Data Sharing and Information Retrieval in Wide-Area ...

Viewer
Transcript

Data Sharing and Information Retrieval in Wide-Area Distributed Systems by Chunqiang Tang

Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

Supervised by Professor Sandhya Dwarkadas Department of Computer Science The College Arts and Sciences University of Rochester Rochester, New York 2004

ii

Curriculum Vitae

Chunqiang Tang was born in Luzhou, China on February 28, 1974. He attended the University of Science and Technology of China from 1991 to 1996, and graduated with a Bachelor of Engineering degree in 1996. He then attended the Institute of Computing Technology, Chinese Academy of Sciences from 1996 to 1999, and earned a Master of Engineering degree in 1999. He came to the University of Rochester in the Fall of 1999 and began graduate studies in Computer Science. He pursued his research in distributed systems under the direction of Professor Sandhya Dwarkadas and received the Master of Science degree from the University of Rochester in 2001.

iii

Acknowledgments I would like to thank my advisor Sandhya Dwarkadas, and committee members Michael L. Scott, Kai Shen, and Wendi Heinzelman for their patient guidance and insightful advice. Sandhya Dwarkadas and Michael L. Scott have supervised me through the entire InterWeave project. I appreciate all of their thoughtful direction and encouragement. InterWeave is the fruit of strong teamwork and collaboration by a number of people. DeQing Chen implemented many features of InterWeave, including multi-level sharing, relaxed coherence models, dynamic views, and support for the Java language. Eduardo Pinheiro wrote an early version of the InterWeave IDL compiler. This dissertation work benefited greatly from Rob Stets’ work on the Cashmere system and Srinivasan Parthasarathy’s work on the InterAct system. Brandon Sanders helped us use InterWeave to facilitate coordinating shared state in the field of visual recognition. Xiangchuan Chen, Xiaochen Du, Gautam Altekar, Grant Farmer, and Andrew Ross have helped design and implement various useful InterWeave applications and tools. I owe many thanks to insightful discussions with Tao Li, Shenghuo Zhu, John Kramer, Mallik Mahalingam, Zhichen Xu, William Scherer, and Ding Chen. I have found the Computer Science Department at the University of Rochester to be a great place to study. I would like to thank all the faculty members, department staff, and fellow graduate students for their support and friendship, which have made life here a wonderful experience.

iv

I would like to thank my parents, Yonglin Tang and Yongli Zhou, my sister Chunyan Tang, and my brother Chunjiang Tang for their unwavering support and encouragement. This material is based upon work supported by the National Science foundation under grants CCR-9988361, ECS-0225413, CCR-9705594, EIA-0080124, and CCR-0219848; DARPA/ITO under AFRL contract F29601-00-K-0182; the U.S. Dept. of Energy Office of Inertial Confinement Fusion under Cooperative Agreement No. DE-FC03-92SF19460; and equipment or financial grants from Compaq, IBM, Intel, and Sun. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the above institutions or the University of Rochester.

v

Abstract This dissertation addresses two problems related to the management of data and information in wide-area distributed systems: distributed shared state and peer-to-peer information retrieval. Distributed applications typically resort to ad-hoc protocols built on top of remote invocation (e.g., Sun RPC or Java RMI) to maintain the coherence and consistency of shared state—information needed at more than one site. We instead propose to automate the management of shared state. As a complement rather than a replacement to remote invocation, our InterWeave system provides a unified programming environment that supports the use of shared-memory programming, remote invocation, relaxed coherence models, and transactions in a single application. InterWeave is the first system that automates the typesafe sharing of structured data in its internal form across heterogeneous platforms and multiple languages. Our evaluations show that InterWeave introduces minimal overhead while reducing bandwidth consumption and improving performance in important cases. Another problem we study in this dissertation is peer-to-peer information retrieval (P2P IR). P2P systems have gained tremendous interest in recent years, but full-text search of information stored in P2P systems still remains particularly challenging. We address this challenge by taking an interdisciplinary approach, making innovations in multiple fields—networks, systems, IR, and databases— when designing components of our systems. What underlie our solutions are

vi

document clustering (i.e., indices stored on a node share similar features) and complete local indexing (i.e., if a node is involved in hosting the index for a document, it always stores the complete index for the document). Document clustering helps limit a search to only nodes hosting relevant indices. Complete local indexing allows each node to rank documents in its indices without consulting others. We propose two independent systems, eSearch and pSearch, that display these properties and are built on top of distributed hash tables (DHTs). Our systems take advantage of the semantic information provided by modern IR algorithms. eSearch uses keywords to cluster documents while pSearch uses concepts derived from latent semantic indexing (LSI) for clustering. Both are efficient and achieve retrieval quality comparable to the centralized baselines.

vii

Table of Contents

Curriculum Vitae

ii

Acknowledgments

iii

Abstract

v

List of Tables

x

List of Figures 1 Introduction

xii 1

1.1

Distributed Shared State . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Peer-to-Peer Information Retrieval . . . . . . . . . . . . . . . . .

6

1.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.4

Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . .

21

2 Intuitive Data Sharing with InterWeave

22

2.1

Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.2

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.3

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . .

51

2.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

2.5

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

viii

3 Incorporating RPC and Transactions into InterWeave

95

3.1

Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.3

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 112

3.4

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

4 eSearch: Hybrid Global-Local Indexing for P2P IR

97

128

4.1

System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 130

4.2

Top Term Selection and Automatic Query Expansion . . . . . . . 132

4.3

Disseminating Document Metadata . . . . . . . . . . . . . . . . . 148

4.4

Balancing Term List Distribution . . . . . . . . . . . . . . . . . . 152

4.5

Analysis of System Resource Usage . . . . . . . . . . . . . . . . . 163

4.6

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

4.7

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

5 pSearch: A Semantic Overlay Network for P2P IR

180

5.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

5.2

Overview of the pSearch System . . . . . . . . . . . . . . . . . . . 185

5.3

Resolving the Dimensionality Mismatch between CAN and LSI . . 190

5.4

Balancing Index Distribution . . . . . . . . . . . . . . . . . . . . . 199

5.5

Reducing the Search Space . . . . . . . . . . . . . . . . . . . . . . 201

5.6

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 210

5.7

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

6 Scaling Latent Semantic Indexing for pSearch 6.1

238

Improving the Retrieval Quality of LSI . . . . . . . . . . . . . . . 240

ix

6.2

Improving the Efficiency of LSI . . . . . . . . . . . . . . . . . . . 250

6.3

Performance of pSearch . . . . . . . . . . . . . . . . . . . . . . . . 265

6.4

Potential of Clustered Search . . . . . . . . . . . . . . . . . . . . 269

6.5

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

6.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

7 Conclusions

280

7.1

Intuitive Data Sharing with InterWeave . . . . . . . . . . . . . . . 280

7.2

Peer-to-Peer Information Retrieval . . . . . . . . . . . . . . . . . 283

7.3

Future Work: Distributed Shared State . . . . . . . . . . . . . . . 289

7.4

Future Work: Peer-to-Peer Information Retrieval . . . . . . . . . 298

7.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

Bibliography

303

x

List of Tables

2.1

The breakdown of the InterWeave code. The “utilities” category includes our own implementation of utilities such as hash table, linked list, and high-resolution timer.

4.1

. . . . . . . . . . . . . . .

49

Difference in the number of retrieved relevant documents between eSearch and the centralized Okapi when returning 10 documents for each query. The columns correspond to eSearch with different configurations (without using query expansion). One entry with value p in the “d=k” row (e.g., the entry with value 8 in the “d=-2” row) means that, out of the 100 TREC queries, p queries return k more relevant documents in eSearch than in the baseline (or return fewer relevant documents if k<0). For instance, when publishing documents under their top 10 terms, for 8 queries, eSearch returns 2 fewer relevant documents than the baseline, and for 73 queries, eSearch performs the same as the baseline. For some queries, eSearch does better than the baseline because of the inherent fuzziness in Okapi’s ranking function (i.e., just focusing on important terms may actually improve retrieval quality sometimes).

. . . . . . . . 144

5.1

Notations used in the content-directed search algorithm. . . . . . 205

5.2

The content-directed search algorithm that is executed for each rotated space i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

xi

5.3

Parameters varied in experiments. . . . . . . . . . . . . . . . . . . 212

xii

List of Figures

1.1

Comparison of distributed indexing structures. (i) Gnutella-like local indexing. (ii) Global indexing. (iii) Hybrid indexing. (iv) Optimized hybrid indexing. a, b, and c are terms. X, Y, and Z are documents. This example distributes metadata for three documents (X-Z ) that contain terms from a small vocabulary (a-c) to three computers (1-3). Term list X → a, c means that documents X contains terms a and c. Inverted list a → X, Z indicates that terms a appears in documents X and Z. . . . . . . . . . . . . . . . . . . .

12

1.2

Search in a semantic space. . . . . . . . . . . . . . . . . . . . . . .

17

2.1

Shared linked list in InterWeave. Variable head points to an unused header node; the first real item is in head->next. . . . . . . . . .

2.2

25

Simplified view of InterWeave client-side data structures: the segment table, subsegments, and blocks within segments. Type descriptors, pointers from balanced trees to blocks and subsegments, and footers of blocks and free space are not shown. . . . . . . . .

2.3

32

Wire format translation of a structure consisting of three integers (i0–i2), two doubles (d0–d1), and a pointer (ptr). All fields except

2.4

d0 and i1 are changed. . . . . . . . . . . . . . . . . . . . . . . . .

37

Grammar of the wire-format diff for a block. . . . . . . . . . . . .

38

xiii

2.5

Client’s cost to translate 1MB of data. . . . . . . . . . . . . . . .

52

2.6

Server’s cost to translate 1MB of data. . . . . . . . . . . . . . . .

54

2.7

Pointer swizzling cost as a function of pointed-to object type. . .

57

2.8

Breakdown of swizzling local pointers into MIPs (using cross 1024). 57

2.9

Breakdown of swizzling MIPs into local pointers (using cross 1024). 57

2.10 Diff management cost as a function of modification granularity (1MB total data, fine-grain modification). . . . . . . . . . . . . .

58

2.11 Diff management cost as a function of modification granularity (1MB total data, coarse-grain modification). . . . . . . . . . . . .

58

2.12 The effect of varying block size. . . . . . . . . . . . . . . . . . . .

60

2.13 Performance impact of version based block layout, with 1MB total data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

2.14 Cache-aware diffing and diff run splicing. . . . . . . . . . . . . . .

62

2.15 Isomorphic type descriptors . . . . . . . . . . . . . . . . . . . . .

64

2.16 Block prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

2.17 Byte order swapping . . . . . . . . . . . . . . . . . . . . . . . . .

65

2.18 Wire format length of datamining segment under InterWeave and RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

2.19 Translation time for datamining segment under InterWeave and RPC. 68

xiv

3.1

Shared linked list in InterWeave. Both the RPC client and RPC server are InterWeave clients. The variable head points to a dummy header node; the first real item is in head->next. head->val is the average of all list items’ val fields. The RPC client is initialized with list init. To add a new item to the list, the RPC client starts a transaction, inserts the item, and makes a remote procedure call to a fast machine to update head->val. head->val could represent summary statistics much more complex than “average”. The function clnt compute avg is the client side stub generated by the standard rpcgen tool. The IW close segment in the RPC server code will leave the cached copy of the segment intact on the server for later reuse.

3.2

. . . . . . . . . . . . . . . . . . . . . . . .

98

The argument structure and its XDR routine. During an RPC, xdr arg t () is invoked on both the caller side (to marshal arguments) and the callee side (to unmarshal arguments). Likewise, xdr result t () (not shown here) is invoked to marshal and unmarshal the results. Both routines are generated with the standard rpcgen tool and slightly modified by the InterWeave IDL compiler. xdr trans arg () is a function in the InterWeave library that marshals and unmarshals transaction metadata along with other arguments.

3.3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

Execution time for transactions that transmit a large amount of data on a LAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3.4

Execution time for transactions that transmit a small amount of data on a LAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3.5

Execution time for a “proactive transaction” running on a LAN, when both the caller and callee only update x% of a 1MB segment. 116

xv

3.6

Execution time for transactions running on a WAN. . . . . . . . . 116

3.7

Execution time for a “proactive transaction” running on a WAN, when both the caller and callee only update x% of a 1MB segment. 118

3.8

Total time to update and search a binary tree. . . . . . . . . . . . 120

3.9

Total size of communication to update and search a binary tree. . 120

3.10 Latencies for executing one query under different offloading configurations. “Local” executes queries locally at the front-end server; “RPC-update” uses RPC to offload computations without any caching scheme. It also represents the case where manual caching is used but the summary structure has changed (hence the cache is invalid and the entire summary structure has to be transmitted). “RPC-no-update” represents RPC with manual caching to offload computation when the cache happens to be valid. “IW-update” uses InterWeave and RPC to support offloading, and the summary structure is changed since the last cached copy. “IW-no-update” differs from “IW-update” in that the summary structure is not changed since last cached copy. . . . . . . . . . . . . . . . . . . . 124 3.11 Breakdown of query service time under different offloading strategies.126 3.12 The impact of different offloading strategies on system throughput. 126 4.1

System architecture of eSearch. . . . . . . . . . . . . . . . . . . . 130

4.2

Ranked term weight of the TREC corpus, normalized to the biggest term weight in each document.

4.3

. . . . . . . . . . . . . . . . . . . 135

eSearch’s precision with respect to the number of terms under which a document is published. The performance of the “all” series is equivalent to that of a centralized implementation of Okapi. . . . 140

4.4

The average number of retrieved relevant documents for a query when returning 1,000 documents. . . . . . . . . . . . . . . . . . . 140

xvi

4.5

eSearch’s precision with respect to the number of expanded query terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

4.6

Comparison of precision-recall between eSearch and the centralized Okapi.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

4.7

Sensitivity of Okapi to the sampling of global statistics.

. . . . . 145

4.8

Cost of overlay source multicast normalized to that of using separate unicast to deliver data. . . . . . . . . . . . . . . . . . . . . . 151

4.9

Comparison of load balancing techniques using equal-size objects. “baseline” is the basic Chord without virtual server. “vs (k)” is a Chord with k virtual nodes running on each physical node. “split (k)” is our technique where a new node performs k random lookups and splits the overloaded node. . . . . . . . . . . . . . . . . . . . 154

4.10 Length distribution of the TREC corpus’s inverted lists when documents are published under (a) top 20 terms; or (b) all terms. . . 155 4.11 Comparison of load balancing schemes using (a) the “top 20 terms” load; and (b) the “all terms” load.

. . . . . . . . . . . . . . . . . 159

4.12 Scalability of our load balancing technique (“split+scatter”).

. . 160

4.13 Scalability of our load balancing technique (“split+scatter”).

. . 162

5.1

A 2-dimensional CAN. . . . . . . . . . . . . . . . . . . . . . . . . 185

5.2

Overview of the pSearch system.

5.3

pLSI in a 2-dimensional CAN.

5.4

Uneven distribution of document indices. . . . . . . . . . . . . . . 189

5.5

The average number of neighbors and routing hops for a 300-dimensional CAN.

. . . . . . . . . . . . . . . . . . 186 . . . . . . . . . . . . . . . . . . . 187

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

xvii

5.6

A rolling-index example. The position of a vector is decided by its first two elements. (a) The CAN partitions dimensions v0 and v1 in the original semantic space. (b) The same CAN partitions dimensions v2 and v3 after rotating the semantic vectors by two dimensions. The relevant document A for the query Q can be easily found in node z on the rotated space. . . . . . . . . . . . . 192

5.7

The average weight of elements of semantic vectors ranked in decreasing order.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

5.8

Singular values of the TREC corpus. . . . . . . . . . . . . . . . . 194

5.9

The effect of using low-dimensional elements to identify relevant documents (the documents and queries are from TREC 7&8). . . 197

5.10 Histogram of the accuracy of the 100 TREC queries, when using subvectors of semantic vectors for search (m=25, e=128). 5.11 The effect of content-aware node bootstrapping. 5.12 An example of content-directed search.

. . . . 197

. . . . . . . . . 200

. . . . . . . . . . . . . . 203

5.13 The effect of varying the system size. . . . . . . . . . . . . . . . . 214 5.14 The effect of simultaneously varying the system size and corpus size.214 5.15 The effect of varying the number of returned documents for a 10knode system.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

5.16 Comparing the content-directed search heuristics for a 10k-node system.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

5.17 Performance of zone-directed search, which prefers searching zones whose centers are the closest to the query vector. . . . . . . . . . 218 5.18 The effect of replication on a 10k-node system. 5.19 Performance of a 128k-node system.

. . . . . . . . . . 218

. . . . . . . . . . . . . . . . 220

5.20 Distribution of documents found for a 10k-node system.

. . . . . 221

xviii

5.21 Sensitivity to term weighting schemes (a 10k-node system). . . . . 224 5.22 Sensitivity to query length (a 10k-node system). . . . . . . . . . . 224 5.23 Sensitivity to parallel search (a 10k-node system). . . . . . . . . . 226 5.24 Sensitivity to rotated dimensions (a 10k-node system). . . . . . . 226 5.25 Sensitivity to the dimensionality of the semantic space. (a 10k-node system). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 5.26 Sensitivity to sample size (for a 10k-node system). . . . . . . . . . 227 5.27 Visited nodes and accuracy when varying system size but keeping the same object density. . . . . . . . . . . . . . . . . . . . . . . . 231 5.28 Nodes visited to achieve an accuracy around 0.8 when varying system size but keeping the same object density. . . . . . . . . . . . 231 5.29 Nodes visited as a fraction of the total node population to achieve an accuracy around 0.8 when varying system size but keeping the same object density. . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.30 Nodes visited and accuracy when varying the average number of objects stored on a node from 1 to 400 (in parentheses).

. . . . . 232

5.31 Performance of distributed nearest-neighbor search when varying object density.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

5.32 Visited nodes and accuracy when varying the number of objects to retrieve for a query.

. . . . . . . . . . . . . . . . . . . . . . . . . 233

5.33 Number of objects to retrieve divided by the number of visited nodes, when varying the number of objects to retrieve for each query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 6.1

Comparison of different configurations for LSI (using TREC). . . 243

6.2

Retrieved relevant documents by LSI with different configurations (using Medlars).

. . . . . . . . . . . . . . . . . . . . . . . . . . . 244

xix

6.3

Comparison of different retrieval algorithms.

. . . . . . . . . . . 246

6.4

High-end precision for TREC. . . . . . . . . . . . . . . . . . . . . 248

6.5

Cumulative distribution of the 100 TREC queries as a function of the returned relevant documents when retrieving 10 documents. . 249

6.6

Comparison of dimensionality reduction methods—Retrieved relevant documents.

. . . . . . . . . . . . . . . . . . . . . . . . . . . 257

6.7

Comparison of dimensionality reduction methods (precision-recall). 259

6.8

Queries that find no relevant document.

6.9

High-end precision when combining with Okapi.

. . . . . . . . . . . . . . 260 . . . . . . . . . 261

6.10 Efficiency of eLSI—memory consumption and execution time of SVD.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

6.11 Efficiency of eLSI—retrieved relevant documents. . . . . . . . . . 264 6.12 The number of visited nodes and precision at top 10 documents when varying the number of nodes in the system from 500 to 128,000 (in parentheses).

. . . . . . . . . . . . . . . . . . . . . . 266

6.13 Precision at top 10 documents and documents searched on visited nodes as a percentage of the entire corpus, when varying the number of nodes in the system from 2,000 to 32,000 (in parentheses). 6.14 The impact of eLSI on pSearch’s retrieval quality.

. . 267

. . . . . . . . 268

6.15 Performance as a function of searched clusters. Documents are partitioned into 500-8,000 clusters (in parentheses). . . . . . . . . 271 6.16 Comparing the strategies to choose clusters to search.

. . . . . . 272

7.1

Single server architecture. . . . . . . . . . . . . . . . . . . . . . . 294

7.2

Replicated server architecture. . . . . . . . . . . . . . . . . . . . . 294

7.3

Hierarchical cache architecture. . . . . . . . . . . . . . . . . . . . 294

1

1

Introduction

Most Internet-level applications are distributed not for the sake of parallel speedup, but rather to access people, data, and devices in geographically disparate locations. Increasingly, these programs are oriented as much toward information access as they are toward computation. E-commerce applications make business information available regardless of location. Computer-supported collaborative work allows colleagues at multiple sites to share project design and management data. Multi-player games maintain a distributed virtual environment. Peer-topeer systems are largely devoted to indexing and lookup of a continually evolving distributed store. Even in the scientific community, so-called GRID computing [92, 93] is as much about finding and accessing remote data repositories as it is about utilizing multiple computing platforms. This dissertation addresses two problems related to the management of data and information in wide-area distributed systems: • How to make it easier for distributed applications to share in-memory program data; • How to make it easier for people to retrieve information through full-text search in large-scale peer-to-peer systems. I will further elaborate on these problems in Sections 1.1 and 1.2, respectively.

2

1.1

Distributed Shared State

We believe that distributed systems will continue to evolve toward data-centric computing. We envision a future in which a user has ubiquitous access to enormous amounts of shared state regardless of location and the device the user attaches to. Shared state is information that is needed at more than one site, that has largely static structure but whose content changes over time. Examples include a user’s calendar, address book, the current screen shot of an editor, and all kinds of program data belonging to the user’s typical applications. Ideally a digital assistant should automatically gather, present, and save the shared state for the user wherever the user goes. Distributed applications should be able to share program variables as easily as people share Web pages. To the best of our knowledge, no one to date has automated the typesafe sharing of structured data in its internal (in-memory) form across heterogeneous platforms and multiple languages, or optimized that sharing for distributed applications. This dissertation does so. Today’s systems employ a variety of mechanisms that might underlie distributed shared state. At one extreme are distributed file and database systems such as AFS [225], Lotus Notestm , CVS [50], or OceanStore [138]. For the most part these are oriented toward external (byte-oriented) data representations, with a narrow, read-write interface, and structure imposed by convention. Data in these systems must generally be converted to and from an in-memory representation in order to be used in programs. At the other extreme, distributed object systems such as CORBA [181] and .NETtm present data in a structured, high-level form, but require that programs employ an object-oriented programming style. While some of these systems do allow caching of objects at multiple sites, performance can be poor. By contrast, the shared memory available within cache-coherent multipro-

3

cessors allows processes to share arbitrarily complex structured data safely and efficiently, with ordinary reads and writes. Many researchers have developed software distributed shared-memory (S-DSM) systems to extend this programming model into message-based environments [5, 146, 242]. Object-based systems can of course be implemented on top of shared memory, but the lower-level interface suffices for many applications. Unfortunately, despite some 15 years of research, S-DSM remains for the most part a laboratory curiosity. The explanation, we believe, lies with the choice of application domain. The S-DSM community has placed most of its emphasis on traditional parallel programming: scientific applications running on low-latency networks of homogeneous machines, with a single protection domain, usually a single programming language, and a primitive (everything lives or dies together) failure model. Within this narrow framework, S-DSM systems provide an upward migration path for applications originally developed for small cache-coherent multiprocessors, but the resulting performance on clusters of up to a dozen nodes (a few dozen processors), while good, does not lead us to believe that S-DSM will scale sufficiently to be competitive with hand-written message-passing code for large-scale parallel computing. As an abstract concept, we believe that shared memory has more to offer to distributed computing than it does to parallel computing. For the sake of availability, scalability, latency, and fault tolerance, most distributed applications cache information at multiple sites. To maintain these copies in the face of distributed updates, programmers typically resort to ad-hoc messaging protocols that embody the coherence and consistency requirements of the application at hand. The code devoted to these protocols often accounts for a significant fraction of overall application size and complexity, and this fraction is likely to increase. We see the management of shared state as ripe for automation. Like hardware cache coherence, or S-DSM within clusters, a system for dis-

4

tributed shared state should provide a uniform name space, and should maintain coherence and consistency automatically. Unlike these more tightly coupled systems, however, it should address concerns unique to wide area distribution: • Names should be machine-independent, but cached copies should be accessed with ordinary reads and writes. • Sharing should work across a wide variety of hardware architectures and programming languages. • Shared data should be persistent, outliving individual executions of sharing applications and system failures. • Coherence and consistency models should match the (generally very relaxed) requirements of applications. • Important optimizations, of the sort embodied by hand-tuned ad-hoc coherence protocols, should be explicitly supported, in a form that allows the user to specify high-level requirements rather than low-level implementations. By replacing ad-hoc protocols, we believe that automatic distributed shared state can dramatically simplify the construction of many distributed applications. Java programmers routinely accept the overhead of byte code interpretation in order to obtain the conceptual advantages of portability, extensibility, and mobile code. Similarly, we believe that many developers would be willing to accept a modest performance overhead for the conceptual advantages of shared state, if it were simple, reliable, and portable across languages and platforms. By incorporating optimizations that are often too difficult to implement by hand, automatic distributed shared state may even improve performance in important cases. We see shared state as entirely compatible with programming models based on remote invocation. In an RPC/RMI-based program, shared state serves to

5

• eliminate invocations devoted to maintaining the coherence and consistency of cached data; • support genuine reference parameters in RPC calls, eliminating the need to pass large structures repeatedly by value, or to recursively expand pointerrich data structures using deep-copy parameter modes; • reduce the number of trivial invocations used simply to put or get data. These observations are not new. Systems such as Emerald [124], Amber [51], and PerDiS [89] have long employed shared state in support of remote invocation in homogeneous object-oriented systems. Clouds [72] integrated S-DSM into the operating system kernel in order to support thread migration for remote invocation. Working in the opposite direction, Koch and Fowler [132] integrated message passing into the coherence model of the TreadMarks S-DSM system [5]. Kono et al. [134] support reference parameters and caching of remote data during individual remote invocations, but with a restricted type system, and with no provision for coherence across calls. RPC systems have long supported automatic deep-copy transmission of structured data among heterogeneous languages and machine architectures [113, 65], and modern standards such as XML provide a language-independent notation for structured data. To the best of our knowledge, however, no one to date has automated the typesafe sharing of structured data in its in-memory form across multiple languages and platforms. Over the past four years we have developed the InterWeave system to manage distributed shared state [231]. InterWeave allows programs written in multiple languages to map persistent shared segments into their address space, regardless of Internet address or machine type, and to access the data in those segments transparently and efficiently once mapped. To support these operations, InterWeave maintains metadata structures comparable to those of a sophisticated lan-

6

guage reflection mechanism, and employs a variety of algorithmic and protocol optimizations specific to distributed shared state. The current implementation of InterWeave consists of 45,000 lines of C++ code, running on Alpha, Sparc, x86, MIPS, and Power series processors, under Tru64, Solaris, Linux, Irix, AIX, and Windows NT (XP). Currently supported languages are C, C++, Java, Fortran 77, and Fortran 90. Driving applications include datamining, intelligent distributed environments, and scientific visualization. Using a combination of microbenchmarks and real applications, we evaluate the performance of our heterogeneity mechanisms, and compare them to comparable mechanisms in RPC-style systems. When transmitting entire segments, InterWeave achieves performance comparable to that of RPC, while providing a more flexible programming model. When only a portion of a segment has changed, InterWeave’s use of diffs allows it to scale its overhead down, significantly outperforming straightforward use of RPC. InterWeave is a joint effort. Chen’s dissertation [54] covers multi-level shared state, application specific coherence models, and Java support. This dissertation focuses on the aspects of InterWeave that I contributed to most: • Efficient sharing of strongly typed, pointer-rich data structures in their inmemory representation across heterogeneous platforms and multiple languages (C and Fortran) [253]. • Seamless integration of shared state, remote invocation, relaxed coherence models, and transactions to form a rich and efficient computing platform [254].

1.2

Peer-to-Peer Information Retrieval

In addition to the InterWeave system that facilitates data sharing for distributed applications, this dissertation addresses efficient information retrieval in peer-to-

7

peer systems. Peer-to-Peer (P2P) systems [168, 187] have gained tremendous interests from both the user community and the research community in the past several years. First-generation systems such as Gnutella [103] and KaZaA [128] are already prevalent, and second-generation systems such as PAST [217] and CFS [70] based on Distributed Hash Tables (DHTs) (e.g., CAN [202], Chord [243], Pastry [216], and Tapestry [308]) are under serious development (e.g., the IRIS project [121]). With a gigantic amount of information shared in these systems, it would be impossible for users to remember the place or precise ID of the desired data. The capability to retrieve documents using full-text search would greatly improve the usability of these systems and accordingly broaden their application domains. A scalable, high-quality P2P search mechanism will not only help P2P systems to reach their full potential, but can also be the infrastructure to address problems arising due to the exponential growth of global information [184]. According to a recent report [197], 93% of information produced worldwide is in digital form and the unique information added each year exceeds one exabyte (or 1018 bytes). This trend calls for equally scalable infrastructures capable of indexing and searching rich content such as HTML, plain text, music, and image files. Search engines such as Google [105] currently appear to work for the Web, but it is unclear whether their architecture [208] can scale with the exponential data growth, not to mention that little is known to the public about how these systems actually work. This dissertation presents techniques to build self-organizing search engines based on P2P technology. The fundamentals of our techniques are applicable to well managed stable environments (e.g., Google-like search engines, data centers, and corporate desktops) as well as the more dynamic P2P environment. Compared with traditional centralized or distributed search engines, a search system built on top of a large number (103 ∼ 106 ) of P2P nodes is attractive for several reasons: • Scalability. Lawrence and Giles reported in 1999 that no search engine

8

indexed more than 16% of the indexable Web [220]. This gap is further widening because of the explosive growth of the Web in recent years. The self-organizing nature of P2P systems offer the hope of building search engines with “unlimited” scalability. • Low cost and ease of deployment. A massive dedicated search engine requires a fair amount of investment, whereas a P2P search system is literally free and can be deployed incrementally as new users join the system. • Data freshness. Due to the limited network bandwidth between centralized search engines and the Internet, it usually takes them months to crawl Web pages to update their indices. P2P nodes can publish new documents voluntarily as soon as the documents appear, and the traffic is distributed to geographically scattered nodes. • Availability. Centralized search engines are susceptible to distributed denial of service attack, but it is difficult for an attacker to bring down a significant fraction of geographically distributed P2P nodes. A loss of a reasonable fraction of them will not affect a P2P system’s functionality on the whole [110]. • Diversity of information source. The entangled net of ever-growing information is beyond a human’s processing capability. As a result, we are increasingly relying on a handful of search engines for information discovery. Because of the critical commercial value behind the information fed to the users, commercial search engines have the incentive to provide the users with information suiting their interest rather than that of the users. Several “democratic” P2P search systems, probably supported by an open source community, would provide users with leverage to bargain in this battle.

9

On the other hand, P2P search systems face problems unprecedented in centralized search engines, in particular, security and trust. We suggest borrowing solutions from related work [49] and acknowledge that these are hard problems. A detailed study of these issues is beyond the scope of this dissertation.

1.2.1

Our Approach

An ideal P2P search system should store only a limited number of indices on each node and meet the following criteria when processing a query: • search only a small number of nodes; • transmit a small amount of data; and • obtain search results as good as centralized systems. Although a number of P2P search techniques [63, 66, 87, 163, 205, 206, 238] have already been proposed in recent years, they either search a large number of nodes or transmit a large amount of data when processing a query. More importantly, with very few exceptions [67], most of them are based on simple keyword matching, ignoring advanced relevance ranking algorithms devised by the Information Retrieval (IR) community through decades of refinement and evaluation [199]. Without effective ranking, queries consisting of popular words may return superfluous documents that are beyond the user’s capability to handle. In addition to improving retrieval quality, it is our strong belief that by exploiting the synergy between IR and P2P, one can actually design more efficient P2P search systems, leveraging features of IR algorithms to guide system optimization. This synergy, in conjunction with the quickly increasing storage capacities provided by future technologies, may completely shift design tradeoffs for P2P search systems, compared with existing solutions that consider “systems issues” alone.

10

We focus on extending classical IR algorithms to work in a P2P environment. Some IR techniques (e.g., Google’s PageRank [33]) leverage hyperlinks to identify important Web pages [33]. This cross-reference information, however, does not exist in many forms of digital content. We start with the most popular and wellstudied statistical IR algorithms, Vector Space Model (VSM) [221] and Latent Semantic Indexing (LSI) [19, 73], which do not rely on cross-reference information. VSM and LSI represent documents and queries as vectors in a Cartesian space, and measure the similarity between a query and a document as the cosine of the angle between their vector representations. Variants of VSM and LSI have been adopted by some major search engines [143, 199, 301]. In practice, various IR techniques are combined to build pragmatic search engines. The study of how other techniques (e.g., PageRank) can complement our approach is a subject for future work. We take an interdisciplinary approach to design P2P search systems. We borrow techniques from multiple fields—networks, systems, IR, and databases—to design each component of our systems, and introduce key innovations to improve the borrowed techniques and to integrate them seamlessly. We propose two independent systems built on top of DHTs: eSearch [255] and pSearch [256, 257, 258, 259]. Both are efficient and achieve retrieval quality comparable to the centralized baselines. What underlie our solutions are document clustering (i.e., indices stored on a node share similar features) and complete local indexing (i.e., if a node is involved in hosting the index for a document, it always stores the complete index). Document clustering helps limit a search to only nodes hosting relevant indices. Complete local indexing allows each node to rank documents in its indices without consulting others. eSearch uses keywords to cluster documents while pSearch uses concepts derived from LSI for clustering. In comparison, eSearch’s search process is more efficient and much simpler than that of pSearch, but eSearch consumes

11

more storage space than pSearch. Overall, we recommend eSearch for text retrieval and pSearch for multimedia retrieval. Below we give a brief summary of eSearch and pSearch.

1.2.2

eSearch: Hybrid Global-Local Indexing for P2P IR

To facilitate the retrieval of documents, a distributed (not necessarily P2P) search system places information regarding the occurrence of terms (words or phrases) in documents in the form of metadata at certain places in the system. The metadata placement strategy in existing systems are based on either local or global indexing [265]. In local indexing (or so-called partitioning by documents method), metadata are partitioned based on the document space (see Figure 1.1 (i)). The complete term list of a document is stored on a node.1 A term list X → a, c means that document X contains term a and c. During a retrieval operation, the query is broadcast to all nodes. Since each node has the complete term list for documents that it is responsible for, it can compute the relevance between the query and its documents without consulting others. The drawback, however, is that every node is involved in processing every query, rendering systems of this type unscalable. Gnutella and search engines such as AllTheWeb (www.alltheweb.com) [208] are based on variants of local indexing. In global indexing (or so-called partitioning by words method), metadata are distributed based on terms (see Figure 1.1 (ii)). Each node stores the complete inverted list of some terms. An inverted list a → X, Z indicates that term a appears in documents X and Z. The advantage of global indexing is that a query is processed by only a small number of nodes. To answer a query consisting of multiple terms, the query is sent to nodes responsible for those terms. Their in1

This is just at the conceptual level. In an actual implementation a node may build inverted

lists for its local documents to speed up query processing.

12

Computer 1

X

a, c

Computer 2

Y

Z

Computer 1

a

X, Z

b, c

a, c

Computer 3

a, b, c (i)

b

Y, Z b, c

X, Z

Computer 2

b c

Y, Z

a, b, c

a

Computer 2

Computer 1

a

Computer 1

c

a, b, c

a, b, c b, c a, c

Computer 3

X, Y, Z (ii)

(iii)

a, b, c

Computer 2

Computer 3

X, Y, Z

Z

b

c

Y, Z

a, b, c b, c

Computer 3

X

a, c (iv )

Figure 1.1: Comparison of distributed indexing structures. (i) Gnutella-like local indexing. (ii) Global indexing. (iii) Hybrid indexing. (iv) Optimized hybrid indexing. a, b, and c are terms. X, Y, and Z are documents. This example distributes metadata for three documents (X-Z ) that contain terms from a small vocabulary (a-c) to three computers (1-3). Term list X → a, c means that documents X contains terms a and c. Inverted list a → X, Z indicates that terms a appears in documents X and Z.

13

verted lists are transmitted over the network so that an intersection to identify documents that contain multiple query terms can be performed. The communication cost for this distributed join, unfortunately, grows proportionally with the length of the inverted lists, i.e., the size of the corpus. Most recent proposals for P2P keyword search [102, 145, 205, 245] are based on global indexing, but with enhancements to reduce the communication cost, for instance, by using Bloom filters to summarize the inverted lists, or by incrementally transmitting the inverted lists and terminating early if sufficient results have already been obtained. In the following, we will simply refer to systems of this type as the Global-P2P systems. Challenging conventional wisdom that uses either local or global indexing, we propose a hybrid indexing structure to combine their benefits while avoiding their limitations. The basic tenet of our approach is selective metadata replication guided by modern IR algorithms. Metadata replication avoids transmitting large amounts of data when processing multi-term queries. Document term lists are replicated by a small factor to important places to guarantee the quality of retrieval. Like global indexing, hybrid indexing distributes metadata based on terms (see Figure 1.1 (iii)). Each node j is responsible for the inverted list of some term t. In addition, for each document D in the inverted list for term t, node j also stores the complete term list for document D. Given a multi-term query, the query is sent to nodes responsible for those terms. Each of those nodes then does a local search without consulting others, since it has the complete term lists for documents in its inverted list. Our system based on hybrid indexing is called eSearch. It uses a Distributed Hash Table (DHT) to map a term to a node where the inverted list for the term is stored. We chose Chord [243] for eSearch, but other DHTs such as CAN [202], Pastry [216], and Tapestry [308] can also be used without major change to our design. When naively implemented, eSearch’s search efficiency obviously comes at the

14

expense of publishing more metadata, requiring more communication and storage. We propose several optimizations that reduce communication by up to 97% and storage by up to 90%, compared with a naive hybrid indexing. Below we outline one important optimization—top term selection. See Chapter 4 for the others. Each document contains many words. Some of them are central to the idea described in the document while the majority are just auxiliary words. Modern statistical IR algorithms such as vector space model (VSM) [221, 233] assign a weight to each term in a document. Terms central to the document are automatically identified by a heavy weight. (More specifically, we use Okapi [210] to compute the term weights.) In eSearch, we only publish the term list for a document to nodes responsible for the top (important) terms in that document. Figure 1.1 (iv) illustrates this optimization. Document X contains terms a and c but its term list is only published to computer 3 since only term c is important in document X. Given a query consisting of terms a and c, computer 3 can still determine that document X contains both terms since it keeps the complete term list for document X. This optimization reduces storage consumption, but it may degrade the quality of search results. eSearch cannot return this document for this query.2 Our argument is that, since none of the query terms is among the top terms for the document, IR algorithms are unlikely to rank this document among the best matching documents for this query anyway. Thus, the top search results for this query are unlikely to be affected by skipping this document. Our results show that eSearch obtains search quality as good as the centralized baseline by publishing a document under its top 20 terms or so. In order to further reduce the chance of missing relevant documents, we adopt 2

We do not require all query terms to be among the top terms for the document, since we always replicate a document’s term list in its entirety. For instance, in Figure 1.1 (iv), a query consisting of terms a and c can find document X although only term c is X’s top term.

15

automatic query expansion [169]. We draw on the observation that, with more terms in a query, it is more likely that a document relevant to this query is published under at least one of the query terms. This scheme automatically identifies additional terms relevant to a query and also searches nodes responsible for those terms. We also propose an overlay source multicast protocol to efficiently disseminate term lists, and two decentralized techniques to balance the distribution of term lists across nodes. We evaluate eSearch through simulations and analysis. The results show that, owing to the optimization techniques, eSearch is scalable and efficient, and obtains search results as good as the centralized baseline. Despite the use of metadata replication, eSearch actually consumes less bandwidth than the Global-P2P systems when publishing documents. During a retrieval operation, eSearch typically transmits 3.3KB of data. These costs are independent of the size of the corpus and grow slowly (logarithmically) with the number of nodes in the system. In contrast, the cost to process a query in the Global-P2P systems grows with the size of the corpus. eSearch’s search efficiency comes at a modest storage cost (6.8 times that of the Global-P2P systems), which can be further reduced by adopting index compression [284] or pruning [45]. Given the quickly increasing capacity and decreasing price of disks, we believe that trading modest disk space for communication and retrieval precision is a proper design choice for P2P systems. First, according to Blake and Rodrigues [27], in 15 years, disk capacity increased by 8000-fold while bandwidth for an end user increased by only 50-fold. Second, in a large P2P system, the storage space scales proportionally with the number of nodes in the system, but the bisection bandwidth does not. Third, work in the Farsite project [29] observed that about 50% of disk space on desktops was not in use. With the cheap massive storage in large P2P systems, replication has already been a common practice to improve efficiency or resilience [53, 163, 217].

16

1.2.3

pSearch: Semantic Overlay Network for P2P IR

eSearch improves global indexing (partitioning by words), by combining it with local indexing (partitioning by documents) and focusing on important terms in documents. In contrast, pSearch improves local indexing by introducing semantic locality when partitioning documents. Current P2P file sharing systems such as Gnutella [103] and KaZaA [128] are based on local indexing. The fundamental problem that makes search difficult in these systems is that documents are randomly distributed with respect to semantics. Given a query, the system either has to search a large number of nodes or the user runs a high risk of missing relevant documents. To address this problem, we introduce the notion of semantic overlay, a logical network where content is organized around their semantics such that the distance (e.g., routing hops) between two documents in the network is proportional to their dissimilarity in semantics. The search cost for a query is therefore reduced since documents related to the query are likely to be concentrated on a small number of nodes. Content-Addressable Networks (CAN) [202] provide a distributed hash table (DHT) abstraction over a Cartesian space. They allow efficient storage and retrieval of (key, object) pairs. An object key is a point in the Cartesian space. We use a CAN to create a semantic overlay by using the semantic vector of a document (generated by Latent Semantic Indexing (LSI) [19, 73]) as key to store the document’s index in the CAN, such that indices stored close in the CAN have similar semantics. Figure 1.2 illustrates how a semantic overlay can benefit a search. When document semantics are generated by LSI, each document is positioned as a point in the Cartesian space (or semantic space). Documents close in the semantic space have similar content, e.g., documents A and B. Each query can also be positioned in this semantic space. To find documents relevant to a query, we only need to

17

s e a rc h re g io n fo r th e q u e ry

A

B

s e m a n tic s p a c e

d o c u m e n t

q u e ry

Figure 1.2: Search in a semantic space. compare the query against documents within a small region centered at the query (i.e., searching a small number of computers in the overlay), because the relevance of documents outside the region is relatively low. By doing this, the search space for the query is effectively reduced while the precision is retained. The basic idea of semantic overlay is straightforward, and involves a mapping of the overlay to physical nodes in a CAN. It is, however, complicated by a number of factors. • Due to a problem known as the curse of dimensionality, it has been shown that limiting the search region in high-dimensional spaces is difficult [278]. • We set the dimensionality of the CAN to be equal to that of LSI’s semantic space, which typically ranges from 50 to 350. The “actual” dimensionality of the CAN, however, is much lower because there are not enough nodes

18

to partition all the dimensions of a high-dimensional CAN. Along those unpartitioned dimensions, the search space is not reduced. • Semantic vectors for documents are not uniformly distributed in the semantic space. A direct mapping from the semantic space to a CAN would result in unbalanced distribution of indices across nodes. • Because of its reliance on LSI, pSearch also inherits LSI’s limitations. (1) When the corpus is large and heterogeneous, LSI’s retrieval quality is inferior to methods such as Okapi [210]. (2) The Singular Value Decomposition (SVD) [104] that LSI uses to derive low-dimensional representations (i.e., semantic vectors) of documents is not scalable in terms of both memory consumption and computation time. We address these problems by leveraging the properties of the semantic space and trading retrieval quality and storage space for efficiency when necessary. Using samples of indices and recently processed queries to guide the search, our content-directed search algorithm substantially reduces the search region in the high-dimensional semantic space. Taking advantage of the higher importance of low-dimensional elements of semantic vectors, our rolling-index scheme partitions the semantic space along more dimensions by using multiple rotated semantic vectors as keys to store document indices in the CAN. Our content-aware node bootstrapping helps distribute indices more evenly across nodes by forcing the node distribution in the CAN to follow the document distribution in the semantic space. Through extensive experimentation, we found a proper configuration that improves LSI’s recall by 76%. We use Okapi to rerank documents returned by LSI to further improve the retrieval quality. Our eLSI (efficient LSI) algorithm reduces the cost for SVD by orders of magnitude through document clustering and term selection.

19

These optimizations make pSearch both efficient and effective. For a 32,000node system, pSearch’s precision at 10 documents (for the TREC 7&8 corpus [263]) is 0.4, compared with the centralized Okapi’s 0.45, while on average searching only 67 nodes for a query. The precision can be improved by searching more nodes.

1.3

Summary

This dissertation presents my work on distributed shared state and peer-to-peer information retrieval (P2P IR). It can be summarized as follows: • Distributed Shared State: With a careful management of complex metadata structures as well as aggressive optimizations in protocol and implementation, it is possible to efficiently support the typesafe sharing of structured data in its internal (in-memory) form across heterogeneous platforms and multiple languages. The integration of shared state, remote invocation, relaxed coherence models, and transactions into a unified computing platform facilitates fast development of distributed applications and improves performance in important cases. • Peer-to-Peer Information Retrieval:

Designing a good P2P search system

amounts to striking a balance between system efficiency and retrieval quality. One can substantially improve system efficiency by taking an interdisciplinary approach, leveraging features of IR algorithms to guide system design and optimization. The rules of thumb we suggest are focusing on important concepts or keywords in documents, and adopting smart index replication. As a complement rather than a replacement to remote invocation, InterWeave provides a unified programming environment that supports the uses of sharedmemory programming, remote invocation, relaxed coherence models, and transactions in a single application. Remote invocation can pass arguments and results

20

through true references to shared state rather than through often unnecessary and inefficient deep copy. It provides programs with a URL-addressable global shared store, caching shared state locally and automatically managing the consistency and coherence of the caches. Regardless of platforms and programming languages, programs access shared state with ordinary reads and writes. To the best of our knowledge, InterWeave is the first system that automates the typesafe sharing of structured data in its internal form across heterogeneous platforms and multiple languages. To support efficient information retrieval in P2P systems, we exploit the synergy between IR and P2P to guide the design and optimization of P2P search systems. In eSearch, we propose hybrid global-local indexing. Each node is responsible for certain terms. Given a document, eSearch uses a modern information retrieval algorithm to select a small number of top (important) terms in the document and publishes the complete term list for the document to nodes responsible for those top terms. This selective replication of term lists allows a multi-term query to proceed local to the nodes responsible for the query terms. Given a multi-term query, the query is sent to nodes responsible for those terms. Each of those nodes then does a local search and returns the results. In pSearch, we propose a semantic overlay. It organizes nodes into a ContentAddressable Network (CAN) and populates documents in the overlay according to document semantics derived from Latent Semantic Indexing (LSI). The distance (e.g., routing hops) between two documents in the overlay is proportional to their dissimilarity in semantics. The search cost for a query is therefore reduced since documents related to the query are likely to be concentrated on a small number of nodes. eSearch and pSearch are the first systems to introduce efficient information retrieval into structured peer-to-peer networks (i.e., distributed hash tables (DHTs)), and are also the first systems to exploit the synergy between IR and P2P to guide

21

system design and optimization. We have seen a flurry of recent publications that cite our work, indicating the importance and influence of this unique approach [2, 13, 16, 15, 17, 22, 23, 34, 53, 62, 48, 63, 84, 112, 117, 130, 140, 145, 147, 149, 156, 157, 158, 160, 165, 171, 172, 175, 176, 59, 219, 190, 200, 229, 228, 232, 236, 239, 245, 246, 260, 266, 273, 274, 275, 302, 303, 300, 306, 307, 309, 312].

1.4

Dissertation Outline

The remainder of the dissertation is organized as follows. Chapters 2 presents the design and implementation of InterWeave, with a focus on its support for heterogeneous platforms and multiple languages. Chapter 3 describes the integration of shared state, remote invocation, relaxed coherence models, and transactions in InterWeave. Chapter 4 and 5 present our P2P IR frameworks—eSearch and pSearch, respectively. Chapter 6 discusses techniques to improve the efficiency and efficacy of latent semantic indexing (LSI) and shows their use in pSearch. Chapter 7 concludes this dissertation.

22

2

Intuitive Data Sharing with InterWeave

With the rapid growth of the Internet, more and more applications are being developed for (or ported to) wide-area networks in order to take advantage of resources available at distributed sites. Examples include e-commerce, computersupported collaborative work, intelligent environments, interactive data mining, and remote scientific visualization. Conceptually, most of these applications involve some sort of shared state: information that is needed at more than one site, that has largely static structure (i.e., is not streaming data), but whose content changes over time. For the sake of locality, “shared” state must generally be cached at each site, introducing the need to keep copies up-to-date in the face of distributed updates. Traditionally the update task has relied on ad-hoc, application-specific protocols built on top of RPC-based systems such as CORBA [181], .NETtm , and Java RMI [251]. We propose instead that it be automated. Specifically, we present a system known as InterWeave that allows programs written in multiple languages to map shared segments into their address space, regardless of Internet address or machine type, and to access the data in those segments transparently and efficiently once mapped. Unfortunately, sharing is significantly more complex in a heterogeneous, widearea network environment than it is in software distributed shared-memory (S-

23

DSM) systems such as TreadMarks [5] and Cashmere [242]. With rare exceptions, S-DSM systems assume that clients are part of a single program, written in a single language, running on identical hardware nodes on a system-area network. InterWeave must support coherent, persistent sharing among programs written in multiple languages, running on multiple machine types, spanning potentially very slow Internet links. RPC systems have of course accommodated multiple machine types for many, many years. They do so using stubs that convert value parameters to and from a machine-independent wire format. InterWeave, too, employs a wire format, but in a way that addresses three key challenges not found in RPC systems. • To minimize communication bandwidth and to support relaxed coherence models (which allow cached copies of data to be slightly out of date), InterWeave must efficiently identify all changes to a segment, and track those changes over time. As a side benefit, this modification history allows InterWeave to choose a data layout on every node that maximizes locality of reference. • In order to update cached copies, InterWeave must represent not only data, but also diffs (concise descriptions of only those data that have changed) in wire format. • To support linked structures and reference parameters, InterWeave must swizzle pointers [283] in a way that turns them into appropriate machine addresses. To support all these operations, InterWeave maintains metadata structures comparable to those of a sophisticated language reflection mechanism, and employs a variety of algorithmic and protocol optimizations specific to distributed shared state.

24

InterWeave allows programs to access shared state with ordinary reads and writes. A common concern with shared-memory programming in a distributed environment is its performance relative to explicit messaging [159]: S-DSM systems tend to induce more underlying traffic than does hand-written message-passing code. At first glance one might expect the higher message latencies of a WAN to increase the performance gap, making shared memory comparatively less attractive. We argue, however, that the opposite may actually occur. Communication in a distributed system is typically more coarse-grained than it is in a tightly coupled local-area cluster, enabling us to exploit optimizations that have a more limited effect in the tightly coupled world, and that are difficult or cumbersome to apply by hand. These optimizations include the use of variable-grain coherence blocks [226]; bandwidth reduction through communication of diffs [47] instead of whole objects; and the exploitation of relaxed coherence models to reduce the frequency of updates [56, 298]. When translating and transmitting previously uncached data, InterWeave achieves throughput comparable to that of standard RPC packages, and 20 times faster than Java RMI [57]. When the data have been cached and only a fraction of them are changed, InterWeave’s translation cost and bandwidth requirements scale down proportionally and automatically; pure RPC code requires ad-hoc recoding to achieve similar performance gains. The remainder of this chapter is organized as follows. We describe the design of InterWeave in more detail in Section 2.1. We provide implementation details in Section 2.2, and performance results in Section 2.3. We compare our design to related work in Section 2.4, and conclude in Section 2.5.

25

node_t *head; IW_handle_t h; void list_init(void) { h = IW_open_segment("host/list"); head = IW_mip_to_ptr("host/list#head"); } node_t *list_search(int key) { node_t *p; IW_rl_acquire(h);

// read lock

for (p = head->next; p; p = p->next) { if(p->key == key) { IW_rl_release(h);

// read unlock

return p; } } IW_rl_release(h);

// read unlock

return NULL; } void list_insert(int key) { node_t *p; IW_wl_acquire(h);

// write lock

p = (node_t*) IW_malloc(h, IW_node_t); p->key = key; p->next = head->next; head->next = p; IW_wl_release(h);

// write unlock

} Figure 2.1: Shared linked list in InterWeave. Variable head points to an unused header node; the first real item is in head->next.

26

2.1

Design

The InterWeave programming model assumes a distributed collection of servers and clients. Servers maintain persistent copies of shared data, and coordinate sharing among clients. Clients in turn must be linked with a special InterWeave library, which arranges to map a cached copy of needed data into local memory. InterWeave servers are oblivious to the programming languages used by clients, but the client libraries may be different for different programming languages. The client library for Fortran, for example, cooperates with the linker to bind InterWeave data to variables in common blocks (Fortran 90 programs may also make use of pointers). Figure 2.1 presents a simple realization of a shared linked list. The InterWeave API used in the example is explained in more detail in the following sections. For consistency with the example we present the C version of the API. Similar versions exist for C++, Java, and Fortran. InterWeave is designed to inter-operate with our Cashmere S-DSM system [242]. Together, these systems integrate hardware coherence and consistency within multiprocessors (level-1 sharing), S-DSM within tightly coupled clusters (level-2 sharing), and version-based coherence and consistency across the Internet (level-3 sharing). At level 3, InterWeave uses application-specific knowledge of minimal coherence requirements to reduce communication, and maintains consistency information in a manner that scales to large amounts of shared data. Further detail on InterWeave’s coherence and consistency mechanisms can be found in [56, 58].

27

2.1.1

Data Allocation

The unit of sharing in InterWeave is a self-descriptive segment (a heap) within which programs allocate strongly typed blocks of memory.1 Every segment is specified by an Internet URL. The blocks within a segment are numbered and optionally named. By concatenating the segment URL with a block name or number and optional offset (delimited by pound signs), we obtain a machine-independent pointer (MIP): “foo.org/path#block#offset”. To accommodate heterogeneous data formats, offsets are measured in primitive data units—characters, integers, floats, etc.—rather than in bytes. Every segment is managed by an InterWeave server at the IP address corresponding to the segment’s URL. Different segments may be managed by different servers. Assuming appropriate access rights, IW open segment() communicates with the appropriate server to open an existing segment or to create a new one if the segment does not yet exist. The call returns an opaque handle that can be passed as the initial argument in calls to IW malloc(). InterWeave currently employs public key based authentication and access control. If requested by the client, communication with a server can optionally be encrypted with DES. As in multi-language RPC systems, the types of shared data in InterWeave must be declared in an IDL. The InterWeave IDL compiler translates these declarations into the appropriate programming language(s) (C, C++, Java, Fortran). It also creates initialized type descriptors that specify the layout of the types on the specified machine. The descriptors must be registered with the InterWeave library prior to being used, and are passed as the second argument in calls to IW malloc(). These conventions allow the library to translate to and from wire format, ensuring that each type will have the appropriate machine-specific byte order, alignment, etc. in locally cached copies of segments. 1

Like distributed file systems and databases, and unlike systems such as PerDiS [89], Inter-

Weave requires manual deletion of data; there is no automatic garbage collection.

28

Synchronization (to be discussed further in Section 2.1.2) takes the form of reader-writer locks. A process must hold a writer lock on a segment in order to allocate, free, or modify blocks. The lock routines take a segment handle as parameter. Given a pointer to a block in an InterWeave segment, or to data within such a block, a process can create a corresponding MIP: IW_mip_t m = IW_ptr_to_mip(p); This MIP can then be passed to another process through a message, a file, or an argument of a remote procedure in RPC-style systems. Given appropriate access rights, the other process can convert back to a machine-specific pointer: my_type *p = (my_type*)IW_mip_to_ptr(m); The IW mip to ptr() call reserves space for the specified segment if it is not already locally cached and returns a local machine address. Actual data for the segment will not be copied into the local machine unless and until the segment is locked. It should be emphasized that IW mip to ptr() is primarily a bootstrapping mechanism. Once a process has one pointer into a data structure (e.g. the head pointer in our linked list example), any data reachable from that pointer can be directly accessed in the same way as local data, even if embedded pointers refer to data in other segments. InterWeave’s pointer-swizzling and data-conversion mechanisms ensure that such pointers will be valid local machine addresses. It remains the programmer’s responsibility to ensure that segments are accessed only under the protection of reader-writer locks. To assist in this task, InterWeave allows the programmer to identify the segment in which the datum referenced by a pointer resides, and to determine whether that segment is already locked:

29

IW_handle_t h = IW_get_handle(p); IW_lock_status s = IW_get_lock_status(h); Much of the time we expect that programmers will know, because of application semantics, that pointers about to be dereferenced refer to data in segments that are already locked.

2.1.2

Coherence

When modified by clients, InterWeave segments move over time through a series of internally consistent states. When writing a segment, a process must have exclusive write access to the most recent version. When reading a segment, however, the most recent version may not be required because processes in distributed applications can often accept a significantly more relaxed—and hence less communication-intensive—notion of coherence. InterWeave supports several relaxed coherence models. It also supports the maintenance of causality among segments using a scalable hash-based scheme for conflict detection. When a process first locks a shared segment (for either read or write access), the InterWeave library obtains a copy from the segment’s server. At each subsequent read-lock acquisition, the library checks to see whether the local copy of the segment is “recent enough” to use. If not, it obtains an update from the server. An adaptive polling/notification protocol [56] often allows the client library to avoid communication with the server when updates are not required. Twin and diff operations [47], extended to accommodate heterogeneous data formats, allow the implementation to perform an update in time proportional to the fraction of the data that has changed. The server for a segment need only maintain a copy of the segment’s most recent version. The API specifies that the current version of a segment is always acceptable, and since processes cache whole segments, they never need an “extra

30

piece” of an old version. To minimize the cost of segment updates, the server remembers, for each block, the version number of the segment in which that block was last modified. This information allows the server to avoid transmitting copies of blocks that have not changed. As partial protection against server failure, InterWeave periodically checkpoints segments and their metadata to persistent storage. The implementation of real fault tolerance is a subject for future work. As noted at the beginning of Section 2.1, an S-DSM-style system such as Cashmere can play the role of a single InterWeave node. Within an S-DSM cluster, or within a hardware-coherent node, data can be shared using data-race-free [4] shared-memory semantics, so long as the cluster or node holds the appropriate InterWeave lock.

2.1.3

Security

Security becomes an important concern when data are shared across the Internet. InterWeave employs public key based authentication and access control. The implementation is built on top of the OpenSSL library [182]. Each client and each server has its own identity key, and segments have read and write access keys. When a client first attempts to access a segment, the client and the server authenticate with each other using the identity keys. The server’s public identity key is well-known to the clients and the server maintains a list of public identity keys of authorized clients. If the client passes the identity check, meaning it has permission to connect to the server, then its access keys are used to determine whether it can read or write the requested segment. Access keys can be specified at segment creation time or changed later by any client that successfully acquires write permission.2 This scheme relieves servers 2

A malicious client with write permission to a segment therefore can deny other clients’

accesses to the segment by changing the access key but not sharing the new key with others. This seemingly indicates a problem but it does not actually degrades the security level of the

31

of the burden of maintaining group information about clients sharing segments, which could be highly dynamic for applications distributed over the Internet. A group of clients with accounts on a given server can agree to share data simply by creating and distributing segment access keys, without intervention by the server. If requested by the client, communication with a server can optionally be encrypted with DES.

2.2

Implementation

In this section, we describe the implementation of the InterWeave server and client library. For both the client and the server we first describe the structure of the metadata and memory management, followed by the algorithms for modification tracking, wire-format diffing, and pointer swizzling. Finally, we present the support for multiple languages.

2.2.1

Memory Management and Segment Metadata

Client Side As described in Section 2.1, InterWeave presents the programmer with two granularities of shared data: segments and blocks. Each block must have a well-defined type, but this type can be a recursively defined structure of arbitrary complexity, so blocks can be of arbitrary size. Every block has a serial number within its segment, assigned by IW malloc(). It may also have an optional symbolic name, specified as an additional parameter. A segment is a named collection of blocks. There is no a priori limit on the number of blocks in a segment, and blocks within the same segment can be of different types and sizes. system since a malicious client with write permission can trash the content of the segment anyway.

32

Segment Table

subseg_addr_tree

SegName

FirstSubSeg

FreeList

blk_number_tree

blk_name_tree

TCPcon

SegName

FirstSubSeg

FreeList

blk_number_tree

blk_name_tree

TCPcon

SegName

FirstSubSeg

FreeList

blk_number_tree

blk_name_tree

TCPcon

SegName

FirstSubSeg

FreeList

blk_number_tree

blk_name_tree

TCPcon

.. .

.. .

.. .

.. .

.. .

.. .

Key Block data Free memory Pagemap (pointers to twins) hdr

hdr

Subsegment

blk_addr_tree

hdr

hdr

hdr

blk_addr_tree

blk_addr_tree

Figure 2.2: Simplified view of InterWeave client-side data structures: the segment table, subsegments, and blocks within segments. Type descriptors, pointers from balanced trees to blocks and subsegments, and footers of blocks and free space are not shown.

33

The copy of a segment cached by a given process need not necessarily be contiguous in the application’s virtual address space, so long as individually malloced blocks are contiguous. The InterWeave library can therefore implement a segment as a collection of subsegments, invisible to the user. Each subsegment is contiguous, and can be any integral number of pages in length. These conventions support blocks of arbitrary size, and ensure that any given page contains data from only one segment. New subsegments can be allocated by the library dynamically, allowing a segment to expand over time. An InterWeave client manages its own heap area, rather than relying on the standard C library malloc(). The InterWeave heap routines manage subsegments, and maintain a variety of bookkeeping information. Among other things, this information includes a collection of balanced search trees that allow InterWeave to quickly locate blocks by name, serial number, or address. Figure 2.2 illustrates the organization of memory into subsegments, blocks, and free space. The segment table has exactly one entry for each segment being cached by the client in local memory. It is organized as a hash table, keyed by segment name. In addition to the segment name, each entry in the table includes four pointers: one for the first subsegment that belongs to that segment, one for the first free space in the segment, and two for a pair of balanced trees containing the segment’s blocks. One tree is sorted by block serial number (blk number tree), the other by block symbolic name (blk name tree); together they support translation from MIPs to local pointers. An additional global tree contains the subsegments of all segments, sorted by address (subseg addr tree); this tree supports modification detection and translation from local pointers to MIPs. Each subsegment has a balanced tree of blocks sorted by address (blk addr tree). Segment table entries may also include a cached TCP connection over which to reach the server. Each block in a subsegment begins with a header containing the size of the block, a pointer to a type descriptor, a serial number, an optional symbolic block name

34

and several flags for optimizations to be described in Section 2.2.3. Free space within a segment is kept on a linked list, with a head pointer in the segment table. Allocation is currently first-fit. To allow a deallocated block to be coalesced with its neighbor(s), if free, all blocks have a footer (not shown in Figure 2.2) that indicates whether that block is free and, if it is, where it starts. Server Side To avoid an extra level of translation, an InterWeave server stores both data and type descriptors in wire format. This also makes it easy for the server to checkpoint a copy of the entire segment to persistent storage. The server keeps track of segments, blocks, and subblocks. The latter are invisible to clients. Each segment maintained by a server has an entry in a segment hash table keyed by segment name. Each block within the segment consists of a version number (the version of the segment in which the block was most recently modified), a serial number, a pointer to a type descriptor, (as described in Section 2.2.2, the server’s type descriptors are machine independent and describe the layout of wire-format data), pointers to link the block into data structures described in the following paragraph, a pointer to a wire-format block header, and a pointer to the data of the block, again in wire format. The blocks of a given segment are organized into a balanced tree sorted by serial number (svr blk number tree) and a linked list sorted by version number (blk version list). The linked list is separated by markers into sublists, each of which contains blocks with the same version number. Markers are also organized into a balanced tree sorted by version number (marker version tree). Pointers to all these data structures are kept in the segment table, along with the segment name. Each block is potentially divided into subblocks, comprising a contiguous group of primitive data elements (fixed at 16 for our experiments, but which can also vary

35

in number based on access patterns) from the same block. Subblocks also have version numbers, maintained in an array. Subblocks and pointers to the various data structures in the segment table are used in collecting and applying diffs (to be described in Section 2.2.2). In order to avoid unnecessary data relocation, machine-independent pointers and strings are stored separately from their blocks, since they can be of variable size.

2.2.2

Diff Creation and Translation

Client Side Recall that there are four separate kinds of balanced trees in the client: a global tree of subsegments, ordered by memory address (subseg addr tree); two trees of blocks within each segment, ordered by serial number (blk number tree) and by symbolic name (blk name tree); and a tree of blocks within each subsegment, ordered by memory address (blk addr tree). When a process acquires a write lock on a given segment, the InterWeave library asks the operating system to write protect the pages that comprise the various subsegments of the local copy of the segment. When a page fault occurs, the SIGSEGV signal handler, installed by the library at program startup time, creates a pristine copy, or twin [47], of the page in which the write fault occurred. It saves a pointer to that twin in the faulting segment’s pagemap for future reference, and then asks the operating system to re-enable write access to the page. More specifically, if the fault occurs in page i of subsegment j, the handler places a pointer to the twin in the ith entry of a structure called the pagemap, located in j’s header (see Figure 2.2). The subseg addr tree makes it easy for the handler to determine i and j. Together, the pagemap and the linked list of subsegments in a given segment allow InterWeave to quickly determine which

36

pages need to be diffed when the coherence protocol needs to send an update to the server. When a process releases a write lock, the library uses diffing to gather changes made locally and convert them into machine-independent wire format in a process called collecting a diff. Figure 2.3 shows an example of wire format translation. The changes are expressed in terms of segments, blocks, and primitive data unit offsets, rather than pages and bytes. A wire-format block diff consists of a block serial number, the length of the diff, and a series of run length encoded data changes, each of which consists of the starting point of the change, the length of the change, and the updated content. Both the starting point and length are measured in primitive data units rather than in bytes. The MIP in the wire format is either a string (for an intra-segment pointer) or a string accompanied by the serial number of the pointed-to block’s type descriptor (for a cross-segment pointer). Figure 2.4 shows the complete grammar of the wire-format diff for a block. The wire format for a segment includes extra information such as added type descriptors, deleted blocks, added blocks, changed blocks, and access keys for the segment. The diffing routine must have access to type descriptors in order to compensate for local byte order and alignment, and in order to swizzle pointers. The content of each descriptor specifies the substructure and layout of its type. For primitive types (integers, doubles, etc.) there is a single pre-defined code. For other types there is a code indicating either an array, a record, or a pointer, together with pointer(s) that recursively identify the descriptor(s) for the array element type, record field type(s), or pointed-at type. The type descriptors also contain information about the size of blocks and the number of primitive data units in blocks. For structures, the descriptor records both the byte offset of each field from the beginning of the structure in local format, and the machine-independent primitive offset of each field, measured in primitive data units. Like blocks, type

37

local format primitive offset byte offset 0

i0

8

0 d0

1

i1

16

i2

24

d1

4

32

ptr

5

translation

unchanged data changed data alignment padding

2

lookup

type descriptors

wire format blk#

52

0

1

i0

block serial number block diff length, measured in bytes

3

3

i2

d1

"blk#1"

the length of changes, measured in primitive data units the starting point of changes, measured in primitive data units

Figure 2.3: Wire format translation of a structure consisting of three integers (i0–i2), two doubles (d0–d1), and a pointer (ptr). All fields except d0 and i1 are changed.

38

blk_diff

-> blk_serial_number blk_diff_len prim_RLEs

prim_RLEs

-> prim_RLE prim_RLEs ->

prim_RLE

-> prim_start num_prims prims

prims

-> prim prims ->

prim

-> primitive data -> MIP

MIP

-> string -> string desc_serial_number

Figure 2.4: Grammar of the wire-format diff for a block. descriptors have segment-specific serial numbers, which the server and client use in wire-format messages. A given type descriptor may have different serial numbers in different segments. Per-segment arrays and hash tables maintained by the client library map back and forth between serial numbers and pointers to local, machine-specific descriptors. An additional, global, hash table maps wire format descriptors to addresses of local descriptors, if any. When an update message from the server announces the creation of a block with a type that has not appeared in the segment before, the global hash table allows InterWeave to tell whether the type is one for which the client has a local descriptor. If there is no such descriptor, then the new block can never be referenced by the client’s code, and can therefore be ignored. When translating local modifications into wire format, the diffing routine takes a series of actions at three different levels: subsegments, blocks, and words. At

39

the subsegment level, it scans the list of subsegments of the segment and the pagemap within each subsegment. When it identifies a modified page, it searches the blk addr tree within the subsegment to identify the block containing the beginning of the modified page. It then scans blocks linearly, converting each to wire format by using block-level diffing. When it runs off the end of the last contiguous modified page, the diffing routine returns to the pagemap and subsegment list to find the next modified page. At the block level, the library uses word-level diffing (word-by-word comparison of modified pages and their twins) to identify a run of continuous modified words. Call the bounds of the run, in words, change begin and change end. Using the pointer to the type descriptor stored in the block header, the library searches for the primitive subcomponent spanning change begin, and computes the primitive offset of this subcomponent from the beginning of the block. Consecutive type descriptors, from change begin to change end, are then retrieved sequentially to convert the run into wire format. At the end, word-level diffing is called again to find the next run to translate. As noted above, the word-level diffing identifies a run of continuous modified words and return them to block-level diffing. It performs a word-by-word diff of the pages and their twins, stores all modified runs internally, and returns each of them individually to subsequent invocations of word-level diffing. To ensure that translation cost is proportional to the size of the modified data, rather than the size of the entire block, we must avoid walking sequentially through subcomponent type descriptors while searching for the primitive subcomponent spanning change begin. First, the library subtracts the starting address of the block from change begin to get the byte offset of the target subcomponent. Then it searches recursively for the target subcomponent. For structures, a binary search is applied to byte offsets of the fields. For an array, the library simply divides the given byte offset by the size of an individual element (stored in the

40

array element’s type descriptor). To accommodate reference types, InterWeave relies on pointer swizzling [283]. Briefly, swizzling uses type descriptors to find all (machine-independent) pointers within a newly-cached or updated segment, and converts them to pointers that work on the local machine. There are two kinds of pointers, intra-segment pointers and cross-segment pointers. For a cross-segment pointer, it is possible that the pointer points to a not (yet) cached segment or that the pointed-to segment is cached but the version is outdated, and does not yet contain the pointed-to block. In both cases, the pointer is set to refer to reserved space in unmapped pages where data will lie once properly locked. The set of segments currently cached on a given machine thus displays an “expanding frontier” reminiscent of lazy dynamic linking: a core of currently accessible pages surrounded by a “fringe” of reserved but not-yet accessible pages. In wire format, a cross-segment pointer is accompanied by the serial number of the pointed-to block’s type descriptor, from which the size of the reserved space can potentially be derived without contacting the server. To swizzle a local pointer to a MIP, the library first searches the subseg addr tree for the subsegment spanning the pointed-to address. It then searches the blk addr tree within the subsegment for the pointed-to block. It subtracts the starting address of the block from the pointed-to address to obtain the byte offset, which is then mapped to a corresponding primitive offset with the help of the type descriptor stored in the block header. Finally, the library converts the block serial number and primitive offset into strings, which it concatenates with the segment name to form a MIP. When a client acquires a read lock and determines that its local cached copy of the segment is not recent enough, it asks the server to build a diff that describes the data that have changed between the client’s outdated copy and the master copy at the server. When the diff arrives the library uses its content to update

41

the local copy in a process called applying a diff. Diff application is similar to diff collection, except that searches now are based on primitive offsets rather than byte offsets. To convert the serial numbers employed in wire format to local machine addresses, the client traverses blk number tree—the balanced tree of blocks, sorted by serial number, that is maintained for every segment. To swizzle a MIP into a local pointer, the library parses the MIP and extracts the segment name, block serial number or name, and the primitive offset, if any. The segment name is used as an index into the segment hash table to obtain a pointer to the segment’s blk number tree, which is in turn searched to identify the pointed-to block. The byte offset of the pointed-to address is computed from the primitive offset with the help of the type descriptor stored in the block header. Finally, the byte offset is added to the starting address of the block to get the pointed-to address. Server Side Each server maintains an up-to-date copy of each segment it serves, and controls access to the segments. For each modest-sized block in each segment, and for each subblock of a larger block, the server remembers the version number of the segment in which (some portion of) the content of the block or subblock was most recently modified. This convention strikes a balance between the size of server-to-client diffs and the size of server-maintained metadata. The server also remembers, for each block, the version number in which the block was created. Finally, the server maintains a list of the serial numbers of deleted blocks along with the versions in which they were deleted. The creation version number allows the server to tell, when sending a segment update to a client, whether the metadata for a given block need to be sent in addition to the data. When mallocing a new block, a client can use any available serial number. Any client with an old copy of the segment in which the

42

number was used for a different block will receive new metadata and new block content from the server the next time it obtains an update. Recall that the blocks of a given segment are organized at the server into a balanced tree sorted by serial number (svr blk number tree) and a linked list sorted by version number (blk version list). Markers in the blk version list separate blocks of different versions into sublists, and are organized into a balanced trees sorted by version number (marker version tree). Upon receiving a diff, the server first appends a new marker to the end of the blk version list and inserts it in the marker version tree. Newly created blocks are then appended to the end of the list. Modified blocks are first identified by search in the svr blk number tree, and then moved to the end of the list. If several blocks are modified together in both the previous diff and the current diff, their positions in the linked list will be adjacent, and moving them together to the end of the list only involves a constant-cost operation to reset pointers into and out of the first and last blocks, respectively. This makes clear the benefit of the two level block version management—the balanced tree for markers and the linked list for blocks. If the blocks are directly organized in a balanced tree sorted by version number, every update to a block requires first deleting it from the tree then inserting it in the tree with a new version number. In balanced tree, both insertion and deletion are expensive operations that may require tree adjustment. At the time of a lock acquire, a client must decide whether its local copy of the segment needs to be updated. (This decision may or may not require communication with the server; see below.) If an update is required, the server traverses the marker version tree to locate the first marker whose version is newer than the client’s version. All blocks in the blk version list after that marker have some subblocks that need to be sent to the client. The server constructs a wireformat diff and returns it to the client. Because version numbers are remembered at the granularity of blocks and (when blocks are too large) subblocks, all of the

43

data in each modified block or subblock will appear in the diff. To avoid extra copying, the modified subblocks are sent to the client using writev(), without being compacted together. The client’s type descriptors, generated by our IDL compiler, describe machinespecific data layout. At the server, however, we store data in wire format. When it receives a new type descriptor from a client, the server converts it to a machineindependent form that describes the layout of fields in wire format. Diff collection and application are then similar to the analogous processes on the client, except that searches are guided by the server’s machine-independent type descriptors.

2.2.3

Optimizations

Several optimizations improve the performance of InterWeave in important common cases. We describe here those related to memory management and heterogeneity. Others are described in [55, 56]. Data layout for cache locality. InterWeave’s knowledge of data types and formats allows it to organize blocks in memory for the sake of spatial locality. When a segment is cached at a client for the first time, blocks that have the same version number, meaning they were modified by another client in a single write critical section, are placed in contiguous locations, in the hope that they may be accessed or modified together by this client as well. We do not currently relocate blocks in an already cached copy of a segment. InterWeave has sufficient information to find and fix all pointers within segments, but references held in local variables would have to be discarded or updated manually. Diff caching. The server maintains a cache of diffs that it has received recently from clients, or collected recently itself, in response to client requests. These cached diffs can often be used to respond to future requests, avoiding redundant collection overhead. In most cases, a client sends the server a diff, and the server

44

caches and forwards it in response to subsequent requests. The server updates its own master copy of the data asynchronously when it is not otherwise busy (or when it needs a diff it does not have). This optimization takes server diff collection off the critical path of the application in common cases. No-diff mode. As in the TreadMarks S-DSM system [6], a client that repeatedly modifies most of the data in a segment will switch to a mode in which it simply transmits the whole segment to the server at every write lock release. This no-diff mode eliminates the overhead of mprotects, page faults, and the creation of twins and diffs. Moreover, translating an entire block is more efficient than translating diffs. The library can simply walk linearly through the type descriptors of primitive subcomponents and put each of them in wire format, avoiding searches for change begin and checks for change end. The switch occurs when the client finds that the size of a newly created diff is at least 75% of the size of the segment itself; this value strikes a balance between communication cost and the other overheads of twinning and diffing. Periodically afterwards (at linearly increasing random intervals), the client will switch back to diffing mode to verify the size of current modifications. When the segment cannot be put into the nodiff mode, we still try to put individual blocks whose changes exceed 75% of the blocks’ sizes into no-diff mode because it saves the overhead for diffing and the translation of an entire block is more efficient than the translation of the diffs for an almost entirely changed block. Isomorphic type descriptors. For a given data structure declaration in IDL, our compiler outputs a type descriptor most efficient for run-time translation rather than strictly following the original type declaration. For example, if a struct contains 10 consecutive integer fields, the compiler generates a descriptor containing a 10-element integer array instead. This altered type descriptor is used only by the InterWeave library, and is invisible to the programmer; the languagespecific type declaration always follows the structure of the IDL declaration.

45

Cache-aware diffing. A blocking technique is used during word-level diffing to improve the temporal locality of data accesses. Specifically, instead of diffing a large block in its entirety and then translating the diff into wire format, we diff at most 1/4 of the size of the hardware cache, and then translate this portion of the diff while it is still in the cache. The ratio 1/4 accommodates the actual data, the twin, and the wire-format diff (all of which are about the same size), and additional space for type descriptors and other metadata. Diff run splicing. In word-level diffing, if one or two adjacent words are unchanged while both of their neighboring words are changed, we treat the entire sequence as changed in order to avoid starting a new run length encoding section in the diff. It already costs two words to specify a head and a length in the diff, and the spliced run is faster to apply. Splicing is particularly important when translating double-word primitive data in which only one word has changed. Last-block searches. On both the client and the server, block predictions are used to avoid searching the balanced tree of blocks sorted by serial number when mapping serial numbers in wire format to blocks. After handling a changed block, we predict that the next changed block in the diff will be the block following the current block in the previous diff. Due to repetitive program patterns, blocks modified together in the past tend to be modified together in the future. Flexible byte order for wire format. Unlike RPC-style systems that stick with big endian as the byte order for wire format, InterWeave has the flexibility to choose either big endian or little endian. The cost for byte order swapping is saved when the wire-format byte order matches with the machine byte order. This is particularly useful in a world dominated by the little endian x86 architecture.

2.2.4

Support for Multiple Languages

Independent of its current implementation, InterWeave establishes a universal data sharing framework by defining a protocol for communication, coherence, and

46

consistency between clients and servers, along with a machine- and languageindependent wire format. Any language implemented on any platform can join the sharing so long as it provides a client library that conforms to the protocol and the wire format. Library implementations may differ from one another in order to take advantage of features within a specific language or platform. Our current implementation employs a common back-end library with front ends tailored to each language. Our current set of front ends supports C, C++, Fortran, and Java. For Fortran 77, the lack of dynamic memory allocation poses a special challenge. InterWeave addresses it by allowing processes to share static variables (this mechanism also works in C). Given an IDL file, the InterWeave IDL/Fortran compiler generates a .f file containing common block declarations (structures in IDL are mapped to common blocks in Fortran), a .c file containing initialized type descriptors (similar to its C counterpart), and a .s file telling the linker to reserve space for subsegments, headers, and footers surrounding the common blocks. At runtime, extensions to the InterWeave API allow a process to “attach” static variables to segments, optionally supplying the variables with block names. Once shared variables are attached, a Fortran program can access them in exactly the same way as local variables, given proper lock protections. Generating type descriptors for common blocks in Fortran 77 also poses a challenge. Type descriptors contain information about the sizes of the common blocks, types of variables in the common blocks, and relative offsets of the variables in the common blocks. This information depends on the particular Fortran compiler in use. For C, our IDL compiler generates type descriptors in C source code, which when passed through the C compiler, becomes properly initialized values. For instance, in C source code, the offset of a field in a structure is represented in C source code as the difference between the address of the field and that of the structure. However, this method does not work with Fortran 77 because of the lack of support for pointers.

47

We address this problem by having the IDL compiler automatically generate and execute Fortran programs to probe offsets of the variables in the common blocks. We provide a probing library that is written in C but also has a Fortran interface. For each common block, the IDL compiler generates a Fortran program in source code that calls the probing library, alternately using every variable in the common block as argument. This Fortran program is compiled by a Fortran compiler into binary and executed. During the execution, the probing library records the address of the variables in the common block, calculating and saving their offsets. Finally, the IDL compiler takes this information to generate a .c file containing type descriptors with properly initialized offsets for variables in common blocks. The above procedure involves execution of several processes, but it is completely automated by the IDL compiler. Sharing the same back-end implementation between the front ends for C and Fortran means that the internal representation of a block in C or Fortran should look alike most of the time. In addition to the block body, an InterWeave block has additional header and footer data. Blocks are encompassed in subsegments and segments, which have their own extra metadata. In C, space for these extra metadata is reserved along with the block bodies when blocks are allocated at runtime. In Fortran, blocks are static and space for blocks is calculated by the Fortran compiler, which knows nothing about InterWeave and hence won’t reserve space for InterWeave’s metadata. Two naive solutions exist. One is to store the metadata apart from the block body. But this leads to different block representations for C and Fortran, defeating our goal of using a shared back end. Another naive solution is to use a custom Fortran compiler that is aware of InterWeave’s special needs. But this would limit InterWeave’s portability. In our implementation, we use the standard Fortran compiler and linker, but our IDL compiler generates a .s file to tell the linker where to put common blocks. The addresses for common blocks are carefully calculated to avoid conflicts

48

with other program data and to include space for all necessary metadata. After linking, instructions in the produced executable can properly reference variables in common blocks. The memory for common blocks, however, has not been allocated by the linker. When the program starts, the InterWeave library uses the mmap system call to allocate memory and map common blocks along with their metadata to the addresses specified in the .s file. When a Fortran client calls the InterWeave API to attach a common block to a segment at runtime, the library properly initializes all metadata related to the common block, for instance, linking the subsegment surrounding the common block into the segment’s subsegment list and inserting the common block into the balanced trees for blocks. After that, a common block looks exactly the same as dynamically allocated blocks. Fortran and C therefore can share the same back-end implementation transparently. The implementation to support static variables in C is similar to that for common blocks in Fortran. For C, we also need to allocate extra space for metadata of static blocks. Similarly, the IDL compiler generates a .s file to tell the linker where to put the static variables. All static variables to be shared through InterWeave are required to be declared together in a C source file, from which a standard C compiler produces an object file. The IDL compiler internally invokes the nm tool on the object file to extract sizes of the variables, and then calculates the addresses reserved for them accordingly. Java introduces additional challenges for InterWeave, including dynamic allocation, garbage collection, constructors, and reflection. Our J-InterWeave system is described in [57]. When sharing with a language with significant semantic limitations, the programmer may need to avoid certain features. For example, pointers are allowed in segments shared by a Fortran 77 client and a C client, but the Fortran 77 client simply has no way to use them.

49

memory,

metadata,

diffing

transaction, Fortran,

relaxed

IDL

coherence

compiler 3,000

utilities

misc. client

10,000

7,000

5,000

server

5,000

4,000

5,000

shared

6,000

Table 2.1: The breakdown of the InterWeave code. The “utilities” category includes our own implementation of utilities such as hash table, linked list, and high-resolution timer.

2.2.5

Portability

InterWeave currently consists of approximately 45,000 lines of heavily commented C++ code (see Table 2.1). This total does not include the libraries InterWeave uses (e.g., AVL and OpenSSL), the J-InterWeave Java support [57], or the (relatively minor) modifications required to the Cashmere S-DSM system to make it work with InterWeave. Both the client library and the server have been ported to a variety of architectures (Power series processors, Alpha, Sparc, x86, and MIPS) and operating systems (AIX, Windows NT/XP, Linux, Solaris, Tru64 Unix, and IRIX). The Windows port is perhaps the best indicator of InterWeave’s portability: only about 100 lines of code are changed. The changes are: (1) replacing malloc and mprotect on Unix with VirtualAlloc and VirtualProtect on Windows, respectively; (2) replacing Unix’s SIGSEGV signal handler with Windows’ exception handler at address FS:0, where FS is a segment register in the x86 architecture; (3) fixing a dozen other minor issues; (4) linking InterWeave with Cygnus’ supporting library, a Unix-like environment on Windows [204].

50

2.2.6

Built-in Debug Utilities

InterWeave provides some built-in utilities to help the debug of the middleware system itself as well as the applications. Our experience has show that these utilities are extremely useful despite their simplicity. • Unprotected write accesses. Accesses to segments must proceed under the protection of proper locks. InterWeave detects write accesses to segments outside write locks through the virtual memory system. • Out of boundary writes. In the debug version, the body and metadata of a block is guarded by an 8-byte header and an 8-byte footer, which are supposed to be accessed by neither the library nor the application under normal operations. The content of the header and footer is initialized to a special bit stream. While collecting diffs, the library checks the headers and footers and reports any header or footer with content different from the bit stream, which indicates that an out-of-boundary write has occurred. API is also provided for the application to check the sanity of the headers and footers at any point. In the production version, this feature is automatically turned off to avoid unnecessary overhead. • Check for repeated IW free() operations. One common mistake with memory management in programs is to free a block of memory allocated from the heap for multiple times. With most current memory management packages, this will trash the metadata these packages maintain and cause an unexpected crash during a later block allocation or release that is completely legal. Our experience, as well as that of many others, has shown that it is tricky to trace back this category of bugs to the points that actually cause the problem. In InterWeave, during an IW free() operation, the library checks the block metadata to ensure that the block has not been freed before.

51

• Check for dangling or unsafe pointers. In the debug version, while collecting diffs, the library checks if the changed pointers point to blocks that actually exist and the types of the pointers match with the types of the pointed-to data. • Inspection and reflection mechanisms. API is provided for the application to inspect InterWeave’s metadata, for instance, the list of segments or blocks in the segments and their states. The server also has a built-in interface that allow users to query the metadata through a Web interface. On the client, the library provides a reflection mechanism to help the applications to query the type of any block or subcomponent of a block.

2.3

Experimental Results

The results presented here were collected on a 500MHz Pentium III machine, with 256MB of memory, a 16KB L1 cache, and a 512KB L2 cache, running Linux 2.4.18. Unless otherwise noted, we use big endian as the wire-format byte order, the same as the choice of RPC-style systems. Earlier results for other machines can be found in a technical report [55].

2.3.1

Microbenchmarks

Basic Translation Costs Figure 2.5 shows the overhead required to translate various data structures from local to wire format and vice versa. In each case we arrange for the total amount of data to equal 1MB; what differs is data formats and types. The int array and double array cases comprise a large array of integers or doubles, respectively. The int struct and double struct cases comprise an array of structures, each with 32 integer or double fields, respectively. The string and small string

52

0.14 RPC XDR collect block collect diff apply block apply diff

Translation Cost (sec.)

0.12 0.1 0.08 0.06 0.04 0.02

uc do t ub le _s tr u ct po in te r sm al l_ st r in g

in t_ str

st r in g

m ix in t_ do ub le

in t_ ar ra do y ub le _a rra y

0

Figure 2.5: Client’s cost to translate 1MB of data. cases comprise an array of strings, each of length 256 or 4 bytes, respectively. The pointer case comprises an array of pointers to integers. The int double case comprises an array of structures containing integer and double fields, intended to mimic typical data in scientific programs. The mix case comprises an array of structures containing integer, double, string, small string, and pointer fields, intended to mimic typical data in non-scientific programs such as calendars, CSCW, and games. All optimizations described in Section 2.2.3 are enabled in our experiments. All provide measurable improvements in performance, bandwidth, or both. Data layout for cache locality, isomorphic type descriptors, cache-aware diffing, diff run splicing, and last block searches are evaluated separately in Section 2.3.2. Depending on the history of a given segment, the no-diff optimization may choose to translate a segment in its entirety or to compute diffs on a block-byblock basis. The collect diff and apply diff bars in Figure 2.5 show the overhead of translation to and from wire format, respectively, in the block-by-block diff case; the collect block and apply block bars show the corresponding overheads when

53

diffing has been disabled. For comparison purposes we also show the overhead of translating the same data via RPC parameter marshaling, in stubs generated with the standard Linux rpcgen tool. Unmarshaling costs (not shown) are nearly identical. To make the comparison in favor of RPC, we manually tune the data declaration fed into rpcgen and pick one for which rpcgen generates the most efficient marshaling routine. Take int array as an example. typedef int int_array[M];

// declaration one

struct {

// declaration two

int i[M]; } int_array; In the rpc-generated marshaling routine for the first declaration, the marshaling of individual integer is not inlined, which obviously is inefficient. It is, however, inlined in the marshaling routine for the second declaration. We therefore use the second one for RPC XDR. However, rpcgen does not inline the marshaling routine for doubles (xdr double) regardless of the declaration format, which is the reason why RPC XDR is not as efficient as InterWeave on marshaling doubles. Generally speaking, InterWeave overhead is comparable to that of RPC. Collect block and apply block are 25% faster than RPC on average; collect diff and apply diff are 8% faster. It is clear that RPC is not good at marshaling pointers and small strings. Excluding these two cases, InterWeave in no diff mode (collect/apply block) is still 18% faster than RPC on average; when diffing has to be performed (collect/apply diff), InterWeave is 0.5% slower than RPC. Collect block is 39% faster than collect diff on average, and apply block is 4% faster than apply diff, justifying the use of the no diff mode. When RPC marshals a pointer, deep copy semantics require that the pointedto data, an integer in this experiment, be marshaled along with the pointer. The size of the resulting RPC wire format is the same as that of InterWeave, because

54

0.2

Execution Time (sec.)

0.18 collect block apply block collect diff apply diff

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02

po in te sm r al l_ str in g

in g in t_ str uc do t ub le _s tru ct

str

m ix in t_ do ub le

in t_ ar ra do y ub le _a rra y

0

Figure 2.6: Server’s cost to translate 1MB of data. MIPs in InterWeave are strings, longer than 4 bytes. The RPC overhead for structures containing doubles is high in part because rpcgen does not inline the marshaling routine for doubles. Overall InterWeave outperforms the widely used RPC XDR implementation in both diff and no-diff modes. Although this performance gap can be shrunk or even reversed by adopting aggressive optimizations in RPC [115], the important message here is that, InterWeave is efficient enough to be used for almost all distributed applications, since the current RPC implementations are the de facto tools to build distributed applications and their performance in marshaling is considered acceptable. Figure 2.6 shows translation overhead for the InterWeave server in the same experiment as above. Because the server maintains data in wire format, costs are negligible in all cases other than pointer and small string. In these cases the high cost stems from the fact that strings and MIPs are of variable length, and

55

are stored separately from their wire format blocks. However, for data structures such as mix, with a more reasonable number of pointers and small strings, the server cost is still low. As noted in Section 2.2.3, the server’s diff management cost is not on the critical path in most cases. Comparisons between InterWeave and Java RMI appear in a companion paper [57]. The short story: translation overhead under the Sun JDK1.3.2 JVM is 20X that of J-InterWeave. Pointer Swizzling Figure 2.7 shows the cost of swizzling (“collect pointer”) and unswizzling (“apply pointer”) a pointer. This cost varies with the nature of the pointed-to data. The int 1 case represents an intra-segment pointer to the start of an integer block. struct 1 is an intra-segment pointer to the middle of a structure with 32 fields. The cross #n cases are cross-segment pointers to blocks in a segment with n total blocks. The modest rise in overhead with n reflects the cost of search in various metadata trees. Performance is best in the int 1 case, which we expect to be representative of the most common sorts of pointers. However, even for moderately complex cross-segment pointers, InterWeave can swizzle about one million of them per second. Figure 2.8 shows the breakdown for swizzling a cross 1024 local pointer into a MIP, where searching block is the cost of searching the pointed-to block by first traversing the subseg addr tree and then traversing the blk addr tree; computing offset is the cost of converting the byte offset between the pointed-to address and the start of the pointed-to block into a corresponding primitive offset; string conversion is the cost of converting the block serial number and primitive offset into strings and concatenating them with the segment name; and searching desc srlnum is the cost of retrieving the serial number of the pointed-to block’s type descriptor from the type descriptor hash table. Recall that, in wire format,

56

the MIP for a cross-segment pointer is accompanied by the type of the pointed-to block. Similarly, Figure 2.9 shows the breakdown for swizzling a cross 1024 MIP into a local pointer, where read MIP is cost of reading the MIP from wire format; parse MIP is the cost of extracting the segment name, block serial number, and primitive offset from the MIP; search segment is the cost of locating the pointedto segment by using the hash table keyed by segment name; search block is the cost of finding the block in the blk number tree of the pointed-to segment; and offset to pointer is the cost of mapping the primitive offset into a byte offset between the pointed-to address and the start of the pointed-to block. As seen in Figures 2.8 and 2.9, pointer swizzling is a complex process where no single factor is the main source of overhead. Modifications at Different Granularities Figure 2.10 shows the client and the server diffing cost as a function of the fraction of a segment that has changed. In all cases the segment in question consists of a 1MB array of integers. The x axis indicates the distance in words between consecutive modified words. Ratio 1 indicates that the entire block has been changed. Ratio 4 indicates that every 4th word has been changed. The client collect diff cost has been broken down into client word diffing—word-by-word comparison of a page and its twin—and client translation— converting the diff to wire format. (Values on these two curves add together to give the values on the client collect diff curve.) There is a sharp knee for word diffing at ratio 1024. Before that point, every page in the segment has been modified; after that point, the number of modified pages decreases linearly. Due to the artifact of subblocks (16 primitive data units in our current implementation), the server collect diff and client apply diff costs are constant for ratios between 1 and 16, because in those cases the server loses track of fine grain mod-

57

1.40

Execution Time (usec.)

1.20 1.00 0.80 0.60 collect pointer apply pointer

0.40 0.20

40 96 16 38 cr 4 os s6 55 36 cr os s

10 24

cr os s

25 6

cr os s

64

cr os s

16

cr os s

cr os s

1

1

cr os s

st r uc t

in t1

0.00

Figure 2.7: Pointer swizzling cost as a function of pointed-to object type.

misc.

searching block

computing offset

string conversion

searching desc_srlnum

1

0

0.2

0.4 0.6 0.8 Execution Time (usec.)

1

1.2

Figure 2.8: Breakdown of swizzling local pointers into MIPs (using cross 1024).

read MIP

parse MIP

search block

offset to pointer

search segment

1

0

0.2

0.4 0.6 Execution Time (usec.)

0.8

1

Figure 2.9: Breakdown of swizzling MIPs into local pointers (using cross 1024).

58

0.03 client collect diff client apply diff client word diffing server apply diff client translation server collect diff

Execution Time (sec.)

0.025 0.02 0.015 0.01 0.005

16384

8192

4096

2048

1024

512

256

128

64

32

16

8

4

2

1

0

Change Ratio

Figure 2.10: Diff management cost as a function of modification granularity (1MB total data, fine-grain modification).

Execution Time (sec.)

0.018 0.016

client collect diff

0.014

client apply diff client word diffing

0.012

server apply diff

0.01

client translation

0.008

server collect diff

0.006 0.004 0.002 16384

8192

4096

2048

1024

512

256

128

64

32

16

8

4

2

1

0

Change Ratio

Figure 2.11: Diff management cost as a function of modification granularity (1MB total data, coarse-grain modification).

59

ifications and treats the entire block as changed. The jump in client collect diff, server apply diff, and client translation between ratios of 2 and 4 is due to the loss of the diff run splicing optimization described in Section 2.2.3. At a ratio of 2 the entire block is treated as changed, while at a ratio of 4 the block is partitioned into many small isolated changes. The translation for a complete block is more efficient than that for a lot of small changed sections. The cost for word diffing increases between ratios of 1 and 2 because the diffing is more efficient when there is only one continuous changed section. The experiment in Figure 2.11 is similar to that in Figure 2.10, except that modifications spread out at a granularity of 8 words. The x axis indicates the ratio of changed words. For instance, ratio 1 indicates that the entire block has been changed. Ratio 4 indicates a quarter of the block has been changed: 8 changed words followed by 24 unchanged words, followed by another 8 changed words, and so forth. There is a sharp knee for word diffing at ratio 256. Before that point, every page belonging to the block has been modified. After that point, the number of modified pages decreases linearly. Due to the artifact of subblocks (16 primitive data units in the experiment), the server collect diff and client apply diff are flat for ratio 1 and 2, because in both cases, the server treat the block as entirely changed. The increase for client translation and server apply diff between ratio 1 and 2 is due to the effect that both client and server are more efficient at manage an entirely changed block. Similarly, there is a jump for server collect diff between ratio 2 and 4. Figures 2.5 and 2.6 show that InterWeave is efficient at translating entirely changed blocks. Figures 2.10 and 2.11 show that InterWeave is also efficient at translating scattered modifications. When only a fraction of a block has changed, InterWeave is able to reduce both translation cost and required bandwidth by transmitting only the diffs. With straightforward use of an RPC-style system, both translation cost and bandwidth remain constant regardless of the fraction of

60

0.16

Execution Time (sec.)

0.14 server collect diff client collect diff server apply diff client apply diff

0.12 0.1 0.08 0.06 0.04 0.02 0 4

8

16

32

64 256 1K Block Size (bytes)

4K

64K

1M

Figure 2.12: The effect of varying block size. the data that has changed. Varying Block Size and Number of Blocks Figure 2.12 shows the overhead introduced on the client and the server while varying the size of blocks and the number of blocks in a segment. Each point on the x axis represents a different segment configuration, denoted by the size of a single block in the segment. For all configurations, the total size of all blocks in the segment is 1MBytes. In the 4K configuration, for example, there are 256 blocks in the segment, each of size 4K. On both the client and the server, the curves flatten out between 64 and 256. Blocks of size 4 bytes are primitive blocks, a special case InterWeave can recognize and optimize accordingly. Clearly, using a large number of small blocks increases the overhead substantially.

61

2.3.2

Optimizations

Data Layout for Cache Locality Figure 2.13 demonstrates the effectiveness of version-based block layout. As in Figure 2.12, several configurations are evaluated here, each with 1MB total data, but with different sizes of blocks. There are two versions for each configuration: orig and remapped. For the orig version, every other block in the segment is changed; for the remapped version, those changed blocks, exactly half of the segment, are grouped together. As before, the cost for word diffing is part of the cost in collect diff. Because fewer pages and blocks need to be traversed, the saving in the remapped version is significant. App comp. is simply the time to write all words in the modified blocks, and is presented to evaluate the locality effects of the layout on a potential computation. The savings in app comp. is due to fewer page faults, fewer TLB misses, and better spatial locality of blocks in the cache. Cache-Aware Diffing and Diff Run Splicing Figure 2.14 shows the effectiveness of cache-aware diffing and diff run splicing, where none means neither of the two techniques is applied; cache means only cache-aware diffing is applied; merge means only diff run splicing is applied; and cache-merge means both are applied. On average, diff run splicing itself degrades performance by 1%, but it is effective for double array and double struct as noted in Section 2.2.3. Cache-aware diffing itself improves performance by only 12%. The two techniques are most effective when combined, improving performance by 20% on average.

62

collect diff (orig)

app comp (orig)

word diffing (orig)

collect diff (remapped)

app comp (remapped)

word diffing (remapped)

Execution Time (sec.)

0.03

0.02

0.01

0 2024

1000

488

232

104

40

Block Size (bytes)

Figure 2.13: Performance impact of version based block layout, with 1MB total data.

collect-cache apply-none apply-cache-merge

collect-merge apply-cache

0.025 0.020 0.015 0.010 0.005

uc t do ub le _s tru ct

in t_ str

in g str

in t_ do ub le

m ix

0.000 in t_ ar ra y do ub le _a rra y

Execution Time (sec.)

collect-none collect-cache-merge apply-merge

Figure 2.14: Cache-aware diffing and diff run splicing.

63

Isomorphic Type Descriptors Figure 2.15 shows the potential performance improvement obtained by our IDL compiler when several adjacent fields of the same size in a structure can be merged into a single field as an array of elements. Int struct from Figure 2.5 is used in this experiment. The x axis shows the number of adjacent fields merged by our IDL compiler. For instance, the data point for 2 on the x axis means the 32 fields of the structure are merged into 16 fields, each of which is a two element array. For apply diff, the small increase from no merging to merging two adjacent fields is due to the overhead in handling small, 2-element arrays. This overhead also exists for collect diff, but the saving due to the simplification in finding change begin is more significant, so the total cost still goes down. The savings in collect diff and collect block are more substantial than those in apply diff and apply block. On average, merging 32 fields into a single field improves performance by 51%. Last Block Searches Figure 2.16 demonstrates the effectiveness of block prediction. The setting for this experiment is the same as that in Figure 2.12, but we show the number of blocks on the x axis here rather than the size of blocks. Predict 100 indicates the apply diff costs at the client or the server when block prediction is 100% correct; predict 50 indicates the costs when prediction is 50% correct; no predict indicates the costs when no prediction is implemented. As shown in the figure, the reduction in block searching costs is significant even at just 50% accuracy, when there is a large number of blocks. When the number of blocks is medium, say less than or equal to 4K, other costs dominate, and the saving due to block prediction is not significant.

64

0.05 collect diff collect block apply diff apply block

Execution Time (sec.)

0.04

0.03

0.02

0.01

0.00 0

2

4

8

16

32

Number of Adjacent Fields Merged

Figure 2.15: Isomorphic type descriptors

0.45 server (no predict) client (no predict) server (predict 50) client (predict 50) server (predict 100) client (predict 100)

Execution Time (sec.)

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4K

16K

32K 64K 128K Number of Blocks

Figure 2.16: Block prediction.

256K

65

collect block (big endian)

collect block (little endian)

apply block (big endian)

apply block (little endian)

Execution Time (sec.)

0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002

uc t

do ub le _s tr

uc t in t_ str

in t_ do ub le

ix m

do ub le _a rra y

in t_ ar ra y

0

Figure 2.17: Byte order swapping Figure 2.17 demonstrates the benefit of choosing the right wire-format byte order. The setting for this experiment is the same as that in Figure 2.5. The endian in bracket is the wire-format byte order. The x86 family uses little endian. When the wire-format byte order matches the machine byte order, on average we can save 17% in wire format translation, justifying the use of little endian as wire-format byte order when most machines use little endian.

2.3.3

Translation Costs for a Datamining Application

In an attempt to validate the microbenchmark results presented in Section 2.3.1, we have measured translation costs in a locally developed datamining application. The application performs incremental sequence mining on a remotely located database of transactions (e.g., retail purchases). Details of the application are described elsewhere [56]. Our sample database is generated by tools from IBM research [237]. It includes 100,000 customers and 1000 different items, with an average of 1.25 transactions

66

per customer and a total of 5000 item sequence patterns of average length 4. The total database size is 20MB. In our experiments, we have a database server and a datamining client. Both are InterWeave clients. The database server reads from an active, growing database and builds a summary data structure (a lattice of item sequences) to be used by mining queries. Each node in the lattice represents a potentially meaningful sequence of transactions, and contains pointers to other sequences of which it is a prefix. This summary structure is shared between the database server and mining client in a single InterWeave segment. Approximately 1/3 of the space in the local-format version of the segment is consumed by pointers. The summary structure is initially generated using half the database. The server then repeatedly updates the structure using an additional 1% of the database each time. As a result the summary structure changes slowly over time. Because the summary structure is large, and changes slowly over time, it makes sense for each client to keep a local cached copy of the structure and to update only the modified data as the database evolves. Another feature of the summary structure is that it is rich of pointers, in our experiments, pointers comprise 31% to 34% of data space allocated in InterWeave as the database grows. To compare InterWeave translation costs to those of RPC, we also implemented an RPC version of the application. The IDL definitions for the two versions (InterWeave and RPC) are identical. Figure 2.18 compares the translated wire format length between InterWeave and RPC. Points on the x axis indicate the percentage of the entire database that has been constructed at the time the summary data structure is transmitted between machines. The middle curve represents the RPC version of the application. The other curves represent InterWeave when sending the entire data structure (upper) or a diff from the previous version (lower). The roughly 2X increase in space required to represent the entire segment in InterWeave stems from the use

67

80000 IW entire segment

70000

RPC Total size (bytes)

60000

IW with diffing

50000 40000 30000 20000

98

94

90

86

82

78

74

70

66

62

58

54

0

50

10000

% of Total Database

Figure 2.18: Wire format length of datamining segment under InterWeave and RPC. of character-string MIPS and the wire-format metadata for blocks such as block serial numbers. The blocks in the segment are quite small, ranging from 8 bytes to 28 bytes. When transferring only diffs, however, InterWeave enjoys a roughly 2X space advantage in this application. Figure 2.19 presents corresponding data for the time overhead of translation. When the whole segment needs to be translated, InterWeave takes roughly twice as long as RPC, due to the high cost of pointers. When transferring only diffs, however, the costs of the two versions are comparable. The high cost of pointer translation here does not contradict the data presented in the previous subsections. There the pointer translation cost for RPC included the cost of translating pointed-to data, while for InterWeave we measured only the cost of translating the pointers themselves. Here all pointers refer to data that are internal to the summary data structure, and are translated by both versions of the application. This data structure represents the potential worst case for

68

1400 IW Entire Segment

1200

IW with diffing Time (usec.)

1000

RPC Encoding

800 600 400 200

98

95

92

89

86

83

80

77

74

71

68

65

62

59

56

53

50

0 Percentage of Total Database

Figure 2.19: Translation time for datamining segment under InterWeave and RPC. InterWeave. Though it is beyond the scope of this paper, we should also note that substantial additional savings are possible in InterWeave by exploiting relaxed coherence [56]. Because the summary data structure is statistical in nature, it does not have to keep completely consistent with the master database at every point in time. By sending updates only when the divergence exceeds a programmerspecified bound, we can decrease overhead dramatically. Comparable savings in the RPC version of the application would require new hand-written code.

2.3.4

Ease of Use

We have implemented several applications on top of InterWeave, in addition to the datamining application mentioned in Section 2.3.3. One particularly interesting example is a stellar dynamics code called Astroflow [96], developed by colleagues in the department of Physics and Astronomy, and modified by our group to take advantage of InterWeave’s ability to share data across heterogeneous platforms.

69

Astroflow is a computational fluid dynamics system used to study the birth and death of stars. The simulation engine is written in Fortran, and runs on a cluster of four AlphaServer 4100 5/600 nodes under the Cashmere [242] S-DSM system. As originally implemented it dumps its results to a file, which is subsequently read by a visualization tool written in Java and running on a Pentium desktop. We used InterWeave to connect the simulator and visualization tool directly, to support online visualization and steering. The changes required to the two existing programs were small and isolated. We wrote an IDL specification to describe the shared data structures and replaced the original file operations with access to shared segments. No special care is required to support multiple visualization clients. Moreover the visualization front end can control the frequency of updates from the simulator simply by specifying a temporal bound on relaxed coherence [56]. Performance experiments [56] indicate that InterWeave imposes negligible overhead on the existing simulator. More significantly, we find the qualitative difference between file I/O and InterWeave segments to be compelling in this application. We also believe the InterWeave version to be dramatically simpler, easier to understand, and faster to write than a hypothetical version based on applicationspecific messaging. Our experience changing Astroflow from an off-line to an online client highlighted the value of middleware that hides the details of network communication, multiple clients, and the coherence of transmitted data.

2.4

Related Work

InterWeave finds context in S-DSM systems [26, 47, 310], distributed object systems [60, 136], traditional databases [108], object oriented databases [43, 152], persistent programing languages [170], and a wealth of other work—far too much to document fully here. We attempt to focus here on the most relevant systems in the literature. Compared with these systems, InterWeave is the first one, to our

70

knowledge, that automates the typesafe sharing of structured data in its internal (in-memory) form across multiple languages and platforms. Other distinguishing features of InterWeave include its efficient integration of shared state, transactions, and remote invocation (see Chapter 3) and its exploitation of relaxed coherence models [56].

2.4.1

Software Distributed Shared Memory Systems

The pioneer work on software distributed shared-memory (S-DSM) is the IVY system [146]. IVY implements a shared virtual memory on a cluster of workstations that do not physically share memory. It features a page-based sequential consistency protocol. Each shared page has a home node. Each node maintains a page table indicating permissions for the shared pages: invalid, read-only, or read-write. Initially, the shared pages are invalid and page faults are used to track data accesses. On a read fault, the faulting node requests read permission from the owner and obtains a copy of the page. It then upgrades the permission to read-only. On a write fault, the faulting node requests write permission from the owner, which sends back the page and a list of current sharers. The faulting node then notifies the sharers to invalidate their copy of the page. IVY’s consistency protocol can cause serious performance problems in the face of false sharing, in which non-shared data accessed by two nodes happen to reside on a single page. In the worst case, IVY has to move the page back-and-forth between the two nodes. To alleviate the false sharing problem, many later S-DSM systems, including Munin [47], TreadMarks [5], and Cashmere [242], have used (machine-specific) diffs to propagate updates, but only on homogeneous platforms. Munin [47] improves IVY by supporting multiple writers and allowing variables to be shared under different coherence models. In contrast with IVY’s sequential consistency, Munin adopts release consistency, which exploits application-

71

level synchronization operations to relax ordering constraints of memory accesses. Programs are annotated with synchronization operations: acquire and release. Intuitively, an acquire gains the permission to access the shared data whereas a release relinquishes the previously acquired permission. These two operations are usually paired and implemented with reader/writer locks. Locks and barriers are implemented using a distributed queue-based protocol. The propagation of modifications made by a node is delayed until this node’s next release operation. Munin assumes that applications follow the data-race-free-1 programming model [4] and allows multiple well-synchronized nodes to write to the same page concurrently. On a write fault, the node creates a pristine copy of the page, called twin. At the next release, the node compares the twin page with the page’s current content to create a diff that captures the modifications made by this node. It then sends the diff to the owner of the page, which incorporates the modifications into its master copy. Modifications from concurrent writers therefore can be safely merged. Munin also allows the use of multiple coherence models. In the program source code, the declaration of a shared variable is annotated with its expected access pattern, which is then mapped by the system into a combination of underling protocol parameters. The premise is that the performance is improved when the access pattern of a variable matches with its coherence model. TreadMarks [5] uses lazy release consistency to further reduce the cost of the coherence protocol. Lazy release consistency (LRC) improves upon release consistency (RC) by considering actual causality [3] among accesses. In RC, modifications are propagated at every release. LRC incurs no memory operations at release time. At an acquire, a node obtains and applies only those necessary modifications according to Lamport’s happen-before relationship [141]. TreadMarks divides the execution of a program into intervals, delimited by synchronization operations. The causality between intervals on different nodes are tracked by a distributed time-vector protocol. Each node creates diffs on demand and keeps the diffs to

72

satisfy later requests. A distributed garbage collection protocol reclaims memory for diffs when memory becomes scarce. Cashmere [209] is a S-DSM system running on a cluster of homogeneous SMP nodes. The goal is to take advantage of the emerging low-latency remote-write networks as well as to use hardware shared-memory for sharing within an SMP and incur software overhead only when actively sharing data across SMPs. It implements a two-level multiple writer, moderately lazy release consistency protocol. On a SMP node, when a process reduces access permissions on a page in a shared address space, it must generally interrupt the execution of any processes executing in that address space on other processors, in order to force them to flush their TLBs and to update the page table. This is known as TLB shootdown. Cashmere eliminates TLB shootdown through the use of two-way diffing, a technique that reconciles intra-node hardware cache coherence with inter-node software coherence. InterWeave extends Cashmere’s two-level protocol to the third level, allowing programs to share data across heterogeneous machines [56]. When caching a segment on a client, InterWeave lays out blocks according to past data access history to improve cache locality. The scheme described by Freeh and Andrews [99] reduces false sharing and improves locality in a S-DSM system by dynamically moving data modified by a process to its own space. It introduces an extra level of indirection to allow this data relocation and requires all nodes in the system to remap data to the same address after a single node relocates its shared data. In Midway [304], Zekauskas et al. [305] quantified the cost of two write detection mechanisms: page fault through the virtual memory (VM) system and compiler-generated write check. They found that write check has low average write latency and supports fine-grain sharing with low overhead. Moreover, the dominant cost of write detection with either strategy is due to the mechanism used to handle fine-grain sharing.

73

Dwarkadas et al. [83] examined the performance tradeoffs between fine-grain and coarse-grain S-DSM in the context of two state-of-the-art S-DSM systems: Shastas [226] and Cashmere. They found that the fine-grain, instrumentationbased approach to S-DSM offers a higher degree of robustness and superior performance in the presence of fine-grain synchronization, while the coarse-grain, VM-based approach offers higher performance when coarse-grain synchronization is used. Brecht and Sandhu [32] described the design and implementation of a library that provides, at the granularity of application-defined regions, the same set of services that are commonly available at a page-granularity using VM primitives. Each physical region is mapped into multiple virtual addresses with different protection: invalid, read-only, or read-write. On the invocation of region protect(), a counterpart of mprotect(), all known pointers including those in registers are swizzled to the virtual address with proper permission. To make a pointer aware to RTL, a C program must invoke region ptr() to register pointers with RTL. Like InterWeave, RTL uses AVL trees to support efficient search of the region spanning a faulting address. As a case study, they described and evaluated how to run TreadMarks over RTL to share data at a finer granularity. Active Harmony [114] is a S-DSM system focused on dynamic reconfiguration to efficiently execute parallel applications in large-scale dynamic environments. It provides a mechanism for applications to specify tuning options and resource requirements. The adaptation controller gathers information about the environment and projects the effect and benefit of possible reconfigurations. Based on the critical path of a parallel computation, process and procedure load balance factor (LBF) predicts the impact of changing the assignment of procedure or process to processors. Active Harmony also tracks threads with similar page access pattern in the beginning computation phase and relocates threads access the same data set together.

74

2.4.2

Heterogeneous S-DSM

Toronto’s Mermaid system [285, 310] allows objects to be shared across more than one type of machine, but requires that all data in the same VM page be of the same type and that objects be of the same size on all machines, with the same byte offset for every subcomponent. Shared memory can start at different addresses on different machines. With the strong restrictions, pointer conversion for Mermaid is reduced to simply adding an offset to the starting addresses of the shared memory. Mermaid has no intermediate wire format and only two types of machines were supported. Data are translated directly from one machine to another. To support more platforms, it would require implementing n2 translation routines among n platforms, which could pose a serious challenge for today’s wildly heterogeneous Internet. CMU’s Agora system [25, 26] supports sharing among more loosely-coupled processes, but in a significantly more restricted fashion than in InterWeave. Pointers and recursive types are not supported, and all shared data have to be accessed indirectly through a local mapping table. Once written, an individual element cannot be changed. Updates are performed by adding a new element to the shared memory and updating the mapping table to point to the new element. Only a single memory model (similar to processor consistency) is supported. Agora uses events and activation to synchronize and exchange modifications. Modifications are always sent back to the master copy.

2.4.3

Hybrid Message Passing and S-DSM Systems

Smart RPC [134] is an extension to the conventional RPC. It passes arguments through call-by-reference instead of call-by-value (deep copy). When invoking a remote procedure, the caller converts local pointers into machine independent long pointers and sends them to the callee. The callee converts the long pointers

75

back into local pointers and reserves space pointed to by them. S-DSM techniques are used to fetch data on demand when the reserved spaces are actually accessed. Modified data are always written back to the caller after each RPC session, because it lacks a shared global name space with a well-defined cache coherence model. Unfortunately, this may significantly slow down the critical path of a chain of remote invocations. Also due to the lack of a shared store, it invalidates caches after each RPC session, prohibiting cache reuse. Overall, passing arguments on demand in Smart RPC seems to be a quick hack to improve RPC’s performance, rather than a complete solution to enrich RPC’s programming model. Transactions are not supported in Smart RPC. Stardust [36, 37] supports both message passing and shared-memory programming on a cluster of heterogeneous machines, but the support for shared memory is limited. The programmer must manually supply type descriptors for shared data. Recursive data structures and pointers are not supported. Stardust is a page based system, always converting and transmitting a whole heterogeneous page of data. A heterogeneous page is a multiple of the page sizes of all architectures in the system. The Rthread [79] system is capable of executing pthread programs on a cluster of heterogeneous machines. In keeping with the pthread programming model, it assumes that all global variables are shared. It enforces a shared object model, in which remote data can only be accessed through read()/write() primitives. Rthread requires the programmer to manually replace pthread calls with Rthread calls, to add primitives to access shared data, and to insert synchronization primitives. Pointers are not supported in shared objects. A precompiler automatically extracts type information of global variables, which is then used at runtime to guide data conversion. Addressing the limitation of existing argument passing methods in RPC (either call-by-reference or call-by-value), Silva et al. [69] proposed call-by-substitution,

76

where a argument is substituted by a datum that already exists in the callee. The programmer needs to specify what data are substitutable and how to substitute them. Unlike the generality of caching, not all arguments are substitutable.

2.4.4

Distributed Object Systems

Dozens of object-based systems attempt to provide a uniform programming model for distributed applications. Many are language specific (e.g., Argus [150] and Arjuna [189]); many of the more recent ones are based on Java. Languageindependent distributed object systems include Legion [109], Globe [269], Microsoft’s DCOM [213], and various CORBA-compliant systems [181].

Globe

replicates objects for availability and fault tolerance. A few CORBA systems (e.g. ScaFDOCS [136] and CASCADE [60]) cache objects for locality of reference. Unfortunately, object-oriented update propagation, typically supported either by invalidating and resending on access or by RMI-style mechanisms, tends to be inefficient (re-sending a large object or a log of operations). Equally significant from our point of view, there are important applications (e.g., compute-intensive parallel applications) that do not employ an object-oriented programming style. PerDiS [89] is a persistent distributed object system featuring object caching, transactions, security, and distributed garbage collection. Among existing systems it is perhaps the closest to InterWeave. It also uses URLs for object naming and has sharing units—clusters and objects—equivalent to InterWeave’s segments and blocks. PerDiS, however, has no built-in support for heterogeneous platforms or relaxed coherence models. Unlike InterWeave, it does not allow remote procedure calls to be protected as part of a transaction (see Chapter 3). The PerDiS architecture consists of four layers—distributed file system, object support, access method, and language support. Clusters and objects are persistently stored in the distributed file system and each cluster has a designated home site. The object support layer does pointer swizzling and garbage collection. Each cluster

77

has a root with a named URL. Objects unreachable from the root are garbage collected. PerDiS allows inter-cluster pointers. Each cluster maintains a list of object references exported to other clusters as well as the clusters that reference them. These information facilitates distributed garbage collection. Each cluster has an access control list for security check. The access method layer uses S-DSM techniques to cache and map clusters. To maintain coherence, either the program explicitly calls the hold method of objects before making modifications (for PerDiS-aware applications), or the runtime automatically protects all pages and holds objects at the time of page faults (for legacy applications). PerDiS allows a hold on an object, a cluster, or an arbitrary continuous address range. PerDiS supports pessimistic and optimistic transactions. The former takes a reader or writer lock as soon as the application issues a hold, consequently blocking any conflicting transaction. The latter reads pages and records their version numbers when the application issues a hold. At the time of commit, if the recorded version numbers are different from the current version numbers, the transaction aborts. Language support provides language-specific runtime service such as allocation of typed objects and type checking. Emerald [124] is an object-based language and system for distributed programming, featuring fine-grained object mobility, efficient local execution, and single object model for both data and threads. When a thread invokes a method provided a remote object, the thread is migrated with its activation records to the node where the object resides. Emerald has three kinds of objects—global objects, local objects, and direct objects (primitive data). Global objects are accessed through one level of indirection and Emerald uses forwarding address or broadcast to locate them. Compiler takes a special role in Emerald. Registers are partitioned into data and address registers on a per-invocation basis in order to simplify the task of identifying pointers in registers. To migrate an object, pointers inside the object are translated under the guidance of compiler generated

78

templates (type descriptors). The compiler also generates templates for activation records in order to identify pointers in the activation records when migrating threads. Emerald allows the programmer to specify parameter passing mechanism for each method: call-by-remote-reference , call-by-move (move arguments to the remote node), and call-by-visit (move arguments to the remote node and return them to the caller afterwards). Compiler can generate different code depending on the code context. Amber [51] is a distributed object system based on the C++ language. Objects in Amber are passive entities consisting of private data and public methods. Thread objects are the real active entities in the system, which encapsulate execution states and can be scheduled on processors. A thread invoking a method of an object is migrated to the node where the object resides. Object migration is supported by arranging the global memory at the same virtual address on all nodes. As a result, pointer swizzling is not needed but this also limits Amber to homogeneous machines. Amber has no coherence problem since it does not support object replication or caching. After an object migration, Amber uses local object descriptors to track the location of the object and forwards method invocations when necessary. Object placement is explicitly controlled by the programmer. One feature that distinguishes Amber from its predecessors is its ability to execute multiple threads on a single shared-memory multiprocessor node. Argus [150] is a language-based distributed object system similar in spirit to the later popular distributed object systems such as CORBA. Objects (so-called guardians) encapsulate internal state and expose some methods (so-called handlers) to be invoked by other objects. Argus was designed particularly with fault tolerance in mind. Part of a guardian’s state is persistent and is saved to stable storage periodically. Transaction support at the language level is another important feature of Argus. Each invocation to an object’s method starts a subtransaction. Argus passes method arguments by value.

79

VDOM [88] distinguishes itself from other distributed object systems by the explicit version control over objects. Its coherence model is built around the concept of multi-version immutable objects. An update to an object logically creates a new version of the object. The version numbers are exposed to the application and can be used as a synchronization mechanism. Release consistency primitives—AcquireRead, AcquireWrite, ReleaseRead and ReleaseWrite—can explicitly specify which version to access. When a new version of an object is created on a node, the new version is sent to all other nodes that currently have a cached copy of the object. It is therefore possible to have multiple versions of an object on a single node. As the new version is being created, threads using the old versions need not be interrupted. It is programmer’s responsibility to guarantee that no two writers create the same version of an object at the same time. The coherence unit in VDOM is fragment object, from which language-level objects can be constructed. The purpose of using fragment objects is to alleviate false sharing. It is the programmer’s responsibility to properly decompose an object into fragment objects. SAM [227] is a distributed object system that supports only two types of shared data: “accumulator” (data with migratory sharing pattern that can only be accessed exclusively) and “value” (data with producer-consumer sharing pattern and single assignment semantics). Like VDOM, each update to a “value” object creates a new version of the object. Objects are typed but multiple pointers to the same data are not allowed in objects. DOSA [118] is a shared object system that employs a handle-based implementation to address the false sharing problem. The handle table of shared objects provides one level indirection. Each physical page is mapped to three VM pages with different access permissions: invalid, read-only, and read-write. On a read or write page fault, the corresponding entry in the handle table is set to the VM page with the proper permission. In doing so, it allows different access permissions for

80

objects in the same physical page. Performance results show that the one level indirection introduces less than 5.2% overhead. DOSA only works for programs written in safe languages or programs that use weakly typed languages in a safe way, since DOSA does not allow programs to use cached entries of the handle table without notice of the runtime system. ScaFDOCS [136] is an object caching framework built on top of CORBA. As in Java RMI, a shared object is derived from a base class and its writeToString and readFromString methods implement the serialization and deserialization of the object’s internal state. Programmers are allowed to reload these methods to implement efficient class-specific serialization. Multiple consistency models, including causality consistency, are supported. CASCADE [60] provides hierarchical object caching services for CORBA. To comply with the CORBA specification, the caching service itself is structured as an CORBA object running on a node close to the client rather than being in the client’s address space. It therefore still needs cross-domain operations to access objects and by no means provides a shared programming model. The designers of both CASCADE and ScaFDOCS reported the problem with CORBA’s reference model, in which every access through the CORBA reference is an expensive crossdomain call and small changes to large objects still require large messages.

2.4.5

Language-based Systems

Weems et al. [279] conducted a survey of languages supporting heterogeneous parallel processing. Among them, Delirium [162] is the only one that adopts a shared-memory programming model but it requires the programmer to list explicitly all variables that each routine might destructively modify. Papadopoulos et al. [186] proposed viewing a distributed application as an actual computation part and a coordination part responsible for communication and

81

coordination between the computation parts. They conducted a survey of coordination models and languages for distributed applications, among which Linda [101] is one prominent example. All those systems employ new programming models and languages rather than sticking to the shared-memory programming model and existing languages. With Linda [101], applications share state through a tuple space that is accessed through two basic operations: out and in. The out operation writes a tuple into the tuple space and the in operation takes a tuple out of the tuple space. Synchronization is implicit through blocking on the in operations. Systems that enhance Linda in one way or another include JavaSpace [248], Lime [194], Scientific [120], and Jini [249]. Mneme [170] combines programming language and database features into a persistent programming language. The goal is to provide better support for programs that use complex and highly structured data, e.g., computer aided engineering. As in InterWeave, persistent data are stored on the server and can be cached at the clients. Objects in Mneme, however, are untyped byte streams. References inside objects are identified by routines supplied by the programmer rather than by the runtime automatically. The C&CO system [90] extends the C language with communication variables to support persistent and concurrent data sharing in work flow systems. A communication variable is assigned once. This immutable semantics greatly simplifies the coherence protocol, since the value of a communication variable never changes.

2.4.6

Java-based Systems

Java RMI [250] supports the use of object references across machines but it has some limitations. In Java RMI, the parameters and results can be passed by either reference or deep copy. With call-by-reference, a proxy object is created on the

82

remote machine to forward method invocations. There are important cases for which both parameter-passing mechanisms have serious performance problem. Previous research in DSM systems for Java has focused mainly on providing a shared object system among distributed Java nodes. Some projects build on existing shared-memory systems. Java/DSM [299] and JESSICA [164] both modify the JVM to use TreadMarks [5] as a distributed memory heap for Java objects. The Java/DSM [299] system intends to hide both hardware heterogeneity and the distributed nature (message passing) of distributed applications by running a modified JVM on top of TreadMarks [5]. Java/DSM locates Java’s heap in the global shared memory. Loaded classes are put in the shared memory and hence accessible to all machines. Running JVM in a heterogeneous environment involves data conversion, which is guided by type descriptors associated with each object. To make the search of type descriptors efficient, a page is only allowed to contain objects of the same size. It is unclear from this work how to run TreadMarks in a heterogeneous environment in which machines have different page size, different virtual address space, etc. Thread location is not transparent to the programmer and thread migration is not supported. The garbage collector is implemented in a distributed fashion. Each JVM records objects imported from remote sites and local objects exported to other sites. Both the exportation and importation are implemented by up-calls from TreadMarks to the JVMs. The JVMs notify each other when references to remote objects are no longer needed. Using these information, the garbage collectors running on different machines can work independently without querying each other. Hyperion [7] is a shared-object system implemented in Java. In Hyperion, Java bytecode is compiled into native code and then linked with a runtime library that supports multithreaded execution of Java threads. Hyperion implements the Java Memory Model (a variant of release consistency), which allows distributed threads to cache objects locally. Coherence is maintained through requiring local

83

modifications being transmitted to the central memory when the thread exits the monitor. A thread is guaranteed to access up-to-date objects since its cache gets invalidated when it enters a monitor. JavaSpaces [248] is a Java variant of Linda [101]. It provides applications with a shared tuple space that supports insertion of objects into the tuple space, read or removal of objects from the tuple space, matching of templates, event notification, and transactions. It is suitable for distributed applications following the “flow of objects” model, i.e., computation is done by data shipping rather than function shipping. There are several difference between JavaSpace and Linda. In JavaSpace, entries (tuples) themselves in addition to their fields are typed. For instance, (string, double, double) could be the representation of either a named point or a named double vector. In JavaSpaces, they have different classes and their templates do not match. Entries in JavaSpace are associated with a life time in order to automatically get rid of debris due to network or client failure. Krishnaswamy and Haumacher [137] described a fast implementation of Java RMI that is capable of caching objects to avoid redundant serialization and retransmission. Philippsen and Haumacher [193] suggested that equipping classes with their own marshaling and unmarshaling routines can greatly improve performance, but they do so manually. In InterWeave, wire-format translation is done by the runtime automatically, achieving good performance through many aggressive optimizations.

2.4.7

Database Systems

In the past two decades, object-oriented databases (OODBs) have been proposed to address the impedance mismatch problem that arises at the boundary when a programming language type system meets a database’s relational type system. This problem is especially serious for engineering applications that manipulate a large number of complex data. Atkinson et al. [8] considered that an OODB must

84

satisfy two criteria: it should be a database management system (DBMS), and it should be an object-oriented (OO) system. The first criteria requires the system to posses five features: persistence, secondary storage management, concurrency, recovery and a query facility. The second one requires the inclusion of several OO features: complex objects, object identify, encapsulation, types or classes, inheritance, late binding, and computational completeness. Carey and DeWitt [41] conducted a survey of work on the connection between objects and databases. They classified these systems into four categories. The extended relational database approach started with relational databases and incorporated flexible, user-defined abstract data types (ADTs). The persistent programming language approach took a object-oriented (OO) programming language and added features to make its data persistent and its program executions atomic. The more radical object-oriented database approach advocated the integration of all important features of databases and OO languages. What distinguishes it from the persistent language approach is its focus on indexing, navigation, query processing, and version control. The last database toolkits approach provided a set of basic toolkits to allow quick construction of domain-specific databases. As of 1996, the extended relational databases evolved into object-relational systems and became the dominant and successful approach. Unlike OODB, object-relational systems started with the relational model and its query language, and then introduced OO features. Among these approaches, InterWeave is closest to the persistent language approach. EXODUS [42] is a prominent example of the database toolkits approach. It provides certain kernel database facilities to help the fast development of highperformance, application-specific database systems. Storage objects in EXODUS are untyped arrays of bytes and no type information is stored in the database. Correct interpretation of the object content is the programmer’s responsibility. As pointed out in [41], with EXODUS, it still requires too much expertise from

85

the programmer in order to build a real database system. SHORE [43] is a follow-up project to EXODUS, featuring the support for both OODB and file system interfaces and a peer-to-peer server architecture. SHORE objects are strongly typed. Each object contains a pointer to a type object that defines the structure and interface of the object. Objects in the database are managed by peer-to-peer servers. A client connects to its local server. A client and its local server run in different address spaces for the sake of security and communicate through RPC. From a client’s point of view, it accesses objects on its local server and those on remote servers in the same way. The local server transparently acts as a proxy to bring in remote objects and cache them. There are two levels of caching in SHORE: page cache in the server and object cache in the client. The client cache is invalidated after each transaction, i.e., it cannot be reused across transaction boundary. Each object has an object ID (OID). All OIDs in objects cached on the client side are swizzled to pointers to entries in an object table. This one level of indirection allows objects to be removed from memory before the transaction commits, without the need to track down and unswizzle all pointers to them. Object attributes can be read directly but require special methods to modify them. For instance, p.update()->x = VAL implements the semantics of p->x = VAL, where update() is a method that tells the runtime that p is modified and needs to be sent back to the server when the transaction commits. Thor [151, 152] is an OODB system featuring safe sharing of objects and dynamic client cache management. It allows objects to be cached at client front ends. Objects in the database are implemented in Theta, a type safe OO language. Applications written in other languages invoke object methods through stubs. An unsafe application is executed in a separate domain and argument checking is intervened when invoking object methods. Three techniques were proposed to reduce the overhead of safety checking: batching method invocations, transferring

86

code into the database, and sandboxing (i.e., the application is only allowed to access certain data). Each object is identified by a 128-bit global identifier, which is swizzled into a local address once the object is cached. A client prefetches data from the server and adjusts its prefetching size according to the usage of previously prefetched data. Thor neither addressed the heterogeneity problem nor attempted to provide a shared-memory programming model. Franklin et al. [98] gave a detailed classification and comparison of schemes for maintaining transactional client-server cache coherence. Comparing to those schemes, InterWeave is unique in its integration of RPC, transaction and shared state, and in the use of transaction metatdata table for efficient management of cache coherence without the excessive overhead of passing data through the database server (see Chapter 3). Zhou and Goscinski [311] presented a detailed realization of an RPC transaction model [108] that integrates data replication and transaction management. In this model, each replica of a database server provides a set of remote procedures to be called by clients to process data locally managed by the replica. InterWeave supports transactional RPC between arbitrary clients and maintains coherence efficiently among dynamically created caches. LOTEC [106] employs a two-phase locking scheme to support nested transactions in a distributed object system. Each invocation to methods of shared objects automatically starts a sub-transaction. LOTEC assumes that methods of a shared object only access shared data in the given object. The compiler therefore can automatically insert synchronization operations at the entry points of methods to conservatively update the data that might be accessed by the methods.

87

2.4.8

Infrastructure for Large-scale Distributed Computing

Khazana [46] is an infrastructure for building distributed services. It exports a global shared store that can be accessed either by read()/write() primitives or by mapping the global store into local virtual memory. Data in the global memory are untyped byte streams without any structure or semantics. The global memory is partitioned into regions. Each region has a home node that stores metadata of the region and keeps track of the nodes that cache a copy of the region. A global address map keeps track of the status of each region (e.g., reserved or free) and its home node. Nodes are grouped into clusters, each represented by a manager node. Clusters further form a hierarchy through the manager nodes. Each node locally runs a consistency manager (CM) that communicates with others to maintain coherence of the global memory. To acquire a lock on shared data, a permission must be granted by the CM. Internally the CM can choose to implement all kinds of different coherence models. The Legion [109] system is designed as a metacomputer that may consists of up to millions of computers and hosts up to trillions of objects but appears to users as a single virtual computer. Legion is structured around objects. Both processors and storage resources are represented as objects. Managing processes amounts to the manipulation of process objects: creation, destruction, and querying status. Legion adopts a three-level naming scheme: location-dependent object address (IP:port), location-independent and immutable Legion Object Identifier (LOID), and Unix-like path name. Replicated binding services allow transparent translations among these names. Legion’s objects are persistent. Together with the naming scheme, it subsumes the functionality of a traditional distributed file system. Inter-process communication is based on macro-dataflow, an asynchronous remote invocation protocol that allows multiple outstanding invocations from a single client as well as overlap of remote invocations and local computation. Legion

88

supports optional data privacy (through RSA) and integrity (through signature) within the message passing layer. The public key of an object is part of its LOID. Users are also represented as objects. Outgoing method invocations by a user are signed with the user’s private key. Class managers are special objects responsible for the management of a set of other objects. Class managers can be organized into a hierarchy for scalability. Like Legion, Globe [269] also intends to provide a uniform model for distributed computing, allowing flexible implementations of the model and targeting worldwide scalability. Globe’s computing model is also based on distributed objects, but with its own features: (1) the components of a single object can be physically distributed; (2) each object fully encapsulates its own policies for replication, migration, and so forth. To achieve world-wide scalability, Globe provides extensive supports for object partition and replication, which are what lack in DCOM and CORBA. A distributed object is built from local objects that reside in different address space but bind with each other through communication. A distributed object is composed of the following types of sub-objects: the semantic sub-object implements the functionality of the object; the communication subobject encapsulates the communication; the replication sub-object implements a specific replication policy; the control sub-object handles the control flow within local objects. All invocations to distributed objects, local or remote, go through the control sub-object, which directs the invocations through the semantic subobject and replication sub-object accordingly. The replication sub-object controls the invocation and synchronization of remote objects. Before using an object, a thread must first bind to that object (object lookup goes through a naming service), resulting in an interface belonging to the object being placed in the client’s address space, along with an implementation of the interface. This local interface can either fully implement the object’s functionality or just act as a stub to forward requests to the remote parts of the object. In contrast, Legion always

89

invokes methods remotely. Globus [91] provides metacomputing applications with a set of low-level toolkits for communication, resource location, resource allocation, information, and authentication. These toolkits can be used to implement a variety of higher-lever services. It provides rule-based selection, resource property inquiry, and notification to support resource-aware applications. Communication links are formed by binding the startpoints and endpoints, supporting both interleaving and multicast. The Metacomputing Directory Service is an extension of LDAP. Generic Security System can be layered on top of different methods such as Kerberos and SSL. Data Access Service provides a remote interface for parallel file system. On top of Globus, many parallel programming interfaces such as MPI can be built. Javelin [40] is a distributed computing architecture based on the Web technology that allows users to contribute free cycles on their machines. There are three kinds of entities in Javelin. A client is a process seeking for computing resources. A host is a process offering computing resources. A broker is a process that coordinates clients and hosts. Clients and hosts are Web browsers capable of executing Java applets. Javelin works as follows. A client in search of computing resources uploads its applet to the broker by visiting a special URL. The broker stores the applet and puts it in a job queue. A host that is willing to do some computation for others visits a special URL at the broker. It then downloads an applet, executes it, and sends back the result. Later the client will contact the broker to get the result. The major challenge in Javelin is to support proper models for communication and storage, since applets are not allowed to access the local file system or contact hosts other than the broker from which they are downloaded. As a result, all data needed by applets must be stored on the broker and communication between an applet and the client submitting the applet or among cooperating applets must go through the broker. It is possible to build various programming models on top of Javelin, including SPMD and Linda. How-

90

ever, for performance reason, Javelin is only suitable for coarse-grain distributed applications. WebOS [267] provides a set of services built on top of existing OSs to support wide-area applications: resource discovery, persistent storage, remote process execution, resource management, authentication, and security. Smart Clients, Java applets running on the clients, are responsible for name translation, load balance, and fault tolerance. Servers piggyback state information such as load to clients in extended fields of HTTP header. Smart proxy can be introduced between clients and servers to support clients that do not speak WebOS protocols and also to improve performance. With WebFS, clients addresses remote persistent files through URLs. WebFS caches files in kernel in a way similar to NFS. Each node running WebOS has a resource manager responsible for remote process control. The resource manager verifies remote requests to start a process and uses virtual machine to prevent intervention among processes. Each node also runs a security manager to verify the identity and capability of requests. As a case study, this paper presents in detail the “Rent-A-Server” application running on top of WebOS. The Rent-A-Server architecture consists of an expandable group of HTTP servers accessed by browsers running Smart Clients. When servers are overloaded, a new server is started and a mirror of the original site is created through WebFS. A Smart Client balances load dynamically by sending requests to different servers. Ninja [271] is a framework for building high performance network services (e.g., web or mail servers). It only supports two special persistent shared data structures, namely, hash tables and B-trees. Shared state is replicated and a twophase commit protocol is used to ensure serializability and atomicity of updates.

91

2.4.9

Thread and Process Migration among Heterogeneous Machines

Steensgaard and Jul [240] extended the Emerald system [124] to support object and thread migration among heterogeneous computers. The compiler generates object executables for each target architecture. When an object is moved, the object’s internal data as well as the activation records of threads currently running in the object’s methods must be converted and moved. The type information (templates) that the Emerald compiler already generates is augmented with information regarding the mapping between registers and slots in activation records as well as information about the number and types of temporary variables in use at certain points (bus stops) of the program. With these information, the system can convert objects and thread context among architectures at runtime. One special challenge is to convert the program counter across architectures when moving a thread. Since an execution point in the code on one machine does not necessarily have a corresponding execution point in the code on another machine, threads are only allowed to migrate at certain well-defined points (so-called bus stops) in the program, e.g., system calls, which are guaranteed to have corresponding execution points on all architectures. Smith and Hutchinson [235] argued that the main limitation of the Emerald system is its dependence on the unpopular Emerald language. They therefore proposed the Tui system that can migrate, among heterogeneous machines, programs written in more general languages such as C and Fortran. A custom compiler provides type information about variables, which is used to guide data marshaling and pointer swizzling. In the intermediate image file of a process dumped before a migration, a pointer is represented by a tuple (variable serial number, offset), similar to that in InterWeave. During a migration, pointers in global and local variables are followed recursively to convert data in heap. It is sometimes impossible for Tui to work out a correct migration for programs written in

92

weakly typed language such as C. The dependence on custom compilers may also limit the portability of the Tui system.

2.4.10

Miscellaneous

Interface description languages date from Xerox Courier [65] and related systems of the early 1980s. Precedents for the automatic management of pointers include Herlihy’s thesis work [113], LOOM [125], and the more recent “pickling” (serialization) of Java [207]. In [283], Wilson described a pointer swizzling scheme to translate persistent, huge pointers to transient, local pointers at the time of page faults. On program startup, all pages except those containing entry pointers are protected and pages referred by entry pointers are reserved but unmapped. When an access to an unmapped page occurs, the content of the page is brought in from persistent storage and pointers in the page are translated. Consequently, virtual addresses for pages referred by newly translated pointers are reserved but unmapped. Relocation of pages thus occurs in a sort of “wave front” just ahead of the running of the program. This scheme requires identifying pointers in pages, which can be done with either compiler support or conservative pointer pattern matching techniques similar to those used in garbage collection. LBFS [174] is a low bandwidth network file system that saves bandwidth by taking advantage of commonality between files; InterWeave saves bandwidth by taking advantage of commonality among versions of a segment. LBFS breaks up files into chunks and indexes file chunks by their hash value. Transmission of duplicate chunks is saved by chunk reuse. Douglis and Iyengar [78] proposed to detect object resemblance and use delta-encoding to save communication. Stampede [201] supports interactive multimedia applications running on a cluster of SMPs by providing them with buffer management, inter-task synchroniza-

93

tion, and soft-realtime. Shared data are organized into a two-dimensional SpaceTime-Table indexed by channel and logical timestamp. The channel can be a physical data source or just a logical sequence of data. Threads connect to channels and then get or put data from the channels. Stampede supports garbage collecting data that are no longer reachable. The reachability is defined in terms of temporal timestamp rather than reference. Threads are required to explicitly advance their virtual time and announce their completion of using data. Based on this information, Stampede continuously computes the minimum global time and frees data whose timestamp is older than the minimum global time. A mechanism is provided for a thread to synchronize with real time, during which the thread may block until the synchronization is met, or the system raises an exception if the thread has already lagged behind the real time. Unlike InterWeave, Stampede does not support shared-memory programming. Instead, it just provides a mechanism that allows threads to access multimedia data without explicitly giving their source or location. In a broader view, caching [203, 262, 262] and replication [298, 297] have been widely used in all kinds of distributed systems, but none of them intends to provide a shared-memory programming model. InterWeave can benefit from these past works to improve fault tolerance, availability, scalability, and security.

2.5

Conclusions

We have described the design and implementation of a middleware system, InterWeave, that allows processes to access shared data transparently and efficiently across heterogeneous machine types and languages using ordinary reads and writes. InterWeave is, to the best of our knowledge, the first system to fully support the typesafe sharing of structured data in its internal (in-memory) form across multiple languages and platforms. The twin goals of convenience and

94

efficiency are achieved through the use of a wire format, and accompanying algorithms and metadata, rich enough to capture machine- and language-independent diffs of complex data structures, including pointers or recursive data types. InterWeave is compatible with existing RPC and RMI systems, for which it provides a global name space in which data structures can be passed by reference. Experimental evaluation demonstrates that automatically cached, coherent shared state can be maintained at reasonable cost, and that it provides significant performance advantages over straightforward (cacheless) use of RPC alone.

95

3

Incorporating RPC and Transactions into InterWeave

In the previous chapter, we introduced InterWeave [56, 58, 253, 254], a middleware system that allows the programmer to share both statically and dynamically allocated variables across programs written in different programming languages and running on a wide variety of hardware and OS platforms. Shared data segments in InterWeave are named by URLs, but are accessed, once mapped, with ordinary loads and stores. Segments are also persistent, outliving individual executions of sharing applications, and support a variety of built-in and user-defined coherence and consistency models. Aggressive protocol optimizations, embodied in the InterWeave library, allow InterWeave applications to outperform all but the most sophisticated examples of ad-hoc caching. Most distributed applications, despite their need for shared state, currently use remote invocation to transfer control among machines. InterWeave is therefore designed to be entirely compatible with RPC systems such as Sun RPC, Java RMI, CORBA, and .NET. By specifying where computation should occur, RPC allows an application to balance load, maximize locality, and co-locate computation with devices, people, or private data at specific locations. At the same time, shared state serves to • eliminate invocations devoted to maintaining the coherence and consistency of cached data;

96

• support genuine reference parameters in RPC calls, eliminating the need to pass large structures repeatedly by value, or to recursively expand pointerrich data structures using deep-copy parameter modes; • reduce the number of trivial invocations used simply to put or get data. The value of these changes becomes particularly apparent in cases where a process, acting as a server, needs access to a significant amount of client state, but the client cannot anticipate exactly what parts of that state will be needed. By automating the caching of shared state across multiple RPC calls, InterWeave allows processes to pass complex linked structures by reference. A server can then access whatever pieces of those structures it needs, without making callbacks to the client, and without requiring the client to pass a conservative cover of the data among its original parameters. Data will still be transmitted behind the scenes, of course, but only the data that are actually needed, and only when any already cached copies are not recent enough to use. For Internet-level applications, it becomes an expectation rather than an exception to experience system failures or race conditions when accessing shared data. Since fault tolerance is not provided at the RPC level, RPC-based applications usually have to build their own mechanism to recover from faults and to improve availability, which complicates application design from another dimension. InterWeave eases the task of building robust distributed applications by providing them with support for ACID transactions [108]. A sequence of RPC calls and data access to shared state can be encapsulated in a transaction in such a way that either all of them execute or none of them do with respect to the shared state. Transactions also provide a framework in which the body of a remote procedure can see (and optionally contribute to) shared data updates that are visible to the caller but not yet to other processes.

97

The remainder of this chapter is organized as follows. Section 3.1 introduces a design that seamlessly integrates shared state, remote invocation, and transactions to form a distributed computing environment. Section 3.2 presents a detailed implementation of this design. Section 3.3 evaluates InterWeave in both local and wide-area network environments, using microbenchmarks and a larger example drawn from our work in data mining, that uses RPC and shared state to offload computations to back-end servers. Section 3.4 concludes this chapter.

3.1

Design

InterWeave integrates shared state, remote invocation, and transactions into a distributed computing environment. The InterWeave programming model assumes a distributed collection of servers and clients. Servers maintain persistent copies of shared data and coordinate sharing among clients. Clients in turn must be linked with a special InterWeave library, which arranges to map a cached copy of needed data into local memory. Once mapped, shared data (including references) are accessed using ordinary reads and writes. InterWeave servers are oblivious to the programming languages used by clients, and the client libraries may be different for different programming languages. InterWeave supports the use of relaxed coherence models when accessing shared data. Updates to shared data and invocations to remote procedures on arbitrary InterWeave processes can be optionally protected by ACID transactions [108]. Figure 3.1 presents an example of InterWeave client code for a simple shared linked list. The InterWeave API used in the example will be explained in more detail in the following sections. For consistency with the example, we present the C version of the API. Similar APIs exist for C++, Java, and Fortran. InterWeave’s shared state can be used with RPC systems by passing MIPs as ordinary RPC string arguments. When necessary, a sequence of RPC calls, lock

98

// RPC client code IW_seg_t seg; node_t *head; void list_init(void) { seg = IW_open_segment("data_svr/list"); head = IW_mip_to_ptr("data_svr/list#head"); IW_set_sharing(seg, IW_MIGRATORY); } // RPC client code int list_insert(int key, float val) { IW_trans_t trans; node_t *p; arg_t arg; for (int c = 0; c < RETRY; c++) { trans = IW_begin_work(); //new transaction IW_twl_acquire(trans, seg); //lock p = (node_t*) IW_malloc(seg, IW_node_t); p->key = key; p->val = val; p->next = head->next; head->next = p; arg.list = "data_svr/list"; //RPC argument arg.trans = trans; //needed by IW lib clnt_compute_avg("rpc_svr", &arg); //RPC printf("average %f\n", head->val); IW_twl_release(trans, seg); //unlock if (IW_commit_work(trans) == IW_OK) return IW_OK; } return IW_FAIL; }

// RPC server code result_t *compute_avg(arg_t *arg, svc_req *r) { static reslt_t result; int n=0; float s=0; char *seg_url = IW_extract_seg_url(arg->list); seg = IW_open_segment(seg_url); IW_twl_acquire(arg->trans, seg); //lock head = IW_mip_to_ptr(arg->list); for (node_t *p = head->next; p; p = p->next) { s += p->val; n++; } head->val = (n!=0)? s/n : 0; IW_twl_release(arg->trans, seg); //unlock IW_close_segment(seg); result.trans = arg->trans; //needed by IW lib return &result; }

Figure 3.1: Shared linked list in InterWeave. Both the RPC client and RPC server are InterWeave clients. The variable head points to a dummy header node; the first real item is in head->next. head->val is the average of all list items’ val fields. The RPC client is initialized with list init. To add a new item to the list, the RPC client starts a transaction, inserts the item, and makes a remote procedure call to a fast machine to update head->val. head->val could represent summary statistics much more complex than “average”. The function clnt compute avg is the client side stub generated by the standard rpcgen tool. The IW close segment in the RPC server code will leave the cached copy of the segment intact on the server for later reuse.

99

struct arg_t { char * list; IW_trans_t trans; //instrumented by IW IDL }; bool_t xdr_arg_t(XDR *xdrs, arg_t *objp) { if(!xdr_string(xdrs, &objp->list, ~0)) return FALSE; if(!xdr_trans_arg(xdrs, &objp->trans)) return FALSE;

//instrumented by IW IDL

return TRUE; } Figure 3.2: The argument structure and its XDR routine. During an RPC, xdr arg t () is invoked on both the caller side (to marshal arguments) and the callee side (to unmarshal arguments). Likewise, xdr result t () (not shown here) is invoked to marshal and unmarshal the results. Both routines are generated with the standard rpcgen tool and slightly modified by the InterWeave IDL compiler. xdr trans arg () is a function in the InterWeave library that marshals and unmarshals transaction metadata along with other arguments.

100

operations, and data manipulations can be protected by a transaction to ensure that distributed shared state is updated atomically. Operations in a transaction are performed in such a way that either all of them execute or none of them do with respect to InterWeave shared state. InterWeave may run transactions in parallel, but the behavior of the system is equivalent to some serial execution of the transactions, giving the appearance that one transaction runs to completion before the next one starts. (Weaker transaction semantics are further discussed in Section 3.2.6.) Once a transaction commits, its changes to the shared state survive failures. Typically, a transaction is embedded in a loop to try the task repeatedly until it succeeds or a retry bound is met (see Figure 3.1). The task usually starts with an IW begin work () call, which returns an opaque transaction handle to be used in later transactional operations, such as IW commit work () and IW rollback work (). Each RPC call automatically starts a sub-transaction that can be individually aborted without rolling back the work that has already been done by outer (sub)transactions. In keeping with the traditional RPC semantics, we assume that only one process in an RPC call chain is active at any given time. Asynchronous RPC is not supported in the current implementation. The skeleton code for both the RPC client and server is generated using the standard rpcgen tool and slightly modified by the InterWeave IDL compiler to insert a transaction handle field in both the RPC argument and result structures (see Figure 3.2). Accordingly, the XDR translation routines for the arguments and results are augmented with a call to xdr trans arg () or xdr trans result (), respectively. These two InterWeave library functions encode and transmit transaction information along with other RPC arguments or results. InterWeave’s shared state and transaction support is designed to be completely compatible with existing RPC systems. Inside a transaction, the body of a remote

101

procedure can see (and optionally contribute to) shared data updates that are visible to the caller but not yet to other processes. An RPC caller can pass references to shared state (MIPs) to the callee as ordinary string arguments. The RPC callee then extracts the segment URL from the MIP using IW extract seg url (), locks the segment, and operates on it (see the RPC server code in Figure 3.1). Modifications to the segment made by the callee will be visible to other processes in the transaction when the lock releases, and are applied to the InterWeave server’s master copy when the outermost (root) transaction commits. Before the root transaction commits, those modifications are invisible to other transactions. A segment’s sharing pattern can be specified as either migratory or stationary using IW set sharing (). When making an RPC call, the xdr trans arg () on the caller side temporarily releases writer locks on currently locked migratory segments and makes modifications to these segments visible to the RPC callee. When the RPC call returns, the xdr trans result () on the caller side will automatically re-acquire locks on those segments and bring in updates made by the callee. Writer locks on stationary segments, however, are not released by the caller before making an RPC call. Should the callee acquire a writer lock on any of these locked segments, it is a synchronization error in the program. The transaction simply aborts. RPC and transactions do not necessarily have to be used together. Idempotent RPCs that do not require shared state may be used without transaction protection to save the extra overhead. If a lock acquire has a null pointer as transaction handle, the critical section delimited by the lock acquire and release is treated as an anonymous transaction. Updates made in this critical section are atomically applied to the server’s master copy as part of the lock release operation. In addition to providing protection against various system failures, transactions also provide a mechanism to recover from problems arising from relaxed coherence models, e.g., deadlock or lock failure caused by inter-segment inconsis-

102

tency. Suppose, for example, that process P has acquired a reader lock on segment A, and that the InterWeave library determined at the time of the acquire that the currently cached copy of A, though not completely up-to-date, was “recent enough” to use. Suppose then that P attempts to acquire a lock on segment B, which is not yet locally cached. The library will contact B’s server to obtain a current copy. If that copy was created using information from a more recent version of A than the one currently in use at P, a consistency violation has occurred. Users can disable this consistency check if they know it is safe to do so, but under normal circumstances the attempt to lock B must fail. The problem is exacerbated by the fact that the information required to track consistency (which segment versions depend on which?) is unbounded. InterWeave hashes this information in a way that is guaranteed to catch all true consistency violations, but introduces the possibility of spurious apparent violations [56]. Transaction aborts and retries could then be used in this case to recover from inconsistency, with automatic undo of uncommitted segment updates. An immediate retry is likely to succeed, because P’s out-of-date copy of A will have been invalidated.

3.2

Implementation

In this section, we elaborate on the implementation of the support for remote invocation and transactions in InterWeave. Please refer to Section 2.2 (page 31) for details on the basic InterWeave implementation.

3.2.1

Overview

When neither transactions nor RPC are being used, segment diffs sent from an InterWeave client to a server are immediately applied to the server’s master copy of the segment. With transactions, updates to the segment master copy are deferred until the transaction commits. Like many database systems, InterWeave employs

103

the strict two-phase locking protocol and two-phase commit protocol to support atomic, consistent, isolated, and durable (ACID) transactions. With the strict two-phase locking protocol, locks acquired in a transaction or sub-transaction are not released to the InterWeave server until the outermost (root) transaction commits or aborts. It would be possible to adapt variants of these protocols (e.g., optimistic two-phase locking [108]) to InterWeave, but we do not consider this possibility here. Each InterWeave client runs a transaction manager (TM) thread that keeps tracks of all on-going transactions involving the given client and listens on a specific TCP port for transaction related requests. Each TM is uniquely identified by a two-element tuple. Each transaction is uniquely identified by a three-element tuple, where is the ID of the TM that starts the transaction and “#seq” is a monotonically increasing sequence number inside that TM. These naming conventions allow TMs to establish connections to each other once the ID of a TM or transaction is known. In the IW start work () call, a transaction metadata table (TMT) is created to record information about the new transaction: locks acquired, locks currently held, version numbers of locked segments, segments modified, locations where diffs can be found, etc. The TMT is the key data structure that supports the efficient implementation of transactions and the integration of shared state and transactions with RPC. It is passed between caller and callee in every RPC call and return. With the aid of the TMT, processes cooperate inside a transaction to share data invisible to other processes and to exchange data modifications without the overhead of going through the InterWeave server. This direct exchange of information is not typically supported by database transactions, but is crucial to RPC performance.

104

3.2.2

Locks inside a Transaction

When a client requests a lock on a segment using either IW twl acquire () (for a writer lock) or IW trl acquire () (for a reader lock), the InterWeave library searches the TMT to see if the transaction has already acquired the requested lock. There are four possible cases. (1) The lock is found in the TMT but another process in the transaction is currently holding an incompatible lock on the segment (e.g., both are write locks). This is a synchronization error in the application. The transaction aborts. (2) The lock is found in the TMT and no other process in the transaction is currently holding an incompatible lock on the segment. The lock request is granted locally. This happens when, for instance, one process in the transaction previously acquired a writer lock on the segment and subsequently released the lock after making modifications. Because of the strict two-phase lock semantics, the lock is retained by the transaction instead of being returned to the InterWeave server immediately. (3) The lock is found in the TMT but only for reading, and the current request is for writing. The client contacts the InterWeave server to upgrade the lock. (4) The lock is not found in the TMT, meaning that the segment has not previously been locked by this transaction. The client contacts the InterWeave server to acquire the lock and updates the TMT accordingly. When a client releases a lock, the InterWeave library updates the lock status in the TMT. In keeping with the strict two-phase locking semantics, the transaction retains the lock until it commits or aborts rather than returning the lock to the InterWeave server immediately. During the release of a write lock, the library uses the process described in Section 2.2.2 (page 35) to collect a diff that describes the modifications made during the lock critical section. Unlike the non-transaction environment where the diff is sent to the InterWeave server immediately, the diff is stored locally in the created-diff buffer (or in a file, if the memory is scarce). The library also increases the segment’s current version number, stores this number in

105

the TMT, and appends an entry indicating that a diff that upgrades the segment to this new version has been created by this client. The actual content of the diff is not stored in the TMT.

3.2.3

Interplay of RPC and Transactions

When a client performs an RPC inside a transaction, the xdr trans arg () call, included in the argument marshaling routine by the InterWeave IDL compiler, encodes and transmits the TMT to the callee along with other arguments. A complementary xdr trans arg () call on the callee side will reconstruct the TMT when unmarshaling the arguments. Typically the TMT is small enough to have a negligible impact on the overhead of the call. For instance, a complete TMT containing information about a single segment is of only 76 bytes. A null RPC call over a 1Gbps network takes 0.212ms, while a null RPC call in InterWeave (with this TMT) takes just 0.214ms. With a slower network, the round trip time dominates. The TMT overhead becomes even more negligible. Among other things, the TMT tracks the latest version of each segment ever locked in the transaction. This latest version can be either the InterWeave server’s master copy or a tentative version created in the on-going transaction. When the callee acquires a lock on a segment and finds that it needs an update (by comparing the latest version in the TMT to the version it has cached), it consults the TMT to decide whether to obtain diffs from InterWeave servers, from other InterWeave clients, or both. To fetch diffs from other clients, the callee’s TM contacts the TMs on those clients directly. Once all needed diffs have been obtained, the callee applies them, in the order in which they were originally generated, to the version of the segment it has cached. If the TMT is modified by the callee to reflect locks acquired or diffs created during an RPC, the modifications are sent back to the caller along with the RPC results, and incorporated into the caller’s copy of the TMT. As in the original call,

106

the code that does this work (xdr trans result ()) is automatically included in the marshaling routines generated by the InterWeave IDL compiler. When the caller needs diffs created by the callee to update its cache, it knows where to get them by inspecting the TMT. In an RPC call chain, the modifications to the TMT are recursively propagated back to the original caller that starts the transaction. Since there is only one active process in a transaction, the TMT is guaranteed to be up-to-date at the site where it is in active use. We assume that both RPC and inter-TM communication uses a reliable protocol such as TCP. RPC has exactly-once semantics in the absence of system failures and at-most-once semantics if processes can fail.

3.2.4

Transaction Commits and Aborts

During a commit operation, the library on the client that originally starts the transaction (the transaction coordinator ) finds all InterWeave clients that participated in the transaction by inspecting the TMT. It then initiates a two-phase commit among those clients by sending every client a prepare message. During the first, prepare phase of the protocol, each client sends its locally created and temporarily buffered diffs to the appropriate InterWeave servers, and asks them to prepare to commit. A client responds positively to the coordinator only if all servers the client contacted respond positively. During the prepare phase, each InterWeave server temporarily stores the received diffs in memory. In the prepare phase, not every diff created in the transaction has to be sent to the InterWeave server for commit. What matters is only the diffs that decide the final contents of the segments. Particularly, if a diff contains the entire new contents of the segment (e.g., due to the no-diff mode optimization [253] in Section 2.2.3, page 43), then all diffs before this one that contain no information about deleted or newly created blocks can be simply discarded without affecting the contents of the final version of the segment. For each diff, the TMT records if

107

it is discardable and if it contains the entire contents of the segment to facilitate this diff-reduction optimization. This optimization is particularly useful for longrunning transactions that creates a lot of temporary, intermediate changes to the shared state. Once the coordinator has heard positively from every client, it begins the second, commit phase of the protocol by sending every client a commit message. In response to this message each client instructs the servers that it contacted during the prepare phase to commit. Upon receiving the commit message, the server writes all diffs to a diff log in persistent storage, and then applies the diffs to the segments’ master copy in the order in which they were originally generated. The persistent diff log allows the server to reconstruct the segment’s master copy in case of server failure. Occasionally, the server checkpoints a complete copy of the segment to persistent storage and frees space for the diff log. InterWeave is robust against client failures (detected via heartbeats) since cached segments can be destroyed at any time without affecting the server’s persistent master copy. A transaction abort call, IW rollback work (), can be issued either by the application explicitly or by the library implicitly if anything goes wrong during the transaction (examples include client failure, server failure, network partition, or lock failure caused by inter-segment inconsistencies). During an abort operation, the transaction coordinator asks all involved clients to abort. Each client instructs the InterWeave servers it contacted to abort, invalidates cached segments that were modified in the transaction, and discards its locally created and buffered diffs. Each server then discards any tentative diffs it received from clients. When a client locks a locally invalidated segment later on, it will obtain a complete, fresh copy of the segment from the InterWeave server. (A more sophisticated implementation would log all changes made to the segment’s cached copy and metadata structure, and thus could allow the client to recover the segment’s cached copy prior to the transaction. This is not currently implemented in InterWeave because we consider

108

transaction aborts as rare events. The potential benefit of this implementation does not warrant the extra complexity it would introduce.) Both the InterWeave clients and servers use timeout to decide when to abort an unresponsive transaction and to reclaim resources devoted to the transaction. For instance, if the transaction coordinator dies before sending out any commit or abort message, other parties will abort the transaction and reclaim resources after a timeout. Each remote procedure call automatically starts a sub-transaction. In case of RPC failure, there is enough information in the TMT to identify the parties involved in the sub-transaction. Either the callee (if it is still alive) executes the protocol to abort the sub-transaction, or the sub-transaction will abort automatically on all involved parties after a timeout. It is then up to the application to try an alternative to finish the work, or to simply abort the root transaction. For the sake of simplicity, InterWeave does not provide any mechanism for deadlock prevention or detection. Transactions experiencing deadlock are treated as unresponsive and are aborted by timeout automatically. After a transaction abort, the application usually introduces a random delay before retrying the transaction to reduce the chance that the conditions that led to the abort will happen again. The solutions to deal with other corner cases that might occur in a transaction is described in [108]. We adopt those techniques in InterWeave but omit a detailed discussion here. When a transaction completes, regardless of commit or abort, the segment locks retained by the transaction are released to the corresponding InterWeave servers and various resources devoted to the transaction (e.g., diff buffers and the TMT) are reclaimed.

3.2.5

Proactive Diff Propagation

Normally, a diff generated inside a transaction is stored on the InterWeave client that created the diff, and is transmitted between clients on demand. To avoid

109

an extra exchange of messages in common cases, however, InterWeave sometimes may send diffs among clients proactively. Specifically, the TM of an RPC caller records the diffs that are created by the caller and requested by the callee during the RPC session. If the diffs for a segment are requested three times in a row by the same remote procedure, the library associates the segment with this particular remote procedure. In later invocations of the same remote procedure, the diffs for the associated segments will be sent proactively to the callee, along with the TMT and RPC arguments. These diffs are stored in the callee’s proactive diff buffer. When a diff is needed on the callee, it always searches the proactive diff buffer first before sending a request to the InterWeave server or the client that created the diff. When the RPC call finishes, along with the RPC results, the callee returns information indicating whether the proactive diffs are actually used by the callee. If not, the association between the segment and the remote procedure is broken and later invocations will not send diffs proactively. The same thing also applies to the diffs created by the callee. If those diffs are always requested by the caller after the RPC call returns, the callee will piggyback those diffs to the caller along with the RPC results in later invocations. As an alternative to automatic detection of segment sharing patterns at runtime, an application can explicitly specify that a segment is migratory or stationary using the IW set sharing () call. As described in Section 3.1, the library on the caller automatically releases writer locks on migratory segments before calling the remote procedure and re-acquires them when the call returns. Diffs for migratory segments are always sent between the caller and callee proactively. In common cases, the caller sends the diffs needed by the callee along with the RPC arguments. The callee applies the diffs to update its local cache, does some computations based on those data, and sends diffs that describe the modifications it made back to the caller along with the RPC results.

110

Always deferring propagating diffs to the InterWeave servers to the end of a transaction may incur significant delay in transaction commit. As an optimization, each InterWeave client’s TM thread also acts as a “diff cleaner”, sending diffs in the created-diff buffer to corresponding InterWeave servers when the client is idle (e.g., waiting for RPC results). These diffs are buffered on the server until the transaction commits or aborts.

3.2.6

Relaxed Transaction Models

In addition to automatic data caching, RPC applications can also benefit from InterWeave’s relaxed coherence models [58] which improve transaction throughput and reduce the demand for communication bandwidth. Bandwidth reduction is particularly important for applications distributed across wide-area networks. In a transaction, if all locks are acquired under Strict coherence, the transaction is similar to those in databases, possessing the ACID properties. When relaxed reader locks are used, however, InterWeave can no longer guarantee isolation among transactions. Suppose, for instance, that there are two transactions T1 = [R(a1 ) → W (b2 )] and T2 = [R(b1 ) → W (a2 )], where R(xj ) reads version j of segment x, W (xj ) produces (writes) version j of segment x, and [X → Y ] means operation X precedes operation Y in the given transaction. Clearly transactions T1 and T2 are not serializable. This situation occurs when T1 uses a relaxed lock on segment a and T2 uses a relaxed lock on segment b. Gray and Reuter [108] classified transactions into degrees 0-3 based on the level of isolation they provide. Degree 0 transactions do not respect any data dependences among transactions, and hence allow the highest concurrency; degree 1 transactions respect WRITE →WRITE dependences; degree 2 transactions respect WRITE →WRITE and WRITE →READ dependences; degree 3 transac-

111

tions respect all dependences, and hence ensure the ACID properties. In this terminology, InterWeave’s relaxed transaction models belong to degree 1. One important reason for many database applications to use degree 3 transactions is to have repeatable reads. That is, reads to the same data inside a transaction always return the same value. InterWeave supports repeatable reads across distributed processes. It guarantees that if any process in a transaction reads an old version of a segment, all other processes in the transaction that employ relaxed coherence models can only read the same old version. This enhancement, we believe, makes it easier to reason about InterWeave’s relaxed transaction models than would normally be the case for degree 1 transactions. Typically, relaxed transaction models are used to improve transaction throughput. In InterWeave, the relaxed coherence models provide the additional benefit of reducing communication cost, which is important for applications distributed across the Internet. When a segment is read under a relaxed lock for the first time, the TMT records which version of the segment is accessed (the target version). When another process in the transaction requires a relaxed lock on the same segment, there are several possible scenarios. (1) If the target version is not “recent enough” for the process to use, the transaction aborts. (2) If the cached copy on the process is newer than the target version, the transaction aborts. (3) If the cached copy is the same as the target version and it is “recent enough”, the process will use the cached copy without an update. (4) If the cached copy is older than the target version (or it is not cached before) and the target version would be “recent enough” for the process, it will try to update its cached copy to the target version. If such an update is impossible, the transaction aborts. Updating a segment to a specific version other than the latest one is aided by the diff log on the servers and the created-diff buffer on the clients.

112

3.3 3.3.1

Experimental Results Transaction Cost Breakdown

We first use a microbenchmark to quantify InterWeave’s transaction cost in both local-area network (LAN) and wide-area network (WAN) environments. In this benchmark, two processes share a segment containing an integer array of variable size and cooperatively update the segment inside a transaction. One process (the RPC caller) starts a transaction and contacts the InterWeave server to acquire a writer lock on the segment (the “lock” phase); increments every integer in the array (the “local” phase); generates a diff that describes the changes it made (the “collect” phase); makes an RPC call to the other process, proactively sending the diff along with the RPC call, and waits for the callee to finish (the “RPC” phase). During this “RPC” phase, the callee acquires a writer lock on the segment (it will find the lock in the TMT, avoiding contacting the InterWeave server); uses the diff in the proactive diff buffer to update its local copy; increments every integer in the array; generates a diff that describes the new changes it made; and proactively sends the diff back to the caller along with the RPC results. After the callee finishes, the caller uses the returned proactive diff to update its local copy (the “apply” phase), prints out some results, and then runs the two-phase commit protocol to update the InterWeave server’s master copy of the segment (the “commit” phase). During the “commit” phase, the caller and callee send the diff they created to the InterWeave server, respectively. We compare the above “proactive transaction” with two other alternatives— “nonproactive transaction” and “no transaction”. With “nonproactive transaction”, the diffs are only sent between the caller and callee on demand. During the “RPC” phase, the callee will contact the caller to fetch the diff created by the caller. Likewise, in the “apply” phase, the caller will contact the callee to fetch the diff created by the callee.

113

With “no transaction” (the basic InterWeave implementation without support for transactions), in the “collect” phase, the caller sends the diff it created to the InterWeave server and releases the writer lock. In the “RPC” phase, the callee contacts the InterWeave server to acquire a writer lock and request the diff it needs. When the callee finishes, it sends the diff it created to the InterWeave server and releases the writer lock. In the “apply” phase, the caller acquires a reader lock and fetches the diff created by the callee from the InterWeave server to update the caller’s local copy. Local-Area Network The first set of experiments were run on a 1Gbps Ethernet. The InterWeave server, RPC caller, and RPC callee run on three different 2GHz Pentium IV machines under Linux 2.4.9. To evaluate the more general scenario, we turn off the diffreduction optimization for this benchmark. For each configuration, we run the benchmark 20 times and report the median in Figure 3.3. The X axis is the size (in bytes) of the segment shared by the caller and callee. The Y axis is the time to complete the transaction. Compared to a “proactive transaction”, the “apply” phase in a “nonproactive transaction” is significantly longer because it includes the time to fetch the diff from the callee. Likewise, the “collect” phase and “apply” phase in “no transaction” are longer than those in “proactive transaction”, because the diffs are sent to or fetched from the InterWeave server during those phases. For a “proactive transaction”, the diffs are sent between the caller and callee during the “RPC” phase. However, the “commit” phase compensates for the savings, resulting in an overall small overhead to support transactions for RPC calls that transmit a large amount of data (see the “proactive” and “no trans.” bars). With the aid of the TMT, processes avoid propagating the diffs through the server when sharing segments. As a result, the critical path of the RPC call for a

Execution Time (ms.)

114

90 80 70 60 50 40 30 20 10 0

commit apply rpc collect local lock

no trans. proactive nonproactive

116K3

532K7

964K 11

13 15 128K

17 19 256K

21 23 512K

25 1M27

Size (bytes) Figure 3.3: Execution time for transactions that transmit a large amount of data on a LAN.

Execution Time (ms.)

2.5

nonproactive proactive

2

commit apply rpc collect local lock

no trans.

1.5 1 0.5 0 4K

8K Size (bytes)

Figure 3.4: Execution time for transactions that transmit a small amount of data on a LAN.

115

“proactive transaction” (the “RPC” phase) is up to 60% shorter than that of “no transaction” (the “collect”+“RPC”+“apply” phases). In the benchmark, the local computation cost is trivial. For transactions with relatively long computation, “proactive transaction” will send diffs to the InterWeave server in the background, reducing the time spent in the “commit” phase. The results for smaller segments are shown in Figure 3.4. The “proactive transaction” has slightly better performance than the “nonproactive transaction” because it saves the extra two round trip times to fetch the diffs. For transactions that only transmit a small amount of data between the caller and callee, the relative overhead of executing the two-phase commit protocol becomes more significant, as seen by comparing with the “no trans.” results. Figure 3.5 shows the execution time of a “proactive transaction” when both the caller and callee only update x% of a 1MB segment. As the percentage of the changed data goes down, the transaction cost decreases proportionally, due to InterWeave’s ability to automatically identify modifications and only transmit the diffs. In all cases, the overhead to compute the diffs (the “collect” phase) is negligible compared with the benefits. The “no-IW RPC” is a simple RPC program with no use of InterWeave, sending the 1MB data between the caller and callee directly. It avoids the cost of sending the modifications to a database server and the overhead of acquiring locks and executing the two-phase commit protocol. The important lesson this figure reveals is that, for temporary (non-persistent) data with simple sharing patterns, it is more efficient to transmit them (using deep copy) directly across different sites than to put them in the global shared space. However, for persistent data (some data outlive a single run of the application and hence must be persistent) with non-trivial sharing patterns, applications can significantly benefit from InterWeave’s caching capability. With InterWeave, both deep copy arguments and MIPs to the shared store can be used in a single RPC call, giving the programmer

Execution Time (ms.)

116

90 80 70 60 50 40 30 20 10 0

commit apply rpc collect local lock

no-IW RPC

100%

50%

25%

12.5%

6.25%

Percentage of Changed Data Figure 3.5: Execution time for a “proactive transaction” running on a LAN, when

900 800 700 600 500 400 300 200 100 0

commit apply rpc collect nonproactive local proactive lock no trans.

2K

Size (bytes)

32K

32 K

12 8

128

2K

Execution Time (ms.)

both the caller and callee only update x% of a 1MB segment.

Figure 3.6: Execution time for transactions running on a WAN.

117

maximum flexibility to choose the most efficient way to communicate data. Wide-Area Network Our second set of experiments run the same benchmark in a wide-area network. The machines running the InterWeave server, RPC caller, and RPC callee are distributed at the University of Waterloo (900MHz Pentium III, Linux 2.4.18), Rochester Institute of Technology (300MHz AMD K6, Linux 2.2.16), and the University of Virginia (700MHz AMD Athlon, Linux 2.4.18), respectively. The execution time of the transactions are shown in Figure 3.6, which are more than 100x slower than those in the fast LAN. As the data size increases, the relative cost of the “RPC” phase among “nonproactive transaction”, “proactive transaction”, and “no transaction” changes. When the data size is small, the “RPC” phase in “proactive transaction” is the smallest because it avoids the extra round trip time to acquire the lock or to fetch the diffs. As the data size increases, the diff propagation time dominates, which is included in the “RPC” phase for “proactive transaction”. As a result, the “RPC” phase for “proactive transaction” becomes the longest among the three. Comparing Figure 3.6 and 3.4, the benefit of proactive diff propagation is more substantial in WAN, due to the long network latency. Also due to the slow network, “no transaction” performs noticeably better than “proactive transaction” as it does not have the overhead of executing the two-phase commit protocol. Figure 3.7 uses the same settings as Figure 3.5 except that the experiment is run on a WAN. The results are similar but there are two important differences. First, the absolute savings due to cache reuse is much more significant on a WAN because of the low network bandwidth and long latency. Second, InterWeave’s protocol overhead (e.g., diff collection) becomes even more negligible compared with the long data communication time, justifying the use of aggressive optimiza-

118

Execution Time (ms.)

14000 12000

commit apply rpc collect local lock

10000 8000 6000 4000 2000 0 no-IW RPC

100%

50%

25%

12.5% 6.25%

Percentage of Changed Data Figure 3.7: Execution time for a “proactive transaction” running on a WAN, when both the caller and callee only update x% of a 1MB segment. tions (e.g., diffing and relaxed coherence models) in middleware to save bandwidth for WAN applications.

3.3.2

Using InterWeave to Share Data in an RPC Application

We use an application kernel consisting of searching and updating a binary tree to further demonstrate the benefit of using InterWeave to facilitate the sharing of pointer-rich data structures between an RPC client and server. In this application, a client maintains a binary tree keyed by ASCII strings. The client reads words from a text file and inserts new words into the tree. Periodically, the client makes an RPC call to the server with one string as parameter. The server uses the string as keyword to search the binary tree, does some application-specific computation using the tree content, and returns the result.

119

To avoid requiring the client to pass the entire tree in every call, the server caches the tree in its local memory. We compare three methods by which the client may propagate updates to the server. In the first method, the client performs a deep copy of the entire tree and sends it to the server on every update. This method is obviously costly if updates happen frequently. In the second method, the server invokes a callback function to obtain updates when necessary. Specifically, it calls back to the client when (and only when) it is searching the tree and cannot find the desired keyword. The client will respond to the callback by sending the subtree needed to finish the search. The third method is to share the binary tree between the client and server in the global store provided by InterWeave, thereby benefiting automatically from caching and efficient updates via wire-format diffing. Figures 3.8 and 3.9 compare the above three solutions. The text file we use is from Hamlet Act III and has about 1,600 words. The client updates the server after reading a certain number of words from the text file. The frequency of updates is varied from every 16 words to every 320 words and is represented by the X axis in both graphs. After each update, the client takes every other word read since the previous update and asks the server to search for these words in the tree. Under these conventions the call frequency is the same as the update frequency. The Y axis shows the total amount of time and bandwidth, respectively, for the client to finish the experiment. As updates become more frequent (moving, for example, from every 80 words to every 64 words), the deep-copy update method consumes more and more total bandwidth and needs more time to finish. The call-back and InterWeave methods keep the bandwidth requirement almost constant, since only the differences between versions are transferred. However, the call-back method requires significantly more programmer effort, to keep track of the updates manually. Due to the extra round-trip message needed by the call-back method, and InterWeave’s

120

8

RPC-update RPC-callback RPC-IW

7

Time (sec.)

6 5 4 3 2 1

30 4

27 2

24 0

20 8

17 6

14 4

11 2

80

48

16

0

Update intervals

Figure 3.8: Total time to update and search a binary tree.

1000

RPC-update RPC-callback RPC-IW

900

Total size (KBytes)

800 700 600 500 400 300 200 100

4 30

2 27

0 24

8 20

6 17

4 14

2 11

80

48

16

0

Update intervals

Figure 3.9: Total size of communication to update and search a binary tree.

121

efficiency in wire format translation, InterWeave achieves the best performance among the three. This example is inspired by the one used in [134], where data can be shared between the client and server during a single critical section (RPC call). In our version of the experiment, InterWeave allows the client and server to share the data structure across multiple critical sections.

3.3.3

Service Offloading in Datamining

In this experiment, we implement a sequence mining service running on a Linux cluster to evaluate the potential performance benefit of combining InterWeave and RPC to build network services, and also to obtain a sense of the effort that a programmer must expend to use InterWeave. The service provided by the cluster is to answer sequence mining queries on a database of transactions (e.g., retail purchases). Each transaction in the database (not to be confused with transactions on the database) comprises a set of items, such as goods that were purchased together. Transactions are ordered with respect to each other in time. A query from a remote user usually asks for a sequence of items that are most commonly purchased by customers over time. The database is updated incrementally by distributed sources. When updates to the database exceed a given threshold, a data mining server running in the background uses an incremental algorithm to search for new meaningful sequences and summarizes the results in a lattice data structure. Each node in the lattice represents a sequence that has been found with a frequency above a specified threshold. Given a sequence mining query, the results can be found efficiently by traversing this summary structure instead of reading the much larger database. Following the structure of many network services, we assign one node in the cluster as a front-end to receive queries from clients. The front-end can either

122

answer mining queries by itself, or offload some queries to other computing nodes in the same cluster when the query load is high. We compare three different offloading strategies. In the first strategy, the front-end uses RPC to offload queries to other computing nodes. Each RPC takes the root of the summary structure and the query as arguments. The offloading nodes do not cache the summary structure. This is the simplest implementation one can get with the least amount of programming effort. However, it is obviously inefficient in that, on every RPC call, the XDR marshaling routine for the arguments will deep copy the entire summary structure. The second offloading strategy tries to improve performance with an ad-hoc caching scheme. With more programming effort, the offloading nodes manually cache the summary structures across RPC calls to avoid unnecessary communication when the summary structure has not changed since the last call. The data mining server updates the offloading nodes only when a new version of the summary structure has been produced. When the summary structure does change, in theory it would be possible for the programmer to manually identify the changes and only communicate those changes in the same way as InterWeave uses diffs. We consider the effort required for this optimization prohibitive, however, because the summary is a pointer-rich data structure and updates to the summary can happen at any place in the lattice. Therefore, this further optimization is not implemented; when the lattice has changed it is retransmitted in its entirety. Alternatively, the system designer can use the global store provided by InterWeave to automate caching in RPC-based offloading. In this third strategy, we use an InterWeave segment to share the summary structure among the cluster nodes. The data mining server uses transactions to update the segment. When making an offloading call, the data mining server passes the MIP of the root of the summary structure to the offloading node, within a transaction that ensures the proper handling of errors. On the offloading node, the MIP is converted back

123

to a local pointer to the root of the cached copy of the summary structure using IW mip to ptr. Our sample database is generated by tools from IBM research [237]. It includes 100,000 customers and 1000 different items, with an average of 1.25 transactions per customer and a total of 5000 item sequence patterns of average length 4. The total database size is 20MB. The experiments start with a summary data structure generated using half this database. Each time the database grows an additional 1% of the total database size, the datamining server updates the summary structure with newly discovered sequences. The queries we use ask for the first K most supported sequences found in the database (K = 100 in our experiments). We use a cluster of 16 nodes connected with a Gigabit Ethernet. Each node has two 1.2GHZ Pentium III processors with 2GB memory, and runs Linux 2.4.2. One node is assigned as the front-end server and offloads incoming user queries to some or all of the other 15 nodes. In the first experiment, we quantify the overhead introduced by offloading by running one query at a time. (In this case, the front-end server is not overloaded and it cannot actually benefit from offloading. This is just our way to quantify the overhead.) The results are shown in Figure 3.10, where the X axis represents a gradual increase in database size and the Y axis represents the time to answer one query by traversing the summary structure. In this figure, as the database grows, the time to answer one query locally at the front-end server increases from 1.4ms to 6.5ms. Offloading introduces additional overhead due to the communication penalties between the front-end and the offloading site. Compared with the cases with an invalid cache, the service latency is significantly smaller when the cache is valid on the offloading site (“RPC-no-update” for RPC ad-hoc caching and “IW-no-update” for InterWeave automatic caching). To understand the performance differences between the different strategies in Figure 3.10, we break down the measured latencies for the last round of database

124

0.025

Latency (sec.)

0.02

RPC-update RPC-no-update IW-update IW-no-update local

0.015

0.01

0.005

0 5

10

15

20

25

30

35

40

45

50

Database sizes

Figure 3.10: Latencies for executing one query under different offloading configurations. “Local” executes queries locally at the front-end server; “RPC-update” uses RPC to offload computations without any caching scheme. It also represents the case where manual caching is used but the summary structure has changed (hence the cache is invalid and the entire summary structure has to be transmitted). “RPC-no-update” represents RPC with manual caching to offload computation when the cache happens to be valid. “IW-update” uses InterWeave and RPC to support offloading, and the summary structure is changed since the last cached copy. “IW-no-update” differs from “IW-update” in that the summary structure is not changed since last cached copy.

125

updates. The results are shown in Figure 3.11. The categories on the X axis are the same as the series in Figure 3.10. In this figure, “Computation” is the time to traverse the summary structure; “Argument Translation” is the time to marshal the arguments; “Result Translation” is the time to marshal the results; “IW Overhead” is the time spent in the InterWeave library, mainly the cost to compute the diffs; “Communication & Misc.” is mainly the time spent in communication. Comparing “RPC-update” with “IW-update” (both configurations need to update their local cache), InterWeave’s use of diffing significantly reduces the total amount of data transferred and accordingly the data translation time. Comparing “RPCno-update” to “IW-no-update” (both configurations can reuse the cache without a summary structure update), InterWeave still noticeably outperforms ad-hoc caching due to its efficiency in returning the RPC results, which are arrays of common sequence patterns. With InterWeave, we only need to return an array of the MIPs that point to the sequences in the summary structure. With ad-hoc RPC caching, we have to return the sequences themselves because of the lack of a global store shared between the caller and callee. Figure 3.12 shows the aggregate service throughput (i.e., queries processed in 100 seconds) of the cluster with an increasing number of offloading nodes. For each offloading node, the front-end runs a dedicated thread to dispatch queries to it. The X axis is the number of offloading nodes we use in the cluster, starting from 0 (no offloading) to a maximum of 15. Beginning with a database of 50% of its full contents, we increase the database to its full size in 50 steps. Each step takes about 2 seconds. The Y axis shows the total number of completed queries during the entire database update process, i.e., 100 seconds. “IW-RPC” is the average over the “IW-update” and “IW-no-update” series in Figure 3.10 (as the database grows, sometimes the cache needs no update while sometimes it does need an update). Likewise, “RPC-cache” uses ad-hoc RPC caching but its value is averaged over the cache hit and cache miss case. “RPC-no-cache”

126

25 Communication & Misc. IW Overhead Result Translation Argument Translation Computation

Service latency (ms.)

20

15

10

5

0 RPC-update

IW-update

RPC-no-update

IW-no-update

Offloading configurations

Figure 3.11: Breakdown of query service time under different offloading strategies.

110000 100000

IW-RPC RPC-cache RPC-no-cache

Total throughput

90000 80000 70000 60000 50000 40000 30000 20000 2

4

6

8

10

12

14

16

Number of offloading nodes

Figure 3.12: The impact of different offloading strategies on system throughput.

127

uses straightforward RPC offloading with no cache. As can be seen from the figure, the throughput for “IW-RPC” scales linearly with the size of the cluster, outperforming “RPC-cache” by up to 28%. Without caching, the system cannot benefit much from using RPC for offloading.

3.4

Conclusions

We have described the integration of shared state, remote invocation, and transactions in InterWeave to form a distributed computing environment. InterWeave works seamlessly with RPC systems, providing them with a global, persistent store that can be accessed using ordinary reads and writes. To protect against various system failures or race conditions, a sequence of remote invocations and data accesses to shared state can be protected in an ACID transaction. Our novel use of the transaction metadata table allows processes to cooperate inside a transaction to safely share data invisible to other processes and to exchange data modifications they made without the overhead of going through the InterWeave server. Experience with InterWeave demonstrates that the integration of the familiar RPC, transactional, and shared-memory programming models facilitates the rapid development of maintainable distributed applications that are robust against system failures. Experiments on a cluster-based datamining service demonstrate that InterWeave can improve service scalability with its optimized “two-way diff” mechanism and its global address space for passing pointer-rich shared data structures. In our experiments, an offloading scheme with InterWeave outperforms an RPC offloading scheme with a manually maintained cache by 28% in overall system throughput.

128

4

eSearch: Hybrid Global-Local Indexing for P2P IR

In this chapter, we address the problem of information retrieval (IR) in Peerto-Peer (P2P) systems. We compared in Figure 1.1 (page 12) different indexing structures: local indexing based on partitioning by documents, global indexing based on partitioning by words, and our hybrid indexing. With local indexing, it has to search a large number of nodes to answer a query. With global indexing, it needs to transmit a large amount of data to answer a multi-term query. Our hybrid indexing intends to combine the benefits of global and local indexing while avoiding their limitations. Our system built around the idea of hybrid indexing is called eSearch. The basic tenet of our approach is selective metadata replication guided by modern information retrieval (IR) algorithms. Metadata replication avoids transmitting large amounts of data when processing multi-term queries. Document term lists are replicated by a small factor to important places to guarantee the quality of search. Like global indexing, hybrid indexing distributes metadata based on terms (see Figure 1.1(iii)). Each node j is responsible for the inverted list of some term t. In addition, for each document D in the inverted list for term t, node j also stores the complete term list for document D. Given a new document, eSearch identifies a small number of top (important) terms in the document and publishes

129

the complete term list for the document to nodes responsible for those top terms. Given a multi-term query, the query is sent to nodes responsible for those terms. Each of those nodes then does a local search without consulting others, since it has the complete term lists for documents in its inverted list. Selective replication, however, may degrade the quality of search results. We adopt automatic query expansion [169] to reduce the chance of missing relevant documents. We draw on the observation that, with more terms in a query, it is more likely that a document relevant to this query is published under at least one of the query terms. We also propose an overlay source multicast protocol to efficiently disseminate term lists, and two decentralized techniques to balance the distribution of term lists across nodes. We evaluate eSearch through simulations and analysis. The results show that, owing to the optimization techniques, eSearch is scalable and efficient, and obtains search results as good as the centralized baseline. Despite the use of metadata replication, eSearch actually consumes less bandwidth than P2P systems based on global indexing (so-called Global-P2P systems) when publishing documents. During a retrieval operation, eSearch typically transmits 3.3KB of data. These costs are independent of the size of the corpus and grow slowly (logarithmically) with the number of nodes in the system. In contrast, the cost to process a query in the Global-P2P systems grows with the size of the corpus. eSearch’s efficiency comes at a modest storage cost (6.8 times that of the Global-P2P systems), which can be further reduced by adopting index compression [284] or pruning [45]. The remainder of this chapter is organized as follows. Section 4.1 provides an overview of eSearch’s system architecture. Sections 4.2 to 4.4 describe and evaluate individual pieces of our techniques, including top term selection and automatic query expansion (Section 4.2), overlay source multicast (Section 4.3), and balancing term list distribution (Section 4.4). We analyze eSearch’s system resource usage and compare it with the Global-P2P systems in Section 4.5. Related

130

m

2 -1 X

E

0

p u b lis h a d o c o n “ i n f o r ma t i o n ”

A

“ c o mp u t e r ” “ i n f o r ma t i o n ”

Y

D

se a rc h fo r d o c s o n “ n e tw o rk ”

B

“ n e tw o rk ”

C

Figure 4.1: System architecture of eSearch. work is discussed in Section 4.6. Section 4.7 concludes this chapter.

4.1

System Architecture

Figure 4.1 depicts eSearch’s system architecture. A large number of computers are organized into a structured overlay network (Chord [243]) to offer IR service. Nodes in the overlay collectively form an eSearch Engine. Inside the Engine, nodes have completely homogeneous functions. A client (e.g., node X) intending to use eSearch connects to any Engine node (e.g., node E) to publish documents or submit queries. Engine nodes are also user nodes that can initiate document publishing or searches on behalf of their user. Among nodes in the P2P system, we intentionally distinguish server-like Engine nodes that are stable and have good Internet connectivity from the rest. Previous measurement studies on P2P systems [224] show that a significant per-

131

centage of P2P nodes join P2P systems for just a short period of time, although there are also many stable nodes. Excluding ephemeral nodes from the Engine avoids unnecessary maintenance operations. Moreover, not even every stable node needs to be included in the Engine, so long as the Engine has enough capacity to offer the desired level of quality of service [268]. When a new node joins the P2P system, it starts as a client. After being stable for a threshold time (e.g., 20 minutes) and if the load inside the Engine is high, it joins the Engine to take over some load, using the protocol described in Section 4.4. Data stored on an Engine node are replicated on its neighbors. Should a node fail, one of its neighbors will take over its job seamlessly. The boundary of the Engine is adjustable. When the load inside the Engine is high, more nodes can be recruited. Inside the Engine, nodes are organized into a ring topology corresponding to an ID space ranging from 0 to 2m − 1 where m = 160. Each node is assigned an ID drawn from this ID space, and is responsible for the key range between its ID and the ID of the previous node on the ring. Each term is hashed into a key in the ID space. The node whose ID immediately follows the term’s key is responsible for the term. For instance, node B is responsible for inverted lists for the term “computer” and “information”. Lookup for the node responsible for a given key is done through routing in the overlay. With the help of additional links not shown in Figure 1.1, Chord on average routes a message to its destination in O(log N ) hops, where N is the number of nodes in the overlay. Document metadata are organized based on the hybrid indexing structure. To publish a document, a client sends the document to the Engine node that it connects to. The Engine node identifies top (important) terms in the document and disseminates its complete term list to nodes responsible for those top terms, using overlay source multicast to economize on network bandwidth (see Section 4.3). A client X starts a search by submitting a query to the Engine node E that

132

it connects to, which then uses overlay routing to forward the query to nodes responsible for terms in the query. Those nodes do a local search, identifying a small number of best matching documents, and return the ID and relevance score (a numerical value that specifies relevance between a document and a query) of those documents to node E. Node E gives returned documents a global rank based on the relevance score and presents the top documents to client X. To improve search quality, node E may expand the query with more relevant terms learned from returned documents, start a second round of search, and then present the final results to client X (see Section 4.2.2). To avoid processing the same query repeatedly and also to alleviate hot spots corresponding to popular queries, query results are cached for a certain amount of time at the nodes processing the query and the paths along which the query was forwarded. If a query arrives at a node with live cached results, those results are returned immediately.

4.2

Top Term Selection and Automatic Query Expansion

Our hybrid indexing structure supports efficient search. The drawback, however, is that it publishes more metadata. The metadata for a document is replicated multiple times (under each term in the document), requiring more storage and potentially more communication. Our first optimization is to avoid publishing a document under stopwords— words that are too common to have real effect on differentiating one document from another, e.g., the word “the”. This simple optimization results in great savings since a significant portion of a document’s content could be stopwords. Our second optimization is to use vector space model (VSM) [233] to identify

133

top (important) terms in a document, and only publish its term list to nodes responsible for those terms.

4.2.1

Top Term Selection

We first provide an overview of VSM. VSM assigns a weight to each term in a document or query. Terms central to a document are automatically identified by a heavy weight. VSM computes the relevance between a document D and a query Q as relevance(D, Q) =

X

Dt · Q t

(4.1)

t∈D,Q

where t is a term appearing in both document D and query Q, Dt is term t’s weight in document D, and Qt is term t’s weight in query Q. Documents with the highest relevance score are returned as search results. The weight of a term is decided by several factors, including the length of the document, the frequency of the term in the document, and the frequency of the term in other documents. Intuitively, if a term appears in a document with a high frequency, there is a good chance that the term could be used to differentiate the document from others. However, if the term also appears in many other documents, its importance should be penalized. A myriad of term weighting schemes have been proposed, among which Okapi [210, 233] has been shown to be particularly effective. For instance, among the eight systems that achieved the best performance in the TREC-8 ad-hoc track [263], five of them were based on Okapi. We adopt Okapi in eSearch. With Okapi, the relevance between a document D and a query Q is computed as relevance(D, Q) =

X

t∈Q,D

ln

N − Nt + 0.5 (k1 + 1)dt (k3 + 1)qt · · . (4.2) P Nt + 0.5 k1 ((1 − b) + b L ) + dt k3 + qt

Here, N is the total number of documents in the corpus. Nt is the number of documents that contain term t. dt is the frequency of term t in document D. qt is the frequency of term t in query Q. P is the length of document D (in

134

bytes). L is the average document length. k1 (between 1.0-2.0), b (usually 0.75), and k3 (between 0-1000) are constants. In our current implementation, k1 = 1.2, b = 0.75, and k3 = 7. The right hand side of Equation 4.2 consists of three parts. We treat the first two parts as the weight for document terms (Dt ) and the third part as the weight for query terms (Qt ). We first evaluate whether it is true that some terms in a document are much more important than others. The SMART system [35] developed at Cornell implements a framework for VSM. The SMART package consists of about 70,000 lines of C code. We use about 45,000 lines of them in our system. We extended SMART with an implementation of Okapi. SMART comes with a list of 571 stopwords, which is used as is in our experiments. The SMART stemmer is used without modification to strip word endings. For instance, “books” and “book” are treated as the same. The corpus we use is the disk 4 and 5 from TREC [263], excluding the Congressional Record. We use the TREC corpus in evaluation because it is considered as a representative heterogeneous corpus. It comes with a set of carefully constructed queries and manually selected relevant documents for each query, against which one can quantitatively evaluate the search quality of a system. It includes 528,543 documents from such sources as news and magazines. The average length of documents is 3,778 bytes. The total size of the corpus is about 2GB. A TREC document on average has 231 unique terms. After stopword removal and stemming, a document on average has 153 unique terms. Since real queries are usually short, we use only the title field of topics 351-450 as queries. A query on average contains 2.4 terms. For each document, we sort its terms by decreasing Okapi weight and compute the relative weight of each term to the biggest term weight in the document. (Note that the terms are those left after stopword removal and stemming.) We average this normalized term weight across all documents, computing a mean for each term rank, and report these means in Figure 4.2. Note that the Y axis is in

135

Normalized term weight

1

0.1

0.01 0

20 40 60 80 100 120 140 160 180 200 Term rank

Figure 4.2: Ranked term weight of the TREC corpus, normalized to the biggest term weight in each document. log scale. The normalized term weight decreases exponentially as the term rank increases and the weight for the top 20 terms drops even faster. This confirms our intuition that a small number of terms are much more important than others in a document. One analogy is that these terms are words in the “Keyword” section of technical papers, except that eSearch extracts them automatically. We use term weights to guide the replication of term lists. Carmel et al. [45] used similar information to guide index pruning in a centralized site.

4.2.2

Automatic Query Expansion

Only publishing a document under its top terms may degrade the quality of search results. If none of the query terms is among the top terms for a document, eSearch cannot return this document for this query.1 Our argument is that, since none 1

We do not require all query terms to be among the top terms for the document, since we

always replicate a document’s term list in its entirety.

136

of the query terms is among the top terms for the document, IR algorithms are unlikely to rank this document among the best matching documents for this query anyway. Usually the length of a query is small, with an average 2.4 terms. If all document term weight Dt in Equation 4.1 are small, the relevance score for this document is likely to be smaller than the relevance score for real relevant documents (note that Qt is the same for all documents). Thus, the top search results for this query are unlikely to be affected by skipping this document. We adopt automatic query expansion [169] (also called automatic relevance feedback) to alleviate this problem. We draw on the observation that, with more terms in a query, it is more likely that a document relevant to this query is published under at least one of the query terms. This scheme automatically expands a short query with additional relevant terms. It has been show as one important technique to improve performance in centralized IR systems [233]. We experimented with several query expansion techniques and found that complex ones such as that in [169] only marginally improve search quality for the TREC corpus compared with simple ones. For the sake of clarity we describe below a simple scheme that is degenerated from [169] but has the same performance. Given a query, eSearch first uses the hybrid indexing structure to retrieve a small number f of best matching documents. We call these documents feedback documents. For each term in the feedback documents, the Engine node that starts the search on behalf of a client computes the average weight of terms in the feedback documents and chooses k terms that have the biggest average weight. These terms are assumed to be relevant to the query and are added into the query. The new query is then used to retrieve the final set of documents for return. Recall that VSM assigns a weight for query terms as it does for document terms (see Equation 4.1). The weight for an expanded query term, which is not assigned by VSM, is its average weight in feedback documents divided by a constant α. Through experiments we found that f = 10 and α = 16 turned out to work

137

well. Alternatively, one may set α to a very large number to make the overall ranking equivalent to that without the expanded query terms, but still search nodes corresponding to expanded terms to reduce the chance of missing relevant documents. We illustrate these steps through an example. Suppose VSM identifies “routing” as the only important term for a document D. Given a query of “computer network”, eSearch first retrieves relevant documents with either “computer” or “network” as one of their important terms. After a look at the retrieved documents, eSearch finds that “routing” seems to be an important common word among them. It then expands the query as “computer network routing”, assigning “routing” a relatively smaller weight than the weight for the original query terms “computer network”. The new query is then used to retrieve a final set of documents. This time, it can find document D. Note that, in practice, a document relevant to a query does not need to contain all terms in the query, and documents containing all terms in a query are not necessarily relevant to the query.

4.2.3

Generating Global Statistics

Okapi relies on some global statistics to compute term weights, e.g., average document length and the popularity of terms (Nt in Equation 4.2). Previous works [107, 196] have shown that statistical IR algorithms can work well with the estimated statistics. eSearch uses a combining tree to sample documents, merge statistics, and disseminate the combined statistics. The initial copy of the statistics is generated using seed documents similar to those that will be used in the specific P2P community. Over time, a combining tree that approximates the underlying Internet topology and includes randomly chosen nodes is used to sample documents and to merge statistics, using either our overlay multicast protocol [290] or a protocol similar to Bayeux [313] or Scribe [218].

138

The size of the statistics, which includes the dictionary and the IDF for words in the dictionary,2 increases slowly with the size of the P2P community, as word usage follows the Zipf distribution and the number of new words increases slowly with the size of the corpus. We expect the statistics to change slowly, at the rate of weeks or even months [277], because statistics are more stable than documents themselves, especially for a large corpus. The root of the combining tree periodically disseminates the statistics through the combining tree, attaching a new version number to each update. The root is a node owning a well-known key in Chord. If it fails, one of its neighbors will take over its role automatically. Upon receiving an update from the spanning tree, a node sets a timer X to 2T , where T is the estimated time that it takes to propagate an update throughout the entire combining tree. After timer X expires, the nodes include the version number of the statistics in keep-alive messages, recursively detecting and updating neighbors with outdated versions. Upon receiving an update, every node—whether or not it is in the spanning tree—sets a timer Y to P/2, where P is the statistics update period and P >> T . Both T and P are also statistics that gradually change and are propagated to all nodes. Nodes keep the old statistics until timer Y expires. We call the time span after receiving an update and before timer Y expires the transition period. Each query is also attached with the version number of the statistics that it intends to use. During the transition period, a given query is first processed under the old statistics. If it fails because one of the involved nodes already discards the old statistics, the query initiator is notified and a second round is started under the new statistics. Because Y is set long enough, all nodes involved in a query are guaranteed to unanimously have a set of common statistics, either the old one or the new one. 2

It is our conjecture that one can omit the IDF for rare words (and hence substantially reduce

the size of the dictionary) without severely affecting the retrieval quality. An evaluation of this conjecture is a subject for future work.

139

In summary, timer X allows the statistics to propagate through the efficient combining tree first before the blind flooding starts; timer Y allows eSearch to work properly during the transition period. eSearch does need to flood the statistics occasionally, but it happens at a rate several orders less frequent than the query flooding in Gnutella.

4.2.4

Experimental Results

We experiment with the TREC corpus to determine proper parameters for eSearch, including the number of top terms under which a document is published and the number of expanded terms for a query. We use the “title” field of TREC topics 351-450 as queries. On average, each query consists of 2.4 terms and has 94 manually identified relevant documents in the corpus. Note that this identification is subjective and based on user input, which implies that a document that contains all terms in a query is not necessarily relevant to the query, and a document relevant to a query need not contain all terms in the query. The metric to quantify the quality of search results is precision, defined as the number of retrieved relevant documents divided by the number r of retrieved documents. For instance, prec@15=0.4 means that, when returning 15 documents for a query, about 40% of the returned documents will be evaluated by users as really relevant to the query. prec@r varies with r. The average precision for a single topic is the mean of the precision obtained after each relevant document is retrieved (using zero as the precision for relevant documents that are not retrieved). We are particularly interested in high-end precision (e.g., prec@15) as users usually only view the top 10 search results [143]. Figure 4.3 reports eSearch’s precision with respect to the number of terms under which a document is published (shown on the X axis). The “expansion” and “no expansion” series are with and without query expansion, respectively. In this experiment, we set the number of expanded terms to 30. “All” means that

Precision

140

avg prec (no expansion) avg prec (expansion) prec@15 (no expansion) prec@15 (expansion) 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 5 10 15 20 30 40 50 60 all Published terms

Figure 4.3: eSearch’s precision with respect to the number of terms under which a document is published. The performance of the “all” series is equivalent to that of a centralized implementation of Okapi.

Retrieved relevant docs

70 60

no expansion expansion

50 40 30 20 10 0 5

10

15

20

30

40

50

60

all

Published terms Figure 4.4: The average number of retrieved relevant documents for a query when returning 1,000 documents.

141

a document is published under all its terms, whose precision is equivalent to that of a centralized implementation of Okapi. Two observations can be drawn from this figure. (1) eSearch can approach the precision of the centralized Okapi by publishing a document under a small number of selected terms, e.g., top 20 terms. (2) Automatic query expansion improves precision, particularly when documents are published under very few top terms. We next examine the performance when returning a large number of documents for each query. This is the case when a user wishes to retrieve as many relevant documents as possible. Figure 4.4 reports the average number of retrieved relevant documents for a query when returning 1,000 documents. Query expansion increases the number of retrieved relevant documents by 22%. The performance of the “expansion” series catches up with that of the centralized Okapi earlier than the “no expansion” series, showing that the use of query expansion allows eSearch to publish documents under fewer terms to achieve the performance of centralized systems. Figure 4.5 shows the precision with respect to the number of expanded query terms. In this experiment, each document is published under its top 20 terms. The high-end precision is not very sensitive to query expansion. The improvement in average precision slows down after the number of expanded terms exceeds 10. Adding more terms into a query results in searching more nodes. This figure indicates that the benefit of query expansion can be reaped with limited overhead in eSearch. Comparing Figures 4.3 and 4.4, we find that top search results are relatively insensitive to top term selection whereas the low-rank search results are affected more severely. Accordingly, query expansion is most useful for improving low-end precision but not for high-end precision. This observation leads to a further optimization. Given a new query, eSearch first retrieves a small number of documents without query expansion. If the user is unsatisfied with the results, eSearch uses

142

Precision

prec@5

prec@10

prec@15

avg prec

0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0

5

10 15 20 30 40 50 60 70 80 Expansion terms

Figure 4.5: eSearch’s precision with respect to the number of expanded query terms.

0.8 0.7 Precision

0.6 0.5

centralized (expansion) eSearch (expansion) centralized (no expansion) eSearch (no expansion)

0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

Figure 4.6: Comparison of precision-recall between eSearch and the centralized Okapi.

143

these documents as feedback documents to expand the query and do a second round of search to return more documents. In addition to precision, another metric to quantify the quality of search results is recall, defined as the number of retrieved relevant documents divided by the total number of relevant documents. This metric is used when the user wishes to retrieve as many relevant documents as possible. A precision-recall curve reflects the precision achieved at certain recall level. Figure 4.6 compares the precisionrecall between eSearch and the centralized Okapi when returning 1,000 documents for each query. eSearch in this experiment publishes each document under its top 20 terms. When expansion is used, both eSearch and the centralized Okapi add 30 terms into the query. In both cases (with and without query expansion), eSearch closely approximates the performance of the centralized Okapi. Query expansion improves low-end precision and recall, but has little impact on high-end precision, consistent with the results in Figure 4.3. Figures 4.3-4.5 show that eSearch’s average retrieval quality is as good as the centralized Okapi. In Table 4.1, we try to further understand the difference for individual queries between eSearch and the baseline. When documents are published under a few terms, eSearch performs badly for a few queries. For instance, when publishing documents under only their top 5 terms, there is one query for which eSearch finds no relevant documents whereas the baseline can find 9 relevant documents. When the number of selected top terms increases, eSearch’s performance improves quickly. When publishing documents under their top 20 terms, the worst relative performance for eSearch is one query for which eSearch retrieves four fewer relevant documents; for 90 queries, eSearch and the baseline perform the same. We also experimented with queries of different original length, ranging from an average of 2.4 terms to 21 terms by also using the “description” and/or “narrative” fields in TREC topics as part of the query. The precision and recall of the

144

difference top 5 terms top 10 terms

top 20 terms

d=3

1

0

0

d=2

3

2

0

d=1

5

3

3

d=0

53

73

90

d=-1

15

9

5

d=-2

6

8

0

d=-3

4

3

1

d=-4

3

0

1

d=-5

3

1

0

d=-6

3

0

0

d=-7

3

1

0

d=-8

0

0

0

d=-9

1

0

0

Table 4.1: Difference in the number of retrieved relevant documents between eSearch and the centralized Okapi when returning 10 documents for each query. The columns correspond to eSearch with different configurations (without using query expansion). One entry with value p in the “d=k” row (e.g., the entry with value 8 in the “d=-2” row) means that, out of the 100 TREC queries, p queries return k more relevant documents in eSearch than in the baseline (or return fewer relevant documents if k<0). For instance, when publishing documents under their top 10 terms, for 8 queries, eSearch returns 2 fewer relevant documents than the baseline, and for 73 queries, eSearch performs the same as the baseline. For some queries, eSearch does better than the baseline because of the inherent fuzziness in Okapi’s ranking function (i.e., just focusing on important terms may actually improve retrieval quality sometimes).

145

average precision

50 48 46 44

full

1/128

1/256

1/512

1/1024

1/4096

1/16384

40

1/65536

42 no idf

Retrieved relevant docs

52

0.220 0.215 0.210 0.205 0.200 0.195 0.190 0.185 0.180

Average precision

retrieved relevant docs

Sample rate Figure 4.7: Sensitivity of Okapi to the sampling of global statistics. centralized system improves with more query terms. Similar to what we observed from automatic query expansion, with longer queries, eSearch can approach the performance of centralized systems by publishing a document under even smaller number of terms. This is because, with more terms in a query, it is more likely that a document relevant to this query is published under at least one of the query terms. In Figure 4.7 we evaluate the impact on retrieval quality when using sampled global statistics for Okapi, e.g., the average document length L and inverse document frequency (IDF)

Nt N

in Equation 4.2. In this experiment, the fraction of

documents that are sampled through the process described in Section 4.2.3 to generate the global statistics varies from 1/128 to 1/65,536. Query expansion is not used in this experiment. We retrieve 1,000 documents for each query and report the average number of retrieved relevant documents for a query and the average precision. Note that the Y axises are not started from 0. “full” (on the X axis) is

146

the centralized implementation that has the exact statistics (i.e., “sampling” all documents). For the “no idf” case, we don’t assume any information regarding IDF (i.e., always set Nt = 1) but we assume a perfectly estimated average document length (we found that a very accurate estimation of the average document length can be extracted from a very small number document samples.). This is a hypothetic configuration to test the importance of the IDF part of Okapi’s equation. On average the centralized Okapi finds 50.73 documents for a query. With a 1/1024 sample rate, it finds 49.74 documents. With no know knowledge about IDF but a perfect average document length (the “no idf” case), it still finds 43.99 documents. This shows that Okapi is not very sensitive to IDF. Experiments not presented here reveal that Okapi is more sensitive to the average document length, which fortunately can be estimated very accurately. Overall, Okapi is insensitive to the sampled global statistics. All experiments for eSearch we presented previously use a sample rate of 1/256.

4.2.5

Discussions

Our evaluation so far has assumed that eSearch always publishes documents under the same number of top terms. This number can actually be varied from document to document. Intuitively, it may be more reasonable to publish long documents under more terms. Our current implementation includes a simple heuristic that publishes documents with big term weights under more terms. A complete comparison of heuristics for top term selection is a subject of future work. It should be emphasized that eSearch is not tied to any particular document ranking algorithm. For documents with cross reference links, link analysis algorithms such as PageRank [33] and HITS [131] can be incorporated, for instance, by combining PageRank with VSM to assign term weights [155] or publishing important Web pages (identified by PageRank) under more terms.

147

As storage space is abundant in massive P2P systems, the main factor that limits the number of terms under which a document is published is network bandwidth. When an Engine node receives a document for the first time, it only publishes the document under limited top terms. As the network becomes underutilized (e.g., during nights), it can incrementally publish the document under more terms to allow easy retrieval of the document. eSearch adapts to knowledge learned from document retrieval patterns. It records the number of times that a document is downloaded. For popular documents, eSearch will publish it under more terms such that they become easier to find. eSearch also learns from query patterns. Suppose a document D is only published under term a. A user submits a query containing term a and b, finds D, identifies it as relevant, and finally downloads D. Now eSearch learns that term b should also be important for D. If there exists many such retrievals, eSearch will eventually publish D under term b such that a single-term query on b can also find D. Although eSearch does not publish a document under all its terms, it is not difficult for eSearch to find documents containing a rare term as long as the term can be identified by IR algorithms as important for those documents. For instance, a search on “eSearch” can find this paper easily as “eSearch” appears in this paper frequently and it is not a popular word. The worst case scenario for eSearch is that a query is about a term that is not important in any document, e.g., the phrase “because of”. eSearch cannot return any result although documents containing this term do exist. Techniques described above alleviate this problem but cannot rule it out completely. In this case, if the node responsible for the hashed key of this term sees this query repeatedly, it will start a gossip in the overlay, shouting “somebody is looking for term X!”. Each node accumulates gossips and disseminates them occasionally. After receiving this gossip, nodes that have information about this term will publish

148

corresponding documents under this term. This is possible because a node stores the complete term lists for documents, including terms not important in those documents. If the user wants results immediately, it is always possible to flood a query to every node, degrading eSearch’s performance to Gnutella in rare case. Alternatively, eSearch can be used with a global indexing system. By default, it uses eSearch to process queries. In rare case that eSearch cannot find matching documents, it resorts to global indexing, which unfortunately may require transmitting inverted lists over the network. A full evaluation of these enhancements is a subject for future work. Although the worst case scenario does exist in theory, in practice, eSearch’s search quality on the widely used TREC benchmark corpus is as good as the centralized baseline. To achieve good retrieval quality, the absolute number of selected top terms may vary from corpus to corpus, but we expect it to be a small percentage of the total number of unique terms in a document, as the importance of terms in a document decreases quickly (see Figure 4.2). The same trend also holds for several other copora we tested, including MED, ADI, and CRAN (available in the SMART package [35]).

4.3

Disseminating Document Metadata

Given a new document, eSearch uses Okapi to identify its top terms and distributes its term list to nodes responsible for those terms. Since the same data are sent to multiple recipients, one natural way to economize on bandwidth is to multicast the data to the recipients. Due to a variety of deployment issues, IP multicast is not widely supported in the Internet. Instead, several overlay multicast systems have been recently proposed, e.g., Narada [61]. These systems usually target multimedia or content distribution applications, where a multicast session is long. Thus, they can amor-

149

tize the overhead for network measurement, multicast tree creation, adaptation, and destruction, over a long data session. The scenario for term list dissemination, however, is quite different. Typically, a document is only several kilobytes and its term list is even smaller. As a result, eSearch disseminates data through a large number of extremely short sessions. Note that the recipients of each session tend to be different as each document has its own set of top terms. It therefore cannot use a single shared tree to serve all the sessions. Although the absolute saving from multicast in a single session is small, the aggregate savings over a large number of sessions would be huge. But this requires protocol overhead for multicast tree creation and destruction to be very small. To this end, we propose overlay source multicast to distribute term lists. It does not use costly messaging to explicitly construct or destroy the multicast tree. Assisted by Internet distance estimation techniques such as GNP [178], the data source locally computes the structure of an efficient multicast tree that has itself as the root and includes all recipients. It builds an application-level packet with the data to be disseminated as payload and the structural information of the multicast tree as header, which specifies the IP addresses of the recipients and the parent-child relationship among them. It then sends the packet to the first-level children in the multicast tree. A recipient of this packet inspects the header to find its children in the multicast tree, strips off information for itself from the header, and forwards the rest of the packet to its children, and so forth. Source multicast is particularly suitable for eSearch’s short multicast session as it does not need any costly messaging in a distributed environment to construct or destroy the multicast tree. On the other hand, it may not be suitable for multicast with a very large number of recipients as the length of packet header may exceed that of packet payload. In the following, we provide more detail on the method that the data source

150

uses to compute the structure of the multicast tree. Our method is based on GNP. A node using GNP measures RTTs to a set of well-known landmark nodes and computes coordinates from a high-dimensional Cartesian space for itself. Internet distance between two nodes is estimated as the Euclidean distance between their coordinates. More specifically, a set of stable well-known nodes are designated as landmark nodes. Landmark nodes measure latencies among themselves and compute coordinates for themselves. A user node measures latency to landmark nodes, contacts a registry to obtain landmark nodes’ coordinates, and computes coordinates for itself based on the measured latencies and landmark nodes’ coordinates. See [178] for further detail. When a node S wishes to send a term list to nodes responsible for top terms in a document, it performs concurrent DHT lookups to locate all recipients. Each recipient R directly replies to node S with its IP address and its GNP coordinates, using UDP protocol to avoid the overhead for establishing TCP connection between nodes R and S. A retransmission is needed only if packet loss (detected through timeout) does happen. After hearing from all recipients, node S locally builds a fully connected graph (a clique) with itself and recipients as vertices. It annotates each edge with estimated Internet distance, i.e., the Euclidean distance between the coordinates of the two nodes incident with the edge. In practice, other factors such as bandwidth and load could also be considered when assigning weights to edges. Finally, it runs a minimum spanning tree algorithm over the graph to find an efficient multicast tree. The lookup results for recipients’ IP and GNP coordinates are cached and reused for a certain amount of time. Stale information used before it times out will be detected when a node receives a term list that it should not be responsible for, which will trigger a recovery mechanism to find the correct recipient. We use simulations to quantify the savings from overlay source multicast. Three data sets are used in our experiments. The first one is derived from a

151

GT-ITM (real) NLANR (optimal)

GT-ITM (optimal) AS (real)

NLANR (real) AS (optimal)

Normalized cost

0.6 0.5 0.4 0.3 0.2 0.1 0 10

20

30

40

50 60 70 Tree nodes

80

90 100

Figure 4.8: Cost of overlay source multicast normalized to that of using separate unicast to deliver data. 1,000 node transit-stub graph generated by GT-ITM [39]. We use its default edge weight as the link latency. The second one is derived from NLANR’s one-day RTT measurements among 131 sites scattered in the US [180]. After filtering out some unreachable sites, we are left with 105 fully connected sites and the latency between each pair of them. We use the median of one-day measurements as the end-to-end latency. The third one is an Internet Autonomous System (AS) snapshot taken by the Route Views Project [183] on April 28, 2003, with a total of 15,253 AS’es recorded. Since the latencies between adjacent AS’es are unknown, we assign latencies randomly between 8ms and 12ms. For the GT-ITM and AS data set, we randomly assign some user nodes to routers or AS’es. The end-toend latency between two nodes is computed as the latency of the shortest path between them. Figure 4.8 reports the performance of overlay source multicast using these data sets. In all experiments, we use the k-means clustering algorithm to select 15 center nodes as landmarks and then use GNP’s technique to compute node coordinates from a 7-dimensional Cartesian space. For each data set, we randomly

152

choose some nodes (shown on the X axis) as recipients to join the multicast tree. The cost of a multicast tree is the sum of the cost of all edges. The Y axis is the multicast tree’s cost normalized to that of using separate unicast to deliver the term list to recipients. For each data set, the “optimal” curves are the normalized cost of the optimal minimum spanning trees assuming that the real latency between each pair of nodes is known. The “real” curves (results of our scheme) are the normalized cost of the minimum spanning tree constructed using estimated Internet distance. As can be seen from the figure, with 20 nodes in the tree, multicast reduces the communication cost by up to 72%. In all cases, the performance of the “real” curve approaches that of the “optimal” curve, indicating that GNP estimates Internet distance with reasonable precision. The performance gap between the “optimal” and “real” curve for the NLANR data set widens as the number of tree nodes increases. This is because GNP is not very accurate at estimating short distance. Some of NLANR’s monitoring sites are very close to each other, particular at the east coast [180]. This data set has many RTTs under 10ms. With more nodes added to the multicast tree, there is a bigger chance that some of them are very close to each other. Note that the NLANR data set only has 105 candidate nodes from which the tree nodes are selected. Accordingly, GNP’s estimation error increases and the resulting minimum spanning tree is less optimal.

4.4

Balancing Term List Distribution

One problem not addressed in previous studies on P2P keyword search [102, 145, 205, 245] is the balance of metadata distribution across nodes. Because popularity of terms varies dramatically, nodes responsible for popular terms will store much more data than others. A traditional approach to balance load in DHTs is virtual server [243], where a physical node functions as several virtual nodes to join the

153

P2P network. As a result, the number of routing neighbors that a physical node monitors increases proportionally. More importantly, we find that virtual server is incapable of balancing load when the lengths of inverted lists vary dramatically. Our solution is a combination of two techniques. First, we slightly modify Chord’s node join protocol. A new node performs lookups for several random keys. Among nodes responsible for these keys, it chooses one that stores the largest amount of data and takes over some of its data. Second, each term is hashed into a key range rather than a single key. For an unpopular term, its key range is mapped to a single node. For a popular term, its key range may be partitioned among multiple nodes, which collectively store the inverted list for this term. Our experiments show that these two techniques can effectively balance the load even if the lengths of inverted lists vary dramatically. In addition, they avoid the extra maintenance overhead that virtual server introduces.

4.4.1

Node Join Protocol

In Chord, a new node performs a lookup for a random key Kr and splits the key range of the node n that is responsible for this key. Originally, node n is responsible for key range [Knb , Kne ]. After the split, the new node and node n are responsible for key ranges [Knb , Kr ] and [Kr + 1, Kne ], respectively. With virtual server, the new node functions as several virtual nodes. Each virtual node executes this join protocol once. Instead of splitting the key range of a random node, our technique seeks to split the key range of an overloaded node. In eSearch, a new node performs lookups for k random keys. Among nodes responsible for these keys, it chooses one that stores the largest amount of data to split. If nodes have heterogeneous storage capacity, it chooses the one that has the highest relative load to split. After node join, eSearch has no extra maintenance overhead as in the virtual server scheme. Each node monitors the same number of routing neighbors as in a default Chord.

Load percentage

154

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

baseline split (2) vs (2) vs (10) split (10)

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

Node percentage Figure 4.9: Comparison of load balancing techniques using equal-size objects. “baseline” is the basic Chord without virtual server. “vs (k)” is a Chord with k virtual nodes running on each physical node. “split (k)” is our technique where a new node performs k random lookups and splits the overloaded node. Figure 4.9 compares the load balancing techniques. In this experiment, we distribute one million equal-size objects to a 10,000-node Chord. Nodes are sorted in decreasing order according to the number of objects they store. The X-axis shows the percentage of the total number of nodes in the system for which the Y axis gives the percentage of objects hosted by the corresponding nodes. “baseline” is the basic Chord without virtual server. “vs (k)” is a Chord with k virtual nodes running on each physical node. “split (k)” is our technique in which a new node performs k random lookups and splits the overloaded node. The closer the graph is to being linear, the more evenly distributed the load. This figure shows that our technique balances load better than virtual server, particularly when k is small. Note that this benefit is achieved without virtual server’s maintenance overhead.

155

Length of inverted list

10000

1000

100

10

1 0

20000

40000

60000

80000 100000

Sorted terms (a) The “top 20 terms” load.

Length of inverted list

1e+06 100000 10000 1000 100 10 1 0

20000 40000 60000 80000 100000 Sorted terms (b) The “all terms” load.

Figure 4.10: Length distribution of the TREC corpus’s inverted lists when documents are published under (a) top 20 terms; or (b) all terms.

156

4.4.2

Distributing Load for a Single Term

As the popularity of terms varies dramatically, the lengths of inverted lists also do. Figure 4.10 plots the length distribution of TREC’s inverted lists when documents are published under top 20 terms or all terms. In the following, we will simply refer to them as the “top 20 terms” load and the “all terms” load, respectively. Note that the Y axis is in log scale, and the numbers in Figure 4.10(b) are larger than those in Figure 4.10(a) by two orders of magnitude. The length distribution is skewed for both cases, particularly for the “all terms” load. Only publishing documents under the top 20 terms greatly reduces the length of long inverted lists corresponding to popular terms since they are actually not important in many documents they appear in. Because of this variation in the length of inverted lists, the load balancing techniques in Section 4.4.1 cannot work well on their own. Our complementary technique is to hash each term t into a key range [Ktb , Kte ] rather than a single key. The key range of an unpopular term is mapped to a single node whereas the key range of a popular term may be mapped to multiple nodes, which collectively store the inverted list of this term. Chord uses 160-bit keys. We partition a key into two parts: 140 high-order bits and 20 low-order bits. Given a document D, we first identify its top terms. For each top term t, we generate a key Kt for it. The high-order bits Kth of key Kt are the high-order bits of the SHA-1 hashing [177] of the term’s text. Let Ktb = 220 Kth . The low-order bits of Kt are generated randomly. We then store document D’s term list on the node that is responsible for key Kt . As a result, the term lists for documents that have term t as one of their important terms will be stored in key range [Ktb , Ktb + 220 − 1]. If this key range is partitioned among multiple nodes, then the inverted list for term t is partitioned among those nodes automatically.

157

The search process needs a change. A query containing term t is first routed to the node responsible for key Ktb and then forwarded along Chord’s ring until it reaches the node responsible for key Ktb +220 -1. All nodes in this range participate in processing this query since they hold part of the inverted list for term t. Our technique not only balances metadata distribution, but also reduces search response time, as many nodes concurrently search different parts of long inverted lists for popular terms. The node join protocol is also changed slightly. When a new node arrives, it obtains a random document (through any means) and randomly chooses k terms that are not stopwords from the document. For each chosen term t, it uses the process described above to compute a key Kt for it. Among nodes responsible for these generated keys, it chooses one that stores the largest amount of data to split. If nodes have heterogeneous storage capacity, it chooses the one that has the highest relative load to split. The reason why we use random terms from a random document to generate the bootstrapping keys is to force the distribution of these keys to follow the inverted list distribution such that long inverted lists are split with a higher probability. No other change to Chord is needed. Importantly, there is no specific node that maintains a list of nodes that store the inverted list of a given term, i.e., our technique is completely decentralized. Any node joining the partition of a term can fail independently. Node failure is handled by Chord’s default protocol. For instance, metadata stored on a node can be replicated on its neighbors. Should a node fail, one of its neighbors will take over its job seamlessly. As nodes come and go, an originally balanced load may become unbalanced. Some nodes may take over more data from leaving nodes than others. In the background, each underutilized node randomly samples load information from other nodes. If it finds a node that stores much more data than itself, it will leave its original position in the Chord to take over some data stored on the overloaded

158

node. Each overloaded node also executes a similar protocol, but seeking for an underutilized node to take over some of its data. If an overloaded node cannot find any node to share load, it will start to discard term lists published under relatively less important terms or term lists of documents barely retrieved by any user. This also prevents term lists aggressively published by the techniques described in Section 4.2.5 from accumulating in the system over time. Currently we do not implement or evaluate this optimization in this dissertation.

4.4.3

Experimental Results

Figure 4.11(a) and (b) compare the load balancing techniques. The “baseline”, “vs’, and “split” curves are the same as those in Figure 4.9. “split+scatter” is our complete load balancing scheme, mapping a term into a key range and splitting overloaded nodes. For virtual server (“vs”), each physical node functions as 10 virtual nodes. For our techniques (“split’ and “split+scatter”), a new node selects an overloaded node from 10 random nodes to split. We distribute the “top 20 terms” load and the “all terms” load to a 10,000-node Chord, respectively. Although the “all terms” load is not how eSearch actually works, we choose it to represent a scenario where the length distribution of eSearch’s inverted lists becomes extremely skewed because, for instance, term lists for a gigantic number of documents are stored in eSearch. In Figure 4.11(a) and (b) the load for the “baseline” is unbalanced, due to the large variation in lengths of inverted lists. For the “all terms” load, 1% of the nodes store 21.5% of the term lists. Virtual server improves the situation marginally—for the “all terms” load, 1% of the nodes still store 20.3% of the term lists. Splitting overloaded nodes (“split”) performs better than virtual server (“vs”) but only our complete technique (“split+scatter”) is able to balance the load when the inverted lists are extremely skewed, owing to its ability to partition the inverted list of a single term among multiple nodes.

Load percentage

159

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

baseline vs split split+scatter 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Node percentage

1

Load percentage

(a) The “top 20 terms load”.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

baseline vs split split+scatter

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Node percentage

1

(b) The “all terms” load. Figure 4.11: Comparison of load balancing schemes using (a) the “top 20 terms” load; and (b) the “all terms” load.

Load percentage

160

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1250 2500 5000 10000 20000 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

Node percentage

Routing hops

(a) Metadata distribution.

9 8 7 6 5 4 3 2 1 0

all terms top terms baseline

1250

2500

5000

10000

20000

Nodes in the system (b) Average routing hops. Figure 4.12: Scalability of our load balancing technique (“split+scatter”).

161

The rest of our experiments evaluate the scalability of “split+scatter”, using the “all terms” load and varying the number of nodes in the system from 1,250 to 20,000. The load distribution is reported in Figure 4.12(a). The curves overlap, indicating that “split+scatter” scales well with the system size. Figure 4.12(b) shows the average routing hops in Chord with different configurations. “baseline” is the default Chord. The other two curves are “split+scatter” with different loads. All curves overlap, indicating that our modifications to Chord do not adversely affect Chord’s routing performance. The average number of routing hops still scales with O(log(N )), where N is the number of nodes in the system. Figure 4.13(a) reports the number of terms whose inverted lists are stored on more than one node. Note that both the X axis and the Y axis are in log scale. As the number of nodes increases, the number of partitioned terms increases proportionally. The curve for the “top 20 terms” load grows faster. Because its inverted lists are not extremely skewed, more added nodes are devoted to partition unpartitioned terms, whereas in the “all terms” load added nodes are mainly used to repetitively partition terms with extremely long inverted lists. Figure 4.13(b) reports the number of nodes that collectively host the longest inverted list. As the number of nodes increases, this inverted list is partitioned among more nodes proportionally. The partition in the “all terms” load is higher than that in the “top 20 terms” load because the inverted list in the “all terms” load is much longer. Overall, our load balance technique scales well with the system size. It balances term list distribution well under different system sizes and does not affect Chord’s routing performance. As the system size increases, long inverted lists are automatically partitioned among more nodes. Since it works for two quite different loads (particularly the extreme “all terms” load), we expect it to also scale well with corpus size.

162

Partitioned terms

10000 1000 100 all terms top 20 terms

10 1 1250

2500

5000

10000

20000

Nodes in the system (a) Number of partitioned terms.

Nodes that partition the longest inverted list

1000 all terms top 20 terms 100

10

1 1250

2500 5000 10000 Nodes in the system

20000

(b) Number of nodes that partition the longest inverted list. Figure 4.13: Scalability of our load balancing technique (“split+scatter”).

163

4.5

Analysis of System Resource Usage

In this section, we analyze eSearch’s system resource usage when publishing metadata for a document or processing a query, and compare it with P2P systems based on global indexing (so-called Global-P2P systems) [102, 145, 205, 245]. The analysis is based on statistics of the TREC corpus. Changing these statistics (e.g., average document length) may affect the absolute results but the relative results comparing eSearch and existing systems will remain the same. We don’t claim that the default values used in the analysis are representative for all situations. Instead, we just want to give a flavor of eSearch’s resource usage. We first summarize our major findings below. Despite the use of metadata replication, eSearch actually consumes less bandwidth than the Global-P2P systems when publishing documents. During a retrieval operation, eSearch typically transmits 3.3KB data. Both costs are independent of the size of the corpus and grow slowly (logarithmically) with the number of nodes in the system. In contrast, a local indexing system sends a query to every node in the system whereas the cost to process multi-term queries in a global indexing system grows with the size of the corpus. eSearch’s efficiency is mainly due to the optimizations described in Sections 4.2 and 4.3, but also comes at an expense of 6.8 times the storage consumption of the Global-P2P systems. We believe that trading modest disk space for communication and precision is a proper design choice for massive P2P systems.

4.5.1

Publishing a Document

eSearch executes a two-phase protocol to publish a document. In the first phase, it uses DHT routing to locate the nodes responsible for top terms in the document and obtain their IP address and GNP coordinates. In the second phase, it uses overlay source multicast to deliver the document. The cost for the first phase

164

could be avoided if the needed information has already been cached locally. We assume that this cache is disabled in the following analysis. Data transmitted to locate the recipients are Bl = nt ∗ h ∗ ml = 10, 400 bytes, where nt =20 is the number of top terms, h=8 is the average number of routing hops in a 20,000-node Chord,3 ml =65 is the size of the message (including 40-byte TCP/IP headers, 1-byte identifier that specifies the type of the message, 4-byte IP address of the data source, and a 20-byte DHT key). Data replied from the recipients are Br = nt ∗ mr =1,220 bytes, where mr =61 is the size of the reply message (including 28-byte UDP/IP headers,4 1-byte message identifier, 4-byte recipient IP address, and 28-byte GNP coordinates in a 7-dimensional Cartesian space). The total data transmitted in the first phase is Bl + Br =11,620 bytes. In the second phase, there are two options for the content to be multicast to the recipients. The data source can build the term list for the document and multicast the term list. Alternatively, the data source can multicast the document itself, leaving it to the recipients to build the term list. The first method is more efficient in that a term list is smaller than a document, but the second method allows more flexible retrieval. With document text in hand, eSearch can search exact matches for quoted text, provide sentence context for matching terms, and support a “cached documents” feature similar to that in Google. We opt for multicasting the document itself. Using statistics from the TREC corpus, we assume that the average document length is 3,778 bytes and it can be compressed to 1,350 bytes (a 2 to 4 text compression ratio is typical for bzip2). The UDP/IP headers, message identifier, and structural information of the multicast tree add 150 bytes to the multicast 3

The actual delay stretch in a proximity-aware overlay [110] may be smaller than the hop

counts. This effect is discussed later. 4 Throughout the analysis, we assume that communication between routing neighbors uses pre-established TCP connections whereas short communication between non-neighboring nodes uses UDP.

165

message, resulting in a 1,500-byte packet. Based on the simulation results in Section 4.3, we conservatively estimate that multicast can save bandwidth by 55%. Thus it consumes 20*1500*0.45=13,500 bytes bandwidth to multicast a document to 20 recipients. The total (phase one and phase two) cost for publishing a document is 11620+13500=25,120 bytes. Readers may be surprised by the fact that the signaling cost in the first phase almost equals the data transmission cost in the second phase. This is because small packets in the first phase have high TCP/IP overhead and this overhead is aggravated by routing the packet through several hops in the overlay. Next, we calculate the bandwidth consumption to distribute metadata for a document in a Global-P2P system. Although these systems [102, 145, 205, 245] do not adopt stopword removal, we add in this step, which helps reduce the size of the metadata. According to the TREC corpus, each document on average contains 153 unique terms after stemming and stopword removal. The metadata for a term in a document includes the term ID, the document ID, and a 1-byte attribute specifying, for instance, the frequency of the term in the document. (If this information needs more than one byte, approximating it with a 256-level value would provide sufficient precision.) The term ID is a DHT key of 20 bytes. We assume the document ID is 8 bytes, including the IP address of the node that stores the document, the port number through which to establish a connection with that node, and a 2-byte document number that differentiates documents on that node. In a Global-P2P system, data transmitted to publish a document are B = na ∗ h ∗ m = 80, 784 bytes, where na =153 is the number of terms in the document, h=8 is the routing hops in Chord, m=66 is the size of the message (including 40-byte TCP/IP headers, 1-byte message identifier, 16-byte term ID, 8-byte document ID, and 1-byte attribute). This bandwidth consumption is 3.2 times that of eSearch. This inefficiency is due to routing high-overhead small packets in

166

the overlay network. Suppose the overhead of overlay routing can be reduced from h=8 to approximately h=2 by introducing proximity neighbor selection into Chord [110]. Then a Global-P2P system consumes 20,196 bytes bandwidth to publish a document whereas eSearch consumes 17,320 bytes bandwidth. Packing multiple small packets into a bigger one can reduce the cost for document publishing, but it is possible only if there are multiple small packets heading to the same destination. As the metadata sent to each recipient are different, it cannot leverage multicast to reduce cost. Global-P2P systems send a large number of small messages to publish a document. eSearch, in contrast, sends a small number of large messages (the whole document). If the inefficiency of processing a large number of small messages in routers and end hosts is counted in, savings in eSearch would be even more significant. Moreover, eSearch distributes the actual document, allowing more flexible retrieval.

4.5.2

Processing a Query

When a user intends to retrieve a small number of best matching documents, by default, query expansion is not used in eSearch (refer to the discussion in Section 4.2.4). The bandwidth cost to process a query is Bq = nq ∗h∗mq +nq ∗md = 3, 335 bytes, where nq = 5 is the number of nodes responsible for the query terms, h = 8 is the routing hops in the Chord, mq = 61 is the size of the query message (including 40-byte TCP/IP headers, 1-byte message identifier, and 20-byte query text), and md = 28 + 1 + 15 ∗ (8 + 2) = 179 is the size of the search results (including 28-byte UDP/IP headers, 1-byte message identifier, and 8-byte ID and 2-byte relevance score for 15 matching documents). Both query and publishing costs are independent of the size of the corpus and grow slowly (logarithmically) with the number of nodes in the system.

167

Occasionally, the user is not satisfied with the search results and requests a feedback process to retrieve a larger number of documents using query expansion. The node that started the search first collects the metadata of feedback documents in order to select terms to be added into the query. The bandwidth cost is Bc = nf ∗ mf = 10, 000 bytes, where nf = 10 is the number of feedback documents and mf = 1, 000 bytes is the cost to retrieve metadata for one feedback document. It then starts a second round of search using the expanded query. The analysis of bandwidth consumption is similar to that in the first round, Bq = nq ∗ h ∗ mq + nq ∗ md = 315, 510 bytes, except that it searches 30 nodes (nq = 30) and each node returns 1,000 documents (md = 28 + 1 + 1000 ∗ (8 + 2) = 10, 029). In total, the feedback round consumes 325,510 bytes bandwidth, the majority of which is due to returning a large number of documents. The search cost (with or without query expansion) is independent of the size of the corpus as eSearch does not transmit inverted lists on the fly. The factors that decide the cost grows slowly (logarithmically) with the number of nodes in the system, e.g., routing hops h and the number of nodes partitioning a term (which affects nq ). Therefore eSearch’s performance is scalable. In contrast, the cost for either local or global indexing is not scalable. Local indexing sends a query to every node in the system. Global indexing needs to transmit inverted lists to process multi-term queries. The sizes of inverted lists grow with the size of the corpus.

4.5.3

Storage Cost

The term list for a document is replicated about 20 times in eSearch, but its storage cost is not 20 times that of the Global-P2P systems, as explained below. To speed up query processing, each eSearch node locally builds inverted lists for documents that it holds term lists for. It also maintains a table that maps a document’s 8-byte global ID into a 2-byte local ID. Although further compression is possible [284], we

168

assume that 3 bytes are used for each (non-stopword) term in a document, 2 bytes for the local document ID, and 1 byte for an attribute (e.g., the frequency of the term in the document). The total cost to store metadata for a document in eSearch (including the cost for the mapping table) is 20*(8+2+153*(2+1))=9,380 bytes, assuming each document has 153 terms after stemming and stopword removal. Since storing full document text is an additional feature, we do not count it in the comparison. Global-P2P systems need at least 9 bytes for a term in a document, 8 bytes for the global document ID and 1 byte for the attribute. The total storage cost for a document’s metadata is 153*(8+1)=1,377 bytes. In these systems, information for terms in a document is distributed on different nodes. They cannot benefit from the technique that maps a document’s long global ID into a short local ID since the size of one entry of the mapping table already exceeds the size of the information for a term and is not reused sufficiently to justify the cost. The storage space consumed by eSearch is 6.8 times that of the Global-P2P systems. The benefit is its low search cost. According to Blake and Rodrigues [27], disk capacity has increased 160 times faster than the network bandwidth for an end user. Moreover, in a large P2P system, the storage space scales proportionally with the number of nodes in the system, but the bisection bandwidth does not. Therefore, we believe that trading modest disk space for communication and retrieval precision is a proper design choice for P2P systems. With the cheap massive storage provided by large P2P systems in hand, replication has already become a common practice to improve efficiency or resilience [53, 163, 217]. We also plan to adopt index compression [284] and pruning [45] to further reduce storage consumption.

169

4.6

Related Work

We classify search systems into four categories according to the type of network in which they operate.

4.6.1

Traditional Distributed Information Retrieval

Compared with the P2P architecture of eSearch and pSearch, traditional distributed IR systems such as GlOSS [107] use either a centralized meta-database or a hierarchy of meta-databases to summarize content of other databases. During a search, the summary is referenced to select databases that are likely to contain most relevant documents. Two database selection algorithms were proposed in GlOSS: the bGlOSS algorithm for Boolean queries and the vGlOSS algorithm for retrieval based on the vector-space model. For scalability, a hGlOSS server can be introduced to further summarize the meta-databases maintained by the lowlevel GlOSS servers and act as the entry point for queries, essentially forming a hierarchy of meta-databases. French et al. [100] compared two database selection algorithms: GlOSS’s gGOSS algorithm and INQUERY’s CORI algorithm, and found that CORI is a better estimator of relevance-based ranks than gGlOSS. Powell et al. [196] demonstrated that, with a good database selection algorithm, search in distributed databases can achieve even better retrieval effectiveness than that in centralized databases, while contacting only a few databases to process a query. Moreover, it is not necessary to maintain collection wide information such as global inverse document frequency (IDF) when database selection is employed. Using local information in fact achieves better performance. Baumarten [14] studied collection selection and fusion using a probabilistic model. Voorhees et al. [272] proposed two algorithms for collection fusion based on query histories. The goal of collection fusion is to combine the retrieval results

170

from multiple distributed collections into a single result such that the effectiveness of the combined result approximates that of searching a centralized collection. The first algorithm uses the results for past k closest queries to estimate the distribution of results for current query among collections. In the second algorithm, past queries are partitioned into clusters and this cluster information is used to estimate the distribution of results for current query. Xu and Callan [287] found that, in the face of a large number of collections, distributed IR systems based on collection selection work significantly worse than centralized systems, mainly because typical queries are not adequate for choosing the right collections to search. They proposed two techniques to address this problem: using phrase information and query expansion. Their findings motivated the techniques we used in eSearch and pSearch. eSearch also adopts query expansion. In pSearch, indices for documents are distributed based on document content rather than document source and the LSI algorithm, to some extent, implicitly subsumes query expansion. Xu and Croft [288] pointed out problems with distributed IR systems when dealing with heterogeneous collections. As opposed to conventional approaches that are focused on designing collection selection and fusion algorithms, they proposed a new approach based on document clustering and language modeling, in which documents are partitioned into a pre-determined number of topics. On a retrieval, the query is routed to related topics and the results retrieved from those topics are fused. Topic clustering is done by a two-pass k-means algorithm. Similarly, Larkey et al. [142] also found that it is better to organize collections based on topics. Two collection selection algorithms were compared: CORI and Kullback-Leibler divergence, and both of them achieve a good retrieval quality when collections are topically organized. They also found that, when collections are organized by topics, effective result fusion requires global inverse document frequency (IDF).

171

Danzig et al. [71] proposed an architecture for distributed IR in which index brokers summarize the content of distributed databases and information regarding these index brokers is flooded to every client. Topic brokers colocated with clients then use this information to forward queries to the right index brokers. Lagoze et al. described in [139] their work on the NCSTRL global digital library. In their architecture, indices are built for each library and a single library could be indexed by multiple indexing servers. Indexing servers and their front-end interface servers are partitioned into connectivity regions based on the underlying Internet topology. A query is typically resolved inside the region where the query is submitted unless the network or indexing servers in that region fail. Query-based sampling [38] is a technique for acquiring accurate database descriptions without the cooperation of data providers. Results show that the resource description created from samples can be used for accurate database selection. To improve the efficiency of searching over very large collections, Lu and McKinley [161] proposed to maintain partial replicas containing the top relevant documents of the most frequent queries and their indices. On a retrieval, a meta-database directs the query to a small number of relevant replicas. For a metasearch engine, the amount of index information it maintains is proportional to the number databases in the system. To reduce the size of indices, Wu et al. [286] proposed that, for a given term, the system keeps only indices for several top databases with the heaviest weight for this term. This approach essentially restricts the size of indices at the metasearch engine to constant. Liu [154] presented a large scale query routing system based on multi-level progressive pruning strategies, which use user query profiles and source capability profiles to discover relevant information sources for a given query.

172

4.6.2

Search in Distributed Hash Table Systems

Distributed Hash Table (DHT) systems, such as CAN [202], Chord [243], Pastry [216], and Tapestry [308], support scalable and fault-tolerant overlay maintenance and efficient object lookup. However, with only a simple interface for storage and retrieval of (key, value) pairs, DHTs do not directly support fulltext search. Global indexing based P2P keyword search systems built on top of DHTs are most relevant to eSearch. To answer a multi-term query, these systems must transmit inverted lists over the network to perform an intersection in order to identify documents that contain multiple query terms. Several techniques consequently have been proposed to reduce this cost. In KSS [102], the system precomputes and stores results for all possible queries consisting of up to a certain number of terms. The number of possible queries unfortunately grows exponentially with the number of terms in the vocabulary. Reynolds and Vahdat [205] adopted a technique developed in the database community to perform the distributed join more efficiently. This technique transmits Bloom filters of inverted lists instead of the inverted lists themselves. Suel et al. [245] used Fagin’s algorithm to compute the top-k results without transmitting the entire inverted lists. Fagin’s algorithm transmits the inverted lists incrementally and terminates early if sufficient results have already been obtained. Li et al. [145] suggested combining several techniques to reduce the cost of a distributed join, including result caching, Bloom filter, and document clustering. These approaches are orthogonal to our effort in eSearch. Our hybrid indexing architecture intends to completely eliminate the cost for distributed join. A quantitative comparison of the search cost between them and eSearch would be an interesting subject for future work. Even with various optimizations, we still

173

expect their cost to grow with corpus size, perhaps at a rate slower than that of a basic global indexing system.

4.6.3

Search in Unstructured Peer-to-Peer Networks

Centralized indexing systems such as Napster suffer from single point of failure and performance bottlenecks at the index server. On the other hand, floodingbased techniques such as Gnutella and PlanetP [68] send either queries or indices to every node in the system, consuming huge amounts of system resources such as network bandwidth and CPU cycles. To reduce cost, heuristic-based approaches try to direct search to only a fraction of the node population. Rhea and Kubiatowicz [206] proposed a method in which each node uses Bloom filters [28] to summarize its neighbors’ content. A query is forwarded only to neighbors that judging from the Bloom filters, host relevant documents with a high probability. More specifically, a node uses several Bloom filters to summarize the content of its neighbors reachable through an outgoing channel. Each of the Bloom filters summarizes content on neighbors within a particular distance (e.g., 1 hop, 2 hops, and so forth), and is assigned a probability inversely proportional to the distance. The probability of finding a document by propagating a query through a particular channel is the sum of the probabilities of all Bloom filters associated with the channel that indicate the existence of relevant documents. Like conventional distributed IR systems, PlanetP [67] divides the search process in a P2P system into two phases: (1) select nodes to search based on some criteria and (2) search documents on those selected nodes and merge the results. Each node’s content is summarized by a Bloom filter and the Bloom filter is disseminated to all other nodes through a gossip protocol. A query source uses the Bloom filters to choose the nodes to search. PlanetP’s scalability is limited to at most several thousand nodes since it assumes that each node receives and keeps a copy of all other nodes’ summaries. It is unclear whether the use of a gossip pro-

174

tocol to disseminate the summaries has any major advantage over protocols that periodically aggregate and disseminate the summaries through a single or multiple trees consisting of the P2P nodes. PlanetP uses VSM to rank documents. Yang and Garcia-Molina [294] studied the super-peer network that aims at striking a good balance between the efficiency of centralized search and the scalability and availability of distributed search. Super-peers act as servers to a set of clients and form a P2P network among themselves. Super-peers for some clients may be replicated for availability. Similarly, FastTrack [87] designates highbandwidth nodes as super-nodes. Each super-node replicates the indices of several other nodes. Prinkey [198] proposed to organize nodes into a hierarchy. Each node P in the hierarchy builds indices to summarize content stored on nodes in the subtree rooted itself. On a retrieval, the query is routed to proper branches in the tree based on information in the indices. Instead of using Bloom filters, Crespo and Garc´ıa-Molina introduced the concept of Routing Indices (RIs) into P2P systems [66], which allow nodes to forward queries to neighbors that are more likely to have answers. If a node cannot answer the query, it forwards the query to a subset of its neighbors, based on its local RIs. They proposed and compared three RI schemes: the compound, the hop-count, and the exponential routing indices. Lv et al. [64, 163] studied search and replication strategy in unstructured P2P networks. They found that, with regard to search, flooding is inefficient; expanding ring search performs better; and random walk is most efficient. They also proved that the optimal number of replicas for an object should be proportional to the square root of the searching rate of the object. They further showed that this optimal replication strategy can be approximated by randomly choosing nodes along a searching path for an object to host replicas for the object.

175

In JXTA search [31], three types of entities form the search infrastructure: provider, consumer, and hub. Information providers are Web servers, database servers, or search engines that are capable of doing search over their content. Information providers register with the search hub the queries that they can answer, along with a filter that specifies queries they are not interested in. Only queries pass the filters are forwarded to search providers. Consumers submit queries to the hub, which then routes the queries to appropriate providers. Both queries and responses are represented in XML. Currently JXTA search does not support connecting several search hubs to route queries at a larger scale. Built on top of the JXTA framework, EDUTELLA [30] provides an RDF-based (Resource Description Framework) metadata infrastructure for P2P applications. Sun and Garcia-Molina [247] proposed partial lookup service to improve search efficiency, which only returns a few search results instead of the entire result set. Yang and Garcia-Molina [292] used an analytic model to study the performance of Napster-like hybrid P2P systems that have a centralized indexing server. They found that the chained model has the best performance, in which servers are chained together and clients connects to one of the servers on startup. Subsequent queries from a client are submitted to the server it connects to. A query is forwarded to other servers on the chain if it cannot be resolved by a server locally. Yang and Garcia-Molina [293] proposed three techniques for search in unstructured P2P networks: iterative deepening, directed BFS, and local indices. Iterative deepening is similar to the expanding ring search strategy proposed in [163]. With directed BFS, the query source sends the query to just a subset of its neighbors that may return many quality results. For example, one may select neighbors that have produced or forwarded many quality results in the past. The neighbors that receive the query then continue forwarding the query to their neighbors. With local indices, each node maintains indices for data stored on nodes within r hops of itself, where r is a system-wide parameter. When a node receives a query,

176

it can process the query on behalf of all nodes within r hops. Results show that these techniques can achieve the same level of search quality while using much fewer resources.

4.6.4

Search in Networks with Semantic Locality

SETS [16] follows the traditional database selection approach to build a P2P IR system. It, however, floods the meta-database to all nodes. SETS assumes that each node’s content is concentrated on certain topic and partitions the nodes into topic segments. Nodes within a segment share similar content. Each segment is represented by a centroid and each node is represented by the centroid of its content. A node is assigned to a segment whose centroid is closest to the node’s centroid under the cosine similarity measure. The number of segments is small (less than several hundred) and is predetermined. The knowledge of segments is global. A distinguishing node computes the segment centroids and propagates them all other nodes. Upon a retrieval, a node uses the global knowledge about the segments to select segments to search, routes to a node within each selected segment, and then floods the query to nodes within those segments. In contrast, pSearch uses semantic vectors to cluster documents in a large P2P overlay. It does not explicitly maintain the meta-database since this knowledge is embedded in the structure of the overlay network. The number of “topic segments” automatically adapts to current node population. Schwartz [230] described a probabilistic resource-discovery protocol that replicates indices of databases and searches the databases by parallel random walk. Sites are organized into topic-based specialization subgraphs through a cache swapping protocol. A search starts with random walk but proceeds more deterministically once it hits in a subgraph with a matching topic. One limitation of this work is that the topics are pre-determined.

177

Motivated by research in data mining, Cohen et al. [63] used guide-rules to organize nodes into an associative network. A guide rule is a set of nodes that satisfy certain predicates. Search within a rule is performed blindly, but the query source has the flexibility of choosing the rules to use for a given query. Sripanidkulchai et al. [238] extended existing P2P networks with common interest groups by linking nodes to other nodes that satisfied their queries in the past. The basic premise is that nodes that satisfied previous queries are likely to provide high-quality answers to new queries. During a retrieval, the query is first forwarded to nodes within the same common interest group. If no result is found, the underlying search mechanism such as flooding is then resorted to further process the query. In Gnutella, a node forwards a received query to all its neighbors. NeuroGrid [123] instead proposed that a node should only forward queries to nodes that are likely to provide high-quality answers. The forwarding decision is made according to a knowledge database that is dynamically maintained through a learning mechanism that adapts to user responses, success or failure, and previous search results. In BuddyWeb [276], Web browsers belonging to users with common interests form a P2P cooperative cache. It uses VSM to measure the similarity of content in nodes’ caches. BuddyWeb adopts a hybrid centralized-P2P architecture. A small number of centralized Location Independent Global Name Lookup servers keep track of node status, assign nodes identifiers that are independent of their IP address, and match nodes into buddies based on their interests. Although Ingrid [95] was developed before the flourish of P2P software, it indeed embraces the basic concepts of P2P. In Ingrid, nodes form clusters and each cluster is responsible for indexing certain term. A query consisting of multiterms is serviced by nodes sitting at the intersection of the clusters responsible for the query terms. For instance, when processing a query on “computer science”,

178

it starts at an entry node in the cluster servicing term “computer” and travels trough the cluster until it meets a node that also sits in the “science” cluster. The subcluster for “computer and science” is then fully searched. A centralized Global-Single-Term Sever helps finding the entry nodes for clusters. To allow a document to be searched for a multi-term query, the document must be preregistered in the corresponding clusters. Ingrid did not address how to identify important term combinations for documents.

4.7

Conclusion

In this chapter, we proposed a new architecture for P2P information retrieval, along with various optimization techniques to improve system efficiency and the quality of search results. We made the following contributions: • Challenging conventional wisdom that uses either local or global indexing, we proposed hybrid indexing that employs selective term list replication to combine the benefits of local and global indexing while avoiding their limitations. • We used semantic information provided by modern IR algorithms to guide the replication. The term list of a document is only replicated on nodes corresponding to important terms in the document. • We adopted automatic query expansion in a P2P environment to alleviate precision degradation introduced by selective replication. • We devised a novel overlay source multicast protocol that has very low protocol overhead in order to reduce term list dissemination cost. It reduces metadata dissemination cost by up to 80% on its own.

179

• We introduced two techniques to balance term list distribution to a greater degree than is achievable using existing load balancing techniques while avoiding their maintenance overhead. We have quantified the efficiency of eSearch (in terms of bandwidth consumption and storage cost) and the quality of its search results by experimenting with one of the largest benchmark corpora available in the public domain. Our results show that the combination of our proposed techniques results in a system that is scalable and efficient, and achieves search results as good as the centralized Okapi.

180

5

pSearch: A Semantic Overlay Network for P2P IR

The eSearch system described in the previous chapter improves global indexing (partitioning by words), by combining it with local indexing (partitioning by documents) and focusing on important terms in documents. In contrast, our second P2P IR framework, pSearch, improves local indexing by introducing semantic locality when partitioning documents. eSearch and pSearch address the same problem (i.e., full-text information retrieval in P2P systems) but they take completely orthogonal approaches. A detailed comparison of eSearch and pSearch will be given in Chapter 7.2.3. Current P2P file sharing software such as Gnutella [103] and KaZaA [128] is based on the local indexing structure (see Figure 1.1, page 12). The fundamental problem that makes search in these systems difficult is that, with respect to semantics, documents are randomly populated. Given a query, the system has to search a large number of nodes to find some relevant documents, rendering systems of this type unscalable. To address this problem, we introduce the notion of semantic overlay, a logical network where content are organized around their semantics such that the distance (e.g., routing hops) between two documents in the network is proportional to their dissimilarity in semantics. The search cost for a query is therefore reduced since documents related to the query are likely to be concentrated on a small number

181

of nodes. Content-Addressable Networks (CANs) [202] provide a Distributed Hash Table (DHT) abstraction over a Cartesian space. They allow efficient storage and retrieval of (key, object) pairs. An object key is a point in the Cartesian space. We use a CAN to create a semantic overlay by using semantic vectors (generated by LSI [19, 73]) of documents as keys to store document indices in the CAN, such that indices stored close in the CAN have similar semantics. Our system based on the concept of semantic overlay is called pSearch. The remainder of the chapter is organized as follows. Section 5.1 provides background information about IR and CAN. Section 5.2 gives an overview of pSearch and highlights the major challenges. Sections 5.3 to 5.5 describe our solutions to these challenges. Section 5.6 describes a prototype of pSearch and our experimental results. Section 5.7 concludes the chapter.

5.1 5.1.1

Background Vector Space Model (VSM)

In VSM [222], a term-document matrix A = (aij ) ∈ Rt×d is formed to represent a collection of d documents containing words from a vocabulary of t terms. Each column vector aj (1 ≤ j ≤ d) corresponds to a document j. Weight aij represents the importance of term i in document j. The weights are usually computed from variants of the term frequency * inverse document frequency (TF*IDF) scheme [19] For instance, the ltc [35] term weighting scheme computes aij as follows, bij = [log(fij ) + 1] · log( bij aij = qP

t 2 x=1 bxj

d ) Di

(5.1) (5.2)

182

where fij is the frequency of term i in document j and Di is the number of documents that contain term i. Normalization in Equation 5.2 ensures that document vector aj is of unit length. The intuition behind the equations is that two factors decide the importance of a term in a document—the frequency of the term in the document and the frequency of the term in other documents. If a term appears in a document with a high frequency, there is a good chance that the term could be used to differentiate the document from others. However, if the term also appears in many other documents, its importance should be penalized. Queries are represented as vectors in a similar fashion. During a retrieval operation, documents are ranked according to the similarity between the document vector and the query vector, and those with the highest similarity are returned. A common measure of similarity is the cosine of the angle between vectors. When vectors are normalized (as they are in ltc), the inner product is the same as the cosine of the angle between the vectors. t X X Y cos(X, Y ) = xi yi = |X| · |Y | i=1

5.1.2

(5.3)

Latent Semantic Indexing (LSI)

Literal matching schemes suffer from synonyms and noise in documents. LSI overcomes these problems by using statistically derived concepts instead of terms for retrieval. It uses truncated Singular Value Decomposition (SVD) [104] to transform a high-dimensional document vector into a lower-dimensional semantic vector, by projecting the former into a semantic subspace. Each element of a semantic vector corresponds to the importance of an abstract concept in the document or query.

183

Suppose the rank of the term-document matrix A is r. SVD decomposes A into the product of three matrices, A = U ΣV T

(5.4)

where U = (u1 , . . . , ur ) ∈ Rt×r , Σ = diag(σ1 , . . . , σr ) ∈ Rr×r , and V = (v1 , . . . , vr ) ∈ Rd×r . V T is the transpose of V . σi ’s are A’s singular values, σ1 ≥ σ2 ≥ . . . ≥ σr . U and V are column-orthonormal. LSI approximates A with a rank-l matrix A l = U l Σ l Vl T

(5.5)

by omitting all but the l largest singular values, where Ul = (u1 , . . . , ul ), Σl = diag(σ1 , . . . , σl ), Vl = (v1 , . . . , vl ). It has been proven that, among all matrices of rank l, Al approximates A with the smallest error, in the least-squares sense. Row i of Ul ∈ Rt×l is the representation of term i in the l-dimensional semantic space. A document (or query) vector q ∈ Rt×1 can be folded into the l-dimensional semantic space using Equation 5.6 or 5.7 [188]. The difference is whether to scale the vector by the inverse of the singular values. (Other variants of LSI also exist. Throughout this chapter, we use a proper configuration we find for LSI through extensive experimentation but defer the discussion to Chapter 6.) Similar to VSM, the similarity between semantic vectors is measured as their inner product. qˆ = UlT q

(5.6)

T qˆ = Σ−1 l Ul q

(5.7)

By choosing an appropriate l for Al , the important structure of the corpus is retained while the noise or variability in word usage (small σi ) is eliminated. Former studies on LSI suggested setting l to a value between 50 and 350 and reported improvements over VSM by up to 30% in precision [73]. In addition, LSI is capable of bringing together documents that are semantically related even if they do not share terms, by learning from co-occurring word usage. For instance, a search

184

about “car” may return relevant documents that actually use “automobile” in the text. This feature has been explored in cross-language information retrieval [82]. In summary, LSI represents documents and queries as vectors (points) in a (semantic) Cartesian space. The similarity between a query and a document is measured as the cosine of the angle between their vector representations. These are the major properties of LSI that we will use in pSearch. Given a query as a point in the Cartesian space, the problem of finding the most relevant documents is reduced to locating the document points nearest to the query point. Therefore, the central issue in pSearch is to map the semantic space to nodes in a network and conduct efficient nearest-neighbor search in a decentralized manner.

5.1.3

Content-Addressable Network (CAN)

Recent overlay networks, such as CAN, Chord, Pastry, and Tapestry, offer an administration-free and fault-tolerant distributed hash table (DHT) that maps “keys” to “values”. CAN partitions a d-dimensional Cartesian space into zones and assigns each zone to a node. Two zones are neighbors if they overlap in all but one dimension along which they abut each other. An object key is a point in the Cartesian space and the object is stored at the node whose zone contains the point. Locating an object is reduced to routing to the node that hosts the object. Routing translates to traversing from one zone to another in the Cartesian space. A node join corresponds to randomly picking a point in the Cartesian space, routing to the zone that contains the point, and splitting the zone with its current owner. When a node leaves, its zone will be taken over by one of its neighbors. An example CAN is shown in Figure 5.1. There are five nodes A-E in the overlay. Each node owns a zone in the Cartesian space. Initially C owns the entire zone at the upper-right corner. When D joins, the zone owned by C splits and part of the zone is given to D. When D wants to retrieve the object with key (0.4, 0.1), it sends the request to E, because E’s coordinates are closer to the

185

zone coordinates 1

B

0.5 - 0.7 5 0.5 - 1

0- 0.5 0.5 - 1 A

0.7 5 - 1 0.5 - 1

E 0- 0.5 0- 0.5

0

D

C

0.4 0.1

0.5 - 1 0- 0.5

ob j ect k ey

1

Figure 5.1: A 2-dimensional CAN. object key, and E forwards the request to A. A then sends the object back to D directly.

5.2

Overview of the pSearch System

In pSearch, a large number of machines are organized into a semantic overlay to offer the information retrieval service (see Figure 5.2). Nodes in the overlay collectively form a pSearch Engine. Inside the Engine, nodes have completely homogeneous functions. A client intending to use pSearch connects to any Engine node to publish document indices or submit queries. Similar to the Engine node selection process in eSearch (Section 4.1, page 130), pSearch only picks server-like nodes that are stable and have good Internet connectivity to join the Engine. A new Engine node finds its position in the overlay, takes over some indices stored

186

d o c u m e n t in d e x q u e ry

A

p S e a r c h En g i n e F

B C

G D

E

Figure 5.2: Overview of the pSearch system. at its neighbors, and starts to process queries. The entire process is completely autonomous. Figure 5.2 shows an example of how the system works. Node A publishes a document to node B inside the Engine. Node B builds an index for the document and routes the index in the overlay. The index is finally stored at node F based on its semantics. When a query is submitted to node E, the query is routed to node C based on the semantics of the query. Node C then takes the responsibility for finding relevant indices and returning them to node E. In this example, node C may return the index published by node A and stored at node F . Different queries can be processed concurrently in different regions of the overlay, resulting in high throughput. pSearch uses a CAN to organize Engine nodes into an overlay and uses an extension of LSI to answer queries. We call our algorithm pLSI. In the following, we first present a basic pLSI algorithm to outline our ideas and to highlight the major challenges.

5.2.1

The Basic Algorithm

pLSI sets the dimensionality of the CAN to be equal to that of LSI’s semantic space. The index for a document is stored in the CAN using its semantic vector

187

docu m e n t i n de x q u e ry

doc 4

2

4

3 3

3

1

s e a r ch r e g i on f or t h e q u e r y Figure 5.3: pLSI in a 2-dimensional CAN. as the key. As a result, indices stored close in the CAN have similar semantics. Among other things, an index includes the semantic vector of a document and a reference (URL) to the document. Figure 5.3 illustrates the steps of pLSI. 1. When receiving a new document A, the Engine node generates its semantic vector Va using LSI and uses Va as the key to store the index in the CAN. 2. When receiving a query q, the Engine node generates its semantic vector Vq and routes the query in the overlay using Vq as the key. 3. Upon reaching the destination, the query is flooded to nodes within a radius r, determined by the similarity threshold or the number of documents wanted by the user. 4. All nodes that receive the query do a local search using LSI and reports the reference to the best matching documents back to the user.

188

Since indices of documents similar to the query (above a certain threshold) can be stored only within this radius r and we do an exhaustive search within this area, in theory, pLSI can achieve the same precision as LSI. Ideally, this radius r should be small such that only a small number of nodes are involved in a search. The only data transmitted during a search are the query and a small number of references to the top documents, both of which are small and are independent of the corpus size. pLSI relies on some global information to function, including the inverse document frequency and the basis of the semantic space. Distributing this information to each Engine node allows the nodes to compute semantic vectors of new documents and queries independently. Previous work [107, 196] as well as our own evaluation in Figure 4.7 (page 145) have shown that sampled statistics can work very well. We use the algorithm described in the Section 4.2.3 (page 137) to aggregate and disseminate the global information.

5.2.2

Major Challenges

The basic idea of pLSI is straightforward but there are several challenges to be overcome before it can work effectively. • Dimensionality mismatch between CAN and LSI. pLSI sets the dimensionality of the CAN to be equal to that of LSI’s semantic space, which typically ranges from 50 to 350. The “actual” dimensionality of the CAN, however, is much lower because there are not enough nodes to partition all the dimensions of a high-dimensional CAN. Along those unpartitioned dimensions, the search space is not reduced. • Uneven distribution of indices. There are several reasons why using semantic vectors as keys to store indices in a CAN may lead to an imbalance in the distribution of these indices across the nodes. First, semantic vectors

189

U V in d ic e s

A p B

Figure 5.4: Uneven distribution of document indices. are normalized and reside on the surface of the unit sphere S in the semantic space. Figure 5.4 shows an example of a 2-dimensional CAN. pLSI only places indices on the unit sphere. The similarity cos θ between documents A and B is proportional to their distance p on the circle, since cos θ = cos p for a unit sphere. The gray area is the region for searching documents close to A in semantics. Nodes U and V own two zones of the same size, but node V does not store any index. Second, even if the key space S is uniformly distributed across the nodes, the system can still suffer from hot spots because the indices are not uniformly distributed on S. • Large search region. The most difficult problem is due to the high dimensionality of the semantic space, which ranges from 50 to 350 for corpora most commonly used in the IR community and is expected to increase as the size of the corpus increases. Due to a problem known as the curse of dimensionality, it has been shown that limiting the search region in high-dimensional spaces is difficult [278]. With existing index structures for multidimensional data [278], it cannot be determined with certainty whether the closest data points have already been found until a large fraction of the space has been explored.

190

• Limitations of LSI. Although LSI has been a popular IR technique for more than a decade, our extensive experimentation reveals some limitations of LSI, which may also cripple pSearch’s efficiency and efficacy because of its reliance on LSI. (1) When the corpus is large and heterogeneous, LSI’s retrieval quality is inferior to methods such as Okapi [210]. (2) The Singular Value Decomposition (SVD) that LSI uses to derive low-dimensional representations (i.e., semantic vectors) of documents is not scalable in terms of both memory consumption and computation time [199]. In the rest of this chapter, we address the first three challenges and evaluate the solutions in the pSearch framework. We will show that, in a decentralized environment, pSearch can efficiently approximate a centralized implementation of LSI. Our techniques to further improve LSI (i.e., to address the last challenge) will be described in Chapter 6.

5.3

Resolving the Dimensionality Mismatch between CAN and LSI

In our algorithm, the dimensionality of the CAN is set to that of LSI’s semantic space (l). Ratnasamy et al. [202] suggested that for an l-dimensional CAN with n nodes, each node on average needs to maintain 2l neighbors, and the average length of routing paths is (l/4)(n1/l ). Since l can be as high as 300, this seems to indicate a problem in that each node will have a large number of neighbors. However, their result holds only if l < log 2 (n). Since at least 2x zones will be produced by partitioning along x dimensions, partitioning along more than log 2 (n) dimensions will result in more zones than there are nodes available. Along an unpartitioned dimension, the neighbor of a zone is itself. Therefore, when l ≥ log 2 (n) and zones are partitioned “evenly”, only log2 (n) dimensions will be partitioned and each node has only log2 (n) neighbors. Figure 5.5 shows the average number

191

of neighbors and average routing hops for a 300-dimensional CAN, which can be seen to exhibit the relationship described above. We refer to the number of actually partitioned dimensions as a CAN’s effective dimensionality. Note that, in a real CAN, zones may not be partitioned evenly and the number of partitioned dimensions may vary at different regions in the Cartesian space. While the limited number of nodes avoids the problem of an excessive number of neighbors, the result is that only the low dimensions of the semantic space are partitioned, making searches less efficient. The search space along the unpartitioned dimensions is not reduced since documents with similar semantic content along these dimensions will be spread across all nodes. This situation is illustrated in Figure 5.6(a). Suppose the semantic space is of four dimensions: v0 -v3 . A query Q and a document A have semantic vectors Vq = (0.55, −0.1, 0.6, −0.57) and Va = (−0.1, 0.55, 0.57, −0.6), respectively. The similarity between Va and Vq is 0.574 (computed from Equation 5.3). The majority of the similarity is contributed from dimensions v2 and v3 . For real corpora, this similarity is usually high enough to consider document A as relevant to query Q. We store the two vectors in a 4-dimensional CAN. The effective dimensionality of this CAN is only two, since there are only four nodes w-z to partition the semantic space along dimensions v0 and v1 (see Figure 5.6(a)). Because Va and Vq are not similar in dimensions v0 and v1 , a search in Figure 5.6(a) for query Q would not find document A unless all nodes are probed. Before presenting a solution to this problem, we first make some high-level observations. • Although the dimensionality of the semantic space is high, in practice, the number of dimensions relevant to a particular document is much smaller. For example, concepts in chemistry are unlikely to appear in a computer science paper. We validate this by showing the weight distribution of the

192

8

30

6

20

5

15

4 3

10

2 5

Average routing hops

Average neighbors

7

average neighbors average routing hops

25

1

0

0 250 500

1k

2k 4k 8k 16k 32k 64k 128k Number of nodes

Figure 5.5: The average number of neighbors and routing hops for a 300dimensional CAN.

Va v1

1

-1

= ( -0.1, 0.55, 0 . 5 7 , - 0 . 6 )

w

x

y

z Vq =

1

v3 1

1

v0

( 0.55, -0.1, 0 . 6 , - 0 . 5 7 )

Va =

( 0.57 , -0.6 , - 0 . 1 , 0 . 5 5 )

w

x

y

z

-1

(a)

1

Vq =

1

v2

( 0.6 , -0.57 , 0 . 5 5 , - 0 . 1 )

(b )

Figure 5.6: A rolling-index example. The position of a vector is decided by its first two elements. (a) The CAN partitions dimensions v0 and v1 in the original semantic space. (b) The same CAN partitions dimensions v2 and v3 after rotating the semantic vectors by two dimensions. The relevant document A for the query Q can be easily found in node z on the rotated space.

193

semantic vectors of the TREC corpus [263]. We sort the elements of each semantic vector in decreasing weight and report the average weight (across all semantic vectors) at each rank Figure 5.7. As can be seen from this figure, a small number of elements carry weights much larger than the rest. This suggests that the number of important concepts for a particular document is typically small. • Queries submitted to search engines are usually short (2.4 terms on average [143]) and are likely to be captured by a few concepts. As a result, only a small number of elements in the semantic vectors will contribute significantly to the final similarity score. • SVD sorts elements in semantic vectors by decreasing importance. Figure 5.8 plots the singular values of the TREC corpus [263], which correspond to the importance of elements. The singular values largely follow a Zipf-like distribution, σi = a · ib , with a=190 and b=-0.3. Because of the importance of the low-dimensional elements, a significant fraction of the similarity score is likely to be contributed by them. According to Equation 5.3, for a document and a query to be considered as a good match, the inner product of some elements of the semantic vectors must be sufficiently large. These observations suggest that for a particular query, the number of dimensions that actually contribute to a match are typically much smaller than the dimensionality of the entire semantic space. Taking advantage of these facts, we propose the use of rolling-index to bridge the dimensionality gap and also to reduce the search space. The basic idea is to use a single CAN to partition more dimensions of the semantic space by rotating the semantic vectors.

194

0.22 0.20 0.18 Element weight

0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0

25

50

75

100 125 150 175 200 225 250 275 Weight rank

Figure 5.7: The average weight of elements of semantic vectors ranked in decreasing order.

400

240 160

Dimension Figure 5.8: Singular values of the TREC corpus.

276

251

226

201

176

151

126

101

76

51

0

26

80 1

Singular value

320

195

5.3.1

Rolling-Index

Given a semantic vector, V = (v0 , v1 , . . . , vl ), we rotate it repeatedly by m dimensions each time to generate a series of new vectors (see Equation 5.8 and note that V 0 = V ). We call these vectors rotated semantic vectors. We set m using Equation 5.9, where n is the number of nodes in the system. Equation 5.9 estimates the effective dimensionality of the CAN by approximating the “average neighbors” curve in Figure 5.5. Rotated vectors of different documents or queries generated with the same amount of rotation (the same i) define a rotated space i with (vi·m , . . . , vi·m+m−1 ) as its m-dimensional support subvector. V i = (vi·m , . . . , v0 , v1 , . . . , vi·m−1 ), i = 0, . . . , p − 1

(5.8)

m = 2.3 · ln(n)

(5.9)

Given a document A with semantic vector Va , we store its index in p places in the CAN using Vai , i = 0, · · · , p − 1, as the keys. For a query with semantic vector Vq , we execute the pLSI algorithm in Figure 5.3 p times. Each time it uses a different Vqi to route the query and guide the search in rotated space i. Each rotated space independently returns matching documents based on vectors on that space. Since the similarity between two semantic vectors is measured as their inner product (see Equation 5.3), rotation will not change the similarity measure. In a rotated space, documents close in the overlay are still close in semantics. Note that we still use full semantic vectors as the CAN keys. The similarity is also computed from full semantic vectors rather than support subvectors. An example of rolling-index with m=2 is shown in Figure 5.6(b) (page 192).

5.3.2

Discussions

Rolling-index uses the same CAN to partition more dimensions of the semantic space, but at the increased storage cost (p times the base storage space). The

196

low dimensions of each of multiple rotated spaces are partitioned onto a single CAN, which correspond to different dimensions in the original semantic space. On the whole, more dimensions of the original semantic space are partitioned by a single CAN. Different rotated spaces tend to complement each other. Because it searches a smaller region on each rotated space, the accumulated search region across multiple spaces is actually smaller than that in the original space. Rolling-index is an approximation to LSI. LSI uses full semantic vectors to cluster similar documents in the semantic space, whereas rolling-index uses support subvectors to cluster potentially similar documents in rotated spaces. Generally, two similar subvectors cannot ensure their full vectors are also similar, but we find that this probability is significantly higher for semantic vectors than for random vectors, because of the significance of the low-dimensional elements of semantic vectors and the correlation among the elements. We demonstrate this through experiments with the TREC corpus (see Section 6.1.1 for details of our experiments). We first retrieve 15 documents for each TREC 7&8 query based on the similarity of the 300-dimensional semantic vectors. The results form set A. On each rotated space, we then retrieve e · 15 documents for a query solely based on the similarity of the m-dimensional subvectors. Here, e is a constant multiplication factor. The results for the first four planes form set B. The “sv” series in Figure 5.9 are the average accuracy of set B with respect to set A, where accuracy =

|A∩B| |A|

× 100%. When m = 25 and e = 128, B is only

1.3% of the entire corpus, but it already covers 90% of the documents in set A. We also conduct the same experiment using random vectors as documents and queries, and report the results as the “rand” series in Figure 5.9. This figure shows that, to some extent, similar semantic subvectors imply similar full semantic vectors, thanks to the features of semantic vectors. Therefore, a significant fraction of documents are likely to be correctly (though not perfectly) clustered in the CAN by the first several support subvectors.

sv (m=35) sv (m=25) sv (m=15) rand (m=35) rand (m=25)

Multiplication factor e

1024

512

256

128

64

rand (m=15)

32

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

16

Accuracy

197

Figure 5.9: The effect of using low-dimensional elements to identify relevant doc-

Number of queries

uments (the documents and queries are from TREC 7&8).

60 55 50 45 40 35 30 25 20 15 10 5 0 0%

20%

40% 60% Accuracy

80%

100%

Figure 5.10: Histogram of the accuracy of the 100 TREC queries, when using subvectors of semantic vectors for search (m=25, e=128).

198

On the other hand, although the low-dimensional elements are statistically of higher importance for the entire corpus, for an individual document discussing unpopular concepts, some high-dimensional elements may carry heavy weight. For queries about these concepts, rolling-index will be less effective. Figure 5.10 shows the accuracy histogram of the 100 queries when m=25 and k=128. x% on the X axis means the number of queries whose accuracy is in range [x%, (x+10)%). There are a small number of queries with very low accuracy (e.g., one query’s accuracy is 6.7%). The accuracy for these queries can be improved by relaxing e, which unfortunately results in a bigger document set B to search. In Section 5.5, we will introduce an algorithm that selectively searches documents in a big B to achieve a high accuracy at a low cost. Another potential solution to this problem is selective rotation. Given a document, in addition to storing its index in the first p rotated spaces, we also store the index in the rotated spaces whose corresponding support subvectors cover the heavily-weighted elements that are not covered by the first p spaces. Likewise, given a query, in addition to searching the first p rotated spaces, it also searches some extra rotated spaces whose corresponding support subvectors cover the heavily-weighted elements of the query. Currently pSearch uses a single semantic space to organize documents. The dimensionality of the semantic space increases as the size of the corpus increases, which may make rolling-index less effective. We propose hierarchical document clustering to partition the document space into clusters, each of which comprises documents discussing closely-related topics [259]. Each cluster is then mapped on top of the same CAN, using CAN’s structure to partition only the document space inside a single cluster rather than the entire document space. Selective rotation and hierarchical partitioning are not included in our current pSearch framework. A detailed study on these issues is a subject for future work.

199

5.4

Balancing Index Distribution

To cope with the uneven distribution of indices, we propose content-aware node bootstrapping to force the distribution of nodes in the CAN to follow the distribution of indices. At node join, the node randomly picks a document that it is going to publish and computes the semantic vector of the document. This semantic vector is randomly rotated to a space i (0 ≤ i < p) and the rotated semantic vector (instead of a random point suggested by [202]) is used as the point toward which the join request is routed. The node whose zone contains the rotated semantic vector splits in the middle along the lowest unpartitioned dimension1 and hands over half of its zone to the new node. It is important for the zones to split regularly to minimize the number of neighbors. Content-aware node bootstrapping can be further improved by the load balancing technique we developed for eSearch (Section 4.4, page 152). That is, a new node generates several (instead of just one) random rotated semantic vectors; among nodes whose zones contain these generated vectors, the new node splits the one that hosts the large number of indices. We do not further pursue this optimization here for simplicity. Content-aware node bootstrapping has three effects. • Balanced index distribution. A larger number of nodes will be used in areas in the semantic space that have dense document population. Moreover, in a high-dimensional CAN, each dimension is partitioned at most once and a node therefore can never occupy a zone that does not intersect with the unit sphere in the semantic space (e.g., node V in Figure 5.4). 1

Because of the high dimensionality of the CAN, there are always some dimensions unpartitioned.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

95%

85%

75%

65%

55%

45%

35%

25%

15%

CAN-SV-key pSearch-s1 pSearch-s2 pSearch-s4 CAN-rand-key 5%

Percentage of indices

200

Percentage of nodes Figure 5.11: The effect of content-aware node bootstrapping. • Index locality. Assume the documents published by a node have similar semantics. On space i, indices of a node’s documents are likely to be published on itself or its neighbors. • Query locality. Assume the documents published by a node are good indications of the user’s interests. On space i, queries submitted by the user would usually result in searching neighboring nodes to where the query is submitted. Figure 5.11 evaluates this load-balancing technique by distributing the TREC corpus to a 10,000-node 300-dimensional CAN. Nodes are sorted in decreasing order according to the number of indices they store. We draw the node percentage on the X-axis. The Y -axis gives the percentage of indices owned by the corresponding nodes. The CAN-rand-key series are the original CAN proposal where nodes and indices are randomly populated. This serves as the baseline for comparison. The CAN-SV-key series use random points for node bootstrapping but indices are stored using semantic vectors as keys. pSearch-sp uses our content-

201

aware bootstrapping, and indices are stored under semantic vectors. Here p is the number of rotated semantic spaces. Note that even the load for CAN-rand-key is not completely balanced due to randomness in the bootstrapping process. With a larger corpus, the load is expected to become more balanced. As can be seen from this figure, without load balancing, 5% of the nodes store 72% of the indices (the curve on the top). The load-balancing technique is effective even if only a single rotated space is used. Increasing the number of rotated spaces can further balance the index distribution, because indices on different rotated spaces compliment each other. This bootstrapping process with multiple rotated spaces does not adversely affect the routing performance of the overlay. The CAN evaluated in Figure 5.5 uses this bootstrapping with four rotated spaces. However, it does affect the neighbor distribution. Nodes that own a large zone is likely to have more neighbors than others. When the number of nodes in the system varies from 250 to 8000, the average number of neighbors for a node ranges from 10 to 20 whereas the maximum number of neighbors is about 3 to 5 times the average number of neighbors. To reduce the routing and maintenance load for nodes with an excessive number of neighbors, we can selectively drop some neighbors for those nodes. Greedy routing in such a network may fail. In this case, it can resort to an algorithm similar to GPSR [126] to complete the routing. We leave an evaluation of this optimization as a subject for future work.

5.5

Reducing the Search Space

Rolling-index clusters indices in the overlay based on their semantics, making it possible to find the most relevant documents for a query by searching a fraction of the nodes in the overlay. However, existing centralized index structures [278] used to limit search space for multidimensional data usually work well for low-

202

dimensional data, but the search space grows quickly as the dimensionality of the data increases. In high-dimensional spaces, it cannot be determined with certainty if the closest data points have already been found until a large fraction of the space has been explored due to a problem known as the curse of dimensionality. Our own unsuccessful experience in applying several distance based or space-filling curve based index structures to the semantic space corroborates this curse. Weber et al. listed several interesting observations about high-dimensional spaces [278]. We summarize some of them below. • High-dimensional data spaces are sparsely populated. • Even very large hyper-cube range queries in high-dimensional spaces are not likely to contain a point. • The distance between a query and its nearest neighbor grows steadily with the dimensionality of the space. A simple heuristic for pSearch could be to search only the nodes where a query vector falls and its direct routing neighbors. But this is not sufficient because the distance between a query and its nearest neighbor is big in a high-dimensional space. Many of the matching documents will reside on indirect routing neighbors. The number of indirect neighbors grows quickly as the dimensionality increases. Our experiment shows that, with a 4-space, 10,000-node, 300-dimensional system, it only achieves an accuracy of 42.7% by visiting the nodes in which the rotated query vectors fall and their direct routing neighbors (searching 93 nodes per query). The accuracy is defined as the overlap between the results returned by pSearch and those returned by a centralized LSI (see Section 5.6.1 for the precise definition). Even if the neighbors within two routing hops (810 nodes) are visited the accuracy only improves to 75.2%.

203

1

2

3

4

6

7

8

11

12 d 17 e 22

13

9 b 14

16 21

f

q

5

c

10 a

15

18

19

20

23

24

25

Figure 5.12: An example of content-directed search.

5.5.1

Content-Directed Search

Because of the large nearest-neighbor distance, a naive nearest-neighbor search will not work well unless a large number of nodes are searched. To solve this problem, we use the content (indices) stored on nodes and the recently processed queries to guide searches to the “right” nodes. In the sparse high-dimensional semantic space, documents usually form tight clusters (see the top curve in Figure 5.11). If a relevant document is found, this document is likely to be surrounded by other relevant documents. We illustrate the basic idea with an example in Figure 5.12. 1-25 are IDs of nodes in the two-dimensional CAN. a-f are semantic vectors of documents and q is the semantic vector of a query. The user wants to retrieve three documents relevant to query q. Judging from the Euclidean distance, documents a, b, and c should be the search results. The search starts at node 13, whose zone contains q. No document is found on node 13. We maintain a queue N that contains the candidate nodes yet to

204

search. After searching node 13, its routing neighbors are added into this queue, N = {8, 12, 14, 18}. In the background, each node samples indices stored on its neighbors. These samples are used to decide the search order for nodes in N . In this example, the samples from node 14 are similar to the query. We search node 14 next and find document a. Node 14’s neighbors are added into N , N = {8, 12, 18, 9, 15, 19}. Likewise, we choose node 9 to search because its samples are similar to the query. Document b is found and node 9’s neighbors are added into N , N = {8, 12, 18, 15, 19, 4, 10}, which leads us to search node 4 and find document c. After that, nodes 12, 11, and 17 are searched in turn but no better matching documents are found. At this point, the chance of finding documents better than a, b and c is low. The search is terminated. Although the example used above assumes that nodes are searched sequentially, the speed of the search process can be improved by accessing nodes in parallel. The tradeoff is in terms of the number of nodes searched unnecessarily. This search algorithm takes advantage of the processing power of a large number of nodes to pre-process (sample) the semantic space in the background to enable efficient search. The use of sampled full semantic vectors to direct searches also removes some inaccuracy due to the limited clustering capability of the support subvectors used in rolling-index. Regardless of where the documents nearest to a given query q reside, they must be in the space surrounding the query. We just need to find a right way to reach them. Leveraging the uneven distribution of documents in the semantic space, this algorithm uses samples to guide the way to relevant documents.

5.5.2

Description of the Search Algorithm

We proceed to give a more formal description of our search algorithm. Table 5.1 lists the notations used in our discussion. The full steps of the algorithm is described in Table 5.2

205

Notation Description Z

generally denote a node.

Vqi

i = 0, . . . , p − 1 rotated semantic vectors for query q. p is the number of rotated spaces.

Zqi Di [Z]

the node whose zone contains Vqi . semantic vectors included in the indices that are stored on node Z and belong to space i.

S i [Z, P ]

the subset of D i [P ] that are sampled by node Z. The maximum size of this set is s.

Qi [Z]

semantic vectors of queries that are recently processed by node Z and belong to space i. The maximum size of this set is g.

U i [Z]

the vector that summarizes D i [Z] and Qi [Z].

r[Z, Zqi ]

number of routing hops to reach node Zqi from node Z.

ei [Z, Vqi ]

the estimated highest similarity between Vqi and the vectors in Di [Z] that is used to decide the search order for nodes in N .

R

a queue of indices of retrieved documents, at most k entries. k is the number of documents the user wants to retrieve.

N

a queue of candidate nodes yet to search.

C

a counter that indicates no new relevant documents are found in recent C node visits.

T

a threshold that decides when to terminate the search.

Table 5.1: Notations used in the content-directed search algorithm.

206

1 If it is space 0, initialize R = ∅; otherwise, inherit R from searches in the previous spaces. 2 Initialize N = {Zqi }, C = 0, r[Zqi , Zqi ] = 0, ei [Zqi , Vqi ] = 1. 3 Terminate the search if C > T . T is computed from Equation 5.13. 4 Find the node that has the highest estimated similarity ei among all nodes in N . Denote this node Z. Remove Z from N and perform a search on Z. Among the documents in R and D i [Z], select the k documents that have the highest similarity to the query to put into R. 5 If R is not changed in step 4, i.e., no better document is found on Z, set C = C + 1; otherwise, set C = 0. 6 For each neighbor P of node Z, add P into N . Tag P with the estimated similarity ei computed from Equation 5.12. Set r[P, Zqi ] = r[Z, Zqi ] + 1. 7 Go to step 3. Table 5.2: The content-directed search algorithm that is executed for each rotated space i.

207

We use X[Z] (or X[Z, Y ]) to denote an attribute X of node Z (with Y as an additional parameter). For instance, D[Z] is the set of semantic vectors in indices stored on node Z, and Q[Z] is the set of semantic vectors of queries recently processed by Z. A node Z stores indices and processes queries for multiple rotated spaces. Di [Z] and Qi [Z] refer to only those parts relevant to space i. For a node Z, we use a single vector U i [Z] computed from Equations 5.10 and 5.11 to summarize the indices stored on Z and the queries recently processed by Z. Equation 5.11 normalizes H such that U i [Z] is of unit length. U i [Z] is the centroid (center of mass) of indices in D i [Z] and queries in Qi [Z]. H=

X

d+

d∈D i [Z]

U i [Z] =

X

c

(5.10)

c∈Qi [Z]

H |H|

(5.11)

In the background, Z requests each of its neighbors P to return kc samples of semantic vectors that are in indices stored on P and have the highest similarity to the summary vector U i [Z]. Z also requests kr random samples from P . In our current implementation, kc = 0.8s and kr = 0.2s, where s is a constant. These returned semantic vectors are stored on Z as a sample set S i [Z, P ], which is used as an estimation of indices stored on P . When Z is under search for a query q (with rotated semantic vectors Vqi ), Z uses Equation 5.12 to estimate the highest similarity between the query vector Vqi and semantic vectors in indices stored on P , based on the sample set S i [Z, P ]. ei [P, Vqi ] = max cos(d, Vqi ) i d∈S [Z,P ]

(5.12)

The metric in Equation 5.12 is used to direct the search on each rotated space. It chooses the nodes whose sampled indices have high similarity to the query (i.e., high ei [P, Vqi ] value) to search first and stops when no better matching document is found during the most recent T node visits. Let Zqi denote the node whose

208

zone contains the rotated query vector Vqi . Node Zq0 (on space 0) acts as the coordination center during the search. It maintains a queue of candidate nodes yet to search (N ) and a queue of indices of identified relevant documents (R). On each rotated space, the search starts from node Zqi . A node under search returns the estimated similarity (ei [P, Vqi ]) of its neighbors P and the indices of the best matching documents found locally to node Zq0 . Zq0 adds the returned indices and the neighbor nodes P into the index queue R and the node queue N , respectively. Zq0 makes the decision on which node(s) in N to search next based on the estimated similarity (ei [P, Vqi ]) returned from the searched nodes. Recall that the search starts from node Zqi on each rotated space. For a node Z that is visited during the search, there exists a path that leads the search to Z from Zqi and all nodes on the path are also visited during the search. Let r[Z, Zqi ] denote the hops of the path (i.e., the number of nodes on the path). The quit threshold T is dynamically computed from T = max(5, F − 5 ∗ i) ∗ 0.8w

(5.13)

w = min r[Z, Zqi ]

(5.14)

Z∈N

where quit bound F is a constant set by the user or a default system value, i is the ID of the rotated space that is under search, and w is the smallest hops to reach nodes still in the node queue N from node Zqi (note that, once searched, a node is removed from N ). Two components decide the quit threshold T . The first component, max(5, F − 5 ∗ i), decreases as the space ID i increases. The intuition is that on lower spaces we want to search more nodes because of the importance of the low-dimensional elements. To guarantee that at least some search is performed on high spaces, this component is no less than 5. The second component, 0.8w , monotonically decreases during the search. The intuition is that the quit threshold should get tighter after the near neighbors of Zqi have already been searched. In practice, w

209

ranges only between 0 and 2 since the algorithm always quits before all two-hop neighbors of Zqi are searched. The algorithm in Table 5.2 is described sequentially. In practice, multiple nodes are searched concurrently under three rules. (1) Nodes Zqi (i = 0, . . . , p − 1) are searched in parallel. (2) The direct routing neighbors of Zq0 are always searched and are searched in parallel. Because of the importance of the low-dimensional elements of semantic vectors, these nodes would typically be visited in a sequential search. (3) In addition to the first two rules, in each round we select the top b nodes from the node queue N to search in parallel. b is decided using Equation 5.15, where T is the quit threshold in Equation 5.13, and d is a “concurrency factor”. Currently we use a constant d. It is a subject for future work to design an algorithm to dynamically fine tune d. b = min(d, T /2)

(5.15)

In addition, at the expense of extra storage space, our current implementation allows a node to replicate its neighbors’ indices and to process queries on their behalf, in order to reduce the number of visited nodes and data transmitted during a search. In the future, the neighboring-content sampling process can be extended to implement selective index replication. A node can use Equations 5.10 and 5.11 to compute a vector to represent itself and only replicate its neighbors’ indices whose similarity to this vector is beyond a threshold. Selective replication has the potential to reduce the amount of replicated indices while achieving performance similar to simple replication. We leave an evaluation of selective replication for future work. It has been reported that a small number of queries are frequently submitted to search engines [143]. Results for frequent queries can be cached in a big region in the overlay to avoid processing them repeatedly and to relieve hot spots.

210

5.6

Experimental Results

We built a pSearch prototype to validate our algorithms. We implemented an overlay simulator based on CAN.2 Cornell’s SMART system [35] implements VSM. We extended SMART with several modules and tools to implement LSI. We used LAS2 in the SVDPACK [252] package to compute the SVD of large sparse matrices. We then linked SMART with the CAN simulator and implemented the pLSI algorithms to build a pSearch prototype. We validated the correctness of our LSI implementation using several corpora coming with SMART. The precision and recall are consistent with those reported in the literature [73]. The size of a semantic vector is about 1.2KB. Counting in other metadata for a document, the index is about 2KB. If a machine dedicates 200MB memory for pSearch, it can stores 100,000 indices. Assume the Web has 10 billion Web pages in the future, it takes 100,000 machines to index the entire Web. This is roughly the biggest size of pSearch we simulate.

5.6.1

Experimental Setup

The corpus we use is the disk 4 and 5 from TREC [263], excluding the Congressional Record. It includes 528,543 documents from heterogeneous sources such as news and magazines, with a total size of about 2GB. Topics 351-450 from the ad-hoc track are used as queries. We use SMART to index the TREC corpus. The entire document or query content is indexed with our “atz” term weighting scheme.3 The SMART stopword 2

In practice, we can employ eCAN [289, 291] to boost CAN’s routing performance, while retaining CAN’s Cartesian space abstraction. 3 “atz” differs from the classical “atc” [35] in the vector normalization step. Instead of normalizing the term vector to unit length, “atz” divides each element of the term vector by q P20 tfi2 i=0 i+1 , where tfi is the i-th heaviest-weight element in the vector before normalization.

Our results show that “atz” outperforms “atc” for the TREC corpus.

211

list is used as is. The SMART stemmer is used without modification to strip word endings. The indexing process takes about 20 minutes to complete on a 1.7GHz Pentium IV machine with 1GB of memory. We randomly sample 15% of the indexed documents to generate a term-by-document matrix. Terms appearing in only one sampled document are not included in the matrix. This leaves us with 79,316 sampled documents and 83,098 indexed terms in the matrix. We apply SVD to this matrix to compute the basis of the semantic space. The SVD computation takes about 57 minutes. Using this basis, we project all 528,543 documents into the semantic space, computing a semantic vector for each document. This takes about 7 minutes. Note that most of the above process can be done by pSearch Engine nodes concurrently. The main metrics we use are the number of visited nodes during a search and the accuracy of search results. System resource consumption, which is proportional to the number of visited nodes, is analyzed in Section 5.6.5. For each configuration, we use LSI to retrieve a fixed number of documents for a query. The returned documents form set A. We refer to documents in A as relevant documents. We then use pLSI to retrieve the same number of documents for the same query. The returned documents form set B. The accuracy is defined as follows. Accuracy =

|A ∩ B| × 100% |A|

(5.16)

The accuracy metric compares pLSI against the centralized LSI baseline. A discussion on pLSI’s absolute performance (i.e., the goodness of the search results under the subjective judgment of users) can be found in Section 5.6.4 and Chapter 6.

5.6.2

Search Efficiency and Efficacy

Table 5.3 shows the parameters we vary in our experiments and their default values. Unless otherwise noted, our experiments use these default values without index replication. Our default baseline uses rolling-index with four rotated seman-

212

description

default value

n

number of nodes in the system

10,000

l

dimensionality of LSI and CAN

300

p

number of rotated semantic spaces

4

m rotated dimensions

use Equation 5.9 i

s

size of the sample set S [Z, P ]

50

F

quit bound in Equation 5.13

24

k

number of returned documents for a query i

15

g

number of warm-up queries (size of Q [Z])

0

d

the concurrent-search factor in Equation 5.15 1 Table 5.3: Parameters varied in experiments.

tic spaces. While rolling-index does help reduce the number of visited nodes for a given accuracy, this number can still be high. Our baseline therefore combines rolling-index with content-directed search in which each node samples a default of 50 indices from each of its neighbors. Effect of varying the system size, corpus size, and number of returned documents. Figure 5.13 shows the effect on the number of visited nodes and the accuracy of the search results when varying the number of nodes. The results are averaged over 100 queries. The only non-default parameter is s (the number of samples). With 500 nodes, we set s to 150. We decrease s by a factor of two each time the number of nodes quadruples because the total number of documents is fixed and each node can sample a larger percentage of its neighbors’ content when the indices are spread over a larger number of nodes. From the figure, we can observe that as the system size increases exponentially, the number of visited nodes only increases moderately. This is because the number of neighbors of a node is one deciding factor for the number of visited nodes, which

213

increases logarithmically with the system size. Even for the 32k-node system, pSearch can achieve an accuracy of 90% by visiting just 139 nodes. For the 128knode system, the accuracy is about 86%. We will show how to further improve the accuracy by varying other parameters. Second, as we relax the quit bound F that controls the accuracy, the accuracy improves slowly with the increase in the number of visited nodes, suggesting that search results can be returned to users incrementally without waiting for it to reach the final quit bound. Since only 30-50% of the documents returned by state-of-the-art information retrieval algorithms are useful, a user may want to set a low quit bound to trade accuracy for efficiency. For instance, when searching for “LATEX tutorial”, the user would be happy with finding just some good tutorials. Unlike Figure 5.13 that only varies the number of nodes, Figure 5.14 varies the number of nodes and the size of the corpus proportionally, i.e., full TREC corpus for 10k nodes, half TREC corpus for 5k nodes, etc. The search cost in this figure increases moderately as the system size and corpus size scale. Figure 5.15(a) shows the effect of varying the number of returned documents for each query. Although it seems that the number of visited nodes grows quickly while the accuracy remains the same, the average number of nodes that need to be searched to return one relevant document decreases drastically as we increase the number of returned documents. This is illustrated in Figure 5.15(b). When the user requests 15 documents, on average 5.9 nodes need to be searched to find one relevant document. When the number of returned documents increases to 960, on average only 0.47 nodes need to be searched to find one relevant document (i.e., each searched node returns more than one relevant document). It has been reported that a significant percentage of users only view the top 10 search results [143]. We believe that using 15 as the default number of returned documents is appropriate.

214

visited (500) acuracy (500)

visited (2k) acuracy (2k)

visited (8k) acuracy (8k)

visited (32k) acuracy (32k)

visited (128k) acuracy (128k)

200

100%

175

90% 80%

125 100

70%

75

Accuracy

Visited nodes

150

60%

50 50%

25 0

40% 6

8 10 12 14 16 18 20 22 24 26 28 30 32 Quit bound

Figure 5.13: The effect of varying the system size.

visited (2500) accuracy (2500)

visited (5k) accuracy (5k)

visited (10k) accuracy (10k)

150

100%

125

95%

100

90%

75

85%

50

80%

25

75%

0

70% 6

10

14

18 22 Quit Bound

26

Accuracy

Visited nodes

visited (1250) accuracy (1250)

30

Figure 5.14: The effect of simultaneously varying the system size and corpus size.

500 450 400 350 300 250 200 150 100 50 0

100% 98% 96% 94% 92% 90% 88% 86% 84% 82% 80%

visited nodes accuracy

15

Accuracy

Visited nodes

215

30 60 120 240 480 960 Returned documents (a) Visited nodes and accuracy.

visited nodes per ret doc

7 6 5 4 3 2 1 0

Total visited nodes

500 400 300 200 100 0 15

Visited nodes per ret doc

total visited nodes

30 60 120 240 480 960 Returned documents

(b) Visited nodes per returned document. Figure 5.15: The effect of varying the number of returned documents for a 10knode system.

216

Effect of varying the content-directed search heuristics We next evaluate the heuristics in the content-directed search. The results for a 10k-node system are shown in Figure 5.16(a). For the content series, g=0 and p=4; for the query series, g=5000 and p=2, where g is the number of past queries used by the search heuristic and p is the number of rotated spaces. The content heuristic uses only the node content to direct searches. The query heuristic warms up the system by processing 500,000 queries before measuring the performance and then uses both node content and past queries to direct searches. When queries have locality, learning from past history can increase the accuracy by up to 5.9% while reducing the number of visited nodes, due to a more selective sampling process for neighboring content. Figure 5.16(b) shows the accuracy histogram of the 100 queries measured in the experiment. x% on the X axis means the number of queries whose accuracy is between x% and (x+10)%. Compared with the content heuristic, the query heuristic improves the accuracy of about 15% of the queries from [90% − 100%) to 100%. The average improvement in accuracy is about 3-5%. In contrast to our content-directed search, a simpler heuristic would prefer searching zones whose centers are the closest to the query vector, i.e., using the center of the zones as the estimation of the content stored on them. We call this strategy zone-directed search. Figure 5.17 reports the visited nodes and accuracy with this heuristic. Compared with the results for content-directed search (see Figure 5.16(a)), this heuristic is much less effective. The accuracy improves slowly as the number visited nodes increases, showing that it does not choose the “right” nodes to search. This experiment shows the importance of using content samples to direct search, taking advantage of the clustering effect of documents in the semantic space.

217

visited nodes (query) accuracy (query) 100%

125

95%

100

90%

75

85%

50

80%

25

75%

0

70% 6

10

14 18 22 26 Quit bound

Accuracy

Visited nodes

visited nodes (content) accuracy (content) 150

30

(a) Accuracy and visited nodes.

Number of queries

70 60

content

50

query

40 30 20 10 0 50%

60%

70% 80% Accuracy

90%

100%

(b) Accuracy histogram. Figure 5.16: Comparing the content-directed search heuristics for a 10k-node system.

218

160

60% visited nodes accuracy

50%

120 40%

100 80

30%

60

20%

Accuracy

Visited nodes

140

40 10%

20 0

0% 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Quit bound

Figure 5.17: Performance of zone-directed search, which prefers searching zones whose centers are the closest to the query vector.

visited (repl) accuracy (repl)

visited (no-repl) accuracy (no-repl)

150

100%

125

95%

100

90%

75

85%

50

80%

25

75%

0

70% 6

8 10 12 14 16 18 20 22 24 26 28 30 32

Quit bound Figure 5.18: The effect of replication on a 10k-node system.

Accuracy

Visited nodes

visited (repl-content) accuracy (repl-content)

219

Effect of replication All the above results are achieved without replication. Replication can improve both accuracy and efficiency (see Figure 5.18). The no-repl series with no replication serve as the baseline. For the repl series, each node replicates its direct neighbors’ content. For the repl-content series, besides replicating neighbor P ’s content, a node Z also replicates the samples that P keeps for P ’s neighbors, i.e., each node has samples of content within two routing hops. Like the content series in Figure 5.16, we set g=0 when doing the sampling, meaning no past query is used in the heuristic. This figure suggests that when replication is used, a small number of rotated spaces and a tight quit bound already achieve a good accuracy. For instance, it visits only 24 nodes in a 10k-node system to achieve an accuracy of 96.8%. This is because the search of the replicated nodes is avoided and the search is directed to the right nodes more accurately. Results for a large system Figure 5.19 presents the performance of a 128k-node system. The configurations for the content and query series are the same as those in Figure 5.16. The repl and repl-content are the same as those in Figure 5.18. repl-query differs from replcontent in that it uses the query heuristic for sampling, with g=5000 and p=2. As can been seen in the figure, the accuracy of the content series can approach close to 90% by relaxing the quit bound while only 0.2% of the nodes are visited. Combining replication and the query heuristic can achieve an accuracy of 91.7% by visiting only 19 nodes, or an accuracy of 98% by visiting 45 nodes. It must be acknowledged that this example distributes a relatively small corpus over a large system. However, the results in Figure 5.14 show that pLSI can retain good performance as the corpus size and system size scale proportionally. As will

220

100%

250 225

visited (repl-content)

175

80%

150 125

70%

100 60%

75

visited (query)

Accuracy

Visited nodes

visited (repl-query)

90%

200

visited (repl) visited (content) accuracy (repl-query) accuracy (repl-content) accuracy (query) accuracy (repl)

50

50%

accuracy (content)

25 0

40% 6 10 14 18 22 26 30 34 38 42 46 Quit bound Figure 5.19: Performance of a 128k-node system.

be shown in Section 5.6.6, the search efficiency actually improves as the corpus size increases.

5.6.3

Distribution of Result Sources

We now try to understand the distribution of retrieved documents across the nodes in the system. The percentage of relevant documents retrieved from each rotated space is shown in Figure 5.20(a). The left Y -axis is the percentage of relevant documents found on each rotated space. The right Y -axis is the percentage of nodes visited on each rotated space out of all visited nodes (rather than the total number of nodes in the system). For the content heuristic, about 75.8% of relevant documents are found on the first rotated space. For the query heuristic, this number goes up to 92.3%.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

docs (content)

Docs found%

docs (query) visited (content) visited (query)

0

1 2 Rotated space

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Visited nodes%

221

3

docs found (content) docs found (query) visited nodes (content) visited nodes (query) 40% 40% 35% 35% 30% 30% 25% 25% 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% 0 2 4 6 8 10 12 14 16 Hops

Visited nodes%

Docs found%

(a) Rotated spaces.

(b) Hop counts. Figure 5.20: Distribution of documents found for a 10k-node system.

222

The important message here is that although the optimal dimensionality of LSI is between 50-350, a large fraction of documents can be correctly (though not perfectly) clustered in the overlay by the low-dimensional elements. This property, which is a result of SVD (see Figure 5.8), makes search in the semantic space much easier than that in other high-dimensional spaces. Using the sampled full semantic vectors to direct searches, the content-directed search algorithm also helps remove some inaccuracy due to the limited clustering capability of the low-dimensional elements. Figure 5.20(b) shows the percentage of relevant documents found and nodes visited at different hops away from Zqi . Here the hop count corresponds to r[Z, Zqi ] in Table 5.1, which measures how many steps it takes to reach Z from Zqi during the search. Note that the hop count could be longer than the shortest route between Z and Zqi . In this figure, the visited nodes curve fits well with the docs found curve, meaning that our content-directed heuristics are effective at directing searches to the right nodes even if it is already several hops away from Zqi . A significant percentage of relevant documents are found at more than two hops away from Zqi , which is the reason why just visiting direct routing neighbors of Zqi is not sufficient. The query heuristic has a longer tail than the content heuristic, showing that the query heuristic is more accurate at directing searches to the right nodes before the quit bound is reached.

5.6.4

Sensitivity to System Parameters

The final set of experiments evaluate the sensitivity of pLSI to the underlying VSM baseline and system parameters. Term weighting schemes. LSI is a proposal to improve VSM and is built on top of VSM. VSM produces the sampled term-by-document matrix from which the basis of the semantic space is computed. Therefore, LSI’s precision and recall are tied to the underlying VSM baseline [80]. We believe that the on-going efforts in

223

the IR community to improve the performance of the baseline are orthogonal to our efforts to make the decentralized implementation close to the centralized baseline. In the future, while a more advanced VSM baseline is plugged in, we expect LSI and hence pLSI’s absolute performance to improve accordingly. Figure 5.21 supports our claim by showing pLSI’s performance with different underlying VSM baselines (see [35] for the term weighting schemes in parentheses). Regardless of the absolute performance of the underlying VSM baselines, pLSI consistently achieves high relative accuracies at a reasonable cost, showing great promise that pLSI can improve along with future developments of advanced VSM baselines. Query length. The sensitivity to the number of query terms is reported in Figure 5.22. Each TREC query consists of three parts—title, description, and narrative. The all series use all three parts as the query, which is our default configuration; the title+desc series use the title and description; the title series use the title only. On average, a query in the all, title+desc, and title series contains 21, 7.8, and 2.4 terms, respectively. Overall, pLSI’s relative accuracy is not very sensitive to the number of query terms, although LSI’s absolute performance may vary. Parallel search. In Figure 5.23 we vary the concurrent-search factor d (the numbers in parentheses), which specifies the number of nodes to search in parallel. The baseline is the content heuristic (d=1) in Figure 5.16(a). As can be seen from the figure, increasing the search concurrency only increases the number of visited nodes moderately, while speeding up the search process by nearly a factor of d. With an algorithm to fine-tune d dynamically, we expect to achieve even better speedup at a lower cost. Rotated dimensions. The sensitivity to the number of dimensions m by which each space is rotated is shown in Figure 5.24. pLSI is not sensitive to m so long as it is larger than a certain threshold (e.g., 10). This is because the first 40 important dimensions of the semantic space are already partitioned by the

224

100% 95% 90% 85% 80% 75%

Accuracy

Visited nodes

180 160 140 120 100 80 60 40 20 0

70% 65% 6 10 14 18 22 26 30

visited (atc) visited (ltc) visited (ltn) visited (lps) visited (atz) accuracy (atc) accuracy (ltc) accuracy (ltn) accuracy (lps) accuracy (atz)

Quit bound Figure 5.21: Sensitivity to term weighting schemes (a 10k-node system).

visited (title+desc) accuracy (title+desc)

visited (title) accuracy (title)

200 175 150 125 100 75 50 25 0

100% 95% 90% 85% 80% 75% 70% 65%

Accuracy

Visited nodes

visited (all) accuracy (all)

6

10

14 18 22 Quit bound

26

30

Figure 5.22: Sensitivity to query length (a 10k-node system).

225

four rotated spaces in use, and the content-directed search algorithm helps remove some inaccuracy due to the use of rolling-index. Dimensionality of the semantic space. Figure 5.25 shows the effect of changing the dimensionality of the semantic space l (the numbers in parentheses). pLSI is not very sensitive to l. When the dimensionality is reduced to 100, we see a small but noticeable improvement in accuracy. This is because the ratio of the effective dimensionality of the CAN to the dimensionality of the semantic space increases, limiting the inaccuracy introduced by the unpartitioned dimensions. Although the relative accuracy of pLSI to LSI improves as the dimensionality decreases, the absolute precision of the LSI baseline actually suffers. In previous work, the suggested value for l is 50-350, but we expect it to increase as the corpus grows. This insensitivity to l suggests from another angle that pLSI has good potential to scale with corpus size. Sample size. Lastly, we vary s, the number of indices that a node samples from each of its neighbors (see Figure 5.26). To leave room for the accuracy to improve as s varies, we set the quit bound F and the number of rotated spaces p differently for different configurations: for content, F = 20 and p = 4; for query, F = 10 and p = 2; for repl-content and repl-query, F = 6 and p = 2. From this figure, we observe that the query heuristics are insensitive to the number of samples due to its more accurate index sampling process. The content heuristics are more sensitive to s when s is small. When s is sufficiently large, the accuracy only improves marginally as s increases. Although the sample size required for the content heuristics to work effectively seems small, it is already a significant fraction of the average number of indices stored on nodes because of the relatively small size of the TREC corpus. It remains to be seen how fast the sample size needs to be increased in order to keep the accuracy. We expect this fraction to decrease as the size of the corpus increases. One simple solution is for each node to replicate its direct neighbors’ content,

226

visited (4) accuracy (4)

visited (8) accuracy (8)

200 175 150 125 100 75 50 25 0

visited (16) accuracy (16)

100% 95% 90% 85% 80% 75% 70%

Accuracy

Visited nodes

visited (1) accuracy (1)

6

10 14 18 22 26 30 Quit bound

95

Visited nodes

90 85 80 visited nodes accuracy

75 70 1

100% 98% 96% 94% 92% 90% 88% 86% 84% 82% 80%

Accuracy

Figure 5.23: Sensitivity to parallel search (a 10k-node system).

6 11 16 21 26 31 36 41 46 Rotated dimensions

Figure 5.24: Sensitivity to rotated dimensions (a 10k-node system).

227

visited nodes (200) accuracy (200)

visited nodes (300) accuracy (300)

175 150 125 100 75 50 25 0

100% 95% 90% 85% 80%

Accuracy

Visited nodes

visited nodes (100) accuracy (100)

75% 70% 6

10

14 18 22 26 Quit bound

30

Figure 5.25: Sensitivity to the dimensionality of the semantic space. (a 10k-node system).

visited (query) visited (content) accuracy (query) accuracy (content)

125

100%

100

90%

75 80% 50 25

70%

0

60% 10 20 30 40 50 60 70 80 Samples

Figure 5.26: Sensitivity to sample size (for a 10k-node system).

Accuracy

Visited nodes

visited (repl-query) visited (repl-content) accuracy (repl-query) accuracy (repl-content)

228

eliminating the need for sampling. Replication also improves fault-tolerance and allows neighboring nodes to process queries on behalf of each other to relieve hot spots and to reduce the number of searched nodes.

5.6.5

Analysis of System Resource Usage

In this section, we analyze the storage and network resource consumption based on the experimental results. When publishing the index of a document, the data transmitted (Bd ) and storage cost (S) are given by Bd = S I · p · h + S I · R

(5.17)

S = SI · p · R

(5.18)

where SI is the size of the index, p is the number of rotated semantic spaces, h is the average routing hops in the CAN, and R is the number of replicas of the index. The total data transmitted for processing a query is given by Bq = SQ · p · h + v · (SQ + SR )

(5.19)

where SQ is the size of the query message, v is the number of visited nodes, SR is the size of returned data from a visited node. We set the variables in the above equations based on the most pessimistic results for the 128k-node system: SI =SQ =1.5KB, p=4, h=8, R=1 (no replication) or R=25 (with replication), v = 230, and SR = 1KB. The index and query are 1.5KB because they consist of a 300-dimensional semantic vector (1200B) and some small metadata. The data returned from each visited node contain the similarity and reference to the top documents, and the estimated similarity (ei [P, Vqi ]) of the neighbors, all of which are independent of corpus size, query length, and document length. For this pessimistic setting, the data transmitted for processing a query is 632KB. Without replication, the data transmitted for publishing one index is

229

49.5KB and and the storage cost is 6KB; with replication, these numbers are 85.5KB and 159KB, respectively. One big advantage of pSearch over previous P2P keyword search systems [102, 145, 205, 245] and Gnutella-style query-flooding systems is that its bandwidth consumption for processing a query is independent of corpus size, query length, and document length. The factors that decide the resource usage are either constant or increase slowly as the system scales, e.g., the routing hops h and the number of visited nodes v. On each visited node, computing the relevant documents can be done efficiently. When indices are in memory, one node in our pSearch prototype can process a query against 0.9 million indices per second. When indices are on disk, it can process a query against 0.1 million indices per second. When each node replicates content stored on its neighbors, the scalability of the n ), where n is the number system’s storage capacity declines from O(n) to O( log(n)

of nodes, since each node on average has O(log(n)) neighbors. Considering the increasing capacity and decreasing price of commodity storage devices, we do not think this is a big problem. The processing power of the system still scales with n.

5.6.6

Nearest-Neighbor Search with Synthetic Data

pLSI visits nodes in the search region to retrieve documents nearest (most relevant) to a query. Due the curse of dimensionality [278], the size of the search region grows quickly as the dimensionality of the semantic space increases. Although it is prohibitive to find the exact nearest neighbors of a point in a highdimensional space, our above evaluations show that search in a semantic space (i.e., using document data) for some approximate nearest neighbors can be conducted efficiently in pSearch. The next question we want to answer is how much

230

of this efficiency and efficacy is due to the special features of semantic vectors that allow us to apply domain-specific optimizations such as rolling-index and content-directed search. In the following we evaluate approximate nearest-neighbor search in a 1025 dimensional space using synthetic random data. We need only approximate nearest neighbors because the similarity score generated by LSI is inherently fuzzy and we will present in Section 6.1.2 (page 245) a technique to re-rank approximate nearest neighbors before producing the final search results. 10-25 dimensions are sufficient for our need because pLSI essentially uses only low-dimensional subvectors to guide index placement and query routing. In this set of experiments, we use a regularly partitioned CAN and turn off the content-directed search optimization in pLSI. Instead of using samples to decide which node to search next, we simply prefer searching zones (nodes) whose centers are closest to the query. We build an l-dimensional Cartesian space and split each dimension once in the middle to generate n = 2l zones. Each zone is assigned to a different node. We generate c · n objects with random keys and store them in the CAN. On average each node stores c objects. We then generate random points as queries and retrieve f close objects for each query. Denote A as the set of f objects retrieved by pLSI, and B as the set of f objects that are actually closest to the query. We use accuracy =

|A∩B| f

× 100% to measure pLSI’s retrieval

quality. Ideally the accuracy should be close to one. Unless otherwise noted, in the experiments below we retrieve f = 10 objects for a query. Figure 5.27 reports the number of visited nodes and the retrieval accuracy, when varying the dimensionality of the space (l) and the number of nodes in the system (n = 2l ). On average we store nine objects on a node (c = 9). This figure shows that pLSI can achieve a reasonable accuracy by visiting a small number of nodes. For a one-million-node (1M-node) system in a 20-dimensional space, pLSI can achieve an accuracy of 80% by visiting only 194.4 nodes. Howver, it becomes

450

100%

400

90%

350

80% 60%

250

50%

200

40%

150

Accuracy

70%

300

30%

200

100

80

0%

60

0

50

10% 40

50 30

20%

20

100

10

Visited nodes

231

visited (1K) visited (4K) visited (16K) visited (64K) visited (256K) visited (1M) accuracy (1K) accuracy (4K) accuracy (16K) accuracy (64K) accuracy (256K) accuracy (1M)

Quit bound

Figure 5.27: Visited nodes and accuracy when varying system size but keeping the same object density.

1000

100%

100

visited accuracy

60% 40%

10

Accuracy

Visited nodes

80%

20% 1

0% 1K

4K

16K

64K 256K 1M

Nodes Figure 5.28: Nodes visited to achieve an accuracy around 0.8 when varying system size but keeping the same object density.

232

Total nodes Fraction of visited nodes

0.1

1K

4K

16K

64K

256K

1M

0.01

0.001

0.0001

Figure 5.29: Nodes visited as a fraction of the total node population to achieve an accuracy around 0.8 when varying system size but keeping the same object density.

700

100% 80%

500 400

60%

300

40%

Accuracy

Visited nodes

600

200 20% 400

200

100

80

60

40

20

0

10

100

visited (1) visited (2) visited (4) visited (40) visited (400) accuracy (1) accuracy (2) accuracy (4) accuracy (40) accuracy (400)

0%

Quit bound Figure 5.30: Nodes visited and accuracy when varying the average number of objects stored on a node from 1 to 400 (in parentheses).

233

100% 80%

150

visited accuracy

100

60% 40%

50

Accuracy

Visited nodes

200

20%

0

0% 1

2 4 40 Objects per zone

400

Figure 5.31: Performance of distributed nearest-neighbor search when varying object density.

70

100% 80%

50

visited

accuracy

40

60%

30

40%

Accuracy

Visited nodes

60

20 20%

10 0

0% 1

5

10

15

30

60

Wanted objects Figure 5.32: Visited nodes and accuracy when varying the number of objects to retrieve for a query.

100

100% 80% wanted/visited

accuracy

60%

10 40%

Accuracy

Wanted objs / visited node

234

20% 1

0% 1

5

10 15 30 Wanted objects

60

Figure 5.33: Number of objects to retrieve divided by the number of visited nodes, when varying the number of objects to retrieve for each query. much less efficient when one wishes to achieve an accuracy close to 100%. For the 1M-node system, it searches 407.5 nodes to achieve an accuracy of 89%. This indicates that approximate nearest-neighbor search is fundamentally much easier than exact nearest-neighbor search. pSearch exploits this fact by trading moderate accuracy for substantial improvement in efficiency. It is important to note that the search in this experiment is much easier than that in pSearch. This experiment always uses full vectors to guide searches because of the low dimensionality of the space. On the other hand, pSearch uses lowdimensional support subvectors of semantic vectors to search high-dimensional full semantic vectors (Section 5.3.1). Results in Figure 5.9 show that this kind of search is hard for random vectors. Results in Figure 5.17 further show that, with the simple zone-directed search similar to that used in this experiment, pSearch cannot work effectively in a high-dimensional space. Figure 5.28 presents the results in Figure 5.27 in a different way. Here we focus on achieving an accuracy close to 80%. Note that both the X axis and the left Y axis are in log-scale. Figure 5.29 shows the visited nodes in Figure 5.28 as a ratio

235

of the total node population. These figures suggest that a reasonable accuracy can be achieved by searching a small fraction of the nodes in the system and this performance is scalable with respect to system size and the dimensionality of the space. In Figure 5.30, we vary, in a 32K-node, 15-dimensional CAN, the average number of objects stored on nodes from 1 to 400 (in parentheses). Figure 5.31 simplifies the results by showing only nodes visited to achieve an accuracy around 80%. These figures suggest that the search efficiency improves dramatically as object density increases. This is because objects closest to a query locate at distances closer to the query as object density increases. In Figure 5.32, we vary, in a 32K-node, 15-dimensional CAN, the number of objects to retrieve for a query from 1 to 60 (in parentheses). This figure shows that the search cost grows with the number of wanted objects, indicating that it is inherently costly for distributed nearest-neighbor search to achieve a very high recall for a large corpus. Figure 5.33 replots the same data by showing the number of wanted objects divided by the number of searched nodes. This figure, however, suggests that the search cost per wanted object actually reduces when more objects are wanted.

5.6.7

Summary of Experimental Results

We have quantified the efficiency and accuracy of pLSI by experimenting with synthetic data as well as one of the largest corpora available in the public domain. The following are our major findings. • pLSI can achieve a good accuracy at a reasonable cost with respect to bandwidth consumption and the number of visited nodes, and this performance is scalable with respect to system size, corpus size, and the number of returned documents.

236

• Rolling-index needs only a small number of rotated spaces to work effectively, limiting the space overhead as well as the number of visited nodes. • The content-directed heuristics are effective at directing searches to the right nodes and learning from past history can be beneficial when the queries have locality. • Replication improves performance, but at the expense of extra storage. • pLSI’s performance is not very sensitive to its major parameters. Its absolute performance shows good potential to improve along with future developments of advanced IR techniques. • Although exact nearest-neighbor search is prohibitive in a high-dimensional space, approximate nearest-neighbor search in a 10-25 dimensional space can still be conducted efficiently.

5.7

Conclusion

We presented techniques to realize the idea of semantic overlay in a P2P network. We quantified the efficiency of pLSI with respect to bandwidth and the number of visited nodes, and the extent to which pLSI can retain LSI’s efficacy, by experimenting via simulation with one of the largest corpora available in the public domain. We made the following contributions in pSearch. • pSearch is the first system that organizes content around their semantics in a P2P network. This makes it possible to achieve accuracy comparable to centralized IR systems while visiting a small number of nodes and transmitting a small amount of data. • We proposed the use of rolling-index to resolve the dimensionality mismatch between the semantic space and a CAN, taking advantage of the higher

237

importance of low-dimensional elements of semantic vectors. This helps reduce the number of visited nodes by partitioning the semantic space along more dimensions. • We employed content-aware node bootstrapping to balance the load, which as a side effect also introduces index locality and query locality. This helps distribute document indices evenly across nodes. • We employed content-directed search, using index samples and recently processed queries to guide search to the right places in the high-dimensional semantic space. This helps further reduce the number of visited nodes. Although our experience with the TREC, MED, ADI, and CRAN corpora shows great promise, more experiments are needed to study whether pLSI can be applied to a much larger corpus that has hundreds of millions or even billions of documents. The scalability of the LSI algorithm itself may limit pLSI’s scalability. We address the problems with LSI in the next chapter. Among our enhancements to pLSI, content-aware node bootstrapping is expected to scale well with corpus size. For content-directed search, the search efficiency actually improves as corpus size increases, because relevant documents can be found at distances closer to the search center. The rolling-index may be affected adversely as corpus size increases due to the enlarged dimensionality gap between LSI and CAN. Our proposals to improve rolling-index, selective rotating and hierarchical clustering, are yet to be evaluated.

238

6

Scaling Latent Semantic Indexing for pSearch

We have shown in the previous chapter that pSearch can efficiently approximate a centralized implementation of LSI. Despite the fact that LSI has been a popular subject of research for more than a decade, our extensive experimentation reveals some limitations of LSI that are not addressed by prior work. • When the corpus is large and heterogeneous, LSI’s retrieval quality is inferior to methods such as Okapi [210] or pivoted normalization [234]. • The Singular Value Decomposition (SVD) that LSI uses to derive lowdimensional representations (i.e., semantic vectors) of documents is not scalable in terms of both memory consumption and computation time. These limitations of LSI may also cripple pSearch’s efficiency and efficacy because of pSearch’s reliance on LSI. In this chapter, we proposal techniques to address these limitations and show their use in the pSearch framework.1 • To improve the efficiency of LSI, we propose an algorithm we call eLSI (efficient LSI) to reduce the size of the input matrix for SVD while retaining 1

This chapter addresses the challenge of using LSI in pSearch. Potentially one can design a P2P IR system around our idea of clustering documents in an overlay network but without using LSI. See Section 6.4 for a discussion on this.

239

the matrix’s important content. We partition documents into clusters and use the centroids of the clusters as “representative” documents. We further reduce the dimensionality of the centroid vectors by filtering out elements corresponding to low-weight terms. The resulting matrix, which has short centroid vectors as columns, is several orders of magnitude smaller than the original matrix. Finally, we apply SVD to this matrix to derive the basis of the semantic space. Experiments show that eLSI retains the retrieval quality of LSI but is several orders of magnitude more efficient. It outperforms four major fast dimensionality reduction methods [24, 76, 127, 188] in retrieval quality. • We conducted extensive experiments with LSI using a large corpus and found that proper normalizations of semantic vectors for both terms and documents improve recall by 76% compared with the standard LSI that strictly follows SVD. • We found that LSI is noticeably inferior to Okapi when the corpus is large and heterogeneous. Unlike works that use LSI to improve retrieval quality, we use LSI as an implicit document clustering method that can work with low-dimensional data.2 We use low-dimensional subvectors of semantic vectors to implicitly cluster documents in an overlay, which helps reduce the search space, and then use Okapi to guide the search process and document selection. The combination of these techniques makes pSearch both more efficient and more effective. For a 32,000-node system, pSearch’s precision at 10 retrieved documents (for the TREC corpus [263]) is 0.4 compared with Okapi’s 0.45, when on average searching only 67 nodes for a query. 2

Working with low-dimensional data is essential. Due to the curse of dimensionality [278], for a high-dimensional space implemented in a decentralized environment, the system has to search a large number of nodes to answer queries.

240

It should be emphasized that our contributions are beyond their use in pSearch, since the problems we address are common to many other systems. • Deriving low-dimensional representation for high-dimensional data is a common theme for many fields. Existing methods such as Principal Component Analysis (PCA) and LSI are not scalable, since their cost grows quickly with the input size. eLSI is efficient and produces high-quality low-dimensional data. Therefore it can be used in many systems to replace PCA or LSI. • The proper configuration we found for LSI should be of general interest to the LSI community. • Existing LSI implementations compare the semantic vector of a query with that of every document. Dumais [80] noticed the inefficiency of this method and commented that no known technique can effectively reduce the search space for high-dimensional data. The fundamentals of our techniques, despite the fact that they were originally developed for P2P systems, can also be applied to centralized systems to reduce the search space. The remainder of the chapter is organized as follows. Sections 6.1 and 6.2 describe and evaluate techniques to improve LSI’s retrieval quality and efficiency, respectively. Section 6.3 puts all these techniques together and evaluates the complete pSearch system. Related work is discussed in Section 6.5. Section 6.6 concludes the chapter.

6.1

Improving the Retrieval Quality of LSI

When the corpus is large and heterogeneous, LSI’s retrieval quality is inferior to methods such as Okapi [210]. In this section, we describe techniques to improve LSI’s retrieval quality.

241

6.1.1

Choosing a Proper Configuration for LSI

Although the use of SVD is common among LSI implementations, we have seen proposals [20, 21, 73, 119] that differ in the manner in which the output of SVD is used, depending on 1. term normalization—whether to normalize rows of Uk (semantic vectors for terms) to unit length before using them in Equation 5.6 or 5.7 (page 183) to project document or query vectors; 2. document normalization—whether to normalize the projected vectors (ˆ q in Equation 5.6 or 5.7, page 183) to unit length before using inner product to compute the similarity (i.e., the choice of using inner product or cosine as the similarity metric); and 3. the choice of using Equation 5.6 or 5.7 to project vectors. There are a total of eight different variants of LSI depending on these choices. In analysis [188], LSI is usually treated as a process that uses a low-rank matrix to approximate a high-rank matrix while introducing the smallest error. The “standard” LSI that follows from this analysis should do neither term normalization nor document normalization and use Equation 5.7 to project vectors. Since the “standard” LSI is most widely used, we refer to it as the “baseline”. In literature, some systems also used the “non-standard” variants. To the best of our knowledge, no study systematically evaluated these fundamental choices for LSI. Below we will show that these choices dramatically affect LSI’s performance. To evaluate these choices, we extended the SMART system [35] with an LSI implementation, using SVDPACK [252] to compute SVD of large sparse matrices. The SMART stopword list and stemmer are used as is. The corpus we use is the disk 4 and 5 from TREC [263], excluding the Congressional Record. It consists of 528,543 documents with a total size of about 2GB. Since real queries are usually

242

short, we use only the title field of topics 351-450 as queries. A query on average contains 2.4 terms and has 94 relevant documents. We experimented with various term weighting schemes to generate the input term-document matrix for SVD, including Okapi [210], pivoted normalization [234], and those built in SMART. Although Okapi and pivoted normalization have good performance when used standalone, using them as a pre-processing step for LSI does not improve performance. The reason is that they use document length as one important factor in weighting, but it is hard to assign a length to short queries that can work with the equations in Okapi or pivoted normalization. We found that the ltc term weighting in Equation 5.1 and 5.2 (page 181) works well with LSI for several corpora. We use ltc to generate the input term-document matrix for SVD. Due to memory limitation, we select only 15% of the TREC corpus to construct a 83,098-term by 79,316-document matrix as the input for SVD, which projects vectors into a 300-dimensional space. The SVD computation consumes 1.7GB memory and takes 57 minutes to complete on a 2GHz Pentium 4 machine. We use LSI to retrieve 1,000 documents for each query and report the average number of retrieved relevant documents for a query in Figure 6.1(a). Figure 6.1(b) plots the precision-recall curves.3 In these figures, “no scale” uses Equation 5.6 to project vectors whereas “scale” uses Equation 5.7. “norm term” normalizes each row of Uk . “norm doc” normalizes semantic vector qˆ in Equation 5.6 or 5.7 before computing similarity. “norm both” does both normalizations. Normalizing semantic vectors of terms or documents significantly improves precision and recall. Combined together, they return 76% more relevant documents compared with the baseline that does no normalization. (In pSearch, we 3

Precision is defined as the number of retrieved relevant documents divided by the number

of retrieved documents. Recall is defined as the number of retrieved relevant documents divided by the total number of relevant documents in the corpus. A precision-recall curve shows the precision at a given recall level.

243

Retrieved relevant docs

40 35

no scale

scale

30 25 20 15 10 5 0 baseline

norm doc

norm term

norm both

Precision

(a) Retrieved relevant documents.

0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

norm both norm term norm doc baseline

0

0.2

0.4

0.6

0.8

1

Recall (b) Precision-recall. Figure 6.1: Comparison of different configurations for LSI (using TREC).

244

12 Retrieved relevant docs

no scale

scale

10 8 6 4 2 0 baseline

norm doc

norm term

norm both

Figure 6.2: Retrieved relevant documents by LSI with different configurations (using Medlars). are more interested in LSI’s recall since we use Okapi to rerank documents afterward. See Section 6.1.2.) The baseline, unfortunately, is widely used in most LSI implementations since it is what directly follows from SVD. As pointed out in [119], normalizing semantic vectors for terms improves performance by emphasizing rare terms. Despite the compensation from the IDF component in Equation 5.1, rare terms tend to have a small norm after truncated SVD because their semantics are usually captured by high-dimensional elements that are truncated away. Consequently, rare terms contribute little to the final similarity score to differentiate documents. The benefit of normalizing document vectors again after SVD (the first normalization in Equation 5.2 is before SVD) corroborates the long-lasting belief that cosine is a robust measure for similarity. We also conduct the same experiment with the Medlars corpus (available in the SMART package [35]), which has 1,033 documents and 30 queries. Documents and queries are projected into a 50-dimensional semantic space. We retrieve 15 documents for each query and report the average number of retrieved relevant documents in Figure 6.2. Unlike that for TREC, LSI’s performance for Medlars varies

245

marginally with different configurations. We conjecture that the performance difference between Figure 6.1(a) and Figure 6.2 is because a 50-dimensional space is sufficient for the small, homogeneous Medlars corpus whereas a 300-dimensional space is insufficient for the large, heterogeneous TREC corpus. Another experiment corroborates this conjecture. When using only an insufficient 15-dimensional space for Medlars, “norm both” outperforms the baseline by 30%, which is consistent with the trend for TREC. This experiment demonstrates the importance of using large, heterogeneous corpora in evaluations. Most past experiments with LSI, however, use a relatively small corpus. In summary, normalization is beneficial if the dimensions of the semantic space are insufficient in capturing the fine structure of the corpus, which is true for most large corpora. We choose “norm both” with Equation 5.6 as the configuration for LSI. There is no performance difference between using Equation 5.6 and 5.7. We opt for Equation 5.6 since it directly follows from the analysis that treats truncated SVD as a matrix approximation process [188].

6.1.2

Combining LSI with Okapi

Okapi [210] consistently achieved the best performance in TREC’s ad-hoc track. Figure 6.3(a) and (b) compares the precision-recall of several retrieval algorithms, 4 using TREC and Medlars, respectively. (LSI+Okapi is our method to be described later.) LSI performs well for the small Medlars corpus, which is consistent with the results in previous work [73]. For the much larger TREC corpus, however, Okapi performs dramatically better than LSI. LSI’s poor performance with large, heterogeneous corpora has also been reported elsewhere [12, 119]. This experiment, again, emphasizes the importance of using large corpora in evaluations. 4

Although automatic query expansion can improve performance (e.g., boosting Okapi’s average precision from 0.221 to 0.273), it is not used in any experiment in this chapter, because we want to study the effect of dimensionality reduction independently.

246

Precision

0.7 0.6

Okapi

0.5

LSI+Okapi

0.4

ltc

0.3

LSI

0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

Recall

Precision

(a) Precision-recall for TREC.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

LSI Okapi ltc

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall (b) Precision-recall for Medlars.

Figure 6.3: Comparison of different retrieval algorithms.

1

247

There are several reasons for LSI’s inferior performance. First, LSI does not explicitly exploit document length as a factor in ranking, which has been shown to be important by Okapi and pivoted normalization [234]. There is no simple solution for this. Our experience shows that simply using Okapi to generate the input term-document matrix for SVD leads to even worse performance. Second, a 300-dimensional semantic space is insufficient for TREC. LSI’s performance can be improved by increasing dimensionality (see results in Figure 6.6), but this will increase the cost of SVD. More importantly, in a decentralized environment, our pLSI algorithm only exploits information in low-dimensional subvectors to guide index placement and query routing (see Section 5.2.1). Therefore, increasing dimensionality is not helpful for us. SVD sorts elements in semantic vectors by decreasing importance. The low dimensions of the semantic space capture the major structure of the corpus, but it still needs fine structure captured by the high dimensions to rank documents. In other words, regardless of the optimal dimensionality for LSI to achieve the best retrieval quality, low-dimensional subvectors can approximately cluster related documents in the P2P network. Without sufficient dimensions, LSI simply cannot rank documents properly. Based on this observation, we make two important modifications to pLSI. First, on a searched node, we use Okapi instead of LSI to select documents. Second, we use Okapi to guide the exploration of the search region in Figure 5.2(c). In content-directed search, we use Okapi instead of LSI to compute the similarity between the sampled documents and the query, which is used to decide which node to search next. We call this method “LSI+Okapi”. Before moving into distributed retrieval, we first evaluate if LSI+Okapi can work in a centralized implementation. Figure 6.4 compares the high-end precision for different methods. P @i is the precision when retrieving i documents for a query. The configuration for LSI+Okapi is as follows. We use a 4-plane pLSI.

248

LSI

ltc

LSI+Okapi

Okapi

0.5

Precision

0.4 0.3 0.2 0.1 0 p@15

p@10

p@5

Figure 6.4: High-end precision for TREC. Each plane is of 25 dimensions. That is, pLSI uses only the first 100 dimensions of the semantic space. Each plane retrieves 1,000 documents for a query based on subvectors on that plane.5 Four planes in total return 4,000 documents. Finally, we use Okapi to rank the returned 4,000 documents. Figure 6.4 shows that, with proper ranking, the top documents retrieved by low-dimensional subvectors are almost as good as Okapi. Figure 6.5 plots the cumulative distribution of the 100 queries as a function of the returned relevant documents when retrieving 10 documents. LSI, LSI+Okapi, and Okapi find no relevant document for 39, 18, and 11 queries, respectively. The distribution of the retrieval quality of LSI+Okapi closely follows that of Okapi. The precision-recall for LSI+Okapi is reported in Figure 6.3(a). The highend precision of LSI+Okapi approaches that of Okapi, but the low-end precision still lags behind. The low-end precision can be improved by allowing each plane to return more candidate documents for Okapi to rank, but this would increase 5

In a decentralized implementation, each plane does not actually return all 1,000 documents since content-directed search [258] automatically avoids searching documents with low similarity score.

CDF of queries

249

100 90 80 70 60 50 40 30 20 10 0

LSI LSI+Okapi Okapi

0

1 2 3 4 5 6 7 8 9 10 Relevant docs among top 10 retrieved Figure 6.5: Cumulativedocs distribution of the 100 TREC queries as a function of the returned relevant documents when retrieving 10 documents. the search cost. Currently our focus is on high-end precision since a significant percentage of users only view top search results [143]. We leave improving low-end precision as a subject for future work. One lesson we learned through extensive experimentation is the importance of using large heterogeneous corpora in evaluation. Unfortunately, most existing work on LSI is evaluated using relatively small and homogeneous corpora. Although the corpus we used is relatively much larger, we acknowledge that it is still quite limited compared with the target environment of our system. In theory, with sufficient dimensions, LSI can at least approximate the performance of the term weighing scheme that produces the input matrix for SVD. In practice, the dimensionality of the semantic space is limited by SVD’s high computation cost. On the other hand, we found that, although low-dimensional LSI loses fine details of the corpus, it can still approximately cluster documents since it captures the major structure of the corpus.

250

6.2

Improving the Efficiency of LSI

In Section 6.1 we described techniques to improve the retrieval quality of pLSI to approach that of Okapi. In this section we address another problem that limits LSI’s scalability—the high computation cost associated with SVD. Traditionally, LSI uses a term-document matrix as the input for SVD to compute the basis of the semantic space. For a sparse matrix A ∈ Rt×d with about c nonzero elements per column, the time complexity of SVD is O(t · d · c) [185]. We propose the following to reduce this cost. We partition documents into clusters and use the centroids of the clusters as “representative” documents. We further reduce the dimensionality of the centroid vectors by filtering out elements corresponding to low-weight terms. The resulting matrix, which has short centroid vectors as columns, is several orders of magnitude smaller than the original termdocument matrix. Finally, we apply SVD to this matrix to derive the basis of the semantic space. We call this algorithm eLSI (efficient LSI). eLSI reduces the cost for SVD but introduces the extra clustering step. We believe clustering algorithms are much more scalable than SVD.

6.2.1

The eLSI Algorithm

Our eLSI algorithm efficiently derives good low-dimensional representation for documents. It consists of the following steps. 1. Partition documents into clusters and compute a centroid for each cluster. 2. Reduce the dimensionality of the centroids by keeping only elements whose aggregate weight across centroids is sufficiently large. 3. Project the dimensionality-reduced centroids into a k-dimensional semantic space using SVD.

251

4. Project terms into the semantic space according to the usage of terms in the centroids and the k-dimensional representation of centroids. 5. Finally, project documents into the semantic space according to the usage of terms in the documents and the k-dimensional representation of terms. Intuitively, we use document clustering and term selection to come up with a small matrix that captures important content of the original term-document matrix. We then apply SVD to this small matrix to derive a low-dimensional representation for the centroids, followed by a chain of actions, centroids → terms → documents, to derive a low-dimensional representation for terms and documents. In Steps 4 and 5, the low-dimensional representation for terms or documents not seen before can also be derived. Each column of the original term-document matrix corresponds to a document. One natural way to reduce the number of columns is to replace multiple columns corresponding to a cluster of similar documents with a single vector that represents the centroid of this cluster. Centroids are landmark structures in the document space. If we find a good projection that maps centroids to a low-dimensional semantic space while retaining the distance among them, it is likely that this projection also keeps the distance among documents. We use a hierarchical version of spherical k-means [76] to cluster documents. See Section 6.2.2 for the details. Denote the centroid matrix obtained through clustering as C = [c1 c2 · · · cs ] ∈ Rt×s

(6.1)

where s is the number of document clusters, t is the number of terms, and cj is the centroid vector for cluster j. The centroid vˆ of a set of vectors vi (1 ≤ i ≤ n) is defined as vˆ =

1 n

Pn

i=1

vi . Element cij indicates the importance of term i in

centroid j. The aggregate weight of a term i across centroids is wi =

Ps

j=1 cij .

We select a subset of e rows from matrix C to construct a row-reduced matrix

252

C˜ ∈ Re×s . The x-th row of C is kept in C˜ if term x appears in more than one centroid and is among the top e terms with the largest aggregate weight. The rationale behind term selection is that terms with big aggregate weight are representative in expressing the relationship among centroids. Similar to the use of document clustering to reduce the columns of the termdocument matrix, one can use term clustering to reduce the rows of matrix C. We prefer term selection since it is more efficient and works well in practice. Alternatively, one can also use Random Projection [97] to reduce the dimensionality of the centroid vectors. See Section 6.2.3 for further detail. After document clustering and term selection, matrix C˜ is several orders of magnitude smaller than the original term-document matrix. For the TREC corpus, the complete term-document matrix has 408,653 rows and 528,155 columns. The matrix C˜ we use for the TREC corpus, on the other hand, has less than ˜ computing components cor2,000 rows and 2,000 columns. We apply SVD to C, ˜ k largest singular values (see Equation 6.2 and 6.3). This can responding to C’s ˜ be done efficiently because of the limited size of C. C˜ = U ΣV T

(6.2)

C˜k = Uk Σk VkT

(6.3)

Vk ∈ Rs×k is the representation of the centroids in the k-dimensional semantic space. Equation 6.4 projects terms into the semantic space using Vk . Recall that each row of C corresponds to a term. B = CVk ∈ Rt×k .

(6.4)

As discussed in Section 6.1.1, we normalize each row of B to unit length to ˜ Finally, Equation 6.5 and 6.6 emphasize rare terms, resulting in a new matrix B. project a document (or query) vector q into the semantic space and normalize it to unit length. ˜T q q¯ = B

(6.5)

253

qˆ =

6.2.2

q¯ ||¯ q ||2

∈ Rk×1

(6.6)

Document Clustering

In eLSI, we use clustering as a low-cost pre-processing step for LSI. It efficiently obtains many nice properties that LSI obtains through expensive SVD. In addition to reducing the matrix size, the use of centroids shares some spirit with LSI in that, it also automatically discovers semantic relationship among words. Kontostathis and Pottenger [135] found that LSI has the capability to trace high-order cooccurrence of words. Suppose terms x and y co-occur in a document X and terms y and z co-occur in a document Z. But terms x and z never co-occur in any document. If these patterns are strong, LSI can discover that terms x and z are also related. The use of centroids achieves a similar effect. Two terms that are commonly used in a cluster of documents will both have heavy weight in the centroid of this cluster, strengthening their correlation. These two terms may even never co-occur any document. In contrast, although document sampling can also reduce the size of the term-document matrix, it loses all information in documents that are not sampled. It cannot discover relationships among terms if only a small fraction of the corpus is sampled. We use hierarchical spherical k-means to partition documents into clusters. The algorithm starts with a single cluster that contains all documents, and generates more clusters trough recursive bisection. Given a j-way clustering solution, it selects a cluster from the j existing clusters and partitions it into two clusters to generate a (j+1)-way solution. Many methods have been proposed to select a cluster to partition in each step. In our implementation, we simply select the largest one because of two reasons. First, this allows us to balance the size of the final clusters. Second, for a large set of similar documents that discusses the same popular topic, they will be represented by multiple clusters in the final solution. This is consistent with the basic idea of LSI. Popular and important topics should

254

have a more frequent occurrence in the input for SVD to allow SVD to learn the relationship among words. We proceed to describe the spherical k-means algorithm that partitions documents in one cluster into two clusters. The following discussion simply uses “documents” to refer to documents in the to-be-partitioned cluster. Initially, two random documents are selected as seed centroids ci (i=0,1). Documents are partitioned into two sets Ci (i=0,1). A document x is assigned to set Ci if it has higher similarity (inner product) to centroid ci than to the other centroid. Then the two centroids are updated as follows. c¯i =

X

x

(6.7)

c¯i ||¯ ci ||2

(6.8)

x∈Ci

ci =

Equation 6.8 normalizes the centroid to unit length, which is the reason why this algorithm is called spherical k-means [76]. Again, documents are partitioned into two sets Ci according to their similarity to the new centroids, and the partition is used to update the centroids ci . This process repeats until no document or only a small number of documents move between the two clusters. We find through experimentation that centroids usually converge after 5-40 iterations. The quality of the clustering solution depends on the initial seed centroids. We try the clustering algorithm five times with different random seeds and pick the one that results in the most coherent clusters, i.e., the sum of the similarity between documents in a cluster is the biggest.

6.2.3

Other Dimensionality Reduction Methods

In this section, we summarize four fast dimensionality reduction methods. We will compare them with eLSI in Section 6.2.4. The first algorithm, Random Projection (RP) [97], projects a t-dimensional document (or query) vector q into a k-dimensional subspace using a random matrix

255

P ∈ Rt×k whose columns have unit length (see Equation 6.9). Here t is the number of terms and k is the dimensionality of the target semantic space, k t. Despite its simplicity, previous work has shown that RP was reasonably accurate in reducing dimensionality for text data [24]. q¯ = P T q

(6.9)

The first step of all other algorithms partitions documents into k clusters, where k is the dimensionality of the target semantic space. (Note that eLSI partitions documents into s clusters, s k.) Denote G = [g1 g2 · · · gk ] ∈ Rt×k as the centroid matrix, where gj is the centroid vector for cluster j. The second algorithm, Concept Indexing (CI), is introduced by Karypis and Han [127]. Given a vector q, CI uses Equation 6.10 to project it into the subspace spanned by the k centroids. q¯ = GT q

(6.10)

The third algorithm [76] solves the least-squares problem in Equation 6.11 to derive the k-dimensional representation for vector q. We will refer to this algorithm as the LS (least-squares) algorithm. q − q||2 q¯ = argq¯ min ||G¯

(6.11)

The fourth algorithm [188] is based on QR decomposition. The QR decomposition of matrix G ∈ Rt×k gives an orthogonal matrix Q ∈ Rt×t and an upper triangular matrix R ∈ Rk×k such that 

G = Q 







R   R   = (Qk Qr )   = Qk R 0 0

(6.12)

where Qk ∈ Rt×k and Qr ∈ Rt×(t−k) . The k-dimensional representation of vector q is computed using Equation 6.13. We will refer to this algorithm as the QR algorithm. q¯ = QTk q

(6.13)

256

As discussed in Section 6.1, normalization improves retrieval quality. We use Equation 6.6 to normalize vector q¯ in Equation 6.9, 6.10, 6.11, and 6.13 to unit length. The eLSI algorithm described in Section 6.2.1 uses term selection to reduce the dimensionality of the centroid vectors before applying SVD. Alternatively, one can use random projection with eLSI to reduce the dimensionality of the centroid vectors, C˜ = F T C

(6.14)

where C ∈ Rt×s is the centroid matrix (from Equation 6.1), C˜ ∈ Re×s is the reduced centroid matrix, and F ∈ Rt×e is a random matrix whose columns have ˜ other steps of the eLSI algorithm are used without unit length. After obtaining C, change. We refer to this version of eLSI as “RP-eLSI” (using RP with eLSI) and refer to the original eLSI algorithm as “sel-eLSI” (term selection eLSI). We will use “eLSI” to generally refer to “sel-eLSI” and “RP-eLSI”. In total, we have seven different dimensionality reduction algorithms to compare— RP-eLSI, sel-eLSI, RP, CI, LS, QR, and LSI, where “LSI” is the traditional LSI algorithm that directly applies SVD to the original or sampled term-document matrix. Note that RP-eLSI and RP are different. RP directly applies random projection to the term-document matrix, whereas RP-eLSI only uses random projection as one substep of our eLSI algorithm to reduce the dimensionality of the centroid vectors.

6.2.4

Experimental Results

In this section, we evaluate eLSI and the above algorithms. Experiments are conducted with the TREC 7&8 corpus and queries.

40 35 30 25 20 15 10 5 0

LSI RP-eLSI sel-eLSI QR LS CI RP

20 30 40 50 60 80 100 120 140 160 180 200 300 600 900

Retrieved relevant docs

257

Dimension Figure 6.6: Comparison of dimensionality reduction methods—Retrieved relevant documents. Comparing Dimensionality Reduction Methods Figure 6.6 compares the dimensionality reduction methods described in Section 6.2.3. In this experiment, sel-eLSI and RP-eLSI use 2,000 centroids of 2,000 dimensions. We use semantic vectors generated by each method to retrieve 1,000 documents for a query and report the average number of retrieved relevant documents in Figure 6.6. The dimensionality of the semantic space varies from 20 to 300 (shown on the X axis), except for RP. RP’s recall is among the worst when the dimensionality is low but we notice a big improvement when the dimensionality increases from 200 to 300. Out of curiosity, we further increase the dimensionality up to 900 and observe that RP keeps its momentum. This matches with the theory that RP performs well when the dimensionality of the reduced space is sufficient in capturing the real dimensionality of the data. When the dimensionality of the reduced space is insufficient, RP’s recall degrades quickly because, unlike LSI, it does not try to capture the major structure of the data in low dimensions. The recall of eLSI (both sel-eLSI and RP-eLSI) is very close to LSI, but we start to see a difference when the dimensionality is bigger than 140. eLSI uses

258

document clustering plus term selection or RP to reduce the size of the input matrix for SVD. It retains the major structure of the data but does lose some fine details that are captured by the high-dimensional elements of the semantic vectors produced by LSI. Due to the dimensionality mismatch between CAN and LSI, pLSI only exploits information in low-dimensional subvectors (usually lower than 150 dimensions) to guide index placement and query routing (see Section 5.2.1). Therefore, when used with pLSI, the choice of LSI or eLSI makes no difference in retrieval quality. eLSI outperforms QR, LS, and CI. With a 140-dimensional space, eLSI retrieves 38%, 79%, and 102% more relevant documents than QR, LS, and CI, respectively. This is because eLSI uses more document clusters, which contain more information about the corpus, and SVD is superior to other methods in dimensionality reduction. Figure 6.7(a) and (b) report the precision-recall for different methods in a 140-dimensional and 300-dimensional space, respectively. Since the performance of RP-eLSI and sel-eLSI are similar, we omit the results for sel-eLSI for clarity. To our surprise, although RP’s recall is among the worst in Figure 6.6, its high-end precision is among the best in Figure 6.7(a) and (b). This indicates that, after random projection, vectors that are very close in the original high-dimensional space are still very close in the low-dimensional space. If two vectors have a medium or long distance in the original space, their distance in the low-dimensional space may be distorted, however. This is the reason why RP has a low recall. In the 140-dimensional space, eLSI’s precision-recall is similar to that of LSI; in the 300dimensional space, eLSI’s performance is noticeably worse, which is consistent with the results in Figure 6.6. Figure 6.8 reports the number of queries that find no relevant document in a 300-dimensional space. Consistent with the results in Figure 6.7, RP performs best, which again indicates that RP is good at keeping distance between very close vectors.

259

0.40 0.35 Precision

0.30 0.25 0.20 0.15

RP LSI RP-eLSI QR LS CI

0.10 0.05 0.00 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

Precision

(a) Precision-recall in a 140-dimensional space.

0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

RP LSI RP-eLSI QR LS CI

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall (c) Precision-recall in a 300-dimensional space. Figure 6.7: Comparison of dimensionality reduction methods (precision-recall).

260

35

Lost queries

30 25 20 15 10 5 CI

LS

R Q

RP

LS RP I -e LS I se l -e LS I

O

ka pi

0

Figure 6.8: Queries that find no relevant document. Figure 6.9 reports the high-end precision when combining dimensionality reduction methods with Okapi. The configuration is similar to that for the LSI+Okapi curve in Figure 6.4. We project documents into a 100-dimensional space and then partition the 100-dimensional space into four planes, each of which is of 25 dimensions. Each plane uses its subvectors to retrieve 1,000 documents for a query, resulting a total of 4,000 returned documents. Finally, we use Okapi to rank these returned documents and report the precision for top 5 and 10 documents. When used with Okapi, eLSI achieves precisions almost identical to LSI and outperforms all other methods. RP’s precision is the worst in this figure, which seems to contradict with previous results. However, since we use Okapi to rerank documents, what dimensionality reduction methods really contribute is their recall. RP’s recall is the worst in spaces of very low dimensions (25 dimensions in this experiment). Therefore, the precision of RP with Okapi reranking is the worst. In summary, when the goal is high recall or when the dimensionality reduction method is to be used with other ranking algorithms, we recommend eLSI. When the goal is high precision (without further ranking by other algorithms), we

precision at 10 docs

RP+eLSI LSI sel- eLSI QR LS CI RP

0.50 0.40 0.30 0.20 0.10 0.00

RP+eLSI LSI sel- eLSI QR LS CI RP

Precision

261

precision at 5 docs

Figure 6.9: High-end precision when combining with Okapi. recommend RP because of its simplicity, efficiency, and good high-end precision. Scalability of eLSI The main computation in eLSI is document clustering and SVD. We consider clustering to be much more scalable than SVD. The time complexity of the clustering algorithm we use is O(n log(s)), where n is the number of documents and s is the number of clusters. One can use more efficient algorithms based on data summarization or random sampling, or use algorithms that only scan the data set once. Clustering is inherently data parallel. Implemented properly, distributed clustering can achieve almost linear speedup [75]. Interested readers may refer to [18] for all these details. A comparison of clustering algorithms is out of the scope of this work. Our evaluations below will focus on the scalability of SVD. Figure 6.10 reports the execution time and memory consumption of using SVD to project eLSI’s reduced centroid matrix C˜ into a 150-dimensional space. For seleLSI, the dimensionality of C˜ is the number of terms kept in C˜ after term selection. For RP-eLSI, the dimensionality of C˜ is the reduced dimensionality after random

Time (sec.)

350 300

time (rp-elsi)

250

time (sel-elsi)

200

mem (rp-elsi)

180 160 140 120 100 80 60 40 20 0

150 100 50 0 250

500

1k

2k

4k

Memory (MB)

262

8k

Number of centroids (a) Vary the number of document centroids

100 time (rp-elsi) time (sel-elsi) mem (rp-elsi) mem (sel-elsi)

120 80

80 60 40

40

20

0

Memory (MB)

Time (sec.)

160

0 250

500

1k

2k

4k

Dimensionality of centroids ˜ (b) the dimensionality of the reduced centroid matrix C. Figure 6.10: Efficiency of eLSI—memory consumption and execution time of SVD.

263

projection. In Figure 6.10(a), we vary the number of document clusters while fixing the dimensionality of C˜ to 2000.6 . Using data regression, we found that the execution time scales with 0.1s0.8 seconds and memory consumption scales with 0.2s0.7 MB, where s is the number of document clusters. In Figure 6.10(b), we vary the dimensionality of C˜ while fixing the number of document clusters to 2,000. The execution time scales with 1.1e0.2 seconds and the memory consumption scales with 0.8e0.2 MB, where e is the number of selected terms. Note that the X axis of both figures are in log-scale. The cost of SVD scales reasonably well in both figures. sel-eLSI is more efficient than RP-eLSI because matrix C˜ produced by sel-eLSI is more sparse than that produced by RP-eLSI. Recall that the cost of SVD is proportional to the number of nonzero elements in the input matrix to SVD. A cluster of similar documents tend to use a small vocabulary. Therefore the centroid vectors are sparse. When the TREC corpus is partitioned into 2,000 clusters, on average 98.8% of elements in a centroid are zero. After term selection, many elements of matrix C˜ are still zero. RP uses a random matrix for projection. After projection, the probability of having zero elements in matrix C˜ is very low. Since sel-eLSI is more efficient and the retrieval quality of RP-eLSI and sel-eLSI are similar, we opt for sel-eLSI. Figure 6.11 evaluates the quality of the semantic vectors produced by eLSI. We retrieve f documents (f=1000 or f=5000, the number in the parentheses) for each query based on similarity of the 150-dimensional semantic vectors and report the average number of retrieved relevant documents for a query. sel-eLSI is effective with just a few hundred document clusters or a few hundred selected terms, indicating that document clustering and term selection do capture the important content of the original term-document matrix. Particularly, the performance is not sensitive to the number of selected terms. This is because the selected terms 6

150 dimensions are sufficient for our use since pLSI uses only a small number of subvectors

to cluster related documents in the overlay.

264

Retrieved relevant docus

60 50 40 30 20

RP (5k)

10

sel (5k)

RP (1k)

sel (1k)

0 200

400

600

800

1k

2k

4k

8k

Number of training doc centroids (a) Vary the number of document centroids

Retrieved relevant docus

60 50 40 30 20 10

rp (5k)

sel (5k)

rp (1k)

sel (1k)

0 200

400

600 800 1k 2k Number of traning terms

4k

˜ (b) Vary the dimensionality of the reduced centroid matrix C. Figure 6.11: Efficiency of eLSI—retrieved relevant documents.

265

are very representative in expressing relationships among centroids. Since the retrieval quality of RP-eLSI and sel-eLSI are similar and sel-eLSI is more efficient, we opt for sel-eLSI. eLSI significantly reduces LSI’s computation cost for SVD. SVD for sel-eLSI with a 2000-by-2000 centroid matrix C˜ takes 55 seconds and consumes 47MB memory, whereas running SVD over the sampled 15% TREC documents takes 57 minutes and consumes 1.7GB memory. The good retrieval quality of eLSI indicates that document clustering and term selection do keep important content of the term-document matrix. We believe that, when combined with an efficient clustering algorithm and a parallel implementation of SVD [144], eLSI can handle very large corpora.

6.3

Performance of pSearch

In Section 5.2, we outlined the architecture of pSearch. In Section 6.1 and 6.2, we described techniques to improve LSI’s retrieval quality and efficiency. In this section we use the TREC 7&8 corpus and queries to evaluate the complete pSearch system that includes these enhancements for LSI. pSearch has some details not covered in this chapter. Unless otherwise noted, we use the configuration in Table 5.3 (page 212) for those omitted features. The semantic vectors are generated by “sel-eLSI” (below we will simply refer to it as “eLSI”). eLSI uses 2,000 document clusters and 2,000 selected terms. We use Okapi to guide content-directed search and to select documents on visited nodes. Figure 6.12 shows the number of visited nodes and precision at top 10 documents when varying the number of nodes in the system from 500 to 128,000 (in parentheses). The X axis is the quit threshold T that controls the size of the search region (see Section 5.2.1). A search is terminated when no better document is found on the most recently searched T nodes. Okapi’s precision at top 10

266

nodes (8k)

nodes (32k)

nodes (128k)

p@10 (500)

p@10 (2k)

p@10 (8k)

p@10 (32k)

p@10 (128k)

90 80 70 60 50 40 30 20 10 0

0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 1

2

3

4 5 6 Quit bound

8

Precision @ 10

nodes (2k)

Visited nodes

nodes (500)

10

Figure 6.12: The number of visited nodes and precision at top 10 documents when varying the number of nodes in the system from 500 to 128,000 (in parentheses). documents is 0.45. pSearch achieves a precision close to Okapi by visiting only a small number of nodes. This performance is scalable with respect to system size. For an 8,000-node system, pSearch achieves a precision of 0.4 by visiting 47 nodes. For a 32,000-node system, pSearch achieves a precision of 0.4 by visiting 67 nodes. It can achieve a higher precision by visiting more nodes. When searching a similar number of nodes, the precision decreases as the system size increases. Since the size of the corpus is constant, more nodes imply fewer indices on each node. In Section 5.6.6 (page 229) we have shown that the search efficiency and retrieval quality improve as the number of indices stored on each node increases. The experiment in Figure 6.13 is the same as that in Figure 6.12 but we report here the documents searched on those visited nodes as a percentage of the entire corpus. To some extent this reflects the balanced distribution of indices across nodes (see [258] for details on load balance). For clarity, we only vary node population from 2,000 to 32,000. For the 8,000-node system, pSearch searches only 1.3% of TREC to achieve a precision of 0.4 for top 10 documents.

267

docs (8k) p@10 (8k)

docs (32k) p@10 (32k)

0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

8% 7% 6% 5% 4% 3% 2% 1% 0%

1

2

3

4 5 6 Quit bound

8

Precision @ 10

Searched docs

docs (2k) p@10 (2k)

10

Figure 6.13: Precision at top 10 documents and documents searched on visited nodes as a percentage of the entire corpus, when varying the number of nodes in the system from 2,000 to 32,000 (in parentheses). Figure 6.14 evaluates the impact on retrieval quality when eLSI reduces the size of the input matrix for SVD through document clustering and term selection. This experiment uses a 10,000-node system. In Figure 6.14(a), we fix the number of selected terms to 2,000 while varying the number of document clusters from 500 to 2,000 (in parentheses). In Figure 6.14(b), we fix the number of document clusters to 1,000 while varying the number of selected terms from 500 to 2,000 (in parentheses). When reducing the size of the input for SVD dramatically, we only see a minor degradation in precision, indicating that eLSI is scalable. The performance gap between different input sizes diminishes as pSearch searches more nodes to achieve a higher precision. Overall, pSearch is efficient and effective. It searches a small number of nodes to achieve a precision close to the state-of-the-art centralized baseline. pSearch is scalable with respect to system size. When the number of nodes increases

268

nodes (c1000) accuracy (c1000)

nodes (c500) accuracy (c500) 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

Visited nodes

Precision @ 10

nodes (c2000) accuracy (c2000) 90 80 70 60 50 40 30 20 10 0

1

2

3

4

5

6

8

10

Quit bound

nodes (t2000) nodes (t1000) nodes (t500) accuracy (t2000) accuracy (t1000) accuracy (t500) 90 0.45 80 0.40 70 0.35 60 0.30 50 0.25 40 0.20 30 0.15 20 0.10 10 0.05 0 0.00 1 2 3 4 5 6 8 10 Quit bound

Precision @ 10

Visited nodes

(a) Vary the number of document clusters.

(b) Vary the number of selected terms. Figure 6.14: The impact of eLSI on pSearch’s retrieval quality.

269

exponentially, the number of visited nodes increases moderately and the precision degrades marginally. We expect it to be scalable with respect to corpus size as well, since eLSI can aggressively reduce the size of the input for SVD without seriously compromising the quality of semantic vectors it produces. The performance is expected to improve as the average number of indices stored on a node increases.

6.4

Potential of Clustered Search

This chapter addresses the challenge of using LSI in pSearch. Potentially one can build a P2P IR system around our document clustering idea but without using LSI (i.e., directly working with high-dimensional document vectors), by partitioning documents into clusters and assigning the clusters to different nodes. Given a query, it searches only nodes whose centroids are closest to the query. We call this approach clustered-search. Besides the problems that pSearch faces (e.g., how to efficiently cluster documents in a distributed environment), this approach faces some additional challenges. • As nodes are dynamic in P2P systems, clustered-search needs to find a decentralized mechanism to dynamically assign clusters to nodes. pSearch uses CAN for this purpose. Clustered-search cannot simply adopt the same approach since it uses high-dimensional document vectors (and hence a highdimensional space). • Since nearest neighbor search in a high-dimensional space is prohibitive, clustered-search cannot employ a distributed search strategy as pSearch does. Each node therefore needs to know the IP and centroid of all other nodes. In contrast, a node in pSearch knows only O(log(n)) routing neighbors, where n is the number of nodes in the system. In a dynamic environment, it is important for each node to maintain only a small amount of global knowledge to be scalable.

270

These challenges arise essentially because of the use of high-dimensional data. In pSearch, we avoid these problems by using low-dimensional data produced by eLSI. It is a subject for future work to further pursue the clustered-search approach and address the above challenges. Below we briefly look into the potential of this approach assuming that the listed challenges are somehow addressed. We focus on evaluating the retrieval quality of clustered-search in a “perfect” world. Experiments are conducted with the TREC corpus. In the first experiment, we partition the TREC corpus into 500-8,000 clusters. For each query, we select clusters whose centroids are closest to the query to search and use Okapi to rank documents on the searched clusters. Figure 6.15 reports improvement in precision and recall as the number of searched clusters increases. This figure shows the great potential of the clustered-search approach. It can achieve a high precision by searching only a small fraction of the corpus. For instance, when the TREC corpus is partitioned into 2,000 clusters, it achieves a precision of 0.4 for top 10 documents (compared with Okapi’s 0.45) by searching only 50 clusters that contain 2.3% of the documents in the corpus. However, it suffers from the same problem that pSearch faces—a relatively low recall. With 2,000 clusters, searching 50 clusters only finds 32.5 relevant documents for a query on average (compared with Okapi’s 57). In a distributed environment, it is possible to achieve a high precision, but it is inherently costly to achieve a high recall. Next we evaluate if it is a good strategy to select clusters to search based on the similarity between queries and centroids of the clusters. In this experiment, we report the percentage of the relevant documents covered by the searched clusters. That is, we assume a perfect ranking function in the future to help us pick relevant documents from searched clusters. Figure 6.16(a) is the performance of a realistic case. We select clusters whose centroids are closest to the query to search. Figure 6.16(b) is the performance of an ideal case, assuming that we know in advance the clusters that contain the

271

prec (2000) docs (2000)

prec (8000) docs (8000) 100.00%

0.4

10.00%

0.3 1.00% 0.2 0.10%

0.1 200

Searched clusters

100

50

35

20

10

5

2

0.01% 1

0

% Searched docs

Precision @ 10

prec (500) docs (500) 0.5

Retrieved relevant docs

(a) Percentage of searched documents and precision at 10 documents.

50 45 40 35 30 25 20 15 10 5 0

clusters (500) clusters (2000) clusters (8000)

1

2

5

10 20 35 50 Searched clusters

100 200

(b) Recall at 1,000 documents. Figure 6.15: Performance as a function of searched clusters. Documents are partitioned into 500-8,000 clusters (in parentheses).

272

10 clusters 20 clusters 100 clusters 1000 clusters 10000 clusters

% Covered relevant docs

100% 80% 60% 40% 20% 0%

1

2

3

4

5

6

7

8

9

10

Number of searched clusters (a) Real case—choosing clusters to search based on cluster centroids.

% Covered relevant docs

100% 80% 60% 10 clusters 20 clusters 100 clusters 1000 clusters 10000 clusters

40% 20% 0%

1

2

3

4

5

6

7

8

9

10

Number of searched clusters (b) Ideal case—choosing clusters to search with an “oracle”. Figure 6.16: Comparing the strategies to choose clusters to search.

273

most relevant documents and only search them. In both figures, the X axis is the number of searched clusters. The Y axis is the percentage of the relevant documents covered by the searched clusters. Choosing clusters based on centroids works reasonably well when there are only a small number of clusters. With 100 clusters, it searches 10 clusters and finds 48% of the relevant documents (assuming a perfect ranking algorithm). As the number of clusters increases, its performance degrades. With 10,000 clusters, it finds only 17% of the relevant documents by searching 10 clusters. In a large P2P system, the corpus needs to be partitioned into a large number of clusters to sufficiently reduce the search space, which poses a big challenge for the clustered-search approach. One reason for the performance difference between Figure 6.16(a) and (b) is that documents that lie in the periphery region of a cluster may be close to a query but the centroid of the cluster is not. Therefore, it cannot properly choose the clusters to search by just looking at the query and the centroids. This performance gap indicates that it is worthwhile to study the strategies for cluster selection. One potential solution is similar to our content-directed search heuristic. Instead of using just the centroids to select clusters to search, one can sample documents from the clusters and use the samples to direct searches, i.e., searching clusters that contain some sample closest to the query. This heuristic exploits information about clusters at a finer granularity. In particular, it may be able to sample documents that sit at the peripheral region of a cluster. Figure 6.16(b) also shows that the hierarchical k-means clustering algorithm we use is effective. When the TREC corpus is partitioned into 10,000 clusters, just 10 clusters already cover 53% of the relevant documents. In other words, the related documents are actually clustered together. Overall, these results suggest that the clustered-search approach has good potential. We make the following observations.

274

• Clustered-search can achieve a reasonable precision (but a relatively low recall) if one can address the challenges listed at the beginning of this section. • It is important to study heuristics for cluster selection, as suggested by the performance gap between the real case and the ideal case.

6.5

Related Work

In pSearch, we adopted techniques from several fields to build an efficient P2P IR system. Much related work has already been presented in Section 4.6 (page 4.6). We focus here on work that is most relevant to content presented in this chapter.

6.5.1

Dimensionality Reduction

Various fast dimensionality reduction methods have been proposed to approximate LSI. The three methods based on document clustering (CI, LS, and QR) [76, 127, 188] were originally proposed for document categorization. We found that, when used for document retrieval, their performance is not as good as our eLSI algorithm. Bingham and Mannila [24] reported reasonable performance when using random projection (RP) to reduce the dimensionality of text data. They used a corpus that is more than 200 times smaller than the corpus we used. We found that when used alone to aggressively reduce dimensionality, RP’s recall is the worst among eLSI, CI, LS, and QR. However, when used as one substep of eLSI, RP is effective when it reduces dimensionality not so aggressively. Papadimitriou et al. [185] used RP to reduce the dimensionality of document vectors, prior to applying SVD. Our RP-eLSI algorithm uses RP to reduce the dimensionality of centroid vectors, producing a much smaller input matrix for

275

SVD. Moreover, they only presented experimental results on a set of artificially generated documents. Kolda and O’Leary [133] used semi-discrete matrix decomposition (SDD), instead of SVD, with LSI. Compared with SVD, SDD uses less memory but takes much more time. Potentially, one can use eLSI to reduce the size of the input for SDD. The effectiveness of SDD over large corpora is yet to be evaluated. Husbands et al. [119] found that normalizing semantic vectors for terms improves LSI’s retrieval quality by emphasizing rare terms. We discovered that, for large corpora, it is important to normalize semantic vectors for both terms and documents. We observed performance discrepancies between a small, homogeneous corpus and a large, heterogeneous corpus and gave one explanation. That is, normalization is beneficial when the dimensions of the semantic space are insufficient in capturing the fine structure of the corpus. Bassu and Behrens [12] observed that LSI’s retrieval quality degrades as the corpus becomes large and heterogeneous. They proposed to partition documents into clusters and apply SVD to each cluster separately to discover the fine structure of each cluster. This method may improve LSI’s performance as a ranking algorithm but it is unclear if it can outperform Okapi. eLSI, on the other hand, applies SVD to the centroids of the clusters to discover the major structure of the entire corpus. Ding [77] showed that document frequency of words follow a Zipf distribution; the number of distinct words in documents follows a log-normal distribution; and the importance of LSI dimensions follows a Zipf-distribution. Multidimensional Scaling (MDS) is a class of data analysis techniques for representing data as points in a multidimensional real-valued space. The objects are represented so that inter-point similarity in the space match inter-object similarity provided by the user. Bartell [11] showed how the document representation given by LSI is equivalent to the optimal representation found through solving a

276

particular MDS problem in which the inter-object similarity is measured as the inner product of document vectors. FastMap [86] shares the goal as MDS but provides a more efficient solution whose time complexity is O(N ), where N is the number of data objects. Like Random Projection [97], FastMap uses the projections of objects onto k lines as their k-dimensional coordinates. Unlike Random Projection [97] that selects the k projection lines randomly and all at once, FastMap selects the k projection lines recursively, each of which goes through two data objects with the longest distance (which serves as a heuristic to maximize the variance of the projection). Jiang and Littman [122] pointed out the mathematical similarity between VSM, LSI, and GVSM, and proposed a new algorithm, Approximate Dimension Equalization (ADE) that is somewhere between LSI and GVSM and performs well on large collections, especially for multi-lingual IR. Locally Linear Embedding (LLE) [215] is an unsupervised learning algorithm that computes low-dimensional, neighborhood preserving embeddings of high dimensional data. LLE attempts to discover nonlinear structure in high dimensional data by exploiting the local symmetries of linear reconstructions. LLE is very costly in computation and no attempt, to our knowledge, has been made to apply it to documents.

6.5.2

Document Clustering

Surveys of clustering algorithms can be found in [173, 282, 18]. Steinbach et al. [241] compared agglomerative hierarchical clustering and k-means. They found that the bisecting k-means technique, which is what we use in eLSI (Section 6.2.2), is better than the standard k-means approach and is as good or better than the agglomerative hierarchical approach.

277

Dhillon [74] presented an algorithm that models a document collection as a bipartite graph between documents and terms, and then uses a graph partitioning algorithm to simultaneously cluster documents and terms. Strehl [244] did a comparison study of the impact of similarity metrics on cluster quality. The studied similarity measures include Euclidean, cosine, Pearson correlation, and extended Jaccard. The clustering algorithms in comparison are self-organizing feature map, hyper-graph partitioning, generalized k-means, and weighted graph partitioning. Strehl found that cosine and extended Jaccard are the best similarity measures to capture human categorization behavior, while Euclidean performs poorest; and weighted graph partitioning is superior to all other compared clustering algorithms.

6.5.3

Multi-dimensional Data Access Methods

Many methods have been proposed to reduce search space for multi-dimensional data [278]. Our use of a CAN to partition a Cartesian space is similar to Grid File [179]. To our knowledge, all existing multi-dimensional data access methods, including Grid File, are designed for centralized systems.

6.6

Conclusions

We have described several techniques to scale LSI to work in a decentralized P2P environment and quantified the efficiency and efficacy of pSearch by experimenting with the large TREC corpus. We made the following contributions in this chapter. • We proposed the eLSI algorithm to improve the efficiency of LSI. eLSI uses document clustering plus term selection or random projection to reduce the size of the input matrix for SVD, while retaining the matrix’s major content. eLSI retains the retrieval quality of LSI but is several orders of

278

magnitude more efficient. It outperforms four major fast dimensionality reduction methods in retrieval quality. • Through extensive experiments, we found that, when the dimensions of the semantic space are insufficient in capturing the fine structure of the corpus, proper normalizations of semantic vectors for terms and documents improve recall by 76% compared with the standard LSI that directly follows from SVD. • Without sufficient dimensions, LSI cannot accurately rank documents for large corpora. We found that LSI can still cluster documents properly because it captures the important structure of the corpus. Our “LSI+Okapi” algorithm combines the benefit of LSI and Okapi. It uses LSI to cluster documents in a P2P network to reduce the search space, but uses Okapi to guide search and to rank documents. • Despite the curse of dimensionality, we demonstrated that approximate nearestneighbor search can still be conducted efficiently in a 10-25 dimensional space. Our technique to reduce search space can also be applied to centralized systems. • We demonstrated that the clustering capability of multiple low-dimensional subvectors of semantic vectors is almost as good as that of full semantic vectors, due to the correlation among different elements of semantic vectors. The combination of our optimizations makes pSearch both more efficient and more effective. For a 32,000-node system, pSearch’s precision at 10 retrieved documents (for the TREC corpus) is 0.4 compared with Okapi’s 0.45, when on average searching only 67 nodes for a query. The performance is expected to improve as the average number of indices stored on a node increases. Although pSearch’s high-end precision approaches that of state-of-the-art centralized IR

279

systems, its low-end precision is inferior. Our future work includes improving low-end precision and incorporating other IR techniques such as automatic query expansion and PageRank.

280

7

Conclusions

Increasingly, Internet-level applications are distributed not for the sake of parallel speedup, but rather to access people, data, and devices in geographically disparate locations. This dissertation addressed two problems related to the management of data and information in wide-area distributed systems: distributed shared state and peer-to-peer information retrieval. Specifically, it addresses the issues of • how to make it easier for distributed applications to share in-memory program data; and • how to make it easier for people to retrieve information through full-text search in large-scale peer-to-peer systems.

7.1

Intuitive Data Sharing with InterWeave

Most distributed applications need some sort of shared state—information needed at more than one site. In the face of distributed updates to shared state, current applications typically resort to ad-hoc protocols built on top of remote invocation (e.g., Sun RPC or Java RMI) to maintain the coherence and consistency of shared state. The code devoted to these protocols often accounts for a significant fraction of overall application size and complexity, and this fraction is likely to increase as the complexity of application logic and shared data structures increase.

281

We have presented the design, implementation, and evaluation of InterWeave, a middleware system that automatically manages shared state for distributed applications running on heterogeneous platforms and written in multiple languages. We have made the following contributions. • Heterogeneous S-DSM. InterWeave is the first system that automates the typesafe sharing of structured data in its internal form across heterogeneous platforms and multiple languages. Our evaluations show that InterWeave introduces minimal overhead while reducing bandwidth consumption and improving performance in important cases. The benefits of InterWeave become even more apparent when the evaluations are moved to the Internet domain because of the low bandwidth and long latency of wide-area networks. • Integration of remote invocation, shared state, and transactions. As a complement rather than a replacement to remote invocation, InterWeave provides a unified programming environment that supports the uses of remote invocation, shared state, and transactions in a single application. In particular, remote invocation can pass arguments and results through true references to shared state rather than through often unnecessary and inefficient deep copy. A sequence of RPC calls and data access to shared state can be encapsulated in a transaction in such a way that either all of them execute or none of them do with respect to the shared state. With our novel use of the transaction metadata table, transactions also provide a framework in which the body of a remote procedure can see (and optionally contribute to) shared data updates that are visible to the caller but not yet to other processes. • Platform-independent wire format. To reduce bandwidth consumption, InterWeave caches shared state, automatically identifies updates (diffs) to the

282

shared state, and only transmits the diffs during coherence operations. We have designed a wire format that captures not only data but also diffs in a machine and language-independent form. The wire format is rich enough for describing diffs for arbitrarily complex data structures including pointers and recursive data types. • Rich metadata for typesafe sharing. We have designed a set of rich metadata for both the clients and the servers to facilitate efficient typesafe sharing. These metadata allow the runtime to quickly identify updates to shared state and the types of the changed data, and to swizzle pointers to maintain longlived (cross-call) address transparency. The complexity of these metadata is comparable to that of a complicated language reflection mechanism. • Aggressive optimizations for typesafe sharing. We have explored the design space and implementation alternatives for typesafe sharing, and proposed many aggressive optimizations to make the sharing efficient, e.g., isomorphic type descriptors and cache-aware diffing. Our evaluations on microbenchmarks and real applications show that these techniques help InterWeave to achieve a wire-format translation efficiency comparable to Sun RPC and 8-23 times faster than Java, while providing much richer features. • Multi-language support. InterWeave allows typesafe sharing across multiple languages. Our current implementation employs a common back-end library with thin front ends tailored to each language. This approach allows maximum code reuse and hence improves portability. We addressed the complexity due to the sharing of static common blocks in Fortran and objects in Java.

283

7.2

Peer-to-Peer Information Retrieval

Peer-to-Peer (P2P) systems have gained tremendous interests from both the user community and the research community in the past several years. Progress has been made in such areas as routing, storage, streaming media, fault tolerance, and security, but information retrieval (IR) using full-text search still remains particularly challenging, partially because of the high entry bar to do interdisciplinary research in this area, which requires knowledge from such fields as networks, systems, information retrieval, and databases. Early works stemming from the networks and systems community tend to address the problem without considering techniques that the IR community has learned through decades of hard work, which not only renders the retrieval results of these early systems hardly usable but also misses important optimization opportunities that exist at the boundary of P2P and IR. We address this challenge by taking a truly interdisciplinary approach: borrowing techniques from multiple fields to design each component of our P2P search systems; and making important innovations to improve the borrowed techniques and to integrate the components into seamless, efficient systems. We have learned several important lessons. • Designing a good P2P search system amounts to striking a balance between system efficiency and retrieval quality. It is often possible and quite beneficial to trade a small degradation in retrieval quality and a small increase in storage consumption for substantial improvement in search efficiency. • It is our strong belief that by exploiting the synergy between IR and P2P, one can design more efficient P2P search systems, leveraging features of IR algorithms to guide system optimization.

284

We proposed eSearch and pSearch, the first systems that introduce efficient information retrieval into structured peer-to-peer networks (i.e., distributed hash tables (DHTs)). They are also the first systems that exploit the synergy between IR and P2P to guide system design and optimization.

7.2.1

eSearch

eSearch is based on our novel hybrid global-local indexing. Each node is responsible for certain terms. Given a document, eSearch uses a modern information retrieval algorithm (Okapi) to select a small number of top (important) terms in the document and publishes the complete term list for the document to nodes responsible for those top terms. This selective replication of term lists allows a multi-term query to proceed local to the nodes responsible for the query terms. Given a multi-term query, the query is sent to nodes responsible for those terms. Each of those nodes then does a local search and returns the results. We made the following contributions in eSearch. • Challenging conventional wisdom that uses either local or global indexing, we proposed hybrid indexing that employs selective term list replication to combine the benefits of local and global indexing while avoiding their limitations. • We used semantic information provided by modern IR algorithms to guide the replication. The term list of a document is only replicated on nodes corresponding to important terms in the document. Our evaluations show that eSearch can achieve a retrieval quality comparable to that of the centralized baseline by using just a small number of important terms per document. • We adopted automatic query expansion in a P2P environment to alleviate precision degradation introduced by selective replication. Query expansion

285

improves recall by 22% and allows eSearch to publish documents under fewer important terms. • We devised a novel overlay source multicast protocol that has very low protocol overhead to reduce term list dissemination cost. It avoids extensive network measurements by using estimated network latency to guide the selection of multicast routes. It saves messages to create and destroy a multicast tree by piggybacking the structure of the multicast tree on data packets, i.e., using source multicast. Our protocol reduces metadata dissemination cost by up to 80% on its own. • We introduced two techniques to balance term list distribution to a greater degree than is achievable using existing load balancing techniques while avoiding their maintenance overhead. First, a new node performs lookups for several random keys. Among nodes responsible for those keys, it chooses one that stores the largest amount of data and takes over some of its data. Second, each term is hashed into a key range rather than a single key. For an unpopular term, its key range is mapped to a single node. For a popular term, its key range may be partitioned among multiple nodes, which collectively store the inverted list for this term.

7.2.2

pSearch

pSearch is based on the concept of semantic overlay. It organizes nodes into a Content-Addressable Network (CAN) and populates documents in the overlay according to document semantics derived from Latent Semantic Indexing (LSI). The distance (e.g., routing hops) between two documents in the overlay is proportional to their dissimilarity in semantics. The search cost for a query is therefore reduced since documents related to the query are likely to be concentrated on a small number of nodes. We made the following contributions in pSearch.

286

• pSearch is the first system that organizes content around their semantics in a P2P network. This makes it possible to achieve an accuracy comparable to centralized IR systems by visiting a small number of nodes and transmitting a small amount of data during a search. • We proposed the use of rolling-index to resolve the dimensionality mismatch between the semantic space and a CAN, taking advantage of the higher importance of low-dimensional elements of semantic vectors. This helps reduce the number of visited nodes by partitioning the semantic space along more dimensions. • We employed content-aware node bootstrapping to balance the load, which as a side effect also introduces index and query locality. This helps distribute document indices evenly across nodes. • We employed content-directed search, using index samples and recently processed queries to guide search to the right places in the high-dimensional semantic space. This helps further reduce the number of visited nodes. • We proposed the eLSI algorithm to improve the efficiency of LSI. eLSI uses document clustering plus term selection or random projection to reduce the size of the input matrix for SVD, while retaining the matrix’s major content. eLSI retains the retrieval quality of LSI but is several orders of magnitude more efficient. It outperforms four major fast dimensionality reduction methods in retrieval quality. • Through extensive experiments, we found that, when the dimensions of the semantic space are insufficient in capturing the fine structure of the corpus, proper normalizations of semantic vectors for terms and documents improve recall by 76% compared with the standard LSI that directly follows from SVD.

287

• Without sufficient dimensions, LSI cannot accurately rank documents for large corpora. We found that LSI can still cluster documents properly because it captures the important structure of the corpus. Our “LSI+Okapi” algorithm combines the benefit of LSI and Okapi. It uses LSI to cluster documents in a P2P network to reduce the search space, but uses Okapi to guide search and to rank documents. • Despite the curse of dimensionality, we demonstrated that approximate nearest neighbor search can still be conducted efficiently in a 10-25 dimensional space. Our technique to reduce search space can also be applied to centralized systems. • We demonstrated that the clustering capability of multiple low-dimensional subvectors of semantic vectors is almost as good as that of full semantic vectors, due to the correlation among different elements of semantic vectors.

7.2.3

Comparison of eSearch and pSearch

The fundamental reason that makes search difficult in existing P2P systems such as Gnutella and KaZaA is that, with respect to semantics, documents are randomly distributed. At a very high level, what underlie our solutions for efficient peer-to-peer information retrieval are feature extraction, document clustering, and complete local indexing. Feature Extraction and Document Clustering Information retrieval is essentially a patten recognition problem, extracting features from objects and measuring the similarity between objects based on their features. IR systems differ in the feature set and the similarity measure they choose. eSearch uses documents’ keywords as features. Documents whose indices are stored on the same node share the same keyword, under which, in some

288

sense, these documents are clustered. Given a query, eSearch only searches document clusters with at least one of the query terms as their keyword. In contrast, pSearch uses as features the transformed conceptual semantic vectors instead of the original keywords. Using concepts instead of keywords substantially reduces the total number of features but it is very challenging to efficiently derive highquality concepts for large corpora. Both eSearch and pSearch partition document indices based on features such that indices stored on a node share similar features. Document clustering helps limit a search to only nodes hosting relevant indices. In addition to the difference in feature sets, eSearch and pSearch also differ in the way that they partition the feature space among the nodes. In eSearch, features (keywords) are considered independent of each other. It is a binary decision that a document either owns a given feature or not. A feature is mapped to certain node(s). The indices for documents with the given feature are then stored on the node(s) responsible for that feature. This independent treatment of features makes search extremely simple and efficient. On a retrieval, the query is sent to nodes responsible for the query terms, each of which does a local search and return the results. In pSearch, features are not partitioned among nodes independent of each other. The space spanning over a combination of features (i.e., concepts) is partitioned into zones and each zone is mapped to a node. A node is responsible for a subspace in the combined feature space, rather than a single feature as that in eSearch. The indices stored on a node therefore are semantically related to the indices stored on the node’s neighbors. A retrieval amounts to a distributed nearest-neighbor search, which could be complex due to the curse of dimensionality. pSearch’s search strategy therefore is more complex and less efficient than that of eSearch. In summary, eSearch searches nodes independently while pSearch has to consider the relationship among nodes during a search. eSearch is simpler but only suitable for text retrieval while the techniques we devised in pSearch can also be applied for multimedia retrieval [1, 85, 148, 306].

289

Complete Local Index In a large P2P system, the storage space scales proportionally with the number of nodes in the system, but the bisection bandwidth does not. In some proposed P2P search systems [102, 145, 205, 245], the complete index for a single document is partitioned by words and scattered on multiple nodes. To compute the similarity between documents and a query, the system needs to spend precious bandwidth on piecing together indices for related documents (i.e., a distributed intersection of the inverted lists). In contrast, a node in eSearch or pSearch always stores complete indices for documents (at the expense of extra storage), allowing a node to compute the similarity between a query and documents in its local indices without consulting others. This approach, we believe, better matches the scalability trend of bandwidth and storage in large P2P systems. Both eSearch and pSearch employ replication to improve retrieval quality. The replication factor in eSearch, however, is about one order of magnitude larger than that in pSearch because eSearch uses keywords as features, the number of which is larger than the number of semantic subvectors (the features in pSearch). In summary, eSearch’s search process is more efficient and much simpler than that of pSearch, but eSearch consumes more storage space than pSearch due to the large replication factor. In pSearch, the problem of efficiently deriving highquality semantic vectors is also very challenging for large corpora. Overall, we recommend eSearch for text retrieval and pSearch for multimedia retrieval.

7.3

Future Work: Distributed Shared State

There are several aspects of InterWeave that can be improved as well as some interesting applications we would like to explore within the InterWeave framework.

290

Scalability One important future direction is to improve the scalability and availability of the InterWeave server. This topic is not completely specific to InterWeave. We can and should learn extensively from past solutions for scalable database, Web, and file servers. On the other hand, the knowledge that InterWeave servers have about the clients’ data access patterns and special requirements for the fidelity of the data (i.e., relaxed coherence) gives rise to some unique opportunities for further optimization in InterWeave. Currently we have a preliminary implementation that allows the servers to be replicated for scalability and availability. Data coherence among replicated servers is maintained using a group communication protocol called Ensemble [129]. Further enhancement to scalability has to resort to a hierarchical solution. See Section 7.3.2 for the details. Cache Management InterWeave maintains a set of complex metadata on both clients and servers. The algorithms InterWeave uses to search and update these metadata are scalable with respect to the amount of data InterWeave manages, but the memory overhead for the metadata grows linearly with the number of blocks, which could cause serious pressure on main memory for applications that use an extremely large number of very small blocks. Moreover, it is likely that the persistent data accessed by some application cannot fit in main memory. In these cases, like database systems, InterWeave has to treat main memory only as a cache for active segments, whose persistent copies reside on the disk. Our current prototype supports saving segments to persistent storage and reloading them when needed. We are yet to study the cache replacement algorithms for InterWeave servers.

291

Persistence through Reachability In our current prototype, InterWeave segments and blocks are persistent unless they are explicitly destroyed by users. One way to relieve users from this burden is to designate a set of persistent roots and garbage collect any segment or block that cannot be reached from any of the roots. This is so-called orthogonal persistence [116] or persistence by reachability. To do so, distributed garbage collection algorithms need to be implemented in InterWeave servers. Applications In addition to introducing new features into InterWeave, another interesting direction is to apply InterWeave to new application domains. We are actively collaborating with colleagues in our own and other departments to employ InterWeave in three principal application domains: remote visualization and steering of high-end simulations, incremental interactive data mining, and human-computer collaboration in richly instrumented physical environments. Comparison with OODBs InterWeave shares commonality with OODBs in such features as sharing of finegrained objects, object caching, pointer swizzling, and client-side object manipulation. A performance comparison with OODB systems using standard benchmarks such as OO7 [44] and experience with porting OODB applications to InterWeave would be interesting subjects for future work. In the following, we further elaborate on exploiting InterWeave’s capability in memory virtualization for task migration and on improving InterWeave’s scalability through the use of hierarchy.

292

7.3.1

Virtualization and Migration with InterWeave

In recent years the importance of machine virtualization has been widely recognized due to its applications in load balance, fault tolerance, availability, and security. The approach for virtualization varies widely from those that completely emulate the hardware and require no modification to run commodity operating systems (e.g., VMWare [270]), those that emulate the hardware but require slight modifications to run commodity operating systems (e.g., Xen [9]), to those that do not attempt to emulate the underlying physical architecture precisely in order to reap additional performance gain (e.g., Denali [281]). One important benefit of virtualization is its ability to move programs around for such reasons as load balance and fault tolerance. In terms of mobility, traditional approaches for process or thread migration [167] can be viewed as virtualization at the application level. Existing solutions for thread or process migration among heterogeneous platforms are expensive and not portable [235, 240]. It is even more expensive to checkpoint and reinstate the state of a complete virtual machine. In systems where program migration is common, for instance, due to frequent node failure (e.g., P2P systems), this inefficiency may pose a serious problem. In our opinion, this inefficiency is inherent to mobility solutions that lack cooperation from the application side. InterWeave applications are different. They have already been slightly modified to use the InterWeave API in order to reap the benefits of shared-memory programming, automatic caching, and bandwidth reduction. If the important hard state of an application is stored in InterWeave segments, then the runtime can automatically capture the state, serialize it in an efficient manner (with diffing), and migrate it to another machine, on which the runtime can automatically reinstate the saved state from InterWeave. This approach, however, requires the programmer to write routines to initialize the

293

program from the saved state to a point from which the program can start to execute again, rebuilding program soft state from the saved state if necessary. We believe this requirement does not put significant burden on the programmer, since it is similar to the widely accepted transactional programming model in which programs start from a clean point after an abort. With this cooperation between runtime and application, we avoid the complexity and portability problems due to overly conservative migration, i.e., marshaling and moving all the content of heap, stack, registers, and program counter, a majority of which might be unnecessary for the program or can be easily reconstructed from the saved hard state. Compared with previous solutions for light-weight virtualization and migration, the benefit of using the infrastructure provided by InterWeave is a simpler, cleaner, and more efficient implementation. There are many applications for this InterWeave-style, light-weight migration. In a P2P system, nodes come and go frequently. When a node leaves, either voluntarily or because of failure, the responsibility originally assigned to that node needs to be taken over by another node. We can use InterWeave to mirror a node’s state on its neighbors in the P2P network, and periodically update the mirrors through efficient diffing. When a node leaves, one of its neighbors can use the mirrored state to assume the role of the leaving node.

7.3.2

Scalable and Cache Coherent Hierarchy

Scalability is one of the most challenging issues in data sharing. Researchers have investigated it in the context of Web caching [153, 212, 214, 296], content distribution [52, 61, 94], databases [295], and storage systems [70, 217]. In the simple client-server architecture, as shown in Figure 7.1, the server is a single point of failure and performance bottleneck. Requests from all clients are handled by the server, which could be quickly saturated as the number of clients increases.

294

Figure 7.1: Single server architecture.

C l i en t

C l i en t

Server

Server

C l i en t

Server

C l i en t

C l i en t

C l i en t

Figure 7.2: Replicated server architecture.

S er v er

Cache

Cache

Cache

Cl i en t

Cache

Cl i en t

Cl i en t

Cache

Cl i en t

Cl i en t

Cache

Cl i en t

Figure 7.3: Hierarchical cache architecture.

Cl i en t

Cl i en t

295

Distributed databases [98] use replicated servers to improve scalability. A replica only handles a subset of client requests. Figure 7.2 shows a diagram of such a system. Data coherence among the replicas can be guaranteed through ACID transactions. That is, a transaction that modifies some data cannot commit until all replicas have made the change. This strict requirement limits the number of replicas and hence only works for medium-size systems. It also degrades availability since all replicas must be up in order to commit a transaction. Bayou [192, 261] and TACT [297, 298] relax this constraint by trading coherence for availability. One consequence is that conflict updates have to be resolved either by user supplied routines or manually by administrators. For real large-scale Internet services such as WWW, stream media, and content distribution, the only scalable solution seems to be introducing a hierarchy to share the load on the server. As the client population increases, it is simply a matter of adding more layers and caches to the hierarchy. Suppose each cache can serve p children. The number of clients that a hierarchy of depth d can serve is pd . A diagram of a hierarchical client-server architecture is shown in Figure 7.3, where p = 2 and d = 3. The use of a hierarchy improves scalability, but also introduces problems as the the size and depth of the hierarchy increases. First, it becomes hard to maintain cache coherence efficiently. When data are modified, it will take a fair amount of time for an invalidation notice to travel through the hierarchy to reach every cache. Second, the latency of traveling through multiple layers of the hierarchy to fetch data on a cache miss might adversely affect the performance, as argued by Barish and Obraczka [10]. In previous works, the hierarchy is either statically constructed [214] or dynamically adjusted [166, 211]. The drawback of a static hierarchy is obvious, i.e., it cannot adapt to the changing environment and program behavior. Previously proposed dynamic hierarchies mainly focus on optimizations guided by network

296

topology [52, 94]. No attempt has been made to take advantage of clients’ different data access patterns [191] and different requirements for data fidelity (i.e., relaxed coherence models). Suppose a client A that has low tolerance for stale data is placed at a level far from the root server S, whereas some clients or caches with high tolerance for stale data are placed on the path between A and S. Upon a cache miss on A, a new copy of the data must be passed down along the path from the server to the client, although caches in-between might never need this new copy to serve other clients. The important observation here is that clients that require high-fidelity data should be placed closer to the server. In another scenario, suppose two clients using similar coherence models are placed at two branches far apart from each other in the hierarchy. On cache misses, they fetch the data independently along two different paths. Ideally, these two clients should be organized under a branch close to each other such that they can share the coherence operations (e.g., fetching a single copy of the data from the server). In summary, clients that require high-fidelity data should be placed closer to the server and clients using similar coherence modes should be placed near each other in the hierarchy. The design of a hierarchy for InterWeave needs to take into account these constraints. In practice, the adoption of a hierarchy is a compromise. In content distribution and multicast, the data are assumed to be static. In Web caching, special attributes associated with Web pages indicate if the pages are cacheable and how long they can be cached. These attributes are just hints and coherence is not guaranteed. Despite this inconvenience, hierarchical Web caches such as Squid [214] are still in wide use because strong coherence is not a major concern for the Web. (Users can always force the browser to directly fetch from the server if they feel the cached copy is stale.) For general purpose distributed computing, which is the target of InterWeave, this pragmatic attitude towards coherence may not be acceptable. It simply pushes all the burden to the programmer.

297

Design Choice Our goal is to design a Scalable and Cache Coherent Hierarchy (SCCH) for efficient data sharing. We outline the rationale behind our design choices below. Franklin et al. [98] gave a survey on transactional client-server cache coherence protocols. They classified coherence algorithms into two categories. The avoidance-based algorithms ensure that all cached data are valid and hence no invalid access could be possible. The detection-based algorithms allow stale data to remain in client caches and ensure that transactions are allowed to commit only if they have not accessed such stale data. Each category is further divided into sub-categories. Detection-based algorithms are further classified by mechanisms for validity check initiation, change notification hints, and remote update action. Avoidance-based algorithms are further classified by mechanisms for write intention declaration, write permission duration, remote conflict priority, and remote update action. The coherence algorithm used in current InterWeave prototype falls into the pessimistic detection-based algorithms. Before using a segment, a client must contact the server and acquire a proper lock. With transaction support, it is possible to adopt optimistic algorithms in InterWeave, which is known to have better performance under most configurations. For our SCCH, we opt for an avoidance-based algorithm, i.e., making cached data up to date all the time so that they can be used directly on cache hits, without the need for a validity check with the server. Pessimistic detection-based algorithms require polling the server on every cache access, imposing high load on the server and violating the fundamentals of using a hierarchy. Optimistic detection-based algorithms can avoid these polls, but they rely on a transaction model for rollback when stale data are consumed. The implementation of data rollback in a hierarchy can be either very complicated or very inefficient. Because of the same reasons, we opt for pessimistic write intention declaration. That is,

298

when a client intends to modify data, it must grab a write lock directly from the server and send modifications directly to the server when releasing the lock. We leave the choice of other design alternatives to experimentation (e.g., update vs. invalidate and write lock callback vs. one-time write lock), since they can be easily tried out within a single flexible framework.

7.4

Future Work: Peer-to-Peer Information Retrieval

As peer-to-peer computing is relatively new, the area of peer-to-peer information retrieval is particularly young. Our work on eSearch and pSearch is the first to introduce information retrieval into structured peer-to-peer networks (i.e., DHTs). Since then, the research community has devoted a fair amount of efforts to attacking this problem, manifested by the citations to our work [2, 13, 16, 15, 17, 22, 23, 34, 53, 62, 48, 63, 84, 112, 117, 130, 140, 145, 147, 149, 156, 157, 158, 160, 165, 171, 172, 175, 176, 59, 219, 190, 200, 229, 228, 232, 236, 239, 245, 246, 260, 266, 273, 274, 275, 302, 303, 300, 306, 307, 309, 312], as well as recent workshops dedicated to this topic (e.g., the ACM SIGIR’04 Workshop on Peer-to-Peer Information Retrieval, http://p2pir.is.informatik.uni-duisburg.de). All works in this area, including ours, however, are still in their infancy. Because of the immaturity of the field of peer-to-peer computing, many components necessary for a P2P IR system are yet to be developed, for instance, routing in the face of high churn rate, security for P2P systems, incentive mechanisms for P2P sharing, and theoretical foundations for guaranteed convergence of the behavior of a P2P system in a dynamic environment. Moreover, it requires a tremendous amount of effort to evaluate a P2P IR system in a realistic setting that the system is designed for. Such an evaluation involves not only a large number of nodes but also a huge amount of realistic and representative content. So far, we evaluated

299

our design by simulating a large number of networked nodes and experimenting with the widely used TREC corpus. The evaluations are obviously very constrained. We are working on converting our simulators into deployable prototypes of eSearch and pSearch and on testing the prototypes on PlanetLab [195] with the Web pages crawled from the Open Directory Project [264]. However, even the PlanetLab environment is quite limited. Our long-term goal is to release freeware and foster the user community, eventually letting the users be the final judge of our systems. In addition to the system building effort, another important direction of our future work is to introduce new features into eSearch and pSearch. The history of Web search proves the importance of link analysis techniques, e.g., Google’s PageRank [33]. A more recent trend is to combine full-text search with link analysis [155]. We are working towards the same direction, which so far seems to be extremely challenging as PageRank requires knowledge of the global link graph and its current P2P implementations are not efficient enough for practical use [223]. One link analysis algorithm more suitable for P2P systems is Kleinberg’s HITS algorithm based on Hubs and Authorities [131]. Instead of working on the global link graph, HITS identifies authoritative pages by working on a relatively small subgraph induced by pages returned from a text-based search engine. One can use eSearch or pSearch to retrieve a medium number of pages for a query and then use HITS to rerank them based on the link structure. Other important techniques to improve retrieval quality include exploiting context information [81] (e.g., documents recently browsed by a user) and user feedback (e.g., documents highly recommended by some users in the past). We are working on incorporating these techniques into our frameworks. Information filtering is the flip side of information retrieval. Due to the overwhelming amount of information on the Internet, it becomes increasingly difficult for people to find relevant information in a timely fashion. Information filtering

300

and dissemination systems address this problem by allowing users to register persistent queries called user profiles. They detect new content, match them against registered profiles, and notify users when relevant documents becomes available. Existing systems, however, either are not scalable or do not support matching of unstructured documents (e.g., text, HTML or music files) that account for a significant percentage of Internet content. In [257], we proposed pFilter, a global-scale decentralized information filtering and dissemination system for unstructured documents. We plan to revise the design of pFilter based on what we learned from eSearch and pSearch and evaluate pFilter under realistic settings. In pSearch [256], we compared a total of seven dimensionality reduction algorithms and found that our eLSI algorithm outperforms the others. One important algorithm left out in the comparison is FastMap [86]. We plan to implement FastMap and evaluate its performance. Principal Component Analysis (PCA) is an optimal solution for dimensionality reduction in the sense of least square error. In PCA, the axis onto which the data points are projected is selected to maximize the variance of the projection. In Random Projection, the axis is selected randomly. In FastMap, the axis is selected to go through two data points with the longest distance, which intuitively serves as a heuristic to maximize the variance of the projection. An interesting subject for future work is to study other heuristics for selecting the projection axis, for instance, computing the optimal axis for a set of sampled data and using the derived axis as the axis for the entire data set, or using some low cost method to estimate the direction that maximizes the variance. Fundamentally, efficient retrieval in pSearch boils down to reducing the search space through the use of indexing structures for multidimensional data. Our use of a CAN to partition a Cartesian space is similar to Grid File [179]. Applying other data structures such as R-Trees [111] to a distributed environment has also been explored lately [306], although the problems specific to tree structures need to be

301

addressed first, for instance, the bottleneck in routing and single point of failure of tree nodes. For information retrieval, what matters is the ability to find a set of approximate nearest neighbors rather than the ability to find the strictly nearest neighbors since the similarity score used by IR algorithms are inherently fuzzy. An interesting direction for future work is to compare how the multidimensional data structures perform when retrieving approximate nearest neighbors, first in a centralized setting and then in a decentralized setting. pSearch works by representing media content as vectors and mapping the content to an overlay network. A retrieval amounts to a distributed nearest-neighbor search in the overlay. This method can be applied to any media that can be abstracted as vectors and have its object similarity measured as some kind of distance in the vector space. A lot of pattern recognition problems fall into this category. For instance, researchers also employed SVD to extract algebraic features from images [1, 85] and used various extractors to derive frequency, amplitude, and tempo feature vectors from music data [148, 280]. We plan to apply pSearch to image and music retrieval [306]. In eSearch, we only publish the complete term list of a document under its important terms in order to reduce storage consumption. One way to further reduce storage consumption is to store only a list of important terms, rather than all the terms, on the corresponding nodes. We plan to adopt this index pruning [45] as well as index compression [284] in eSearch. Currently we simply use the term weights generated by Okapi to select important terms. We plan to do a complete study on algorithms for keyword selection.

7.5

Summary

This dissertation presented our work on distributed shared state and peer-topeer information retrieval. We have summarized our contributions to both fields

302

and outlined interesting directions for future work. Both fields are extremely interesting to us and we plan to continue working on them but in a broader context. We understand that the final judges of a piece of work are time and end users and hope our techniques find abundant applications beyond our systems.

303

Bibliography

[1] K. Aas and L. Eikvil. A Survey on: Content-based Access to Image and Video Databases, 1997. [2] K. Aberer, F. Klemm, M. Rajman, and J. Wu. An Architecture for Peer-toPeer Information Retrieval. In ACM SIGIR’04 Workshop on Peer-to-Peer Information Retrieval, 2004. [3] S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90), pages 2–14, Seattle, Washington, June 1990. [4] S. V. Adve and M. D. Hill. A Unified Formulation of Four Shared-Memory Models. IEEE TPDS, 4(6):613–624, JUN 1993. [5] C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations. Comuter, 29(2):18–28, Feb. 1996. [6] C. Amza, A. Cox, S. Dwarkadas, and W. Zwaenepoel. Software DSM Protocols that Adapt between Single Writer and Multiple Writer. In Proceedings of the Third International Symposium on High Performance Computer Architecture, San Antonio, TX, February 1997.

304

[7] G. Antoniu, L. Boug, P. Hatcher, M. MacBeth, K. McGuigan, and R. Namyst. The Hyperion system: Compiling multithreaded Java bytecode for distributed execution. Parallel Computing, March 2001. [8] M. Atkinson, F. Bancilhon, D. DeWitt, K. Dittrich, D. Maier, and S. Zdonik. The Object-Oriented Database System Manifesto . In In Proceedings of the First International Conference on Deductive and Object-Oriented Databases, pages 223–240, December 1989. [9] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, October 2003. [10] G. Barish and K. Obraczka. World Wide Web Caching: Trends and Techniques. IEEE Communications, May 2000. [11] B. T. Bartell, G. W. Cottrell, and R. K. Belew. Latent Semantic Indexing is an Optimal Special Case of Multidimensional Scaling. In Proceedings of Research and Development in Information Retrieval, pages 161–167, 1992. [12] D. Bassu and C. Behrens. Distributed LSI: Scalable Concept-based Information Retrieval with High Semantic Resolution. In Proceedings of the 3rd SIAM International Conference on Data Mining (Text Mining Workshop), San Francisco, CA, May 2003. [13] D. Bauer, P. Hurley, R. A. Pletka, and M. Waldvogel. Bringing Efficient Advanced Queries to Distributed Hash Tables. Technical report, IBM Zurich Research Laboratory, 2004. [14] C. Baumgarten. A Probabilistic Solution to the Selection and Fusion Problem in Distributed Information Retrieval. In SIGIR’99, 1999.

305

[15] M. Bawa, R. J. B. Jr., and R. Agrawal. Privacy Preserving Indexing of Documents on the Network. In VLDB’03, 2003. [16] M. Bawa, G. S. Manku, and P. Raghavan. SETS: Search Enhanced by Topic Segmentation. In SIGIR’03, 2003. [17] M. Bender, S. Michel, C. Zimmer, and G. Weikum. Bookmark-driven query routing in peer-to-peer web search. In ACM SIGIR’04 Workshop on Peerto-Peer Information Retrieval, 2004. [18] P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002. [19] M. Berry, Z. Drmac, and E. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):335–362, 1999. [20] M. W. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval (Software, Environments, Tools). Society for Industrial & Applied Mathematics, 1999. [21] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573–595, 1995. [22] B. Bhattacharjee, S. Chawathe, V. Gopalakrishnan, P. Keleher, and B. Silaghi. Efficient peer-to-peer searches using result-caching. In IPTPS’03, Berkeley, CA, USA, February 2003. [23] I. Bhattacharya, S. R. Kashyap, and S. Parthasarathy. Similarity Searching in Peer-to-Peer Databases. Technical Report CS-TR-4558, Department of Computer Science, University of Maryland, 2004. [24] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In SIGKDD’01, 2001.

306

[25] R. Bisiani and A. Forin. Architectural Support for Multilanguage Parallel Programming on Heterogeneous Sysmtes. In Proc. ASPLOS II, 1987. [26] R. Bisiani and A. Forin. Multilanguage Parallel Programming of Heterogeneous Machines. IEEE Trans. on Computers, 37(8), August 1988. [27] C. Blake and R. Rodrigues. High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two. In HotOS’03, May 2003. [28] B. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, July 1970. [29] W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs. In SIGMETRICS’00, 2000. [30] W. N. Boris. EDUTELLA: A P2P Networking Infrastructure Based on RDF, 2001. [31] S. Botros and S. Waterhouse. Search in JXTA and Other Distributed Networks. In Proceedings of the First International Conference on Peer-to-Peer Computing (P2P’01), August 2001. [32] T. Brecht and H. Sandhu. The Region Trap Library: Handling Traps on Application-Defined Regions of Memory. In In Proceedings of the 1999 USENIX Annual Tech. Conf., pages 85–99, June 1999. [33] S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998. [34] J. Broekstra, M. Ehrig, P. Haase, F. van Harmelen, M. Menken, P. Mika, B. Schnizler, and R. Siebes. Bibster - A Semantics-Based Bibliographic

307

Peer-to-Peer System. In The Second Workshop on Semantics in Peer-toPeer and Grid Computing at the Thirteenth International World Wide Web Conference, 2004. [35] C. Buckley.

Implementation of the SMART information retrieval sys-

tem. Technical Report TR85-686, Department of Computer Science, Cornell University, Ithaca, NY 14853, May 1985. Source code available at ftp://ftp.cs.cornell.edu/pub/smart. [36] G. Cabillic and I. Puaut. Dealing with Heterogeneity in Stardust: An Environment for Parallel Programming on Networks of Heterogeneous Workstations. In Euro-Par, Vol. I, pages 114–119, 1996. [37] G. Cabillic and I. Puaut. Stardust: An Environment for Parallel Programming on Networks of Heterogeneous Workstations. Journal of Parallel and Distributed Computing, 40(1):65–80, 1997. [38] J. P. Callan and M. E. Connell. Query-based sampling of text databases. Information Systems, 19(2):97–130, 2001. [39] K. Calvert, M. Doar, and E. W. Zegura. Modeling Internet Topology. IEEE Communications Magazine, June 1997. [40] P. Cappello, B. Christiansen, M. Ionescu, M. Neary, K. Schauser, and D.Wu. JaveLin: Internet Based Parallel Computing Using Java. In 1997 ACM Workshop on Java for Science and Engineering Computation, Las Vegas, June 1997. [41] M. J. Carey and D. J. DeWitt. Of Objects and Databases: A Decade of Turmoil. In the VLDB Journal, pages 3–14, 1996. [42] M. J. Carey, D. J. DeWitt, D. Frank, G. Graefe, J. E. Richardson, E. J. Shekita, and M. Muralikrishna. The architecture of the EXODUS extensible

308

DBMS. In Proc. Int’l. Workshop on Object-Oriented Database Systems, Pacific Grove, CA, 1986. [43] M. J. Carey, D. J. DeWitt, M. J. Franklin, N. E. Hall, M. L. McAuliffe, J. F. Naughton, D. T. Schuh, M. H. Solomon, C. K. Tan, O. G. Tsatalos, S. J. White, and M. J. Zwilling. Shoring up persistent applications. In Proceedings of the 1994 ACM SIGMOD Conference, pages 383–394, Minneapolis, MN, May 1994. [44] M. J. Carey, D. J. DeWitt, and J. F. Naughton. The OO7 benchmark. In SIGMOD Conference Proceedings, pages 12–21, Washington D.C., June 1993. [45] D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovicl, Y. S. Marrek, and A. Scoffer. Static Index Pruning for Information Retrieval Systems. In SIGIR’01, 2001. [46] J. Carter, A. Ranganathan, and S. Susarla. Khazana: An Infrastructure for Building Distributed Services. In Intl. Conf. on Distributed Computing Systems, pages 562–571, May 1998. [47] J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, pages 152–164, Pacific Grove, CA, 1991. [48] M. Castro, M. Costa, and A. Rowstron. Peer-to-peer overlays: structured, unstructured, or both? Technical Report MSR-TR-2004-73, Microsoft Research, July 2003. [49] M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and D. S. Wallach. Security for structured peer-to-peer overlay networks. In OSDI’02, Boston, MA, Dec. 2002.

309

[50] P. Cederqvist et al. Version Management with CVS, 1993. Available at http://www.cvshome.org/docs/manual/. [51] J. Chase, F. Amador, E. Lazowska, H. M. Levy, and R. J. Littlefield. The Amber System: Parallel Programming on a Network of Multiprocessors. In Proc. of the 12th ACM Symp. on Operating Systems Principles, pages 147–158, Dec 1989. [52] Y. Chawathe. Scattercast: An Architecture for Internet Broadcast Distribution as an Infrastructure Service. PhD thesis, University of California, Berkeley, December 2000. [53] Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and S. Shenker. Making Gnutella-like P2P Systems Scalable. In SIGCOMM’03, 2003. [54] D. Chen. Multi-Level Shared State and Application Specific Coherence Models. PhD thesis, University of Rochester, 2003. [55] D. Chen, C. Tang, X. Chen, S. Dwarkadas, and M. L. Scott. Beyond S-DSM: Shared State for Distributed Systems. Technical Report 744, URCS, 2001. [56] D. Chen, C. Tang, X. Chen, S. Dwarkadas, and M. L. Scott. Multi-level Shared State for Distributed Systems. In The 2002 International Conference on Parallel Processing (ICPP-02), Vancouver, British Columbia, Canada, August 2002. [57] D. Chen, C. Tang, S. Dwarkadas, and M. L. Scott. JVM for a Heterogeneous Shared Memory System. In Second Workshop on Caching, Coherence, and Consistency (WC3’02), in conjunction with ICS’02, New York, NY, June 2002. [58] D. Chen, C. Tang, B. Sanders, S. Dwarkadas, and M. L. Scott. Exploiting High-level Coherence Information to Optimize Distributed Shared State.

310

In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’03), San Diego, California, June 2003. [59] Cheuk-Hang NG. Peer Clustering and Firework Query Model in Peer-toPeer Networks. Master’s thesis, The Chinese University of Hong Kong, 2003. [60] G. Chockler, D. Dolev, R. Friedman, and R. Vitenberg. Implementing a caching service for distributed CORBA objects. In Middleware, pages 1–23, 2000. [61] Y. Chu, S. G. Rao, and H. Zhang. A Case for End System Multicast. In SIGMETRICS’00, 2000. [62] M. Ciglari´e, M. Trampus, M. Pancur, and T. Vidmar.

Message Routing

in Pure Peer-to-Peer Networks. In The 21st IASTED International MultiConference on Applied Informatics (AI 2003), Innsbruck, Austria, February 2003. [63] E. Cohen, A. Fiat, and H. Kaplan. Associative Search in Peer to Peer Networks: Harnessing Latent Semantics. In INFOCOM’03, April 2003. [64] E. Cohen and S. Shenker. Replication Strategies in Unstructured Peer-toPeer Networks. In SIGCOMM’02, 2002. [65] X. Corporation. The Remote Procedure Call Protocol. Technical report, Technical Report XSIS 038112, Dec. 1981. [66] A. Crespo and H. Garc´ıa-Molina. Routing Indices for Peer-to-peer Systems. In ICDCS’02, July 2002. [67] F. M. Cuenca-Acuna and T. D. Nguyen. Text-based content search and retrieval in ad hoc P2P communities. In Proceedings of the International Workshop on Peer-to-Peer Computing, May 2002.

311

[68] F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen. PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. In the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC’03), June 2003. [69] M. M. da Silva, M. P. Atkinson, and A. P. Black. Semantics for parameter passing in a type-complete persistent RPS. In International Conference on Distributed Computing Systems, pages 411–419, 1996. [70] F. Dabek, M. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage with CFS. In SOSP’01, October 2001. [71] P. B. Danzig, J. Ahn, J. Noll, and K. Obraczka. Distributed Indexing: A Scalable Mechanism for Distributed Information Retrieval. In A. Bookstein, Y. Chiaramella, G. Salton, and V. V. Raghavan, editors, Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 220–229, Chicago, Illinois, 1991. [72] P. Dasgupta, R. Ananthanarayan, S. Menon, A. Mohindra, and R. Chen. Distributed Programming with Objects and Threads in the Clouds System. Computing Systems, 4(3):243–275, 1991. [73] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. [74] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Knowledge Discovery and Data Mining, pages 269– 274, 2001. [75] I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Proceedings of Workshop on Large-Scale Parallel KDD Systems, 1999.

312

[76] I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, 2001. [77] C. H. Ding. A Probabilistic Model for Dimensionality Reduction in Information Retrieval and Filtering. In Proc. of 1st SIAM Computational Information Retrieval Workshop, October 2000. [78] F. Douglis and A. Iyengar. Application-specific Delta-encoding via Resemblance Detection. In USENIX 2003 Annual Technical Conference, 2003. [79] B. Dreier, M. Zahn, and T. Ungerer. Parallel and distributed programming with pthreads and rthreads.

In Proc. of the 3rd Int’l Workshop

on High-Level Parallel Programming Models and Supportive Environments (HIPS’98), pages 34–40, 1998. [80] S. Dumais. Using LSI for information filtering: TREC-3 experiments. In Third Text REtrieval Conference (TREC-3), 1995. [81] S. Dumais, E. Cutrell, J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff i’ve seen: A system for personal information retrieval and re-use. In SIGIR’03, 2003. [82] S. Dumais, T. Letsche, M. Littman, and T. Landauer. Automatic crosslanguage retrieval using latent semantic indexing. In AAAI Symposium on CrossLanguage Text and Speech Retrieval, March 1997. [83] S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. J. Stets. Comparative evaluation of fine- and coarse-grain software distributed shared memory. In 5th International Symposium on High-Performance Computer Architecture, January 1999. [84] A. F. Edith Cohen and H. Kaplan. A case for associative peer to peer overlays. In HotNets-I, October 2002.

313

[85] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, and W. Equitz. Efficient and effective querying by image content. Journal of Intelligent Information Systems, 3(3/4):231–262, 1994. [86] C. Faloutsos and K.-I. Lin. FastMap: A Fast Algorithm for Indexing, DataMining and Visualization of Traditional and Multimedia Datasets. In SIGMOD Conference, 1995. [87] FastTrack

Peer-to-Peer

technology

company,

2001.

http://www.fasttrack.nu. [88] M. Feeley and H. Levey. Distributed Shared Memory with Versioned Objects. In OOPSLA ’92 Conf. Proc., pages 247–262, Oct. 1992. [89] P. Ferreira, M. Shapiro, X. Blondel, O. Fambon, J. Garcia, S. Kloosterman, N. Richer, M. Roberts, F. Sandakly, G. Goulouris, J. Dollimore, P. Guedes, D. Hagimont, and S. Krakowiak. PerDiS: Design, Implementation and Use of a PERsistent DIstributed Store. Technical Report 3532, INRIA, Oct 1998. [90] A. Forst, e. K¨ uhn, and O. Bukhres. General purpose work flow languages. Distributed and Parallel Databases, 3(2):187–218, 1995. [91] I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit. Intl. Journal of Supercomputer, 11(2):115–128, 1997. [92] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The physiology of the grid: An open grid services architecture for distributed systems integration, January 2002. [93] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid. Intl. J. Supercomputer Applications, 2001.

314

[94] P. Francis. Yallcast: Extending the internet multicast architecture. Technical report, NTT Information Sharing Platform Laboratories, September 1999. http://www.yallcast.com. [95] P. Francis, T. K. S. ya Sato, and S. Shimizu. Ingrid: A self-configuring information navigation infrastructure. In Proceedings of the Fourth International World Wide Web Conference, pages 519–537, 1995. [96] A.

Frank,

troflow

G.

Delamarter,

simulator.

R.

Bent,

and

B.

Hill.

http://astro.pas.rochester.edu/

Asdela-

mart/Research/Astroflow/Astroflow.html. [97] P. Frankl and H. Maehara. sphericity of some graphs.

The johnson-lindenstrauss lemma and the Journal of Combinatorial Theory Ser. B,

44(3):355–362, 1988. [98] M. J. Franklin, M. J. Carey, and M. Livny. Transactional client-server cache consistency: alternatives and performance. ACM Transactions on Database Systems, 22(3):315–363, 1997. [99] V. W. Freeh and G. R. Andrews. Dynamically controlling false sharing in distributed shared memory. In Proc. of the Fifth IEEE Int’l Symp. on High Performance Distributed Computing (HPDC-5), pages 403–411, 1996. [100] J. C. French, A. L. Powell, J. P. Callan, C. L. Viles, T. Emmitt, K. J. Prey, and Y. Mou. Comparing the Performance of Database Selection Algorithms. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 238–245, 1999. [101] D. Gelernter. Generative Communication in Linda. ACM Transactions on Programming Languages and Systems, 7(1):80–112, 1985.

315

[102] O. D. Gnawali. A Keyword Set Search System for Peer-to-Peer Networks. Master’s thesis, Massachusetts Institute of Technology, June 2002. [103] Gnutella. http://gnutella.wego.com. [104] G. Golub and C. V. Loan. Matrix Computations. The Jason Hopkins University Press, Baltimore, Maryland, second edition edition, 1989. [105] Google. http://www.google.com. [106] P. Graham and Y. Sui. LOTEC: A simple DSM consistency protocol for nested object transactions. In Symposium on Principles of Distributed Computing, pages 153–162, 1999. [107] L. Gravano, H. Garc´ıa-Molina, and A. Tomasic. GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems, 24(2), 1999. [108] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 1993. [109] A. S. Grimshaw and W. A. Wulf. Legion—A View from 50,000 Feet. In Proc. of the 17th Intl. Symp. on High Performance Distributed Computing, Aug. 1996. [110] K. P. Gummadi, R. Gummadi, S. D. Gribble, S. Ratnasamy, S. Shenker, and I. Stoica. The Impact of DHT Routing Geometry on Resilience and Proximity. In SIGCOMM’03, August 2003. [111] A. Guttman. R-Trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD Conference, pages 47–57, 1984. [112] P. Haase, R. Siebes, and F. van Harmelen. Peer Selection in Peer-to-Peer Networks with Semantic Topologies. In Proceedings of International Con-

316

ference on Semantics of a Networked World: Semantics for Grid Databases, Paris, 2004. [113] M. Herlihy and B. Liskov. A Value Transmission Method for Abstract Data Types. ACM Transactions on Programming Languages and Systems, 4(4):527–551, Oct 1982. [114] J. K. Hollingsworth and P. J. Keleher. Prediction and Adaptation in Active Harmony. In Proc. of the 7th Intl. Symp. on High Performance Distributed Computing, Apt 1998. [115] P. Hoschka.

Compact and efficient presentation conversion code.

IEEE/ACM Transactions on Networking, 6(4):389–396, 1998. [116] A. L. Hosking and J. Chen. PM3: An Orthogonally Persistent Systems Programming Language—Design, Implementation, Performance. In Proceedings of the International Conference on Very Large Data Bases, Edinburgh, Scotland, September 1999. [117] H.-C. Hsiao and C.-T. King. Similarity Discovery in Structured P2P Overlays. In Proceedings of the 32th International Conference on Parallel Processing (ICPP’03), Kaohsiung, Taiwan, October 2003. [118] Y. C. Hu, W. Yu, A. L. Cox, D. S. Wallach, and W. Zwaenepoel. Runtime support for distributed sharing in typed languages. In Languages, Compilers, and Run-Time Systems for Scalable Computers, pages 192–206, 2000. [119] P. Husbands, H. Simon, and C. Ding. the use of singular value decomposition for text retrieval. In M. Berry, editor, Proc. of SIAM Comp. Info. Retrieval Workshop, October 2000.

317

[120] S. C. A. Inc. Virtual Shared Memory and the Paradise System for Distributed Computing. In Technical Report, New Haven, RI, Apr. 1999. [121] IRIS. http://www.project-iris.net. [122] F. Jiang and M. L. Littman. Approximate dimension equalization in vectorbased information retrieval. In Proceedings of the Seventeenth International Conference on Machine Learning, 2000. [123] S. Joseph. Neurogrid: Semantically routing queries in peer-to-peer networks. In Proceedings of the 1st Int’l Workshop on Peer-to-Peer Systems, Pisa, Italy, May 2002. [124] E. Jul, H. Levy, N. Hutchinson, and A. Black. Fine Grained Mobility in the Emerald System. ACM Trans. on Computer Systems, 6(1):109–133, Feb 1988. [125] T. Kaehler. Virtual Memory on a Narrow Machine for an Object-Oriented Language. In OOPSLA ’86 Conference Proceedings, pages 87–106, Portland, OR, October 1986. [126] B. Karp and H. T. Kung. GPSR: greedy perimeter stateless routing for wireless networks. In Mobile Computing and Networking, pages 243–254, 2000. [127] G. Karypis and E.-H. S. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval and categorization. In CIKM’00, 2000. [128] KaZaA. http://www.kazaa.com. [129] B. Kemme and G. Alonso. A suite of database replication protocols based on group communication primitives. In International Conference on Distributed Computing Systems, pages 156–163, 1998.

318

[130] I. A. Klampanos and J. M. Jose. An Architecture for Information Retrieval over Semi-Collaborating Peer-to-Peer Networks. In SAC’04, Nicosia, Cyprus, March 2004. [131] J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. of 9th ACM SIAM Symposium on Discrete Algorithms, February 1998. [132] P. T. Koch, R. J. Fowler, and E. Jul. Message-driven relaxed consistency in a software distributed shared memory. In In Proceedings of the First Symposium on Operating Systems Design and Implementation (OSDI), pages 75–85, Monterey, California, November 1994. [133] T. G. Kolda and D. P. O’Leary. semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Trans. Information Systems, 16:322–346, 1998. [134] K. Kono, K. Kato, and T. Masuda. Smart remote procedure calls: Transparent treatment of remote pointers. In International Conference on Distributed Computing Systems, pages 142–151, 1994. [135] A. Kontostathis and W. M. Pottenger. A Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrences. Technical Report LUCSE-02-006, CSE Department, Lehigh University, 2002. [136] R. Kordale, M. Abamad, and M. Devarakonda. Object Caching in a CORBA Compliant System. Computing Systems, 9(4):377–404, Fall 1996. [137] V. Krishnaswamy, D. Walther, S. Bhola, E. Bommaiah, G. Riley, B. Topol, and M. Ahamad. Efficient implementations of Java remote method invocation (RMI). In the 4th USENIX Conf. on Object-Oriented Technologies and Systems (COOTS’98), pages 19–36, 1998.

319

[138] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. Oceanstore: An architecture for global-scale persistent storage. In Proceedings of ACM ASPLOS. ACM, November 2000. [139] C. Lagoze, D. Fielding, and S. Payette. Making global digital libraries work: Collection services, connectivity regions, and collection views. In ACM DL, pages 134–143, 1998. [140] W. Lam, W. Wang, and C.-W. Yue. Web Discovery and Filtering Based on Textual Relevance Feedback Learning.

Computational Intelligence,

19(2):136–163, May 2003. [141] L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7):558–565, July 1978. [142] L. S. Larkey, M. E. Connell, and J. P. Callan. Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data. In CIKM’00, 2000. [143] R. Lempel and S. Moran. Optimizing result prefetching in web search engines with segmented indices. In VLDB’01, 2001. [144] T. A. Letsche and M. W. Berry. Large-scale information retrieval with latent semantic indexing. Information Sciences, 100(1-4):105–137, 1997. [145] J. Li, B. T. Loo, J. Hellerstein, F. Kaashoek, D. R. Karger, and R. Morris. On the Feasibility of Peer-to-Peer Web Indexing and Search. In IPTPS’03, February 2003. [146] K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Comptuer Systems, 7(4):321–359, Novemer 1989.

320

[147] M. Li, W.-C. Lee, A. Sivasubramaniam, and D. L. Lee. A Small World Overlay Network for Semantic Based Search in P2P. In Second Workshop on Semantics in Peer-to-Peer and Grid Computing, New York, USA, May 2004. [148] T. Li, M. Ogihara, and Q. Li. A Comparative Study on Content-Based Music Genre Classification. In Proceedings of Annual ACM Conference on Research and Development in Information Retrieval (SIGIR), 2004. [149] J.

Linnolahti.

QoS

routing

for

P2P

networking,

2004.

http://www.tml.hut.fi/Studies/T-110.551/2004/papers/Linnolahti.pdf. [150] B. Liskov.

Distributed Programming in Argus.

Comm. of the ACM,

31(3):300–312, Mar. 1988. [151] B. Liskov, A. Adya, M. Castro, M. Day, S. Ghemawat, R. Gruber, U. Maheshwari, A. C. Myers, and L. Shrira. Safe and Efficient Sharing of Persistent Objects in Thor. In Proc. of the 1996 ACM SIGMOD Intl. Conf. on Management of Data, June 1996. [152] B. Liskov, M. Castro, L. Shrira, and A. Adya. Providing persistent objects in distributed systems. In R. Guerraoui, editor, ECOOP ’99 — ObjectOriented Programming 13th European Conference, Lisbon Portugal, volume 1628, pages 230–257. Springer-Verlag, New York, NY, 1999. [153] C. Liu and P. Cao. Maintaining strong cache consistency in the world-wide web. In International Conference on Distributed Computing Systems, 1997. [154] L. Liu. Query routing in large-scale digital library systems. In ICDE’99, pages 154–163, 1999. [155] X. Long and T. Suel. Optimized Query Execution in Large Search Engines with Global Page Ordering. In VLDB’03, 2003.

321

[156] B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, and I. Stoica. Enhancing P2P File-Sharing with an Internet-Scale Query Processor. In VLDB, 2004. [157] A. L¨oser. Towards taxonomy based routing in p2p networks. In The Second Workshop on Semantics in Peer-to-Peer and Grid Computing at the Thirteenth International World Wide Web Conference, 2004. [158] A. L¨oser, K. Schubert, and F. Zimmer.

The Semantic Music Store:

Managing Distributed Semantic Overlay Networks, 2004. http://cis.cs.tuberlin.de/˜aloeser/publications/Loeser Semantic Music Store.pdf. [159] H. Lu, S. Dwarkadas, A. Cox, , and W. Zwaenepoel. Message passing versus distributed shared memory on networks of workstations. In Proceedings of Supercomputing ’95, 1995. [160] J. Lu and J. Callan. Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks. In ACM SIGIR’04 Workshop on Peerto-Peer Information Retrieval, 2004. [161] Z. Lu and K. S. McKinley. Searching a terabyte of text using partial replication. Technical Report UM-CS-1999-050, Computer Science Department, University of Masschusetts Amherst, 1999. [162] S. Lucco and O. Sharp. Delirium: an embedding coordination language. In Proceedings of Supercomputing ’90, pages 515–524. IEEE Computer Society Press, 1990. [163] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker. Search and Replication in Unstructured Peer-to-Peer Networks. In ICS’02, June 2002. [164] M. J. M. Ma, C.-L. Wang, F. C. M. Lau, and Z. Xu. JESSICA: Java-enabled single system image computing architecture. In The International Confer-

322

ence on Parallel and Distributed Processing Techniques and Applications, pages 2781–2787, June 1999. [165] F. Menczer, R. Akavipat, and L. Wu. 6S: Distributing crawling and searching across Web peers. Technical report, School of Informatics, Indiana University, June 2004. [166] S. Michael, K. Nguyen, A. Rosenstein, L. Zhang, S. Floyd, and V. Jacobson. Adaptive web caching: towards a new global caching architecture. Computer Networks and ISDN Systems, 30(22–23):2169–2177, 1998. [167] D. Milojicic, F. Douglis, Y. Paindaveine, R. Wheeler, and S. Zhou. ”process migration survey”. ACM Computing Surveys, September 2000. [168] D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu. Peer-to-peer computing. Technical Report HPL-2002-57, HP Lab, 2002. [169] M. Mitra, A. Singhal, and C. Buckley. Improving Automatic Query Expansion. In SIGIR’98, 1998. [170] J. E. B. Moss. Design of the Mneme persistent object store. ACM Transactions on Information Systems, 8(2):103–139, 1990. [171] W. M¨ uler, M. Eisenhardt, and A. Henrich. Efficient content-based P2P image retrieval using peer content descriptions. In SPIE Electronic Imaging 2004, San Jose, California, USA, January 2004. [172] W. M¨ uler and A. Henrich. Fast Retrieval of High-Dimensional Feature Vectors in P2P Networks Using Compact Peer Data Summaries. In ACM MIR’03 Workshop, Berkeley, CA, USA, November 2003. [173] F. Murtagh. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4), 1983.

323

[174] A. Muthitacharoen, B. Chen, and D. Mazi`eres. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, 2001. [175] K. Nakauchi, Y. Ishikawa, H. Morikawa, and T. Aoyama. Peer-to-Peer Keyword Search Using Keyword Relationship. In Proceedings of 3rd International Workshop on Global and Peer-to-Peer Computing on Large Scale Distributed Systems (GP2PC 2003), Tokyo, Japan, May 2003. [176] K. Nakauchi, H. Morikawa, and T. Aoyama. Semantic Peer-to-Peer Search Using Query Expansion. In Second Workshop on Semantics in Peer-to-Peer and Grid Computing, New York, USA, May 2004. [177] National Institute of Standards and Technology. Secure Hash Standard, FIPS 180-1, April 1995. [178] T. S. E. Ng and H. Zhang. Predicting Internet Network Distance with Coordinates-Based Approaches. In INFOCOM’02, 2002. [179] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems, 9(1):38–71, 1984. [180] NLANR. http://watt.nlanr.net/. [181] ObjectManagementGroup,Inc., Framingham, MA. The Common Object Request Broker: Architecture and Specification, Revision 2.0, July 1996. [182] OpenSSL. http://www.openssl.org/. [183] Oregon Route Views Project. http://routeviews.org. [184] P. Lyman, H.R. Varian, J. Dunn, A. Strygin, and K. Searingen. How much information, 2003. http://www.sims.berkeley.edu/research/projects/howmuch-info.

324

[185] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent Semantic Indexing: A Probabilistic Analysis. In PODC’98, 1998. [186] G. A. Papadopoulos and F. Arbab. Coordination models and languages. Advances in Computers, 55:329–400, 1998. [187] M. Parameswaran, A. Susarla, and A. B. Whinston. P2P networking: An information-sharing alternative. IEEE Computer, 34(7), July 2001. [188] H. Park, M. Jeon, and J. Rosen. Lower dimensional representation of text data based on centroids and least squares. BIT, 43(2):1–22, 2003. [189] G. D. Parrington, S. K. Srivastava, S. M. Wheater, and M. C. Little. The Design and Implementation of Arjuna. Computing Systems, 8(2):255–308, 1995. [190] C. Peery, F. Cuenca-Acuna, R. Martin, and T. Nguyen. Collaborative Management of Global Directories in P2P Systems. Technical Report DCS-TR510, Department of Computer Science, Rutgers University, November 2002. [191] J.-K. Peir, Y. Lee, and W. W. Hsu. Capturing Dynamic Memory Reference Behavior with Adaptive Cache Topology. In Architectural Support for Programming Languages and Operating Systems, pages 240–250, 1998. [192] K. Petersen, M. J. Spreitzer, and D. B. Terry. Flexible update propagation for weakly consistent replication. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, 1997. [193] M. Philippsen and B. Haumacher. More efficient object serialization. In IPPS/SPDP Workshops, pages 718–732, 1999. [194] G. P. Picco, A. L. Murphy, and G. C. Roman. Lime: Linda meets mobility. In Proc. of the 21st Intl. Conf. on Software Engineering, pages 368–377, May 1999.

325

[195] PlanetLab. http://www.planet-lab.org/. [196] A. L. Powell, J. C. French, J. P. Callan, M. E. Connell, and C. L. Viles. The impact of database selection on distributed searching. In ACM SIGIR ’00, pages 232–239, 2000. [197] C. D. Prete, J. T. McArthur, R. L. Villars, I. L. Nathan Redmond, and D. Reinsel. Industry developments and models, Disruptive Innovation in Enterprise Computing: storage. IDC, February 2003. [198] M. T. Prinkey. An efficient scheme for query processing on peer-to-peer networks, 2001. http://aeolusres.homestead.com/files. [199] P. Raghavan. Information Retrieval Algorithms: A survey. In Proc. 8th SIAM Symposium on Discrete Algorithms (SODA), 1997. [200] S. Ramabhadran, S. Ratnasamy, J. M. Hellerstein, and S. Shenker. Prefix Hash Tree: An Indexing Data Structure over Distributed Hash Tables, 2004. [201] U. Ramachandran, N. Harel, R. S. Nikhil, J. M. Rehg, and K. Knobe. Space-Time Memory: A Parallel Programming Abstraction for Interactive Multimedia Applications. In Proceedings of the Seventh ACM Symposium on Principles and Practice of Parallel Programming, pages 183–192, Atlanta, GA, May 1999. [202] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A Scalable Content-Addressable Network. In SIGCOMM’01, 2001. [203] M. S. Raunak. A survey of cooperative caching, December 1999. [204] Red Hat Inc. http://www.cygwin.com. [205] P. Reynolds and A. Vahdat. Efficient Peer-to-Peer Keyword Searching. In Middleware’03, June 2003.

326

[206] S. Rhea and J. Kubiatowicz. Probabilistic Location and Routing. In INFOCOM’02, 2002. [207] R. Riggs, J. Waldo, A. Wollrath, and K. Bharat. Pickling State in the Java System. Computing Systems, 9(4):291–312, Fall 1996. [208] K. M. Risvik and R. Michelsen. Search Engines and Web Dynamics. Computer Networks, 39(3):289–302, 2002. [209] J. Robert J. Stets. Leveraging Symmetric Multiprocessors and System Area Networks in Software Distributed Shared memory. PhD thesis, University of Rochester, Rochester, New York, August 1999. [210] S. E. Robertson, S. Walker, S. Jones, M. M. HancockBeaulieu, and M. Gatford. Okapi at TREC-3. In TREC-3, 1994. [211] P. Rodriguez and S. Sibal. SPREAD: Scalable platform for reliable and efficient automated distribution. In 9th International World Wide Web Conference, 2000. [212] P. Rodriguez, C. Spanner, and E. Biersack. Web caching architectures: Hierarchical and distributed caching. In Proccedings of the 4th International Caching Workshop, San Diego, California, April 1999. [213] D. Rogerson. Inside COM. Microsoft Press, Redmond, Washington, Jan. 1997. [214] A.

Rousskov.

On

performance

of

caching

proxies,

1996.

http://www.cs.ndsu.nodak.edu/rousskov/research/cache/squid/profiling/papers. [215] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, Dec. 2000.

327

[216] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329–350, November 2001. [217] A. Rowstron and P. Druschel. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In SOSP’01, 2001. [218] A. I. T. Rowstron, A.-M. Kermarrec, M. Castro, and P. Druschel. SCRIBE: The design of a large-scale event notification infrastructure. In Networked Group Communication, pages 30–43, 2001. [219] Ryan Huebsch and Joseph M. Hellerstein and Nick Lanham and Boon Thau Loo and Scott Shenker and Ion Stoica. Querying the internet with pier. In VLDB, 2003. [220] S. Lawrence and C.L. Giles. Accessibility of information on the web. Nature, 400:107–109, July 1999. [221] G. Salton, A. Wong, and C. Yang. A Vector Space Model for Information Retrieval. Journal for the American Society for Information Retrieval, 18(11):613–620, 1975. [222] G. Salton, A. Wong, and C. Yang. A vector space model for information retrieval. Journal for the American Society for Information Retrieval, 18(11):613–620, 1975. [223] K. Sankaralingam, S. Sethumadhavan, and J. C. Browne. Distributed Pagerank for P2P Systems. In the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC’03), June 2003.

328

[224] S. Saroiu, P. K. Gummadi, and S. D. Gribble. A Measurement Study of Peerto-Peer File Sharing Systems. In Proceedings of Multimedia Computing and Networking 2002 (MMCN ’02), San Jose, CA, USA, January 2002. [225] M. Satyanarayanan and M. Spasojevic. AFS and the Web: Competitors or Collaborators? In Proceedings of the Seventh ACM SIGOPS European Workshop, Connemara, Ireland, September 1996. [226] D. J. Scales and K. Gharchorloo. Towards Transparent and Efficient Software Distributed Shared Memory. In The Sixteenth ACM Symposium on Operating Systems Principles, 1997. [227] D. J. Scales and M. S. Lam. The design and evaluation of a shared object system for distributed memory machines. In Operating Systems Design and Implementation, pages 101–114, 1994. [228] C. Schmidt and M. Parashar. Flexible Information Discovery in Decentralized Distributed Systems. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12), 2003. [229] C. Schmidt, M. Parashar, W. Chen, and D. Foran. Engineering A Peer-toPeer Collaboratory for Tissue Microarray Research. In Proceedings of the 2nd International Workshop on Challenges of Large Applications in Distributed Environments (CLADE 2004), Honolulu, Hawaii, USA, June 2004. [230] M. Schwartz. A Scalable, Non-Hierarchical Resource Discovery Mechanism Based on Probabilistic Protocols. Technical Report TR CU-CS-474-90, University of Colorado, 1990. [231] M. L. Scott, D. Chen, S. Dwarkadas, and C. Tang. Distributed Shared State (position paper). In The 9th International Workshop on Future Trends of Distributed Computing Systems (FTDCS’03), San Juan, Puerto Rico, May 2003.

329

[232] S. Shi, G. Yang, D. Wang, J. Yu, S. Qu, and M. Chen. Making Peer-to-Peer Keyword Searching Feasible Using Multi-level Partitioning. In Proceedings of the 3rd Int’l Workshop on Peer-to-Peer Systems, San Diego, CA, February 2004. [233] A. Singhal. Modern Information Retrieval: A Brief Overview. IEEE Data Engineering Bulletin, 24(4):35–43, 2001. [234] A. Singhal, C. Buckley, and M. Mitra. Pivoted Document Length Normalization. In SIGIR’96, 1996. [235] P. Smith and N. C. Hutchinson. Heterogeneous process migration: The Tui system. Software Practice and Experience, 28(6):611–639, 1998. [236] D. Spence and T. Harris. XenoSearch: Distributed Resource Discovery in the XenoServer Open Platform. In Proceedings of the Twelfth IEEE International Symposium on High Performance Distributed Computing (HPDC-12), June 2003. [237] R. Srikant and R. Agrawal. Mining Sequential Patterns. In IBM Research Report RJ9910, IBM Almaden Research Center, 1994. Expanded version of paper presented at the International Conference on Data Engineering, Taipei, Taiwan, March 1995. [238] K. Sripanidkulchai, B. Maggs, and H. Zhang. Enabling Efficient Content Location and Retrieval in Peer-to-Peer Systems by Exploiting Locality in Interests. ACM SIGCOMM Computer Communication Review, 32(1), January 2001. [239] P. Stacey, D. Berry, and E. Coyle. Using Structured P2P Overlay Networks to Build Content Sensitive Communities. In The Tenth International Conference on Parallel and Distributed Systems (ICPADS’04), Newport Beach California, July 2004.

330

[240] B. Steensgaard and E. Jul. Object and native code thread mobility among heterogeneous computers. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, pages 68–77, Dec. 1995. [241] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In In KDD Workshop on Text Mining, 2000. [242] R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and M. scott. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. In Proc. of the 16th ACM Symp. on Operating Systems Priciples, St. Malo, France, Oct. 1997. [243] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM’01, 2001. [244] A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI 2000), 30-31 July 2000, Austin, Texas, USA, pages 58–64. AAAI, July 2000. [245] T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasunderam. ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval. In WebDB’03, June 2003. [246] Q. Sun,

P. Ganesan,

and H. Garcia-Molina.

The InfoMatrix:

Distributed Indexing in a P2P Environment, 2004.

http://www-

db.stanford.edu/ qsun/research/infomat.pdf. [247] Q. Sun and H. Garcia-Molina. Partial lookup services. In ICDCS’03, 2003. [248] Sun Microsystems, Palo Alto, CA. JavaSpaces Specification, Jan. 1999.

331

[249] Sun Microsystems, Palo Alto, CA. Jini Architecture Specification, Jan. 1999. [250] Sun Microsystems Inc., Mountain View, CA. Java Remote Method Invocation Specification. [251] Sun Microsystems Inc., Mountain View, CA. Java Remote Method Invocation Specification, Revision 1.4, JDK 1.1., 1997. [252] SVDPACK. http://www.netlib.org/svdpack. [253] C. Tang, D. Chen, S. Dwarkadas, and M. L. Scott. Efficient Distributed Shared State for Heterogeneous Machine Architectures. In The 23rd International Conference on Distributed Computing Systems, Providence, Rhode Island, May 2003. Expanded version available as URCS technical report TR783 “Support for Machine and Language Heterogeneity in a Distributed Shared State System”. [254] C. Tang, D. Chen, S. Dwarkadas, and M. L. Scott. Integrating Remote Invocation and Distributed Shared State. In IPDPS’04, 2003. [255] C. Tang and S. Dwarkadas. Hybrid Global-Local Indexing for Efficient Peerto-Peer Information Retrieval. In First Symposium on Networked Systems Design and Implementation (NSDI’04), 2004. [256] C. Tang, S. Dwarkadas, and Z. Xu. On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems. In SIGIR’04, Sheffield, UK, July 2004. [257] C. Tang and Z. Xu. pFilter: Global information filtering and dissemination. In The 9th International Workshop on Future Trends of Distributed Computing Systems (FTDCS’03), San Juan, Puerto Rico, May 2003. Expanded version available as technical report HPL-2002-304. [258] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks. In SIGCOMM’03, 2003.

332

[259] C. Tang, Z. Xu, and M. Mahalingam. pSearch: Information Retrieval in Structured Overlays. In The First Workshop on Hot Topics in Networks (HotNets I), 2002. Older but partially expanded version available as technical report HPL-2002-198, “PeerSearch: Efficient Information Retrieval in Peer- to-Peer Networks”. [260] N. Tang. The Information Discovery Graph: A Framework for a Distributed Search Engine. PhD thesis, University of California at Los Angeles, November 2003. [261] D. Terry, M. Theimer, and K. Petersen. Managing update conflicts in a weakly connected replicated storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, 1995. [262] R. Tewari, M. Dahlin, H. Vin, and J. Kay. Beyond hierarchies: Design considerations for distributed caching on the internet. In Proceedings of ICDCS 99, May 1999. [263] Text Retrieval Conference (TREC). http://trec.nist.gov. [264] The Open Directory Project: Web directory for over 2.5 million URLs. http://www.dmoz.org/. [265] A. Tomasic and H. Garcia-Molina. Query Processing and Inverted Indices in Shared-Nothing Document Information Retrieval Systems. VLDB Journal, 2(3):243–275, 1993. [266] D. A. Tran, K. A. Hua, and T. T. Do. A Peer-to-Peer Architecture for Media Streaming. IEEE JSAC Special Issue on Advances in Overlay Networks, 22(1), January 2004. [267] A. Vahdat, T. Anderson, M. Dahlin, D. Culler, E. Belani, P. Eastham, and C. Yoshikawa. WebOS: Operating System Serves for Wide Area Applica-

333

tions. In Proc. of the 7th Intl. Symp. on High Performance Distributed Computing, July 1998. [268] A. Vahdat, J. Chase, R. Braynard, D. Kostic, and A. Rodriguez. Selforganizing subsets: From each according to his abilities, to each according to his needs. In Proceedings of the 1st Int’l Workshop on Peer-to-Peer Systems, Pisa, Italy, May 2002. [269] M. van Steen, P. Homburg, and A. S. Tanenbaum. Globe: A Wide-Area Distributed System. IEEE Concurrency, pages 70–78, Jan.- Mar. 1999. [270] VMWare Inc. http://www.vmware.com/. [271] J. R. von Behren, E. A. Brewer, N. Borisov, M. Chen, M. Welsh, J. MacDonald, J. Lau, S. Gribble, and D. Culler. Ninja: A framework for network services. In USENIX 2002 Annual Conference, Monterey, CA, June 2002. [272] E. M. Voorhees, N. K. Gupta, and B. Johnson-Laird. The collection fusion problem. In Third Text REtrieval Conference (TREC-3), 1994. [273] S. Voulgaris, A.-M. Kermarrec, L. Massoulie, and M. van Steen. Exploiting Semantic Proximity in Peer-to-peer Content Searching. In Proc. 10th IEEE Int’l Workshop on Future Trends in Distributed Computing Systems (FTDCS 2004), Suzhou, China, May 2004. [274] C. Wang, J. Li, and S. Shi. An Approach to Content-Based Approximate Query Processing in Peer-to-Peer Data Systems. In Grid and Cooperative Computing, Second International Workshopi (GCC 2003), Shanghai, China, December 2003. [275] J. Wang. Location and recommendation with distributed multimedia resources, 2004.

334

[276] X. Wang, W. Nga, B. Ooia, K.-L. Tan, and A. Zhou. BuddyWeb: A P2Pbased Collaborative Web Caching System. In Proceedings of the 1st Int’l Workshop on Peer-to-Peer Systems, Pisa, Italy, May 2002. [277] S. Waterhouse. JXTA search: Distributed search for distributed networks. http://search.jxta.org. [278] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB’98, 1998. [279] C. Weems, G. Weaver, and S. Dropsho. Linguistic support for heterogeneous parallel processing: A survey and an approach. In Third Heterogeneous Computing Workshop, pages 81–88, Canceen, Mexico, April 1994. [280] M. Welsh, N. Borisov, J. Hill, R. von Behren, and A. Woo. Querying large collections of music for similarity. Technical Report UCB/CSD-00-1096, UC Berkeley, November 1999. [281] A. Whitaker, M. Shaw, and S. D. Gribble. Scale and Performance in the Denali Isolation Kernel. In OSDI’02, 2002. [282] P. Willet. Recent trends in hierarchical document clustering: a critical review. Information Processing & Management, 24:577–597, 1988. [283] P. Wilson. Pointer Swizzling at Page Fault Time: Efficiently and Compatibly Supporting Huge Address Spaces on Standard Hardware. In International Workshop on Object Orientation in Operating Systems, September 1992. [284] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999.

335

[285] D. B. Wortman, S. Zhou, and S. Fink. Automating data conversion for heterogeneous distributed shared memory. Software-Practice and Experience, 24(1):111–125, 1994. [286] Z. Wu, W. Meng, C. T. Yu, and Z. Li. Towards a highly-scalable and effective metasearch engine. In World Wide Web, pages 386–395, 2001. [287] J. Xu and J. Callan. Effective retrieval with distributed collections. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112–120, 1998. [288] J. Xu and W. B. Croft. Cluster-Based Language Models for Distributed Retrieval. In SIGIR’99, 1999. [289] Z. Xu, M. Mahalingam, and M. Karlsson. Turning Heterogeneity into an Advantage in Overlay Routing. In INFOCOM’03, 2003. [290] Z. Xu, C. Tang, S. Banerjee, and S.-J. Lee. RITA: Receiver Initiated Justin-Time Tree Adaptation for Rich Media Distribution. In The 13th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV’03), Monterey, California, June 2003. [291] Z. Xu, C. Tang, and Z. Zhang. Building Topology-Aware Overlays using Global Soft-State. In Proceedings of ICDCS’03, May 2003. [292] B. Yang and H. Garcia-Molina. Comparing hybrid peer-to-peer systems. In The VLDB Journal, pages 561–570, Sep. 2001. [293] B. Yang and H. Garcia-Molina. Efficient search in peer-to-peer networks. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), July 2002.

336

[294] B. Yang and H. Garcia-Molina. Designing a Super-peer Network. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, March 2003. [295] J. Yin, L. Alvisi, M. Dahlin, and C. Lin. Using leases to support serverdriven consistency in large-scale systems. In International Conference on Distributed Computing Systems, pages 285–294, 1998. [296] H. Yu, L. Breslau, and S. Shenker. A scalable web cache consistency architecture. In Proceedings of ACM Sigcomm’99, August 1999. [297] H. Yu and A. Vahdat. Design and Evaluation of a Continuous Consistency Model for Replicated Services. In OSDI 2000, October 2000. [298] H. Yu and A. Vahdat. The costs and Limits of Availability for Replicated Services. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, October 2001. [299] W. Yu and A. Cox. Java/DSM: A Pltform for Heterogeneous Computing. Concurrency–Practice and Experience, 9(11), 1997. [300] M. A. Zaharia and S. Keshav. Design, Analysis and Simulation of a System for Free-Text Search in Peer-to-Peer Networks, 2004. [301] J. D. Zakis and Z. J. Pudlowski.

The world wide web as uni-

versal medium for scholarly publication, information retrieval and interchange.

Global Journal of Engineering Education, 1(3), 1997.

http://www.eng.monash.edu.au/uicee/gjee/vol1no3/paper5.htm. [302] D. Zeinalipour-Yazti. Information retrieval in peer-to-peer systems. Master’s thesis, Computer Science and Engineering, University of California Riverside, 2003.

337

[303] D. Zeinalipour-Yazti, V. Kalogeraki, and D. Gunopulos. Information Retrieval Techniques for Peer-to-Peer Networks. IEEE Computer in Science and Engineering (CiSE) Magazine, Special Issue on Web Engineering, 2004. [304] M. Zekauskas, W. Sawdon, and B. Bershad. Software write detection for distributed shared memory. In Proc. of the First USENIS Symposium on Operating System Design and Implementation, pages 87–100, November 1994. [305] M. J. Zekauskas, W. A. Sawdon, and B. N. Bershad. Software write detection for a distributed shared memory. In Proc. of the 1st Symp. on Operating Systems Design and Implementation (OSDI’94), pages 87–100, 1994. [306] C. Zhang, A. Krishnamurthy, and R. Y. Wang. SkipIndex: Towards a Scalable Peer-to-Peer Index Service for High Dimensional Data. Technical report, Department of Computer Science, Princeton University, 2004. http://www.cs.princeton.edu/ chizhang/skipindex.pdf. [307] Z. Zhang, S. Shi, and J. Zhu. SOMO: self-organized metadata overlay for resource management in P2P DHT. In IPTPS’03, Berkeley, CA, USA, February 2003. [308] B. Zhao, J. Kubiatowicz, and A. Joseph. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical Report UCB/CSD01-1141, Computer Science Division, U. C. Berkeley, April 2001. [309] K. Zhao, S. Zhou, L. Xu, W. Cai, and A. Zhou. PeerSDI: A Peer-to-Peer Information Dissemination System. In Advanced Web Technologies and Applications, 6th Asia-Pacific Web Conference (APWeb), Hangzhou, China, April 2004. [310] S. Zhou, M. Stumm, M. Li, and D. Wortman.

Heterogeneous Dis-

tributed Shared Memory. IEEE Trans. on Parallel and Distributed Systems, 3(5):540–554, September 1992.

338

[311] W. Zhou and A. Goscinski. Managing replicated remote procedure call transactions. The Computer Journal, 42(7):592–608, 1999. [312] Y. Zhu, H. Wang, and Y. Hu. Integrating semantics-based access mechanisms with peer-to-peer file systems. In Proceedings of the 3rd IEEE International Conference on Peer-to-Peer Computing (P2P2003), Linkopings, Sweden, September 2003. [313] S. Q. Zhuang, B. Y. Zhao, A. D. Joseph, R. H. Katz, and J. D. Kubiatowicz. Bayeux: An architecture for scalable and fault-tolerant wide-area data dissemination. In NOSSDAV 2001, June 2001.

Information sharing in contests - Wiwi Uni-Frankfurt