Configurable Meta-search for Integrating Web Public Access Catalogs Hou Ieong Ho and Jieh Hsiang Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan [email protected], [email protected]

Abstract. A Web Public Access Catalog (WebPAC) is an important feature of modern libraries. In this paper we propose a meta-search method to provide users with simultaneous access to WebPACs of different libraries. Our method gives a librarian full freedom to select WebPACs to be incorporated in the service but requires no programming effort from the librarian’s side. At the core of our method is a meta-search engine which sends a query to incorporated WebPACs, receives results, and post-processes the query results into a uniform presentation format. To incorporate an existing WebPAC into our system, one needs to analyze the query interaction behavior between the WebPAC and the browser. This can be done by extracting the query parameters from a query and the subsequent query result web pages. We modeled and abstracted these interactions and defined the corresponding XML formats to capture the needed parameters from these web pages. The resulting XML pages will then be fed to the search engine which will automatically incorporate the designated WebPAC as part of its search. The advantage of our method is that the search engine does not need to be modified when new WebPACs are added. When adding a new WebPAC, the librarian only needs to analyze a few web pages to decide the parameters. Even this step can mostly be done automatically. To illustrate the effectiveness of our method, we have built a system, called MetaCat, that has incorporated the WebPACs of 26 major libraries in Taiwan. MetaCat can be accessed at http://MetaCat.ntu.edu.tw. This research is supported in part by the National Science Council of the NSC-93Republic of China under grant numbers NSC-94-2422-H-002-008 2213-E-002-039.

1 Introduction The most important common service provided by modern libraries is the Web Public Access Catalog (WebPAC). By using WebPAC, users can search a library’s catalog quickly via internet. However, try to find books from several libraries can be a painful experience. The user needs to visit the WebPAC of every intended library and issues the same query to each of them separately. If the user does not have a clear idea of the books that she is looking for, it can be another time-consuming experience to go E.A. Fox et al. (Eds.): ICADL 2005, LNCS 3815, pp. 317 – 322, 2005. © Springer-Verlag Berlin Heidelberg 2005

318

H.I. Ho and J. Hsiang

through the search results from those WebPACs. It is therefore reasonable to design an integrated search that can access several WebPACs simultaneously. This service can be achieved by either building a centralized union catalog (such as WorldCat of OCLC), establishing standard data exchange protocols (such as Z39.50 [1] or OAIPMH [2]), or using meta-search (see [3] as an example). In this paper we propose a new meta-search methodology that allows a librarian to build her library’s cross-WebPAC service without any programming effort. Our method involves a core search facility and an XML format that allows the incorporation of a WebPAC service by simply identifying parameters involved in queries. To demonstrate the effectiveness of our method, we have implemented such a service, called MetaCat, for the National Taiwan University Library. MetaCat currently incorporates the WebPACs of 26 major libraries of Taiwan. It is also a popular search tool provided by the NTU Library. In Section 2 of the paper we give the methodology of our configurable meta-search method. Section 3 describes the implementation of MetaCat. We conclude the paper with some discussion and future directions.

2 Methodology Meta-search for WebPACs is a mechanism that allows the users to access and search, via Web, WebPACs of different libraries from a single webpage in a uniform way. In a typical (single) WebPAC service of a library, a user issues a query such as a title or author, the system then searches through the catalog of the library and returns a list of books (if any) that match the query. If we treat the inner working of a WebPAC as a black box, then the query session described above can be regarded as a series of webpage exchanges through the http protocol. The query issued by the user is sent as a sequence of parameters, usually wrapped inside the control elements (buttons, checkboxes, radio buttons, menus, text input, file select, hidden controls, object controls, etc) [9] of the
tag of an html page. The query results, once retrieved from the data base, are embedded in another html page and presented to the user’s browser. This http interaction model between the browser and the WebPAC is quite simple, and can be summarized as the following states: 1. 2. 3. 4. 5.

Send request, as an html page, to WebPAC Receive an html page from WebPAC Identify the html template of received html page Extract data from the html page Stop, or use the extracted data and go to State 1

As far as the Web interface is concerned, the only differences between two WebPACs are the configurations (parameters) of the queries that the WebPAC interfaces send to their respective search engines, and the configurations of the query results that the interfaces get back from the WebPAC search engines and present to the user. Among the five states of the above http interaction model, all except State 2 (the receive state) have their own configurations.

Configurable Meta-search for Integrating Web Public Access Catalogs

319

2.1 Consolidate WebPAC Services Through Meta-search For integrating several WebPACs, a conventional meta-search solution would analyze the interface of each WebPAC and incorporate them through programming. This process can be rather laborious and requires programming skills beyond the capability of an average librarian. Furthermore, if a WebPAC changes its interface or if a new WebPAC is to be added, then the program needs to be modified. Our method uses a modular approach. WebPACs, due to their public-service nature, usually employ interfaces that are much simpler than commercial Web-accessed data bases. Therefore instead of using programming to incorporate each WebPAC, we propose general XML configuration formats to capture the parameters embedded in the http interaction. To include a WebPAC into our meta-search facility, then, all that need to be done is to transform its http interactions into their respective XML formats and incorporate them in the meta-search mechanism. This action only requires knowledge of Web query parameters and XML, and can be done by any experienced librarian with some training. Furthermore, this framework is general enough that a library can choose the WebPACs that it wants to incorporate in its own meta-search service and builds its own system. Adding a new WebPAC service or modifying an existing one can also be done easily. Note that four of the five states in the WebPAC http interaction model given above involve webpages with parameters containing information related to the WebPAC transaction. They are captured in our framework through four classes of XML configuration formats. They are the Request configuration format, Verify configuration format, Extract configuration format and Flow configuration format. They correspond, respectively, to State 1 through 5 (except State 2, which does not need a corresponding configuration). The Request configuration format deals with the parameters of the queries of a WebPAC. Verify and Extract have to do with those of the query results. Flow analyzes whether the WebPAC has any session control features. Due to space limit, we only briefly outline these formats. Detail information can be found in a longer version of this paper upon request. Request Configuration. The most basic operation in a WebPAC is to issue a query. Request configuration is the XML format that captures all the information involved in the query-sending process. They include the host URL, the connection method (GET or POST), proxy information, query parameters, and user agent information such as accept content type, accept-language, and accept-encoding. If a WebPAC has been incorporated in the meta-search service, the meta-search engine will use that WebPAC’s Request configuration to simulate a user query to that particular WebPAC. We remark that the process of extracting the necessary query information and incorporating them into the Request format can be done almost automatically. The only human effort required is to issue sample queries to each of the query fields (such as title, author, etc). The parameters associated with each query will be extracted and embedded into the corresponding configuration. Verify Configuration. Several outcomes may happen when a WebPAC returns the query results. It may find no record, one record, or several records. The Verify configuration format includes three XML forms, each for the template of “no record”,

320

H.I. Ho and J. Hsiang

“brief listed records”, and “detailed information”. Identifying which form that a webpage corresponds to can be done by identifying specific key phrases that are pertinent to that particular template. Determining which verify configuration forms are needed in a specific WebPAC requires the assistance of a librarian. An experienced librarian needs to indicate to the system the types of query result webpages that her WebPAC may produce, and for each template, identify a key phrase that is unique for that template. Extraction Configuration Format. After identifying the templates of the query result webpages, we need to extract the parameters, such as title, author, and hyperlink to a page with detailed information from each of the templates. This is the purpose of the Extraction configuration format, which is an XML format that locates the bibliographic information embedded in html. Similar to the Verification configuration format, there are also several Extract configuration forms; each corresponds to a possible query result webpage. To locate bibliographic information in a query result html page, we define four elements needed for the Extraction configuration: “Single tag” element, “Range tags” element, “Nested single tag” element, and “Nested range tags” element. “Single tag” element is used to locate a particular html tag using the tag’s name and its order of appearance in an html page. “Range Tags” element is used to locate the range enclosed within an html tag, such as “…”. It can also be used to specify repeated ranges within the same tag name. “Nested single tag” and “Nested range tags” are used when the located content need further utilized by the extraction elements. Flow Configuration Format. In order to control query sessions, some WebPAC systems generate a transaction key when a user enters, and the key may expire after a time out. This key is hidden in the webpages and is either for giving better services (by remembering previous queries in the same session) or for preventing abuse from external agents. The Flow configuration format is, then, an XML format that analyzes the flow of the http interactions of a WebPAC so that this type of session control can be handled. To check whether a WebPAC has this type of session control is quite simple. One simply needs to enter the library’s query system from two different computers, issues the same query, then compares the parameters sent from the two browsers. 2.2 How to Incorporate a WebPAC Using the four configurations mentioned above, a librarian can add a WebPAC to meta-search engine easily and without any programming effort. With the help of aiding tools, most of the needed configurations can be generated automatically. The librarian only needs to provide minimum help (such as identifying the parameter name if a key is needed in a WebPAC with session control and the key phrases associated with Verify configurations) by highlighting the related terms.

3 MetaCat – An Implementation To demonstrate how our method works, we have built a meta-search service, MetaCat, for the National Taiwan University Library. MetaCat currently incorporates the

Configurable Meta-search for Integrating Web Public Access Catalogs

321

WebPAC services of 26 major libraries in Taiwan. To provide better services to all users in Taiwan, we have grouped them (not mutually exclusively) into 7 categories, according to geographic locations (four different areas covering northern, central, southern, and eastern Taiwan), size (a group that contains only libraries with over half a million books), and specialties (medical and educational). This arrangement makes it easier for a user with special interest or from a specific locality to find what she wants. The user can also select the libraries of her choice from the list of 26 by clicking buttons. MetaCat provides a query field with 6 query modes (title, author, subject, ISBN, ISSN, and keywords) on its query interface. Although the 26 WebPACs are from 3 vendors, Innovative (Innopac), Transtech (TOTAL II) and Dynix (iPAC), each has its own variations in query interface and query result presentation and needs to be dealt with individually. Querying multiple WebPACs and gathering their results may take a while, especially when the network is slow. To expedite the query outcome, MetaCat (1) simultaneously dispatches query to the involved WebPACs, (2) displays query results in an incremental, first-arrive-first-present way, and (3) waits a maximum of 30 seconds, to compensate for possible site failure or a congested network. Another important feature of MetaCat is that it groups the same books into a single result, with links to the different libraries from which the book records are retrieved. MetaCat does this by checking for similarities in bibliographic information from query results and aggregating the similar ones together. This feature can significantly reduce the number of items in the list of results and make the system much easier to use. We have also implemented a tool bar plug-in that can be installed on an IE browser. MetaCat is quite popular among users in NTU. It is also gaining acceptance among librarians because it is much more up-to-date than the Taiwanese union catalog NBInet.

4

Discussion

In this paper we introduced an approach to building configurable meta-search services for WebPACs. In addition to providing the core search facilities, we introduced four general XML configuration formats, with which one can incorporate the parameters from a WebPAC’s query process and query result presentations. In our method, new WebPACs can be added to the meta-search service without any programming effort or modification of the programming code. To demonstrate the effectiveness of our method, we have built such a service, called MetaCat, for the NTU Library. In addition to having incorporated the WebPAC of all major libraries in Taiwan (26 in total), MetaCat also provides six query modes, seven categories of libraries for the ease of use, and post-processing features that make query results much easier to use than other similar services. We have designed tools to help librarians analyze the configurations. We are also studying the possibility of fully automating this process. There are methods for extracting content from webpages (see, e.g., [4] [5] [6]). They need to be tailored for WebPAC applications. The DeepSpot Agent Tool Box [7] provides ways to extract information based on pattern discovery [8]. But the rules generated from that approach are not human-readable and may be hard for librarians to check for accuracy.

322

H.I. Ho and J. Hsiang

References [1] National Information Standard Organization (NISO).ANSI Z39.50: Information Retrieval Service and Protocol, 1992. [2] The Open Archives Initiative Protocol for Metadata Harvesting protocol version 2.0 http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm [3] Lin Fang, Library of Central China Normal University. A Developing Search Service Heterogeneous Resources Integration and Retrieval System. D-Lib Magazine, Volume 10 Number 3, March 2004 [4] N. Kushmerick, D. Weld and R. Doorenbos. Wrapper induction for information

extraction, IJCAI-97, 1997. http://sherry.ifi.unizh.ch/kushmerick97wrapper.html [5] Hongkun Zhao, Weiyi Meng, Zonghuan Wu, Vijay Raghavan, and Clement Yu. Fully Automatic Wrapper Generation for Search Engines. Proc. of 14th International World Wide Web Conference (WWW14), pp.66-75, Chiba, Japan, May 2005 [6] Benjamin Habegger. Multi-pattern wrappers for relation extraction from the Web. In Proceedings of the European Conference on Artificial Intelligence, 2002 [7] Chia-Hui Chang, Harianto Siek, Jiann-Jyh Lu, Chun-Nan Hsu, Jen-Jie Chiou. "Reconfigurable Web Wrapper Agents," IEEE Intelligent Systems, vol. 18, no. 5, pp. 34-40, September/October 2003. [8] Chia-Hui Chang and Shao-Chen Lui. IEPAD: information extraction based on pattern discovery. In Proceedings of the Tenth International Conference on the World Wide Web, pages 681–688, Hong Kong, China, 2001. [9] Forms - User-input Forms: Text Fields, Buttons, Menus, and more HTML 4.01 Specification W3C Recommendation 24 December 1999 http://www.w3.org/TR/REChtml40/interact/forms.html#h-17.10

LNCS 3815 - Configurable Meta-search for Integrating ...

an integrated search that can access several WebPACs simultaneously. This service can be achieved by either building a centralized union catalog (such as WorldCat of. OCLC), establishing standard data exchange protocols (such as Z39.50 [1] or OAI-. PMH [2]), or using meta-search (see [3] as an example). In this paper ...

115KB Sizes 2 Downloads 156 Views

Recommend Documents

Configurable Memory Hierarchies for Energy Efficiency ...
Feb 13, 2012 - guages Europe, volume 365 of Lecture Notes in Computer Science, ... on Architectural support for programming languages and operating ...

sv-lncs - Research at Google
In dynamic web application development, software testing forms an ... Thus, in practice, a small company can rent these infrastructures from external cloud-.

Configurable Timer Controller User's Manual v1.0.pdf
Whoops! There was a problem loading this page. Configurable Timer Controller User's Manual v1.0.pdf. Configurable Timer Controller User's Manual v1.0.pdf.

LNCS 6622 - Connectedness and Local Search for ...
Stochastic local search algorithms have been applied successfully to many ...... of multiobjective evolutionary algorithms that start from efficient solutions are.

LNCS 3174 - Multi-stage Neural Networks for Channel ... - Springer Link
H.-S. Lee, D.-W. Lee, and J. Lee. In this paper, we propose a novel multi-stage algorithm to find a conflict-free frequency assignment with the minimum number of total frequencies. In the first stage, a good initial assignment is found by using a so-

LNCS 4270 - A Service-Oriented Architecture for ...
Now, most of the visualization systems are based on the client/server frame- ... Also based on the transmission data format, the architecture can be divided into ... client. This offers high-quality graphics, but this approach needs powerful graph-.

LNCS 4233 - Fast Learning for Statistical Face Detection - Springer Link
Department of Computer Science and Engineering, Shanghai Jiao Tong University,. 1954 Hua Shan Road, Shanghai ... SNoW (sparse network of winnows) face detection system by Yang et al. [20] is a sparse network of linear ..... International Journal of C

LNCS 4016 - Load Shedding for Window Joins over ...
Data stream applications such as network monitoring, on-line transaction flow analysis, intrusion ..... Stream speeds of two streams (tuples/ms). Output tuples .... cialized Research Fund for the Doctoral Program of Higher Education (SRFDP).

LNCS 7601 - Optimal Medial Surface Generation for ... - Springer Link
parenchyma of organs, and their internal vascular system, powerful sources of ... but the ridges of the distance map have show superior power to identify medial.

Design of freely configurable safety light curtain using ...
Safety light curtains provide reliable and cost-effective protection against access into hazardous points or areas. However, in .... configuration, low radiation, and high efficiency (about 10 ms ... degrees in engineering from Kyushu Institute of.

LNCS 8149 - Manifold Diffusion for Exophytic Kidney ...
acteristic analysis showed that the proposed method significantly outperformed ..... Seo, S., Chung, M.K., Vorperian, H.K.: Heat kernel smoothing using laplace-.

LNCS 4270 - A Service-Oriented Architecture for ...
a client-server based novel service-oriented architecture for 3D content ... lems at hand, the service-oriented architecture (SOA) [1] is a promising software .... tal interactive parameters, accounting the max value of Q parameter transmitted.

LNCS 4258 - Privacy for Public Transportation - Springer Link
Public transportation ticketing systems must be able to handle large volumes ... achieved in which systems may be designed to permit gathering of useful business ... higher powered embedded computing devices (HPDs), such as cell phones or ... embedde

LNCS 7575 - Multi-component Models for Object ... - Springer Link
visual clusters from the data that are tight in appearance and configura- tion spaces .... Finally, a non-maximum suppression is applied to generate final detection ...

LNCS 6942 - On the Configuration-LP for Scheduling ... - Springer Link
insights on two key weaknesses of the configuration-LP. For the objective of maximizing the minimum machine load in the unrelated graph balancing setting ...... length. European Journal of Operational Research 156, 261–266 (2004). 19. Scheithauer,

LNCS 4234 - Wavelet Spectral Entropy for Indication ...
types show wavelet spectral entropy and scale-averaged wavelet power are ... a time–scale space, the dominant modes of variability and its variation over time ...

Strategies for integrating alternative groundwater ...
Moreover, WQIs are often based on specific standards such as the .... the Western Algarve, the Regional Water Utility (AdA) decided to drill “emergency wells” in ...

LNCS 6361 - Automatic Segmentation and ... - Springer Link
School of Eng. and Computer Science, Hebrew University of Jerusalem, Israel. 2 ... OPG boundary surface distance error of 0.73mm and mean volume over- ... components classification methods are based on learning the grey-level range.

Design Configurable Aspects to connecting Business ...
5 Oficial Site of Spring Framework http://www.springsource.org/. 6 Cibrán M. and D'Hondt M. (2003). “Composable and reusable business rules using. AspectJ”. In Workshop on Software engineering Properties of Languages for Aspect. Technologies (SP

Prototyping a Configurable Cache/Scratchpad Memory ...
suitable for real-time applications – and also offer scalable general-purpose ... who merely have to identify the input and output data sets of their tasks. Our goal ...

A run-time Configurable Cache/Scratchpad Memory with Virtualized ...
Peter Marwedel. Scratchpad memory: A design alternative for cache on- chip memory in embedded systems. In In 10th International Symposium on Hardware/Software Codesign (CODES), Estes Park, pages 73–78. ACM, 2002. [2] J. A. Kahle, M. N. Day, H. P. H