Finding Event-Relevant Content from the Web Using a ...

Viewer
Transcript

2007 IEEE/WIC/ACM International Conference on Web Intelligence

Finding Event-Relevant Content from the Web Using a Near-Duplicate Detection Approach Hung-Chi Chang1, Jenq-Haur Wang2, Chih-Yi Chiu1 1 Institute of Information Science, Academia Sinica, Taiwan 2 Department of Computer Science and Information Engineering, National Taipei University of Technology, Taiwan 1 {hungchi, cychiu}@iis.sinica.edu.tw, [email protected] differences between them for further analysis. For example, we could analyze the shared content to find the original article and all citations for a specific event, and thereby collect individual opinions about the event. In this paper, we propose a near-duplicate detection approach for finding event-relevant content on the Web. First, based on a compact feature for document representation, our near-duplicate detection algorithm checks whether documents are near-duplicates. These nearduplicate documents are put in the potential candidate set related to the same event. Second, the duplicate set generation algorithm clusters the documents in the candidate set into event-relevant groups according to their features. The remainder of this paper is organized as follows. In Section 2, we formally define our problem. Related works are discussed in Section 3. We describe our proposed method in Section 4, and detail the experiments in Section 5. Section 6 gives some discussions and conclusions.

Abstract In online resources, such as news and weblogs, authors often extract articles, embed content, and comment on existing articles related to a popular event. Therefore, it is useful if authors can check whether two or more articles share common parts for further analysis, such as cocitation analysis and search result improvement. If articles do have parts in common, we say the content of such articles is event-relevant. Conventional text classification methods classify a complete document into categories, but they cannot represent the semantics precisely or extract meaningful event-relevant content. To resolve these problems, we propose a near-duplicate detection approach for finding event-relevant content in Web documents. The efficiency of the approach and the proposed duplicate set generation algorithms make it suitable for identifying event-relevant content. The experiment results demonstrate the potential of the proposed approach for use in weblogs.

2. Problem definition The problem we address in this paper is formally defined as follows. From online resources, such as news articles and blogs, assume that we have already collected a set of documents W that contains N documents. Given a new incoming document, d, and for each document wi in W, where i=1, 2, ..., N, we want to determine whether d and wi is event-relevant. An event could be a major news story, a movie, or something else. In general online resources on the Web, a document set usually keeps growing. Therefore, in our problem, the document source is assumed to be a document stream. In weblogs, bloggers might cite parts of articles from different resources, such as news and blog archives, and add their comments. We assume that such content sharing behavior relates documents to a particular event. To identify event-relevant documents, several issues have to be addressed. First, it is important to define the criteria to determine the event-relevance of documents. Second, since there could be a large number of documents to be compared, the efficiency of computation is a practical concern. Third, the relevance of grouped documents to certain events has to be addressed. Due to the space limitations, we focus on the first two issues.

1. Introduction Community applications, such as Weblogs (or blogs), are growing in popularity because they are relatively simple and easy to use. As a result, more diverse content and user comments are being rapidly accumulated in addition to conventional online resources, such as Web pages and news. It is quite common for blog users (or bloggers) and even online news editors to quote original scripts or other articles for a particular event, do some editing, and make their own comments for the particular events. Therefore, news and blog articles often contain common content. If we consider that the common paragraphs are related to the same event, then these articles are considered eventrelevant. We call the documents that share common content near-duplicates if most of their content are identical. They are annoying because storing these near-duplicates requires extra storage space and presents much overlap among search results for a query. However, when analyzing a particular event, it might be useful if we could identify articles that share common content and identify the

0-7695-3026-5/07 $25.00 © 2007 IEEE DOI 10.1109/WI.2007.25

291

relevant sets. That is, documents in each set share the same content. We give an example in Figure 1 to illustrate the proposed architecture for acquiring two archives. Features are extracted and indexed in the left-hand thread labeled "indexing" in the figure, while duplicate sets from the feature database are generated in the right-hand thread labeled "querying."

3. Related work Event detection has been an important research topic in information retrieval since the Topic Detection and Tracking (TDT) project. For example, Yang et al. [9] used k-means clustering algorithms for document grouping. Yang, Pierce, and Carbonell [10] investigated the use and extension of text retrieval and clustering techniques for event detection. Allan et al. [1] addressed the problem of detecting new events by using a single-pass clustering algorithm and a novel threshold model. Our proposed approach is different from the above methods. A nearduplicate detection approach is utilized to identify the overlapping parts of documents, and then associations between documents are generated by a clustering algorithm. Near-duplicate copy detection has received a great deal of attention in recent years. There are many copy detection applications, such as duplicate Web page detection and removal in Web search engines [3][5], document versioning, and plagiarism detection in digital libraries [4]. Early copy detection mechanisms like SCAM [7] and COPS [2] were proposed for removing duplicate documents from databases. Later methods based on document fingerprinting were developed for detecting small partial copies, for example, Winnowing [6], which is based on estimation of similarities of n-gram hashes. In [3], Henzinger compared the performance of mainstream algorithms with large-scale evaluations. Yang and Callan [8] proposed an instance-level constrained clustering approach that incorporates additional information, such as document attributes and content structure, in the clustering process to form near-duplicate clusters. Our approach differs from previous methods in that it uses sentence-level features for near-duplicate document detection instead of word-level features. Although the simplicity and compactness of sentence-level features are efficient, the effectiveness of the detection algorithm is not affected significantly. In this paper, we employ one of the sentence-level features in our evaluations, i.e., the sequence of sentence-lengths in a document.

Figure 1. Architecture of kernel functional blocks

4.1. Feature space conversion Sentence-level features are extracted from text documents in this process. Because file formats, layouts, and writing styles vary among documents in the source corpus, additional pre-processing steps should be applied before further manipulation. We remove materials we do not care about, such as HTML markups of formatting instructions. Since the proposed approach is based on sentencelevel features, we have to determine the sentence boundaries first. A pre-defined delimiter set decides where the boundaries of a sentence should be when text-based sentences are converted into a numeric sequence. The text string between two adjacent delimiting symbols can be considered as a single sentence. After boundaries of sentences are determined, we count the number of words of each sentence and convert sentences to corresponding lengths. The output of this step is a number sequence, namely, a feature string of the document. We then process the feature string instead of the original document. Instead of sentence lengths, other techniques, such as hash-based fingerprinting, can be used to generate the features of sentences. However, such techniques are beyond the scope of this paper.

4. The proposed approach We assume that documents are likely to be eventrelevant if they share the same content, even if the amount of the shared content is relatively small. To find eventrelevant documents, our proposed architecture employs two functions, namely, duplicate detection and duplicate set generation. The sentence length, which is calculated by the number of words or phrases, is used as a sentencelevel feature in our approach. The algorithm for duplicate set generation is then applied to generate potential event-

292

does not occur frequently enough, we use a pre-defined output threshold to control the output quality. According to the clustering algorithm, we only need to check new sets in the duplicate set database, rather than scanning the whole feature database after processing new documents.

4.2. Duplicate detection We index feature strings into a feature database, and query the database to detect duplicate content in this process. There are two primary parameters in both index and query procedures: the window size, WS, and the sliding step width, SW. Window size is the length of the numeric sub-sequence and should be constant in our approach. The fixed-length sub-sequence is called a feature vector. When indexing or querying is performed with the feature string, a fixed-length window at the beginning of the string extracts the first feature vector. Subsequently, the window slides SW elements forward to extract the next feature vector until the end of the string is reached. The extracted feature vectors are then inserted into the feature database or used to query the feature database with the current document ID. If SW > 1, we refer to it as the jumping window. If SW remains constant during the index process, we call it static jumping. Otherwise we call it dynamic jumping, which is proposed to further reduce the number of records in the feature database. The idea of dynamic jumping is to eliminate information repeated in adjacent feature vectors in the extracted order. If the leading SW ordered elements of the current window position are identical to both the previous and following windows, we discard the current window and move to the next step. By using this approach, we can detect near duplicates and partial duplicates in addition to exact duplicates.

5. Experiments Two experiments were conducted to evaluate the performance of the proposed feature for duplicate detection and the clustering algorithm respectively. Both experiments were carried out on the same desktop PC with an Intel Pentium 4 3.0 GHz CPU and 2 GB DDR2 RAM installed. The single archive was downloaded from http://blog.myspace.com, one of the most popular blog sites. It contained 3.34 GB of content in 55,430 documents. The major language used in the archive is English.

5.1. Parameter fine-tuning The first experiment was conducted to evaluate the performance of the proposed feature with various parameter configurations. A subset of 5,000 distinct documents from the corpus was prepared in this experiment. There are no exact duplicates in the subset. However, there might be near- or partial-duplicate documents in the subset. Ten assistants were asked to compose five test articles each. For each article, the assistant first picked 5 distinct documents from the subset archive, and then copied and pasted parts of the documents to generate the test articles. Assistants were allowed to modify the layout style and add their comments to the articles. Once a document had been chosen for one test article, it could not be chosen for other articles by the same assistant. The procedure described above is to simulate the quoting and editing operations of bloggers. A total of 50 test queries were generated. We included most punctuations, such as {, ; " . ? !}, in the delimiter set. The configurations and efficiency measurements are listed in Table 1, in which DJ indicates the dynamic jumping scheme. The effectiveness measurements of the 50 test queries are listed in Table 2. Note that the search time in Table 2 is the total time required to process the 50 queries. According to the experiment results, the dynamic jumping scheme reduces the number of indexed records, the size of the feature database, and the search time. Generally speaking, a smaller window size means a higher recall rate. Although a larger window size increases the precision, the size of the index also grows, and the robustness of the duplicate detection algorithm deteriorates.

4.3. Duplicate set generation To deal with the document streams, we modify the association generation method slightly. Instead of scanning the whole database after indexing is finished, we search for the indexing feature vector at each indexing operation. The modification is not too expensive, since the insertion operation also needs to locate the proper position in the database to store the record by a key (or the feature vector). If the indexing feature vector exists in the feature database, documents indexed with this vector and the current document form a duplicate set. The frequency denotes how many feature vectors are shared by documents in the corresponding set. Therefore, the higher the frequency of the set, the greater the likelihood that documents in the set will be event-relevant. We prepare another database to record the frequency of each duplicate set. Each time the feature vectors of a new document are indexed, the document may be added to the existing duplicate set if some of its feature vectors are common to the duplicate set. We call the original set the base set and the extended set the derived set. If a document is added to the duplicate set, the frequency of the base set will be changed by subtracting the frequency of the derived set. In order to filter out the candidate sets that 293

Table 1. Efficiency measurements for different parameter configurations Notation

WS

SW

DJ

WS08SW1 WS08SW1wDJ WS08SW2 WS08SW4 WS16SW1 WS16SW1wDJ WS16SW2 WS32SW1

8 8 8 8 16 16 16 32

1 1 2 4 1 1 2 1

No Yes No No No Yes No No

Index number 936,339 894,203 474,595 242,025 924,469 870,357 466,361 872,566

7. References

Index size Index time (KB) (sec.) 30,216 328 29,736 340 15,368 327 8,184 316 39,240 336 37,568 335 20,360 322 56,328 336

[1] Allan, J., Papka, R., and Lavrenko, V. On-line new

event detection and tracking. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998). 1998, 37-45. [2] Brin, S., Davis, J., and Garcia-Molina, H. Copy detection mechanisms for digital documents. In Proceedings of the 1995 ACM International Conference on Management of Data (SIGMOD 1995). 1995, 398-409. [3] Henzinger, M. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006). 2006, 284-291. [4] Hoad, T. and Zobel, J. Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3), 203-215, 2003. [5] G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web (WWW 2007), 2007, 141-150. [6] Schleimer, S., Wilkerson, D., and Aiken, A. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 2003 ACM International Conference on Management of Data (SIGMOD 2003). 2003, 76-85. [7] Shivakumar, N. and Garcia-Molina, H. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries (DL 1995). 1995, available at: http://csdl.tamu.edu/DL95/papers/shivakumar.ps [8] Yang, H. and Callan, J. Near-duplicate detection by instance-level constrained clustering. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006). 2006, 421-428. [9] Yang, Y. Carbonell, Allan, J., and Yamron, J. Topic detection and tracking: Detection-task. In Proceedings of the Workshop of Topic Detection and Tracking. 1997. [10] Yang, Y., Pierce, T., and Carbonell, J. A study on retrospective and on-line event detection. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998). 1998, 28-36.

Table 2. Effectiveness measurements of 50 queries Search Average Macro-average time (sec.) recall Prec. F1 WS08SW1 61.41 1 0.919 0.958 WS08SW1wDJ 56.16 1 0.938 0.968 WS08SW2 0.155 1 0.919 0.958 WS08SW4 0.08 0.98 0.952 0.966 WS16SW1 0.545 0.88 1 0.936 WS16SW1wDJ 0.395 0.88 1 0.936 WS16SW2 0.16 0.88 1 0.936 WS32SW1 39.38 0.38 0.9 0.534 Notation

Micro-average Prec. F1 0.9 0.947 0.931 0.964 0.9 0.947 0.946 0.963 1 0.936 1 0.936 1 0.936 1 0.551

5.2. Clustering evaluation In this experiment, we processed the whole archive. To achieve a balance between the efficiency and effectiveness of the proposed approach, we used the parameter configuration WS16SW1wDJ mentioned in Section 5.1. The output threshold was set to 5. Since it is difficult to find out all duplicates manually in the archive, only the precision measure was used to evaluate the effectiveness of the proposed algorithm. The high precision rate shows that the approach can help finding duplicate sets from document streams. Table 3. Performance of the clustering algorithm 7,661 sec. Number of output sets Pre-processing time 9.36 sec. Macro-average prec. Conversion time Indexing and clustering 6,565 sec. Micro-average prec.

8,840 0.946 0.904

6. Conclusion Event-relevant content is common in news articles and blogs. Conventional clustering and categorization approaches may not be directly applicable to these documents. In this paper, we have proposed a near-duplicate detection approach to find event-relevant content. Our experiments demonstrate the efficiency of our approach and the performance for duplicate set generation is very effective.

294

Customer Targeting Models Using Actively-Selected Web Content

A Lightweight Multimedia Web Content Management System

Effectively finding relevant web pages from linkage ... - Semantic Scholar

Effectively finding relevant web pages from linkage ... - IEEE Xplore

Creating Personal Histories from the Web using ...

Extracting Unambiguous Keywords from Microposts Using Web and ...

Finding Near-Duplicate Web Pages: A Large-Scale ...

Lesson 1.6: Finding text on a web page

Extracting Unambiguous Keywords from Microposts Using Web and ...

Kumu-Finding-Clarity-Using-interviews-to-listen-to-a-system.pdf

A Web Platform for Collaborative Multimedia Content ...

A Pragmatic Application of the Semantic Web Using ...

A statistical video content recognition method using invariant ... - Irisa

Gathering enriched web server activity data of cached web content

Finding and Using Internet Information

Finding the Best Web-Conference Tool for Your Needs.pdf ...

Enhancing Expert Finding Using Organizational ...