A systematic study of parameter correlations in large scale duplicate document detection Shaozhi Ye1

Ji-Rong Wen2

Wei-Ying Ma2

1 Department

of Computer Science University of California, Davis 2 Microsoft

Research Asia

The 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, April 9-12, Singapore

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

1 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

2 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

3 / 25

1. Motivation

Duplicate pages and mirrored web sites: phenomenal on the Web. More than 250 sites mirrored Linux Document Project. 10% of hosts are mirrored to various extents [Bharat & Broder, 1999] 5.5% result entries for popular queries are duplicated in major search engines [Ye et.al, 2004]

It is important to detect duplicated and nearly duplicated documents. crawling, ranking, clustering, archiving and caching...

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

4 / 25

1. Motivation

DDD: Duplicate Document Detection

The tremendous volume of web pages challenges DDD algorithms. Much work has been done on both DDD algorithms and applications Little has been explored about the factors affecting DDD performance and scalability.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

5 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

6 / 25

2.1 Shingle Based Algorithms Definition

A shingle is a set of contiguous terms in a document. 1

Each document is divided into multiple shingles.

“The ones we don’t know we don’t know" → {the ones we}, {ones we don’t}, {we don’t know}, {don’t know we}, {know we don’t} 2

A hash value is assigned to each shingle.

{the ones we} → 1cb888794a0ed3d9e9989093d0e353b4 {ones we don’t} → 20dc9a35cd0cbbf895272744b100278a ··· 3

Then the resemblance of two documents is calculated based on the number of shingles they share.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

7 / 25

Document Similarity Metric Broder et.al 1997

Definition

The resemblance r of two documents A and B is defined as: r (A, B) =

|S(A) ∩ S(B)| . |S(A) ∪ S(B)|

(1)

Where |S| represents the number of elements in the set S. Pairwise comparison → N 2

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

8 / 25

Clustering Algorithm Beat N 2 Complexity

Merge Sort → N log (N/M) 1

Get all the shingles for each document → kN

2

Sort pairs → kN log (kN/M)

3

4

Get list: expand, divide, sort, merge → kN log (kN/M) Scan list → ???

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

9 / 25

Parameters used in Prior Work Work Broder97

Volume of Documents Set 30M

Shingling Strategy 10-gram

Hash Function 40-bit

Similarity Threshold 0.5

Shivakumar98, Cho00 Fetterly03

24M 25M 150M

entire document, two or four lines 5-gram

32-bit

25 or 15 shingles in common two supershingles in common

64-bit

Sampling Ratio/Strategy Broder97 1/25 and at most 400 shingles per document Shivakumar98 & Cho00 Line based shingling Fetterly03 14 shingles per supershingle six supershingles per document

No formal evaluation is provided for their parameter or tradeoff choices.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

10 / 25

2.2 Term Based Algorithms

Use individual terms/words as the basic unit, instead of continuous k-gram shingles. Many IR techniques are used, especially feature selection. Work well for small-scale IR systems and online DDD. Too complex for large scale DDD. We focus on shingle based, offline DDD algorithms in this paper.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

11 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

12 / 25

3.1 Data Description TREC .Gov Collection:

HTML Documents Total Size Average Document Size Average Words per Document

1,053,034 12.9 GB 13.2 KB 699

Divided into 11 groups based on size: Group 0 1 2 3 4 5 6 7 8 9 10

S. Ye, J. Wen and W. Ma (UCD & MSRA)

Words in Document 0-500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000

Number of Documents 651,983 153,741 78,590 28,917 14,669 8,808 5,636 3,833 2,790 1,983 7,775

Shingles in Group 118,247,397 105,876,410 107,785,579 69,980,491 50,329,605 39,165,329 30,760,394 24,750,365 20,796,424 16,770,544 93,564,410

PAKDD 2006

13 / 25

3.1 Data Description

Document Size Distribution

Document Size Distribution: Long Tailed S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

14 / 25

3.2 Implementation Issues

Hash Function

Previous work: 30-bit and 40-bit Rabin hashes → 0.5 probability to have a collision in 220 (about one million) Our work: MD5 → 0.5 probability to have a collision in 264

No-sampling baseline → very time consuming

Remove exact duplicates first External sort is used: BerkeleyDB Two weeks to run 400 trials with two 3G Hz Xeon CPU, 4 GB Memory, and SCSI disks.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

15 / 25

3.3 Results

Precision with Different Similarity Thresholds. Sampling Ratio: 1/4

1 0.9 0.8

Precision

0.7 0.6 <500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000

0.5 0.4 0.3 0.2 0.1 0

0.5

0.55

S. Ye, J. Wen and W. Ma (UCD & MSRA)

0.6

0.65

0.7 0.75 0.8 0.85 Similarity Threshold

0.9

0.95

1

PAKDD 2006

16 / 25

3.3 Results

Precision with Different Similarity Thresholds. Sampling Ratio: 1/16

1 0.9 0.8

Precision

0.7 0.6 <500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000

0.5 0.4 0.3 0.2 0.1 0

0.5

0.55

S. Ye, J. Wen and W. Ma (UCD & MSRA)

0.6

0.65

0.7 0.75 0.8 0.85 Similarity Threshold

0.9

0.95

1

PAKDD 2006

17 / 25

3.4 Parameter Correlations Summary

Similarity Threshold: precision drops with the increase of similarity threshold., especially when the threshold is higher than 0.9. Sampling Ratio: precision drops with the decreasing of sampling ratio, especially for small documents containing fewer than 500 words. Document Size: small documents are more sensitive to similarity threshold and sampling ratio than large documents. Recall: sampling ratio does not hurt recall because sampling only generates false positives.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

18 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

19 / 25

4.1 Adaptive Sampling Strategy

Key Idea: apply small sampling ratio on large documents and large sampling ratio on small documents. Long Tailed Distribution:

In our data set, 68% of the documents have fewer than 500 words, but contribute only 17% shingles. Applying small sampling ratio on large documents greatly reduces the total shingles.

Experimental Result: With required precision 0.8 and similarity threshold 0.6, only 8% of the total shingles have to be processed.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

20 / 25

4.2 Experimental Results

Adaptive Sampling with Different Precision Thresholds

100

0.5 0.6 0.7 0.8 0.9 0.95 0.99

Percentage of Shingles(%)

90 80 70 60 50 40 30 20 10 0

0.5

0.55

S. Ye, J. Wen and W. Ma (UCD & MSRA)

0.6

0.65

0.7 0.75 0.8 0.85 Similarity Threshold

0.9

0.95

0.99 1

PAKDD 2006

21 / 25

Summary

Large sampling ratio is required for high precision. Especially when the precision is higher than 0.9.

Small sampling ratio hurts the precision of DDD.

Small documents consist of a major fraction of the whole Web.

Adaptive sampling strategy greatly reduces the sampling ratio of documents. Faster and more scalable when dealing with large document set.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

22 / 25

Questions?

Thank you!

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

23 / 25

Backup Slides

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

24 / 25

Recall with Different Similarity Thresholds

Sampling Ratio: 1/16

1 0.9 0.8 0.7 Recall

0.6 <500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000

0.5 0.4 0.3 0.2 0.1 0

0.5

0.55

S. Ye, J. Wen and W. Ma (UCD & MSRA)

0.6

0.65

0.7 0.75 0.8 0.85 Similarity Threshold

0.9

0.95

1

PAKDD 2006

25 / 25

A systematic study of parameter correlations in large ...

Wei-Ying Ma2. 1Department of Computer Science. University of California, Davis ... crawling, ranking, clustering, archiving and caching... S. Ye, J. Wen and W.

408KB Sizes 1 Downloads 278 Views

Recommend Documents

A Systematic Study of Parameter Correlations in Large ... - Springer Link
detection (DDD) and its applications, we observe the absence of a sys- ..... In: Proceedings of the 6th International World Wide Web Conference. (WWW). (1997).

A systematic study on parameter correlations in large ...
Abstract. Although much work has been done on duplicate document detection (DDD) ..... Thus we believe that the size of each piece can not be too large,.

Parameter optimization in 3D reconstruction on a large ...
Feb 20, 2007 - File replication allows a single file to be replicated to multiple storage ... Data are then replicated from that SE to other two SEs and so on.

Challenges in Executing Large Parameter Sweep ...
Texas Advanced Computing Center. The University of Texas at Austin ..... submitted with a run-time requirement of 24 hours each, then contributed CPUs to the ...

ALOJA: a Systematic Study of Hadoop Deployment Variables ... - GitHub
1. A cloud of points for Cost vs. Performance vs. Cloud or On-premise ... SSDs, InfiniBand networks, and Cloud services. ...... HadoopPerformanceTuning.pdf.

A systematic study of spirals and spiral turbulence in a ...
We report our experimental study on chemical patterns in a spatial open reactor using the Belousov–. Zhabotinsky reaction. A phase ... fect, which constructs the spiral core and plays the role of dominator.1,8 So the study of spiral ... fronts near

Role of lactoferrin in neonatal care a systematic review.pdf ...
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying.

Alice in Warningland: A Large-Scale Field Study of ... - Semantic Scholar
Abstract. We empirically assess whether browser security warn- ings are as ineffective as suggested by popular opinion and previous literature. We used Mozilla Firefox and. Google Chrome's in-browser telemetry to observe over. 25 million warning impr

Alice in Warningland: A Large-Scale Field Study of Browser Security ...
Based on our findings, we make ... We provide suggestions for browser warning design- ers and ... Safe Browsing service that the URL is still on the malware .... participants about the purpose of the warning, what would ..... malware and phishing war

Alice in Warningland: A Large-Scale Field Study of ... - Devdatta Akhawe
closely follows the development repository. ... der active development. ..... Android. NC. 64.6%. Table 3: User operating system vs. clickthrough rates for SSL.

Correlations and scaling in a simple sliding spring ...
interest that act as the seismic fault surfaces have been con- structed ... That is, the charge cell acts like a bumper .... In summary, by means of DFA and Higuchi's.

correlations among
Further analysis showed that certain smells corre- lated with tasks more .... The procedure and comparison sample data were adapted from Gross-. Isseroff and ...

Arterial-pulsation Driven Flow in Syringomyelia–A Lumped-parameter ...
1 Curtin University of Technology/Mechanical Engineering, Research Fellow, Perth, Australia. 2 University of Warwick/Fluid Dynamics Research Centre, Associate Professor, Coventry, United Kingdom. 3 The Walton Centre for Neuroradiology and Neurosurger

Parametric study and multiple correlations on air-side ...
Available online 26 January 2008. Abstract. In the present study, ... fax: +86 29 82663502. ... of Eighth International Heat Transfer Conference, San Francisco,.

Systematic Study of the 87Sr Clock Transition in an ...
Jan 26, 2006 - optical frequency relative to a Cs fountain-calibrated hy- ... 600 ms of two-stage laser cooling (using the 1S0-1P1 and .... 2 (color online).

Efficient Use of Fading Correlations in MIMO Systems
phone: +49 (89) 289285 f11,09,24g phone: +1 ... definite diagonal matrix used to set the transmit power for each ..... For medium transmit powers it pays off to open up .... [1] E. Telatar, “Capacity of multi-antenna gaussian channels,” AT&T-Bell

Implied Correlations in CDO Tranches
default probability for each name CreditRisk+ produces tails which are fat enough to meet market tranche losses. Additionally, we find that, similar to the correlation skew in the large pool model, ... man Academic Exchange Service (DAAD). ∗ Corres

Recognition of Handwritten Numerical Fields in a Large ...
pattern recognition systems is to use synthetic training data. [2, 7, 9]. In this paper, we investigate the utility of artifi- cial data in building a segmentation-based ...

The Use of a Colour Parameter in a Machine Vision ...
colour information and velocity and stability information. It also shows that the ... Rio Tinto Technology, 1 Research Avenue, Bundoora Vic 3083. FIG 1 - CIE Lab ...

The Use of a Colour Parameter in a Machine Vision ...
Kennecott Utah Copper Corporation, 8400 West 10200 South,. Bingham Canyon .... machine vision data, using the multivariate linear regression models shown ...

An Exploration of Parameter Redundancy in Deep ... - Sanjiv Kumar
These high-performing methods rely on deep networks containing millions or ... computational processing in modern convolutional archi- tectures. We instead ...