Microscale Evolution of Web Pages Carrie Grimes

Sean O’Brien

Google 345 Spear St San Francisco, CA 94105

Google 345 Spear St San Francisco, CA 94105

[email protected]

[email protected]

ABSTRACT We track a large set of “rapidly” changing web pages and examine the assumption that the arrival of content changes follow a Poisson process on a microscale. We demonstrate that there are significant differences in the behavior of pages that can be exploited to maintain freshness in a web corpus.

Categories and Subject Descriptors H.4.m [Information Systems]: Miscellaneous; H.3.3 [Information Search and Retrieval]: Search Process

General Terms Measurement, Experimentation, Algorithms

Keywords Web evolution, rate of change, change detection

1.

INTRODUCTION

Search engines crawl the web to download a corpus of web pages to index for user queries. One of the most efficient ways to maintain an up-to-date corpus of web pages for a search engine to index is to re-crawl pages preferentially based on their rate of content update [3]. Much of the existing work on estimating expected rates of change has been based on the assumption that changes arrive as a Poisson process, and that the average rate of change can be estimated under this model. Given the rate at which rapidly changing pages need to be crawled, there are high costs associated with these pages both in terms of crawling resources and corpus consistency with the web. In this paper, we ask the question of whether the Poisson model truly represents the changes for rapidly updating pages, and if not, what can be gained by better understanding the real structure of page changes in terms of acquiring fresh content.

2.

METHODOLOGY

Definitions: For each page, the number of updates that occur in a single hour, X, is distributed as a Poisson distribution, with parameter λ = 1/∆. The average time between changes is ∆, and the time between page updates is distributed exponentially with parameter 1/∆. Copyright is held by the author/owner(s). WWW2008, April 21–25, 2008, Beijing, China. .

Defining a ‘Change’: We employ the simhash technique, outlined by Charikar [1], which creates a fingerprintlike representation of the page but has the unique benefit that pages with similar content have similar simhash values. Distance between simhashes can be measured by the number of bits in which they differ; for this study we consider versions of a page with 6 or more bits different in their simhashes to be changed. Computing Rates of Change: Given a history of reported “changes,” measured on any regular interval of length C, the simple estimator is to divide the total number of changes observed by the total time observed. That simple ˆ = T /X, where T estimator of the rate of change, ∆, is ∆ = total time and X = total number of changes. However, if the time between crawls of the page is remotely similar to the rate of change of the page, this estimate is significantly asymptotically biased. If more than one change occurs during an interval of length C, the crawler will only observe a single change. As a result, if C is too large compared to ∆, no matter how many observations are taken, the estimate will always overestimate the length of time between changes. Cho and Garcia-Molina, section 4.2, [2] reduce the asymptotic bias of the simple estimator by subtracting off the expected bias and making a small modification to remove singularities. The modified estimator, ˆ∗ = − ∆

1 log( T −X+0.5 ) T +0.5

(1)

is demonstrated to have significantly better bias, even for small (N < 50) observations, especially when C/∆ is in the neighborhood of 0.5 − 2.0. Although our page crawl interval granularity will be quite small compared to the total space of possible rates of change, the pages we are examining have rates of change on the order of 1-2 days, and therefore the shrinkage of the estimator given by Cho and Garcia-Molina makes a critical difference in the values. However, if the samples are significantly non-Poisson, the asymptotic results for this estimator do not apply. For this reason, we will ˆ and the estimator ∆ ˆ ∗. compute both ∆ Sampling Web Pages: We use a multi-staged process to generate our sample. We first selected hosts according to the number of URLs they had in a feed from the Google crawl containing over 1B documents. From each host we sampled a number of URLs, crawled them twice daily, and down-sampled the URLs with an average rate of change of less than 48 hours. This left us with 29,000 URLs which we crawled every hour. Of those 80% had at least 500 con-

Figure 1: Heatmap of observed interval frequencies given average observed rate of change of the page. Red = low intensity, Yellow = high intensity.

Figure 2: Comparison of the observed interval freˆ = 24 with the number prequencies for pages with ∆ dicted by an exponentially-distributed waiting time between updates.

secutive successful fetches, which is the set of URLs we will examine in this paper. Every time a page is accessed in our sample, we compute a simhash of the page’s content as described in Charikar [1], and consider the page changed if the current simhash differs from the previous by 6 or more bits.

3.

Figure 3: Arrival times of page updates over one week, with respect to PST clock times.

DISTRIBUTION OF CONTENT CHANGES

Content Changerate Profiles: We begin by examining the overall distribution of rates of change given by this sample. In our data, only a very small (< 5%) portion of the sample changes by 6 bits or more every time we access it. However, applying the modified estimator ∆∗ in 1, we estimate that up to 25% of our fast-changing sample has an average rate of change less than or equal to one hour. Over 50% of pages in the sample have an estimated rate of change of less than 4 hours. The primary differences between the simple estimator (∆) and the modified estimator (∆∗ ) occurs in the fastest-changing bins because those are the most sensitive to the remaining censoring introduced by our hour-granularity sample. Pages with Regular Updates: Intuitively, it would seem that many pages should show a much more regular behavior than is dictated by the Poisson model due to automated updating of the sites on an hourly or daily basis. Figure 1 is a heatmap of all actual between-change intervals observed, plotted by the overall average rate of change observed for the page. The high-frequency bins are concentrated around the fastest-changing pages at the lower left corner. However, there is an additional bright spot at 24-hour observed intervals – there are significant numbers of pages with an average rate of change near 24 hours and a large number of changes at exactly 24 hours. Figˆ = ure 2 illustrates this effect specifically for pages with ∆ 24. The point plotted in Figure 2 show the observed number of change intervals in with each length, and the lighter line shows the predicted number of change intervals of each length from a model where observed times between changes are distributed Exp(1/24). Temporal Effects in Content Updates: If humans or machines set by humans are the cause of page updates, those updates should have a clear temporal association. Within our “rapid” change sample, we divided the pages into regionassociations based on the top level domain of the page. The data was aggregated into two major groups, CJK = {.cn, .jk, .ko} pages, and FIGS = {.fr, .it, .de, .es} pages. Plotting the arrival times of observed updates against the day (day is with respect to Pacific Standard Time) in Figure 3, we see that there is a significant decrease in probability

of a page change between local daytime and local nighttime, and an even more significant decrease in update volume on the weekend. The graph is smoothed with a 5-hour window to reduce significant spikiness. This graph suggests that resources for refreshing pages should be prioritized to occur directly after the highest local page update volume.

4.

CONCLUSIONS

The case for an aggregate Poisson model for these fastchanging pages is somewhat inconclusive: relatively few pages in our sample were strictly consistent with a Poisson model, but only a small portion differ significantly. We do show several effects that can be exploited to improve refresh performance on fast-changing pages: change volume increases depending on local time and day of the week, and fastupdating pages bias toward hourly and daily changes. One dominating question in this work is whether the large component (nearly 20%) of pages that are more consistent with a 1-hour regular change pattern than with a Poisson process are updating content useful to the user.

5.

REFERENCES

[1] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC ’02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388, New York, NY, USA, 2002. ACM Press. [2] J. Cho and H. Garcia-Molina. Estimating frequency of change. ACM Trans. Inter. Tech., 3(3):256–290, 2003. [3] J. Cho, H. Garc´ıa-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1–7):161–172, 1998.

Microscale Evolution of Web Pages - Research at Google

We track a large set of “rapidly” changing web pages and examine the ... We first selected hosts according ... Figure 2: Comparison of the observed interval fre-.

205KB Sizes 1 Downloads 280 Views

Recommend Documents

Automatic generation of research trails in web ... - Research at Google
Feb 10, 2010 - thematic exploration, though the theme may change slightly during the research ... add or rank results (e.g., [2, 10, 13]). Research trails are.

The Mechanics of Machining at the Microscale
must be overcome if machining processes and equipment are to emerge that can ...... Summary of Micro-Milling Studies,'' Proc. of 1st International Conference.

Remedying Web Hijacking: Notification ... - Research at Google
each week alerts over 10 million clients of unsafe webpages [11];. Google Search ... a question remains as to the best approach to reach webmasters and whether .... as contact from their web hosting provider or a notification from a colleague ...

Designing Usable Web Forms - Research at Google
May 1, 2014 - 3Dept. of Computer Science ... guidelines to improve interactive online forms when .... age, level of education, computer knowledge, web.

The Mechanics of Machining at the Microscale
Later, Lucca and Seo 20 performed another experimental study of the effect of single-crystal diamond tool edge geometry rake angle, tool edge radius on the ...

Web-scale Image Annotation - Research at Google
models to explain the co-occurence relationship between image features and ... co-occurrence relationship between the two modalities. ..... screen*frontal apple.

web-derived pronunciations - Research at Google
Pronunciation information is available in large quantities on the Web, in the form of IPA and ad-hoc transcriptions. We describe techniques for extracting ...

Improving Access to Web Content at Google - Research at Google
Mar 12, 2008 - No Javascript. • Supports older and newer browsers alike. Lynx anyone? • Access keys; section headers. • Labels, filters, multi-account support ... my screen- reading application, this site is completely accessible for people wit

Reducing Web Latency: the Virtue of Gentle ... - Research at Google
for modern network services. Since bandwidth remains .... Ideal. Without loss. With loss. Figure 1: Mean TCP latency to transfer an HTTP response from Web.

The viability of web-derived polarity lexicons - Research at Google
Polarity lexicons are large lists of phrases that en- .... The common property among all graph propaga- ..... these correspond to social media text where one ex-.

Optimizing utilization of resource pools in web ... - Research at Google
Sep 19, 2002 - Modern web middleware platforms are complex software systems that expose ...... There is a vast body of research work in the area of analytical ...

Searching help pages of R packages - Research at Google
34 matches - Software. Introduction. The sos package provides a means to quickly and flexibly search the help ... Jonathan Baron's R site search database (Baron, 2009) and returns the ..... from http://addictedtor.free.fr/rsitesearch . Bibliography.

Helping users re-find web pages by identifying ... - Research at Google
One area that deserves attention is the ranking function for search results, as a strong one can allow desktop search to produce good results for vague queries ...

Crowdsourcing and the Semantic Web - Research at Google
Semantic Web technologies (Hitzler et al., 2009) have become use- ful in various ..... finding those tasks that best match their preferences. A common ... 10 C. Sarasua et al. .... as well as data hosting and cataloging infrastructures (e. g. CKAN,.

Securing Nonintrusive Web Encryption through ... - Research at Google
Jun 8, 2008 - generated data with business partners and/or have vulnerabilities that may lead to ... risks and send confidential data to untrusted sites in order to use .... applications involving multiple websites, as shown in Section 3.3. In Sweb,

Web Browser Workload Characterization for ... - Research at Google
browsing workload on state-of-the-art Android systems leave much room for power ..... the web page and wait for 10 seconds in each experiment. 6.1 Breakdown ...

Extracting knowledge from the World Wide Web - Research at Google
Extracting knowledge from the World Wide Web. Monika Henzinger* and Steve Lawrence. Google, Inc., 2400 Bayshore Parkway, Mountain View ...... Garey, M. R. & Johnson, D. S. (1979) Computers and Intractability: A Guide to the Theory of NP-Completeness

The W3C Web Content Accessibility Guidelines - Research at Google
[2], became a W3C recommendation in December 2008. WCAG 2.0 was ... ally possible to make static HTML websites WCAG 1.0 AA conforming without.

Evaluating Web Search Using Task Completion ... - Research at Google
for two search algorithms which we call search algorithm. A and search algorithm B. .... if we change a search algorithm in a way that leads users to take less time that ..... SIGIR conference on Research and development in information retrieval ...

From mixed-mode to multiple devices Web ... - Research at Google
the smartphone or tablet computers? There are few published ... desktop or laptop computers, is the .... respondent-has-used/ (accessed 10 January. 2013) .

Optimizing the update packet stream for web ... - Research at Google
Key words: data synchronization, web applications, cloud computing ...... A. Fikes, R. Gruber, Bigtable: A Distributed Storage System for Structured Data,. OSDI ...

Using the Web for Language Independent ... - Research at Google
Aug 6, 2009 - Subjects were asked to randomly se- ... subjects, resulting in a test set of 11.6k tokens, and ..... tion Processing and Management, 27(5):517.