Evolution of the Chilean Web Structure Composition Ricardo Baeza-Yates

Barbara Poblete

Center for Web Research Dept. of Computer Science University of Chile E-mail: {rbaeza,bpoblete}@dcc.uchile.cl

Abstract In this paper we present the evolution of the structure of the Chilean Web between 2000 and 2002. Our results show that although the Web grows as expected, also a significant part of it disappears. In addition, some components are much more stable than others. We also compare the expected life cycle of a Web site in the structure with the actual real data.

Year Pages Sites Domains

2000 730.673 10.352 9.102

2001 794.218 21.207 19.389

2002 2.214.253 39.320 35.520

Table 1. TodoCL collections.

1. Introduction The Web is highly dynamic and little is known about its evolution. There are models that predict when a page will change, but that differs a lot from site to site. There are also generative models for Web growth, but they do not include Web death. In fact, new websites appear and others disappear, but little is know on how this happens. In this paper we present the evolution of the structure composition of the Chilean Web at the site and domain level, based on data gathered from a search engine targeted to this web domain, TodoCL.cl, between years 2000 and 2002. We define the Chilean Web as all the .cl sites plus all other sites found by crawling that have an IP belonging to a Chilean ISP. The first year the crawl started from an initial sample of sites, but subsequent years started with all .cl domains thanks to NIC Chile. Hence, the number of unconnected sites was low the first year. Also, the last crawl contains more dynamic pages, which in general do not change the Web structure. Table 1 shows the data for these years.

Proceedings of the First Latin American Web Congress (LA-WEB 2003) 0-7695-2058-8/03 $17.00 © 2003 IEEE

Our results present how the structure evolves, how sites migrate from one component to another component, and where sites appear and disappear. The changes are dramatic, corroborating that perhaps we are trying to study a process that is still in a transient phase, or that cannot be modeled in detail. This is a first step to measure and follow the evolution of part of the Web structure, as well as try to understand the process behind the changes. To the best of our knowledge there are no other studies on Web composition as specific as ours. Most statistical studies deal with global attributes as language or size. We would have like to separate the Chilean Web in commercial, educational, governmental, etc. sites, but Chile does not use a subdomain level indicating that, so the classification is not trivial. In section 2 we review the results on the structure of the Web. Section 3 shows the evolution of this structure, and section 4 analyzes the expected changes in the structure with respect to the typical life cycle of a Web site. The last section has some concluding remarks. A preliminary and short version of this paper was presented as a poster in [BYP03].

2. Web Structure The most complete study of the Web structure [BKM 00] focus on page connectivity. One problem with this is that a page is not a logical unit (for example, a page can describe several documents and one document can be stored in several pages.) Hence, we decided to study the structure of how websites were connected, as websites are closer to be real logical units. Not surprisingly, we found in [BYC01] that the structure at the website level was similar to the global Web, and hence we use the same notation of [BKM 00]. The components are:

MAIN

MAIN-MAIN

IN

OUT

MAIN-IN

MAIN-OUT

MAIN-NORM TUNNEL T.IN

T.OUT

a) MAIN, sites that are in the strong connected component of the connectivity graph of sites (that is, we can navigate from any site to any other site in the same component);

Figure 1. Structure of the Web.

b) IN, sites that can reach MAIN but cannot be reached from MAIN;

studied in [BYSJC02] for the first two collections (2000 and 2001).

ISLANDS

c) OUT, sites that can be reached from MAIN, but there is no path to go back to MAIN; and d) other sites that can be reached from IN (T.IN, where T is an abbreviation for tentacles), sites in paths between IN and OUT (TUNNEL), sites that only reach OUT (T.OUT), and unconnected sites (ISLANDS). In [BYC01] we analyzed the data for 2000 and we extended this notation by dividing the MAIN component into four parts: a) MAIN-MAIN, which are sites that can be reached directly from the IN component and can reach directly the OUT component; b) MAIN-IN, which are sites that can be reached directly from the IN component but are not in MAIN-MAIN; c) MAIN-OUT, which are sites that can reach directly the OUT component, but are not in MAIN-MAIN;

Component MAIN IN OUT TUNNEL TENTACLE-IN TENTACLE-OUT ISLANDS MAIN-MAIN MAIN-OUT MAIN-IN MAIN-NORM

Size (%) 2001 9.25% 5.84% 20.21% 0.22% 3.04% 1.68% 59.73% 3.43% 2.49% 1.16% 2.15%

Size (%) 2002 11.98% 9.97% 17.15% 0.23% 3.11% 3.31% 54.21% 4.08% 2.77% 2.24% 2.88%

Table 2. Relative size of the number of sites in the components of the Chilean Web.

d) MAIN-NORM, which are sites not belonging to the previously defined subcomponents. Figure 1 shows all these components. We also considered domains in our study, although domains may contain sites that are quite different. For example, web hosting in an ISP provider using a common second-level domain such as co.cl. In table 2 we give the relative size of each component. Notice the size of ISLANDS, which is near 50% of the Chilean Web sites. These sites are usually recent, and the main growth of the Web is in that component. The average update time of pages and sites, and their relation to structure and link ranking techniques was

Proceedings of the First Latin American Web Congress (LA-WEB 2003) 0-7695-2058-8/03 $17.00 © 2003 IEEE

3. Evolution of the Structure Composition Table 3 shows the number of sites and domains that have appeared and disappeared from year to year. In tables 4 and 5 we show the migration of sites among the components. Similarly, table 6 does the same analysis for domains. There are two ways of reading these tables. By columns we have from which component comes the sites/domains in each component. By rows, we see where are today the sites/domains of the components in

Year TOTAL NEW GONE

2000 7.497 -

Sites 2001 21.207 15.415 1.705

2002 39.320 23.937 5.824

Domains 2001 2002 19.389 35.520 21.397 5.266

Table 3. Growth and death of sites and domains.

the previous year. The last column and row represent the sites/domains that that do not longer exist (GONE) and the new sites/domains (NEW), respectively. Death of a site means that there are no IP addresses associated to it (this might be wrong if the site changes its name) and death of a domain means that there are no sites associated with it (in particular the domain name itself or prefixed by www)1 Notice that OUT and MAIN are stable components, because about 25% of the sites stay there. It is also interesting to see that MAIN grows from OUT by 20%, and that ISLANDS is the component with largest growth and also death. Figures 4 and 5 show graphically the migration of sites and domains among the different components, using the same textures to identify from which component in the previous year they came.

4. Analysis of Web Site Migration Web sites evolve and hence migrate inside the structure. First, a typical Web site should start as part of ISLANDS or IN (depending if they link or not a good Web site). If the site becomes popular and they also link known sites, it migrates to MAIN. If links are not well chosen or updated, they starty in or migrate to OUT. Figure 2 shows the expected life path of a website to migrate to MAIN. We also include the migration from MAIN to OUT if the site is not well maintained. Figure 3 shows the real migration (percentage) in the structure. We can notice that there is almost no migration from IN to MAIN or from MAIN to OUT in opposition to what intuition predicts. Also, there are websites that appear directly in MAIN or OUT. This means that a good site seems to be linked from a site in MAIN in less than a year, or that sites obtain links from portals in MAIN (for example, a banner).

1

The domain name could be still registered, though.

Proceedings of the First Latin American Web Congress (LA-WEB 2003) 0-7695-2058-8/03 $17.00 © 2003 IEEE

Figure 2. Expected migrations of websites in the structure.

5. Conclusions The overall number of sites of the Chilean Web is duplicating each year. However, that is the result of more than a 125% increase plus a 25% death. In addition, many sites, sometimes because of ignorance, do not allow crawlers to enter. For example, in 2001, 56% of the domains and 54% of the sites had only one page. However, 25% of them (14% of the total) had an initial Flash page or called a similar kind of program. There is still a lot to do to understand how the composition of the structure changes. For example we can follow specific sites, but in that case have to see if a one-year sampling strategy is enough. We are currently studying the change at the level of pages related to the structure. For example, the largest 20 sites (in pages) in 2002 are all different from the largest sites in 2001.

Acknowledgements We acknowledge the support of Millennium Nucleus Grant P01-029-F from Mideplan, Chile, and Chilean Fondecyt Project 1020803.

References [BYC01]

Ricardo Baeza-Yates and Carlos Castillo. Relating web characteristics with link analysis. In String Processing and Information Retrieval. IEEE Computer Science Press, 2001. [BYSJC02] Ricardo Baeza-Yates, Felipe Saint-Jean, and Carlos Castillo. Web dynamics, structure, and link ranking. In String Processing and Information Retrieval. Lecture Notes in CS, Springer, 2002. [BYP03] Ricardo Baeza-Yates and Barbara Poblete, Evolution of the Web Structure. In Twelfth World Wide Web Conference. Budapest, Hungary, 2003. [BKM 00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, and A. Tomkins. Graph structure in the Web: Experiments and models. In 9th World Wide Web Conference, 2000.

Figure 3. Real migrations of websites in the structure in percentage (top: 2000-2001, bottom: 2001-2002).

Proceedings of the First Latin American Web Congress (LA-WEB 2003) 0-7695-2058-8/03 $17.00 © 2003 IEEE

 2001  2000 MAIN OUT IN ISLANDS TUNNEL TIN TOUT NEW

MAIN

OUT

IN

ISLANDS

TUNNEL

TIN

TOUT

GONE

959 195 39 18 1 5 3 742

724 1151 89 124 1 31 38 2128

140 39 118 14 3 0 25 901

305 749 279 213 18 18 131 10955

11 5 2 0 0 3 0 27

61 96 31 14 0 3 4 437

24 48 25 19 2 2 12 225

509 668 226 174 3 37 88 -

Table 4. Component changes of sites from 2000 to 2001.

 2002  2001 MAIN OUT IN ISLANDS TUNNEL TIN TOUT NEW

MAIN

OUT

IN

ISLANDS

TUNNEL

TIN

TOUT

GONE

1214 901 233 422 11 78 52 1801

339 1683 98 1351 15 215 79 2965

158 188 292 786 3 25 41 2430

42 532 196 5182 4 128 59 15173

1 15 1 23 1 2 0 50

17 128 22 365 2 66 18 608

8 43 16 299 0 5 24 910

183 796 382 4240 12 127 84 -

Table 5. Component changes of sites from 2001 to 2002.

 2002  2001 MAIN OUT IN ISLANDS TUNNEL TIN TOUT NEW

MAIN

OUT

IN

ISLANDS

TUNNEL

TIN

TOUT

GONE

918 892 206 487 4 88 35 1376

218 1424 79 1276 1 226 22 2176

79 167 288 970 3 22 39 2644

35 466 182 4967 1 134 35 14171

0 14 2 25 0 0 0 27

4 97 19 320 0 59 2 419

4 35 9 242 0 8 19 584

141 560 326 4074 4 102 59 -

Table 6. Component changes in domains from 2001 to 2002.

Proceedings of the First Latin American Web Congress (LA-WEB 2003) 0-7695-2058-8/03 $17.00 © 2003 IEEE

Figure 4. Flow of sites among components. The same texture in each component indicates the origin of old sites.

Figure 5. Flow of domains among components.

Proceedings of the First Latin American Web Congress (LA-WEB 2003) 0-7695-2058-8/03 $17.00 © 2003 IEEE

Evolution of the Chilean Web Structure Composition

Barbara Poblete. Center for Web Research. Dept. of Computer Science .... We acknowledge the support of Millennium Nucleus. Grant P01-029-F from Mideplan, ...

311KB Sizes 1 Downloads 376 Views

Recommend Documents

Dynamics of the Chilean Web structure
Dec 9, 2005 - (other non .cl sites hosted in Chile are estimated to number ... but there is no path to go back to MAIN; and. (d) other ... with 94,348 having a DNS server. Hence, in ..... site appeared at the end of 1993 in our CS depart- ment.

9 Food Web Structure and the Evolution of Complex ...
9 Food Web Structure and the Evolution of Complex Networks. Guido Caldarelli1, Diego Garlaschelli1,2, and Luciano Pietronero1,3. 1 INFM and Dipartimento di ...

Argument Structure and State Composition
Oct 1, 2010 - Kratzer (1996) introduce the external argument though a Voice head using ..... western conference on linguistics, vol. 15, 158–179. Fresno: ...

Cosmology -The origin and evolution of cosmic structure
International (+44) 1243 779777 e-mail (for orders and customer service enquiries): [email protected]. Visit our Home Page on http://www.wileyeurope.com ...

vegetation composition, structure and patterns of diversity - BayCEER
were drawn using the EstimateS software (Colwell, 1997). Diversity indices ... The ADE-4 software package, with an interface ...... Psy ana Rubiaceae. T. 0.12. 62.

vegetation composition, structure and patterns of ...
1 Forestry and Ecology Division, National Remote Sensing Agency, .... frequency (rF), relative density (rD) and relative basal area (rBA) for each species.

Learning the structure of objects from Web supervision
sider for example the notion of object category, which is a basic unit of understanding in .... parts corresponding to the “bus” and “car” classes. Webly supervised ...

Structure and Evolution of Missed Collaborations in ...
in spectrum sensing [10], cooperative transport [11], etc., and. MNFs identify ..... wireless networks using simplicial complexes,” in Mobile Adhoc and. Sensor ...

Ebook Free Structure and Evolution of Invertebrate ...
Feb 24, 2016 - the University of South Florida in Tampa, USA on molecular systematics of ... Planck Institute for Chemical Ecology in Jena, Germany. In 2009 ...

Web service composition via TLV
Given the largest simulation R of T by C, we can build every composition through the controller generator (CG). CG = < A, [1,…,n], Sr, sr. 0, δ, ω> with.

Chilean Altiplano
Oct 2, 2007 - Abstract A high resolution multiproxy study (mag- ..... and gray-colour curve data. ...... CJF, van Tongeren OFR (eds) Data analysis in commu-.

The Evolution of Cultural Evolution
for detoxifying and processing these seeds. Fatigued and ... such as seed processing techniques, tracking abilities, and ...... In: Zentall T, Galef BG, edi- tors.

Evolution of the Total Lightning Structure of a Leading ...
Oct 31, 2005 - Texas A&M University. Evolution of the Total ... Convective System over. Houston, Texas. Evolution of the Total Lightning Structure of a Leading-line, Trailing-stratiform. Mesoscale Convective System over. Houston, Texas ... composite

With the evolution from Web 1.0 to Web 3.0, web ...
With the introduction of social networking sites, blogs, forums, and wikis, it is ... Evolving further, websites such as Amazon.com and Yahoo provide features such as .... 10. The layers of information can be grouped into four levels of information.

Factor Structure of Content Preparation for E-Business Web Sites
To enhance the quality of e-business web sites, a study of factor ..... The best way to determine what information customers want in e-business operation.

Microscale Evolution of Web Pages - Research at Google
We track a large set of “rapidly” changing web pages and examine the ... We first selected hosts according ... Figure 2: Comparison of the observed interval fre-.

The Philosophy of Composition
He first involved his hero in a web of difficulties, forming the ... the tone at all points, tend to the development of the intention. 2. There is a .... refrain, the division of the poem into stanzas was, of course, a corollary: the refrain forming

PDF Download Galaxies: Structure and Evolution Online - Sites
... Evolution , Download Galaxies: Structure and Evolution Android, Download Galaxies: Structure ... revision in what continues to be a rapidly developing subject.

pdf-1399\web-service-composition-and-new-frameworks-in ...
... the apps below to open or edit this item. pdf-1399\web-service-composition-and-new-frameworks-in-designing-semantics-innovations-by-patrick-hung.pdf.

A survey on Web Service Composition Algorithm - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 1, Issue 5, May ... Service Oriented Computing (SOC) is an emerging cross-disciplinary .... Here authors have proposed a cloud web service ... ICIW '08: Proceedings of the 2008