Does Internet Media Traffic Really Follow Zipf-like Distribution? Lei Guo1 , Enhua Tan1 , Songqing Chen2 , Zhen Xiao3 , and Xiaodong Zhang1 1

{lguo,

The Ohio State University

etan, zhang}@cse.ohio-state.edu

2

George Mason University

[email protected]

Categories and Subject Descriptors: C.2.4 [ComputerCommunication Networks]: Distributed Systems General Terms: Measurement Keywords: Media, Zipf-like, stretched exponential

1.

INTRODUCTION

It is commonly agreed that Web traffic follows the Zipf-like distribution, which is an analytical foundation for improving Web access performance by client-server based proxy caching systems on the Internet. However, some recent studies have observed non-Zipf-like distributions of Internet media traffic in different content delivery systems. Due to the variety of media delivery systems and the diversity of media content, existing studies on media traffic are largely workload specific, and the observed access patterns are often different from or even conflict with each other. For Web media systems, study [3] reports that the access pattern of streaming media is Zipf-like in a university campus network, while study [2] finds that it is not Zipf-like in an enterprise media server. For VoD media systems, study [1] finds that it is not Zipf-like in a multicast-based Media-on-Demand server of a campus network, while study [9] reports it is Zipf-like in a large VoD streaming system of an ISP. For P2P media systems, study [4] reports that the access pattern of media workload in KaZaa system collected in a campus network is not Zipf-like, while study [5] reports that it is Zipf-like in another campus network. For live streaming media systems, study [8] reports it is Zipf-like while study [6] reports it is not Zipf-like. A number of models have been proposed to explain the observed media access patterns, such as the generalized Zipf-like model [7], “fetch-at-most-once” model [4], and two-mode Zipf model [6]. However, each of these models can only explain a very limited scope of measurement results. A general model of Internet media access patterns is highly desirable for traffic engineering on the Internet and is critical to design, benchmark, and evaluate Internet media delivery systems. In this study, we have analyzed a wide variety of media workloads on the Internet. The workloads were collected from both the client side and the server side in Web, VoD, P2P, and live streaming environments between 1998 and 2006, where the media content is delivered via Web/P2P downloading or unicast/multicast streaming. The duration of these workloads ranges from a few days to more than two years and the user population ranges from several thousands to more than one hundred thousand. The number of client requests Copyright is held by the author/owner(s). SIGMETRICS’07, June 12–16, 2007, San Diego, California, USA. ACM 978-1-59593-639-4/07/0006.

3

IBM Research

[email protected]

ranges from tens of thousands to hundreds of million, and the number of objects in each workload ranges from several hundreds to several million. Through extensive analysis, we find that the reference ranks of media objects in all sixteen workloads follow the stretched exponential (SE) distribution, and a biased measurement may lead to a Zipf-like observation on media access patterns. With such a request pattern, the temporal locality in media systems is hard to exploit by client-server based caching systems. The stretched exponential model implies that peer-to-peer collaborative caching systems can effectively deliver Internet media content. Current technology advancements such as PPLive and BitTorrent have demonstrated the strong advantages of P2P collaboration on the delivery of Internet media content.

2.

THE STRETCHED EXPONENTIAL DISTRIBUTION OF MEDIA TRAFFIC

Figures 1(a), 1(b), 1(c), and 1(d) show the reference rank distributions of media objects in typical Web, VoD, P2P, and Live media systems, respectively. In each figure, the x coordinate represents the reference rank of each object, plotted in log scale, while the y coordinate represents the number of references to this object, plotted in both log scale (marked on the right of y-axis) and a powered scale (by a constant c, as marked on the left of y-axis). These figures show that the reference rank distributions of all these workloads cannot be fitted with a straight line in a log-log scale, meaning they are not Zipf-like. Instead, by selecting a proper constant c, all these workloads can be well fitted with a straight line in log-y c scale. Such a distribution is called a stretched exponential distribution. As marked in the figures, the coefficient of determination of the stretched exponential fitting result, R2 , is very close to 1 for all workloads. The cumulative probability function of a stretched exponential distribution can be expressed as P (X < x) = 1 − e

−( xx )c 0

,

(1)

where c and x0 are constants. If we rank the N objects in the workload in descending order of their reference numbers yi (1 ≤ i ≤ N ), we have P (yn > yi ) = i/N . So the reference rank distribution can be expressed as follows yic = −a log i + b (1 ≤ i ≤ N ), xc0

y1c .

(2)

where a = and b = Since b is a normalization parameter, the shape of an SE distribution is determined by c, the stretch factor of y coordinate, and a, the slope of the straight line in log-y c scale. For on-demand media systems, the stretch factor c of the object reference rank distribution is highly related with the

10 data in log−log scale 10

1

98 c = 0.2, a= 0.738, b = 6.533

10

2

R = 0.997444

1 0 10

1

10

0

2

10 Rank (log scale)

3

10

(a) Web media system

10

10

40268

10

3

2

14877

10

c = 0.4, a= 12.961, b = 118.864 2 2773 R = 0.985824

1

10

10

693

10

2

c = 0.52, a= 7.234, b = 58.142 195 R2 = 0.999138

1

10

10

0

2

10

1304485

10

460392

10

123346

10

5

4

3

c = 0.2, a= 2.662, b = 22.694 2 R = 0.99764

20529

2

10

3

1 0 10

10

10

(b) VoD media system

1

10

1

1220

data in log−yc scale SE model fit

10 Rank (log scale)

6

3087335

1

data in log−yc scale SE model fit 1 0 10

10

3

1479

# of References (yc scale)

4

81921

c

data in log−y scale SE model fit

# of References (log scale)

10

# of References (yc scale)

# of References (log scale)

# of References (yc scale)

2

1024

7

6436343 data in log−log scale

5

142337

10

10 data in log−log scale

3

5033

4

2544

10 data in log−yc scale SE model fit

0

2

10 Rank (log scale)

3

10

(c) P2P media system

10

1 0 10

1

10

0

2

10 Rank (log scale)

3

10

(d) Live media system

Figure 1: Reference rank distributions of different kinds of media systems sizes of files in the system. In general, for media workloads delivered by similar kinds of systems or techniques, the stretch factors of their corresponding object reference rank distributions increase with their median file sizes; for workloads with similar median file sizes, the stretch factors of their corresponding object reference rank distributions are similar regardless of their underlying media systems and delivery techniques. Furthermore, for objects accessed in different time periods in a media system with roughly constant object birth rate, the stretch factor c of corresponding reference rank distributions is a time-invariant constant. For media systems with roughly constant request rates and object birth rates over time, the parameter a (the slope of the SE line in log-y c scale) of the object reference rank distribution increases with its stretch factor c and the average number of requests per object in the workload. Furthermore, due to the increase of the average number of requests per object over time, parameter a increases with the length of the workload duration gradually but converges to a constant, which is determined by the ratio of the media request rate to the object birth rate and the stretch factor c. For a stretched exponential reference rank distribution with slope a in log-y c scale and total N objects, the difference between this distribution and its corresponding Zipf-like model in log-log scale increases with a log N . For a workload with large media files, both the average number of requests per object and the stretch factor c are large. Thus a is large, and the difference between its reference rank distribution and the corresponding Zipf-like model is large. For a workload with small media files, the difference between its reference rank distribution and the corresponding Zipf-like model is also large when the workload duration is long enough (at least months to years).

3.

# of References (log scale)

6

223716

# of References (log scale)

10 data in log−log scale

# of References (yc scale)

4

16807

IMPLICATIONS ON MEDIA CACHING

Internet media objects commonly have long lifespans because they are seldom updated and have low production rates compared to Web objects. Most requested media objects are created long time ago, and most media requests are for objects created long time ago. For example, for a media workload collected at a large residential cable network in 2005, more than 50% requested objects are created at least 250 days ago, and more than 50% requests are for objects older than 150 days. The temporal locality in a computer system comes from the concentration and correlation of requests to the content in the system. During a short period such as one week, the popularity of media objects is almost stationary, thus the

temporal locality mainly comes from the request concentration. We have modeled the optimal hit ratios of typical short term media workloads and Web workloads, where request concentration dominates the temporal locality. In such cases, caching of media (SE) workload is far less efficient than that of Web (Zipf) workload. For example, assuming all objects are cachable and have the same file size, caching 1% Web content can achieve about 40% hit radio, while caching 1% media content can only achieve 18% hit ratio, even though they have the same hit ratio with an unlimited cache. Nevertheless, the request concentration in a media workload (parameter a) increases with time. Furthermore, due to the long lifespan of media objects, the request correlation becomes important with time. With a much higher temporal locality, long-term caching can have a high hit ratio greater than 85% with caching 10% content. However, it may take months to years and a huge amount of storage to achieve such an improvement, for which peer-to-peer techniques can be much effective.

4.

CONCLUSION

Our study shows that Internet media access patterns follow the stretched exponential distribution. Thus, the performance of media caching with a client-server model is far less effective than that of Web content caching. The stretched exponential distribution lays out an analytical foundation to establish peer-to-peer caching systems for delivering the rapidly increasing Internet media content.

5.

REFERENCES

[1] S. Acharya, et al. Characterizing user access to videos on the world wide web. In Proc. of MMCN, 2000. [2] L. Cherkasova, et al. Characterizing locality, evolution, and life span of accesses in enterprise media server workloads. In Proc. of NOSSDAV, May 2002. [3] M. Chesire, et al. Measurement and analysis of a streaming media workload. In Proc. of USENIX USITS, March 2001. [4] K. P. Gummadi, et al. Measurement, modeling, and analysis of a peer-to-peer file-sharing workload. In Proc. of ACM SOSP, October 2003. [5] A. Iamnitchi, et al. Small-world file-sharing communities. In Proc. of IEEE INFOCOM, March 2004. [6] K. Sripanidkulchai, et al. An analysis of live streaming workloads on the Internet. In Proc. of ACM SIGCOMM IMC, October 2004. [7] W. Tang, et al. Medisyn: A synthetic streaming media service workload generator. In Proc. of ACM NOSSDAV, June 2003. [8] E. Veloso, et al. A hierarchical characterization of a live streaming media workload. In Proc. of the ACM SIGCOMM IMW, November 2002. [9] H. Yu, et al. Understanding user behavior in large scale video-on-demand systems. In Proc. of EuroSys, April 2006.

10

Does Internet Media Traffic Really Follow Zipf-like ...

Jun 12, 2007 - For VoD media systems, study [1] finds that it is not Zipf-like in a multicast-based ... The cumulative probability function of a stretched expo- nential distribution can ... tions increase with their median file sizes; for workloads with.

405KB Sizes 0 Downloads 257 Views

Recommend Documents

Does Money Really Matter
of children adopted into low-, middle- and high-SES families (Duyme, Dumaret, Annick- ... Other studies yield a mixed verdict on whether family income influences healthy child ..... cluster adjustment to account for the fact that in some cases there

What neuroeconomics does really mean?
namely the ventromedial prefrontal (VM) and the amygdala have much more difficulties to .... is delayed for a longer term (Mac Lure and alii, 2004). So reasoning ...

Does Money Really Matter
Achievement with Data from Random-Assignment Experiments ... our analytic approach, while Section IV describes the data and measures used in our analysis.

Why does Portland need additional traffic safety tools FINAL.PDF ...
Why does Portland need additional traffic safety tools FINAL.PDF. Why does Portland need additional traffic safety tools FINAL.PDF. Open. Extract. Open with.

1 Does It Really Matter Whether Students ...
the students might have lost confidence in the computer tutor due to its inability ...... should support typed input because all available data generally support the ..... The application is able to write the document to unused space on the hard driv

1 Does It Really Matter Whether Students' Contributions ...
We thank our research colleagues in the Emotive Computing Group at the University of. Memphis ..... Window 1 (top of screen) is ...... .500. I11. Confidence can learn from computer tutor .706. I7. Enjoy learning new computer software .657. I4.

Does difference in information really mean better electoral decisions?
Apr 23, 2009 - The data used in the analysis comes from the post electoral ... affect the way in which people vote (during all this period the electorate remained stable in the USA). ..... 1968 and 2004 in the years when presidential election took pl

Noninvasive Support Does it really decrease DBP.pdf
R.A. Polin is a ... chronic lung disease has been known since the 1970s.8,9 Studies in experimental an- .... Noninvasive Support Does it really decrease DBP.pdf.

Does Foreign Investment Really Improve Corporate ...
SMB. HML α β β β ε. = + ×. + ×. + ×. +. (4) where t. R is the excess return over the ... In April 2005, the fund added AIG, AT&T, Delhi, Novell, and Weyerhaeuser to.