Estimating the Size of Online Social Networks Shaozhi Ye [email protected] Department of Computer Science University of California, Davis

The 2nd IEEE International Conference on Social Computing Joint work with S. Felix Wu @ UC Davis

1 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

2 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

3 / 27

Motivation

Various number of users are claimed by OSNs. Facebook: We have 500M active users. How can outsiders verify it?

Lots of OSN research are conducted on a partial data set. How representative are our samples?

Question we try to answer here Can we estimate the number of users without crawling the entire social network?

4 / 27

Motivation

Various number of users are claimed by OSNs. Facebook: We have 500M active users. How can outsiders verify it?

Lots of OSN research are conducted on a partial data set. How representative are our samples?

Question we try to answer here Can we estimate the number of users without crawling the entire social network?

5 / 27

Existing approaches

Largest valid UID Failed when UIDs are assigned non-sequentially.

Probing UID space Expensive when UIDs are assigned non-uniformly. Hard to deal with non-numerical UIDs.

6 / 27

Our approach

Assumption: No assumption on how UIDs are assigned. Requires the ability to distinguish any two users. Methods: MLE: Maximum likelihood estimation based on uniform sampling. RW: Unbiased estimator based on random walkers.

7 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

8 / 27

2.1 Intuition

Sample the network with replacement multiple times → Duplicate users. With the same number of samples, the more duplicate users, the smaller the graph is.

9 / 27

2.2 Maximum Likelihood Estimation MLE ˆ which maximizes the The graph size n can be estimated as the n probability of getting k unique users in s uniform samplings with replacement. ˆ = arg max P(k |n, s). n n

Finkelstein et al. 1998 ˆ is unique and is the smallest integer j ≥ k , which satisfies If k < n, n j+1 j s j+1−k ( j+1 ) < 1. j+1 j s j+1−k ( j+1 )

is unfeasible and expensive to compute for large s

For large OSNs, s  n, thus linear probing becomes slow.

10 / 27

2.2 Maximum Likelihood Estimation MLE ˆ which maximizes the The graph size n can be estimated as the n probability of getting k unique users in s uniform samplings with replacement. ˆ = arg max P(k |n, s). n n

Finkelstein et al. 1998 ˆ is unique and is the smallest integer j ≥ k , which satisfies If k < n, n j+1 j s j+1−k ( j+1 ) < 1. j+1 j s j+1−k ( j+1 )

is unfeasible and expensive to compute for large s

For large OSNs, s  n, thus linear probing becomes slow.

11 / 27

2.3 Examining the objective function f (j) = log(

j +1 j ) + slog( )<0 j +1−k j +1

Theorem ˆ ∈ [k , djc e], where jc = s(k − 1)/(s − k ). Furthermore, f (j) is n monotonically decreasing within the interval [k , bjc c].

12 / 27

2.4 Results - Mean Size of simulated social networks: 10M

13 / 27

2.4 Results - Standard deviation Size of simulated social networks: 10M

14 / 27

2.5 Summary

Simulation with n = 10M (1% sampling) Estimation error < 0.2% Standard deviation < 4.5% of the graph size 70X speedup compared to linear probing

Estimating on Twitter.com Its public timeline service samples from 13M users.

15 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

16 / 27

3.1 How it works Marchetti-Spaccamela 1989

Forward walking: Find a random acyclic path via random walk.

Back tracing: From where the random walker stops, trace back to the originator via random walks on the reverse graph.

17 / 27

3.2 Datasets Four social graphs from Mislove et al. 2007

Graph

Total Total Mean Users Crawled Nodes Links Degree (Estimated) Flickr 1, 657, 846 22, 613, 981 13.6 26.9% LiveJournal 4, 929, 069 77, 402, 652 15.7 95.4% Orkut 3, 072, 441 223, 534, 301 72.8 11.3% YouTube 1, 099, 764 4, 945, 382 4.5 Unknown 267 citations according to Google Scholar as of May 26th, 2010.

18 / 27

3.3 Results on four OSNs

19 / 27

3.4 How many runs do we need Results on Flickr.com

20 / 27

3.5 Reducing the crawling cost Starting with high degree nodes

21 / 27

3.5 Generalizing RW to estimate other quantities Refer to the paper for details.

22 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

23 / 27

4. Pitfalls Things we could do better in the future.

Large variance and expensive back tracing Poor lower and upper bounds for RW estimator Reducing computation costs in simulations Tight data structures Accelerating frequently executed routines Improving data locality

24 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

25 / 27

Summary

Two estimators are introduced to estimate the size of OSNs. An O(logn) algorithm is introduced to solve the MLE problem quickly. Evaluated the bias and variance of these two estimators with real OSNs. Generalized the RW estimator.

26 / 27

Thanks!

27 / 27

Estimating the Size of Online Social Networks

Lots of OSN research are conducted on a partial data set. How representative ... compute for large s. For large OSNs, s ≪ n, thus linear probing becomes slow.

228KB Sizes 1 Downloads 78 Views

Recommend Documents

Estimating time-varying networks
cell behavior. Networks help us ... I will present a line of work that deals with estimation of high-dimensional dynamic networks from limited amounts of data.

Monte carlo methods for estimating game tree size
Apr 25, 2013 - computer chess programming, perft is used to verify correct implementation ... up to a depth of 13 and are now available in the online integer sequence .... pected outcome of a phenomenon with a certain degree of certainity.

Detection of Spam in Online Social Networks (OSN)
2Assistant Professor, Department of Computer Science and Engineering, SSCET ... Apart from the classification strategies, [1] the system gives a useful rule layer .... Suganya Thangavel is presently doing her final year M.E (CSE) in Karpagam Universi

Discrete temporal models of social networks - CiteSeerX
Abstract: We propose a family of statistical models for social network ..... S. Hanneke et al./Discrete temporal models of social networks. 591. 5. 10. 15. 20. 25. 30.

The Wealth of Networks - How Social Production Transforms Markets ...
The Wealth of Networks - How Social Production Transforms Markets and Freedom. .... [pg 10]. This book has been more than a decade in the making. Its roots go ..... as a reemergence of mass media--the dominance of the few visible sites.

Networks of Outrage and Hope - Social Movements in the Internet ...
Networks of Outrage and Hope - Social Movements in the Internet Age.pdf. Networks of Outrage and Hope - Social Movements in the Internet Age.pdf. Open.

The Value of Political Connections in Social Networks
Keywords: Social network, political connection, close election, ... rank among the best in the world,3 the evidence of the value of political connections is mixed, ... 2 See for instance Shleifer and Vishny (2002), chapters 3-5 and 8-10, for ... We a