Estimating the Size of Online Social Networks Shaozhi Ye [email protected] Department of Computer Science University of California, Davis

The 2nd IEEE International Conference on Social Computing Joint work with S. Felix Wu @ UC Davis

1 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

2 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

3 / 27

Motivation

Various number of users are claimed by OSNs. Facebook: We have 500M active users. How can outsiders verify it?

Lots of OSN research are conducted on a partial data set. How representative are our samples?

Question we try to answer here Can we estimate the number of users without crawling the entire social network?

4 / 27

Motivation

Various number of users are claimed by OSNs. Facebook: We have 500M active users. How can outsiders verify it?

Lots of OSN research are conducted on a partial data set. How representative are our samples?

Question we try to answer here Can we estimate the number of users without crawling the entire social network?

5 / 27

Existing approaches

Largest valid UID Failed when UIDs are assigned non-sequentially.

Probing UID space Expensive when UIDs are assigned non-uniformly. Hard to deal with non-numerical UIDs.

6 / 27

Our approach

Assumption: No assumption on how UIDs are assigned. Requires the ability to distinguish any two users. Methods: MLE: Maximum likelihood estimation based on uniform sampling. RW: Unbiased estimator based on random walkers.

7 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

8 / 27

2.1 Intuition

Sample the network with replacement multiple times → Duplicate users. With the same number of samples, the more duplicate users, the smaller the graph is.

9 / 27

2.2 Maximum Likelihood Estimation MLE ˆ which maximizes the The graph size n can be estimated as the n probability of getting k unique users in s uniform samplings with replacement. ˆ = arg max P(k |n, s). n n

Finkelstein et al. 1998 ˆ is unique and is the smallest integer j ≥ k , which satisfies If k < n, n j+1 j s j+1−k ( j+1 ) < 1. j+1 j s j+1−k ( j+1 )

is unfeasible and expensive to compute for large s

For large OSNs, s  n, thus linear probing becomes slow.

10 / 27

2.2 Maximum Likelihood Estimation MLE ˆ which maximizes the The graph size n can be estimated as the n probability of getting k unique users in s uniform samplings with replacement. ˆ = arg max P(k |n, s). n n

Finkelstein et al. 1998 ˆ is unique and is the smallest integer j ≥ k , which satisfies If k < n, n j+1 j s j+1−k ( j+1 ) < 1. j+1 j s j+1−k ( j+1 )

is unfeasible and expensive to compute for large s

For large OSNs, s  n, thus linear probing becomes slow.

11 / 27

2.3 Examining the objective function f (j) = log(

j +1 j ) + slog( )<0 j +1−k j +1

Theorem ˆ ∈ [k , djc e], where jc = s(k − 1)/(s − k ). Furthermore, f (j) is n monotonically decreasing within the interval [k , bjc c].

12 / 27

2.4 Results - Mean Size of simulated social networks: 10M

13 / 27

2.4 Results - Standard deviation Size of simulated social networks: 10M

14 / 27

2.5 Summary

Simulation with n = 10M (1% sampling) Estimation error < 0.2% Standard deviation < 4.5% of the graph size 70X speedup compared to linear probing

Estimating on Twitter.com Its public timeline service samples from 13M users.

15 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

16 / 27

3.1 How it works Marchetti-Spaccamela 1989

Forward walking: Find a random acyclic path via random walk.

Back tracing: From where the random walker stops, trace back to the originator via random walks on the reverse graph.

17 / 27

3.2 Datasets Four social graphs from Mislove et al. 2007

Graph

Total Total Mean Users Crawled Nodes Links Degree (Estimated) Flickr 1, 657, 846 22, 613, 981 13.6 26.9% LiveJournal 4, 929, 069 77, 402, 652 15.7 95.4% Orkut 3, 072, 441 223, 534, 301 72.8 11.3% YouTube 1, 099, 764 4, 945, 382 4.5 Unknown 267 citations according to Google Scholar as of May 26th, 2010.

18 / 27

3.3 Results on four OSNs

19 / 27

3.4 How many runs do we need Results on Flickr.com

20 / 27

3.5 Reducing the crawling cost Starting with high degree nodes

21 / 27

3.5 Generalizing RW to estimate other quantities Refer to the paper for details.

22 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

23 / 27

4. Pitfalls Things we could do better in the future.

Large variance and expensive back tracing Poor lower and upper bounds for RW estimator Reducing computation costs in simulations Tight data structures Accelerating frequently executed routines Improving data locality

24 / 27

Outline

1

Introduction

2

Maximum Likelihood Estimator

3

Random Walker Based Estimator

4

Pitfalls

5

Summary

25 / 27

Summary

Two estimators are introduced to estimate the size of OSNs. An O(logn) algorithm is introduced to solve the MLE problem quickly. Evaluated the bias and variance of these two estimators with real OSNs. Generalized the RW estimator.

26 / 27

Thanks!

27 / 27

Estimating the Size of Online Social Networks

Lots of OSN research are conducted on a partial data set. How representative ... compute for large s. For large OSNs, s ≪ n, thus linear probing becomes slow.

228KB Sizes 1 Downloads 303 Views

Recommend Documents

Estimating the size of online social networks - Research at Google
1. Estimating the Size of Online Social Networks. Shaozhi Ye*. Google Inc. ... three estimators using widely available OSN functionalities/services. The first ...

Estimating the size of online social networks
Instead of using synthetic data generated by social network models, this paper ...... coefficient in scale-free networks on lattices with local spatial correlation structure. ... conference on knowledge discovery and data mining, 2008, pp. 16–24.

Estimating time-varying networks
cell behavior. Networks help us ... I will present a line of work that deals with estimation of high-dimensional dynamic networks from limited amounts of data.

Detection of Spam in Online Social Networks (OSN)
2Assistant Professor, Department of Computer Science and Engineering, SSCET ... Apart from the classification strategies, [1] the system gives a useful rule layer .... Suganya Thangavel is presently doing her final year M.E (CSE) in Karpagam Universi

Detection of Spam in Online Social Networks (OSN) - International ...
Web mining ,as the term mining implies extraction i.e. extraction on information from web. Usually defined as ... classification of text mined from Online social network (OSN) [4]. Radial Basis Function ... RFBNs have a single hidden layer of process

Monte carlo methods for estimating game tree size
Apr 25, 2013 - computer chess programming, perft is used to verify correct implementation ... up to a depth of 13 and are now available in the online integer sequence .... pected outcome of a phenomenon with a certain degree of certainity.

social networks in the boardroom - Wiley Online Library
CREST (ENSAE). David Thesmar. HEC Paris. Abstract. This paper provides evidence that social networks strongly affect board composition and are detrimental to corporate governance. Our empirical investigation relies on a large data set of executives a

Discrete temporal models of social networks - CiteSeerX
Abstract: We propose a family of statistical models for social network ..... S. Hanneke et al./Discrete temporal models of social networks. 591. 5. 10. 15. 20. 25. 30.

Discrete temporal models of social networks - CiteSeerX
We believe our temporal ERG models represent a useful new framework for .... C(t, θ) = Eθ [Ψ(Nt,Nt−1)Ψ(Nt,Nt−1)′|Nt−1] . where expectations are .... type of nondegeneracy result by bounding the expected number of nonzero en- tries in At.

The Wealth of Networks - How Social Production Transforms Markets ...
The Wealth of Networks - How Social Production Transforms Markets and Freedom. .... [pg 10]. This book has been more than a decade in the making. Its roots go ..... as a reemergence of mass media--the dominance of the few visible sites.

Networks of Outrage and Hope - Social Movements in the Internet ...
Networks of Outrage and Hope - Social Movements in the Internet Age.pdf. Networks of Outrage and Hope - Social Movements in the Internet Age.pdf. Open.

PDF Download The Wealth of Networks: How Social ...
PDF Download The Wealth of Networks: How. Social Production Transforms Markets and. Freedom Full Books. Books detail. New q. Mint Condition q. Dispatch ...

The formation of partnerships in social networks
Dec 28, 2016 - advice on a particular issue, a small loan, help on a school project ...... Question 2: If it takes 5 machines 5 minutes to make 5 widgets, how long would it take .... department, Ecole Centrale de Lyon, a business school EM Lyon, ...

The role of social networks in health
The role of social networks in health. 6th UK Social Networks Conference. 12 – 16 April 2010. University of Manchester. Mara Tognetti Bordogna, Simona ...

Improving the Readability of Clustered Social Networks using Node ...
Index Terms—Clustering, Graph Visualization, Node Duplications, Social Networks. 1 INTRODUCTION. Social networks analysis is becoming increasingly popular with online communities such as FaceBook, MySpace or Flickr, where users log in, exchange mes

The Value of Political Connections in Social Networks
Keywords: Social network, political connection, close election, ... rank among the best in the world,3 the evidence of the value of political connections is mixed, ... 2 See for instance Shleifer and Vishny (2002), chapters 3-5 and 8-10, for ... We a