How Many People Visit YouTube? Imputing Missing Events in Panels With Excess Zeros Georg M. Goerg, Yuxue Jin, Nicolas Remy, Jim Koehler1 1

Google, Inc.; United States

E-mail for correspondence: [email protected] Abstract: Media-metering panels track TV and online usage of people to analyze viewing behavior. However, panel data is often incomplete due to nonregistered devices, non-compliant panelists, or work usage. We thus propose a probabilistic model to impute missing events in data with excess zeros using a negative-binomial hurdle model for the unobserved events and beta-binomial sub-sampling to account for missingness. We then use the presented models to estimate the number of people in Germany who visit YouTube. Keywords: imputation; missing data; zero inflation; panel data.

1

Introduction

Media panels (GfK Consumer Panels, 2013) are used by advertisers to estimate reach and frequency of a campaign: reach is the fraction of the population that has seen an ad, frequency tells us how often they have seen it (on average). It is important to get good estimates from panel data, as they largely determine the cost of an ad spot on TV or a website. Na¨ıvely, one would use a sample fraction of the number of non-zero events (website visits, TV spots watched, etc.) per unit time to estimate reach; similarly, for frequency. This, however, suffers from underestimation as panels often only record a fraction of all events due to e.g., non-compliance or work usage. Correcting this bias and imputing missing events has been studied previously (Fader and Hardie, 2000; Yang et al., 2010). In this work we i) extend the beta-binomial negative-binomial (BBNB) model (Hofler and Scrogin, 2008) with a hurdle component to improve modeling excess zeros in panel data (§2); ii) present the maximum likelihood estimator (MLE) and also add prior information on missingness (§3); and iii) use the methodology to estimate – from online media panels and internal YouTube log files – how many people in Germany visit YouTube (§4). The proposed methodology can be applied to a great variety of situations where events have been counted – but some are known to be missing.

2

2

How Many People Visit YouTube?

Hierarchical Event Imputation

Let Ni ∈ {0, 1, 2, . . .} count the true (but unobserved) number of visits by panelist i. The population consists of people who do not visit YouTube at all (with probability q0 ∈ [0, 1]), and those who visit at least once. If she visits (overcoming the “hurdle” with probability 1 − q0 ), we assume that Ni is distributed according to a shifted Poisson distribution (starting at n = 1) with i . For model heterogeneity among the population we use  rate λ q1 a Gamma r, 1−q 1

prior for λi , with r > 0 and q1 ∈ (0, 1).

Overall, this yields a shifted negative binomial hurdle (NBH) distribution ( q0 , if n = 0, P (N = n; q0 , q1 , r) = (1) Γ(n+r−1) r n−1 (1 − q0 ) · Γ(r)Γ(n) · (1 − q1 ) q1 , if n ≥ 1. We choose a hurdle, rather than a mixture, model for the excess zeros (Hu et al., 2011), since 1 − q0 can be directly interpreted as the true – but unobserved – 1+ reach: if an advertiser shows an ad on YouTube they can expect that a fraction of 1 − q0 of the population sees it at least once. Let pi be the probability a visit of user i is recorded in the panel. Assuming independence across visits the total number of recorded panel events, Ki ∈ {0, 1, 2, . . .}, thus follows a binomial distribution, Ki ∼ Bin(Ni , pi ). To account for heterogeneity across the population we assume pi ∼ Beta(µ, φ), with mean µ and precision φ (Ferrari and Cribari-Neto, 2004). Here µ represents the expected non-missing rate and φ the (inverse) variation across the population. Integrating out pi gives a Beta-Binomial (BB) distribution, Ki | Ni ∼ BB(Ni ; µ, φ).

(2)

Combining (1) and (2) yields a hierarchical beta-binomial negative-binomial hurdle (BBNBH) imputation model with parameter vector θ = (µ, φ, q0 , r, q1 ): Ni ∼ N BH(N ; q0 , r, q1 ) and Ki | Ni ∼ BB(K | Ni ; µ, φ). 2.1

(3)

Joint Distribution

The pdf of (2) can be written as   Γ(φ) n Γ(k + φµ)Γ(n − k + (1 − µ)φ) . g(k | n; µ, φ) = Γ(n + φ) Γ(µφ)Γ(φ(1 − µ)) k For k = 0 this reduces to P (K = 0 | N, µ, φ) =

Γ(φ) Γ(n + (1 − µ)φ) × . Γ(n + φ) Γ(φ(1 − µ))

(4)

Goerg et al.

3

Due to the zero hurdle it is useful to treat N = 0 and N > 0 separately: P (N, K) = P (K | N ) · P (N ) = BB(k | n; µ, φ) · N BH(n; q0 , q1 , r)

(5)

For n = 0, (5) is non-zero only for k = 0, P (N = 0, K = 0) = q0 , since P (K > N ) = 0. For n > 0, 1 (1 − q1 )r Γ(k + φµ) × B(φµ, φ(1 − µ)) Γ(r) Γ(k + 1) Γ(n − k + φ(1 − µ)) Γ(n + r − 1) n−1 Γ(n + 1) × q . × Γ(n − k + 1) Γ(n + φ) 1 Γ(n) (6)

P (N = n, K = k) =(1 − q0 )

2.2

Conditional Predictive Distribution For Imputation

The panel records ki events for panelist i, but we want to know how many events truly occurred. That is, we are interested in (dropping subscript i) P (N = n | K = k) =

P (K = k | N = n) P (N = n) , P (K = k)

(7)

To obtain analytical expressions we consider k = 0 and k > 0 separately: k = 0: Either none truly happened (n = 0) or a panelist visited at least once (n > 0), but none were recorded. n = 0: P (N = 0 | K = 0) =

q0 . P (K = 0)

(8)

n > 0: P (N = n | K = 0) =

1 Γ(n + φ(1 − µ)) Γ(φ) × P (K = 0) Γ(n + φ) Γ(φ(1 − µ)) Γ(n + r − 1) (1 − q1 )r n−1 × (1 − q0 ) q1 , Γ(n) Γ(r)

where the second term comes from (4). k > 0: The zero “hurdle” for N has been surpassed for sure. n < k : By construction of Binomial subsampling P (N = n | K = k) = 0 for all n < k.

(9)

n ≥ k: Here Γ(n − k + (1 − µ)φ) Γ(n + r − 1)× Γ(n − k + 1)Γ(n + φ) !−1 ∞ X Γ(m + φ(1 − µ)) Γ(m + k + r − 1) m+k−1 (m + k) q . Γ(m + 1) Γ(m + k + φ) 1 m=0

P (N = n | K = k) = n · q1n−1

4

How Many People Visit YouTube?

µ q0 q1 r φ

Estimate 0.272 0.641 0.982 0.252 2.320

Std. Err.

t value

P r(> |t|)

0.016 0.002 0.021 0.594

38.858 494.105 11.811 3.907

0.000 0.000 0.000 0.000

TABLE 1: MLE for θ for panel data on YouTube visits in Germany.

3

Parameter Estimation

Let k = {k1 , . . . , kP } be the number of observed events for all P panelist. Each panelist also has socio-economic indicators such as gender, age, and income. These attributes determine their demographic weight w ˜i , which equals the number of people in the entire population that panelist i reprePP sents. Finally, let wi = w ˜i · P/ i=1 w ˜i be re-scaled weight of panelist i PP such that i=1 wi equals sample size P . We estimate θ using maximum likelihood (MLE), θb = arg maxθ∈Θ `(θ; x), where the log-likelihood X `(θ; x) = xk · log P (K = k; θ) , (10) {k|xk >0}

P and x = {xk | k = 0, 1, . . . , max (k)}, where xk = {i|ki =k} wi is the total weight of all panelists with k visits. P∞ For deriving closed form expressions of P (K = k) = n=0 P (N = n, K = k) it is simpler to consider k = 0 and k > 0 separately: P (K = 0) = q0 + (1 − q0 ) × ×

(1 − q1 )r Γ(φ) Γ(φ(1 − µ)) Γ(r) ∞ X Γ(n + 1 + φ(1 − µ)) n=0

Γ(n + 1)

Γ(n + r) qn , Γ(n + 1 + φ) 1 (11)

and for k > 0, P (K = k) =(1 − q0 )(1 − q1 )r ×

∞ X

(m + k)

m=0

Γ(φ) 1 Γ(k + µφ) × Γ(µφ)Γ(φ(1 − µ)) Γ(r) Γ(k + 1)

Γ(m + φ(1 − µ)) Γ(m + k + r − 1) m+k−1 q . Γ(m + 1) Γ(m + k + φ) 1 (12)

Goerg et al.

5

Beta(p; µ = 0.27, φ = 2.3)

P(N <= n; r = 0.25, q1 = 0.98, q0 = 0.64)

22

3

P(K <= k; θ) Log−likelihood: −6466.69

P(N = n | K = k; θ)

0.8 pmf



empirical model

● ●

5

10

15

20

79.5%

K=0 K=2 E(N|K=0) = 1.02 E(N|K=2) = 13.12

0.4



0.0



●●●●●● ●●●● ●●● ●● ●● ●●



0

0.0 0.2 0.4 0.6 0.8 1.0 non−missingness rate



0.90

cdf

17

true counts (N)



0.80

8 12

2

pdf 4

1

q0 = 64%

0

0

0.80 0.65

cdf

4

5

α = 0.63, β = 1.7

25

0

observed counts (K)

2

4

6

8 10

true counts (N)

FIGURE 1: Model estimates for: (top left) true counts Ni ; (top right) nonmissing rate pi ; (bottom left) empirical count frequency and model fit; (bottom right) conditional predictive distributions and expectations.

3.1

Fix expected non-missing rate µ

Usually, researchers must estimate all 5 parameters from panel data. For our application, though, we can estimate (and fix) the non-missing rate µ a-priori as we PPhave access to internal YouTube log files. Let k¯W ˜i ki be the observed panel visits projected to the entire ˜ = i=1 w ¯ ˜ = PP w population. Analogously, let N i=1 ˜i Ni be the panel projections of W the number of true YouTube visits. While any single Ni is unobservable, ¯ ˜ by simply counting all YouTube homepage views in we can estimate N W b¯ . We herewith obtain a Germany from our YouTube log files, yielding N ˜ W b¯ . The remaining plug-in estimate of the non-missing rate, µ b = k¯ /N Logs

˜ W

˜ W

4 parameters, θ(−µ) = (φ, q0 , r, q1 ), can be obtained by MLE, θb(−µ) = arg maxθ(−µ) `((b µLogs , θ(−µ) ); x). The overall estimate is θb = (b µLogs , θb(−µ) ).

4

Estimating YouTube Audience in Germany

Here we use data from a German online panel (GfK Consumer Panels, 2013), which monitors web usage of P = 6, 545 individuals in October, 2013 (31 days). In particular, we are interested in the probability that an adult in Germany visited the YouTube homepage www.youtube.de. Empirically,

b (K = 0) = 0.81, yielding 19% observed 1+ reach. However, we know by P comparison to YouTube log files that the panel only recorded 27.2% of all impressions. We fix the expected non-missing rate at µ b = 0.272 and obtain the remaining parameters via MLE (Table 1): Figure 1 shows the model fit for the true, observed, and predictive distribution. In particular, the true 1+ reach is 36% (b q0 = 0.64), not 19% as the na¨ıve estimate suggests.

5

Discussion

We introduce a probabilistic framework to impute missing events in count data, including a hurdle component for more flexibility to model lots of zeros. Researchers can use our models to obtain accurate probabilistic predictions of the number of true, unobserved events. We apply our methodology to accurately estimate how many people in Germany visit YouTube.

Acknowledgments: We want to thank Christoph Best, Penny Chu, Tony Fagan, Yijia Feng, Oli Gaymond, Simon Morris, Raimundo Mirisola, Andras Orban, Simon Rowe, Sheethal Shobowale, Yunting Sun, Wiesner Vos, Xiaojing Wang, and Fan Zhang for constructive discussions and feedback. References Fader, P. and Hardie, B. (2000). A note on modelling underreported Poisson counts. Journal of Applied Statistics, 27(8):953–964. Ferrari, S. and Cribari-Neto, F. (2004). Beta Regression for Modelling Rates and Proportions. Journal of Applied Statistics, 31(7):799–815. GfK Consumer Panels (2013). Media Efficiency Panel. Hofler, R. A. and Scrogin, D. (2008). A count data frontier model. Technical report, University of Central Florida. Hu, M., Pavlicova, M., and Nunes, E. (2011). Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. Am J Drug Alcohol Abuse, 37(5):367–75. Rose, C., Martin, S., Wannemuehler, K., and Plikaytis, B. (2006). On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. J Biopharm Stat, 16(4):463–81. Schmittlein, D. C., Bemmaor, A. C., and Morrison, D. G. (1985). Why Does the NBD Model Work? Robustness in Representing Product Purchases, Brand Purchases and Imperfectly Recorded Purchases. Marketing Science, 4(3):255–266. Yang, S., Zhao, Y., and Dhar, R. (2010). Modeling the underreporting bias in panel survey data. Marketing Science, 29(3):525–539.

How Many People Visit YouTube? Imputing ... - Semantic Scholar

Ni ∼ NBH(N; q0, r, q1) and Ki | Ni ∼ BB(K | Ni; µ, φ). (3). 2.1 Joint Distribution. The pdf of (2) can be written as g(k | n; µ, φ) = (n k. )Γ(k + φµ)Γ(n − k + (1 − µ)φ).

256KB Sizes 0 Downloads 236 Views

Recommend Documents

How Many People Visit YouTube? - Research at Google
they largely determine the cost of an ad spot on TV or a website. Naıvely, one would use a sample fraction of the number of non-zero events. (website visits, TV ...

How Many Millennials Visit YouTube ... - Research at Google
Apr 27, 2015 - magazine, or on a website. It is thus important to obtain accurate and precise reach and frequency estimates from panel data. A naıve approach ...

Approachability: How People Interpret Automatic ... - Semantic Scholar
Wendy Ju – Center for Design Research, Stanford University, Stanford CA USA, wendyju@stanford. ..... pixels, and were encoded using Apple Quicktime format.

Avoiding Interference: How People Use Spatial ... - Semantic Scholar
profit or commercial advantage and that copies bear this notice and the full citation on the first page. .... One disadvantage is the need and cost of PDAs. Another is that .... custom single display groupware application, created atop the. SDGToolki

How Do People Organise Their Photographs? - Semantic Scholar
References. [1] M. G. Brown, J. T. Foote, G. J. F. Jones, K. Spärck Jones, and S. J. Young. Open-vocabulary speech indexing for voice and video mail retrieval.

How people interpret an uncertain If - Semantic Scholar
The task and instructions were implemented in Python using the Pygame graphical ... students was used to improve instructions for Experiment 1. First steps ... object-feature positions in the conditionals may affect performance [6]. For a feature ...

How people interpret an uncertain If - Semantic Scholar
influence interpretation, and (iii) does interpretation change over time? .... The task and instructions were implemented in Python using the Pygame graphical .... the probability of the material conditional interpretation may be calculated us-.

Avoiding Interference: How People Use Spatial ... - Semantic Scholar
1Department of Computer Science, 2Department of Psychology. University of Calgary. Calgary, Alberta, Canada, T2N 1N4. +1 403 220 6087. @cpsc.ucalgary.ca, [email protected]. ABSTRACT. Single Display Groupware (SDG) lets multiple co-located people, each

How many high-level concepts will fill the semantic ... - Semantic Scholar
these archives, digital imagery indexing based on low level image features like color and .... The National Institute of Standards and Technology(NIST) has spon- sored the ... mentaries, advertising films, technical/educational material to multi-.

what are people searching on government web ... - Semantic Scholar
through the Internet. Due in part to this Act, large amounts of government infor- mation have been put online and made publicly accessible. The provision of ..... 0. 100. 200. 300. 400. 500. 600. 700. 800. 900. 1000. 3/1/2003. 4/1/2003. 5/1/2003. 6/1

Physics - Semantic Scholar
... Z. El Achheb, H. Bakrim, A. Hourmatallah, N. Benzakour, and A. Jorio, Phys. Stat. Sol. 236, 661 (2003). [27] A. Stachow-Wojcik, W. Mac, A. Twardowski, G. Karczzzewski, E. Janik, T. Wojtowicz, J. Kossut and E. Dynowska, Phys. Stat. Sol (a) 177, 55

Physics - Semantic Scholar
The automation of measuring the IV characteristics of a diode is achieved by ... simultaneously making the programming simpler as compared to the serial or ...

Physics - Semantic Scholar
Cu Ga CrSe was the first gallium- doped chalcogen spinel which has been ... /licenses/by-nc-nd/3.0/>. J o u r n a l o f. Physics. Students http://www.jphysstu.org ...

Physics - Semantic Scholar
semiconductors and magnetic since they show typical semiconductor behaviour and they also reveal pronounced magnetic properties. Te. Mn. Cd x x. −1. , Zinc-blende structure DMS alloys are the most typical. This article is released under the Creativ

vehicle safety - Semantic Scholar
primarily because the manufacturers have not believed such changes to be profitable .... people would prefer the safety of an armored car and be willing to pay.

Reality Checks - Semantic Scholar
recently hired workers eligible for participation in these type of 401(k) plans has been increasing ...... Rather than simply computing an overall percentage of the.

Top Articles - Semantic Scholar
Home | Login | Logout | Access Information | Alerts | Sitemap | Help. Top 100 Documents. BROWSE ... Image Analysis and Interpretation, 1994., Proceedings of the IEEE Southwest Symposium on. Volume , Issue , Date: 21-24 .... Circuits and Systems for V

TURING GAMES - Semantic Scholar
DEPARTMENT OF COMPUTER SCIENCE, COLUMBIA UNIVERSITY, NEW ... Game Theory [9] and Computer Science are both rich fields of mathematics which.

A Appendix - Semantic Scholar
buyer during the learning and exploit phase of the LEAP algorithm, respectively. We have. S2. T. X t=T↵+1 γt1 = γT↵. T T↵. 1. X t=0 γt = γT↵. 1 γ. (1. γT T↵ ) . (7). Indeed, this an upper bound on the total surplus any buyer can hope

i* 1 - Semantic Scholar
labeling for web domains, using label slicing and BiCGStab. Keywords-graph .... the computational costs by the same percentage as the percentage of dropped ...

fibromyalgia - Semantic Scholar
analytical techniques a defect in T-cell activation was found in fibromyalgia patients. ..... studies pregnenolone significantly reduced exploratory anxiety. A very ...

hoff.chp:Corel VENTURA - Semantic Scholar
To address the flicker problem, some methods repeat images multiple times ... Program, Rm. 360 Minor, Berkeley, CA 94720 USA; telephone 510/205-. 3709 ... The green lines are the additional spectra from the stroboscopic stimulus; they are.

Dot Plots - Semantic Scholar
Dot plots represent individual observations in a batch of data with symbols, usually circular dots. They have been used for more than .... for displaying data values directly; they were not intended as density estimators and would be ill- suited for