Convergence of Pseudo Posterior Distributions ... -

Viewer
Transcript

Convergence of Pseudo Posterior Distributions under Informative Sampling Terrance D. Savitsky∗, Daniell Toth∗ January 1, 2016

Abstract An informative sampling design assigns probabilities of inclusion that are correlated with the response of interest and induces a dependence among sampled observations. Unadjusted model-based inference performed on data acquired under an informative sampling design can be biased concerning parameters of the population generating distribution if the sample design is not accounted for in the model. Known marginal inclusion probabilities may be used to weight the likelihood contribution of each observed unit to form a “pseudo” posterior distribution with the intent to adjust for the design. This article extends a theoretical result on the consistency of the posterior distribution, defined on an analyst-specified model space, at the true generating distribution to the samplingweighted pseudo posterior distribution used to account for an informative sampling design. We construct conditions on known marginal and pairwise inclusion probabilities that define a class of sampling designs where consistency of the pseudo posterior is achieved, in probability. We demonstrate the result on an application concerning the Bureau of Labor Statistics Job Openings and Labor Turnover Survey. ∗

U.S. Bureau of Labor Statistics, 2 Massachusetts Ave. N.E, Washington, D.C. 20212 USA

1

Key words: Bayesian hierarchical models; Survey sampling; Pseudo posterior distribution; Posterior consistency; Markov Chain Monte Carlo; Job Openings and Labor Turnover Survey.

1

Motivation

Let ν ∈ Z+ index a sequence of finite populations, {Uν }ν=1,...,Nν , each of size, |Uν | = Nν , such 0

that Nν < Nν 0 , for ν < ν , so that the finite population size grows as ν increases. Suppose that Xν,1 , . . . , Xν,Nν are independent and identically distributed according to some unknown distribution P, (with density, p) defined on the sample space, (X , A ) . If Π is a prior distribution on the model space, (P, C ) to which P is known to belong, then the posterior distribution is given by Nν p P∈B ∏i=1 p0 (Xi )dΠ(P) , R Nν p (X )dΠ(P) ∏ i P∈P i=1 p0

R

Π (B|X1 , . . . , XNν ) =

(1)

for any B ∈ C , where we refer to {Xν,i }i=1,...,Nν as {Xi }i=1,...,Nν for readability when the context is clear. Ghosal and van der Vaart (2007) study the rate at which this posterior distribution converges to the assumed true (and fixed) generating distribution P0 . They prove, under certain conditions on the model space, P, and the prior distribution, Π, that in P0 −probability, the posterior distribution concentrates on an arbitrarily small neighborhood of P0 as Nν ↑ ∞. The observed data on which we focus is not the entire finite population, X1 , . . . , XNν , but rather a sample, X1 , . . . , Xnν , with nν ≤ Nν , drawn under a sampling design distribution applied to the finite population under which each unit, i ∈ (1, . . . , Nν ), is assigned a probability of inclusion in the sample. These unit inclusion probabilities are constructed to depend on the realized finite population values, X1 , . . . , XNν , at each ν. The posterior distribution estimated on the resulting sample observations, without accounting for the informative sampling design, will generally be different from that estimated on the fully-observed finite population. Informative sampling designs are widely employed among government agencies and by 2

academic researchers to produce statistics that support policy-making in diverse fields, such as economic forecasting, public health, and education; for example, the U.S. Bureau of Labor Statistics (BLS) utilizes a survey instrument administered to business establishments to produce geographically- and industry-indexed statistics for total employment and total wages. Sampling design distributions for employment surveys typically seek to assign relatively higher inclusion probabilities to those establishments with a relatively larger number of employees since these larger establishments account for more of the total variation in the employment and wage statistics. The correlation between inclusion probabilities and the number of employees for establishments is typically induced through utilizing a set of fully observed values of an auxiliary variable for all units in the finite population, that is generally known to be correlated with the number of employees, to generate the inclusion probabilities; for example, the total revenue for each establishment is often used. We devise new conditions on the sampling design distribution that, together, define an allowable class of sampling designs used to construct a frequentist L1 contraction rate of a sampling-weighted approximation of the posterior distribution, estimated on only the observed sample data realized under an informative design, to the true generating measure, P0 . To our knowledge, this paper is the first to provide such theoretical guarantees for the consistency of the sampling-weighted Bayesian probability model, defined on the broad class of models, P ∈ P, addressed by Ghosal and van der Vaart (2007), under informative sampling. Our approach further demonstrates the sampling-weighted approximation to the posterior distribution estimated on the observed sample contracts on P0 through a path that converges to the posterior distribution estimated on the entire finite population (that we only partially observe in our sample). We accomplish our result by substantially extending Ghosal and van der Vaart (2007) and Wong and Shen (1995) using our new conditions. We start with the Ghosal and van der Vaart (2007) result for inid data and perform our extension to a dependent sample that is realized under the joint distribution of finite population generation and the taking of an 3

informative sample. Recent works that address the theoretical consistency of sampling-derived estimators focus on empirical likelihood constructions that may be used to generate draws of the unobserved finite population, labeled “pseudo” populations (Pfeffermann and Sverchkov 2009), from which associated sampling-weighted mean and total estimators (and their associated variances) are constructed (e.g. the Horvitz Thompson estimator (S¨arndal et al. 2003)). All of these approaches implicitly focus on the “design” consistency under the sampling distribution for the resulting sampling-weighted population statistics, rather than our focus on “model ” consistency of estimated parameters for an analyst-specified model under the joint distribution of population generation and the taking of informative samples. Rao and Wu (2010) compose a “pseudo” posterior distribution from a sampling-weighted empirical likelihood and define conditions on the sampling design for its convergence. Kunihama et al. (2014) replace the empirical distribution with a flexible Bayesian non-parametric mixture to compose their pseudo posterior. Their focus is to generate pseudo populations from which the analyst may construct sampling-weighted mean and total estimators, rather than inference on the population model parameters. Breslow and Wellner (2007) formulate a sampling weighted empirical distribution and show it is consistent at the true generating distribution, P0 , under a single sampling design, stratified sampling design with simple random samples taken from within each stratum. Their approach applies the consistency result for the weighted bootstrap estimator of Praestgaard and Wellner (1993). Our formulation for the pseudo posterior in the sequel exponentiates the likelihood contribution for each observation (under a model specified by the data analyst) by the associated sampling weight to form a plug-in, noisy estimator of the posterior distribution defined on the finite population. We show that this plug-in estimator allows the analyst to perform asymptotically unbiased inference from their population generating model with no required change in the parameterization to account for the informative sampling design. Our construction for 4

the pseudo posterior accommodates the event where the analyst does not know the sampling design to parameterize it into the model or where doing such may conflict with desired inference. We next formally introduce notation to describe a sampling design and the associated distribution in Section 2, from which we construct a (sampling-weighted) pseudo posterior distribution on the observed sample using P ∈ P. We follow by stating the conditions for our main result on the contraction rate of the pseudo posterior distribution on P0 in Section 3. Added conditions on the sampling distribution are derived from required enabling results in Section 4. We make an application of the pseudo posterior estimator to perform a regression analysis on data collected under a monthly sample of job hires and separations administered to business establishments by the Bureau of Labor Statistics (BLS) in Section 5. We reveal large differences for parameter estimates between incorporation versus ignoring the sampling weights. This section also includes a simulation study that compares the pseudo posterior estimated on the observed sample to the posterior estimated on the entire finite population. The paper concludes with a discussion in Section 6. The proofs for the main result and two enabling results are contained in an Appendix.

2

Pseudo Posterior Distribution

The observed data set is a sample from each finite population, Uν , under a known survey sampling design that may induce dependence among sampled units and between the response of interest and probabilities for inclusion among units of Uν . A sampling design is defined by placing a known distribution on a vector of inclusion indicators, δν = (δν1 , . . . , δνNν ), linked to the units comprising the population, Uν . The sampling distribution is subsequently used to take an observed random sample of size nν ≤ Nν . In general, each δνi is integer-valued, as a population unit may be included multiple times (or not at all) under sampling with re-

5

placement. The statement and discussion of our main result, to follow, is formulated under sampling without replacement, where each population unit may be included only once, so that δνi ∈ {0, 1}. We offer comments, along-the-way, about generalization to sampling with replacement. The joint distribution over (δν1 , . . . , δνNν ) is described by known marginal unit inclusion probabilities, πνi = Pr{δνi = 1} for all i ∈ Uν and the second-order pairwise probabilities, πνi j = Pr{δνi = 1 ∩ δν j = 1} for i, j ∈ Uν . The dependence among unit inclusions in the sample contrasts with the usual iid draws from P. We denote the sampling distribution by Pν . A prior belief about each density, p, for measure, P, in the space, P, is specified as Π (P) in a probability model for the populations, {Uν }ν∈Z+ , from which we wish to perform inference for P0 using observed data collected from an informative sample of size nν . Under informative sampling, the marginal inclusion probabilities, πνi = P{δνi = 1}, i ∈ (1, . . . , Nν ), are formulated to depend on the finite population data values, XNν = (X1 , . . . , XNν ). Since the resulting balance of information would be different in the sample (which will be skewed towards inclusion of larger establishments than the finite population), the posterior distribution for the incompletely observed, (X1 δν1 , . . . , XNν δνNν ), that we employ for inference about P0 , is not equal to that of Equation 1. Our task is to perform inference about the population generating distribution, P0 , using the observed data taken under an informative sampling design without replacement. We account for informative sampling by “undoing” the sampling design with the weighted estimator, pπ (Xi δνi ) := p (Xi )δνi /πνi , i ∈ Uν ,

(2)

which weights each density contribution, p(Xi ), by the inverse of its marginal inclusion probability. This construction re-weights the likelihood contributions defined on those units randomly-selected for inclusion in the observed sample ({i ∈ Uν : δνi = 1}) to approximate the balance of information in Uν . This approximation for the population likelihood is referred 6

to as the pseudo likelihood (Chambers and Skinner 2003), from which we state the associated pseudo posterior, Nν pπ P∈B ∏i=1 pπ0 (Xi δνi )dΠ(P) , R Nν pπ P∈P ∏i=1 pπ0 (Xi δνi )dΠ(P)

R

Ππ (B|X1 δν1 , . . . , XNν δνNν ) =

(3)

that we use to achieve our required conditions for the rate of contraction of the pseudo posterior distribution on P0 . We recall that both P and δν are random variables defined on the space of measures and possible samples, respectively. Additional conditions are later formulated for the distribution over samples, Pν , drawn under the known sampling design, to achieve contraction of the pseudo posterior on P0 . We assume measurability for the sets on which we compute prior, posterior and pseudo posterior probabilities on the joint product space, X × P. For brevity, we use the superscript, π, to denote the dependence on the known sampling probabilities, {πνi }i=1,...,Nν ; for example, Ππ (B|X1 δν1 , . . . , XNν δνNν ) := Π (B| (X1 δν1 , . . . , XNν δνNν ) , (πν1 , . . . , πνNν )). Our main result is achieved in the limit as ν ↑ ∞, under the countable set of successively larger-sized populations, {Uν }ν∈Z+ . We define the associated rate of convergence notation, ν) ery et al. (2013). O(bν ), to denote limν↑∞ O(b bν = 0 as in Bonn´

2.1

Empirical process functionals

We employ the empirical distribution approximation for the joint distribution over population generation and the draw of an informative sample that produces our observed data to formulate our results. Our empirical distribution construction follows Breslow and Wellner (2007) and incorporates inverse inclusion probability weights, {1/πνi }i=1,...,Nν , to account for the informative sampling design, PπNν

1 Nν δνi = ∑ πνi δ (Xi) , Nv i=1

(4)

where δ (Xi ) denotes the Dirac delta function, with probability mass 1 on Xi and we recall that Nν = |Uν | denotes the size of of the finite population. This construction contrasts with the 7

usual empirical distribution, PNν =

1 Nv

Nν δ (Xi ), used to approximate P ∈ P, the distribu∑i=1

tion hypothesized to generate the finite population, Uν . We follow the notational convention of Ghosal et al. (2000) and define the associated exν δνi pectation functionals with respect to these empirical distributions by PπNν f = N1ν ∑N i=1 πνi f (Xi ). ν Similarly, PNν f = N1ν ∑N i=1 f (Xi ). Lastly, we use the associated centered empirical processes, √ √ GπNν = Nν PπNν − P0 and GNν = Nν (PNν − P0 ).

The sampling-weighted, (average) pseudo Hellinger distance between distributions, P1 , P2 ∈ 1 hR √ √ 2 i 2 π,2 Nν δνi 2 1 P, dNν (p1 , p2 ) = Nν ∑i=1 πνi d (p1 (Xi ), p2 (Xi )), where d (p1 , p2 ) = p1 − p2 dµ (for dominating measure, µ). We need this empirical average distance metric because the observed (sample) data drawn from the finite population under Pν are no longer independent or identically distributed. The implication is that our result apply to finite populations generated as inid from which informative samples are taken. The associated non-sampling Hellinger distance is specified with, dN2 ν (p1 , p2 ) =

3

1 Nν

N

ν d 2 (p1 (Xi ), p2 (Xi )). ∑i=1

Main result

We proceed to construct a theorem and associated conditions that contain our main result on the consistency of the pseudo posterior distribution under a class of informative sampling designs at the true generating distribution, P0 . Our approach extends the main in-probability convergence result of Ghosal and van der Vaart (2007) by adding new conditions that restrict the distribution of the informative sampling design. We next specify the conditions that will be used in our main and enabling results, presented in the sequel. Suppose we have a sequence, ξNν ↓ 0 and Nν ξN2ν ↑ ∞ and nν ξN2ν ↑ ∞ as ν ∈ Z+ ↑ ∞ and any constant, C > 0, (A1) (Local entropy condition - Size of model)

8

sup log N (ξ /36, {P ∈ PNν : dNν (P, P0 ) < ξ }, dNν ) ≤ Nν ξN2ν ,

ξ >ξNν

(A2) (Size of space) Π (P\PNν ) ≤ exp −Nν ξN2ν (2(1 + 2C)) (A3) (Prior mass covering the truth) ! 2 p p ≤ ξN2ν ∩ P0 log ≤ ξN2ν ≥ exp −Nν ξN2ν C Π P : −P0 log p0 p0 (A4) (Non-zero Inclusion Probabilities)   1  sup  ≤ γ, with P0 −probability 1. min πνi ν i∈Uν

(A5) (Asymptotic Independence Condition) πνi j lim sup max − 1 = O Nν−1 , with P0 −probability 1 i6= j∈Uν πνi πν j ν↑∞

such that for some constant, C3 > 0, Nν sup max

ν i6= j∈Uν

πνi j ≤ C3 , for Nν sufficiently large. πνi πν j

(A6) (Constant Sampling fraction) For some constant, f ∈ (0, 1), that we term the “sampling fraction”, nν lim sup − f = O(1), with P0 −probability 1. Nν ν Condition (A1) denotes the logarithm of the covering number, defined as the minimum number of balls of radius ξ /36 needed to cover {P ∈ PNν : dNν (P, P0 ) < ξ } under distance metric, dNν . This condition restricts the growth of the model space, which guarantees the existence of test statistics, φnν (X1 δν1 , . . . , XNν δνNν ) ∈ (0, 1), needed for enabling Lemma 4.1, stated below, that bounds the expectation of the pseudo posterior mass assigned on the set {P ∈ 9

PNν : dnν (P, P0 ) ≥ ξNν }. Condition (A3) ensures the prior, Π, assigns mass to convex balls in the vicinity of P0 . Conditions (A1) and (A3), together, define the minimum value of ξNν , where if these conditions are satisfied for some ξNν , then they are also satisfied for any ξ > ξNν . Condition (A2) allows, but restricts, the prior mass placed on the uncountable portion of the model space, such that we may direct our inference to an approximating sieve, PNν . The next three new conditions together impose restrictions on the sampling design and associated known distribution, Pν , used to draw the observed sample data that, together, define a class of allowable sampling designs on which the contraction result for the pseudo posterior is guaranteed. Condition (A4) requires the sampling design to assign a positive probability for inclusion of every unit in the population because the restriction bounds the sampling inclusion probabilities away from 0. Since the maximum inclusion probability is 1, the bound, γ ≥ 1. No portion of the population may be systematically excluded, which would prevent a sample of any size from containing information about the population from which the sample is taken. Condition (A5) restricts the result to sampling designs where the dependence among lowest-level sampled units attenuates to 0 as ν ↑ ∞; for example, a two-stage sampling design of clusters within strata would meet this condition if the number of population units nested within each cluster (from which the sample is drawn) increases in the limit of ν. Such would be the case in a survey of households within each cluster if the cluster domains are geographically defined and would grow in area as ν increases. In this case of increasing cluster area, the dependence among the inclusion of any two households in a given cluster would decline as the number of households increases with the size of the area defined for that cluster. Condition (A6) ensures that the observed sample size, nν , limits to ∞ along with the size of the partially-observed finite population, Nν . Theorem 3.1. Suppose conditions (A1)-(A6) hold. Then for sets PNν ⊂ P, constants, K > 0,

10

and M sufficiently large, EP0 ,Pν Ππ P : dNπν (P, P0 ) ≥ MξNν |X1 δν1 , . . . , XNν δνNν ≤ ! −Knν ξN2ν 16γ 2 [γ +C3 ] + 5γ exp , 2γ (K f + 1 − 2γ)2 Nν ξN2

(5)

ν

which tends to 0 as (nν , Nν ) ↑ ∞. We note that the rate of convergence is injured for a sampling distribution, Pν , that assigns relatively low inclusion probabilities to some units in the finite population such that γ will be relatively larger. Samples drawn under a design that expresses a large variability in the sampling weights will express more dispersion in their information similarity to the underlying finite population. Similarly, the larger the dependence among the finite population unit inclusions induced by Pν , the higher will be C3 and the slower will be the rate of contraction. The separability of the conditions on P and Π (P), on the one hand, from those on the sampling design distribution, Pν , on the other hand, coupled with the sequential process of taking the observed sample from the finite population reveal that the pseudo posterior, defined on the partially-observed sample from a population, contracts on P0 through converging to the posterior distribution defined on each fully-observed population. By contrast, if the posterior distribution, defined on each fully-observed finite population, fails to meet conditions (A1), (A2) and (A3) for the main result from Equation 5, such that it fails to contract on P0 , then the associated pseudo posterior cannot contract on P0 , even if the sampling design satisfies conditions (A4), (A5) and (A6). The proof generally follows that of Ghosal et al. (2000) with substantial modification to account for informative sampling. The L1 rate of contraction of the pseudo posterior distribution with respect to the joint distribution for population generation and the taking of informative samples is derived. Please see Appendix A for details.

11

4

Enabling Results

We next construct two enabling results needed to prove Theorem 3.1 to account informative sampling under (A4), (A5) and (A6). The first enabling result, Lemma 4.1, extends the applicability of Ghosal and van der Vaart (2007) - Lemmas 2 and 9 for inid models to informative sampling without replacement. This result is used to bound from above the numerator for the expectation with respect to the joint distribution for population generation and the taking of the informative sample, (P0 , Pν ), of the pseudo posterior distribution in Equation 3 on the restricted set of measures, {P ∈ B}, where B = {P ∈ P : dNν (P, P0 ) > δ ξNν }, (for any δ > 0). The restricted set includes those P that are at some minimum distance, δ ξNν , from P0 under pseudo Hellinger metric, dNπν . The second result, Lemma 4.2, extends Lemma 8.1 of Ghosal et al. (2000) to bound the probability of the denominator of Equation 3 with respect to (P0 , Pν ), from below. Lemma 4.1. Suppose conditions (A1) and (A4) hold. Then for every ξ > ξNν , a constant, K > 0, and any constant, δ > 0,  Z

 EP0 ,Pν  P∈P\PNν

Nν

 pπ

∏ pπ (Xiδνi) dΠ (P) (1 − φnν ) i=1



≤ Π (P\PNν )

(6)

0



 Z

 EP0 ,Pν  P∈PNν :dNπν (P,P0 )>δ ξ

Nν

−Knν δ 2 ξ 2 pπ  .(7) ∏ pπ (Xiδνi) dΠ (P) (1 − φnν ) ≤ 2γ exp γ i=1 0

The constant multiplier, γ ≥ 1, defined in condition (A4), restricts the distribution of the sampling design by bounding all marginal inclusion probabilities for population units away from 0. As with the main result, the upper bound is injured by γ. Please see Appendix B for proof. Lemma 4.2. For every ξ > 0 and measure Π on the set, ( ) 2 p p B = P : −P0 log ≤ ξ 2 , P0 log ≤ ξ2 p0 p0 12

under the conditions (A2), (A3), (A4), and (A5), we have for every C > 0 and Nν sufficiently large, Pr

  Z 

P∈P

  γ +C3 2 ∏ pπ (Xiδνi) dΠ (P) ≤ exp −(1 +C)Nν ξ  ≤ C2Nν ξ 2 , i=1 0 Nν

pπ

(8)

where the above probability is taken with the respect to P0 and the sampling generating distribution, Pν , jointly. The bound of “1” in the numerator of the result for Lemma 8.1 of Ghosal et al. (2000), is replaced with γ + C3 for our generalization of this result in Equation 8. The sum of positive constants, γ +C3 , is greater than 1 and will be larger for sampling designs where the inclusion probabilities, {πνi }, express relatively higher gradients. Observing each finite population in a skewed fashion through the taking of an informative sample may only slow the rate of posterior contraction (as compared to contraction of the posterior distribution defined on the fully observed finite population). Please see Appendix C for proof.

5

Application

We next conduct an analysis on data produced by the Job Openings and Labor Turnover Survey (JOLTS), administered by BLS on a monthly basis to a randomly-selected sample from a frame composed of non-agricultural U.S. private and public establishments. JOLTS focuses on the demand side of U.S. labor force dynamics and measures job hires, separations (e.g. quits, layoffs and discharges) and openings. The JOLTS sampling design assigns inclusion probabilities (under sampling without replacement) to establishments to be proportional to the number of employees for each establishment (as obtained from the Quarterly Census of Employment and Wages (QCEW)). This design is informative in that the number of employees for an establishmnet will generally be correlated with the number of hires, separations and

13

openings. Sub-groups of selected establishments are collected into panels on an annual basis (whose composition is revised, quarterly, based on new establishments entering the sampling frame, which then requires a re-weighting of the expanded sample). A new panel is initiated each month and data are collected on these establishments for a year. Approximately 16000 responses are received on a monthly basis. We perform our modeling analysis on a May, 2012 data set of n = 8595 responding establishments for illustration. We construct a finite population regression from which we formulate the sampling-weighted pseudo posterior distribution from which we make inference on model parameters under the population generating distribution. We demonstrate that failing to incorporate sampling weights (e.g. by estimating the posterior distribution defined for the finite population on the observed sample) produces large differences in estimates of parameters. Our regression model defines a multivariate response as the number of job hires (Hires), on the one hand, and total separations (Seps), on the other hand. We construct a single multivariate model (as contrasted with the specification of two univariate models) because these variables of interest tend to be highly correlated such that we expect the regression parameters to express dependence; for example, these two variables are correlated at 60% in our May 2012 dataset. We formulate a model for count data that accommodates the high degree of over-dispersion expressed in our establishment-indexed multivariate responses due to the large employment size differences across the establishments. Were we working with domain-indexed (e.g., by state or county) responses, we may consider to use a Gaussian approximation for the count data likelihood, but such is not appropriate for us due to the presence of many small-sized establishments.

14

We construct the following count data model for the population, ind

yid ∼ Pois (exp (ψid )) N×D

D×D

N×D P×D

Ψ ∼ X

(9) !

B +NN×D IN , Λ−1 P×P −1

(10)

! −1

B ∼ 0 + NP×D M , [τb Λ]

(11)

Λ ∼ WD ((D + 1), ID )

(12)

τb ∼ G (1, 1)

(13)

M ∼ WP ((P + 1), IP ) ,

(14)

where i = 1, . . . , N indexes the number of establishments and d = 1, . . . , D indexes the number ! D×1

of dimensions for the multivariate response, Y. The N × D log-mean, Ψ =

0

0

ψ1 , . . . , ψN ,

may be viewed as a latent response whose columns index the number of job hires (Hires) and total separations (Seps) under our JOLTS application, where D = 2. The number of predictors in the design matrix, X, is denoted by P and B are the unknown matrix of population coefficients that serve as the focus for our inference. Our model is formulated as a multivariate Poisson-lognormal model, under which the Gaussian prior of Equation 10 for the logarithm of the Poisson mean allows for over-dispersion (of different degrees) in each of the D dimensions. The priors in Equation 10 and Equation 11 are formulated in matrix variate (or, more generally, tensor product) Gaussian distributions using the notation of Dawid (1981); for example, the prior for the P × D matrix of coefficients, B, assigns the P × D mean 0 for a Gaussian distribution that employs a separable covariance structure where the P × P, M, denotes the precision matrix for the columns of B, and the D × D, Λ, denotes the precision matrix for the rows. This prior formulation is the equivalent of assigning a PD dimensional Gaussian prior to a vectorization of B accomplished by stacking its columns with PD × PD precision matrix, M ⊗ Λ. (See Hoff (2011) for more background). Precision matrices, (M, Λ), each receive Wishart priors with hyperparameter values that impose uniform marginal prior 15

distributions on the correlations (Barnard et al. 2000). We regress the multivariate response, Ψ, on predictors representing the logarithm of the overall establishment-indexed number of employees (Emp), obtained from the QCEW, the logarithm of the number of job openings (Open), region (Northeast, South, West, Midwest (Midw)) and ownership type (Private, Federal Government, State Government (State), Local Government (Local)). We convert region and ownership type to binary indicators and leave out the Northeast region and Federal Government ownership to provide the baseline of a fullcolumn rank predictor matrix. We summarize our model by: (Hires, Seps) ∼ 1 + West + Midw + South + State + Local + Private + log(Emp) + log(Opens) + error, where 1 denotes an intercept (Int). Our population model is hypothesized to generate the finite population of the U.S. nonagricultural establishments, from which we have taken a sample of size n = 8595 for May, 2012. For ease of reading, we will continue to use Y and X, to next define the associated pseudo posterior, though each possesses n < N rows representing the sampled observations, in this context. The population model likelihood contribution for establishment, i, on dimension, d, is formed with the integration, Z

p (yid |xi , B, Λ) =

p (yid |ψid ) × p (ψid |xi , B, Λ) dψid ,

(15)

R

where sampling weight, wi = 1/πi and w˜ i = n × wi / ∑ni=1 wi , such that the adjusted weights sum to n, the asymptotic amount of information contained in the sample (under a sampling design that obeys condition (A5)). This integrated likelihood induces the following pseudo likelihood,

w˜ i

 Z

pπ (yid |xi , B, Λ) = 

p (yid |ψid ) × p (ψid |xi , B, Λ) dψid  ,

(16)

R

which is analytically intractable, so we perform the integration, numerically, in our MCMC using the prior for each ψid exponentiated by the normalized sampling weight, w˜ i , which we 16

use to construct its pseudo posterior distribution. Using Bayes rule we present the logarithm of the pseudo posteriors for the latent set of D × 1 log-mean parameters, {ψi }, (which are a posteriori independent over i = 1, . . . , n), with, log pπ (ψi |yi , xi , B, Λ) ∝ h 0 iw˜ i w˜ i yid −1 log [exp (ψid ) exp (− exp (ψid ))] × ND ψi xi B, Λ

(17b)

0 0 0 [y ψ − exp (ψ )] + ψ − x B w ˜ Λ ψ − x B , i i i id ∑ id id i i

(17c)

(17a)

D

∝ w˜ i

d=1

where we note in the second expression of the last equation that the sampling weights influence the prior precision for each ψid , such that a higher-weighted observation will exert relatively more influence on posterior inference, which expresses the underlying mechanism by which the information in the sample is re-balanced to approximate that in the finite population from which it was drawn. We draw samples from Equation 17c in our posterior sampling scheme using the elliptical slice sampler of Murray et al. (2010), which produces very well mixing sampling chains in our application akin to Gibbs sampling under a conjugate model specification. We next illustrate the construction of the pseudo posterior distribution for the P × D matrix of regression coefficients, B, (which by D-separation is independent of the observations, (yid , given (ψid )), "

# w˜ i 0 −1 p (B|Y, X, Ψ, Λ, M) ∝ ∏ Nn×D ψi |B xi , In , Λ NP×D B|M−1 , Λ−1 n

π

(18a)

i=1 n

0 0 0 w˜ i w˜ i log p (B|Y, X, Ψ, Λ, M) ∝ ∑ log |Λ| − ψi − B xi Λ ψi − B xi 2 i=1 2 + log NP×D B|M−1 , Λ−1 . π

(18b)

In a Bayesian setting, the sum of the weights (n = ∑ni=1 w˜ i ) impacts the estimated posterior variance as we observe in Equation 18b. We see that weights scale the quadratic product of the Gaussian kernel in Equation 18b such that we may accomplish the same result using 17

˜ Λ−1 , where the matrix variate formation to define the pseudo likelihood, Nn×D Ψ − XB|W, ˜ = diag (w˜ 1 , . . . , w˜ n ), the weights for the sampled observations, from which we compute the W following conjugate conditional pseudo posterior distribution defined on the n observations, pπ (B|Y, X, Ψ, Λ, M) = hπB + NP×D B|(φπB )−1 , Λ−1 ,

(19)

0 ˜ + M and hπ = (φπ )−1 X0 WΨ. ˜ where φπB = X WX B B

Each plot panel in Figure 1 compares estimated posterior distributions for a coefficient in B (within 95% credible intervals), labeled by “predictor, dimension (of the multivariate response)”, when applied to the May, 2012 JOLTS dataset between two estimation models: 1. The left-hand plot in each panel employs the sampling weights to estimate the pseudo posterior for B, induced by the pseudo posterior for the latent response in Equation 17c; 2. The right-hand plot estimates the coefficients using the posterior distribution defined on the ˜ by the identity matrix to equally finite population, which may be achieved by replacing W weight establishments. Equal weighting of establishments assumes that the sample represents the same balance of information as the population from which it was drawn, which is not the case under an informative sampling design. Comparing estimation results from the pseudo posterior and population posterior distributions provides one method to assess the sensitivity of estimated parameter distributions to the sampling design. We observe that the estimated results are quite different in both location and variation between estimations performed under the pseudo posterior and population posterior distributions, indicating a high degree of informativeness in the sampling design. The 95% credible intervals for the coefficients of the continuous predictors - (the log of) job openings (Opens) and employment (Emp) - don’t even overlap on both the number of hires (Hires) and separations (Seps) responses. The coefficient for the State ownership predictor and the number of hires response is bounded away from 0 when estimated under the (unweighted) population posterior, but is centered on 0 under the sampling-weighted, pseudo posterior. The coef-

18

Hires

Hires

Hires

Hires

Hires

Emp

int

Local

Midw

Opens

0.9

0.3

-0.2

0.8 0.7 weight

ignore

weight

Hires

Hires

Private

South

ignore

0.3

0.1

-0.6 -5.0

0.4

0.2

-0.4

-4.5

Distribution within 95% CI for Coefficient

0.5

0.0 -4.0

weight

ignore

weight

Hires

ignore

weight

Hires

State

ignore

Seps

West

Emp

0.6 0.75

0.2

0.50

0.1

0.4

0.90

0.2

0.2

0.25

0.0 weight

ignore

0.80

ignore

weight

Seps

int

Local 0.0

-4.8

ignore

Midw

Opens

Private

0.35

weight

ignore

ignore

weight

Seps

South

0.6

0.25

0.3

0.1 weight

Seps

ignore

0.9

0.30

0.2

-0.6

weight

Seps

0.3

-0.4

-5.4

ignore

Seps

-0.2

-5.1

0.75 weight

Seps 0.4

0.2

-4.5

0.0

-0.4 weight

Seps

0.85

0.1

0.0 -0.2

ignore

weight

ignore

weight

ignore

Seps

State

West

1.00 0.4 0.3 0.2

0.75

0.4

0.50

0.3

0.25

0.2

0.00

0.1

0.1 weight

ignore

weight

ignore

weight

ignore

Response - Predictor

Figure 1: Comparison of posterior densities for the each coefficient in the (P = 9) × (D = 2) coefficient matrix, B, within 95% credible intervals, based on inclusion sampling weights in a pseudo posterior (the left-hand plot in each panel) and exclusion of the sampling weights using the posterior distribution defined for the population (in the right-hand plot). Each plot panel is labeled by “predictor,response” for the two included response variables, “Hires”, and “Seps” (total separations). ficient posterior variances estimated under the population posterior are understated because they don’t reflect the uncertainty with which the information in the sample expresses that in the population (which is captured through the sampling weights).

5.1

Simulation Study

We implement a simulation study under which we devise an informative sampling design and conduct Monte Carlo draws of observations from the JOLTS May, 2012 data that we now

19

treat as though it were a finite population. We compare the marginal pseudo posterior distributions to the (unweighted) posterior distributions for regression coefficients, where both are estimated on the observed sample drawn under our informative sampling design in each Monte Carlo iteration. We compare the sampling-weighted pseudo posterior and unweighted posterior distributions estimated on the samples, on the one hand, to the population posterior distribution estimated from our JOLTS data, on the other hand, which serves as our finite population. This set-up for the simulation study allows us to both discover the benefit of correcting for the informative sampling design for estimation conducted on the observed sample data and to assess the convergence of the pseudo posterior, estimated on samples, to the posterior estimated on the finite population (from which the samples were taken). Our chosen sampling design is single-stage and inclusion probabilities are assigned proportionally to a size variable that we know to be correlated with the values for JOLTS hires and total separations that serve as our response variables of interest. We utilize the establishmentindexed number of employees and regress it on a 3−rd order polynomial function of the two response variables and take the expected value as our size variable. This modeling step (versus using raw number of employees as the size variable) serves to reduce the skewness in the resulting inclusion probabilities, which mitigates the number of “certainty” establishments with inclusion probabilities equal to 1. Characteristics of the the sampling design at each sample size are presented in Table 5.1. Our single-stage, proportional-to-size informative sampling design will induce distributions of the two response variables in our observed samples that will be different from those for the population. The designed correlation between the size of the response and establishment inclusion probabilities will produce observed samples with values skewed towards higher numbers of hires and separations than the distributions of those variables in the population from which the samples were taken. Figure 2 demonstrates this difference between the distributions for realized samples under an informative sampling design from those for the 20

Table 1: Characteristics of single stage, fixed size pps sampling design used in simulation study. nν denotes the sample size. CUs denotes the number of certainty units (with inclusion probabilities equal to 1). πν denotes the inclusion probabilities (proportional to square root of JOLTS employment), CV(πν ) denotes the coefficient of variation of πν , Cor(yhires ,πν ) denotes correlation of the number of hires and πν and Cor(ySeps ,πν ) denotes the correlation of the number of separations and πν . CV(πν ) Cor(yhires , πν )

Cor(ySeps , πν )

nν

CUs

min(πν )

max(πν )

1

500

56

0.02

1.00

2.11

0.80

0.62

2

1000

196

0.04

1.00

1.60

0.69

0.50

3

1500

357

0.07

1.00

1.29

0.61

0.44

4

2500

722

0.14

1.00

0.91

0.51

0.36

finite population. The left-most box plot in each of the two panels displays the population distribution for a response value. A single sample is drawn in each of the sample sizes we employ in our simulation study for illustration. The next set of box plots displays the resulting distributions for the response values in each sample with size increasing from left-to-right. The left-hand plot panel displays the distributions for the Hires response, while the right-hand panel displays those for the Seps (separations) response variable. We take 100 Monte Carlo samples at each of nν = (500, 1000, 1500, 2500) establishments from our N = 8595 JOLTS data. Pseudo posterior and population posterior distributions are estimated on each Monte Carlo sample at each sample size in nν .

21

Hires

Seps

400

Distribution of Response Values

300

200

100

2000

1000

500

100

pop

2000

1000

500

100

pop

0

Sample Size

Figure 2: Distributions of response values for population compared to informative samples. The left-most box plot in each of the two plot panels contains the distribution for the JOLTS sample that we use as our “population” in the simulation study. The next set of box plots show the distribution for the response values for increasing sample sizes (from left-to-right) for each sample drawn under our single stage proportion-to-size design. The left-hand plot panel displays the Hires response variable and the right-hand panel displays the Seps (separations) response variable.

22

500

1000

1500

2500

Emp_Hires

0.7

0.6

0.5

0.9

Emp_Seps

Distribution within 95% CI for Coefficient

0.8

0.8

0.7

srs

ignore

weight

pop

srs

ignore

weight

pop

srs

ignore

weight

pop

srs

ignore

weight

pop

0.6

Sample Size

Figure 3: Comparison of posterior densities for 2 coefficients, Employment-Hires (top row of plot panels) and Employment-Separation (bottom row of plot panels) in B, within 95% credible intervals, between estimation on the population (left-hand plot in each panel), estimations from informative samples data taken from that population, which include sampling weights in a pseudo posterior (the second plot from the left in each panel) and exclusion of the sampling weights using the population posterior distribution (the third plot from the left) under a simulation study. The right-most plot presents the posterior density estimated from a simple random sample of the same size for comparison. The simulation study uses the May, 2012 JOLTS sample as the “population” and generates 500 informative samples for a range of sample sizes (of 500, 1000, 1500, 2500, from left-to-right) under a sampling without replacement design with inclusion probabilities set proportionally to the square root of employment levels. A separate estimation is performed on each Monte Carlo sample and the draws from estimated distributions are concatenated over the samples.

23

Figure 3 compares estimation of the posterior distribution from the fully-observed population (left-hand box plot) in each plot panel to estimation using the pseudo posterior (the second box plot from the left) on the sample observations taken from that population under the proportional-to-size sampling design and also to another application of the posterior distribution (the third box plot) estimated on the same sample. The right-hand box plot estimates the posterior density on a simple random sample of the same size (as the proportion-to-size informative sample) to allow comparison. We estimate the distributions on each of the 100 Monte Carlo draws for each sample size and concatenate the results such that they incorporate both the variation of population generation and repeated sampling from that population. The sample sizes, nν , increase from left-to-right across the plot panels. The top set of plot panels display the posterior distributions of the regression coefficient for the employment predictor (Emp) and the hires response (Hires), while the bottom set of panels display the coefficient distributions for the employment predictor (Emp) and the total separations response (Seps). Scanning from left-to-right in each row across the increasing sample sizes, we readily note a consistent difference in the estimated posterior mean, as expected, between the population model estimated on the samples without adjustment for the informative sampling design as compared to the mean of the posterior distribution estimated on the entire finite population. The application of the pseudo posterior model, however, produces much less difference (relative to estimation on the fully observed population), though the difference between the estimated pseudo-posterior and the population posterior is yet notably more than that for the simple random sampling result (estimated on samples of the same size as the informative sample). The estimated difference for the pseudo-posterior converges to 0, however, as the sample size increases. The posterior variance for the estimated posterior under simple random sampling remains larger than that for the pseudo posterior estimated on the informative sample. Estimation from a sample taken under an information-efficient sampling design often results in lower posterior variance than a simple random sample in the case where the informative 24

sampling design produces better information coverage of the finite population in the realized samples that overcomes the added variation because of incorporation of the sampling weights for estimation. Our proportional-to-size design over-samples the highest variance units, which provides relatively more information for estimation. So, in summary, this simulation study demonstrates the contraction of the pseudo posterior distribution estimated on the sample onto the posterior distribution estimated on a fully-observed finite population.

6

Discussion

This paper provides conditions, both on the model space and prior, on the one hand, and on the sampling design, on the other hand, under which the sampling-weighted, pseudo posterior estimator achieves convergence at the fixed, P0 . The pseudo posterior is an approximating mechanism (applied to the model space, P) for the posterior distribution and is designed to “undo” the informative sampling design. The posterior distribution is evaluated on a finite population, while the pseudo posterior is estimated on the informative sample taken from that population. Our main result reveals that we retain the conditions on the model space and prior from Ghosal and van der Vaart (2007) and add three new conditions that define a class of sampling designs under which L1 convergence is achieved with respect to the joint distribution for population generation and the taking of informative samples, (P0 , Pν ). This separability of conditions on the model space generating the finite population, on the one hand, from the sampling design under which the observed sample is taken from the finite population, on the other hand, is not surprising as convergence at P0 is achieved under informative sampling through convergence of the pseudo posterior, estimated on the observed sample, to the posterior, estimated on the finite population. So the conditions that guarantee convergence of the posterior distribution defined on the population are also required for convergence of the pseudo posterior distribution.

25

In a similar fashion as discussed in Toth and Eltinge (2011), our results are asymptotic as they are constructed under a super-population framework. Even further, the imposition of the subjective prior to define a Bayesian probability model requires conditions that restrict the class of priors to guarantee the frequentist result of contraction on a single measure. In practice, however, the analyst often conducts estimation on one sample from a single finite population where additional diagnostic guidance on whether the finite population and/or the sample are extreme would be very useful. Our simulation study reveals that, in practice, employment of the pseudo posterior for observations taken under an informative sampling design reduces the estimated bias, even at relatively small sample sizes, assuming the population generating model meets the conditions on the model space and class of allowable prior distributions. Additional simulation runs we conducted under sampling designs that violate condition (A4) by excluding some portion of the finite population reveal this bias-reducing property expresses robustness. Calibration and other re-weighting steps may be used by the analyst to evaluate the actual, realized sample as compared to that intended under the sampling design when the design is known. The asymptotic adjustment for informative sampling is not a fully Bayesian mechanism, but provides a plug-in estimator based solely on observed sample quantities. The use of the pseudo posterior plug-in estimator provides a more computationally-tractable alternative to the use of sampling weights for generation of pseudo populations on which the population posterior distribution may be approximated (through repeated estimations over the pseudo population draws). Incorporation of the sampling weights will generally increase the estimated posterior variance (relative to simple random sampling) because the weights encode the expected variation in the samples taken under an informative design. This increase may partly or wholly offset in the case where the informative design is more efficient than simple random sampling (e.g. use of stratification to produce fuller coverage of the population). The total amount of information in the sample is regulated by normalizing the weights to sum to 26

the sample size, which is asymptotically correct, but as noted above, the realization of a single sample may express some dependence induced by the sampling design such that the amount of information in the sample is less than the sample size (in which case, the posterior uncertainty may be under-estimated). A focus for future work would be to incorporate a sample dependence adjustment for setting the sum of the sampling weights.

References Barnard, J., McCulloch, R. and Meng, X.-L. (2000), ‘Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage’, Statistica Sinica 10(4), 1281–1311. Bonn´ery, D., Breidt, F. J. and Coquet, F. (2013), Uniform convergence of the empirical cumulative distribution under informative selection from a finite population, Technical report, Submitted to Bernoulli. Breslow, N. E. and Wellner, J. A. (2007), ‘Weighted likelihood for semiparametric models and two-phase stratified samples, with application to cox regression’, Scandinavian Journal of Statistics 34(1), 86–102. URL: http://EconPapers.repec.org/RePEc:bla:scjsta:v:34:y:2007:i:1:p:86-102 Chambers, R. and Skinner, C. (2003), Analysis of Survey Data, Wiley Series in Survey Methodology, Wiley. URL: http://books.google.com/books?id=4pYGz69d-LkC Dawid, A. (1981), ‘Some matrix-variate distribution theory: Notational considerations and a Bayesian application’, Biometrika 68(1), 265–274.

27

Ghosal, S., Ghosh, J. K. and Vaart, A. W. V. D. (2000), ‘Convergence rates of posterior distributions’, Ann. Statist pp. 500–531. Ghosal, S. and van der Vaart, A. (2007), ‘Convergence rates of posterior distributions for noniid observations’, Ann. Statist. 35(1), 192–223. URL: http://dx.doi.org/10.1214/009053606000001172 Hoff, P. D. (2011), ‘Separable covariance arrays via the tucker product, with applications to multivariate relational data’, Bayesian Anal. 6(2), 179–196. URL: http://dx.doi.org/10.1214/11-BA606 Kunihama, T., Herring, A. H., Halpern, C. T. and Dunson, D. B. (2014), Nonparametric bayes modeling with sample survey weights, Technical report, Submitted to Biometrika. Murray, I., Adams, R. P. and MacKay, D. J. (2010), ‘Elliptical slice sampling’, JMLR: W&CP 9, 541–548. Pfeffermann, D. and Sverchkov, M. (2009), Inference under informative sampling, in D. Pfeffermann and C. Rao, eds, ‘Handbook of statistics 29B: sample surveys: inference and analysis’, Elsevier Science Ltd., pp. 455–487. Praestgaard, J. and Wellner, J. (1993), ‘Exchangeably weighted bootstraps of the general empirical processes’, Annals of Probabability 21, 2053–2086. Rao, J. N. K. and Wu, C. (2010), ‘Bayesian pseudo-empirical-likelihood intervals for complex surveys’, Journal of the Royal Statistical Society Series B 72(4), 533–544. URL: http://EconPapers.repec.org/RePEc:bla:jorssb:v:72:y:2010:i:4:p:533-544 S¨arndal, C.-E., Swensson, B. and Wretman, J. (2003), ‘Model assisted survey sampling (springer series in statistics)’.

28

Toth, D. and Eltinge, J. L. (2011), ‘Building consistent regression trees from complex sample data.’, J. Am. Stat. Assoc. 106(496), 1626–1636. Wong, W. H. and Shen, X. (1995), ‘Probability inequalities for likelihood ratios and convergence rates of sieve mles’, Ann. Statist. 23(2), 339–362. URL: http://dx.doi.org/10.1214/aos/1176324524

29

A

Proof of Theorem 3

Proof. Condition (A1) establishes the existence of test statistics, φnν (X1 δν1 , . . . , XNν δνNν ) ∈ (0, 1) used to achieve the following result, EP0 ,Pν φnν ≤ exp

nν ξN2ν

exp −Knν M 2 ξn2ν · 2 2 1 − exp −Knν M ξNν

≤ 2 exp −Knν ξN2ν ,

(20)

in Lemmas 2 and 9 of Ghosal and van der Vaart (2007) by setting ξ = MξNν , and by choosing constant M > 0 sufficiently large, such that KM 2 − 1 > K. We will bound the expectation (under (P0 , Pν ), jointly) of the mass assigned by pseudo posterior distribution for those P at some minimum distance from P0 , Ππ P ∈ P : dNπν (P, P0 ) ≥ MξNν |X1 δν1 , . . . , XNν δνNν

= Ππ P ∈ P : dNπν (P, P0 ) ≥ MξNν |X1 δν1 , . . . , XNν δνNν (φnν + 1 − φnν ) .

(21)

Equation 20 establishes the bound, EP0 ,Pν Ππ P ∈ P : dNπν (P, P0 ) ≥ MξNν X1 δν1 , . . . , XNν δNν φnν ≤ EP0 φnν ≤ 2 exp −Knν ξN2ν , (22) since the pseudo posterior mass is bounded from above by 1. We next enumerate the pseudo posterior distribution for the second term of Equation 21, Ππ P ∈ P : dNπν (P, P0 ) ≥ MξNν XNν δNν (1 − φnν ) = Nν

Z P∈P:dNπν (P,P0 )≥MξNν

Z P∈P

30

Nν

pπ ∏ pπ (Xiδνi) dΠ (P) (1 − φnν ) i=1 0

pπ ∏ pπ (Xiδνi) dΠ (P) i=1 0

. (23)

We may bound the denominator of Equation 23 from below, in probability. Define the event, ( BNν =

P : −P0 log

p p0

) 2 p ≤ ξN2ν , P0 log ≤ ξN2ν p0

We have from Lemma 4.2,    Z Nν pπ  γ +C3 2 Pr (Xi δνi ) dΠ (P) ≥ exp −(1 +C)Nν ξ , ≥ 1− 2 ∏ π   C Nν ξ 2 i=1 p0 P∈P

for every P ∈ BNν and any C > 0, γ > 1, where γ may be set closer to 1 for sampling designs that define a low gradient for inclusion probabilities, {πνi }. The constant, C3 > 0, and will be close to 1 for sufficiently large ν. Condition (A3) restricts the prior on BNν , Π (BNν ) ≥ exp −Nν ξN2ν C . Then with probability at least 1 − Z P∈P

16γ 2 [γ+C3 ] 2

(KM2 f −2γ )

Nν ξ 2

,

Nν

pπ ∏ pπ (Xiδνi) dΠ (P) ≥ exp −(1 +C)Nν ξ 2 Π (BNν ) i=1 0 ≥ exp −(1 + 2C)Nν ξ 2 ! KM 2 nν ξN2ν ≥ exp − , 2γ

where we set 1 + 2C =

KM 2 f 2γ ,

where we use condition (A6) to replace f × Nν with nν for ν

sufficiently large. Denote this event by,   Z AπNν = 

P∈P

Nν

pπ

∏ pπ (Xiδνi) dΠ (P) ≥ exp i=1

0

31

−

KM 2 n

2 ν ξNν

2γ

!  

,

(24)

such that, Ππ P ∈ P : dNπν (P, P0 ) ≥ MξNν XNν δNν (1 − φnν )   Z Nν π p   ∏ pπ (Xiδνi) dΠ (P) (1 − φNν )   0 i=1   {P∈P:d π (P,P0 )≥MξN } π c nν ν   π I A + I A =  N N Z ν ν Nν π   p   (X δ ) dΠ (P) i νi ∏   π p i=1 0 P∈P

π c π ≤ I ANν + I ANν × exp + exp

KM 2 nν ξN2ν 2γ

!

KM 2 nν ξN2ν 2γ

! Π (P\PNν ) Nν

Z

{P∈PNν :dNπν (P,P0 )≥MξNν }

pπ ∏ pπ (Xiδνi) dΠ (P) (1 − φnν ) i=1 0

Taking the expectation of both sides with respect to the joint distribution, (P0 , Pν ), EP0 ,Pν Ππ P ∈ P : dNπν (P, P0 ) ≥ MξNν XNν δNν (1 − φnν ) ! π c KM 2 nν ξN2ν Π (P\PNν ) ≤ P ANν + exp 2γ ! Z Nν π KM 2 nν ξN2ν p · EP0 ,Pν (Xi δνi ) dΠ (P) (1 − φnν ) + exp ∏ 2γ pπ0 i=1 {P∈PNν :dNπν (P,P0 )≥MξNν } ! (i) KM 2 nν ξN2ν 16γ 2 [γ +C3 ] + exp − ≤ 2 2γ (KM 2 f − 2γ) nν ξN2ν ! Z Nν π KM 2 nν ξN2ν p (Xi δνi ) dΠ (P) (1 − φnν ) , (25) + exp · EP0 ,Pν ∏ 2γ pπ0 i=1 {P∈PNν :dNπν (P,P0 )≥MξNν } where in (i) we have used condition (A2) that bounds from above Π (P\PNν ), the prior mass assigned on the portion of the model space that lies outside the sieve, and have plugged in for constant, C.

32

By conditions (A1), (A4) and Lemma 4.1, Nν

Z

EP0 ,Pν {P∈PNν :dNπν (P,P0 )≥MξNν } ! −KM 2 nν ξN2ν ≤ 2γ exp γ

pπ ∏ pπ (Xiδνi) dΠ (P) (1 − φNν ) i=1 0

Returning to the expectation in Equation 25, EP0 ,Pν Ππ P ∈ P : dNπν (P, P0 ) ≥ MξNν XNν δNν (1 − φnν ) ! ! ! KM 2 nν ξN2ν KM 2 nν ξN2ν KM 2 nν ξN2ν 16γ 2 [γ +C3 ] + exp − + exp · 2γ exp − ≤ 2 2γ 2γ γ (KM 2 − 2γ) Nν ξN2ν ! (i) KM 2 nν ξN2ν 16γ 2 [γ +C3 ] , (26) ≤ + 3γ exp − 2γ (K f − 2γ)2 Nν ξN2 ν

where in (i) we use our earlier stated bound, KM 2 − 1 > K → KM 2 > K + 1. Bringing all the pieces together, EP0 ,Pν Ππ P ∈ P : dNπν (P, P0 ) ≥ MξNν X1 δν1 , . . . , XNν δNν ≤ 2 exp −Knν ξN2ν +

16γ 2 [γ +C3 ]

≤

16γ 2 [γ +C3 ] (K f − 2γ)2 Nν ξN2ν

(K f − 2γ)2 Nν ξN2ν

+ 5γ exp −

KM 2 nν ξN2ν + 3γ exp − 2γ !

Knν ξN2ν 2γ

!

(27)

where γ ≥ 1 and C3 > 0. The right-hand side of Equation 27 tends to 0 (as ν ↑ ∞) in P0 probability. This concludes the proof.

B

Proof of Lemma 4.1

Proof. We proceed constructively to simplify the form of the expectations on the left-hand side of both Equations 6 and 7 and follow with an application of Lemma 2 (and result 2.2) and Lemma 9 of Ghosal and van der Vaart (2007), which is used to establish the right-hand bound of Equation 7 (based on the existence of tests, φnν ). 33

Fixing ν, we index units that comprise the population with, Uν = {1, . . . , Nν }. Next, draw a single observed sample of nν units from Uν , indexed by subsequence, {i` ∈ Uν : δνi` = 1, ` = 1, . . . , nν }. Without loss of generality, we simplify notation to follow by indexing the observed sample, sequentially, with ` = 1, . . . , nν . We next decompose the expectation under the joint distribution with respect to population generation, P0 , and the drawing of a sample, Pν , Suppose we draw P from some set B ⊂ P. By Fubini,   Z Nν π p EP0 ,Pν  ∏ pπ (Xiδνi) dΠ (P) (1 − φnν ) 0 P∈B i=1 " # Z Nν π p ≤ EP0 ,Pν ∏ π (Xi δνi ) (1 − φnν ) dΠ (P) i=1 p0 P∈B # ( ) " 1 Z nν πν` p ≤ ∑ EP0 ∏ p0 (X`) (1 − φnν ) δν PPν (δν ) dΠ (P) `=1 P∈B δν ∈∆ν # " 1 Z nν πν` p (X` ) (1 − φnν ) δν dΠ (P) ≤ max EP0 ∏ δν ∈∆ν `=1 p0 P∈B # " 1 Z nν π ν` p (1 − φnν ) δν∗ dΠ (P) ≤ EP0 ∏ (X` ) `=1 p0 P∈B # " Z nν p ≤ EP0 ∏ (X` ) (1 − φnν ) δν∗ dΠ (P) p 0 `=1

(28)

(29)

(30)

(31)

(32)

P∈B

≤

Z

Pδν∗ (1 − φnν ) dΠ (P) ,

P∈B

where

∑

PPν (δν ) = 1 (S¨arndal et al. 2003) and

δν ∈∆ν

δν∗

n o ∗ ∗ ∈ ∆ν = {δνi }i=1,...,Nν , δνi ∈ {0, 1}

denotes that sample, drawn from the space of all possible samples, ∆ν , which maximizes the probability under the population generating distribution for the event of interest. The inequality in Equation 32 results from

p p0

≤ 1 and

(1 − φnν ) given δν∗ is denoted by, Pδν∗ (1 − φnν ).

34

1 πν`

≥ 1. The conditional expectation of

If P ∈ P\PNν , 

 Nν

Z

∏ pπ (Xiδνi) (1 − φnν ) dΠ (P)

 EP0 ,Pν  P∈P\PNν

≤

Z

pπ

i=1



0

Z

P (1 − φnν ) dΠ (P) ≤

dΠ (P) = Π (P\PNν ) ,

δν∗

P∈P\PNν

P∈P\PNν

since (1 − φnν ) ≤ 1. We next establish a bound for Pδν∗ (1 − φnν ) on a sieve or slice. Let Arπ = {P ∈ PNν : ∗ ∗ rεNν ≤ dNπν (P, P0 ) ≤ 2rεNν } for integers, r. Under observed X1 δν1 , . . . , XNν δνN ∈ X , by ν conditions (A1) and (A4) we have,

sup Pδν∗ (1 − φnν )

(33)

P∈Arπ

=

sup {P∈PNν :rξ ≤dNπν (P,P0 )≤2rξ }

Pδν∗ (1 − φnν )

(34)

(1 − φnν )

(35)

(i)

≤n

sup

2rξ rξ √ P∈PNν : √ γ ≤dNν (P,P0 )≤ γ

P∗ o δν

Knν r2 ξ 2 , ≤ exp − γ

(ii)

where the smaller range in (i), P ∈ PNν :

rξ √ γ

≤ dNν (P, P0 ) ≤

(36) 2rξ √ , γ

increases Pδν∗ (1 − φnν ). The

result in (ii) uses condition (A2) to obtain the result of Lemmas 2 and 9 in Ghosal and van der √ Vaart (2007) where we set ξ → ξ / γ. Finally, fixing some value for δ > 0, set r = 2` δ for a given, for integers, ` ≥ 0. Following the approach for bounding the sum over the slices in Wong and Shen (1995), let L be the √ smallest integer such that 22L δ 2 ξ 2 > 2γ, since dNπν < 2γ (by our definition of the pseudo

35

Hellinger metric in Section 2.1). Then,  Nν

Z

 EPθ0 ,Pν  {P∈PNν :dNπν (P,P0 )≥δ ξ } L



pπ  ∏ pπ (Xiδνi) dΠ (P) (1 − φnν ) i=1 0

(37)

Nν

pπ = ∑ EPθ0 ,Pν ∏ π (Xiδνi) dΠ (P) (1 − φNν ) {P∈PNν :2` δ ξ ≤dNπν (P,P0 )≤2`+1 δ ξ } i=1 p0 `=0 2` L 2 Knν δ 2 ξ 2 ≤ γ ∑ exp − γ `=0 Knν δ 2 ξ 2 ≤ 2γ exp − , γ Z

for nν sufficiently large such that

Knν δ 2 ξ 2 γ

(38) (39) (40)

≥ 1.

This concludes the proof.

C

Proof of Lemma 4.2

Proof. By Jensen’s inequality, Z

log P∈P

Nν pπ (X δ ) dΠ (P) ≥ i νi ∑ ∏ pπ i=1 0 i=1 Nν

Z

log P∈P

= Nν · PNν

pπ (Xi δνi ) dΠ (P) pπ0

Z

log P∈P

pπ dΠ (P) , pπ0

where we recall that the last equation denotes the empirical expectation functional taken with respect to the joint distribution over population generating and informative sampling. By

36

Fubini, Z

PNν P∈P

pπ PNν log π dΠ (P) p0 P∈P Z p δν dΠ (P) = PNν log πν p0 P∈P Z p π = PNν log dΠ (P) p0

pπ log π dΠ (P) = p0

Z

P∈P

= PπNν

Z

log P∈P

p dΠ (P) , p0

where we, again, apply Fubini. Then, the probability statement in the result of Equation 8 is bounded (from above) by,   Z Z   p p p p Pr GπNν log dΠ (P) ≤ − Nν ξ 2 (1 +C) − Nν P0 log dΠ (P)   p0 p0 P∈P P∈P   Z Z   p p p p = Pr GπNν log dΠ (P) ≤ − Nν ξ 2 (1 +C) − Nν P0 log dΠ (P)   p0 p0 P∈P P∈P   Z   p p p p = Pr GπNν log dΠ (P) ≤ − Nν ξ 2 (1 +C) + Nν ξ 2 = − Nν ξ 2C ,   p0 P∈P

where we have again applied Fubini in the second inequality and also the bound for P0 log pp0 ≤ ξ 2 for P on the set B.

37

We now apply Chebyshev and Jensen’s inequality to bound the probability, i hR   p π Z   Var P∈P GNν log p0 dΠ (P) p p Pr GπNν log dΠ (P) ≤ − Nν ξ 2C ≤ (41a)   p0 Nν ξ 4C2 P∈P Z p π Var GNν log dΠ (P) p0 P∈P ≤ (41b) Nν ξ 4C2 " 2 # Z p dΠ (P) EP0 ,Pν GπNν log p0 P∈P ≤ (41c) Nν ξ 4C2 # " Z p p 2 π Nν PNν log dΠ (P) EP0 ,Pν p0 P∈P ≤ , Nν ξ 4C2 (41d) where EP0 ,Pν (·) denotes the expectation with respect to the joint distribution over population generation and sampling (from that population) without replacement. We apply Jensen’s inequality in Equation 41b and use E X 2 > Var (X) in the third inequality, stated in Equation 41c, and drop the centering term in Equation 41d. We now bound the expectation inside the square brackets on the right-hand side of Equation 41d, which is taken with respect to this joint distribution. In the sequel, define Aν = σ (X1 , . . . , XNν ) as the sigma field of information potentially available for the Nν units in population, Uν .

38

p 1 p 2 π = EP0 ,Pν Nν PNν log p0 Nν

δνi δν j p p ∑ EP0,Pν πνiπν j log p0 (Xi) log p0 X j i, j∈Uν " ( 2 ! )# 1 p δνi = EP0 EPν log (Xi ) Aν ∑ 2 Nν i= j∈Uν p0 πνi # " EPν δνi δν j |Aν p p 1 Xj log (Xi ) log + 2 ∑ EP0 πνi πν j p0 p0 Nν i6= j∈Uν " 2 # 1 1 p = EP0 log (Xi ) Nν i=∑ πνi p0 j∈Uν πνi j 1 p p + EP0 log (Xi ) log Xj Nν i6=∑ πνi πν j p0 p0 j∈Uν   πνi j 1 2 2  + ξ (Nν − 1) sup max ≤ ξ sup  πνi πν j min πνi ν ν i6= j∈Uν i∈Uν

≤ ξ 2 (γ +C3 ) , for sufficiently large Nν , where we have applied the condition for P ∈ B for the first term of the last two inequalities and conditions and (A4) and (A5) for the last inequality. We additionally note that πνi j = πν j when i = j, i, j ∈ Uν . This concludes the proof.

39

Children's understanding of posterior probability