Parameter Estimation with Out-of-Sample Objective Peter Reinhard Hansena and Elena-Ivona Dumitrescub a

European University Institute & UNC Chapel Hill & CREATES ∗ b Paris-Ouest

Nanterre la Défense University April 22, 2016

Abstract We study parameter estimation from the sample X , when the objective is to maximize the expected value of a criterion function, Q, for a distinct sample, Y. This is the situation that arises when a model is estimated for the purpose of describing other data than those used for estimation. The motivated for much estimation has this form, with forecasting problems being a prime example. A natural candidate for solving maxT ∈σ(X ) EQ(Y, T ) is the innate estimator, θˆ = arg maxθ Q(X , θ). While the innate estimator has certain advantages, we show that the asymptotically efficient estima˜ , θ), where Q ˜ is defined from a likelihood function in conjunction tor takes the form θ˜ = arg maxθ Q(X with Q. The likelihood-based estimator is, however, fragile, as misspecification is harmful in two ways. First, the likelihood-based estimator may be inefficient under misspecification. Second, and more importantly, the likelihood approach requires a parameter transformation that depends on the truth, causing an improper mapping to be used under misspecification. The theoretical results are illustrated with two applications comprising asymmetric loss and multi-step forecasting, respectively.

Keywords: Forecasting, Out-of-Sample, LinEx Loss, Multi-step forecasting. JEL Classification: C52

∗ We thank Valentina Corradi, Nour Meddahi, Werner Ploberger, Barbara Rossi, Mark Watson, and seminar speakers at University of Pennsylvania, Penn State, and Cambridge University for helpful comments. The first author acknowledges support from CREATES - Center for Research in Econometric Analysis of Time Series (DNRF78), funded by the Danish National Research Foundation.

1

1

Introduction

Efficient parameter estimation is a well explored research topic. For instance, an estimator T (X ) is said to be efficient for θ if it minimizes the expected loss, E[L(T (X ), θ)], where L is a loss function and X is the random sample available for estimation. In this paper, we consider parameter estimation with a different objective. Our objective is characterized by the intended use of the estimated model that involves a second random sample, Y. This is the structure that emerges in forecasting problems where Y represents future data and X is the sample available for estimation. More generally, the sample Y can represent a random draw from the general population for which an estimated model is to be used. For instance, from a pilot study (based on X ), one may seek to optimize tuning parameters in a policy program before the program is implemented more widely (to Y). Although our results are not specific to forecasting, we sometimes use standard forecasting terminology by referring to X and Y as in-sample and out-of-sample, respectively. To fix ideas: Let the objective be maxT ∈σ(X ) EQ(Y, T ), where Q is a criterion function. A natural candidate is the extremum estimator, θˆ = arg maxθ Q(X , θ), which we label the innate estimator, because it is deduced directly from Q. While the innate estimator seeks to maximize the objective, Q, it need not be efficient and a better estimator may be available. To study this problem we consider a ˜ , θ), where Q ˜ is class of extremum estimators, where a typical element is given by θ˜ = arg maxθ Q(X ˜ that differs another criterion. While it may seem unnatural to estimate parameters using a criterion, Q, from that of the actual objective, Q, this approach is quite common in practice. This is sometimes done out of convenience,1 but, as we will show, the asymptotically efficient estimator is one deduced ˜ The use of a different criterion, Q, ˜ for estimation has many pitfalls, and the from a carefully crafted Q. asymptotic efficiency hinges on additional assumption, such as the absence of misspecification. We shall establish results in an asymptotic framework, that are based on conventional assumptions made in the context of M -estimation. While our framework and objective differs from that usually used to study efficient parameter estimation, the classical structure emerges after manipulating the asymptotic expressions. This enables us to utilize the Cramer-Rao lower bound to establish a likelihoodbased estimator as the asymptotically efficient estimator, albeit new and important issues arise in the case where the likelihood is misspecified. Under correct specification, the likelihood-based estimator dominates the innate estimator, sometimes by a wide margin. When the likelihood is misspecified, the asymptotic efficiency argument perishes but, more importantly, the likelihood approach requires a mapping of likelihood parameters to criterion parameters that hinges on the likelihood being correctly 1

For instance, estimation by simple regression analysis although the objective may be predictions of Value-at-Risk.

2

specified. Under misspecification, this mapping becomes improper and causes θ˜ to be inconsistent for the value of θ that maximizes the objective. So our results cast light on the relative merits of likelihood-based estimation versus innate estimation. An advantage of the likelihood approach is that once the model is estimated, it can in principle be tailored to suit any objective. In contrast, the innate estimator is intrinsically tied to the objective. So if the objective changes then the innate estimator must be adjusted accordingly. Our limit results do not univocally point to one approach being preferred to the other. It ultimately rests on the faith (or skepticism) that one has about the specification. If the likelihood is correctly specified, the limit theory clearly favors the likelihood-based estimator, while the innate estimator is preferred asymptotically under a fixed degree of misspecification. At moderate levels of misspecification as defined in a framework with local-to-correct specifications, the choice is less obvious. The misspecification threshold at which the innate estimator becomes superior to the likelihood-based estimator is context-specific and depends on many factors, including the nature of misspecification and the criterion function, Q. Model diagnostics and misspecification tests can provide some help in the choice of estimator, but should to be tailored for the specific problem at hand. In the context of forecasting, many have argued for the estimation criterion to coincide with the actual objective, starting with Granger (1969), see also Weiss (1996). For empirical support of this approach, see e.g. Weiss and Andersen (1984) and Christoffersen and Jacobs (2004). Changing the forecasting horizon amounts to changing the objective, even if the same loss function is applied to the forecasting errors. For this reason, one might expect the optimal estimation method to vary with the forecasting horizon. In the autoregressive setting with quadratic prediction loss Bhansali (1999) and Ing (2003) have established that the relative merits of the two estimation methods depend on the degree of misspecification. This led Schorfheide (2005) to propose a model selection criterion that accounts for the bias-variance trade-off, and Hansen (2010a) develops a leave-h-out cross-validation criterion that is well suited for the selection of h-step ahead forecasting models. The existing literature has primarily focused on the case with a mean square error (MSE) loss function and likelihood functions based on Gaussian specifications. In this paper, we establish results for the ˜ belong to a general class of criteria that are suitable for M -estimation, see e.g. case where both Q and Q Huber (1981) and Amemiya (1985). In this general framework we establish analytical results that are asymptotic in nature. Specifically, we will compare the relative merits of estimators in terms of the limit distributions that arise in this context. The asymptotic result are complemented by two applications where we also study the finite sample properties and consider situations with local misspecification in various forms.

3

˜ can seriously degrade First we make the simple observation that a discrepancy between Q and Q ˜ Second, we show that the asymptotically optimal estimator is an estimator that the performance of θ. is deduced from the maximum likelihood estimator. This theoretical result is analogous to the well known Cramer-Rao bound for in-sample estimation. We address the case where the likelihood function involves a parameter of higher dimension than θ, and discuss the losses incurred by the misspecification of the likelihood. We illustrate the theoretical result in a context with an asymmetric (LinEx) loss function. The innate estimator performs on par with the likelihood-based estimator (LBE) when the loss is near-symmetric, whereas the LBE clearly dominates the innate estimator under asymmetric loss. In contrast, when the likelihood is misspecified the LBE suffers and its performance can drop considerably as the degree of misspecification increases. A second application pertains to multi-step forecasting. In this literature the two competing forecasting methods are known as the direct forecast and the iterated forecast.2 The direct and iterated forecasts can be related to the innate and the likelihood-based estimators, respectively. Several well known results for direct and iterated forecasts in the context with an autoregressive model and MSE loss emerge as special cases in our framework. We contribute to this literature by considering a case with asymmetric loss and derive results for the case with correct specification and the case with local misspecification. The asymmetry exacerbates the advantages of iterated forecasts (the likelihood-based approach), so that it takes a relatively higher degree of misspecification for the direct forecast to be competitive. The rest of the paper is structured as follows. Section 2 presents the theoretical framework and the asymptotic results. Sections 3 and 4 present the two applications to asymmetric loss function and multi-step forecasting, respectively. Section 4 concludes and the three appendices contain proofs of the analytical results in Section 2, auxiliary results relating to the two applications, and details about the simulation studies. A Web Appendix is available with additional simulation results and analytical results.

2

Theoretical Framework

ˆ to a generic alternative estimator, θ. ˜ This is We will compare the merits of the innate estimator, θ, done within the theoretical framework of M -estimators, see Huber (1981), Amemiya (1985), and White (1994). Our notation will largely follow that in Hansen (2010b). 2

The former is also know as prediction error estimation and the latter as the plug-in forecast.

4

The criterion functions take the form

Q(X , θ) =

n X

q(xt , θ)

and

˜ , θ) = Q(X

t=1

n X

q˜(xt , θ),

t=1

with xt = (Xt , . . . , Xt−k ) for some k. This framework includes criteria deduced from Markovian models. For instance, least squares estimation of an AR(1) model, Xt = ϕXt−1 + εt , would translate into xt = (Xt , Xt−1 ) and q˜(xt , θ) = −(Xt − ϕXt−1 )2 . Assumption 1. Suppose that {Xt } is stationary and ergodic, and that E|q(xt , θ)| < ∞ and E|˜ q (xt , θ)| < ∞. The assumed stationarity carries over to q(xt , θ) and q˜(xt , θ), and their derivatives that we introduce below. Next we make some regularity assumptions about the criteria functions. Assumption 2. (i) The criteria functions q(xt ; θ) and q˜(xt ; θ) are continuous in θ for all xt and measurable for all θ ∈ Θ, where Θ is compact. (ii) θ∗ and θ0 are the unique maximizers of E[q(xt , θ)] and E[˜ q (xt , θ)], respectively, where θ∗ and θ0 are interior to Θ; (iii) E[supθ∈Θ |q(xt , θ)|] < ∞ and E[supθ∈Θ |˜ q (xt , θ)|] < ∞; The assumed stationarity and Assumption 2 ensure that the θ that maximizes E[Q(X , θ)] is unique, ˜ and θ0 . invariant to the sample size, and is given by θ∗ = arg maxθ E[q(xt , θ)]. Similarly for Q The following consistency follows from the literature on M -estimators, see e.g. Amemiya (1985, Theorem 4.1.1). P P Lemma 1. The extremum estimators θˆ = arg maxθ∈Θ nt=1 q(xt , θ) and θ˜ = arg maxθ∈Θ nt=1 q˜(xt , θ) converge in probability to θ∗ and θ0 , respectively, as n → ∞. Because the innate estimator (as its label suggests) is intrinsic to the criterion Q, it will be consistent p for θ∗ under standard regularity conditions, in the sense that θˆ → θ∗ as the sample size, n, increases.

˜ unless θ∗ and θ0 coincide. This consistency need not hold for the alternative estimator, θ, Next we assume the following regularity conditions that enable us to derive the limit results that form the basis for our main results. These conditions are also standard in the literature on M -estimation. Assumption 3. The criteria, q and q˜, are twice continuously differentiable in θ, where (i) the first P d derivatives, s(xt , θ) and s˜(xt , θ), satisfy a central limit theorem, n1/2 nt=1 (s(xt , θ∗ )0 , s˜(xt , θ0 )0 )0 → ˜ t , θ), are uniformly integrable in a neighborhood of N(0, ΣS ); (ii) the second derivatives, h(xt , θ) and h(x ˜ t , θ0 ) are invertible. θ∗ and θ0 , respectively, where the matrices A = −Eh(xt , θ∗ ) and A˜ = −Eh(x 5

˜ denote the long-run variances of s(xt , θ∗ ) and s˜(xt , θ0 ), respectively. Then, ΣS , will Let B and B ˜ as diagonal blocks. There is no need to introduce a have a block diagonal structure, with B and B notation for the off-diagonal blocks in ΣS , as they are immaterial to subsequent results. The following result establishes an asymptotic independence between the in-sample scores and outof-sample scores, which is useful for the computation of conditional expectations in the limit distribution. Lemma 2. We have

n1/2

n X

s(xt , θ∗ )0 ,

t=1

n X

s˜(xt , θ0 )0 ,

t=1

2n X

!0 s(xt , θ∗ )0

t=n+1

 d

→ N(0, 

ΣS

0

0

B

 ).

In this literature it is often assumed that X and Y are independent, see e.g. Schorfheide (2005, assumption 4), which implies independence of the score that relates to X and the score that relates to Y. Lemma 2 shows that the assumed independence between X and Y can be dispensed with, because the asymptotic independence of the scores is a simple consequence of the central limit theorem being applicable, and no further assumption is needed. The asymptotic independence holds whether the scores are martingale difference sequences or serially dependent so long as the central limit theorem applies. To simplify the exposition, we focus on the case where the sample size of Y = (xn+1 , . . . , x2n ), coincides with that of X = (x1 , . . . , xn ).3 ˜ are said to be coherent if θ∗ = θ0 , otherwise the criteria are said Definition 1. Two criteria, Q and Q, to be incoherent. Similarly, we refer to an estimator as being coherent for Q if its probability limit is θ∗ . Next, we state the fairly obvious result that an incoherent criterion will lead to inferior performance. p ˜ deduced from an incoherent criterion, so that θ˜ → Lemma 3. Consider an alternative estimator, θ,

θ0 6= θ∗ . Then ˆ − Q(Y, θ) ˜ → ∞, Q(Y, θ) in probability. The divergence is at rate n. The result shows that any incoherent estimator is asymptotically inferior to the innate estimator. ˜ that This shows that consistency for θ∗ is a critical requirement, which limits the set of criteria, Q, ˜ from a likelihood are suitable for estimation. It is, however, possible to craft a coherent criterion, Q, function, as we shall show below. 3

This setup is quite common in this literature, and was, for instance, used in Akaike (1974) to motivate the Akaike’s Information Criterion (AIC).

6

˜ is a coherent criterion, then the limit distribution of Theorem 1. If Assumptions 1-3 hold and Q ˜ − Q(Y, θ0 ), as n → ∞ is given by Q(Y, θ) ˜ 1/2 Zx − 1 Z 0 B ˜ 1/2 A˜−1 AA˜−1 B ˜ 1/2 Zx , Zy0 B 1/2 A˜−1 B 2 x ˜ where Zx , Zy ∼ iidN(0, I), and its expected value is: − 12 tr{A˜−1 AA˜−1 B}. Interestingly, for the case with the innate estimator, the expected value of the limit distribution, − 12 tr{A−1 B}, can be related to a result by Takeuchi (1976), who generalized the AIC to the case with misspecified models. The expected value of the limit distribution motivates the following definition of criterion risk: ˜ is defined by Definition 2. The asymptotic criterion risk, induced by the estimation error of θ, ˜ = 1 tr{AA˜−1 B ˜ A˜−1 }. R(θ) 2 ˜ = E[Q(Y, θ0 ) − Q(Y, θ)]. ˜ The finite sample analog is defined by Rn (θ) ˆ = 1 tr{A−1 B} and its magnitude relative to 1 tr{AA˜−1 B ˜ A˜−1 } For the innate estimator we have R(θ) 2 2 defines which of the two estimators is most efficient. We formulate this by defining the relative criterion efficiency ˜ ˜ ˆ θ) ˜ = E[Q(Y, θ0 ) − Q(Y, θ(X ))] = Rn (θ) . RQE(θ, ˆ ))] ˆ E[Q(Y, θ0 ) − Q(Y, θ(X Rn (θ)

(1)

ˆ The asymptotic Note that an RQE < 1 defines the case where θ˜ outperforms the innate estimator, θ. expression for the RQE is ˜ ˜ A˜−1 } R(θ) tr{AA˜−1 B = , ˆ tr{A−1 B} R(θ) provided that θ˜ is a coherent estimator. For an incoherent estimator it follows by Lemma 3 that RQE → ∞, as n → ∞, so that coherency is an imperative to good asymptotic properties of an estimator.

2.1

Likelihood-Based Estimator

In this section we focus on estimators that are deduced from a likelihood criterion. In some cases, one can obtain θ˜ directly as a maximum likelihood estimator, but a mapping of the likelihood parameters,

7

ϑ say, to the criterion function, θ, is typically needed. This is obviously the case if the dimensions of θ and ϑ do not coincide. Consider a statistical model, {Pϑ }ϑ∈Ξ , and suppose that Pϑ0 is the true probability measure, with ´ ϑ0 ∈ Ξ. The implication is that the expected value is defined by Eϑ0 (·) = (·)dPϑ0 , and we have in particular that θ0 = arg max Eϑ0 [Q(Y, θ)], θ

which defines θ0 as a function of ϑ0 , i.e. θ0 = θ(ϑ0 ). Assumption 4. There exists τ (ϑ) so that ϑ 7→ (θ, τ ) is continuously differentiable, with

∂ 0 0 0 ∂ϑ (θ(ϑ) , τ (ϑ) )

having non-zero determinant at ϑ0 . The assumption ensures that the reparameterization (that distills θ from ϑ) is invertible in a way that does not degenerate the limit distribution. While the assumption is relatively innocuous, we will analyze a case in Section 3 where this assumption does not hold.4 ˜ is a coherent estimator. Lemma 4. Given Assumption 1–4, let ϑ˜ be the MLE. Then θ˜ = θ(ϑ) One potential challenge to using the likelihood-based estimator is that the mapping from ϑ to θ may be difficult to obtain. When θ˜ is estimated from a correctly specified likelihood function, it follows from the information ˜ In terms of asymptotic criterion risk, the comparison of the innate estimator matrix identity that A˜ = B. ˜ −1 }. to the likelihood-based estimator becomes a comparison of the quantities 21 tr{A−1 B} and 21 tr{AB The following Theorem shows that the latter is smaller. Theorem 2 (Optimality of likelihood-based estimator). Suppose that Assumptions 1–4 hold, and let ϑ˜ ˜ is the likelihood-based estimator. If the likelihood be the maximum likelihood estimator so that θ˜ = θ(ϑ) function is correctly specified, then, as n → ∞ d

ˆ − Q(Y, θ) ˜ → ξ, Q(Y, θ) ˆ − R(θ) ˜ ≤ 0. where E[ξ] = R(θ) Theorem 2 shows that the likelihood-based approach is asymptotically superior to the criterion-based approach and the following Corollary shows that the likelihood-based estimator is in fact superior to any other M -estimator. ˜ as an extremum estimator, that maximizes the reparameterized The assumption also allows us to interpret θ˜ = θ(ϑ) and concentrated log-likelihood function `c (θ) = `(θ, τ˜(θ)), where τ˜(θ) = arg maxτ `(θ, τ ). 4

8

˜ be the likelihood-based estimator and let θˇ be any other coherent estimator. Corollary 1. Let θ˜ = θ(ϑ) ˆ Then Suppose that Assumptions 1-4 hold with Assumptions 2 and 3 adapted to θˇ in place of θ. d ˇ − Q(Y, θ) ˜ → Q(Y, θ) ξ,

ˇ − R(θ) ˜ ≤ 0. where E[ξ] = R(θ) An inspection of the proof of Corollary 1 reveals that the inequality is strict, unless the alternative ˇ is asymptotically equivalent to θ. ˜ So the likelihood-based estimator is (also) asymptotestimator, θ, ically efficient in the present framework with an out-of-sample objective. The proof also reveals that manipulation of the asymptotic expression simplifies the comparison to one that is well known from the asymptotic analysis of estimation.

2.2

The Case with a Misspecified Likelihood

Misspecification is harmful to the likelihood-based estimator for two reasons. First, the resulting estimator is no longer efficient, which eliminates the argument in favor of adopting the likelihood-based estimator. Second, and more importantly, the mapping from ϑ to θ depends on the true probability measure, so that a misspecified likelihood will result in an improper mapping from ϑ to θ. The likelihood-based estimator θ˜ may therefore be inconsistent under misspecification, i.e. incoherent. An incoherent likelihood-based estimator, as the result of a fixed degree of misspecification, will be greatly inferior to the innate estimator in the sense that RQE → ∞ as n → ∞. An asymptotic design of this sort will in many cases be misleading about the relative performance of the estimators in finite samples. For this reason we turn our attention to the case with a slightly misspecified model. This can be achieved with an asymptotic design where the likelihood is only slightly misspecified, in the sense that the likelihood gets closer and closer to being correctly specified as n → ∞. We shall analyze cases where the discrepancy between θ0 and θ∗ is of order n−1/2 , and we refer to this mild form of misspecification as local-to-correct specification. 2.2.1

Local-to-Correct Specification

We consider a case where the true probability measure does not coincide with Pϑ0 . To make matter interesting, we consider a case with a locally misspecified model, where the degree of misspecification is balanced with the sample size. Thus, let the true probability measure be Pn , and let the corresponding (n)

(n)

(best approximating) likelihood parameter be denoted by ϑ0 , and let θ0 9

(n)

= θ(ϑ0 ). As n increases,

(n)

Pn approaches Pϑ0 for some ϑ0 ∈ Ξ, and this occurs at a rate so that θ0 − θ∗ = n−1/2 b, for some b ∈ Rk that defines the degree of local misspecification. Correct specification corresponds to the case where ˜ is given by the following Theorem. b = 0. In this scenario, the limit distribution of Q(Y, θ∗ ) − Q(Y, θ) (n)

Theorem 3. Suppose that Pn → Pϑ0 as n → ∞, so that θ(ϑ0 ) − θ(ϑ0 ) = n−1/2 b, then ˜ = 1 tr{A(B ˜ −1 + bb0 )}, R(θ) 2 ˜ i , θ∗ )]. ˜ = E[−h(x where B The likelihood-based estimator retains its efficiency in terms of asymptotic variance under local misspecification, but is negatively affected by the asymptotic bias. Thus, under local misspecification ˜ −1 , the relative criterion efficiency is a question of the magnitude of the bias, bb0 , relative to A−1 BA−1 − B which is the advantage that the likelihood-based estimator has in terms of asymptotic variance. One can measure the degree of local misspecification by a measure of non-centrality. In the univariate p ˜ which can be interpreted as the expected value of the tcase (θ ∈ R) this can be expressed as d = b B, q p ˜ In the multivariate case the non-centrality may be measured as d = b0 Bb. ˜ statistic, (θ˜−θ∗ )/ avar(θ). While the degree of non-centrality is, in some sense, a measure of the (average) statistical evidence of misspecification, it does not (unless k = 1) directly map into a particular value of criterion risk, because p ˜ but different values of different vectors of b can translate into the same non-centrality, d = b0 Bb, tr{Abb0 } = b0 Ab, and vice versa. The following Theorem states upper and lower bounds on the criterion risk that can result from a given level of misspecification. Theorem 4. Let the local misspecification be such that d =

p

˜ Then the asymptotic criterion risk b0 Bb.

resulting from this misspecification, b0 Ab is bounded by λmin d2 ≤ b0 Ab ≤ λmax d2 where λmin and λmax ˜ = 0. are the smallest and largest solutions (eigenvalues) to |A − λB| The degree of misspecification could be measured in other ways than by the non-centrality parameter d. For instance, in terms of Kullback-Leibler divergence. In one of our simulation designs below, where the likelihood-based estimator is deduced from a Gaussian likelihood, we map the value of d into the expected value of the Jarque-Bera test statistic. We shall study designs with local misspecification in the following two sections. The first section is an application to a criterion function defined by the asymmetric LinEx loss function and the second section is an application of our framework to the problem of making multi-step forecasts.

10

Figure 1: The LinEx loss function for three values of c.

3

The Case with Asymmetric Loss and a Gaussian Likelihood

In this section we apply the theoretical results to the case where the criterion function is given by the LinEx loss function. In forecasting problems, there are many applications where asymmetry is thought to be appropriate, see e.g. Granger (1986), Christoffersen and Diebold (1997), and Hwang et al. (2001). The LinEx loss function is a highly tractable asymmetric loss function that was introduced by Varian (1974), and has found many applications in economics, see e.g. Weiss and Andersen (1984), Zellner (1986), Diebold and Mariano (1995), and Christoffersen and Diebold (1997). Here we shall adopt the following parameterization of the LinEx loss function

Lc (x) =

  c−2 [exp(cx) − cx − 1] for c ∈ R\{0},   1 x2

(2)

for c = 0

2

which has minimum at x = 0. The absolute value of the parameter c determines the degree of asymmetry and its sign defines whether the asymmetry is left-skewed or right-skewed, see Figure 1. The quadratic loss arises as the limited case, limc→0 Lc (x) = 12 x2 , which motivates the definition of L0 (x).

We adopt

the LinEx loss function because it produces simple estimators in closed-form and analytical expressions

11

of criterion risk that ease the computational burden. The objective in this application is to estimate θ for the purpose of minimizing the expected loss, ELc (Yi − θ). This problem maps into our theoretical framework by setting q(Xi , θ) = −Lc (Xi − θ), and it is easy to show that θ∗ = arg min ELc (Xi − θ) = c−1 log[E exp(cXi )], provided that E exp(cXi ) < ∞. P Similarly, it can be shown that the innate estimator, which is given as the solution to min ni=1 Lc (Xi − θ

θ), can be written in closed-form as n

1 1X θˆ = log[ exp(cXi )], c n

(3)

i=1

as and by the ergodicity of Xi , hence exp(cXi ), it follows that θˆ → θ∗ . Next, we introduce a likelihood-

based estimator that is deduced from the assumption that Xi ∼ iidN(µ0 , σ02 ), for which it can be shown that θ 0 = µ0 +

cσ02 , 2

(4)

see Christoffersen and Diebold (1997). The likelihood-based estimator is therefore given by c˜ σ2 θ˜ = µ ˜+ , 2 where µ ˜ = n−1

Pn

t=1 Xi

and σ ˜ 2 = n−1

Pn

˜) t=1 (Xi − µ

2

(5)

are the maximum likelihood estimators of µ0 and

σ02 , respectively. Equation (4) illustrates the need to map likelihood parameters, ϑ = (µ, σ 2 )0 , into criterion parameter, θ. The likelihood-based estimator will be consistent for θ0 , which coincides with θ∗ if the Gaussian assumption is correct. Under misspecification the two need not coincide. We shall compare the two estimators in terms of the LinEx criterion

Q(Y; θ) = −

n X

c−2 [exp{c(Yi − θ)} − c(Yi − θ) − 1],

i=1

where Yi are iid and independent of (X1 , . . . , Xn ). First we consider the case with correct specification, i.e. the case where (X1 , . . . , Xn , Y1 , . . . , Yn ) are iid with marginal distribution N(µ0 , σ02 ). Subsequently we turn to the case where the marginal distribution is a normal inverse Gaussian (NIG) distribution, which causes the Gaussian likelihood to be misspecified.

12

3.1

Results for the Case with Correct Specification

With qi (Xi , θ) = −Lc (Yi −θ) we have si (Xi , θ) = c−1 [exp{c(Yi −θ)}−1] and hi (Xi , θ) = − exp{c(Xi −θ)}. With Xi ∼ iidN(µ, σ 2 ) it can be shown that A = E[−hi (Xi , θ0 )] = 1 B = var[si (Xi , θ0 )] = ˜ ˜ = 1/avar(θ) A˜ = B

exp(c2 σ 2 ) − 1 , c2

(= σ 2 if c = 0),

= 1/(σ 2 + c2 σ 4 /2),

see Appendix B.1. Consequently, in this application we have

RQE =

˜ −1 } tr{AB 1 (cσ)2 + (cσ)4 /2 = = , ˜ tr{A−1 B} exp(cσ)2 − 1 BB

for cσ 6= 0,

which is (unsurprisingly) less than one, and RQE = 1 arises only in the limit, cσ → 0, where the two estimators coincide. The relative efficiency of θˆ and θ˜ is compared in Table 1 for the case with a correctly specified likelihood function. Panel A of Table 1 displays the asymptotic results based on our analytical expressions, whereas Panels B and C present finite sample results based on simulations with n = 1, 000 and n = 100, respectively. 500,000 replications were used to compute all statistics.5 The simulation design is detailed in Appendix C.1. The asymmetry parameter is given in the first column followed by the population value of θ∗ , the RQE and the criterion losses resulting from estimation error. Table 1 shows that the likelihood-based estimator dominates the innate estimation, and increasingly so for larger values of c. The exception is the case c = 0, where the two estimators coincide. Our analytical results imply the superiority of the LBE in the asymptotic design of Panel A; however the LBE also dominates the innate estimator in finite samples, albeit to a less extent. The innate estimator appears to be somewhat less inferior in finite samples because its criterion loss tends to be relatively smaller in finite samples. However this does not imply that the innate estimator performs better with ˆ a smaller sample size. To the contrary, the per observation criterion loss, Rn (θ)/n, is decreasing in n. Moreover, the innate estimator has a larger finite sample bias relative to that of the likelihood-based estimator in the present context. 5

The standard deviations of all simulated quantities are smaller than 10−5 .

13

Table 1: Relative Efficiency under LinEx Loss

Panel A: Asymptotic Results c

θ∗

RQE

ˆ R(θ)

˜ R(θ)

ˆ bias(θ)

˜ bias(θ)

0 0.25 0.5 1 1.5 2 2.5

0 0.125 0.250 0.500 0.750 1.000 1.250

1 0.999 0.990 0.873 0.563 0.224 0.050

0.5 0.516 0.568 0.859 1.886 6.700 41.36

0.5 0.516 0.563 0.750 1.063 1.500 2.063

0.000 0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.000 0.000 0.000 0.000 0.000 0.000

Panel B: Finite Sample Results: n = 1, 000 c

θ∗

RQE

ˆ Rn (θ)

˜ Rn (θ)

ˆ bias(θ)

˜ bias(θ)

0 0.25 0.5 1 1.5 2 2.5

0 0.125 0.250 0.500 0.750 1.000 1.250

1 0.999 0.991 0.880 0.600 0.350 0.217

0.499 0.518 0.569 0.853 1.777 4.341 9.670

0.499 0.518 0.563 0.748 1.068 1.513 2.100

0.000 0.000 0.000 -0.001 -0.003 -0.010 -0.030

0.000 0.000 0.000 0.000 -0.001 -0.001 -0.001

Panel C: Finite Sample Results: n = 100 c

θ∗

RQE

ˆ Rn (θ)

˜ Rn (θ)

ˆ bias(θ)

˜ bias(θ)

0 0.25 0.5 1 1.5 2 2.5

0 0.125 0.250 0.500 0.750 1.000 1.250

1 0.999 0.991 0.900 0.720 0.560 0.418

0.498 0.515 0.567 0.839 1.493 2.745 5.273

0.498 0.514 0.562 0.753 1.075 1.526 2.203

0.000 -0.001 -0.003 -0.009 -0.023 -0.058 -0.122

0.000 -0.001 -0.003 -0.005 -0.008 -0.010 -0.012

ˆ in terms of the relative criterion efficiency in the The likelihood-based estimator θ˜ is compared to the innate estimator, θ, case with LinEx loss and iid Gaussian observations with zero mean and unit variance. The former dominates the innate estimator and does so increasingly as the asymmetry increases. The upper panel is intended to match the asymptotic results, whereas the next two panels present the corresponding results in finite samples, n = 100 and n = 1, 000. The results are based on 500,000 simulations.

14

3.1.1

Likelihoods with One-Dimensional Parameter

With a likelihood deduced from Xt ∼ N(µ, σ 2 ), we have in some sense stacked the results against the likelihood-based estimators. The likelihood approach involves a two-dimensional estimator, (˜ µ, σ ˜ 2 ), whereas the innate estimator only estimates the one-dimensional θ. This could be favorable to the innate estimator in finite samples, but the dimension of ϑ is immaterial to the asymptotic comparison. This follows from the fact that the asymptotic RQE for the likelihood-based estimator is always bounded by one, regardless of the dimension of ϑ. However, the asymptotic variance of θ˜ could be influenced by the complexity of the underlying likelihood function, so that a simpler likelihood (one with fewer degrees of freedom) may be even better in terms of RQE. To illustrate this, we consider two nested models that both have a one-dimensional ϑ. The first nested model has σ02 to be known, so that only µ is to be estimated, and the second nested model has µ0 to be known so that ϑ = σ 2 . The latter design is of separate interest because it constitutes a case where Assumption 4 is violated. The asymptotic variance of θ˜ that arises in the various models is the following:    σ02 +     ˜ = σ2 avar(θ) 0       c2 σ 4 2 0

c2 4 2 σ0

if σ02 is known, if µ0 is known.

When σ02 is known, the asymptotic variance of θ˜ is smaller than in our previous design, which gives θ˜ an ˆ The design corresponds to a case where the “stakes” even greater advantage over the innate estimator, θ. in using the likelihood approach are raised, because misspecification can now result from an incorrect assumed value for σ02 (in addition to the previous forms of misspecification). When µ0 is known, the 2

mapping from ϑ to θ is simply θ = θ(σ 2 ) = µ0 + c σ2 , which does not depend on σ 2 when c = 0. This is the case where Assumption 4 is violated, because ∂θ(σ 2 )/∂σ 2 = 0 if c = 0. Consequently, the asymptotic results need not apply in this case. This particular violation of Assumption 4 is advantageous to the likelihood-based estimator because with µ0 known and c = 0 the optimal estimator is known without any need for estimation. For c close to zero, the LBE benefits from having a very small asymptotic variance, because its variance is proportional to c2 . The results are reported in Table 2. As expected, the likelihood-based estimator performs even better in these cases where the dimension of ϑ is smaller. ϑ is one-dimensional. Panel A has the case ϑ = µ (and σ02 is known) and Panel B has the case where ˜ for the likelihood estimator ϑ = σ 2 (and µ0 is known). In Panel A the asymptotic criterion risk, R(θ), does not depend on c, while the corresponding criterion loss for the innate estimator is increasing in

15

Table 2: Relative Criterion Efficiency: 1-dimensional likelihood parameters Panel A: σ02 known

Panel B: µ0 known

θ˜ = µ ˜ + cσ02 /2

θ˜ = µ0 + c˜ σ 2 /2

c

θ∗

RQE

ˆ R(θ)

˜ R(θ)

RQE

ˆ R(θ)

˜ R(θ)

0 0.25 0.5 1.0 1.5 2.0 2.5

0 0.125 0.25 0.50 0.75 1.00 1.25

1 0.969 0.880 0.582 0.265 0.075 0.012

0.5 0.516 0.568 0.859 1.886 6.700 41.36

0.5 0.5 0.5 0.5 0.5 0.5 0.5

0 0.030 0.110 0.291 0.298 0.149 0.038

0.5 0.516 0.568 0.859 1.886 6.700 41.36

0 0.016 0.063 0.250 0.563 1.000 1.563

ˆ in terms of the relative criterion efficiency in the The likelihood-based estimator θ˜ is compared to the innate estimator, θ, case with LinEx loss and iid Gaussian observations with zero mean and unit variance. The former dominates the innate estimator and the performance gap increases with the degree of asymmetry. Panel A corresponds to the case where the variance σ02 is known, whereas panel B presents the case where the mean µ0 is known. The results are based on 500,000 simulations.

˜ is increasing in c starting from zero at c = 0. The theoretical explanations for c. In Panel B, R(θ) this follow from the underlying information matrices. Because the innate estimator is unaffected by the choice of specification for the likelihood, we continue to have A = 1 and B = [exp(c2 σ 2 ) − 1]/c2 ˆ = 1 tr{A−1 B} = 1 [exp(c2 ) − 1]/c2 , in both cases. Consequently, we have the same expression for R(θ) 2 2 whereas the expressions are different for the likelihood-based estimators. For the specification in Panel ˜ = 1/σ 2 = 1, so that 1 tr{AB ˜ −1 } = 1 . Similarly, for the specification in Panel B we have A we have B 0 2 2 ˜ −1 = c2 /2, so that 1 tr{AB ˜ −1 } = c2 /4. B 2

3.2

Local Misspecification

As previously discussed, misspecification distorts the likelihood-based estimator through two channels. First, it erodes the efficiency argument that favors the likelihood-based estimator and second, the transformation of likelihood parameters into criterion parameters is distorted and will in general be incorrect under misspecification. To study the impact of misspecification we now consider the case where the truth is defined by a normal inverse Gaussian (NIG) distribution introduced by BarndorffNielsen (1977, 1978) so that the Gaussian likelihood is misspecified. A NIG distribution is characterized by four parameters, λ, δ, α, and β, that represent location, scale, tail heaviness, and asymmetry, respectively. The density of a NIG distribution is presented in Figure 2. The NIG-distribution is flexible and well suited for the present problem, because the Gaussian distribution, N(µ, σ 2 ), can be obtained as the limited case where λ = µ, δ = σ 2 α, β = 0, and α → ∞,

16

0.5  

NIG   0.45  

N(0,1)  

0.4   0.35   0.3   0.25   0.2   0.15   0.1   0.05   0   -­‐4  

-­‐3  

-­‐2  

-­‐1  

0  

1  

2  

3  

4  

Figure 2: The density of a standardized NIG distribution (mean zero and unit variance) with ξ = 0.5 and χ = 0.25 and the standard Gaussian density. and because the distribution yields tractable analytical expressions for the quantities that are relevant for our analysis of the LinEx loss function. 2

α 2 The mean and variance of NIG(λ, δ, α, β) are given by µ = λ + δβ γ and σ = δ γ 3 , respectively, where p γ = α2 − β 2 . So it follows that the likelihood-based estimator converges in probability to

θ0 = (λ +

δβ γ )

+ 2c δ

α2 . γ3

The ideal value for θ is, however, equal to

θ∗ =

 hp i p  2 − β2 − 2 − (β + c)2 λ + δ α α c  λ +

δβ γ

if c 6= 0, (6) if c = 0,

see Lemma B.1 in the Appendix. So θ0 and θ∗ generally do not coincide and the (misspecified) likelihoodbased estimator is therefore incoherent, with the exception of the following two special cases. The estimators are identical under symmetric loss, c = 0, and in the limited case: δ = σ 2 α, α → ∞, and β = O(α1−a ) with a ∈ ( 12 , 1], see Theorem B.1, because this causes the NIG to converge to the Gaussian distribution.

17

To make our misspecified design comparable to our previous design (where Xi ∼ iidN(0, 1)) we consider the standard NIG distribution with mean zero and unit variance. The zero mean and unit 2

α variance is achieved by setting λ = − δβ γ and δ γ 3 = 1. This family of standard NIG distributions can,

conveniently, be characterized by the two parameters

ξ=√

1 1 + δγ

and

β χ=ξ , α

that will be such that 0 ≤ |χ| < ξ < 1, see Barndorff-Nielsen et al. (1985). Figure 2 displays the density for the case with ξ = 0.5 and χ = 0.25. The original parameter values can be obtained using the expressions p 1 − ξ2 α=ξ 2 ξ − χ2 that imply γ =

q

1−ξ 2 . ξ 2 −χ2

p and

β=χ

1 − ξ2 , ξ 2 − χ2

(7)

The limited case where ξ = 0 (and hence χ = 0) corresponds to the standard

Gaussian distribution. √ We now construct a local-to-correct specified model by letting (θ0 − θ∗ ) = b/ n in an asymmetric design where χ = −ξ 3/2 → 0. To ease the interpretation of the simulation results, we use that ξ ∝ n−1/3 to set ξ to d × n−1/3 with d the degree of local misspecification. The parameters for the NIG are deduced from ξ and the optimal predictor is computed using (6). In particular, with ξ ∝ n−1/3 so that χ ∝ −n−1/2 , it can be shown that α ∝ n1/3 and β ∝ −n1/6 , see Appendix C.2 where the simulation design is fully detailed. Besides, the results can be easily mapped in terms of the asymptotic bias b. To assess the extent to which the misspecification is statistically detectable, we have computed the expected value of the Jarque-Bera test statistic and the power of the Jarque-Bera test for various values of d and three sample sizes. These numbers are reported in Table 3. The large sample size, (n = 106 ), is sought to emulate the asymptotic results, while the sample sizes n = 1000 and n = 200 provide a finitesample analogy. The second column of the table reports the mapping of the level of misspecification to the asymptotic bias in the case c = 1. The Table shows that the statistical evidence for misspecification is strong once d ≥ 2 . Table 4 displays the performance of the two estimators under local misspecification, as defined by d, for different levels of asymmetry, c. The table also reports the optimal predictor, RQE, the criterion risk induced by estimation, the bias of the likelihood-based predictor and the asymptotic criterion risk induced by the misspecification. The first panel, where c = 0, corresponds to the case of the MSE loss function. Here, the likelihoodbased estimator and innate estimator coincide, as they are both equal to the sample average. The LBE 18

Table 3: Local Misspecification

n = 1, 000

“Asymptotic” d 0 0.1 0.2 0.3 0.5 0.6 0.8 1 1.5 2 2.5 3

b 0.00 0.02 0.04 0.08 0.17 0.23 0.35 0.49 0.89 1.36 1.89 2.47

Jarque-Bera 2.000 2.001 2.012 2.042 2.184 2.325 3.098 3.609 7.184 14.17 25.61 42.95

Power 5.00% 5.13% 5.14% 5.32% 6.43% 7.50% 11.3% 19.0% 52.1% 88.7% 99.5% 100%

Jarque-Bera 1.985 1.981 2.000 2.030 2.212 2.373 3.276 3.765 8.211 17.90 36.33 69.52

Power 4.89% 4.89% 5.00% 5.25% 6.63% 7.83% 12.0% 19.0% 51.6% 86.8% 98.9% 100%

n = 200 Jarque-Bera 1.912 1.910 1.941 1.985 2.216 2.417 3.624 4.276 10.91 28.29 70.14 169.2

Power 4.51% 4.52% 4.70% 5.02% 6.49% 7.74% 12.0% 18.7% 48.0% 81.1% 97.0% 99.8%

The misspecification level, d, is measured in terms of the non-centrality level in the Jarque-Bera test. The “asymptotic” results are based on 100, 000 replications with n = 106 , and the finite-sample ones rely on 500, 000 simulations for n = 200 and n = 1000, respectively. The normality is rejected at the 5% level when the Jarque-Bera statistic is larger than the χ2 (2), i.e. 5.99. The misspecification level d is also mapped into the asymptotic bias b for the case where c = 1.

is therefore not affected by misspecification if c = 0. The subsequent panels display the results with increasing levels of asymmetry, c. The RQE in the last column of Table 4 shows how the performance of the (quasi) likelihood-based estimator is affected by misspecification for different levels of asymmetry. For low levels of misspecification, the LBE dominates the innate estimator, and its performance improves as c increases. In contrast, with larger deviations from normality its performance worsens and it becomes increasingly inferior to the innate estimator. This is explained by the fact that the mapping from ϑ 7→ θ is distorted under misspecification, and increasingly so as the level of misspecification increases. The likelihood-based estimator still outperforms the innate estimator in terms of variance. This is evident from the LBE’s variance component in column 5, which can be compared to column 3 (because the bias is asymptotically negligible for the innate estimator). But the improper mapping of the LBE estimator induces a bias related risk component, which is exponentially increasing in d (see column 6), and this bias is responsible for the degradation of the LBE’s performance. To complement the asymptotic results in Table 4, we present results based on a sample size with n = 200 in Table 5. The results are similar, although for large values of c, we observe that the innate estimator begins to dominate the LBE once the misspecification parameter reaches d = 1. By comparing the values of d for which the LBE dominates the innate estimator with the power reported in Table 3, it is obvious that diagnostic tests can be a valuable aid for the selection of estimator. However, diagnostic

19

Table 4: Local Misspecification

c = 0.25

d 0 0.5 1 1.5 2 2.5 3

c = 0.5

d 0 0.5 1 1.5 2 2.5 3

c=1

d 0 0.5 1 1.5 2 2.5 3

d 0 0.5 1 1.5 2 2.5 3

c=2

c = 0.0

Innate

d 0 0.5 1 1.5 2 2.5 3

Likelihood-Based

θ∗

ˆ R(θ)

˜ R(θ)

˜ Rvar (θ)

˜ Rbias (θ)

RQE

0.000 0.000 0.000 0.000 0.000 0.000 0.000

0.493 0.502 0.493 0.501 0.498 0.500 0.494

0.493 0.502 0.493 0.501 0.498 0.500 0.494

0.493 0.502 0.493 0.501 0.498 0.500 0.494

0.000 0.000 0.000 0.000 0.000 0.000 0.000

1.000 1.000 1.000 1.000 1.000 1.000 1.000

θ∗

ˆ R(θ)

˜ R(θ)

˜ Rvar (θ)

˜ Rbias (θ)

RQE

0.125 0.125 0.125 0.125 0.125 0.125 0.125

0.514 0.507 0.522 0.516 0.507 0.520 0.513

0.514 0.507 0.522 0.517 0.511 0.528 0.526

0.514 0.507 0.521 0.515 0.507 0.520 0.514

0.000 0.000 0.001 0.002 0.004 0.008 0.012

1.000 1.000 1.000 1.003 1.008 1.014 1.025

θ∗

ˆ R(θ)

˜ R(θ)

˜ Rvar (θ)

˜ Rbias (θ)

RQE

0.250 0.250 0.250 0.250 0.250 0.250 0.249

0.566 0.560 0.576 0.567 0.559 0.573 0.564

0.561 0.557 0.579 0.588 0.615 0.686 0.761

0.561 0.555 0.571 0.562 0.554 0.569 0.563

0.000 0.001 0.008 0.026 0.061 0.118 0.198

0.991 0.994 1.004 1.036 1.101 1.197 1.350

θ∗

ˆ R(θ)

˜ R(θ)

˜ Rvar (θ)

˜ Rbias (θ)

RQE

0.500 0.500 0.500 0.499 0.499 0.498 0.498

0.857 0.852 0.873 0.855 0.844 0.862 0.843

0.750 0.762 0.885 1.145 1.670 2.552 3.806

0.750 0.746 0.767 0.750 0.741 0.763 0.759

0.000 0.015 0.119 0.395 0.929 1.789 3.047

0.876 0.895 1.015 1.340 1.978 2.961 4.516

θ∗

ˆ R(θ)

˜ R(θ)

˜ Rvar (θ)

˜ Rbias (θ)

RQE

1.000 0.999 0.998 0.997 0.995 0.993 0.991

6.723 6.678 6.560 6.440 6.370 6.210 6.067

1.535 1.736 3.355 7.382 15.15 27.58 45.42

1.535 1.500 1.562 1.509 1.476 1.548 1.521

0.000 0.236 1.793 5.873 13.67 26.04 43.90

0.228 0.260 0.511 1.146 2.378 4.442 7.486

˜ is compared with the innate estimator, θ, ˆ under local misspecification. The data The likelihood-based estimator, θ, ˜ generating process is a NIG distribution whose discrepancy from the Gaussian specification is characterized by d. Rbias (θ) 0 6 measures the bias component of the risk, b Ab/2. The simulation results are based on 100,000 replications with n = 10 .

20

Table 5: Local Misspecification (finite samples n = 200)

d 0 0.5 1 1.5 2 2.5 3 d 0 0.5 1 1.5 2 2.5 3

c=1

d 0 0.5 1 1.5 2 2.5 3 d 0 0.5 1 1.5 2 2.5 3

c=2

c = 0.5

c = 0.25

c = 0.0

Innate

d 0 0.5 1 1.5 2 2.5 3

Likelihood-Based

θ∗

ˆ bias(θ)

˜ bias(θ)

ˆ Rn (θ)

ˆ Rnvar (θ)

ˆ Rnbias (θ)

˜ Rn (θ)

˜ Rnvar (θ)

˜ Rnbias (θ)

RQE

0.000 0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.503 0.501 0.499 0.498 0.499 0.500

0.500 0.503 0.501 0.499 0.498 0.499 0.500

0.500 0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.503 0.501 0.499 0.498 0.499 0.500

0.500 0.503 0.501 0.499 0.498 0.499 0.500

0.500 0.000 0.000 0.000 0.000 0.000 0.000

0.000 1.000 1.000 1.000 1.000 1.000 1.000

1.00 1.00 1.00 1.00 1.00 1.00 1.00

θ∗

ˆ bias(θ)

˜ bias(θ)

ˆ Rn (θ)

ˆ Rnvar (θ)

ˆ Rnbias (θ)

˜ Rn (θ)

˜ Rnvar (θ)

˜ Rnbias (θ)

RQE

0.125 0.124 0.122 0.118 0.113 0.107 0.100

0.001 0.001 0.001 0.001 0.001 0.000 0.001

0.001 0.000 0.002 0.003 0.005 0.008 0.011

0.516 0.507 0.487 0.470 0.446 0.424 0.395

0.516 0.507 0.487 0.470 0.446 0.424 0.395

0.000 0.000 0.000 0.000 0.000 0.000 0.000

0.516 0.507 0.487 0.470 0.446 0.423 0.395

0.516 0.507 0.487 0.469 0.443 0.416 0.382

0.000 0.000 0.000 0.001 0.003 0.006 0.012

0.999 1.000 1.000 0.999 0.998 0.997 0.999

θ∗

ˆ bias(θ)

˜ bias(θ)

ˆ Rn (θ)

ˆ Rnvar (θ)

ˆ Rnbias (θ)

˜ Rn (θ)

˜ Rnvar (θ)

˜ Rnbias (θ)

RQE

0.250 0.247 0.242 0.235 0.228 0.219 0.209

0.001 0.001 0.001 0.001 0.001 0.001 0.001

0.001 0.002 0.007 0.014 0.021 0.030 0.040

0.568 0.548 0.509 0.473 0.430 0.389 0.345

0.565 0.548 0.509 0.473 0.430 0.389 0.345

0.000 0.000 0.000 0.000 0.000 0.000 0.000

0.563 0.545 0.513 0.491 0.473 0.473 0.496

0.560 0.545 0.509 0.473 0.428 0.384 0.339

0.000 0.000 0.005 0.018 0.045 0.089 0.157

0.991 0.995 1.009 1.038 1.101 1.215 1.438

θ∗

ˆ bias(θ)

˜ bias(θ)

ˆ Rn (θ)

ˆ Rnvar (θ)

ˆ Rnbias (θ)

˜ Rn (θ)

˜ Rnvar (θ)

˜ Rnbias (θ)

RQE

0.500 0.489 0.470 0.447 0.422 0.395 0.367

0.004 0.004 0.003 0.003 0.002 0.002 0.002

0.003 0.009 0.028 0.051 0.076 0.102 0.130

0.847 0.775 0.664 0.568 0.472 0.391 0.317

0.828 0.774 0.663 0.568 0.472 0.390 0.317

0.002 0.001 0.001 0.001 0.001 0.000 0.000

0.752 0.719 0.716 0.825 1.066 1.472 2.085

0.741 0.710 0.639 0.574 0.508 0.462 0.450

0.000 0.008 0.077 0.251 0.558 1.009 1.635

0.887 0.927 1.078 1.451 2.258 3.766 6.579

θ∗

ˆ bias(θ)

˜ bias(θ)

ˆ Rn (θ)

ˆ Rnvar (θ)

ˆ Rnbias (θ)

˜ Rn (θ)

˜ Rnvar (θ)

˜ Rnbias (θ)

RQE

1.000 0.959 0.896 0.826 0.754 0.682 0.610

0.035 0.028 0.021 0.014 0.010 0.007 0.005

0.005 0.036 0.099 0.169 0.241 0.313 0.385

3.139 2.580 1.903 1.377 0.961 0.658 0.443

2.869 2.503 1.861 1.355 0.951 0.653 0.442

0.119 0.077 0.042 0.021 0.010 0.005 0.002

1.523 1.475 2.038 3.499 5.792 8.794 12.45

1.470 1.342 1.110 0.939 0.817 0.773 0.819

0.003 0.133 0.928 2.560 4.975 8.021 11.63

0.485 0.572 1.071 2.542 6.030 13.38 28.08

ˆ in the case where the Gaussian likelihood is The likelihood-based estimator θ˜ is compared with the innate estimator, θ, (locally) misspecified for different levels of asymmetry. The data generating process is a standard NIG distribution where the degree of local misspecification is determined by d. The finite-sample results are based on 500,000 replications with n = 200.

21

2 c=1 c = 0. 5

Relative Criterion Efficiency

c = 0. 1

1.5

1

0.5 0

0.5

1 Local Misspecification (d )

1.5

2

Figure 3: RQE: Local Misspecification. tests are unlikely to provide perfect guidance in this respect for a number of reasons. One reason is that there are going to be levels of misspecification that are difficult to detect, yet detrimental to the likelihood-based estimator. This is illustrated in this application for medium levels of misspecification, d ∈ [1, 2] and c ≥ 1. In this design one would prefer to use the innate estimator over the LBE. However for this level of misspecification one cannot rely on the JB test for the selection of the estimator, because it only has moderate power in this range for d. If more powerful tests were adopted, we would face the opposite problem of having higher rejection rates for d < 1 where the LBE still dominates the innate estimator. Figure 3 illustrates the impact of local misspecification on the RQE under LinEx loss.6 It shows that the LBE dominates the innate estimator when the misspecification is small (d ≤ 1). At some point (about d ' 1) the local misspecification more than offsets the advantages that the likelihood approach offers in terms of a reduced asymptotic variance. The larger is c the more advantageous is it to use the LBE over the innate estimator for low levels misspecification. The ranking is reversed under severe misspecification, where a high degree of asymmetry magnifies the distortions caused by misspecification. 6

These results are based on n = 1, 000, 000 and 100, 000 replications.

22

4

Multi-Step Forecasting

Our theoretical framework can be applied to the problem of making multi-step forecasts. Forecasts based on the innate and likelihood-based estimators are known as the direct forecast and iterated forecast, respectively, see e.g. Marcellino et al. (2006). The iterated forecast is also known as the plug-in forecast. There is a vast literature on multi-step forecasting, see e.g. Cox (1961), Tiao and Tsay (1994), Clements and Hendry (1996), Bhansali (1997), Ing (2003), Chevillon (2007), and references therein. This literature has mainly focused on the case with MSE loss with or without misspecification. We make contributions to this literature by showing that the merits of direct versus iterated forecasts is a special case of our theoretical framework in Section 2. We also contribute to the literature by establishing results beyond the quadratic loss function. For instance, we demonstrate that asymmetric loss can exacerbate the relative inefficiency of the direct (innate) estimator. Consider an autoregressive process of order p. In this context, the direct forecast of YT +h at time T is obtained by regressing Yt on (Yt−h , . . . , Yt−p−h+1 ) and a constant for t = 1, . . . , T , whereas the iterated forecast is obtained by estimating an AR(p) model, that yields a forecast of YT +1 , which is subsequently used to construct a forecast of YT +2 , and so forth until the forecast of YT +h is obtained, by repeated use of the estimated autoregressive model. For ease of exposition, we focus on the case with a first-order autoregressive model

Yt = µ + ϕYt−1 + εt ,

t = 1, 2, . . .

(8)

where εt ∼ iidN(0, σ 2 ) and assume that |ϕ| < 1 and σ 2 > 0. It follows that the conditional distribution of Yt+h given Yt is N(ϕh Yt +

1−ϕh 1−ϕ2h 2 1−ϕ µ, 1−ϕ2 σ ),

so that the optimal predictor under LinEx loss is given

by 0 Yt+h,t = ϕh Yt +

1−ϕh 1−ϕ µ

+

c 1−ϕ2h 2 2 σ . 2 1−ϕ

(9)

The iterated (likelihood-based) predictor, Y˜t+h,t , is given by plugging the maximum likelihood estimators, µ ˜, ϕ, ˜ and σ ˜ 2 into this expression. In the notation of Section 2, we have ϑ = (µ, ϕ, σ 2 )0 and θ(ϑ) =



1−ϕh 1−ϕ µ

+

c 1−ϕ2h 2 h 0 , 2 σ ,ϕ 2 1−ϕ

˜ t , with α so that the iterated forecast is Y˜t+h,t = α ˜ + βY ˜ =

1−ϕh 1−ϕ µ

+

c 1−ϕ2h 2 2 1−ϕ2 σ

and β˜ = ϕh . For

convenience, we use α and β to represent the elements of θ, i.e. θ = (α, β)0 . ˆ that are obtained by solving The direct forecast is based on the innate estimators, α ˆ and β, 23

min α,β

4.1

ˆ t. − α − βYt−h ), and the direct forecast is Yˆt+h,t = α ˆ + βY

PT

t=1 Lc (Yt

Results under Correct Specification

With the criterion function given by qt (θ) = −Lc (Yt − α − βYt−h ), it follows that  st (θ) = c−1 

ec(Yt −α−βYt−h ) − 1





(ec(Yt −α−βYt−h )



and ht (θ) = −ec(Yt −α−βYt−h ) 

Yt−h

− 1)

1

Yt−h

Yt−h

2 Yt−h

 ,

˜ −1 (the and the corresponding matrices, A = E[−ht (θ0 )], B (the long-run variance of st (θ0 )), and B ˜ are given in the next Theorem. asymptotic variance of θ) Theorem 5. Let Yt = µ + ϕYt−1 + εt with |ϕ| < 1 and εt ∼ iidN (0, σ 2 ). Then µy ≡ EY = µ/(1 − ϕ) and σy2 ≡ var(Y ) = σ 2 /(1 − ϕ2 ). Then  A=

B = c−2

h−1 X

2 σ 2 (ϕ|j| −ϕ2h−|j| ) y

 ˜ −1 =  B

µy

µy µ2y

 (δj − 1) 

j=−h+1

where δj = ec

1



+

σy2

,

1



µy + γj

µy + γj

µ2y

+

ϕ|j| σy2

+ µy γ|j|



, γj = cσy2 (ϕ2(h−j) − ϕ2h ) for j = 1, . . . , h, and γj = 0 for j ≤ 0, and h 2

) (1+ϕ) − 2cµy σy2 ζ + c2 σy4 ψ cσy2 ζ − µy ξ µ2y ξ + σy2 (1−ϕ(1−ϕ)

cσy2 ζ

− µy ξ

ξ

where ξ = (ϕ−2 − 1)h2 ϕh , ζ = hϕh (1 − ϕ2h ) − ϕh ξ, and ψ =

 ,

(1−ϕ2h )2 (1 + ϕ2 ) − 2hϕ2h (1 − ϕ2h ) + ϕ2h ξ. 2(1−ϕ2 )

So A does not depend on the forecasting horizon. The expression for the B-matrix is relatively more involved when h > 1 due to the resulting autocorrelation in the forecast errors. The general expressions for the criterion risk are given by h ˜ −1 } = σ 2 ξ + tr{AB y tr{A−1 B} =

1 c2

h−1 X

(1−ϕh )2 (1+ϕ) 1−ϕ

i

+ c2 σy4 ψ

(δj − 1)(1 + ϕ|j| )],

j=−h+1 |j| −ϕ2h−|j| )

where δj = eλ(ϕ

, with λ = c2 σy2 . So we can express the RQE as 24

1

1 h=2 h=3 h=4

h=2 h=3 h=4

0.9

0.8

0.8

0.7

0.7

0.6

0.6

RQE

RQE

0.9

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

φ

0.5

0.6

0.7

0.8

0.9

1

φ

(a) Symmetric loss, c = 0

(b) Asymmetry loss, c = 1

Figure 4: RQE under correct specification as a function of φ and h. The left panel is for the symmetric case and the right panel is for an asymmetric case.

h λ ξ+ RQE = Ph−1

(1−ϕh )2 (1+ϕ) 1−ϕ |j| −ϕ2h−|j| )λ

(ϕ j=−h+1 (e

i

+ λ2 ψ

− 1)(1 + ϕ|j| )]

,

which is less than one if |ϕ| < 1, and it can be shown that h 2

) (1+ϕ) (1 − ϕ2 )h2 ϕ2h−2 + (1−ϕ(1−ϕ) lim RQE = Ph−1 , |j| 2h−|j| )(1 + ϕ|j| ) λ→0 j=−h+1 (ϕ − ϕ

which corresponds to the case with symmetric loss. Simpler expressions are available for specific values of h, for instance if h = 2 we have limλ→0 RQE = 21 (1 + 2ϕ + 5ϕ2 )/(1 + ϕ + 2ϕ2 ), see Appendix B.4. Figure 4 shows the RQE as a function of ϕ for different forecast horizons when the level of asymmetry, c, is fixed to 0 and 1, respectively, with σ normalized to σ = 1. Under MSE loss, c = 0, the RQE approaches one as ϕ → 1. This shows that in the situation with symmetric loss, the gains from using the iterated forecast over the direct forecast are small if the underlying process is highly persistent. This, however, does not generalize to the asymmetric case, as illustrated in the panel (b). Under asymmetric loss, c = 1, the likelihood-based estimator is increasingly superior to the direct approach, as the persistence of the process approaches unity. As is to be expected from the existing literature, the relative advantages of the iterated estimator increase with the forecast horizon h in the symmetric case, and this result carries over to the asymmetric case, particularly if the process is persistent. This point is further documented in Figure 5, where the RQE is given as a function of ϕ for five different

25

1 c=0 c=0.5 c=1 c=1.5 c=2

0.9 0.8 0.7

RQE

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

φ

Figure 5: RQE under correct specification as a function of ϕ for different levels of asymmetry, λ = (cσ)2 . levels of asymmetry, for the case h = 2.7 The larger is the asymmetry the better is the iterated forecast compared to the direct forecast, and increasingly so as ϕ approaches unity. The direct and iterated forecasts, Yˆt+h,t and Y˜t+h,t , are compared in Tables 6-7 for the cases where h = 2 and h = 4, respectively. The simulations design is described in Appendix C.3. The iterated forecast is expected to be superior in this design where the iterated forecast is based on a correctly specified AR(1) model. The asymptotic results are based on our analytical expressions whereas the finite sample results, with n = 1, 000 and n = 200, are obtained by simulations with 500,000 replications. We consider a range of values for the asymmetry parameter, c, and three levels of persistence in the autoregressive process, ϕ ∈ {0.3, 0.8, 0.99}. The Tables quantify the performance of the iterated predictor, Y˜t+h,t , relative to the direct predictor, Yˆt+h,t , with the relative performance gap being increasing in c. This is particularly the case in the persistent design where ϕ = 0.99. The case c = 0.1 roughly mimics the behavior of the two estimators in the well studied case with mean squared error loss. In this case, the relative advantages of the iterated approach is decreasing in ϕ. Interestingly, this is not a general result. In fact, quite the opposite is observed with a high degree of asymmetry. For instance with c = 1.5 and c = 2.0 the relative performance of the iterated predictor is increasing in ϕ. So even though the two predictors are ˜ −1 }, tr{A−1 B}, and RQE, for the case h = 2, that are used in Figure 5 are given The simplified expressions for tr{AB in Appendix B.4. 7

26

c = 0.1

c = .25

c = 0.5

c=1

c = 1.5

0.56 0.58 0.51

RQE

0.34 0.22 0.14

RQE

0.12 0.03 0.01

0.3 0.8 0.99

ϕ

0.3 0.8 0.99

ϕ

0.3 0.8 0.99

RQE

ϕ

RQE

0.69 0.92 0.97

0.3 0.8 0.99

ϕ

RQE

ϕ

0.67 0.86 0.87

0.69 0.94 0.99

0.3 0.8 0.99

0.3 0.8 0.99

RQE

ϕ

2.377 6.050 7.900

˜ R(θ)

ˆ R(θ)

20.07 186.9 713.6

1.785 4.672 6.168

˜ R(θ)

ˆ R(θ)

5.276 21.39 45.13

1.363 3.688 4.930

˜ R(θ)

ˆ R(θ)

2.429 6.361 9.609

1.109 3.097 4.188

˜ R(θ)

ˆ R(θ)

1.658 3.621 4.797

1.046 2.949 4.002

˜ R(θ)

ˆ R(θ)

1.522 3.204 4.140

1.028 2.908 3.950

˜ R(θ)

1.487 3.099 3.980

ˆ R(θ)

Asymptotic

0.25 0.19 0.13

RQE

0.39 0.35 0.34

RQE

0.57 0.61 0.65

RQE

0.67 0.85 0.91

RQE

0.68 0.91 0.97

RQE

0.69 0.93 0.99

RQE

9.732 33.36 116.8

ˆ Rn (θ)

4.554 14.20 36.39

ˆ Rn (θ)

2.392 6.325 17.05

ˆ Rn (θ)

1.663 3.791 11.09

ˆ Rn (θ)

1.540 3.378 10.05

ˆ Rn (θ)

1.493 3.262 10.00

ˆ Rn (θ)

2.393 6.481 15.60

˜ Rn (θ)

1.800 4.938 12.53

˜ Rn (θ)

1.356 3.855 11.16

˜ Rn (θ)

1.108 3.231 10.12

˜ Rn (θ)

1.053 3.087 9.777

˜ Rn (θ)

1.025 3.034 9.866

˜ Rn (θ)

0.002 0.009 0.009

ˆ bias(β)

0.002 0.008 0.009

ˆ bias(β)

0.002 0.006 0.009

ˆ bias(β)

0.002 0.006 0.009

ˆ bias(β)

0.002 0.006 0.009

ˆ bias(β)

0.002 0.006 0.009

ˆ bias(β)

n = 1000

0.000 0.005 0.009

˜ bias(β)

0.000 0.005 0.009

˜ bias(β)

0.000 0.005 0.009

˜ bias(β)

0.000 0.005 0.009

˜ bias(β)

0.000 0.005 0.009

˜ bias(β)

0.000 0.005 0.009

˜ bias(β)

0.33 0.27 0.00

RQE

0.45 0.45 0.13

RQE

0.57 0.66 0.68

RQE

0.65 0.84 0.91

RQE

0.66 0.89 0.95

RQE

0.67 0.90 0.96

RQE

Table 6: Multi-step Forecasting (h = 2): AR(1) Model

7.401 30.51 > 106

ˆ Rn (θ)

4.071 13.28 596.7

ˆ Rn (θ)

2.391 6.865 57.26

ˆ Rn (θ)

1.695 4.444 31.62

ˆ Rn (θ)

1.559 4.000 28.28

ˆ Rn (θ)

1.523 3.853 27.42

ˆ Rn (θ)

2.409 8.320 308.4

˜ Rn (θ)

1.815 5.964 75.89

˜ Rn (θ)

1.367 4.507 39.12

˜ Rn (θ)

1.105 3.734 28.75

˜ Rn (θ)

1.036 3.558 26.95

˜ Rn (θ)

1.025 3.470 26.38

˜ Rn (θ)

0.011 0.034 0.051

ˆ bias(β)

0.010 0.033 0.052

ˆ bias(β)

0.011 0.031 0.052

ˆ bias(β)

0.010 0.030 0.052

ˆ bias(β)

0.011 0.030 0.052

ˆ bias(β)

0.010 0.030 0.052

ˆ bias(β)

n = 200

0.001 0.026 0.051

˜ bias(β)

0.001 0.025 0.051

˜ bias(β)

0.001 0.026 0.051

˜ bias(β)

0.001 0.026 0.051

˜ bias(β)

0.001 0.026 0.051

˜ bias(β)

0.001 0.026 0.051

˜ bias(β)

The direct (innate) forecast is compared to the iterated (likelihood-based) forecast for the case with two-step ahead forecasting. In this situation with correct specification the likelihood-based one dominates in terms of RQE for all sample sizes, and does so increasingly as the level of asymmetry, c, increases.

c=2

27

c = 0.1

c = 0.25

c = 0.5

c=1

c = 1.5

0.52 0.45 0.20

RQE

0.33 0.11 0.01

RQE

0.12 0.01 0.00

0.3 0.8 0.99

ϕ

0.3 0.8 0.99

ϕ

0.3 0.8 0.99

RQE

ϕ

RQE

0.61 0.81 0.93

0.3 0.8 0.99

ϕ

RQE

ϕ

0.60 0.75 0.74

0.61 0.83 0.98

0.3 0.8 0.99

0.3 0.8 0.99

RQE

ϕ

2.455 17.47 31.70

˜ R(θ)

ˆ R(θ)

21.07 2939 > 106

1.823 12.65 24.52

˜ R(θ)

ˆ R(θ)

5.605 119.9 3455

1.371 9.209 19.40

˜ R(θ)

ˆ R(θ)

2.648 20.62 97.15

1.100 7.143 16.32

˜ R(θ)

ˆ R(θ)

1.847 9.572 22.19

1.032 6.626 15.55

˜ R(θ)

ˆ R(θ)

1.705 8.161 16.80

1.013 6.482 15.37

˜ R(θ)

1.669 7.821 15.64

ˆ R(θ)

Asymptotic

0.24 0.16 0.01

RQE

0.375 0.296 0.120

RQE

0.52 0.51 0.44

RQE

0.59 0.74 0.82

RQE

0.61 0.80 0.93

RQE

0.61 0.81 0.96

RQE

10.22 128.2 > 105

ˆ Rn (θ)

4.874 45.28 515.2

ˆ Rn (θ)

2.627 18.73 107.3

ˆ Rn (θ)

1.858 9.904 48.27

ˆ Rn (θ)

1.718 8.472 40.41

ˆ Rn (θ)

1.679 8.114 38.63

ˆ Rn (θ)

2.474 20.05 107.6

˜ Rn (θ)

1.827 13.41 61.93

˜ Rn (θ)

1.373 9.514 47.24

˜ Rn (θ)

1.101 7.338 39.56

˜ Rn (θ)

1.039 6.770 37.66

˜ Rn (θ)

1.018 6.600 37.19

˜ Rn (θ)

0.002 0.011 0.019

ˆ bias(β)

0.002 0.012 0.019

ˆ bias(β)

0.002 0.011 0.018

ˆ bias(β)

0.002 0.009 0.018

ˆ bias(β)

0.002 0.009 0.018

ˆ bias(β)

0.002 0.009 0.018

ˆ bias(β)

n = 1000

0.000 0.006 0.017

˜ bias(β)

0.000 0.006 0.017

˜ bias(β)

0.000 0.006 0.017

˜ bias(β)

0.000 0.006 0.017

˜ bias(β)

0.000 0.006 0.017

˜ bias(β)

0.000 0.006 0.017

˜ bias(β)

0.32 0.20 0.00

RQE

0.422 0.368 0.000

RQE

0.53 0.56 0.18

RQE

0.58 0.71 0.79

RQE

0.59 0.75 0.87

RQE

0.59 0.76 0.89

RQE

Table 7: Multi-step Forecasting (h = 4): AR(1) Model

7.980 146.4 > 1015

ˆ Rn (θ)

4.386 44.77 > 109

ˆ Rn (θ)

2.608 19.36 4904

ˆ Rn (θ)

1.890 11.19 152.8

ˆ Rn (θ)

1.754 9.602 109.8

ˆ Rn (θ)

1.702 9.294 101.3

ˆ Rn (θ)

2.523 29.84 > 107

˜ Rn (θ)

1.853 16.48 > 105

˜ Rn (θ)

1.369 10.83 893.1

˜ Rn (θ)

1.103 7.910 120.0

˜ Rn (θ)

1.040 7.149 95.68

˜ Rn (θ)

1.011 7.034 90.39

˜ Rn (θ)

n = 200

0.010 0.047 0.106

ˆ bias(β)

0.010 0.045 0.103

ˆ bias(β)

0.010 0.044 0.102

ˆ bias(β)

0.010 0.043 0.100

ˆ bias(β)

0.010 0.043 0.100

ˆ bias(β)

0.010 0.043 0.100

ˆ bias(β)

0.001 0.027 0.095

˜ bias(β)

0.000 0.027 0.095

˜ bias(β)

0.001 0.027 0.095

˜ bias(β)

0.001 0.027 0.095

˜ bias(β)

0.001 0.027 0.095

˜ bias(β)

0.001 0.027 0.095

˜ bias(β)

The direct (innate) forecast is compared to the iterated (likelihood-based) forecast for the case with four-step ahead forecasting. In this situation with correct specification the likelihood-based one dominates in terms of RQE for all sample sizes, and does so increasingly as the level of asymmetry, c, increases. The finite-sample results are based on simulations with 500,000 replications.

c=2

28

roughly equivalent under MSE loss if the process is highly persistent, this is not a result that can be extrapolated to cases with non-MSE loss. See also Figure 5. By comparing the results based on two-step ahead forecasting (Table 6) with those with four-step ahead forecasting (Table 7), it is evident that the superiority of the iterated predictor is increasing in the forecasting horizon, which is also in agreement with the theoretical results. The finite sample results in Tables 6-7 are similar to the asymptotic results, albeit the relative superiority of the iterated estimator is less pronounced in some designs. This is partly explained by the bias that play a role in finite samples. We observe that the direct predictor tends to have a larger bias than the iterated predictor, and that the bias increases with the forecasting horizon.

4.2

Local Misspecification

To study the effect of local misspecification on the relative efficiency of the likelihood-based predictor with respect to the innate predictor, we adopt an asymptotic design from Schorfheide (2005). Specifically, we keep the complexity of our prediction models fixed, and introduce local misspecification in the conditional mean. Unlike the design of our application in Section 3, where we deviated from the Gaussian distribution, we maintain the Gaussian distribution in this application. The misspecification is instead introduced by modifying the data generating process to be an AR(2) model, Yt = µ+ϕ1 Yt−1 +ϕ2 Yt−2 +εt . A local-to-correct specification is obtained by letting ϕ2 ∝ n−1/2 . Specifically, we set ϕ2 = d×n−1/2 σϕ2 , where σϕ2 2 is the asymptotic variance of the maximum likelihood estimator, ϕ˜2 when ϕ2 = 0. It is easy to verify that σϕ2 2 = 1 in this case. The constant, d, defines the degree of misspecification and can be interpreted as the non-centrality of the t-statistics associated with testing ϕ2 = 0. In this local-to-correct design we hold the first order autocorrelation, ρ1 = corr(Yt, Yt−1 ), constant, which is achieved by setting ϕ1 = ρ1 (1 − ϕ2 ). We focus on two cases, ρ1 = 0.8 and ρ1 = 0.99. The results for the case ρ1 = 0.8 are presented below, whereas the corresponding results for the case ρ1 = 0.99 are presented in the Web Appendix. The optimal predictor is given by YT∗+2,T

c = µ(1 + ϕ) + (ϕ21 + ϕ2 )YT + ϕ1 ϕ2 YT −1 + (1 + ϕ21 )σε2 , 2

(10)

but neither the direct predictor nor the iterated predictor make use of two lags of Yt . These forecasts continue to be of the form α + βYT using a Gaussian structure, and it follows that   1 + ϕ21 − ϕ22 2 σε . Yt+2 |Yt ∼ N (1 + ρ1 )µ + ρ2 Yt , 1 − ϕ22

29

(11)

In this case we have α ˜ =

1−ϕ ˜2 ˜ 1−ϕ ˜ µ

+

˜2·2 2 c 1−ϕ σ ˜ 2 1−ϕ ˜2

and β˜ = ϕ˜h for the iterated forecast, while the direct

forecast deduces the estimates of α and β from the two-steps ahead LinEx loss function. The key ˜ difference between the two forecasts is that the iterated forecast is attempting to estimate ρ21 from β, whereas βˆ is correctly attempting to estimate the second-order autocorrelation, denoted ρ2 . In the case with misspecification, it is no longer true that ρ21 = ρ2 , which the iterated predictor relies on. In fact, the larger is d, the larger is the discrepancy between ρ2 by ρ21 . This illustrates the distortions that the likelihood-based estimator suffers from, as a result of misspecification. Table 8 reports the two-step ahead performance of the two predictors under (local-to-correct) misspecified for the case where ρ1 = 0.8. When the loss is nearly symmetric, it requires a certain level of misspecification (about d = 1) for the innate predictor to catch up and surpass the iterated predictor that suffers from misspecification. When the loss function is more asymmetric, it takes a substantially higher degree of misspecification. In fact, for a high level of asymmetry, c = 2, the direct forecasts remain inferior even at high levels of misspecification. So here we have a situation where misspecification is of a magnitude that is easily detectable, yet the likelihood-based estimator continues to dominate the innate estimator. This shows that one cannot solely rely on conventional misspecification tests for the selection of estimator. Ideally, the diagnostics should be tailored for the true objective, Q. The results displayed include the non-centrality parameter d, RQE, the criterion losses resulting from estimation error, the second order autocorrelation and the bias of the two estimators, as well as the ˜ Note that the likelihood-based predictor asymptotic criterion risk entailed by the misspecification for θ. dominates the direct one when the level of misspecification is small, and its performance improves as c increases. However, its relative performance declines with the level of misspecification. This reflects the negative impact of the non-centrality parameter d on the criterion loss for the likelihood-based estimator ˜ becomes larger as relatively to the innate estimator. Furthermore, the asymptotic bias term Rbias (θ) both c and d increase. Figure 6 illustrates an extended version of the asymptotic results in Table 8, by plotting the RQE as a function of the misspecification parameter d, for various degrees of asymmetry.8 The larger is the asymmetry, the larger is the range of local misspecification for which the iterated forecast dominates the direct forecast. For example, for c = 0.1, the iterated forecast performs better than the innate estimator for values of d up to 1, and for the greater degree of asymmetry, c = 1, the iterated forecast dominates at even higher levels of misspecification, to about d = 3. To add some perspective to this statement, d = 1 and d = 3 translate into ϕ2 ' 0.03 and ϕ2 ' 0.1, respectively, if the sample size is n = 1000, and 8

These results are based on n = 100, 000 and 100, 000 repetitions.

30

c = 0.1

c = 0.5

c=1

d 0 0.5 1 2 3 5

d 0 0.5 1 2 3 5

d 0 0.5 1 2 3 5

d 0 0.5 1 2 3 5

˜ R(θ)

3.10 3.29 3.66 5.27 8.02 16.7

˜ R(θ)

3.69 3.94 4.39 6.47 9.97 21.2

˜ R(θ)

ˆ R(θ)

3.62 3.76 3.95 4.87 6.44 11.5

ˆ R(θ)

6.36 6.48 6.66 7.59 9.16 14.3

ˆ R(θ)

187 158 151 150 152 156

RQE

0.86 0.88 0.93 1.08 1.24 1.45

RQE

0.58 0.61 0.66 0.85 1.09 1.48

RQE

0.03 0.04 0.05 0.08 0.12 0.25

6.05 6.75 7.30 11.4 17.7 39.4

2.91 3.08 3.43 4.88 7.39 15.4

3.10 3.23 3.43 4.34 5.92 11.0

0.94 0.95 1.00 1.12 1.25 1.40

˜ R(θ)

ˆ R(θ)

RQE

0.640 0.640 0.640 0.641 0.641 0.642

β

0.640 0.640 0.640 0.641 0.641 0.642

β

0.640 0.640 0.640 0.641 0.641 0.642

β

0.640 0.640 0.640 0.641 0.641 0.642

β

0.000 0.000 0.000 0.001 0.001 0.002

˜ bias(β)

0.000 0.000 0.000 0.001 0.001 0.002

˜ bias(β)

0.000 0.000 0.000 0.001 0.001 0.002

˜ bias(β)

0.000 0.000 0.000 0.001 0.001 0.002

˜ bias(β)

Asymptotic Results

0.000 0.296 1.314 5.311 11.86 33.11

˜ Rbias (θ)

0.000 0.173 0.708 2.783 6.285 17.42

˜ Rbias (θ)

0.000 0.141 0.553 2.180 4.921 13.64

˜ Rbias (θ)

0.000 0.131 0.502 1.985 4.483 12.44

˜ Rbias (θ)

0.11 0.12 0.16 0.27 0.45 0.88

RQE

0.50 0.55 0.64 0.89 1.17 1.66

RQE

0.81 0.88 0.97 1.20 1.42 1.75

RQE

0.93 0.99 1.09 1.30 1.47 1.74

RQE

32.8 32.3 31.7 29.9 29.6 30.3

ˆ Rn (θ)

6.29 6.28 6.42 7.08 8.19 11.4

ˆ Rn (θ)

3.78 3.86 4.09 4.83 6.11 9.54

ˆ Rn (θ)

3.24 3.35 3.59 4.36 5.65 9.12

ˆ Rn (θ)

3.72 4.02 4.98 8.21 13.2 26.8

˜ Rn (θ)

3.14 3.46 4.12 6.32 9.61 18.9

˜ Rn (θ)

3.05 3.38 3.98 5.81 8.65 16.7

˜ Rn (θ)

3.01 3.34 3.91 5.65 8.33 15.90

˜ Rn (θ)

0.640 0.646 0.651 0.663 0.674 0.697

β

0.640 0.646 0.651 0.663 0.674 0.697

β

0.640 0.646 0.651 0.663 0.674 0.697

β

0.640 0.646 0.651 0.663 0.674 0.697

β

0.009 0.009 0.009 0.009 0.009 0.009

ˆ bias(β)

0.006 0.007 0.007 0.007 0.007 0.008

ˆ bias(β)

0.006 0.006 0.006 0.006 0.007 0.007

ˆ bias(β)

0.006 0.006 0.006 0.006 0.007 0.007

ˆ bias(β)

Finite Sample, n = 1000

0.005 0.011 0.017 0.029 0.040 0.064

˜ bias(β)

0.005 0.011 0.017 0.029 0.041 0.064

˜ bias(β)

0.005 0.011 0.017 0.029 0.040 0.064

˜ bias(β)

0.005 0.011 0.017 0.029 0.040 0.064

˜ bias(β)

0.27 0.30 0.34 0.45 0.76 1.55

RQE

0.66 0.75 0.87 1.20 1.60 2.63

RQE

0.84 0.96 1.09 1.44 1.80 2.71

RQE

0.90 1.03 1.17 1.50 1.84 2.69

RQE

30.3 28.1 26.9 28.8 24.6 24.8

ˆ Rn (θ)

6.83 6.80 6.89 7.49 8.33 10.4

ˆ Rn (θ)

4.40 4.51 4.72 5.53 6.54 8.85

ˆ Rn (θ)

3.84 3.97 4.21 5.07 6.12 8.47

ˆ Rn (θ)

8.27 8.41 9.20 12.9 18.6 38.4

˜ Rn (θ)

4.49 5.07 5.96 8.95 13.3 27.3

˜ Rn (θ)

3.70 4.31 5.19 7.93 11.78 23.9

˜ Rn (θ)

3.45 4.07 4.94 7.60 11.3 22.8

˜ Rn (θ)

0.640 0.653 0.665 0.691 0.716 0.767

β

0.640 0.653 0.665 0.691 0.716 0.767

β

0.640 0.653 0.665 0.691 0.716 0.767

β

0.640 0.653 0.665 0.691 0.716 0.767

β

0.035 0.035 0.037 0.039 0.041 0.046

ˆ bias(β)

0.032 0.033 0.034 0.037 0.039 0.044

ˆ bias(β)

0.030 0.031 0.033 0.036 0.038 0.044

ˆ bias(β)

0.030 0.031 0.032 0.035 0.038 0.043

ˆ bias(β)

Finite Sample, n = 200

Table 8: Multi-Step Forecasting: Locally-misspecified AR(1) Model with ρ1 = 0.8 and h = 2

Note: The iterated forecast (likelihood based) is compared to the direct forecast (innate) in terms of LinEx loss under local misspecification as defined by the parameter d. Both forecasts employ an AR(1) structure while the true model is an AR(2) model, with ϕ2 being d standard deviations away from zero. Rbias (Y˜t+h,t ) captures the bias component of the risk, b0 Ab. The asymptotic results are based on 50,000 simulations with n = 106 whereas the finite-sample ones are based on 500,000 replications.

c=2

31

0.026 0.040 0.055 0.086 0.117 0.184

˜ bias(β)

0.026 0.040 0.055 0.086 0.117 0.184

˜ bias(β)

0.026 0.040 0.055 0.086 0.117 0.184

˜ bias(β)

0.026 0.040 0.055 0.086 0.117 0.184

˜ bias(β)

c = 0.1

c = 0.5

c=1

d 0 0.5 1 2 3 5

d 0 0.5 1 2 3 5

d 0 0.5 1 2 3 5

d 0 0.5 1 2 3 5 ˜ R(θ)

7.14 7.29 7.94 10.7 15.0 29.1 ˜ R(θ)

9.21 9.41 10.2 13.7 19.0 36.7 ˜ R(θ)

ˆ R(θ)

9.57 9.55 9.64 10.1 10.7 12.9

ˆ R(θ)

20.6 20.8 20.6 21.1 21.4 23.7

ˆ R(θ)

1081 1083 1070 1061 1034 1037

RQE

0.75 0.76 0.82 1.06 1.40 2.25

RQE

0.45 0.45 0.50 0.65 0.89 1.55

RQE

0.02 0.02 0.02 0.02 0.03 0.07

17.18 17.52 19.22 23.91 36.00 68.59

6.48 6.62 7.21 9.71 13.7 26.6

7.82 7.78 7.90 8.32 9.02 11.2

0.83 0.85 0.91 1.17 1.52 2.38

˜ R(θ)

ˆ R(θ)

RQE

0.41 0.41 0.41 0.41 0.41 0.41

β

0.41 0.41 0.41 0.41 0.41 0.41

β

0.41 0.41 0.41 0.41 0.41 0.41

β

0.41 0.41 0.41 0.41 0.41 0.41

β

0.00 0.00 0.00 0.00 0.00 0.00

˜ bias(β)

0.00 0.00 0.00 0.00 0.00 0.00

˜ bias(β)

0.00 0.00 0.00 0.00 0.00 0.00

˜ bias(β)

0.00 0.00 0.00 0.00 0.00 0.00

˜ bias(β)

Asymptotic Results

0.006 0.314 1.928 7.328 17.05 49.82

˜ Rbias (θ)

0.000 0.262 1.071 4.393 9.834 27.27

˜ Rbias (θ)

0.000 0.209 0.857 3.495 7.794 21.72

˜ Rbias (θ)

0.000 0.194 0.786 3.192 7.130 19.96

˜ Rbias (θ)

0.15 0.16 0.99 0.23 0.32 0.62

RQE

0.51 0.55 0.62 0.83 1.14 1.95

RQE

0.74 0.80 0.99 1.20 1.60 2.49

RQE

0.82 0.88 0.99 1.32 1.73 2.60

RQE

125 122 8.20 117 106 96.9

ˆ Rn (θ)

18.6 18.4 18.1 18.0 18.2 19.7

ˆ Rn (θ)

9.84 9.84 8.20 10.1 10.8 12.9

ˆ Rn (θ)

8.10 8.13 8.20 8.54 9.26 11.5

ˆ Rn (θ)

18.8 19.6 8.16 26.8 33.9 59.7

˜ Rn (θ)

9.52 10.1 11.2 14.9 20.8 38.4

˜ Rn (θ)

7.30 7.89 8.16 12.2 17.2 32.1

˜ Rn (θ)

6.60 7.18 8.16 11.3 16.1 29.9

˜ Rn (θ)

0.41 0.42 0.43 0.45 0.47 0.51

β

0.41 0.42 0.43 0.45 0.47 0.51

β

0.41 0.42 0.43 0.45 0.47 0.51

β

0.41 0.42 0.43 0.45 0.47 0.51

β

0.018 0.018 0.018 0.019 0.019 0.020

ˆ bias(β)

0.011 0.012 0.011 0.012 0.012 0.013

ˆ bias(β)

0.009 0.009 0.009 0.010 0.010 0.011

ˆ bias(β)

0.009 0.009 0.009 0.009 0.010 0.010

ˆ bias(β)

Finite Sample, n = 1000

0.006 0.017 0.027 0.049 0.070 0.110

˜ bias(β)

0.006 0.017 0.027 0.049 0.070 0.110

˜ bias(β)

0.006 0.017 0.027 0.049 0.070 0.110

˜ bias(β)

0.006 0.017 0.027 0.049 0.070 0.110

˜ bias(β)

0.20 0.20 0.23 0.32 0.47 0.53

RQE

0.56 0.62 0.73 1.01 1.39 2.34

RQE

0.71 0.81 0.94 1.31 1.76 2.77

RQE

0.76 0.87 1.02 1.41 1.88 2.87

RQE

139 142 135 125 113 203

ˆ Rn (θ)

19.3 19.1 18.9 19.0 19.7 23.1

ˆ Rn (θ)

11.1 11.1 11.3 11.9 12.9 16.4

ˆ Rn (θ)

9.26 9.32 9.57 10.2 11.3 14.9

ˆ Rn (θ)

28.1 28.7 30.9 39.4 53.6 107

˜ Rn (θ)

10.8 11.9 13.7 19.2 27.4 53.9

˜ Rn (θ)

7.85 8.97 10.7 15.5 22.7 45.5

˜ Rn (θ)

6.99 8.08 9.73 14.4 21.3 42.9

˜ Rn (θ)

0.410 0.434 0.457 0.501 0.544 0.623

β

0.410 0.434 0.457 0.501 0.544 0.623

β

0.410 0.434 0.457 0.501 0.544 0.623

β

0.410 0.434 0.457 0.501 0.544 0.623

β

0.057 0.060 0.062 0.067 0.072 0.081

ˆ bias(β)

0.049 0.051 0.053 0.058 0.062 0.072

ˆ bias(β)

0.045 0.047 0.049 0.054 0.058 0.068

ˆ bias(β)

0.044 0.046 0.048 0.052 0.056 0.066

ˆ bias(β)

Finite Sample, n = 200

Table 9: Multi-Step Forecasting: Locally-misspecified AR(1) Model with ρ1 = 0.8 and h = 4

Note: The iterated forecast (likelihood based) is compared to the direct forecast (innate) in terms of LinEx loss under local misspecification as defined by the parameter d. Both forecasts employ an AR(1) structure while the true model is an AR(2) model, with ϕ2 being d standard deviations away from zero.

c=2

32

0.027 0.054 0.079 0.129 0.177 0.272

˜ bias(β)

0.027 0.054 0.079 0.129 0.177 0.272

˜ bias(β)

0.027 0.054 0.079 0.129 0.177 0.272

˜ bias(β)

0.027 0.054 0.079 0.129 0.177 0.272

˜ bias(β)

ρ1 = 0.8 2 c= 1 c= 0.5

Relative Criterion Efficiency

c= 0.1

1

0 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Local Misspecification (d)

Figure 6: Locally Misspecified AR(1) Model. into ϕ2 ' 0.07 and ϕ2 ' 0.21, respectively, if the sample size is n = 200. The corresponding results for the case ρ1 = 0.99 are very similar, and are given in the Web Appendix. The predictors behave similarly when the forecast horizon is h = 4 instead of h = 2. Table 9 shows that for each asymmetry coefficient c, the values of RQE are further from the threshold at 1 than before, whereas the criterion losses are larger. In particular, the bias component of the asymptotic LBE risk increases with the forecast horizon. Both estimators exhibit finite-sample biases increasing with the forecast horizon, with that of the iterated estimator being larger than that of the direct estimator for d > 0. The results for the case ρ1 = 0.99 are qualitatively similar to those in Tables 8 & 9, and the corresponding tables are presented in the Web Appendix.

5

Conclusion

In this paper we have studied parameter estimation in the situation where the objective is to obtain a good description of other data than those used for estimation. The leading example is the problem of forecasting future data, while estimating unknown parameters from past observation. The objective is defined by a criterion function, and the question is how this criterion function should be incorporated in the estimation method. The two most natural candidates are the innate estimator and a likelihood-based estimator. The former is directly deduced from the criterion, whereas the likelihood-based estimator

33

is deduced from a statistical model and relies on a criterion-specific transformation. The tradeoff is robustness vis-a-vis efficiency. In our theoretical analysis, we showed that the likelihood-based estimator is asymptotically optimal within a class of M -estimators. In fact the likelihood-based estimator can be vastly better than the innate estimator, even in simple problems. The flip-side is that the likelihoodbased estimator can also be vastly inferior to the innate estimator if the underlying statistical model is misspecified. We showed that the types of misspecification which are most harmful to the likelihoodbased estimator depend on the criterion. For instance, we presented cases where the likelihood-based estimator is seriously impaired by relatively modest levels of misspecification that would be difficult to detect statistically. Similarly we documented cases with easy-to-detect levels of misspecification where the likelihood-based estimator continued to dominate the innate estimator. This shows that a fixed set of model diagnostics and misspecification tests may not be useful for deciding whether to employ the likelihood-based estimator or not, because the best choice depends on the criterion function. The innate estimator employs the same criterion for estimation as the objective. If a different estimator is to be employed, we showed that the notion of coherency – between the estimation criterion and the actual objective – is essential. A coherent criterion can be crafted from a maximum likelihood estimator, and we showed that the likelihood-based estimator is asymptotically efficient in the present context. This result is analogous to, and deduced from, the classical Cramer-Rao lower bound. The superiority of the likelihood-based estimator relies on the likelihood function being correctly specified. When the likelihood function is misspecified, the asymptotic efficiency that is inherited from the underlying maximum likelihood estimators perishes. However, the most adverse consequence of misspecification is that the required mapping of likelihood parameters to criterion parameters hinges on the specification. So that misspecification can result in the likelihood-based estimator being incoherent. At moderate levels of misspecification, as defined in a local-to-correct specification framework, the choice is less obvious. The likelihood-based estimator dominates the innate estimator at “nearly correct” specifications. The threshold at which the innate estimator becomes superior to the likelihood-based estimator is context-specific and depends on many factors, including the nature of misspecification and the criterion function, Q. In our application based on the LinEx loss function we saw that the superiority of the likelihood-based estimator increases with the degree of asymmetry of the objective. For this reason, it takes a relatively high degree of misspecification before the innate estimator outperforms the likelihood-based estimator when the asymmetry is large, but relatively little misspecification when the loss function is symmetric. Our results highlight the potential benefits of conducting a thorough model diagnostic in the present

34

context. However, for the selection of estimator, the diagnostics should be targeted toward the forms of misspecification that are important for the objective. These are primarily the forms of misspecification that severely distort the mapping of likelihood-parameters to criterion parameters.

References Akaike, H. (1974), ‘A new look at the statistical model identification’, IEEE transactions on automatic control 19, 716–723. Amemiya, T. (1985), Advanced Econometrics, Cambridge: Harvard University Press. Barndorff-Nielsen, O. E. (1977), ‘Exponentially decreasing distributions for the logarithm of particle size’, Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 353(1674), 401–419. Barndorff-Nielsen, O. E. (1978), ‘Hyperbolic distributions and distributions on hyperbolae’, Scandinavian Journal of Statistics 5, 151–157. Barndorff-Nielsen, O. E., Blæsid, P., Jensen, J. L. and Sørensen, M. (1985), The fascination of sand, in A. Atkinson and S. Feinberg, eds, ‘A Celebration of Statistics’, Sprin, New York. Bhansali, R. (1997), ‘Direct autoregressive predictors for multistep prediction: Order selection and performance relative to the plug in predictors’, Statistica Sinica 7, 425–450. Bhansali, R. (1999), Parameter estimation and model selection for multistep prediction of time series: a review., 1 edn, CRC Press, pp. 201–225. Chevillon, G. (2007), ‘Direct multi-step estimation and forecasting’, Journal of Economic Surveys 21(4), 746–785. Christoffersen, P. and Diebold, F. (1997), ‘Optimal prediction under asymmetric loss’, Econometric Theory 13, 808–817. Christoffersen, P. and Jacobs, K. (2004), ‘The importance of the loss function in option valuation’, Journal of Financial Economics 72, 291–318. Clements, M. P. and Hendry, D. F. (1996), ‘Multi-step estimation for forecasting’, Oxford Bulletin of Economics and Statistics 58(4), 657–684. Cox, D. R. (1961), ‘Prediction by exponentially weighted moving averages and related methods’, Journal of the Royal Statistical Society. Series B (Methodological) pp. 414–422. Diebold, F. X. and Mariano, R. S. (1995), ‘Comparing predictive accuracy’, Journal of Business and Economic Statistics 13, 253–263. Granger, C. (1969), ‘Prediction with a generalized cost of error function’, OR pp. 199–207. Granger, C. (1986), Forecasting Economic Time Series, Academic Press. Hansen, B. E. (2010a), ‘Multi-step forecast model selection’, working paper .

35

Hansen, P. R. (2010b), ‘A winner’s curse for econometric models: On the joint distribution of in-sample fit and out-ofsample fit and its implications for model selection’, working paper . Huber, P. (1981), Robust Statistics, Wiley. Hwang, S., Knight, J. and Satchell, S. (2001), ‘Forecasting nonlinear functions of returns using linex loss functions’, annals of economics and finance 2(1), 187–213. Ing, C.-K. (2003), ‘Multistep prediction in autoregressive processes’, Econometric Theory 19(2), 254–279. Lütkepohl, H. (2005), New introduction to multiple time series analysis, Springer Science & Business Media. Marcellino, M., Stock, J. H. and Watson, M. W. (2006), ‘A comparison of direct and iterated multistep ar methods for forecasting macroeconomic time series’, Journal of Econometrics 135, 499–526. Schorfheide, F. (2005), ‘Var forecasting under misspecification’, Journal of Econometrics 128(1), 99–136. Takeuchi, K. (1976), ‘Distribution of informational statistics and a criterion of model fitting’, Suri-Kagaku (Mathematical Sciences) 153, 12–18. (In Japanese). Tiao, G. C. and Tsay, R. S. (1994), ‘Some advances in non-linear and adaptive modelling in time-series’, Journal of forecasting 13(2), 109–131. Varian, H. (1974), A Bayesian Approach to Real Estate Assessment, North-Holland, pp. 195–208. Weiss, A. (1996), ‘Estimating time series models using the relevant cost function’, Journal of Applied Econometrics 11(5), 539–560. Weiss, A. and Andersen, A. (1984), ‘Estimating time series models using the relevant forecast evaluation criterion’, Journal of the Royal Statistical Society. Series A (General) pp. 484–487. Weiss, G. (1975), ‘Time-reversibility of linear stochastic processes’, Journal of Applied Probability 12, 831–836. White, H. (1994), Estimation, Inference and Specification Analysis, Cambridge University Press, Cambridge. Zellner, A. (1986), ‘Bayesian estimation and prediction using asymmetric loss functions’, Journal of the American Statistical Association pp. 446–451.

36

A

Appendix: Proof of Analytical Results in Section 2

Proof of Lemma 2. For simplicity write st = s(xt , θ∗ ) and similarly for s˜t . By Assumption 3, the 0 P 2n 0 P2n 0 ˜t is ΣS . Now use the simple identity for the variance asymptotic variance of (2n)1/2 t=1 s t=1 st , P P of a sum to deduce that the asymptotic covariance of n1/2 nt=1 st and n1/2 2n t=n+1 st is zero. The P same argument can be applied to establish the (zero) asymptotic covariance between n1/2 nt=1 s˜t and P n1/2 2n  t=n+1 st . P p p ˜ → Proof of Lemma 3. Since θ˜ → θ0 , it follows by Assumptions 1 and 2 that n−1 nt=1 q(xt , θ) E[q(xt , θ0 )], which is strictly smaller than E[q(xt , θ∗ )], as a consequence of Assumption 2.ii. So that 1 ˆ n [Q[(Y, θ)

p ˜ → − Q(Y, θ)] E[q(xt , θ∗ )] − E[q(xt , θ0 )] > 0, and the result follows.



Proof of Theorem 1. To simplify notation, we write Qx (θ) in place of Q(X , θ), and similarly Sx (θ) = ˜ x (θ) = Q(X ˜ , θ), etc. Since Q ˜ is coherent, we have S(X , θ), Hx (θ) = H(X , θ), Qy (θ) = Q(Y, θ), Q p θ˜ → θ0 = θ∗ , and by a Taylor expansion we have

˜ − Q(Y, θ0 ) = Sy (θ0 )0 (θ˜ − θ0 ) + 1 (θ˜ − θ0 )0 Hy (θ0 )(θ˜ − θ0 ) + op (1). Q(Y, θ) 2 p p d ˜ , θ∗ ) → ˜ and {S(X ˜ , θ), S(Y, θ)} → By Assumption 3 and Lemma 2 we have that n−1 H(Y, θ∗ ) → −A, n−1 H(X −A,

˜ 1/2 Zx , B 1/2 Zy } where Zx and Zy are independent and both distributed as N(0, I). The result {B d

˜ − Q(Y, θ) → Z 0 B 1/2 A˜B ˜ 1/2 Zx + 1 Z 0 B ˜ 1/2 A˜−1 [−A]A˜−1 B ˜ 1/2 Zx now follows. The expectation Q(Y, θ) y 2 x of the first term is zero, and the final result follows by ˜ 1/2 EZx Z 0 B ˜ 1/2 }, ˜ 1/2 Zx } = tr{A˜−1 AA˜−1 B ˜ 1/2 A˜−1 AA˜−1 B tr{EZx0 B x and using that EZx Zx0 = I.



Proof of Lemma 4. Let P denote the true distribution. Consider the parameterized model, {Pϑ : ϑ ∈ Ξ}, which is correctly specified so that P = Pϑ0 for some ϑ0 ∈ Ξ. Since θ∗ is defined to be the maximizer of

ˆ E[Q(Y, θ)] = Eϑ0 [Q(Y, θ)] =

Q(Y, θ)dPϑ0 ,

it follows that θ0 is just a function of ϑ0 , i.e., θ0 = θ(ϑ0 ).



Proof of Theorem 2. Consider first the case where ϑ = θ. From Theorem 1 and a slight variation of its proof it follows that 1 d ˆ − Q(Y, θ) ˜ → Q(Y, θ) +Zy0 B 1/2 A−1 B 1/2 Zx − Zx0 B 1/2 A−1 B 1/2 Zx 2 ˜ 1/2 A˜−1 AA˜−1 B ˜ 1/2 Z˜x , ˜ 1/2 Z˜x + 1 Z˜x0 B −Zy0 B 1/2 A˜−1 B 2 37

where Zy , Zx , and Z˜x are all distributed as N(0, I), with Zy independent of (Zx , Z˜x ). This defines the random variable ξ. Two of the terms vanish after taking the expected value, which yields 1 1 ˜ = 1 tr{AA˜−1 − A−1 B}, − tr{A−1 B} + tr{A˜−1 AA˜−1 B} 2 2 2 ˜ Manipulating this expression leads to where we have used the information matrix equality, A˜ = B. o 1 n 1/2 ˜−1 tr A (A − A−1 BA−1 )A1/2 ≤ 0, 2 ˜ −1 is the asymptotic covariance matrix of the where the inequality follows from the fact that A˜−1 = B ˜ −1 MLE whereas A−1 BA−1 is the asymptotic covariance of the innate estimator, so that A−1 BA−1 − B is positive semi-definite by the Cramer-Rao bound. These arguments are valid whether θ has the same dimension as ϑ or not, because we can reparametrize the model in ϑ 7→ (θ, γ), which results in block-diagonal information matrices. This is achieved with γ(ϑ) = τ (ϑ) − Στ θ Σ−1 θθ θ(ϑ), where  

Σθθ

Σθτ

Στ θ Στ τ

 ,

denotes the asymptotic covariance of the MLE for the parametrization (θ, τ ).



Proof of Corollary 1.The proof is analogous to that of Theorem 2, albeit the comparison in question is here n o ˇ + tr{A˜−1 AA˜−1 B} ˜ = tr A1/2 (A˜−1 − Aˇ−1 B ˇ Aˇ−1 )A1/2 ≤ 0, −tr{Aˇ−1 AAˇ−1 B} ˇ ˇ are the “information” matrices associated with θ. where Aˇ and B (n)

Proof of Theorem 3. With Pn → Pϑ0 we have θ(ϑ0 ) = θ∗ . Thus with θ0

(n)



= θ(ϑ0 ) and a Taylor

˜ about Q(Y, θ∗ ) we find the limit distribution of Q(Y, θ) ˜ − Q(Y, θ∗ ) to be given expansion of Q(Y, θ) from (n) (n) (n) (n) (n) (n) Sy (θ∗ )0 (θ˜ − θ0 + θ0 − θ∗ ) + 21 (θ˜ − θ0 + θ0 − θ∗ )0 Hy (θ∗ )(θ˜ − θ0 + θ0 − θ∗ ).

The expectation of the first term is zero, and the limit distribution of the second term is 1 ˜−1 ˜ 1/2 Zx 2 (A B

˜ 1/2 Zx + b), + b)0 [−A](A˜−1 B

38

so that the asymptotic criterion risk is 1 ˜−1 ˜ 1/2 Zx 2 E(A B

˜ 1/2 Zx + b) = 1 tr{AB ˜ −1 } + 1 tr{Abb0 } = 1 tr{A(B ˜ −1 + bb0 } + b)0 A(A˜−1 B 2 2 2

˜ which also holds where we used that EZx = 0, EZx Zx0 = I and the information matrix equality A˜ = B, under local-to-correct specification.



˜ 1/2 b we have b0 Ab/b0 Bb ˜ = y0B ˜ −1/2 AB ˜ −1/2 y/y 0 y which is bounded Proof of Theorem 4. With y = B ˜ −1/2 AB ˜ −1/2 . If λ is a solution to |B ˜ −1/2 AB ˜ −1/2 − λI| = 0 by the smallest and largest eigenvalues of B ˜ = 0, and the result follows. then λ also solves |A − λB|

B



Appendix: Proof of Auxiliary Results

B.1

Derivations Related to the Linex Case With Correct Specification

The expression for A follows by A = E[−hi (Xi , θ0 )] = E exp{c(Xi − θ0 )} = E exp{c(Xi − µ) −

c2 σ 2 } 2

2 2

= exp{− c 2σ + 21 c2 σ 2 } = 1, where the second last equality follows by using that the moment generating function for V ∼ N(λ, τ 2 ) 2 2

is mgf(t) = E(exp{tV }) = exp{λt + 21 τ 2 t2 }, and setting λ = − c 2σ , τ 2 = c2 σ 2 , and t = 1. For B we note that E[si (Xi , θ0 )]2 = c−2 E[exp{2c(Xi − θ0 )} − 2 exp{c(Xi − θ0 )} + 1] = c−2 E[exp{2c(Xi − µ) − c2 σ 2 )} − 2 exp{c(Xi − µ) − 2 2

= c−2 [exp{−c2 σ 2 + 2c2 σ 2 } − 2 exp{− c 2σ +

c2 σ 2 2 }

c2 σ 2 2 }

+ 1]

+ 1]

= c−2 [exp{c2 σ 2 } − 1].

Here we have used the expression for the moment generating function for a Gaussian random variable twice.

39

B.2

Proof of (6) in Section 3.2

Lemma B.1. Suppose that X ∼ NIG(λ, δ, α, β) and let θ0 = λ +

δβ γ

2

+ 2c δ αγ 3 , where γ =

p

α2 − β 2 .

Then θ∗ = λ +

i p δ hp 2 α − β 2 − α2 − (β + c)2 , c

solves minθ E[exp{c(X − θ)} − c(X − θ) − 1]. Moreover, θ∗ − θ0 → 0 if either (a) c → 0; or (b) α → ∞ with σ 2 = δ/α constant and β = 0. Proof. Using the moment generating function for the NIG-distribution the objective is to minimize

exp{−cθ} exp{cλ + δ(γ −

p α2 − (β + c)2 )} − c(λ +

δβ γ

− θ),

with respect to θ. The first order conditions are therefore −c exp{−cθ} exp{cλ + δ(γ −

p α2 − (β + c)2 )} + c = 0,

hence by rearranging and taking the logarithm, we have

−cθ + cλ + δ(γ −

p α2 − (β + c)2 ) = 0,

and rearranging yields the first result. By the l’Hospital rule we find √ lim

α2 −β 2 −



α2 −(β+c)2

c

c→0

which proves that θ∗ → λ +

δβ γ

=

limc→0 2(β+c) 12 [α2 −(β+c)2 ]−1/2 1

= β/γ,

as c → 0.

With β = 0 we have γ = α, δ = σ 2 α = var(X) and λ = µ = E(X), so that θ∗ = µ +

i σ 2 1−√1−xc2 σ 2 α h√ 2 p 2 α − α − c2 = µ + , x c c

where x = α−2 . The limit as α → ∞, i.e. x → 0, is by the l’Hospital rule σ 2 12 c2 (1 − xc2 )−1/2 σ2 =µ+c , x→0 c 1 2

lim θ0 = µ + lim

α→∞

which completes the proof.

40

Theorem B.1. Consider the NIG(λ, δ, α, β), where λ = µ − δβ/γ, δ = σ 2 γ 3 /α2 and β = bα1−a for a ∈ ( 31 , 1] and b ∈ R. Then NIG(λ, δ, α, β) → N(µ, σ 2 ),

as

α→∞

Proof. Define x = α−2 so that α = x−1/2 and β = bx−(1−a)/2 and note that β/α = bα−a = bxa/2 so that p p γ = 1 − (β/α)2 = 1 − b2 xa . α Now consider the characteristic function for the NIG(λ, δ, α, β) which is given by exp{iλt + δ(γ −

p α2 − (β + it)2 }.

With δ = σ 2 γ 3 /α2 and λ = µ − δβ/γ = µ − σ 2 (γ/α)2 β, the first part of the characteristic function is given by λ = µ − σ 2 (1 − b2 xa )bx−

1−a 2

= µ − σ 2 bx−

1−a 2

+ σ 2 b3 x

We observe that the last term vanishes as x → 0 provided that a > itσ 2 bx−

1−a 2

= itσ 2 bx

1+a 2

3a−1 2

1 3,

.

while the second term,

/x, will be accounted for below.

The second part of the characteristic function equals p ( γ )4 − ( αγ )3 δ(γ − α2 − (β + it)2 ) = σ 2 α

p 1 − (β/α + it/α)2 , α−2

which, in terms of x, is expressed as σ

2 (1

− b2 xa )2 − (1 − b2 xa )3/2 x

p

1 − (bxa/2 + itx1/2 )2

.

Including the second term from the first part of the CF, we arrive at,

σ

2 −itbx

1+a 2

+ (1 − b2 xa )2 − (1 − b2 xa )3/2 x

p 1 − (bxa/2 + itx1/2 )2

,

and applying l’Hospital’s rule as x → 0, we find (apart for the scale σ 2 ) −itb 1+a 2 x

a−1 2

− 2ab2 xa−1 + 23 ab2 xa−1 − 12 (−b2 xa−1 − 2itb 1+a 2 x

41

a−1 2

+ t2 ) = − 21 t2 .

So the CF for the NIG converges to exp{iµt −

σ2 2 2 t }

as x → 0, which is the CF for N(µ, σ 2 ).

Corollary B.1. NIG(µ, σ 2 α, α, 0) → N(µ, σ 2 ), as α → ∞. Proof. The results follow from Theorem B.1, or directly by observing that the CF for NIG(µ, σ 2 α, α, 0) is exp{iµt + σ 2 α2 (1 −

p

1 + α−2 t2 }.

√ Now by l’Hospital’s rule note that ∂ 1 + xt2 /∂x = 12 t2 (1+xt2 )−1/2 , so by setting x = α−2 and applying L’Hospital rule we find

lim

σ 2 (1 −



x→0

limx→0 [− 21 t2 (1 + xt2 )−1/2 ] 1 + xt2 ) 1 = = − σ 2 t2 . x 1 2

Proof of (7). From ξ = (1 + δγ)−1/2 and χ = −ξ αβ , we note that 1/ξ 2 − 1 = (1 + δγ) = δγ = γ 4 /α2 and ξ 2 − χ2 = ξ 2 (1 − β 2 /α2 ) = ξ 2 α−2 γ 2 so that p p 1 − ξ2 1 − ξ2 1 γ 2 /α ξ 2 = = α. = 2 ξ − χ2 ξ α−2 γ 2 (1 − β 2 ) α

Similarly, p p −ξ αβ 1 − ξ2 χ 1 − ξ2 −χ 2 = − ξ = − α = β, ξ − χ2 ξ ξ 2 − χ2 ξ which proves the identities in (7). 

B.3

Derivations Related to the Asymptotic Criterion Risk for the AR(1) model

˜ −1 for the case with multi-step forecasting and LinEx loss. Before In this section we derive A, B and B we prove Theorem 5 we derive several intermediate results. Lemma B.2. Suppose  

then (i) EeX = eµ+

σ2 2 ,

Y X





 ∼ N 

(ii) EXeX = (µ + σ 2 )eµ+σ

ω µ

2 /2

42

  ,

τ2

γ

γ

σ2

  ,

, and (iii) E[Y eX ] = (ω + γ)eµ+

σ2 2 .

Proof. (i) is a simple consequence of the moment generating function of a Gaussian random variable. (ii) follows from (iii) as a special case. To prove (iii), let β = γ/σ 2 and consider EY eX

= E(Y − βX)eX + βEXeX = (ω − βµ)EeX + βEXeX = (ω − βµ)eµ+σ = (ω + γ)eµ+σ

2 /2

2 /2

+ β(µ + σ 2 )eµ+σ

2 /2

,

where the second equality follows by the independence of Y − βX and X. ∗ ∗ Lemma B.3. For the forecast error, et = Yt − Yt,t−h , where Yt,t−h = α∗ − β∗ Yt−h is based on the ideal h

parameter values, (α∗ , β∗ )0 = (µ 1−ϕ 1−ϕ +

c 1−ϕ2h 2 h 0 2 1−ϕ2 σ , ϕ ) ,

we have

et ∼ N(− 2c (1 − ϕ2h )σy2 , (1 − ϕ2h )σy2 ), and for j = 0, 1, . . . , h we have et + et±j ∼ N(−c(1 − ϕ2h )σy2 , 2(1 − ϕ2h + ϕj − ϕ2h−j )σy2 ). Proof. The first result is easily verified from 2h

et = − 2c 1−ϕ σ 2 + εt + ϕεt−1 + · · · + ϕh−1 εt−h+1 . 1−ϕ2 Next et+j + et = −2 2c (1 − ϕ2h )σy2 +(εt+j + ϕεt+j−1 + · · · + ϕj−1 εt+1 ) +(1 + ϕj )(εt + ϕεt−1 + · · · + ϕh−j−1 εt−h+j+1 ) +(ϕh−j εt−h+j + · · · + ϕh−1 εt−h+1 ) which verifies the mean. The variance of the three stochastic terms in the expression above adds up to σy2 times, (1 − ϕ2j ) + (1 + ϕj )2 (1 − ϕ2(h−j) ) + (ϕ2(h−j) − ϕ2h ) = (1 − ϕ2j ) + (1 + ϕ2j + 2ϕj ) − (ϕ2(h−j) + ϕ2h + 2ϕ2h−j ) + (ϕ2(h−j) − ϕ2h ) = 2 + 2ϕj − 2ϕ2h + 2ϕ2h−j ,

43

which completes the proof. Lemma B.4. We have Et−h−j ecet = 1

2 (ϕ|j| −ϕ2h−|j| )

Eec(et +et+j ) = δj = e(cσy )

and

,

for

|j| ≤ h.

Proof. The first and second results now follow by applying Lemma B.2 and simplifying the expressions. First, since et is independent of Ft−h−j we need to compute

Ee

cet

=

c2 1−ϕ2h 2 c2 1−ϕ2h 2 σ +2 σ −2 1−ϕ2 1−ϕ2 e

= e0 = 1,

and second, for j = 0, . . . , h, 2 2 (1−ϕ2h )σ 2 + c 2(1−ϕ2h +ϕj −ϕ2h−j )σ 2 y y 2

Eec(et +et+j ) = e−c

2 σ 2 (ϕj −ϕ2h−j ) y

= ec

,

and the result for j < 0 follows by symmetry. Note that for j ≥ 0 we also have δj = Et−h ec(et +et+j ) a.s. To simplify later expressions we introduce the notation γj = cov(t+j−h , cet ), where t+j−h =

Pj

j−i . i=1 εt−h+i ϕ

Note that γj = 0 for j ≤ 0 and γj = cσy2 (ϕ2(h−j) − ϕ2h ) for

j = 1, . . . , h. We will also use that E[t+j−h ec(et +et+j ) ] = γj δj , which follows by Lemma B.2 since Et+j−h ec(et +et+j ) = (0 + γj )e−c

2 (1−ϕ2h )σ 2 +c2 (1−ϕ2h +ϕj −ϕ2h−j )σ 2 y y

2 (ϕj −ϕ2h−j )σ 2 y

= γj e c

= γj δ j ,

where we have used Lemma B.3. Also, for j ≥ 0 it is easy to verify that also Et−h [t+j−h ec(et +et+j ) ] = γj δj a.s. Lemma B.5. For |j| ≤ h we have EYt+j−h ecet

= µy + γj

EYt+j−h ec(et +et+j ) = (µy + γj )δj EYt−h Yt+j−h ecet

= ϕ|j| σy2 + µy (µy + γj )

EYt−h Yt+j−h ec(et +et+j ) = [ϕ|j| σy2 + µy (µy + γ|j| )]δj . 44

(B.1) (B.2) (B.3) (B.4)

Proof. For ease of exposition, let us assume j = 1, . . . , h. First note that Yt+j−h = Et−h Yt+j−h + t+j−h , where Et−h Yt+j−h = ϕj Yt−h + (1 − ϕj )µy . We have EYt+j−h ecet

= E[Et−h {ϕj Yt−h + (1 − ϕj )µy + t+j−h )ecet ] = [ϕj µy + (1 − ϕj )µy ]Et−h ecet + E[t+j−h ecet ] c2

2 2h )σ 2 + c (1−ϕ2h )σ 2 y y 2

= µy + γj e− 2 (1−ϕ

= µy + γj ,

where we used Lemma B.4, and EYt−j−h ecet = EYt−j−h Et−j−h [ecet ] = µy completes the proof of (B.1). Next, (B.2) follows from: EYt−j−h ec(et +et−j ) = EYt−j−h Et−j−h [ec(et +et−j ) ] = µy δj and EYt+j−h ec(et +et+j ) = E[Et−h {ϕj Yt−h + (1 − ϕj )µy + t+j−h }ec(et +et+j ) ], = [ϕj µy + (1 − ϕj )µy ]Et−h ec(et +et+j ) + E[t+j−h ec(et +et+j ) ] = µy δj + δj (γj + γ0 ) = δj (µy + γj ).

The third result (B.3) follows by, EYt−h Yt+j−h ecet

2 = ϕj E(Yt−h ) + µ2y (1 − ϕj ) + EYt−h Et−h t+j−h ecet c2

= ϕj (σy2 + µ2y ) + µ2y (1 − ϕj ) + µy (0 + γj )e− 2 (1−ϕ

2h )σ 2 + c2 (1−ϕ2h )σ 2 y y 2

= ϕj (σy2 + µ2y ) + µ2y (1 − ϕj ) + µy γj = ϕj σy2 + µ2y + µy γj , where we used Lemma B.2(iii) in the second equality. Finally (B.4) follows from

EYt−h Yt+j−h ec(et +et+j ) = EYt−h (ϕj Yt−h + (1 − ϕj )µy + t+j−h )ec(et +et+j ) 2 = E{[ϕj Yt−h + Yt−h (1 − ϕj )µy ]Et−h ec(et +et+j ) } + E{Yt−h Et−h t+j−h ec(et +et+j ) }

= [ϕj (σy2 + µ2y ) + (1 − ϕj )µ2y ]δj + µy γj δj = [ϕj σy2 + µ2y + µy γj ]δj .

45

Proof of Theorem 5. For the matrix A it is straightforward that  A = E[−ht (θ0 )] = E[ecet 

1

Yt−h

2 Yt−h Yt−h





] = 

1

µy

µy µ2y + σy2

 ,

where we used Lemma B.4 and Lemma B.5 (B.1) and (B.3) with j = 0. For B we need to derive the long run variance of the score

st =

1 c



ecet − 1



(ecet

Yt−h



− 1)

.

For the upper left element we need to compute the following terms E(ecet − 1)(ecet+j − 1) = E(ec(et +et+j ) − ecet − ecet+j + 1) = Eec(et +et+j ) − 1, that are zero for |j| ≥ h, so that

c2 B11 =

h−1 X



2 σ 2 (ϕ|j| −ϕ2h−|j| ) y

ec

 −1 =

j=−(h−1)

h−1 X

(δj − 1)

j=−(h−1)

by Lemma B.4. Next, for B12 we need to compute the terms E(ecet − 1)Yt+j−h (ecet+j − 1) = EYt+j−h ec(et +et+j ) − EYt+j−h ecet − EYt+j−h ecet+j + EYt+j−h 2 σ 2 (ϕj −ϕ2h−j ) y

= (µy + γj )(ec

− 1) = (µy + γj )(δj − 1),

with EYt+j−h ecet+j = µy , and similarly we find E(ecet − 1)Yt−j−h (ecet−j − 1) = EYt−j−h ec(et +et−j ) − EYt−j−h ecet − EYt−j−h ecet−j + EYt−j−h 2 σ 2 (ϕj −ϕ2h−j ) y

= µy (ec

− 1) = µy (δj − 1).

Hence,

c2 B12 =

h−1 X

(µy + γj ) (δj − 1) ,

j=−h+1

46

where γj = 0 for j ≤ 0 and γj = cσy2 (ϕ2(h−j) − ϕ2h ) for j = 1, . . . , h. For B22 we need EYt−h (ecet − 1)Yt+j−h (ecet+j − 1) for j = 1, . . . , h, which equals EYt−h (ecet − 1)Yt−j−h (ecet−j − 1) given the stationarity. By Lemma B.5 (B.3) and (B.4) we have EYt−h Yt+j−h ec(et +et+j ) − EYt−h Yt+j−h ecet − EYt−h Yt+j−h ecet+j + EYt−h Yt+j−h = (ϕj σy2 + µ2y + µy γj )δj − (ϕj σy2 + µ2y + µy γj ) = (ϕj σy2 + µ2y + µy γj )(δj − 1) since EYt−h Yt+j−h ecet+j = EYt−h Yt+j−h Et+j−h (ecet+j ) = EYt−h Yt+j−h by Lemma B.4. Therefore

h−1 X

B22 =

(ϕ|j| σy2 + µ2y + µy γ|j| )(δj − 1).

j=−h+1

˜ = ( 1−b µ + ˜ −1 , we observe that for ϑ = (µ, ϕ, σ 2 )0 and θ˜0 = (α Finally, for the expression of B ˜ , β) 1−ϕ c 1−b2 2 2 1−ϕ2 σ , b)

˜ with b = ϕh , the differential ∂ θ/∂ϑ is given by  G=

1−b 1−ϕ

−1 (1−ϕ)+(1−b) µ −hbϕ (1−ϕ) 2

+

2 −1 (1−ϕ2 )+ϕ(1−b2 ) cσ 2 −hb ϕ (1−ϕ 2 )2

hbϕ−1

0

c 1−b2 2 1−ϕ2

0

 .

Next, 

σ2 +

µ2 (1+ϕ) (1−ϕ)

  avar(ϑ) =  −µ(1 + ϕ)  0

−µ(1 + ϕ) (1 − ϕ2 ) 0

0



  0 ,  4 2σ

˜ −1 = Gavar(ϑ)G0 , the expression stated in the Theorem see e.g. Lütkepohl (2005, section 3.4.3). With B follows by and simple multiplication of these terms. Detailed calculations of the matrix multiplications are available in the online Web Appendix.

B.4



Expression for the case h = 2

Relatively transparent expressions are obtained for the case with two-step ahead forecasting, where it is easy to verify that ˜ −1 } = (1 + 2ϕ + 5ϕ2 )(1 − ϕ2 )σ 2 + (1 + 4ϕ2 − ϕ4 )(1 − ϕ2 )2 c2 σ 4 tr{AB y y tr{A−1 B} =

2 (1−ϕ4 )c2 σy2 [e c2

− 1 + (ec 47

2 σ 2 ϕ(1−ϕ2 ) y

− 1)(1 + ϕ)].

With the notation λ = (cσy )2 we have RQE =

(1 + 2ϕ + 5ϕ2 )(1 − ϕ2 )λ + 21 (1 + 4ϕ2 − ϕ4 )(1 − ϕ2 )2 λ2 , 2[e(1−ϕ4 )λ − 1 + (eλϕ(1−ϕ2 ) − 1)(1 + ϕ)]

which is less than one for |ϕ| < 1, and it can be verified that limλ→0 RQE =

1 1+2ϕ+5ϕ2 2 1+ϕ+2ϕ2 ,

which

corresponds to the case with symmetric loss.

B.5

Proof of (11) in section 4.2

Theorem B.2. Consider an AR(2) process, Yt = µ + ϕ1 Yt−1 + ϕ2 Yt−2 + εt , with εt ∼ i.i.d.N (0, σ 2 ). Then   1 + ϕ21 − ϕ22 2 Yt+2 |Yt ∼ N (1 + ρ1 )µ + ρ2 Yt , σε . 1 − ϕ22 Proof. The conditional mean follows from

E(Yt+2 |Yt ) = E[µ+ϕ1 Yt+1 + ϕ2 Yt + εt+2 |Yt ] = µ + ϕ1 E(Yt+1 |Yt ) + ϕ2 Yt = µ+

ϕ1 1−ϕ2 µ

+ (ϕ1 ρ1 + ϕ2 )Yt = (1 + ρ1 )µ + ρ2 Yt ,

where we used that

E(Yt+1 |Yt ) = E[µ+ϕ1 Yt + ϕ2 Yt−1 + εt+1 |Yt ] = µ + ϕ1 Yt + ϕ2 E(Yt−1 |Yt ) =

1 1−ϕ2 (µ

+ ϕ1 Yt ) =

1 1−ϕ2 µ

+ ρ1 Yt .

2 |Y ) − For the conditional variance, var(Yt+2 |Yt ), we make use of the fact that var(Yt+2 |Yt ) = E(Yt+2 t

[E(Yt+2 |Yt )]2 , where the second term follows from the result above. To compute the first term, we make use of identities (from Yule-Walker):

ρ1 (1 − ϕ2 ) = ϕ1 ,

(B.5)

ϕ1 ρ1 + ϕ2 = ρ2 .

(B.6)

We seek 2 E(Yt+2 |Yt ) = E[(µ + ϕ1 Yt+1 + ϕ2 Yt + εt+2 )(µ + ϕ1 Yt+1 + ϕ2 Yt + εt+2 )|Yt ] 2 = µ2 + 2µϕ1 E(Yt+1 |Yt ) + 2µϕ2 Yt + ϕ21 E(Yt+1 |Yt ) + ϕ22 Yt2 + σε2 + 2ϕ1 ϕ2 Yt E(Yt+1 |Yt )

48

where we use (B.6) and that 2 E(Yt+1 |Yt ) = E((µ + ϕ1 Yt + ϕ2 Yt−1 + εt+1 )(µ + ϕ1 Yt + ϕ2 Yt−1 + εt+1 )|Yt ) 2 = µ2 + 2µYt ϕ1 + 2µϕ2 E(Yt−1 |Yt )) + ϕ21 Yt2 + ϕ22 E(Yt−1 |Yt ) + σε2 + 2ϕ1 ϕ2 Yt E(Yt−1 |Yt ).

2 |Y ) = E(Y 2 |Y ),9 Using E(Yt+1 t t−1 t

2 E(Yt+1 |Yt ) =

1 [µ2 1−ϕ22

ϕ2 + 2µYt (ϕ1 + 2ϕ2 ρ1 ) + (ϕ21 + 2ϕ1 ϕ2 ρ1 )Yt2 + σε2 ] + 2µ2 1−ϕ 2

so that by using (B.5) several times, we find the equivalent expressions 2 E(Yt+1 |Yt ) =

1 [µ2 1−ϕ22

ϕ2 1 + 2µ2 1−ϕ + 2µYt (ρ1 + ϕ2 ρ1 )] + ρ1 1+ϕ [ϕ1 + 2(ρ1 − ϕ1 )]Yt2 + 2 2

1 = µ2 (1−ϕ 2 + 2)

2 1+ϕ2 µYt ρ1

1 + ρ1 1+ϕ [2ρ1 − ϕ1 ]Yt2 + 2

1 σ2 1−ϕ22 ε

1 σ2, 1−ϕ22 ε

and 2 E(Yt+2 |Yt ) = µ2 (1 + 2ρ1 +

ϕ21 ) (1−ϕ2 )2

+ 2µYt (ϕ1 ρ1 + ϕ2 + ϕ2 ρ1 + ϕ1 ρ21 )

1 −ϕ1 +[ϕ21 ρ1 2ρ1+ϕ + 2ϕ2 ρ2 − ϕ22 ]Yt2 + 2

1+ϕ21 −ϕ22 2 σε 1−ϕ22

= µ2 (1 + 2ρ1 + ρ21 ) + 2µYt (ρ2 + (ϕ2 + ϕ1 ρ1 )ρ1 )) 1 −ϕ1 +[ϕ21 ρ1 2ρ1+ϕ + 2ϕ2 ρ2 − ϕ22 ]Yt2 + 2

The expression for the conditional variance, var(Yt+2 |Yt ) =

1+ϕ21 −ϕ22 2 σε . 1−ϕ22

1+ϕ21 −ϕ22 2 σε 1−ϕ22

follows from

1 −ϕ1 µ2 (1 + ρ1 )2 + 2µYt ρ2 (1 + ρ1 ) + [ϕ21 ρ1 2ρ1+ϕ + 2ϕ2 ρ2 − ϕ22 ]YT2 2

−[µ2 (1 + ρ1 )2 + 2µYt (1 + ρ1 )ρ2 + ρ22 Yt2 ] h i 2 2 2 2) = ϕ21 ρ21 2−(1−ϕ − (ϕ − 2ϕ ρ + ρ ) 2 2 2 2 YT 1+ϕ2   = (ρ2 − ϕ2 )2 − (ϕ2 − ρ2 )2 YT2 = 0, which completes the proof. 9

2 This equality follows from var(Yt±1 |Yt ) = E([Yt±1 − E(Yt±1 |Yt )]2 |Yt ) = E(Yt±1 |Yt ) − 2E(Yt±1 E(Yt±1 |Yt )|Yt ) + 2 E(E(Yt±1 |Yt ) |Yt ) where the equalities E(Yt−1 |Yt ) = E(Yt+1 |Yt ) = ρ1 Yt and var(Yt−1 |Yt ) = var(Yt+1 |Yt ) hold by the time-reversibility of stationary Gaussian processes, see e.g. Weiss (1975).

49

C

Appendix: Details Concerning the Simulations Designs

C.1

LinEx Loss

Proposition C.1 (Invariance of LinEx Simulation Study). The simulation study based on Xi ∼ iidN(µ, σ 2 ) and LinEx loss Ld , is equivalent to that based on Xi ∼ iidN(0, 1) and LinEx loss Lc . Our simulation design is based on random variables with mean zero and unit variance. This is without loss of generality because a simulation design based on random variables Xi with mean µ, variance σ 2 and asymmetry parameter d, is equivalent to a design based on Zi = (Xi − µ)/σ with asymmetry parameter c = σd. To establish this result we first show that the estimator, θˇd , deduced from LinEx loss, Ld , and the sample X = (X1 , . . . , Xn ), is linearly related to the estimator θˇc , deduced from LinEx loss, Lc , and the sample Z = (Z1 , . . . , Zn ), by θˇd (X ) = µ + σ θˇc (Z).

(C.1)

For the innate estimator this follows by X X 1 1 exp(dXi )} = log{ n1 exp(dσZi ) exp(dµ)} log{ n1 d d X 1 = µ + log{ n1 exp(cZi )} = µ + σ θˆc (Z), d

θˆd (X ) =

and similarly for the likelihood-based estimator we observed that ¯+ θ˜d (X ) = X

d1 2n

X ¯ 2 = µ + σ Z¯ + (Xi − X) i

d1 2n

X

¯ 2 = µ + σ θ˜c (Z). σ 2 (Zi − Z)

i

Hence d{Yi − θˇd (X )} = d{σ Yiσ−µ + µ − µ − σ θˇc (Z)} = c{ Yiσ−µ − θˇc (Z)}, which proves that, Ld (Yi − θˇd (X )) = σ 2 Lc ( Yiσ−µ − θˇc (Z)). Since the scale, σ 2 , is common for all estimators that satisfy (C.1), their relative performance is unaffected.

50

Details about the simulation study We draw 2n independent observations N (0, 1). The first n observations (in-sample) are used to compute ˜ The remaining n observation are used to compute the out-of-sample losses, the estimators θˆ and θ. including the losses resulting from the ideal parameter value θ∗ . This is done in 500,000 independent replications, and the properties of θˆ and θ˜ are evaluated by averaging over the simulations. We have used n = 100 and n = 1, 000 in the finite sample analysis. A range, c ∈ {0; 0.25; 0.5; 1; 1.5; 2; 2.5}, of values for the asymmetry parameter is used.

C.2

LinEx under NIG distribution

The simulations with the normal inverse Gaussian (NIG) distribution represent a case with local misspecification. We use random variables drawn from NIG(λ, δ, α, β), that are normalized to have mean p zero and unit variance. This is achieved by setting δ = γ 3 /α2 and λ = −δβ/γ where γ = α2 − β 2 , and this standardized NIG distribution may be parameterized by:

ξ=√

1 1 + δγ

β χ=ξ . α

and

An asymmetric distribution is obtained with χ = −ξ 3/2 , leaving one free parameter, where the Gaussian case arises as ξ → 0. This facilitates the setup of the local-misspecification experiments where an increase in the level of the bias will be immediately mirrored by a modification of ξ. The advantage of this setup is that the NIG optimal predictor is always computable as long as 0 ≤ ξ < 1. The values of the parameters λ, δ, α, β are then deduced from the relations above and the standardization constraints. To set the misspecification level in the simulations design, we first deduce the rate of convergence of √ ξ to zero. By setting the LinEx asymmetry coefficient to 1, we solve for ξ so that n(θ0 − θ∗ ) − b = 0. From (4) and (6), and the normalizations µ = 0 and σ = 1, we observe that θ0 − θ∗ = q h i p δ γ − α2 − (β + 1)2 . Next using that χ = −ξ 3/2 and defining aξ = 1+ξ 1−ξ we find that √ α = ξ

1−ξ 2 ξ 2 −χ2

√ =ξ

1−ξ 2 ξ 2 −ξ 3

r =

1 ξ

1−ξ 2 (1−ξ)2

√ √ 1−ξ 2 1−ξ 2 a β = χ ξ2 −χ2 = − √1ξ 1−ξ = − √ξξ , q q q 1−ξ 2 1−ξ 2 1+ξ 1 γ = = = = ξ 1−ξ ξ 2 −χ2 ξ2 so that δ =

γ 3 /α2

=

2 3/2 aξ ξ (1−ξ) ξ3

=

3/2 aξ (1−ξ) ξ

=

51



1+

=



1 2

−λ−

aξ ξ ,

√ 1−ξ ξ

1−ξ 2 1−ξ

ξ (1−ξ) ξ

q

=

=

√ aξ 1−ξ , ξ

1−ξ 2 ξ

q

1−ξ ξ

and λ = −δβ/γ =

√ = aξ 1−ξ ξ

q

1−ξ 2 ξ .

Consequently,

√ aξ 1−ξ ξ

θ0 − θ∗ =

1 2

=

1 2

=

1 3/2 2ξ

− −

q



1−ξ 2 ξ

3/2 aξ (1−ξ) ξ



q

1−ξ 2 ξ



q

√ aξ 1−ξ ξ

1−ξ ξ

 q p 2 2 2 − aξ /ξ − (1 − aξ / ξ) " # r q q p 1+ξ 1+ξ 1+ξ − 1−ξ /ξ 2 − (1 − 1−ξ / ξ)2 ξ2

− 18 ξ 2 + O(ξ 3 ), 2

1

equating with n−1/2 b implies ξ ' (2b) 3 n− 3 . So in our simulations design we set ξ = d × n−1/3 where d defines the degree of local-misspecification, and we let d vary from 0 (correctly specified model) to 3. Then we proceed with the following steps: 1. Given ξ we compute the parameters (λ, δ, α, β) using the expressions provided in Section 3.2. 2. Draw a sample of size 2n from the NIG(λ, δ, α, β) distribution. 3. The estimators θˆ and θ˜ are computed with (3) and (5) using the first n observations, while θ∗ is computed with (6). ˆ θ, ˜ and θ∗ . 4. The criterion is then evaluated out-of-sample using the last n observations, using θ, 5. Repeat steps 2-4 for the desired number of repetitions (we use 100, 000 and 500, 000 in some cases). 6. Evaluate the out-of-sample properties by averaging over repetitions. For the design resembling the asymptotic case, we use the sample size n = 1, 000, 000, and we have verified that n = 100, 000 produced very similar results. The levels of asymmetry used are c ∈ {0; 0.25; 0.5; 1; 2}. A finite sample analysis with n = 200 is also performed. In this where ξ ∝ n−1/3 , we have χ = −ξ 3/2 ∝ −n−1/2 , p p 1 − ξ2 1 − n−2/3 −1/3 α= 2 ∝ n ∝ n1/3 , ξ − χ2 n−2/3 − n−1 1

1

1

and β = αχ/ξ ∝ −n 3 − 2 + 3 = −n1/6 .

C.3

Multi-Step Forecasting

To compare the relative efficiency of the two predictors in the context of multi-step forecasting, the following steps are used in the simulation study.

52

1. For a given value of ϕ, we draw εt ∼ iidN(0, 1), for t = 1, . . . , 2n, and generate Yt = ϕYt−1 + εt with Y0 = 0. 2. The estimators θˆ and θ˜ are computed from the first n observations, while θ0 and θ∗ can be deduced analytically. 0 3. The predictors, Yt+h,t , Y˜t+h,t , Yˆt+h,t , are computed (out-of-sample) for t = n + h, . . . , 2n and eval-

uated using the criterion. 4. Repeat steps 1-3 for the desired number of repetitions. We use 50,000 to simulate “asymptotic” results (n = 1, 000, 000) and 500,000 for our simulation study of finite sample properties. 5. Evaluate the out-of-sample properties by averaging over repetitions.

53

Parameter Estimation with Out-of-Sample Objective

Apr 22, 2016 - Y represents future data and X is the sample available for estimation. .... In this general framework we establish analytical results that are.

1MB Sizes 2 Downloads 304 Views

Recommend Documents

DISTRIBUTED PARAMETER ESTIMATION WITH SELECTIVE ...
shared with neighboring nodes, and a local consensus estimate is ob- tained by .... The complexity of the update depends on the update rule, f[·], employed at ...

PARAMETER ESTIMATION OF 2D POLYNOMIAL ...
still important to track the instantaneous phases of the single com- ponents. This model arises, for example, in radar ... ods do not attempt to track the phase modulation induced by the motion. Better performance can be .... set of lags as τ(l). 1.

Network topology and parameter estimation - BMC Systems Biology
Feb 7, 2014 - lution of the system. It is hence necessary to determine the identity and number of perturbations and whether to generate data from individual or combined .... different metrics used for challenge scoring described in Additional file 3:

Parameter estimation for agenda-based user simulation
In spoken dialogue systems research, modelling .... Each time the user simulator receives a system act, .... these systems are commonly called slot-filling di-.

pdf-0730\parameter-estimation-and-inverse-problems-international ...
... loading more pages. Retrying... pdf-0730\parameter-estimation-and-inverse-problems-inte ... y-richard-c-aster-brian-borchers-clifford-h-thurber.pdf.

Reinforcement learning for parameter estimation in ...
Oct 14, 2011 - Keywords: spoken dialogue systems, reinforcement learning, POMDP, dialogue .... “I want an Indian restaurant in the cheap price range” spoken in a noisy back- ..... 1“Can you give me a phone number of The Bakers?” 12 ...

Network topology and parameter estimation - BMC Systems Biology
Feb 7, 2014 - using fluorescent data from protein time courses is a key ..... time course predictions. Score Bayesian Decompose network". Selection of data. Sampling. Orangeballs. 0.0229. 3.25E-03. 0.002438361. 1.21E - 25. 27.4 no yes ...... TC EB PM

Optimal ࣇ-SVM Parameter Estimation using Multi ...
May 2, 2010 - of using the Pareto optimality is of course the flexibility to choose any ... The authors are with the Dept. of Electrical and Computer Engineering.

Optimal ࣇ-SVM Parameter Estimation using Multi ...
May 2, 2010 - quadratic programming to solve for the support vectors, but regularization .... the inherent parallelism in the algorithm [5]. The NSGA-II algorithm ...

Network topology and parameter estimation: from ... - Springer Link
Feb 7, 2014 - No space constraints or color figure charges. • Immediate publication on acceptance. • Inclusion in PubMed, CAS, Scopus and Google Scholar. • Research which is freely available for redistribution. Submit your manuscript at www.bio

Final Report on Parameter Estimation for Nonlinear ... - GitHub
set of measured data that have some degree of randomness. By estimating the .... Available: http://www.math.lsa.umich.edu/ divakar/papers/Viswanath2004.pdf. 8.

TEST-BASED PARAMETER ESTIMATION OF A BENCH-SCALE ...
control tasks. This requires extrapolation outside of the data rank used to fit to the model (Lee ..... Torres-Ortiz, F.L. (2005). Observation and no lin- ear control of ...

TEST-BASED PARAMETER ESTIMATION OF A ...
analysis, sensitivity analysis and finally a global parametrical estimation. Some of these ..... tank (s) has an electrical resistance, the big-tank is a mixing-tank (b).

SBML-PET-MPI: A parallel parameter estimation tool for SBML ...
SBML-PET-MPI is a parallel parameter estimation tool for Systems Biology Markup. Language (SBML) (Hucka et al., 2003) based models. The tool allows the ...

working fluid selection through parameter estimation
WORKING FLUID SELECTION THROUGH PARAMETER ESTIMATION. Laura A. Schaefer. Mechanical Engineering Department, University of Pittsburgh [email protected]. Samuel V. Shelton. Woodruff School of ...... [23] Wilding, W., Giles, N., and Wilson, L., 1996, “P

parameter estimation for the equation of the ...
School of Electrical and Computer Engineering, Electric Power Department,. High Voltage Laboratory, National Technical University of Athens, Greece. ABSTRACT. This paper presents the parameter estimation of equations that describe the current during

Objective Objective #1 Objective Objective #2 Objective ...
This content is from rework.withgoogle.com (the "Website") and may be used for non-commercial purposes in accordance with the terms of use set forth on the Website. Page 2. Objective. 0.6. Accelerate Widget revenue growth. Overall score. Key Results.

Download Objective First Workbook with Answers with ...
Objective First Fourth edition Student's Book. The accompanying Audio CD provides exam-style listening practice. A Workbook without answers is also available ...

PdF Objective First Workbook with Answers with Audio ...
... Create Book of Illustrated Lyrics Adero said the book was dedicated Halloween Heat II book download Rachel Firasek Download Halloween Heat II This site ...