Statistics and Probability Letters 80 (2010) 540–547
Contents lists available at ScienceDirect
Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro
Local adaptive smoothing in kernel regression estimation Qi Zheng, K.B. Kulasekera ∗ , Colin Gallagher Department of Mathematical Sciences, Clemson University, Clemson, SC 29634-0975, United States
article
info
Article history: Received 2 July 2009 Received in revised form 4 December 2009 Accepted 5 December 2009 Available online 23 December 2009
abstract We consider nonparametric estimation of a smooth function of one variable. Global selection procedures cannot sufficiently account for local sparseness of the covariate nor can they adapt to local curvature of the regression function. We propose a new method for selecting local smoothing parameters which takes into account sparseness and adapts to local curvature. A Bayesian type argument provides an initial smoothing parameter which adapts to the local sparseness of the covariate and provides the basis for local bandwidth selection procedures which further adjust the bandwidth according to the local curvature of the regression function. Simulation evidence indicates that the proposed method can result in reduction of both pointwise mean squared error and integrated mean squared error. © 2009 Elsevier B.V. All rights reserved.
1. Introduction The kernel method for regression estimation has become a standard statistical technique in many areas of research. It is well known that the performance of kernel methods depends crucially on the smoothing parameter (the bandwidth). This motivated data-driven bandwidth procedures appearing in the literature. Among these, we mention three: least square cross-validation (LSCV) proposed by Craven and Wahba (1979); plug-in methods (e.g., see Härdle and Bowman, 1988); and AIC based methods (e.g., see Akaike, 1973). However, in many situations a single fixed bandwidth does not generally provide a good estimator of a regression function. One reason for this is that the amount of smoothing needed for estimating a function where the data are dense may be different from that required to estimate a function where the data are sparse. Besides that, a bandwidth chosen for optimally estimating the function at or near its local mode may differ from the bandwidth that is required to get an accurate estimate of the regression function where the function is flat. In order to overcome the disadvantages of a fixed bandwidth it is necessary to allow the bandwidth to somehow adapt locally. Apart from the direct plug-in method which requires the knowledge of the unknown mean function, very few data based methods for selecting local bandwidths have been proposed. Fan et al. (1996) proposed an adaptive procedure for getting local bandwidths. Their technique uses a plug-in approach coupled with an approximation scheme to develop a bandwidth as a function of the point of estimation. However, as the authors mentioned, their procedure does not work well for moderate sample sizes. To allow a smoothing method to adapt to sparseness in the data, the choice of a local smoothing parameter should depend on the design density of the covariate. The local sparseness of the covariate would be reflected by a data based local bandwidth for a design density estimator. Gangopadhyay and Cheung (2002) and Kulasekera and Padgett (2006) proposed adaptive Bayesian bandwidth selection for density estimation. They treat the bandwidth as a scale parameter, and use a prior distribution to compensate for lack of information for small or moderate sample sizes. As a result this approach appears to perform well for moderate sample sizes.
∗
Corresponding author. E-mail addresses:
[email protected] (Q. Zheng),
[email protected] (K.B. Kulasekera),
[email protected] (C. Gallagher).
0167-7152/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2009.12.008
Q. Zheng et al. / Statistics and Probability Letters 80 (2010) 540–547
541
In this article, we combine Bayesian estimation of design density bandwidth with local bandwidth selection techniques. The Bayesian bandwidths adapt to local sparseness (see Fig. 1) and can be used to create an initial smoothing window for a local bandwidth selection procedure. The result is local bandwidths which adapt to sparseness and adapt to the local curvature of the regression function. In Section 2 we describe our procedure for local linear regression, but the method can be adapted to other smoothed estimators of regression functions. Judicious choice of prior parameter sequences results in bandwidths tending to zero as the sample size increases, indicating that estimators based on bandwidths from our procedure are consistent. Results obtained from a simulation study comparing several bandwidth selectors against the proposed method are provided in Section 3. These simulation results indicate that the proposed method can enhance nonparametric regression estimator performance. 2. Adaptive bandwidth selection for kernel estimation of regression We develop the Bayesian local bandwidth selection approach for kernel regression estimators and discuss asymptotic properties of local smoothing parameters from our procedure. We begin by developing some notation. Let (Xi , Yi ), i = 1, 2, . . . , n, be n independent and identically distributed (i.i.d.) bivariate observations with a joint density function f (·, ·). Let the marginal density of X be f (·). Let K be a classical secondorder kernel function (e.g., see Eubank, 1999). Our goal is to estimate the regression function, which is the conditional expectation m(x) = E [Y |X = x] (assuming f (x) 6= 0). Then the model can be written as Y = m(X ) +
(1)
where E [|X ] = 0, V [|X ] = σ . In this article, we consider the local linear estimator of m(x) given by 2
x −X i n X x − Xi M2,n (x) − h(x) M1,n (x) mˆh (x) = K Yi nh(x) i=1 h(x) M2,n (x)M0,n (x) − M12,n (x) j Pn x −X x−Xi where Mj,n (x) = (h(x)n)−1 i=1 K h(x)i , j = 0, 1, 2; and h(x) is the bandwidth to be selected at x. h(x) 1
(2)
Our aim is to select a bandwidth that depends on the data. Noticing that the amount of smoothing where X is dense should be different from that where X is sparse, we use a Bayesian approach to get an initial smoothing window based on the X observations (see Kulasekera and Padgett, 2006). This window can be used in local bandwidth selection procedures resulting in the h(x) to be used in (2). The local bandwidth selection techniques accommodate different mean functions with the same sampling design density. 2.1. The Bayesian smoothing window Here we develop an initial bandwidth which adapts to the local sparseness of the data. Define fh (x) = f ∗ Kh (x) =
Z
f (u)Kh (x − u)du = E [Kh (X − x)]
(3)
where Kh (x) = 1h K ( hx ). Now, considering h as a parameter for fh , for a prior density π (h), the posterior distribution of h is
π (h|x) = R
fh (x)π (h)
.
fh (x)π (h)dh
(4)
Since fh (x) is unknown, we cannot compute π (h|x) directly. However, it is natural to use the sample mean fˆh (x) =
n 1X
n i=1
Kh (Xi − x)
to estimate fh (x). Substituting this in (4), we get
πˆ (h|X1 , . . . , Xn , x) = R
fˆh (x)π (h) fˆh (x)π (h)dh
.
(5)
Then, for the squared-error loss, the best local bandwidth h = h(x) is given by the posterior mean h∗ (x) =
Z
hπˆ (h|X1 , . . . , Xn , x)dh.
(6)
542
Q. Zheng et al. / Statistics and Probability Letters 80 (2010) 540–547
Note that with this approach, the posterior is a function of h only and, with a well-selected prior and a kernel, π (h|X1 , . . . , Xn , x) and h∗ (x) can be explicitly obtained. Although this approach works in principle for any kernel and the most suitable prior (preferably a conjugate prior), for the remainder of this article we shall use a normal kernel and an invertedgamma prior due to the algebraic simplicity. In particular for a normal kernel 1 2 K ( u) = √ e− u / 2 , 2π
−∞ < u < ∞
coupled with an inverted-Gamma prior,
2
π (h) =
0 (α)β α h2α+1
−1 exp , β h2
α > 0, β > 0, h > 0
we get n P
πˆ (h|X1 , . . . , Xn , x) =
(1/h2α+2 ) exp{−(1/h2 )((Xi − x)2 /2 + 1/β)}
i=1 n P
(0 (α + 1/2)/2){(Xi − x)2 /2 + 1/β}
i =1
resulting in
h (x) = √
n P
0 (α)
∗
{1/(β(Xi − x)2 + 2)}α
i=1
n 2β 0 (α + 1/2) P
.
(7)
{1/(β(Xi − x)2 + 2)}α+1/2
i =1
The following theorem shows that if one picks the prior parameters in a suitable manner (fix α and let β diverge with a proper rate as sample size increases) one can get a sequence of h∗ (x) that converges to zero almost surely, for every x. 1
Theorem 2.1. Let f (·) be continuous and bounded away from 0 at x. If α > 0 is fixed, β → ∞ and nβ − 2 → ∞ as n → ∞, then h∗ (x) → 0 and nh∗ (x) → ∞ with probability 1 as n → ∞. The proof of this theorem is given in the Appendix. Remark 2.1. The implementations of the Bayesian procedure require a specification of parameters α and β for the prior distribution. From the proof of Theorem 2.1 in the Appendix, we can see that the asymptotic rate of the Bayesian bandwidth − α
1
2
4
h∗ (x) is between [β − 2 , β 4α+2 ]. Thus, if we let α be sufficiently large and choose β between n 5 and n 5 , the Bayesian bandwidths converge to 0 with a rate close to that of the mean squared optimal bandwidths. 2.2. Local bandwidth selection The above Bayesian bandwidth only considers the influence from the distribution of X . However, in kernel regression estimation, the bandwidths should reflect the impact of the responses. This motivates the following modifications of the Bayesian bandwidths for obtaining final bandwidths used to estimate the regression function. Since m(·) is continuous and our object is to estimate m(·) locally at x, we argue that only the segment of observations in the neighborhood of x is vitally important. In particular, for estimating m at x, we consider three bandwidth selection techniques: Leave-one-out cross-validation, generalized cross-validation and AICc (Hurvich et al., 1998), using observations falling in Ix = [x − h∗ (x), x + h∗ (x)] only. Their local Bayesian versions are developed as follows. First let l(x) denote the number of covariate values falling in Ix and let (Xi0 , Yi0 ), i = 1, . . . , l(x), denote the corresponding observations. Then let
ˆ − i ,h ( X i ) = m
0 X Xi0 − Xj0 M2i (Xi ) −
1
0
(l(x) − 1)h
K
where Mki (Xi0 ) = ((l(x) − 1)h)−1
1
Pl(x),h i,j = K l(x)h
P
j6=i K
Xi0 −Xj0
h
Xi0 −Xj0
k
h
0 0 M2,l(x) (Xi ) −
Xi0 − Xj h
Xi0 −Xj0
h
M1i (Xi0 )
2 M2i (Xi0 )M0i (Xi0 ) − M1i (Xi0 )
h
j6=i
Xi0 −Xj0
, k = 1, 2, 3,
h
M2,l(x) (Xi )M0,l(x) (Xi ) − 0
0
Yj0
M1,l(x) (Xi0 ) M12,l(x)
( Xi ) 0
and σˆ l(x),h =
l(x) 1 X
l(x) i=1
2
ˆ h (Xi ) Yi − m
.
Now we define the local leave-one-out cross-validation bandwidth hBLCV (x), the locally generalized cross-validation bandwidth hBGCV (x) and the local AICc bandwidthhAICc by
543
0.6495 0.6490 0.6485
Bayesian bandwidths
0.6500
Q. Zheng et al. / Statistics and Probability Letters 80 (2010) 540–547
0.0
0.2
0.4
0.6
0.8
1.0 2
Fig. 1. Averaged Bayesian bandwidths over the support of regression functions for n = 50, N(0.5,0.25) design, α = 100, β = n 5 based on N = 500 replications.
hBLCV (x) = arg min
( l(x) X
h
) 2 ˆ −i,h (Xi ) Yi − m 0
0
(8)
i=1
hBGCV (x) = arg min log(σˆ l(x),h ) − 2 log 1 − h
hAICc (x) = arg min log(σˆ l(x),h ) + 1 + 2 h
tr(Pl(x),h )
(9)
l(x)
tr(Pl(x),h ) + 1 l(x) − tr(Pl(x),h ) − 2
.
(10)
Remark 2.2. Note that the bandwidths hBLCV (x) and hBGCV (x) are selected from a set of candidate bandwidths when minimizing cross-validations above. In global cross-validation procedures this set is usually taken as [n−1+η , n−η ] for some small η > 0. Such cross-validations produce bandwidths of order n−1/5 with probability tending to 1 as n → ∞ (see Girard, 1998). Since the number of observations contained in the interval Ix is of order nh∗ (x) under reasonable conditions on the density of X , hBLCV (x), hBGCV (x) above will be of order [nh∗ (x)]−1/5 . Thus, the proposed adaptive bandwidths hBLCV (x), hBGCV (x) can achieve near asymptotic optimal rate for suitably chosen β in the prior. This extends to hBAICc , given that Härdle et al. (1988) showed that all classical selectors considered here are asymptotically equivalent. 3. Numerical results To illustrate the improvement of our Bayesian locally adaptive selection procedure over a few existing methods, we conducted a Monte Carlo simulation. Implementation of our technique is straightforward, but requires selection of the parameters α and β for the prior distribution. In practice, we recommend that these be chosen in accordance with the criterion proposed by Kulasekera and Padgett (2006). For the numerical results presented here we took α = 100 and β = n2/5 . The performance of the local linear estimator using different bandwidth selection procedures is related to the sample size, the distribution of X (design density), the true regression function, and the true standard deviation of the errors. Although for the sake of brevity, only a few selected results are reported here, the following settings of these factors were examined, with 500 replications for each set of factor combinations: (a) sample size n = 25 (small), 50 (moderate); (b) distributions of X : Uniform(0, 1), Normal(0.5, 0.25); (c) regression functions: (i) m(x) = sin(5π x); (ii) m(x) = sin(15π x); (iii) m(x) = 1 − 48x + 218x2 − 315x3 + 145x4 ; (iv) m(x) = 0.3 exp{−64(x − 0.25)2 } + 0.7 exp{−256(x − 0.75)2 }; (v) m(x) = 10 exp(−10x); (d) error standard deviation σ = 0.25Ry , 0.5Ry , where Ry is the range of m(x) over [0, 1]. Most of the regression functions above were used in earlier studies (Ruppert et al., 1995; Hart and Yi, unpublished; Herrmann, 1997). We used R (Ihaka and Gentleman, 1996) for all computations. We compare the local linear estimators obtained using Bayes type local bandwidths hBLCV (x), hBGCV (x), and hBAICc to local linear estimators based on LCV, GCV, AICc and the plug-in selector of Staniswalis (1989) by comparing the estimated MSE (ESMSE) of each estimator on the basis of P ˆ we define ESMSE at point x as ESMSE(m ˆ (x)) = N −1 Nj=1 (m ˆ (x) − m(x))2 , where N is the simulations. For any estimator m the number of simulations. Specifically we consider the pointwise log ratio between different selectors: r (x) = log
ˆ 1 (x)) ESMSE(m ˆ 2 (x)) ESMSE(m
544
Q. Zheng et al. / Statistics and Probability Letters 80 (2010) 540–547
Table 1 Simulated EIMSE based on N = 500 replications for n = 50.
σ /Ry
Design
Bandwidth estimators hopt
LCV
BLCV
m(x) = sin(15π x) Uniform 0.25 0.323 1.638 1.688 Uniform 0.5 0.336 1.883 1.850 Normal 0.25 0.407 1.801 1.712 Normal 0.5 0.411 1.953 1.896 m(x) = sin(5π x) Uniform 0.25 0.050 1.778 1.711 Uniform 0.5 0.106 2.642 2.527 Normal 0.25 0.112 3.264 2.233 Normal 0.5 0.161 3.667 2.981 m(x) = 1 − 48x + 218x2 − 315x3 + 145x4 Uniform 0.25 0.098 2.139 2.064 Uniform 0.5 0.208 2.856 2.794 Normal 0.25 0.155 3.918 3.688 Normal 0.5 0.270 4.443 4.338 m(x) = 0.3 exp{−64(x − 0.25)2 } + 0.7 exp{−256(x − 0.75)2 } Uniform 0.25 0.005 2.248 2.122 Uniform 0.5 0.013 2.393 2.212 Normal 0.25 0.005 4.749 3.749 Normal 0.5 0.013 5.551 3.616 m(x) = 10 exp(−10x) Uniform 0.25 0.212 3.635 3.196 Uniform 0.5 0.568 3.683 3.381 Normal 0.25 0.421 5.569 3.883 Normal 0.5 0.983 5.866 4.588
GCV
BGCV
AICc
BAICc
Plug-in
Topt
1.619 1.836 2.104 2.563
1.575 1.769 1.851 2.226
1.688 1.767 1.937 2.015
1.653 1.769 1.738 1.936
1.692 2.147 1.610 2.246
9.697 8.562 33.079 24.379
1.740 2.628 2.908 3.534
1.672 2.453 1.807 3.152
1.792 2.735 2.909 3.141
1.711 2.680 1.819 2.931
4.415 5.372 2.530 3.757
1.483 1.887 2.691 7.207
2.145 2.773 3.402 4.991
2.043 2.762 3.001 4.670
2.102 2.679 3.259 3.652
2.131 2.717 2.948 3.715
3.191 3.778 18.833 21.892
2.124 2.728 7.159 12.281
2.140 2.375 6.740 6.021
2.005 2.166 3.500 3.766
2.362 2.296 6.605 5.538
2.166 2.242 3.467 3.524
7.197 2.593 8.269 3.399
3.100 1.574 2.713 2.252
3.464 3.744 4.871 8.978
3.042 3.437 3.274 4.335
3.206 3.211 4.697 4.415
2.882 3.011 3.162 3.821
4.984 3.685 143.575 66.793
3.214 3.397 106.394 49.230
and the estimated integrated MSE:
ˆ ) = p−1 EIMSE(m
p X
ˆ (xi )), ESMSE(m
i=1
where x1 , . . . , xp are equally spaced points in (0, 1). We chose p = 50 for our simulations. Since the theoretical asymptotically optimal bandwidths htopt do not work well for small and moderate sample sizes, reported in our simulation, we localize the idea of Hengartner et al. (2002) and define the local optimal bandwidth hopt by hopt (x) = arg min h
( 10 X
) ˆ h (x − 0.02 + 0.004i) m(x − 0.02 + 0.004i) − m
2
.
i=0
The bandwidth selection procedures considered here involve numerically solving minimization problems that necessitate −0.2 a n search over a finite grid of H. o Because it is known that the optimal bandwidth is around n , we take the grid Hg = −0.2
−0.2
−0.2
, 2 √n 2.5n , . . . , 2.5n √n 2.5n for LCV, GCV and AICc . Since the proposed local procedures are based on the local number n −0.2 o l(x) l(x)−0.2 l(x)−0.2 of observations, l(x), we take Hl(x) = √2.5l(x) , 2 √2.5l(x) , . . . , 2.5l(x) √2.5l(x) for BLCV, BGCV, and AICc . n √
2.5n
The improvement of the proposed methods over LCV, GCV and AICc can be seen in the simulation results given in Table 1. This table gives the optimal EIMSE based on hopt , and the mean ratio of EIMSE to this optimal EIMSE, for each method. We can see that LCV, GCV, AICc and our proposed procedures GLCV, BGCV, and BAICc all dominate over the plug-in method. This is likely because plug-in methods mimic theoretical asymptotically optimal bandwidths, which can have poor performance for small and moderate sample sizes. For the Uniform design, the proposed modifications are competitive with LCV and GCV and give better results in most cases. For data simulated under the Normal design, the proposed methods make a marked improvement over LCV, GCV and AICc . As expected, the proposed procedures improve in performance when the covariate design is sparse in regions. The pointwise performance of the proposed methods can be seen in Fig. 2. Each part of Fig. 2 contains a plot of the true regression function with the average of 500 simulated regression curves based on a local method and based on its adaptive Bayesian counterpart. We also plot the log ratio of pointwise ESMSE, r (x). The proposed Bayesian modifications to LCV, GCV and AICc do not appear to have significant impact on the bias of the estimators. However, since r (x) > 0 for nearly all points x, it appears that our proposed procedures reduce the local variability of estimators based on LCV, GCV and AICc , respectively. Moreover, r (x) tends to be significantly larger where the data are sparse. This phenomenon is simply a manifestation of the fact that our technique takes the sparseness of the design into account. In addition to the above, we examined the performance of our method against the adaptive bandwidth selection mentioned in Fan et al. (1996). Fan et al. (1996) explicitly discussed the smoothing parameter selection for density estimation only. They however suggested that their method, although not performing well for small to moderate sample sizes, can
Q. Zheng et al. / Statistics and Probability Letters 80 (2010) 540–547
0.6
GCV BGCV r(x) of GCV over BGCV
0.0
0.0
0.2
0.2
0.4
0.4
0.6
LCV BLCV r(x) of LCV over BLCV
545
0.0
0.2
0.4
0.6
0.8
1.0
(a) Estimation and r (x) for LCV, BLCV.
0.0
0.2
0.4
0.6
0.8
1.0
(b) Estimation and r (x) for GCV, BGCV.
0.0
0.2
0.4
0.6
AICc BAICc r(x) of AICc over BAICc
0.0
0.2
0.4
0.6
0.8
1.0
(c) Estimation and r (x) for AICc , BAICc . Fig. 2. Estimations with different selectors for n = 50, N(0.5,0.25) design, m(x) = 0.3 exp{−64(x − 0.25)2 } + 0.7 exp{−256(x − 0.75)2 }, signal-to-noise ratio σ = 0.25Ry . Table 2 EIMSE for m(x) = 1 − 48x + 218x2 − 315x3 + 145x4 , σ = 0.5Ry , N = 50. Design density
BLCV
BGCV
BAICc
Fan et al. type bandwidth
Uniform Normal
0.580 1.169
0.574 1.258
0.564 1.001
2.053 1.682
be extended to regression estimation. Following their approach, we used the same spline approximation to the optimal bandwidth function h(x) where the proposed cross-validation over a large set of cubic spline P interpolants of pre-chosen n ˆ hˆ (Xi ) − Yi )2 where knot–bandwidth pairs {(a1 , h1 ), . . . , (ap , hp )} (in their notation) was conducted by minimizing i=1 (m i
ˆ hˆ (Xi ) is the estimated regression function for a bandwidth calculated at Xi for each spline function. The knots (a1 , . . . ap ) m i
for our simulation were chosen to be equispaced on [0, 1] with p = 6. We chose four values {0.5hˆ G , hˆ G , 1.5hˆ G , 2hˆ G } for each coordinate hi , i = 1, . . . , p, where hˆ G was the global cross-validation bandwidth. The MSE performance of local linear estimators with Fan et al. type bandwidths was inferior to that of our proposed method. We provide the IEMSE for these local linear estimators using the bandwidths in Table 2. 4. Conclusion The Bayesian local bandwidth selection procedure for smoothing parameter selection proposed here has several advantages as seen in our simulations. The small sample dominance of these estimators coupled with reasonable asymptotic properties makes such bandwidths highly desirable. Moreover, the use of Bayesian local bandwidth selection procedures is not restricted to the local linear estimator. For example, we examined the Nadaraya–Watson estimator proposed by Nadaraya (1964) and Watson (1964) with our local bandwidths and obtained a similar performance to the above. Acknowledgements The authors would like to thank the Co-Editor-in-Chief and an anonymous associate editor whose suggestions lead to a marked improvement of the article. Appendix Here, we show that h∗ (x) → 0 and nh∗ (x) → ∞ as n → ∞.
546
Q. Zheng et al. / Statistics and Probability Letters 80 (2010) 540–547
Since β diverges as the sample size increases, we write β as βn . Choose n such that (1) n → ∞ and (2) √βn → 0 as n → ∞. Since
0 (α) 0 (α+ 12 )
n
is a constant for fixed α , (7) can be rewritten as
n P
h∗ (x) = Cα
i=1
1 βn (Xi −x)2 +2
n √ P βn β i=1
where Cα =
0 (α) √ . 20 (α+ 12 )
n X
1 2 n (Xi −x) +2
(11)
α+ 12
Consider the numerator of (9):
α
1
βn (Xi − x)2 + 2
i=1
α
α n X n 1 I |Xi − x| ≤ √ βn βn (Xi − x)2 + 2 i =1 α n X n 1 + I |Xi − x| ≥ √ βn βn (Xi − x)2 + 2 i =1 α α n 1 n X n 1 1 | | ≤ n√ I Xi − x ≤ √ +n 2 βn (Xi − x)2 + 2 n + 2 βn i=1 n √βn n βn α α n 1 n 1 1 n X I |Xi − x| ≤ √ +n 2 . ≤ n√ n 2 n + 2 βn i=1 n √β βn
=
n
Consider the denominator of (9):
α+ 12 √ n p X n 1 I |Xi − x| ≤ √ βn ≥ βn βn (Xi − x)2 + 2 βn βn (Xi − x)2 + 2 i=1 i =1 α+ 12 √ n n 1 1 √ X √ I |Xi − x| ≤ √ . ≥ n n n n + 2 βn i =1 n √ β n i √ i Pn √βn h P n √ βn h n √ √ I |Xi − x| ≤ √ n . Then, combining the Let f1n (x) denote I | X − x | ≤ , and let f ( x ) denote i 2n i=1 nn i=1 n n β β n X
p
α+ 21
1
n
n
two inequalities above, we get n P i =1
√
βn
1 βn (Xi −x)2 +2
n P i=1
α
1 βn (Xi −x)2 +2
α+ 12 ≤
P∞
n=1
√
x
P i =1
√
βn
1
βn (Xi −x)2 +2
n P i=1
α
1 βn (Xi −x)2 +2
α+ 12
α
1
n2 +2
α+ 12
1 α 2
f1n (x)
α
1
n2 +2
P∞
n=1
1
√
exp(−γ n( √βn )2 ) converge for every γ > 0 with our choices of n , βn , then n
a.s.
sup |f2n (x) − f (x)| → 0 x
as n → ∞, by Theorem 2.1.3 of Rao (1983). Notice that n
α+ 21 + √ α+ 21 . 1 f2n (x) +2 n f2n (x) n +2 n
n
sup |f1n (x) − f (x)| → 0,
+n
n n f2n (x) 1+2 n
exp(−γ n( √βn )2 ) and a.s.
n
√ n = √ βn Since both
1 α 2
n √βn f1n (x)
α √ n 1 ≤ √ C1 2 βn
√
n → ∞ and
f (x)
1
→ 0 as n → ∞. Then
1
α
n2 +2
α+ 21 + C2 √ α+ 12 1 f ( x ) +2 n f ( x ) +2 n n
1
! 1 √ n (n + 2)α+ 2 ≤ C1 + C2 √ 2 βn α+1 1 ≤ C3 √n + α n βn α
√ √ n βn
1
(n + 2)α+ 2 √ f (x) n (n2 + 2)α
!
Q. Zheng et al. / Statistics and Probability Letters 80 (2010) 540–547
547
with probability 1 as n → ∞, where C1 , C2 , C3 are constant. Hence, for sufficiently large n, h (x) ≤ C ∗
nα+1 1 √ + α βn n
1 2(2α+1)
almost surely, for some C . The optimal rate of n for minimizing the right hand side is βn α − 2(2α+ 1) n
h∗ (x) ≤ C β
. Then,
.
Also, since n X
α
1
≥
βn (Xi − x)2 + 2
i =1
n X i =1
1
α+ 21
βn (Xi − x)2 + 2
we have − 12
h∗ (x) ≥ Cα βn
.
Therefore − 12
Cα βn
α − 2(2α+ 1)
≤ h∗ (x) ≤ C βn
.
(12) − 12
By the assumptions, βn → ∞ and nβn → ∞, we get h∗ (x) → 0 and nh∗ (x) → ∞ as n → ∞. Hence, h∗ (x) is a proper bandwidth of the regression estimator for every x. References Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csàki, F. (Eds.), Proc. 2nd Int. Symp. Information Theory. Akadémiai Kiadó, Budapest, pp. 267–281. Craven, P., Wahba, G., 1979. Smoothing noisy data with spline functions. Numerische Mathematik 31, 377–403. Eubank, R.L., 1999. Nonparametric Regression and Spline Smoothing. Marcel Dekker, New York. Fan, J., Hall, P., Martin, M., Patil, P., 1996. On local smoothing of nonparametric curve estimators. Journal of American Statistical Association 91, 258–266. Gangopadhyay, A., Cheung, K., 2002. Bayesian approach to the choice of smoothing parameter in kernel density estimation. Journal of Nonparametric Statistics 14 (6), 655–664. Girard, D.A., 1998. Asymptotic comparison of (partial) cross-validation, GCV and randomized GCV in nonparametric regression. Annals of Statistics 26 (1), 315–334. Härdle, W., Bowman, A.G., 1988. Bootstrapping in nonparametric regression: Local adaptive smoothing and confidence bands. Journal of American Statistical Association 83, 102–110. Härdle, W., Hall, P., Marron, J.G., 1988. How far are automatically chosen regression smoothing parameters from their optimum? Journal of American Statistical Association 83, 86–101. Hart, J.D., Yi, S., 1996. One-sided cross-validation (unpublished). Hengartner, N.W., Wegkamp, J.S., Matzner-Lober, E., 2002. Bandwidth selection for local linear regression smoothers. Journal of Royal Statistical Society 64 (4), 791–804. Herrmann, E., 1997. Local bandwidth choice in kernel regression estimation. Journal of Computational and Graphical Statistics 6, 35–54. Hurvich, C.M., Simonoff, J.S., Tsai, C., 1998. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of Royal Statistical Society 60 (2), 271–293. Ihaka, R., Gentleman, R., 1996. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 5, 299–314. http://www.rproject.org. Kulasekera, K.B., Padgett, W.J., 2006. Bayes bandwidth selection in kernel density estimation with censored data. Journal of Nonparametric Statistics 18, 129–143. Nadaraya, E.A., 1964. On estimating regression. Theory of Probability and its Applications 9 (1), 141–142. Rao, P.B., 1983. Nonparametric Function Estimation. Academic Press, London. Ruppert, D., Sheather, S.J., Wand, M.P., 1995. An effective bandwidth selector for local least squares regression. Journal of American Statistical Association 90, 1215–1230. Staniswalis, J.G., 1989. Local bandwidth selection for kernel estimators. Journal of the American Statistical Association 84, 284–288. Watson, G., 1964. Smooth regression analysis. Sankhya, Series A 26, 359–372.