Nonlinear Least Squares Theory - Lecture Notes
Nonlinear Least Squares Theory - Lecture Notes
CHUNG-MING KUAN
Department of Finance & CRETA
March 9, 2010
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 1 / 33
Lecture Outline
1 Nonlinear Specifications
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 2 / 33
Nonlinear Specifications
y = f (x; β) + e(β),
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 3 / 33
The CES (constant elasticity of substitution) production function:
−λ/γ
y = α δL−γ + (1 − δ)K −γ
,
ln y = β1 +β2 ln L+β3 ln K +β4 (ln L)(ln K )+β5 (ln L)2 +β6 (ln K )2 ,
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 4 / 33
Nonlinear Time Series Models
with aj + δj = bj .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 5 / 33
Replacing the indicator function in SETAR model with a “smooth”
function h we obtain the smooth threshold autoregressive (STAR)
model:
p
X p
X
yt = a0 + aj yt−j + δ0 + δj yt−j h(yt−d ; c, δ) + et ,
j=1 j=1
with c the threshold value and s a scale parameter. The STAR model
admits smooth transition between different regimes, and it behaves
like a SETAR model when (yt−d − c)/s is large.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 6 / 33
Artificial Neural Networks
which contains p input units, q hidden units, and one output unit. The
functions h and g are known as activation functions, and the parameters
in these functions are connection weights.
e x − e −x
h(x) = .
e x + e −x
The function g may be the identity function or the same as h.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 7 / 33
Artificial neural networks are designed to mimic the behavior of biological
neural systems and have the following properties.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 8 / 33
The NLS Estimator
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 9 / 33
[ID-2] f (x; ·) is twice continuously differentiable in the second argument
on Θ1 , such that for given data (yt , xt ), t = 1, . . . , T , ∇2β QT (β̂ T ) is
positive definite.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 10 / 33
Nonlinear Optimization Algorithms
That is, the (i + 1) th iterated value β (i+1) is obtained from β (i) with an
adjustment term s (i) d(i) , where d(i) characterizes the direction of change
in the parameter space and s (i) controls the amount of change. Note that
an iterative algorithm can only locate a local optimum.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 11 / 33
Gradient Method
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 12 / 33
Steepest Descent Algorithm
Thus, 0 = g(i+1)0 g(i) ≈ g(i)0 g(i) − s (i) g(i)0 H(i) g(i) , or equivalently,
g(i)0 g(i)
s (i) = ≥ 0,
g(i)0 H(i) g(i)
when H(i) is p.d. We obtain the steepest descent algorithm:
" #
g (i)0 g(i)
β (i+1) = β (i) − g(i) .
g(i)0 H(i) g(i)
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 13 / 33
Newton Method
The Newton method takes into account the second order derivatives.
Consider the second-order Taylor expansion of Q(β) around some β † :
1
QT (β) ≈ QT (β † ) + g†0 (β − β † ) + (β − β † )0 H† (β − β † ).
2
β ≈ β † − (H† )−1 g† .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 14 / 33
From Taylor’s expansion it is easy to see that
1 −1 (i)
QT β (i+1) − QT β (i) ≈ − g(i)0 H(i)
g ≤ 0,
2
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 15 / 33
Gauss-Newton Algorithm
2 2 2
H(β) = − ∇ f(β)[y − f(β)] + Ξ(β)0 Ξ(β),
T β T
Ignoring the first term, an approximation to H(β) is 2Ξ(β)0 Ξ(β)/T ,
which requires only the first order derivatives and is guaranteed to be
p.s.d. The Gauss-Newton algorithm utilizes this approximation as
0 −1
β (i+1) = β (i) + Ξ β (i) Ξ β (i) Ξ β (i) y − f β (i) .
Note that the adjustment term can be obtained as the OLS estimate of
regressing y − f β (i) on Ξ β (i) ; this is known as the Gauss-Newton
regression.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 16 / 33
Other Modifications
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 17 / 33
Initial Values and Convergence Criteria
For the Gauss-Newton algorithm, one may stop the algorithm when
TR 2 is “close” to zero, where R 2 is the coefficient of determination of
the Gauss-Newton regression.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 18 / 33
Digression: Uniform Law of Large Numbers
Consider the function q(zt (ω); θ). It is a r.v. for a given θ and a function
of θ for a given ω. Suppose {q(zt ; θ)} obeys a SLLN for each θ ∈ Θ:
T
1 X a.s.
QT (ω; θ) = q(zt (ω); θ) −→ Q(θ),
T
t=1
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 19 / 33
When θ also depends on T (e.g., when θ is replaced by an estimator θ̃T ),
there may not exist a finite T ∗ such that QT (ω; θ̃T ) are arbitrarily close to
Q(ω; θ̃T ) for all T > T ∗ . Thus, we need a notion of convergence that is
uniform on the parameter space Θ.
We also say that q(zt (ω); θ) obey a strong (or weak) uniform law of large
numbers (SULLN or WULLN).
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 20 / 33
Example: Let zt be i.i.d. with zero mean and
1
T θ,
0 ≤ θ ≤ 2T ,
1 1
qT (zt (ω); θ) = zt (ω) + 1 − T θ, 2T < θ ≤ T ,
1
0, T < θ < ∞.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 22 / 33
Given > 0, we can choose θ † such that kθ − θ † k < /(2∆). Then,
sup |QT (θ) − QT (θ † )| ≤ CT ≤ ,
θ∈Θ 2∆ 2
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 23 / 33
Consistency
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 24 / 33
For ω ∈ Ω0 , we have for large T , IE[QT (β̂ T )] − QT (β̂ T ) < 2 , and
QT (β̂ T ) − IE[QT (β o )] ≤ QT (β o ) − IE[QT (β o )] < ,
2
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 25 / 33
Q: How do we ensure a SULLN or WULLN?
If Θ1 is compact and convex, we have from the mean-value theorem and
the Cauchy-Schwartz inequality that
CT = sup ∇β QT (β),
β∈Θ1
with ∇β QT (β) = −2 T
P
t=1 ∇β f (xt ; β)[yt − f (xt ; β)]/T . Note that
∇β QT (β) may be bounded in probability, but it may not be bounded in
an almost sure sense. (Why?)
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 26 / 33
We impose the following conditions.
[C1] {(yt w0t )0 } is a sequence of random vectors, and xt is vector
containing some elements of Y t−1 and W t .
(i) The sequences {yt2 }, {yt f (xt ; β)} and {f (xt ; β)2 } all obey a WLLN for
each β in Θ1 , where Θ1 is compact and convex.
(ii) yt , f (xt ; β) and ∇β f (xt ; β) all have bounded second moment
uniformly in β.
[C2] There exists a unique parameter vector β o such that
IE(yt | Y t−1 , W t ) = f (xt ; β o ).
Theorem 8.1
Given the nonlinear specification: y = f (x; β) + e(β), suppose that [C1]
IP
and [C2] hold. Then, β̂ T −→ β o .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 27 / 33
Remark: Theorem 8.1 is not satisfactory because it only deals with the
convergence to the global minimum. Yet, an iterative algorithm is not
guaranteed to find a global minimum of the NLS objective function.
Hence, it is more reasonable to expect the NLS estimator converging to
some local minimum of IE[QT (β)]. Therefore, we shall, in what follows,
assert only that the NLS estimator converges in probability to a local
minimum β ∗ of IE[QT (β)]. In this case, f (x; β ∗ ) is, at most, an
approximation to the conditional mean function.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 28 / 33
Asymptotic Normality
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 29 / 33
Under suitable conditions,
√ T
2 X
T ∇β QT (β ∗ ) = − √ ∇β f (xt ; β ∗ )[yt − f (xt ; β ∗ )]
T t=1
√ D
obeys a CLT, i.e., (V∗T )−1/2 T ∇β QT (β ∗ ) −→ N (0, Ik ), where
T
!
2 X
V∗T = var √ ∗ ∗
∇β f (xt ; β )[yt − f (xt ; β )] .
T t=1
By asymptotic equivalence,
√ D
(D∗T )−1/2 T (β̂ T − β ∗ ) −→ N (0, Ik ).
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 30 / 33
When D∗T is replaced by a consistent estimator D
b ,
T
√
b −1/2 T (β̂ − β ∗ ) −→
D
D
N (0, Ik ).
T T
Note that
T
∗2 X 0
IE ∇β f (xt ; β ∗ ) ∇β f (xt ; β ∗ )
HT (β ) =
T
t=1
T
2 X
IE ∇2β f (xt ; β ∗ ) yt − f (xt ; β ∗ ) ,
−
T
t=1
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 31 / 33
When t = yt − f (xt ; β ∗ ) are uncorrelated with ∇2β f (xt ; β ∗ ), HT (β ∗ )
depends only on the expectation of the outer product of ∇β f (xt ; β ∗ ) so
that H
b may be simplified as
T
T
b = 2
X 0
H T ∇β f (xt ; β̂ T ) ∇β f (xt ; β̂ T ) .
T
t=1
PT 0
This is analogous to estimating Mxx by t=1 xt xt /T in linear regressions.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 32 / 33
Wald Tests
where Γ b R0 , and D
b = RD b is a consistent estimator for D∗ .
T T T T
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 33 / 33