Lecture Notes17
Lecture Notes17
Numerical Analysis
1. Numerical analysis
Numerical analysis, area of mathematics and computer science that creates, analyzes, and imple-
ments algorithms for obtaining numerical solutions to problems involving continuous variables. Such
problems arise throughout the natural sciences, social sciences, engineering, medicine, and business.
Since the mid 20th century, the growth in power and availability of digital computers has led to an
increasing use of realistic mathematical models in science and engineering, and numerical analysis
of increasing sophistication is needed to solve these more detailed models of the world. The formal
academic area of numerical analysis ranges from quite theoretical mathematical studies to computer
science issues. A major advantage for numerical technique is that a numerical answer can be obtained
even when a problem has no analytical solution. However, result from numerical analysis is an approx-
imation, in√ general, which can be made as accurate as desired. For example to find the approximate
values of 2, π etc.
With the increasing availability of computers, the new discipline of scientific computing, or com-
putational science, emerged during the 1980s and 1990s. The discipline combines numerical analysis,
symbolic mathematical computations, computer graphics, and other areas of computer science to make
it easier to set up, solve, and interpret complicated mathematical models of the real world.
1.1. Common perspectives in numerical analysis. Numerical analysis is concerned with all as-
pects of the numerical solution of a problem, from the theoretical development and understanding of
numerical methods to their practical implementation as reliable and efficient computer programs. Most
numerical analysts specialize in small subfields, but they share some common concerns, perspectives,
and mathematical methods of analysis. These include the following:
• When presented with a problem that cannot be solved directly, they try to replace it with a
“nearby problem” that can be solved more easily. Examples are the use of interpolation in
developing numerical integration methods and root-finding methods.
• There is widespread use of the language and results of linear algebra, real analysis, and func-
tional analysis (with its simplifying notation of norms, vector spaces, and operators).
• There is a fundamental concern with error, its size, and its analytic form. When approximating
a problem, it is prudent to understand the nature of the error in the computed solution.
Moreover, understanding the form of the error allows creation of extrapolation processes to
improve the convergence behaviour of the numerical method.
• Numerical analysts are concerned with stability, a concept referring to the sensitivity of the
solution of a problem to small changes in the data or the parameters of the problem. Numerical
methods for solving problems should be no more sensitive to changes in the data than the
original problem to be solved. Moreover, the formulation of the original problem should be
stable or well-conditioned.
In this chapter, we introduce and discuss some basic concepts of scientific computing. We begin
with discussion of floating-point representation and then we discuss the most fundamental source of
imperfection in numerical computing namely roundoff errors. We also discuss source of errors and then
stability of numerical algorithms.
This is an infinite series, but computer use an finite amount of memory to represent numbers. Thus
only a finite number of digits may be used to represent any number, no matter by what representation
method.
8
For example, we can chop the infinite decimal representation of after 4 digits,
3
� �
8 2 6 6 6
= + + + × 101 = 0.2666 × 101 .
3 101 102 103 104
Generalizing this, we say that number has n decimal digits and call this n as precision.
For each real number x, we associate a floating point representation denoted by f l(x), given by
f l(x) = ±(0.a1 a2 . . . an )β × β e ,
here β based fraction is called mantissa with all ai integers and e is known as exponent. This repre-
sentation is called β−based floating point representation of x and we take base β = 10 in this course.
For example,
42.965 = 4 × 101 + 2 × 100 + 9 × 10−1 + 6 × 10−2 + 5 × 10−3
= 0.42965 × 102 .
−0.00234 = −0.234 × 10−2 .
Number 0 is written as 0.00 . . . 0 × 10e . Likewise, we can use for binary number system and any real
x can be written
x = ±q × 2m
with 12 ≤ q ≤ 1 and some integer m. Both q and m will be expressed in terms of binary numbers. For
example,
1001.1101 = 1 × 23 + 2 × 20 + 1 × 2−1 + 1 × 2−2 + 1 × 2−4
= (9.8125)10 .
Remark 2.1. The above representation is not unique.
For example, 0.2666 × 101 = 0.02666 × 102 etc.
Definition 2.1 (Normal form). A non-zero floating-point number is in normal form if the values of
mantissa lies in (−1, −0.1] or [0.1, 1).
Therefore, we normalize the representation by a1 �= 0. Not only the precision is limited to a finite
number of digits, but also the range of exponent is also restricted. Thus there are integers m and M
such that −m ≤ e ≤ M .
FLOATING POINT ARITHMETIC AND ERRORS 3
Definition 2.2 (Overflow and underflow). An overflow is obtained when a number is too large to fit
into the floating point system in use, i.e e > M . An underflow is obtained when a number is too small,
i.e e < −m . When overflow occurs in the course of a calculation, this is generally fatal. But underflow
is non-fatal: the system usually sets the number to 0 and continues. (Matlab does this, quietly.)
2.1. Rounding and chopping. Let x be any real number and f l(x) be its machine approximation.
There are two ways to do the “cutting” to store a real number
x = ±(0.a1 a2 . . . an an+1 . . . ) × 10e , a1 �= 0.
(1) Chopping: We ignore digits after an and write the number as following in chopping
f l(x) = (0.a1 a2 . . . an ) × 10e .
(2) Rounding: Rounding is defined as following
�
±(0.a1 a2 . . . an ) × 10e , 0 ≤ an+1 < 5 (rounding down)
f l(x) =
±[(0.a1 a2 . . . an ) + (0.00 . . . 01)] × 10e , 5 ≤ an+1 < 10 (rounding up).
Example 1. � � �
6 0.86 × 100 (rounding)
fl =
7 0.85 × 100 (chopping).
Therefore
� ∞
�
� ai
|x − f l(x)| = × 10e .
10i
i=n+1
�∞
10 − 1
|x − f l(x)| ≤ × 10e
10i
i=n+1
� �
1 1
= (10 − 1) + + . . . × 10e
10n+1 10n+2
� �
1
10n+1
= (10 − 1)) 1 × 10e
1 − 10
= 10e−n .
Ea = |x − f l(x)| ≤ 10e−n .
Now
1
|x| = (0.a1 a2 . . . an )10 × 10e ≥ 0.1 × 10e = × 10e .
10
Therefore relative error bound is
�∞
ai
|x − f l(x)| = × 10e
10i
i=n+1
� ∞
�
an+1 � ai
= + × 10e
10n+1 10i
i=n+2
� ∞
�
10/2 − 1 � (10 − 1)
≤ + × 10e
10n+1 10i
i=n+2
� �
10/2 − 1 1
= + n+1 × 10e
10n+1 10
1 e−n
= 10 .
2
FLOATING POINT ARITHMETIC AND ERRORS 5
4. Significant Figures
The term significant digits is often used to loosely describe the number of decimal digits that appear
to be accurate. The definition is more precise, and provides a continuous concept.
Looking at an approximation 2.75303 to an actual value of 2.75194, we note that the three most
significant digits are equal, and therefore one may state that the approximation has three significant
digits of accuracy. One problem with simply looking at the digits is given by the following two examples:
(1) 1.9 as an approximation to 1.1 may appear to have one significant digit, but with a relative
error of 0.73, this seems unreasonable.
(2) 1.9999 as an approximation to 2.0001 may appear to have no significant digits, but the relative
error is 0.00010 which is almost the same relative error as the approximation 1.9239 is to 1.9237.
Thus, we need a more mathematical definition of the number of significant digits. Let the number x
and approximation x∗ be written in decimal form. The number of significant digits tells us to about
how many positions x and x∗ agree. More precisely, we say that x∗ has m significant digits of x if
the absolute error |x − x∗ | has zeros in the first m decimal places, counting from the leftmost nonzero
(leading) position of x, followed by a digit from 0 to 5.
Examples:
5.1 has 1 significant digit of 5: |5 − 5.1| = 0.1.
0.51 has 1 significant digits of 0.5: |0.5 − 0.51| = 0.01.
4.995 has 3 significant digits of 5: 5 − 4.995 = 0.005.
4.994 has 2 significant digits of 5: 5 − 4.994 = 0.006.
0.57 has all significant digits of 0.57.
1.4 has 0 significant digits of 2: 2 − 1.4 = 0.6. In the terms of relative errors, the number x∗ is said to
approximate x to m significant digits (or figures) if m is the largest nonnegative integer for which
|x − x∗ |
≤ 0.5 × 10−m .
|x|
If the relative error is greater than 0.5, then we will simply state that the approximation has zero
significant digits.
6 FLOATING POINT ARITHMETIC AND ERRORS
Example 6. Multiply the following floating point numbers: 0.1111e74 and 0.2000e80.
Sol. On multiplying we obtain 0.1111e74 × 0.2000e80 = 0.2222e153. This shows overflow condition of
normalized floating-point numbers.
Example 7. If x = 0.3721478693 and y = 0.3720230572. What is the relative error in the computation
of x − y using five decimal digits of accuracy?
Sol. We can compute with ten decimal digits of accuracy and can take it as ‘exact’.
x − y = 0.0001248121.
Both x and y will be rounded to five digits before subtraction. Thus
f l(x) = 0.37215
f l(y) = 0.37202.
f l(x) − f l(y) = 0.13000 × 10−3 .
Relative error, therefore is
(x − y) − (f l(x) − f l(y))
Er = ≈ .04% = 4%.
x−y
Example 8. The error in the measurement of area of a circle is not allowed to exceed 0.5%. How
accurately the radius should be measured.
Sol. Area of the circle is A = πr2 (say).
∂A
∴ = 2πr.
∂r
δA
Relative error (in percentage) in area is × 100 = 0.5.
A
0.5
Therefore δA = × A = 1/200πr2 .
100
δr 100 δA
Percentage relative error in radius is × 100 = = 0.25.
r r ∂A
∂r
6. Loss of Accuracy
Round-off errors are inevitable and difficult to control. Other types of errors which occur in com-
putation may be under our control. The subject of numerical analysis is largely preoccupied with
understanding and controlling errors of various kinds.
One of the most common error-producing calculations involves the cancellation of digits due to the
subtractions nearly equal numbers (or the addition of one very large number and one very small number
or multiplication of a small number with a quite large number). The loss of accuracy due to round-off
error can often be avoided by a reformulation of the calculations, as illustrated in the next example.
Example 9. Use four-digit rounding arithmetic and the formula for the roots of a quadratic equation,
to find the most accurate approximations to the roots of the following quadratic equation. Compute the
absolute and relative errors.
1.002x2 + 11.01x + 0.01265 = 0.
Sol. The quadratic formula states that the roots of ax2 + bx + c = 0 are
√
−b ± b2 − 4ac
x1,2 = .
2a
Using the above formula, the roots of given eq. 1.002x2 + 11.01x + 0.01265 = 0 are approximately
(using long format)
x1 = −0.00114907565991, x2 = −10.98687487643590.
8 FLOATING POINT ARITHMETIC AND ERRORS
We use four-digit rounding arithmetic to find approximations to the roots. We write the approxima-
tions of root as x∗1 and x∗2 . These approximations are given by
�
−11.01 ± (−11.01)2 − 4 · 1.002 · 0.01265
x∗1,2 =
√ 2 · 1.002
−11.01 ± 121.2 − 0.05070
=
2.004
−11.01 ± 11.00
= .
2.004
Therefore, we find the first root:
x∗1 = −0.004990,
|x1 −x∗1 |
which has the absolute error |x1 −x∗1 | = 0.00384095 and relative error |x1 | = 3.34265968 (very high).
We find the second root
−11.01 − 11.00
x∗2 = = −10.98,
2.004
which has the following absolute error
|x2 − x∗2 | = 0.006874876,
and relative error
|x2 − x∗2 |
= 0.000626127.
|x2 |
This quadratic formula for the calculation of √ first root, encounter the subtraction of nearly equal num-
bers and cause loss of accuracy. Here b and b2 − 4ac become two equal numbers. Therefore, we use
the alternate quadratic formula by rationalize the expression to calculate first root and approximation
is given by
√ √
∗ −b + b2 − 4ac −b − b2 − 4ac
x1 = √
2a −b − b2 − 4ac
−2c
= √
b + b2 − 4ac
= −0.001149,
which has the following very small relative error
|x1 − x∗1 |
= 6.584 × 10−5 .
|x1 |
Remark 6.1. However, if rationalize the numerator in x2 to get
−2c
x2 = √ .
b − b2 − 4ac
The use of this formula results not only involve the subtraction of two nearly equal numbers but also
division by the small number. This would cause degrade in accuracy.
Remark 6.2. Since product of the roots for a quadratic is c/a. Thus we can find the approximation
of the first root from the second as
c
x∗1 = ∗ .
ax2
Example 10. The quadratic formula is used for computing the roots of equation ax2 +bx+c = 0, a �= 0
and roots are given by
√
−b ± b2 − 4ac
x= .
2a
Consider the equation x2 + 62.10x + 1 = 0 and discuss the numerical results.
FLOATING POINT ARITHMETIC AND ERRORS 9
Sol. Using quadratic formula and 8-digit rounding arithmetic, we obtain two roots
x1 = −0.01610723
x2 = −62.08390.
We use these
√ values as√“exact values”. Now √ we perform calculations with 4-digit rounding arithmetic.
We have b2 − 4ac = 62.102 − 4.000 = 3856 − 4.000 = 62.06 and
−62.10 + 62.06
f l(x1 ) = = −0.02000.
2.000
The relative error in computing of x1 is
|f l(x1 ) − x1 | | − 0.02000 + .01610723|
= = 0.2417.
|x1 | | − 0.01610723|
In calculating of x2 ,
−62.10 − 62.06
f l(x2 ) = = −62.10.
2.000
The relative error in computing x2 is
|f l(x2 ) − x2 | | − 62.10 + 62.08390|
= = 0.259 × 10−3 .
|x2 | | − 62.08390|
√
In this equation since b2 = 62.102 is much larger than 4ac = 4. Hence b and b2 − 4ac become two
equal numbers. Calculation of x1 involves the subtraction of nearly two equal numbers but x2 involves
the addition of the nearly equal numbers which will not cause serious loss of significant figures.
To obtain a more accurate 4-digit rounding approximation for x1 , we change the formulation by
rationalizing the numerator, that is,
−2c
x1 = √ .
b + b2 − 4ac
Then
−2.000
f l(x1 ) = = −2.000/124.2 = −0.01610.
62.10 + 62.06
The relative error in computing x1 is now reduced to 0.62 × 10−3 .
Nested Arithmetic: Accuracy loss due to round-off error can also be reduced by rearranging calcu-
lations, as shown in the next example. Polynomials should always be expressed in nested form before
performing an evaluation, because this form minimizes the number of arithmetic calculations. One
way to reduce round-off error is to reduce the number of computations.
Example 11. Evaluate f (x) = x3 − 6.1x2 + 3.2x + 1.5 at x = 4.71 using three-digit arithmetic directly
and with nesting.
Sol. The exact result of the evaluation is (by taking more digits):
Exact: f (4.71) = 4.713 − 6.1 × 4.712 + 3.2 × 4.71 + 1.5 = −14.263899.
Now using three-digit rounding arithmetic, we obtain
f (4.71) = 4.713 − 6.1 × 4.712 + 3.2 × 4.71 + 1.5
= 22.2 × 4.71 − 6.1 × 22.2 + 15.1 + 1.5
= 105 − 135 + 15.1 + 1.5 = −13.4.
Similarly if we use three-digit chopping then
f (4.71) = 4.713 − 6.1 × 4.712 + 3.2 × 4.71 + 1.5
= 22.1 × 4.71 − 6.1 × 22.1 + 15.0 + 1.5
= 104 − 134 + 15.0 + 1.5 = −13.5.
The relative error in case of three-digit (rounding) is
� �
� −14.263899 + 13.4 �
� � ≈ 0.06,
� −14.263899 �
10 FLOATING POINT ARITHMETIC AND ERRORS
Nesting has reduced the relative errors for both the approximations. Moreover in original form, there
are 7 multiplications while nested form has only 2 multiplications. Thus nested form reduce errors as
well as number of calculations.
10
For example, if f (x) = , then the condition number can be calculated as
1 − x2
� � �
� xf (x) � 2
κ=� � � = 2x .
f (x) � |1 − x2 |
Condition number can be quite large for |x| ≈ 1. Therefore, the function is ill-conditioned.
Example 12. Compute and interpret the condition number for
(a) f (x) = sin x for x = 0.51π.
(b) f (x) = tan x for x = 1.7.
Sol. (a) The condition number is given by
� � �
� xf (x) �
κ = �� �.
f (x) �
For x = 0.51π, f � (x) = cos(0.51π) = −0.03141, f (x) = sin(0.51π) = 0.99951.
∴ κ = 0.05035 < 1.
Since, the condition number is < 1, we conclude that the relative error is attenuated.
(b) f (x) = tan x, f (1.7) = −7.6966, f � (x) = 1/ cos2 x, f � (a) = 1/ cos2 (1.7) = 60.2377.
κ = 13.305 >> 1.
Thus, the function is ill-conditioned.
7.1. Creating Algorithms. Another theme that occurs repeatedly in numerical analysis is the dis-
tinction between numerical algorithms are stable and those that are not. Informally speaking, a
numerical process is unstable if small errors made at one stage of the process are magnified and prop-
agated in subsequent stages and seriously degrade the accuracy of the overall calculation.
An algorithm can be thought of as a sequence of problems, i.e., a sequence of function evaluations.
In this case we consider the algorithm for evaluating f (x) to consist of the evaluation of the sequence
x1 , x2 , · · · , xn . We are concerned with the condition of each of the functions f1 (x1 ), f2 (x2 ), · · · , fn−1 (xn−1 )
where f (x) = fi (xi ) for all i. An algorithm is unstable if any fi is ill-conditioned, i.e. if any fi (xi ) has
condition much worse than f (x). In the following, we study an example to create a stable algorithm.
√ √
Example 13. Write an algorithm to calculate the expression f (x) = x + 1 − x, when x is quite
large. By considering the condition number κ of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Suggest a modification which makes it stable.
Sol. Consider
√ √
f (x) = x+1− x
so that there is potential loss of significance when x is large. Taking x = 12345 as an example, one
possible algorithm is
x0 : = x = 12345
x1 : = x0 + 1
√
x2 : = x
√ 1
x3 : = x0
f (x) := x4 : = x2 − x3 .
The loss of significance occurs with the final subtraction. We can rewrite the last step in the form
f3 (x3 ) = x2 − x3 to show how the final answer depends on x3 . As f3� (x3 ) = −1, we have the condition
� � � �
� x3 f3� (x3 ) � � x3 �
�
κ(x3 ) = � � = � �
f3 (x3 ) � � x2 − x3 �
12 FLOATING POINT ARITHMETIC AND ERRORS
from which we find κ(x3 ) ≈ 24690.5 when x = 12345. Note that this is the condition of a subproblem
arrived at during the algorithm.
To find an alternative algorithm, we write
√ √
√ √ x+1+ x 1
f (x) = ( x + 1 − x) √ √ =√ √ .
x+1+ x x+1+ x
This suggests the algorithm
x0 : = x = 12345
x1 : = x0 + 1
√
x2 : = x
√ 1
x3 : = x0
x4 : = x2 + x3
f (x) := x5 : = 1/x4 .
In this case f3 (x3 ) = 1/(x2 + x3 ) giving a condition for the subproblem of
� � � �
� x3 f3� (x3 ) � � x3 �
�
κ(x3 ) = � � = � �,
f3 (x3 ) � � x2 + x3 �
which is approximately 0.5 when x = 12345. Thus first algorithm is unstable and second is stable for
large values of x. In general such analyses are not usually so straightforward but, in principle, stability
can be analysed by examining the condition of a sequence of subproblems.
Example 14. Write an algorithm to calculate the expression f (x) = sin(a + x) − sin a, when x =
0.0001. By considering the condition number κ of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Suggest a modification which makes it stable.
Sol. Let x = 0.0001
x0 = 0.0001
x1 = a + x0
x2 = sin x1
x3 = sin a
x4 = x2 − x 3 .
Now to check the effect of x3 on x2 , we consider the function f3 (x3 ) = x2 − x3 and calculate condition
with a = 2
� � � �
� x3 f � (x3 ) � � x3 �
κ(x3 ) = �� � = � � ≈ 20008.13.
f (x3 ) � � x2 − x3 �
We obtain a very larger condition number, which shows that the last step is not stable.
Thus we need to modify the above algorithm and write the equivalent form
f (x) = sin(a + x) − sin a = 2 sin(x/2) cos(a + x/2).
The modified algorithm is the following
x0 = 0.0001
x1 = x0 /2
x2 = sin x1
x3 = cos(a + x1 )
x4 = 2x2 x3 .
Now we consider the function f3 (x3 ) = 2x2 x3 ,
� �
� x3 · 2x2 �
�
κ(x3 ) = � � = 1.
2x2 x3 �
Thus the condition number is one, so this form is acceptable.
Remarks
FLOATING POINT ARITHMETIC AND ERRORS 13
(1) Accuracy tells us the closeness of computed solution to true solution of problem. Accuracy
depends on conditioning of problem as well as stability of algorithm.
(2) Stability alone does not guarantee accurate results. Applying stable algorithm to well-conditioned
problem yields accurate solution. Inaccuracy can result from applying stable algorithm to ill-
conditioned problem or unstable algorithm to well-conditioned problem.
Exercises
(1) Compute the absolute error and relative error in approximations of x by x∗ .
(a) x = π, ∗
√ x = 22/7.
(b) x = 2, x∗ = 1.414.
(c) x = 8!, x∗ = 39900.
(2) Find the largest interval in which x∗ must lie to approximate x with relative error at most 10−4
for each value of x.
(a) π.
(b) √
e.
(c) √3.
(d) 3 7.
(3) A rectangular parallelepiped has sides of length 3 cm, 4 cm, and 5 cm, measured to the nearest
centimeter. What are the best upper and lower bounds for the volume of this parallelepiped?
What are the best upper and lower bounds for the surface area?
(4) Use three-digit rounding arithmetic to perform the following calculations. Compute the abso-
lute error
√ and
√ relative
√ error with the exact value determined to at least five digits.
(a) 3 + ( 5 + 7).
(b) (121 − 0.327) − 119.
3
(c) −10π + 6e − .
62
π − 22/7
(d) .
1/17
(5) Calculate the value of x2 + 2x − 2 and (2x − 2) + x2 where x = 0.7320e0, using normalized point
arithmetic and proves that they are not the same. Compare with the value of (x2 − 2) + 2x.
1 6x
(6) The derivative of f (x) = is given by . Do you expect to have difficulties
(1 − 3x2 ) (1 − 3x2 )2
evaluating this derivative at x = 0.577? Try it using 3- and 4-digit arithmetic with chopping.
(7) Suppose two points (x0 , y0 ) and (x1 , y1 ) are on a straight line with y1 �= y0 . Two formulas are
available to find the x-intercept of the line:
x 0 y 1 − x 1 y0 (x1 − x0 )y0
x= , and x = x0 − .
y1 − y0 y1 − y0
(a) Show that both formulas are algebraically correct.
(b) Use the data (x0 , y0 ) = (1.31, 3.24) and (x1 , y1 ) = (1.93, 4.76) and three-digit rounding
arithmetic to compute the x-intercept both ways. Which method is better and why?
(8) Use four-digit rounding arithmetic and the formula to find the most accurate approximations
to the roots of the following quadratic equations. Compute the absolute errors and relative
errors.
1 2 123 1
x + x − = 0.
3 4 6
(9) Find the root of smallest magnitude of the equation x2 − 1000x + 25 = 0 using quadratic
formula. Work in floating-point arithmetic using a four-decimal place mantissa.
(10) Consider the identity
�x
1 − cos(x2 )
sin(xt)dt = .
x
0
Explain the difficulty in using the right-hand fraction to evaluate this expression when x is
close to zero. Give a way to avoid this problem and be as precise as possible.
(11) Assume 3-digit mantissa with rounding
14 FLOATING POINT ARITHMETIC AND ERRORS
(b) Which function should be used for computations when x is near√3π/2? Why?
(16) (a) Consider the stability (by calculating the condition number) of 1 + x − 1 when x is near
0. Rewrite the expression to rid it of subtractive cancellation.
(b) Rewrite ex − cos x to be stable when x is near 0.
(17) Suppose that a function f (x) = ln(x + 1) − ln(x), is computed by the following algorithm for
large values of x using six digit rounding arithmetic
x0 : = x = 12345
x1 : = x0 + 1
x2 : = ln x1
x3 : = ln x0
f (x) := x4 : = x2 − x3 .
By considering the condition κ(x3 ) of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Also propose the modification of function evaluation
so that algorithm will become stable.
Bibliography
[Burden] Richard L. Burden, J. Douglas Faires, and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 2 (8 LECTURES)
ROOTS OF NON-LINEAR EQUATIONS IN ONE VARIABLE
1. Introduction
Finding one or more root (or zero) of the equation
f (x) = 0
is one of the more commonly occurring problems of applied mathematics. In most cases explicit
solutions are not available and we must be satisfied with being able to find a root to any specified
degree of accuracy. The numerical procedures for finding the roots are called iterative methods. These
problems arise in variety of applications.
The growth of a population can often be modeled over short periods of time by assuming that the
population grows continuously with time at a rate proportional to the number present at that time.
Suppose that N (t) denotes the number in the population at time t and λ denotes the constant birth
rate of the population. Then the population satisfies the differential equation
dN (t)
= λN (t),
dt
whose solution is N (t) = N0 eλt , where N0 denotes the initial population.
This exponential model is valid only when the population is isolated, with no immigration. If
immigration is permitted at a constant rate I, then the differential equation becomes
dN (t)
= λN (t) + I,
dt
whose solution is
I λt
N (t) = N0 eλt + (e − 1).
λ
Suppose a certain population contains N (0) = 1000000 individuals initially, that 424000 individuals
immigrate into the community in the first year, and that N (1) = 1564000 individuals are present at
the end of one year. To determine the birth rate of this population, we need to find λ in the equation
424000 λ
1564000 = 1000000eλ + (e − 1).
λ
It is not possible to solve explicitly for λ in this equation, but numerical methods discussed in this
chapter can be used to approximate solutions of equations of this type to an arbitrarily high accuracy.
Definition 1.1 (Simple and multiple root). A zero (root) has a “multiplicity”, which refers to the
number of times that its associated factor appears in the equation. A root having multiplicity one is
called a simple root. For example, f (x) = (x − 1)(x − 2) has a simple root at x = 1 and x = 2, but
g(x) = (x − 1)2 has a root of multiplicity 2 at x = 1, which is therefore not a simple root.
A multiple root is a root with multiplicity m ≥ 2 is called a multiple point or repeated root. For example,
in the equation (x − 1)2 = 0, x = 1 is multiple (double) root.
If a polynomial has a multiple root, its derivative also shares that root.
Let α be a root of the equation f (x) = 0, and imagine writing it in the factored form
f (x) = (x − α)m φ(x)
with some integer m ≥ 1 and some continuous function φ(x) for which φ(α) �= 0. Then we say that α
is a root of f (x) of multiplicity m.
Now we study some iterative methods to solve the non-linear equations.
1
2 ROOTS OF NON-LINEAR EQUATIONS
Example 1. The sum of two numbers is 20. If each number is added to its square root, then the
product of the resulting sums is 155.55. Perform five iterations of bisection method to determine the
two numbers. 1
Sol. Let x and y be the two numbers. Then,
x + y = 20.
√ √
Now x is added to x and y is added to y. The product of these sums is
√ √
(x + x)(y + y) = 155.55.
√ √
∴ (x + x)(20 − x + 20 − x) = 155.55.
Write the above equation in to root finding problem
√ √
f (x) = (x + x)(20 − x + 20 − x) − 155.55 = 0.
As f (6)f (7) < 0, so there is a root in interval (6.7).
Below are the iterations of bisection method for finding root. Therefore root is 6.53125.
1Choice of initial approximations: Initial approximations to the root are often known from the physical significance of
the problem. Graphical methods are used to find the zero of f (x) = 0 and any value in the neighborhood of root can be
taken as initial approximation.
If the given equation f (x) = 0 can be written as f1 (x) = f2 (x) = 0, then the point of the intersection of the graphs
y = f1 (x) and y = f2 (x) gives the root of the equation. Any value in the neighborhood of this point can be taken as
initial approximation.
ROOTS OF NON-LINEAR EQUATIONS 3
Definition 2.1 (Convergence). A sequence {xn } is said to be converge to a point α with order p if
there is exist a constant c such that
|xn+1 − α|
lim p = c, n ≥ 0.
n→∞ |xn − α|
The constant c is known as asymptotic error constant. If we write en = |xn − α| where en denote the
absolute error in n-th iteration then we can write in limiting case
en+1 = c epn .
Two cases are given special attention.
(i) If p = 1 (and c < 1), the sequence is linearly convergent.
(ii) If p = 2, the sequence is quadratically convergent.
Definition 2.2. Let {βn } is a sequence which converges to zero and {xn } is any sequence. If there
exists a constant c > 0 and an integer N > 0 such that
|xn − α| ≤ c|βn |, ∀n ≥ N,
then we say that {xn } converges to α with rate O(βn ). We write
xn = α + O(βn ).
Example: Define two sequences for n ≥ 1,
n+1 n+2
xn = , and yn = .
n2 n3
Both the sequences has limit 0 but the sequence {yn } converges to this limit much faster than the
sequence {xn }.
Now
n+1 n+n 1
|xn − 0| = 2
< 2
= 2 = 2βn
n n n
and
n+2 n + 2n 1
|yn − 0| = 3
< 3
= 3 2 = 3β̃n .
n n n
Hence the rate of convergence of {xn } to zero is similar to the convergence of {1/n} to zero, whereas
{yn } converges to zero at a rate similar to the more rapidly convergent sequence {1/n2 }. We express
this by writing
xn = 0 + O(βn ) and yn = 0 + O(β̃n ).
2.2. Convergence analysis. Now we analyze the convergence of the iterations generated by the
bisection method.
Theorem 2.3. Suppose that f ∈ C[a, b] and f (a) · f (b) < 0. Then the bisection method generates a
sequence {cn } approximating a zero α of f with linear convergence.
Proof. Let [a1 , b1 ], [a2 , b2 ], · · · , [an , bn ], · · · , denote the successive intervals produced by the bisection
algorithm. Thus
a = a1 ≤ a2 ≤ · · · ≤ b1 = b
b = b1 ≥ b2 ≥ · · · ≥ a1 = a.
This implies {an } and {bn } are monotonic and bounded and hence convergent.
Since
b1 − a1 = (b − a)
1 1
b2 − a 2 = (b1 − a1 ) = (b − a)
2 2
........................
1
bn − a n = n−1
(b − a). (2.1)
2
Hence
lim (bn − an ) = 0.
n→∞
4 ROOTS OF NON-LINEAR EQUATIONS
Here b − a denotes the length of the original interval with which we started. Take limit
lim an = lim bn = α (say).
n→∞ n→∞
Since f is continuous function, therefore
lim f (an ) = f ( lim an ) = f (α).
n→∞ n→∞
The bisection method ensures that
f (an )f (bn ) ≤ 0
which implies
lim f (an )f (bn ) = f 2 (α) ≤ 0
n→∞
=⇒ f (α) = 0.
Thus limit of {an } and {bn } is a zero of [a, b].
Since the root α is in either the interval [an , cn ] or [cn , bn ]. Therefore
1
|α − cn | < cn − an = bn − cn = (bn − an )
2
Combining with (2.1), we obtain the further bound
1
en = |α − cn | < n (b − a).
2
Therefore
1
en+1 < n+1 (b − a).
2
1
∴ en+1 < en .
2
This shows that the iterates cn converge to α as n → ∞. By definition of convergence, we can say that
the bisection method converges linearly with rate 12 .
Illustrations: 1. Since the method brackets the root, the method is guaranteed to converge, however,
can be very slow.
a n + bn
2. Computing cn : It might happen that at a certain iteration n, computation of cn = will
2
give overflow. It is better to compute cn as:
bn − a n
c n = an + .
2
3. Stopping Criteria: Since this is an iterative method, we must determine some stopping criteria
that will allow the iteration to stop. We can use the following criteria to stop in term of absolute error
and relative error
|cn+1 − cn | ≤ �,
|cn+1 − cn |
≤ �,
|cn+1 |
provided |cn+1 | �= 0.
Criterion |f (cn )| ≤ � can be misleading since it is possible to have |f (cn )| very small, even if cn is not
close to the root.
Let’s now find out what is the minimum number of iterations N needed with the bisection method to
b−a
achieve a certain desired accuracy. The interval length after N iterations is N . So, to obtain an
2
b−a
accuracy of �, we must have N ≤ �. That is,
2
2−N (b − a) ≤ ε,
or
log(b − a) − log ε
N≥ .
log 2
Note the number N depends only on the initial interval [a, b] bracketing the root.
4. If a function is such that it just touches the x-axis, for example f (x) = x2 , then we don’t have a
ROOTS OF NON-LINEAR EQUATIONS 5
and b such that f (a)f (b) < 0 but x = 0 is the root of f (x) = 0.
5. For functions where there is a singularity and it reverses sign at the singularity, bisection method
1
may converge on the singularity. An example include f (x) = . We can chose a and b such that
x
f (a)f (b) < 0. However, the function is not continuous and the theorem that a root exists is not
applicable.
Example 2. Use the bisection method to find solutions accurate to within 10−2 for x3 −7x2 +14x−6 = 0
on [0, 1].
Sol. Number of iterations
log(1 − 0) − log(10−2 )
N≥ = 6.6439.
log 2
Thus, a minimum of 7 iterations will be needed to obtain the desired accuracy using the bisection
method. This yields the following results for mid-points cn and for absolute errors En = |cn − cn−1 |.
Fixed-point iterations: We now consider solving an equation x = g(x) for a root α by the iteration
xn+1 = g(xn ), n ≥ 0,
with x0 as an initial guess to α.
Each solution of x = g(x) is called a fixed point of g.
2. x = 3/x.
3. x = 12 (x + 3/x).
Let x0 = 2.
√
Now 3 = 1.73205 and it is clear that third choice is correct but why other two are not working?
Therefore which of the approximation is correct or not, we will answer after the convergence result
(which require |g � (α) < 1| and a ≤ g(x) ≤ b, ∀x ∈ [a, b] in the neighborhood of root α).
Lemma 3.1. Let g(x) be a continuous function on [a, b] and assume that a ≤ g(x) ≤ b, ∀x ∈ [a, b]
then x = g(x) has at least one solution in [a, b].
Proof. Let g be a continuous function on [a, b].
Let assume that a ≤ g(x) ≤ b, ∀x ∈ [a, b].
Now consider φ(x) = g(x) − x.
If g(a) = a or g(b) = b then proof is trivial. Hence we assume that a �= g(a) and b �= g(b).
Now since a ≤ g(x) ≤ b
=⇒ g(a) > a and g(b) < b.
Now
φ(a) = g(a) − a > 0
and
φ(b) = g(b) − b < 0.
Now φ is continuous and φ(a)φ(b) < 0, therefore by Intermediate Value Theorem φ has at least one
zero in [a, b], i.e. there exists some α s.t.
g(α) = α, α ∈ [a, b].
Graphically, the roots are the intersection points of y = x & y = g(x) as shown in the Figure.
Theorem 3.2 (Contraction Mapping Theorem). Let g & g � are continuous functions on [a, b] and
assume that g satisfy a ≤ g(x) ≤ b, ∀x ∈ [a, b]. Furthermore, assume that there is a positive constant
λ < 1 exists with
|g � (x)| ≤ λ, ∀x ∈ (a, b).
Then
1. x = g(x) has a unique solution α of x = g(x) in the interval [a, b].
2. The iterates xn+1 = g(xn ), n ≥ 1 will converge to α for any choice of x0 ∈ [a, b].
3.
λn
|α − xn | ≤ |x1 − x0 |, n ≥ 0.
1−λ
4. Convergence is linear.
Proof. Let g and g � are continuous functions on [a, b] and assume that a ≤ g(x) ≤ b, ∀x ∈ [a, b]. By
previous Lemma, there exists at least one solution to x = g(x).
Also Assume
|g � (x)| ≤ λ < 1, ∀x ∈ (a, b).
By Mean-Value Theorem
g(x) − g(y) = g � (c)(x − y), c ∈ (x, y)
|g(x) − g(y)| ≤ λ|x − y|, 0 < λ < 1, ∀x ∈ [a, b].
1. Let x = g(x) has two solutions, say α and β in [a, b] then α = g(α), and β = g(β). Now
|α − β| = |g(α) − g(β)| ≤ λ|α − β|
=⇒ (1 − λ)|α − β| ≤ 0
=⇒ α = β, since 0 < λ < 1.
Therefore x = g(x) has a unique solution in [a, b] which is α (say).
2. To check the convergence of iterates {xn }, we observe that they all remain in [a, b] as xn ∈
[a, b], xn+1 = g(xn ) ∈ [a, b].
Now
|α − xn+1 | = |g(α) − g(xn )| = |g � (cn )||α − xn | (3.1)
for some cn between α and xn .
=⇒ |α − xn+1 | ≤ λ|α − xn | ≤ λ2 |α − xn−1 |
................
≤ λn+1 |α − x0 |
As n → ∞, λn → 0 which implies xn → α. Also
|α − xn | ≤ λn |α − x0 |. (3.2)
3. To find the bound:
Since
|α − x0 | = |α − x1 + x1 − x0 |
≤ |α − x1 | + |x1 − x0 |
≤ λ|α − x0 | + |x1 − x0 |
=⇒ (1 − λ)|α − x0 | ≤ |x1 − x0 |
1
=⇒ |α − x0 | ≤ |x1 − x0 |
1−λ
λn
=⇒ λn |α − x0 | ≤ |x1 − x0 |
1−λ
Therefore using (3.2)
λn
|α − xn | ≤ λn |α − x0 | ≤ |x1 − x0 |
1−λ
λn
=⇒ |α − xn | ≤ |x1 − x0 |.
1−λ
8 ROOTS OF NON-LINEAR EQUATIONS
Now xn → α =⇒ cn → α.
Hence
|α − xn+1 |
lim = |g � (α)|.
n→∞ |α − xn |
If |g � (α)| < 1, the above formula shows that iterates are linearly convergent with rate (asymptotic error
constant) |g � (α)|. If in addition g � (α) �= 0, then formula proves that convergence is exactly linear, with
no higher order of convergence being possible.
Illustrations: 1. In practice, it is difficult to find an interval [a, b] for which a ≤ g(x) ≤ b condition
is satisfied. On the contrary if |g � (α)| > 1, then the iteration method xn+1 = g(xn ) will not converge
to α.
When |g � (α)| ≈ 1, no conclusion can be drawn and even if convergence occur, the method would be
far too slow for the iteration method to be practical.
2. If
λn
|α − xn | ≤ |x1 − x0 | < ε
1−λ
where ε is desired accuracy. This bound can be used to find the number of iterations to achieve the
accuracy ε.
Also from part 2, |α − xn | ≤ λn |α − x0 | ≤ λn max{x0 − a, b − x0 } < ε, can be used to find the number
of iterations.
3. The possible behavior of fixed-point iterates {xn } is shown in Figure 3 for various values of g � (α).
To see the convergence, consider the case of x1 = g(x0 ), the height of y = g(x) at x0 . We bring the
number x1 back to the x-axis by using the line y = x and the height y = x1 . We continue this with
each iterate, obtaining a stair-step behavior when g � (α) > 0. When g � (α) < 0, the iterates oscillates
around the fixed point α, as can be seen in the Figure. In first figure (on top) iterations are monotonic
convergence, in second oscillatory convergent, in third figure iterations are divergent and in the last
figure iterations are oscillatory divergent.
Theorem 3.3. Let α is a root of x = g(x), and g(x) is p times continuously differentiable function
for all x ∈ [α − δ, α + δ], g(x) ∈ [α − δ, α + δ], for some p ≥ 2. Furthermore assume
Proof. Let g(x) is p times continuously differentiable function for all x ∈ [α − δ, α + δ], g(x) ∈
[α − δ, α + δ] and satisfying the conditions in equation (3.3) stated above.
Now expand g(xn ) in a Taylor polynomial about α.
xn+1 = g(xn )
= g(α + xn − α)
(xn − α)p−1 (p−1) (xn − α)p (p)
= g(α) + (xn − α)g � (α) + · · · + g (α) + g (cn ),
(p − 1)! p!
ROOTS OF NON-LINEAR EQUATIONS 9
Example 4. Consider the equation x3 − 7x + 2 = 0 in [0, 1]. Write a fixed-point iteration which will
converge to the solution.
1
Sol. We rewrite the equation in the form x = (x3 + 2) and define the fixed-point iteration
7
1
xn+1 = (x3n + 2).
7
10 ROOTS OF NON-LINEAR EQUATIONS
1
Now g(x) = (x3 + 2) is continuous function. Thus
7
3x2 �� 6x
g � (x) = , g (x) =
7 7
2 6
g(0) = , g(1) =
7 7
3
g � (0) = 0, g � (1) = .
7
2 3 6
Hence ≤ g(x) ≤ and |g � (x)| ≤ < 1, ∀x ∈ [0, 1].
7 7 7
Hence by the fixed point theorem, the sequence {xn } defined above will converge to the unique solution
of given equation. Starting with x0 = 0.5, we can compute the solution as following.
x1 = 0.303571429
x2 = 0.28971083
x3 = 0.289188016.
Therefore root correct to three decimals is 0.289.
Example 5. An equation ex = 4x2 has a root in [4, 5]. Show that we cannot find that root using
x = g(x) = 12 ex/2 for the fixed-point iteration method. Can you find another iterative formula which
will locate that root ? If yes, then find third iterations with x0 = 4.5. Also find the error bound.
Sol. Here g(x) = 12 ex/2 , g � (x) = 14 ex/2 > 1 for all x ∈ (4, 5), therefore, the fixed-point iteration fails to
converge to the root in [4, 5].
Now consider x = g(x) = ln(4x2 ). Thus
2 2
g � (x) = > 0, g �� (x) = − 2 < 0, ∀x ∈ [4, 5].
x x
Therefore g and g � are monotonic increasing and decreasing, respectively. Now
g(4) = 4.15888, g(5) = 4.60517, g � (4) = 0.5, g � (5) = 0.4.
Thus
4 ≤ g(x) ≤ 5, λ = max |g � (x)| = 0.5 < 1, ∀x ∈ [4, 5].
4≤x≤5
Using the fixed-point iteration method with x0 = 4.5 gives the iterations as
x1 = g(x0 ) = ln(4 × 4.52 ) = 4.3944
x2 = 4.3469
x3 = 4.3253.
We have the error bound
0.53
|α − x3 | ≤ |4.3944 − 4.5| = 0.0264.
1 − 0.5
Example 6. Use a fixed-point method to determine a solution to within 10−4 for x = tan x, for x in
[4, 5].
Sol. Using g(x) = tan x and x0 = 4 gives x1 = g(x0 ) = tan 4 = 1.158, which is not in the interval [4, 5].
So we need a different fixed-point function.
If we note that x = tan x implies that
1 1
=
x tan x
1 1
=⇒ x = x − + .
x tan x
1 1
Starting with x0 and taking g(x) = x − + , we obtain
x tan x
x1 = 4.61369, x2 = 4.49596, x3 = 4.49341, x4 = 4.49341.
As x3 and x4 agree to five decimals, it is reasonable to assume that these values are sufficiently accurate.
ROOTS OF NON-LINEAR EQUATIONS 11
Example 7. The iterates xn+1 = 2 − (1 + c)xn + cx3n will converge to α = 1 for some values of constant
c (provided that x0 is sufficiently close to α). Find the values of c for which convergence occurs? For
what values of c, if any, convergence is quadratic?
Sol. Fixed-point iteration
xn+1 = g(xn )
with
g(x) = 2 − (1 + c)x + cx3 .
If α = 1 is a fixed point then for convergence |g � (α)| < 1
=⇒ | − (1 + c) + 3cα2 | < 1
=⇒ 0 < c < 1.
For this value of c, g �� (α) �= 0.
For quadratic convergence
g � (α) = 0 & g �� (α) �= 0.
This gives c = 1/2.
Example 8. Which
� of the
� following iterations
� �
1 2 6 6
a. xn+1 = xn + b. xn+1 = 4 − 2
4 xn xn
is suitable to find a root of the equation x3 = 4x2 − 6 in the interval [3, 4]? Estimate the number of
iterations required to achieve 10−3 accuracy, starting from x0 = 3.
Sol. a.
� �
1 6
g(x) = x2 +
4 x
� �
1 3
g � (x) = x− 2 .
2 x
g is continuous in [3, 4], but g � (x) > 1, for all x ∈ (3, 4). So this choice of g(x) is not suitable.
b.
� �
6
g(x) = 4− 2
x
12
g � (x) = .
x3
Now g is continuous in [3, 4] and g(x) ∈ [3, 4], for all x ∈ [3, 4].
Also |g � (x)| = |12/x3 | < 1 for all x ∈ (3, 4).
Thus a unique fixed-point exists in [3, 4] by fixed point theory. To find an approximation of that root
with an accuracy of 10−3 , we need to determine the number of iterations n so that
λn
|α − xn | ≤ |x1 − x0 | < 10−3 .
1−λ
Here λ = max |g � (x)| = 4/9 and using the fixed-point method by taking x0 = 3, we have x1 = g(x0 ) =
3≤x≤4
10/3, we have
(4/9)n
|α − xn | ≤ |10/3 − 3| < 10−3
1 − 4/9
� −3 �
10 × 5
n(log 4 − log 9) = log
3
n > 7.8883 ≈ 8.
12 ROOTS OF NON-LINEAR EQUATIONS
Illustrations:
1. Stopping Criterion: We can use the following stopping criteria
|xn+1 − xn | < ε,
� �
� xn+1 − xn �
or �� � < ε,
xn+1 �
where ε is prescribed accuracy.
2. We can combine the secant method with the bisection method and bracket the root, i.e., we choose
initial approximations x0 and x1 in such a manner that f (x0 )f (x1 ) < 0. At each stage we bracket the
root. The method is known as ‘Method of False Position’ or ‘Regula Falsi Method’.
Example 9. Apply secant method to find the root of the equation ex = cos x with relative error less
than < 0.5%.
Sol. Let f (x) = ex − cos x = 0.
The successive iterations of the secant method are given by
xn − xn−1
xn+1 = xn − f (xn ), n = 1, 2, . . .
f (xn ) − f (xn−1 )
ROOTS OF NON-LINEAR EQUATIONS 13
We take initial guesses x0 = −1.1 and x1 = −1, and let en denotes relative error at n-th step and we
obtain � �
� x2 − x1 �
x2 = −1.335205, e1 = �� � × 100% = 10%.
x2 �
� �
� x3 − x2 �
x3 = −1.286223, e2 = �� � × 100% = 25.01%.
x3 �
� �
� x4 − x3 �
x4 = −1.292594, e3 = �� � × 100% = 3.68%.
x4 �
� �
� x5 − x4 �
x5 = −1.292696, e4 = � � � × 100% = 0.49%.
x5 �
We obtain error less than 0.5% and accept x5 = −1.292696 as root with prescribed accuracy.
Example 10. Let f ∈ C 0 [a, b]. If α is a simple root of f (x) = 0, then show that the sequence {xn }
generated by the secant method has order of convergence 1.618.
Sol. We assume that α is a simple root of f (x) = 0 then f (α) = 0.
Let xn = α + en , where en is the error at n-th step.
An iterative method is said to has order of convergence p if
|xn+1 − α| = C |xn − α|p .
Or equivalently
|en+1 | = C|en |p .
Successive iteration in secant method are given by
xn − xn−1
xn+1 = xn − f (xn ) n = 1, 2, . . .
f (xn ) − f (xn−1 )
Error equation is written as
en − en−1
en+1 = en − f (α + en ).
f (α + en ) − f (α + en−1 )
By expanding f (α + en ) and f (α + en−1 ) in Taylor series, we obtain the error equation
� �
� 1 2 ��
(en − en−1 ) en f (α) + en f (α) + . . .
2
en+1 = en − � �
�
1 ��
(en − en−1 ) f (α) + (en + en−1 ) f (α) + . . .
2
� ��
�� �−1
1 2 f (α) 1 f �� (α)
= en − en + en � + . . . 1 + (en−1 + en ) � + ...
2 f (α) 2 f (α)
� � � �
1 2 f �� (α) 1 f �� (α)
= en − en + en � + . . . 1 − (en−1 + en ) � + ...
2 f (α) 2 f (α)
1 f �� (α)
= × en en−1 + O(e2n en−1 + en e2n−1 ).
2 f � (α)
Therefore
en+1 ≈ Aen en−1
1 f �� (α)
where constant A = .
2 f � (α)
This relation is called the error equation. Now by the definition of the order of convergence, we expect
a relation of the following type
en+1 = Cepn .
1/p
Making one index down, we obtain en = Cepn−1 or en−1 = C 1/p en .
Hence
C epn = Aen C 1/p e1/p
n
=⇒ epn = AC (−1+1/p) e1+1/p
n .
14 ROOTS OF NON-LINEAR EQUATIONS
Example 11. A calculator is defective: it can only add, subtract, and multiply. Using the Newton’s
Method and the defective calculator to find 1/1.37 correct to 5 decimal places.
Sol. We consider
1 1
x= , f (x) = − 1.37 = 0.
1.37 x
ROOTS OF NON-LINEAR EQUATIONS 15
1
We have f � (x) = − , and therefore the Newton’s Method yields the iteration
x2
f (xn )
xn+1 = xn −
f � (xn )
= xn (2 − 1.37xn ).
Note that the expression xn (2 − 1.37xn ) can be evaluated on our defective calculator, since it only
involves multiplication and subtraction. The choice x0 = 1 can work and we get
x1 = 0.63, x2 = 0.716247, x3 = 0.729670622, x4 = 0.729926917, x5 = 0.729927007.
Since the fourth and fifth iterates agree in to five decimal places, we assume that 0.729927007 is a
correct solution to f (x) = 0, to at least five decimal places.
Figure 7. One more example of where Newton’s method will not work.
Sol. Let (x, ln x) be a general point on the curve, and let S(x) be the square of the distance from
(x, ln x) to the origin. Then
S(x) = x2 + ln2 x.
We want to minimize the distance. This is equivalent to minimizing the square of the distance. Now
the minimization process takes the usual route. Note that S(x) is only defined when x > 0. We have
ln x 2
S � (x) = 2x + 2 = (x2 + ln x).
x x
ROOTS OF NON-LINEAR EQUATIONS 17
Our problem thus comes down to solving the equation S � (x) = 0. We can use the Newton’s Method
directly on S � (x), but calculations are more pleasant if we observe that S � (x) = 0 is equivalent to
x2 + ln x = 0.
Let f (x) = x2 + ln x. Then f � (x) = 2x + 1/x and we get the recurrence relation
x2k + ln xk
xk+1 = xk − , k = 0, 1, · · ·
2xk + 1/xk
We need to find a suitable starting point x0 . Experimentation with a calculator suggests that we take
x0 = 0.65.
Then x1 = 0.6529181, and x2 = 0.65291864.
Since x1 agrees with x2 to 5 decimal places, we can perhaps decide that, to 5 places, the minimum
distance occurs at x = 0.65292.
Example 14. The function f (x) = sin x has a zero on the interval (3, 4), namely, x = π. Perform
three iterations of Newton’s method to approximate this zero, using x0 = 4. Determine the absolute
error in each of the computed approximations. What is the apparent order of convergence?
Sol. Consider f (x) = sin x. In the interval (3, 4), f has a zero α = π.
Also, f � (x) = cos x.
Newton’s iterations are given by
f (xn )
xn+1 = xn − � , n ≥ 0.
f (xn )
With x0 = 4, we have
f (x0 ) sin 4
x1 = x0 − � =4− = 2.8422,
f (x0 ) cos 4
f (x1 ) sin 2.8422
x2 = x1 − � = 2.8422 − = 3.1509,
f (x1 ) cos 2.8422
f (x2 ) sin 3.1509
x3 = x2 − � = 3.1509 − = 3.1416.
f (x2 ) cos 3.1509
The absolute errors are:
e0 = |x0 − α| = 0.8584,
e1 = |x1 − α| = 0.2994,
e2 = |x2 − α| = 0.0093,
e3 = |x3 − α| = 2.6876 × 10−7 .
If p is the order of convergence then � �p
e2 e1
= .
e1 e0
The corresponding order(s) of convergence are
ln(e2 /e1 ) ln(0.0093/0.2994)
p = = = 3.296,
ln(e1 /e0 ) ln(0.2994/0.8584)
ln(e3 /e2 ) ln(2.6876 × 10−7 /0.0093)
p = = = 3.010.
ln(e2 /e1 ) ln(0.0093/0.2994)
We obtain a better than a third order of convergence, which is a better order than the theoretical
bound gives us.
4.4. Newton’s method for multiple roots. Let α be a root of f (x) = 0 with multiplicity m. In
this case we can write
f (x) = (x − α)m φ(x).
In this case
f (α) = f � (α) = · · · = f (m−1) (α) = 0, f (m) (α) �= 0.
Recall that we can regard Newton’s method as a fixed point method:
f (x)
xn+1 = g(xn ), g(x) = x − � .
f (x)
Then we substitute
f (x) = (x − α)m φ(x)
to obtain
(x − α)m φ(x)
g(x) = x −
m(x − α)m−1 φ(x) + (x − α)m φ� (x)
(x − α) φ(x)
= x− .
mφ(x) + (x − α)φ� (x)
Therefore we obtain
1
g � (α) = 1 − �= 0.
m
ROOTS OF NON-LINEAR EQUATIONS 19
For m > 1, this is nonzero, and therefore Newton’s method is only linearly convergent.
There are ways of improving the speed of convergence of Newton’s method, creating a modified method
that is again quadratically convergent. In particular, consider the fixed point iteration formula
f (x)
xn+1 = g(xn ), g(x) = x − m �
f (x)
in which we assume to know the multiplicity m of the root α being sought. Then modifying the above
argument on the convergence of Newton’s method, we obtain
1
g � (α) = 1 − m = 0
m
and the iteration method will be quadratically convergent. But most of the time we don’t know the
multiplicity.
One method of handling the problem of multiple roots of a function f is to define
f (x)
µ(x) = � .
f (x)
If α is a zero of f of multiplicity m with f (x) = (x − α)m φ(x), then
(x − α)m φ(x)
µ(x) =
m(x − α)m−1 φ(x) + (x − α)m φ� (x)
φ(x)
= (x − α)
mφ(x) + (x − α)φ� (x)
also has a zero at α. However, φ(α) �= 0, so
φ(α) 1
= �= 0,
mφ(α) + (α − α)φ� (α) m
and α is a simple zero of µ(x). Newton’s method can then be applied to µ(x) to give
µ(x) f (x)/f � (x)
g(x) = x − = x −
µ� (x) {[f � (x)]2 − [f (x)][f �� (x)]}/[f � (x)]2
which simplifies to
f (x)f � (x)
g(x) = x − .
[f � (x)]2
− f (x)f �� (x)
If g has the required continuity conditions, functional iteration applied to g will be quadratically con-
vergent regardless of the multiplicity of the zero of f. Theoretically, the only drawback to this method
is the additional calculation of f �� (x) and the more laborious procedure of calculating the iterates. In
practice, however, multiple roots can cause serious round-off problems because the denominator of the
above expression consists of the difference of two numbers that are both close to 0.
Example 15. Let f (x) = ex − x − 1. Show that f has a zero of multiplicity 2 at x = 0. Show that
Newton’s method with x0 = 1 converges to this zero but not quadratically.
Sol. We have f (x) = ex − x − 1, f � (x) = ex − 1 and f �� (x) = ex .
Now f (0) = 1 − 0 − 1 = 0, f � (0) = 1 − 1 = 0 and f �� (0) = 1. Therefore f has a zero of multiplicity 2
at x = 0.
Starting with x0 = 1, iterations are given by
f (xn )
xn+1 = xn − � .
f (xn )
x1 = 0.58198, x2 = 0.31906, x3 = 0.16800 x4 = 0.08635, x5 = 0.04380, x6 = 0.02206.
By using the modified Newton’s Method
f (xn )f � (xn )
xn+1 = xn − � .
[f (xn )]2 − f (xn )f �� (xn )
Starting with x0 = 1.0, we obtain
x1 = −0.023421, x2 = −0.0084527, x3 = −0.000011889.
We observe that modified Newton’s is converging faster to root 0.
20 ROOTS OF NON-LINEAR EQUATIONS
Example 16. The equation f (x) = x3 − 7x2 + 16x − 12 = 0 has a double root at x = 2.0. Starting
with x0 = 1, find the root correct to three decimals with Newton’s and its modified version.
Sol. Firstly we apply simple Newton’s method and successive iterations are given by
x3n − 7x2n + 16xn − 12
xn+1 = xn − , n = 0, 1, 2, . . .
3x2n − 14xn + 16
Start with x0 = 1.0, we obtain
x1 = 1.4, x2 = 1.652632, x3 = 1.806484, x4 = 1.89586
x5 = 1.945653, x6 = 1.972144, x7 = 1.985886, x8 = 1.992894
x9 = 1.996435, x10 = 1.998214, x11 = 1.999106, x12 = 1.999553.
The root correct to 3 decimal places is x12 = 2.000.
If we apply modified Newton’s method then
x3n − 7x2n + 16xn − 12
xn+1 = xn − 2 , n = 0, 1, 2, . . .
3x2n − 14xn + 16
Start with x0 = 1.0, we obtain
x1 = 1.8, x2 = 1.984615, x3 = 1.999884.
The root correct to 3 decimal places is 2.000 and in this case we need less iterations to get desired
accuracy.
We end this chapter by solving an example with all three methods studied previously.
arctan 6
Example 17. The function f (x) = tan πx − 6 has a zero at ≈ 0.447431543. Use eight
π
iterations of each of the following methods to approximate this root. Which method is most successful
and why?
a. Bisection method in interval [0,1].
b. Secant method with x0 = 0 and x1 = 0.48.
c. Newton’s method with x0 = 0.4.
Sol. It is important to note that f has several roots on the interval [0, 5] (to see make a plot).
a. Since f has several roots in [0, 5], the bisection method converges to a different root in this interval.
Therefore, it would be a better idea to choose the interval to be [0, 1]. For such case, we have the
following results: After 8 iterations answer is 0.447265625.
n a b c
0 0 1 0.5
1 0 0.5 0.25
2 0.25 0.5 0.375
3 0.375 0.5 0.4375
4 0.4375 0.5 0.46875
5 0.4375 0.46875 0.453125
6 0.4375 0.46875 0.4453125
7 0.4375 0.4453125 0.44921875
8 0.4453125 0.44921875 0.447265625
Exercises
(1) Use the bisection method to find solutions accurate to within 10−3 for the following problems.
(a) x − 2−x = 0 for 0 ≤ x ≤ 1.
(b) ex − x2 + 3x − 2 = 0 for 0 ≤ x ≤ 1.
(c) x + 1 − 2 sin(πx) = 0. for
√ 0 ≤ x ≤ 0.5 and 0.5 ≤−3x ≤ 1.
(2) Find an approximation to 3 25 correct to within 10 using the bisection algorithm.
(3) Find a bound for the number of iterations needed to achieve an approximation by bisection
method with accuracy 10−2 to the solution of x3 − x − 1 = 0 lying in the interval [1, 2]. Find
an approximation to the root with this degree of accuracy.
(4) Sketch the graphs of y = x and y = 2 sin x. Use the bisection method to find an approximation
to within 10−3 to the first positive value of x with x = 2 sin x.
(5) The function defined by f (x) = sin(πx) has zeros at every integer. Show that when −1 < a < 0
and 2 < b < 3, the bisection method converges to
(a) 0, if a + b < 2
(b) 2, if a + b > 2
(c) 1, if a + b = 2.
(6) Let f (x) = (x + 2)(x + 1)2 x(x − 1)3 (x − 2). To which zero of f does the bisection method
converge when applied on the following intervals?
(a) [−1.5, 2.5]
(b) [−0.5, 2.4]
(c) [−0.5, 3]
(d) [−3, −0.5].
(7) For each of the following equations, use the given interval or determine an interval [a, b] on
which fixed-point iteration will converge. Estimate the number of iterations necessary to obtain
approximations accurate to within 10−2 , and perform the calculations.
5
(a) x = 2 + 2.
x
(b) 2 + sin x − x = 0 in interval [2, 3].
(c) 3x2 − ex = 0
(d) x − cos x = 0.
(8) Use the fixed-point iteration method to find smallest and second smallest positive roots of the
equation tan x = 4x, correct to 4 decimal places.
(9) Show that g(x) = π + 0.5 sin(x/2) has a unique fixed point on [0, 2π]. Use fixed-point iteration
to find an approximation to the fixed point that is accurate to within 10−2 . Also estimate the
number of iterations required to achieve 10−2 accuracy, and compare this theoretical estimate
to the number actually needed.
(10) Find all the zeros of f (x) = x2 + 10 cos x by using the fixed-point iteration method for an
appropriate iteration function g. Find the zeros accurate to within 10−2 .
(11) Let A be a given positive constant and g(x) = 2x − Ax2 .
(a) Show that if fixed-point iteration converges to a nonzero limit, then the limit is α = 1/A,
so the inverse of a number can be found using only multiplications and subtractions.
(b) Find an interval about 1/A for which fixed-point iteration converges, provided x0 is in
that interval.
(12) Consider the root-finding problem f (x) = 0 with root α, with f � (x) �= 0. Convert it to the
fixed-point problem
x = x + cf (x) = g(x)
22 ROOTS OF NON-LINEAR EQUATIONS
Suppose the particle has moved 1.7 ft in 1 s. Find, to within 10−5 , the rate ω at which θ
changes. Assume that g = 32.17 ft/s2 .
(24) An object falling vertically through the air is subjected to viscous resistance as well as to the
force of gravity. Assume that an object with mass m is dropped from a height s0 and that the
height of the object after t seconds is
mg m2 g
s(t) = s0 − t + 2 (1 − e−kt/m ),
k k
2
where g = 32.17 ft/s and k represents the coefficient of air resistance in lb-s/ft. Suppose
s0 = 300 ft, m = 0.25 lb, and k = 0.1 lb-s/ft. Find, to within 0.01 s, the time it takes this
quarter-pounder to hit the ground.
(25) The circle below has radius 1, and the longer circular arc joining A and B is twice as long as
the chord AB. Find the length of the chord AB, correct to four decimal places. Use Newton’s
method.
(26) It costs a firm C(q) dollars to produce q grams per day of a certain chemical, where
C(q) = 1000 + 2q + 3q 2/3 .
The firm can sell any amount of the chemical at $4 a gram. Find the break-even point of the
firm, that is, how much it should produce per day in order to have neither a profit nor a loss.
Use the Newton’s method and give the answer to the nearest gram.
Appendix A. Algorithms
Algorithm (Bisection):
To determine a root of f (x) = 0 that is accurate within a specified tolerance value ε, given values a
and b such that f (a) f (b) < 0.
a+b
Define c = .
2
If f (a) f (c) < 0, then set b = c, otherwise a = c.
End if.
Until |a − b| ≤ ε (tolerance value).
Print root as c.
24 ROOTS OF NON-LINEAR EQUATIONS
Algorithm (Fixed-point):
To find a solution to x = g(x) given an initial approximation x0 .
Input: Initial approximation x0 , tolerance value �, maximum number of iterations N .
Output: Approximate solution α or message of failure.
Step 1: Set i = 1.
Step 2: While i ≤ N do Steps 3 to 6.
Step 3: Set x1 = g(x0 ). (Compute xi .)
|x1 − x0 |
Step 4: If |x1 − x0 | ≤ � or < � then OUTPUT x1 ; (The procedure was successful.)
|x1 |
STOP.
Step 5: Set i = i + 1.
Step 6: Set x0 =x1 . (Update x0 .)
Step 7: Print the output and STOP.
Algorithm (Secant):
1. Give inputs and take two initial guesses x0 and x1 .
2. Start iterations
x1 − x0
x 2 = x1 − f0 .
f1 − f0
3. If
|x2 − x1 |
|x2 − x1 | < ε or <ε
|x2 |
then stop and print the root.
4. Repeat the iterations (step 2). Also check if the number of iterations has exceeded the maximum
number of iterations.
Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 3 (4 LECTURES)
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS
1. Introduction
System of simultaneous linear equations are associated with many problems in engineering and
science, as well as with applications to the social sciences and quantitative study of business and
economic problems. These problems occur in wide variety of disciplines, directly in real world problems
as well as in the solution process for other problems.
The principal objective of this Chapter is to discuss the numerical aspects of solving linear system of
equations having the form
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
(1.1)
................................................
an1 x1 + an2 x2 + · · · + ann xn = bn .
This is a linear system of n equation in n unknowns x1 , x2 , · · · , xn . This system can simply be written
in the matrix equation form
Ax=b
a11 a12 ··· a1n x1 b1
a21 a22 ··· a2n x 2 b2
.. .. .. .. × .. = .. (1.2)
. . . . . .
an1 an2 · · · ann xn bn
This equations has a unique solution x = A−1 b, when the coefficient matrix A is non-singular. Unless
otherwise stated, we shall assume that this is the case under discussion. If A−1 is already available,
then x = A−1 b provides a good method of computing the solution x.
If A−1 is not available, then in general A−1 should not be computed solely for the purpose of obtaining
x. More efficient numerical procedures will be developed in this chapter. We study broadly two
categories Direct and Iterative methods. We start with direct method to solve the linear system in
this chapter.
2. Gaussian Elimination
Direct methods, which are technique that give a solution in a fixed number of steps, subject only to
round-off errors, are considered in this chapter. Gaussian elimination is the principal tool in the direct
solution of system (1.2). The method is named after Carl Friedrich Gauss (1777-1855).
To solve larger system of linear equation, we consider a following n × n system
As a22 = −11/3 �= 0, therefore, we eliminate entry in a32 position by taking the multiple m32 = 7/11
7
and we write E3 as E3 − 11 , to get
3 2 −1 1
0 −11//3 7/3 5/3 .
0 0 2/11 14/11
Obviously, the original set of equations has been transformed to an upper-triangular form. Since all
the diagonal elements of the obtaining upper-triangular matrix are nonzero, which means that the
coefficient matrix of the given system is nonsingular and therefore, the given system has a unique
solution.
Now using backward substitution to give
2/11x3 = 14/11, x3 = 7
−11/3x2 = 5/3 − 7/3x3 = −44/3, x2 = 4
3x1 = 1 − 2x2 + x3 = 0, x1 = 0.
Partial Pivoting: In the elimination process, we divide with aii at each stage and assume that aii �= 0.
These elements are known as pivot element. If at any stage of elimination, one of the pivot becomes
small (or zero) then we bring other element as pivot by interchanging the rows. This process is called
Gauss elimination with partial pivoting.
Example 2. Solve the system of equations using Gauss elimination. This system has exact solution
(known from other sources!) x1 = 2.6, x2 = −3.8, x3 = −5.0.
elimination.
In this case (after interchanging) multipliers is m32 = 0.00005999 and we write new E3 as E3 −
000005999E2 .
6.000 2.000 2.000 −2.000
0.0 1.667 −1.337 0.3334 .
0.0 0.0 −0.3332 1.667
Using back substitution, we obtain
x3 = −5.003
x2 = −3.801
x1 = 2.602.
We see that after partial pivoting, we get the desired solution.
Example 3. Given the linear system
x1 − x2 + αx3 = −2,
−x1 + 2x2 − αx3 = 3,
αx1 + x2 + x3 = 2.
a. Find value(s) of α for which the system has no solutions.
b. Find value(s) of α for which the system has an infinite number of solutions.
c. Assuming a unique solution exists for a given α, find the solution.
Sol. Augmented matrix is given by
1 −1 α −2
−1 2 −α 3
α 1 1 2
Multipliers are m21 = −1 and m31 = α. Performing E2 → E2 + E1 and E3 → E3 − αE1 to obtain
1 −1 α −2
0 1 0 1
2
0 1 + α 1 − α 2(1 + α)
Multiplier is m32 = 1 + α and we perform E3 → E3 − m32 E2 .
1 −1 α −2
0 1 0 1
0 2
0 1−α 1+α
a. If α = 1, then the last row of the reduced augmented matrix says that 0.x3 = 2, the system has no
solution.
b. If α = −1, then we see that the system has infinitely many solutions.
c. If α �= 1, then the system has a unique solution.
1 1
x3 = , x2 = 1, x1 = − .
1−α 1−α
Remark 2.1. Unique solution, no solution, or infinite number of solutions.
(1) If we have a leading one in every column, then we will have a unique solution.
(2) If we have a row of zeros equal to a non-zero number in right side, then the system has no
solution.
(3) If we don’t have a leading one in every column in a homogeneous system, i.e. a system where
all the equations equal zero, or a row of zeros, then system have infinite number of solutions.
Example 4. Solve the system by Gauss elimination
4x1 + 3x2 + 2x3 + x4 = 1
3x1 + 4x2 + 3x3 + 2x4 = 1
2x1 + 3x2 + 4x3 + 3x4 = −1
x1 + 2x2 + 3x3 + 4x4 = −1.
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 5
Complete Pivoting: In the first stage of elimination, we search the largest element in magnitude
from the entire matrix and bring it at the position of first pivot. We repeat the same process at every
step of elimination. This process require interchange of both rows and columns.
Scaled Partial Pivoting: In this approach, the algorithm select the largest relative entries as the
pivot elements at each stage of elimination. At the beginning, a scale factor must be computed for
each equation in the system. We define
si = max |aij | (1 ≤ i ≤ n)
1≤j≤n
These numbers are recored in the scaled vector s = [s1 , s2 , · · · , sn ]. Note that the scale vector does not
change throughout the procedure.
In starting the forward elimination process, we do not arbitrarily use the first equation as the pivot
ai,1
equation. Instead, we use the equation for which the ratio is greatest. We repeat the process by
si
taking same scaling factors.
The scale factors are s1 = 4.21, s2 = 10.2, & s3 = 1.09. We need to pick the largest (2.11/4.21 =
0.501, 4.01/10.2 = 0.393, 1.09/1.09 = 1), which is the third entry, and interchange row 1 and row 3 and
interchange s1 and s3 to get
1.09 0.987 0.832 4.21
4.01 10.2 −1.12 −3.09
2.11 −4.21 0.921 2.01.
Now comparing (6.57/10.2 = 0.6444, 6.12/4.21 = 1.45), the second ratio is largest so we need to
interchange row 2 and row 3 and interchange scale factor accordingly.
1.09 0.987 0.832 4.21
0 −6.12 −0.689 −6.16
0 6.57 −4.18 −18.6.
by hand using scaled partial pivoting. Justify all row interchanges and write out the transformed matrix
after you finish working on each column.
and the scale factors are s1 = 13, s2 = 18, s3 = 6, & s4 = 12. We need to pick the largest (3/13, 6/18, 6/6, 12/12),
which is the third entry, and interchange row 1 and row 3 and interchange s1 and s3 to get
6 −2 2 4 16
−6 4 1 −18 −34
3 −13 9 3 −19
12 −8 6 10 26
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 7
second equation is now the pivot equation. We do not actually interchange the equations and use
the second equation as the first pivot equation. The rest of the calculations are as above for partial
pivoting. The computed values of x and y are correct to five significant digits.
2.1. Operation Counts. We count the number of operations required to solve the system Ax = b.
Both the amount of time required to complete the calculations and the subsequent round-off error
depend on the number of floating-point arithmetic operations needed to solve a routine problem.
In general, the amount of time required to perform a multiplication or division on a computer is
approximately the same and is considerably greater than that required to perform an addition or
subtraction. The actual differences in execution time, however, depend on the particular computing
system.
To demonstrate the counting operations for a given method, we will count the operations required to
solve a typical linear system of n equations in n unknowns using Gauss elimination Algorithm. We will
keep the count of the additions/subtractions separate from the count of the multiplications/divisions
because of the time differential.
aji
First step is to calculate multipliers mji = . Then the replacement of the equation Ej by (Ej −mji Ei )
aii
requires that mji be multiplied by each term in Ei and then each term of the resulting equation is
subtracted from the corresponding term in Ej .
The following table states the operations count from going from A to U at each step 1, 2, · · · , n − 1.
Step Number of divisions Number of multiplications Number of additions/subtractions
1 (n − 1) (n − 1)2 (n − 1)2
2 (n − 2) (n − 2) 2 (n − 2)2
.. .. .. ..
. . . .
n−2 2 4 4
n−1 1 1 1
n(n − 1) n(n − 1)(2n − 1) n(n − 1)(2n − 1)
Total:
2 6 6
n(n − 1)(2n − 1)
Therefore total number of additions/subtractions from A to U are (I).
6
n(n − 1) n(n − 1)(2n − 1) 2
n(n − 1)
Total number of multiplications/divisions are + = (II).
2 6 3
Now we count the number of additions/subtractions and the number of multiplications/divisions for
right hand side vector b. We have:
n(n − 1)
Total number of additions/subtractions (n − 1) + (n − 2) + · · · + 2 + 1 = (III).
2
n(n − 1)
Total number of multiplications/divisions (n − 1) + (n − 2) + · · · + 2 + 1 = (IV).
2
Lastly we count the number of additions/subtractions and multiplications/divisions for finding the
solutions from the back-substitution method. Recall that
bn
xn = , where bn = an,n+1 .
ann
For each i = n − 1, n − 2, · · · , 2, 1, we have
�
n
ai,n+1 − aij xj
j=i+1
xi = .
aii
n(n − 1)
Therefore total number of additions/subtractions 0 + 1 + · · · + (n − 1) = (V).
2
n(n + 1)
Total number of multiplications/divisions 1 + 2 + · · · + n = (VI).
2
Thus the total number of operations to obtain the solution of a system of n linear equations in n
variables using Gaussian elimination is:
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 9
Additions/Subtractions (I+III+IV):
n(n − 1)(2n + 5)
.
6
Multiplications/Divisions (II+IV+VI):
n(n2 + 3n − 1)
.
3
For large n, the total number of multiplications and divisions is approximately n3 /3, as is the total
number of additions and subtractions. Thus the amount of computation and the time required increases
with n in proportion to n3 , as shown in Table.
n Multiplications/Divisions Additions/Subtractions
3 17 11
10 430 375
50 44, 150 42, 875
100 343, 300 338, 250
3. The LU Factorization
When we use matrix multiplication, another meaning can be given to the Gauss elimination. The
matrix A can be factored into the product of the two triangular matrices.
Let AX = b is the system to be solved, A is n × n coefficient matrix. The linear system can be reduced
to the upper triangular system U X = g with
u11 u12 · · · u1n
0 u22 · · · u2n
U = .. .. ..
. . .
0 0 ··· unn
Here uij = aij . Introduce an auxiliary lower triangular matrix L based on the multipliers mij as
following
1 0 0 ··· 0
m21 1 0 ··· 0
··· 0
L = m31 m32 1
.. .. ..
. . .
mn,1 0 · · · mn,n−1 1
Theorem 3.1. Let A be a non-singular matrix and let L and U be defined as above. If U is produced
without pivoting then
LU = A.
This is called LU factorization of A.
We can use Gaussian elimination to solve a system by LU decomposition. Suppose that A has been
factored into the triangular form A = LU , where L is lower triangular and U is upper triangular. Then
we can solve for x more easily by using a two-step process.
First we let y = U x and solve the lower triangular system Ly = b for y. Once y is known, the upper
triangular system U x = y provide the solution x. We can check that total number of operations are
same as Gauss elimination.
Example 8. We require to solve the following system of linear equations using LU decomposition.
x1 + 2x2 + 4x3 = 3
3x1 + 8x2 + 14x3 = 13
2x1 + 6x2 + 13x3 = 4.
Find the matrices L and U using Gauss elimination and using those values of L and U , solve the
system of equations.
10 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS
Sol. We first apply the Gaussian elimination on the matrix A and collect the multipliers m21 , m31 ,
and m32 .
We have
1 2 4
A = 3 8 14
2 6 13
Multipliers are m21 = 3, m31 = 2.
E2 → E2 − 3E1 and E3 → E3 − 2E1 .
1 2 4
∼ 0 2 2
0 2 5
Multiplier is m32 = 2/2 = 1 and we perform E3 → E3 − E2 .
1 2 4
∼ 0 2 2
0 0 3
We observe that m21 = 3, m31 = 2, and m32 = 1. Therefore,
1 2 4 1 0 0 1 2 4
A = 3 8 14 = 3 1 0 0 2 2 = LU
2 6 13 2 1 1 0 0 3
Therefore,
Ax = b =⇒ LU x = b.
Assuming U x = y, we obtain,
Ly = b
i.e.
1 0 0 y1 3
3 1 0 y2 = 13 .
2 1 1 y3 4
Using forward substitution, we obtain y1 = 3, y2 = 4, and y3 = −6. Now
1 2 4 x1 3
U x = y =⇒ 0
2 2 x2 = 4 .
0 0 3 x3 −6
Now, using the backward substitution process, we obtain the final solution as x3 = −2, x2 = 4, and
x1 = 3.
Example 9. (a) Determine the LU factorization for matrix A in the linear system Ax = b, where
1 1 0 3 1
2 1 −1 1 1
A=
3 −1 −1 2 and b = −3 .
(3.1)
−1 2 3 −1 4
(b) Then use the factorization to solve the system
x1 + x2 + 3x4 = 8
2x1 + x2 − x3 + x4 = 7
3x1 − x2 − x3 + 2x4 = 14
−x1 + 2x2 + 3x3 − x4 = −7
.
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 11
Sol. (a) We take the coefficient matrix and apply Gauss elimination.
Multipliers are m21 = 2, m31 = 3, and m41 = −1.
Sequence of operations E2 → E2 − 2E1 , E3 → E3 − 3E1 , E4 → E4 − (−1)E1 .
1 1 0 3
0 −1 −1 −5
0 −4 −1 −7 .
0 3 3 2
Multipliers are m32 = 4 and m42 = −3.
E3 → E3 − 4E2 , E4 → E4 − (−3)E2 .
1 1 0 3
0 −1 −1 −5
∼
0 0
.
3 13
0 0 0 −13
The multipliers mij and the upper triangular matrix produce the following factorization
1 1 0 3 1 0 0 0 1 1 0 3
2 1 −1 1 2 0 0
A= = 1 0 −1 −1 −5 = LU.
3 −1 −1 2 3 4 1 0 0 0 3 13
−1 2 3 −1 −1 −3 0 1 0 0 0 −13
(b)
1 0 0 0 1 1 0 3 x1 8
2 1 0 0 0 −1 −1 −5 x2 7
Ax = (LU )x =
3
= .
4 1 0 0 0 3 13 x3 14
−1 −3 0 1 0 0 0 −13 x4 −7
We first introduce the substitution y = U x. Then b = L(U x) = Ly. That is,
1 0 0 0 y1 8
2 1 0 0 y2 7
Ly =
3
= .
4 1 0 y3 14
−1 −3 0 1 y4 −7
This system is solved for y by a simple forward-substitution process:
Sol. (a) We have already counted the mathematical operation in detail in Gauss elimination. Here we
provide the same for LU factorization.
n(n − 1)(2n − 1) 1 1 1
We found total number of additions/subtractions from A to U are = n3 − n2 + n
6 3 2 6
n(n − 1) n(n − 1)(2n − 1) n(n2 − 1) 1 3
and total number of multiplications/divisions are + = = n −
2 6 3 3
1
n.
3
These counts remains same to factorize the matrix A in to L and U .
(b) Solving Ly = b, where L is a lower-triangular matrix with lii = 1 for all i, requires total number of
n(n − 1)
additions/subtractions 0 + 1 + · · · + (n − 1) = .
2
n(n − 1)
Total number of multiplications/divisions 0 + 1 + 2 + · · · + n = .
2
Please note that these operations can be counted in similar manner as we discussed for back substi-
tution. As lii is always 1, so this will reduce one division at each step. These counts are same as for
b.
(c) Finally the counts used in b are same as required to solve Ly = b. Therefore, total counts remains
same for LU decomposition and simple Gauss elimination.
Exercises
(1) Use Gaussian elimination with backward substitution and two-digit rounding arithmetic to
solve the following linear systems. Do not reorder the equations. (The exact solution to each
system is x1 = −1, x2 = 1, x3 = 3.)
(a)
−x1 + 4x2 + x3 = 8
5 2 2
x1 + x2 + x3 = 1
3 3 3
2x1 + x2 + 4x3 = 11.
(b)
4x1 + 2x2 − x3 = −5
1 1 1
x1 + x2 − x3 = −1
9 9 3
x1 + 4x2 + 2x3 = 9.
(2) Using the four-digit arithmetic solve the following system of equations by Gaussian elimination
with and without partial pivoting
This system has exact solution, rounded to four places x1 = 0.2245, x2 = 0.2814, x3 = 0.3279.
Compare your answers!
(3) Use the Gaussian elimination algorithm to solve the following linear systems, if possible, and
determine whether row interchanges are necessary:
(a)
x1 − x2 + 3x3 = 2
3x1 − 3x2 + x3 = −1
x1 + x2 = 3.
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 13
(b)
2x1 − x2 + x3 − x4 = 6
x2 − x3 + x4 = 5
x4 = 5
x3 − x4 = 3.
(4) Use Gaussian elimination and three-digit chopping arithmetic to solve the following linear
system, and compare the approximations to the actual solution [0, 10, 1/7]T .
3.03x1 − 12.1x2 + 14x3 = −119
−3.03x1 + 12.1x2 − 7x3 = 120
6.11x1 − 14.2x2 + 21x3 = −139.
(5) Repeat the above exercise using Gaussian elimination with partial and scaled partial pivoting
and three-digit rounding arithmetic.
(6) Suppose that
2x1 + x2 + 3x3 = 1
4x1 + 6x2 + 8x3 = 5
6x1 + αx2 + 10x3 = 5,
with |α| < 10. For which of the following values of α will there be no row interchange required
when solving this system using scaled partial pivoting?
(a) α = 6.
(b) α = 9.
(c) α = −3.
(7) Modify the LU Factorization Algorithm so that it can be used to solve a linear system, and
then solve the following linear systems.
2x1 − x2 + x3 = −1
3x1 + 3x2 + 9x3 = 0
3x1 + 3x2 + 5x3 = 4.
Appendix A. Algorithms
Algorithm (Gauss Elimination)
Input: number of unknowns and equations n; augmented matrix A = [aij ], where 1 ≤ i ≤ n and
1 ≤ j ≤ n + 1.
Output: solution x1 , x2 , · · · , xn or message that the linear system has no unique solution.
Step 1: For i = 1, · · · , n − 1 do Steps 2-4. (Elimination process.)
Step 2: Let p be the smallest integer with i ≤ p ≤ n and api �= 0.
If no integer p can be found
then OUTPUT (‘no unique solution exists’);
STOP.
Step 3: If p �= i then perform (Ep ) ↔ (Ei ).
Step 4: For j = i + 1, · · · , n do Steps 5 and 6.
Step 5: Set mji = aji /aii .
Step 6: Perform (Ej − mji Ei ) → (Ej );
Step 7 If ann = 0 then OUTPUT (‘no unique solution exists’);
STOP.
Step 8: Set xn = an,n+1 /ann .(Start backward substitution.)
�n
Step 9: For i = n − 1, · · · , 1 set xi = [ai,n+1 − aij xj ]/aii .
j=i+1
Step 10: OUTPUT (x1 , · · · , xn ); (Procedure completed successfully.)
STOP.
14 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS
Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 4 (5 LECTURES)
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA
1. Introduction
In this chapter we will study iterative techniques to solve linear systems. An initial approximation
(or approximations) will be found, and new approximations are then determined based on how well
the previous approximations satisfied the equation. The objective is to find a way to minimize the
difference between the approximations and the exact solution. To discuss iterative methods for solving
linear systems, we first need to determine a way to measure the distance between n-dimensional column
vectors. This will permit us to determine whether a sequence of vectors converges to a solution of the
system. In actuality, this measure is also needed when the solution is obtained by the direct methods
presented in Chapter 3. Those methods required a large number of arithmetic operations, and using
finite-digit arithmetic leads only to an approximation to an actual solution of the system. We end
the chapter by presenting a way to find eigenvalue (dominant) and associated eigenvector. Dominant
eigenvalue plays an important role in convergence of any iterative method.
1.1. Norms of Vectors and Matrices.
1.2. Vector Norms. Let Rn denote the set of all n−dimensional column vectors with real-number
components. To define a distance in Rn we use the notion of a norm, which is the generalization of
the absolute value on R, the set of real numbers.
Definition 1.1. A vector norm on Rn is a function, �·�, from Rn into R with the following properties:
(1) �x�≥ 0 for all x ∈ Rn .
(2) �x�= 0 if and only if x = 0.
(3) �x + y�≤ �x�+�y� for all x, y ∈ Rn (triangle inequality).
(4) �αx�= |α|�x� for all x ∈ Rn and α ∈ R.
Definition 1.2. The l2 and l∞ norms for the vector x = (x1 , x2 , . . . , xn )t are defined by
��n �1/2
2
�x�2 = |xi | and �x�∞ = max |xi |.
1≤i≤n
i=1
Note that each of these norms reduces to the absolute value in the case n = 1.
The l2 norm is called the Euclidean norm of the vector x because it represents the usual notion of
distance from the origin in case x is in R1 = R, R2 , or R3 . For example, the l2 norm of the vector
x = (x1 , x2 , x3 )t gives the length of the straight line joining the points (0, 0, 0) and (x1 , x2 , x3 ).
Example 1. Determine the l2 norm and the l∞ norm of the vector x = (−1, 1, −2)t .
Sol. The vector x = (−1, 1, −2)t in R3 has norms
� √
�x�2 = (−1)2 + (1)2 + (−2)2 = 6
and
�x�∞ = max{|−1|, |1|, |−2|} = 2
Definition 1.3 (Distance between vectors in Rn ). If x = (x1 , x2 , · · · , xn )t and y = (y1 , y2 , · · · , yn )t
are vectors in Rn , the l2 and l∞ distances between x and y are defined by
� n �1/2
�
�x − y�2 = (xi − yi ) ,
i=1
Definition 1.4 (Matrix Norm). A matrix norm on the set of all n × n matrices is a real-valued
function, �·�, defined on this set, satisfying for all n × n matrices A and B and all real numbers α :
(1) �A�≥ 0,
(2) �A�= 0 if and only if A is O, the matrix with all 0 entries,
(3) �αA�= |α|�A�,
(4) �A + B�≤ �A�+�B�,
(5) �AB�≤ �A� �B�.
If �·� is a vector norm on Rn , then
�A�= max �Ax�
�x�=1
is a matrix norm.
The matrix norms we will consider have the forms
�A�∞ = max �Ax�∞ ,
�x�∞ =1
and
�A�2 = max �Ax�2 .
�x�2 =1
∴ lim Ak = 0,
k→∞
which implies matrix A is convergent.
Note that the convergent matrix A in this Example has spectral radius ρ(A) = 12 , because 12 is the
only eigenvalue of A. This illustrates an important connection that exists between the spectral radius
of a matrix and the convergence of the matrix, as detailed in the following result.
4 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA
The proof of this theorem can be found in advanced texts of numerical analysis.
2. Iterative Methods
The linear system Ax = b may have a large order. For such systems Gauss elimination is often too
expensive in either computation time or computer memory requirements or both.
In an iterative method, a sequence of progressively iterates is produced to approximate the solution.
Jacobi and Gauss-Seidel Method: We start with an example. Let us consider a system of equations
9x1 + x2 + x3 = 10
2x1 + 10x2 + 3x3 = 19
3x1 + 4x2 + 11x3 = 0.
One class of iterative method for solving this system as follows.
We write
1
x1 = (10 − x2 − x3 )
9
1
x2 = (19 − 2x1 − 3x3 )
10
1
x3 = (0 − 3x1 − 4x2 ).
11
(0) (0) (0)
Let x(0) = [x1 x2 x3 ] be an initial approximation of solution x. Then define an iteration of sequence
(k+1) 1 (k) (k)
x1 = (10 − x2 − x3 )
9
(k+1) 1 (k) (k)
x2 = (19 − 2x1 − 3x3 )
10
(k+1) 1 (k) (k)
x3 = (0 − 3x1 − 4x2 ), k = 0, 1, 2, . . . .
11
This is called Jacobi or method of simultaneous replacements. The method is named after German
mathematician Carl Gustav Jacob Jacobi.
We start with [0 0 0] and obtain
(1) (1) (1)
x1 = 1.1111, x2 = 1.900, x3 = 0.0.
(2) (2) (2)
x1 = 0.9000, x2 = 1.6778, x3 = −0.9939
etc.
An another approach to solve the same system will be following.
(k+1) 1 (k) (k)
x1 = (10 − x2 − x3 )
9
(k+1) 1 (k+1) (k)
x2 = (19 − 2x1 − 3x3 )
10
(k+1) 1 (k+1) (k+1)
x3 = (0 − 3x1 − 4x2 ), k = 0, 1, 2, . . . .
11
This method is called Gauss-Seidel or method of successive replacements. It is named after the German
mathematicians Carl Friedrich Gauss and Philipp Ludwig von Seidel. Starting with [0 0 0], we obtain
(1) (1) (1)
x1 = 1.1111, x2 = 1.6778, x3 = −0.9131.
(2) (2) (2)
x1 = 1.0262, x2 = 1.9687, x3 = −0.9588.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 5
The Jacobi iterative method is obtained by solving the i-th equation for xi to obtain (provided aii �= 0)
�n
1
(−aij xj ) + bi .
xi =
aii
j=1
j�=i
(k+1) (k)
For each k ≥ 0, generate the components xi from xi as
�n
(k+1) 1
(−aij xkj ) + bi .
xi =
aii
j=1
j�=i
Now we write the above iterative scheme is matrix form. To write matrix form, we take
n
(k+1) �
aii xi = k
(−aij xj ) + bi .
j=1
j�=i
In matrix form
(D + L)x(k+1) = −U x(k) + b,
where D, L and U are diagonal, strictly lower triangle and upper triangle matrices, respectively.
x(k+1) = −(D + L)−1 U x(k) + (D + L)−1 b
6 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA
x(k+1) = Tg x(k) + B, k = 0, 1, 2, · · · .
Here Tg = −(D + L) U and this matrix is called iteration matrix and B = (D + L)−1 b.
−1
Stopping Criteria: Since these techniques are iterative so we require a stopping criteria. Let ε is
accuracy desired then we can use the following
�x(k) − x(k−1) �∞
< ε.
�x(k) �∞
Example 5. Use the Gauss-Seidel method to approximate the solution of the following system:
4x1 + x2 − x3 = 3
2x1 + 7x2 + x3 = 19
x1 − 3x2 + 12x3 = 31.
Continue the iterations until two successive approximations are identical when rounded to three signif-
icant digits.
Sol. To begin, write the system in the form
1
x1 = (3 − x2 + x3 )
4
1
x2 = (19 − 2x1 − x3 )
7
1
x3 = (31 − x1 + 3x2 )
12
As
|a11 | = 4 > |a12 | + |a13 | = 1
|a22 | = 7 > |a21 | + |a23 | = 3
|a33 | = 12 > |a31 | + |a32 | = 2
which shows that coefficient matrix is strictly diagonally dominant. Therefore Gauss-Seidel iterations
will converge.
Start with a random vector x(0) = [0, 0, 0]t , the first approximation is
(1)
x1 = 0.7500
(1)
x2 = 2.5000
(1)
x3 = 3.1458.
Similarly
x(2) = [0.9115, 2.0045, 3.0085]t
2.1. Convergence analysis of iterative methods. To study the convergence of general iteration
techniques, we need to analyze the formula
x(k+1) = T x(k) + B, for each k = 0, 1, · · · ,
where x(0) is arbitrary. The next lemma and Theorem provide the key for this study.
Lemma 2.1. If the spectral radius ρ(T ) < 1, then (I − T )−1 exists, and
∞
�
(I − T )−1 = I + T + T 2 + · · · = T k.
k=0
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 7
Theorem 2.2 (Necessary and sufficient condition). A necessary and sufficient condition for the con-
vergence of an iterative method is that the eigenvalue of iteration matrix T satisfy the inequality
ρ(T ) < 1.
Proof. Let
ρ(T ) < 1.
The sequence of vector x(k) by iterative method (Gauss-Seidel) are given by
x(1) = T x(0) + B.
........................
= T 2 (x − x(k−2) )
= T k (x − x(0 ).
Let z = x − x(0) ) then
=⇒ ρ(T ) < 1.
Theorem 2.3. If A is strictly diagonally dominant in Ax = b, then iterative method always converges
for any initial starting vector.
8 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA
x(k+1) = Tj x(k) + B.
Method is convergent iff ρ(Tj ) < 1.
Now
�(D + L)−1 �∞
�Tj �∞ = � − (D + L)−1 U �∞ ≤ � − (D + L)−1 �∞ �U �∞ = < 1.
| max aii |
This shows the convergence condition for Jacobi method.
Further we prove the convergence of Gauss-Seidel method. Gauss-Seidel iterations are given by
x(k+1) = −(D + L)−1 U x(k) + (D + L)−1 b
x(k+1) = Tg x(k) + B.
Let λ be an eigenvalue of matrix A and x be an eigenvector then
Tg x = λx
−(D + L)−1 U x = λx
−U x = λ(D + L)x
n
� i
�
− aij = λ[ aij xj ], i = 1, 2, . . . , n
j=i+1 j=1
n
� i−1
�
− aij = λaii xi + λ aij xj
j=i+1 j=1
i−1
� n
�
λaii xi = −λ aij xj − λ aij xj
j=1 j=i+1
i−1
� n
�
|λaii xi | ≤ |λ| |aij | |xj | + |λ| |aij | |xj |
j=1 j=i+1
�
n �
n
|aij | |aij |
j=i+1 j=i+1
=⇒ |λ| ≤ ≤ =1
�
i−1 �n
|aii | − |aij | |aij |
j=1 j=i+1
Eigenvalues are − 12 , − 12 , 0.
Thus ρ(Tg ) = 12 < 1.
Spectral radius of iteration matrix of Jacobi method is greater than one and less than one for
Gauss-Seidel. Therefore Gauss-Seidel iterations converge.
The choice of relaxation factor ω is not necessarily easy, and depends upon the properties of the
coefficient matrix. If A is a symmetric and positive definite matrix and 0 < ω < 2, then the SOR
method converges for any choice of initial approximate vector x(0) .
Important Note: If a matrix A is symmetric, it is positive definite if and only if all its leading
principle submatrices (minors) has a positive determinant.
Example 7. Consider a linear system Ax = b, where
3 −1 1 −1
A = −1 3 −1 , b = 7
1 −1 3 −7
a. Check, that the SOR method with value ω = 1.25 of the relaxation parameter can be used to solve
this system.
b. Compute the first iteration by the SOR method starting at the point x(0) = (0, 0, 0)t .
Sol. a. Let us verify the sufficient condition for using the SOR method. We have to check, if matrix
A is symmetric, positive definite: A is symmetric as A = AT , so let us check positive definitness:
� �
3 −1
det(3) = 3 > 0, det = 8 > 0, det(A) = 20 > 0.
−1 3
All leading principal minors are positive and so the matrix A is positive definite. We know, that for
symmetric positive definite matrices the SOR method converges for values of the relaxation parameter
ω from the interval 0 < ω < 2.
Therefore the SOR method with value ω = 1.25 can be used to solve this system.
b. The iterations of the SOR method are easier to compute by elements than in the vector form:
Write the system as equations and write down the equations for the Gauss-Seidel iterations
(k+1) (k) (k)
x1 = (−1 + x2 − x3 )/3
(k+1) (k+1) (k)
x2 = (7 + x1 + x3 )/3
(k+1) (k+1) (k+1)
x3 = (−7 − x1 + x2 )/3.
Now multiply the right hand side by the parameter ω and add to it the vector x(k) from the previous
iteration multiplied by the factor of (1 − ω) :
(k+1) (k) (k) (k)
x1 = (1 − ω)x1 + ω(−1 + x2 − x3 )/3
(k+1) (k) (k+1) (k)
x2 = (1 − ω)x2 + ω(7 + x1 + x3 )/3
(k+1) (k) (k+1) (k+1)
x3 = (1 − ω)x3 + ω(−7 − x1 + x2 )/3.
For k = 0:
(1)
x1 = (1 − 1.25) · 0 + 1.25 · (−1 + 0 − 0)/3 = −0.41667
(1)
x2 = (1 − 1.25) · 0 + 1.25 · (7 − 0.41667 + 0)/3 = 2.7431
(1)
x3 = (1 − 1.25) · 0 + 1.25 · (−7 + 0.41667 + 2.7431)/3 = −1.6001.
The next three iterations are
x(2) = (1.4972, 2.1880, −2.2288)t ,
x(3) = (1.0494, 1.8782, −2.0141)t ,
x(4) = (0.9428, 2.0007, −1.9723)t .
The exact solution is x = (1, 2, −2)t .
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 11
Condition Numbers: The inequalities in the above theorem imply that �A−1 � and �A�·�A−1 � pro-
vide an indication of the connection between the residual vector and the accuracy of the approximation.
In general, the relative error �x − x̃|/�x� is of most interest, and this error is bounded by the product
of �A�·�A−1 � with the relative residual for this approximation, �r�/�b�. Any convenient norm can be
used for this approximation; the only requirement is that it be used consistently throughout.
Definition 4.3. The condition number of the nonsingular matrix A relative to a norm �·� is
K(A) = �A�·�A−1 �.
With this notation, the inequalities in above theorem become
�r�
�x − x̃�≤ K(A)
�A�
and
�x − x̃� �r�
≤ K(A) .
�x� �b�
For any nonsingular matrix A and natural norm �·�,
1 = �I�= �A · A−1 �≤ �A�·�A−1 �= K(A)
A matrix A is well-conditioned if K(A) is close to 1, and is ill-conditioned when K(A) is significantly
greater than 1. Conditioning in this context refers to the relative security that a small residual vector
implies a correspondingly accurate approximate solution. When it is very large, the solution of Ax = b
12 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA
will be very sensitive to relatively small changes in b. Or in the the residual, a relatively small residual
will quite possibly lead to a relatively large error in x̃ as compared with x. These comments are also
valid when the changes are made to A rather than to b.
� �
0.98
Example 9. Suppose x̄ = is an approximate solution for the linear system Ax = b, where
1.1
� � � �
3.9 1.6 5.5
A= , b= .
6.8 2.9 9.7
�x − x̄�
Find a bound for the relative error .
�x�
Sol. The residual is given by
� � � �� � � �
5.5 3.9 1.6 0.98 −0.0820
r = b − Ax̄ = − = .
9.7 6.8 2.9 1.1 −0.1540
The bound for the relative error is (for the infinity norm)
�x − x̄� �A� �A−1 � �r�
≤ .
�x� b
Also
det(A) = 0.43.
� � � �
1 2.9 −1.6 6.7442 −3.7209
∴ A−1 = =
0.43 −6.8 2.9 −15.8140 9.0698
�A� = 9.7, �A−1 � = 24.8837, �r� = 0.1540, �b� = 9.7.
�x − x̄� �A� �A−1 � �r�
∴ ≤ = 3.8321.
�x� b
Example 10. Determine the condition number for the matrix
� �
1 2
A= .
1.0001 2
Sol. We saw in previous Example that the very poor approximation (3, −0.0001)t to the exact solution
(1, 1)t had a residual vector with small norm, so we should expect the condition number of A to be
large. We have �A�∞ = max{|1| + |2|, |1.001| + |2|} = 3.0001, which would not be considered large.
However, � �
−1 −10000 10000
A = , so �A�∞ = 20000,
5000.5 −5000
and for the infinity norm, K(A) = (20000)(3.0001) = 60002. The size of the condition number for this
example should certainly keep us from making hasty accuracy decisions based on the residual of an
approximation.
Example 11. Find the condition number K(A) of the matrix
� �
1 c
A= , |c| �= 1.
c 1
When does A become ill-conditioned? What does this say about the linear system Ax = b? How is
K(A) related to det(A)?
Sol. For the given system of equations the matrix A is
� �
1 c
c 1
and is well conditioned if K(A) is near 1. K(A) with respect to norm �·�∞ is given as
K(A) = �A�∞ �A−1 �∞ .
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 13
� � 1 −c
1 −c −c 2 1 − c2
Here det(A) = 1 − c2 and adj(A) = . Thus A−1 = 1 −c 1
−c 1
1−c 2 1−c 2
1 |c| 1 + |c|
Thus ||A||∞ = 1 + |c| and ||A−1 ||∞ = + = .
|1 − c2 | |1 − c2 | |1 − c2 |
(1 + |c|) 2
Hence condition number K(A) = .
|1 − c2 |
Thus A is ill-conditioned when |c| is near 1.
When condition number is large the solution of the system Ax = b is sensitive to small changes in A.
If the determinant of A is small, then the condition number of A will be very large.
4.1. The Residual Correction Method. A further use of this error estimation procedure is to
define an iterative method for improving the computed value x. Let x(0) , the initial computed value
for x, generally obtained by using Gaussian elimination. Define
r(0) = b − Ax(0) = A(x − x(0) ).
Then
Ae(0) = r(0) , e(0) = x − x(0) .
Solving by Gaussian elimination, we obtain an approximate value of e(0) . Using it, we define an
improved approximation
x(1) = x(0) + e(0) .
Now we repeat the entire process, calculating
r(1) = b − Ax(1)
x(2) = x(1) + Ae(1) ,
where e(1) is the approximate solution of
Ae(1) = r(1) , e(1) = x − x(1) .
Continue this process until there is no further decrease in the size of error vector.
For example, use a computer with four-digit floating-point decimal arithmetic with rounding, and
use Gaussian elimination with pivoting. The system to be solved is
x1 + 0.5x2 + 0.3333x3 = 1
0.5x1 + 0.3333x2 + 0.25x3 = 0
0.3333x1 + 0.25x2 + 0.2x3 = 0
Then
x(0) = [8.968, −35.77, 29.77]t
r(0) = [−0.005341, −0.004359, −0.0005344]t
e(0) = [0.09216, −0.5442, 0.5239]t
x(1) = [9.060, −36.31, 30.29]t
r(1) = [−0.0006570, −0.0003770, −0.0001980]t
e(1) = [0.001707, −0.01300, 0.01241]t
x(2) = [9.062, −36.32, 30.30]t .
dominant eigenvalue - that is the eigenvalue with largest magnitude. By modifying the method it can
be used to determine other eigenvalues. One useful feature of power method is that it produces not
only eigenvalue but also associated eigenvector.
To apply the power method, we assume that n × n matrix A has n eigenvalues λ1 , λ2 , · · · , λn (which
we don’t know) with associated eigenvectors v (1) , v (2) , · · · , v (n) . We say matrix A is diagonalizable.
We write
Av (i) = λi v (i) , i = 1, 2, · · · , n.
We assume that these eigenvalues are ordered so that λ1 is the dominant eigenvalue (with correspond-
ing eigenvector v (1) ).
From linear algebra, if A is diagonalizable, then it has n linearly independent eigenvectors v (1) , v (2) , · · · , v (n) .
An n×n matrix need not have n linearly independent eigenvectors. When it does not the Power method
may still be successful, but it is not guaranteed to be.
As the n eigenvectors v (1) , v (2) , · · · , v (n) are linearly independent, they must form a basis for Rn .
We select an arbitrary nonzero starting vector x(0) and express it as a linear combination of basis
vectors as
x0 = c1 v (1) + c2 v (2) + · · · + cn v (n) .
We assume that c1 �= 0. (If c1 = 0, the power method may not converge, and a different x(0) must be
used as an initial approximation.
Then we repeatedly carry out matrix-vector multiplication, using the matrix A to produce a sequence
of vectors. Specifically, we have
x(1) = Ax(0)
x(2) = Ax(1) = A2 x(0)
..
.
x(k) = Ax(k−1) = Ak x(0) .
In general, we have
x(k) = Ak x(0) , k = 1, 2, 3, · · ·
Substituting the value of x(0) , we obtain
x(k) = Ak x(0)
= c1 Ak v (1) + c2 Ak v (2) + · · · + cn Ak v (n)
= c1 λk1 v (1) + c2 λk2 v (2) + · · · + cn λkn v (n)
� � �k � �k �
k (1) λ2 (2) λn (n)
= λ1 c 1 v + c 2 v + · · · + cn v
λ1 λ1
Now, from our original assumption that λ1 is larger in absolute value than the other eigenvalues it
follows that each of the fractions
λ2 λ3 λn
, ,··· , < 1.
λ1 λ1 λ1
Therefore each of the factors � �k � �k � �k
λ2 λ3 λn
, ,··· ,
λ1 λ1 λ1
must approach 0 as k approaches infinity. This implies that the approximation
Ak x(0) ≈ λk1 c1 v (1) , c1 �= 0.
Since v (1) is a dominant eigenvector, it follows that any scalar multiple of v (1) is also a dominant
eigenvector. Thus we have shown that Ak x0 approaches a multiple of the dominant eigenvector of A.
The entries of Ak x(0) may grow with k, therefore we scale the powers of Ak x(0) in an appropriate
manner to ensure that the limit is finite and nonzero. The scaling begins by choosing initial guess to
be a unit vector x(0) relative to maximum norm, that is �x(0) �∞ = 1. Then we compute y (1) = Ax(0)
and next approximation can be taken as
y (1)
x(1) = .
�y (1) �∞
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 15
We repeat the procedure and stop by putting the following stopping criteria:
�x(k) − x(k−1) �∞
< ε,
�x(k) �∞
where ε is the desired accuracy.
Example 12. Calculate four iterations of the power method with scaling to approximate a dominant
eigenvector of the matrix
1 2 0
−2 1 2
1 3 1
Sol. Using x(0) = [1, 1, 1]T as initial approximation, we obtain
1 2 0 1 3
y (1) = Ax(0) = −2 1 2 1 = 1
1 3 1 1 5
and by scaling we obtain the approximation
3 0.60
x(1) = 1/5 1 = 0.20 .
5 1.00
Similarly we get
1.00 0.45
y (2) = Ax(1) = 1.00 = 2.20 0.45 = 2.20x(2) .
2.20 1.00
1.35 0.48
y (3) = Ax(2) = 1.55 = 2.8 0.55 = 2.8x(3) .
2.8 1.00
0.51
y (4) = Ax(3) = 3.1 0.51 .
1.00
etc.
After four iterations, we observe that dominant eigenvector is
0.51
x = 0.51 .
1.00
Scaling factors are approaching to dominant eigenvalue λ = 3.1.
Remark 5.1. The power method is useful to compute the eigenvalue but it gives only dominant eigen-
value. To find other eigenvalue we use properties of matrix such as sum of all eigenvalue is equal to the
trace of matrix. Also if λ is an eigenvalue of A then λ−1 is the eigenvalue of A−1 . Hence the smallest
eigenvalue of A is the dominant eigenvalue of A−1 .
5.1. Inverse Power method. The Inverse Power method is a modification of the Power method that
is used to determine the eigenvalue of A that is closest to a specified number σ.
We consider A − σI then its eigenvalues are λ1 − σ, λ2 − σ, · · · , λn − σ, where λ1 , λ2 , · · · , λn are the
eigenvalues of A.
1 1 1
Now the eigenvalues of (A − σI)−1 are , ,··· , .
λ1 − σ λ2 − σ λn − σ
The eigenvalues of the original matrix A that is the closest to σ corresponds to the eigenvalue of largest
magnitude of the shifted and inverted of matrix (A − σI)−1 .
To find the eigenvalue closest to σ, we apply the Power method to obtain the eigenvalue µ of (A−σI)−1 .
Then we recover the eigenvalue λ of the original problem by λ = 1/µ + σ. This method is called shifted
and inverted. We solve y = (A − σI)−1 x which implies (A − σI)y = x. We need not to compute the
inverse of the matrix.
16 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA
Example 13. Apply the inverse power method with x(0) = [1, 1, 1]T to the matrix
−4 14 0
−5 13 0
−1 0 2
with σ = 19/3.
Sol. For the inverse power method, we consider
−31
14 0
19 3
A− I = −5 20
3 0
3
−1 0 − 13
3
Starting with x(0) = [1, 1, 1]T , (A − σI)−1 x(0) = y (1) gives (A − σI)y (1) = x(0) . This gives
−31
3 14 0 a 1
−5 20 0 b = 1 .
3
−1 0 − 13 3
c 1
Solving above system by Gauss elimination (LU decomposition), we get a = −6.6, b = −4.8, and
c = 1.2923.
Therefore y (1) = (−6.6, −4.8, 1.2923)T . We normalize it by taking 6.6 as scale factor and x(1) =
1 (1) = (1, 0.7272, −0.1958)T .
−6.6 y
1
Therefore first approximation of the eigenvalue of A near 19/3 is − 6.6 + 19
3 = 6.1818.
Repeating the above procedure we can obtain the eigenvalue (and which is 6).
Important Remark: Although the power method worked well in these examples, we must say
something about cases in which the power method may fail. There are basically three such cases:
1. Using the power method when A is not diagonalizable. Recall that A has n linearly Independent
eigenvector if and only if A is diagonalizable. Of course, it is not easy to tell by just looking at A
whether it is diagonalizable.
2. Using the power method when A does not have a dominant eigenvalue or when the dominant
eigenvalue is such that |λ1 | = |λ2 |.
3. If the entries of A contains significant error. Powers Ak will have significant roundoff error in their
entires.
Exercises
(1) Find l∞ and l2 norms of the vectors.
a. x = (3, −4, 0, 32 )t
b. x = (sin k, cos k, 2k )t for a fixedpositive integer
k.
4 −1 7
(2) Find the l∞ norm of the matrix: −1 4 0 .
−7 0 4
(3) The following linear system Ax = b have x as the actual solution and x̄ as an approximate
solution. Compute �x − x̄�∞ and �Ax̄ − b�∞ . Also compute �A�∞ .
x1 + 2x2 + 3x3 = 1
2x1 + 3x2 + 4x3 = −1
3x1 + 4x2 + 6x3 = 2,
x = (0, −7, 5)t
x̄ = (−0.2, −7.5, 5.4)t .
(4) Find the first two iterations of Jacobi and Gauss-Seidel using x(0) = 0:
4.63x1 − 1.21x2 + 3.22x3 = 2.22
−3.07x1 + 5.48x2 + 2.11x3 = −3.17
1.26x1 + 3.11x2 + 4.57x3 = 5.11.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 17
(11) Determine the largest eigenvalue and the corresponding eigenvector of the following matrix
correct to three decimals using the power method with x(0) = (−1, 2, 1)t using the power
method.
1 −1 0
−2 4 −2 .
0 −1 2
(12) Use the inverse power method to approximate the most dominant eigenvalue of the matrix until
a tolerance of 10−2 is achieved with x(0) = (1, −1, 2)t .
2 1 1
1 2 1 .
1 1 2
(13) Find the eigenvalue of matrix nearest to 3
2 −1 0
−1 2 −1
0 −1 2
using inverse power method.
Appendix A. Algorithms
Algorithm (Gauss-Seidel):
(1) Input matrix A = [aij ], b, XO = x(0) , tolerance TOL, maximum number of iterations
(2) Set k = 1
(3) while (k ≤ N ) do step 4-7
(4) For i = 1, 2, · · · , n
i−1 n
1 � �
xi = − (aij xj ) − (aij XOj ) + bi )
aii
j=1 j=i+1
Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 5 (8 LECTURES)
POLYNOMIAL INTERPOLATION AND APPROXIMATIONS
1. Introduction
Polynomials are used as the basic means of approximation in nearly all areas of numerical analysis.
They are used in the solution of equations and in the approximation of functions, of integrals and
derivatives, of solutions of integral and differential equations, etc. Polynomials have simple structure,
which makes it easy to construct effective approximations and then make use of them. For this reason,
the representation and evaluation of polynomials is a basic topic in numerical analysis. We discuss this
topic in the present chapter in the context of polynomial interpolation, the simplest and certainly the
most widely used technique for obtaining polynomial approximations.
Definition 1.1 (Polynomial). A polynomial Pn (x) of degree ≤ n is, by definition, a function of the
form
Pn (x) = a0 + a1 x + a2 x2 + · · · + an xn (1.1)
with certain coefficients a0 , a1 , · · · , an . This polynomial has (exact) degree n in case its leading coeffi-
cient an is nonzero.
The power form (1.1) is the standard way to specify a polynomial in mathematical discussions. It is
a very convenient form for differentiating or integrating a polynomial. But, in various specific contexts,
other forms are more convenient. For example, the following shifted power form may be helpful.
P (x) = a0 + a1 (x − c) + a2 (x − c)2 + · · · + an (x − c)n . (1.2)
It is good practice to employ the shifted power form with the center c chosen somewhere in the interval
[a, b] when interested in a polynomial on that interval.
Definition 1.2 (Newton form). A further generalization of the shifted power form is the following
Newton form
�
P (x) = a0 + a1 x − c1 ) + a2 (x − c1 )(x − c2 ) + · · · + an (x − c1 )(x − c2 ) · · · (x − cn ).
This form plays a major role in the construction of an interpolating polynomial. It reduces to the
shifted power form if the centers c1 , · · · , cn , all equal c, and to the power form if the centers c1 , · · · , cn ,
all equal zero.
2. Lagrange Interpolation
In this chapter, we consider the interpolation problems. Suppose we do not know the function f ,
but a few information (data) about f . Now we try to compute a function g that approximates f .
2.1. Polynomial Interpolation. The polynomial interpolation problem, also called Lagrange inter-
polation, can be described as follows: Given (n+1) data points (xi , yi ), i = 0, 1, · · · , n find a polynomial
P of lowest possible degree such
yi = P (xi ), i = 0, 1, · · · , n.
Such a polynomial is said to interpolate the data. Here yi may be the value of some unknown function
f at xi , i.e. yi = f (xi ).
One reason for considering the class of polynomials in approximation of functions is that they uniformly
approximate continuous function.
Theorem 2.1 (Weierstrass Approximation Theorem). Suppose that f is defined and continuous on
[a, b]. For any ε > 0, there exists a polynomial P (x) defined on [a, b] with the property that
|f (x) − P (x)| < ε, ∀x ∈ [a, b].
1
2 INTERPOLATION AND APPROXIMATIONS
Another reason for considering the class of polynomials in approximation of functions is that the
derivatives and indefinite integrals of a polynomial are easy to compute.
Theorem 2.2 (Existence and Uniqueness). Given a real-valued function f (x) and n + 1 distinct points
x0 , x1 , · · · , xn , there exists a unique polynomial Pn (x) of degree ≤ n which interpolates the unknown
function f (x) at given points x0 , x1 , · · · , xn .
Proof. Existence: Let x0 , x1 , · · · , xn be the given n + 1 discrete data points. We will prove the result
by the mathematical induction.
The Theorem clearly holds for n = 0, only one data point is given and we can take constant polynomial
P0 (x) = f (x0 ), ∀x.
Assume that the Theorem holds for n ≤ k, i.e. there is a polynomial Pk with degree ≤ k such that
Pk (xi ) = f (xi ), for 0 ≤ i ≤ k.
Now we try to construct a polynomial of degree at most k + 1 to interpolate (xi , f (xi )), 0 ≤ i ≤ k + 1.
Let
Pk+1 (x) = Pk (x) + c(x − x0 )(x − x1 ) · · · (x − xk ).
For x = xk+1 ,
f (xk+1 ) − Pk (xk+1 )
=⇒ c = .
(xk+1 − x0 )(xk+1 − x1 ) · · · (xk+1 − xk )
Since xi are distinct, the polynomial Pk+1 (x) is well-defined and degree of Pk+1 ≤ k + 1. Now
and
Pk+1 (xk+1 ) = f (xk+1 )
Above two equations implies
Pk+1 (xi ) = f (xi ), 0 ≤ i ≤ k + 1.
Therefore Pk+1 (x) interpolate f (x) at all k + 2 nodal points. By mathematical induction result is true
for all polynomials.
Uniqueness: Let there are two such polynomials Pn and Qn such that
Pn (xi ) = f (xi )
Qn (xi ) = f (xi ), 0 ≤ i ≤ n.
Define
Sn (x) = Pn (x) − Qn (x)
Since for both Pn and Qn , degree ≤ n, which implies the degree of Sn is also ≤ n.
Also
Sn (xi ) = Pn (xi ) − Qn (xi ) = f (xi ) − f (xi ) = 0, 0 ≤ i ≤ n.
This implies Sn has at least n + 1 zeros which is not possible as degree of Sn is at most n.
This implies
Sn (x) = 0, ∀x
solved in order to determine all li (x)’s. Fortunately there is a shortcut. An obvious way of constructing
polynomials li (x) of degree n that satisfy the condition is the following:
(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )
li (x) = .
(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )
The uniqueness of the interpolating polynomial of degree ≤ n given n + 1 distinct interpolation points
implies that the polynomials li (x) given by above relation are the only polynomials of degree n.
Note that the denominator does not vanish since we assume that all interpolation points are distinct.
We can write the formula for li (x) in a compact form using the product notation.
(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )
li (x) =
(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )
W (x)
= , i = 0, 1, · · · , n
(x − xi )W � (xi )
where
W (x) = (x − x0 ) · · · (x − xi−1 )(x − xi )(x − xi+1 ) · · · (x − xn )
∴ W � (xi ) = (xi − x0 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn ).
The Lagrange interpolating polynomial can be written as
n
� n
� (x − xj )
Pn (x) = f (xi ) .
(xi − xj )
i=0 j=0
j�=i
Example 1. Use Lagrange interpolation to find the unique polynomial of degree 3 or less, that agrees
with the following data: Also estimate y(1.5).
xi −1 0 1 2
yi 3 −4 5 −6
(x − x0 )(x − x2 ) x(x − 1)
l1 (x) = = ,
(x1 − x0 )(x1 − x2 ) x1 (x1 − 1)
(x − x0 )(x − x1 ) x(x − 1)
l2 (x) = = .
(x2 − x0 )(x2 − x1 ) (1 − x1 )
∴ P2 (x) = l0 (x)f (x0 ) + l1 (x)f (x1 ) + l2 (x)f (x2 )
�
(x − x1 )(x − 1) x(x − 1) x(x − 1)
= .0 + . x1 − x21 + .0
x1 x1 (x1 − 1) (1 − x1 )
x(x − 1)
= −� .
x1 (1 − x1 )
If we now consider f (x) − P2 (x), then
� x(x − 1)
f (x) − P2 (x) = x − x2 + � .
x1 (1 − x1 )
Hence f (0.5) − P2 (0.5) = −0.25 implies
� 0.5(0.5 − 1)
0.5 − 0.52 + � = −0.25
x1 (1 − x1 )
Solving for x1 gives
x21 − x1 = −1/9
or (x1 − 1/2)2 = 5/36
� �
1 5 1 5
which gives x1 = − or x1 = + .
2 36 2 36
The largest of these is therefore �
1 5
x1 = + ≈ 0.8727.
2 36
2.3. Error Analysis for Polynomial Interpolation. We are given nodes x0 , x1 , · · · , xn , and the
corresponding function values f (x0 ), f (x1 ), · · · , f (xn ), but the we don’t know the expression for the
function. Let Pn (x) be the polynomial of order ≤ n that passes through the n + 1 points (x0 , f (x0 )),
(x1 , f (x1 )),· · · , (xn , f (xn )).
Question: What is the error between f (x) and Pn (x) even we don’t know f (x) in advance?
Definition 2.3 (Truncation error). The polynomial Pn (x) coincides with f (x) at all nodal points and
may deviates at other points in the interval. This deviation is called the truncation error and we write
En (f ; x) = f (x) − Pn (x).
Theorem 2.4. Suppose that x0 , x1 , · · · , xn are distinct numbers in [a, b] and f ∈ C n+1 [a, b]. Let Pn (x)
be the unique polynomial of degree ≤ n that passes through n + 1 distinct points then prove that
∀x ∈ [a, b], ∃ξ = ξ(x) ∈ (a, b)
such that
(x − x0 )(x − x1 ) · · · (x − xn ) (n+1)
f (x) − Pn (x) = f (ξ).
(n + 1)!
Proof. Let x0 , x1 , · · · , xn are distinct numbers in [a, b] and f ∈ C n+1 [a, b].
Let Pn (x) be the unique polynomial of degree ≤ n that passes through n + 1 discrete points.
Since f (xi ) = Pn (xi ), ∀i = 1, 2, · · · , n; which implies f (x) − Pn (x) = 0.
Now for any t in the domain, define a function g(t),t ∈ [a, b],
(t − x0 )(t − x1 ) · · · (t − xn )
g(t) = f (t) − P (t) − [f (x) − P (x)] (2.2)
(x − x0 )(x − x1 ) · · · (x − xn )
Now g(t) ∈ C n+1 [a, b] as f ∈ C n+1 [a, b] and P (x) ∈ C n+1 [a, b].
Now g(t) = 0 at t = x, x0 , x1 , · · · , xn . Therefore g(t) satisfy the conditions of generalized Rolle’s
6 INTERPOLATION AND APPROXIMATIONS
Theorem which states that between n + 2 zeros of a function, there is at least one zero of (n + 1)th
derivative of the function. Hence there exists a point ξ such that
g (n+1) (ξ) = 0
where ξ ∈ (a, b) and depends on x.
Now differentiate function g(t) (n + 1) times to obtain
(n + 1)!
g (n+1) (t) = f (n+1) (t) − Pn(n+1) (t) − [f (x) − Pn (x)]
(x − x0 )(x − x1 ) · · · (x − xn )
(n + 1)!
= f (n+1) (t) − [f (x) − Pn (x)]
(x − x0 )(x − x1 ) · · · (x − xn )
(n+1)
Here Pn (t) = 0 as Pn (x) is a n-th degree polynomial.
Now g (n+1) (ξ) = 0 and then solving for f (x) − Pn (x), we obtain
(x − x0 )(x − x1 ) · · · (x − xn ) (n+1)
f (x) − P (x) = f (ξ).
(n + 1)!
Corollary 2.5. To find the maximum error, we have to find the maxima of right side which contain
two factors: one is products of factors of the form x − xi and second is f (n+1) (ξ). In practice we try
to find two separate bounds for both the terms.
The next example illustrates how the error formula can be used to prepare a table of data that will
ensure a specified interpolation error within a specified bound.
Example 3. Suppose a table is to be prepared for the function f (x) = ex , for x in [0, 1]. Assume
the number of decimal places to be given per entry is d ≥ 8 and that the difference between adjacent
x−values, the step size, is h. What step size h will ensure that linear interpolation gives an absolute
error of at most 10−6 for all x in [0, 1]?
Sol. Let x0 , x1 , . . . be the numbers at which f is evaluated, x be in [0, 1], and suppose i satisfies
xi ≤ x ≤ xi+1 .
The error in linear interpolation is
� �
�1 2 � |f 2 (ξ)|
|f (x) − P (x)| = � f (ξ)(x − xi )(x − xi+1 )�� =
� |(x − xi )||(x − xi+1 )|.
2 2
The step size is h, so xi = ih, xi+1 = (i + 1)h, and
1
|f (x) − p(x)| ≤ |f 2 (ξ)| |(x − xi )(x − xi+1 )|.
2
Hence
1
|f (x) − p(x)| ≤ max eξ max |(x − xi )(x − xi+1 )|
2 ξ∈[0,1] xi ≤x≤xi+1
e
≤ max |(x − xi )(x − xi+1 )|.
2 xi ≤x≤xi+1
We write g(x) = (x − xi )(x − xi+1 ), for xi ≤ x ≤ xi+1 .
For simplification, we write
x − xi = th
∴ x − xi+1 = x − (xi + h) = (t − 1)h.
Thus
g(t) = h2 t(t − 1)
g � (t) = h2 (2t − 1).
INTERPOLATION AND APPROXIMATIONS 7
� � 2
The only critical point for g is at t = 12 , which gives g 12 = h4 .
Since g(xi ) = 0 and g(xi+1 ) = 0, the maximum value of |g � (x)| in [xi , xi+1 ] must occur at the critical
point which implies that
e e h2 eh2
|f (x) − p(x)| ≤ max |g(x)| ≤ · = .
2 xi ≤x≤xi+1 2 4 8
Consequently, to ensure that the the error in linear interpolation is bounded by 10−6 , it is sufficient
for h to be chosen so that
eh2
≤ 10−6 .
8
This implies that h < 1.72 × 10−3 .
Because n = (1 − 0)/h must be an integer, a reasonable choice for the step size is h = 0.001.
Example 4. Determine the step size h that can be used in the tabulation of a function f (x), a ≤ x ≤ b,
at equally spaced nodal points so that the truncation error of the quadratic interpolation is less than ε.
Sol. Let xi−1 , xi , xi+1 are three eqispaced points with spacing h. The truncation error of the quadratic
interpolation is given by
M
|f (x) − P( x)| ≤ max |(x − xi−1 )(x − xi )(x − xi+1 )|
3! a≤x≤b
where M = max |f (3) (x)|.
a≤x≤b
To simplify the calculation, let
x − xi = th
∴ x − xi−1 = x − (xi − h) = (t + 1)h
and x − xi+1 = x − (xi + h) = (t − 1)h.
∴ |(x − xi−1 , )(x − xi )(x − xi+1 )| = h3 |t(t + 1)(t − 1)| = g(t) (say).
Now g(t) attains its extreme values if
dg
=0
dt
which gives t = ± √13 . At end points of the interval g becomes zero.
For both values of t = ± √13 , we obtain max 2
|g(t)| = h3 3√ 3
.
xi−1 ≤x≤xi+1
Truncation error
|f (x) − P2 (x)| < ε
h3
=⇒ √ M < ε
9 3
� √ �1/3
9 3ε
=⇒ h < .
M
3. Neville’s Method
Neville’s method can be applied in the situation that we want to interpolate f (x) at a given point
x = p with increasingly higher order Lagrange interpolation polynomials.
For concreteness, consider three distinct points x0 , x1 , and x2 at which we can evaluate f (x) ex-
actly f (x0 ), f (x1 ), f (x2 ). From each of these three points we can construct an order zero (constant)
“polynomial” to approximate f (p) as
f (p) ≈ P0 (p) = f (x0 ) (3.1)
f (p) ≈ P1 (p) = f (x1 ) (3.2)
f (p) ≈ P2 (p) = f (x2 ) (3.3)
Of course this isn’t a very good approximation so we turn to first order Lagrange polynomials
p − x1 p − x0
f (p) ≈ P0,1 (p) = f (x0 ) + f (x1 )
x0 − x1 x1 − x0
8 INTERPOLATION AND APPROXIMATIONS
p − x2 p − x1
f (p) ≈ P1,2 (p) = f (x1 ) + f (x2 ).
x1 − x2 x2 − x1
There is also P0,2 , but we won’t concern ourselves with that one.
If we note that f (xi ) = Pi (x), we find
p − x1 p − x0
P0,1 (p) = P0 (p) + P1 (p)
x0 − x1 x1 − x0
(p − x1 )P0 (p) − (p − x0 )P1 (p)
=
x0 − x1
and similarly
(p − x2 )P1 (p) − (p − x1 )P2 (p)
P1,2 (p) =
x1 − x2
In general we want to multiply Pi (x) by (x−xj ) where j �= i (i.e., xj is a point that is NOT interpolated
by Pi (x)). We take the difference of two such products and divide by the difference between the added
points.
The result is a polynomial Pi,i−1 of one degree higher than either of the two used to construct it
and that interpolates all the points of the two constructing polynomials combined. This idea can be
extended to construct the third order polynomial P0,1,2
(p − x2 )P0,1 (p) − (p − x0 )P1,2 (p)
P0,1,2 (p) = .
x0 − x2
A little algebra will convince that
(p − x1 )(p − x2 ) (p − x0 )(p − x2 ) (p − x0 )(p − x1 )
P0,1,2 (p) = f (x0 ) + f (x1 ) + f (x2 )
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
which is just the third order Lagrange polynomial interpolating the points x0 , x1 , x2 . This shouldn’t
surprise you since this is the unique third order polynomial interpolating these three points.
Example 5. We are given the function
1
.
f (x) =
x
Approximate the value f (3) using three points 2, 2.5 and 4 by Neville’s method.
Sol. Firstly we evaluate the function at the three points
xi f (xi )
2 0.5
2.5 0.4
4 0.25
We can first make three separate zero-order approximations
f (3) ≈ P0 (3) = f (x0 ) = 0.5
f (3) ≈ P1 (3) = f (x1 ) = 0.4
f (3) ≈ P2 (3) = f (x2 ) = 0.25.
From these we proceed to construct P0,1 and P1,2 by using the Neville formula
(3 − x1 )P0 (3) − (3 − x0 )P1 (3)
f (3) ≈ P0,1 (3) = = 0.3
x0 − x1
(3 − x2 )P1 (3) − (3 − x1 )P2 (3)
f (3) ≈ P1,2 (3) = = 0.35.
x1 − x2
So we can add these numbers to our table
xi f (xi ) Pi,i+1
2 0.5
2.5 0.2 0.3
4 0.25 0.35
INTERPOLATION AND APPROXIMATIONS 9
f (x1 ) − f (x0 )
The ratio f [x0 , x1 ] = , is called first divided difference of f (x) and in general
x1 − x0
f (xi+1 ) − f (xi )
f [xi , xi+1 ] = .
xi+1 − xi
The remaining divided differences are defined recursively.
The second divided difference of three points, xi , xi+1 , xi+2 , is defined as
f [xi+1 , xi+2 ] − f [xi , xi+1 ]
f [xi , xi+1 , xi+2 ] = .
xi+2 − xi
Now if we substitute x = x2 and the values of a0 and a1 in Eqs. (4.1), we obtain
f (x1 ) − f (x0 )
P (x2 ) = f (x2 ) = f (x0 ) + (x2 − x0 ) + a2 (x2 − x0 )(x2 − x1 )
x1 − x0
f (x0 ) f (x1 ) f (x2 )
=⇒ a2 = + +
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
f [x1 , x2 ] − f [x0 , x1 ]
= = f [x0 , x1 , x2 ].
x2 − x0
The process ends with the single n-th divided difference,
f [x1 , x2 , · · · , xn ] − f [x0 , x1 , · · · , xn−1 ]
an = f [x0 , x1 , · · · , xn ] =
xn − x0
n
� f (xi )
= .
�
n
i=0 (xi − xj )
j=0
j�=i
We can write the Newton’s divided difference formula in the following fashion (and we will prove in
next Theorem).
Pn (x) = f (x0 ) + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) +
· · · + f [x0 , x1 , · · · , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 )
n
� i−1
�
= f (x0 ) + f [x0 , x1 , · · · , xi ] (x − xj ).
i=1 j=0
We can also construct the Newton’s interpolating polynomial as given in the next result.
Theorem 4.1. The unique polynomial of degree ≤ n that passes through (x0 , y0 ), (x1 , y1 ), · · · , (xn , yn )
is given by
Pn (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · +
f [x0 , x1 , · · · , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 ).
Proof. We prove it by induction. The unique polynomial of degree 0 that passes through (x0 , y0 ) is
obviously
P0 (x) = y0 = f [x0 ].
Suppose that the polynomial Pk (x) of order ≤ k that passes through (x0 , y0 ), (x1 , y1 ), · · · , (xk , yk ) is
Pk (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · +
f [x0 , x1 , · · · , xk ](x − x0 )(x − x1 ) · · · (x − xk−1 ).
Write Pk+1 (x), the unique polynomial of order (degree) ≤ k that passes through (x0 , y0 ), (x1 , y1 ), · · · ,
(xk , yk )(xk+1 , yk+1 ) by
Pk+1 (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · +
f [x0 , x1 , · · · , xk ](x − x0 )(x − x1 ) · · · (x − xk−1 ) + C(x − x0 )(x − x1 ) · · · (x − xk−1 )(x − xk ).
We only need to show that
C = f [x0 , x1 , · · · , xk , xk+1 ].
INTERPOLATION AND APPROXIMATIONS 11
For this, let Qk (x) be the unique polynomial of degree ≤ k that passes through (x1 , y1 ), · · · , (xk , yk )
(xk+1 , yk+1 ). Define
x − x0
R(x) = Pk (x) + [Qk (x) − Pk (x)].
xk+1 − x0
Then,
• R(x) is a polynomial of degree k + 1.
• R(x0 ) = Pk (x0 ) = y0 ,
xi − x0
R(xi ) = Pk (xi ) + (Qk (xi ) − Pk (xi )) = Pk (xi ) = yi , i = 1, · · · , k,
xk+1 − x0
R(xk+1 ) = Qk (xk+1 ) = yk+1 .
By the uniqueness, R(x) = Pk+1 (x).
The leading coefficient of Pk+1 (x) is C.
x − x0
The leading coefficient of R(x) is the leading coefficient of [Qk (x) − Pk (x)] which is
xk+1 − x0
1
(leading coefficient of Qk (x) - leading coefficient of Pk (x)).
xk+1 − x0
On the other hand, the leading coefficient of Qk (x) is f [x1 , · · · , xk+1 ], and the leading coefficient of
Pk (x) is f [x0 , · · · , xk ]. Therefore
f [x1 , · · · , xk+1 ] − f [x0 , · · · , xk ]
C= = f [x0 , x1 , · · · , xk+1 ].
xk+1 − x0
x −1 0 1 2
y 3 −4 5 −6
Find a polynomial in Newton’s form to interpolate the data and evaluate f (1.5) (the same exercise was
done by Lagrange interpolation).
Sol. To write the Newton’s form, we draw divided difference (d.d.) table as following.
P3 (x) = f (x0 ) + (x + 1)f [−1, 0] + (x + 1)(x − 0)f [−1, 0, 1] + (x + 1)(x − 0)(x − 1)f [−1, 0, 1, 2]
= 3 − 7(x + 1) + 8x(x + 1) − 6x(x + 1)(x − 1)
= −4 + 7x + 8x2 − 6x3 .
∴ f (1.5) ≈ P3 (1.5) = 4.25.
12 INTERPOLATION AND APPROXIMATIONS
Note that xi can be re-ordered but must be distinct. When the order of some xi are changed, one
obtain the same polynomial but in different form.
Theorem 4.2. Let f ∈ C n [a, b] and x0 , · · · , xn are distinct numbers in [a, b]. Then there exists ξ such
that
f (n) (ξ)
f [x0 , x1 , x2 , · · · , xn ] = .
n!
Proof. Let
n
�
Pn (x) = f (x0 ) + f [x0 , x1 , · · · , xk ](x − x0 )(x − x1 ) · · · (x − xk−1 )
k=1
Since Pn (xi ) = f (xi ) for i = 0, 1, · · · , n, the function g has n + 1 distinct zeros in [a, b]. By the
generalized Rolle’s Theorem there exists ξ ∈ (a, b) such that
Here
Pn(n) (x) = n! f [x0 , x1 , · · · , xn ].
Therefore
f (n) (ξ)
f [x0 , x1 , · · · , xn ] = .
n!
4.1. Newton’s interpolation for equally spaced points. Newton’s divided-difference formula can
be expressed in a simplified form when the nodes are arranged consecutively with equal spacing. Let
n + 1 points x0 , x1 , · · · , xn are arranged consecutively with equal spacing h.
Let
xn − x0
h= = xi+1 − xi , i = 0, 1, · · · , n
n
Then each xi = x0 + ih, i = 0, 1, · · · , n.
For any x ∈ [a, b], we can write x = x0 + sh, s ∈ R.
Then x − xi = (s − i)h.
INTERPOLATION AND APPROXIMATIONS 13
Sol. As the point x = 0.71 lies in the beginning, we will use Newton’s forward interpolation. The
forward difference table is:
x -2 -1 0 1 2
f (x) -1 3 1 -1 3
Sol. In the formulation of P (x), second point −1 is taken as initial point x0 while in the formulation
of Q(x) first point is taken as initial point.
Also (alternatively without drawing the table) P (−2) = Q(−2) = −1, P (−1) = Q(−1) = 3, P (0) =
Q(0) = 1, P (1) = Q(1) = −1, P (2) = Q(2) = 3.
Therefore both the cubic polynomials interpolate the given data. Further the interpolating polynomials
are unique but format of a polynomial is not unique. If P (x) and Q(x) are expanded, they are identical.
The forward difference table is:
� n
∂E
= [yi − (a + bxi )](−2xi ) = 0
∂b
i=1
n
� n
� n
�
=⇒ x i yi = a xi + b x2i . (5.2)
i=1 i=1 i=1
These equations (5.1-5.2) are called normal equations, which are to be solved to get desired values for
a and b.
Example 12. Obtain the least square straight line fit to the following data
Example 13. Find the least square approximation of second degree for the discrete data
x −2 −1 0 1 2
f (x) 15 1 1 3 19
�
5 �
5 �
4 �
5 �
5 �
5 �
5
We have xi = 0, x2i = 10, x3i = 0, x4i = 34, f (xi ) = 39, xi f (xi ) = 10, x2i f (xi ) =
i=1 i=1 i=1 i=1 i=1 i=1 i=1
140.
From given data
5a + 10c = 39
10b = 10
10a + 34c = 140.
−37 31
The solution of this system is a = , b = 1, and c = .
35 7
1
The required approximation is y = (−37 + 35x + 155x2 ).
35
√
Example 14. Use the method of least square to fit the curve f (x) = c0 x + c1 / x. Also find the least
square error.
Sol. By principle of least squares, we minimize the error
5
� c1
E(c0 , c1 ) = [f (xi ) − c0 xi − √ ]2
xi
i=1
We obtain the normal equations
5
� 5
� 5
�
√
c0 x2i + c1 xi = xi f (xi )
i=1 i=1 i=1
5
� 5
� 5
� f (xi )
√ 1
c0 xi + c 1 = √ .
xi xi
i=1 i=1 i=1
We have
5
� 5
� 5
�
√ 1
xi = 4.1163, = 11.8333, x2i = 5.38
xi
i=1 i=1 i=1
5
� 5
� f (xi )
xi f (xi ) = 24.9, √ = 85.0151.
xi
i=1 i=1
The normal equations are given by
5.3c0 + 4.1163c1 = 24.9
4.1163c0 + 11.8333c1 = 85.0151.
Whose solution is c0 = −1.1836, c1 = 7.5961.
Therefore, the least square fit is given as
7.5961
f (x) = √ − 1.1836x.
x
The least square error is given by
5
� 7.5961
E= [f (xi ) − √ + 1.1836xi ]2 = 1.6887
xi
i=1
Example 15. Obtain the least square fit of the form y = abx to the following data
x 1 2 3 4 5 6 7 8
f (x) 1.0 1.2 1.8 2.5 3.6 4.7 6.6 9.1
18 INTERPOLATION AND APPROXIMATIONS
Sol. The curve y = abx takes the form Y = A + Bx after taking log on base 10, where Y = log y,
A = log a and B = log b.
Hence the normal equations are given by
8
� 8
�
Yi = 8A + B xi
i=1 i=1
8
� x
� 8
�
xi Yi = A xi + B x2i
i=1 i=1 i=1
From the data, we form the following table. Substituting the values, we obtain
x y Y = log y xY x2
1 1.0 0.0 0.0 1
2 1.2 0.0792 0.1584 4
3 1.8 0.2553 0.7659 9
4 2.5 0.3979 1.5916 16
5 3.6 0.5563 2.7815 25
6 4.7 0.6721 4.0326 36
7 6.6 0.8195 5.7365 49
8 9.1 0.9590 7.6720 64
Σ 36 30.5 3.7393 22.7385 204
Remark 5.1. If data is quite large then we can make it small by changing the origin and appropriating
scaling.
Example 16. Show that the line of fit to the following data is given by y = 0.7x + 11.28.
x 0 5 10 15 20 25
y 12 15 17 22 24 30
Sol. Here n = 6. We fit a line of the form y = A + Bx.
x − 15
Let u = , v = y − 20 and line of the form v = a + bu.
5
x y u v uv u2
0 12 −3 −8 24 9
5 15 −2 −5 10 4
10 17 −1 −3 3 1
15 22 0 2 0 0
20 24 1 4 4 1
25 30 2 10 20 4
Σ −3 0 61 19
The normal equations are,
0 = 6a − 3b
61 = −3a + 19b.
By solving a = 1.7428 and b = 3.4857.
Therefore equation of the line is v = 1.7428 + 3.4857u.
Changing in to original variable, we obtain
� �
x − 15
y − 20 = 1.7428 + 3.4857
5
INTERPOLATION AND APPROXIMATIONS 19
=⇒ y = 11.2857 + 0.6971x.
Exercises
(1) Find the unique polynomial P (x) of degree 2 or less such that
P (1) = 1, P (3) = 27, P (4) = 64
using Lagrange interpolation. Evaluate P (1.05).
(2) For the given functions f (x), let x0 = 1, x1 = 1.25, and x2 = 1.6. Construct Lagrange
interpolation polynomials of degree at most one and at most two to approximate f (1.4), and
find the absolute error.
(a) f (x) = sin
√ πx
3
(b) f (x) = x − 1
(c) f (x) = log10 (3x − 1)
(d) f (x) = e2x − x.
(3) Let P3 (x) be the Lagrange interpolating polynomial for the data (0, 0), (0.5, y), (1, 3) and (2, 2).
Find y if the coefficient of x3 in P3 (x) is 6.
(4) Let f (x) = ln(1 + x), x0 = 1, x1 = 1.1. Use Lagrange linear interpolation to find the
approximate value of f (1.04) and obtain a bound on the truncation error.
(5) Construct the Lagrange interpolating polynomials for the following functions, and find a bound
for the absolute error on the interval [x0 , xn ].
(a) f (x) = e2x cos 3x, x0 = 0, x1 = 0.3, x2 = 0.6, n = 2.
(b) f (x) = sin(ln x), x0 = 2.0, x1 = 2.4, x2 = 2.6, n = 2.
show that the √ error of linear interpolation using data (x0 , f0 ) and (x1 , f1 ) cannot exceed
(x1 − x0 )2 /2 2πe.
(14) Using Newton’s divided difference interpolation, construct interpolating polynomials of degree
one, two, and three for the following data. Approximate the specified value using each of the
polynomials.
f (0.43) if f (0) = 1, f (0.25) = 1.64872, f (0.5) = 2.71828, f (0.75) = 4.4816.
(15) Show that the polynomial interpolating (in Newton’s form) the following data has degree 3.
x −2 −1 0 1 2 3
f (x) 1 4 11 16 13 −4
(16) Let f (x) = ex , show that f [x0 , x1 , . . . , xm ] > 0 for all values of m and all distinct equally spaced
nodes {x0 < x1 < · · · < xm }.
(17) Show that the interpolating polynomial for f (x) = xn+1 at n + 1 nodal points x0 , x1 , · · · , xn is
given by
xn+1 − (x − x0 )(x − x1 ) · · · (x − xn ).
(18) The following data are given for a polynomial P (x) of unknown degree
x 0 1 2 3
f (x) 4 9 15 18
Determine the coefficient of x3 in P (x) if all fourth-order forward differences are 1.
(19) Let i0 , i1 , · · · , in be a rearrangement of the integers 0, 1, · · · , n. Show that f [xi0 , xi1 , · · · , xin ] =
f [x0 , x1 , · · · , xn ].
(20) Let f (x) = 1/(1 + x) and let x0 = 0, x1 = 1, x2 = 2. Calculate the divided differences f [x0 , x1 ]
and f [x0 , x1 , x2 ]. Using these divided differences, give the quadratic polynomial P2 (x) that
interpolates f (x) at the given node points {x0 , x1 , x2 }. Graph the error f (x) − P2 (x) on the
interval [0, 2].
(21) Construct the interpolating polynomial that fits the following data using Newton forward and
backward
x 0 0.1 0.2 0.3 0.4 0.5
f (x) −1.5 −1.27 −0.98 −0.63 −0.22 0.25
difference interpolation. Hence find the values of f (x) at x = 0.15 and 0.45.
(22) For a function f , the forward-divided differences are given by
x0 = 0.0 f [x0 ]
50
x1 = 0.4 f [x1 ] f [x0 , x1 ] f [x0 , x1 , x2 ] =
7
x1 = 0.4 f [x2 ] = 6 f [x1 , x2 ] = 10
Determine the missing entries in the table.
(23) A fourth-degree polynomial P (x) satisfies Δ4 P (0) = 24, Δ3 P (0) = 6, and Δ2 P (0) = 0, where
ΔP (x) = P (x + 1) − P (x). Compute Δ2 P (10).
(24) Show that
f (n+1) (ξ(x))
f [x0 , x1 , x2 , · · · , xn , x] = .
(n + 1)!
(25) Use the method of least squares to fit the linear and quadratic polynomial to the following
data.
x −2 −1 0 1 2
f (x) 15 1 1 3 19
(26) By the method of least square fit a curve of the form y = axb to the following data.
x 2 3 4 5
y 27.8 62.1 110 161
√
(27) Use the method of least squares to fit a curve y = c0 /x + c1 x to the following data.
x 0.1 0.2 0.4 0.5 1 2
y 21 11 7 6 5 6
INTERPOLATION AND APPROXIMATIONS 21
Appendix A. Algorithms
Algorithm (Lagrange Interpolation):
•
Read the degree n of the polynomial Pn (x).
•
Read the values of x(i) and y(i) = f (xi ), i = 0, 1, . . . , n.
•
Read the point of interpolation p.
•
Calculate the Lagrange’s fundamental polynomials li (x) using the following loop:
for i=1 to n
l(i) = 1.0
for j=1 to n
if j �= i
p − x(j)
l(i) = ∗ l(i)
x(i) − x(j)
end j
end i
• Calculate the approximate value of the function at x = p using the following loop:
sum=0.0
for i=1 to n
sum = sum + l(i) ∗ y(i)
end i
• Print sum.
Algorithm (Newton’s Divided-Difference Interpolation):
Given n distinct interpolation points x0 , x1 , · · · , xn , and the values of a function f (x) at these
points, the following algorithm computes the matrix of divided differences:
D = zeros(n, n);
for i = 1 : n
D(i,1) = y(i);
end i
for j = 2 : n,
for k = j : n,
D(k, j) = (D(k, j − 1) − D(k − 1, j − 1))/(x(k) − x(k − j + 1));
end i
end j.
Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 6 (4 LECTURES)
NUMERICAL INTEGRATION
1. Introduction
The general problem is to find the approximate value of the integral of a given function f (x) over
an interval [a, b]. Thus
� b
I= f (x)dx. (1.1)
a
Problem can be solved by using the Fundamental Theorem of Calculus by finding an anti-derivative
F of f , that is, F � (x) = f (x), and then
� b
f (x)dx = F (b) − F (a).
a
But finding an anti-derivative is not an easy task in general. Hence, it is certainly not a good approach
for numerical computations.
In this chapter we’ll study methods for finding integration rules. We’ll also consider composite versions
of these rules and the errors associated with them.
where
� b
λi = li (x)dx.
a
1
2 NUMERICAL INTEGRATION
We can also use Newton divided difference interpolation to approximate the function f (x).
3. Newton-Cotes Formula
b−a
Let all nodes are equally spaced with spacing h = . The number h is also called the step
n
length.
Let x0 = a and xn = b then xi = a + ih, i = 0, 1, · · · , n.
The general quadrature formula is given by
� b n
�
f (x)dx = λi f (xi ) + E(f ).
a i=0
This formula is called Newton-Cotes formula if all points are equally spaced. We now derive rules by
taking one and two degree interpolating polynomials.
�b
3.1. Trapezoidal Rule. We derive the Trapezoidal rule for approximating f (x)dx using the linear
a
Lagrange polynomial.
Let x0 = a, x1 = b, and h = b − a.
� 1
b=x �x1
f (x) dx = P1 (x)dx + E(f ).
a=x0 x0
�x1 �x1
P1 (x)dx = [l0 (x)f (x0 ) + l1 (x)f (x1 )] dx
x0 x0
�x1 �x1
x − x1 x − x0
= f (x0 ) dx + f (x1 ) dx
x0 − x1 x1 − x0
x0 x0
� 2
� x1 � �x
(x − x1 ) (x − x0 )2 1
= f (x0 ) + f (x1 )
2(x0 − x1 ) x0 2(x1 − x0 ) x0
x1 − x0
= [f (x0 ) + f (x1 )]
2
h
= [f (a) + f (b)].
2
�x1
f (2) (ξ)
E(f ) = (x − x0 )(x − x1 ) dx
2!
x0
�x1
1
= f (2) (ξ)(x − x0 )(x − x1 ) dx.
2
x0
NUMERICAL INTEGRATION 3
Since (x − x0 )(x − x1 ) does not change its sign in [x0 , x1 ], therefore by the Weighted Mean-Value
Theorem, there exists a point ξ ∈ (x0 , x1 ) such that
�x1
f (2) (ξ)
E(f ) = (x − x0 )(x − x1 ) dx
2
x0
� �
f (2) (ξ) (x0 − x1 )3
=
2 6
h3
= − f (2) (ξ).
12
Thus the integration formula is
�b
h h3
f (x)dx = [f (a) + f (b)] − f (2) (ξ).
2 12
a
Geometrically, it is the area of Trapezium (Trapezoid) with width h and ordinates f (a) and f (b).
3.2. Simpson’s Rule. We take second degree Lagrange interpolating polynomial. We take n =
a+b
2, x0 = a, x1 = , x2 = b, h = (b − a)/2.
2
� 2
b=x �x2
f (x) dx = P2 (x)dx + E(f ).
a=x0 x0
�x2 �x2
P2 (x)dx = [l0 (x)f (x0 ) + l1 (x)f (x1 ) + +l2 (x)f (x2 )] dx
x0 x0
= λ0 f (x0 ) + λ1 f (x1 ) + λ2 f (x2 ).
The values of the multipliers λ0 , λ1 , and λ2 are given by
�x2
(x − x1 )(x − x2 )
λ0 = dx.
(x0 − x1 )(x0 − x2 )
x0
To simply this integral, we substitute x = x0 + ht, dx = h dt and change the limits from 0 to 2
accordingly. Therefore
� 2
h(t − 1)h(t − 2)
λ0 = hdt = h/3.
0 (−h)(−2h)
Similarly
�x2
(x − x0 )(x − x2 )
λ1 = dx
(x1 − x0 )(x1 − x2 )
x0
� 2
h(t − 0)h(t − 2)
= hdt = 4h/3.
0 (h)(−h)
and
�x2
(x − x0 )(x − x1 )
λ2 = dx
(x2 − x0 )(x2 − x1 )
x0
�2
h(t − 0)h(t − 1)
= hdt = h/3.
(2h)(h)
0
4 NUMERICAL INTEGRATION
Since (x − x0 )(x − x1 )(x − x2 ) changes its sign in the interval [x0 , x1 ], therefore we cannot apply the
Weighted Mean-Value Theorem (as we did in trapezoidal rule).
Also � x2
(x − x0 )(x − x1 )(x − x2 ) dx = 0.
x0
We can add an interpolation point without affecting the area of the interpolated polynomial, leaving
the error unchanged. We can therefore do our error analysis of Simpson’s rule with any single point
added, since adding any point in [a, b] does not affect the area, we simply double the midpoint, so that
our node set is {x0 = a, x1 = (a + b)/2, x1 = (a + b)/2, x2 = b}. We can now examine the value of the
next interpolating polynomial. Therefore
�
1 x2 (4)
E(f ) = f (ξ)(x − x0 )(x − x1 )2 (x − x2 )dx.
4! x0
Now the product (x−x0 )(x−x1 )2 (x−x2 ) does not change its sign in [x0 , x1 ], therefore by the Weighted
Mean-Value Theorem, there exists a point ξ ∈ (x0 , x1 ) such that
� x2
1 (4)
E(f ) = f (ξ) (x − x0 )(x − x1 )2 (x − x2 )dx
24 x0
f (4) (ξ)
= − (x2 − x0 )5
2880
h5
= − f (4) (ξ).
90
Hence
� b � � � �
h a+b h5
f (x)dx = f (a) + 4f + f (b) − f (4) (ξ).
a 3 2 90
1
This rule is called Simpson’s rule.
3
Similarly by taking third order Lagrange interpolating polynomial with four nodes a = x0 , x1 , x2 , x3 = b
b−a 3
with h = , we get the next integration formula known as Simpson’s rule given below.
3 8
� b
3h 3
f (x)dx = [f (x0 ) + 3f (x1 ) + 3f (x2 ) + f (x3 )] − h5 f (4) (ξ).
a 8 80
Definition 3.1. The degree of accuracy, or precision, or order of a quadrature formula is the largest
positive integer n such that the formula is exact for xk , for each k = 0, 1, · · · , n.
In other words, an integration method of the form
� b �n � b n
�
1
f (x)dx = λi f (xi ) + f (n+1) (ξ) (x − xi )dx
a (n + 1)! a
i=0 i=0
is said to be of order n if it provides exact results for all polynomials of degree less than or equal to n
and the error term will be zero for all polynomials of degree ≤ n.
Trapezoidal rule has degree of precision one and Simpson’s rule has three.
Example 1. Find the value of the integral
� 1
dx
I=
0 1+x
using trapezoidal and Simpson’s rule. Also obtain a bound on the errors. Compare with exact value.
NUMERICAL INTEGRATION 5
Sol.
1
f (x) = .
1+x
By trapezoidal rule
IT = h/2[f (a) + f (b)].
Here a = 0, b = 1, h = b − a = 1.
Therefore
a+b+c=π
b/2 + c = π/2
b/4 + c = 3π/8.
By solving these equations, we obtain a = π/4, b = π/2, c = π/4. Hence
� 1
f (x)
� dx = π/4[f (0) + 2f (1/2) + f (1)].
0 x(1 − x)
� 1 � 1 � 1
dx dx f (x)dx
I= √ = √ � = � .
x−x 3 1 + x x(1 − x) x(1 − x)
0 0 0
√
Here f (x) = 1/ 1 + x.
By using the above formula, we obtain
� √ √ �
2 2 2
I = π/4 1 + √ + = 2.62331.
3 2
4. Composite Integration
As the order of integration method is increased, the order of the derivative involved in error term also
increase. Therefore, we can use higher-order method if the integrand is differentiable up to required
degree. We can apply lower-order methods by dividing the whole interval in to subintervals and then
we use any Newton-Cotes or Gauss quadrature method for each subintervals separately.
Composite Trapezoidal Method: We divide the interval [a, b] into N subintervals with step size
b−a
h= and taking nodal points a = x0 < x1 < · · · < xN = b where xi = x0 +i h, i = 1, 2, · · · , N −1.
N
Now
� b
I = f (x)dx
a
� x1 � x2 � xN
= f (x)dx + f (x)dx + · · · + f (x)dx.
x0 x1 xN −1
Now use trapezoidal rule for each of the integrals on the right side, we obtain
h
I = [(f (x0 ) + f (x1 )) + (f (x1 ) + f (x2 )) + · · · + (f (xN −1 ) + f (xN )]
2
h3
− [f (2) (ξ1 ) + f (2) (ξ2 ) + · · · + f (2) (ξN )]
12
� N −1
� N
h � h3 � (2)
= f (x0 ) + f (xN ) + 2 f (xi ) − f (ξi ).
2 12
i=1 i=1
This formula is composite trapezoidal rule where where xi−1 ≤ ξi ≤ xi , i = 1, 2, · · · , N.
The error associated with this approximation is
N
h3 � (2)
E(f ) = − f (ξi ).
12
i=1
If f ∈ C 2 [a, b], the Extreme Value Theorem implies that f (2) assumes its maximum and minimum in
[a, b]. Since
min f (2) (x) ≤ f (2) (ξi ) ≤ max f (2) (x).
x∈[a,b] x∈[a,b]
On summing, we have
N
�
N min f (2) (x) ≤ f (2) (ξi ) ≤ N max f (2) (x)
x∈[a,b] x∈[a,b]
i=1
NUMERICAL INTEGRATION 7
and
N
1 � (2)
min f (2) (x) ≤ f (ξi ) ≤ max f (2) (x).
x∈[a,b] N x∈[a,b]
i=1
By the Intermediate Value Theorem, there is a ξ ∈ (a, b) such that
N
(2) 1 � (2)
f (ξ) = f (ξi ).
N
i=1
Therefore
h3
E(f ) = − N f (2) (ξ),
12
or, since h = (b − a)/N ,
(b − a) 2 (2)
E(f ) = −
h f (ξ).
12
Composite Simpson’s Method: Simpson’s rule require three abscissas, choose an even integer N
b−a
to produce odd number of nodes with h = . Likewise before, we write
N
� b
I = f (x)dx
a
� x2 � x4 � xN
= f (x)dx + f (x)dx + · · · + f (x)dx.
x0 x2 xN −2
Now use Simpson’s rule for each of the integrals on the right side to obtain
h
I = [(f (x0 ) + 4f (x1 ) + f (x2 )) + (f (x2 ) + 4f (x3 ) + f (x4 )) + · · · + (f (xN −2 ) + 4f (xN −1 ) + f (xN )]
3
h5
− [f (4) (ξ1 ) + f (4) (ξ2 ) + · · · + f (4) (ξN/2 )]
90
N/2−1 N/2 N/2
h � � h5 � (4)
= f (x0 ) + 2 f (x2i ) + 4 f (x2i−1 ) + f (xN ) − f (ξi ).
3 90
i=1 i=1 i=1
This formula is called composite Simpson’s rule. The error in the integration rule is given by
N
h5 � (4)
E(f ) = − f (ξi ).
90
i=1
If f ∈ C 4 [a, b], the Extreme Value Theorem implies that f (4) assumes its maximum and minimum in
[a, b]. Since
min f (4) (x) ≤ f (4) (ξi ) ≤ max f (4) (x).
x∈[a,b] x∈[a,b]
On summing, we have
N/2
N � N
min f (4) (x) ≤ f (4) (ξi ) ≤ max f (4) (x)
2 x∈[a,b] 2 x∈[a,b]
i=1
and
N/2
2 � (4)
min f (4) (x) ≤ f (ξi ) ≤ max f (4) (x).
x∈[a,b] N x∈[a,b]
i=1
By the Intermediate Value Theorem, there is a ξ ∈ (a, b) such that
N/2
2 � (4)
f (4) (ξ) = f (ξi ).
N
i=1
Therefore
h5
E(f ) = − N f (4) (ξ),
180
8 NUMERICAL INTEGRATION
So composite Simpson’s rule requires only n ≥ 18. Composite Simpson’s rule with n = 18 gives
� π � � 8 �9 �
π iπ (2i − 1)π
sin x dx ≈ 2 sin( ) + 4 sin( ) = 2.0000104.
0 54 9 18
i=1 i=1
This is accurate to within about 10−5 because the true value is − cos(π) − (− cos(0)) = 2.
Example 5. The area A inside the closed curve y 2 + x2 = cos x is given by
� α
1/2
A=4 (cos x − x2 ) dx
0
f (x) = cos x − x2 = 0,
we obtain the following iteration scheme
cos xk − x2k
xk+1 = xk + , k = 0, 1, 2, · · ·
sin xk + 2xk
Starting with x0 = 0.5, we obtain
0.62758
x1 = 0.5 + = 0.92420
1.47942
−0.25169
x2 = 0.92420 + = 0.82911
2.64655
−0.011882
x3 = 0.82911 + = 0.82414
2.39554
−0.000033
x4 = 0.82414 + = 0.82413.
2.38226
Hence the value of α correct to three decimals is 0.824.
(b) Substituting the value of α, we obtain
� 0.824
1/2
A=4 (cos x − x2 ) dx.
0
Using composite trapezoidal method by taking h = 0.824, 0.412, and 0.206 respectively, we obtain the
following approximations of the area A.
4(0.824)
A = [1 + 0.017753] = 1.67725
2
4(0.412)
A = [1 + 2(0.864047) + 0.017753] = 2.262578
2
4(0.206)
A = [1 + 2(0.967688 + 0.864047 + 0.658115) + 0.017753] = 2.470951.
2
5. Gauss Quadrature
In the numerical integration method if both nodes xi and multipliers λi are unknown then method is
called Gaussian quadrature. We can obtain the unknowns by making the method exact for polynomials
of degree as high as required. The formulas are derived for the interval [−1, 1] and any interval [a, b]
can be transformed to [−1, 1] by taking the transformation x = At + B which gives a = −A + B and
b−a b+a
b = A + B and after solving we get x = t+ .
2 2
10 NUMERICAL INTEGRATION
f (n+1) (ξ)
= C,
(n + 1)!
where
�1
C= (x − x0 ) · · · (x − xn ) dx.
−1
We can compute the value of C by putting f (x) = xn+1 to obtain
� b � n
n+1 C
x dx = λi xi n+1 + (n + 1)!
a (n + 1)!
i=0
� b �n
=⇒ C = xn+1 dx − λi xi n+1 .
a i=0
The number C is called error constant. By using the notation, we can write error term as following
C
E(f ) = f (n+1) (ξ).
(n + 1)!
Gauss-Legendre Integration Methods: The technique we have described could be used to de-
termine the nodes and coefficients for formulas that give exact results for higher-degree polynomials.
One-point formula: The formula is given by
� 1
f (x)dx = λ0 f (x0 ).
−1
The method has two unknowns λ0 and x0 . Make the method exact for f (x) = 1, x, we obtain
� 1
f (x) = 1 : dx = 2 = λ0
−1
� 1
f (x) = x : xdx = 0 = λ0 x0 =⇒ x0 = 0.
−1
Therefore one-point formula is given by
� 1
f (x)dx = 2f (0).
−1
The error in approximation is given by
C ��
E(f ) = f (ξ)
2!
where error constant C is given by
� 1
C= x2 dx − 2f (0) = 2/3.
−1
Hence
1
E(f ) = f �� (ξ), −1 < ξ < 1.
3
NUMERICAL INTEGRATION 11
Two-point formula:
� 1
f (x)dx = λ0 f (x0 ) + λ1 f (x1 ).
−1
The method has four unknowns. Make the method exact for f (x) = 1, x, x2 , x3 , we obtain
� 1
f (x) = 1 : dx = 2 = λ0 + λ1 (5.1)
−1
� 1
f (x) = x : xdx = 0 = λ0 x0 + λ1 x1 (5.2)
−1
� 1
f (x) = x2 : x2 dx = 2/3 = λ0 x20 + λ1 x21 (5.3)
−1
� 1
f (x) = x3 : x3 dx = 0 = λ0 x30 + λ1 x31 (5.4)
−1
The method has six unknowns. Make the method exact for f (x) = 1, x, x2 , x3 , x4 , x5 , we obtain
f (x) = 1 : 2 = λ0 + λ 1 + λ 2
f (x) = x : 0 = λ0 x0 + λ1 x1 + λ2 x2
f (x) = x2 : 2/3 = λ0 x20 + λ1 x21 + λ2 x22
f (x) = x3 : 0 = λ0 x30 + λ1 x31 + λ2 x32
f (x) = x4 : 2/5 = λ0 x40 + λ1 x41 + λ2 x42
f (x) = x5 : 0 = λ0 x50 + λ1 x51 + λ2 x52
�
� these equations, we obtain λ0 = λ2 = 5/9 and λ1 = 8/9. x0 = ±
By solving 3/5, x1 = 0 and
x2 = ∓ 3/5.
12 NUMERICAL INTEGRATION
1
∴ E5 = f (6) (ξ), −1 < ξ < 1.
15750
Note: Legendre polynomial Pn (x) is a monic polynomial of degree n. The first few Legendre polyno-
mials are
P0 (x) = 1,
P1 (x) = x,
1
P2 (x) = x2 − ,
3
3
P3 (x) = x3 − x.
5
Nodes in Gauss-Legendre rules are roots of these polynomials.
Example 6. Evaluate
� 2
2x
I= dx
1 1 + x4
using Gauss-Legendre 1 and 2-point formula. Also compare with the exact value.
t+3
Sol. Firstly we change the interval [1, 2] in to [−1, 1] by taking x = , dx = dt/3.
2
� 2 � 1
2x 8(t + 3)
I= dx = dt.
1 1 + x4 −1 16 + (t + 3)4
Let
8(t + 3)
f (t) = .
16 + (t + 3)4
By 1-point formula
I = 2f (0) = 0.4948.
By 2-point formula
� � � �
1 1
I = f −√ +f √ = f (−0.57735) + f (0.57735) = 0.5434.
3 3
Now exact value of the integral is given by
� 2
2x π
I= 4
dx = tan−1 4 − = 0.5408.
1 1 + x 4
Therefore errors by one and two points formula are |0.4948 − 0.5408| = 0.046 and |0.5434 − 0.5408| =
0.0026, respectively.
Example 7. Evaluate
� 1
3/2
I= (1 − x2 ) cos x dx
−1
using Gauss-Legendre 3-point formula.
NUMERICAL INTEGRATION 13
3/2
Sol. Using Gauss-Legendre 3-point formula with f (x) = (1 − x2 ) cos x, we obtain
� � � � �� ��
1 3 3
I = 5f − + 8f (0) + 5f
9 5 5
� � � �� � � � �� ��
1 2 3/2 3 2 3/2 3
= 5 cos +8+5 cos
9 5 5 5 5
= 1.08979.
Example 8. Determine constants a, b, c, and d that will produce a quadrature formula
� 1
f (x)dx = af (−1) + bf (1) + cf � (−1) + df � (1)
−1
that has degree of precision 3.
Sol. We want the formula
� 1
f (x)dx = af (−1) + bf (1) + cf � (−1) + df � (1)
−1
to hold for polynomials 1, x, x2 , · · · . Plugging these into the formula, we obtain:
� 1
f (x) = x0 : dx = 2 = a · 1 + b · 1 + c · 0 + d · 0
−1
� 1
f (x) = x1 : xdx = 0 = a · (−1) + b · 1 + c · 1 + d · 1
−1
� 1
2
f (x) = x2 : x2 dx = = a · 1 + b · 1 + c · (−2) + d · 2
−1 3
� 1
f (x) = x3 : x3 dx = 0 = a · (−1) + b · 1 + c · 3 + d · 3.
−1
We have 4 equations in 4 unknowns:
a + b = 2,
−a + b + c + d = 0,
2
a + b − 2c + 2d = ,
3
−a + b + 3c + 3d = 0.
Solving this system, we obtain:
1 1
a = 1, b = 1, c = , d = − .
3 3
Thus, the quadrature formula with accuracy 3 is
� 1
1 1
f (x)dx = f (−1) + f (1) + f � (−1) − f � (1).
−1 3 3
Example 9. Evaluate � 1
dx
I=
0 1+x
by subdividing the interval [0, 1] into two equal parts and then by using Gauss-Legendre three-point
formula.
Sol. � � � � �� ��
� 1
1 3 3
f (x)dx = 5f − + 8f (0) + 5f .
−1 9 5 5
Let � � �
1 1/2 1
dx dx dx
I= = + = I1 + I 2 .
0 1+x 0 1+x 1/2 1+x
14 NUMERICAL INTEGRATION
t+1 z+3
Now substitute x = and x = in I1 and I2 , respectively to change the limits to [−1, 1].
4 4
We have dx = dt/4 and dx = dz/4 for integral I1 and I2 , respectively.
Therefore � �
� 1
dt 1 5 8 5
I1 = = � + + � = 0.405464
−1 t + 5 9 5 − 3/5 5 5 + 3/5
� 1 � �
dz 1 5 8 5
I2 = = � + + � = 0.287682
−1 z + 7 9 7 − 3/5 7 7 + 3/5
Hence
I = I1 + I2 = 0.405464 + 0.287682 = 0.693146.
Exercises
(1) Given
� 1
I= x ex dx.
0
Approximate the value of I using trapezoidal and Simpson’s one-third method. Also obtain
the error bounds and compare with exact value of the integral.
(2) Evaluate
� 1
dx
I= 2
0 1+x
using trapezoidal and Simpson’s rule with 4 and 6 subintervals. Compare with the exact value
of the integral.
(3) Approximate the following integrals using the trapezoidal and Simpson formulas.
�
0.25
(a) I = (cos x)2 dx.
−0.25
�
e+1 1
(b) dx.
e x ln x
Find a bound for the error using the error formula, and compare this to the actual error.
�2
(4) The quadrature formula f (x)dx = c0 f (0) + c1 f (1) + c2 f (2) is exact for all polynomials of
0
degree less than or equal to 2. Determine c0 , c1 , and c2 .
(5) Determine the values of a, b, and c such that the formula
� h
f (x)dx = h [af (0) + bf (h/3) + cf (h)]
0
is exact for polynomials of degree as high as possible. Also obtain the degree of the precision.
(6) The length of the curve represented by a function y = f (x) on an interval [a, b] is given by the
integral
� b�
I= 1 + [f � (x)]2 dx.
a
Use the trapezoidal rule and Simpson’s rule with 4 and 6 subintervals compute the length of
the graph of the ellipse given with equation 4x2 + 9y 2 = 36.
(7) Determine the values of n and h required to approximate
�2
e2x sin 3x dx
0
Appendix A. Algorithms
Algorithm (Composite Trapezoidal Method):
Step 1 : Inputs: function f (x); end points a and b; and N number of subintervals.
Step 2 : Set h = (b − a)/N .
Step 3 : Set sum = 0
Step 4 : For i = 1 to N − 1
Step 5 : Set x = a + h ∗ i
Step 6 : Set sum = sum+2 ∗ f (x)
end
Step 7 : Set sum = sum+f (a) + f (b)
Step 8 : Set ans = sum∗(h/2)
End
Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 7 (4 LECTURES)
INITIAL-VALUE PROBLEMS FOR ORDINARY DIFFERENTIAL EQUATIONS
1. Introduction
Differential equations are used to model problems in science and engineering that involve the change
of some variable with respect to another. Most of these problems require the solution of an initial-value
problem, that is, the solution to a differential equation that satisfies a given initial condition.
In common real-life situations, the differential equation that models the problem is too complicated to
solve exactly, and one of two approaches is taken to approximate the solution. The first approach is to
modify the problem by simplifying the differential equation to one that can be solved exactly and then
use the solution of the simplified equation to approximate the solution to the original problem. The
other approach, which we will examine in this chapter, uses methods for approximating the solution of
the original problem. This is the approach that is most commonly taken because the approximation
methods give more accurate results and realistic error information.
In this chapter, we discuss the numerical methods for solving the ordinary differential equations of
initial-value problems (IVP) of the form
dy
= f (t, y), t ∈ R, y(t0 ) = y0 (1.1)
dt
where y is a function of t, f is function of t and y, and t0 is called the initial value. The numerical
values of y(t) on an interval containing t0 are to be determined.
We divide the domain [a, b] in to subintervals
a = t0 < t1 < · · · < tN = b.
These points are called mesh points or grid points. Let equal spacing is h. The uniform mesh points
are given by ti = t0 + ih, i = 0, 1, 2, ... The set of points y0 , y1 , · · · , yN are the numerical solution of
the initial-value problem (IVP).
Thus
� t
t2
φ1 (t) = 1 + s ds = 1 + ,
0 2
�t � �
s2 t2 t4
φ2 (t) = 1 + s 1 + ds = 1 + + ,
2 2 2.4
0
t2 t4 t6
φ2 (t) = 1 + + + .
2 2·4 2·4·6
It may be established by induction that
� 2� � �2 � �k
t 1 t2 1 t2
φk (t) = 1 + + + .... + .
2 2! 2 k! 2
2
We recognize φk (x) as the partial sum for the series expansion of the function φ(t) = et /2 . We know
that this series converges for all t and this means that φk (t) → φ(t) as k → ∞, for all x ∈ R.
Indeed φ is a solution of the given initial value problem.
Example 5. Generate φ0 (t), φ1 (t), φ2 (t), and φ3 (t) for the initial-value problem using Picard’s method
y � = −y + t + 1, 0 ≤ t ≤ 1, y(0) = 1.
Sol.
φ0 (t) = y(0) = 1.
�t
φ1 (t) = 1 + f (s, y0 (s))ds
0
�t
= 1+ (−1 + s + 1)ds
0
t2
= 1+ .
2
�t
φ2 (t) = 1 + f (s, y1 (s))ds
0
�t � � � �
s2
= 1+ − 1+ + s + 1 ds
2
0
�t � �
1 2
= 1+ s − s ds
2
0
1 1
= 1 + t 2 − t3 .
2 6
4 NUMERICAL DIFFERENTIAL EQUATIONS
�t
φ3 (t) = 1 + f (s, y2 (s))ds
0
�t � � � �
1 2 1 3
= 1+ − 1 + s − s + s + 1 ds
2 6
0
1 1 1
= 1 + t 2 − t3 + t 4 .
2 6 24
We can check these approximations are Maclaurin series expansion of t+e−t which is the exact solution
of given IVP.
2.2. Taylor’s Series method. Consider the one dimensional initial value problem
y � = f (t, y), y(t0 ) = y0
where f is a function of two variables t and y and (t0 , y0 ) is a known point on the solution curve.
If the existence of all higher order partial derivatives is assumed for y at some point t = ti , then by
Taylor series the value of y at any neighboring point ti + h can be written as
h2 �� h3 ���
y(ti + h) = y(ti ) + hy � (ti ) + y (ti ) + y (ti ) + · · · + O(hp+1 ).
2 3!
Since at ti , yi is known, y � at xi can be found by computing f (ti , yi ).
Similarly higher derivatives of y at ti can be computed by making use of the relation y � = f (t, y).
Hence the value of y at any neighboring point ti + h can be obtained by summing the above infinite
series. If the series has been terminated after the p-th derivative term then the approximated formula
is called the Taylor series approximation to y of order p and the error is of order p + 1.
Example 6. Given the IVP y � = x2 y − 1, y(0) = 1. By Taylor series method of order 4 with step size
0.1. Find y at x = 0.1 and x = 0.2.
Sol. Given IVP
y � = x2 y − 1, y �� = 2xy + x2 y � , y ��� = 2y + 4xy � + x2 y �� , y (4) = 6y � + 6xy �� + x2 y ��� .
3.1. Euler’s Method: The Euler method is named after Swiss mathematician Leonhard Euler (1707-
1783). This is the one of the simplest method to solve the IVP. Consider the IVP given in Eqs(3.1-3.2).
dy
We can approximate the derivative as following by assuming that all nodes ti are equally spaced
dt
with spacing h and ti+1 = ti + h.
Now by the definition of derivative
y(t0 + h) − y(t0 )
y � (t0 ) ≈ .
h
Apply this approximation to the given IVP at point t = t0 gives
y � (t0 ) = f (t0 , y0 ).
Therefore
1
[y(t1 ) − y(t0 )] = f (t0 , y0 )
h
=⇒ y(t1 ) − y(t0 ) = hf (t0 , y0 )
which gives
y(t1 ) = y(t0 ) + hf (t0 , y0 ).
In general, we write
ti+1 = ti + h
yi+1 = yi + hf (ti , yi )
where yi = y(ti ). This procedure is called Euler’s method.
Alternatively we can derive this method from a Taylor’s series. We write
h2 ��
y(ti+1 ) = y(ti + h) = y(ti ) + hy � (ti ) + y (ti ) + · · ·
2!
If we cut the series at y � (ti ), we obtain
y(ti+1 ) = y(ti ) + hy � (ti )
=⇒ y(ti+1 ) = y(ti ) + hf (ti , y(ti ))
=⇒ yi+1 = yi + hf (ti , yi ).
If truncation error has term hp+1 , then order of the numerical method is p. Therefore, Euler’s method
is a first-order method.
3.2. The Improved or Modified Euler’s method. We write the integral form of y(t) as
�t
dy
= f (t, y) ⇐⇒ y(t) = y(t0 ) + f (t, y(t))dt.
dt
t0
t2 = t0 + 2h = 0 + 2 × 0.1 = 0.2
y2 = y1 + hf (0.1, 0.9) = 0.9 + 0.1(−2 × 0.9 + 2 − e−4(0.1) )
= 0.9 + 0.1(−0.47032) = 0.852967
∴ y2 = y(0.2) = 0.852967.
√
Example 8. For the IVP y � = t + y, y(0) = 1. Calculate y in the interval [0.0.6] with h = 0.2 by
using modified Euler’s method.
Sol.
√
y � = t + y = f (t, y), t0 = 0, y0 = 1, h = 0.2, t1 = 0.2
K1 = hf (t0 , y0 ) = 0.2(1) = 0.2
K2 = hf (t1 , y0 + K1 ) = hf (0.2, 1.2) = 0.2591
K1 + K2
y1 = y(0.2) = y0 + = 1.22955.
2
Similarly we can compute solutions at other points.
Example 9. Show that the following initial-value problem has a unique solution.
y � = t−2 (sin 2t − 2ty), 1 ≤ t ≤ 2, y(1) = 2.
Find y(1.1) and y(1.2) with step-size h = 0.1 using modified Euler’s method.
Sol.
y � = t−2 (sin 2t − 2ty) = f (t, y).
Holding t as a constant,
|f (t, y1 ) − f (t, y2 )| = |t−2 (sin 2t − 2ty1 ) − t−2 (sin 2t − 2ty2 )|
2
= |y1 − y2 |
|t|
≤ 2|y1 − y2 |.
Thus f satisfies a Lipschitz condition in the variable y with Lipschitz constant L = 2. Additionally,
f (t, y) is continuous when 1 ≤ t ≤ 2, and −∞ < y < ∞, so Existence Theorem implies that a unique
NUMERICAL DIFFERENTIAL EQUATIONS 7
3.3. Runge-Kutta Methods: This is the one of the most important method to solve the IVP. These
techniques were developed around 1900 by the German mathematicians C. Runge and M. W. Kutta.
If we apply Taylor’s Theorem directly then we require that the function have higher-order derivatives.
The class of Runge-Kutta methods does not involve higher-order derivatives which is the advantage of
this class.
Euler’s method is an example of the Runge-Kutta method of first-order and modified Euler’s method
is an example of second-order Runge-Kutta method.
Third-order Runge-Kutta methods: Like-wise modified Euler’s, using Simpson’s rule to approxi-
mate the integral, we obtain the following Runge-Kutta method of order three.
ti+1 =
ti + h
K1 =
hf (ti , yi )
K2 =
hf (ti + h/2, yi + K1 /2)
K3hf (ti + h, yi − K1 + 2K2 )
=
1
yi+1 = yi + (K1 + 4K2 + K3 ).
6
There are different Runge-Kutta method of order three. Most commonly used method is Heun’s
method, given by
ti+1 = ti + h
� � ��
h 2h 2h h h
yi+1 = yi + f (ti , yi ) + 3f ti + , yi + f (ti + , yi + f (ti , yi )) .
4 3 3 3 3
Runge-Kutta methods of order three are not generally used. The most common Runge-Kutta method
in use is of order four, is given by the following.
Fourth-order Runge-Kutta method:
ti+1 =ti + h
K1 =hf (ti , yi )
K2 =hf (ti + h/2, yi + K1 /2)
K3 =hf (ti + h/2, yi + K2 /2)
K4 =hf (ti + h, yi + K3 )
1
yi+1 = yi + (K1 + 2K2 + 2K3 + K4 ) + O(h5 ).
6
Local truncation error in the Runge-Kutta method is the error that arises in each step because of
the truncated Taylor series. This error is inevitable. The fourth-order Runge-Kutta involves a local
truncation error of O(h5 ).
dy y 2 − t2
Example 11. Using Runge-Kutta fourth-order, solve = 2 with y0 = 1 at t = 0.2 and 0.4.
dt y + t2
Sol.
y 2 − t2
f (t, y) = , t0 = 0, y0 = 1, h = 0.2
y 2 + t2
K1 = hf (t0 , y0 ) = 0.2f (0, 1) = 0.200
� �
h K1
K2 = hf t0 + , y0 + = 0.2f (0.1, 1.1) = 0.19672
2 2
� �
h K2
K3 = hf t0 + , y0 + = 0.2f (0.1, 1.09836) = 0.1967
2 2
K4 = hf (t0 + h, y0 + K3 ) = 0.2f (0.2, 1.1967) = 0.1891
1
y1 = y0 + (K1 + 2K2 + 2K3 + K4 ) = 1 + 0.19599 = 1.196
6
∴ y(0.2) = 1.196.
NUMERICAL DIFFERENTIAL EQUATIONS 9
Now
t1 = t0 + h = 0.2
K1 = hf (t1 , y1 ) = 0.1891
� �
h K1
K2 = hf t1 + , y1 + = 0.2f (0.3, 1.2906) = 0.1795
2 2
� �
h K2
K3 = hf t1 + , y1 + = 0.2f (0.3, 1.2858) = 0.1793
2 2
K4 = hf (t1 + h, y1 + K3 ) = 0.2f (0.4, 1.3753) = 0.1688
1
y2 = y(0.4) = y1 + (K1 + 2K2 + 2K3 + K4 ) = 1.196 + 0.1792 = 1.3752.
6
h K2 L2
K3 = hf (x0 + , y0 + , z0 + ) = 0.34385
2 2 2
h K2 L2
L3 = hg(x0 + , y0 + , z0 + ) = −0.007762
2 2 2
K4 = hf (x0 + h, y0 + K3 , z0 + L3 ) = 0.3893
L4 = hg(x0 + h, y0 + K3 , z0 + L3 ) = −0.03104.
Hence
1
y1 = y(0.3) = y0 + (K1 + 2K2 + 2K3 + K4 ) = 0.34483
6
1
z1 = z(0.3) = z0 + (L1 + 2L2 + 2L3 + L4 ) = 0.9899.
6
Example 14. Consider the following Lotka-Volterra system in which u is the number of prey and v
is the number of predators.
du
= 2u − uv, u(0) = 1.5
dt
dv
= −9v + 3uv, v(0) = 1.5.
dt
Use the fourth-order Runge-Kutta method with step-size h = 0.2 to approximate the solution at t = 0.2.
Sol.
du
= 2u − uv = f (t, u, v)
dt
dv
= −9v + 3uv = g(t, u, v),
dt
u0 = 1.5, v0 = 1.5, h = 0.2.
K1 = hf (t0 , u0 , v0 ) = 0.15
L1 = hg(t0 , u0 , v0 ) = −1.35
� �
h K1 L1
K2 = hf t0 + , u0 + , v0 + = 0.370125.
2 2 2
� �
h K1 L1
L2 = hg t0 + , u0 + , v0 + = −0.7054
2 2 2
� �
h K2 L2
K3 = hf t0 + , u0 + , v0 + = 0.2874
2 2 2
� �
h K2 L2
L3 = hg t0 + , u0 + , v0 + = −0.9052
2 2 2
K4 = hf (t0 + h, u0 + K3 , v0 + L3 ) = 0.5023
L4 = hg (x0 + h, u0 + K3 , v0 + L3 ) = −0.4348.
Therefore
1
u(0.2) = 1.5 + (0.15 + 2 × 0.370125 + 2 × 0.2874 + 0.5023) = 1.8279.
6
1
v(0.2) = 1.5 + (−1.35 − 2 × 0.7054 − 2 × 0.9052 − 0.4348) = 0.6657.
6
Example 15. Solve by using fourth-order Runge-Kutta method for x = 0.2.
� �2
d2 y dy
2
=x − y 2 , y(0) = 1, y � (0) = 0.
dx dx
Sol. Let
dy
= z = f (x, y, z).
dx
Therefore
dz
= xz 2 − y 2 = g(x, y, z).
dx
NUMERICAL DIFFERENTIAL EQUATIONS 11
Now
x0 = 0, y0 = 1, z0 = 0, h = 0.2
K1 = hf (x0 , y0 , z0 ) = 0.0
L1 = hg(x0 , y0 , z0 ) = −0.2
� �
h K1 L1
K2 = hf x0 + , y0 + , z0 + = −0.02
2 2 2
� �
h K1 L1
L2 = hg x0 + , y0 + , z0 + = −0.1998
2 2 2
� �
h K2 L2
K3 = hf x0 + , y0 + , z0 + = −0.02
2 2 2
� �
h K2 L2
L3 = hg x0 + , y0 + , z0 + = −0.1958
2 2 2
K4 = hf (x0 + h, y0 + K3 , z0 + L3 ) = −0.0392
L4 = hg(x0 + h, y0 + K3 , z0 + L3 ) = −0.1905.
Hence
1
y1 = y(0.2) = y0 + (K1 + 2K2 + 2K3 + K4 ) = 0.9801.
6
1
z1 = y � (0.3) = z0 + (L1 + 2L2 + 2L3 + L4 ) = −0.1970.
6
Example 16. The motion of a swinging pendulum is described by the following second-order differential
equation
d2 θ g π
+ sin θ = 0, θ(0) = , θ� (0) = 0,
dt2 L 6
where θ be the angle with vertical at time t, length of the pendulum L = 2 ft, and g = 32.17 ft/s2 . With
h = 0.1 s, find the angle θ at t = 0.1 using Runge-Kutta fourth order method.
Sol. First of all we convert the given second order initial value problem into simultaneous first order
initial value problems.
dθ
Assuming = y, we obtain the following system:
dt
dθ
= y = f (t, θ, y), θ(0) = π/6
dt
dy g
= − sin θ = g(t, θ, y), y(0) = 0.
dt L
Here t0 = 0, θ0 = π/6, and y0 = 0. We have, by Runge-Kutta fourth order method, taking h = 0.1.
K1 = hf (t0 , θ0 , y0 ) = 0.00000000
L1 = hg(t0 , θ0 , y0 ) = −0.80425000
K2 = hf (t0 + 0.5h, θ0 + 0.5K1 , y0 + 0.5L1 ) = −0.04021250
L2 = hg(t0 + 0.5h, θ0 + 0.5K1 , y0 + 0.5L1 ) = −0.80425000
K3 = hf (t0 + 0.5h, θ0 + 0.5K2 , y0 + 0.5L2 ) = −0.04021250
L3 = hg(t0 + 0.5h, θ0 + 0.5K2 , y0 + 0.5L2 ) = −0.77608129
K4 = hf (t0 + h, θ0 + K3 , y0 + L3 ) = −0.07760813
L4 = hg(t0 + h, θ0 + K3 , y0 + L3 ) = −0.74759884.
(K1 + 2K2 + 2K3 + K4 )
θ1 = θ0 + = 0.48385575.
6
Therefore, θ(0.1) ≈ θ1 = 0.48385575.
12 NUMERICAL DIFFERENTIAL EQUATIONS
Exercises
(1) Show that each of the following initial-value problems (IVP) has a unique solution, and find
the solution.
(a) y � = y cos t, 0 ≤ t ≤ 1, y(0) = 1.
2
(b) y � = y + t2 et , 1 ≤ t ≤ 2, y(1) = 0.
t
(2) Apply Picard’s method for solving the initial-value problem generate y0 (t), y1 (t), y2 (t), and
y3 (t) for the initial-value problem
y � = −y + t + 1, 0 ≤ t ≤ 1, y(0) = 1.
(3) Consider the following initial-value problem
x� = t(x + t) − 2, x(0) = 2.
Use the Euler method with stepsize h = 0.2 to compute x(0.6).
(4) Given the initial-value problem
1 y
y � = 2 − − y 2 , 1 ≤ t ≤ 2, y(1) = −1,
t t
1
with exact solution y(t) = − .
t
(a) Use Euler’s method with h = 0.05 to approximate the solution, and compare it with the
actual values of y.
(b) Use the answers generated in part (a) and linear interpolation to approximate the following
values of y, and compare them to the actual values.
i. y(1.052) ii. y(1.555) iii. y(1.978).
(5) Solve the following IVP by second-order Runge-Kutta method
y � = −y + 2 cos t, y(0) = 1.
Compute y(0.2), y(0.4), and y(0.6) with mesh length 0.2.
(6) Compute solutions to the following problems with a second-order Taylor method. Use step size
h = 0.2.
(a) y � = (cos y)2 , 0 ≤ x ≤ 1, y(0) = 0.
20
(b) y � = , 0 ≤ x ≤ 1, y(0) = 1.
1 + 19e−x/4
(7) A projectile of mass m = 0.11 kg shot vertically upward with initial velocity v(0) = 8 m/s is
slowed due to the force of gravity, Fg = −mg, and due to air resistance, Fr = −kv|v|, where
g = 9.8 m/s2 and k = 0.002 kg/m. The differential equation for the velocity v is given by
mv � = −mg − kv|v|.
(a) Find the velocity after 0.1, 0.2, · · · , 1.0s.
(b) To the nearest tenth of a second, determine when the projectile reaches its maximum
height and begins falling.
(8) Using Runge-Kutta fourth-order method to solve the IVP at x = 0.8 for
dy √
= x + y, y(0.4) = 0.41
dx
with step length h = 0.2.
(9) Water flows from an inverted conical tank with circular orifice at the rate
√
dx � x
= −0.6πr2 2g ,
dt A(x)
where r is the radius of the orifice, x is the height of the liquid level from the vertex of the
cone, and A(x) is the area of the cross section of the tank x units above the orifice. Suppose
r = 0.1 ft, g = 32.1 ft/s2 , and the tank has an initial water level of 8 ft and initial volume of
512(π/3) ft3 . Use the Runge-Kutta method of order four to find the following.
(a) The water level after 10 min with h = 20 s.
(b) When the tank will be empty, to within 1 min.
NUMERICAL DIFFERENTIAL EQUATIONS 13
(10) The following system represent a much simplified model of nerve cells
dx
= x + y − x3 , x(0) = 0.5
dt
dy x
= − , y(0) = 0.1
dt 2
where x(t) represents voltage across the boundary of nerve cell and y(t) is the permeability of
the cell wall at time t. Solve this system using Runge-Kutta fourth-order method to generate
the profile up to t = 0.2 with step size 0.1.
(11) Use Runge-Kutta method of order four to solve
y �� − 3y � + 2y = 6e−t , 0 ≤ t ≤ 1, y(0) = y � (0) = 2
for t = 0.2 with stepsize 0.2.
Appendix A. Algorithms
Algorithm for second-order Runge-Kutta method:
for i = 0, 1, 2, .. do
ti+1 = ti + h = t0 + (i + 1)h
K1 = hf (ti , yi )
K2 = hf (ti+1 , yi + K1 )
1
yi+1 = yi + (K1 + K2 ).
2
end for
Algorithm for fourth-order Runge-Kutta method:
for i = 0, 1, 2, .. do
ti+1 = ti + h
K1 = hf (ti , yi )
K2 = hf (ti + h/2, yi + K1 /2)
K3 = hf (ti + h/2, yi + K2 /2)
K4 = hf (ti+1 , yi + K3 )
1
yi+1 = yi + (K1 + 2K2 + 2K3 + K4 ).
6
end for
Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.