0% found this document useful (0 votes)
1K views

Lecture Notes17

This document contains lecture notes on numerical analysis from 2017. It discusses floating point arithmetic and errors in numerical approximations. Floating point numbers represent real numbers using a finite number of digits, which introduces rounding errors. There are also issues with overflow and underflow when numbers are too large or small to represent. Numerical analysis aims to understand these errors and develop stable methods to solve problems numerically.

Uploaded by

Dr Simran Kaur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Lecture Notes17

This document contains lecture notes on numerical analysis from 2017. It discusses floating point arithmetic and errors in numerical approximations. Floating point numbers represent real numbers using a finite number of digits, which introduces rounding errors. There are also issues with overflow and underflow when numbers are too large or small to represent. Numerical analysis aims to understand these errors and develop stable methods to solve problems numerically.

Uploaded by

Dr Simran Kaur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Lecture Notes

Numerical Analysis

Third Version 2017

Instructor: Paramjeet Singh


These notes were originally prepared during Fall 2013 for Mathematics module namely Numerical Anal-
ysis. In writing these notes, it was not my intention to add to the glut of Numerical Analysis texts.
They were designed to complement the course text, Numerical Analysis, Ninth edition, by Burden and
Faires. As such, these notes follow the conventions of that text fairly closely. If you are at all serious
about pursuing study of Numerical Analysis, you should consider acquiring that text, or any one of a
number of other fine texts by e.g., Atkinson, Cheney & Kincaid etc.
Special thanks go to the students of batch 2013, who suffered through early versions of these notes,
which were riddled with (more) errors. Now I guess these notes have less errors.
More homework questions and example problems will help you learn the material.

First Version 2013.


Second Version 2015.
CHAPTER 1 (4 LECTURES)
FLOATING POINT ARITHMETIC AND ERRORS

1. Numerical analysis
Numerical analysis, area of mathematics and computer science that creates, analyzes, and imple-
ments algorithms for obtaining numerical solutions to problems involving continuous variables. Such
problems arise throughout the natural sciences, social sciences, engineering, medicine, and business.
Since the mid 20th century, the growth in power and availability of digital computers has led to an
increasing use of realistic mathematical models in science and engineering, and numerical analysis
of increasing sophistication is needed to solve these more detailed models of the world. The formal
academic area of numerical analysis ranges from quite theoretical mathematical studies to computer
science issues. A major advantage for numerical technique is that a numerical answer can be obtained
even when a problem has no analytical solution. However, result from numerical analysis is an approx-
imation, in√ general, which can be made as accurate as desired. For example to find the approximate
values of 2, π etc.
With the increasing availability of computers, the new discipline of scientific computing, or com-
putational science, emerged during the 1980s and 1990s. The discipline combines numerical analysis,
symbolic mathematical computations, computer graphics, and other areas of computer science to make
it easier to set up, solve, and interpret complicated mathematical models of the real world.

1.1. Common perspectives in numerical analysis. Numerical analysis is concerned with all as-
pects of the numerical solution of a problem, from the theoretical development and understanding of
numerical methods to their practical implementation as reliable and efficient computer programs. Most
numerical analysts specialize in small subfields, but they share some common concerns, perspectives,
and mathematical methods of analysis. These include the following:
• When presented with a problem that cannot be solved directly, they try to replace it with a
“nearby problem” that can be solved more easily. Examples are the use of interpolation in
developing numerical integration methods and root-finding methods.
• There is widespread use of the language and results of linear algebra, real analysis, and func-
tional analysis (with its simplifying notation of norms, vector spaces, and operators).
• There is a fundamental concern with error, its size, and its analytic form. When approximating
a problem, it is prudent to understand the nature of the error in the computed solution.
Moreover, understanding the form of the error allows creation of extrapolation processes to
improve the convergence behaviour of the numerical method.
• Numerical analysts are concerned with stability, a concept referring to the sensitivity of the
solution of a problem to small changes in the data or the parameters of the problem. Numerical
methods for solving problems should be no more sensitive to changes in the data than the
original problem to be solved. Moreover, the formulation of the original problem should be
stable or well-conditioned.
In this chapter, we introduce and discuss some basic concepts of scientific computing. We begin
with discussion of floating-point representation and then we discuss the most fundamental source of
imperfection in numerical computing namely roundoff errors. We also discuss source of errors and then
stability of numerical algorithms.

2. Floating-point representation of numbers


Any real number is represented by an infinite sequence of digits. For example
� �
8 2 6 6
= 2.66666 · · · = + + + . . . × 101 .
3 101 102 103
1
2 FLOATING POINT ARITHMETIC AND ERRORS

Figure 1. Numerical Approximations

This is an infinite series, but computer use an finite amount of memory to represent numbers. Thus
only a finite number of digits may be used to represent any number, no matter by what representation
method.
8
For example, we can chop the infinite decimal representation of after 4 digits,
3
� �
8 2 6 6 6
= + + + × 101 = 0.2666 × 101 .
3 101 102 103 104
Generalizing this, we say that number has n decimal digits and call this n as precision.
For each real number x, we associate a floating point representation denoted by f l(x), given by
f l(x) = ±(0.a1 a2 . . . an )β × β e ,
here β based fraction is called mantissa with all ai integers and e is known as exponent. This repre-
sentation is called β−based floating point representation of x and we take base β = 10 in this course.
For example,
42.965 = 4 × 101 + 2 × 100 + 9 × 10−1 + 6 × 10−2 + 5 × 10−3
= 0.42965 × 102 .
−0.00234 = −0.234 × 10−2 .
Number 0 is written as 0.00 . . . 0 × 10e . Likewise, we can use for binary number system and any real
x can be written
x = ±q × 2m
with 12 ≤ q ≤ 1 and some integer m. Both q and m will be expressed in terms of binary numbers. For
example,
1001.1101 = 1 × 23 + 2 × 20 + 1 × 2−1 + 1 × 2−2 + 1 × 2−4
= (9.8125)10 .
Remark 2.1. The above representation is not unique.
For example, 0.2666 × 101 = 0.02666 × 102 etc.
Definition 2.1 (Normal form). A non-zero floating-point number is in normal form if the values of
mantissa lies in (−1, −0.1] or [0.1, 1).
Therefore, we normalize the representation by a1 �= 0. Not only the precision is limited to a finite
number of digits, but also the range of exponent is also restricted. Thus there are integers m and M
such that −m ≤ e ≤ M .
FLOATING POINT ARITHMETIC AND ERRORS 3

Definition 2.2 (Overflow and underflow). An overflow is obtained when a number is too large to fit
into the floating point system in use, i.e e > M . An underflow is obtained when a number is too small,
i.e e < −m . When overflow occurs in the course of a calculation, this is generally fatal. But underflow
is non-fatal: the system usually sets the number to 0 and continues. (Matlab does this, quietly.)
2.1. Rounding and chopping. Let x be any real number and f l(x) be its machine approximation.
There are two ways to do the “cutting” to store a real number
x = ±(0.a1 a2 . . . an an+1 . . . ) × 10e , a1 �= 0.
(1) Chopping: We ignore digits after an and write the number as following in chopping
f l(x) = (0.a1 a2 . . . an ) × 10e .
(2) Rounding: Rounding is defined as following

±(0.a1 a2 . . . an ) × 10e , 0 ≤ an+1 < 5 (rounding down)
f l(x) =
±[(0.a1 a2 . . . an ) + (0.00 . . . 01)] × 10e , 5 ≤ an+1 < 10 (rounding up).
Example 1. � � �
6 0.86 × 100 (rounding)
fl =
7 0.85 × 100 (chopping).

3. Errors in numerical approximations


Definition 3.1 (Absolute and relative error). If f l(x) is the approximation to the exact value x, then
|x − f l(x)|
the absolute error is |x − f l(x)|, and relative error is .
|x|
Remark: As a measure of accuracy, the absolute error may be misleading and the relative error is more
meaningful.

Example 2. Find the largest interval in which f l(x) must lie to approximate 2 with relative error
at most 10−5 for each value of x.
Sol. We have �√ �
� 2 − f l(x) � √
� �
� √ � ≤ 2 · 10−5 .
� 2 �
Therefore
√ √
| 2 − f l(x)| ≤ 2 · 10−5 ,
√ √ √
− 2 · 10−5 ≤ 2 − f l(x) ≤ 2 · 10−5
√ √ √ √
− 2 − 2 · 10−5 ≤ −f l(x) ≤ − 2 + 2 · 10−5
√ √ √ √
2 + 2 · 10−5 ≥ f l(x) ≥ 2 − 2 · 10−5 .
Hence interval (in decimals) is [1.4141994 · · · , 1.4142277 · · · ].
3.1. Chopping and Rounding Errors. Let x be any real number we want to represent in a com-
puter. Let f l(x) be the representation of x in the computer then what is largest possible values of
|x − f l(x)|
? In the worst case, how much data we are losing due to round-off errors or chopping errors?
|x|
Chopping errors: Let
x = (0.a1 a2 . . . an an+1 . . . ) × 10e
�a a2 an an+1 �
1
= + 2 + · · · + n + n+1 + · · ·
�10∞
10� 10 10
� ai
= × 10e , a1 �= 0,
10i
i=1
� n �
� ai
f l(x) = (0.a1 a2 . . . an ) × 10e = × 10e .
10i
i=1
4 FLOATING POINT ARITHMETIC AND ERRORS

Therefore
� ∞

� ai
|x − f l(x)| = × 10e .
10i
i=n+1

Now since each ai ≤ 9 = 10 − 1, therefore,

�∞
10 − 1
|x − f l(x)| ≤ × 10e
10i
i=n+1
� �
1 1
= (10 − 1) + + . . . × 10e
10n+1 10n+2
� �
1
10n+1
= (10 − 1)) 1 × 10e
1 − 10
= 10e−n .

Therefore absolute error bound is

Ea = |x − f l(x)| ≤ 10e−n .

Now
1
|x| = (0.a1 a2 . . . an )10 × 10e ≥ 0.1 × 10e = × 10e .
10
Therefore relative error bound is

|x − f l(x)| 10−n × 10e


Er = ≤ −1 = 101−n .
|x| 10 × 10e

Rounding errors: For rounding


 � n


 � ai

 10e
(0.a1 a2 . . . an )10 × = × 10e , 0 ≤ an+1 < 5
 10i
f l(x) = � i=1 n


 1 � ai

 e
(0.a1 a2 . . . an−1 [an + 1])10 10 = + × 10e , 5 ≤ an+1 < 10.
 10n 10i
i=1

For 0 < an+1 < 5 = 10/2,

�∞
ai
|x − f l(x)| = × 10e
10i
i=n+1
� ∞

an+1 � ai
= + × 10e
10n+1 10i
i=n+2
� ∞

10/2 − 1 � (10 − 1)
≤ + × 10e
10n+1 10i
i=n+2
� �
10/2 − 1 1
= + n+1 × 10e
10n+1 10
1 e−n
= 10 .
2
FLOATING POINT ARITHMETIC AND ERRORS 5

For 5 ≤ an+1 < 10,


� ∞ �
� � a 1 ��
� i
|x − f l(x)| = � − � × 10e
� 10i 10n �
i=n+1
� �
� 1 � ∞
ai ��

= � n− � × 10e
� 10 10i �
i=n+1
� �
� 1 an+1

� ai ��

= � n − n+1 − � × 10e
� 10 10 10i �
i=n+2
� �
� 1 an+1 ��
≤ � e
� 10n − 10n+1 � × 10
Since −an+1 ≤ −10/2, therefore
� �
� 1 10/2 ��

|x − f l(x)| ≤ � n − n+1 � × 10e
10 10
1 e−n
= 10 .
2
Thus, for both the cases, absolute error bound is
1
Ea = |x − f l(x)| ≤ 10e−n .
2
Hence relative error bound is
|x − f l(x)| 1 10−n × 10e 1
Er = ≤ −1 e
= 101−n = 5 × 10−n .
|x| 2 10 × 10 2

4. Significant Figures
The term significant digits is often used to loosely describe the number of decimal digits that appear
to be accurate. The definition is more precise, and provides a continuous concept.
Looking at an approximation 2.75303 to an actual value of 2.75194, we note that the three most
significant digits are equal, and therefore one may state that the approximation has three significant
digits of accuracy. One problem with simply looking at the digits is given by the following two examples:
(1) 1.9 as an approximation to 1.1 may appear to have one significant digit, but with a relative
error of 0.73, this seems unreasonable.
(2) 1.9999 as an approximation to 2.0001 may appear to have no significant digits, but the relative
error is 0.00010 which is almost the same relative error as the approximation 1.9239 is to 1.9237.
Thus, we need a more mathematical definition of the number of significant digits. Let the number x
and approximation x∗ be written in decimal form. The number of significant digits tells us to about
how many positions x and x∗ agree. More precisely, we say that x∗ has m significant digits of x if
the absolute error |x − x∗ | has zeros in the first m decimal places, counting from the leftmost nonzero
(leading) position of x, followed by a digit from 0 to 5.
Examples:
5.1 has 1 significant digit of 5: |5 − 5.1| = 0.1.
0.51 has 1 significant digits of 0.5: |0.5 − 0.51| = 0.01.
4.995 has 3 significant digits of 5: 5 − 4.995 = 0.005.
4.994 has 2 significant digits of 5: 5 − 4.994 = 0.006.
0.57 has all significant digits of 0.57.
1.4 has 0 significant digits of 2: 2 − 1.4 = 0.6. In the terms of relative errors, the number x∗ is said to
approximate x to m significant digits (or figures) if m is the largest nonnegative integer for which
|x − x∗ |
≤ 0.5 × 10−m .
|x|
If the relative error is greater than 0.5, then we will simply state that the approximation has zero
significant digits.
6 FLOATING POINT ARITHMETIC AND ERRORS

For example, if we approximate π with 3.14 then relative errors is


|π − 3.14|
Er = ≈ 0.00051 ≤ 0.005 = 0.5 × 10−2 ,
π
and therefore it is correct to two significant digits.
Also 4.994 has 2 significant digits of 5 as relative errors is (5 − 4.994)/5 = 0.0012 = 0.12 × 10−2 ≤
0.5 × 10−2 .
Some numbers are exact because they are known with complete certainty. Most exact numbers are
integers: exactly 12 inches are in a foot, there might be exactly 23 students in a class. Exact numbers
can be considered to have an infinite number of significant figures.

5. Rules for mathematical operations


In carrying out calculations, the general rule is that the accuracy of a calculated result is limited
by the least accurate measurement involved in the calculation. In addition and subtraction, the result
is rounded off so that it has the same number of digits as the measurement having the fewest decimal
places (counting from left to right). For example, 100 + 23.643 = 123.643, which should be rounded to
124.
Let the floating-point representations f l(x) and f l(y) are given for the real numbers x and y and
that the symbols ⊕, �, ⊗ and � represent machine addition, subtraction, multiplication, and division
operations, respectively. We will assume a finite-digit arithmetic given by
x ⊕ y = f l(f l(x) + f l(y)), x � y = f l(f l(x) − f l(y)),
x ⊗ y = f l(f l(x) × f l(y)), x � y = f l(f l(x) ÷ f l(y)).
This arithmetic corresponds to performing exact arithmetic on the floating-point representations of x
and y and then converting the exact result to its finite-digit floating-point representation.
5
Example 3. Suppose that x = 7 and y = 13 . Use five-digit chopping for calculating x + y, x − y, x × y,
and x ÷ y.
Sol. Here x = 57 = 0.714285 · · · and y = 13 = 0.33333 · · · .
Using the five-digit chopping values of x and y are
f l(x) = 0.71428 × 100 and f l(y) = 0.33333 × 100 .
Thus,
x ⊕ y = f l(f l(x) + f l(y)) = f l(0.71428 × 100 + 0.33333 × 100 ) = f l(1.04761 × 100 ) = 0.10476 × 101 .
The true value is x + y = 57 + 13 = 22
21 , so we have
Absolute Error Ea = | 22
21 − 0.10476 × 101 | = 0.190 × 10−4 .
0.190 × 10−4
Relative Error Er = 22 = 0.182 × 10−4 .
21
Similarly we can perform other calculations.
Further we show some examples of arithmetic with different exponents.
Example 4. Add the following floating-point numbers 0.4546e3 and 0.5433e7.
Sol. This problem contains unequal exponent. To add these floating-point numbers, take operands
with the largest exponent as,
0.5433e7 + 0.0000e7 = 0.5433e7.
(Because 0.4546e3 changes in the same operand as 0.0000e7).
Example 5. Subtract the following floating-point numbers:
1. 0.5424e − 99 from 0.5452e − 99
2. 0.3862e − 7 from 0.9682e − 7
Sol. On subtracting we get 0.0028e−99. Again this is a floating-point number but not in the normalized
form. To convert it in normalized form, shift the mantissa to the left. Therefore we get 0.28e − 101.
This condition is called an underflow condition.
Similarly, after subtraction we get 0.5820e − 7.
FLOATING POINT ARITHMETIC AND ERRORS 7

Example 6. Multiply the following floating point numbers: 0.1111e74 and 0.2000e80.
Sol. On multiplying we obtain 0.1111e74 × 0.2000e80 = 0.2222e153. This shows overflow condition of
normalized floating-point numbers.
Example 7. If x = 0.3721478693 and y = 0.3720230572. What is the relative error in the computation
of x − y using five decimal digits of accuracy?
Sol. We can compute with ten decimal digits of accuracy and can take it as ‘exact’.
x − y = 0.0001248121.
Both x and y will be rounded to five digits before subtraction. Thus
f l(x) = 0.37215

f l(y) = 0.37202.
f l(x) − f l(y) = 0.13000 × 10−3 .
Relative error, therefore is
(x − y) − (f l(x) − f l(y))
Er = ≈ .04% = 4%.
x−y
Example 8. The error in the measurement of area of a circle is not allowed to exceed 0.5%. How
accurately the radius should be measured.
Sol. Area of the circle is A = πr2 (say).
∂A
∴ = 2πr.
∂r
δA
Relative error (in percentage) in area is × 100 = 0.5.
A
0.5
Therefore δA = × A = 1/200πr2 .
100
δr 100 δA
Percentage relative error in radius is × 100 = = 0.25.
r r ∂A
∂r

6. Loss of Accuracy
Round-off errors are inevitable and difficult to control. Other types of errors which occur in com-
putation may be under our control. The subject of numerical analysis is largely preoccupied with
understanding and controlling errors of various kinds.
One of the most common error-producing calculations involves the cancellation of digits due to the
subtractions nearly equal numbers (or the addition of one very large number and one very small number
or multiplication of a small number with a quite large number). The loss of accuracy due to round-off
error can often be avoided by a reformulation of the calculations, as illustrated in the next example.
Example 9. Use four-digit rounding arithmetic and the formula for the roots of a quadratic equation,
to find the most accurate approximations to the roots of the following quadratic equation. Compute the
absolute and relative errors.
1.002x2 + 11.01x + 0.01265 = 0.
Sol. The quadratic formula states that the roots of ax2 + bx + c = 0 are

−b ± b2 − 4ac
x1,2 = .
2a
Using the above formula, the roots of given eq. 1.002x2 + 11.01x + 0.01265 = 0 are approximately
(using long format)
x1 = −0.00114907565991, x2 = −10.98687487643590.
8 FLOATING POINT ARITHMETIC AND ERRORS

We use four-digit rounding arithmetic to find approximations to the roots. We write the approxima-
tions of root as x∗1 and x∗2 . These approximations are given by

−11.01 ± (−11.01)2 − 4 · 1.002 · 0.01265
x∗1,2 =
√ 2 · 1.002
−11.01 ± 121.2 − 0.05070
=
2.004
−11.01 ± 11.00
= .
2.004
Therefore, we find the first root:
x∗1 = −0.004990,
|x1 −x∗1 |
which has the absolute error |x1 −x∗1 | = 0.00384095 and relative error |x1 | = 3.34265968 (very high).
We find the second root
−11.01 − 11.00
x∗2 = = −10.98,
2.004
which has the following absolute error
|x2 − x∗2 | = 0.006874876,
and relative error
|x2 − x∗2 |
= 0.000626127.
|x2 |
This quadratic formula for the calculation of √ first root, encounter the subtraction of nearly equal num-
bers and cause loss of accuracy. Here b and b2 − 4ac become two equal numbers. Therefore, we use
the alternate quadratic formula by rationalize the expression to calculate first root and approximation
is given by
√ √
∗ −b + b2 − 4ac −b − b2 − 4ac
x1 = √
2a −b − b2 − 4ac
−2c
= √
b + b2 − 4ac
= −0.001149,
which has the following very small relative error
|x1 − x∗1 |
= 6.584 × 10−5 .
|x1 |
Remark 6.1. However, if rationalize the numerator in x2 to get
−2c
x2 = √ .
b − b2 − 4ac
The use of this formula results not only involve the subtraction of two nearly equal numbers but also
division by the small number. This would cause degrade in accuracy.
Remark 6.2. Since product of the roots for a quadratic is c/a. Thus we can find the approximation
of the first root from the second as
c
x∗1 = ∗ .
ax2

Example 10. The quadratic formula is used for computing the roots of equation ax2 +bx+c = 0, a �= 0
and roots are given by

−b ± b2 − 4ac
x= .
2a
Consider the equation x2 + 62.10x + 1 = 0 and discuss the numerical results.
FLOATING POINT ARITHMETIC AND ERRORS 9

Sol. Using quadratic formula and 8-digit rounding arithmetic, we obtain two roots
x1 = −0.01610723
x2 = −62.08390.
We use these
√ values as√“exact values”. Now √ we perform calculations with 4-digit rounding arithmetic.
We have b2 − 4ac = 62.102 − 4.000 = 3856 − 4.000 = 62.06 and
−62.10 + 62.06
f l(x1 ) = = −0.02000.
2.000
The relative error in computing of x1 is
|f l(x1 ) − x1 | | − 0.02000 + .01610723|
= = 0.2417.
|x1 | | − 0.01610723|
In calculating of x2 ,
−62.10 − 62.06
f l(x2 ) = = −62.10.
2.000
The relative error in computing x2 is
|f l(x2 ) − x2 | | − 62.10 + 62.08390|
= = 0.259 × 10−3 .
|x2 | | − 62.08390|

In this equation since b2 = 62.102 is much larger than 4ac = 4. Hence b and b2 − 4ac become two
equal numbers. Calculation of x1 involves the subtraction of nearly two equal numbers but x2 involves
the addition of the nearly equal numbers which will not cause serious loss of significant figures.
To obtain a more accurate 4-digit rounding approximation for x1 , we change the formulation by
rationalizing the numerator, that is,
−2c
x1 = √ .
b + b2 − 4ac
Then
−2.000
f l(x1 ) = = −2.000/124.2 = −0.01610.
62.10 + 62.06
The relative error in computing x1 is now reduced to 0.62 × 10−3 .

Nested Arithmetic: Accuracy loss due to round-off error can also be reduced by rearranging calcu-
lations, as shown in the next example. Polynomials should always be expressed in nested form before
performing an evaluation, because this form minimizes the number of arithmetic calculations. One
way to reduce round-off error is to reduce the number of computations.
Example 11. Evaluate f (x) = x3 − 6.1x2 + 3.2x + 1.5 at x = 4.71 using three-digit arithmetic directly
and with nesting.
Sol. The exact result of the evaluation is (by taking more digits):
Exact: f (4.71) = 4.713 − 6.1 × 4.712 + 3.2 × 4.71 + 1.5 = −14.263899.
Now using three-digit rounding arithmetic, we obtain
f (4.71) = 4.713 − 6.1 × 4.712 + 3.2 × 4.71 + 1.5
= 22.2 × 4.71 − 6.1 × 22.2 + 15.1 + 1.5
= 105 − 135 + 15.1 + 1.5 = −13.4.
Similarly if we use three-digit chopping then
f (4.71) = 4.713 − 6.1 × 4.712 + 3.2 × 4.71 + 1.5
= 22.1 × 4.71 − 6.1 × 22.1 + 15.0 + 1.5
= 104 − 134 + 15.0 + 1.5 = −13.5.
The relative error in case of three-digit (rounding) is
� �
� −14.263899 + 13.4 �
� � ≈ 0.06,
� −14.263899 �
10 FLOATING POINT ARITHMETIC AND ERRORS

and for three-digit (chopping) is


� �
� −14.263899 + 13.5 �
� � ≈ 0.05.
� −14.263899 �

As an alternative approach, we write the polynomial f (x) in a nested manner as


f (x) = ((x − 6.1)x + 3.2)x + 1.5.
Using three-digit chopping arithmetic now produces
f (4.71) = ((4.71 − 6.1)4.71 + 3.2)4.71 + 1.5
= (−1.39)4.71 + 3.2)4.71 + 1.5
= (−6.54 + 3.2)4.71 + 1.5
= (−3.34)4.71 + 1.5
= −15.7 + 1.5 = −14.2.
In a similar fashion, we can obtain a three-digit rounding and answer is −14.3.
The relative error in case of three-digit (chopping) is
� �
� −14.263899 + 14.2 �
� � ≈ 0.0045,
� −14.263899 �

and for three-digit (rounding) is


� �
� −14.263899 + 14.3 �
� � ≈ 0.0025.
� −14.263899 �

Nesting has reduced the relative errors for both the approximations. Moreover in original form, there
are 7 multiplications while nested form has only 2 multiplications. Thus nested form reduce errors as
well as number of calculations.

7. Algorithms and Stability


An algorithm is a procedure that describes, in an unambiguous manner, a finite sequence of steps
to be performed in a specified order. The object of the algorithm is to implement a procedure to solve
a problem or approximate a solution to the problem. One criterion we will impose on an algorithm
whenever possible is that small changes in the initial data produce correspondingly small changes in
the final results. An algorithm that satisfies this property is called stable; otherwise it is unstable.
Some algorithms are stable only for certain choices of initial data, and are called conditionally stable.
The words condition and conditioning are used to indicate how sensitive the solution of a problem may
be to small changes in the input data. A problem is well-conditioned if small changes in the input data
can produce only small changes in the results. On the other hand, a problem is ill-conditioned if small
changes in the input data can produce large changes in the output.
For a certain types of problems, a condition number can be defined. If that number is large (greater
than one), it indicates an ill-conditioned problem. In contrast, if the number is modest (up to one),
the problem is recognized as a well-conditioned problem.
The condition number can be calculated in the following manner:
relative change in output
κ=
relative change in input
� �
� f (x) − f (x∗ ) �
� �
� f (x) �
= � �
� x − x∗ �
� �
� x �
� � �
� xf (x) �
≈� � �.
f (x) �
FLOATING POINT ARITHMETIC AND ERRORS 11

10
For example, if f (x) = , then the condition number can be calculated as
1 − x2
� � �
� xf (x) � 2
κ=� � � = 2x .
f (x) � |1 − x2 |
Condition number can be quite large for |x| ≈ 1. Therefore, the function is ill-conditioned.
Example 12. Compute and interpret the condition number for
(a) f (x) = sin x for x = 0.51π.
(b) f (x) = tan x for x = 1.7.
Sol. (a) The condition number is given by
� � �
� xf (x) �
κ = �� �.
f (x) �
For x = 0.51π, f � (x) = cos(0.51π) = −0.03141, f (x) = sin(0.51π) = 0.99951.

∴ κ = 0.05035 < 1.
Since, the condition number is < 1, we conclude that the relative error is attenuated.
(b) f (x) = tan x, f (1.7) = −7.6966, f � (x) = 1/ cos2 x, f � (a) = 1/ cos2 (1.7) = 60.2377.
κ = 13.305 >> 1.
Thus, the function is ill-conditioned.

7.1. Creating Algorithms. Another theme that occurs repeatedly in numerical analysis is the dis-
tinction between numerical algorithms are stable and those that are not. Informally speaking, a
numerical process is unstable if small errors made at one stage of the process are magnified and prop-
agated in subsequent stages and seriously degrade the accuracy of the overall calculation.
An algorithm can be thought of as a sequence of problems, i.e., a sequence of function evaluations.
In this case we consider the algorithm for evaluating f (x) to consist of the evaluation of the sequence
x1 , x2 , · · · , xn . We are concerned with the condition of each of the functions f1 (x1 ), f2 (x2 ), · · · , fn−1 (xn−1 )
where f (x) = fi (xi ) for all i. An algorithm is unstable if any fi is ill-conditioned, i.e. if any fi (xi ) has
condition much worse than f (x). In the following, we study an example to create a stable algorithm.
√ √
Example 13. Write an algorithm to calculate the expression f (x) = x + 1 − x, when x is quite
large. By considering the condition number κ of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Suggest a modification which makes it stable.
Sol. Consider
√ √
f (x) = x+1− x
so that there is potential loss of significance when x is large. Taking x = 12345 as an example, one
possible algorithm is
x0 : = x = 12345
x1 : = x0 + 1

x2 : = x
√ 1
x3 : = x0
f (x) := x4 : = x2 − x3 .
The loss of significance occurs with the final subtraction. We can rewrite the last step in the form
f3 (x3 ) = x2 − x3 to show how the final answer depends on x3 . As f3� (x3 ) = −1, we have the condition
� � � �
� x3 f3� (x3 ) � � x3 �

κ(x3 ) = � � = � �
f3 (x3 ) � � x2 − x3 �
12 FLOATING POINT ARITHMETIC AND ERRORS

from which we find κ(x3 ) ≈ 24690.5 when x = 12345. Note that this is the condition of a subproblem
arrived at during the algorithm.
To find an alternative algorithm, we write
√ √
√ √ x+1+ x 1
f (x) = ( x + 1 − x) √ √ =√ √ .
x+1+ x x+1+ x
This suggests the algorithm
x0 : = x = 12345
x1 : = x0 + 1

x2 : = x
√ 1
x3 : = x0
x4 : = x2 + x3
f (x) := x5 : = 1/x4 .
In this case f3 (x3 ) = 1/(x2 + x3 ) giving a condition for the subproblem of
� � � �
� x3 f3� (x3 ) � � x3 �

κ(x3 ) = � � = � �,
f3 (x3 ) � � x2 + x3 �
which is approximately 0.5 when x = 12345. Thus first algorithm is unstable and second is stable for
large values of x. In general such analyses are not usually so straightforward but, in principle, stability
can be analysed by examining the condition of a sequence of subproblems.
Example 14. Write an algorithm to calculate the expression f (x) = sin(a + x) − sin a, when x =
0.0001. By considering the condition number κ of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Suggest a modification which makes it stable.
Sol. Let x = 0.0001
x0 = 0.0001
x1 = a + x0
x2 = sin x1
x3 = sin a
x4 = x2 − x 3 .
Now to check the effect of x3 on x2 , we consider the function f3 (x3 ) = x2 − x3 and calculate condition
with a = 2
� � � �
� x3 f � (x3 ) � � x3 �
κ(x3 ) = �� � = � � ≈ 20008.13.
f (x3 ) � � x2 − x3 �
We obtain a very larger condition number, which shows that the last step is not stable.
Thus we need to modify the above algorithm and write the equivalent form
f (x) = sin(a + x) − sin a = 2 sin(x/2) cos(a + x/2).
The modified algorithm is the following
x0 = 0.0001
x1 = x0 /2
x2 = sin x1
x3 = cos(a + x1 )
x4 = 2x2 x3 .
Now we consider the function f3 (x3 ) = 2x2 x3 ,
� �
� x3 · 2x2 �

κ(x3 ) = � � = 1.
2x2 x3 �
Thus the condition number is one, so this form is acceptable.
Remarks
FLOATING POINT ARITHMETIC AND ERRORS 13

(1) Accuracy tells us the closeness of computed solution to true solution of problem. Accuracy
depends on conditioning of problem as well as stability of algorithm.
(2) Stability alone does not guarantee accurate results. Applying stable algorithm to well-conditioned
problem yields accurate solution. Inaccuracy can result from applying stable algorithm to ill-
conditioned problem or unstable algorithm to well-conditioned problem.

Exercises
(1) Compute the absolute error and relative error in approximations of x by x∗ .
(a) x = π, ∗
√ x = 22/7.
(b) x = 2, x∗ = 1.414.
(c) x = 8!, x∗ = 39900.
(2) Find the largest interval in which x∗ must lie to approximate x with relative error at most 10−4
for each value of x.
(a) π.
(b) √
e.
(c) √3.
(d) 3 7.
(3) A rectangular parallelepiped has sides of length 3 cm, 4 cm, and 5 cm, measured to the nearest
centimeter. What are the best upper and lower bounds for the volume of this parallelepiped?
What are the best upper and lower bounds for the surface area?
(4) Use three-digit rounding arithmetic to perform the following calculations. Compute the abso-
lute error
√ and
√ relative
√ error with the exact value determined to at least five digits.
(a) 3 + ( 5 + 7).
(b) (121 − 0.327) − 119.
3
(c) −10π + 6e − .
62
π − 22/7
(d) .
1/17
(5) Calculate the value of x2 + 2x − 2 and (2x − 2) + x2 where x = 0.7320e0, using normalized point
arithmetic and proves that they are not the same. Compare with the value of (x2 − 2) + 2x.
1 6x
(6) The derivative of f (x) = is given by . Do you expect to have difficulties
(1 − 3x2 ) (1 − 3x2 )2
evaluating this derivative at x = 0.577? Try it using 3- and 4-digit arithmetic with chopping.
(7) Suppose two points (x0 , y0 ) and (x1 , y1 ) are on a straight line with y1 �= y0 . Two formulas are
available to find the x-intercept of the line:
x 0 y 1 − x 1 y0 (x1 − x0 )y0
x= , and x = x0 − .
y1 − y0 y1 − y0
(a) Show that both formulas are algebraically correct.
(b) Use the data (x0 , y0 ) = (1.31, 3.24) and (x1 , y1 ) = (1.93, 4.76) and three-digit rounding
arithmetic to compute the x-intercept both ways. Which method is better and why?
(8) Use four-digit rounding arithmetic and the formula to find the most accurate approximations
to the roots of the following quadratic equations. Compute the absolute errors and relative
errors.
1 2 123 1
x + x − = 0.
3 4 6
(9) Find the root of smallest magnitude of the equation x2 − 1000x + 25 = 0 using quadratic
formula. Work in floating-point arithmetic using a four-decimal place mantissa.
(10) Consider the identity
�x
1 − cos(x2 )
sin(xt)dt = .
x
0
Explain the difficulty in using the right-hand fraction to evaluate this expression when x is
close to zero. Give a way to avoid this problem and be as precise as possible.
(11) Assume 3-digit mantissa with rounding
14 FLOATING POINT ARITHMETIC AND ERRORS

(a) Evaluate y = x3 − 3x2 + 4x + 0.21 for x = 2.73.


(b) Evaluate y = [(x − 3)x + 4]x + 0.21 for x = 2.73.
Compare and discuss the errors obtained in part (a) and (b).
(12) How many multiplications and additions are required to determine a sum of the form
n �
� i
a i bj ?
i=1 j=1
Modify the sum to an equivalent form that reduces the number of computations.
(13) Let P (x) = an xn + an−1 xn−1 + · · · + a1 x + a0 be a polynomial, and let x0 be given. Construct
an algorithm to evaluate P (x0 ) using nested multiplication.
(14) Construct an algorithm that has as input an integer n ≥ 1, numbers x0 , x1 , · · · , xn , and a
number x and that produces as output the product (x − x0 )(x − x1 ) · · · (x0 − xn ).
(15) Verify that the functions f (x) and g(x) are identical functions.
cos2 x
f (x) = 1 − sin x, g(x) = .
1 + sin x
(a) Which function should be used for computations when x is near π/2? Why?

(b) Which function should be used for computations when x is near√3π/2? Why?
(16) (a) Consider the stability (by calculating the condition number) of 1 + x − 1 when x is near
0. Rewrite the expression to rid it of subtractive cancellation.
(b) Rewrite ex − cos x to be stable when x is near 0.
(17) Suppose that a function f (x) = ln(x + 1) − ln(x), is computed by the following algorithm for
large values of x using six digit rounding arithmetic
x0 : = x = 12345
x1 : = x0 + 1
x2 : = ln x1
x3 : = ln x0
f (x) := x4 : = x2 − x3 .
By considering the condition κ(x3 ) of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Also propose the modification of function evaluation
so that algorithm will become stable.

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires, and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 2 (8 LECTURES)
ROOTS OF NON-LINEAR EQUATIONS IN ONE VARIABLE

1. Introduction
Finding one or more root (or zero) of the equation
f (x) = 0
is one of the more commonly occurring problems of applied mathematics. In most cases explicit
solutions are not available and we must be satisfied with being able to find a root to any specified
degree of accuracy. The numerical procedures for finding the roots are called iterative methods. These
problems arise in variety of applications.
The growth of a population can often be modeled over short periods of time by assuming that the
population grows continuously with time at a rate proportional to the number present at that time.
Suppose that N (t) denotes the number in the population at time t and λ denotes the constant birth
rate of the population. Then the population satisfies the differential equation
dN (t)
= λN (t),
dt
whose solution is N (t) = N0 eλt , where N0 denotes the initial population.
This exponential model is valid only when the population is isolated, with no immigration. If
immigration is permitted at a constant rate I, then the differential equation becomes
dN (t)
= λN (t) + I,
dt
whose solution is
I λt
N (t) = N0 eλt + (e − 1).
λ
Suppose a certain population contains N (0) = 1000000 individuals initially, that 424000 individuals
immigrate into the community in the first year, and that N (1) = 1564000 individuals are present at
the end of one year. To determine the birth rate of this population, we need to find λ in the equation
424000 λ
1564000 = 1000000eλ + (e − 1).
λ
It is not possible to solve explicitly for λ in this equation, but numerical methods discussed in this
chapter can be used to approximate solutions of equations of this type to an arbitrarily high accuracy.
Definition 1.1 (Simple and multiple root). A zero (root) has a “multiplicity”, which refers to the
number of times that its associated factor appears in the equation. A root having multiplicity one is
called a simple root. For example, f (x) = (x − 1)(x − 2) has a simple root at x = 1 and x = 2, but
g(x) = (x − 1)2 has a root of multiplicity 2 at x = 1, which is therefore not a simple root.
A multiple root is a root with multiplicity m ≥ 2 is called a multiple point or repeated root. For example,
in the equation (x − 1)2 = 0, x = 1 is multiple (double) root.
If a polynomial has a multiple root, its derivative also shares that root.
Let α be a root of the equation f (x) = 0, and imagine writing it in the factored form
f (x) = (x − α)m φ(x)
with some integer m ≥ 1 and some continuous function φ(x) for which φ(α) �= 0. Then we say that α
is a root of f (x) of multiplicity m.
Now we study some iterative methods to solve the non-linear equations.
1
2 ROOTS OF NON-LINEAR EQUATIONS

2. The Bisection Method


2.1. Method. Let f (x) be a continuous function on some given interval [a, b] and it satisfies the
condition f (a) f (b) < 0, then by Intermediate Value Theorem the function f (x) must have at least one
root in [a, b]. The bisection method repeatedly bisects the interval [a, b] and then selects a subinterval
in which a root must lie for further processing. It is a very simple and robust method, but it is also
relatively slow. Usually [a, b] is chosen to contain only root α.

Figure 1. Bisection method

Example 1. The sum of two numbers is 20. If each number is added to its square root, then the
product of the resulting sums is 155.55. Perform five iterations of bisection method to determine the
two numbers. 1
Sol. Let x and y be the two numbers. Then,
x + y = 20.
√ √
Now x is added to x and y is added to y. The product of these sums is
√ √
(x + x)(y + y) = 155.55.
√ √
∴ (x + x)(20 − x + 20 − x) = 155.55.
Write the above equation in to root finding problem
√ √
f (x) = (x + x)(20 − x + 20 − x) − 155.55 = 0.
As f (6)f (7) < 0, so there is a root in interval (6.7).
Below are the iterations of bisection method for finding root. Therefore root is 6.53125.

n a b c sign(f (a).f (c))


1 6.000000 7.000000 6.500000 >0
2 6.500000 7.000000 6.750000 <0
3 6.500000 6.750000 6.625000 <0
4 6.500000 6.625000 6.562500 >0
5 6.500000 6.562500 6.531250 <0

If x = 6.53125, then y = 20 − 6.53125 = 13.4688.


Further we discuss the convergence of approximate solution to exact solution. In this step firstly we
define the usual meaning of convergence and order of convergence.

1Choice of initial approximations: Initial approximations to the root are often known from the physical significance of
the problem. Graphical methods are used to find the zero of f (x) = 0 and any value in the neighborhood of root can be
taken as initial approximation.
If the given equation f (x) = 0 can be written as f1 (x) = f2 (x) = 0, then the point of the intersection of the graphs
y = f1 (x) and y = f2 (x) gives the root of the equation. Any value in the neighborhood of this point can be taken as
initial approximation.
ROOTS OF NON-LINEAR EQUATIONS 3

Definition 2.1 (Convergence). A sequence {xn } is said to be converge to a point α with order p if
there is exist a constant c such that
|xn+1 − α|
lim p = c, n ≥ 0.
n→∞ |xn − α|

The constant c is known as asymptotic error constant. If we write en = |xn − α| where en denote the
absolute error in n-th iteration then we can write in limiting case
en+1 = c epn .
Two cases are given special attention.
(i) If p = 1 (and c < 1), the sequence is linearly convergent.
(ii) If p = 2, the sequence is quadratically convergent.
Definition 2.2. Let {βn } is a sequence which converges to zero and {xn } is any sequence. If there
exists a constant c > 0 and an integer N > 0 such that
|xn − α| ≤ c|βn |, ∀n ≥ N,
then we say that {xn } converges to α with rate O(βn ). We write
xn = α + O(βn ).
Example: Define two sequences for n ≥ 1,
n+1 n+2
xn = , and yn = .
n2 n3
Both the sequences has limit 0 but the sequence {yn } converges to this limit much faster than the
sequence {xn }.
Now
n+1 n+n 1
|xn − 0| = 2
< 2
= 2 = 2βn
n n n
and
n+2 n + 2n 1
|yn − 0| = 3
< 3
= 3 2 = 3β̃n .
n n n
Hence the rate of convergence of {xn } to zero is similar to the convergence of {1/n} to zero, whereas
{yn } converges to zero at a rate similar to the more rapidly convergent sequence {1/n2 }. We express
this by writing
xn = 0 + O(βn ) and yn = 0 + O(β̃n ).
2.2. Convergence analysis. Now we analyze the convergence of the iterations generated by the
bisection method.
Theorem 2.3. Suppose that f ∈ C[a, b] and f (a) · f (b) < 0. Then the bisection method generates a
sequence {cn } approximating a zero α of f with linear convergence.
Proof. Let [a1 , b1 ], [a2 , b2 ], · · · , [an , bn ], · · · , denote the successive intervals produced by the bisection
algorithm. Thus
a = a1 ≤ a2 ≤ · · · ≤ b1 = b
b = b1 ≥ b2 ≥ · · · ≥ a1 = a.
This implies {an } and {bn } are monotonic and bounded and hence convergent.
Since
b1 − a1 = (b − a)
1 1
b2 − a 2 = (b1 − a1 ) = (b − a)
2 2
........................
1
bn − a n = n−1
(b − a). (2.1)
2
Hence
lim (bn − an ) = 0.
n→∞
4 ROOTS OF NON-LINEAR EQUATIONS

Here b − a denotes the length of the original interval with which we started. Take limit
lim an = lim bn = α (say).
n→∞ n→∞
Since f is continuous function, therefore
lim f (an ) = f ( lim an ) = f (α).
n→∞ n→∞
The bisection method ensures that
f (an )f (bn ) ≤ 0
which implies
lim f (an )f (bn ) = f 2 (α) ≤ 0
n→∞
=⇒ f (α) = 0.
Thus limit of {an } and {bn } is a zero of [a, b].
Since the root α is in either the interval [an , cn ] or [cn , bn ]. Therefore
1
|α − cn | < cn − an = bn − cn = (bn − an )
2
Combining with (2.1), we obtain the further bound
1
en = |α − cn | < n (b − a).
2
Therefore
1
en+1 < n+1 (b − a).
2
1
∴ en+1 < en .
2
This shows that the iterates cn converge to α as n → ∞. By definition of convergence, we can say that
the bisection method converges linearly with rate 12 .

Illustrations: 1. Since the method brackets the root, the method is guaranteed to converge, however,
can be very slow.
a n + bn
2. Computing cn : It might happen that at a certain iteration n, computation of cn = will
2
give overflow. It is better to compute cn as:
bn − a n
c n = an + .
2
3. Stopping Criteria: Since this is an iterative method, we must determine some stopping criteria
that will allow the iteration to stop. We can use the following criteria to stop in term of absolute error
and relative error
|cn+1 − cn | ≤ �,
|cn+1 − cn |
≤ �,
|cn+1 |
provided |cn+1 | �= 0.
Criterion |f (cn )| ≤ � can be misleading since it is possible to have |f (cn )| very small, even if cn is not
close to the root.
Let’s now find out what is the minimum number of iterations N needed with the bisection method to
b−a
achieve a certain desired accuracy. The interval length after N iterations is N . So, to obtain an
2
b−a
accuracy of �, we must have N ≤ �. That is,
2
2−N (b − a) ≤ ε,
or
log(b − a) − log ε
N≥ .
log 2
Note the number N depends only on the initial interval [a, b] bracketing the root.

4. If a function is such that it just touches the x-axis, for example f (x) = x2 , then we don’t have a
ROOTS OF NON-LINEAR EQUATIONS 5

and b such that f (a)f (b) < 0 but x = 0 is the root of f (x) = 0.
5. For functions where there is a singularity and it reverses sign at the singularity, bisection method
1
may converge on the singularity. An example include f (x) = . We can chose a and b such that
x
f (a)f (b) < 0. However, the function is not continuous and the theorem that a root exists is not
applicable.
Example 2. Use the bisection method to find solutions accurate to within 10−2 for x3 −7x2 +14x−6 = 0
on [0, 1].
Sol. Number of iterations
log(1 − 0) − log(10−2 )
N≥ = 6.6439.
log 2
Thus, a minimum of 7 iterations will be needed to obtain the desired accuracy using the bisection
method. This yields the following results for mid-points cn and for absolute errors En = |cn − cn−1 |.

n an bn cn sign(f (a).f (c)) En


1 0 1 0.5 >0
2 0.5 1 0.75 <0 0.25
3 0.5 0.75 0.625 <0 0.125
4 0.5 0.625 0.5625 >0 0.0625
5 0.5625 0.625 0.59375 <0 0.03125
6 0.5625 0.59375 0.578125 <0 0.015625
7 0.578125 0.59375 0.5859375 0.0078125 < 0.01(= 10−2 )

3. Fixed-point iteration method


A fixed point for a function is a number at which the value of the function does not change when
the function is applied. The terminology was first used by the Dutch mathematician L. E. J. Brouwer
(1882-1962) in the early 1900s.
The number α is a fixed point for a given function g if g(α) = α.
In this section we consider the problem of finding solutions to fixed-point problems and the connec-
tion between the fixed-point problems and the root-finding problems we wish to solve. Root-finding
problems and fixed-point problems are equivalent classes in the following sense:
Given a root-finding problem f (x) = 0, we can define functions g with a fixed point at x in a number of
ways. Conversely, if the function g has a fixed point at α, then the function defined by f (x) = x − g(x)
has a zero at α.
Although the problems we wish to solve are in the root-finding form, the fixed-point form is easier to
analyze, and certain fixed-point choices lead to very powerful root-finding techniques.
Example 3. Determine any fixed points of the function g(x) = x2 − 2.
Sol. A fixed point x for g has the property that
x = g(x) = x2 − 2
which implies that
0 = x2 − x − 2 = (x + 1)(x − 2).
A fixed point for g occurs precisely when the graph of y = g(x) intersects the graph of y = x, so g has
two fixed points, one at x = −1 and the other at x = 2.

Fixed-point iterations: We now consider solving an equation x = g(x) for a root α by the iteration
xn+1 = g(xn ), n ≥ 0,
with x0 as an initial guess to α.
Each solution of x = g(x) is called a fixed point of g.

For example, consider the solving x2 − 3 = 0.


We can write 1. x = x2 + x − 3 or x = x2 + c(x − 3), c �= 0.
6 ROOTS OF NON-LINEAR EQUATIONS

2. x = 3/x.
3. x = 12 (x + 3/x).
Let x0 = 2.

Table 1. Table for iterations in three cases


n 1 2 3
0 2.0 2.0 2.0
1 3.0 1.5 1.75
2 9.0 2.0 1.732147
3 87.0 1.5 1.73205


Now 3 = 1.73205 and it is clear that third choice is correct but why other two are not working?
Therefore which of the approximation is correct or not, we will answer after the convergence result
(which require |g � (α) < 1| and a ≤ g(x) ≤ b, ∀x ∈ [a, b] in the neighborhood of root α).
Lemma 3.1. Let g(x) be a continuous function on [a, b] and assume that a ≤ g(x) ≤ b, ∀x ∈ [a, b]
then x = g(x) has at least one solution in [a, b].
Proof. Let g be a continuous function on [a, b].
Let assume that a ≤ g(x) ≤ b, ∀x ∈ [a, b].
Now consider φ(x) = g(x) − x.
If g(a) = a or g(b) = b then proof is trivial. Hence we assume that a �= g(a) and b �= g(b).
Now since a ≤ g(x) ≤ b
=⇒ g(a) > a and g(b) < b.
Now
φ(a) = g(a) − a > 0
and
φ(b) = g(b) − b < 0.
Now φ is continuous and φ(a)φ(b) < 0, therefore by Intermediate Value Theorem φ has at least one
zero in [a, b], i.e. there exists some α s.t.
g(α) = α, α ∈ [a, b].
Graphically, the roots are the intersection points of y = x & y = g(x) as shown in the Figure.

Figure 2. An example of Lemma


ROOTS OF NON-LINEAR EQUATIONS 7

Theorem 3.2 (Contraction Mapping Theorem). Let g & g � are continuous functions on [a, b] and
assume that g satisfy a ≤ g(x) ≤ b, ∀x ∈ [a, b]. Furthermore, assume that there is a positive constant
λ < 1 exists with
|g � (x)| ≤ λ, ∀x ∈ (a, b).
Then
1. x = g(x) has a unique solution α of x = g(x) in the interval [a, b].
2. The iterates xn+1 = g(xn ), n ≥ 1 will converge to α for any choice of x0 ∈ [a, b].
3.
λn
|α − xn | ≤ |x1 − x0 |, n ≥ 0.
1−λ
4. Convergence is linear.
Proof. Let g and g � are continuous functions on [a, b] and assume that a ≤ g(x) ≤ b, ∀x ∈ [a, b]. By
previous Lemma, there exists at least one solution to x = g(x).
Also Assume
|g � (x)| ≤ λ < 1, ∀x ∈ (a, b).
By Mean-Value Theorem
g(x) − g(y) = g � (c)(x − y), c ∈ (x, y)
|g(x) − g(y)| ≤ λ|x − y|, 0 < λ < 1, ∀x ∈ [a, b].
1. Let x = g(x) has two solutions, say α and β in [a, b] then α = g(α), and β = g(β). Now
|α − β| = |g(α) − g(β)| ≤ λ|α − β|
=⇒ (1 − λ)|α − β| ≤ 0
=⇒ α = β, since 0 < λ < 1.
Therefore x = g(x) has a unique solution in [a, b] which is α (say).
2. To check the convergence of iterates {xn }, we observe that they all remain in [a, b] as xn ∈
[a, b], xn+1 = g(xn ) ∈ [a, b].
Now
|α − xn+1 | = |g(α) − g(xn )| = |g � (cn )||α − xn | (3.1)
for some cn between α and xn .
=⇒ |α − xn+1 | ≤ λ|α − xn | ≤ λ2 |α − xn−1 |
................
≤ λn+1 |α − x0 |
As n → ∞, λn → 0 which implies xn → α. Also
|α − xn | ≤ λn |α − x0 |. (3.2)
3. To find the bound:
Since
|α − x0 | = |α − x1 + x1 − x0 |
≤ |α − x1 | + |x1 − x0 |
≤ λ|α − x0 | + |x1 − x0 |
=⇒ (1 − λ)|α − x0 | ≤ |x1 − x0 |
1
=⇒ |α − x0 | ≤ |x1 − x0 |
1−λ
λn
=⇒ λn |α − x0 | ≤ |x1 − x0 |
1−λ
Therefore using (3.2)
λn
|α − xn | ≤ λn |α − x0 | ≤ |x1 − x0 |
1−λ
λn
=⇒ |α − xn | ≤ |x1 − x0 |.
1−λ
8 ROOTS OF NON-LINEAR EQUATIONS

4. Now by equation (3.1)


|α − xn+1 |
= |g � (cn )|,
|α − xn |
for some cn between α and xn . Taking limit n → ∞, we have
|α − xn+1 |
lim = lim |g � (cn )|
n→∞ |α − xn | n→∞

Now xn → α =⇒ cn → α.
Hence
|α − xn+1 |
lim = |g � (α)|.
n→∞ |α − xn |
If |g � (α)| < 1, the above formula shows that iterates are linearly convergent with rate (asymptotic error
constant) |g � (α)|. If in addition g � (α) �= 0, then formula proves that convergence is exactly linear, with
no higher order of convergence being possible.
Illustrations: 1. In practice, it is difficult to find an interval [a, b] for which a ≤ g(x) ≤ b condition
is satisfied. On the contrary if |g � (α)| > 1, then the iteration method xn+1 = g(xn ) will not converge
to α.
When |g � (α)| ≈ 1, no conclusion can be drawn and even if convergence occur, the method would be
far too slow for the iteration method to be practical.
2. If
λn
|α − xn | ≤ |x1 − x0 | < ε
1−λ
where ε is desired accuracy. This bound can be used to find the number of iterations to achieve the
accuracy ε.
Also from part 2, |α − xn | ≤ λn |α − x0 | ≤ λn max{x0 − a, b − x0 } < ε, can be used to find the number
of iterations.
3. The possible behavior of fixed-point iterates {xn } is shown in Figure 3 for various values of g � (α).
To see the convergence, consider the case of x1 = g(x0 ), the height of y = g(x) at x0 . We bring the
number x1 back to the x-axis by using the line y = x and the height y = x1 . We continue this with
each iterate, obtaining a stair-step behavior when g � (α) > 0. When g � (α) < 0, the iterates oscillates
around the fixed point α, as can be seen in the Figure. In first figure (on top) iterations are monotonic
convergence, in second oscillatory convergent, in third figure iterations are divergent and in the last
figure iterations are oscillatory divergent.

Theorem 3.3. Let α is a root of x = g(x), and g(x) is p times continuously differentiable function
for all x ∈ [α − δ, α + δ], g(x) ∈ [α − δ, α + δ], for some p ≥ 2. Furthermore assume

g � (α) = g �� (α) = · · · = g (p−1) (α) = 0, g (p) (α) �= 0. (3.3)

Then if the initial guess x0 is sufficiently close to α, then iteration

xn+1 = g(xn ), n≥0

will have order of convergence p.

Proof. Let g(x) is p times continuously differentiable function for all x ∈ [α − δ, α + δ], g(x) ∈
[α − δ, α + δ] and satisfying the conditions in equation (3.3) stated above.
Now expand g(xn ) in a Taylor polynomial about α.

xn+1 = g(xn )
= g(α + xn − α)
(xn − α)p−1 (p−1) (xn − α)p (p)
= g(α) + (xn − α)g � (α) + · · · + g (α) + g (cn ),
(p − 1)! p!
ROOTS OF NON-LINEAR EQUATIONS 9

Figure 3. Convergent and non-convergent sequences xn+1 = g(xn )

for some cn between xn and α.


Using equation (3.3) and g(α) = α, we obtain
(xn − α)p (p)
xn+1 − α = g (cn )
p!
xn+1 − α g (p) (cn )
=⇒ =
(xn − α)p p!
α − xn+1 g (p) (cn )
=⇒ = (−1)p−1 .
(α − xn )p p!
Take limits n → ∞ on both sides,

α − xn+1 (p) (p)


p−1 g (cn ) p−1 g (α)
lim p = lim (−1) = (−1) .
n→∞ (α − xn ) n→∞ p! p!
By definition of convergence, the iterations will have order of convergence p.

Example 4. Consider the equation x3 − 7x + 2 = 0 in [0, 1]. Write a fixed-point iteration which will
converge to the solution.
1
Sol. We rewrite the equation in the form x = (x3 + 2) and define the fixed-point iteration
7
1
xn+1 = (x3n + 2).
7
10 ROOTS OF NON-LINEAR EQUATIONS

1
Now g(x) = (x3 + 2) is continuous function. Thus
7
3x2 �� 6x
g � (x) = , g (x) =
7 7
2 6
g(0) = , g(1) =
7 7
3
g � (0) = 0, g � (1) = .
7
2 3 6
Hence ≤ g(x) ≤ and |g � (x)| ≤ < 1, ∀x ∈ [0, 1].
7 7 7
Hence by the fixed point theorem, the sequence {xn } defined above will converge to the unique solution
of given equation. Starting with x0 = 0.5, we can compute the solution as following.
x1 = 0.303571429
x2 = 0.28971083
x3 = 0.289188016.
Therefore root correct to three decimals is 0.289.
Example 5. An equation ex = 4x2 has a root in [4, 5]. Show that we cannot find that root using
x = g(x) = 12 ex/2 for the fixed-point iteration method. Can you find another iterative formula which
will locate that root ? If yes, then find third iterations with x0 = 4.5. Also find the error bound.
Sol. Here g(x) = 12 ex/2 , g � (x) = 14 ex/2 > 1 for all x ∈ (4, 5), therefore, the fixed-point iteration fails to
converge to the root in [4, 5].
Now consider x = g(x) = ln(4x2 ). Thus
2 2
g � (x) = > 0, g �� (x) = − 2 < 0, ∀x ∈ [4, 5].
x x
Therefore g and g � are monotonic increasing and decreasing, respectively. Now
g(4) = 4.15888, g(5) = 4.60517, g � (4) = 0.5, g � (5) = 0.4.
Thus
4 ≤ g(x) ≤ 5, λ = max |g � (x)| = 0.5 < 1, ∀x ∈ [4, 5].
4≤x≤5
Using the fixed-point iteration method with x0 = 4.5 gives the iterations as
x1 = g(x0 ) = ln(4 × 4.52 ) = 4.3944
x2 = 4.3469
x3 = 4.3253.
We have the error bound
0.53
|α − x3 | ≤ |4.3944 − 4.5| = 0.0264.
1 − 0.5
Example 6. Use a fixed-point method to determine a solution to within 10−4 for x = tan x, for x in
[4, 5].
Sol. Using g(x) = tan x and x0 = 4 gives x1 = g(x0 ) = tan 4 = 1.158, which is not in the interval [4, 5].
So we need a different fixed-point function.
If we note that x = tan x implies that
1 1
=
x tan x
1 1
=⇒ x = x − + .
x tan x
1 1
Starting with x0 and taking g(x) = x − + , we obtain
x tan x
x1 = 4.61369, x2 = 4.49596, x3 = 4.49341, x4 = 4.49341.
As x3 and x4 agree to five decimals, it is reasonable to assume that these values are sufficiently accurate.
ROOTS OF NON-LINEAR EQUATIONS 11

Example 7. The iterates xn+1 = 2 − (1 + c)xn + cx3n will converge to α = 1 for some values of constant
c (provided that x0 is sufficiently close to α). Find the values of c for which convergence occurs? For
what values of c, if any, convergence is quadratic?
Sol. Fixed-point iteration
xn+1 = g(xn )
with
g(x) = 2 − (1 + c)x + cx3 .
If α = 1 is a fixed point then for convergence |g � (α)| < 1
=⇒ | − (1 + c) + 3cα2 | < 1

=⇒ 0 < c < 1.
For this value of c, g �� (α) �= 0.
For quadratic convergence
g � (α) = 0 & g �� (α) �= 0.
This gives c = 1/2.
Example 8. Which
� of the
� following iterations
� �
1 2 6 6
a. xn+1 = xn + b. xn+1 = 4 − 2
4 xn xn
is suitable to find a root of the equation x3 = 4x2 − 6 in the interval [3, 4]? Estimate the number of
iterations required to achieve 10−3 accuracy, starting from x0 = 3.
Sol. a.
� �
1 6
g(x) = x2 +
4 x
� �
1 3
g � (x) = x− 2 .
2 x
g is continuous in [3, 4], but g � (x) > 1, for all x ∈ (3, 4). So this choice of g(x) is not suitable.
b.
� �
6
g(x) = 4− 2
x
12
g � (x) = .
x3
Now g is continuous in [3, 4] and g(x) ∈ [3, 4], for all x ∈ [3, 4].
Also |g � (x)| = |12/x3 | < 1 for all x ∈ (3, 4).
Thus a unique fixed-point exists in [3, 4] by fixed point theory. To find an approximation of that root
with an accuracy of 10−3 , we need to determine the number of iterations n so that
λn
|α − xn | ≤ |x1 − x0 | < 10−3 .
1−λ
Here λ = max |g � (x)| = 4/9 and using the fixed-point method by taking x0 = 3, we have x1 = g(x0 ) =
3≤x≤4
10/3, we have
(4/9)n
|α − xn | ≤ |10/3 − 3| < 10−3
1 − 4/9
� −3 �
10 × 5
n(log 4 − log 9) = log
3
n > 7.8883 ≈ 8.
12 ROOTS OF NON-LINEAR EQUATIONS

4. Iteration method based on first degree equation


4.1. The Secant Method. Let f (x) = 0 be the given non-linear equation.
Let (x0 , f (x0 )) and (x1 , f (x1 )) are two pints on the curve y = f (x). Then the equation of the secant
line joining two points on the curve y = f (x) is given by
f (x1 ) − f (x0 )
y − f (x1 ) = (x − x1 ).
x1 − x0
Let intersection point of the secant line with the x-axis is (x2 , 0) then at x = x2 , y = 0. Therefore
f (x1 ) − f (x0 )
0 − f (x1 ) = (x2 − x1 )
x1 − x0
x1 − x0
x2 = x1 − f (x1 ).
f (x1 ) − f (x0 )
Here x0 and x1 are two approximations of the root. The point (x2 , 0) can be taken as next approx-
imation of the root. This method is called the secant or chord method and successive iterations are
given by
xn − xn−1
xn+1 = xn − f (xn ), n = 1, 2, . . .
f (xn ) − f (xn−1 )
Geometrically, in this method we replace the unknown function by a straight line or chord passing
through (x0 , f (x0 )) and (x1 , f (x1 )) and we take the point of intersection of the straight line with the
x-axis as the next approximation to the root and continue the process.

Figure 4. Secant method

Illustrations:
1. Stopping Criterion: We can use the following stopping criteria
|xn+1 − xn | < ε,
� �
� xn+1 − xn �
or �� � < ε,
xn+1 �
where ε is prescribed accuracy.
2. We can combine the secant method with the bisection method and bracket the root, i.e., we choose
initial approximations x0 and x1 in such a manner that f (x0 )f (x1 ) < 0. At each stage we bracket the
root. The method is known as ‘Method of False Position’ or ‘Regula Falsi Method’.
Example 9. Apply secant method to find the root of the equation ex = cos x with relative error less
than < 0.5%.
Sol. Let f (x) = ex − cos x = 0.
The successive iterations of the secant method are given by
xn − xn−1
xn+1 = xn − f (xn ), n = 1, 2, . . .
f (xn ) − f (xn−1 )
ROOTS OF NON-LINEAR EQUATIONS 13

We take initial guesses x0 = −1.1 and x1 = −1, and let en denotes relative error at n-th step and we
obtain � �
� x2 − x1 �
x2 = −1.335205, e1 = �� � × 100% = 10%.
x2 �
� �
� x3 − x2 �
x3 = −1.286223, e2 = �� � × 100% = 25.01%.
x3 �
� �
� x4 − x3 �
x4 = −1.292594, e3 = �� � × 100% = 3.68%.
x4 �
� �
� x5 − x4 �
x5 = −1.292696, e4 = � � � × 100% = 0.49%.
x5 �
We obtain error less than 0.5% and accept x5 = −1.292696 as root with prescribed accuracy.
Example 10. Let f ∈ C 0 [a, b]. If α is a simple root of f (x) = 0, then show that the sequence {xn }
generated by the secant method has order of convergence 1.618.
Sol. We assume that α is a simple root of f (x) = 0 then f (α) = 0.
Let xn = α + en , where en is the error at n-th step.
An iterative method is said to has order of convergence p if
|xn+1 − α| = C |xn − α|p .
Or equivalently
|en+1 | = C|en |p .
Successive iteration in secant method are given by
xn − xn−1
xn+1 = xn − f (xn ) n = 1, 2, . . .
f (xn ) − f (xn−1 )
Error equation is written as
en − en−1
en+1 = en − f (α + en ).
f (α + en ) − f (α + en−1 )
By expanding f (α + en ) and f (α + en−1 ) in Taylor series, we obtain the error equation
� �
� 1 2 ��
(en − en−1 ) en f (α) + en f (α) + . . .
2
en+1 = en − � �

1 ��
(en − en−1 ) f (α) + (en + en−1 ) f (α) + . . .
2
� ��
�� �−1
1 2 f (α) 1 f �� (α)
= en − en + en � + . . . 1 + (en−1 + en ) � + ...
2 f (α) 2 f (α)
� � � �
1 2 f �� (α) 1 f �� (α)
= en − en + en � + . . . 1 − (en−1 + en ) � + ...
2 f (α) 2 f (α)
1 f �� (α)
= × en en−1 + O(e2n en−1 + en e2n−1 ).
2 f � (α)
Therefore
en+1 ≈ Aen en−1
1 f �� (α)
where constant A = .
2 f � (α)
This relation is called the error equation. Now by the definition of the order of convergence, we expect
a relation of the following type
en+1 = Cepn .
1/p
Making one index down, we obtain en = Cepn−1 or en−1 = C 1/p en .
Hence
C epn = Aen C 1/p e1/p
n
=⇒ epn = AC (−1+1/p) e1+1/p
n .
14 ROOTS OF NON-LINEAR EQUATIONS

Comparing the powers of en on both sides, we get


p = 1 + 1/p,
which gives two values of p, one is p = 1.618 and another one is negative (and we neglect negative
value of p as order of convergence is non-negative).
Therefore, order of convergence of secant method is less than 2.

4.2. Newton’s Method. Let f (x) = 0 be the given non-linear equation.


Let the tangent line at point (x0 , f (x0 )) on the curve y = f (x) intersect with the x-axis at (x1 , 0). The
equation of tangent is given by
y − f (x0 ) = f � (x0 )(x − x0 ).
Here the number f � (x0 ) gives the slope of tangent at x0 . At x = x1 ,
0 − f (x0 ) = f � (x0 )(x1 − x0 )
f � (x0 )
x1 = x0 − .
f (x0 )
Here x0 is the approximations of the root.
This is called the Newton’s method and successive iterations are given by
f (xn )
xn+1 = xn − , n = 0, 1, . . . .
f � (xn )
The method can be obtained directly from the secant method by taking limit xn−1 → xn . In the
limiting case the chord joining the points (xn−1 , f (xn−1 )) and (xn , f (xn )) becomes the tangent at
(xn , f (xn )).
In this case problem of finding the root of the equation is equivalent to finding the point of intersection
of the tangent to the curve y = f (x) at point (xn , f (xn )) with the x-axis.

Figure 5. Newton’s method.

Example 11. A calculator is defective: it can only add, subtract, and multiply. Using the Newton’s
Method and the defective calculator to find 1/1.37 correct to 5 decimal places.
Sol. We consider
1 1
x= , f (x) = − 1.37 = 0.
1.37 x
ROOTS OF NON-LINEAR EQUATIONS 15

1
We have f � (x) = − , and therefore the Newton’s Method yields the iteration
x2
f (xn )
xn+1 = xn −
f � (xn )
= xn (2 − 1.37xn ).
Note that the expression xn (2 − 1.37xn ) can be evaluated on our defective calculator, since it only
involves multiplication and subtraction. The choice x0 = 1 can work and we get
x1 = 0.63, x2 = 0.716247, x3 = 0.729670622, x4 = 0.729926917, x5 = 0.729927007.
Since the fourth and fifth iterates agree in to five decimal places, we assume that 0.729927007 is a
correct solution to f (x) = 0, to at least five decimal places.

4.2.1. The Newton’s Method can go bad.


• Once the Newton’s Method catches scent of the root, it usually hunts it down with amazing
speed. But since the method is based on local information, namely f (xn ) and f � (xn ), the
Newton’s Method’s sense of smell is deficient.
• If the initial estimate is not close enough to the root, the Newton’s Method may not converge,
or may converge to the wrong root.
• If f (x) be twice continuously differentiable on the closed finite interval [a, b] and the following
conditions are satisfied:
(i) f (a) f (b) < 0.
(ii) f � (x) �= 0, ∀x ∈ [a, b].
(iii) Either f �� (x) ≥ 0 or f �� (x) ≤ 0, ∀x ∈ [a, b].
(iv) The tangent to the curve at either endpoint intersects the x−axis within the interval [a, b].
In other words, at the end points a, b,
|f (a)| |f (b)|
|x − a| = < b − a, |x − b| = < b − a.
|f � (a)| |f � (b)|
Then the Newton’s method converges to the unique solution α of f (x) = 0 in [a, b] for any
choice of x0 ∈ [a, b].
Conditions (i) and (ii) guarantee that there is one and only one solution in [a, b]. Condition
(iii) states that the graph of f (x) is either concave from above or concave from below, and
furthermore together with condition (ii) implies that f � (x) is monotone on [a, b].
The following example shows that choice of initial guess is very important for convergence.
Example 12. Using Newton’s Method to find a non-zero solution of x = 2 sin x.
Sol. Let f (x) = x − 2 sin x.
Then f � (x) = 1 − 2 cos x, f (1)f (2) < 0, root lies in (1, 2).
The Newton’s iterations are given by
f (xn ) xn − 2 sin xn 2(sin xn − xn cos xn )
xn+1 = xn − �
= xn − = ; n ≥ 0.
f (xn ) 1 − 2 cos xn 1 − 2 cos xn
Let x0 = 1.1. The next six estimates, to 3 decimal places, are:
x1 = 8.453, x2 = 5.256, x3 = 203.384, x4 = 118.019, x5 = −87.471, x6 = −203.637.
Therefore iterations diverges.
Note that choosing x0 = π/3 ≈ 1.0472 leads to immediate disaster, since then 1 − 2 cos x0 = 0 and
therefore x1 does not exist. The trouble was caused by the choice of x0 as f � (x0 ) ≈ 0.
Let’s see whether we can do better. Draw the curves y = x and y = 2 sin x. A quick sketch shows that
they meet a bit past π/2. If we take x0 = 1.5. Here are the next five estimates
x1 = 2.076558, x2 = 1.910507, x3 = 1.895622, x4 = 1.895494, x5 = 1.895494.
Example 13. Find, correct to 5 decimal places, the x-coordinate of the point on the curve y = ln x
which is closest to the origin. Use the Newton’s Method.
16 ROOTS OF NON-LINEAR EQUATIONS

Figure 6. An example where Newton’s method will not work.

Figure 7. One more example of where Newton’s method will not work.

Sol. Let (x, ln x) be a general point on the curve, and let S(x) be the square of the distance from
(x, ln x) to the origin. Then
S(x) = x2 + ln2 x.
We want to minimize the distance. This is equivalent to minimizing the square of the distance. Now
the minimization process takes the usual route. Note that S(x) is only defined when x > 0. We have
ln x 2
S � (x) = 2x + 2 = (x2 + ln x).
x x
ROOTS OF NON-LINEAR EQUATIONS 17

Our problem thus comes down to solving the equation S � (x) = 0. We can use the Newton’s Method
directly on S � (x), but calculations are more pleasant if we observe that S � (x) = 0 is equivalent to
x2 + ln x = 0.
Let f (x) = x2 + ln x. Then f � (x) = 2x + 1/x and we get the recurrence relation
x2k + ln xk
xk+1 = xk − , k = 0, 1, · · ·
2xk + 1/xk
We need to find a suitable starting point x0 . Experimentation with a calculator suggests that we take
x0 = 0.65.
Then x1 = 0.6529181, and x2 = 0.65291864.
Since x1 agrees with x2 to 5 decimal places, we can perhaps decide that, to 5 places, the minimum
distance occurs at x = 0.65292.

4.3. Convergence Analysis.


Theorem 4.1. Let f ∈ C 2 [a, b]. If α is a simple root of f (x) = 0 and f � (α) �= 0, then Newton’s method
generates a sequence {xn } converging at least quadratically to root α for any initial approximation x0
near to α.
Proof. The proof is based on analyzing Newton’s method as the fixed point iteration scheme
xn+1 = g(xn )
f (xn )
= xn − , n≥0
f � (xn )
with
f (x)
g(x) = x − .
f � (x)
We first find an interval [α − δ, α + δ] such that g(x) ∈ [α − δ, α + δ] and for which |g � (x)| ≤ λ, λ ∈ (0, 1),
for all x ∈ (α − δ, α + δ).
Since f � is continuous and f � (α) �= 0, i.e., a continuous function is non-zero at a point which implies it
will remain non-zero in a neighborhood of α.
Thus g is defined and continuous in a neighborhood of α. Also in that neighborhood
f � (x)f � (x) − f (x)f �� (x) f (x)f �� (x)
g � (x) = 1 − = . (4.1)
f � (x)2 [f � (x)]2
Now since f (α) = 0, therefore
f (α)f �� (α)
g � (α) = = 0.
[f � (α)]2
Since g is continuous and 0 < λ < 1, then there exists a number δ such that
|g � (x)| ≤ λ, ∀x ∈ [α − δ, α + δ].
Now we will show that g maps [α − δ, α + δ] into [α − δ, α + δ].
If x ∈ [α − δ, α + δ], the Mean Value Theorem implies that for some number c between x and α,
|g(x) − α| = |g(x) − g(α)| = |g � (c)| |x − α| ≤ λ|x − α| < |x − α|.
It follows that if |x − α| < δ =⇒ |g(x) − α| < δ.
Hence, g maps [α − δ, α + δ] into [α − δ, α + δ].
All the hypotheses of the Fixed-Point Convergence Theorem (Contraction Mapping) are now satisfied,
so the sequence xn converges to root α. Further from Eqs. (4.1)
f �� (α)
g �� (α) = �= 0,
f � (α)
which proves that convergence is of second-order provided f �� (α) �= 0.
Remark 4.1. Newton’s method converges at least quadratically. If g �� (α) = 0, then higher order
convergence is expected.
18 ROOTS OF NON-LINEAR EQUATIONS

Example 14. The function f (x) = sin x has a zero on the interval (3, 4), namely, x = π. Perform
three iterations of Newton’s method to approximate this zero, using x0 = 4. Determine the absolute
error in each of the computed approximations. What is the apparent order of convergence?
Sol. Consider f (x) = sin x. In the interval (3, 4), f has a zero α = π.
Also, f � (x) = cos x.
Newton’s iterations are given by
f (xn )
xn+1 = xn − � , n ≥ 0.
f (xn )
With x0 = 4, we have
f (x0 ) sin 4
x1 = x0 − � =4− = 2.8422,
f (x0 ) cos 4
f (x1 ) sin 2.8422
x2 = x1 − � = 2.8422 − = 3.1509,
f (x1 ) cos 2.8422
f (x2 ) sin 3.1509
x3 = x2 − � = 3.1509 − = 3.1416.
f (x2 ) cos 3.1509
The absolute errors are:
e0 = |x0 − α| = 0.8584,
e1 = |x1 − α| = 0.2994,
e2 = |x2 − α| = 0.0093,
e3 = |x3 − α| = 2.6876 × 10−7 .
If p is the order of convergence then � �p
e2 e1
= .
e1 e0
The corresponding order(s) of convergence are
ln(e2 /e1 ) ln(0.0093/0.2994)
p = = = 3.296,
ln(e1 /e0 ) ln(0.2994/0.8584)
ln(e3 /e2 ) ln(2.6876 × 10−7 /0.0093)
p = = = 3.010.
ln(e2 /e1 ) ln(0.0093/0.2994)
We obtain a better than a third order of convergence, which is a better order than the theoretical
bound gives us.
4.4. Newton’s method for multiple roots. Let α be a root of f (x) = 0 with multiplicity m. In
this case we can write
f (x) = (x − α)m φ(x).
In this case
f (α) = f � (α) = · · · = f (m−1) (α) = 0, f (m) (α) �= 0.
Recall that we can regard Newton’s method as a fixed point method:
f (x)
xn+1 = g(xn ), g(x) = x − � .
f (x)
Then we substitute
f (x) = (x − α)m φ(x)
to obtain
(x − α)m φ(x)
g(x) = x −
m(x − α)m−1 φ(x) + (x − α)m φ� (x)
(x − α) φ(x)
= x− .
mφ(x) + (x − α)φ� (x)
Therefore we obtain
1
g � (α) = 1 − �= 0.
m
ROOTS OF NON-LINEAR EQUATIONS 19

For m > 1, this is nonzero, and therefore Newton’s method is only linearly convergent.
There are ways of improving the speed of convergence of Newton’s method, creating a modified method
that is again quadratically convergent. In particular, consider the fixed point iteration formula
f (x)
xn+1 = g(xn ), g(x) = x − m �
f (x)
in which we assume to know the multiplicity m of the root α being sought. Then modifying the above
argument on the convergence of Newton’s method, we obtain
1
g � (α) = 1 − m = 0
m
and the iteration method will be quadratically convergent. But most of the time we don’t know the
multiplicity.
One method of handling the problem of multiple roots of a function f is to define
f (x)
µ(x) = � .
f (x)
If α is a zero of f of multiplicity m with f (x) = (x − α)m φ(x), then
(x − α)m φ(x)
µ(x) =
m(x − α)m−1 φ(x) + (x − α)m φ� (x)
φ(x)
= (x − α)
mφ(x) + (x − α)φ� (x)
also has a zero at α. However, φ(α) �= 0, so
φ(α) 1
= �= 0,
mφ(α) + (α − α)φ� (α) m
and α is a simple zero of µ(x). Newton’s method can then be applied to µ(x) to give
µ(x) f (x)/f � (x)
g(x) = x − = x −
µ� (x) {[f � (x)]2 − [f (x)][f �� (x)]}/[f � (x)]2
which simplifies to
f (x)f � (x)
g(x) = x − .
[f � (x)]2
− f (x)f �� (x)
If g has the required continuity conditions, functional iteration applied to g will be quadratically con-
vergent regardless of the multiplicity of the zero of f. Theoretically, the only drawback to this method
is the additional calculation of f �� (x) and the more laborious procedure of calculating the iterates. In
practice, however, multiple roots can cause serious round-off problems because the denominator of the
above expression consists of the difference of two numbers that are both close to 0.
Example 15. Let f (x) = ex − x − 1. Show that f has a zero of multiplicity 2 at x = 0. Show that
Newton’s method with x0 = 1 converges to this zero but not quadratically.
Sol. We have f (x) = ex − x − 1, f � (x) = ex − 1 and f �� (x) = ex .
Now f (0) = 1 − 0 − 1 = 0, f � (0) = 1 − 1 = 0 and f �� (0) = 1. Therefore f has a zero of multiplicity 2
at x = 0.
Starting with x0 = 1, iterations are given by
f (xn )
xn+1 = xn − � .
f (xn )
x1 = 0.58198, x2 = 0.31906, x3 = 0.16800 x4 = 0.08635, x5 = 0.04380, x6 = 0.02206.
By using the modified Newton’s Method
f (xn )f � (xn )
xn+1 = xn − � .
[f (xn )]2 − f (xn )f �� (xn )
Starting with x0 = 1.0, we obtain
x1 = −0.023421, x2 = −0.0084527, x3 = −0.000011889.
We observe that modified Newton’s is converging faster to root 0.
20 ROOTS OF NON-LINEAR EQUATIONS

Example 16. The equation f (x) = x3 − 7x2 + 16x − 12 = 0 has a double root at x = 2.0. Starting
with x0 = 1, find the root correct to three decimals with Newton’s and its modified version.
Sol. Firstly we apply simple Newton’s method and successive iterations are given by
x3n − 7x2n + 16xn − 12
xn+1 = xn − , n = 0, 1, 2, . . .
3x2n − 14xn + 16
Start with x0 = 1.0, we obtain
x1 = 1.4, x2 = 1.652632, x3 = 1.806484, x4 = 1.89586
x5 = 1.945653, x6 = 1.972144, x7 = 1.985886, x8 = 1.992894
x9 = 1.996435, x10 = 1.998214, x11 = 1.999106, x12 = 1.999553.
The root correct to 3 decimal places is x12 = 2.000.
If we apply modified Newton’s method then
x3n − 7x2n + 16xn − 12
xn+1 = xn − 2 , n = 0, 1, 2, . . .
3x2n − 14xn + 16
Start with x0 = 1.0, we obtain
x1 = 1.8, x2 = 1.984615, x3 = 1.999884.
The root correct to 3 decimal places is 2.000 and in this case we need less iterations to get desired
accuracy.
We end this chapter by solving an example with all three methods studied previously.
arctan 6
Example 17. The function f (x) = tan πx − 6 has a zero at ≈ 0.447431543. Use eight
π
iterations of each of the following methods to approximate this root. Which method is most successful
and why?
a. Bisection method in interval [0,1].
b. Secant method with x0 = 0 and x1 = 0.48.
c. Newton’s method with x0 = 0.4.
Sol. It is important to note that f has several roots on the interval [0, 5] (to see make a plot).
a. Since f has several roots in [0, 5], the bisection method converges to a different root in this interval.
Therefore, it would be a better idea to choose the interval to be [0, 1]. For such case, we have the
following results: After 8 iterations answer is 0.447265625.

n a b c
0 0 1 0.5
1 0 0.5 0.25
2 0.25 0.5 0.375
3 0.375 0.5 0.4375
4 0.4375 0.5 0.46875
5 0.4375 0.46875 0.453125
6 0.4375 0.46875 0.4453125
7 0.4375 0.4453125 0.44921875
8 0.4453125 0.44921875 0.447265625

b. The Secant method diverges for x0 = 0 and x1 = 0.48.


The Secant method converges for some other choices of initial guesses, for example, x0 = 0.4 and
x1 = 0.48. Few iterations are given:
x2 = 4.1824045, x3 = 4.29444232, x4 = 4.57230361, x5 = 0.444112051,
x6 = 0.446817663, x7 = 0.447469928, x8 = 0.447431099, x9 = 0.447431543.
c. We have
π
f (x) = tan(πx) − 6, and f � (x) = .
cos2 (πx)
Since the function f has several roots, some initial guesses may lead to convergence to a different root.
Indeed, for x0 = 0, Newton’s method converges to a different root. For Newton’s method, therefore, it
ROOTS OF NON-LINEAR EQUATIONS 21

is suggested that we use x0 = 0.4 in order to converge to given root.


Starting with x0 = 0.4, we obtain
x1 = 0.488826408, x2 = 0.480014377, x3 = 0.467600335, x4 = 0.455142852,
x5 = 0.448555216, x6 = 0.447455353, x7 = 0.447431554, x8 = 0.447431543.
We see that for these particular examples and initial guesses, the Newton’s method and the Se-
cant method give very similar convergence behaviors. The Newton’s method converges slightly faster
though. The bisection method converges much slower than the two other methods, as expected.

Exercises
(1) Use the bisection method to find solutions accurate to within 10−3 for the following problems.
(a) x − 2−x = 0 for 0 ≤ x ≤ 1.
(b) ex − x2 + 3x − 2 = 0 for 0 ≤ x ≤ 1.
(c) x + 1 − 2 sin(πx) = 0. for
√ 0 ≤ x ≤ 0.5 and 0.5 ≤−3x ≤ 1.
(2) Find an approximation to 3 25 correct to within 10 using the bisection algorithm.
(3) Find a bound for the number of iterations needed to achieve an approximation by bisection
method with accuracy 10−2 to the solution of x3 − x − 1 = 0 lying in the interval [1, 2]. Find
an approximation to the root with this degree of accuracy.
(4) Sketch the graphs of y = x and y = 2 sin x. Use the bisection method to find an approximation
to within 10−3 to the first positive value of x with x = 2 sin x.
(5) The function defined by f (x) = sin(πx) has zeros at every integer. Show that when −1 < a < 0
and 2 < b < 3, the bisection method converges to
(a) 0, if a + b < 2
(b) 2, if a + b > 2
(c) 1, if a + b = 2.
(6) Let f (x) = (x + 2)(x + 1)2 x(x − 1)3 (x − 2). To which zero of f does the bisection method
converge when applied on the following intervals?
(a) [−1.5, 2.5]
(b) [−0.5, 2.4]
(c) [−0.5, 3]
(d) [−3, −0.5].
(7) For each of the following equations, use the given interval or determine an interval [a, b] on
which fixed-point iteration will converge. Estimate the number of iterations necessary to obtain
approximations accurate to within 10−2 , and perform the calculations.
5
(a) x = 2 + 2.
x
(b) 2 + sin x − x = 0 in interval [2, 3].
(c) 3x2 − ex = 0
(d) x − cos x = 0.
(8) Use the fixed-point iteration method to find smallest and second smallest positive roots of the
equation tan x = 4x, correct to 4 decimal places.
(9) Show that g(x) = π + 0.5 sin(x/2) has a unique fixed point on [0, 2π]. Use fixed-point iteration
to find an approximation to the fixed point that is accurate to within 10−2 . Also estimate the
number of iterations required to achieve 10−2 accuracy, and compare this theoretical estimate
to the number actually needed.
(10) Find all the zeros of f (x) = x2 + 10 cos x by using the fixed-point iteration method for an
appropriate iteration function g. Find the zeros accurate to within 10−2 .
(11) Let A be a given positive constant and g(x) = 2x − Ax2 .
(a) Show that if fixed-point iteration converges to a nonzero limit, then the limit is α = 1/A,
so the inverse of a number can be found using only multiplications and subtractions.
(b) Find an interval about 1/A for which fixed-point iteration converges, provided x0 is in
that interval.
(12) Consider the root-finding problem f (x) = 0 with root α, with f � (x) �= 0. Convert it to the
fixed-point problem
x = x + cf (x) = g(x)
22 ROOTS OF NON-LINEAR EQUATIONS

with c a nonzero constant. How should c be chosen to ensure rapid convergence of


xn+1 = xn + cf (xn )
to α (provided that x0 is chosen sufficiently close to α)? Apply your way of choosing c to the
root-finding problem x3 − 5 = 0.
(13) Show that if A is any positive number, then the sequence defined by
1 A
xn = xn−1 + , for n ≥ 1,
2 2xn−1

converges to A whenever x0 > 0. What happens if x0 < 0?
(14) Use secant method to find solutions accurate to within 10−3 for the following problems.
(a) −x3 − cos x = with x0 = −1 and x1 = 0.
(b) x − cos x = 0, x ∈ [0, π/2].
(15) Use Newton’s method to find solutions accurate to within 10−3 to the following problems.
(a) x − e−x = 0 for 0 ≤ x ≤ 1.
(b) 2x cos 2x − (x − 2)2 = 0 for 2 ≤ x ≤ 3 and 3 ≤ x ≤ 4.
(16) Use Newton’s method to approximate the positive root of 2 cos x = x4 correct to six decimal
places.
(17) The fourth degree polynomial f (x) = 230x4 + 18x3 + 9x2 − 221x − 9 = 0 has two real zeros, one
in [−1, 0] and other in [0, 1]. Attempt to approximate these zeros to within 10−6 using Secant
and Newton’s methods.
(18) Use Newton’s method to solve the equation
1 1 2 1
+ x − x sin x − cos 2x = 0
2 4 2
π
with x0 = 2 . Iterate using Newton’s method until an accuracy of 10−5 is obtained. Explain
why the result seems unusual for Newton’s method. Also, solve the equation with x0 = 5π and
x0 = 10π.
(19) (a) Apply Newton’s method to the function
� √
√ x, x≥0
f (x) =
− −x, x < 0
with the root α = 0. What is the behavior of the iterates? Do they converge, and if so, at
what rate?

(b) Do the same but with


� √
3 2 x≥0
f (x) = √x ,
3 2
− x , x<0
(20) Apply the Newton’s method with x0 = 0.8 to the equation f (x) = x3 − x2 − x + 1 = 0, and
verify that the convergence is only of first-order. Further show that root α = 1 has multiplicity
2 and then apply the modified Newton’s method with m = 2 and verify that the convergence
is of second-order.
(21) Use Newton’s method to approximate, to within 10−4 , the value of x that produces the point
on the graph of y = x2 that is closest to (1, 0).
(22) Use Newton’s method and the modified Newton’s method to find a solution of
√ √
cos(x + 2) + x(x/2 + 2) = 0, for − 2 ≤ x ≤ −1
accurate to within 10−3 .
(23) A particle starts at rest on a smooth inclined plane whose angle θ is changing at a constant
rate

= ω < 0.
dt
At the end of t seconds, the position of the object is given by
� ωt �
g e − e−ωt
x(t) = − 2 − sin ωt .
2ω 2
ROOTS OF NON-LINEAR EQUATIONS 23

Suppose the particle has moved 1.7 ft in 1 s. Find, to within 10−5 , the rate ω at which θ
changes. Assume that g = 32.17 ft/s2 .

(24) An object falling vertically through the air is subjected to viscous resistance as well as to the
force of gravity. Assume that an object with mass m is dropped from a height s0 and that the
height of the object after t seconds is
mg m2 g
s(t) = s0 − t + 2 (1 − e−kt/m ),
k k
2
where g = 32.17 ft/s and k represents the coefficient of air resistance in lb-s/ft. Suppose
s0 = 300 ft, m = 0.25 lb, and k = 0.1 lb-s/ft. Find, to within 0.01 s, the time it takes this
quarter-pounder to hit the ground.
(25) The circle below has radius 1, and the longer circular arc joining A and B is twice as long as
the chord AB. Find the length of the chord AB, correct to four decimal places. Use Newton’s
method.

(26) It costs a firm C(q) dollars to produce q grams per day of a certain chemical, where
C(q) = 1000 + 2q + 3q 2/3 .
The firm can sell any amount of the chemical at $4 a gram. Find the break-even point of the
firm, that is, how much it should produce per day in order to have neither a profit nor a loss.
Use the Newton’s method and give the answer to the nearest gram.

Appendix A. Algorithms
Algorithm (Bisection):
To determine a root of f (x) = 0 that is accurate within a specified tolerance value ε, given values a
and b such that f (a) f (b) < 0.
a+b
Define c = .
2
If f (a) f (c) < 0, then set b = c, otherwise a = c.
End if.
Until |a − b| ≤ ε (tolerance value).
Print root as c.
24 ROOTS OF NON-LINEAR EQUATIONS

Algorithm (Fixed-point):
To find a solution to x = g(x) given an initial approximation x0 .
Input: Initial approximation x0 , tolerance value �, maximum number of iterations N .
Output: Approximate solution α or message of failure.
Step 1: Set i = 1.
Step 2: While i ≤ N do Steps 3 to 6.
Step 3: Set x1 = g(x0 ). (Compute xi .)
|x1 − x0 |
Step 4: If |x1 − x0 | ≤ � or < � then OUTPUT x1 ; (The procedure was successful.)
|x1 |
STOP.
Step 5: Set i = i + 1.
Step 6: Set x0 =x1 . (Update x0 .)
Step 7: Print the output and STOP.

Algorithm (Secant):
1. Give inputs and take two initial guesses x0 and x1 .
2. Start iterations
x1 − x0
x 2 = x1 − f0 .
f1 − f0
3. If
|x2 − x1 |
|x2 − x1 | < ε or <ε
|x2 |
then stop and print the root.
4. Repeat the iterations (step 2). Also check if the number of iterations has exceeded the maximum
number of iterations.

Algorithm (Newton’s method):


To find a solution to f (x) = 0, given an initial approximation x0 .
Input: Initial approximation x0 , tolerance value �, maximum number of iterations N .
Output: Approximate solution x1 or message of failure.
Step 1: Set i = 1.
Step 2: While i ≤ N do Steps 3 to 6.
f (x0 )
Step 3: Set x1 = x0 − . (Compute xi .)
df (x0 )
|x1 − x0 |
Step 4: If |x1 − x0 | ≤ � or < � then OUTPUT x1 ; (The procedure was successful.) STOP.
|x1 |
Step 5: Set i = i + 1.
Step 6: Set x0 =x1 . (Update x0 .)
Step 7: Output (’The method failed after N iterations, N =’, N ); (The procedure was unsuccessful.)
STOP.

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 3 (4 LECTURES)
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

1. Introduction
System of simultaneous linear equations are associated with many problems in engineering and
science, as well as with applications to the social sciences and quantitative study of business and
economic problems. These problems occur in wide variety of disciplines, directly in real world problems
as well as in the solution process for other problems.
The principal objective of this Chapter is to discuss the numerical aspects of solving linear system of
equations having the form

 a11 x1 + a12 x2 + · · · + a1n xn = b1


a21 x1 + a22 x2 + · · · + a2n xn = b2
(1.1)

 ................................................

an1 x1 + an2 x2 + · · · + ann xn = bn .

This is a linear system of n equation in n unknowns x1 , x2 , · · · , xn . This system can simply be written
in the matrix equation form
Ax=b

     
a11 a12 ··· a1n x1 b1
 a21 a22 ··· a2n   x 2   b2 
     
 .. .. .. ..  ×  ..  =  ..  (1.2)
 . . . .   .  .
an1 an2 · · · ann xn bn

This equations has a unique solution x = A−1 b, when the coefficient matrix A is non-singular. Unless
otherwise stated, we shall assume that this is the case under discussion. If A−1 is already available,
then x = A−1 b provides a good method of computing the solution x.
If A−1 is not available, then in general A−1 should not be computed solely for the purpose of obtaining
x. More efficient numerical procedures will be developed in this chapter. We study broadly two
categories Direct and Iterative methods. We start with direct method to solve the linear system in
this chapter.

2. Gaussian Elimination
Direct methods, which are technique that give a solution in a fixed number of steps, subject only to
round-off errors, are considered in this chapter. Gaussian elimination is the principal tool in the direct
solution of system (1.2). The method is named after Carl Friedrich Gauss (1777-1855).
To solve larger system of linear equation, we consider a following n × n system

a11 x1 + a12 x2 + a13 x3 + · · · + a1n xn = b1 (E1 )


a21 x1 + a22 x2 + a23 x3 + · · · + a2n xn = b2 (E2 )
............................................................
ai1 x1 + ai2 x2 + ai3 x3 + · · · + ain xn = bi (Ei )
............................................................
an1 x1 + an2 x2 + an3 x3 + · · · + ann xn = bn (En ).
1
2 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

Here Ei denote the i-th row of the coefficients matrix A = [aij ], i, j = 1, 2, · · · , n.


Firstly we write the Augmented matrix [A, b] (coefficient and right hand side together) as follows.
 
a11 a12 · · · a1n a1,n+1
 a21 a22 · · · a2n a2,n+1 
 
 .. .. . . .. .. 
 . . . . . 
an1 an2 · · · ann an,n+1
where ai,n+1 = bi .
Let a11 �= 0 and eliminate x1 from E2 , E3 , · · · , En .
aj1
Define multipliers mj1 = , for each j = 2, 3, · · · , n.
a11
We write each entry in Ej as Ej − mj1 E1 , j = 1, 2, · · · , n.
This will delete x1 from these rows.
We repeat this procedure and follow a sequential procedure for i = 2, 3, · · · , n − 1 and perform the
following operations provided aii �= 0:
Ej −→ Ej − (aji /aii )Ei , for each j = i + 1, i + 2, · · · , n.
This eliminates xi in each row below the i-th for all values of i = 1, 2, · · · , n − 1.
The resulting matrix is triangular and has the form
 
a11 a12 · · · a1n a1,n+1
 0 a22 · · · a2n a2,n+1 
 
... ... ... ... . . .
0 0 ··· ann an,n+1
Solving the n-th equation for xn gives
an,n+1
xn = .
ann
Solving the (n − 1)st equation for xn−1 and using the known value for xn yields (back substitution)
an−1,n+1 − an−1,n xn
xn−1 = .
an−1,n−1
Continuing this process, we obtain
�n
ai,n+1 − aij xj
ai,n+1 − ai,i+1 xi+1 − ai,i+2 xi+2 − · · · − ain xn j=i+1
xi = = ,
aii aii
for each i = n − 1, n − 2, · · · , 2, 1.
Example 1. Solve the following systems using the simple Gaussian elimination method
3x1 + 2x2 − x3 = 1
x1 − 3x2 + 2x3 = 2
2x1 − x2 + x3 = 3.
Sol. We label each rows as E1 , E2 and E3 . The augmented matrix form of the given system is
 
3 2 −1 1
1 −3 2 2 .
2 −1 1 3
Since a11 = 3 �= 0, so we wish to eliminate the elements a21 and a31 by subtracting from the second
and third rows the appropriate multipliers of the first row. In this case the multiples are given
1 2
m21 = , m31 = .
3 3
1 2
Thus we write E2 as E2 − 3 E1 and E3 as E3 − 3 E1 . Hence
 
3 2 −1 1
0 −11//3 7/3 5/3 .
0 −7/3 5/3 7/3
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 3

As a22 = −11/3 �= 0, therefore, we eliminate entry in a32 position by taking the multiple m32 = 7/11
7
and we write E3 as E3 − 11 , to get
 
3 2 −1 1
0 −11//3 7/3 5/3  .
0 0 2/11 14/11
Obviously, the original set of equations has been transformed to an upper-triangular form. Since all
the diagonal elements of the obtaining upper-triangular matrix are nonzero, which means that the
coefficient matrix of the given system is nonsingular and therefore, the given system has a unique
solution.
Now using backward substitution to give
2/11x3 = 14/11, x3 = 7
−11/3x2 = 5/3 − 7/3x3 = −44/3, x2 = 4
3x1 = 1 − 2x2 + x3 = 0, x1 = 0.
Partial Pivoting: In the elimination process, we divide with aii at each stage and assume that aii �= 0.
These elements are known as pivot element. If at any stage of elimination, one of the pivot becomes
small (or zero) then we bring other element as pivot by interchanging the rows. This process is called
Gauss elimination with partial pivoting.
Example 2. Solve the system of equations using Gauss elimination. This system has exact solution
(known from other sources!) x1 = 2.6, x2 = −3.8, x3 = −5.0.

6x1 + 2x2 + 2x3 = −2


2 1
2x1 + x2 + x3 = 1
3 3
x1 + 2x2 − x3 = 0.
Sol. Let us use a floating-point representation with 4−digits and all operations will be rounded. We
label each rows as E1 , E2 and E3 . Augmented matrix (coefficients matrix with right hand side) is
given by
 
6.000 2.000 2.000 −2.000
2.000 0.6667 0.3333 1.000 
1.000 2.000 −1.000 0.0
2 1
Multipliers are m21 = = 0.3333 and m31 = = 0.1667.
6 6
We write E2 as E2 − 0.3333E1 and E3 as E3 − 0.1667E1 .
 
6.000 2.000 2.000 −2.000
 0.0 0.0001000 −0.3333 1.667 
0.0 1.667 −1.333 0.3334
1.667
Multiplier is m32 = = 16670 and we write E3 as E3 − 16670E2 .
0.0001
 
6.000 2.000 2.000 −2.000
 0.0 0.0001000 −0.3333 1.667 
0.0 0.0 5555 −27790
Using back substitution, we obtain
x3 = −5.003
x2 = 0.0
x1 = 1.335.
We observe that computed solution is not compatible with the exact solution.
The difficulty is in a22 . This coefficient is very small (almost zero). This means that the coefficient
in this position had essentially infinite relative error and this was carried through into computation
involving this coefficient. To avoid this, we interchange second and third rows and then continue the
4 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

elimination.
In this case (after interchanging) multipliers is m32 = 0.00005999 and we write new E3 as E3 −
000005999E2 .
 
6.000 2.000 2.000 −2.000
 0.0 1.667 −1.337 0.3334  .
0.0 0.0 −0.3332 1.667
Using back substitution, we obtain
x3 = −5.003
x2 = −3.801
x1 = 2.602.
We see that after partial pivoting, we get the desired solution.
Example 3. Given the linear system
x1 − x2 + αx3 = −2,
−x1 + 2x2 − αx3 = 3,
αx1 + x2 + x3 = 2.
a. Find value(s) of α for which the system has no solutions.
b. Find value(s) of α for which the system has an infinite number of solutions.
c. Assuming a unique solution exists for a given α, find the solution.
Sol. Augmented matrix is given by
 
1 −1 α −2
−1 2 −α 3 
α 1 1 2
Multipliers are m21 = −1 and m31 = α. Performing E2 → E2 + E1 and E3 → E3 − αE1 to obtain
 
1 −1 α −2
0 1 0 1 
2
0 1 + α 1 − α 2(1 + α)
Multiplier is m32 = 1 + α and we perform E3 → E3 − m32 E2 .
 
1 −1 α −2
0 1 0 1 
0 2
0 1−α 1+α
a. If α = 1, then the last row of the reduced augmented matrix says that 0.x3 = 2, the system has no
solution.
b. If α = −1, then we see that the system has infinitely many solutions.
c. If α �= 1, then the system has a unique solution.
1 1
x3 = , x2 = 1, x1 = − .
1−α 1−α
Remark 2.1. Unique solution, no solution, or infinite number of solutions.
(1) If we have a leading one in every column, then we will have a unique solution.
(2) If we have a row of zeros equal to a non-zero number in right side, then the system has no
solution.
(3) If we don’t have a leading one in every column in a homogeneous system, i.e. a system where
all the equations equal zero, or a row of zeros, then system have infinite number of solutions.
Example 4. Solve the system by Gauss elimination
4x1 + 3x2 + 2x3 + x4 = 1
3x1 + 4x2 + 3x3 + 2x4 = 1
2x1 + 3x2 + 4x3 + 3x4 = −1
x1 + 2x2 + 3x3 + 4x4 = −1.
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 5

Sol. We write augmented matrix and solve the system


 
4 3 2 1 1
3 4 3 2 1 
 
2 3 4 3 −1
1 2 3 4 −1
3 1 1
Multipliers are m21 = , m31 = , and m41 = .
4 2 4
Replace E2 with E2 − m21 E1 , E3 with E3 − m31 E1 and E4 with E4 − m41 E1 .
 
4 3 2 1 1
0 7/4 3/2 5/4 1/4 
 
0 3/2 3 5/2 −3/2
0 5/4 5/2 15/4 −5/4
6 5
Multipliers are m32 = and m42 = .
7 7
Replace E3 with E3 − m32 E2 and E4 with E4 − m42 E2 , we obtain
 
4 3 2 1 1
0 7/4 3/2 5/4 1/4 
 
0 0 12/7 10/7 −12/7
0 0 10/7 20/7 −10/7
5
Multiplier is m43 = and we replace E4 with E4 − m43 E3 .
6
 
4 3 2 1 1
0 7/4 3/2 5/4 1/4 
 
0 0 12/7 10/7 −12/7
0 0 0 5/3 0

Using back substitution successively for x4 , x3 , x2 , x1 , we obtain x4 = 0, x3 = −1, x2 = 1, x1 = 0.

Complete Pivoting: In the first stage of elimination, we search the largest element in magnitude
from the entire matrix and bring it at the position of first pivot. We repeat the same process at every
step of elimination. This process require interchange of both rows and columns.

Scaled Partial Pivoting: In this approach, the algorithm select the largest relative entries as the
pivot elements at each stage of elimination. At the beginning, a scale factor must be computed for
each equation in the system. We define

si = max |aij | (1 ≤ i ≤ n)
1≤j≤n

These numbers are recored in the scaled vector s = [s1 , s2 , · · · , sn ]. Note that the scale vector does not
change throughout the procedure.
In starting the forward elimination process, we do not arbitrarily use the first equation as the pivot
ai,1
equation. Instead, we use the equation for which the ratio is greatest. We repeat the process by
si
taking same scaling factors.

Example 5. Solve the system

2.11x1 − 4.21x2 + 0.921x3 = 2.01


4.01x1 + 10.2x2 − 1.12x3 = −3.09
1.09x1 + 0.987x2 + 0.832x3 = 4.21

by using scaled partial pivoting.


6 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

Sol. The augmented matrix is


 
2.11 −4.21 0.921 2.01
4.01 10.2 −1.12 −3.09
1.09 0.987 0.832 4.21.

The scale factors are s1 = 4.21, s2 = 10.2, & s3 = 1.09. We need to pick the largest (2.11/4.21 =
0.501, 4.01/10.2 = 0.393, 1.09/1.09 = 1), which is the third entry, and interchange row 1 and row 3 and
interchange s1 and s3 to get
 
1.09 0.987 0.832 4.21
4.01 10.2 −1.12 −3.09
2.11 −4.21 0.921 2.01.

Performing E2 → E2 − 3.68E1 , E3 → E3 − 1.94E1 , we obtain


 
1.09 0.987 0.832 4.21
 0 6.57 −4.18 −18.6 
0 −6.12 −0.689 −6.16.

Now comparing (6.57/10.2 = 0.6444, 6.12/4.21 = 1.45), the second ratio is largest so we need to
interchange row 2 and row 3 and interchange scale factor accordingly.
 
1.09 0.987 0.832 4.21
 0 −6.12 −0.689 −6.16 
0 6.57 −4.18 −18.6.

Performing E3 → E3 + 1.07E2 , we get


 
1.09 0.987 0.832 4.21
 0 −6.12 −0.689 −6.16 
0 0 −4.92 −25.2.

Backward substitution gives x3 = 5.12, x2 = 0.43, x1 = −0.436.

Example 6. Solve the system

3x1 − 13x2 + 9x3 + 3x4 = −19


−6x1 + 4x2 + x3 − 18x4 = −34
6x1 − 2x2 + 2x3 + 4x4 = 16
12x1 − 8x2 + 6x3 + 10x4 = 26

by hand using scaled partial pivoting. Justify all row interchanges and write out the transformed matrix
after you finish working on each column.

Sol. The augmented matrix is


 
3 −13 9 3 −19
−6 4 1 −18 −34
 
6 −2 2 4 16 
12 −8 6 10 26

and the scale factors are s1 = 13, s2 = 18, s3 = 6, & s4 = 12. We need to pick the largest (3/13, 6/18, 6/6, 12/12),
which is the third entry, and interchange row 1 and row 3 and interchange s1 and s3 to get
 
6 −2 2 4 16
−6 4 1 −18 −34
 
 3 −13 9 3 −19
12 −8 6 10 26
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 7

with s1 = 6, s2 = 18, s3 = 13, s4 = 12. Performing E2 → E2 − (−6/6)E1 , E3 → E3 − (3/6)E1 , and


E4 → E4 − (12/6)E1 , we obtain
 
6 −2 2 4 16
0 2 3 −14 −18
 
0 −12 8 1 −27
0 −4 2 2 −6.
Comparing (|a22 |/s2 = 2/18, |a32 |/s3 = 12/13, |a42 |/s4 = 4/12), the largest is the third entry so we
need to interchange row 2 and row 3 and interchange s2 and s3 to get
 
6 −2 2 4 16
0 −12 8 1 −27
 
0 2 3 −14 −18
0 −4 2 2 −6
with s1 = 6, s2 = 13, s3 = 18, s4 = 12. Performing E3 → E3 − (2/12)E2 and E4 → E4 − (−4/12)E2 ,
we get
 
6 −2 2 4 16
0 −12 8 1 −27 
 
0 0 13/3 −83/6 −45/2
0 0 −2/3 5/3 3
Comparing (|a33 |/s3 = (13/3)/18, |a43 |/s4 = (2/3)/12), the largest is the first entry so we do not
interchange rows. Performing E4 → E4 − (−2/13)E3 , we get the final reduced matrix
 
6 −2 2 4 16
0 −12 8 1 −27 
 
0 0 13/3 −83/6 −45/2
0 0 0 −6/13 −6/13
Backward substitution gives x1 = 3, x2 = 1, x3 = −2, x4 = 1.
Example 7. Solve this system of linear equations:
0.0001x + y = 1
x+y =2
using no pivoting, partial pivoting, and scaled partial pivoting. Carry at most five significant digits
of precision (rounding) to see how finite precision computations and roundoff errors can affect the
calculations.
Sol. By direct substitution, it is easy to verify that the true solution is x = 1.0001 and y = 0.99990 to
five significant digits.
For no pivoting, the first equation in the original system is the pivot equation, and the multiplier is
1/0.0001 = 10000. The new system of equations is
0.0001x + y = 1
9999y = 9998
We obtain y = 9998/9999 ≈ 0.99990 and x = 1. Notice that we have lost the last significant digit in
the correct value of x.
We repeat the solution process using partial pivoting for the original system. We see that the second
entry is larger, so the second equation is used as the pivot equation. We can interchange the two
equations, obtaining
x+y =2
0.0001x + y = 1
which gives y = 0.99980/0.99990 ≈ 0.99990 and x = 2 − y = 2 − 0.99990 = 1.0001.
Both computed values of x and y are correct to five significant digits.
We repeat the solution process using scaled partial pivoting for the original system. Since the scaling
constants are s = (1, 1) and the ratios for determining the pivot equation are (0.0001/1, 1/1), the
8 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

second equation is now the pivot equation. We do not actually interchange the equations and use
the second equation as the first pivot equation. The rest of the calculations are as above for partial
pivoting. The computed values of x and y are correct to five significant digits.

2.1. Operation Counts. We count the number of operations required to solve the system Ax = b.
Both the amount of time required to complete the calculations and the subsequent round-off error
depend on the number of floating-point arithmetic operations needed to solve a routine problem.
In general, the amount of time required to perform a multiplication or division on a computer is
approximately the same and is considerably greater than that required to perform an addition or
subtraction. The actual differences in execution time, however, depend on the particular computing
system.
To demonstrate the counting operations for a given method, we will count the operations required to
solve a typical linear system of n equations in n unknowns using Gauss elimination Algorithm. We will
keep the count of the additions/subtractions separate from the count of the multiplications/divisions
because of the time differential.
aji
First step is to calculate multipliers mji = . Then the replacement of the equation Ej by (Ej −mji Ei )
aii
requires that mji be multiplied by each term in Ei and then each term of the resulting equation is
subtracted from the corresponding term in Ej .
The following table states the operations count from going from A to U at each step 1, 2, · · · , n − 1.
Step Number of divisions Number of multiplications Number of additions/subtractions
1 (n − 1) (n − 1)2 (n − 1)2
2 (n − 2) (n − 2) 2 (n − 2)2
.. .. .. ..
. . . .
n−2 2 4 4
n−1 1 1 1
n(n − 1) n(n − 1)(2n − 1) n(n − 1)(2n − 1)
Total:
2 6 6
n(n − 1)(2n − 1)
Therefore total number of additions/subtractions from A to U are (I).
6
n(n − 1) n(n − 1)(2n − 1) 2
n(n − 1)
Total number of multiplications/divisions are + = (II).
2 6 3
Now we count the number of additions/subtractions and the number of multiplications/divisions for
right hand side vector b. We have:
n(n − 1)
Total number of additions/subtractions (n − 1) + (n − 2) + · · · + 2 + 1 = (III).
2
n(n − 1)
Total number of multiplications/divisions (n − 1) + (n − 2) + · · · + 2 + 1 = (IV).
2
Lastly we count the number of additions/subtractions and multiplications/divisions for finding the
solutions from the back-substitution method. Recall that
bn
xn = , where bn = an,n+1 .
ann
For each i = n − 1, n − 2, · · · , 2, 1, we have

n
ai,n+1 − aij xj
j=i+1
xi = .
aii
n(n − 1)
Therefore total number of additions/subtractions 0 + 1 + · · · + (n − 1) = (V).
2
n(n + 1)
Total number of multiplications/divisions 1 + 2 + · · · + n = (VI).
2
Thus the total number of operations to obtain the solution of a system of n linear equations in n
variables using Gaussian elimination is:
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 9

Additions/Subtractions (I+III+IV):
n(n − 1)(2n + 5)
.
6
Multiplications/Divisions (II+IV+VI):
n(n2 + 3n − 1)
.
3
For large n, the total number of multiplications and divisions is approximately n3 /3, as is the total
number of additions and subtractions. Thus the amount of computation and the time required increases
with n in proportion to n3 , as shown in Table.
n Multiplications/Divisions Additions/Subtractions
3 17 11
10 430 375
50 44, 150 42, 875
100 343, 300 338, 250

3. The LU Factorization
When we use matrix multiplication, another meaning can be given to the Gauss elimination. The
matrix A can be factored into the product of the two triangular matrices.
Let AX = b is the system to be solved, A is n × n coefficient matrix. The linear system can be reduced
to the upper triangular system U X = g with
 
u11 u12 · · · u1n
 0 u22 · · · u2n 
 
U =  .. .. .. 
 . . . 
0 0 ··· unn
Here uij = aij . Introduce an auxiliary lower triangular matrix L based on the multipliers mij as
following  
1 0 0 ··· 0
 m21 1 0 ··· 0
 
 ··· 0
L =  m31 m32 1 
 .. .. .. 
 . . . 
mn,1 0 · · · mn,n−1 1
Theorem 3.1. Let A be a non-singular matrix and let L and U be defined as above. If U is produced
without pivoting then
LU = A.
This is called LU factorization of A.
We can use Gaussian elimination to solve a system by LU decomposition. Suppose that A has been
factored into the triangular form A = LU , where L is lower triangular and U is upper triangular. Then
we can solve for x more easily by using a two-step process.
First we let y = U x and solve the lower triangular system Ly = b for y. Once y is known, the upper
triangular system U x = y provide the solution x. We can check that total number of operations are
same as Gauss elimination.
Example 8. We require to solve the following system of linear equations using LU decomposition.
x1 + 2x2 + 4x3 = 3
3x1 + 8x2 + 14x3 = 13
2x1 + 6x2 + 13x3 = 4.
Find the matrices L and U using Gauss elimination and using those values of L and U , solve the
system of equations.
10 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

Sol. We first apply the Gaussian elimination on the matrix A and collect the multipliers m21 , m31 ,
and m32 .
We have
 
1 2 4
A = 3 8 14
2 6 13
Multipliers are m21 = 3, m31 = 2.
E2 → E2 − 3E1 and E3 → E3 − 2E1 .
 
1 2 4
∼  0 2 2
0 2 5
Multiplier is m32 = 2/2 = 1 and we perform E3 → E3 − E2 .
 
1 2 4
∼  0 2 2
0 0 3
We observe that m21 = 3, m31 = 2, and m32 = 1. Therefore,
    
1 2 4 1 0 0 1 2 4
A = 3 8 14 = 3 1 0 0 2 2 = LU
2 6 13 2 1 1 0 0 3
Therefore,
Ax = b =⇒ LU x = b.
Assuming U x = y, we obtain,
Ly = b
i.e.     
1 0 0 y1 3
3 1 0 y2  = 13 .
2 1 1 y3 4
Using forward substitution, we obtain y1 = 3, y2 = 4, and y3 = −6. Now
    
1 2 4 x1 3
U x = y =⇒ 0   
2 2 x2 = 4  .
 
0 0 3 x3 −6
Now, using the backward substitution process, we obtain the final solution as x3 = −2, x2 = 4, and
x1 = 3.
Example 9. (a) Determine the LU factorization for matrix A in the linear system Ax = b, where
   
1 1 0 3 1
2 1 −1 1  1
A= 
 3 −1 −1 2  and b = −3 .
  (3.1)
−1 2 3 −1 4
(b) Then use the factorization to solve the system
x1 + x2 + 3x4 = 8
2x1 + x2 − x3 + x4 = 7
3x1 − x2 − x3 + 2x4 = 14
−x1 + 2x2 + 3x3 − x4 = −7
.
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 11

Sol. (a) We take the coefficient matrix and apply Gauss elimination.
Multipliers are m21 = 2, m31 = 3, and m41 = −1.
Sequence of operations E2 → E2 − 2E1 , E3 → E3 − 3E1 , E4 → E4 − (−1)E1 .
 
1 1 0 3
0 −1 −1 −5
 
0 −4 −1 −7 .
0 3 3 2
Multipliers are m32 = 4 and m42 = −3.
E3 → E3 − 4E2 , E4 → E4 − (−3)E2 .
 
1 1 0 3
0 −1 −1 −5 
∼
0 0
.
3 13 
0 0 0 −13
The multipliers mij and the upper triangular matrix produce the following factorization
    
1 1 0 3 1 0 0 0 1 1 0 3
2 1 −1 1  2 0 0  
A= = 1  0 −1 −1 −5  = LU.
3 −1 −1 2   3 4 1 0   0 0 3 13 
−1 2 3 −1 −1 −3 0 1 0 0 0 −13
(b)
      
1 0 0 0 1 1 0 3 x1 8
2 1 0 0 0 −1 −1 −5  x2   7 
Ax = (LU )x = 
3
    =  .
4 1 0 0 0 3 13  x3   14 
−1 −3 0 1 0 0 0 −13 x4 −7
We first introduce the substitution y = U x. Then b = L(U x) = Ly. That is,
    
1 0 0 0 y1 8
2 1 0 0   y2   7 
Ly = 
3
  =  .
4 1 0 y3   14 
−1 −3 0 1 y4 −7
This system is solved for y by a simple forward-substitution process:

y1 = 8, y2 = −9, y3 = 26, y4 = −26.


We then solve U x = y for x, the solution of the original system; that is,,
    
1 1 0 3 x1 8
0 −1 −1 −5  x2   −9 
   =  
0 0 3 13  x3   26 
0 0 0 −13 x4 −26
Using backward substitution we obtain x4 = 2, x3 = 0, x2 = −1, x1 = 3.
Example 10. Show that the LU factorization algorithm requires
(a)
1 3 1 1 3 1 2 1
n − n multiplications/divisions and n − n + n additions/subtractions.
3 3 3 2 6
(b) Show that solving Ly = b, where L is a lower-triangular matrix with lii = 1 for all i, requires
1 2 1 1 2 1
n − n multiplications/divisions and n − n additions/subtractions.
2 2 2 2
(c) Show that solving Ax = b by first factoring A into A = LU and then solving Ly = b and U x = y
requires the same number of operations as the Gaussian elimination algorithm.
12 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

Sol. (a) We have already counted the mathematical operation in detail in Gauss elimination. Here we
provide the same for LU factorization.
n(n − 1)(2n − 1) 1 1 1
We found total number of additions/subtractions from A to U are = n3 − n2 + n
6 3 2 6
n(n − 1) n(n − 1)(2n − 1) n(n2 − 1) 1 3
and total number of multiplications/divisions are + = = n −
2 6 3 3
1
n.
3
These counts remains same to factorize the matrix A in to L and U .
(b) Solving Ly = b, where L is a lower-triangular matrix with lii = 1 for all i, requires total number of
n(n − 1)
additions/subtractions 0 + 1 + · · · + (n − 1) = .
2
n(n − 1)
Total number of multiplications/divisions 0 + 1 + 2 + · · · + n = .
2
Please note that these operations can be counted in similar manner as we discussed for back substi-
tution. As lii is always 1, so this will reduce one division at each step. These counts are same as for
b.
(c) Finally the counts used in b are same as required to solve Ly = b. Therefore, total counts remains
same for LU decomposition and simple Gauss elimination.

Exercises
(1) Use Gaussian elimination with backward substitution and two-digit rounding arithmetic to
solve the following linear systems. Do not reorder the equations. (The exact solution to each
system is x1 = −1, x2 = 1, x3 = 3.)
(a)

−x1 + 4x2 + x3 = 8
5 2 2
x1 + x2 + x3 = 1
3 3 3
2x1 + x2 + 4x3 = 11.

(b)

4x1 + 2x2 − x3 = −5
1 1 1
x1 + x2 − x3 = −1
9 9 3
x1 + 4x2 + 2x3 = 9.

(2) Using the four-digit arithmetic solve the following system of equations by Gaussian elimination
with and without partial pivoting

0.729x1 + 0.81x2 + 0.9x3 = 0.6867


x1 + x2 + x3 = 0.8338
1.331x1 + 1.21x2 + 1.1x3 = 1.000.

This system has exact solution, rounded to four places x1 = 0.2245, x2 = 0.2814, x3 = 0.3279.
Compare your answers!
(3) Use the Gaussian elimination algorithm to solve the following linear systems, if possible, and
determine whether row interchanges are necessary:
(a)

x1 − x2 + 3x3 = 2
3x1 − 3x2 + x3 = −1
x1 + x2 = 3.
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 13

(b)
2x1 − x2 + x3 − x4 = 6
x2 − x3 + x4 = 5
x4 = 5
x3 − x4 = 3.
(4) Use Gaussian elimination and three-digit chopping arithmetic to solve the following linear
system, and compare the approximations to the actual solution [0, 10, 1/7]T .
3.03x1 − 12.1x2 + 14x3 = −119
−3.03x1 + 12.1x2 − 7x3 = 120
6.11x1 − 14.2x2 + 21x3 = −139.
(5) Repeat the above exercise using Gaussian elimination with partial and scaled partial pivoting
and three-digit rounding arithmetic.
(6) Suppose that
2x1 + x2 + 3x3 = 1
4x1 + 6x2 + 8x3 = 5
6x1 + αx2 + 10x3 = 5,
with |α| < 10. For which of the following values of α will there be no row interchange required
when solving this system using scaled partial pivoting?
(a) α = 6.
(b) α = 9.
(c) α = −3.
(7) Modify the LU Factorization Algorithm so that it can be used to solve a linear system, and
then solve the following linear systems.
2x1 − x2 + x3 = −1
3x1 + 3x2 + 9x3 = 0
3x1 + 3x2 + 5x3 = 4.

Appendix A. Algorithms
Algorithm (Gauss Elimination)
Input: number of unknowns and equations n; augmented matrix A = [aij ], where 1 ≤ i ≤ n and
1 ≤ j ≤ n + 1.
Output: solution x1 , x2 , · · · , xn or message that the linear system has no unique solution.
Step 1: For i = 1, · · · , n − 1 do Steps 2-4. (Elimination process.)
Step 2: Let p be the smallest integer with i ≤ p ≤ n and api �= 0.
If no integer p can be found
then OUTPUT (‘no unique solution exists’);
STOP.
Step 3: If p �= i then perform (Ep ) ↔ (Ei ).
Step 4: For j = i + 1, · · · , n do Steps 5 and 6.
Step 5: Set mji = aji /aii .
Step 6: Perform (Ej − mji Ei ) → (Ej );
Step 7 If ann = 0 then OUTPUT (‘no unique solution exists’);
STOP.
Step 8: Set xn = an,n+1 /ann .(Start backward substitution.)
�n
Step 9: For i = n − 1, · · · , 1 set xi = [ai,n+1 − aij xj ]/aii .
j=i+1
Step 10: OUTPUT (x1 , · · · , xn ); (Procedure completed successfully.)
STOP.
14 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

Algorithm (LU Factorization)


To factor the n × n matrix A = [aij ] into the product of the lower-triangular matrix L = [lij ] and the
upper-triangular matrix U = [uij ]; that is, A = LU , where the main diagonal of either L or U consists
of all ones:
INPUT: dimension n; the entries aij , 1 ≤ i, j ≤ n of A; the diagonal l11 = · · · = lnn = 1 of L or the
diagonal u11 = · · · = unn = 1 of U .
OUTPUT: the entries lij , 1 ≤ j ≤ i, 1 ≤ i ≤ n of L and the entries, uij , i ≤ j ≤ n, 1 ≤ i ≤ n of U .
Step 1: Select l11 and u11 satisfying l11 u11 = a11 .
If l11 u11 = 0 then OUTPUT (‘Factorization impossible’);
STOP. Step 2: For j = 2, · · · , n set u1j = a1j /l11 ; (First row of U .)

lj1 = aj1 /u11 . (First column of L.)


Step 3: For i = 2, · · · , n − 1 do Steps 4 and 5.

i−1
Step 4: Select lii and uii satisfying lii uii = aii − lik uki .
k=1
If lii uii = 0 then OUTPUT (‘Factorization impossible’);
STOP.

i−1
Step 5: For j = i + 1, · · · , n set uij = l1ii [aij − lik ukj ] (ith row of U .)
k=1
1 �
i−1
lji = lui [aji − ljk uki ] (ith column of L.)
k=1

n−1
Step 6: Select lnn and unn satisfying lnn unn = ann − lnk ukn .
k=1
(Note: If lnn unn = 0, then A = LU but A is singular.)
Step 7: OUTPUT (lij for j = 1, · · · , i and i = 1, · · · , n);
OUTPUT (uij for j = i, · · · , n and i = 1, · · · , n);
STOP.

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 4 (5 LECTURES)
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

1. Introduction
In this chapter we will study iterative techniques to solve linear systems. An initial approximation
(or approximations) will be found, and new approximations are then determined based on how well
the previous approximations satisfied the equation. The objective is to find a way to minimize the
difference between the approximations and the exact solution. To discuss iterative methods for solving
linear systems, we first need to determine a way to measure the distance between n-dimensional column
vectors. This will permit us to determine whether a sequence of vectors converges to a solution of the
system. In actuality, this measure is also needed when the solution is obtained by the direct methods
presented in Chapter 3. Those methods required a large number of arithmetic operations, and using
finite-digit arithmetic leads only to an approximation to an actual solution of the system. We end
the chapter by presenting a way to find eigenvalue (dominant) and associated eigenvector. Dominant
eigenvalue plays an important role in convergence of any iterative method.
1.1. Norms of Vectors and Matrices.
1.2. Vector Norms. Let Rn denote the set of all n−dimensional column vectors with real-number
components. To define a distance in Rn we use the notion of a norm, which is the generalization of
the absolute value on R, the set of real numbers.
Definition 1.1. A vector norm on Rn is a function, �·�, from Rn into R with the following properties:
(1) �x�≥ 0 for all x ∈ Rn .
(2) �x�= 0 if and only if x = 0.
(3) �x + y�≤ �x�+�y� for all x, y ∈ Rn (triangle inequality).
(4) �αx�= |α|�x� for all x ∈ Rn and α ∈ R.
Definition 1.2. The l2 and l∞ norms for the vector x = (x1 , x2 , . . . , xn )t are defined by
��n �1/2
2
�x�2 = |xi | and �x�∞ = max |xi |.
1≤i≤n
i=1
Note that each of these norms reduces to the absolute value in the case n = 1.
The l2 norm is called the Euclidean norm of the vector x because it represents the usual notion of
distance from the origin in case x is in R1 = R, R2 , or R3 . For example, the l2 norm of the vector
x = (x1 , x2 , x3 )t gives the length of the straight line joining the points (0, 0, 0) and (x1 , x2 , x3 ).
Example 1. Determine the l2 norm and the l∞ norm of the vector x = (−1, 1, −2)t .
Sol. The vector x = (−1, 1, −2)t in R3 has norms
� √
�x�2 = (−1)2 + (1)2 + (−2)2 = 6
and
�x�∞ = max{|−1|, |1|, |−2|} = 2
Definition 1.3 (Distance between vectors in Rn ). If x = (x1 , x2 , · · · , xn )t and y = (y1 , y2 , · · · , yn )t
are vectors in Rn , the l2 and l∞ distances between x and y are defined by
� n �1/2

�x − y�2 = (xi − yi ) ,
i=1

�x − y�∞ = max |xi − yi |.


1≤i≤n
1
2 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

Definition 1.4 (Matrix Norm). A matrix norm on the set of all n × n matrices is a real-valued
function, �·�, defined on this set, satisfying for all n × n matrices A and B and all real numbers α :
(1) �A�≥ 0,
(2) �A�= 0 if and only if A is O, the matrix with all 0 entries,
(3) �αA�= |α|�A�,
(4) �A + B�≤ �A�+�B�,
(5) �AB�≤ �A� �B�.
If �·� is a vector norm on Rn , then
�A�= max �Ax�
�x�=1

is a matrix norm.
The matrix norms we will consider have the forms
�A�∞ = max �Ax�∞ ,
�x�∞ =1

and
�A�2 = max �Ax�2 .
�x�2 =1

Theorem 1.5. If A = (aij ) is an n × n matrix, then


n

�A�∞ = max |aij |.
1≤i≤n
j=1

Example 2. Determine �A�∞ for the matrix


 
1 2 −1
A = 0 3 −1 (1.1)
5 −1 1
Sol. We have
3
� 3

|a1j | = |1| + |2| + | − 1| = 4, |a2j | = |0| + |3| + | − 1| = 4
j=1 j=1
and
3

|a3j | = |5| + | − 1| + |1| = 7.
j=1

So above theorem implies that �A�∞ = max{4, 4, 7} = 7.


Definition 1.6 (Eigenvalue and eigenvector). Let A be a square matrix then number λ is called an
eigenvalue of A if there exists a nonzero vector x such that Ax = λx. Here x is called the corresponding
eigenvector.
Definition 1.7 (Characteristic polynomial). Characteristic polynomial is defined as
P (λ) = det(A − λI).
λ is an eigenvalue of matrix A if and only if λ is a root of the characteristic polynomial, i.e., P (λ) = 0.
Definition 1.8 (Spectral Radius). The spectral radius ρ(A) of a matrix A is defined by
ρ(A) = max |λ|, where λ is an eigenvalue of A.
Theorem 1.9. If A is an n × n matrix, then
� �1/2
�A�2 = ρ(AT A) .
The proof of this theorem requires more information concerning eigenvalues. We illustrate the
procedure by taking an example.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 3

Example 3. Determine the l2 norm of


 
1 1 0
A =  1 2 1 .
−1 1 2

Sol. We calculate ρ(AT A), therefore we need to find the eigenvalues of AT A.


  
1 1 −1 1 1 0
AT A =  1 2 1   1 2 1 
0 1 2 −1 1 2
 
3 2 −1
=  2 6 4 .
−1 4 5
Eigenvalues satisfy the following characteristics equation
0 = det(AT A − λI)
 
3−λ 2 −1
= det  2 6−λ 4 
−1 4 5−λ
= −λ3 + 14λ2 − 42λ = −λ(λ2 − 14λ + 42)

=⇒ λ = 0, 7 ± 7.

Therefore ρ(At A) = 7 + 7. Thus

� �1/2 √
�A�2 = ρ(AT A) = 7 + 7 ≈ 3.106.

1.3. Convergent Matrices. In studying iterative matrix techniques, it is of particular importance


to know when powers of a matrix become small (that is, when all the entries approach zero). Matrices
of this type are called convergent.
Definition 1.10. An n × n matrix A convergent if
lim Ak = 0.
k→∞

Example 4. Show that matrix �1 �


2 0
A= 1 1
4 2
is a convergent matrix.
Sol. Computing powers of A, we obtain:
�1 � � 1

0 0
A2 = 4
1 1 , A3 = 8
3 1 ,
4 4 16 8
and, in general,
� 1 �
k 2k
0
A = k 1 .
2k+1 2k
So A is a convergent matrix because
1 k
lim = 0, lim k+1 = 0.
k→∞ 2k k→∞ 2

∴ lim Ak = 0,
k→∞
which implies matrix A is convergent.
Note that the convergent matrix A in this Example has spectral radius ρ(A) = 12 , because 12 is the
only eigenvalue of A. This illustrates an important connection that exists between the spectral radius
of a matrix and the convergence of the matrix, as detailed in the following result.
4 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

Theorem 1.11. The following statements are equivalent.


(i) A is a convergent matrix.
(ii) lim �Ak �= 0, for all natural norms.
k→∞
(iii) ρ(A) < 1.
(iv) lim Ak x = 0, for every x.
k→∞

The proof of this theorem can be found in advanced texts of numerical analysis.

2. Iterative Methods
The linear system Ax = b may have a large order. For such systems Gauss elimination is often too
expensive in either computation time or computer memory requirements or both.
In an iterative method, a sequence of progressively iterates is produced to approximate the solution.
Jacobi and Gauss-Seidel Method: We start with an example. Let us consider a system of equations
9x1 + x2 + x3 = 10
2x1 + 10x2 + 3x3 = 19
3x1 + 4x2 + 11x3 = 0.
One class of iterative method for solving this system as follows.
We write
1
x1 = (10 − x2 − x3 )
9
1
x2 = (19 − 2x1 − 3x3 )
10
1
x3 = (0 − 3x1 − 4x2 ).
11
(0) (0) (0)
Let x(0) = [x1 x2 x3 ] be an initial approximation of solution x. Then define an iteration of sequence
(k+1) 1 (k) (k)
x1 = (10 − x2 − x3 )
9
(k+1) 1 (k) (k)
x2 = (19 − 2x1 − 3x3 )
10
(k+1) 1 (k) (k)
x3 = (0 − 3x1 − 4x2 ), k = 0, 1, 2, . . . .
11
This is called Jacobi or method of simultaneous replacements. The method is named after German
mathematician Carl Gustav Jacob Jacobi.
We start with [0 0 0] and obtain
(1) (1) (1)
x1 = 1.1111, x2 = 1.900, x3 = 0.0.
(2) (2) (2)
x1 = 0.9000, x2 = 1.6778, x3 = −0.9939
etc.
An another approach to solve the same system will be following.
(k+1) 1 (k) (k)
x1 = (10 − x2 − x3 )
9
(k+1) 1 (k+1) (k)
x2 = (19 − 2x1 − 3x3 )
10
(k+1) 1 (k+1) (k+1)
x3 = (0 − 3x1 − 4x2 ), k = 0, 1, 2, . . . .
11
This method is called Gauss-Seidel or method of successive replacements. It is named after the German
mathematicians Carl Friedrich Gauss and Philipp Ludwig von Seidel. Starting with [0 0 0], we obtain
(1) (1) (1)
x1 = 1.1111, x2 = 1.6778, x3 = −0.9131.
(2) (2) (2)
x1 = 1.0262, x2 = 1.9687, x3 = −0.9588.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 5

General Approach: We consider a system Ax = b of order n and for i = 1, 2, · · · , n, we write the


i-th equation
ai1 x1 + ai2 x2 + · · · + aii xi + · · · + ain xn = bi
 
n
� 
 aij xj 
  + aii xi = bi .
j=1
j�=i

The Jacobi iterative method is obtained by solving the i-th equation for xi to obtain (provided aii �= 0)
 
�n
1  
 (−aij xj ) + bi  .
xi =  
aii
j=1
j�=i
(k+1) (k)
For each k ≥ 0, generate the components xi from xi as
 
�n
(k+1) 1  
 (−aij xkj ) + bi  .
xi =  
aii
j=1
j�=i

Now we write the above iterative scheme is matrix form. To write matrix form, we take
 
n
(k+1) � 
aii xi =  k 
 (−aij xj ) + bi  .
j=1
j�=i

Dx(k+1) = −(L + U )x(k) + b,


where D, L and U are diagonal, strictly lower triangle and upper triangle matrices, respectively and
A = L + D + U . Here b = [b1 b2 · · · bn ]t .
If D−1 exists, then the Jacobi iterative scheme is
x(k+1) = −D−1 (L + U )x(k) + D−1 b.
We write Tj = −D−1 (L + U ) and B = D−1 b to obtain
x(k+1) = Tj x(k) + B.
Matrix Tj is called the iteration matrix.
For Gauss-Seidel, we write the i-th equation as
ai1 x1 + ai2 x2 + · · · + ai,i−1 xi−1 + aii xi + ai,i+1 xi+1 + · · · + ain xn = bi
i−1
� n

aij xj + aii xi + aij xj = bi .
j=1 j=i+1
 
i−1
� n

1 
∴ xi = − aij xj − aij xj + bi  .
aii
j=1 j=i+1
The iterative scheme is given for each k ≥ 0
 
i−1
� n

(k+1) 1 − (k+1) (k)
xi = aij xj − aij xj + bi  .
aii
j=1 j=i+1

In matrix form
(D + L)x(k+1) = −U x(k) + b,
where D, L and U are diagonal, strictly lower triangle and upper triangle matrices, respectively.
x(k+1) = −(D + L)−1 U x(k) + (D + L)−1 b
6 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

x(k+1) = Tg x(k) + B, k = 0, 1, 2, · · · .
Here Tg = −(D + L) U and this matrix is called iteration matrix and B = (D + L)−1 b.
−1

Stopping Criteria: Since these techniques are iterative so we require a stopping criteria. Let ε is
accuracy desired then we can use the following
�x(k) − x(k−1) �∞
< ε.
�x(k) �∞
Example 5. Use the Gauss-Seidel method to approximate the solution of the following system:
4x1 + x2 − x3 = 3
2x1 + 7x2 + x3 = 19
x1 − 3x2 + 12x3 = 31.
Continue the iterations until two successive approximations are identical when rounded to three signif-
icant digits.
Sol. To begin, write the system in the form
1
x1 = (3 − x2 + x3 )
4
1
x2 = (19 − 2x1 − x3 )
7
1
x3 = (31 − x1 + 3x2 )
12
As
|a11 | = 4 > |a12 | + |a13 | = 1
|a22 | = 7 > |a21 | + |a23 | = 3
|a33 | = 12 > |a31 | + |a32 | = 2
which shows that coefficient matrix is strictly diagonally dominant. Therefore Gauss-Seidel iterations
will converge.
Start with a random vector x(0) = [0, 0, 0]t , the first approximation is
(1)
x1 = 0.7500
(1)
x2 = 2.5000
(1)
x3 = 3.1458.
Similarly
x(2) = [0.9115, 2.0045, 3.0085]t

x(3) = [1.0010, 1.9985, 2.9995]t

x(4) = [1.000, 2.000, 3.000]t .

2.1. Convergence analysis of iterative methods. To study the convergence of general iteration
techniques, we need to analyze the formula
x(k+1) = T x(k) + B, for each k = 0, 1, · · · ,
where x(0) is arbitrary. The next lemma and Theorem provide the key for this study.
Lemma 2.1. If the spectral radius ρ(T ) < 1, then (I − T )−1 exists, and


(I − T )−1 = I + T + T 2 + · · · = T k.
k=0
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 7

Proof. If T v = λv, then T k v = λk v.


Since ρ(T ) < 1, it follows that
lim T k = 0.
k→∞
Since above limit is zero, the matrix series
I + T + T2 + ··· + Tk + ···
is convergent. Now by multiplying the matrix (I − T ) by this series, we obtain
(I − T )(I + T + T 2 + · · · + T k + · · · ) = I.
Thus


−1
(I − T ) = T k.
k=0

Theorem 2.2 (Necessary and sufficient condition). A necessary and sufficient condition for the con-
vergence of an iterative method is that the eigenvalue of iteration matrix T satisfy the inequality
ρ(T ) < 1.
Proof. Let
ρ(T ) < 1.
The sequence of vector x(k) by iterative method (Gauss-Seidel) are given by

x(1) = T x(0) + B.

x(2) = T x(1) + B = T (T x(0) + B) + B = T 2 x(0) + (T + I)B.

........................

x(k) = T k x(0) + (T k−1 + T k−2 + ... + T + I)B.


Since ρ(T ) < 1, this implies
lim T k x(0) = 0
k→∞
Therefore
lim x(k) = (I − T )−1 B as k → ∞.
k→∞

Therefore, x(k) converges to unique solution x = T x + B.


Conversely, assume that the sequence x(k) converges to x. Now
x − x(k) = T x + B − T x(k−1) − B = T (x − x(k−1) )

= T 2 (x − x(k−2) )

= T k (x − x(0 ).
Let z = x − x(0) ) then

lim T k z = lim (x − x(k) ) = x − lim x(k) = x − x = 0.


k→∞ k→∞ k→∞

=⇒ ρ(T ) < 1.

Theorem 2.3. If A is strictly diagonally dominant in Ax = b, then iterative method always converges
for any initial starting vector.
8 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

Proof. We assume that A is strictly diagonally dominant, hence aii �= 0 and


n

|aii | > |aij |, i = 1, 2, · · · , n
j=1
j�=i

Gauss-Seidel iterations are given by


x(k+1) = −D−1 (L + U )x(k) + (D + L)−1 b

x(k+1) = Tj x(k) + B.
Method is convergent iff ρ(Tj ) < 1.
Now
�(D + L)−1 �∞
�Tj �∞ = � − (D + L)−1 U �∞ ≤ � − (D + L)−1 �∞ �U �∞ = < 1.
| max aii |
This shows the convergence condition for Jacobi method.

Further we prove the convergence of Gauss-Seidel method. Gauss-Seidel iterations are given by
x(k+1) = −(D + L)−1 U x(k) + (D + L)−1 b

x(k+1) = Tg x(k) + B.
Let λ be an eigenvalue of matrix A and x be an eigenvector then
Tg x = λx

−(D + L)−1 U x = λx
−U x = λ(D + L)x
n
� i

− aij = λ[ aij xj ], i = 1, 2, . . . , n
j=i+1 j=1

n
� i−1

− aij = λaii xi + λ aij xj
j=i+1 j=1

i−1
� n

λaii xi = −λ aij xj − λ aij xj
j=1 j=i+1

i−1
� n

|λaii xi | ≤ |λ| |aij | |xj | + |λ| |aij | |xj |
j=1 j=i+1

Since x is an eigenvector, x �= 0, so we can take norm ||x||∞ = 1.


Hence  
i−1
� n

|λ| |aii | − |aij | ≤ |aij |
j=1 j=i+1


n �
n
|aij | |aij |
j=i+1 j=i+1
=⇒ |λ| ≤ ≤ =1

i−1 �n
|aii | − |aij | |aij |
j=1 j=i+1

which implies spectral radius ρ(T ) < 1.


This implies Gauss-Seidel is convergent.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 9

Example 6. The linear system


2x1 − x2 + x3 = −1,
2x1 + 2x2 + 2x3 = 4,
−x1 − x2 + 2x3 = −5
has the solution (1, 2, −1) T
√ .
(a) Show that ρ(Tj ) = 2 > 1. (b) Show that ρ(Tg ) = 12 .
Sol. We write A = L + D + U .
(a)
Tj = −D−1 (L + U )
  
0.5 0 0 0 1 −1
=  0 0.5 0  −2 0 −2
0 0 0.5 1 01 0
 
0 0.5 −0.5
= −1 0 −1  .
0.5 0.5 0
The spectral radius ρ(Tj ) of matrix Tj is defined by
ρ(Tj ) = max |λ|, where λ is an eigenvalue of Tj .
Eigenvalues of Tj are

5
λ=± i, 0.
2

5
Thus, ρ(Tj ) = > 1.
2
(b)
Tg = −(D + L)−1 U
  
0.5 0 0 0 −1 1
= − −0.5 0.5 0   0 0 2
0 0.25 0.5 0 00 0
 
0 0.5 −0.5
= 0 −0.5 −0.5 .
0 0 −0.5

Eigenvalues are − 12 , − 12 , 0.
Thus ρ(Tg ) = 12 < 1.
Spectral radius of iteration matrix of Jacobi method is greater than one and less than one for
Gauss-Seidel. Therefore Gauss-Seidel iterations converge.

3. The SOR method


We observed that the convergence of an iterative technique depends on the spectral radius of the
matrix associated with the method. One way to select a procedure to accelerate convergence is to
choose a method whose associated matrix has minimal spectral radius. These techniques are known
as Successive Over-Relaxation (SOR). The SOR method is devised by applying extrapolation to the
Gauss-Seidel method. This extrapolation takes the form of a weighted average between the previous
iterate and the computed Gauss-Seidel iterate successively for each component. We multiply with a
(k+1)
weight ω and to calculate xi , we modify the Gauss-Seidel procedure to
(k+1) (k) (k+1) (k)
xi = xi + ω(xi − xi )
(k+1) (k) (k+1)
xi = (1 − ω)xi + ω xi .
10 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

The last term is calculated by Gauss-Seidel and we write


 
i−1
� n

(k+1) ω  (k+1) (k)
xi = (1 − ω)xki + bi − (aij xj )− (aij xj )) .
aii
j=1 j=i+1

The choice of relaxation factor ω is not necessarily easy, and depends upon the properties of the
coefficient matrix. If A is a symmetric and positive definite matrix and 0 < ω < 2, then the SOR
method converges for any choice of initial approximate vector x(0) .
Important Note: If a matrix A is symmetric, it is positive definite if and only if all its leading
principle submatrices (minors) has a positive determinant.
Example 7. Consider a linear system Ax = b, where
   
3 −1 1 −1
A = −1 3 −1 , b =  7 
1 −1 3 −7
a. Check, that the SOR method with value ω = 1.25 of the relaxation parameter can be used to solve
this system.
b. Compute the first iteration by the SOR method starting at the point x(0) = (0, 0, 0)t .
Sol. a. Let us verify the sufficient condition for using the SOR method. We have to check, if matrix
A is symmetric, positive definite: A is symmetric as A = AT , so let us check positive definitness:
� �
3 −1
det(3) = 3 > 0, det = 8 > 0, det(A) = 20 > 0.
−1 3
All leading principal minors are positive and so the matrix A is positive definite. We know, that for
symmetric positive definite matrices the SOR method converges for values of the relaxation parameter
ω from the interval 0 < ω < 2.
Therefore the SOR method with value ω = 1.25 can be used to solve this system.
b. The iterations of the SOR method are easier to compute by elements than in the vector form:
Write the system as equations and write down the equations for the Gauss-Seidel iterations
(k+1) (k) (k)
x1 = (−1 + x2 − x3 )/3
(k+1) (k+1) (k)
x2 = (7 + x1 + x3 )/3
(k+1) (k+1) (k+1)
x3 = (−7 − x1 + x2 )/3.
Now multiply the right hand side by the parameter ω and add to it the vector x(k) from the previous
iteration multiplied by the factor of (1 − ω) :
(k+1) (k) (k) (k)
x1 = (1 − ω)x1 + ω(−1 + x2 − x3 )/3
(k+1) (k) (k+1) (k)
x2 = (1 − ω)x2 + ω(7 + x1 + x3 )/3
(k+1) (k) (k+1) (k+1)
x3 = (1 − ω)x3 + ω(−7 − x1 + x2 )/3.
For k = 0:
(1)
x1 = (1 − 1.25) · 0 + 1.25 · (−1 + 0 − 0)/3 = −0.41667
(1)
x2 = (1 − 1.25) · 0 + 1.25 · (7 − 0.41667 + 0)/3 = 2.7431
(1)
x3 = (1 − 1.25) · 0 + 1.25 · (−7 + 0.41667 + 2.7431)/3 = −1.6001.
The next three iterations are
x(2) = (1.4972, 2.1880, −2.2288)t ,
x(3) = (1.0494, 1.8782, −2.0141)t ,
x(4) = (0.9428, 2.0007, −1.9723)t .
The exact solution is x = (1, 2, −2)t .
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 11

4. Error Bounds and Iterative Refinement


Definition 4.1. Suppose x̃ ∈ Rn is an approximation to the solution of the linear system defined by
Ax = b. The residual vector for x̃ with respect to this system is r = b − Ax̃.
It seems intuitively reasonable that if x̃ is an approximation to the solution x of Ax = b and the
residual vector r = b − Ax̃ has the property that �r� is small, then �x − x̃� would be small as well. This
is often the case, but certain systems, which occur frequently in practice, fail to have this property.
Example 8. The linear system Ax = b given by
� �� � � �
1 2 x1 3
=
1.0001 2 x2 3.0001
has the unique solution x = (1, 1)t . Determine the residual vector for the poor approximation x̃ =
(3, −0.0001)t .
Sol. We have � �� �� � � �
3 1 2 3 0.0002
r = b − Ax̃ = =
3.0001 1.0001 2 −0.0001 0
so �r�∞ = 0.0002. Although the norm of the residual vector is small, the approximation x̃ =
(3, −0.0001)t is obviously quite poor; in fact, �x − x̃�∞ = 2.
Theorem 4.2. Suppose that x̃ is an approximation to the solution of Ax = b, A is a nonsingular
matrix, and r is the residual vector for x̃. Then for any natural norm,
�x − x̃�≤ �r�·�A−1 �
and if x �= 0 and b �= 0
�x − x̃� �r�
≤ �A�·�A−1 � .
�x� �b�
Proof. Since r = b − Ax̃ = Ax − Ax̃ and A is nonsingular, we have x − x̃ = A−1 r.
�x − x̃�= �A−1 r�≤ �A−1 �·�r�.
Moreover, since b = Ax, we have �b�≤ �A�·�x�. So 1/�x�≤ �A�/�b� and
�x − x̃� �A�·�A−1 �
≤ �r�.
�x� �b�

Condition Numbers: The inequalities in the above theorem imply that �A−1 � and �A�·�A−1 � pro-
vide an indication of the connection between the residual vector and the accuracy of the approximation.
In general, the relative error �x − x̃|/�x� is of most interest, and this error is bounded by the product
of �A�·�A−1 � with the relative residual for this approximation, �r�/�b�. Any convenient norm can be
used for this approximation; the only requirement is that it be used consistently throughout.
Definition 4.3. The condition number of the nonsingular matrix A relative to a norm �·� is
K(A) = �A�·�A−1 �.
With this notation, the inequalities in above theorem become
�r�
�x − x̃�≤ K(A)
�A�
and
�x − x̃� �r�
≤ K(A) .
�x� �b�
For any nonsingular matrix A and natural norm �·�,
1 = �I�= �A · A−1 �≤ �A�·�A−1 �= K(A)
A matrix A is well-conditioned if K(A) is close to 1, and is ill-conditioned when K(A) is significantly
greater than 1. Conditioning in this context refers to the relative security that a small residual vector
implies a correspondingly accurate approximate solution. When it is very large, the solution of Ax = b
12 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

will be very sensitive to relatively small changes in b. Or in the the residual, a relatively small residual
will quite possibly lead to a relatively large error in x̃ as compared with x. These comments are also
valid when the changes are made to A rather than to b.
� �
0.98
Example 9. Suppose x̄ = is an approximate solution for the linear system Ax = b, where
1.1
� � � �
3.9 1.6 5.5
A= , b= .
6.8 2.9 9.7
�x − x̄�
Find a bound for the relative error .
�x�
Sol. The residual is given by
� � � �� � � �
5.5 3.9 1.6 0.98 −0.0820
r = b − Ax̄ = − = .
9.7 6.8 2.9 1.1 −0.1540
The bound for the relative error is (for the infinity norm)
�x − x̄� �A� �A−1 � �r�
≤ .
�x� b
Also
det(A) = 0.43.
� � � �
1 2.9 −1.6 6.7442 −3.7209
∴ A−1 = =
0.43 −6.8 2.9 −15.8140 9.0698
�A� = 9.7, �A−1 � = 24.8837, �r� = 0.1540, �b� = 9.7.
�x − x̄� �A� �A−1 � �r�
∴ ≤ = 3.8321.
�x� b
Example 10. Determine the condition number for the matrix
� �
1 2
A= .
1.0001 2
Sol. We saw in previous Example that the very poor approximation (3, −0.0001)t to the exact solution
(1, 1)t had a residual vector with small norm, so we should expect the condition number of A to be
large. We have �A�∞ = max{|1| + |2|, |1.001| + |2|} = 3.0001, which would not be considered large.
However, � �
−1 −10000 10000
A = , so �A�∞ = 20000,
5000.5 −5000
and for the infinity norm, K(A) = (20000)(3.0001) = 60002. The size of the condition number for this
example should certainly keep us from making hasty accuracy decisions based on the residual of an
approximation.
Example 11. Find the condition number K(A) of the matrix
� �
1 c
A= , |c| �= 1.
c 1
When does A become ill-conditioned? What does this say about the linear system Ax = b? How is
K(A) related to det(A)?
Sol. For the given system of equations the matrix A is
� �
1 c
c 1
and is well conditioned if K(A) is near 1. K(A) with respect to norm �·�∞ is given as
K(A) = �A�∞ �A−1 �∞ .
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 13
 
� � 1 −c
1 −c  −c 2 1 − c2 
Here det(A) = 1 − c2 and adj(A) = . Thus A−1 = 1 −c 1 
−c 1
1−c 2 1−c 2
1 |c| 1 + |c|
Thus ||A||∞ = 1 + |c| and ||A−1 ||∞ = + = .
|1 − c2 | |1 − c2 | |1 − c2 |
(1 + |c|) 2
Hence condition number K(A) = .
|1 − c2 |
Thus A is ill-conditioned when |c| is near 1.
When condition number is large the solution of the system Ax = b is sensitive to small changes in A.
If the determinant of A is small, then the condition number of A will be very large.
4.1. The Residual Correction Method. A further use of this error estimation procedure is to
define an iterative method for improving the computed value x. Let x(0) , the initial computed value
for x, generally obtained by using Gaussian elimination. Define
r(0) = b − Ax(0) = A(x − x(0) ).
Then
Ae(0) = r(0) , e(0) = x − x(0) .
Solving by Gaussian elimination, we obtain an approximate value of e(0) . Using it, we define an
improved approximation
x(1) = x(0) + e(0) .
Now we repeat the entire process, calculating
r(1) = b − Ax(1)
x(2) = x(1) + Ae(1) ,
where e(1) is the approximate solution of
Ae(1) = r(1) , e(1) = x − x(1) .
Continue this process until there is no further decrease in the size of error vector.
For example, use a computer with four-digit floating-point decimal arithmetic with rounding, and
use Gaussian elimination with pivoting. The system to be solved is
x1 + 0.5x2 + 0.3333x3 = 1
0.5x1 + 0.3333x2 + 0.25x3 = 0
0.3333x1 + 0.25x2 + 0.2x3 = 0
Then
x(0) = [8.968, −35.77, 29.77]t
r(0) = [−0.005341, −0.004359, −0.0005344]t
e(0) = [0.09216, −0.5442, 0.5239]t
x(1) = [9.060, −36.31, 30.29]t
r(1) = [−0.0006570, −0.0003770, −0.0001980]t
e(1) = [0.001707, −0.01300, 0.01241]t
x(2) = [9.062, −36.32, 30.30]t .

5. Power method for approximating eigenvalues


The eigenvalues of an n × n of matrix A are obtained by solving its characteristic equation
det(a − λI) = 0
n n−1
λ + cn−1 λ + cn−2 λn−2 + · · · + c0 = 0.
For large values of n, the polynomial equations like this one are difficult, time-consuming to solve
and sensitive to rounding errors. In this section we look at an alternative method known as Power
Method for approximating eigenvalues. The method is an iterative method used to determine the
14 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

dominant eigenvalue - that is the eigenvalue with largest magnitude. By modifying the method it can
be used to determine other eigenvalues. One useful feature of power method is that it produces not
only eigenvalue but also associated eigenvector.
To apply the power method, we assume that n × n matrix A has n eigenvalues λ1 , λ2 , · · · , λn (which
we don’t know) with associated eigenvectors v (1) , v (2) , · · · , v (n) . We say matrix A is diagonalizable.
We write
Av (i) = λi v (i) , i = 1, 2, · · · , n.
We assume that these eigenvalues are ordered so that λ1 is the dominant eigenvalue (with correspond-
ing eigenvector v (1) ).
From linear algebra, if A is diagonalizable, then it has n linearly independent eigenvectors v (1) , v (2) , · · · , v (n) .
An n×n matrix need not have n linearly independent eigenvectors. When it does not the Power method
may still be successful, but it is not guaranteed to be.
As the n eigenvectors v (1) , v (2) , · · · , v (n) are linearly independent, they must form a basis for Rn .
We select an arbitrary nonzero starting vector x(0) and express it as a linear combination of basis
vectors as
x0 = c1 v (1) + c2 v (2) + · · · + cn v (n) .
We assume that c1 �= 0. (If c1 = 0, the power method may not converge, and a different x(0) must be
used as an initial approximation.
Then we repeatedly carry out matrix-vector multiplication, using the matrix A to produce a sequence
of vectors. Specifically, we have
x(1) = Ax(0)
x(2) = Ax(1) = A2 x(0)
..
.
x(k) = Ax(k−1) = Ak x(0) .
In general, we have
x(k) = Ak x(0) , k = 1, 2, 3, · · ·
Substituting the value of x(0) , we obtain
x(k) = Ak x(0)
= c1 Ak v (1) + c2 Ak v (2) + · · · + cn Ak v (n)
= c1 λk1 v (1) + c2 λk2 v (2) + · · · + cn λkn v (n)
� � �k � �k �
k (1) λ2 (2) λn (n)
= λ1 c 1 v + c 2 v + · · · + cn v
λ1 λ1
Now, from our original assumption that λ1 is larger in absolute value than the other eigenvalues it
follows that each of the fractions
λ2 λ3 λn
, ,··· , < 1.
λ1 λ1 λ1
Therefore each of the factors � �k � �k � �k
λ2 λ3 λn
, ,··· ,
λ1 λ1 λ1
must approach 0 as k approaches infinity. This implies that the approximation
Ak x(0) ≈ λk1 c1 v (1) , c1 �= 0.
Since v (1) is a dominant eigenvector, it follows that any scalar multiple of v (1) is also a dominant
eigenvector. Thus we have shown that Ak x0 approaches a multiple of the dominant eigenvector of A.
The entries of Ak x(0) may grow with k, therefore we scale the powers of Ak x(0) in an appropriate
manner to ensure that the limit is finite and nonzero. The scaling begins by choosing initial guess to
be a unit vector x(0) relative to maximum norm, that is �x(0) �∞ = 1. Then we compute y (1) = Ax(0)
and next approximation can be taken as
y (1)
x(1) = .
�y (1) �∞
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 15

We repeat the procedure and stop by putting the following stopping criteria:
�x(k) − x(k−1) �∞
< ε,
�x(k) �∞
where ε is the desired accuracy.
Example 12. Calculate four iterations of the power method with scaling to approximate a dominant
eigenvector of the matrix  
1 2 0
−2 1 2
1 3 1
Sol. Using x(0) = [1, 1, 1]T as initial approximation, we obtain
    
1 2 0 1 3
y (1) = Ax(0) = −2 1 2 1 = 1
1 3 1 1 5
and by scaling we obtain the approximation
   
3 0.60
x(1) = 1/5 1 = 0.20 .
5 1.00
Similarly we get    
1.00 0.45
y (2) = Ax(1) = 1.00 = 2.20 0.45 = 2.20x(2) .
2.20 1.00
   
1.35 0.48
y (3) = Ax(2) = 1.55 = 2.8 0.55 = 2.8x(3) .
2.8 1.00
 
0.51
y (4) = Ax(3) = 3.1 0.51 .
1.00
etc.
After four iterations, we observe that dominant eigenvector is
 
0.51
x = 0.51 .
1.00
Scaling factors are approaching to dominant eigenvalue λ = 3.1.
Remark 5.1. The power method is useful to compute the eigenvalue but it gives only dominant eigen-
value. To find other eigenvalue we use properties of matrix such as sum of all eigenvalue is equal to the
trace of matrix. Also if λ is an eigenvalue of A then λ−1 is the eigenvalue of A−1 . Hence the smallest
eigenvalue of A is the dominant eigenvalue of A−1 .
5.1. Inverse Power method. The Inverse Power method is a modification of the Power method that
is used to determine the eigenvalue of A that is closest to a specified number σ.
We consider A − σI then its eigenvalues are λ1 − σ, λ2 − σ, · · · , λn − σ, where λ1 , λ2 , · · · , λn are the
eigenvalues of A.
1 1 1
Now the eigenvalues of (A − σI)−1 are , ,··· , .
λ1 − σ λ2 − σ λn − σ
The eigenvalues of the original matrix A that is the closest to σ corresponds to the eigenvalue of largest
magnitude of the shifted and inverted of matrix (A − σI)−1 .
To find the eigenvalue closest to σ, we apply the Power method to obtain the eigenvalue µ of (A−σI)−1 .
Then we recover the eigenvalue λ of the original problem by λ = 1/µ + σ. This method is called shifted
and inverted. We solve y = (A − σI)−1 x which implies (A − σI)y = x. We need not to compute the
inverse of the matrix.
16 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

Example 13. Apply the inverse power method with x(0) = [1, 1, 1]T to the matrix
 
−4 14 0
−5 13 0
−1 0 2
with σ = 19/3.
Sol. For the inverse power method, we consider
 −31 
14 0
19 3
A− I =  −5 20
3 0 
3
−1 0 − 13
3

Starting with x(0) = [1, 1, 1]T , (A − σI)−1 x(0) = y (1) gives (A − σI)y (1) = x(0) . This gives
 −31    
3 14 0 a 1
 −5 20 0   b  =  1 .
3
−1 0 − 13 3
c 1
Solving above system by Gauss elimination (LU decomposition), we get a = −6.6, b = −4.8, and
c = 1.2923.
Therefore y (1) = (−6.6, −4.8, 1.2923)T . We normalize it by taking 6.6 as scale factor and x(1) =
1 (1) = (1, 0.7272, −0.1958)T .
−6.6 y
1
Therefore first approximation of the eigenvalue of A near 19/3 is − 6.6 + 19
3 = 6.1818.
Repeating the above procedure we can obtain the eigenvalue (and which is 6).

Important Remark: Although the power method worked well in these examples, we must say
something about cases in which the power method may fail. There are basically three such cases:
1. Using the power method when A is not diagonalizable. Recall that A has n linearly Independent
eigenvector if and only if A is diagonalizable. Of course, it is not easy to tell by just looking at A
whether it is diagonalizable.
2. Using the power method when A does not have a dominant eigenvalue or when the dominant
eigenvalue is such that |λ1 | = |λ2 |.
3. If the entries of A contains significant error. Powers Ak will have significant roundoff error in their
entires.

Exercises
(1) Find l∞ and l2 norms of the vectors.
a. x = (3, −4, 0, 32 )t
b. x = (sin k, cos k, 2k )t for a fixedpositive integer
 k.
4 −1 7
(2) Find the l∞ norm of the matrix: −1 4 0 .
−7 0 4
(3) The following linear system Ax = b have x as the actual solution and x̄ as an approximate
solution. Compute �x − x̄�∞ and �Ax̄ − b�∞ . Also compute �A�∞ .
x1 + 2x2 + 3x3 = 1
2x1 + 3x2 + 4x3 = −1
3x1 + 4x2 + 6x3 = 2,
x = (0, −7, 5)t
x̄ = (−0.2, −7.5, 5.4)t .
(4) Find the first two iterations of Jacobi and Gauss-Seidel using x(0) = 0:
4.63x1 − 1.21x2 + 3.22x3 = 2.22
−3.07x1 + 5.48x2 + 2.11x3 = −3.17
1.26x1 + 3.11x2 + 4.57x3 = 5.11.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 17

(5) The linear system


x1 − x3 = 0.2
1 1
− x1 + x2 − x3 = −1.425
2 4
1
x1 − x2 + x3 = 2
2
has the solution (0.9, −0.8, 0.7)T .
(a) Is the coefficient matrix strictly diagonally dominant?
(b) Compute the spectral radius of the Gauss-Seidel iteration matrix.
(c) Perform four iterations of the Gauss-Seidel iterative method to approximate the solution.
(d) What happens in part (c) when the first equation in the system is changed to x1 − 2x3 =
0.2?
(6) Show that Gauss-Seidel method does not converge for the following system of equations
2x1 + 3x2 + x3 = −1
3x1 + 2x2 + 2x3 = 1
x1 + 2x2 + 2x3 = 1.
(7) Find the first two iterations of the SOR method with ω = 1.1 for the following linear systems,
using x(0) = 0 :
4x1 + x2 − x3 = 5
−x1 + 3x2 + x3 = −4
2x1 + 2x2 + 5x3 = 1.
(8) Compute
� the condition
� numbers of the following matrices relative to �.�∞ .
3.9 1.6
(a)
6.8 2.9
 
0.04 0.01 −0.01
(b)  0.2 0.5 −0.2 .
1 2 4
(9) Use Gaussian elimination and three-digit rounding arithmetic to approximate the solutions to
the following linear systems. Then use one iteration of iterative refinement to improve the
approximation, and compare the approximations to the actual solutions.
(a)
0.03x1 + 58.9x2 = 59.2
5.31x1 − 6.10x2 = 47.0.
Actual solution (10, 1)t .
(b)
3.3330x1 + 15920x2 + 10.333x3 = 7953
2.2220x1 + 16.710x2 + 9.6120x3 = 0.965
−1.5611x1 + 5.1792x2 − 1.6855x3 = 2.714.
Actual solution (1, 0.5, −1)t .
(10) The linear system Ax = b given by
� �� � � �
1 2 x1 3
=
1.0001 2 x2 3.0001
has solution (1, 1)t . Use four-digit rounding arithmetic to find the solution of the perturbed
system � �� � � �
1 2 x1 3.00001
=
1.000011 2 x2 3.00003
Is matrix A ill-conditioned?
18 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

(11) Determine the largest eigenvalue and the corresponding eigenvector of the following matrix
correct to three decimals using the power method with x(0) = (−1, 2, 1)t using the power
method.
 
1 −1 0
−2 4 −2 .
0 −1 2
(12) Use the inverse power method to approximate the most dominant eigenvalue of the matrix until
a tolerance of 10−2 is achieved with x(0) = (1, −1, 2)t .
 
2 1 1
1 2 1 .
1 1 2
(13) Find the eigenvalue of matrix nearest to 3
 
2 −1 0
−1 2 −1
0 −1 2
using inverse power method.

Appendix A. Algorithms
Algorithm (Gauss-Seidel):
(1) Input matrix A = [aij ], b, XO = x(0) , tolerance TOL, maximum number of iterations
(2) Set k = 1
(3) while (k ≤ N ) do step 4-7
(4) For i = 1, 2, · · · , n
 
i−1 n
1  � �
xi = − (aij xj ) − (aij XOj ) + bi )
aii
j=1 j=i+1

(5) If ||x − XO|| < T OL, then OUTPUT (x1 , x2 , · · · , xn )


STOP
(6) k = k + 1
(7) For i = 1, 2, · · · , n
Set XOi = xi
(8) OUTPUT (x1 , x2 , · · · , xn )
STOP.
Algorithm (Power Method):
(1) Start
(2) Define matrix A and initial guess x
(3) Calculate y = Ax
(4) Find the largest element in magnitude of matrix y and assign it to K.
(5) Calculate fresh value x = (1/K) ∗ y
(6) If [K(n) − K(n − 1)] > error, goto step 3.
(7) Stop

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 5 (8 LECTURES)
POLYNOMIAL INTERPOLATION AND APPROXIMATIONS

1. Introduction
Polynomials are used as the basic means of approximation in nearly all areas of numerical analysis.
They are used in the solution of equations and in the approximation of functions, of integrals and
derivatives, of solutions of integral and differential equations, etc. Polynomials have simple structure,
which makes it easy to construct effective approximations and then make use of them. For this reason,
the representation and evaluation of polynomials is a basic topic in numerical analysis. We discuss this
topic in the present chapter in the context of polynomial interpolation, the simplest and certainly the
most widely used technique for obtaining polynomial approximations.
Definition 1.1 (Polynomial). A polynomial Pn (x) of degree ≤ n is, by definition, a function of the
form
Pn (x) = a0 + a1 x + a2 x2 + · · · + an xn (1.1)
with certain coefficients a0 , a1 , · · · , an . This polynomial has (exact) degree n in case its leading coeffi-
cient an is nonzero.
The power form (1.1) is the standard way to specify a polynomial in mathematical discussions. It is
a very convenient form for differentiating or integrating a polynomial. But, in various specific contexts,
other forms are more convenient. For example, the following shifted power form may be helpful.
P (x) = a0 + a1 (x − c) + a2 (x − c)2 + · · · + an (x − c)n . (1.2)
It is good practice to employ the shifted power form with the center c chosen somewhere in the interval
[a, b] when interested in a polynomial on that interval.
Definition 1.2 (Newton form). A further generalization of the shifted power form is the following
Newton form

P (x) = a0 + a1 x − c1 ) + a2 (x − c1 )(x − c2 ) + · · · + an (x − c1 )(x − c2 ) · · · (x − cn ).
This form plays a major role in the construction of an interpolating polynomial. It reduces to the
shifted power form if the centers c1 , · · · , cn , all equal c, and to the power form if the centers c1 , · · · , cn ,
all equal zero.

2. Lagrange Interpolation
In this chapter, we consider the interpolation problems. Suppose we do not know the function f ,
but a few information (data) about f . Now we try to compute a function g that approximates f .

2.1. Polynomial Interpolation. The polynomial interpolation problem, also called Lagrange inter-
polation, can be described as follows: Given (n+1) data points (xi , yi ), i = 0, 1, · · · , n find a polynomial
P of lowest possible degree such
yi = P (xi ), i = 0, 1, · · · , n.
Such a polynomial is said to interpolate the data. Here yi may be the value of some unknown function
f at xi , i.e. yi = f (xi ).
One reason for considering the class of polynomials in approximation of functions is that they uniformly
approximate continuous function.
Theorem 2.1 (Weierstrass Approximation Theorem). Suppose that f is defined and continuous on
[a, b]. For any ε > 0, there exists a polynomial P (x) defined on [a, b] with the property that
|f (x) − P (x)| < ε, ∀x ∈ [a, b].
1
2 INTERPOLATION AND APPROXIMATIONS

Another reason for considering the class of polynomials in approximation of functions is that the
derivatives and indefinite integrals of a polynomial are easy to compute.

Theorem 2.2 (Existence and Uniqueness). Given a real-valued function f (x) and n + 1 distinct points
x0 , x1 , · · · , xn , there exists a unique polynomial Pn (x) of degree ≤ n which interpolates the unknown
function f (x) at given points x0 , x1 , · · · , xn .

Proof. Existence: Let x0 , x1 , · · · , xn be the given n + 1 discrete data points. We will prove the result
by the mathematical induction.
The Theorem clearly holds for n = 0, only one data point is given and we can take constant polynomial
P0 (x) = f (x0 ), ∀x.
Assume that the Theorem holds for n ≤ k, i.e. there is a polynomial Pk with degree ≤ k such that
Pk (xi ) = f (xi ), for 0 ≤ i ≤ k.
Now we try to construct a polynomial of degree at most k + 1 to interpolate (xi , f (xi )), 0 ≤ i ≤ k + 1.
Let
Pk+1 (x) = Pk (x) + c(x − x0 )(x − x1 ) · · · (x − xk ).
For x = xk+1 ,

Pk+1 (xk+1 ) = f (xk+1 ) = Pk (xk+1 ) + c(xk+1 − x0 )(xk+1 − x1 ) · · · (xk+1 − xk )

f (xk+1 ) − Pk (xk+1 )
=⇒ c = .
(xk+1 − x0 )(xk+1 − x1 ) · · · (xk+1 − xk )
Since xi are distinct, the polynomial Pk+1 (x) is well-defined and degree of Pk+1 ≤ k + 1. Now

Pk+1 (xi ) = Pk (xi ) + 0 = Pk (xi ) = f (xi ), 0 ≤ i ≤ k

and
Pk+1 (xk+1 ) = f (xk+1 )
Above two equations implies
Pk+1 (xi ) = f (xi ), 0 ≤ i ≤ k + 1.
Therefore Pk+1 (x) interpolate f (x) at all k + 2 nodal points. By mathematical induction result is true
for all polynomials.
Uniqueness: Let there are two such polynomials Pn and Qn such that

Pn (xi ) = f (xi )

Qn (xi ) = f (xi ), 0 ≤ i ≤ n.
Define
Sn (x) = Pn (x) − Qn (x)
Since for both Pn and Qn , degree ≤ n, which implies the degree of Sn is also ≤ n.
Also
Sn (xi ) = Pn (xi ) − Qn (xi ) = f (xi ) − f (xi ) = 0, 0 ≤ i ≤ n.
This implies Sn has at least n + 1 zeros which is not possible as degree of Sn is at most n.
This implies
Sn (x) = 0, ∀x

=⇒ Pn (x) = Qn (x), ∀x.


Therefore interpolating polynomial is unique.
INTERPOLATION AND APPROXIMATIONS 3

2.2. Linear Interpolation. We determine a polynomial


P (x) = ax + b (2.1)
where a and b are arbitrary constants satisfying the interpolating conditions f (x0 ) = P (x0 ) and
f (x1 ) = P (x1 ). We have
f (x0 ) = P (x0 ) = ax0 + b
f (x1 ) = P (x1 ) = ax1 + b.
Lagrange interpolation: Solving for a and b, we obtain
f (x0 ) − f (x1 )
a =
x0 − x1
f (x0 )x1 − f (x1 )x0
b =
x1 − x0
Substituting these values in equation (2.1), we obtain
f (x0 ) − f (x1 ) f (x0 )x1 − f (x1 )x0
P (x) = x+
x0 − x1 x1 − x0
x − x1 x − x0
=⇒ P (x) = f (x0 ) + f (x1 )
x0 − x1 x1 − x0
=⇒ P (x) = l0 (x)f (x0 ) + l1 (x)f (x1 )
x − x1 x − x0
where l0 (x) = and l1 (x) = .
x0 − x1 x1 − x0
These functions l0 (x) and l1 (x) are called the Lagrange Fundamental Polynomials and they satisfy the
following conditions.
l0 (x) + l1 (x) = 1.
l0 (x0 ) = 1, l0 (x1 ) = 0
l1 (x0 ) = 0, l1 (x1 ) = 1

1, i = j
=⇒ li (xj ) = δij =
0, i �= j.
Higher-order Lagrange interpolation: In this section we take a different approach and assume
that the interpolation polynomial is given as a linear combination of n + 1 polynomials of degree n.
This time, we set the coefficients as the interpolated values, {f (xi )}ni=0 , while the unknowns are the
polynomials. We thus let
� n
Pn (x) = f (xi )li (x),
i=0
where li (x) are n + 1 polynomials of degree n. Note that in this particular case, the polynomials li (x)
are precisely of degree n (and not ≤ n). However, Pn (x), given by the above equation may have a
lower degree. In either case, the degree of Pn (x) is n at the most. We now require that Pn (x) satisfies
the interpolation conditions
Pn (xj ) = f (xj ), 0 ≤ j ≤ n.
By substituting xj for x we have
n

Pn (xj ) = f (xi )li (xj ), 0 ≤ j ≤ n.
i=0
Therefore we may conclude that li (x) must satisfy
li (xj ) = δij , i, j = 0, 1, · · · , n
where δij is the Kronecker delta, defined as

1, i = j
δij =
� j.
0, i =
Each polynomial li (x) has n + 1 unknown coefficients. The conditions given above through delta
provide exactly n + 1 equations that the polynomials li (x) must satisfy and these equations can be
4 INTERPOLATION AND APPROXIMATIONS

solved in order to determine all li (x)’s. Fortunately there is a shortcut. An obvious way of constructing
polynomials li (x) of degree n that satisfy the condition is the following:
(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )
li (x) = .
(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )
The uniqueness of the interpolating polynomial of degree ≤ n given n + 1 distinct interpolation points
implies that the polynomials li (x) given by above relation are the only polynomials of degree n.
Note that the denominator does not vanish since we assume that all interpolation points are distinct.
We can write the formula for li (x) in a compact form using the product notation.
(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )
li (x) =
(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )
W (x)
= , i = 0, 1, · · · , n
(x − xi )W � (xi )
where
W (x) = (x − x0 ) · · · (x − xi−1 )(x − xi )(x − xi+1 ) · · · (x − xn )
∴ W � (xi ) = (xi − x0 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn ).
The Lagrange interpolating polynomial can be written as
n
� n
� (x − xj )
Pn (x) = f (xi ) .
(xi − xj )
i=0 j=0
j�=i

Example 1. Use Lagrange interpolation to find the unique polynomial of degree 3 or less, that agrees
with the following data: Also estimate y(1.5).

xi −1 0 1 2
yi 3 −4 5 −6

Sol. The Lagrange fundamental polynomials are given by


(x − 0)(x − 1)(x − 2) 1
l0 (x) = = − (x3 − 3x2 + 2x).
(−1 − 0)(−1 − 1)(−1 − 2) 6
(x + 1)(x − 1)(x − 2) 1
l1 (x) = = (x3 − 2x2 − x + 2).
(0 + 1)(0 − 1)(0 − 2) 2
(x + 1)(x − 0)(x − 2) 1
l2 (x) = = − (x3 − x2 − 2x).
(1 + 1)(1 − 0)(1 − 2) 2
(x + 1)(x − 0)(x − 1) 1
l3 (x) = = (x3 − x).
(2 + 1)(2 − 0)(2 − 1) 6
The interpolating polynomial in the Lagrange form is therefore
P3 (x) = y0 l0 (x) + y1 l1 (x) + y2 l2 (x) + y3 l3 (x)
= 3l0 (x) − 4l1 (x) + 5l2 (x) − 6l3 (x)
= −6x3 + 8x2 + 7x − 4.
∴ y(1.5) ≈ P3 (1.5) = 4.25.

Example 2. Let f (x) = x − x2 and P2 (x) be the interpolation polynomial on x0 = 0, x1 and x2 = 1.
Find the largest value of x1 in (0, 1) for which f (0.5) − P2 (0.5) = −0.25.
√ �
Sol. If f (x) = x − x2 then our nodes are [x0 , x1 , x2 ] = [0, x1 , 1] and f (x0 ) = 0, f (x1 ) = x1 − x21
and f (x2 ) = 0. Therefore
(x − x1 )(x − x2 ) (x − x1 )(x − 1)
l0 (x) = = ,
(x0 − x1 )(x0 − x2 ) x1
INTERPOLATION AND APPROXIMATIONS 5

(x − x0 )(x − x2 ) x(x − 1)
l1 (x) = = ,
(x1 − x0 )(x1 − x2 ) x1 (x1 − 1)
(x − x0 )(x − x1 ) x(x − 1)
l2 (x) = = .
(x2 − x0 )(x2 − x1 ) (1 − x1 )
∴ P2 (x) = l0 (x)f (x0 ) + l1 (x)f (x1 ) + l2 (x)f (x2 )

(x − x1 )(x − 1) x(x − 1) x(x − 1)
= .0 + . x1 − x21 + .0
x1 x1 (x1 − 1) (1 − x1 )
x(x − 1)
= −� .
x1 (1 − x1 )
If we now consider f (x) − P2 (x), then
� x(x − 1)
f (x) − P2 (x) = x − x2 + � .
x1 (1 − x1 )
Hence f (0.5) − P2 (0.5) = −0.25 implies
� 0.5(0.5 − 1)
0.5 − 0.52 + � = −0.25
x1 (1 − x1 )
Solving for x1 gives
x21 − x1 = −1/9
or (x1 − 1/2)2 = 5/36
� �
1 5 1 5
which gives x1 = − or x1 = + .
2 36 2 36
The largest of these is therefore �
1 5
x1 = + ≈ 0.8727.
2 36
2.3. Error Analysis for Polynomial Interpolation. We are given nodes x0 , x1 , · · · , xn , and the
corresponding function values f (x0 ), f (x1 ), · · · , f (xn ), but the we don’t know the expression for the
function. Let Pn (x) be the polynomial of order ≤ n that passes through the n + 1 points (x0 , f (x0 )),
(x1 , f (x1 )),· · · , (xn , f (xn )).
Question: What is the error between f (x) and Pn (x) even we don’t know f (x) in advance?
Definition 2.3 (Truncation error). The polynomial Pn (x) coincides with f (x) at all nodal points and
may deviates at other points in the interval. This deviation is called the truncation error and we write
En (f ; x) = f (x) − Pn (x).
Theorem 2.4. Suppose that x0 , x1 , · · · , xn are distinct numbers in [a, b] and f ∈ C n+1 [a, b]. Let Pn (x)
be the unique polynomial of degree ≤ n that passes through n + 1 distinct points then prove that
∀x ∈ [a, b], ∃ξ = ξ(x) ∈ (a, b)
such that
(x − x0 )(x − x1 ) · · · (x − xn ) (n+1)
f (x) − Pn (x) = f (ξ).
(n + 1)!
Proof. Let x0 , x1 , · · · , xn are distinct numbers in [a, b] and f ∈ C n+1 [a, b].
Let Pn (x) be the unique polynomial of degree ≤ n that passes through n + 1 discrete points.
Since f (xi ) = Pn (xi ), ∀i = 1, 2, · · · , n; which implies f (x) − Pn (x) = 0.
Now for any t in the domain, define a function g(t),t ∈ [a, b],
(t − x0 )(t − x1 ) · · · (t − xn )
g(t) = f (t) − P (t) − [f (x) − P (x)] (2.2)
(x − x0 )(x − x1 ) · · · (x − xn )
Now g(t) ∈ C n+1 [a, b] as f ∈ C n+1 [a, b] and P (x) ∈ C n+1 [a, b].
Now g(t) = 0 at t = x, x0 , x1 , · · · , xn . Therefore g(t) satisfy the conditions of generalized Rolle’s
6 INTERPOLATION AND APPROXIMATIONS

Theorem which states that between n + 2 zeros of a function, there is at least one zero of (n + 1)th
derivative of the function. Hence there exists a point ξ such that
g (n+1) (ξ) = 0
where ξ ∈ (a, b) and depends on x.
Now differentiate function g(t) (n + 1) times to obtain
(n + 1)!
g (n+1) (t) = f (n+1) (t) − Pn(n+1) (t) − [f (x) − Pn (x)]
(x − x0 )(x − x1 ) · · · (x − xn )
(n + 1)!
= f (n+1) (t) − [f (x) − Pn (x)]
(x − x0 )(x − x1 ) · · · (x − xn )
(n+1)
Here Pn (t) = 0 as Pn (x) is a n-th degree polynomial.
Now g (n+1) (ξ) = 0 and then solving for f (x) − Pn (x), we obtain
(x − x0 )(x − x1 ) · · · (x − xn ) (n+1)
f (x) − P (x) = f (ξ).
(n + 1)!

Corollary 2.5. To find the maximum error, we have to find the maxima of right side which contain
two factors: one is products of factors of the form x − xi and second is f (n+1) (ξ). In practice we try
to find two separate bounds for both the terms.
The next example illustrates how the error formula can be used to prepare a table of data that will
ensure a specified interpolation error within a specified bound.
Example 3. Suppose a table is to be prepared for the function f (x) = ex , for x in [0, 1]. Assume
the number of decimal places to be given per entry is d ≥ 8 and that the difference between adjacent
x−values, the step size, is h. What step size h will ensure that linear interpolation gives an absolute
error of at most 10−6 for all x in [0, 1]?
Sol. Let x0 , x1 , . . . be the numbers at which f is evaluated, x be in [0, 1], and suppose i satisfies
xi ≤ x ≤ xi+1 .
The error in linear interpolation is
� �
�1 2 � |f 2 (ξ)|
|f (x) − P (x)| = � f (ξ)(x − xi )(x − xi+1 )�� =
� |(x − xi )||(x − xi+1 )|.
2 2
The step size is h, so xi = ih, xi+1 = (i + 1)h, and
1
|f (x) − p(x)| ≤ |f 2 (ξ)| |(x − xi )(x − xi+1 )|.
2
Hence
1
|f (x) − p(x)| ≤ max eξ max |(x − xi )(x − xi+1 )|
2 ξ∈[0,1] xi ≤x≤xi+1
e
≤ max |(x − xi )(x − xi+1 )|.
2 xi ≤x≤xi+1
We write g(x) = (x − xi )(x − xi+1 ), for xi ≤ x ≤ xi+1 .
For simplification, we write
x − xi = th
∴ x − xi+1 = x − (xi + h) = (t − 1)h.
Thus
g(t) = h2 t(t − 1)
g � (t) = h2 (2t − 1).
INTERPOLATION AND APPROXIMATIONS 7

� � 2
The only critical point for g is at t = 12 , which gives g 12 = h4 .
Since g(xi ) = 0 and g(xi+1 ) = 0, the maximum value of |g � (x)| in [xi , xi+1 ] must occur at the critical
point which implies that
e e h2 eh2
|f (x) − p(x)| ≤ max |g(x)| ≤ · = .
2 xi ≤x≤xi+1 2 4 8
Consequently, to ensure that the the error in linear interpolation is bounded by 10−6 , it is sufficient
for h to be chosen so that
eh2
≤ 10−6 .
8
This implies that h < 1.72 × 10−3 .
Because n = (1 − 0)/h must be an integer, a reasonable choice for the step size is h = 0.001.
Example 4. Determine the step size h that can be used in the tabulation of a function f (x), a ≤ x ≤ b,
at equally spaced nodal points so that the truncation error of the quadratic interpolation is less than ε.
Sol. Let xi−1 , xi , xi+1 are three eqispaced points with spacing h. The truncation error of the quadratic
interpolation is given by
M
|f (x) − P( x)| ≤ max |(x − xi−1 )(x − xi )(x − xi+1 )|
3! a≤x≤b
where M = max |f (3) (x)|.
a≤x≤b
To simplify the calculation, let
x − xi = th
∴ x − xi−1 = x − (xi − h) = (t + 1)h
and x − xi+1 = x − (xi + h) = (t − 1)h.
∴ |(x − xi−1 , )(x − xi )(x − xi+1 )| = h3 |t(t + 1)(t − 1)| = g(t) (say).
Now g(t) attains its extreme values if
dg
=0
dt
which gives t = ± √13 . At end points of the interval g becomes zero.
For both values of t = ± √13 , we obtain max 2
|g(t)| = h3 3√ 3
.
xi−1 ≤x≤xi+1
Truncation error
|f (x) − P2 (x)| < ε
h3
=⇒ √ M < ε
9 3
� √ �1/3
9 3ε
=⇒ h < .
M

3. Neville’s Method
Neville’s method can be applied in the situation that we want to interpolate f (x) at a given point
x = p with increasingly higher order Lagrange interpolation polynomials.
For concreteness, consider three distinct points x0 , x1 , and x2 at which we can evaluate f (x) ex-
actly f (x0 ), f (x1 ), f (x2 ). From each of these three points we can construct an order zero (constant)
“polynomial” to approximate f (p) as
f (p) ≈ P0 (p) = f (x0 ) (3.1)
f (p) ≈ P1 (p) = f (x1 ) (3.2)
f (p) ≈ P2 (p) = f (x2 ) (3.3)
Of course this isn’t a very good approximation so we turn to first order Lagrange polynomials
p − x1 p − x0
f (p) ≈ P0,1 (p) = f (x0 ) + f (x1 )
x0 − x1 x1 − x0
8 INTERPOLATION AND APPROXIMATIONS

p − x2 p − x1
f (p) ≈ P1,2 (p) = f (x1 ) + f (x2 ).
x1 − x2 x2 − x1
There is also P0,2 , but we won’t concern ourselves with that one.
If we note that f (xi ) = Pi (x), we find
p − x1 p − x0
P0,1 (p) = P0 (p) + P1 (p)
x0 − x1 x1 − x0
(p − x1 )P0 (p) − (p − x0 )P1 (p)
=
x0 − x1
and similarly
(p − x2 )P1 (p) − (p − x1 )P2 (p)
P1,2 (p) =
x1 − x2
In general we want to multiply Pi (x) by (x−xj ) where j �= i (i.e., xj is a point that is NOT interpolated
by Pi (x)). We take the difference of two such products and divide by the difference between the added
points.
The result is a polynomial Pi,i−1 of one degree higher than either of the two used to construct it
and that interpolates all the points of the two constructing polynomials combined. This idea can be
extended to construct the third order polynomial P0,1,2
(p − x2 )P0,1 (p) − (p − x0 )P1,2 (p)
P0,1,2 (p) = .
x0 − x2
A little algebra will convince that
(p − x1 )(p − x2 ) (p − x0 )(p − x2 ) (p − x0 )(p − x1 )
P0,1,2 (p) = f (x0 ) + f (x1 ) + f (x2 )
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
which is just the third order Lagrange polynomial interpolating the points x0 , x1 , x2 . This shouldn’t
surprise you since this is the unique third order polynomial interpolating these three points.
Example 5. We are given the function
1
.
f (x) =
x
Approximate the value f (3) using three points 2, 2.5 and 4 by Neville’s method.
Sol. Firstly we evaluate the function at the three points
xi f (xi )
2 0.5
2.5 0.4
4 0.25
We can first make three separate zero-order approximations
f (3) ≈ P0 (3) = f (x0 ) = 0.5
f (3) ≈ P1 (3) = f (x1 ) = 0.4
f (3) ≈ P2 (3) = f (x2 ) = 0.25.
From these we proceed to construct P0,1 and P1,2 by using the Neville formula
(3 − x1 )P0 (3) − (3 − x0 )P1 (3)
f (3) ≈ P0,1 (3) = = 0.3
x0 − x1
(3 − x2 )P1 (3) − (3 − x1 )P2 (3)
f (3) ≈ P1,2 (3) = = 0.35.
x1 − x2
So we can add these numbers to our table
xi f (xi ) Pi,i+1
2 0.5
2.5 0.2 0.3
4 0.25 0.35
INTERPOLATION AND APPROXIMATIONS 9

Finally we can compute P0,1,2 using P0,1 and P1,2 .


(3 − x2 )P0,1 (3) − (3 − x0 )P1,2 (3)
f (3) ≈ P0,1,2 (3) = = 0.325.
x0 − x2
xi f (xi ) Pi,i+1 Pi,i+1,i+2
2 0.5
2.5 0.2 0.3
4 0.25 0.35 0.325
Example 6. Neville’s method is used to approximate f (0.4) as follows. Complete the table.
xi Pi (0.4) Pi,i+1 (0.4) Pi,i+1,i+2 Pi,i+1,i+2,i+3
0 1
0.25 2 P0,1 (0.4)=2.6
0.5 P2 P1,2 (0.4) P0,1,2 (0.4)
0.75 8 P2,3 (0.4) = 2.4 P1,2,3 (0.4) = 2.96 P0,1,2,3 (0.4) = 3.016
Sol.
(0.4 − 0.75)P2 − (0.4 − 0.5)P3
P2,3 (0.4) = = 2.4
0.5 − 0.75
=⇒ P2 = 4
(0.4 − 0.5)P1 − (0.4 − 0.25)P2 (−0.1)(2) − (0.15)(4)
P1,2 (0.4) = = = 3.2.
0.25 − 0.5 −0.25
(0.4 − 0.5)P0,1 − (0.4 − 0)P1,2 (0.4) (−0.1)(2.6) − (0.4)(3.2)
P0,1,2 (0.4) = = = 3.08.
0 − 0.5 −0.5
Example 7. In Neville’s method, suppose xi = i, for i = 0, 1, 2, 3 and it is known that P0,1 (x) =
x + 1, P1,2 (x) = 3x − 1, and P1,2,3 (1.5) = 4. Find P2,3 (1.5) and P0,1,2,3 (1.5).
Sol. Here x0 = 0, x1 = 1, x2 = 2, x3 = 3.
(x − x2 )P0,1 (x) − (x − x0 )P1,2 (x) (x − 2)(x + 1) − x(3x + 1)
P0,1,2 (x) = = = x2 + 1.
x0 − x2 −2
(1.5 − x1 )P2,3 (1.5) − (1.5 − x3 )P1,2 (1.5)
P1,2,3 (1.5) = = 4.
x3 − x1
=⇒ P2,3 (1.5) = 5.5.
Also P0,1,2 (1.5) = 3.25
(1.5 − 3)P0,1,2 (1.5) − (1.5 − 0)P1,2,3 (1.5)
∴ P0,1,2,3 (1.5) =
0−3
= 3.625.

4. Newton’s divided difference interpolation


Suppose that Pn (x) is the n-th order Lagrange polynomial that agrees with the function f at the
distinct numbers x0 , x1 , · · · , xn . Although this polynomial is unique, there are alternate algebraic
representations that are useful in certain situations. The divided differences of f with respect to
x0 , x1 , · · · , xn are used to express Pn (x) in the form
Pn (x) = a0 + a1 (x − x0 ) + a2 (x − x0 )(x − x1 ) + · · · + an (x − x0 )(x − x1 ) · · · (x − xn−1 ), (4.1)
for appropriate constants a0 , a1 , · · · , an .
Now we determine the first of these constants a0 . For this we substitute x = x0 in Pn (x) and we obtain
a0 = Pn (x0 ) = f (x0 ).
Similarly, when Pn (x) is evaluated at x1 , the only nonzero terms in the evaluation of Pn (x1 ) are the
constant and linear terms,
f (x0 ) + a1 (x1 − x0 ) = Pn (x1 ) = f (x1 ),
so
f (x1 ) − f (x0 )
a1 = = f [x0 , x1 ]
x1 − x0
10 INTERPOLATION AND APPROXIMATIONS

f (x1 ) − f (x0 )
The ratio f [x0 , x1 ] = , is called first divided difference of f (x) and in general
x1 − x0
f (xi+1 ) − f (xi )
f [xi , xi+1 ] = .
xi+1 − xi
The remaining divided differences are defined recursively.
The second divided difference of three points, xi , xi+1 , xi+2 , is defined as
f [xi+1 , xi+2 ] − f [xi , xi+1 ]
f [xi , xi+1 , xi+2 ] = .
xi+2 − xi
Now if we substitute x = x2 and the values of a0 and a1 in Eqs. (4.1), we obtain
f (x1 ) − f (x0 )
P (x2 ) = f (x2 ) = f (x0 ) + (x2 − x0 ) + a2 (x2 − x0 )(x2 − x1 )
x1 − x0
f (x0 ) f (x1 ) f (x2 )
=⇒ a2 = + +
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
f [x1 , x2 ] − f [x0 , x1 ]
= = f [x0 , x1 , x2 ].
x2 − x0
The process ends with the single n-th divided difference,
f [x1 , x2 , · · · , xn ] − f [x0 , x1 , · · · , xn−1 ]
an = f [x0 , x1 , · · · , xn ] =
xn − x0
n
� f (xi )
= .

n
i=0 (xi − xj )
j=0
j�=i

We can write the Newton’s divided difference formula in the following fashion (and we will prove in
next Theorem).
Pn (x) = f (x0 ) + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) +
· · · + f [x0 , x1 , · · · , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 )
n
� i−1

= f (x0 ) + f [x0 , x1 , · · · , xi ] (x − xj ).
i=1 j=0

We can also construct the Newton’s interpolating polynomial as given in the next result.
Theorem 4.1. The unique polynomial of degree ≤ n that passes through (x0 , y0 ), (x1 , y1 ), · · · , (xn , yn )
is given by
Pn (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · +
f [x0 , x1 , · · · , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 ).
Proof. We prove it by induction. The unique polynomial of degree 0 that passes through (x0 , y0 ) is
obviously
P0 (x) = y0 = f [x0 ].
Suppose that the polynomial Pk (x) of order ≤ k that passes through (x0 , y0 ), (x1 , y1 ), · · · , (xk , yk ) is
Pk (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · +
f [x0 , x1 , · · · , xk ](x − x0 )(x − x1 ) · · · (x − xk−1 ).
Write Pk+1 (x), the unique polynomial of order (degree) ≤ k that passes through (x0 , y0 ), (x1 , y1 ), · · · ,
(xk , yk )(xk+1 , yk+1 ) by
Pk+1 (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · +
f [x0 , x1 , · · · , xk ](x − x0 )(x − x1 ) · · · (x − xk−1 ) + C(x − x0 )(x − x1 ) · · · (x − xk−1 )(x − xk ).
We only need to show that
C = f [x0 , x1 , · · · , xk , xk+1 ].
INTERPOLATION AND APPROXIMATIONS 11

For this, let Qk (x) be the unique polynomial of degree ≤ k that passes through (x1 , y1 ), · · · , (xk , yk )
(xk+1 , yk+1 ). Define
x − x0
R(x) = Pk (x) + [Qk (x) − Pk (x)].
xk+1 − x0
Then,
• R(x) is a polynomial of degree k + 1.
• R(x0 ) = Pk (x0 ) = y0 ,
xi − x0
R(xi ) = Pk (xi ) + (Qk (xi ) − Pk (xi )) = Pk (xi ) = yi , i = 1, · · · , k,
xk+1 − x0
R(xk+1 ) = Qk (xk+1 ) = yk+1 .
By the uniqueness, R(x) = Pk+1 (x).
The leading coefficient of Pk+1 (x) is C.
x − x0
The leading coefficient of R(x) is the leading coefficient of [Qk (x) − Pk (x)] which is
xk+1 − x0
1
(leading coefficient of Qk (x) - leading coefficient of Pk (x)).
xk+1 − x0
On the other hand, the leading coefficient of Qk (x) is f [x1 , · · · , xk+1 ], and the leading coefficient of
Pk (x) is f [x0 , · · · , xk ]. Therefore
f [x1 , · · · , xk+1 ] − f [x0 , · · · , xk ]
C= = f [x0 , x1 , · · · , xk+1 ].
xk+1 − x0

The generation of the divided differences is outlined in following Table.

Example 8. We have the following four data points:

x −1 0 1 2
y 3 −4 5 −6

Find a polynomial in Newton’s form to interpolate the data and evaluate f (1.5) (the same exercise was
done by Lagrange interpolation).
Sol. To write the Newton’s form, we draw divided difference (d.d.) table as following.
P3 (x) = f (x0 ) + (x + 1)f [−1, 0] + (x + 1)(x − 0)f [−1, 0, 1] + (x + 1)(x − 0)(x − 1)f [−1, 0, 1, 2]
= 3 − 7(x + 1) + 8x(x + 1) − 6x(x + 1)(x − 1)
= −4 + 7x + 8x2 − 6x3 .
∴ f (1.5) ≈ P3 (1.5) = 4.25.
12 INTERPOLATION AND APPROXIMATIONS

x y = f (x) first d.d. second d.d. third d.d.


−1 3
0 −4 −7
1 5 9 8
2 6 11 −10 −6

Note that xi can be re-ordered but must be distinct. When the order of some xi are changed, one
obtain the same polynomial but in different form.

Theorem 4.2. Let f ∈ C n [a, b] and x0 , · · · , xn are distinct numbers in [a, b]. Then there exists ξ such
that
f (n) (ξ)
f [x0 , x1 , x2 , · · · , xn ] = .
n!
Proof. Let
n

Pn (x) = f (x0 ) + f [x0 , x1 , · · · , xk ](x − x0 )(x − x1 ) · · · (x − xk−1 )
k=1

be the interpolating polynomial of f in Newton’s form. Define

g(x) = f (x) − Pn (x).

Since Pn (xi ) = f (xi ) for i = 0, 1, · · · , n, the function g has n + 1 distinct zeros in [a, b]. By the
generalized Rolle’s Theorem there exists ξ ∈ (a, b) such that

g (n) (ξ) = f (n) (ξ) − Pn(n) (ξ) = 0.

Here
Pn(n) (x) = n! f [x0 , x1 , · · · , xn ].
Therefore
f (n) (ξ)
f [x0 , x1 , · · · , xn ] = .
n!

Example 9. Let f (x) = xn for some integer n ≥ 0. Let x0 , x1 , · · · , xm be m + 1 distinct numbers.


What is f [x0 , x1 , · · · , xm ] for m = n? For m > n?

Sol. Since we can write


f (m) (ξ)
f [x0 , x1 , · · · , xm ] = ,
m!
n!
∴ f [x0 , x1 , · · · , xn ] = = 1.
n!
If m > n, then f (m) (x) = 0 as f (x) is a monomial of degree n, thus f [x0 , x1 , · · · , xm ] = 0.

4.1. Newton’s interpolation for equally spaced points. Newton’s divided-difference formula can
be expressed in a simplified form when the nodes are arranged consecutively with equal spacing. Let
n + 1 points x0 , x1 , · · · , xn are arranged consecutively with equal spacing h.
Let
xn − x0
h= = xi+1 − xi , i = 0, 1, · · · , n
n
Then each xi = x0 + ih, i = 0, 1, · · · , n.
For any x ∈ [a, b], we can write x = x0 + sh, s ∈ R.
Then x − xi = (s − i)h.
INTERPOLATION AND APPROXIMATIONS 13

Now Newton’s interpolating polynomial is given by


n

Pn (x) = f (x0 ) + f [x0 , x1 , · · · , xk ] (x − x0 ) · · · (x − xk−1 )
k=1
�n
= f (x0 ) + f [x0 , x1 , · · · , xk ] (s − 0)h (s − 1)h · · · (s − k + 1)h
k=1
�n
= f (x0 ) + f [x0 , x1 , · · · , xk ] s(s − 1) · · · (s − k + 1) hk
k=1
n
� � �
s k
= f (x0 ) + f [x0 , x1 , · · · , xk ] k! h
k
k=1

where the binomial formula � �


s s(s − 1) · · · (s − k + 1)
= .
k k!
Now we introduce the forward difference operator
�f (xi ) = f (xi+1 ) − f (xi ).

�k f (xi ) = �k−1 �f (xi ) = �k−1 [f (xi+1 ) − f (xi )], i = 0, 1, · · · , n − 1


Using the � notation, we can write
f (x1 ) − f (x0 ) 1
f [x0 , x1 ] = = �f (x0 )
x1 − x0 h
1
f [x1 , x2 ] − f [x0 , x1 ] h �f (x1 ) − h1 �f (x0 ) 1
f [x0 , x1 , x2 ] = = = �2 f (x0 )
x2 − x0 2h 2!h2
In general
1
f [x0 , x1 , · · · , xk ] = �k f (x0 ).
k!hk
Therefore
n � �
� s
Pn (x) = Pn (x0 + sh) = f (x0 ) + �k f (x0 ).
k
k=1
This is the Newton’s forward divided difference interpolation.
If the interpolation nodes are arranged recursively as xn , xn−1 , · · · , x0 , a formula for the interpolating
polynomial is similar to previous result. In this case, Newton’s divided difference formula can be
written as
n

Pn (x) = f (xn ) + f [xn , xn−1 · · · , xn−k ] (x − xn ) · · · (x − xn−k+1 ).
k=1
If nodes are equally spaced with spacing
xn − x0
h= , xi = xn − (n − i)h, i = n, n − 1, · · · , 0.
n
Let x = xn + sh.
Therefore
�n
Pn (x) = f (xn ) + f [xn , xn−1 · · · , xn−k ] (x − xn ) · · · (x − xn−k+1 )
k=1
n

= f (xn ) + f [xn , xn−1 · · · , xn−k ] (s)h (s + 1)h · · · (s + k − 1)h
k=1
�n � �
k −s
= f (xn ) + f [xn , xn−1 · · · , xn−k ] (−1) hk k!
k
k=1
14 INTERPOLATION AND APPROXIMATIONS

where the binomial formula is extended to include all real values s,


� �
−s −s(−s − 1) · · · (−s − k + 1) s(s + 1) · · · (s + k − 1)
= = (−1)k .
k k! k!
Like-wise the forward difference operator, we introduce the backward-difference operator by symbol ∇
(nabla) and
∇f (xi ) = f (xi ) − f (xi−1 ).
∇k f (xi ) = ∇k−1 ∇f (xi ) = ∇k−1 [f (xi ) − f (xi−1 )].
Then
f (xn ) − f (xn−1 ) 1
f [xn , xn−1 ] = = ∇f (xn ).
xn − xn−1 h
1
f [xn , xn−1 ] − f [xn−1 , xn−2 ] h ∇f (xn ) − h1 ∇f (xn−1 ) 1
f [xn , xn−1 , xn−2 ] = = = ∇2 f (xn ).
xn − xn−2 2h 2!h2
In general
1
∇k f (xn ).
f [xn , xn−1 , xn−2 · · · , xn−k ] =
k!hk
Therefore by using the backward-difference operator, the divided-difference formula can be written as
n �
� �
−s
Pn (x) = f (xn ) + (−1)k ∇k f (xn ).
k
k=1
This is the Newton’s backward difference interpolation formula.
Example 10. Using the following table for tan x, approximate its value at 0.71 using Newton’s inter-
polation.

xi 0.70 72 0.74 0.76 0.78


tan xi 0.84229 0.87707 0.91309 0.95045 0.98926

Sol. As the point x = 0.71 lies in the beginning, we will use Newton’s forward interpolation. The
forward difference table is:

xi f (xi ) Δf (xi ) Δ2 f (xi ) Δ3 f (xi ) Δ4 f (xi )


0.70 0.84229
0.72 0.87707 0.03478
0.74 0.91309 0.03602 0.00124
0.76 0.95045 0.03736 0.00134 0.0001
0.78 0.98926 0.03881 0.00145 0.00011 0.00001

Here x0 = 0.70, h = 0.02, x = 0.71 = x0 + sh gives s = 0.5.


The Newton’s forward difference polynomial is given by
P3 (x) = P3 (x0 + sh)
s(s − 1) 2 s(s − 1)(s − 2) 3 s(s − 1)(s − 2)(s − 3) 4
= f (x0 ) + sΔf (x0 ) + Δ f (x0 ) + Δ f (x0 ) + Δ f (x0 ).
2! 3! 4!
Substituting the values from table (first entries of each column starting from second), we obtain
P3 (0.71) = tan(0.71) = 0.8596.
Example 11. Show that the cubic polynomials
P (x) = 3 − 2(x + 1) + 0(x + 1)(x) + (x + 1)(x)(x − 1)
and
Q(x) = −1 + 4(x + 2) − 3(x + 2)(x + 1) + (x + 2)(x + 1)(x)
both interpolate the given data. Why does this not violate the uniqueness property of interpolating
polynomials?
INTERPOLATION AND APPROXIMATIONS 15

x -2 -1 0 1 2
f (x) -1 3 1 -1 3

Sol. In the formulation of P (x), second point −1 is taken as initial point x0 while in the formulation
of Q(x) first point is taken as initial point.
Also (alternatively without drawing the table) P (−2) = Q(−2) = −1, P (−1) = Q(−1) = 3, P (0) =
Q(0) = 1, P (1) = Q(1) = −1, P (2) = Q(2) = 3.
Therefore both the cubic polynomials interpolate the given data. Further the interpolating polynomials
are unique but format of a polynomial is not unique. If P (x) and Q(x) are expanded, they are identical.
The forward difference table is:

x f (x) Δf (xi ) Δ2 f (xi ) Δ3 f (xi ) Δ4 f (xi )


-2 -1
-1 3 4
0 1 -2 -3
1 -1 -2 0 1
2 3 4 3 1 0

5. Curve Fitting : Principles of Least Squares


Least-squares, also called “regression analysis”, is one of the most commonly used methods in
numerical computation. Essentially it is a technique for solving a set of equations where there are more
equations than unknowns, i.e. an overdetermined set of equations. Least squares is a computational
procedure for fitting an equation to a set of experimental data points. The criterion of the “best” fit is
that the sum of the squares of the differences between the observed data points, (xi , yi ), and the value
calculated by the fitting equation, is minimum. The goal is to find the parameter values for the model
which best fits the data. The least squares method finds its optimum when the sum E, of squared
residuals
�n
E= ei 2
i=1
is a minimum. A residual is defined as the difference between the actual value of the dependent variable
and the value predicted by the model. Thus
ei = yi − f (xi ).
Least square fit of a straight line: Suppose that we are given a data set (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )
of observations from an experiment. We are interested in fitting a straight line of the form f (x) = a+bx,
to the given data. Now residuals is given by
ei = yi − (a + bxi ).
Note that ei is a function of parameters a and b. We need to find a and b such that
�n
E= e2i
i=1
is minimum. The necessary condition for the minimum is given by
∂E ∂E
= 0, = 0.
∂a ∂b
The conditions yield
�n
∂E
= [yi − (a + bxi )](−2) = 0
∂a
i=1
n
� n

=⇒ yi = na + b xi (5.1)
i=1 i=1
16 INTERPOLATION AND APPROXIMATIONS

� n
∂E
= [yi − (a + bxi )](−2xi ) = 0
∂b
i=1
n
� n
� n

=⇒ x i yi = a xi + b x2i . (5.2)
i=1 i=1 i=1
These equations (5.1-5.2) are called normal equations, which are to be solved to get desired values for
a and b.
Example 12. Obtain the least square straight line fit to the following data

x 0.2 0.4 0.6 0.8 1


f (x) 0.447 0.632 0.775 0.894 1

Sol. The normal equations for fitting a straight line y = a + bx are


5
� 5

f (xi ) = 5a + b xi
i=1 i=1
5
� 5
� 5

xi f (xi ) = a xi + b x2i
i=1 i=1 i=1

5 �5 �
5 �
5
From the data, we have xi = 3, x2i = 2.2, f (xi ) = 3.748, and xi f (xi ) = 2.5224.
i=1 i=1 i=1 i=1
Therefore
5a + 3b = 3.748, 3a + 2.2b = 2.5224.
The solution of this system is a = 0.3392 and b = 0.684. The required approximation is y = 0.3392 +
0.684x.

5
Least square error= [f (xi ) − (0.3392 + 0.684xi )2 ] = 0.00245.
i=1

Example 13. Find the least square approximation of second degree for the discrete data

x −2 −1 0 1 2
f (x) 15 1 1 3 19

Sol. We fit a second degree polynomial y = a + bx + cx2 .


By principle of least squares, we minimize the function
5

E= [yi − (a + bxi + cx2i )]2 .
i=1
The necessary condition for the minimum is given by
∂E ∂E ∂E
= 0, = 0, = 0.
∂a ∂b ∂c
The normal equations for fitting a second degree polynomial are
5
� 5
� 5

f (xi ) = 5a + b xi + c x2i
i=1 i=1 i=1
5
� 5
� 5
� 5

xi f (xi ) = a xi + b x2i + c x3i
i=1 i=1 i=1 i=1
5
� 5
� 5
� 5

x2i f (xi ) = a x2i + b x3i + c x4i .
i=1 i=1 i=1 i=1
INTERPOLATION AND APPROXIMATIONS 17


5 �
5 �
4 �
5 �
5 �
5 �
5
We have xi = 0, x2i = 10, x3i = 0, x4i = 34, f (xi ) = 39, xi f (xi ) = 10, x2i f (xi ) =
i=1 i=1 i=1 i=1 i=1 i=1 i=1
140.
From given data
5a + 10c = 39
10b = 10
10a + 34c = 140.
−37 31
The solution of this system is a = , b = 1, and c = .
35 7
1
The required approximation is y = (−37 + 35x + 155x2 ).
35

Example 14. Use the method of least square to fit the curve f (x) = c0 x + c1 / x. Also find the least

x 0.2 0.3 0.5 1 2


f (x) 16 14 11 6 3

square error.
Sol. By principle of least squares, we minimize the error
5
� c1
E(c0 , c1 ) = [f (xi ) − c0 xi − √ ]2
xi
i=1
We obtain the normal equations
5
� 5
� 5


c0 x2i + c1 xi = xi f (xi )
i=1 i=1 i=1
5
� 5
� 5
� f (xi )
√ 1
c0 xi + c 1 = √ .
xi xi
i=1 i=1 i=1
We have
5
� 5
� 5

√ 1
xi = 4.1163, = 11.8333, x2i = 5.38
xi
i=1 i=1 i=1
5
� 5
� f (xi )
xi f (xi ) = 24.9, √ = 85.0151.
xi
i=1 i=1
The normal equations are given by
5.3c0 + 4.1163c1 = 24.9
4.1163c0 + 11.8333c1 = 85.0151.
Whose solution is c0 = −1.1836, c1 = 7.5961.
Therefore, the least square fit is given as
7.5961
f (x) = √ − 1.1836x.
x
The least square error is given by
5
� 7.5961
E= [f (xi ) − √ + 1.1836xi ]2 = 1.6887
xi
i=1

Example 15. Obtain the least square fit of the form y = abx to the following data

x 1 2 3 4 5 6 7 8
f (x) 1.0 1.2 1.8 2.5 3.6 4.7 6.6 9.1
18 INTERPOLATION AND APPROXIMATIONS

Sol. The curve y = abx takes the form Y = A + Bx after taking log on base 10, where Y = log y,
A = log a and B = log b.
Hence the normal equations are given by
8
� 8

Yi = 8A + B xi
i=1 i=1
8
� x
� 8

xi Yi = A xi + B x2i
i=1 i=1 i=1
From the data, we form the following table. Substituting the values, we obtain

x y Y = log y xY x2
1 1.0 0.0 0.0 1
2 1.2 0.0792 0.1584 4
3 1.8 0.2553 0.7659 9
4 2.5 0.3979 1.5916 16
5 3.6 0.5563 2.7815 25
6 4.7 0.6721 4.0326 36
7 6.6 0.8195 5.7365 49
8 9.1 0.9590 7.6720 64
Σ 36 30.5 3.7393 22.7385 204

8A + 36B = 3.7393, 36A + 204B = 22.7385


=⇒ A = 0.1656, B = 0.1407
=⇒ a = 1.4642, b = 1.3826.
The required curve is y = (1.4642)(1.3826)x .

Remark 5.1. If data is quite large then we can make it small by changing the origin and appropriating
scaling.
Example 16. Show that the line of fit to the following data is given by y = 0.7x + 11.28.
x 0 5 10 15 20 25
y 12 15 17 22 24 30
Sol. Here n = 6. We fit a line of the form y = A + Bx.
x − 15
Let u = , v = y − 20 and line of the form v = a + bu.
5
x y u v uv u2
0 12 −3 −8 24 9
5 15 −2 −5 10 4
10 17 −1 −3 3 1
15 22 0 2 0 0
20 24 1 4 4 1
25 30 2 10 20 4
Σ −3 0 61 19
The normal equations are,
0 = 6a − 3b
61 = −3a + 19b.
By solving a = 1.7428 and b = 3.4857.
Therefore equation of the line is v = 1.7428 + 3.4857u.
Changing in to original variable, we obtain
� �
x − 15
y − 20 = 1.7428 + 3.4857
5
INTERPOLATION AND APPROXIMATIONS 19

=⇒ y = 11.2857 + 0.6971x.

Exercises
(1) Find the unique polynomial P (x) of degree 2 or less such that
P (1) = 1, P (3) = 27, P (4) = 64
using Lagrange interpolation. Evaluate P (1.05).
(2) For the given functions f (x), let x0 = 1, x1 = 1.25, and x2 = 1.6. Construct Lagrange
interpolation polynomials of degree at most one and at most two to approximate f (1.4), and
find the absolute error.
(a) f (x) = sin
√ πx
3
(b) f (x) = x − 1
(c) f (x) = log10 (3x − 1)
(d) f (x) = e2x − x.
(3) Let P3 (x) be the Lagrange interpolating polynomial for the data (0, 0), (0.5, y), (1, 3) and (2, 2).
Find y if the coefficient of x3 in P3 (x) is 6.
(4) Let f (x) = ln(1 + x), x0 = 1, x1 = 1.1. Use Lagrange linear interpolation to find the
approximate value of f (1.04) and obtain a bound on the truncation error.
(5) Construct the Lagrange interpolating polynomials for the following functions, and find a bound
for the absolute error on the interval [x0 , xn ].
(a) f (x) = e2x cos 3x, x0 = 0, x1 = 0.3, x2 = 0.6, n = 2.
(b) f (x) = sin(ln x), x0 = 2.0, x1 = 2.4, x2 = 2.6, n = 2.

(c) f (x) = cos x + sin x, x0 = 0, x1 = 0.25, x2 = 0.5, x3 = 1.0, n = 3.


(6) Use the following values and four-digit rounding arithmetic to construct a third degree Lagrange
polynomial approximation to f (1.09). The function being approximated is f (x) = log10 (tan x).
Use this knowledge to find a bound for the error in the approximation.
f (1.00) = 0.1924, f (1.05) = 0.2414, f (1.10) = 0.2933, f (1.15) = 0.3492.
(7) Use the Lagrange interpolating polynomial of degree three or less and four-digit chopping
arithmetic to approximate cos 0.750 using the following values. Find an error bound for the
approximation.
cos 0.698 = 0.7661, cos 0.733 = 0.7432, cos 0.768 = 0.7193, cos 0.803 = 0.6946.
The actual value of cos 0.750 is 0.7317 (to four decimal places). Explain the discrepancy between
the actual error and the error bound. √
(8) Determine the spacing h in a table of equally spaced values of the function f (x) = x between
1 and 2, so that interpolation with a quadratic polynomial will yield an accuracy of 5 × 10−8 .
(9) Use Neville’s method to obtain the approximations for Lagrange interpolating polynomials of
degrees one, two, and three to approximate each of the following:
(a) f (8.4) if f (8.1) = 16.94410, f (8.3) = 17.56492, f (8.6) = 18.50515, f (8.7) = 18.82091.
(b) f (−1/3) if f (−0.75) = −0.07181250, f (−0.5) = −0.02475000, f (−0.25) = 0.33493750, f (0) =
1.10100000. √
(10) Use Neville’s method to approximate 3 with the following functions and values.
(a) f (x) = 3√x and the values x0 = −2, x1 = −1, x2 = 0, x3 = 1, and x4 = 2.
(b) f (x) = x and the values x0 = 0, x1 = 1, x2 = 2, x3 = 4, and x4 = 5.
(c) Compare the accuracy of the approximation in parts (a) and (b).
(11) Let P3 (x) be the interpolating polynomial for the data (0, 0), (0.5, y), (1, 3), and (2, 2). Use
Neville’s method to find y if P3 (1.5) = 0.
(12) Neville’s Algorithm is used to approximate f (0) using f (−2), f (−1), f (1), and f (2). Suppose
f (−1) was understated by 2 and f (1) was overstated by 3. Determine the error in the original
calculation of the value of the interpolating polynomial to approximate f (0).
(13) If linear interpolation is used to interpolate the error function
� x
2 2
f (x) = √ e−x dt,
π 0
20 INTERPOLATION AND APPROXIMATIONS

show that the √ error of linear interpolation using data (x0 , f0 ) and (x1 , f1 ) cannot exceed
(x1 − x0 )2 /2 2πe.
(14) Using Newton’s divided difference interpolation, construct interpolating polynomials of degree
one, two, and three for the following data. Approximate the specified value using each of the
polynomials.
f (0.43) if f (0) = 1, f (0.25) = 1.64872, f (0.5) = 2.71828, f (0.75) = 4.4816.
(15) Show that the polynomial interpolating (in Newton’s form) the following data has degree 3.

x −2 −1 0 1 2 3
f (x) 1 4 11 16 13 −4
(16) Let f (x) = ex , show that f [x0 , x1 , . . . , xm ] > 0 for all values of m and all distinct equally spaced
nodes {x0 < x1 < · · · < xm }.
(17) Show that the interpolating polynomial for f (x) = xn+1 at n + 1 nodal points x0 , x1 , · · · , xn is
given by
xn+1 − (x − x0 )(x − x1 ) · · · (x − xn ).
(18) The following data are given for a polynomial P (x) of unknown degree
x 0 1 2 3
f (x) 4 9 15 18
Determine the coefficient of x3 in P (x) if all fourth-order forward differences are 1.
(19) Let i0 , i1 , · · · , in be a rearrangement of the integers 0, 1, · · · , n. Show that f [xi0 , xi1 , · · · , xin ] =
f [x0 , x1 , · · · , xn ].
(20) Let f (x) = 1/(1 + x) and let x0 = 0, x1 = 1, x2 = 2. Calculate the divided differences f [x0 , x1 ]
and f [x0 , x1 , x2 ]. Using these divided differences, give the quadratic polynomial P2 (x) that
interpolates f (x) at the given node points {x0 , x1 , x2 }. Graph the error f (x) − P2 (x) on the
interval [0, 2].
(21) Construct the interpolating polynomial that fits the following data using Newton forward and
backward
x 0 0.1 0.2 0.3 0.4 0.5
f (x) −1.5 −1.27 −0.98 −0.63 −0.22 0.25
difference interpolation. Hence find the values of f (x) at x = 0.15 and 0.45.
(22) For a function f , the forward-divided differences are given by
x0 = 0.0 f [x0 ]
50
x1 = 0.4 f [x1 ] f [x0 , x1 ] f [x0 , x1 , x2 ] =
7
x1 = 0.4 f [x2 ] = 6 f [x1 , x2 ] = 10
Determine the missing entries in the table.
(23) A fourth-degree polynomial P (x) satisfies Δ4 P (0) = 24, Δ3 P (0) = 6, and Δ2 P (0) = 0, where
ΔP (x) = P (x + 1) − P (x). Compute Δ2 P (10).
(24) Show that
f (n+1) (ξ(x))
f [x0 , x1 , x2 , · · · , xn , x] = .
(n + 1)!
(25) Use the method of least squares to fit the linear and quadratic polynomial to the following
data.
x −2 −1 0 1 2
f (x) 15 1 1 3 19
(26) By the method of least square fit a curve of the form y = axb to the following data.
x 2 3 4 5
y 27.8 62.1 110 161

(27) Use the method of least squares to fit a curve y = c0 /x + c1 x to the following data.
x 0.1 0.2 0.4 0.5 1 2
y 21 11 7 6 5 6
INTERPOLATION AND APPROXIMATIONS 21

(28) Experiment with a periodic process provided the following data :


t◦ 0 50 100 150 200
y 0.754 1.762 2.041 1.412 0.303
Estimate the parameter a and b in the model y = a + b sin t, using the least square approxima-
tion.

Appendix A. Algorithms
Algorithm (Lagrange Interpolation):


Read the degree n of the polynomial Pn (x).

Read the values of x(i) and y(i) = f (xi ), i = 0, 1, . . . , n.

Read the point of interpolation p.

Calculate the Lagrange’s fundamental polynomials li (x) using the following loop:
for i=1 to n
l(i) = 1.0
for j=1 to n
if j �= i
p − x(j)
l(i) = ∗ l(i)
x(i) − x(j)
end j
end i
• Calculate the approximate value of the function at x = p using the following loop:
sum=0.0
for i=1 to n
sum = sum + l(i) ∗ y(i)
end i
• Print sum.
Algorithm (Newton’s Divided-Difference Interpolation):
Given n distinct interpolation points x0 , x1 , · · · , xn , and the values of a function f (x) at these
points, the following algorithm computes the matrix of divided differences:
D = zeros(n, n);
for i = 1 : n
D(i,1) = y(i);
end i
for j = 2 : n,
for k = j : n,
D(k, j) = (D(k, j − 1) − D(k − 1, j − 1))/(x(k) − x(k − j + 1));
end i
end j.

Now compute the value at interpolating point p using nesting:


f p = D(n, n);
for i = n − 1 : −1 : 1
f p = f p ∗ (p − x(i)) + D(i, i);
end i
Print Matrix D and f p.

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 6 (4 LECTURES)
NUMERICAL INTEGRATION

1. Introduction
The general problem is to find the approximate value of the integral of a given function f (x) over
an interval [a, b]. Thus
� b
I= f (x)dx. (1.1)
a
Problem can be solved by using the Fundamental Theorem of Calculus by finding an anti-derivative
F of f , that is, F � (x) = f (x), and then
� b
f (x)dx = F (b) − F (a).
a
But finding an anti-derivative is not an easy task in general. Hence, it is certainly not a good approach
for numerical computations.
In this chapter we’ll study methods for finding integration rules. We’ll also consider composite versions
of these rules and the errors associated with them.

2. Elements of numerical integration


The basic method involved in approximating the integration is called numerical quadrature and uses
a sum of the type
� b
f (x)dx ≈ λi f (xi ). (2.1)
a
The method of quadrature is based on the polynomial interpolation. We divide the interval [a, b] in to
a set of distinct nodes {x0 , x1 , x2 , · · · , xn }. Then we approximate the function f (x) by an interpolating
polynomial, say Lagrange interpolating polynomial is used to approximate f (x), i.e.
f (x) = Pn (x) + Error
�n n
f (n+1) (ξ) �
= f (xi )li (x) + (x − xi ).
(n + 1)!
i=0 i=0

Here ξ = ξ(x) ∈ (a, b) and


n
� x − xj
li (x) = , 0 ≤ i ≤ n.
xi − xj
j=0
j�=i
Therefore
� b � b � b
f (x)dx = Pn (x)dx + en (x)dx
a a a
n
� � b � b n

1
= f (xi ) li (x)dx + f (n+1) (ξ) (x − xi )dx
a (n + 1)! a
i=0 i=0
�n
= λi f (xi ) + E(f )
i=0

where
� b
λi = li (x)dx.
a
1
2 NUMERICAL INTEGRATION

Error in the numerical quadrature is given by


� b n

1
E(f ) = f (n+1) (ξ) (x − xi )dx.
(n + 1)! a i=0

We can also use Newton divided difference interpolation to approximate the function f (x).

3. Newton-Cotes Formula
b−a
Let all nodes are equally spaced with spacing h = . The number h is also called the step
n
length.
Let x0 = a and xn = b then xi = a + ih, i = 0, 1, · · · , n.
The general quadrature formula is given by
� b n

f (x)dx = λi f (xi ) + E(f ).
a i=0

This formula is called Newton-Cotes formula if all points are equally spaced. We now derive rules by
taking one and two degree interpolating polynomials.

�b
3.1. Trapezoidal Rule. We derive the Trapezoidal rule for approximating f (x)dx using the linear
a
Lagrange polynomial.
Let x0 = a, x1 = b, and h = b − a.

� 1
b=x �x1
f (x) dx = P1 (x)dx + E(f ).
a=x0 x0

We calculate both the integrals separately as:

�x1 �x1
P1 (x)dx = [l0 (x)f (x0 ) + l1 (x)f (x1 )] dx
x0 x0
�x1 �x1
x − x1 x − x0
= f (x0 ) dx + f (x1 ) dx
x0 − x1 x1 − x0
x0 x0
� 2
� x1 � �x
(x − x1 ) (x − x0 )2 1
= f (x0 ) + f (x1 )
2(x0 − x1 ) x0 2(x1 − x0 ) x0
x1 − x0
= [f (x0 ) + f (x1 )]
2
h
= [f (a) + f (b)].
2

�x1
f (2) (ξ)
E(f ) = (x − x0 )(x − x1 ) dx
2!
x0
�x1
1
= f (2) (ξ)(x − x0 )(x − x1 ) dx.
2
x0
NUMERICAL INTEGRATION 3

Since (x − x0 )(x − x1 ) does not change its sign in [x0 , x1 ], therefore by the Weighted Mean-Value
Theorem, there exists a point ξ ∈ (x0 , x1 ) such that
�x1
f (2) (ξ)
E(f ) = (x − x0 )(x − x1 ) dx
2
x0
� �
f (2) (ξ) (x0 − x1 )3
=
2 6
h3
= − f (2) (ξ).
12
Thus the integration formula is
�b
h h3
f (x)dx = [f (a) + f (b)] − f (2) (ξ).
2 12
a

Geometrically, it is the area of Trapezium (Trapezoid) with width h and ordinates f (a) and f (b).

3.2. Simpson’s Rule. We take second degree Lagrange interpolating polynomial. We take n =
a+b
2, x0 = a, x1 = , x2 = b, h = (b − a)/2.
2
� 2
b=x �x2
f (x) dx = P2 (x)dx + E(f ).
a=x0 x0

�x2 �x2
P2 (x)dx = [l0 (x)f (x0 ) + l1 (x)f (x1 ) + +l2 (x)f (x2 )] dx
x0 x0
= λ0 f (x0 ) + λ1 f (x1 ) + λ2 f (x2 ).
The values of the multipliers λ0 , λ1 , and λ2 are given by
�x2
(x − x1 )(x − x2 )
λ0 = dx.
(x0 − x1 )(x0 − x2 )
x0

To simply this integral, we substitute x = x0 + ht, dx = h dt and change the limits from 0 to 2
accordingly. Therefore
� 2
h(t − 1)h(t − 2)
λ0 = hdt = h/3.
0 (−h)(−2h)
Similarly
�x2
(x − x0 )(x − x2 )
λ1 = dx
(x1 − x0 )(x1 − x2 )
x0
� 2
h(t − 0)h(t − 2)
= hdt = 4h/3.
0 (h)(−h)
and
�x2
(x − x0 )(x − x1 )
λ2 = dx
(x2 − x0 )(x2 − x1 )
x0
�2
h(t − 0)h(t − 1)
= hdt = h/3.
(2h)(h)
0
4 NUMERICAL INTEGRATION

Now error is given by


� x2
1
E(f ) = f ��� (ξ)(x − x0 )(x − x1 )(x − x2 )dx.
3! x0

Since (x − x0 )(x − x1 )(x − x2 ) changes its sign in the interval [x0 , x1 ], therefore we cannot apply the
Weighted Mean-Value Theorem (as we did in trapezoidal rule).
Also � x2
(x − x0 )(x − x1 )(x − x2 ) dx = 0.
x0
We can add an interpolation point without affecting the area of the interpolated polynomial, leaving
the error unchanged. We can therefore do our error analysis of Simpson’s rule with any single point
added, since adding any point in [a, b] does not affect the area, we simply double the midpoint, so that
our node set is {x0 = a, x1 = (a + b)/2, x1 = (a + b)/2, x2 = b}. We can now examine the value of the
next interpolating polynomial. Therefore

1 x2 (4)
E(f ) = f (ξ)(x − x0 )(x − x1 )2 (x − x2 )dx.
4! x0
Now the product (x−x0 )(x−x1 )2 (x−x2 ) does not change its sign in [x0 , x1 ], therefore by the Weighted
Mean-Value Theorem, there exists a point ξ ∈ (x0 , x1 ) such that
� x2
1 (4)
E(f ) = f (ξ) (x − x0 )(x − x1 )2 (x − x2 )dx
24 x0
f (4) (ξ)
= − (x2 − x0 )5
2880
h5
= − f (4) (ξ).
90
Hence
� b � � � �
h a+b h5
f (x)dx = f (a) + 4f + f (b) − f (4) (ξ).
a 3 2 90
1
This rule is called Simpson’s rule.
3
Similarly by taking third order Lagrange interpolating polynomial with four nodes a = x0 , x1 , x2 , x3 = b
b−a 3
with h = , we get the next integration formula known as Simpson’s rule given below.
3 8
� b
3h 3
f (x)dx = [f (x0 ) + 3f (x1 ) + 3f (x2 ) + f (x3 )] − h5 f (4) (ξ).
a 8 80
Definition 3.1. The degree of accuracy, or precision, or order of a quadrature formula is the largest
positive integer n such that the formula is exact for xk , for each k = 0, 1, · · · , n.
In other words, an integration method of the form
� b �n � b n

1
f (x)dx = λi f (xi ) + f (n+1) (ξ) (x − xi )dx
a (n + 1)! a
i=0 i=0

is said to be of order n if it provides exact results for all polynomials of degree less than or equal to n
and the error term will be zero for all polynomials of degree ≤ n.
Trapezoidal rule has degree of precision one and Simpson’s rule has three.
Example 1. Find the value of the integral
� 1
dx
I=
0 1+x
using trapezoidal and Simpson’s rule. Also obtain a bound on the errors. Compare with exact value.
NUMERICAL INTEGRATION 5

Sol.
1
f (x) = .
1+x
By trapezoidal rule
IT = h/2[f (a) + f (b)].
Here a = 0, b = 1, h = b − a = 1.

I = 1/2[1 + 1/2] = 0.75.


Exact value
Iexact = ln 2 = 0.693147.
Error= |0.75 − 0.693147| = 0.056853
The error bound for the trapezoidal rule is given by
E(f ) ≤ h3 /12 max |f �� (ξ)|
0≤ξ≤1
� �
� 2 �
= 1/12 max �� �
0≤ξ≤1 (1 + ξ)3 �
= 1/6.
Similarly by using Simpson’s rule with h = (b − a)/2 = 1/2, we obtain
IS = h/3[f (0) + 4f (1/2) + f (1)] = 1/6(1 + 8/3 + 1/2) = 0.69444.
Error= |0.75 − 0.69444| = 0.001297.
The error bound for the Simpson’s rule is given by
h5
E(f ) ≤ max |f (4) (ξ)|
90 0≤ξ≤1
� �
1 � 24 �
= max � �
2880 0≤ξ≤1 � (1 + ξ)5 �
= 0.008333.
Example 2. Find the quadrature formula by method of undetermined coefficients
� 1
f (x)
� dx = af (0) + bf (1/2) + cf (1)
0 x(1 − x)
which is exact for polynomials of highest possible degree. Then use the formula to evaluate
� 1
dx
√ .
0 x − x3
Sol. We make the method exact for polynomials up to degree 2.
� 1
dx
f (x) = 1 : I1 = � =a+b+c
0 x(1 − x)
� 1
xdx
f (x) = x : I2 = � = b/2 + c
0 x(1 − x)
� 1
2 x2 dx
f (x) = x : I3 = � = b/4 + c.
0 x(1 − x)
Now � 1 � 1 � 1
dx dx dt
I1 = � = � = √ = [sin−1 t]1−1 = π
0 x(1 − x) 0 1 − (2x − 1) 2
−1 1 − t2
Similarly
I2 = π/2
I3 = 3π/8.
6 NUMERICAL INTEGRATION

Therefore
a+b+c=π
b/2 + c = π/2
b/4 + c = 3π/8.
By solving these equations, we obtain a = π/4, b = π/2, c = π/4. Hence
� 1
f (x)
� dx = π/4[f (0) + 2f (1/2) + f (1)].
0 x(1 − x)
� 1 � 1 � 1
dx dx f (x)dx
I= √ = √ � = � .
x−x 3 1 + x x(1 − x) x(1 − x)
0 0 0

Here f (x) = 1/ 1 + x.
By using the above formula, we obtain
� √ √ �
2 2 2
I = π/4 1 + √ + = 2.62331.
3 2

4. Composite Integration
As the order of integration method is increased, the order of the derivative involved in error term also
increase. Therefore, we can use higher-order method if the integrand is differentiable up to required
degree. We can apply lower-order methods by dividing the whole interval in to subintervals and then
we use any Newton-Cotes or Gauss quadrature method for each subintervals separately.
Composite Trapezoidal Method: We divide the interval [a, b] into N subintervals with step size
b−a
h= and taking nodal points a = x0 < x1 < · · · < xN = b where xi = x0 +i h, i = 1, 2, · · · , N −1.
N
Now
� b
I = f (x)dx
a
� x1 � x2 � xN
= f (x)dx + f (x)dx + · · · + f (x)dx.
x0 x1 xN −1

Now use trapezoidal rule for each of the integrals on the right side, we obtain
h
I = [(f (x0 ) + f (x1 )) + (f (x1 ) + f (x2 )) + · · · + (f (xN −1 ) + f (xN )]
2
h3
− [f (2) (ξ1 ) + f (2) (ξ2 ) + · · · + f (2) (ξN )]
12
� N −1
� N
h � h3 � (2)
= f (x0 ) + f (xN ) + 2 f (xi ) − f (ξi ).
2 12
i=1 i=1
This formula is composite trapezoidal rule where where xi−1 ≤ ξi ≤ xi , i = 1, 2, · · · , N.
The error associated with this approximation is
N
h3 � (2)
E(f ) = − f (ξi ).
12
i=1

If f ∈ C 2 [a, b], the Extreme Value Theorem implies that f (2) assumes its maximum and minimum in
[a, b]. Since
min f (2) (x) ≤ f (2) (ξi ) ≤ max f (2) (x).
x∈[a,b] x∈[a,b]
On summing, we have
N

N min f (2) (x) ≤ f (2) (ξi ) ≤ N max f (2) (x)
x∈[a,b] x∈[a,b]
i=1
NUMERICAL INTEGRATION 7

and
N
1 � (2)
min f (2) (x) ≤ f (ξi ) ≤ max f (2) (x).
x∈[a,b] N x∈[a,b]
i=1
By the Intermediate Value Theorem, there is a ξ ∈ (a, b) such that
N
(2) 1 � (2)
f (ξ) = f (ξi ).
N
i=1
Therefore
h3
E(f ) = − N f (2) (ξ),
12
or, since h = (b − a)/N ,
(b − a) 2 (2)
E(f ) = −
h f (ξ).
12
Composite Simpson’s Method: Simpson’s rule require three abscissas, choose an even integer N
b−a
to produce odd number of nodes with h = . Likewise before, we write
N
� b
I = f (x)dx
a
� x2 � x4 � xN
= f (x)dx + f (x)dx + · · · + f (x)dx.
x0 x2 xN −2

Now use Simpson’s rule for each of the integrals on the right side to obtain
h
I = [(f (x0 ) + 4f (x1 ) + f (x2 )) + (f (x2 ) + 4f (x3 ) + f (x4 )) + · · · + (f (xN −2 ) + 4f (xN −1 ) + f (xN )]
3
h5
− [f (4) (ξ1 ) + f (4) (ξ2 ) + · · · + f (4) (ξN/2 )]
 90 
N/2−1 N/2 N/2
h � � h5 � (4)
= f (x0 ) + 2 f (x2i ) + 4 f (x2i−1 ) + f (xN ) − f (ξi ).
3 90
i=1 i=1 i=1

This formula is called composite Simpson’s rule. The error in the integration rule is given by
N
h5 � (4)
E(f ) = − f (ξi ).
90
i=1

If f ∈ C 4 [a, b], the Extreme Value Theorem implies that f (4) assumes its maximum and minimum in
[a, b]. Since
min f (4) (x) ≤ f (4) (ξi ) ≤ max f (4) (x).
x∈[a,b] x∈[a,b]
On summing, we have
N/2
N � N
min f (4) (x) ≤ f (4) (ξi ) ≤ max f (4) (x)
2 x∈[a,b] 2 x∈[a,b]
i=1
and
N/2
2 � (4)
min f (4) (x) ≤ f (ξi ) ≤ max f (4) (x).
x∈[a,b] N x∈[a,b]
i=1
By the Intermediate Value Theorem, there is a ξ ∈ (a, b) such that
N/2
2 � (4)
f (4) (ξ) = f (ξi ).
N
i=1
Therefore
h5
E(f ) = − N f (4) (ξ),
180
8 NUMERICAL INTEGRATION

or, since h = (b − a)/N ,


(b − a) 4 (4)
E(f ) = − h f (ξ).
180
Example 3. Determine the values of subintervals n and step-size h required to approximate
�2
1
dx
x+4
0
to within 10−5 and hence compute the approximation using composite Simpson’s rule.
1 24
Sol. Here f (x) = , therefore f (4) (x) = .
x+4 (x + 4)5
24
∴ max |f (4) (x)| = .
x∈[0,2] 45
Now error in Simpson’s rule is given by
h4 (b − a)f 4 (ξ)
E(f ) = − .
180
To get desire accuracy, we have
2h4 × 24
< 10−5
45 × 180
=⇒ h < 0.44267.
b−a
Since n = h > 2/0.44267 = 4.518, and nearest even integer is 6, therefore we take minimum 6
subintervals to achieve the desired accuracy.
By taking 6 subintervals with h = 2/6 = 1/3 and using Simpson’s rule, we obtain
1
IS = [f (0) + 4{f (1/3) + f (1) + f (5/3)} + 2{f (2/3) + f (4/3)} + f (2)] = 0.405466.
9
Example 4. Determine values of h (or n) that will ensure an approximation error of less than 0.00002
�π
when approximating sin x dx and employing (a) Composite Trapezoidal rule and (b) Composite Simp-
o
son’s rule.
Sol. (a) The error form for the composite trapezoidal rule for f (x) = sin x on [0, π] is
� 2 � � �
� π h �� � � π h2 � πh2
� f (ξ) �=� (− sin ξ) �= |sin ξ|.
� 12 � � 12 � 12
To ensure sufficient accuracy with this technique we need to have
π h2 π h2
|sin ξ|≤ < 0.0002.
12 12
Since h = π/n implies that n = π/h, we need
� �1/2
π3 π3
< 0.0002 which implies that n > ≈ 359.44
12n2 12(0.00002)
and the composite trapezoidal rule requires n ≥ 360.
(b) The error form for the composite Simpson’s rule for f (x) = sin x on [0, π] is
� 4 � � �
� π h (4) � � π h4 � πh4
� � � �
� 180 f (ξ)� = � 180 sin ξ � = 180 |sin ξ|.
To ensure sufficient accuracy with this technique we need to have
π h4 π h4
|sin ξ| ≤ < 0.0002.
180 180
Using again the fact that n = π/h gives
� �1/4
π5 π5
< 0.0002 which implies that n > ≈ 17.07.
180 n4 180(0.00002)
NUMERICAL INTEGRATION 9

So composite Simpson’s rule requires only n ≥ 18. Composite Simpson’s rule with n = 18 gives
� π � � 8 �9 �
π iπ (2i − 1)π
sin x dx ≈ 2 sin( ) + 4 sin( ) = 2.0000104.
0 54 9 18
i=1 i=1

This is accurate to within about 10−5 because the true value is − cos(π) − (− cos(0)) = 2.
Example 5. The area A inside the closed curve y 2 + x2 = cos x is given by
� α
1/2
A=4 (cos x − x2 ) dx
0

where α is the positive root of the equation cos x = x2 .


(a) Compute α with three correct decimals.
(b) Use trapezoidal rule to compute the area A with an absolute error less than 0.05.
Sol. (a) Using Newton method to find the root of the equation

f (x) = cos x − x2 = 0,
we obtain the following iteration scheme
cos xk − x2k
xk+1 = xk + , k = 0, 1, 2, · · ·
sin xk + 2xk
Starting with x0 = 0.5, we obtain
0.62758
x1 = 0.5 + = 0.92420
1.47942
−0.25169
x2 = 0.92420 + = 0.82911
2.64655
−0.011882
x3 = 0.82911 + = 0.82414
2.39554
−0.000033
x4 = 0.82414 + = 0.82413.
2.38226
Hence the value of α correct to three decimals is 0.824.
(b) Substituting the value of α, we obtain
� 0.824
1/2
A=4 (cos x − x2 ) dx.
0

Using composite trapezoidal method by taking h = 0.824, 0.412, and 0.206 respectively, we obtain the
following approximations of the area A.
4(0.824)
A = [1 + 0.017753] = 1.67725
2
4(0.412)
A = [1 + 2(0.864047) + 0.017753] = 2.262578
2
4(0.206)
A = [1 + 2(0.967688 + 0.864047 + 0.658115) + 0.017753] = 2.470951.
2

5. Gauss Quadrature
In the numerical integration method if both nodes xi and multipliers λi are unknown then method is
called Gaussian quadrature. We can obtain the unknowns by making the method exact for polynomials
of degree as high as required. The formulas are derived for the interval [−1, 1] and any interval [a, b]
can be transformed to [−1, 1] by taking the transformation x = At + B which gives a = −A + B and
b−a b+a
b = A + B and after solving we get x = t+ .
2 2
10 NUMERICAL INTEGRATION

As observed in Newton-Cotes quadrature, we can write any integral as


� 1 �n �1
f (n+1) (ξ)
f (x)dx = λi f (xi ) + (x − x0 ) · · · (x − xn ) dx
−1 (n + 1)!
i=0 −1
n

= λi f (xi ) + E(f ).
i=0
If product does not change its sign in interval concerned, we can write error as
�1
f (n+1) (ξ)
E(f ) = (x − x0 ) · · · (x − xn ) dx.
(n + 1)!
−1

f (n+1) (ξ)
= C,
(n + 1)!
where
�1
C= (x − x0 ) · · · (x − xn ) dx.
−1
We can compute the value of C by putting f (x) = xn+1 to obtain
� b � n
n+1 C
x dx = λi xi n+1 + (n + 1)!
a (n + 1)!
i=0
� b �n
=⇒ C = xn+1 dx − λi xi n+1 .
a i=0
The number C is called error constant. By using the notation, we can write error term as following
C
E(f ) = f (n+1) (ξ).
(n + 1)!
Gauss-Legendre Integration Methods: The technique we have described could be used to de-
termine the nodes and coefficients for formulas that give exact results for higher-degree polynomials.
One-point formula: The formula is given by
� 1
f (x)dx = λ0 f (x0 ).
−1
The method has two unknowns λ0 and x0 . Make the method exact for f (x) = 1, x, we obtain
� 1
f (x) = 1 : dx = 2 = λ0
−1
� 1
f (x) = x : xdx = 0 = λ0 x0 =⇒ x0 = 0.
−1
Therefore one-point formula is given by
� 1
f (x)dx = 2f (0).
−1
The error in approximation is given by
C ��
E(f ) = f (ξ)
2!
where error constant C is given by
� 1
C= x2 dx − 2f (0) = 2/3.
−1
Hence
1
E(f ) = f �� (ξ), −1 < ξ < 1.
3
NUMERICAL INTEGRATION 11

Two-point formula:
� 1
f (x)dx = λ0 f (x0 ) + λ1 f (x1 ).
−1
The method has four unknowns. Make the method exact for f (x) = 1, x, x2 , x3 , we obtain
� 1
f (x) = 1 : dx = 2 = λ0 + λ1 (5.1)
−1
� 1
f (x) = x : xdx = 0 = λ0 x0 + λ1 x1 (5.2)
−1
� 1
f (x) = x2 : x2 dx = 2/3 = λ0 x20 + λ1 x21 (5.3)
−1
� 1
f (x) = x3 : x3 dx = 0 = λ0 x30 + λ1 x31 (5.4)
−1

Now eliminate λ0 from second and fourth equation


λ1 x31 − λ1 x1 x20 = 0
which gives
λ1 x1 (x1 − x0 )(x1 + x0 ) = 0
Since λ1 �= 0, x0 �= x1 and x1 �= 0 (if x1 = 0 then by second equation x0 = 0). Therefore x1 = −x0 .
Substituting in second equation, we obtain λ0 = λ1 .
By substituting these values in first equation,√ we get λ0 = λ1√= 1.
2
Third equation gives x0 = 1/3 or x0 = ±1/ 3 and x1 = ∓1/ 3.
Therefore, the two-point formula is given by
� 1 � � � �
1 1
f (x)dx = f − √ +f √ .
−1 3 3
The error is given by
C (4)
E(f ) = f (ξ)
4!
and � � � � � ��
1
1 1 8
C= x4 dx − f − √ +f √ = .
−1 3 3 45
The error in two-point formula is given by
1 (4)
f (ξ), −1 < ξ < 1.
E(f ) =
135
Three-point formula: By taking n = 2, we obtain
� 1
f (x)dx = λ0 f (x0 ) + λ1 f (x1 ) + λ2 f (x2 ).
−1

The method has six unknowns. Make the method exact for f (x) = 1, x, x2 , x3 , x4 , x5 , we obtain
f (x) = 1 : 2 = λ0 + λ 1 + λ 2
f (x) = x : 0 = λ0 x0 + λ1 x1 + λ2 x2
f (x) = x2 : 2/3 = λ0 x20 + λ1 x21 + λ2 x22
f (x) = x3 : 0 = λ0 x30 + λ1 x31 + λ2 x32
f (x) = x4 : 2/5 = λ0 x40 + λ1 x41 + λ2 x42
f (x) = x5 : 0 = λ0 x50 + λ1 x51 + λ2 x52

� these equations, we obtain λ0 = λ2 = 5/9 and λ1 = 8/9. x0 = ±
By solving 3/5, x1 = 0 and
x2 = ∓ 3/5.
12 NUMERICAL INTEGRATION

Therefore formula is given by


� 1 � � � � �� ��
1 3 3
f (x)dx = 5f − + 8f (0) + 5f .
−1 9 5 5
The error in three-point formula is given by
C (6)
E5 = f (ξ)
6!
where  �� �
� 6 �� �6 
1
1 −3 −3  8
C= x6 dx − 5 +8×0+5+ = .
−1 9 5 5 175

1
∴ E5 = f (6) (ξ), −1 < ξ < 1.
15750
Note: Legendre polynomial Pn (x) is a monic polynomial of degree n. The first few Legendre polyno-
mials are
P0 (x) = 1,
P1 (x) = x,
1
P2 (x) = x2 − ,
3
3
P3 (x) = x3 − x.
5
Nodes in Gauss-Legendre rules are roots of these polynomials.
Example 6. Evaluate
� 2
2x
I= dx
1 1 + x4
using Gauss-Legendre 1 and 2-point formula. Also compare with the exact value.
t+3
Sol. Firstly we change the interval [1, 2] in to [−1, 1] by taking x = , dx = dt/3.
2
� 2 � 1
2x 8(t + 3)
I= dx = dt.
1 1 + x4 −1 16 + (t + 3)4
Let
8(t + 3)
f (t) = .
16 + (t + 3)4
By 1-point formula
I = 2f (0) = 0.4948.
By 2-point formula
� � � �
1 1
I = f −√ +f √ = f (−0.57735) + f (0.57735) = 0.5434.
3 3
Now exact value of the integral is given by
� 2
2x π
I= 4
dx = tan−1 4 − = 0.5408.
1 1 + x 4
Therefore errors by one and two points formula are |0.4948 − 0.5408| = 0.046 and |0.5434 − 0.5408| =
0.0026, respectively.
Example 7. Evaluate
� 1
3/2
I= (1 − x2 ) cos x dx
−1
using Gauss-Legendre 3-point formula.
NUMERICAL INTEGRATION 13

3/2
Sol. Using Gauss-Legendre 3-point formula with f (x) = (1 − x2 ) cos x, we obtain
� � � � �� ��
1 3 3
I = 5f − + 8f (0) + 5f
9 5 5
� � � �� � � � �� ��
1 2 3/2 3 2 3/2 3
= 5 cos +8+5 cos
9 5 5 5 5
= 1.08979.
Example 8. Determine constants a, b, c, and d that will produce a quadrature formula
� 1
f (x)dx = af (−1) + bf (1) + cf � (−1) + df � (1)
−1
that has degree of precision 3.
Sol. We want the formula
� 1
f (x)dx = af (−1) + bf (1) + cf � (−1) + df � (1)
−1
to hold for polynomials 1, x, x2 , · · · . Plugging these into the formula, we obtain:
� 1
f (x) = x0 : dx = 2 = a · 1 + b · 1 + c · 0 + d · 0
−1
� 1
f (x) = x1 : xdx = 0 = a · (−1) + b · 1 + c · 1 + d · 1
−1
� 1
2
f (x) = x2 : x2 dx = = a · 1 + b · 1 + c · (−2) + d · 2
−1 3
� 1
f (x) = x3 : x3 dx = 0 = a · (−1) + b · 1 + c · 3 + d · 3.
−1
We have 4 equations in 4 unknowns:
a + b = 2,
−a + b + c + d = 0,
2
a + b − 2c + 2d = ,
3
−a + b + 3c + 3d = 0.
Solving this system, we obtain:
1 1
a = 1, b = 1, c = , d = − .
3 3
Thus, the quadrature formula with accuracy 3 is
� 1
1 1
f (x)dx = f (−1) + f (1) + f � (−1) − f � (1).
−1 3 3
Example 9. Evaluate � 1
dx
I=
0 1+x
by subdividing the interval [0, 1] into two equal parts and then by using Gauss-Legendre three-point
formula.
Sol. � � � � �� ��
� 1
1 3 3
f (x)dx = 5f − + 8f (0) + 5f .
−1 9 5 5
Let � � �
1 1/2 1
dx dx dx
I= = + = I1 + I 2 .
0 1+x 0 1+x 1/2 1+x
14 NUMERICAL INTEGRATION

t+1 z+3
Now substitute x = and x = in I1 and I2 , respectively to change the limits to [−1, 1].
4 4
We have dx = dt/4 and dx = dz/4 for integral I1 and I2 , respectively.
Therefore � �
� 1
dt 1 5 8 5
I1 = = � + + � = 0.405464
−1 t + 5 9 5 − 3/5 5 5 + 3/5
� 1 � �
dz 1 5 8 5
I2 = = � + + � = 0.287682
−1 z + 7 9 7 − 3/5 7 7 + 3/5
Hence
I = I1 + I2 = 0.405464 + 0.287682 = 0.693146.

Exercises
(1) Given
� 1
I= x ex dx.
0
Approximate the value of I using trapezoidal and Simpson’s one-third method. Also obtain
the error bounds and compare with exact value of the integral.
(2) Evaluate
� 1
dx
I= 2
0 1+x
using trapezoidal and Simpson’s rule with 4 and 6 subintervals. Compare with the exact value
of the integral.
(3) Approximate the following integrals using the trapezoidal and Simpson formulas.

0.25
(a) I = (cos x)2 dx.
−0.25

e+1 1
(b) dx.
e x ln x
Find a bound for the error using the error formula, and compare this to the actual error.
�2
(4) The quadrature formula f (x)dx = c0 f (0) + c1 f (1) + c2 f (2) is exact for all polynomials of
0
degree less than or equal to 2. Determine c0 , c1 , and c2 .
(5) Determine the values of a, b, and c such that the formula
� h
f (x)dx = h [af (0) + bf (h/3) + cf (h)]
0

is exact for polynomials of degree as high as possible. Also obtain the degree of the precision.
(6) The length of the curve represented by a function y = f (x) on an interval [a, b] is given by the
integral
� b�
I= 1 + [f � (x)]2 dx.
a
Use the trapezoidal rule and Simpson’s rule with 4 and 6 subintervals compute the length of
the graph of the ellipse given with equation 4x2 + 9y 2 = 36.
(7) Determine the values of n and h required to approximate
�2
e2x sin 3x dx
0

to within 10−4 . Use composite Trapezoidal and composite Simpson’s rule.


NUMERICAL INTEGRATION 15

(8) The equation


� x
1 2
√ e−t /2 dt = 0.45
0 2π
can be solved for x by applying Newton’s method to the function
� x
1 2 1 2
f (x) = √ e−t /2 dt − 0.45 & f � (x) = √ e−x /2 .
0 2π 2π
Note that Newton’s method would require the evaluation of f (xk ) at various xk which can be
estimated using a quadrature formula. Find a solution for f (x) = 0 with error no more than
10−5 using Newton’s method starting with x0 = 0.5 and by means of the composite Simpson’s
rule.
(9) A car laps a race track in 84 seconds. The speed of the car at each 6-second interval is
determined by using a radar gun and is given from the beginning of the lap, in feet/second, by
the entries in the following table.
Time 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84
Speed 124 134 148 156 147 133 121 109 99 85 78 89 104 116 123
How long is the track?
(10) A particle of mass m moving through a fluid is subjected to a viscous resistance R, which is a
function of the velocity v. The relationship between the resistance R, velocity v, and time t is
given by the equation
�v(t)
m
t= du
R(u)
v(t0 )

Suppose that R(v) = −v v for a particular fluid, where R is in newtons and v is in me-
ters/second. If m = 10 kg and v(0) = 10 m/s, approximate the time required for the particle
to slow to v = 5 m/s.
(11) Evaluate the integral
� 1
2
e−x cos x dx
−1
by using the Gaussian quadrature with n = 1 and n = 2.
(12) Compute by Gaussian quadrature with n = 2 and compare with the exact value of the integral.
� 3.5
x
√ dx.
3 x2 − 4
(13) Evaluate
� 1
sin x dx
I=
0 2+x
by subdividing the interval [0, 1] into two equal parts and then by using Gaussian quadrature
with n = 2.
(14) Consider approximating integrals of the form
�1

I= xf (x)dx
0

in which f (x) has several continuous derivatives on [0, 1].


(a) Find a formula
�1

xf (x)dx ≈ w1 f (x1 ) = I1
0
which is exact if f (x) is any linear polynomial.
16 NUMERICAL INTEGRATION

(b) To find a formula


�1

xf (x)dx ≈ w1 f (x1 ) + w2 f (x2 ) = I2
0
which is exact for all polynomials of degree ≤ 3, set up a system of four equations with
unknowns w1 , w2 , x1 , x2 . Verify that
� � � � � �
1 10 1 10
x1 = 5+2 , x2 = 5−2 ,
9 7 9 7
� � �
1 7 2
w1 = 5+ , w2 = − w 1
15 10 3
is a solution of the system.
(c) Apply I1 and I2 to the evaluation of
�1

I= xe−x dx = 0.37894469164.
0

Appendix A. Algorithms
Algorithm (Composite Trapezoidal Method):
Step 1 : Inputs: function f (x); end points a and b; and N number of subintervals.
Step 2 : Set h = (b − a)/N .
Step 3 : Set sum = 0
Step 4 : For i = 1 to N − 1
Step 5 : Set x = a + h ∗ i
Step 6 : Set sum = sum+2 ∗ f (x)
end
Step 7 : Set sum = sum+f (a) + f (b)
Step 8 : Set ans = sum∗(h/2)
End

Algorithm (Composite Simpson’s Method):


Step 1 : Inputs: function f (x); end points a and b; and N number of subintervals (even).
Step 2 : Set h = (b − a)/N .
Step 3 : Set sum = 0
Step 4 : For i = 1 to N − 1
Step 5 : Set x = a + h ∗ i
Step 6 : If rem(i, 2) = 0
sum = sum+2 ∗ f (x)
else
sum = sum+4 ∗ f (x)
end
Step 7 : Set sum = sum+f (a) + f (b)
Step 8 : Set ans = sum∗(h/3)
End

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 7 (4 LECTURES)
INITIAL-VALUE PROBLEMS FOR ORDINARY DIFFERENTIAL EQUATIONS

1. Introduction
Differential equations are used to model problems in science and engineering that involve the change
of some variable with respect to another. Most of these problems require the solution of an initial-value
problem, that is, the solution to a differential equation that satisfies a given initial condition.
In common real-life situations, the differential equation that models the problem is too complicated to
solve exactly, and one of two approaches is taken to approximate the solution. The first approach is to
modify the problem by simplifying the differential equation to one that can be solved exactly and then
use the solution of the simplified equation to approximate the solution to the original problem. The
other approach, which we will examine in this chapter, uses methods for approximating the solution of
the original problem. This is the approach that is most commonly taken because the approximation
methods give more accurate results and realistic error information.
In this chapter, we discuss the numerical methods for solving the ordinary differential equations of
initial-value problems (IVP) of the form
dy
= f (t, y), t ∈ R, y(t0 ) = y0 (1.1)
dt
where y is a function of t, f is function of t and y, and t0 is called the initial value. The numerical
values of y(t) on an interval containing t0 are to be determined.
We divide the domain [a, b] in to subintervals
a = t0 < t1 < · · · < tN = b.
These points are called mesh points or grid points. Let equal spacing is h. The uniform mesh points
are given by ti = t0 + ih, i = 0, 1, 2, ... The set of points y0 , y1 , · · · , yN are the numerical solution of
the initial-value problem (IVP).

2. Existence and Uniqueness of Solutions


Definition 2.1. A function f (t, y) is said to satisfy a Lipschitz condition in the variable y on some
domain if a constant L > 0 exists with
|f (t, y1 ) − f (t, y2 )| ≤ L |y1 − y2 |,
whenever (t, y1 ) and (t, y2 ) are in domain. The constant L is called a Lipschitz constant for f .
2
Example 1. Let f (t, x) = x2 e−t sin t be defined on
D = {(t, x) ∈ R2 : 0 ≤ x ≤ 2}.
Show that f satisfies Lipschitz condition.
Sol. Let (t, x1 ), (t, x2 ) ∈ D.
2 2
|f (t, x1 ) − f (t, x2 )| = |x21 e−t sin t − x22 e−t sin t|
2
= |e−t sin t||x1 + x2 ||x1 − x2 |
≤ (1)(4)|x1 − x2 |
Thus we may take L = 4 and f satisfies a Lipschitz condition in D with Lipschitz constant 4.
Example 2. Show that f (t, y) = t|y| satisfies a Lipschitz condition on the interval D = {(t, y) |1 ≤
t ≤ 2 and − 3 ≤ y ≤ 4}.
1
2 NUMERICAL DIFFERENTIAL EQUATIONS

Sol. For each pair of points (t, y1 ) and (t, y2 ) in D, we have


|f (t, y1 ) − f (t, y2 )| = | |t| y1 − |t| y2 |
≤ |t| |y1 − y2 |
≤ 2|y1 − y2 |.
Thus f satisfies a Lipschitz condition on D in the variable y with Lipschitz constant L = 2.
Theorem 2.2. If f (t, y) is continuous in a ≤ t ≤ b, −∞ ≤ y ≤ ∞, and
|f (t, y1 ) − f (t, y2 )| ≤ L|y1 − y2 |
for some positive constant L (which means f is Lipschitz continuous in y), then the IVP (1.1) has a
unique solution in the interval [a, b].
Example 3. Show that there is a unique solution to the initial–value problem
y � = 1 + t sin(ty), 0 ≤ t ≤ 2, y(0) = 0.
Sol. Take two points (t, y1 ) and (t, y2 ) in the domain, we have
|f (t, y1 ) − f (t, y2 )| = |(1 + t sin(ty1 )) − (1 + t sin(ty2 ))|
= |t| | sin(ty1 ) − sin(ty2 )|.
Holding t constant and applying the Mean Value Theorem for f (t, y) = sin(ty), we get
| sin(ty1 ) − sin(ty2 )| = |t cos(tξ)| |y1 − y2 |, ξ ∈ (y1 , y2 )
= |t| | cos(tξ)| |y1 − y2 |.
Therefore
|f (t, y1 ) − f (t, y2 )| = t2 | cos(tξ)| |y1 − y2 |,
≤ 4|y1 − y2 |.
Hence f satisfies a Lipschitz condition in the variable y with Lipschitz constant L = 4.
Additionally, f (t, y) is continuous when 0 ≤ t ≤ 2 and −∞ ≤ y ≤ ∞, so Existence Theorem implies
that a unique solution exists to this initial-value problem.
2.1. Picard’s method. This method is also known as method of successive approximations.
We consider the following IVP
dy
= f (t, y), t ∈ R, y(t0 ) = y0
dt
Let f (t, y) to be a continuous function on the given domain. The initial value problem is equivalent to
following integral equation,
�t
y(t) = y(0) + f (t, y(t))dt.
t0
Writing y(0) = y0 and we can compute the solution y(t) at any time t by integrating above equation.
Note that y(t) also appears in integral in f (t, y(t)). Therefore we take any approximation of y(t) to
start the procedure.
The successive approximations for solutions are given by
�t
y0 (t) = y0 , yk+1 (t) = 1 + f (s, yk (s))ds, k = 0, 1, 2, · · ·
0
Or equivalently a solution of this equation is meant a continuous function φ exists and which approx-
imate y(t), i.e,
�t
φ0 (t) = y0 , φk+1 (t) = 1 + f (s, φk (s))ds, k = 0, 1, 2, · · · .
0

Example 4. Consider the initial value problem y� = ty, y(0) = 1.


NUMERICAL DIFFERENTIAL EQUATIONS 3

Sol. The integral equation corresponding to this problem is


�t
y(t) = 1 + s.y(s)ds.
0

The successive approximations are given by


�t
φ0 (t) = 1, φk+1 (t) = 1 + sφk (s)ds, k = 0, 1, 2, .....
0

Thus
� t
t2
φ1 (t) = 1 + s ds = 1 + ,
0 2
�t � �
s2 t2 t4
φ2 (t) = 1 + s 1 + ds = 1 + + ,
2 2 2.4
0
t2 t4 t6
φ2 (t) = 1 + + + .
2 2·4 2·4·6
It may be established by induction that
� 2� � �2 � �k
t 1 t2 1 t2
φk (t) = 1 + + + .... + .
2 2! 2 k! 2
2
We recognize φk (x) as the partial sum for the series expansion of the function φ(t) = et /2 . We know
that this series converges for all t and this means that φk (t) → φ(t) as k → ∞, for all x ∈ R.
Indeed φ is a solution of the given initial value problem.
Example 5. Generate φ0 (t), φ1 (t), φ2 (t), and φ3 (t) for the initial-value problem using Picard’s method
y � = −y + t + 1, 0 ≤ t ≤ 1, y(0) = 1.
Sol.
φ0 (t) = y(0) = 1.
�t
φ1 (t) = 1 + f (s, y0 (s))ds
0
�t
= 1+ (−1 + s + 1)ds
0
t2
= 1+ .
2
�t
φ2 (t) = 1 + f (s, y1 (s))ds
0
�t � � � �
s2
= 1+ − 1+ + s + 1 ds
2
0
�t � �
1 2
= 1+ s − s ds
2
0
1 1
= 1 + t 2 − t3 .
2 6
4 NUMERICAL DIFFERENTIAL EQUATIONS

�t
φ3 (t) = 1 + f (s, y2 (s))ds
0
�t � � � �
1 2 1 3
= 1+ − 1 + s − s + s + 1 ds
2 6
0
1 1 1
= 1 + t 2 − t3 + t 4 .
2 6 24
We can check these approximations are Maclaurin series expansion of t+e−t which is the exact solution
of given IVP.

2.2. Taylor’s Series method. Consider the one dimensional initial value problem
y � = f (t, y), y(t0 ) = y0
where f is a function of two variables t and y and (t0 , y0 ) is a known point on the solution curve.
If the existence of all higher order partial derivatives is assumed for y at some point t = ti , then by
Taylor series the value of y at any neighboring point ti + h can be written as
h2 �� h3 ���
y(ti + h) = y(ti ) + hy � (ti ) + y (ti ) + y (ti ) + · · · + O(hp+1 ).
2 3!
Since at ti , yi is known, y � at xi can be found by computing f (ti , yi ).
Similarly higher derivatives of y at ti can be computed by making use of the relation y � = f (t, y).
Hence the value of y at any neighboring point ti + h can be obtained by summing the above infinite
series. If the series has been terminated after the p-th derivative term then the approximated formula
is called the Taylor series approximation to y of order p and the error is of order p + 1.
Example 6. Given the IVP y � = x2 y − 1, y(0) = 1. By Taylor series method of order 4 with step size
0.1. Find y at x = 0.1 and x = 0.2.
Sol. Given IVP
y � = x2 y − 1, y �� = 2xy + x2 y � , y ��� = 2y + 4xy � + x2 y �� , y (4) = 6y � + 6xy �� + x2 y ��� .

∴ y � (0) = −1, y �� (0) = 0, y (3) (0) = 2, y (4) (0) = −6.


The fourth-order Taylor’s formula is given by
h2 �� h3 h4
y(xi + h) = y(xi ) + hy � (xi ) + y (xi ) + y ��� (xi ) + y iv (xi ) + O(h5 )
2 3! 4!
Therefore
y(0.1) = 1 + (0.1)(−1) + 0 + (0.1)3 (2)/6 − (0.1)4 (−6)/24 = 0.900033
Similarly
y(0.2) = 0.80227.

3. Numerical methods for IVP


We consider the following IVP
dy
= f (t, y), t ∈ R (3.1)
dt
y(t0 ) = y0 . (3.2)
Its integral form is the following equation
�t
y(t) = y0 + f (s, y(s))ds.
0
NUMERICAL DIFFERENTIAL EQUATIONS 5

3.1. Euler’s Method: The Euler method is named after Swiss mathematician Leonhard Euler (1707-
1783). This is the one of the simplest method to solve the IVP. Consider the IVP given in Eqs(3.1-3.2).
dy
We can approximate the derivative as following by assuming that all nodes ti are equally spaced
dt
with spacing h and ti+1 = ti + h.
Now by the definition of derivative
y(t0 + h) − y(t0 )
y � (t0 ) ≈ .
h
Apply this approximation to the given IVP at point t = t0 gives
y � (t0 ) = f (t0 , y0 ).
Therefore
1
[y(t1 ) − y(t0 )] = f (t0 , y0 )
h
=⇒ y(t1 ) − y(t0 ) = hf (t0 , y0 )
which gives
y(t1 ) = y(t0 ) + hf (t0 , y0 ).
In general, we write
ti+1 = ti + h
yi+1 = yi + hf (ti , yi )
where yi = y(ti ). This procedure is called Euler’s method.
Alternatively we can derive this method from a Taylor’s series. We write
h2 ��
y(ti+1 ) = y(ti + h) = y(ti ) + hy � (ti ) + y (ti ) + · · ·
2!
If we cut the series at y � (ti ), we obtain
y(ti+1 ) = y(ti ) + hy � (ti )
=⇒ y(ti+1 ) = y(ti ) + hf (ti , y(ti ))
=⇒ yi+1 = yi + hf (ti , yi ).
If truncation error has term hp+1 , then order of the numerical method is p. Therefore, Euler’s method
is a first-order method.

3.2. The Improved or Modified Euler’s method. We write the integral form of y(t) as
�t
dy
= f (t, y) ⇐⇒ y(t) = y(t0 ) + f (t, y(t))dt.
dt
t0

Approximate integral using the trapezium rule:


h
y(t) = y(t0 ) +
[f (t0 , y(t0 )) + f (t0 + h, y(t1 )], t1 = t0 + h.
2
Use Euler’s method to approximate y(t1 ) ≈ y(t0 ) + hf (t0 , y(t0 )) in trapezium rule:
h
y(t1 ) = y(t0 ) + [f (t0 , y(t0 )) + f (t1 , y(t0 ) + hf (t0 , y(t0 )))].
2
Hence the modified Euler’s scheme
K1 = hf (t0 , y0 )
K2 = hf (t1 , y0 + K1 )
K1 + K2
y1 = y0 + .
2
6 NUMERICAL DIFFERENTIAL EQUATIONS

In general, the modified Euler’s scheme is given by


ti+1 = ti + h
K1 = hf (ti , yi )
K2 = hf (ti+1 , yi + K1 )
K1 + K2
yi+1 = yi + .
2
Example 7.
y � + 2y = 2 − e−4t , y(0) = 1
By taking step size 0.1, find y at t = 0.1 and 0.2 by Euler method.
Sol.
y � = −2y + 2 − e−4t = f (t, y), y(0) = 1
f (0, 1) = −2(1) + 2 − 1 = −1
By Euler’s method with step size h = 0.1,
t1 = t0 + h = 0 + 0.1 = 0.1
y1 = y0 + hf (0, 1) = 1 + 0.1(−1) = 0.9
∴ y1 = y(0.1) = 0.9.

t2 = t0 + 2h = 0 + 2 × 0.1 = 0.2
y2 = y1 + hf (0.1, 0.9) = 0.9 + 0.1(−2 × 0.9 + 2 − e−4(0.1) )
= 0.9 + 0.1(−0.47032) = 0.852967
∴ y2 = y(0.2) = 0.852967.

Example 8. For the IVP y � = t + y, y(0) = 1. Calculate y in the interval [0.0.6] with h = 0.2 by
using modified Euler’s method.
Sol.

y � = t + y = f (t, y), t0 = 0, y0 = 1, h = 0.2, t1 = 0.2
K1 = hf (t0 , y0 ) = 0.2(1) = 0.2
K2 = hf (t1 , y0 + K1 ) = hf (0.2, 1.2) = 0.2591
K1 + K2
y1 = y(0.2) = y0 + = 1.22955.
2
Similarly we can compute solutions at other points.
Example 9. Show that the following initial-value problem has a unique solution.
y � = t−2 (sin 2t − 2ty), 1 ≤ t ≤ 2, y(1) = 2.
Find y(1.1) and y(1.2) with step-size h = 0.1 using modified Euler’s method.
Sol.
y � = t−2 (sin 2t − 2ty) = f (t, y).
Holding t as a constant,
|f (t, y1 ) − f (t, y2 )| = |t−2 (sin 2t − 2ty1 ) − t−2 (sin 2t − 2ty2 )|
2
= |y1 − y2 |
|t|
≤ 2|y1 − y2 |.
Thus f satisfies a Lipschitz condition in the variable y with Lipschitz constant L = 2. Additionally,
f (t, y) is continuous when 1 ≤ t ≤ 2, and −∞ < y < ∞, so Existence Theorem implies that a unique
NUMERICAL DIFFERENTIAL EQUATIONS 7

solution exists to this initial-value problem.


Now we apply Modified Euler’s method to find the solution.
t0 = 1, y0 = 2, h = 0.1, t1 = 1.1.
K1 = hf (t0 , y0 ) = hf (1, 2) = −0.309072
K2 = hf (t1 , y0 + K1 ) = hf (1, 1.6909298) = −0.24062
y1 = y(1.1) = y0 + 1/2(K1 + K2 ) = 1.725152.
Now y1 = 1.725152, h = 0.1, t2 = 1.2.
∴ K1 = −0.24684
K2 = −0.19947
y2 = y(0.2) = 1.50199.
Example 10. Given the initial-value problem
2
y � = y + t2 et , 1 ≤ t ≤ 2, y(1) = 0
t
(i) Use Euler’s method with h = 0.1 to approximate the solution in the interval [1, 1.6].
(ii) Use the answers generated in part (i) and linear interpolation to approximate y at t = 1.04 and
t = 1.55.
Sol. Given the initial-value problem
2
y � = y + t2 et = f (t, y)
t
t0 = 1.0, y(t0 ) = 0.0, h = 0.1.
By Euler’s method, approximation of solutions at different time-level are given by
y(ti+1 ) = y(ti ) + hf (ti , y(ti )).
� �
2
∴ y(t1 ) = y(1.1) = y(0) + hf (1, 0) = 0.0 + 0.1 2 1.0
0.0 + 1.0 e = 0.271828.
1.0
t1 = 1.1
� �
2
y(t2 ) = y(1.2) = 0.271828 + 0.1 0.271828 + (1.1)2 e1.1 = 0.684756
1.1
t2 = 1.2
� �
2 2 1.2
y(t3 ) = y(1.3) = 0.684756 + 0.1 0.684756 + (1.2) e = 1.27698.
1.2
t3 = 1.3
Similarly
t4 = 1.4
y(t4 ) = y(1.4) = 2.09355
t5 = 1.5
y(t5 ) = y(1.5) = 3.18745
t6 = 1.6
y(t6 ) = y(1.6) = 4.62082.
Now using linear interpolation, approximate y can be found in the following way.
1.04 − 1.1 1.04 − 1.1
y(1.04) = y(1.0) + y(1.1) = 0.10873120.
1.0 − 1.1 1.0 − 1.1
1.55 − 1.6 1.55 − 1.5
y(1.55) = y(1.5) + y(1.6) = 3.90413500.
1.5 − 1.6 1.6 − 1.5
8 NUMERICAL DIFFERENTIAL EQUATIONS

3.3. Runge-Kutta Methods: This is the one of the most important method to solve the IVP. These
techniques were developed around 1900 by the German mathematicians C. Runge and M. W. Kutta.
If we apply Taylor’s Theorem directly then we require that the function have higher-order derivatives.
The class of Runge-Kutta methods does not involve higher-order derivatives which is the advantage of
this class.
Euler’s method is an example of the Runge-Kutta method of first-order and modified Euler’s method
is an example of second-order Runge-Kutta method.
Third-order Runge-Kutta methods: Like-wise modified Euler’s, using Simpson’s rule to approxi-
mate the integral, we obtain the following Runge-Kutta method of order three.
ti+1 =
ti + h
K1 =
hf (ti , yi )
K2 =
hf (ti + h/2, yi + K1 /2)
K3hf (ti + h, yi − K1 + 2K2 )
=
1
yi+1 = yi + (K1 + 4K2 + K3 ).
6
There are different Runge-Kutta method of order three. Most commonly used method is Heun’s
method, given by
ti+1 = ti + h
� � ��
h 2h 2h h h
yi+1 = yi + f (ti , yi ) + 3f ti + , yi + f (ti + , yi + f (ti , yi )) .
4 3 3 3 3
Runge-Kutta methods of order three are not generally used. The most common Runge-Kutta method
in use is of order four, is given by the following.
Fourth-order Runge-Kutta method:
ti+1 =ti + h
K1 =hf (ti , yi )
K2 =hf (ti + h/2, yi + K1 /2)
K3 =hf (ti + h/2, yi + K2 /2)
K4 =hf (ti + h, yi + K3 )
1
yi+1 = yi + (K1 + 2K2 + 2K3 + K4 ) + O(h5 ).
6
Local truncation error in the Runge-Kutta method is the error that arises in each step because of
the truncated Taylor series. This error is inevitable. The fourth-order Runge-Kutta involves a local
truncation error of O(h5 ).
dy y 2 − t2
Example 11. Using Runge-Kutta fourth-order, solve = 2 with y0 = 1 at t = 0.2 and 0.4.
dt y + t2
Sol.
y 2 − t2
f (t, y) = , t0 = 0, y0 = 1, h = 0.2
y 2 + t2
K1 = hf (t0 , y0 ) = 0.2f (0, 1) = 0.200
� �
h K1
K2 = hf t0 + , y0 + = 0.2f (0.1, 1.1) = 0.19672
2 2
� �
h K2
K3 = hf t0 + , y0 + = 0.2f (0.1, 1.09836) = 0.1967
2 2
K4 = hf (t0 + h, y0 + K3 ) = 0.2f (0.2, 1.1967) = 0.1891
1
y1 = y0 + (K1 + 2K2 + 2K3 + K4 ) = 1 + 0.19599 = 1.196
6
∴ y(0.2) = 1.196.
NUMERICAL DIFFERENTIAL EQUATIONS 9

Now
t1 = t0 + h = 0.2
K1 = hf (t1 , y1 ) = 0.1891
� �
h K1
K2 = hf t1 + , y1 + = 0.2f (0.3, 1.2906) = 0.1795
2 2
� �
h K2
K3 = hf t1 + , y1 + = 0.2f (0.3, 1.2858) = 0.1793
2 2
K4 = hf (t1 + h, y1 + K3 ) = 0.2f (0.4, 1.3753) = 0.1688
1
y2 = y(0.4) = y1 + (K1 + 2K2 + 2K3 + K4 ) = 1.196 + 0.1792 = 1.3752.
6

4. Numerical solution of system and second-order equations


We can apply the Euler and Runge-Kutta methods to find the numerical solution of system of
differential equations. Second-order equations can be changed in to system of differential equations.
The application of numerical methods are explained in the following examples.
Example 12. Solve the following system
dx
= 3x − 2y
dt
dy
= 5x − 4y
dt
x(0) = 3, y(0) = 6.
Find solution by Euler’s method at 0.1 and 0.2 by taking time increment 0.1.
Sol. Given t0 = 0, x0 = 3, y0 = 6, h = 0.1.
Write f (t, x, y) = 3x − 2y, g(t, x, y) = 5x − 4y.
By Euler’s method
x1 = x(0.1) = x0 + hf (t0 , x0 , y0 ) = 3 + 0.1(3 × 3 − 2 × 6) = 2.7.
y1 = y(0.1) = y0 + hg(x0 , y0 ) = 6 + 0.1(5 × 3 − 4 × 6) = 5.1.
Similarly
x2 = x(0.2) = x1 + hf (t1 , x1 , y1 ) = 2.7 + 0.1(3 × 2.7 − 2 × 5.1) = 2.49.
y2 = y(0.2) = y1 + hg(t1 , x1 , y1 ) = 5.1 + 0.1(5 × 2.7 − 4 × 5.1) = 4.41.
Example 13. Solve the following system
dy dz
= 1 + xz, = −xy
dx dx
for x = 0.3 by using fourth-order Runge-Kutta method. Given y(0) = 0, z(0) = 1.
Sol. Given
dy dz
= 1 + xz = f (x, y, z), = −xy = g(x, y, z)
dx dx
x0 = 0, y0 = 0, z0 = 1, h = 0.3
K1 = hf (x0 , y0 , z0 ) = 0.3f (0, 0, 1) = 0.3
L1 = hg(x0 , y0 , z0 ) = 0.3g(0, 0, 1) = 0
h K1 L1
K2 = hf (x0 + , y0 + , z0 + ) = 0.3f (0.15, 0.15, 1) = 0.346
2 2 2
h K1 L1
L2 = hg(x0 + , y0 + , z0 + ) = −0.00675
2 2 2
10 NUMERICAL DIFFERENTIAL EQUATIONS

h K2 L2
K3 = hf (x0 + , y0 + , z0 + ) = 0.34385
2 2 2
h K2 L2
L3 = hg(x0 + , y0 + , z0 + ) = −0.007762
2 2 2
K4 = hf (x0 + h, y0 + K3 , z0 + L3 ) = 0.3893
L4 = hg(x0 + h, y0 + K3 , z0 + L3 ) = −0.03104.
Hence
1
y1 = y(0.3) = y0 + (K1 + 2K2 + 2K3 + K4 ) = 0.34483
6
1
z1 = z(0.3) = z0 + (L1 + 2L2 + 2L3 + L4 ) = 0.9899.
6
Example 14. Consider the following Lotka-Volterra system in which u is the number of prey and v
is the number of predators.
du
= 2u − uv, u(0) = 1.5
dt
dv
= −9v + 3uv, v(0) = 1.5.
dt
Use the fourth-order Runge-Kutta method with step-size h = 0.2 to approximate the solution at t = 0.2.
Sol.
du
= 2u − uv = f (t, u, v)
dt
dv
= −9v + 3uv = g(t, u, v),
dt
u0 = 1.5, v0 = 1.5, h = 0.2.
K1 = hf (t0 , u0 , v0 ) = 0.15
L1 = hg(t0 , u0 , v0 ) = −1.35
� �
h K1 L1
K2 = hf t0 + , u0 + , v0 + = 0.370125.
2 2 2
� �
h K1 L1
L2 = hg t0 + , u0 + , v0 + = −0.7054
2 2 2
� �
h K2 L2
K3 = hf t0 + , u0 + , v0 + = 0.2874
2 2 2
� �
h K2 L2
L3 = hg t0 + , u0 + , v0 + = −0.9052
2 2 2
K4 = hf (t0 + h, u0 + K3 , v0 + L3 ) = 0.5023
L4 = hg (x0 + h, u0 + K3 , v0 + L3 ) = −0.4348.
Therefore
1
u(0.2) = 1.5 + (0.15 + 2 × 0.370125 + 2 × 0.2874 + 0.5023) = 1.8279.
6
1
v(0.2) = 1.5 + (−1.35 − 2 × 0.7054 − 2 × 0.9052 − 0.4348) = 0.6657.
6
Example 15. Solve by using fourth-order Runge-Kutta method for x = 0.2.
� �2
d2 y dy
2
=x − y 2 , y(0) = 1, y � (0) = 0.
dx dx
Sol. Let
dy
= z = f (x, y, z).
dx
Therefore
dz
= xz 2 − y 2 = g(x, y, z).
dx
NUMERICAL DIFFERENTIAL EQUATIONS 11

Now
x0 = 0, y0 = 1, z0 = 0, h = 0.2
K1 = hf (x0 , y0 , z0 ) = 0.0
L1 = hg(x0 , y0 , z0 ) = −0.2
� �
h K1 L1
K2 = hf x0 + , y0 + , z0 + = −0.02
2 2 2
� �
h K1 L1
L2 = hg x0 + , y0 + , z0 + = −0.1998
2 2 2
� �
h K2 L2
K3 = hf x0 + , y0 + , z0 + = −0.02
2 2 2
� �
h K2 L2
L3 = hg x0 + , y0 + , z0 + = −0.1958
2 2 2
K4 = hf (x0 + h, y0 + K3 , z0 + L3 ) = −0.0392
L4 = hg(x0 + h, y0 + K3 , z0 + L3 ) = −0.1905.
Hence
1
y1 = y(0.2) = y0 + (K1 + 2K2 + 2K3 + K4 ) = 0.9801.
6
1
z1 = y � (0.3) = z0 + (L1 + 2L2 + 2L3 + L4 ) = −0.1970.
6
Example 16. The motion of a swinging pendulum is described by the following second-order differential
equation
d2 θ g π
+ sin θ = 0, θ(0) = , θ� (0) = 0,
dt2 L 6
where θ be the angle with vertical at time t, length of the pendulum L = 2 ft, and g = 32.17 ft/s2 . With
h = 0.1 s, find the angle θ at t = 0.1 using Runge-Kutta fourth order method.

Sol. First of all we convert the given second order initial value problem into simultaneous first order
initial value problems.

Assuming = y, we obtain the following system:
dt

= y = f (t, θ, y), θ(0) = π/6
dt
dy g
= − sin θ = g(t, θ, y), y(0) = 0.
dt L
Here t0 = 0, θ0 = π/6, and y0 = 0. We have, by Runge-Kutta fourth order method, taking h = 0.1.
K1 = hf (t0 , θ0 , y0 ) = 0.00000000
L1 = hg(t0 , θ0 , y0 ) = −0.80425000
K2 = hf (t0 + 0.5h, θ0 + 0.5K1 , y0 + 0.5L1 ) = −0.04021250
L2 = hg(t0 + 0.5h, θ0 + 0.5K1 , y0 + 0.5L1 ) = −0.80425000
K3 = hf (t0 + 0.5h, θ0 + 0.5K2 , y0 + 0.5L2 ) = −0.04021250
L3 = hg(t0 + 0.5h, θ0 + 0.5K2 , y0 + 0.5L2 ) = −0.77608129
K4 = hf (t0 + h, θ0 + K3 , y0 + L3 ) = −0.07760813
L4 = hg(t0 + h, θ0 + K3 , y0 + L3 ) = −0.74759884.
(K1 + 2K2 + 2K3 + K4 )
θ1 = θ0 + = 0.48385575.
6
Therefore, θ(0.1) ≈ θ1 = 0.48385575.
12 NUMERICAL DIFFERENTIAL EQUATIONS

Exercises
(1) Show that each of the following initial-value problems (IVP) has a unique solution, and find
the solution.
(a) y � = y cos t, 0 ≤ t ≤ 1, y(0) = 1.
2
(b) y � = y + t2 et , 1 ≤ t ≤ 2, y(1) = 0.
t
(2) Apply Picard’s method for solving the initial-value problem generate y0 (t), y1 (t), y2 (t), and
y3 (t) for the initial-value problem
y � = −y + t + 1, 0 ≤ t ≤ 1, y(0) = 1.
(3) Consider the following initial-value problem
x� = t(x + t) − 2, x(0) = 2.
Use the Euler method with stepsize h = 0.2 to compute x(0.6).
(4) Given the initial-value problem
1 y
y � = 2 − − y 2 , 1 ≤ t ≤ 2, y(1) = −1,
t t
1
with exact solution y(t) = − .
t
(a) Use Euler’s method with h = 0.05 to approximate the solution, and compare it with the
actual values of y.
(b) Use the answers generated in part (a) and linear interpolation to approximate the following
values of y, and compare them to the actual values.
i. y(1.052) ii. y(1.555) iii. y(1.978).
(5) Solve the following IVP by second-order Runge-Kutta method
y � = −y + 2 cos t, y(0) = 1.
Compute y(0.2), y(0.4), and y(0.6) with mesh length 0.2.
(6) Compute solutions to the following problems with a second-order Taylor method. Use step size
h = 0.2.
(a) y � = (cos y)2 , 0 ≤ x ≤ 1, y(0) = 0.
20
(b) y � = , 0 ≤ x ≤ 1, y(0) = 1.
1 + 19e−x/4
(7) A projectile of mass m = 0.11 kg shot vertically upward with initial velocity v(0) = 8 m/s is
slowed due to the force of gravity, Fg = −mg, and due to air resistance, Fr = −kv|v|, where
g = 9.8 m/s2 and k = 0.002 kg/m. The differential equation for the velocity v is given by
mv � = −mg − kv|v|.
(a) Find the velocity after 0.1, 0.2, · · · , 1.0s.
(b) To the nearest tenth of a second, determine when the projectile reaches its maximum
height and begins falling.
(8) Using Runge-Kutta fourth-order method to solve the IVP at x = 0.8 for
dy √
= x + y, y(0.4) = 0.41
dx
with step length h = 0.2.
(9) Water flows from an inverted conical tank with circular orifice at the rate

dx � x
= −0.6πr2 2g ,
dt A(x)
where r is the radius of the orifice, x is the height of the liquid level from the vertex of the
cone, and A(x) is the area of the cross section of the tank x units above the orifice. Suppose
r = 0.1 ft, g = 32.1 ft/s2 , and the tank has an initial water level of 8 ft and initial volume of
512(π/3) ft3 . Use the Runge-Kutta method of order four to find the following.
(a) The water level after 10 min with h = 20 s.
(b) When the tank will be empty, to within 1 min.
NUMERICAL DIFFERENTIAL EQUATIONS 13

(10) The following system represent a much simplified model of nerve cells
dx
= x + y − x3 , x(0) = 0.5
dt
dy x
= − , y(0) = 0.1
dt 2
where x(t) represents voltage across the boundary of nerve cell and y(t) is the permeability of
the cell wall at time t. Solve this system using Runge-Kutta fourth-order method to generate
the profile up to t = 0.2 with step size 0.1.
(11) Use Runge-Kutta method of order four to solve
y �� − 3y � + 2y = 6e−t , 0 ≤ t ≤ 1, y(0) = y � (0) = 2
for t = 0.2 with stepsize 0.2.

Appendix A. Algorithms
Algorithm for second-order Runge-Kutta method:
for i = 0, 1, 2, .. do
ti+1 = ti + h = t0 + (i + 1)h
K1 = hf (ti , yi )
K2 = hf (ti+1 , yi + K1 )
1
yi+1 = yi + (K1 + K2 ).
2
end for
Algorithm for fourth-order Runge-Kutta method:
for i = 0, 1, 2, .. do
ti+1 = ti + h
K1 = hf (ti , yi )
K2 = hf (ti + h/2, yi + K1 /2)
K3 = hf (ti + h/2, yi + K2 /2)
K4 = hf (ti+1 , yi + K3 )
1
yi+1 = yi + (K1 + 2K2 + 2K3 + K4 ).
6
end for

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.

You might also like