Numerical Analysis
Numerical Analysis
net/publication/321850359
CITATIONS READS
0 1,300
1 author:
Mohammad Tawfik
Academy of Knowledge
66 PUBLICATIONS 425 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mohammad Tawfik on 21 September 2018.
Mohammad Tawfik
Academy of Knowledge
http://academyofknowledge.org
1. Basic Concepts
A most important question that should come to your mind is “Why should I study numerical
analysis?” The first answer would be “It is a better use of your computer than games!” However,
most of us would disagree with that answer. Other answers, with better arguments, may be:
1. Studying numerical analysis develops programming skills which are essential for the
engineering career. You might not love programming, but as an engineer you will eventually
get stuck at some point when you will need to write down, at least, a small program to solve
a problem you are facing what is not readily available in the packages you have.
2. Studying numerical analysis will give you insight on how the numerical packages, you are
using, work. In many instances you will interact with a readily available numerical analysis
package that is needed to perform some parts of your work. In most cases, at least in the
beginning, you will generate solutions that do not make any sense! In such cases, you will
need to troubleshoot the problem. In these cases, a good understanding of how the package
generates the results is necessary in order to be able to find what part of the input data is not
properly fit into the problem.
3. Studying numerical analysis will give you the necessary foundations upon which you will be
able to use and develop new numerical techniques that are specially fit to your specific
problems. Even if you are not working in research and development, you may need to fins a
new way for solving the problems you are facing. This will, probably, need a second course
in numerical analysis that will allow you to develop more enhanced techniques for specific
problems.
Nonlinear relations are very common in modeling different physical problems. Finding the roots of
a nonlinear functions means solving the equation:
f (x)=0
Where f ( x) is the nonlinear function of interest. In most practical problems, the solution of such
an equation lies in a range that is determined by physical constraints or by an educated guess. For
example, if you are trying to solve a relation for my body mass, knowing that I am a human adult,
my mass will never be negative, physical constraint, almost impossible to be less than 20 kg and
will not exceed 300 kg. Of course, a good approximation would be the average human mass, around
75 kg, but we are interested in this section in a range of values in which the solution may be. This
range is called bracket.
Which is known in mathematics as the “Intermediate Value Theorem”. Figure 2.1 illustrates the
basic idea of having a root in the bracket. Now, the problem becomes a search for that value. One
advantage of the bracketing methods is that they ensure reaching a solution. Nevertheless, they
maybe quite slow compared to open methods (discussed later).
The following step of the bisection method will be to determine which couple of the three values of
x should be included in the next iteration. To perform this step, we have to evaluate f (x3 ) then
the other point will be the one with the opposite sign. Formally, that will be:
if f ( x 3 )∗f (x 1)<0→x 1 =x 1 and x 2 =x 3
new New
otherwise x1 =x 3 and x2 =x 2
new New
The final step in this search is to determine when to stop the search, termination criteria. The main
criterion for stopping a search algorithm is that there is no more progress. That is, when generate a
new value for the solution, is does not change much from the previous value. This is determined by
evaluating the relative error:
|
e relative=
x 3−x 3
x3 |
old
<ϵ
Where ϵ is a small positive number that we may call tolerance. The value of the tolerance will
depend on the problem. There is an apparent problem associated with that definition of the relative
error, namely, if the solution is equal to zero. This is quite rare to happen, however, is it occures, the
relative error may be evaluated using:
e relative=|x2−x 1|< ϵ
Which means that the bracket has become too small to divide again and will result in almost the
same accuracy.
In most search techniques, it is advisable to add a second termination criterion in the search
algorithm. The second criterion is to make sure that the computer search does not go forever in case
there is no solution or is the tolerance is extremely small. The second termination criterion thus
becomes the number of times the algorithm has been used to solve the given problem. There is no
good number of iterations for stopping, rather, it will depend, again, on the problem at hand.
Bisection Algorithm
1. Select initial values, x1 and x2, and tolerance ϵ
2. Initiate Counter=0 (counting number of iterations)
3. Calculate f (x1 ), f ( x 2)
4. Increment Counter
x +x
5. Calculate x 3= 1 2 and f (x3 )
2
6. If f (x3 )=0 then the solution is x 3 END
7. If e relative <ϵ then the solution is x 3 END
8. If Counter> allowed number of iterations then no solution exists END
9. if f ( x 3 )∗f (x 1)<0→x 2=x 3 otherwise x 1=x 3
10. Go to step 4
f (x)=e− x −0.5=0
Given that the root lies within [-1,5] with tolerance 10-3. Limit the number of iterations to 20.
Solution:
Another step:
x1 + x2
x 3= =1.25
2
Which gives
f (0.5)=−0.214
Which will need to be repeated 14 more times to reach the required accuracy. Using a spreadsheet
to preform the calculations, we get the solution steps in the table below:
A computer code was developed using Octave® to solve the problem (the same code may run on
FreeMat® and Matlab® ) and is presented below:
%A simple program that demonstrates the
% application of the Bisection method.
%Clearing the memory
clear all
close all
clc
%Defining the bracket of search
x1=-1;
x2=5;
%Evaluating the function
% at the boundaries of the bracket
F1=exp(-x1)-0.5;
F2=exp(-x2)-0.5;
%Problem parameters
Epsilon = 1e-3; %Tolerance
MaxIter =20; %Maximum number of iterations
%Initializing the counter and error
Counter=0;
Err=1000;
x3old=x1
%The search loop
while true
Counter=Counter+1; %Incrementing the counter
x3=0.5*(x1+x2); %Evaluating the solution
F3=exp(-x3)-0.5;%Evaluating the function
%Evaluating the relative error
if x3!=0
Err=abs((x3-x3old)/x3);
else
Err=abs(x2-x1);
end
%Checking for termination
if or(F3==0,or(Err<Epsilon,Counter>MaxIter))
break
else
if (F1*F3)<0 %If they have opposite signs
F2=F3;
Note:
In the above program we could have created a function for evaluating the f(x) instead of having to
write it three different times, however, it was preferred to keep the program simple for readability.
If we assume that the line is an approximation of the function, then the root of the line may present
the root of the function. The root of the line may be found readily by setting y=0 and solving for x
which gives:
Now we may evaluate the function at the new solution then proceed with the same steps as the
bisection. Thus, the algorithm for the false position method may be written as:
Note that the only line that changed from the previous algorithm was line number 5. This will be
the same for the program, as only one line will change in the while loop at which x3 is evaluated.
f ( x0 )
x 1=x0 −
(df /dx )|x= x
0
When iterating for the solution, the general expression for the approximate solution will be:
f ( xk )
x k+1=x k −
(df /dx)|x=x k
Newton-Raphson Algorithm
1- Select an initial guess x 0
2- Evaluate the function f ( x k ) and its derivative (df /dx)|x=x at the current guess k
5- If |
x k+1−x k
x k +1 |
<ϵ then let x Solution=x k+ 1 and exit
6- Set x k =x k+1 go to step 2
Note:
The solution using this method will not be adequate if the root lies near a local extreme point as the
slope of the function will be near zero which may lead the solution to be unstable.
Newton-Raphson is the most popular method used to find the solution for nonlinear function
because if its general stability and convergence rate. In vary rare cases you may find that Newton-
Raphson method does not work, in such cases you might need to try other methods.
Another method to derive Newton-Raphson method is through the Taylor’s series. We are going to
introduce it here as an exercise for using the series to approximate the behavior of different
functions. Using Taylor’s series, we may write any function as:
f (x)=f (x 0)+(x−x 0)
df
dx |
x= x0
+
( x −x0 )2 d 2 f
2! dx2 |
x= x0
+
( x−x0 )3 d 3 f
3! dx3 |
x=x 0
+ ...
( x −x0 )n d n f
|
∞
f (x)=∑
n=0 n! dx n x=x 0
If we strict out attention to the first two terms of the series, first-order approximation, we get:
0≈f (x 0 )+(x−x 0 )
df
dx x=x| 0
f (x 0)
→x Solution≈x 0−
(df / dx)|x=x 0
Which is the same relation we got earlier by drawing the tangent line. The derivations using Tailor’s
series are more elegant and systematic in finding the relations we are seeking.
One main drawback, of the Newton-Raphson method, is the need to evaluate the derivative at every
iteration. In many practical cases, the derivative is not easily evaluated which makes the use of the
method not easy as well. The following methods try to overcome that problem.
f (x 1)(x 1−x 0)
x 2=x1 −
f ( x 1)−f (x 0)
When iterating for the solution, the general expression for the approximate solution will be:
f (x k+1 )(x k+1−x k )
x k+2=x k +1−
f ( x k+1 )−f ( x k )
5- If |
x k+2−x k +1
x k+2 |<ϵ then let x Solution=x k+ 2 and exit
6- Set x k =x k+1 , x k +1=x k+2 , f (x k )=f (x k+1 ) go to step 2
Notes:
1- The secant method is similar to the false-position method. But, in the initial selection, we do not
have to enclose the root in an interval. Then the algorithm is more systematic.
2- From the view point of Newton-Raphson method, the secant method is very similar in finding
the root with the main difference in the evaluation of the slope of the line at each iteration. In
Newton-Raphson method we had to evaluate the slope of the function, on the other hand, in the
Secant method, the slope of the line is evaluated by two points that belong to the function.
If the initial guesses are close enough, we may claim that the slope of the line we obtain is
approximately equal to the slope of the function at any of the two points, however, when we move
to the following iteration, the approximate solution may not be close enough to its previous point,
thus, the slope of the line can no more be claimed to be equal to the slope of the function.
Nevertheless, as the solution converges nearer to the root, the changes from one iteration to the
other become quite small which brings us back to the case where the slope of the line becomes a
good approximmation of the slope of the function.
df f ( x+ δ x )−f ( x)
= lim
dx δ x→0 δx
Which can be approximated by:
df f (x +δ x)−f ( x )
≈
dx δx
As long as δ x is small enough the result is approximate enough. However, enough is a matter of
choice. Now, we may use the approximation we obtained above into the relation for Newton-
Raphson method to get:
f (x k ) δ x
x k+1=x k −
f (x k + δ x)−f ( x k )
A common practice for the evaluation of δ x is to have it as a fraction of the current value of x
such that:
δ x=ϵ x k
Where ϵ can take any small value (0.1, 0.01, …). This needs to be modified if x k becomes zero.
The price we will have to pay for this method to work is an extra function evaluation for every
iteration. This maybe a price we are willing to pay as long as the evaluation of the slope is not
readily available. Of course, if it was available, we would have needed to evaluate it, thus, we did
not lose much in terms of computational effort. On the other hand, as we get nearer to the solution,
we may switch to the secant method and save the extra evaluation.
We may use Tailor’s series to write down the first order approximation of any function as:
(
f i (x 1, x 2, ..., x n)k +1≈ f i (x1, x 2, ... , x n )+(x 1−x 1 ) k
∂f1
∂ x1 k
∂f
∂ x2 k
∂f
+(x 2−x 2 ) 1 ...+( x n−x n ) 1
∂ xn )|
( x 1, x2, ... , x n)k
If we do this to all the given functions, we get a system of equations in the form:
[ ]
∂f1 ∂f1 ∂f1
...
{} {}
∂ x1 ∂ x2 ∂ xn
−f 1 Δ x1
∂f2 ∂f2 ∂f2
−f 2 = ∂x ... Δ x2
⋮ 1 ∂ x2 ∂ xn ⋮
−f n ⋮ ⋮ ⋮ ⋮ Δ xn
k
∂f n ∂fn ∂fn
...
∂ x1 ∂ x2 ∂ xn
{ }[ {] }
−1
∂f1 ∂f1 ∂f1
...
∂ x1 ∂ x2 ∂ xn
Δ x1 −f 1
∂f2 ∂f2 ∂f2
Δ x2 = ... −f 2
∂ x1 ∂ x2 ∂ xn
⋮ ⋮
Δ xn ⋮ ⋮ ⋮ ⋮ −f n k
∂fn ∂fn ∂fn
...
∂ x1 ∂ x2 ∂ xn
{} {}{ }
x1 x1 Δ x1
x2 = x2 + Δ x2
⋮ ⋮ ⋮
xn k +1 x n k Δ xn
A measure for the convergence in such a problem may be the value of the square of the change
given by:
{}
Δ x1
e =⌊ Δ x1 Δ x 2 ... Δ x n }
2 Δ x2 <ϵ
⋮
Δ xn
4. Interpolation
In many practical problems, you do not get a function describing the relation between dependent
and independent variables, rather, you get a set of data points relating the variables. In this case, you
might need to estimate values of the dependent variables that are not present in the given data sets,
or even want to estimate the maximum or minimum of them. To do that, you may interpolate a
function or functions that pass by the given data points in order to be able to perform the required
analysis. This is known as curve fitting. Curve fitting is a procedure by which you create a function
that best describes the behavior of the dependent variables with the change of the independent
variables.
Curve fitting may be divided into two main categories; interpolation and regression. Interpolation,
usually, uses polynomials to fit a function to a given set of data. In interpolation, you ensure that the
function you get passes by all the points that are given in the data set, then, you may use the
function you obtained to perform the analysis you need. In regression, you create a function that
passes near the given points of the data set, then, you apply some technique to ensure that the
function you selected has a minimum error measured in a certain chosen manner.
The above is called an nth order polynomial because the highest power of the independent variable is
n. Such a polynomial has n+1 terms with n+1 coefficients. The coefficients are, sometimes, called
generalized coordinates. Another way of presenting the above polynomial is:
n
y (x)=∑ ai xi
i=0
{}
a0
a1
y (x)=⌊1 x x 2 ... x n−1 x n ⌋ a 2 =⌊H ( x)⌋{a }
⋮
an−1
an
If you have a two-points data set as shown in Figure 4.1.1, you may create a line joining the data
points by forcing the above equation to have the values of (x1, y1) and (x2, y2) as follows:
y (x 1)= y1 =a0 +a 1 x 1
y (x 2)= y 2 =a0 +a 1 x 2
Now, you have two equations in two unknown variables a1and a2. The above two equations may be
rewritten in matrix form as:
[ ]{ } { }
1 x1 a0
1 x2 a1
y
= 1
y2
In compact form:
[T ]{a}={y }
{a}=[T ]−1 { y }
y (x)= ( x 2−x 1 )(
x 2 y 1−x 1 y 2
+
y2 − y 1
x2 −x1 )
x
Notes
1- The only data set that you can not fork with is one that has x1=x2
2- If you try to fit a higher order polynomial, you will end up with an infinite number of solutions,
that is because you will have unconstrained coefficients
y (x)=a0 +a1 x+ a2 x2 +a 3 x 3
[ ]{ } { }
1 x1 x 21 x31 a0 y1
1 x2 x 22 x32 a1 y
= 2
1 x3 x 23 x33 a2 y3
1 x4 x 24 x 34 a3 y4
Notes:
1- As in the case of linear interpolation, if any two of the x-values are equal, you end up with two
linearly dependent equations witch redeems the system unsolvable
2- As the number of equations increase, the matrix inversion becomes computationally expensive to
solve. Matrix inversion also have the problems associated with round-off errors that may cause
inaccurate results
3- Finally, if any two values of x are very close, the system of equations may become ill-
conditioned. Ill conditioned systems may appear to be solvable, but the round-off errors become
extremely high causing the computer to return infinite values for the solution.
4- If the polynomial order becomes too large, that may introduce very large numbers in the matrix
which, in turn, may create an ill-conditioned matrix. There are several ways to try avoiding ill-
conditioning, however, each way comes with its own problems.
Solution:
for four points, we need to use a cubic function (third order polynomial), which may be written as:
y (x)=a0 +a1 x+ a2 x2 +a 3 x 3
[ ]{ } { }
1 −1 1 −1 a 0 1
1 0 0 0 a1 = 2
1 1 1 1 a2 5
1 2 4 8 a3 16
{ } {}
a0 2
a1
= 1
a2 1
a3 1
y (x)=2+ x+ x 2 + x 3
Let’s get back to the two-point interpolation problem of Figure 4.1.1. If we rewrite the linear
equation in the form:
y (x)=b1 +b2 (x−x 1)
We still have a linear relation, but the x-values are shifted to be measured from the point x1. Now, if
we try to force the line to pass by the given data set, we get:
Again, we get a set of two simultaneous equations, but this time, their solution was readily available
as we start from the first point. Notice that if we started from the other point, we would have got the
same result:
y (x)=b12+ b22 ( x−x 2)
y (x)= y 1+ ( x 2−x 1)
y 2− y 1
(
( x −x1 )= y 2 +
y 1− y 2
x1−x 2 )
(x−x 2)
y (x)=
x2−x 1 (
x 2 y 1−x 1 y 2 y 2− y 1
+
x 2−x 1 )
x= y 1+ (
y 2− y 1
x 2−x 1 )
(x−x 1)
Now, let’s turn our focus to a three-points data set. We will need to use a quadratic (second order)
polynomial to interpolate a function that passe by three points. In Newton’s method, we may write
the quadratic polynomial as:
y (x)=b1 +b2 (x−x 1)+ b3 ( x−x 1)( x−x 2)
y (x 2)= y 2 = y 1+ b2 (x 2−x 1)
y 2− y1
→b 2=
x 2−x 1
y 2− y 1
y (x 3)= y3 = y 1+ (x −x )+b3 ( x 3−x 1 )(x3 −x2 )
x 2−x 1 3 1
y 3 − y 2 y 2− y 1
−
x3 −x2 x 2−x 1
→b3=
x 3−x 1
Hint to prove the above relation: If you start from the final relation and employ the fact that
y2 − y1
y 2= y 1 + ( x −x ) , you will prove that the identity is true.
x2 −x1 2 1
Notes:
1- In the above analysis, for both linear and quadratic problems, the coefficient b1 turned out to be
the y-value of the starting point, and b2 turned out to be the slope of the line joining the first two
points
2- In the solution for the quadratic equation, we could express the coefficient b3 as an expression
that may be interpreted as: The change in the slope of the lines joining the adjacent points divided
by the distance between the first and last points. In other words, b3 presented the slope of the slope
(second derivative).
3- We can easily deduce that the expression for b4 and b5 (for higher order polynomials) can be
readily written, but, they will be extremely cumbersome.
Newton’s method, thus far, has helped us evaluating the expressions for the generalized coordinates
without involving any matrix inversion. Further, the method can be used in an algorithmic way
using a table. To illustrate the table method, let us get back to the linear problem:
In Table 4.2.1, the first row presented the x-values, the second row presented the y-values, the third
row was obtained by dividing the differences from the seond row by those from the first row, and
finally, the second row of the second and third columns, presented the solution we were seeking.
Let’s go one more step. Let’s present the solution for a quadratic function using the table method. In
Table 4.2.2 we employed the same procedure as Table 4.2.1 for the first three columns, then used
the results of the third column to obtain the fourth. Let us now try to solve a numerical example
envolving a third order polynomial.
Solution:
Create a table and fill in the data.
After filling the table up to the fifth column, we may read the coefficients to be:
b 1=1
b 2=1
b 3=1
b 4=1
You can manipulate the results by expanding the bracket products and collecting the powers of x to
obtain:
y (x)=2+ x+ x 2 + x 3
Note:
You can use the table to extract the equations for the three lines joining each of the adjacent points
in the form:
L1( x )=1+(x +1)
L2(x )=2+3 x
L3 ( x)=5+11( x−1)
Also, you can use it to find the quadratic functions that join the sets of the first and last three
points:
P1 (x)=1+( x+1)+ x ( x+ 1)
P2 ( x )=2+3 x+ 4 x ( x−1)
Note:
1- Using Newton’s method, we avoided the matrix inversion and all the problems that may be
associated with them.
2- Expressing a general term for the generalized coordinates is still out of reach, although, not
impossible.
3- the table method introduced and algorithmic way for solving interpolation problems with the
added benefits of providing the data to construct the polynomials for interpolation on subsets of the
given data set.
The above function is still linear, but it involves the sum of two distinct linear relations. Forcing this
function to pass through the given data set, we get:
y1
y (x 1)= y 1=c 1( x 1−x 2 )+ c2 (x 1−x 1)→c 1=
x 1−x 2
y2
y (x 2)= y 2 =c 1( x2−x 2 )+ c 2(x 2−x 1 )→c2=
x 2−x 1
The above function may be seen as a weighted summation of two lines each of which gives the
value of one at a point and zero at the other, meanwhile, the weights, that may be seen as the
generalized coordinates, are equal to value of the function at the corresponding points. In algebraic
form:
y (x)= y 1 L1 ( x)+ y 2 L2( x )
x−x 2
L1( x)=
x1 −x2
x−x 1
L2( x)=
x 2−x1
If we extend the procedure described above, we may get a form for the quadratic interpolation
function in the form:
y (x)=c 1( x −x2 )(x−x 3)+c 2 (x−x 1)(x−x 3 )+ c 3( x−x1 )(x−x 2)
y (x 2)= y 2 =c 1( x2−x 2 )( x2 −x3 )+c 2 (x 2−x1 )( x 2−x 3)+c 3 (x 2−x 1)( x 2−x 2)
y2
→c 2=
( x 2−x 1)(x 2−x 3 )
y (x 3)= y3 =c 1 ( x3 −x2 )( x 3−x 3)+c 2 (x 3−x 1)( x 3−x 3)+ c3 (x 3−x 1 )(x3 −x2 )
y3
→c 3=
( x3 −x1 )(x 3−x 2)
Which, as in the linear case, presents a weighted summation of three quadratic functions, each of
which is equal to one at one point and equal to zero at the other two points. It may be expressed as:
y (x)= y 1 P1 ( x)+ y 2 P2 (x)+ y 3 P3 (x )
(x−x 2)(x−x 3 )
P 1(x )=
( x 1−x 2)(x 1−x 3)
(x−x 1)(x−x 3 )
P 2(x )=
( x 2−x 1)(x 2−x 3)
(x−x 1)(x−x 2 )
P3 (x )=
( x 3−x 1)(x 3−x 2 )
Figure 4.3.2. Illustration of the three quadratic functions P1, P2, and P2
Solution:
For four points, we may write Lagrange polynomial in the form:
y (x)= y 1 ( x−x 2
x 1−x 2 )( x−x 3
x1−x 3 )( ) ( )( )( )
x−x 4
x1 −x 4
+ y2
x−x1
x 2−x 1
x−x 3
x 2−x 3
x−x 4
x 2−x 4
+ y3 ( x−x 1
x 3−x 1 )( x−x 2
x 3−x 2 )( x−x 4
x3 −x 4 ) ( )( )( )
+ y4
x−x 1
x 4−x 1
x−x 2
x 4 −x 2
x−x 3
x 4 −x3
+5 (
2 )( 1 )( −1 )
+16 (
3 )( 2 )( 1 )
x+1 x x−2 x +1 x x−1
If you want to simplify the above expression, you will arrive to the good old:
y (x)=2+ x+ x 2 + x 3
Note: In Lagrange method, you may easily obtain the final result you want for any number of
points, the price will be manipulating the polynomials to obtain a simplified form. Nevertheless, if
all what you are looking for is using the function in computations, then plugging-in the numbers is
a straight-forward task that any computer can easily perform.
Let us start with a simple problem. Given the values of a function of two independent variables
(x,y) at four different points (see Figure 4.4.1). It makes perfect sense to select an interpolation
function that looks:
f (x , y)=(b 0+ b1 x)(c 0 +c 1 y )
Now, we will follow a similar procedure as for the single-variable functions by forcing the function
to pass by all the points:
f ( x 1 , y 1 )=f 1=a0 +a 1 x 1+ a2 y1 + a3 x 1 y1
f ( x 1 , y 1 )=f 2 =a0 +a 1 x 2+ a2 y 2+ a3 x 2 y 2
f ( x 1 , y 1)=f 3=a0 +a1 x 3+ a2 y 3 + a3 x 3 y 3
f (x1 , y 1)=f 4 =a0 +a 1 x 4 +a 2 y 4 + a3 x 4 y 4
[ ]{ } { }
1 x1 y1 x1 y1 a0 f0
1 x2 y2 x2 y2 a1 f
= 1
1 x3 y3 x3 y3 a2 f2
1 x4 y4 x4 y4 a3 f3
Or
Let’s use the matrix presentation of the interpolation polynomial presented in section 4.1:
{}
a0
a1
y (x)=⌊1 x x 2 ... x n−1 x n ⌋ a 2 =⌊H ( x)⌋{a }
⋮
an−1
an
Which gives something that looks like a set of functions, each multiplied by the value of f at a point.
That is exactly what Lagrange interpolation is, and the set of function N(x,y) is the set of Lagrange
polynomials that interpolate the function over the given set of points. To illustrate, let us assume
that the four points of Figure 4.4.1 are aligned to the four corners of a rectangle that extends from
the origin and extends a distance x1 and y1 in the x and y-directions respectively. (see Figure 4.4.2).
[ ]
1 0 0 0
1 x1 0 0
[T ]=
1 x1 y1 x1 y1
1 0 y1 0
[ ]
1 0 0 0
−1 1
0 0
x1 x1
−1
[T ] = −1 1
0 0
y1 y1
1 −1 1 −1
x1 y1 x1 y1 x1 y 1 x1 y1
Which gives:
⌊
⌊N ( x , y)⌋=⌊ H ( x , y )⌋[T ]−1= 1−
x y
− +
xy
x1 y1 x1 y1
x
−
xy
x1 x1 y1
xy
x1 y1
y
−
xy
y1 x1 y1 ⌋
Check that each of those functions is equal to one at its corresponding point and zero at the other
three points.
Now, what should we do if we have only three points instead of four? You might easily answer drop
the xy term, which is a good answer, but what happens if you have five points? There must be a
systematic way to select which terms to use for the interpolation, since we are always stuck with the
number of points that are given to us. Whenever we have a number of points that is a square of
another (4, 9, 16, ...), we can easily create the two dimensional polynomial by multiplying two one-
dimensional polynomials (two lines, two parabolas, …), however that is not the case for other
numbers of points. Here comes what is known as Pascal’ s triangle. Pascal’s triangle is a graphical
presentation of different terms of polynomials that makes it easy for us to select the terms to use in
a way that may keep the resulting two-dimensional (or multi-dimensional) polynomials having a
systematic form.
Figure 4.4.3 shows the pascal triangle. As you may see, the polynomial is presented in levels with
increasing powers (0, 1, 2, …). In the four-point case we described above, we used the top-two rows
and one term from the third row. In case we had only three points, we will be using the top-two
rows only. In case of five, we have a choice of adding any of the other two elements of the third
row. However, if you want to keep the symmetry of the polynomial (relative to the Pascal’s triangle)
then you will either use the x2 any y2 terms, or use the xy and x2y2 terms. Your choice will depend on
your insight of the problem.
Another way for selecting the terms of the polynomial will be that you may ignore the symmetry
completely. Such cases will be reasonable if you know that, for example, the problem is strictly
linear in x. Thus, your choice of terms does not need to include any higher power of x, and you may
keep your selection of terms from the two right-most lines of the triangle.
5. Regression
Rather:
a0 + a1 x 1− y 1=e 1
Where e1 is the error you measure between the data point you have and the value you get from the
line function. Generally, this is what you will get at all the points you have. The problem now
becomes which line can we draw to obtained the least amount of error in the data set?
Several answers may come to mind, and they are all valid. For example, you may decide to try to
find the line that passes over as many points as it does below so that the summation of the error
becomes zero:
n
∑ e i=0
i=1
another idea may be to minimize the absolute value of the total error:
n
Minimize ∑|e |i
i=1
Or maybe you want to make sure that the maximum value of the deviation (error) is as low as
possible:
Minimize ( Max (|ei|))
A very common choice is to minimize the summation of the square of the errors:
n
Minimize ∑ e2i
i=1
This choice makes the analytical work associated with the problem much easier because the
function is differentiable and, because it is squared, you are sure that the slope will become zero at
the absolute minimum. This method is know as the Least Squares method.
We may write the equations for the error at different points in Figure 5.1.1 as follows:
e 1=a0 +a 1 x 1− y 1
e 2=a0 +a 1 x 2− y 2
e3 =a0 +a 1 x 3− y 3
e 4=a0 +a1 x 4− y 4
{ } [ ]{ } { }
e1 1 x1 y1
e2 1 x2 a0 y
= − 2
e3 1 x3 a1 y3
e4 1 x4 y4
In shorthand:
{e}=[ A ]{a}−{ y }
An expression that expresses the summation of the squares of the error values may be:
n
Note that the above is a scalar equation, and that the transpose of a scalar is itself. Thus, we have:
{y }T [ A]{a}={a}T [ A ]T { y }
Yielding the error relation to be:
n
This relation presents the square of the error as a quadratic function in the generalize coefficients.
To find the values of a that minimize e2, we need to differentiate:
(∑ )
n
d
e 2i =2[ A ]T [ A ]{a}−2 [ A]T {y }
d {a} i=1
In linear algebra terms, we say differentiating from the left. Equating the above result by zero, we
end up with:
[ A ]T [ A]{a}=[ A]T {y }
The above is a linear set of equations in the generalized coordinates that minimize the square of the
error. Solving, we get:
−1
{a}=( [ A ]T [ A ] ) [ A ]T { y }
In case you are interested in evaluating the square of the error you will obtain, you may plug-in the
result into the relation for the error.
Note: If you revise the steps above starting from the first introduction of the summation of the error
squared, you will find that the procedure applies to any set-size of generalized coordinates, hence,
we will employ the same results directly when handling non-linear regression
x 1 2 3 4 5 6 7
y 0.5 2.5 2 4 3.5 6 5.5
Solution:
Using a line equation:
y (x)=a0 +a1 x
{a}=
{}
a0
a1
{y }T =⌊0.5 2.5 2 4 3.5 6 5.5⌋
[
[ A] T = 1 1 1 1 1 1 1
1 2 3 4 5 6 7 ]
Now, we may write the equations for the optimum a as:
[ A ]T [ A]{a}=[ A] T {y }→
[287 ] { }
28 {a}= 28
140 119.5
Solving, we get:
{}{
a0
a1
=
0.0714
0.8393 }
Finally, the line that minimizes the square of the error may be written as:
y (x)=0.0714+0.8393 x
y (x)=a0 +a1 x+ a2 x2
But, the parabola will not pass by every point in the data, thus:
Which leads us to the same problem we had with linear regression, but this time we have it with
three generalized coordinates. The optimum values for a will be those satisfying the same equation.
Solution:
Using a quadratic equation:
y (x)=a0 +a1 x+ a2 x2
{}
a0
{a}= a1
a2
{y }T =⌊0.5 2.5 2 4 3.5 6 5.5⌋
[ ]
1 1 1 1 1 1 1
T
[A] = 1 2 3 4 5 6 7
1 4 9 16 25 36 49
[ A ]T [ A]{a}=[ A]T {y }→
[ ] { }
7 28 140 28
28 140 784 {a}= 119.5
140 784 4676 665.5
Solving, we get:
{}{ }
a0 −0.2857
a1 = 1.0774
a2 −0.0298
Finally, the parabola that minimizes the square of the error may be written as:
y (x)=−0.2857+1.0774 x−0.0298 x 2
Now, you can use all sorts of nonlinear functions to fit any set of data using the same method.
If we consider the case where m=1 kg, k=1 N/m, and c=0.5 Ns/m. When checking the numbers, you
will find that the damping ratio is 0.25. Thus, the system has a natural frequency of 1. If the system
is excited with a unit initial value of the displacement or a unit initial velocity, we get two different
responses as shown in Figure 5.5.2 (time increment 0.1 sec).
1.2
0.8
0.6
0.4
x(t)
-0.4
-0.6
Time (s)
If we try to examine the differential equation of the system in terms of finite difference (see section
7.2), we may write:
x t+δ t −2 x t + x t −δ t x t +δ t−x t −δ t
m +c +k x t=0
δt 2
2δ t
x t+δ t =
(
− 1−
δtc
2m )
xt −δ t −
m (
δ t2 k
−2 xt )
δtc
1+
2m
We may expand the denominator using Taylor’s series and multiply by the numerator. If we ignore
second-order time terms, we may write down the above equation as:
x t+δ t ≈− 1−( δ tc
m ) (
x t −δ t − −2+2
δtc
2m t
x )
Substituting the numbers we have, we get:
x t+δ t ≈−0.95 x t−δ t +1.95 x t
Now, let’s examine how we may use regression to get a relation for the displacement x(t) in terms
of the previous values of x(t). We may rite:
x (t+ δ t )=a0 + a1 x (t−δ t)+ a2 x (t )
To evaluate the values of the unknown coefficients, we will need to get some numerical data of the
system dynamics. Thus, we will evaluate the time response for the first two seconds with an
increment of 0.1 seconds which will give us 21 data points (including the initial conditions) for the
case with
x (0)=1
ẋ (0)=0
Now, we may write 19 equations for the above relation in the form:
{ } [ ]{ }
x2 1 x 0 x1 a
0
x 3 − 1 x 1 x 2 a ={error }
1
⋮ ⋮ ⋮ ⋮
x 20 a
1 x 18 x 19 2
From which we may get the unknown coefficients from the relation:
Which is quite like the one we got using the finite difference relation above. Also, we ignored the
effect of the time increments. If we plot the responses of the exact solution and the one we get from
the approximate relation we go from the regression, the results will be as those presented in Figure
5.5.3.
To validate the accuracy of the relation, we will compare the results of the exact solution for the
case with the initial conditions given by:
x (0)=0
ẋ (0)=1
with those obtained from the auto-regressive relation by using the first two points as input to the
relation. For that, we get the plot of Figure 5.5.4.
1.2
0.8
0.6
Response (m)
0.4
0.2 X(t)
xt+dt
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-0.2
-0.4
-0.6
Time (sec)
0.8
0.6
0.4
Response (m)
0.2
X(t)
xt+dt
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-0.2
-0.4
Time (sec)
Figure 5.5.4: Exact and approximate solutions with different initial conditions
The above results indicate how powerful the auto-regressive method in predicting the performance
of a mass-spring-damper system. The method could be easily extended to multiple-input multiple-
output systems. Also, as we were able to do in regression, the auto-regressive relation may
accommodate nonlinear terms that may increase its accuracy with nonlinear systems’ responses.
6. Numerical Integration
7. Differential Equations
Two main families of approximate methods could be identified in the literature. The discrete
coordinate methods and the distributed coordinate methods.
Discrete coordinate methods depend on solving the differential relations at pre-specified points in
the domain. When those points are determined, the differential equation may be approximately
presented in the form of a difference equation. The difference equation presents a relation, based of
the differential equation, between the values of the dependent variables at different values of the
independents variables. When those equations are solved, the values of the dependent variables are
determined at those points giving an approximation of the distribution of the solution. Examples of
the discrete coordinate methods are finite difference methods and the Runge-Kutta methods.
Discrete coordinate methods are widely used in fluid dynamics and in the solution of initial value
problems.
The other family of approximate methods is the distributed coordinate methods. These methods,
generally, are based on approximating the solution of the differential equation using a summation of
functions that satisfy some or all the boundary conditions. Each of the proposed functions is
multiplied by a coefficient, generalized coordinate, that is then evaluated by a certain technique that
identifies different methods from one another. After the solution of the problem, you will obtain a
function that represents, approximately, the solution of the problem at any point in the domain.
Stationary functional methods are part of the distributed coordinate methods family. These methods
depend on minimizing/maximizing the value of a functional that describes a certain property of the
solution, for example, the total energy of the system. Using the stationary functional approach, the
finite element model of a problem may be obtained. It is usually much easier to present the relations
of different variables using a functional, especially when the relations are complex as in the case of
fluid structure interaction problems or structure dynamics involving control mechanisms.
The weighted residual methods, on the other hand, work directly on the differential equations. As
the approximate solution is introduced, the differential equation is no more balanced. Thus, a
residue, a form of error, is introduced to the differential equation. The different weighted residual
methods handle the residue in different ways to obtain the values of the generalized coordinates that
satisfy a certain criterion.
df δ x 2 d 2 f δ x 3 d 3 f
f (x+ δ x )=f ( x)+ δ x + + +...
dx 2 ! dx 2 3 ! dx 3
df δ x2 d 2 f δ x 3 d3 f
δx =f (x +δ x)−f ( x )− − −...
dx 2 ! dx 2 3 ! dx3
Dividing by δ x
df f (x +δ x)−f ( x )
≈
dx δx
The above expression for the first derivative is known as the forward difference. The evaluation of
the derivative of a function depends on the evaluation of the function at a point that lies after the
current. Similarly, we write the backward difference expression for the first derivative to be:
df f ( x)−f ( x−δ x)
≈
dx δx
They are equivalent in the numerical sense, but we need to keep the distinction in mind when
handling differential equations using the numerical differentiation in order to determine which
values of the function we should be using in the equation.
df δ x 2 d 2 f δ x 3 d3 f
f ( x +δ x)=f (x )+ δ x
+ + +...
dx 2! dx 2 3! dx 3
df δ x 2 d 2 f δ x 3 d 3 f
f (x−δ x)=f ( x )−δ x + − +...
dx 2! dx 2 3 ! dx 3
df δ x3 d3 f
f (x+ δ x )−f ( x−δ x)=2 δ x +2 + ...
dx 3 ! dx 3
If we use this equation to find an expression for the first derivative, we get:
Which is another expression for the first derivative with an order of error O(δ x 2) that is more
accurate than the forward or backward expressions we obtained earlier. The expression
df f (x +δ x)−f ( x−δ x )
≈
dx 2δ x
is known as the central difference of the first derivative and it has a second order accuracy which
makes it superior to the forward and backward expression. As usual, there is a price to pay, in this
case, the price is a double evaluation of the function.
Let’s play another trick. Write down the Tailor’s series such that:
df δ x 2 d2 f δ x3 d3 f
4 f (x + δ x)=4 f (x)+ 4 δ x +4 + 4 +...
dx 2! dx 2 3 ! dx 3
df δ x2 d2 f δ x 3 d3 f
f (x +2 δ x)=f ( x)+2 δ x + 4 +8 +...
dx 2! dx 2 3! dx 3
df δ x3 d3 f
4 f (x + δ x)−f ( x +2 δ x)=3 f (x )+2 δ x −4 +...
dx 3 ! dx 3
df −f (x+ 2 δ x )+ 4 f (x +δ x)−3 f ( x)
= +O( δ x 2)
dx 2δ x
Which is a forward difference expression for the first derivative with second order accuracy. We
will not discuss here further manipulations for getting the first derivative using different
manipulations for the Tailor’s series, but, the above examples may give you an indication in that
direction.
df δ x 2 d 2 f δ x 3 d3 f
f ( x +δ x)=f (x )+ δ x + + +...
dx 2! dx 2 3! dx 3
df δ x 2 d 2 f δ x 3 d 3 f
f (x−δ x)=f ( x )−δ x + − +...
dx 2! dx 2 3 ! dx 3
δ x2 d2 f δ x4 d4 f
f (x+ δ x )+f (x −δ x)=2 f ( x)+2 + 2 +...
2 ! dx 2 4 ! dx 4
Rearranging, we get:
Note:
If we used the central difference expression:
2
d f
=
df
dx |
x+ δ x
df
dx
− |
x− δ x
dx 2 2δ x
Then substituted by the central difference expression for the first derivative, we get:
f ( x+2 δ x)−f ( x ) f ( x)−f (x−2 δ x )
2 −
d f 2δ x 2δ x f (x +2 δ x)−2 f ( x)+ f ( x−2 δ x)
= =
dx 2
2δ x 4δ x2
While, if we used the forward difference, we get:
2
d f
=
df
dx |
x+ δ x
−
df
dx |
x
dx 2 δx
This illustrates that we may combine different difference expressions to evaluate a derivative, but
the order of accuracy may change in an unpredictable way. The error and convergence analysis of
different expressions is beyond the scope of this work.
Now let us try to find another expression for the second derivative using the equations:
df δ x2 d2 f δ x3 d 3 f
2 f ( x + δ x)=2 f (x)+2 δ x +2 +2 +...
dx 2! dx 2 3 ! dx 3
df δ x2 d2 f δ x3 d3 f
f (x+ 2 δ x )=f ( x)+2 δ x + 4 +8 +...
dx 2 ! dx 2 3 ! dx 3
δ x2 d2 f δ x3 d 3 f
2 f ( x + δ x)−f ( x +2 δ x)=f (x )−2 −6 +...
2 ! dx 2 3 ! dx 3
Manipulating:
d 2 f f ( x +2 δ x )−2 f ( x + δ x )+ f (x )
= +O( δ x )
dx 2 δ x2
Which is a forward difference expression for the second derivative with first order accuracy! Let’s
try:
df δ x2 d2 f δ x3 d3 f
5 f ( x +δ x)=5 f (x)+5 δ x +5 +5 +...
dx 2 ! dx 2 3! dx 3
df δ x2 d2 f δ x3 d 3 f
4 f (x +2 δ x)=4 f (x)+8 δ x + 16 +32 +...
dx 2 ! dx 2 3 ! dx3
df δ x2 d2 f δ x3 d3 f
f (x+ 3 δ x)=f ( x)+3 δ x +9 + 27 +...
dx 2 ! dx 2 3 ! dx 3
Adding the first equation to the third then subtracting the second, we get:
2 d2 f
f (x+3 δ x )−4 f (x +2 δ x)+5 f ( x+ δ x )=2 f ( x )−δ x +...
dx 2
Manipulating:
Euler method uses the a difference expression for the first derivative, usually the forward first order
one, to express the derivative in the form:
d y ( x)
dx x |
=
y (x + δ x )− y (x)
δx
Now we have an expression to evaluate the value of the function at a point based on the value of the
function at a neighboring point and the slope of the function at that point. If we divide the domain
into equal intervals, we get the expression:
y (x i+1)= y ( x i)+ δ x f ( y (x i ), x i)
Which is the general expression for the Euler method. Euler method is the simplest method used for
first order differential equations, usually initial value problems, but it has the lowest possible
accuracy, thus, errors may accumulate easily as we propagate in the domain. If we select the mesh
size (dx) to be smaller, the solution should, theoretically, approach the exact solution. However,
when the mesh size is too small we become subject to round-off errors which may cause the
solution to become unstable. Nevertheless, the simplicity of Euler method makes it quite commonly
used, especially, when the accuracy of the results is not critical.
From basic calculus, we may recall the central value theorem which states that the slope of the line
joining two points a & b is equal to the slope of the function at a point c that lies between both
points and it has a value that lies between the slope values at a and b. I other words:
for a<b
y (b)− y (a) dy (c)
=
b−a dx
where a<c <b
dy (a) dy (c) dy (b)
→ ≤ ≤
dx dx dx
Now, we may approximate the value of the slope at the point c by the average of the slopes at a & b.
y 1 − y 0 f ( y e , x +δ x)−f ( y 0, x )
=
δx 2
Which leads to:
δx
y 1= y 0 + (f ( y e , x +δ x)+ f ( y 0, x))
2
In the above expressions, we use ye which is the approximation of y1 using Euler method. Usually,
in Runge-Kutta methods, we identify the different values of the slopes by ki, thus we may write the
Runge-Kutta second order method as:
δx
y 1= y 0 + (k + k )
2 1 2
k 1=f ( y 0 , x 0)
k 2=f ( y 0 + δ x k 1 , x 0 + δ x )
There are different Runge-Kutta methods that use different weighted averages of the slopes, without
getting into the details of derivations, we may state the third order method says:
δx
y 1= y 0 + (k +4 k 2+ k 3 )
6 1
k 1=f ( y 0 , x0 )
δx δx
k 2=f ( y 0 + k1 , x0 + )
2 2
k 3=f ( y 0 + δ x (2 k 2 −k 1), x 0+ δ x)
δx
y 1= y 0 + (k 1+2 k 2 +2 k 3 + k 4 )
6
k 1=f ( y 0 , x 0)
And the fourth order method says δx δx
k 2=f ( y 0 + k1 , x0 + )
2 2
δx δx
k 3=f ( y 0 + k ,x + )
2 2 0 2
k 4 =f ( y 0+ δ x k 3 , x 0+ δ x)
which is the most commonly used method as it balances between accuracy and complexity.
{}
d y 1 (x )
{ }
dx
f 1 ( y 1, y 2, ... , y n , x)
d y 2 (x )
dx = f 2 ( y 1, y 2, ... , y n , x)
⋮
⋮
f n ( y 1, y 2, ... , y n , x)
d y n (x )
dx
{} {} { }
y1 y1 f 1 ( y 1, y 2, ... , y n , x)
y2 = y2 + δ x f 2 ( y 1, y 2, ... , y n , x)
⋮ ⋮ ⋮
yn x +δ x
0
yn x0 f n ( y 1, y 2, ... , y n , x) x0
{ } { } ({ }{ })
y1 y1 f 1 ( y 1, y 2, ... , y n , x) f 1 (( y1, y 2, ... , y n)e , x + δ x)
y2 δx f 2 ( y 1, y 2, ... , y n , x) + f 2 (( y1, y 2, ... , y n)e , x + δ x)
= y2 +
⋮ ⋮ 2 ⋮ ⋮
yn x +δ x
0
y n x0 f n ( y 1, y 2, ... , y n , x) x f n (( y 1, y 2, ... , y n)e , x + δ x )
0
Where
{}{} { }
y1 y1 f 1 ( y 1, y 2, ... , y n , x)
y2 = y2 + δ x f 2 ( y 1, y 2, ... , y n , x)
⋮ ⋮ ⋮
yn e yn x0 f n ( y 1, y 2, ... , y n , x) x0
The above expression for Euler and Runge-Kutta methods may be used for linear and non-linear
problems. However, many of the practical problems are linearized. For linearized problems, the
differential equation may be written in the form:
{}
d y 1 (x )
[ ]{ } { }
dx a11 a 12 ... a1 n y1 u1 ( x)
d y 2 (x )
= a21 a 22 ... a2 n y 2 + u2 ( x)
dx ⋮ ⋮
⋮ ⋮ ... ⋮
⋮
a a ... a nn yn u n ( x)
d y n (x ) n1 n2
dx
This is commonly known as the state space representation of linear time independent systems (time
is the independent variable instead of x).ems, the Euler method may be written as:
{ } { } ([ ]{ } { })
y1 y1 a11 a12 ... a1 n y1 u1 ( x)
= y 2 + δ x a21 a22
y2 ... a2 n y 2 + u2 ( x)
⋮ ⋮ ⋮ ⋮ ... ⋮ ⋮ ⋮
yn x +δ x y n x an1 a n2 ... ann y n x un ( x)
{} {}
y1 y1
y2 δx
= y2 + ( k +k )
⋮ ⋮ 2 1 2
yn x +δ x yn x
[ ]{ } { }
a11 a12 ... a1 n y1 u1 (x)
a a ... a2 n y 2 + u2 (x)
k 1= 21 22
⋮ ⋮ ... ⋮ ⋮ ⋮
an 1 an 2 ... ann y n x un (x)
[ ]({ } ) { }
a11 a 12 ... a1 n y1 u1 ( x + δ x )
k 2= a21 a 22 ... a2 n y2 + δ x k + u2 ( x + δ x )
1
⋮ ⋮ ... ⋮ ⋮ ⋮
an 1 an 2 ... ann yn x un ( x + δ x )
Euler and Runge-Kutta methods, as described above, will not be able to directly handle such a
problem. However, by introducing extra variables to the problem, we may be able to transform the
second order equation into a set of first order equations that may be handled as described in section
7.5. Let us introduce z(t) such that:
z (t)= ẏ (t )→ ż (t)= ÿ (t)
{ }[ ]{ } { }
0 1 0
ẏ (t) y (t)
= k c + u
ż (t) − − z (t)
m m m
Now we may be able to apply the methods of section 7.5 for the system above to get an
approximate solution. The same could be applied for a system of a second order equations in the
form:
[M ] Ÿ +[C ] Ẏ +[ K ]Y =U
{ }[
Ẏ (t )
Ż (t)
=
0
−1
1
−1
Y (t)
−[M ] [K ] −[M ] [C ] Z( t)
+]{ } {
0
−1
[M ] {U } }
And we may proceed from that point on.
Where f(x) is the function of interest, g(x) is some arbitrary input to the domain (excitation
function), while L(.) is a differential operator that may be expanded in the form:
d d2 dn
L(.)=a0 ( x )+ a1 ( x) +a 2( x) 2 +... an (x) n
dx dx dx
Which is a general linear nth order differential equation with coefficients that may depend on the
independent variable x. In order not to risk losing you at this point, we will switch our attention to a
more specific problem.
d2 u
EA + f (x)=0
d x2
Where E is the modulus of elasticity, A is the cross-section area, and u is the displacement. Subject
to the boundary conditions:
u(0)=u (L)=0
Solution:
The first step for solving such a problem would be to divide the domain into pre-specifyed intervals.
We will strict our work here to equally distributed n intervals to simplify the relations (see Figure
7.7.2). The length of each interval will be:
L
δ x=
n
For the second derivative, we may select the second order accurate expression we derived in section
7.2.2.
d2u
d x2 |
x=x i
=
u i+1−2 ui +ui−1
δ x2
Which gives us the general expression for the difference equation at point xi to be:
δ x2
ui+1−2ui +ui −1=− f ( x i)
EA
Difference equation is an expression we use to describe the equation we get after substituting the
difference relations for the differential relations in the differential equation of interest.
δ x2
u2−2u1 +u0 =− f ( x 1)
EA
δ x2
→u 2−2 u1=−u0− f ( x 1)
EA
Similarly, at the last point, we have:
δ x2
un−2 un−1 +un−2=− f (x n)
EA
δ x2
→−2 un−1 +un−2=−u n− f (x n)
EA
Applying the relation to all other internal points of the domain, we get the system of equations:
[ ]{ } { } { }
−2 1 0 0 ... 0 u1 −u0 f1
1 −2 1 0 ... 0 u2 0 f2
2
0 1 −2 1 ... 0 u3 = 0 − δ x f3
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ EA ⋮
0 ... 0 1 −2 1 un−2 0 f n −2
0 ... 0 0 1 −2 un−1 −un f n −1
Which is an algebraic set of (n-1) equations in the unknown values of u inside the domain. Thus, we
may say that the finite difference procedure we followed above transformed the differential
equation into a set of algebraic equations that can be readily solved for the values of the unknown
function at discrete points inside the domain.
Note:
In the above example, we were given boundary conditions in theform of the value of the function at
the boundary points, this type of boundary conditions is known in mathematics as Dirichlet
boundary conditions. If the boundary conditions were in the form of the derivative of the function
at the boundary points, they would have been known as Neumann boundary conditions. Usually, a
mixture of both is what is given in a practical problem.
d2 u
EA 2 + f (x)=0
dx
u(0)=0
du
| =
P
dx x= L EA
Figure 7.7.3: Simple fixed-free bar with distributed load and tip concentrated load
Solution:
We will use the same difference relation for the second derivative, so we get the general form of the
difference equation to be:
δ x2
ui+1−2ui +ui −1=− f ( x i)
EA
For the second boundary condition (Neumann type) we will need to to rewrite it in difference form
as well, thus we may write:
u n−1−un P P
= →un −1−u n=δ x
δx EA EA
This difference relation introduces an extra equation that may be used simultaneously with the set
of (n-1) equations we got before, and then all be solved for the n unknown values of u. However,
the difference relation we have used for the boundary condition is of first order accuracy (see
section 7.2.1), while the difference relation we used for the differential equation was of second
order. This combination reduces the order of accuracy of the whole system to first order.
Nevertheless, the presentation of the boundary condition is simple and can be easily implemented.
In order to try to preserve the accuracy of the system of equations at second order, we will need to
use a second order presentation of the first derivative. Let us write the difference equation at the nth
point so it looks like:
δ x2
un−1−2 un +un +1=− f (x n)
EA
And the boundary condition will be:
u n+1−u n−1 P
=
2δ x EA
Which can be manipulated to get:
P
un+ 1=un−1+ 2 δ x
EA
Which can then be substituted back into the difference equation to get:
δ x2 P
2u n−1−2u n=− f ( x n )−2 δ x
EA EA
Thus, we end u with a set of equations in the unknown values of the function as follows:
[ ]{ } { } { }
u1 −u0 f1
−2 1 0 0 ... 0
u2 0 f2
1 −2 1 0 ... 0
0 δ x 2
0 1 −2 1 ... 0 u3 = − f3
⋮
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ EA ⋮
0
0 ... 0 1 −2 1 u n−1 f n−1
P
0 ... 0 0 2 −2 un −2 δ x fn
EA
Which is a set of n equations in n unknown values that may be readily solved for the values of the
function.
Note:
An argument against this approach of handling the Neumann boundary condition would be that we
relied on a point that does not exist physically in the problem (it lies outside the domain). A counter
argument may be that the problem has already been transformed for the physical continuous
domain to a discrete domain, thus the introduction of such points is just a way to complete the
transformation (from x-domain to i-domain).
A third way of handling the Neuman boundary condition we have may be that we use a backward
second order difference representation of the first derivative. Referring to section 7.2.1, we may
write:
[ ]{ } { } { }
−u0
−2 1 0 0 ... 0 u1 f1
0
1 −2 1 0 ... 0 u 2 f2
0 δ x2
0 1 −2 1 ... 0 u3 = − f3
⋮
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ EA ⋮
0
0 ... 0 1 −2 1 un−1 f n−1
P
0 ... 0 1 −4 3 u n 2δ x fn
EA
Which is another system of equations that may preserve the degree of accuracy of the problem.
∂2 u ∂2 u
ρA + EA + f ( x , t)=0
∂t 2 ∂ x2
Subject to the boundary conditions:
∂2 ui ui−1−2 ui +ui+1
ρA 2
+ EA +f ( xi , t)=0
∂t δ x2
Now the set of difference equations become:
[ ]{ } [ ]{ } { }
1 0 0 0 ... 0 u¨1 −2 1 0 0 ... 0 u1 f 1 (t)
0 1 0 0 ... 0 u¨2 1 −2 1 0 ... 0 u2 f 2 (t)
δ x2 ρ 0 0 1 0 ... 0 u¨3 + 0 1 −2 1 ... 0
2
u3 = −δ x f 3 (t)
E ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ EA ⋮
0 ... 0 0 1 0 ün−2 0 ... 0 1 −2 1 un−2 f n−2 (t)
0 ... 0 0 0 1 ün−1 0 ... 0 0 1 −2 un−1 f n−1 (t)
Or:
[M ] { ü }+[K ] {u }={ f }
The above step transformed the partial differential equation into a set of n ordinary differential
equations in time (initial value problem), thus we may apply any of the techniques we learned
earlier to find a solution. We will apply the Euler method to find the temporal solution, however, we
will need to reduce the order of the equation from second order to first order to be able to solve it
using Euler method. Let us introduce the set of (n-1) variables vi, where:
{ }{ }
v1 u̇1
u2 u̇2
v 3 = u̇3
⋮ ⋮
v n−2 u̇ n−2
v n−1 u̇ n−1
Using this, similar to section 7.6, we obtain the equations in the form:
{u̇v̇}=[−M0 K 0I ]{uv}+{M0 f }
−1 −1
Using the Euler method, we obtain the values of the functions for every time step as:
Where f(x) is the function of interest, g(x) is some arbitrary input to the domain (excitation
function), while L(.) is a differential operator that may be expanded in the form:
d d2 dn
L(.)=a0 ( x )+ a1 ( x) +a 2( x) 2 +... an (x) n
dx dx dx
Which is a general linear nth order differential equation with coefficients that may depend on the
independent variable x. The above equation has to have certain boundary conditions that render the
solution unique. If we select different functions, Ψi (which we may call trial functions), that satisfy
all boundary conditions but do not necessarily satisfy the differential equation, we may write the
approximate solution of f(x) in the form of:
n
f (x)=∑ aiψ i ( x )
i=1
Where ai are the generalized coordinates, or the unknown coefficients (Note the similarity with the
interpolation using polynomials) (see section 4.1). Applying the differential operator on the
approximate solution, you get:
( )
n n
L(f ( x))−g(x)=L ∑ a iψ i −g(x )=∑ ai L (ψ i )−g ( x)≠0
i=1 i=1
Note that the right hand side of the above equation is not equal to zero. The non-zero value of the
right hand side is called the residue.
n
Note:
The residue is NOT the error.
The error, as you may recall, is the difference between the exact solution and the approximate one.
Meanwhile, the residue is the imbalance created in the differential equation due to the use of the
approximate solution, this maybe expressed in the form:
(∑ )
n
L aiψ i −f (x ) =R( x)
i=1
The criterion used in Weighted Residual methods is based on one simple assumption, that is, if we
integrate the residue over the domain and force it to equal to zero, then the error will be reduced.
Or:
∫ R( x)dx=0
Doomain
( )
n
Which is a single equation in in n unknown generalized coordinates ai. To find a solution for a
linear system of equations in n unknown variables, we need to have n linearly independent
equations. In order to generate such a set of equations, we may multiply the integrand in the above
equation by n linearly independent functions and perform the integral with each of the functions to
generate an equation. These linearly independent functions are called weighing functions. Thus we
may write:
( )
n
( )
n
Where
k ij = ∫ L(ψ i ( x)) w j ( x ) dx
Domain
q j= ∫ g ( x)w j (x) dx
Domain
d2 u
EA + f (x)=0
d x2
Subject to the boundary conditions
u(0)=0
d u( L)
=0
dx
Now, lets use the approximate solution
( )
L
d2 ψ i
n L
EA ∑ ai∫ w j dx=−∫ f (x) w j dx
i=1 0 d x2 0
For the boundary conditions to be satisfied, we need a function that has zero value at x=0 and has a
slope equal to zero at the free end. Sinusoidal functions are appropriate for this, hence, we may
write:
( (2i−1) )
n
πx
u(x)=∑ ai sin
i=1 2L
For the purpose of illustration, let’s use the first term only (n=1). For the weighting function, we
may use a polynomial term. The simplest term would be 1, thus:
( ( ))
L L
2
πx
EA a∫ − π 2 sin dx=−∫ f ( x )dx
0 4L 2L 0
2 f o L2 f o L2
a= ≈0.637
EA π EA
Then, the approximate solution for this problem becomes
f o L2
u(x)≈0.637
EA
sin
πx
2L ( )
Now we may compare the obtained solution with the exact one that may be obtained from solving
the differential equation. The maximum displacement and the maximum strain may be compared
with the exact solution. The maximum displacement is
u(L)=0.637
EA (
f o L2
Exact= 0.5
f o L2
EA )
Fundamentals of Numerical Analysis 21. Sep. 2018 66
Mohammad Tawfik
Differential Equations
d u(0) f o L
dx
=
EA (Exact=
fo L
EA )
Now, if we select a set of points xj inside the domain of the problem, we may write down the
integral of the residue, multiplied by the delta functions, as follows:
n
Where
k ij =L(ψ i (x j ))
q j=g( x j )
d2 u
EA + f (x)=0
d x2
Subject to the boundary conditions
u(0)=0
d u( L)
=0
dx
Now, lets use the approximate solution
n
u(x)=∑ ai ψ i (x)
i=1
(
d 2ψ i ( x j )
)
n
EA ∑ ai =−f (x j )
i=1 d x2
For the boundary conditions to be satisfied, we need a function that has zero value at x=0 and has a
slope equal to zero at the free end. Sinusoidal functions are appropriate for this, hence, we may
write:
( (2i−1) )
n
πx
u(x)=∑ ai sin
i=1 2L
For the purpose of illustration, let’s use the first term only (n=1). The most logical choice for the
collocation point would be xj=L/2, thus:
( (
π ( L /2)
))
2
EA a − π 2 sin =−f ( L /2)
4L 2L
EA π 2
a=f o
4 √ 2 L2
Or:
4 √ 2 f o L2 f o L2
a= ≈0.57
EA π 2 EA
f o L2
u(x)≈0.57
EA
sin
πx
2L ( )
Now we may compare the obtained solution with the exact one that may be obtained from solving
the differential equation. The maximum displacement and the maximum strain may be compared
with the exact solution. The maximum displacement is
u(L)=0.57
f o L2
EA (Exact= 0.5
f o L2
EA )
And maximum strain is:
d u(0)
dx
=0.9
f oL
EA (
Exact=
foL
EA )
Now, if we select a set of points xj inside the domain of the problem, we may write down the
integral of the residue, multiplied by the identity derived above, as follows:
x j+ 1 n x j+1
∫ R (x)( S( x−x j)−S (x−x j+1 ))dx= ∫ R(x )dx =∑ ai ∫ ( L(ψ i (x))−g( x ))dx=0
Domain xj i=1 xj
Where
x j+ 1
k ij = ∫ L(ψ i (x)) dx
xj
x j+ 1
q j=∫ g ( x)dx
xj
d2 u
EA 2 + f (x)=0
dx
Subject to the boundary conditions
u(0)=0
d u( L)
=0
dx
Now, lets use the approximate solution
n
u(x)=∑ ai ψ i (x)
i=1
( )
x j+ 1 x j+ 1
n
d2 ψ i
EA ∑ ai ∫ 2
dx=− ∫ f ( x)dx
i=1 xj dx xj
For the boundary conditions to be satisfied, we need a function that has zero value at x=0 and has a
slope equal to zero at the free end. Sinusoidal functions are appropriate for this, hence, we may
write:
( (2i−1) )
n
πx
u(x)=∑ ai sin
i=1 2L
For the purpose of illustration, let’s use the first term only (n=1). The most logical choice for the
subdomain would be xj=0 & xj+1=L, thus:
( ( ))dx=−∫ f ( x )dx
L L
EA a∫ − π 2 sin π x
0 4 L2 2L 0
2 f o L2 f o L2
a= ≈0.637
EA π EA
Then, the approximate solution for this problem becomes
f o L2
u(x)≈0.637
EA ( )
sin
πx
2L
Now we may compare the obtained solution with the exact one that may be obtained from solving
the differential equation. The maximum displacement and the maximum strain may be compared
with the exact solution. The maximum displacement is
u(L)=0.637
f o L2
EA (
Exact= 0.5
f o L2
EA )
And maximum strain is:
d u(0) f o L
dx
=
EA (
Exact=
fo L
EA )
Where
q j= ∫ g ( x) ψ j (x)dx
Domain
d2 u
EA + f (x)=0
d x2
Subject to the boundary conditions
u(0)=0
d u( L)
=0
dx
Now, lets use the approximate solution
n
u(x)=∑ ai ψ i (x)
i=1
( )
L
d2 ψ i
n L
EA ∑ ai∫ ψ j ( x ) dx=−∫ f ( x )ψ j ( x)dx
i=1 0 d x2 0
For the boundary conditions to be satisfied, we need a function that has zero value at x=0 and has a
slope equal to zero at the free end. Sinusoidal functions are appropriate for this, hence, we may
write:
( (2i−1) )
n
πx
u(x)=∑ ai sin
i=1 2L
For the purpose of illustration, let’s use the first term only (n=1), thus:
( ( ))
L L
πx
( π2 Lx ) dx
2
EA a∫ − π 2 sin 2 dx=−∫ f ( x)sin
0 4L 2L 0
EA π 2 L 2L
2
a=f o π
4L 2
Or:
16 f o L2 f o L2
a= ≈0.52
EA π 3 EA
f o L2
u(x)≈0.52
EA ( )
sin
πx
2L
Now we may compare the obtained solution with the exact one that may be obtained from solving
the differential equation. The maximum displacement and the maximum strain may be compared
with the exact solution. The maximum displacement is
u(L)=0.52
f o L2
EA (
Exact= 0.5
f o L2
EA )
And maximum strain is:
d u(0)
dx
=0.82
foL
EA (
Exact=
f oL
EA )
In most structure mechanics problems, the differential equation involves second derivative or higher
for the displacement function. When Galerkin method is applied for such problems, you get the
proposed function multiplied by itself or by one of its function family. This suggests the use of
integration by parts. Let’s examine this for the previous example. Substituting with the approximate
solution: (Int. by Parts)
|
L L
d2ψ i L
dψ i d ψ i dψ j
∫ dx
2
ψ j dx=
dx
ψ j −∫
dx dx
dx
0 0 0
But, for this specific problem, the boundary integrals are equal to zero since the functions were
already chosen to satisfy the boundary conditions. Evaluating the integrals will give you the same
results. So, what did we gain by performing the integration by parts?
• The functions are required to be less differentiable
• Not all boundary conditions need to be satisfied
• The matrix became symmetric!
The above gains suggested that the Galerkin method is the best candidate for the derivation of the
finite element, finite volume, and boundary element models as a weighted residual method.
Figure 7.8.5: Simple fixed-free bar with distributed and end loads
Solution:
The governing differential equation (as seen before) is:
d2u
EA + f 0=0
d x2
Subject to the boundary conditions:
u(0)=0 and
du
dx |
x=L
=−
P
EA
u(x)=−
f0 2 f0L P
2 EA
x+ −
EA EA
x( )
If we use a two term polynomial, we will actually be getting the exact solution using the weighted
residual methods. Let’s have the solution in the form:
ua ( x )=a1 x+ a2 x 2
Those two terms, both, satisfy all the boundary conditions of the bar. (The constant term would
have violated the left hand side fixation boundary condition). However, not both terms satisfy the
differentiability conditions, thus, we may only use Galerkin method. Substituting the solution into
the differential equation, we get:
d2 ua
EA +f 0=R( x)
dx2
Applying the Galerking method, we get:
|
L L L
du0 d ua
→ EAx −∫ EA dx +∫ x f 0 dx=0
dx 0 0 dx 0
L L
0 0
L2
→( EAL (a1 +2 a2 L) )−EA ( L a1+ L2 a 2)+ f =0
2 0
f0
→a 2=−
2 EA
L L
d2 ua 2
∫ x 2 R( x )dx =∫ EA x 2
dx 2
+ x f (x) dx=0
0 0
|
L L L
2 du 0 du
→ EAx −∫ EA 2 x a dx+∫ x 2 f 0 dx=0
dx 0 0 dx 0
L L
L
→ EA x (a1 +2 a2 x)|0 −EA ∫ 2 x (a1 +2 a2 x)dx+∫ x2 f 0 dx=0
2
0 0
4 L3
→EA L2 ( a1+ 2a 2 L)−EA(L2 a1 + L3 a 2)+ f 0 =0
3 3
→(2 EA a 2)+ f 0=0
Incomplete XXXXXX
Figure 7.9.1: Example of 2-D unstructured grid near the surface of an airfoil
Figure 7.9.2: An example of a structured grid near the leading edge of an airfoil
For a one-dimensional domain. The grid generation is much simpler than what we see in the figures
above. The domain is divided into intervals, elements, and at the end of each element the points are
named nodes. The element length, thus, is the difference between the x-values of the end nodes.
(see Figure 7.9.3). Now that we have all what we need to know about the grid, we may start
creating the interpolation functions.
The interpolation function for the element needs to satisfy the essential boundary conditions as we
intend to be using the Galerkin method. If the essential boundary conditions are the values of the
function at each end of the element, we need to satisfy two conditions, which leads to the need of a
first order polynomial. The polynomial may, generally, be written as:
u(x)=a0 +a 1 x
Where u(x) may present a displacement of a bar, the temperature in a conductive material, the flow
potential in a fluid, or any other function that may be described by the problem. In vector form, we
may write:
u(x)=⌊ 1 x ⌋
{}a0
a1
=⌊ H (x)⌋{a}
The boundary conditions that need to be satisfied by this polynomial are the values of the function
at each of the end of the elements presented in Figure 7.9.4.
Figure 7.9.4. One-Dimensional element with end function values and concentrated excitations.
Using the boundary conditions, we may get:
u (0)=u1=⌊H (0)⌋{a}
u(L)=u2=⌊H ( L)⌋{a}
In matrix form:
{ } [ ]{ }
u1
u2
=
1 0 a0
1 L a1
{u}=[T m ]{a }
[ ]
1 0
{a}= −1 1 {u}
L L
where:
{ }
x
1−
L
⌊N ( x)⌋T ={N (x )}=
x
L
Note that the functions presented in the above vector are exactly those we would have got using the
Lagrange interpolation method. Let’s now keep focusing on this simple problem to illustrate the
method. From the point of view of the weighted residual methods, we may view the interpolation
function as:
2
u(x)=N 1 ( x)u1+ N 2 (x) u2=∑ ui N i ( x)
i=1
Where the proposed solution functions are Ni(x) while the generalized coordinates are ui. The next
step would be creating the element equations using the Galerkin method.
Where
k ij = ∫ L( N i (x)) N j (x)dx
Domain
q j= ∫ g(x )N j ( x)dx
Domain
If we use the bar problem as an example, we may rewrite the governing equation of the bar in the
form:
d
dx (
EA
du
dx )
+ f ( x)=0
( )
2
dN i
∑ dxd EA
dx i
u + f (x)=R (x)
i=1
( )
x2 2 x2 x2
d dN i
∫ N j R( x ) dx=∑ ∫ N j dx
EA
dx i
u dx+∫ N j f (x )dx=0
x1 i=1 x1 x 1
Where x1 and x2 are the boundaries of the element we are concerned with. However, since all the
integrations we are performing are bound by x1 and x2, we may transfer the coordinates such that we
perform all the integrals from 0 to L where L is the element length. Applying the integration by
parts and using the property of the integration of the Dirac delta function, the above equation may
be rewritten as:
( | )
2 L L L
dN i dN j dN i
∑ N j EA u −u ∫ EA
dx i 0 i 0 dx dx
dx +∫ N j f (x)dx=0
i=1 0
( ) |
2 L L 2 L
dN j dN i dN i
∑ ui∫ EA
dx dx
dx =∫ N j f ( x)dx+ N j ∑ EA u
dx i 0
i=1 0 0 i=1
From the interpolation functions we obtained in section 7.9.2, can find that:
N 1 (0)=1 , N 1 ( L)=0 , N 2 (0)=0 , N 2 (L)=1
dN 1 1 dN 2 1
=− , =
dx L dx L
Using those relations into the equation above, we get:
( )( ) )|
L L L L
−u1∫ EA
0
dN j 1
dx L
dx + u2∫ EA
0
dN j 1
dx L
dx =∫ N j f ( x) dx+ N j EA
0
du
dx ( 0
Note that du/dx is the strain of the bar element, when the strain is multiplied by the modulus of
elasticity and the area, the result becomes the applied load. For j=1, we get the first equation as:
( )( )
L L L
1 1 x
u1∫ EA 2 dx − u2∫ EA 2 dx =∫ (1− )f ( x )dx + p1
0 L 0 L 0
L
( )( )
L L L
1 1 x
−u1∫ EA dx + u2∫ EA 2 dx =∫ ( ) f ( x)dx+ p 2
0 L 2
0 L 0
L
If we assume EA to be constant over the element, we get the element equations as:
[
EA 1 −1 u1
L −1 1 u2
f
f2]{ } { } { }
p
= 1 + 1
p2
Or
[K ]{u}={q}
Where [K] is called the stiffness matrix and {q} is called the generalized force vector. Note that the
generalized force vector includes the effect of the distributed loads as well as the concentrated ones.
Also note that we can not write an explicit expression for the generalized force without knowing the
force function. However, we may write:
{ }
L
∫ f ( x) N 1 dx
{}
L
f1
∫ f ( x){N }= 0
L
=
f2
∫ f ( x) N 2 dx
0
Let’s examine what have we got here. On the left hand side, we have a matrix that is multiplied by a
vector of the displacements at both ends of the bar element. On the right hand side we have two
vectors. The first vector is the resultant of integrating the distributed force over the element, while
the second vector represents the externally applied loads at the nodes of the elements. The above
equation is the element equation.
[ ]{ } { }{ }
1 −1 0 u1 f (11 ) p1
EA (1 ) (2)
−1 2 −1 u2 = f 2 + f 1 + p 2
L
0 −1 1 u3 f (2) p3
2
Where we assumed that the element lengths, modulus of elasticity, and cross-section area are
constant for both elements. Also, the superscripts (1) and (2) denote the information from the first
and second element matrices respectively.
[ ]{ } { }{ }
1 −1 0 0 f (11 ) p1
EA
−1 2 −1 u2 = f (12 )+ f (2)1
+ p2
L
0 −1 1 u3 f2(2)
p3
[ ]{ } {
f (1)
}{ }
(2)
EA 2 −1 u2 2 +f 1 p
= + 2
L −1 1 u3 f2(2)
p3
Which is called the primary equations. The primary equations could be readily solved for the
unknown displacements by inverting the matrix on the left hand side and multiplying it by the right
hand side vectors. The second equation becomes:
EA
L
u
{}
[−1 0 ] 2 = {f (1)
u3
1 }+ { p1 }
Which is called the secondary equation or the auxiliary equation. The secondary equation has the
unknown variables as the concentrated force at the fixed end, thus, it is the support reaction which
is found by substituting the solution obtained from the primary equations directly.
[ ]{ } { }{ }
1 −1 0 0 f (11 ) p1
EA (1) (2 )
−1 2 −1 u2 = f 2 + f 1 + p 2
L
0 −1 1 0 f (22 ) p3
[ ]
EA −1
L −1
{
f (1)
{ }{ }
p
u 2}= 1(2) + 1
f2 p3
8. Optimization
To translate that into mathematics, if we have an initial guess xk at which the function has a slope of
f’(xk), we may say that the next best guess is given by:
x k+1=x k ±α f '( xk )
Where a is called the learning rate, the ‘+’ sign is used when you are searching for a maximum, and
the ‘-’ sign is used when you are searching for a minimum. To illustrate, let us use a simple function
f (x)=10 x−x 2 which has a slope of f ' (x)=10−2 x for which we want to find the maximum
with a starting guess of x=2, and we will use a=0.5. We get:
f ' (2)=6→ x1 =2+ 0.5∗6=5
Which means that we reached the maximum in a single step. Of course, this is not going to be the
case in most of the problems. However, you may notice that the selection of the learning rate is
quite important, that is, if it is too large, we may be thrown around in the domain without being able
to reach a proper solution, while if it is too small, we will need very large number of trials to reach a
reasonable solution. Nevertheless, the procedure of the steepest descent method is quite simple
which makes it very appealing as the first choice when you are trying to find the minimum (or
maximum) of a function. A simple code to implement the procedure above was developed on
Octave and listed below.
The result history was plotted for different values of a to illustrate its effect on the convergence of
the solution and are presented in Figure 8.3.2.
Figure 8.3.2: Illustration for the effect of a on the convergence of the solution
As simple and, generally, reliable this method is, in practice, we have many cases for which the
evaluation of the slope of the function may not be feasible. Thus, we may resolve to using
approximate evaluation of the slope. In such cases we may utilize any of the methods available to
evaluate the slope. In the code below, we replaced the evaluation of the slope by the forward
difference formula:
f ( x+ δ x)−f ( x )
f ' (x)≈
δx
x0=0; %Initial guess
Alpha=.1; %Learning rate
Err=10; %initializing error
Iteration=1; %iteration counter
while (Err>0.001) %excepted tolerance
Iteration=Iteration+1;
DeltaX=0.01*x0; x1=x0+Alpha*FPrime; %evaluating next guess
if x0==0 DeltaX=0.01; endif
F1=FofX(x0);
F2=FofX(x0+DeltaX);
FPrime=(F2-F1)/DeltaX; %evaluating the slope
Err=abs((x1-x0)/x1); %evaluating the error
x0=x1;
%Making sure that we do not loop for ever
if Iteration>20 break endif
endwhile
%Displaying best answer reached
x1
The above code could be used for any objective function that you may need to maximize (or
minimize by changing the ‘+’ to a ‘-’) which makes it very reliable for optimizing objective
functions with single variable.
To find an optimum values, we are actually searching for the root of the slope of that function, thus,
we may write:
f ' (x)
x k+1=x k −
f ' ' (x)
The following, is a listing of an Octave program that was created to solve the above problem.
Figure 8.4.1 shows the convergence history of the Newton-Raphson method compared to steepest
assent.
x0=0; %Initial guess
Err=10; %initializing error
Iteration=1; %iteration counter
while (Err>0.001) %accepted tolerance
Iteration=Iteration+1;
FPrime=3*x0*x0-60*x0+200; %evaluating the slope
FPPrime=6*x0-60; %evaluating the slope
x1=x0 - FPrime/FPPrime; %evaluating next guess
Err=abs((x1-x0)/x1); %evaluating the error
x0=x1;
%Making sure that we do not loop for ever
if Iteration>2000 break endif
endwhile
%Displaying best answer reached
x1
Note:
1- Steepest descent will diverge if we used larger values of a!
2- Newton-Raphson does not distinguish between a maximum and a minimum, so you will need to
test for the second derivative.
As seen above, Newton-Raphson method is much faster in converging on an optimum, but, you
may get in trouble if the optimum is near or at an inflection point (inflection points are those with
zero second derivative). Also, this method has the drawback of requiring the evaluation of two
derivatives instead of one. To overcome this problem, we may resolve to numerical differentiation
as we did for steepest descent. For that, we may use the relations:
f ( x+ δ x)−f ( x−δ x )¿
f ' (x)≈
2δ x
f ( x +δ x)−2 f (x)+ f (x−δ x)
f ' ' (x)≈
δ x2
Ti apply this in the code, few lines will need to be changed as follows:
F1=FofX(x0);
F2=FofX(x0+DeltaX);
F3=FofX(x0-DeltaX);
FPrime=(F2-F3)/2/DeltaX; %evaluating the slope
FPPrime=(F2+F3-2*F1)/DeltaX/DeltaX;
x1=x0-FPrime/FPPrime; %evaluating next guess
Which will yield the same results and convergence rate for this particular problem.
{}
∂f
∂ x1
{}
0
∂f
0
∂ x2 = ⋮
⋮ 0
∂f
∂ xn
Using the argument for the steepest descent method, we may write:
{}
∂f
{} {}
∂ x1
x1 x1
∂f
x2 = x2 ±α ∂ x2
⋮ ⋮
⋮
xn k +1 xn k
∂f
∂ xn k
This could be performed quite easily if we have an expression for the partial derivatives, which is
not available in most of the cases. Thus, we will need to resolve to numerical differentiation. For
partial derivatives, the derivative formula becomes:
This will require 2n function evaluations for every iteration. Below, is a code that will use the
steapest descent to find the optimum solution for a function with n variables using the numerical
evaluation of the partial derivative.
If we attempt to use Newton-Raphson method, we may recall from section 3.4 that:
[ ]
−1
∂f1 ∂f1 ∂f1
...
{} {} {}
∂ x1 ∂ x2 ∂ xn
x1 x1 f1
∂f2 ∂f 2 ∂f2
x2 = x2 − ∂ x ... f2
∂ x2 ∂ xn
⋮ ⋮ 1 ⋮
xn xn k ⋮ ⋮ ⋮ ⋮ fn
k +1 k
∂fn ∂f n ∂fn
...
∂ x1 ∂ x2 ∂ xn
If we modify the relation by replacing the functions by the first derivative of the objective function,
we get:
[ ]{ }
−1
∂2 f ∂2 f ∂2 f ∂f
...
∂ x21 ∂ x 1 ∂ x2 ∂ x1 ∂ x n
{} {}
∂ x1
x1 x1
∂2 f ∂2 f ∂2 f ∂f
x2 x ...
= 2 − ∂ x2 ∂ x1 ∂ x 22 ∂ x2 ∂ x n ∂ x2
⋮ ⋮
⋮ ⋮ ⋮ ⋮ ⋮
xn k +1 x n k
∂2 f 2
∂ f ∂2 f ∂f
... ∂ xn
∂ xn ∂ x1 ∂ xn ∂ x2 ∂ x 2n k
To evaluate the second partial derivative, we may use any of the relations that describe it. For the
sake of demonstrating the method, we will be using the simplest relations. We may write:
∂2 f 1
≈
∂ xi ∂ x j 4 δ xi δ x j
[ f (... , xi + δ x i ,... , x j + δ x j , ...)−f (... , x i−δ x i , ... , x j+ δ x j ,...)
−f (... , xi + δ x i ,... , x j−δ x j , ...)+ f (... , x i−δ x i , ... , x j−δ x j ,...) ]
∂2 f ∂2 f
=
∂ x i ∂ x j ∂ x j ∂ xi
Thus, the code for the Newton-Raphson method may be written as:
For both the above two programs, we used the function FofXY as written below:
function FofXY=FofXY(nn,xx)
FofXY=0;
for ii =1 : nn
FofXY=FofXY+200*xx(ii)-30*xx(ii)*xx(ii) + xx(ii)*xx(ii)*xx(ii);
endfor
endfunction
Which you modify to accommodate any objective function that you need to optimize.
Now, we may claim that we have some tools that may be used to find the optimum solution for any
given objective function. However, that may not be true. In many problems, the steepest
descent/ascent and Newton-Raphson algorithms may get stuck, or simply do not work. Also, both
algorithms can easily get some local optimum that does not present the global optimum that we are
searching for. Thus, other optimization algorithms needed to be developed in order to try to find the
global optimum.