0% found this document useful (0 votes)
277 views

ECSE343MidtermW2023 Final

This document is the midterm exam for the course ECSE 343 Numerical Methods in Engineering. It contains 3 questions worth a total of 20 marks. Students have 1 hour and 20 minutes to complete the exam, which may be open book but no outside access is permitted. Question 1 is worth 7 marks and asks students to calculate decimal values of two 16-bit floating point numbers and determine the difference between them and the next larger representable number. Question 2 asks students to justify the answers to part a). Question 3 asks students to provide the 16-bit representation of 1.8125.

Uploaded by

Noor Jetha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
277 views

ECSE343MidtermW2023 Final

This document is the midterm exam for the course ECSE 343 Numerical Methods in Engineering. It contains 3 questions worth a total of 20 marks. Students have 1 hour and 20 minutes to complete the exam, which may be open book but no outside access is permitted. Question 1 is worth 7 marks and asks students to calculate decimal values of two 16-bit floating point numbers and determine the difference between them and the next larger representable number. Question 2 asks students to justify the answers to part a). Question 3 asks students to provide the 16-bit representation of 1.8125.

Uploaded by

Noor Jetha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Midterm Exam

ECSE 343 Numerical Methods in Engineering


Date: Thursday Feb 23, 2023, 11:35am—12:55pmm

Name: __________________________________________________

Student Number: _____________________

Time Allowed: 1hr 20min (80 min)

• Answer all the questions.


• You may use Faculty standard calculators.
• Please show all your work.
• This is an open book exam.
• You may use your computer or other devices to access the course notes on myCourses or an
electronic version of the textbook. You may not access the web in general or communicate with
others.
• If you think there is information missing in a question, make a reasonable assumption (while
explaining why) and answer the question.
• Answer all questions on the exam paper.
• This exam has 3 questions.
• The exam has 8 pages (including the cover page)
Q1 Q2 Q3 Total
Weight 7 3 10 20
Grade

1 of 8
Question I (7Marks)
Suppose we are using the half precision format (16-bit floating point representation), the details of
16-bit floating point are shown below:
a. 1 bit is used to determine the sign
b. 5 bits are used for storing exponent, E.
c. 10 bits are used for storing mantissa.
d. The value of the bias used is 15.

a) What is the value in decimal of the following two floating point numbers? (show your work to
justify the answer).
-
⑭ b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0
sign bit nee 3
2 2* 10 => e=18-bias=18-15=
E1001,
=

I =
+

=> 2° 2 1.5625,
M 1. 1001
+
=

=-(2 222)
bit 12.5
floating
=
-

2 Valve
+

-1100.
(notstored) Value -1.100
=
=
x

M
--

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15


⑪ 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0
signbit nee
111102
2'2422* 3%
E =
E
= =
+
+ = =7 e 30-bias
= 30-15= 15
=

M2
"
22* 1.5625,0 =

M 1.1001 =>
=
+ +

-1.5625 x2
=
-
5/200
x 2 Value
Value -1,100
= =
=

b) What is the difference between the two numbers in part a) above and the next larger number that
can be represented as a 16-bit floating point number respectively? Show your work to justify your

② answers.
x2 2x27=7.8125 (8
(-12.5) the difference
0.0g
is
one
=

first
=

For

2x2 ===
2
32
for the second one the diff is a
=

2 of 8
c) What is the 16-bit floating point representation of 1.8125? Show your work to justify the answer.
① sign
Gree
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 0 11 I 1110100000 O

I
1.8125 0 0.8125 2"e
1,8125, 1.1101, 1.1101 x
+
= = =
=>

M
0.8125 x 2 1 0.625
15 0(172
= +

0 15 =

7E
0.625 2 It re
+ -

e0
=

=
=

bit
0
-
+
0.5 x 2
is
=

o
Sign

d) What is the 16-bit floating point representation of 29? Show your work to justify the answer.

① b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15


0 1 0 0 I 110 1 0 0 00 0 0

- C

29/2 1/2

E
14 +
"
2
1.x
=

111012
=

29,
=

14/2 7 %2 M
+
=

2 4 E 4 1S
=
+

19, 10012
=

3 Y2
=

72
=

= +

bit is o
1 Yz sign
3
+
=

Yz 0
=

Yz
+

3 of 8
e) If you are using a 16-bit floating point numbers and you make the assignment x=0.1. How would
the variable x be stored as a floating point numbers? Show your work to justify your answer.

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0010111001100110

I
0.1 x 2 0 0.2
-> 0.1, 0.00011001100110oll....
+
=

0.2 x 2 0 =

0,4
+ *
X2
0,1 1.100110lool...-
0.4
=
x2 0
= +

0.8 X2 1 0.6 4 15 1, 00112


7E
+ +
= =

e 4
=
=
-
=

0.6 x 2 1
=
0.2
+

0.4 bit is zero


0.2x2 0
sign
+
=

0.8
0.4x2
+

0
=

0.0x2 1
=
0.6
+

4 of 8
Question II (3Marks)

! !
A numerical simulation requires you to compute ! = $% + " − $% − " , for % ≫ 1. A colleague

③ notes that:

! !
#"$ $ #"& '(
! ! ! ! " " "
! = $% + − $% − = *$% + − $% − + =
" " " " !
#"$ $ #"&
! !
#"$ $ #"&
!
" " " "

'(
"
Your colleague claims that evaluating the equivalent expression ! = ! !
would provide a
#"$ $ #"&
" "
more accurate answer on a finite precision computer. Is your colleague correct? Justify you answer.

y Vath --for xx2, -so


=

* ->

originally presented
the subtraction
the expression as requires
Evaluating
two numbers that almostequal. f(x, 22) 1, He is ill-conditioned
of
are
=

(leads to antastrophic cancellation)


of1, = He

mathematical equivalentbutdoes not the


The second expression is require
almost numbers. The colleague is correct
subtraction oftwo equal

5 of 8
Question III (10 Marks)

You are tasked with designing a computer program for obtaining polynomial models using least
squares approximation. This would typically require the solution of an overdetermined system , - =
., where the matrix , is a Vandermonde Matrix known for being ill-conditioned. You have access to
a standard numerical library (such as Matlab for example) with the following methods:

a) Fundamental matrix and vector operations (multiplication, addition etc)


b) Norm(A,n) à Returns the Ln norm of A
c) PLU(A) à Returns L, U and P such that P*A=L*U (Partial pivoting)
d) LU(A) à Returns L and U such that L*U=A (No pivoting)
e) transp(A) à Returns the transpose of A
f) Chol(A) à Returns L such that A=L*LT (Uses Cholesky decomposition – A must be
symmetric positive definite).
g) MGSqr(A) à Returns Q and R such that A=Q*R (uses the modified Gram-Schmidt
algorithm)
h) HouseholderQR(A) à Returns Q and R such that A=Q*R (uses the Householder reflections)
i) GivensQR(A) à Returns Q and R such that A=Q*R (uses Givens Rotations)
j) FwdSub(L,b) à Returns y such that L*y=b (L must be lower triangular).
k) BwdSub(U,y) à Returns x such that U*x=y (U must be upper triangular).

Note that you are not prevented choosing to use techniques such as Tikhonov regularization and
implementing them using the above library. But you do not need to implement the above methods,
you can simply use them.

Identify the three best candidate approaches to solving this problem and express them as pseudo
codes. Identify the strengths and weaknesses of each approach, and choose one while justifying your
choice. If you believe that more information and/or constraints are needed to make a choice, simply
identify what constraints you needed to add or compromises you needed to make that are associated
with your final choice.

option #1
-

Solve the Normal Equations (ATA)K=Atb


Atis positive semidefinite => Can use choleskifact,

for faster apr time.

Code:
do than other approaches

I
Advantages: faster
(A); ill-conditioned
At transp disadvantages:i f
=
A is

At A; the AAis even more


At A =
*

ill-conditioned
At B At b;
A
could be problem
=

Accuracy
a

C Chof (AtA);
=

FwdS-b(2, A tB);
y
=

x Bodsub
= (transp(2), 1);
6 of 8
Option 2:

bliz
Use tikhonor regularization:Instead ofminl/Ax-
Minimize: min9HAX-bI+ 512
This leads to
solving
the *** than ATA

-> solve using cholesky factorization for fast


execution

pseudocode; rantages

I
the the problem
conditioning of
·
Improves
(A); results lower cpu cost
At transp
=

using cholecking in

Ux(322);
.

At Apls
At
=
A A +
·
potentially helps reduce
overfitting
At B

C
At
=

Chof (AtApls);
=
b;
A

Rantagescarefully choose the

FwdS-b(2, A tB); hyperparameters


y
=

x Bodsub
= (transp(2), 1);

#3
on
Option 2:

bliz
Use tikhonor regularization:Instead ofminl/Ax-
min9HAX-bI+ 512
Minimize:

squares to the
to applying least
This is equivalent QR
Solve using

(t)a (0)
better accurcely.
Use Householder for
problem:
-
=

bp
or
- Apk
=

Ap QRx bp
=

xR
=

Ap =

Rx a*bp
=>
=

↳ solve using
back-sub.
7 of 8
code Entages

I
v eye
= (a)) ;
(length the the problem
conditioning of
Improves
(length (2), 2) ;
·

z Zerus
=
·
potentially helps reduce
overfitting
Householte better accurac
bp [b;z),
·
QR gives

than the Cholesty approach


=

Ap [A; bxu];

eatratepate
=

HouseholteQ(Ap);
[Q, R) =

in
bq x** bp; =

25 bodsub(R, b9);
=

2 gives
&
a

speed
good
so
compromize
we
will
between accuray
three valid
all are

choose here,however
it

what now prioritizen


depending
on

8 of 8

You might also like