ECSE343MidtermW2023 Final
ECSE343MidtermW2023 Final
Name: __________________________________________________
1 of 8
Question I (7Marks)
Suppose we are using the half precision format (16-bit floating point representation), the details of
16-bit floating point are shown below:
a. 1 bit is used to determine the sign
b. 5 bits are used for storing exponent, E.
c. 10 bits are used for storing mantissa.
d. The value of the bias used is 15.
a) What is the value in decimal of the following two floating point numbers? (show your work to
justify the answer).
-
⑭ b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0
sign bit nee 3
2 2* 10 => e=18-bias=18-15=
E1001,
=
I =
+
=> 2° 2 1.5625,
M 1. 1001
+
=
=-(2 222)
bit 12.5
floating
=
-
2 Valve
+
-1100.
(notstored) Value -1.100
=
=
x
M
--
M2
"
22* 1.5625,0 =
M 1.1001 =>
=
+ +
-1.5625 x2
=
-
5/200
x 2 Value
Value -1,100
= =
=
b) What is the difference between the two numbers in part a) above and the next larger number that
can be represented as a 16-bit floating point number respectively? Show your work to justify your
② answers.
x2 2x27=7.8125 (8
(-12.5) the difference
0.0g
is
one
=
first
=
For
2x2 ===
2
32
for the second one the diff is a
=
2 of 8
c) What is the 16-bit floating point representation of 1.8125? Show your work to justify the answer.
① sign
Gree
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 0 11 I 1110100000 O
I
1.8125 0 0.8125 2"e
1,8125, 1.1101, 1.1101 x
+
= = =
=>
M
0.8125 x 2 1 0.625
15 0(172
= +
0 15 =
7E
0.625 2 It re
+ -
e0
=
=
=
bit
0
-
+
0.5 x 2
is
=
o
Sign
d) What is the 16-bit floating point representation of 29? Show your work to justify the answer.
- C
29/2 1/2
E
14 +
"
2
1.x
=
111012
=
29,
=
14/2 7 %2 M
+
=
2 4 E 4 1S
=
+
19, 10012
=
3 Y2
=
72
=
= +
bit is o
1 Yz sign
3
+
=
Yz 0
=
Yz
+
3 of 8
e) If you are using a 16-bit floating point numbers and you make the assignment x=0.1. How would
the variable x be stored as a floating point numbers? Show your work to justify your answer.
⑪
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0010111001100110
I
0.1 x 2 0 0.2
-> 0.1, 0.00011001100110oll....
+
=
0.2 x 2 0 =
0,4
+ *
X2
0,1 1.100110lool...-
0.4
=
x2 0
= +
e 4
=
=
-
=
0.6 x 2 1
=
0.2
+
0.8
0.4x2
+
0
=
0.0x2 1
=
0.6
+
4 of 8
Question II (3Marks)
! !
A numerical simulation requires you to compute ! = $% + " − $% − " , for % ≫ 1. A colleague
③ notes that:
! !
#"$ $ #"& '(
! ! ! ! " " "
! = $% + − $% − = *$% + − $% − + =
" " " " !
#"$ $ #"&
! !
#"$ $ #"&
!
" " " "
'(
"
Your colleague claims that evaluating the equivalent expression ! = ! !
would provide a
#"$ $ #"&
" "
more accurate answer on a finite precision computer. Is your colleague correct? Justify you answer.
* ->
originally presented
the subtraction
the expression as requires
Evaluating
two numbers that almostequal. f(x, 22) 1, He is ill-conditioned
of
are
=
5 of 8
Question III (10 Marks)
You are tasked with designing a computer program for obtaining polynomial models using least
squares approximation. This would typically require the solution of an overdetermined system , - =
., where the matrix , is a Vandermonde Matrix known for being ill-conditioned. You have access to
a standard numerical library (such as Matlab for example) with the following methods:
Note that you are not prevented choosing to use techniques such as Tikhonov regularization and
implementing them using the above library. But you do not need to implement the above methods,
you can simply use them.
Identify the three best candidate approaches to solving this problem and express them as pseudo
codes. Identify the strengths and weaknesses of each approach, and choose one while justifying your
choice. If you believe that more information and/or constraints are needed to make a choice, simply
identify what constraints you needed to add or compromises you needed to make that are associated
with your final choice.
option #1
-
Code:
do than other approaches
I
Advantages: faster
(A); ill-conditioned
At transp disadvantages:i f
=
A is
ill-conditioned
At B At b;
A
could be problem
=
Accuracy
a
C Chof (AtA);
=
FwdS-b(2, A tB);
y
=
x Bodsub
= (transp(2), 1);
6 of 8
Option 2:
bliz
Use tikhonor regularization:Instead ofminl/Ax-
Minimize: min9HAX-bI+ 512
This leads to
solving
the *** than ATA
pseudocode; rantages
I
the the problem
conditioning of
·
Improves
(A); results lower cpu cost
At transp
=
using cholecking in
Ux(322);
.
At Apls
At
=
A A +
·
potentially helps reduce
overfitting
At B
C
At
=
Chof (AtApls);
=
b;
A
x Bodsub
= (transp(2), 1);
#3
on
Option 2:
bliz
Use tikhonor regularization:Instead ofminl/Ax-
min9HAX-bI+ 512
Minimize:
squares to the
to applying least
This is equivalent QR
Solve using
(t)a (0)
better accurcely.
Use Householder for
problem:
-
=
bp
or
- Apk
=
Ap QRx bp
=
xR
=
Ap =
Rx a*bp
=>
=
↳ solve using
back-sub.
7 of 8
code Entages
I
v eye
= (a)) ;
(length the the problem
conditioning of
Improves
(length (2), 2) ;
·
z Zerus
=
·
potentially helps reduce
overfitting
Householte better accurac
bp [b;z),
·
QR gives
Ap [A; bxu];
eatratepate
=
HouseholteQ(Ap);
[Q, R) =
in
bq x** bp; =
25 bodsub(R, b9);
=
2 gives
&
a
speed
good
so
compromize
we
will
between accuray
three valid
all are
choose here,however
it
8 of 8