0% found this document useful (0 votes)
143 views

EE2007C Chap1 201516

This document provides information about the Computer System Fundamentals EE2007C course taught by Dr. Isaac Y.F. Fung. The course covers topics such as microprocessor architectures, assembly language programming, interrupts, memory systems, input/output subsystems, computer communication, and bus architecture over 11 weeks. Assessment includes continuous assessments and a final exam. Continuous assessments comprise lab performance, exercises, activities, lab reports, and tests. The course objectives are to understand computer system principles, learn assembly language programming, and design simple computer systems.

Uploaded by

ntldvlai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views

EE2007C Chap1 201516

This document provides information about the Computer System Fundamentals EE2007C course taught by Dr. Isaac Y.F. Fung. The course covers topics such as microprocessor architectures, assembly language programming, interrupts, memory systems, input/output subsystems, computer communication, and bus architecture over 11 weeks. Assessment includes continuous assessments and a final exam. Continuous assessments comprise lab performance, exercises, activities, lab reports, and tests. The course objectives are to understand computer system principles, learn assembly language programming, and design simple computer systems.

Uploaded by

ntldvlai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Computer System Fundamentals EE2007C

Lecturer: Dr Isaac Y.F. Fung


Email: [email protected]
Office: CF605

Consultation: Tuesday Afternoon

Contents:

1 Basic Architectures of Intel Family of microprocessors (3 weeks)


2 Instruction Modes/Assembly Language programming (3 weeks)
3 Interrupt mechanism (1.5 weeks)
4 Memory system design (1.5 weeks)
5 Input/Output Subsystem (1 weeks)
6 Computer communication (1 week)
7 BUS architecture (1 weeks)

Text/Reference books

1 The Intel Microprocess , Prentice-hall, Barry Brey, 7th ed. ISBN


0-13-060714-2 (or the latest edition)
2 Computer Architecture a quantitative approach, Hennessy JL and Patterson DA,
5th edition, Elsevier, 2012
3 Computer Organization and embedded systems, C. Hamacher et al, McGraw Hill,
ISBN 978-007-108900-5
4 Intel Microprocessors: Architecture, Programming and Interfacing, Ray &
Bhurchandi, McGrawHill, ISBN 0-07-120169-6
5 Pentium Processor System Architecture-2nd Editions, Don Anderson/Tom
Shanley-Mindshare, Inc.
6 IBM PC Assembly Language and Programming - Fourth Edition (Peter Abel -
Prentice-Hall International Inc.)
7 The 8086 and 80286 Microprocessors Hardware, Software and Interfacing (A.
Singh, W.A. Triebel, Prentice-Hall)

Assessment methods:

Continuous assessments: 40%


Final Examination 60%

Continuous assessments include


1 Performance in Lab (5%)
2 Blackboard exercises (5%)
3 In-class activities (5%)
4 Lab report (10%)
5 Written test (15%)
Criteria Reference Assessment (CRA)

Pass
Some understanding of assembly language programming
Some understanding of other topics
Grade C/C+
Able to write simple assembly language programs
Able to demonstrate in-depth understanding of certain topics
Grade B/B+
Able to write assembly language program to achieve specific tasks
Able to demonstrate in-depth understanding for most topics
Grade A/A+
Able to write assembly language program
Able to demonstrate in-depth understanding of all topics
Able to apply knowledge learnt to solve real-life problems
Objectives and learning outcomes

1. To understand the basic principles (hardware components) of a simple computer


system
2. To learn how to control/program a computer system using assembly language
programming
3. To learn how to design a simple computer system (eg for the development of a
electronic mouse, robot, simple control device for your final year group project)
Basic number systems - revision

The microprocessor (P) is a binary device, everything inside the microprocessor is


represented by 0s and 1s. Since the microprocessor performs many numerical operations
so it is necessary to have a basic understanding of number systems utilized by a
microprocessor as well as number systems used during programming.

The most direct number system used inside the microprocessor is the binary system (base
2) with only 0 and 1, for example 01010101. Each digit (or bit) in the number represents
a value in power of 2, starting from the RHS (Right hand side). The first digit is 20, then
21, 22 , etc. Usually, the most RHS bit is also called the LSB (least significant bit) while
the most LHS bit is the MSB (most significant bit).

So the value 010101012 = 0x27 + 1x26 + 0x25 + 1x24 + 0x23 + 1x22 + 0x21 + 1x20
= 64+16+4+1 = 85

In addition, we use the term byte to represent an 8-bit data and the term word to
represent a 16-bit value, for a 32-bit value, it is called double word.
If we want to represent a very large number then we need to use many bits and this is not
very convenient to write on paper. Therefore when we communicate with the computer,
we usually use number systems derived from the binary system and the most commonly
used number system are Octal (base 8) and hexademical (base 16) usually we just use the
term Hex.

In octal, only digits from 0 to 7 will be used and in hexadecimal, we use 0 to 9 and A, B,
C, D, E, F to represent the values. A = 10, B = 11, C = 12, D = 13, E = 14, F = 15

To convert, from a binary number to a hex, it can be done very easily, starting from the
RHS, every 4-bit from the binary number can be converted directly into a hex digit.
For example: 0101 0101 is equal to 55H (H == hexadecimal), we have two 4-bit groups
0101 and 0101. The value 0101 = 5 so the binary pattern is 55H (in Hex).

Example: 1100 0111 1010 1011 = C 7 A B H

To convert from a Hex to a binary value it is just the reverse process.


Example ABCDH = 1010 1011 1100 1101 B

What is the value 10101100 equal to ?

Units used in a computer system


1K = 1024 (210)
1M = 1024K (220)
1G = 1024M (230)
Positive and negative values
If a binary pattern represents both positive as well as negative values then it is regarded
as signed, otherwise, it is unsigned (only positive).
Since inside the computer everything is binary so the ve sign that we use to represent a
negative value is also represented by a binary pattern (usually by 1 bit).
There are two types of signed notations, using a sign bit (usually the most LHS bit) and
2s complement.
For example using the LHS bit as a sign-bit, 10101010 = -42 while 00101010 = 42

The sign-bit system is seldom used because it is not easy to implement in hardware to
carry out arithmetic operations (such as addition, multiply etc).
Usually the 2s complement system is used. To convert a value X to X using 2s
complement, we first do a 1s complement of X and then add 1 to the result. The 1s
complement of a binary value is the negate of the binary pattern of the number (ie
inverting values 0s and 1s).

Example: X = 00101010 = 42 first convert X using 1s complement gives 11010101


then add 1 to the result, i.e. 11010101 + 1 = 11010110 = -42

Note: there is no -0 in binary number based on 2s complement

Example:
If the number is represented by only 2 bits
The binary number 01 + 01 = 10
The binary number 00 01 = 1 1 1 ; the 1 is a borrow bit

Note: from the computers point of view A-B is not the same as A+ (-B) !!!!

Self-exercise
What is the range of values that can be represented by an 8-bit pattern?

Floating point number using binary form

As mentioned above, each digit in a binary pattern represents a value in the power of 2.
In a floating point format, the digits after the decimal point represent value 2-1, 2-2, etc.
Example, .1010 = 1x2-1+0x2-2+1x2-3+0x2-4 = 0.5+0.125 = 0.625

Floating point format

Floating-point numbers are represented in the form


X = F 2 E
F is called the fraction (or mantissa) and E is the exponent
F must be in the form 1.XXXXX and the value 1 is not stored in the pattern since it is
already known.
Floating-point number is usually represented in two standards which are defined
by IEEE (Institute of Electrical and Electronics Engineers). These are 32-bit (single
precision) and 64-bit (double precision) standard. A 80-bit (extended precision) standard
is also available. When you write your C++ program, you should have try using float and
double!
For single precision, there are 23 bits to represent the fraction and 8 bits for exponent and
1 sign bit
For double precision, there are a 52 bits fraction, 11 bits exponent and 1 sign bit
The bit formation is in this order: Sign Exp. Fraction

Some special cases in floating point notation

For single precision format

If E=255, F0; NaN (not a number)

If E=255, F=0, S=1; then infinity

If E=255, F=0, S=0; then +infinity

If 0<E<255 , X = (-1)S*2* (E-127) *(1.F)

If E=0, F 0, X = (-1)S*2* (-126) *(0.F)

If E=0, F=0, S=1 then X = -0

If E=0, F=0, S=0 then X = +0

With floating point number represented in this format a*2b then multiplication and
division of two floating point numbers can be achieved easily.
For example
X=a*2b
Y=c*2d
Then X*Y = a*c 2 (b+d)
X/Y = a/c 2 (b-d)

So standard multiply and division hardware can be used in the floating-point calculation
same as addition and subtraction hardware.

Self-exercise
Determine how to perform addition and subtraction with floating point?

Example

How to convert the value 6.234 into the IEEE 32-bit floating-point format?

1. Convert the value into the form 1.XXXX 2x


6.234 = 1.5585 22
It is a positive value so the sign bit is 0
2. Derive the exponent
Since the exponent is represented in the form (E-127) now the exponent is 2 so E = 129
so the binary pattern is 1000 0001 (129)
3. Determine the Fraction 1.5585 and only concern the .5585 (we do not need to store
the 1.)
.5585 = 0.5 + 0.03125 + 0.015625 + .
0.5 = 2 -1
0.03125 = 2 -5
0.015625 = 2 -6
1 000 11

So the final pattern = 0 1000 0001 1000 11...

Exercise:
Convert the number -4.5 into a 32-bit floating-point pattern

Representing characters
In addition to values, characters are also represented using binary codes usually in ASCII
(American Standard Code for Information Interchange) code.

ASCII is a character-encoding scheme based on the ordering of the English alphabet.


ASCII codes represent text in computer using 8 bits (in fact only 7 bits are used) as shown
in the following table.

From the above table, it shows the ASCII codes used for representing different characters.
For example, the letter A is 41H and a is 61H. Therefore, from the computers point
of view A is not the same as a. You can also convert letters from upper case to lower
case by a simple subtraction. Such as c-20H will give you the character C.
Similarly, other characters such as Chinese are also encoded in binary form.

Chinese characters
There are different approaches to represent Chinese characters in a computer such as
Unicode, Big5, GB2312-80 etc. A font file must be installed in order to display the
corresponding characters. Big5 is the character encoding standard most commonly used
for traditional Chinese characters. Every Chinese character is represented by a two-byte
code. The first byte ranges from 0xA1 to 0xF9, the second byte ranges from 0x40 to
0x7E and 0xA1 to 0xFE. In a document that contain Chinese and regular ASCII
characters, the ASCII characters are still represented by a single byte. As in ASCII, the
MSB is always 0 but in Chinese character coding, the MSB is always 1 (0xA1 or 0xF9)
so English or Chinese character can be distinguished.

Some examples

Address Character
A640H
AF66H
A741H
AA46H

Revision exercises:

1 What is a Bit, a byte, one K, one M


2 Binary number notation (10101010 = ???)
3 How many different values can be represented by an 8-bit pattern?
4 Hexadecimal A = ??? CF (Hex) = ???
5 What is 27.4375 in binary form?
6 What is -88 in binary?
7 What is ASCII? What is a string???
8 Do you know how to program in C/C++, Fortran, Java, assembly language etc ???
9 Do you think learning this subject is useful?
Introduction to microprocessor
Computer technology has made tremendous progress in the past 65 years since the first
general-purpose electronic computer was created. Today, you pay $3000 to $4000 HKD
can buy a computer that has more performance than a computer bought in 1983 for
around $7 million HKD.
This introduction will discuss the basic concept and internal architecture of a
microprocessor. Features that can improve the performance of a microprocessor are
highlighted. In addition, components that contribute to a simple computer system will
also be discussed.

What is a microprocessor (P)

Definition:
1 Processor-on-a-chip can be described as a microprocessor, ie a very powerful IC
(Integrated Circuit).
There are many different types of microprocessor and the most commonly used
includes the 8051 series, 8086, Intel Pentium series, Intel i-series etc

Table 1 illustrates key features that are used to evaluate performance rendered by a
microprocessor.
Model Year Max. Clock Transistors Register Ext. Max. Caches
frequency per Die Sizes data external
at bus address
introduction size space
8086 1978 8 MHz 29K 16GP 16 1MB None
486 1989 25MHz 1.2M 32GP 32 4GB L1:8KB
80FPU
Pentium 1993 60MHz 3.1M 32GP 64 4GB L1:16KB
80FPU
P3 1999 500MHz 8.2M 32GP 64 64GB L1:32KB
80FPU L2:
64MMX 512KB
128
XMM
Pentium 2007 1.6GHz to 167 M 64 64GB L2: 1MB
Dual 2.4 GHz
Core
Core i7 2015 2.66GHz to 1400M 8 32-bit 64 64GB 1MB L2
3.83GHz register 8MB L3
16
64-bit
register
per core
GP general purpose
FPU floating point unit
The key features when comparing different microprocessors are:

1. Operating frequency or the clock frequency


The microprocessor is a digital device (similar to a synchronous machine) so the
overall performance of a microprocessor is directly related to its operating
frequency
2. A Register is a digital device that is used to store binary information.
Registers are very important component during the operations of the processor
so the size, as well as the number, of registers included insider the processor
will affect the performance of the device.
3. Size of the data bus for sending data between components of a computer so
a wider data bus allows higher data transfer rate. Size of a bus refers to the
number of bits provided by the bus
4. Size of the address bus relates to size of memory that can be accessed. A
processor can access more memory if a wider address bus is available.
5. Size of the cache memory high speed memory (located insider the
microprocessor as well as outside the microprocessor). Can reduce the rate of
access external memory.
6. Modern microprocessor, the number of core

The above features will affect the overall performance of a microprocessor and details
will be discussed in later lectures. Recently, there are other families of processor such as
ARM Cortex-A series as well as the Nvidia Tegra 2 processor (dual-core processor).
These devices are commonly applied in mobile phones and tablet computers.
Figure 1 Growth in processor performance since the 1971

Moores law
Moore's law refers to an observation made by Intel co-founder Gordon Moore in 1965. He
noticed that the number of transistors per square inch on integrated circuits had doubled
every year since their invention.
Moore's law predicts that this trend will continue into the foreseeable future. Although the
pace has slowed, the number of transistors per square inch has since doubled
approximately every 18 months. This is used as the current definition of Moore's law.
BREAKING DOWN 'Moore's Law'
Because Moore's law suggests exponential growth, it is unlikely to continue indefinitely.
Most experts expect Moore's law to hold for another two decades. Some studies have
shown physical limitations could be reached by 2017.
The extension of Moore's law is that computers, machines that run on computers, and
computing power all become smaller and faster with time, as transistors on integrated
circuits become more efficient. Transistors are simple electronic on/off switches embedded
in microchips, processors and tiny electrical circuits. The faster microchips process
electrical signals, the more efficient a computer becomes.
Costs of these higher-powered computers eventually came down as well, usually about 30
percent per year. When designers increased the performance of computers with better
integrated circuits, manufacturers were able to create better machines that could automate
certain processes. This automation created lower-priced products for consumers, as the
hardware created lower labor costs.
Ref: http://www.investopedia.com/terms/m/mooreslaw.asp

RISC and CISC instruction

Microprocessors can also be classified based on the instructions set. The instruction set is
the available instruction that we can use to program the device. Such as ADD, SUB etc.
There are two fundamentally different approaches in the design of the instruction set. One
popular approach is called Reduced Instruction Set Computers (RISC). In a RISC, each
instruction occupies exactly one word. Examples RISC processor include ARM, PIC.
One word is the basic unit of a storage space used inside the processor, for the ARM
processor, it is 32 bits. So in the ARM processor, every instruction is 32-bit long
An alternative to RISC is to make use of more complex instructions which may span
more than one word of memory (ie more than 32 bits), and which may specify more
complicated operations. Processors based on this idea is called Complex Instruction Set
Computer (CISC). The Intel x86 CPUs and AMD processors are examples of CISC.

RISC vs CISC summary


RISC CISC
Simple instructions, few in number Many complex instructions
Fixed length instructions Variable length instructions
Complexity in compiler Complexity in microcode
Only LOAD/STORES instructions access Many instructions can access memory
memory
Few addressing modes Many addressing modes

For a RISC, as all information related to an instruction is stored in one single word so it is
faster comparing to CISC in getting the instruction from memory. But RISC is also less
flexible as all instructions must fit to one word.
In a RISC instruction, all elements involved in an arithmetic or logic operation must
either be in processor registers or one of the operands may be given explicitly within the
instruction word.
Example to do C = A+B where A, B, C are data stored in memory then we need to do:
Load R2, A ; put the variable A to register R2
Load R3, B ; put the variable B to register R3
ADD R4, R2, R3 ; do the addition and put the result to register R4
Store R4, C ; move the result from register R4 to memory location represented by C

In RISC instructions, as the size of the instruction is only 1 word so information that can
be included also limited so we need to divide the instruction into smaller steps

In CISC, some instructions may occupy multiple words and arithmetic and logic
instructions use the two-address format.
For the same example, C = A+B
Move C, B ; C = B
ADD C, A ; C = C+A

The processing power of the microprocessor increased dramatically from 80s to late 90s
(about 16 years), at a rate of close to 50% per year. The dramatic rate of improvement has
led to the dominance of microprocessor based computers across the entire range of
computer design. PCs and workstations have emerged as major products in the computer
industry. Even mainframes have been almost replaced with microprocessors consisting of
small numbers of off-the-shelf microprocessors. High-end supercomputers are being built
with collections of microprocessors.
But in year 2000 onwards, the rate has been slower to about 20%. This is due to triple
hurdles of maximum power dissipation of air-cooled chips, little instruction-level
parallelism left to exploit and almost unchanged memory latency (operating speed of the
memory).
The latest trend to higher performance would be via multiple processor per chip
(multi-core) rather than via faster uniprocessor.

What is a microcomputer

Definition:
A complete device based on a particular microprocessor chip. As shown in Figure 1.1.
So the microprocessor is the most important component in a microcomputer and therefore,
to study a microcomputer system, we must first understand the microprocessor.

BUS

Figure 1.1 Block diagram of a microcomputer

As shown in Figure 1.1, the block diagram of a simple microcomputer includes the MPU
(Processor Unit), or CPU (Central Processing Unit), memory (primary & secondary
memory), input units and output units.
Memory unit is to store programs and data.
Primary memory usually refers to RAM (Random Access Memory) or ROM (Read Only
Memory) and also called main memory. It is a fast memory that operates at electronic
speeds. Program must be stored in this memory while they are being executed. Memory
consists of a large number of semiconductor storage cells, each capable of storing one bit
of information
External memory usually refers to Harddisk, CD-ROM, or DVD-ROM. Nowadays,
external memory can be a SD card. Primary memory is expensive and does not retain
information when power is turned off.
Input units could include keyboard, mouse, scanner, touch-screen monitor etc
Output units would include display, printer, speaker, etc.

Components within the microcomputer are connected together by the buses, there are the
data bus, address bus and control bus. A bus refers to a collection of wires, or signal
lines. Inside the computer system, data is being sent in bytes (8-bit), in words (16-bit) or
double words (32-bit/64-bit) so a group of wires is used to send the data bits together.
One bit of information is transferred in each wire.

Data bus is used to transmit data between different components.

Address bus is used to send an address (location) so that different I/O components
can be selected or memory location can be accessed.

Example - if you only have a 2-bit address then you can access 4 (22) different
components. Consider all possible combinations that you can obtain with a 2-bit pattern:
00, 01, 10, 11 so each pattern can be used to select one component.

Modern computer system is using data bus with a size of 32 or 64, so 32-bit or 64-bit
data can be sent in a single operation.
On the other hand, if you have a wider address bus then you can access more
components (especially memory) in the system.

The Arithmetic and Logic unit (ALU) and control unit


The ALU carries out most of the operations.
The control unit coordinates the various units such as the ALU, I/O and memory.

The Intel 8086 microprocessor

The 8088/8086 is a popular device used in the early 70s and 80s and its architecture is
simple and suitable for teaching computer architecture. Once we gain the basic concept of
the 8086, we can then discuss features embodied in more advanced microprocessors such
as the Intel Pentium microprocessor.

The 8086 is a 16-bit microprocessor chip fabricated using high-performance metal-oxide


semiconductor (HMOS) technology. Circuitry on chip is approximately 29,000
transistors and it is a 40-pin package.
Self-exercises

What does it mean by 16-bit, 32-bit, or 64-bit processor?


Is the Intel P5 core 2 dual microprocessor a 32-bit device or 64-bit device, and why?

Basic 8086 features

The 8086 microprocessor is a true 16-bit microprocessor with 16-bit internal and
external data paths (Buses).
The address bus and data bus are multiplexed???
There is a 20-bit address bus which allows access to 1 MB (220) of memory locations.
Via the address bus, the device can address up to 64K byte-wide (8-bit) I/O ports, or
32K word-wide ports (word = 16 bits). An I/O port is an interface to an external device.

8-bit + 8-bit = 16-bit


1 Port 1 Port

Figure 1.2 Two 8-bit ports combined to form one 16-bit port

What is a multiplexed address bus and data bus?

The 8086 address bus and data bus are multiplexed implying that the two buses are
sharing the same pins. Since the data bus is 16 bits so the 16 pins are used both for
transferring data as well as for transferring address, refer to block diagram of the
8086 pins layout. By multiplexing the two buses, it can reduce the pin number of the
device. The concept of multiplex is important as there are other processors also have such
multiplex feature.

Term AD1, as shown in Figure 2, means that this is used for address bit 1 as well as data
bit 1.

There is another special feature offered by the 8086 and it has two modes minimum
mode and maximum mode.
In Minimum mode, the 8086 is used as a typical microprocessor and we mainly
study the 8086 in minimum mode in this course.
For the Maximum mode, the 8086 can be used with multiple processors, usually for
floating-point arithmetic.
The mode selection is via the MN/MX input pin, refer to the pins layout diagram.
Figure 2 8086 processor pin configuration

Internal architecture of the 8086

Components inside a CPU are designed to execute operations assigned to the CPU. What
are the basic operations performed by a computer?
Programs that control the operations of the microprocessor are stored in the memory (or
in the harddisk). Therefore, in order to get the instructions from the program, the
microprocessor must read the memory to extract the instructions. This is usually called a
read cycle. After this, the microprocessor must execute the instruction, this is called an
execution cycle. The two cycles are being performed repeatedly. So modern
microprocessors usually include separate hardware units to perform the two
functions. This is also the case for the 8086.
Block diagram for a simple computer system

Display unit
LCD

What are the


basic operations
performance by a computer?

memory CPU I/O

Get instruction from memory


Perform/Execute operation
Get next instruction

In order to optimize the performance, inside the 8086, the processor is organized into a
separate Bus Interface Unit (BIU) (for performing the read cycle) and Execution Unit
(EU). This enables the 8086 to fetch and execute instructions simultaneously. What is
the advantage of such oganization?

Figure 3 Processor model for 8086


Referring to Figure 3, the BIU is responsible for performing all bus operations, such as
instruction fetching, reading and writing operands from/to memory, and inputting and
outputting of data for peripherals (input/output devices).
There are two types of buses : data and address
To get instruction from memory, the CPU must access the address bus as well as the data
bus. BIU is responsible to control the buses so that instructions can be extracted from the
memory.

Self-exercise:
Why the operations such as instruction fetching, reading and writing operands are related
to the buses???

The EU is responsible for executing instructions once these are extracted from memory
by the BIU.
The two units operate asynchronously so overlapping instruction fetch and execution is
possible (whats the advantage of this???)
Therefore, the overall performance of the microprocessor can be improved.

Before, we discuss the functions of the BIU and EU, we must first understand terms that
are being used during the discussion.

Terminology

Program is stored in memory and consists of a sequence of instructions as well as some


data. To execute an instruction it may require some operands, what is an operand?

1. Operand is the object that is being operated upon!


2. For example, in an instruction ADD A, B (this is similar to A = A+B in C++)
3. ADD (addition is the operation)
4. A and B are the operands

Bus Interfacing Unit (BIU)

The BIU is the 8086s interface to the outside world (external memory). The major task
of BIU is to get information from the memory. Information can be referred to as data
and instructions.

How can we get data from memory?????

In order to access the memory, we need to issue a 20-bit address (via the address bus) and
then read the 16-bit data (via the data bus) (Details of this mechanism will be discussed
when we learn the memory systems). Therefore, you will find the BIU consists of the
address generation and bus control unit.
In addition, the BIU also consists of a set of segment registers, internal communication
registers, instruction pointer, instruction object code queue, address summer (), and bus
control logic. All these registers are related to the generation of the memory address
for accessing information.

Function of the instruction queue


To extract an instruction, the BIU will issue the proper address of the memory location
where the instruction is stored then waits for the instruction to be available from the
memory.
When the instruction (2 bytes) is available in the data bus, it is then stored in the
instruction queue which can store only 6 bytes. The instruction will be executed by the
Execution Unit (EU) when it is being extracted from the instruction queue. The
instruction queue can be considered as a buffer between the BIU and EU.

The process or setup is called a pipeline. Pipeline is a concept and the implementation of
such concept is based on the BIU, instruction queue and EU inside the 8086! Pipelining
is an implementation technique whereby multiple instructions are overlapped in
execution; it takes advantage of parallelism that exists among the actions needed to
execute an instruction. Today, pipelining is the key implementation technique used to
make fast CPUs.

There are two ends of a pipeline, one is the input and the other end is the output, refer to
Figure 4. A pipeline is like an assembly line. In a computer pipeline, each component in
the pipeline completes one part of an instruction.

For the instruction queue in 8086, the input is connected to the BIU and the output is
connected to the EU. The property of the instruction queue is FIFO (First In First Out)

Note: during the execution of an instruction, EU may request the BIU to extract data
(operand) from memory in order to execute an instruction.

The pipeline is a very important feature because it can improve the efficiency of the
microprocessor. The internal architecture of a modern microprocessor also includes
different forms of a pipeline structure.

BIU EU
Information Carry out the instruction
coming from memory

Pipeline that
can store 6 bytes
Control to access
the memory EU requests BIU to get operands

Figure 4 Block diagram of the pipeline mechanism inside the 8086

Instruction Pre-fetch by the BIU


The instruction queue is 6-byte long. For example if each instruction is only 2 bytes then
the instruction queue can store 3 instructions. If instructions number 1, 2, and 3 are being
stored in the queue, then EU will extract, or get, Instruction 1 from the queue and execute.
In such a case, Instruction 2 and 3 are being fetched before it is being executed. The
term pre-fetch refers to such a scenario.
Pre-fetch == Pre (before), fetched before it is being executed.

This is similar to what you do when youre having a buffet dinner. You fetch (get)
different kinds of food from the buffet table, for example, you take the sashimi, roasted
beef, and salad. When youre eating the salad, you have already pre-fetched the sashimi
and the roasted beef, or even the ice cream! If you do not pre-fetch then you take the
salad first, go back to the table, eat your salad. When you finish the salad then you go and
get some other food.

Critical thinking:
Why you pre-fetch your food in a buffet dinner?
Why BIU pre-fetch the instruction?
Is the reason for the above questions the same? Or different?
Is the buffet eating scenario exactly equivalent to a pipeline?

There are conditions in order to carry out a pre-fetch in the 8086:

1 When the instruction queue has space to store at least 2 bytes


2 EU is not requesting BIU to read or write operands from memory (i.e BIU is
free!!)

If the above conditions are satisfied then BIU will look ahead in the program codes by
prefetching the next sequential instruction. The prefetched instructions are held in the
instruction queue. Two bytes are fetched (16-bit data bus) in a single memory read cycle.
EU will read one instruction byte from the output of the queue. If the queue is full and
EU is not requesting access to operands in memory, the BIU does not perform any bus
cycles this is called idle states (ie BIU is not doing anything!).

When BIU is in the process of fetching an instruction and the EU requests its services
then BIU first completes the instruction fetch bus cycle and then serves the EU. The
following diagram, Figure 5, illustrates the process of pre-fetch and instruction execution.
You can observe that both the EU and BIU are busy at most of the time and therefore, the
overall performance can be improved when comparing to a case if there is only one unit
to perform both operations of fetching and execution.
If there is only one unit to perform both operations then you must first fetch the
instruction and then followed by execute. It is not possible to perform both steps at the
same time.

Can you see the advantage of a pipeline system?


Figure5 Instruction execution sequence

Operations with a single component


Fetch Execute Fetch Execute Fetch Execute

Basic Pipeline concept

The time required between moving an instruction one step down the pipeline is a
processor cycle. Because all stages proceed at the same time, the length of a processor
cycle is determined by the time required for the slowest pipe stage. If the pipeline stages
are perfectly balanced, then the time per instruction on the pipelined processor assuming
ideal conditions is equal to

Time per instruction on unpipelined machine number of pipe stages

So the speedup from pipelining equals the number of pipeline stages!

Exercise

An instruction can be completed by a single processing stage and it takes 25 clock cycles to
finish. On the other hand, the instruction can be divided into five sub-tasks each of which
can be carried out by one processing stage forming a pipeline. If sub-task 1 takes 10 clock
cycles, sub-task 2 takes 5 clock cycles, sub-task 3 takes 8 clock cycles, sub-task 4 takes
10 clock cycles and sub-task 5 takes 12 clock cycles, how many instructions should be
executed so that the sub-task approach will be more cost-effective?

Components included in the BIU

Functions of the BIU is to access the memory, it will read or write the memory.
In order to achieve this task, the BIU must include proper components to generate a
20-bit memory address (the address bus of 8086 is 20-bit). The 20-bit memory address
is the sum of 2 values, one is called the segment address and the other is the offset.
Adding these 2 values together will generate the physical address of a memory location.
The segment address is stored in the segment register (such as CS), while the offset
address is stored in another register. For example, the Instruction Pointer (IP) is one of
the offset register.

The segment concept

The 8086 can access up to 1M bytes of memory. The memory is divided into smaller
blocks called segments. The size of a segment is 64Kbytes, so in theory the memory can
be divided into 16 segments but only 4 segments can be active at each instance because
there are only 4 segment registers. Each memory location can now be represented by two
components the segment address and the offset address, see the Figure 6.

So the real address (physical


Real address address) is
Memory Base address + offset
(a segment)
Offset

Base address (segment address)


Figure 6 Segment concept

The segment concept


The segment address is also called the base address (or the first address of the
corresponding segment). Since the size of a segment is 64Kbytes, also the maximum
value of the offset is FFFFH (Hex). Adding the segment value and the offset will give
you the physical address (or the real address) of all memory locations within the segment.
Example, if the base address is 00000H then the segment location is between 00000H to
0FFFFH (64Kbytes).
If the base address is 10000H then the segment is from 10000H to 1FFFFH also
64Kbytes.

Note: since we are dealing with binary system so things always start from 0!

Therefore, inside the BIU, there is a dedicated ADDER (to perform the addition of the
segment and the offset). In addition, you will find the different registers located insider
the BIU for holding the segment values (segment registers) and the offset value.

However, inside the 8086, all registers (segment registers as well as offset registers) are
only 16-bit wide. So in order to generate a 20-bit physical address, a special
mechanism is needed.

Example

The maximum value of a 16-bit value is FFFF (Hex), if two 16-bit values added together,
such as FFFF (segment) + FFFF (offset), the result is 1FFFE (Hex) (physical) and it is
only a 17-bit value and values from 20000H to FFFFFH cannot be produced by the
addition.

So in 8086, you cannot randomly assign the base address of a segment. The segment
address must satisfy one condition, that is the base address must be divisible by 16.
If a value is divisible by 16 and if we are using HEX (base 16) as the number system then
the last digit of the value must be a 0.

For example:

FFFFEH is not divisible by 16


FFFF0H is divisible by 16
12340H is also divisible by 16

Therefore, a valid segment address must have the last digit equals to 0 and when we
store the segment value, we can ignore the digit 0.
For example, if the segment address is 12340 Hex then only 1234 is stored in the
segment register.

However, when we generate the physical address then we must first append the 0 into
the value and then carry out the addition.

For example, the value in the segment register is 1234H and the offset is 20H then the
physical address is 12340H + 20H = 12360H.

This is how to generate a 20-bit physical address from two 16-bit values (segment
and offset).

The segment concept analogy


If you design an elevator for a very tall building, for example with 100 levels. How
are you going to arrange the buttons if the elevator is able to reach all levels?
If you stay in a hotel and your room number is 1234 does it mean that the hotel has
more than 1000 rooms?

The Execution Unit (EU)

The EU is responsible for decoding and executing all instructions. The EU is a digital
device that takes the binary instruction as input and output other binary signals to control
units inside the microprocessor to execute an instruction.

What is decoding ?
Instructions are stored in the machine code format (in binary). The EU will see data such
as 8B C3 (10001011 11010011). Decoding is to carry out the proper operation according
to the binary string (10001011 11010011). The pattern 8B C3 is (MOV AX, BX). After
decoding, EU will perform the move (MOV) operation accordingly (ie controlling the
proper hardware components to move the data from the source BX to destination AX).

The EU consists of an ALU (Arithmetic and Logic Unit), status and control flags, eight
general-purpose registers, temporary registers, and queue control logic. The EU extracts
instructions from the top of the instruction queue in the BIU, decodes them, generates
operand addresses if necessary, passes them to the BIU and requests it to perform the
read or write bus cycles to memory or I/O, and performs the operation specified by the
instruction on the operands. During the execution of an instruction, the EU tests the status
and control flags and updates them based on the results of executing the instruction.

Instructions
in binary Decoder
format
001010011
Signals to
control
hardware
modules to
perform the
operation
Control signals are generated for each execution step based on the instruction. These
signals are usually determined by a program stored in a special memory. The control
program is called a microprogram to distinguish it from the program being executed by
the processor. The microprogram is stored on the processor chip in a small and fast
memory called the microprogram memory or the control store.
Suppose that n control signals are needed. Let each control signal be represented by a bit
in an n-bit word, which is often referred to as a control word or a microinstruction.
A typical organization of the hardware needed for microprogrammed control consists of a
microinstruction address generator, which generates the address to be used for reading
microinstructions from the control store. The address generator uses a microprogram
counter, uPC, to keep track of control store addresses when reading microinstructions
from successive locations.
Microprogrammed control can be viewed as having a control processor within the main
processor. Microinstructions are fetched and executed much like machine instructions.
Their function is to direct the actions of the main processors hardware components, by
indicating which control signals need to be active during each execution step.

Example:

ADD AX, 16 ; meaning add 16 to the AX register


Where AX is a register inside the CPU.
If AX is 20 then after the operation it becomes 36
For the above operation, do we need to fetch operand from memory? 16 in the above
operation is called an immediate. An immediate value is considered as part of the
instruction so the value 16 is also pre-fetched and stored inside the instruction queue.
Now if it is ADD AX, X ; X is a variable.
Do we need to fetch operand from memory?
If the instruction queue is empty, the EU waits for the next instruction byte to be fetched
and shifted to the top of the queue. When the EU executes a branch or jump instruction
(for example in IF .. ELSE statement ) , it transfers control to a location corresponding to
another set of sequential instructions. Whenever this happens, the BIU automatically
resets the queue (or clear the queue) and then begins to fetch instructions from this new
location to refill the queue.

Instruction queue activities during a jump or branch operation

The 8086 internal registers

Registers are important components because they are used as a temporary storage, as well
as storing the current status of the CPU. Registers are available in all microprocessors.
Contents of some registers will indicate the memory locations to be fetched, such as the
segment registers. Registers are internal components and some of them can be controlled
with assembly language programming. So it is important to have a basic understanding of
the registers. For the 8086, there are 4 groups of 16-bit register:
a. Instruction Pointer (IP)
b. Data Registers (4)
c. Pointers and Index Registers (4)
d. Segment Registers (4)

In addition, there is the Flag Register for storing the status of the microprocessor.
In different microprocessors, names assigned to the registers may be different but
functions performed by the registers are very similar.
Instruction Pointer (IP)

IP is used to identify the location of the next instruction to be executed in the current
code segment. IP is the offset value not the physical address of the next instruction.
Physical address = IP+CS (code segment register)
The code segment is the memory segment that stores the program (or instructions)
Every time an instruction word is fetched from memory, the BIU updates the values
in IP (eg IP = IP+2) such that it points to the next sequential instruction word in
memory.

Sometimes, the IP is called PC (Program Counter) implying that it is used to control


the program execution

Data registers

There are 4 general purpose data registers which are used for temporary storage of
frequently used intermediate results. This can improve the speed because registers are
located inside the microprocessor.
The data registers can use either as 8-bit or 16-bit. The registers are the followings:
1 Accumulator Register (AX: AH AL)
2 Base Register (BX: BH BL)
3 Count Register (CX: CH CL)
4 Data Register (DX: DH DL)

AX 16 bits; AH 8 bits, AL 8-bits. A 16-bit device can be divided into 2 parts


High (H) and Low (L). If only 8-bit data is used that you can either use AL, or AH.

AX (16-bit)
AH (8-bit) AL (8-bit)

The general purpose data registers can be used for arithmetic or logic operations.
For example, to carry out an addition: add ax, bx
The result is stored in ax and it is equal to the sum of values in ax and bx (in C++, it is
similar to ax+=bx).
In addition, the data registers are also used for some special situations and details will be
discussed in the Assembly Language Programming section. Here are some examples:

In implementing a LOOP, the CX register is used to store a counter value representing


the number of times for looping.
All I/O operations require data that are to be input or output must be stored in the AX, or
AL register, while register DX holds the address of the I/O port.

Segment Registers
The segment registers are used for storing the base address of a segment as discussed in
the segment concept.
The 8086 address space is segmented into 64K-byte segments and just four segments can
be active at the same time because there are only 4 segment registers!!!
The segment registers are used to select, or address the active segments

Code Segment (CS) Register


CS identifies the starting address of the 64-K byte segment known as the code segment.
Code segments of memory contain instructions of the program.

Data Segment (DS) Register


DS register identifies the starting location of the current data segment in memory. Data is
stored in the data segment.

Stack Segment (SS) Register


SS register contains a logical address that identifies the starting location of the current
stack segment in memory. Stack is used for temporary storage

Extra Segment (ES) Register


ES register identifies the extra segment usually used for data storage.

The segment registers store the base address of a segment. To determine the physical
address, an offset is required. The index registers are used to store the offset value.

Offset Registers

The offset registers are used to store the offset of a memory location relative to a base
address or the segment address, refer to Figure 5. In 8086, the offset registers come with
different names and in general, they are called the pointer and index registers.

Stack Pointer (SP) permits easy access to locations in the stack segment of memory.
The value in SP represents the offset of the next stack location which can be accessed
relative to the current address in the stack segment (SS) register, i.e., always points to the
top of the stack.
Base Pointer (BP) - BP represents an offset from the stack segment. However, it is used
to access data within the stack segment and usually used in the based addressing mode

The applications of the various registers will be discussed in details when we learn
assembly language programming.

Index register are used to hold offset addresses for instructions that access data stored in
the data segment of memory.
Source Index Register (SI)
SI is used to store an offset address for a source operand under index addressing for string
and memory operation.
Destination Index Register (DI)
DI is used for storage of an offset that identifies the location of destination operand also
used in some string operations.
Remarks: The offset value is always referenced to the value in the data segment (DS)
register.

Table 1 Segment registers and their corresponding offset registers


Segment register Pointer

CS (code segment) IP (instruction pointer)

DS (data segment) DI, SI

SS (Stack segment) SP (stack pointer)


BP (base pointer)

ES (Extra segment) DI

Flag Register

The flag register is a 16-bit register within the execution unit, as shown in the figure. The
status flags in the register indicate conditions that are produced as the result of executing
an arithmetic or logic instruction.
A flag is represented by a single bit.

Flags that are most commonly used include: carry flag, sign flag, parity, zero, and
overflow. When we write assembly language programs, we need to make use of the flags
to carry out different kinds of decision making, such as IF carry flag is set THEN ..

Since a flag is only a single bit, so the flag can have 2 status: 0 or 1, usually we use
the term SET or CLEAR to describe the flag status. SET = 1; CLEAR = 0.
We can also set or clear a flag when we write our program.

C - Carry flag (set if there is a carryout or borrowin). When we perform an addition, we


may get a Carry flag set. On the other hand, when we perform a subtraction, we may
get a Borrow flag set. Physically, the Carry flag is the same bit as the borrow.

Example:
If our data is only 8-bit then when we do FFH + 1H = 1 0000 0000 this is a 9-bit
value the 1 is the carry!!!!
Similarly when we do 00H 1H then result is 1 1111 1111 the 1 is the borrow bit.

Z - Zero Bit (set if result after an operation is equal to zero). Example when you do a
1 1 then you will get the zero flag set.
S - Sign Bit (represent negative value produced during an operation). Example, you
do a 0-1 then you will get a Sign flag. Sign flag is the MSB (most significant bit) of
the result. Note: sign flag is not the same as the borrow. There are cases, you will get
a sign flag but not the borrow !!!!!

O - Overflow Bit (result is out of range). If the result of a signed operation is not large
enough to be accommodated in a destination register then you will get an Overflow.
If you are dealing with unsigned values then you do not need to consider the
Overflow flag.

If we are dealing with 8-bit signed values then the range that can be represented is
-128 to 127. So if you do a 127 + 1 then you will get an overflow! Why? Because 127
+ 1 = 128 (1000 000) which is larger then the acceptable range (the largest positive
value is only 127!!!!).
Similarly, if -128 + -1 = -129 = 1 0000 0001
Which is also out of range therefore, overflow will also set!
If you examine the MSB (or the sign bit) then you will observe that when two
positive numbers are added and the result becomes negative then overflow will set.
Similarly, when two negative numbers added together and the result is positive, then
overflow will also set.

T - Trap Bit (when set, 8086 goes into the single-step mode). In single-step, after the
execution of an instruction, the CPU will halt and wait for an external input signal in
order to execute the following instruction.
I - Interrupt Bit (to enable maskable interrupt interface, details given when interrupt is
discussed )
D - Direction Bit (determines the direction in which string operations will occur). 0
the string is processed beginning from the lowest address to the highest address.
(more in assembly language programming)
Memory Read Write Cycle

Display unit
LCD

memory CPU I/O

Block diagram of a simple computer system


A computer basically performs two functions:

Get instruction from memory


Execute the instruction

In order to get instruction from memory, a Read Cycle will be performed. After the
execution of the instruction, the CPU may want to store some data to the memory and a
Write Cycle will be performed. These are called bus cycle because during the read/write
operation, the system must access the data bus and the address bus. As you can see that if
a CPU can complete a bus cycle in a very short time then you can perform many
operations implying that the performance is high. Once again, bus cycle is common to
all microprocessors.

Bus cycle is used to access memory, I/O devices, or the interrupt controller.
Bus cycle starts with an address being output on the system bus and it is followed by a
read or write data transfer. A series of control signals are produced to control the
direction and timing of the bus, refer to figures.
For the 8086, a standard bus cycle consists of 4 clock periods. Understand the system bus
timing will assist you to choose the proper memory device.

Functions performed during a READ Cycle


1 T1 : BIU puts an address on the bus
2 T2: data are put on the bus (for write cycle)
3 T2: bus in High Z (Z impedance) mode (for read cycle)
4 T3: data on the bus
5 T4: data on the bus

Timing diagram for a READ CYCLE

Timing diagram for a Write CYCLE


Wait states
Wait states (extra clock cycles) can be inserted into a bus cycle in response to request by
an event in the external hardware. To initiate a Wait state, the READY input of the 8086
is set LOW. As long as READY is held low, wait states (Tw) are inserted between T3
and T4 of a bus cycle. For a write cycle, data maintained on the bus during the wait state.
The purpose of inserting a wait state is to extend the duration of the bus cycle, so slower
memory devices or other I/O devices can be used in a computer system.

Example, during a READ cycle, CPU expects data to be available at T3, but if the device
is very slow and cannot output valid data at T3 then what will happen? Now you can
insert wait states until valid data is available.

Figure 6 Timing diagram of including a Wait state


An example to demonstrate how a computer system operates

Assembly Language Machine Code

MOV AX, 0F802H B802F8

PUSH AX 50

MOV CX, BX 8BCB

MOV DX, CX 8BD1

ADD AX, [SI] 0304

ADD SI, 8086H 81C68680

JMP $ -14 EBF0

The table shows a very simple assembly language program and the corresponding
machine codes. The program is now stored in the memory and the CPU will then perform
a read cycle to extract the first instruction and put the values into the instruction queue.
The EU will extract the codes from the queue and executes. The BIU will keep reading
the instruction as long as the instruction queue is not full.
The following figure illustrates the first 2 Read Cycles.
Stages of code fetching and execution:
1. Initially, the queue is empty.
2. Two bytes are fetched and put to the queue
3. 1 byte extracted from the queue
4. 1 byte extracted from the queue
5. 2 bytes put to the queue

Pins definition of the 8086

1. AD15 AD0 Address/Data Bus


2. A19/S6 A16/S3 - Address/Status
3. BHE/S7 Bus high enable
4. MN/MX min. max. mode control
5. RD read control (read data from memory or I/O)
6. Test wait on test (input). If test signal is HIGH then processor will be in an idle
state
7. Ready (input) wait state control
8. Reset system reset (if kept HIGH for 4 clock cycles)
9. NMI non-maskable interrupt request
10. INTR interrrupt request
11. CLK system clock
12. HOLD hold request (used with DMA)
13. HLDA hold acknowledge (entered the HOLD state)
14. WR write control
15. M/IO memory /IO control
16. DT/R data transmit receive (to enable external data bus buffer)
17. DEN data enable
18. ALE address latch enable
19. INTA interrupt acknowledge
20. RQ / GT1,0 request / grant bus access control (Used in max. mode, to force the
processor to release the local bus at the end of the processors current bus cycle)
21. LOCK bus priority lock control (disable other bus master to gain access of the
system bus)
22. /S2-/S0 bus cycle status (this lines reflect the type of operation being carried out by
the processor)
23. QS1 QS0 instruction queue status (give information about the status of the
code-prefetch queue)

Status lines - shows the process of BIU

/S2 /S1 /S0 Indication


0 0 0 Interrupt acknowledge
0 0 1 Read I/O port
0 1 0 Write I/O port
0 1 1 Halt
1 0 0 Code access
1 0 1 Read memory
1 1 0 Write memory
1 1 1 passive

Queue Status shows the instruction queue status

QS1 QS0 Indication


0 0 No operation
0 1 First byte of opcode from the queue
1 0 Empty queue
1 1 Subsequent byte from the queue

Status signals
S6 S3 are output on the bus at the same time that data are transferred over the other
bus lines.
S4 and S3 form a 2-bit binary code that identifies which of the 8086s internal segment
register was used to generate the physical address
00 extra 01 stack 10 code/none 11 Data
S5 logic level of the interrupt enable flag
S6 not used always at 0
Control signals

Control signals are provided to support memory and I/O interfaces


/ALE (Address latch enable) 0-> 1 to signal external circuitry when a valid address
word is on the bus
/BHE (Bank High Enable) : 0 used as memory enable for the most significant byte (High
Byte) half of the data bus
M/IO: 1 represent a memory operation; 0 represent an I/O operation
DT/R: 1 bus in transmit mode; 0 in receive mode
/RD: represent a read cycle and reading data from the bus
/WR: represent a write cycle and 0 represent valid write or output data are on the bus
/DEN: signals external devices when they should put data on the bus

Operating modes of 8086

The 8086 can run in two different modes: minimum and maximum
In min. mode, the 8086 provides all the control signals needed to implement the memory
I/O interfaces. In max. mode, it provides signals (status signals) for implementing a
multiprocessor/coprocessor system environment.
In max. mode, bus controller, bus arbiter are included in the system, see the figure. The
controller derives the control signals based on the status signals.
Maximum mode hardware configuration
Maximum mode features:
1 The basic functions of the bus controller chip (8288) is to derive control signals
like /RD, /WR, /DEN, DT/R, ALE based on the status lines
2 /IORC, /IOWC I/O read/write command signals. They enable an IO interface to
read or write data from or to the addressed port.
3 /MRDC, /MWTC memory read and write command signals. For instructing
memory to accept or send data from or to the bus.
4 /AIOWC, /AMWTC advanced /IOWC and /MWTC. Serve the same purpose as
/IOWC or /MWTC but are activated one clock cycle earlier.
address

Minimum mode circuit

In the minimum mode circuit, you should pay attention to the de-multiplexing of the
address/data bus. To access the memory both address and data must be available
at the same time so you must provide address information and data at the same time. But
for 8086, the bus is multiplexed so physically it is not possible to supply both information
simultaneously. So address latches are used to hold the address information when the bus is
used to transmit or receive data.
The 8086 will first output the address to bus then the ALE signal will active and latch the
address onto the address latch (a buffer). Now valid address is supplied by the address latch,
refer to the minimum mode circuit.
Data can now be read or written via the transceiver (transmit/receive). The signal DT/R can
control the direction of the data while the signal DEN (data enable) will enable the
transceiver so that data can be received by, or send out from, the CPU.
An overview of modern microprocessor architecture

The 8086 architecture is simple and its major components include BIU, EU, instruction
queue and registers. Functions performed by different components as well as the general
operations of a simple computer system have been discussed. Based on our understanding
of the 8086 microprocessor, what features of the 8086 are needed to be modified in order
to make it more powerful, or more efficient?

What are the major differences between a modern microprocessor and an 8086???

The following features are crucial to the performance:


1. Fabrication technology
2. Memory (size and speed)
3. Data bus size
4. Floating point processing
5. Overlapping of execution and memory access (the pipeline)
6. Perform more tasks in a single cycle (parallel processing)

Fabrication technology

The 8086 operates at 5MHz but the latest i5 microprocessor can operate at the GHz
ranges (2.6GHz). The operating frequency is governed by the fabrication techniques
(65nm technology for Intel Core 2 family and the 8086 is based on 3 Micron technology).
With sub-micron (less than a micron) technology, in 2006, it was the 0.09 microns or
the 90 nanometers and 65 nanometers chips are underway. This is also referred to as
feature size. The transistor (basic component of a microprocessor is transistor)
performance improves linearly with decreasing feature size. We can put more
components into the chip as well as running at a higher speed if feature size decreases.
The operating frequency is related to the delay and delay decreases when the components
become smaller. The delay also decreases when more pipeline stages are included.
Wires in an integrated circuit have a more complicated relationship with feature size. The
signal delay for a wire increases in proportion to the product of its resistance and
capacitance. Of course as feature size shrinks, wires get shorter, but the resistance and
capacitance per unit length get worse. This relationship is complex, since both resistance
and capacitance depend on detailed aspects of the process, the geometry of a wire, the
loading on a wire, and even the adjacency to other structures. There are occasional
process enhancements, such as the introduction of copper, which provide one-time
improvements in wire delay.
In general, however, wire delay scales poorly compared to transistor performance,
creating additional challenges for the designer. In the past few years, wire
delay has become a major design limitation for large integrated circuits and is
often more critical than transistor switching delay. Larger and larger fractions of
the clock cycle have been consumed by the propagation delay of signals on wires.
In 2001, the Pentium 4 broke new ground by allocating 2 stages of its 20+-stage
pipeline just for propagating signals across the chip.
For CMOS chips, the traditional dominant energy consumption has been in
switching transistors, also called dynamic power. The power required per transistor
is proportional to the product of the load capacitance of the transistor, the
square of the voltage, and the frequency of switching, with watts being the unit:

Power (dynamic) = * Capacitive load * voltage 2 * frequency switched

The higher the operating speed of a CPU, more heat (power) will be generated and
cooling the CPU becomes more important. No cooling is necessary for the CPU in the
486 era. What will happen if a CPU overheated?

Hence, dynamic power and energy are greatly reduced by lowering the voltage, and so
voltages have dropped from 5V to just over 1V in 20 years. The capacitive load is a
function of the number of transistors connected to an output and the technology, which
determines the capacitance of the wires and the transistors. For a fixed task, slowing
clock rate reduces power.

Example:
Some microprocessors today are designed to have adjustable voltage, so that a 15%
reduction in voltage may result in a 15% reduction in frequency. What would be the
impact on dynamic power?

Power is now the major limitation to using transistors; in the past it was raw silicon area.
As a result of this limitation, most microprocessors today turn off the clock of inactive
modules to save energy and dynamic power. For example, if no floating-point
instructions are executing, the clock of the floating-point unit is disabled.
Although dynamic power is the primary source of power dissipation in CMOS, static
power is becoming an important issue because leakage current flows even when a
transistor is off:
Power (static) = Current (static) * Voltage

Thus, increasing the number of transistors increases power even if they are turned
off, and leakage current increases in processors with smaller transistor sizes. As a
result, very low power systems are even gating the voltage to inactive modules to
control loss due to leakage. In 2006, the goal for leakage is 25% of the total
power consumption, with leakage in high-performance designs sometimes far
exceeding that goal. As mentioned before, the limits of air cooling have led to
exploration of multiple processors on a chip running at lower voltages and clock
rates

Memory

The 8086 has a 20-bit address bus and the max. memory location that can be accessed is
only 1Mbytes. The Pentium II can access up to 64GBytes (what is the size of the address
bus???) of memory. The computer can run faster if more memory is available in the
system.
The speed of memory chip has also increased in the last 20 years. The RDRAM can
operate at 600MHz or 800MHz. DDR ram can operate at 200MHz. The speed of the
memory will affect the performance as discussed above in the memory read/write cycle.
Using 8086 as an example, memory read/write must complete in 4 cycles so if the
memory is slow then the memory read/write cannot be completed in 4 cycles.

Cache memory

To improve the performance, internal high-speed memory is provided for the storage of
frequently used data as well as instruction. These internal memory is called cache. There
are cache inside the CPU and it is called Level 1 (L1) cache (there are at least 8Kb in P4).
The cache located between the CPU and the external memory is called Level 2 (L2)
cache (speed of cache is higher then the traditional memory) (512K in P4). In newer
microprocessors design, L2 cache is also located inside the CPU so therefore, L3 cache is
also provided. L3 cache is external just like L2 cache in the old days.
L1 primary cache; L2 Secondary cache
Accessing data/instruction from cache is faster than accessing the traditional memory so
when executing a program, if the entire program is already stored in the cache then the
overall performance can be improved. In addition, to access cache memory the CPU does
not have to use motherboards system bus for data transfer. Whenever data must be passed
through the system bus, the data transfer speed slows to the motherboards capability. The
CPU can process data much faster by avoiding the bottleneck created by the system bus.

CPU

External
L1 L3 Memory
L2

The effectiveness of cache is based on a property of computer program called locality of


reference. Analysis of program shows that most of their execution time is spent in
routines in which many instructions are executed repeatedly, such as looping, functions
etc. Therefore, many instructions in localized areas of the program are executed
repeatedly during some time period. So recently executed instruction is likely to be
executed again very soon and instruction close to a recently executed instruction are also
likely to be executed soon.
Whenever an information item (data or instruction) is first needed, this item should be
brought to the cache because it is likely to be needed again soon. In addition, instead of
fetching just one item from the main memory it is useful to fetch several items that are
located at adjacent addresses as well.
The cache can be considered as a buffer between the microprocessor and the main
memory (SDRAM). The cache memory is usually divided into smaller units, called
blocks or lines. In 486, the 8K cache is divided into four 2K blocks. Each block is
16bytes x 128 rows. The 16-byte row is divided into 4-byte lines. Any of the 4 lines
cannot be accessed partially so 32-bit data is being write to the cache.
During operations, the microprocessor always check the cache for data and instruction
first. Usually a mechanism called paging is employed to write frequently used data and
instruction into the cache memory.
The processor does not need to know the existence of the cache. It simply issues Read
and Write requests using addresses that refer to locations in the memory. The cache
control circuitry determines whether the requested word currently exists in the cache. If it
does, the Read or Write operation is performed on the appropriate cache location. This is
called a cache hit. The main memory is not involved when there is a cache hit in a Read
operation. For a write operation, there are two techniques: write-through and write-back.
In a write-through, both the cache and the main memory are updated. In write-back, only
the cache location is updated and a flag (a dirty or modified bit) is used to reflect the
change in status. The main memory location is updated later, when the block containing
this marked word is removed from the cache to make room for a new block.
If data/instruction is not already stored in the cache then a cache miss occurs and the
required data will be read from the external memory and put to the cache as a page.

Figure Cache mechanism

Cache mechanism
In modern CPU the size of a page (data being copied from memory to cache) is 64 bytes
and the cache operation can be summarized in the following:
1. the CPU looks for instruction stored in memory address X
2. since the contents from address X arent inside the cache, the CPU has to get it
from external memory
3. the cache controller loads a page (64 bytes) starting at address X to the cache.
So if the next instruction to be fetched is at address X+2 then the instruction is
now stored in the cache.
4. If program runs sequentially, ie instructions are being executed at addresses x,
x+2, x+4 etc then CPU would never need to fetch data directly from the RAM
except at the first instruction.
In order to identify if data from main memory is stored in a cache slot, a tag is provided.
A tag contains information related to the address of the memory being stored in the cache.
The size of the tag is address bus size lg N where N is the number of bytes in the data
part of the cache slot. For example if the address bus is 20 bits and the cache slot is 32
bytes then the tag is 20-5 = 15 bits. If the memory data is stored in the cache then the
upper 15 bits of the address should map with the tag value of the cache. The rest of the
address (5 bits) is the offset value for identifying the exact location of the data in the
cache.

The V bit is the valid bit indicating whether the slot holds valid date. If V=1, then the
data is valid. If V=0, then data is not valid. Initially, it is invalid. Once data is placed in
the slot it is valid.

The D bit is the dirty bit. This bit only has meaning if V=1. This indicates that the data in
the slot has been modified (written to) or not. If D=1, data has been modified since being
in the cache. If D=0, then the data is the same as it was when it first entered the cache.

Mapping between the cache and external memory

Size of the cache (KB or MB) is much smaller than external memory (in GB). Usually a
mapping mechanism is applied so that external memory can be mapped to the cache. The
most popular mapping mechanism is called n-way set associative.
The cache is updated a page (or a line) at each time, usually 64bytes. For example, if the
cache has 512KB then there are a total of 8192 pages (or slots). The 8192 pages will be
divided into blocks based on the value n. If n is 4 then 8192/4 = 2048 blocks. Based
on the number of blocks, the external memory will also be divided into the same number
of blocks.
For example, if the external memory is 1GBytes then each block of the external memory
is equal to 512Kbytes. (1G = 230) and each block of the external memory will be mapped
to 1 block of the cache. So every 4 lines of the cache (256 bytes) will be in charge (used
to store) of 512Kbytes of the external memory. So the ratio between cache to main
memory is 256 to 512K.

Example
If the CPU has a 32-bit address and the cache is 128 slot and 32 bytes per slot. Using
8-way set associative then 8 slots per set, there should be 16 sets (128/8). The tag size of
each cache slot is 32 5 = 27 bits.
Since the cache is now divided into sets so 4 bits are needed to represent the set number.
Tag (27-4 = 23 bits) Set 4 bits

The memory address is 32 bits and interpreted in the following:


Tag (23 bits) Set 4 bits Offset (5 bits)

When the CPU issue an address then the 4-bit representing the set will be used to
determine the set number. Then slots in the same set will be searched for the tag number
included in the address. If there is a match, the corresponding data will be extracted based
on the offset value.

Replacement algorithms

When a new block is to be brought into the cache and all the positions that it may occupy
are full, the cache controller must decide which of the old blocks to overwrite. In general,
it should keep blocks in the cache that are likely to be reference in the near future. A
direct approach is to overwrite the one that has gone the longest time without being
referenced. This block is called the least recently used (LRU) block, and this is called the
LRU replacement algorithm. To use the LRU algorithm, the cache controller must track
references to all blocks as computation proceeds.

Figure 4-way set associative mapping


Data bus

The 8086 has a 16-bit data bus. The P4 has a 64-bit data bus so can get more data in a
single read/write cycle, or processing of high-precision data can be more effective.
Consider this: what is the maximum value can be represented by a 16-bit and a 32-bit
pattern respectively as well as floating-point numbers. If the system is using 16-bit data
bus then how can you process 32-bit values?

Overlapping of operations

In 8086 the instruction pipeline enables the overlapping of execution of instruction and
fetching. In Pentium Pro processor which is a superscalar architecture meaning that it can
execute multiple instructions concurrently. There are two integer pipelines U and V, each
one has 5 stages, see figure (8086 only has a two-stage pipeline). In addition, there is one
floating-point execution unit. So it is possible to execute three instructions
simultaneously (with proper programming and the speedup can be up to 40%).

Pentium processor issues 2 instructions in parallel to the 2 independent integer pipelines


(U and V). This enhances the speed of integer arithmetic. In the prefetch stage, the CPU
fetches instructions from the code cache. There is a separate cache for data. The code
cache and the data cache both have 128K locations.
In Write Back, the CPU updates the registers contents or the status in the flag register
depending upon the execution result. In Pentium processor only one clock cycle is
required to decode the instruction, comparing to 486, it takes 2 clock cycles.
The dual-pipeline of modern microprocessor
The five-stage U and V Pipeline architecture

The above diagrams illustrate the extended pipeline used in modern microprocessors. Can
you spot the similarity between the pipeline implemented in the 8086 and the above?

Can you guess the function of the Branch prediction unit and the reason for including
such an unit?
As in 8086, when a jump operation is being performed then the instruction queue must
be reset and therefore, instruction already pre-fetched is discarded.
If you can predict when a branch or jump instruction will take place then you can
pre-fetch from the new jump point implying that the instructions being pre-fetch are
always relevant to the instruction to be executed.
The cache controller of modern CPUs analyze the memory block it loaded and whenever
it finds a JMP instruction in there it will load the memory block for that position in the L2
memory cache before the CPU reaches that JMP instruction.
In a conditional statement such as if a =< b go to address 1, or if a > b go to address 2.
This would make a cache miss, because the values of a and b are unknown and the cache
controller would be looking only for JMP-like instructions. The cache controller loads
both conditions into the memory cache. Later, when the CPU processes the branching
instruction, it will simply discard the one that wasnt chosen. It is better to load the
memory cache with unnecessary data than directly accessing the RAM memory.
Dynamic branch prediction

The processor hardware assesses the likelihood of a given branch taken by keeping track
of branch decisions every time that a branch instruction is executed. A dynamic branch
prediction algorithm can use the result of the most recent execution of a branch
instruction. The processor assumes that the next time the instruction is executed, the
branch decision is likely to be the same as the last time.
There will be two states: LT (branch is likely to be taken); LNT (branch is likely not to be
taken)

Suppose it starts at LNT, when the branch instruction is executed and the branch is taken,
the machine moves to state LT. Otherwise, it remains in state LNT. The next time the
same instruction is encountered, the branch is predicted as taken if it is in state LT.
Otherwise it is predicted as not taken.
This algorithm works well inside program loops. Once a loop is entered, the decision for
the branch instruction that controls looping will always be the same except for the last
pass through the loop. Hence, each prediction for the branch instruction will be correct
except in the last pass. The prediction in the last pass will be incorrect, and the branch
history state will be changed to the opposite state. Therefore, if the same loop is being
executed again then the prediction will be wrong.

BT branch taken
BNT branch not taken

Example

Do {
X = x+1;
} while (x<100);

The system will be in the LNT when the while (x<100) is first executed. During the loop,
the prediction will be correct as x is still less than 100. When x is 101 then the prediction
will also be wrong again.
A four-state algorithm

ST strongly likely to be taken


LT likely to be taken
LNT likely not to be taken
SNT strongly likely not to be taken

Assume that the state of the algorithm is initially set to LNT. After the branch instruction
is executed, and if the branch is actually taken, the state is changed to ST; otherwise, it is
changed to SNT. As program execution progresses and the same branch instruction is
encountered multiple times, the state of the prediction algorithm changes. The branch is
predicted as taken if the state is either ST or LT. Otherwise, the branch is predicted are
not taken.

What happens when executing a program loop, assume that the branch instruction is at
the end of the loop and that the processor sets the initial state of the algorithm to LNT. In
the first pass, the prediction (not taken) will be wrong, and hence the state will be
changed to ST. In all subsequent passes, the prediction will be correct, except for the last
pass. At that time, the state will change to LT. When the loop is entered a second time,
the prediction in the first pass will be to take the branch, which will be correct if there is
more than one iteration. Thus, repeated execution of the same loop now results in only
one misprediction in the last pass.
The 4-state algorithm is good for double looping such as

Do {
Do{

.
} while (condition1)
} while (condition2)

Other hurdles of pipelining

Branching is a major hurdle to maximize the performance of a pipeline, but there are also
other issues and these are called pipeline hazards. Hazards reduce the performance from
the ideal speedup gained by pipelining. In addition to branching, there are:

Structural hazards - arise from resource conflicts when the hardware cannot support all
possible combinations of instructions simultaneously in overlapped execution. Structural
hazards appear when some resource has not been duplicated enough to allow all
combinations of instructions in the pipeline to execute. For example, a processor may have
only one register-file write port, but under certain circumstances, the pipeline might want
to perform two writes in a clock cycle. This will generate a structural hazard.

Some pipelined processors have shared a single-memory pipeline for data and instructions.
As a result, when an instruction contains a data memory reference, it will conflict with the
instruction reference for a later instruction. To resolve this hazard, we stall the pipeline for
1 clock cycle when the data memory access occurs. As an alternative to this structural
hazard, the designer could provide a separate memory access for instructions, either by
splitting the cache into separate instruction and data caches, or by using a set of buffers,
usually called instruction buffers, to hold instructions.
Hazards in pipelines can make it necessary to stall the pipeline. Avoiding a hazard often
requires that some instructions in the pipeline be allowed to proceed while others are
delayed. For the pipelines when an instruction is stalled, all instructions issued later than
the stalled instruction are also stalled. Instructions issued earlier than the stalled instruction
must continue, since otherwise the hazard will never clear. As a result, no new instructions
are fetched during the stall.
Figure Parallalism from pipelining

Data hazards - arise when an instruction depends on the results of a previous instruction in
a way that is exposed by the overlapping of instructions in the pipeline. Data hazards occur
when the pipeline changes the order of read/write accesses to operands so that the order
differs from the order seen by sequentially executing instructions on an unpipelined
processor.
Using a 5-stage pipeline as an example, the 5 stages are:

Instruction fetch cycle (IF)


Instruction decode/register fetch cycle (ID)
Execution/effective address cycle (EX)
Memory access (MEM)
Write-back cycle (WB) - write the result into the register file

Five-stage pipeline

Consider the pipelined execution of these instructions:

DADD R1,R2,R3 ; addition R1 = R2 + R3


DSUB R4,R1,R5 ; subtraction R4 = R1 R5
AND R6,R1,R7 ; logical AND R6 = R1 AND R7
OR R8,R1,R9 ; logical OR R8 = R1 OR R9
XOR R10,R1,R11 ; logical XOR R10 = R1 XOR R11
Figure Data hazard from the example

All the instructions after the DADD use the result of the DADD instruction. The DADD
instruction writes the value of R1 in the WP stage, but the DSUB instruction reads the
value during its ID (instruction decode) stage. This problem is called a data hazard. Unless
precautions are taken to prevent it, the DSUB instruction will read the wrong value and try
to use it. In fact, the value used by the DSUB instruction is not even deterministic.
The above problem can be solved with a simple hardware technique called forwarding
(also called bypassing and sometimes short-circuiting). The key insight in forwarding is
that the result is not really needed by the DSUB until after the DADD actually produces it.
If the result can be moved from the pipeline register where the DADD stores it to where the
DSUB needs it, then the need for a stall can be avoided. Using this observation, forwarding
works as follows:
1. The ALU result from both the EX/MEM and MEM/WB pipeline registers is
always fed back to the ALU inputs.

2. If the forwarding hardware detects that the previous ALU operation has written
the register corresponding to a source for the current ALU operation, control logic selects
the forwarded result as the ALU input rather than the value read from the register file.
Notice that with forwarding, if the DSUB is stalled, the DADD will be completed and the
bypass will not be activated.
Five-stage pipeline with forwarding

Floating point hardware and format

Floating point calculation is very important and dedicated hardware component is now
embedded in the microprocessor in order to improve the floating point operations. Can
you imagine how to implement a floating-point arithmetic with the 8086? The
co-processor mechanism was used before floating point hardware was available within
the microprocessor.

X = F 2 E

If floating point data is following the IEEE format then arithmetic operations can be
easily implemented using adder, multiplier, divider, and subtract etc.

For example,
If A = 1.2 x 23 and B = 1.123x23
Then A+B = (1.2+1.123 ) x 23 (a single addition)
AxB = (1.2x1.123 ) x 2(3 +3) ( a multiply and an addition)

Current trend of microprocessor design is to perform more tasks in a single cycle, or


parallel processing (using multi-core). Dual core technology refers to two individual
microprocessors on a single die cast chip. This is essentially two computer processing units
(CPUs) in one. The advantage of a dual core chip is that tasks can be carried out in parallel
streams, decreasing processing time. But inside each core, you will still find design
based on the pipeline mechanism.

Self test

1. What is the most important hardware feature embedded in a 8086 microprocessor?


2. What are the basic operations being performed when a computer is running?
3. What is the maximum value represented by a 20-bit pattern?
4. What is the maximum value represented by the sum of two 16-bit pattern?
5. What is the major function of a ALU?
6. What is the major functions of a BIU?
7. Can you use a block diagram to represent a memory?
8. What is an instruction pipeline and why it can improve the overall performance of a
microprocessor?
9. Refer to the block diagram of the Pentium 4 microprocessor, there are four major
components, can you identify the basic functions of the different components?

Block diagram of the Intel P4 microprocessor

Critical thinking:
Refer to block diagram of the Intel P4 microprocessor, can you spot components that are
commonly in the 8086 microprocessor? Are you able to guess functions of those
components?

Introduction to microcontroller

If you want to develop a simple robot such as a robotic car that can follow a track, are
you going to use a P5 microprocessor in your system?
Usually to implement a simple system, microcontroller is used instead of a
microprocessor.
A microcontroller can be regarded as an all-in-one device with a CPU, memory,
Input/Output interfaces all included in a single package. Commonly used
microcontrollers include the 8051 series, Basic Stamp, BasicX etc. In addition, the cost of
a microcontroller is usually lower than the traditional microprocessor, for example an
8051 is only HKD25.
Some microcontrollers also come with additional features such as ADC (analog to digital
converter), DAC (digital to analog converter) and PWM (pulse-width modulation) output.
All those features are very useful for the implementation of basic robotic or control
systems.

The ADuC832 device


The ADuC832 is a powerful microcontroller and is being used in the experiment setup.
Basic features of the device include:
1. 8-channel 12-bit ADC
2. 2 12-bit DAC
3. 62Kbytes Program memory, 4kBytes data memory
4. 2304 bytes on-chip data RAM
5. dual PWM output (the PWM can be used to control an analog device)
6. 8051 compatible instruction set
7. 4 8-bit Input/Output ports
Figure 5 Functional block diagram of ADuC832

With the above features, if you want to implement a robot that can follow a track then
you only need to add the suitable sensors and motors to ADuC832, but most importantly
write the suitable control program!

The components can be connected directly to the microcontroller via the I/O ports
without other supporting devices.

You might also like