For Private Circulation Only
Advanced
Microprocessors
BE-Semester-VII-Computer
PART-I
@®
Chapter-1 Overview
Chapter-2 Advanced Intel Microprocessors
Chapter-3 Study of Pentium Family
Notes Prepared By
Prof. Faruk Kazi
Printed notes are revised as per the changed question
| paper pattern and are not suitable for independent |
reading. Combine them with class notes for better
| conceptual understanding and exam oriented
preparation. |
| NOTE:
Downloaded from FaaDodEngineer's.comDownloaded from FaaDoEngineer's.comAre
BLE, COMPUTER ENGINEERING
URTH YEAR SEMESTER Vil
‘SUBJECT: ADVANCED MICROPROCESSORS
Lectures: 4 Hrs per week Theory: 100 Marks
Practical: 2 Hrs por week Term work: 25 Marks
Oral Exam.: 25 Marks
‘Objective: To study microprocessor basics and the fundamental principles of architecture related
to advanced microprocessors
Pre- requisite: Microprocessors
DETAILED SYLLABUS
1. Overview of new generation of modern microprocessors
2, Advanced intel Microprocessors
Protected Mode operation of x 86 Intel family, study of Pentium, super scalar architecture and.
pipelining, register set & special instructions, memory management, cache organization, bus
‘operation, branch prediction logic
3. Study of Pentium Family of Processors
Pentium |, Pentium I, Pentium Ill, Pentium IV, architectural features, comparative study.
4, Advanced RISC Microprocessors
Overview of RISC Development and current systems , Alpha AXP architecture , Alpha AXP
Implementation and applications
5. Study of Sun SPARC Family
‘SPARC Architecture, the Super SPARC, SPARC implementation and application
6. Standard for Bus Architecture and Ports
EISA, VESA, PCI, SCSI, PCMCIA Cards and slots, ATA, ATAPI, LPT, USB, AGP, RAID
7. System Architecture for desktop and server based systems
‘Study of memory subsystems and I/O subsystems, integration Issues.
BOOKS:
“Text Books: :
1. Daniel Tabak, “Advanced Microprocessors”, Tata McGraw Hill
2. Barry Brey , “The Intel Microprocessors, Architecture, Programming and Interfacing’,
3. Tom Shanley, "Pentium Processor System Architecture", Addison Wesley Press
References:
1. Ray Bhurchandi, “Advanced Microprocessors and peripherals", TMH
2. James Abtonakos, "The Pentium Microprocessor’, Pearson Education
3. Badri Ram, “Advanced Microprocessors and Interfacing", TMH
4. Intel Manuals
TERM WORK
7. Term work shall consist of at least 10 practical expermiments and two assignments covering the
topics of the syllabus
‘ORAL EXAMINATION.
Tobe conducted based on the above syllabus
‘An oral examinatior
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6.13
‘The major status signals are: .
‘SLCT When printer is selected, this signal goes high.
‘ACKNLG When itis low, it indicates that the data character has been accepted and the printer is ready for next
character.
BUSY Due to some reason, if printer is not ready to receive a character, this
reason is—the printer being out of paper-
PE This signal goes high, ifthe out-f-paper switch in the printer i activated
ERROR This signal goes low for a number of problems inthe printer.
jgnal goes high. Example of such a
RAID
Redundant Array of Independent Drives (or Disks), also knovm as Redundant Array of Inexpensive Drives (or Disks),
‘The basic idea of RAID was to combine multiple small, inexpensive disk drives into an array of disk drives which
yields performance exceeding that of a Single Large Expensive Drive (SLED). Additionally, this array of drives
appears to the computer as a single logical storage unit or drive. That is, RAID combines multiple hard disks into a
single logical unit, There are two ways this can be done: in hardware and in software. Hardware comnbines the drives
into a logical unit in dedicated hardware which then presents the drives as a single drive to the operating system.
Software does this within the operating system and presents the drives as a single drive to the users of the system. A.
uick summary of the most commonly used RAID leveis is given below. (You may also look into your COA notes !!)
+ RAID 0: Striped Set (2 disks minimum) without parity, Provides improved performance and additional storage
but no fault tolerance from disk errors or disk failure. Any disk failure destroys the array, which becomes
‘more likely with more disks in the array. The reason a single disk failure destroys the entire array is because
‘when data is writen to a RAID 0 drive, the data is broken into "fragments". The number of fragments is
dictated by the number of disks in the drive, Each of these fragments are writen to their respective disks
simultaneously on the same sector. This allows the entire chunk of data fo be read off the drive in parallel,
siving this type of arrangement huge bandwidth. When one sector on one of the disks fails though, the
corresponding sector on every other disk is rendeted useless because part of the data is now corrupted. RAID 0
does not implement error checking so any error is unrecoverable, More disks in the drive means higher
bandwidth, but greater risk of data loss,
+ RAID 1: Mirrored Set (2 disks minimum) without parity. Provides fault tolerance from disk errors and single
disk failure. Increased read performance occurs when using a multi-threaded operating system that supports
split secks, very small performance reduction when writing. Array continues to operate so long as atleast one
drive is functioning
= RAID 3 and RAID 4: Striped Set (3 disk minimum) with Dedicated Party, the parity bits represent a memory
location each, they have a value of 0 or 1, whether the given memory location is empty or full, thus enhancing
the speed of read and write. Provides improved performance and fault tolerance similar to RAID 5, but with #
dicated parity disk rather than rotated parity stripes. The single disk isa bottle-neck for writing since every
‘write requires updating the parity dats, One minor benefit isthe dedicated parity disk allows the parity dive to
fail and operation will continue without parity or performance penalty.
+ RAID 5: Striped Set (3 disk minimum) with Distributed party, Distributed parity requires all but one drive to be
present to operate; drive failure requires eplacernent, but the array is not destroyed by a single drive failure
Upon drive failure, any subsequent reads can be calculated from the distributed parity such that the drive
failure is masked from the end user. The array will have data loss in the event of second drive failure and is
‘vulnerable until the data that was on the failed drive is rebuilt onto a replacement drive
* RAID 6: Striped Set (4 disk minimum) with Dual Distributed parity. Provides fault tolerance from two drive
failures; array continues to operate with up to two failed drives. This makes larger RAID groups more
practical. This is becoming a popular choice for SATA drives as they approach I Terabyte in size. This is
because the single parity RAID levels are vulnerable to data loss until the failed drive is rebuilt. The larger the
drive, the longer the rebuild will take, With dual parity, it gives the array time to rebuild onto a large drive
‘with the ability to sustain another drive failure
Subjects For SEM VII ROBOTICS & SYSTEM SECURITY
[email protected] |
Downloaded from FaaDo0Engineers.comBE-Sem-VII-C
-Advanc
ficroprocessor Notes By Prof, Faruk Kazi
Chapter 1- Overview
Notes by Prof. Faruk Kazi
WW
1223893
DECO07/10M: Compare x86 processors- 8086 to Pentium
(Please refer class notes of first lecture for complete answer)
ectare | Processing: [instru in
| eitle: eons | clock speed
OF
ey ee eet eed
CISC with separate Fetch and 14 3 Miz with 0.33
8086 data and | bus interface and execution toad
20 bits | execution units. Six | cycles overlap. Ay
‘Number of address. | byte instruction ‘MIPS
Transistors Addressab } queue.
29,0003 um | te memory 10 Miz with
1 MB. 0.75 MIPS
| 80186 Refer class notes
16 bits | Added protected-
80286 data and | mode features to
24 bit | 8086 with essentially 20dtt, 10 mite
Numberot | adress | the same instruction vwth 15 MIPS
Transistors bus. set. Included memory
| 134,000 at .5-| Addressab protection hardware 12.5 MHz with
em Je memory | to support ae
16 MB| multitasking
physical, | operating systems
with per-process
= address space
| Complete | Real and protected | 3-stage 12 16 Miz with 5
80386 DX | 32bit mode with 32 bit Pipeline as | 2)
microproc | memory fetch, decode eet
Numberot | essor. | management. and execute tare
Transistors Reduced |
275,000 at | wm | (Also read instruction 25 MHz with 8.5.
ate yee tines, Mies
vations 33 Miz with
venions) 1a MIPS
a i ee
a EM SECURITY: faru Hitbacin |
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 6.12
Differential SCSI is a great idea in theory, and one might have thought it would become very popular. In fac, this
never happened in the PC world, largely due to cost. The circuits needed to drive differential signals are more
expensive and use more power than those for single-ended SCSI. For many years, single-ended SCSI was "good
enough’, and allowed cable lengths sufficient forthe needs of most users.
PCMCIA CARDS AND SLOTS
+ POMCIA stands for Personal Computer Memory Card Intemational Assocition, an international standards
body and trade association based in San Jose, California, Founded in 1989.
+ Itisa standard for peripherals whose size is that ofa credit ard
+ PCMCIA cards are also called PC cards
+The card size is approximately 2 inches wide and 3. inches long.
* The thickness varies from 1/8 inch to 1.5 inch depending ofits ype
* Originally, the standard was developed for removable memory cards for portable computers, Now-a-days the
standards are available for a variety of peripheral devices such as fax, modem, SCSI adapter, Ethemet adapter,
disk drives ete
* The standard specifies the physical design of cards, the physical design of connector, and the electrical
interface to cards ete. PCMCTA cards are becoming standard features on portable and desktop computers,
© The PCMCIA slot supports hot insertion, which allows devices to be plugged and unplugged without
switching off the power supply to the computer.
‘The types of PCMCIA cards are: Type I to Type IV. The Type I, is 3.3mm thick, it is provided with 34-pin
connector and supported actual memory card (for example ATA Type I Flash Memory Cards) like DRAM or
flash memory.
‘= The Type If card is Smm thick and has 68-pin connector. It can interface fax, cellular modem, LAN adapter,
wireless LAN adapter, SCSI adapter etc.
‘The Type Il card is 10,5mm thick and it has 68-pin connector, It is used for hard disk drive up to GB.
‘The Type IV card is 16mm thick and it was developed by Toshiba for removable hard diss.
LPT- Line PrinTer- Parallel port
A parallel port can be used to interface printers, CD-ROM drive, extemal hard disk drive, ZIP drive or other mass
storage devices. This technique makes the devices slower than if they were directly connected tothe PC's UO bus via a
plug-in-card. But it is convenient to hook up extemal peripheral devices to the parallel port of a PC. IEBE 1284 is a
‘Standard for peripheral devices to be connected toa parallel port. It allows five mode of operation,
Centronics Standard Centronics is the name of « company. It developed a standard for interfacing printer to a parallel
Port. It used 36-pin interface, as 36-pin connector becomes bulky. IBM preferred to use a 25-pin connector for printer
interface to a parallel port and given name as LPT port
‘The major control signals in LPT standard are:
INIT. te direts the printer to perform its internal initialization sequence
STROBE Ittells the printer there isa character for you.
SLCTIN When this signal is low, data can be entered into the printer.
AUTOFEED XT When this signal is low, the paper is automatically fed one line after printing.
Subjects For SEM VIII ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com
viel
pp
into
Sof
uie12
ssor Notes 1 Kari 9820223893
32 bits | Onboard SKB cache | 5-stage pipeline [1:1 5 Me wi
80486 address as | and FPU.
pale 33 Mz with 27
Namberof | data bus Mes
‘Transistors 12
nillion at 1 pr; 50 Mia with 41
the $0 Miz was Mies
208 um
Superscalar, spit | Dual pipelines [2:1 OME with 100
Pentium-I | 64bits | cache. Redesigned —_| allow two MIPS
FPU with 8 stage | instructions to amet
Namber of pipeline. Branch _| be executed Mrs
seansisoes 3.1 prediction logic. simultaneously.
nillion
Three instruction [Instructions [3:1 300-500 MF,
P6 64 bits | decoders with 12 _| broken into |
stage pipeline, Out of | micro-ops that |
Tae order execution with | move through |
on-board level one | the pipeline in a
a |
(Refer 3 | and level two cache. | fetch/decode;
chapter dispatohiexecut
notes) es retire
sequence.
Table 1.1: Comparison of Intel family
Note: Clock frequency/ MIPS are dependent on processor versions. Pentium-I is having three basic
versions as PS, P54 and P54C of which PS is considered here for comparison. For memory hierarchy
and its calculations, please refer class notes of first lecture.
Common features of all advanced microprocessors-
* 32 or 64 bit microprocessor
Wider data bus- double the ALU size
On chip FPU
On chip MMU
Dedicated L1 code cache
Dedicated LI data cache
Branch Prediction Logic (BPL)
Superscalar
Multiple stage pipeline
Downloaded from FaaDo0Engineer's.comsata
ag
|
BE-Sem-VII-COMP-Adyanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6.11
REQ(REQUEST) A signal driven by a target to request a REQIACK data transfer handshake,
ACK ‘A signal driven by an initiator to acknowledge a REQUACK data transfer,
(ACKNOWLEDGE)
ATN(ATTENTION) A signal driven by an initiato? to indicate the Attention condition (initiator has a message for
the target).
RST (RESET) Reset condition.
DB_(7-0P) (DATA Eight data-bi(DB) signals, plus a parity-bit signal tat form a Data Bus. DBC?) isthe most
BUS) significant bit Bit umber, significance, and priority decreases downward to DB (0). A data
bitis defined as one when the signal value is true and defined as zero when the signal valve is
false, Data parity DB(P) shall be odd, bu parity is undefined during the Arbitration phase
Single-Ended (SE) and Differential (High Voltage Differential, HVD) SCSI
Conventional SCSI signaling is very similar to tht uxéd for most other interfaces and buses within the PC.
Convetional logic is wed: apostive volage is &“one’, and a zero volge (ground) ia "zero". This is called single-
ended signaling, abbreviated SE. Up unl event, single-ended SCSI had been by far the most poplar signaling pe,
fora simple reason: is relatively simple and inexpensive implement
‘There's an important problem with SE signaling, however. SCSI is a high-speed bus capable of supporting multiple
devices, including devices connected both inside and outside the PC. As with all high-speed parallel buses, there is
always a concern about signal integrity on the bus, problems can arise due to bouncing signals, interference, and
degradation over distance and cross-talk from adjacent signals. The faster the bus runs, the more these problems
manifest themselves; the longer the cable, the more the problems exist for any given interface speed. As a result, the
length of a single-ended SCSI cable is rather limited, and the faster the bus runs, the shorter the maximums allowable
cable length.
To get around this problem, a different signaling method was also defined for SCSI, which uses two wires for each
signal that are mirror images of each other, For a logical "zero", zero voltage is sent on both wires. Fora logical one",
the first wire of each signal pair contains a positive voltage, similar tothe signal on an SE bus, but not necessarily at the
same voltage. The second wire contains the electrical opposite ofthe frst wire. The circuitry at the receiving device
takes the difference between the two signals sent, and thus sees a relatively high voltage for a one, and a zero voltage
for a zero. It is called differential signaling, after the technique used to determine the value of each signal by the
recipient. The (wo signals in each pair are usually named with "*" and "-" signs; for example, the signal carrying data
bit would use "+DB(0)" and "-DB(0)"
Table below shows the great difference in cable length that exists between SE and differential devices, particularly as
bbus speed increases:
Signaling Bus Single-Ended—_Differential SCSI
Speed Speed SCSI Maximum = Maximum Cable
(MHz) Cable Length (m) Length (m)
Slow 5 6 25
Fast 10 3 25
Fas 20 1s 25
AAs you can see, each doubling of the bus speed results in « halving of the maximum cable length for single-ended
SCSI, but differential SCSI allows long (25m) cables forall three speeds.
Subjects For SEM VIII | ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com13
r Notes By Prof, Faruk.
ce ‘Table 1.2: Generations of Intel processors
een see ee
PI 8086 16-bit registers and data bus, eal mode only
8088 ‘Same as 8086 with 8-bit extemal data bus
P2 80286 ‘Added protected mode
PS 80386DK Introduced IA-32, 32-bit registers and buses, added
virtual 8086 mode
B0386SX Same as 80386Dx with 16-bit external data bus
Pa 80486DX ‘Same as 80386Dx with integrated FPU and Li
cache
B0486SX ‘Same a5 80486Dx without coprocessor
B0486DX2 and ‘Same as 80486Dx with faster (2x or 4x) internal
80486DX4 clock
PS Pentium Classic ‘Super-sealar architecture, Dual instruction
pipelines, 64 bit external data bus. Branch
prediction
Pentiom MMX ‘Same as Classic with support for MMX operations.
Pe Pentium pro Dytiamic execution, L2 cache in same package, no
MMX
Pentium I ‘Same as Pro new cartridge package, MMX support
Celeron ‘Same as Pentium I but no integrated L2
Pentium IIT Same as Pentium If with SSE support
Pentium 4 Microburst architecture
P7 Ttanium TA-64, 64-bit registers, 128 bit instruction bundles
Released May 29, 2001 | with explicit parallelism, 128 bit data bus, 64 bit
address bus. It is having 16 KB of Level 1
instruction cache and 16 KB of Level 1 data cache.
‘The L2 cache was unified (both instruction and data)
and is 256 KB, The Level 3 cache was also unified
and varied in size from 1.5 MB to 24 MB.
ee
sul
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6.10
SCSI transection There is also # SCSI Command called SCSI Reset that can lake immediately to this phase. SCST .
Reset is used to force the SCSI bus ogo to the BUS FREE phase ;
‘To other
SCSI devices
Figure 6.7 Typical SCSI Configuration recip
SCSI Bus Signals:
Signal Description
BSY(BUSY) Signal which indicates that the bus is Being used
SBL(SELECT) A sign used by an initiator to select a target, or by a target to reselect an initiator.
cD ‘A signal driven by a target to indicate whether or not control or data information is on the data
(CONTROLIDATA) bus. True indicetes control,
wo ‘A signal driven by 2 target to control the direction of data movement on the data bus. True
(NPUTIOUTPUT) indicates input to the inititor. This signal is also used to distinguish between selection and
reselection phases
a As you
scsi
MSG (MESSAGE) A signal driven by target during the Message phase a
Subjects For SEM Vill | ROBOTICS & SYSTEM SECURITY
[email protected] Sub;
Downloaded from FaaDo0Engineers.comVI Advanced Mi sor Ni Prof. Fa i 3
Clock Generator
2 Independent DMA Channels
3 Programmable 16-Bit Timers
Dynamic RAM Reftesh Control Unit
Programmable Memery ancl Peripheral Chip Select Logic
Programmable Wait State Generator
Local Bus Controller
System-Level Testing Support
Direct Addressing Capability to 1 Mbyte Memory and 64 Kbyte 1/0
‘Supports Intel 80187 Numeric Coprocessor Interface
14
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6, 9
SCSI Bus Phases
‘The SCSI bus can be time-shared, which results in greater usage of bus bandwidth. This is how it works: while one
device is using the bus, other devices may be active and performing intemal activities. Devices do not use the bus
unless they are involved in data transfer or have status to report. Devices may disconnect from the bus while time
consuming activities intemal to the device ae occurring, As soon as a device is ready to resume communication, the
device can arbitrate for the bus (when the bus is free) to reattach to the host System performance is significently
increased when devices disconnect and reconnect to the bus. During the bus phases (refer figure), devices must frst
‘contend for access to the bus. Then «physical path is established between the initiator and target. Remember, the SCSI
‘bus cannot be in more than one phase at atime.
Bus Free Phase: The Bus Free Phase is used to indicate that no SCSI device is actively using the SCSI bus and thet it
‘available for subsequent users. SCSI devices shall detect the Bus Free Phase
after SEL and BSY are both flse
Arbitration Phase: The Arbitration Phase allows one SCSI device to gain contro!
of the SCSI bus so that it ean assume the role of an initiator or target. If no higher
priority SCSI ID bit is true on the Data Bus, then the SCSI device has won the
facbitration and it asserts SEL. Any other SCSI device that is participating in the
Arbitration Phase has lost the arbitration. The SCSI device that won the
arbitration has both BSY and SEL asserted
Selection Phase: The Selection Phase allows an initiator to select a target forthe
purpose of initsting some target function for example the Read or Write
command. The initiator sets the Data Bus to a value whichis the OR of its SCSI
1D bit and the targe's SCSI ID bit for selection of target.
Reselection Phase: Reselection is an optional phase that allows e target to
reconnect to an initiator for the purpose of continuing some operation that was
previously stated by the initiator but was suspended by the target. For example, a
Thost system may have requested a Read from a disk. The disk can Disconnect and
Reconnect ifthe Read involves a time consuming seek operation to be performed,
‘This is one ofthe optimization features of SCSI.
Information Trassfer Phases: The Command, Data, Status, and Message Phases
fare all grouped together as the Information Tansfer Phases becguse they are all
used fo transfer dala or contro! information via the Data Bus. The C/D, UO, and
MSG signals are used to distinguish between the different Information Tansfer
Phases. The target drives these three signals and therefore controls all changes
from one phase to another.
[Command Phase: The Command Phase allows the target to request command
information from the initiator. The target shall assert the C/D signal and negate
the 1/ and MSG during the REQ/ACK handshake) ofthis phase.
Data Phase: The Data Phase is « term that encompasses both the Data In Phase
1d the Data Out Phase. The Data In Phase allows the target to request that data
‘be sent to the initiator from the target. The Data Out Phase allows the target to
request that data be sent from the initiator tothe target.
Status Phase: The Status Phase allows the target to request that status
information be sent from the target to the initiator.
Message Phase: The Message Phase is term that references either a Message In,
for a Message Out Phase, Multiple messages may be sent during either phase. The
first byte transferred in ether of these phases is either a single-byte message oF
the first byte of « multiple-byte message, Multiple-byte messages are wholly
contained within a single message phase. The Message In Phase allows the target
io request that message(s) be sent tothe initiator from the target
‘Bus Free Phase: Once the bus has gone to the BUS FREE phase, the target is no
longer in control of the bus. At this point device is fee to proceed with another
in
Subjects For SEM VIII. ROBOTICS & SYSTEM SECURITY farukkazi@iitb.
Downloaded from FaaDo0Engineers.comBE-Sem. A
MAY06/10Marks: State versions of 80386. Draw its block diagram and explain,
Figure 1.2a: Functional Block Diagram of Intel 80386 Processor
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 6. 8
SCSI Standard: (MAY0S/4M, NOVOS/1OM, MAY06/10M, NOV06/20M, MAYO7/10M)
‘The development of the Small Computer Systems Interface (SCSI) was a major stop forward in hardware interfaces for
“small computers" (as opposed to mainframes and minicomputers). Interfaces prior to SCSI were not intelligent and
‘were designed for specific devices, Thus there was ¢ hard disk interface fora hard drive, a tape drive interface for a
tape drive, and so on, With SCSI, a standard interface was defined for all devices so tht only a single adapter was
required. The first SCSI standard, referred to as SCSI-1, supported upto seven devices per adapter and vas approved in
1986, It had its roots in SASI (Shugart Associates Systems Interface) which was developed by Al Shugarts Shugart
Associates in 1999,
‘SCSI is an intelligent adapter and not « controller. The controller is built on the drive itself. It has a separate /O bus
called SCSI bus. As its data transfer rate is high, it is connected to PCI bus. Its latest version called SCSI-3 can
interface up to 15 devices. Each device connected to SCSI bus is assigned an identification number. The highest
identification (ID) number is used by the host adapter. The device designed to be connected to an SCSI bus is called a
SCSI device, A wide variety of devices such as hard disk drive, optical disk drive, ZIP drive, tape drive, printer,
Scanner, graphics tablet etc. can be connected to an SCSI bus. The devices are connected in a daisy-chain fashion, as
show in Fig. 6.5. A flat cable which contains 50 wires, runs from SCSI host adapter card to SCSI devices. To reduce
reflections the cable is terminated atthe end. The terminator terminates lines to the ground through resistors.
SCSI devices can be moved with data from a host adapter of one computer to another, except hard disk drive. A hard
disk is treated as « new disk on the other computer. This problem does not arise in case of optical disks and removable
bard disks In these cases the driver software, which supports their use, has been provided with features needed for such
‘movement, SCSI standard allows multiple, independent conversation between SCSI devices to go on simultaneously
across a single SC§I bus. This feature allows to have high degree of multiprocessing in a PC with SCSI bus. For this
purpose suitable software is needed.
‘On SCST host adapter card, there are connectors for external SCSI devices as wel as intemal SCSI devices. SCSI being,
costly is used on servers. Desktop computers use EIDE connectors,
scsi ‘scsi SCSI
saa Devicet,} | Device? | |
Flat cables
Fig. 65 SCSI Interface
SCSI Layered Architecture:
‘The peripheral interface is made up of many layers. The peripheral interface model with four layers for the SCSI is
agreed by the American National Standard Institute (ANSI). Lowest layer is the Physical Interface Layer. It describes
the cable and connector types. It also defines signal voltages and current requirements of the drivers used in the
interface. The timing specifications and the coordination of all the signals atthe’ interface bus are described in this
layer. Above the physical layer resides the Protocol Layer. The protocol is nothing but the set of rules. The protocol
layer gives rules for the exchange of messages between devices connected through an interface. It also describes the use
of error correction if the data is corrupted. It defines data byte and separates it from an instruction. The Device Mfodel
Layer lcs on top of the protocol layer. This layer describes the behavior ofthe device to be connected tothe interface.
For example, any printer interface may define a printer- page printer or line printer- depending on an interface. These
descriptions can be detailed and precise. Command Set Layer represents the fourth layer of the interface model, The
command set builds upon the device model. It defines the commands that must be understood by the interface devices.
Subjects For SEM VIII__ ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com16
10223893,
P-Advanced rocessor Notes By Prof, Raruk Kazi
FEATURES OF 80386 (Refer class notes)
+ 32 bit Microprocessor having 8 General purpose 32-bit registers supports 8, 16, 32 Bit data
types.
Very large address space. 4 GB physical & upto 64'TB virtual Memory support.
Variable segment size form one byte to 4 Gigabyte,
4 levels of protection PLO - PL3.
Paging on Demand.
Optimized for multitasking operations.
Virtual 8086 mode for running 8066 software in a protected and paged system.
Pipelined instruction execution.
TLB for address translation Cache
High speed Numeric support via 80287 and 80387 coprocessor.
‘The 80386 Functional Units: (Diagram to be drawn in exam)
tonsa ven
ws sant
we [sexu] | race
ca} | "inen
| p+bit barre!
7
pus F/O Mets Bae
UNIT bq contol Bus
> disp:
Tnatosion Prefeth
Decoder rare
A
Tastraction Prefecher
‘queue _———
© Desade Prefetch
oe a
Figure 1.2b: Functional Block uiagram of Intel 80386 Processor
‘The 80386 Microprocessor consists of five functional unit.
> BUS UNIT : Handles communication with devices extemal to the microprocessor chip.
> CODE PREFETCH UNIT : Fetches instructions from memory before the microprocessor
actually requests them, It is having a 16-byte prefetch queue, code prefetch requests are given a
lower priority by the bus unit than the requests from Execution Unit.
> INSTRUCTION DECODE UNIT : Decodes the instruction prior to passing it to the executing
unit for execution. It is having 3 instruction deep decode queue for use by the execution unit.
> EXECUTION UNIT : Consists of 8 General purpose 32 bytes Registers, 64 bit barrel shifter and
ALU. The execution unit executes each instruction received from the decoded instruction queue.
Seas
nudes
ee
VILL ROWOTIGS & S1STEN Sct
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6. 7
* PIO Modes: ATA includes support for PIO modes 0, and 2
* DMA Modes: ATTA includes support for single word DMA modes 0, 1 and 2, and multivord DMA mode 0
‘+ The flat ribbon cable has 40 wire connectors init, and usually has three identical female connectors: one is
intended for the IDE controller (or motherboard header for PCs with built in PCI ATA controllers) and the
other two are for the master and slave devices on the interface
Flat ribbon cables have no insulation or protection from electromagnetic interference
Was originally designed for very slow hard disks that transferred less than S MB/s, not the high-speed devices
of today
* The main issue is the length of the cable, The longer the cable, the more the chance of data corruption due to
interference on the cable and uneven signal propagation, and therefore, itis often recommended that the cable
bbe kept as short as possible. According to the ATA standards, the official maximum length is 18 inches
"Plain* ATA does not include support for enhancements such as ATAPI support for non-hard-disk IDE/ATA devices,
block mode transfers, logical block addressing, Ultra DMA modes or other advanced features. Drives developed to
rect this standard are no longer made, asthe standard is old and obsolete. In fac, atthe recommendation of the T13
Technical Committee, ATA] was withdrawn as an official ANSI standard in 1999. This is presumably du to its age,
and the large number of replacement ATA standards already published by that time.
SATA/PATA: SATA is « High Speed Serialized AT Attachment Serial version of the IDE [ATA] specification. It uses
24 conductor cable with two differential pairs [TW/Rx], plus an additional three grounds pins and a separate power pin.
ata runs at 150MBps (1.5GHz) and 250moV signal swings. Serial ATA is not compatible withthe IDE [Parallel ATA-
PATA] because the connectors are different, the vollage levels are different, and dats format is different. SATA sends a
Dit at atime while PATA sends 16 bits at once. SATA will not interface with the IDE bus. No cable can be made to
connect SATA with IDE, However a converter may be purchased which translates SATA to PATA
ATA Packet Interface (ATAPI): Originally, the IDE/ATA interface was designed to work only with hard
disks. CD-ROMs and tape drives used either proprietary interfaces (often implemented on sound card), the floppy disk
interface (which is slow and cumbersome) or SCSI. In the early 1990s it became apparent that there would be
enormous advantages to using the standard [DE/ATA interface to support devices other than hard disks, due to its high
performance, relative simplicity, and universality. The intention was not to replace SCSI of course, but rather to get rid
of the proprietary interfaces (Which nobody really likes) andthe slow floppy interface for tape drives.
Unfortunately, because of how the ATA command structure works, it wasnit possible to simply put non-hard-disk
devices on the IDE channel and expect them to work. Therefore, a special protocal was developed called the AT
Attachment Packet Interface ot ATAPI. The ATAPI standard is used for devices like opticsl, tape and removable
storage drives. It enables them to plug into the standard IDE cable used by IDE/ATA herd disks, and be configured as
master of slave, etc. just like a hard disk would be. When you see a CD-ROM or other non-hard-disk peripheral
advertised as being an "IDE device” or working wit IDE, itis really using the ATAPI protocol
Internally, however, the ATAPI protocol isnot identical to the standard ATA (ATA-2, etc.) command set used by hard
disks at all. The name "packet interface" comes from the fact that commands to ATAPI devices are sent in groups
called packets. ATAPI in general is a much more complex interface than regular ATA, and in some ways resembles
‘SCSI more than IDE in terms ofits command set and operation. (AL the time it was created, SCSI was the interface of
choice for many CD-ROM and higher-end tape drives.)
‘A special ATAPI driver is used to communicate with ATAPI devices. This driver must be loaded into memory before
the device can be accessed (most newer operating systems support ATAPI internally and in essence, load their own
drivers for the interface). The actual transfers over the channel use regular PIO or DMA modes, just like hard disks,
although support for the various modes differs much more widely by device than it does for hard disks. For the most
part, ATAPI devices will coexist with IDE/ATA devices and from the user's perspective, they behave as if they are
regular IDE/ATA hard disks on the channel
Subjects For SEM VIII ROBOTICS & SYSTEM SECURITY farukkazi@iitb.
Downloaded from FaaDodEngineer's.com17
BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Fs i -9820223893
> MEMORY MANAGEMENT UNIT : When the microprocessor must address a memory location,
the MMU forms the physical memory address that is driven out onto the address bus by the bus
unit during a bus cycle.
MAY08/SM: DIFFERENCE BETWEEN 80386 SX & 80386 DX
Address Bus: Unlike the address bus consisting of A31-A2 and 4-byte enable lines, the SX
address Bus consist of A23-A0 and BHE # (i. identical to 80286).
Data Bus : Like the 80286 Microprocessor, the SX only has two data paths, verses four for the
DX Microprocessor.
80386SX Microprocessor has substantially slower throughput than an 80386 DX.
0386
80386 SX Single 80386 DX Double eXecution
Xecution speed 16 speed 32 bit data
bit data
Operated in
Real Mode Protected Mode
(fast 8086)
‘Virsa 8086
Intel's 80386 DX can operate in real mode or protected Mode or a variation of protected mode called
virtual 8086 mode, When the processor is reset or powered up, it is initialized in Real mode. The real
‘mode has the same base architecture as the 8086, but allows the access to the 32-bit register set of
80386 DX. Basically it functions as a fast 8086,
This mode is usually used to:
* Initialize the peripherals device.
‘Load the main part of the operating system from disk into memory.
Load some registers.
Enable the interrupts.
Enter into the protected mode,
NOTE (VIVA): There are a few more versions of 80386 like EX and SL. The 80386 EX
microprocessor is designed for embedded applications that require high integration and low power.
Key features include power management, low-voltage operation, and on-chip integration of numerous
common peripherals such as interrupt controllers, chip selects, counters and timers. ‘The 80386SL is
basically an_ 855,000 transistor version of the 386SX processor, with cache, bus, and memory
controllers, ISA compatibility and power management circuitry. It added a special system management
mode (SMM), in which the BIOS could more easily perform power management and other functions
without requiring OS support. The 386SL was the first chip specifically made for portable computers
( 5 FSECURITY farukkasiqiitacin
ee ea
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6, 6
Difference between USB and serial port: An ordinary serial port provided on the back of a PC, can connect only one
setial device. From practical consideration two serial and two parallel ports ean be provided on the back of a PC.
‘Therefore, up to two serial devices ean be connected to the serial ports, and up to two devices designed for parallel data
‘transfer, can be connected to the parallel ports. There is no su limitation when USB is used. A dozen of USB devices,
which has a wide range of input/output devices, can be easily connected to USB bus.
IDE, EIDE, ATA, ATAPI
IDE stands for Integrated Drive (or Device) Electronics. It is a standard according to which IDE interface is done
BIDE is an Enhanced IDE. IDE was developed to interface hard disk drives. EIDE can interface hard disk, floppy disk
rive, optical disk drive and tape drive. The motherboard of a new PC has two EIDE connectors. It is an adapter and
‘not a controller. The controller is on the drive itself. Commercial motherboards of PCs provide only two EIDE.
connectors. From each connector a flat eable runs. A flat cable provides one channel. From each channel two EIDE
devices can be connected. The cable has two more connectors atthe other end, each of Which can connect an EIDE (or
IDB) device. From two EIDE connectors, two flat cables run and up to four EIDE devices can be connected. But EIDE,
is capable of providing up to 4 channels. Two additional channels, if required, can be provided by adding plug-in eards
on the ISA bus. Figure 6.4 shows EIDE interface,
. Magnetic isk ae
‘hives .
Fig. 64 EIDE Interface
IDE drives is cost Because the separate controller or host adapter is eliminated and the cable
connections are simplified, IDE drives cost much less than a standard controller-and-drive combination, These drives
also are more reliable, because the controller is built into the drive. Therefore, the data separator (the converter between
the digital and analog signals on the drive) stays close tothe media. Because the drive has a short analogesignal path, it
is less susceptible to external noise and interference.
ATA is AT Attachment It is standard which specifies how to deal with hard disks over EIDE channel. ATTAPI is
ATA Packet Interface, It extends ATA to deal with optical disks and other types of devices on the EIDE channel
ATA (ATA-1)
‘The first format standard defining the AT Attachment interface was submitted to ANSI for approval in 1990. The
original IDE/ATA. standard defines the following Features:
* Two Hard Disks: The specification calls for a single channel in a PC, shared by two devices that are
configured as master and slave
Subjects For SEM VII ROBOTICS & SYSTEM SECURITY
[email protected] |
Downloaded from FaaDo0Engineers.comDownloaded from FaaDo0Engineers.comof
BES
Ge3>8e
‘em-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6. 5
320 [TA]
controller}
a2-bit 6b
132MB/S ‘SMBS.
. PCIbus WSAbus
‘AGP slot
Figure 6.2 AGP workstation
UNIVERSAL SERIAL BUS (USB)
It sa serial bus designed to connect several devices.
It can interface a wide variety of peripherals such as monitor, keyboard, mouse, modem, speaker, microphone,
seanner and printer ete
Itcan handle upto 127 devices.
‘An USB cable contains four wires: two for supplying electrical power and two for transmitting data and
commands,
Low-power devices such as keyboard, mouse, etc. can get power from USB cable, eliminating bulky power
supply. The device, which needs larger amount of power, for example, a big loud speaker, must have a local
power supply.
‘The USB controller assigns each device an identification number and allows devices to communicate to one
another.
thas two operating modes: low-speed and medium-speed. In low-speed mode, data transfer rate i 1.SMbps.
‘At medium-speed mode, data transfer rate is 12Mps.
It provides three types of data transfer schemes: isochronous (or real-time), interrupt driven and bulk data
‘transfer. In isochronous data transfer scheme, there is no interruption in the flow of data for example, video or
sound, In such a case uniform amount of deta must be transferred every second, and fixed amounts of data
must be transferred in chunks on regular schedule,
‘The USB provides plug-to-play facility
USB devices can be connected in a daisy-chain fashion, as shown in Figure 6.3. One device is connected to the USB
controller. Another deviee is plugged into the device, which has already been connected. In this way a number of
devices can be connected in # daisy chain. Sometimes the resulting chs
of cables may branch at some devices. The
system treats all the devices similar as if they were all connected in series or directly at the PC,
Flat cables
Figure 6.3 USB Connections in Daisy-Cl
Subjects For SEM VIII. ROBOTICS & SYSTEM SECURITY
[email protected].
"|
Downloaded from FaaDo0Engineers.com19
BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893
80286 Features:
(Same as 8086 except protected mode for multitasking- refer class notes)
© Itisa 16 bit microprocessor
Its 24 bit address bus gives 16 MBytes address space
Itis having 16 bit data bus
Its prefetch queue is 6 bytes
Protected mode operation was first introduced with 80286
It implements protection mechanism with 4-levels of privilege as PLO-PL3
On chip support for multitasking
Its segment size is variable from 1 byte to 64 KBytes
Descriptor structure was first implemented with 80286
It supports virtual memory of 1 GBytes (refer class notes for this calculation)
GQ: Explain general structure of an advanced microprocessor
(Also useful for VIVA)
Prefetch Unit and Decoding
_—_2k Instruction Queue Unit
Dan Bus
Instruction
ej Bus Interface
‘Address Ba uniter | Ct) | Branch Controt
Hess Bus Target | Unit
Buffer cy
Dia | @7B)
<4
‘Contol Bus Cache
@eache)
Memory
Management za
Unit Ma)
Tateral Bus
a Integer | Unit QU) Fleing fiat Uae PUD
pel ; il
Function Unis painseet, | [ Hosting Fein
Srv) easter File | | Register File
eg. MMX ake) RF)
Integer Floating Point
Operation Operation
Units Units
Figure 1.4: General structure of Adv Microprocessors
cst fheSENVT RROTICS iS TEMISECUIUTY. Nala,
Downloaded from FaaDodEngineer's.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. -9820223893 6. 4
PG) Devices
368i 8 MHz ISA bus
7 ISA devices
Fig 6.1 Typical PCI workstati
ACCELERATED GRAPHICS PORT (AGP)
With the increase of processor speed, the speed of the host bus (or processor bus) is also increasing. When the speed of
a processor was about 400-5OOMHz, the speed of the host bus was 100MHz. Today the speed of a processor isin the
range of 700-I000MHz and that of the processor bus is 33-400MHz. The PCI was developed when the processor bus
speed was 33MHz, Its present speed remains the same, ie, 33MHz. Today new bus standard is needed to cope up with
the processor bus speed. A new bus, AGP bus has been developed to operate at processor bus speed,
‘Though AGP is called a port, it is actually an expansion slot It is new 32-bit bus, specially designed for video card
Figure 6.2 shows AGP along with PCI and ISA bus. Its data transfer rate is 528MBIsec. or more. The video card
contains @ video accelerator, which can access main memory at high speed through AGP bus and the chipset A
graphics and video accelerator performs image calculation. It generates and processes pixel, receives commend from
the CPU, converts graphies commands into a data stream and keeps in the local memory. A video accelerator is
‘provided with local memory. The video accelerator includes a digital to analog converter (DAC), which receives
information from the local memory and controls the intensity ofthe red, blue and green electron beams,
Subjects For SEM VIET ROBOTICS & SYSTEM SECURITY
[email protected].
‘
Downloaded from FaaDo0Engineers.com
USE
cont
devi
syste
SulDownloaded from FaaDodEngineer's.comsE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6. 3
PCI BUS: (NOV04/SM, NOV0S/10MMA Y06/8M, NOVO6/10MMAY07/10M)
‘PCI stands for Peripheral Component Interconnect.
+ Tewas developed by Intel Corporation in 1992.
It is @ kind of local bus, which is directly connected to the processor bus. In other words a local bus is an
extension of the processor bus.
‘tis widely used bus architecture.
‘= The PCT bus provides plug-and play facility or auto-configuration
* Ttis 32-bit bus and can be expanded even up to 64-bit, if need arises (Pentium).
+ Ttoperates at 33MHz and data transfer rate is 130MB/se
‘Tis faster than BISA and MCA bus.
Its address and data buses are multiplexed to reduce the size of the connector. na 32-bit bus, all the 32 lines
are multiplexed for address and data, At one moment they carry address and at the other moment they carry
date
+ In ease of 64-bit bus system, 32 address lines are multiplexed with data lines. The remaining 32 lines are only
to carry data .
‘Error chocking mechanism has been provided for all addresses and data transfer.
* It supports reflected-wave-switching for power consumption. Hence sometimes called GREEN MACHINE.
‘+ It uses hidden-bus arbitration in which, even if PCT bus isnot free- arbitration can be performed.
Figure 6.1 shows a workstation system with PCI bus. A PCI slot does not accept 8 or 16-bit ISA cards. Therefore, ISA
‘bus is also used in combination with PCI bus, to interface 8 and 16-bit cards. The PCI bus provides plug-and play
facility, which gives user the ability to insert any hardware peripheral into the system and use it without any
‘configuration or setup. In other words it provides auto-configuration, which enables the peripheral to configure itself,
rather than configuration being supplied by the user. The PCI interface contains a number of registers to hold
information about the board that allows the computer to sulomatically configure a PCI card. This auto-configuration
feature is called Plug-and-Play
‘There is a bridge-chip (chipset) between the processor and the PCI bus, which connects the PCI bus to the processor
‘bus, Once a host chipset is included in the system, the processor can access all evailable PCI peripherals. This makes
PCI bus processor independent. When & new processor is to be used, only the chipset needs to be changed. Power PC
‘and Apple's Macintosh system also use PCI bus. A PCI controller immediately stores data in a buffer. This allows CPU
to go quickly to the next operation, rather than waiting for it to complete the data transfer. The PCI bus is designed to
‘operate without termination (unlike a SCSI bus).
‘There may be more than one PCI bus. PCI-to-PCI Bridge is available. The second PCI bus is connected tothe first PCI
‘bus through the PCI-to-PCI Bridge IC.
Subjects For SEM VIII_ ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com21
BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kari -9820223893
Chapter 2- Advanced Intel Microprocessors
Notes by Prof. Faruk Kazi
2.1 Protected Mode Operation of X86 Intel Family
The first processor in the 80x86family was the 16-bit 8086, which was capable of addressing one
Megabyte of memory, a significant improvement over the 8-bit machines available in the late 1970s,
‘Twenty address lines were provided on the processor to access the IMB of memory. The advanced
processors tat Yllowed he S086, “beginning with the 80286, all contained additional address lines.
‘The 80386, 80486, and Pentium all contain 32 address lines, giving them the ability to access 2°, or
4.GB of memory. This large addressing space allows the advanced Intel Microprocessors, to perform
many operating system chores- such as multitasking- that are difficult, or even impossible, on the
Cs a
Beginning with the 80286, the advanced Intel Microprocessors all contained the ability to operate in
two different modes of operation, Real mode and Protected mode. In teal mode, the advanced
processors, including the Pentium, simply operate like very fast 8086, with the associated 1 MB
‘memory limit, Real mode operation is automatically selected upon power-up. So a Pentium-based
PC that boots up into DOS is operating in real mode (SOS is a teal mode operating system).
In protected mode, the full 4 GB of memory is available to the processor, as are special privileged
instructions and many other architectural goodies, including support for multitasking, virtual
memory addressing, memory management and protection, and control over the intemal data and
instruction cache. The Windows operating system runs in protected mode to take advantage of these
improvements. Writing programs that runs in protected mode requires special background
knowledge of operating systems theory.
Mode
cE
(Virtual 80
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 6, 2
(66287.77 Hz. This output of timer is used as a request signal for DRAM refresh in ISA systems. Upon sensing DRAM
refresh request, the DRAM refresh logic asserts DRQO signal of the master DMAC to execute the refresh eycle,
Timer 2 (Speaker Timer): The output of the Timer 2 is used to drive speaker. The Timer 2 is also given from
1.1918 MHz timebase.
EISA BUS (32 Bit version)
EISA stands for Extended Industry Standard Architecture
Itwas introduced in 1988
‘Ituses 32-bit address lines and 32-bit data lines
Its suitable for multiuser system and faster than ISA Bus.
ISA cards can also be inserted into EISA slots. An EISA connector contains two layers of contacts. The top
layer contacts are the contacts for additional EISA signals. The connector of an EISA bus is of the same
physical size as that of an ISA bus so that either ISA or BISA card can be inserted into the EISA connector
slot
“Though an BISA bus has 32-bit data bus, its clock speed is only SMFIz and data transfer rate 33MBsec.
Hence, is slower tan PCI bus which operates at 33Miz
ISA buss no longer used and has been replaced by PCI bus
‘As itwas expensive, it was used in servers.
MCA BUS
MCA stands fay Micro Channel Architecture.
It-was developed by IBM in 1987.
Ituses 32-bit address lines and 32-bit data lines.
It does not accept older S-bit and 16-bit ISA expansion cards
‘MCA and BISA bus architectures are completely incompatible
‘MCA bus operates at 1OMHz and its data transfer rate is 80MB/sec,
It was used in IBM's servers,
‘At the time of MCA release, there was already large established base of products and machines that were ISA
compatible It was also expensive. Due to these reasons it did not gain industry acceptance and was replaced
by PCI bus.
VESA BUS
‘ESA stands for Video Electonics Standards Associaton,
Twas introduced in 1992.
tis 32-bit local bus, diretly connected to the processor bus
operate at 33MHz, and its data transfer rate is 130MB/ec.
Its also called VL (VESA Local bus.
Kallows upto 3 peripherals
[does not provide auto-configurtion and has 64-bit expansion capability
{twas used in combination with ISA bus,
It contained bus controller to arbitrate between the bus masters and the CPU.
Subjects For SEM VIII | ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com
Su22
By Pro 9820223893
2.2 Study of Pentium: (Architecture & Features)
Pentium Processor Features (PI refer class notes for EQ and detailed answer)
‘The on-chip memory management unit (MMU) of Pentium is completely compatible with Intel 386
and 486CPUs. The Pentium processor (510/60, 567/66 MHz) contains all the features of Intel 486
CPU. The significant features and additions are as the following:
U_ Improved Instruction Execution Time
Bus Cycle Pipelining
Address Parity
Internal Parity checking
Functional Redundancy Checking
Execution Tracing
Performance Monitoring
System Management Mode
‘Virtual Mode Extensions
The Pentium processor operates at a very high speed. It also overcomes many performance
bottlenecks associated with earlier X86 processors. This enhanced performance is achieved by it due
to its superscalar architecture,
The important features of Pentium architecture are
U_ Wider (64-bit) Data Bus: With its 64-bit-wide extemal data bus (in contrast to the
Intel486 processor's 32-bit- wide external bus) the Pentium processor can handle up
to twice the data load of the Intel486 processor at the same clock frequency
U Superscalar Architecture: Dual Instruction Pipeline
‘The Intel486 processor can execute only one instruction at a time. With superscalar
execution, the Pentium processor can sometimes execute two instructions
simultaneously
U_ Dynamic Branch Prediction Logic: The Pentium processor fetches the branch target
instruction before it executes the branch instruction
U Enhanced Floating Point Unit: The Pentium processor executes individual
instructions faster through execution pipelining, which allows multiple floating-
point instructions to be executed at the same time
U_ Dedicated Instruction and Data Cache: The Pentium processor has two separate 8-
kilobyte (KB) caches on chip~-one for instructions and one for data--which allows
the Pentium processor to fetch data and instructions from the cache simultaneously
U_ Write-Back MESI Protocol in Data Cache: When data is modified; only the data in
the cache is changed. Memory data is changed only when the Pentium processor
replaces the modified data in the cache with a different set of data
c
ceccecee
As Figure 2.1(or Figure 2.2- any figure can be drawn for the EQ) Shows, the Pentium processor is a
complex machine with many interlocking parts. At the heart of the processors are the two integer
pipelines, the U pipeline and the V pipeline, These pipelines are responsible for executing 80x86
Subjects for SEM
Downloaded from FaaDo0Engineers.com‘BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6. 1
Chapter 6- Standard for Bus Architecture and Ports
Notes by Faruk Kazi
PLNOTE: Printed notes are not sufficient for this chapter. Please refer class notes for better understanding and exam
oriented preparation as usually one out-of syllabus question comes on this topic.
The following major bus architectures have been developed over the past two decades
1, ISA Bus @Bitand 16 Bit version)
EISA Bus
MCA Bus
PCI Bus
‘VESA Bus
6. AGP
Out of these bus architectures PCI, ISA and AGP are used in a modem computer. Other bus architectures have been
replaced by PCI bus architecture.
ISA BUS (8/16 Bit Version)
+ ISA stands for Industry Standard Architecture. ts pronounced as"e-sai
*Thwas introduced in 1984,
+ Itcontains 24 address lines and 16 dat lines. (8:Bit: 20 bit address and 8 bit data)
+ Since it i a 16-bit bus it does not take full advantage of the 32-bit address bus and 32-bit data bus of « 32-bit
microprocessor (for example 80386 and onwards)
+ Teoperates as SMHz and its data transfer rate is SMB/see, (8-Bit: 4MB/se°)
‘The ISA expansion slot is designed in such « way that an older 8-bit card can also be plugged into it ISA
connector slots are still used in most PCs to connect S-bit and 16-bit cards, usvally with other types of
expansion slots.
ISA bus is also known as AT bus because it first appeared with IBM's PC/AT.
ISA Timers: (Pl refer class notes for ISA DMA and ISA Interrupt)
ISA systems have a free raning clock with fequeney 1431818 MHz. This clock is available on the ISA connector
with name “OSC” and may be used in the add-on cards Its frequency (1431818 MHz) is our times the television color
burst fequeney and therefore after division by 4 it can be used asa clock inthe video display adapter cards. The
divider foie on te system bourd divides OSC clock by 12 to provide driving signal with frequency 1.19318 Mi for
‘Timer 0, Timer | and Timer 2
Timer 0 (System Timer: Te sytem timer, timer 0, i wed as a programmable frequency soure. Iisa 16-bit
dow counter. The programmer can load divisor cout in the count register ofthe timer to get the desired output
frequency. During the POST, the count register is loaded with count FFFFH (65535 decimal). After every clock input
count in the count register is decremented by one, unl the eount becomes zero. When count in the count register
becomes zero, the court reps is automatically loaded with original count and cycle repeats, Inthe ISA systems the
‘output of timer 0 is used to trigger IRQO interrupt of 8259 interrupt controller.
Timer 1 (Refresh Ter): The recs timer, Timer 1, is alo used asa programmable frequency soures. Is
16-bit down counter. The programmer can load divisor cout in the cout register of the timer to get desired output
frequency. Ducing POST, the count register is loaded with count O012H (IB decimal), giving the ouput frequency
Subjects For SEM VII ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.comfloating-point unit is included on the chip to execute instructions previously handled
by the extemal 80x87 math coprocessors. During execution, the U and V pipelines are capable of
executing two integer instructions at the same time, under special conditions, or one floating-point
instruction.
The Pentium communicates with the outside world via a 32-bit address bus and a 64-bit data bus,
The bus unit is capable of performing burst reads and writes of 32 bytes to memory, and through bus
cycle pipelining, allows two bus cycles to be in progress simultaneously
An 8KB instruction cache is used to provide quick access to frequently used instructions. When an
instruction is not found in the instruction cache, it is read from the external data bus and a ‘copy
placed into the instruction cache for future references, The branch target buffer and prefetch buffers
work together with the instruction cache to fetch instructions as fast as possible. The prefetch buffers
maintain a copy of the next 32 bytes of prefetched instruction code, and can be loaded from the
cache in a single clock cycle, due to the 256-bit wide data output of the instruction cache.
A separate 8KB data cache stores a copy of the most frequently accessed memory data. Since
memory accesses are significantly longer than processor clock cycles, it pays to keep a copy of
memory data in a fast-reading cache. The data and instruction caches may both be enabled/disabled
hardware or software. Both also employ the use of a translation look aside buffer, which
Converts logical addresses into physical addresses when virtual memory is employed
The Pentium uses a technique called branch prediction to maintain a steady flow of instructions into
the pipelines. To support branch prediction, the branch target buffer maintains a copy of instructions
in a different part of the program located at an address called the branch target.
‘Why not 80586?
In 1993, following their earlier naming conventions, Intel's new fifth-
generation chip was expected to be named the 586. However, Intel |
wanted to be able to register as a trademark the name of their new
processor, and since numbers cannot be trademarked, the Pentium was
bom. Since this time the Pentium name has become one of the most.
widely recognized trademarks throughout the computer world,
Downloaded from FaaDo0Engineers.comDownloaded from Faalo0Engineer's.comDownloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 5. 13
‘+ Implemented using 0.5 micron, 4-layer metal CMOS technology operating at 3.3 volts.
+ Packaged using a 521-pin plastic Ball Grid Array (BGA)
+ High speed memory transfers (1.3 GB/sec)
‘+ Implements the new Ultra Port Architecture (UPA) Interconnect
Other SPARC Implementations:
‘The very first SPARC architecture implementations included a CPU chip with a single-ALU IU, a register
file (usually 136 registers from the start), PCs, PSR, and a few other control and status registers. The FPU,
MMU, and cache had to be rez ‘on separate chips. The original Cypress SPARC CPU was labeled
CY7C601. Cypress followed it up by a subsequent modet called HyperSPARC, or CY7C620. The
HyperSPARC is a two-issue superscalar, with over 1 million transistors, containing on-chip IU and FPU,
operating at 55.5 MHz, with a 64 bit wide data bus. The MMU and cache have to be configured outside of
‘the CPU chip.
Tl also. featured a lower-level, scalar, single-pipeline microprocessor, ‘called~MicroSPARC. ‘The
MicroSPARC is a 0.8 micron, 800 000 transistors, SV microprocessor, consuming 3.5 W at 50 MHz. It
contains an IU, FPU, and a modest on-chip dual cache 4 Kbytes code and 2 Kbytes data.
‘Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com25
BE-Sem-VII-Ct licroprocessor Ni sruk Kazi -9820223893
bs
Figure 22: Pentium Architecture
‘The floating point unit of the Pentium maintains a set of floating point registers and provides 80- bit
precision when performing high-speed math operations. This unit has been completely redesigned
from the one used inside the 80486 and is also pipelined. The floating-point unit uses hardware in the
U and V pipelines to perform the initial work during a floating point instruction (such as fetching a
. 764- bit operand). And then uses its own pipeline to complete the operation. Since both integer
cfg pipelines are used, only one floating point instruction may be executed ata time.
fe
|
| Altogether, the Pentium processor includes many features designed to increase performance over
| earlier 80x86 machines.
Downloaded from FaaDodEngineer's.comBE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5. 12
‘and cache organization. This organization uses the term ‘set’ in a different manner than in the computer
literature. In the SuperSPARC a set is a block of data of 4kbyte, the page size in this system. The
instructions cache contains five such sets for a total of five pages or 20Kbyte, The data cache contains four
such sets for a total of four pages or 16Kbyte. Based on this, the instructions cache is said to be a five-way
set-associative, and the data cache is four-way set-associative, Both caches use a pseudo-LRU replacement
algorithm. The line size on the instruction cache is 64bytes, and the line size on the data is 32bytes.
‘The instruction cache is accessed by 128-bit fetch path, allowing the fetching of 4 instructions
simultaneously. The data cache is accessed by a 64-bit path, allowing transmission of double-precision
floating-point data in one bus cycle. The hit rate was reported by sun to be 98 percent for the instruction
cache and 90 percent for the data cache. The cache is physically addressed. The SuperSPARC TLB has 64
entries and it is fully associated,
‘nsiuction
from PC Loa
‘The UltraSPARC-I Processor: MAY08/5M
One of the first implementations of the new SPARC-V9 architecture, the UltraSPARC-1 retains complete
upwards compatibility with the 32-bit SPARC-V8 specification, ensuring binary compatibility with existing
applications. UltaSPARC-1 not only provides 64-bit data and addressing, but adds a number of other
features to improve operating system and application performance:
‘+ Nine stage pipeline; can issue up to 4 instructions per cycle
‘+ Better cache management and greatly reduced memory latency
+ On-chip cache 16K Data and 16K Instruction, with up to 4 MB extemal cache allowed
‘+ Integrated multi-processor support with low latency to shared data
‘© On-chip graphics and imaging support
Subjects For SEM VI ROBOTICS & SYSTEM SECURITY
[email protected] Subj
Downloaded from FaaDo0Engineers.com26
}E-Sem-VII- iP. racessor s By Pri Kazi 93,
2.3 superscalar Arc ing
In Pentium processor, the integer instructions traverse a five-stage pipeline. ‘The pipeline stages are
as follows:
PF - Prefetch
D1 — Instruction Decode
D2- Address Generate
EX ~ Execute - ALU and Cache Access
‘WB - Write-Back
Pentium processor is a superscalar machine, capable of executing two instructions in parallel. The
five stage pipelines operate in parallel allowing integer instructions to execute in a single clock in
each pipeline. The pipelines in Pentium processor are called U and V pipes and the process of
issuing two instructions in parallel is termed as 2 Issue superscalar. There are two execution units in
Pentium and the instruction pairing allows each unit to complete the execution of an instruction at
the same time,
‘The Figure 2.3 depicts how ten instructions move through the pipeline of Pentium processor.
nha
Downloaded from FaaDodEngineers.com.ate
See
eee
Figure 5.7 (a) Taken conditional branch (b) Untaken conditional branch
‘The Super SPARC implements a precise exception model. At any given time there can be up to nine
instructions in the TU pipeline and four more in the floating-point queue. Exceptions and the instruction that
caused them propagate through the TU pipeline. They are resolved in the execute stage before their results
can modify visible state in the register file, control registers or memory
‘The FPU consists of a floating-point controller (FPC), two independent pipelines FADD and FMUL, 2
floating-point queue and a 32-bit 32-register floating-point file. The FP file is organized as sixteen 16-bit
double words to optimize double-precision performance, Each 32-bit word of the FP file can be accessed
separately, however. The FP file has three read and two write ports. The FPC is tightly coupled to the TU
pipeline and is capable of executing o floating-point memory event and a floating-point operation in the
same cycle, The FPC also handles floating-point exceptions,
There ore two types of floating-point instructions:
1, FPOPs-f loafing point operations, such as add, multiply, convert and so on.
2, FPEVENTS- floating-point events, such as load/store to/from floating-point register, load/store
to/from floating-point status register, store floating-point queue, integer multiply, and integer divide.
FPEVENTS are executed by the FPU but do not enter the FPU queue.
‘The FPU pipeline consists of four stages
1 FRD decode and read
2. FMIFA execute multiply or add
3. FN/FR normalization and rounding
4 FWB write ~ back to FP file
‘The SuperSPARC has a dual cache 20kbyte Icache 16Kbyte Deache for a total of 36kbyte on chip primary
cache. There is support for an external, second level 1-Mbyte cache. Figure shows the SuperSPARC MMU
‘Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com27
‘TI-COMP-Advanced Microprocessor Notes By Prof. Faruk. 3202238
‘Assuming that all the instructions have followed the pairing rules, the five instruction pairs are
shown in Figure 2.3. The five clock cycles are used to perform five pipeline stages. In the clock
cycle 1, the prefetch (PF) action is implemented. A pair of instructions is prefetched from the on-
chip code cache during clock 1. This first pair is isstied in parallel to the U and V pipelines for
decoding purpose (D1 stage), while another pair is being prefetched (PF stage) during the clock 2
cycle. In clock 3 cycle, the first instruction pair moves to decode 2 (D2) stage, while the second pair
is now issued to the decode 1 (D1) stage of both the pipelines and the third pair of instructions is
being fetched (PF stage). In this way, each pair of instructions can proceed to the next stage in the
pipeline with each cycle of the processor clock (PCLK). During clock cycle 5, the first instruction
pair completes its execution. If we observe the coluimn of CLKS, the first pair is in the last stage
(WB) of the pipeline whereas the second pair is implementing the 4° stage (EX) and the third
instruction pair is at the 3 stage (D2) of the pipeline and so on. Thus, ten different instructions are
present at the various pipeline stages during a single clock cycle. After the clock cycle 5, each
succeeding clock cycle shows the completion of another instruction pair.
Integer Pipeline Stages: 7
1. Prefetch (PF Stage)
There are two prefetch buffer/queue present in Pentium and at a time, one of them is active,
active queue fetches the instruction codes from the on-chip cache or memory until the branch
prediction logic predicts that a branch will be taken when the branch instruction reaches the
execution stage. During the normal pipeline operation, this active queue supplies two consecutive
instructions to U and V pipeliies..
2. Decode i (D1) Stage
‘Two pipelines filled with instructions are decoded in D1 stage. The instructions are first checked for
the pairability beside branch prediction.
© Instruction Pairing (Refer class notes'for complete answer)
Certain rules are provided for instruction pairing. Not all instructions are pairable. The first
limitation is put by the V pipeline. During normal operation, the active queue delivers the first
instruction to u pipe and second to the V pipeline. But the V pipeline has no barrel shifter and it
cannot execute all type of instructions. It can execute simple instructions. The U pipeline, enhanced
version of 486 pipeline can execute any instruction in Intel architecture. Considering all the
‘ons, a certain criteria are defined. The two instructions are pairable only if they satisfy the
Plowing conditions
¥ Both instructions in the pair must be simple.
v No register dependencies/contention between them.
‘The instructions, which are completely hardwired, are called Simple Instructions. They do not
require any microcode control and execute in 1,2 or at the most 3 clock cycle. The following integer
instructions are considered simple and may be paired
MOV teg, reg/mem/imm
MOV mem,reg/imm
OBOTICS & 8}
Downloaded from FaaDo0Engineers.comBE-Sem-Vil-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi .9820223893 5. 10
data cache. The other three ALUs are used for data processing. If we have integer instructions that are data
dependent, the first instruction can be executed in one of the upper ALUs, forwarding the result and another
‘operand to the lower ALU, completing both computations within a single cycle. Afier that, both results are
stored in the IU register file.
‘The SuperSPARC integer pipeline consists of four stages (cycles). Each cycle has two phases. The pipeline
eight phases are
1, FO Instruction cache (1 cache) access and TLB lookup.
2. Fl Tcache match detect. Four instructions sent tothe instructions queue (Iqueue)
3. Do Issue one, two, or three instructions. Select register indices for load/store instructions.
4. DI Read register file load/store instructions, Resources allocation for ALU instructions. Evaluate
branch target address,
5. D2 Read register file for ALU operands, Calculate EA for load/store instructions,
6. 0 First stage of ALU. Data cache (Deache) access and TLB lookup. Floating point instruction
dispatch,
7. El Second stage of ALU, Deache match detect. Load data available. Resolve exceptions.
8. WB Write back result into the register file, Retire store into the store buffer.
‘The FPU pipeline is tightly coupled to the integer pipeline. An operation may be started every cycle; the
delay of most floating point operations is three cycles. In the EO phase, one floating point arithmetic
instruction is selected for execution and its operands are read during El. Two stages of execution delay are
required for the double precision FPU adder and FPU multiplier. The first cycle of the adder examines
exponents, aligns mantissas, and produces a result. The first cycle of the multiplier computes and adds
partial products. Independent second stages round and normalize the result of the respective units
Forwarding paths are provided to chain resulfs of one FPU operation into the source of a subsequent
operation
Figure 5.6 illustrates an example of pipelined execution of a set of ALU operations, with a load instructions
in between. Up to four instructions can be fetched during the (FO, F1) cycle, but only up to three instructions
can be issued as a group (GRP) during the DO phase. Forwarding of results between subsequent groups of
instructions is shown by arrows in figure 5.6, But where is Figure 5.6? Refer class notes of Faruk Kazi.
Pipelined execution of a set of instructions, which includes a conditional branch, is shown in Figure 5.7. A
taken branch case in Figure $.7(a) and a no taken case in Figure S.7(b). The original sequential instructions
are denoted as SI and $2, and the target instructions as T1, T2, T3 and T4. The delay instruction, placed
after the conditional branch instruction (BNE in this example) is denoted by DI while C1 refers to the
certainty instruction stream. The Super SPARC process can group compare (CMP) and the conditional
branch instruction (BNE), to speed execution. The processor statically predicts that all branches are taken.
‘When a control transfer instruction relative to the PC is issued, its DI is fetched concurrently. During the D1
phase the target address (TA) is computed. As the branch instruction enters phase D2, the target instruction
stream is fetched (FT). The fetch completed as the DI advances to phase DI and the compare and branch
instruction enter phase E0, The compare instruction computes new integer condition codes in phase EO and
the branch direction is resolved. When a branch is taken, all sequential path instructions (SI and on; grouped
together with the DI) are invalidated (squash $1), as shown in fig. When a branch is not taken (untaken)
se(quential path instructions (SA+) remain valid and the target instructions (Tl and on) fetched are discarded.
This scheme does not introduce a pipeline bubble (stall) for either branch path. The PC and prefetch PC
values for both directions are precomputed. The SuperSPARC branch implementation can execute nontaken
branches somewhat more efficiently than taken branches.
Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com28
BESS wr Notes By Prof. Faruk Kazi -9820223893
‘ALU reg, reg/menvimm
ALU mem, reg/imm
INC reg/mem
DEC regmem
PUSH reg/mem
POP reg
LEA reg, mem
IMP/ Call/ jecnear
NOP
If the two instructions are not pairable, 12 instruction in the V pipeline’s D1 stage is deleted and
shifted to the D1 stage of the U pipeline when Il is moved to the D2 stage of U pipeline.
Instruction Issue Algorithm (EQ):
Decode two consecutive instructions I1 and 12
If the following are all true:
11 is a “simple” instruction
2s a “simple” instruction
Tis not a jump instruction
Destination of [1 # source of 12
Destination of 11 # destination of 12
(ic. no contention)
Then,
issue [1 to U pipe and I2 to V pipe
Else,
issue to U pipe
«Branch Prediction (Refer Section 2.8 for detailed answer for separate EQ)
The Pentium processor includes branch prediction logic, allowing it to avoid pipeline stalls if it
correctly predicts whether or not the branch will be taken when the branch instruction is executed.
When a branch operation is correctly predicted, no performance penalty is incurred, However, when
branch prediction is not correct, a three cycle penalty is incurred if the branch is executed in the U
pipeline and a four cycle penalty if the branch is in the V pipeline.
3. Decode 2 or D2 Stage
The D1 stage is followed by D2 stage in which the instructions are further decoded and the addresses
of memory resident operands are calculated. It performs segmentation addressing. ‘The address
calculation at this stage is much faster, Pentium requires a single clock cycle to calculate the address
for the instructions containing a base and index-addressing mode with displacement and an
immediate addressing mode,
pe
ferukkeni@itbacin
Downloaded from FaaDo0Engineers.com‘BE-Sem.VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5. 9
5.3 The SuperSPARC (SPARC Implementation)
“The SuperSPARC is a 3.1-million transistor, 0.8-micron, three-layer metal BICMOS, 293 ceramic pin grid
array (PGA) microprocessor, manufactured by TI in cooperation with Sun Microsystems. The processor chip
contains an IU, FPU; MMU, and a dual cache (20 Kbytes code, 16kbytes data, total 36 Kbytes). A block
gram of the Super SPARC is shown in Figure 5.5.
ae
-
Figure §.5~Super SPARC functional block diagram
‘The SuperSPARC is a three-issue super scalar system. The SuperSPARC can issue and execute three
instructions every eyele subject to the following constraints:
1. Maximum of two integer results
2, Maximum of one data memory reference
3, Maximum of one floating-point arithmetic instruction
4, Terminate group of instructions after each control transfer
Data dependencies are solved on the SuperSPARC by
1. Cascading dependent instructions in the same group
2. Forwarding dependent instructions in consecutive group
‘The block diagram in Figure $.5 shows the detail of the structure of the IU. There are four ALUs and a
shifter, The lower leftmost ALU is used for address computations; its output is forward to the MMU and the
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.comDuring the D2 stage, the processor also performs the segmentation protection checks required when
the processor forming memory addresses in protected mode.
Figure 2.4: D2 Stage
29
| BE-Sem-VII-COM} ced Mics jotes By Prof, Faruk Kazi 9820223893
|
I
|
Control Unit
x“ t
v v
coo baler fa sez
f nd
and eT ea
Gre Gkiang
D2
A | oe g
4, Execution or EX-Stage
Figure 2.5 illustrates the execution stage of the dual instruction pipelines.
Figure 2.5: EX Stage
| TALU
& tage flag v anv
| Repair
|
i Barrel Shits
it
Pa
a
uU
es
sss
ss
an
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kaz| -9820223893 5, @
SUBX (SUBXe<) Subtract with Carry (and modify ice)
TSUBce (TSUBesTV) Tagged Subtract and modify icc (and Trap on overflow)
MULSee ‘Multiply Step and modify ice
AND (ANDNes) ‘And (and modify ies)
‘ANDN(ANDNec) ‘And not (and modify ice)
(OR (ORec) Inclusive-Or (and modify ice)
ORN (ORNec) Inclusive-Or Not (and modify ie)
XOR (KORCC) Exelusive-Or (and modify ics)
XNORCKNORes) _Bxclusive-Nor (and modify ice)
SLL Shif Left Logical
SRL Shift Right Logical
SRA Shift Right Arithmetic
SETHI Set High 22 bits ofr register
SAVE Save caller's window
RESTORE Restore calle’s window
Bice Branch on integer condition codes
FBfee Branch on floating-point condition codes
CBece ‘Branch on coprocessor condition codes
CALL call
JMPL Jump and Link
RETT* Retum from Trap
Tice ‘Trap on integer condition codes
RDY Read ¥ register
RDPSR* Read Processor State Register
RDWIM? Reed Window invalid Mask Register
RDTBR* Read Trap Base Register
wry Write ¥ register
WRPSR* Write Processor State Register
wRWwim? Weite Window invalid Mask Register
WRTBR* ‘Write Trap Base Register 7
UNIMP: ‘Unimplemented instruction in
IFLUSH Instruction cache Flush
FPop Floating point Operate:
Pop Coprocessor operate
‘Privileged instruction.
Ds
Th
shi
‘Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY
[email protected]. [ Sul
Downloaded from FaaDo0Engineers.comBE-Ser
‘The execution stage is comprised of the arithmetic logic unit, or ALU. The U pipeli
incorporates a barrel shifter, while the V pipeline’s does not. It is obvious, then, that the U pipeline
can handle instructions that cannot be handled in the V pipeline. When necessary, data cache
accesses (on a cache hit) or memory accesses (on a cache miss) are performed in this stage. Access
to the data cache can be made by the U pipeline and V pipeline simultaneously.
Note that both instructions enter the execution stage at the same time. If the instruction in the V
pipeline stalls, the U pipeline instruction is permitted to proceed to the write-back stage (i.e. the last
stage in integer pipeline). However, if the U pipeline instruction stalls, the V pipeline instruction will
not proceed to the write-back stage.
5, Write-Back or WB Stage
This is the final stage of integer instruction execution. In WB stage, the processor state is modified
by updating target registers and EFLAGS register (if necessary).
(EQ)Floating Point Instruction Pipeline Stages (PI. Refer class notes)
Most floating-point instructions are issued singly to the U pipeline and cannot be paired with integer
instructions. It consists of eight pipeline stages. The first four stages are shared with integer pipeline
and the last four reside within the floating-point unit itself
‘The 8 Pipeline Stages are:
Bec igen |o ene Dennen cee
Prefetch (PF) Tdentical to integer prefetch stage
Instruction Decode 1 | Identical to the integer D1 stage
@))
Instruction Decode 2 | Identical to the integer D2 stage
(D2)
‘Execution Stage (EX) __| Register read, memory read, or memory write performed as
required by the instruction (to access an operand)
FP Execution 1 Stage | Information from register or memory is vitten into a FP
x) register. Data is converted to floating-point format before
being loaded into the floating-point unit.
FP Execution 2 Stage| Floating-point operation performed within floating-point
(x2) unit
“Write FP Result (WR) | Floating-point results are rounded and the result is written to
the target floating-point register.
Error Reporting (ER) | If an error is detected, the error is reported and the FPU
status word is updated.
Downloaded from FaaDo0Engineers.com.BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5, 7
Instruction set ‘
~ The SPARC architecture features the following types of instructions:
1 Load/store
2° Arithmetic/logical/shift
3° Control transfer
4 Read/write control registers
5 Floating-point operate
6 Coprocessor operate (not needed on latest highly integrated implementations)
‘The SPARC instruston set is summarized in Table .1.
Table $.1 SPARC Instruction Set
Opcode Name
LDSB (LDSBA*) ‘Load Signed Byte (from Altemate space)
LDSH(LDSHA*) Load Signed Half word (from Alternate space)
LDUB (LDUBA*) ‘Load Unsigned Byte (from Altemate space).
LDUH (LDUHA*) Load Unsigned Halfword (om Alternate space)
LD@LDA*) Load Word (from Altemate space)
LDD @DDA*) Load Doubleword (from Altemate space)
LF Load Floating-point
LDDF ‘Load Double Floating-point
LDFSR Load Floating-point State Register
Lupe Loed Coprocessor
Lppe Load Double Coprocessor
LDCR Load Coprocessor State Register
STB (STBA*) Store Bytes (into Altemate space)
STH (STHA*) Store Halfword (into Alternate space)
SISTA") Store Word (into Alternate space)
STD (STDAS) Store Doubleword (into Altemste space)
STF Store Floating-point
STDF Store Double Floating-point
STRSR Store Floating-point State Register
STDFQ* Store Double Floating-point State Register
stc Store Coprocessor..
sTDC Store Double Coprocessor
STCSR Store Coprocessor State Register
sTDCQ* ‘Store Double Coprovessor Queue
LDSTUB(LDSTUBA*) Atomic Load-Store Unsigned Byte (in Alternate space)
SWAP (SWAPA*) ‘Swap r Register with Memory (in Alternate space)
ADD (ADDec) ‘Ada (etd modify ice)
‘ADDX (ADDXce) ‘Add with Carry (and modify ice)
TADDee (TADDecTV) _Tegged Ald and modity ie (and Trap on overflow)
SUB (SUBcc) Subtract (and modify ice)
‘Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com211
wr Notes By Prof, Faruk Kazi 9820223893
Instruction Pairing Rules for Floating Point Instructions
The rules of how floating point (FP) instructions get issued on Pentium processor are given below.
” |, EP instructions are normally issued to the U pipeline singly as they do not get paired with
integer instructions. However, a limited pairing of two FP instructions can be performed,
ii. Pairing can occur only if the first instruction issued to the U pipeline is a simple set F
instruction and the second instruction is the floating point exchange, FXCH instruction. “The
F set or simple instructions are FLD single/ double precision, FLDST () and all forms of
FADD, FSUB, FMUL, FDIV, FCOM, FUCOM, FABS, and FCHS.
FPU Internal Pipelining-resources
Inside the FPU, all the resources are allocated to one or more instructions at one time. This permits
pipeline execution within the FPU. This is explained with the help of three examples:
i. FDIV instruction cannot be executed with any other instruction, since FDIV requires all of
the FPU resources.
ii, Similarly, two consecutive FMUL instructions cannot be executed simultaneously,
iii, FMUL instruction can be executed in parallel with one or two FADD instructions.
iv. Three FADD instructions can be executed simultaneously.
2.4 The Register Set (Software Model /Architecture)
The Intel x86 architectures register set is subdivided into the following groups:
1 Base architectures registers (application register set)
i General-purpose registers (8x32 bit)
ii Instruction pointer (BIP 32 bit)
iil. Flags Register (EFLAGS 32 bit) (DECO7/MAY08)
iv Segment registers (6x16 bit)
2 System registers (MAY08)
i Memory management registers (MAYO7)
ii Control registers (NOVOS/MAY06/MAY07)
3. Floating-point registers (Same as 8087 hence not discussed here)
i Data registers
ii Tag word
iii, Status word
iv_ instruction and data pointers
4 Debug registers (DEC07) (Refer class notes)
‘The base architecture and floating-point registers are accessible by applications programs. The
system and debug registers are accessible only by system programs (such as OS), running on the
highest privilege level
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprecessor Notes By Prof. Faruk Kaz| 9820223893 5. 6
pros ges iy ars
Figure 54~ SPARC instruction formats
As can be seen in Figure 5.4, the operations instructions implement three-operand addressing, and all
formats are of a single word length (32 bits). The fields in the instructions have the following designation:
op its 31,30 in al formas. They ar interpreta as follows:
i _bits 29 to 25 in formats 2 and 3, Selects the source register for store instructions and the destination register for al other
instructions.
a bit 29 in format 2, Annul bil. Changes the behavior ofthe instruction encountered immediately after a contol transfer.
cond bits 28 to 25 in format 2. Selects the condition code for eondtona ranches
inm22__ bits 24 to Gin format 2 A 22-bit constant value used bythe SETHI instruction.
sp22 bits 21 to On format 2. A 2s sign-xtended word displacement for branch instructions.
85920 bits 29 0 format 1. A 30-bit sign-extended word dsplacement for PC-eatve cal instructions,
093 bits 24to 19 format 3. Opoode extension.
i _bit #3 n format 3. Selects the typeof the second ALU operand for non-loating-point operation instuctions
=O: the second operand isin register 82
the second operands sign-extended simi.
‘asi bits 120 Sin format 3. An Sit address space identifier generate by load and store atemateistucons.
131 bits 180 14 in format 3, Selects the rst source operand register.
152 bits 4toOin format 3. Selects the second source operand register.
sine 13 bits 12 Oin format 3. A signextended 13-iinmediate vale,
oof 1310 Sin format 3. dents a foating pont operate instuction
5.2.4 Addressing modes (Refer class notes for examples and complete answer)
Besides the standard register direct and immediate addressing modes, there are only three addressing modes
for memory access:
1 Register indirect with displacement; register + signed 13-bit constant
2 Register indirect indexed; register! + register?
3° PC-relative; used in CALL instructions with a 30-bit displacement
‘Sub
Downloaded from FaaDoEngineer's.com212
Prof. Faruk Kazi 9820223893
BE. [P-Advanced Micropracess By
Base architectures registers: . ;
“The base architectures registers (or the application register set) are shown in fig 2.6 There are eight
32-bit general purpose registers
‘The Flags Register (EFLAGS) shown in Fig.2.7, is a 32-bit register called EFLAGS. The specified j
bits and bit fields of EFLAGS control a number of operations and indicate the status of the
processor. The lower 16 bits of EFLAGS, called FLAGS, are used when executing 8086 or 80286
code.
= Bit 21, ID- Identification Flag: The ability of a program to set and clear the ID flag indicates
that the processor supports the CPU identification (CPUID) instruction.
= Bit 20, VIP- Virtual Interrupt Pending Flag: The VIP flag together with the VIF (bit 19) ;
enable each applications program in a multitasking environment to have virtualized versions of
the system's IF flag (bit 9). The processor reads this flag but never modifies ths flag. '
= Bit 19, VIF- Virtual Interrupt Flag: The VIF is a virtual image of the IF flag used with VIP.
‘The processor recognizes the VIF flag when either the VME or PVI bit in CR4 is set and the :
IOPL is less than 3. The VME flag enables the virtual 8086 mode extensions while the PVI
enables the protected-mode virtual interrupts.
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5.
inten
Estencedt
Figure 5.3~ Processor data types
5.2.3 Instruction formats
‘The SPARC instruction formats are shown in Figure 5.4. There are three basic instruction format types:
1 CALL
2 Branch instructions
3 Operate instruction (egister-to-egister)
5
Subjects For SEM Vil ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDodEngineer's.comthe
86
tes
19)
of
BE-Sem-
2.13
icro) By Pi ik Kazi -98202:
MP-Advan
Bit 18, AC- Alignment Check: Setting the AC flag and the AM bit in the control Register 0
(CRO) enables alignment checking on memory references. An alignment check exception is
generated when references is made to an ‘unaligned operand, such as a word at an odd byte
address. Alignment check exceptions are generated only in user mode (PL=3).
Bit 17, VM-Virtual Mode: If VM is set (VM=1) the processor will be placed in virtual 8086
‘mode which is an emulation of the programming environment of the 8086 microprocessor.
Bit 16, RF- Resume Flag: When RF is set (RF=1), it temporarily disables debug fauits so that
an instruction can be restarted after a debug faults without immediately causing another debug
fault.
Bit 14, NT-Nested Task: If NT is set (NT=1), it indicates that the currently executing task is
nested within another task and has a valid link to the previous task in TSS..
Bit 13-12, IOPL- Input/Output Privilege Level: The IOPL encoded values (0,1,2,3) indicate
the numerically maximum current privilege level permitted to access VO address space.
Bit 11, OF -Overflow Flag: The OF is set (OF=1) if the operation resulted in a signed overflow
Bit 10, DF-Direction Flag: DF defines whether ESI and /or EDI registers are incremented (post
increment) or decremented (post-decrement) during the execution of string instructions Post
increment occurs if DF =0; post decrement occurs if DF =
Bit, IF-Interrupt Enable Flag: When IF is set (IF=1) it allows recognition of external
interrupts signaled on the INTR pin. When F=0 external interrupts on INTR are not recognized.
Bit 8, TF- Trap Enable Flag: When TF is set (TF=1), the Processor is put into single-step mode
for debugging, In this mode, the processor generates a debug exception after each instruction,
which allows a program to be inspected as it executes each instruction.
Bit 7, SF-Sign Flag: SF is set (SF=1) if the MSB of the result is set (MSB =1), or in other
words, the result is negative. SF reflects the state of bit 7,15,31, for 8-,16, and 32-bit operations,
respectively.
Bit 6, ZF-Zero Flag: ZF is set (ZF=1) if all bits of the result are zero. Otherwise, ZF =0.
Bit 4, AF-Auxiliary Carry Flag: The AF is used for BCD operations. AF is set (AF=1) if the
operation resulted in a carry out of bit 3. Otherwise, AF-0.
Bit 2, PF-Parity Flag: PF is set (PF=1) if the low-order 8 bits of the operation contain an even.
number of Is(even party), PF is reset (PF=0) ifthe low-order 8 bits have odd parity (odd number
of Is).
Downloaded from FaaDo0Engineers.comBE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5. 4
Window Invalid Mask (WIM): Refer class notes .
‘Trap Base Register (TBR): The Trap Base Register (TBR) containé the address to which control is
transferred when a trap occurs.
Program Counters (PC, nPC): The 32-bit PC contains the address of the instruction currently being
executed by the IU. The nPC holds the address of the next instruction to be executed (assuming a trap
does not occur). For a delayed control transfer, the instruction that immediately follows the transfer
instruction is known as the delay instruction. This delay instruction is executed (unless the control
transfer instruction annuls it) before control is transferred to the target. During execution of the delay
instruction, the nPC points to the target of the control transfer instruction, while the PC points to the
delay instruction.
Ancillary State Registers (ASRs): SPARC provides for up to 31 Ancillary State Registers (ASR’s),
‘numbered from 1 to 31. ASR’s numbered 1-15 are reserved for future use by the architecture and
should not be referenced by software. ASR’s numbered 16-31 are available for implementation
dependent uses, such as timers, counters, diagnostic registers, selftest registers, and trap-control
registers. A particular IU may choose to implement from zer9 to sixteen of these ASR’s. The semantics
of accessing any of these ASR’s is implementation dependent. Whether a particular Ancillary State
Register is privileged or not is implementation-dependent. An ASR is read and written with the
RDASR and WRASR instructions.
TU Deferred-Trap Queue: An implementation may contain zero or more deferred-trap queues. Such a
‘queue contains sufficient state to implement resumable deferred traps caused by the IU.
5.2.2 Data types
SPARC architecture recognizes the following data types, as shown in Figure 53:
Integer
Signed, unsigned byte 8 bits
Signed, unsigned half word 16 bits
Signed, unsigned word 32 bits
Double word 64 bits
2 Floating-point (EEE 754 standard)
Single-precision 32 bits
Double-precision 64 bits
‘Quad-precision exponent: 15 bits, mantissa: 63 bits
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY
[email protected]
Sub
Downloaded from FaaDo0Engineers.com2.14
BE-Sem-VIJ-COMP-Advanced Microprocessor Not yruk Kazi -9820223893
= Bit 0, CF-Carry Flag: CF is set (CF=1) if the operation resulted in a carryout of the MSB (the
sign bit). Otherwise CF=0. For8-16-, or 32-bit operations, CF is set according to the carryout of
bit7, 15, or 31, respectively.
Segment registers
Six 16-bit segment registers CS, SS, DS, ES, FS, and GS hold segment selector values identifying
the currently addressable memory segments. The selector in CS indicates the current code segment,
the selector in SS indicates the current stack segment, and the selectors in DS, ES, FS and GS
indicate the current four data segments.
‘System memory management registers (Refer class notes for detailed answer) EQ
Four memory management registers are used to control segmented memory management. The Gbbal
Descriptor Tables Register (GDTR) and Interrupt Descriptor Table Register (IDTR) can be loaded
with instructions which get a 6-byte data item from memory. The Local Descriptor Table Register
Downloaded from FaaDo0Engineers.comBE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 5, 3
TU controUstatus registers: A SPARC processor includes two types of registers: general-purpose or
“working” data registers and controV/status registers. The [U's general-purpose registers ae called r registers
as discussed above, and the FPU’s general-purpose registers ae called f registers. IU control/staus registers
include:
* Processor State Register (PSR)
‘© Window Invalid Mask (WIM)
+ Trap Base Register (TBR)
‘Program Counters (PC, nPC)
+ Implementation-dependent Ancillary State Registers (ASRs)
‘+ Implementation-dependent IU Deferred-Trap Queue
Processor State Register (PSR)
a | cw |
ee Lae Sod
ORE GRIMES oo BSR oor ceases ae Vea erate mmm mscoonMOa nea?
ipl Bits 31 through 28 are hardwired to identify an implementation or class of
implementations of the architecture. Together, the imp! and ver fields define a
‘unique implementation or class of implementations of the architecture.
ver Bits 27 through 24 are implementation-dependent. The ver field is either hardwired
to identify one or more particular implementations or is a readable and writable
slate field whose properties are implementation-dependent.
ice Bits 23 through 20 are the IU"s condition codes, These bits are modified by the
arithmetic and logical instructions whose names end with the letters ce (¢.g.,
ANDec), and by the WRPSR instruction, The Bicc and Tice instructions cause a
transfer of control based on the value of these bits, which are defined as follows:
reserved Bits 19 through 14 are reserved.
EC (Enable Bit 13 determines whether the implementation-dependent coprocessor is enabled.
Coprocessor) If disabled, a coprocessor instruction will trap. 1 = enabled, 0 = disabled. If an
‘implementation does not support a coprocessor in hardware, EC should always
read as 0.
EF (Enable Bit 12 determines whether the FPU is enabled. If disabled, a floating-point
Floating- instruction will trap. 1 = enabled, 0 = disabled.
point)
PIL (Processor Bits 11 (the most significant bit) through 8 (the least significant bit) identify the
Interrupt interrupt level above which the processor will accept an interrupt.
Level)
s Bit 7 determines whether the processor isin supervisor or user mode.
1 = supervisor mode, 0= user mode.
PS @revious Bit 6 contains the value of the S bit atthe time of the most recent trap.
Supervisor)
ET (Enable 1 = traps enabled, 0 traps disabled.
Traps)
CWP (Current Bits 4 (the MSB) through 0 (the LSB) comprise the current window pointer, a
* Window ‘counter that identifies the current window into the r registers.
Pointer)
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.comee
215
BE-Sem-VII-C iced Microprocessor Notes By Prof, Faruk Kazi -98202238:
(LDTR) and Task Register (TR) can be loaded with instructions, which take a 16-bit segment
selector as an operand, The remaining bytes of these registers are than loaded automatically by the
)rocessor from the descriptor referenced by the operand.
Control registers (Caution: wrong figure is given in some text book)
There are five control registers (CRO, CRI, CR2, CR3, CR4), Only four of them are used by the
current implementation; register CR1 is reserved for future use.
The CRO register contains system control flags, which control modes of operation or indicate states
of the processor. Only bits 0 to 5, 16,18,and 29 to 31 are currently used. The other bits are reserved
for future implementation. The function of the CRO bits is briefly explained in the following,
* Bit 31, PG- Paging Enable: When PG is set (PG=1), paging is enabled, When PG-0, paging is
disabled.
* Bit 30, CD- Cache Disable: The CD bit is used to enable or disable the on-chip cache fill
mechanism, When CD=1, the cache will not be filled on cache misses. When CD=0, cache fills
may be performed on misses, .
* Bit 29, NW-Not Write-through: When NW is cleared (NW=0), it enables on-chip cache writes-
through. When NW=0, all writes, including cache hits, are sent out to the pins, When NW=1,
write-through and write-invalidate cycles are disabled. The only write cycles that reach the
extemal bus when NW=I are cache misses. Invalidate cycles are ignored, Write cycles with
NW=1 does not update main memory.
* Bit 18, AM-Alignment Mask: The AM bit allows alignment checking when set (AM=1) and
disables alignment checking when clear (AM=0)
* Bit 16, WP-Write Protect: When WP is set (WP=1) it offers write-protection to user-level
ages against supervisor-level write operations. When WP is clear (WP= 0), read-only user-level
ages can be written by a supervisor process.
* Bit 5, NE-Numeric Error: When NE is set (NE=1), it enables the standard mechanism for
reporting floating-point numeric errors.
* Bit 4, ET-Extension Type: The ET bit indicates support of the i387 mathematical coprocessor
instructions,
= Bit3, TS- Task Switched: The TS is set (TS=1) whenever a task switch operation is performed.
* Bit 2, EM-Emulation: When EM is set (EM=1) execution of a numeric floating-point
instruction generates the coprocessor-not-available exception. The EM bit must be set when the
processor does not have a floating
* Bit 1, MP-Monitor coProcessor: On the i286 and i386 processors, the MP bit controls the
function of the WAIT instruction, which is used to synchronize with a coprocessor. The WAIT
Subjects forSEM Vin. ROBOTICS & S¥STEN
Downloaded from FaaDo0Engineers.comBE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 5, 2
‘The group subdivision of the window registers is ‘
131 t0 124 ins, contain parameters passed to the procedure by the calling procedure
123 to 116 locals, contain local parameters of the procedure
115 to r8 outs, contain parameters passed tothe called procedure
As can be seen in fig. 5.1, the outs registers ofthe calling procedure, are physically the ins registers ofthe
called procedure. The calling procedure passes parameters to the called procedure through its outs registers,
which are the ins registers ofthe called procedure. The register window of the currently running procedure,
called the active window, is pointed to by the current window pointer (CWP) in the processor state register
(sR).
‘The number of windows (NWINDOWS) that can be used in different versions of the SPARC ranges from 2
to 32, for a total number of general purpose IU registers (including the eight globals) ranging from 48 to
548, respectively. Most current SPARC implementation microprocessors feature eight windows fora total of
136 registers. Implemented windows are contiguously numbered 0 to (NWINDOWS-1). An example of an
cight-window implementation, where the windows ore cicularly interconnected, is shown in Figure 5.2
Figure 5.2 ~ Circular stack of window registers
‘The CPU contains a 32-bit control register called window invalid mask (WIM). Each bit of WIM, wi
1.31), corresponds to one of the possible 32. windows (even if less than 32 are implemented). If wi = 1,
window i is considered to be invalid, and a trap condition exists. The CPU's program counter (PC) is @
separate register, not included in the general-purpose register file. SPARC implementations may have
several PCs containing address of subsequent instructions.
Some of the SPARC IU registers have specially designated tasks. The 10 is hardwired to a zero value, as itis
in many other RISC-type systems. A CALL instruction writes its own address into the outs register 115. The
CWP is decremented with a SAVE instruction on a procedure call and incremented by a RESTORE
instruction on a procedure return. Procedures can also be called without changing the window.
Suppose that in the case of NWINDOWS = 8 (Figure 5.2), window 0 is the currently running active
window. In this case, CWP=O. Since window 0 is the last free window, when the procedure, using window
O calls another procedure, a window overflow occurs. A new register window wraps around fo overwrite the
previously used window 7, whose contents must be saved in the memory by software. After a return, and
when the register file was out of windows, we have a window underflow. Software must restore previously
used register windows in this case. A window overflow trap is caused by the overflow. The overflow trap
handler uses the locals of window 7 for pointers into the memory where the overflowed window is stored.
‘Window 7 is invalidated during the trap handling by setting bit w7 of the register.
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY
[email protected]
Sut
Downloaded from FaaDo0Engineers.com216
BE-Sem-VII-COMP-Advanced Micro Prof, Faruk Kazi -9820223893
instruction is not needed on processors with on-chip FPU, such as i486 and the Pentium. When
running i286 and i386 programs on i486 and the Pentium FPU, MP should be set (MP=1) . It
should be cleared (MP=0) on i486 and the Pentium.
= Bit 0, PE-Protection Enable: When PE is set (PE=1), the protection mechanism is enabled.
‘When PE=0, the processor operates in unprotected real (8086) mode.
‘The low-order 16 bits of the CRO are also known as the machine status word (MSW), for
compatibility with the i286. The MSW of the i286 has 4 bits that are usgd: 3 through 0, TS, EM,MP,
and PE,
‘The CR2 register holds the 32-bit linear address that caused the last page fault detected.
‘The CR3 register contains in its upper 20 bits (bits 31 though 12), the address of the page directory
base. The CR3 is also known as the page directory base register (PDBR). The page directory
‘occupies a regular page frame, that is, it must be aligned to a page boundary, so the low 12 bits of the
CR3 are not used as address bits. On the i486 and the Pentium (and possibly on future
implementations) the state of bits 4 and 3 is driven on the outside pins PCD and PWT respectively,
and they are used as follows:
* Bit 4, PCD-Page-level Cache Disable: When PCD=l, the on-chip cache is disabled. When
PCD=0, on-chip caching is enabled, provided it is not disabled by other means (such as cache
deactivation by a signal from an external pin.
= Bit 3, PWT-Page-level Write Transparent: The PWT bit can be used to control the write
policy of an extemal second-level cache. When PWT=l, it allows a write-through policy for the
external cache. If PWT=0, a write-back policy for the extemal cache is adopted
Register CR4, new on the Pentium, contains bits that enable certain architectural extensions. Only
bits 6 and 4 through 0 are currently used as follows
"Bit 6, MCE-Machine Check Enable: Setting MCE (MCE=1) enables the machine check
exception,
= Bit 4, PSE- Page Size Extension: Setting PSE (PSE=1) enables paging with large 4-Mbyte
ages.
= Bit3, DE-Debugging Extensions: Setting DE (DE=1) enables 1/O breakpoints.
= Bit 2, TSD —Time Stamp Disable: Setting TSD (TSD=1) makes the read from time stamp
counter (RDTSC) a privileged instruction.
ee
‘Subjects for SEM
Downloaded from FaaDo0Engineers.comBE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5, 1
Chapter 5 - Sun SPARC Family
Notes by Prof. Faruk Kazi
5.1 Overview
‘The SPARC architecture was initiated by Sun Microsystems. Before announcing the SPARC, Sun
Microsystems produced a very popular family of M68000-bosed Sun workstations. One of the things that
differentiate SPARC from other RISC-type systems is that Sun does not have a history of preceding
microprocessors, RISC or CISC, 0 it had no software compatibility constraints to worry about. SPARC
designers could start from a clean slate. The name SPARC stands for Scalable Processor ARChitecture. The
concept of scalability, as seen by the creators of SPARC, is the wide spectrum of its possible
Price/performance implementations, ranging from microcomputers to supercomputers. The scalability of the
SPARC can also be interpreted in the number of CPU registers that can be used in various versions of
products, implementing the SPARC architecture. The SPARC architecture follows the Berkeley RISC
design philosophy by stressing of the importance of the relatively large CPU register file and by
i register window features.
5.2 SPARC Architecture
5.2.1 Register Organization
General Purpose Register file
SPARC architecture features a comparatively large CPU register file of over 100 registers. As in the
Berkeley RISC, any procedure running on the SPARC can access only 32 registers, denoted 10 to 131. Eight
Of the registers (10 to £7) are global, accessible by all procedures. The other 24 registers are the window
registers, assigned to each procedure, with an overlap of eight registers between procedures. The 24 window
registers are subdivided into three groups of eight registers each, as illustrated in Figure 5.1 for a sequence
of three nested procedures.
ext
a
a
Pm
ie
it stow
aa a
fe ioe,
or
ain sa C41
or 7,
bo 24
Pe
a z
ai
a /
a
ro
a
os
Figure 5.1 Three overlapping windows and globals,
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.comz
v
she
me
ay,
yen
rite
the
nly
veck
byte
amp
217
Faruk Kazi -9820223893
BE-Sem-VII-COMP-Adv: icroprocessor Note
cs
* Bit 1, PVE-Protected-made Virtual Interrupts: Setting PVI (PVI=1) enables support for a
virtual interrupt flag in protected mode. This feature can enable some programs designed for
‘execution at privilege level 0 to execute at privilege leve! 3 (applications level; least privileged).
"Bit 0, VME-Virtual-8086 Mode Extensions: Setting VME (VME=1) enables support for a
virtual interrupt flag in virtual-8086 mode. This feature may improve performance in this mode.
Fig 2.7 Control Registers
Floating-point registers (PI refer 8087 class notes for diagram)
‘The on-chip FPU includes eight 80-bit data registers RO to R7, a 16, bit tag word, a 16-bit control
registers, a 16-bit status register, a 48-bit instruction pointer, and a 48-bit data pointer.
Data Registers: The data registers RO to R7 are used by floating-point computations. These registers
can be accessed in two ways:
1 Asa stack whose top is pointed to by bits 13 to 11 (TOP field) ofthe status register (or status
word) with instructions operating on the top one or two stack elements.
2 Asa fixed register set with instructions operating on explicitly designated registers
‘A PUSH operation decrements TOP by 1 and loads a value into the new top data registers. A POP
operation stores the value from the current top data register and then increments TOP by one. Like
other x86 stacks in memory, the FPU data register stack grows down towards lower-addressed
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 4, 18
Paging in DEC -Alpha AXP
‘The system supports pages of 8 Kbytes, 64 Kbytes, 512 Kbytes and 4 Mbytes.
Other AXP Implementations
‘A numberof subsequent Alpha AXP implementation micropocestor has ben produced by DEC.
‘he 270644 is a (.S-mieron.three-metal-layer. CMOS-S. 2.5-millioe transistor. 431-pin PGA microprocessor, running
at Irequencies ranging from 225 to 275 MHz. It has double the on-chip cache than the 21064: 16 Kbytes instruction,
Jokbytes data. for a total of 32 Kbytes, The 27066 is a highly integrated implementation of Alpha, whose on-chip
fanetions include an 10 controller. IU, FPU, memory controler. graphics, accelerator, instruction and data caches (B
Kbytes each, as onthe 21064) and an extemal cache controler. The 21066 isa 0.68-micron, three-metal-layer. CMOS
4. 287-pin PGA microprocessor. running at 166 MHZ. The 27068 is alower-frequency version of the 21066 running at
66 Mil
4.5 Applications of DEC Alpha AXP: (Pl refer class notes for CRAY T3D MPP)
{he T3D (Torus, -Dimensional) was Cray Researe's first attempt at a massively parallel supercomputer architecture.
Launched in 1993, it also marked Cray's first use of a non-proprietary microprocessor architecture in @supercamputer.
The 13D consisted of between 32 and 2048 Processing Elements (PES), each comprising a 150 MHz DEC Alpha
21064 (EVA) processor and cither 16 or 64 MB of DRAM. PEs were grouped in pais, or nodes, which incorporated a
eway processor interconnect switch. These switches had @ peak bandwidth of 300 MBsecond in each direction and
were connected to form a tree-dimensional torus network topology.
The T3D was designed to be hosted by @ Cray Y-MP Model E, M90 or C90-series “frontend!” system and rely on it and
its UNICOS operating system for all UO and most system services. The T3D PES ran a simple mierokemel called
UNICOS MAX.
VIVA: The first processor of the Alpha family was called 21064 ("21" implied that Alpha was
an architecture of the 21st century, "0" — a processor's generation, “64” — a computational
capability in bits), also code-named as BV4 ("EV" was. the abbreviation of “Extended VAX"
and “4” — a technological process’ generation). But then what is AXP? Check your class
Subjects For SEM Vill_ ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com2.18
BE-Sem- MP-Advanced Microy s By Prof, Faruk Kazi -9820223893,
registers,
‘Tag word: The tag word marks the content of each data register, RO to R7. Bach 2-bit tag (0) to Tag
(7) represents one of the RO to R7 registers, respectively.
Status word: The 16-bit status word, located in the status register, reflects the over-all state of the
FPU.
Control word: The control word provides the user with several programmable processing options.
The low-order 6 bits contain individual masks for each of the six exceptions that the FPU recognizes.
‘They fit the low-order 6 bits of the status word.
Instruction and data pointers: In case of an FPU ertor, the 48-bit instruction pointer contains the
address of the failing instruction and the 48-bit data pointer contains the address of its numeric
memory operand, if appropriate
Debug registers
‘The x86 architecture features eight debug registers, DRO to DR7. Only programs executing at the
highest privilege level can access these registers. Registers DRO to DR3 specify the four linear
breakpoint addresses. The debug control register, DR7 is used to set the breakpoints.
2.5 Memory Management
The primary functions of the MMU are:
1 Translation of the virtual (logical) address into a physical (real) address.
2. Provide for the paging mechanism involved in the virtual memory organization. The paging
unit does this.
3. Provide for the segmentation mechanism by the segmentation init.
4 Provide for memory protection. This is usually done within the paging or segmentation unit,
or both,
5 Inclusion and management of a fast-access translation look aside buffer (TLB)
Segmentation (Refer class notes for comparison of real/protected mode segmentation)
Segmented memory is utilized by protected mode to allow tasks to have their own separate memory
spaces, which are protected from access by other tasks. A segment can be from 1 byte to 4 GB long.
Segments can start at any base address in memory, and storage overlapping between segments is
allowed
Address Translation Mechanism
A virtual (logical) address in the x86 architecture is formed out of two components:
1 A 16-bit selector, used to determine the linear base address (the address of the first byte of the
segment) of the segment.
2 A 32-bit offset used the intemally address within a segment. The offset of a given memory
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes. 9f. Faruk Kazi -9820223893 4. 17
Figure 4.8 CPU external interface
Pipeline in DEC Alpha AXP
The 21064 IU and FPU pipelines are illustrated in figure (Pl refer class notes of Faruk Kazi). The integer pipeline is
seven stages deep. ‘The first four stages are associated with instruction fetching, decoding, and scoreboard checking of
operands for possible date dependency. Pipeline stages 0 through 3 can be stalled. Beyond stage 3, however, all
pipeline stages advance every cycle. Most ALU operations complete in cycle 4 (Al), Primary cache accesses complete
im cycle 6 (WR), so cache delay is three cycles. The instruction stream is based on autonomous perfecting in cycles 0
‘and 1 withthe final resolution of CACHE hit not oceurring until eycle S. The prefetcher includes a branch history table
and a subroutine return stack. The architecture provides a convention for compiles to predict branch decisions and
destination addresses, including those for register indirect jumps. The penalty for branch mispredict is four cycles.
in
Figure 4.9 Pipeline in DEC ALPHA AXP
‘The FPU pipeline is 10 stages deep. It is identical and mostly shared with the IU pipeline in stages 0 through 3. All
operations, 32-and64-bit, have the same timing (except divide). Divide is handled by a non-pipelined, single bit per
cycle, dedicated divide unit In eycle 4(F1), the register file data is formatted to fraction, exponent, and sign. Inthe first
stage adder exponent difference is calculated and a 3xmultiplicand is generated for multiplies. In addition, a predictive
leading 1 or 0 detector using the input operands is initiated for use in result normalization. In cycles 5 (F2) and 6 (F3),
for odd/subtract, alignment or shift are performed. For both single-and double-precision multiplication, the multiply is
done in a radix-8 pipelined array multiplier. In eycles 7 (F4) and 8(F5), the final addition and rounding are performed
in parallel and the final result is selected and driven back to the register file in cycle 9 (FWR). With an allowed bypass
of the register write data floating-point delay is six eycles.
Pairability in DEC ALPHA AXP
‘The super scalar dual issue of instructions is restricted tothe following pairs:
Any load/store in parallel with any operate
An integer operate in parallel with floating-point operate
A floating-point operate and a flosting-point branch
© Aninteger operate and an integer branch
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com18
zes,
the
ear
oy
ng.
sis
I- -Advanced Microy yr Notes By Prof
location address is its distance in bytes from the segment base address. (Called as EIP for
instruction Fetch)
The addressing of a memory operand within a segment is illustrated in Fig 2.8. When a 32 bit x86
processor is reset or powered up, itis initialized in real mode, Real mode has the same architecture
as the 8086 but allows access to the 32-bit register set. The default operand size in real mode is 16
bits, However, the regular mode of operation of a 32-bit x86 architecture processor is in protected
virtual address mode (PVAM) or simply, protected mode. In protected mode the 16-bit selector is
used to specify an index in an OS-defined table ‘The table contains the 32-bit base address of a given
segment. Adding the base address obtained from the table to the offset forms the physical address (if
PG-0)
Segment Descriptors (BQ)
Each segment has a segment descriptor associated with it the segment descriptor is 8 bytes long and
contains the following information about the segment:
‘A 32-bit segment base linear address.
A. 20-bit segment limit, specifying the size of the segment
‘Access rights byte, containing protection mechanism information
Control bits
Rune
‘The segment limit field of the segment descriptor has only 20 bits (and not 32) because the segment
size does not have byte granularity for all segment sizes. Segments have a byte granularity (i.e
segments may differ in size by a single byte) for segment sizes up to 1 Mbytes. For segments above
1 Mbytes and up to 4 GB, there is a page granularity that is segment sizes may differ by a page size,
which is 4 Kbytes (2 " bytes).
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 4. 16
is
Figure 4:7 Block diagram of the 21064.
TBOX-issues instructions (two at a time), maintains the integer pipeline, and performs PC calculations. It
decodes two instructions in parallel and checks availability of resources. There is no out-of-order issue. There
will be an issue if appropriate resources are available. If resources are not available forthe frst instruction,
there will be no issue. The IBOX contains branch prediction logic, instruction translator buffers (TBs),
interrupt logi, and performance counters (issues, non issues, total eycles, pipe dry, pipe freeze, cache misses).
‘There are two ITBs:
1, Small page ITB, eightentry fully associative contains recently used instruction stream page table entries
(PTEs) for 8-KByte pages.
2. Large-page ITB, four-entry, fully associative, for 512 x 8 KByte pages (4 Mbytes).
EBOX - integer execution unit. It contains a 64-bit adder, logic box, barrel shifter, bypassers, integer
multiplier, 32 x 64 IRF with four read and two write ports
ABOX - address generation unit, It contains address translation datapath, load silo, data cache interface,
intemal processor registers (PRs). and the BIU, and a 32-entry, fully associative data translation buffer
(DIB). The load silo is a memory reference pipeline that can accept a new load or store instruction every
cycle until a data cache fll is required. The BIU has on external 128-bit data bus.
FBOX - the FPU. It contains in addition to the operation units a 32x64 floating-point register fle (FRE) and a
user accessible floating-point control register (FPCR).
ICACHE - instruction cache. 8 KBytes, direct-mapped, physicaladdressed, 32 bytesfine.
DCACHE - data cache, 8 KBytes, direct-mapped, physical-addressed, 32 bytesline, write-through, read
allocate,
‘An example of an external interface interconnection of the 21064 is shown in fig. It is designed to directly support an
Off chip secondary cache (also called backup cache, or B-cache) that can range from 128 Kbytes to 8 Mbytes and can
bbe constructed from ordinary SRAM. The interface is designed to allow all cache policy decisions to be controlled by
logic external to the CPU chip. There are 3 control bits associated with each B-cache line: valid (V), shared (S), and
ditty (D). The chip completes a B-cache read as long as valid is truc. A write is processed by the CPU only if valid is
‘rue and shared is false, When a write is performed, the dirty bit i set to true. In all other cases, the chip defers to an
‘external state machine to complete the transaction.
Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.comql
i
E
3
ae
ce
=
oa
Fig 2.9 Segment Selector
sonatas
preamps 5 ae
8 o8
2 8s
ee aire
zE
ae
é
35
5 &
Pek. 3
5
a
Pee
(D7)
256 1SR. The IDT is basically the interrupt v
Fig 2.10 Segment Deseriptor format (PI refer class notes for TYPE field)
The Global Descriptor Table (GDT) contains descriptors that are possibly available to all
tasks in the system.
2. The Local Descriptor Table (LDT) contains descriptors associated with a given task. Each
task may have a separate LDT. A segment cannot be accessed by a tas
descriptor does not exist in either the current LDT or the GDT.
1
Segment descriptors are stored in descriptor tables in memory. The descriptor tables define all the
3. The Interrupt Descriptor Table
segments, which are used in the system. There are three types of descriptor tables:
8 bytes each). Th:
arrays, They can range in size between
92 descriptors,
t
upper 13 bits of a selector are used as an index into the descriptor table. Each of the above tables lies
All of the above descriptor tables are variable-length memo:
8 bytes (a single descriptor) and 64 Kbytes (upper limit: 2
associated with the GDT
associated with the IDT
48 bits,
GDT register (GDTR),
2 LDT register (LDTR), 16 bits, associated with the LDT
ister, located in the CPU associated with it and pointing to it:
1
aregi
3
IDT register (IDTR), 48 bits,
Downloaded from FaaDo0Engineers.comBE-Sem.VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 4. 15
‘There are two types of floating-point compare instructions: 4
1 CMPGxx Compare G_ f loafing, operands: Farg, Fog, Fewq where x¢ may take the options:
EQ | equel
LE_| less than or equal
LT | less than
For a total of thee instructions.
2 CMPTox Compare T_f loafing operands: Fars, Fbrx, Fo.wq where xx may take the options EQ, LE, LT as
for CMPGxx and another option UN unordered
In all the floating-point compare instructions the operands in Fa and Fb are compared. Ifthe specified relationship is
true, a nonzero floating-point value (0.5 for CMPGxx, 2.0 FOR CMPTxx) is written into Fe, Otherwise, a true zero is
‘written into Fe.
Privileged architecture library (PAL) code
‘The PAL code provides a mechanism to implement the following functions without resorting to a micro coded
‘machine:
Instructions that require complex sequencing as an atomic operation
Instructions that require VAX-style interlocked memory accesses
Privileged instructions
Memory management control
Context swapping
Interrupt and exception dispatching
Power-up initialization and booting,
Console functions
‘Emulation of instructions with no hardware support
PAL functions are implemented in alpha architecture in standard machine code, resident in mein memory. PAL code
environment differs from the normal environment in the following ways:
1 There is complete control of the machine slate allowing all functions of the machine to be controlled.
2 Interrupts are disabled, allowing the system to provide mult instruction sequences as atomic operations.
3 Implementation-specific hardware functions are enabled, allowing access to low-level system hardware.
4
Instruction stream memory managemedt traps are prevented, allowing PAL code to implement memory
‘management functions such as translation buffer (TB) fills
4.4 Alpha AXP Implementations
‘The first implementation ofthe Alpha architecture isthe 21064 microprocessor chip. The 21064 is fabricated in a 0.75
micron CMOS technology uilzing three levels of metalization end optimized for 3.3-V operation. The de size is 16.8
X 13.9 mm and it contains 1.68 million transistors. Its initial operating frequency is within the 150 to 200 MHZ
interval. Power dissipation at 200 MHz is 30 W. The processor is a two-issue super scalar. The chip includes @ dusl-
cache &-kbyt instruction and &-kbyte data It also includes a fourrentry, 32bytes-per-entry waite buffer, a pipelined 64-
bit integer execution unit with a 32 X 64 register file, and a pipelined FPU with a 32 X 64 register file of its own. The
pin interface includes integral support for an extemal secondary cache of 128 Kbytes up to 8Mbytes, The internal
aches are ditect-mapped, All caches have 32 bytes/inc. The intemal data cache is a wrte-through. read allocste,
physial cache. The chip package is « 431-pin pin grid arey (PGA) with 140 pins dedicated to power supply voltege
and ground.
‘Subjects For SEM Vil ROBOTICS & SYSTEM SECURITY farukka:
Downloaded from FaaDo0Engineers.comae
al
at
to
2.21
320223893,
Advanced Microprocessor Notes B
The LGDT, LLDT and LIDT instructions load the base and the limit of the GDT, LDT, and IDT,
respectively, into the appropriate register: GDTR, LDTR and IDTR, respectively. The SGDT, SLDT}
and SIDT instructions store the contents of GDTR, LDTR, and IDTR, respectively, into a specified
destination address,
Paging Mechanism
‘The paging mechanism is optional. It is enabled when PG Bit (Bit 1) in CRO is set. Paging works
beneath segmentation and is transparent to the segmentation. The standard page size of the x86 is
4KB = 2"? bytes, but can be extended to 4 Mbytes for Pentium processor. The x86 uses two levels of
tables to translate the linear address into a physical address, There are three components to the
aging mechanism: the page directory, the page tables, and the page frame. A uniform size forall the
elements simplifies memory allocation and reallocation schemes, since there is no problem with
memory fragmentation. Figure 2.11 illustrates the paging mechanism.
ig 2.11 Paging Mechanism (4 KB Page)
(For 4 MB Page'PI. Refer Class Notes)
The control register CR2 is the page fault linear address register. It holds the 32-bit linear address,
which caused the last page fault detected. Register CR3 points to the base of the page directory
‘The page directory is 4 KBytes long and allows up to 1024 PDEs. Each PDE, shown in Fig,2.12 (a),
contains the address of the next level tables, the page tables, and information about the page table
Pointed to. The upper 10 bits of the linear address (the directory field; bits 31 to 22) are used as an
index to select the correct PDE.
Downloaded from FaaDodEngineer's.comBE-Sem-VIl ced
licroprocessor Notes By Prof. Faruk Kazi -98202238
SUBF
SUBG
suBS
suBT
Subtract F_floating
Subtract G-floating
Subtract §_floating
Subtract T_floating
MULF Multiply F-floating
MULG — Multiply G_floating
MULS —Maltiply 8 floating
MULT — Multiply T-floating
DIVF _Divide F_floating
DIVG —_Divide G_floating
DIVS _Divide S_floating
DIVT __ Divide T_floating
‘The four arithmetic operations are performed as follows:
‘Addition Fe € Fav Fov
Subirastion: Fe€ Fav- Fw |
Malliplication: | Fo € Fav* Fov
Divisi Fe € Fav/Fov
414
The floating-point convert operations are summarized in Table 4.9. [n all the above operations Fb contains the datum to
bbe converted and Fo is the destination where the converted datum is stored,
Table 4.9 Floating-point convert operations
Mnemonic
cvTLQ
cvTQt.
cvIDe
cvtap
cvTGr
cvTcg
cvTgr
cevT9s
cvTQs
evror
cvrTg
vrs
cvrst
‘Operation,
‘Convert longword to quadword
‘Convert quadword to longword
Convert D to G_ floating
Convert G to D_floating
Convert G to F_foating
Convert G_floating to quadword
Convert quadword TO F_foating
Convert quadword TO G_flosting
Convert quadword TO 8_ floating
Convert quadword TO T_foating
Convert T_ floating to quadword
Convert T
Convert $
to S_floating
toT_floating
‘Subjects For SEM VIIl_ ROBOTICS & SYSTEM SECURITY
[email protected]
Downloaded from FaaDo0Engineers.com2.22
By Prof, ik Kazi
Each page table is 4kbytes and holds up to 1024 PTEs, A PTE, showg in Fig.2.12 (b), contains the
starting address of the page frame and access information about page, Address bits 31 to 22 (table
field) are used as an index to select one of the 1024 PTE, Bits 31 to 12 of the PTE contain the upper
20 bits of the page frame base. The lower 12 bits of the PTE are identical to those of the PDE. The
function of the bits currently in use is as follows:
Bit 6, D-Dirty Bit : D is set before a write to an address covered by the PTE occurs. It is undefined
for the PDE.
Bit 5, A- Accessed Bit: A is set before a read or write access occurs to an address covered by the
entry
Bit 4, PCD-Page Cache Disable: The PCD bit controls the page on chip cache ability. When (PCD)
'=0, the on-chip cache is enabled. When (PCD) =I, on-chip caching is disabled.
Bit 3, PWT-Page Write-Through. The PWT bit controls page write policy. (PWT) =1 defines a
vwrite-thrqugh policy for the current page. (PWT) =0 allows the possibility of write-back. Bits PCD
and PWT are also bits 4 and 3 on the CR3. The state of the PCD and PWT bits is driven out on the
PCD and PWT pins during a memory aécess.
Bit 2, User/Supervisor: Bit U/S differentiates between lower-privilege user mode and bigher-
privilege supervisor mode,
Bit 1, Read/Writ
Bit R/W establishes read and write protection privileges for the page
Bit 0, Present: Present in physical memory.
Downloaded from FaaDo0Engineers.com:-Sem-VIl-COMP-Advanced Microprocessor Ni 9f, Faruk Kazi 9820223893 4, 13
LDT Load T-floating
STF Load Floating
STG Load G-floating
sTs Load S-floating
STT___Load T-floating
Floating-point control instructions
‘There are six floating-point branch instructions. These instructions tet the value of a floating-point register Fe, and
conditionally change the value of the PC. The instructions are suramarized in table 4.7
Toble 4 Floating-point Branch Instructions
“Vinemonte Operation
FBEQ Flosting branch equal
FBGE Floating branch > or equal
FEGT — Floating branch>
FBLE Floating branch
) or by an immediate value of a literal #b, The number of bytes to extract is specified in the
function code. Remaining bytes are filled with zeros.
3. Byte Insert. The byte insert instruction INSxx has seven options:
TNSrx Option Insert
BL Byte low
WL Word low
u Longword low
a Quadword low
WH Word high
LH Longword high
aH Quadword high
INSkL and INSxH shift bytes from register Re and insert them into a field of zeros, storing the result in Re.
Register Rbv<2:0> or a literal #b select the shift amount (0 to 7), and the function code selects the maximum field
‘width: 1,2,4 or 8 bytas,
4 Byte mask: The byte MSKxx instruction has seven options same as the byte insert and byte extract instruction
(MSKBL, MSKWL, MSKLL, MSKQL, MSKWH, MSKLH, MSKQH). MSKxL and MSKxi set selected bytes of
register Ra to zero storing the result in Register Re. Register Rbv<2:0> or literal selects the starting position of
the field of zero bytes, and the function code selects the maximum width; 1, 2, 4 or 8, bytes.
5 Zero bytes: This group contains two instructions
ZAP Zero bytes
ZAPNOT Zero bytes not
‘These instructions set selected bytes of register Ra fo zero and store the result in register Ro. Register Rbv or
4 literal selets the bytes to be zeroed; bit of Rb corresponds o byte 0 of Ra bit I of Rb corresponds to byte 1 of
Ra, and 30 on.
‘The CMPBGE and ZAP instructions allow very fast implementations of the C language string routines, among
oer uses.
Floating-point load and store instructions
‘The floating-point load and store instructions (a total of eight) move floating-point data between memory and floting-
point registers, The instructions are summarized in Table 46
In the load instructions Fa isthe destination register. The memory address is computed by sdding a sig-extended
displacement to Rbv for both load and store instructions. Register Fa serves as the source register in the store
instrutions.
‘Table 46 Floating-point Load and store Instructions
‘Mnemonic Operation
LDF Load Floating
LDG ——_Loed G-floating
LDS __Load $Floating
‘Subjects For SEM Vill
ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.com2.24
BE-Ser P-Advanced Microprocessor Notes By Prof, Faruk 1820223893
2.6 Pentium Cache Organization (EQ NOV06/20Marks)
¥ Cache Revision (PI refer notes of COA for basic understanding)
Cache architecture
‘A cache system (the cache and cache controller) can be interfaced to the processor in two ways.
i, Look-through cache architecture.
ii, Look-aside cache architecture.
Cache Coherency
It is required that the data present in the cache and main memory are exact duplicates of each other,
i.e. they should always provide the processor or any other bus master the latest copy of the
information. This is known as maintaining cache coherency or consistency,When the processor
updates any information in the cache, the same change should be made in the main memory before
any other bus master tries to access it. This is achieved by the cache controller by following one of
the following write policies.
i, Write through
ii, Buffered write through
iii, Write back
First and Second Level Caches
‘The 80486 and the Pentium processors have an intemal cache. This cache is known as level 1 cache
or LI cache, This L1 cache provides the processor with the most often used code and data and are
usually small in size (4KB to 64 KB).A second level (L2) cache can also be added in between the L1
cache and the main memory. L2 caches are usually larger in size compared to L1 caches (64 KB to
512 KB). LI caches are thus subsets of the L2 caches. The use of two cache levels substantially
increases the hit rate. This is because when a cache miss occurs in LI cache, the L2 cache can
provide the required information in zero wait states,
Pentium Processor Cache: General Features
© The Pentium processor has a separated code and data cache each of 8k bytes.
‘© The cache line size is 32-bytes
‘© Since the Pentium processor has data bus of 8 bytes (64 — bits), it requires a burst of four
consecutive transfers to fill the cache line of 32 bytes.
© Each cache is organized as two-way set-associative.
+ The data cache can be configured as a write-through or a write-back cache on a line-by-line
basis and it follows the MESI protocol.
© The code cache does not require a write policy, as it is a read-only cache.
+ Each cache has a dedicated translation look aside buffer (TLB) to translate linear addresses to
physical addresses.
‘© The data cache tags are triple ported to support two data transfers and an snoop cycle in the
same clock.
* The code cache tags are also triple ported to support snooping and split line access
Downloaded from FaaDo0Engineers.com
1
1
‘
E
aBE-Sem-VII-COMP-Advanced Micro}
sssor Note: f. Faruk Kazi 9820223893 4. 11
Table 4.5 Logical and Shift Irstuctions
‘Mnemonic Operation
AND, Logical AND
BIC Logical AND with complement
BIS Logical OR
EQV Logical equivalence (KORNOT)
ORNOT Logic OR with complement
XOR Exclusive OR
MOV Conditional move integer
SLL Shift left logical
SRA. ‘Shift right arithmetic
SRL Shift right logical
‘The logical operations are performed as follows:
gical op
Mnemonic Operation
AND Re € Rav AND Rbv
Bic Re € Rav AND (NOT Rb)
Bis Re € Rav ORROV
ORNOT Re CRavOR WOT REw)
XOR Re € Rav XORRBV
EO Re € Rav XOR (NOT Rbv)
Byte -manipulation instructions
The Alpha architecture features five types of byte-manipulation instruction (a total of 24 instructions) within registers.
‘This is an unusual feature compare to other systems, The byte manipulation instructions can be used with the load and
store unaligned instructions to manipulate short unaligned string of bytes. All of the byte-manipulation instructions
have the same operand specifications as the logical and shift instructions.
1, Compare byte: This group contains a single instruction CMPBGE (compare byte greater or equal). It does eight
parallel unsigned byte comparisons between corresponding bytes of Rav and Rbv, storing the eight results in the
low 8 bits of Re.
2. Extract byte: This group features seven options forthe EXT instruction:
EXTxx Option Extract,
BL Byte low
WL Word low
| LL Longword low
a Quadword low
WH Word high
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.comver,
sor
ore
vot
che
are
Ll
3 to
ally
four
sine
esto
athe
oess
2.25
[P-Advanced Microprocessor Notes By Prof. Farul 182027 3.
simultaneously.
+ Individual pages in the main memory can be configured as cacheable or non-cacheable by
software or hardware.
"The cache can be enabled or disabled by software or hardware.
Handling of Memory Read Transaction
‘The sequence of events when a memory read request from L1 cache is given by one of the execution
units (from the data cache) or by the prefetcher (form the code cache) is given below.
1.” IFLI cache hit occurs, the request is immediately fulfilled.
2. If Li cache miss occurs, the Pentium processor must perform a cache line fill from extemal
memory of L2 cache.
3, The L2 cache now detects the memory read bus cycle and checks its directory to see if it has
a copy of the requested information.
4, Ifitis L2 cache hit, the L2 cache asserts KEN# to indicate that the address is cacheable. L2
cache then supplies the data a burst of four consecutive 64-bit transfers (cache line fill
operation).
5. If itis L2 cache miss, the L2 cache passes the bus cycle to the system bus. The NCA (non-
cacheable address) logic decodes this address to determine if the address is cacheable or not.
6. If address is non-cacheable, the NCA logic desserts KEN #, Thus, the bus cycle is not
converted into a cache line fill and instead a single-transfer bus cycle is run to fetch the
requested information directly form memory.
7. If the address is cacheable, the NCA logic asserts KEN # (and since the processor had
asserted CACHE #) and the read cycle is converted into a cache line fill for both L2 and L1
caches.
8. The L2 cache copies the first quadword (8 bytes) of data into its cache line fill buffer and
simultaneously forwards it to the processor while asserting BRDY # to indicate that valid
datais present in the processors data bus,
9. The Pentium Processor reads the first quadword and stores it in its cache line-fill buffer,
‘This quadword is the one, which was requested, and hence it is immediately passed on to the
requester (execution unit or prefetcher). The next three quadwords are awaited and as they
are received, they are stored in the buffer. When the entire line is received, both L1 and L2
‘caches copy the line from their respective line-fill buffer into their respective caches.
The Write Once Policy & the MESI Cache Consistency Model
‘The MESI (modified-Exclusive-Shared-Invalid) protocol provides a method to maintain cache
coherency. The MESI protocol is only for the data cache and the SI protocol for the code cache.
Each line in the data cache can be in one of the four MESI states as indicated by two bits stored
along with the tag address
Modified
It indicates that this line in cache has been updated or modified due to a write hit in the cache.
In this case, when the cache subsystem snoops the system bus and finds a snoop hit, it should
write the modified line back to memory (update the memory).
| Subjects for SEM.
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4. 10
3. Describe subroutine and co routine returns: By marking each branch and jump as ‘call’, ‘return’ or ‘neither?
the architecture provides in implementation enough information to maintain a small stock of likely subroutine
returns quickly.
‘The conditional move instructions and the branching hints eliminated same branches and speed up the remaining ones
‘without compromising multiple instruction issu.
Integer arithmetic instructions
‘The integer arithmetic instructions of Alpha (a total of 20) are listed in table 4.4
Table 4.4 Integer arithmetic instructions
‘Mnemonic
Operations
‘ADDL ‘Add longword
‘ADDQ ‘Add quadord
S4ADDL Sealed add longword by 4
S8ADDL ‘Scaled add longword by 4
S4aDDQ Scaled add quadward by 4
SaAbbQ Scaled add quadword by 8
cMPEQ Compare signed quadword =
CMPLT Compare signed quadword <
CMPLE Compare signed quadword Mnemonic Predicted Target <15:0> Prediction Stack Action
00 IMP. PC+4*disp<13:0>] -
a JSR PCH4*disp<13:0>] Push PC
10 RET Prediction Stack Pop
u JSR.COROUTINE __ Prediction Stuck Pop, push PC
‘Table 4.3 Integer control instructions
‘Mnemonic
Operations
BEQ Branch ifRav=0
BGE Branch if Rav 20
BOT Branch if Rav>0
BLBC Branch if Ra LSB=0
BLES
BLE
BLT
BNE Branch ifRa 0
BR ‘Unconditional branch
BSR Branch to subroutine
IMP. Jump
ISR Jump to subroutine
RET Return from subroutine
JSR-COROUTINE __Jump to subroutine retum
The Alpha architecture specifies three types of branching hints in instructions:
1 Architected static branch prediction rule: forvard conditional branches are predicted not-aken, and backward
‘ones taken to the extent that compilers and hardware implementations follow this rule, programs can run more
Quickly with lite hardware cost, This hint does not preclude doing dynamic branch predictions in am
implementations, bt it may reduce the need to do so.
2 Describes computed jump targets: Otherwise unused instructions bits are defined to give the low bits of the most
likely target, using the same target calculations as unconditional branches. The 14 bits provided are enough to
specify the instructions offset within a page, which is often enough to start a fastest level instruction cache fetch
‘many eycles before the actual target value is known,
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.comB
od
of
be
227
-Sem-VIJ-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893,
‘When the prefetcher issues an instruction request, the code cache is checked to see if a copy is
available. Assuming a cache miss, a cache line-fill request is made to the bus unit, i.e. a cache line is
brought in from L2 cache/memory.
‘The 32-bit address given by the processor is interpreted as shown in figure
TAGIPAGE (20 Bits) | INDEX (7 Bits) | BYTE (5 Bits)
Figure 2.14- Interpretation of 32-Bit Address
Downloaded from FaaDo0Engineers.comBE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820423893 4. 6
‘The is ter denoting the operand neces ype. Itmay be one of th following:
“Access pe Meaning —
2 (Used in addres calculation
“at” means scale by 4 (longwords)
"aq" means scale by 8 (quadwords)
“ab means the operand isin byte units
i The operand isan immediate tral
r ‘The operand is read only
m ‘The operand is both read and written
w The operand is write only
‘The isa letter denoting the datatype of the operand. It may be one ofthe following:
Data type Meaning
> Byte
f — F-Floating
8 G-Floating
1 Longword
4 Quadward
8 S-flosting
1 Tfloating
w Word
Xx __ Specified by the instruction
Integer load and store instructions
“The memory access integer loud and ste instructions (total of 12) oe summarized in Table 42
Mnemonic Operation
LDA [Load address
LDAH Load address high
LDL Load SE longword
LDQ Load quadvword
LDQU Load quadword unaligned
LDL_L Load SE longword locked
LDQ_L Load quadword locked
st. Store longword
stg Store quadword
STL_C Store longword conditional
STQ.C __ Store quadword conditional
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.com2.28
BE-Sem-VII-COMP-Advanced Microprocessor. By Prof, Faruks Kazi -9820223893
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4,
Operate Instruction Format
rade) Rp 682 0| racton| Re | #2
ues a ams
peed ur [3] rato [Re
Floating-Point Operate Instruction Format
a 2635 nts part
a :
fovesce to | my | reson ne | 2
PAL Code Instruction Format
fre faLcodeFancien | :F2
Figure 46 Alpha Instruction formats (check corrections)
Addressing modes: (PI refer class notes for examples)
The Alpha architecture features four simple addressing modes as practiced on RISC-type systems,
Register
Immediate
Register indirect with displacement
PC-relative
Instruction set:
‘The Alpha architecture features the following types of instructions
1, Integer load and store
2, Integer control
3. Integer arithmetic
4. Logical and shift
5. Byte manipulation
(6. Floating-point load and store
7. Floating-point contro!
8. Floating-point operate
9. Miscellaneous
‘An instruction operand is specified by the following attributes:
‘
‘The may be any ofthe registers; Ra, Rb, Re. Fa, Fb, Fe, ot
isp The displacement ld ofthe intucion
fac The PAL funcon fold of the instucon
4% Anintoger teal operand inthe Rb Fld ofthe instucton,
7
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.com2.29
BE-Sem-VII-COMP- nce ia 1k Kazi -9820223893
Split-Line Access: NOV06/SMarks-(PI Refer Class Notes)
In a CISC processor, instructions are of variable length. In the Pentium processor the smallest
instruction is one byte while the maximum legal length is 15 bytes. A code cache miss always results
in a 32-byte cache line fil, if its a cacheable address. Multi-byte instructions may straddle two
seqitential lines stored in the code cache, When the prefetcher determines that the instruction is
straddled across two lines, it would have to perform two sequential cache accesses, which would
hamper performance, For this reason the Pentium processor incorporates a split line access which
allows the upper half of one line and the lower half of the next line to be accessed in one cycle.
‘When a split line access is made the bytes must be rotated so that they are in proper order. In order
for the split access to work efficiently, instruction boundaries within the cache line need to be
defined, When an instruction is decoded the first time the length of the instruction is fed back to the
cache, Each code cache entry marks instruction boundaries within the line so that if necessary split
line accesses can be performed
Subjects for SEM
Downloaded from FaaDo0Engineers.comBE-Sem.VII.COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi 9820223893 4. ©
LIT is an 8-bit literal value from 0 to 255.
‘All instruction formats have a 6-bit (its <31-26>) major opcode field. Any umused register field (5 bits) of on
instruction (Ra, Rb, Fa, or Fb) must be set to a Value of 31 (11111 binary) The five instruction formats are described
below.
1, Memory instruction format
‘This format is used to transfer information between registers and memory, to load on affective address, and for
subroutine jumps, The Memory_disp field is a byte offset. tis sign extended and added to the contents of register
Rb to form a virtual address. The virtual address is used as a memory load/store address or a result value,
depending on the specific instruction. For some instructions, the Memory_disp field is replaced by the Function
field. It serves as an extension of the opcode that designates a set of miscellaneous instructions.
2. Branch instruction format
‘The branch format is used for conditional branch instructions (in which case the Ra field contains the condition
encoding) and for PC-relative subroutine jumps. As each instruction is decoded, the PC value is advanced to point
to the next sequential instruction. The new PC valuc is refered to asthe updated PC.
‘The Branch disp field is treated as long word offset [tis shifted left 2 bits (to address a longword boundary),
sign-extended to 64 bits, and edded to the updated PC value to form the target virtual address.
3. Operate instruction format
“The operate instruction format is used for instructions that perform integer register to register operations. Fields Ra
and Rb specify source operands, Field Re specifies the destination, The Function field is an extension of the
‘opeode. If bit 12 ig 0, Rb specifies a source register operand. If bit 12 is 1, an 8 bit zero extended literal constant is
formed by bits <20:13> of the instruction. The itera is interpreted as a positive integer between 0 and 255 and is
zero-extended to 64 bits.
4, Floating-point operate instruction format
‘This format is used for instructions that perform floating-point register-to-rogister operations, The Fa and Fb fields
specify floating-point register sowce operands. The Fe field specifies the destination. Floating-point convert
instructions use a subset of the floating-point operate format and perform register to register conversion operations.
‘The Fb operand specifies the source, the Fa field must be F31 (ie, zero) and Fe is naturally the destination
5. PAL code instruction format
‘The Privileged Architecture Library (PAL) code format is used to specify extended processor functions. The 26 bit
PAL code Function field specifies the particular PAL code operation. The source and destination operands for PAL.
code instructions aré supplied in fixed registers that are specified inthe individual instruction descriptions. (VIVA:
‘An opcode of zero and PAL. code function of zero specify the HALT instruction.)
Memory Instruction Format
31 2625 2120 1615
fnsid | m | Merenoip
31,2625 21201615 °
; e
lopcode| Ra | pp | Function 2
Branch Instruction Format
34,2625 2120
lovcade! Ra Branch_Disp |
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.com2.30
yruk Kazi -9820223893
BE-Ser I-COM}
‘The Data Cache
‘The data cache is 8KB and is organized as a 2-way set associative cache. The two ways are called
way 0 and way I. Each cache line is 32 bytes.
© Total size of cache KB
* Sizeofeach way = 4KB
© Cache line size 2 bytes.
© Number of lines = 128
Operation of Internal Data Cache
Each 4 KB cache way is divided into 128 lines, There are thus correspondingly 128 entries in each
tag directory. Each directory stores a 20-bit tag (page) address A [31-12]. The entry also consists of
two state bits (10 indicate one of four states M-E-S or I) and a parity bit P.
dvanced Microprocessor
Parity Exclusive
at
Each data cache line is 32 byte or eight double words. Parity is generated for each byte within a data
cache line as shown in figure.
(Lee
‘When a byte of information is read from the data cache, the parity is checked. On detecting a parity
error, an internal parity error is signaled to extemal logic through the IERR # (Intemal Error) output.
“The processor also generates a special shutdown bus cycle and stops execution. The data cache itself
is single ported, but the cache directories are triple ported to allow access form both pipelines (U and
V) and to allow an external snoop simultaneously. Interpretation of 32-Bit address as viewed by
intemal Data cache Controller is shown below
P| Bye |r] eve
[Le
at wnat s4 205 8 .
| Bank
Page Lie ea
© A3I:A12 identify the page in which the target location resides
* AILAS identify the line that the target address occupies within the page (hence its position
in the cache way)
© A4:A2 identifies which double word within the line that the target address occupies (hence
the internal data cache bank in which the address data resides)
* A1:A0 are not used (don’t care)
Downloaded from FaaDo0Engineers.comBE-Sem-Vil-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4, 5
S_floating Register Format
Frac.HI |
Fraction Lo
Fraction Lo
Fraction Mid! ed
Fraction Midh Ate
‘Qutewordlteger Reatng-Regltar Format
=
Figure 4.5 Integer data storage in memory and FPU registers,
Instruction formats
All Alpha instructions are 32 bits long. The Alpha architecture festures five basic instruction formats illustrated in
fig 4.6. The notation used in fig. 46 i the following:
Ra, Rb, Re are integer register operands
Fe, Fb, Fe ae floating-point register operands
disp isa displacement, added to the value in Rb to form a virtual address,
BZ. Should Be Zero.
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.com2.31
By Prof, Faruk Kazi -9820223893
od Bank Select Logie
hz
Way 1
ch
of
°
i Bank Select Logie
y27
ata
Way 0
ity
wut.
elf
and a
by
Snoop
Pipeline “V"
a Line low | Pipeline “U"
jon Figure - Internal Data Cache Structure
ace
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4. 4
F_Floating Datum: .
154 76 °
Alec [ct |
Fradtion Lo y
D_Floating Datum D_Floating Datum
1514 76. foie oi34) 43 °
co | racnt |x [5] fm | Freat
| Freetion wiah
—_—___
| Fraction Mia
42,
is
Fraction Midh a2
| Fraction Lo
Figure 4.2 Memory storage of VAX floating-point formats
F_floating Register Format
6362 52514544 2928
:
G_floating Register Format
6362 5251 _ 4847 3238 __1615 °
st] Fracion Mth | Fracion mish | Fractonto |
Fe
s] exe.
G_floating Register Format
63625554 asay a seis °
a |
t i i
Figure 4.3 Register format of VAX floating-point formats
TEBE floating-point formats-
‘The IEEE standard features the single-precision 32 bits (S floating), and the double-precision 64 bit (T floating)
formats. Their memory and floating register storage ore illustrated in Fig. 44. Location A may be anywhere in
‘memory, butfor better performance it should be naturally aligned, as in the case of the VAX formats.
Long word and quad word integers may be stored in FPU registers. Their storage in memory and FPU registers is
‘lustrated in Fig. 45.
S_Floating Datum
15.14 76 °
‘Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.com232
BE-Sem-VII-COMP-Advanced Microprocessor ‘Notes By Prof. Faruk Kazi 9820223893 =e
‘Request From U and V Pipelines : Singl
‘When the U and V pipelines request information and if the operands are present in different banks Table
(banks are decided by A [4:2], the both the operands can be accessed simultaneously. If the U and Mit
‘V pipelines simultaneously require data from the same bank, then a bank conflict occurs since banks
or data cache is single ported. In such cases, the U-pipe access is completed first and the V-pipe is.
made to wait. Thus, a bank conflict incurs a one-clock penalty on the V-pipe instruction.
If the requests from both the U and V pipes happen to be cache misses, then two cache line-fill
requests are made to the bus controller the same time. The U-pipe read occurs first followed by the
‘V-pipe read.
EQ: Anatomy of a Read Hit and Miss- Cache Line Fill Algorithm
“The steps involved when the execution unit of the processor requests a memory read oycle are —
1. The directory entries at index given by A [11:5] In both the cache ways are checked.
2. If the both ways the state bits are ‘I’, then no further checking is required and it is a cache
miss
3. Ifin any one of the ways the state bits are other than (i.e. M, E or $), then the page number
or tag given by A [31:12] is compared with the corresponding tag entry. :
4, If these tags match, then it means that the required line is available (cache read hit). The
required data is then made available immediately
5. If not, then step (3) is repeated for the other way.
6. Also, each directory entry has a parity bit that is generated and written each time the i
directory is updated. This parity is checked each time the tag entry is accessed to check for Cyel
hit or miss. If parity error is detected, IERR # is asserted and a shutdown special cycle is run Men
and the processor stops. It is
non-
2.7 Pentium Bus Operation divi
Pentium processor supports a number of different types of bus cycles, The three main types are: Tis
i, Single Transfer Cycles
ii, Burst Cycles
iii, Special Cycles
T2s
Downloaded from FaaDo0Engineers.comBE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4, 3
Architecture of Alpha AXP:
Data types -Alpha architecture resognzes the fllowing data types.
Integer datatypes
| + Byte, Sits. Basic addressable unit.
‘+ Word, 16 bits. Two contiguous bytes starting on an arbitrary byte boundary. A word is addressed by the
address of its least significant byte (the byte that contains bit zero).
‘+ Long word, 32 bts. Four contiguous bytes starting on an arbitrary byte boundary. A long word is addressed by
the address ofits leat significant byte.
+ Quad word, 64 bits. Eight contiguous bytes starting on an arbitrary byte boundary. A quad word is addressed
by the address of its LSB. In a 64-bit integer, bit 63 isthe sign bit.
‘The Alpha integer datatypes are shown in fig. 4.1
1
| Qoard word
Figure 4.1 Alpha integer data types
Although words, long words, and quad words may be stored at any byte address, better performance can be achieved if
they are naturally aligned. That is, long words are stored in addresses divisible by 4 (low Order 2 bits of the address are
zero), and quad words are stored in addresses divisible by 8 (low-order 3 bits are zero),
Floating-point data types
‘Alpha architecture features two groups of floating-point datatypes.
+ VAX floating-point formats, for backward compatibility with the VAX software
+ TBEE standard (TEBE 754) floating-point formats, as practiced in practically all other modem systems
VAX floating-point formats- 7
‘Alpha architecture features three VAX floating-point formats:
1, F floating, 32 bits
2. G floating, 64 bts (1 bit exponent)
3. D floating, 64 bts (8 bit exponent)
The memory storage ofthe above formats is illustrated in fig 4.2, and their CPU floating register storage is illustrated in
fig43 Although A may be any address in memory, better performance will be attained if A is naturally aligned
(Givisible by 4 for F, and by 8 for G and D formats). The main difference between the G and D formats is that G has an
11-bit exponent-field, while D has an 8-bit one. Thus, the G format has a much higher range. The D format is not fully
supported on the Alpha, and no D floating-point arithmetic operations are provided. For VAX compatibility, exact D
Floating point arithmetic may be provided by software emulation.
Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.comSem-VII-COMP-Advanced Microprocessor Notes By Prof.
Single Transfer Bus Cycles Using the Bus Cycle Definition Signals
aks Table shows all of the bus cycles initiated by Pentium processor except the special cycles.
ce MIT J pic | wir | CACHE | ICN | cycte Description Treters
eis ° 0 0 1 x Interrupt acknowledge (2 Locked} 1 Transfer
cycles) each cycle
fill 0 0 1 1 X_ | Special cycle 1
the 0 1 0 1 x I/O Read, 32 bit or less non-cacheable_ 7
0 1 L 1 X_| 1/0 Write, 32 bit or less non-cacheable 1
1 oO 0 1 X__| Code read, 64 bits, non-cacheable 1
[OG Conte reat Boots burs ine A
1 1 0 1 x Memory read 64 bits or less non- 1
= cacheable ae
Sees Memory wad 0 bigbumtine ni AT
aber 1 1 Memory write 64 bits or less non- 1
cacheable.
The ae i Si
rn | Single transfer bus cycle is categorized into two classes as Non-Pipelined Cycles and Pipelined
< for {Cycles
sun Memory Read & Write Bus Cycles - Non pipelined
| Ttis the simplest type of a bus cycle either with or without wait states. The following figure shows
non-pipelined memory read (zero wait state) and write cycles (with 1 wait state). The cycle is
divided into two T-States: Tl and T2, The sequence of operations performed in the two T-states.
Ti State
Itis also called the address phase, since address is placed during this state. The processor
initiates the cycle by asserting address status (ADS #) signal. The ADS # output indicates that
a valid bus cycle definition and address is available on the cycle definition pins (M/IO #,
D/C#, W/RA) and the address bus (A3-31, BEO #, BE7#). The CACHE? output is deasserted
(high) to indicate the single transfer cycle )
T2 State
It is also known as data phase as data is sent out or received in this stage. If it is a write
operation, the processor drives the data over the data bus at the beginning of T2 state. At the
end of T2 state clock, the processor samples the signal BRDY¢ (this signal is generated by
memory subsystem and sent to the microprocessor). If asserted, the signal BRDY# indicates,
that the external memory subsystem has presented valid data in response to a read or the
extemal memory subsystem has accepted data in response to a write. If BRDY¢ is found not
asserted, the processor is forced to insert another 2 time or one wait state. Any number of
wait states can be added to Pentium processor bus cycles by maintaining BRDY# inactive.
‘The deasserted BRDY# signal indicates that, the system is not ready to drive or to accept
data.
|
i
Downloaded from FaaDo0Engineers.comBE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Fat zi 9820223893 4, 2
ssuch as IBM RS/6000 itis lose to 200.
‘Among the RISC manufacturers there are companies, which started with « RISC product, such as MIPS Computer
‘Systems (now a part of Silicon Graphics) with its RxOOO series, and Sun Microsystems with its SPARC. There are
other manufacturers, known for their CISC microprocessor families, who also started their own RISC systems families,
such as Intel, with its x86 family which started the RISC 860 family, and Motorola, with its M68000 family which
started the RISC M28000 family.
‘Of particular note is TBM which wns actually the first to start with the development of an experimental RISC system,
‘the 801, and now feature the RISC System 6000. This effort is continued jointly by the cooperation of IBM. Motorola,
and Apple in creating a new RISC-type family of microprocessors, called PowerPC, with the 6x series.
DEC, some of whdse professionals opposed the RISC idea in the beginning now features its own RISC product, the
Alpha AXP, considered tobe one ofthe fastest microprocessors of the early nineties.
Practically all new RISC-type products, as well as some CISC, are superscalar. The MIPS R4000 and R4400 are two-
‘seue super pipelined. Of the superscalar systems, the majority are two-issue.
‘The application of RISC processors is widening. Generally speaking, most RISC processors are universal and their
field of application is not limited. However, some of the most notable recent RISC applicstions are in workstations,
multiprocessors, and real-time systems, primarily because of their superior performance, at a relatively low cost. The
application area of RISC is expected to widen in the future. Most of the RISC systems implement following features
although they may not constitute the basic principles of RISC.
HLL support
* Implementation of register windows
© Pipelining
* Delayed branch
Score boarding
* Dual cache
= ILPAnstruction Level Parallelism-Superscalar or super pipelined)
4.3 The Alpha AXP Architecture & Features:
Features:
‘The Alpha is a 64-bit RISC type microprocessor manufactured by Digital Equipment Corporation (DEC) in 1991. Its
features are listed below:
(tis. two-issue superscalar implementation of instruction level parallelism (LP)
(O Ithas dual cache with 8Kbyte for code and 8Kbyte for data
© Ithas on chip Floating Point Unit (PU),
thas on chip Memory Management Unit (MMU).
Instruction size is fixed with 32 bit length supporting three operands.
Register to register operation and memory access by load and store instructions only.
thas two sets of thirty-two 64-bit registers, RO to R31 for Integer Unit (1) and FO to F31 for FPU
Itis byte addressable ie its basic addressable unit i byte,
Memory is accessed using 64-bit vitual address, The minimum virtual address is 43 bits,
Itoperstes atthe Frequency starting st 150 MHz and reaching 300 MH.
cococoae
‘Subjects For SEM VII ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.com2.34
BE-Sem-VII-COMP. ficroprocessor Notes By Prof. Far i 98202238: BE:
TiState 5 ‘Non
Every bus cycle starts and ends with idle state Ti. This idle state is required because of Figu
‘timing constraints associated with faster bus speed. Hence most of the signals including valid ‘tran
address, bank enables, bus cycle definition signals will extend a bit in idle state following T2, CA
Here CACHE# and KEN# both are deasserted to indicate that the bus cycles are no more dealing =}
with intemal cache of Pentium.
i
' ' 1
1 1 1
1 1 1
1 1 '
L ol '
' 1 1
ff ieee ley nee al
= o=4 1
4 Hl {pata to processpr Data td procestor |
horototot hobo
Heel rood
Hoorooh oat eee
Seu
Downloaded from FaaDo0Engineers.comBE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 4, 1
Chapter 4 - DEC Alpha AXP Family
Notes by Faruk Kazi
4.1 RISC versus CISC:
‘The microprocessor families Intel x86 and Motorola M68000 are known for their abundant instruction sets, multiple
addressing modes, and multiple instruction formats and sizes. Their control is micro programmed and different
instructions execute within a different number of cycles. The control units of such microprocessors are naturally
complex, since they have to distinguish between ¢ large numberof opcodes, addressing modes, and formats. This type
of system belongs tothe category called Complex Instruction Set Computer (CISC).
‘As opposed to the traditional CISC design, in the early eighties there emerged a new trend of computer design called
RISC-Reduced Instruction Set Computer. What is "reduced" in 2 RISC? Practically everything: the number of
instructions, addressing modes, and formats. In an ideal RISC all instructions hove the same size (usually 32 bits) and
‘execute within a single CPU cycle. In practice, only the majority of the instructions (over 80 percent in most RISC
systems) execute in a single cycle. Some of the important RISC properties ae listed below.
+ Single-cycle execution ofall (or at least most, over 80 percent) instructions
* Single-word standard fength of ll instructions
‘Small number of instructions, otto exceed about 128
‘Small number of instruction formats, not to exceed about
* Small number of addressing modes, nt to exceed about 4
= Memory access by toad and store instructions only
+ Alloperations, except load and store, are regsterto-register within the CPU
+ Hardwired control unit
* Aelatively large (at least 32) general-purpose, CPU register file
Advantages of RISC:
‘The advantages of RISC based microprocessors can be summarized as
+ VLSI realization
+ Computing speed
+ Design cost and reliability
+ HLL support
Shortcomings of RISC:
RISC shortcomings are directly related to some of its points of advantage, The principal RISC disadvantage is its
reduced number of instructions. Since a RISC has a small number of instructions, a number of functions, performed on.
CISC by a single instruction, will need two, three, or more instructions on a RISC. This in tum will cause the RISC
‘code to be longer. More memory will have to be allocated for RISC programs, and the instruction traffic between the
‘memory and the CPU will be increased. Recent studies have shown that, on the average, a RISC program is about 30
percent longer than a CISC program, performing the same function. Ths is because only a minority of the instructions
is used most ofthe time and this minority is usually featured on RISC systems.
4.2 Overview of RISC Development and Current Systems:
‘As can be seen from the preceding discussion, the RISC concept is not quite clear-cut; it has both advantages and
shortcomings. It has encountered opposition right from its inception, The RISC controversy continued over a number of
years. Notwithstanding the controversy, an important fact is notable: there are a considerable number of commercial
computer products announced as RISC-type by their manufacturers. To be sure, some of them do not adhere to all the
RISC properties specified above. One particular RISC *violation” is in the number of instructions. In some systems
‘Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected]
Downloaded from FaaDo0Engineers.com4 2.35
|-COMP-Ad’ jicroprocessor. Prof, Far -9820223893,
‘Non-Pipelined 1/0 Read and Write Bus Cycles :
of Figure illustrates VO read and write cycles. The V/O read shown in figure is a zer0 ‘wait state
lid transfer, The subsequent 1/0 write is a one wait state transfer. It is important to note that, the
(2, CACHE# signal in these cycles is always deasserted and KEN# is not sampled.
ae . 1-2. 3 4 6 7 8
5
pnoim im in im twat Lt
‘ADS# !
oe, XG HEX | meee |)
wk | IX pee | Ml
wor i iene!
JA: | M+
CACHED 1 1 t 1 1 1 1
wre 1 1
nh Ct ERD
1 1 1 rbesnel 1 phan |
Figure - Timing of a 10 Read Followed by a 1/0 Write bus Cycle (Non-
Pipelined)
Downloaded from FaaDo0Engineers.comDownloaded from Faalo0Engineer's.com2:36
gle Transfer Bus Cycle (Pipelined)
‘When back-to-back cycles are run to memory or VO devices that require one or more wait states to
complete a transfer, pipelining can improve performance. These devices must decode the address
and assert the NA# (Next Address) signal. When the processor samples NA# asserted, it drives the
next pending bus cycle early, before the current bus cycle completes. This allows devices designed
to take advantage of pipelining to decode the address early in preparation for the next transfer. These
devices can also latch the current data access and retum that data to the processor, while starting the
next access during the current cycle.
Following Figure illustrates two bus cycle transfer sequences, one without pipelining and the other
with pipelining. The first sequence consists of three back-to-back bus eycles to a device requiring
two wait states to complete each transfer. The processor samples NA# deasserted at all sample
points, therefore the next cycle is not pipelined early. The second sequence consists of the same three
back-to-back bus cycles, However, in this example NA# is sampled asserted by the processor. This
causes the processor to start the next bus cycle prior to completing the first. Notice that the three-
pipelined cycles complete with four less clock cycles than the non-pipelined transfers
Downloaded from FaaDo0Engineers.com
Shu
Du
cac
con
Ir
ger
Lik
folfee
BLE, COMPUTER ENGINEERING
FOURTH YEAR SEMESTER Vil
‘SUBJECT: ADVANCED MICROPROCESSORS
Lectures: 4 Hire por weak Theory: 100 Marks
Practical: 2 Hrs per week Term work: 25 Marks
Oral Exam. 25 Marks
Objective: To study microprocessor basics and the fundamental principles of architecture related
to advanced microprocessors.
Pre- requisite: Microprocessors
DETAILED SYLLABUS
7. Overview of new generation of modern microprocessors
2. Advanced Intel Microprocessors
Protected Mode operation of x 86 Intel family, study of Pentium, super scalar architecture and
pipelining, register set & special instructions, memory management, cache organization, bus
‘operation, branch prediction logic.
3. Study of Pentium Family of Processors
Pentium I, Pentium Il, Pentium Il, Pentium IV, architectural features, comparative study.
4, Advanced RISC Microprocessors
Overview of RISC Development and current systems , Alpha AXP architecture , Alpha AXP
Implementation and applications
5. Study of Sun SPARC Family
‘SPARC Architecture, the Super SPARC, SPARC implementation and application
6. Standard for Bus Architecture and Ports
EISA, VESA, PCI, SCSI, PCMCIA Cards and slots, ATA, ATAPI, LPT, USB, AGP, RAID
7. System Architecture for desktop and server based systems
‘Study of memory subsystems and /O subsystems, integration Issues.
BOOKS
Text Books:
7. Daniel Tabak, “Advanced Microprocessors", Tata McGraw Fill
2. Barry Brey , “The Intel Microprocessors, Architecture, Programming and Interfacing’
3. Tom Shanley, “Pentium Processor System Architecture’, Addison Wesley Press
References:
1. Ray Bhurchandi, “Advanced Microprocessors and peripherals", TMH
2 James Abtonakos, “The Pentium Microprocessor’, Pearson Education
3. Badri Ram, “Advanced Microprocessors and Interfacing’, TMH
4 Intel Manuals,
TERMWORK
7. Term work shall consist of at least 10 practical expermiments and two assignments covering the
topics of the syllabus.
‘ORAL EXAMINATION
‘An oral examination is to be conducted based on the above syllabus
Downloaded from FaaDo0Engineers.com2.37
BI 9820223893,
Special Cycles:
‘The special cycles are indicated by the byte enable signals BEO# to BES#. These signals define six
special cycles. They are described in Table. The bus cycle definition pins for them are in the
following state: MAO# = 0, D/C# = 0 and W/R# = 1.
BET | BEG# | BES# | BE4# | BE3# | BE2 | BEIM | BEOW Special Bus Cycle
1 1 1 1 1 1 ‘Shutdown
1 1 eee Flush (NVD, WBINVD instruction)
Halt
i
1
1 ‘Write-Back (WBINVD instruction)
1
1
1
1
Flush Acknowledge
1 1
i i
1 1
1 1
H 1 Eee 1 1 Branch Trace Message
Table Special Bus Cyeles
During special cycles, the data bus is undefined or floated and the address lines A3-31 are driven to
‘0’. With this condition, when the external logic detects a special cycle is in progress, then the byte
enables are decoded to determine which special cycle is being run.
‘+ Shutdown Special Cycle
Shutdown cycle is executed due to the following reasons:
i. If any other exception occurs while Pentium is attempting to invoke the double-fault handler
(Triple fault situation).
ii, An internal parity error is detected.
During the shutdown, the intemal caches remain in the same state unless on inquire cycle is run or
cache is flushed. The pins FLUSH#, SMII# and R/S# are recognized during this state. The processor
‘comes out of shutdown if NMI, INIT or RESET is asserted.
© Halt Special Cycle
The processor executes the halt cycle when a HLT instruction is executed, During halt, the internal
processor status is same as that was during shutdown cycle, Halt cycle can be recognized externally
by the byte enables asserted differently if compared to the shutdown cycles. Pentium processor will
exit the halt state if INTR is asserted and maskable interrupts are enabled in addition to the assertion
of NMI, INIT or RESET.
Branch Trace Message Special Cycle
If the execution tracing enable bit (bitl) in the Test Register 12 (TR12) is set to one, the processor
generates a branch trace massage special cycle whenever a branch is taken. The processor also
asserts IBT (Instruction Branch Taken) pin.
Like the other special bus cycles, the data bus is-undefined and setting for bus definition signals is
same in this special cycle. The only difference is that, it does not drive “Os” on address bus. The
following is driven on the address bus during branch trace message special cycle
A31-A3: Bits 31-3 of the branch target linear address.
BT2 ~ BTO: Bits 2-0 of the branch target linear address.
(the byte enables should not be decoded for A2-A0)
Downloaded from FaaDo0Engineers.com3.18
BES wr Notes By Prof, Far
Intel Dual-Core Processors: (for VIVA) :
In April of 2005, Intel announced the Intel Pentium processor Extreme Edition, featuring an Intel
dual-core processor, which can provide immediate advantages for people looking to buy systems that
boost multitasking computing power and improve the throughput of multithreaded applications. An
Intel dual-core processor consists of two complete execution cores in one physical processor (right),
both running at the same frequency. Both cores share the same packaging and the same interface with
the chipsetimemory. Overall, an Intel dual-core processor offers a way of delivering more cepabilties
while balancing energy-efficient performance, and is the first step in the multi-core processor future.
‘An Intel dual-core processor-based PC will enable new computing experiences as it delivers value by
providing additional computing resources that expand the PC's capabilities in the form of higher
throughput and simultaneous computing. Imagine that a ‘dual-core processor is like a four-lane
highway-—it can handle up to twice as many cars as its two-lane predecessor without making each car
drive twice as fast. Similarly, with an Intel dual-core processor-based PC, people can perform
‘multiple tasks such as downloading music and gaming simultaneously
‘And when combined with Hyper-Threading Technology (HT Technology) the Intel dual-core
processor is the next step in the evolution of high-performance computing, Intel dual-core products
supporting Hyper-Threading Technology can process four software threads simultaneously by more
efficiently using resources that otherwise may sit idle. A new Intel dual-core processor-based PC
gives people the flexibility and performance to handle robust content creation or intense gaming, plus
simultaneously managing background tasks such as virus scanning and downloading. Cutting-edge
gamers can play the latest titles and experience ultra-realistic effects and gameplay. Entertainment
‘enthusiasts will be able to create and improve digital content while encoding other content in the:
background.
‘The new Intel Core Duo processors have ushered in a new era in processor architecture design in
which multi-core processors become the standard for delivering greater performance, improved
performance per watt, and new capabilites across Intel's desktop, mobile, and server platforms, The
Intel dual-core products also represent a vital frst step on the road to realizing Platform 2015, Intel's
‘vision for the future of computing and the evolving processor and platform architectures that support
it
Downloaded from FaaDo0Engineers.com
Abou2.38
ee ites By Prof. Faruk Kazi -9820223893
High if the default operand size is 32 bits.
Low if the default operand size is 16 bits
+ Flush Special Cycle
This cycle is generated by execution of two instructions-
a. INVD~Invalidate
‘When this instruction is executed, the processor sets all entries to I(invalid) stage and runs flush
special eycle. This cycle notifies extemal logic that all intemal cache lines have been invalidated and
2 if present should also invalidate itself. In this case, modified data in L2 is not written back to
memory and is hence lost.
b. WBINVD- Write Back and Invalidate
The execution of this instruction causes all modified data to be written back to memory and then all
the cache entries are invalidated. As each modified line is written back, the processor invalidates is
entry. After all the lines are written back, the processor runs.
A write-back special cycle followed by
A flush special cycle.
This is to inform L2 cache to invalidate its entries after writing back to main memory
* Write-back Special Cycle *
‘This cycle is run after WBINVD instruction. This indicates that all modified lines in L1 data cache
has been written back to main memory/ L2 cache, The write-back special cycle is followed by a
flush special cycle which forces L2 cache to invalidate its entries after writing back to main memory
+ Flush Acknowledge Special Cycle
‘This cycle is run in response to the FLUSH# signal pin being asserted. When FLUSH# is asserted,
the processor writes back all modified lines in the data cache and invalidates all cache entries in both
code and data caches, This cycle notifies to the external logic that all modified lines have been
written back and all cache entries invalidated
Interrupt Acknowledge Bus Cycle
In spite of all the cycles discussed earlier, interrupt acknowledge cycles have a unique cycle type
generated on the cycle type pins. Pentium processor generates two back-to-back interrupt
acknowledge bus cycles when an interrupt request (INTR) is recognized. Itis the response given by
Pentium to maskable interrupt request generated on INTR pin if interrupts are enabled. The
processor uses this locked pair of interrupt acknowledge cycles to communicate with the interrupt
controller(s). Of course, it is the system designer’s responsibility to insert wait states if required to
‘meet the specified data setup and hold time requirements for the interrupt controllers,
Subjects for SEM VIII
Downloaded from FaaDo0Engineers.com
:
‘
1
r
t
13.7
vn van 1 Notes By Prof, 3
Comparison of Pentium family pipeline: MAY08
Figure below shows, the NetBurst pipeline is twice as deep as that ofthe P6, which in turn had twice
the depth of the PS"s. Increasing pipeline depth increases logic complexity and branch penalties, but it
also allows clock speeds to increase.
Pe Mercian
Figure: Comparison of Pentium Family Pipeline
* Trace Cache next Instruction Pointer: The trace cache fetch logic gets a pointer to the next
instruction in the trace cache, Trace cache is intel's name for putting the L1 cache inside of
the first functional unit for speed.
Trace Cache Fetch: use pointer to fetch an instruction from the cache.
* Drive. The two Drive stages shown in figure represent time required to move signals actoss
the chip. No other work is done during these stages. NetBurst is the first pipeline with
dedicated stages for wire delays. This is apparently necessary for multiple-gigahertz speeds.
* Allocate and Rename: The CPU actually contains more registers than are related in the
specification in order to speed things ip and be able to execute operations in a superscalar
fashion (which is to say, more than one operation at once). At this time, the CPU will
associate different registers with the names of the registers.
* Queue: Operations are now placed into cither the memory queue or the arithmetic (everything
else) queue for scheduling
* Schedule: In a superscalar processor, operations are often executed out of order so that they
do not step on cach other and so that they are completed as rapidly as possible. The P4 has
four queues; Memory, Fast ALU, Slow ALU/General FPU, and Simple FP. All instructions
get dumped into one of them for later execution by the appropriate (and linked) functional
unit. Operations in each queue are sorted based on the order they were submitted, what
instructions are waiting on them, and the number of cycles required by an ALU (or FPU, or
LSU) to complete the instruction.
* Dispatch: Instructions are moved from the queues to the functional units.
= Register Files: The instructions are now loaded into the functional units for actual execution.
* Execute: The functional units process the instructions in the files along with, data in the
registers. This is the seventeenth stage! A lot has happened before we got here.
Flags: There is a status register (sometimes called a flag register) in all CPUs of the x86
family which is used for conditional jumps. The flags are set in this stage.
Branch Check: Now in the nineteenth out of twenty stages we finally check to see if the
branch predictor predicted incorrectly and we have to discard some operation we have just
spent eighteen stages (and a cycle) on.
* In other words, the Pentium 4 is split up into 20 very short pipeline stages. Some of them are
so short that they aren't long enough to fit an entire function, and so that function actually
takes two stages. Chopping execution up into so many short stages means that very high clock
Downloaded from FaaDo0Engineers.coma
239
BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893
fe oreinackcxse —p] atleast ne
ide state
Wom om ti 1
Address
post
| CACHES
wre
BROYE
DATA
The first and the second interrupt acknowledge cycles are distinguished by the state of address bit 2
(A2-encoded from the byte enables). If A2 = I, it corresponds to the first interrupt acknowledge
cycle and if A2 = 0, it indicates the second interrupt acknowledge cycle.
1" interrupt acknowledge: A2 = 1, A3 : A31 = 0 Hence addres:
2" interrupt acknowledge: A2 = 0, A3: A31 = 0 Hence address = 0
‘The data will be returned by the interrupt controller at the end of both the cycles. The data retumed
during the first cycle, is ignored by the processor, During the second cycle, the interrupt vector is
retuned on the lower 8 bits of the data bus. Pentium has 256 possible interrupt vectors. Both the
cycles are separately terminated when the extemal system retums BRDY#. Wait states can be added
by withholding BRDY#, Pentium processor automatically generates at least one idle clock between
the first and second cycle.
ubjects for SEM VI
Downloaded from FaaDo0Engineers.com3.16
BE-Sem-VI-COMP-Adyan i cessor Notes By Prof, Farul ri -982022389:
a = g ee
eae oT as sees
a eS ms |
cima
ucla
4
Memory subsystem
‘The processor provides three levels of on-package cache for scalable performance across a variety of
workloads, At the first level, instruction and data caches are split, each 16 Kbytes in size, four-way
set-associative, and with a 32- byte line size, The dual-ported data cache has a load latency of two
cycles, is write-through, and is physically addressed and tagged. The L1 caches are effective on
moderate-size workloads and act as a first-level filter for capturing the immediate locality of large
workloads, The second cache level is 96 Kbytes in size, is six-way set-associative, and uses a 64-byte
line size. The cache can handle two requests per clock via banking. This cache is also the level at
which ordering requirements and semaphore operations are implemented. The L2 cache uses a four-
state MESI (modified, exclusive, shared, and invalid) protocol for multiprocessor coherence. The
cache is unified, allowing it to service both instruction and data side requests from the L1 caches. This
approach allows optimal cache use for both instruction-heavy’ (server) and data-heavy (numeric)
workloads. Since floating-point workloads often have large data working sets and are used with
compiler optimizations such as data blocking, the L2 cache is the first point of service for floating-
point loads, Also, because floating-point performance requires high bandwidth to the register file, the
L2 cache can provide four double-precision operands per clock to the floating-point register file,
using two parallel floating-point load pair instructions. The third level of on-package cache is 4
Mbytes in size, uses a 64-byte line size, and is four-way set-associative, It communicates with the
processor at core frequency (800 MHz) using a 128-bit bus. This cache serves the large workloads of
server- and transaction processing applications, and minimizes the cache traffic on the front-side
system bus. The L3 cache also implements e MESI protocol for microprocessor coherence.
Downloaded from FaaDo0Engineers.comBE-Sem-VI) cessor Notes By Prof. Faruk Kazi
NOV04/MAY07: Bus State Transition Diagram
Pentium processor’s state machine operates in various states. It has six bus states as follows:
Ti Bus Idle State
Tl Address Phase
T2 Data Phase
Tl2_ Address Phase (new cycle) and Data Phase (Current cycle)
T2P Data Phase (1" cycle pipelined) and Data Phase (2 cycle pipelined)
TD Dead State
The bus control state machined diagram is shown in the following figure. The state transitions are
listed below:
0 (No request pending)
#5 (Bus cycte ends
& fo bur cycle pending)
te NAGI
‘10(Adde wait states F ERO 4 &
#4 (Starts Bus cet Wino bus eyel pending)
en
(When fst tans er
‘comptes with no
7)
(Current cycle funning, two
fstanding
(Stage in TZ? until
Before ending bus cyci F mother the frst ansfer
(Gefore ending bus cycn f othe tin tina
‘bus eyele pending Le. NAB =
EM SECURITY fuitzioitbocn
Downloaded from FaaDo0Engineers.com
Note
one |3.15
-9820223893
BE-Sem. --Advanced Microprocessor Notes By Pt
and efficiently deliver this information to the hardware, The, processor provides a six-wide and 10-
stage deep pipeline, running at 800 MHz on a 0.18-micron process. This combines both abundant
resources to exploit ILP and high frequency for minimizing the latency of each instruction. The
resources consist of four integer units, four multimedia units, two load/store unit, three branch units,
two extended-precision floating-point units, and two additional single-precision floating-point unite
(FPUs). The hardware employs dynamic prefetch, branch prediction, non-blocking caches, and a
register scoreboard to optimize for compilation time nondeterminism, Three levels of on-package
‘cache minimize overall memory latency. This includes a 4-Mbyte level-3 (L3) cache, accessed at core
speed, providing over 12 Gbytes/s of data bandwidth. Figure below provides the block diagram of the
Itanium processor.
The 16-Kbyte, four-way set-associative instruction cache is fully pipelined and can deliver 32 bytes of
code (two instruction bundles or six instructions) every clock. The cache is supported by a single-
cycle, 64-entry instruction translation look-aside buffer (TLB). The fetched code is fed into a
decoupling buffer that can hold eight bundles of code. As a result of this buffer, the machine’s front
end can continue to fetch instructions into the buffer even when the back end stalls. Conversely, the
buffer can continue to feed the back end even when the front end is disrupted by fetch bubbles due to
branches or instruction cache misses, Hierarchy of branch predictors- The processor employs a
hierarchy of branch prediction structures to deliver high-accuracy and low penalty predictions across
wide spectrum of workloads. Note that if branch mis prediction led to-a full pipeline flush, there
would be nine cycles of pipeline bubbles before the pipeline is full again. This would mean a heavy
performance loss.
After instructions are fetched in the front end, they move into the middle pipeline that disperses
instructions, implements the architectural renaming of registers, and delivers operands to the wide
parallel hardware. The processor has a total of nine issue ports capable of issuing up to two memory
instructions (ports MO and MI), two integer (ports 10 and I1), two floating-point (ports FO and Fl),
and three branch instructions (ports BO, BI, and B2) per clock. The processor’s 17 execution units are
fed through the M, I, F, and B groups of issue ports. The processor provides an abundance of
execution resources to exploit ILP. The integer execution core includes two memory and two integer
ports, with all four ports capable of executing arithmetic, shift and-add, logical, compare, and most
integer SIMD multimedia operations. The memory ports can also perform load and store operations,
including loads and stores with post increment functionality. The integer ports add the ability to
perform the less-common integer instructions, such as test bit, look for zero byte, and variable shift.
Additional uncommon instructions are also implemented on only the first integer port.
For data speculation, the software issues an advanced load instruction. When the hardware encounters
‘an advanced load, it places the address, size, and destination register of the load into the ALAT
structure, The ALAT then observes all subsequent explicit store instructions, checking for overlaps of
the valid advanced load addresses present in the ALAT, In the common case, there’s no match, the
ALAT state is unchanged, and the advanced load result is used normally. In the case of an overlap, all
address-matching advanced loads in the ALAT are invalidated.
Downloaded from FaaDo0Engineers.comare
2.41
snced Microprocessor Notes By Prof, Faruk 1820223893,
Note that, once NA# is sampled asserted Pentium processor latches it and will pipeline a cycle when
‘one becomes pending even if NA# is subsequently deasserted.
#0 No request pending.
#1 Pentium processor starts a new bus cycle by driving the address and bus cycle definition
signals and asserting ADS# in T1 state.
#2 Pentium processor always moves to T2 from TI to process the data transfer. The only
exception for this, is the assertion of BOFF#. If BOFFY is released, the cycle is
terminated and restarted.
#3 Pentium maintains T2 state (adding wait states) until the transfer is over (BRDY #
sampled asserted) if no new request becomes pending or if NA# is not asserted.
#4 If the current transfer is complete, the processor begins from T1 state again if there is a
new request pending and NA# is sampled asserted. If NA# is not asserted, the processor
enters Ti state from T2 state.
#5 If the current transfer is over and no other bus cycle is pending or NA# is not asserted, the
‘processor goes to the idle state Ti.
#6 Before the current cycle ends, if. NA# is found asserted and another cycle becomes
pending, the processor now has two outstanding cycles. ADS# is asserted for the second
cycle.
#7 When the current cycle is complete in T12 state, and no dead clock is needed, then the
processor moves back to T2 state from T12.
#8 When the processor finishes the current bus cycle but a dead clock is needed, it goes to
TD state, (This happens for consecutive read and write cycles or write and read cycles)
#9 When the current bus cycle is still running and no BOFF# asserted, the processor always
transitions from T12 to T2P state to’ process the data transfer. Both — the current and the
next-cycles execute their data phase in state T2P.
#10 The processor stays in T2P until the first cycle transfer gets over.
#11 The bus state goes from T2P to T2 state when the processor finishes the first transfer and
no dead clock is needed (no consecutive read-write or write-read operation).
#12 The processor enters TD state from T2P, if the first transfer is complete; but a dead clock
is required.
#13 The bus state returns to T12 state from TD if NA# is sampled asserted and there is a new
request pending.
#14 If there is no new request pending or NA# is not asserted, the transitions of processor are
from TD to T2 state
Subjects for SEM VIII OBOTICS & SYSTEM
Downloaded from FaaDo0Engineers.com314
BE-Ser -COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893
Summary of Pentium Family of Processors: (Also useful for VIVA Exam)
Processor Name (code | Pentium | Pentium Pro | Pentium i | Pentium II | Pentium
name) Processor | Processor | Processor | Processor | Processor
(P5P54C) | (P8) (Kamat | (Katmai! | (Witamette!
Deschutes) | Coppermine) | Northwood)
Introduced Canam | TieIes | oso7e7 | o2nges | 11/2000
Operations Per Clock] 2 3 3 3 6
| Cycle
GOMEZ COMHz OME TOOME ‘400MEZ
Max Clock Speed system bus: | system bus: | systembus: | system bus: | system bus:
1S0MHz | 180MHz | 333MHz | 1.0GHz | 240GHz
66M 66Meiz | 100MHz | 133MHz | S33MHz
system bus: | system bus: | system bus: | system bus: | system bus:
200MHz | 200M@z | 4soMHz | “1.4GHz | 2.53GHe.
“400MHz
Bus Frequency 6oMiiz, | 60MUz, | 66MHz, | 100MHz, | (i00*4),
s6Miz | 66MHz | 100MEz | 133MHz | _533MEz
(334)
Number of Transistors | 3,100,000 | 5,500,000 | 7,500,000 | 24,000,000 | 42,000,000
(@8misron) | (0.35 micron) | “(035 | (@.13 micron) | (0.13
ticron) micron)
Li Cache 16KB 16KB 32KB 32KB | 12kpop+
(SKB Code | (8KB Code (I6KB Code | (16KB Code 8KB Data
+8KB Dats) | +8KB Date) |” +6KB | +16KB Dats)
| Daw
12 Cache (eftchip-not | IMB snkp | siaKB 512KB
specified) | (onchip) | (oftchin) | (onchip) | (on-chip)
Addressable Memory 4GB 6408 6408 6408 64GB
Integer Pipelines 2 2 2 2 4
Floating Point Pipelines 1 L 1 1 2
Supencalar | Intel's fat ue | Dval Data Pretcich | Copa of
Brief Description schteewre | server! independent | Logi, Level2_ | delivering
workstation | bus, dynamic | Advanced | 42GB of data
ip execution, | Transfer Cache | persecond
Intel Mx into and cut of
Lechaoogy | the processor
3.8 Intel Itanium Processor: pEcovmayos
(Please refer class notes for EQ answer)
The Itanium processor is the first implementation of the IA~64 instruction set architecture (ISA). The
design team optimized the processor to mect a wide range of requirements: high performance on
Intemet servers and workstations, support for 64-bit addressing, reliability for mission-critical
applications, full LA-32 instruction set compatibility in hardware, and scalability across a range of
operating systems and platforms. The processor employs EPIC (explicitly parallel instruction
computing) design concepts for a tighter coupling between hardware and software. In this design style
the hardware-software interface lets the software exploit all available compilation time information
Downloaded from FaaDo0Engineers.com2.42
BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893
Burst Bus cycles
In case of L1 data cache miss, the required line has to be brought in form external memory or L2
cache, This is done as a cache line fill operation in a burst cycle. Burst bus cycles are made up of
four consecutive bus cycles. The processor outputs the start address for only the first required
quadword. The memory subsystem should latch this address and compute the addresses for the other
three quadwords in the burst sequence. The fastest burst cycle possible requires 2 clocks for the first
data item to be retumed/driven with the subsequent data items retumed/driven in every clock. Thus
a fast burst cycle takes 5 clocks to complete, When a cache line fill operation has to be done for the
LI cache, first the L2 cache is accessed. Ifit is an L2 cache, hit, we have a burst cycle with no wait
states as explained latter. If it is an L2 cache miss, then the slow DRAM has to be accessed which
results in a slow burst cycle as explained latter.
Burst Read Cycle
‘We consider a U pipeline request resulting in a L1 data cache miss and L2 hit.
‘© The bus cycle begins with TI when processor outputs the address and bus cycle definition
signals on A (31-3), BEO #-BE7#, M/1O # and D/C#.
‘«ADS¢ is asserted as the address is driven indicating that the address and bus cycle definition
‘current on the bus are valid.
* CACHE is driven low, to indicate that the processor wishes to perform a burst line fill.
WIR # driven low indicates that this should be a burst read cycle.
* L2 cache on receiving the address finds a cache hit which also implies that the address is
cacheable. Hence, L2 cache asserts KEN # on the clock that it first retums BRDY¢# asserted.
(KEN # is sampled only once during a cycle to determine cacheabilitty),
«Since the processor samples BRDY# asserted, it reads the first quadward that has been placed
on the data bus by L2 cache.
© the processor also samples WB/WT # signal and finds it asserted indicating that the line
should be placed in L1 cache in the ’S" state. Le. write through policy.
* Since this is a burst cycle, L2 cache computes the additional addresses as described before
and then supplies the corresponding data It asserts BRDY# to indicate valid data on the data
bus.
«When the first quadword is received by the processor, it sends the requested data to the U-
pipe immediately and also stores the quadword in the 32-byte line-fill buffer. Each of the
remaining quadwords is placed in the line-fill buffer as they are received. When the entire
32-byte line is placed in the buffer, the entire cache line is written into cache memory and the
cache directory is updated.
Downloaded from FaaDo0Engineers.com.RE BRS sas GF RR
aat
als
es
on
res
che
ons
the
3.13
OMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893
improvements to the execution units over that of the P6 microarchitecture, For example, the
arithmetic logic units operate twice as fast as previous microarchitectures.
.
ee
ea
‘As with the previous implementations, the retirement section receives the results of the executed pops,
from the execution core and processes the results so that the proper architectural state is updated
according to the original program order. For semantically correct execution, the results of 1A-32
instructions must be committed in original program order before they are retired, Exceptions may be
raised as instructions are retired. ‘Thus, exceptions cannot occur speculatively, they occur in the
correct order, and the machine can be correctly restarted after an exception. When a yop completes
and writes its result to the destination, itis retired. Up to three pops may be retired per cycle. Again,
the ROB is the unit in the processor which buffers completed nops, updates the architectural state in
order, and manages the ordering of exceptions. The retirement section also keeps track of branches
and sends updated branch target information to the branch target buffer (BTB) to update branch
history.
3.7 The Intel Pentium M Processor Family
‘The Intel Pentium M processor family is designed for low power consumption. It’s enhanced
microarchitecture includes the following features:
U. Support for Intel Architecture with Dynamic Execution
U_ Avhigh performance, low-power core manufactured using Intel’s advanced process
technology with copper interconnect
On-die, primary 32-Kbyte instruction cache and 32-kbyte write-back data cache
Oncdie, second-level cache (up to 2-MByte) with Advanced Transfer Cache Architecture
‘Advanced Branch Prediction and Data Prefetch Logic
Support for MMX Technology, Streaming SIMD instructions, and the SSE2 instruction
‘A 400 MHz, Source-Synchronous Processor System Bus
U_ Advanced power management using Enhanced Intel SpeedSt
cceece
Downloaded from FaaDo0Engineers.com2.43
2
Microprocessor Notes By Prof. 9820223893
MP-Advane
:L2
pof
tired
vther
A313]
BEOH-BE7
8
8
8
3
first
Thus
0st
tthe
wait
hich
wre
ition
BROYE
ition
fill,
ss is
sted.
aced
Figure - Burst Read Cycle (Basic)
line
fore
data
xe U-
of the
ontire
id the
Downloaded from FaaDo0Engineers.com3.12
123893,
BESS ced Microproces: By Prof.
Intel’s MMX technology to 128 bits and supports packed integer operations. While the extended
‘width ofthe operation used to be 64 bits, these new instructions double the SIMD integer bandwidth
over SSE/MMX technology. This accelerates a broad range of applications, including video, speech,
and imago and photo processing, The new 64-bit adds/subtracts and 32x32 unsigned multiply provide
Significant enhancements to encryption operations as well. As we move. into. the futuro,
eneryption/decryption capabilities will be more important in driving a secure e-Business infrastructure
for the connected world. The 128-bit SIMD double precision floating-point delivers the capability to
execute two 64-bit double precision floating point instructions at once, doubling the performance
Capability. In addition, it offers a full set of SIMD double precision floating-point operations, end
additional operations that convert between double and single precision. This precision floating point
results in the acceleration of content creation, financial, engineering, and scientific applications.
Balanced Platform Solution
‘As part of a complete platform solution, the Intel Pentium 4 processor was designed in tandem with
the Intel 850 chipset 1 ereate a powerful new platform for high-performance users, The 400-MEz
system bus in the processor is balanced by dual RDRAM memory channels in the $50 chipset that
perate in lock-step to deliver 3.2 GBIs of memory bandwidth. Coupled with more efficient protocols
nd the 400-MHz system bus, the Intel Pentium 4 processor and Intel 850 chipset deliver three times
the bandwidth of platforms based on high-performance Intel Pentium III processors. The increased
bandwidth enables faster memory acquisitions, which increase performanee-on-any- application
requiring intensive memory accesses such as many 3D and video applications.
More on Intel NetBurst MicroArchitecture:
Figure 3.6 is an overview of the Intel NetBurst microarchitecture, ‘This microarchitectue pipeline is
rade up of three sections: (1) the front end pipeline, (2) the out-of-order execution core, and (3) the
retirement wit,
“The concept behind the Intel NetBurst microarchitecture (Pentium 4 processor, Intel Xeon processor),
was to improve the throughput, improve the efficiency of the out-of-order execution engine, and to
Create a processor that can reach much higher frequencies with higher performance relative to the PS
and P6 microarchitectures, while maintaining backward compatibility.
‘The Intel NetBurst microarchitecture addressed some of the common problems found in high-speed,
pipelined microprocessors. Limiting factors for processor performance were delays from pre-fetch
dnd decoding of the instructions to uops the efficiency of the branch prediction algorithm, and cache
misses, The execution trace cache addresses these problems by storing decoded 1A-32 instructions.
Instructions are fetched and decoded by a translation engine, which builds the decoded instruction
into sequences of wops called traces, which are then stored in the trace cache, The execution ‘trace
Cache stores these pops in the path of predicted program execution flow, where the results of branches
sn the code are integrated into the same cache line. This increases the instruction flow from the cache
and makes better use of the overall cache storage space since the cache no longer stores instructions
that are branched over and never executed, The trace cache delivers up to three wops per clock to the
core.
Branch targets are predicted based on their linear address using branch prediction logic and fetched as
soan as possible, Branch targets are fetched from the execution trace cache if they are cached there;
otherwise, they are fetched from the memory hierarchy. The translation engine's branch prediction
information is used to form traces along the most likely paths.
“The cores ability to execute instructions out of order remains a key factor in enabling parallelism, The
processor employs several bufers to smooth the flow of ops. This implies that when one portion of
the entire processor pipeline experiences a delay, that delay may be covered by other operations
executing in parallel (for example, in the core) or by the execution of pops which were previously
queued up in a boffer (for example, in the front end). The NetBurst microarchitecture adds further
Downloaded from FaaDodEngineer's.com244
BE-Sem-ViI-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893
2.8 Branch Prediction (Regularly asked EQ)
The Pentium processor includes branch prediction logic, allowing it to avoid pipeline stalls if it
correctly predicts whether or not the branch will be taken when the branch instruction is executed,
‘When a branch operation is correctly predicted, no performance penalty is incurred. However, when
branch prediction is not correct, a three cycle penalty is incurred if the branch is executed in the U
pipeline and a four cycle penalty if the branch is in the V pipeline. The prediction mechanism is
implemented using a four-way, set-associative cache with 256 entries. This is referred to as the
branch target buffer, or BTB, The directory entry for each line contains the following information.
+ A valid bit that indicates whether or not the entry is in use.
History bits that track how often the branch has been taken each time that it entered the
pipeline before.
+ The source memory address that the branch instruction was fetched from.
The branch target buffer, or BTB, is a look-aside cache that sits off to the side of the D1 stages of the
two pipelines and monitors for branch instructions.
Prefetcher
BIB
Hit
Figure — Illustrates the relationship of the D1 pipeline stages and the BTB.
‘The first time that a branch instruction enters either pipeline, the BTB uses its source memory
address to perform a lookup in the cache. Since the instruction has not been seen before, this results
‘a BTB miss. This essentially means that the branch prediction logic has no history on the
instruction. It therefore predicts that the branch will not be taken when the instruction reaches the
execution stage of the pipeline, and does not instruct the prefetcher to alter program flow. Even
‘unconditional jumps will be predicted as not-taken the first time that they are seen by the BTB.
‘When the instruction reaches the execution stage, the branch will either be taken or not taken. If
taken, the next instruction to be executed should be the one fetched from the branch target address.
‘Subjects for
Downloaded from FaaDo0Engineers.com
aml
BE-Sem-
If the bra
sequential
When th
predictor
made cor
history bi
one of fo
1. Stro
The hist
marked 1
strongly
2. Wea
It is upgt
the corre
i
3. We:
Ifa bran
branch n
4, Stre
If a brat
‘When a
ae
in
ne
If bran
indicat
prefetc
addres
branch3
Microprocessor Notes By Prof. Faruk: Kazi 9820223893
Bis). With 3.2 GB/s of system bandwidth, the Intel Pentium 4 processor delivers the highest
‘bandwidth desktop bus currently inthe industry.
Hyper-pipelined Technology :
With the Intel Pentium 4 processor, Intel has doubled the pipeline depth to 20 stages, enabling a
higher clock fréquency. The additional pipeline stages establish a now baseline for processor speed,
delivering =1.40 GHz at launch on our 0.18 micron process. This higher core frequency significantly
increases processor performance and frequency capability and provides the scalability needed for
future applications.
Advanced Dynamic Execution
The Advanced Dynamic Execution engine is a very deep, out-of-order speculative execution engine
that keeps the execution units executing instructions. It does so by providing a very large window of
instructions from which the execution units can choose. The large out-of-order instruction window
allows the processor to significantly reduce stalls that can occur while instructions are waiting for
dependencies to resolve, One of the more common forms of stalls is waiting for data to be loaded
from memory on a cache miss. This aspect is very important in high-frequency designs, as the latency
to main memory increases relative to the core frequency. The NetBurst microarchitecture can have up
to 126 instructions in this window (in flight) vs. the previous P6 microarchitecture’s much smaller
window of 42 instructions. The Advanced Dynamic Execution engine also delivers an enhanced
branch prediction capability that allows the Pentium 4 processor to be more accurate in predicting
program branches. This has the net effect of reducing the number of branch mispredictions by about
33 percent over the P6 microarchitecture’s branch prediction capability. It does this by implementing
a 4-KB branch target buffer that stores more detail on the history of past branches, as well as by
implementing @ more advanced branch prediction algorithm. This enhanced branch prediction
capability is one of the key design elements that reduce the overall sensitivity of the NetBurst
‘microarchitecture to the branch misprediction penalty.
Rapid Execution Engine
The two Arithmetic Logic Units (ALUs) in the Intel Pentium 4 processor run at twice the core
frequency of the processor. This makes it possible to execute basic integer instructions (such as add,
subtract, logical AND, and logical OR) in half a clock cycle, with higher execution throughput and
reduced latency of execution, With the 1.40-GHz Intel Pentium 4 processor, each of the ALUs is
running at 2.80 GHz, increasing performance on integer-based applications.
Revolutionary Cache Subsystem
In order to increase performance and scalability, the Intel Pentium 4 processor features an innovative
‘new cache subsystem designed to optimize data transfer to the core. An execution trace cache stores
12K decoded instructions in the order of program flow instead of predecoded instructions that cannot
take code branches into consideration. The execution trace cache removes the decoder from the main
instruction loop and results in a higher performance, more efficient level 1 instruction cache, The Intel
Pentium 4 processor also includes a level 2 Advanced Transfer Cache (ATC). While still only 256 KB
in size, this ATC improves the data transfer rate between the on-die level 2 cache and the processor
core to 44.8 GB/s at 1.40 GHz, compared with 16 GB/s on a 1-GHz Intel Pentium Ill processor. The
evel 2 Advanced Transfer Cache is able to clock 256 bits (32 bytes) of data into and out of the cache
on every clock eycle, unlike previous microarchitectures. The overall gain with the new cache
subsystem is that the transfer rate between the cache subsystem and the processor core is optimized
over previous subsystems in both bandwidth and latency. The enhanced cache subsystem delivers
increased performance and response on a wide variety of applications.
Streaming SIMD Extensions 2 (SSE2)
‘To make the SIMD instruction set even more powerful, the Intel Pentium 4 processor provides 144
new performance improving instructions, including 128 bit SIMD double precision floating point,
128-bit SIMD integer, and improved cache and memory management instructions. The SSE2 extends
Downloaded from FaaDo0Engineers.comE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Fat
If the branch is not taken the next instruction executed should be the one fetched from the next
sequential memory address after the branch instruction.
fit
me | When the branch is taken for the fist time, the execution unit provides feedback to the branch
eU | _ Prediction logic. The branch target address is sent back and recorded in the BTB. A directory entry is
cis, Made containing the source memory address that the branch instruction was fetched from and the
the |__ history bits are set to indicate that the branch has been strongly taken. The history bits can indicate
{one of four possible states.
| 1. Strongly Taken
the |The history bits are initialized to this state when the entry is first made. In addition, if a branch
marked weakly taken is taken again, it is upgraded to strongly taken stage. When a branch marked
strongly taken is not taken the next time, it is downgraded to weakly taken.
‘the | 2% Weakly Taken
It is upgraded to the strongly taken state when a branch marked weakly taken is taken again. When
the corresponding marked branch is not taken, then it is downgraded to weakly not taken state.
Te DI stage, « hil on strongly y taken entry ¥ positive prediction. |
(ie, the branch is predicted taken) . z ‘|
3. Weakly Not Taken
Ifa branch marked weakly not taken is taken again, it is upgraded to the weakly taken state. When a
branch marked weakly not taken is not taken the next time, it is downgraded to strongly not taken.
4. Strongly Not Taken.
If a branch marked strongly not taken is taken again it is upgraded to the weakly not taken state.
When a branch marked strongly not taken is not taken the next time, it remains in the strongly not
taken state.
Tn DI Stage, a
| negative,
Movement when branch is not taken
—_——————
ory
sults *
the js
ve Movement when branch is taken
sven If branch predicted taken, the BTB supplies the branch target address back to the prefetcher and
indicates that a positive prediction is being made. In response, the prefetcher switches to the opposite
prefetch queue and immediately begins to prefetch from memory starting at the branch target
alt address. The instructions fetched are supplied to the instruction pipelines immediately behind the
Tess. branch instruction.
Downloaded from FaaDo0Engineers.com