0% found this document useful (0 votes)
287 views20 pages

The Micro Architecture of Intel Pentium 4

Netburst Microarchitecture formed the basis for a new family of Intel processors starting from the Pentium 4. Uses a deeply pipelined architecture to ensure a high clock rate. Uses high speed execution engine to reduce the latency of basic integer instructions.

Uploaded by

Rekha Govindaraj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
287 views20 pages

The Micro Architecture of Intel Pentium 4

Netburst Microarchitecture formed the basis for a new family of Intel processors starting from the Pentium 4. Uses a deeply pipelined architecture to ensure a high clock rate. Uses high speed execution engine to reduce the latency of basic integer instructions.

Uploaded by

Rekha Govindaraj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

The Microarchitecture of Intel Pentium 4

Sudipta Mahapatra

Introduction
 The Intel Pentium 4 was introduced in November 2000 targeted at a high clock rate of 1.5 GHz.  The Netburst microarchitecture formed the basis for a new family of Intel processors starting from the Pentium 4.  Developed with an intention of delivering high level of performance for many important applications such as multimedia.

Targeted application areas


 Internet audio and streaming video.  Image processing  Video content creation  Speech recognition  3D applications and games.  Video editing and video conferencing.

Overview of the Netburst Microarchitecture


 Uses a deeply pipelined architecture to ensure a high clock rate.  Uses a high-performance, quad-pumped bus interface to the 100 MHz system bus to transfer data at a rate of 400 MHz.  Uses a high speed execution engine to reduce the latency of basic integer instructions

Overview (Contd.)
 Out-of-order speculative execution to enable parallelism  Superscalar issue to exploit maximal parallelism

Main Features
 Hardware register renaming to avoid register name space limitations (WAW hazards)  Cache line sizes of 64 bytes  Optimization for the common case of frequently executed instructions  Improved branch handling techniques.

Basic Block Diagram

Branch-history update

[Glenn Hinton et. al., Intel Technology Jn. Q1, 2001]

Main sections
1. In order front end (FE) 2. Out-of-order Execution logic (OOE) 3. Integer and Floating-point Execution Units (EX) 4. Memory Subsystem (M)

In order front end


 Fetches the instructions to be executed next.  Supplies a set of decoded instructions to the execution pipeline.  Uses accurate branch prediction logic to determine the branch target.  The instructions from the branch target are decoded to generate a set of micro-operations or uops that may be executed in the execution core.  Uses the trace cache to store the uops corresponding to the most recently executed 9 instructions.

Front end
From L2 Cache

To Allocator/ Register Renamer [Glenn Hinton et. al., Intel Technology Jn. Q1, 2001]
10

Front end components


 Trace cache (TC): Serves as the L1 instruction cache. However, it holds the uops corresponding to the most recently decoded instructions. Delivers up to three uops per clock cycle to the OOE. Capacity=12K uops. Only in case of TC miss, the L2 cache is accessed. The trace cache has its own branch predictor that indicates where to go next in the trace cache. This is smaller than the Front-end BTB as it is concerned only with the subset of instructions that are currently in the trace cache. Also includes a 16-entry return address stack.
11

Front end components (Contd.)


 Microcode ROM: Is used for complex IA-32 instructions such as the string move and for fault and interrupt handling. In case of complex instructions, control is transferred to the microcode ROM, which then issues the needed uops.  Instruction TLB/Pre-fetcher: Responsible for fetching instructions from L2 cache in case of TC miss. Does the translation of supplied IA-32 linear instruction address into the corresponding physical address needed to access the L2 cache.  Front-end BTB: Supplies the IA-32 instruction bytes that are predicted to be executed next from the L2 cache. In case of a miss in the BTB, backward branches are 12 predicted taken and forward branches not taken.

Front end components (Contd.)


 Instruction decoder: Receives two IA-32 instructions at a time from the L2 cache and decodes them into uops.  Can decode at a maximum rate of one IA-32 instruction at a time.  Most of the instructions are converted into single uops.  If the instruction needs more than 4 uops, control is transferred into the microcode ROM.

13

Out-of-order Execution logic


 Prepares the instructions for out-of-order execution.  Uses aggressive reordering to execute the instructions as soon as they are ready to execute.  Maximal utilization of execution resources.  Has retirement logic to reorder the instructions so that they commit in order.

14

Out-of-order Execution logic


From uop Queue

To execution units
[Glenn Hinton et. al., Intel Technology Jn. Q1, 2001]

15

Execution Units
 The execution units include several integer and floating point units for result computation.  The execution section also includes the L-1 data cache used for most of the load/store operations.

16

Execution Units
From out-of-order execution logic

From/to memory subsystem [Glenn Hinton et. al., Intel Technology Jn. Q1, 2001]
17

Memory Subsystem
 The memory section contains the L2 cache and the system bus. Used to access the main memory when the L2 cache has a cache miss. Also used to access the I/O resources.

18

Memory Subsystem
To ITLB/Prefetcher

From execution units [Glenn Hinton et. al., Intel Technology Jn. Q1, 2001]

19

Pentium 4 pipeline
 The P6 microarchitecture (P2, P3, Celeron) has twice the pipeline depth of Pentium processor.  The Netburst microarchitecture has almost doubled the depth of pipelining of P6. - It allows for a higher frequency of operation. - Different parts of Pentium 4 operate at different clock frequencies.

20

You might also like