A Tour Of The Pentium Pro Processor

Microarchitecture Essay, Research Paper

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

A Tour of the Pentium Pro Processor Microarchitecture

Introduction

One of the Pentium Pro processor & # 8217 ; s primary ends was to significantly transcend the

public presentation of the 100MHz Pentium processor while being manufactured on the same

semiconducting material procedure. Using the same procedure as a volume production processor

practically assured that the Pentium Pro processor would be manufacturable, but

it meant that Intel had to concentrate on an improved microarchitecture for ALL of the

public presentation additions. This guided circuit depict how multiple architectural

techniques & # 8211 ; some proven in mainframe computing machines, some proposed in academe and

some we innovated ourselves & # 8211 ; were carefully interwoven, modified, enhanced,

tuned and implemented to bring forth the Pentium Pro microprocessor. This unique

combination of architectural characteristics, which Intel describes as Dynamic

Execution, enabled the first Pentium Pro processor Si to transcend the

original public presentation end.

Constructing from an already high platform

The Pentium processor set an impressive public presentation criterion with its pipelined,

superscalar microarchitecture. The Pentium processor & # 8217 ; s pipelined execution

utilizations five phases to pull out high throughput from the Si & # 8211 ; the Pentium Pro

processor moves to a decoupled, 12-stage, superpipelined execution, trading

less work per pipestage for more phases. The Pentium Pro processor reduced its

pipestage clip by 33 per centum, compared with a Pentium processor, which means the

Pentium Pro processor can hold a 33 % higher clock velocity than a Pentium processor

and still be every bit easy to bring forth from a semiconducting material fabrication procedure

( i.e. , transistor velocity ) position.

The Pentium processor & # 8217 ; s superscalar microarchitecture, with its ability to

execute two instructions per clock, would be hard to transcend without a new

attack. The new attack used by the Pentium Pro processor removes the

restraint of additive direction sequencing between the traditional & # 8220 ; bring & # 8221 ; and

& # 8220 ; execute & # 8221 ; stages, and opens up a broad direction window utilizing an direction

pool. This attack allows the & # 8220 ; execute & # 8221 ; stage of the Pentium Pro processor to

hold much more visibleness into the plan & # 8217 ; s direction watercourse so that better

programming may take topographic point. It requires the direction & # 8220 ; fetch/decode & # 8221 ; stage of

the Pentium Pro processor to be much more intelligent in footings of foretelling

plan flow. Optimized programming requires the cardinal & # 8220 ; execute & # 8221 ; stage to

be replaced by decoupled & # 8220 ; dispatch/execute & # 8221 ; and & # 8220 ; retire & # 8221 ; stages. This allows

instructions to be started in any order but ever be completed in the original

plan order. The Pentium Pro processor is implemented as three independent

engines coupled with an direction pool as shown in Figure 1 below.

What is the cardinal job to work out?

Before get downing our circuit on how the Pentium Pro processor achieves its high

public presentation it is of import to observe why this three- independent-engine attack

was taken. A cardinal fact of today & # 8217 ; s microprocessor executions must be

appreciated: most CPU nucleuss are non to the full utilised. See the codification fragment

in Figure 2 below:

The first direction in this illustration is a burden of r1 that, at run clip, causes

a cache girl. A traditional CPU nucleus must wait for its coach interface unit to

read this information from chief memory and return it before traveling on to instruction 2.

This CPU stables while waiting for this information and is therefore being under-utilized.

While CPU velocities have increased 10-fold over the past 10 old ages, the velocity of

chief memory devices has merely increased by 60 per centum. This increasing memory

latency, comparative to the CPU nucleus velocity, is a cardinal job that the

Pentium Pro processor set out to work out. One attack would be to put the

load of this job onto the chipset but a high-performance CPU that demands

really high velocity, specialized, support constituents is non a good solution for a

volume production system.

A brute-force attack to this job is, of class, increasing the size of the

L2 cache to cut down the girl ratio. While effectual, this is another expensive

solution, particularly sing the velocity demands of today & # 8217 ; s L2 cache SRAM

constituents. Alternatively, the Pentium Pro processor is designed from an overall

system execution position which will let higher public presentation systems to

be designed with cheaper memory subsystem designs.

Pentium Pro processor takes an advanced attack

To avoid this memory latency job the Pentium Pro processor & # 8220 ; looks-ahead & # 8221 ;

into its direction pool at subsequent instructions and will make utile work

instead than be stalled. In the illustration in Figure 2, direction 2 is non

feasible since it depends upon the consequence of direction 1 ; nevertheless both

instructions 3 and 4 are feasible. The Pentium Pro processor speculatively

executes instructions 3 and 4. We can non perpetrate the consequences of this bad

executing to lasting machine province ( i.e. , the programmer-visible registries )

since we must keep the original plan order, so the consequences are alternatively

stored back in the direction pool expecting in-order retirement. The nucleus

executes instructions depending upon their preparedness to put to death and non on their

original plan order ( it is a true dataflow engine ) . This attack has the

side consequence that instructions are typically executed out-of-order.

The cache girl on direction 1 will take many internal redstem storksbills, so the Pentium

Pro processor nucleus continues to look in front for other instructions that could be

speculatively executed and is typically looking 20 to 30 instructions in forepart

of the plan counter. Within this 20- to 30- direction window there will be,

on norm, five subdivisions that the fetch/decode unit must right foretell if

the dispatch/execute unit is to make utile work. The thin registry set of an

Intel Architecture ( IA ) processor will make many false dependences on

registries so the dispatch/execute unit will rename the IA registries to enable

extra forward advancement. The retire unit owns the physical IA registry set

and consequences are merely committed to lasting machine province when it removes

completed instructions from the pool in original plan order.

Dynamic Execution engineering can be summarized as optimally seting

direction executing by foretelling plan flow, analyzing the plan & # 8217 ; s

dataflow graph to take the best order to put to death the instructions, so holding

the ability to speculatively put to death instructions in the preferable order. The

Pentium Pro processor dynamically adjusts its work, as defined by the entrance

direction watercourse, to minimise overall executing clip.

Overview of the Michigans on the circuit

We have previewed how the Pentium Pro processor takes an advanced attack to

get the better of a cardinal system restraint. Now let & # 8217 ; s take a closer expression inside the

Pentium Pro processor to understand how it implements Dynamic Execution. Figure

3 below extends the basic block diagram to include the cache and memory

interfaces & # 8211 ; these will besides be stops on our circuit. We shall go down the

Pentium Pro processor grapevine to understand the function of each unit:

? The FETCH/DECODE unit: An in-order unit that takes as input the user plan

direction watercourse from the direction cache, and decodes them into a series of

micro-operations ( uops ) that represent the dataflow of that direction watercourse.

The plan pre-fetch is itself bad.

? The DISPATCH/EXECUTE unit: An out-of-order unit that accepts the dataflow

watercourse, schedules executing of the uops subject to data dependences and

resource handiness and temporarily shops the consequences of these bad

executings.

? The RETIRE unit: An in-order unit that knows how and when to perpetrate ( & # 8221 ; retire & # 8221 ; )

the impermanent, bad consequences to lasting architectural province.

? The BUS INTERFACE unit: A partly ordered unit responsible for linking the

three internal units to the existent universe. The coach interface unit communicates

straight with the L2 cache back uping up to four concurrent cache entrees. The

coach interface unit besides controls a dealing coach, with MESI spying protocol,

to system memory.

Tour halt # 1: The FETCH/DECODE unit.

Figure 4 shows a more elaborate position of the fetch/decode unit:

Let & # 8217 ; s get down the circuit at the Instruction Cache ( ICache ) , a nearby topographic point for

instructions to shack so that they can be looked up rapidly when the CPU demands

them. The Next_IP unit provides the ICache index, based on inputs from the

P >

Branch Target Buffer ( BTB ) , trap/interrupt position, and branch-misprediction

indicants from the whole number executing subdivision. The 512 entry BTB uses an

extension of Yeh & # 8217 ; s algorithm to supply greater than 90 per centum anticipation

truth. For now, allow & # 8217 ; s assume that nil exceptional is go oning, and that

the BTB is right in its anticipations. ( The Pentium Pro processor integrates

characteristics that allow for the rapid recovery from a mis-prediction, but more of

that subsequently. )

The ICache fetches the cache line matching to the index from the Next_IP,

and the following line, and presents 16 aligned bytes to the decipherer. Two lines are

read because the IA direction watercourse is byte-aligned, and codification frequently subdivisions

to the center or terminal of a cache line. This portion of the grapevine takes three

redstem storksbills, including the clip to revolve the prefetched bytes so that they are

justified for the direction decipherers ( ID ) . The beginning and terminal of the IA

instructions are marked.

Three parallel decipherers accept this watercourse of pronounced bytes, and continue to happen

and decrypt the IA instructions contained in this. The decipherer converts the IA

instructions into triadic uops ( two logical beginnings, one logical finish per

uop ) . Most IA instructions are converted straight into individual uops, some

instructions are decoded into one-to-four uops and the complex instructions

require firmware ( the box labeled MIS in Figure 4, this firmware is merely a set

of preprogrammed sequences of normal uops ) . Some instructions, called prefix

bytes, modify the undermentioned direction giving the decipherer a batch of work to make.

The uops are enqueued, and sent to the Register Alias Table ( RAT ) unit, where

the logical IA-based registry mentions are converted into Pentium Pro

processor physical registry mentions, and to the Allocator phase, which adds

position information to the uops and enters them into the direction pool. The

direction pool is implemented as an array of Content Addressable Memory called

the ReOrder Buffer ( ROB ) .

We have now reached the terminal of the in-order pipe.

Tour halt # 2: The DISPATCH/EXECUTE unit

The despatch unit selects uops from the direction pool depending upon their

position. If the position indicates that a uop has all of its operands so the

despatch unit cheques to see if the executing resource needed by that uop is besides

available. If both are true, it removes that uop and sends it to the resource

where it is executed. The consequences of the uop are subsequently returned to the pool.

There are five ports on the Reservation Station and the multiple resources are

accessed as shown in Figure 5 below:

The Pentium Pro processor can schedule at a peak rate of 5 uops per clock, one

to each resource port, but a sustained rate of 3 uops per clock is typical. The

activity of this programming procedure is the quintessential out-of-order procedure ;

uops are dispatched to the executing resources purely harmonizing to dataflow

restraints and resource handiness, without respect to the original ordination

of the plan.

Note that the existent algorithm employed by this execution-scheduling procedure is

vitally of import to public presentation. If merely one uop per resource becomes data-

ready per clock rhythm, so there is no pick. But if several are available,

which should it take? It could take indiscriminately, or first-come-first-served.

Ideally it would take whichever uop would shorten the overall dataflow graph

of the plan being run. Since there is no manner to truly cognize that at run-time,

it approximates by utilizing a imposter FIFO scheduling algorithm prefering back-to-

back uops.

Note that many of the uops are subdivisions, because many IA instructions are

subdivisions. The Branch Target Buffer will right foretell most of these subdivisions

but it can & # 8217 ; t right predict them all. See a BTB that & # 8217 ; s right

foretelling the backward subdivision at the underside of a cringle: finally that cringle is

traveling to end, and when it does, that subdivision will be mispredicted. Branch

uops are tagged ( in the in-order grapevine ) with their fallthrough reference and

the finish that was predicted for them. When the subdivision executes, what the

subdivision really did is compared against what the anticipation hardware said it

would make. If those coincide, so the subdivision finally retires, and most of the

speculatively executed work behind it in the direction pool is good.

But if they do non co-occur ( a subdivision was predicted as taken but fell through,

or was predicted as non taken and it really did take the subdivision ) so the Jump

Execution Unit ( JEU ) changes the position of all of the uops behind the subdivision to

take them from the direction pool. In that instance the proper subdivision

finish is provided to the BTB which restarts the whole grapevine from the

new mark reference.

Tour halt # 3: The RETIRE unit

Figure 6 shows a more elaborate position of the retire unit:

The retire unit is besides look intoing the position of uops in the direction pool & # 8211 ; it

is looking for uops that have executed and can be removed from the pool. Once

removed, the uops & # 8217 ; original architectural mark is written as per the original

IA direction. The retirement unit must non merely notice which uops are complete,

it must besides re-impose the original plan order on them. It must besides make this

in the face of interrupts, traps, mistakes, breakpoints and mis- anticipations.

There are two clock rhythms devoted to the retirement procedure. The retirement

unit must foremost read the direction pool to happen the possible campaigners for

retirement and determine which of these campaigners are following in the original

plan order. Then it writes the consequences of this rhythm & # 8217 ; s retirements to both

the Instruction Pool and the RRF. The retirement unit is capable of retiring 3

uops per clock.

Tour halt # 4: BUS INTERFACE unit

Figure 7 shows a more elaborate position of the coach interface unit:

There are two types of memory entree: tonss and shops. Loads merely need to

stipulate the memory reference to be accessed, the breadth of the informations being retrieved,

and the finish registry. Tonss are encoded into a individual uop. Shops need

to supply a memory reference, a information breadth, and the informations to be written. Shops

hence require two uops, one to bring forth the reference, one to bring forth the

informations. These uops are scheduled independently to maximise their concurrence, but

must re-combine in the shop buffer for the shop to finish.

Shops are ne’er performed speculatively, there being no crystalline manner to undo

them. Shops are besides ne’er re- ordered among themselves. The Shop Buffer

despatchs a shop merely when the shop has both its reference and its informations, and

there are no older shops expecting despatch.

What impact will a bad nucleus have on the existent universe? Early in the Pentium

Pro processor undertaking, we studied the importance of memory entree reordering.

The basic decisions were as follows:

? Shops must be constrained from go throughing other shops, for merely a little

impact on public presentation.

? Shops can be constrained from go throughing tonss, for an inconsequential

public presentation loss.

? Restraining tonss from go throughing other tonss or from go throughing shops creates

a important impact on public presentation.

So what we need is a memory subsystem architecture that allows tonss to go through

shops. And we need to do it possible for tonss to go through tonss. The Memory

Order Buffer ( MOB ) accomplishes this undertaking by moving like a reserve station

and Re-Order Buffer, in that it holds suspended tonss and shops, redispatching

them when the barricading status ( dependence or resource ) disappears.

Tour Drumhead

It is the alone combination of improved subdivision anticipation ( to offer the nucleus

many instructions ) , informations flow analysis ( taking the best order ) , and

bad executing ( put to deathing instructions in the preferable order ) that

enables the Pentium Pro processor to present its public presentation encouragement over the

Pentium processor. This alone combination is called Dynamic Execution and it is

similar in impact as & # 8220 ; Superscalar & # 8221 ; was to old coevals Intel Architecture

processors. While all your Personal computer applications run on the Pentium Pro processor,

today & # 8217 ; s powerful 32-bit applications take best advantage of Pentium Pro

processor public presentation.

And while our designers were honing the Pentium Pro processor microarchitecture,

our Si engineers were working on an advanced fabrication procedure –

the 0.35 micrometer procedure. The consequence is that the initial Pentium Pro Processor

CPU nucleus velocities range up to 200MHz.

A Tour Of The Pentium Pro Processor

Related posts:

Post a Comment Cancel reply