Microarchitecture Essay, Research Paper
A Tour of the Pentium Pro Processor Microarchitecture
Introduction
One of the Pentium Pro processor & # 8217 ; s primary ends was to significantly transcend the
public presentation of the 100MHz Pentium processor while being manufactured on the same
semiconducting material procedure. Using the same procedure as a volume production processor
practically assured that the Pentium Pro processor would be manufacturable, but
it meant that Intel had to concentrate on an improved microarchitecture for ALL of the
public presentation additions. This guided circuit depict how multiple architectural
techniques & # 8211 ; some proven in mainframe computing machines, some proposed in academe and
some we innovated ourselves & # 8211 ; were carefully interwoven, modified, enhanced,
tuned and implemented to bring forth the Pentium Pro microprocessor. This unique
combination of architectural characteristics, which Intel describes as Dynamic
Execution, enabled the first Pentium Pro processor Si to transcend the
original public presentation end.
Constructing from an already high platform
The Pentium processor set an impressive public presentation criterion with its pipelined,
superscalar microarchitecture. The Pentium processor & # 8217 ; s pipelined execution
utilizations five phases to pull out high throughput from the Si & # 8211 ; the Pentium Pro
processor moves to a decoupled, 12-stage, superpipelined execution, trading
less work per pipestage for more phases. The Pentium Pro processor reduced its
pipestage clip by 33 per centum, compared with a Pentium processor, which means the
Pentium Pro processor can hold a 33 % higher clock velocity than a Pentium processor
and still be every bit easy to bring forth from a semiconducting material fabrication procedure
( i.e. , transistor velocity ) position.
The Pentium processor & # 8217 ; s superscalar microarchitecture, with its ability to
execute two instructions per clock, would be hard to transcend without a new
attack. The new attack used by the Pentium Pro processor removes the
restraint of additive direction sequencing between the traditional & # 8220 ; bring & # 8221 ; and
& # 8220 ; execute & # 8221 ; stages, and opens up a broad direction window utilizing an direction
pool. This attack allows the & # 8220 ; execute & # 8221 ; stage of the Pentium Pro processor to
hold much more visibleness into the plan & # 8217 ; s direction watercourse so that better
programming may take topographic point. It requires the direction & # 8220 ; fetch/decode & # 8221 ; stage of
the Pentium Pro processor to be much more intelligent in footings of foretelling
plan flow. Optimized programming requires the cardinal & # 8220 ; execute & # 8221 ; stage to
be replaced by decoupled & # 8220 ; dispatch/execute & # 8221 ; and & # 8220 ; retire & # 8221 ; stages. This allows
instructions to be started in any order but ever be completed in the original
plan order. The Pentium Pro processor is implemented as three independent
engines coupled with an direction pool as shown in Figure 1 below.
What is the cardinal job to work out?
Before get downing our circuit on how the Pentium Pro processor achieves its high
public presentation it is of import to observe why this three- independent-engine attack
was taken. A cardinal fact of today & # 8217 ; s microprocessor executions must be
appreciated: most CPU nucleuss are non to the full utilised. See the codification fragment
in Figure 2 below:
The first direction in this illustration is a burden of r1 that, at run clip, causes
a cache girl. A traditional CPU nucleus must wait for its coach interface unit to
read this information from chief memory and return it before traveling on to instruction 2.
This CPU stables while waiting for this information and is therefore being under-utilized.
While CPU velocities have increased 10-fold over the past 10 old ages, the velocity of
chief memory devices has merely increased by 60 per centum. This increasing memory
latency, comparative to the CPU nucleus velocity, is a cardinal job that the
Pentium Pro processor set out to work out. One attack would be to put the
load of this job onto the chipset but a high-performance CPU that demands
really high velocity, specialized, support constituents is non a good solution for a
volume production system.
A brute-force attack to this job is, of class, increasing the size of the
L2 cache to cut down the girl ratio. While effectual, this is another expensive
solution, particularly sing the velocity demands of today & # 8217 ; s L2 cache SRAM
constituents. Alternatively, the Pentium Pro processor is designed from an overall
system execution position which will let higher public presentation systems to
be designed with cheaper memory subsystem designs.
Pentium Pro processor takes an advanced attack
To avoid this memory latency job the Pentium Pro processor & # 8220 ; looks-ahead & # 8221 ;
into its direction pool at subsequent instructions and will make utile work
instead than be stalled. In the illustration in Figure 2, direction 2 is non
feasible since it depends upon the consequence of direction 1 ; nevertheless both
instructions 3 and 4 are feasible. The Pentium Pro processor speculatively
executes instructions 3 and 4. We can non perpetrate the consequences of this bad
executing to lasting machine province ( i.e. , the programmer-visible registries )
since we must keep the original plan order, so the consequences are alternatively
stored back in the direction pool expecting in-order retirement. The nucleus
executes instructions depending upon their preparedness to put to death and non on their
original plan order ( it is a true dataflow engine ) . This attack has the
side consequence that instructions are typically executed out-of-order.
The cache girl on direction 1 will take many internal redstem storksbills, so the Pentium
Pro processor nucleus continues to look in front for other instructions that could be
speculatively executed and is typically looking 20 to 30 instructions in forepart
of the plan counter. Within this 20- to 30- direction window there will be,
on norm, five subdivisions that the fetch/decode unit must right foretell if
the dispatch/execute unit is to make utile work. The thin registry set of an
Intel Architecture ( IA ) processor will make many false dependences on
registries so the dispatch/execute unit will rename the IA registries to enable
extra forward advancement. The retire unit owns the physical IA registry set
and consequences are merely committed to lasting machine province when it removes
completed instructions from the pool in original plan order.
Dynamic Execution engineering can be summarized as optimally seting
direction executing by foretelling plan flow, analyzing the plan & # 8217 ; s
dataflow graph to take the best order to put to death the instructions, so holding
the ability to speculatively put to death instructions in the preferable order. The
Pentium Pro processor dynamically adjusts its work, as defined by the entrance
direction watercourse, to minimise overall executing clip.
Overview of the Michigans on the circuit
We have previewed how the Pentium Pro processor takes an advanced attack to
get the better of a cardinal system restraint. Now let & # 8217 ; s take a closer expression inside the
Pentium Pro processor to understand how it implements Dynamic Execution. Figure
3 below extends the basic block diagram to include the cache and memory
interfaces & # 8211 ; these will besides be stops on our circuit. We shall go down the
Pentium Pro processor grapevine to understand the function of each unit:
? The FETCH/DECODE unit: An in-order unit that takes as input the user plan
direction watercourse from the direction cache, and decodes them into a series of
micro-operations ( uops ) that represent the dataflow of that direction watercourse.
The plan pre-fetch is itself bad.
? The DISPATCH/EXECUTE unit: An out-of-order unit that accepts the dataflow
watercourse, schedules executing of the uops subject to data dependences and
resource handiness and temporarily shops the consequences of these bad
executings.
? The RETIRE unit: An in-order unit that knows how and when to perpetrate ( & # 8221 ; retire & # 8221 ; )
the impermanent, bad consequences to lasting architectural province.
? The BUS INTERFACE unit: A partly ordered unit responsible for linking the
three internal units to the existent universe. The coach interface unit communicates
straight with the L2 cache back uping up to four concurrent cache entrees. The
coach interface unit besides controls a dealing coach, with MESI spying protocol,
to system memory.
Tour halt # 1: The FETCH/DECODE unit.
Figure 4 shows a more elaborate position of the fetch/decode unit:
Let & # 8217 ; s get down the circuit at the Instruction Cache ( ICache ) , a nearby topographic point for
instructions to shack so that they can be looked up rapidly when the CPU demands
them. The Next_IP unit provides the ICache index, based on inputs from the
P >
Branch Target Buffer ( BTB ) , trap/interrupt position, and branch-misprediction
indicants from the whole number executing subdivision. The 512 entry BTB uses an
extension of Yeh & # 8217 ; s algorithm to supply greater than 90 per centum anticipation
truth. For now, allow & # 8217 ; s assume that nil exceptional is go oning, and that
the BTB is right in its anticipations. ( The Pentium Pro processor integrates
characteristics that allow for the rapid recovery from a mis-prediction, but more of
that subsequently. )
The ICache fetches the cache line matching to the index from the Next_IP,
and the following line, and presents 16 aligned bytes to the decipherer. Two lines are
read because the IA direction watercourse is byte-aligned, and codification frequently subdivisions
to the center or terminal of a cache line. This portion of the grapevine takes three
redstem storksbills, including the clip to revolve the prefetched bytes so that they are
justified for the direction decipherers ( ID ) . The beginning and terminal of the IA
instructions are marked.
Three parallel decipherers accept this watercourse of pronounced bytes, and continue to happen
and decrypt the IA instructions contained in this. The decipherer converts the IA
instructions into triadic uops ( two logical beginnings, one logical finish per
uop ) . Most IA instructions are converted straight into individual uops, some
instructions are decoded into one-to-four uops and the complex instructions
require firmware ( the box labeled MIS in Figure 4, this firmware is merely a set
of preprogrammed sequences of normal uops ) . Some instructions, called prefix
bytes, modify the undermentioned direction giving the decipherer a batch of work to make.
The uops are enqueued, and sent to the Register Alias Table ( RAT ) unit, where
the logical IA-based registry mentions are converted into Pentium Pro
processor physical registry mentions, and to the Allocator phase, which adds
position information to the uops and enters them into the direction pool. The
direction pool is implemented as an array of Content Addressable Memory called
the ReOrder Buffer ( ROB ) .
We have now reached the terminal of the in-order pipe.
Tour halt # 2: The DISPATCH/EXECUTE unit
The despatch unit selects uops from the direction pool depending upon their
position. If the position indicates that a uop has all of its operands so the
despatch unit cheques to see if the executing resource needed by that uop is besides
available. If both are true, it removes that uop and sends it to the resource
where it is executed. The consequences of the uop are subsequently returned to the pool.
There are five ports on the Reservation Station and the multiple resources are
accessed as shown in Figure 5 below:
The Pentium Pro processor can schedule at a peak rate of 5 uops per clock, one
to each resource port, but a sustained rate of 3 uops per clock is typical. The
activity of this programming procedure is the quintessential out-of-order procedure ;
uops are dispatched to the executing resources purely harmonizing to dataflow
restraints and resource handiness, without respect to the original ordination
of the plan.
Note that the existent algorithm employed by this execution-scheduling procedure is
vitally of import to public presentation. If merely one uop per resource becomes data-
ready per clock rhythm, so there is no pick. But if several are available,
which should it take? It could take indiscriminately, or first-come-first-served.
Ideally it would take whichever uop would shorten the overall dataflow graph
of the plan being run. Since there is no manner to truly cognize that at run-time,
it approximates by utilizing a imposter FIFO scheduling algorithm prefering back-to-
back uops.
Note that many of the uops are subdivisions, because many IA instructions are
subdivisions. The Branch Target Buffer will right foretell most of these subdivisions
but it can & # 8217 ; t right predict them all. See a BTB that & # 8217 ; s right
foretelling the backward subdivision at the underside of a cringle: finally that cringle is
traveling to end, and when it does, that subdivision will be mispredicted. Branch
uops are tagged ( in the in-order grapevine ) with their fallthrough reference and
the finish that was predicted for them. When the subdivision executes, what the
subdivision really did is compared against what the anticipation hardware said it
would make. If those coincide, so the subdivision finally retires, and most of the
speculatively executed work behind it in the direction pool is good.
But if they do non co-occur ( a subdivision was predicted as taken but fell through,
or was predicted as non taken and it really did take the subdivision ) so the Jump
Execution Unit ( JEU ) changes the position of all of the uops behind the subdivision to
take them from the direction pool. In that instance the proper subdivision
finish is provided to the BTB which restarts the whole grapevine from the
new mark reference.
Tour halt # 3: The RETIRE unit
Figure 6 shows a more elaborate position of the retire unit:
The retire unit is besides look intoing the position of uops in the direction pool & # 8211 ; it
is looking for uops that have executed and can be removed from the pool. Once
removed, the uops & # 8217 ; original architectural mark is written as per the original
IA direction. The retirement unit must non merely notice which uops are complete,
it must besides re-impose the original plan order on them. It must besides make this
in the face of interrupts, traps, mistakes, breakpoints and mis- anticipations.
There are two clock rhythms devoted to the retirement procedure. The retirement
unit must foremost read the direction pool to happen the possible campaigners for
retirement and determine which of these campaigners are following in the original
plan order. Then it writes the consequences of this rhythm & # 8217 ; s retirements to both
the Instruction Pool and the RRF. The retirement unit is capable of retiring 3
uops per clock.
Tour halt # 4: BUS INTERFACE unit
Figure 7 shows a more elaborate position of the coach interface unit:
There are two types of memory entree: tonss and shops. Loads merely need to
stipulate the memory reference to be accessed, the breadth of the informations being retrieved,
and the finish registry. Tonss are encoded into a individual uop. Shops need
to supply a memory reference, a information breadth, and the informations to be written. Shops
hence require two uops, one to bring forth the reference, one to bring forth the
informations. These uops are scheduled independently to maximise their concurrence, but
must re-combine in the shop buffer for the shop to finish.
Shops are ne’er performed speculatively, there being no crystalline manner to undo
them. Shops are besides ne’er re- ordered among themselves. The Shop Buffer
despatchs a shop merely when the shop has both its reference and its informations, and
there are no older shops expecting despatch.
What impact will a bad nucleus have on the existent universe? Early in the Pentium
Pro processor undertaking, we studied the importance of memory entree reordering.
The basic decisions were as follows:
? Shops must be constrained from go throughing other shops, for merely a little
impact on public presentation.
? Shops can be constrained from go throughing tonss, for an inconsequential
public presentation loss.
? Restraining tonss from go throughing other tonss or from go throughing shops creates
a important impact on public presentation.
So what we need is a memory subsystem architecture that allows tonss to go through
shops. And we need to do it possible for tonss to go through tonss. The Memory
Order Buffer ( MOB ) accomplishes this undertaking by moving like a reserve station
and Re-Order Buffer, in that it holds suspended tonss and shops, redispatching
them when the barricading status ( dependence or resource ) disappears.
Tour Drumhead
It is the alone combination of improved subdivision anticipation ( to offer the nucleus
many instructions ) , informations flow analysis ( taking the best order ) , and
bad executing ( put to deathing instructions in the preferable order ) that
enables the Pentium Pro processor to present its public presentation encouragement over the
Pentium processor. This alone combination is called Dynamic Execution and it is
similar in impact as & # 8220 ; Superscalar & # 8221 ; was to old coevals Intel Architecture
processors. While all your Personal computer applications run on the Pentium Pro processor,
today & # 8217 ; s powerful 32-bit applications take best advantage of Pentium Pro
processor public presentation.
And while our designers were honing the Pentium Pro processor microarchitecture,
our Si engineers were working on an advanced fabrication procedure –
the 0.35 micrometer procedure. The consequence is that the initial Pentium Pro Processor
CPU nucleus velocities range up to 200MHz.