Download 64-Bit CPUs: What You Need to Know

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ExtremeTech - Print Article
1 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
64-Bit CPUs: What You Need to Know
February 8, 2002
By: Jim Turley
It's the peak, the top, it's the Mona Lisa. It's the $64,000 Question: what processor will dominate 64-bit
computing? Sixty-four bits holds the promise of new performance, new architectures, new compilers, and a
new balance of power in CPU realpolitik. A clean break with the old, a new chance for the new.
What hardware or architectural changes are in store for 64 bits? Quite a lot, although few of them have to
do with 64-bittedness, per se. But 64-bit processors are at today's very high end, and they showcase all the
best thinking in microprocessor design. This is the cutting edge, where silicon manufacturing, computer
architecture, compiler technology, and marketing wizardry all come together. In the words of Calvin and
Hobbes, scientific progress goes "Boink!"
For most of us waiting breathlessly on the sidelines, the 64-bit battle is between Intel's IA-64 and AMD's
Hammer architectures. Separately, we'll evaluate the pros and cons of the "other" 64-bit processors used
in workstations and servers, such as SPARC, Power, MIPS, and Alpha.
In this first segment of our 64-bit computing series, we'll launch into the wonder that is IA-64. You've
probably seen much information already written about Itanium and IA-64 architecture in the past few years,
which is mostly a replay of Intel-generated information. We'll try to get beyond the standard facts and hype,
and take a critical look at Itanium and IA-64/EPIC, by describing features and delivering some critical
analyses. We'll set the stage for an architectural comparison with Hammer and other 64-bit architectures in
future segments.
To be clear, this 64-bit computing architecture series is not performance testing focused. It's architecture
focused, and discusses long-term potentials. We will, however, point you to a few Itanium performance
studies on the Web.
Intel IA-64 & Itanium
Intel's IA-64 (née Tahoe) architecture had a gestation period longer than that of an elephant. After first
announcing their cooperation in 1994, Hewlett Packard and Intel said the first offspring of their matrimony
would arrive "not before 1998," a prognostication that certainly proved to be true. In reality, the design was
even longer in the making, for Intel and HP had stealthily begun working well before their mid-'94
announcement.
Ten years and 325 million transistors later, we behold Itanium (all the good names were taken). Originally
code-named Merced, Itanium is the first-born of the IA-64 family and our first real look into how well IA-64
will--or won't--work. (The subsequent offspring code-named McKinley, Madison, and Deerfield, are covered
later in this article.) First and most obviously, Itanium, like all IA-64 processors, is not an x86 chip. It is a
clean break from the long and legendary x86 (or IA-32, in Intel parlance) architecture that Intel invented,
seemingly back when Earth was still cooling, and which propelled the Santa Clara company to such
heights. Yes, Itanium is able to run x86 code in backward-compatibility mode, but that compatibility is
tacked on; in its element, Itanium and all IA-64 chips are nothing at all like Pentium.
That's both good news and bad news, as we shall see. It's good to be free from the tyranny of the x86
architecture, considered by many programmers to be the worst 8-bit, 16-bit, or 32-bit (take your pick) CPU
family ever developed. That it should have succeeded so spectacularly is enough to shake one's faith in
divine forces. The bad news? IA-64 leaves behind everything that made x86 chips ubiquitous, and
presumably replaces it all with new bugs, new quirks, and new head-scratchers, leaving us to wonder,
"why the hell did they design it that way?"
The Heart of the Beast: A Modified VLIW Core
7/1/2002 3:14 PM
ExtremeTech - Print Article
2 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
Internally, Itanium is a six-issue processor, meaning it can profitably handle six instructions simultaneously.
It's also a VLIW (very long instruction word) machine with some enhancements for added flexibility in
instruction groupings, less code expansion than classic VLIW designs, and better scalability, to permit
wider parallel instruction issue in future IA-64 processors. Thus Intel prefers the term EPIC: Explicitly
Parallel Instruction-set Computing.
Itanium has nine execution units and future IA-64 processors will probably have more. The nine are
grouped into two integer units, two combo integer-and-load/store units, two floating-point units, and three
branch units. These four groups are significant, as we shall see in a moment.
Here's a simplified Itanium block diagram:
click on image for full view
And here's a more complex block diagram:
click on image for full view
Itanium has a 10-stage pipeline, which is respectable but not impressive by today's standards. Again,
7/1/2002 3:14 PM
ExtremeTech - Print Article
3 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
future IA-64 processors may have different and probably longer pipes. For comparison, Pentium III has a
12-stage pipeline, but the Alpha 21264 has just eight stages. And Pentium 4 has 20 stages (from the point
of fetching micro-ops from its trace cache), and Athlon has 10 stages.
Here's a basic Itanium pipeline diagram:
click on image for full view
328 Registers and Counting
The Itanium processor has a massive register set, with 128 general-purpose integer registers (each 64 bits
wide), 128 floating-point registers (each 82 bits wide), 64 1-bit predicate registers, 8 branch registers, and a
whole bunch of other registers scattered among several different functions, including some for x86
backward compatibility. Like a lot of RISC processors, the first register (GR0) is hard-wired to a permanent
zero, making it worthless for storage but useful as a constant for inputs and a bit bucket for outputs.
Here's a simplified diagram of key application registers:
click on image for full view
Here's a detailed diagram of application and system-level register sets:
7/1/2002 3:14 PM
ExtremeTech - Print Article
4 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
click on image for full view
And of course Itanium supports standard 32-bit x86 execution modes and the 32-bit registers are mapped
onto the IA-64 registers. See details in the section titled "Don't Look Back: How Itanium Handles x86 Code"
down below.
What a far cry from the cramped, crowded register set of the x86! With 256 registers to play with,
programmers have an embarrassment of riches. To avoid that embarrassment, IA-64 has two features that
manage the register file: register frames and register rotating. These require some explanation…
Register Your Window Frames
Registers are great when your program is running, but pushing and popping 128 big ol' registers for
subroutine calls is unpleasantly time-consuming (and usually not necessary anyway). It's traditional but it's
inefficient. One alternative is register windows, of which SPARC processors are a notable proponent.
Register windows have their problems, too, and it's no coincidence that the only major RISC architecture to
use register windows is also the slowest major RISC architecture still in production. IA-64 gets around the
constant pushing and popping by using register frames.
The first 32 of the 128 integer registers are global, available to all tasks at all times. The other 96, though,
can be framed, rotated, or both. Before a function call, you use Itanium's ALLOC instruction (which is
unrelated to the C function of the same name) to shift the apparent arrangement of the general-purpose
registers so that it appears that parameters are being passed from one function to another through shared
registers. In reality, ALLOC changes the mapping of the logical (software-visible) registers to the physical
registers, much like SPARC does. The similarities with SPARC's windows are strong and the differences
mostly minor. With IA-64's frames, the frame size is arbitrary, unlike SPARC, which supports a few different
fixed frame sizes. In the example illustration, the calling routine sets aside 11 registers (GR32 - GR42) for
the called routine, with four registers overlapping. The overlapping registers are where the parameters will
be passed, although they never really move. Regardless of what registers either routine physically uses,
they will appear to be contiguous with the first 32 fixed registers, GR0 - GR31.
7/1/2002 3:14 PM
ExtremeTech - Print Article
5 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
click on image for full view
The maximum frame size is all 96 registers, plus the 32 globals that are always visible. Only the integer
registers are framed; FP registers and predicate registers (described below) are not. The minimum frame
size is one register, or you can choose not to use ALLOC at all.
Rotating Registers
On top of the frames, there's register rotation, a feature that helps loop unrolling more than parameter
passing. With rotation, Itanium can shift up to 96 of its general-purpose registers (the first 32 are still fixed
and global) by one or more apparent positions. Why? So that iterative loops that hammer on the same
register(s) time after time can all be dispatched and executed at once without stepping on each other. Each
instance of the loop actually targets different physical registers, allowing them all to be in flight at once.
If this sounds a lot like register renaming, it is. Itanium's register-rotation feature is less generic than
all-purpose register renaming like Athlon's, so it's easier to implement and faster to execute. Chip-wide
register renaming like Athlon's adds gobs of multiplexers, adders, and routing, one of the big drawbacks of
a massively out-of-order machine. On a smaller scale, ARM used this trick with its ill-fated Piccolo DSP
coprocessor. At the high end, Cydrome also used this technique, a favorite feature that Cydrome alumnus
and Itanium team member Bob Rau apparently brought with him.
So IA-64 has two levels of indirection for its own registers: the logical-to-virtual mapping of the frames and
the virtual-to-physical mapping of the rotation. All this means that programs usually aren't accessing the
physical registers they think they are, but that's nothing new to high-end microprocessors. Arcane as it
seems, this method still uses less hardware trickery than the full register renaming of Athlon, Pentium III, or
P4.
Frames and rotation help up to a point, but eventually even Itanium runs out of registers. When that
happens, we're back to pushing and popping registers on and off the stack. Where Itanium differs from
SPARC is that Intel makes it automatic. Itanium's register save engine (RSE) is an automated circuit within
the processor that oversees filling and spilling registers to/from the stack when the register file overflows or
underflows. Unlike SPARC, Itanium's RSE handles this task automatically and invisibly to software.
SPARC, in contrast, raises a fault that must be handled in software.
The RSE is more complicated than you might think. It has to handle any kind of memory problem, page
fault, exception, or error without bothering the processor. In Itanium, the RSE stalls the processor to do its
work. In future IA-64 implementations, it will probably be more elegantly handled in the background.
The Good Stuff: Instruction Set
As we mentioned, IA-64 is an enhanced VLIW architecture, so its concept of "instruction" is a little different
from that of, say, Pentium or Alpha. With IA-64, there are instructions, there are bundles, and there are
7/1/2002 3:14 PM
ExtremeTech - Print Article
6 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
groups. Get your notepads ready.
Instructions are 41 bits long. Yup - say goodbye to powers of two. It takes 7 bits to specify one of 128
general-purpose (or floating-point) registers, so two source-operand fields and a destination field eat up 21
bits right there, before you even get to the opcode. Another 6 bits specify the 64 combinations of
predication (which we discuss in more detail below), if any.
Instructions are delivered to the processor in "bundles." Bundles are 128 bits: three 41-bit instructions
(making 123 bits), plus one 5-bit template, which we'll get to in a minute. Still with us? Then there are
instruction groups, which are collections of instructions that can theoretically all execute all at once. The
instruction groups are the compiler's way of showing the processor which instructions can be dispatched
simultaneously without dependencies or interlocks. It's the responsibility of the compiler to get this right; the
processor doesn't check. Groups can be of any arbitrary length, from one lonely instruction up to millions of
instructions that can (hypothetically, at least) all run at once without interfering with each other. A bit in the
template identifies the end of a group.
A bundle is not a group. That is, IA-64 instructions are physically packaged into 128-bit bundles because
that's deemed the minimum width for an IA-64 processor's bus and decode circuitry. (Itanium dispatches
two bundles, or 256 bits, at once.) A bundle just happens to hold three complete instructions. But logically,
instructions can be grouped in any arbitrary amount, and it's the groups that determine how instructions
interrelate to one another.
All IA-64 instructions fall into one of four categories: integer, load/store, floating-point, and branch
operations. These categories are significant in how they map onto the chip's hardware resources. Different
IA-64 implementations (Itanium, McKinley, etc.) might have different hardware resources, but all will do
their best to dispatch all the instructions in a group at once. And we'll see IA-64 compilers capable of
optimizing binaries for different IA-64 processors too.
It's hard not to think that Intel's institutionalized taste for baroque and ungainly, not to mention bizarre
instruction set features didn't creep in here somewhere. With so much elegance going for it, IA-64 falls
down in the evening gown competition. First, IA-64 opcodes are not unique - they're reused up to four
times. In other words, the same 41-bit pattern decodes into four completely different and unrelated
operations depending on whether it's sent to the integer unit, the floating-point unit, the memory unit, or the
branch unit. A C++ programmer would call this overloading. An assembly program would call it nuts. You'd
think that Itanium's designers would have been satisfied with 241 different opcodes, but no…
The second eccentric feature, which is related to the first, explains how Itanium avoids confusing these
identical-but-different opcodes (a process serious engineers call disambiguation). The five-bit template at
the start of every 128-bit bundle helps route the three-instruction payload to the correct execution units.
Those of you who are good at binary arithmetic are thinking, "wait a minute… five bits isn't enough." And
you'd be right--if you weren't designing Itanium. Rather than tagging each of the three instructions with its
associated execution unit, or just extending the instruction width, IA-64 uses these five bits to define one of
24 different "templates" for an instruction bundle (the other eight combinations are reserved). A template
spells out how the three instructions are arranged in a bundle, and where the end of the logical group is, if
any. And yes, you're right again, 24 templates is not enough to define all possible combinations of integer,
FP, branch, and memory operations within a bundle, as well as the presence of a group's logical stop. Deal
with it.
7/1/2002 3:14 PM
ExtremeTech - Print Article
7 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
click on image for full view
You'll notice that it's impossible to have an FP instruction as the first instruction of a bundle, and that
load/store instructions are not allowed at the end. You can't have two FP instructions in a bundle, yet you
can have three branch instructions bundled together. This is not as counterproductive as it sounds--as long
as two of the branches are conditional and evaluate false, they do no harm other than wasting space.
How Epic is EPIC?
Is EPIC really VLIW? Yes, by most definitions of that term. Pedantic computer architects may argue over
abstruse differences, and Intel's marketing people will steam over the misuse of their trademark, but for all
intents and purposes, EPIC is merely a more pronounceable rendition of VLIW with a few enhancements.
Few, if any, of EPIC's features discussed so far are unique to Itanium or to Intel. Broadsiding a processor
with a volley of instructions at once is what VLIW is all about. EPIC corrupts, if you will, the pure ideal of
VLIW by introducing its peculiar 5-bit instruction templates that unnecessarily complicates multi-instruction
issue and effectively eliminates several potential combinations of instructions. On the plus side, Intel gets
credit for allowing flexible-sized instruction groupings, which help increase issue efficiency. This is likely to
payoff handsomely in future IA-64 processors. IA-64's groups also reduce the code bloat seen in traditional
VLIW designs (where fixed-width VLIW instruction slots may often go unused if the compiler cannot find
independent instructions to group together from within a particular window of instructions).
Certainly there are plenty of processors with multiple execution units and microarchitectures that can keep
them busy. Predicated execution is nothing new, either. Tiny embedded processors do it, and compiler
writers are happy to manage the multiple predicate bits. Itanium's scoreboard bits, register frames, and
svelte and RISC-like instruction set all have been seen before. Itanium doesn't even reorder instructions,
for cryin' out loud, something even midrange 32-bitters do all day long. But then again, Intel's formally
stated goal was to shift complexity out of the processor logic and to the compiler. Yet, if you read a
presentation from last Intel Developer Forum, you'll see that "Future Itanium Processor Family processors
can have out-of-order execution." Of course, this also implies that McKinley will be called Itanium II or
something similar.
IA-64 doesn't really introduce anything all that new. It's more of an amalgam of concepts and techniques
seen before and given the ol' Intel twist. That doesn't make it bad, but it's also not spectacular nerd porn.
Instruction Set Highlights
It would be tedious in the extreme to even summarize the entire IA-64 instruction set; you can refer here for
the complete ISA (Instruction Set Architecture) listing. But there are some highlights in the ISA worth
noting, such as conditional (predicated) execution, hinted and speculative loads, and the odd way in which
7/1/2002 3:14 PM
ExtremeTech - Print Article
8 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
Itanium handles integer math.
Pretty much any IA-64 instruction can be conditional, with its execution predicated on literally anything you
care to define. Far beyond the simple Z (zero), V (overflow), S (sign), and N (negative) flags of our
childhood, IA-64 has 64 free-form predicate bits, each considered a separate predicate register. You can
set or clear a predicate bit any way you like, and its condition sticks indefinitely. Any subsequent instruction
anywhere in the program can check that bit (or multiple bits) and behave accordingly. This allows you, for
example, to evaluate two numbers in one part of a program, but not make a decision (conditional branch)
until much later. The microprocessor cognoscenti consider predicate bits more elegant than flags; they
scale more easily to larger sizes (more bits) and are easier for compilers to target. We'll cover predication
in more detail below.
Loading Up the Stores
IA-64 is surprisingly stingy with memory-addressing modes. It has precisely one: register-indirect with
optional post-increment. This seems horribly limiting but is very RISC-like in philosophy. Addresses are
calculated just like any other number and deposited in a general-purpose register. By avoiding special
addressing modes, Itanium avoids specialized hardware in the critical path. VLIW pushes complexity onto
the compiler instead of the hardware.
Loads can be pretty uninteresting, but IA-64 manages to spice them up a bit. Loads can "hint" to the cache
that it would be beneficial to preload additional data after the load, whether that data is likely to be reused,
and if so, which of the three cache levels is most appropriate to hold it. These are not the kinds of things
even dedicated assembly-language programmers are likely to know, but large-scale commercial
developers might profile a new operating system or major application extensively, and use the feedback to
provide prefetch and caching hints. These are just hints, too--the processor is under no obligation to act on
the hints or the caching information.
Speculative Loads
Somewhat stronger than a hint is a speculative load, an instruction that tells the processor it might want to
load data from memory. Programmers (or more realistically, advanced compilers) can sprinkle their code
with speculative loads to try to snag data that might be needed soon. Itanium will do its best to comply, but
if the system bus is busy, the speculative load might be postponed indefinitely. If a speculative load fails
(such as from a memory fault or violation) the processor does not raise an exception. Hey, it was only
speculative anyway.
Itanium can hoist loads above branches, which many high-end RISCs do, but it can also hoist loads above
stores, which is much trickier. The usual problem with the latter procedure is alias detection: the compiler
can't be sure that loads and stores aren't to the same address. As long as there's a chance, it's dangerous
to load from memory before all the stores to the same memory addresses are finished. Yet loads are
time-consuming, so it's a big win if you can accelerate them.
IA-64 gets around this problem--with a little help from you--with the LD.A (load advanced) instruction. LD.A
speculatively loads from memory, but also stuffs the load address into a special buffer called the Advanced
Load Address Table (ALAT). Subsequent stores to memory are checked against addresses in the ALAT. If
there's a match, the speculative load aborts (or, if it already completed, the contents are discarded). Using
the data from a LD.A can be tricky, too. You need to validate them with a CHK.A instruction first. There's
no guarantee that any calculations you did won't have to be redone with valid data. It's a bit of a gamble,
but can pay handsomely if you speculate wisely. Architecture imitates life.
FP You, Too
Bizarrely, Itanium's two floating-point units can't multiply two numbers together. They can't add, either. The
FPU is designed for multiply-accumulate (MAC) operations, so if you want a conventional FP MUL you
program it as an FP MAC with an adder of zero. Likewise, if you want a simple FP ADD you're forced to
use a multiplier of 1.0 along with the value you want to add.
Stranger still, Itanium has no integer multiply function at all. Any multiplication, whether it's integer or
floating-point, has to happen in the FP MAC unit. Unfortunately, that means transferring a pair of integers
from the general-purpose registers to the floating-point registers, then transferring the result back again.
Fortunately, IA-64 includes a few instructions specifically for this eventuality. What were they thinking?
Branches: Going Out On a Limb
7/1/2002 3:14 PM
ExtremeTech - Print Article
9 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
The longer the pipeline, the bigger the train wreck if the processor mispredicts a branch. And Itanium has a
fairly long pipeline, so the potential for performance-robbing disaster looms ever large. Predicting branches
takes on paramount importance and to that end, IA-64 has a number of tricks to help it avoid the dreaded
mispredicted branch.
First, there's only one form of conditional branch, but its behavior can be based on any of the 64 predicate
bits mentioned earlier. Branches can also be tagged with either static or dynamic branch prediction (that's
prediction, not predication), which predicts whether the branch is likely, or not likely, to be taken this time
around. Static prediction cannot be overridden; dynamic prediction leaves the decision to Itanium's own
branch-prediction hardware. If you, as the programmer, know which way the branch is likely to go, stick
with static prediction and the chip will assume you're always right. If you're unsure, let Itanium make up its
own mind. If you're feeling especially clairvoyant, you can also suggest that Itanium fetch instructions from
the predicted target of the branch, and even how far ahead of the branch target it should prefetch.
Predicated Execution
Predication is cool--it avoids short branches that inject bubbles into the pipeline. Rather than skip over
short sections of code, predicated processors can plow straight ahead, either committing or discarding the
results based on the predicate test. It effectively permits execution of both branch code paths at the same
time. Predicated instruction sets have a mixed effect on code density. They improve code density slightly
by eliminating branch instructions, but then hand back much of that improvement by usurping several bits
(in IA-64's case, six bits per instruction specifying one of 64 predicate registers) from every instruction for
the predicate field.
Predicated execution sacrifices execution units on the altar of branch latency. In other words, predicated
instructions make it most of the way though the pipeline whether they're supposed to execute or not. In
Itanium's case, all conditional instructions are predicated so everything executes nearly to completion. It's
only in the next-to-last DET (exception detection) stage of the pipeline that their effects are canceled if the
predicate turns out to be false. By that time, the instruction has already commandeered one of Itanium's
nine execution units for nothing, possibly preventing some other instruction from using it. Well, not entirely
for nothing; it has served the greater good by avoiding a potential bubble in the pipeline. Better to waste a
little work than to spin your wheels waiting for a branch to resolve.
It's small comfort, but predicated instructions that would stall waiting for an operand are killed early,
because Itanium resolves the predicate (true/false) about the same time that it detects the dependency. It
won't stall instructions waiting for data that's irrelevant. That's the beauty of predicate bits set well ahead of
time instead of flags that are updated every cycle.
Don't Look Back: How Itanium Handles x86 Code
Yes, Virginia, there is an x86-compatibility mode in Itanium. It's awkward and unnatural, but we know how
attached you are to your old binaries. IA-64 does not normally support older x86 binaries, and it's entirely
possible that some future IA-64 implementation might drop this feature or water it down, but for now your
old Lotus 1-2-3 diskettes are safe.
Itanium supports all x86 instructions in one way or another, even MMX, SSE (not SSE2), Protected, Virtual
8086, and Real mode features. You can even run entire operating systems in x86 mode, or just run the
applications under a new IA-64 OS. All the x86 registers map onto Itanium's own general-purpose
registers, but some of the less orthogonal x86 registers appear in Itanium's "application registers" AR24
through AR31.
7/1/2002 3:14 PM
ExtremeTech - Print Article
10 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
click on image for full view
Switching modes appears trivial but isn't. There's one IA-64 instruction that switches the processor to x86
mode and another (newly defined) x86 instruction, JMPE, that switches to IA-64 mode. If the programmer
so wishes, interrupts can switch automatically to IA-64 mode or the machine can stay in x86 mode. In the
latter case, you can reuse your x86 interrupt handlers.
Switching to x86 mode is a lot like booting a '386 because you have to set up memory segment
descriptors, status registers, and flags. Also, x86 code likes to have its way with all the resources of the
processor, either overwriting or ignoring many of Itanium's state bits and registers. It's also likely to upset
your cache contents. In general, it's best to save the entire state of the processor before switching to x86
mode. It's awkward enough that you probably don't want to switch modes willy-nilly. Save it for dramatic
changes, such as executing entire x86 applications.
Not that anyone was asking, but PA-RISC compatibility is handled offline through a software translator.
IA-64 instructions don't directly support PA-RISC instructions, but they do map fairly closely (hey, RISC is
RISC). The fact that x86 binaries are emulated in minute detail with enormous helpings of hardware while
PA-RISC code is relegated to a translator before it has any hope of running says a lot about the relative
importance of these two installed bases. It may also tell us something about the "equal" relationship
between the HP and Intel engineers designing IA-64.
Oooooh, It's So Big!
The definition of chip, processor, and die become somewhat clouded with Itanium. The first IA-64 "chip" is
really a metal-cased cartridge, somewhat like Pentium II modules of yore. The cartridge - which is
mechanically incompatible with anything ever seen before - contains at least five chips, including the
processor itself and four cache SRAMs. The first- and second-level caches (L1 and L2) really are on the
same die as the processor; the L3 cache takes up those four SRAMs that are off-chip but on-module. Got
it?
7/1/2002 3:14 PM
ExtremeTech - Print Article
11 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
Processor abstraction layer
Then there's the PAL. PAL is Intel's "processor abstraction layer," a flash ROM inside the cartridge that, in
Intel's words "… maintain[s] a single software interface for multiple implementations of the processor
silicon steppings." Sounds like a "fudge ROM" for hiding, tweaking, or patching imperfections in the
processor that may not entirely live up to their data book specification.
The whole thing weighs in at about 325 million transistors: 25 million for the processor chip (including L1
and L2 caches) and about 75 million for each of the four L3 cache chips. We'll toss in the PAL for free. If 25
million transistors seems like a lot, remember that Pentium III has 24 million and Pentium 4 has 42 million.
For a high-end 64-bit processor, Itanium is looking positively dinky.
Itanium die
You know what else is big? Itanium's code footprint. Poor code density is a hallmark of VLIW designs, and
although IA-64 makes some improvements as we mentioned, it's no exception to the rule. With no (public)
code to look at it's hard to be sure, but educated estimates pin Itanium's code size at about one-third bigger
than other 64-bit RISCs and double the size of Pentium binaries.
Poor code density means lots of disk space, but that's not a big deal for high-end systems. It also means
less effective cache size, which in turn reduces cache-hit rates. Again, no big deal because caches can
always be made bigger. But cache bandwidth is hard to improve and that may be the real bottleneck for
7/1/2002 3:14 PM
ExtremeTech - Print Article
12 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
IA-64 processors. That's why Itanium's first two levels of cache are on the processor die itself and the L3
cache is very nearby on the same module.
Outside the Box
The 128-bit bus between the Itanium die and its L3 caches is contained entirely within the cartridge; it's
never exposed to the outside. Itanium's external bus is 64 bits wide and this is its only connection with the
outside world, main memory, or other processors. Up to four processors can share this bus. After that, Intel
has a bridge chip that allows four-processor clusters to talk to each other.
It's a pretty pedestrian bus as these things go. It has none of the exotic interprocessor communications that
Hammer has (as we'll study in our next segment), nor is it even very fast at 2.1 GB/second of maximum
bandwidth, compared with 3.2 GB/second for Pentium 4 or 3.6 GB/second for MIPS. It's also a doomed,
dead-end bus: McKinley will have a completely different interface.
McKinley, Madison, and Deerfield: The Next Generation
The second IA-64 processor after Itanium is code-named McKinley and it's likely to be faster, smaller, and
all-around better than its predecessor. McKinley's L1 caches will be the same size as Itanium's, but the L2
cache will grow from 96K to 256K. The L3 cache will get smaller (3M instead of 4M) but move onto the
actual chip, not just on the same cartridge. All three cache interfaces will get faster. McKinley shaves one
cycle off the L1 cache access time (from two cycles to one), shortens L2 access time by seven cycles (to
five), and takes eight cycles off the L3 latency (to 12 cycles). Adding the L3 cache to the chip will boost
McKinley's die size significantly, probably to around 450 mm2, and ups the transistor count to 221 million.
But manufacturing cost should be significantly reduced without the external L3 SRAMs and larger package
required for the dual-chip (core and L3) Itanium.
McKinley will use a completely different socket design from Itanium and a revised bus interface, dooming
the first IA-64 systems almost before they get out the door. Just like Pentium Pro, Itanium's mechanical
footprint will be an orphan from Day One. McKinley's system bus will widen to 128 bits (up from Itanium's
64) and its clock frequency will improve from 133 MHz to 200 MHz. The bus will still be double-pumped
(i.e., transferring data on both rising and falling edges of every clock) yielding 6.4GB/sec front-side bus
bandwidth.
Next up comes Madison, expected to be a 0.13-micron shrink of McKinley, all other things being equal.
Deerfield, the fourth member of IA-64's growing family, will also be a 0.13-micron shrink of McKinley, but
this time with a smaller 1M L3 cache and yet another new bus interface intended for cheaper systems.
Deerfield will be the "value" version of IA-64, à la Celeron or Duron.
Bottom Line
IA-64 is an interesting architecture that borrows from and/or extends many existing microarchitectural
techniques, and also adds some new and interesting twists, but the first instantiation of the architecture,
Itanium, has not been a major success to date. After waiting a few years longer than originally anticipated
for the first IA-64 chip to appear (Intel publicly disclosed initial IA-64 details at Microprocessor Forum in
1997, and stated the first IA-64 chip, code-named Merced, was expected to ship in mid-1999), we saw a
processor with a slower than expected clock rate, and less than stellar integer performance that catered to
a very limited market. Plus initial shipments were stymied with delays from key vendors. Some
commentators called Intel's first IA-64 chip "Unobtanium". And not surprisingly, the catch-phase for quite
some time has been "wait for McKinley, Itanium is simply a development platform".
Very recently another setback occurred with Dell dropping Itanium workstations from its product lineup (see
"Dell Discontinues Itanium Workstation"), possibly encouraging even more people to "wait for McKinley".
But clearly Itanium is not all that bad. Floating-point performance as seen in some benchmarks is
impressive today, and its large address space can certainly be useful in various high-end applications, but
Intel faces a steep uphill battle trying to convince many server and workstation customers, with long
histories using established 64-bit architectures, to convert to IA-64 at this juncture. Then again, Intel has
swayed many customers to convert portions of their application processing to Itanium-based solutions as
seen at this link. Software developers are a key target as well, and many have been on the IA-64
bandwagon for a while.
Things could improve substantially when McKinley arrives later this year in development systems and early
next year in volume. We expect Intel to start seriously ramping IA-64 architecture processor shipments in
selected markets within two three years. But let's not forget about AMD, who clearly appears to up for the
challenge, as we'll see in our next segment. Also, we'll provide our thoughts on the rumored Yamhill 64-bit
7/1/2002 3:14 PM
ExtremeTech - Print Article
13 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
x86 "hedge your bet" technology under deep wraps within Intel development labs.
Resources
Itanium manuals. Be sure to explore the menu to the left of the page - it has links to lots of other
Itanium reference material including some PowerPoint slides.
A nice quick overview of basic Itanium features can be found here.
A set of performance tests and a summary of SPEC test results are at this link from last summer.
Intel's own benchmarketing results are at this link.
64-Bit CPUs: AMD Hammer vs. Intel IA-64
February 13, 2002
By: Jim Turley
In 25 words or less: Intel's IA-64 is a clean break, while AMD's Hammer is philosophically (some would say
pathologically) another extension to the ages-old x86 architecture.
It's always fun to root for the underdog. It's the American way to cheer for the little guy, hoping he'll triumph
over the dark forces of--ironically enough--Corporate America. AMD has been squarely in the underdog
role for quite awhile now, but the brewing Hammer v. Itanium match-up will take that to a whole new level.
IA-64 is a new VLIW design that has x86 compatibility tacked on; Hammer is a real x86 processor (albeit
one with 64-bit extensions) just like Athlon, K6, and AMD's other processors before it. The two product
lines are now heading down separate paths. You're always the winner when you run a race by yourself.
But when you reach that finish line, are you anyplace you want to be?
In this second part of our three-part analysis on 64-bit computing architectures, we'll delve into Hammer's
microarchitecture and compare/contrast to Itanium's existing design, and to some extent McKinley's
upcoming design. You can check out our earlier Itanium analysis at this link.
Hammer of the Gods
As x86 processors go, this is about as good as it gets, boys and girls. Hammer strikes a blow to the
doomsayers who prophesied x86 was (or should be) dead. Who would've thought that anyone could pound
so much performance out of a souped-up 8086 after twenty years? Behold, Hammer: the ultimate
expression of CISC ingenuity.
At its core, Hammer is a nine-way superscalar, massively out-of-order, CISC-into-RISC processor not too
different in architecture from Athlon. In many ways, the change from K6 to Athlon (née K7) was greater
than from Athlon to Hammer (née K8).
Hammer has nine execution units, the same as Athlon and, coincidentally, the same as Itanium. These are
grouped into three integer units (arithmetic/logic units, aka ALUs), three address-generation units (AGUs),
and three floating-point units. Like Athlon and K6 before it, Hammer transmogrifies every x86 instruction
into one or more internal RISC operations (ROPs). Beyond the first few stages of the pipeline, Hammer is a
RISC machine with no idea of x86 instructions or machine state.
Hammer can decode up to three x86 instructions and dispatch up to 9 ROPs per cycle, assuming the best
case where each ROP serendipitously maps to one of Hammer's nine execution units. Most ROPs execute
directly in hardware, but even after conversion, some x86 operations are too perverse for that. These are
trapped and emulated by routines in Hammer's micro-ROM, just like Athlon. Ah, microcode--the
quintessential CISC technique.
Hammer's pipeline is longer than Itanium's, at 12 stages. It should come as no surprise that most of that
small difference is spent decoding x86 instructions and converting them to more digestible ROPs.
Which Weighs More: Nine Lbs of Hammer or Nine Lbs of Itanium?
Although Hammer and Itanium both have nine execution units, it's hard to say that they'd both accomplish
the same amount of work per cycle. For starters, Hammer's got three address-generation units (Itanium
7/1/2002 3:14 PM
ExtremeTech - Print Article
14 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
has none), which don't really contribute to forward progress. They're more of a necessary evil. Itanium has
no address-generation units because it supports only one simple addressing mode. Advantage: Intel.
Hammer can dispatch nine ROPs to Itanium's six, which would appear to give Hammer a 50% lead in
grunt-per-cycle. On the other hand, these ROPs are nothing but decimated x86 instructions, so they don't
count for much individually. It usually takes a handful of ROPs to equal one "real" instruction. But the same
could be said for Itanium (and all VLIW machines)--what is one of those instructions worth? We'll call it a
draw and file it under the same category as "how many angels can dance on the head of a pin?"
Three of Hammer's nine execution units are for floating-point operations only, leaving six for integer code
(the three AGUs and three ALUs), while only two of Itanium's are used for FP, with the remaining seven
free for integer code (recall that Itanium has two integer units, two combo integer-and-load/store units, two
floating-point units, and three branch units). That, and the fact that Itanium's integer units do more useful
work (as opposed to the housekeeping mentioned above) suggest that the Intel chip will make more
headway on normal code. Advantage: Intel.
But wait: three of Itanium's seven non-FP units are branch units. Itanium really has just four integer units to
Hammer's three. Two of Itanium's integer units do double duty as load/store units, so it's not quite fair to
say you could run four integer operations at once - you've got to do loads and stores sometime. Hammer,
on the other hand, sets aside three address-generation units to this task, so you really can execute three
integer operations at once. We'll call this one a tie.
Hammer has one more floating-point unit than Itanium. On the other hand, Itanium's are both equivalent
and able to handle any FP operation, whereas all three of Hammer's are different. Itanium gets bonus
points for symmetry but Hammer can potentially get more floating-point work done. Advantage: AMD.
Please Register Your Software
The most noticeable enhancement to Hammer is its 64-bit register file (we'll get to the 64-bit instructions in
a minute). All the old familiar x86 registers are extended to 64 bits and new registers added, with RISC-like
names R8 through R15. Obviously, existing x86 binaries won't see the upper half of the eight new 64-bit
registers, or the eight new registers at all. The enhancements are visible to new 64-bit code only.
click on image for full view
AMD's sixteen 64-bit registers are a far cry from Intel's 128 general-purpose plus 128 floating-point
registers. Even in its 64-bit mode, AMD has one-sixteenth the quantity of registers that IA-64 has. There's
no argument that more registers is better, although there's plenty of contention over how much better. How
many registers is enough? There comes a point of diminishing returns, but we'd bet that most
programmers and compiler writers would prefer some number greater than 16.
If it makes AMD fans feel any better, Itanium's register file is so big it takes two clock cycles to access a
register, adding a stage to the pipeline. If it makes Intel fans feel any better, that delay's probably going to
go away in McKinley and future IA-64 processors.
7/1/2002 3:14 PM
ExtremeTech - Print Article
15 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
When I'm 64
AMD presented itself with a problem that Intel didn't have: how to extend the original x86 instruction set to
64 bits? Intel treats 32-bit x86 code as the end of the road. If you want 64 bits go to IA-64 (though this
stance may change--see the section on Yamhill below). AMD had to stretch the opcode map while still
retaining backward compatibility with old x86 code. No mean feat.
Adding 64-bit capability to existing x86 code doesn't make sense. The whole point of Hammer is to be able
to run old binaries without modification. So to access its new 64-bit instructions and other features,
Hammer has to switch to 64-bit mode. The process is simpler and less traumatic than Itanium's mode
switch. First you set a control bit to globally enable 64-bit mode. Then you mark individual code segments
as 64-bit segments by setting a previously unused bit in the local descriptor table (LDT). Any branches to
segments so marked are interpreted as 64-bit code. Branches to code segments without the magic bit set
are assumed to be 32-bit code.
AMD also invented some new 64-bit instructions. There's a whole new set of floating-point operations that
use a new flat FP register file of sixteen 128-bit registers. This will be a huge improvement over the
spectacularly clumsy register-stack architecture of the original 8087 and all x87 FPUs that followed. (The
floating-point abilities of Intel's processors were the low point of an already awkward architecture).
Hammer also provides mixed code size compatibility the same way the '386 did, with a size-override byte.
Any existing (that is, pre-Hammer) instruction can be prefixed with the one-byte REX pseudo-instruction.
This byte tells the decoder that the operands in this instruction should be interpreted as 64-bit quantities.
The '386 worked exactly the same way, introducing the 0x66 prefix byte, instantly turning decade-old 16-bit
operations into new 32-bit operations without actually changing the instruction set or duplicating every
operation.
Bus & Cache Stuff
Hammer's external interfaces are as interesting as Itanium's are dull. First off, Hammer includes two
double data-rate (DDR) controllers that directly manage external SDRAM memory. The memory bus can
be either 64 bits or 128 bits wide, and requires no glue logic. This compares favorably to Itanium's generic
(and short-lived) system bus, which requires a separate Intel controller chip to make sense of memory.
Sexier still, Hammer includes three (count 'em!) HyperTransport links, an obvious advantage over Itanium
in multiprocessing. This bus is relatively open and has bandwidth to spare. Depending on how you arrange
them, up to eight Hammer processors can seamlessly communicate amongst themselves using nothing
but their built-in HyperTransport links. Anybody remember the Transputer? Whereas Itanium processors
have to share a system bus, each Hammer gets its own private memory, courtesy of its on-chip SDRAM
controller. (The first Hammer processor, Clawhammer, may only support two processors using a single
HyperTransport link).
click on image for full view
But wait--there's more. The L1 data cache and the L2 cache, as well as all external memory, are protected
with ECC. For good measure, all of these are periodically scrubbed to remove soft errors. (The L1
instruction cache is protected with parity, not ECC, partly for reasons of speed, but also because the
7/1/2002 3:14 PM
ExtremeTech - Print Article
16 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
instruction cache is never written into from applications, the OS, or drivers, making it more reliable.)
AMD has clearly made its powerful, yet simple and elegant multiprocessing design Hammer's big
differentiator over Itanium. While Hammer has the more conservative internal architecture, it has the more
ambitious system interface. Itanium has the radical redesign inside, but fairly pedestrian, almost PC-like,
system architecture. Thankfully, McKinley will step the external bus performance up by about 3X. But
there's no question that Hammer provides the easier path to two-way, four-way, or even eight-way
multiprocessing systems. The question is, will that be enough?
Hammer may have multiprocessing features that Itanium and McKinley lack, but no one doubts that Intel
could add those features at any time (Itanium Xeon, anyone?). The company hardly lacks the wherewithal;
it just doesn't see the market demand at present. It'll be a whole lot easier for Intel to add multiprocessing
features to IA-64 chips than it would for AMD to add IA-64 compatibility to a future Hammer.
For all its features, the first Hammer will be a small guy. Clawhammer will likely have dual 64K L1 caches
and a 256K L2 cache (no L3 cache). The chip should measure just 104 mm2 in a 0.13-micron CMOS
process, according to AMD--just one-quarter the die size estimated for McKinley. The second Hammer
processor, Sledgehammer, should also consume far less real estate than McKinley, but it will quadruple
the L2 cache to 1 MB.
All things being equal, Clawhammer silicon will be far less expensive to manufacture than Itanium, even
before you consider Itanium's extra L3 cache chips and its elaborate mechanical housing. For both
single-processor and multiprocessor systems, AMD offers the more economical option for system makers.
Getting Your 64-Bit Goodness
Hammer is not VLIW and it doesn't expose parallelism (or anything else) to the compiler. It's just another
turbocharged x86: really fast at x86 code, but really nothing radically new in architecture. New Hammer
code can access all the new registers, and even treat them as a flat register file, but it can't break free of
the inherent awkwardness of the x86 instruction set. The problem is not the binary encoding of x86
instructions--AMD and others have shown they can build blazing fast x86 chips with RISC-like internals
even with the fundamental x86 handicap. It's the nonparallel nature of x86 code that's impossible to
overcome.
In contrast, IA-64 compilers have all the time in the world to locate parallelism, find dependencies, optimize
loads, organize branches, develop predicate conditions, and much more, and then express all that richness
and intelligence explicitly to the processor. Poor Hammer has to pry what secrets it can from an inherently
serial binary stream, searching for tiny fragments of instruction-level parallelism in hardware--and do it all in
a handful of nanoseconds without slowing the critical path or impacting the clock frequency. Plus, Hammer
has to do all this with no foreknowledge, and no memory, of the entire program. Hammer's scope for
reassignment and optimization is limited to the few instructions already in its pipeline. It's like decoding the
Rosetta Stone while skydiving, or trying to characterize Internet routing protocols using Latin or Sanskrit.
Hammer and Itanium are complete opposites when it comes to rescheduling or re-ordering instructions to
improve overall efficiency and utilization of internal chip resources. IA-64 does no reordering whatsoever
and is damn proud of it. That's the compiler's job and the hardware is just a dumb servant of the compiler,
doing what it's told. Hammer, on the other hand, is--has to be--aggressive about locating and exploiting
weaknesses (for lack of a better term) in the compiled output, stealthily reorganizing the occasional integer
or FP instruction to take advantage of its hardware resources.
Which do you suppose has more headroom, more upside growth potential? It's inconceivable that Intel
can't extend and enhance IA-64 for another decade or so. Hammer, on the other hand, is the result of a
decade of stretching, tweaking, and cajoling a Paleolithic architecture into modern form. It's difficult to
believe there would be as much life left in Father Time as in the New Year's baby.
Branch latency and its related pipeline bubbles are the bane of high-speed microprocessors. IA-64 is the
hands-down winner here. It offers complete software and compiler control over branch prediction, hinting,
and prefetching, all in addition to its hardware branch-prediction logic. And its predication technique
removes many branches altogether. Hammer has to make do with the x86 instruction set, which has no
concept of branch hinting or predication, and never will, unless AMD chooses to invent new instructions
and create yet another significant extension to x86.
IA-64's instruction clustering means the hardware doesn't have to check for dependencies; that's the
compiler's responsibility. This eliminates a lot of the nastiest, most convoluted hardware from the
7/1/2002 3:14 PM
ExtremeTech - Print Article
17 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
processor's critical path--exactly the place where Hammer spends most of its effort.
IA-64 code density should be as bad as x86 code density is good. That's a bonus for AMD and Hammer,
though you don't often see server manufacturers choosing their high-end processor based on code density.
Intel threw out the baby with the bathwater, creating an entirely new microprocessor and gluing x86
compatibility onto the side for sentimental value. AMD keeps straining the same old bathwater. Odd as it
seems, AMD now carries the torch for x86, extending it this way and that, while Intel heads down another
path. In a sense, AMD now "owns" the x86 architecture.
Intel's Ace in the Hole: Yamhill
It's pretty well understood that Itanium will not provide leadership x86 performance. That's Hammer's great
hope, in fact. AMD's strategy depends on Intel mistakenly abdicating its x86 throne leaving Hammer and its
descendants the heirs apparent to a software kingdom.
Would Intel so cavalierly jeopardize its legacy? Not on your life. To no one's great surprise, Intel is rumored
to be developing something that will give future Pentium processors--not IA-64 processors--a performance
kick. In a perverse reversal of roles, Intel may actually be following AMD's lead in 64-bit x86 extensions. A
"Hammer killer" technology, code-named Yamhill, may appear in chips late next year, about the time
Hammer makes its debut. It's suggested that Intel's forthcoming Prescott processor will be based on
Pentium 4, but with Yamhill 64-bit extensions that coincidentally mimic Hammer's. (Prescott is also
rumored to be built on a 0.09 micron process and implement HyperThreading.)
Naturally, the very existence of Yamhill, if it exists at all, is a diplomatically touchy subject at Intel HQ. The
company doesn't want to undermine its outward confidence in Itanium and IA-64, but neither can it afford
the possibility of ceding x86 dominance to a competitor. Besides, whether they appear in future Pentium
derivatives or not, Intel's 64-bit extensions could appear in future IA-64 processors instead. New IA-64
features plus competitive x86 performance--now that's a compelling product.
Summary Analysis: Intel v. AMD
It's said that "the short walk to the gallows focuses the mind tremendously" and AMD is headed up those
thirteen steps. The company worked very diligently on Hammer and appears to have produced another silk
purse from the sow's ear of x86 architecture. Will that be enough? It depends on what you're after.
Both processors are totally, completely, and inarguably backward compatible with x86 binaries. Anything
else would be a criminal dereliction of duty. If what you want is a faster x86 PC, it's entirely likely that
Hammer-based systems will run old (or upcoming) PC applications much faster than an Itanium- or
McKinley-based system. That's fine, for as long as you keep that box. But the day may come when
Microsoft Windows version n+1, or Quake XVII, is released for IA-64, but not for AMD's x86-64. And then
you'll have to choose sides. Microsoft has publicly announced it will port Windows XP to IA-64, but it has
made no such announcement about x86-64.
Running existing binaries on either Itanium or Hammer is a no-brainer, but what about new code? Now, for
the first time, software vendors will have to decide: do they support Intel, AMD, or both? Porting major
applications and operating systems to Hammer will not be trivial--but neither is supporting IA-64. Backing
Intel's newest and heavily promoted next-generation architecture is a foregone conclusion for vendors that
want to stay in business. Supporting AMD becomes more problematic. Will the added market share be
worth the effort? Suddenly AMD finds itself in the same boat as Apple with a different, yet competitive,
product that requires dedicated software support to survive.
Grimly, AMD itself lived through this tragedy not so many years ago, and the wound was self-inflicted. AMD
unceremoniously axed its entire 29000 family, one of the most popular RISC processors of the early 1990s,
due to the cost of software support. The company decommissioned the second-best-selling RISC in the
world because subsidizing the independent software developers was sapping all the profits from 29K chip
sales. As "successful" as it was, AMD had to abandon the 29K, the only original CPU architecture it ever
created.
There is one very possible future scenario, though a long-shot hope for Hammer. Recall back in the 1980s,
IBM was losing ground to PC clone makers so it took its ball and went home. It changed the game, and
called it PS/2 - and look how well that worked. Instead of following IBM and switching platforms, the world
went right on using PC clones. IBM never regained the dominance it once had. Maybe IA-64 is just the
PS/2 of processors, a futile attempt to change the game just when it was getting good. Maybe the world
7/1/2002 3:14 PM
ExtremeTech - Print Article
18 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
really wants a faster x86 instead of a new and different family of processors. Maybe lightning will strike
twice.
64-Bit CPUs: Alpha, SPARC, MIPS, and POWER
February 21, 2002
By: Jim Turley
Alpha--and Omega
Since the very first day Alpha processors were available, they've held the microprocessor speed lead.
Alpha consistently topped the benchmarks throughout the 1990s. The nerds' favorite microprocessor, an
engineer's delight, and everything a good 64-bit architecture should be. It also ended up a dismal
commercial failure. Is there no justice?
Digital introduced Alpha in 1992 and shipped its first Alpha processor in January 1993. Developed entirely
in-house, Alpha was a replacement for the MIPS processors Digital had been using for a few years, which
were, in turn, replacements for the VAX, which had become too complex for its own good.
Not entirely happy with MIPS processors, Digital's engineers set about developing "the RISC to end all
RISCs," in the words of team member Jesse Lipcon. Code-named EVAX, for Extended VAX, each Alpha
generation bore the project name EVn. Alas, EV7 was to be the last of this great dynasty.
There have really only been two major generations of Alpha processors despite the plethora of model
numbers and product announcements. The 21164 and the 21264 (EV6) supplied the core for all the other
processors. The 21364 (EV7) is essentially the same as the '264 internally but with a different external
system interface. Sadly, the EV8, which was to be an entirely new internal design, was never finished.
Some of the EV8 technology may be incorporated into future Intel 64-bit processors, since Compaq
licensed Alpha technology to Intel last year – more on that in a bit. But note that Compaq expects to
manufacture and ship Alpha chips into 2004, before converting over to IA-64.
Alpha 21364 -- Just Icing on the Cake
Alpha's latest and greatest processor, the 21364 (EV7), due later this year at 1GHz to 1.2GHz, takes an
unmodified '264 processor core and wraps some bodacious system-bus logic around it. First off, the '364
includes no fewer than four channels to Direct RDRAM (Rambus) memory for a whopping 6 GB/second of
theoretically peak bandwidth. And that's just to main memory. The chip also has four
interprocessor-communication channels for communicating with other '364 processors. Each channel
includes a pair of one-way 16-bit buses. The whole thing works like a packet network, with destination
headers in the packets and forwarding. Although the '364 has ports for four direct connections, you can
actually hook up as many processors as you want, and intermediate ones (that is, processors standing
between the one sending and the one addressed) will kindly forward packets on their way. If this sounds
familiar, that's because AMD licensed this technology from Digital for its own Hammer processor line (see
more details below).
Although the '364's execution resources are the same as the '264's, the newer chip delivers better
performance because it has bigger caches. The '364 adds 1.5 MB of L2 cache directly onto the chip, 128
bits wide. The dual 64K instruction and data caches remain. The whole thing measures about 350 mm2 in
Digital's (oops, Intel's) 0.18-micron process – smaller than its '264 predecessor, which was built in
0.35-micron silicon.
Alpha 21264 Sets the Stage for Performance
The 21264 is getting a bit creaky by high-end microprocessor standards -- it was first announced in 1996,
yet this core has tided Alpha over to the present day. It is a four-issue, highly out-of-order machine. Its six
execution units are evenly divided into two integer units, two load/store units, and two floating-point units, a
pattern not too dissimilar from Athlon or Itanium. The '264 initially debuted at over 500 MHz, an unheard-of
frequency at the time. Current chips just squeak past 1 GHz, which is still impressive considering they're
not built in leading-edge processes, by any means.
7/1/2002 3:14 PM
ExtremeTech - Print Article
19 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
click on image for full view
Uniquely, the '264 actually duplicates its entire register file, giving one copy to one half of the execution
units and other registers to the other half. Logically, any instruction can access any registers, but physically
they are separate. Exotic synchronization hardware keeps both halves of this schizophrenic register set
consistent.
Alpha is aggressively out of order. Instructions are decoded and left waiting in one of two instruction
queues (one for integer code, one for FP instructions). From there, instructions that can execute right away
with no dependencies, and with all their operands handy (i.e., already in registers), are pulled out first. After
all of those are dispatched, the Alpha takes mercy on the oldest instructions first, favoring them over those
that haven't been in the queue as long. The actual sequence in which instructions are executed is utterly
unpredictable, and has precious little to do with program order.
This aggressively out-of-order approach is in stark contrast to Itanium, which does no reordering in
hardware at all. But that's exactly how IA-64 is supposed to be. Like any VLIW architecture, it relies on the
compiler to find parallelism and trusts the compiler to have found the best combination and sequence of
executable instruction units. Athlon and Hammer, on the other hand, reorder instructions, though not as
aggressively as Alpha.
Photograph of Alpha 21264 Slot B module
7/1/2002 3:14 PM
ExtremeTech - Print Article
20 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
The '264 and '364 include not one, but two, dynamic branch-prediction methods. The processor chooses
between them on the fly. Because Alpha's 1-ns clock cycle is so wicked fast (at least, for its time and its
technology), an instruction-cache miss is even more than usually painful. That, and the '264's large die size
(298 mm2 in 0.35-micron, with 15.2 million transistors) means that even the on-chip caches can't be
accessed in a single cycle. You can't fight physics, but to alleviate some of the pain, Alpha's
branch-prediction hardware cooperates with the instruction cache. The '264's instruction cache lines are
longer than usual-- if the line holds a branch instruction, the expected target of that instruction is cached
with it. If the branch is predicted taken, the '264 starts a cache lookup for the branch target right away.
Feedback from the branch-prediction circuitry periodically updates the predicted-target entry in the cache.
Shades of Alpha in Hammer
Digital licensed Alpha to microprocessor nonentities Mitsubishi (in 1993) and Samsung (in 1996), hoping
these new disciples would help spread the gospel. Samsung started with a second-source license, but
upgraded it to a full license (i.e., with rights to design its own processors) in 1998. Then, for no apparent
reason, Samsung renamed its US arm Alpha Processor, Inc (later API Networks) for the purpose of selling
its Alpha chips. That was about as successful as you might think.
There's a little bit of Alpha in AMD's Hammer processors. Back in 1997, AMD proudly announced it would
use Alpha's system bus for its "Slot A" interfaces for K6. But it didn't end there. While the FTC was
investigating antitrust aspects of Intel's acquisition of Digital Semiconductor, it uncovered documents
showing AMD had been negotiating with Digital for an entire Alpha processor license, not just the bus. It's
no coincidence then, that Hammer's interprocessor communications and some of its internal features bear
more than a passing similarity to Alpha.
The Omega Glory
Alpha is fast – very fast – but why? Surprisingly, it isn't fast because of its architecture. Alpha's instruction
set is about the same as that of any other RISC architecture. There's no magic bullet there. Nor is there
anything spectacularly different about Alpha's internal implementations. Its pipelines have never been
especially long, nor does it embody any unusually clever microarchitectural techniques. No, Alpha's lead is
(was) almost entirely due to its top-notch manufacturing and the minute tweaking and detail work that went
into tuning the process to fit the processor. Alpha was exquisitely designed for the exact semiconductor
technology Digital used to fab it.
That equation fell apart when Intel and Compaq divided up Alpha. Intel got Digital Semiconductor's
fabrication plant in Massachusetts (along with StrongARM) while Compaq got Alpha and the DEC
computer business. Neither company got many of the engineers responsible for Alpha – they all bailed out
rather than join either acquisitor. The magic that conjured up Alpha evaporated, never to rematerialize.
Alpha was the first 64-bit processor to reach 1 GHz, in 2001. Ironically, this milestone came just a few short
weeks after Compaq proclaimed it was discontinuing Alpha development and licensing Alpha technology to
Intel. It's a bit like having your star running back score the game-winning touchdown right after the team
announces he's being traded away. So long, Alpha. We'll miss you.
Sun UltraSPARC
SPARC is one of the purer RISC architectures still in existence. It's also just about the only one still used
for its original purpose of powering computers. That's not to say that the architecture has stagnated – far
from it. Sun has added visual- and media-processing features in its VIS (visual instruction set) extensions
to SPARC. Similar to MMX or 3DNow!, VIS adds the ability to handle packed RGB-alpha data for
compression, decompression, and video-processing applications. Even with those enhancements, and fully
eight generations of design, SPARC processors are all still software compatible, from the first to the most
recent. No mean feat, that.
But like any well-established architecture, SPARC is showing its age. While AMD and Intel mass-produce
processors at 2.0 GHz and up, Sun's latest UltraSPARC-III just barely squeaked past 1.05 GHz in January
– and it took TI's latest six-layer copper-interconnect process to do it.
7/1/2002 3:14 PM
ExtremeTech - Print Article
21 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
UltraSPARC-III chips
Historically, SPARC has lagged behind most (usually all) other processors in clock speed and benchmark
performance. SPARC has spent a decade at the back of the RISC pack, at least according to most
recognized benchmarks. In its defense, Sun says it emphasizes "system performance," which includes
factors such as memory bandwidth and availability that aren't quantifiable in the benchmarks. Still, it's hard
to ignore SPARC's consistent lack of progress against MIPS, Alpha, Power (including PowerPC), and even
the dreaded x86. It's been said that the biggest difference between a Sun workstation and a PC is the ego
of the person sitting in front of it.
SPARC is to processors what Linux is to operating systems. It has become the flagpole around which
rebellious mobs gather in passive-aggressive demonstrations against the dominant player (in this case,
Intel). SPARC, and Sun, have earned a weird kind of nerd chic that is out of all proportion to their
relevance in the market, technical features, or performance. Sun succeeds largely on anti-PC and
anti-Microsoft sentiment, not pro-Sun sensibility.
Here's a good bar bet: What company has the most CPU design engineers (after Intel)? It's Sun, with
1,300 SPARC designers spread across Sunnyvale, Austin, and Chelmsford. For CPU designers looking for
work, Sun is the last, best hope before succumbing to The Dark Side.
UltraSPARC-III Outshines Its Predecessors
Sun describes UltraSPARC-III as "…the second generation of the 64-bit SPARC V9 architecture…" a
confusing agglomeration of revision numbers, to be sure. Is it second, third, or ninth generation?
UltraSPARC-III is essentially the same internally as other UltraSPARC chips that came before it. The
biggest differences are clock speed, external buses, and cache.
UltraSPARC-III has a 14-stage pipeline, the longest of any of the 64-bitters reviewed in our series, and on
par with the old Pentium Pro. It also has the now-familiar six execution units: two for integer, two for
floating-point, one load/store unit and one address-generation unit. With only one load/store unit,
UltraSPARC-III can't process multiple memory transactions the same as, say, Hammer or Itanium.
UltraSPARC-III can have multiple loads and/or stores outstanding, thanks to its buffers and queues-- it just
has to dispatch them one at a time from the code stream.
UltraSPARC-III has average-sized L1 caches of 32K for instructions, and 64K for data. The chip contains
the L2 cache tags, but not the cache memory itself. That cache is built off-chip using standard SRAMs.
There is no L3 cache at all. Without the L2 caches on the chip, UltraSPARC-III's price is somewhat
artificially lower than its competitors, assuming you do add the 1M, 2M, or 8M of external cache the chip
supports. Like Alpha and Hammer, UltraSPARC-III has a built-in DRAM controller. In this case, it manages
SDRAM (synchronous DRAM) devices.
SPARC Instruction Set and Register Windows
SPARC's instruction set is unremarkable – after all, RISC is RISC – but its register set is unique. All
SPARC chips expose 32 registers to the programmer at any one time, but these registers are just a
"window" into a larger set of physical registers. The additional registers are hidden from view until you call
a subroutine or other function. Where other processors would push parameters on a stack for the called
routine to pop off, SPARC processors just "rotate" the register window to give the called routine a fresh set
7/1/2002 3:14 PM
ExtremeTech - Print Article
22 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
of registers. The old window and the new window overlap, so that some registers are shared. As long as
you're careful about placing parameters in the right registers, the windows are a slick way to pass
operands without using the stack at all.
SPARC circular register windows
Slick as it seems, registers windows have their drawbacks. The concept has been around for decades, yet
SPARC is almost the only CPU architecture to use it. First, register windows only help up to a point-- the
number of physical registers is finite and eventually SPARC runs out space for more windows. When that
happens, you're back to pushing and popping operands on and off the stack. It's next to impossible to
predict when the register file will overflow or underflow, so performance can be unpredictable. Finally, the
processor doesn't handle the overflow/underflow automatically in hardware. It generates a software fault,
which the operating system has to handle, burning more cycles.
Many hardware engineers aren't particular fans of register windowing. It puts enormous demands on
multiplexers and register ports to make any physical register appear to be any logical register. In the
1990s, when there were nearly a dozen different vendors designing and marketing SPARC-compatible
processors, their designers complained bitterly about the headaches in routing interconnect over, around,
and through the register file in the middle of every SPARC processor.
That register windowing, which is an inherent and permanent feature of every SPARC, has so far made it
impossible to add multithreading, and difficult to keep clock speeds up. The 900-MHz and new 1.05-GHz
UltraSPARC-III chips both use TI's 0.15-micron copper process for their 29 million transistors. Fortunately,
most SPARC processors are buried inside Sun workstations, where the value of Sun's software base and
systems-level expertise outshine the relative shortcomings of its processors.
MIPS64 5Kf and R20Kc
Here's another good Silicon Valley bar bet: What does "MIPS" stand for? In the case of the eponymous
company, it's "microprocessor without interlocked pipeline stages." In other words, a RISC architecture
without (ideally) any hardware interlocks, a design goal that MIPS has very nearly kept for all these years.
MIPS, of course, does not make processor chips any more, not even for Silicon Graphics. It is one of the
more popular licensed architectures, used by over a dozen chip-making companies around the world for
their consumer devices (like handheld PCs), video games (Nintendo 64 and PlayStation), and countless
network boxes. Many networking companies use MIPS cores (some officially licensed, some not) in their
chips, but not because MIPS is particularly good at network processing. It isn't. MIPS is just a convenient,
clean, and easily scaled architecture around which special-purpose network processors or protocol
engines can be added. Indeed, MIPS is one of the cleanest and most generic processor designs around,
finely tuned for absolutely nothing.
Recently a little Boston company called Lexra licensed a knock-off "clean room" MIPS core, but the
7/1/2002 3:14 PM
ExtremeTech - Print Article
23 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
company has now has gone legit and taken out a MIPS license. Like SiByte and other MIPS users, Lexra
used MIPS processors as the framework for more interesting network processors.
One Little, Two Little, Three Little Endians
MIPS has two different 64-bit cores for sale: the low-end (relatively speaking, of course) 5Kf, and the
high-end 20Kc. The 5Kf has a six-stage pipeline and single, scalar execution. The core comes with (or,
more precisely, can be designed with) L1 caches up to 64K in size. Assuming you choose to build yours in
a 0.13-micron process, you can expect to use up about 4.0 mm2 if silicon, according to MIPS. The caches
do not support multiprocessing features such as cache-coherent snooping, so the 5Kf is not destined for
high-end computers of any sort. Instead, it's a good embedded core for comparatively speedy networking
or video applications.
The 20Kc core, on the other hand, is MIPS's "big iron." In a break from recent tradition, you can buy this
core as a real chip. The 20K processor, as it's called, sports 7.2 million transistors on its 34mm2 die. With
that you get dual 32K caches and a generic new system bus called, for reasons that immediately suggests
itself, MGB. MGB runs at 150 MHz and can displace 3.6 GB of data per second. Let's hope it doesn't leak
oil like most MGBs.
Whether you like it soft core or hard core, the 20K pumps two instructions through its simple seven-stage,
in-order pipeline. For a 64-bit processor, the 20K is no fire-breather. But this is how MIPS's customers want
it. Now that Silicon Graphics has adopted IA-64 like everyone else (except Sun), there's no point designing
high-end MIPS processors for workstations that will never exist. MIPS is now firmly in the embedded camp,
and that means sacrificing a few of the sexier features for better power consumption and easier
manufacturing. The 20Kc doesn't even have any exotic branch prediction. Its seven-stage pipe will
probably limit the 20Kc to about 500 MHz, and that suits most embedded ASIC designers just fine. In a nod
to networking reality, the core supports both big- and little-endian byte ordering.
MIPS 20K processor
The 20Kc core has three execution units (one full integer, one partial integer, one floating-point), but can
dispatch and execute only two instructions per cycle. Because the second integer ALU is a somewhat
limited version of the first, the 20Kc cannot dispatch two load/stores simultaneously, nor can it do two
branches (which is pretty obvious). Also, clearly, it can have only one floating-point operation in flight at any
one time, although, since FP operations can be many cycles long, the 20K can crunch on two new integer
operations while it waits for the FP instruction to come out of the wash.
Although 20Kc won't spank other 64-bit processors, it does have the advantage of ASIC availability.
Taiwanese fab giant TSMC has already licensed the 20Kc, just as it did with earlier MIPS cores. This
7/1/2002 3:14 PM
ExtremeTech - Print Article
24 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
makes the 20Kc easily available to ASIC customers with little or no interest in actually designing their own
MIPS-based chips-- they can just glue on the 20Kc core and stick to their own specialty value.
IBM Power4
Everything about IBM's Power4 processor is amazing. Amazing technology, amazing size (680 million
transistors), amazing power requirements, amazing performance. And it's amazing that anything so
complex can work at all. This beast has 5,200 pins on the package and consumes 500 watts (that's right,
half a kilowatt) of power.
Actually, Power4 is more than a processor, it's an entire neighborhood of processors. It's sold as a module
comprising two processor cores per die, and four die per module, making eight 64-bit processors and 680
million transistors in one unit. Each individual die contains 174 millions transistors and measures a
sun-blocking 400 mm2 in IBM's 0.18-micron 7-layer copper process.
IBM Power4 module
The pipeline of a Power4 processor – one Power4 processor – is 12 stages long and feeds eight execution
units. Those eight include two integer units, two floating-point units, two load/store units, one branch unit,
and one condition-evaluation unit. Like current Athlon and Pentium machines, Power4 cracks its
instructions into an intermediate internal format that is more easily digested by the pipeline. This is a bit
odd, since Power is nominally a RISC architecture to begin with, but there you are. Both "native" Power
instructions, as well as PowerPC instructions, are decoded into this internal representation early in the
pipeline.
7/1/2002 3:14 PM
ExtremeTech - Print Article
25 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
Power4 die
In-order versus out-of-order is a slightly clouded issue (no pun intended) on Power4. Instructions are
dispatched in order, whereupon the pipeline almost immediately reorganizes them. Individual instructions
can progress through the pipeline at various rates until they are reunited with their comrades and retired in
order. In all, more than 200 instructions may be in flight in Power4. And this is all on just one of the eight
processors in the module.
Some more statistics: the data cache is 32K and the instruction cache is 64K. The 1.5 MB of on-chip L2
cache is divided into thirds, the better to serve up fast responses. (You can see this division in the die
photograph.) This cache is shared between the two cores on the Power4 chip through an elaborate switch
matrix. Because the cache pieces are relatively small they can be made faster than one large cache. This
also helps open up more ports into the cache memory and avoids contention for resources, buses, tags,
and cache lines. It also makes the entire cache system spectacularly complex to design and manufacture
(not to mention the problems keeping all three portions cache-coherent) but IBM is not one to shy away
from such a challenge. These guys are professionals. Oh, and the cache controller for the 32MB of L3
cache is included on the Power4 chip, although the cache memory itself is off-chip, but not off the module.
The Tale of the Tape
Someone once said there are lies, damn lies, and benchmarks. Be that as it may, it sometimes comes
down to standard recognized benchmarks when it's time to pick a processor. And for 64-bit processors,
SPEC (Standard Performance Evaluation Corporation; www.spec.org.) is the most-used source for those
benchmarks. SPECmarks are more reliable than some other benchmarks because the results can be
verified by outsiders. Any vendor posting results is required to specify exactly how those results were
obtained and how to duplicate them. Peer pressure keeps blatant marketing optimism in check. Still,
SPECmarks don't tell the whole story, as anyone who scores near the bottom will earnestly explain.
7/1/2002 3:14 PM
ExtremeTech - Print Article
26 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
Current SPEC Benchmark Data for 64-Bit Processors
Processor, Speed
SPECint2000 (base) SPECfp2000 (base)
System
Alpha 21264C, 1.01 GHz
561
585
Compaq AlphaServer GS160
Alpha 21264C, 1 GHz
621
776
Compaq AlphaServer ES45
Alpha 21264B, 833 MHz
518
621
Compaq AlphaServer ES40
Itanium, 800 MHz
379
701
HP rx4610
Itanium, 800 MHz
358
655
HP i2000
McKinley, 1.0 GHz
640
--
Very preliminary simulations from Intel
UltraSPARC III 1.05 GHz
537
701
Sun Blade 2050
UltraSPARC III 900 MHz
470
629
Sun Blade 1000
UltraSPARC III 750 MHz
363
312
Sun Blade 1000
Power4, 1.3 GHz
790
1098
IBM eServer 690
There is no data for Hammer because it's not available yet, of course, nor is there any SPEC data for the
MIPS R20K. As with all benchmarks, the scores seem a little confusing because they don't track clock
frequency very linearly. That's partly because other system components (chipsets, memory controllers,
etc.) affect performance, as it should. Compilers revisions also affect performance, so "faster" processors
don't necessarily produce faster benchmark scores.
There's no question that IBM's Power4 behemoth is the winner in this contest, although you'd have to buy
a pretty big system (and a pretty big room to keep it in) to enjoy it firsthand. Itanium turns in remarkably
embarrassing SPECint (integer benchmark) numbers, scoring about as well as an UltraSPARC-III
processor that's 6% slower in terms of clock rate and about a hundred years older in terms of architecture.
As new product launches go, "Itanic" appears to be sinking under its own weight. Itanium's floating-point
scores are first-rate, however, an impressive reversal of fortune for Intel, which normally has to apologize
for its mediocre FP performance. And McKinley should do much better as the estimated integer score
indicates.
Wrap-Up
Generations. Looking across all three parts of our 64-bit Computing seriers, what we're seeing here is
different processor generations. IA-64 is the latest generation, though not the latest thinking on CPU
architecture. Hammer is, by necessity, the oldest generation. Hammer, K6, Athlon, and Pentium Pro/II/III/4
all made a break from "true" x86 microarchitecture long ago when they started cracking x86 opcodes into
internal ROPs and running RISC machines internally. That makes Athlon perhaps a 1.5-generation
machine and Hammer a 1.999-generation processor. Alpha is generation 2. It's pure RISC, has some
exotic features, and is very well architected. SPARC probably belongs in this group, too, though its basic
architecture is a bit older than Alpha's. MIPS is of the same vintage as SPARC and it, SPARC, and Alpha
all bear some similarities. All three stand in between Hammer and Itanium in terms of modern thinking.
Power4 is in a class by itself, pulling a middle-aged instruction set with it as it develops on-chip
multiprocessing and other system-level features. Although all of these processors are 64-bit machines,
most are destined for different and mutually exclusive markets. MIPS has given up the desktop and
become a very successful embedded architecture. SPARC has done just the opposite: Sun relies on
SPARC for its workstations and servers, and maintains its characteristically defiant attitude about other
CPUs. After some initial dabbles in the embedded realm, SPARC is now almost entirely Sun's pet
processor.
Alpha will fade away on its own, though not because of its performance, development, features, or
software support. Alpha is disappearing by fiat. It has no strategic place in its new environment so it must
be eliminated. Power4 is clearly in another world, one dominated by scientists with white lab coats and
bulging foreheads.
That leaves Hammer and Itanium competing head to head for the mainstream 64-bit workstation, and
7/1/2002 3:14 PM
ExtremeTech - Print Article
27 of 27
http://www.extremetech.com/print_article/0,3998,a=22731,00.as
Hammer should also compete with Itanium in portions of the 64-bit server market. Two roads are diverging
in the yellow woods of PC processors, and Intel, for a change, is taking the one less traveled. Whether
that's blazing a new trail or abandoning the road to riches remains to be seen. What's certain is that both
sides will claim to be on the shining path. AMD and Intel swap insults, claims, and counterclaims as
frequently as AOL sends us shiny new coasters in the mail.
In a time when microprocessors are advertised on TV and CPU vendors have their own jingles, the
fantastic technology embodied in these chips seems almost irrelevant. And it nearly is-- microprocessor
marketing is drawing ever nearer to perfume advertising. All the emphasis is on packaging and marketing,
branding and pricing, channels and distribution, with little left over for solid product details, features, and
benefits. Little old ladies who don't know a transistor from a tarantula know the name "Pentium" and think
they want "HyperThreading." It's a good thing that for some of us, the technology still matters.
Copyright (c) 2002 Ziff Davis Media Inc. All Rights Reserved.
7/1/2002 3:14 PM