Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ExtremeTech - Print Article 1 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as 64-Bit CPUs: What You Need to Know February 8, 2002 By: Jim Turley It's the peak, the top, it's the Mona Lisa. It's the $64,000 Question: what processor will dominate 64-bit computing? Sixty-four bits holds the promise of new performance, new architectures, new compilers, and a new balance of power in CPU realpolitik. A clean break with the old, a new chance for the new. What hardware or architectural changes are in store for 64 bits? Quite a lot, although few of them have to do with 64-bittedness, per se. But 64-bit processors are at today's very high end, and they showcase all the best thinking in microprocessor design. This is the cutting edge, where silicon manufacturing, computer architecture, compiler technology, and marketing wizardry all come together. In the words of Calvin and Hobbes, scientific progress goes "Boink!" For most of us waiting breathlessly on the sidelines, the 64-bit battle is between Intel's IA-64 and AMD's Hammer architectures. Separately, we'll evaluate the pros and cons of the "other" 64-bit processors used in workstations and servers, such as SPARC, Power, MIPS, and Alpha. In this first segment of our 64-bit computing series, we'll launch into the wonder that is IA-64. You've probably seen much information already written about Itanium and IA-64 architecture in the past few years, which is mostly a replay of Intel-generated information. We'll try to get beyond the standard facts and hype, and take a critical look at Itanium and IA-64/EPIC, by describing features and delivering some critical analyses. We'll set the stage for an architectural comparison with Hammer and other 64-bit architectures in future segments. To be clear, this 64-bit computing architecture series is not performance testing focused. It's architecture focused, and discusses long-term potentials. We will, however, point you to a few Itanium performance studies on the Web. Intel IA-64 & Itanium Intel's IA-64 (née Tahoe) architecture had a gestation period longer than that of an elephant. After first announcing their cooperation in 1994, Hewlett Packard and Intel said the first offspring of their matrimony would arrive "not before 1998," a prognostication that certainly proved to be true. In reality, the design was even longer in the making, for Intel and HP had stealthily begun working well before their mid-'94 announcement. Ten years and 325 million transistors later, we behold Itanium (all the good names were taken). Originally code-named Merced, Itanium is the first-born of the IA-64 family and our first real look into how well IA-64 will--or won't--work. (The subsequent offspring code-named McKinley, Madison, and Deerfield, are covered later in this article.) First and most obviously, Itanium, like all IA-64 processors, is not an x86 chip. It is a clean break from the long and legendary x86 (or IA-32, in Intel parlance) architecture that Intel invented, seemingly back when Earth was still cooling, and which propelled the Santa Clara company to such heights. Yes, Itanium is able to run x86 code in backward-compatibility mode, but that compatibility is tacked on; in its element, Itanium and all IA-64 chips are nothing at all like Pentium. That's both good news and bad news, as we shall see. It's good to be free from the tyranny of the x86 architecture, considered by many programmers to be the worst 8-bit, 16-bit, or 32-bit (take your pick) CPU family ever developed. That it should have succeeded so spectacularly is enough to shake one's faith in divine forces. The bad news? IA-64 leaves behind everything that made x86 chips ubiquitous, and presumably replaces it all with new bugs, new quirks, and new head-scratchers, leaving us to wonder, "why the hell did they design it that way?" The Heart of the Beast: A Modified VLIW Core 7/1/2002 3:14 PM ExtremeTech - Print Article 2 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as Internally, Itanium is a six-issue processor, meaning it can profitably handle six instructions simultaneously. It's also a VLIW (very long instruction word) machine with some enhancements for added flexibility in instruction groupings, less code expansion than classic VLIW designs, and better scalability, to permit wider parallel instruction issue in future IA-64 processors. Thus Intel prefers the term EPIC: Explicitly Parallel Instruction-set Computing. Itanium has nine execution units and future IA-64 processors will probably have more. The nine are grouped into two integer units, two combo integer-and-load/store units, two floating-point units, and three branch units. These four groups are significant, as we shall see in a moment. Here's a simplified Itanium block diagram: click on image for full view And here's a more complex block diagram: click on image for full view Itanium has a 10-stage pipeline, which is respectable but not impressive by today's standards. Again, 7/1/2002 3:14 PM ExtremeTech - Print Article 3 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as future IA-64 processors may have different and probably longer pipes. For comparison, Pentium III has a 12-stage pipeline, but the Alpha 21264 has just eight stages. And Pentium 4 has 20 stages (from the point of fetching micro-ops from its trace cache), and Athlon has 10 stages. Here's a basic Itanium pipeline diagram: click on image for full view 328 Registers and Counting The Itanium processor has a massive register set, with 128 general-purpose integer registers (each 64 bits wide), 128 floating-point registers (each 82 bits wide), 64 1-bit predicate registers, 8 branch registers, and a whole bunch of other registers scattered among several different functions, including some for x86 backward compatibility. Like a lot of RISC processors, the first register (GR0) is hard-wired to a permanent zero, making it worthless for storage but useful as a constant for inputs and a bit bucket for outputs. Here's a simplified diagram of key application registers: click on image for full view Here's a detailed diagram of application and system-level register sets: 7/1/2002 3:14 PM ExtremeTech - Print Article 4 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as click on image for full view And of course Itanium supports standard 32-bit x86 execution modes and the 32-bit registers are mapped onto the IA-64 registers. See details in the section titled "Don't Look Back: How Itanium Handles x86 Code" down below. What a far cry from the cramped, crowded register set of the x86! With 256 registers to play with, programmers have an embarrassment of riches. To avoid that embarrassment, IA-64 has two features that manage the register file: register frames and register rotating. These require some explanation… Register Your Window Frames Registers are great when your program is running, but pushing and popping 128 big ol' registers for subroutine calls is unpleasantly time-consuming (and usually not necessary anyway). It's traditional but it's inefficient. One alternative is register windows, of which SPARC processors are a notable proponent. Register windows have their problems, too, and it's no coincidence that the only major RISC architecture to use register windows is also the slowest major RISC architecture still in production. IA-64 gets around the constant pushing and popping by using register frames. The first 32 of the 128 integer registers are global, available to all tasks at all times. The other 96, though, can be framed, rotated, or both. Before a function call, you use Itanium's ALLOC instruction (which is unrelated to the C function of the same name) to shift the apparent arrangement of the general-purpose registers so that it appears that parameters are being passed from one function to another through shared registers. In reality, ALLOC changes the mapping of the logical (software-visible) registers to the physical registers, much like SPARC does. The similarities with SPARC's windows are strong and the differences mostly minor. With IA-64's frames, the frame size is arbitrary, unlike SPARC, which supports a few different fixed frame sizes. In the example illustration, the calling routine sets aside 11 registers (GR32 - GR42) for the called routine, with four registers overlapping. The overlapping registers are where the parameters will be passed, although they never really move. Regardless of what registers either routine physically uses, they will appear to be contiguous with the first 32 fixed registers, GR0 - GR31. 7/1/2002 3:14 PM ExtremeTech - Print Article 5 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as click on image for full view The maximum frame size is all 96 registers, plus the 32 globals that are always visible. Only the integer registers are framed; FP registers and predicate registers (described below) are not. The minimum frame size is one register, or you can choose not to use ALLOC at all. Rotating Registers On top of the frames, there's register rotation, a feature that helps loop unrolling more than parameter passing. With rotation, Itanium can shift up to 96 of its general-purpose registers (the first 32 are still fixed and global) by one or more apparent positions. Why? So that iterative loops that hammer on the same register(s) time after time can all be dispatched and executed at once without stepping on each other. Each instance of the loop actually targets different physical registers, allowing them all to be in flight at once. If this sounds a lot like register renaming, it is. Itanium's register-rotation feature is less generic than all-purpose register renaming like Athlon's, so it's easier to implement and faster to execute. Chip-wide register renaming like Athlon's adds gobs of multiplexers, adders, and routing, one of the big drawbacks of a massively out-of-order machine. On a smaller scale, ARM used this trick with its ill-fated Piccolo DSP coprocessor. At the high end, Cydrome also used this technique, a favorite feature that Cydrome alumnus and Itanium team member Bob Rau apparently brought with him. So IA-64 has two levels of indirection for its own registers: the logical-to-virtual mapping of the frames and the virtual-to-physical mapping of the rotation. All this means that programs usually aren't accessing the physical registers they think they are, but that's nothing new to high-end microprocessors. Arcane as it seems, this method still uses less hardware trickery than the full register renaming of Athlon, Pentium III, or P4. Frames and rotation help up to a point, but eventually even Itanium runs out of registers. When that happens, we're back to pushing and popping registers on and off the stack. Where Itanium differs from SPARC is that Intel makes it automatic. Itanium's register save engine (RSE) is an automated circuit within the processor that oversees filling and spilling registers to/from the stack when the register file overflows or underflows. Unlike SPARC, Itanium's RSE handles this task automatically and invisibly to software. SPARC, in contrast, raises a fault that must be handled in software. The RSE is more complicated than you might think. It has to handle any kind of memory problem, page fault, exception, or error without bothering the processor. In Itanium, the RSE stalls the processor to do its work. In future IA-64 implementations, it will probably be more elegantly handled in the background. The Good Stuff: Instruction Set As we mentioned, IA-64 is an enhanced VLIW architecture, so its concept of "instruction" is a little different from that of, say, Pentium or Alpha. With IA-64, there are instructions, there are bundles, and there are 7/1/2002 3:14 PM ExtremeTech - Print Article 6 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as groups. Get your notepads ready. Instructions are 41 bits long. Yup - say goodbye to powers of two. It takes 7 bits to specify one of 128 general-purpose (or floating-point) registers, so two source-operand fields and a destination field eat up 21 bits right there, before you even get to the opcode. Another 6 bits specify the 64 combinations of predication (which we discuss in more detail below), if any. Instructions are delivered to the processor in "bundles." Bundles are 128 bits: three 41-bit instructions (making 123 bits), plus one 5-bit template, which we'll get to in a minute. Still with us? Then there are instruction groups, which are collections of instructions that can theoretically all execute all at once. The instruction groups are the compiler's way of showing the processor which instructions can be dispatched simultaneously without dependencies or interlocks. It's the responsibility of the compiler to get this right; the processor doesn't check. Groups can be of any arbitrary length, from one lonely instruction up to millions of instructions that can (hypothetically, at least) all run at once without interfering with each other. A bit in the template identifies the end of a group. A bundle is not a group. That is, IA-64 instructions are physically packaged into 128-bit bundles because that's deemed the minimum width for an IA-64 processor's bus and decode circuitry. (Itanium dispatches two bundles, or 256 bits, at once.) A bundle just happens to hold three complete instructions. But logically, instructions can be grouped in any arbitrary amount, and it's the groups that determine how instructions interrelate to one another. All IA-64 instructions fall into one of four categories: integer, load/store, floating-point, and branch operations. These categories are significant in how they map onto the chip's hardware resources. Different IA-64 implementations (Itanium, McKinley, etc.) might have different hardware resources, but all will do their best to dispatch all the instructions in a group at once. And we'll see IA-64 compilers capable of optimizing binaries for different IA-64 processors too. It's hard not to think that Intel's institutionalized taste for baroque and ungainly, not to mention bizarre instruction set features didn't creep in here somewhere. With so much elegance going for it, IA-64 falls down in the evening gown competition. First, IA-64 opcodes are not unique - they're reused up to four times. In other words, the same 41-bit pattern decodes into four completely different and unrelated operations depending on whether it's sent to the integer unit, the floating-point unit, the memory unit, or the branch unit. A C++ programmer would call this overloading. An assembly program would call it nuts. You'd think that Itanium's designers would have been satisfied with 241 different opcodes, but no… The second eccentric feature, which is related to the first, explains how Itanium avoids confusing these identical-but-different opcodes (a process serious engineers call disambiguation). The five-bit template at the start of every 128-bit bundle helps route the three-instruction payload to the correct execution units. Those of you who are good at binary arithmetic are thinking, "wait a minute… five bits isn't enough." And you'd be right--if you weren't designing Itanium. Rather than tagging each of the three instructions with its associated execution unit, or just extending the instruction width, IA-64 uses these five bits to define one of 24 different "templates" for an instruction bundle (the other eight combinations are reserved). A template spells out how the three instructions are arranged in a bundle, and where the end of the logical group is, if any. And yes, you're right again, 24 templates is not enough to define all possible combinations of integer, FP, branch, and memory operations within a bundle, as well as the presence of a group's logical stop. Deal with it. 7/1/2002 3:14 PM ExtremeTech - Print Article 7 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as click on image for full view You'll notice that it's impossible to have an FP instruction as the first instruction of a bundle, and that load/store instructions are not allowed at the end. You can't have two FP instructions in a bundle, yet you can have three branch instructions bundled together. This is not as counterproductive as it sounds--as long as two of the branches are conditional and evaluate false, they do no harm other than wasting space. How Epic is EPIC? Is EPIC really VLIW? Yes, by most definitions of that term. Pedantic computer architects may argue over abstruse differences, and Intel's marketing people will steam over the misuse of their trademark, but for all intents and purposes, EPIC is merely a more pronounceable rendition of VLIW with a few enhancements. Few, if any, of EPIC's features discussed so far are unique to Itanium or to Intel. Broadsiding a processor with a volley of instructions at once is what VLIW is all about. EPIC corrupts, if you will, the pure ideal of VLIW by introducing its peculiar 5-bit instruction templates that unnecessarily complicates multi-instruction issue and effectively eliminates several potential combinations of instructions. On the plus side, Intel gets credit for allowing flexible-sized instruction groupings, which help increase issue efficiency. This is likely to payoff handsomely in future IA-64 processors. IA-64's groups also reduce the code bloat seen in traditional VLIW designs (where fixed-width VLIW instruction slots may often go unused if the compiler cannot find independent instructions to group together from within a particular window of instructions). Certainly there are plenty of processors with multiple execution units and microarchitectures that can keep them busy. Predicated execution is nothing new, either. Tiny embedded processors do it, and compiler writers are happy to manage the multiple predicate bits. Itanium's scoreboard bits, register frames, and svelte and RISC-like instruction set all have been seen before. Itanium doesn't even reorder instructions, for cryin' out loud, something even midrange 32-bitters do all day long. But then again, Intel's formally stated goal was to shift complexity out of the processor logic and to the compiler. Yet, if you read a presentation from last Intel Developer Forum, you'll see that "Future Itanium Processor Family processors can have out-of-order execution." Of course, this also implies that McKinley will be called Itanium II or something similar. IA-64 doesn't really introduce anything all that new. It's more of an amalgam of concepts and techniques seen before and given the ol' Intel twist. That doesn't make it bad, but it's also not spectacular nerd porn. Instruction Set Highlights It would be tedious in the extreme to even summarize the entire IA-64 instruction set; you can refer here for the complete ISA (Instruction Set Architecture) listing. But there are some highlights in the ISA worth noting, such as conditional (predicated) execution, hinted and speculative loads, and the odd way in which 7/1/2002 3:14 PM ExtremeTech - Print Article 8 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as Itanium handles integer math. Pretty much any IA-64 instruction can be conditional, with its execution predicated on literally anything you care to define. Far beyond the simple Z (zero), V (overflow), S (sign), and N (negative) flags of our childhood, IA-64 has 64 free-form predicate bits, each considered a separate predicate register. You can set or clear a predicate bit any way you like, and its condition sticks indefinitely. Any subsequent instruction anywhere in the program can check that bit (or multiple bits) and behave accordingly. This allows you, for example, to evaluate two numbers in one part of a program, but not make a decision (conditional branch) until much later. The microprocessor cognoscenti consider predicate bits more elegant than flags; they scale more easily to larger sizes (more bits) and are easier for compilers to target. We'll cover predication in more detail below. Loading Up the Stores IA-64 is surprisingly stingy with memory-addressing modes. It has precisely one: register-indirect with optional post-increment. This seems horribly limiting but is very RISC-like in philosophy. Addresses are calculated just like any other number and deposited in a general-purpose register. By avoiding special addressing modes, Itanium avoids specialized hardware in the critical path. VLIW pushes complexity onto the compiler instead of the hardware. Loads can be pretty uninteresting, but IA-64 manages to spice them up a bit. Loads can "hint" to the cache that it would be beneficial to preload additional data after the load, whether that data is likely to be reused, and if so, which of the three cache levels is most appropriate to hold it. These are not the kinds of things even dedicated assembly-language programmers are likely to know, but large-scale commercial developers might profile a new operating system or major application extensively, and use the feedback to provide prefetch and caching hints. These are just hints, too--the processor is under no obligation to act on the hints or the caching information. Speculative Loads Somewhat stronger than a hint is a speculative load, an instruction that tells the processor it might want to load data from memory. Programmers (or more realistically, advanced compilers) can sprinkle their code with speculative loads to try to snag data that might be needed soon. Itanium will do its best to comply, but if the system bus is busy, the speculative load might be postponed indefinitely. If a speculative load fails (such as from a memory fault or violation) the processor does not raise an exception. Hey, it was only speculative anyway. Itanium can hoist loads above branches, which many high-end RISCs do, but it can also hoist loads above stores, which is much trickier. The usual problem with the latter procedure is alias detection: the compiler can't be sure that loads and stores aren't to the same address. As long as there's a chance, it's dangerous to load from memory before all the stores to the same memory addresses are finished. Yet loads are time-consuming, so it's a big win if you can accelerate them. IA-64 gets around this problem--with a little help from you--with the LD.A (load advanced) instruction. LD.A speculatively loads from memory, but also stuffs the load address into a special buffer called the Advanced Load Address Table (ALAT). Subsequent stores to memory are checked against addresses in the ALAT. If there's a match, the speculative load aborts (or, if it already completed, the contents are discarded). Using the data from a LD.A can be tricky, too. You need to validate them with a CHK.A instruction first. There's no guarantee that any calculations you did won't have to be redone with valid data. It's a bit of a gamble, but can pay handsomely if you speculate wisely. Architecture imitates life. FP You, Too Bizarrely, Itanium's two floating-point units can't multiply two numbers together. They can't add, either. The FPU is designed for multiply-accumulate (MAC) operations, so if you want a conventional FP MUL you program it as an FP MAC with an adder of zero. Likewise, if you want a simple FP ADD you're forced to use a multiplier of 1.0 along with the value you want to add. Stranger still, Itanium has no integer multiply function at all. Any multiplication, whether it's integer or floating-point, has to happen in the FP MAC unit. Unfortunately, that means transferring a pair of integers from the general-purpose registers to the floating-point registers, then transferring the result back again. Fortunately, IA-64 includes a few instructions specifically for this eventuality. What were they thinking? Branches: Going Out On a Limb 7/1/2002 3:14 PM ExtremeTech - Print Article 9 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as The longer the pipeline, the bigger the train wreck if the processor mispredicts a branch. And Itanium has a fairly long pipeline, so the potential for performance-robbing disaster looms ever large. Predicting branches takes on paramount importance and to that end, IA-64 has a number of tricks to help it avoid the dreaded mispredicted branch. First, there's only one form of conditional branch, but its behavior can be based on any of the 64 predicate bits mentioned earlier. Branches can also be tagged with either static or dynamic branch prediction (that's prediction, not predication), which predicts whether the branch is likely, or not likely, to be taken this time around. Static prediction cannot be overridden; dynamic prediction leaves the decision to Itanium's own branch-prediction hardware. If you, as the programmer, know which way the branch is likely to go, stick with static prediction and the chip will assume you're always right. If you're unsure, let Itanium make up its own mind. If you're feeling especially clairvoyant, you can also suggest that Itanium fetch instructions from the predicted target of the branch, and even how far ahead of the branch target it should prefetch. Predicated Execution Predication is cool--it avoids short branches that inject bubbles into the pipeline. Rather than skip over short sections of code, predicated processors can plow straight ahead, either committing or discarding the results based on the predicate test. It effectively permits execution of both branch code paths at the same time. Predicated instruction sets have a mixed effect on code density. They improve code density slightly by eliminating branch instructions, but then hand back much of that improvement by usurping several bits (in IA-64's case, six bits per instruction specifying one of 64 predicate registers) from every instruction for the predicate field. Predicated execution sacrifices execution units on the altar of branch latency. In other words, predicated instructions make it most of the way though the pipeline whether they're supposed to execute or not. In Itanium's case, all conditional instructions are predicated so everything executes nearly to completion. It's only in the next-to-last DET (exception detection) stage of the pipeline that their effects are canceled if the predicate turns out to be false. By that time, the instruction has already commandeered one of Itanium's nine execution units for nothing, possibly preventing some other instruction from using it. Well, not entirely for nothing; it has served the greater good by avoiding a potential bubble in the pipeline. Better to waste a little work than to spin your wheels waiting for a branch to resolve. It's small comfort, but predicated instructions that would stall waiting for an operand are killed early, because Itanium resolves the predicate (true/false) about the same time that it detects the dependency. It won't stall instructions waiting for data that's irrelevant. That's the beauty of predicate bits set well ahead of time instead of flags that are updated every cycle. Don't Look Back: How Itanium Handles x86 Code Yes, Virginia, there is an x86-compatibility mode in Itanium. It's awkward and unnatural, but we know how attached you are to your old binaries. IA-64 does not normally support older x86 binaries, and it's entirely possible that some future IA-64 implementation might drop this feature or water it down, but for now your old Lotus 1-2-3 diskettes are safe. Itanium supports all x86 instructions in one way or another, even MMX, SSE (not SSE2), Protected, Virtual 8086, and Real mode features. You can even run entire operating systems in x86 mode, or just run the applications under a new IA-64 OS. All the x86 registers map onto Itanium's own general-purpose registers, but some of the less orthogonal x86 registers appear in Itanium's "application registers" AR24 through AR31. 7/1/2002 3:14 PM ExtremeTech - Print Article 10 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as click on image for full view Switching modes appears trivial but isn't. There's one IA-64 instruction that switches the processor to x86 mode and another (newly defined) x86 instruction, JMPE, that switches to IA-64 mode. If the programmer so wishes, interrupts can switch automatically to IA-64 mode or the machine can stay in x86 mode. In the latter case, you can reuse your x86 interrupt handlers. Switching to x86 mode is a lot like booting a '386 because you have to set up memory segment descriptors, status registers, and flags. Also, x86 code likes to have its way with all the resources of the processor, either overwriting or ignoring many of Itanium's state bits and registers. It's also likely to upset your cache contents. In general, it's best to save the entire state of the processor before switching to x86 mode. It's awkward enough that you probably don't want to switch modes willy-nilly. Save it for dramatic changes, such as executing entire x86 applications. Not that anyone was asking, but PA-RISC compatibility is handled offline through a software translator. IA-64 instructions don't directly support PA-RISC instructions, but they do map fairly closely (hey, RISC is RISC). The fact that x86 binaries are emulated in minute detail with enormous helpings of hardware while PA-RISC code is relegated to a translator before it has any hope of running says a lot about the relative importance of these two installed bases. It may also tell us something about the "equal" relationship between the HP and Intel engineers designing IA-64. Oooooh, It's So Big! The definition of chip, processor, and die become somewhat clouded with Itanium. The first IA-64 "chip" is really a metal-cased cartridge, somewhat like Pentium II modules of yore. The cartridge - which is mechanically incompatible with anything ever seen before - contains at least five chips, including the processor itself and four cache SRAMs. The first- and second-level caches (L1 and L2) really are on the same die as the processor; the L3 cache takes up those four SRAMs that are off-chip but on-module. Got it? 7/1/2002 3:14 PM ExtremeTech - Print Article 11 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as Processor abstraction layer Then there's the PAL. PAL is Intel's "processor abstraction layer," a flash ROM inside the cartridge that, in Intel's words "… maintain[s] a single software interface for multiple implementations of the processor silicon steppings." Sounds like a "fudge ROM" for hiding, tweaking, or patching imperfections in the processor that may not entirely live up to their data book specification. The whole thing weighs in at about 325 million transistors: 25 million for the processor chip (including L1 and L2 caches) and about 75 million for each of the four L3 cache chips. We'll toss in the PAL for free. If 25 million transistors seems like a lot, remember that Pentium III has 24 million and Pentium 4 has 42 million. For a high-end 64-bit processor, Itanium is looking positively dinky. Itanium die You know what else is big? Itanium's code footprint. Poor code density is a hallmark of VLIW designs, and although IA-64 makes some improvements as we mentioned, it's no exception to the rule. With no (public) code to look at it's hard to be sure, but educated estimates pin Itanium's code size at about one-third bigger than other 64-bit RISCs and double the size of Pentium binaries. Poor code density means lots of disk space, but that's not a big deal for high-end systems. It also means less effective cache size, which in turn reduces cache-hit rates. Again, no big deal because caches can always be made bigger. But cache bandwidth is hard to improve and that may be the real bottleneck for 7/1/2002 3:14 PM ExtremeTech - Print Article 12 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as IA-64 processors. That's why Itanium's first two levels of cache are on the processor die itself and the L3 cache is very nearby on the same module. Outside the Box The 128-bit bus between the Itanium die and its L3 caches is contained entirely within the cartridge; it's never exposed to the outside. Itanium's external bus is 64 bits wide and this is its only connection with the outside world, main memory, or other processors. Up to four processors can share this bus. After that, Intel has a bridge chip that allows four-processor clusters to talk to each other. It's a pretty pedestrian bus as these things go. It has none of the exotic interprocessor communications that Hammer has (as we'll study in our next segment), nor is it even very fast at 2.1 GB/second of maximum bandwidth, compared with 3.2 GB/second for Pentium 4 or 3.6 GB/second for MIPS. It's also a doomed, dead-end bus: McKinley will have a completely different interface. McKinley, Madison, and Deerfield: The Next Generation The second IA-64 processor after Itanium is code-named McKinley and it's likely to be faster, smaller, and all-around better than its predecessor. McKinley's L1 caches will be the same size as Itanium's, but the L2 cache will grow from 96K to 256K. The L3 cache will get smaller (3M instead of 4M) but move onto the actual chip, not just on the same cartridge. All three cache interfaces will get faster. McKinley shaves one cycle off the L1 cache access time (from two cycles to one), shortens L2 access time by seven cycles (to five), and takes eight cycles off the L3 latency (to 12 cycles). Adding the L3 cache to the chip will boost McKinley's die size significantly, probably to around 450 mm2, and ups the transistor count to 221 million. But manufacturing cost should be significantly reduced without the external L3 SRAMs and larger package required for the dual-chip (core and L3) Itanium. McKinley will use a completely different socket design from Itanium and a revised bus interface, dooming the first IA-64 systems almost before they get out the door. Just like Pentium Pro, Itanium's mechanical footprint will be an orphan from Day One. McKinley's system bus will widen to 128 bits (up from Itanium's 64) and its clock frequency will improve from 133 MHz to 200 MHz. The bus will still be double-pumped (i.e., transferring data on both rising and falling edges of every clock) yielding 6.4GB/sec front-side bus bandwidth. Next up comes Madison, expected to be a 0.13-micron shrink of McKinley, all other things being equal. Deerfield, the fourth member of IA-64's growing family, will also be a 0.13-micron shrink of McKinley, but this time with a smaller 1M L3 cache and yet another new bus interface intended for cheaper systems. Deerfield will be the "value" version of IA-64, à la Celeron or Duron. Bottom Line IA-64 is an interesting architecture that borrows from and/or extends many existing microarchitectural techniques, and also adds some new and interesting twists, but the first instantiation of the architecture, Itanium, has not been a major success to date. After waiting a few years longer than originally anticipated for the first IA-64 chip to appear (Intel publicly disclosed initial IA-64 details at Microprocessor Forum in 1997, and stated the first IA-64 chip, code-named Merced, was expected to ship in mid-1999), we saw a processor with a slower than expected clock rate, and less than stellar integer performance that catered to a very limited market. Plus initial shipments were stymied with delays from key vendors. Some commentators called Intel's first IA-64 chip "Unobtanium". And not surprisingly, the catch-phase for quite some time has been "wait for McKinley, Itanium is simply a development platform". Very recently another setback occurred with Dell dropping Itanium workstations from its product lineup (see "Dell Discontinues Itanium Workstation"), possibly encouraging even more people to "wait for McKinley". But clearly Itanium is not all that bad. Floating-point performance as seen in some benchmarks is impressive today, and its large address space can certainly be useful in various high-end applications, but Intel faces a steep uphill battle trying to convince many server and workstation customers, with long histories using established 64-bit architectures, to convert to IA-64 at this juncture. Then again, Intel has swayed many customers to convert portions of their application processing to Itanium-based solutions as seen at this link. Software developers are a key target as well, and many have been on the IA-64 bandwagon for a while. Things could improve substantially when McKinley arrives later this year in development systems and early next year in volume. We expect Intel to start seriously ramping IA-64 architecture processor shipments in selected markets within two three years. But let's not forget about AMD, who clearly appears to up for the challenge, as we'll see in our next segment. Also, we'll provide our thoughts on the rumored Yamhill 64-bit 7/1/2002 3:14 PM ExtremeTech - Print Article 13 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as x86 "hedge your bet" technology under deep wraps within Intel development labs. Resources Itanium manuals. Be sure to explore the menu to the left of the page - it has links to lots of other Itanium reference material including some PowerPoint slides. A nice quick overview of basic Itanium features can be found here. A set of performance tests and a summary of SPEC test results are at this link from last summer. Intel's own benchmarketing results are at this link. 64-Bit CPUs: AMD Hammer vs. Intel IA-64 February 13, 2002 By: Jim Turley In 25 words or less: Intel's IA-64 is a clean break, while AMD's Hammer is philosophically (some would say pathologically) another extension to the ages-old x86 architecture. It's always fun to root for the underdog. It's the American way to cheer for the little guy, hoping he'll triumph over the dark forces of--ironically enough--Corporate America. AMD has been squarely in the underdog role for quite awhile now, but the brewing Hammer v. Itanium match-up will take that to a whole new level. IA-64 is a new VLIW design that has x86 compatibility tacked on; Hammer is a real x86 processor (albeit one with 64-bit extensions) just like Athlon, K6, and AMD's other processors before it. The two product lines are now heading down separate paths. You're always the winner when you run a race by yourself. But when you reach that finish line, are you anyplace you want to be? In this second part of our three-part analysis on 64-bit computing architectures, we'll delve into Hammer's microarchitecture and compare/contrast to Itanium's existing design, and to some extent McKinley's upcoming design. You can check out our earlier Itanium analysis at this link. Hammer of the Gods As x86 processors go, this is about as good as it gets, boys and girls. Hammer strikes a blow to the doomsayers who prophesied x86 was (or should be) dead. Who would've thought that anyone could pound so much performance out of a souped-up 8086 after twenty years? Behold, Hammer: the ultimate expression of CISC ingenuity. At its core, Hammer is a nine-way superscalar, massively out-of-order, CISC-into-RISC processor not too different in architecture from Athlon. In many ways, the change from K6 to Athlon (née K7) was greater than from Athlon to Hammer (née K8). Hammer has nine execution units, the same as Athlon and, coincidentally, the same as Itanium. These are grouped into three integer units (arithmetic/logic units, aka ALUs), three address-generation units (AGUs), and three floating-point units. Like Athlon and K6 before it, Hammer transmogrifies every x86 instruction into one or more internal RISC operations (ROPs). Beyond the first few stages of the pipeline, Hammer is a RISC machine with no idea of x86 instructions or machine state. Hammer can decode up to three x86 instructions and dispatch up to 9 ROPs per cycle, assuming the best case where each ROP serendipitously maps to one of Hammer's nine execution units. Most ROPs execute directly in hardware, but even after conversion, some x86 operations are too perverse for that. These are trapped and emulated by routines in Hammer's micro-ROM, just like Athlon. Ah, microcode--the quintessential CISC technique. Hammer's pipeline is longer than Itanium's, at 12 stages. It should come as no surprise that most of that small difference is spent decoding x86 instructions and converting them to more digestible ROPs. Which Weighs More: Nine Lbs of Hammer or Nine Lbs of Itanium? Although Hammer and Itanium both have nine execution units, it's hard to say that they'd both accomplish the same amount of work per cycle. For starters, Hammer's got three address-generation units (Itanium 7/1/2002 3:14 PM ExtremeTech - Print Article 14 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as has none), which don't really contribute to forward progress. They're more of a necessary evil. Itanium has no address-generation units because it supports only one simple addressing mode. Advantage: Intel. Hammer can dispatch nine ROPs to Itanium's six, which would appear to give Hammer a 50% lead in grunt-per-cycle. On the other hand, these ROPs are nothing but decimated x86 instructions, so they don't count for much individually. It usually takes a handful of ROPs to equal one "real" instruction. But the same could be said for Itanium (and all VLIW machines)--what is one of those instructions worth? We'll call it a draw and file it under the same category as "how many angels can dance on the head of a pin?" Three of Hammer's nine execution units are for floating-point operations only, leaving six for integer code (the three AGUs and three ALUs), while only two of Itanium's are used for FP, with the remaining seven free for integer code (recall that Itanium has two integer units, two combo integer-and-load/store units, two floating-point units, and three branch units). That, and the fact that Itanium's integer units do more useful work (as opposed to the housekeeping mentioned above) suggest that the Intel chip will make more headway on normal code. Advantage: Intel. But wait: three of Itanium's seven non-FP units are branch units. Itanium really has just four integer units to Hammer's three. Two of Itanium's integer units do double duty as load/store units, so it's not quite fair to say you could run four integer operations at once - you've got to do loads and stores sometime. Hammer, on the other hand, sets aside three address-generation units to this task, so you really can execute three integer operations at once. We'll call this one a tie. Hammer has one more floating-point unit than Itanium. On the other hand, Itanium's are both equivalent and able to handle any FP operation, whereas all three of Hammer's are different. Itanium gets bonus points for symmetry but Hammer can potentially get more floating-point work done. Advantage: AMD. Please Register Your Software The most noticeable enhancement to Hammer is its 64-bit register file (we'll get to the 64-bit instructions in a minute). All the old familiar x86 registers are extended to 64 bits and new registers added, with RISC-like names R8 through R15. Obviously, existing x86 binaries won't see the upper half of the eight new 64-bit registers, or the eight new registers at all. The enhancements are visible to new 64-bit code only. click on image for full view AMD's sixteen 64-bit registers are a far cry from Intel's 128 general-purpose plus 128 floating-point registers. Even in its 64-bit mode, AMD has one-sixteenth the quantity of registers that IA-64 has. There's no argument that more registers is better, although there's plenty of contention over how much better. How many registers is enough? There comes a point of diminishing returns, but we'd bet that most programmers and compiler writers would prefer some number greater than 16. If it makes AMD fans feel any better, Itanium's register file is so big it takes two clock cycles to access a register, adding a stage to the pipeline. If it makes Intel fans feel any better, that delay's probably going to go away in McKinley and future IA-64 processors. 7/1/2002 3:14 PM ExtremeTech - Print Article 15 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as When I'm 64 AMD presented itself with a problem that Intel didn't have: how to extend the original x86 instruction set to 64 bits? Intel treats 32-bit x86 code as the end of the road. If you want 64 bits go to IA-64 (though this stance may change--see the section on Yamhill below). AMD had to stretch the opcode map while still retaining backward compatibility with old x86 code. No mean feat. Adding 64-bit capability to existing x86 code doesn't make sense. The whole point of Hammer is to be able to run old binaries without modification. So to access its new 64-bit instructions and other features, Hammer has to switch to 64-bit mode. The process is simpler and less traumatic than Itanium's mode switch. First you set a control bit to globally enable 64-bit mode. Then you mark individual code segments as 64-bit segments by setting a previously unused bit in the local descriptor table (LDT). Any branches to segments so marked are interpreted as 64-bit code. Branches to code segments without the magic bit set are assumed to be 32-bit code. AMD also invented some new 64-bit instructions. There's a whole new set of floating-point operations that use a new flat FP register file of sixteen 128-bit registers. This will be a huge improvement over the spectacularly clumsy register-stack architecture of the original 8087 and all x87 FPUs that followed. (The floating-point abilities of Intel's processors were the low point of an already awkward architecture). Hammer also provides mixed code size compatibility the same way the '386 did, with a size-override byte. Any existing (that is, pre-Hammer) instruction can be prefixed with the one-byte REX pseudo-instruction. This byte tells the decoder that the operands in this instruction should be interpreted as 64-bit quantities. The '386 worked exactly the same way, introducing the 0x66 prefix byte, instantly turning decade-old 16-bit operations into new 32-bit operations without actually changing the instruction set or duplicating every operation. Bus & Cache Stuff Hammer's external interfaces are as interesting as Itanium's are dull. First off, Hammer includes two double data-rate (DDR) controllers that directly manage external SDRAM memory. The memory bus can be either 64 bits or 128 bits wide, and requires no glue logic. This compares favorably to Itanium's generic (and short-lived) system bus, which requires a separate Intel controller chip to make sense of memory. Sexier still, Hammer includes three (count 'em!) HyperTransport links, an obvious advantage over Itanium in multiprocessing. This bus is relatively open and has bandwidth to spare. Depending on how you arrange them, up to eight Hammer processors can seamlessly communicate amongst themselves using nothing but their built-in HyperTransport links. Anybody remember the Transputer? Whereas Itanium processors have to share a system bus, each Hammer gets its own private memory, courtesy of its on-chip SDRAM controller. (The first Hammer processor, Clawhammer, may only support two processors using a single HyperTransport link). click on image for full view But wait--there's more. The L1 data cache and the L2 cache, as well as all external memory, are protected with ECC. For good measure, all of these are periodically scrubbed to remove soft errors. (The L1 instruction cache is protected with parity, not ECC, partly for reasons of speed, but also because the 7/1/2002 3:14 PM ExtremeTech - Print Article 16 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as instruction cache is never written into from applications, the OS, or drivers, making it more reliable.) AMD has clearly made its powerful, yet simple and elegant multiprocessing design Hammer's big differentiator over Itanium. While Hammer has the more conservative internal architecture, it has the more ambitious system interface. Itanium has the radical redesign inside, but fairly pedestrian, almost PC-like, system architecture. Thankfully, McKinley will step the external bus performance up by about 3X. But there's no question that Hammer provides the easier path to two-way, four-way, or even eight-way multiprocessing systems. The question is, will that be enough? Hammer may have multiprocessing features that Itanium and McKinley lack, but no one doubts that Intel could add those features at any time (Itanium Xeon, anyone?). The company hardly lacks the wherewithal; it just doesn't see the market demand at present. It'll be a whole lot easier for Intel to add multiprocessing features to IA-64 chips than it would for AMD to add IA-64 compatibility to a future Hammer. For all its features, the first Hammer will be a small guy. Clawhammer will likely have dual 64K L1 caches and a 256K L2 cache (no L3 cache). The chip should measure just 104 mm2 in a 0.13-micron CMOS process, according to AMD--just one-quarter the die size estimated for McKinley. The second Hammer processor, Sledgehammer, should also consume far less real estate than McKinley, but it will quadruple the L2 cache to 1 MB. All things being equal, Clawhammer silicon will be far less expensive to manufacture than Itanium, even before you consider Itanium's extra L3 cache chips and its elaborate mechanical housing. For both single-processor and multiprocessor systems, AMD offers the more economical option for system makers. Getting Your 64-Bit Goodness Hammer is not VLIW and it doesn't expose parallelism (or anything else) to the compiler. It's just another turbocharged x86: really fast at x86 code, but really nothing radically new in architecture. New Hammer code can access all the new registers, and even treat them as a flat register file, but it can't break free of the inherent awkwardness of the x86 instruction set. The problem is not the binary encoding of x86 instructions--AMD and others have shown they can build blazing fast x86 chips with RISC-like internals even with the fundamental x86 handicap. It's the nonparallel nature of x86 code that's impossible to overcome. In contrast, IA-64 compilers have all the time in the world to locate parallelism, find dependencies, optimize loads, organize branches, develop predicate conditions, and much more, and then express all that richness and intelligence explicitly to the processor. Poor Hammer has to pry what secrets it can from an inherently serial binary stream, searching for tiny fragments of instruction-level parallelism in hardware--and do it all in a handful of nanoseconds without slowing the critical path or impacting the clock frequency. Plus, Hammer has to do all this with no foreknowledge, and no memory, of the entire program. Hammer's scope for reassignment and optimization is limited to the few instructions already in its pipeline. It's like decoding the Rosetta Stone while skydiving, or trying to characterize Internet routing protocols using Latin or Sanskrit. Hammer and Itanium are complete opposites when it comes to rescheduling or re-ordering instructions to improve overall efficiency and utilization of internal chip resources. IA-64 does no reordering whatsoever and is damn proud of it. That's the compiler's job and the hardware is just a dumb servant of the compiler, doing what it's told. Hammer, on the other hand, is--has to be--aggressive about locating and exploiting weaknesses (for lack of a better term) in the compiled output, stealthily reorganizing the occasional integer or FP instruction to take advantage of its hardware resources. Which do you suppose has more headroom, more upside growth potential? It's inconceivable that Intel can't extend and enhance IA-64 for another decade or so. Hammer, on the other hand, is the result of a decade of stretching, tweaking, and cajoling a Paleolithic architecture into modern form. It's difficult to believe there would be as much life left in Father Time as in the New Year's baby. Branch latency and its related pipeline bubbles are the bane of high-speed microprocessors. IA-64 is the hands-down winner here. It offers complete software and compiler control over branch prediction, hinting, and prefetching, all in addition to its hardware branch-prediction logic. And its predication technique removes many branches altogether. Hammer has to make do with the x86 instruction set, which has no concept of branch hinting or predication, and never will, unless AMD chooses to invent new instructions and create yet another significant extension to x86. IA-64's instruction clustering means the hardware doesn't have to check for dependencies; that's the compiler's responsibility. This eliminates a lot of the nastiest, most convoluted hardware from the 7/1/2002 3:14 PM ExtremeTech - Print Article 17 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as processor's critical path--exactly the place where Hammer spends most of its effort. IA-64 code density should be as bad as x86 code density is good. That's a bonus for AMD and Hammer, though you don't often see server manufacturers choosing their high-end processor based on code density. Intel threw out the baby with the bathwater, creating an entirely new microprocessor and gluing x86 compatibility onto the side for sentimental value. AMD keeps straining the same old bathwater. Odd as it seems, AMD now carries the torch for x86, extending it this way and that, while Intel heads down another path. In a sense, AMD now "owns" the x86 architecture. Intel's Ace in the Hole: Yamhill It's pretty well understood that Itanium will not provide leadership x86 performance. That's Hammer's great hope, in fact. AMD's strategy depends on Intel mistakenly abdicating its x86 throne leaving Hammer and its descendants the heirs apparent to a software kingdom. Would Intel so cavalierly jeopardize its legacy? Not on your life. To no one's great surprise, Intel is rumored to be developing something that will give future Pentium processors--not IA-64 processors--a performance kick. In a perverse reversal of roles, Intel may actually be following AMD's lead in 64-bit x86 extensions. A "Hammer killer" technology, code-named Yamhill, may appear in chips late next year, about the time Hammer makes its debut. It's suggested that Intel's forthcoming Prescott processor will be based on Pentium 4, but with Yamhill 64-bit extensions that coincidentally mimic Hammer's. (Prescott is also rumored to be built on a 0.09 micron process and implement HyperThreading.) Naturally, the very existence of Yamhill, if it exists at all, is a diplomatically touchy subject at Intel HQ. The company doesn't want to undermine its outward confidence in Itanium and IA-64, but neither can it afford the possibility of ceding x86 dominance to a competitor. Besides, whether they appear in future Pentium derivatives or not, Intel's 64-bit extensions could appear in future IA-64 processors instead. New IA-64 features plus competitive x86 performance--now that's a compelling product. Summary Analysis: Intel v. AMD It's said that "the short walk to the gallows focuses the mind tremendously" and AMD is headed up those thirteen steps. The company worked very diligently on Hammer and appears to have produced another silk purse from the sow's ear of x86 architecture. Will that be enough? It depends on what you're after. Both processors are totally, completely, and inarguably backward compatible with x86 binaries. Anything else would be a criminal dereliction of duty. If what you want is a faster x86 PC, it's entirely likely that Hammer-based systems will run old (or upcoming) PC applications much faster than an Itanium- or McKinley-based system. That's fine, for as long as you keep that box. But the day may come when Microsoft Windows version n+1, or Quake XVII, is released for IA-64, but not for AMD's x86-64. And then you'll have to choose sides. Microsoft has publicly announced it will port Windows XP to IA-64, but it has made no such announcement about x86-64. Running existing binaries on either Itanium or Hammer is a no-brainer, but what about new code? Now, for the first time, software vendors will have to decide: do they support Intel, AMD, or both? Porting major applications and operating systems to Hammer will not be trivial--but neither is supporting IA-64. Backing Intel's newest and heavily promoted next-generation architecture is a foregone conclusion for vendors that want to stay in business. Supporting AMD becomes more problematic. Will the added market share be worth the effort? Suddenly AMD finds itself in the same boat as Apple with a different, yet competitive, product that requires dedicated software support to survive. Grimly, AMD itself lived through this tragedy not so many years ago, and the wound was self-inflicted. AMD unceremoniously axed its entire 29000 family, one of the most popular RISC processors of the early 1990s, due to the cost of software support. The company decommissioned the second-best-selling RISC in the world because subsidizing the independent software developers was sapping all the profits from 29K chip sales. As "successful" as it was, AMD had to abandon the 29K, the only original CPU architecture it ever created. There is one very possible future scenario, though a long-shot hope for Hammer. Recall back in the 1980s, IBM was losing ground to PC clone makers so it took its ball and went home. It changed the game, and called it PS/2 - and look how well that worked. Instead of following IBM and switching platforms, the world went right on using PC clones. IBM never regained the dominance it once had. Maybe IA-64 is just the PS/2 of processors, a futile attempt to change the game just when it was getting good. Maybe the world 7/1/2002 3:14 PM ExtremeTech - Print Article 18 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as really wants a faster x86 instead of a new and different family of processors. Maybe lightning will strike twice. 64-Bit CPUs: Alpha, SPARC, MIPS, and POWER February 21, 2002 By: Jim Turley Alpha--and Omega Since the very first day Alpha processors were available, they've held the microprocessor speed lead. Alpha consistently topped the benchmarks throughout the 1990s. The nerds' favorite microprocessor, an engineer's delight, and everything a good 64-bit architecture should be. It also ended up a dismal commercial failure. Is there no justice? Digital introduced Alpha in 1992 and shipped its first Alpha processor in January 1993. Developed entirely in-house, Alpha was a replacement for the MIPS processors Digital had been using for a few years, which were, in turn, replacements for the VAX, which had become too complex for its own good. Not entirely happy with MIPS processors, Digital's engineers set about developing "the RISC to end all RISCs," in the words of team member Jesse Lipcon. Code-named EVAX, for Extended VAX, each Alpha generation bore the project name EVn. Alas, EV7 was to be the last of this great dynasty. There have really only been two major generations of Alpha processors despite the plethora of model numbers and product announcements. The 21164 and the 21264 (EV6) supplied the core for all the other processors. The 21364 (EV7) is essentially the same as the '264 internally but with a different external system interface. Sadly, the EV8, which was to be an entirely new internal design, was never finished. Some of the EV8 technology may be incorporated into future Intel 64-bit processors, since Compaq licensed Alpha technology to Intel last year – more on that in a bit. But note that Compaq expects to manufacture and ship Alpha chips into 2004, before converting over to IA-64. Alpha 21364 -- Just Icing on the Cake Alpha's latest and greatest processor, the 21364 (EV7), due later this year at 1GHz to 1.2GHz, takes an unmodified '264 processor core and wraps some bodacious system-bus logic around it. First off, the '364 includes no fewer than four channels to Direct RDRAM (Rambus) memory for a whopping 6 GB/second of theoretically peak bandwidth. And that's just to main memory. The chip also has four interprocessor-communication channels for communicating with other '364 processors. Each channel includes a pair of one-way 16-bit buses. The whole thing works like a packet network, with destination headers in the packets and forwarding. Although the '364 has ports for four direct connections, you can actually hook up as many processors as you want, and intermediate ones (that is, processors standing between the one sending and the one addressed) will kindly forward packets on their way. If this sounds familiar, that's because AMD licensed this technology from Digital for its own Hammer processor line (see more details below). Although the '364's execution resources are the same as the '264's, the newer chip delivers better performance because it has bigger caches. The '364 adds 1.5 MB of L2 cache directly onto the chip, 128 bits wide. The dual 64K instruction and data caches remain. The whole thing measures about 350 mm2 in Digital's (oops, Intel's) 0.18-micron process – smaller than its '264 predecessor, which was built in 0.35-micron silicon. Alpha 21264 Sets the Stage for Performance The 21264 is getting a bit creaky by high-end microprocessor standards -- it was first announced in 1996, yet this core has tided Alpha over to the present day. It is a four-issue, highly out-of-order machine. Its six execution units are evenly divided into two integer units, two load/store units, and two floating-point units, a pattern not too dissimilar from Athlon or Itanium. The '264 initially debuted at over 500 MHz, an unheard-of frequency at the time. Current chips just squeak past 1 GHz, which is still impressive considering they're not built in leading-edge processes, by any means. 7/1/2002 3:14 PM ExtremeTech - Print Article 19 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as click on image for full view Uniquely, the '264 actually duplicates its entire register file, giving one copy to one half of the execution units and other registers to the other half. Logically, any instruction can access any registers, but physically they are separate. Exotic synchronization hardware keeps both halves of this schizophrenic register set consistent. Alpha is aggressively out of order. Instructions are decoded and left waiting in one of two instruction queues (one for integer code, one for FP instructions). From there, instructions that can execute right away with no dependencies, and with all their operands handy (i.e., already in registers), are pulled out first. After all of those are dispatched, the Alpha takes mercy on the oldest instructions first, favoring them over those that haven't been in the queue as long. The actual sequence in which instructions are executed is utterly unpredictable, and has precious little to do with program order. This aggressively out-of-order approach is in stark contrast to Itanium, which does no reordering in hardware at all. But that's exactly how IA-64 is supposed to be. Like any VLIW architecture, it relies on the compiler to find parallelism and trusts the compiler to have found the best combination and sequence of executable instruction units. Athlon and Hammer, on the other hand, reorder instructions, though not as aggressively as Alpha. Photograph of Alpha 21264 Slot B module 7/1/2002 3:14 PM ExtremeTech - Print Article 20 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as The '264 and '364 include not one, but two, dynamic branch-prediction methods. The processor chooses between them on the fly. Because Alpha's 1-ns clock cycle is so wicked fast (at least, for its time and its technology), an instruction-cache miss is even more than usually painful. That, and the '264's large die size (298 mm2 in 0.35-micron, with 15.2 million transistors) means that even the on-chip caches can't be accessed in a single cycle. You can't fight physics, but to alleviate some of the pain, Alpha's branch-prediction hardware cooperates with the instruction cache. The '264's instruction cache lines are longer than usual-- if the line holds a branch instruction, the expected target of that instruction is cached with it. If the branch is predicted taken, the '264 starts a cache lookup for the branch target right away. Feedback from the branch-prediction circuitry periodically updates the predicted-target entry in the cache. Shades of Alpha in Hammer Digital licensed Alpha to microprocessor nonentities Mitsubishi (in 1993) and Samsung (in 1996), hoping these new disciples would help spread the gospel. Samsung started with a second-source license, but upgraded it to a full license (i.e., with rights to design its own processors) in 1998. Then, for no apparent reason, Samsung renamed its US arm Alpha Processor, Inc (later API Networks) for the purpose of selling its Alpha chips. That was about as successful as you might think. There's a little bit of Alpha in AMD's Hammer processors. Back in 1997, AMD proudly announced it would use Alpha's system bus for its "Slot A" interfaces for K6. But it didn't end there. While the FTC was investigating antitrust aspects of Intel's acquisition of Digital Semiconductor, it uncovered documents showing AMD had been negotiating with Digital for an entire Alpha processor license, not just the bus. It's no coincidence then, that Hammer's interprocessor communications and some of its internal features bear more than a passing similarity to Alpha. The Omega Glory Alpha is fast – very fast – but why? Surprisingly, it isn't fast because of its architecture. Alpha's instruction set is about the same as that of any other RISC architecture. There's no magic bullet there. Nor is there anything spectacularly different about Alpha's internal implementations. Its pipelines have never been especially long, nor does it embody any unusually clever microarchitectural techniques. No, Alpha's lead is (was) almost entirely due to its top-notch manufacturing and the minute tweaking and detail work that went into tuning the process to fit the processor. Alpha was exquisitely designed for the exact semiconductor technology Digital used to fab it. That equation fell apart when Intel and Compaq divided up Alpha. Intel got Digital Semiconductor's fabrication plant in Massachusetts (along with StrongARM) while Compaq got Alpha and the DEC computer business. Neither company got many of the engineers responsible for Alpha – they all bailed out rather than join either acquisitor. The magic that conjured up Alpha evaporated, never to rematerialize. Alpha was the first 64-bit processor to reach 1 GHz, in 2001. Ironically, this milestone came just a few short weeks after Compaq proclaimed it was discontinuing Alpha development and licensing Alpha technology to Intel. It's a bit like having your star running back score the game-winning touchdown right after the team announces he's being traded away. So long, Alpha. We'll miss you. Sun UltraSPARC SPARC is one of the purer RISC architectures still in existence. It's also just about the only one still used for its original purpose of powering computers. That's not to say that the architecture has stagnated – far from it. Sun has added visual- and media-processing features in its VIS (visual instruction set) extensions to SPARC. Similar to MMX or 3DNow!, VIS adds the ability to handle packed RGB-alpha data for compression, decompression, and video-processing applications. Even with those enhancements, and fully eight generations of design, SPARC processors are all still software compatible, from the first to the most recent. No mean feat, that. But like any well-established architecture, SPARC is showing its age. While AMD and Intel mass-produce processors at 2.0 GHz and up, Sun's latest UltraSPARC-III just barely squeaked past 1.05 GHz in January – and it took TI's latest six-layer copper-interconnect process to do it. 7/1/2002 3:14 PM ExtremeTech - Print Article 21 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as UltraSPARC-III chips Historically, SPARC has lagged behind most (usually all) other processors in clock speed and benchmark performance. SPARC has spent a decade at the back of the RISC pack, at least according to most recognized benchmarks. In its defense, Sun says it emphasizes "system performance," which includes factors such as memory bandwidth and availability that aren't quantifiable in the benchmarks. Still, it's hard to ignore SPARC's consistent lack of progress against MIPS, Alpha, Power (including PowerPC), and even the dreaded x86. It's been said that the biggest difference between a Sun workstation and a PC is the ego of the person sitting in front of it. SPARC is to processors what Linux is to operating systems. It has become the flagpole around which rebellious mobs gather in passive-aggressive demonstrations against the dominant player (in this case, Intel). SPARC, and Sun, have earned a weird kind of nerd chic that is out of all proportion to their relevance in the market, technical features, or performance. Sun succeeds largely on anti-PC and anti-Microsoft sentiment, not pro-Sun sensibility. Here's a good bar bet: What company has the most CPU design engineers (after Intel)? It's Sun, with 1,300 SPARC designers spread across Sunnyvale, Austin, and Chelmsford. For CPU designers looking for work, Sun is the last, best hope before succumbing to The Dark Side. UltraSPARC-III Outshines Its Predecessors Sun describes UltraSPARC-III as "…the second generation of the 64-bit SPARC V9 architecture…" a confusing agglomeration of revision numbers, to be sure. Is it second, third, or ninth generation? UltraSPARC-III is essentially the same internally as other UltraSPARC chips that came before it. The biggest differences are clock speed, external buses, and cache. UltraSPARC-III has a 14-stage pipeline, the longest of any of the 64-bitters reviewed in our series, and on par with the old Pentium Pro. It also has the now-familiar six execution units: two for integer, two for floating-point, one load/store unit and one address-generation unit. With only one load/store unit, UltraSPARC-III can't process multiple memory transactions the same as, say, Hammer or Itanium. UltraSPARC-III can have multiple loads and/or stores outstanding, thanks to its buffers and queues-- it just has to dispatch them one at a time from the code stream. UltraSPARC-III has average-sized L1 caches of 32K for instructions, and 64K for data. The chip contains the L2 cache tags, but not the cache memory itself. That cache is built off-chip using standard SRAMs. There is no L3 cache at all. Without the L2 caches on the chip, UltraSPARC-III's price is somewhat artificially lower than its competitors, assuming you do add the 1M, 2M, or 8M of external cache the chip supports. Like Alpha and Hammer, UltraSPARC-III has a built-in DRAM controller. In this case, it manages SDRAM (synchronous DRAM) devices. SPARC Instruction Set and Register Windows SPARC's instruction set is unremarkable – after all, RISC is RISC – but its register set is unique. All SPARC chips expose 32 registers to the programmer at any one time, but these registers are just a "window" into a larger set of physical registers. The additional registers are hidden from view until you call a subroutine or other function. Where other processors would push parameters on a stack for the called routine to pop off, SPARC processors just "rotate" the register window to give the called routine a fresh set 7/1/2002 3:14 PM ExtremeTech - Print Article 22 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as of registers. The old window and the new window overlap, so that some registers are shared. As long as you're careful about placing parameters in the right registers, the windows are a slick way to pass operands without using the stack at all. SPARC circular register windows Slick as it seems, registers windows have their drawbacks. The concept has been around for decades, yet SPARC is almost the only CPU architecture to use it. First, register windows only help up to a point-- the number of physical registers is finite and eventually SPARC runs out space for more windows. When that happens, you're back to pushing and popping operands on and off the stack. It's next to impossible to predict when the register file will overflow or underflow, so performance can be unpredictable. Finally, the processor doesn't handle the overflow/underflow automatically in hardware. It generates a software fault, which the operating system has to handle, burning more cycles. Many hardware engineers aren't particular fans of register windowing. It puts enormous demands on multiplexers and register ports to make any physical register appear to be any logical register. In the 1990s, when there were nearly a dozen different vendors designing and marketing SPARC-compatible processors, their designers complained bitterly about the headaches in routing interconnect over, around, and through the register file in the middle of every SPARC processor. That register windowing, which is an inherent and permanent feature of every SPARC, has so far made it impossible to add multithreading, and difficult to keep clock speeds up. The 900-MHz and new 1.05-GHz UltraSPARC-III chips both use TI's 0.15-micron copper process for their 29 million transistors. Fortunately, most SPARC processors are buried inside Sun workstations, where the value of Sun's software base and systems-level expertise outshine the relative shortcomings of its processors. MIPS64 5Kf and R20Kc Here's another good Silicon Valley bar bet: What does "MIPS" stand for? In the case of the eponymous company, it's "microprocessor without interlocked pipeline stages." In other words, a RISC architecture without (ideally) any hardware interlocks, a design goal that MIPS has very nearly kept for all these years. MIPS, of course, does not make processor chips any more, not even for Silicon Graphics. It is one of the more popular licensed architectures, used by over a dozen chip-making companies around the world for their consumer devices (like handheld PCs), video games (Nintendo 64 and PlayStation), and countless network boxes. Many networking companies use MIPS cores (some officially licensed, some not) in their chips, but not because MIPS is particularly good at network processing. It isn't. MIPS is just a convenient, clean, and easily scaled architecture around which special-purpose network processors or protocol engines can be added. Indeed, MIPS is one of the cleanest and most generic processor designs around, finely tuned for absolutely nothing. Recently a little Boston company called Lexra licensed a knock-off "clean room" MIPS core, but the 7/1/2002 3:14 PM ExtremeTech - Print Article 23 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as company has now has gone legit and taken out a MIPS license. Like SiByte and other MIPS users, Lexra used MIPS processors as the framework for more interesting network processors. One Little, Two Little, Three Little Endians MIPS has two different 64-bit cores for sale: the low-end (relatively speaking, of course) 5Kf, and the high-end 20Kc. The 5Kf has a six-stage pipeline and single, scalar execution. The core comes with (or, more precisely, can be designed with) L1 caches up to 64K in size. Assuming you choose to build yours in a 0.13-micron process, you can expect to use up about 4.0 mm2 if silicon, according to MIPS. The caches do not support multiprocessing features such as cache-coherent snooping, so the 5Kf is not destined for high-end computers of any sort. Instead, it's a good embedded core for comparatively speedy networking or video applications. The 20Kc core, on the other hand, is MIPS's "big iron." In a break from recent tradition, you can buy this core as a real chip. The 20K processor, as it's called, sports 7.2 million transistors on its 34mm2 die. With that you get dual 32K caches and a generic new system bus called, for reasons that immediately suggests itself, MGB. MGB runs at 150 MHz and can displace 3.6 GB of data per second. Let's hope it doesn't leak oil like most MGBs. Whether you like it soft core or hard core, the 20K pumps two instructions through its simple seven-stage, in-order pipeline. For a 64-bit processor, the 20K is no fire-breather. But this is how MIPS's customers want it. Now that Silicon Graphics has adopted IA-64 like everyone else (except Sun), there's no point designing high-end MIPS processors for workstations that will never exist. MIPS is now firmly in the embedded camp, and that means sacrificing a few of the sexier features for better power consumption and easier manufacturing. The 20Kc doesn't even have any exotic branch prediction. Its seven-stage pipe will probably limit the 20Kc to about 500 MHz, and that suits most embedded ASIC designers just fine. In a nod to networking reality, the core supports both big- and little-endian byte ordering. MIPS 20K processor The 20Kc core has three execution units (one full integer, one partial integer, one floating-point), but can dispatch and execute only two instructions per cycle. Because the second integer ALU is a somewhat limited version of the first, the 20Kc cannot dispatch two load/stores simultaneously, nor can it do two branches (which is pretty obvious). Also, clearly, it can have only one floating-point operation in flight at any one time, although, since FP operations can be many cycles long, the 20K can crunch on two new integer operations while it waits for the FP instruction to come out of the wash. Although 20Kc won't spank other 64-bit processors, it does have the advantage of ASIC availability. Taiwanese fab giant TSMC has already licensed the 20Kc, just as it did with earlier MIPS cores. This 7/1/2002 3:14 PM ExtremeTech - Print Article 24 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as makes the 20Kc easily available to ASIC customers with little or no interest in actually designing their own MIPS-based chips-- they can just glue on the 20Kc core and stick to their own specialty value. IBM Power4 Everything about IBM's Power4 processor is amazing. Amazing technology, amazing size (680 million transistors), amazing power requirements, amazing performance. And it's amazing that anything so complex can work at all. This beast has 5,200 pins on the package and consumes 500 watts (that's right, half a kilowatt) of power. Actually, Power4 is more than a processor, it's an entire neighborhood of processors. It's sold as a module comprising two processor cores per die, and four die per module, making eight 64-bit processors and 680 million transistors in one unit. Each individual die contains 174 millions transistors and measures a sun-blocking 400 mm2 in IBM's 0.18-micron 7-layer copper process. IBM Power4 module The pipeline of a Power4 processor – one Power4 processor – is 12 stages long and feeds eight execution units. Those eight include two integer units, two floating-point units, two load/store units, one branch unit, and one condition-evaluation unit. Like current Athlon and Pentium machines, Power4 cracks its instructions into an intermediate internal format that is more easily digested by the pipeline. This is a bit odd, since Power is nominally a RISC architecture to begin with, but there you are. Both "native" Power instructions, as well as PowerPC instructions, are decoded into this internal representation early in the pipeline. 7/1/2002 3:14 PM ExtremeTech - Print Article 25 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as Power4 die In-order versus out-of-order is a slightly clouded issue (no pun intended) on Power4. Instructions are dispatched in order, whereupon the pipeline almost immediately reorganizes them. Individual instructions can progress through the pipeline at various rates until they are reunited with their comrades and retired in order. In all, more than 200 instructions may be in flight in Power4. And this is all on just one of the eight processors in the module. Some more statistics: the data cache is 32K and the instruction cache is 64K. The 1.5 MB of on-chip L2 cache is divided into thirds, the better to serve up fast responses. (You can see this division in the die photograph.) This cache is shared between the two cores on the Power4 chip through an elaborate switch matrix. Because the cache pieces are relatively small they can be made faster than one large cache. This also helps open up more ports into the cache memory and avoids contention for resources, buses, tags, and cache lines. It also makes the entire cache system spectacularly complex to design and manufacture (not to mention the problems keeping all three portions cache-coherent) but IBM is not one to shy away from such a challenge. These guys are professionals. Oh, and the cache controller for the 32MB of L3 cache is included on the Power4 chip, although the cache memory itself is off-chip, but not off the module. The Tale of the Tape Someone once said there are lies, damn lies, and benchmarks. Be that as it may, it sometimes comes down to standard recognized benchmarks when it's time to pick a processor. And for 64-bit processors, SPEC (Standard Performance Evaluation Corporation; www.spec.org.) is the most-used source for those benchmarks. SPECmarks are more reliable than some other benchmarks because the results can be verified by outsiders. Any vendor posting results is required to specify exactly how those results were obtained and how to duplicate them. Peer pressure keeps blatant marketing optimism in check. Still, SPECmarks don't tell the whole story, as anyone who scores near the bottom will earnestly explain. 7/1/2002 3:14 PM ExtremeTech - Print Article 26 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as Current SPEC Benchmark Data for 64-Bit Processors Processor, Speed SPECint2000 (base) SPECfp2000 (base) System Alpha 21264C, 1.01 GHz 561 585 Compaq AlphaServer GS160 Alpha 21264C, 1 GHz 621 776 Compaq AlphaServer ES45 Alpha 21264B, 833 MHz 518 621 Compaq AlphaServer ES40 Itanium, 800 MHz 379 701 HP rx4610 Itanium, 800 MHz 358 655 HP i2000 McKinley, 1.0 GHz 640 -- Very preliminary simulations from Intel UltraSPARC III 1.05 GHz 537 701 Sun Blade 2050 UltraSPARC III 900 MHz 470 629 Sun Blade 1000 UltraSPARC III 750 MHz 363 312 Sun Blade 1000 Power4, 1.3 GHz 790 1098 IBM eServer 690 There is no data for Hammer because it's not available yet, of course, nor is there any SPEC data for the MIPS R20K. As with all benchmarks, the scores seem a little confusing because they don't track clock frequency very linearly. That's partly because other system components (chipsets, memory controllers, etc.) affect performance, as it should. Compilers revisions also affect performance, so "faster" processors don't necessarily produce faster benchmark scores. There's no question that IBM's Power4 behemoth is the winner in this contest, although you'd have to buy a pretty big system (and a pretty big room to keep it in) to enjoy it firsthand. Itanium turns in remarkably embarrassing SPECint (integer benchmark) numbers, scoring about as well as an UltraSPARC-III processor that's 6% slower in terms of clock rate and about a hundred years older in terms of architecture. As new product launches go, "Itanic" appears to be sinking under its own weight. Itanium's floating-point scores are first-rate, however, an impressive reversal of fortune for Intel, which normally has to apologize for its mediocre FP performance. And McKinley should do much better as the estimated integer score indicates. Wrap-Up Generations. Looking across all three parts of our 64-bit Computing seriers, what we're seeing here is different processor generations. IA-64 is the latest generation, though not the latest thinking on CPU architecture. Hammer is, by necessity, the oldest generation. Hammer, K6, Athlon, and Pentium Pro/II/III/4 all made a break from "true" x86 microarchitecture long ago when they started cracking x86 opcodes into internal ROPs and running RISC machines internally. That makes Athlon perhaps a 1.5-generation machine and Hammer a 1.999-generation processor. Alpha is generation 2. It's pure RISC, has some exotic features, and is very well architected. SPARC probably belongs in this group, too, though its basic architecture is a bit older than Alpha's. MIPS is of the same vintage as SPARC and it, SPARC, and Alpha all bear some similarities. All three stand in between Hammer and Itanium in terms of modern thinking. Power4 is in a class by itself, pulling a middle-aged instruction set with it as it develops on-chip multiprocessing and other system-level features. Although all of these processors are 64-bit machines, most are destined for different and mutually exclusive markets. MIPS has given up the desktop and become a very successful embedded architecture. SPARC has done just the opposite: Sun relies on SPARC for its workstations and servers, and maintains its characteristically defiant attitude about other CPUs. After some initial dabbles in the embedded realm, SPARC is now almost entirely Sun's pet processor. Alpha will fade away on its own, though not because of its performance, development, features, or software support. Alpha is disappearing by fiat. It has no strategic place in its new environment so it must be eliminated. Power4 is clearly in another world, one dominated by scientists with white lab coats and bulging foreheads. That leaves Hammer and Itanium competing head to head for the mainstream 64-bit workstation, and 7/1/2002 3:14 PM ExtremeTech - Print Article 27 of 27 http://www.extremetech.com/print_article/0,3998,a=22731,00.as Hammer should also compete with Itanium in portions of the 64-bit server market. Two roads are diverging in the yellow woods of PC processors, and Intel, for a change, is taking the one less traveled. Whether that's blazing a new trail or abandoning the road to riches remains to be seen. What's certain is that both sides will claim to be on the shining path. AMD and Intel swap insults, claims, and counterclaims as frequently as AOL sends us shiny new coasters in the mail. In a time when microprocessors are advertised on TV and CPU vendors have their own jingles, the fantastic technology embodied in these chips seems almost irrelevant. And it nearly is-- microprocessor marketing is drawing ever nearer to perfume advertising. All the emphasis is on packaging and marketing, branding and pricing, channels and distribution, with little left over for solid product details, features, and benefits. Little old ladies who don't know a transistor from a tarantula know the name "Pentium" and think they want "HyperThreading." It's a good thing that for some of us, the technology still matters. Copyright (c) 2002 Ziff Davis Media Inc. All Rights Reserved. 7/1/2002 3:14 PM