From surfnet.nl!newsfeed2.news.nl.uu.net!sun4nl!oleane.net!oleane!proxad.net!enews.sgi.com!fido.engr.sgi.com!news.corp.sgi.com!mash.engr.sgi.com!mash Sun Sep 3 07:11:57 MET DST 2000 Article: 9415 of comp.arch Path: cwi.nl!surfnet.nl!newsfeed2.news.nl.uu.net!sun4nl!oleane.net!oleane!proxad.net!enews.sgi.com!fido.engr.sgi.com!news.corp.sgi.com!mash.engr.sgi.com!mash From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: NUMAflex essay [extremely long, ~30 pages] Date: 3 Sep 2000 01:17:38 GMT Organization: Silicon Graphics, Inc. Lines: 1734 Message-ID: <8os8ri$p9j$3@murrow.corp.sgi.com> NNTP-Posting-Host: 163.154.3.73 A while ago, in response to some questions about SGI's NUMAflex, I said I'd write up some info on the architecture & especially on the design process that led there, and alternatives considered but not taken. I only had time to write all this once, for multiple uses, so please excuse the occasional marketing-stuff that snuck in amongst the technology and history. ============================================================================= NUMAflex Modular Design Approach A Revolution in Evolution John R. Mashey On behalf of the team, I hope! 8/30/00 0. INTRODUCTION 1. BACKGROUND AND OVERALL CHRONOLOGY 1.1 HISTORICAL BACKGROUND 1.2 ORIGIN 2000 (LATE 1996) 1.3 "FLINTSTONES" PROJECT (AS OF JANUARY 1998) 2. SGI'S NUMAFLEX TM) DESIGN APPROACH - BRICK-AND-CABLE MODULARITY 3. ORIGIN 3000 + ONYX 3000 - FIRST NUMAFLEX FAMILY 4. THE NEXT FAMILY - ITANIUM-BASED NUMAFLEX 5. FURTHER MIPS & IA-64 NUMAFLEX FAMILIES (2001-2006) 6. NUMAFLEX COMMENTARY 7. SUMMARY 8. ACKNOWLEDGEMENTS 9. REFERENCES 0. INTRODUCTION SGI's "NUMAflex" (TM) modular design approach builds computer families with unusual scalability and evolvability characteristics. It partitions CPU, I/O, and other functions into small, 19" rackmount computing "bricks", then combines them via efficient, high-speed cache-coherent cabled interconnects, rather than large backplanes. This particular approach enables a fundamental change in the way *scalable* computers are designed and evolved, implying an unusual degree of interaction among development process, systems architectures, innovative physical packaging, and resulting system products. For mid-range and high-end systems, the typical development process at most computer vendors tends to be a synchronous, "big-bang" process that produces a complete new system design about every 3-4 years, followed by modest upgrades until the next major design. People upgrade CPUs and disks often, I/O busses rarely, and backplanes hardly ever. The NUMAflex development process is a more asynchronous, "continuous-creation" approach that offers frequent moderate improvements, trying to incorporate the faster development cycle of clustered, low-end systems into the design of larger scalable systems. The NUMAflex roadmap includes multiple, and often major, upgrades every year. Most larger computers have big backplanes and strong coupling of multiple technologies, such as interconnects and I/O. By contrast, the NUMAflex design approach eschews big backplanes in favor of small bricks and high-speed interconnect cables. The NUMAflex approach uses specific interface standards and components shared among NUMAflex families, combined with strategies for evolving them. The first NUMAflex family is the MIPS/IRIX-based Origin 3000 & Onyx 3000, soon to be followed by an IA-64/Linux family. Unlike past SGI families (Power Series, Challenge, Origin 2000), the new family is *not* only another set of new products, but rather, the first of several families that share a common design approach, and quite often, common physical components. NUMAflex designs have several obvious analogies. Physically, they resemble modern modular consumer audio systems. Philosophically, they follow the classic UNIX shell-and-pipeline approach that handles a huge variety of problems by using common connections among a set of modular tools. At SGI, NUMAflex resembles the recent strategies for IRIX and MIPS CPUs. IRIX has shifted from "big-bang" releases to quarterly releases with much smoother transitions, and happier customers, and MIPS-chips have thankfully reverted to a more disciplined evolutionary process in place of multiple big-bang designs. 1. BACKGROUND AND OVERALL CHRONOLOGY 1.1 HISTORICAL BACKGROUND TIMELINE AND CLOSE-RELATIVE SYSTEM GENEAOLOGY 1988 1993 1996 2000 Cray T3D------> T3E-----------\ NUMAflex (Origin 3000, ....) -->DASH--------------> Origin2000-----/ / ^ PowerSeries---> Challenge------/ Power Series & Challenge were classic backplane-bus-based SMPs. Stanford's DASH prototype [LEN95a] combined 4P SGI 4D/340s with directory controllers and mesh network to build a research ccNUMA system. Some key DASH people (such as Dan Lenoski and Jim Laudon) joined SGI, and then worked on the Origin 2000. NUMAflex in general, and Origin 3000 in particular, includes alumni of both Origin 2000 and T3E, and although more closely related to Origin 2000, was influenced by both predecessors, but also acquired some radically different characteristics as well. T3D & T3E are shared-memory MPP (Massive Parallel Processing), NUMA (Non-Uniform Memory Access) systems. Each CPU can address all memory, using normal cached access to local memory, and remote memory is accessed using uncached memory references, managed via software conventions and additional special hardware. T3s were designed to retain the high CPU counts of message-passing MPPs, such as TMC CM-5s, but offer more convenient global shared memory. In common usage, if a system includes hardware to maintain synchronization among all of the caches in a system, as seen by ordinary memory accesses from any CPU, it is called *cache-coherent*. Although it would be more precise to say hardware-cache-coherent (as opposed to software-cache-coherent), people don't. Hardware technology is only getting good enough to build reasonable systems that have both large T3E sizes and (hardware) cache-coherency, and lacking this hardware, T3s are not normally labeled cache-coherent. Origin 2000 and Origin 3000 are cache-coherent NUMA (ccNUMA) systems. Origin 3000 is SGI's *second* ccNUMA product, but is (properly) labeled a *third-generation* ccNUMA, acknowledging Stanford DASH's important contribution as the first along this general line of development. TERMINOLOGICAL CONFUSION AND UNCONSCIOUS OBFUSCATION It is common in computing, as elsewhere, that the meaning of a term changes over time, causing confusion among the unwary. In particular, a term used to make a distinction may change meaning if that distinction becomes less important, and a general term for a large set may acquire a more specific connotation if one member of the set is especially successful. MPP (Massive Parallel Processing) has generally meant large systems with many CPUs, but it often takes on the connotation of mesasge-passing, since the early MPPs used that approach. SMP once meant Symmetric MultiProcessing (all CPUs have equal capabilities), as opposed to Assymmetric MultiProcessing (where perhaps only one CPU could run the operating system, or only one had I/O device attachments). Over time, system design favored SMPs, so the symmetric-vs-asymmetric distinction became less useful. Later, SMP tended to mean Shared-memory Multiprocessing, as opposed to message-passing or shared-nothing systems. By *this* definition, most multiprocessors are SMPs, whether built around a common shared bus (many minicomputers and microprocessor-based servers), a fixed set of crossbars (as in many mainframes and vector supercomputers, and in Sun E10000, HP N-series, various IBM RS/6000s). With this usage, Origin 2000s and other ccNUMAs should be called SMPs, but .... Later yet, the term SMP has meant "SMP, but with bus or small crossbar", as opposed to distributed shared memory systems (NUMA, ccNUMA, COMA). Even more specifically, since most SMPs are small bus-based SMPs, many people use the term SMP to mean bus-based SMP. There is related evolution in terms like NUMA and ccNUMA. If all of the memories in an SMP have the same memory access times from all CPUs, it is labeled UMA (Uniform Memory Access). Most small SMP designs are UMAs, and any system that is Non-UMA is labeled NUMA. NUMA systems with (hardware) cache-coherency are usually called ccNUMAs. Any ccNUMA is a NUMA, but a NUMA need not have cache-coherency. As happened with bus-based SMPs, the unadorned term NUMA is often meant as ccNUMA, simply because most NUMAs in the world are actually ccNUMAs. Finally, COMA (cache Only Memory Architecture) systems (like those of KSR) are certainly NUMAs, and logically are ccNUMAs as well, since they do have hardware cache-coherency. However, sometimes people place COMA on the same level as ccNUMA, i.e., in COMA-vs-ccNUMA arguments. The moral of all this is to ask people what they mean when they use these terms! TOP-LEVEL TECHNICAL SUMMARY OF RELATED SYSTEMS Peak/raw Year System Arch Connect #CPU Max I/O Type MB/sec Mem GB 1988 SGI PowerSeries Bus SMP 128 2-8P .256 SCSI, ENET, VME 1991 Stanford DASH ccNUMA 2*60 48P .256 (same) 1993 SGI Challenge Bus SMP 1500 2-36P 16 SCSI, ENET, VME 1993 Cray T3D Hybrid* 2*300 2048P 128 Attached 1996 SGI/Cray T3E Hybrid* 2*500 2048P+ 4TB GigaRing (SCI variant) 1996 SGI Origin 2000 ccNUMA 2*800 2-512P 1TB XIO, PCI (64b, 33Mhz) 2000 SGI Origin 3000 ccNUMA 2*1600 2-512P 1TB PCI (64b, 66Mhz), XIO (*Hybrid is as discussed above: like MPP in scalability, but with shared-memory). For consistency, peak/raw bandwidth numbers are given along one connection, either the shared bus in an SMP, or along one connection in MPP or ccNUMA, expressed as 2*N, since all the cases here are full-duplex. Sustained numbers are 50-80% of peaks - YMMV -Your Mileage May Vary! END OF THE ROAD FOR BIG-BUS SMPS SMP bus bandwidth rose sharply from 1988 to 1993 - at least 10X difference in peaks, but efficiency improved, so in the case above, 1988's sustained 60 MB/sec grew to 1993's 1200 MB/sec. Unfortunately, from 1993-2000, 1200 MB/sec only improved by 2-3X (2600-3200 MB/sec) for big-bus (rackmount-width) SMPs, falling far behind the growth in CPU performance, and yielding the oft-made complaint: SMP busses don't scale They used to, but they don't any more, but they still managed to take over a huge market, displacing the classic minicomputer architectures, and even attacking the low-end of the supercomputer business. But now... Big shared-bus backplanes have gotten about as wide (128-256 bit data) and as fast (83-100Mhz) as they are likely to get any time soon. Skew and bus-loading problems make it more and more difficult to do much better, especially in the larger systems, whose backplanes may have limited clock-rates compared to their smaller siblings. For example, Sun's E6500 bus runs at 90MHz, using the same boards as the smaller E3500, whose bus achieves 100Mhz. This issue has led people to build MPPs (many), and then shared-memory ccNUMAs of various flavors (KSR (COMA), Convex/HP, SGI, Sequent, DG, Compaq). Vendors no longer appear to be building new big-bus SMPs. Instead, one way or another, people use switch-based designs, with narrower point-to-point interfaces running at much higher clock rates. Busses that appeared in 1996 have by now been scaled to 100MHz, while point-to-point designs were already at 400MHz in 1996, and 800Mhz in 2000. Of course, mainframes and vector supercomputers have long used crossbar switches, various ccNUMAs and MPPs have used cabled memory interconnects, although usually to connect fairly large hardware aggregates. Some early ccNUMAs have used SCI (Scalable Coherent Interconnect), although this appears to be slowly disappearing in more recent designs. (SCI was an ambitious, probably too ambitious, attempt to create industry-standard mechanisms for ccNUMA and local networking. See [SCI00a].) In any case, crossbar switches, and sometimes, high-speed cables, have been propagating downward in price into the broader systems market. THE SHAPE OF THINGS TO COME Going forward, I conjecture that computing will be dominated by 2 types of computers, whose block diagrams look relatively similar: (a) COMMODITY SBC (SINGLE-BOARD COMPUTER) - memory, fixed-I/O, 1P, 2P or maybe 4P, often clustered. Evidence is accumulating that 1-2P per shared-bus seems to be a design sweet spot. Many CPUs and chipsets permit 4P/bus, but quite often, it is found that 2-3P saturate the bus for many workloads. Anyway, the most typical designs have 1-2P and 1-2 PCI busses per memory-control ASIC: P P |_______| | MEMORY-----ASIC | I/O Of course, people sometimes integrate the controller ASIC with CPU, and multiple CPUs/die and CPU+memory integrations are coming, but for a while, the block diagram above is quite common. (b) SCALABLE SYSTEM built from similar nodes - 1-2P per bus, memory, I/O (either as part of node, or connection to remote I/O), and some kind of ccNUMA port that integrates multiple nodes together in a consistent shared-memory environment. Much of the ASIC design can be similar to (a), but needs the ccNUMA port, and extra logic to do cache-coherence, and the memory must provide bits to hold the coherency information (wider DIMMS or extra directory DIMMs are typical). P P |_______| | MEMORY+----ASIC+-----ccNUMA interconnect | I/O Hence, clearly a scalable system needs more hardware, and costs more/CPU (to handle the domain of problems that want larger systems). ccNUMA designers vary wildly in preference, ranging from those who prefer "light" nodes (2-4P) that resemble (or even are) commodity SBCs, to those who prefer "heavy" nodes, using crossbars to connect 8-16P before going to a ccNUMA interconnect. The trend seems toward light nodes. Of course, many people build clusters of commodity computers, or similar collections of non-shared-memory systems, like IBM SPs, and these offer some of the same characteristics as NUMAflex designs, but without producing arbitrarily-sized shared-memory systems, which are the focus of the remainder of this writing. DEVELOPMENT INTERVALS & "INFRASTRESS" For whatever reasons, 3.5-4 years is a typical interval between one top-of-the-line system and the next. For example, the following are typical: --------SGI-------- --------Sun---- ----DEC/Compaq--------- Year System Year System Year System 1988Q4 PowerSeries 1993Q1 Challenge 1992 SC2000 1992 DEC 10000 1996Q4 Origin 2000 1996 UEx000 1996 DEC 8400 (GS140) 2000Q3 Origin 3000 2000? next? 2000 Compaq ES320 The usual process creates chassis designs used for 4-6 years, and allows for CPU board upgrades and disk upgrades. I/O-bus upgrades happen occasionally, but are often awkward, and serious interconnect upgrades are rare. Occasionally, modest increments in bus bandwidth have been provided. In practice, multiple technologies are coupled in such ways that it is difficult to re-use very much between major generations. Technologies change at different intervals, and at different rates, and are not aligned: Technology Interval rate CPU ~year 1.6X [noticeable speed grade each year] Disk ~year 2X capacity (now; used to be slower) DRAMs ~3 years 4X capacity, costs go down meanwhile Chassis ~4-6 years you bought it, you have it I/O busses varies varies [difficult to predict, political] The different rates of change cause enough stress on system infrastructure (and its designers!) that it inspired a talk I gave often, starting in 1997: "Big Data and The Next Wave of InfraStress", of which an online version can be found in [MAS99a]. The subtle goal for NUMAflex was *not* just to build SGI's next scalable product, but to change the big-bang dynamics and timing into a more continuous process able to incorporate new technologies faster. But, to understand NUMAflex and Origin 3000, it helps to review the predecessor Origin 2000, in a bit more detail. 1.2 ORIGIN 2000 (LATE 1996) Technically, this is a 2-bristled (2 nodes/Router), 2P/node, directory-based ccNUMA using hypercube topologies up to 64P, and then fat-hypercubes above 64P, using extra MetaRouter cabinet(s). The discussion is ordered from the bottom-up, for reasons described later: ASICs, system module, larger systems, assessment. 1.2.1 ORIGIN 2000 ASIC summary ASIC Description Major ports Peak B/W, MB/sec Each port HUB CPU Node Crossbar 1 SYSAD <-> 2 MIPS CPUs 800 1 Memory 800 1 XIO 2*800 1 NUMAlink* 2*800 XBOW I/O Crossbar 8 XIO 2*800 ROUTER Node interconnect crossbar 6 NUMAlink* 2*800 and, on most XIO cards: BRIDGE XIO <-> 64b, 33Mhz PCI 1 XIO 2*800 1 64b 33MHz PCI 267 * NUMAlink: the original marketing name was CrayLink, but with the spinoff of the Cray Vector unit, the name has been changed to NUMAlink. NUMAlink and XIO are simultaneous bidirectional packet-oriented channels, where each direction is 16 data bits wide, running at an effective 400 MBaud, to get the peak 800 MB/sec each direction. See [GAL96a] for details. 1.2.2 ORIGIN 2000 "MODULE" An "8P+12I/O Module" is a file-cabinet-size box, with a big midplane: 2 XBOW I/O ASIC crossbars, each with 8 ports 2 ports to CPU nodes 6 ports to XIO cards; a PCI shoebox could replace some XIOs. XIO cards often used a BRIDGE ASIC that converted XIO to 64b, 33MHz PCI on-board. Total: 12 XIO slots. 4 slots for CPU node cards, each of which includes: 1 HUB ASIC 2 MIPS R1x000 CPUs Memory DIMMs, including directory for smaller systems Directory DIMMs added for larger systems 2 slots for 6-port Router boards, each with one Router chip. A 4P system might save cost using a Null Router, and an 8P system could use a Star Router, but modules expected to be part of a larger system need both Router boards. Half-a-dozen SCSI disks Power supplies, blower, etc. ORIGIN 2000 NODE BLOCK DIAGRAM P P |_______| | MEMORY-----HUB--- (NUMAlink) | I/O (XIO) ORIGIN 2000 MODULE BLOCK DIAGRAM P P P P P P P P |_______| |_______| |_______| |_______| | | | | MEMORY-----HUB---\ /--HUB--MEM MEM-HUB---\ /--HUB-MEM + Router * + Router * I/O / | | \ I/O I/O / | | \ I/O + \-- * ------------- + ---/ * + * + * ++XBOW+++++++++ * +++++++++++++++ * / / | | \ \ ***********************XBOW****** / / | | \ \ The Routers offer 6 NUMAlink ports outside the module, and the XBOWs provide 12 XIO slots. Each Router supports a pair of CPU nodes, one of which is connected to each XBOW. The same module can be a deskside machine or rackmount; a deskside Onyx2 uses a different cardcage to incorporate a graphics unit in place of 2 CPU nodes. An Origin 200 is a smaller box, limited to 2P internally, which uses the same ASICs, but has very different physical packaging, PCI I/O, with optional external XIO box. Two can be cabled together into a 4P system. ASIC/CPU EFFICIENCY A full Origin 2000 module has 8 CPUs, and 8 main ASICs (4 HUBs, 2 XBOWs, and 2 Routers), ignoring I/O cards. A useful first-order metric for cost is the overhead ratio of ASICs/CPU, about 1:1 for the big ASICs in a full module, and a bit worse in partially-populated modules. The 1:1 ratio was an improvement on the previous Challenges, which ran about 3:1. Lower ratios are *not* automatically better, as less ASICs may mean less pins and less bandwidth, but *usually* lower ratios mean less cost, and really large ratios may imply very high cost. This is actually a special case of the more general issue of fixed-costs in scalable systems. In many systems, the customer buys some chassis, then fills it with CPUs and I/O, implying that the worst price-performance occurs when the chassis is minimally filled. The best price-performance occurs when the chassis is filled, except, perhaps, where bandwidth limitations (as in big-bus SMPs) cause the later CPUs to suffer in performance. Thus, people like to have incremental, pay-as-you-go costs, rather than large fixed costs that must be amortized. A big backplane is expensive, and represents a larger unit of failure, than does a smaller backplane with less chips. Engineers are always jiggling the boundaries, as the tradeoffs keep changing, and different design groups can rationally prefer different partitioning. The next generation of ccNUMA systems illustrate this, as systems with quite similar topologies display radically different packaging and ASIC counts. 1.2.3 ORIGIN 2000 LARGER SYSTEMS A rack can hold 2 modules, connected by 2 NUMAlink cables, giving 16P/rack, with 24 XIO slots, and 4 Routers. By adding racks and cables, one builds larger and larger hypercubes (either complete or partially populated), up to 64P. Then, these are connected via MetaRouters (additional racks containing only Routers), as fat hypercubes, to build larger systems. Hypercubes were chosen for scalable bisection bandwidth, which generally grows in proportion to the number of CPUs, and for low latency. See [LEN95a] for the details and comparisons of interconnect topologies. Local restart memory latency is ~330ns. Remote restart latency, to furthest node in 128P system, is about 1.2 microseconds. Theoretical average remote latencies (assuming random memory references scattered equally) are lower, since the furthest-away node is the furthest away, and in a hypercube, most nodes are of course closer than the furthest. Effective latencies, as experienced by actual programs, are lower yet, probably more like 400-600ns (in same ballpark as Sun E10000), because: - MIPS CPUs are aggressive speculative CPUs that overlap memory accesses, hiding latency for many codes. Many vendors build speculative CPUs, because they successfully hide some latency, but one must be careful in interpreting latency numbers, as some benchmarks, on purpose, defeat the speculative execution features. - The theoretical average latency is about as bad as an OS can do, whereas the IRIX OS has become well-tuned at allocating memory "near" the CPU9s) using it. - OS code is normally replicated into multiple nodes, so that OS instruction cache misses are "closer" than average. - For special cases, programmers use "cpuset" directives to control memory allocation more explicitly. All of these serve to reduce the actual latency as seen by programs. Latency citations are often not much better than the old mips-ratings - unless somebody is very specific, it's hard to know what they mean, and being really specific rapidly turns a quick note into a thesis. SGI usually cites "restart latency", or interval from detection of primary cache miss through restart of the instruction, including CPU and refill overhead, but with no contention from other CPUs or earlier cache misses (back-to-back). Back-to-back latency (as measured in lmbench) is worse, and other contentions are worse, and very complicated. For discussion, see [HRI97a]. 1.2.4 ORIGIN 2000 RETROSPECTIVE ASSESSMENT I've lost track of the numbers, but I think there are 30,000 or so Origins out there (including the smaller Origin 200s, which use same ASICs), of which: Size Quantity 512P 1 (196GB main-memory system @ NASA Ames) 256P 6-8 (?) (?, because 256Ps sometimes get split into 2*128) 128P 200+ There are hundreds of 64P systems, and thousands of the smaller ones. GENERALLY-WORKABLE ccNUMAs IN 1996 ORIGIN 2000s were directory-based ccNUMAs that actually worked when people spread jobs across multiple nodes, and acted more like UMA (Uniform Memory Access) SMPs than did many ccNUMAs of the time. The workability was due to several attributes: (a) Remote:local latency ratios were generally under 4:1, usually more like 2:1 as seen by typical programs, which means many people didn't need to worry too much about data placement, especially in comparison with some ccNUMAs where the ratio could hit 10:1. Of course, one can keep this latency ratio low (good) by increasing the local latency, but that is *not* a good idea. Much better is to pay fanatical attention to latency everywhere. (b) The hypercube topology scaled up bisection bandwidth, more-or-less proportional to the number of CPUs, and hence many workloads scaled fairly well. Bisection bandwidth is the bandwidth obtained by slicing the machine in half and computing the total bandwidth across the bisection. In bus-based SMPs, both total memory bandwidth, and bisection bandwidth (= bus bandwidth) remain constant, implying that an increase in number of CPUs causes a proportionate decrease in per-CPU bandwidths (memory bandwidth and bisection bandwidth). In ccNUMAs, the total memory bandwidth is normally proportional to the number of CPUs, but the scaling of the bisection bandwidth varies according to the toplogy, i.e., hypercube and 3D-torus scale fairly well, simple-ring stays constant, like an SMP bus. Bisection bandwidth doesn't matter much if a system is running a workload of mostly-independent transactions that stay in their local nodes, but large-compute or large-I/O jobs can suffer badly if the interconnect gets overloaded, just as people found with big-bus SMPs, when adding more CPUs simply did not help. SGI systems are required to deal with high bandwidths, so simple rings were not an option. (c) The node size was small (2P), and the same interconnect was used at all levels. This avoided the kind of performance dropoff sometimes found with big-node ccNUMAs, where the inter-node latency difference was so high that some computer centers did not permit users to run jobs that spanned multiple nodes, because performance dropoff for remote memory accesses degraded performance seriously. In effect, such machines were run as clusters of independent systems, in which case the ccNUMA hardware was not very useful. (d) IRIX tried to optimize data placement, and tools were provided for user directives to override the defaults. Early on, customers seriously worried about data placement, but most discovered they need not worry about it much. Of course, some Origin 2000 users *do* worry about data placement, typically for one of several reasons: (a) They are going for all-out performance on big parallel CPU jobs, especially those with higher CPU counts. (b) They are doing I/O tasks that require careful balancing of memory bandwidths. Each node's memory system peaks at 800 MB/sec (and about 600 MB/sec sustained), but we have seen sustained single-file read rates above 4 GB/sec, which requires careful striping across about 8 memories. These issue shows up in big-I/O and media-streaming applications. Some people wanted more local memory bandwidth, but it turned out that the ccNUMA interconnect bandwidths, for most people, were actually a bit higher than needed. However, in making the transition from SMP to ccNUMA, we wanted to make it as smooth and surprise-free if possible, if necessary, providing a bit more bandwidth than necessary, and then expect to fine-tune later designs. Anyway, in actual practice, most users didn't worry about it, and certainly, most third-party software vendors haven't. Many people just ran the existing Challenge SMP binaries on Origins as well. Customers liked the ability to start low, buy incrementally, and build big; they liked being able to disassemble systems and reconfigure. Some of them loved the massive I/O capabilities: 7 GB/sec from one file system [although one customer I knew wanted 15] 4+ GB/sec to/from one file 1 TB backup in an hour 7-TB indvidual files actually used by real customers. Disk farms in use at 10-100+ TBs, with no fsck. While such abilities are relatively straightforward on an Origin 2000, they simply do not work at all, or not without serious data special-casing and data partitioning on clusters of machines. Also, the high-speed I/O pipes were crucial for feeding graphics units, handling digital media applications, and doing high-speed networking, not just in handling large numbers of disks. Most people liked the minimal Non-Uniformity, so they could treat the machine more like UMA SMP. People often think Non-Uniformity is a goal. It has *never* been the goal of any SGI NUMA designers, but rather viewed as a necessary price to obtain cost-effective scalability, and a reasonable price to pay as long as kept under tight control. If we could get UMA everywhere, cheap, we'd love it. Until then, we obey the laws of physics, and have NUMA. 1.2.5 BUT CUSTOMERS ALWAYS WANT MORE HAVE IT THEIR WAY (a) Many customers filled every CPU slot (8 per rack), and 2-3 (of 24) I/O slots and complained about the wasted I/O slots. (b) A few customers filled every I/O slot, used only half of the CPU slots, just enough to get all of the I/O connected, and moaned about having to buy CPU boards they didn't really need. (c) Some customers wished for *more* disk in the main chassis. (d) Some customers didn't want *any* disks in the main chassis; they just wanted separate Fibre Channel disk racks, and they complained that SCSI disk slots were a waste. I once visited a customer whose basement was filled with Origins (good!), but all 4 complaints were voiced to me on the visit, by different people. TECHNOLOGY EVOLUTION & MIS-MATCHES People do not expect major in-chassis upgrades for single-board computers, but they demand them for their scalable systems. This "InfraStress" issue was discussed earlier, but the following is more specific. EXAMPLE 1: the Origin 2000 is an XIO machine with a few concessions to PCI. The XIO interface is very fast, with many good characteristics, but overkill for some uses, and the cards naturally cost more than PCI cards. PCI 64b, 33Mhz simply wasn't fast enough for some SGI needs, but 66Mhz wasn't really there yet, and cards weren't available. We did have a PCI shoebox to take care of people with standard-bus needs, but the packaging remains a bit awkward [the shoebox sticks out a few inches.] A year earlier, and Origin 2000 might not have had any PCI, and a year later, it might have been mostly PCI, with a few XIO slots. Likewise, Origin 2000s spanned co-existence/changeover between SCSI disks and FibreChannel disks; had it come out 2 years later, it might have used only FC disks, perhaps. This sensitivity of design to such timing is always painful, and of course, truly awful things happen when some part of a schedule slips, especially for reasons outside one's control. EXAMPLE 2: Sun servers are transitioning from SBUS to PCI, leading to the following sort of comment (from some Sun Web Page): "Q. PCI boards do not package well in Enterprise servers compared to SBus boards. So, why would I recommend SCI PCI over the SBus version? A. You are correct and that is why we are keeping SCI SBus in the product line. However, our next generation of servers will not support SBus. Therefore, customers who will want to cluster today's generation of Enterprise servers with tomorrow should consider using SCI PCI." That is not a knock on Sun, it is a typical inter-generation transition problem that occurs again and again with I/O. Customers not only keep old I/O devices "forever" (some SGI customers still have many VME cardcages out there), but still want access to the newest devices. I/O is also awkward, in that it's as much political as technical, the standards come from shifting industry coalitions, and the schedules are often unpredictable. REDUNDANCY & RESILIENCY Some people wanted more redundancy, but others wouldn't pay for it. In most cases, for a single system size it is cost-optimal to build a design optimized for that size, although doing so may not be optimal across a product line. Computer designers and marketeers are always arguing about the number and sizes of distinct boxes to be built. A unit that is too big may create a large unit of failure, and a unit too small may impose high costs for modularity. For example, if every replaceable unit needs its own N+1-redundant power supplies, something built of small units ends up with "redundant redundancy". While an 8P+12 I/O unit was not that big, especially compared to the larger servers around, it was still too large a unit of failure for some customers, even with some redundant internal paths. With 2 XBOW ASICs, each connected to 2 CPU nodes, there were at least 2 CPU<-> I/O paths, but still, a node failure required a reboot. People wanted support for partitioning, and that *almost* worked, but not quite. Under certain circumstances, reset signals propagated into other partitions, a Bad Thing. It was good enough to debug partitioning software, but not (the required) 100% safe to release. AGGLOMERATED SYSTEMS The partitioning issues were just a few among the myriad of details that one must learn to get right in a system constructed from an agglomeration of relatively-independent modules. System controllers aren't simple, but need to cooperate with others. Methods that work fine in uniprocessors or small SMPs, or even some big-bus SMPs may fall apart when scaled up. In the "good old days", a master CPU could check each slot and see what's there. In a modular system that supports a wide range of topologies, and may have broken links, it's not so easy. Algorithms that used to work may become impractical due to scaling issues. Elapsed times may become dominated by serialized code. For example, UNIX "fsck" is a disaster on a 10-TB filesystem, which is why people use journaled filesystems like XFS. Likewise, early on, the reboot time for the 512P + 196GB-main-memory Origin 2000 at NASA AMES was .... 2 hours. (It's much less now). In general, leading-edge, high-end systems get to have more "close encounters of a strange kind" with scaling issues that sound like fantasy to most people, but later come to afflict many systems. Just 10 years ago, many people considered the ideas of 64-bit micros and more-than-4GB-memory as strange, but we've seen 8GB desktops already. Anyway, although these agglomerated systems overlap in market space with bus-based SMPs, they add all sorts of exciting new issues to solve. INTERNAL ELECTRICAL IMPLEMENTATION ISSUES In some cases, the multiplicity of Origin 2000 physical realizations caused hassles or overhead. For example, node cards used a tricky connector, over which ran both NUMAlink and XIO onto traces that connected with Routers and XBOWs, which then had different connectors for NUMAlink cables and XIO cards. To run an XIO cable (XTown) (to a graphics unit) required an XIO card with an "XC" chip to provide differential signals. All of this was done for good reasons, but people wished for less different flavors, especially since high-speed-signal engineering is nontrivial. STATUS - LATE 1997 The Origin 2000 was introduced in late-1996, and its smaller sibling, the Origin 200, in early 1997. Hardware engineers were already considering new ASIC designs. There was, of course, serious churning around due to digestion of the (mid-1996) Cray merger, changes in the executive ranks, etc. The high-end MIPS roadmap had been changed, with two new designs cancelled in favor of extensions to R10000/R12000. Intel's IA-64 was to be incorporated into systems. We also had experience from T3Es, whose sweet-spot was larger than Origin 2000, but whose engineering philosophies were more similar than one might expect - both groups shared fanatic attention to latency, bisection bandwidth, and serious I/O, even though many other details were different. We had considerable software efforts going towards scalability, attempting every 9-12 months to double the size of the largest useful configuration. So, we continued to believe strongly in directory-based ccNUMAs, but we'd learned a great deal from real practice, and we knew we would have a long overlap of MIPS and IA-64 systems. At that point (late 1997), we were going to have 2 product lines: - Flintstones: MIPS/IA64; 2-128P [hardware primarily Mountain View] - Aqua: MIPS; 64-1024P [hardware primarily Chippewa Falls] So, we'll see how we got from there to where we are now, but first: REMINDER OF AN UNPLEASANT FACT OF CURRENT COMPUTER DESIGN Once upon a time (15 years ago), people built microprocessor systems that included many PALs and wires, and they could be changed and fixed fairly easily. Boards often shipped to customers with fixup wires soldered on them. These days, people don't have this luxury, but work for years on monster CPUs and ASICs that have more gates than whole boards had just a few years ago, cannot be fixed with soldering irons, and require gate counts, clock rates, or special circuits not yet obtainable from FPGAs. It can easily take a few months for a complete turn on a CPU or aggressive ASIC, so verification tests become ever more important. Thus, ASICs & CPUs have to be started long before one knows *all* of the details of system partitioning, packaging, configurations. If new requirements cause major changes, especially to interfaces between ASICs (or with CPUs), people intone the dread words "Major Schedule Slip", meaning year(s). On the other hand, given a flexible set of ASICs, one may be able to build a wide variety of systems. For example, Origin 2000 and 200 use the same major ASICs, but are otherwise rather different. For this reason, I keep presenting ASICs first, then system design, because the ASICs tend to acquire more inertia earlier in the process. This will help explain what happened, and I've often noted that there is often more insight to be gained from hearing about paths considered, but then rejected, than in just knowing the path chosen. 1.3 "FLINTSTONES PROJECT" (AS OF JANUARY 1998) At that point, SGI was fairly far along with the designs for several ASICs, and people were developing concepts for system partitioning and packaging. There were to be both MIPS and IA-64 flavors, although whether we were doing one or the other, or both, and if both, in which order, sometimes changed according to executive decisions! In such cases, good engineers keep on slogging away, and opt for very flexible designs, Just In Case. [There must surely be a Dilbert to this effect.] In a net posting, I won't describe all of the variants, but it is worth looking at our clearly-defined view of the world in early 1998. 1.3.1 ASIC SUMMARY (AS OF JANUARY 1998, STILL TRUE) ASIC Description Major ports Peak B/W, MB/sec Each port Bedrock CPU Node Crossbar 2 SYSAD <-> 4 MIPS CPUs 1600 1 Memory 3200 1 XTown2 -> XBridge 2*1200 1 NUMAlink3 2*1600 XBridge I/O Crossbar 2 XTown2 (XIO+) 2*1200 4 XIO ports 2*800 2 64b 66MHz PCI 533 Router Node interconnect 8 NUMAlink3 2*1600 XXXXXXX Itanium bus <-> MIPS bus Not yet public Bedrock fills the same role as the Origin 2000 HUB, except has: 4X CPU-bus bandwidth (for 4 CPUs, not 2) = 2 MIPS SYSAD busses, each 2X faster 4X memory bandwidth 2X ccNUMA interconnect bandwidth (NUMAlink -> NUMAlink 3) 1.5X I/O bandwidth (XIO -> XTown2) XBridge resembles an XBOW with 2 builtin BRIDGES, and can be used to provide either XIO or PCI ports. Three XBridges ganged together the right way can supply 6 64b 66Mhz PCI busses. The differing improvement ratios reflected costs, physics, and studies of Origin 2000 performance. For example, many wanted more memory bandwidth, but relatively few had saturated the ccNUMA or I/O interconnect. It would have been nice to have made the XTown2 bandwidth 2*1600 MB/sec, but that was not feasible. Given these, it is possible to build all sorts of systems, but the basic CPU nodes must look like the following, with Routers and I/O placed wherever makes sense for system partitioning. MIPS Itanium P P P P P P P P |_______| |_______| |_______| |_______| \ / \ / \ / XXXXXXXX XXXXXXXX \ / \ / MEMORY---------Bedrock---NUMAlink 3 MEMORY--Bedrock---NUMAlink 3 | | XTown2 XTown2 The NUMAlink 3's (optionally) connect Bedrock <-> Bedrock, or Bedrock <-> Router. The XTown2's connect to (optional) XBridge chips. Numerous topologies are possible. For example, one might have 2 CPU nodes per Router, leaving 6 ports free to connect to other Routers, or 4 CPU nodes/Router, leaving 4 ports free. It is good if the smallest machines (4P or 8P) can avoid paying for Routers. [In Origin 2000s, several special Router cards were used to achieve this, but it was irksome to need the variations, or to change router boards as systems scaled up.] 1.3.2 FLINTSTONES MODULE - BAMBAM This was the replacement for the Origin 2000 module, would have been fairly similar, and naturally have been named Origin 3000 or 4000. It was proposed as a 10U (17.5") box: 8 CPUs, connected to 2 Bedrocks 3 XBridges, giving either 12 PCIs (3 4-PCI shoeboxes), or 8 PCIs and 2 XTown2s 6 disks 2 removable media (Device Bay) 1 Router, with 2 ports connected to the 2 Bedrocks, 6 ports free Miscellaneous other items as needed; power, fans, etc. Ignoring I/O, a full BamBam would have had 2 Bedrocks, 1 Router, and 3 Xbridges, or 6 ASICs / 8 CPUs, a .75:1 ratio, better than Origin 2000's 1:1. However, if one recalls complaint 1.2.5 (a)-(d) above, nobody is ever satisfied with a fixed CPU-I/O ratio, and SGI not only has Big-CPU + Big-I/O customers, but also Big-CPU-hardly-any-I/O customers, so the following was added, to avoid wasting money on unused I/O: 1.3.3 FLINTSTONES CPU MODULE - PEBBLES 5U box 8 CPUs, connected to 2 Bedrocks 1 Router, with 2 ports connected to the 2 Bedrocks, 6 ports free These have a ratio of 3 ASICs: 8 CPUs, or .375 (good). 1.3.4 FLINTSTONES - BIGGER SYSTEMS It is clearly possible to build Flintstones systems similar to Origin 2000s, by combining BamBams, and CPU-rich configurations could be gotten by adding Pebbles boxes. With 6 free router ports per box, there were of course a myriad of potential topologies, limited mainly by one's imagination and willingness to handle the cabling variations in the field. 1.3.5 BROADER CONTEXT AND OTHER PROJECTS In addition to the teams working on Itanium and MIPS, another group was getting started on Intel "McKinley" designs. Looking forward over the next few years, one could observe that: (a) The various CPUs had different (sometimes, radically different) power, cooling, and packaging requirements, with different CPU busses. Among other effects, it is extremely difficult to optimize dense-packed CPU boards to handle such differences effeciently, especially when specifications change. (b) 64b, 64MHz PCI satisfied most needs, but we still had to support XIO for a few cases, as well as legacy re-use. Later, we might want PCI-X; and then we might have either Intel's NGIO or the rival Future I/O, with no way of knowing who would win, or whether both would be required. As it happened, thankfully, they merged to create one - InfiniBand. >From past experience, we knew that I/O standards efforts were unpredictable. (c) There was no end of argument about I/O mixtures for BamBam. Almost every integrated system design I've been involved with has had these arguments. There is only so much physical space, and there are always serious fights over every bit of it. (d) There was no end of argument about the amount of redundancy required, and where it would be, and who would or wouldn't be willing to pay for it. (e) Any optimizations we came up with for delivery at a specific date seemed to rapidly become non-optimal over the following few years, especially given the rapid changes to I/O standards. 1.3.6 WHAT HAPPENED THEN WAS ... We didn't actually build early-1998's Flintstones at all, although if we had, it might have shipped a bit earlier. The name did persist, but what we built was very different from Origin 2000 or January-1998 Flintstones. In mid-January, a more modular design was suggested, and went through some frequent iterations and input from numerous people. Also, in March Aqua and Flintstones were combiend into one project. By June 1998, we had mostly settled on something that was radically different, not-yet-complete, but recognizable in the systems finally built. The ASICs mostly stayed the same, of necessity, but almost everything else changed. In January 1998, it was not at all clear that the new approach would work, Physical and low-level electrical design issues often get short shrift in publicity, but in this case, people continually solved very subtle and difficult problems, without which this whole approach was infeasible. However, something much more important happened during 1998, although it wasn't yet quite so obvious. We created an overall design approach (now named NUMAflex), that encompasses multiple generations and families of machines, and that gives better hope of adapting to uncertainty. We generated plans out through 2006 looking at evolutionary scenarios. We modified some earlier designs to avoid causing problems for later ones, and we modified later designs to make better re-use of earlier pieces. Hence, instead of having several independent projects, we got one big project, with subprojects that shared many common pieces. [This has been done before, of course, but it hasn't been very common in the large microprocessor-systems business, and especially not at SGI, so it was a major philosophical change.] We changed to a design philosophy that emphasized frequent improvements, rather than having 3-4-year development efforts optimized for a delivery date, but then inevitably becoming suboptimal. High-end systems suffer from this, across the industry, because it takes big efforts to do an entire new design. NUMAflex is a long-term-oriented design approach, not just targeted at the first family, Origin 3000. If it were *only* that family, some optimizations would be *quite* different. Hence, this whole approach is a large bet that a continuous design process will work better than the "big-bang" approach. Also during 1998, we decided that Linux-IA64 would be used rather than IRIX-IA64 for the IA64 versions. That is a whole separate story, based on strong input from major third-party software vendors. 2. SGI NUMAFLEX (TM) DESIGN APPROACH - BRICK-AND-CABLE MODULARITY All of the NUMAflex systems are assembled from "bricks" (i.e., particular sorts of "cyber-bricks"), using high-speed cabled interconnects, rather than normal big backplanes. People seeing an Origin 3000 often look for the big backplane. THERE ISN'T ANY. Bricks are rack-wide, mostly 3U-4U high (5.25" - 7"), connected via cables. Brick-and-cable is the implementation essence of all NUMAflex systems. As mentioned earlier, a good physical analogy is that of modern, modular audio components that plug together and evolve independently. By wonderful contrast, our marketing folks showed an old wooden TV+record player entertainment center - beautiful wood, but not very upgradeable, no CD! FIRST DO NO HARM - we wanted to keep the scalability we had in Origin 2000, and then do better: 2.1 INDEPENDENT RESOURCE SCALABILITY - CPU+memory, I/O, and storage can be scaled (relatively) independently in the varying ratios desired by customers, without wasting slots, floor space, power, and cost. In early NUMAflex systems, the number of I/O bricks is no more than the number of CPU bricks, but we expect to relax that restriction later. 2.2 INDEPENDENT RESOURCE EVOLVABILITY - CPU+memory, I/O interfaces, storage, and even interconnects can be evolved at their own natural rates, still work together, and offer many years of cost-effective, yet compatible upgrades. Numerous subtle problems must be solved to do this. MIPS and Intel IA-64 NUMAflex families offer long roadmaps, but most hardware elements are shared, both between MIPS and IA-64 families, but also between generations. The *crucial* enabler of the two attributes above is the (somewhat unusual) partitioning of I/O, wherein a CPU brick only pays for an XTown2 connector. If I/O is to be connected, one adds an XTown2 cable, and an I/O brick that internally converts XTown2 to the specific I/O implemented in that brick. VERY IMPORTANT: CPU BRICKS ARE COMPLETELY DECOUPLED FROM THE END I/O BUSSES - no CPU brick contains any manifestation of XIO, PCI, PCI-X, InfiniBand, etc. I/O is only paid for as I/O slots are filled, and CPU decisions are decoupled from I/O evolution, so that CPU and I/O evolution could become more asynchronous, for hardware. [There is *always* A Minor Matter of Software.] This approach is clearly very different in packaging from the integrated I/O of the Origin 2000 and many older SMPs. It is less different from designs that have separate I/O ASICs plugged into CPU backplanes, with cables to remote PCI boxes, like Compaq ES320. The subtle, but important difference, is that no hardware component in the SGI CPU brick depends on the choice of the I/O busses employed elsewhere. By 2001, we should have seen the whole new round of ccNUMA designs, and then can see the different ways people have sliced this problem. Meanwhile, we are quite happy to have the line drawn where we put it, that is, with CPU bricks bearing little cost for I/O and having no dependence on the type of I/O connected. This allows great flexibility of evolutionary paths, but also, upgrade paths for individual installed machines. 2.3 PERFORMANCE - Of course. Bandwidths up (~2X), latencies down (~.5X), CPU performance up about 30% (at same clock rate) in Origin 3000, and later NUMAflex machines improve more. 2.4 PRICE/PERFORMANCE - it is not good enough to build fast, but expensive systems. These systems use commodities wherever they can, and get some unusual economies of scale by minimizing the number of distinct entities, given the immense range of configurations. We're happy to have had big backplanes disappear. It is convenient that more elements are reasonably FEDEXable. An obvious improvement is that we reduced the typical 1 big ASIC per CPU ratio in the O2000 (8 CPUs, 4 HUBs, 2 XBOWs, 2 Routers) down to .3-.6, depending on I/O richness. [8 Bedrocks, 2 Routers, and likely 4 Xbridges typical for 32P system: 14:32, or .44]. MIPS/IRIX, and IA-64/Linux systems share most hardware elements, and even if drivers are different, the overhead of device qualification and support is far less than doing it twice. 2.5 STRONG RAS FEATURES - the use of independently-powered bricks solved numerous problems and ended many arguments, once we became convinced that it was actually possible at the speeds expected. Becoming convinced took serious efforts by many people! There is an obvious pluggable connection at the end of each cable. There are less different kinds of connectors, cables, and power supplies. PCI cards are warm-pluggable, without moving anything else around. Power supplies and fans are N+1-redundant and hot-pluggable. I/O bricks can be cabled to 2 separate CPU bricks, rather than being incapacitated if their attachment to a unique CPU node fails. Although dual-attach is common in mainframes, it is rarely found in micro-based systems, even systems introduced in 2000, most of which either integrate I/O with CPU boards, or if cabled, have no redundancy in path from I/O box to CPU+memory system. 2.6 SERVERS AND GRAPHICS - of course, SGI always does this, but with even more flexibility, given the increased modularity of the bricks. 2.7 FAMILIES - A NUMAflex family is typically defined by the CPU brick, i.e., one may mix different speeds of the same flavor, may upgrade within the family with no software-visible changes, etc. Typically, different C-brick ASICs create different families, due to necessary changes in addressing, cache-coherency protocols, etc. For example, early MIPS, Itanium, and later IA-64s are clearly different families. Some bricks (like I/O & Routers) and cables are used in common by multiple families. Early I/O bricks will live a long time, even if newer systems begin using newer I/O bricks, but the general process is one of incremental overlap, rather than big-bang changes. Some elements (racks, power supply, etc) are common among all families. Anyway, the whole point is to maximize the re-use across families, avoid premature obsolescence, and give a lot of flexibility with regard to I/O and CPU evolution, and in general, PROTECT CUSTOMER INVESTMENT. 2.8 UNUSUAL FLEXIBILITIES With Origin 2000s, people sometimes built up a big configuration, then later split it up into several pieces placed in different locations. NUMAflex systems are like this, but much more so. It is also easier to put the bricks down submarine hatches, or into other constrained spaces. Smaller bricks are inherently more ruggedized than larger modules, so they are directly suitable for some applications that required ruggedizing of Origin 2000s. So, what are the details? 3. ORIGIN 3000 + ONYX 3000 = FIRST NUMAFLEX FAMILY Technically, these are 4-bristled ccNUMAs (4 nodes/Router), 4P/node, using hypercube topology up to 128P, and then fat cubes up to 512P. In CPU-rich configurations, they use about half the floor space per CPU as the Origin 2000s. One can have a complete 32P, CPU-rich configuration in one rack, or 128P in 5 racks (4 CPU, 1 I/O). An Origin 3000 rack resembles a rack of losely-coupled small computers, but as bricks are added, they form a tightly-integrated system, whose close joining is symbolized by the wavy vertical lines on the front door. 3.1 BRICKS Each brick has hot-plug, N+1 fans at the front, pulling air through to the back. This was a change from the big blower used in Origin 2000. BRICK HEIGHT: U=1.75" DESCRIPTION C-brick 3U 2-4 MIPS R12000A CPUs, 1 Bedrock ASIC, 512MB - 8GB of memory 1 NUMAlink connector at back (2 * 1.6 GB/sec) 1 XTown2 connector at back (2 * 1.2 GB/sec) LCD, Level 1 system controller, etc. Local memory bandwidth (peak) = 3.2 GB/sec, regardless of size [This is strangely important: bandwidth does *not* depend on the number of DIMMS, unlike, for example Compaq GS320, where full quoted bandwidth only occurs for some memory sizes, whose minimum is 4GB for a 4P node.] Local memory (restart) latency = 180ns. R-brick 2U 1 8-port Router (or 6-port in smaller systems) 8 * (2* 1.6 GB/sec) ports Latency: CPU -> cable -> Router -> = about 30ns for 1m cable, a bit more for longer cables. A good rough estimate is that Origin 3000 latency tends to be about 50% of Origin 2000 latency, at same CPU counts. Local memory latency is better, per-hop latency is better, and 4-bristled topology with 8-port routers costs less hops. I-brick 4U Base I/O, every partition needs at least one Connects to 1-2 C-bricks 1 XBridge ASIC 2 XTown2 ports (can dual-attach, either for resiliency, or bandwidth, or second one can attach to G-brick) 2 64b PCI busses 1 66MHz, 2-slot 1 33MHz, 4-way: 3 slots, plus local I/O (1 serial, 2 1394s, 2 USBs, Ethernet, RTI/RTO (Real-Time Sync ports)) 2 FC disks (FC controller uses a slot) 1 DVD/CD-ROM [1394] Each partition in a system needs one I-brick. P-brick 4U PCI brick, 12 64b 66MHz slots on 6 busses 3 XBridge ASICs ganged together XBridge #1 uses all ports 1-2 XTown2 to C-brick(s) 2 PCIbusses, with 2 slots apiece 2 XIO to Xbridge 2 2 XIO to XBridge 3 Xbridge #2 & XBbridge #3 supply 2 PCIs each, with 2 XIOs to Xbridge #1, and 2 XIOs unused. Sustained PCI bandwidth could get as high as 6 * 400 MB/sec, or 2400 MB/sec, which would likely saturate 1 XTown2 (2*1200 MB/sec peak), but of course, the P-brick has 2 XTown2 ports, so bandwidth-intense applications would use both. This also leaves headroom for later (faster) PCI-X versions. Thus, the customer can have: - Low-cost, connectivity applications - 1 port connected - Redundant, connectivity applications - 2 ports connected - High-bandwidth applications - 2 ports connected All I & P-brick PCI cards are (in hardware) warm-plug, with horizontal insertion/removal from the rear, using a plastic carrier that pushes the PCI card down as needed. [Big-system hardware designers are not fond of PCI's insertion at right angles to the external connectors. Note, for example, that Compaq's GS320 documentation says nothing about warm-plug PCI. X-Brick 4U XIO Brick, 4 XIO slots (like Origin 2000's) 1 XBridge ASIC 2 XTown2 ports to outside 4 XIO slots (PCIs unused) G-brick 18U Graphics (Infinite Reality 3) 1 XTown port, connectable to XTown2 port in I, P, or X-bricks. A pleasant improvement: it no longer needs a card to convert XIO to differential XTown. An Onyx 3000 is an Origin 3000 + G-brick(s). Of course, 18U is a giant "brick" and the graphics backplane is the only big one in the system, as this brick was essentially brought forward from Onyx2. D-brick 4U JBOD Disk brick [RAIDs are normally in other racks] 1-12 FC disks Powerbay 3U (Not really a brick, although similar size). 1-6 hot-plug power supplies, yield 48V, which runs to C, R, I, P, X bricks, each of which is responsible for its own internal power conversions [they vary]. They use industry-standard power supplies. Tall rack has 1-2 of these, short rack has 1. Other items Bricks have Level 1 system controllers. Each CPU rack in bigger systems has an L2 syscon as well, which runs the door's display, communicates with other L2s and system console (if any) via Ethernet. Every brick is independently powered & removable, with the degree of pluggability being up to software, which improves over time. 3.2 CABLES The XTown2 (5 meters max) and NUMAlink 3 (3 meters max) cables are otherwise identical, with the same connectors. There are no big backplanes or mid-planes: Routers plus cables are the equivalent, in effect a virtual backplane. The Origin 2000's multiplicity of connectors and backplane traces has been simplified into one connector type for high-speed signals. Following is a block diagram for the CPU + Router part of a 16P system; an I-brick plus 0-3 {I-, P-, and X-bricks} would be connected to XTown2 ports. This shows 4 C-bricks, 1 R-brick, and 4 NUMAlink 3 cables: C-brick 1 . C-brick 2 . C-brick 3 . C-brick 4 . . . P P P P . P P P P . P P P P . P P P P |___| |___| . |___| |___| . |___| |___| . |___| |___| \ / . \ / . \ / . \ / MEM--BEDROCK . MEM--BEDROCK . MEM--BEDROCK . MEM--BEDROCK / | . / | . / | . / | XTown2 | . XTown2 | . XTown2 | . XTown2 | ............................................................ Cables | |______ ____| | |_____________________ \ / __________________| \ | | / ............................................................ ROUTER R-brick / | | \ ............................................................ 3.3 RACKS Tall rack - 74" high, 30" wide, 50" deep - 39U configurable space Half-rack - 34" high, 24" wide, 42" deep They have the usual cool-SGI colors & plastic skins, and the tall rack takes care of the (serious) cabling issues. Of course, some SGI customers will use their own racks, especially the ones who put them into vans, airplanes, other embedded systems, or the aforementioned submarines. 3.4 MODELS - HAVE IT YOUR WAY ... WITHIN LIMITS OF SANITY It is obvious that one can make up zillions of different combinations, but anybody who does that too much will be sorry later. Hence there are standard configurations and ways to put these things together, with a lot of leeway left if somebody wants to buy $100M of something different. On occasion, a "molecular" notation is useful in describing configurations, and it is no accident that brick names have distinct letters. Treat the numbers as subscripts. 3.4.1 SGI ORIGIN 3200, 2-8P EXAMPLES This is normally in a half-rack; molecular notation follows, with ASIC/CPU ratios in []; remember that a C-brick typically has 4 CPUs. CI 2-4P, minimal system (has 1 powerbay) [1:1 - .5:1] CID CI plus disk-brick CID2 CI plus 2 disk trays C2I 8P, still routerless [3 ASICs, .375:1] C2ID 8P with JBOD disks C2IP 8P, but more PCI [5 ASICs, .625:1] C2IG 8P, graphics ... Onyx 3200, in tall rack (the 18U G-brick!) 3.4.2 SGI ORIGIN 3400 OR ONYX 3200 4-32P EXAMPLES 1-rack, with all CPUs, I/O, and storage OR 2-racks = 1 CPU-rack, rest in second rack OR 3-racks (or more) = 1 CPU-rack, 1 I/O rack, and more racks for G-bricks and disk The basic approach build up a C4R group: a Router and 1-4 C-bricks. For smaller systems, I/O goes in the remainder of the first rack. For larger ones, I/O starts in the second rack. A few examples: C4R2IPD 16P & some I/O & disk; 1 rack. [.63 ASIC/CPU] [These come configured with 2 Routers] C8R2I 32P, 2 disks, a few Ethernets; 1 rack [.34 ASIC/CPU] This is naturally a system liked by CPU-heavy users, but actually certain kinds of e-commerce customers do it also. C8R2IGD A 32P Onyx3000, 2 racks, one disk tray. C8R2IP7D* 32P, 2 racks plus disk racks [1 ASIC/ CPU] ignoring the I-brick: 84 PCI slots, 16.8 GB/sec I/O I/O configuration; 2 racks plus more for disks Using an SGI TP9100 storage unit [9 12-drive chassis/rack], and supposing you have 73GB drives, and you do single attach, you get 84*870GB = 73 TB, in 10 disk racks, dwarfing the 2 racks for CPU & I/O. 3.4.3 ORIGIN OR ONYX 3800 16-512P EXAMPLES Here, for cabling sanity, CPU racks are kept "pure". C32R8I 128P, 2 disks, some Ethernets. 5 racks. [.31 ASIC/CPU] Definitely for CPU-heavy applications. C32R8IP31 128P, plus disks; looks like you could get 372 PCI slots (ignoring I-brick), or 323 TB of disk. [For some of our friends who make 7-TB files, that's only 46 such files, so don't laugh...] C128R44.. 512P, and some racks have an extra R-brick at top to connect with other racks. NOTE: all of the above are *hardware* possibilities, with no guarantees that software has tested and supports all such configurations. Sales literature and IRIX release notes describe the actual limits, which in practice, rise over the years. In practice, while small configurations may in fact max out their I/O limits, we have not generally seen large Origin 2000s fill every I/O slot. 3.5 SOFTWARE, PARTITIONING, ETC Origin 3000 and Onyx 3000 run IRIX, binary-compatible with the earlier systems, and from a user viewpoint, essentially identical. Of course, IRIX is already comfortable with ccNUMA systems that handle huge memories, large CPU counts, and big I/O systems. As usual, large numbers of software people labored mightily on this project, especially to make sure that these systems appeared minimally different from the Origin 2000. Many improvements were made under the skin, especially for tolerating and recovering from errors. As is often the case, but sometimes frustrating to old OS people (like me), some of the most painstaking and difficult OS work produces no cool-looking visible feature, but rather improves performance or reliability or lets people stop needing to work to tune applications as much, i.e., the better the job, the more invisible it is! This is always unsung-hero(ine) work, and there was a great deal of it done this time. In fact, the Origin 3000 was released with the same standard mainline OS release available on other platforms, a major improvement over the common habit of distinct releases for new hardware. Work continues on being able to warm-plug PCI cards, I/O bricks, and later, CPU bricks, in that order. The hardware for all this works fine, but ... It's a Minor Matter of Software. The systems can be partitioned, with each partition requiring an I-brick. When partitioned, they act like clusters that happen to have high-speed memory-to-memory links. Systems can be clustered, and customers can arrange the same hardware in a myriad of ways. I believe that the usual argument of single system image versus cluster is a diversion. Rather, most environments are likely to be clusters of machines of the appropriate sizes, where the issue is workload-dependent sizing. Workloads of independent, CPU-intense jobs, with minimal data sharing, and very predictable resource requirements, have been the most widely successful in clusters. For example, high-energy physics, some chemistry problems, final graphics rendering for films, some Web applications, and some transaction processing applications fit this well. In addition, some individual codes have been successfully parallelized to run on clusters of small systems, typically using MPI message-passing. On the other hand, if jobs are less independent, do more I/O, share more data, and are more unpredictable, or not very amenable to rewriting for message-passing, then people sensibly prefer larger systems. For example, if one needs to run 1000 copies of a single 32-bit, 240MB, integer-CPU-intense job, in throughput mode, then one might usefully buy 1000 256MB PCs. If however, the job changes to require 300MB, somebody should open up 1000 PCs and replace DIMMs. More subtly, some jobs vary dramatically in their memory requirements, either by phase, or according to the specific input, causing people to wish for bigger systems to absorb the dynamic variations. Some customers are rightfully happy with hundreds or thousands of Linux PCs. Some SGI customers run clusters of 128P systems, and would like clusters of 512P systems, if they had the budget. For others, the optimal size is somewhere in between 1P and 128P - I've personally seen or heard of Origin 200s and 2000s used in clusters of 2P, 4P, 8P, and 32P elements, but the other sizes probably happen as well. For example, I know one customer whose ideal is a cluster of systems, each with 32P, 16GB memory, 2 disks. Each system makes an in-core copy of a database, 8GB in size, because the running system simply cannot afford many disk accesses. The NUMAflex bricks, of course, adapt even better to these multiple size adaptations, than the Origin 2000s did, given independent resource scalability, and the smaller increments of CPU and I/O. 4. THE NEXT FAMILY - ITANIUM-BASED NUMAFLEX This has not been announced yet, so I cannot say much. However, the architecture is easy to describe: it's the same, except: (a) The C-bricks have 4 Itaniums, with 2 XXXXXXXs, and a Bedrock, and the rest of the hardware is identical. (b) The OS is Linux, not IRIX, and IRIX scaling tends to precede Linux scaling to larger configurations. 5. FURTHER MIPS & IA-64 NUMAFLEX FAMILIES (2001-2006) Again, not released, but a few hints are OK. We've got scenarios, going out years, with improvements every year. (a) New CPU bricks will work with existing I/O bricks & Routers, as long as that makes sense. (b) New I/O bricks will appear, which will generally work with existing CPU bricks, until that doesn't make sense any more, i.e., at some point there will likely be I/O bricks that work with newer CPU bricks, but not the oldest ones. (c) Some bricks will work with existing interconnects, but allow for faster ones, and when they (and faster Routers) appear, it will become possible to upgrade the interconnects at least once, possibly several times. That's very exciting to me, because running out of interconnect has usually been the final straw in ending a system architecture's life. (d) There is room for some uncertainty, as for I/O busses. We tried to make design decisions that allowed for changing our minds about bricks, and sometimes topologies, without bothering other elements. Again, I cannot emphasize enough that NUMAflex is a design approach for multiple generations of strongly-related system families. Origin 3000 and Onyx 3000 are the first products from a new, and hopefully improved, model of development. 6. NUMAFLEX EARLY ASSESSMENT 6.1 NUMAflex separates CPU and I/O more than is usually the case with mid-range systems, and the relatively small size of the bricks is unusual for scalable systems. One does find echoes of mainframe channel I/O, of cluster-of-PCs appearance, and of KSR's modules. This approach utterly depends on: (a) Being able to do high-speed, low-latency cabled interconnects. (b) Being able to do cache-coherency well. At first, it seemed weird to do this, and there was quite a bit of rational skepticism, because people were much more used to producing integrated boxes, including I/O and power. After the months of work that led to 1/98's Flintstones, it took 6 months (January-June 1998) for discussion and analysis to make sure this could work. At one point, we went through iterations where CPU bricks were 5U high, 5U wide, 2-across in rack, and that might allow a reasonable deskside "stack", but they just didn't work out. An amazing variety of cabling designs were examined and rejected. It was non-trivial to get cabling systems that work for the big machines without sacrificing a lot of cost in the smaller ones. C-brick designs went through many iterations, as they were heavily over-constrained by combinations of trace lengths, cooling issues, and physical packaging, which differ among the various CPUs. Having separate bricks at least isolated the problems, which certainly reduced the inter-design constraints. 6.2 But once this approach is incorporated, it seems to work well. It is well-matched to the "asynchronous design style" where different teams can be working at different rates on different bricks. We think it improves resilience to "surprises" arising from events outside our control, like changes in oncoming I/O standards. 6.2.1 The I/O ASICs are *NOT* in the C-bricks. If a new I/O bus appears: (a) Create an XBridge or other variant to support it. (b) Make up some new kind of I/O brick. (c) Start shipping new machines that can include the new ones. (d) If it makes sense, the new I/O bricks can also be shipped to add to installed-base systems, thus upgrading their I/O systems. (e) If the new brick clearly subsumes some older brick, one can stop making the older brick as convenient, but the old I/O interface doesn't just disappear, or require hard choices between old and new (recall the Sun Sbus versus PCI issue). This eliminates the agonizing decisions needed all too often across I/O bus switchovers, or trying to mix them in same machine, or scrambling to avoid obsoleting customer investments. Of course, taste is required to avoid Quality Assurance explosions. The hardest issue has been understanding, of the myriad possibilities possible, which ones actually make sense to offer at first, and which others might be considered upon demand. Anyway, having now done this, we find we *really* like using one standard I/O connection available at the CPU brick, with minimal cost burden there, and then adding bricks that use that interface, and let them bear the cost and conversion burden. We are happy that we can use the bricks to build fairly modest-sized machines as well as big ones. We wish we could build the very smallest machines, but the minimum machine uses 10U of rack space (CI = 7U, plus power bay's 3U). Getting smaller would require a separate dedicated design (akin to the Origin 200), but even the short-rack design appears to offer a size and price-point lower than usual for a high-end technology, as we believe that many competitive systems in the Origin 3000 class will likely have one-rack minima. We of course have more ideas on how to do all this even better. 6.2.2 CPU bricks allow headroom. The various CPU chips have rather different packaging, power, and heat characteristics. NUMAflex design allows the issues to be attacked separately: if one needs a C-brick with faster fans, so be it. Each C-brick worries about its own voltage and power needs. If the board layout for one CPU is radically different from that of another, so be it - there is no backplane and airflow they must share. 6.2.3 Fans & blowers Every brick has its own fans, rather than having a giant blower that might be awkward to replace, and the cost of all this is incremental as bricks are added, rather than upfront cost. 6.2.4 Engineers love to start from scratch, but we've learned that it may be better to not have to do that all of the time. Nevertheless, it is a wrenching change for many engineers and marketers, whose natural instinct is to make *their* product great, to then take a broader view of making a whole series of products great, even if it means compromises in their own pieces. I salute the number of people who were able to make that change, because it is never easy to think this far ahead, especially when the immediate problems are difficult in their own right. 6.2.5 There are numerous subtle issues dealing with the manufacturing, configuring, and marketing of brick-style systems, since the very word "system" doesn't necessarily mean the usual thing. 6.3 PERFORMANCE As of this writing, there are few public benchmark numbers, and various submitted results are working their way through approvals. But a few notes are possible. 6.3.1 Comparison of 400MHz R12000A in Origin 2000 and Origin 3000 As one would expect, the two systems, both with 8MB caches, perform about the same on cache-resident codes, but the Origin 3000 performs noticeably better on codes with higher cache-miss rates, given 2X better bandwidths and .5X better latencies. Of the SPEC CPU benchmarks (SPECint2000, SPECfp2000, SPECint_rate2000, SPECfp_rate2000), we usually consider SPECfp_rate2000 most useful. SPECint2000 and SPECint_rate2000 get good hit rates in 4-8MB caches, so reveal little about the performance of the memory system. The uniprocessor benchmarks (SPECint2000, SPECfp2000) are not very useful for multiprocessor comparison, as they completely ignore contention among CPUs. That leaves SPECfp_rate2000, which uses multiple CPUs, stresses the memory system, and whose CPU-scaling curves are useful in understanding performance dropoffs with increasing CPU counts. To avoid misleading interpretation of the results, it is a good idea to compare similar sorts of systems when possible, i.e., smaller systems, or clusters thereof, should almost always have better price/performance on workloads for which they are suitable, including SPECfp_rate, but people continue to buy large scalable systems, because their workloads include jobs that have additioanl requirements. The Origin2x00 has always had relatively flat SPEC*rate curves, with no drastic dropoffs as the number of CPUs is raised, and the Origin 3x00 is quite similar. Following are the public SPEC*rate numbers, followed by the normalized SPEC*rate/CPU numbers, which allow easier comparisons across machines with differing numbers of CPUs. Peaks: SPECint_rate2000, SGI3x00 vs SGI2x00, unofficial estimates marked "E" 1P 2P 4P 8P 16P 32P 64P 128P SGI3x00 - - - - 65.3E 130.15 259.04 - SGI2x00 - 7.79 15.38 30.51 - 124.51 - 476.71 Peaks: SPECint_rate2000, normalized per CPU, unofficial estimates marked "E" 1P 2P 4P 8P 16P 32P 64P 128P SGI3x00 - - - - 4.1E 4.1 4.1 - SGI2x00 - 3.9 3.8 3.8 - 3.9 - 3.7 The SGI3x00 is only 6-7% faster here. There is little significance here to the difference between 3.9 and 3.8. Peaks: SPECfp_rate2000, unofficial estimates marked "E" 1P 2P 4P 8P 16P 32P 64P 128P SGI3x00 - - - - 66.9E 133.8 265 - SGI2x00 - 6.7 13.2 26.2 - 105.5 - 406.63 Peaks: SPECfp_rate2000 per CPU, unofficial estimates marked "E" 1P 2P 4P 8P 16P 32P 64P 128P SGI3x00 - - - - 4.2E 4.2 4.1 - SGI2x00 - 3.4 3.3 3.3 - 3.3 - 3.2 Here the difference is 25-30% higher for SGI3x000, and we have seen memory-intensive real codes that were substantially better. SPECfp_rate is much more influenced by the actual memory system than is SPECint_rate. 6.3.2 SGI3x000 versus other comparable machines on SPECfp_rate2000 As of 8/30/00, it is difficult to get good sets of recent and consistent numbers, especially for comparable larger server systems. SPEC2000 benchmarks are relatively new. The only Sun number is for 480Mhz 450, the HP nubmers are for the N-series, and the HP "SuperDome" is not announced yet. The best immediate comparison is with Compaq GS systems, which are large ccNUMAs that overlap with the middle of the Origin 3000 range. The following are taken from [SPE00a], augmented by a few unofficial estimated results for smaller Origin 3000 CPU counts. The IBM SP's are the 375MHz High Node numbers. Peaks: SPECfp_rate2000, unofficial estimates marked "E" 1P 2P 4P 8P 16P 32P 64P 128P CPQ GS 5.2 - - - 73.3 147.8 - N/A SGI3x00 - - - - 66.9E 133.8 265 - HPN4000 - 7.84 14.4 23.04 N/A N/A N/A N/A SGI2x00 - 6.7 13.2 26.2 - 105.5 - 406.63 IBM SP - - 14.5 28 51.7 - - Sun 450 - - 11.13 N/A N/A N/A N/A N/A Peaks: SPECfp_rate2000/CPU, unofficial estimates marked "E" 1P 2P 4P 8P 16P 32P 64P 128P CPQ GS 5.2 - - - 4.6 4.6 - N/A SGI3x00 - - - - 4.2E 4.2 4.1 - HPN4000 - 3.9 3.6 2.9 N/A N/A N/A N/A SGI2x00 - 3.4 3.3 3.3 - 3.3 - 3.2 IBM SP - - 3.6 3.5 3.2 - - - Sun 450 - - 2.8 N/A N/A N/A N/A N/A So, doing the best that I can to compare similar systems, 400MHZ SGI 3x000 systems deliver about 90% of the SPECfp_rate2000 performance of 731MHz Compaq GS systems with identical CPU counts, at least in the 16-32P range. Presumably, the charts will get filled in over time. Detailed price comparisons are beyond the scope of this writing, especially for systems as configurable as Origin 3000 and Compaq GS. In CPU-rich configurations, I think Origin 3000s use about 50% of the floor space of same-CPU-count Compaq GS systems, and I think they are priced at roughly 50% of the GS prices. If that is *actually* true, or even close, I'm quite happy! 7. SUMMARY We think that the NUMAflex design approach blends some good attributes of small system design (iteration speed, lower cost) into the design model for larger systems. It will take a while to know if we're right, particularly because some of the hoped-for improvements show up in cost-savings and time-to-market issues later in the life-cycle. Of course, this approach changes the very nature of the life-cycle, since the system life-cycle is converted to a series of overlapping life-cycles of bricks, cables, and racks. >From the numerous possible styles of modularity, the NUMAflex design approach chooses a specific kind: - Small bricks, connected primarily by high-speed, full-duplex source-synchronous, cache-coherent interconnect cables, that can be used to create shared-memory nodes of various sizes, with good enough bandwidth and latency to be usually treated like SMP UMAs - I/O busses split into separate bricks, with no I/O-specific manifestations in other bricks - Practical systems from small to very large, using the same elements across the entire range of sizes. This approach supports independent resource scalability (at any one time) and independent resource evolvability over time. We think it will allow much faster evolution of systems, and we think it will help RAS. We know it helps amortize effort across MIPS and IA-64 systems, given the commonality of components. Although it is too early to be sure, we think this will pay off in seriously-improved customer investment protection. In 1996, when we announced the Origin 2000, we claimed there were strong rationales for building scalable systems as switch-based ccNUMAs, and that others would go this way, and in fact, more (albeit, not yet all) vendors are doing so. In 2000, I conjecture that the pressure from quickly-evolving smaller systems, and customer desires for economic scalability and evolvability, will tend to drive scalable systems over time to ccNUMAs that look more like NUMAflex-style designs. We probably won't really know until around 2003/2004, given the usual life-cycle. 8. ACKNOWLEDGEMENTS Shifting the design approach to NUMAflex, and getting the first products out the door has taken huge efforts by a large cast of engineering, manufacturing and marketing people spread across Mountain View, Chippewa Falls, and Eagan. Amazingly, in the midst of some extremely difficult years at SGI, people managed not only to ship an innovative product, but to make a major positive change in the entire product development approach, converting big bangs to more continuous evolution. NUMAflex, SGI, and SGI Origin are Trademarks of SGI. Linux is a Trademark of Linus Torvalds. Others are Trademarks of ther respective organizations. 9. REFERENCES [COM00a] Compaq "Wildfire" (AlphaServer GS website) http://www.compaq.com/AlphaServer/gs320/index.html [GAL96a] Mike Galles, The SGI SPIDER Chip, Proc. Hot Interconnects IV, Stanford, August-15-17, 1996. 141-146. This describes the Router used in Origin 2000. [HRI97a] Cristina Hristea, Daniel Lenoski, John Keen, "Measuring Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks", Proc SC'97. http://www.supercomp.org/sc97/program/TECH/HRISTEA/INDEX.HTM [LEN95a] Daniel E. Lenoski & Wolf-Dietrich Weber, Scalable Shared-Memory Multiprocessing, Morgan Kaufmann, San Francisco, 1995. This is a good all-around reference. [MAS97a] John R. Mashey, "Big Data and the Next Wave of Infrastress", Proc. 1999 USENIX, Monterey, CA. http://www.usenix.org/events/usenix99/invited_talks/mashey.pdf In particular, page 12 shows intervals for large servers. [SCI00a] SCIzzL main Web page: http://www.SCIzzL.com [SPE00a] SPEC CFP2000 Rates http://www.specbench.org/osg/cpu2000/results/rfp2000.html [SUN00a] Sun Website on mid-range bus-based SMPs http://www.sun.com/servers/midrange/ SGI Origin and Onyx 3000 Websites: http://www.sgi.com/features/2000/july/3000/ http://www.sgi.com/origin/3000/ http://www.sgi.com/onyx3000/ -- -John Mashey EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-933-2663 USPS: SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351 SGI employee 25% time, non-conflicting,local, consulting elsewise.