The QCDOC architecture has been designed to provide a highly cost-effective, massively parallel computer capably of focusing significant computing resources on relatively small but extremely demanding problems. This new design is a natural evolution of that used in our earlier QCDSP machines. The individual processing nodes are PowerPC-based and interconnected in a 6-dimension mesh with the topology of a torus. A second Ethernet-based network provides booting and diagnostic capability as well as more general I/O. The entire computer is packaged in a style that provides good temperature control and a small footprint. Central to this design is the IBM Blue Logic technology which makes possible the high-density, low-power combination of an industry standard RISC processor with 64-bit floating point, embedded DRAM, six-dimensional interprocessor communications and the wide array of predesigned functions needed to assemble a complete, functional unit.
Node architecture. Each node is made of a single applications specific integrated circuit (ASIC) chip and industry standard DIMM memory. The ASIC contains a 500 MHz 440 PowerPC processor core with a 1 Gflops, 64-bit floating point unit. (The actual clock speed of the final computer will be determined in the next few months as more hardware is tested.) In addition, there is 4 Mbytes of on-chip memory, referred to as embedded DRAM or EDRAM. This is sufficient to hold the code and data for a standard lattice QCD calculation and provides a peak bandwidth of 8 GBytes to the processor. In addition, this ASIC contains DMA capability for automatically moving data between EDRAM and external memory, the circuitry to support internode communication and an Ethernet controller for the boot-diagnostic-I/O network described below. The single DIMM memory card, which is part of each node, is 64-bits wide, 333 MHz, DDR SDRAM with 7 additional bits of ECC. The memory size is determined by the particular memory modules acquired, ranging between 128 to 2048 mbytes per node.
Inter-node Communications. Each processor has the capability to send and receive data from each of its twelve nearest neighbors in six dimensions at a rate of 500 Mbit/sec. This provides a total off-node bandwidth of 12 Gbit/sec. Each communication link has a phase-locked receiver and single-bit error detection with automatic resend. Each of these twenty four communication channels has its own direct memory access capability allowing autonomous reads/writes from either EDRAM or external SDRAM. Instructions controlling each of these DMA transfers will be stored as 16 sequences of block-strided-moves located in 24 separate, on-chip register arrays. A communication process can be initiated with a single PowerPC write operation, providing very low processor overhead and start-up latencies of ~50 nsec. As in the QCDSP machines, special hardware is provided to increase the efficiency of global operations for this mesh-based machine. In the QCDOC case there is a ``store-and-forward'' operation that permits low-latency broadcasts and global sums. We plan to use only four of the six dimensions for actual internode-communication during a physics calculation. The additional two dimensions will be used to partition the machine. Thus, in a typical configuration we expect the over-all six-dimensional torus to be subdivided into a few 4-dimensional sub-tori which will run completely independently. The inter-node communications supports three independent message modes which will allow application communication, operating system messaging and global interrupts to be transmitted without interference and in a fashion consistent with the partitioning described above.
Booting, diagnostics and I/O. The SCSI booting, diagnostic and I/O network of the QCDSP machines is replaced by 100 Mbit/sec Ethernet. The Ethernet connections of four processors are joined together with a Fast Ethernet hub whose output is fed to a switch which includes a Gbit/sec Ethernet port. The resulting Ethernet tree will be connected to a SMP host with a number of 1 Gbit Ethernet cards running a threaded OS. The Ethernet tree can be used in broadcast mode to provide boot code to the processors, allows individual processors to be interrogated for diagnostic purposes and permits easy connection to industry standard RAID disks, providing a large aggregate I/O bandwidth. In addition, we support IBM's PowerPC RISCWatch debugger configured in a fashion which allows multiple windows to be opened, one for each processing node being ``watched''.
Mechanical design. As in the QCDSP machines, we plan to exploit the homogeneity of this style of massively parallel machine to achieve a high degree of mechanical modularity. The individual processors will be mounted two per daughter card, one being impractical given the 5 inch size of the DDR SDRAMS cards. We mount 32 such daughter cards on a mother board and then 8 mother boards in a rather large crate with two 4-mother board backplanes. These crates will be cooled by vertical air flow passing through a water-cooled radiator below each crate. Cable connections are provided on the backplane for the off-node communications of each motherboard.
An overview of the QCDOC ASIC is provided by the figure above. To a large extent, this ASIC was created from already existent IBM macros that are simply interconnected in the specified way to create the larger unit. Special to our design, in addition to our particular configuration of components, is the serial communications unit (SCU), the EDRAM controller and the DMA controller permitting direct transfers between external and embedded DRAM.
We had our first daughter boards working in July 2003, the first populated mother board working in December 2003, will build 2-3 2 Teraflops (peak speed) machines in February 2004 and at least two 10 Tflops-scale machines (peak speed) in the summer of 2004. A cost performance of less than \$1/Mflops is anticipated.
Picture of a 2-node daughter card. The two silver chips are the QCDOC ASICs (later daughter board examples have these covered by heatsinks) and the two vertically mounted cards are 256Mbyte DDR SDRAMs. the central connector carries 40 differential pairs making up the off-daughter board 6-dimensional serial communications network The left-most connector carries power, clock, Ethernet and various house-keeping signals. The large quad flatpack chip on the left is an 5-port, Broadcom (Altima) Ethernet ub.
A picture of a single mother board. Two rows of 16 daughter cards with two nodes each provide a total of 64 nodes. The connectors on the right-hand edge provide 48-volt DC power and the 384 differential pairs needed to transmit the serial communication signals from the six faces of this 2^6 hypecube of nodes which are connected off-board.
Boyle, D. Chen,
N.H. Christ, M. Clark, S.D. Cohen, C. Cristian, Z. Dong, A. Gara, B. Joo, C. Jung, C. Kim, L. Levkova, X. Liao, G. Liu, R.D. Mawhinney,
S. Ohta, K. Petrov, T. Wettig, A. Yamaguchi
Comments: Lattice2003(machine), 6 pages, 5 figures
QCDOC is a massively parallel supercomputer whose processing nodes are based on an application-specific integrated circuit (ASIC). This ASIC was custom-designed so that crucial lattice QCD kernels achieve an overall sustained performance of 50% on machines with several 10,000 nodes. This strong scalability, together with low power consumption and a price/performance ratio of $1 per sustained MFlops, enable QCDOC to attack the most demanding lattice QCD problems. The first ASICs became available in June of 2003, and the testing performed so far has shown all systems functioning according to specification. We review the hardware and software status of QCDOC and present performance figures obtained in real hardware as well as in simulation.
An overview is given of the QCDOC architecture, a massively parallel and highly scalable computer optimized for lattice QCD using system-on-a-chip technology. The heart of a single node is the PowerPC-based QCDOC ASIC, developed in collaboration with IBM Research, with a peak speed of 1 GFlop/s. The nodes communicate via high-speed serial links in a 6-dimensional mesh with nearest-neighbor connections. We find that highly optimized four-dimensional QCD code obtains over 50% efficiency in cycle accurate simulations of QCDOC, even for problems of fixed computational difficulty run on tens of thousands of nodes. We also provide an overview of the QCDOC operating system, which manages and runs QCDOC applications on partitions of variable dimensionality. Finally, the SciDAC activity for QCDOC and the message-passing interface QMP specified as a part of the SciDAC effort are discussed for QCDOC. We explain how to make optimal use of QMP routines on QCDOC in conjunction with existing C and C++ lattice QCD codes, including the publicly available MILC codes.
Boyle, D. Chen,
N.H. Christ, C. Cristian, Z. Dong, A. Gara, B. Joo, C. Jung, C. Kim, L. Levkova, X. Liao, G. Liu, R. D. Mawhinney,
S. Ohta, K. Petrov, T. Wettig, A. Yamaguchi
Comments: 3 pages 1 figure. Lattice2002(machines)
QCDOC is a supercomputer designed for high scalability at a low cost per node. We discuss the status of the project and provide performance estimates for large machines obtained from cycle accurate simulation of the QCDOC ASIC.