next up previous
Next: About this document


QCDSPgif gif Submission to Gordon Bell Prize Committee
updated version

Norman H. Christ, Robert Mawhinney, Pavlos Vranas

Department of Physics
Columbia University, NY, NY, 10027

Abstract: The Quantum Chromodynamics on Digital Signal Processors ( QCDSP) machines in operation at Columbia University and partially complete at the RIKEN Brookhaven Research Center are MIMD machines with processing nodes based on the Texas Instruments TMS320C31-50 Digital Signal Processor (DSP), interconnected as a four-dimensional torus. The Columbia machine contains 8,192 nodes and has a peak speed of 0.4Tflops. The RIKEN/BNL machine has 12,288 nodes, a peak speed of 0.6Tflops and a total cost of $1.85M. For this submission we have run two standard lattice Quantum Chromodynamics (QCD) programs on portions of this hardware and obtained three benchmark numbers. The first program stochastically estimates the trace of the inverse of the Wilson Dirac operator on series of thermalized lattices. For the first benchmark, this calculation is performed on lattice. Running for 492 minutes on 1/6 of the Brookhaven machine, this code performs floating point operations for a cost/performance of $10.50/Mflops. The second benchmark runs the same code on the same machine but with a lattice volume as large, . By using the smallest practical lattice volume per node, this benchmark explores the strength of the QCDSP architecture for problems with small volume per node, a large surface to volume ratio and correspondingly large demands on communication bandwidth and latency. This second benchmark ran for 65 minutes and performed floating point operations for a cost/performance of $14.8. The third benchmark uses a second sample program which generates a Markov chain of lattice configurations distributed according the statistical weight describing two species of light Wilson quarks interacting with the gauge degrees of freedom of QCD. Running for 1334 minutes on a single cabinet at Columbia (equivalent to 1/12th of the Brookhaven machine), this program executes floating point operations for a cost performance of $13.6/Mflops. Pictures of the machine and further description of the architecture can be found at http://www.phys.columbia.edu/~cqft.

1. Introduction


Over that past five years a group in the Physics Department at Columbia University, together with collaborators from four other institutions, has designed and built a number of QCDSP machines. These are generally programmable, highly parallel computers targeted at large-scale numerical studies of the fundamental theory of the strong interactions, QCD. Each machine represents a different configuration of common, scalable, computer hardware. We presently have a 64-node machine installed at the University of Wuppertal in Germany, a 128-node machine at Ohio State University, and a 1024-node machine at Florida State University, as well as the 8,192- and 12,288-node machines at Columbia and RIKEN/Brookhaven respectively.

The 0.4Tflops (peak) Columbia machine was completed in April, 1998, but parts of this machine have been in intensive use since September, 1997. Since our initial physics program has been to reproduce earlier results obtained on smaller machines and to explore a new physics algorithm (domain wall fermions), we have cabled the machine at Columbia as eight separate 1024-node machines, each running an independent calculation. Therefore, in the benchmark presented from the Columbia machine, we have run on a single 1024-node cabinet. As our studies proceed over the next one-two months, we expect to join the Columbia machine into a smaller number of larger sections.

The 0.6Tflops (peak) machine at Brookhaven is nearly complete but has had major components finished and operating since the middle of April, 1998. We therefore have run the RIKEN/Brookhaven machine benchmark on a 2048-node portion of the machine that was complete.

The two benchmark programs used to establish the sustained speed of our machines were taken from our present suite of physics production code. They have been in use since March, 1998 and, for example, contributed the results presented by our group at the recent Workshop on Fermion Frontiers in Vector Lattice Gauge Theories, held at Brookhaven, May 6-9, 1998.

2. Hardware


Because of its regular, homogenous structure, lattice QCD lends itself easily to parallel computation. In the typical lattice one divides space-time into an array of smaller identical sub-volumes and assigns the field variables associated with each sub-volume to a separate processor. Since the fundamental interactions that must be modeled are local or involve neighboring sites, only processors that correspond to contiguous sub-volumes need to communicate directly, for most parts of the calculation. However, the most interesting lattice QCD calculations are very computationally demanding, so one is often limited to small lattice sizes with increased communication demands. As a rule of thumb, one needs one Mword/sec of off-node bandwidth per 10Mflops of processor speed. There are similar demands for small communication latency, given the frequent, short communications required by the relatively small problem size.

Our architecture is chosen to provide these characteristics at a reasonably low cost. The fundamental node of our machine is constructed from a commodity processor, a Texas Instruments DSP. This DSP executes 32-bit floating point arithmetic at a peak speed of 50Mflops. The memory is standard 4Mbit, 60ns DRAM (now a little dated). The only non-commodity component in the machine is the custom gate array which provides ECC and a 32-word programmable cache needed for the DSP to use DRAM effectively. This device also controls the 16 serial wires needed to provide bi-directional communications with the eight nearest neighbors in a 4-dimensional mesh. We designed this ASIC using Viewlogic tools, relying heavily on VHDL. This 250K-transistor chip worked on our first try and is manufactured for us by the ATMEL corporation for under $20/unit. The entire processor node fits on a PC board whose complete manufacture and test costs less than $80.

Sixty three of these small cards insert in SIMM sockets on a motherboard which has a 64 node directly attached, with extra memory, PROM and SCSI access. The 64 nodes are interconnected on the motherboard as a array. Two of the eight faces of this hypercube are joined together on the mother board. The off-node serial wires corresponding to the remaining 6 faces are taken out to 6 separate cable connectors on the rear of the backplane, into which this board is inserted. This backplane holds eight mother boards and a large number of backplanes (at least 256) can be cabled together to form a large machine of various geometries. Each backplane has complete clock and reset circuitry so that no additional controller is required for the machine. Each mother board has two independent SCSI ports. These are connected into a large tree with the UNIX host as its root. This SCSI tree is used to boot the machine, load code and extract results. Additional disks for checkpointing the calculation and data archiving can be joined to these SCSI connections as well.

Let us complete this overview of the computer hardware by summarizing the characteristics of the inter-processor communication. Communications in each of the eight off-node directions is controlled by a separate DMA engine within the ASIC. Each of these eight links can be programmed to send or receive a specified number of data blocks, of specified length, separated in memory by a fixed stride. The DSP must only initiate such a transfer and later poll to determine that it has finished. The corresponding receive or send must also be programmed by the processor on the other end of the link. These two actions need not be synchronized: the hard-wired protocol does not permit data to be lost but rather stalls the DMA process until the transfer has been set up at both ends of the link. The data transfer rate on a single link is Mbytes/sec, giving a net off-node bandwidth of 40Mbytes/sec.

3. Software


Code for the machine is written using commercial development tools for the DSP provided by Texas Instruments. These include C and cross compilers as well as an assembler, all running on the host machine. The QCDSP machine is controlled from within a UNIX shell running on the host which, in addition to the normal c-shell commands, is augmented with further machine-specific instructions allowing the loading of code or data, the running of diagnostics, and the reading of data. Resident on each node is a small kernel which handles communication and provides the user with standard C programming support, such as the functions printf() and fopen(), so data can be written to the screen and files on the host can be directly accessed for reading or writing. In addition, there are specific ``system'' subroutines that can be called to initiate an inter-node data transfer.

With the present software environment, the actual programming of the machine is done from the perspective of a single node. Normally the lattice size on a single node will be left as a run-time variable so a given piece of complied code can be run on a variety of machine topologies yielding results for a number of actual lattice volumes. While the most time critical inner loops may be written in assembler, the bulk of the application code for the machine is written in . At present a large number of important physics programs have been completed, including much code for computing hadron masses using staggered, Wilson fermions and domain wall fermions, as well as code for complete hybrid Monte Carlo sampling using each of these fermion formulations.

Our benchmarks for this submission use two of these production codes: the first computes the chiral condensate on a series of equilibrated, high-temperature lattices. The second carries out a hybrid Monte Carlo evolution including both the quark and gluon dynamics. The performance of both pieces of code relies on an efficient conjugate gradient inverter used to compute the inverse of the Wilson Dirac operator, a sparse complex matrix. This inverter contains the usual algorithmic enhancements normally used to increase efficiency, including exploiting the spin projection structure of the r=1 Wilson hopping terms and using a red-black preconditioning scheme.

It is important to emphasize that the large computational demands of lattice QCD, require that powerful machines be used on relatively small problems. As the number of nodes of a parallel machine is increased to increase total computational power, the size of the problem assigned to each node will decrease. Good efficiency for this small problem size then requires a tightly-coupled, parallel machine with a large network bandwidth/processor and low processor latency for communication. For example, the smallest practical problem size on a single processor is a lattice. A single Wilson Dirac operator iteration (typically the most demanding part of our code) requires about 45Kflop. During these operations the calculation requires at least 64 separate transfers of 48 words each be made off node. This implies a network bandwidth of words/flops and a processor/network latency consistent with efficiently initiating a network transfer after every flop. Our second benchmark is carried out with this demanding single-node lattice volume.

4. Cost


The cost for such non-commercial hardware can be hard to determine. However, for this submission, we can use the actual cost of the machine being completed at the RIKEN Brookhaven Research Center at the Brookhaven National Laboratory. This machine is being constructed by the group at Columbia for an amount somewhat less than $1.85M. This figure has two components. The first is $1.8M in funding explicitly provided to Columbia for the procurement of the machine. Extensive paperwork is available at Columbia, detailing the cost of each component down to the last wire-lug and clock chip and each manufacturing/assembly contract. This sum has paid for complete, tested subassemblies: 206 fully populated mother boards, 16 spares, 12 fully assembled, water-cooled cabinets and two 8-slot, air-cooled crates.

The final $50K is an estimate of the labor costs required to configure and burn-in the entire system. This will take approximately four months and approximately one full-time Brookhaven technician and the 1/3-time supervision of one of us. We believe the $50K figure used is slightly larger than these actual personnel costs. We then compute the cost of the 2048-node machine on which the benchmark was run as a prorated of $1.85M or $308K.

5. Benchmark


We ran three separate benchmarks to establish the sustained speed of the machine. The first is a series of measurements of the quark condensate, run on a 2048-node machine at Brookhaven configured as a processor mesh machine with each node holding a sub-volume. The code was set to generate a series configurations equilibrated at a coupling strength of on which we computed an estimate of the trace of the inverse of the Wilson Dirac operator by averaging 50 random, diagonal elements. Estimates for four separate quark masses were performed on each configuration. Timed with the SUN processor clock, the code ran for 492 minutes and performed 590,755 conjugate gradient iteration. Since each such iteration requires 2,808 operations per site and the entire machine contains 524,288 sites, floating point operations were performed in that time, for a sustained rate of 29.4Gflops and a cost performance of $10.5/Mflops.

The second benchmark was run with the same program on the same machine at Brookhaven. We simply reduced the lattice volume per node from to the smallest practical volume, . Now, running for 65 minutes, 878,708 conjugate gradient iterations were performed. This corresponds to 8.085 floating point operations for a sustained rate of 20.73Gflops and a cost performance of $14.87/Mflops.

The third benchmark was run on a 1024-node machine at Columbia. This was a complete, full-QCD, hybrid Monte Carlo run on a lattice with a coupling strength and a quark mass determined by the hopping parameter . In 1334 minutes this code performed 1,232,496 conjugate gradient iterations or floating point operations. This corresponds to a sustained rate of 11.3Gflops for hardware of one-half the $308K cost above, giving a cost performance of $13.6/Mflops.





next up previous
Next: About this document




Norman Christ
Fri Dec 4 00:14:53 EST 1998