![]()
Norman H. Christ, Robert Mawhinney, Pavlos Vranas
Department of Physics
Columbia University, NY, NY, 10027
Abstract: The Quantum Chromodynamics on Digital Signal Processors ( QCDSP)
machines in operation at Columbia University and partially complete at the RIKEN
Brookhaven Research Center are MIMD machines with processing nodes based on the Texas
Instruments TMS320C31-50 Digital Signal Processor (DSP), interconnected as a
four-dimensional torus. The Columbia machine contains 8,192 nodes and has a peak speed of
0.4Tflops. The RIKEN/BNL machine has 12,288 nodes, a peak speed of 0.6Tflops and a total
cost of $1.85M. For this submission we have run two standard lattice Quantum
Chromodynamics (QCD) programs on portions of this hardware and obtained three benchmark
numbers. The first program stochastically estimates the trace of the inverse of the Wilson
Dirac operator on series of thermalized lattices. For the first benchmark, this
calculation is performed on
lattice.
Running for 492 minutes on 1/6 of the Brookhaven machine, this code performs
floating point operations for a
cost/performance of $10.50/Mflops. The second benchmark runs the same code on the same
machine but with a lattice volume
as
large,
. By using the smallest practical
lattice volume per node, this benchmark explores the strength of the QCDSP
architecture for problems with small volume per node, a large surface to volume ratio and
correspondingly large demands on communication bandwidth and latency. This second
benchmark ran for 65 minutes and performed
floating point operations for a cost/performance of $14.8. The third
benchmark uses a second sample program which generates a Markov chain of
lattice configurations distributed according the
statistical weight describing two species of light Wilson quarks interacting with the
gauge degrees of freedom of QCD. Running for
1334 minutes on a single cabinet at Columbia (equivalent to 1/12th of the Brookhaven
machine), this program executes
floating point operations for a cost performance of $13.6/Mflops. Pictures of the machine
and further description of the architecture can be found at http://www.phys.columbia.edu/~cqft.
Over that past five years a group in the Physics Department at Columbia University, together with collaborators from four other institutions, has designed and built a number of QCDSP machines. These are generally programmable, highly parallel computers targeted at large-scale numerical studies of the fundamental theory of the strong interactions, QCD. Each machine represents a different configuration of common, scalable, computer hardware. We presently have a 64-node machine installed at the University of Wuppertal in Germany, a 128-node machine at Ohio State University, and a 1024-node machine at Florida State University, as well as the 8,192- and 12,288-node machines at Columbia and RIKEN/Brookhaven respectively.
The 0.4Tflops (peak) Columbia machine was completed in April, 1998, but parts of this machine have been in intensive use since September, 1997. Since our initial physics program has been to reproduce earlier results obtained on smaller machines and to explore a new physics algorithm (domain wall fermions), we have cabled the machine at Columbia as eight separate 1024-node machines, each running an independent calculation. Therefore, in the benchmark presented from the Columbia machine, we have run on a single 1024-node cabinet. As our studies proceed over the next one-two months, we expect to join the Columbia machine into a smaller number of larger sections.
The 0.6Tflops (peak) machine at Brookhaven is nearly complete but has had major components finished and operating since the middle of April, 1998. We therefore have run the RIKEN/Brookhaven machine benchmark on a 2048-node portion of the machine that was complete.
The two benchmark programs used to establish the sustained speed of our machines were taken from our present suite of physics production code. They have been in use since March, 1998 and, for example, contributed the results presented by our group at the recent Workshop on Fermion Frontiers in Vector Lattice Gauge Theories, held at Brookhaven, May 6-9, 1998.
Because of its regular, homogenous structure, lattice QCD lends itself easily to parallel computation. In the typical lattice one divides space-time into an array of smaller identical sub-volumes and assigns the field variables associated with each sub-volume to a separate processor. Since the fundamental interactions that must be modeled are local or involve neighboring sites, only processors that correspond to contiguous sub-volumes need to communicate directly, for most parts of the calculation. However, the most interesting lattice QCD calculations are very computationally demanding, so one is often limited to small lattice sizes with increased communication demands. As a rule of thumb, one needs one Mword/sec of off-node bandwidth per 10Mflops of processor speed. There are similar demands for small communication latency, given the frequent, short communications required by the relatively small problem size.
Our architecture is chosen to provide these characteristics at a reasonably low cost.
The fundamental node of our machine is constructed from a commodity processor, a Texas
Instruments DSP. This DSP executes 32-bit floating point arithmetic at a peak speed of
50Mflops. The memory is standard 4Mbit, 60ns DRAM (now a little dated). The only
non-commodity component in the machine is the custom gate array which provides ECC and a
32-word programmable cache needed for the DSP to use DRAM effectively. This device also
controls the 16 serial wires needed to provide bi-directional communications with the
eight nearest neighbors in a 4-dimensional mesh. We designed this ASIC using Viewlogic
tools, relying heavily on VHDL. This 250K-transistor chip worked on our first try and is
manufactured for us by the ATMEL corporation for under $20/unit. The entire processor node
fits on a
PC board whose complete
manufacture and test costs less than $80.
Sixty three of these small cards insert in SIMM sockets on a
motherboard which has a 64
node directly attached, with extra memory, PROM and
SCSI access. The 64 nodes are interconnected on the motherboard as a
array. Two of the eight faces of this hypercube are
joined together on the mother board. The off-node serial wires corresponding to the
remaining 6 faces are taken out to 6 separate cable connectors on the rear of the
backplane, into which this board is inserted. This backplane holds eight mother boards and
a large number of backplanes (at least 256) can be cabled together to form a large machine
of various geometries. Each backplane has complete clock and reset circuitry so that no
additional controller is required for the machine. Each mother board has two independent
SCSI ports. These are connected into a large tree with the UNIX host as its root. This
SCSI tree is used to boot the machine, load code and extract results. Additional disks for
checkpointing the calculation and data archiving can be joined to these SCSI connections
as well.
Let us complete this overview of the computer hardware by summarizing the
characteristics of the inter-processor communication. Communications in each of the eight
off-node directions is controlled by a separate DMA engine within the ASIC. Each of these
eight links can be programmed to send or receive a specified number of data blocks, of
specified length, separated in memory by a fixed stride. The DSP must only initiate such a
transfer and later poll to determine that it has finished. The corresponding receive or
send must also be programmed by the processor on the other end of the link. These two
actions need not be synchronized: the hard-wired protocol does not permit data to be lost
but rather stalls the DMA process until the transfer has been set up at both ends of the
link. The data transfer rate on a single link is
Mbytes/sec, giving a net off-node bandwidth of 40Mbytes/sec.
Code for the machine is written using commercial development tools for the DSP provided
by Texas Instruments. These include C and
cross compilers as well as an assembler, all running on the host machine.
The QCDSP machine is controlled from within a UNIX shell running on the host which,
in addition to the normal c-shell commands, is augmented with further
machine-specific instructions allowing the loading of code or data, the running of
diagnostics, and the reading of data. Resident on each node is a small kernel which
handles communication and provides the user with standard C programming support, such as
the functions printf() and fopen(), so data can be written to the screen
and files on the host can be directly accessed for reading or writing. In addition, there
are specific ``system'' subroutines that can be called to initiate an inter-node data
transfer.
With the present software environment, the actual programming of the machine is done
from the perspective of a single node. Normally the lattice size on a single node will be
left as a run-time variable so a given piece of complied code can be run on a variety of
machine topologies yielding results for a number of actual lattice volumes. While the most
time critical inner loops may be written in assembler, the bulk of the application code
for the machine is written in
. At
present a large number of important physics programs have been completed, including much
code for computing hadron masses using staggered, Wilson fermions and domain wall
fermions, as well as code for complete hybrid Monte Carlo sampling using each of these
fermion formulations.
Our benchmarks for this submission use two of these production codes: the first
computes the chiral condensate on a series of equilibrated, high-temperature lattices. The
second carries out a hybrid Monte Carlo evolution including both the quark and gluon
dynamics. The performance of both pieces of code relies on an efficient conjugate gradient
inverter used to compute the inverse of the Wilson Dirac operator, a sparse
complex matrix. This inverter contains the
usual algorithmic enhancements normally used to increase efficiency, including exploiting
the spin projection structure of the r=1 Wilson hopping terms and using a red-black
preconditioning scheme.
It is important to emphasize that the large computational demands of lattice QCD,
require that powerful machines be used on relatively small problems. As the number of
nodes of a parallel machine is increased to increase total computational power, the size
of the problem assigned to each node will decrease. Good efficiency for this small problem
size then requires a tightly-coupled, parallel machine with a large network
bandwidth/processor and low processor latency for communication. For example, the smallest
practical problem size on a single processor is a
lattice. A single Wilson Dirac operator iteration (typically the most
demanding part of our code) requires about 45Kflop. During these operations the
calculation requires at least 64 separate transfers of 48 words each be made off node.
This implies a network bandwidth of
words/flops and a processor/network latency consistent with efficiently initiating a
network transfer after every
flop. Our
second benchmark is carried out with this demanding
single-node lattice volume.
The cost for such non-commercial hardware can be hard to determine. However, for this submission, we can use the actual cost of the machine being completed at the RIKEN Brookhaven Research Center at the Brookhaven National Laboratory. This machine is being constructed by the group at Columbia for an amount somewhat less than $1.85M. This figure has two components. The first is $1.8M in funding explicitly provided to Columbia for the procurement of the machine. Extensive paperwork is available at Columbia, detailing the cost of each component down to the last wire-lug and clock chip and each manufacturing/assembly contract. This sum has paid for complete, tested subassemblies: 206 fully populated mother boards, 16 spares, 12 fully assembled, water-cooled cabinets and two 8-slot, air-cooled crates.
The final $50K is an estimate of the labor costs required to configure and burn-in the
entire system. This will take approximately four months and approximately one full-time
Brookhaven technician and the 1/3-time supervision of one of us. We believe the $50K
figure used is slightly larger than these actual personnel costs. We then compute the cost
of the 2048-node machine on which the benchmark was run as a prorated
of $1.85M or $308K.
We ran three separate benchmarks to establish the sustained speed of the machine. The
first is a series of measurements of the quark condensate, run on a 2048-node machine at
Brookhaven configured as a
processor
mesh machine with each node holding a
sub-volume. The code was set to generate a series configurations equilibrated at a
coupling strength of
on which we
computed an estimate of the trace of the inverse of the Wilson Dirac operator by averaging
50 random, diagonal elements. Estimates for four separate quark masses were performed on
each configuration. Timed with the SUN processor clock, the code ran for 492 minutes and
performed 590,755 conjugate gradient iteration. Since each such iteration requires 2,808
operations per site and the entire machine contains 524,288 sites,
floating point operations were performed in that time,
for a sustained rate of 29.4Gflops and a cost performance of $10.5/Mflops.
The second benchmark was run with the same program on the same machine at Brookhaven.
We simply reduced the lattice volume per node from
to the smallest practical volume,
. Now, running for 65 minutes, 878,708 conjugate gradient iterations were
performed. This corresponds to 8.085
floating point operations for a sustained rate of 20.73Gflops and a cost performance of
$14.87/Mflops.
The third benchmark was run on a 1024-node machine at Columbia. This was a complete,
full-QCD, hybrid Monte Carlo run on a
lattice with a coupling strength
and a
quark mass determined by the hopping parameter
. In 1334 minutes this code performed 1,232,496 conjugate gradient
iterations or
floating point
operations. This corresponds to a sustained rate of 11.3Gflops for hardware of one-half
the $308K cost above, giving a cost performance of $13.6/Mflops.
![]()