Given the enormous computational demands of quantum field theory and the easily parallelized nature of this problem, it is natural to design and build massively parallel machines, whose design is optimized for this class of calculation. Beginning in the March 1993, a collaboration of physicists and engineers, centered at Columbia, began the design of a digital signal processor (DSP) based machine with this objective.
Here we describe the architecture and implementation of what has grown into a series of computers ranging from 64-node, 3 Gflops machines to a 8,192-node, 400 Gflops machine at Columbia and a 12,288-node, 600 Gflops machine now begin constructed at the RIKEN/Brookhaven Research Center at the Brookhaven National Laboratory. The processors in each of these machines are interconnected as a four-dimensional mesh with nearest neighbor connections. For example, the Columbia machine is a 16x16x8x4 mesh. Each processor or node is made up of a Texas Instruments TMS320C31 DSP, 0.5Mwords of error-corrected memory and a gate array providing memory management and inter-node communication. The Columbia machine cost approximately $1.8M to construct over the past two years (it would cost about $1.2M at present component prices) implying a cost performance advantage of more than ten-fold over present commercial machines.
Figure 1 shows a single node containing the DSP, memory and a gate array chip which acts as a memory and inter-node communications controller. The memory is made of five 0.256M x 16, 60ns DRAM chips providing a 32-bit word and an additional 7 bits for error detection and correction.
Figure 1. A block diagram of a single node.
The gate array chip, referred to as our Node Gate Array or NGA contains standard memory management circuitry, eight buffered DMA channels to permit simultaneous communication with each neighbor and a circular buffer used to efficiently match the 25Mword/sec DSP bandwidth to the I/O characteristics of DRAM as shown in Figure 2.
Figure 2. A diagram of the components contained within the gate array chip (NGA).
The seven components that make up this basic processor node are mounted on a small, 1.75" x 2.65", six-layer printed circuit board. The resulting physical unit is called a daughter board. These nodes now cost approximately $80 each and one is shown in Figure 3 below.
Figure 3. An example of a commercially assembled daughter boards. Our special ASIC chip is on the left while the Texas Instruments DSP is on the Right. The gold contacts of the 40-pin SIMM card edge are visible along the bottom.
Sixty three of these daughter boards are mounted on a 20" x 14" mother board in SIMM sockets as shown in Figure 4 below. The 64th node on the mother board is soldered directly to the board and has an extra 8 Mwords of memory and connection to two SCSI controller chips allowing communication with two industry standard, SCSI busses. These SCSI connections are be used both to connect one mother board to a host machine (typically a Sun) as well as to connect additional mother boards to disk and tape drives and to each other. This allows configuration checkpointing, backup and archiving.
Figure 4. A mother board holding 63 daughter boards and a soldered-down 64th node. The board is mounted in a test fixture with somewhat primitive air cooling.
Eight of these mother boards are plugged into a single backplane to form a crate. An example of such a 8 mother board crate is shown in Figure 5. A stack of two such crates forms a cabinet and eight such cabinets make up the entire 8K-node Columbia machine. These 2-crate cabinets are designed so they can be stacked two-high and the Brookhaven machine is constructed of 12 cabinets arranged as six of these stacked, four-crate units.
Figure 5. An eight mother board crate. This example has the side physically cut way for hardware debugging purposes. This allows the daughter boards attached to the right-most mother board to be seen. There are six working mother boards installed in the picture. The left-most four boards are interconnected as a 4x4x4x4 machine, the next panel is blank and the final two mother boards are operating independently of the four-mother board unit and each other and are each connected as 4x4x2x2 machines.
Transfers between two adjacent nodes are synchronized on the word level. Upon receipt of a word from a neighboring node, its parity is checked and a ready/acknowledge sent back to that neighbor. If this acknowledge message indicates a parity error, the sending node will retransmit. If this ready/acknowledge reply indicates no error it also guarantees that the received word has been removed from the input buffer so that the receiver is ready to receive another word. Thus, communications is self-synchronized in hardware. We reduce the processor overhead required for these transmissions and make the latency implied by our serial scheme less onerous by implementing a ``block-strided-move'' instruction in hardware. This allows a single DSP instruction to initiate the transfer (either send or receive) of a specified number of blocks of contiguous words, with each block separated from the preceding by a fixed stride.
Communication between a pair of nodes is provided by a single open-collector wire, driven according to Future-Bus conventions at 50 MHz. The 64-node mother board contains the complete 4-node time axes (connected as a circle) for each of the 16 spatial coordinates (one node at each coordinate) which make up a 4x2x2 mesh. Thus, our four-dimensional interconnection scheme can be realized if each mother board is connected to its six nearest neighbors in a 3-dimensional mesh. Two of these directions (corresponding to the two 4x2x2 faces of the processor hypercube on the mother board) are joined with 16 wires while the remaining four directions (corresponding to the four 4x4x2 faces of the processor hypercube on the mother board) are joined with 32 wires. This is easily accomplished with 36- and 68-conductor high density ribbon cable, plugged into connectors on the backplane. By simply rearranging the connection pattern of these cables one can choose a variety of topologies for the machine. We typically configure a machine as a hypercube with periodic boundary conditions, i.e. a four-dimensional torus.
The total communications bandwidth for each node, recognizing the possibility of simultaneous communication with each of the node's eight neighbors, is about 50 Mbytes/sec. For our quantum field theory application this is well matched to the 50Mflops processing speed of the DSP. However, serial communication of the sort implemented in our QCDSP machines implies a considerable latency for the transfer of a word. This can be easily hidden by other parts of the calculation when data on one node is transferred to be used by its neighbor. However, for global operations such as a global sum or global broadcast, the latencies for each of the communication steps adds and can introduce a very significant bottleneck.
We avoid this standard problem of serial communications by implementing pass-through operations in hardware. For example, the NGA can be instructed to receive an incoming serial word from a specified direction and to pass the incoming bits on to a given subset of the remaining directions with minimum latency. Thus, the incoming word can begin to be transmitted after only a few bits have been received rather than waiting for the entire word to be received. In a similar, although reversed fashion, a specific node can also be instructed to receive incoming serial data from subset of the eight off-node directions and to add the incoming bit streams, passing the multiword sum on in another, specified direction. By keeping "carry" information, a series of word can be summed providing an arbitrary precision, low-latency, global integer sum. Of course, this operation requires that the transmitted word come with its least significant bits first. Finally, the hardware also supports a seccond "combine" operation where, instead of the sum of the incoming words, the node will transmit the largest of the incoming words (now the most significant bits must be transmitted first). By combining the global integer sum with this global max, we can realize a low-latency, arbitrary-precision, global floating point sum. Figure 6 below shows the data motion that might occur in a two-dimensional plan during a global sum operation.
Figure 6. Data motion that occurs during a low-latency, pass-through global sum shown in two dimensions. The black lines represent a two-dimensional plane in the four-dimensional communications network, the blue dots the nodes and the red arrows indicate the active links and the direction of data flow.
In addition to the usual address generation and error handling, the NGA provides an efficient interface between the very fast DSP and our DRAM. We must solve two problems: how to produce a valid ready signal very early in the DSP read cycle and how to provide zero-wait-state, page-mode memory access for the DSP.
These problems are solved by the circular buffer, a 32-word internal cache. This device interprets a DSP memory request, supplies the requested data from memory, and also begins to pre-fetch data held in sequential addresses according to a programmable algorithm. This pre-fetched data can be read in at the full, page-mode DRAM bandwidth. This data is then available for guaranteed, zero-wait-state access by the DSP. By supporting multiple images of our 0.5Mword DRAM within the DSP address space, we provide a valid ready signal at the beginning of a DSP read provided the DSP code addresses memory lying in a particular memory image that is reserved for a pattern of usage for which good pre-fetched data can be guaranteed.
NODE GATE ARRAY
In addition to the communication and memory management functions described above, the NGA must also arbitrate between the memory needs of the off-node communication and the DSP. Each of the eight off-node communications directions is provided with an separate direct memory access controller, capable of carrying out an independent block-strided-move instruction, and a separate eight-word buffer. Memory arbitration is provided between each of these eight independent communications units and the DSP. This memory arbitration uses an algorithm which attempts to exploit the caching possible in the eight, eight-word buffers to reduce the inefficiencies cause by DRAM page misses.
A good overview of our QCDSP architecture can be obtained by considering the three networks provided in the machine. Most central to an actual calculation is the 4-dimensional serial network. This network provides the communication between neighboring nodes in a four-dimensional mesh that is used during a typical parallel calculation to exchange data between one region of the problem at hand and a neighboring region. Since many problems in physics and engineering naturally can be divided into semi-autonomous regions in space (or space-time) communicating only with neighboring regions, this nearest neighbor communications network may handle the vast majority of the inter-processor communications required by the calculation without the need to pass data on to more remote nodes.
Figure 7. A schematic view of the three networks that make up our QCDSP architecture.
Perhaps of next importance to a potential user of the machine is the SCSI network. A segment of this network is shown in Figure 7. Here three mother boards with their SCSI connection to the host are depicted schematically. As is shown in the figure, a SCSI port on only the top mother board is directly connected to the host. That #0 mother board then can access the two lower mother boards by using its second SCSI port which is attached to a second SCSI bus to which those two lower mother boards are also connected. Each of these second-layer mother boards can then be attached to further SCSI busses through their second SCSI ports allowing this scheme to be continued until an entire tree of SCSI connections is developed, joining each mother board, indirectly, to the host. This SCSI tree serves two purposes. First it allows the host to communicate with each mother board. This is necessary when booting the machine and for transferring code or data between the host and a mother board. Second, by adding standard SCSI disks to these SCSI busses, we can provide a highly parallel, high bandwidth disk system for the machine.
Finally, Figure 7 shows a third, DSP-serial, network. This network uses a special serial port available on the Texas Instruments TMS320C31 to provide a connection between the distinguished node #0 on each mother board and each of the 63 other nodes mounted on the SIMM daughter boards. Considerable care has been taken in designing this system so that relatively few components are required to support the communication between the host and a specific daughter board. This results in the combination of the SCSI and DSP-serial networks forming a reliable boot and diagnostic network for the machine.
The actual construction of the machine is carried out by a number of outside vendors. Our application specific integrated circuit chip is made by the Atmel corporation. The printed circuit boards for both the mother boards and daughter boards are manufactured by HADCO. The assembly and debugging of the daughter and mother boards is carried out by AJC Electronics of Syosset NY, 516-496-0992. The major suppliers of components for the project are Marshall Industries and NuHorizons. The backplanes for the 8-mother board crates are made by Bustronics of Fremont CA, 510-490-7388 and the card cages themselves as well as the cabinets for each of our different configurations are made by ELMA.
These cabinets come in three different configurations. The simplest are the single mother board units that provide a 3 Gflops machine easily connected by a SCSI bus to a SUN. These are air-cooled and consume about 300 Watts. The next larger configuration is a stand-alone, air cooled crate holding 8 mother boards or 512 nodes. This 25 Gflops machine consumes about 2500 Watts. These can be placed side by side and connected by 4-dimensional serial cables to form a larger machine. The size of a machine that can be configured in this was is limited by the air-cooling which prevents these crates from being directly stacked. Finally, the largest unit, is a two-crate, 16-mother board, 1024-node water cooled cabinet. Each cabinet consumes about 5000 Watts. These units, typically cooled by 50°F water, can be arranged side-by-side or stacked.
This section has summarized the architecture and technology used in our QCDSP architecture. Our experience to date indicates that the resulting system forms a robust and easily maintained machine with considerable cost performance advantage over alternative, massively parallel machines.