The major effort in this project is now focused on writing operating system and application code. The machine is accessed through a command line interface written as a UNIX shell which provides additional, machine-specific control commands and maintains and uses a database of machine configuration and initialization choices. These system commands allow operation of a machine with a boot network and 4-dimensional network of arbitrary configuration. This system is self-configuring. It begins by automatically examining the SCSI tree or boot network and directly determining its topology. Next, the 4-dimensional network is similarly explored and all this configuration information saved, to be used later by system and application routines.
As might be expected, the programming model for the machine is that of an array of identical nodes which, in most cases, will run identical code. Thus, the first step in writing code for the machine might be to create a C or C++ program that could be compiled to run on either a workstation or a single DSP.
Next, communication routines should be added for data transfers between neighboring nodes. This would usually still be done in the context of a single program, running identically on every node. Each off-node data transfer would be programmed as both a send of the data required by a neighbor and a receive of the corresponding data to be expected from the neighbor in the opposite direction. Finally, calls to the hardware-implemented global sum and global interrupt synchronization might be added.
The code thus far could all be written in C++ and, except for the provision for off-node communications, no more complex than that required for a general purpose machine. Code of this sort will generally run quite slowly and should not be expected to achieve more than 5% of the machine's peak speed. If critical routines are identified and specifically relocated to the on-chip memory (CRAM) of the DSP, this performance can often be boosted to 10%.
To achieve greater efficiency, the program (still in C++) needs to be reorganized to conserve the DSP's bandwidth to memory. Specific arithmetic arguments might be moved to CRAM and the circular buffer programmed to maximize memory bandwidth. These efforts may increase overall performance to nearly 20%.
Finally maximum efficiency (20-40%) can be achieved with judicious introduction of assembly language routines, careful avoidance of pipeline conflicts and attention to the requirements of the DSP's parallel add and multiply instructions. Address arithmetic should be minimized by pre-computing tables of the addresses used in critical routines.
Application programs available at present include reasonably optimized conjugate gradient routines for staggered, Wilson and domain-wall fermions, efficient Cabibbo-Marinari heat bath code and a hybrid Monte Carlo routine for staggered fermions with the development of similar code for Wilson and domain-wall fermions well underway. All of these application programs achieve approximately 20% efficiency, i.e. run at 10 Mflops/node.