Storage systems and processing platforms have similar histories. Both started as single large monoliths, but both have moved strongly to parallelism to take advantage of the volume economies introduced by the personal computer industry. Large attached storage systems now have thousands of individual spindles; Top500 computing platforms increasingly have thousands of individual processors. The challenge for system architects, of course, is to maximize the parallelism between those thousands of spindles and those thousands of processors, whilst assuring reliability when one of either fails. Winning solutions maximize total transfer bandwidth while minimizing software complexity. Several successful approaches now exist, including Lustre, PVFS, and GPFS among others, but until now none of them could extend their reach to the other system resource that is present in the thousands: DIMMs. By reflecting the power of the Lustre parallel file system back onto terabytes of its own main memory, the SiCortex SC5832 FabriCache provides many of the benefits of a global shared memory without the enormous cost.
Since the SiCortex FabriCache takes advantage of the Lustre parallel file system, a few words about Lustre are in order. Produced by Cluster File Systems, Lustre has established itself as a standard in high performance computing, used by Cray and HP among others.
Like any parallel file system, Lustre starts by spreading, or “striping”, individual files across as many disk spindles as possible. Having a file spread across many disks allows it to be read back into an individual processor very fast. More importantly, it allows many processors to be getting at pieces of the same file simultaneously. For the physical scientist, parallel file systems allow all the processors in the system to be writing updated information about their individual grid points to the appropriate places in a single file at once. For the life scientist, parallel file systems allow every processor to be reading from single copies of genomes as they look for patterns of similarity.
Traditional parallel file systems coordinate transfers by means of extra processors added to the network. Instead of going straight to the disk spindles themselves, the cluster nodes send their read/write requests to these intermediate processors (Lustre calls them “Object Storage Servers”, or OSSs) which then do the actual disk transfers and pass the data back and forth to node memory. The Lustre OSSs have an additional function: assuring the integrity of the data. The OSSs deny write access to portions of a file that are already being written by another node and suspend read access to portions that are about to be written. In addition to the OSSs that coordinate transfers between nodes and established files, a Metadata Server node manages the efficient creation and deletion of files. If thousands of nodes simultaneously ask for a new temporary file apiece, it is the job of the Metadata Server to assure that they do not have to wait in line to get them.
Lustre on SiCortex is different because it is “inboard” rather than “outboard.” Instead of being an external file system in addition to the root file system used by the processing node kernels, SiCortex Lustre is the file system for the whole configuration. And rather than running on exogenous processors, SiCortex Linux uses SC5832 and SC648 nodes directly as its OSSs. Most critically, instead of communicating with OSSs over an ethernet or similar enabling network, SiCortex uses its very high bandwidth Kautz graph.
Each SiCortex circuit module implements 27 nodes. (A SC5832 has 36 of these circuit boards; a SC648 has four.) Of the 27 nodes, three are linked to PCI EXPRESS® slots to which disk spindles can be interfaced. Each, or all, of these three can be Lustre OSSs, either on a dedicated basis or in addition to running applications.
On SiCortex systems, every node knows how to act as an OSS. The reason that only three of 27 act as OSSs to spindles is that they are the ones that have PCI EXPRESS slots. All 27, however, are connected to their own DIMMs, so all 27 can act as OSSs to portions of that memory set aside for FabriCache.
FabriCaches can be established in either of two ways. For intense interaction, a subset of SiCortex nodes can have all their available memory dedicated to FabriCache usage. The remaining compute nodes communicate with them on the Kautz graph fabric. A more interesting case is to set aside a portion of every node’s memory for FabriCache. Now all the nodes (972 for the SC5832, 108 for the SC648) are part-time OSSs.
The power of FabriCache is that, either way, compute nodes address it as just another file with a single address space, because that is what it is. The Lustre logic that is already in every node’s kernel has the responsibility of determining where in the file a read or write goes, and which OSS is responsible for that portion of that file. It sends the request over the fabric to that OSS node, and exchanges the data the same way. If the data happens to be in the node’s own portion of the FabriCache file, it acts as its own OSS and the operation collapses to a memory copy. Of course, if another node has reserved that part of the file for writing, the operation may be held up for coherence reasons. In effect, the node acting as Lustre OSS pulls rank on the file acting as compute resource in order to assure integrity of the data in the FabriCache. The key point is that the node acting as FabriCache OSS is transparent to the node acting as compute resource. No special code is needed to access the data. It is just a file. And no special code is needed to get extreme performance. Lustre does that automatically.