The SiCortex DMA Engines: Almost Like a Cluster Within a Cluster

The SiCortex DMA Engine

Each SiCortex node chip includes, in addition to its six 64-bit Linux cores, a DMA Engine that is capable of:

  • Cooperating with other DMA Engines to transfer data from one processor's virtual memory to another's.
  • Sending small amounts of data with very low latency and no operating system involvement.
  • Establishing connections and then sending or receiving large amounts of data without any application intervention (see diagram).
  • Doing all of this for multiple execution threads, and hence for multiple ongoing transfers, at a time, with separate state maintained for each.

In order to do all this, the DMA Engine executes its own microcoded instruction set that has been tuned on behalf of the MPI Library. This instruction set is quite general-purpose, and features the ability to queue operations on other DMA Engines. Because each DMA Engine executes its own instructions and can invoke the help of other DMA Engines, they really do act like a cluster inside of a cluster.

The DMA Engine Command Set

Three DMA Engine instructions are used to do the heavy MPI lifting. They are:

Send Event: Immediately transmit a packet, with up to 112 bytes of data, to another node's DMA engine, where it is available to the destination user-mode program.

Send Command: Transmit a command to a destination node where it will be executed by that node's DMA Engine.

Put Buffer: Send a sequence of packets to a destination node according to parameters contained within the DMA command. Put buffer is a zero-copy operation.

Notice that no separate "Get" command is needed. A DMA engine that wants data gets it by queuing a "Put" command in the DMA Engine of the node that has it.