The Journey of a Thousand Teraflops

The coming generation of petaflops systems will strain this dichotomy to the breaking point. With their unprecedented processor counts, petaflops systems cry out for new applications approaches, but, with their even higher price tags, they price out the experimentation that is the path to those new approaches.
SiCortex can help. The SC5832, with its thousands of $300 Linux processors and enormous bisection bandwidth, is the ideal, affordable, launching pad for new peta-applications. Just as important as its high processor count is its rich array of open-source development tools and performance monitors.
SiCortex support for petaflops software development starts with pervasive instrumentation within each node chip. Each processor core is capable of recording its own instruction execution behavior, including cache hits/misses and stalls. At the node level, each chip monitors its shared L2 cache, PCI I/O transactions, and interprocessor communications. (For more details, see the Late Winter 2007 edition of 5832, available at www.SiCortex.com.) This hardware-level data is then made available to the SiCortex suite of development tools through the widely used PAPI subsystem.

The SiCortex tools strategy combines best-of-breed open-source tools with full interoperability. The tools work on unmodified codes and provide rapid characterization of hardware utilization, memory, I/O and communications usages, and thread/task load balancing. Advanced users can have full access to the performance monitoring hardware through pfmon, a tool that communicates directly with the Perfmon2 kernel subsystem.
Perfmon2, Libpfm, and PAPI: While not typically used directly by applications developers, Perfmon2, Libpfm, and PAPI provide a consistent interface to the SiCortex hardware counters. They offer first- and third-person semantics for thread-centric counting and sampling.

The SiCortex “-ex” Interface Architecture: In order to make the performance monitoring experience as accessible as possible, SiCortex has designed a consistent set of commands that layer on top of standard tools.
Papiex/PAPI: Papiex is used to provide summary information such as memory footprint, percent of time in I/O, and percent of time in MPI. A typical Papiex run will produce upwards of 30 top-level run statistics. In short, Papiex derives meaningful statistics from the wealth of performance data that the node chip provides, giving a high-level view of how time is being spent within the processors.

Mpiex/mpiP: Mpiex utilizes the LLNL mpiP package to characterize MPI load balance, MPI function profile, message size distribution, and call site information.
Ioex: Ioex, based on concepts from IOtrack, developed at PDC/KTH, characterizes the I/O behavior and performance of a high-processor count application.

Hpcex/HPCToolkit: Hpcex, based on the HPCtoolkit from Rice University, produces statistical profiles without the need for user-coded instrumentation. It can profile by load module, file, function, line, and even instruction.

Gptlex/GPTL: Gptlex controls the behavior of GPTL, developed at NCAR, and adds support for automatic compiler instrumentation in GCC and Fortran on SiCortex systems.

Tauex/TAU: Tauex provides a consistent Interface to TAU, the widely-adopted parallel performance profiling environment from the University of Oregon and Paratools, Inc. It supports parallel profiling, tracing, and high-level 2D and 3D visualization. (Detailed information about TAU is available at: www.cs.uoregon.edu/research/tau/home.php).

Vampir: Vampir is a powerful visualization tool for temporal performance data that scales to trace data volumes in excess of 40 GBytes. (Detailed information about Vampir is available at: www.vampir.eu)