<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>SiCortex Newsletter Articles</title><link>http://sicortex.com</link><description></description><language>en-US</language><item><pubDate>Fri, 29 Feb 2008 15:24:26 GMT</pubDate><title>From www.xconomy.com — Peddle Power: MIT Cyclocross Team Promotes Alternative Energy, Low-Power Computing </title><link>http://sicortex.com/news_events/5832_newsletter/it_s_actually_easier_being_green/from_www_xconomy_com_peddle_power_mit_cyclocross_team_promotes_alternative_energy_low_power_computing</link><description>
&lt;div class="object-left"&gt;&lt;div class="content-view-embeddedmedia"&gt;
&lt;div class="class-image"&gt;

&lt;div class="attribute-image"&gt;
&lt;p&gt;      
    
        
    
                &lt;img src="/var/ezwebin_site/storage/images/media/images/mit_cyclocross/3112-1-eng-US/mit_cyclocross_medium.jpg" width="200" height="150"  style="border: 0px;" alt="" title="" /&gt;
		
    
    
    
      &lt;/p&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;....the bikes were used to power machines made by SiCortex, of Maynard, MA, a venture-funded startup (investors include Flagship Ventures, Polaris Venture Partners, and Prism VentureWorks, along with Chevron and JK&amp;B Capital) that specializes in low-powered supercomputers. To give you an idea of how low-powered, CEO John Mucci says the chip in his supercomputer, with six processors, uses about eight watts of power. The chip in my laptop, he told me, takes almost 100 watts. Ouch. &lt;/p&gt;

&lt;p&gt;
Read the full story at http://www.xconomy.com/2007/12/11/peddle-power-mit-&lt;br /&gt;cyclocross-team-promotes-alternative-energy-low-power-computing/ &lt;/p&gt;
</description></item><item><pubDate>Fri, 29 Feb 2008 15:18:37 GMT</pubDate><title>Five Ways To Reduce the Footprint Of An HPC Application</title><link>http://sicortex.com/news_events/5832_newsletter/it_s_actually_easier_being_green/five_ways_to_reduce_the_footprint_of_an_hpc_application</link><description>&lt;a name="eztoc3090_0_0_1" id="eztoc3090_0_0_1"&gt;&lt;/a&gt;&lt;h3&gt;0. Minimize Runtime (duh)&lt;/h3&gt;
&lt;p&gt;No HPC user needs to be reminded to do this. The thirst for faster runs is insatiable, and developers work hard to make their codes run faster. There remain, however, options at runtime. Computing to a higher-than-needed accuracy, for example, increases runtime, perhaps to the third power, and hence increases footprint. &lt;/p&gt;
&lt;a name="eztoc3090_0_0_1" id="eztoc3090_0_0_1"&gt;&lt;/a&gt;&lt;h3&gt;1. Eschew Peak&lt;/h3&gt;
&lt;p&gt;It has long been noted that “peak is meaningless,” in that knowing the peak speed of a processor tells little about the sustained performance that an application will actually get. Peak is, however, a meaningful indicator of the amount of power that will be consumed during a program run. It is the sustained performance as a percent of peak that determines the footprint of an application. An application that runs at twice the percentage of peak on processors with half the peak has half the footprint. Or maybe a quarter the footprint, since power goes up more than linearly with clock speed. &lt;/p&gt;
&lt;a name="eztoc3090_0_0_1" id="eztoc3090_0_0_1"&gt;&lt;/a&gt;&lt;h3&gt;2. Just Communicate&lt;/h3&gt;
&lt;p&gt;For the past ten years, Linux/MPI applications have been headed in the opposite direction, accepting additional arithmetic to avoid doing communications. A simple example: many applications recompute known values from scratch to avoid doing a lookup in a large table that requires communications. SiCortex systems can send individual values from tables as big as a terabyte in time comparable to a floating point operation. Applications that look up values have a correspondingly smaller footprint.&lt;/p&gt;
&lt;a name="eztoc3090_0_0_1" id="eztoc3090_0_0_1"&gt;&lt;/a&gt;&lt;h3&gt;3. Practice Spin Control&lt;/h3&gt;
&lt;p&gt;Modern HPC applications deal with enormous amounts of data that must be shared amongst large numbers of processors. Traditionally, this has meant large arrays of disk drives, often hundreds of them. Keeping all those disks spinning is the worst of all worlds: low access times and high energy use. Clusters with large amounts of main memory and the right access methods can keep all that data cached, reducing time-to-completion and lowering the disk energy footprint to boot.&lt;/p&gt;
&lt;a name="eztoc3090_0_0_1" id="eztoc3090_0_0_1"&gt;&lt;/a&gt;&lt;h3&gt;4. Keep the Offense On the Field&lt;/h3&gt;
&lt;p&gt;
When an application is moving toward completion, it is playing offense. When it stops to do a checkpoint to protect itself against unreliable hardware, it is playing defense, and bloating its footprint.&lt;br /&gt;
Perversely, the more energy a processor chip uses, the hotter and therefore the less reliable it is, forcing more frequent checkpoints. It is not unheard of for applications to spend half their run time checkpointing the other half. &lt;br /&gt;Applications that run on inherently reliable hardware can potentially halve their footprint, and their time-to-completion, by keeping the offense on the field the whole game.&lt;/p&gt;
</description></item><item><pubDate>Mon, 23 Jul 2007 19:58:19 GMT</pubDate><title>The Payoff for Petaflops</title><link>http://sicortex.com/news_events/5832_newsletter/a_thousand_teraflops/the_payoff_for_petaflops</link><description>
&lt;p&gt;
Now comes the petaflops, and with it the high processor counts that make for a radically different computing milieu. Algorithms can no longer depend on their author’s knowing what individual processors are doing; there are simply too many of them. Adaptive approaches, ones that figure out on the fly how best to organize the work, are sure to become more important, as are algorithms inspired by biology. These algorithms, however, need extensive testing on a wide range of data sets to confirm that they perform appropriately.&lt;br /&gt;Designers of these new approaches need a cost-effective and highly-transparent computing resource in which to develop their codes. The SiCortex combination of low-cost processors and a rich library of monitoring software makes it the natural birthplace for this new generation of petaflops software.&lt;/p&gt;
</description></item><item><pubDate>Mon, 23 Jul 2007 15:29:40 GMT</pubDate><title>The Journey of a Thousand Teraflops</title><link>http://sicortex.com/news_events/5832_newsletter/a_thousand_teraflops/the_journey_of_a_thousand_teraflops</link><description>
&lt;p&gt;
The coming generation of petaflops systems will strain this dichotomy to the breaking point. With their unprecedented processor counts, petaflops systems cry out for new applications approaches, but, with their even higher price tags, they price out the experimentation that is the path to those new approaches.&lt;br /&gt;
SiCortex can help. The SC5832, with its thousands of $300 Linux processors and enormous bisection bandwidth, is the ideal, affordable, launching pad for new peta-applications. Just as important as its high processor count is its rich array of open-source development tools and performance monitors.&lt;br /&gt;SiCortex support for petaflops software development starts with pervasive instrumentation within each node chip. Each processor core is capable of recording its own instruction execution behavior, including cache hits/misses and stalls. At the node level, each chip monitors its shared L2 cache, PCI I/O transactions, and interprocessor communications. (For more details, see the Late Winter 2007 edition of 5832, available at www.SiCortex.com.) This hardware-level data is then made available to the SiCortex suite of development tools through the widely used PAPI subsystem.&lt;/p&gt;

&lt;p&gt;
The SiCortex tools strategy combines best-of-breed open-source tools with full interoperability. The tools work on unmodified codes and provide rapid characterization of hardware utilization, memory, I/O and communications usages, and thread/task load balancing. Advanced users can have full access to the performance monitoring hardware through pfmon, a tool that communicates directly with the Perfmon2 kernel subsystem.&lt;br /&gt;&lt;b&gt;Perfmon2, Libpfm, and PAPI&lt;/b&gt;: While not typically used directly by applications developers, Perfmon2, Libpfm, and PAPI provide a consistent interface to the SiCortex hardware counters. They offer first- and third-person semantics for thread-centric counting and sampling. &lt;/p&gt;

&lt;div class="object-center"&gt;&lt;div class="content-view-embeddedmedia"&gt;
&lt;div class="class-image"&gt;

&lt;div class="attribute-image"&gt;
&lt;p&gt;      
    
        
    
                &lt;img src="/var/ezwebin_site/storage/images/media/images/sicortex_library/1776-1-eng-US/sicortex_library_medium.jpg" width="200" height="150"  style="border: 0px;" alt="" title="" /&gt;
		
    
    
    
      &lt;/p&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;
&lt;b&gt;The SiCortex “-ex” Interface Architecture&lt;/b&gt;: In order to make the performance monitoring experience as accessible as possible, SiCortex has designed a consistent set of commands that layer on top of standard tools.&lt;br /&gt;&lt;b&gt;Papiex/PAPI:&lt;/b&gt; Papiex is used to provide summary information such as memory footprint, percent of time in I/O, and percent of time in MPI. A typical Papiex run will produce upwards of 30 top-level run statistics. In short, Papiex derives meaningful statistics from the wealth of performance data that the node chip provides, giving a high-level view of how time is being spent within the processors.&lt;/p&gt;

&lt;p&gt;
&lt;b&gt;Mpiex/mpiP&lt;/b&gt;: Mpiex utilizes the LLNL mpiP package to characterize MPI load balance, MPI function profile, message size distribution, and call site information.&lt;br /&gt;&lt;b&gt;Ioex:&lt;/b&gt; Ioex, based on concepts from IOtrack, developed at PDC/KTH, characterizes the I/O behavior and performance of a high-processor count application.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Hpcex/HPCToolkit:&lt;/b&gt; Hpcex, based on the HPCtoolkit from Rice University, produces statistical profiles without the need for user-coded instrumentation. It can profile by load module, file, function, line, and even instruction.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Gptlex/GPTL&lt;/b&gt;: Gptlex controls the behavior of GPTL, developed at NCAR, and adds support for automatic compiler instrumentation in GCC and Fortran on SiCortex systems.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Tauex/TAU&lt;/b&gt;: Tauex provides a consistent Interface to TAU, the widely-adopted parallel performance profiling environment from the University of Oregon and Paratools, Inc. It supports parallel profiling, tracing, and high-level 2D and 3D visualization. (Detailed information about TAU is available at: www.cs.uoregon.edu/research/tau/home.php).&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Vampir&lt;/b&gt;: Vampir is a powerful visualization tool for temporal performance data that scales to trace data volumes in excess of 40 GBytes. (Detailed information about Vampir is available at: www.vampir.eu)&lt;/p&gt;
</description></item><item><pubDate>Mon, 23 Jul 2007 12:32:39 GMT</pubDate><title>The Payoff For Applications</title><link>http://sicortex.com/news_events/5832_newsletter/the_sicortex_dma_engines/the_payoff_for_applications</link><description>
&lt;p&gt;Users with extreme performance requirements on dedicated applications can go further: investigating the possibilities for extending the standard SiCortex MPI library to provide more targeted support. For example, the ability of each DMA Engine to inject commands for execution by other DMA Engines is a natural fit for functions that many applications now carry out laboriously in user software.&lt;/p&gt;
&lt;a name="eztoc1738_0_0_0_1" id="eztoc1738_0_0_0_1"&gt;&lt;/a&gt;&lt;h4&gt;Collective Wisdom&lt;/h4&gt;
&lt;p&gt;This newsletter has focused on the DMA Engine's role in node-to-node data transfers. As you might expect, it also has extensive facilities for accelerating so-called "collective" operations that involve large numbers of processors. Fast collectives are essential for high processor count computing. A future edition of the newsletter will discuss collectives in detail.&lt;/p&gt;
</description></item><item><pubDate>Fri, 20 Jul 2007 20:19:29 GMT</pubDate><title>The SiCortex DMA Engines: Almost Like a Cluster Within a Cluster</title><link>http://sicortex.com/news_events/5832_newsletter/the_sicortex_dma_engines/the_sicortex_dma_engines_almost_like_a_cluster_within_a_cluster</link><description>&lt;a name="eztoc1729_0_0_0_1" id="eztoc1729_0_0_0_1"&gt;&lt;/a&gt;&lt;h4&gt;The SiCortex DMA Engine&lt;/h4&gt;
&lt;div class="object-center"&gt;&lt;div class="content-view-embeddedmedia"&gt;
&lt;div class="class-image"&gt;

&lt;div class="attribute-image"&gt;
&lt;p&gt;      
    
        
    
                &lt;img src="/var/ezwebin_site/storage/images/media/images/sicortex_layer__1/1759-1-eng-US/sicortex_layer_medium.jpg" width="200" height="124"  style="border: 0px;" alt="" title="" /&gt;
		
    
    
    
      &lt;/p&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Each SiCortex node chip includes, in addition to its six 64-bit Linux cores, a DMA Engine that is capable of:&lt;/p&gt;

&lt;ul&gt;

&lt;li&gt;Cooperating with other DMA Engines to transfer data from one processor's virtual memory to another's.&lt;/li&gt;

&lt;li&gt;Sending small amounts of data with very low latency and no operating system involvement.&lt;/li&gt;

&lt;li&gt;Establishing connections and then sending or receiving large amounts of data without any application intervention (see diagram).&lt;/li&gt;

&lt;li&gt;Doing all of this for multiple execution threads, and hence for multiple ongoing transfers, at a time, with separate state maintained for each.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;In order to do all this, the DMA Engine executes its own microcoded instruction set that has been tuned on behalf of the MPI Library. This instruction set is quite general-purpose, and features the ability to queue operations on other DMA Engines. Because each DMA Engine executes its own instructions and can invoke the help of other DMA Engines, they really do act like a cluster inside of a cluster.&lt;/p&gt;

&lt;div class="object-center"&gt;&lt;div class="content-view-embeddedmedia"&gt;
&lt;div class="class-image"&gt;

&lt;div class="attribute-image"&gt;
&lt;p&gt;      
    
        
    
                &lt;img src="/var/ezwebin_site/storage/images/media/images/block_diagram/1755-1-eng-US/block_diagram_medium.jpg" width="200" height="91"  style="border: 0px;" alt="" title="" /&gt;
		
    
    
    
      &lt;/p&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;a name="eztoc1729_0_0_0_1" id="eztoc1729_0_0_0_1"&gt;&lt;/a&gt;&lt;h4&gt;The DMA Engine Command Set&lt;/h4&gt;
&lt;p&gt;Three DMA Engine instructions are used to do the heavy MPI lifting. They are:&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Send Event&lt;/b&gt;: Immediately transmit a packet, with up to 112 bytes of data, to another node's DMA engine, where it is available to the destination user-mode program.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Send Command&lt;/b&gt;: Transmit a command to a destination node where it will be executed by that node's DMA Engine.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Put Buffer&lt;/b&gt;: Send a sequence of packets to a destination node according to parameters contained within the DMA command. Put buffer is a zero-copy operation.&lt;/p&gt;

&lt;p&gt;Notice that no separate "Get" command is needed. A DMA engine that wants data gets it by queuing a "Put" command in the DMA Engine of the node that has it.&lt;/p&gt;
</description></item><item><pubDate>Mon, 14 May 2007 21:52:13 GMT</pubDate><title>The Payoff For Applications</title><link>http://sicortex.com/news_events/5832_newsletter/the_unexamined_program/the_payoff_for_applications</link><description>
&lt;p&gt;In the early days of supercomputing there was a very direct and beneficial link of understanding between computational scientists and their hardware. The scientists knew their vector pipes and architects like Seymour Cray knew their inner loops. Each could help the other to maximize the answers per day from the machines. &lt;/p&gt;

&lt;p&gt;One of the reasons that the efficiency of today’s scientific codes is so low (often a few per cent or less) is that this linkage of understanding has been broken. Chip designers do not understand floating point codes because that is not what they design for. Scientists have lost track of the innards of complex out-of-order nodes made up of chips from multiple vendors. &lt;/p&gt;

&lt;p&gt;SiCortex is working to make its hardware more transparent and accessible to the scientists who rely on it. Where we can, we are eliminating the need to worry about what the hardware is doing. The Kautz graph fabric of the SC5832 and SC648, for example, has such a flat performance curve that there is rarely any reason to lay out data in any but the most natural way. &lt;/p&gt;

&lt;p&gt;Where the behavior of the hardware does matter to the application, SiCortex is making it easy to monitor the inner workings, not just of the processors, but of the whole system. Often the true performance issues lurk in places other than the apparent ones. &lt;/p&gt;

&lt;p&gt;In the past 10 years the transparency of software such as Linux has accelerated user involvement and interaction. Now SiCortex is expanding that transparency to hardware, making it easier to develop applications that run efficiently on the hardware, and easier to develop hardware that fits those applications. &lt;/p&gt;
</description></item><item><pubDate>Mon, 14 May 2007 21:46:09 GMT</pubDate><title>"The unexamined program is not worth running."</title><link>http://sicortex.com/news_events/5832_newsletter/the_unexamined_program/the_unexamined_program_is_not_worth_running</link><description>
&lt;div class="object-center"&gt;&lt;div class="content-view-embeddedmedia"&gt;
&lt;div class="class-image"&gt;

&lt;div class="attribute-image"&gt;
&lt;p&gt;      
    
        
    
                &lt;img src="/var/ezwebin_site/storage/images/media/images/steth_o_chip/1497-1-eng-US/steth_o_chip_small.jpg" width="100" height="102"  style="border: 0px;" alt="" title="" /&gt;
		
    
    
    
      &lt;/p&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Each node within a SiCortex system has its own central monitoring circuitry which can selectively sample more than 1200 activities around the chip. Because the data comes together in one place, it is possible to learn about dependencies as well as about individual event counts.&lt;/p&gt;

&lt;div class="object-center newsletter_expandable_with_caption"&gt;&lt;div class="content-view-embeddedmedia"&gt;
&lt;div class="class-image"&gt;

&lt;div class="attribute-image"&gt;
&lt;p&gt;      
    
        
    
            &lt;a href="/media/images/exploded_chip" class="greybox" target="_self" onclick="return GB_showCenter('Sicortex', this.href, 570, 570)"&gt;
                    &lt;img src="/var/ezwebin_site/storage/images/media/images/exploded_chip/1601-1-eng-US/exploded_chip_medium.gif" width="193" height="200"  style="border: 0px;" alt="" title="" /&gt;
	&lt;/a&gt;	
    
    
    
        &lt;p class="Info"&gt;(Click the image above for more information.)&lt;/p&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;</description></item><item><pubDate>Mon, 14 May 2007 21:37:37 GMT</pubDate><title>The Payoff For Applications</title><link>http://sicortex.com/news_events/5832_newsletter/just_plane_fast/the_payoff_for_applications</link><description>
&lt;p&gt;Three characteristics become obvious as you experience the Kautz graph visualizations: &lt;/p&gt;

&lt;p&gt;1. Performance remains very high even as the number of nodes approaches a thousand. &lt;/p&gt;

&lt;p&gt;2. The “response curve” of the network is essentially flat, meaning that the time it takes to get to the most distant node is almost the same as the time to the closest. (Each hop only costs 30ns.)&lt;/p&gt;

&lt;p&gt;3. The Kautz graph’s enormous bisection bandwidth allows it to power through collective operations such as broadcast without degradation. &lt;/p&gt;

&lt;p&gt;All three of these characteristics are essential to the high processor count applications that will dominate technical computing in the years to come. &lt;/p&gt;

&lt;p&gt;The first key to increasing the performance of any Linux/MPI application is to run it on more processors, provided the network scales cost-effectively. The SiCortex Kautz graph achieves this goal by integrating the switching within processor nodes and putting the wires in a backplane. &lt;/p&gt;

&lt;p&gt;A second key to increasing performance is often to use more sophisticated, and perhaps adaptive, data structures. That is why so many mainstream applications are now moving from fixed grids to dynamic ones. The flat response curve of the Kautz network assures that performance remains high even as the data moves around unpredictably in memory. &lt;/p&gt;

&lt;p&gt;And it is well known among computational scientists that the performance of collectives is often the limiting factor in scalability. The Kautz graph addresses this issue by offering enormous bisection bandwidth and supporting microcode-accelerated collectives. &lt;/p&gt;

&lt;p&gt;Ultimately, the payoff for applications is that you can henceforth forget about your SiCortex system’s Kautz graph and implement your application in the way that best fits the algorithm. The Kautz graph will perform well regardless.&lt;/p&gt;
</description></item><item><pubDate>Mon, 14 May 2007 21:34:32 GMT</pubDate><title>Just Plane Fast</title><link>http://sicortex.com/news_events/5832_newsletter/just_plane_fast/just_plane_fast</link><description>
&lt;p&gt;A small Kautz graph is easy to understand. Each node in a degree three graph has three wires out and three wires back in. There are obvious symmetries, albeit not the traditional ones. For example, the path from node A to node B is not the same as the path from node B back to node A. It is known that the number of nodes in a Kautz graph is exponential in the diameter, but for a true understanding, it helps enormously to be able to see what is going on.&lt;/p&gt;

&lt;div class="object-center"&gt;&lt;div class="content-view-embeddedmedia"&gt;
&lt;div class="class-image"&gt;

&lt;div class="attribute-image"&gt;
&lt;p&gt;      
    
        
    
                &lt;img src="/var/ezwebin_site/storage/images/media/images/kautz_12/1605-1-eng-US/kautz_12.gif" width="334" height="326"  style="border: 0px;" alt="" title="" /&gt;
		
    
    
    
      &lt;/p&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Suggestions for further reading: Bermond, J.-C., and Peyrat, C., “de Bruijn and Kautz Networks: a competitor for the hypercube?” in Andre, F., and Verjus, J.P., eds., Hypercube and Distributed Computers, North Holland, 1989. Banerjee, Subrata, et al, “Regular Multihop Logical Topologies for Lightwave Networks”, in IEEE Communications Surveys, Vol 2, No 1, First Quarter 1999. &lt;/p&gt;
&lt;a name="eztoc1474_0_0_0_1" id="eztoc1474_0_0_0_1"&gt;&lt;/a&gt;&lt;h4&gt; &lt;/h4&gt;&lt;a name="eztoc1474_0_0_0_1" id="eztoc1474_0_0_0_1"&gt;&lt;/a&gt;&lt;h4&gt;Now It Is Your Turn &lt;/h4&gt;
&lt;p&gt;The SiCortex SC5832 and SC648 are the first clusters to be designed from a clean sheet of silicon, but they could not have been designed on a clean sheet of paper. In a system with thousands of interconnected processors, there is too much going on. You can see that for yourself using the Kautz Graph Viewer on the accompanying CD. Take it all the way up to 972 nodes to see how quickly a Kautz graph can get from anywhere to anywhere. Click on any node to start a broadcast and see how fast it can get from anywhere to everywhere. Treat it as a one-dimensional row of processors or a two-dimensional box. The ability to visualize the backplane topology from a variety of points of view was crucial to the overall systems design, which involved integrating the behavior of the Kautz graph with the behavior of the nodes and the applications software running on those nodes. It was the partnership of experienced architect and computer visualization that made it possible to take this significant step forward in network design. Of course, when it came time to work out the physical wiring of the backplane, the architects had to stand aside. They could not compete with the efficiency of the computer’s genetic algorithm.&lt;/p&gt;
</description></item><item><pubDate>Mon, 14 May 2007 21:21:15 GMT</pubDate><title>The Payoff for Applications</title><link>http://sicortex.com/news_events/5832_newsletter/the_sicortex_fabricache/the_payoff_for_applications</link><description>
&lt;p&gt;Every HPC application faces the temptation to employ heroic measures in order to eke out performance. A common example is the use of local disks on conventional cluster nodes. Used in parallel, they offer the potential of very high aggregate bandwidth, but at the expense of specialized programming and uncertain reliability.&lt;/p&gt;

&lt;p&gt;FabriCache is better in every way. It is faster, because it operates at memory speed. It is more robust, because it takes advantage of the full error correction facilities built into Lustre. And it is simpler, because a FabriCache file is no different from any other file that the application uses. A simple way to test for disk file performance issues is to temporarily shift a file into FabriCache and see how overall applications performance changes.&lt;/p&gt;

&lt;p&gt;SiCortex is committed to architecture that maximizes performance while minimizing special cases. FabriCache is just one example. &lt;/p&gt;
</description></item><item><pubDate>Fri, 11 May 2007 16:36:05 GMT</pubDate><title>The SiCortex FabriCache™: Measure Its Abilities In Genomes/sec.</title><link>http://sicortex.com/news_events/5832_newsletter/the_sicortex_fabricache/the_sicortex_fabricache_measure_its_abilities_in_genomes_sec</link><description>
&lt;p&gt;Storage systems and processing platforms have similar histories. Both started as single large monoliths, but both have moved strongly to parallelism to take advantage of the volume economies introduced by the personal computer industry. Large attached storage systems now have thousands of individual spindles; Top500 computing platforms increasingly have thousands of individual processors. The challenge for system architects, of course, is to maximize the parallelism between those thousands of spindles and those thousands of processors, whilst assuring reliability when one of either fails. Winning solutions maximize total transfer bandwidth while minimizing software complexity. Several successful approaches now exist, including Lustre, PVFS, and GPFS among others, but until now none of them could extend their reach to the other system resource that is present in the thousands: DIMMs. By reflecting the power of the Lustre parallel file system back onto terabytes of its own main memory, the SiCortex SC5832 FabriCache provides many of the benefits of a global shared memory without the enormous cost. &lt;/p&gt;
&lt;a name="eztoc1168_0_0_0_1" id="eztoc1168_0_0_0_1"&gt;&lt;/a&gt;&lt;h4&gt;Parallel File Architecture &lt;/h4&gt;
&lt;p&gt;Since the SiCortex FabriCache takes advantage of the Lustre parallel file system, a few words about Lustre are in order. Produced by Cluster File Systems, Lustre has established itself as a standard in high performance computing, used by Cray and HP among others. &lt;/p&gt;

&lt;p&gt;Like any parallel file system, Lustre starts by spreading, or “striping”, individual files across as many disk spindles as possible. Having a file spread across many disks allows it to be read back into an individual processor very fast. More importantly, it allows many processors to be getting at pieces of the same file simultaneously. For the physical scientist, parallel file systems allow all the processors in the system to be writing updated information about their individual grid points to the appropriate places in a single file at once. For the life scientist, parallel file systems allow every processor to be reading from single copies of genomes as they look for patterns of similarity. &lt;/p&gt;

&lt;p&gt;Traditional parallel file systems coordinate transfers by means of extra processors added to the network. Instead of going straight to the disk spindles themselves, the cluster nodes send their read/write requests to these intermediate processors (Lustre calls them “Object Storage Servers”, or OSSs) which then do the actual disk transfers and pass the data back and forth to node memory. The Lustre OSSs have an additional function: assuring the integrity of the data. The OSSs deny write access to portions of a file that are already being written by another node and suspend read access to portions that are about to be written. In addition to the OSSs that coordinate transfers between nodes and established files, a Metadata Server node manages the efficient creation and deletion of files. If thousands of nodes simultaneously ask for a new temporary file apiece, it is the job of the Metadata Server to assure that they do not have to wait in line to get them. &lt;/p&gt;

&lt;p&gt;Lustre on SiCortex is different because it is “inboard” rather than “outboard.” Instead of being an external file system in addition to the root file system used by the processing node kernels, SiCortex Lustre is the file system for the whole configuration. And rather than running on exogenous processors, SiCortex Linux uses SC5832 and SC1458 nodes directly as its OSSs. Most critically, instead of communicating with OSSs over an ethernet or similar enabling network, SiCortex uses its very high bandwidth Kautz graph. &lt;/p&gt;

&lt;p&gt;Each SiCortex circuit module implements 27 nodes. (A SC5832 has 36 of these circuit boards; a SC1458 has nine.) Of the 27 nodes, three are linked to PCI EXPRESS® slots to which disk spindles can be interfaced. Each, or all, of these three can be Lustre OSSs, either on a dedicated basis or in addition to running applications. &lt;/p&gt;
&lt;a name="eztoc1168_0_0_0_1" id="eztoc1168_0_0_0_1"&gt;&lt;/a&gt;&lt;h4&gt;Implementing FabriCache Via Lustre &lt;/h4&gt;
&lt;p&gt;On SiCortex systems, every node knows how to act as an OSS. The reason that only three of 27 act as OSSs to spindles is that they are the ones that have PCI EXPRESS slots. All 27, however, are connected to their own DIMMs, so all 27 can act as OSSs to portions of that memory set aside for FabriCache. &lt;/p&gt;

&lt;p&gt;FabriCaches can be established in either of two ways. For intense interaction, a subset of SiCortex nodes can have all their available memory dedicated to FabriCache usage. The remaining compute nodes communicate with them on the Kautz graph fabric. A more interesting case is to set aside a portion of every node’s memory for FabriCache. Now all the nodes (972 for the SC5832, 243 for the SC1458) are part-time OSSs. &lt;/p&gt;

&lt;div class="object-center newsletter_expandable_with_caption"&gt;&lt;div class="content-view-embeddedmedia"&gt;
&lt;div class="class-image"&gt;

&lt;div class="attribute-image"&gt;
&lt;p&gt;      
    
        
    
            &lt;a href="/media/images/pci_disks" class="greybox" target="_self" onclick="return GB_showCenter('Sicortex', this.href, 570, 570)"&gt;
                    &lt;img src="/var/ezwebin_site/storage/images/media/images/pci_disks/1593-1-eng-US/pci_disks_medium.jpg" width="200" height="145"  style="border: 0px;" alt="" title="" /&gt;
	&lt;/a&gt;	
    
    
    
        &lt;p class="Info"&gt;(Click the image above for more information.)&lt;/p&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The power of FabriCache is that, either way, compute nodes address it as just another file with a single address space, because that is what it is. The Lustre logic that is already in every node’s kernel has the responsibility of determining where in the file a read or write goes, and which OSS is responsible for that portion of that file. It sends the request over the fabric to that OSS node, and exchanges the data the same way. If the data happens to be in the node’s own portion of the FabriCache file, it acts as its own OSS and the operation collapses to a memory copy. Of course, if another node has reserved that part of the file for writing, the operation may be held up for coherence reasons. In effect, the node acting as Lustre OSS pulls rank on the file acting as compute resource in order to assure integrity of the data in the FabriCache. The key point is that the node acting as FabriCache OSS is transparent to the node acting as compute resource. No special code is needed to access the data. It is just a file. And no special code is needed to get extreme performance. Lustre does that automatically. &lt;/p&gt;

&lt;div class="object-center newsletter_expandable_with_caption"&gt;&lt;div class="content-view-embeddedmedia"&gt;
&lt;div class="class-image"&gt;

&lt;div class="attribute-image"&gt;
&lt;p&gt;      
    
        
    
            &lt;a href="/media/images/board_with_dna" class="greybox" target="_self" onclick="return GB_showCenter('Sicortex', this.href, 570, 570)"&gt;
                    &lt;img src="/var/ezwebin_site/storage/images/media/images/board_with_dna/1597-1-eng-US/board_with_dna_medium.jpg" width="200" height="125"  style="border: 0px;" alt="" title="" /&gt;
	&lt;/a&gt;	
    
    
    
        &lt;p class="Info"&gt;(Click the image above for more information.)&lt;/p&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;</description></item></channel></rss>
