Supercomputing for Pedestrians

..stray thoughts and personal experiences

This article won the first prize at Aagomani 2006 - A workshop on Supercomputers & High Performance Computing. Organized by the Electrical Engineering Student Assoscication, IIT-Bombay and sponsered by IBM, TATA-CRL. If you like it please link to it the on your website/blog.

0. Introduction

The BlueGene /L built by IBM presently ranks number one on the list of top500 supercomputers (www.top500.org). As long as there exists a list, somebody (let’s call him the institution) has to top it and somebody else (let’s call him the underdog) will always try to topple him. So what’s the big deal! The very same institution-underdog relation manifest in all walks of life, e.g. operating systems – Windows & Linux, Personal Computers- IBM PC & Apple Mac, and Formula-1 – Schumacher & Alonso (or the other way round, if it makes you happy). The underdog has always had an inherent advantage over the institution i.e. of being an underdog. Crowds love the underdog they want to associate with it, see it win, they hate the institution. We all loved the last sequence of Jo Jita Vahi Sikandar and Remember the Titans, just praying for that underdog to win.

BlueGene/L can do up to 360 tera floating point operations using 131072 processors in parallel. Sounds intriguing? Well maybe! But how is that going to affect me? Well the folks at the institution may have used these to solve some complex problems in protein analysis, but as far as I am concerned proteins are something you get from chicken & dals and that’s the farthest you should analyze them. I am sorry if a sound like a technical jerk, but lets face it guys, the folks at IBM are never going to let me use their servers to play AOE or run those MATLAB simulations on their machines. Hence I promise not to talk about these computers on the top500 list anymore and neither the things that they do to stay on the list.

Having spent the first two paragraphs of the introduction mentioning the things I wont be writing about in this article, let me judiciously spend the last one stating the things I love and will talk about. Firstly, I will try and describe the various nitty-gritty’s of a choosing a computer configuration, and how one can optimize to get the maximum juice out of your box (Section 1). Secondly, how to setup your own distributed computing environment using the existing resources (Section 2). Thirdly, some freely assessable and most common applications of high performance systems/networks built using resources donated by the contributors (Section 3).

1. System.config

Let me give you some background of what I do as my M.Tech project (or at least I tell people I do!). I work in the field of 3D Technology CAD and modeling of electron devices, to define broadly. For device modeling one typically need to use simulators both home grown and commercially available which use finite element methods to solve partial device equations resulting from the physics governing the device operation. This requires a lot of computational resources. Of recently we were approaching a point when the current resources in Microelectronics Computational Lab (MCL), IIT Bombay were no longer sufficient to handle these computations. So we decided to buy some new hardware, not just for the present use, but something that will be capable for next couple of years (seems like a common issue in each lab/company).

To : sales@unicompservices.com

From : Aneesh Nainani _02d07031 aneeshnainani@gmail.com

Sub : requesting information

Dear Sir,

We are looking forward to purchase some high end workstations. 1-2 processors per workstations (64 bit). 8 to 16 GB of RAM with medium end graphics and 19”/21” TFT. Can you please suggest the best configuration and approximate prices for the same?

To: aneeshaninani@gmail.com

From : shashank@unicompservices.com

>stripped

For AMD Opteron workstations

Configration :

1. AMD Opteron Dual Core 275 ( 2.2 ) 31000 er unit

2. Motherboard Tyan 2877 > Rs 18000 / TYAN 2895 Rs 29500

3. RAM 2GB DIMM ECC DDR2 Rs 17000 per module

4. HDD 250GB SATA Seagate Rs 4200

5. 256MB nvidia graphics card > fx1500 Rs 27000 approx

normal card gforce chipset Rs 3500 approx

6. 19" TFT Monitor / 21" TFt monitor

6.1. 19” LCD Monitor (Philips/Samsung) Rs 15600

6.2 19” LCD Monitor (E96FSB) ViewSonic Rs 16500

6.3. 21" LCD Monitor (P227FSB) ViewSonic Rs 65600

7. Server Chassis REcommend Chieftec

with 450 watts Smps Rs 6500

with 500 watts Smps Rs 8500 to 9000

pl finalise your config and kindly get back to us for any further details

>private

Let me describe how to interpret most of the information and more importantly what more one should ask for before zeroing on the configuration that gives you the highest performance as per your requirements. If you are a hardware expert and already know all this you can skip directly to Section 2. All the prices and hardware descriptions are as on 11-Oct-2006.

1.1 The CPU

The CPU configuration for my system says AMD Opteron Dual Core (2.2 Ghz). Let me explain why I chose this particular one and how it meets my computational requirements. To explain this I need to answer three questions a) Why 64-bit? b) Why dual core? c) Why Opteron ? Let’s take them on one by one.

1.1.1 Why 64-bit?

64-bit Arithmetic: With 64-bit computing, each general purpose register is 64 bits wide and can represent a much larger mathematical number. The variable type long, long int or long double which are 64 bits wide, can be contained within a single register on 64-bit machine versus two registers on a 32-bit machine. A 64-bit processor takes only one instruction to perform mathematical operations on a long/ variable and 2 load operation of the memory, whereas in a 32-bit machine the same operation would require two or more mathematical operations and 4 loads from the memory. In addition, logical operations like AND, OR, XOR all can operate on a much larger data size.

64-bit Memory Space: A 32-bit processor may address a maximum of 4 GB of virtual memory. A 64-bit processor has the ability to address 16 Exabytes. A 4 GB memory limit seemed like an unlimited resource many years ago, but now days it’s simply not enough. Clearly 64-bit applications can take advantage of a larger virtual address space on a supporting operating system. Also, 32-bit applications running on a 64-bit operating system can benefit from the larger real address range (i.e. Ten 32 bit applications each requiring 1 GB of memory can still run on a 64 bit OS, taking advance of the larger real address range ).

64-bit seems almost necessary for our lab usage, since most of our computations require floating point operations and memory requirement frequently exceeds 4 GB. AMD offers 64-bit processors in all the market segments – Opteron for the Servers, Athlon-64 for the Desktop and Turion-64 for the Laptops segments respectively. IBM’s PowerPC are based on the 64-bit architecture as well. The 64 bit offerings from Intel are the Itanium and Xeon-64 series for the Servers, it’s yet to answer AMD’s challenge in the Desktop and Laptop segments. But given the way things move in the processor market, that won’t take long. Note that all of the top 10 machines in the top500 list use 64-bit processors.

1.1.2 Dual core and dual processor

It has always been a frequent question -- "Will I benefit from multiple processors?" With the growing popularity of dual core processors, the topic is more important than ever! Will multiple processors or a dual core processor be beneficial to you, and what are the differences between them?

For anyone doing video editing, multi-threaded applications, or a lot of multitasking (if you want to play NFS-3 while running those MATLAB simulations in the background) the answer to the first question is a very clear 'yes'. Then the question becomes whether two separate processors is the way to go, or whether a single dual-core CPU will do just as well. Dual CPU vs. Dual core -- which is better?!

Having two CPUs (and a motherboard capable of hosting them) is more expensive, so computer engineers came up with another approach: take two CPUs, smash them together onto one chip, and presto! The power of two CPUs, but only one socket on the motherboard. This keeps the price of the motherboards reasonable, and allows for the power of two CPUs (also known as cores) with a cost that is less than two separate chips. This, in a nut shell, is what the term "Dual Core" refers to - two CPUs put together on one chip.

As far as performance is concerned the Dual core avatars are definitely better then the Single core ones but a little bit less as compared to two Physical CPUs, as redundancy issues arise from sharing the same bus for memory etc.Remember the quadratic relationship between the frequency of a transistor and the power it consumes taught in the first lecture of the Analog Circuits course. With the multi-core approach, you can increase your power and performance linearly, as opposed to the quadratic relationship between power and performance for larger monolithic processors. Therefore, multiple small cores have the potential to provide near-linear increases in performance with only linear increase in power (as opposed to quadratic increases in power with a large core).

But does this mean that power will increase with multiple cores? No. When we apply multiple processors to a problem, we can use that quadratic relationship between performance and power to our advantage. For example, a 15 percent drop in per-core performance can give us a 50 percent decrease in power usage. So in the future, we can double the number of processor cores on a die using processors that each have 15 percent lower performance than a larger monolithic processor, but we still greatly increase the overall processor performance, and we have cut power usage by 50 percent. So yes, we can get more performance while reducing power. That’s why dual cores have become such a rage in the laptop market.

Thus to sum up, Dual Core CPUs offer performance comparable to two Physical CPUs while going easy on the pocket. There is an added advantage in terms of software c
ost – many companies like SAP, Oracle offer their licenses based on the number of physical CPUs using the software, another ace in the Dual core’s sleeve which counts as a single physical unit. And if you thought dual core was enough, guess what! the quad-cores (4 CPUs per core) technologies are already in developments and are expected to hit the market very soon.

Note that the Hyper-Thread technology from Intel is not the same as Dual core, it basically fakes a single core as two, to the operating systems to better utilize the resources. Performance wise is rates somewhere b/w the single and dual core solutions.

Coming back to how we use all this, for the MCL lab since we want the best of performance and we can afford to burn to some money, we choose an AMD Opteron system with a dual CPU motherboard, and using a dual core CPU in each socket. That gives a grand total of four functional CPU cores! This setup is especially desirable for multiple (since we have many users logged on every system) heavy duty applications open (CAD, video editing, and modeling come to mind) - just need make sure to complement those processors with plenty of memory.

1.1.3 Why Opteron ?

Having decided that I want 64-bit Dual core CPUs – I have AMDs - Athalon-64 and Opteron and Intel’s – Itanium and Xeon-64 to choose from. AMD scores miles above Intel clearly, because of their Direct Connect and Hyper Transport Architecture. Let me explain what it means.

For optimal performance, the front-side bus (FSB) bandwidth must scale with increasing processor speeds. When cache misses occur, the processor must fetch information from main memory. In the Northbridge/Southbridge (remember that aluminum maze kind of thing projecting from your motherboard that is the Northbrid
ge) architecture existing till now, shown in Figure below, memory transactions must traverse through the Northbridge element, creating additional latencies that reduce performance potential. To help resolve this performance bottleneck, AMD incorporates the memory controller into the processor. The Direct Connect interface to the memory can significantly reduce the memory latency seen by the processor. Moreover, this latency will continue to drop as the processor frequency scales. This reduction in memory latency coupled with the additional increase in memory bandwidth available directly to the processor resulting from this platform architecture design optimization cannot be overstated as it tremendously benefits system performance. This very same direct connect methodology is used in the computers on the top500 list.

Note that the processors available in India are always a generation behind whats shipping in the USA. For e.g. AMD is presently shipping 2.8 Ghz Dual core Opterons in the states, while 2.4 Ghz Dual core Opteron is the best commercially available in India.

For the nanoelectronics buffs, the latest Intel CPUs are based on the 65nm node, AMD CPUs while are one generation behind using the 90nm node. AMD uses the Silicon-On-Insulator (SOI) technology to boost speeds.

1.2 The RAM

Most people who have or are planning on purchasing a computer know to look for certain things. One of those things is the amount of memory or RAM that comes in a computer. The higher the amount of RAM, the better the system. What a lot of people don't know is that the type of memory that goes into a computer can also make a big difference in the performance and future ability to upgrade that system. Let us refer back to the mail correspondence with the computer vendor on Page 2. The quotation for the RAM reads.

3. RAM 2GB DIMM ECC D

DR2 Rs 17000 per

module

What do each of these specifications (DIMM, ECC, DDR2?), signify and are these the only things I need to know for choosing the RAM?

There are two primary pieces that determine the memory that is used in a computer, the CPU and the motherboard or chipset. All CPUs have a speed rating given to them. This is often the rating of the processor in gigahertz. There is a second speed rating to the processor referred to as the front side bus. The front side bus refers to the speed at which the processor can talk to the memory and other components on the system. The processor speed is actually a multiplier of the front side bus speed. For example, an Intel Pentium 4 CPU (2.4GHz) processor has a 400 megahertz front side bus which is multiplied by 6 clock cycles to generate the 2.4 GHz speed. This means that the CPU can communicate with the memory up to the 400 megahertz speed.

Two types of DIMMs: a 168-pin SDRAM module (top) and a 184-pin DDR SDRAM module (bottom).

Now the motherboard also determines a lot about the memory. The chipset that interacts with both the CPU and the memory is designed to work with specific memory types and sizes. While the chipset may be able to support a processor such as the Pentium 2.4 GHz mentioned before, it might only support memory up to speeds of 333MHz. Also, these chipsets are generally designed to communicate with memory up to a specific size, such as a 2GB DIMM modules. This coupled with the motherboard layout will determine the maximum amount of memory a system can hold. If the system has 2 slots, then the maximum would be 4 GB of RAM which is less than a separate board which can support 3 modules resulting in 6 GB of RAM.

Note that DIMM stands for Double In Line Memory. DIMM generally have a 64-bit data path and have electrical contacts on both sides of the module (see figure above)

While there are many different types of memory on the market, currently only two are used in modern consumer PCs:

· SDouble Data Rate DRAM (DDR)

· DDouble Date Rate 2 DRAM (DDR2)

Double data rate or DDR memory is designed to function at two memory operations per clock cycle. This effectively doubles the speed of the memory over older synchronous memory modules. This memory type is still used by many budget computer systems but is slowly being phased out in favor of the faster DDR2 standards. The two types of memory are not interchangable as they interface with the chipset and memory in a different method. To differentiate the two, each type has a different pin count and layout for the memory modules (see figure above).

To make things confusing, these memory types can be listed in two ways. The first method lists the memory by its overall clock speed. Thus 400MHz DDR memory would be referred to as DDR400. DDR2 memory would be referred to as DDR2-400. The other method of classifying the modules is by their bandwidth rating in megabytes per second. 400MHz memory can run at a theoretically speed of 3.2 gigabytes per second or 3,200 megabytes per second. Thus DDR400 memory is also referred to as PC3200 memory. 400MHz DDR2 memory would be listed as PC2-3200. One must keep this in mind while reading the specs.

Both parity and ECC (Error Correcting Code) are forms of error detection for memory modules. Parity is a simple form of error detection that adds an extra bit for every 8 bits on a memory module. This extra bit records whether there is an even or odd number of 1's registered in the 8 bits. If they don't match, then an error has been detected within the memory. ECC is a more advanced form of error detection that goes beyond the single parity bit and can actually handle error correction.

Thus there are 5 primary things one should look for while buying the memory 1) Memory capacity (obviously!) 2) Memory type (DDR/DDR2) 2) Memory speed 4) Fit with the CPU (FSB) and the motherboard 5) Presence of error checking mechanisms such as ECC/Parity. Thus for the MCL lab computers we chose the best we can afford – 8 DIMM Modules limited by the numbers of DIMM slots available on the motherboard. Each module being 2GB (that the maximum capacity available per DIMM) DDR2 – 800MHz (PC2- 6400) with ECC, placing 16 GB of RAM per computer. Beware that many times vendors don’t specify the memory speed while giving the specs trying to trick you with a slower memory.

1.3 The Hard Disk Drive (HDD)

Reading from the email from the computer vendor on Page 2. The quotation for the HDD reads.

4. HDD 250GB SATA Seagate Rs 4200

SATA or Serial Advanced Technology Attachment (ATA) is the latest hard drive storage technology to come along. The fundamental difference between SATA and the earlier traditional ATA formats is how the data is transferred between the device and the processors. Traditional ATA devices and controllers use a parallel data transfer mechanism. Parallel processing is a fairly common technique where multiple channels of data are sent simultaneously to try and increase the amount of data transferred in a single clock cycle. The problem with this type of mechanism is the number of wire required to transfer that data. This is why the ATA cables are so wide. It is necessary to have the 40 or 80 wires required to transfer the data. The problem with this is the interference caused between these wires. At higher clock speeds necessary for faster speeds, the interference between the wires is too great to allow for reliable transmission.

On the other hand, serial transmissions run across a single control channel. This means that at the same clock speeds, the serial line will carry less data, but because the serial method requires fewer wires, less interference is generated to cause data integrity problems. This allows for serial transmission methods to run at much higher speeds than the equivalent parallel methods. Thus SATA also allows for neat (as well as longer if needed) cabling between devices which also help reduce the amount of radiant heat that gets trapped within a computer (see figure below).

Note that almost all the interfaces - processor to storage, processor to processor in the computers on the top500 list are serial in nature. Below is the inside of a PC with Parallel ATA cables, and on the right is the same computer with serial ATA cables.

Two other things to look for while buying storage except the warranty are the RPM speed and the platter size. For example a 160GB HDD can be made of two platters of 80GB each or four 40GB platters, with the two platter version having significant boost above the other.

2. Poor Man’s Supercomputor

Building a supercomputer which can get into the top500 list requires millions of dollars investment in infrastructure which obviously not everyone can afford and most probably does not need as well. In this section using two examples, I want to illustrate how one can create a distributed environment to get highest performance from modicum resources.

2.1 Remote Procedure Calls (RPC)

Remote procedure call (RPC) is a protocol that allows a computer program running on one computer to cause a subroutine on another computer to be executed without the programmer explicitly coding the details for this interaction. When the software in question is written using object-oriented principles, RPC may be referred to as remote invocation or remote method invocation.

RPC is an easy and popular paradigm for implementing the client-server model of distributed computing. An RPC is initiated by the caller (client) sending a request message to a remote system (the server) to execute a certain procedure using arguments supplied. A result message is returned to the caller. There are many variations and subtleties in various implementations, resulting in a variety of different (incompatible) RPC protocols. E.g XML-RPC, SOAP etc

Whats most important is RPC utilities come as a standard with most linux distributions ( use man rpcinfo, man portmapper to see if RPC is active on your computer). So bascially using some standard linux desktops and RPC, I can set up my own little distributed computing implemenation with the minimun of effort. Moreover one can use Samba utilities to distribute the processing over windows machines also. Now that’s nice !

2.2 RAID

The acronym ‘RAID' stands for Redundant Array of Independent (or Inexpensive) Disks. There are several variations designed to meet different needs. Some are for making larger, faster storage solutions. Others trade off size for increased reliability. Yet others try and accomplish both. Here is a rundown of the basic types of RAID available today.

RAID 0 - a.k.a. Striping: Data is striped across two or more disks. This allows much faster data access. But unfortunately, if any disk in the array fails, you lose ALL the data in the array.

RAID 1 - a.k.a. Mirroring: Opposite of striping. Copy of every single piece of information is written to every disk in the array. If one disk fails you have a complete and up-to-date backup of everything. But, only half the total disk space is available because of the mirroring process.

RAID 0+1 - Mirroring Two RAID 0 Stripes” or RAID 10 – Stripping Two RAID Mirrors : Attempts to give both a performance and reliability boost. However, just like RAID 1, only half the total amount of disk space is usable (since all data is written twice).

RAID 5 - Striping with Parity: RAID 5 requires at least three disks, making it more affordable (in terms of disks) to operate than either RAID 0+1 or 10. The key to RAID 5 is 'parity' data. Parity data is special code generated when data is written to the array that allows it to rebuild a whole disk if one should fail. The array operates by striping a given amount of data across all the disks except one, and using that one to store parity data. The next piece of data is treated the same, except that a different disk is used to store the parity data - and so on. This way, the total storage available is the total amount of disk space less one disk's worth (that gets used up for parity). Reading data from a RAID 5 array is not as fast as it would be from a RAID 0 (as there is a parity stream to check), but is still slightly faster than a single disk drive.

RAID 6 - RAID 5 on Steroids: This configuration uses the same basic idea as RAID 5, but creates two separate parity sets. This means it has to have four disks to function, and loses two disks worth of storage space to parity. However, it also means that any two disks can fail, and the array can still be rebuilt. Additionally, RAID 6 (and, to a certain extent, RAID 5) can scale up easily and give very large storage arrays while only losing a small portion of their overall drive space.

RAID (various forms) along with RPC are the two fundamental concepts behind most of the network file systems implementations NFS (Network File System), AFS (Andrew File System) etc, which once come as part standard linux distributions and you can select the type of RAID you want to use.

3. I chat, I download therefore I am

In this section I describe two applications of high performance systems/networks built and maintained by individuals like you and me, relying on modest of resources contributed by the users and are open to all to use. First the Internet Chat (IRC) and second is the KaAzA or Napster (or DC++, which I am sure all of you must be using by now).

3.1 Internet Chat (IRC)

Internet Relay Chat (IRC) is a form of realtime internet chat. It is mainly designed for group (many-to-many) communication in discussion forums called channels, but also allows one-to-one communication via private message.

IRC was created by Jarkko Oikarinen (nickname "WiZ") in late August 1988, long before Yahoo / Google chat. IRC gained prominence when it was used to report on the Soviet coup attempt of 1991 throughout a media blackout. It was previously used in a similar fashion by Kuwaitis during the Iraqi invasion. Relevant logs are available from ibilblio (www.ibiblio.org) archive. One of the coolest thing about IRC is the compulsion to have a nick if you want to join in.

Bots are authomated clients, to serve as permanent points of contact for information exchange and protection agents for the channels they served, because of their superior speed when compared to humans. Some bots were created for malevolent uses, such as flooding or taking over channels, occupying them from rightful owner(s).

A bouncer's purpose is to maintain a connection to an IRC server, acting as a relay between it and the connecting client. Should the client lose network connectivity, the bouncer will archive all traffic for later delivery, allowing the user to resume his IRC session without externally perceptible disruption..

There exist thousands of IRC networks running across the world (e.g Google for FreeNode, IRCnet). One can freely join in and if you wish contribute your resources as a server. IRC channels are the preffered mode of communication and best place to interact with the open source community worldwide. Some of the coolest people I know on gods green earth, I met for the first time over IRC.

3.2 P2P – Napster, KaAzA, DC++

Napster is a file sharing service created by Shawn Fanning. Napster was the first widely-used peer-to-peer (or P2P) music sharing service, which stripped the music industies revenue apart. Although the original service was shut down by court order following a lawsuit initiated by rock band Metallica, it paved the way for decentralized P2P file-sharing programs such as Kazaa, which have been much harder to control.

Kazaa uses peer-to-peer (P2P) file sharing -- the same type of technology that made Napster famous. But unlike Napster, which distributed content via a centralized server, Kazaa uses a decentralized system. Kazaa users contact one another directly online to share content. Kazaa's decentralization is one of the main reasons why it has weathered the legal firestorm this long. Decentralized control is the key feature of the many computers on the top500 list. If you request a file which exists on the Kazaa network on multiple clients, then Kazza splits your one request into many and downloads parts of the file from each client to maximize the throughput (DC++ users must have experienced this).

I hope you enjoyed the read !

Aneesh Nainani is a Final Year Dual Degree student specializing in Microelectronics at doube-E department IIT Bombay. He can be contacted on aneeshnainani[at]gmail[dot]com or can you can catch him loitering on IRC channel #elinux on Freenode.