This will be another introductory academic post like the last post explaining how torrents work.
We’ve all seen references to “supercomputers” in popular culture, run by institutions like NASA, the Chinese government, Bond villains, and other nefarious groups. But what is a supercomputer, and what distinguishes one from a “normal” computer? Surprisingly, this isn’t even discussed in the curriculums of many computer science programs unless you happen to take electives in parallel computing.
Supercomputers, better called cluster computers and often referred to as high performance computing (HPC), consist of racks of conventional computers, tied together with special interlinks to share information as quickly as possible, and loaded with software to run pieces of a program across each of the computers in the racks. Whereas most desktop and laptop computers have a single processor, allowing them to do only one thing at once (or, with a 4-core or 8-core processor, to almost do 4 things or 8 things at once), a supercomputer consists of dozens to tens of thousands of CPUs, and up to millions of cores, allowing it to run many tasks concurrently. Notably, the processors inside aren’t any different than the ones in a desktop, and certainly aren’t any faster: Many of the computers on the Top500 High Performance Computers list run Intel Xeons, and some clusters are clocked as low as 1.45 Gigahertz. If you could somehow run the latest Halo game on a supercomputer there’d be no meaningful speed-up over your home computer. Code must be written specifically to take advantage of the enormous parallelism available on a cluster computer to achieve any performance gain.
What workloads benefit from this kind of parallelism? Mostly large simulation work: weather prediction, epidemic spread, economic impact estimation, industrial engineering to design boxes that can be moved quickly on an assembly line without tipping over, etc. These are usually simulations with a large number of variables, where it is desirable to run a hundred thousand slightly different configurations of the model and determine optimal, average, or worst-case outcome. All problems that require an enormous number of calculations that mostly do not depend on one another and so do not have to be run sequentially.
We made an allusion to hardware interlinks in clusters being a “magic sauce” that makes everything so quick. Before discussing the software written for these magic interlinks, we should dig deeper into how they work.
Most cluster systems include some kind of peer-to-peer network system with very custom attributes: Usually it can directly write to memory in userspace, the network itself can handle operations like receiving multiple messages and adding them together before delivery, and it all runs very quickly with as much networking logic implemented in hardware as possible. For those familiar with Internet networking, these networks are usually similar to UDP in that there’s no need for fault tolerance, guaranteed delivery, or checksumming if the cables are high enough quality to ensure zero data loss, and routing is much simpler since the entire network topology is fixed and predefined.
So that’s the hardware link, but equally important is the network topology, or which computers are linked to which others. This networking hardware is extraordinarily expensive, so linking every node to every other is infeasible, and for most programs wouldn’t give much of a performance boost anyway. Supercomputer designers must make tradeoffs to allow information to be distributed through the cluster efficiently using as few links as possible.
Some supercomputers use a simple Fat Tree topology where high level routers forward messages to “pods” of compute nodes:
This is appropriate for simple workloads where each node in the cluster needs to receive information at the start and then works independently until results are combined at the end. However, for any workload where nodes regularly need to share data with one another this puts a great deal of strain on the switches, and introduces latency in larger trees.
Some cluster systems, like the now-retired IBM Blue Gene series use a Torus topology that organizes nodes into a rectangular prism with links along every axis and wrapping around each row and column. The Blue Gene systems use 3-dimensional and 5-dimensional torus networks, but we’ve limited ourselves to two dimensions to simplify the diagram:
Other supercomputers use radically different topologies, like the Cray butterfly network, which lacks the wrap-around flexibility of a Torus but can quickly distribute and re-combine top-level results using few links:
Each of these network structures changes the number of hops required to send information from one node to another, and whether there are local “groupings” of compute nodes that can communicate quickly without sending messages to distant nodes.
Now we have a cluster of computers, wired in an elaborate communications network using custom very high-performance interlinks. Cool, but how do we write code that actually uses that architecture? Most supercomputers use some variant of the Message Passing Interface, like OpenMPI, to describe parallel operations.
From the programmers perspective, an identical copy of their program runs on every compute node in the cluster, except that each copy is aware of both how many nodes exist, and the number of their own node in the cluster. For anyone used to systems programming, think “the program has been forked once for each node before the first line of
The program then loads data into each node, either by loading all the data into one node and distributing it, or by using a networked file system so that each node can directly read the starting data relevant to its work.
The message passing interface defines a number of basic operations that form the basis of parallel programming:
Scatter: Take an array and send a subset of the array to each node in a list
Gather: Take a small array from each node in a list and combine into a single large array on the gathering node
Send / Recv: Sent a single message directly to another node, or block on receiving a message from another node
Barrier: Similar to a multi-process breakpoint, all processes must reach this line in the code before they can proceed, synchronizing the nodes for scatter and gather operations
Since each node is a separate process with independent memory, there are few shared resources between nodes and usually no complexities around threading and mutexes and variable race conditions unless a process uses multithreading internally. Data sharing between nodes is entirely via send and receive calls or synchronized scatters and gathers, making it (relatively) easy to track data dependencies and avoid collisions.
Message passing performance is closely tied with the network structure of the cluster computer. Therefore, for more complex simulations with frequent message passing the programmer must be familiar with the configuration of their particular cluster system, so they can break up work in a way that places tasks with data dependencies on “close” nodes within the cluster. This also means that programs written for one cluster computer must be re-tuned before they can be effectively deployed on another cluster, or risk massive slow-downs from inefficient message passing and network clogging.
We’ve described how a supercomputer is built, and how code is written for it. The last piece is how to interact with it. You can’t exactly
ssh into a cluster system, because it isn’t a singular computer: Each compute node is running its own operating system (usually a thoroughly tuned Linux distribution), and the only applications that cross between nodes are ones written specifically for use with the messaging interconnect system.
Instead, one or more nodes in the cluster are designated as “I/O nodes” that can be
sshed into. The user can upload or compile their software on these landing pads, and from these systems can submit their executable as a job. Then, much like a mainframe system in the 1970s, a batch scheduling system will decide which jobs will run on which nodes in what order to maximize use of the cluster and potentially ensure fair sharing of resources between users.
While general-purpose Central Processing Units (CPUs) usually have only four to sixteen cores, more special-purpose Graphics Processing Units (GPUs) in graphics cards typically have hundreds to tens of thousands of cores in a single computer! Why don’t we use these for massive parallelism? The answer is “we do when we can” and “it’s very hard”.
The reason graphics cards can have so many more cores than a CPU is that graphics processors are simpler and can do far less, which means the cores are physically smaller and require less power, so many more can fit on a chip. Many GPU operations involve working on vectors: for example, you can multiply a vector of a thousand elements by a scalar in one step by using a thousand cores to manipulate the vector in parallel, but you cannot direct those thousand cores to run independent operations in that single step. If and when programs can be expressed in terms of the limited operations possible on a graphics card then we can take advantage of the massive parallelism available there.
Most recently-built cluster systems include graphics cards in each node, so that complex work can be distributed across compute nodes, with the abstract tasks handled by the CPUs, and the rote mathematics handled by each graphics card using APIs like CUDA and OpenCL when possible.