Programmers, including game architects, have a weakness for interesting things, and this ends up complicating code, when simple code would work better.
I have been writing a series of articles about fluid simulation, which includes a discussion about parallelization using Intel Threading Building Blocks (TBB). That API advocates using data parallelism instead of its alternatives, task parallelism or instruction parallelism. And I agree with their approach, but it is very boring and therefore, to some extent, doomed in the face of programmers looking for something more challenging.
I used to research computational fluid dynamics, which entailed using massively parallel processing (MPP), and I had accounts on a lot of very large machines. My simulations and post-processing code (and most simulations that used those massive machines) used data parallelism, because it scales well, meaning that the same code works about as efficiently regardless of problem size or the number of CPU's available. The work automatically gets redistributed to suit both the problem size and the computer capacity.
Data parallelism basically means that if you have a loop that iterates over data, you divide the iterations across multiple threads. So if the loop has N iterations and M threads then each thread works on N/M of the iterations.
The problem with the simplest form of data parallelism is that to work properly, the result of each iteration must be independent of the results of all other iterations. Such "embarrassingly parallel" problems are trivial to parallelize -- so easy that one might think one could not be so lucky as to have enough such problems to use up all computational resources. So programmers, being career problem-solvers, inevitably gravitate towards solving the more general problem: task parallelism.
Task parallelism means that various tasks run on various threads, and from time to time, the threads synchronize and communicate. The synchronization requires special mediation using atomic operations. This is the most general form of parallelism, and it leads to many potential problems, such as deadlock.
At GDC in 2007, I saw a talk given by a lead programmer of a studio making games for Xenon and Cell, each of which have multiple processing cores. The talk focused on their nifty job manager and how successful they were at balancing computational load. He spoke for an hour on all the trials and tribulations, discussed extremely interesting load-balancing strategies, expounded on how to detect and diagnose deadlock, described why all the naive approaches fail or fall short, showed graphs of computational load, and the talk was fascinating. The audience was full of people who had either written their own job managers or were about to, inspired by the success of the speaker and enthused by how interesting it all is.
Interesting, but ultimately not the right way to spend a lot of time.
Task parallelism does not scale well. If the problem size increases then so must the code; a programmer must manually modify the algorithm to involve more subtasks. Likewise, if the number of CPU's varies, the code must change. One might argue with this claim and state that the job manager will balance the load automatically, given a pool of processors. Perhaps, but the job manager cannot use more threads than there are jobs. So, task parallelism is not scalable to the extent we require.
The Xenon and Cell have more processing units than previous consoles had, but only by a small amount -- under a dozen each. In contrast, MPP machines have hundreds or thousands of CPU's. When that new generation of consoles came out, accustomed to thinking of parallezing across vast numbers of CPU's, I felt these multi-core consoles were quaint, but annoying -- too many CPU's to ignore, too few to use without an uncomfortable amount of effort. Not to mention that they did not even agree on a memory model -- Xenon uses shared and PS3 uses distributed. So people wrote versatile and robust cross-platform job manager API's. And for a few years, people at GDC had something new to talk about.
Since everybody knows that not enough problems can use loop parallelism, why do I advocate it? We have problems that require synchronization -- the data has interdepencencies -- so we need task parallelism and job managers, right?
Yes and no.
Most data interdependencies for large problems that occur in virtual simulations follow a "reduction" pattern, or a "map-reduce" pattern. This means that the problem can be broken into 2 phases: One (map) which subdivides the problem into smaller, independent pieces, and another (reduce) which combines each divided piece to its original whole. This principle allows "divide-and-conquer" to work.
A great example is "merge sort". Merge sort recursively divides a list of items in half until the list of items is small enough to sort trivially. At that point, you have a binary tree of sorted sub-lists. A merge operation merges 2 sorted lists such that the result is also sorted. The merge operation is less complex (linear, in fact) than the sort operation, but the sort operation gets subdivided until complexity becomes trivial (i.e. constant time).
By analogy, map-reduce problems boil down to being able to figure out how to combine the results of multiple computations, efficiently.
It turns out that at least one company has hinged their success at least partially on being able to phrase problems such that they suit the map-reduce paradigm.
And my series of articles demonstrates that physical simulation between thousands of mutually interacting particles also fits the same paradigm.
I predict that when the number of processors increases to a "serious" quantity (dozens or more) then we'll see a shift away from interesting discussions of job managers and task parallelism, toward a far more commonplace practice of using data parallelism, and map-reduce will be as commonly applied, and about as worthy of discussion, as creating a class.
We'll still need task parallelism for niche tasks (like asynchronous file loading) but I'd like to see its importance recede compared to data parallelism. I'd like to see a relatively small number of persistent tasks, each of which uses data parallelism as the primary means of exploiting parallel hardware.
Subscribe to:
Post Comments (Atom)

2 comments:
It seems to me that we've been urging stream processing for a while with Gamebryo, which is just a limited form of data parallelism, but that we're actually reaching limits already not only of stream proc but of data parallel as a whole: most of the embarrassingly parallel rendering techniques are easy to "go wide" with on the current platforms - skinning, morphing, particle systems - but what I hear developers wanting to parallelize in next-generation games is more often behaviors / AI / script execution.
nVidia is reporting huge wins from GPU parallelization of physics, and you could shift some of that to extra cores instead of the graphics hardware. But what else in near-future games is amenable to data parallelism? We could add some flexibility to our scheduler to get a broader set of GPGPU-style algorithms running, but most of the GPGPU algorithms aren't all that game-relevant.
In an earlier article, I advocated the "Actor" model of concurrency, a form of task parallelism, which opposes what I advocate here. So apparently I have waffled.
When speculating about the future of concurrency in games, I vascillate between two options for how games should use massive numbers of CPU's: Actors or map-reduce.
On the one hand, each entity (character, AI, prop, whatever) in the system could have its own thread. This is "task parallelism", which, in general, is a hellish nightmare of synchronization issues including contention and deadlock. But the Actor model simplifies synchronization.
On the other hand, each iteration in a loop could have its own thread. This is "data parallelism", where synchronization is either trivial (none) or nearly so (reduction).
From an architectural standpoint, both forms have the same merit: Robust synchronization.
But with thread parallelism, you have to choose between robust synchronization (Actor model) and potentially more efficient but fragile ad-hoc synchronization schemes.
So the unifying theme in my otherwise apparently contradictory advocations is to prefer robust synchronization over ad-hoc schemes even if the latter are more efficient.
If you think of the idea that every entity would get its own thread, as a giant update loop where each entity's Update occurs in a loop, then in that limited sense, the Actor model (which seems to be thread parallelism) differs from data parallelism in one respect: How do the various iterations of the loop interact.
My point in this post (though not articulated in this way) is to seek out ways of using the map-reduce pattern, to keep synchronization robust without the asynchronous message passing overhead endemic to the Actor model.
The upshot of this post is that I have seen what I opine is a disproportionate amount of interest in thread-level parallelism with ad-hoc, complicated and unwieldy synchronization schemes, and I have personally seen a relative neglect of seeking to exploit data parallelism -- especially the map-reduce pattern. I probably oversold the point in this post, but sometimes the joy of rhetoric takes over. This blog is a nearly stream-of-consciousness blog and until you posted this comment I didn't think anybody read it. It's a sandbox for half-baked ideas. I just phrase them as though they're fully baked. Hence "misinformed cognoscenti". It's a place where charlatans like me can feel at home.
*cough*
Anyway, I appreciate your comment, which basically means to point out that thread-level parallelism has a home in games and simulations, and I agree.
Post a Comment