I want to give a shout out to Robert Nystrom. He's a colleague at EA who is writing a book about Design Patterns for use in video games, which, to my knowledge, would be the first true book on Game Engine Architecture. Other books claim to be that but fall short of providing a valuable description of how to concoct a game engine architecture -- they merely present an example of one.
Bob also works on an interesting language called Magpie. It's really tempting to talk about his language but I don't want to steal his thunder, so go read what he has to write about it. I will mention, though, that his adoption of ML-style type inference makes me think that's the way to go.
He has nifty ideas about allowing coders to create infix operators. When I asked him if he came up with that notion, he tried to pass the idea off on older languages like Lisp that allow practically anything as an identifier, but I think he's got something new. And I want to steal his ideas.
By the way, he was also in a band.
So I guess today is Bob's day.
Monday, October 5, 2009
Tuesday, September 8, 2009
Concurrency in games: Prefer data parallelism to task parallelism
Programmers, including game architects, have a weakness for interesting things, and this ends up complicating code, when simple code would work better.
I have been writing a series of articles about fluid simulation, which includes a discussion about parallelization using Intel Threading Building Blocks (TBB). That API advocates using data parallelism instead of its alternatives, task parallelism or instruction parallelism. And I agree with their approach, but it is very boring and therefore, to some extent, doomed in the face of programmers looking for something more challenging.
I used to research computational fluid dynamics, which entailed using massively parallel processing (MPP), and I had accounts on a lot of very large machines. My simulations and post-processing code (and most simulations that used those massive machines) used data parallelism, because it scales well, meaning that the same code works about as efficiently regardless of problem size or the number of CPU's available. The work automatically gets redistributed to suit both the problem size and the computer capacity.
Data parallelism basically means that if you have a loop that iterates over data, you divide the iterations across multiple threads. So if the loop has N iterations and M threads then each thread works on N/M of the iterations.
The problem with the simplest form of data parallelism is that to work properly, the result of each iteration must be independent of the results of all other iterations. Such "embarrassingly parallel" problems are trivial to parallelize -- so easy that one might think one could not be so lucky as to have enough such problems to use up all computational resources. So programmers, being career problem-solvers, inevitably gravitate towards solving the more general problem: task parallelism.
Task parallelism means that various tasks run on various threads, and from time to time, the threads synchronize and communicate. The synchronization requires special mediation using atomic operations. This is the most general form of parallelism, and it leads to many potential problems, such as deadlock.
At GDC in 2007, I saw a talk given by a lead programmer of a studio making games for Xenon and Cell, each of which have multiple processing cores. The talk focused on their nifty job manager and how successful they were at balancing computational load. He spoke for an hour on all the trials and tribulations, discussed extremely interesting load-balancing strategies, expounded on how to detect and diagnose deadlock, described why all the naive approaches fail or fall short, showed graphs of computational load, and the talk was fascinating. The audience was full of people who had either written their own job managers or were about to, inspired by the success of the speaker and enthused by how interesting it all is.
Interesting, but ultimately not the right way to spend a lot of time.
Task parallelism does not scale well. If the problem size increases then so must the code; a programmer must manually modify the algorithm to involve more subtasks. Likewise, if the number of CPU's varies, the code must change. One might argue with this claim and state that the job manager will balance the load automatically, given a pool of processors. Perhaps, but the job manager cannot use more threads than there are jobs. So, task parallelism is not scalable to the extent we require.
The Xenon and Cell have more processing units than previous consoles had, but only by a small amount -- under a dozen each. In contrast, MPP machines have hundreds or thousands of CPU's. When that new generation of consoles came out, accustomed to thinking of parallezing across vast numbers of CPU's, I felt these multi-core consoles were quaint, but annoying -- too many CPU's to ignore, too few to use without an uncomfortable amount of effort. Not to mention that they did not even agree on a memory model -- Xenon uses shared and PS3 uses distributed. So people wrote versatile and robust cross-platform job manager API's. And for a few years, people at GDC had something new to talk about.
Since everybody knows that not enough problems can use loop parallelism, why do I advocate it? We have problems that require synchronization -- the data has interdepencencies -- so we need task parallelism and job managers, right?
Yes and no.
Most data interdependencies for large problems that occur in virtual simulations follow a "reduction" pattern, or a "map-reduce" pattern. This means that the problem can be broken into 2 phases: One (map) which subdivides the problem into smaller, independent pieces, and another (reduce) which combines each divided piece to its original whole. This principle allows "divide-and-conquer" to work.
A great example is "merge sort". Merge sort recursively divides a list of items in half until the list of items is small enough to sort trivially. At that point, you have a binary tree of sorted sub-lists. A merge operation merges 2 sorted lists such that the result is also sorted. The merge operation is less complex (linear, in fact) than the sort operation, but the sort operation gets subdivided until complexity becomes trivial (i.e. constant time).
By analogy, map-reduce problems boil down to being able to figure out how to combine the results of multiple computations, efficiently.
It turns out that at least one company has hinged their success at least partially on being able to phrase problems such that they suit the map-reduce paradigm.
And my series of articles demonstrates that physical simulation between thousands of mutually interacting particles also fits the same paradigm.
I predict that when the number of processors increases to a "serious" quantity (dozens or more) then we'll see a shift away from interesting discussions of job managers and task parallelism, toward a far more commonplace practice of using data parallelism, and map-reduce will be as commonly applied, and about as worthy of discussion, as creating a class.
We'll still need task parallelism for niche tasks (like asynchronous file loading) but I'd like to see its importance recede compared to data parallelism. I'd like to see a relatively small number of persistent tasks, each of which uses data parallelism as the primary means of exploiting parallel hardware.
I have been writing a series of articles about fluid simulation, which includes a discussion about parallelization using Intel Threading Building Blocks (TBB). That API advocates using data parallelism instead of its alternatives, task parallelism or instruction parallelism. And I agree with their approach, but it is very boring and therefore, to some extent, doomed in the face of programmers looking for something more challenging.
I used to research computational fluid dynamics, which entailed using massively parallel processing (MPP), and I had accounts on a lot of very large machines. My simulations and post-processing code (and most simulations that used those massive machines) used data parallelism, because it scales well, meaning that the same code works about as efficiently regardless of problem size or the number of CPU's available. The work automatically gets redistributed to suit both the problem size and the computer capacity.
Data parallelism basically means that if you have a loop that iterates over data, you divide the iterations across multiple threads. So if the loop has N iterations and M threads then each thread works on N/M of the iterations.
The problem with the simplest form of data parallelism is that to work properly, the result of each iteration must be independent of the results of all other iterations. Such "embarrassingly parallel" problems are trivial to parallelize -- so easy that one might think one could not be so lucky as to have enough such problems to use up all computational resources. So programmers, being career problem-solvers, inevitably gravitate towards solving the more general problem: task parallelism.
Task parallelism means that various tasks run on various threads, and from time to time, the threads synchronize and communicate. The synchronization requires special mediation using atomic operations. This is the most general form of parallelism, and it leads to many potential problems, such as deadlock.
At GDC in 2007, I saw a talk given by a lead programmer of a studio making games for Xenon and Cell, each of which have multiple processing cores. The talk focused on their nifty job manager and how successful they were at balancing computational load. He spoke for an hour on all the trials and tribulations, discussed extremely interesting load-balancing strategies, expounded on how to detect and diagnose deadlock, described why all the naive approaches fail or fall short, showed graphs of computational load, and the talk was fascinating. The audience was full of people who had either written their own job managers or were about to, inspired by the success of the speaker and enthused by how interesting it all is.
Interesting, but ultimately not the right way to spend a lot of time.
Task parallelism does not scale well. If the problem size increases then so must the code; a programmer must manually modify the algorithm to involve more subtasks. Likewise, if the number of CPU's varies, the code must change. One might argue with this claim and state that the job manager will balance the load automatically, given a pool of processors. Perhaps, but the job manager cannot use more threads than there are jobs. So, task parallelism is not scalable to the extent we require.
The Xenon and Cell have more processing units than previous consoles had, but only by a small amount -- under a dozen each. In contrast, MPP machines have hundreds or thousands of CPU's. When that new generation of consoles came out, accustomed to thinking of parallezing across vast numbers of CPU's, I felt these multi-core consoles were quaint, but annoying -- too many CPU's to ignore, too few to use without an uncomfortable amount of effort. Not to mention that they did not even agree on a memory model -- Xenon uses shared and PS3 uses distributed. So people wrote versatile and robust cross-platform job manager API's. And for a few years, people at GDC had something new to talk about.
Since everybody knows that not enough problems can use loop parallelism, why do I advocate it? We have problems that require synchronization -- the data has interdepencencies -- so we need task parallelism and job managers, right?
Yes and no.
Most data interdependencies for large problems that occur in virtual simulations follow a "reduction" pattern, or a "map-reduce" pattern. This means that the problem can be broken into 2 phases: One (map) which subdivides the problem into smaller, independent pieces, and another (reduce) which combines each divided piece to its original whole. This principle allows "divide-and-conquer" to work.
A great example is "merge sort". Merge sort recursively divides a list of items in half until the list of items is small enough to sort trivially. At that point, you have a binary tree of sorted sub-lists. A merge operation merges 2 sorted lists such that the result is also sorted. The merge operation is less complex (linear, in fact) than the sort operation, but the sort operation gets subdivided until complexity becomes trivial (i.e. constant time).
By analogy, map-reduce problems boil down to being able to figure out how to combine the results of multiple computations, efficiently.
It turns out that at least one company has hinged their success at least partially on being able to phrase problems such that they suit the map-reduce paradigm.
And my series of articles demonstrates that physical simulation between thousands of mutually interacting particles also fits the same paradigm.
I predict that when the number of processors increases to a "serious" quantity (dozens or more) then we'll see a shift away from interesting discussions of job managers and task parallelism, toward a far more commonplace practice of using data parallelism, and map-reduce will be as commonly applied, and about as worthy of discussion, as creating a class.
We'll still need task parallelism for niche tasks (like asynchronous file loading) but I'd like to see its importance recede compared to data parallelism. I'd like to see a relatively small number of persistent tasks, each of which uses data parallelism as the primary means of exploiting parallel hardware.
Subscribe to:
Posts (Atom)
