David Field chats to David Nalasco, one of AMD’s senior techies for graphics cards. Technicality doesn’t just ensue, it haemorrhages.We recently traveled to Singapore to talk about tech with the AMD/ATI boys. And talk we did.
The first two pages of what you are about to read appear in Atomic issue 89, but we covered so much we've decided to publish the interview in its entirety here.
Atomic: Multi-GPUs are a niche market compared to the rest of the industry. Given the law of diminishing returns that you get with multi-GPU systems, how are you going to stay competitive in the high end card game?
David Nalasco: One thing we’ve been guilty of in the past is that we’ve designed
ever larger, more expensive GPUs so that our high end card would be a something like a [$US] 5-600 GPU. We put all this research and development into designing this expensive card, but the return from it is fairly small because hardly anybody will spend that much on it.
What we’re doing now, especially with the HD3800, is focusing more engineering resources on a more reasonably priced product, and then providing a solution for someone who wants to spend huge amounts of money. That comes from multi-GPU scaling. The trick to that is to get your software to efficiently scale up, and the multi-core concept is not new to us on the GPU.
CPUs hit a wall when they were scaled up by increasing the MHz because they got too hot and too expensive. So they went to multi-core, and the undertaking to make applications multi-core aware and multithreaded is still an ongoing task.
We’ve never had this problem with GPUs. If we want to make a GPU faster, we just add more cores, and as we shrink the transistor size we can go from 64 to 128 to 256 cores. I mean we’re at 360 individual processors in an HD3800 series GPU, and we don’t see any clear limitations that would prevent us from increasing that.
But we can only build a chip that’s so big and so hot and so power hungry. And at some point going above a certain level of power consumption is like you said, a law of diminishing returns.
Atomic: How has the challenge of trying to get multi GPU systems working efficiently been, given there’s been a small amount of time for it to mature relative to single card solutions?
David: In Crossfire, we use multiple frame rendering, which basically queues up frames across multiple GPUs. When we started working on our multi GPU driver, DirectX 9 was fairly simple. We had a good understanding of it and it scaled well.
DirectX 10 is a lot more challenging because it’s new and has existed only as long as Vista has existed, whereas DirectX 9 has been around for 5 or 6 years.
Some of the things that made multiple GPU processing difficult in DirectX 10 was that some of the more sophisticated render effects take the results of one frame and use them to determine what you see in the next frame. Take motion blur as an example. There you want to see how fast an object is moving, and if it’s moving really fast you blur it across multiple frames. That means that you have to keep track of what happened in the scene one, two, three frame ago.
With multiple frame rendering you have different frames on different GPUs, so you have to transfer data between GPUs which takes up a lot of resources and prevents you from getting scaling in a lot of cases. You’ll never be any slower than a single GPU, but you can get into a case where, say, three GPUs aren’t much faster than a single GPU, maybe only 10% faster. And we want to avoid that wherever possible.
The tough part is that a lot of it has to do with the developer. How they architect their game and rendering algorithm has a big impact on us, so we try to work with them and show them certain techniques that they can use to allow our driver to handle their game more efficiently.
Certain games that we find more challenging were just designed in such a way that they make it very difficult for us to optimise around their issues. A lot of that has to do with our DirectX 10 driver still maturing, but we’re just starting to see the first round of games that are using DirectX 10, so even the developers aren’t that familiar with it.
I’m fairly confident that it’s going to improve and we’re going to start seeing better scaling, much like we see in DirectX 9 already; it’s just taking some time for us to work out those details.
Atomic: How much of a jump is DirectX 10.1 from DirectX 10?
David: It’s an incremental change. There are a lot of what seem like nitpicky little features added here and there, but certain algorithms can really benefit from these small changes. Take global illumination for example.
Global illumination is a way to capture all the lighting in the scene including multiple bounces, multiple light sources and area light sources all at once. It gives us an image that looks like a ray traced image, but is very fast. Doing that with traditional techniques is just not feasible.
Normally when you have light sources in a scene, you calculate how much light is coming from each light source for every pixel, and traditionally games just calculate the contribution from all these different light sources. You get into problems when you have a lot of light sources in the scene (like hundreds), because performance just scales down linearly.
Another problem is when you have area light sources, like if an entire wall is glowing one colour. It’s hard to define that as one light source because the light is actually coming from multiple locations.
Then there’s the multiple bounce problem. If we want to do things like ray tracing, we have to do the calculations multiple times for each bounce that we want to calculate, which slows things down even further.
To do global illumination in DirectX 10, we implemented a technique that uses something called cube maps, which have been in DirectX for a while. For any given point, a cube map renders how the scene looks from that point from six different directions, like the faces of a cube. We don’t do it from every pixel in the scene, we break the scene up into cubes, and for each cube we calculate how the scene looks in all six directions from the centre of that cube; we can break the scene up into as many cubes as we want.
In DirectX 10 you can only render one cube map at a time, but in DirectX 10.1 there’s a feature that lets us render multiple cube maps in parallel. It seems like a small, simple change that only that a developer would normally be interested in, but because we can now calculate multiple cubes simultaneously what would have taken hundreds of passes before now takes just a small amount of passes through the rendering engine.