Atomic: What’s the biggest difference between Firestream and a consumer card when you’re looking at GPGPU calculations?
David: The most obvious difference is that there’s no output. There’s no major hardware differences, but we package those systems with a lot more memory, because if you use a GPU to replace what a CPU might have done, the CPU might have access to four gigs or more of system memory. So we have to put, ideally, equivalent amounts of memory on the GPU, subject to the limitation that the GPU memory we use is much faster and more expensive, so we can’t go and put 16GB of it on a single graphics card, but we are planning to support up to 4GB configurations for Firestream, so that’s more than you would ever get or need for a consumer graphics card.
Atomic: What would happen if you were to put DDR3 on a graphics card that was doing GPGPU calculations instead of graphics calculations?
David: Well you could, but DDR3 is just not going to be as fast as GDDR3. Actually, DDR3 is approaching the speeds of GDDR3, but GDDR is progressing. There’s GDDR4 and GDDR5 is coming as well. So I think GDDR3 will start replacing GDDR2 as the low cost solution for graphics cards over the coming year or two. But the mass market system memory will never be as fast as dedicated system memory.
And to really take advantage of a GPU – think about it – if we’re saying we have a hundred times the processing power of a CPU, we can’t just say we’re going to feed it data at the same rate. The memory speed becomes a major bottleneck, so even though doubling the memory speed on a Firestream board is the best we can do, it’s not necessarily a deal breaker. We’d like to have way more bandwidth.
Atomic: So you prefer speed over capacity?
David: It depends a lot on the application. When we design GPGPU applications, some problems have to be reframed, because if you have a problem that requires constant accessing of memory, that’s the kind of application that won’t run any better than the CPU, because if your limitation is memory bandwidth, the GPU doesn’t have that much more memory bandwidth than the CPU does. It has some more, but it might be one to two times more, not 50 times more, so sometimes you have to reframe a problem such that you reduce the number of memory accesses and maximise how much computation you do on each a piece of data.
The stuff that runs best on GPUs is stuff that takes a small amount of data out of memory and does a huge amount of operations on it. Some problems fall into that category and some don’t. Some problems you’ll just take a bit of information out of memory, perform a couple of math operations on it and write it back. In that kind of situation you’re not going to get a big speedup on the GPU, at least not until we can find a way to get more memory bandwidth into the GPU. So it depends on the application, basically.
Atomic: What can you tell us about Swift and its architecture? Are you going to include any memory on the die?
David: We’re taking the concept of IGP and expanding it so the CPU will be in the IGP package. The way IGP works now is that you have a shared memory interface
The idea is that our CPU memory controller is designed to use system memory, because they have to, that’s what they’re always connected to. In the case of a fusion processor, like swift, you have the GPU share the bus to the system memory, so there isn’t a way you can use high speed dedicated graphics memory in that configuration.
Atomic: What if you integrated the memory into the chip?
David: There’s a lot of challenges. You’ve seen how memory prices have fallen through the floor; that’s because the processes that are used to design memory chips have been highly optimised for memory design. Memory looks very different in terms of how the silicon is set up compared to logic, like CPUs and GPUs. There are ways to embed memory into a microprocessor, but they’re far less efficient in terms of space than what you get on a memory die.
So for us to integrate even a small amount of memory onto a microprocessor takes an inordinate amount of die area. You see a CPU and it’s got a cache of 6MB, or 2MB, and that takes up a significant proportion of the die. So if you’re talking about 512MB of memory that’s effectively taking up the space that cache is taking up, that would be extremely expensive in terms of die area and not really practical. So instead what we try to do is just find ways to make efficient caches.
CPUs obviously use a lot of cache, but GPUs are very efficient in terms of cache, because whereas a CPUs cache has to function in a very general purpose way, it has to be able to cache any kind of data used for any kind of purpose; GPU caches are highly specialised. So we have texture caches, and vertex caches and colour caches, a whole multitude of small little caches that only do one specific task but do it very efficiently. Those caches might be kilobytes, or usually less than a megabyte, but they give you very good performance.
So I don’t think what you’ll see in the near future is a trend toward including very large amounts of memory on a fusion type of die. You’ll see cache sizes increase, like they are on CPUs in general, but the only way that you’re going to get large amount of memory is still through some kind of external interface.
Atomic: If die real estate is so expensive, will we see a Core2 Duo style inclusion of two GPUs on a single die, we’re just going to see more crossfire on single cards using two GPUs?
David: I wouldn’t necessarily say that. The idea of accelerated computing or this heterogeneous core [swift] is that you can tailor your core for an application. So you might decide that an application benefits from just a single CPU core and multiple GPU cores if you want it to be really graphics heavy, or you might say for a certain task you need lots of CPU cores and one very simple GPU core for a server application or something, and although our initial implementations may not work that way, the design that we’re working on is such that you can have potentially multiple GPU cores and multiple CPU cores. Of course, as these cores shrink, we have more flexibility in terms of how much we can mix and match, but I wouldn’t rule out seeing multiple GPUs on a core some time in the future.
Browse this article: