The time has come for NVIDIA to finally show us their initial vision with Fermi, improving on many strong points of the architecture, boosting efficiency, and simultaneously moving to a smaller process. On paper, the GTX680 (codename Kepler) looks to be one hell of a GPU. It has 1536 CUDA cores, each one essentially the same as found on Fermi (just smaller), almost double the performance per watt, 6.0 GB/s memory bus, 1GHz core clock and 3.5Billion transistors.
It’s the last part of that spec list that we’re most interested in though, and with transistors being the building blocks of the GPU and its processing power, this is similar to measuring an engine’s power by its cylinder capacity. Other factors in the GPU architecture do play a significant role in the overall power of the GPU, though the transistor count is a fairly good indicator of what to expect from a processor when comparing new generation tech to existing products.
By comparison, the GTX 580 has 3 Billion transistors, and for this reason, we only expected to see gains of around 16 per cent when compared to the GTX 580 at the same clock speeds. Of course the GTX 680 has a gap over the GTX 580 larger than 16 per cent, and this is largely due to the 1GHz clock speed found on the GTX 680.
As mentioned, the GTX 680 has around 16% more transistors as a grand total. However, NVIDIA have increased the core count from 512 in the old GTX 580 to a very respectable 1536. Though the GTX 680 has an enormous core count, each core is made up of fewer transistors, making each core slightly less powerful than those on the GTX 580. Each GTX 680 core has roughly 230,000 transistors, while the GTX 580 was constructed with a far more significant 585,937 transistors at its disposal. This means that each GTX 580 core is nearly three times as fast on paper, and this is one of the main reasons why the three times greater core count on the GTX 680 has not converted into three times the performance.
As for the physical size of the chip, 294mm2 is the final measurement; this is significantly smaller than the GTX 580, which sported roughly 3 Billion transistors on a 520mm2 chip. The reason for this is that NVIDIA has moved from a 40nm manufacturing process to 28nm, so that each individual transistor is smaller by 43 per cent. Given the much smaller size of each transistor and the chip itself, it’s fairly safe to assume that the power requirements have also been reduced. NVIDIA claim a maximum power draw of around 195W, which as far as we can tell is fairly accurate. Compare this to the AMD 7970, and you should see a decent power saving. In our testing however, the 3D load consumption between the two video cards is extremely close.
Another way NVIDIA managed to cut back on power draw – besides employing a smaller 28nm process – was to simplify the architecture compared to the overly complicated Fermi. The NVIDIA design team removed the hardware-level scheduler from the process hierarchy, and moved the task to software. This is arguably less efficient in theory, though removing hardware simplifies design, reduces manufacturing costs and also reduces power consumption.
In order to make this transition work, NVIDIA had to spend a long time working on the software compiler and its efficiency. If this crucial part of the processor failed, or was inefficient, the entire project would be dead in the water and NVIDIA would need to either fix the software issues, or go back to the drawing board and include a hardware-level scheduler and compiler once more. Luckily, their software-implemented solution has paid off.
The good news is that with the 28nm process NVIDIA are now using, clock speeds can be increased and power consumption can be lowered. This has allowed NVIDIA to develop a new technology that monitors GPU load, clock speed, power consumption and temperatures. This technology is called “GPU Boost” and is best compared to Intel’s “Turbo Boost.”
Essentially, the GPU driver monitors the card workload and sets a core voltage, fan speed and GPU clock to suit. So for 2D web browsing, you will likely end up with a core clock around 300-350MHz which can vary from card to card. Load up a light 3D application and you may see your “default” core clock of 1006MHz, or if this level of performance is not required, somewhere around 700MHz. Essentially, the core clock is no longer a static number, and now closer resembles a car tachometer.