We take a look at the Intel Nehalem CPU - the i965.Intel has been dominating the processor industry for the past few years, spurred on by its amazing 65nm Core 2 Duo and Quad releases. This was only furthered when it shrank down these processors, squeezing the same cores into the same space using a 45nm process, bumping up both the cache and performance significantly. Two years later, we enter a new phase, complete with a new core, and a whole new Intel.
Nehalem is built around the tried-and-true 45nm process, but is set into a completely new foundation – the LGA1366 socket. This socket (funnily enough) has 1366 pins inside it, and is physically larger than the current LGA775. It uses a similar lever securing method as 775, but also includes a metal bracket over the back of the motherboard – physical stability and strength is much improved. But the socket wasn’t increased in size and sturdiness on a whim – it was to hold a whole new chip inside.
The new architecture follows a slight reworking of the previous Core 2 setup, starting with the Instruction Fetch and Pre-Decode stage. Here, the processor will retrieve the code, and then store it in the 32KB Level-1 cache, an incredibly fast cache located directly next to each core. This cache is fed by a 256KB Level-2 cache (and each core has one of these), and that in turn is fed by a shared pool of 8MB Level-3 cache, that each core can have more or less of depending on their needs, potentially giving a single core access to 8MB of cache.
The instructions are then passed through the Instruction Queue, and through the Decode stage. Here, a Branch Prediction unit attempts to ‘guess’ the possible outcomes of the current data, and allows the cache to load what data is most likely to be needed next into the L2 cache, saving time and increasing performance. After this stage, the instructions are decoded, and analysed by the Loop Stream Detector, which looks for repeating sequences in code and can store them for indefinite repetitions – essentially removing all the steps up until this stage.
The final step is to send the code to the Execution Units, whose task is to perform the calculations, which are temporarily stored in the final 32KB Data Cache until they are sent away for storage in either the L2/3 caches, or system memory.