Memory Hierarchy
First off, it's important to discuss and understand what a register is. Before we get into that however, let's have a look at my favorite image of the memory hierarchy:
(thanks to COMPUTER SCIENCE E-1 for this great image)
It's safe to say there are probably thousands of images regarding the memory hierarchy throughout CS books, documents, and presentations, but this one takes the cake for me! It's about as good as it visually gets for a hierarchy image, and although it doesn't display a few key points, I can do that myself right here in this post.
What are the key points I'm talking about that are missing from this image? Well, from the bottom>top, we're going from slowest to fastest. If we're coming from the top>bottom, we're going from the fastest to the slowest (in regards to read and write access time). It's absolutely imperative to also understand that the faster we get, the more expensive we get, and the slower we get, the less expensive we get (in regards to USD). With that now known, you can imagine that the read/write from a removal media device (such as USB) is slower than the read/write from your hard drive, but is less expensive.
As this is a post strictly about registers, I won't go into the complexities and intricacies of each part of the hierarchy, and will instead focus on the registers themselves. As far as access time goes, let's compare registers and the hard drive as an example:
Registers - 1-2ns (nanoseconds)
Hard Drive - 5-20ms (milliseconds)
-- It's all dependent on the architecture of the processor, really. These are rough #'s.
Cue the amazing Grace Hopper!
Why are registers so fast? Registers are actually circuits which are built/wired (literally) into the Arithmetic logic unit (ALU), which is also widely considered the fundamental building block of a CPU. With that said, we really can't get any closer, which means there's also no data transfer overhead as there are barely any clock cycles required. Also, a CPU's instruction set tends to work with registers more than it does with actual memory locations.
Speaking of clock cycles, here's a chart displaying the cycles regarding the memory hierarchy:
As we can see, a register only takes one clock cycle.
What is a register?
Now that we understand the basic fundamentals behind the memory hierarchy and where the register resides on the hierarchy, we can discuss what a register actually is! In its most basic definition, a register is used to store small pieces of data that the processor is actively working on. There are many different registers and categories of registers, all of which essentially do something different, however you can generally break registers down into two types. For example, regarding the first type, we have the General purpose register (GPR), which essentially stores data and performs arithmetic based on an instruction (addition, subtraction, multiplication, etc). Once the arithmetic is finished (or the manipulation of data/memory is finished), it's entirely up to what the instruction is set to do. It can either store it back into memory with the same instruction, a different one, etc.
Regarding the second type, we have the Special purpose register (SPR) which as the name implies has a special meaning and specific purpose. For example, the SP (Stack Pointer) register is a SPR in addition to being a GPR regarding the IA-32 architecture. This register is used to have the CPU store the address of the last program request in a stack. Among other things, as new requests are coming in, they push down the older ones in the stack, with the most recent request always residing at the top of the stack.
It's important to note that at every clock tick, there are specific values regarding registers. The values stored in a specific register may have been updated on a tick, so the values may not be the same as they were prior to the tick. For example, when an interrupt fires, register values are copied to a stack and stay on that stack while an Interrupt Service Routine (ISR) is being executed by the CPU. Once the interrupt is properly handled, the original register values are loaded back from the stack so they can continue to service the instruction they were previously working with.
What I described above is known as context switching, which is essentially the jump of instructions from CPU > ISR. Although unrelated yet interesting to note, in some special cases regarding 0x101 bug checks depending on what actually caused the bug check, you may need to have knowledge of context switching to properly debug.
With all of the above said, there's about a dozen different ways I could go at this point. I could go on to talk about the register file, the many different and various categories of registers, etc. However, let's jump ahead to register renaming as that's a pretty important topic.
Register Renaming
Register renaming is essentially a form of pipelining that deals with data dependencies between instructions by renaming their register operands. The way renaming works is, it will go ahead and replace the architectural register (user-accessible registers, or more easily just known to us as 'the registers') names (value names) with a new value name for each instruction destination operand.
Thanks to register renaming, we can also successfully perform what is known as Out-of-order execution (OOE). How exactly does it allow OOE to be performed? Register renaming entirely eliminates name dependencies between instructions, and recognizes true dependencies. True dependencies occur when an instruction depends on the result of a subsequent instruction.
Given the above is now explained, here's a good time to explain data hazards. Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring a data hazard can lead to what is known as a race condition, which is when the order of the data that was outputted was not the intended order. We have three main data hazards:
1. What is RAW? Let's take two instruction locations (l1, l2). A prime RAW example is when l2 tries to read a source before l1 writes to it. l2 is attempting to refer to a result that hasn't been calculated or retrieved yet by l1.
2. What is WAR? Let's once again take l1 & l2. l2 tries to write a destination before it is read by l1. This is a problem in concurrent execution, which notes of course they must work concurrently and not sequentially. If they do work sequentially, then we have a data hazard like so.
3. What is WAW? Taking l1 & l2 one last time, l2 tries to write an operand before it is written by l1.
With register renaming, since we're ultimately maintaining a status bit for each value that indicates whether or not it has been completed, it allows the execution of two instruction operations to be performed out of order when there are no true data dependencies between them. This removes WAR/WAW, and of course leaves RAW intact as discussed above.
(Excerpt from the following .pdf)
Speaking of clock cycles, here's a chart displaying the cycles regarding the memory hierarchy:
(thanks to HLNAND for this great image)
As we can see, a register only takes one clock cycle.
What is a register?
Now that we understand the basic fundamentals behind the memory hierarchy and where the register resides on the hierarchy, we can discuss what a register actually is! In its most basic definition, a register is used to store small pieces of data that the processor is actively working on. There are many different registers and categories of registers, all of which essentially do something different, however you can generally break registers down into two types. For example, regarding the first type, we have the General purpose register (GPR), which essentially stores data and performs arithmetic based on an instruction (addition, subtraction, multiplication, etc). Once the arithmetic is finished (or the manipulation of data/memory is finished), it's entirely up to what the instruction is set to do. It can either store it back into memory with the same instruction, a different one, etc.
Regarding the second type, we have the Special purpose register (SPR) which as the name implies has a special meaning and specific purpose. For example, the SP (Stack Pointer) register is a SPR in addition to being a GPR regarding the IA-32 architecture. This register is used to have the CPU store the address of the last program request in a stack. Among other things, as new requests are coming in, they push down the older ones in the stack, with the most recent request always residing at the top of the stack.
It's important to note that at every clock tick, there are specific values regarding registers. The values stored in a specific register may have been updated on a tick, so the values may not be the same as they were prior to the tick. For example, when an interrupt fires, register values are copied to a stack and stay on that stack while an Interrupt Service Routine (ISR) is being executed by the CPU. Once the interrupt is properly handled, the original register values are loaded back from the stack so they can continue to service the instruction they were previously working with.
What I described above is known as context switching, which is essentially the jump of instructions from CPU > ISR. Although unrelated yet interesting to note, in some special cases regarding 0x101 bug checks depending on what actually caused the bug check, you may need to have knowledge of context switching to properly debug.
With all of the above said, there's about a dozen different ways I could go at this point. I could go on to talk about the register file, the many different and various categories of registers, etc. However, let's jump ahead to register renaming as that's a pretty important topic.
Register Renaming
Register renaming is essentially a form of pipelining that deals with data dependencies between instructions by renaming their register operands. The way renaming works is, it will go ahead and replace the architectural register (user-accessible registers, or more easily just known to us as 'the registers') names (value names) with a new value name for each instruction destination operand.
Thanks to register renaming, we can also successfully perform what is known as Out-of-order execution (OOE). How exactly does it allow OOE to be performed? Register renaming entirely eliminates name dependencies between instructions, and recognizes true dependencies. True dependencies occur when an instruction depends on the result of a subsequent instruction.
Given the above is now explained, here's a good time to explain data hazards. Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring a data hazard can lead to what is known as a race condition, which is when the order of the data that was outputted was not the intended order. We have three main data hazards:
- Read-after-write (RAW), also known as a true dependency.
- Write-after-read (WAR), also known as an anti-dependency.
- Write-after-write (WAW), also known as an output dependency.
1. What is RAW? Let's take two instruction locations (l1, l2). A prime RAW example is when l2 tries to read a source before l1 writes to it. l2 is attempting to refer to a result that hasn't been calculated or retrieved yet by l1.
2. What is WAR? Let's once again take l1 & l2. l2 tries to write a destination before it is read by l1. This is a problem in concurrent execution, which notes of course they must work concurrently and not sequentially. If they do work sequentially, then we have a data hazard like so.
3. What is WAW? Taking l1 & l2 one last time, l2 tries to write an operand before it is written by l1.
With register renaming, since we're ultimately maintaining a status bit for each value that indicates whether or not it has been completed, it allows the execution of two instruction operations to be performed out of order when there are no true data dependencies between them. This removes WAR/WAW, and of course leaves RAW intact as discussed above.
(Excerpt from the following .pdf)
x86 Registers
In WinDbg, by using the r command we can go ahead and dump the registers from the context of the thread that caused the crash. For example:
In WinDbg, by using the r command we can go ahead and dump the registers from the context of the thread that caused the crash. For example:
kd> r
eax=818f4920 ebx=86664d90 ecx=818fb9c0 edx=000002d0 esi=818f493c edi=00000000
eip=818c07dd esp=a1e72cb0 ebp=a1e72ccc iopl=0 nv up ei pl nz na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00000206
x86 (IA-32) has eight GPRs, which are:
- EAX
- EBX
- ECX
- EDX
- ESI
- EDI
- EBP
- ESP
Great, so there's our eight GPRs. Now we can go ahead and break them down:
- EAX - The 'A' in the EAX register implies it's the accumulator register for operands and results data.
- EBX - The 'B' in the EBX register implies it's the pointer to the data in the DS segment. DS is the current data segment. It also means 'base register'.
- ECX - The 'C' in the ECX register implies it's the counter for storing loop and string operations.
- EDX - The 'D' in the EDX register implies it's the I/O pointer.
- ESI - The 'SI' in the ESI register implies it's the Source Index, which is a pointer to data in the segment pointed to by the DS register.
- EDI - The 'DI' in the EDI register implies it's the Destination Index, which is a pointer to data (or a destination) in the segment pointed to by the ES register. It's essentially the counterpart to the ESI register, for lack of a better word.
- EBP - The 'BP' in the EBP register implies it's the Base Pointer, which is the pointer to data on the stack.
- ESP - The 'SP' in the ESP register implies it's the Stack Pointer, which is used to detect the location of the last item put on the stack.
Whew, alright! So there's just a few things I'd like to quickly explain as well as we haven't covered all of the bases:
- EAX - The 'AX' in the EAX register is used to address only the lower 16 bits of the register. If we were to reference all 32 bits, we'd use all of EAX, and not just AX.
- EIP - I didn't forget this guy, don't worry! The 'IP' in the EIP register implies it's the Instruction Pointer, which can also actually be called the 'program counter'. It contains the offset in the current code segment for the next instruction that will be executed. It's also interesting to note that EIP cannot be accessed by software, and is explicitly controlled by control-transfer instructions such as JMP, CALL, JC, and RET. Aside from control-transfer instructions, interrupts and exceptions can also access EIP directly.
Okay, so I can't mention control-transfer instructions and then not explain them. I mean, I could... but I wouldn't be happy. Control-transfer instructions specifically control the flow of program execution. There are quite a few control-transfer instructions, but I will discuss the ones I mentioned that can directly access EIP:
- JMP - Jump. JMP transfers program control to a different point in the instruction stream without recording any return information. The destination operand specifies the address of the instruction being jumped to. This operand can be an immediate value, a GPR, or a memory location. The JMP instruction can be used to actually execute four different types of jumps:
1. Near jump - A jump to an instruction within the segment currently pointed to by the CS register. It can also at times be referred to as an intrasegment jump.
2. Short jump - A type of near jump in that the jump range is limited to -128 to +127 from the current EIP value.
3. Far jump - A jump to an instruction that's located in a different segment than the current code segment, but at the same privilege level. It can also at times be referred to as an intersegment jump.
4. Task switch - A jump to an instruction that's located in a different task. Note that a task switch can only be accomplished in protected-mode, which not to fly too off the handle here, but it's necessary to explain. Rather than explaining it here though as it's a bit large, I'll explain it below in a short while.
- CALL - Call procedure. CALL pushes the current code location onto the hardware supported stack in memory, and then performs an unconditional jump to the code location indicated by the label operand. Unlike the simple jump instructions I listed above, the call instruction saves the location to return to when the subroutine completes.
- JC - Jump if carry flag is set.
- Ret - Return. This instruction transfers control to the return address located on the stack, which is usually placed on the stack by a call instruction that we discussed above. It then performs an unconditional jump to the retrieved code location. For example:
I mentioned intersegment/intrasegment jumps above. Intersegment jumps can transfer control to a statement in a different code segment, while intrasegment jumps are always between statements in the same code segment.
call <label> ret
Now that we have the above control-transfer instructions explained, let's discuss protected-mode as I mentioned above. Before further discussing protected-mode and the similarly named but very different real-mode, we'll need to do a bit of a history lesson.
Way before my time, back in the late 70's (76-78), Intel's 16-bit 8086 processor was released. It was 16-bit because its internal registers + internal/external data buses were 16 bits wide. A 20-bit external address bus meant this beast could address a whopping 1 MB's of memory! 1 MB may not seem like anything these days, but it was actually considered more than overkill around this time. Due to this being the case, the max linear address space was limited to a mere 64 KB. Aside from 1 MB being overkill, this was also because the internal registers were only 16 bits wide.
There were two problems here:
1. Programming over 64 KB boundaries meant adjusting segment registers.
2. As time went on, applications were being released that made this mere 64 KB seem like the measly number it is today.
What was the saving grace? Intel's 80286 processor was released in 1982! Well, what's so great about this processor that solve the above two problems? The 80286 processor had two operating modes, as opposed to the 8086 which only had one. The operating modes were:
1. Real-Mode (backwards compatible 8086 mode).
2. Protected-Mode.
The 80286 processor had a 24-bit address bus, which could address up to 16 MB! That's 15 more MB than its predecessor. There's too much good stuff here, so let's discuss the problems. The problems were certainly problems, and they were considerably big ones:
- DOS apps couldn't easily be ported to protected-mode granted that most/if not all DOS apps were developed in a way that made them incompatible.
- The 80286 processor couldn't successfully revert back to the backwards compatible real-mode without a CPU reset. Later on however in 1984, IBM added circuitry that allowed a special series of instructions to successfully cause a revert without initiating a CPU reset. This method while certainly being a great feat, posed quite the performance penalty. Later on it was discovered that initiating a triple fault was a much faster and cleaner way, but there was still yet to be a 'painless' method of transition.
With the above said, the successor was released which is the 80386. The 80386 had a 32-bit address bus, which allowed for 4 GB of memory access. 1 MB > 16 MB > 4 GB, quite the increase! In addition, the segment size was increased to 32-bits, which meant there wasn't a need to switch between multiple segments to access the full address space of 4 GB.
Whew! With all of this known, what is protected-mode actually so great for at this point? Protected-mode allows for virtual memory, paging, ability to finally painlessly switch back to real-mode without a CPU reset, etc.
How do we actually get into protected-mode? The Global Descriptor Table (GDT) needs to be created with a minimum of three entries. These three entries are a null descriptor, a code segment descriptor, and a data segment descriptor. Afterwards, the PE bit needs to be set in the CR0 register and a far JMP needs to be made to clear the prefect input queue (PIQ).
// Set PE bit
mov eax, cr0
or eax, 1
mov cr0, eax
// Far JMP (Remember CS = Code Segment)
jmp cs:@pm
@pm:
// Now in protected-mode :)
So we got to most of the registers from the x86 register dump excerpt, but we're missing these ones:
nv up ei pl nz na pe nc
These are the current contents of the FLAGS register (there are 20[?] different flags), which is the status register for x86.
- nv - No overflow.
- up - Up.
- ei - Enable interrupt.
- pl - Plus (I believe).
- nz - Not zero.
- na - Not auxiliary carry.
- pe - Parity even.
- nc - No carry.
Disassembly Example
Let's now dump a stack from a random x86 crash dump to show an example of some of the registers we talked about, some assembly code, and instructions:
ChildEBP RetAddr
a1e72ccc 81a8eec4 nt!KeBugCheckEx+0x1e
a1e72cf0 81a1e85f nt!PspCatchCriticalBreak+0x73
a1e72d20 81a1e806 nt!PspTerminateAllThreads+0x2c
a1e72d54 8185c986 nt!NtTerminateProcess+0x1c1
a1e72d54 77725d14 nt!KiSystemServicePostCall
0028f0d0 00000000 0x77725d14
Let's go ahead and disassemble the nt!PspTerminateAllThreads+0x2c kernel function:
nt!PspTerminateAllThreads+0x2c:
81a1e85f 8b450c mov eax,dword ptr [ebp+0Ch] // Move value stored at memory address in the ebp+0Ch register to the eax register.
81a1e862 8b4048 mov eax,dword ptr [eax+48h] // Move value stored at memory address in the eax+48h register to the eax register.
81a1e865 8945f0 mov dword ptr [ebp-10h],eax // Move contents of eax register to the ebp-10h register.
81a1e868 6a00 push 0 // Push a 32-bit zero on the stack.
81a1e86a 8bc7 mov eax,edi // Move contents of the edi register into the eax register.
81a1e86c c745fc22010000 mov dword ptr [ebp-4],122h // Store the 32-bit value 122h to the ebp-4 register.
81a1e873 e8bf010000 call nt!PspGetPreviousProcessThread (81a1ea37) // Call function name nt!PspGetPreviousProcessThread.
81a1e878 8b5d14 mov ebx,dword ptr [ebp+14h] // Moving value stored at memory address in the ebp+14h register to the ebx register.
Hope you enjoyed reading!
References
Intel 80386 Reference Programmer's Manual
X86 Assembly/High-Level Languages
A Guide to Programming Intel IA32 PC Architecture
Segment Registers: Real mode vs. Protected mode