Debugging and reverse engineering: How the BSOD actually 'works', why, etc.

So in this blog I've talked many times in-depth regarding postmortem debugging kernel-dumps as far as blue screen crashes goes. Well, I decided maybe it's time to go ahead and actually in detail explain why a blue screen occurs, what actually goes on when a blue screen occurs, etc.

-- Disclaimer: At the time of this post, I have never myself experienced a BSOD on my Windows 8.1 system, so I cannot 100% confirm whether or not the display is shifted to a low-res VGA mode when paining the screen. I may use NotMyFault to test this out and will edit when I get a confirmation. For now though, let's assume nothing has changed and hope I'm correct : )

---------------------------

First off, why does the blue screen of death occur? Well, it's important to know that there are many reasons as to why a blue screen occurs. Just to name a few:

References to invalid/inaccessible memory; causes access violations, etc.
Unexpected exceptions.
Bugs in drivers causing a fault in a kernel-mode driver, 3rd party drivers doing what I first mentioned, etc.

Again, this is very few of the potential reasons why, but some of the most prevalent. For those interested, here's actually a distribution of what causes bug checks most commonly in Windows.

This is a picture I found on Google from a TechNet article, so thanks to the author for this! It's from Windows Internals - 5th Edition. AFAIK there is not one in the 6th, at least I have not seen it throughout my reading or research afterwards. With this said, it's likely not entirely accurate in regards to today, but I imagine it has not changed too much. Given I analyze postmortem kernel-dumps quite a bit, I am surprised to see pool is so low. Again, this was way back in the writing of the 5th edition which was during Vista's legacy, so many things have changed since then. It's up in the air, really!

Now, with that said we understand a few reasons as to why Windows stops and a blue screen occurs. Good! Now let's also go ahead and understand that if any of these things occurred, Windows could theoretically not stop and keep going when one of these is occurring. Why doesn't it just do this? Well, it's actually extremely simple, and that's because many of these things can cause severe data/memory corruption which could actually lead to hardware problems.

Since we don't want any of that, Windows thankfully has a fail-safe known to us as the Blue Screen of Death (BSOD -- abbreviating from now on). If Windows detects that there is a serious problem that is unrecoverable, it will stop all executions, switch the display to the basic/low-res VGA mode, paint the actual blue screen itself, write memory/crash information to what we know as a memory dump (crash dump/dmp file/dmp), and display a stop code (bug check). All of this is done through a series of functions.

Now that we're on this topic, I must STRESS and dispel the misconception right now that the blue screen itself is a bad thing. It's not! The blue screen is a good thing, and it's making it so our data doesn't become completely corrupt. Remember, the blue screen happens because Windows has detected something has gone horribly wrong, and it cannot recover and/or stop it. When this happens, the appropriate bug check based on what caused the error is called, and the blue screen is painted.

Bottom line... the blue screen is our friend, not our enemy : )

---------------------------

As discussed above, a blue screen happens when Windows detects that there's an unrecoverable/irreversible problem occurring. Regardless of what this actual problem is, the end result is a blue screen. As I mentioned above, this blue screen process actually happens through functions.

Despite the belief that there is only one function that calls and/or begins the bug check process, it is not true! There's two!

(Clickable for their MSDN links)!

First off, before stating their differences, let's make it easy by saying that both of these functions take what is known as a BugCheckCode parameter. What is a BugCheckCode parameter? Good question! This parameter is otherwise known as a STOP code (for example - 0x0000000A, 0x0000001A, 0x0000009F, etc). These stop codes (otherwise known as/called 'bug checks') are what allows us (other than actually debugging the crash dump itself) to troubleshoot the blue screen. It allows us to go ahead and troubleshoot because each of these STOP codes has an actual preset meaning/cause as to why it occurred.

Great, so now that we know that information, what is the difference between KeBugCheckEx and KeBugCheck? Good question! KeBugCheck calls KeBugCheckEx and sets the four parameters to zero.

Example - {0,0,0,0}

Essentially, the KeBugCheckEx function itself provides more information because it sets the four parameters to their preset meanings based on the STOP code/bug check.

---------------------------

Once KeBugCheckEx is called, it first goes ahead and disables all interrupts by calling the KiDisableInterrupts function. After this is done, it transitions to a special system-state in which the STOP code is dumped (0x0000000A for example). It accomplishes the transition and dump of the STOP code with a call from KiDisableInterrupts to the HalDisplayString function.

HalDisplayString itself goes ahead and first takes one parameter (string to print to the blue screen), and does a check to see if the system is in its special system-state (blue screen 'mode'). If it is not in this state however, it will go ahead and attempt to successfully use the firmware to swap to this proper system-state in order to continue.

Once the check has been successfully completed and confirmed that the system is in its proper state, HalDisplayString goes ahead and dumps the string into text-mode video memory at the current location of the cursor. This is kept track of throughout all of the future calls.

After all of this is successfully accomplished, KeBugCheckEx then goes ahead and calls the KeGetBugMessageText function. The KeGetBugMessageText translates the stop code into its text-equivalent. There's a bug check reference list here.

Once that is completed, KeBugCheckEx will then go ahead at this point and start to call any bug check handlers that drivers registered (if any). The handlers themselves are registered by calling KeRegisterBugCheckCallback which goes ahead and fills in a buffer that is allocated by the caller of the register routine so it can be debugged in the debugging client. It also essentially in general allows any drivers a chance to stop their devices.

Once that is through, we move on to calling the KeRegisterBugCheckReasonCallback function which goes ahead and allows any drivers to write data to the crash dump or write crash dump information to alternate devices.

Once the above is done (if possible, because handlers aren't always registered) KeDumpMachineState is called which dumps the rest of the text on the screen. However, the first thing KeDumpMachineState tries to do is successfully interpret the four parameters that were passed to KeBugCheckEx as a valid address within a loaded module. It will go ahead and stop when it can successfully resolve one. The function that is used to accomplish this is KiPcToFileHeader.

KiPcToFileHeader returns for the first parameter that it goes ahead and successfully resolves, immediately prints the following text form of the STOP code, and also includes the base address of the module and the module’s name.

---------------------------

Below I will share the difference between your standard 8/8.1 and XP/Vista/7 screens: