Saturday, February 22, 2014

0x101 Debugging

I've been working on an 0x101 BSOD (located here), and I thought I'd go ahead and blog about it, even though it's not officially solved just yet. It's still interesting nonetheless, and I believe is good content for a post.

CLOCK_WATCHDOG_TIMEOUT (101)

This indicates that an expected clock interrupt on a secondary processor, in a multi-processor system, was not received within the allocated interval. 

So there's the basic definition of this particular bug check. Let's get into the debugging now.

--------------------


BugCheck 101, {19, 0, fffff880009b2180, 4}

^^ 19 clock ticks in regards to the timeout.

fffff880009b2180 is the PRCB address of the hung processor, let's keep this address in mind.

0: kd> !prcb 4
PRCB for Processor 4 at fffff880009b2180:
Current IRQL -- 0
Threads--  Current fffffa800d851060 Next fffffa800caa6680 Idle fffff880009bd0c0
Processor Index 4 Number (0, 4) GroupSetMember 10
Interrupt Count -- 001bd6a1
Times -- Dpc    00000018 Interrupt 00000048
         Kernel 0000f52d User      00003d36
For reference, I did not do !prcb 0 through 4. That would have been very tedious. Instead, you can use !running -it. The "i" argument causes it to display idle processors too, and "t" displays the stack trace for the thread running on each processor. If we run that extension, it shows the is an 8 core box.

Hint: At times, the 4th parameter of the bug check will show you the responsible processor. For example, in your *101 here, it was correct as the 4th parameter was 4.

Hint #2: You can also generally tell the amount of cores on the box by checking the bugcheck_string - BUGCHECK_STR: CLOCK_WATCHDOG_TIMEOUT_8_PROC

As this matches the 3rd parameter of the bug check, processor #4 is the responsible processor. Now with the information we have here thus far, we know that processor #4 reached 19 clock ticks without responding, therefore the system crashed. Before we go further, what is a clock tick? A clock interrupt is a form of interrupt which involves counting the the cycles of the processor core, which is running a clock on the processors to keep them all in sync. A clock interrupt is handed out to all processors and then they must report in, and when one doesn't report in, you then crash.

--------------------

Let's now look at the stacks of the different processors to see what the threads were involved in:

We can use knL and go through a grueling method of obtaining the trap frame, but we don't like having to put in more work, so let's use kv instead on Processor 0:

0: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`009a9728 fffff800`0322c443 : 00000000`00000101 00000000`00000019 00000000`00000000 fffff880`009b2180 : nt!KeBugCheckEx
fffff880`009a9730 fffff800`032885f7 : 00000000`00000000 fffff800`00000004 00000000`00002711 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4e3e
fffff880`009a97c0 fffff800`037f5895 : fffff800`0381a460 fffff880`009a9970 fffff800`0381a460 00000000`00000000 : nt!KeUpdateSystemTime+0x377
fffff880`009a98c0 fffff800`0327c3f3 : 00000000`8e403992 fffff800`033f8e80 fffff880`03088180 fffffa80`101ca060 : hal!HalpHpetClockInterrupt+0x8d
fffff880`009a98f0 fffff800`032b55a3 : 00000000`00000000 00000000`00000001 00000000`00000000 00000000`00000000 : nt!KiInterruptDispatchNoLock+0x163 (TrapFrame @ fffff880`009a98f0)
fffff880`009a9a80 fffff800`0328de2c : 00000000`00000000 fffff6fc`40004308 00000000`00000000 00000000`00000000 : nt!KxFlushEntireTb+0x93
fffff880`009a9ac0 fffff800`032c76b9 : fffff6fc`40004308 00000000`00000008 fffff800`033f8e80 00000000`00000080 : nt!KeFlushMultipleRangeTb+0x28c
fffff880`009a9b90 fffff800`032c728f : ffffffff`ffffffff 00000000`0000007f 00000000`00000000 00000000`00000000 : nt!MiZeroPageChain+0x14e
fffff880`009a9bd0 fffff800`03523166 : fffffa80`0ca90460 00000000`00000080 fffffa80`0ca909e0 fffff800`0325e479 : nt!MmZeroPageThread+0x7da
fffff880`009a9d00 fffff800`0325e486 : fffff800`033f8e80 fffffa80`0ca90460 fffff800`03406c40 15ff0000`0160248c : nt!PspSystemThreadStartup+0x5a
fffff880`009a9d40 00000000`00000000 : fffff880`009aa000 fffff880`009a4000 fffff880`009a9970 00000000`00000000 : nt!KiStartSystemThread+0x16
There it is! Let's move forward:

0: kd> .trap fffff880`009a98f0
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000001 rbx=0000000000000000 rcx=00000000000406f8
rdx=00000000000008e1 rsi=0000000000000000 rdi=0000000000000000
rip=fffff800032b55a3 rsp=fffff880009a9a80 rbp=fffff880009a9bb8
 r8=0000000000000000  r9=ffffffffffffff7f r10=0000000000000008
r11=fffff880009a9a20 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na pe nc
nt!KxFlushEntireTb+0x93:
fffff800`032b55a3 ebe4            jmp     nt!KxFlushEntireTb+0x79 (fffff800`032b5589)
0: kd> knL
  *** Stack trace for last set context - .thread/.cxr resets it
 # Child-SP          RetAddr           Call Site
00 fffff880`009a9a80 fffff800`0328de2c nt!KxFlushEntireTb+0x93
01 fffff880`009a9ac0 fffff800`032c76b9 nt!KeFlushMultipleRangeTb+0x28c
02 fffff880`009a9b90 fffff800`032c728f nt!MiZeroPageChain+0x14e
03 fffff880`009a9bd0 fffff800`03523166 nt!MmZeroPageThread+0x7da
04 fffff880`009a9d00 fffff800`0325e486 nt!PspSystemThreadStartup+0x5a
05 fffff880`009a9d40 00000000`00000000 nt!KiStartSystemThread+0x16
^^ Here we can find the stored registers and the stack at the time of the interrupt.

This is where we're going to do some instruction disassembling:

0: kd> u @rip
nt!KxFlushEntireTb+0x93:
fffff800`032b55a3 ebe4            jmp     nt!KxFlushEntireTb+0x79 (fffff800`032b5589)
fffff800`032b55a5 f08305d3c5140001 lock add dword ptr [nt!KiTbFlushTimeStamp (fffff800`03401b80)],1
fffff800`032b55ad 400fb6c6        movzx   eax,sil
fffff800`032b55b1 440f22c0        mov     cr8,rax
fffff800`032b55b5 488b5c2440      mov     rbx,qword ptr [rsp+40h]
fffff800`032b55ba 488b742448      mov     rsi,qword ptr [rsp+48h]
fffff800`032b55bf 4883c430        add     rsp,30h
fffff800`032b55c3 5f              pop     rdi
0: kd> u fffff800`032b5589 fffff800`032b55a5
nt!KxFlushEntireTb+0x79:
fffff800`032b5589 8b8780200000    mov     eax,dword ptr [rdi+2080h]
fffff800`032b558f 85c0            test    eax,eax <-- Checking if value is non-zero.
fffff800`032b5591 7412            je      nt!KxFlushEntireTb+0x95 (fffff800`032b55a5) <-- It looks like it takes the jmp here to stay in the loop.
fffff800`032b5593 ffc3            inc     ebx
fffff800`032b5595 851d310d2000    test    dword ptr [nt!HvlLongSpinCountMask (fffff800`034b62cc)],ebx
fffff800`032b559b 0f84a11e0200    je      nt! ?? ::FNODOBFM::`string'+0x7467 (fffff800`032d7442)
fffff800`032b55a1 f390            pause
fffff800`032b55a3 ebe4            jmp     nt!KxFlushEntireTb+0x79 (fffff800`032b5589)
fffff800`032b55a5 f08305d3c5140001 lock add dword ptr [nt!KiTbFlushTimeStamp (fffff800`03401b80)],1
^^ Disassembling the first few instructions reveals a jump (jmp) that is back up in the nt!KxFlushEntireTb function. It appears at the time of the bug check, the thread was executing a pause (a CPU delay), and doing this in a loop waiting for a release.

So, what's the summary so far? Processor #0 was the thread that created the bugcheck itself, and must have been interrupted by a clock interrupt in order to trigger the CLOCK_WATCHDOG_TIMEOUT bug check.

--------------------

Let's take a look into Processor #1's call stack like we did Processor #0:

Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`02f1bc58 fffff800`0328da3a : 00000000`0035ce39 fffffa80`0dc25bd8 fffff880`009fb0c0 00000000`00000001 : intelppm!MWaitIdle+0x19
fffff880`02f1bc60 fffff800`032886cc : fffff880`009f0180 fffff880`00000000 00000000`00000000 fffff880`00000000 : nt!PoIdle+0x53a
fffff880`02f1bd40 00000000`00000000 : fffff880`02f1c000 fffff880`02f16000 fffff880`02f1bd00 00000000`00000000 : nt!KiIdleLoop+0x2c
1: kd> !irql
Debugger saved IRQL for processor 0x1 -- 0 (LOW_LEVEL)
^^ Either it's running at 0 or the IRQL despite saying 'saved' really didn't get saved. Windows Internals notes this is a possibility.

1: kd> u @rip
intelppm!MWaitIdle+0x19:
fffff880`06c7ac61 c3              ret
fffff880`06c7ac62 cc              int     3
fffff880`06c7ac63 cc              int     3
fffff880`06c7ac64 cc              int     3
fffff880`06c7ac65 cc              int     3
fffff880`06c7ac66 cc              int     3
fffff880`06c7ac67 cc              int     3
intelppm!SetPerfStateIO:
fffff880`06c7ac68 48895c2408      mov     qword ptr [rsp+8],rbx
^^ So it seems that we have the intelppm!MWaitIdle function. I have done some research and I cannot find info on it, although intelppm is related to the processor and I believe its power configuration, power states, etc. Assuming idle implies what I believe it does, this may indicate that processor #1 at the time of the crash was idle waiting for something.

--------------------

Let's check Processor #2:
2: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`02f8cc58 fffff800`0328da3a : 00000000`0035ce39 fffffa80`0d163908 00000000`00000000 00000000`00000000 : intelppm!MWaitIdle+0x19
fffff880`02f8cc60 fffff800`032886cc : fffff880`02f64180 fffff880`00000000 00000000`00000000 fffff880`00000000 : nt!PoIdle+0x53a
fffff880`02f8cd40 00000000`00000000 : fffff880`02f8d000 fffff880`02f87000 fffff880`02f8cd00 00000000`00000000 : nt!KiIdleLoop+0x2c
^^ Exact same as Processor #1.

--------------------

Let's check Processor #3:

3: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`07bb5210 fffff800`032ab0fb : 00000000`00000002 fffff880`07bb5380 fffff900`c2e82000 00000000`00000001 : nt!MiFlushTbAsNeeded+0x28a
fffff880`07bb5320 fffff800`033afceb : 00000000`00001230 00000000`00001230 00000000`00000021 00000008`00000000 : nt!MiAllocatePagedPoolPages+0x4bb
fffff880`07bb5440 fffff800`03292860 : 00000000`00001230 fffff880`049fccc0 00000000`00000021 00000000`00000000 : nt!MiAllocatePoolPages+0x8e2
fffff880`07bb5590 fffff800`033b2bfe : 00000000`00000000 00000000`02323fff fffffa80`00000020 fffff880`049fccc0 : nt!ExpAllocateBigPool+0xb0
fffff880`07bb5680 fffff960`001928ed : 00000000`00001d01 00000000`00000000 00000000`00000000 fffff960`001a40d1 : nt!ExAllocatePoolWithTag+0x82e
fffff880`07bb5770 fffff960`00193e0f : 00000000`00000001 fffff880`07bb5908 00000000`00000001 fffff960`001a4302 : win32k!AllocateObject+0xdd
fffff880`07bb57b0 fffff960`00169f2f : fffff880`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : win32k!SURFMEM::bCreateDIB+0x1fb
fffff880`07bb58a0 fffff960`00180bc4 : 00000000`01010051 fffff900`c34b3a00 00000000`00000000 00000000`0000002c : win32k!GreCreateDIBitmapReal+0x533
fffff880`07bb59d0 fffff960`00182be6 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : win32k!InternalGetIconInfo+0x174
fffff880`07bb5ac0 fffff800`0327f153 : fffffa80`0ea2d060 00000000`0272f098 fffff880`07bb5b88 00000000`00000020 : win32k!NtUserGetIconInfo+0x182
fffff880`07bb5b70 00000000`778f462a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`07bb5be0)
00000000`0272f078 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x778f462a
3: kd> .trap fffff880`07bb5be0
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=00000000778f462a rsp=000000000272f078 rbp=000000000272f1e0
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl zr na po nc
0033:00000000`778f462a ??              ???
3: kd> u @rip
00000000`778f462a ??              ???
            ^ Memory access error in 'u @rip'
^^ Cannot seem to access the rip register on processor #3. From the stack, it looks like nt!MiFlushTbAsNeeded may be in a loop.

--------------------

Let's now take a look at the problematic processor (#4):

4: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
00000000`00000000 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x0
4: kd> r
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=0000000000000000 rsp=0000000000000000 rbp=0000000000000000
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up di pl nz na pe nc
cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000
00000000`00000000 ??              ???
^^ We have a zerod stack + registers, so this will be problematic. Usually this occurs on the problem processor because the IRQL is too high, OR the processor was too hung at the time of the crash to report its information, etc. We will need to get the raw stack.

Let's give this a shot:

4: kd> !pcr
KPCR for Processor 4 at fffff880009b2000:
    Major 1 Minor 1
    NtTib.ExceptionList: fffff880009bd640
        NtTib.StackBase: fffff880009b7040
       NtTib.StackLimit: 00000000000ade58
     NtTib.SubSystemTib: fffff880009b2000
          NtTib.Version: 00000000009b2180
      NtTib.UserPointer: fffff880009b27f0
          NtTib.SelfTib: 00000000fffdb000

                SelfPcr: 0000000000000000
                   Prcb: fffff880009b2180
                   Irql: 0000000000000000
                    IRR: 0000000000000000
                    IDR: 0000000000000000
          InterruptMode: 0000000000000000
                    IDT: 0000000000000000
                    GDT: 0000000000000000
                    TSS: 0000000000000000

          CurrentThread: fffffa800d851060
             NextThread: fffffa800caa6680
             IdleThread: fffff880009bd0c0
4: kd> !thread
THREAD fffffa800d851060  Cid 0bb8.08c4  Teb: 00000000fffdb000 Win32Thread: fffff900c4147c30 RUNNING on processor 4
Not impersonating
DeviceMap                 fffff8a0019544e0
Owning Process            fffffa800d845b30       Image:         chrome.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      78494          Ticks: 714 (0:00:00:11.138)
Context Switch Count      232083         IdealProcessor: 4                 LargeStack
UserTime                  00:00:44.990
KernelTime                00:00:09.032
Win32 Start Address 0x0000000000288c9e
Stack Init fffff8800b922d70 Current fffff8800b922a60
Base fffff8800b923000 Limit fffff8800b91a000 Call 0
Priority 9 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
00000000`00000000 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x0
^^ We'll be using the base & limit addresses to dump the raw stack:

(For convenience purposes, I cut the stack to the important part because the entire raw stack is really large. Even after cutting it, it's still really large...)

fffff880`0b920988  fffff880`0fef3a11*** ERROR: Symbol file could not be found.  Defaulted to export symbols for nvlddmkm.sys -
 nvlddmkm+0xbda11
fffff880`0b920990  fffffa80`0fab3460
fffff880`0b920998  fffff880`06d92095 dxgmms1!VidSchiUpdateContextRunningTimeAtISR+0x45
fffff880`0b9209a0  00000000`0035ce39
fffff880`0b9209a8  fffffa80`0e47c000
fffff880`0b9209b0  00000000`00000000
fffff880`0b9209b8  00000000`00000005
fffff880`0b9209c0  fffffa80`0fab3460
fffff880`0b9209c8  00000000`0000e323
fffff880`0b9209d0  fffffa80`0e47d1a0
fffff880`0b9209d8  fffff880`06d8e66b dxgmms1!VidSchiProcessIsrCompletedPacket+0x1eb
fffff880`0b9209e0  fffffa80`0e47d1a0
fffff880`0b9209e8  fffffa80`0e44a410
fffff880`0b9209f0  fffffa80`0e47c000
fffff880`0b9209f8  fffffa80`0e47c000
fffff880`0b920a00  00000000`00000000
fffff880`0b920a08  00000000`00000000
fffff880`0b920a10  fffffa80`0e47d1a0
fffff880`0b920a18  00000000`00000000
fffff880`0b920a20  fffffa80`0fab3460
fffff880`0b920a28  00000000`00001661
fffff880`0b920a30  00000000`0000c350
fffff880`0b920a38  00000000`00400120
fffff880`0b920a40  00000000`0000e323
fffff880`0b920a48  00000000`00000001
fffff880`0b920a50  00000000`00000000
fffff880`0b920a58  00000000`00000000
fffff880`0b920a60  fffffa80`0e44a410
fffff880`0b920a68  fffff880`06d8e172 dxgmms1!VidSchDdiNotifyInterruptWorker+0x1ea
fffff880`0b920a70  fffffa80`0ebbd9c0
fffff880`0b920a78  fffff880`06d92095 dxgmms1!VidSchiUpdateContextRunningTimeAtISR+0x45
fffff880`0b920a80  00000000`0035ce39
fffff880`0b920a88  fffffa80`0e47e000
fffff880`0b920a90  00000000`00000000
fffff880`0b920a98  00000000`00000006
fffff880`0b920aa0  fffffa80`0ebbd9c0
fffff880`0b920aa8  00000000`00006ed6
fffff880`0b920ab0  fffffa80`0e47f5b0
fffff880`0b920ab8  fffff880`06d8e66b dxgmms1!VidSchiProcessIsrCompletedPacket+0x1eb
fffff880`0b920ac0  fffffa80`0e47f5b0
fffff880`0b920ac8  fffffa80`0e44a410
fffff880`0b920ad0  fffffa80`0e47e000
fffff880`0b920ad8  fffffa80`0e47e000
fffff880`0b920ae0  00000000`00000000
fffff880`0b920ae8  00000000`00000000
fffff880`0b920af0  fffffa80`0e47f5b0
fffff880`0b920af8  00000000`00000000
fffff880`0b920b00  fffffa80`0ebbd9c0
fffff880`0b920b08  00000000`0000071a
fffff880`0b920b10  00000000`000124f8
fffff880`0b920b18  00000000`00400120
fffff880`0b920b20  00000000`00006ed6
fffff880`0b920b28  00000000`00000001
fffff880`0b920b30  00000000`00000001
fffff880`0b920b38  00000000`00000000
fffff880`0b920b40  fffffa80`0e44a410
fffff880`0b920b48  fffff880`06d8e172 dxgmms1!VidSchDdiNotifyInterruptWorker+0x1ea
fffff880`0b920b50  fffffa80`0e47e000
fffff880`0b920b58  00000000`00000000
fffff880`0b920b60  fffffa80`00000001
fffff880`0b920b68  fffff880`0b920e00
fffff880`0b920b70  fffff880`0b920ba0
fffff880`0b920b78  00000000`00000005
fffff880`0b920b80  00000000`00000000
fffff880`0b920b88  00000000`00000001
fffff880`0b920b90  fffff880`0b920e00
fffff880`0b920b98  fffff880`06d8df76 dxgmms1!VidSchDdiNotifyInterrupt+0x9e
fffff880`0b920ba0  fffffa80`00006ed6
fffff880`0b920ba8  00000000`00000000
fffff880`0b920bb0  fffffa80`0e472000
fffff880`0b920bb8  fffffa80`0ebc2010
fffff880`0b920bc0  fffff880`0b920e00
fffff880`0b920bc8  fffff880`06c9513f dxgkrnl!DxgNotifyInterruptCB+0x83
fffff880`0b920bd0  00000000`00006ed6
fffff880`0b920bd8  00000000`00000000
fffff880`0b920be0  00000000`00000001
fffff880`0b920be8  00000000`00000000
fffff880`0b920bf0  fffff880`0b920c80
fffff880`0b920bf8  fffff880`0fef37c9 nvlddmkm+0xbd7c9
fffff880`0b920c00  fffff880`0b920e00
fffff880`0b920c08  fffffa80`0ebc2010
fffff880`0b920c10  fffffa80`0d906000
fffff880`0b920c18  00000000`00000000
fffff880`0b920c20  fffff880`0fef376f nvlddmkm+0xbd76f
fffff880`0b920c28  fffffa80`0d906000
fffff880`0b920c30  00000000`00000000
fffff880`0b920c38  00000000`00000000
fffff880`0b920c40  00000000`00000000
fffff880`0b920c48  00000000`00000000
fffff880`0b920c50  fffffa80`0d697480
fffff880`0b920c58  fffff880`0b920e00
fffff880`0b920c60  00000000`00006ed6
fffff880`0b920c68  fffffa80`0e246000
fffff880`0b920c70  00000000`00000001
fffff880`0b920c78  00000000`00000000
fffff880`0b920c80  fffff880`0b920d40
fffff880`0b920c88  fffff880`0fef3a11 nvlddmkm+0xbda11
fffff880`0b920c90  fffffa80`0d906000
fffff880`0b920c98  fffffa80`0ebc2010
fffff880`0b920ca0  fffff880`0b920e00
fffff880`0b920ca8  fffffa80`0e246000
fffff880`0b920cb0  fffff880`0fef39aa nvlddmkm+0xbd9aa
fffff880`0b920cb8  fffffa80`0d906000
fffff880`0b920cc0  00000000`00000000
fffff880`0b920cc8  00000000`00000000
fffff880`0b920cd0  00000000`00000000
fffff880`0b920cd8  00000000`00000000
fffff880`0b920ce0  00000000`00006ed6
fffff880`0b920ce8  00000000`00000000
fffff880`0b920cf0  00000000`00000001
fffff880`0b920cf8  00000000`00000000
fffff880`0b920d00  00000000`00000000
fffff880`0b920d08  fffffa80`0d906000
fffff880`0b920d10  00000000`00006ed6
fffff880`0b920d18  fffff880`0fef3924 nvlddmkm+0xbd924
fffff880`0b920d20  fffff880`0fef39aa nvlddmkm+0xbd9aa
fffff880`0b920d28  fffffa80`0e246000
fffff880`0b920d30  fffffa80`0e251ad0
fffff880`0b920d38  fffffa80`0d906000
fffff880`0b920d40  fffff880`0b920e90
fffff880`0b920d48  fffff880`0ff2c17d nvlddmkm+0xf617d
fffff880`0b920d50  00000000`00000000
fffff880`0b920d58  fffffa80`0ebc2010
fffff880`0b920d60  00000000`00000001
fffff880`0b920d68  00000000`00000000
fffff880`0b920d70  fffff880`0ff2be9c nvlddmkm+0xf5e9c
fffff880`0b920d78  fffffa80`0d906000
fffff880`0b920d80  fffffa80`0ebb5010
fffff880`0b920d88  fffffa80`0ebb6010
fffff880`0b920d90  fffffa80`0ebb69d0
fffff880`0b920d98  00000000`00000000
fffff880`0b920da0  00000000`00000001
fffff880`0b920da8  00000000`00000000
fffff880`0b920db0  fffffa80`0fab3460
fffff880`0b920db8  fffff880`06d92095 dxgmms1!VidSchiUpdateContextRunningTimeAtISR+0x45
fffff880`0b920dc0  00000000`0035ce39
fffff880`0b920dc8  fffffa80`0e47c000
fffff880`0b920dd0  00000000`00000000
fffff880`0b920dd8  00000000`00000006
fffff880`0b920de0  fffffa80`0fab3460
fffff880`0b920de8  00000000`0005fea6
fffff880`0b920df0  fffffa80`0e47cf30
fffff880`0b920df8  fffff880`06d8e66b dxgmms1!VidSchiProcessIsrCompletedPacket+0x1eb
fffff880`0b920e00  fffffa80`0fab3460
fffff880`0b920e08  fffff880`06d92095 dxgmms1!VidSchiUpdateContextRunningTimeAtISR+0x45
fffff880`0b920e10  00000000`0035ce39
fffff880`0b920e18  fffffa80`0e47c000
fffff880`0b920e20  00000000`00000000
fffff880`0b920e28  00000000`0000000b
fffff880`0b920e30  fffffa80`0fab3460
fffff880`0b920e38  00000000`0007cd31
fffff880`0b920e40  fffffa80`0e47ccc0
fffff880`0b920e48  fffff880`06d8e66b dxgmms1!VidSchiProcessIsrCompletedPacket+0x1eb
fffff880`0b920e50  fffffa80`0e47ccc0
fffff880`0b920e58  fffffa80`0e44a410
fffff880`0b920e60  fffffa80`0e47c000
fffff880`0b920e68  fffffa80`0e47c000
fffff880`0b920e70  00000000`00000000
fffff880`0b920e78  00000000`00000000
fffff880`0b920e80  fffffa80`0e47ccc0
fffff880`0b920e88  00000000`00000000
fffff880`0b920e90  fffffa80`0fab3460
fffff880`0b920e98  00000000`00000d3f
fffff880`0b920ea0  00000000`00007111
fffff880`0b920ea8  00000000`00400120
fffff880`0b920eb0  00000000`0007cd31
fffff880`0b920eb8  00000000`00000001
fffff880`0b920ec0  00000000`00000000
fffff880`0b920ec8  00000000`00000000
fffff880`0b920ed0  fffffa80`0e44a410
fffff880`0b920ed8  fffff880`06d8e172 dxgmms1!VidSchDdiNotifyInterruptWorker+0x1ea
fffff880`0b920ee0  fffffa80`0e47c000
fffff880`0b920ee8  00000000`00000000
fffff880`0b920ef0  fffffa80`00000001
fffff880`0b920ef8  fffff880`0b921190
fffff880`0b920f00  fffff880`0b920f30
fffff880`0b920f08  00000000`00000005
fffff880`0b920f10  00000000`00000000
fffff880`0b920f18  00000000`00000000
fffff880`0b920f20  fffff880`0b921190
fffff880`0b920f28  fffff880`06d8df76 dxgmms1!VidSchDdiNotifyInterrupt+0x9e
fffff880`0b920f30  fffffa80`0007cd31
fffff880`0b920f38  00000000`00000000
fffff880`0b920f40  fffffa80`0e472000
fffff880`0b920f48  fffffa80`0fbf5010
fffff880`0b920f50  fffff880`0b921190
fffff880`0b920f58  fffff880`06c9513f dxgkrnl!DxgNotifyInterruptCB+0x83
^^ Okay, so from that raw stack, we can see quite a few DirectX Kernel & MMS calls, as well as nVidia driver calls as well. This is good news, as this may be our problem (it gives us a good start as far as troubleshooting goes). I'd like to note that there were much more than this, and that the raw stack went on for a very very long time. I am just cutting it to a very small sample for blogging purposes.

--------------------

Let's check Processor #5:

5: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
00000000`00000000 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x0
5: kd> u @rip
00000000`00000000 ??              ???
            ^ Memory access error in 'u @rip'
^^ Looks like this specific processor was too hung at the time of the crash to report any information.

--------------------

Let's check Processor #6:

6: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`03121c58 fffff800`0328da3a : 00000000`0035ce39 fffffa80`0d16d378 fffff880`030f9180 00000000`00000000 : intelppm!MWaitIdle+0x19
fffff880`03121c60 fffff800`032886cc : fffff880`030f9180 fffff880`00000000 00000000`00000000 fffff880`00000000 : nt!PoIdle+0x53a
fffff880`03121d40 00000000`00000000 : fffff880`03122000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x2c
 ^^ Same as processors #1 and #2.

--------------------

Finally, let's check Processor #7:

7: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`07a19930 fffff800`032c1c4a : 00000000`00000000 00000000`2d6c0fff fffffa80`00000000 fffffa80`00000000 : nt!MiDeleteVirtualAddresses+0x7d8
fffff880`07a19af0 fffff800`0327f153 : ffffffff`ffffffff 00000000`2174e6d0 00000000`2174e6c8 00000000`00008000 : nt!NtFreeVirtualMemory+0x5ca
fffff880`07a19be0 00000000`77a3009a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`07a19be0)
00000000`2174e698 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x77a3009a
 7: kd> .trap fffff880`07a19be0
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=0000000077a3009a rsp=000000002174e698 rbp=0000000017a8f904
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na po nc
7: kd> u @rip
00000000`77a3009a ??              ???
            ^ Memory access error in 'u @rip'
^^ Cannot seem to go too far into processor #7, but it seemed to be doing virtual memory related things.

--------------------

Overall, from the above, this looks like a hardware issue. Video card, RAM, or CPU itself. I'd like to say it's also possible for it to be a video driver causing corruption, we shall see.

I'm having the user go through hardware diagnostics, as well as a few other things, so I'll report back with any info when I have it.

Friday, February 21, 2014

0x9F (1st parameter = 0x4) Debugging

There are generally two types of 0x9F bug checks that you'll see most in the wild:

1. 4th parameter containing the blocked IRP address.

-- In the case of #1, if the analyze -v doesn't show anything, and the stack is useless, you'd run !irp 123addresshere123 and it would show the culprit '>'.

2. 1st parameter = 0x4, which implies that a power IRP has failed to synchronize with the PnP Manager.

-- In the case of #2, it's a bit different, so let's get into a recent debugging I did. I learned most of this from documentation provided by my good friend x BlueRobot, and other parts viewing various developer forums.

--------------------

0: kd> .bugcheck
Bugcheck code 0000009F
Arguments 00000000`00000004 00000000`00000258 fffffa80`070b5b50 fffff800`00b9c3d0
1st parameter = 0x4.

Let's take a look at the call stack:

0: kd> knL
 # Child-SP          RetAddr           Call Site
00 fffff800`00b9c398 fffff800`033577e6 nt!KeBugCheckEx
01 fffff800`00b9c3a0 fffff800`0350834c nt!PnpBugcheckPowerTimeout+0x76
02 fffff800`00b9c400 fffff800`032d085c nt!PopBuildDeviceNotifyListWatchdog+0x1c
03 fffff800`00b9c430 fffff800`032d06f6 nt!KiProcessTimerDpcTable+0x6c
04 fffff800`00b9c4a0 fffff800`032d05de nt!KiProcessExpiredTimerList+0xc6
05 fffff800`00b9caf0 fffff800`032d03c7 nt!KiTimerExpiration+0x1be
06 fffff800`00b9cb90 fffff800`032bd8ca nt!KiRetireDpcList+0x277
07 fffff800`00b9cc40 00000000`00000000 nt!KiIdleLoop+0x5a
Not really much info, just a few PnP routines that call into the bugcheck eventually. No driver in the stack, etc.

Let's run !locks to see what's going on:

0: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks..

Resource @ nt!IopDeviceTreeLock (0xfffff800034c9ce0)    Shared 1 owning threads
    Contention Count = 35
     Threads: fffffa80070b5b50-01<*>
KD: Scanning for held locks.

Resource @ nt!PiEngineLock (0xfffff800034c9be0)    Exclusively owned
    Contention Count = 41
     Threads: fffffa80070b5b50-01<*>
KD: Scanning for held locks.............
31147 total locks, 2 locks currently held
Let's take the address in the 'Threads' field and run !thread:

0: kd> !thread fffffa80070b5b50
THREAD fffffa80070b5b50  Cid 0004.0030  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
    fffff88003762370  NotificationEvent
IRP List:
    fffffa800d1fc730: (0006,0430) Flags: 00060000  Mdl: fffffa800d7bd520
Not impersonating
DeviceMap                 fffff8a000008b30
Owning Process            fffffa8006f93380       Image:         System
Attached Process          N/A            Image:         N/A
Wait Start TickCount      563451         Ticks: 216 (0:00:00:03.369)
Context Switch Count      2944           IdealProcessor: 4  NoStackSwap
UserTime                  00:00:00.000
KernelTime                00:00:00.452
Win32 Start Address nt!ExpWorkerThread (0xfffff800032cf150)
Stack Init fffff88003762c70 Current fffff88003762070
Base fffff88003763000 Limit fffff8800375d000 Call 0
Priority 12 BasePriority 12 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`037620b0 fffff800`032bb5f2 : 00000000`00000000 fffffa80`070b5b50 00000000`00000000 00000000`00000001 : nt!KiSwapContext+0x7a
fffff880`037621f0 fffff800`032cc99f : fffffa80`0d1fcad0 00000000`00000000 fffffa80`00000000 fffff880`017b8975 : nt!KiCommitThreadWait+0x1d2
fffff880`03762280 fffff800`035e66b8 : fffffa80`0cdec700 fffff880`00000000 fffffa80`62747300 00000000`00000700 : nt!KeWaitForSingleObject+0x19f
fffff880`03762320 fffff800`035e65d6 : fffffa80`000003e6 fffffa80`08064960 00000000`00000000 00000000`000007ff : nt!FstubReadSector+0x98
fffff880`037623a0 fffff800`035e66f5 : fffffa80`0cdec8f0 fffffa80`0de3a1a0 fffff880`03762438 fffffa80`081d6010 : nt!FstubDetectPartitionStyle+0x22
fffff880`037623d0 fffff880`00f9b512 : fffffa80`0cdec8f0 fffff880`00000001 00000000`00000000 fffffa80`08064960 : nt!IoReadPartitionTableEx+0x2d
fffff880`03762400 fffff880`00fa803e : fffffa80`081d6010 00000000`00000000 fffffa80`0cdec8f0 fffffa80`0de3a1a0 : partmgr!PmGetDriveLayoutEx+0x5d2
fffff880`03762500 fffff880`00f9ec1d : 00000000`00000000 fffffa80`0de3a1a0 fffffa80`0cdec8f0 00000000`00000000 : partmgr!PmSetSystemCriticalFlag+0x2e
fffff880`03762530 fffff880`00fa832a : 00000000`00000000 fffffa80`081d6010 fffffa80`081d6010 fffffa80`0cdec7a0 : partmgr!PmStartDevice+0x1fd
fffff880`03762600 fffff800`0368218e : fffffa80`081d6010 fffffa80`0d47eae0 fffffa80`0cdec7a0 fffff880`009b3180 : partmgr!PmPnp+0x11a
fffff880`03762650 fffff800`033bc7ad : fffffa80`0ea70050 fffffa80`0d47eae0 fffff800`033c5ed0 00000000`00000000 : nt!PnpAsynchronousCall+0xce
fffff880`03762690 fffff800`03691506 : fffff800`034c99a0 fffffa80`07c2fd10 fffffa80`0d47eae0 fffffa80`07c2feb8 : nt!PnpStartDevice+0x11d
fffff880`03762750 fffff800`036917a4 : fffffa80`07c2fd10 fffffa80`07040043 fffffa80`070421e0 00000000`00000001 : nt!PnpStartDeviceNode+0x156
fffff880`037627e0 fffff800`036b4eb6 : fffffa80`07c2fd10 fffffa80`070421e0 00000000`00000002 00000000`00000000 : nt!PipProcessStartPhase1+0x74
fffff880`03762810 fffff800`036b5448 : 00000000`00000000 fffff800`033c4e76 00000000`00000000 00000000`00000000 : nt!PipProcessDevNodeTree+0x296
fffff880`03762a80 fffff800`033c8827 : 00000001`00000003 00000000`00000000 00000000`00000001 fffffa80`00000084 : nt!PiProcessReenumeration+0x98
fffff880`03762ad0 fffff800`032cf261 : fffff800`033c8500 ffffffff`80000201 fffff800`0364bc00 fffffa80`070b5b50 : nt!PnpDeviceActionWorker+0x327
fffff880`03762b70 fffff800`035622ea : 00000000`00000000 fffffa80`070b5b50 00000000`00000080 fffffa80`06f93380 : nt!ExpWorkerThread+0x111
fffff880`03762c00 fffff800`032b68e6 : fffff880`03589180 fffffa80`070b5b50 fffff880`035940c0 00000000`00000000 : nt!PspSystemThreadStartup+0x5a
fffff880`03762c40 00000000`00000000 : fffff880`03763000 fffff880`0375d000 fffff880`03762370 00000000`00000000 : nt!KxStartSystemThread+0x16

We have an IRP list, and a little bit of a more informative stack text. We can see a few partmgr routines (Partition Management system driver).

** I've gone through a few 0x9F's in which dumping the !thread didn't provide an IRP list address. I am not sure why this would be the case, but my only guess is simply not enough information was available at the time of the crash to obtain an IRP list address.

Let's go ahead and take a look at the current IRP within the thread:

0: kd> !irp fffffa800d1fc730 7
Irp is active with 12 stacks 11 is current (= 0xfffffa800d1fcad0)
 Mdl=fffffa800d7bd520: No System Buffer: Thread fffffa80070b5b50:  Irp stack trace. 
Flags = 00060000
ThreadListEntry.Flink = fffffa80070b5f40
ThreadListEntry.Blink = fffffa80070b5f40
IoStatus.Status = 00000000
IoStatus.Information = 00000000
RequestorMode = 00000000
Cancel = 00
CancelIrql = 0
ApcEnvironment = 00
UserIosb = fffff88003762360
UserEvent = fffff88003762370
Overlay.AsynchronousParameters.UserApcRoutine = 00000000
Overlay.AsynchronousParameters.UserApcContext = 00000000
Overlay.AllocationSize = 00000000 - 00000000
CancelRoutine = 00000000  
UserBuffer = 00000000
&Tail.Overlay.DeviceQueueEntry = fffffa800d1fc7a8
Tail.Overlay.Thread = fffffa80070b5b50
Tail.Overlay.AuxiliaryBuffer = 00000000
Tail.Overlay.ListEntry.Flink = 00000000
Tail.Overlay.ListEntry.Blink = 00000000
Tail.Overlay.CurrentStackLocation = fffffa800d1fcad0
Tail.Overlay.OriginalFileObject = 00000000
Tail.Apc = 00000001
Tail.CompletionKey = 00000001
     cmd  flg cl Device   File     Completion-Context
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
>[  3,34]   2 e0 fffffa800dd8f060 00000000 fffff88000f9e1b0-00000000 Success Error Cancel
           \Driver\Disk    partmgr!PmReadWriteCompletion
            Args: 00000200 00000000 00000000 00000000
 [  3, 0]   2  0 fffffa800cdec7a0 00000000 00000000-00000000   
           \Driver\partmgr
            Args: 4632491f2 00000000 00000000 00000000

 ^^ Disk.sys and partmgr.sys. Disk.sys is the PnP Disk system driver.

Let's now go ahead and run !devstack on the hightlighted address for Disk.sys:

0: kd> !devstack fffffa800dd8f060
  !DevObj           !DrvObj            !DevExt           ObjectName
  fffffa800cdec7a0  \Driver\partmgr    fffffa800cdec8f0 
> fffffa800dd8f060  \Driver\Disk       fffffa800dd8f1b0  DR66
  fffffa800ea70050  \Driver\AmUStor    fffffa800ea701a0  00000267
!DevNode fffffa8007c2fd10 :
  DeviceInstis "USBSTOR\Disk&Ven_Multiple&Prod_Card_Reader&Rev_1.00\058F63666433&01"
  ServiceName is "disk"
We can see mention of AmUStor.sys here which is the ALCOR Micro SD Card Controller driver.

-- FAILURE_BUCKET_ID: X64_0x9F_4_disk_IMAGE_AmUStor.SYS

We can see this matches our FBID.

--------------------

As the card controller drivers was OEM, there appeared to be no relevant updates, therefore it was disabled (or the card reader was removed) and it was solved.

Saturday, February 15, 2014

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - BSoD Multiple_irp_complete_requests problem

What the issue was -
This was the LogMeIn Hamachi 0x44 bug that has since been solved with the 2/5/2014 update.

[SOLVED] VIDEO_TDR_FAILURE

Link to solved thread - System Crashing BSOD, need help

What the issue was -
SFC needed to be run.

[SOLVED] MEMORY_MANAGEMENT / NTFS_FILE_SYSTEM

Link to solved thread - Random Reboots/BSOD Crashes (Various Error messages)

What the issue was -
RAM was faulty, RMA to Corsair set up.

[SOLVED] DRIVER_POWER_STATE_FAILURE

Link to solved thread - Driver Power State Failure - Blue Screen Of Death - Windows Vista

What the issue was -
Ralink 802.11 USB Wireless Adapter driver needed to be updated.

[SOLVED] BAD_SYSTEM_CONFIG_INFO

Link to solved thread - blue screen

What the issue was -
SYSTEM Hive was corrupt, reinstall was needed.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - Update Drivers to stop "Blue Screen Of Death"

What the issue was -
This was the LogMeIn Hamachi 0x44 bug that has since been solved with the 2/5/2014 update.

[SOLVED] UNEXPECTED_KERNEL_MODE_TRAP

Link to solved thread - Help with BSOD "Unexpected Kernel Trap Mode" error

What the issue was -
ESET needed to be removed and replaced with Windows Defender.

[SOLVED] MACHINE_CHECK_EXCEPTION / WHEA_UNCORRECTABLE_ERROR

Link to solved thread - BSOD: Stop 0x0000009C

What the issue was -
I presume uninstalling the Asus AI Suite solved the crashes as the user never reported they crashed afterwards and had to RMA the processor.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - BSOD Multiple_irp_complete_requests

What the issue was -
This was the LogMeIn Hamachi 0x44 bug that has since been solved with the 2/5/2014 update.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - Multiple IRP Requests - BSOD

What the issue was -
This was the LogMeIn Hamachi 0x44 bug that has since been solved with the 2/5/2014 update.

[SOLVED] CRITICAL_STRUCTURE_CORRUPTION

Link to solved thread - Windows 8.1 has found Critical Structure Corruption

What the issue was -


- ESET needed to be removed and replaced with Windows Defender.

- sptd.sys needed to be removed.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - Windows 8.1 MULTIPLE_IRP_COMPLETE_REQUESTS

What the issue was -
This was the LogMeIn Hamachi 0x44 bug that has since been solved with the 2/5/2014 update.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - Multiple IRP Complete Request error BSOD

What the issue was -
This was the LogMeIn Hamachi 0x44 bug that has since been solved with the 2/5/2014 update.

[SOLVED] DPC_WATCHDOG_VIOLATION

Link to solved thread - BSOD creating a WiFi hotspot

What the issue was -
Realtek driver needed to be updated.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - Blue Screen - MULTIPLE-IRP-COMPLETE-REQUESTS on Windows 8.1

What the issue was -
This was the LogMeIn Hamachi 0x44 bug that has since been solved with the 2/5/2014 update.

[SOLVED] DRIVER_IRQL_NOT_LESS_OR_EQUAL / CRITICAL_PROCESS_DIED

Link to solved thread - Blue screen of death occurring frequently!

What the issue was -
Quickheal needed to be removed and replaced with Windows Defender.

[SOLVED] KERNEL_DATA_INPAGE_ERROR

Link to solved thread - WIndows 8.1 BSOD caused by KERNEL_DATA_INPAGE_ERROR

What the issue was - 


- Asus bloatware needed to be removed.

- AVG needed to be removed and replaced with Windows Defender.

- ISODrive needed to be removed.

[SOLVED] DPC_WATCHDOG_VIOLATION / DRIVER_VERIFIER_DETECTED_VIOLATION

Link to solved thread - Window 8 BSOD. DPC Watchdog Violations. Updated all driver.

What the issue was -
What seemed to be occurring was the DPC may have been looping itself by gathering a Spinlock at DPC Level, cancelling the Timer, and then finally releasing the Spinlock again. This was going into a loop over and over again, apparently caused by IntelliMemory. Removal of IntelliMemory solved the crashes.

Thursday, February 13, 2014

0x133 (with a little 0xC4 in addition) Debugging

I've written a newer/more informative version of this blog post here - 0x133 (with a little 0xC4 in addition) Debugging - New Version (Update).

--------------------

I've solved of course many *133 bug checks in the past, however, this was the first time I was supplied a kernel-dump for it and was finally knowledgeable enough thanks to reading, experience, and documentation, to debug it 'in-depth', and successfully solved it at the same time.

I have also supplied Harry (x BlueRobot) this kernel-dump as we both wanted to write and learn about 0x133's in-depth, so he has gone ahead and written a tutorial as well for it (which you can see here). Harry goes into pretty nice detail about DPC's, which is something I won't be doing here in my tutorial. I'll instead be focusing on what caused it and how it was solved.

--------------------

I've thankfully done most of the analysis in the thread I solved the crash in, however, I don't want to get lazy and will of course go into detail wherever I can. Let's get started:

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000000, A single DPC or ISR exceeded its time allotment. The offending
    component can usually be identified with a stack trace.
Arg2: 0000000000000501, The DPC time count (in ticks).
Arg3: 0000000000000500, The DPC time allotment (in ticks).
Here we have the basic bug check information. First off, the DPC_WATCHDOG_VIOLATION bug check can be triggered in two ways. First, if a single DPC exceeds a specified number of ticks, the system will stop with 0x133 with parameter 1 of the bug check set to 0.  In this case, the system's time limit for single DPC will be in parameter 3, with the number of ticks taken by this DPC in parameter 2.

BugCheck 133, {0, 501, 500, 0}


^^ In this case, the 1st parameter = 0!


0: kd> .bugcheck
Bugcheck code 00000133
Arguments 00000000`00000000 00000000`00000501 00000000`00000500 00000000`00000000



This specific DPC has run for 0x501 ticks, when the limit was 0x500.


In the case of a stop 0x133 with the first parameter set to 0, the call stack should contain the offending driver. Rather than running a kv as we usually do, let's go ahead and run knL:

0: kd> knL
 # Child-SP          RetAddr           Call Site
00 fffff803`a2459408 fffff803`a044bf4b nt!KeBugCheckEx
01 fffff803`a2459410 fffff803`a0310774 nt! ?? ::FNODOBFM::`string'+0x145a4
02 fffff803`a2459490 fffff803`a0228eca nt!KeUpdateTime+0x2ec
03 fffff803`a2459670 fffff803`a02c501e hal!HalpTimerClockInterrupt+0x86
04 fffff803`a24596a0 fffff880`18564dd2 nt!KiInterruptDispatchLBControl+0x1ce
05 fffff803`a2459830 fffff803`a02f51ea usb80236!CancelSendsTimerDpc+0xa6
06 fffff803`a2459870 fffff803`a02f3655 nt!KiProcessExpiredTimerList+0x22a
07 fffff803`a24599a0 fffff803`a02f5668 nt!KiExpireTimerTable+0xa9
08 fffff803`a2459a40 fffff803`a02f4a06 nt!KiTimerExpiration+0xc8
09 fffff803`a2459af0 fffff803`a02f59ba nt!KiRetireDpcList+0x1f6
0a fffff803`a2459c60 00000000`00000000 nt!KiIdleLoop+0x5a
k = Displays the call stack of the given thread.
n = Displays frame numbers.
L (important you capitalize it) = Hides source lines in the display.

This makes a nice, neat, and informative call stack for 0x133 debugging.

As we can see above in the call stack, the offending driver is usb80236.sys, which is the Remote NDIS USB Driver. It calls into KiInterruptDispatchLBControl.

Let’s view the driver’s unassembled DPC routine:

0: kd> ub fffff880`18564dd2
usb80236!CancelSendsTimerDpc+0x81:
fffff880`18564dad 488bcd          mov     rcx,rbp
fffff880`18564db0 c6465001        mov     byte ptr [rsi+50h],1
fffff880`18564db4 ff1556230000    call    qword ptr [usb80236!_imp_KeReleaseSpinLockFromDpcLevel (fffff880`18567110)]
fffff880`18564dba 488bcb          mov     rcx,rbx
fffff880`18564dbd ff15a5220000    call    qword ptr [usb80236!_imp_IoCancelIrp (fffff880`18567068)]
fffff880`18564dc3 488bcd          mov     rcx,rbp
fffff880`18564dc6 ff15e4220000    call    qword ptr [usb80236!_imp_KeAcquireSpinLockAtDpcLevel (fffff880`185670b0)]
fffff880`18564dcc ff8708030000    inc     dword ptr [rdi+308h]
1. The driver started by calling the KeAcquireSpinLockAtDpcLevel routine which acquires a spin lock when the caller is already running at IRQL >= DISPATCH_LEVEL.

The caller should release the spin lock with KeReleaseSpinLockFromDpcLevel as quickly as possible.

2. The driver then called the IoCancelIrp routine which sets the cancel bit in a given IRP and calls the cancel routine for the IRP if there is one.

If the IRP has a cancel routine, IoCancelIrp sets the cancel bit and calls the cancel routine.

3. The driver then called the KeReleaseSpinLockFromDpcLevel routine (as mentioned above) which releases an executive spin lock without changing the IRQL.

Going a bit deeper...

0: kd> u fffff880`18564dd2
usb80236!CancelSendsTimerDpc+0xa6:
fffff880`18564dd2 488b36          mov     rsi,qword ptr [rsi]
fffff880`18564dd5 493bf6          cmp     rsi,r14
fffff880`18564dd8 75a9            jne     usb80236!CancelSendsTimerDpc+0x57 (fffff880`18564d83)
fffff880`18564dda 4c8d87b0020000  lea     r8,[rdi+2B0h]
fffff880`18564de1 488d8f70020000  lea     rcx,[rdi+270h]
fffff880`18564de8 48c7c2001f0afa  mov     rdx,0FFFFFFFFFA0A1F00h
fffff880`18564def ff1553220000    call    qword ptr [usb80236!_imp_KeSetTimer (fffff880`18567048)]
fffff880`18564df5 eb12            jmp     usb80236!CancelSendsTimerDpc+0xdd (fffff880`18564e09)
1. The driver called the CancelSendsTimerDpc routine. I do not know exactly what this routine does, however, it's certainly something in regards to a timer on and/or for a DPC (Deferred Procedure Call). According to Harry, he believes that the driver may use a Custom DPC associated with a Timer object.

2. The driver then calls the KeSetTimer routine which sets the absolute or relative interval at which a timer object is to be set to a signaled state and, optionally, supplies a CustomTimerDpc routine to be executed when that interval expires.

3. The driver then calls the CancelSendsTimerDpc routine again. As far as I know, what should be going on here is the CustomTimerDpc routine should be called, but CancelSendsTimerDpc may be in a loop.

Overall, what seems to be occurring is the DPC may be looping itself by gathering a Spinlock at DPC Level, cancelling the Timer, and then finally releasing the Spinlock again. This is happening over and over again, therefore we have a loop.

--------------------

If we run a !pcr to show us the queued DPCs for the processor:

0: kd> !pcr
KPCR for Processor 0 at fffff803a056a000:
    Major 1 Minor 1
    NtTib.ExceptionList: fffff803a2452000
        NtTib.StackBase: fffff803a2453080
       NtTib.StackLimit: 0000000004ccee48
     NtTib.SubSystemTib: fffff803a056a000
          NtTib.Version: 00000000a056a180
      NtTib.UserPointer: fffff803a056a7f0
          NtTib.SelfTib: 000000007ef44000

                SelfPcr: 0000000000000000
                   Prcb: fffff803a056a180
                   Irql: 0000000000000000
                    IRR: 0000000000000000
                    IDR: 0000000000000000
          InterruptMode: 0000000000000000
                    IDT: 0000000000000000
                    GDT: 0000000000000000
                    TSS: 0000000000000000

          CurrentThread: fffff803a05c4880
             NextThread: fffffa80036e8040
             IdleThread: fffff803a05c4880

              DpcQueue:  0xfffffa80073595e0 0xfffff8800400c960 [Normal] dxgkrnl!DpiFdoDpcForIsr
                         0xfffffa8007be9b68 0xfffff88001efb380 [Normal] ndis!ndisInterruptDpc
                         0xfffff803a04fcfe0 0xfffff803a027c71c [Normal] nt!PpmCheckPeriodicStart
                         0xfffff803a0545d60 0xfffff803a031e45c [Normal] nt!KiBalanceSetManagerDeferredRoutine
1. ndis.sys:

(Network Driver Interface Specification driver) routine call. The Network Driver Interface Specification (NDIS) is an application programming interface (API) for network interface cards (NICs). The NDIS forms the Logical Link Control (LLC) sublayer, which is the upper sublayer of the OSI data link layer (layer 2). Therefore, the NDIS acts as the interface between the Media Access Control (MAC) sublayer, which is the lower sublayer of the data link layer, and the network layer (layer 3).

The NDIS is a library of functions often referred to as a "wrapper" that hides the underlying complexity of the NIC hardware and serves as a standard interface for level 3 network protocol drivers and hardware level MAC drivers. Another common LLC is the Open Data-Link Interface (ODI).

2. dxgkrnl.sys - Direct X Kernel.

--------------------

So, with all of this said, we know that something is causing usb80236.sys to call into a loop, and it may be anything that's working with and/or possibility interfering with Windows' networking, or Direct X. We'll need to do some detective work to determine what is causing this, as it's a system driver and is being faulted by something else. At this point, since we're at quite the wall, I recommend enabling Driver Verifier so we can see what's going on. The user enabled DV, and sure enough, they had an 0xC4 crash! Let's take a look:

DRIVER_VERIFIER_DETECTED_VIOLATION (c4)
A device driver attempting to corrupt the system has been caught.  This is
because the driver was specified in the registry as being suspect (by the
administrator) and the kernel has enabled substantial checking of this driver.
If the driver attempts to corrupt the system, bugchecks 0xC4, 0xC1 and 0xA will
be among the most commonly seen crashes.
Arguments:
Arg1: 0000000000001011, Invariant MDL buffer contents for Read Irp were modified during dispatch or buffer backed by dummy pages.
Arg2: fffffa8006219060, Device object to which the Read IRP was issued.
Arg3: fffff980098a8c60, The address of the IRP.
Arg4: fffff8801a5b3000, System-Space Virtual Address for the buffer that the MDL describes.
Here we have the basic bug check info, with the 2nd/3rd parameter highlighted as they will be useful later on. Let's go ahead and take a look at the call stack first:

0: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`1af96888 fffff803`07a31fcc : 00000000`000000c4 00000000`00001011 fffffa80`06219060 fffff980`098a8c60 : nt!KeBugCheckEx
fffff880`1af96890 fffff803`07ebfa51 : 00000000`00000000 fffff880`1af96a00 fffff980`098a8c60 fffff803`07cba000 : nt!MdlInvariantPreProcessing1+0x200
fffff880`1af96900 fffff803`07ebdc51 : 00000000`00000000 fffffa80`0dd877d0 fffffa80`06218060 00000000`00000001 : nt!IovpCallDriver1+0x1cd
fffff880`1af96a60 fffff803`07eb4cde : fffff980`098a8c60 00000000`00000002 fffff880`1af96c40 00000000`c0000089 : nt!VfBeforeCallDriver+0x141
fffff880`1af96a90 fffff880`012ccdab : fffff980`098a8c60 fffffa80`06219060 fffffa80`062191b0 fffffa80`0dd877d0 : nt!IovCallDriver+0x35e
fffff880`1af96ae0 fffff880`012ccb7e : fffff980`098a8c60 fffff880`1af96ba0 fffff980`098a8e50 fffffa80`06219060 : intmsd+0x3dab
fffff880`1af96b20 fffff880`012d4402 : 00000000`00000000 fffff880`1af96c40 00000000`00000000 fffff803`07928d44 : intmsd+0x3b7e
fffff880`1af96b60 fffff880`012ccaf3 : fffff980`098a8c60 fffffa80`062191b0 fffffa80`06219060 00000000`00000000 : intmsd+0xb402
fffff880`1af96bf0 fffff803`07eb4d66 : fffff980`098a8c60 00000000`00000002 fffffa80`0bf1b790 fffffa80`06219b10 : intmsd+0x3af3
fffff880`1af96cc0 fffff803`07eb4d66 : fffff980`098a8c60 fffffa80`06219b10 00000000`00000002 fffffa80`0bf1b790 : nt!IovCallDriver+0x3e6
fffff880`1af96d10 fffff880`0127b14e : fffffa80`05356de0 fffffa80`05356c90 00000000`00000002 fffffa80`0d745ba0 : nt!IovCallDriver+0x3e6
fffff880`1af96d60 fffff803`07eb4d66 : 00000000`00000002 fffff980`098a8c60 fffffa80`05356c90 fffffa80`0dd66e00 : volmgr!VmReadWrite+0x13e
fffff880`1af96da0 fffff880`01f63faa : fffff980`098a8c60 00000001`94c3f000 fffff980`098a8c60 fffffa80`0dd66e00 : nt!IovCallDriver+0x3e6
fffff880`1af96df0 fffff880`01f64236 : fffff980`098a8c60 00000001`94c3f000 fffff980`098a8c60 fffff803`07ebdc51 : fvevol!FveReadWrite+0x3e
fffff880`1af96e30 fffff803`07eb4d66 : fffffa80`06227040 00000000`00000002 fffffa80`0d9eee40 fffff880`01c18329 : fvevol!FveFilterRundownReadWrite+0x1b6
fffff880`1af96e80 fffff880`01c01af2 : fffffa80`06228190 fffffa80`06228040 00000000`00000002 fffffa80`0d9eee40 : nt!IovCallDriver+0x3e6
fffff880`1af96ed0 fffff803`07eb4d66 : fffff980`098a8c60 00000000`00000002 00000000`00000030 00000000`00000000 : volsnap!VolSnapReadFilter+0x112
fffff880`1af96f00 fffff880`01817b69 : fffff880`1af79828 fffff880`1af798f0 fffffa80`0c543080 fffffa80`045faf40 : nt!IovCallDriver+0x3e6
fffff880`1af96f50 fffff803`078c8b67 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000001 : Ntfs!NtfsStorageDriverCallout+0x16
fffff880`1af96f80 fffff803`078c8b2d : 00000000`00000000 00000000`00000000 00000000`00000002 fffff803`07923ad8 : nt!KxSwitchKernelStackCallout+0x27 (TrapFrame @ fffff880`1af96e40)
fffff880`1af79690 fffff803`07923ad8 : 00000000`00000006 00000000`00000000 00000000`00000006 00000000`00000000 : nt!KiSwitchKernelStackContinue
fffff880`1af796b0 fffff803`07926405 : fffff880`01817b54 fffff880`1af79828 00000000`00055000 00000000`00000001 : nt!KeExpandKernelStackAndCalloutInternal+0x218
fffff880`1af797b0 fffff880`01817aac : fffff880`1af79c60 fffff980`098a8c60 fffff8a0`01299b90 00000000`0005d000 : nt!KeExpandKernelStackAndCalloutEx+0x25
fffff880`1af797f0 fffff880`0181646a : fffff880`1af798a0 fffff880`1af79c60 fffff8a0`01299b90 fffff880`018770b0 : Ntfs!NtfsMultipleAsync+0xac
fffff880`1af79860 fffff880`01825b26 : 00000000`00000000 fffff880`1af79df0 fffff8a0`01299be8 fffff8a0`01299b90 : Ntfs!NtfsNonCachedIo+0x26a
fffff880`1af79a70 fffff880`0182742b : fffff880`1af79c60 fffff980`098a8c60 fffffa80`0c169701 00000000`00000001 : Ntfs!NtfsCommonRead+0x896
fffff880`1af79c30 fffff803`07eb4d66 : fffff980`098a8c60 fffff980`098a8c60 00000000`00000002 fffffa80`0dd800c0 : Ntfs!NtfsFsdRead+0x1db
fffff880`1af79e70 fffff880`013924ee : fffffa80`0fb303f0 fffff880`1af79f00 fffff980`098a8c60 fffffa80`0dd800c0 : nt!IovCallDriver+0x3e6
fffff880`1af79ec0 fffff880`013900b6 : fffffa80`0538bb00 00000000`00000002 fffff980`098a8c60 fffffa80`0d579e28 : FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x25e
fffff880`1af79f60 fffff803`07eb4d66 : fffff980`098a8c60 00000000`00000002 fffffa80`0d5f47b0 fffffa80`0398e598 : FLTMGR!FltpDispatch+0xb6
fffff880`1af79fc0 fffff803`0794b43e : fffff980`098a8c60 00000000`00000000 fffffa80`0d5f4878 fffffa80`0d579d80 : nt!IovCallDriver+0x3e6
fffff880`1af7a010 fffff803`0794fa61 : fffffa80`0d5f47b0 fffffa80`0d5f47d8 fffffa80`0d5f4818 fffffa80`0d5f4808 : nt!IoPageRead+0x21e
fffff880`1af7a060 fffff803`0794ab20 : 00000000`00000002 fffff880`1af7a0d0 fffffa80`0c543080 fffffa80`0d5f47b0 : nt!MiIssueHardFaultIO+0xc9
fffff880`1af7a0a0 fffff803`07908d8f : fffffa80`0c543080 fffff803`07bc5f40 00000000`c0033333 00000000`00000000 : nt!MiIssueHardFault+0x170
fffff880`1af7a130 fffff803`0792369b : 00000000`00000000 fffffa80`0c543080 8de00001`15efb900 00000000`00000000 : nt!MmAccessFault+0x81f
fffff880`1af7a270 fffff803`07939a37 : 00000000`00000000 00000000`00051000 00000000`00055000 fffff880`1af7a438 : nt!MmCheckCachedPageStates+0x8db
fffff880`1af7a400 fffff803`07938e62 : fffffa80`0c16b010 0000003c`a4737c90 fffff880`1af7a550 fffff880`00000000 : nt!CcMapAndCopyInToCache+0x397
fffff880`1af7a4f0 fffff880`018d4070 : 00000000`000551b8 fffffa80`0538bb00 fffffa80`0c5acbe8 fffffa80`0c1697b0 : nt!CcCopyWriteEx+0x1b2
fffff880`1af7a590 fffff880`01394415 : 00000000`00000000 00000000`000002b0 fffff880`1af7a9b0 00000000`00000000 : Ntfs!NtfsCopyWriteA+0x290
fffff880`1af7a7f0 fffff880`01394b53 : fffff880`1af7a8e0 00000000`00000000 fffffa80`0c169700 fffffa80`0c5aca90 : FLTMGR!FltpPerformFastIoCall+0x155
fffff880`1af7a850 fffff880`013bcabd : 00000000`000000d8 00000000`00000000 00000000`00000000 00000000`00000000 : FLTMGR!FltpPassThroughFastIo+0xc3
fffff880`1af7a8b0 fffff803`07caa249 : fffffa80`0c1697b0 00000000`00000000 00000000`00000000 00000000`00000000 : FLTMGR!FltpFastIoWrite+0x19d
fffff880`1af7a960 fffff803`078cd453 : fffffa80`0c543001 00000000`00000000 00000000`00000000 00000000`00000000 : nt!NtWriteFile+0x5b8
fffff880`1af7aa90 000007f8`a88c2c6a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`1af7ab00)
0000003c`a744fa28 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x000007f8`a88c2c6a
As we can see, we have many different file system related routines (Ntfs, FLTMGR, etc). Why? Well, as we move up the stack, we eventually see three intmsd.sys calls. This is the IntelliMemory Storage Filter Driver from Condusiv Technologies.

Now, let's go ahead and refer to the 2nd/3rd parameters as I've highlighted above!

If we run an !irp on the 3rd parameter:

 [  3, 0]  10 e0 fffffa8006219060 00000000 fffff880012ccbc8-fffffa8006219060 Success Error Cancel
           \Driver\intmsd    intmsd
            Args: 00008000 00000000 1cec3f000 00000000
>[  3, 0]  10 e1 fffffa8006219060 00000000 fffff880012183c0-00000000 Success Error Cancel pending
           \Driver\intmsd    partmgr!PmIoCompletion
If we run a !devobj on the 2nd parameter:

0: kd> !devobj fffffa8006219060
Device object (fffffa8006219060) is for:
 intmsd0 \Driver\intmsd DriverObject fffffa80049241b0
Current Irp 00000000 RefCount 0 Type 00000007 Flags 00000850
Vpb fffffa800534d8f0 Dacl fffff9a10052d360 DevExt fffffa80062191b0 DevObjExt fffffa8006219930 Dope fffffa800534d880
ExtensionFlags (0x80000800)  DOE_DEFAULT_SD_PRESENT, DOE_DESIGNATED_FDO
Characteristics (0x00000100)  FILE_DEVICE_SECURE_OPEN
AttachedDevice (Upper) fffffa8006219b10 \Driver\partmgr
AttachedTo (Lower) fffffa8006218060 \Driver\disk
Device queue is not busy.
That's why we're seeing so many file system and storage related routines being called. After this was found, I recommend disabling and/or preferably uninstalling IntelliMemory. After uninstalling IntelliMemory, the crashes ceased. Why?

First off, IntelliMemory™ is an intelligent data caching technology that provides faster access to frequently used files. IntelliMemory is supposed to improve latency and throughput by reducing disk I/O requests as active files are predicatively cached within the server to preempt round trips between VMs and network storage.

Remember how we saw various network related routines, etc, during the 0x133 debugging? Well, it's because IntelliMemory was the driver that was causing the loop.

...and solved : )



0xDEADDEAD Debugging (sort of)

This isn't really a tutorial, or how to debug, but just some information that came as quite a surprise to me that I thought I'd share!

Right, so a user shared their dump and I began to look, and it was of the 0xDEADDEAD bug check. I have never actually seen this bug check outside of driver development, crash and/or hang troubleshooting, learning, etc, because those are generally the only things it's used for. However, today, I learned that a driver can call KeBugCheckEx and pass it the code.

--------------------

MANUALLY_INITIATED_CRASH1 (deaddead)
The user manually initiated this crash dump.
Arguments:
Arg1: 000000000f000101
Arg2: 000000000023002c
Arg3: 0000000012a70000
Arg4: 0000000000000000
So here we have the basic bug check information. As far as I know, the parameters have no meaning in this bug check. There's no 'if parameter 1 = 3, it means x'. We can even see the description of the bug check via WinDbg is 'The user manually initiated this crash dump'.

1: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`009e85f8 fffff880`0775bfcf : 00000000`deaddead 00000000`0f000101 00000000`0023002c 00000000`12a70000 : nt!KeBugCheckEx
fffff880`009e8600 00000000`deaddead : 00000000`0f000101 00000000`0023002c 00000000`12a70000 00000000`00000000 : NETwNs64+0xa6fcf
fffff880`009e8608 00000000`0f000101 : 00000000`0023002c 00000000`12a70000 00000000`00000000 fffffa80`1419ceae : 0xdeaddead
fffff880`009e8610 00000000`0023002c : 00000000`12a70000 00000000`00000000 fffffa80`1419ceae 00000000`00000000 : 0xf000101
fffff880`009e8618 00000000`12a70000 : 00000000`00000000 fffffa80`1419ceae 00000000`00000000 00000000`00000000 : 0x23002c
fffff880`009e8620 00000000`00000000 : fffffa80`1419ceae 00000000`00000000 00000000`00000000 fffffa80`0a73ed80 : 0x12a70000
From the call stack we can see that 0xdeaddead called into NETwNs64+0xa6fcf (Intel(R) Wireless WiFi Link 5000 Series Adapter Driver for Windows 7). Also, what's interesting, in the RetAddr portion of the call stack, we can see deaddead is mentioned. I am going to assume that this indicates that NETwNs64.sys went ahead and called nt!KeBugCheckEx and passed it the code. That's what gave us the 0xDEADDEAD bug check.

Cool : )

Wednesday, February 12, 2014

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1) NETIO.sys Debugging

This will be my first among many debugging tutorials (aside from older ones)! I very much want to get back into writing tutorials for a few reasons, but the main is that they are very fun, and I obviously learn more and more every day! Another thing about tutorials is they are all over the web on various blogs, forums, etc, but many have different styles of the way they were written. Some may contain more info, etc, and different methods of explaining, etc. My goal with everything regarding debugging has always and will always be explain as much as my personal knowledge permits, and do it in the way that anyone that doesn't know how to do it can learn it by reading and then performing it hands on by themselves.

--------------------

Let's get started! We're going to start off with the *D1 bug check, but more specifically when NETIO.sys is the labeled fault of the crash. I've been debugging online on various forums for a little over two years now, and in the past few months to a year, I have seen a huge increase in NETIO.sys *D1's. I am going to tell you right now that NETIO.sys *D1 bug checks are caused 100% of the time from what I have seen (and I have debugged and solved MANY NETIO.sys *D1's) by either the following:

1. Network drivers themselves; whether they need to be updated, reinstalled due to corruption, rolled back due to bug in latest version, etc.

2. 3rd party antivirus or firewall software causing NETBIOS and/or network related conflicts.
 (99% of the time #2 is the cause, and rarely have I seen #1 but it's of course possible).
Right, so with all of this said, what's NETIO.sys? NETIO.sys is Microsoft Windows' Network I/O Subsystem.

First of all, Input and Output (I/O) is actually extremely in-depth and will not be explained in this blog post. If you of course would however like to read about it and learn (which I highly recommend), read the following from the msdn website.

More specifically, we're interested in Network I/O operations in this regard - msdn link here

--------------------

With this said, the basic definition (per msdn) for the *D1 bug check is the following:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
This indicates that a kernel-mode driver attempted to access pageable memory at a process IRQL that was too high.
A driver tried to access an address that is pageable (or that is completely invalid) while the IRQL was too high. This bug check is usually caused by drivers that have used improper addresses. 
So, this is a fairly standard explanation for a person who understands how Windows' memory manager works. If you don't however, you can kinda sorta get the gist of it, but at the same time it may not really mean much to you. Let's go into detail on the memory manager subsystem, because we're all about learning!

Windows' memory manager runs at IRQL 0 (PASSIVE_LEVEL), which is the layer that threads run at. If for example a driver attempts to access memory that is not currently in RAM (paged), this will cause an exception (thrown by the processor). When this exception happens, Windows' memory manager will go ahead and catch the exception, fetch memory from the hard disk, and then finally the processor will then go ahead and return to the driver that attempted to access this memory which was not paged, but at this point will now be paged.

Alright, great, so why do we get this bug check? *D1 occurs when a driver attempts to access memory that is running at a higher IRQL. This is not good (clearly), because when the driver attempts to access paged-out memory at IRQL[n] (I use (n) because there are different levels, but I will go ahead and say that 2 is the most common, so from this point on I will use 2), Windows' memory manager will page-in the memory and run at IRQL 0. This cannot happen, so Windows' memory manager will bug check the system as a deadlock will occur.

This can also occur not only if a driver attempts to access memory that is running at a higher IRQL, but if a driver attempts to access an invalid memory address.

--------------------

Now that we have all of that said, let's move onto an example crash dump (just a random *D1 NETIO.sys dump from a user that I managed to dig up):

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: 0000000000000028, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff80000f8c43f, address which referenced memory

Debugging Details:
------------------
Right away we can see that the 2nd parameter and/or argument of the *D1 bug check itself is 0000000000000002 (2) as I mentioned earlier. There are various other ways to display the parameters of a bug check in different ways.

For example, by running the .bugcheck command:

0: kd> .bugcheck
Bugcheck code 000000D1
Arguments 00000000`00000028 00000000`00000002 00000000`00000000 fffff800`00f8c43f

I've highlighted where '00000000`00000002' = 2.

Before running !analyze v it's listed:

BugCheck D1, {28, 2, 0, fffff80000f8c43f}
It's also listed after running !analyze v further in the dump:

CURRENT_IRQL:  2
So, with this specific crash dump, it was a minidump and didn't contain very much information. For example, just have a look at the call stack:

STACK_TEXT: 
ffffd000`253ab288 fffff801`9776d7e9 : 00000000`0000000a 00000000`00000028 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
ffffd000`253ab290 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x69
We can see from the stack that we just have Windows' usual error handling and fault tolerance bug check related routines. No driver calls, etc. Very dead stack. Let's go ahead and refer to the FBID:

FAILURE_BUCKET_ID:  X64_0xD1_NETIO!RtlCopyBufferToMdl+1f
We can see the fault of the crash is NETIO.sys (calling into?) the RtlCopyBufferToMdl routine. I am not entirely sure actually what this routine implies, however just from knowing the acronyms...

Rtl = Run-Time Library.

Mdl = Memory Descriptor List.

I can imagine there's some sort of buffer being copied from an RTL routine to an MDL. So, what does this mean to us? Well, nothing really. It's a minidump with not very much information. All we know is something is conflicting with NETIO.sys. Let's go ahead and take a look at the loaded modules list (Debug > Modules). Now, in NETIO.sys dumps you are going to want to check for popular antivirus drivers. I would list them here, but there are so many. I think I'll add them over time. I will just go ahead and let you know that this specific dump contained ggc.sys which is a driver in relation to Quick Heal AntiVirus.

0: kd> lmvm ggc
start             end                 module name
fffff800`01600000 fffff800`01618000   ggc        (deferred)           
    Image path: \SystemRoot\system32\DRIVERS\ggc.sys
    Image name: ggc.sys
    Timestamp:        Wed Sep 04 02:43:22 2013
So, there's ggc.sys. Now, at this point I recommend removal of QuickHeal and explained that it was likely causing network related conflicts, which in turn caused the system to crash. After QuickHeal was removed, the crashes stopped.

--------------------

-- Today when I wake up I will add a list of antiviruses and firewalls that I have seen cause this bug check.