Debugging and reverse engineering: April 2014

Sunday, April 27, 2014

700 answers on MS Answers

I hit 700 answers today on MS Answers! Very cool. I am ecstatic I've been able to help so many people.

Friday, April 25, 2014

0x50 Debugging - When it's not the antivirus!

0x50, the bug check we all love because it's so easy to say 'Remove avast!, AVG, Kaspersky, McAfee, Norton, ESET, etc' because most commonly this bug check is caused by antiviruses corrupting the file system, interceptors conflicting if anti-malware and antivirus active protections are running (maybe two antiviruses running at once), etc. Lots of different possibilities. However, what if we're not so quick to blame the antivirus, and come to find instead that it's faulty RAM? Well, let's talk about it!

---------------------------

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except,
it must be protected by a Probe. Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: fffffa806589b700, memory referenced.
Arg2: 0000000000000000, value 0 = read operation, 1 = write operation.
Arg3: fffff803fd7133d4, If non-zero, the instruction address which referenced the bad memory
address.
Arg4: 0000000000000002, (reserved)

^^ Here we of course have the basic output of the bug check. As we can see, parameter 1 is the memory that was referenced, and parameter 3 (if non-zero), is the instruction address which referenced the bad memory address (parameter 1). So, we can so far say that address fffffa806589b700 was written to by the instruction at address fffff803fd7133d4. Pretty easy so far!

6: kd> r cr2
cr2=fffffa806589b700

^^ We can see above that the 1st parameter address was stored in cr2 prior to calling the page fault handler. This doesn't tell us anything we don't already know about the bug check, just a confirmation, if you will.

---------------------------

Now that we know all of this, let's go ahead and run !pte on the 1st parameter address. !pte displays the page table entry (PTE) and page directory entry (PDE) for the specified address.

6: kd> !pte fffffa806589b700
                                           VA fffffa806589b700
PXE at FFFFF6FB7DBEDFA8    PPE at FFFFF6FB7DBF5008    PDE at FFFFF6FB7EA01960    PTE at FFFFF6FD4032C4D8
contains 000000021EFFF863 contains 0000000000000000
GetUlongFromAddress: unable to read from fffff803fd9e20e4
pfn 21efff    ---DA--KWEV not valid

^^ We can see from the above that the address fffffa806589b700 is indeed invalid. With this said, why did fffffa806589b700 attempt to write to fffff803fd7133d4?

Let's go ahead and run kv to get the trapframe:

6: kd> kv
Child-SP RetAddr : Args to Child : Call Site
fffff880`02f324f8 fffff803`fd7a55a0 : 00000000`00000050 fffffa80`6589b700 00000000`00000000 fffff880`02f326e0 : nt!KeBugCheckEx
fffff880`02f32500 fffff803`fd71eacb : 00000000`00000000 fffffa80`6589b700 fffffa80`06705700 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x33e2a
fffff880`02f325a0 fffff803`fd6e1eee : 00000000`00000000 a1e00021`d8925847 00000000`00001000 fffff880`02f326e0 : nt!MmAccessFault+0x55b
fffff880`02f326e0 fffff803`fd7133d4 : 00000003`00000000 00000000`00000000 a0200002`15917005 00000000`73576d4d : nt!KiPageFault+0x16e (TrapFrame @ fffff880`02f326e0)
fffff880`02f32870 fffff803`fd732434 : fffffa80`080e2568 fffff880`02f32ac0 00000000`00000001 fffff803`fd731c9e : nt!MiAgeWorkingSet+0x264
fffff880`02f32a30 fffff803`fd7318bd : fffff880`02f32b09 00000000`00000001 00000000`00000000 00000000`00000000 : nt!MiTrimOrAgeWorkingSet+0xb4
fffff880`02f32a80 fffff803`fd740e94 : 00000000`00000007 00000000`00000000 00000000`00000001 00000000`00000007 : nt!MiProcessWorkingSets+0x1dd
fffff880`02f32b70 fffff803`fd731e31 : 00000000`00000002 fffff880`02f32be0 00000000`00000008 00000000`00000000 : nt!MmWorkingSetManager+0x40
fffff880`02f32ba0 fffff803`fd6b6fd9 : fffffa80`06705700 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KeBalanceSetManager+0xd9
fffff880`02f32c10 fffff803`fd76b7e6 : fffff880`00e65180 fffffa80`06705700 fffff880`00e70f40 fffffa80`066fd040 : nt!PspSystemThreadStartup+0x59
fffff880`02f32c60 00000000`00000000 : fffff880`02f33000 fffff880`02f2d000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16

6: kd> .trap fffff880`02f326e0
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=00000000021d8925 rbx=0000000000000000 rcx=8000000000000000
rdx=0000098000000000 rsi=0000000000000000 rdi=0000000000000000
rip=fffff803fd7133d4 rsp=fffff88002f32870 rbp=000000000000021e
r8=0000000fffffffff r9=fffff803fd9db700 r10=0000058000000000
r11=000007f8ca99f000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei ng nz na po cy
nt!MiAgeWorkingSet+0x264:
fffff803`fd7133d4 498b5510 mov rdx,qword ptr [r13+10h] ds:00000000`00000010=????????????????

^^ On the instruction we failed on, address fffff803`fd7133d4 deferenced r13+10h where r13 is 0000000000000000. All of this would result in a memory write to the address 00000000`00000010. Let's go ahead and run !pte on 00000000`00000010 to see whether or not it's a valid address.

6: kd> !pte 00000000`00000010
                                           VA 0000000000000010
PXE at FFFFF6FB7DBED000    PPE at FFFFF6FB7DA00000    PDE at FFFFF6FB40000000    PTE at FFFFF68000000000
contains 19000001DD904867 contains 014000020228E867 contains 01500001D070F867 contains 0000000000000000
pfn 1dd904    ---DA--UWEV pfn 20228e    ---DA--UWEV pfn 1d070f    ---DA--UWEV not valid

^^ It's invalid. We can go one step further and run dd which dumps the physical address.

6: kd> dd 00000000`00000010
00000000`00000010 ???????? ???????? ???????? ????????
00000000`00000020 ???????? ???????? ???????? ????????
00000000`00000030 ???????? ???????? ???????? ????????
00000000`00000040 ???????? ???????? ???????? ????????
00000000`00000050 ???????? ???????? ???????? ????????
00000000`00000060 ???????? ???????? ???????? ????????
00000000`00000070 ???????? ???????? ???????? ????????
00000000`00000080 ???????? ???????? ???????? ????????

^^ Again, completely invalid.

---------------------------

Right, so the code wanted to write to 00000000`00000010 which as we can see above is a completely invalid and/or bogus address. The 1st parameter and cr2 however note we failed writing to address fffffa806589b700. This does not make sense, and is essentially not logically possible.

MiAgeWorkingSet told the hardware to write to 00000000`00000010 (which again by the way is a completely invalid address), and the hardware came back and said 'I cannot write to fffffa806589b700'. I like ntdebug's analogy on this, which can be read (here). The way I like to explain it in this specific scenario is if you kindly asked the waiter of your table for some delicious hot lava water (doesn't exist, of course! :']), he writes it down, but comes back and says 'I'm sorry, but we're all out of coffee'.

Ultimately, the hardware was told to write to a completely invalid address, and then came back with an entirely different invalid address. Seems very fishy on hardware, doesn't it?

We can also very likely confirm that this is a hardware issue not just by the analysis alone, but this specific crash dump was verifier enabled, yet failed to find a 3rd party driver being the culprit (because there isn't one):

Verify Level 2092b ... enabled options are:
    Special pool
    Special irql
    All pool allocations checked on unload
    Deadlock detection enabled
    Security checks enabled
    Miscellaneous checks enabled

Summary of All Verifier Statistics

RaiseIrqls                             0x0
AcquireSpinLocks                       0xfbf4b4
Synch Executions                       0x124b88
Trims                                  0xa77

Pool Allocations Attempted             0x28395e
Pool Allocations Succeeded             0x28395e
Pool Allocations Succeeded SpecialPool 0x28395e
Pool Allocations With NO TAG           0x26
Pool Allocations Failed                0x0
Resource Allocations Failed Deliberately   0x0

Current paged pool allocations         0xa8 for 00100740 bytes
Peak paged pool allocations            0xbc for 002387F0 bytes
Current nonpaged pool allocations      0x6eb9 for 0099F300 bytes
Peak nonpaged pool allocations         0x6f4f for 009B0B08 bytes

---------------------------

Ultimately, in this scenario, I asked the user to run Memtest (as RAM is as always the most likely culprit in a situation like this), and sure enough in ~10 hrs time, there were 7 errors.

Thanks for reading!

HP Envy 700-074 ACPI BSOD BIOS update released + story!

*CLICK HERE FOR BIOS UPDATE*

---------------------------

Way back in January I made this blog post (here), which also made me realize truly how fast times flies by, because it's already almost May!

In that post, I mentioned I wasn't sure whether or not I was the only person who had reported the issue in-depth, and that it was a model widespread bug, and not just a system-specific issue. Well, it appears that I was, because for a few weeks I was in contact with various 'higher-ups' within HP. Supposedly they really had no idea the bug even existed as I can only assume what was happening was owners of the 700-074 model were contacting HP's basic customer support, and ultimately getting either refunds or the same model replaced, which would eventually have the exact same problem. I imagine it didn't appear 'alarming' enough yet for HP's customer/replacement support to bring the attention to the higher-ups within HP, but I digress.

First off, I wasn't sure at all where to go to actually report this as a somewhat serious issue. I looked all over and all I could find regarding contact was basic technical and/or customer support. I myself did not own an HP device, so I couldn't contact either of those because unless you have a service tag, you are charged. I didn't want to risk being charged money I didn't have to possibly either be hung up on, or not get anywhere. So, with that said, I wandered over to the HP Support forums, created a thread, and hoped I would get something. I checked the thread a few hours later, and it received a response notifying me that there is no official HP presence on the forums, but the person was nice enough to go ahead and escalate my issue so that it could possibly get the attention I was hoping.

Sure enough, around 24 hours later I received an email from someone in HP's Executive Customer Relations. I was asked for any info about the problem I was trying to bring to attention, at which time I then explained thoroughly what was going on with the Envy 700-074. I made sure I stressed that I'm just a young guy that isn't doing this for any reason other than to help HP possibly push a solution to this problem faster, so their customers didn't have to constantly return the same model 5+ times until they said screw it and got a full refund. I then shortly after my reply received a reply back stating that I would be updated in at least 24/48 hours with additional information on their findings. I never ended up receiving a reply back. At this point, I wasn't really sure what to do. I didn't want to email again and be annoying, so I let that go.

About a week or so later, thanks to a good friend of mine (MS MVP - John Griffith/Jcgriff2), I got into contact with an HP Social Media Ambassador, who was very kind. We exchanged emails back-and-forth for a few weeks, but I eventually stopped receiving responses from them as well. At the last time we spoke, I was told that the information had successfully reached HP's Escalation Engineer team (very cool). I can only imagine someone in HP said 'stop contacting this person' or something related, because I'm just a regular 20 year old guy interested in computers trying to help fix something any way I can, and they're a major business. I suppose it could have been considered sensitive information once it reached a certain point. I do appreciate both people I was in contact with, though.

Fast forward to almost three months later (now), and I receive an email notification that my thread on the HP forums received a reply. I figured it was just another Envy 700-074 customer saying they are having this problem, etc, however, it was a person notifying me a BIOS update had been released very few days ago, and it addressed this problem. Sure enough, I checked HP's website for the Envy 700-074, and a BIOS update was posted on 4/22!

Rejoice, another bug solved that I had a little to do with! :o)

Tuesday, April 22, 2014

0xBE - Journaling file system & Log File Service

For the first time today I came upon NTFS Journaling in a crash dump, so I thought I'd go ahead and write a blog post about it!

Thread/post here.

---------------------------

First off, before going into the specific scenario, let's talk about what a Journaling file system is. Essentially, a JFS is a file system that will go ahead and keep track of any changes that occur within what is known as a journal. In computing of course, this journal is generally a circular log located in a dedicated area of the file system. It's very important to note that this entire process itself is done before committing them to the main file system/carried through the disk.

With that said, one question may remain which is likely "Why do we even go through this process in the first place?" Well, in its simplest terms, this process is done to maintain data integrity. If a crash, hang, etc, occurs, the JFS will then have a log to go ahead recreate any potentially corrupt/lost data that occurred. Not only will it return the data to the pre-crash configuration, but it will also go ahead and recover any unsaved data and store it in the location it would have been stored in if the system had not been unexpectedly interrupted.

Now, you may be wondering how we actually communicate and/or work with something like this, and that's where NTFS.sys comes in. With NTFS.sys, we have a series of kernel-mode routines (which I will display below in my analysis) that are used to access the log file. This log file is specifically divided into two regions:

1. LFS Restart Area.

2. Infinite Logging Area.

Here's a diagram from http://www.ntfs.com/transaction.htm:

NTFS.sys calls the LFS (Log File Service) to read/write to the Restart Area. From the above diagram, we can see the two areas we mentioned above. You may notice that under LFS Restart Area, we have two copies. This is done in the event that one copy is either corrupt/inaccessible, so the second would be available in that situation.

If we take a look at the other side, we can see that we have the Logging Area, which as I mentioned above is circular (where 'infinite' comes from). New records are added to the logging file until it reaches full capacity, which the LFS then go ahead and frees up space for new records after any prior writes to the log are complete.

For what we're discussing here, that's about all we need to know. If you'd like to know more, I suggest reading the link above.

---------------------------

Great, so now we have some pretty decent knowledge regarding JFS, LFS, and NTFS regarding data integrity. Let's now go ahead and take a look at the crash I dealt with earlier:

ATTEMPTED_WRITE_TO_READONLY_MEMORY (be)
An attempt was made to write to readonly memory. The guilty driver is on the
stack trace (and is typically the current instruction pointer).
When possible, the guilty driver's name (Unicode string) is printed on
the bugcheck screen and saved in KiBugCheckDriver.
Arguments:
Arg1: fffff9800a472010, Virtual address for the attempted write.
Arg2: 80e0000057391121, PTE contents.
Arg3: fffff880031b5d50, (reserved)
Arg4: 000000000000000b, (reserved)

Let's go ahead and take a look at that 2nd parameter:

3: kd> !pte 80e0000057391121
                                           VA 80e0000057391121
PXE at FFFFF6FB7DBED000    PPE at FFFFF6FB7DA00008    PDE at FFFFF6FB400015C8    PTE at FFFFF680002B9C88
contains 0070000138B42867 contains 60D00000A3408867 contains 0000000000000000
pfn 138b42    ---DA--UWEV pfn a3408     ---DA--UWEV not valid

WARNING: noncanonical VA, accesses will fault !

From above, we can see we have an invalid virtual address (VA). This will inevitably result in a crash. Let's take a look at the call stack:

3: kd> k
Child-SP RetAddr Call Site
fffff880`031b5be8 fffff800`02ef37c6 nt!KeBugCheckEx
fffff880`031b5bf0 fffff800`02e73cee nt! ?? ::FNODOBFM::`string'+0x44cde
fffff880`031b5d50 fffff880`012fcd0e nt!KiPageFault+0x16e
fffff880`031b5ee0 fffff880`01303be5 Ntfs!LfsWriteLogRecordIntoLogPage+0x1ee <--- As the LFS data is being written to the LFS log, we call into a pagefault.
fffff880`031b5f80 fffff880`012ff536 Ntfs!LfsWrite+0x145 <--- Writing to the LFS.
fffff880`031b6040 fffff880`013002ef Ntfs!NtfsWriteLog+0x466 <--- Preparing to call the LFS to write to the log.
fffff880`031b6290 fffff880`013013ad Ntfs!NtfsChangeAttributeValue+0x34f <--- Changing some sort of value, which NTFS works a lot with. Unsure of what an attribute value is, though.
fffff880`031b6480 fffff880`012cea70 Ntfs!NtfsUpdateStandardInformation+0x26b <--- Looks like we have some sort of update to information.
fffff880`031b6590 fffff880`012cf41d Ntfs!NtfsCommonFlushBuffers+0x1f0 <--- Again.
fffff880`031b6670 fffff800`0331ed26 Ntfs!NtfsFsdFlushBuffers+0x10d <--- File System Driver Creation (FSD) buffer flush.
fffff880`031b66e0 fffff880`01041bcf nt!IovCallDriver+0x566
fffff880`031b6740 fffff880`010406df fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x24f
fffff880`031b67d0 fffff800`0331ed26 fltmgr!FltpDispatch+0xcf
fffff880`031b6830 fffff800`0317f17b nt!IovCallDriver+0x566
fffff880`031b6890 fffff800`03113ea1 nt!IopSynchronousServiceTail+0xfb
fffff880`031b6900 fffff800`02e74e53 nt!NtFlushBuffersFile+0x171
fffff880`031b6990 fffff800`02e71410 nt!KiSystemServiceCopyEnd+0x13
fffff880`031b6b28 fffff800`03114c5f nt!KiServiceLinkage
fffff880`031b6b30 fffff800`03114a20 nt!CmpFileFlush+0x3f
fffff880`031b6b70 fffff800`03114caa nt!HvWriteDirtyDataToHive+0xe0
fffff880`031b6be0 fffff800`03105bbf nt!HvOptimizedSyncHive+0x32
fffff880`031b6c10 fffff800`03105d25 nt!CmpDoFlushNextHive+0x197
fffff880`031b6c70 fffff800`02e7f261 nt!CmpLazyFlushWorker+0xa5
fffff880`031b6cb0 fffff800`031122ea nt!ExpWorkerThread+0x111
fffff880`031b6d40 fffff800`02e668e6 nt!PspSystemThreadStartup+0x5a
fffff880`031b6d80 00000000`00000000 nt!KxStartSystemThread+0x16

Bug check (BE) as I noted above indicates that there was an attempt to write to readonly memory. The attempt to write to readonly memory was this call right here - Ntfs!LfsWriteLogRecordIntoLogPage+0x1ee. So, why did NTFS.sys make an attempt to write to readonly memory, causing a pagefault to occur? Generally, in almost all cases, you will not see a system driver and/or non-3rd party driver accessing invalid, readonly, etc, memory.

I had the user run a Chkdsk and errors were found and corrected, however no bad sectors. I also recommended running Seatools in DOS, so I will report back when I can with any new info, etc.

---------------------------

References I used to learn about JFS, LFS, etc:

http://www.ntfs.com/transaction.htm
http://www.webopedia.com/TERM/J/journaled_file_system.html
http://en.wikipedia.org/wiki/Journaling_file_system

[SOLVED] DRIVER_IRQL_NOT_LESS_OR_EQUAL

Link to solved thread - BSOD driver irql not less or equal netwsw00.sys

What the issue was - Modem drivers were too old to function with the OS, therefore a new modem with more up to date device drivers needed to be purchased.

[SOLVED] BAD_POOL_HEADER

Link to solved thread - ntkrnlmp.exe causing BSOD randomly

What the issue was - Sentinel drivers were uninstalled and reinstalled.

[SOLVED] KERNEL_SECURITY_CHECK_FAILURE

Link to solved thread - Kernel security check failure

What the issue was -

- AVG Secure Search needed to be removed.

- ESET needed to be removed and replaced with Windows Defender.

[SOLVED] BAD_POOL_HEADER

Link to solved thread - Error Bad Pool Header

What the issue was - Kaspersky needed to be removed and replaced with Windows Defender.

[SOLVED] PFN_LIST_CORRUPT

Link to solved thread - BSOD issues with windows 8.1

What the issue was - Creative drivers needed to be updated.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - Blue Screen Multiple IRP Complete Requests on Windows 8.1

What the issue was - Atheros Bluetooth drivers needed to be updated.

[SOLVED] IRQL_NOT_LESS_OR_EQUAL

Link to solved thread - Windows 7 blue screen (kernel power ID41 category 63)

What the issue was -

- Gigabyte On/Off needed to be removed + Easy Saver.

- Kaspersky needed to be removed and replaced with MSE.

- Samsung printer software removed (device drivers too old).

[SOLVED] PAGE_FAULT_IN_NONPAGED_AREA

Link to solved thread - My computer crashes EVERY. SINGLE. TIME. it *tries* to come out of sleep mode.

What the issue was -

- avast! needed to be removed and replaced with Windows Defender.

- Realtek wireless driver updated.

[SOLVED] DRIVER_IRQL_NOT_LESS_OR_EQUAL

Link to solved thread - driver_irql_not_less_or_equal (tcpip.sys) blue screen error leading to computer restart

What the issue was - Malwarebytes needed to be removed.

[SOLVED] DRIVER_UNLOADED_WITHOUT_CANCELLING_PENDING_OPERATIONS

Link to solved thread - ERROR: driver_unloaded_without_cancelling_pending_operations.

What the issue was -

- CyberLink software needed to be removed.

- AVG needed to be removed and replaced with Windows Defender.

[SOLVED] MEMORY_MANAGEMENT / SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M / SYSTEM_SERVICE_EXCEPTION / CRITICAL_STRUCTURE_CORRUPTION

Link to solved thread - Windows Repeatedly Crashing

What the issue was - Faulty RAM stick.

[SOLVED] DRIVER_POWER_STATE_FAILURE

Link to solved thread - Driver power state failure caused by afd.sys

What the issue was - ???

[SOLVED] KERNEL_DATA_INPAGE_ERROR

Link to solved thread - Windows 8.1 kernel data inpage error

What the issue was - AVG needed to be removed and replaced with MSE.

[SOLVED] DRIVER_POWER_STATE_FAILURE

Link to solved thread - Driver Power State Failure (BSOD) after Windows 8.1 Installation

What the issue was - Beta video card driver needed to be installed.

Monday, April 14, 2014

[SOLVED] BAD_POOL_HEADER

Link to solved thread - BSOD, bad_pool_header error - Windows 8.1 x64

What the issue was - AVG + McAfee was installed, therefore kernel corruption was occurring due to conflicts. McAfee was removed.

[SOLVED] MEMORY_MANAGEMENT

Link to solved thread - Frequent BSOD

What the issue was - XBMC needed to be updated to the beta version.

[SOLVED] UNEXPECTED_KERNEL_MODE_TRAP / PROCESS_HAS_LOCKED_PAGES / DRIVER_IRQL_NOT_LESS_OR_EQUAL

Link to solved thread - Help With BSOD Windows 8.1

What the issue was - Kaspersky needed to be removed and replaced with Windows Defender.

[SOLVED] DRIVER_POWER_STATE_FAILURE

Link to solved thread - driver_power_state_failure asus zenbook windows 8.1

What the issue was -

- avast! needed to be removed and replaced with Windows Defender.

- ExpressCache needed to be removed.

[SOLVED] DRIVER_IRQL_NOT_LESS_OR_EQUAL

Link to solved thread - 0xD1, 0x3B and 0xA bluescreens

What the issue was -

- avast! needed to be removed and replaced with MSE.

- Bigfoot network drivers needed to be updated.

[SOLVED] DRIVER_IRQL_NOT_LESS_OR_EQUAL / MEMORY_MANAGEMENT / SYSTEM_SERVICE_EXCEPTION / IRQL_NOT_LESS_OR_EQUAL / DRIVER_OVERRAN_STACK_BUFFER

Link to solved thread - lots of BSOD's

What the issue was -

- Video card drivers needed to be updated.

- Kaspersky needed to be removed and replaced with Windows Defender.

- Samsung printer needed to be removed due to very old device drivers.

[SOLVED] DRIVER_POWER_STATE_FAILURE

Link to solved thread - ntoskrnl.exe driver power state failure

What the issue was - avast! needed to be removed and replaced with Windows Defender.

[SOLVED] KERNEL_SECURITY_CHECK_FAILURE

Link to solved thread - BSOD Kernel_Security_Check_Failure

What the issue was -

- USB drivers needed to be updated.

- Kaspersky needed to be removed and replaced with Windows Defender.

[SOLVED] DRIVER_IRQL_NOT_LESS_OR_EQUAL / SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M

Link to solved thread - BSOD Windows 7 Profesional 64bit (Wdf01000.sys)

What the issue was - Lumension Security + McAfee needed to be removed and replaced with MSE.

[SOLVED] DRIVER_POWER_STATE_FAILURE

Link to solved thread - Driver Power State Failure caused by adress "ntoskrnl.exe+14dca0"

What the issue was - Intel Rapid Storage Technology driver needed to be updated.

[SOLVED] IRQL_NOT_LESS_OR_EQUAL

Link to solved thread - Windows 7 irql_not_less_or_equal minutes after login

What the issue was - Asus utility driver needed to be updated.

[SOLVED] PAGE_FAULT_IN_NONPAGED_AREA / SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M / PFN_LIST_CORRUPT / KMODE_EXCEPTION_NOT_HANDLED / CRITICAL_STRUCTURE_CORRUPTION / SYSTEM_SERVICE_EXCEPTION

Link to solved thread - DPC_WATCHDOG_VIOLATION everytime at boot, help?

What the issue was - CPU was faulty and needed to be replaced.

[SOLVED] BAD_POOL_CALLER

Link to solved thread - BSOD Bad_pool_caller on Switching Users (Windows 7 Ultimate x64)

What the issue was - NetGear USB Control Center needed to be uninstalled.

[SOLVED] IRQL_NOT_LESS_OR_EQUAL

Link to solved thread - Bad Pool Caller

What the issue was - Network adapter was too old regarding its device drivers and a new one needed to be purchased.

[SOLVED] IRQL_UNEXPECTED_VALUE

Link to solved thread - BSoD Locale ID 3081 BC Code c8

What the issue was - avast! needed to be removed and replaced with MSE.

[SOLVED] DRIVER_POWER_STATE_FAILURE

Link to solved thread - Windows 8.1 Crashes while attempting to go to sleep

What the issue was - Daemon Tools needed to be uninstalled.

[SOLVED] DPC_WATCHDOG_VIOLATION

Link to solved thread - DPC Watchdog Violation

What the issue was - LogMeIn Hamachi was uninstalled and the crashing ceased.

[SOLVED] SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M

Link to solved thread - BSOD Occurred. Computer cannot wake or boot up properly after sleeping.

What the issue was - Video card was faulty and needed to be replaced.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - MULTIPLE_IRP_COMPLETE_REQUEST Windows 8.1

What the issue was - Bluetooth drivers needed to be updated.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - Windows 8.1 BSOD (Multiple IRP Complete Requests)

What the issue was - Bluetooth drivers needed to be updated.

[SOLVED] KERNEL_DATA_INPAGE_ERROR

Link to solved thread - KERNEL_DATA_INPAGE_ERROR and black stripes / glitchy screen

What the issue was - Hard disk was faulty and needed to be replaced.

[SOLVED] DRIVER_IRQL_NOT_LESS_OR_EQUAL

Link to solved thread - Driver IRQL not less or equal - Win 8.1 (with minidump files)

What the issue was -

- Video card drivers needed to be updated.

- AiCharger.sys was present on the system, therefore Asus bloatware needed to be uninstalled.

- AVG needed to be removed and replaced with MSE.

[SOLVED] BAD_POOL_HEADER

Link to solved thread - BSOD BAD_POOL_HEADER caused by ntoskrnl.exe

What the issue was - Auto-OC was unstable, therefore the CMOS was cleared and the system was stable at defaults.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - Help with Windows 8.1 MULTIPLE_IRP_COMPLETE_REQUESTS

What the issue was - Bluetooth drivers needed to be updated.

[SOLVED] DRIVER_POWER_STATE_FAILURE

Link to solved thread - Shutdown BSOD: Driver Power State Failure (Windows 8.1 Pro 64-bit)

What the issue was -

- AODDriver2.sys was listed and loaded and the software needed to be removed.

- AiCharger needed to be uninstalled.

- Diskeeper needed to be uninstalled.

[SOLVED] DRIVER_IRQL_NOT_LESS_OR_EQUAL

Link to solved thread - Windows 8 BSOD

What the issue was -

- Western Digital SES software/driver needed to be removed.

- CyberLink software needed to be removed.

- Realtek PCI/PCIe Adapters driver needed to be updated.

- Video card drivers needed to be updated.

[SOLVED] MULTIPLE_IRP_COMPLETE_REQUESTS

Link to solved thread - Windows 8.1 Professional - Multiple random BSoD

What the issue was - ATEN USB to Serial device driver needed to be updated.

Friday, April 11, 2014

0x44 btfilter.sys Bug (?)

-- For any users reading this experiencing this problem, my current advice to you is to go ahead and ensure all of your USB/Bluetooth related drivers (especially Atheros) are up to date via your manufacturers website.

Worst case, rollback any recent Windows Updates.

--------------------------------

Hello everyone!

Over the past short few days, many 0x44 BSOD's have been popping up. I can only help but smile as it reminds me of this - LogMeIn Hamachi Windows 8 and 8.1 Bug. When you have such a frequency of crashes with consistency, it's generally a bug in a driver regarding a specific piece of software, update rolled out, etc.

Anyway, enough rambling, let's get on with the analysis!

--------------------------------

First off, the bug check seems to consistently be MULTIPLE_IRP_COMPLETE_REQUESTS (44).

-- A driver has called IoCompleteRequest to ask that an IRP be completed, but the packet has already been completed.

3: kd> k
Child-SP RetAddr Call Site
ffffd000`207bb8e8 fffff803`43c08b4b nt!KeBugCheckEx
ffffd000`207bb8f0 fffff800`02cd4389 nt! ?? ::FNODOBFM::`string'+0x2a9ab
ffffd000`207bba00 fffff800`02cd391d USBPORT!USBPORT_Core_iCompleteDoneTransfer+0x979
ffffd000`207bbba0 fffff800`02cd33b8 USBPORT!USBPORT_Core_iIrpCsqCompleteDoneTransfer+0x21d
ffffd000`207bbc00 fffff800`02cd312e USBPORT!USBPORT_Core_UsbIocDpc_Worker+0x238
ffffd000`207bbc70 fffff803`43adad10 USBPORT!USBPORT_Xdpc_Worker_IocDpc+0x1fe
ffffd000`207bbcf0 fffff803`43ada9f0 nt!KiExecuteAllDpcs+0x1b0
ffffd000`207bbe40 fffff803`43bd0dd5 nt!KiRetireDpcList+0xd0
ffffd000`207bbfb0 fffff803`43bd0bd9 nt!KxRetireDpcList+0x5
ffffd000`25051a40 fffff803`43bd2e45 nt!KiDispatchInterruptContinue
ffffd000`25051a70 fffff803`43bcead3 nt!KiDpcInterruptBypass+0x25
ffffd000`25051a80 00000018`c76f0399 nt!KiChainedDispatch+0x173
00000018`c5cccfc0 00000000`00000000 0x00000018`c76f0399

Pretty interesting call stack! As we move up, we can see that it calls to retire a DPC list, and then it makes a call to execute all DPCs. We then call into what appears to be the USBPORT.sys driver and have a few worker routines going on, which inevitably call into the bug check.

-- FAILURE_BUCKET_ID: 0x44_IMAGE_ACPI

We can see that by default, WinDbg notes that the cause of the crash was the Advanced Configuration and Power Interface (ACPI). Essentially, in its simplest terms, it brings power management under the control of the operating system, as opposed to the previous BIOS-central system which relied on platform-specific firmware to determine power management and configuration policy.

-- ADDITIONAL_DEBUG_TEXT: USB\VID_0CF3&PID_3004

WinDbg provides us with a Vendor/Product ID, which we can sift through a database to find out which USB-based device this specifically is. 0CF3 regarding a Vendor ID falls under Atheros Communications, Inc. Specifically, regarding its Product ID, 3004 is not listed, but 3000, 30002, and 3005 are all related to Bluetooth (from Atheros).

Interestingly enough, in the crash dump itself, this is also mentioned:

-- OVERLAPPED_MODULE: Address regions for 'bthport' and 'bthport.sys' overlap.

bthport.sys is the Bluetooth Bus driver.

Right, so the summary so far is that we appear to possibly be having a Bluetooth related driver (possibly related to Atheros) causing issues. Let's look forward more to confirm or deny this.

--------------------------------

3: kd> dt nt!_IO_STATUS_BLOCK ffffe00001587ac0
   +0x000 Status           : 0n20223232
   +0x000 Pointer          : 0xffffe000`01349500 Void
   +0x008 Information      : 0

Now that we have this, let's check the STATUS field layout:

3: kd> ? 0n20223232
Evaluate expression: 20223232 = 00000000`01349500

3: kd> .formats 0n20223232
Evaluate expression:
Hex:     00000000`01349500
Decimal: 20223232
Octal:   0000000000000115112400
Binary: 00000000 00000000 00000000 00000000 00000001 00110100 10010101 00000000
Chars:   .....4..
Time:    Sat Aug 22 21:33:52 1970
Float:   low 3.31677e-038 high 0
Double: 9.9916e-317

Assuming I am correct, '00000000`01349500' falls under:

NT_SUCCESS(Status) Evaluates to TRUE if the return value specified by Status is a success type (0 − 0x3FFFFFFF) or an informational type (0x40000000 − 0x7FFFFFFF).

Moving forward, let's take a closer look at the IRP:

3: kd> !irp ffffe00001587ac0 1
Irp is active with 8 stacks 7 is current (= 0xffffe00001587d40)
No Mdl: No System Buffer: Thread 00000000: Irp stack trace.
Flags = 00000000
ThreadListEntry.Flink = ffffe00001587ae0
ThreadListEntry.Blink = ffffe00001587ae0
IoStatus.Status = 00000000
IoStatus.Information = 00000000
RequestorMode = 00000000
Cancel = 00
CancelIrql = 0
ApcEnvironment = 00
UserIosb = 00000000
UserEvent = 00000000
Overlay.AsynchronousParameters.UserApcRoutine = 00000000
Overlay.AsynchronousParameters.UserApcContext = 00000000
Overlay.AllocationSize = 00000000 - 00000000
CancelRoutine = 00000000
UserBuffer = 00000000
&Tail.Overlay.DeviceQueueEntry = ffffe00001587b38
Tail.Overlay.Thread = 00000000
Tail.Overlay.AuxiliaryBuffer = 00000000
Tail.Overlay.ListEntry.Flink = 00000000
Tail.Overlay.ListEntry.Blink = 00000000
Tail.Overlay.CurrentStackLocation = ffffe00001587d40
Tail.Overlay.OriginalFileObject = 00000000
Tail.Apc = 00000000
Tail.CompletionKey = 00000000
     cmd flg cl Device   File     Completion-Context
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
*** ERROR: Module load completed but symbols could not be loaded for btfilter.sys
>[ f, 0]   0 e1 ffffe000033f1050 00000000 fffff8000415d460-ffffe00001660180 Success Error Cancel pending
           \Driver\usbohci    btfilter
            Args: ffffe00003831910 00000000 00220003 00000000
[ f, 0]   0 0 00000000 00000000 00000000-00000000

            Args: ffffe00003831910 00000000 00220003 00000000

This is a very interesting IRP output, as a lot of the values are zeroed out. Also, it's not noting that the IRP has completed, or that the PENDING has returned.

In any case, we can see that btfilter.sys is mentioned. btfilter.sys is the Atheros Bluetooth driver.

3: kd> !devstack ffffe000033f1050
!DevObj   !DrvObj            !DevExt   ObjectName
ffffe000031b4b80 *** ERROR: Module load completed but symbols could not be loaded for usbfilter.sys
\Driver\usbfilter ffffe000031b4cd0
ffffe000026ab050 \Driver\usbhub     ffffe000026ab1a0 0000003a
ffffe000031b5bf0 \Driver\usbfilter ffffe000031b5d40
ffffe000031cfe50 \Driver\ACPI       ffffe00000f94b90
> ffffe000033f1050 \Driver\usbohci    ffffe000033f11a0 USBPDO-2
!DevNode ffffe0000336a130 :
DeviceInst is "USB\ROOT_HUB\4&301d4ea9&0"
ServiceName is "usbhub"

Here's where we can see mention of ACPI, and other USB related drivers (even the root hub of USB).

--------------------------------

Overall, with all of this said, I can only guess that with the zeroed out values + no completed notice, etc, that it's well beyond actual completion, PENDING return, etc, OR despite the bug check, the IRP is not completing (or in an IRP completion state), but rather remaining PENDING, etc. Also, it's entirely possible that usbfilter.sys and btfilter.sys are both battling for completion believing that they own the packet. In any case, I believe this has been brought upon by a recent update being bugged, but I will have to see as time goes on. I also need to look into a few more things regarding debugging this.

More info soon, hopefully.

Wednesday, April 9, 2014

0x133 (with a little 0xC4 in addition) Debugging - New Version (Update)

This is a more updated version of the following blog post - 0x133 (with a little 0xC4 in addition) Debugging

---------------------------

Before we get into the actual debugging of the bug check itself, let's talk about why this bug check 'came to be', etc, so you have a little more than a basic understanding. I guess you could call it a little history lesson! This is something I believe I am going to try and do with all of my future debugging posts. Not only is it neat & fun for me to get to talk about it, but it's always great to learn not just how to solve the bugcheck, but why it came to be an actual bugcheck.

Starting with Windows 8 (Server 2012, etc) we had the famous (or infamous depending on your stance with Windows 8!) touch-optimized user interface (GUI) that was based off of Microsoft's Metro. We all know the good & bad that has been said about Metro, so let's not get into that. Regardless, that's not what we're really here to discuss anyway. What we can say regarding the Metro GUI is that is was developed primarily in mind to improve the tablet-based computing experience as Microsoft was competing with mobile OS' such as Android and iOS.

Now, with that said (we'll get back to it) let's move on to the Interrupt handler, or otherwise known as ISR (Interrupt Service Routine). In its simplest definition, this is a callback subroutine in microcontroller firmware, operating system or device driver whose execution is triggered by the reception of an interrupt. Interrupt handlers have a multitude of functions, which vary based on the reason the interrupt was generated and the speed at which the interrupt handler completes its task.

ISR runs at a very high level (15 levels on x64-based processor, and 32 levels on x86-based processor). Programs/Applications run at 0, DPCs run at 1, and interrupts will go ahead and fire at a higher level than that. Let's say for example that you have IRQ 4 and IRQ 5 waiting to be serviced. IRQ 5 holds a higher priority than IRQ 4 does, therefore it will shove its own function to the top of the stack and continue running. Afterwards, it will come back, unwind, and then IRQ 4 will continue having no idea that it was stopped entirely. With this said, interrupts will interrupt each other (where the word interrupt comes from).

ISRs will queue work, because hogging of the CPU is not good. Obviously the CPU needs to address all of the interrupts, work its assigned, etc, however, it needs to be 'fair'. This queued work is turned into what is known as a DPC (Deferred Procedure Call). This mechanism allows high-priority tasks to defer required but lower-priority tasks for later execution from the ISR. This permits device drivers and other low-level event consumers to perform the high-priority part of their processing quickly, and schedule non-critical additional processing for execution at a lower priority.

DPCs block all applications from running (or being serviced by the CPU) as they take the absolute highest priority in being serviced. Absolutely no applications whatsoever (not even explorer.exe) can run when a DPC is PENDING and waiting to be serviced. Now, let's use a neat example for fun! I would go as far to say that if you've used a computer in your life for an extended period of time, you've at least once had a system hang, but you were still able to move your mouse cursor. However, even though you were able to use your mouse cursor, NO applications worked. You weren't able to Ctrl+Shift+Esc to bring up Task Manager, you weren't able to do anything at all regarding any applications. I always laugh when I think of this, because I just imagine tons and tons of frustration and attempting to open Task Manager over and over again, or trying to open the Start Menu to Restart the system.

Anyway, the reason no applications work, but the mouse cursor still does and is not frozen with the applications, is because the mouse cursor is actually an interrupt that's firing a scheduled DPC. With this said, we still have the drawing of the mouse cursor. So, with that said, if you've ever wondered why your system froze but you still had a moving mouse cursor, now you know why. Some pretty fun food for thought, I think!

Now that we're (hopefully) smiling about mouse cursors still moving during DPC hangs, we can return back to why we discussed Metro as a GUI. Remember, as I said, Metro was developed primarily in mind to improve the tablet-based computing experience. Now, on a tablet-based device, you don't have a keyboard. Okay, we don't have a keyboard on a tablet, big deal. Well, because we don't have a keyboard, if a DPC hang occurs, we cannot force a system crash (and/or break-out) to call a bugcheck and ultimately safely bring down the system. You may be asking "Couldn't I just power down the tablet"? Yes, absolutely. However, if you power down the tablet, you risk data corruption.

-- If you're interested in reading about forcing a bugcheck with the keyboard, take a look at this article - Forcing a System Crash from the Keyboard (Windows Debuggers)

Due to the inability to force a bugcheck if a DPC hang occurs on a tablet-based device due to the lack of a keyboard, the wonderful DPC_WATCHDOG_VIOLATION bug check was created. If we didn't have this bug check, if a DPC hang occurred on your tablet-based device, you'd either inevitably wait for the device to lose battery and simply turn off because it'd hang forever, or the DPC(s) MIGHT (depending - and I use might and depending very loosely) be serviced and Windows will continue.

To my knowledge, the rules that apply to calling the bugcheck in a DPC hang-based scenario are as follows:

1. If one specific DPC has been running for ~10 seconds, the system will bugcheck.

2. If an accumulation of DPCs have been running for ~15 seconds, the system will bugcheck.

(Microsoft recommends that DPCs should not run longer than 100 microseconds and ISRs should not run longer than 25 microseconds, however the actual timeout values on the system are set much higher).

-- I'd also like to go and make note that depending on your tablet device, there are sometimes alternative methods of performing an ACPI 'reset' by usually hitting two buttons simultaneously.

Fantastic! Now that we know all of the cool information above, let's proceed with the debugging. Remember, as with all bug checks, there is no same culprit. You will see 0x133's being caused by all sorts of things! With this said, just because you see what you see here in my debugging example, doesn't mean you'll see it in another! This is not a 'how to solve' tutorial, but more-so a look into how to generally debug this type of bug check.

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000000, A single DPC or ISR exceeded its time allotment. The offending
component can usually be identified with a stack trace.
Arg2: 0000000000000501, The DPC time count (in ticks).
Arg3: 0000000000000500, The DPC time allotment (in ticks).

^^ Here we have the basic bug check information. First off, the DPC_WATCHDOG_VIOLATION bug check can be triggered in two ways:

1. If a single DPC exceeds a specified number of ticks, the system will bugcheck with 0x133, and the first parameter of the bug check will be set to 0. If this is the case, the system's time limit for single DPC will be in the third parameter of the bug check, with the number of ticks taken by this DPC in the 2nd parameter of the bug check.

2. If the system exceeds a larger timeout of time spent cumulatively in all DPCs since the IRQL was raised to DPC level, the system will bugcheck with a 0x133, and the first parameter will be set to 1.

BugCheck 133, {0, 501, 500, 0}

^^ In this case, the first parameter = 0. We'd refer to #1. This specific DPC has run for 0x501 ticks, when the limit was 0x500.

In the case of a stop 0x133 with the first parameter set to 0, the call stack should contain the offending driver. Rather than running a kv as we usually do, let's go ahead and run knL:

0: kd> knL
   # Child-SP RetAddr Call Site
   00 fffff803`a2459408 fffff803`a044bf4b nt!KeBugCheckEx
   01 fffff803`a2459410 fffff803`a0310774 nt! ?? ::FNODOBFM::`string'+0x145a4
   02 fffff803`a2459490 fffff803`a0228eca nt!KeUpdateTime+0x2ec
   03 fffff803`a2459670 fffff803`a02c501e hal!HalpTimerClockInterrupt+0x86
   04 fffff803`a24596a0 fffff880`18564dd2 nt!KiInterruptDispatchLBControl+0x1ce
   05 fffff803`a2459830 fffff803`a02f51ea usb80236!CancelSendsTimerDpc+0xa6
   06 fffff803`a2459870 fffff803`a02f3655 nt!KiProcessExpiredTimerList+0x22a
   07 fffff803`a24599a0 fffff803`a02f5668 nt!KiExpireTimerTable+0xa9
   08 fffff803`a2459a40 fffff803`a02f4a06 nt!KiTimerExpiration+0xc8
   09 fffff803`a2459af0 fffff803`a02f59ba nt!KiRetireDpcList+0x1f6
   0a fffff803`a2459c60 00000000`00000000 nt!KiIdleLoop+0x5a

k = Displays the call stack of the given thread.
n = Displays frame numbers.
L (important you capitalize it) = Hides source lines in the display.

This makes a nice, neat, and informative call stack for 0x133 debugging.

As we can see above in the call stack, the offending driver is usb80236.sys, which is the Remote NDIS USB Driver. It calls into KiInterruptDispatchLBControl.

Let’s view the driver’s unassembled DPC routine:

   0: kd> ub fffff880`18564dd2
   usb80236!CancelSendsTimerDpc+0x81:
   fffff880`18564dad 488bcd mov rcx,rbp
   fffff880`18564db0 c6465001 mov byte ptr [rsi+50h],1
   fffff880`18564db4 ff1556230000 call qword ptr [usb80236!_imp_KeReleaseSpinLockFromDpcLevel (fffff880`18567110)]
   fffff880`18564dba 488bcb mov rcx,rbx
   fffff880`18564dbd ff15a5220000 call qword ptr [usb80236!_imp_IoCancelIrp (fffff880`18567068)]
   fffff880`18564dc3 488bcd mov rcx,rbp
   fffff880`18564dc6 ff15e4220000 call qword ptr [usb80236!_imp_KeAcquireSpinLockAtDpcLevel (fffff880`185670b0)]
   fffff880`18564dcc ff8708030000 inc dword ptr [rdi+308h]

1. The driver started by calling the KeAcquireSpinLockAtDpcLevel routine which acquires a spin lock when the caller is already running at IRQL >= DISPATCH_LEVEL.

The caller should release the spin lock with KeReleaseSpinLockFromDpcLevel as quickly as possible.

2. The driver then called the IoCancelIrp routine which sets the cancel bit in a given IRP and calls the cancel routine for the IRP if there is one.

-- If the IRP has a cancel routine, IoCancelIrp sets the cancel bit and calls the cancel routine.

3. The driver then called the KeReleaseSpinLockFromDpcLevel routine (as mentioned above) which releases an executive spin lock without changing the IRQL.

Going a bit deeper...

0: kd> u fffff880`18564dd2
   usb80236!CancelSendsTimerDpc+0xa6:
   fffff880`18564dd2 488b36 mov rsi,qword ptr [rsi]
   fffff880`18564dd5 493bf6 cmp rsi,r14
   fffff880`18564dd8 75a9 jne usb80236!CancelSendsTimerDpc+0x57 (fffff880`18564d83)
   fffff880`18564dda 4c8d87b0020000 lea r8,[rdi+2B0h]
   fffff880`18564de1 488d8f70020000 lea rcx,[rdi+270h]
   fffff880`18564de8 48c7c2001f0afa mov rdx,0FFFFFFFFFA0A1F00h
   fffff880`18564def ff1553220000 call qword ptr [usb80236!_imp_KeSetTimer (fffff880`18567048)]
   fffff880`18564df5 eb12 jmp usb80236!CancelSendsTimerDpc+0xdd (fffff880`18564e09)

1. The driver called the CancelSendsTimerDpc routine. I do not know exactly what this routine does, however, it's certainly something in regards to a timer on and/or for a DPC. According to Harry (x BlueRobot), he believes that the driver may use a Custom DPC associated with a Timer object.

2. The driver then calls the KeSetTimer routine which sets the absolute or relative interval at which a timer object is to be set to a signaled state and, optionally, supplies a CustomTimerDpc routine to be executed when that interval expires.

3. The driver then calls the CancelSendsTimerDpc routine again. As far as I know, what should be going on here is the CustomTimerDpc routine should be called, but CancelSendsTimerDpc may be in a loop.

Overall, what seems to be occurring is the DPC may be looping itself by gathering a Spinlock at DPC Level, cancelling the Timer, and then finally releasing the Spinlock again. This is happening over and over again, therefore we have a loop.

If we run a !pcr to show us the queued DPCs for the processor:

-- Essentially, !pcr will display the current status of the Processor Control Region (PCR) on a specific processor.

0: kd> !pcr
   KPCR for Processor 0 at fffff803a056a000:
   Major 1 Minor 1
   NtTib.ExceptionList: fffff803a2452000
   NtTib.StackBase: fffff803a2453080
   NtTib.StackLimit: 0000000004ccee48
   NtTib.SubSystemTib: fffff803a056a000
   NtTib.Version: 00000000a056a180
   NtTib.UserPointer: fffff803a056a7f0
   NtTib.SelfTib: 000000007ef44000

   SelfPcr: 0000000000000000
   Prcb: fffff803a056a180
   Irql: 0000000000000000
   IRR: 0000000000000000
   IDR: 0000000000000000
   InterruptMode: 0000000000000000
   IDT: 0000000000000000
   GDT: 0000000000000000
   TSS: 0000000000000000

   CurrentThread: fffff803a05c4880
   NextThread: fffffa80036e8040
   IdleThread: fffff803a05c4880

   DpcQueue: 0xfffffa80073595e0 0xfffff8800400c960 [Normal] dxgkrnl!DpiFdoDpcForIsr
   0xfffffa8007be9b68 0xfffff88001efb380 [Normal] ndis!ndisInterruptDpc
   0xfffff803a04fcfe0 0xfffff803a027c71c [Normal] nt!PpmCheckPeriodicStart
   0xfffff803a0545d60 0xfffff803a031e45c [Normal] nt!KiBalanceSetManagerDeferredRoutine

1. ndis.sys:

(Network Driver Interface Specification driver) interrupt routine. The Network Driver Interface Specification (NDIS) is an application programming interface (API) for network interface cards (NICs). The NDIS forms the Logical Link Control (LLC) sublayer, which is the upper sublayer of the OSI data link layer (layer 2). Therefore, the NDIS acts as the interface between the Media Access Control (MAC) sublayer, which is the lower sublayer of the data link layer, and the network layer (layer 3).

The NDIS is a library of functions often referred to as a "wrapper" that hides the underlying complexity of the NIC hardware and serves as a standard interface for level 3 network protocol drivers and hardware level MAC drivers. Another common LLC is the Open Data-Link Interface (ODI).

2. dxgkrnl.sys - Direct X Kernel.

So, with all of this said, we know that something is causing usb80236.sys to call into a loop, and it may be anything that's working with and/or possibility interfering with Windows' networking, or Direct X. We'll need to do some detective work to determine what is causing this, as it's a system driver and is being faulted by something else. At this point, since we're at quite the wall, I recommend enabling Driver Verifier so we can see what's going on. The user enabled DV, and sure enough, they had an 0xC4 crash! Let's have a look below!

DRIVER_VERIFIER_DETECTED_VIOLATION (c4)
A device driver attempting to corrupt the system has been caught. This is
because the driver was specified in the registry as being suspect (by the
administrator) and the kernel has enabled substantial checking of this driver.
If the driver attempts to corrupt the system, bugchecks 0xC4, 0xC1 and 0xA will
be among the most commonly seen crashes.
Arguments:
Arg1: 0000000000001011, Invariant MDL buffer contents for Read Irp were modified during dispatch or buffer backed by dummy pages.
Arg2: fffffa8006219060, Device object to which the Read IRP was issued.
Arg3: fffff980098a8c60, The address of the IRP.
Arg4: fffff8801a5b3000, System-Space Virtual Address for the buffer that the MDL describes.

Here we have the basic bug check info, with the second/third parameter highlighted as they will be useful later on. Let's go ahead and take a look at the call stack first:

0: kd> kv
Child-SP RetAddr : Args to Child : Call Site
fffff880`1af96888 fffff803`07a31fcc : 00000000`000000c4 00000000`00001011 fffffa80`06219060 fffff980`098a8c60 : nt!KeBugCheckEx
fffff880`1af96890 fffff803`07ebfa51 : 00000000`00000000 fffff880`1af96a00 fffff980`098a8c60 fffff803`07cba000 : nt!MdlInvariantPreProcessing1+0x200
fffff880`1af96900 fffff803`07ebdc51 : 00000000`00000000 fffffa80`0dd877d0 fffffa80`06218060 00000000`00000001 : nt!IovpCallDriver1+0x1cd
fffff880`1af96a60 fffff803`07eb4cde : fffff980`098a8c60 00000000`00000002 fffff880`1af96c40 00000000`c0000089 : nt!VfBeforeCallDriver+0x141
fffff880`1af96a90 fffff880`012ccdab : fffff980`098a8c60 fffffa80`06219060 fffffa80`062191b0 fffffa80`0dd877d0 : nt!IovCallDriver+0x35e
fffff880`1af96ae0 fffff880`012ccb7e : fffff980`098a8c60 fffff880`1af96ba0 fffff980`098a8e50 fffffa80`06219060 : intmsd+0x3dab
fffff880`1af96b20 fffff880`012d4402 : 00000000`00000000 fffff880`1af96c40 00000000`00000000 fffff803`07928d44 : intmsd+0x3b7e
fffff880`1af96b60 fffff880`012ccaf3 : fffff980`098a8c60 fffffa80`062191b0 fffffa80`06219060 00000000`00000000 : intmsd+0xb402
fffff880`1af96bf0 fffff803`07eb4d66 : fffff980`098a8c60 00000000`00000002 fffffa80`0bf1b790 fffffa80`06219b10 : intmsd+0x3af3
fffff880`1af96cc0 fffff803`07eb4d66 : fffff980`098a8c60 fffffa80`06219b10 00000000`00000002 fffffa80`0bf1b790 : nt!IovCallDriver+0x3e6
fffff880`1af96d10 fffff880`0127b14e : fffffa80`05356de0 fffffa80`05356c90 00000000`00000002 fffffa80`0d745ba0 : nt!IovCallDriver+0x3e6
fffff880`1af96d60 fffff803`07eb4d66 : 00000000`00000002 fffff980`098a8c60 fffffa80`05356c90 fffffa80`0dd66e00 : volmgr!VmReadWrite+0x13e
fffff880`1af96da0 fffff880`01f63faa : fffff980`098a8c60 00000001`94c3f000 fffff980`098a8c60 fffffa80`0dd66e00 : nt!IovCallDriver+0x3e6
fffff880`1af96df0 fffff880`01f64236 : fffff980`098a8c60 00000001`94c3f000 fffff980`098a8c60 fffff803`07ebdc51 : fvevol!FveReadWrite+0x3e
fffff880`1af96e30 fffff803`07eb4d66 : fffffa80`06227040 00000000`00000002 fffffa80`0d9eee40 fffff880`01c18329 : fvevol!FveFilterRundownReadWrite+0x1b6
fffff880`1af96e80 fffff880`01c01af2 : fffffa80`06228190 fffffa80`06228040 00000000`00000002 fffffa80`0d9eee40 : nt!IovCallDriver+0x3e6
fffff880`1af96ed0 fffff803`07eb4d66 : fffff980`098a8c60 00000000`00000002 00000000`00000030 00000000`00000000 : volsnap!VolSnapReadFilter+0x112
fffff880`1af96f00 fffff880`01817b69 : fffff880`1af79828 fffff880`1af798f0 fffffa80`0c543080 fffffa80`045faf40 : nt!IovCallDriver+0x3e6
fffff880`1af96f50 fffff803`078c8b67 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000001 : Ntfs!NtfsStorageDriverCallout+0x16
fffff880`1af96f80 fffff803`078c8b2d : 00000000`00000000 00000000`00000000 00000000`00000002 fffff803`07923ad8 : nt!KxSwitchKernelStackCallout+0x27 (TrapFrame @ fffff880`1af96e40)
fffff880`1af79690 fffff803`07923ad8 : 00000000`00000006 00000000`00000000 00000000`00000006 00000000`00000000 : nt!KiSwitchKernelStackContinue
fffff880`1af796b0 fffff803`07926405 : fffff880`01817b54 fffff880`1af79828 00000000`00055000 00000000`00000001 : nt!KeExpandKernelStackAndCalloutInternal+0x218
fffff880`1af797b0 fffff880`01817aac : fffff880`1af79c60 fffff980`098a8c60 fffff8a0`01299b90 00000000`0005d000 : nt!KeExpandKernelStackAndCalloutEx+0x25
fffff880`1af797f0 fffff880`0181646a : fffff880`1af798a0 fffff880`1af79c60 fffff8a0`01299b90 fffff880`018770b0 : Ntfs!NtfsMultipleAsync+0xac
fffff880`1af79860 fffff880`01825b26 : 00000000`00000000 fffff880`1af79df0 fffff8a0`01299be8 fffff8a0`01299b90 : Ntfs!NtfsNonCachedIo+0x26a
fffff880`1af79a70 fffff880`0182742b : fffff880`1af79c60 fffff980`098a8c60 fffffa80`0c169701 00000000`00000001 : Ntfs!NtfsCommonRead+0x896
fffff880`1af79c30 fffff803`07eb4d66 : fffff980`098a8c60 fffff980`098a8c60 00000000`00000002 fffffa80`0dd800c0 : Ntfs!NtfsFsdRead+0x1db
fffff880`1af79e70 fffff880`013924ee : fffffa80`0fb303f0 fffff880`1af79f00 fffff980`098a8c60 fffffa80`0dd800c0 : nt!IovCallDriver+0x3e6
fffff880`1af79ec0 fffff880`013900b6 : fffffa80`0538bb00 00000000`00000002 fffff980`098a8c60 fffffa80`0d579e28 : FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x25e
fffff880`1af79f60 fffff803`07eb4d66 : fffff980`098a8c60 00000000`00000002 fffffa80`0d5f47b0 fffffa80`0398e598 : FLTMGR!FltpDispatch+0xb6
fffff880`1af79fc0 fffff803`0794b43e : fffff980`098a8c60 00000000`00000000 fffffa80`0d5f4878 fffffa80`0d579d80 : nt!IovCallDriver+0x3e6
fffff880`1af7a010 fffff803`0794fa61 : fffffa80`0d5f47b0 fffffa80`0d5f47d8 fffffa80`0d5f4818 fffffa80`0d5f4808 : nt!IoPageRead+0x21e
fffff880`1af7a060 fffff803`0794ab20 : 00000000`00000002 fffff880`1af7a0d0 fffffa80`0c543080 fffffa80`0d5f47b0 : nt!MiIssueHardFaultIO+0xc9
fffff880`1af7a0a0 fffff803`07908d8f : fffffa80`0c543080 fffff803`07bc5f40 00000000`c0033333 00000000`00000000 : nt!MiIssueHardFault+0x170
fffff880`1af7a130 fffff803`0792369b : 00000000`00000000 fffffa80`0c543080 8de00001`15efb900 00000000`00000000 : nt!MmAccessFault+0x81f
fffff880`1af7a270 fffff803`07939a37 : 00000000`00000000 00000000`00051000 00000000`00055000 fffff880`1af7a438 : nt!MmCheckCachedPageStates+0x8db
fffff880`1af7a400 fffff803`07938e62 : fffffa80`0c16b010 0000003c`a4737c90 fffff880`1af7a550 fffff880`00000000 : nt!CcMapAndCopyInToCache+0x397
fffff880`1af7a4f0 fffff880`018d4070 : 00000000`000551b8 fffffa80`0538bb00 fffffa80`0c5acbe8 fffffa80`0c1697b0 : nt!CcCopyWriteEx+0x1b2
fffff880`1af7a590 fffff880`01394415 : 00000000`00000000 00000000`000002b0 fffff880`1af7a9b0 00000000`00000000 : Ntfs!NtfsCopyWriteA+0x290
fffff880`1af7a7f0 fffff880`01394b53 : fffff880`1af7a8e0 00000000`00000000 fffffa80`0c169700 fffffa80`0c5aca90 : FLTMGR!FltpPerformFastIoCall+0x155
fffff880`1af7a850 fffff880`013bcabd : 00000000`000000d8 00000000`00000000 00000000`00000000 00000000`00000000 : FLTMGR!FltpPassThroughFastIo+0xc3
fffff880`1af7a8b0 fffff803`07caa249 : fffffa80`0c1697b0 00000000`00000000 00000000`00000000 00000000`00000000 : FLTMGR!FltpFastIoWrite+0x19d
fffff880`1af7a960 fffff803`078cd453 : fffffa80`0c543001 00000000`00000000 00000000`00000000 00000000`00000000 : nt!NtWriteFile+0x5b8
fffff880`1af7aa90 000007f8`a88c2c6a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`1af7ab00)
0000003c`a744fa28 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x000007f8`a88c2c6a

As we can see, we have many different file system related routines (Ntfs, FLTMGR, etc). Why? Well, as we move up the stack, we eventually see four intmsd.sys calls. This is the IntelliMemory Storage Filter Driver from Condusiv Technologies.

Now, let's go ahead and refer to the 2nd/3rd parameters as I've highlighted above!

If we run an !irp on the 3rd parameter:

[ 3, 0] 10 e0 fffffa8006219060 00000000 fffff880012ccbc8-fffffa8006219060 Success Error Cancel
   \Driver\intmsd intmsd
   Args: 00008000 00000000 1cec3f000 00000000
   >[ 3, 0] 10 e1 fffffa8006219060 00000000 fffff880012183c0-00000000 Success Error Cancel pending
   \Driver\intmsd partmgr!PmIoCompletion

If we run a !devobj on the 2nd parameter:

0: kd> !devobj fffffa8006219060
   Device object (fffffa8006219060) is for:
   intmsd0 \Driver\intmsd DriverObject fffffa80049241b0
   Current Irp 00000000 RefCount 0 Type 00000007 Flags 00000850
   Vpb fffffa800534d8f0 Dacl fffff9a10052d360 DevExt fffffa80062191b0 DevObjExt fffffa8006219930 Dope fffffa800534d880
   ExtensionFlags (0x80000800) DOE_DEFAULT_SD_PRESENT, DOE_DESIGNATED_FDO
   Characteristics (0x00000100) FILE_DEVICE_SECURE_OPEN
   AttachedDevice (Upper) fffffa8006219b10 \Driver\partmgr
   AttachedTo (Lower) fffffa8006218060 \Driver\disk
   Device queue is not busy.

That's why we're seeing so many file system and storage related routines being called. After this was found, I recommend disabling and/or preferably uninstalling IntelliMemory. After uninstalling IntelliMemory, the crashes ceased. Why?

First off, IntelliMemory is an intelligent data caching technology that provides faster access to frequently used files. IntelliMemory is supposed to improve latency and throughput by reducing disk I/O requests as active files are predicatively cached within the server to preempt round trips between VMs and network storage.

Remember how we saw various network related routines, etc, during the 0x133 debugging? Well, it's because IntelliMemory was the driver that was causing the loop.

Thanks for reading!

Friday, April 4, 2014

600 answers on MS Answers

Today I hit 600 answers on MS Answers, neat! We're closing in on the big 1000 : )