Debugging and reverse engineering: analyzing

Showing posts with label analyzing. Show all posts

Sunday, June 23, 2013

0x124: WHEA_UNCORRECTABLE_ERROR

Ah, good ol' 0x124. To most, this is the 'Oh gosh, my hardware!!! It's dying!!!' bugcheck, and generally that's unfortunately true. However, there are some really neat ways to debug 0x124 dumps to help you hopefully figure things out faster!

Let's start with our favorite thing, the dump:

Disclaimer: 0x124 bugchecks require multiple dumps to even close to successfully troubleshoot due to one single dump not being much to go on. For example, one 0x124 dump can provide one error, and the next could provide something completely different (saying it is hardware related of course, but not CPU related). It's important to have multiple dumps to truly figure out whether or not the CPU itself is at fault.

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa800ddde028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000b6004000, High order 32-bits of the MCi_STATUS value.
Arg4: 00000000e6000175, Low order 32-bits of the MCi_STATUS value.

Debugging Details:
------------------

BUGCHECK_STR: 0x124_AuthenticAMD

CUSTOMER_CRASH_COUNT: 1

DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT

PROCESS_NAME: WebKit2WebProc

CURRENT_IRQL: f

STACK_TEXT:
fffff880`03297b08 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KeBugCheckEx

STACK_COMMAND: kb

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: AuthenticAMD

IMAGE_NAME: AuthenticAMD

DEBUG_FLR_IMAGE_TIMESTAMP: 0

FAILURE_BUCKET_ID: X64_0x124_AuthenticAMD_PROCESSOR_CACHE

BUCKET_ID: X64_0x124_AuthenticAMD_PROCESSOR_CACHE

Followup: MachineOwner
---------

Alright, cool, right off the bat we are thankfully greeted with fairly respectable instructions. It tells us that parameter 1 contains and identifies the type of error source that reported the error. Now, in this dump, that would be 'Machine Check Exception'.

What is a Machine Check Exception (otherwise known as a MCE) you may ask? Well, it's not as hard to describe as the name makes it sound. This simply means that the computer's CPU detects that there is a hardware problem and reports it to the Operating System.

Moving on, you can now see it sees parameter 2 holds the address of the WHEA_ERROR_RECORD structure that describes the error condition. Now, in this dump, the WHEA_ERROR_RECORD structure address is: fffffa800ddde028.

So, with these handy instructions that we now understand, let's go ahead and run an !errrec (dumps a specific WHEA error record) on the WHEA_ERROR_RECORD structure address, which in our case is fffffa800ddde028!

!errrec fffffa800ddde028

We are then presented with:

5: kd> !errrec fffffa800ddde028
===============================================================================
Common Platform Error Record @ fffffa800ddde028
-------------------------------------------------------------------------------
Record Id     : 01ce686c947ffec6
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 6/13/2013 19:40:22 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ddde0a8
Section       @ fffffa800ddde180
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Generic
Flags         : 0x00
Level         : 1
CPU Version   : 0x0000000000100fa0
Processor ID : 0x0000000000000005

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ddde0f0
Section       @ fffffa800ddde240
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000005
CPU Id        : a0 0f 10 00 00 08 06 05 - 09 20 80 00 ff fb 8b 17
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0 @ fffffa800ddde240

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ddde138
Section       @ fffffa800ddde2c0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : DCACHEL1_EVICT_ERR (Proc 5 Bank 0)
Status      : 0xb6004000e6000175
Address     : 0x0000000000000700
Misc.       : 0x0000000000000000

As you can see, we have a Cache Error in this specific dump. If you see Section 2 of the !errrec report, we can see that the error specifically is 'DCACHEL1_EVICT_ERR (Proc 5 Bank 0)'. Simply put, this means:

DCACHEL1_EVICT_ERR (Proc 5 Bank 0)

- This means it could not read data from L1 cache.

What does that mean? L1 Cache = Level 1 Cache, otherwise known as the primary cache. It's used for temporary storage of instructions and data organized in blocks of 32 bytes.

Now that we have this info, let's take a look at another 0x124 dump from the same system:

**Rather than pasting the entire dump, I am just going to show the output of running the !errrec on the WER structure address**

4: kd> !errrec fffffa800ec8e838
===============================================================================
Common Platform Error Record @ fffffa800ec8e838
-------------------------------------------------------------------------------
Record Id     : 01ce686c947ffec5
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 6/13/2013 19:31:21 (UTC)
Flags         : 0x00000002 PreviousError

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ec8e8b8
Section       @ fffffa800ec8e990
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Data Write
Flags         : 0x00
Level         : 1
CPU Version   : 0x0000000000100fa0
Processor ID : 0x0000000000000003

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ec8e900
Section       @ fffffa800ec8ea50
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000003
CPU Id        : a0 0f 10 00 00 08 06 03 - 09 20 80 00 ff fb 8b 17
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0 @ fffffa800ec8ea50

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ec8e948
Section       @ fffffa800ec8ead0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : DCACHEL1_DWR_ERR (Proc 3 Bank 0)
Status      : 0xf614c00000000145
Address     : 0x000000043679f000
Misc.       : 0x0000000000000000

Now, in this one, as we can see this is also reporting a Cache Error. If you see Section 2 of the !errrec report, we can see that the error specifically is 'DCACHEL1_DWR_ERR (Proc 3 Bank 0)'. Simply put, this means:

DCACHEL1_DWR_ERR (Proc 3 Bank 0)

- This means it could not write data from L1 cache.

Now we have two dumps showing read & write errors from the L1 cache. Are two dumps enough to go on? I would say no, however, with an error like this, it's a big flag raiser for a faulty CPU. In this specific situation, the rest of the user's dumps were all read & write errors from the L1 cache, so it was more than likely a faulty CPU.

Another 9F example!

I run into 0x9F: DRIVER_POWER_STATE_FAILURE a fair bit. I figured I'd share another example just to show how simple they can be, and how nothing really changes in regards to troubleshooting them if the fault isn't obvious.

Not going to share an entire dump, just quick & easy troubleshooting for a quick & easy bugcheck:

BugCheck 9F, {3, 86404030, 83135ae0, 85b49570}
Probably caused by : pci.sys

As you can see, this specific dump was faulting pci.sys. You can pretty much bet your life that this is not the cause, so let's go ahead and see what else we can find. As you know (and if you don't already, please visit my earlier 9F blog post that goes into detail) to get more details on what specifically caused the crash. You're going to want to locate the address of the blocked IRP. In this case, for this specific dump, the address for the blocked IRP was the 3rd parameter which is 85b49570.

Once you have located the blocked IRP address, run an !irp address here. So, for example, for this specific dump we would run:

!irp 85b49570

Now we get the following:

0: kd> !irp 85b49570
Irp is active with 4 stacks 3 is current (= 0x85b49628)
No Mdl: No System Buffer: Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
>[ 16, 2] 0 e1 8674c028 00000000 83331d5d-885bc728 Success Error Cancel pending
Unable to load image \SystemRoot\system32\DRIVERS\Rt86win7.sys, Win32 error 0n2
*** WARNING: Unable to verify timestamp for Rt86win7.sys
*** ERROR: Module load completed but symbols could not be loaded for Rt86win7.sys
\Driver\RTL8167 nt!PopSystemIrpCompletion

I have bolded what's important here, which is that it Rt86win7.sys was the loaded driver at the time that was the true fault. The > indicates that the specific driver was loaded at the time of the crash.

Rt86win7.sys is a Realtek NIC driver, therefore I instructed the user to visit Realtek's website and update his / her network drivers.

That's about it : )

Thursday, July 12, 2012

0x9F: DRIVER_POWER_STATE_FAILURE

Alright, let's get down to some business. As I said, I am now going to start going through successful analysis posts of mine and posting them here for others to learn from, read up on, etc. Also just for personal reference as well! Some will be very easy and I will just explain what I did, and some will be difficult, etc.

Now, this post is going to be about bugcheck 0x9F: DRIVER_POWER_STATE_FAILURE. 9F bugchecks are personally my favorite as in most cases they're very easy to solve, and I will explain why. In most cases, a 9F will tell you the driver culprit right in the "probably caused by", however in some cases it will shoot an incorrect fault, or it will relay what the bucket ID is in that specific dump.

Before we get into all of that though, here's that basic definition of a 0x9F:

A device driver is in an invalid or inconsistent power state from either shutdown or going into or returning from hibernate or standby modes.

I recently dealt with a case in which the user was reporting he/she was receiving 0x9F's BSOD(s). I opened the dump in WinDbg:

*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 9F, {3, fffffa800a727060, fffff80004ba93d8, fffffa8006a62010}

*** WARNING: Unable to verify timestamp for asmthub3.sys
*** ERROR: Module load completed but symbols could not be loaded for asmthub3.sys
*** WARNING: Unable to verify timestamp for win32k.sys
*** ERROR: Module load completed but symbols could not be loaded for win32k.sys
Probably caused by : asmthub3.sys

Followup: MachineOwner
---------

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_POWER_STATE_FAILURE (9f)
A driver is causing an inconsistent power state.
Arguments:
Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time
Arg2: fffffa800a727060, Physical Device Object of the stack
Arg3: fffff80004ba93d8, Functional Device Object of the stack
Arg4: fffffa8006a62010, The blocked IRP

Debugging Details:
------------------

DRVPOWERSTATE_SUBCODE: 3

DRIVER_OBJECT: fffffa800670ac10

IMAGE_NAME: asmthub3.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 4eb203d0

MODULE_NAME: asmthub3

FAULTING_MODULE: fffff88005f25000 asmthub3

CUSTOMER_CRASH_COUNT: 1

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

BUGCHECK_STR: 0x9F

PROCESS_NAME: System

CURRENT_IRQL: 2

STACK_TEXT:
fffff800`04ba9388 fffff800`033046c2 : 00000000`0000009f 00000000`00000003 fffffa80`0a727060 fffff800`04ba93d8 : nt!KeBugCheckEx
fffff800`04ba9390 fffff800`032a4e3c : fffff800`04ba94c0 fffff800`04ba94c0 00000000`00000000 00000000`00000002 : nt! ?? ::FNODOBFM::`string'+0x34050
fffff800`04ba9430 fffff800`032a4cd6 : fffff800`03433f70 00000000`0000ea7d 00000000`00000000 00000000`00000000 : nt!KiProcessTimerDpcTable+0x6c
fffff800`04ba94a0 fffff800`032a4bbe : 00000002`2e2dc26d fffff800`04ba9b18 00000000`0000ea7d fffff800`03410228 : nt!KiProcessExpiredTimerList+0xc6
fffff800`04ba9af0 fffff800`032a49a7 : 00000000`bedfc7c2 00000000`0000ea7d 00000000`bedfc78b 00000000`0000007d : nt!KiTimerExpiration+0x1be
fffff800`04ba9b90 fffff800`03291eca : fffff800`0340ce80 fffff800`0341acc0 00000000`00000000 fffff880`00000000 : nt!KiRetireDpcList+0x277
fffff800`04ba9c40 00000000`00000000 : fffff800`04baa000 fffff800`04ba4000 fffff800`04ba9c00 00000000`00000000 : nt!KiIdleLoop+0x5a

STACK_COMMAND: kb

FOLLOWUP_NAME: MachineOwner

FAILURE_BUCKET_ID: X64_0x9F_3_IMAGE_asmthub3.sys

BUCKET_ID: X64_0x9F_3_IMAGE_asmthub3.sys

Followup: MachineOwner
---------

As you can see, this is a fairly straightforward 0x9F. It says the culprit right there in the probably caused, which is asmthub3.sys (ASMedia USB 3.0 Hub driver). All that needed to be done was link the user to the latest chipset / utility drivers on the motherboard page and the issue was solved after updating the driver.

Now, let's assume that the following dump file was not so straightforward. Let's pretend that when we opened it up, rather than the probably caused by faulty displaying the guilty driver, it said for example "usbhub.sys".

What we would do then is we would take a look at the 4th argument is there was a blocked IRP, and then run an !irp on the address of the 4th argument.

So, for example, in the following dump: !irp fffffa8006a62010

After running that, we then get the following:

0: kd> !irp fffffa8006a62010
Irp is active with 8 stacks 7 is current (= 0xfffffa8006a62290)
No Mdl: No System Buffer: Thread 00000000: Irp stack trace.
     cmd flg cl Device   File     Completion-Context
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
[ 0, 0]   0 0 00000000 00000000 00000000-00000000

            Args: 00000000 00000000 00000000 00000000
>[ 16, 2]   0 e1 fffffa800a740790 00000000 fffff80003285ce0-fffffa800b934340 Success Error Cancel pending
           \Driver\asmthub3    nt!IopUnloadSafeCompletion
            Args: 00016600 00000001 00000004 00000005
[ 0, 0]   0 0 00000000 00000000 00000000-fffffa8006958dc0

            Args: 00000000 00000000 00000000 00000000

As you can see, there's a (>). This symbol indicates what driver was active at the time of the crash, and the asmthub3 driver is listed there. That's what you'd do if there was a false probably caused by fault. However, sometimes you get 0x9F's that don't have a blocked up IRP AND an incorrect / false fault. Well, you'd then have to obviously go through some other dumps or take a look at the loaded modules list and see if there are any obvious troublesome 3rd party drivers that may have caused it, etc. You can also get 9F bugchecks that deal with locks and such, but we'll get into that at another time.

Generally, 0x9F's are very easy. I learned how to solve most 0x9F's by reading VirGnarus' posts across various BSOD communities.