Sunday, June 23, 2013

0x124: WHEA_UNCORRECTABLE_ERROR

Ah, good ol' 0x124. To most, this is the 'Oh gosh, my hardware!!! It's dying!!!' bugcheck, and generally that's unfortunately true. However, there are some really neat ways to debug 0x124 dumps to help you hopefully figure things out faster!

Let's start with our favorite thing, the dump:

Disclaimer: 0x124 bugchecks require multiple dumps to even close to successfully troubleshoot due to one single dump not being much to go on. For example, one 0x124 dump can provide one error, and the next could provide something completely different (saying it is hardware related of course, but not CPU related). It's important to have multiple dumps to truly figure out whether or not the CPU itself is at fault.

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa800ddde028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000b6004000, High order 32-bits of the MCi_STATUS value.
Arg4: 00000000e6000175, Low order 32-bits of the MCi_STATUS value.

Debugging Details:
------------------


BUGCHECK_STR:  0x124_AuthenticAMD

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

PROCESS_NAME:  WebKit2WebProc

CURRENT_IRQL:  f

STACK_TEXT: 
fffff880`03297b08 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KeBugCheckEx


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: AuthenticAMD

IMAGE_NAME:  AuthenticAMD

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_0x124_AuthenticAMD_PROCESSOR_CACHE

BUCKET_ID:  X64_0x124_AuthenticAMD_PROCESSOR_CACHE

Followup: MachineOwner
---------
Alright, cool, right off the bat we are thankfully greeted with fairly respectable instructions. It tells us that parameter 1 contains and identifies the type of error source that reported the error. Now, in this dump, that would be 'Machine Check Exception'.

What is a Machine Check Exception (otherwise known as a MCE) you may ask? Well, it's not as hard to describe as the name makes it sound. This simply means that the computer's CPU detects that there is a hardware problem and reports it to the Operating System.


Moving on, you can now see it sees parameter 2 holds the address of the WHEA_ERROR_RECORD structure that describes the error condition. Now, in this dump, the WHEA_ERROR_RECORD structure address is: fffffa800ddde028.

So, with these handy instructions that we now understand, let's go ahead and run an !errrec (dumps a specific WHEA error record) on the WHEA_ERROR_RECORD structure address, which in our case is fffffa800ddde028!

!errrec fffffa800ddde028

We are then presented with:

 5: kd> !errrec fffffa800ddde028
===============================================================================
Common Platform Error Record @ fffffa800ddde028
-------------------------------------------------------------------------------
Record Id     : 01ce686c947ffec6
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 6/13/2013 19:40:22 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ddde0a8
Section       @ fffffa800ddde180
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Generic
Flags         : 0x00
Level         : 1
CPU Version   : 0x0000000000100fa0
Processor ID  : 0x0000000000000005

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ddde0f0
Section       @ fffffa800ddde240
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000005
CPU Id        : a0 0f 10 00 00 08 06 05 - 09 20 80 00 ff fb 8b 17
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa800ddde240

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ddde138
Section       @ fffffa800ddde2c0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : DCACHEL1_EVICT_ERR (Proc 5 Bank 0)
  Status      : 0xb6004000e6000175
  Address     : 0x0000000000000700
  Misc.       : 0x0000000000000000
As you can see, we have a Cache Error in this specific dump. If you see Section 2 of the !errrec report, we can see that the error specifically is 'DCACHEL1_EVICT_ERR (Proc 5 Bank 0)'. Simply put, this means:

DCACHEL1_EVICT_ERR (Proc 5 Bank 0)

- This means it could not read data from L1 cache.

What does that mean? L1 Cache = Level 1 Cache, otherwise known as the primary cache. It's used for temporary storage of instructions and data organized in blocks of 32 bytes.

Now that we have this info, let's take a look at another 0x124 dump from the same system:

**Rather than pasting the entire dump, I am just going to show the output of running the !errrec on the WER structure address**

 4: kd> !errrec fffffa800ec8e838
===============================================================================
Common Platform Error Record @ fffffa800ec8e838
-------------------------------------------------------------------------------
Record Id     : 01ce686c947ffec5
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 6/13/2013 19:31:21 (UTC)
Flags         : 0x00000002 PreviousError

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ec8e8b8
Section       @ fffffa800ec8e990
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Data Write
Flags         : 0x00
Level         : 1
CPU Version   : 0x0000000000100fa0
Processor ID  : 0x0000000000000003

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ec8e900
Section       @ fffffa800ec8ea50
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000003
CPU Id        : a0 0f 10 00 00 08 06 03 - 09 20 80 00 ff fb 8b 17
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa800ec8ea50

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ec8e948
Section       @ fffffa800ec8ead0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : DCACHEL1_DWR_ERR (Proc 3 Bank 0)
  Status      : 0xf614c00000000145
  Address     : 0x000000043679f000
  Misc.       : 0x0000000000000000

Now, in this one, as we can see this is also reporting a Cache Error. If you see Section 2 of the !errrec report, we can see that the error specifically is 'DCACHEL1_DWR_ERR (Proc 3 Bank 0)'. Simply put, this means:


DCACHEL1_DWR_ERR (Proc 3 Bank 0)

- This means it could not write data from L1 cache.

Now we have two dumps showing read & write errors from the L1 cache. Are two dumps enough to go on? I would say no, however, with an error like this, it's a big flag raiser for a faulty CPU. In this specific situation, the rest of the user's dumps were all read & write errors from the L1 cache, so it was more than likely a faulty CPU.

No comments:

Post a Comment