Friday, June 28, 2013

TechEd 2013 Videos for analysis & debugging!

Vir Gnarus over at Sysnative was kind enough to share links from TechEd 2013 in regards to debugging, analysis, etc. Here are some (I will add more hopefully once I get down to watching more myself)!

Sysinternals Primer: TechEd 2013 Edition

Join us for the fourth edition of the popular Sysinternals Primer series with Aaron Margosis, Mark Russinovich’s co-author of The Windows Sysinternals Administrator’s Reference. The Sysinternals utilities are vital tools for any computer professional on the Windows platform. Mark Russinovich's popular "Case Of The Unexplained" demonstrates some of their capabilities in advanced troubleshooting scenarios. This complementary tutorial series focuses primarily on the utilities themselves, deep-diving into as many features as time will allow. This year’s session describes everything that’s new and improved in the Sysinternals tool set since the book was published two years ago. It’s like an early draft of Sysinternals, Second Edition.
  Case of the Unexplained 2013: Windows Troubleshooting with Mark Russinovich

Come hear Mark Russinovich, the master of Windows troubleshooting, walk you step-by-step through how he has solved seemingly unsolvable system and application problems on Windows. With all new real case studies, Mark shows how to apply the Microsoft Debugging Tools and his own Sysinternals tools, including Process Explorer, Process Monitor, to solve system crashes, process hangs, security vulnerabilities, DLL conflicts, permissions problems, registry misconfiguration, network hangs, and file system issues. These tools are used on a daily basis by Microsoft Product Support and have been used effectively to solve a wide variety of desktop and server issues, so being familiar with their operation and application will assist you in dealing with different problems on Windows.

Hardcore Debugging  (very very difficult and in-depth video in regards to analysis on not only kernel debugging, but debugging programs in general, etc).

The title says Hard Core for a reason. If you think it's possible for a 400-level session to be too technical, then I guarantee this one is. But if you don't, or if you just want to have your brain melted anyway, then join a group of Microsoft's greatest debuggers for a session designed to enlighten and entertain as you learn new techniques and watch people push the envelope on what's possible with our debugging platform. A fan favorite since TechReady 7!

Sunday, June 23, 2013

0x124: WHEA_UNCORRECTABLE_ERROR

Ah, good ol' 0x124. To most, this is the 'Oh gosh, my hardware!!! It's dying!!!' bugcheck, and generally that's unfortunately true. However, there are some really neat ways to debug 0x124 dumps to help you hopefully figure things out faster!

Let's start with our favorite thing, the dump:

Disclaimer: 0x124 bugchecks require multiple dumps to even close to successfully troubleshoot due to one single dump not being much to go on. For example, one 0x124 dump can provide one error, and the next could provide something completely different (saying it is hardware related of course, but not CPU related). It's important to have multiple dumps to truly figure out whether or not the CPU itself is at fault.

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa800ddde028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000b6004000, High order 32-bits of the MCi_STATUS value.
Arg4: 00000000e6000175, Low order 32-bits of the MCi_STATUS value.

Debugging Details:
------------------


BUGCHECK_STR:  0x124_AuthenticAMD

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

PROCESS_NAME:  WebKit2WebProc

CURRENT_IRQL:  f

STACK_TEXT: 
fffff880`03297b08 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KeBugCheckEx


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: AuthenticAMD

IMAGE_NAME:  AuthenticAMD

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_0x124_AuthenticAMD_PROCESSOR_CACHE

BUCKET_ID:  X64_0x124_AuthenticAMD_PROCESSOR_CACHE

Followup: MachineOwner
---------
Alright, cool, right off the bat we are thankfully greeted with fairly respectable instructions. It tells us that parameter 1 contains and identifies the type of error source that reported the error. Now, in this dump, that would be 'Machine Check Exception'.

What is a Machine Check Exception (otherwise known as a MCE) you may ask? Well, it's not as hard to describe as the name makes it sound. This simply means that the computer's CPU detects that there is a hardware problem and reports it to the Operating System.


Moving on, you can now see it sees parameter 2 holds the address of the WHEA_ERROR_RECORD structure that describes the error condition. Now, in this dump, the WHEA_ERROR_RECORD structure address is: fffffa800ddde028.

So, with these handy instructions that we now understand, let's go ahead and run an !errrec (dumps a specific WHEA error record) on the WHEA_ERROR_RECORD structure address, which in our case is fffffa800ddde028!

!errrec fffffa800ddde028

We are then presented with:

 5: kd> !errrec fffffa800ddde028
===============================================================================
Common Platform Error Record @ fffffa800ddde028
-------------------------------------------------------------------------------
Record Id     : 01ce686c947ffec6
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 6/13/2013 19:40:22 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ddde0a8
Section       @ fffffa800ddde180
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Generic
Flags         : 0x00
Level         : 1
CPU Version   : 0x0000000000100fa0
Processor ID  : 0x0000000000000005

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ddde0f0
Section       @ fffffa800ddde240
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000005
CPU Id        : a0 0f 10 00 00 08 06 05 - 09 20 80 00 ff fb 8b 17
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa800ddde240

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ddde138
Section       @ fffffa800ddde2c0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : DCACHEL1_EVICT_ERR (Proc 5 Bank 0)
  Status      : 0xb6004000e6000175
  Address     : 0x0000000000000700
  Misc.       : 0x0000000000000000
As you can see, we have a Cache Error in this specific dump. If you see Section 2 of the !errrec report, we can see that the error specifically is 'DCACHEL1_EVICT_ERR (Proc 5 Bank 0)'. Simply put, this means:

DCACHEL1_EVICT_ERR (Proc 5 Bank 0)

- This means it could not read data from L1 cache.

What does that mean? L1 Cache = Level 1 Cache, otherwise known as the primary cache. It's used for temporary storage of instructions and data organized in blocks of 32 bytes.

Now that we have this info, let's take a look at another 0x124 dump from the same system:

**Rather than pasting the entire dump, I am just going to show the output of running the !errrec on the WER structure address**

 4: kd> !errrec fffffa800ec8e838
===============================================================================
Common Platform Error Record @ fffffa800ec8e838
-------------------------------------------------------------------------------
Record Id     : 01ce686c947ffec5
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 6/13/2013 19:31:21 (UTC)
Flags         : 0x00000002 PreviousError

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ec8e8b8
Section       @ fffffa800ec8e990
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Data Write
Flags         : 0x00
Level         : 1
CPU Version   : 0x0000000000100fa0
Processor ID  : 0x0000000000000003

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ec8e900
Section       @ fffffa800ec8ea50
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000003
CPU Id        : a0 0f 10 00 00 08 06 03 - 09 20 80 00 ff fb 8b 17
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa800ec8ea50

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa800ec8e948
Section       @ fffffa800ec8ead0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : DCACHEL1_DWR_ERR (Proc 3 Bank 0)
  Status      : 0xf614c00000000145
  Address     : 0x000000043679f000
  Misc.       : 0x0000000000000000

Now, in this one, as we can see this is also reporting a Cache Error. If you see Section 2 of the !errrec report, we can see that the error specifically is 'DCACHEL1_DWR_ERR (Proc 3 Bank 0)'. Simply put, this means:


DCACHEL1_DWR_ERR (Proc 3 Bank 0)

- This means it could not write data from L1 cache.

Now we have two dumps showing read & write errors from the L1 cache. Are two dumps enough to go on? I would say no, however, with an error like this, it's a big flag raiser for a faulty CPU. In this specific situation, the rest of the user's dumps were all read & write errors from the L1 cache, so it was more than likely a faulty CPU.

Another 9F example!

I run into 0x9F: DRIVER_POWER_STATE_FAILURE a fair bit. I figured I'd share another example just to show how simple they can be, and how nothing really changes in regards to troubleshooting them if the fault isn't obvious.

Not going to share an entire dump, just quick & easy troubleshooting for a quick & easy bugcheck:

BugCheck 9F, {3, 86404030, 83135ae0, 85b49570}
Probably caused by : pci.sys
As you can see, this specific dump was faulting pci.sys. You can pretty much bet your life that this is not the cause, so let's go ahead and see what else we can find. As you know (and if you don't already, please visit my earlier 9F blog post that goes into detail) to get more details on what specifically caused the crash. You're going to want to locate the address of the blocked IRP. In this case, for this specific dump, the address for the blocked IRP was the 3rd parameter which is 85b49570.

Once you have located the blocked IRP address, run an !irp address here. So, for example, for this specific dump we would run:

!irp 85b49570
 Now we get the following:

 0: kd> !irp 85b49570
Irp is active with 4 stacks 3 is current (= 0x85b49628)
No Mdl: No System Buffer: Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
>[ 16, 2] 0 e1 8674c028 00000000 83331d5d-885bc728 Success Error Cancel pending
Unable to load image \SystemRoot\system32\DRIVERS\Rt86win7.sys, Win32 error 0n2

*** WARNING: Unable to verify timestamp for Rt86win7.sys
*** ERROR: Module load completed but symbols could not be loaded for Rt86win7.sys
\Driver\RTL8167
nt!PopSystemIrpCompletion
I have bolded what's important here, which is that it Rt86win7.sys was the loaded driver at the time that was the true fault. The > indicates that the specific driver was loaded at the time of the crash.

Rt86win7.sys is a Realtek NIC driver, therefore I instructed the user to visit Realtek's website and update his / her network drivers.

That's about it : )

A new bugcheck appears!

I ran into my first 0xC0000221 (STATUS_IMAGE_CHECKSUM_MISMATCH) bugcheck today. What does this bugcheck mean? Good question! Ultimately, this is caused when a device driver or important system file has become corrupted (from what I have read, more-so the latter than the former). In most cases, the filename of the problematic driver / system file is generally shown in the stop message. However, there are certain cases in which it is not, and you need to do some digging, although not very difficult!

For example, here's a dump from a crash I dealt with:

BugCheck C0000221, {fffff8a000227450, 0, 0, 0}

Probably caused by : ntkrnlmp.exe ( nt!ExpSystemErrorHandler2+5ff )

Followup: MachineOwner
---------

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Unknown bugcheck code (c0000221)
Unknown bugcheck description
Arguments:
Arg1: fffff8a000227450
Arg2: 0000000000000000
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------


BUGCHECK_STR:  0xc0000221

ERROR_CODE: (NTSTATUS) 0xc0000221 - {Bad Image Checksum}  The image %hs is possibly corrupt. The header checksum does not match the computed checksum.

EXCEPTION_CODE: (NTSTATUS) 0xc0000221 - {Bad Image Checksum}  The image %hs is possibly corrupt. The header checksum does not match the computed checksum.

EXCEPTION_PARAMETER1:  fffff8a000227450

EXCEPTION_PARAMETER2:  0000000000000000

EXCEPTION_PARAMETER3:  0000000000000000

EXCEPTION_PARAMETER4: 0

MODULE_NAME: nt

IMAGE_NAME:  ntkrnlmp.exe

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  VERIFIER_ENABLED_VISTA_MINIDUMP

PROCESS_NAME:  System

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from fffff8000352332f to fffff800032d4c00

STACK_TEXT: 
fffff880`009a91e8 fffff800`0352332f : 00000000`0000004c 00000000`c0000221 fffff880`009a9288 fffffa80`0353d610 : nt!KeBugCheckEx
fffff880`009a91f0 fffff800`0332090d : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00001000 : nt!ExpSystemErrorHandler2+0x5ff
fffff880`009a9420 fffff800`037079e1 : 00000000`c0000221 00000000`00000001 00000000`00000001 00000000`00040000 : nt!ExpSystemErrorHandler+0xdd
fffff880`009a9460 fffff800`03707de6 : fffffa80`c0000221 00000000`00000001 fffffa80`00000001 00000000`00040000 : nt!ExpRaiseHardError+0xe1
fffff880`009a9790 fffff800`037097a6 : fffff880`c0000221 00000000`00000001 00000000`00000001 fffff880`009a9988 : nt!ExRaiseHardError+0x1d6
fffff880`009a9890 fffff800`0371cadf : 00000000`c0000221 00000000`08000000 fffff800`037a3828 ffffffff`800000a0 : nt!NtRaiseHardError+0x1e4
fffff880`009a9930 fffff800`0371ce39 : 00000000`002a0028 00000000`00000000 00000000`00000001 fffff800`037d3ac0 : nt!PspLocateSystemDll+0xbf
fffff880`009a9a00 fffff800`0380736d : fffff800`00812810 00000000`00000002 00000000`00000000 fffff800`0344fe80 : nt!PsLocateSystemDlls+0x69
fffff880`009a9a40 fffff800`0380a4f5 : 00000000`00000007 00000000`00000010 ffffffff`8000002c fffff800`00818080 : nt!IoInitSystem+0x85d
fffff880`009a9b40 fffff800`0375a0f9 : 00000000`00000000 fffffa80`018e6040 00000000`00000080 fffffa80`01869890 : nt!Phase1InitializationDiscard+0x1275
fffff880`009a9d10 fffff800`03572ede : 00000000`00000000 00000000`00000080 00000000`00000000 fffff800`032c58f9 : nt!Phase1Initialization+0x9
fffff880`009a9d40 fffff800`032c5906 : fffff800`0344fe80 fffffa80`018e6040 fffff800`0345dcc0 00000000`00000000 : nt!PspSystemThreadStartup+0x5a
fffff880`009a9d80 00000000`00000000 : fffff880`009aa000 fffff880`009a4000 fffff880`009a93f0 00000000`00000000 : nt!KiStartSystemThread+0x16


STACK_COMMAND:  kb

FOLLOWUP_IP:
nt!ExpSystemErrorHandler2+5ff
fffff800`0352332f cc              int     3

SYMBOL_STACK_INDEX:  1

SYMBOL_NAME:  nt!ExpSystemErrorHandler2+5ff

FOLLOWUP_NAME:  MachineOwner

DEBUG_FLR_IMAGE_TIMESTAMP:  5147d9c6

FAILURE_BUCKET_ID:  X64_0xc0000221_VRF_nt!ExpSystemErrorHandler2+5ff

BUCKET_ID:  X64_0xc0000221_VRF_nt!ExpSystemErrorHandler2+5ff

Followup: MachineOwner
---------
As you can see, this one was fairly unforgiving and did not provide a 'look!!! I know what caused it!!'. Now, if this ever happens to you, here's what you can do.

1. Take a look at the four parameters. All of them but the 1st are 0000000000000000.

2. Copy the first parameter and run a 'da' on it (this will display the ASCII strings). For example: da fffff8a000227450 

Here's the output:

0: kd> da fffff8a000227450
fffff8a0`00227450  "\SystemRoot\System32\ntdll.dll"
As you can see, the problematic file here is 'ntdll.dll' which is the dynamic link library that is in charge of exporting the Windows Native API.

Once you find the file in question, you can generally fix it by running System File Checker or inserting your Windows installation disc and repairing (you can also use this method to replace it manually).

Saturday, June 15, 2013

Back in action!

After a very difficult nine months of disappearing from the BSOD analysis community, I am back in action. During the time I was gone, I managed to lose up to ~50 pounds, change my lifestyle dramatically for the better, achieve a 4.0 GPA, and am now working towards A+, Network+ and Security+.

Sometimes life gets you down and you just need to take a hard look at yourself and make some very needed changes! Well, those changes were made, and I am back and better than ever. It's going to be wonderful working with a lot of my old friends in the BSOD analysis community. I have missed many of you. BSOD Kernel Dump analysis is a skill-set I will never want to lose, only better.