Monday, August 18, 2014

PDC_WATCHDOG_TIMEOUT (14f) debugging

I received a PDC_WATCHDOG_TIMEOUT (14f) crash dump, although I seemed to have misplaced the source. What a shame! Anyway, this bug check is pretty mysterious. There's very little to no documentation on it, and not too many have had it occur on their systems to show up in any sort of web search. With that said, let's do our best to get some info on it!

 PDC_WATCHDOG_TIMEOUT (14f)  
 A system component failed to respond within the allocated time period,  
 preventing the system from exiting connected standby.  
 Arguments:  
 Arg1: 0000000000000002, Client ID of the hung component.  
 Arg2: 0000000000000002, A resiliency client failed to respond.  
 Arg3: fffff801867d3578, Pointer to the resiliency client (pdc!_PDC_RESILIENCY_CLIENT).  
 Arg4: ffffd001b7f61b30, Pointer to a pdc!PDC_14F_TRIAGE structure.  

We can see right away that the cause of the bug check itself is:
A system component failed to respond within the allocated time period, preventing the system from exiting connected standby.
With that said, now is a good time to discuss connected standby. Connected standby is a low-power state (implemented in Windows 8, also in 8.1) that features extremely low power consumption while maintaining a constant internet connection. Here is how to trigger connected standby:

  • Press the system power button.
  • Close the lid or tablet cover, or close the tablet into an attached dock.
  • Select Sleep from the Power button on the Settings charm

The best comparison is a smartphone's power button. When you press the power button on one of today's smartphones, it will transition to a similar state as opposed to entirely shutting down. This way, when you press the power button again, it will start right back up from where you previously left off.

Now that we know how to trigger the connected standby, how does it wake up and transition to its active state again?

  • Press the system power button.
  • Open the lid on a clamshell form-factor system.
  • Open the tablet if it is connected to a portable dock with a keyboard (similar to a lid in a clamshell system).
  • Generate input on an integrated or attached keyboard, mouse, or touchpad.
  • Press the Windows button that is integrated into the system display.

Other than user-related actions, system components, programs, etc, can wake the system from connected standby as well. For example, if the user has an incoming Skype call, the system will immediately awake and create a 25 second time frame to answer the call. If it is not answered, the call is canceled and the system will go back into connected standby.

System components or devices can also wake the core silicon or SoC from connected standby, even though those events may not turn on the display. Nearly all devices connected to a connected standby system are expected to be capable of waking the SoC from its deepest idle power state.

--------------------

Now that we understand connected standby, we now understand that for some reason a specific system component failed to respond during the set time period, therefore the system remained in connected standby when it should have woken up.

First of all, what kind of device is this given the fact that we likely wouldn't see connected standby on a desktop (or maybe even a laptop)?

 0: kd> !sysinfo machineid  
 Machine ID Information [From Smbios 2.8, DMIVersion 39, Size=1106]  
 BiosMajorRelease = 3  
 BiosMinorRelease = 7  
 FirmwareMajorRelease = 32  
 FirmwareMinorRelease = 0  
 BiosVendor = American Megatrends Inc.  
 BiosVersion = 3.07.0150  
 BiosReleaseDate = 05/15/2014  
 SystemManufacturer = Microsoft Corporation  
 SystemProductName = Surface Pro 3  
 SystemFamily = Surface  
 SystemVersion = 1  
 SystemSKU = Surface_Pro_3  
 BaseBoardManufacturer = Microsoft Corporation  
 BaseBoardProduct = Surface Pro 3  
 BaseBoardVersion = 1  

Ah, it's a Surface tablet! It all makes sense now.

As this was a minidump, our call stack was extremely uninformative:

 0: kd> k  
 Child-SP     RetAddr      Call Site  
 ffffd001`b7f61af8 fffff801`867dcd72 nt!KeBugCheckEx  
 ffffd001`b7f61b00 fffff803`712daadb pdc!PdcpResiliencyWatchdog+0xa6  
 ffffd001`b7f61b50 fffff803`71356794 nt!ExpWorkerThread+0x293  
 ffffd001`b7f61c00 fffff803`713e15c6 nt!PspSystemThreadStartup+0x58  
 ffffd001`b7f61c60 00000000`00000000 nt!KiStartSystemThread+0x16  

We can see we're starting a thread which turns out to be a worker thread, and then we call into pdc!PdcpResiliencyWatchdog+0xa6. This implies we failed to complete the resiliency phase in the allotted time period (however long). Usually when you see resiliency phase issues on a device regarding anything in terms of waking from an inactive state (sleep, hibernate, etc), the first thing to look at is network. For example, the D0 IRP for the required network device may not have completed in time due to a 3rd party conflict, etc.

We can further confirm we're likely dealing with a network issue by taking a look at our bucket_id:

 FAILURE_BUCKET_ID: 0x14F_WCM_pdc!PdcpResiliencyWatchdog  

WCM is the Windows Connection Manager, which enables the creation and configuration of connection manager software.

As this is a Surface Tablet, one can imagine it's likely using WiFi. If a Wi-Fi connection is available, the system will wait for the Wi-Fi device only, regardless of whether a mobile broadband (MBB) connection is available. With this said, I took a look at what loaded modules we had to see if any antivirus was installed, firewall, etc. I was essentially looking for anything that could have accidentally interfered with the network upon wake.

Here's what I found:

 0: kd> lmvm MBAMSwissArmy  
 start       end         module name  
 fffff801`8851d000 fffff801`8853e000  MBAMSwissArmy  (deferred)         
   Image path: \??\C:\windows\system32\drivers\MBAMSwissArmy.sys  
   Image name: MBAMSwissArmy.sys  
   Timestamp:    Thu Mar 20 18:12:35 2014  

Malwarebytes Anti-malware driver, listed and loaded.

I asked the user to uninstall Malwarebytes for temporary troubleshooting purposes, and the crashes no longer occurred. I hope the user also contacted Malwarebytes support to work out any possible issues that need to be patched.

I hope to see more of these bug checks in the future, and hopefully with a kernel-dump next time as well so I can go in-depth!

Thanks for reading!

9 comments:

  1. As usual a cheerful article to read. Thanks Patrick for these articles ^_^

    ReplyDelete
    Replies
    1. Sorry for the late reply my friend, I was not notified of such! My pleasure, glad you enjoyed. Thanks as always for reading! Hope you're well.

      Regards,

      Patrick

      Delete
  2. I am seeing a similar issue on my own surface pro 3, but do not have malwarebytes installed. When I open the generated memory.dmp file I see

    1: kd> k
    Child-SP RetAddr Call Site
    ffffd000`81b8baf8 fffff800`5e653d72 nt!KeBugCheckEx
    ffffd000`81b8bb00 fffff803`9a05eadb pdc!PdcpResiliencyWatchdog+0xa6
    ffffd000`81b8bb50 fffff803`9a0da794 nt!ExpWorkerThread+0x293
    ffffd000`81b8bc00 fffff803`9a1655c6 nt!PspSystemThreadStartup+0x58
    ffffd000`81b8bc60 00000000`00000000 nt!KiStartSystemThread+0x16

    I am new to this sort of thing. What would you suggest to debug further? Thanks

    ReplyDelete
    Replies
    1. Hi Ruben Perez ^_^,

      Like Patrick said, in this case it was the MalwareBytes which was causing the problem. Do you have any Antivirus/ 3rd party Firewall installed on your Surface Pro 3? If yes, then please see this page :- http://kb.eset.com/esetkb/index?page=content&id=SOLN146

      and download the cleaning tool for your Antivirus. That should solve your problem. In case the problem still not solved then please upload the dump files using any file hosting from "C:\Windows\Minidumps" so that those could be analyzed.

      Delete
    2. Hi Ruben,

      If you email me the kernel-dump(s) (C:\Windows and it'll be named MEMORY.DMP), I can take a look for you.

      Regards,

      Patrick

      Delete
  3. It should have only windows virus and firewall at this point, it's a new system.

    Thanks for agreeing to take a look at this Patrick. I've shared the file with you via google drive. Good luck, and let me know what you find.

    ReplyDelete
    Replies
    1. Apologies for the delayed reply, I just had a chance to look.

      Yours is pretty much identical to the blog post here, in addition to being a Surface
      Pro 3. This bug check is pretty much next to impossible to get reliable information with,
      so it's mostly educated guesswork.

      If we check the stack (as you pasted above prior):

      1: kd> k
      Child-SP RetAddr Call Site
      ffffd000`81b8baf8 fffff800`5e653d72 nt!KeBugCheckEx
      ffffd000`81b8bb00 fffff803`9a05eadb pdc!PdcpResiliencyWatchdog+0xa6
      ffffd000`81b8bb50 fffff803`9a0da794 nt!ExpWorkerThread+0x293
      ffffd000`81b8bc00 fffff803`9a1655c6 nt!PspSystemThreadStartup+0x58
      ffffd000`81b8bc60 00000000`00000000 nt!KiStartSystemThread+0x16

      All we know is that a system thread started up that happened to be a worker thread, and
      then we bug checked due to a failure to come out of standby when it should have
      responded and woke.

      If I had to take an educated guess as I said above as to what's the likely cause, it's
      DisplayLink. I've seen tons of issues surrounding DisplayLink products lately.

      Regards,

      Patrick

      Delete
    2. Thanks for taking a look. Fortunately(or unfortunately depending on the mindset) it hasn't reoccurred in the last week despite using the surface in docking station with display link adapter for a second monitor.

      If it reoccurs I'll uninstall display link and use only one monitor to see if the crash and displaylink are correlated. Thanks again for your help.

      Delete
  4. I have a Surface Pro 3 with this problem (actually it is the 2nd one as the first was replaced under RMA with no analysis of the logs done.) If you would still be interested I can send the logs. There seem to be many others with this same problem, but no definitive answers from anyone, nor any help from MS (apart from device replacements). I'd like to see if a better answer might be found in my logs.

    ReplyDelete