Hello fellow troubleshooters who may be reading this, OR SSD owners. To my fellow troubleshooters, if you're dealing with a case in which a user you're helping has an SSD, it's always very important to ask if they have updated to the latest firmware, and if they have not, instruct them how to, etc. SSD owners, this advice crucial to the functionality of your SSD, and the overall stability of your system. An SSD firmware update for some manufacturers can be something as simple as a minor bugfix, or it could be a mandatory update that stops fatal crashing, etc.
For example, here's a recent case I dealt with on Tech Support Forum. Quick fix and to the point.
On this blog, you''ll find postmortem/live bug check (BSOD) debugging, malware analysis, and reverse engineering.
Friday, August 3, 2012
Thursday, July 26, 2012
So many athrx.sys BSOD's!
Whew, can't catch a break with these lately! I've solved a TON of athrx.sys (Atheros network adapter driver) related BSOD's lately. Most if not all of them are entirely straight forward and all that is required is an update of the cable AND wireless drivers which can be downloaded from here in case any readers are wondering:
http://www.atheros.cz/
What's funny is, a real life friend actually messaged me on Skype and told me he was having BSOD's, so I had him send the dumps over. Sure enough, every dump is a D1 pointing right to athrx.sys, I couldn't believe it! Sure enough again, a simple update of the drivers solved everything.
Not sure what's up with these network drivers recently, Atheros specific of course, but remember to keep them updated!
Here's an Atheros thread for example that I am more than sure is solved as the user reported verifier is no longer flagging athrx.sys and the system is working fine.
http://www.atheros.cz/
What's funny is, a real life friend actually messaged me on Skype and told me he was having BSOD's, so I had him send the dumps over. Sure enough, every dump is a D1 pointing right to athrx.sys, I couldn't believe it! Sure enough again, a simple update of the drivers solved everything.
Not sure what's up with these network drivers recently, Atheros specific of course, but remember to keep them updated!
Here's an Atheros thread for example that I am more than sure is solved as the user reported verifier is no longer flagging athrx.sys and the system is working fine.
Tuesday, July 17, 2012
Recent interesting solve - F4: CRITICAL_OBJECT_TERMINATION
I recently solved (well, the user recently solved!) a relatively easy F4 bugcheck. I always love when they're simple and only require an exchange of a few words! :')
Anyways, the user was having BSOD's of course. After taking a look at the dump attached to the post, it was an F4: CRITICAL_OBJECT_TERMINATION bugcheck. I informed the user that F4's essentially mean that there's a hardware problem with the boot drive, or a device driver has a bug or a critical service was stopped.
The user mentioned that they cannot boot into safe mode, so I figured one of two things given the info provided and the bugcheck:
1.Dying drive.
2. Software issue preventing safe mode from booting properly.
I recommended the user to try and run a repair if the OS disc was available at hand. Thankfully, the user was able to perform a restoration to the same day the BSOD happened. As it turns out, the issue ended up being a Microsoft Malicious Software patch. The user mentioned the system had Norton antivirus, so Norton may have blocked that patch somehow and interfered which then caused a bit of trouble in Windows.
After the using system restored before the patch, the system worked fine.
Anyways, the user was having BSOD's of course. After taking a look at the dump attached to the post, it was an F4: CRITICAL_OBJECT_TERMINATION bugcheck. I informed the user that F4's essentially mean that there's a hardware problem with the boot drive, or a device driver has a bug or a critical service was stopped.
The user mentioned that they cannot boot into safe mode, so I figured one of two things given the info provided and the bugcheck:
1.Dying drive.
2. Software issue preventing safe mode from booting properly.
I recommended the user to try and run a repair if the OS disc was available at hand. Thankfully, the user was able to perform a restoration to the same day the BSOD happened. As it turns out, the issue ended up being a Microsoft Malicious Software patch. The user mentioned the system had Norton antivirus, so Norton may have blocked that patch somehow and interfered which then caused a bit of trouble in Windows.
After the using system restored before the patch, the system worked fine.
Kaspersky: Guilty of causing BSOD's
I have been in the BSOD analysis community for a few months now. I have never personally seen a situation in which Kaspersky was an issue, but I finally got my chance.
A user was having BSOD's when playing games, specifically Tera. I'm not sure as to whether or not the user was crashing in other instances, such as being idle and NOT gaming, but we'll leave it at that for now. I took a look and what I knew from the start was:
There were 5 dumps attached. 3 are pointing to dxgmms1.sys (DirectX), the other is a core MS file, and the final and most early is pointing to the culprit RzSynapse.sys (Razer Synapse Engine/Razer Naga). After seeing that, I figured it might have been a simple video card driver issue as all of the DirectX culprits happened whilst the process TERA.exe was running (video game), or the infamous Razer drivers causing BSOD's again which I hadn't seen in awhile.
I prompted the user to ensure his/her DirectX was fully functional and up to date, which it was. I next recommended ensuring video card drivers were up to date as well. If the user recently updated the video card drivers and the issues started appearing then, I then recommended to reinstall a previous older driver version to be sure the newer drivers weren't an issue.
Last step I prompted was an update of the Razer drivers. If the drivers were already at the latest version, and the user was still BSOD'ing after updating DX and drivers (or rolling back gpu drivers), I recommended uninstalling the Razer naga drivers and letting Windows install the generic mouse drivers to cross out a possible Razer driver issue.
After doing all of the updates and such, the user reported the system felt much more responsive and stable. Whether or not this was placebo, that sounded good, and the updated drivers may have offered some nice performance fixes that may have been the issue. However, after 30 minutes of gameplay, the user reported a BSOD, but a different one than the usual. I was hopeful!
The newest attached dump from the user pointed to the culprit kl1.sys which is the Kaspersky driver. I recommended the user temporarily remove Kaspersky using the remove tool provided by Kaspersky to ensure Kaspersky isn't the actual issue. Sure enough, after the user removed Kaspersky, the system went 3 full days without a single hiccup (and continued to do so).
With this being said, I'm now stuck trying to figure out why Kaspersky was causing those BSOD's, and why it's recommended to be removed all the time in the BSOD analysis community. It may just be because we want to ensure as usual whatever AV the user has isn't interfering with anything which may be causing the BSOD's, etc. But I am not so sure. I have asked some experts why Kaspersky is such an issue, and I will update this blog post when I have an answer!
A user was having BSOD's when playing games, specifically Tera. I'm not sure as to whether or not the user was crashing in other instances, such as being idle and NOT gaming, but we'll leave it at that for now. I took a look and what I knew from the start was:
There were 5 dumps attached. 3 are pointing to dxgmms1.sys (DirectX), the other is a core MS file, and the final and most early is pointing to the culprit RzSynapse.sys (Razer Synapse Engine/Razer Naga). After seeing that, I figured it might have been a simple video card driver issue as all of the DirectX culprits happened whilst the process TERA.exe was running (video game), or the infamous Razer drivers causing BSOD's again which I hadn't seen in awhile.
I prompted the user to ensure his/her DirectX was fully functional and up to date, which it was. I next recommended ensuring video card drivers were up to date as well. If the user recently updated the video card drivers and the issues started appearing then, I then recommended to reinstall a previous older driver version to be sure the newer drivers weren't an issue.
Last step I prompted was an update of the Razer drivers. If the drivers were already at the latest version, and the user was still BSOD'ing after updating DX and drivers (or rolling back gpu drivers), I recommended uninstalling the Razer naga drivers and letting Windows install the generic mouse drivers to cross out a possible Razer driver issue.
After doing all of the updates and such, the user reported the system felt much more responsive and stable. Whether or not this was placebo, that sounded good, and the updated drivers may have offered some nice performance fixes that may have been the issue. However, after 30 minutes of gameplay, the user reported a BSOD, but a different one than the usual. I was hopeful!
The newest attached dump from the user pointed to the culprit kl1.sys which is the Kaspersky driver. I recommended the user temporarily remove Kaspersky using the remove tool provided by Kaspersky to ensure Kaspersky isn't the actual issue. Sure enough, after the user removed Kaspersky, the system went 3 full days without a single hiccup (and continued to do so).
With this being said, I'm now stuck trying to figure out why Kaspersky was causing those BSOD's, and why it's recommended to be removed all the time in the BSOD analysis community. It may just be because we want to ensure as usual whatever AV the user has isn't interfering with anything which may be causing the BSOD's, etc. But I am not so sure. I have asked some experts why Kaspersky is such an issue, and I will update this blog post when I have an answer!
Thursday, July 12, 2012
0x116: VIDEO_TDR_ERROR
* Updated 2/25/2014
**This is actually a canned reply, even though it's not under my canned replies post. Even though it's a canned reply, you can still use it for troubleshooting. It appears based off of the page views this post gets, most do that anyway.
----------------
The basic definition of a 0x116 bugcheck is:
With this being said, if Timeout Detection and Recovery fails to recover the display driver, it will then shoot the 0x116 bugcheck. There are many different things that can cause a 0x116, which I will explain below:
(Ensure you have the latest video card drivers. If you are already on the latest video card drivers, uninstall and install a version or a few versions behind the latest to ensure it's not a latest driver only issue. If you have already experimented with the latest video card driver and many previous versions, please give the beta driver for your card a try.)
The following hardware issues can cause a TDR event:
1. Unstable overclock (CPU, GPU, etc). Revert all and any overclocks to stock settings.
2. Bad sector in memory resulting in corrupt data being communicated between the GPU and the system (video memory otherwise known as vRAM or physical memory otherwise known as RAM).
GPU testing: Furmark, run for ~15 minutes and watch temperatures to ensure there's no overheating and watch for artifacts.
RAM testing: Memtest (RUN FOR NO LESS THAN ~8 PASSES) - Refer to the below:
Memtest:
Memtest86+:
Download Memtest86+ here:
http://www.memtest.org/
Which should I download?
You can either download the pre-compiled ISO that you would burn to a CD and then boot from the CD, or you can download the auto-installer for the USB key. What this will do is format your USB drive, make it a bootable device, and then install the necessary files. Both do the same job, it's just up to you which you choose, or which you have available (whether it's CD or USB).
Do note that some older generation motherboards do not support USB-based booting, therefore your only option is CD (or Floppy if you really wanted to).
How Memtest works:
Memtest86 writes a series of test patterns to most memory addresses, reads back the data written, and compares it for errors.
The default pass does 9 different tests, varying in access patterns and test data. A tenth test, bit fade, is selectable from the menu. It writes all memory with zeroes, then sleeps for 90 minutes before checking to see if bits have changed (perhaps because of refresh problems). This is repeated with all ones for a total time of 3 hours per pass.
Many chipsets can report RAM speeds and timings via SPD (Serial Presence Detect) or EPP (Enhanced Performance Profiles), and some even support changing the expected memory speed. If the expected memory speed is overclocked, Memtest86 can test that memory performance is error-free with these faster settings.
Some hardware is able to report the "PAT status" (PAT: enabled or PAT: disabled). This is a reference to Intel Performance acceleration technology; there may be BIOS settings which affect this aspect of memory timing.
This information, if available to the program, can be displayed via a menu option.
Any other questions, they can most likely be answered by reading this great guide here:
http://forum.canardpc.com/threads/28864-FAQ-please-read-before-posting
3. Corrupt hard drive or Windows install / OS install resulting in corruption to the registry or page file.
HDD diagnostics: Seatools - Refer to the below:
http://www.seagate.com/support/downloads/seatools/
You can run it via Windows or DOS. Do note that the only difference is simply the environment you're running it in. In Windows, if you are having what you believe to be device driver related issues that may cause conflicts or false positive, it may be a wise decision to choose the most minimal testing environment (DOS).
Run all tests EXCEPT: Fix All, Long Generic, and anything Advanced.
To reset your page file, follow the instructions below:
a ) Go to Start...Run...and type in "sysdm.cpl" (without the quotes) and press Enter.
- Then click on the Advanced tab,
- Then on the Performance Settings Button,
- Then on the next Advanced tab,
- Then on the Virtual Memory Change button.
b ) In this window, note down the current settings for your pagefile (so you can restore them later on).
-Then click on the "No paging file" radio button, and
- then on the "Set" button. Be sure, if you have multiple hard drives, that you ensure that the paging file is set to 0 on all of them.
-Click OK to exit the dialogs.
c ) Reboot (this will remove the pagefile from your system)
d ) Then go back in following the directions in step a ) and re-enter the settings that you wrote down in step
b ). Follow the steps all the way through (and including) the reboot.
e ) Once you've rebooted this second time, go back in and check to make sure that the settings are as they're supposed to be.
Run System File Checker:
SFC.EXE /SCANNOW
Go to Start and type in "cmd.exe" (without the quotes)
At the top of the search box, right click on the cmd.exe and select "Run as adminstrator"
In the black window that opens, type "SFC.EXE /SCANNOW" (without the quotes) and press Enter.
Let the program run and post back what it says when it's done.
- Overheating of the CPU or GPU and or other components can cause 0x116 bugchecks. Monitor your temperatures and ensure the system is cooled adequately.
- GPU failure- Heat, power issue (PSU issue), faulty vRAM, etc.
The following software issues can cause a TDR event:
- Incompatible drivers of any sort
- Messy / corrupt registry
- Corrupt Direct X - http://support.microsoft.com/kb/179113
- Corrupt system files (run System File Checker as advised above)
- Buggy and or corrupt 3rd party drivers. If you suspect a 3rd party driver being the issue, enable Driver Verifier:
Driver Verifier:
What is Driver Verifier?
Driver Verifier is included in Windows 8/8.1, 7, Windows Server 2008 R2, Windows Vista, Windows Server 2008, Windows 2000, Windows XP, and Windows Server 2003 to promote stability and reliability; you can use this tool to troubleshoot driver issues. Windows kernel-mode components can cause system corruption or system failures as a result of an improperly written driver, such as an earlier version of a Windows Driver Model (WDM) driver.
Essentially, if there's a 3rd party driver believed to be at issue, enabling Driver Verifier will help flush out the rogue driver if it detects a violation.
Before enabling Driver Verifier, it is recommended to create a System Restore Point:
Vista - START | type rstrui - create a restore point
Windows 7 - START | type create | select "Create a Restore Point"
Windows 8 - http://www.eightforums.com/tutorials/4690-restore-point-create-windows-8-a.html
How to enable Driver Verifier:
Start > type "verifier" without the quotes > Select the following options -
1. Select - "Create custom settings (for code developers)"
2. Select - "Select individual settings from a full list"
3. Check the following boxes -
- Special Pool
- Pool Tracking
- Force IRQL Checking
- Deadlock Detection
- Security Checks (Windows 7 & 8)
- DDI compliance checking (Windows 8)
- Miscellaneous Checks
4. Select - "Select driver names from a list"
5. Click on the "Provider" tab. This will sort all of the drivers by the provider.
6. Check EVERY box that is NOT provided by Microsoft / Microsoft Corporation.
7. Click on Finish.
8. Restart.
Important information regarding Driver Verifier:
- If Driver Verifier finds a violation, the system will BSOD. To expand on this a bit more for the interested, specifically what Driver Verifier actually does is it looks for any driver making illegal function calls. When and/if this happens, system corruption occurs if allowed to continue. When Driver Verifier is enabled, it is monitoring all 3rd party drivers (as we have it set that way) and when it catches a driver attempting to do this, it will quickly flag that driver as being a troublemaker, and bring down the system safely before any corruption can occur.
- After enabling Driver Verifier and restarting the system, depending on the culprit, if for example the driver is on start-up, you may not be able to get back into normal Windows because Driver Verifier will detect it in violation almost straight away, and as stated above, that will cause / force a BSOD.
If this happens, do not panic, do the following:
- Boot into Safe Mode by repeatedly tapping the F8 key during boot-up.
- Once in Safe Mode - Start > Search > type "cmd" without the quotes.
- To turn off Driver Verifier, type in cmd "verifier /reset" without the quotes.
・ Restart and boot into normal Windows.
If your OS became corrupt or you cannot boot into Windows after disabling verifier via Safe Mode:
- Boot into Safe Mode by repeatedly tapping the F8 key during boot-up.
- Once in Safe Mode - Start > type "system restore" without the quotes.
- Choose the restore point you created earlier.
-- Note that Safe Mode for Windows 8 is a bit different, and you may need to try different methods: 5 Ways to Boot into Safe Mode in Windows 8 & Windows 8.1
How long should I keep Driver Verifier enabled for?
I recommend keeping it enabled for at least 24 hours. If you don't BSOD by then, disable Driver Verifier. I will usually say whether or not I'd like for you to keep it enabled any longer.
My system BSOD'd with Driver Verifier enabled, where can I find the crash dumps?
They will be located in %systemroot%\Minidump
Any other questions can most likely be answered by this article:
http://support.microsoft.com/kb/244617
-------- --------------------------------------------------------------------------------------------
Now that we've gone over what can cause a 0x116 bugcheck, let's go over a very simple case I solved the other day!
The user was complaining of crashing during gameplay, specifically Battlefield 3. After taking a look at the dumps, of course there were 0x116 VIDEO_TDR_ERROR bugchecks. Here's a dump file excerpt:
In most cases of 0x116 BSOD's, the first thing I always recommend and as you can see above, is the uninstall and reinstall of the video card drivers. If the user is at the latest, rollback a version or two to see if the issue disappears. If the user is not at the latest, then update to the latest OR a beta if available.
Well, as it turns out, this case was as simple as an uninstall and reinstall of the latest available video card drivers for the user's specific GPU. Sometimes though, and unfortunately in most cases, it's not that easy and the issue is usually hardware related which takes some patience along with trial and error.
**This is actually a canned reply, even though it's not under my canned replies post. Even though it's a canned reply, you can still use it for troubleshooting. It appears based off of the page views this post gets, most do that anyway.
----------------
The basic definition of a 0x116 bugcheck is:
This indicates that an attempt to reset the display driver and recover from a timeout failed.So, let me now explain what VIDEO_TDR_ERROR means. First off, TDR is an acronym for 'Timeout Detection and Recovery'. Timeout Detection and Recovery was introduced in Vista and carried over to Windows 7. Rather than putting exactly what Timeout Detection and Recovery does exactly, I'll just directly quote the MSDN article!
Timeout detection:
The GPU scheduler, which is part of the DirectX graphics kernel subsystem (Dxgkrnl.sys), detects that the GPU is taking more than the permitted amount of time to execute a particular task. The GPU scheduler then tries to preempt this particular task. The preempt operation has a "wait" timeout, which is the actual TDR timeout. This step is thus the timeout detection phase of the process. The default timeout period in Windows Vista and later operating systems is 2 seconds. If the GPU cannot complete or preempt the current task within the TDR timeout period, the operating system diagnoses that the GPU is frozen.
To prevent timeout detection from occurring, hardware vendors should ensure that graphics operations (that is, DMA buffer completion) take no more than 2 seconds in end-user scenarios such as productivity and game play.
Preparation for recovery:
The operating system's GPU scheduler calls the display miniport driver's DxgkDdiResetFromTimeout function to inform the driver that the operating system detected a timeout. The driver must then reinitialize itself and reset the GPU. In addition, the driver must stop accessing memory and should not access hardware. The operating system and the driver collect hardware and other state information that could be useful for post-mortem diagnosis.
Desktop recovery:
The operating system resets the appropriate state of the graphics stack. The video memory manager, which is also part of Dxgkrnl.sys, purges all allocations from video memory. The display miniport driver resets the GPU hardware state. The graphics stack takes the final actions and restores the desktop to the responsive state. As previously mentioned, some legacy DirectX applications might render just black at the end of this recovery, which requires the end user to restart these applications. Well-written DirectX 9Ex and DirectX 10 and later applications that handle Device Remove technology continue to work correctly. An application must release and then recreate its Direct3D device and all of the device's objects. For more information about how DirectX applications recover, see the Windows SDK.
Article here.
With this being said, if Timeout Detection and Recovery fails to recover the display driver, it will then shoot the 0x116 bugcheck. There are many different things that can cause a 0x116, which I will explain below:
(Ensure you have the latest video card drivers. If you are already on the latest video card drivers, uninstall and install a version or a few versions behind the latest to ensure it's not a latest driver only issue. If you have already experimented with the latest video card driver and many previous versions, please give the beta driver for your card a try.)
The following hardware issues can cause a TDR event:
1. Unstable overclock (CPU, GPU, etc). Revert all and any overclocks to stock settings.
2. Bad sector in memory resulting in corrupt data being communicated between the GPU and the system (video memory otherwise known as vRAM or physical memory otherwise known as RAM).
GPU testing: Furmark, run for ~15 minutes and watch temperatures to ensure there's no overheating and watch for artifacts.
RAM testing: Memtest (RUN FOR NO LESS THAN ~8 PASSES) - Refer to the below:
Memtest:
Memtest86+:
Download Memtest86+ here:
http://www.memtest.org/
Which should I download?
You can either download the pre-compiled ISO that you would burn to a CD and then boot from the CD, or you can download the auto-installer for the USB key. What this will do is format your USB drive, make it a bootable device, and then install the necessary files. Both do the same job, it's just up to you which you choose, or which you have available (whether it's CD or USB).
Do note that some older generation motherboards do not support USB-based booting, therefore your only option is CD (or Floppy if you really wanted to).
How Memtest works:
Memtest86 writes a series of test patterns to most memory addresses, reads back the data written, and compares it for errors.
The default pass does 9 different tests, varying in access patterns and test data. A tenth test, bit fade, is selectable from the menu. It writes all memory with zeroes, then sleeps for 90 minutes before checking to see if bits have changed (perhaps because of refresh problems). This is repeated with all ones for a total time of 3 hours per pass.
Many chipsets can report RAM speeds and timings via SPD (Serial Presence Detect) or EPP (Enhanced Performance Profiles), and some even support changing the expected memory speed. If the expected memory speed is overclocked, Memtest86 can test that memory performance is error-free with these faster settings.
Some hardware is able to report the "PAT status" (PAT: enabled or PAT: disabled). This is a reference to Intel Performance acceleration technology; there may be BIOS settings which affect this aspect of memory timing.
This information, if available to the program, can be displayed via a menu option.
Any other questions, they can most likely be answered by reading this great guide here:
http://forum.canardpc.com/threads/28864-FAQ-please-read-before-posting
3. Corrupt hard drive or Windows install / OS install resulting in corruption to the registry or page file.
HDD diagnostics: Seatools - Refer to the below:
http://www.seagate.com/support/downloads/seatools/
You can run it via Windows or DOS. Do note that the only difference is simply the environment you're running it in. In Windows, if you are having what you believe to be device driver related issues that may cause conflicts or false positive, it may be a wise decision to choose the most minimal testing environment (DOS).
Run all tests EXCEPT: Fix All, Long Generic, and anything Advanced.
To reset your page file, follow the instructions below:
a ) Go to Start...Run...and type in "sysdm.cpl" (without the quotes) and press Enter.
- Then click on the Advanced tab,
- Then on the Performance Settings Button,
- Then on the next Advanced tab,
- Then on the Virtual Memory Change button.
b ) In this window, note down the current settings for your pagefile (so you can restore them later on).
-Then click on the "No paging file" radio button, and
- then on the "Set" button. Be sure, if you have multiple hard drives, that you ensure that the paging file is set to 0 on all of them.
-Click OK to exit the dialogs.
c ) Reboot (this will remove the pagefile from your system)
d ) Then go back in following the directions in step a ) and re-enter the settings that you wrote down in step
b ). Follow the steps all the way through (and including) the reboot.
e ) Once you've rebooted this second time, go back in and check to make sure that the settings are as they're supposed to be.
Run System File Checker:
SFC.EXE /SCANNOW
Go to Start and type in "cmd.exe" (without the quotes)
At the top of the search box, right click on the cmd.exe and select "Run as adminstrator"
In the black window that opens, type "SFC.EXE /SCANNOW" (without the quotes) and press Enter.
Let the program run and post back what it says when it's done.
- Overheating of the CPU or GPU and or other components can cause 0x116 bugchecks. Monitor your temperatures and ensure the system is cooled adequately.
- GPU failure- Heat, power issue (PSU issue), faulty vRAM, etc.
The following software issues can cause a TDR event:
- Incompatible drivers of any sort
- Messy / corrupt registry
- Corrupt Direct X - http://support.microsoft.com/kb/179113
- Corrupt system files (run System File Checker as advised above)
- Buggy and or corrupt 3rd party drivers. If you suspect a 3rd party driver being the issue, enable Driver Verifier:
Driver Verifier:
What is Driver Verifier?
Driver Verifier is included in Windows 8/8.1, 7, Windows Server 2008 R2, Windows Vista, Windows Server 2008, Windows 2000, Windows XP, and Windows Server 2003 to promote stability and reliability; you can use this tool to troubleshoot driver issues. Windows kernel-mode components can cause system corruption or system failures as a result of an improperly written driver, such as an earlier version of a Windows Driver Model (WDM) driver.
Essentially, if there's a 3rd party driver believed to be at issue, enabling Driver Verifier will help flush out the rogue driver if it detects a violation.
Before enabling Driver Verifier, it is recommended to create a System Restore Point:
Vista - START | type rstrui - create a restore point
Windows 7 - START | type create | select "Create a Restore Point"
Windows 8 - http://www.eightforums.com/tutorials/4690-restore-point-create-windows-8-a.html
How to enable Driver Verifier:
Start > type "verifier" without the quotes > Select the following options -
1. Select - "Create custom settings (for code developers)"
2. Select - "Select individual settings from a full list"
3. Check the following boxes -
- Special Pool
- Pool Tracking
- Force IRQL Checking
- Deadlock Detection
- Security Checks (Windows 7 & 8)
- DDI compliance checking (Windows 8)
- Miscellaneous Checks
4. Select - "Select driver names from a list"
5. Click on the "Provider" tab. This will sort all of the drivers by the provider.
6. Check EVERY box that is NOT provided by Microsoft / Microsoft Corporation.
7. Click on Finish.
8. Restart.
Important information regarding Driver Verifier:
- If Driver Verifier finds a violation, the system will BSOD. To expand on this a bit more for the interested, specifically what Driver Verifier actually does is it looks for any driver making illegal function calls. When and/if this happens, system corruption occurs if allowed to continue. When Driver Verifier is enabled, it is monitoring all 3rd party drivers (as we have it set that way) and when it catches a driver attempting to do this, it will quickly flag that driver as being a troublemaker, and bring down the system safely before any corruption can occur.
- After enabling Driver Verifier and restarting the system, depending on the culprit, if for example the driver is on start-up, you may not be able to get back into normal Windows because Driver Verifier will detect it in violation almost straight away, and as stated above, that will cause / force a BSOD.
If this happens, do not panic, do the following:
- Boot into Safe Mode by repeatedly tapping the F8 key during boot-up.
- Once in Safe Mode - Start > Search > type "cmd" without the quotes.
- To turn off Driver Verifier, type in cmd "verifier /reset" without the quotes.
・ Restart and boot into normal Windows.
If your OS became corrupt or you cannot boot into Windows after disabling verifier via Safe Mode:
- Boot into Safe Mode by repeatedly tapping the F8 key during boot-up.
- Once in Safe Mode - Start > type "system restore" without the quotes.
- Choose the restore point you created earlier.
-- Note that Safe Mode for Windows 8 is a bit different, and you may need to try different methods: 5 Ways to Boot into Safe Mode in Windows 8 & Windows 8.1
How long should I keep Driver Verifier enabled for?
I recommend keeping it enabled for at least 24 hours. If you don't BSOD by then, disable Driver Verifier. I will usually say whether or not I'd like for you to keep it enabled any longer.
My system BSOD'd with Driver Verifier enabled, where can I find the crash dumps?
They will be located in %systemroot%\Minidump
Any other questions can most likely be answered by this article:
http://support.microsoft.com/kb/244617
-------- --------------------------------------------------------------------------------------------
Now that we've gone over what can cause a 0x116 bugcheck, let's go over a very simple case I solved the other day!
The user was complaining of crashing during gameplay, specifically Battlefield 3. After taking a look at the dumps, of course there were 0x116 VIDEO_TDR_ERROR bugchecks. Here's a dump file excerpt:
As you can see there, the probably caused by is actually pointing to atikmpag.sys (ATI/AMD video card drivers). In most cases, this means nothing obviously as a 0x116 is the display driver failing to recover, so Windows obviously says "Well, here's what caused the crash", so of course in most cases it's going to be the video / display driver. In *116 crashes you will at times also see Direct X be the fault (either dxgkrnl - DirectX Kernel OR dxgmms1 - DirectX MMS).Built by: 7601.17835.amd64fre.win7sp1_gdr.120503-2030 Debug session time: Fri Jun 22 04:12:04.033 2012 (UTC - 4:00) System Uptime: 0 days 1:26:38.899 BugCheck 116, {fffffa800cdfe4e0, fffff880042078b8, 0, c} *** WARNING: Unable to verify timestamp for atikmpag.sys *** ERROR: Module load completed but symbols could not be loaded for atikmpag.sys Probably caused by : atikmpag.sys ( atikmpag+78b8 ) BUGCHECK_STR: 0x116 PROCESS_NAME: bf3.exe
In most cases of 0x116 BSOD's, the first thing I always recommend and as you can see above, is the uninstall and reinstall of the video card drivers. If the user is at the latest, rollback a version or two to see if the issue disappears. If the user is not at the latest, then update to the latest OR a beta if available.
Well, as it turns out, this case was as simple as an uninstall and reinstall of the latest available video card drivers for the user's specific GPU. Sometimes though, and unfortunately in most cases, it's not that easy and the issue is usually hardware related which takes some patience along with trial and error.
0x9F: DRIVER_POWER_STATE_FAILURE
Alright, let's get down to some business. As I said, I am now going to start going through successful analysis posts of mine and posting them here for others to learn from, read up on, etc. Also just for personal reference as well! Some will be very easy and I will just explain what I did, and some will be difficult, etc.
Now, this post is going to be about bugcheck 0x9F: DRIVER_POWER_STATE_FAILURE. 9F bugchecks are personally my favorite as in most cases they're very easy to solve, and I will explain why. In most cases, a 9F will tell you the driver culprit right in the "probably caused by", however in some cases it will shoot an incorrect fault, or it will relay what the bucket ID is in that specific dump.
Before we get into all of that though, here's that basic definition of a 0x9F:
Now, let's assume that the following dump file was not so straightforward. Let's pretend that when we opened it up, rather than the probably caused by faulty displaying the guilty driver, it said for example "usbhub.sys".
What we would do then is we would take a look at the 4th argument is there was a blocked IRP, and then run an !irp on the address of the 4th argument.
So, for example, in the following dump: !irp fffffa8006a62010
After running that, we then get the following:
Generally, 0x9F's are very easy. I learned how to solve most 0x9F's by reading VirGnarus' posts across various BSOD communities.
Now, this post is going to be about bugcheck 0x9F: DRIVER_POWER_STATE_FAILURE. 9F bugchecks are personally my favorite as in most cases they're very easy to solve, and I will explain why. In most cases, a 9F will tell you the driver culprit right in the "probably caused by", however in some cases it will shoot an incorrect fault, or it will relay what the bucket ID is in that specific dump.
Before we get into all of that though, here's that basic definition of a 0x9F:
A device driver is in an invalid or inconsistent power state from either shutdown or going into or returning from hibernate or standby modes.I recently dealt with a case in which the user was reporting he/she was receiving 0x9F's BSOD(s). I opened the dump in WinDbg:
*******************************************************************************As you can see, this is a fairly straightforward 0x9F. It says the culprit right there in the probably caused, which is asmthub3.sys (ASMedia USB 3.0 Hub driver). All that needed to be done was link the user to the latest chipset / utility drivers on the motherboard page and the issue was solved after updating the driver.
* *
* Bugcheck Analysis *
* *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck 9F, {3, fffffa800a727060, fffff80004ba93d8, fffffa8006a62010}
*** WARNING: Unable to verify timestamp for asmthub3.sys
*** ERROR: Module load completed but symbols could not be loaded for asmthub3.sys
*** WARNING: Unable to verify timestamp for win32k.sys
*** ERROR: Module load completed but symbols could not be loaded for win32k.sys
Probably caused by : asmthub3.sys
Followup: MachineOwner
---------
0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
DRIVER_POWER_STATE_FAILURE (9f)
A driver is causing an inconsistent power state.
Arguments:
Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time
Arg2: fffffa800a727060, Physical Device Object of the stack
Arg3: fffff80004ba93d8, Functional Device Object of the stack
Arg4: fffffa8006a62010, The blocked IRP
Debugging Details:
------------------
DRVPOWERSTATE_SUBCODE: 3
DRIVER_OBJECT: fffffa800670ac10
IMAGE_NAME: asmthub3.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 4eb203d0
MODULE_NAME: asmthub3
FAULTING_MODULE: fffff88005f25000 asmthub3
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
BUGCHECK_STR: 0x9F
PROCESS_NAME: System
CURRENT_IRQL: 2
STACK_TEXT:
fffff800`04ba9388 fffff800`033046c2 : 00000000`0000009f 00000000`00000003 fffffa80`0a727060 fffff800`04ba93d8 : nt!KeBugCheckEx
fffff800`04ba9390 fffff800`032a4e3c : fffff800`04ba94c0 fffff800`04ba94c0 00000000`00000000 00000000`00000002 : nt! ?? ::FNODOBFM::`string'+0x34050
fffff800`04ba9430 fffff800`032a4cd6 : fffff800`03433f70 00000000`0000ea7d 00000000`00000000 00000000`00000000 : nt!KiProcessTimerDpcTable+0x6c
fffff800`04ba94a0 fffff800`032a4bbe : 00000002`2e2dc26d fffff800`04ba9b18 00000000`0000ea7d fffff800`03410228 : nt!KiProcessExpiredTimerList+0xc6
fffff800`04ba9af0 fffff800`032a49a7 : 00000000`bedfc7c2 00000000`0000ea7d 00000000`bedfc78b 00000000`0000007d : nt!KiTimerExpiration+0x1be
fffff800`04ba9b90 fffff800`03291eca : fffff800`0340ce80 fffff800`0341acc0 00000000`00000000 fffff880`00000000 : nt!KiRetireDpcList+0x277
fffff800`04ba9c40 00000000`00000000 : fffff800`04baa000 fffff800`04ba4000 fffff800`04ba9c00 00000000`00000000 : nt!KiIdleLoop+0x5a
STACK_COMMAND: kb
FOLLOWUP_NAME: MachineOwner
FAILURE_BUCKET_ID: X64_0x9F_3_IMAGE_asmthub3.sys
BUCKET_ID: X64_0x9F_3_IMAGE_asmthub3.sys
Followup: MachineOwner
---------
Now, let's assume that the following dump file was not so straightforward. Let's pretend that when we opened it up, rather than the probably caused by faulty displaying the guilty driver, it said for example "usbhub.sys".
What we would do then is we would take a look at the 4th argument is there was a blocked IRP, and then run an !irp on the address of the 4th argument.
So, for example, in the following dump: !irp fffffa8006a62010
After running that, we then get the following:
0: kd> !irp fffffa8006a62010As you can see, there's a (>). This symbol indicates what driver was active at the time of the crash, and the asmthub3 driver is listed there. That's what you'd do if there was a false probably caused by fault. However, sometimes you get 0x9F's that don't have a blocked up IRP AND an incorrect / false fault. Well, you'd then have to obviously go through some other dumps or take a look at the loaded modules list and see if there are any obvious troublesome 3rd party drivers that may have caused it, etc. You can also get 9F bugchecks that deal with locks and such, but we'll get into that at another time.
Irp is active with 8 stacks 7 is current (= 0xfffffa8006a62290)
No Mdl: No System Buffer: Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
>[ 16, 2] 0 e1 fffffa800a740790 00000000 fffff80003285ce0-fffffa800b934340 Success Error Cancel pending
\Driver\asmthub3 nt!IopUnloadSafeCompletion
Args: 00016600 00000001 00000004 00000005
[ 0, 0] 0 0 00000000 00000000 00000000-fffffa8006958dc0
Args: 00000000 00000000 00000000 00000000
Generally, 0x9F's are very easy. I learned how to solve most 0x9F's by reading VirGnarus' posts across various BSOD communities.
Monday, July 9, 2012
Wow, time flies! I wrote my first blog post almost two weeks ago and it doesn't even feel like it has been that long.
Anyways, this post will finally be about some BSOD discussion, but not totally in to any cases just yet. This post will be more towards how I got myself got into BSOD Kernel Dump analysis, etc. After I write this post, I'll go have a look around and compile some BSOD analysis I have done that was successful to post here, and go into in-depth explanation. So, without further ado, let's begin.
I built my current rig I have now about almost I'd say... almost two years ago? It worked well for a long time, absolutely loved it. Well, one day, I shut it down real quick to see if an adapter I bought fit one of my GPUs, and when I went to power it back on, my Corsair 750w PSU shorted out and killed every single one of my components EXCEPT my two ATI 5850 video cards.
However, I did not know this until VERY much later, I figured my motherboard and HDD were the only ones lost in combat. So I powered my rig back on and it wouldn't POST. I knew right away that my SATA controller was probably dead as the HDD wasn't being detected in another system, or in mine. So I called up Asus and Samsung and get RMAs set up with both and had my system up and running again within almost two weeks.
Flash forward to two weeks when I get everything up and running, and it's actually all down hill from here. I would BSOD VERY randomly after a few days of system uptime, never really soon after a cold boot, it always took a few days.
So, I had always been somewhat knowledgeable in computers in various different ways, so I went ahead and took it upon myself to see if anybody else was having my issues, what they were doing for it, etc. Back then, at the time, I only knew that the crashes were being caused by what I saw as the driver culprit on that actual blue screen itself. The two most popular were dgxmms1.sys, and of course... atikmpag.sys. So, with those two drivers in mind, I set off on a journey to Google. I came across various forums: Seven Forums, Tech Support Forum, etc.. all discussing this. When I was reading up on this, I took it upon myself to go ahead and learn how to analyze dump files.
I got the WinDbg client set up, the symbols and the path set up, and then set off. At first, I didn't know what the hell I was looking at. It was like looking at another language, hieroglyphics would be the closest thing I can think of. I just closed the client and gave up. I decided to RMA my RAM and my PSU to Corsair and both were replaced. After about 3 weeks, I got my system up and running again, and on a fresh Windows 7 install. After a few days... boom, BSOD.
I sat there and was almost to tears, I couldn't believe it. I couldn't believe I pretty much replaced my entire computer, and it's still crashing. I couldn't believe I spent over $2000 on a computer that is completely unstable, but a backup rig that I have in a plastic drawer with the mobo screwed to a piece of wood NEVER crashes. It just hurt... it really did, and it REALLY bothered me.
I said enough was enough and decided to learn how to debug and analyze dump files. So I once again set off on a journey, but I was for sure this time going to learn. I remember reading posts on various forums from jcgriff2, JMH3143, zigzag3143, satrow, writhziden, Vir Gnarus, etc. Without people like that, I wouldn't be where I am today with this hobby, and my knowledge would be absolute EONS behind.
After lots of reading, I had enough knowledge to understand that my issue was related to either DirectX... or my AMD drivers. I tackled DirectX first just in case, and it wasn't that... so I moved onto AMD drivers. Before uninstalling and reinstalling the drivers, I thought about when these BSODs started happening. I remembered that before I RMA'd all of my parts, my system worked completely fine. I thought to myself, unless I am incredibly unlucky and all of this brand new replacement hardware somehow became DoA, it should be fine and something ELSE is the culprit. I then at that moment remembered on my old Windows install prior to my PSU failing, I was on CCC 12.1 and not CCC 12.3 (which is what I was on after I got the replacement hardware).
To be sure everything went smooth, I went ahead and did a clean Windows 7 install, I might as well have anyways. I was BSOD'ing left and right on the new install, and had to make a lot of crappy adjustments and optimizations to be stable for more than 15 minutes... so my OS and registry in general was probably in shambles.
After I got a clean install of Windows 7 going, satrow helped me get everything up and running again by running me through a checklist and such. After reinstalling 12.1 rather than 12.3 CCC, 116 TDR nightmares... GONE!
At that moment, I took it upon myself to spend all of the free time I have helping others with their BSOD related issues. BSODs are not fun, let's get that out of the way. They cause real life stress, annoyance, waste time when work needs to be done, etc. They are a plague, and if you're having issues that are so ongoing like I did, and you don't know how to solve them, imagine that? I don't want people to have to go through what I did.
I made it my goal to make it so people when they get a BSOD, they don't have to go to the nearest Geek Squad and pay $100 to have them say "Oh, we need to reinstall Windows" when it's a simple driver culprit, etc. I want to help others with their issues, for free. There aren't many of us in the BSOD analysis community if you think about it. Compared to the amount of computers and users that use said computers in the world, we're almost non-existent.
So now, with my free time, which currently in my life is ALL of the time until I start IT school, I read and read about BSODs courtesy of our BSOD communities. I also spend tons of time on these various communities solving BSOD related issues. The best part is, I have been noticed for this, and awarded.
Ever since all of this, here I am almost a year later with this knowledge that increases every single day because of my peers, etc. I have made it my long term goal to one day achieve the Microsoft MVP award. I know for a fact that my hard work, determination, and personality will get me there. It's just a matter of time.
:)
Anyways, this post will finally be about some BSOD discussion, but not totally in to any cases just yet. This post will be more towards how I got myself got into BSOD Kernel Dump analysis, etc. After I write this post, I'll go have a look around and compile some BSOD analysis I have done that was successful to post here, and go into in-depth explanation. So, without further ado, let's begin.
I built my current rig I have now about almost I'd say... almost two years ago? It worked well for a long time, absolutely loved it. Well, one day, I shut it down real quick to see if an adapter I bought fit one of my GPUs, and when I went to power it back on, my Corsair 750w PSU shorted out and killed every single one of my components EXCEPT my two ATI 5850 video cards.
However, I did not know this until VERY much later, I figured my motherboard and HDD were the only ones lost in combat. So I powered my rig back on and it wouldn't POST. I knew right away that my SATA controller was probably dead as the HDD wasn't being detected in another system, or in mine. So I called up Asus and Samsung and get RMAs set up with both and had my system up and running again within almost two weeks.
Flash forward to two weeks when I get everything up and running, and it's actually all down hill from here. I would BSOD VERY randomly after a few days of system uptime, never really soon after a cold boot, it always took a few days.
So, I had always been somewhat knowledgeable in computers in various different ways, so I went ahead and took it upon myself to see if anybody else was having my issues, what they were doing for it, etc. Back then, at the time, I only knew that the crashes were being caused by what I saw as the driver culprit on that actual blue screen itself. The two most popular were dgxmms1.sys, and of course... atikmpag.sys. So, with those two drivers in mind, I set off on a journey to Google. I came across various forums: Seven Forums, Tech Support Forum, etc.. all discussing this. When I was reading up on this, I took it upon myself to go ahead and learn how to analyze dump files.
I got the WinDbg client set up, the symbols and the path set up, and then set off. At first, I didn't know what the hell I was looking at. It was like looking at another language, hieroglyphics would be the closest thing I can think of. I just closed the client and gave up. I decided to RMA my RAM and my PSU to Corsair and both were replaced. After about 3 weeks, I got my system up and running again, and on a fresh Windows 7 install. After a few days... boom, BSOD.
I sat there and was almost to tears, I couldn't believe it. I couldn't believe I pretty much replaced my entire computer, and it's still crashing. I couldn't believe I spent over $2000 on a computer that is completely unstable, but a backup rig that I have in a plastic drawer with the mobo screwed to a piece of wood NEVER crashes. It just hurt... it really did, and it REALLY bothered me.
I said enough was enough and decided to learn how to debug and analyze dump files. So I once again set off on a journey, but I was for sure this time going to learn. I remember reading posts on various forums from jcgriff2, JMH3143, zigzag3143, satrow, writhziden, Vir Gnarus, etc. Without people like that, I wouldn't be where I am today with this hobby, and my knowledge would be absolute EONS behind.
After lots of reading, I had enough knowledge to understand that my issue was related to either DirectX... or my AMD drivers. I tackled DirectX first just in case, and it wasn't that... so I moved onto AMD drivers. Before uninstalling and reinstalling the drivers, I thought about when these BSODs started happening. I remembered that before I RMA'd all of my parts, my system worked completely fine. I thought to myself, unless I am incredibly unlucky and all of this brand new replacement hardware somehow became DoA, it should be fine and something ELSE is the culprit. I then at that moment remembered on my old Windows install prior to my PSU failing, I was on CCC 12.1 and not CCC 12.3 (which is what I was on after I got the replacement hardware).
To be sure everything went smooth, I went ahead and did a clean Windows 7 install, I might as well have anyways. I was BSOD'ing left and right on the new install, and had to make a lot of crappy adjustments and optimizations to be stable for more than 15 minutes... so my OS and registry in general was probably in shambles.
After I got a clean install of Windows 7 going, satrow helped me get everything up and running again by running me through a checklist and such. After reinstalling 12.1 rather than 12.3 CCC, 116 TDR nightmares... GONE!
At that moment, I took it upon myself to spend all of the free time I have helping others with their BSOD related issues. BSODs are not fun, let's get that out of the way. They cause real life stress, annoyance, waste time when work needs to be done, etc. They are a plague, and if you're having issues that are so ongoing like I did, and you don't know how to solve them, imagine that? I don't want people to have to go through what I did.
I made it my goal to make it so people when they get a BSOD, they don't have to go to the nearest Geek Squad and pay $100 to have them say "Oh, we need to reinstall Windows" when it's a simple driver culprit, etc. I want to help others with their issues, for free. There aren't many of us in the BSOD analysis community if you think about it. Compared to the amount of computers and users that use said computers in the world, we're almost non-existent.
So now, with my free time, which currently in my life is ALL of the time until I start IT school, I read and read about BSODs courtesy of our BSOD communities. I also spend tons of time on these various communities solving BSOD related issues. The best part is, I have been noticed for this, and awarded.
Ever since all of this, here I am almost a year later with this knowledge that increases every single day because of my peers, etc. I have made it my long term goal to one day achieve the Microsoft MVP award. I know for a fact that my hard work, determination, and personality will get me there. It's just a matter of time.
:)
Subscribe to:
Posts (Atom)