Wednesday, 27 November 2013

How WHEA Works Internally

After this blog post, I think I've covered all the aspects of a Stop 0x124. Otherwise, you may see a few more posts about it in the future. I'm also planning on adding a few more debugging extensions, and will provide a updated version of a Stop 0x101.

WHEA General Structure and Reporting:

WHEA has the following general component structure:


LLHEHs are used to perform hardware error source discovery, gather information about the error source in the form of Hardware Error Packets, and then notify the operating system of the error. The Windows Kernel then formats these Hardware Error Packets into Error Records.

Here is the general format of a Error Record for WHEA:


 Each Error Record is described by the WHEA_ERROR_RECORD structure which provides information about the error condition. This is then saved in the Event Log through ETW (Event Tracing for Windows).

When a error condition has happened, the LLHEH may communicate with the PSHED to receive platform specific information for that error condition.

Hardware Error Classification

There are two types hardware error groups: corrected and uncorrected. These classifications are quite self explanatory, but I'll explain them nevertheless.

Corrected Errors: These are errors which have been corrected by the hardware, the operating system is then notified of this correction.

Uncorrected Errors: These are errors which can't be corrected by the hardware, and therefore fall into a further two different categories: Fatal and Non-fatal.

Fatal: Uncorrected error which can't be corrected by the recovered by the hardware, and will result in a bugcheck.

Non-Fatal: The errors can be attempted to recovered by the operating system, however, failure to do so will result in a bugcheck.

Error Sources

Error sources refer to the hardware which located a hardware error, they do not necessarily mean that the error source is the problem. At boot, the PSHED (Platform Specific Hardware Error Driver) returns a list of WHEA_ERROR_SOURCE_DESCRIPTOR structures to the Windows Kernel to indicate all the supported error sources for that hardware platform. With this information, the operating system is able to load and set up the necessary LLHEHs (Low Level Hardware Error Handlers) for the hardware error sources.

The hardware platforms for x86 and x64 are:
  • Machine Check Exceptions
  • Corrected Machine Checks
  • Non Maskable Interrupts
  • Boot Errors
The PCIe AER error sources are discovered by the PCIe Bus driver, and isn't discovered by the PSHED.

The general structure is as follows:


The Type member indicates the type of error source. ErrorSourceID indicates the unique identifier for the error source. More information on this structure can be found in the WDK documentation.

The union lists all the descriptors for a specific error source.






No comments:

Post a Comment