BSODTutorials: Stop 0x124

Showing posts with label Stop 0x124. Show all posts

Thursday, 23 January 2014

List of WHEA Data Structures

I've listed other WHEA data structures in my other blog posts, and therefore will not be listing the same ones here. The purpose of this blog post is to list the WHEA data structures available with WinDbg, and Microsoft's Public Symbol Server. The information within the structures has more or less been explained in my other WHEA posts, but if in doubt please leave a comment or read the WDK documentation.

_WHEA_ERROR_STATUS
_WHEA_ERROR_RECORD_HEADER_FLAGS
_WHEA_ERROR_PACKET_V2
_WHEA_ERROR_PACKET_FLAGS
_WHEA_ERROR_TYPE
_WHEA_ERROR_SEVERITY
_WHEA_ERROR_SOURCE_TYPE
_WHEA_ERROR_PACKET_DATA_FORMAT
_WHEA_ERROR_RECORD
_WHEA_ERROR_RECORD_HEADER
_WHEA_ERROR_RECORD_SECTION_DESCRIPTOR
_WHEA_REVISION
_WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_VALIDBITS
_WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_FLAGS

Monday, 2 December 2013

Debugging Stop 0x124 - Using !whea

A new month, a new post and a old topic. It's back to the Stop 0x124 once again, and this time I'm going to explain the !whea extension, which outputs the high level structure of the WHEA architecture.

You'll need at least a Kernel Memory dump to use this extension.

Take into account the current error record address for the WHEA_ERROR_RECORD structure, and the error source notified. We can actucally view the data structure with the dt nt!_(name) command, I've added the -r switch to dump all the substructures too; here is a partial output:

Here is the !whea output.

I've highlighted the address of the current WHEA_ERROR_RECORD present in our dump file, as the main error record for the crash. I've highlighted the source of the error, and the type of hardware platform which the error record falls under. We can see that one error has caused a bugcheck, and the rest are corrected errors. The rest of the !whea output is below:

Let's explain what each of these fields mean. Firstly, the Type field contains a enumeration called WHEA_ERROR_SOURCE_TYPE. It defines all the error sources which the hardware is able to report. It's general structure is here:

These values will take the value of parameter 1 in the Stop 0x124 bugcheck. Some of the acronyms you already be familiar with, but I'll list and explain these fields nonetheless.

MCE (0) = Machine Check Exception
CMC (1) = Corrected Machine Check
CPE (2) = Corrected Platform Error
NMI (3) = Non Maskable Interrupt
PCIe (4) = PCI Express
Generic (5) = Unknown Error
INIT (6) = Itanium INIT error
BOOT (7) = Boot Error
SCI (8) = Service Control Interrupt
IPFMCA (A) = Itanium Machine Check Exception
IPFCMC (B) = Itanium Corrected Machine Check

The Error Count field shows the amount of uncorrectable errors which have lead to a bugcheck, and the Record Count field shows the number of Error Records under that particular error source.

The fields shown in the red box belong to the WHEA_ERROR_SOURCE_DESCRIPTOR structure. It's general structure can be seen here:

The Length and Version fields aren't really important here. The Length field indicates the size of the structure in bytes, and the Version field shows the version of structure.

The Type field is the WHEA_ERROR_SOURCE_TYPE enumeration.

The State field shows the WHEA_ERROR_SOURCE_STATE enumeration, which defines the state of the error source. The structure shows the runtime states of the error source, the error source has stopped handling and processing information or it's started.

MaxRawDataLength is the amount of data which should stored in the error packet, for error source information and any additional information provided by the PSHED plug in, to give specific troubleshooting information developed by the hardware vendor.

NumRecordsToPreallocate, this is quite self explantory, and defines how many error records should be preallocated for the error source.

MaxSectionsPerRecord, this the maximum number of sections to be provided within the error record.

ErrorSourceID and PlatformErrorSourceID are unique identifiers to the error source on the system where the error has happened.

Flags are bitwise OR'ed to show additional information, the possible three flags are as follows:

WHEA_ERROR_SOURCE_FLAG_DEFAULTSOURCE indicates the error source is the default error source for the hardware platform in which the error source notified.
WHEA_ERROR_SOURCE_FLAG_FIRMWAREFIRST, this shows that firmware processed the error condition before control was handed to the operating system.
WHEA_ERROR_SOURCE_FLAG_GLOBAL shows any settings applied to one error source should apply to all error sources of the same type.

Looking at our example, we can see that all the x86/x64 supported error sources provided by querying the PSHED were provided.

Just to add, to obtain the error packet address and use the !errpkt extension, you need to dump the error record, and the error packet should be in one of the record sections, but unfortunately I've never came across a dump file like this, so I'll post the WinDbg documentation example here.

Wednesday, 27 November 2013

How WHEA Works Internally

After this blog post, I think I've covered all the aspects of a Stop 0x124. Otherwise, you may see a few more posts about it in the future. I'm also planning on adding a few more debugging extensions, and will provide a updated version of a Stop 0x101.

WHEA General Structure and Reporting:

WHEA has the following general component structure:

LLHEHs are used to perform hardware error source discovery, gather information about the error source in the form of Hardware Error Packets, and then notify the operating system of the error. The Windows Kernel then formats these Hardware Error Packets into Error Records.

Here is the general format of a Error Record for WHEA:

Each Error Record is described by the WHEA_ERROR_RECORD structure which provides information about the error condition. This is then saved in the Event Log through ETW (Event Tracing for Windows).

When a error condition has happened, the LLHEH may communicate with the PSHED to receive platform specific information for that error condition.

Hardware Error Classification

There are two types hardware error groups: corrected and uncorrected. These classifications are quite self explanatory, but I'll explain them nevertheless.

Corrected Errors: These are errors which have been corrected by the hardware, the operating system is then notified of this correction.

Uncorrected Errors: These are errors which can't be corrected by the hardware, and therefore fall into a further two different categories: Fatal and Non-fatal.

Fatal: Uncorrected error which can't be corrected by the recovered by the hardware, and will result in a bugcheck.

Non-Fatal: The errors can be attempted to recovered by the operating system, however, failure to do so will result in a bugcheck.

Error Sources

Error sources refer to the hardware which located a hardware error, they do not necessarily mean that the error source is the problem. At boot, the PSHED (Platform Specific Hardware Error Driver) returns a list of WHEA_ERROR_SOURCE_DESCRIPTOR structures to the Windows Kernel to indicate all the supported error sources for that hardware platform. With this information, the operating system is able to load and set up the necessary LLHEHs (Low Level Hardware Error Handlers) for the hardware error sources.

The hardware platforms for x86 and x64 are:

Machine Check Exceptions
Corrected Machine Checks
Non Maskable Interrupts
Boot Errors

The PCIe AER error sources are discovered by the PCIe Bus driver, and isn't discovered by the PSHED.

The general structure is as follows:

The Type member indicates the type of error source. ErrorSourceID indicates the unique identifier for the error source. More information on this structure can be found in the WDK documentation.

The union lists all the descriptors for a specific error source.

Tuesday, 26 November 2013

Debugging Stop 0x124 - !sysinfo, !cpuinfo, !whea and !errpkt

Another blog post about Stop 0x124, as always said, these bugchecks are probably one of the hardest to debug due to the lack of information they retain, thereby it's very important to understand how to gather as much as possible from these seemingly barren dump files. I'm going to explain the rest of the !sysinfo extensions, !cpuinfo, !whea and !errpkt.

In regards to !sysinfo, I'm going to discuss the flags to the extension highlighted with the red box.

Let's begin with the !cpuinfo extension, which will show some basic information about the CPU. By default, it will display information about all the processors in the system, however, since this is a Mindump so one processor; the processor which was last running will be shown.

The MHz field shows the clockspeed of the processor. The Manufacturer field is used to show that the processor is a real Intel processor.

The CP field shows the current processor number. The F indicates the processor family number; the M indicates the processor model number and the S indicates the stepping size.

A processor family (F) is a form of categorisation used by CPU vendors to group their products, and therefore make comparison of the different features between processors of a similar feature set much easier. In a debugging sense, this just makes it easily to identify the processor, and find the relevant documentation for it.

The Model number is shows the specific type of processor within that family.

The Stepping Size (S) is the version number of a CPU.

The MSR (Machine/Model Specific Register) Signature Features is used to show the set debugging features and performance monitoring. It can refer to any of the Control Registers used. These are usually displayed as cr8 or a another number. See Volume 3 Chapter 35 of the Intel Developers Manual.

The !sysinfo cpuinfo extension is used to display similar information.

The !sysinfo cpumicrocode shows the processor family, model and stepping information.
This only works for Intel processors.

The CPUID string shows the name of the processor, and the MaxSpeed and CurrentSpeed of the processor. This is very useful for checking for overclocking.

The !sysinfo gbl used to provide ACPI Table information, and only works on systems which support ACPI. The ACPI Specifications can be found here - ACPI - Advanced Configuration and Power Interface

The !sysinfo machineid extension is used for displaying basic motherboard and BIOS date information.

The !sysinfo registers extension is used to display information about the MSRs, here I would consult the Intel Developers Manual to gather more information. This extension only works on processors which are not Itanium processors.

The !sysinfo smbios extension explains information related to the BIOS, such as memory, BIOS Version information, processor information and power information.

Again, the above is only a partial view of the extension due to size limitations. This only works on systems which support SMBIOS.

The !whea extension never seems to produce much with a Minidump, since it's part of the higher levels of the WHEA structure and therefore I'll advise to check the WinDbg documentation.

The Error Source indicates the hardware which notified WHEA of the hardware error condition. This do not mean that the Error Source is necessarily the culprit of the crash.

There is also the !errpkt extension which displays information about a WHEA Hardware Error Packet, however, this information is converted into a WHEA Error Record (!errrec) by the Windows Kernel, by being given the Error Packet from the LLHEH (Low Level Hardware Error Handler).

Sunday, 17 November 2013

Debugging Stop 0x124 - PCIe Errors Part 3

Poisoned TLP (PTLP) -A TLP packet is usually considered as "poisoned" when it contains bad data, the receiver knows that a TLP is poisoned as a result of this. Only data which corresponds to read or write requests (posted or non-posted) is applicable to be considered as a TLP which has been poisoned. A PTLP is marked in the Header of the TLP with the EP bit being set. Any other forms of bad data are considered as Unsupported Requests (Memory, I/O or Messages). See Section 2.7.2.2.

Flow Control Protocol (FCP) - This quite a simple error, in that it suggests that a Flow Control Protocol rule has been broken. Flow Control Information is carried as FCPs (Flow Control Packets) which is a type of DLLP as a explained in Part 2. See Section 2.6.1 for a entire listing of rules. Again, I'll add a few for your convenience.

Each Virtual Channel (Data Buffer) uses a separate credit control flow system, and therefore is independent of other channels. Any TLP receivers which are receiving a TLP with a Virtual Channel which isn't enabled are considered Malformed TLPs.

Completer Abort (CA) - Completer Aborts are errors which are generally caused by requests which physically can't be processed by the device, and therefore any requester which is returned a completion status of CA must free all the resources and buffer space for that request, and treat it as the last request, so bad requests of that type can't be kept being sent. The receiving device or port is the device which flags this error.

The table is a little misleading here, I would suggest checking Section 2.2.9.

End To End Redundancy Check (ECRC) - This kind of error reporting and support is entirely optional, and as a result I'm not sure entirely if it will always be reported in the dump file, so be watchful of this. Anyhow, this kind of information is always reported within the AER Capabilities and Control Register (Caps & Control in dump file). The value of the ECRC bits are set in the Digest field of the TLP. See Section 2.7.1. and 7.10.7.

ECRC is used to help with data integrity reporting, and any reported errors are sourced from the receiving port. Devices which support ECRC, must enable this error checking on all TLPs.

References:

PCI-E WHEA errors (Stop 0x124) - Sysnative Forums

Stop 0x124 Troubleshooting (New Method) - SevenForums (my old tutorial)

Down to the TLP - How PCI express devices talk (Part 1)

Note: The most important documentation to check is the 1.1 PCIe Base Specification.

Debugging Stop 0x124 - PCIe Errors Part 2

Continuing from Part 1, this second part is going to elaborate on the errors which were described in my previous blog post. Here, I would strongly recommend you download the PCIe 1.1 Specification, which isn't freely available to download from the PCIe website, but you can find a copy on Google. The error table is available in section 6.2.7 of the PCIe 1.1 Specification.

In general, here is what the PCIe AER Extended Capability Structure looks like:

Let's begin with the Unsupported Request Error (UR). Unsupported Requests are basically requests which are not supported, they can be a result of several things. One being the device power states, any I/O TLP or Memory related TLP request being sent when the power state is D1, D2 or D3 is treated as a unsupported request (see section 5.3.1). I'll explain TLP's later. There are several errors which can cause a Unsupported Request error, from the PCIe 1.1 Specification, here is a listing with the references to the appropriate sections, I'll list a few of these errors for convenience:

The data link between two devices could be down, which is shown with the DL_Down status being sent to the Transaction Layer (TLPs are part of this layer), therefore any TLPs being sent, which are non-posted, are completed with a return of Unsupported Request and then discarded. See Section 2.9.1 for more details. PCIe Posted and Non-Posted Requests can be found here - PCIe posted vs non-posted transactions

See Section 6.5.7 - PCie Endpoints do not support locking mechanisms, unlike their legacy PCI counterparts which do, but this is no discouraged. Any MRdLk requests are treated as a Unsupported Request. More information is available here - The PCIe Lock Protocol

Malformed TLP (Malformed Transaction Layer Protocol); here I'm going to explain some of the concepts between TLPs. TLPs are part of the Transaction Layer which on top of the Link Layer and the Physical Layer. TLPs are essentially request packets, like IRPs in the I/O Manager subsystem of Windows. These packets contain requests to perform certain operations. These requests can be memory reads, memory writes, I/O writes and reads and messages. Each packet is usually between 3 and 4 Double Words (Windows data type) or 32-bits (which is a Double Word).

The Transaction Layer primarily reads and sends this packets. More Information - PCI Express The Transaction Layer

Malformed TLPs are TLPs which have caused errors. There is a few reasons, why a TLP can become Malformed. The length of TLP mustn't exceed the Max Payload Size (amount of data in bytes), this setting can altered in the BIOS. See Section 2.2.7. for general I/O, Memory and Message rules with TLPs.

The TLP Address and and Length can't across the 4KB memory boundary. See Section 7.8.4.

Regarding, memory read requests, the TLP length must not exceed the MAX_READ_REQUEST_SIZE - See Section 7.8.4.

Design Assistant for PCI Express - What is the difference between MAX_READ_REQUEST_SIZE and MAX_PAYLOAD_SIZE?

Surprise Down (SD) - This error is detected within the Link Layer of the PCIe topology. I'm not sure how to exactly define how it happens, but it seems to also be part of the Sequence Number and Link (LCRC) error detection which applies to a TLP. I'm guessing a TLP sends a packet to the Link Layer, and then the Link Layer performs some error checking on this. The TLP is checked before it sent to another component connected along the link.

I believe if the TLP isn't sent across the link the first time properly, then the TLP is stored within a retry buffer, then this TLP is sent a again a few times, unless or until the other component on the opposite end of the link gives a positive response back. If all the repeated attempts of sending the TLP are negative responses, then the link is considered to malfunctioning and therefore the Physical Layer may either close the link by changing the link status to DL_Inactive state. However, (see section 4.2.6) the Physical Layer will attempt to recover or train the link again, but if this fails, then the link is inactive.

ROF (Receiver Overflow) - This error indicates that a TLP is received consuming more than the specified amount of flow control credit given. The credit control flow system seems to be a method controlling TLP ordering (see Section 2.4) and to prevent receiver buffer overflows. It's maintained between the Transaction Layer and the Data Link Layer. Any TLPs transferring between the layers of the topology must pass through each credit flow control gate. See Section 2.6.

Unexpected Completion (UC) - A request will received which doesn't match one of the receivers requests.

Completion Timeout (CT) A request wasn't received in the excepted time interval. See Section 2.8.

Data Link Protocol (DLP) - This indicates that an error has occurred within the Link Layer (sits between Physical Layer and Transaction Layer), and is is used for sending TLPs between different devices. It also provides data integrity (see Section 3.5) checking for TLPs and DLLPs. See Section 3.4.1 and Section 3.

The Data Link Layer also has it's own packets called Data Link Layer Packets (DLLPs), which are used to provide functions like power management, flow control information and TLP acknowledgment.

Ack DLLP is used to acknowledge that the component has successfully received the expected number of TLPs. Nak DLLP does the opposite.

InitFC1, InitFC2 and UpdateFC are used to flow control information exchanges between components. It's important to remember that if a receiver has received a malformed TLP, then no receivers should update their flow control information, thus no UpdateFC DLLPs should be used.

See Part 3 for more information on the other errors.

Friday, 15 November 2013

Debugging Stop 0x124 - PCIe Errors Part 1

I've bumped into a Stop 0x124 bugcheck, which was sourced from a PCI or PCIe bus on the motherboard. It's been a very long time since I've debugged one of these bugchecks (PCIe that is, not CPU), and in fact I did write a tutorial of how to debug these types of crashes, but that tutorial is quite dated now and I've quite a few more things since it's writing, therefore I'm going to create an updated version here which will bring everything together. I already know this tutorial is going to have be split across mulitple posts, since there is so much information about PCIe errors that I need to explain. Fortunately, there is some free specification documents availalbe online, so I would download those right away. I'm hopefully going to explain all the fields within the AER (more on that later) and the errors which caused the bugcheck.

Okay, that was a lengthy introduction already, but in this first part, I'm just going to show you the general structure of the dump file and then provide information on which data structures you should look at. These data structures can't be dumped within WinDbg I believe, so you'll need to check the MSDN documentation.

Let's begin:

The parameters given are not much differerent from their CPU Machine Check Exception counterparts, the first parameter indicates the source of the error. This is usually the Root Port for PCI Express, PCI or even can be related to USB/Bluetooth if the computer is a OEM. Please note, the source tells where the error originated from, and does not mean exactly that the error is being caused by the particular device being shown.

The second parameter is the address of the WHEA_ERROR_RECORD data structure, we can use the !errrec extension on this to dump this data structure in WinDbg.

The !errrec extension will produce something like this, I've omitted the less useful information from the image:

The Port Type simply tells us where the error was sourced from, and who reported the error, in this case it was the Root Port. The Version field here is important, this is the documentation we need to start reading to understand all the information given to us in the AER (PCI Express Advance Error Reporting). The VenId and the DevId will help identify the device connected to the port which sourced the error, we can enter these values into a PCI Database.

In this the case, the Device ID fields pointed to Intel 7500 Chipset PCIe Root Port.

The next important fields are Uncorrectable Error Status, Uncorrectable Error Mask and Uncorrectable Error Severity. Each of these fields is set by bit flags, which help us identify the errors which have occured. I'll explain these in more detail in Part 2, but for now, I will quickly give a list of them.

Status - Shows which errors have been found and reported - See PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS

Mask - Shows which errors are not important and can be disregarded - See PCI_EXPRESS_UNCORRECTABLE_ERROR_MASK

Severity - Shows which errors are fatal and therefore can't be recovered from. These are the errors which we will need to examine, since these errors have caused the bugcheck.

See - PCI_EXPRESS_UNCORRECTABLE_ERROR_SEVERITY

Now, the captialised characters are the errors which have been supposedly reported, so here's just a list of what they mean:

UR - Unsupported Request Error
MTLP - Malformed TLP
SD - Surprise Down
ROF - Receiver Overflow
UC - Unexcepted Completion
CT - Completion Timeout
DLP - Data Link Protocol Error
PTLP - Poisoned TLP
FCP - Flow Control Protocol Error
CA - Completer Abort
ECRC - End to End Reduncany Check Error

These errors will be explained in more detail in later parts.

Caps & Control field corresponds to the AER Capabilities and Control Register, which is used for ECRC Error Checking.

The Header Log field contains the 32 bit TLP header.

These errors will be explained in more detail in Part 2.

Wednesday, 23 October 2013

Debugging Stop 0x124 - Calculating Clockspeed (Without !sysinfo cpuspeed)

We all know that Stop 0x124 contain very little practical information to work with, the stack consists of WHEA reporting routines and many commands have no significance to a Stop 0x124.

The first thing to look at with a Stop 0x124 is the clock speed of the processor, generally a overclocked processor will mean that the user has most likely overclocked their GPU or RAM, and also changed the voltage settings. However, the !sysinfo cpuspeed extension does not always work like below:

This is quite annoying, but there are other ways to view the clockspeed of a processor, we can see the !prcb extension to view the Processor Control Block, which is a private kernel data structure used to store thread scheduling information; DPC queue, detailed CPU vendor information and cache sizes etc.

Using the !prcb extension doesn't provide much information, but using the address fffff780ffff0000 with the _KPRCB data structure will provide some very detailed information. Please be aware that not specifying the processor number with !prcb extension will default to the context of the current processor.

This is not the entire data structure, you can enter the command yourself, if you wish to view the entire data structure, but for the purposes of this blog post the MHz field is what we are interested in the most.

The hexadecimal value of 0xa21 doesn't give much information and is practically useless. Using the ? (evaluate expression) command we can convert this into something readable.

We can see that the processor is running at 2.6 GHz, the .formats command will give similar information:

Tuesday, 1 October 2013

Debugging Stop 0x124 - CPU Mnemonics

Okay, we all know that Stop 0x124's are very generic and irritating bugchecks since they don't provide much information at all to be honest.

Although, this can be made easier by following reading the error mnemonics within the CPU documentation, which will provide further insight into how the error was caused. I actually learned this in I think it may have been one of Vir's quotes in one of YoYo's posts, so thanks for providing information on where to find the documentation.

You will want to download the .PDF file of Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3: System Programming Guide, and then turn to page 2352. Here you will find the error mnemonics for the type of error, remember we can find the type of error and then decrypt it's meaning by using the !errrec extension with the second parameter of the bugcheck.

Due to size limitations in which the Snipping Tool can expand to, I've taken a screenshot of the relevant part of the output in which the !errrec extension reads from the WHEA_ERROR_RECORD data structure.

Okay, so the error is related to a Bus Error, which is documented within the CPU Developer's Manual, each type of error has a table of mnemonics associated with it.

So, looking at the above error, we can see that the error originated from the Level 0 Cache, and the error was sourced by from the processor itself (the CPU raised the flag for the error, hence the Machine Check Exception in the first parameter). The error was a generic no timeout error, occurring within the processor number 0 and memory bank 0. The M indicates that something was accessing the data stored within the cache when the error happened.