Friday 15 November 2013

Debugging Stop 0x124 - PCIe Errors Part 1

I've bumped into a Stop 0x124 bugcheck, which was sourced from a PCI or PCIe bus on the motherboard. It's been a very long time since I've debugged one of these bugchecks (PCIe that is, not CPU), and in fact I did write a tutorial of how to debug these types of crashes, but that tutorial is quite dated now and I've quite a few more things since it's writing, therefore I'm going to create an updated version here which will bring everything together. I already know this tutorial is going to have be split across mulitple posts, since there is so much information about PCIe errors that I need to explain. Fortunately, there is some free specification documents availalbe online, so I would download those right away. I'm hopefully going to explain all the fields within the AER (more on that later) and the errors which caused the bugcheck.

Okay, that was a lengthy introduction already, but in this first part, I'm just going to show you the general structure of the dump file and then provide information on which data structures you should look at. These data structures can't be dumped within WinDbg I believe, so you'll need to check the MSDN documentation.

Let's begin:

The parameters given are not much differerent from their CPU Machine Check Exception counterparts, the first parameter indicates the source of the error. This is usually the Root Port for PCI Express, PCI or even can be related to USB/Bluetooth if the computer is a OEM. Please note, the source tells where the error originated from, and does not mean exactly that the error is being caused by the particular device being shown.

The second parameter is the address of the WHEA_ERROR_RECORD data structure, we can use the !errrec extension on this to dump this data structure in WinDbg.

The !errrec extension will produce something like this, I've omitted the less useful information from the image:

The Port Type simply tells us where the error was sourced from, and who reported the error, in this case it was the Root Port. The Version field here is important, this is the documentation we need to start reading to understand all the information given to us in the AER (PCI Express Advance Error Reporting). The VenId and the DevId will help identify the device connected to the port which sourced the error, we can enter these values into a PCI Database.

In this the case, the Device ID fields pointed to Intel 7500 Chipset PCIe Root Port.

The next important fields are Uncorrectable Error Status, Uncorrectable Error Mask and Uncorrectable Error Severity. Each of these fields is set by bit flags, which help us identify the errors which have occured. I'll explain these in more detail in Part 2, but for now, I will quickly give a list of them.

Status -  Shows which errors have been found and reported - See PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS

Mask - Shows which errors are not important and can be disregarded - See PCI_EXPRESS_UNCORRECTABLE_ERROR_MASK

Severity - Shows which errors are fatal and therefore can't be recovered from. These are the errors which we will need to examine, since these errors have caused the bugcheck.


Now, the captialised characters are the errors which have been supposedly reported, so here's just a list of what they mean:

UR - Unsupported Request Error
MTLP - Malformed TLP
SD - Surprise Down
ROF - Receiver Overflow
UC - Unexcepted Completion
CT - Completion Timeout
DLP - Data Link Protocol Error
PTLP - Poisoned TLP
FCP - Flow Control Protocol Error
CA - Completer Abort
ECRC - End to End Reduncany Check Error

These errors will be explained in more detail in later parts.

Caps & Control field corresponds to the AER Capabilities and Control Register, which is used for ECRC Error Checking.

The Header Log field contains the 32 bit TLP header.

These errors will be explained in more detail in Part 2.

1 comment:

  1. possible if you could tell me what the issue was, because i had the exact same issue, with the same devid etc.. much thanks!