Sunday 17 November 2013

Debugging Stop 0x124 - PCIe Errors Part 2

Continuing from Part 1, this second part is going to elaborate on the errors which were described in my previous blog post. Here, I would strongly recommend you download the PCIe 1.1 Specification, which isn't freely available to download from the PCIe website, but you can find a copy on Google. The error table is available in section 6.2.7 of the PCIe 1.1 Specification.

In general, here is what the PCIe AER Extended Capability Structure looks like:

Let's begin with the Unsupported Request Error (UR).  Unsupported Requests are basically requests which are not supported, they can be a result of several things. One being the device power states, any I/O TLP or Memory related TLP request being sent when the power state is D1, D2 or D3 is treated as a unsupported request (see section 5.3.1). I'll explain TLP's later. There are several errors which can cause a Unsupported Request error, from the PCIe 1.1 Specification, here is a listing with the references to the appropriate sections, I'll list a few of these errors for convenience:

The data link between two devices could be down, which is shown with the DL_Down status being sent to the Transaction Layer (TLPs are part of this layer), therefore any TLPs being sent, which are non-posted, are completed with a return of Unsupported Request and then discarded. See Section 2.9.1 for more details. PCIe Posted and Non-Posted Requests can be found here - PCIe posted vs non-posted transactions 

See Section 6.5.7 - PCie Endpoints  do not support locking mechanisms, unlike their legacy PCI counterparts which do, but this is no discouraged. Any MRdLk requests are treated as a Unsupported Request. More information is available here - The PCIe Lock Protocol

Malformed TLP (Malformed Transaction Layer Protocol); here I'm going to explain some of the concepts between TLPs. TLPs are part of the Transaction Layer which on top of the Link Layer and the Physical Layer. TLPs are essentially request packets, like IRPs in the I/O Manager subsystem of Windows. These packets contain requests to perform certain operations. These requests can be memory reads, memory writes, I/O writes and reads and messages. Each packet is usually between 3 and 4 Double Words (Windows data type) or 32-bits (which is a Double Word).

The Transaction Layer primarily reads and sends this packets. More Information - PCI Express The Transaction Layer

Malformed TLPs are TLPs which have caused errors. There is a few reasons, why a TLP can become Malformed. The length of TLP mustn't exceed the Max Payload Size (amount of data in bytes), this setting can altered in the BIOS. See Section 2.2.7. for general I/O, Memory and Message rules with TLPs.

The TLP Address and and Length can't across the 4KB memory boundary. See Section 7.8.4.

Regarding, memory read requests, the TLP length must  not exceed the MAX_READ_REQUEST_SIZE - See Section 7.8.4.

Design Assistant for PCI Express - What is the difference between MAX_READ_REQUEST_SIZE and MAX_PAYLOAD_SIZE?

Surprise Down (SD)  - This error is detected within the Link Layer of the PCIe topology. I'm not sure how to exactly define how it happens, but it seems to also be part of the Sequence Number and Link (LCRC) error detection which applies to a TLP. I'm guessing a TLP sends a packet to the Link Layer, and then the Link Layer performs some error checking on this. The TLP is checked before it sent to another component connected along the link.

 I believe if the TLP isn't sent across the link the first time properly, then the TLP is stored within a retry buffer, then this TLP is sent a again a few times, unless or until the other component on the opposite end of the link gives a positive response back. If all the repeated attempts of sending the TLP are negative responses, then the link is considered to malfunctioning and therefore the Physical Layer may either close the link by changing the link status to DL_Inactive state. However, (see section 4.2.6) the Physical Layer will attempt to recover or train the link again, but if this fails, then the link is inactive.

ROF (Receiver Overflow) - This error indicates that a TLP is received consuming more than the specified amount of flow control credit given. The credit control flow system seems to be a method controlling TLP ordering (see Section 2.4) and to prevent receiver buffer overflows. It's maintained between the Transaction Layer and the Data Link Layer.  Any TLPs transferring between the layers of the topology must pass through each credit flow control gate. See Section 2.6.

Unexpected Completion (UC) - A request will received which doesn't match one of the receivers requests. 

Completion Timeout (CT)  A request wasn't received in the excepted time interval. See Section 2.8.

Data Link Protocol (DLP) - This indicates that an error has occurred within the Link Layer (sits between Physical Layer and Transaction Layer), and is is used for sending TLPs between different devices. It also provides data integrity (see Section 3.5) checking for TLPs and DLLPs. See Section 3.4.1 and Section 3.

The Data Link Layer also has it's own packets called Data Link Layer Packets (DLLPs), which are used to provide functions like power management, flow control information and TLP acknowledgment.

Ack DLLP is used to acknowledge that the component has successfully received the expected number of TLPs. Nak DLLP does the opposite.

InitFC1, InitFC2 and UpdateFC are used to flow control information exchanges between components. It's important to remember that if a receiver has received a malformed TLP, then no receivers should update their flow control information, thus no UpdateFC DLLPs should be used.

See Part 3 for more information on the other errors.

No comments:

Post a Comment