Sunday 17 November 2013

Debugging Stop 0x124 - PCIe Errors Part 3

Poisoned TLP (PTLP) -A TLP packet is usually considered as "poisoned" when it contains bad data, the receiver knows that a TLP is poisoned as a result of this. Only data which corresponds to read or write requests (posted or non-posted) is applicable to be considered as a TLP which has been poisoned. A PTLP is marked in the Header of the TLP with the EP bit being set. Any other forms of bad data are considered as Unsupported Requests (Memory, I/O or Messages). See Section 2.7.2.2.

Flow Control Protocol (FCP) - This quite a simple error, in that it suggests that a Flow Control Protocol rule has been broken. Flow Control Information is carried as FCPs (Flow Control Packets) which is a type of DLLP as a explained in Part 2. See Section 2.6.1 for a entire listing of rules. Again, I'll add a few for your convenience.

Each Virtual Channel (Data Buffer) uses a separate credit control flow system, and therefore is independent of other channels. Any TLP receivers which are receiving a TLP with a Virtual Channel which isn't enabled are considered Malformed TLPs.


Completer Abort (CA) - Completer Aborts are errors which are generally caused by requests which physically can't be processed by the device, and therefore any requester which is returned a completion status of CA must free all the resources and buffer space for that request, and treat it as the last request, so bad requests of that type can't be kept being sent. The receiving device or port is the device which flags this error.


The table is a little misleading here, I would suggest checking Section 2.2.9.

End To End Redundancy Check (ECRC) - This kind of error reporting and support is entirely optional, and as a result I'm not sure entirely if it will always be reported in the dump file, so be watchful of this. Anyhow, this kind of information is always reported within the AER Capabilities and Control Register (Caps & Control in dump file). The value of the ECRC bits are set in the Digest field of the TLP. See Section 2.7.1. and 7.10.7.

ECRC is used to help with data integrity reporting, and any reported errors are sourced from the receiving port. Devices which support ECRC, must enable this error checking on all TLPs.

References:


Stop 0x124 Troubleshooting (New Method) - SevenForums (my old tutorial)


Note: The most important documentation to check is the 1.1 PCIe Base Specification.











2 comments:

  1. Hello, I am debugging a bugcheck 0x80. Reading your blog I came to understand
    that this is a case of pci-e device sending Unsupported request to cpu due to which
    NMI was sent to cpu.
    I am having trouble identifying what pci-e request was unsupported in this case.
    Can you please point to some additional debugging tactic to pinpoint this issue?

    ReplyDelete
  2. Hi Harry, I am having the exactly same debugging results as yours, same device id, severity, etc. I was wondering if you had fixed this problem. If so, what was the problem. I hope you could help me out

    ReplyDelete