Wednesday 29 May 2013

Understanding Bit Flips

I will not take all the credit for writing this tutorial about how to debug and understand potential flipped bits in CPU registers. I would to say thanks to Vir Gnarus for helping to understand this very important method.

Okay, let's see with the tutorial, it took me a while to find the thread I used again, so be grateful ;)


CONTEXT: fffff880095630f0 -- (.cxr 0xfffff880095630f0)
rax=fffffa800f5177c8 rbx=fffffa800f5177c0
rcx=f7fffa800f5177c8
rdx=fffffa800d01ebf0 rsi=00000000014a2e00 rdi=fffffa800f616640
rip=fffff800030c7ccb rsp=fffff88009563ad0 rbp=00000000014a2e70
r8=0000000000000000 r9=0000000000000000 r10=fffffa800cb045d0
r11=00000000001f0003 r12=fffff88002fd5180 r13=fffffa800f5177c8
r14=fffffa800f4e09d8 r15=fffffa800f5177c8
iopl=0 nv up ei pl zr na po nc
cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00010246
nt!KiInsertQueue+0xab:
fffff800`030c7ccb 48894108 mov qword ptr [rcx+8],rax ds:002b:f7fffa80`0f5177d0=????????????????
As we can see from the above Context Switch, the register which the CPU was reading from has a 7 in the middle of it's memory address. This is the flipped bit.




3: kd> .formats f7fffa800f5177c8
Evaluate expression:
Hex: f7fffa80`0f5177c8
Decimal: -576466799360378936
Octal: 1737777650001724273710
Binary: 11110111 11111111 11111010 10000000 00001111 01010001 01110111 11001000
Chars: .....Qw.
Time: ***** Invalid FILETIME
Float: low 1.03276e-029 high -1.03837e+034
Double: -1.05588e+270
   

3: kd> .formats fffffa800f5177c8
Evaluate expression:
Hex: fffffa80`0f5177c8
Decimal: -6047056955448
Octal: 1777777650001724273710
Binary: 11111111 11111111 11111010 10000000 00001111 01010001 01110111 11001000
Chars: .....Qw.
Time: ***** Invalid FILETIME
Float: low 1.03276e-029 high -1.#QNAN
Double: -1.#QNAN
Using the .formats command, we can gain the binary representation of the memory address contained within the CPU registers, and then compare these two binary instructions to give us a confirmation of a flipped bit. As you can see, the address with the random 7, has a 0 instead of a 1 in the beginning of it's address, this indicates a flipped bit.

The crash resulted, because the corrupted memory addresses within the CPU registers were being accessed.  The usual causes for these bit flips can be a result of the CPU, PSU and motherboard. A large number of bits being changed, can be due to a the hard-drive or the RAM being at fault.








Debugging Stop 0x124

Stop 0x124's are fundamentally caused by hardware errorsalthough, can be caused by corrupted drivers reporting false errors to the Windows operating system. I would have written a full blog post stating how to debug a Stop 0x124, however, a while ago I wrote up a entire tutorial for this issue, therefore I will simply create a link to the tutorial to save a good hour of formatting the text and merging the content of separate posts etc.

Here's the link - Debugging Stop 0x124 - The Guide


Monday 20 May 2013

Debugging Stop 0x101

I would usually explain how to use the 'old method' of finding which CPU or processor core has stopped responding to interrupts, and therefore causing a hang, but there is a more efficient method of analyzing Stop 0x101's for any device driver faults (thanks to muhahaa for introducing this method). More Information - Class 101 for 0x101 Bugchecks

We can use the !running extension to quickly produce the information contained within the PRCB's of each processor. The !running extension takes two parameters which are:


  • -i     This causes the debugger to show idle processors as well as active processors.
  • -t     This causes the debugger to display a stack trace for each processor.
We can then use the !running extension with the two parameters like so:



0: kd> !running -ti

System Processors: (000000000000000f)
Idle Processors: (0000000000000000) (0000000000000000) (0000000000000000) (0000000000000000)

Prcbs Current (pri) Next (pri) Idle
0 fffff80002dfae80 fffffa8006a2fad0 (16) fffff80002e08cc0 ................

Child-SP RetAddr Call Site
fffff880`033164e8 fffff800`02cd6a3a nt!KeBugCheckEx
fffff880`033164f0 fffff800`02c896e7 nt! ?? ::FNODOBFM::`string'+0x4e3e
fffff880`03316580 fffff800`031fa895 nt!KeUpdateSystemTime+0x377
fffff880`03316680 fffff800`02c7c153 hal!HalpHpetClockInterrupt+0x8d
fffff880`033166b0 fffff800`02cb5483 nt!KiInterruptDispatchNoLock+0x163
fffff880`03316840 fffff800`02c84a0c nt!KxFlushEntireTb+0x93
fffff880`03316880 fffff800`02c699e4 nt!KeFlushMultipleRangeTb+0x28c
fffff880`03316950 fffff800`02d00f15 nt!MiAgeWorkingSet+0x64a
fffff880`03316b00 fffff800`02c69b16 nt! ?? ::FNODOBFM::`string'+0x4c7f6
fffff880`03316b80 fffff800`02c69fc3 nt!MmWorkingSetManager+0x6e
fffff880`03316bd0 fffff800`02f1dede nt!KeBalanceSetManager+0x1c3
fffff880`03316d40 fffff800`02c70906 nt!PspSystemThreadStartup+0x5a
fffff880`03316d80 00000000`00000000 nt!KiStartSystemThread+0x16

1 fffff880009ec180 fffffa80077de060 ( 8) fffffa8008ffaa00 (15) fffff880009f6fc0 ................

Child-SP RetAddr Call Site
00000000`00000000 00000000`00000000 0x0

2 fffff88002f64180 fffffa800a2a9640 ( 8) fffffa8009f23060 (22) fffff88002f6efc0 ................

Child-SP RetAddr Call Site
00000000`00000000 00000000`00000000 0x0

3 fffff88002fd5180 fffffa8009267b50 (11) fffffa8007a61590 (26) fffff88002fdffc0 ................

Child-SP RetAddr Call Site
00000000`00000000 00000000`00000000 0x0
We can then obtain a raw stack trace from each idle processor, by using the !thread extension with the address of highlighted in red, as seen in this blog post -  Stack Text Commands






Thursday 16 May 2013

!error and NTSTATUS Errors

This is going to be very short post, however, I still feel it's important to understand how to use the !error extension in order to extract some readable and understandable information about a NTSTATUS error.

Here's a current list of NTSTATUS Errors - 2.3.1 NTSTATUS values

The NTSTATUS are used by kernel-mode drivers which support standard driver routines and driver support routines. Some driver return a NTSTATUS value as their return type, in order to display information regarding: success values, informational values, warnings and error values.

We can therefore use the !error extension with the value displayed by the NTSTATUS value:


Stop: 0x0000007E (0xC000005, 0x95E5529C, 0xA12C0B40, 0xA12C0720

Here's a little snippet I've taken from a dump, notice the 0xC000005, this is a NTSTATUS error, we can use the !error extension with this value to display the following result:


STATUS_ACCESS_VIOLATION

The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
This is usually due to drivers referencing invalid memory addresses.

  






Wednesday 15 May 2013

Understanding Page Faults

Understanding Page Faults

To understand Page Faults, we must understand the differences between Virtual and Physical memory types.

Virtual memory is used by the operating system, it is used to store data and instructions of a particular program or process on the hard-drive in file called a Page. Each process will have it's own virtual memory address space. Each page will also contain a fixed number of virtual addresses. 

Physical memory is the simply the addresses stored in RAM. We need to also understand that RAM is limited by the address spaced usable by the operating system.

When a process requests access to RAM, then the operating system must map the virtual address provided by the process to the physical address stored within the RAM. This is known process is known as Paging, and handled by the Memory Management Unit within the CPU.



Page Table and Mappings

The Page Table is used by the operating system to store mappings between virtual and physical memory addresses. A mapping is simply the correspondence between a virtual and physical memory address. Each mapping is known as a Page Table Entry (PTE).

So, where do Page Faults come into the equation?

A Page Fault is a type of interrupt, as the name suggests, this is when the CPU stops what process it was running, and then switches to the process which requested the CPU. In terms of a Page Fault, a interrupt occurs when a process requests the MMU to translate data stored in a virtual memory address to be translated into a physical memory address stored in the RAM. The operating system then loads the data for the process.


Invalid Page Fault

Invalid Page Fault occurs when a invalid virtual address is referenced, this is usually due to a corrupt page table or page file corruption.

Tuesday 14 May 2013

Debugging Stop 0x9F - Blocked IRPs


Stop 0x9F Debugging Guide

How is it caused?

Typically, a Stop 0x9F with the first parameter is holding the value of 3, means that a certain device object (Windows representation of installed devices), is holding a IRP packet for too long and therefore creating a blockage of any further IRP packets to be processed.

At this point, for those who do not understand what a IRP is or how it works, quite simply a IRP  is a I/O request packet, this data structure is used by the Windows operating system and other drivers to communicate with each other. The packets are handed by the I/O Manager which then routes these packets to the appropriate destination. 

Debugging the Stop 0x9F:


Now, you understand what a IRP is, we can now look into how a Stop 0x9F may be debugged, here are the following parameters:


DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time (usually 10 minutes).
Arguments:
Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time
Arg2: fffffa8005bd7060, Physical Device Object of the stack
Arg3: fffff80000b9c3d8, nt!TRIAGE_9F_POWER on Win7, otherwise the Functional Device Object of the stack
Arg4: fffffa8005f6bc50, The blocked IRP
We can see there is a blocked IRP packet, and fortunately we can analyze this IRP packet and check which Device Object it belongs to.


0: kd> !irp fffffa8005f6bc50
Irp is active with 4 stacks 3 is current (= 0xfffffa8005f6bdb0)
No Mdl: No System Buffer: Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000

>[ 16, 2] 0 e1 fffffa800926b050 00000000 fffff80004ad2200-fffffa8007fc5a10 Success Error Cancel pending
*** WARNING: Unable to verify timestamp for k57nd60a.sys
*** ERROR: Module load completed but symbols could not be loaded for k57nd60a.sys

\Driver\k57nd60a nt!PopSystemIrpCompletion
Args: 00014400 00000000 00000004 00000002
[ 0, 0] 0 0 00000000 00000000 00000000-fffffa8007fc5a10

Args: 00000000 00000000 00000000 00000000
The !irp is used with the address from parameter 4, this displays information about the specified IRP packet, the small > points to the driver which was active at the time of the crash. Do you notice the two number within the [    ] box, these are called function codes. The first number is a major function code and the second number is a minor function code.

The major function code 16 (IRP_MJ_POWER), means that the IRP has been sent to a power-related stack, with the minor function code 2 (IRP_MN_SET_POWER), indicating that a request has been sent.

Notice, one last thing, the Success Error Cancel, the Success indicates that the IRP packet completion routine will be called if the IRP completes successfully, the Error indicates that the IRP packet completion routine will be called wen the IRP completes with an error, and the Cancel means that the IRP completion routine will be called  when the current IRP is attempted to be canceled. 












Monday 13 May 2013

Checking System Information with !sysinfo

Checking System Information

There will most certainly be times when you are debugging, and require to know information about the system you are debugging. Fortunately, there is a very useful extension provided by the Windows Debugger, in which we can use to gain such invaluable information from dump files. The command is !sysinfo.

The extension will be able provide information, such as the motherboard model with revision number, the BIOS model with the timestamp and current version of the BIOS and the clockspeed of the CPU.


!sysinfo Parameters:

cpuinfo: This will provide basic information about the CPU.

cpuspeed: This will provide the maximum and current clockspeed of the CPU (ideal for overclocked systems).

machineid: This will provide information about the system, which includes BIOS, SMBIOS, firmware and motherboard.

smbios: This will provide information regarding the RAM, BIOS, CPU and SMBIOS.





Stack Text Commands

Stack Text Commands

The stack text is one of the most fundamental elements of a dump file, and shouldn't be overlooked. the stack text will contain all the saved function calls used by drivers and kernel modules at the time of the crash. There are a few different commands which can be used to produce a stack unwind.

The three main stack unwind commands I tend to use are listed as follows:

kv: This will produce a stack unwind with all the symbols, module names, and memory addresses.

k: This will produce a stack unwind with only the module name and the memory addresses.


kb: This is the stack text backtrace, and will provide similar information to the kv command, this is useful when needing to check the call stack within a particular context, such as a context switch.


Note: The stack text is produced in reverse order, the first call is at the bottom of the stack, whereas, the last call usually KeBugCheckEx (when the BSOD was produced) is at the top of the stack text.


The Raw Stack Text:

The stack text will usually set to the context of the crash, and may not always contain all the information which could assist us in our debugging efforts, therefore we need to gather a raw stack text.

Firstly, we need to set the context of the dump file to the crashing thread, we can use the !thread extension in order to do this.


0: kd> !thread
GetPointerFromAddress: unable to read from fffff80004abd000
THREAD fffff80004a0ecc0 Cid 0000.0000 Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
GetUlongFromAddress: unable to read from fffff800049fcba4
Owning Process fffff80004a0f180 Image: <Unknown>
Attached Process fffffa80052059e0 Image: System
fffff78000000000: Unable to get shared data
Wait Start TickCount 17103124
Context Switch Count 11139370 IdealProcessor: 0
ReadMemory error: Cannot get nt!KeMaximumIncrement value.
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address nt!KiIdleLoop (0xfffff8000487c8f0)
Stack Init fffff80000b9cc70 Current fffff80000b9cc00
Base fffff80000b9d000 Limit fffff80000b97000 Call 0
Priority 16 BasePriority 0 UnusualBoost 0 ForegroundBoost 0 IoPriority 0 PagePriority 0
Notice the two highlighted values, this is the size of our current thread, this will be the entire size of our thread and more importantly raw stack text.

So, let's go and get our raw stack:

dps fffff80000b97000 fffff80000b9d000 
Press Enter, and you will receive a very large stack text for the entire context of the crashing thread.


fffff800`00b9c588 fffff880`102924cbUnable to load image \SystemRoot\system32\DRIVERS\nvlddmkm.sys, Win32 error 0n2
*** WARNING: Unable to verify timestamp for nvlddmkm.sys
*** ERROR: Module load completed but symbols could not be loaded for nvlddmkm.sys
nvlddmkm+0x1fb4cb 
You will notice, certain modules will have symbol problems, don't worry if you have set-up your symbol server correctly then this should appear only for third-party driver modules, which do not have their symbols made publicly available. The nvlddmkm.sys seems to be a possible cause for the crash, and therefore we should use some of other commands to investigate further.





Driver Verifier - Command Line

What is Driver Verifier?

Driver Verifier is a very useful tool which is provided by the Windows operating system. The filename is verifier.exe. The tool is used to test corrupted drivers, especially for third-party programs.

If you wish to enable Driver Verifier, then visit these two links:



Driver Verifier Command-Line Options:

To access and use these commands, you must use an elevated Command Prompt to do so, please follow these steps:
  1. Click Start or use the the Windows flag key on your keyboard.
  2. In "Search programs and files" - type: cmd
  3. Right-Click "cmd" and then select Run As Administrator
  4. Accept the UAC prompt
Figure 1 - Elevated Command Prompt

Now, you will want to simply enter the desired commands into Command Prompt and Press Enter.

verifier /reset: This will clear all the current settings, after the next boot, Driver Verifier will no longer check any Drivers.

verifier /query: This will show the current activity of Driver Verifier, and what drivers are being checked. 


verifier /querysettings: This will show the current settings and driver which will be checked after rebooting the system.


verifier /volatile: This will change any settings without having to reboot the system. The command can be used with /adddriver and /removedriver to check specific drivers. 












Checking Drivers - Common Commands

Checking Drivers - What To Do?

Drivers are the main cause for BSODs, however, mostly BSODs will not point out directly to the driver causing the problem. There are some basic principles you should follow when checking for any problematic or outdated drivers. Firstly, make sure you update your drivers on a regular basis, join a forum or visit the driver's website to ensure you have the most stable and up to date drivers for your hardware; remember drivers - especially graphics cards - can improve the performance of your hardware.


A great tool for checking for problematic drivers is Driver Verifier.


Common Commands:

lm - 

The lm command is used to load listed driver modules at the time of the crash. A name of the module and a timestamp for each module are not included. This is useful for checking for known problematic drivers.


lmtsm - 

This command is very similar to the lm command, although more information is displayed for the driver modules, loaded at the time of the crash. The name of the module and the timestamp will also be included. This command is very useful when checking for very outdated or known problematic drivers.

lmvm

This command can be used to find detailed information about a specified module, such as timestamp, address, checksum, module name and directory in which it is stored.

Syntax:

lmvm [Module Name]

*Note* Remember to leave out the file extension from the module name e.g. .sys