Happy New Year!
Tuesday, 31 December 2013
Happy New Year 2014 (Almost)
Well, as we all know, it's the first day of 2014 tomorrow; New Year's Day. I thought I would write a very quick blog post wishing all my readers a Happy New Year, just in case I don't have time tomorrow to write one up.
Sunday, 29 December 2013
Timeout Detection and Recovery (Stop 0x116) Internals
Stop 0x116's and Stop 0x117's are largely the same bugcheck. There is also the Stop 0x119, which is related to the video scheduler causing problems. However, this blog post is going to look at the internals of Timeout Detection and Recovery and explain what this recovery process is, and how it may lead to a Stop 0x116 or Stop 0x117 bugcheck.
Since Windows Vista, Microsoft has introduced a new feature called TDR (Timeout Detection and Recovery), which as the name suggests, enables drivers to recover from hardware time-outs instead of the system completely crashing.
The GPU Scheduler firstly detects a graphics task is taking longer than it should, and goes to preempt this task and it's associated thread. If the GPU Scheduler is unable to complete or premept the task with the TDR Timeout period, then the GPU is considered to be frozen, and the preparation for recovery process begins.
The GPU Scheduler then calls the DxgkDdiResetFromTimeout which informs the graphics driver that the operating system has detected a timeout, and the GPU will need to be reset. The routine also stops the graphics card driver from accessing any form of memory. The routine causes the graphics card driver threads to run synchronously, as a result no other threads are running at the same time as the DxgkDdiResetFromTimeout routine. Furthermore, access to the frame buffer is not permitted and the PLL is also set for the memory controller. The PLL or Phase Locked Loop is used for digital clock signal synchronization for data transfers. A Frame Buffer, on the other hand, is used to store bitmaps of pixels (forming a entire image), and then storing this image in a Video RAM (VRAM) to be sent to the monitor for output. The KeSynchronizeExecution routine may be called to register interrupts and ISRs with graphics related reset routines.
If the DxgkDdiResetFromTimeout routine fails, then the system will bugcheck with a Stop 0x116. Otherwise, the recovery stage will be started and the graphics stack will be reset.
After the DxgkDdiResetFromTimeout routine has returned with STATUS_SUCCESS, then the operating system will begin to clear up any resources which are not being used. Other driver routines may be called here, which I will begin to explain below.
For example, lets begin with the DxgkDdiBuildPagingBuffer routine. This routine is used if a allocation was paged into a memory segment. A short concise explanation of memory segments and video memory should be described here to help explain this routine.
Memory Segments are used by the Miniport driver to describe the GPU's address space to the Video Memory Manager. Each Memory Segments are generally used to organize video memory resources. The driver creates a list of support segment types with the DxgkDdiQueryAdapterInfo routine, and then describes each segment with the
DXGK_SEGMENTDESCRIPTOR data structure.
When the Video Memory Manager wishes to allocate a certain video resource to a memory segment, the driver checks which segment (by the segment identifier) is most suitable for the video resource at hand and request. A allocation is created with the
DxgkDdiCreateAllocation routine, these allocations are then described with the
DXGK_ALLOCATIONINFO data structure.
The information above should be enough to understand, the DxgkDdiBuildPagingBuffer routine and it's role with releasing allocations. When the above routine is called after a reset, a paging buffer is created which is DMA buffer for the use by the GPU.
In this current situation, a paging buffer will be created for a transfer operation, thus the Operation member of the DXGKARG_BUILDPAGINGBUFFER data structure is set to
DXGK_OPERATION_TRANSFER to move the content of one allocation to another.
The Transfer.Size member is set to 0, since the content would have been lost during the reset.
On the other hand, if the Memory Segment was a aperture (physical address space assigned to a external device) memory segment, then the DXGKARG_BUILDPAGINGBUFFER Operation member is assigned the value of DXGK_OPERATION_UNMAP_APERTURE_SEGMENT which then umaps the allocation was the aperture.
Additional Reading - Linear Aperture Address Space Segments
The DxgkDdiReleaseSwizzlingRange routine may be called to release a swizzling range for a CPU based aperture memory segment. The DXGKARG_RELEASESWIZZLINGRANGE data structure is used to store information about releasing a swizzling range. The
DxgkDdiAcquireSwizzlingRange routine is used to create a swizzling range.
Swizzling for computer graphics commonly means organising vectors, so they provide better performance and better textures for graphics.
Additional Reading - What is Swizzling?
References
Since Windows Vista, Microsoft has introduced a new feature called TDR (Timeout Detection and Recovery), which as the name suggests, enables drivers to recover from hardware time-outs instead of the system completely crashing.
The GPU Scheduler firstly detects a graphics task is taking longer than it should, and goes to preempt this task and it's associated thread. If the GPU Scheduler is unable to complete or premept the task with the TDR Timeout period, then the GPU is considered to be frozen, and the preparation for recovery process begins.
The GPU Scheduler then calls the DxgkDdiResetFromTimeout which informs the graphics driver that the operating system has detected a timeout, and the GPU will need to be reset. The routine also stops the graphics card driver from accessing any form of memory. The routine causes the graphics card driver threads to run synchronously, as a result no other threads are running at the same time as the DxgkDdiResetFromTimeout routine. Furthermore, access to the frame buffer is not permitted and the PLL is also set for the memory controller. The PLL or Phase Locked Loop is used for digital clock signal synchronization for data transfers. A Frame Buffer, on the other hand, is used to store bitmaps of pixels (forming a entire image), and then storing this image in a Video RAM (VRAM) to be sent to the monitor for output. The KeSynchronizeExecution routine may be called to register interrupts and ISRs with graphics related reset routines.
If the DxgkDdiResetFromTimeout routine fails, then the system will bugcheck with a Stop 0x116. Otherwise, the recovery stage will be started and the graphics stack will be reset.
After the DxgkDdiResetFromTimeout routine has returned with STATUS_SUCCESS, then the operating system will begin to clear up any resources which are not being used. Other driver routines may be called here, which I will begin to explain below.
For example, lets begin with the DxgkDdiBuildPagingBuffer routine. This routine is used if a allocation was paged into a memory segment. A short concise explanation of memory segments and video memory should be described here to help explain this routine.
Memory Segments are used by the Miniport driver to describe the GPU's address space to the Video Memory Manager. Each Memory Segments are generally used to organize video memory resources. The driver creates a list of support segment types with the DxgkDdiQueryAdapterInfo routine, and then describes each segment with the
DXGK_SEGMENTDESCRIPTOR data structure.
When the Video Memory Manager wishes to allocate a certain video resource to a memory segment, the driver checks which segment (by the segment identifier) is most suitable for the video resource at hand and request. A allocation is created with the
DxgkDdiCreateAllocation routine, these allocations are then described with the
DXGK_ALLOCATIONINFO data structure.
The information above should be enough to understand, the DxgkDdiBuildPagingBuffer routine and it's role with releasing allocations. When the above routine is called after a reset, a paging buffer is created which is DMA buffer for the use by the GPU.
In this current situation, a paging buffer will be created for a transfer operation, thus the Operation member of the DXGKARG_BUILDPAGINGBUFFER data structure is set to
DXGK_OPERATION_TRANSFER to move the content of one allocation to another.
The Transfer.Size member is set to 0, since the content would have been lost during the reset.
On the other hand, if the Memory Segment was a aperture (physical address space assigned to a external device) memory segment, then the DXGKARG_BUILDPAGINGBUFFER Operation member is assigned the value of DXGK_OPERATION_UNMAP_APERTURE_SEGMENT which then umaps the allocation was the aperture.
Additional Reading - Linear Aperture Address Space Segments
The DxgkDdiReleaseSwizzlingRange routine may be called to release a swizzling range for a CPU based aperture memory segment. The DXGKARG_RELEASESWIZZLINGRANGE data structure is used to store information about releasing a swizzling range. The
DxgkDdiAcquireSwizzlingRange routine is used to create a swizzling range.
Swizzling for computer graphics commonly means organising vectors, so they provide better performance and better textures for graphics.
Additional Reading - What is Swizzling?
Friday, 27 December 2013
Advanced Debugging Tools
OllyDbg (V2.1) - This tool is mostly for examining malware and programs. I find the tool really useful, the Assembly is very good too, especially with tracing JMPs.
Download - OllyDbg 2.0
Hook Analyser (2.6) - Able to view application crashes with more detail, and hook onto running processes for malware analysis and debugging.
Download - Hook Analyser Blog
WinCheck (8.50) - Able to view Kernel Data Structures not available in WinDbg.
Documentation - WinCheck Blog
Download - WinCheck KernelMode.Info forum
Download - OllyDbg 2.0
Hook Analyser (2.6) - Able to view application crashes with more detail, and hook onto running processes for malware analysis and debugging.
Download - Hook Analyser Blog
WinCheck (8.50) - Able to view Kernel Data Structures not available in WinDbg.
Documentation - WinCheck Blog
Download - WinCheck KernelMode.Info forum
Tis' The Season To Be Sharing - Sharing and Mapping Memory
This blog post is going to look at sharing memory, control areas and section objects, and how to view information about these mechanisms. Let's begin by looking at the general concept of sharing physical memory between two processes.
Process A and Process B both wish to use the same resource, this could be a library or some other kind of object. The pages used to map the shared resource, do cause any conflicts between the two processes, since the processes retain their own private virtual address space, and furthermore the pages will be marked with protection flags such as copy on write and execute only.
The sharing mechanism is mostly driven by a special object used by the Memory Manager called a Section Object. This may also be referred to as a File Mapping Object.
Sections Objects are created by calling the CreateFileMapping function, and then using a file handle to back the Section Object to, or using the INVALID_HANDLE_VALUE flag to use a page file backed region. The reasoning for the choice with the handle, is because Section Objects can be either be committed and then written to a page file on the hard disk, or a open file.
You may also notice other API function calls such as MapViewOfFile and MapViewOfFilEx. These are used when the Section Objects mapped to files much larger than the address space of the process, and therefore only a small portion of the Section Object will be mapped into the address space of a process. This is called a View of the Section Object.
Like all objects, the Section Object is manged by the Object Manager, and as a result the same techniques as discussed in my Object Manager posts can be used to gather information about a Section Object. We need to also understand the idea and concept of Control Areas.
Looking for Control Areas
Control Areas are used to store information about Section Objects and Mapped files. We can find Control Areas through several different methods, the first method is by using the !filecache WinDbg extension, which in turn displays the usage of file cache. If your why Section Objects and Control Areas can be found this way, it is because the Cache Manager uses mapped files.
The Control column lists all the Control Areas used by the Cache Manager and their addresses. The No Name for File message indicates that the Virtual Address Control Block (VACB) is not present and being cached for metadata. The VACB can be seen with the _VACB data structure in WinDbg.
The BaseAddress field of the data structure points to the starting address of the view used by the system cache, and is used in conjuction with the MmMapViewInSystemCache. The SharedCacheMap field is a pointer to another data structure, which belongs to the VACB and uses the Section Object used to map the view of the file. The ArrayHead field is pointer to a data structure which is used to manage the array of VACBs.
You could also use the !memusage extension to view the PFN database, and then view the Control Areas for the Section Objects from using that extension.
The last method, is to use the !handle extension, which dump all the handles for the current section objects for the current process.
Using the address of the Section Object, and then applying the dt command with the _SECTION_OBJECT data structure, we can locate the _CONTROL_AREA data structure by checking the _SEGMENT_OBJECT data structure.
If your wondering about the VAD Tree related fields, then please read my VAD Tree blog post, and see this blog post too - Hidding Module from the Virtual Address Descriptor Tree
Examining Control Areas
We can then view information about a Control Area with the !ca extension or by using the _CONTROL_AREA data structure.
Personally, I prefer to use the !ca extension instead of formatting the data structure in WinDbg. Using the !ca extension with the address of the Control Area, we can see three distinct categories: Control Area, Segment and Subsection.
The File Object field contains the address of the File Object associated with the file mapping, we can see the !fileobj extension or the _FILE_OBJECT data structure here to gather further information. Generally, the Control Area section contains information about the state of any mapped files. The Flush Count field contains the number of times a mapped view was flushed with the FlushViewOfFile function.
The most important field is the Section Object Pointers, which contains the address of the _SECTION_OBJECT_POINTERS data structure. This data structure is used by the Memory Manager or the Cache Manager to provide information about file mapping and cache information for a file stream.
The Segment section contains information about the prototype PTEs used to map the pages used by the section object. This structure is allocated with paged pool.
The Subsection contains information contains information about the page protection and mapping information for each section of the file used in the mapping.
Process A and Process B both wish to use the same resource, this could be a library or some other kind of object. The pages used to map the shared resource, do cause any conflicts between the two processes, since the processes retain their own private virtual address space, and furthermore the pages will be marked with protection flags such as copy on write and execute only.
The sharing mechanism is mostly driven by a special object used by the Memory Manager called a Section Object. This may also be referred to as a File Mapping Object.
Sections Objects are created by calling the CreateFileMapping function, and then using a file handle to back the Section Object to, or using the INVALID_HANDLE_VALUE flag to use a page file backed region. The reasoning for the choice with the handle, is because Section Objects can be either be committed and then written to a page file on the hard disk, or a open file.
You may also notice other API function calls such as MapViewOfFile and MapViewOfFilEx. These are used when the Section Objects mapped to files much larger than the address space of the process, and therefore only a small portion of the Section Object will be mapped into the address space of a process. This is called a View of the Section Object.
Like all objects, the Section Object is manged by the Object Manager, and as a result the same techniques as discussed in my Object Manager posts can be used to gather information about a Section Object. We need to also understand the idea and concept of Control Areas.
Looking for Control Areas
Control Areas are used to store information about Section Objects and Mapped files. We can find Control Areas through several different methods, the first method is by using the !filecache WinDbg extension, which in turn displays the usage of file cache. If your why Section Objects and Control Areas can be found this way, it is because the Cache Manager uses mapped files.
The Control column lists all the Control Areas used by the Cache Manager and their addresses. The No Name for File message indicates that the Virtual Address Control Block (VACB) is not present and being cached for metadata. The VACB can be seen with the _VACB data structure in WinDbg.
The BaseAddress field of the data structure points to the starting address of the view used by the system cache, and is used in conjuction with the MmMapViewInSystemCache. The SharedCacheMap field is a pointer to another data structure, which belongs to the VACB and uses the Section Object used to map the view of the file. The ArrayHead field is pointer to a data structure which is used to manage the array of VACBs.
You could also use the !memusage extension to view the PFN database, and then view the Control Areas for the Section Objects from using that extension.
The last method, is to use the !handle extension, which dump all the handles for the current section objects for the current process.
If your wondering about the VAD Tree related fields, then please read my VAD Tree blog post, and see this blog post too - Hidding Module from the Virtual Address Descriptor Tree
Examining Control Areas
We can then view information about a Control Area with the !ca extension or by using the _CONTROL_AREA data structure.
Personally, I prefer to use the !ca extension instead of formatting the data structure in WinDbg. Using the !ca extension with the address of the Control Area, we can see three distinct categories: Control Area, Segment and Subsection.
The File Object field contains the address of the File Object associated with the file mapping, we can see the !fileobj extension or the _FILE_OBJECT data structure here to gather further information. Generally, the Control Area section contains information about the state of any mapped files. The Flush Count field contains the number of times a mapped view was flushed with the FlushViewOfFile function.
The most important field is the Section Object Pointers, which contains the address of the _SECTION_OBJECT_POINTERS data structure. This data structure is used by the Memory Manager or the Cache Manager to provide information about file mapping and cache information for a file stream.
The DataSectionObject field contains the address of the Control Area. A NULL value indicates that the file stream is not present in memory. See the WDK documentation for more information.
The Segment section contains information about the prototype PTEs used to map the pages used by the section object. This structure is allocated with paged pool.
The Subsection contains information contains information about the page protection and mapping information for each section of the file used in the mapping.
Monday, 23 December 2013
Physical Address Extension (PAE)
This blog post is going to explain the fundamentals and the internals of Physical Address Extension (PAE) on Windows.
Physical Address Extension
PAE Mode enables x86 operating systems to be to address up to 64GB of Physical Memory (x86 processors), and 1,024GB on x64 when the processor is running in Legacy Mode, which is the same as running in x86 Protected Mode. The PDEs and PTEs are extended to 64-bits wide, and a extra layer is added called the Page Directory Pointer Table. The CR3 then points to the address of the this table instead of the Page Directory.
We can check if a system has the PAE bit enabled by checking bit 5 within he CR4 Register. According to my dump file, my system has the PAE bit set. However, due to licensing restrictions set by Microsoft, my operating system will still not address any larger than 4GB of RAM due to compatibility issues with drivers.
Additionally, when PAE has been enabled for x64 processors, then other features are also automatically enabled such as Data Execution Protection (DEP), hot-swappable memory and the support of NUMA (Non Uniform Access Memory).
Address Windowing Extension
Furthermore, some processes can still support more than the 2GB addressing limit with the AWE (Address Windowing Extension) setting.
AWE works by allocating the physical memory wished to be used, and then map these views of physical memory into the allocated virtual address space for the process as shown in the above diagram. It's also important to understand, that physical pages and and virtual memory ranges allocated by AWE can not be shared and therefore inherited by other processes. AWE allocated address ranges are always read and write, and so protection bits will not apply to these pages.
All memory allocated with AWE is non-paged pool. Since, AWE allocated memory must be freed as one unit of memory, then the MEM_RELEASE flag must be specified by using the VirtualFree API function. These pages will then be in the Free page state.
Setting PAE and Checking PAE Support
I've developed a very simple tool using the Win32 API, to check if your processor supports PAE. Please be aware that by default, x64 processors will always return a non-zero (true). The code is available below:
If you have DEP set, then PAE can't be disabled. You must disable DEP but disabling PAE. This can be achieved by editing the Boot Configuration Data through the use of Command Prompt.
Additional Reading:
Address Windowing Extensions
BCDEdit /set
Physical Address Extension
PAE Mode enables x86 operating systems to be to address up to 64GB of Physical Memory (x86 processors), and 1,024GB on x64 when the processor is running in Legacy Mode, which is the same as running in x86 Protected Mode. The PDEs and PTEs are extended to 64-bits wide, and a extra layer is added called the Page Directory Pointer Table. The CR3 then points to the address of the this table instead of the Page Directory.
We can check if a system has the PAE bit enabled by checking bit 5 within he CR4 Register. According to my dump file, my system has the PAE bit set. However, due to licensing restrictions set by Microsoft, my operating system will still not address any larger than 4GB of RAM due to compatibility issues with drivers.
Additionally, when PAE has been enabled for x64 processors, then other features are also automatically enabled such as Data Execution Protection (DEP), hot-swappable memory and the support of NUMA (Non Uniform Access Memory).
Address Windowing Extension
Furthermore, some processes can still support more than the 2GB addressing limit with the AWE (Address Windowing Extension) setting.
AWE works by allocating the physical memory wished to be used, and then map these views of physical memory into the allocated virtual address space for the process as shown in the above diagram. It's also important to understand, that physical pages and and virtual memory ranges allocated by AWE can not be shared and therefore inherited by other processes. AWE allocated address ranges are always read and write, and so protection bits will not apply to these pages.
All memory allocated with AWE is non-paged pool. Since, AWE allocated memory must be freed as one unit of memory, then the MEM_RELEASE flag must be specified by using the VirtualFree API function. These pages will then be in the Free page state.
Setting PAE and Checking PAE Support
I've developed a very simple tool using the Win32 API, to check if your processor supports PAE. Please be aware that by default, x64 processors will always return a non-zero (true). The code is available below:
If you have DEP set, then PAE can't be disabled. You must disable DEP but disabling PAE. This can be achieved by editing the Boot Configuration Data through the use of Command Prompt.
We can enable PAE with the AlwaysEnable flag instead, or the AlwaysOn flag with DEP.
Additional Reading:
Address Windowing Extensions
BCDEdit /set
Thursday, 19 December 2013
Where Did My Kernel Memory Dump Go?
Okay, this going to be a very short blog post about the common problem with Kernel Memory Dumps not being saved, despite your efforts to following all the instructions listed in this Sysnative Tutorial, Windows still doesn't seem to be saving your Kernel Memory Dumps. So what's the problem? The answer lies within the registry.
According to customer service feedback, many users were complaining about Kernel Memory Dumps using their hard drive space. For the average user, unless receiving support on a forum, then these Kernel Memory Dumps are valueless, as a result you will need to create a registry key called AlwaysKeepMemoryDump was created to address this issue.
If this registry key is set to 1, then Kernel Memory Dumps will always be saved regardless.
The highlighted key shows the maximum number of Minidumps which will be created.
Reference:
According to customer service feedback, many users were complaining about Kernel Memory Dumps using their hard drive space. For the average user, unless receiving support on a forum, then these Kernel Memory Dumps are valueless, as a result you will need to create a registry key called AlwaysKeepMemoryDump was created to address this issue.
If this registry key is set to 1, then Kernel Memory Dumps will always be saved regardless.
The highlighted key shows the maximum number of Minidumps which will be created.
Reference:
Translation Lookaside Buffer (TLB) and Look Aside Lists
TLB Cache
The TLB Cache is very much a key part for the necessary performance of Virtual to Physical Address Translation. It's main purpose is to improve the performance of Virtual Address Translation. All modern CPUs and their MMUs (Memory Management Units) support the use of the TLB.
A important aspect to understand, is the difference between TLB Hit and TLB Miss. When a Virtual Address is accessed, and then looked up, the TLB Cache is checked first to see if the Virtual-Physical Address mapping is present and if so then a TLB Hit occurs. On the other hand, if the address isn't present, then a TLB Miss occurs and the MMU is forced to execute a Page Walk which is the process of looking through the Page Tables like discussed in my previous blog posts. Once, the Page Walk has completed and the physical address is found, then this information is loaded into the TLB Cache.
If Page Walk is unsuccessful, then a Page Fault is raised, and the Virtual to Physical Address Mapping is created in the Page Table. Generally, any changes to the Page Table Structure and the Paging Structure will resulting the flushing of the TLB with a Inter Processor Interrupt (IPI). This one of the reasons why you will tend to notice TLB flushing and IPIs with Stop 0x101 bugchecks.
The flushing of the TLB Cache can be achieved by reloading the CR3 (Page Directory Base Register), there is a easier method which I will explain too.
Here is a small segment of Assembly code from OSDev Wiki:
However, if the G flag has been set for a PTE or PDE, then that entry will not be flushed from the TLB Cache.
The other method would be to use the
The data structure contains a pointer to a larger data structure called _GENERAL_LOOKASIDE_LIST which retains the information about the current Look Aside List.
The type of pool being used for the Look Aside List will depend upon if the driver is going to need to access the entries at IRQL Level 2 or IRQL Level 1. Obviously, Non-Paged Pool will need to be used if the driver is going to need to use the entries at IRQL Level 2, and Paged Pool will need to be used if the driver is not going to access these entries at any level above IRQL Level 1.
The Allocate and Free fields are used as function pointers to the functions in which you wish to use to allocate and free the entries in the list. The TotalAllocates and TotalFrees fields shows the total number of allocations and frees.
The Depth field contains the number of entries in the list.
The SingleListHead.Next field is used to point to the next free pool chunk.
We can use the !lookaside extension to see the efficiency and information about system lookaside lists.
Additional Reading:
Using LookAside Lists (Windows Drivers)
Source Code Examples
The TLB Cache is very much a key part for the necessary performance of Virtual to Physical Address Translation. It's main purpose is to improve the performance of Virtual Address Translation. All modern CPUs and their MMUs (Memory Management Units) support the use of the TLB.
A important aspect to understand, is the difference between TLB Hit and TLB Miss. When a Virtual Address is accessed, and then looked up, the TLB Cache is checked first to see if the Virtual-Physical Address mapping is present and if so then a TLB Hit occurs. On the other hand, if the address isn't present, then a TLB Miss occurs and the MMU is forced to execute a Page Walk which is the process of looking through the Page Tables like discussed in my previous blog posts. Once, the Page Walk has completed and the physical address is found, then this information is loaded into the TLB Cache.
If Page Walk is unsuccessful, then a Page Fault is raised, and the Virtual to Physical Address Mapping is created in the Page Table. Generally, any changes to the Page Table Structure and the Paging Structure will resulting the flushing of the TLB with a Inter Processor Interrupt (IPI). This one of the reasons why you will tend to notice TLB flushing and IPIs with Stop 0x101 bugchecks.
The flushing of the TLB Cache can be achieved by reloading the CR3 (Page Directory Base Register), there is a easier method which I will explain too.
Here is a small segment of Assembly code from OSDev Wiki:
However, if the G flag has been set for a PTE or PDE, then that entry will not be flushed from the TLB Cache.
The other method would be to use the
invlpg (Invalidate TLB Entry) instruction. This instruction is a privileged instruction, and therefore the CPL (Current Privilege Level) must be Level 0. This instruction also flushes or invalidates an entry for a specific page, and therefore is much more suited if you wish to only flush a certain entry. Although, in some circumstances, it may flush the entire TLB or multiple entries, there is the guarantee though that it will flush the entries associated with the current PCID (Process Context Identifier). See Volume 3: Section 4.10 in Intel Developer's Manual.
You can check if PCIDs are enabled by checking the 17th bit of the CR4 register.
The above example is from a AMD CPU, and I don't think AMD yet supports PCIDs. We could also use the j command with the addition of the .echo command, as seen below:Getting back on topic back the TLB Cache, each entry is associated with a tag. The tag contains important information such as the part of the virtual address, physical page number, protection bits, valid bit and dirty bit. A Virtual address being checked, and is then matched against a tag within the TLB Cache. The 8 bit ASID (Address Space Identifier) is used to from part of the tag. The ASID part is the is matched between the TLB Entry and the Virtual Address (PTE).
Context Switches and Task Switches can invalidate TLB Entries, since the mappings will be different.
Look Aside Lists
The second part of my blog post will concern Look Aside Lists. Look Aside Lists are a type of pool allocation algorithm, although, the difference is that Look Aside Lists have fixed sizes and do not use spinlocks. They are also based around Singly Linked Lists, using a LIFO order.
Device Drivers and parts of the operating system (I/O Manager, Cache Manager and Object Manager) can build their own Look Aside Lists. The Executive versions of the Look Aside List are managed per a processor (see _KPRCB). These look-asides lists are managed by the operating system. Each Look-Aside List can be allocated with Paged Pool or Non-Paged Pool respectively. The operating system will increase the number of allocations to a Look Aside List if the demand is great, and thus the number of entries. The opposite is true if demand is low.
Furthermore, Look Aside Lists are managed automatically by the Balance Set Manager every second with the call from ExAdjustLookAsideDepth (reserved for System Use). This only happens if no allocations have happened recently. The Look Aside List depth will never drop below 4.
The main purpose of Look Aside Lists is when a device driver is going to be using specific sized pool blocks frequently. Each allocation is known as a entry, depending if the pool allocation type, the data structure used will either be _PAGED_LOOKASIDE_LIST or _NPAGED_LOOKASIDE_LIST.
Firstly, a Look Aside List is created by calling
ExInitializeLookasideListEx, just a side note, before I continue writing, all the said functions should be fully documented in the WDK. The function mentioned, creates data structure called _LOOKASIDE_LIST_EX.The data structure contains a pointer to a larger data structure called _GENERAL_LOOKASIDE_LIST which retains the information about the current Look Aside List.
The type of pool being used for the Look Aside List will depend upon if the driver is going to need to access the entries at IRQL Level 2 or IRQL Level 1. Obviously, Non-Paged Pool will need to be used if the driver is going to need to use the entries at IRQL Level 2, and Paged Pool will need to be used if the driver is not going to access these entries at any level above IRQL Level 1.
The Allocate and Free fields are used as function pointers to the functions in which you wish to use to allocate and free the entries in the list. The TotalAllocates and TotalFrees fields shows the total number of allocations and frees.
The Depth field contains the number of entries in the list.
The SingleListHead.Next field is used to point to the next free pool chunk.
We can use the !lookaside extension to see the efficiency and information about system lookaside lists.
Additional Reading:
Using LookAside Lists (Windows Drivers)
Source Code Examples
Wednesday, 18 December 2013
Virtual to Physical Address Translation (Part 3)
All the pages resident in physical memory are manged by the PFN Database or Page Frame Number Database. The PFN is used to describe the page state of each page, and the number of references to this page.
Page States
The page states can be found in a enumeration called _MMLISTS:
Zeroed - The page already is free and already contains 0's, or has been freed and being zeroed.
Free - The page is free, but may still contain data since the Dirty bit could have not been set, therefore these pages are zeroed before being marked as a user page for user-mode processes.
Standby - The page has been recently been removed from the working set of a process, and as a result is currently in Transition. The page hasn't been written to or modified since removal and transfer to the hard disk, but the PTE (Invalid) may still refer to the physical page.
Modified - The page has been recently removed from the working set of a process, but has been written to and modified before it was written to the hard disk, thereby the contents of this page must be written to the hard disk before the physical page the PTE points to can be used.
Modified No-Write - The no write version of the Modified page state. The page will not be written to the hard disk.
Bad - The page has a parity error, or has other associated hardware issues with it. It may be used by the Kernel for look-aside caches and changing page states.
Active/Valid - The page is part of a working set (process or system), or is a non-paged pool page, and will therefore have a valid PTE.
Transition - Temporary page state used for pages which aren't present on working set lists and other page state lists. I/O for these pages will be in progress, and collided page faults can be used here.
ROM - Read Only Memory Page
Since, we now understand the different page states, there is two methods for gathering system wide information about the page states used by the PFN database. The first method is to use the WinDbg extension called !memusage.
You will always notice a few bad pages, for the reasons explained earlier. The second method and the fastest method, would be to use the MemInfo Tool with the -s switch.
The appropriate program version is shown in the image below:
Structure of the PFN Database
The PFN Database is a array of data structures such as _MMPFN and _MMPFNENTRY. Each PFN Entry data structure is used to store information about a physical page in memory.
The flags are all represented by certain bits, and these flag meanings will be explained below:
WriteInProgress - Indicates that a write operation is in progress for this page. A event object will become signaled when this operation is complete.
Modified - Shows if the page was modified, and if so, then contents of the page must be written out to the disk before the page can be removed from memory.
ReadInProgress - Indicates that a read and in-page operation is in progress for this page. Again, a event object will be signaled when this operation is complete.
Rom - The page is currently read-only memory.
InPageError - I/O error occurred during a in-page operation with this page.
KernelStack - The page is being used to contain a Kernel Stack, the PFN Entry will contain the owner of the the stack and the next stack for the PFN for the same thread.
RemovalRequested - Indicates that page is eligible to removal.
ParityError - Physical page contains parity errors or ECC errors. There is a white paper somewhere on parity and ECC errors.
The !pfn extension can give us information about a PFN Entry. Using the !pfn with the known page frame, we can see the output seen below. Remember that we can obtain the Page Frame from the !pte extension. The !pfn gives similar information to the _MMPFN structure.
We can see the Reference Count field which refers to the number of references to this page. The reference count is incremented when added to the working set of a process, and when the page has been locked into memory for I/O. The reference count is decremented when the share count becomes 0, or when the page has been unlocked from memory.
We can also see the Share Count field which refers to the number of PTEs which refers to the page, this value can be greater than one for shared pages.
The Colour field is used to link physical pages together by colour, which refers to their location within the processor cache. This is generally used for performance purposes.
The Priority field shows us the page priority of the PFN, and thereby which standby list it will be placed upon. I'll explain these page priorities a little later in my post.
Page Priority Levels
The page priority levels run from 0 to 7, with 0 being the lowest priority. The default priority level is 5, but the priority level can be inherited from the process' thread which caused the allocation. The Memory Manager will always take pages from the standby list of the lowest priority first.
Using the MemInfo Tool with the -c extension, we can then view the Page Priority Standby lists, as shown here:
I'm going to explain the TLB Cache and Look Aside Lists in my next blog post instead, since this topic got a little big.
Page States
The page states can be found in a enumeration called _MMLISTS:
Zeroed - The page already is free and already contains 0's, or has been freed and being zeroed.
Free - The page is free, but may still contain data since the Dirty bit could have not been set, therefore these pages are zeroed before being marked as a user page for user-mode processes.
Standby - The page has been recently been removed from the working set of a process, and as a result is currently in Transition. The page hasn't been written to or modified since removal and transfer to the hard disk, but the PTE (Invalid) may still refer to the physical page.
Modified - The page has been recently removed from the working set of a process, but has been written to and modified before it was written to the hard disk, thereby the contents of this page must be written to the hard disk before the physical page the PTE points to can be used.
Modified No-Write - The no write version of the Modified page state. The page will not be written to the hard disk.
Bad - The page has a parity error, or has other associated hardware issues with it. It may be used by the Kernel for look-aside caches and changing page states.
Active/Valid - The page is part of a working set (process or system), or is a non-paged pool page, and will therefore have a valid PTE.
Transition - Temporary page state used for pages which aren't present on working set lists and other page state lists. I/O for these pages will be in progress, and collided page faults can be used here.
ROM - Read Only Memory Page
Since, we now understand the different page states, there is two methods for gathering system wide information about the page states used by the PFN database. The first method is to use the WinDbg extension called !memusage.
You will always notice a few bad pages, for the reasons explained earlier. The second method and the fastest method, would be to use the MemInfo Tool with the -s switch.
The appropriate program version is shown in the image below:
Structure of the PFN Database
The PFN Database is a array of data structures such as _MMPFN and _MMPFNENTRY. Each PFN Entry data structure is used to store information about a physical page in memory.
The flags are all represented by certain bits, and these flag meanings will be explained below:
WriteInProgress - Indicates that a write operation is in progress for this page. A event object will become signaled when this operation is complete.
Modified - Shows if the page was modified, and if so, then contents of the page must be written out to the disk before the page can be removed from memory.
ReadInProgress - Indicates that a read and in-page operation is in progress for this page. Again, a event object will be signaled when this operation is complete.
Rom - The page is currently read-only memory.
InPageError - I/O error occurred during a in-page operation with this page.
KernelStack - The page is being used to contain a Kernel Stack, the PFN Entry will contain the owner of the the stack and the next stack for the PFN for the same thread.
RemovalRequested - Indicates that page is eligible to removal.
ParityError - Physical page contains parity errors or ECC errors. There is a white paper somewhere on parity and ECC errors.
The !pfn extension can give us information about a PFN Entry. Using the !pfn with the known page frame, we can see the output seen below. Remember that we can obtain the Page Frame from the !pte extension. The !pfn gives similar information to the _MMPFN structure.
We can see the Reference Count field which refers to the number of references to this page. The reference count is incremented when added to the working set of a process, and when the page has been locked into memory for I/O. The reference count is decremented when the share count becomes 0, or when the page has been unlocked from memory.
We can also see the Share Count field which refers to the number of PTEs which refers to the page, this value can be greater than one for shared pages.
The Colour field is used to link physical pages together by colour, which refers to their location within the processor cache. This is generally used for performance purposes.
The Priority field shows us the page priority of the PFN, and thereby which standby list it will be placed upon. I'll explain these page priorities a little later in my post.
Page Priority Levels
The page priority levels run from 0 to 7, with 0 being the lowest priority. The default priority level is 5, but the priority level can be inherited from the process' thread which caused the allocation. The Memory Manager will always take pages from the standby list of the lowest priority first.
Using the MemInfo Tool with the -c extension, we can then view the Page Priority Standby lists, as shown here:
I'm going to explain the TLB Cache and Look Aside Lists in my next blog post instead, since this topic got a little big.
Monday, 16 December 2013
Virtual to Physical Address Translation (Part 2)
The second part is going to concern paging structure on x86 and x64, and how virtual memory addresses and physical memory addresses are mapped according to this structure. The third part will look at how physical memory is managed with the PFN database.
Hardware PTEs and Paging Structure
A virtual address on a x86 system, is divided into three different parts: Page Directory Index (10 bits); Page Table Index (10 Bits) and the Byte Index (12 bits). The above image shows their relationship in relation to the general page table structure.
The Page Directory Index shows the address of the page table in which the desired PTE is located. The Page Table Index indicates the address of the PTE within the Page Table, and the Byte Index is used to find the correct physical page for which the PTE is mapped to.
Before we go onto briefly explaining the x64 version, and going into greater depth about each part of the translation process, let's quickly discuss how to find the Page Directory address. Remember it is unique to each process, and each thread running under one process will inherit this address, meaning that a context switch will not need to be formed when changing the context to a different thread.
Using the !process extension, the DirBase field shows the physical address of the Page Directory. This same physical address is stored within the Control Register called CR3. Using the r command with the @ character, we can see that the two addresses are identical.
Alternatively, you could also check the _KPROCESS data structure, and investigate the address from that standpoint. The CR3 register (Page Directory Base Register) will updated with the address of a different Page Directory if a process context switch occurs.
The Page Directory is identically a large array of PDEs (Page Directory Entry), each PDE points to the address of a Page Table, and is 4 bytes long (or 32-bits). Page Tables are created on demand, and therefore the VAD Tree is checked to see wherever a new Page Table should be created upon access of a virtual address without a corresponding Page Table Index and Page Directory Index.
The PDE virtual address can be found with the !pte extension, as shown below:
Each PDE points to the address of a Page Table, in the Page Table is similar in respect that it is a array of PTEs (Page Table Entry). The Page Table Index is used to find the relevant PTE within the array. 1,024 Page Tables are required to map the entire 4GB of address space for x86.
Each PTE then is used to point the relevant physical page, and the Byte Index is used to find the appropriate address within this page. The PTE has a number of different PTE Protection and Status bits associated with it, which I will explain here.
System PTEs
System PTEs are used to map the system address space. For example, kernel stacks, MDLs and I/O is mapped with the use of System PTEs. We can see the amount of System PTEs free to use, by using the !sysptes extension in WinDbg.
You can also view the number of System PTE Allocation failures, which could indicate a PTE Leak by dumping the address of _MI_SYSTEM_PTE_TYPE data structure, by gathering the address of the MiSystemPteInfo global variable.
The above address is then used with the dt command, as seen below:
As we can see, there isn't any allocation failures which is a positive sign, and suggests everything is running normally. On the other hand, if you do notice any allocation failures, then you could create a certain registry key called HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\TrackPtes. By creating this DWORD key and setting it to 1, then you will be enabling the tracking of System PTEs. The next step would be to use the !sysptes extension with the 0x4 bit flag set.
Invalid PTEs
Invalid PTEs indicate that the PTE isn't accessible to the process, usually for the invalid PTE to become valid, a Page Fault exception is raised and the Page Fault is then resolved by the Memory Manager's fault handler called MmAccessFault. There are four different kinds of Invalid PTEs, which will be explained below:
Page File - The page is located within a page file on the hard drive, accessing this page will result in a page fault, which will allocate a physical page, and enable the Valid bit for the PTE. The page will also be added to the Working Set of the accessing process.
Demand Zero - The page will be written with a page of 0's, if this page is accessed then a zero filled page is added to the working set of the process. At first, the zero page list is checked, and if this is empty, then a page from the free list is taken and paged with 0's. Otherwise, the page is taken from a Standby List and paged with 0's.
Transition - The page is currently on a standby, modified, modified-on-write or no list and therefore will removed from the corresponding list and added to the working set of the process.
Unknown - PTE is zero, or there isn't a page table yet. This leads to the VAD Tree being checked to see if the page is committed, and if so, a page table will be created.
Prototype PTEs
Pages will are shared between processes are mapped with Prototype PTEs. When a sharable and mapped page is referenced by a process, a hardware PTE is used to point to the referenced page, thus both the Prototype PTE and the hardware PTE point to the physical page.For each reference to a shareable page, a counter is incremented within the PFN Database. This allows the Memory Manager to invalidate any pages and move these pages to the hard-drive or a transition list.
The PTE used by the process' page table has it's Valid flag cleared and is used to point to the Prototype PTE which points to the page.
If the page is later accessed, then the Prototype PTE can improve the lookup process. The diagram below illustrates this point, and how a Prototype PTE and Valid PTE may look in concept.
A Prototype PTE can used to used to describe the page state of a sharable page, these states are as follows:
Valid - Page is in physical memory.
Transition - The page is currently present on a standby or modified list, or may not be present on any list.
Modified-No-Write - The page is present in physical memory, and present on the modified-no-write list.
Demand Zero - The page will written with a page of 0's.
Page File - The page is present on a paging file
Mapped File - The page is present in a mapped file.
x64 Translation Process:
On x64 systems, the paging structure has expanded from two levels to four levels. The additional level or layer is called the Page Map Level 4. The Virtual Address on a x64 system therefore has more sections. I've created a diagram for the current 48-bit implementation.
Hardware PTEs and Paging Structure
A virtual address on a x86 system, is divided into three different parts: Page Directory Index (10 bits); Page Table Index (10 Bits) and the Byte Index (12 bits). The above image shows their relationship in relation to the general page table structure.
The Page Directory Index shows the address of the page table in which the desired PTE is located. The Page Table Index indicates the address of the PTE within the Page Table, and the Byte Index is used to find the correct physical page for which the PTE is mapped to.
Before we go onto briefly explaining the x64 version, and going into greater depth about each part of the translation process, let's quickly discuss how to find the Page Directory address. Remember it is unique to each process, and each thread running under one process will inherit this address, meaning that a context switch will not need to be formed when changing the context to a different thread.
Using the !process extension, the DirBase field shows the physical address of the Page Directory. This same physical address is stored within the Control Register called CR3. Using the r command with the @ character, we can see that the two addresses are identical.
Alternatively, you could also check the _KPROCESS data structure, and investigate the address from that standpoint. The CR3 register (Page Directory Base Register) will updated with the address of a different Page Directory if a process context switch occurs.
The Page Directory is identically a large array of PDEs (Page Directory Entry), each PDE points to the address of a Page Table, and is 4 bytes long (or 32-bits). Page Tables are created on demand, and therefore the VAD Tree is checked to see wherever a new Page Table should be created upon access of a virtual address without a corresponding Page Table Index and Page Directory Index.
The PDE virtual address can be found with the !pte extension, as shown below:
Each PDE points to the address of a Page Table, in the Page Table is similar in respect that it is a array of PTEs (Page Table Entry). The Page Table Index is used to find the relevant PTE within the array. 1,024 Page Tables are required to map the entire 4GB of address space for x86.
Each PTE then is used to point the relevant physical page, and the Byte Index is used to find the appropriate address within this page. The PTE has a number of different PTE Protection and Status bits associated with it, which I will explain here.
- Accessed (A) - The page has been read.
- Cache Disable (Cd) - Caching is disable for the page.
- Copy-on-Write (Cw) - Page is using copy on write.
- Dirty (D) - Page has been written to.
- Global (Gl) - Translation applies to all processes.
- Large Page (L) - PDE maps a 4MB page.
- Owner (O) - Shows if the page is accessible in User-Mode or Kernel-Mode.
- Prototype (P) - Prototype PTE, will be explained later.
- Valid (V) - Virtual Page maps to a physical page.
- Write Through (Wt) - Disables caching of writes.
- Write (W) - Page is writable.
System PTEs
System PTEs are used to map the system address space. For example, kernel stacks, MDLs and I/O is mapped with the use of System PTEs. We can see the amount of System PTEs free to use, by using the !sysptes extension in WinDbg.
You can also view the number of System PTE Allocation failures, which could indicate a PTE Leak by dumping the address of _MI_SYSTEM_PTE_TYPE data structure, by gathering the address of the MiSystemPteInfo global variable.
The above address is then used with the dt command, as seen below:
As we can see, there isn't any allocation failures which is a positive sign, and suggests everything is running normally. On the other hand, if you do notice any allocation failures, then you could create a certain registry key called HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\TrackPtes. By creating this DWORD key and setting it to 1, then you will be enabling the tracking of System PTEs. The next step would be to use the !sysptes extension with the 0x4 bit flag set.
Invalid PTEs
Invalid PTEs indicate that the PTE isn't accessible to the process, usually for the invalid PTE to become valid, a Page Fault exception is raised and the Page Fault is then resolved by the Memory Manager's fault handler called MmAccessFault. There are four different kinds of Invalid PTEs, which will be explained below:
Page File - The page is located within a page file on the hard drive, accessing this page will result in a page fault, which will allocate a physical page, and enable the Valid bit for the PTE. The page will also be added to the Working Set of the accessing process.
Demand Zero - The page will be written with a page of 0's, if this page is accessed then a zero filled page is added to the working set of the process. At first, the zero page list is checked, and if this is empty, then a page from the free list is taken and paged with 0's. Otherwise, the page is taken from a Standby List and paged with 0's.
Transition - The page is currently on a standby, modified, modified-on-write or no list and therefore will removed from the corresponding list and added to the working set of the process.
Unknown - PTE is zero, or there isn't a page table yet. This leads to the VAD Tree being checked to see if the page is committed, and if so, a page table will be created.
Prototype PTEs
Pages will are shared between processes are mapped with Prototype PTEs. When a sharable and mapped page is referenced by a process, a hardware PTE is used to point to the referenced page, thus both the Prototype PTE and the hardware PTE point to the physical page.For each reference to a shareable page, a counter is incremented within the PFN Database. This allows the Memory Manager to invalidate any pages and move these pages to the hard-drive or a transition list.
The PTE used by the process' page table has it's Valid flag cleared and is used to point to the Prototype PTE which points to the page.
If the page is later accessed, then the Prototype PTE can improve the lookup process. The diagram below illustrates this point, and how a Prototype PTE and Valid PTE may look in concept.
A Prototype PTE can used to used to describe the page state of a sharable page, these states are as follows:
Valid - Page is in physical memory.
Transition - The page is currently present on a standby or modified list, or may not be present on any list.
Modified-No-Write - The page is present in physical memory, and present on the modified-no-write list.
Demand Zero - The page will written with a page of 0's.
Page File - The page is present on a paging file
Mapped File - The page is present in a mapped file.
x64 Translation Process:
On x64 systems, the paging structure has expanded from two levels to four levels. The additional level or layer is called the Page Map Level 4. The Virtual Address on a x64 system therefore has more sections. I've created a diagram for the current 48-bit implementation.
The Page Map Level 4 Selector points to the Page Map Level 4, the Page Directory Pointer Selector then points to the Parent Page Directory Pointers Table. The Page Table Selector shows the Page Directory, and the Page Table Entry Selector then points to the correct PTE which maps to the physical page. The Byte Within Page points to the specific PFN. Remember the x64 paging structure still applies to each process. Using the above information, you can imagine the overall x64 paging structure being like the diagram below: