Troubleshooting a hung or nonresponsive Windows server can be a challenging endeavor. Simply hitting the reset button is no longer a tolerated option as more companies use these servers for business-critical operations. This three-part series will explore the reasons why a Windows server may hang and provide a cookbook approach to diagnosing the underlying issues with the Windows Kernel Debugger (Windbg).
When Microsoft released the early versions of its server operating system (Windows NT 3.5x and NT4), there was no easy way to troubleshoot a hung server. Other mainstream operating systems, such as Digital Equipment Corp.’s VAX/VMS, offered ways to manually intervene by forcing a crash dump whereby the server’s state could be captured at the time of the hang. This dump could then be analyzed to determine why the server hung. The only option for early Windows platforms, however, was to reset the box.
As Windows servers became more predominant in the business world, hitting the reset button became unacceptable. As a result, in Windows 2000 Server and later versions, it became possible to force a crash dump to assist with determining why the server hung. Microsoft introduced this feature in Knowledge Base article 244139. It allows a keystroke combination (right CTRL+SCROLL LOCK twice) to generate a crash dump on PS/2-type keyboards. Microsoft extended this feature in Windows Server 2003 with a hotfix to the Kbdhid.sys driver to accommodate USB-type keyboards.
Several other options now exist to force a crash dump. Microsoft provides the Windows Special Administrative Console (SAC) Crashdump command as part of Windows Emergency Management Services (EMS), which allows for “headless” servers with no local graphical console. Vendor-specific options also exist to force a crash dump including the HP Integrity server’s Management Processor TC (transfer of control) command, an NMI (non-maskable interrupt) button on some Integrity models, or the Integrated Lights Out (iLO) virtual NMI button. We’ll take a closer look at each of these options later in the series.
Why a server hangs
There are a variety of reasons why a server may hang, including both hardware and software issues. The most common hardware reason for a server hang is spurious interrupts by a failing device. For example, a network interface controller (NIC) may have a bad component or be attached to a bad cable causing false interrupts to occur. These interrupts occur at an elevated interrupt request level (IRQL) dominating the attention of the processor(s), leaving lower priority requests (user level) unanswered. As a result, the server appears to be hung.
Another example of a hardware-induced hang involves storage requests going unanswered. For example, consider a case where a disk drive fails, causing outstanding I/O requests to be queued up. Eventually, these pending requests trigger a cascading effect of user and system threads to hang, leading to a system-wide outage.
More often, however, server hangs are a result of software issues. These issues come in several flavors, including:
System resource depletion (e.g., out of memory pool) — The most common type of software hang, this typically is the result of a memory leak by a driver or kernel mode thread. Resource depletion can also result from exceeding architectural limits of paged and nonpaged memory pools (typically experienced on an x86 32-bit operating system).
Deadlock conditions — A deadlock occurs when contention exists for common resources between two or more threads. For example, a deadlock exists when one thread owns an exclusive lock on a resource that another thread wants, and that thread exclusively owns a resource that the initial thread wants.
Spinlock conditions — Spinlock hangs are similar to deadlocks, but involve contention for a spinlock that is used to synchronize access to data structures in a multi-processor environment. Other permutations of these conditions include a driver holding a lock while performing other activities for an extended period of time. Actual examples of deadlock and spinlock hangs will be provided later.
High-priority, compute-bound threads — A software hang can also occur if high-priority, compute-bound thread(s) are dominating the processors. Since the Windows operating system permits varying levels of thread priority, one or more threads may execute at a higher priority than typical user threads. The result is that applications and users at normal priority are starved for CPU time, causing a perceived software hang.
The big picture
So, as you can see, there are numerous reasons why a server may hang. To give you a better idea of what happens when you force a crash to generate a memory dump, and subsequently analyze the crash to determine what caused the hang, see Figure 1 below.
Starting on the left-hand side, you can see the server crashes or hangs. In the event of a crash, the server would generate a memory dump if the dumpfile and pagefile are properly configured (see Microsoft Knowledge Base articles 254649, 197379 and 889654).
In the event of a hang, manual intervention would be required to force a crash dump as previously described. In either case, the content of memory is written to the pagefile.sys before the server is rebooted. During the reboot, the pagefile.sys is written to the memory.dmp file. Finally, once the server has rebooted, you can use the Windows Kernel Debugger (Windbg) to analyze the memory dump using a symbol server (as documented in KB article 311503) to translate memory references to meaningful functions and variables.
Figure 1: Overview of memory dump process and analysis
Now that you have a better idea of why server hangs occur, the next article in this series will look at the preparation process for troubleshooting a hung Windows server.
TROUBLESHOOTING A HUNG WINDOWS SERVER
– Part 1: Why do servers hang?
– Part 2: Preparing to troubleshoot
– Part 3: Resolving the issue
Previously in this series, we looked at some of the reasons why server hangs occur in a network. Now that you have a little background, let’s look at the preparation process for resolving the problem using a tool called the Windows Kernel Debugger, or Windbg.
When troubleshooting a hung Windows server, there are several things that need to be done up front to prepare for collecting data. A forced crash dump may only be necessary if other means of troubleshooting prove unsuccessful. The first thing administrators should always do is runMPS Reports to collect event logs and other pertinent information. Close examination of system and application event logs may reveal a pattern of particular entries occurring prior to each hang. If the problem starts with a slow down or performance issue, you should collect Perfmon data as described in Microsoft Knowledge Base article 248345.
Once you determine that a forced crash dump is necessary, update the appropriate registry entries per KB article 244139 or 927069 and reboot the server. Also, ensure you have properly configured the dump file type as previously mentioned in KB article 254649. Finally, be sure that your pagefile.sys is sufficiently sized to accommodate a memory dump and that you have enough free space on the disk where the memory.dmp will be located, per KB article 886429.
Installing Windbg and setting the symbol path
In addition to configuring the server to generate a memory dump, you have to install the Windows Kernel Debugger and establish the symbol path. Do that on a separate server from the one that you are troubleshooting. You can download the Windbg kit free from Microsoft, and the kind of kit you choose depends on the architecture you are installing it on: (x86 or x64/IA64). Each is capable of reading a dump from a different architecture (i.e., 32-bit Windbg can read a 64-bit dump and vice versa).
Once Windbg is installed, be sure to establish the symbol path as documented in KB article 311503. Setting up the symbol path allows the debugger to translate memory references to meaningful functions and variable names. This will allow you to look at a stack trace and determine what routines were executing at the time of the hang.
Once you have all this set up, you are ready to analyze a crash dump. Use the appropriate keystrokes, Web GUI, Management Processor TC command or NMI button to initiate the forced crash as previously described. Be sure to allow sufficient time for the contents of memory to write to the pagefile.sys. If you have trouble getting the crash dump created, be aware that there are several reasons why a crash dump may not be captured as expected (see KB article 130536).
Now you are ready to determine the cause of the server hang. In the final part of this series, I’ll explain how to use Windbg to analyze a forced crash as a means of resolving the problem.
Previously in this series, we talked about why Windows server hangs occur and how to prepare to resolve the problem using a tool called the Windows Kernel Debugger, or Windbg. In this article, we’ll finish up by learning how to analyze the crash dump and fixing the issue.
After you have captured a forced crash dump, you are ready to begin using Windbg to determine what caused the hang. The following sections will explore the appropriate Windbg commands to use depending on the type of hang.
You can invoke Windbg two ways. One way is from the Windows Start menu:
Start | All Programs | Debugging Tools for Windows | Windbg
The other is from the DOS command prompt:
C:\ > windbg
In Windbg, use the File pulldown menu to select Open Crash Dump, specifying the location of the dumpfile. This can be accomplished in one step from the command prompt by using the –z option:
C:\> windbg –z memory.dmp
Be sure to watch out for any warnings from Windbg indicating a truncated or inconsistent set-bit count. Messages like this may indicate the dumpfile is corrupt or missing data:
WARNING: Dump file has been truncated. Data may be missing.WARNING: Dump file has inconsistent set-bit count. Data may be missing.
Windbg does a good job of pointing out problems with asterisks (*), so be sure to pay particular attention whenever you see them in the output. By default, the debugger output is displayed in the main window with a one-line command prompt at the bottom.
No matter what sort of hang your server has encountered, the first command that should be used in Windbg is this:
!analyze –v -hang
The !analyze command will perform a preliminary analysis of the dump and provide a “best guess” for what caused the crash. In the case of a forced dump, the analysis will typically point to the i8042prt.sys or kbdhid.sys driver because that is the driver that initiated the crash. You will also notice the bugcheck type is a 0xE2, indicating a manually initiated crash as seen in Figure 1.
In addition to providing a best guess for the cause of the crash, the !analyze command will also check for blocking locks and set the processor, process, thread and register context to the current ones at the time of the crash. Subsequent commands will use this context for their execution.
Once you have executed the !analyze command, the commands in Table 1 will help determine the footprint or circumstances that existed when the crash was forced. Be sure to focus on the current process, current thread, stack trace, virtual and physical memory usage, and locking information. We will take a closer look at these commands in subsequent sections.
Windbg commands for analyzing server hangs.
|!process||Display current process information|
|!thread||Display current thread information|
|!running –it||Display currently executing threads on all CPUs|
|!vm||Display virtual memory usage|
|!poolused||Display paged and non-paged pool usage|
|!memusage||Display physical memory usage|
|!locks||Display kernel locks held|
|!stacks||Display summary of threads, states and function|
|kv||Display current threads stack trace|
High-priority compute-bound threads
Identifying the current process (!process) and the current thread (!thread) can prove useful if the server hung because of a high-priority runaway, compute-bound thread. Use the !running –it command, as it will list all the currently executing threads across all the processors. Processes and threads can be assigned various levels of priorities that can preempt other processes and threads.
System resource depletion
If you suspect a system resource depletion caused the hang, use the !vm, !poolused and!memusage commands. These commands display the virtual and physical memory usage at the time of the hang. Be sure to watch for any asterisks flagged by Windbg as illustrated in Figure 2.
To determine if paged pool or non-paged pool has been depleted, compare the “usage” to the “maximum” value as circled in red above. If the usage is relatively close to the maximum value, then there is a high likelihood that pool depletion caused the hang. You would then use the!poolused command to focus in on which pool data structure was responsible. The !poolusedcommand has several flags to sort the paged or non-paged data structures according to their usage (see the online debugger help for more information on the command syntax and usage).
It is worth mentioning that pool statistics can also be acquired by several tools without the need for a memory dump. You can use Perfmon to collect general paged and non-paged performance statistics. Poolmon and Poolsnap are free tools from Microsoft that capture more granular specifics on the actual pool data structures. Finally, note that it is possible to tune paged pool on x86 servers by tweaking two registry values (PagedPoolSize and PoolUsageMaximum). For further details on tuning paged pool, check out Microsoft KB article 312362.
Deadlock and spinlock hangs
Use the !locks command if you suspect a deadlock hang. As explained earlier, a deadlock exists when one thread owns an exclusive lock on a resource that another thread wants, and that thread exclusively owns a resource that the initial thread wants. There are several variants of a deadlock scenario, but there must be waiter threads that stall as a result. In Figure 3, you can see a potential deadlock scenario where we have an exclusively owned lock on a resource that has numerous waiters.
You will notice under the list of threads for the resource that one has an asterisk next to it. This thread is the one that owns the exclusive lock for the resource. So, the question to be answered is, what is causing the owning thread to stall and not release the lock for the other waiters to acquire? Therefore, the next command to issue would be a !thread command on the owning thread to determine why it is stalled.
Figure 4 shows the output of a !thread command on the owner. It reveals that it is stalled waiting for an I/O request packet (IRP) to complete from the QAFilter.sys driver. This particular case is a known issue caused by a deadlock with the QAFilter driver documented in Microsoft KB article 906194. Note that QAFilter and NmSvFsf are not standard Microsoft drivers, so symbols are not available for them from the Microsoft symbol server.
A spinlock hang is very similar to a deadlock condition except that processors are involved instead of threads. A data structure called a spinlock is used to synchronize access to other data structures or a critical section. Only one processor can own a particular spinlock at a time. The other processors that want to acquire the spinlock will wait (or spin) until the spinlock is released. In a spinlock scenario, multiple processors all want to acquire the same spinlock at an elevated IRQL, causing a perceived system hang.
To troubleshoot a spinlock hang, examine each processor to determine what function is executing at the time. Use the ~# command — where # is the processor number (0, 1, 2 …) — to change context between processors. You will notice that the debugger’s kd prompt changes to reflect the processor number that currently has context.
Then use the !thread or kv command to determine the stack trace of the current thread to see what function was executing. In a true spinlock scenario, all processors except one will be executing a spinlock acquire function. Finally, to determine the culprit (driver) responsible for the spinlock condition, look down the stack trace for the last driver to call the spinlock acquire function. See Figure 5 for an example of a stack trace illustrating a spinlock hang initiated by the XYZDrv.sys driver.
Finally, the command !stacks is very useful to determine which threads are executing and the states of those threads (running, ready, blocked, etc.). In the example of the spinlock hang,!stacks was extremely useful in illustrating how threads currently running on the various processors were all trying to acquire spinlocks except for the current thread that was executing the bugcheck code. Figure 6 shows an example of the !stacks command and the pertinent output.
And there you have it. Troubleshooting non-responsive Windows servers can be very perplexing. Fortunately, the Windows operating system has matured over the years and now offers a variety of features and tools to help determine what causes servers to hang. By forcing a crash dump and using Windbg to analyze it, you can typically isolate the hang to a particular application or system resource. Plus, if the problem requires further analysis from Microsoft, you will have the memory dump they will need to troubleshoot the issue.