WinDbg is a tool that is immensely useful, but painfully hard to get any value from if you don’t know how to use it. On a few occasions I’ve found myself in a situation where I knew that WinDbg could give me the answer, but I didn’t have enough knowledge of how to use it.
This week we performed a production deployment on one of the projects I am involved in. Pretty soon after, we started to notice some issues cropping up that were affecting the performance and stability of the whole site.
In this scenario, WinDbg was a life saver for us.
The first sign that something was wrong was this event getting fired across the whole front-end tier:
03 Jul 2009 08:32:43 AM
Monitor Title: "Event Log Monitor" (Type=Event Log Monitor)
* Event Time: 03 Jul 2009 08:32:42 AM
* Source: .NET Runtime 2.0 Error Reporting
* Event Log: Application
* Type: Error Event
* Event ID: 1000
* Faulting application w3wp.exe, version 6.0.3790.3959, stamp 45d6968e, faulting module kernel32.dll, version 5.2.3790.4062, stamp 4626467c, debug? 0, fault address 0x00022366.
We also noted that CPU usage was spiking erratically and was well above the average we would expect for the number of firewall connections we had open.
This event log entry was telling us that the w3wp worker process was faulting in an unrecoverable way, causing the application pool to be torn down in IIS. The nature of the fault was also causing the Windows Error Reporting
dw20.exe process to trigger and try capturing a mini dump. These rapid recycling combined with the CPU and IO intensive task of capturing a mini dump was killing our servers and we needed to act fast.
The first step was to work out what was causing the process to be torn down in the first place, as the impact of Windows Error Reporting’s automated analysis was only a secondary issue. To capture this information, I used the
adplus script in the Debugging Tools for Windows package to capture a dump.
In its default configuration, adplus will capture a crash dump on any first chance exception. Being an ASP.NET application there are a number of exceptions that we expect to be thrown as 404s are reported, redirects are performed, etc. As such, running adplus in its default configuration was causing it to grab the crash dump too frequently and not giving us the fault that was causing the complete teardown.
To capture the process teardown fault, I ran adplus with these parameters:
adplus.vbs -pn w3wp.exe -c Unknown.cfg
The first parameter just says that we want to capture dump data for the
w3wp.exe process. The second parameter is the interesting one – I am passing in a configuration file that includes some exception filters. I grabbed this config from Tess’ blog.
The content of the configuration file is fairly simple XML that describes the unknown fault scenario:
<adplus> <settings> <runmode> CRASH </runmode> </settings> <exceptions> <config> <code>AllExceptions</code> <actions1>Log</actions1> <actions2>MiniDump;Log;EventLog</actions2> </config> <newexception> <code> 0xe053534f </code> <name> Unknown_Exception </name> </newexception> <config> <code> 0xe053534f </code> <actions1>FullDump;Log;EventLog</actions1> <actions2>FullDump;Log;EventLog</actions2> </config> </exceptions> </adplus>
With the debugger now attached, it was a simple matter of waiting for the problem to occur. In our scenario it was happening every 10 to 15 minutes per server, so it took a while for me to capture the dump. Finally, the console showed “Dump file successfully written”.
To analyse the dump file, I opened
windbg.exe, chose Open Crash Dump and selected the
.dmp file that had been written out. You’ll usually find it in a CrashDump folder next to where you ran adplus from.
As we are working with a .NET application, but WinDbg is a native debugger, my next step was to load a set of debugger helpers called SOS. This is basically a plugin for WinDbg that lets us access information about the CLR state.
To load it, run:
.loadby sos mscorwks
The period at the start says it’s a WinDbg core command. The next parameter is the name of the library we want to load into the debugger. The final parameter is basically a relative file base – we’re telling WinDbg to load SOS from the same location that
mscorwks had been loaded from.
mscorwks is the .NET runtime, so this approach helps ensure we are loading the correct version and architecture of SOS and it saves us from having to write out the full path.
Next, I wanted to see what point of our application we were in when the process faulted. With SOS loaded, the command is quite simple:
Instantly, it was obvious that we had a stack overflow problem. This was made clear by literally hundreds of repeated frames on our call stack. It also explained why our ASP.NET Health Monitoring wasn’t reporting an exception, as a stack overflow causes the whole process to be torn down before any exception handling code can be run.
Here’s a short snippet (with the project details masked out):
The component in question is used across almost all of our pages and the site wasn’t completely down, so I could now derive that it was being caused by a very specific piece of data. This leads us to our next challenge, which is to find out what which record was triggering the problem.
To do so, I ran:
The argument of
-p asks SOS to list out all of the parameters for each stack frame.
Finding out the URL of the current request is a non-trivial task, and not all parameters are available due to runtime optimisations. Rather than persisting in a search for the URL, it was much easier to grab another stack frame which included the information I needed:
1947ed28 1abb1c7e CompanyName.Project.Web.Logic.Presenters.RetailItemPresenter.View_ItemLoading(System.Object, CompanyName.Project.Web.Logic.Views.ItemLoadingEventArgs)
this = 0x04144bf8
e = 0x04151b38
This is the event handler on one of our presenters being called as a result of the
ItemLoading event being fired on the corresponding view. This approach is specific to our MVP architecture, but it was a known and simple entry point into most of the stack. Try and look for an equivalent call in your own stack.
I could now see that it was a particular retail item triggering the problem, and I knew that the information I needed was in the event arguments. You can see the parameter on the stack frame shown as
e = 0x04151b38. That number is the memory location of the object we need.
To get the object, I then ran:
!do command is a shortcut for
!DumpObject, and gave us this output:
Size: 12(0xc) bytes
(c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files\root\aef04219\e69835b3\assembly\dl3\29e5062003a4007_a4fac901\CompanyName.Project.Web.Logic.DLL)
MT Field Offset Type VT Attr Value Name
7932a400 4000183 4c System.EventArgs 0 shared static Empty
>> Domain:Value 001cc630:NotInit 00204700:03043914 <<
793308ec 4000819 4 System.String 0 instance 0412522c <RetailItemId>k__BackingField
We can now see the
RetailItemId property and its corresponding backing field. The
value column on that line is
0412522c, which is another memory reference.
Once again, we dump the memory reference:
This returns us the corresponding
Size: 28(0x1c) bytes
MT Field Offset Type VT Attr Value Name
79332b38 4000096 4 System.Int32 1 instance 6 m_arrayLength
79332b38 4000097 8 System.Int32 1 instance 5 m_stringLength
793315cc 4000098 c System.Char 1 instance 36 m_firstChar
793308ec 4000099 10 System.String 0 shared static Empty
>> Domain:Value 001cc630:02fe01d0 00204700:02fe01d0 <<
7933151c 400009a 14 System.Char 0 shared static WhitespaceChars
>> Domain:Value 001cc630:02fe0728 00204700:06fe0e14 <<
Hidden away on the sixth line there is the raw string value,
6T046, which was the product id.
At this point, we had:
- Captured a complete memory dump of the w3wp.exe process when the fault occurred
- Identified a stack overflow to be the cause of the fault
- Identified the area of our code that was involved in the stack overflow
- Identified the particular piece of data that was exposing the bug
We could now:
- Perform a quick fix to the data to circumvent the bug and get the website performant and stable again
- Reproduce the bug locally using the same data that caused the issue
- Have the lazy Friday we’d been hoping for