This blog contains reflections and thoughts on my work as a software engineer

lørdag den 10. oktober 2009

Tracking down an IIS deadlock with WinDbg

We’ve had a few performance issues lately on the website I’m working on – while digging through code we’ve discovered a lot of technical debt which we’ve started paying more attention to. Also we discovered that the IIS on our farm servers went down on a daily basis – approximately one or two times per day the IIS on each server would recycle the application pool for reasons unknown to us. There was an entry in the Eventlog telling us that the IIS had deadlocked and hence a reset was issued. Surveillance didn't work as expected so we never got anything but random users once or twice a month complaining that "The site is slow sometimes"...

We developed quite a lot of theories along the way but due to gut feeling and what we could see from log files etc. we suspected either thread pool starvation because of the way we made some synchronous long-running webservive calls to i.e. clear the server’s CMS cache or an infinite loop spinning up eating all available ressources before the IIS decided that it needed to recycle the application pool to come back to life. We had little else apart from a Performance Monitor telling us a few minutes in advance that a server was going down due to a massive amount of SQL User connections suddenly craving all ressources.

Enter WinDbg – the coolest and geekiest debugging tool I have ever worked with. Until last week I had no knowledge of debugging using WinDbg apart from reading a few blogposts from time to time written by people who have made debugging their niche in the development community. I decided to take a dive into it because we needed some facts on the table instead of just creating theories. First of all I knew we had to collect some sort of dump of the process when it was in an unhealthy condition so se started to collect memory dumps using ADPlus and after a few hours of waiting the IIS on one server went down and generated a large 500+ MB dump named [something]_Process_Shutdown_[timestamps-something].dmp. I spent the next three days deep inside the dump, scratching my head and reading lots and lots of blogposts found on Google, peeking and poking and trying out commands which didn’t output anything useful. I wrote it off at first because I suspected my own missing competence for the lack of finding information until I came across a blogpost writing about false positive dumps… Which was exactly what I had. A false positive means that the dump was generated when the process was in a different state than expected – something you can only find out by debugging the content of the dump. What I had wasn’t a dump from a deadlocked process, oh no – I had a dump from the process right after the IIS had decided to recycle the application pool so every thread except the Finalizer had no ID which means it had been shut down nicely and was basicly just waiting for the garbage collector to put and end to it's misery. That was also why there were no callstacks on the threads, nothing useful in the heap… Heureka!!! Nothing but a lot of useless bytes in a big file, basicly.

A lot wiser but still with no data I started looking for a way to collect a dump when the IIS was in the deadlocked state. I managed to locate an MSDN-article which described how to set the metadata property OrphanWorkerProcess on the IIS to “true” so instead of the IIS automaticly recycling the application pool I could kick off a script invoking a WinDbg through the commandline which could write a dump of the process to the disc for later analysis. It took a while but after a bit of trial and error I managed to get the script working. The script worked fine apart from the IIS not calling filename.cmd with a parameter whenever the worker process was orphaned so whenever the IIS executed the script the script invoking the debugger failed because there was no process ID named %1… It took only another two days to figure out what was going on, thank you very much Microsoft. MSDN is an invaluable ressource for me but I didn't have anything nice to say about the author of the article for about 10-12 minutes   :o)

As of tonight - art break - I managed to get a positive-positive dump and it took only 15 minutes to figure out exactly what was going on. The last week’s trial and error, peeking and poking did prove to be a good investment – here’s what I did:

I downloaded the dump generated by WinDbg to my local machine and fired up WinDbg on my local machine. I had earlier downloaded the .NET Framework version from the farm server because even though you have version v2.0.50727 installed on your local machine there is a chance that it isn’t the exact same revision as the one you have installed on your server. In my case my local workstation’s copy of aspnet_isapi.dll is v2.0.50727.4016 while the server has 2.0.50727.3053. I figure it has something to do with the fact that I’m running Vista while the farm server is a Windows 2003. It hasn’t got any effect in your daily life but it is essential while debugging a crash dump that you have the *exact* same assemblies and symbol files as the machine where the dump was generated otherwise you might not be able to extract information out of the dump or – even worse – you might get corrupted or incorrect data which could very easy lead you in a completely wrong direction. So I zipped and downloaded the .NET Framework from the farmserver to my local machine and used the command “.exepath+ [path]” to search for matching assemblies in [path] if no matches in the default locations were found.

I loaded the dump into WinDbg using and thought back on our initial suspects. Our first suspicion was the infinite loop because it was the strongest theory we had. Debugging threading issues such as two threads deadlocking eachother would also be harder while an infinite loop would without a doubt reveal itself in a very long stacktrace on one of the active worker threads - given that I didn't have a false positive crash dump that is(!). So I used !threads to get a list of threads – here’s the output:

0:028> !threads 
ThreadCount: 52
UnstartedThread: 0
BackgroundThread: 39
PendingThread: 0
DeadThread: 6
Hosted Runtime: no
PreEmptive GC Alloc Lock
ID OSID ThreadOBJ State GC Context Domain Count APT Exception
13 1 52c 000ebc28 1808220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
15 2 1008 000f7ce0 b220 Enabled 00000000:00000000 000db760 0 MTA (Finalizer)
16 3 164c 00110208 80a220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Completion Port)
17 4 11c0 00113840 1220 Enabled 00000000:00000000 000db760 0 Ukn
11 5 314 0012f818 880a220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Completion Port)
18 6 12a0 0015ff88 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
19 7 1760 0015a5c8 200b020 Enabled 00000000:00000000 001147a0 0 MTA
20 8 11fc 00164c40 200b020 Enabled 00000000:00000000 001147a0 0 MTA
21 9 f80 00158dc8 200b020 Enabled 00000000:00000000 001147a0 0 MTA
22 a 14dc 001597a8 200b020 Enabled 00000000:00000000 001147a0 0 MTA
23 b 170 0012a638 200b020 Enabled 00000000:00000000 001147a0 0 MTA
24 c 152c 00170c58 200b020 Enabled 00000000:00000000 001147a0 0 MTA
25 d 1668 06b33c10 200b020 Enabled 00000000:00000000 001147a0 0 MTA
27 e e70 06b6fb40 200b220 Enabled 00000000:00000000 001147a0 0 MTA
28 f 5e4 06b5e420 180b220 Enabled 00000000:00000000 001147a0 1 MTA (Threadpool Worker) System.Data.SqlClient.SqlException (1ee19b2c) (nested exceptions)
29 10 1470 06b6a650 200b220 Enabled 00000000:00000000 001147a0 0 MTA
30 11 fa4 06b7dbe8 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
10 12 148c 06b80e60 220 Enabled 00000000:00000000 000db760 0 Ukn
7 14 1640 06b8e1b0 220 Enabled 00000000:00000000 000db760 0 Ukn
37 16 fd4 08f340d8 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
39 18 8bc 09097448 200b220 Enabled 00000000:00000000 001147a0 0 MTA
40 17 104c 090532b8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
41 19 1238 090967e0 200b220 Enabled 00000000:00000000 001147a0 0 MTA
42 1a 13fc 090a8bd8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
43 1b 1408 09055108 200b220 Enabled 00000000:00000000 001147a0 0 MTA
44 1c 1580 090559c0 200b220 Enabled 00000000:00000000 001147a0 0 MTA
45 1d 184 09056278 200b220 Enabled 00000000:00000000 001147a0 0 MTA
46 1e 13dc 09056b30 200b220 Enabled 00000000:00000000 001147a0 0 MTA
47 1f 12e8 090573e8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
48 20 1458 06c04298 200b220 Enabled 00000000:00000000 001147a0 0 MTA
49 21 93c 06c04878 200b220 Enabled 00000000:00000000 001147a0 0 MTA
50 22 d5c 06c052a8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
51 23 15ec 06c05cd8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
52 24 29c 06c06738 200b220 Enabled 00000000:00000000 001147a0 0 MTA
53 25 1098 06c07198 200b220 Enabled 00000000:00000000 001147a0 0 MTA
54 26 13cc 06c07bf8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
55 27 12f0 06c08658 200b220 Enabled 00000000:00000000 001147a0 0 MTA
56 28 ce4 06c090b8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
57 29 e60 06c09b00 200b220 Enabled 00000000:00000000 001147a0 0 MTA
58 2a d98 06c0a4e0 200b220 Enabled 00000000:00000000 001147a0 0 MTA
6 15 1060 090a6980 220 Enabled 00000000:00000000 000db760 0 Ukn
4 2b 590 090ddae8 220 Enabled 00000000:00000000 000db760 0 Ukn
3 13 cd4 1712e4f8 220 Enabled 00000000:00000000 000db760 0 Ukn
XXXX 2c 0 17133d50 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
XXXX 2d 0 17104970 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
XXXX 2e 0 090ac858 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
69 2f 99c 090ad248 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
XXXX 30 0 090b5dd0 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
70 31 fcc 090b70e8 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
XXXX 32 0 090b83e0 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
XXXX 33 0 090bb380 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
75 35 1134 1718fb70 200b220 Enabled 00000000:00000000 001147a0 1 MTA

There was an immediate suspect – thread #28 appeared to have a nested SqlException in it… So I switched to the thread by executing the command “~28s” and now I could get a callstack on the thread by using “!clrstack” with the following output:


…074fe5d4 0fc5c6f5 dgiCore.Log.ExceptionLogPublisher.Publish(System.Exception, System.Collections.Specialized.NameValueCollection, System.Collections.Specialized.NameValueCollection) 
[snip]
074fe920 0fc5c6f5 dgiCore.Log.ExceptionLogPublisher.Publish(System.Exception, System.Collections.Specialized.NameValueCollection, System.Collections.Specialized.NameValueCollection)
[snip]
074fec6c 0fc5c6f5 dgiCore.Log.ExceptionLogPublisher.Publish(System.Exception, System.Collections.Specialized.NameValueCollection, System.Collections.Specialized.NameValueCollection)
[snip]
074fee20 0f97d9e2 DGI.Web.layouts.DGI.Loginbox.btnLogin_Click(System.Object, System.EventArgs)
[snip]
074ff09c 01e1adf5 ASP.layouts_main_aspx.ProcessRequest(System.Web.HttpContext)
[snip]
074ff180 66083e1c System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
074ff1b4 66083ac3 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)
074ff1c4 66082c5c System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
074ff3d8 79f68cde [ContextTransitionFrame: 074ff3d8]
074ff40c 79f68cde [GCFrame: 074ff40c]
074ff568 79f68cde [ComMethodFrame: 074ff568]

I was quite surprised to find out that a click on the Submit-button on our login-page proved to be the cause of an infinite loop… What parameters was the methods being called with? Enter the command “!clrstack –p” and somewhere along the massive output of nested exceptions I found the call invoked by the user clicking the “Login” button:


074fed68 0d6a755b dgiCore.Security.Authenticate(System.String, System.String, System.String, dgiCore.Log.Exceptions.Fejl ByRef) 
PARAMETERS:
username =
password =
applicationName =
fejl = 0x074fee24

Empty username and password? Normally there would be a reference to an address in the memory… Just for the fun of it I tried to issue a login from the website under suspicion with an empty username and password and what do you know? The website hung until the IIS decided to recycle the application pool and come back to life with a new process ID... If you can imagine my surprise sitting late in the evening and just watching every little piece of the puzzle fall into place watching the IIS on the server choking harder and harder before it's inevitable death and immediate resurrection. It was in that exact splitsecond I fell in deep, unconditional love with WinDbg.


I won’t go into detail about the error itself and how it was fixed (I won’t say too much but part of it might include a bit of Javascript form validation) but I have had a major success using WinDbg (and Google) to track down and fix an error which by default only revealed itself through an entry in the server’s Event log. WinDbg is indeed a very useful tool in a developer’s toolbox and now I’ve got some experience with it I can only imagine how many hours I’ve wasted earlier on trying to come up with grand unified theories about the behaviour of a piece of code where I would now just attach a debugger / create a dump and check for myself what’s actually going on. It took me four to five days effectively to get to know the mindset and commands neccessary to make it useful but boy... They will come back a hundred times over the next three decades of programming. Try it out for yourself  :o)

4 kommentarer:

Unknown sagde ...

You wrote this comment "The script worked fine apart from the IIS not calling filename.cmd with a parameter whenever the worker process was orphaned so whenever the IIS executed the script the script invoking the debugger failed because there was no process ID named %1… It took only another two days to figure out what was going on, thank you very much Microsoft."

What exactly did you mean by this? What was your fix?

Kristian Erbou sagde ...

@Chris: The fix in this particular situation was that instead of relying on the IIS to provide the process ID of the w3wp.exe process and extracting it using %1 I hardcoded the process ID into the script and waited for the server to go down. The catch was that if the server went down and the script didn't write the dump as expected requested (I had some issues regarding read/write access to the folder where my dump should be located) I had to edit the script again and change the process ID to the one which the w3wp.exe process had now

Unknown sagde ...

So I was able to get a dump file generated however I am getting no information in my dump. I believe I have loaded the correct symbols and I am running WINDBG on the server so the environment versions are correct. I was wondering if you ran into issues where the dump returned something like this:

"The dump file has a breakpoint exception stored in it. The stored exception information can be accessed via .ecxr.ntdll!DbgBreakPoint"

Is this an issue with not getting information for the correct process? The dump file is 2GB so there has to be some data in it. Just curious if you had any ideas?

Kristian Erbou sagde ...

As of tonight I'm trying to debug a BlueScreenOfDeath on a collegue's computer (some driver issue I believe - but I thought I'd try WinDbg on the minidump generated just before the reboot occurred) - performing an "!analyze -v" got me the following message:

"This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003. This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG. This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but ...
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG. This will let us see why this breakpoint is
happening."

Have you by any chance noticed the exception code 0x80000003 in your dump?