Note: If you are interested only in finding a quick resolution for Non-Yielding scheduler or Non-yielding IOCP Listener dumps or Non-yielding resource monitor Jump to THIS LINK. Continue reading this article if you like to learn How to analyze Non-Yielding scheduler dumps and Non-yielding IOCP listener dumps
How to analyze Non-Yielding scheduler dumps and Non-yielding IOCP listener dumps?
This blog is targeted towards analyzing Non-Yielding scheduler dumps and not targeted on explaining how Non-Yield Detection works please read http://technet.microsoft.com/en-us/library/cc917684.aspx to understand how the Non-Yield Detection works but let us recollect few key points before we get in to analysis.
1. SQL Server has its own logical schedulers to schedule the SQL Server workers.
2. The scheduler is called the User Mode Scheduler (UMS) in SQL Server 2000 and the SQL Server Operating System (SOS) Scheduler in SQL Server 2005
3. Logical scheduler makes the worker non-preemptive to the database engine. The worker owns the scheduler until it yields to another worker on the same scheduler.
What if the threads which owns the scheduler executes for long time without yielding (or) forever and does not yield to give quantum for the other threads waiting in the scheduler?
Answer: Other threads would not get CPU cycles and starve the SQL Server performance.
What if the thread is not able to finish its work with in quantum –(4 Milliseconds) for example large for loop?
SQL Server worker thread’s Quantum target is 4ms which means the thread(worker) is expected to yield back to SQL Server scheduler when it exceeds 4ms and rescheduled when the other threads which are already waiting in SOS Scheduler (runnable list) finishes its execution or quantum.
What if the thread did not yield after 4 Milliseconds?
SQL Server has its scheduler monitor to track this. SchedulerMonitor algorithm is to check non-Yield condition every 5 seconds during which the basic check (Check if the thread is executing for >4Ms) is done . When the basic check evaluates to true, tracking of the worker begins and if the thread doesn’t yield beyond 10seconds (Nonyield threshold) after the tracking begins then threshold check becomes true. So there is approximately 15 seconds between the time of the last yield on the scheduler and the time that the threshold check becomes true and tracking continues.
A dump is taken when an specific nonyield situation has reached 60 seconds in total duration. Once a 17883 mini-dump is captured, no further 17883 mini-dumps are captured until trace flag -T1262 is enabled or the SQL Server process is restarted. However, 17883 error message reporting continues, regardless of the mini-dump capture. Also when –T1262 is enabled mini-dump is captured when the Non-Yield threshold check becomes true (15 seconds)
and at subsequent 60-second intervals for the same nonyield occurrence. A new nonyielding occurrence causes dump captures to occur again.
When the SQL Server decides to take the minidump on nonyield occurrence it copies the CONTEXT of the nonyielding thread to a global structure and then initiates the dump because Sometimes it is possible that by the time SQLDumper gets the dump, the non-yielding thread has already yielded. So to get the exact snapshot of the thread we need to trust on CONTEXT saved in global structure also we can compare the current stack of the thread with the one which is copied and check if the thread is progressing.
Non-yielding IOCP Listener
An identical algorithm is used to detect non-yielding I/O completion routines, counting completed I/O completion routines instead of number of yields. Scheduler Monitor takes a dump when it notices the IOCP has not moved for 10 seconds. Analyzing Non-yielding IOCP Listener is also same as analyzing non-yielding scheduler dump
Let us step in to analysis of non-yielding scheduler dump which I got in SQL Server 2012
Sample 1
When a non-yielding scheduler dump is generated following error message is logged in SQL Error log and SQLDump000n.mdmp is generated in log folder.
{
* *******************************************************************************
*
* BEGIN STACK DUMP:
* 04/16/12 10:09:58 spid 6256
*
* Non-yielding Scheduler
*
* *******************************************************************************
Process 0:0:0 (0x1cb0) Worker 0x0000003054F62160 appears to be non-yielding on Scheduler 0. Thread creation time: 12979065797278. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 97%. Interval: 70110 ms.
}
To analyze the dump download and Install Windows Debugger from This link
Step 1:
Open Windbg . Choose File menu –> select Open crash dump –>Select the Dump file (SQLDump000#.mdmp)
Microsoft (R) Windows Debugger Version 6.11.0001.404 X86
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [C:\Users\karthick \Desktop\Karthick\SQLDump0009.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available
Comment: ‘Stack Trace’
Comment: ‘Non-yielding Scheduler’ èType of the dump
Symbol search path is: *** Invalid ***
Executable search path is:
Windows 7 Version 7601 (Service Pack 1) MP (24 procs) Free x64
Product: Server, suite: Enterprise TerminalServer SingleUserTS à Windows version and system information
Machine Name:
Debug session time: Mon Apr 16 09:09:59.000 2012 (GMT-7)
System Uptime: 9 days 15:57:03.155
Process Uptime: 0 days 0:06:48.000
……………………………………………………….
……………………………..
Step 2:
on command window type
.sympath srv*c:\Websymbols*http://msdl.microsoft.com/download/symbols;
Step 3:
Type .reload /f and hit enter. This will force debugger to immediately load all the symbols.
Step 4:
Verify if symbols are loaded for SQL Server by using the debugger command lmvm
0:146> lmvm sqlservr
start end module name
00000000`ffad0000 00000000`ffb0e000 sqlservr T (pdb symbols) c:\websymbols\sqlservr.pdb\21553ADC31784A4D933974A386EE2E052\sqlservr.pdb
Loaded symbol image file: sqlservr.exe
Image path: C:\Program Files\Microsoft SQL Server\MSSQL11.S1\MSSQL\Binn\sqlservr.exe
Image name: sqlservr.exe
Timestamp: Fri Apr 06 08:19:38 2012 (4F7F098A)
CheckSum: 00036498
ImageSize: 0003E000
File version: 2011.110.2316.0
Product version: 11.0.2316.0 èSQL Server Version
File flags: 0 (Mask 3F)
File OS: 40000 NT Base
File type: 1.0 App
File date: 00000000.00000000
Translations: 0000.04b0 0000.04e4 0409.04b0 0409.04e4
Step 5:
Use !findstack command to find scheduler monitor thread (sqlservr!SQL_SOSNonYieldSchedulerCallback )
0:146> !findstack sqlservr!SQL_SOSNonYieldSchedulerCallback
Thread 006, 1 frame(s) match è Thread ID of scheduler monitor.
* 07 00000000336be420 000007fee36e0955 sqlservr!SQL_SOSNonYieldSchedulerCallback+0x47f
Step 6:
Switch to scheduler monitor thread using ~[threadID]s command
0:146> ~[006]s
ntdll!NtWaitForSingleObject+0xa:
00000000`76d3135a c3 ret
Step 7:
Use kC or kP command to look at the stack on scheduler monitor thread.
0:006> kP Child-SP RetAddr Call Site
00000000`3369c218 000007fe`fcd210ac ntdll!NtWaitForSingleObject+0xa
00000000`3369c220 00000000`ffaeecce KERNELBASE!WaitForSingleObjectEx+0x79
00000000`3369c2c0 00000000`ffaef1a4 sqlservr!CDmpDump::DumpInternal+0x20e
00000000`3369c360 000007fe`dbe50794 sqlservr!CDmpDump::Dump+0x24
00000000`3369c3a0 000007fe`dbe511e6 sqllang!SQLDumperLibraryInvoke+0x2e4
00000000`3369c640 000007fe`dbe16ddb sqllang!CImageHelper::DoMiniDump+0x426
00000000`3369c830 00000000`ffae307f sqllang!stackTrace+0xbdb
00000000`3369e270 000007fe`e36e0955 sqlservr!SQL_SOSNonYieldSchedulerCallback+0x47f
00000000`336be430 000007fe`e36866da sqldk!SOS_Scheduler::ExecuteNonYieldSchedulerCallbacks+0x375
00000000`336bebf0 000007fe`e364b53f sqldk!SchedulerMonitor::CheckScheduler+0x307
00000000`336bed60 000007fe`e364aa8f sqldk!SchedulerMonitor::CheckSchedulers+0x211
00000000`336bf1f0 000007fe`e371c779 sqldk!SchedulerMonitor::Run+0xfb
00000000`336bf320 000007fe`e3642f10 sqldk!SchedulerMonitor::EntryPoint+0x9
00000000`336bf350 000007fe`e3642d04 sqldk!SOS_Task::Param::Execute+0x21e
00000000`336bf950 000007fe`e36429e6 sqldk!SOS_Scheduler::RunTask+0xa8
00000000`336bf9c0 000007fe`e365a29f sqldk!SOS_Scheduler::ProcessTasks+0x299
00000000`336bfa40 000007fe`e365a3b0 sqldk!SchedulerManager::WorkerEntryPoint+0x261
00000000`336bfae0 000007fe`e3659fcf sqldk!SystemThread::RunWorker+0x8f
00000000`336bfb10 000007fe`e365aaf8 sqldk!SystemThreadDispatcher::ProcessWorker+0x3c8
00000000`336bfbc0 00000000`76ad652d sqldk!SchedulerManager::ThreadEntryPoint+0x236
Step 8:
Switch to the thread which is reported as Non-Yielding in SQL Server error log using ~~[ThreadID]s command.
Recollect the below error in SQL error log. Process 0:0:0 (0x1cb0) Worker 0x0000003054F62160 appears to be non-yielding on Scheduler 0.
0:006> ~~[0x1cb0]s
ntdll!NtWaitForSingleObject+0xa:
00000000`76d3135a c3 ret
Step 9:
Look at the current stack of Non-yielding thread. using kc command
0:146> kc 10
Call Site
ntdll!NtWaitForSingleObject
KERNELBASE!WaitForSingleObjectEx
sqldk!SOS_Scheduler::SwitchContext
sqldk!SOS_Scheduler::SuspendNonPreemptive
sqldk!WorkDispatcher::DequeueTask
sqldk!SOS_Scheduler::ProcessTasks
sqldk!SchedulerManager::WorkerEntryPoint
sqldk!SystemThread::RunWorker
sqldk!SystemThreadDispatcher::ProcessWorker
sqldk!SchedulerManager::ThreadEntryPoint
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart
Recollect the information which we discussed earlier in this blog. When the SQL Server decides to take the minidump on nonyield occurrence it copies the CONTEXT of the nonyielding thread to a global structure and then initiates the dump because Sometimes it is possible that by the time SQLDumper gets the dump, the non-yielding thread has already yielded. So to get the exact snapshot of the thread we need to trust on CONTEXT saved in global structure also we can compare the current stack of the thread with the one which is copied and check if the thread is progressing.
Look at the above stack it cannot be Non-Yield thread because we see SwitchNonPreemptive and SwitchContex in the thread.
{
SwitchPreemptive or SuspendNonPreemptive forces another worker to become owner of the scheduler. It does this by making the head of the runnable list the new owner and removing the current worker from logical scheduler control. The worker transitions ownership and is removed from SQL scheduler control until the external activity is complete. When the external activity is complete, the worker returns to the end of the runnable list by calling SwitchNonPreemptive.
}
Step 10:
Search for the copied stack structure using X commad
0:146> X sqlmin!*copiedStack*
000007fe`df11bfe0 sqlmin!g_copiedStackInfo = <no type information>
It is sqlmin!g_copiedStackInfo in this dump because this is dump is from SQL Server2012. In earlier versions of SQL Server it is sqlservr!g_copiedStackInfo
Step 11:
We know copied CONTEXT is stored in g_CopiedStackInfo how to find the valid offset in this structure ? If the CONTEXT is valid Rip,Rbp and RSP registers has to be valid if the dump is from X64 system and Eip,Ebp and Esp has to be valid if it is X86 system.
Let us do dd on sqlmin!g_copiedStackInfo (remember it is sqlservr!g_copiedStackInfo in SQL2008/2005/2000)
0:146> dd sqlmin!g_copiedStackInfo
000007fe`df11bfe0 00000001 00000000 3369e2e0 00000000
000007fe`df11bff0 0000a998 00000000 00000000 00000000
000007fe`df11c000 00000000 00000000 00000000 00000000
000007fe`df11c010 00000000 00000000 00000000 00000000
000007fe`df11c020 00000000 00000000 00000000 00000000
000007fe`df11c030 0010000b 00001f80 00000033 00000000
000007fe`df11c040 002b0000 00000246 00000000 00000000
000007fe`df11c050 00000000 00000000 00000000 00000000
Step 12:
Let us dump each address with context and verify if Rip,Rbp and RSP registers are valid. This dump is from 64-bit SQL Server so we are using Rip,Rbp and RSP registers. If the dump is from x86 system use Eip,Ebp and Esp .
0:146> dt 000007fe`df11bfe0 CONTEXT Rip Rsp Rbp èType cast 000007fe`df11bfe0 with CONTEXT . RsP,Rbp and Rip is invalid so this is not valid offset.
ole32!CONTEXT
+0x098 Rsp : 2
+0x0a0 Rbp : 0x5a4d
+0x0f8 Rip : 0
0:146> dt 000007fe`df11bff0 CONTEXT Rip Rsp Rbp è Type cast 000007fe`df11bff0 with CONTEXT . RIP is invalid
ole32!CONTEXT
+0x098 Rsp : 0x72120000
+0x0a0 Rbp : 0x3369e3cc
+0x0f8 Rip : 0xf2
0:146> dt 000007fe`df11c000 CONTEXT Rip Rsp Rbp è Type cast 000007fe`df11c000 with CONTEXT . RIP ,RSP and Rbp is valid. So this should be a valid context.Let us switch to this context and verify
ole32!CONTEXT
+0x098 Rsp : 0x3369e2e0
+0x0a0 Rbp : 0x3369e498
+0x0f8 Rip : 0x76d3139a
Now we know 000007fe`df11c000 is valid context. So 000007fe`df11c000 -sqlmin!g_copiedStackInfo =0x20 hence we can use .cxr sqlmin!g_copiedStackInfo+0X20 directly to switch to the context of copied stack.
Step 13:
Switch the context of copied stack using .cxr 000007fe`df11c000 or .cxr sqlmin!g_copiedStackInfo+0X20
0:146> .cxr 000007fe`df11c000
rax=0000000000000002 rbx=000000003369e3cc rcx=0000000000005a4d
rdx=0000000072120000 rsi=000000000000026c rdi=0000000000000000
rip=0000000076d3139a rsp=000000003369e2e0 rbp=000000003369e498
r8=00000000000000b0 r9=0000000084a85310 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000004
r14=00000000000000f2 r15=0000000000000001
iopl=0 nv up ei pl zr na po nc
cs=0033 ss=002b ds=0000 es=0000 fs=0000 gs=0000 efl=00000246
ntdll!NtWriteFile+0xa:
00000000`76d3139a c3 ret
Step 14:
Dump the stack of copied context using kP or kc (kc displays clean stack trace. each display line includes only the module name and the function name)
0:146> Kc
*** Stack trace for last set context – .thread/.cxr resets it
Call Site
ntdll!NtWriteFile
KERNELBASE!WriteFile
kernel32!WriteFileImplementation
sqllang!CErrorReportingManager::WriteToErrLog
sqllang!CErrorReportingManager::SendErrorToErrLog
sqllang!CErrorReportingManager::CwchFormatAndPrint
sqllang!ReportLoginFailure
sqllang!FRedoLogin
sqllang!login
sqllang!process_login_finish
sqllang!process_commands
sqldk!SOS_Task::Param::Execute
sqldk!SOS_Scheduler::RunTask
sqldk!SOS_Scheduler::ProcessTasks
sqldk!SchedulerManager::WorkerEntryPoint
sqldk!SystemThread::RunWorker
sqldk!SystemThreadDispatcher::ProcessWorker
sqldk!SchedulerManager::ThreadEntryPoint
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart
Now compare the current stack and the copied stack to see if the thread has progressed after No-Yield condition. Stack look completely different So the Non-Yield thread has progressed and completed .It is doing new work now. Also to understand why the thread was Non-Yielding look at the copied stack and not the current unless both the stacks are same.
Current thread stack which we dumped using the thread ID in SQL Errorlog.
0:146> kc
|
Copied thread stack which SQL Server copied to global structure before generating the dump.
0:146> Kc 10
|
ntdll!NtWaitForSingleObject
|
KERNELBASE!WriteFile
|
KERNELBASE!WaitForSingleObjectEx
|
kernel32!WriteFileImplementation
|
sqldk!SOS_Scheduler::SwitchContext
|
sqllang!CErrorReportingManager::WriteToErrLog
|
sqldk!SOS_Scheduler::SuspendNonPreemptive
|
sqllang!CErrorReportingManager::SendErrorToErrLog
|
sqldk!WorkDispatcher::DequeueTask
|
sqllang!CErrorReportingManager::CwchFormatAndPrint
|
sqldk!SOS_Scheduler::ProcessTasks
|
sqllang!ReportLoginFailure
|
sqldk!SchedulerManager::WorkerEntryPoint
|
sqllang!FRedoLogin
|
sqldk!SystemThread::RunWorker
|
sqllang!login
|
sqldk!SystemThreadDispatcher::ProcessWorker
|
sqllang!process_login_finish
|
sqldk!SchedulerManager::ThreadEntryPoint
|
sqllang!process_commands
|
kernel32!BaseThreadInitThunk
|
sqldk!SOS_Task::Param::Execute
|
ntdll!RtlUserThreadStart
|
sqldk!SOS_Scheduler::RunTask
|
|
sqldk!SOS_Scheduler::ProcessTasks
|
|
sqldk!SchedulerManager::WorkerEntryPoint
|
|
sqldk!SystemThread::RunWorker
|
|
sqldk!SystemThreadDispatcher::ProcessWorker
|
|
sqldk!SchedulerManager::ThreadEntryPoint
|
Now let us read the copied stack and understand what would have caused a Non-Yield condition (read from bottom to top)
ntdll!NtWriteFile -> WriteFile function is at top of the stack and did not complete in expected time.
KERNELBASE!WriteFile
kernel32!WriteFileImplementation
sqllang!CErrorReportingManager::WriteToErrLog ->Write the error to errorlog
sqllang!CErrorReportingManager::SendErrorToErrLog ->Send the error to SQL Server errorlog
sqllang!CErrorReportingManager::CwchFormatAndPrint ->format the error
sqllang!ReportLoginFailure ->Login failed
sqllang!FRedoLogin
sqllang!login ->Login task is processed
From the above stack we are able to understand SQL Server is writing login failed information to SQL Error log (Synchronously) and the writefile function has taken long time and did not return.So there is Non-Yield scheduler dump.
When will writefile operation take long time?
When there is Disk bottleneck. So the obvious solution for this issue is to fix the performance of the disk.
Similarly there could be numerous other reasons for Non-Yield condition so look at the stack of your Non-Yield scheduler dump using the method above and make out what could have caused the Non-Yield condition.
Also refer THIS LINK to check if your stack matches with any of the known issues in SQL Server.
To Be continued…………………………
Related posts:
How to Analyze "Deadlocked Schedulers" Dumps?
Non-yielding IOCP Listener, Non-yielding Scheduler and non-yielding resource monitor known issues and fixes
SQL Server generated Access Violation dumps while accessing oracle linked servers.
SQL Server Latch & Debugging latch time out
If you liked this post, do like us on Face Book at https://www.facebook.com/mssqlwiki and join our FaceBook group https://www.facebook.com/mssqlwiki#!/groups/454762937884205/
Thank you,
Karthick P.K |My Facebook Page |My Site| Blog space| Twitter
Like this:
Like Loading...