SQL Server Latch & Debugging latch time out
Posted by Karthick P.K on September 7, 2012
In a multithreaded process what would happens when a one thread updates a data or index page in memory while second thread is reading the same page?
What will happen when 1st thread reads a data/index page in memory while 2nd thread is freeing the same page from memory?
Answer: We would end up with data or data structure inconsistency. To avoid inconsistency SQL Server uses Synchronization Mechanisms like Locks,Latches and Spinlocks.
We will discuss few key points about latches and how to debug latch timeout dumps in this blog.
What is Latch ?
They control the concurrent access to data pages and structures by multiple threads. Latches provide physical data consistency of data pages and provide synchronization for data structures. Latches are not controllable by user like locks.
Types of the Latch:
Buffer (BUF) Latch
Used to synchronize access to BUF structures and their associated database pages.
Buffer “IO” Latch
A subset of BUF latches used when the BUF and associated data/index page is in the middle of an IO operation (Reading page from disk or writing page to disk).
Non-Buffer (Non-BUF) Latch
These are latches that are used to synchronize general in-memory data structures generally used by queries/tasks executed by parallel threads, auto grow operations , shrink operations etc.
Latch modes
Keep (KP) Latches
Used to ensure that the page is not released from memory while it is in use.
Shared (SH) Latches
Used for read-only access to data structures and prevent write access by others threads.
This mode allows shared access.
SH is compatible with KP, SH, and UP. It should be noted that although in general SH implies read-only access, it is not always the case. For buffer latches SH is the minimum mode required in order to read a data page.
Update (UP) Latches
Allows read access to the data structure(Compatible with SH and KP), but prevents other EX-latch access.
Used for write operations when torn page detection is off and when AWE is not enabled.
Exclusive (EX) Latches
Prevents any read activity from occurring on the latched structure. EX is only compatible with KP.
Used during read IO during write IO when torn page detection is on or AWE is enabled.
Destroy (DT) Latches
Used when removing BUFs from the buffer pool, either by adding them to the free list or unmapping AWE buffers.
Latch compatibility
KP | SH | UP | EX | DT | |
KP | Y | Y | Y | Y | N |
SH | Y | Y | Y | N | N |
UP | Y | Y | N | N | N |
EX | Y | N | N | N | N |
DT | N | N | N | N | N |
How do you identify Latch contention?
Latch contention can be identified using below wait types in sysprocesses.
PAGEIOLATCH_*: This waittype in sysprocesses indicates that SQL Server is waiting on a physical I/O of a buffer pool page to complete.
1. PAGEIOLATCH_* are commonly solved by tuning the queries which are performing heavy IO (Commonly by adding, changing and removing indexes (or) statistics to reduce the amount of physical IO).
2. Identifying if there is disk bottleneck and fixing them (Pageiolatch wait times (ex > 30 ms))
PAGELATCH_*: This waittype in sysprocesses indicates that SQL Server is waiting on access to a database page, but the page is not undergoing physical IO.
1. This problem is normally caused by a large number of sessions attempting to access the same physical page at the same time. We should Look at the wait resource of the spid. The wait_resource is the page number (the format is dbid:file:pageno)
that is being accessed.
2. We can use DBCC PAGE to identify object or type of the page in which we have the contention. Also it will help us to determine whether contention is for allocation, data or text.
3. If the pages that SQL Server is most frequently waiting on are in tempdb database ,check the wait resource column for a page number in dbid 2. You may be facing tempdb allocation latch contention mentioned in http://support.microsoft.com/kb/328551
4. If the page is in a user database, check to see if the table has a clustered index on a monotonic key such as an identity where all threads are contending for the same page at the end of the table. In this case we need to choose a different
clustered index key to spread the work across different pages.
LATCH_*: Non-buf latch waits can be caused by variety of things. We can use the wait resource column in sysprocesses to determine the type of latch involved(KB 822101).
1. A very common LATCH_EX wait is due to running a profiler trace or sp_trace_getdata Refer KB 929728 for more information.
2. Auto Grow and auto shrink.
When a latch is requested by thread and If that latch cannot be granted immediately because of some other thread holding a incompatible latch on same page or data structure then the requestor must wait for the latch to be grantable. Warning messages like one below is printed in SQL Server error log and a mini dump with all the threads is captures if the wait interval reaches 5 minutes (waittime 300). The warning message differs for buffer and non-buffer latches.
844: Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.
846: A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait.
847: Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.
Break up of above warning
type
The latch mode of the current latch acquire request. This is a numerical value with the following mapping: 0 – NL (not used); 1 – KP; 2 – SH; 3 – UP; 4 – EX; 5 – DT.
task
Task for which we are trying to acquire latch.
Waittime
The total time waited for this latch acquire request in seconds.
owning task
The address of the Task that owns the latch, if available.
bp (Buffer latches only)
The address of the BUF structure corresponding to this buffer latch.
page (Buffer latches only.)
The page id for the page currently contained in the BUF structure.
database id (Buffer latches only.)
The database id for the page in the BUF.
Like troubleshooting blocking issues in SQL Server when there is a latch contention or timeout dump identify the owner of latch and troubleshoot why the latch is held by the owner for long time.
When there is latch timeout dump you will see a warning message similar to one below. Warning error message printed in SQL server errorlog before the dump is very important to find the owner thread of latch.
{
2012-01-18 00:52:03.16 spid69 A time-out occurred while waiting for buffer latch — type 4, bp 00000000ECFDAA00, page 1:6088, stat 0x4c1010f, database id: 4, allocation unit Id: 72057594043367424, task 0x0000000006E096D8 : 0, waittime 300, flags 0x19,
owning task 0x0000000006E08328. Not continuing to wait.
spid21s **Dump thread – spid = 21, PSS = 0x0000000094622B60, EC = 0x0000000094622B70
spid21s ***Stack Dump being sent to E:\Data\Disk1\MSSQL.1\MSSQL\LOG\SQLDump0009.txt
spid21s * *******************************************************************************
spid21s * BEGIN STACK DUMP:
spid21s * 02/28/12 00:32:03 spid 21
spid21s * Latch timeout
Timeout occurred while waiting for latch: class ‘ACCESS_METHODS_HOBT_COUNT’, id 00000002D8C32E70, type 2, Task 0x00000000008FCBC8 : 7, waittime 300, flags 0x1a, owning task 0x00000000050E1288. Continuing to wait.
Timeout occurred while waiting for latch: class ‘ACCESS_METHODS_HOBT_VIRTUAL_ROOT’, id 00000002D8C32E70, type 2, Task 0x00000000008FCBC8 : 7, waittime 300, flags 0x1a, owning task 0x00000000050E1288. Continuing to wait.
}
From the error message above we can easily understand we are trying to acquire latch on database id: 4, page 1:6088 (6088 page of first file) and we timed out because task 0x0000000006E08328 (owning task 0x0000000006E08328 in warning message) is holding a latch on it.
Note: Task is simply a work request to be performed by the thread. (such as system tasks, login task, Ghost cleanup task etc.). Threads which execute the task will take required latches on need.
Let us see how to analyze latch timeout dump and get the owning thread of the Latch using the owning task 0x0000000006E08328.
To analyze the dump download and Install Windows Debugger from This link
Step 1:
Open Windbg . Choose File menu –> select Open crash dump –>Select the Dump file (SQLDump000#.mdmp)
Step 2:
on command window type
.sympath srv*c:\Websymbols*http://msdl.microsoft.com/download/symbols;
Step 3:
Type .reload /f and hit enter. This will force debugger to immediately load all the symbols.
Step 4:
Verify if symbols are loaded for SQL Server by using the debugger command lmvm
0:002> lmvm sqlservr
start end module name
00000000`01000000 00000000`03679000 sqlservr T (pdb symbols) c:\websymbols\sqlservr.pdb\21E4AC6E96294A529C9D99826B5A7C032\sqlservr.pdb
Loaded symbol image file: sqlservr.exe
Image path: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn\sqlservr.exe
Image name: sqlservr.exe
Timestamp: Wed Oct 07 21:15:52 2009 (4ACD6778)
CheckSum: 025FEB5E
ImageSize: 02679000
File version: 2005.90.4266.0
Product version: 9.0.4266.0
File flags: 0 (Mask 3F)
File OS: 40000 NT Base
File type: 1.0 App
File date: 00000000.00000000
Translations: 0000.04b0 0000.04e4 0409.04b0 0409.04e4
Step 5:
Use the below command to search thread stack to identify the thread which has reference to the owning task and it will be the thread which is owning the latch. Replace 0X0000000006E08328 with owning task in your errorlog
~*e .echo ThreadId:; ?? @$tid; r? @$t1 = ((ntdll!_NT_TIB *)@$teb)->StackLimit; r? @$t2 = ((ntdll!_NT_TIB *)@$teb)->StackBase; s -d @$t1 @$t2 0X0000000006E08328
ThreadId:
unsigned int 0x93c
ThreadId:
unsigned int 0x9a0
ThreadId:
unsigned int 0x9b4
00000000`091fdaf0 06e08328 00000000 00000000 00000000 (……………
00000000`091fdcb8 06e08328 00000000 091fdd70 00000000 (…….p…….
00000000`091fded0 06e08328 00000000 06e0e798 00000000 (……………
00000000`091fdf38 06e08328 00000000 00000002 00000000 (……………
00000000`091fec60 06e08328 00000000 0168883a 00000000 (…….:.h…..
00000000`091ff260 06e08328 00000000 000007d0 00000000 (……………
00000000`091ff2d0 06e08328 00000000 00000020 00000000 (……. …….
00000000`091ff5f8 06e08328 00000000 800306c0 00000000 (……………
00000000`091ff6c0 06e08328 00000000 00000000 00000000 (……………
00000000`091ff930 06e08328 00000000 00000000 00000001 (……………
00000000`091ff9b8 06e08328 00000000 00000000 00000000 (……………
00000000`091ffa38 06e08328 00000000 00000000 00000000 (……………
00000000`091ffc10 06e08328 00000000 03684080 00000000 (……..@h…..
00000000`091ffc90 06e08328 00000000 00000000 00000000 (……………
ThreadId:
unsigned int 0x9b8
ThreadId:
unsigned int 0x9bc
ThreadId:
unsigned int 0x9c0
……………
…………..
Step 6:
From the above out put we see thread 0x9b4 has reference to the pointer of owning task and it will be the thread which is owning the latch. Let us switch to the thread(0x9b4 ) which is executing the owning task and
then go through the stack to see why the thread is owning the latch for long time.
Step 7:
~~[0x9b4]s ==> Switching to the thread (Replace 0x9b4 with your thread id which has reference to the po
ntdll!ZwWaitForSingleObject+0xa:
00000000`77ef047a c3 ret
Step 8:
0:002> kC ==> Print the stack
Call Site
ntdll!ZwWaitForSingleObject
kernel32!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::Suspend
sqlservr!SOS_Event::Wait
sqlservr!BPool::FlushCache
sqlservr!checkpoint2
sqlservr!alloca_probe
sqlservr!ProcessCheckpointRequest
sqlservr!CheckpointLoop
sqlservr!ckptproc
sqlservr!SOS_Task::Param::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex
From the above stack we can understand that the thread which is owning the latch is executing checkpoint and flushing cache (Dirty buffers) to disk. If flushing buffers to disk (checkpoint) is taking a long time, then obviously there is disk bottleneck.
Similarly for any other latch time out issues first identify the owner thread of latch, read the stack of owner thread to understand the task performed by owner thread and troubleshoot the performance of task performed by owner thread.
If you want to see the stack of thread which is waiting, then pickup the task (task 0x0000000006E096D8 )from latch timeout warning message in errorlog instead of owning task (task 0x0000000006E08328) and use the command mentioned in step 5.
I hope this post will help you to learn and debug the latch timeout issues.
If you liked this post, do like us on Facebook at https://www.facebook.com/mssqlwiki and join our Facebook group https://www.facebook.com/mssqlwiki#!/groups/454762937884205/
Related posts: Non-yielding IOCP Listener, Non-yielding Scheduler and non-yielding resource monitor known issues and fixes
Thank you,
Karthick P.K |My Facebook Page |My Site| Blog space| Twitter
Kranthi Kumar said
Hi karthi. for troubleshooting do we need to have crash dump of same day. Coz i have an issue where a job is getting failed due to buffer latch type 4 Error details added below.
2012-10-12 04:25:30.18 spid475 A time-out occurred while waiting for buffer latch — type 4, bp 0000000106FCE180, page 1:375475, stat 0xcc1000f, database id: 18, allocation unit Id: 72057594293452800, task 0x0000000006A312E8 : 0, waittime 300, flags 0x1a, owning task 0x0000000006A125C8. Not continuing to wait.
which happened on 12 Oct but i have crash dump on 30th Sep. So pls help me to further investgation
Karthick P.K said
Do you hav the errorlogs of 30th Sep?
Jeff Bennett said
Hi,Karthi.
This is a great article. Thanks. Is there a 64-bit version of the debugger?
Thanks again
Jeff
Karthick P.K said
yes there is. But to open a dump file you can use any version of debugger(32-Bit debugger can beused to load and analyze 64-bit dump. You need not worry about version unless you do live debugging) > Date: Tue, 13 Nov 2012 15:25:53 +0000 > To: karthick_pk83@hotmail.com >
Jeff Bennett said
Another question: I was reluctant to run this on the production server, so I tried it on my workstation, but got all kinds of Symbol file related errors. Do I have to run this on the affected server? Thanks again
Karthick P.K said
You can copy the dump to any systems and run it. you can ignore symbole errors as long as sqlservr.pdb is loaded as mentioned in step4 > Date: Tue, 13 Nov 2012 17:04:49 +0000 > To: karthick_pk83@hotmail.com >
John Healey said
I’m not able to load the symbols into the debugger.
0:039> .sympath srv*c:\Websymbols*http://msdl.microsoft.com/download/symbols;
Symbol search path is: srv*c:\Websymbols*http://msdl.microsoft.com/download/symbols
Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols
0:039> .reload /f
.*** ERROR: Symbol file could not be found. Defaulted to export symbols for sqlservr.exe –
Not sure why I’m not able to use/load/find the sqlservr.pdb symbol file.
Any suggestions will be much appreciated.
Karthick P.K said
Can you paste the output of >>lmvm sqlservr
ryannadesign.com said
This specific article SQL Server Latch & Debugging
latch time out at MSSQLWIKI, has got extremely great info
and I actually discovered precisely what
I had been researching for. Many Thanks.
How to Analyze "Deadlocked Schedulers" Dumps? « MSSQLWIKI said
[…] SQL Server Latch & Debugging latch time out […]
Non-yielding IOCP Listener, Non-yielding Scheduler and non-yielding resource monitor known issues and fixes « MSSQLWIKI said
[…] SQL Server Latch & Debugging latch time out […]
Michael Zilberstein : Corruption case said
[…] This article was extremely helpful in analyzing this dump with WinDbg tool – flow and somewhat cryptic commands described there easily pinpointed the guilty thread and its call stack: […]
Azad Sale said
Good learning experience and as always clear and crisp explanation.
Thanks a ton Karthick.
raj said
Hi Karthik
I getting below Error:
A time-out occurred while waiting for buffer latch — type 4, bp 0000000295FE6F80, page 1:129408, stat 0xc0000f, database id: 2, allocation unit Id: 6488064, task 0x00000002194A0088 : 0, waittime 300, flags 0x1000000039, owning task 0x00000002A2120748. Not continuing to wait.
A time-out occurred while waiting for buffer latch — type 3, bp 0000000295FE6F80, page 1:129408, stat 0xc0000f, database id: 2, allocation unit Id: 6488064, task 0x0000000190952748 : 0, waittime 300, flags 0x1000000039, owning task 0x00000002A2120748. Not continuing to wait.
Here database ID 2 means temp db
After that i have installed Windbg –> Open Crash Dump file was not available.
I have checked in ERROR LOG but no luck Could you please suggest asp.. its already escalated.
Waiting for valueble reply… Thanks so much.
Karthick P.K said
You mean there is no dump in SQL Server errorlog folder?
Top SQL Server blogs from MSSQLWIKI « MSSQLWIKI said
[…] SQL Server Latch & Debugging latch time out […]
waste clearance said
It’s the best time to make some plans for thhe long run and
it’s time to be happy. I hazve read this post
and if I may I want to recommend you some interesting issues or
tips. Perhaps yyou cann write next articles relating to this article.
I wish to learn even mre issues approximately it!
garrulouschair104.shutterfly.com said
Je suis tombé sur ton site par hasard : je
ne le regrette point !!
Brian Kang said
Hi Karthik,
I got the same error and followed this instruction and got below stack. It is slightly different than what you showed here. Can you check below stack and share what do you see? Is it our Log backup job crashed due while truncating it? If so, was it caused by IO issue?
0:227> kC
Call Site
ntdll!NtWaitForSingleObject
KERNELBASE!WaitForSingleObjectEx
sqlservr!CDmpDump::DumpInternal
sqlservr!CDmpDump::DumpFilter
sqlservr!CDmpDump::Dump
msvcr100!__C_specific_handler
ntdll!RtlpExecuteHandlerForException
ntdll!RtlDispatchException
ntdll!RtlRaiseException
KERNELBASE!RaiseException
sqlservr!CDmpDump::Dump
sqllang!SQLDumperLibraryInvoke
sqllang!CImageHelper::DoMiniDump
sqllang!stackTrace
sqlmin!LatchBase::DumpOnTimeoutIfNeeded
sqlmin!LatchBase::PrintWarning
sqlmin!LatchBase::Suspend
sqlmin!LatchBase::AcquireInternal
sqlmin!BPool::PageReadyToWrite
sqlmin!BPool::WriteMultiple
sqlmin!BPool::FlushCache
sqlmin!HardenAndLogCheckpoint
sqlmin!SafeCheckpoint
sqlmin!BackupLogOperation::EndSyncCommitAndTruncateLog
sqlmin!BackupEntry::BackupLog
sqllang!CStmtDumpXact::XretExecute
sqllang!CMsqlExecContext::ExecuteStmts
sqllang!CMsqlExecContext::FExecute
sqllang!CSQLSource::Execute
sqllang!process_request
sqllang!process_commands
sqldk!SOS_Task::Param::Execute
sqldk!SOS_Scheduler::RunTask
sqldk!SOS_Scheduler::ProcessTasks
sqldk!SchedulerManager::WorkerEntryPoint
sqldk!SystemThread::RunWorker
sqldk!SystemThreadDispatcher::ProcessWorker
sqldk!SchedulerManager::ThreadEntryPoint
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart
Thanks much for your help.
Maria Lourdes said
We also ran into a latch timeout issue and went through the WinDebug steps and the thread stack shows the following. We started having these latch timeout issues ever since we migrated to SSD. Can you share please what you see might be the issue? Thanks so much!
ntdll!NtWaitForSingleObject
KERNELBASE!WaitForSingleObjectEx
sqlservr!CDmpDump::DumpInternal
sqlservr!CDmpDump::DumpFilter
sqlservr!CDmpDump::Dump
msvcr100!__C_specific_handler
ntdll!RtlpExecuteHandlerForException
ntdll!RtlDispatchException
ntdll!RtlRaiseException
KERNELBASE!RaiseException
sqlservr!CDmpDump::Dump
sqllang!SQLDumperLibraryInvoke
sqllang!CImageHelper::DoMiniDump
sqllang!stackTrace
sqlmin!LatchBase::DumpOnTimeoutIfNeeded
sqlmin!LatchBase::PrintWarning
sqlmin!LatchBase::Suspend
sqlmin!LatchBase::AcquireInternal
sqlmin!BUF::AcquireLatch
sqlmin!BPool::Get
sqlmin!IndexPageManager::GetPageWithKey
sqlmin!GetRowForKeyValue
sqlmin!IndexDataSetSession::GetRowByKeyValue
sqlmin!IndexDataSetSession::FetchRowByKeyValueInternal
sqlmin!RowsetNewSS::FetchRowByKeyValue
sqlmin!CValFetchByKey::ManipData
sqltses!CEsExec::GeneralEval4
sqlmin!CQScanSimpleUpdateNew::GetRow
sqlmin!CQueryScan::GetRow
sqllang!CXStmtQuery::ErsqExecuteQuery
sqllang!CXStmtDML::XretDMLExecute
sqllang!CXStmtDML::XretExecute
sqllang!CMsqlExecContext::ExecuteStmts
sqllang!CMsqlExecContext::FExecute
sqllang!CSQLSource::Execute
sqllang!ExecuteSql
sqllang!CSpecProc::ExecuteSpecial
sqllang!CSpecProc::Execute
sqllang!process_request
sqllang!process_commands
sqldk!SOS_Task::Param::Execute
sqldk!SOS_Scheduler::RunTask
sqldk!SOS_Scheduler::ProcessTasks
sqldk!SchedulerManager::WorkerEntryPoint
sqldk!SystemThread::RunWorker
sqldk!SystemThreadDispatcher::ProcessWorker
sqldk!SchedulerManager::ThreadEntryPoint
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart
Yagnesh said
Step 5 command is not working for me. I replaced task address with my task address but it returned nothing. I am on 64bit debugger. Can you please help me to sort out this.
What is a mommy massage said
What is a mommy massage
blog topic