MSSQLWIKI

Karthick P.K on SQL Server

Archive for September, 2013

Tempdb latch contention

Posted by Karthick P.K on September 17, 2013

You might see Page latch contention in tempdb when you repeatedly drop and create TempDb objects (Temp tables, table variables etc.).

When you notice PAGELATCH_* contention on tempdb (Wait resource in sysprocesses starts with 2: ) check if the latch wait is on PFS,GAM or SGAM page. When there is latch contention on tempdb you will see lot of sessions waiting on Pagelatch_* similar to one below.

In the below output session is waiting on resource 2:15:121320 . If we decode the wait resource it is 2: database id of tempdb ,  15: file number , 121320 is page number. 121320 is in multiple of 8088 so it is a PFS page, similarly identify if the page we are waiting is GAM or SGAM page if it is not PFS page.

Wait type                            Wait resource

PAGELATCH_UP               2:15:121320

PAGELATCH_UP               2:15:121320

How to identify if page is PFS,GAM or IAM?

PFS Page: A PFS page occurs once in 8088 pages. SQL Server will attempt to place a PFS page on the first page of every PFS interval(8088Pages). The only time a PFS page is not the first page in its interval is in the first interval for a file. File header page is first, and the PFS page is second. (Page ID starts from 0 so the first PFS page is at Page ID 1). If (page number)/8088 is round value then the page is PFS page.

GAM Page: GAM page is page 2 in the data file, next GAM page is placed at 511230 Page after first GAM page (GAM interval). If (page number-1)/511230 is round value then the page is GAM page.

SGAM Page: SGAM page is page 3 in data file , next SGAM page is placed at 511230 Page after first SGAM page. If (page number-2)/511230 is round value then the page is GAM page.

clip_image002

How to resolve?

1. Increase the number of TEMPDB data files files and size them equally. As a general rule, if the number of logical processors is less than or equal to 8, use the same number of data files as logical processors. If the number of logical processors is greater than 8, use 8 data files and then if contention continues further increase the number of data files by multiples of 4 (You may not see improvement once you reach 32 files). 

2. Enable server side trace flag 1118.

3. If you further see latch contention on PFS  page after following above two steps then the only option is to modify your application to limit the tempdb usage.

4. If you see contention on 2:1:103 (Page 103 is for system table sys.sysmultiobjrefs. This table manages the relationship between created objects in every database). The only way to reduce contention on this page is reduce the relation. Example creating lot of temp tables with primary key can cause this contention because the relation between the table and PK constraint has to be updated in sys.sysmultiobjrefs.

What’s the best practice ?

1. Create multiple tempdb data files instead of creating 1 large file and size them equally in all your SQL Server instances.

2. Make TF1118 (Uniform allocation) as default. (Extra space required by this trace flag shouldn’t really matter as amount additional space required is minimal and storage cost is not that high these days).  

 

If you liked this post, do like us on Facebook at https://www.facebook.com/mssqlwiki and join our Facebook group

Thank you,

Karthick P.K |My Facebook Page |My Site| Blog space| Twitter

Disclaimer:

The views expressed on this website/blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights

Posted in Performance, Space management, SQL General, SQL Server Engine | Tagged: , , , , | 1 Comment »

Troubleshooting Transactional replication Latency using Agent Statistics

Posted by Prabhakar Bhaskaran on September 13, 2013

Troubleshooting latency issues in replication is black box for many DBA’s, In this post I will explain how you can leverage the agent statistics to troubleshoot the latency issues.

Before understanding how to decode the agent statistics, lets take a look at the some of the basic things which will help us to troubleshoot the replication performance issue in better way.

The following MSDN diagram depicts the transactional replication architecture in simple manner.

Transactional replication components and data flow

Troubleshooting latency issues is multi step approach, first step is identify which agent is slow,

  • Log reader Agent (Publisher to Distributor)
  • Distribution Agent (Distributor to Subscriber)

So, the problem can be either log reader or distribution agent, we can identify this by just simply inserting the tracer token.

Once we find out the problematic agent the next step is to identify within the agent which particular thread causing the issue.

Let me introduce you to the important threads and its work on these replication agents in nutshell.

Log Reader Agent

Reader Thread – It scans the publisher database transaction log using sp_replcmds

Writer Thread – Add the queued transactions to Distribution database using sp_MSadd_repl_commands

Distribution Agent

Reader thread – It finds the watermark from the table Msreplication_subscriptions(on subscriber) and uses this information to retrieve pending commands from the Distribution database. It basically uses the stored procedure sp_MSget_replcommands to achieve it.

Writer thread – Writer thread uses the Batched RPC calls to write the information to subscriber database.

Now that we understood the threads in the replication agents.  let’s assume we already identified which agent is slow by inserting tracer token. Next is to dig deeper on thread level, this is where our replication agent statistics comes to rescue us.

Agent statistics entries appended to history tables every 5 minutes by default. It provides the historical view of how the agent has been performing and keeps the last 3 days data. You can keep for more days by changing the history retention period.

MSlogreader_history

MSdistribution_history

the above two tables are located in Distribution database. The statistics information is added as XML blob in comments column of these tables.

Now, lets take a look at how to decipher this XML Data for each agents.

Log Reader Agent statistics

<stats state=”1″ work=”948″ idle=”351940″ >
<reader fetch=”859″ wait=”0″/>
<writer write=”822″ wait=”395390″/>
<sincelaststats elapsedtime=”300″ work=”49″ cmds=”176998″ cmdspersec=”3543.000000″><reader fetch=”17″ wait=”0″/><writer write=”29″ wait=”350833″/></sincelaststats></stats>

– State = 1 means stats after batch commit

–Work = cumulative time spent by the agent since restart – idle time

–Idle = Time spent waiting to call sp_replcmds

–Reader fetch = Time to do execute sp_replcmds

Wait = Time spent waiting on writer to release buffer

–Writer write = Time spent writing commands into distribution database

Wait = Time spent waiting on reader to populate buffer

Note: Each thread will have their own buffer with 40k in size.

Here,we need to look at the wait time to understand where the bottleneck exist.For example, if you notice wait time for Reader thread is high then it essentially means your writer thread is slow since reader thread is waiting for writer to release the buffer. Similarly, if you notice high wait time for writer thread then your reader thread is performing slow.

The simple way to decode this is,

HIGH wait time on Reader thread = Writer thread is slow ( thread which writes the commands to distribution database)

HIGH Wait time on Writer thread =  Reader thread is slow ( thread which scans the transaction log)

Distribution Agent Statistics

<stats state=”1″ work=”154″ idle=”351464″>
<reader fetch=”144″ wait=”11″/>
<writer write=”12″ wait=”338″/>
<sincelaststats elapsedtime=”305″ work=”10″ cmds=”81262″ cmdspersec=”8041.000000″><reader fetch=”0″ wait=”9″/><writer write=”10″ wait=”0″/></sincelaststats></stats>
– State =1 means stats after a batch commit

– Work = cumulative time spend by the agent since restart – idle time (seconds)

– Idle = Time spend waiting to call sp_msget_repl_commands

– Reader fetch = Time to do execute sp_msget_repl_commands

Wait = Time spent waiting on writer to release buffer.

– Writer write = Time spend writing commands into distribution database

Wait = Time spent waiting on reader to populate buffer.

Similar to log reader agent, the decoding of wait time is same way we did for log reader agent.

HIGH wait time on Reader thread = Writer thread is slow ( thread which writes the subscriber database using batched RPC Calls)

HIGH wait time on Writer thread = Reader thread is slow ( thread which takes the pending commands from Distribution database)

Distributor Writer thread Slow Scenario

We would be able to understand this concepts better by looking at the example statistics, In this below case, I explicitly started the transaction on subscriber table to simulate blocking at the subscriber side making the writer thread of distribution agent to wait and build up latency.

This is how stats looked,

  • <stats state=”1″ work=”755″ idle=”354505″>
  • <reader fetch=”153″ wait=”604″/>
  • <writer write=”613″ wait=”346″/>
  • <sincelaststats elapsedtime=”636″ work=”515″ cmds=”45033″ cmdspersec=”87.000000″><reader fetch=”0″ wait=”515″/><writer write=”515″ wait=”0″/></sincelaststats></stats>

We can clearly see Reader thread wait time is high(515) which means writer thread is slow since we simulated the blocking on subscriber side.

Similarly,we can simulate the blocking on replication tables msrepl_commands and msrepl_transactions which will cause Log reader writer thread to be slow and stats will show Reader thread wait time as high.

Ok, now we isolated the source of bottleneck in thread level, After this we can just follow the standard performance troubleshooting approach described in this Whitepaper to troubleshoot the slowness of the replication session.

For instance, check out the video where Joe Sack talks about using Extended events to troubleshoot the Distributor writer thread slowness.

In Summary

1. Find which agent is causing slowness using tracer token.

2. Leverage the Agent statistics to narrow down problem to thread level .

3. Follow standard performance troubleshooting approach to resolve the issue.

Thanks for reading! I hope this will help you to troubleshoot the replication performance better next time.

Posted in Performance, Replication, SQL General | Tagged: , , , , , | 2 Comments »