Troubleshooting latency issues in replication is black box for many DBA’s, In this post I will explain how you can leverage the agent statistics to troubleshoot the latency issues.
Before understanding how to decode the agent statistics, lets take a look at the some of the basic things which will help us to troubleshoot the replication performance issue in better way.
The following MSDN diagram depicts the transactional replication architecture in simple manner.
Troubleshooting latency issues is multi step approach, first step is identify which agent is slow,
- Log reader Agent (Publisher to Distributor)
- Distribution Agent (Distributor to Subscriber)
So, the problem can be either log reader or distribution agent, we can identify this by just simply inserting the tracer token.
Once we find out the problematic agent the next step is to identify within the agent which particular thread causing the issue.
Let me introduce you to the important threads and its work on these replication agents in nutshell.
Log Reader Agent
Reader Thread – It scans the publisher database transaction log using sp_replcmds
Writer Thread – Add the queued transactions to Distribution database using sp_MSadd_repl_commands
Reader thread – It finds the watermark from the table Msreplication_subscriptions(on subscriber) and uses this information to retrieve pending commands from the Distribution database. It basically uses the stored procedure sp_MSget_replcommands to achieve it.
Writer thread – Writer thread uses the Batched RPC calls to write the information to subscriber database.
Now that we understood the threads in the replication agents. let’s assume we already identified which agent is slow by inserting tracer token. Next is to dig deeper on thread level, this is where our replication agent statistics comes to rescue us.
Agent statistics entries appended to history tables every 5 minutes by default. It provides the historical view of how the agent has been performing and keeps the last 3 days data. You can keep for more days by changing the history retention period.
the above two tables are located in Distribution database. The statistics information is added as XML blob in comments column of these tables.
Now, lets take a look at how to decipher this XML Data for each agents.
Log Reader Agent statistics
<stats state=”1″ work=”948″ idle=”351940″ >
<reader fetch=”859″ wait=”0″/>
<writer write=”822″ wait=”395390″/>
<sincelaststats elapsedtime=”300″ work=”49″ cmds=”176998″ cmdspersec=”3543.000000″><reader fetch=”17″ wait=”0″/><writer write=”29″ wait=”350833″/></sincelaststats></stats>
– State = 1 means stats after batch commit
–Work = cumulative time spent by the agent since restart – idle time
–Idle = Time spent waiting to call sp_replcmds
–Reader fetch = Time to do execute sp_replcmds
Wait = Time spent waiting on writer to release buffer
–Writer write = Time spent writing commands into distribution database
Wait = Time spent waiting on reader to populate buffer
Note: Each thread will have their own buffer with 40k in size.
Here,we need to look at the wait time to understand where the bottleneck exist.For example, if you notice wait time for Reader thread is high then it essentially means your writer thread is slow since reader thread is waiting for writer to release the buffer. Similarly, if you notice high wait time for writer thread then your reader thread is performing slow.
The simple way to decode this is,
HIGH wait time on Reader thread = Writer thread is slow ( thread which writes the commands to distribution database)
HIGH Wait time on Writer thread = Reader thread is slow ( thread which scans the transaction log)
Distribution Agent Statistics
<stats state=”1″ work=”154″ idle=”351464″>
<reader fetch=”144″ wait=”11″/>
<writer write=”12″ wait=”338″/>
<sincelaststats elapsedtime=”305″ work=”10″ cmds=”81262″ cmdspersec=”8041.000000″><reader fetch=”0″ wait=”9″/><writer write=”10″ wait=”0″/></sincelaststats></stats>
– State =1 means stats after a batch commit
– Work = cumulative time spend by the agent since restart – idle time (seconds)
– Idle = Time spend waiting to call sp_msget_repl_commands
– Reader fetch = Time to do execute sp_msget_repl_commands
Wait = Time spent waiting on writer to release buffer.
– Writer write = Time spend writing commands into distribution database
Wait = Time spent waiting on reader to populate buffer.
Similar to log reader agent, the decoding of wait time is same way we did for log reader agent.
HIGH wait time on Reader thread = Writer thread is slow ( thread which writes the subscriber database using batched RPC Calls)
HIGH wait time on Writer thread = Reader thread is slow ( thread which takes the pending commands from Distribution database)
Distributor Writer thread Slow Scenario
We would be able to understand this concepts better by looking at the example statistics, In this below case, I explicitly started the transaction on subscriber table to simulate blocking at the subscriber side making the writer thread of distribution agent to wait and build up latency.
This is how stats looked,
- <stats state=”1″ work=”755″ idle=”354505″>
- <reader fetch=”153″ wait=”604″/>
- <writer write=”613″ wait=”346″/>
- <sincelaststats elapsedtime=”636″ work=”515″ cmds=”45033″ cmdspersec=”87.000000″><reader fetch=”0″ wait=”515″/><writer write=”515″ wait=”0″/></sincelaststats></stats>
We can clearly see Reader thread wait time is high(515) which means writer thread is slow since we simulated the blocking on subscriber side.
Similarly,we can simulate the blocking on replication tables msrepl_commands and msrepl_transactions which will cause Log reader writer thread to be slow and stats will show Reader thread wait time as high.
Ok, now we isolated the source of bottleneck in thread level, After this we can just follow the standard performance troubleshooting approach described in this Whitepaper to troubleshoot the slowness of the replication session.
For instance, check out the video where Joe Sack talks about using Extended events to troubleshoot the Distributor writer thread slowness.
1. Find which agent is causing slowness using tracer token.
2. Leverage the Agent statistics to narrow down problem to thread level .
3. Follow standard performance troubleshooting approach to resolve the issue.
Thanks for reading! I hope this will help you to troubleshoot the replication performance better next time.