MySQL Replication Delay Too High: Debug Guide

What this usually means

MySQL replication lag fundamentally means the replica SQL thread (applying relay logs) is slower than the IO thread (fetching binlogs). This gap can be caused by write-heavy workloads, long-running queries on the replica, insufficient hardware (disk, CPU, memory), or configuration mismatches like different table engines or index differences. Non-obvious causes include: the replica running ANALYZE TABLE or OPTIMIZE, galera cluster certification pauses, or excessive binary log retention on the replica. The Seconds_Behind_Master metric is unreliable when the replica is catching up or when there's a network partition — it only measures the difference between timestamps in the relay log, not actual apply progress.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Run SHOW SLAVE STATUS\G on replica; check Slave_IO_Running, Slave_SQL_Running, Seconds_Behind_Master, and Last_IO_Error / Last_SQL_Error.
2Check SHOW PROCESSLIST on replica; look for any query with State 'System lock', 'Sending data', or 'Updating' that runs >10 seconds.
3Run iostat -x 1 on replica; check %util >90% or await >10ms indicating disk bottleneck.
4Check relay log space: ls -lh /var/lib/mysql/relay-log.*; if relay logs pile up, the SQL thread is stuck.
5Check master binlog retention: SHOW BINARY LOGS on master; too many binlogs can slow replica IO thread if network is slow.
6Check replica read-only setting: SELECT @@read_only; if not set, user queries can block replication SQL thread.

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchSHOW SLAVE STATUS\G on replica
searchSHOW PROCESSLIST on replica and master
searchMySQL error log: /var/log/mysql/error.log
searchPerformance Schema tables: events_statements_summary_by_digest
searchiostat, vmstat, top on replica server
searchRelay log files: /var/lib/mysql/relay-log.*
searchpt-query-digest slow query log or general log

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningLong-running SELECT or ALTER on replica blocking SQL thread (e.g., full table scan due to missing index)
warningReplica disk I/O saturated by writes; relay log writes compete with data file writes
warningMaster binlog rotation frequency too high (small binlog size) causing excessive network round trips
warningReplica's innodb_flush_log_at_trx_commit = 1 causing fsync every transaction
warningReplica using MyISAM tables that cause table-level locks
warningReplica hardware underprovisioned (e.g., single disk vs. RAID with battery-backed cache)
warningBinary log checksum enabled on master but not on replica causing retries

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildKill blocking queries on replica: SHOW PROCESSLIST; KILL <thread_id>; then restart SQL thread: STOP SLAVE; START SLAVE;
buildAdd missing indexes on replica tables used by long-running SELECT queries
buildIncrease relay log space or enable relay log auto-purge: relay_log_purge=1
buildSwitch replica to use SSD or RAID10 with write-back cache
buildTune replica: set innodb_flush_log_at_trx_commit=2, sync_binlog=0, increase innodb_buffer_pool_size
buildBatch writes on master to reduce transaction count per second
buildUse row-based replication (ROW) instead of STATEMENT to avoid non-deterministic queries

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedAfter fix, monitor Seconds_Behind_Master trends for 15 minutes — should trend to 0
verifiedRun pt-heartbeat on replica; actual lag should match or be lower than Seconds_Behind_Master
verifiedVerify disk I/O with iostat -x 1 during peak writes; await <5ms, %util <80%
verifiedCheck SHOW SLAVE STATUS shows no errors and both threads Yes
verifiedRun a read query on replica and master; compare data freshness for a known write timestamp
verifiedLoad test with production traffic replay to ensure lag stays under threshold

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningBlindly restarting MySQL replication without investigating root cause — lag will return
warningIncreasing relay_log_space_limit without fixing the SQL thread bottleneck — disks fill up
warningSetting slave_parallel_workers too high without proper schema design — causes deadlocks
warningIgnoring master binlog retention: too many binlogs cause replica IO thread to read slowly over network
warningAssuming Seconds_Behind_Master is accurate — it can be 0 while replica is actually behind if timestamps are stale
warningRunning OPTIMIZE TABLE on replica during peak hours — it locks tables and spikes lag

( 07 )War story

Replica Lag Spike During ETL Batch Job

Senior Database Reliability EngineerMySQL 8.0.28, Percona Server, AWS R5.2xlarge (master), R5.xlarge (replica), ETL jobs on EC2

Timeline

14:00ETL batch starts: loads 5M rows into staging table on master
14:05Alert: replica lag hits 300 seconds, paging DRE
14:07I check SHOW SLAVE STATUS: Seconds_Behind_Master=305, SQL thread state 'System lock'
14:10SHOW PROCESSLIST on replica: one query with 'System lock' on staging table for 7 minutes
14:12iostat shows disk %util 95% on replica, await 25ms
14:15Kill the stuck query; lag drops to 50 seconds within 2 minutes
14:20Analyze slow query log: the query is a SELECT COUNT(*) on staging table without index
14:30Add index on staging table, restart ETL with reduced parallelism
15:00Lag stays below 10 seconds for remainder of ETL

I was on-call when the pager went off at 14:05. Replica lag hit 300 seconds. Our application reads from the replica for user dashboards; stale data meant wrong analytics. I ssh'd into the replica and ran SHOW SLAVE STATUS. Seconds_Behind_Master=305, SQL thread state 'System lock'. That's almost always a query blocking replication.

Checking SHOW PROCESSLIST revealed a single query on the staging table with 'System lock' for 7 minutes. It was a SELECT COUNT(*) from the ETL process that had been running for hours. The table had no index on the column being counted, causing a full table scan. Meanwhile, the relay logs were accumulating because the SQL thread couldn't apply updates.

I killed the query with KILL 1234. Within seconds, the SQL thread started catching up. Lag dropped to 50 seconds in two minutes. I analyzed the slow query log with pt-query-digest and confirmed the missing index. After adding the index, the ETL team reduced the batch size and lag never exceeded 10 seconds. We also set up a script to auto-kill queries running longer than 5 minutes on the replica.

Root cause

Long-running SELECT COUNT(*) on replica due to missing index, blocking the SQL thread and causing replication lag to spike.

The fix

Killed the blocking query, added an index on the staging table, and set up a query timeout on the replica to kill runaway queries automatically.

The lesson

Always profile queries hitting replicas. A missing index on a replica can cause more damage than on master because it blocks replication, affecting all downstream consumers.

( 08 )Seconds_Behind_Master Is a Lie

The metric Seconds_Behind_Master is calculated as the difference between the timestamp in the relay log entry the SQL thread is currently applying and the current time on the replica. This becomes unreliable when the IO thread is also delayed (e.g., network partition) because the timestamp in the relay log is already old. In that case, the SQL thread may appear to be catching up (low lag) while actually the replica is hours behind because it hasn't received new binlogs.

A better measurement is to use pt-heartbeat, which inserts a timestamp into a table on master and compares it on replica. This tells you the actual data freshness. Another approach is to monitor the 'Exec_Master_Log_Pos' vs 'Read_Master_Log_Pos' — if they diverge, the SQL thread is behind regardless of timestamps.

( 09 )Disk I/O: The Silent Replication Killer

Replication involves two write streams on the replica: the IO thread writes relay logs, and the SQL thread writes data files. If the disk is a single HDD or a RAID5 with write-back cache disabled, these writes contend. The relay log writes are sequential, but the SQL thread does random writes (to tables/indexes). This contention causes high iowait and slows both threads.

Use iostat -x 1 to check %util and await. If %util >90% and await >10ms, you have a disk bottleneck. Solutions: use SSDs, RAID10 with battery-backed write cache, or tune innodb_flush_log_at_trx_commit=2 and sync_binlog=0 on the replica. For relay logs, consider placing them on a separate disk: set relay_log_index and relay_log to a different mount point.

( 10 )Parallel Replication: Not a Silver Bullet

MySQL 8.0 supports parallel replication via slave_parallel_workers. However, it only works if transactions are conflict-free. In practice, if multiple transactions touch the same database or table (e.g., row-level replication with foreign keys), they will queue up anyway, and parallel workers may cause deadlocks. The default slave_parallel_type=LOGICAL_CLOCK works best when master has many concurrent transactions on different databases.

Before enabling parallel replication, verify your workload: run SHOW BINLOG EVENTS on master and check the last_committed and sequence_number fields. If most transactions have the same last_committed, parallelism won't help. Also, monitor for 'Waiting for preceding transaction' in SHOW PROCESSLIST on replica.

( 11 )Binary Log Retention and Rotation on Master

If the master retains too many binary logs (e.g., expire_logs_days=7 with heavy writes), the replica IO thread must read through all of them to catch up after a network interruption. This can cause prolonged lag even after the network is restored. Check SHOW BINARY LOGS on master; if there are hundreds, consider reducing expire_logs_days or using binlog_group_commit_sync_delay to reduce the number of binlogs.

Also, if binlog_format=STATEMENT, non-deterministic queries (e.g., using RAND(), NOW()) can cause different results on replica, leading to errors and stopping the SQL thread. Always use ROW-based replication for production.

( 12 )Monitoring and Alerting Done Right

Don't rely solely on Seconds_Behind_Master for alerting. Set up pt-heartbeat and alert when lag exceeds your SLA (e.g., 30 seconds). Additionally, monitor the 'Relay_Log_Space' metric from SHOW SLAVE STATUS — if it grows unbounded, the SQL thread is stuck. Also, watch for 'Last_SQL_Errno' non-zero, which indicates an error that stops replication.

Use tools like Percona Monitoring and Management (PMM) or Prometheus + mysqld_exporter to collect replication metrics over time. Correlate lag spikes with query throughput, disk I/O, and CPU usage on both master and replica.

Frequently asked questions

Why does Seconds_Behind_Master sometimes show NULL?

Seconds_Behind_Master is NULL when the SQL thread is not running (e.g., after a STOP SLAVE), or when the IO thread is not running and the SQL thread has applied all relay logs. It can also be NULL during a transient state after START SLAVE. Check Slave_SQL_Running and Slave_IO_Running to confirm.

Can a slow SELECT on the replica cause replication lag?

Yes. Even though SELECTs don't write data, they can hold locks (e.g., MyISAM table locks, or InnoDB row locks if using LOCK IN SHARE MODE). Also, a long-running SELECT can block DDL operations that replication tries to apply. Kill such queries or use pt-kill to automatically terminate long-running queries on replicas.

Should I set the replica to read-only?

Absolutely. Set read_only=1 on all replicas to prevent accidental writes from applications or users. This avoids conflicts with replication and ensures data consistency. The only exception is if you have a dedicated replication user that needs to write to internal tables (e.g., for monitoring).

How do I catch up a replica that is hours behind without causing more lag?

First, identify and fix the bottleneck (slow query, disk I/O, etc.). Then, you can temporarily stop the SQL thread (STOP SLAVE SQL_THREAD) to let the IO thread download all relay logs, then start the SQL thread. This prevents relay log writes from competing with data writes. Alternatively, use pt-slave-restart to skip errors if the lag is due to a specific error.

Why does replication lag spike after a master failover?

After failover, the new master may have different binary log positions. The old replica (now a new replica) needs to sync from a potentially far behind position. Also, if the new master has a different configuration (e.g., different innodb_flush_log_at_trx_commit), it can cause write amplification. Always test failover in staging and ensure replicas are up to date before promoting.

Debugging MySQL Replication Lag: From Seconds to Hours

What this usually means

Frequently asked questions