Lesson 1 of 12 10 minDatabase Core

MySQL Production Incident Playbook: Diagnosing Slow Queries, Deadlocks, and Replication Lag

A battle-tested playbook for diagnosing and resolving MySQL production incidents. Covers slow query analysis with EXPLAIN, deadlock debugging via InnoDB status, replication lag investigation, and connection pool exhaustion.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Premium outcome

From relational fundamentals to query plans, transactions, and scaling.

Backend engineers who want stronger SQL fundamentals and production database depth.

What you unlock

  • Better mental models for indexes, joins, transactions, and query optimization
  • A complete path through core MySQL operational and architectural concepts
  • More confidence choosing schema and storage trade-offs in interviews and production

Mental Model

A database incident is not a mystery; it is a sequence of events with a root cause that leaves evidence. Your job as an engineer is to follow the evidence trail systematically before touching anything.

The worst thing you can do during a MySQL incident is start changing things without a diagnosis. Random configuration changes in a degraded system often make things worse. This playbook gives you a systematic investigation protocol for the four most common MySQL production incidents.


Incident Type 1: Sudden Slowdown / High Latency

Symptoms: Query latency spikes from 10ms to 2000ms. Error rates increase. CPU is high or I/O wait is high.

Step 1: See What's Running Right Now

-- Show all currently executing queries
SHOW FULL PROCESSLIST;

-- Or with more detail (requires performance_schema)
SELECT 
  processlist_id,
  processlist_user,
  processlist_db,
  processlist_command,
  processlist_time,
  processlist_state,
  LEFT(processlist_info, 200) AS query_preview
FROM performance_schema.processlist
WHERE processlist_command != 'Sleep'
ORDER BY processlist_time DESC;

What to look for:

  • Queries with processlist_time > 10 seconds
  • State: Waiting for table metadata lock — something holds a metadata lock (often an uncommitted DDL or long transaction)
  • State: Sending data with very high time — data volume problem or missing index
  • State: Locked — row-level or table-level lock contention

Step 2: Kill the Blocking Query

-- Identify the blocking thread
SELECT 
  r.trx_id AS waiting_trx_id,
  r.trx_mysql_thread_id AS waiting_thread,
  r.trx_query AS waiting_query,
  b.trx_id AS blocking_trx_id,
  b.trx_mysql_thread_id AS blocking_thread,
  b.trx_query AS blocking_query,
  b.trx_started AS blocking_started,
  TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) AS blocking_seconds
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

-- Kill the blocking thread (CAREFUL: confirm this is safe first)
KILL 12345; -- blocking_thread ID

Warning: Never kill a thread without understanding why it's there. A long-running UPDATE might be legitimate — a data migration, for example. Killing it causes a rollback that might take as long to complete as the original query.

Step 3: EXPLAIN the Slow Query

-- Run EXPLAIN on the slow query (without ANALYZE — it's non-destructive)
EXPLAIN SELECT * FROM orders o 
  JOIN users u ON o.user_id = u.id
  WHERE o.status = 'pending' 
  AND o.created_at > '2026-01-01';

Red flags in EXPLAIN output:

Column Red Flag Meaning
type ALL Full table scan — no index used
type index Full index scan — better, but still slow on large tables
rows > 100,000 MySQL will examine that many rows
Extra Using filesort Sort can't use an index — expensive on large result sets
Extra Using temporary MySQL needs a temp table — very expensive
key NULL No index selected

Step 4: EXPLAIN ANALYZE for Confirmed Queries

Once you identify the slow query, run EXPLAIN ANALYZE in a non-production environment (it actually executes the query):

-- Shows actual vs estimated row counts — reveals optimizer lies
EXPLAIN ANALYZE 
  SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at;

-- Output shows:
-- -> Sort: orders.created_at  (actual time=2847.123..2847.456 rows=89234 loops=1)
--     -> Filter: (orders.status = 'pending')  
--         -> Table scan on orders  (cost=23847.00 rows=234891 actual time=0.023..2201.342 rows=89234 loops=1)

The actual vs estimated rows gap tells you if the optimizer is working with stale statistics — if so, run ANALYZE TABLE orders;.


Incident Type 2: Deadlock

Symptoms: Application receives ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction.

Deadlocks are deterministic: they happen when two transactions acquire locks in opposite orders. Understanding this is essential before you can fix them.

Step 1: Read the InnoDB Deadlock Info

-- Show the last deadlock (stored since last restart)
SHOW ENGINE INNODB STATUS\G

-- Look for the LATEST DETECTED DEADLOCK section
-- It shows:
-- 1. Transaction A: what it held, what it waited for
-- 2. Transaction B: what it held, what it waited for
-- 3. Which transaction was rolled back (the "victim")

Reading the deadlock output:

TRANSACTION 1, ACTIVE 2 sec starting index read
MySQL thread id 156, OS thread handle 140234891776
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)

RECORD LOCKS space id 89 page no 5 n bits 72 index PRIMARY of table `db`.`orders`
trx id 421897 lock_mode X locks rec but not gap waiting

TRANSACTION 2, ACTIVE 1 sec
MySQL thread id 157
1 lock struct(s), heap size 1136, 1 row lock(s)
RECORD LOCKS space id 89 page no 5 n bits 72 index PRIMARY of table `db`.`orders`
trx id 421896 lock_mode X locks rec but not gap

Step 2: Identify the Lock Order Problem

Most deadlocks have this pattern:

Time →

Transaction A                     Transaction B
──────────────────────────────────────────────────
LOCK orders WHERE id = 100        
                                  LOCK orders WHERE id = 200
WAIT for orders WHERE id = 200   
                                  WAIT for orders WHERE id = 100  ← DEADLOCK

The universal fix: Ensure all code paths that lock multiple rows do so in the same order — typically by ascending primary key.

Step 3: Fix the Deadlock in Code

// BEFORE (deadlock-prone): Different threads might acquire in different orders
public void transferFunds(long fromId, long toId, BigDecimal amount) {
    Account from = accountRepo.findByIdForUpdate(fromId); // LOCK id=from
    Account to = accountRepo.findByIdForUpdate(toId);     // LOCK id=to
    // Thread 1: locks 100 then 200
    // Thread 2: locks 200 then 100 → DEADLOCK
}

// AFTER (deadlock-safe): Always lock in ascending ID order
public void transferFunds(long fromId, long toId, BigDecimal amount) {
    long firstId = Math.min(fromId, toId);
    long secondId = Math.max(fromId, toId);
    
    Account first = accountRepo.findByIdForUpdate(firstId);
    Account second = accountRepo.findByIdForUpdate(secondId);
    
    Account from = (fromId == firstId) ? first : second;
    Account to = (toId == firstId) ? first : second;
    // Now: Both threads always lock the smaller ID first → no deadlock possible
}

Step 4: Add Deadlock Monitoring

-- Track deadlock frequency over time
SELECT 
  variable_value AS deadlocks_since_restart
FROM performance_schema.global_status
WHERE variable_name = 'Innodb_deadlocks';

-- Set up a periodic check in your monitoring
-- Alert if deadlock rate > 10/minute

Incident Type 3: Replication Lag

Symptoms: Read replica is returning stale data. Seconds_Behind_Master is growing. Replica queries are reading data that was written minutes or hours ago.

Step 1: Check Current Lag

-- On the replica
SHOW REPLICA STATUS\G

-- Key fields:
-- Seconds_Behind_Master: Current lag in seconds
--   0 = in sync
--   NULL = replication is stopped or not connected
--   Large number = replica is catching up

-- Replica_SQL_Running: YES = applying binlogs
-- Replica_IO_Running: YES = receiving binlogs from primary

Step 2: Diagnose the Bottleneck

-- Is the replica IO thread stuck? (Can't receive binlogs)
-- Check: Last_IO_Error, Last_IO_Errno

-- Is the replica SQL thread stuck? (Can't apply binlogs)
-- Check: Last_SQL_Error, Last_SQL_Errno

-- Is a single long transaction causing lag?
SELECT 
  trx_started,
  TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS running_seconds,
  trx_query
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30
ORDER BY running_seconds DESC;

Step 3: Common Causes and Fixes

Cause 1: Long-running primary transaction

A single 30-second UPDATE on the primary causes a 30-second gap on the replica (single-threaded applier is blocked until the full transaction replays).

-- Fix: Break large updates into batches on the primary
-- BEFORE (blocks replica for 60 seconds):
UPDATE orders SET status = 'archived' WHERE created_at < '2025-01-01';
-- This might affect 10 million rows

-- AFTER (replica lag stays < 1 second):
-- Run in a loop until done:
UPDATE orders SET status = 'archived' 
WHERE created_at < '2025-01-01' 
AND status != 'archived'
LIMIT 1000;
-- Sleep 100ms between batches

Cause 2: Missing index on replica

If the replica has a different schema than the primary (e.g., a migration ran on primary but not replica), row-based replication must scan the full table to find the rows to update.

-- Check indexes match
-- On primary:
SHOW INDEX FROM orders;
-- On replica:
SHOW INDEX FROM orders;
-- Fix: Add missing index to replica

Cause 3: Replica I/O bound

The replica disk can't keep up with the primary's write rate.

-- Enable parallel replication (MySQL 5.7+)
-- On replica:
SET GLOBAL slave_parallel_workers = 8;
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';

-- Persist in my.cnf:
slave_parallel_workers = 8
slave_parallel_type = LOGICAL_CLOCK

Incident Type 4: Connection Pool Exhaustion

Symptoms: Application returns "Too many connections" or connection pool timeout errors. Database max_connections is hit.

Step 1: Check Current Connection State

-- Current connections vs maximum
SHOW VARIABLES LIKE 'max_connections';
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Threads_running'; -- actively executing (not sleeping)

-- See connection distribution by user and host
SELECT user, host, COUNT(*) as connections 
FROM information_schema.processlist 
GROUP BY user, host
ORDER BY connections DESC;

-- Find sleeping connections holding locks (common problem)
SELECT id, user, time, state, info
FROM information_schema.processlist
WHERE command = 'Sleep' AND time > 30
ORDER BY time DESC;

Step 2: Identify the Leak

-- Check if connections aren't being returned properly
-- High Threads_connected + low Threads_running = connection leak

-- Check connection wait time
SHOW STATUS LIKE 'Connection_errors_max_connections';
-- Non-zero = connections are being rejected

-- Check for connections from unexpected sources
SELECT host, COUNT(*) FROM information_schema.processlist 
GROUP BY host ORDER BY 2 DESC;

Step 3: Immediate Relief

-- Kill all idle connections older than 60 seconds (emergency only)
SELECT CONCAT('KILL ', id, ';')
FROM information_schema.processlist
WHERE command = 'Sleep' AND time > 60;
-- Copy and execute the output carefully

Step 4: Root Cause Fix — HikariCP Configuration

// Spring Boot HikariCP configuration
spring:
  datasource:
    hikari:
      maximum-pool-size: 20          # (CPU cores * 2) + disk spindles is a good starting point
      minimum-idle: 5
      connection-timeout: 30000      # 30 seconds max wait for a connection
      idle-timeout: 600000           # Return idle connections after 10 minutes
      max-lifetime: 1800000          # Recycle connections every 30 minutes
      keepalive-time: 60000          # Test idle connections every 60 seconds
      connection-test-query: SELECT 1 # Validate connection is alive
      leak-detection-threshold: 60000 # Warn if a connection is held > 60 seconds

The critical insight on pool sizing: max-pool-size should equal the number of connections your database can effectively serve. For most workloads: (CPU_cores × 2) + effective_disk_spindles. A pool size of 200 on a 4-core database server will cause resource contention. 20–50 connections per application instance is usually correct.


Setting Up Proactive Monitoring

Don't wait for incidents — monitor these metrics continuously:

-- Key metrics to track every 60 seconds:
SELECT
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Threads_connected') AS connections,
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Threads_running') AS active_queries,
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_row_lock_waits') AS lock_waits,
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_deadlocks') AS deadlocks,
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Slow_queries') AS slow_queries;

Alert thresholds:

  • Threads_connected > 80% of max_connections → Warning
  • Threads_running > 50 → Warning (too many concurrent queries)
  • Seconds_Behind_Master > 30 → Warning on replica
  • Innodb_deadlocks rate > 1/minute → Investigation needed
  • Slow_queries rate > 10/minute → Check slow_query_log

Key Takeaways

  • SHOW PROCESSLIST + EXPLAIN ANALYZE are your first tools in any MySQL performance incident — never guess without data.
  • Deadlocks are always reproducible if you understand the lock acquisition order; the fix is always about consistent ordering.
  • Replication lag > 10 seconds means your read replicas are serving stale data — always monitor and alert on Seconds_Behind_Master.

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.