LEARN · DEBUGGING GUIDE

Rails Database Migration Error: Failed to Add Column Due to Lock Wait Timeout

Rails migrations fail silently or with cryptic errors when database locks, schema drifts, or unsafe concurrent operations interfere. Here's how to diagnose and fix them without downtime.

IntermediateRuby7 min read

What this usually means

Rails migrations often fail because the migration tries to acquire a lock (e.g., adding a column with a default value) while another transaction holds a conflicting lock. In MySQL, DDL statements often require an exclusive metadata lock; if a long-running SELECT holds a shared lock, the migration times out. In PostgreSQL, adding a column with a default value rewrites the table, blocking writes. Another common cause is a schema version mismatch after a rollback: the schema_migrations table has a version that doesn't exist in the migration files, causing Rails to skip or re-run migrations incorrectly. Deadlocks occur when two transactions each hold a lock the other needs—often from concurrent migrations or application code.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Run `SHOW FULL PROCESSLIST;` (MySQL) or `SELECT * FROM pg_stat_activity WHERE state = 'active';` (Postgres) to see blocking queries.
  • 2Check the last migration version: `rails db:migrate:status` — look for 'down' migrations that weren't applied.
  • 3Inspect MySQL lock waits: `SELECT * FROM information_schema.INNODB_TRX;` and `SHOW ENGINE INNODB STATUS\G` for lock info.
  • 4Review the migration file for unsafe operations: adding column with default, changing column type, or removing a column without proper locks.
  • 5Check the schema_migrations table: `SELECT * FROM schema_migrations ORDER BY version;` for duplicates or missing versions.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchdb/migrate/*.rb — the migration file itself for problematic operations
  • searchconfig/database.yml — connection pool size, timeout settings
  • searchlog/development.log or log/production.log — exact error stack trace
  • searchMySQL: `SHOW ENGINE INNODB STATUS\G` — deadlock and lock wait sections
  • searchPostgres: `pg_locks` view and `pg_stat_activity` for long-running queries
  • searchApplication monitoring (e.g., Datadog, New Relic) for query duration spikes
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningAdding a column with a default value in Postgres causes a full table rewrite, blocking writes.
  • warningRunning a migration while a long-running query holds a shared lock on the table.
  • warningRolling back a migration manually without updating schema_migrations, causing version mismatch.
  • warningInsufficient innodb_lock_wait_timeout (default 50s) for large table migrations in MySQL.
  • warningConcurrent migrations (e.g., multiple deploy pods) trying to alter the same table simultaneously.
  • warningForgetting to set `algorithm: :default` or `lock_timeout` to avoid blocking DDL.
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildUse `add_column` with `null: false, default: 0` only after adding column as nullable, then backfill in batches (or use Postgres 11+ `NOT VALID`).
  • buildSet `lock_timeout` in the migration: `execute("SET lock_timeout TO '5s'")` before DDL to fail fast instead of hanging.
  • buildUse `change_column_null` with `using` clause to add NOT NULL without full table rewrite.
  • buildRun migrations in a maintenance window or use `pt-online-schema-change` (Percona) to avoid locks.
  • buildFix schema_migrations by inserting missing version: `INSERT INTO schema_migrations (version) VALUES ('20230101000000');`
  • buildUse `safe_migrations` gem or custom rake tasks with `disable_ddl_transaction!` for large tables.
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedRun the migration again after applying the fix: `rails db:migrate:redo VERSION=...` or `rails db:migrate`.
  • verifiedCheck `rails db:migrate:status` shows all expected migrations as 'up'.
  • verifiedVerify the column exists: `rails runner 'p User.column_names'` or direct SQL.
  • verifiedTest concurrent writes to the table while migration runs (simulate with a script).
  • verifiedMonitor lock wait times in database logs after the fix: should be under 100ms.
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningKilling the migration process without checking locks first can leave the database in an inconsistent state.
  • warningRebooting the database server to clear locks—this causes downtime and doesn't prevent recurrence.
  • warningAdding `null: false, default: something` on a large table without backfilling—will lock the table for ages.
  • warningIgnoring the rollback: if a migration fails, rollback properly with `rails db:migrate:down VERSION=...` before fixing.
  • warningRunning `ALTER TABLE` directly without wrapping in a transaction or understanding the lock impact.
  • warningManually deleting rows from schema_migrations without verifying the migration file exists.
( 07 )War story

A 15-Minute Migration That Took 2 Hours: Lock Wait Timeout in Production

Senior Backend EngineerRails 6.1, MySQL 8.0, AWS RDS, Sidekiq

Timeline

  1. 14:00Deploy triggered: migration to add `status` column to `orders` table with default 0.
  2. 14:02PagerDuty alert: 'Lock wait timeout exceeded; try restarting transaction' on migration.
  3. 14:05Checked `SHOW FULL PROCESSLIST` — found a long-running Sidekiq worker updating `orders` for 30+ seconds.
  4. 14:10Killed the Sidekiq worker (PID 12345) to release locks.
  5. 14:12Re-ran migration — same error, because the migration itself timed out again.
  6. 14:20Realized the migration tried to add column with default on a 10M-row table; MySQL needed exclusive lock and it timed out.
  7. 14:30Rolled back the migration, added the column without default, then backfilled in batches.
  8. 14:45Deployment succeeded. Added monitoring for long-running queries before future migrations.

We were adding a `status` column to the `orders` table with a default value of 0. The migration looked innocent: `add_column :orders, :status, :integer, default: 0`. But our `orders` table had 10 million rows and was under heavy write load from Sidekiq workers. MySQL requires an exclusive metadata lock to add a column with a default; it has to rewrite the table. While the migration waited for the lock, a Sidekiq worker had a long-running transaction updating the same table. The migration timed out after 50 seconds (the default `innodb_lock_wait_timeout`). I thought killing the worker would fix it, but the migration still failed because the lock timeout had already been reached and the lock was not cleanly released.

I checked `SHOW ENGINE INNODB STATUS` and saw the transaction history. The migration was waiting for the lock held by the killed worker, but the worker's transaction hadn't rolled back instantly. I had to wait for it to rollback, which took another 30 seconds. Then I re-ran the migration, but it timed out again because the table was still busy. At this point, I decided to change strategy: I rolled back the migration, added the column without a default, then ran a separate batch job to set the default. This avoided the table rewrite lock and the migration completed in under 2 seconds.

After the fix, I added a `lock_timeout` to future migrations and set up a pre-deployment check for long-running queries. We also introduced a policy: never add a column with a default on tables with more than 1 million rows. Instead, we use the two-step approach: add column nullable, backfill, then add default. This incident taught me that the default value on a new column is not free—it forces a table rewrite in MySQL and a full table rewrite in Postgres.

Root cause

Adding a column with a default value on a large table under write load caused a lock wait timeout because MySQL needed an exclusive metadata lock, which was blocked by a long-running transaction.

The fix

Rolled back the migration, added the column without default, backfilled the default value in batches, then added the default constraint. Also set `innodb_lock_wait_timeout` to 1 second to fail fast in the future.

The lesson

Never add a column with a default on a large table in production without a zero-downtime strategy. Use the two-step approach and monitor locks during deployment.

( 08 )Understanding Lock Wait Timeout in MySQL DDL

When a Rails migration runs `add_column` with a default value, MySQL has to acquire an exclusive metadata lock (MDL) on the table. This lock waits behind any other transactions holding a shared or exclusive lock on the same table. By default, `innodb_lock_wait_timeout` is 50 seconds. If the lock isn't granted within that time, the migration fails with 'Lock wait timeout exceeded'.

To diagnose, run `SELECT * FROM performance_schema.metadata_locks WHERE OBJECT_SCHEMA = 'your_db' AND OBJECT_NAME = 'your_table';` to see which locks are pending. Often, a long-running `SELECT` or `UPDATE` is the culprit. Kill the blocking query with `KILL QUERY <thread_id>`, but note that the migration may still fail if the timeout has already been reached. Better to set a short lock timeout in the migration itself: `execute("SET SESSION lock_wait_timeout = 5")` before DDL.

( 09 )Schema Mismatch After Rollback

If a migration fails and you manually rollback by running `rails db:migrate:down` or by reverting the migration file, the `schema_migrations` table might get out of sync. For example, the version might still be present in the table but the migration file is missing, causing Rails to skip it erroneously. Or the version might be missing, causing Rails to try to run it again (and fail because the column or table already exists).

To fix, check `rails db:migrate:status` and compare with the actual migration files in `db/migrate/`. If a version is missing from the table but the file exists, insert it: `INSERT INTO schema_migrations (version) VALUES ('20230101000000');`. If a version is in the table but the file is gone, delete it: `DELETE FROM schema_migrations WHERE version = '20230101000000';`.

( 10 )Zero-Downtime Migration Strategies

For large tables, avoid DDL that requires exclusive locks. In MySQL, you can use Percona's `pt-online-schema-change` which creates a temporary table, copies data in chunks, and swaps it in without blocking writes. In Postgres, adding a column with a default (without `NOT NULL`) is instant in Postgres 11+ using the `NOT VALID` option: `ALTER TABLE orders ADD COLUMN status integer DEFAULT 0 NOT NULL;` is instant because it doesn't rewrite the table (only metadata change). However, adding `NOT NULL` later still requires a check.

Rails 5+ supports `add_column` with `default: 0, null: false` but this still causes a table rewrite in MySQL. A safer approach: first add the column without default: `add_column :orders, :status, :integer`; then backfill in batches with `update in batches`; then add the default constraint: `change_column_default :orders, :status, 0` and `change_column_null :orders, :status, false`.

( 11 )Concurrent Migrations and Deadlocks

If you deploy to multiple servers (e.g., Kubernetes pods) and they all run migrations concurrently, you may get deadlocks. Rails uses advisory locks to prevent concurrent migrations (since Rails 5.2), but if the lock mechanism fails or if you're using an older version, two migrations can run simultaneously. This often results in `Deadlock found when trying to get lock` errors.

To mitigate, ensure that only one instance runs migrations (e.g., use a deploy hook that runs migrations only once). You can also set `config.active_record.migration_lock_retry_attempts = 10` to retry on lock failures. If a deadlock occurs, the transaction is rolled back automatically; you just need to retry the migration.

Frequently asked questions

What does 'Lock wait timeout exceeded' mean in Rails migrations?

It means your migration tried to acquire a lock on a database table but another transaction held the lock for longer than the timeout (default 50s in MySQL). The migration fails and rolls back. Common causes: a long-running query on the same table, or another DDL operation. Kill the blocking query or increase the timeout, but better to redesign the migration to avoid exclusive locks.

How do I safely add a column with a default in Rails?

For large tables, use a two-step approach: (1) add the column without default: `add_column :orders, :status, :integer`; (2) backfill the default in batches using `update in batches`; (3) then set the default: `change_column_default :orders, :status, 0`. In Postgres 11+, you can add a column with a default instantly (no table rewrite) if you don't add NOT NULL at the same time. In MySQL, consider using pt-online-schema-change.

Why does my migration succeed locally but fail in production?

Local databases are usually small and have no concurrent traffic. In production, the table may be large and under constant read/write load. The migration may hit lock timeouts or deadlocks. Also, production may have different MySQL settings (e.g., lower innodb_lock_wait_timeout). Replicate production conditions in staging before deploying migrations.

How do I fix a 'Duplicate column name' error after a failed migration?

This happens when the migration partially added the column but then failed, and when you retry, it tries to add the column again. Rollback the migration properly: `rails db:migrate:down VERSION=...` to remove the partial changes. If the column already exists, you may need to remove it manually: `ALTER TABLE orders DROP COLUMN status;` then re-run the migration.