Backup and Restore
Backup and Restore
Section titled “Backup and Restore”This guide covers how to back up T4 data, how to restore a cluster after failure, and how to create zero-copy snapshots using branching.
How backups work
Section titled “How backups work”T4 writes checkpoints to S3 automatically. A checkpoint contains:
- A Pebble database snapshot (SST files + metadata).
- A manifest JSON file pointing to the checkpoint’s SST files and the revision at which it was taken.
- An index JSON file with the term and revision.
Checkpoints are written:
- Automatically on a configurable interval (
CheckpointInterval, default 15 min). - Optionally after a configurable number of WAL entries (
CheckpointEntries). - On leader promotion (startup checkpoint).
SST files are content-addressed — identical content is stored once regardless of how many checkpoints reference it. Checkpoints are cheap: only changed SSTs are written.
No manual backup steps are needed. As long as the node can write to S3, backups are continuous and automatic.
Listing available checkpoints
Section titled “Listing available checkpoints”t4 restore list \ --s3-bucket my-bucket \ --s3-prefix t4/Output:
CHECKPOINT REVISION TERMcheckpoint/0000000001/00000000000000000050/manifest.json 50 1checkpoint/0000000001/00000000000000000100/manifest.json 100 1 (latest)The (latest) marker shows the checkpoint referenced by manifest/latest.
Restoring the latest checkpoint
Section titled “Restoring the latest checkpoint”Use this to recover a node that lost its local disk, or to spin up a replacement node.
t4 restore checkpoint \ --s3-bucket my-bucket \ --s3-prefix t4/ \ --data-dir /var/lib/t4-restoredThen start the node normally — it replays any WAL segments on S3 that are newer than the restored checkpoint before joining the cluster:
t4 run \ --data-dir /var/lib/t4-restored \ --s3-bucket my-bucket \ --s3-prefix t4/ \ --listen 0.0.0.0:3379Restoring a specific earlier checkpoint (point-in-time rollback)
Section titled “Restoring a specific earlier checkpoint (point-in-time rollback)”Use this when recent writes caused data corruption and you want to roll back to a known-good revision.
# 1. List checkpoints to find the target revision.t4 restore list --s3-bucket my-bucket --s3-prefix t4/
# 2. Download the target checkpoint to a fresh directory.t4 restore checkpoint \ --s3-bucket my-bucket \ --s3-prefix t4/ \ --checkpoint checkpoint/0000000001/00000000000000000050/manifest.json \ --data-dir /var/lib/t4-pitr
# 3. Inspect the restored data offline, without starting a server.t4 inspect meta --data-dir /var/lib/t4-pitrt4 inspect list --data-dir /var/lib/t4-pitr --prefix /t4 inspect history --data-dir /var/lib/t4-pitr /your/key
# 4. Optionally start a verification node if you want to use etcd clients too.t4 run --data-dir /var/lib/t4-pitr --listen 0.0.0.0:3380etcdctl --endpoints=localhost:3380 get --prefix /
# 5. When satisfied, promote to a new production prefix.t4 run \ --data-dir /var/lib/t4-pitr \ --s3-bucket my-bucket \ --s3-prefix t4-recovered/ \ --listen 0.0.0.0:3379Caution: using the original S3 prefix in step 5 replays all WAL segments after revision 50 — this is recovery, not rollback. To stay at revision 50, use a fresh prefix as shown above.
Restoring a multi-node cluster
Section titled “Restoring a multi-node cluster”When the entire cluster fails:
- Restore the latest checkpoint to each node’s data directory (or start fresh — nodes recover automatically from S3 on startup).
- Start all nodes pointing at the same S3 bucket and prefix.
- They race to acquire the S3 leader lock. One wins, the others follow.
# On each node (same bucket+prefix, different data-dir and peer address):t4 run \ --data-dir /var/lib/t4 \ --s3-bucket my-bucket \ --s3-prefix t4/ \ --peer-listen 0.0.0.0:3380 \ --advertise-peer <this-node-ip>:3380 \ --listen 0.0.0.0:3379No manual restore step is needed — each node runs t4 restore checkpoint internally on startup if its local Pebble database is absent.
Zero-copy branching (snapshot without download)
Section titled “Zero-copy branching (snapshot without download)”Branching lets you fork a database at a checkpoint without copying SST files in S3. The branch node reads inherited SSTs from the source prefix and writes its own new SSTs to a separate prefix.
# 1. Register a branch (prints checkpoint key).t4 branch fork \ --s3-bucket my-bucket \ --s3-prefix t4/ \ --branch-id staging
# 2. Start the branch node with the printed checkpoint key.t4 run \ --data-dir /var/lib/t4-staging \ --s3-bucket my-bucket \ --s3-prefix t4-staging/ \ --branch-prefix t4/ \ --branch-checkpoint <key-from-step-1> \ --listen 0.0.0.0:3379
# 3. When done, remove the branch registration so source GC can reclaim space.t4 branch unfork \ --s3-bucket my-bucket \ --s3-prefix t4/ \ --branch-id stagingUse cases: test environments, schema migrations, CI data fixtures, analytics read replicas.
Point-in-time restore using S3 versioning
Section titled “Point-in-time restore using S3 versioning”If your S3 bucket has versioning enabled, you can restore to any point in time — not just to a checkpoint boundary.
t4 run \ --data-dir /var/lib/t4-pitr \ --s3-bucket my-bucket \ --s3-prefix t4/ \ --restore-point-time "2024-06-01T12:00:00Z" \ --listen 0.0.0.0:3380See API reference — RestorePoint for the Go library equivalent.
Backup retention and GC
Section titled “Backup retention and GC”Old checkpoints and WAL segments accumulate in S3 unless explicitly pruned. Use t4 gc to remove objects outside a retention window:
t4 gc \ --s3-bucket my-bucket \ --s3-prefix t4/ \ --keep 5This keeps the 5 most recent checkpoints and deletes everything else in three passes:
- Old checkpoint archives beyond the
--keepwindow. - SST files exclusively referenced by the deleted checkpoints (orphans not needed by any surviving checkpoint or active branch).
- WAL segments whose entire revision range is covered by the latest surviving checkpoint.
Use t4 status first to see current counts:
t4 status --s3-bucket my-bucket --s3-prefix t4/Branch safety: GC reads the branch registry before deleting. A checkpoint pinned by an active
t4 branch forkis never deleted, nor are its SST files. Callt4 branch unforkonly after the branch node is fully decommissioned.
See Storage management — Garbage collection for retention recommendations.
Verifying a restore
Section titled “Verifying a restore”After restoring, verify that data is intact before promoting the node:
# Check the revision at which the node is running.curl -s http://localhost:9090/metrics | grep t4_revision
# Or inspect the restored local data directory directly.t4 inspect meta --data-dir /var/lib/t4-restoredt4 inspect count --data-dir /var/lib/t4-restored --prefix /
# Spot-check key data.etcdctl --endpoints=localhost:3379 get --prefix /your/prefix --limit 100
# Count total keys.etcdctl --endpoints=localhost:3379 get --prefix / --count-onlySee also
Section titled “See also”- Operations guide — Restore from checkpoint, Point-in-time recovery workflow, Branching
- Failure scenarios — WAL and checkpoint corruption recovery
- API reference — RestorePoint, BranchPoint