Backup and Restore

This guide covers how to back up T4 data, how to restore a cluster after failure, and how to create zero-copy snapshots using branching.

How backups work

T4 writes checkpoints to S3 automatically. A checkpoint contains:

A Pebble database snapshot (SST files + metadata).
A manifest JSON file pointing to the checkpoint’s SST files and the revision at which it was taken.
An index JSON file with the term and revision.

Checkpoints are written:

Automatically on a configurable interval (CheckpointInterval, default 15 min).
Optionally after a configurable number of WAL entries (CheckpointEntries).
On leader promotion (startup checkpoint).

SST files are content-addressed — identical content is stored once regardless of how many checkpoints reference it. Checkpoints are cheap: only changed SSTs are written.

No manual backup steps are needed. As long as the node can write to S3, backups are continuous and automatic.

Listing available checkpoints

t4 restore list \
  --s3-bucket my-bucket \
  --s3-prefix t4/

Output:

CHECKPOINT                                                                  REVISION    TERM
checkpoint/0000000001/00000000000000000050/manifest.json                          50       1
checkpoint/0000000001/00000000000000000100/manifest.json                         100       1  (latest)

The (latest) marker shows the checkpoint referenced by manifest/latest.

Restoring the latest checkpoint

Use this to recover a node that lost its local disk, or to spin up a replacement node.

t4 restore checkpoint \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --data-dir /var/lib/t4-restored

Then start the node normally — it replays any WAL segments on S3 that are newer than the restored checkpoint before joining the cluster:

t4 run \
  --data-dir  /var/lib/t4-restored \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --listen    0.0.0.0:3379

Restoring a specific earlier checkpoint (point-in-time rollback)

Use this when recent writes caused data corruption and you want to roll back to a known-good revision.

# 1. List checkpoints to find the target revision.
t4 restore list --s3-bucket my-bucket --s3-prefix t4/

# 2. Download the target checkpoint to a fresh directory.
t4 restore checkpoint \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --checkpoint checkpoint/0000000001/00000000000000000050/manifest.json \
  --data-dir /var/lib/t4-pitr

# 3. Inspect the restored data offline, without starting a server.
t4 inspect meta --data-dir /var/lib/t4-pitr
t4 inspect list --data-dir /var/lib/t4-pitr --prefix /
t4 inspect history --data-dir /var/lib/t4-pitr /your/key

# 4. Optionally start a verification node if you want to use etcd clients too.
t4 run --data-dir /var/lib/t4-pitr --listen 0.0.0.0:3380
etcdctl --endpoints=localhost:3380 get --prefix /

# 5. When satisfied, promote to a new production prefix.
t4 run \
  --data-dir  /var/lib/t4-pitr \
  --s3-bucket my-bucket \
  --s3-prefix t4-recovered/ \
  --listen    0.0.0.0:3379

Caution: using the original S3 prefix in step 5 replays all WAL segments after revision 50 — this is recovery, not rollback. To stay at revision 50, use a fresh prefix as shown above.

Restoring a multi-node cluster

When the entire cluster fails:

Restore the latest checkpoint to each node’s data directory (or start fresh — nodes recover automatically from S3 on startup).
Start all nodes pointing at the same S3 bucket and prefix.
They race to acquire the S3 leader lock. One wins, the others follow.

# On each node (same bucket+prefix, different data-dir and peer address):
t4 run \
  --data-dir  /var/lib/t4 \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --peer-listen 0.0.0.0:3380 \
  --advertise-peer <this-node-ip>:3380 \
  --listen 0.0.0.0:3379

No manual restore step is needed — each node runs t4 restore checkpoint internally on startup if its local Pebble database is absent.

Zero-copy branching (snapshot without download)

Branching lets you fork a database at a checkpoint without copying SST files in S3. The branch node reads inherited SSTs from the source prefix and writes its own new SSTs to a separate prefix.

# 1. Register a branch (prints checkpoint key).
t4 branch fork \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --branch-id staging

# 2. Start the branch node with the printed checkpoint key.
t4 run \
  --data-dir  /var/lib/t4-staging \
  --s3-bucket my-bucket \
  --s3-prefix t4-staging/ \
  --branch-prefix t4/ \
  --branch-checkpoint <key-from-step-1> \
  --listen 0.0.0.0:3379

# 3. When done, remove the branch registration so source GC can reclaim space.
t4 branch unfork \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --branch-id staging

Use cases: test environments, schema migrations, CI data fixtures, analytics read replicas.

Point-in-time restore using S3 versioning

If your S3 bucket has versioning enabled, you can restore to any point in time — not just to a checkpoint boundary.

t4 run \
  --data-dir  /var/lib/t4-pitr \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --restore-point-time "2024-06-01T12:00:00Z" \
  --listen 0.0.0.0:3380

See API reference — RestorePoint for the Go library equivalent.

Backup retention and GC

Old checkpoints and WAL segments accumulate in S3 unless explicitly pruned. Use t4 gc to remove objects outside a retention window:

t4 gc \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --keep 5

This keeps the 5 most recent checkpoints and deletes everything else in three passes:

Old checkpoint archives beyond the --keep window.
SST files exclusively referenced by the deleted checkpoints (orphans not needed by any surviving checkpoint or active branch).
WAL segments whose entire revision range is covered by the latest surviving checkpoint.

Use t4 status first to see current counts:

t4 status --s3-bucket my-bucket --s3-prefix t4/

Branch safety: GC reads the branch registry before deleting. A checkpoint pinned by an active t4 branch fork is never deleted, nor are its SST files. Call t4 branch unfork only after the branch node is fully decommissioned.

See Storage management — Garbage collection for retention recommendations.

Verifying a restore

After restoring, verify that data is intact before promoting the node:

# Check the revision at which the node is running.
curl -s http://localhost:9090/metrics | grep t4_revision

# Or inspect the restored local data directory directly.
t4 inspect meta --data-dir /var/lib/t4-restored
t4 inspect count --data-dir /var/lib/t4-restored --prefix /

# Spot-check key data.
etcdctl --endpoints=localhost:3379 get --prefix /your/prefix --limit 100

# Count total keys.
etcdctl --endpoints=localhost:3379 get --prefix / --count-only

Backup and Restore

Backup and Restore

How backups work

Listing available checkpoints

Restoring the latest checkpoint

Restoring a specific earlier checkpoint (point-in-time rollback)

Restoring a multi-node cluster

Zero-copy branching (snapshot without download)

Point-in-time restore using S3 versioning

Backup retention and GC

Verifying a restore

See also