Operations Guide

Single-node with S3

The simplest durable deployment. A single node writes WAL segments and checkpoints to S3. If the node is replaced or its disk is lost, it recovers automatically on the next start.

t4 run \
  --data-dir  /var/lib/t4 \
  --listen    0.0.0.0:3379  \
  --s3-bucket my-bucket     \
  --s3-prefix t4/

AWS credentials are resolved from the standard chain: AWS_* environment variables, ~/.aws/credentials, instance profile (EC2/ECS), workload identity (EKS).

MinIO or other S3-compatible stores

t4 run \
  --data-dir    /var/lib/t4 \
  --listen      0.0.0.0:3379   \
  --s3-bucket   my-bucket      \
  --s3-prefix   t4/        \
  --s3-endpoint http://minio:9000

Multi-node cluster

Multi-node mode requires:

A shared S3 bucket (leader election lock + WAL archive).
Each node has a unique --node-id and a --peer-listen address reachable by all other nodes.

All nodes run the same command. At startup they race to acquire the S3 leader lock; the winner becomes the leader, the rest become followers.

Three-node example

# Node A
t4 run \
  --data-dir       /var/lib/t4     \
  --listen         0.0.0.0:3379        \
  --s3-bucket      my-bucket           \
  --s3-prefix      t4/             \
  --node-id        node-a              \
  --peer-listen    0.0.0.0:3380        \
  --advertise-peer node-a.internal:3380

# Node B
t4 run \
  --data-dir       /var/lib/t4     \
  --listen         0.0.0.0:3379        \
  --s3-bucket      my-bucket           \
  --s3-prefix      t4/             \
  --node-id        node-b              \
  --peer-listen    0.0.0.0:3380        \
  --advertise-peer node-b.internal:3380

# Node C — same as B, different --node-id and --advertise-peer

Leader election and failover

On startup each node reads the S3 lock. If absent, it issues an atomic conditional PUT (If-None-Match: *); only one concurrent writer wins. The winner becomes the leader and records LastSeenNano = now() in the lock so followers see it as immediately alive.
The leader streams WAL entries to all followers over the peer port (default 3380). Followers apply entries and serve local reads.
A follower that observes --follower-max-retries consecutive stream failures (~10 s at default 5 × 2 s) checks the lock’s LastSeenNano. If stale (older than LeaderLivenessTTL = 6 s), it attempts a takeover using If-Match: <etag> — only the candidate that read the same ETag wins the race. The new leader records its own address and LastSeenNano.
The former leader periodically re-reads the S3 lock (--leader-watch-interval-sec, default 300 s) and on every follower disconnect. Each check reads the lock with its ETag, then — if still the owner — writes a liveness touch using If-Match: <etag>. If the conditional touch is rejected (ErrPreconditionFailed), a new leader has taken over between the Read and the Touch: the old leader steps down immediately. This closes the Read→Touch split-brain race without a second round-trip.
Writes sent to a follower are automatically forwarded to the current leader and the result is returned to the caller.

Leader election uses atomic conditional PUT (If-None-Match/If-Match on the leader-lock object). There is no TTL polling — the only S3 election writes are at startup, on leader takeover, and during liveness touches while followers are disconnected. In cluster mode, writes additionally require quorum ACK from all connected followers before returning to the caller.

S3 operation budget

The default settings are useful for estimating monthly S3 request volume:

Setting	Default	Monthly count over 30 days
Month length	30 days	2,592,000 seconds
`SegmentMaxAge`	10 s	Up to 259,200 age-based WAL rotations/uploads with continuous writes
`SegmentMaxSize`	50 MB	Adds more WAL uploads if the workload writes 50 MB before 10 s elapses
`CheckpointInterval`	15 min	2,880 checkpoint cycles
`LeaderWatchInterval`	5 min	8,640 periodic leader lock reads
`FollowerRetryInterval`	2 s	1,296,000 liveness poll ticks if a follower is disconnected all month

In a healthy cluster, ordinary writes do not wait for S3. The steady-state S3 cost is mostly asynchronous WAL upload:

Source	Approximate monthly S3 operations
WAL archive	~259,200 PUTs for continuous low/medium write traffic, plus size-based rotations if WAL reaches 50 MB before 10 s
Periodic leader lock watch	~8,640 GETs
Checkpoints	2,880 checkpoint cycles; each writes several small objects plus any new SST files for changed data

Follower disconnects add incident traffic, not baseline traffic. While a follower is disconnected, the leader performs a fenced liveness check/touch every 2 seconds: one GET plus one conditional PUT per tick. A month-long disconnect would be about 1,296,000 GETs and 1,296,000 PUTs.

Single-node synchronous S3 durability has a different cost profile. With WALSyncUpload=true, each acknowledged AppendBatch is uploaded before returning, so the cost is close to one WAL PUT per write batch:

Write rate, no meaningful batching	Approximate WAL PUTs/month
1 write/s	2,592,000
10 writes/s	25,920,000
100 writes/s	259,200,000

If the local disk is durable and you can accept an upload window, WALSyncUpload=false moves single-node uploads to the segment rotation cadence: roughly the same ~259,200 WAL PUTs/month for continuous low/medium writes, plus extra PUTs when size-based rotation triggers sooner than 10 seconds.

Adding a node to a running cluster

Start a new node with a fresh --data-dir and the same S3 bucket. It will:

Read the S3 manifest and restore the latest checkpoint.
Replay any WAL segments uploaded since the checkpoint.
Lose the election (leader already holds the lock) and become a follower.
Receive the live WAL stream from the leader to catch up to the current revision.

No manual registration or cluster membership changes are required.

Scaling down

Close the nodes you want to remove. The remaining nodes continue without any configuration change. If the leader is among the removed nodes, a follower will take over.

mTLS between peers

Provide a shared CA and a per-node certificate/key pair (all PEM format):

t4 run \
  ... \
  --peer-tls-ca   /etc/t4/tls/ca.crt  \
  --peer-tls-cert /etc/t4/tls/node.crt \
  --peer-tls-key  /etc/t4/tls/node.key

Both the leader’s gRPC server and the follower’s gRPC client use these files. The same CA must be used on all nodes. TLS 1.3 is required; mutual authentication is enforced.

Embedded library

Pass credentials.TransportCredentials directly:

serverCreds, clientCreds, err := buildTLS(caFile, certFile, keyFile)

node, err := t4.Open(t4.Config{
...
PeerServerTLS: serverCreds,
PeerClientTLS: clientCreds,
})

Client TLS

By default the etcd gRPC port (3379) is plaintext. Enable TLS to encrypt traffic between clients and the server.

Server-only TLS (encryption, no client cert required)

t4 run \
  ... \
  --client-tls-cert /etc/t4/tls/server.crt \
  --client-tls-key  /etc/t4/tls/server.key

Clients connect with TLS but are not required to present a certificate. Use this when clients are etcd-compatible tools or libraries that support TLS but not mTLS.

etcdctl --endpoints=https://localhost:3379 \
        --cacert /etc/t4/tls/ca.crt \
        put /hello world

Mutual TLS (mTLS, client cert required)

Add --client-tls-ca to require clients to present a certificate signed by the given CA:

t4 run \
  ... \
  --client-tls-cert /etc/t4/tls/server.crt \
  --client-tls-key  /etc/t4/tls/server.key \
  --client-tls-ca   /etc/t4/tls/ca.crt

etcdctl --endpoints=https://localhost:3379 \
        --cacert  /etc/t4/tls/ca.crt  \
        --cert    /etc/t4/tls/client.crt \
        --key     /etc/t4/tls/client.key \
        put /hello world

Client TLS and peer mTLS are independent — each uses its own cert/key/CA and can be enabled or disabled separately.

Authentication and RBAC

T4 implements the etcd v3 Auth API: username/password authentication with bearer tokens, and role-based access control scoped to key prefixes. Auth state (users, roles, enabled flag) is stored in Pebble and flows through the WAL, so it is replicated to followers and included in S3 checkpoints. Bearer tokens are persisted to Pebble and survive node restarts — clients do not need to re-authenticate after a restart.

Enable auth with --auth-enabled:

t4 run \
  ... \
  --auth-enabled \
  --token-ttl 300   # bearer token lifetime in seconds (default: 300)

Initial setup

Auth cannot be enabled unless a root user exists. Bootstrap with etcdctl:

ETCDCTL_API=3 etcdctl --endpoints=localhost:3379 user add root
# Enter password at prompt

ETCDCTL_API=3 etcdctl --endpoints=localhost:3379 auth enable

Once enabled, all KV and Watch requests require a valid bearer token. The root user has unconditional access to all keys via the built-in root role.

Note: The root user and root role cannot be deleted while auth is enabled.

Authenticating

ETCDCTL_API=3 etcdctl --endpoints=localhost:3379 \
  --user root:yourpassword \
  put /hello world

The etcd client library handles token acquisition and refresh automatically when --user is provided. Tokens expire after --token-ttl seconds; the client re-authenticates transparently.

Managing users

# Create a user
etcdctl --endpoints=localhost:3379 --user root:pass user add alice

# List users
etcdctl --endpoints=localhost:3379 --user root:pass user list

# Delete a user
etcdctl --endpoints=localhost:3379 --user root:pass user delete alice

# Change password
etcdctl --endpoints=localhost:3379 --user root:pass user passwd alice

Managing roles

# Create a role
etcdctl --endpoints=localhost:3379 --user root:pass role add reader

# Grant read access to a key prefix
etcdctl --endpoints=localhost:3379 --user root:pass \
  role grant-permission reader read /data/ --prefix

# Grant write access to a specific key
etcdctl --endpoints=localhost:3379 --user root:pass \
  role grant-permission reader write /config/app

# Grant read+write access to a prefix
etcdctl --endpoints=localhost:3379 --user root:pass \
  role grant-permission writer readwrite /app/ --prefix

# Revoke a permission
etcdctl --endpoints=localhost:3379 --user root:pass \
  role revoke-permission reader /data/ --prefix

# List roles
etcdctl --endpoints=localhost:3379 --user root:pass role list

# Inspect a role's permissions
etcdctl --endpoints=localhost:3379 --user root:pass role get reader

# Delete a role
etcdctl --endpoints=localhost:3379 --user root:pass role delete reader

Assigning roles to users

# Grant a role
etcdctl --endpoints=localhost:3379 --user root:pass \
  user grant-role alice reader

# Revoke a role
etcdctl --endpoints=localhost:3379 --user root:pass \
  user revoke-role alice reader

# List a user's roles
etcdctl --endpoints=localhost:3379 --user root:pass user get alice

RBAC rule evaluation

A request is permitted when the authenticated user has at least one role whose permissions cover the requested key and operation type:

Operation	Required permission
`Range` (Get / List)	`read`
`Put`	`write`
`DeleteRange`	`write`
`Txn`	`write`
`Watch`	`read`

A permission entry covers a key when:

Exact key (--prefix omitted): the key matches exactly.
Prefix range (--prefix): the key starts with the permission’s key prefix (computed as rangeEnd = prefix[:-1] + chr(ord(prefix[-1])+1)).
Open-ended range (rangeEnd = "\x00"): all keys ≥ the permission key.

The root role always passes all checks regardless of the key.

Auth namespace protection

Keys under the \x00auth/ prefix are reserved for internal auth storage. Access to these keys via the KV service is blocked for all users, including root. Attempting to read or write them returns PermissionDenied.

Rate limiting

To protect against brute-force attacks, T4 enforces a per-username rate limit on failed authentication attempts:

5 consecutive failures within a 5-minute window triggers a 15-minute lockout for that username.
Subsequent Authenticate calls during the lockout period return an error without checking the password.
The lockout state is in-memory only and resets on node restart (intentional: a restart is already a privileged operation).
All authentication outcomes are recorded in the t4_auth_attempts_total metric with a result label (success, fail, locked).

Disabling auth

etcdctl --endpoints=localhost:3379 --user root:pass auth disable

Or restart the node without --auth-enabled. Auth state (users, roles) is preserved in Pebble — re-enabling auth later restores the same configuration.

Full example: read-only service account

# 1. Create the role with read access to /config/
etcdctl --user root:pass role add config-reader
etcdctl --user root:pass role grant-permission config-reader read /config/ --prefix

# 2. Create the user and assign the role
etcdctl --user root:pass user add svc-account
etcdctl --user root:pass user grant-role svc-account config-reader

# 3. The service account can read /config/ but not write
etcdctl --user svc-account:pass get /config/timeout   # OK
etcdctl --user svc-account:pass put /config/timeout 60s  # PermissionDenied
etcdctl --user svc-account:pass get /secrets/key         # PermissionDenied

Observability

t4 run --metrics-addr 0.0.0.0:9090 ...

Endpoints

Path	Description
`GET /metrics`	Prometheus metrics
`GET /healthz`	200 once the node has started
`GET /readyz`	200 when the node is ready to serve reads

Inspecting S3 storage state

t4 status reads directly from S3 (no running node required) and prints the current checkpoint, object counts, and any registered branch forks:

t4 status \
  --s3-bucket my-bucket \
  --s3-prefix t4/

S3 status  s3://my-bucket/t4/

Latest checkpoint
  key:       checkpoint/0000000001/00000000000000000100/manifest.json
  revision:  100
  term:      1

Storage objects
  checkpoints: 7
  WAL segments: 43

Branch forks
  (none)

Use this to confirm a node is checkpointing regularly and to estimate how much storage GC will reclaim.

Grafana dashboard

A pre-built Grafana dashboard is available for download.

To import it:

In Grafana, go to Dashboards → Import.
Upload the JSON file or paste its contents.
Select your Prometheus datasource when prompted.
Set the job variable to match the scrape job name for your T4 instances (default: t4).

The dashboard contains six sections:

Section	Panels
Cluster Health	Leader count (split-brain indicator), current revision, node roles, max follower lag, elections/hr, resyncs/hr
Write Performance	Throughput by op type, error rate, p50/p95/p99 write latency
Watch Performance	Active watches/prefixes, p50/p95/p99 watch scan latency, revision span, scanned vs matched log entries
Replication	Per-follower lag over time, forwarded write rate, forward round-trip latency
WAL & Checkpoints	Upload rate, upload errors, upload duration, checkpoint frequency
Object Store (S3)	Op rate by type, error rate, p50/p95/p99 latency

Prometheus metrics

Metric	Type	Labels	Description
`t4_writes_total`	counter	`op`	Completed write operations
`t4_write_errors_total`	counter	`op`	Write operations that returned an error
`t4_write_duration_seconds`	histogram	`op`	Write latency (WAL + apply)
`t4_forwarded_writes_total`	counter	`op`	Writes forwarded from follower to leader
`t4_forward_duration_seconds`	histogram	`op`	Forwarded write round-trip latency
`t4_current_revision`	gauge	—	Latest applied revision
`t4_compact_revision`	gauge	—	Compaction watermark
`t4_role`	gauge	`role`	1 for the active role (`leader`/`follower`/`single`)
`t4_wal_uploads_total`	counter	—	WAL segments successfully uploaded
`t4_wal_upload_errors_total`	counter	—	Failed WAL segment uploads
`t4_wal_upload_duration_seconds`	histogram	—	WAL segment upload latency
`t4_wal_gc_segments_total`	counter	—	WAL segments deleted from S3 after checkpointing
`t4_checkpoints_total`	counter	—	Checkpoints written to S3
`t4_elections_total`	counter	`outcome`	Election attempts (`won`/`lost`)
`t4_follower_resyncs_total`	counter	`reason`	Full resync events triggered on followers (`behind_leader_start` / `ring_buffer_miss` / `stream_gap`)
`t4_auth_attempts_total`	counter	`result`	Authentication attempts (`success` / `fail` / `locked`)
`t4_watch_active`	gauge	—	Currently active watch subscriptions
`t4_watch_active_prefixes`	gauge	—	Distinct prefixes with active watch subscriptions
`t4_watch_scan_duration_seconds`	histogram	—	Time spent scanning revision logs for watch delivery
`t4_watch_scan_revision_span`	histogram	—	Revision span covered by each watch scan
`t4_watch_scan_entries_total`	counter	`result`	Watch scan entries by result (`scanned` / `matched`)

op label values: put, create, update, delete, compact.

Performance

Numbers are from go test -bench=. -benchtime=5s on an Apple M4 Pro (12 cores, NVMe SSD). All tests use in-process loopback — no real network or S3.

Single-node (no peers, no S3)

Write latency is dominated by a single WAL fsync (~8 ms on NVMe). Concurrent writers are automatically batched by the commitLoop into a single fsync per drain cycle (group commit).

Operation	Throughput	Latency
`Put` (serial)	~123 writes/s	8.1 ms
`Put` (12 concurrent writers)	~750 writes/s	1.3 ms avg
`Put` (192 concurrent writers)	~11,600 writes/s	86 µs avg
`Get` / `LinearizableGet` (leader)	~2,300,000 reads/s	0.43 µs
`List` (100 keys)	~27,900 ops/s	36 µs

3-node cluster (localhost loopback)

Write latency = leader WAL fsync + quorum ACK round-trip (follower WAL fsync + network). On loopback, both nodes share the same SSD so each write costs roughly two sequential fsyncs (~16 ms).

Operation	Throughput	Latency
`Put` (serial)	~43 writes/s	23 ms
`Put` (12 concurrent writers)	~224 writes/s	4.5 ms avg
`LinearizableGet` (follower)	~18,100 reads/s	55 µs

With group commit, the per-write overhead of the quorum ACK round-trip disappears almost entirely under load — 12 concurrent writers improve from 43 to 224 writes/s by batching many writes into one ACK round.

Impact of real-world latency

Write latency scales with inter-node RTT and S3 latency (single-node only):

Scenario	Additional latency	Notes
Cluster, same-host loopback	+15 ms	loopback gRPC + follower fsync
Cluster, LAN (1 ms RTT)	+9 ms	≈ follower fsync + 2× 0.5 ms network
Cluster, cross-AZ (5 ms RTT)	+18 ms	≈ follower fsync + 2× 5 ms network
Cluster, cross-region (50 ms RTT)	+108 ms	high-latency links hurt serial throughput most
Single-node, S3 upload	+100–500 ms	sync upload per WAL segment — use cluster mode for low latency

In cluster mode, S3 uploads are async (disaster-recovery only) and add zero latency to the write path. Single-node mode uploads each WAL segment to S3 synchronously; write latency is dominated by S3 round-trip, not local fsync. For low-latency single-node deployments without S3, latency is entirely local disk (~8 ms NVMe).

Read latency on a follower includes one ForwardGetRevision gRPC call to the leader to obtain the current revision, then a local Pebble lookup. On localhost this costs ~55 µs; on LAN expect ~1–2 ms; on cross-AZ ~10 ms.

Durability and recovery

What is durable

A write is durable when it has been:

fsynced to the leader’s WAL and ACKed by all connected followers (cluster mode) — the entry exists on at least two nodes’ WALs before the caller sees success, or
fsynced to the local WAL and the WAL segment has been uploaded to S3 (single-node mode).

In cluster mode S3 is disaster-recovery only (both nodes fail simultaneously). WAL uploads are fully async and do not affect write latency. In single-node mode without S3, durability depends entirely on local disk.

Recovery procedure

On startup, T4 always performs:

Read manifest/latest from S3 → get the latest checkpoint key and revision.
If the local Pebble database is absent, restore the checkpoint from S3.
Open the local Pebble database.
Replay all local WAL segments (.wal files in <data-dir>/wal/) that are newer than the checkpoint.
Replay any WAL segments uploaded to S3 that are newer than the checkpoint and not already replayed locally.
Run leader election (cluster mode) or become single-node.

Steps 4–5 ensure that no committed write is lost even if the node is killed between WAL writes and checkpoint creation.

S3 unavailability

In cluster mode, S3 uploads are fully async — WAL segments and checkpoints are uploaded in the background without blocking writes. In single-node mode, each WAL segment is uploaded to S3 synchronously before the write is acknowledged. In both modes, on restart local WAL segments are replayed first, so no data written to the local WAL is lost even if it was never uploaded to S3.

Storage management

Local database size

As a sizing reference, a local Pebble database with 1,000,000 live keys and 256-byte varied values measured about 317 MiB on disk after compaction (roughly 332 bytes per key). This includes T4’s live key index and revision log entries. The test used keys like /size/000000001, so longer keys add directly to the total.

For the same 256-byte value size:

Live keys	Local DB size	Basis
1 million	~317 MiB	measured
10 million	~3.0 GiB	measured
100 million	~30 GiB	extrapolated
1 billion	~300 GiB	extrapolated

For 1,000,000 live keys:

Value size	Local DB size	Bytes per key
64 B	~95 MiB	~100 B
128 B	~159 MiB	~167 B
256 B	~317 MiB	~332 B
1 KiB	~1.04 GiB	~1,118 B
4 KiB	~4.0 GiB	~4,279 B

Actual size varies with key length, value compressibility, update/delete history, and Pebble compaction state. S3 stores the same logical database as checkpoint SSTs plus retained WAL segments; a single latest checkpoint is usually in the same range as the compacted local DB, while total S3 usage depends on checkpoint retention, WAL retention, and branch-pinned SSTs.

Garbage collection

Old checkpoints and WAL segments accumulate in S3 unless explicitly pruned. Run t4 gc periodically (e.g. daily via cron) to reclaim storage:

t4 gc \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --keep 3

--keep (default: 3) sets how many of the most recent checkpoints to retain. The command performs three passes in order:

Checkpoint GC — deletes old checkpoint archives beyond the --keep window.
Orphan SST GC — deletes SST files exclusively referenced by the deleted checkpoints.
WAL segment GC — deletes WAL segments whose entire revision range is covered by the latest surviving checkpoint.

Output:

GC complete
  checkpoints deleted: 4
  orphan SSTs deleted: 31
  WAL segments deleted: 18

Branch safety

Before deleting any checkpoint, t4 gc reads all active branch registrations. Any checkpoint pinned by an active branch is skipped unconditionally — even if it falls outside the --keep window. The SSTs it references are also excluded from orphan deletion.

Call t4 branch fork before running GC on the source.
Call t4 branch unfork only after the branch node is fully decommissioned.

Branching

Branches let you fork a database at any checkpoint with zero S3 data copies. SST files are content-addressed and shared between the source and all branches — no data is duplicated.

Requirements

S3 versioning is not required.
The source database must have at least one checkpoint.

Creating a branch (CLI)

# 1. Register the branch against the source store.
#    Prints the checkpoint key — save it.
t4 branch fork \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --branch-id my-branch

# Output: checkpoint/0000000001/00000000000000000100/manifest.json

# 2. Start the branch node, pointing it at the source.
t4 run \
  --data-dir          /var/lib/t4-branch \
  --listen            0.0.0.0:3379 \
  --s3-bucket         my-bucket \
  --s3-prefix         t4-branch/ \
  --branch-prefix     t4/ \
  --branch-checkpoint checkpoint/0000000001/00000000000000000100/manifest.json

On first boot the branch node downloads SSTs and Pebble metadata from the source prefix. On subsequent restarts --branch-checkpoint is ignored (the local data directory already exists).

Creating a branch (Go library)

import "github.com/t4db/t4"
import "github.com/t4db/t4/pkg/object"

sourceStore := object.NewS3Store(object.S3Config{Bucket: "my-bucket", Prefix: "t4/"})
branchStore := object.NewS3Store(object.S3Config{Bucket: "my-bucket", Prefix: "t4-branch/"})

// Register and get the checkpoint key.
cpKey, err := t4.Fork(ctx, sourceStore, "my-branch")
if err != nil {
log.Fatal(err)
}

// Start the branch node.
node, err := t4.Open(t4.Config{
DataDir:       "/var/lib/t4-branch",
ObjectStore:   branchStore,
AncestorStore: sourceStore,
BranchPoint: &t4.BranchPoint{
SourceStore:   sourceStore,
CheckpointKey: cpKey,
},
})

Forking from a specific checkpoint

By default Fork uses the latest checkpoint. To fork from an earlier revision, call checkpoint.RegisterBranch directly with the specific key:

# CLI
t4 branch fork \
  --s3-bucket my-bucket --s3-prefix t4/ \
  --branch-id my-branch \
  --checkpoint checkpoint/0000000001/00000000000000000050/manifest.json

// Go — use the internal package directly for a specific key
import "github.com/t4db/t4/internal/checkpoint"

cpKey := "checkpoint/0000000001/00000000000000000050/manifest.json"
if err := checkpoint.RegisterBranch(ctx, sourceStore, "my-branch", cpKey); err != nil {
log.Fatal(err)
}

### Removing a branch

When the branch is no longer needed, unregister it so the source's GC can reclaim unused SSTs:

```bash
t4 branch unfork \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --branch-id my-branch

if err := t4.Unfork(ctx, sourceStore, "my-branch"); err != nil {
log.Fatal(err)
}

Use cases

Point-in-time recovery — fork from a checkpoint taken before a bad write, validate, then promote.

Blue/green migrations — run a schema migration against a branch with production data, test it, then cut over.

DR drills — spin up a replica in a different region from a fork, verify integrity, then shut it down.

Parallel testing — fork the same production snapshot for multiple independent test runs.

Point-in-time restore (S3 versioning)

Note: this mechanism requires S3 versioning to be enabled on the bucket. For most use cases, Branching is simpler and does not require versioning.

RestorePoint bootstraps a new node from a specific set of S3 object version IDs captured at a past moment. See api.md — Point-in-time restore for the Go API.

Requirements

S3 versioning must be enabled on the bucket before the first write.

Capturing a restore point

# Find the current checkpoint key.
aws s3 cp s3://my-bucket/source-prefix/manifest/latest - | jq .

# Download that checkpoint index and record version IDs for the index,
# every listed sst_files object, and every checkpoint-local pebble_meta file.
aws s3api list-object-versions \
  --bucket my-bucket \
  --prefix source-prefix/checkpoint/ \
  --query 'Versions[?IsLatest==`true`].[Key,VersionId]' \
  --output json

aws s3api list-object-versions \
  --bucket my-bucket \
  --prefix source-prefix/sst/ \
  --query 'Versions[?IsLatest==`true`].[Key,VersionId]' \
  --output json

# List WAL segments and their version IDs.
aws s3api list-object-versions \
  --bucket my-bucket \
  --prefix source-prefix/wal/ \
  --query 'Versions[?IsLatest==`true`].[Key,VersionId]' \
  --output json

The checkpoint index alone is not a complete restore point. Include the versions for all SST and Pebble metadata objects referenced by the index, otherwise source checkpoint GC can delete live metadata before the restore runs.