Troubleshooting
Diagnosing a node
Section titled “Diagnosing a node”Check health and readiness
Section titled “Check health and readiness”curl http://localhost:9090/healthz # 200 = startedcurl http://localhost:9090/readyz # 200 = ready to serve readsInspect Prometheus metrics
Section titled “Inspect Prometheus metrics”curl -s http://localhost:9090/metrics | grep t4_Key metrics to check:
| Metric | What it means |
|---|---|
t4_role{role="leader"} | 1 if this node is the leader |
t4_current_revision | Highest applied revision |
t4_wal_upload_errors_total | Failed S3 WAL uploads (should be 0) |
t4_elections_total{outcome="won"} | How many elections this node has won |
t4_write_errors_total | Write errors by operation type |
Enable debug logging
Section titled “Enable debug logging”t4 run ... --log-level debugTrace-level logging prints every WAL entry, follower ACK, and S3 operation — useful for diagnosing election or replication issues but very verbose in production.
Startup issues
Section titled “Startup issues”failed to read S3 manifest
Section titled “failed to read S3 manifest”The node can’t reach S3 on startup.
Check:
- S3 credentials (
T4_S3_ACCESS_KEY_ID,T4_S3_SECRET_ACCESS_KEY, or IRSA) - Bucket name and prefix (
--s3-bucket,--s3-prefix) - Endpoint URL (
--s3-endpointfor MinIO/Ceph) - Network connectivity from the node to S3
aws s3 ls s3://my-bucket/t4/manifest/data directory not empty but no local manifest
Section titled “data directory not empty but no local manifest”You pointed a new node at a data directory that contains a partial or corrupted state.
Fix: either clear the data directory and let the node restore from S3, or investigate what’s in it:
ls -la /var/lib/t4/rm -rf /var/lib/t4/ # wipe and restore from S3 on next startNode starts but never becomes ready (/readyz returns 503)
Section titled “Node starts but never becomes ready (/readyz returns 503)”The node is stuck in WAL replay or waiting for leader election.
Check:
- S3 connectivity (WAL replay reads old segments from S3)
- If multi-node: are other nodes running? A follower won’t become ready until it can sync with the leader.
- Disk space: WAL replay writes to local Pebble; if disk is full, replay will fail.
Write errors
Section titled “Write errors”ErrNotLeader
Section titled “ErrNotLeader”A write was sent to a follower and the leader is unreachable for forwarding.
Causes:
- Leader is down; a new election is in progress (wait a few seconds)
- Network partition between follower and leader
- Firewall blocking the peer port (default 3380)
Check:
curl http://leader-node:9090/healthzetcdctl --endpoints=follower:3379 endpoint statusErrKeyExists (from Create)
Section titled “ErrKeyExists (from Create)”The key already exists. This is expected behavior for Create — use Put if you want an unconditional write, or Update if you want compare-and-swap.
Writes stall / very high latency (single-node with S3)
Section titled “Writes stall / very high latency (single-node with S3)”WALSyncUpload=true (the default) blocks each write until the WAL segment is uploaded to S3. If S3 is slow or unreachable, writes stall.
Fix options:
- Set
--wal-sync-upload=falseif your local storage is durable (e.g. a persistent volume) - Check S3 latency:
aws s3 cp /dev/urandom s3://my-bucket/test --expected-size 1 2>&1 | tail -1 - Use a cluster — in cluster mode the leader always uses async S3 uploads
Writes stall in cluster mode (high replication latency)
Section titled “Writes stall in cluster mode (high replication latency)”The leader waits for follower ACKs before returning. High inter-node RTT or a slow follower disk increases write latency.
Check:
# Follower applied revision vs leader current revisioncurl -s http://leader:9090/metrics | grep t4_current_revisioncurl -s http://follower:9090/metrics | grep t4_current_revisionA large gap means the follower is behind. Check:
- Network RTT between nodes
- Follower disk throughput (WAL fsync)
t4_wal_upload_errors_totalon followers
Election and leadership issues
Section titled “Election and leadership issues”Leader keeps changing (frequent elections)
Section titled “Leader keeps changing (frequent elections)”Symptom: t4_elections_total counter is incrementing rapidly.
Causes:
LeaderLivenessTTLis too short relative toFollowerRetryInterval- Intermittent S3 connectivity causing leader lock reads to fail
- Network flap between leader and followers
Check:
curl -s http://any-node:9090/metrics | grep t4_elections_totalFix:
- Increase
--follower-max-retries(default 5) to tolerate transient failures - Check S3 latency and error rates
- Check network stability between nodes
Follower can’t reach leader (peer connection refused)
Section titled “Follower can’t reach leader (peer connection refused)”failed to connect to peer: dial tcp node-a:3380: connection refusedCheck:
--peer-listenis set on the leader--advertise-peeris reachable from the follower (not0.0.0.0)- Firewall rules allow traffic on port 3380
- TLS config matches (both or neither have
--peer-tls-*)
Split-brain suspected
Section titled “Split-brain suspected”T4 prevents split-brain via conditional S3 PUT. If you suspect two nodes both believe they are the leader:
# Read the current leader lock from S3aws s3 cp s3://my-bucket/t4/leader-lock - | jq .The lock contains the current leader’s address and LastSeenNano. Only one node can own the lock at a time — the one whose conditional PUT succeeded last.
WAL and checkpoint issues
Section titled “WAL and checkpoint issues”WAL segment upload failed
Section titled “WAL segment upload failed”S3 upload is failing. Writes will stall in single-node sync mode.
Check:
curl -s http://localhost:9090/metrics | grep t4_wal_upload_errors_totalCommon causes:
- S3 bucket does not exist or wrong region
- Credentials expired (IRSA token, static key)
- S3 request throttling (check S3 CloudWatch metrics)
checkpoint failed: S3 upload error
Section titled “checkpoint failed: S3 upload error”Checkpoints are failing. WAL segments won’t be GC’d, and S3 usage will grow unboundedly.
Fix: same as WAL upload failures — resolve S3 access. WAL segments are preserved until the next successful checkpoint so no data is lost.
Disk full
Section titled “Disk full”WAL segments accumulate on disk until they’re uploaded to S3 and a checkpoint is written. If S3 is unreachable for a long time, local disk fills up.
Check:
du -sh /var/lib/t4/wal/ls /var/lib/t4/wal/ | wc -lFix: restore S3 connectivity. Once uploads resume, segments are deleted after the next checkpoint.
Data and consistency issues
Section titled “Data and consistency issues”Follower returns stale data
Section titled “Follower returns stale data”By default, follower reads use the Linearizable consistency mode — the follower syncs to the leader’s current revision before serving the read. If you’re seeing stale reads, the follower may be configured with Serializable consistency.
Check:
t4 run ... --read-consistency linearizableOr in Go:
node, _ := t4.Open(t4.Config{ ReadConsistency: t4.ReadConsistencyLinearizable,})Key was written but Get returns nil
Section titled “Key was written but Get returns nil”- The key may have been compacted. Check
t4_compact_revisionmetric vs the revision of the write. - If using a follower with
Serializableconsistency, the read may be behind the write. - If using
Watch, the event may have already been processed before theGet.
Lost write after node restart
Section titled “Lost write after node restart”If WALSyncUpload=false and the node crashed before the pending WAL segment was uploaded, that segment is lost.
Recovery: increase CheckpointInterval to reduce the WAL retention window, or enable WALSyncUpload=true for single-node deployments without durable local storage.
Performance issues
Section titled “Performance issues”Low write throughput (single writer)
Section titled “Low write throughput (single writer)”T4’s reactive group-commit is optimized for concurrent writers. A single serial writer pays the full WAL fsync cost per write (~8 ms on NVMe).
Options:
- Batch writes at the application layer
- Use
Putin parallel goroutines — throughput scales well with concurrency - If using S3 sync upload, switch to
--wal-sync-upload=falsewith a durable local disk
See Benchmarks for a detailed performance analysis.
High read latency on followers
Section titled “High read latency on followers”Linearizable follower reads require a ForwardGetRevision RPC to the leader. Latency = leader RTT + local Pebble lookup.
For workloads that tolerate slightly stale reads, switch to serializable:
t4 run ... --read-consistency serializableMemory usage growing
Section titled “Memory usage growing”T4 buffers recent WAL entries in memory for follower catch-up (PeerBufferSize, default 10,000 entries). If entries are large, this can use significant memory.
Reduce it:
node, _ := t4.Open(t4.Config{ PeerBufferSize: 1000,})Trade-off: smaller buffer means slower-to-catch-up followers must restore from S3 checkpoint instead of replaying from memory.