MeshDB 0.2.0
MeshDB 0.2.0 is out. The headline is operational hardening — the knobs and surfaces you need to actually run a cluster: graceful drain, k8s probes, per-query budgets, plan cache, consistent backup with offline verification, TLS cert hot-reload, mTLS, OpenTelemetry, an audit log, and a full multi-raft mode. There are no breaking changes to the on-disk format, network protocols, or config shape; upgrade is a process restart on the same data directory.
Install the latest release with:
cargo install meshdb-server
The reference container image is darkspar/meshdb-server:0.2.0 on
Docker Hub. The new
deploy/ tree
in the repo ships production-shaped Kubernetes manifests, a
Prometheus alert bundle, an importable Grafana dashboard, and a
day-2 RUNBOOK.md.
Multi-Raft mode
mode = "multi-raft" is the third cluster shape: sharding with
per-partition replication. Each partition has its own openraft group
replicated across replication_factor peers; DDL and cluster
membership ride a separate metadata Raft group spanning every peer.
Cross-partition writes use a Spanner-style 2PC where both PREPARE
and COMMIT are proposed through the partition Raft, so staged state
is replicated by the time PREPARE-ACK returns and there’s no
in-doubt window dependent on a participant log fsync.
The mechanics that make it useful in practice:
- Server-side proxying. Single-partition writes arriving on a
non-leader peer are proxied to the partition leader via internal
MeshWrite::ForwardWrite; DDL throughMeshWrite::ForwardDdl. Bolt clients see one consistent endpoint for the lifetime of a session, no client-visible redirects. - DDL barrier. Every peer tracks the highest meta-Raft index
it’s seen committed; partition writes await the local meta replica
to catch up before applying, so a
CREATE INDEXissued through any peer is guaranteed visible by the time a follow-up write lands. - Per-partition snapshots. Each partition’s applier packs only
its own nodes and edges (~1/N of cluster data), so a new replica
catching up via
InstallSnapshotdoesn’t download the full graph. - Dynamic rebalancing.
add_partition_replica/remove_partition_replicawrap openraft’s per-partitionchange_membership;instantiate_partition_groupspins up a new partition Raft + RocksDB dir + applier on a peer that didn’t bootstrap with that partition, no restart required. The cluster’s persisted view of placement (PartitionReplicaMap) updates atomically viaClusterCommand::SetPartitionReplicas. - Linearizable reads. Scatter-gather reads in multi-raft and routing modes serialize through the partition leader for linearizability; learner / read-replica APIs let operators add non-voting replicas for read fan-out.
- Periodic 2PC recovery. A 60s loop re-resolves any in-doubt PREPAREs left by a coordinator that crashed mid-flight while the cluster stayed up.
Operational hardening
The full new surface, all configurable through TOML:
- Graceful drain on SIGTERM / SIGINT. Every partition this peer
leads steps down before the listener exits. Bounded by
shutdown_drain_timeout_seconds(default 30). Pairs with k8sterminationGracePeriodSecondsfor zero-drop rolling restarts. - K8s liveness / readiness probes.
/livezand/readyzon the metrics endpoint. The shutdown path flips readiness before draining, so the kubelet pulls a peer out of rotation before it stops accepting traffic. - Per-query budgets.
query_timeout_seconds(deadline on every Cypher RUN),query_max_rows(cap on result-set size,ResourceExhaustedinstead of OOM),max_concurrent_queries(per-peer concurrency cap). - Cypher plan cache.
plan_cache_size = Nskips parse + plan on repeated parametrised queries. Self-correcting on every peer when DDL changes the schema — entries are keyed on a fingerprint hashed from the local index registry. - Cluster-wide consistent backup.
MeshService::take_cluster_backupsnapshots every Raft group across every peer in parallel; the manifest serializes to JSON. The newmeshdb-server validate-backup --manifest=path --data-dir=path --peer-id=NCLI subcommand confirms a restored data dir matches the manifest before you bring the cluster back up. - TLS cert hot-reload. Set
[bolt_tls] reload_interval_secondsand / or[grpc_tls] reload_interval_secondsand the listener swaps in rotated certs without restarting. Pairs with cert-manager / ACME pipelines. - gRPC mTLS.
[grpc_tls] client_ca_pathrequires every inbound TLS handshake to present a client cert chained to the bundle. Outbound peer endpoints automatically present the configured identity. - Cluster auth.
[cluster_auth] tokenadds a shared-secret bearer token on every inter-peer and admin RPC, validated in constant time. - OpenTelemetry / OTLP.
[tracing] otlp_endpoint = "..."exports everytracing::span(gRPC handlers, Raft applies, the Cypher executor) through the OpenTelemetry SDK’s batch span processor.service_nameandsample_rateare operator-tunable. - Audit log.
audit_log_pathwrites one fsync’d JSONL record per admin operation (drain, backup, …) with timestamp + initiator- structured args + success/error.
- RocksDB tuning. A new
[storage]section exposesmax_open_files,write_buffer_size_bytes,max_write_buffer_number,bloom_filter_bits_per_key,keep_log_file_num. Defaults match the historical hand-applied values, so existing deployments see no behavior change unless they opt in.
Cypher: allShortestPaths locked in
allShortestPaths(...) is now a supported planner + executor
surface. The shared layered-BFS operator builds a parent DAG that
records every (parent, edge) pair at each shortest-path level,
and the reconstruction walk enumerates every minimum-length path
(vs. shortestPath(...)’s single-path early exit). Cycles, parallel
edges, undirected patterns, reverse direction, edge-type unions,
and self-loops are all covered by integration tests. The capability
was present in code but documented as deferred; this release sheds
the deferral and locks in the behavior with paranoia tests.
Production deployment artifacts
The new deploy/
tree:
k8s/— ConfigMap with operational knobs set, headless + Bolt ClusterIP services, a 3-pod StatefulSet with anti-affinity, OrderedReady, and livez/readyz probes, plus a PodDisruptionBudget (max 1 unavailable).grafana/meshdb-cluster.json— importable dashboard: query rate, p99 latency, leader skew, apply lag, in-doubt PREPAREs, DDL gate, forward writes.prometheus/alerts.yaml— peer-down, meta-without-leader, apply lag, in-doubt PREPAREs, DDL gate timeout, sustained p99 > 1s.RUNBOOK.md— day-2 ops: rolling upgrades, drain, backup/restore, TLS rotation, adding/removing peers, common pitfalls.
Upgrading from 0.1.0
No action required beyond a binary restart. No data migration, no
config changes, no protocol changes. The [storage],
[cluster_auth], [tracing], audit_log_path,
shutdown_drain_timeout_seconds, query_timeout_seconds,
query_max_rows, max_concurrent_queries, and plan_cache_size
fields are all additive — leave them unset and the server behaves
exactly as it did in 0.1.0.
What’s next
- Auto leader balancer (blocked on openraft 0.9 exposing
transfer_leader/ TimeoutNow). - Audit-log coverage of the cluster-level paths
(
add_partition_replica,drain_peer, learner ops) — the service-level paths land first; the cluster-level paths are a mechanical follow-up. - Pull-based result streaming through Bolt PULL —
query_max_rowsis the bound today; the executor still materializes rows inVec<Row>before the response goes out. - A CI perf-regression gate for the criterion benches that shipped in 0.1.0.
meshdb-server backupandmeshdb-server cluster add-peerCLI subcommands so backup-trigger and peer-add aren’t gRPC-only.