Kontinuum Node — Observability
Metrics inventory, log conventions, tracing spans, alerts. P1 prerequisite для production-readiness и SRE onboarding.
Audience: SRE / DevOps · node developers (где emit metrics) · ops engineers (dashboards).
Связанные документы:
operations.md— §15.1 Grafana + Directus + operator chat bot stack; §16.1 DR planconfiguration.md—[observability]sectionapi-contracts.md—/metricsendpoint exposuretesting.md— performance benchmarks tie в SLO
Стек
| Tier | Component | Где живёт |
|---|---|---|
| Metrics export | Prometheus exporter (metrics-exporter-prometheus) | kontinuum-node-server + kontinuum-node-admin |
| Metrics storage | Prometheus | Centralized (org-side) |
| Visualization | Grafana | Centralized |
| Logs | Structured JSON logs → stdout | Per-process, collected via Loki / Vector / fluentbit |
| Traces | OpenTelemetry (OTLP) | Centralized OTEL collector (опционально) |
| Alerts | Alertmanager (Prometheus stack) | Routes to operator chat bot + PagerDuty / Slack |
Rust ecosystem:
metrics 0.24— abstraction layer для metrics (counter / gauge / histogram).metrics-exporter-prometheus 0.16— HTTP exporter.tracing 0.1+tracing-subscriber 0.3— structured logs + spans.tracing-opentelemetry(optional) — OTLP integration.
Metrics inventory
Все метрики namespace'нуты kontinuum_node_<subsystem>_<metric_name>. Labels — minimal, чтобы не cardinality explosion.
Network / DHT
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_dht_lookups_total | counter | dht_namespace, result | DHT lookups by namespace, success/timeout/error |
kontinuum_node_dht_lookup_duration_seconds | histogram | dht_namespace | DHT lookup latency distribution |
kontinuum_node_dht_records_count | gauge | dht_namespace | Records stored per namespace |
kontinuum_node_dht_routing_table_size | gauge | dht_namespace | k-buckets entry count per namespace |
kontinuum_node_dht_put_total | counter | dht_namespace, result | DhtPut operations |
kontinuum_node_dht_quarantined_records | gauge | dht_namespace | Records в quarantine для atomic transactions |
kontinuum_node_peers_connected | gauge | tier | Live peer connections by tier |
kontinuum_node_peers_discovered_total | counter | Total peers ever discovered | |
kontinuum_node_peer_handshakes_total | counter | result | NodeHello handshakes — accepted/rejected |
Storage
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_storage_used_bytes | gauge | kind | Used storage by kind (dht/cache/blobs/mailbox) |
kontinuum_node_storage_capacity_bytes | gauge | kind | Capacity by kind |
kontinuum_node_s3_requests_total | counter | operation, result | PUT/GET/DELETE/HEAD — succeeded/4xx/5xx |
kontinuum_node_s3_request_duration_seconds | histogram | operation | Request latency |
kontinuum_node_s3_bytes_read_total | counter | Read traffic out (egress) | |
kontinuum_node_s3_bytes_written_total | counter | Write traffic in (ingress) | |
kontinuum_node_presign_requests_total | counter | result | Auth-shim presign requests |
kontinuum_node_presign_auth_failures_total | counter | reason | Invalid signature / not member / etc. |
Replication
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_replication_factor_current | histogram | space_id | Actual RF distribution per space (bucketed) |
kontinuum_node_replication_repair_jobs_pending | gauge | Pending repair tasks | |
kontinuum_node_replication_push_total | counter | result | ReplicaPush operations |
kontinuum_node_replication_push_duration_seconds | histogram | Push latency | |
kontinuum_node_replication_bonus_blobs | gauge | Blobs replicated через friend-replication bonus |
Mailbox
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_mailbox_entries_total | gauge | identity_id_hash | Inbox depth per identity (hash для cardinality) |
kontinuum_node_mailbox_bytes_used | gauge | identity_id_hash | Bytes used per identity |
kontinuum_node_mailbox_deposits_total | counter | msg_type, result | Inbound deposits |
kontinuum_node_mailbox_gc_entries_removed_total | counter | GC sweep removals | |
kontinuum_node_mailbox_cursor_updates_total | counter | Cursor sync events | |
kontinuum_node_device_cursors_active | gauge | Live device cursors |
Note on identity_id_hash label: never expose raw identity_id в metrics (privacy). Use blake3(identity_id)[:8] — sufficient cardinality для troubleshooting, не leaks full identity.
Cert / Lifecycle
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_cert_state | gauge | state | Current cert state (1 = в state, 0 = нет) |
kontinuum_node_cert_valid_until_seconds | gauge | Unix epoch когда expires | |
kontinuum_node_cert_revocations_received_total | counter | source | CRL updates received (dht / gossip / direct) |
kontinuum_node_lapse_stage | gauge | stage | Active lapse stage (1 = active, 0 = inactive) |
Anti-entropy
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_anti_entropy_cycles_total | counter | result | AE gossip cycles |
kontinuum_node_anti_entropy_divergences_detected_total | counter | Mismatch root hashes detected | |
kontinuum_node_anti_entropy_records_reconciled_total | counter | Records pushed/pulled через AE | |
kontinuum_node_anti_entropy_cycle_duration_seconds | histogram | Time per cycle |
Challenge / PoS
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_challenges_issued_total | counter | Challenges this node issued | |
kontinuum_node_challenges_responded_total | counter | outcome | Responses (passed/failed/timeout) |
kontinuum_node_challenge_pass_rate | gauge | peer_id_hash | Rolling pass rate per peer (для targeting) |
kontinuum_node_peer_scores | histogram | Reputation distribution |
Rendezvous
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_rendezvous_publishes_total | counter | result | Presence publish requests |
kontinuum_node_rendezvous_lookups_total | counter | result | Lookup requests |
kontinuum_node_rendezvous_active_presences | gauge | Currently-live presence tokens | |
kontinuum_node_rendezvous_rate_limit_hits_total | counter | bucket | 429 responses (per-IP, per-identity) |
Re-encryption
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_reencryption_jobs_pending | gauge | Pending re-encryption tasks | |
kontinuum_node_reencryption_blobs_processed_total | counter | Successfully re-encrypted | |
kontinuum_node_reencryption_duration_seconds | histogram | Per-blob re-encryption time |
Resource utilization
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_node_cpu_usage_pct | gauge | Process CPU % | |
kontinuum_node_memory_bytes | gauge | kind | RSS / VSZ / heap |
kontinuum_node_open_connections | gauge | TCP/QUIC open connections | |
kontinuum_node_goroutines_count N/A для Rust | - | - | - |
kontinuum_node_tokio_workers_busy_pct | gauge | Tokio runtime utilization | |
kontinuum_node_db_pool_connections_in_use | gauge | db | r2d2 pool state |
kontinuum_node_db_query_duration_seconds | histogram | operation | DB query latency |
Admin / Billing (admin process only)
| Metric | Type | Labels | Description |
|---|---|---|---|
kontinuum_admin_billing_events_received_total | counter | event_type, result | Webhook events |
kontinuum_admin_billing_processing_duration_seconds | histogram | event_type | Processing time |
kontinuum_admin_cert_issuances_total | counter | tier, result | Cert issuance flow |
kontinuum_admin_tier0_request_duration_seconds | histogram | RPC latency к anchors | |
kontinuum_admin_tier0_signatures_collected | histogram | Multi-sig collection (1, 2, 3, ...) | |
kontinuum_admin_rest_requests_total | counter | endpoint, method, status | REST API traffic |
kontinuum_admin_rest_request_duration_seconds | histogram | endpoint | Latency per endpoint |
Log conventions
Format
JSON только в production (log_format = "json" в config). Text формат — только dev.
Каждая запись — single line JSON, no pretty-printing:
{
"timestamp": "2026-05-19T10:30:00.123Z",
"level": "info",
"target": "kontinuum_node_server::dht::global",
"message": "Peer connected",
"span": {
"name": "handle_peer_connect",
"peer_id": "12D3KooW...",
"tier": 1
},
"fields": {
"remote_addr": "10.0.0.5:4001",
"protocol_version": 1
},
"trace_id": "abc123...", // если OTLP включён
"span_id": "def456..."
}Privacy в logs
Never log:
- Raw
identity_id(useblake3(identity_id)[:8]truncated hash) - Raw
signing_key(никогда) - Plaintext blob content (только hashes)
- BIP39 recovery phrases
- HMAC / API secrets
OK to log:
node_id(public)peer_id(libp2p public)- Cert metadata (issued_at, valid_until, tier)
- Blob hashes (random-looking, no identity leak)
- Truncated identity hash (
b3hash:abc12345)
Levels
| Level | When |
|---|---|
trace | Per-frame wire details, fine-grained DHT routing decisions (dev only) |
debug | Internal state transitions, не emitted в production by default |
info | Significant events: peer connect/disconnect, cert lifecycle, GC stats |
warn | Recoverable anomalies: rate-limit hits, transient peer failures, slow queries |
error | Unrecoverable per-request errors, requires investigation |
fatal | Process unable to continue — exit immediately |
Structured fields (consistent across messages)
| Field | Type | Example | Notes |
|---|---|---|---|
node_id | string | 1f8b... | Always present (own node id) |
peer_id | string | 12D3KooW... | Когда event involves remote peer |
identity_id_hash | string | b3hash:abc12345 | Truncated, никогда full identity_id |
space_id | string | 1f8b... | Для Space-scoped events |
blob_hash | string | f3e5... | Для blob-scoped events |
request_id | string | req_xyz789 | Correlation между admin REST request и log entries |
error.kind | string | INVALID_SIGNATURE | Machine-parseable error class |
error.cause | string | signature mismatch | Human-readable |
Tracing spans
OpenTelemetry-compatible tracing для cross-process flows (admin REST → Tier 0 RPC → DHT publish).
Span hierarchy
admin_rest_request{ endpoint=/nodes/provision }
└── billing_lookup_subscription
└── http_outbound{ url=billing-api.com }
└── tier0_multi_sig_collection
├── tier0_anchor_sign{ anchor_id=1 }
│ └── http_outbound
├── tier0_anchor_sign{ anchor_id=2 }
└── tier0_anchor_sign{ anchor_id=3 }
└── dht_publish{ key=node:1f8b... }
└── kademlia_provider_record_putSampling
- Production: 1% baseline sampling + 100% для errors / slow requests (>1s).
- Staging: 10% sampling.
- Dev: 100%.
Span attributes
Inherit log conventions выше — same field names для consistency.
Alerts
Severity model
| Level | Action | Examples |
|---|---|---|
info | Slack notification, не page | Cert renewing soon (30 days warning) |
warning | Slack + operator chat bot (operator on duty) | Slow query, peer churn rate elevated |
critical | Page (PagerDuty) + Slack + operator chat bot | Node down, cert revocation failed |
Critical alerts
| Alert name | Condition | Action |
|---|---|---|
node_down | up{job="kontinuum-node"} == 0 для ≥ 1 min | Page on-call SRE |
dht_lookup_p99_high | histogram_quantile(0.99, ...) > 2s для ≥ 5 min | Page |
mailbox_storage_full | kontinuum_node_mailbox_bytes_used / mailbox_capacity > 0.95 | Page (data loss risk) |
cert_expired | kontinuum_node_cert_valid_until_seconds < now() | Page |
tier0_unreachable | All anchor health checks fail для ≥ 2 min | Page (cert issuance broken) |
disk_full_imminent | node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05 | Page |
peer_count_low | kontinuum_node_peers_connected < 5 для ≥ 10 min | Page (isolation risk) |
replication_factor_degraded | histogram_quantile(0.50, replication_factor) < target_rf - 1 | Page (durability риск) |
Warning alerts
| Alert name | Condition |
|---|---|
cert_expiring_soon | cert_valid_until - now() < 7 days |
dht_lookup_p99_elevated | p99 > 500ms для ≥ 10 min |
challenge_failure_rate_high | challenge_failure_rate > 0.05 для ≥ 30 min |
re_encryption_backlog | reencryption_jobs_pending > 1000 |
anti_entropy_divergences_elevated | rate(divergences_detected[5m]) > 10/min |
rate_limit_hits_high | rate(rate_limit_hits_total[5m]) > 100/min |
Info alerts
| Alert name | Condition |
|---|---|
cert_renewing | cert_valid_until - now() < 30 days |
new_partner_node_provisioned | Counter increment |
lapse_timeline_started | lapse_stage transitions |
Dashboards
Grafana dashboard layout
Dashboard 1 — Network Overview (org-wide):
| Panel | Metric | Visualization |
|---|---|---|
| Active nodes by tier | sum by (tier) (up) | Stacked bar |
| DHT lookup p50/p95/p99 (global aggregated) | Histogram quantiles | Time series |
| Peer connections distribution | histogram_quantile(peers_connected) | Heatmap |
| Geo distribution (nodes per zone) | sum by (geo_zone) (up) | World map |
| Active alerts | Alertmanager status | Table |
Dashboard 2 — Single Node Detail:
| Panel | Metric |
|---|---|
| Cert state + valid_until | cert_state, cert_valid_until_seconds |
| Storage usage by kind (donut) | storage_used_bytes by kind |
| DHT records per namespace (list) | dht_records_count by namespace |
| Recent log entries (5 min) | Loki query |
| Active peers list | (Custom — node info endpoint) |
| Mailbox depth top 10 identities | topk(10, mailbox_entries_total) |
Dashboard 3 — Storage / S3:
| Panel | Metric |
|---|---|
| Bytes in/out (rate) | rate(s3_bytes_*_total[5m]) |
| Operations rate (PUT/GET/DELETE) | rate(s3_requests_total[5m]) by operation |
| Auth failure rate | rate(presign_auth_failures_total[5m]) |
| 4xx / 5xx error rate | `rate(s3_requests_total{result=~"4xx |
Dashboard 4 — Replication:
| Panel | Metric |
|---|---|
| RF distribution | replication_factor_current histogram |
| Repair jobs backlog | replication_repair_jobs_pending |
| Push rate | rate(replication_push_total[5m]) |
| Friend-replication bonus active | replication_bonus_blobs |
Dashboard 5 — Admin / Billing:
| Panel | Metric |
|---|---|
| Webhook events / hour | rate(billing_events_received_total[1h]) |
| Cert issuance success rate | rate(cert_issuances_total{result="success"}) / rate(cert_issuances_total) |
| Tier 0 RPC latency | tier0_request_duration_seconds |
| Multi-sig collection time | tier0_signatures_collected histogram |
| REST API traffic | rate(rest_requests_total[5m]) by endpoint |
Dashboard JSON storage
В kontinuum-node/deploy/grafana/dashboards/:
deploy/grafana/dashboards/
├── 01-network-overview.json
├── 02-node-detail.json
├── 03-storage-s3.json
├── 04-replication.json
├── 05-admin-billing.json
└── README.mdJSON files — version-controlled, importable в Grafana через provisioning.
SLO definitions
Service Level Objectives — measurable targets, basis для alerts.
| SLO | Target | Measurement window | Burn rate alert |
|---|---|---|---|
| Node availability | 99.9% | 30 days | 2h / 6h burn rate |
| DHT lookup p99 latency | < 500ms | 5 min rolling | Если > 1s для 5 min |
| S3 presign latency p99 | < 100ms | 5 min rolling | Если > 250ms для 5 min |
| Mailbox deposit success rate | 99.5% | 5 min rolling | Если < 99% для 10 min |
| Replication factor compliance | 99.9% blobs at target_rf | 1h | Если < 99% для 30 min |
| Cert renewal lead time | ≥ 7 days early | At renewal | N/A (info alert) |
Implementation checklist
- [ ] В
server/Cargo.toml+admin/Cargo.toml:metrics,metrics-exporter-prometheus,tracing,tracing-subscriber, optionallytracing-opentelemetry. - [ ]
server/src/observability/mod.rs— register все metrics в startup. - [ ] Helper macros / functions для consistent identity_id_hash labeling.
- [ ] Privacy linter: CI check no
identity_idилиsigning_keyreferences вtracing::*calls (Clippy custom lint или grep). - [ ] Grafana dashboards JSON в
deploy/grafana/dashboards/. - [ ] Alertmanager rules в
deploy/prometheus/alerts.yml. - [ ] Recording rules для aggregations в
deploy/prometheus/rules.yml. - [ ] Tests: metrics emitted correctly через unit tests + integration assertion.
- [ ] Loki / Vector config example в
deploy/observability/log-shipping.yml.
Open implementation questions
- OTLP vs Prometheus + Loki separately. OTLP unified (metrics + logs + traces в одном protocol), но добавляет collector dependency. Decision: Prometheus + Loki для v1.0 (simpler ops); OTLP migration — post-v1.0.
- Sampling strategy для traces. 1% baseline + 100% errors — стандартное, но может потерять low-frequency slow requests. Tail-sampling collector?
- High-cardinality labels.
identity_id_hash(4-byte truncated) даёт ~4B unique values — too high для Prometheus. Альтернатива — emit без identity label, аггрегировать через logs / traces только. Решение: identity labels только в logs/traces, никаких в metrics (revise inventory выше). - Per-tenant metrics в family-mode. Each tenant — отдельный label. Cardinality risk при много tenants. Решение: aggregate per host node, individual breakdown — через logs.