Skip to content

Kontinuum Node — Observability

Metrics inventory, log conventions, tracing spans, alerts. P1 prerequisite для production-readiness и SRE onboarding.

Audience: SRE / DevOps · node developers (где emit metrics) · ops engineers (dashboards).

Связанные документы:


Стек

TierComponentГде живёт
Metrics exportPrometheus exporter (metrics-exporter-prometheus)kontinuum-node-server + kontinuum-node-admin
Metrics storagePrometheusCentralized (org-side)
VisualizationGrafanaCentralized
LogsStructured JSON logs → stdoutPer-process, collected via Loki / Vector / fluentbit
TracesOpenTelemetry (OTLP)Centralized OTEL collector (опционально)
AlertsAlertmanager (Prometheus stack)Routes to operator chat bot + PagerDuty / Slack

Rust ecosystem:

  • metrics 0.24 — abstraction layer для metrics (counter / gauge / histogram).
  • metrics-exporter-prometheus 0.16 — HTTP exporter.
  • tracing 0.1 + tracing-subscriber 0.3 — structured logs + spans.
  • tracing-opentelemetry (optional) — OTLP integration.

Metrics inventory

Все метрики namespace'нуты kontinuum_node_<subsystem>_<metric_name>. Labels — minimal, чтобы не cardinality explosion.

Network / DHT

MetricTypeLabelsDescription
kontinuum_node_dht_lookups_totalcounterdht_namespace, resultDHT lookups by namespace, success/timeout/error
kontinuum_node_dht_lookup_duration_secondshistogramdht_namespaceDHT lookup latency distribution
kontinuum_node_dht_records_countgaugedht_namespaceRecords stored per namespace
kontinuum_node_dht_routing_table_sizegaugedht_namespacek-buckets entry count per namespace
kontinuum_node_dht_put_totalcounterdht_namespace, resultDhtPut operations
kontinuum_node_dht_quarantined_recordsgaugedht_namespaceRecords в quarantine для atomic transactions
kontinuum_node_peers_connectedgaugetierLive peer connections by tier
kontinuum_node_peers_discovered_totalcounterTotal peers ever discovered
kontinuum_node_peer_handshakes_totalcounterresultNodeHello handshakes — accepted/rejected

Storage

MetricTypeLabelsDescription
kontinuum_node_storage_used_bytesgaugekindUsed storage by kind (dht/cache/blobs/mailbox)
kontinuum_node_storage_capacity_bytesgaugekindCapacity by kind
kontinuum_node_s3_requests_totalcounteroperation, resultPUT/GET/DELETE/HEAD — succeeded/4xx/5xx
kontinuum_node_s3_request_duration_secondshistogramoperationRequest latency
kontinuum_node_s3_bytes_read_totalcounterRead traffic out (egress)
kontinuum_node_s3_bytes_written_totalcounterWrite traffic in (ingress)
kontinuum_node_presign_requests_totalcounterresultAuth-shim presign requests
kontinuum_node_presign_auth_failures_totalcounterreasonInvalid signature / not member / etc.

Replication

MetricTypeLabelsDescription
kontinuum_node_replication_factor_currenthistogramspace_idActual RF distribution per space (bucketed)
kontinuum_node_replication_repair_jobs_pendinggaugePending repair tasks
kontinuum_node_replication_push_totalcounterresultReplicaPush operations
kontinuum_node_replication_push_duration_secondshistogramPush latency
kontinuum_node_replication_bonus_blobsgaugeBlobs replicated через friend-replication bonus

Mailbox

MetricTypeLabelsDescription
kontinuum_node_mailbox_entries_totalgaugeidentity_id_hashInbox depth per identity (hash для cardinality)
kontinuum_node_mailbox_bytes_usedgaugeidentity_id_hashBytes used per identity
kontinuum_node_mailbox_deposits_totalcountermsg_type, resultInbound deposits
kontinuum_node_mailbox_gc_entries_removed_totalcounterGC sweep removals
kontinuum_node_mailbox_cursor_updates_totalcounterCursor sync events
kontinuum_node_device_cursors_activegaugeLive device cursors

Note on identity_id_hash label: never expose raw identity_id в metrics (privacy). Use blake3(identity_id)[:8] — sufficient cardinality для troubleshooting, не leaks full identity.

Cert / Lifecycle

MetricTypeLabelsDescription
kontinuum_node_cert_stategaugestateCurrent cert state (1 = в state, 0 = нет)
kontinuum_node_cert_valid_until_secondsgaugeUnix epoch когда expires
kontinuum_node_cert_revocations_received_totalcountersourceCRL updates received (dht / gossip / direct)
kontinuum_node_lapse_stagegaugestageActive lapse stage (1 = active, 0 = inactive)

Anti-entropy

MetricTypeLabelsDescription
kontinuum_node_anti_entropy_cycles_totalcounterresultAE gossip cycles
kontinuum_node_anti_entropy_divergences_detected_totalcounterMismatch root hashes detected
kontinuum_node_anti_entropy_records_reconciled_totalcounterRecords pushed/pulled через AE
kontinuum_node_anti_entropy_cycle_duration_secondshistogramTime per cycle

Challenge / PoS

MetricTypeLabelsDescription
kontinuum_node_challenges_issued_totalcounterChallenges this node issued
kontinuum_node_challenges_responded_totalcounteroutcomeResponses (passed/failed/timeout)
kontinuum_node_challenge_pass_rategaugepeer_id_hashRolling pass rate per peer (для targeting)
kontinuum_node_peer_scoreshistogramReputation distribution

Rendezvous

MetricTypeLabelsDescription
kontinuum_node_rendezvous_publishes_totalcounterresultPresence publish requests
kontinuum_node_rendezvous_lookups_totalcounterresultLookup requests
kontinuum_node_rendezvous_active_presencesgaugeCurrently-live presence tokens
kontinuum_node_rendezvous_rate_limit_hits_totalcounterbucket429 responses (per-IP, per-identity)

Re-encryption

MetricTypeLabelsDescription
kontinuum_node_reencryption_jobs_pendinggaugePending re-encryption tasks
kontinuum_node_reencryption_blobs_processed_totalcounterSuccessfully re-encrypted
kontinuum_node_reencryption_duration_secondshistogramPer-blob re-encryption time

Resource utilization

MetricTypeLabelsDescription
kontinuum_node_cpu_usage_pctgaugeProcess CPU %
kontinuum_node_memory_bytesgaugekindRSS / VSZ / heap
kontinuum_node_open_connectionsgaugeTCP/QUIC open connections
kontinuum_node_goroutines_count N/A для Rust---
kontinuum_node_tokio_workers_busy_pctgaugeTokio runtime utilization
kontinuum_node_db_pool_connections_in_usegaugedbr2d2 pool state
kontinuum_node_db_query_duration_secondshistogramoperationDB query latency

Admin / Billing (admin process only)

MetricTypeLabelsDescription
kontinuum_admin_billing_events_received_totalcounterevent_type, resultWebhook events
kontinuum_admin_billing_processing_duration_secondshistogramevent_typeProcessing time
kontinuum_admin_cert_issuances_totalcountertier, resultCert issuance flow
kontinuum_admin_tier0_request_duration_secondshistogramRPC latency к anchors
kontinuum_admin_tier0_signatures_collectedhistogramMulti-sig collection (1, 2, 3, ...)
kontinuum_admin_rest_requests_totalcounterendpoint, method, statusREST API traffic
kontinuum_admin_rest_request_duration_secondshistogramendpointLatency per endpoint

Log conventions

Format

JSON только в production (log_format = "json" в config). Text формат — только dev.

Каждая запись — single line JSON, no pretty-printing:

json
{
    "timestamp": "2026-05-19T10:30:00.123Z",
    "level": "info",
    "target": "kontinuum_node_server::dht::global",
    "message": "Peer connected",
    "span": {
        "name": "handle_peer_connect",
        "peer_id": "12D3KooW...",
        "tier": 1
    },
    "fields": {
        "remote_addr": "10.0.0.5:4001",
        "protocol_version": 1
    },
    "trace_id": "abc123...",        // если OTLP включён
    "span_id": "def456..."
}

Privacy в logs

Never log:

  • Raw identity_id (use blake3(identity_id)[:8] truncated hash)
  • Raw signing_key (никогда)
  • Plaintext blob content (только hashes)
  • BIP39 recovery phrases
  • HMAC / API secrets

OK to log:

  • node_id (public)
  • peer_id (libp2p public)
  • Cert metadata (issued_at, valid_until, tier)
  • Blob hashes (random-looking, no identity leak)
  • Truncated identity hash (b3hash:abc12345)

Levels

LevelWhen
tracePer-frame wire details, fine-grained DHT routing decisions (dev only)
debugInternal state transitions, не emitted в production by default
infoSignificant events: peer connect/disconnect, cert lifecycle, GC stats
warnRecoverable anomalies: rate-limit hits, transient peer failures, slow queries
errorUnrecoverable per-request errors, requires investigation
fatalProcess unable to continue — exit immediately

Structured fields (consistent across messages)

FieldTypeExampleNotes
node_idstring1f8b...Always present (own node id)
peer_idstring12D3KooW...Когда event involves remote peer
identity_id_hashstringb3hash:abc12345Truncated, никогда full identity_id
space_idstring1f8b...Для Space-scoped events
blob_hashstringf3e5...Для blob-scoped events
request_idstringreq_xyz789Correlation между admin REST request и log entries
error.kindstringINVALID_SIGNATUREMachine-parseable error class
error.causestringsignature mismatchHuman-readable

Tracing spans

OpenTelemetry-compatible tracing для cross-process flows (admin REST → Tier 0 RPC → DHT publish).

Span hierarchy

admin_rest_request{ endpoint=/nodes/provision }
└── billing_lookup_subscription
    └── http_outbound{ url=billing-api.com }
└── tier0_multi_sig_collection
    ├── tier0_anchor_sign{ anchor_id=1 }
    │   └── http_outbound
    ├── tier0_anchor_sign{ anchor_id=2 }
    └── tier0_anchor_sign{ anchor_id=3 }
└── dht_publish{ key=node:1f8b... }
    └── kademlia_provider_record_put

Sampling

  • Production: 1% baseline sampling + 100% для errors / slow requests (>1s).
  • Staging: 10% sampling.
  • Dev: 100%.

Span attributes

Inherit log conventions выше — same field names для consistency.


Alerts

Severity model

LevelActionExamples
infoSlack notification, не pageCert renewing soon (30 days warning)
warningSlack + operator chat bot (operator on duty)Slow query, peer churn rate elevated
criticalPage (PagerDuty) + Slack + operator chat botNode down, cert revocation failed

Critical alerts

Alert nameConditionAction
node_downup{job="kontinuum-node"} == 0 для ≥ 1 minPage on-call SRE
dht_lookup_p99_highhistogram_quantile(0.99, ...) > 2s для ≥ 5 minPage
mailbox_storage_fullkontinuum_node_mailbox_bytes_used / mailbox_capacity > 0.95Page (data loss risk)
cert_expiredkontinuum_node_cert_valid_until_seconds < now()Page
tier0_unreachableAll anchor health checks fail для ≥ 2 minPage (cert issuance broken)
disk_full_imminentnode_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05Page
peer_count_lowkontinuum_node_peers_connected < 5 для ≥ 10 minPage (isolation risk)
replication_factor_degradedhistogram_quantile(0.50, replication_factor) < target_rf - 1Page (durability риск)

Warning alerts

Alert nameCondition
cert_expiring_sooncert_valid_until - now() < 7 days
dht_lookup_p99_elevatedp99 > 500ms для ≥ 10 min
challenge_failure_rate_highchallenge_failure_rate > 0.05 для ≥ 30 min
re_encryption_backlogreencryption_jobs_pending > 1000
anti_entropy_divergences_elevatedrate(divergences_detected[5m]) > 10/min
rate_limit_hits_highrate(rate_limit_hits_total[5m]) > 100/min

Info alerts

Alert nameCondition
cert_renewingcert_valid_until - now() < 30 days
new_partner_node_provisionedCounter increment
lapse_timeline_startedlapse_stage transitions

Dashboards

Grafana dashboard layout

Dashboard 1 — Network Overview (org-wide):

PanelMetricVisualization
Active nodes by tiersum by (tier) (up)Stacked bar
DHT lookup p50/p95/p99 (global aggregated)Histogram quantilesTime series
Peer connections distributionhistogram_quantile(peers_connected)Heatmap
Geo distribution (nodes per zone)sum by (geo_zone) (up)World map
Active alertsAlertmanager statusTable

Dashboard 2 — Single Node Detail:

PanelMetric
Cert state + valid_untilcert_state, cert_valid_until_seconds
Storage usage by kind (donut)storage_used_bytes by kind
DHT records per namespace (list)dht_records_count by namespace
Recent log entries (5 min)Loki query
Active peers list(Custom — node info endpoint)
Mailbox depth top 10 identitiestopk(10, mailbox_entries_total)

Dashboard 3 — Storage / S3:

PanelMetric
Bytes in/out (rate)rate(s3_bytes_*_total[5m])
Operations rate (PUT/GET/DELETE)rate(s3_requests_total[5m]) by operation
Auth failure raterate(presign_auth_failures_total[5m])
4xx / 5xx error rate`rate(s3_requests_total{result=~"4xx

Dashboard 4 — Replication:

PanelMetric
RF distributionreplication_factor_current histogram
Repair jobs backlogreplication_repair_jobs_pending
Push raterate(replication_push_total[5m])
Friend-replication bonus activereplication_bonus_blobs

Dashboard 5 — Admin / Billing:

PanelMetric
Webhook events / hourrate(billing_events_received_total[1h])
Cert issuance success raterate(cert_issuances_total{result="success"}) / rate(cert_issuances_total)
Tier 0 RPC latencytier0_request_duration_seconds
Multi-sig collection timetier0_signatures_collected histogram
REST API trafficrate(rest_requests_total[5m]) by endpoint

Dashboard JSON storage

В kontinuum-node/deploy/grafana/dashboards/:

deploy/grafana/dashboards/
├── 01-network-overview.json
├── 02-node-detail.json
├── 03-storage-s3.json
├── 04-replication.json
├── 05-admin-billing.json
└── README.md

JSON files — version-controlled, importable в Grafana через provisioning.


SLO definitions

Service Level Objectives — measurable targets, basis для alerts.

SLOTargetMeasurement windowBurn rate alert
Node availability99.9%30 days2h / 6h burn rate
DHT lookup p99 latency< 500ms5 min rollingЕсли > 1s для 5 min
S3 presign latency p99< 100ms5 min rollingЕсли > 250ms для 5 min
Mailbox deposit success rate99.5%5 min rollingЕсли < 99% для 10 min
Replication factor compliance99.9% blobs at target_rf1hЕсли < 99% для 30 min
Cert renewal lead time≥ 7 days earlyAt renewalN/A (info alert)

Implementation checklist

  • [ ] В server/Cargo.toml + admin/Cargo.toml: metrics, metrics-exporter-prometheus, tracing, tracing-subscriber, optionally tracing-opentelemetry.
  • [ ] server/src/observability/mod.rs — register все metrics в startup.
  • [ ] Helper macros / functions для consistent identity_id_hash labeling.
  • [ ] Privacy linter: CI check no identity_id или signing_key references в tracing::* calls (Clippy custom lint или grep).
  • [ ] Grafana dashboards JSON в deploy/grafana/dashboards/.
  • [ ] Alertmanager rules в deploy/prometheus/alerts.yml.
  • [ ] Recording rules для aggregations в deploy/prometheus/rules.yml.
  • [ ] Tests: metrics emitted correctly через unit tests + integration assertion.
  • [ ] Loki / Vector config example в deploy/observability/log-shipping.yml.

Open implementation questions

  1. OTLP vs Prometheus + Loki separately. OTLP unified (metrics + logs + traces в одном protocol), но добавляет collector dependency. Decision: Prometheus + Loki для v1.0 (simpler ops); OTLP migration — post-v1.0.
  2. Sampling strategy для traces. 1% baseline + 100% errors — стандартное, но может потерять low-frequency slow requests. Tail-sampling collector?
  3. High-cardinality labels. identity_id_hash (4-byte truncated) даёт ~4B unique values — too high для Prometheus. Альтернатива — emit без identity label, аггрегировать через logs / traces только. Решение: identity labels только в logs/traces, никаких в metrics (revise inventory выше).
  4. Per-tenant metrics в family-mode. Each tenant — отдельный label. Cardinality risk при много tenants. Решение: aggregate per host node, individual breakdown — через logs.