Kontinuum Node — Observability

Metrics inventory, log conventions, tracing spans, alerts. P1 prerequisite для production-readiness и SRE onboarding.

Audience: SRE / DevOps · node developers (где emit metrics) · ops engineers (dashboards).

Связанные документы:

operations.md — §15.1 Grafana + Directus + operator chat bot stack; §16.1 DR plan
configuration.md — [observability] section
api-contracts.md — /metrics endpoint exposure
testing.md — performance benchmarks tie в SLO

Стек

Tier	Component	Где живёт
Metrics export	Prometheus exporter (`metrics-exporter-prometheus`)	`kontinuum-node-server` + `kontinuum-node-admin`
Metrics storage	Prometheus	Centralized (org-side)
Visualization	Grafana	Centralized
Logs	Structured JSON logs → stdout	Per-process, collected via Loki / Vector / fluentbit
Traces	OpenTelemetry (OTLP)	Centralized OTEL collector (опционально)
Alerts	Alertmanager (Prometheus stack)	Routes to operator chat bot + PagerDuty / Slack

Rust ecosystem:

metrics 0.24 — abstraction layer для metrics (counter / gauge / histogram).
metrics-exporter-prometheus 0.16 — HTTP exporter.
tracing 0.1 + tracing-subscriber 0.3 — structured logs + spans.
tracing-opentelemetry (optional) — OTLP integration.

Metrics inventory

Все метрики namespace'нуты kontinuum_node_<subsystem>_<metric_name>. Labels — minimal, чтобы не cardinality explosion.

Network / DHT

Metric	Type	Labels	Description
`kontinuum_node_dht_lookups_total`	counter	`dht_namespace`, `result`	DHT lookups by namespace, success/timeout/error
`kontinuum_node_dht_lookup_duration_seconds`	histogram	`dht_namespace`	DHT lookup latency distribution
`kontinuum_node_dht_records_count`	gauge	`dht_namespace`	Records stored per namespace
`kontinuum_node_dht_routing_table_size`	gauge	`dht_namespace`	k-buckets entry count per namespace
`kontinuum_node_dht_put_total`	counter	`dht_namespace`, `result`	DhtPut operations
`kontinuum_node_dht_quarantined_records`	gauge	`dht_namespace`	Records в quarantine для atomic transactions
`kontinuum_node_peers_connected`	gauge	`tier`	Live peer connections by tier
`kontinuum_node_peers_discovered_total`	counter		Total peers ever discovered
`kontinuum_node_peer_handshakes_total`	counter	`result`	NodeHello handshakes — accepted/rejected

Storage

Metric	Type	Labels	Description
`kontinuum_node_storage_used_bytes`	gauge	`kind`	Used storage by kind (dht/cache/blobs/mailbox)
`kontinuum_node_storage_capacity_bytes`	gauge	`kind`	Capacity by kind
`kontinuum_node_s3_requests_total`	counter	`operation`, `result`	PUT/GET/DELETE/HEAD — succeeded/4xx/5xx
`kontinuum_node_s3_request_duration_seconds`	histogram	`operation`	Request latency
`kontinuum_node_s3_bytes_read_total`	counter		Read traffic out (egress)
`kontinuum_node_s3_bytes_written_total`	counter		Write traffic in (ingress)
`kontinuum_node_presign_requests_total`	counter	`result`	Auth-shim presign requests
`kontinuum_node_presign_auth_failures_total`	counter	`reason`	Invalid signature / not member / etc.

Replication

Metric	Type	Labels	Description
`kontinuum_node_replication_factor_current`	histogram	`space_id`	Actual RF distribution per space (bucketed)
`kontinuum_node_replication_repair_jobs_pending`	gauge		Pending repair tasks
`kontinuum_node_replication_push_total`	counter	`result`	ReplicaPush operations
`kontinuum_node_replication_push_duration_seconds`	histogram		Push latency
`kontinuum_node_replication_bonus_blobs`	gauge		Blobs replicated через friend-replication bonus

Mailbox

Metric	Type	Labels	Description
`kontinuum_node_mailbox_entries_total`	gauge	`identity_id_hash`	Inbox depth per identity (hash для cardinality)
`kontinuum_node_mailbox_bytes_used`	gauge	`identity_id_hash`	Bytes used per identity
`kontinuum_node_mailbox_deposits_total`	counter	`msg_type`, `result`	Inbound deposits
`kontinuum_node_mailbox_gc_entries_removed_total`	counter		GC sweep removals
`kontinuum_node_mailbox_cursor_updates_total`	counter		Cursor sync events
`kontinuum_node_device_cursors_active`	gauge		Live device cursors

Note on identity_id_hash label: never expose raw identity_id в metrics (privacy). Use blake3(identity_id)[:8] — sufficient cardinality для troubleshooting, не leaks full identity.

Cert / Lifecycle

Metric	Type	Labels	Description
`kontinuum_node_cert_state`	gauge	`state`	Current cert state (1 = в state, 0 = нет)
`kontinuum_node_cert_valid_until_seconds`	gauge		Unix epoch когда expires
`kontinuum_node_cert_revocations_received_total`	counter	`source`	CRL updates received (dht / gossip / direct)
`kontinuum_node_lapse_stage`	gauge	`stage`	Active lapse stage (1 = active, 0 = inactive)

Anti-entropy

Metric	Type	Labels	Description
`kontinuum_node_anti_entropy_cycles_total`	counter	`result`	AE gossip cycles
`kontinuum_node_anti_entropy_divergences_detected_total`	counter		Mismatch root hashes detected
`kontinuum_node_anti_entropy_records_reconciled_total`	counter		Records pushed/pulled через AE
`kontinuum_node_anti_entropy_cycle_duration_seconds`	histogram		Time per cycle

Challenge / PoS

Metric	Type	Labels	Description
`kontinuum_node_challenges_issued_total`	counter		Challenges this node issued
`kontinuum_node_challenges_responded_total`	counter	`outcome`	Responses (passed/failed/timeout)
`kontinuum_node_challenge_pass_rate`	gauge	`peer_id_hash`	Rolling pass rate per peer (для targeting)
`kontinuum_node_peer_scores`	histogram		Reputation distribution

Rendezvous

Metric	Type	Labels	Description
`kontinuum_node_rendezvous_publishes_total`	counter	`result`	Presence publish requests
`kontinuum_node_rendezvous_lookups_total`	counter	`result`	Lookup requests
`kontinuum_node_rendezvous_active_presences`	gauge		Currently-live presence tokens
`kontinuum_node_rendezvous_rate_limit_hits_total`	counter	`bucket`	429 responses (per-IP, per-identity)

Re-encryption

Metric	Type	Description
`kontinuum_node_reencryption_jobs_pending`	gauge	Pending re-encryption tasks
`kontinuum_node_reencryption_blobs_processed_total`	counter	Successfully re-encrypted
`kontinuum_node_reencryption_duration_seconds`	histogram	Per-blob re-encryption time

Resource utilization

Metric	Type	Labels	Description
`kontinuum_node_cpu_usage_pct`	gauge		Process CPU %
`kontinuum_node_memory_bytes`	gauge	`kind`	RSS / VSZ / heap
`kontinuum_node_open_connections`	gauge		TCP/QUIC open connections
`kontinuum_node_goroutines_count` N/A для Rust	-	-	-
`kontinuum_node_tokio_workers_busy_pct`	gauge		Tokio runtime utilization
`kontinuum_node_db_pool_connections_in_use`	gauge	`db`	r2d2 pool state
`kontinuum_node_db_query_duration_seconds`	histogram	`operation`	DB query latency

Admin / Billing (admin process only)

Metric	Type	Labels	Description
`kontinuum_admin_billing_events_received_total`	counter	`event_type`, `result`	Webhook events
`kontinuum_admin_billing_processing_duration_seconds`	histogram	`event_type`	Processing time
`kontinuum_admin_cert_issuances_total`	counter	`tier`, `result`	Cert issuance flow
`kontinuum_admin_tier0_request_duration_seconds`	histogram		RPC latency к anchors
`kontinuum_admin_tier0_signatures_collected`	histogram		Multi-sig collection (1, 2, 3, ...)
`kontinuum_admin_rest_requests_total`	counter	`endpoint`, `method`, `status`	REST API traffic
`kontinuum_admin_rest_request_duration_seconds`	histogram	`endpoint`	Latency per endpoint

Log conventions

Format

JSON только в production (log_format = "json" в config). Text формат — только dev.

Каждая запись — single line JSON, no pretty-printing:

json

{
    "timestamp": "2026-05-19T10:30:00.123Z",
    "level": "info",
    "target": "kontinuum_node_server::dht::global",
    "message": "Peer connected",
    "span": {
        "name": "handle_peer_connect",
        "peer_id": "12D3KooW...",
        "tier": 1
    },
    "fields": {
        "remote_addr": "10.0.0.5:4001",
        "protocol_version": 1
    },
    "trace_id": "abc123...",        // если OTLP включён
    "span_id": "def456..."
}

Privacy в logs

Never log:

Raw identity_id (use blake3(identity_id)[:8] truncated hash)
Raw signing_key (никогда)
Plaintext blob content (только hashes)
BIP39 recovery phrases
HMAC / API secrets

OK to log:

node_id (public)
peer_id (libp2p public)
Cert metadata (issued_at, valid_until, tier)
Blob hashes (random-looking, no identity leak)
Truncated identity hash (b3hash:abc12345)

Levels

Level	When
`trace`	Per-frame wire details, fine-grained DHT routing decisions (dev only)
`debug`	Internal state transitions, не emitted в production by default
`info`	Significant events: peer connect/disconnect, cert lifecycle, GC stats
`warn`	Recoverable anomalies: rate-limit hits, transient peer failures, slow queries
`error`	Unrecoverable per-request errors, requires investigation
`fatal`	Process unable to continue — exit immediately

Structured fields (consistent across messages)

Field	Type	Example	Notes
`node_id`	string	`1f8b...`	Always present (own node id)
`peer_id`	string	`12D3KooW...`	Когда event involves remote peer
`identity_id_hash`	string	`b3hash:abc12345`	Truncated, никогда full identity_id
`space_id`	string	`1f8b...`	Для Space-scoped events
`blob_hash`	string	`f3e5...`	Для blob-scoped events
`request_id`	string	`req_xyz789`	Correlation между admin REST request и log entries
`error.kind`	string	`INVALID_SIGNATURE`	Machine-parseable error class
`error.cause`	string	`signature mismatch`	Human-readable

Tracing spans

OpenTelemetry-compatible tracing для cross-process flows (admin REST → Tier 0 RPC → DHT publish).

Span hierarchy

admin_rest_request{ endpoint=/nodes/provision }
└── billing_lookup_subscription
    └── http_outbound{ url=billing-api.com }
└── tier0_multi_sig_collection
    ├── tier0_anchor_sign{ anchor_id=1 }
    │   └── http_outbound
    ├── tier0_anchor_sign{ anchor_id=2 }
    └── tier0_anchor_sign{ anchor_id=3 }
└── dht_publish{ key=node:1f8b... }
    └── kademlia_provider_record_put

Sampling

Production: 1% baseline sampling + 100% для errors / slow requests (>1s).
Staging: 10% sampling.
Dev: 100%.

Span attributes

Inherit log conventions выше — same field names для consistency.

Alerts

Severity model

Level	Action	Examples
`info`	Slack notification, не page	Cert renewing soon (30 days warning)
`warning`	Slack + operator chat bot (operator on duty)	Slow query, peer churn rate elevated
`critical`	Page (PagerDuty) + Slack + operator chat bot	Node down, cert revocation failed

Critical alerts

Alert name	Condition	Action
`node_down`	`up{job="kontinuum-node"} == 0` для ≥ 1 min	Page on-call SRE
`dht_lookup_p99_high`	`histogram_quantile(0.99, ...)` > 2s для ≥ 5 min	Page
`mailbox_storage_full`	`kontinuum_node_mailbox_bytes_used / mailbox_capacity > 0.95`	Page (data loss risk)
`cert_expired`	`kontinuum_node_cert_valid_until_seconds < now()`	Page
`tier0_unreachable`	All anchor health checks fail для ≥ 2 min	Page (cert issuance broken)
`disk_full_imminent`	`node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05`	Page
`peer_count_low`	`kontinuum_node_peers_connected < 5` для ≥ 10 min	Page (isolation risk)
`replication_factor_degraded`	`histogram_quantile(0.50, replication_factor) < target_rf - 1`	Page (durability риск)

Warning alerts

Alert name	Condition
`cert_expiring_soon`	`cert_valid_until - now() < 7 days`
`dht_lookup_p99_elevated`	`p99 > 500ms` для ≥ 10 min
`challenge_failure_rate_high`	`challenge_failure_rate > 0.05` для ≥ 30 min
`re_encryption_backlog`	`reencryption_jobs_pending > 1000`
`anti_entropy_divergences_elevated`	`rate(divergences_detected[5m]) > 10/min`
`rate_limit_hits_high`	`rate(rate_limit_hits_total[5m]) > 100/min`

Info alerts

Alert name	Condition
`cert_renewing`	`cert_valid_until - now() < 30 days`
`new_partner_node_provisioned`	Counter increment
`lapse_timeline_started`	`lapse_stage` transitions

Dashboards

Grafana dashboard layout

Dashboard 1 — Network Overview (org-wide):

Panel	Metric	Visualization
Active nodes by tier	`sum by (tier) (up)`	Stacked bar
DHT lookup p50/p95/p99 (global aggregated)	Histogram quantiles	Time series
Peer connections distribution	`histogram_quantile(peers_connected)`	Heatmap
Geo distribution (nodes per zone)	`sum by (geo_zone) (up)`	World map
Active alerts	Alertmanager status	Table

Dashboard 2 — Single Node Detail:

Panel	Metric
Cert state + valid_until	`cert_state`, `cert_valid_until_seconds`
Storage usage by kind (donut)	`storage_used_bytes by kind`
DHT records per namespace (list)	`dht_records_count by namespace`
Recent log entries (5 min)	Loki query
Active peers list	(Custom — node info endpoint)
Mailbox depth top 10 identities	`topk(10, mailbox_entries_total)`

Dashboard 3 — Storage / S3:

Panel	Metric
Bytes in/out (rate)	`rate(s3_bytes_*_total[5m])`
Operations rate (PUT/GET/DELETE)	`rate(s3_requests_total[5m]) by operation`
Auth failure rate	`rate(presign_auth_failures_total[5m])`
4xx / 5xx error rate	`rate(s3_requests_total{result=~"4xx

Dashboard 4 — Replication:

Panel	Metric
RF distribution	`replication_factor_current` histogram
Repair jobs backlog	`replication_repair_jobs_pending`
Push rate	`rate(replication_push_total[5m])`
Friend-replication bonus active	`replication_bonus_blobs`

Dashboard 5 — Admin / Billing:

Panel	Metric
Webhook events / hour	`rate(billing_events_received_total[1h])`
Cert issuance success rate	`rate(cert_issuances_total{result="success"}) / rate(cert_issuances_total)`
Tier 0 RPC latency	`tier0_request_duration_seconds`
Multi-sig collection time	`tier0_signatures_collected` histogram
REST API traffic	`rate(rest_requests_total[5m]) by endpoint`

Dashboard JSON storage

В kontinuum-node/deploy/grafana/dashboards/:

deploy/grafana/dashboards/
├── 01-network-overview.json
├── 02-node-detail.json
├── 03-storage-s3.json
├── 04-replication.json
├── 05-admin-billing.json
└── README.md

JSON files — version-controlled, importable в Grafana через provisioning.

SLO definitions

Service Level Objectives — measurable targets, basis для alerts.

SLO	Target	Measurement window	Burn rate alert
Node availability	99.9%	30 days	2h / 6h burn rate
DHT lookup p99 latency	< 500ms	5 min rolling	Если > 1s для 5 min
S3 presign latency p99	< 100ms	5 min rolling	Если > 250ms для 5 min
Mailbox deposit success rate	99.5%	5 min rolling	Если < 99% для 10 min
Replication factor compliance	99.9% blobs at target_rf	1h	Если < 99% для 30 min
Cert renewal lead time	≥ 7 days early	At renewal	N/A (info alert)

Implementation checklist

[ ] В server/Cargo.toml + admin/Cargo.toml: metrics, metrics-exporter-prometheus, tracing, tracing-subscriber, optionally tracing-opentelemetry.
[ ] server/src/observability/mod.rs — register все metrics в startup.
[ ] Helper macros / functions для consistent identity_id_hash labeling.
[ ] Privacy linter: CI check no identity_id или signing_key references в tracing::* calls (Clippy custom lint или grep).
[ ] Grafana dashboards JSON в deploy/grafana/dashboards/.
[ ] Alertmanager rules в deploy/prometheus/alerts.yml.
[ ] Recording rules для aggregations в deploy/prometheus/rules.yml.
[ ] Tests: metrics emitted correctly через unit tests + integration assertion.
[ ] Loki / Vector config example в deploy/observability/log-shipping.yml.

Open implementation questions

OTLP vs Prometheus + Loki separately. OTLP unified (metrics + logs + traces в одном protocol), но добавляет collector dependency. Decision: Prometheus + Loki для v1.0 (simpler ops); OTLP migration — post-v1.0.
Sampling strategy для traces. 1% baseline + 100% errors — стандартное, но может потерять low-frequency slow requests. Tail-sampling collector?
High-cardinality labels. identity_id_hash (4-byte truncated) даёт ~4B unique values — too high для Prometheus. Альтернатива — emit без identity label, аггрегировать через logs / traces только. Решение: identity labels только в logs/traces, никаких в metrics (revise inventory выше).
Per-tenant metrics в family-mode. Each tenant — отдельный label. Cardinality risk при много tenants. Решение: aggregate per host node, individual breakdown — через logs.

Kontinuum Node — Observability ​

Стек ​

Metrics inventory ​

Network / DHT ​

Storage ​

Replication ​

Mailbox ​

Cert / Lifecycle ​

Anti-entropy ​

Challenge / PoS ​

Rendezvous ​

Re-encryption ​

Resource utilization ​

Admin / Billing (admin process only) ​

Log conventions ​

Format ​

Privacy в logs ​

Levels ​

Structured fields (consistent across messages) ​

Tracing spans ​

Span hierarchy ​

Sampling ​

Span attributes ​

Alerts ​

Severity model ​

Critical alerts ​

Warning alerts ​

Info alerts ​

Dashboards ​

Grafana dashboard layout ​

Dashboard JSON storage ​

SLO definitions ​

Implementation checklist ​

Open implementation questions ​