👁

OBS

P0 · Foundation Planned · 9 FRs at 10/10 audit · ships parallel with BRAIN/AUTH waves Owner: CTO seat → interim CEO

The observability spine. Every module emits through here. The question "is CyberOS healthy?" has exactly one answer source — and so does "what did the AI just decide on behalf of this tenant?".

OBS plays three strategic roles simultaneously. Role 1 — three-pillars unified pane: every of the 22 modules emits OpenTelemetry; logs (Loki), metrics (Prometheus), traces (Tempo), and AI traces (LangSmith) all correlate by trace_id and tenant_id; one dashboard answers "what happened" across the full request path from external MCP client to Postgres write. Role 2 — auto-runbook router: alerts don't page people, they trigger a CUO triage skill that consults the runbook catalogue, suggests the first step, and creates a BRAIN audit row; only un-runbookable alerts page on-call — the rest become self-service tickets in CHAT. Role 3 — compliance evidence surface: a read-only view of the BRAIN audit chain, scoped per-regulator, satisfies EU AI Act Art. 12 (decision logging), PDPL Art. 14 (DSAR transparency), and SOC 2 CC7.2 (monitoring) without exposing the underlying operational data. The planned implementation lives at cyberos/services/obs/; the P0 · slice 1 build sequence puts OBS alongside AI Gateway as one of the first two P0 modules to ship.

OBS is the shared telemetry plane: logs, metrics, traces, and AI-trace observability for every CyberOS module. Operationally, OpenTelemetry SDKs in every service ship to a single OTel collector that fans out — logs to Loki, metrics to Prometheus, traces to Tempo. Grafana renders dashboards (per-module SLO, per-tenant cost, per-region health). LangSmith captures full LLM call traces independently from the operational pipeline so AI debugging doesn't require correlating across three tools. Alert Manager fans critical alerts to PagerDuty, mid alerts to #cyberos-alerts, low signals into the CUO morning digest. The audit chain — owned by BRAIN — is exposed via a separate read-only OBS surface for regulators (PDPL Art. 14, EU AI Act Art. 12). Tenant scoping is enforced at the query proxy so a member of tenant A cannot see tenant B's logs.

Strategic role
Observability spine
3 pillars + AI traces + audit
Status
Planned
P0 · design phase · P0 · slice 1
Stack
LGTM + LangSmith
Loki · Grafana · Tempo · Prometheus
Trace exporter
OTel SDK
in every service binary
Correlation key
trace_id × tenant_id
propagated W3C TraceContext
Retention
7d / 90d / 1y
hot / warm / cold tiers
Tenant isolation
Proxy-enforced
no cross-tenant reads
SLO targets
≥ 99.5%
platform composite SLO
Auto-runbook coverage
≥ 60% by P1
alerts → CUO triage skill
Compliance surfaces
EU AI Act · PDPL · SOC 2
read-only regulator views
Depends on
BRAIN
+ AUTH (P0 · slice 2) · AI Gateway (P0 · slice 1)
Used by
All 22 modules
every service emits OTel
0

The bigger picture — three strategic roles

OBS is one of the two earliest P0 modules to ship (P0 · slice 1, alongside AI Gateway), because any debugging requires it. The naive read is "it's the Grafana installation." The real read is: this is the protocol-level guarantee that any incident — operational, AI-decision, or compliance — has exactly one investigation surface; not three competing dashboards from three vendors.

Role 1 · Three-pillars unified pane
Logs + metrics + traces + AI traces, one trace_id

Every module emits OpenTelemetry. The collector fans out to Loki (logs), Prometheus (metrics), Tempo (operational traces), LangSmith (AI traces). All four are correlated by W3C TraceContext trace_id + tenant_id. One click in Grafana takes you from a slow API trace to the LLM call inside it to the BRAIN audit row that recorded the decision. No more "the AI did something weird, but I can only see the HTTP request."

Role 2 · Auto-runbook router
Alerts route through CUO before paging humans

An alert fires. Before PagerDuty pages someone, the alert hits CUO's obs.triage-alert@1 skill. CUO consults the runbook catalogue (in KB), suggests the first step, attaches the relevant traces + dashboard links, and either (a) creates a self-service ticket in CHAT if confidence ≥ 0.70 OR (b) escalates to PagerDuty with the suggested first step pre-attached. Target: ≥ 60% of alerts auto-runbookable by P1; the on-call schedule benefits, the runbook catalogue grows, the audit trail is complete.

Role 3 · Compliance evidence surface
Read-only regulator view of the BRAIN audit chain

Auditors don't need access to Postgres. They need a scoped, time-bounded, exportable view of the audit chain that proves "the AI made this decision at this time under this persona, and a human confirmed it before it executed." OBS surfaces this via the /compliance/{regulator} read-only endpoint per regulator (EU AI Act / PDPL / SOC 2 / ISO 27001). Each regulator sees only the chained audit rows + retention metadata they're entitled to.

OBS in the runtime — every module emits, OBS correlates

flowchart TB subgraph emitters["Emitters (all 22 modules)"] CUO["🎯 CUO"] BRAIN["🧠 BRAIN"] SKILL["🛠 Skill"] AI["⚡ AI Gateway"] AUTH["🔐 AUTH"] MCP["🔌 MCP"] PROJ["📋 PROJ"] N["… 15 more"] end COLLECT["OTel Collector
(W3C TraceContext)"] OBS["👁 OBS
LGTM + LangSmith"] subgraph stores["Stores (per pillar)"] LOKI["Loki
logs"] PROM["Prometheus
metrics"] TEMPO["Tempo
traces"] LS["LangSmith
AI traces"] end subgraph consumers["Consumers"] GRAF["Grafana dashboards"] ALERT["Alert Manager"] CUO_TRIAGE["🎯 CUO
obs.triage-alert@1"] PAGER["PagerDuty"] CHAT["💬 CHAT digest"] COMP["Compliance views
(EU AI Act / PDPL / SOC 2)"] end CUO --> COLLECT BRAIN --> COLLECT SKILL --> COLLECT AI --> COLLECT AUTH --> COLLECT MCP --> COLLECT PROJ --> COLLECT N --> COLLECT COLLECT --> OBS OBS --> LOKI OBS --> PROM OBS --> TEMPO OBS --> LS OBS --> GRAF OBS --> ALERT ALERT --> CUO_TRIAGE CUO_TRIAGE -.->|"conf ≥ 0.70"| CHAT CUO_TRIAGE -.->|"conf below 0.70 OR P1"| PAGER OBS --> COMP classDef hub fill:#ede9fe,stroke:#5b21b6,stroke-width:3px,color:#3b0764 classDef mod fill:#e0e7ff,stroke:#3730a3 classDef store fill:#f5ede6,stroke:#45210e classDef consumer fill:#fef6e0,stroke:#9c750a class OBS,COLLECT hub class CUO,BRAIN,SKILL,AI,AUTH,MCP,PROJ,N mod class LOKI,PROM,TEMPO,LS store class GRAF,ALERT,CUO_TRIAGE,PAGER,CHAT,COMP consumer

The OTel Collector is the only fan-in; OBS is the only correlation point. Removing OBS means each module re-implements observability — and the implementations diverge on tenant scoping, retention, and trace correlation.

Auto vs human-in-loop operations matrix

OperationHow it happensWhy this split
Log / metric / trace ingestionAuto from every module's OTel SDKThe platform is unobservable without 100% coverage; SDKs are non-optional in every service.
Alert evaluationAuto in Prometheus + Alert ManagerThreshold-based; CUO triage layered on top, not instead of.
CUO triage skill invocationAuto when alert firesThe skill is read-only — it cannot fix things, only suggest. Human-in-loop preserved for action.
Self-service ticket creation (CHAT)Auto when CUO confidence ≥ 0.70 AND severity ≤ P2Reduces page fatigue; on-call sees only what triage couldn't handle.
PagerDuty escalationAuto for P0/P1 OR CUO confidence < 0.70The cases that need human judgment still get human judgment.
Runbook executionHuman always (runbooks are suggestions, not actions)Auto-remediation is a separate skill universe (eventually); OBS suggests, humans execute.
Regulator view exportHuman request via compliance ticket → auto-generated bundleThe bundle is a snapshot of the audit chain + retention metadata; the trigger is always a regulator request, never auto.
SLO budget trackingAuto against per-module SLO contractsBurn rate alerts when budget consumption exceeds projection; surfaces to CTO weekly.
AI persona drift detectionAuto via LangSmith → CUO comparisonIf a persona's responses diverge from baseline by > 0.30 cosine, alert. Human-confirms whether drift is intentional.
1

Why OBS exists

Production observability is one of the line items that, if not centralised early, fragments quickly: one team picks Datadog, another picks Honeycomb, the AI team picks LangSmith, the compliance team asks for an audit-log dashboard that nobody owns. Centralise the platform, let every module emit OpenTelemetry, give compliance read-only audit views, and the question "is the platform healthy?" has one answer instead of five.

📊
LGTM is enough

Loki + Grafana + Tempo + Prometheus = the full operational picture. Self-hosted; runs on Fargate + S3.

🧠
AI traces are different

LangSmith captures full prompt + completion + tool-call chains. Operational tracing alone won't tell you why an agent made a bad decision.

⚖️
Compliance is a first-class view

EU AI Act Art. 12 + PDPL Art. 14 demand decision logging that regulators can inspect — OBS owns the read-only audit surface.

The bet: pay the LGTM operational cost once, plug LangSmith in beside it, and you get incident response, SLO tracking, AI debugging, and compliance evidence from one plane. The alternative — three different SaaS tools, each with its own auth and bill — is a money-and-context drain that compounds with every new module.

2

What it does — 5W1H2C5M

AxisQuestionAnswer
5W · WhatWhat is OBS?A self-hosted LGTM stack (Loki, Grafana, Tempo, Prometheus) plus LangSmith for AI-trace observability, plus a small Rust query proxy that enforces tenant scoping on every read, plus Alert Manager for routing.
5W · WhoWho reads it?Operators: CTO + on-call engineers (dashboards, alerts). Module owners: for their SLO dashboards. Tenant admins: for their own tenant's cost + usage dashboards. Compliance: read-only audit surface. Auditors: per-engagement scope.
5W · WhenWhen does it run?24/7. OTel collector receives spans/logs/metrics in real time; alert evaluation every 30 s; dashboards refresh on user request or 30 s auto-refresh.
5W · WhereWhere does it run?Self-hosted on AWS in SG-1 (P0). LangSmith is a managed SaaS (zero-retention contract); audit-log view is served from BRAIN reads via the query proxy.
5W · WhyWhy a separate plane?So no module has to think about "where do my logs go?" — they emit OTel, the plane handles fan-out, retention, query, and alerting.
1H · HowHow does it work?Services emit OTel; collector splits by signal type; Loki / Tempo / Prometheus ingest; Grafana queries via the tenant-aware proxy; LangSmith ingests AI traces over its own SDK; Alert Manager evaluates rules and routes; audit-log surface reads BRAIN binlog.
2C · CostCost?P0: ~$130/month (S3 hot-tier storage + Fargate for query proxy + LangSmith starter). 50-tenant: ~$700/month including S3 cold tier + Grafana Enterprise (optional).
2C · ConstraintsConstraints?(a) PII redaction before log shipping (≥ 99.5% recall). (b) Tenant queries cannot bypass scope. (c) EU AI Act Art. 12 decision logs retained ≥ 6 months. (d) Audit-log surface is read-only for everyone.
5M · MaterialsStack?OpenTelemetry SDK (Rust + Python) · OTel Collector · Loki 3.x · Tempo 2.x · Prometheus 2.x · Grafana 11.x · LangSmith · Alert Manager · S3 (Loki / Tempo backing).
5M · MethodsMethod choices?OTel for everything except AI traces (LangSmith). Trace-id propagation via W3C TraceContext. PII redaction at the collector. Tenant_id injected as a label by the collector based on JWT inspection.
5M · MachinesDeployment?Loki + Tempo on S3-backed object storage; Prometheus on a single Fargate task (P0); Grafana on Fargate; query proxy on Fargate.
5M · ManpowerWho maintains?0.3 FTE CTO at P0. P1+: dedicated SRE/on-call rotation.
5M · MeasurementHow measured?N(FR pending) (platform availability ≥ 99.5%), N(FR pending) (SLO dashboard ≤ 60 s freshness), N(FR pending) (log PII recall ≥ 99.5%).
2.5

Three-pillars unified pane — logs · metrics · traces · AI traces

The naive multi-tool stack — Datadog for metrics + Honeycomb for traces + Sentry for errors + LangSmith for AI — costs $50k/month at small scale and forces engineers to context-switch across vendors during incidents. The CyberOS stack is LGTM + LangSmith, all self-hosted, all correlated by trace_id × tenant_id × persona_version. One Grafana, one investigation surface.

Pillar × signal-type mapping

PillarStoreSignals capturedCardinality budgetRetention
LogsLokiStructured JSON; req/resp/error/event lines from every service1 GB/tenant/day default; per-tenant tunable7d hot · 90d S3 warm · 1y Glacier cold
MetricsPrometheus + Mimir (P1+)RED metrics (rate/errors/duration) per service · USE metrics (utilisation/saturation/errors) per host · business KPIs1M active series/tenant; LIMIT_EXCEEDED at 1.5M15d local Prom · 1y Mimir
Traces (operational)TempoOpenTelemetry spans · cross-service HTTP/gRPC propagation10% head sampling default; per-tenant adjustable; 100% on errors14d hot · 90d S3 warm
AI tracesLangSmithLLM prompt + completion + tool calls + decision rationale; tied to operational trace_id100% sampling for AI calls (volume is low); cost-bounded by AI Gateway budget90d hot · 1y cold (compliance)
Events (BRAIN audit)BRAIN binlogDecision audit rows · chained · signeduncapped (append-only)indefinite (compliance)

Cross-pillar correlation example — one investigation, four pillars

An engineer investigates "why did this Member's CHAT message take 8 s?". The investigation path:

  1. Metric (Prom): chat_send_p95_ms{tenant=acme} = 8200 at 14:32 — spike confirmed.
  2. Trace (Tempo): pick a slow trace from the spike window → see the request hit MCP Gateway, then CUO, then AI Gateway → AI Gateway took 7.4s of the 8.2s.
  3. AI trace (LangSmith): click the AI Gateway span → see the LLM call → primary provider returned 429 (rate limit), failover added 6.8s.
  4. Log (Loki): same trace_id → see the structured log {level: warn, msg: "provider rate limit", provider: bedrock, retry_after_s: 6}.
  5. BRAIN audit row: same trace_id → confirm the ai.invocation audit row carries failover_path: fallback; root cause identified, runbook says "increase Bedrock quota for this tenant".

5 minutes from "p95 alert" to "open the quota-increase ticket." Without correlation, the same investigation takes hours across separate tool consoles.

Tenant query proxy — the isolation guarantee

All four pillars (Loki, Prom, Tempo, LangSmith) are queried through a Rust proxy that:

  • Validates the JWT (per AUTH §2.7) and extracts tenant_id + scope_grants.
  • Injects a mandatory tenant=<id> label filter into every LogQL / PromQL / TraceQL query.
  • Rejects queries that attempt to use the tenant label in their own filter (prevents bypass via crafted query).
  • Audits the query itself (compliance evidence that the engineer queried only their authorised tenants).
  • Rate-limits at 100 QPS/Member to prevent runaway dashboards from DOS'ing the stack.

Cross-tenant query attempts return 403_TENANT_SCOPE_VIOLATION + emit an audit row + page CSO. This is the protocol-level guarantee that an SRE on the engineering team cannot accidentally (or maliciously) see another tenant's logs.

2.6

Auto-runbook router — alerts that don't page humans

The default behaviour of every monitoring tool is "alert fires → PagerDuty → human pages." That model assumes the human knows what to do; in practice, 60-80% of alerts are repeats with known remediation. OBS inverts the default: alerts route through CUO's obs.triage-alert@1 skill first, which consults the runbook catalogue (in KB), suggests the first step, and either auto-creates a CHAT ticket OR escalates to PagerDuty. Humans only see the alerts that genuinely need human judgment.

The 6-step routing sequence

sequenceDiagram autonumber participant P as Prometheus / Loki rules participant AM as Alert Manager participant C as 🎯 CUO obs.triage-alert@1 participant KB as 📚 KB runbook catalogue participant CH as 💬 CHAT (#cyberos-alerts) participant PD as PagerDuty participant B as 🧠 BRAIN audit P->>AM: alert fires (severity, labels, value) AM->>C: invoke triage skill (alert payload + trace_id) C->>KB: query runbook by alert signature (similar past alerts) KB-->>C: top-3 runbook matches + confidence scores C->>C: confidence = max(matches) alt confidence ≥ 0.70 AND severity ≤ P2 C->>CH: post ticket with runbook link + suggested first step C->>B: audit row (alert · runbook · routing=self-service) else confidence below 0.70 OR severity ∈ {P0, P1} C->>PD: escalate with suggested first step pre-attached C->>B: audit row (alert · routing=pagerduty) end Note over CH,PD: on-call sees only alerts that need human judgment

Alert severity × routing matrix

SeverityDefault routingOverrideExamples
P0 (down)Always PagerDuty + CHATCannot suppressPlatform unreachable · BRAIN write down · tenant data loss
P1 (impaired)PagerDuty with runbook attachedConfidence ≥ 0.90 → CHAT-only with on-call notifiedp95 latency > 3× target · provider down · auth degraded
P2 (warn)CUO triage → CHAT or PagerDutyConf ≥ 0.70 → CHAT self-serviceSLO budget burn rate > 2× · cost cap approaching 80% · cache hit rate dropping
P3 (info)Daily CUO digest in CHATConf ≥ 0.50 → bundled in digestSlow query trend · persona-version drift < 0.30 · stale memory citations
P4 (dev)Slack only (dev-channel)CI flakes · staging warnings · review reminders

Runbook catalogue grows by itself

When CUO can't auto-route an alert (confidence < 0.70), the resulting PagerDuty incident becomes a runbook-authoring trigger. Post-incident, the on-call adds a runbook entry; CUO's confidence on similar future alerts rises. The feedback loop:

  1. Novel alert → PagerDuty → human handles → resolves → writes runbook (or skill-author drafts one from the resolution).
  2. Runbook indexed in KB with the alert signature embedding.
  3. Next similar alert → CUO finds the runbook → confidence ≥ 0.70 → CHAT self-service.
  4. On-call load decreases over time; runbook catalogue grows.

Target: ≥ 60% of alerts auto-runbookable by P1 exit; ≥ 80% by P2. The KPI dashboard surfaces "% of alerts that never paged anyone" as the headline metric.

2.7

Compliance evidence surface — read-only regulator views

The standard pattern when an auditor arrives is "let me give you a Postgres readonly account and good luck." It's a security incident waiting to happen. OBS exposes per-regulator scoped read-only views over the BRAIN audit chain so auditors see exactly what they need — chained, exportable, tamper-evident — without touching the underlying operational data.

Regulator view × audit scope matrix

RegulatorWhat they seeRetention requirementExport format
EU AI Act (Art. 12)All ai.invocation rows for EU-residency tenants · persona-version · decision rationale · human-confirm events≥ 6 months active, indefinite coldSigned JSON bundle + chain-of-custody manifest
PDPL (Art. 14 DSAR)Per-subject filtered audit rows · all decision events involving the subject's data≥ 1 yearJSON or CSV per subject · multilingual labels (vi + en)
SOC 2 Type IIAccess events · privilege changes · backup events · incident audit · SLO breaches≥ 1 year per auditor periodCSV bundle + auditor's quarterly evidence package
ISO 27001:2022Information security audit rows · risk register updates · control changes≥ 3 yearsJSON + signed manifest
GDPR (Art. 30 RoPA)Processing activity audit · cross-border transfer events · data-subject rights eventsindefinite while processingSigned JSON
Vietnam Decree 13/2023 (Art. 17)Processing log scoped to Vietnamese data subjects≥ 2 yearsJSON or PDF (signed)

Per-view scoping mechanism

compliance_views:
  eu_ai_act:
    audit_kinds: [ai.invocation, ai.failover_triggered, ai.degraded_mode]
    filter: { tenant.residency: eu-1 }
    fields_visible: [seq, ts_ns, op, extra.tenant_id, extra.agent_persona, extra.module,
                     extra.usage, extra.redaction_applied, extra.failover_path,
                     prev_chain, chain]
    fields_hidden: [extra.prompt_hash, extra.response_hash]   # hashes only proven, not exposed
    retention_min_days: 180
    export_format: signed_json
    auditor_account: requires_csr_signoff
  pdpl_dsar:
    audit_kinds: [*]   # all kinds — DSAR is data-subject-scoped, not kind-scoped
    filter: { subject_in_extra: $subject_id }
    fields_visible: all_except_pii_hashes
    retention_min_days: 365
    export_format: json_or_csv
    auditor_account: tenant_dpo_only
  soc2_cc7_2:
    audit_kinds: [auth.*, access.*, backup.*, incident.*]
    fields_visible: [seq, ts_ns, op, extra.actor, extra.action, extra.target]
    retention_min_days: 365
    export_format: csv_bundle
    auditor_account: external_auditor_only

Chain-of-custody manifest

Every regulator export includes a manifest that proves the export is a faithful slice of the BRAIN chain. The manifest contains:

  • Time window: ISO 8601 range; rows outside the window are excluded.
  • Filter predicates: the exact compliance-view filter that was applied.
  • Row count: N rows exported, sequence numbers N_min..N_max.
  • Chain anchors: the BRAIN chain hash at the start and end of the window (proves no rows were inserted or removed between those points).
  • Export signer: Ed25519 signature by OBS with the export request audit row hash.
  • Auditor identity: JWT subject of the auditor requesting the export.

An auditor can verify the export independently: recompute the chain hashes from the exported rows + the manifest anchors, and confirm they match. This is the protocol-level guarantee that compliance evidence cannot be tampered with mid-export.

3

Architecture

Every CyberOS service ships OTel SDK in-process. The collector receives all signals, applies PII redaction, tags with tenant_id, and fans out to Loki (logs), Tempo (traces), Prometheus (metrics). LangSmith receives AI-trace data directly from AI Gateway. Grafana renders dashboards via a Rust tenant-aware query proxy. Alert Manager evaluates Prometheus rules and routes.

graph TB subgraph SERVICES ["Every CyberOS service"] SVC1["🔐 AUTH"] SVC2["🧠 AI Gateway"] SVC3["🔌 MCP Gateway"] SVC4["💬 CHAT"] SVCN["… 18 more modules"] end subgraph COLLECTOR ["OTel Collector (Fargate · per-region)"] REC["receivers
OTLP grpc/http"] RED["redactor processor
PII scrub · ≥ 99.5%"] TAG["tenant_tag processor
JWT → tenant_id label"] SAMP["sampler
tail-based for traces"] EXP["exporters"] end subgraph LGTM ["LGTM backends (S3-backed)"] LOKI[("Loki
logs · 7d hot · 90d warm")] TEMPO[("Tempo
traces · 7d hot · 30d warm")] PROM[("Prometheus
metrics · 15d local · 1y in Mimir P1+")] end subgraph DASH ["Grafana + Query Proxy"] QP["tenant_query_proxy.rs
Rust · enforces tenant scope on every query"] GRAF["Grafana 11.x
dashboards"] end subgraph ALERT ["Alert Manager"] AM["alertmanager.yml
routes by severity"] PD["PagerDuty
critical"] CHAT["CHAT bot
mid"] DIG["CUO digest
low"] end subgraph AI ["AI trace plane"] LANG["LangSmith SaaS
(zero-retention)"] end subgraph AUDIT ["Audit surface"] BR["🧠 BRAIN
binlog"] AS["audit_view.rs
read-only · auditor-scoped"] end SVC1 --> REC SVC2 --> REC SVC3 --> REC SVC4 --> REC SVCN --> REC REC --> RED RED --> TAG TAG --> SAMP SAMP --> EXP EXP --> LOKI EXP --> TEMPO EXP --> PROM PROM --> AM AM --> PD AM --> CHAT AM --> DIG GRAF --> QP QP --> LOKI QP --> TEMPO QP --> PROM SVC2 -. LangSmith SDK.-> LANG BR --> AS AS --> GRAF classDef shipped fill:#f5ede6,stroke:#45210e classDef planned fill:#fef6e0,stroke:#7c3aed classDef store fill:#f5f3ff,stroke:#7c3aed class BR shipped class REC,RED,TAG,SAMP,EXP,QP,GRAF,AM,AS planned class LOKI,TEMPO,PROM,LANG store class PD,CHAT,DIG planned class SVC1,SVC2,SVC3,SVC4,SVCN planned

Internal components

ComponentWhereResponsibility
OTel Collectorservices/obs/collector/Receives OTLP from every service. Applies PII redaction, tenant tagging, tail-based sampling. Fans out to Loki/Tempo/Prometheus.
redactor processorcollector/processors/redactor.goPresidio-equivalent PII scrubber in Go. Recall ≥ 99.5%. Same rule set as AI Gateway redactor.
tenant_tag processorcollector/processors/tenant_tag.goInspects span attributes for tenant_id (from JWT context); adds as standard label. Sources of truth: tenant.id attribute.
samplercollector/processors/sampler.goTail-based — keeps 100% of error traces, samples 10% of successful ones.
LokibackendLog storage. S3-backed. Compressed gzip. 7d hot · 90d warm.
TempobackendTrace storage. S3-backed. 7d hot · 30d warm.
PrometheusbackendMetrics. Local 15d. Mimir for 1y at P1+.
tenant_query_proxy.rsservices/obs/query-proxy/Rust axum service. Every query (from Grafana or API) is intercepted; tenant_id from JWT injected as label filter; cross-tenant queries rejected with 403.
Grafanafrontend11.x. Per-module SLO dashboards + per-tenant cost dashboards + read-only audit-log view (datasource: BRAIN).
Alert ManagerbackendRoutes alerts by severity. PagerDuty + CHAT + CUO digest integrations.
SLO engineservices/obs/slo/Sloth-based. SLO definitions in YAML committed to repo. Burn-rate alerts generated automatically.
cost_pipeline.pyservices/obs/cost/Daily cost roll-up from AWS Cost Explorer + AI Gateway DuckDB + storage metrics. Per-tenant breakdown.
audit_view.rsservices/obs/audit/Read-only audit-log API; consumes BRAIN binlog; exposes Grafana datasource so compliance can query in the same UI as operations.
LangSmith clientintegrated in AI GatewaySends prompt/completion/tool-call traces directly to LangSmith. Zero-retention contract in place.
4

Data model

OBS is mostly streaming — its "data model" is the schema of OTel signals plus SLO and alert configuration. Below shows the entity relationships.

erDiagram TENANT ||--o{ LOG_STREAM: "owns" TENANT ||--o{ METRIC_SERIES: "owns" TENANT ||--o{ TRACE: "owns" TENANT ||--o{ SLO_TARGET: "defines" TENANT ||--o{ ALERT_RULE: "defines" SERVICE ||--o{ LOG_STREAM: "produces" SERVICE ||--o{ METRIC_SERIES: "exposes" SERVICE ||--o{ SPAN: "emits" TRACE ||--|{ SPAN: "contains" ALERT_RULE ||--o{ ALERT_INSTANCE: "fires" SLO_TARGET ||--o{ SLO_BURN: "calculated by" AI_TRACE ||--o{ LLM_SPAN: "contains" AI_TRACE ||--o{ TOOL_CALL_SPAN: "contains" LOG_STREAM { string label_set "service=auth,tenant=acme,env=prod" timestamp ts string level "DEBUG | INFO | WARN | ERROR" string message obj attributes string trace_id } METRIC_SERIES { string name "http_requests_total" string label_set string type "counter | gauge | histogram" float value timestamp ts } TRACE { string trace_id PK int64 duration_ns string root_service int span_count timestamp start_ts string tenant_id } SPAN { string span_id PK string trace_id FK string parent_span_id string service_name string operation int64 duration_ns obj attributes string status "ok | error" timestamp start_ts } AI_TRACE { string id PK "LangSmith run id" string trace_id "correlates with OTel" string persona string persona_version string model_id int tokens_in int tokens_out string outcome "ok | err" timestamp ts } LLM_SPAN { string id PK string ai_trace_id FK string messages_in_hash string completion_hash int tokens } TOOL_CALL_SPAN { string id PK string ai_trace_id FK string tool_name string args_hash string result_hash } SLO_TARGET { string id PK string service string indicator "availability | latency | error_rate" float target_pct string window "28d | 7d" } SLO_BURN { string slo_id FK float budget_remaining_pct float burn_rate_short float burn_rate_long timestamp ts } ALERT_RULE { string id PK string promql string severity "critical | warning | info" int for_seconds obj routing } ALERT_INSTANCE { string id PK string rule_id FK string state "pending | firing | resolved" timestamp started_at timestamp resolved_at obj labels }

Canonical OTel attribute schema

AttributeTypeRequiredPurpose
tenant.idstring (UUID)YESTenant scoping — load-bearing for all queries.
tenant.slugstringSHOULDHuman-readable label.
actor.idstringYESSubject (user / agent / service).
actor.kind"human"|"agent"|"service"YESAuthentication shape.
persona.versionstringif agente.g. cuo-v2.3.1.
modulestringYESe.g. brain, auth, chat.
service.namestringYESOTel standard.
service.versionstringYESOTel standard.
deployment.environment"dev"|"staging"|"prod"YESOTel standard.
cyberos.severity_class"p0"|"p1"|"p2"|"p3"SHOULDFor alert routing.
cyberos.cost_usdfloatif applicableFor per-tenant cost dashboards.
5

API surface

Query API (Grafana-compatible, tenant-scoped)

All queries flow through tenant_query_proxy.rs, which extracts tenant_id from the caller's JWT and rewrites the query to inject {tenant_id="…"} label filter. Cross-tenant queries return 403.

MethodPathPurpose
POST/api/v1/loki/queryLogQL query (Grafana datasource).
POST/api/v1/loki/query_rangeRange LogQL query.
POST/api/v1/prom/queryPromQL query.
POST/api/v1/prom/query_rangeRange PromQL.
POST/api/v1/tempo/api/searchTempo trace search.
GET/api/v1/tempo/api/traces/{id}Get full trace by id.
POST/api/v1/audit/queryBRAIN audit-log query (read-only).
GET/api/v1/sloList SLO targets for tenant.
GET/api/v1/slo/{id}/burnBurn-rate for a specific SLO.
GET/api/v1/cost/mtdMTD cost breakdown for tenant.
GET/api/v1/alerts/activeActive alerts for tenant.
POST/api/v1/alerts/{id}/silenceSilence an alert (operator scope).

GraphQL subgraph (federated)

extend schema
 @link(url: "https://specs.apollo.dev/federation/v2.5", import: ["@key", "@requiresScopes"])

type SLO @key(fields: "id") {
 id: ID!
 service: String!
 indicator: SLOIndicator!
 targetPct: Float!
 window: String!
 currentPct: Float!
 budgetRemainingPct: Float!
 burnRateShort: Float!
 burnRateLong: Float!
}

type Alert @key(fields: "id") {
 id: ID!
 ruleName: String!
 severity: Severity!
 state: AlertState!
 startedAt: DateTime!
 resolvedAt: DateTime
 labels: JSON!
}

type CostReport @key(fields: "tenantId month") {
 tenantId: ID!
 month: String!
 totalUsdCost: Float!
 infraUsdCost: Float!
 aiUsdCost: Float!
 storageUsdCost: Float!
 byService: [ServiceCost!]!
}

type ServiceCost {
 service: String!
 usdCost: Float!
}

enum SLOIndicator { AVAILABILITY LATENCY ERROR_RATE THROUGHPUT }
enum Severity { CRITICAL WARNING INFO }
enum AlertState { PENDING FIRING RESOLVED }

type Query {
 slos(service: String): [SLO!]! @requiresScopes(scopes: [["obs.read"]])
 alertsActive: [Alert!]! @requiresScopes(scopes: [["obs.read"]])
 costMTD: CostReport! @requiresScopes(scopes: [["obs.cost_read"]])
 trace(id: String!): Trace @requiresScopes(scopes: [["obs.read"]])
}

OTel ingest endpoints

MethodPathPurpose
POST/v1/logsOTLP logs ingest (collector).
POST/v1/metricsOTLP metrics ingest.
POST/v1/tracesOTLP traces ingest.
GET/metricsPrometheus scrape endpoint (collector self-telemetry).
GET/healthLiveness + signal counts.
6

Key flows

Flow 1 — Log ingestion (PII-scrubbed, tenant-tagged)

sequenceDiagram autonumber participant SVC as 🔐 AUTH service participant SDK as OTel SDK (Rust) participant COL as OTel Collector participant RED as redactor participant TAG as tenant_tag participant LOKI as Loki SVC->>SDK: log!("login attempt for user={email} cccd={cccd}") SDK->>SDK: attach trace_id, span_id, service.name, tenant.id SDK->>COL: OTLP /v1/logs COL->>RED: scan attributes + message RED->>RED: find vn.cccd, en.email → replace sentinels RED-->>COL: redacted record (email=[REDACTED], cccd=[REDACTED]) COL->>TAG: ensure tenant_id label present TAG-->>COL: labelled COL->>LOKI: push (label_set, ts, message) LOKI-->>COL: 204 No Content

(FR pending): PII recall ≥ 99.5%. Redaction at the collector is the last point at which PII can be stopped before it lands on S3.

Flow 2 — Metric scrape + alert evaluation

sequenceDiagram autonumber participant SVC as 🧠 AI Gateway participant PROM as Prometheus participant AM as Alert Manager participant PD as PagerDuty participant CHAT as #cyberos-alerts PROM->>SVC: GET /metrics (every 15 s) SVC-->>PROM: ai_request_latency_p95_seconds 2.3
ai_provider_error_total 47 loop alert eval every 30 s PROM->>PROM: evaluate rule: ai_request_latency_p95_seconds above 2 for 5m alt firing PROM->>AM: alert {severity=critical, service=ai-gateway} AM->>AM: route by labels AM->>PD: page on-call (severity=critical) AM->>CHAT: post message end end

Flow 3 — Trace propagation across modules

sequenceDiagram autonumber participant U as User SPA participant AR as Apollo Router participant AUTH as AUTH RBAC participant CHAT as CHAT service participant AI as AI Gateway participant BR as BRAIN participant TEMPO as Tempo U->>AR: POST /graphql (mutation sendMessage) AR->>AR: start trace · trace_id=t1 AR->>AUTH: RBAC.Check (carries trace_id) AUTH-->>AR: allow AR->>CHAT: forward request (carries trace_id) CHAT->>AI: summarise (carries trace_id) AI->>BR: get persona (carries trace_id) BR-->>AI: persona AI-->>CHAT: completion CHAT->>BR: append message row CHAT-->>AR: ok AR-->>U: ok Note over AR,TEMPO: each service emits its own span
all share trace_id=t1 AR->>TEMPO: span (root) AUTH->>TEMPO: span CHAT->>TEMPO: span AI->>TEMPO: span BR->>TEMPO: span (x2)

(FR pending): end-to-end trace continuity verified. W3C TraceContext propagation through every internal call. One trace_id stitches the whole transaction.

Flow 4 — Alert escalation (severity-based routing)

sequenceDiagram autonumber participant R as Prometheus rule participant AM as Alert Manager participant PD as PagerDuty participant CHAT as CHAT bot participant CUO as CUO digest queue participant OPS as On-call engineer R->>AM: alert {severity, service, summary} alt severity=critical (SLO burn fast, error_budget below 5%) AM->>PD: trigger incident AM->>CHAT: post #incidents PD->>OPS: page OPS->>CHAT: acknowledge OPS->>OPS: investigate via Grafana else severity=warning (burn slow, budget below 30%) AM->>CHAT: post #cyberos-alerts Note over OPS: handled async; SLA 4h else severity=info (trend, advisory) AM->>CUO: enqueue for morning digest Note over CUO: surfaced in CEO morning brief end

(FR pending): PagerDuty for critical, CHAT for low, CUO digest for trends.

Flow 5 — Audit-log query (compliance review)

sequenceDiagram autonumber participant AUD as Auditor participant G as Grafana participant QP as query_proxy participant AUTH as AUTH RBAC participant AV as audit_view.rs participant BR as 🧠 BRAIN binlog AUD->>G: open "Audit · 2026-Q1" dashboard G->>QP: GET /api/v1/audit/query?since=2026-01-01&actor=stephen@… QP->>AUTH: RBAC.Check(action="obs.audit_read", resource=…) AUTH-->>QP: allow (auditor scope) QP->>AV: query(filter) AV->>BR: walk binlog from seq=12000 BR-->>AV: rows matching filter AV-->>QP: ChainedAuditRow QP-->>G: rows + inclusion proofs (optional) G-->>AUD: table view with verify-button per row

EU AI Act Art. 12: decision logs retained ≥ 6 months; PDPL Art. 14 DSAR; auditors get read-only access scoped by engagement.

7

Alert lifecycle

Alerts traverse a five-state lifecycle. Every state transition emits a metric for SLO compliance tracking.

stateDiagram-v2 [*] --> Inactive: rule loaded Inactive --> Pending: condition met within for_seconds window Pending --> Inactive: condition cleared before window Pending --> Firing: condition held for for_seconds duration Firing --> Acknowledged: on-call ack Firing --> Silenced: operator silences Acknowledged --> Resolving: investigation in progress Resolving --> Resolved: condition cleared Silenced --> Firing: silence expires Resolved --> Inactive Resolved --> Postmortem: severity = critical Postmortem --> [*]

SLO catalogue (P0)

ServiceIndicatorTargetWindowOwner
Platform (aggregate)availability≥ 99.5%28d rollingCTO
CHATavailability≥ 99.9%28dCTO
BRAIN searchavailability≥ 99.5%28dCDO
AUTHavailability≥ 99.95%28dCSO
AI Gatewayavailability≥ 99.9%28dCTO
AI Gatewaylatency p95≤ 2 s28dCTO
MCP Gatewayavailability≥ 99.95%28dCTO
MCP Gatewaywrite tool p95≤ 1 s28dCTO
GraphQL Routerlatency p95≤ 400 ms28dCTO
Backup RPOrecovery point≤ 1 hcontinuousCTO
Backup RTOrecovery time≤ 4 hcontinuousCTO
8

Functional Requirements

The CyberOS FR catalogue is being rebuilt one feature at a time via the open feature-request-author Agent Skill.

Previous FR enumerations were archived 2026-05-14 and are no longer reflected on this page. Specific FRs land here as they are re-authored.

9

Non-Functional Requirements

NFR IDConcernTargetMeasurement
N(FR pending)Platform availability (28-day rolling)≥ 99.5%SLO target · burn-rate alerts
N(FR pending)CHAT availability≥ 99.9%SLO
N(FR pending)BRAIN search availability≥ 99.5%SLO
N(FR pending)Backup RPO≤ 1 hscheduled backup audit
N(FR pending)Backup RTO≤ 4 hquarterly restore drill
N(FR pending)Cross-region failover (P3)≤ 24 hannual DR drill
N(FR pending)SLO dashboard refresh latency≤ 60 smonitor synthetic SLO breach
N(FR pending)Log ingest end-to-end latency≤ 30 s p95synthetic log → query
N(FR pending)Trace ingest end-to-end≤ 60 s p95synthetic trace
N(FR pending)Log PII redaction recall≥ 99.5%test set
N(FR pending)Log PII redaction precision≥ 95%test set
N(FR pending)OBS plane availability≥ 99.5%SLO (recursive)
N(FR pending)Decision-log retention≥ 180 dconfig audit · S3 lifecycle
N(FR pending)Cross-tenant query leakage= 0property-based test
N(FR pending)OBS plane infra cost (P0)≤ $130/monthcost dashboard
10

Dependencies

graph LR subgraph upstream ["OBS depends on"] AUTH["🔐 AUTH
tenant + scope verification"] BRAIN["🧠 BRAIN
audit-log surface"] S3["☁️ S3
Loki/Tempo storage"] LANGSMITH["LangSmith SaaS
AI traces"] end OBS["👁 OBS"] subgraph emitters ["Every CyberOS service emits OTel"] AUTH2["AUTH"] AI["AI"] MCP["MCP"] CHAT["CHAT"] BR2["BRAIN"] SK["Skill"] OTH["…all 22"] end subgraph consumers ["Consumers"] OPS["On-call ops"] COMP["Compliance"] AUDIT["External auditor"] CEO["CEO morning digest"] end AUTH --> OBS BRAIN --> OBS S3 --> OBS LANGSMITH --> OBS AUTH2 --> OBS AI --> OBS MCP --> OBS CHAT --> OBS BR2 --> OBS SK --> OBS OTH --> OBS OBS --> OPS OBS --> COMP OBS --> AUDIT OBS --> CEO classDef shipped fill:#f5ede6,stroke:#45210e classDef planned fill:#fef6e0,stroke:#7c3aed class BRAIN,SK shipped class OBS,AUTH,AI,MCP,CHAT,AUTH2,BR2,OTH planned class S3,LANGSMITH planned
11

Compliance scope

Regulation / standardArticle / clauseOBS feature
EU AI ActArt. 12 — LoggingDecision-log retention ≥ 6 months; LangSmith trace per AI decision.
EU AI ActArt. 13 — TransparencyAudit-log surface available to deployers (tenant admins).
EU AI ActArt. 14 — Human oversightPer-tenant alerting flags anomalous agent behaviour.
Vietnam PDPLArt. 14 — DSARPer-subject log + decision export via audit-log surface.
Vietnam Decree 13/2023Art. 17 — Processing logAudit-log surface materialises the processing log for the regulator.
GDPRArt. 30 — Records of processingBRAIN audit chain + OBS audit-view = records of processing.
GDPRArt. 32 — Security of processingPII redaction on logs; tenant-scoped queries; mTLS to collectors.
GDPRArt. 33 — Breach notificationAlert routing surfaces breaches; OBS provides forensic timeline.
ISO/IEC 27001:2022A.8.15 — LoggingCentralised structured logs; integrity via BRAIN chain.
ISO/IEC 27001:2022A.8.16 — Monitoring activitiesPer-module SLO + alert pipeline.
ISO/IEC 42001 (AIMS)§ 9.1 — Performance evaluationLangSmith + AI Gateway metrics ≡ AI system performance KPIs.
SOC 2 Type IICC7.2 — Monitoring controlsSLO dashboards · alert routing · audit-log retention.
SOC 2 Type IICC7.3 — DetectionAlert manager + on-call rotation.
12

Risk entries

IDRiskLikelihoodImpactOwnerMitigation
R-OBS-001PII leaks into Loki/Tempo via missed redaction ruleMediumHighCSORecall ≥ 99.5% gated in CI; quarterly red-team; opt-in encryption at rest for sensitive log streams.
R-OBS-002Cross-tenant log leakage via crafted queryLowCatastrophicCSOQuery-proxy property-based test gate; tenant_id always injected from JWT not user input.
R-OBS-003LangSmith outage blinds AI debuggingMediumMediumCTOLocal OTel trace mirror retained 7 d; LangSmith is for deep analysis, not primary.
R-OBS-004Alert fatigue (too many warnings)HighMediumCTOBurn-rate alerting (Sloth) instead of static thresholds; quarterly alert review.
R-OBS-005S3 retention misconfig → decision logs purged earlyLowHighCTOLifecycle policy declared in Terraform; CI gate verifies ≥ 180 d retention for decision-log bucket.
R-OBS-006Grafana credential leak — broad audit-log accessLowHighCSOGrafana auth via OIDC SSO; per-folder scope; auditors get time-bound access.
R-OBS-007Trace-id loss across async boundary → broken span treeMediumLowCTOOTel context propagation in every async runtime crate; CI test verifies multi-hop trace continuity.
R-OBS-008Prometheus disk full → metrics gapMediumMediumCTO15-d retention with auto-eviction; alert on free-disk < 30%; long-term in Mimir at P1+.
R-OBS-009OTel SDK version drift across modulesMediumLowCTOPin SDK version in shared crate / package; Renovate alerts on upstream releases.
R-OBS-010Cost-pipeline mis-attributes spend to wrong tenantMediumMediumCFOtenant_id required in every spend event; reconciliation gate against AWS bill monthly.
R-OBS-011Auto-runbook router miscategorises P0 as P2 → critical alert silencedLowCriticalCTOP0/P1 always escalate to PagerDuty regardless of triage confidence; severity is set by Prometheus rule, never modifiable by CUO; routing audit row makes silent-suppression detectable.
R-OBS-012Compliance export tampering — auditor receives modified bundleLowCriticalDPOEd25519 signed manifest with chain anchors + per-row hashes; auditor verifies independently; tampering attempt detected at signature verification.
R-OBS-013CUO triage skill goes down → all alerts fall back to PagerDuty (page storm)MediumMediumCTOTriage skill has graceful-degrade: when unavailable, alerts route via static severity → routing table (the pre-CUO behaviour); on-call notified that triage is offline.
R-OBS-014LangSmith data retention violates EU residency (data shipped to US)LowHighDPOEU-residency tenants route AI traces to a self-hosted LangSmith-compatible store in EU-1; ZDR + DPA confirmed before any third-party SaaS is enabled.
R-OBS-015Trace sampling drops the wrong tail (errors sampled out)MediumMediumCTOTail-based sampling: 100% on errors + slow traces; head-based 10% for normal traffic; CI test verifies error-trace coverage = 100%.
R-OBS-016Persona-drift detector false-positive triggers Lumi rollback unnecessarilyMediumMediumCPODrift detector requires 3 consecutive windows above threshold; rollback is a candidate-version proposal, not auto-applied; human confirms.
R-OBS-017Cross-pillar correlation breaks when service uses async runtime without OTel context propagationMediumMediumCTOOTel context-propagation middleware required in every service template; CI test verifies trace_id continuity across > 2 hops; PR check blocks merge if missing.
R-OBS-018Query proxy DOS via expensive LogQL queries from one MemberMediumMediumCTOPer-Member 100 QPS limit + per-query 30 s timeout + complexity analyser refuses unbounded scans; cost shown to user before execution.
R-OBS-019Runbook catalogue drift — runbook says "increase Bedrock quota" but tenant uses VertexMediumLowCTORunbooks tagged with applicability conditions (provider, region, severity); CUO triage filters runbooks before suggestion; stale runbooks flagged for review quarterly.
R-OBS-020SLO budget burn rate alarm fires during planned maintenance → noiseMediumLowCTOMaintenance windows declared in OBS (per service); SLO calc excludes declared windows; un-declared maintenance still alerts (catches "forgot to declare").
13

KPIs

KPIFormulaSourceTarget
Platform availability (28d)1 − error_minutes / total_minutesPrometheus≥ 99.5%
SLO dashboard freshnesslast_scrape_agePrometheus≤ 60 s
Log ingest p95 latencyhistogramcollector≤ 30 s
PII redaction recallTP / (TP + FN)CI gate≥ 99.5%
Cross-tenant query rejectionscountquery_proxytracked; 0 successful breaches
Alert false-positive ratefp / (fp + tp)weekly review≤ 20%
MTTR (critical)resolved_at − fired_atPagerDuty≤ 60 min
Error-budget remaining (per SLO)1 − burned / budgetSLO engine> 0 throughout window
Decision-log retention compliancedays_retainedS3 lifecycle≥ 180 d
Auto-runbook coverage(alerts auto-routed to CHAT) / total alertsobs.triage-alert@1 audit rows≥ 0.60 by P1 exit; ≥ 0.80 by P2
P0/P1 false-suppression rateP0/P1 alerts that didn't pagecross-check Prom rules vs PagerDuty events= 0 (hard floor)
Compliance export verification rateexports passing auditor's manifest re-verification / total exportsauditor reports= 1.0 (hard floor)
Cross-pillar correlation completenesstraces with all 4 pillars present / total tracesOBS coverage probe≥ 0.95
Tail-sampling error coverageerror traces with full trace retained / total error tracesTempo= 1.0 (hard floor)
Persona-drift detector precisiontrue-positive drifts / total flaggedquarterly human review≥ 0.80 (don't cry wolf on personas)
Query proxy violation rejectionscross-tenant query attempts / total queriesquery proxy audittracked; spike = active threat
MTTR for self-service tickets (P2/P3)resolved_at − ticket_created_atCHAT ticket events≤ 4 h median
Dogfooding alert acknowledgement (internal)internal alerts ACK'd within 5 minfiltered to tenant_id=org:cyberskill≥ 0.90 (we live by this)
14

RACI matrix

ActivityCEOCTOCSOCDOCFODPO
Stack design + deploymentIA/RCCII
SLO definitionARCCII
Alert rule maintenanceIA/RCIII
PII redaction rule maintenanceICCA/RIC
On-call rotationIA/RCIII
Cost pipeline + reconciliationICIIA/RI
Audit-log surface designICCCIA/R
Compliance review (AI Act, PDPL)ICCCIA/R
15

Planned CLI surface

Operator CLI cyberos-obs plus standard Grafana + Loki + Prom CLIs.

1. Tail logs for a tenant

$ cyberos-obs logs tail --tenant acme --service auth --since 5m

2026-05-14T07:19:02Z INFO auth login_attempt subject=[REDACTED:email] trace=t_3ab9
2026-05-14T07:19:02Z INFO auth login_success aal=aal3 trace=t_3ab9
2026-05-14T07:19:03Z INFO rbac check action=brain.put decision=allow trace=t_3ab9
…

2. SLO status

$ cyberos-obs slo status

SERVICE INDICATOR TARGET CURRENT BUDGET BURN
platform availability 99.5% 99.94% 99% 0.06× (28d)
chat availability 99.9% 99.97% 71% 1.2× (warning)
auth availability 99.95% 100% 100% 0×
ai-gateway latency p95 2 s 1.4 s ok —
brain-search availability 99.5% 99.99% 99% 0×
mcp-gateway write p95 1 s 0.42 s ok —
graphql-router latency p95 400 ms 280 ms ok —

3. Active alerts

$ cyberos-obs alerts active

ALERT SEVERITY STARTED STATUS
ChatErrorBudgetBurnFast warning 5m ago firing
AIProviderLatencyHigh info 12m ago firing
S3LifecycleStaleConfig (cost-bucket) info 2h ago silenced

4. Per-tenant cost MTD

$ cyberos-obs cost mtd --tenant acme

TENANT: acme
MONTH: 2026-05
─────────────────────────────────────
Infra: $182.40
 Fargate (chat) $52.10
 Fargate (auth) $48.20
 RDS Postgres $42.10
 S3 storage $24.00
 Other $16.00
AI: $97.42 (cap $150 · 64.9%)
Storage: $24.00
─────────────────────────────────────
TOTAL: $303.82

5. Trace lookup by id

$ cyberos-obs trace get t_3ab9c8d4

trace_id: t_3ab9c8d4
duration: 412 ms
spans:
 apollo-router sendMessage(graphql) 412 ms
 ├─ auth RBAC.Check 8 ms
 ├─ chat CreateMessage 286 ms
 │ ├─ brain put_message 12 ms
 │ └─ ai-gateway summariseSync 260 ms
 │ ├─ tenant_policy 3 ms
 │ ├─ redactor 2 ms
 │ └─ bedrock invoke 254 ms
 └─ chat FanoutMentions 14 ms

6. Audit-log query (compliance)

$ cyberos-obs audit query --since 2026-04-01 --action 'brain.delete' --format jsonl
{"seq":12031,"action":"brain.delete","actor":"stephen@…","mode":"tombstone","path":"memories/…","ts":"…"}
{"seq":12102,"action":"brain.delete","actor":"dpo@…","mode":"purge","reason":"DSAR-2026-014","path":"memories/…","ts":"…"}
…
[query] 47 rows · chain integrity verified

7. SLO definition (YAML)

# cyberos-obs/slo/ai-gateway-latency.yml
slo:
 id: ai-gateway-latency-p95
 service: ai-gateway
 indicator: latency_p95
 target: 2.0 # seconds
 window: 28d
 alerts:
 burn_rate_fast:
 severity: critical
 route: pagerduty
 threshold: 2.0 # 2x burn over 1h
 burn_rate_slow:
 severity: warning
 route: chat
 threshold: 1.0 # 1x burn over 6h
16

Phase status & estimates

Status
Planned
P0 · design phase · P0 · slice 1
Est. LoC
~3,000
Rust query proxy + Go collector configs
SLOs at P0
~11
platform + per-module
P0 budget
~$130/mo
LGTM hosting + LangSmith starter
Decision-log retention
≥ 180 d
EU AI Act Art. 12
PII recall target
≥ 99.5%
(FR pending)
CapabilityStatus
OTel Collector + LGTM backendsplanned · P0
PII redaction processorplanned · P0
Tenant-tag processorplanned · P0
tenant_query_proxy (Rust)planned · P0
Grafana dashboards (per-module SLO)planned · P0
Per-tenant cost dashboardsplanned · P0
Alert Manager + PagerDuty routingplanned · P0
Audit-log surface (read-only)planned · P0
LangSmith integrationplanned · P0
SLO-as-code (Sloth-style)planned · P0
Auto-pause feature flags on burnplanned · P1
Mimir for 1y metric retentionplanned · P1+
Multi-region active-activeplanned · P3+
17

References

  • Bigger picture (§0 above): 3 strategic roles + emitter/consumer Mermaid + auto-vs-human matrix.
  • Three-pillars unified pane (§2.5 above): pillar × signal-type table + cross-pillar correlation example + tenant query proxy guarantee.
  • Auto-runbook router (§2.6 above): 6-step routing sequence + severity × routing matrix + runbook-catalogue growth loop.
  • Compliance evidence surface (§2.7 above): regulator view × audit scope matrix + per-view scoping YAML + chain-of-custody manifest.
  • Cross-module page links: brain.html · auth.html · ai.html · cuo.html · mcp.html · kb.html · chat.html · proj.html
  • Build-readiness audit: archive/2026-05-14/AUDIT_AND_PLAN.md (archived; see cyberos/CHANGELOG.md) — OBS placed at P0 · slice 1 alongside AI Gateway as the two earliest P0 modules.
  • Research review: archive/2026-05-14/RESEARCH_REVIEW.md (archived; see cyberos/CHANGELOG.md) — OBS rated "Strong" (9/10); auto-runbook-router framing flagged as the most differentiated operational play.
  • BRAIN auto-sync vision: BRAIN_AUTOSYNC_DESIGN.md §8 — OBS reads from BRAIN's audit chain for compliance exports; never writes back (BRAIN is upstream).
  • FR authoring discipline: modules/skill/feature-request-audit/AUTHORING_DISCIPLINE.md.
  • EU AI Act (Reg. 2024/1689) — Art. 12 logging, Art. 13 transparency, Art. 14 human oversight, Art. 26 deployer obligations.
  • ISO/IEC 27001:2022 — A.8.15 logging, A.8.16 monitoring activities.
  • ISO/IEC 42001 (AIMS) — § 9.1 performance evaluation.
  • SOC 2 Type II — CC7.2 monitoring, CC7.3 evaluation, CC7.4 incident response evidence.
  • Vietnam PDPL (Law 91/2025) — Art. 14 DSAR transparency, Art. 20 security obligations.
  • Vietnam Decree 13/2023 — Art. 17 processing log requirement.
  • GDPR (EU 2016/679) — Art. 30 Records of Processing Activities.
  • OpenTelemetry — specification + Rust + Python SDKs; W3C TraceContext propagation.
  • Grafana Loki + Tempo + Mimir — upstream LGTM stack.
  • LangSmith — managed AI-trace observability (EU-residency tenants use self-hosted equivalent).
  • Sloth — SLO-as-code engine (Prometheus rule generator).
  • Architecture context: infrastructure.html#obs.