Files

T

Build / build (push) Successful in 10m40s

Details

docs: workload refactor + observability progress

Two design + handoff docs:

- docs/WORKLOAD_REFACTOR_TODO.md — status-at-a-glance table
  showing what's done (volume scopes, kind-aware editors,
  vendor webhook parsing, chain-panel CSS, Log Rules panel)
  and what's still pending (static source inline port + the
  hard legacy cutover gated on it; codemap entries; /apps
  page-level i18n; Priority 4 integration tests).

- docs/LOGSCAN_AND_TRIGGERS_TODO.md — companion design + status
  doc for the two Observability features. Records the
  loop-prevention invariant (event_log = system observing
  itself, webhook_deliveries = system talking to outside) so
  the next contributor doesn't accidentally break it by adding
  a new EventLog subscriber that re-publishes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 22:18:51 +03:00

17 KiB

Raw Blame History

Log Scanner + Event Triggers — Design Handoff

Two related features. They can ship independently, but were designed together because they share the event_log seam.

A. Log scanner — tail container logs, match against rules, emit event_log entries. Producer of events.
B. Event triggers — turn event_log entries into webhook / notification dispatches. Consumer of events. Generalizes the existing RegisterPersistentLogger pattern.

Either half is useful alone:

A without B = errors get surfaced in the events UI, no external delivery.
B without A = manual + reconciler + deploy events can drive notifications.

Recommended ship order: B first (smaller, self-contained generalization), then A (more moving parts, depends on container-lifecycle hooks).

A. Log scanner — BACKEND LANDED

Status:

Schema + store CRUD — internal/store/log_scan_rules.go + log_scan_rules table added to the observabilityTables block. Includes the EffectiveLogScanRules(workloadID) helper that resolves global rules minus per-workload overrides plus workload- only additions in one Go-side pass.
Stream-selectable docker reads — internal/docker/container.go ContainerLogsOpts accepts a ContainerLogOptions{ShowStdout, ShowStderr, Follow, Tail} so the scanner can subscribe to one stream when a rule scopes itself to stdout or stderr. The legacy ContainerLogs is preserved as a thin wrapper for back-compat.
Engine — internal/logscanner/engine.go: per-rule cooldown (keyed on container+rule), per-container token bucket (default 10 events / 60s, override-able), regex match per line, hits returned for the manager to persist. Pure logic, fully unit-tested.
Tail goroutine — internal/logscanner/tail.go: per-container loop reading docker's multiplexed log frames (with TTY fallback), strips the prepended RFC3339 timestamp, runs every line through the engine + snapshot. Exits on container stop or context cancel.
Manager — internal/logscanner/manager.go: 5s polling diff against ListContainers(state=running), atomic.Pointer[Snapshot] hot-reload, structural HitEmitter that writes event_log rows AND publishes EventLog on the bus (so event-trigger dispatchers can pick them up immediately).
API — internal/api/log_scan_rules.go: full CRUD, /test endpoint accepting {"sample_line": "..."} and returning matched/captures, plus GET /api/workloads/{id}/effective-rules for the workload detail page's future Log Rules tab. Admin-gated mutations.
Wired in main.go before the API server is constructed so the reload callback is plugged via apiServer.SetLogScanReloader.
Loop-prevention — Same boundary as feature B: scanner publishes EventLog events, dispatcher consumes them, neither writes to event_log on the consume side.
Tests — internal/logscanner/{engine,rules}_test.go cover cooldown isolation, token bucket refill, stream filtering, override-replaces-global, disabled-override-suppresses-global, compile-error reporting. internal/store/log_scan_rules_test.go covers validation + cascade delete.

Frontend still pending — /log-scan-rules pages, regex test box component, Log Rules tab on /apps/[id], i18n keys. Not touched this turn.

Where it plugs in

internal/docker/container.go:362 already exposes ContainerLogs(ctx, id, follow=true, tail). The existing SSE handler at internal/api/workloads.go:43 (streamWorkloadContainerLogs) is per-viewer and dies on browser disconnect — do not hook the scanner there. The scanner is a separate long-lived subsystem owned by the server process.

Minor required change to ContainerLogs: expose ShowStdout / ShowStderr as caller-controlled. Currently hardcoded to true/true. Single existing caller passes "both" → no friction. Add an options struct or two booleans.

New package: `internal/logscanner/`

internal/logscanner/
  manager.go    — Manager: map[containerID]*tail, lifecycle hooks
  tail.go       — per-container goroutine; reads logs, fans to engine
  engine.go     — rule evaluation + cooldown + rate limit
  rules.go      — Rule struct, regex compile cache, effective-set resolver

Manager lifecycle. Subscribes to container start/stop signals. Options for the signal source:

Add a ContainerStarted / ContainerStopped event type to the bus and publish from the reconciler + deployer. Cleanest, but adds two event types.
Manager polls docker.ListContainers every N seconds and diffs. Lazier, robust to missed signals, slightly higher idle CPU. Probably fine.

Pick (1) if you want zero-latency start, (2) if you want fewer moving parts. Defaulting to (2) with 5s poll — Docker container starts already take seconds; sub-second matching is not a requirement.

Tail goroutine. On container start: open ContainerLogs(follow=true, tail="0") with stdout/stderr filters per rules in scope. Read line-by-line via bufio.Scanner. For each line: run through engine. On container stop or ctx cancel: drain and exit.

Engine. Holds compiled regexes per rule. For each line:

Walk effective ruleset for this workload (see schema below).
For each matching rule: check cooldown (map[ruleID]time.Time, mutex guarded). If cooled down, insert event_log row + publish + update timestamp.
Per-container token bucket (default: 10 events/min/container) to prevent catastrophic event_log floods if a regex is too greedy.

Schema

Single table, global + override pattern. No separate "overrides" table.

CREATE TABLE log_scan_rules (
  id               INTEGER PRIMARY KEY AUTOINCREMENT,
  workload_id      TEXT,                  -- NULL = global rule
  overrides_id     INTEGER,               -- if set, this row overrides a global rule for one workload
  name             TEXT NOT NULL,
  pattern          TEXT NOT NULL,         -- regex, compiled at load
  severity         TEXT NOT NULL,         -- info|warn|error
  streams          TEXT NOT NULL DEFAULT 'all',  -- all|stdout|stderr
  cooldown_seconds INTEGER NOT NULL DEFAULT 60,
  enabled          INTEGER NOT NULL DEFAULT 1,
  created_at       TEXT NOT NULL,
  FOREIGN KEY (workload_id) REFERENCES workloads(id) ON DELETE CASCADE,
  FOREIGN KEY (overrides_id) REFERENCES log_scan_rules(id) ON DELETE CASCADE
);
CREATE INDEX idx_log_scan_rules_workload ON log_scan_rules(workload_id);
CREATE INDEX idx_log_scan_rules_overrides ON log_scan_rules(overrides_id);

Effective ruleset for workload X:

All rows where workload_id IS NULL AND overrides_id IS NULL (pure globals), minus any global that has a row with workload_id = X AND overrides_id = global.id.
Plus all rows where workload_id = X AND overrides_id IS NULL (workload-only additions).
Plus all override rows where workload_id = X AND overrides_id IS NOT NULL (substitute for the global; their fields win, including enabled=false to disable the global for this workload).

A pure SQL implementation is doable with a LEFT JOIN ... WHERE override.id IS NULL for step 1 plus a UNION ALL for steps 2 and 3. Or compute in Go after two simpler queries — fine since rule counts will be small.

Output

Scanner calls store.InsertEvent with:

Source = "logscan"
Severity from the matched rule
Message = raw matched line (truncated to ~500 chars)
Metadata JSON = {"workload_id": ..., "container_id": ..., "rule_id": ..., "rule_name": ..., "captures": {...}}

Then bus.Publish(EventLog, payload). This reuses exactly the path internal/events/bus.go:158 (RegisterPersistentLogger) already established. SSE clients see it live, and the dispatcher from feature B picks it up.

Hot-reload

When a rule is created/updated/deleted via the API, the manager must rebuild the effective ruleset for affected containers. Cheapest path: a single *atomic.Pointer[ruleSnapshot] shared across tails, replaced wholesale on any rule change. Each tail dereferences the snapshot per line — no locking on the hot path.

B. Event triggers — BACKEND LANDED

Status:

Schema + store CRUD — internal/store/event_triggers.go + table creation in internal/store/store.go observabilityTables. Model: EventTrigger in internal/store/models.go.
Dispatcher — internal/events/dispatcher.go RegisterEventTriggerDispatcher(bus, triggerSource, notifier). Filter eval is AND-composed across severity (CSV), source (CSV), and optional message regex. Compiled regexes are memoized.
Webhook delivery — extended notify.Notifier with SendPayload(url, secret, eventType, payload) which reuses the existing HMAC + headers infra (X-Hub-Signature-256, etc.). New TierEventTrigger tier is recorded for telemetry / audit.
Loop-prevention — dispatcher does not call InsertEvent. Delivery outcomes go through the notifier's existing logging only.
API — internal/api/event_triggers.go with admin-gated mutations:

GET    /api/event-triggers
POST   /api/event-triggers
GET    /api/event-triggers/{id}
PATCH  /api/event-triggers/{id}
DELETE /api/event-triggers/{id}
POST   /api/event-triggers/{id}/test     — synthetic event_log → notifier.SendSyncForTest

Wired in main.go next to RegisterPersistentLogger.
Tests — internal/events/dispatcher_test.go: 10 cases covering filter eval, regex caching, dispatcher fan-out, unsupported action_type, trigger-source errors. CSV filter helper has dedicated table-driven coverage.

Frontend still pending — /event-triggers list + detail + new pages, the Send-test UX, i18n keys. Not touched this turn.

Where it plugs in

Mirrors the RegisterPersistentLogger shape at internal/events/bus.go:158:

func RegisterEventTriggerDispatcher(b *Bus, triggers TriggerSource, notifier Notifier) func() {
    sub := b.Subscribe(func(evt Event) bool { return evt.Type == EventLog })
    go func() {
        for evt := range sub {
            payload, ok := evt.Payload.(EventLogPayload)
            if !ok { continue }
            for _, t := range triggers.Enabled() {
                if t.matches(payload) {
                    notifier.Send(t.ActionTarget, buildBody(t, payload))
                }
            }
        }
    }()
    return func() { b.Unsubscribe(sub) }
}

Reuses the existing notifier at internal/notify/notifier.go — including the signed-delivery and webhook_deliveries audit trail.

Schema

CREATE TABLE event_triggers (
  id                    INTEGER PRIMARY KEY AUTOINCREMENT,
  name                  TEXT NOT NULL,
  filter_severity       TEXT,            -- nullable; comma-list like 'warn,error'
  filter_source         TEXT,            -- nullable; comma-list like 'logscan,deploy'
  filter_message_regex  TEXT,            -- nullable; matched against message
  action_type           TEXT NOT NULL,   -- 'webhook' | 'notification_channel'
  action_target         TEXT NOT NULL,   -- URL or channel ID
  enabled               INTEGER NOT NULL DEFAULT 1,
  created_at            TEXT NOT NULL
);

Filters AND together. Empty filters match all.

Loop-prevention

Critical constraint: the dispatcher must not write to event_log. All delivery successes / failures land in webhook_deliveries (existing table) so the audit trail is preserved without risking trigger recursion. Keeps the boundary crisp:

event_log = system observing itself
webhook_deliveries = system talking to the outside

If a user-visible "trigger fired" entry is desired in the events UI, add a read-only join from webhook_deliveries into the events page rather than writing event_log rows.

What to defer

Item	Why	Add when
Multi-line stack trace coalescing	Real rabbit hole (which lines belong together?).	Real user pain.
Capture-group templating in messages (`{{.captures.code}}`)	v1 stores captures in metadata, displays raw line.	Once real rules exist and patterns emerge.
Backfilling history search	This is Loki/Grafana scope-creep.	Never (push to Loki instead if it comes up).
Per-rule alert routing	v1 fans out by `(severity, source)` filter on trigger side.	When users want one rule → one channel.
YAML config-as-code	Tinyforge is UI-driven everywhere else.	Probably never.
Retry / backoff on trigger delivery failure	Notifier already handles delivery; whether triggers retry is a separate question.	If trigger reliability becomes an SLO.

UI footprint

All boolean inputs use ToggleSwitch per project CLAUDE.md. All destructive actions use ConfirmDialog per memory note (no inline Yes/No strips).

New pages

/log-scan-rules — list with severity / workload filter, "+ New rule" button.
- Detail page: name, pattern (regex with live test box that takes a sample log line), severity, streams, cooldown, enabled toggle, scope picker (global / workload).
/event-triggers — list, "+ New trigger" button.
- Detail page: name, filters (severity multiselect, source multiselect, optional message regex), action type, action target, enabled toggle.

Augmentations

Workload detail page (/apps/[id]): new "Log Rules" tab/panel listing effective rules for this workload. Each global shows an "Override for this workload" button. Each override / workload-only shows edit + delete.
Events page (/events): entries with source=logscan get a small icon
- tooltip showing rule name. Click → jumps to rule detail.
Settings sidebar: links to /log-scan-rules and /event-triggers under a new "Observability" group.

i18n keys to add

Roughly 40–60 keys across en.json + ru.json. Namespace: logscan.* and triggers.*.

API surface

GET    /api/log-scan-rules                 — list (filter: ?workload_id=, ?global=true)
POST   /api/log-scan-rules                 — create
GET    /api/log-scan-rules/{id}            — detail
PATCH  /api/log-scan-rules/{id}            — update
DELETE /api/log-scan-rules/{id}            — delete
POST   /api/log-scan-rules/{id}/test       — body: {sample_line}; returns matched: bool, captures
GET    /api/workloads/{id}/effective-rules — computed effective ruleset for a workload

GET    /api/event-triggers                 — list
POST   /api/event-triggers                 — create
GET    /api/event-triggers/{id}            — detail
PATCH  /api/event-triggers/{id}            — update
DELETE /api/event-triggers/{id}            — delete
POST   /api/event-triggers/{id}/test       — dispatches a synthetic event to verify the action target

POST .../test endpoints are worth shipping in v1 — they make the rule / trigger editing UX dramatically nicer and avoid "did I get the regex right?" deploy-and-pray cycles.

File pointers (when work starts)

Backend, new:

internal/logscanner/{manager,tail,engine,rules}.go
internal/api/log_scan_rules.go
internal/api/event_triggers.go
internal/store/log_scan_rules.go
internal/store/event_triggers.go
internal/events/dispatcher.go (or extend bus.go with RegisterEventTriggerDispatcher)

Backend, modified:

internal/docker/container.go:362 — expose stream selection on ContainerLogs
internal/api/router.go — register new routes
cmd/server/main.go — wire RegisterEventTriggerDispatcher next to RegisterPersistentLogger, start logscanner.Manager
migrations: internal/store/migrations/00XX_log_scan_rules.sql, 00XX_event_triggers.sql

Frontend, new:

web/src/routes/log-scan-rules/+page.svelte, [id]/+page.svelte, new/+page.svelte
web/src/routes/event-triggers/+page.svelte, [id]/+page.svelte, new/+page.svelte
web/src/lib/components/LogRulePanel.svelte (workload detail tab)
web/src/lib/components/RegexTestBox.svelte (reusable)

Frontend, modified:

web/src/routes/apps/[id]/+page.svelte — add Log Rules tab
web/src/routes/events/+page.svelte — logscan source icon + rule tooltip
web/src/routes/+layout.svelte — Observability nav group
web/src/lib/i18n/{en,ru}.json — new key namespaces
web/src/lib/api.ts, web/src/lib/types.ts — typed clients

Open questions to revisit before coding

Container start/stop signal source — bus events (low latency, two new event types) vs polling (simpler, ~5s latency). Tentative: polling.
Trigger delivery retry — does the dispatcher retry on webhook failure, or is one shot enough since webhook_deliveries records failures? Tentative: one shot v1; revisit if reliability complaints surface.
Where does the "logscan source icon" link go on the events page — rule detail page, or the workload's effective-rules tab? Latter is probably more useful since it shows context.

Memory pointer

Add a memory after this lands describing the event_log = observe-self, webhook_deliveries = talk-to-outside boundary — it's the kind of invariant that's easy to violate accidentally when adding new event types later.

17 KiB Raw Blame History Unescape Escape

Log Scanner + Event Triggers — Design Handoff

A. Log scanner — BACKEND LANDED

Where it plugs in

New package: internal/logscanner/

Schema

Output

Hot-reload

B. Event triggers — BACKEND LANDED

Where it plugs in

Schema

Loop-prevention

What to defer

UI footprint

New pages

Augmentations

i18n keys to add

API surface

File pointers (when work starts)

Open questions to revisit before coding

Memory pointer

17 KiB

Raw Blame History

New package: `internal/logscanner/`