# Log Scanner + Event Triggers — Design Handoff Two related features. They can ship independently, but were designed together because they share the event_log seam. - **A. Log scanner** — tail container logs, match against rules, emit event_log entries. Producer of events. - **B. Event triggers** — turn event_log entries into webhook / notification dispatches. Consumer of events. Generalizes the existing `RegisterPersistentLogger` pattern. Either half is useful alone: - A without B = errors get surfaced in the events UI, no external delivery. - B without A = manual + reconciler + deploy events can drive notifications. Recommended ship order: B first (smaller, self-contained generalization), then A (more moving parts, depends on container-lifecycle hooks). --- ## A. Log scanner — BACKEND LANDED Status: - **Schema + store CRUD** — `internal/store/log_scan_rules.go` + `log_scan_rules` table added to the `observabilityTables` block. Includes the `EffectiveLogScanRules(workloadID)` helper that resolves global rules minus per-workload overrides plus workload- only additions in one Go-side pass. - **Stream-selectable docker reads** — `internal/docker/container.go` `ContainerLogsOpts` accepts a `ContainerLogOptions{ShowStdout, ShowStderr, Follow, Tail}` so the scanner can subscribe to one stream when a rule scopes itself to stdout or stderr. The legacy `ContainerLogs` is preserved as a thin wrapper for back-compat. - **Engine** — `internal/logscanner/engine.go`: per-rule cooldown (keyed on container+rule), per-container token bucket (default 10 events / 60s, override-able), regex match per line, hits returned for the manager to persist. Pure logic, fully unit-tested. - **Tail goroutine** — `internal/logscanner/tail.go`: per-container loop reading docker's multiplexed log frames (with TTY fallback), strips the prepended RFC3339 timestamp, runs every line through the engine + snapshot. Exits on container stop or context cancel. - **Manager** — `internal/logscanner/manager.go`: 5s polling diff against `ListContainers(state=running)`, atomic.Pointer[Snapshot] hot-reload, structural HitEmitter that writes event_log rows AND publishes `EventLog` on the bus (so event-trigger dispatchers can pick them up immediately). - **API** — `internal/api/log_scan_rules.go`: full CRUD, `/test` endpoint accepting `{"sample_line": "..."}` and returning matched/captures, plus `GET /api/workloads/{id}/effective-rules` for the workload detail page's future Log Rules tab. Admin-gated mutations. - **Wired in main.go** before the API server is constructed so the reload callback is plugged via `apiServer.SetLogScanReloader`. - **Loop-prevention** — Same boundary as feature B: scanner publishes EventLog events, dispatcher consumes them, neither writes to event_log on the consume side. - **Tests** — `internal/logscanner/{engine,rules}_test.go` cover cooldown isolation, token bucket refill, stream filtering, override-replaces-global, disabled-override-suppresses-global, compile-error reporting. `internal/store/log_scan_rules_test.go` covers validation + cascade delete. **Frontend still pending** — `/log-scan-rules` pages, regex test box component, Log Rules tab on `/apps/[id]`, i18n keys. Not touched this turn. ### Where it plugs in [internal/docker/container.go:362](../internal/docker/container.go#L362) already exposes `ContainerLogs(ctx, id, follow=true, tail)`. The existing SSE handler at [internal/api/workloads.go:43](../internal/api/workloads.go#L43) (`streamWorkloadContainerLogs`) is per-viewer and dies on browser disconnect — **do not hook the scanner there**. The scanner is a separate long-lived subsystem owned by the server process. Minor required change to `ContainerLogs`: expose `ShowStdout` / `ShowStderr` as caller-controlled. Currently hardcoded to `true`/`true`. Single existing caller passes "both" → no friction. Add an options struct or two booleans. ### New package: `internal/logscanner/` ``` internal/logscanner/ manager.go — Manager: map[containerID]*tail, lifecycle hooks tail.go — per-container goroutine; reads logs, fans to engine engine.go — rule evaluation + cooldown + rate limit rules.go — Rule struct, regex compile cache, effective-set resolver ``` **Manager lifecycle.** Subscribes to container start/stop signals. Options for the signal source: 1. Add a `ContainerStarted` / `ContainerStopped` event type to the bus and publish from the reconciler + deployer. Cleanest, but adds two event types. 2. Manager polls `docker.ListContainers` every N seconds and diffs. Lazier, robust to missed signals, slightly higher idle CPU. Probably fine. Pick (1) if you want zero-latency start, (2) if you want fewer moving parts. Defaulting to **(2) with 5s poll** — Docker container starts already take seconds; sub-second matching is not a requirement. **Tail goroutine.** On container start: open `ContainerLogs(follow=true, tail="0")` with stdout/stderr filters per rules in scope. Read line-by-line via `bufio.Scanner`. For each line: run through engine. On container stop or ctx cancel: drain and exit. **Engine.** Holds compiled regexes per rule. For each line: - Walk effective ruleset for this workload (see schema below). - For each matching rule: check cooldown (`map[ruleID]time.Time`, mutex guarded). If cooled down, insert event_log row + publish + update timestamp. - Per-container token bucket (default: 10 events/min/container) to prevent catastrophic event_log floods if a regex is too greedy. ### Schema Single table, global + override pattern. No separate "overrides" table. ```sql CREATE TABLE log_scan_rules ( id INTEGER PRIMARY KEY AUTOINCREMENT, workload_id TEXT, -- NULL = global rule overrides_id INTEGER, -- if set, this row overrides a global rule for one workload name TEXT NOT NULL, pattern TEXT NOT NULL, -- regex, compiled at load severity TEXT NOT NULL, -- info|warn|error streams TEXT NOT NULL DEFAULT 'all', -- all|stdout|stderr cooldown_seconds INTEGER NOT NULL DEFAULT 60, enabled INTEGER NOT NULL DEFAULT 1, created_at TEXT NOT NULL, FOREIGN KEY (workload_id) REFERENCES workloads(id) ON DELETE CASCADE, FOREIGN KEY (overrides_id) REFERENCES log_scan_rules(id) ON DELETE CASCADE ); CREATE INDEX idx_log_scan_rules_workload ON log_scan_rules(workload_id); CREATE INDEX idx_log_scan_rules_overrides ON log_scan_rules(overrides_id); ``` **Effective ruleset for workload X:** 1. All rows where `workload_id IS NULL AND overrides_id IS NULL` (pure globals), *minus* any global that has a row with `workload_id = X AND overrides_id = global.id`. 2. Plus all rows where `workload_id = X AND overrides_id IS NULL` (workload-only additions). 3. Plus all override rows where `workload_id = X AND overrides_id IS NOT NULL` (substitute for the global; their fields win, including `enabled=false` to disable the global for this workload). A pure SQL implementation is doable with a `LEFT JOIN ... WHERE override.id IS NULL` for step 1 plus a `UNION ALL` for steps 2 and 3. Or compute in Go after two simpler queries — fine since rule counts will be small. ### Output Scanner calls `store.InsertEvent` with: - `Source = "logscan"` - `Severity` from the matched rule - `Message` = raw matched line (truncated to ~500 chars) - `Metadata` JSON = `{"workload_id": ..., "container_id": ..., "rule_id": ..., "rule_name": ..., "captures": {...}}` Then `bus.Publish(EventLog, payload)`. This reuses exactly the path [internal/events/bus.go:158](../internal/events/bus.go#L158) (`RegisterPersistentLogger`) already established. SSE clients see it live, and the dispatcher from feature B picks it up. ### Hot-reload When a rule is created/updated/deleted via the API, the manager must rebuild the effective ruleset for affected containers. Cheapest path: a single `*atomic.Pointer[ruleSnapshot]` shared across tails, replaced wholesale on any rule change. Each tail dereferences the snapshot per line — no locking on the hot path. --- ## B. Event triggers — BACKEND LANDED Status: - **Schema + store CRUD** — `internal/store/event_triggers.go` + table creation in `internal/store/store.go` `observabilityTables`. Model: `EventTrigger` in `internal/store/models.go`. - **Dispatcher** — `internal/events/dispatcher.go` `RegisterEventTriggerDispatcher(bus, triggerSource, notifier)`. Filter eval is AND-composed across severity (CSV), source (CSV), and optional message regex. Compiled regexes are memoized. - **Webhook delivery** — extended `notify.Notifier` with `SendPayload(url, secret, eventType, payload)` which reuses the existing HMAC + headers infra (`X-Hub-Signature-256`, etc.). New `TierEventTrigger` tier is recorded for telemetry / audit. - **Loop-prevention** — dispatcher does **not** call `InsertEvent`. Delivery outcomes go through the notifier's existing logging only. - **API** — `internal/api/event_triggers.go` with admin-gated mutations: ```http GET /api/event-triggers POST /api/event-triggers GET /api/event-triggers/{id} PATCH /api/event-triggers/{id} DELETE /api/event-triggers/{id} POST /api/event-triggers/{id}/test — synthetic event_log → notifier.SendSyncForTest ``` - **Wired in main.go** next to `RegisterPersistentLogger`. - **Tests** — `internal/events/dispatcher_test.go`: 10 cases covering filter eval, regex caching, dispatcher fan-out, unsupported action_type, trigger-source errors. CSV filter helper has dedicated table-driven coverage. **Frontend still pending** — `/event-triggers` list + detail + new pages, the Send-test UX, i18n keys. Not touched this turn. ### Where it plugs in Mirrors the `RegisterPersistentLogger` shape at [internal/events/bus.go:158](../internal/events/bus.go#L158): ```go func RegisterEventTriggerDispatcher(b *Bus, triggers TriggerSource, notifier Notifier) func() { sub := b.Subscribe(func(evt Event) bool { return evt.Type == EventLog }) go func() { for evt := range sub { payload, ok := evt.Payload.(EventLogPayload) if !ok { continue } for _, t := range triggers.Enabled() { if t.matches(payload) { notifier.Send(t.ActionTarget, buildBody(t, payload)) } } } }() return func() { b.Unsubscribe(sub) } } ``` Reuses the existing notifier at [internal/notify/notifier.go](../internal/notify/notifier.go) — including the signed-delivery and `webhook_deliveries` audit trail. ### Schema ```sql CREATE TABLE event_triggers ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT NOT NULL, filter_severity TEXT, -- nullable; comma-list like 'warn,error' filter_source TEXT, -- nullable; comma-list like 'logscan,deploy' filter_message_regex TEXT, -- nullable; matched against message action_type TEXT NOT NULL, -- 'webhook' | 'notification_channel' action_target TEXT NOT NULL, -- URL or channel ID enabled INTEGER NOT NULL DEFAULT 1, created_at TEXT NOT NULL ); ``` Filters AND together. Empty filters match all. ### Loop-prevention **Critical constraint: the dispatcher must not write to event_log.** All delivery successes / failures land in `webhook_deliveries` (existing table) so the audit trail is preserved without risking trigger recursion. Keeps the boundary crisp: - `event_log` = system observing itself - `webhook_deliveries` = system talking to the outside If a user-visible "trigger fired" entry is desired in the events UI, add a *read-only join* from `webhook_deliveries` into the events page rather than writing event_log rows. --- ## What to defer | Item | Why | Add when | |---|---|---| | Multi-line stack trace coalescing | Real rabbit hole (which lines belong together?). | Real user pain. | | Capture-group templating in messages (`{{.captures.code}}`) | v1 stores captures in metadata, displays raw line. | Once real rules exist and patterns emerge. | | Backfilling history search | This is Loki/Grafana scope-creep. | Never (push to Loki instead if it comes up). | | Per-rule alert routing | v1 fans out by `(severity, source)` filter on trigger side. | When users want one rule → one channel. | | YAML config-as-code | Tinyforge is UI-driven everywhere else. | Probably never. | | Retry / backoff on trigger delivery failure | Notifier already handles delivery; whether *triggers* retry is a separate question. | If trigger reliability becomes an SLO. | --- ## UI footprint All boolean inputs use `ToggleSwitch` per project CLAUDE.md. All destructive actions use `ConfirmDialog` per memory note (no inline Yes/No strips). ### New pages - **`/log-scan-rules`** — list with severity / workload filter, "+ New rule" button. - Detail page: name, pattern (regex with live test box that takes a sample log line), severity, streams, cooldown, enabled toggle, scope picker (global / workload). - **`/event-triggers`** — list, "+ New trigger" button. - Detail page: name, filters (severity multiselect, source multiselect, optional message regex), action type, action target, enabled toggle. ### Augmentations - **Workload detail page** (`/apps/[id]`): new "Log Rules" tab/panel listing effective rules for this workload. Each global shows an "Override for this workload" button. Each override / workload-only shows edit + delete. - **Events page** (`/events`): entries with `source=logscan` get a small icon + tooltip showing rule name. Click → jumps to rule detail. - **Settings sidebar**: links to `/log-scan-rules` and `/event-triggers` under a new "Observability" group. ### i18n keys to add Roughly 40–60 keys across `en.json` + `ru.json`. Namespace: `logscan.*` and `triggers.*`. --- ## API surface ``` GET /api/log-scan-rules — list (filter: ?workload_id=, ?global=true) POST /api/log-scan-rules — create GET /api/log-scan-rules/{id} — detail PATCH /api/log-scan-rules/{id} — update DELETE /api/log-scan-rules/{id} — delete POST /api/log-scan-rules/{id}/test — body: {sample_line}; returns matched: bool, captures GET /api/workloads/{id}/effective-rules — computed effective ruleset for a workload GET /api/event-triggers — list POST /api/event-triggers — create GET /api/event-triggers/{id} — detail PATCH /api/event-triggers/{id} — update DELETE /api/event-triggers/{id} — delete POST /api/event-triggers/{id}/test — dispatches a synthetic event to verify the action target ``` `POST .../test` endpoints are worth shipping in v1 — they make the rule / trigger editing UX dramatically nicer and avoid "did I get the regex right?" deploy-and-pray cycles. --- ## File pointers (when work starts) **Backend, new:** - `internal/logscanner/{manager,tail,engine,rules}.go` - `internal/api/log_scan_rules.go` - `internal/api/event_triggers.go` - `internal/store/log_scan_rules.go` - `internal/store/event_triggers.go` - `internal/events/dispatcher.go` (or extend `bus.go` with `RegisterEventTriggerDispatcher`) **Backend, modified:** - [internal/docker/container.go:362](../internal/docker/container.go#L362) — expose stream selection on `ContainerLogs` - [internal/api/router.go](../internal/api/router.go) — register new routes - [cmd/server/main.go](../cmd/server/main.go) — wire `RegisterEventTriggerDispatcher` next to `RegisterPersistentLogger`, start `logscanner.Manager` - migrations: `internal/store/migrations/00XX_log_scan_rules.sql`, `00XX_event_triggers.sql` **Frontend, new:** - `web/src/routes/log-scan-rules/+page.svelte`, `[id]/+page.svelte`, `new/+page.svelte` - `web/src/routes/event-triggers/+page.svelte`, `[id]/+page.svelte`, `new/+page.svelte` - `web/src/lib/components/LogRulePanel.svelte` (workload detail tab) - `web/src/lib/components/RegexTestBox.svelte` (reusable) **Frontend, modified:** - `web/src/routes/apps/[id]/+page.svelte` — add Log Rules tab - `web/src/routes/events/+page.svelte` — logscan source icon + rule tooltip - `web/src/routes/+layout.svelte` — Observability nav group - `web/src/lib/i18n/{en,ru}.json` — new key namespaces - `web/src/lib/api.ts`, `web/src/lib/types.ts` — typed clients --- ## Open questions to revisit before coding 1. **Container start/stop signal source** — bus events (low latency, two new event types) vs polling (simpler, ~5s latency). Tentative: polling. 2. **Trigger delivery retry** — does the dispatcher retry on webhook failure, or is one shot enough since `webhook_deliveries` records failures? Tentative: one shot v1; revisit if reliability complaints surface. 3. **Where does the "logscan source icon" link go on the events page** — rule detail page, or the workload's effective-rules tab? Latter is probably more useful since it shows context. --- ## Memory pointer Add a memory after this lands describing the event_log = observe-self, webhook_deliveries = talk-to-outside boundary — it's the kind of invariant that's easy to violate accidentally when adding new event types later.