Two design + handoff docs: - docs/WORKLOAD_REFACTOR_TODO.md — status-at-a-glance table showing what's done (volume scopes, kind-aware editors, vendor webhook parsing, chain-panel CSS, Log Rules panel) and what's still pending (static source inline port + the hard legacy cutover gated on it; codemap entries; /apps page-level i18n; Priority 4 integration tests). - docs/LOGSCAN_AND_TRIGGERS_TODO.md — companion design + status doc for the two Observability features. Records the loop-prevention invariant (event_log = system observing itself, webhook_deliveries = system talking to outside) so the next contributor doesn't accidentally break it by adding a new EventLog subscriber that re-publishes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
Log Scanner + Event Triggers — Design Handoff
Two related features. They can ship independently, but were designed together because they share the event_log seam.
- A. Log scanner — tail container logs, match against rules, emit event_log entries. Producer of events.
- B. Event triggers — turn event_log entries into webhook / notification
dispatches. Consumer of events. Generalizes the existing
RegisterPersistentLoggerpattern.
Either half is useful alone:
- A without B = errors get surfaced in the events UI, no external delivery.
- B without A = manual + reconciler + deploy events can drive notifications.
Recommended ship order: B first (smaller, self-contained generalization), then A (more moving parts, depends on container-lifecycle hooks).
A. Log scanner — BACKEND LANDED
Status:
- Schema + store CRUD —
internal/store/log_scan_rules.go+log_scan_rulestable added to theobservabilityTablesblock. Includes theEffectiveLogScanRules(workloadID)helper that resolves global rules minus per-workload overrides plus workload- only additions in one Go-side pass. - Stream-selectable docker reads —
internal/docker/container.goContainerLogsOptsaccepts aContainerLogOptions{ShowStdout, ShowStderr, Follow, Tail}so the scanner can subscribe to one stream when a rule scopes itself to stdout or stderr. The legacyContainerLogsis preserved as a thin wrapper for back-compat. - Engine —
internal/logscanner/engine.go: per-rule cooldown (keyed on container+rule), per-container token bucket (default 10 events / 60s, override-able), regex match per line, hits returned for the manager to persist. Pure logic, fully unit-tested. - Tail goroutine —
internal/logscanner/tail.go: per-container loop reading docker's multiplexed log frames (with TTY fallback), strips the prepended RFC3339 timestamp, runs every line through the engine + snapshot. Exits on container stop or context cancel. - Manager —
internal/logscanner/manager.go: 5s polling diff againstListContainers(state=running), atomic.Pointer[Snapshot] hot-reload, structural HitEmitter that writes event_log rows AND publishesEventLogon the bus (so event-trigger dispatchers can pick them up immediately). - API —
internal/api/log_scan_rules.go: full CRUD,/testendpoint accepting{"sample_line": "..."}and returning matched/captures, plusGET /api/workloads/{id}/effective-rulesfor the workload detail page's future Log Rules tab. Admin-gated mutations. - Wired in main.go before the API server is constructed so the
reload callback is plugged via
apiServer.SetLogScanReloader. - Loop-prevention — Same boundary as feature B: scanner publishes EventLog events, dispatcher consumes them, neither writes to event_log on the consume side.
- Tests —
internal/logscanner/{engine,rules}_test.gocover cooldown isolation, token bucket refill, stream filtering, override-replaces-global, disabled-override-suppresses-global, compile-error reporting.internal/store/log_scan_rules_test.gocovers validation + cascade delete.
Frontend still pending — /log-scan-rules pages, regex test box
component, Log Rules tab on /apps/[id], i18n keys. Not touched this
turn.
Where it plugs in
internal/docker/container.go:362 already
exposes ContainerLogs(ctx, id, follow=true, tail). The existing SSE handler at
internal/api/workloads.go:43
(streamWorkloadContainerLogs) is per-viewer and dies on browser disconnect —
do not hook the scanner there. The scanner is a separate long-lived
subsystem owned by the server process.
Minor required change to ContainerLogs: expose ShowStdout / ShowStderr as
caller-controlled. Currently hardcoded to true/true. Single existing caller
passes "both" → no friction. Add an options struct or two booleans.
New package: internal/logscanner/
internal/logscanner/
manager.go — Manager: map[containerID]*tail, lifecycle hooks
tail.go — per-container goroutine; reads logs, fans to engine
engine.go — rule evaluation + cooldown + rate limit
rules.go — Rule struct, regex compile cache, effective-set resolver
Manager lifecycle. Subscribes to container start/stop signals. Options for the signal source:
- Add a
ContainerStarted/ContainerStoppedevent type to the bus and publish from the reconciler + deployer. Cleanest, but adds two event types. - Manager polls
docker.ListContainersevery N seconds and diffs. Lazier, robust to missed signals, slightly higher idle CPU. Probably fine.
Pick (1) if you want zero-latency start, (2) if you want fewer moving parts. Defaulting to (2) with 5s poll — Docker container starts already take seconds; sub-second matching is not a requirement.
Tail goroutine. On container start: open ContainerLogs(follow=true, tail="0") with stdout/stderr filters per rules in scope. Read line-by-line via
bufio.Scanner. For each line: run through engine. On container stop or ctx
cancel: drain and exit.
Engine. Holds compiled regexes per rule. For each line:
- Walk effective ruleset for this workload (see schema below).
- For each matching rule: check cooldown (
map[ruleID]time.Time, mutex guarded). If cooled down, insert event_log row + publish + update timestamp. - Per-container token bucket (default: 10 events/min/container) to prevent catastrophic event_log floods if a regex is too greedy.
Schema
Single table, global + override pattern. No separate "overrides" table.
CREATE TABLE log_scan_rules (
id INTEGER PRIMARY KEY AUTOINCREMENT,
workload_id TEXT, -- NULL = global rule
overrides_id INTEGER, -- if set, this row overrides a global rule for one workload
name TEXT NOT NULL,
pattern TEXT NOT NULL, -- regex, compiled at load
severity TEXT NOT NULL, -- info|warn|error
streams TEXT NOT NULL DEFAULT 'all', -- all|stdout|stderr
cooldown_seconds INTEGER NOT NULL DEFAULT 60,
enabled INTEGER NOT NULL DEFAULT 1,
created_at TEXT NOT NULL,
FOREIGN KEY (workload_id) REFERENCES workloads(id) ON DELETE CASCADE,
FOREIGN KEY (overrides_id) REFERENCES log_scan_rules(id) ON DELETE CASCADE
);
CREATE INDEX idx_log_scan_rules_workload ON log_scan_rules(workload_id);
CREATE INDEX idx_log_scan_rules_overrides ON log_scan_rules(overrides_id);
Effective ruleset for workload X:
- All rows where
workload_id IS NULL AND overrides_id IS NULL(pure globals), minus any global that has a row withworkload_id = X AND overrides_id = global.id. - Plus all rows where
workload_id = X AND overrides_id IS NULL(workload-only additions). - Plus all override rows where
workload_id = X AND overrides_id IS NOT NULL(substitute for the global; their fields win, includingenabled=falseto disable the global for this workload).
A pure SQL implementation is doable with a LEFT JOIN ... WHERE override.id IS NULL for step 1 plus a UNION ALL for steps 2 and 3. Or compute in Go after
two simpler queries — fine since rule counts will be small.
Output
Scanner calls store.InsertEvent with:
Source = "logscan"Severityfrom the matched ruleMessage= raw matched line (truncated to ~500 chars)MetadataJSON ={"workload_id": ..., "container_id": ..., "rule_id": ..., "rule_name": ..., "captures": {...}}
Then bus.Publish(EventLog, payload). This reuses exactly the path
internal/events/bus.go:158
(RegisterPersistentLogger) already established. SSE clients see it live, and
the dispatcher from feature B picks it up.
Hot-reload
When a rule is created/updated/deleted via the API, the manager must rebuild
the effective ruleset for affected containers. Cheapest path: a single
*atomic.Pointer[ruleSnapshot] shared across tails, replaced wholesale on any
rule change. Each tail dereferences the snapshot per line — no locking on the
hot path.
B. Event triggers — BACKEND LANDED
Status:
- Schema + store CRUD —
internal/store/event_triggers.go+ table creation ininternal/store/store.goobservabilityTables. Model:EventTriggerininternal/store/models.go. - Dispatcher —
internal/events/dispatcher.goRegisterEventTriggerDispatcher(bus, triggerSource, notifier). Filter eval is AND-composed across severity (CSV), source (CSV), and optional message regex. Compiled regexes are memoized. - Webhook delivery — extended
notify.NotifierwithSendPayload(url, secret, eventType, payload)which reuses the existing HMAC + headers infra (X-Hub-Signature-256, etc.). NewTierEventTriggertier is recorded for telemetry / audit. - Loop-prevention — dispatcher does not call
InsertEvent. Delivery outcomes go through the notifier's existing logging only. - API —
internal/api/event_triggers.gowith admin-gated mutations:
GET /api/event-triggers
POST /api/event-triggers
GET /api/event-triggers/{id}
PATCH /api/event-triggers/{id}
DELETE /api/event-triggers/{id}
POST /api/event-triggers/{id}/test — synthetic event_log → notifier.SendSyncForTest
- Wired in main.go next to
RegisterPersistentLogger. - Tests —
internal/events/dispatcher_test.go: 10 cases covering filter eval, regex caching, dispatcher fan-out, unsupported action_type, trigger-source errors. CSV filter helper has dedicated table-driven coverage.
Frontend still pending — /event-triggers list + detail + new
pages, the Send-test UX, i18n keys. Not touched this turn.
Where it plugs in
Mirrors the RegisterPersistentLogger shape at
internal/events/bus.go:158:
func RegisterEventTriggerDispatcher(b *Bus, triggers TriggerSource, notifier Notifier) func() {
sub := b.Subscribe(func(evt Event) bool { return evt.Type == EventLog })
go func() {
for evt := range sub {
payload, ok := evt.Payload.(EventLogPayload)
if !ok { continue }
for _, t := range triggers.Enabled() {
if t.matches(payload) {
notifier.Send(t.ActionTarget, buildBody(t, payload))
}
}
}
}()
return func() { b.Unsubscribe(sub) }
}
Reuses the existing notifier at
internal/notify/notifier.go — including the
signed-delivery and webhook_deliveries audit trail.
Schema
CREATE TABLE event_triggers (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
filter_severity TEXT, -- nullable; comma-list like 'warn,error'
filter_source TEXT, -- nullable; comma-list like 'logscan,deploy'
filter_message_regex TEXT, -- nullable; matched against message
action_type TEXT NOT NULL, -- 'webhook' | 'notification_channel'
action_target TEXT NOT NULL, -- URL or channel ID
enabled INTEGER NOT NULL DEFAULT 1,
created_at TEXT NOT NULL
);
Filters AND together. Empty filters match all.
Loop-prevention
Critical constraint: the dispatcher must not write to event_log. All
delivery successes / failures land in webhook_deliveries (existing table) so
the audit trail is preserved without risking trigger recursion. Keeps the
boundary crisp:
event_log= system observing itselfwebhook_deliveries= system talking to the outside
If a user-visible "trigger fired" entry is desired in the events UI, add a
read-only join from webhook_deliveries into the events page rather than
writing event_log rows.
What to defer
| Item | Why | Add when |
|---|---|---|
| Multi-line stack trace coalescing | Real rabbit hole (which lines belong together?). | Real user pain. |
Capture-group templating in messages ({{.captures.code}}) |
v1 stores captures in metadata, displays raw line. | Once real rules exist and patterns emerge. |
| Backfilling history search | This is Loki/Grafana scope-creep. | Never (push to Loki instead if it comes up). |
| Per-rule alert routing | v1 fans out by (severity, source) filter on trigger side. |
When users want one rule → one channel. |
| YAML config-as-code | Tinyforge is UI-driven everywhere else. | Probably never. |
| Retry / backoff on trigger delivery failure | Notifier already handles delivery; whether triggers retry is a separate question. | If trigger reliability becomes an SLO. |
UI footprint
All boolean inputs use ToggleSwitch per project CLAUDE.md. All destructive
actions use ConfirmDialog per memory note (no inline Yes/No strips).
New pages
/log-scan-rules— list with severity / workload filter, "+ New rule" button.- Detail page: name, pattern (regex with live test box that takes a sample log line), severity, streams, cooldown, enabled toggle, scope picker (global / workload).
/event-triggers— list, "+ New trigger" button.- Detail page: name, filters (severity multiselect, source multiselect, optional message regex), action type, action target, enabled toggle.
Augmentations
- Workload detail page (
/apps/[id]): new "Log Rules" tab/panel listing effective rules for this workload. Each global shows an "Override for this workload" button. Each override / workload-only shows edit + delete. - Events page (
/events): entries withsource=logscanget a small icon- tooltip showing rule name. Click → jumps to rule detail.
- Settings sidebar: links to
/log-scan-rulesand/event-triggersunder a new "Observability" group.
i18n keys to add
Roughly 40–60 keys across en.json + ru.json. Namespace: logscan.* and
triggers.*.
API surface
GET /api/log-scan-rules — list (filter: ?workload_id=, ?global=true)
POST /api/log-scan-rules — create
GET /api/log-scan-rules/{id} — detail
PATCH /api/log-scan-rules/{id} — update
DELETE /api/log-scan-rules/{id} — delete
POST /api/log-scan-rules/{id}/test — body: {sample_line}; returns matched: bool, captures
GET /api/workloads/{id}/effective-rules — computed effective ruleset for a workload
GET /api/event-triggers — list
POST /api/event-triggers — create
GET /api/event-triggers/{id} — detail
PATCH /api/event-triggers/{id} — update
DELETE /api/event-triggers/{id} — delete
POST /api/event-triggers/{id}/test — dispatches a synthetic event to verify the action target
POST .../test endpoints are worth shipping in v1 — they make the rule /
trigger editing UX dramatically nicer and avoid "did I get the regex right?"
deploy-and-pray cycles.
File pointers (when work starts)
Backend, new:
internal/logscanner/{manager,tail,engine,rules}.gointernal/api/log_scan_rules.gointernal/api/event_triggers.gointernal/store/log_scan_rules.gointernal/store/event_triggers.gointernal/events/dispatcher.go(or extendbus.gowithRegisterEventTriggerDispatcher)
Backend, modified:
- internal/docker/container.go:362 — expose stream selection on
ContainerLogs - internal/api/router.go — register new routes
- cmd/server/main.go — wire
RegisterEventTriggerDispatchernext toRegisterPersistentLogger, startlogscanner.Manager - migrations:
internal/store/migrations/00XX_log_scan_rules.sql,00XX_event_triggers.sql
Frontend, new:
web/src/routes/log-scan-rules/+page.svelte,[id]/+page.svelte,new/+page.svelteweb/src/routes/event-triggers/+page.svelte,[id]/+page.svelte,new/+page.svelteweb/src/lib/components/LogRulePanel.svelte(workload detail tab)web/src/lib/components/RegexTestBox.svelte(reusable)
Frontend, modified:
web/src/routes/apps/[id]/+page.svelte— add Log Rules tabweb/src/routes/events/+page.svelte— logscan source icon + rule tooltipweb/src/routes/+layout.svelte— Observability nav groupweb/src/lib/i18n/{en,ru}.json— new key namespacesweb/src/lib/api.ts,web/src/lib/types.ts— typed clients
Open questions to revisit before coding
- Container start/stop signal source — bus events (low latency, two new event types) vs polling (simpler, ~5s latency). Tentative: polling.
- Trigger delivery retry — does the dispatcher retry on webhook failure,
or is one shot enough since
webhook_deliveriesrecords failures? Tentative: one shot v1; revisit if reliability complaints surface. - Where does the "logscan source icon" link go on the events page — rule detail page, or the workload's effective-rules tab? Latter is probably more useful since it shows context.
Memory pointer
Add a memory after this lands describing the event_log = observe-self, webhook_deliveries = talk-to-outside boundary — it's the kind of invariant that's easy to violate accidentally when adding new event types later.