30133bc1eb
Build / build (push) Successful in 10m40s
Two design + handoff docs: - docs/WORKLOAD_REFACTOR_TODO.md — status-at-a-glance table showing what's done (volume scopes, kind-aware editors, vendor webhook parsing, chain-panel CSS, Log Rules panel) and what's still pending (static source inline port + the hard legacy cutover gated on it; codemap entries; /apps page-level i18n; Priority 4 integration tests). - docs/LOGSCAN_AND_TRIGGERS_TODO.md — companion design + status doc for the two Observability features. Records the loop-prevention invariant (event_log = system observing itself, webhook_deliveries = system talking to outside) so the next contributor doesn't accidentally break it by adding a new EventLog subscriber that re-publishes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
386 lines
17 KiB
Markdown
386 lines
17 KiB
Markdown
# Log Scanner + Event Triggers — Design Handoff
|
||
|
||
Two related features. They can ship independently, but were designed together
|
||
because they share the event_log seam.
|
||
|
||
- **A. Log scanner** — tail container logs, match against rules, emit event_log
|
||
entries. Producer of events.
|
||
- **B. Event triggers** — turn event_log entries into webhook / notification
|
||
dispatches. Consumer of events. Generalizes the existing
|
||
`RegisterPersistentLogger` pattern.
|
||
|
||
Either half is useful alone:
|
||
- A without B = errors get surfaced in the events UI, no external delivery.
|
||
- B without A = manual + reconciler + deploy events can drive notifications.
|
||
|
||
Recommended ship order: B first (smaller, self-contained generalization), then
|
||
A (more moving parts, depends on container-lifecycle hooks).
|
||
|
||
---
|
||
|
||
## A. Log scanner — BACKEND LANDED
|
||
|
||
Status:
|
||
|
||
- **Schema + store CRUD** — `internal/store/log_scan_rules.go` +
|
||
`log_scan_rules` table added to the `observabilityTables` block.
|
||
Includes the `EffectiveLogScanRules(workloadID)` helper that
|
||
resolves global rules minus per-workload overrides plus workload-
|
||
only additions in one Go-side pass.
|
||
- **Stream-selectable docker reads** — `internal/docker/container.go`
|
||
`ContainerLogsOpts` accepts a `ContainerLogOptions{ShowStdout,
|
||
ShowStderr, Follow, Tail}` so the scanner can subscribe to one
|
||
stream when a rule scopes itself to stdout or stderr. The legacy
|
||
`ContainerLogs` is preserved as a thin wrapper for back-compat.
|
||
- **Engine** — `internal/logscanner/engine.go`: per-rule cooldown
|
||
(keyed on container+rule), per-container token bucket (default 10
|
||
events / 60s, override-able), regex match per line, hits returned
|
||
for the manager to persist. Pure logic, fully unit-tested.
|
||
- **Tail goroutine** — `internal/logscanner/tail.go`: per-container
|
||
loop reading docker's multiplexed log frames (with TTY fallback),
|
||
strips the prepended RFC3339 timestamp, runs every line through the
|
||
engine + snapshot. Exits on container stop or context cancel.
|
||
- **Manager** — `internal/logscanner/manager.go`: 5s polling diff
|
||
against `ListContainers(state=running)`, atomic.Pointer[Snapshot]
|
||
hot-reload, structural HitEmitter that writes event_log rows AND
|
||
publishes `EventLog` on the bus (so event-trigger dispatchers can
|
||
pick them up immediately).
|
||
- **API** — `internal/api/log_scan_rules.go`: full CRUD,
|
||
`/test` endpoint accepting `{"sample_line": "..."}` and returning
|
||
matched/captures, plus
|
||
`GET /api/workloads/{id}/effective-rules` for the workload detail
|
||
page's future Log Rules tab. Admin-gated mutations.
|
||
- **Wired in main.go** before the API server is constructed so the
|
||
reload callback is plugged via `apiServer.SetLogScanReloader`.
|
||
- **Loop-prevention** — Same boundary as feature B: scanner publishes
|
||
EventLog events, dispatcher consumes them, neither writes to
|
||
event_log on the consume side.
|
||
- **Tests** — `internal/logscanner/{engine,rules}_test.go` cover
|
||
cooldown isolation, token bucket refill, stream filtering,
|
||
override-replaces-global, disabled-override-suppresses-global,
|
||
compile-error reporting. `internal/store/log_scan_rules_test.go`
|
||
covers validation + cascade delete.
|
||
|
||
**Frontend still pending** — `/log-scan-rules` pages, regex test box
|
||
component, Log Rules tab on `/apps/[id]`, i18n keys. Not touched this
|
||
turn.
|
||
|
||
### Where it plugs in
|
||
|
||
[internal/docker/container.go:362](../internal/docker/container.go#L362) already
|
||
exposes `ContainerLogs(ctx, id, follow=true, tail)`. The existing SSE handler at
|
||
[internal/api/workloads.go:43](../internal/api/workloads.go#L43)
|
||
(`streamWorkloadContainerLogs`) is per-viewer and dies on browser disconnect —
|
||
**do not hook the scanner there**. The scanner is a separate long-lived
|
||
subsystem owned by the server process.
|
||
|
||
Minor required change to `ContainerLogs`: expose `ShowStdout` / `ShowStderr` as
|
||
caller-controlled. Currently hardcoded to `true`/`true`. Single existing caller
|
||
passes "both" → no friction. Add an options struct or two booleans.
|
||
|
||
### New package: `internal/logscanner/`
|
||
|
||
```
|
||
internal/logscanner/
|
||
manager.go — Manager: map[containerID]*tail, lifecycle hooks
|
||
tail.go — per-container goroutine; reads logs, fans to engine
|
||
engine.go — rule evaluation + cooldown + rate limit
|
||
rules.go — Rule struct, regex compile cache, effective-set resolver
|
||
```
|
||
|
||
**Manager lifecycle.** Subscribes to container start/stop signals. Options for
|
||
the signal source:
|
||
1. Add a `ContainerStarted` / `ContainerStopped` event type to the bus and
|
||
publish from the reconciler + deployer. Cleanest, but adds two event types.
|
||
2. Manager polls `docker.ListContainers` every N seconds and diffs. Lazier,
|
||
robust to missed signals, slightly higher idle CPU. Probably fine.
|
||
|
||
Pick (1) if you want zero-latency start, (2) if you want fewer moving parts.
|
||
Defaulting to **(2) with 5s poll** — Docker container starts already take
|
||
seconds; sub-second matching is not a requirement.
|
||
|
||
**Tail goroutine.** On container start: open `ContainerLogs(follow=true,
|
||
tail="0")` with stdout/stderr filters per rules in scope. Read line-by-line via
|
||
`bufio.Scanner`. For each line: run through engine. On container stop or ctx
|
||
cancel: drain and exit.
|
||
|
||
**Engine.** Holds compiled regexes per rule. For each line:
|
||
- Walk effective ruleset for this workload (see schema below).
|
||
- For each matching rule: check cooldown (`map[ruleID]time.Time`, mutex
|
||
guarded). If cooled down, insert event_log row + publish + update timestamp.
|
||
- Per-container token bucket (default: 10 events/min/container) to prevent
|
||
catastrophic event_log floods if a regex is too greedy.
|
||
|
||
### Schema
|
||
|
||
Single table, global + override pattern. No separate "overrides" table.
|
||
|
||
```sql
|
||
CREATE TABLE log_scan_rules (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
workload_id TEXT, -- NULL = global rule
|
||
overrides_id INTEGER, -- if set, this row overrides a global rule for one workload
|
||
name TEXT NOT NULL,
|
||
pattern TEXT NOT NULL, -- regex, compiled at load
|
||
severity TEXT NOT NULL, -- info|warn|error
|
||
streams TEXT NOT NULL DEFAULT 'all', -- all|stdout|stderr
|
||
cooldown_seconds INTEGER NOT NULL DEFAULT 60,
|
||
enabled INTEGER NOT NULL DEFAULT 1,
|
||
created_at TEXT NOT NULL,
|
||
FOREIGN KEY (workload_id) REFERENCES workloads(id) ON DELETE CASCADE,
|
||
FOREIGN KEY (overrides_id) REFERENCES log_scan_rules(id) ON DELETE CASCADE
|
||
);
|
||
CREATE INDEX idx_log_scan_rules_workload ON log_scan_rules(workload_id);
|
||
CREATE INDEX idx_log_scan_rules_overrides ON log_scan_rules(overrides_id);
|
||
```
|
||
|
||
**Effective ruleset for workload X:**
|
||
1. All rows where `workload_id IS NULL AND overrides_id IS NULL` (pure globals),
|
||
*minus* any global that has a row with `workload_id = X AND overrides_id = global.id`.
|
||
2. Plus all rows where `workload_id = X AND overrides_id IS NULL` (workload-only additions).
|
||
3. Plus all override rows where `workload_id = X AND overrides_id IS NOT NULL`
|
||
(substitute for the global; their fields win, including `enabled=false` to
|
||
disable the global for this workload).
|
||
|
||
A pure SQL implementation is doable with a `LEFT JOIN ... WHERE override.id IS
|
||
NULL` for step 1 plus a `UNION ALL` for steps 2 and 3. Or compute in Go after
|
||
two simpler queries — fine since rule counts will be small.
|
||
|
||
### Output
|
||
|
||
Scanner calls `store.InsertEvent` with:
|
||
- `Source = "logscan"`
|
||
- `Severity` from the matched rule
|
||
- `Message` = raw matched line (truncated to ~500 chars)
|
||
- `Metadata` JSON = `{"workload_id": ..., "container_id": ..., "rule_id": ..., "rule_name": ..., "captures": {...}}`
|
||
|
||
Then `bus.Publish(EventLog, payload)`. This reuses exactly the path
|
||
[internal/events/bus.go:158](../internal/events/bus.go#L158)
|
||
(`RegisterPersistentLogger`) already established. SSE clients see it live, and
|
||
the dispatcher from feature B picks it up.
|
||
|
||
### Hot-reload
|
||
|
||
When a rule is created/updated/deleted via the API, the manager must rebuild
|
||
the effective ruleset for affected containers. Cheapest path: a single
|
||
`*atomic.Pointer[ruleSnapshot]` shared across tails, replaced wholesale on any
|
||
rule change. Each tail dereferences the snapshot per line — no locking on the
|
||
hot path.
|
||
|
||
---
|
||
|
||
## B. Event triggers — BACKEND LANDED
|
||
|
||
Status:
|
||
|
||
- **Schema + store CRUD** — `internal/store/event_triggers.go` + table
|
||
creation in `internal/store/store.go` `observabilityTables`. Model:
|
||
`EventTrigger` in `internal/store/models.go`.
|
||
- **Dispatcher** — `internal/events/dispatcher.go`
|
||
`RegisterEventTriggerDispatcher(bus, triggerSource, notifier)`.
|
||
Filter eval is AND-composed across severity (CSV), source (CSV), and
|
||
optional message regex. Compiled regexes are memoized.
|
||
- **Webhook delivery** — extended `notify.Notifier` with
|
||
`SendPayload(url, secret, eventType, payload)` which reuses the
|
||
existing HMAC + headers infra (`X-Hub-Signature-256`, etc.). New
|
||
`TierEventTrigger` tier is recorded for telemetry / audit.
|
||
- **Loop-prevention** — dispatcher does **not** call `InsertEvent`.
|
||
Delivery outcomes go through the notifier's existing logging only.
|
||
- **API** — `internal/api/event_triggers.go` with admin-gated mutations:
|
||
|
||
```http
|
||
GET /api/event-triggers
|
||
POST /api/event-triggers
|
||
GET /api/event-triggers/{id}
|
||
PATCH /api/event-triggers/{id}
|
||
DELETE /api/event-triggers/{id}
|
||
POST /api/event-triggers/{id}/test — synthetic event_log → notifier.SendSyncForTest
|
||
```
|
||
|
||
- **Wired in main.go** next to `RegisterPersistentLogger`.
|
||
- **Tests** — `internal/events/dispatcher_test.go`: 10 cases covering
|
||
filter eval, regex caching, dispatcher fan-out, unsupported
|
||
action_type, trigger-source errors. CSV filter helper has dedicated
|
||
table-driven coverage.
|
||
|
||
**Frontend still pending** — `/event-triggers` list + detail + new
|
||
pages, the Send-test UX, i18n keys. Not touched this turn.
|
||
|
||
### Where it plugs in
|
||
|
||
Mirrors the `RegisterPersistentLogger` shape at
|
||
[internal/events/bus.go:158](../internal/events/bus.go#L158):
|
||
|
||
```go
|
||
func RegisterEventTriggerDispatcher(b *Bus, triggers TriggerSource, notifier Notifier) func() {
|
||
sub := b.Subscribe(func(evt Event) bool { return evt.Type == EventLog })
|
||
go func() {
|
||
for evt := range sub {
|
||
payload, ok := evt.Payload.(EventLogPayload)
|
||
if !ok { continue }
|
||
for _, t := range triggers.Enabled() {
|
||
if t.matches(payload) {
|
||
notifier.Send(t.ActionTarget, buildBody(t, payload))
|
||
}
|
||
}
|
||
}
|
||
}()
|
||
return func() { b.Unsubscribe(sub) }
|
||
}
|
||
```
|
||
|
||
Reuses the existing notifier at
|
||
[internal/notify/notifier.go](../internal/notify/notifier.go) — including the
|
||
signed-delivery and `webhook_deliveries` audit trail.
|
||
|
||
### Schema
|
||
|
||
```sql
|
||
CREATE TABLE event_triggers (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
name TEXT NOT NULL,
|
||
filter_severity TEXT, -- nullable; comma-list like 'warn,error'
|
||
filter_source TEXT, -- nullable; comma-list like 'logscan,deploy'
|
||
filter_message_regex TEXT, -- nullable; matched against message
|
||
action_type TEXT NOT NULL, -- 'webhook' | 'notification_channel'
|
||
action_target TEXT NOT NULL, -- URL or channel ID
|
||
enabled INTEGER NOT NULL DEFAULT 1,
|
||
created_at TEXT NOT NULL
|
||
);
|
||
```
|
||
|
||
Filters AND together. Empty filters match all.
|
||
|
||
### Loop-prevention
|
||
|
||
**Critical constraint: the dispatcher must not write to event_log.** All
|
||
delivery successes / failures land in `webhook_deliveries` (existing table) so
|
||
the audit trail is preserved without risking trigger recursion. Keeps the
|
||
boundary crisp:
|
||
|
||
- `event_log` = system observing itself
|
||
- `webhook_deliveries` = system talking to the outside
|
||
|
||
If a user-visible "trigger fired" entry is desired in the events UI, add a
|
||
*read-only join* from `webhook_deliveries` into the events page rather than
|
||
writing event_log rows.
|
||
|
||
---
|
||
|
||
## What to defer
|
||
|
||
| Item | Why | Add when |
|
||
|---|---|---|
|
||
| Multi-line stack trace coalescing | Real rabbit hole (which lines belong together?). | Real user pain. |
|
||
| Capture-group templating in messages (`{{.captures.code}}`) | v1 stores captures in metadata, displays raw line. | Once real rules exist and patterns emerge. |
|
||
| Backfilling history search | This is Loki/Grafana scope-creep. | Never (push to Loki instead if it comes up). |
|
||
| Per-rule alert routing | v1 fans out by `(severity, source)` filter on trigger side. | When users want one rule → one channel. |
|
||
| YAML config-as-code | Tinyforge is UI-driven everywhere else. | Probably never. |
|
||
| Retry / backoff on trigger delivery failure | Notifier already handles delivery; whether *triggers* retry is a separate question. | If trigger reliability becomes an SLO. |
|
||
|
||
---
|
||
|
||
## UI footprint
|
||
|
||
All boolean inputs use `ToggleSwitch` per project CLAUDE.md. All destructive
|
||
actions use `ConfirmDialog` per memory note (no inline Yes/No strips).
|
||
|
||
### New pages
|
||
|
||
- **`/log-scan-rules`** — list with severity / workload filter, "+ New rule" button.
|
||
- Detail page: name, pattern (regex with live test box that takes a sample log line), severity, streams, cooldown, enabled toggle, scope picker (global / workload).
|
||
- **`/event-triggers`** — list, "+ New trigger" button.
|
||
- Detail page: name, filters (severity multiselect, source multiselect, optional message regex), action type, action target, enabled toggle.
|
||
|
||
### Augmentations
|
||
|
||
- **Workload detail page** (`/apps/[id]`): new "Log Rules" tab/panel listing
|
||
effective rules for this workload. Each global shows an "Override for this
|
||
workload" button. Each override / workload-only shows edit + delete.
|
||
- **Events page** (`/events`): entries with `source=logscan` get a small icon
|
||
+ tooltip showing rule name. Click → jumps to rule detail.
|
||
- **Settings sidebar**: links to `/log-scan-rules` and `/event-triggers` under
|
||
a new "Observability" group.
|
||
|
||
### i18n keys to add
|
||
|
||
Roughly 40–60 keys across `en.json` + `ru.json`. Namespace: `logscan.*` and
|
||
`triggers.*`.
|
||
|
||
---
|
||
|
||
## API surface
|
||
|
||
```
|
||
GET /api/log-scan-rules — list (filter: ?workload_id=, ?global=true)
|
||
POST /api/log-scan-rules — create
|
||
GET /api/log-scan-rules/{id} — detail
|
||
PATCH /api/log-scan-rules/{id} — update
|
||
DELETE /api/log-scan-rules/{id} — delete
|
||
POST /api/log-scan-rules/{id}/test — body: {sample_line}; returns matched: bool, captures
|
||
GET /api/workloads/{id}/effective-rules — computed effective ruleset for a workload
|
||
|
||
GET /api/event-triggers — list
|
||
POST /api/event-triggers — create
|
||
GET /api/event-triggers/{id} — detail
|
||
PATCH /api/event-triggers/{id} — update
|
||
DELETE /api/event-triggers/{id} — delete
|
||
POST /api/event-triggers/{id}/test — dispatches a synthetic event to verify the action target
|
||
```
|
||
|
||
`POST .../test` endpoints are worth shipping in v1 — they make the rule /
|
||
trigger editing UX dramatically nicer and avoid "did I get the regex right?"
|
||
deploy-and-pray cycles.
|
||
|
||
---
|
||
|
||
## File pointers (when work starts)
|
||
|
||
**Backend, new:**
|
||
- `internal/logscanner/{manager,tail,engine,rules}.go`
|
||
- `internal/api/log_scan_rules.go`
|
||
- `internal/api/event_triggers.go`
|
||
- `internal/store/log_scan_rules.go`
|
||
- `internal/store/event_triggers.go`
|
||
- `internal/events/dispatcher.go` (or extend `bus.go` with `RegisterEventTriggerDispatcher`)
|
||
|
||
**Backend, modified:**
|
||
- [internal/docker/container.go:362](../internal/docker/container.go#L362) — expose stream selection on `ContainerLogs`
|
||
- [internal/api/router.go](../internal/api/router.go) — register new routes
|
||
- [cmd/server/main.go](../cmd/server/main.go) — wire `RegisterEventTriggerDispatcher` next to `RegisterPersistentLogger`, start `logscanner.Manager`
|
||
- migrations: `internal/store/migrations/00XX_log_scan_rules.sql`, `00XX_event_triggers.sql`
|
||
|
||
**Frontend, new:**
|
||
- `web/src/routes/log-scan-rules/+page.svelte`, `[id]/+page.svelte`, `new/+page.svelte`
|
||
- `web/src/routes/event-triggers/+page.svelte`, `[id]/+page.svelte`, `new/+page.svelte`
|
||
- `web/src/lib/components/LogRulePanel.svelte` (workload detail tab)
|
||
- `web/src/lib/components/RegexTestBox.svelte` (reusable)
|
||
|
||
**Frontend, modified:**
|
||
- `web/src/routes/apps/[id]/+page.svelte` — add Log Rules tab
|
||
- `web/src/routes/events/+page.svelte` — logscan source icon + rule tooltip
|
||
- `web/src/routes/+layout.svelte` — Observability nav group
|
||
- `web/src/lib/i18n/{en,ru}.json` — new key namespaces
|
||
- `web/src/lib/api.ts`, `web/src/lib/types.ts` — typed clients
|
||
|
||
---
|
||
|
||
## Open questions to revisit before coding
|
||
|
||
1. **Container start/stop signal source** — bus events (low latency, two new
|
||
event types) vs polling (simpler, ~5s latency). Tentative: polling.
|
||
2. **Trigger delivery retry** — does the dispatcher retry on webhook failure,
|
||
or is one shot enough since `webhook_deliveries` records failures? Tentative:
|
||
one shot v1; revisit if reliability complaints surface.
|
||
3. **Where does the "logscan source icon" link go on the events page** — rule
|
||
detail page, or the workload's effective-rules tab? Latter is probably more
|
||
useful since it shows context.
|
||
|
||
---
|
||
|
||
## Memory pointer
|
||
|
||
Add a memory after this lands describing the event_log = observe-self,
|
||
webhook_deliveries = talk-to-outside boundary — it's the kind of invariant
|
||
that's easy to violate accidentally when adding new event types later.
|