Files
tiny-forge/docs/LOGSCAN_AND_TRIGGERS_TODO.md
T
alexei.dolgolyov 30133bc1eb
Build / build (push) Successful in 10m40s
docs: workload refactor + observability progress
Two design + handoff docs:

- docs/WORKLOAD_REFACTOR_TODO.md — status-at-a-glance table
  showing what's done (volume scopes, kind-aware editors,
  vendor webhook parsing, chain-panel CSS, Log Rules panel)
  and what's still pending (static source inline port + the
  hard legacy cutover gated on it; codemap entries; /apps
  page-level i18n; Priority 4 integration tests).

- docs/LOGSCAN_AND_TRIGGERS_TODO.md — companion design + status
  doc for the two Observability features. Records the
  loop-prevention invariant (event_log = system observing
  itself, webhook_deliveries = system talking to outside) so
  the next contributor doesn't accidentally break it by adding
  a new EventLog subscriber that re-publishes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:18:51 +03:00

17 KiB
Raw Blame History

Log Scanner + Event Triggers — Design Handoff

Two related features. They can ship independently, but were designed together because they share the event_log seam.

  • A. Log scanner — tail container logs, match against rules, emit event_log entries. Producer of events.
  • B. Event triggers — turn event_log entries into webhook / notification dispatches. Consumer of events. Generalizes the existing RegisterPersistentLogger pattern.

Either half is useful alone:

  • A without B = errors get surfaced in the events UI, no external delivery.
  • B without A = manual + reconciler + deploy events can drive notifications.

Recommended ship order: B first (smaller, self-contained generalization), then A (more moving parts, depends on container-lifecycle hooks).


A. Log scanner — BACKEND LANDED

Status:

  • Schema + store CRUDinternal/store/log_scan_rules.go + log_scan_rules table added to the observabilityTables block. Includes the EffectiveLogScanRules(workloadID) helper that resolves global rules minus per-workload overrides plus workload- only additions in one Go-side pass.
  • Stream-selectable docker readsinternal/docker/container.go ContainerLogsOpts accepts a ContainerLogOptions{ShowStdout, ShowStderr, Follow, Tail} so the scanner can subscribe to one stream when a rule scopes itself to stdout or stderr. The legacy ContainerLogs is preserved as a thin wrapper for back-compat.
  • Engineinternal/logscanner/engine.go: per-rule cooldown (keyed on container+rule), per-container token bucket (default 10 events / 60s, override-able), regex match per line, hits returned for the manager to persist. Pure logic, fully unit-tested.
  • Tail goroutineinternal/logscanner/tail.go: per-container loop reading docker's multiplexed log frames (with TTY fallback), strips the prepended RFC3339 timestamp, runs every line through the engine + snapshot. Exits on container stop or context cancel.
  • Managerinternal/logscanner/manager.go: 5s polling diff against ListContainers(state=running), atomic.Pointer[Snapshot] hot-reload, structural HitEmitter that writes event_log rows AND publishes EventLog on the bus (so event-trigger dispatchers can pick them up immediately).
  • APIinternal/api/log_scan_rules.go: full CRUD, /test endpoint accepting {"sample_line": "..."} and returning matched/captures, plus GET /api/workloads/{id}/effective-rules for the workload detail page's future Log Rules tab. Admin-gated mutations.
  • Wired in main.go before the API server is constructed so the reload callback is plugged via apiServer.SetLogScanReloader.
  • Loop-prevention — Same boundary as feature B: scanner publishes EventLog events, dispatcher consumes them, neither writes to event_log on the consume side.
  • Testsinternal/logscanner/{engine,rules}_test.go cover cooldown isolation, token bucket refill, stream filtering, override-replaces-global, disabled-override-suppresses-global, compile-error reporting. internal/store/log_scan_rules_test.go covers validation + cascade delete.

Frontend still pending/log-scan-rules pages, regex test box component, Log Rules tab on /apps/[id], i18n keys. Not touched this turn.

Where it plugs in

internal/docker/container.go:362 already exposes ContainerLogs(ctx, id, follow=true, tail). The existing SSE handler at internal/api/workloads.go:43 (streamWorkloadContainerLogs) is per-viewer and dies on browser disconnect — do not hook the scanner there. The scanner is a separate long-lived subsystem owned by the server process.

Minor required change to ContainerLogs: expose ShowStdout / ShowStderr as caller-controlled. Currently hardcoded to true/true. Single existing caller passes "both" → no friction. Add an options struct or two booleans.

New package: internal/logscanner/

internal/logscanner/
  manager.go    — Manager: map[containerID]*tail, lifecycle hooks
  tail.go       — per-container goroutine; reads logs, fans to engine
  engine.go     — rule evaluation + cooldown + rate limit
  rules.go      — Rule struct, regex compile cache, effective-set resolver

Manager lifecycle. Subscribes to container start/stop signals. Options for the signal source:

  1. Add a ContainerStarted / ContainerStopped event type to the bus and publish from the reconciler + deployer. Cleanest, but adds two event types.
  2. Manager polls docker.ListContainers every N seconds and diffs. Lazier, robust to missed signals, slightly higher idle CPU. Probably fine.

Pick (1) if you want zero-latency start, (2) if you want fewer moving parts. Defaulting to (2) with 5s poll — Docker container starts already take seconds; sub-second matching is not a requirement.

Tail goroutine. On container start: open ContainerLogs(follow=true, tail="0") with stdout/stderr filters per rules in scope. Read line-by-line via bufio.Scanner. For each line: run through engine. On container stop or ctx cancel: drain and exit.

Engine. Holds compiled regexes per rule. For each line:

  • Walk effective ruleset for this workload (see schema below).
  • For each matching rule: check cooldown (map[ruleID]time.Time, mutex guarded). If cooled down, insert event_log row + publish + update timestamp.
  • Per-container token bucket (default: 10 events/min/container) to prevent catastrophic event_log floods if a regex is too greedy.

Schema

Single table, global + override pattern. No separate "overrides" table.

CREATE TABLE log_scan_rules (
  id               INTEGER PRIMARY KEY AUTOINCREMENT,
  workload_id      TEXT,                  -- NULL = global rule
  overrides_id     INTEGER,               -- if set, this row overrides a global rule for one workload
  name             TEXT NOT NULL,
  pattern          TEXT NOT NULL,         -- regex, compiled at load
  severity         TEXT NOT NULL,         -- info|warn|error
  streams          TEXT NOT NULL DEFAULT 'all',  -- all|stdout|stderr
  cooldown_seconds INTEGER NOT NULL DEFAULT 60,
  enabled          INTEGER NOT NULL DEFAULT 1,
  created_at       TEXT NOT NULL,
  FOREIGN KEY (workload_id) REFERENCES workloads(id) ON DELETE CASCADE,
  FOREIGN KEY (overrides_id) REFERENCES log_scan_rules(id) ON DELETE CASCADE
);
CREATE INDEX idx_log_scan_rules_workload ON log_scan_rules(workload_id);
CREATE INDEX idx_log_scan_rules_overrides ON log_scan_rules(overrides_id);

Effective ruleset for workload X:

  1. All rows where workload_id IS NULL AND overrides_id IS NULL (pure globals), minus any global that has a row with workload_id = X AND overrides_id = global.id.
  2. Plus all rows where workload_id = X AND overrides_id IS NULL (workload-only additions).
  3. Plus all override rows where workload_id = X AND overrides_id IS NOT NULL (substitute for the global; their fields win, including enabled=false to disable the global for this workload).

A pure SQL implementation is doable with a LEFT JOIN ... WHERE override.id IS NULL for step 1 plus a UNION ALL for steps 2 and 3. Or compute in Go after two simpler queries — fine since rule counts will be small.

Output

Scanner calls store.InsertEvent with:

  • Source = "logscan"
  • Severity from the matched rule
  • Message = raw matched line (truncated to ~500 chars)
  • Metadata JSON = {"workload_id": ..., "container_id": ..., "rule_id": ..., "rule_name": ..., "captures": {...}}

Then bus.Publish(EventLog, payload). This reuses exactly the path internal/events/bus.go:158 (RegisterPersistentLogger) already established. SSE clients see it live, and the dispatcher from feature B picks it up.

Hot-reload

When a rule is created/updated/deleted via the API, the manager must rebuild the effective ruleset for affected containers. Cheapest path: a single *atomic.Pointer[ruleSnapshot] shared across tails, replaced wholesale on any rule change. Each tail dereferences the snapshot per line — no locking on the hot path.


B. Event triggers — BACKEND LANDED

Status:

  • Schema + store CRUDinternal/store/event_triggers.go + table creation in internal/store/store.go observabilityTables. Model: EventTrigger in internal/store/models.go.
  • Dispatcherinternal/events/dispatcher.go RegisterEventTriggerDispatcher(bus, triggerSource, notifier). Filter eval is AND-composed across severity (CSV), source (CSV), and optional message regex. Compiled regexes are memoized.
  • Webhook delivery — extended notify.Notifier with SendPayload(url, secret, eventType, payload) which reuses the existing HMAC + headers infra (X-Hub-Signature-256, etc.). New TierEventTrigger tier is recorded for telemetry / audit.
  • Loop-prevention — dispatcher does not call InsertEvent. Delivery outcomes go through the notifier's existing logging only.
  • APIinternal/api/event_triggers.go with admin-gated mutations:
GET    /api/event-triggers
POST   /api/event-triggers
GET    /api/event-triggers/{id}
PATCH  /api/event-triggers/{id}
DELETE /api/event-triggers/{id}
POST   /api/event-triggers/{id}/test     — synthetic event_log → notifier.SendSyncForTest
  • Wired in main.go next to RegisterPersistentLogger.
  • Testsinternal/events/dispatcher_test.go: 10 cases covering filter eval, regex caching, dispatcher fan-out, unsupported action_type, trigger-source errors. CSV filter helper has dedicated table-driven coverage.

Frontend still pending/event-triggers list + detail + new pages, the Send-test UX, i18n keys. Not touched this turn.

Where it plugs in

Mirrors the RegisterPersistentLogger shape at internal/events/bus.go:158:

func RegisterEventTriggerDispatcher(b *Bus, triggers TriggerSource, notifier Notifier) func() {
    sub := b.Subscribe(func(evt Event) bool { return evt.Type == EventLog })
    go func() {
        for evt := range sub {
            payload, ok := evt.Payload.(EventLogPayload)
            if !ok { continue }
            for _, t := range triggers.Enabled() {
                if t.matches(payload) {
                    notifier.Send(t.ActionTarget, buildBody(t, payload))
                }
            }
        }
    }()
    return func() { b.Unsubscribe(sub) }
}

Reuses the existing notifier at internal/notify/notifier.go — including the signed-delivery and webhook_deliveries audit trail.

Schema

CREATE TABLE event_triggers (
  id                    INTEGER PRIMARY KEY AUTOINCREMENT,
  name                  TEXT NOT NULL,
  filter_severity       TEXT,            -- nullable; comma-list like 'warn,error'
  filter_source         TEXT,            -- nullable; comma-list like 'logscan,deploy'
  filter_message_regex  TEXT,            -- nullable; matched against message
  action_type           TEXT NOT NULL,   -- 'webhook' | 'notification_channel'
  action_target         TEXT NOT NULL,   -- URL or channel ID
  enabled               INTEGER NOT NULL DEFAULT 1,
  created_at            TEXT NOT NULL
);

Filters AND together. Empty filters match all.

Loop-prevention

Critical constraint: the dispatcher must not write to event_log. All delivery successes / failures land in webhook_deliveries (existing table) so the audit trail is preserved without risking trigger recursion. Keeps the boundary crisp:

  • event_log = system observing itself
  • webhook_deliveries = system talking to the outside

If a user-visible "trigger fired" entry is desired in the events UI, add a read-only join from webhook_deliveries into the events page rather than writing event_log rows.


What to defer

Item Why Add when
Multi-line stack trace coalescing Real rabbit hole (which lines belong together?). Real user pain.
Capture-group templating in messages ({{.captures.code}}) v1 stores captures in metadata, displays raw line. Once real rules exist and patterns emerge.
Backfilling history search This is Loki/Grafana scope-creep. Never (push to Loki instead if it comes up).
Per-rule alert routing v1 fans out by (severity, source) filter on trigger side. When users want one rule → one channel.
YAML config-as-code Tinyforge is UI-driven everywhere else. Probably never.
Retry / backoff on trigger delivery failure Notifier already handles delivery; whether triggers retry is a separate question. If trigger reliability becomes an SLO.

UI footprint

All boolean inputs use ToggleSwitch per project CLAUDE.md. All destructive actions use ConfirmDialog per memory note (no inline Yes/No strips).

New pages

  • /log-scan-rules — list with severity / workload filter, "+ New rule" button.
    • Detail page: name, pattern (regex with live test box that takes a sample log line), severity, streams, cooldown, enabled toggle, scope picker (global / workload).
  • /event-triggers — list, "+ New trigger" button.
    • Detail page: name, filters (severity multiselect, source multiselect, optional message regex), action type, action target, enabled toggle.

Augmentations

  • Workload detail page (/apps/[id]): new "Log Rules" tab/panel listing effective rules for this workload. Each global shows an "Override for this workload" button. Each override / workload-only shows edit + delete.
  • Events page (/events): entries with source=logscan get a small icon
    • tooltip showing rule name. Click → jumps to rule detail.
  • Settings sidebar: links to /log-scan-rules and /event-triggers under a new "Observability" group.

i18n keys to add

Roughly 4060 keys across en.json + ru.json. Namespace: logscan.* and triggers.*.


API surface

GET    /api/log-scan-rules                 — list (filter: ?workload_id=, ?global=true)
POST   /api/log-scan-rules                 — create
GET    /api/log-scan-rules/{id}            — detail
PATCH  /api/log-scan-rules/{id}            — update
DELETE /api/log-scan-rules/{id}            — delete
POST   /api/log-scan-rules/{id}/test       — body: {sample_line}; returns matched: bool, captures
GET    /api/workloads/{id}/effective-rules — computed effective ruleset for a workload

GET    /api/event-triggers                 — list
POST   /api/event-triggers                 — create
GET    /api/event-triggers/{id}            — detail
PATCH  /api/event-triggers/{id}            — update
DELETE /api/event-triggers/{id}            — delete
POST   /api/event-triggers/{id}/test       — dispatches a synthetic event to verify the action target

POST .../test endpoints are worth shipping in v1 — they make the rule / trigger editing UX dramatically nicer and avoid "did I get the regex right?" deploy-and-pray cycles.


File pointers (when work starts)

Backend, new:

  • internal/logscanner/{manager,tail,engine,rules}.go
  • internal/api/log_scan_rules.go
  • internal/api/event_triggers.go
  • internal/store/log_scan_rules.go
  • internal/store/event_triggers.go
  • internal/events/dispatcher.go (or extend bus.go with RegisterEventTriggerDispatcher)

Backend, modified:

Frontend, new:

  • web/src/routes/log-scan-rules/+page.svelte, [id]/+page.svelte, new/+page.svelte
  • web/src/routes/event-triggers/+page.svelte, [id]/+page.svelte, new/+page.svelte
  • web/src/lib/components/LogRulePanel.svelte (workload detail tab)
  • web/src/lib/components/RegexTestBox.svelte (reusable)

Frontend, modified:

  • web/src/routes/apps/[id]/+page.svelte — add Log Rules tab
  • web/src/routes/events/+page.svelte — logscan source icon + rule tooltip
  • web/src/routes/+layout.svelte — Observability nav group
  • web/src/lib/i18n/{en,ru}.json — new key namespaces
  • web/src/lib/api.ts, web/src/lib/types.ts — typed clients

Open questions to revisit before coding

  1. Container start/stop signal source — bus events (low latency, two new event types) vs polling (simpler, ~5s latency). Tentative: polling.
  2. Trigger delivery retry — does the dispatcher retry on webhook failure, or is one shot enough since webhook_deliveries records failures? Tentative: one shot v1; revisit if reliability complaints surface.
  3. Where does the "logscan source icon" link go on the events page — rule detail page, or the workload's effective-rules tab? Latter is probably more useful since it shows context.

Memory pointer

Add a memory after this lands describing the event_log = observe-self, webhook_deliveries = talk-to-outside boundary — it's the kind of invariant that's easy to violate accidentally when adding new event types later.