Two design + handoff docs: - docs/WORKLOAD_REFACTOR_TODO.md — status-at-a-glance table showing what's done (volume scopes, kind-aware editors, vendor webhook parsing, chain-panel CSS, Log Rules panel) and what's still pending (static source inline port + the hard legacy cutover gated on it; codemap entries; /apps page-level i18n; Priority 4 integration tests). - docs/LOGSCAN_AND_TRIGGERS_TODO.md — companion design + status doc for the two Observability features. Records the loop-prevention invariant (event_log = system observing itself, webhook_deliveries = system talking to outside) so the next contributor doesn't accidentally break it by adding a new EventLog subscriber that re-publishes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,385 @@
|
|||||||
|
# Log Scanner + Event Triggers — Design Handoff
|
||||||
|
|
||||||
|
Two related features. They can ship independently, but were designed together
|
||||||
|
because they share the event_log seam.
|
||||||
|
|
||||||
|
- **A. Log scanner** — tail container logs, match against rules, emit event_log
|
||||||
|
entries. Producer of events.
|
||||||
|
- **B. Event triggers** — turn event_log entries into webhook / notification
|
||||||
|
dispatches. Consumer of events. Generalizes the existing
|
||||||
|
`RegisterPersistentLogger` pattern.
|
||||||
|
|
||||||
|
Either half is useful alone:
|
||||||
|
- A without B = errors get surfaced in the events UI, no external delivery.
|
||||||
|
- B without A = manual + reconciler + deploy events can drive notifications.
|
||||||
|
|
||||||
|
Recommended ship order: B first (smaller, self-contained generalization), then
|
||||||
|
A (more moving parts, depends on container-lifecycle hooks).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## A. Log scanner — BACKEND LANDED
|
||||||
|
|
||||||
|
Status:
|
||||||
|
|
||||||
|
- **Schema + store CRUD** — `internal/store/log_scan_rules.go` +
|
||||||
|
`log_scan_rules` table added to the `observabilityTables` block.
|
||||||
|
Includes the `EffectiveLogScanRules(workloadID)` helper that
|
||||||
|
resolves global rules minus per-workload overrides plus workload-
|
||||||
|
only additions in one Go-side pass.
|
||||||
|
- **Stream-selectable docker reads** — `internal/docker/container.go`
|
||||||
|
`ContainerLogsOpts` accepts a `ContainerLogOptions{ShowStdout,
|
||||||
|
ShowStderr, Follow, Tail}` so the scanner can subscribe to one
|
||||||
|
stream when a rule scopes itself to stdout or stderr. The legacy
|
||||||
|
`ContainerLogs` is preserved as a thin wrapper for back-compat.
|
||||||
|
- **Engine** — `internal/logscanner/engine.go`: per-rule cooldown
|
||||||
|
(keyed on container+rule), per-container token bucket (default 10
|
||||||
|
events / 60s, override-able), regex match per line, hits returned
|
||||||
|
for the manager to persist. Pure logic, fully unit-tested.
|
||||||
|
- **Tail goroutine** — `internal/logscanner/tail.go`: per-container
|
||||||
|
loop reading docker's multiplexed log frames (with TTY fallback),
|
||||||
|
strips the prepended RFC3339 timestamp, runs every line through the
|
||||||
|
engine + snapshot. Exits on container stop or context cancel.
|
||||||
|
- **Manager** — `internal/logscanner/manager.go`: 5s polling diff
|
||||||
|
against `ListContainers(state=running)`, atomic.Pointer[Snapshot]
|
||||||
|
hot-reload, structural HitEmitter that writes event_log rows AND
|
||||||
|
publishes `EventLog` on the bus (so event-trigger dispatchers can
|
||||||
|
pick them up immediately).
|
||||||
|
- **API** — `internal/api/log_scan_rules.go`: full CRUD,
|
||||||
|
`/test` endpoint accepting `{"sample_line": "..."}` and returning
|
||||||
|
matched/captures, plus
|
||||||
|
`GET /api/workloads/{id}/effective-rules` for the workload detail
|
||||||
|
page's future Log Rules tab. Admin-gated mutations.
|
||||||
|
- **Wired in main.go** before the API server is constructed so the
|
||||||
|
reload callback is plugged via `apiServer.SetLogScanReloader`.
|
||||||
|
- **Loop-prevention** — Same boundary as feature B: scanner publishes
|
||||||
|
EventLog events, dispatcher consumes them, neither writes to
|
||||||
|
event_log on the consume side.
|
||||||
|
- **Tests** — `internal/logscanner/{engine,rules}_test.go` cover
|
||||||
|
cooldown isolation, token bucket refill, stream filtering,
|
||||||
|
override-replaces-global, disabled-override-suppresses-global,
|
||||||
|
compile-error reporting. `internal/store/log_scan_rules_test.go`
|
||||||
|
covers validation + cascade delete.
|
||||||
|
|
||||||
|
**Frontend still pending** — `/log-scan-rules` pages, regex test box
|
||||||
|
component, Log Rules tab on `/apps/[id]`, i18n keys. Not touched this
|
||||||
|
turn.
|
||||||
|
|
||||||
|
### Where it plugs in
|
||||||
|
|
||||||
|
[internal/docker/container.go:362](../internal/docker/container.go#L362) already
|
||||||
|
exposes `ContainerLogs(ctx, id, follow=true, tail)`. The existing SSE handler at
|
||||||
|
[internal/api/workloads.go:43](../internal/api/workloads.go#L43)
|
||||||
|
(`streamWorkloadContainerLogs`) is per-viewer and dies on browser disconnect —
|
||||||
|
**do not hook the scanner there**. The scanner is a separate long-lived
|
||||||
|
subsystem owned by the server process.
|
||||||
|
|
||||||
|
Minor required change to `ContainerLogs`: expose `ShowStdout` / `ShowStderr` as
|
||||||
|
caller-controlled. Currently hardcoded to `true`/`true`. Single existing caller
|
||||||
|
passes "both" → no friction. Add an options struct or two booleans.
|
||||||
|
|
||||||
|
### New package: `internal/logscanner/`
|
||||||
|
|
||||||
|
```
|
||||||
|
internal/logscanner/
|
||||||
|
manager.go — Manager: map[containerID]*tail, lifecycle hooks
|
||||||
|
tail.go — per-container goroutine; reads logs, fans to engine
|
||||||
|
engine.go — rule evaluation + cooldown + rate limit
|
||||||
|
rules.go — Rule struct, regex compile cache, effective-set resolver
|
||||||
|
```
|
||||||
|
|
||||||
|
**Manager lifecycle.** Subscribes to container start/stop signals. Options for
|
||||||
|
the signal source:
|
||||||
|
1. Add a `ContainerStarted` / `ContainerStopped` event type to the bus and
|
||||||
|
publish from the reconciler + deployer. Cleanest, but adds two event types.
|
||||||
|
2. Manager polls `docker.ListContainers` every N seconds and diffs. Lazier,
|
||||||
|
robust to missed signals, slightly higher idle CPU. Probably fine.
|
||||||
|
|
||||||
|
Pick (1) if you want zero-latency start, (2) if you want fewer moving parts.
|
||||||
|
Defaulting to **(2) with 5s poll** — Docker container starts already take
|
||||||
|
seconds; sub-second matching is not a requirement.
|
||||||
|
|
||||||
|
**Tail goroutine.** On container start: open `ContainerLogs(follow=true,
|
||||||
|
tail="0")` with stdout/stderr filters per rules in scope. Read line-by-line via
|
||||||
|
`bufio.Scanner`. For each line: run through engine. On container stop or ctx
|
||||||
|
cancel: drain and exit.
|
||||||
|
|
||||||
|
**Engine.** Holds compiled regexes per rule. For each line:
|
||||||
|
- Walk effective ruleset for this workload (see schema below).
|
||||||
|
- For each matching rule: check cooldown (`map[ruleID]time.Time`, mutex
|
||||||
|
guarded). If cooled down, insert event_log row + publish + update timestamp.
|
||||||
|
- Per-container token bucket (default: 10 events/min/container) to prevent
|
||||||
|
catastrophic event_log floods if a regex is too greedy.
|
||||||
|
|
||||||
|
### Schema
|
||||||
|
|
||||||
|
Single table, global + override pattern. No separate "overrides" table.
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE log_scan_rules (
|
||||||
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||||
|
workload_id TEXT, -- NULL = global rule
|
||||||
|
overrides_id INTEGER, -- if set, this row overrides a global rule for one workload
|
||||||
|
name TEXT NOT NULL,
|
||||||
|
pattern TEXT NOT NULL, -- regex, compiled at load
|
||||||
|
severity TEXT NOT NULL, -- info|warn|error
|
||||||
|
streams TEXT NOT NULL DEFAULT 'all', -- all|stdout|stderr
|
||||||
|
cooldown_seconds INTEGER NOT NULL DEFAULT 60,
|
||||||
|
enabled INTEGER NOT NULL DEFAULT 1,
|
||||||
|
created_at TEXT NOT NULL,
|
||||||
|
FOREIGN KEY (workload_id) REFERENCES workloads(id) ON DELETE CASCADE,
|
||||||
|
FOREIGN KEY (overrides_id) REFERENCES log_scan_rules(id) ON DELETE CASCADE
|
||||||
|
);
|
||||||
|
CREATE INDEX idx_log_scan_rules_workload ON log_scan_rules(workload_id);
|
||||||
|
CREATE INDEX idx_log_scan_rules_overrides ON log_scan_rules(overrides_id);
|
||||||
|
```
|
||||||
|
|
||||||
|
**Effective ruleset for workload X:**
|
||||||
|
1. All rows where `workload_id IS NULL AND overrides_id IS NULL` (pure globals),
|
||||||
|
*minus* any global that has a row with `workload_id = X AND overrides_id = global.id`.
|
||||||
|
2. Plus all rows where `workload_id = X AND overrides_id IS NULL` (workload-only additions).
|
||||||
|
3. Plus all override rows where `workload_id = X AND overrides_id IS NOT NULL`
|
||||||
|
(substitute for the global; their fields win, including `enabled=false` to
|
||||||
|
disable the global for this workload).
|
||||||
|
|
||||||
|
A pure SQL implementation is doable with a `LEFT JOIN ... WHERE override.id IS
|
||||||
|
NULL` for step 1 plus a `UNION ALL` for steps 2 and 3. Or compute in Go after
|
||||||
|
two simpler queries — fine since rule counts will be small.
|
||||||
|
|
||||||
|
### Output
|
||||||
|
|
||||||
|
Scanner calls `store.InsertEvent` with:
|
||||||
|
- `Source = "logscan"`
|
||||||
|
- `Severity` from the matched rule
|
||||||
|
- `Message` = raw matched line (truncated to ~500 chars)
|
||||||
|
- `Metadata` JSON = `{"workload_id": ..., "container_id": ..., "rule_id": ..., "rule_name": ..., "captures": {...}}`
|
||||||
|
|
||||||
|
Then `bus.Publish(EventLog, payload)`. This reuses exactly the path
|
||||||
|
[internal/events/bus.go:158](../internal/events/bus.go#L158)
|
||||||
|
(`RegisterPersistentLogger`) already established. SSE clients see it live, and
|
||||||
|
the dispatcher from feature B picks it up.
|
||||||
|
|
||||||
|
### Hot-reload
|
||||||
|
|
||||||
|
When a rule is created/updated/deleted via the API, the manager must rebuild
|
||||||
|
the effective ruleset for affected containers. Cheapest path: a single
|
||||||
|
`*atomic.Pointer[ruleSnapshot]` shared across tails, replaced wholesale on any
|
||||||
|
rule change. Each tail dereferences the snapshot per line — no locking on the
|
||||||
|
hot path.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## B. Event triggers — BACKEND LANDED
|
||||||
|
|
||||||
|
Status:
|
||||||
|
|
||||||
|
- **Schema + store CRUD** — `internal/store/event_triggers.go` + table
|
||||||
|
creation in `internal/store/store.go` `observabilityTables`. Model:
|
||||||
|
`EventTrigger` in `internal/store/models.go`.
|
||||||
|
- **Dispatcher** — `internal/events/dispatcher.go`
|
||||||
|
`RegisterEventTriggerDispatcher(bus, triggerSource, notifier)`.
|
||||||
|
Filter eval is AND-composed across severity (CSV), source (CSV), and
|
||||||
|
optional message regex. Compiled regexes are memoized.
|
||||||
|
- **Webhook delivery** — extended `notify.Notifier` with
|
||||||
|
`SendPayload(url, secret, eventType, payload)` which reuses the
|
||||||
|
existing HMAC + headers infra (`X-Hub-Signature-256`, etc.). New
|
||||||
|
`TierEventTrigger` tier is recorded for telemetry / audit.
|
||||||
|
- **Loop-prevention** — dispatcher does **not** call `InsertEvent`.
|
||||||
|
Delivery outcomes go through the notifier's existing logging only.
|
||||||
|
- **API** — `internal/api/event_triggers.go` with admin-gated mutations:
|
||||||
|
|
||||||
|
```http
|
||||||
|
GET /api/event-triggers
|
||||||
|
POST /api/event-triggers
|
||||||
|
GET /api/event-triggers/{id}
|
||||||
|
PATCH /api/event-triggers/{id}
|
||||||
|
DELETE /api/event-triggers/{id}
|
||||||
|
POST /api/event-triggers/{id}/test — synthetic event_log → notifier.SendSyncForTest
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Wired in main.go** next to `RegisterPersistentLogger`.
|
||||||
|
- **Tests** — `internal/events/dispatcher_test.go`: 10 cases covering
|
||||||
|
filter eval, regex caching, dispatcher fan-out, unsupported
|
||||||
|
action_type, trigger-source errors. CSV filter helper has dedicated
|
||||||
|
table-driven coverage.
|
||||||
|
|
||||||
|
**Frontend still pending** — `/event-triggers` list + detail + new
|
||||||
|
pages, the Send-test UX, i18n keys. Not touched this turn.
|
||||||
|
|
||||||
|
### Where it plugs in
|
||||||
|
|
||||||
|
Mirrors the `RegisterPersistentLogger` shape at
|
||||||
|
[internal/events/bus.go:158](../internal/events/bus.go#L158):
|
||||||
|
|
||||||
|
```go
|
||||||
|
func RegisterEventTriggerDispatcher(b *Bus, triggers TriggerSource, notifier Notifier) func() {
|
||||||
|
sub := b.Subscribe(func(evt Event) bool { return evt.Type == EventLog })
|
||||||
|
go func() {
|
||||||
|
for evt := range sub {
|
||||||
|
payload, ok := evt.Payload.(EventLogPayload)
|
||||||
|
if !ok { continue }
|
||||||
|
for _, t := range triggers.Enabled() {
|
||||||
|
if t.matches(payload) {
|
||||||
|
notifier.Send(t.ActionTarget, buildBody(t, payload))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
return func() { b.Unsubscribe(sub) }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Reuses the existing notifier at
|
||||||
|
[internal/notify/notifier.go](../internal/notify/notifier.go) — including the
|
||||||
|
signed-delivery and `webhook_deliveries` audit trail.
|
||||||
|
|
||||||
|
### Schema
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE event_triggers (
|
||||||
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||||
|
name TEXT NOT NULL,
|
||||||
|
filter_severity TEXT, -- nullable; comma-list like 'warn,error'
|
||||||
|
filter_source TEXT, -- nullable; comma-list like 'logscan,deploy'
|
||||||
|
filter_message_regex TEXT, -- nullable; matched against message
|
||||||
|
action_type TEXT NOT NULL, -- 'webhook' | 'notification_channel'
|
||||||
|
action_target TEXT NOT NULL, -- URL or channel ID
|
||||||
|
enabled INTEGER NOT NULL DEFAULT 1,
|
||||||
|
created_at TEXT NOT NULL
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
Filters AND together. Empty filters match all.
|
||||||
|
|
||||||
|
### Loop-prevention
|
||||||
|
|
||||||
|
**Critical constraint: the dispatcher must not write to event_log.** All
|
||||||
|
delivery successes / failures land in `webhook_deliveries` (existing table) so
|
||||||
|
the audit trail is preserved without risking trigger recursion. Keeps the
|
||||||
|
boundary crisp:
|
||||||
|
|
||||||
|
- `event_log` = system observing itself
|
||||||
|
- `webhook_deliveries` = system talking to the outside
|
||||||
|
|
||||||
|
If a user-visible "trigger fired" entry is desired in the events UI, add a
|
||||||
|
*read-only join* from `webhook_deliveries` into the events page rather than
|
||||||
|
writing event_log rows.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What to defer
|
||||||
|
|
||||||
|
| Item | Why | Add when |
|
||||||
|
|---|---|---|
|
||||||
|
| Multi-line stack trace coalescing | Real rabbit hole (which lines belong together?). | Real user pain. |
|
||||||
|
| Capture-group templating in messages (`{{.captures.code}}`) | v1 stores captures in metadata, displays raw line. | Once real rules exist and patterns emerge. |
|
||||||
|
| Backfilling history search | This is Loki/Grafana scope-creep. | Never (push to Loki instead if it comes up). |
|
||||||
|
| Per-rule alert routing | v1 fans out by `(severity, source)` filter on trigger side. | When users want one rule → one channel. |
|
||||||
|
| YAML config-as-code | Tinyforge is UI-driven everywhere else. | Probably never. |
|
||||||
|
| Retry / backoff on trigger delivery failure | Notifier already handles delivery; whether *triggers* retry is a separate question. | If trigger reliability becomes an SLO. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## UI footprint
|
||||||
|
|
||||||
|
All boolean inputs use `ToggleSwitch` per project CLAUDE.md. All destructive
|
||||||
|
actions use `ConfirmDialog` per memory note (no inline Yes/No strips).
|
||||||
|
|
||||||
|
### New pages
|
||||||
|
|
||||||
|
- **`/log-scan-rules`** — list with severity / workload filter, "+ New rule" button.
|
||||||
|
- Detail page: name, pattern (regex with live test box that takes a sample log line), severity, streams, cooldown, enabled toggle, scope picker (global / workload).
|
||||||
|
- **`/event-triggers`** — list, "+ New trigger" button.
|
||||||
|
- Detail page: name, filters (severity multiselect, source multiselect, optional message regex), action type, action target, enabled toggle.
|
||||||
|
|
||||||
|
### Augmentations
|
||||||
|
|
||||||
|
- **Workload detail page** (`/apps/[id]`): new "Log Rules" tab/panel listing
|
||||||
|
effective rules for this workload. Each global shows an "Override for this
|
||||||
|
workload" button. Each override / workload-only shows edit + delete.
|
||||||
|
- **Events page** (`/events`): entries with `source=logscan` get a small icon
|
||||||
|
+ tooltip showing rule name. Click → jumps to rule detail.
|
||||||
|
- **Settings sidebar**: links to `/log-scan-rules` and `/event-triggers` under
|
||||||
|
a new "Observability" group.
|
||||||
|
|
||||||
|
### i18n keys to add
|
||||||
|
|
||||||
|
Roughly 40–60 keys across `en.json` + `ru.json`. Namespace: `logscan.*` and
|
||||||
|
`triggers.*`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API surface
|
||||||
|
|
||||||
|
```
|
||||||
|
GET /api/log-scan-rules — list (filter: ?workload_id=, ?global=true)
|
||||||
|
POST /api/log-scan-rules — create
|
||||||
|
GET /api/log-scan-rules/{id} — detail
|
||||||
|
PATCH /api/log-scan-rules/{id} — update
|
||||||
|
DELETE /api/log-scan-rules/{id} — delete
|
||||||
|
POST /api/log-scan-rules/{id}/test — body: {sample_line}; returns matched: bool, captures
|
||||||
|
GET /api/workloads/{id}/effective-rules — computed effective ruleset for a workload
|
||||||
|
|
||||||
|
GET /api/event-triggers — list
|
||||||
|
POST /api/event-triggers — create
|
||||||
|
GET /api/event-triggers/{id} — detail
|
||||||
|
PATCH /api/event-triggers/{id} — update
|
||||||
|
DELETE /api/event-triggers/{id} — delete
|
||||||
|
POST /api/event-triggers/{id}/test — dispatches a synthetic event to verify the action target
|
||||||
|
```
|
||||||
|
|
||||||
|
`POST .../test` endpoints are worth shipping in v1 — they make the rule /
|
||||||
|
trigger editing UX dramatically nicer and avoid "did I get the regex right?"
|
||||||
|
deploy-and-pray cycles.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File pointers (when work starts)
|
||||||
|
|
||||||
|
**Backend, new:**
|
||||||
|
- `internal/logscanner/{manager,tail,engine,rules}.go`
|
||||||
|
- `internal/api/log_scan_rules.go`
|
||||||
|
- `internal/api/event_triggers.go`
|
||||||
|
- `internal/store/log_scan_rules.go`
|
||||||
|
- `internal/store/event_triggers.go`
|
||||||
|
- `internal/events/dispatcher.go` (or extend `bus.go` with `RegisterEventTriggerDispatcher`)
|
||||||
|
|
||||||
|
**Backend, modified:**
|
||||||
|
- [internal/docker/container.go:362](../internal/docker/container.go#L362) — expose stream selection on `ContainerLogs`
|
||||||
|
- [internal/api/router.go](../internal/api/router.go) — register new routes
|
||||||
|
- [cmd/server/main.go](../cmd/server/main.go) — wire `RegisterEventTriggerDispatcher` next to `RegisterPersistentLogger`, start `logscanner.Manager`
|
||||||
|
- migrations: `internal/store/migrations/00XX_log_scan_rules.sql`, `00XX_event_triggers.sql`
|
||||||
|
|
||||||
|
**Frontend, new:**
|
||||||
|
- `web/src/routes/log-scan-rules/+page.svelte`, `[id]/+page.svelte`, `new/+page.svelte`
|
||||||
|
- `web/src/routes/event-triggers/+page.svelte`, `[id]/+page.svelte`, `new/+page.svelte`
|
||||||
|
- `web/src/lib/components/LogRulePanel.svelte` (workload detail tab)
|
||||||
|
- `web/src/lib/components/RegexTestBox.svelte` (reusable)
|
||||||
|
|
||||||
|
**Frontend, modified:**
|
||||||
|
- `web/src/routes/apps/[id]/+page.svelte` — add Log Rules tab
|
||||||
|
- `web/src/routes/events/+page.svelte` — logscan source icon + rule tooltip
|
||||||
|
- `web/src/routes/+layout.svelte` — Observability nav group
|
||||||
|
- `web/src/lib/i18n/{en,ru}.json` — new key namespaces
|
||||||
|
- `web/src/lib/api.ts`, `web/src/lib/types.ts` — typed clients
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions to revisit before coding
|
||||||
|
|
||||||
|
1. **Container start/stop signal source** — bus events (low latency, two new
|
||||||
|
event types) vs polling (simpler, ~5s latency). Tentative: polling.
|
||||||
|
2. **Trigger delivery retry** — does the dispatcher retry on webhook failure,
|
||||||
|
or is one shot enough since `webhook_deliveries` records failures? Tentative:
|
||||||
|
one shot v1; revisit if reliability complaints surface.
|
||||||
|
3. **Where does the "logscan source icon" link go on the events page** — rule
|
||||||
|
detail page, or the workload's effective-rules tab? Latter is probably more
|
||||||
|
useful since it shows context.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Memory pointer
|
||||||
|
|
||||||
|
Add a memory after this lands describing the event_log = observe-self,
|
||||||
|
webhook_deliveries = talk-to-outside boundary — it's the kind of invariant
|
||||||
|
that's easy to violate accidentally when adding new event types later.
|
||||||
@@ -0,0 +1,334 @@
|
|||||||
|
# Workload-First Refactor — Remaining Work
|
||||||
|
|
||||||
|
Handoff for resuming the refactor. The plugin architecture (Source × Trigger),
|
||||||
|
`/api/workloads` surface, `/apps` UI, env/volume/webhook/logs/chain panels,
|
||||||
|
multi-face proxy routes, blue-green image deploys, schema-driven wizard, and
|
||||||
|
test coverage on triggers / image helpers / webhook parser / store upserts are
|
||||||
|
**already landed and live**. What follows is what's still pending, in priority
|
||||||
|
order.
|
||||||
|
|
||||||
|
## Status at a glance
|
||||||
|
|
||||||
|
| Item | Priority | Status |
|
||||||
|
| ---- | -------- | ------ |
|
||||||
|
| Static source inline port | 1 | **PENDING** — only remaining blocker for hard cutover |
|
||||||
|
| Hard legacy cutover | 1 | **PENDING** — gated by static port (volume scopes blocker is resolved) |
|
||||||
|
| Generalized volume scopes | 2 | DONE |
|
||||||
|
| Kind-aware editors (compose / image / static) | 2 | DONE |
|
||||||
|
| Vendor-specific webhook parsing | 2 | DONE |
|
||||||
|
| Chain-panel CSS | 3 | DONE |
|
||||||
|
| Log Rules panel on `/apps/[id]` | adjacent | DONE — uses `getEffectiveLogScanRules` + per-workload override action |
|
||||||
|
| i18n for `/apps/*` page strings | 3 | **PARTIAL** — Log Rules panel + Observability surfaces i18n'd; `apps.*` namespace still pending |
|
||||||
|
| Docs / codemap entries for `internal/workload/plugin/` | 3 | **PENDING** |
|
||||||
|
| API-handler / dispatcher / compose-source / static-backend tests | 4 | **PENDING** |
|
||||||
|
| Triggers as first-class reusable entities (post-cutover) | 5 | **PENDING** |
|
||||||
|
|
||||||
|
Cross-references to the adjacent Observability work (Event Triggers + Log
|
||||||
|
Scanner backend + drop-counter stats panel) live in
|
||||||
|
[docs/LOGSCAN_AND_TRIGGERS_TODO.md](LOGSCAN_AND_TRIGGERS_TODO.md).
|
||||||
|
|
||||||
|
## Priority 1 — Architecture unlock
|
||||||
|
|
||||||
|
### Static source inline port — ~2150 LOC across 8 files
|
||||||
|
|
||||||
|
The current `internal/workload/plugin/source/static/` delegates to
|
||||||
|
`staticsite.Manager` via a phantom-row adapter
|
||||||
|
(`cmd/server/static_backend.go`) that keeps a synthetic row in the legacy
|
||||||
|
`static_sites` table per workload. This works but blocks the hard cutover —
|
||||||
|
you can't drop `static_sites` until the adapter is gone.
|
||||||
|
|
||||||
|
To port inline, the deploy pipeline body has to move into
|
||||||
|
`internal/workload/plugin/source/static/`:
|
||||||
|
|
||||||
|
| Source file | Lines | What to keep / port |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `internal/staticsite/manager.go` | 834 | Deploy / Stop / status pipeline. State should move to `containers` rows + `workload_env` instead of `static_sites`. |
|
||||||
|
| `internal/staticsite/gitea_content.go` | 360 | Keep as helper — Gitea content download/listing. |
|
||||||
|
| `internal/staticsite/github_provider.go` | 276 | Keep as helper. |
|
||||||
|
| `internal/staticsite/gitlab_provider.go` | 254 | Keep as helper. |
|
||||||
|
| `internal/staticsite/healthcheck.go` | 111 | Convert to plugin Reconcile body. |
|
||||||
|
| `internal/staticsite/markdown.go` | 83 | Keep as helper. |
|
||||||
|
| `internal/staticsite/provider.go` | 171 | Keep — provider abstraction. |
|
||||||
|
| `internal/staticsite/deno/` | (sub-pkg) | Keep — Dockerfile + router.ts codegen. |
|
||||||
|
|
||||||
|
Estimated as its own dedicated turn (or two). Strategy: keep the provider
|
||||||
|
abstraction + helpers exported; rewrite only `Manager.Deploy` body into a new
|
||||||
|
`source/static/deploy.go` that operates against `plugin.Workload` directly and
|
||||||
|
writes container rows + workload_env rather than the `static_sites` table.
|
||||||
|
|
||||||
|
### Hard legacy cutover
|
||||||
|
|
||||||
|
Sole remaining blocker is the static source inline port above. The
|
||||||
|
generalized-volume-scopes blocker is resolved (legacy `ResolvePath`
|
||||||
|
stays in place for legacy callers and dies with the cutover). When the
|
||||||
|
static port lands:
|
||||||
|
|
||||||
|
- Delete `/api/projects`, `/api/stacks`, `/api/sites`, `/api/stages` handlers.
|
||||||
|
- Drop tables: `projects`, `stages`, `stacks`, `stack_revisions`,
|
||||||
|
`stack_deploys`, `static_sites`, `static_site_secrets`, `deploys`,
|
||||||
|
`poll_states`.
|
||||||
|
- Delete `internal/stack/`, `internal/staticsite/` packages.
|
||||||
|
- Delete frontend `/projects`, `/sites`, `/stacks` routes.
|
||||||
|
- Delete legacy `volume.ResolvePath` + `internal/api/volume_browser.go`
|
||||||
|
callers (the only remaining users).
|
||||||
|
|
||||||
|
## Priority 2 — Behavior gaps
|
||||||
|
|
||||||
|
### ~~Generalized volume scopes~~ — DONE
|
||||||
|
|
||||||
|
Landed: `internal/volume.ResolveWorkloadPath` (workload-keyed; sits next to the
|
||||||
|
legacy `ResolvePath` so legacy code paths keep working) plus the wired-through
|
||||||
|
`computeMounts` in `internal/workload/plugin/source/image/image.go`. All
|
||||||
|
`VolumeScope` values are now honored at deploy time:
|
||||||
|
|
||||||
|
- `absolute` — host bind, validated against `settings.AllowedVolumePaths`.
|
||||||
|
- `ephemeral` — tmpfs.
|
||||||
|
- `instance` — per-tag dir under `<base>/<workload>-<idShort>/instance-<tag>/<source>`.
|
||||||
|
- `stage`, `project` — both collapse to `<base>/<workload>-<idShort>/<source>`.
|
||||||
|
- `project_named` — Docker named volume prefixed `tf-<idShort>-<name>`.
|
||||||
|
- `named` — Docker named volume by raw name.
|
||||||
|
|
||||||
|
Test coverage: `internal/volume/resolver_test.go` (table-driven, portable
|
||||||
|
Linux/Windows). The legacy `ResolvePath` stays in place for legacy deployer +
|
||||||
|
volume-browser callers and dies with the hard cutover.
|
||||||
|
|
||||||
|
### ~~Kind-aware editors on `/apps/new` and `/apps/[id]` edit~~ — DONE
|
||||||
|
|
||||||
|
All three Source plugins now have hand-rolled forms on both pages, with
|
||||||
|
an "Advanced JSON" toggle preserved as the power-user escape hatch.
|
||||||
|
Submit logic marshals form fields back into the same JSON shape the
|
||||||
|
backend already expects — no API or store changes required.
|
||||||
|
|
||||||
|
**Principle:** the plugin contract makes new Source / Trigger kinds cheap
|
||||||
|
on the backend, but the UI is not cheap by default — every kind needs a
|
||||||
|
paired hand-rolled form to be daily-driver usable. The shared JSON
|
||||||
|
editor is the fallback for power users and brand-new plugins, not the
|
||||||
|
end state. New Source / Trigger merge requests should treat "ship the
|
||||||
|
kind-aware form" as part of done, not a follow-up.
|
||||||
|
|
||||||
|
**Landed:**
|
||||||
|
|
||||||
|
- `compose`: YAML textarea + project_name input on both `/apps/new`
|
||||||
|
and `/apps/[id]`.
|
||||||
|
- `image`: form fields for image / port / healthcheck / default_tag /
|
||||||
|
registry_name / cpu_limit / memory_limit / max_instances on both
|
||||||
|
pages. Registry name is a select populated from `/api/registries`
|
||||||
|
(with text-input fallback when the list is empty). env + volumes
|
||||||
|
stay in their detail-page panels and round-trip through the form
|
||||||
|
via `imageFormBody` so manual edits aren't clobbered.
|
||||||
|
- `static`: provider select (gitea / github / gitlab), base URL,
|
||||||
|
repo_owner / repo_name (both required), branch (default "main"),
|
||||||
|
folder_path, access_token (password input, for private repos),
|
||||||
|
mode radio (static / deno), render_markdown checkbox. The
|
||||||
|
storage_enabled / storage_limit_mb fields aren't surfaced as
|
||||||
|
form controls yet, but they round-trip through `staticFormBody`
|
||||||
|
so values set via the raw JSON editor survive form edits.
|
||||||
|
|
||||||
|
**Still pending forms:** none — all three Source plugins now have
|
||||||
|
hand-rolled forms on both `/apps/new` and `/apps/[id]`.
|
||||||
|
|
||||||
|
The raw JSON editor stays available behind the "Advanced JSON" toggle
|
||||||
|
(shipped with compose) so the plugin's full sample is still reachable
|
||||||
|
for power users and for any new plugin kind without a hand-rolled form.
|
||||||
|
|
||||||
|
Effort: per-kind form roughly half a turn each; can land incrementally.
|
||||||
|
Touches `web/src/routes/apps/new/+page.svelte` and the edit block in
|
||||||
|
`web/src/routes/apps/[id]/+page.svelte`. The Svelte side keeps
|
||||||
|
serializing into the same `source_config` JSON shape the backend
|
||||||
|
already expects — no API or store change required.
|
||||||
|
|
||||||
|
### ~~Vendor-specific webhook parsing for `/api/webhook/workloads/{secret}`~~ — DONE
|
||||||
|
|
||||||
|
Landed: `internal/webhook/vendor_parsers.go` plus rewrites in
|
||||||
|
`internal/webhook/handler.go` `buildInboundEvent`. The dispatch order is now:
|
||||||
|
|
||||||
|
1. Empty body → manual event.
|
||||||
|
2. Vendor-specific parsers, short-circuit on a recognized `X-*-Event`
|
||||||
|
header — Gitea package, GitHub `package` / `registry_package`, GitHub
|
||||||
|
push, Gitea push, GitLab `Push Hook` / `Tag Push Hook`.
|
||||||
|
3. Generic simple-body fallback: top-level `image` or top-level `ref` —
|
||||||
|
what the legacy CI integrations already send.
|
||||||
|
|
||||||
|
Vendor parsers can populate fields the generic parser cannot: image
|
||||||
|
digest, `GitEvent.Vendor`, registry host. When a vendor parser claims a
|
||||||
|
request (header matches) it is authoritative — a malformed Gitea
|
||||||
|
package payload surfaces as an error rather than silently falling
|
||||||
|
through to the generic parser. Test coverage:
|
||||||
|
`internal/webhook/vendor_parsers_test.go` covers each vendor branch +
|
||||||
|
the routed-via-`buildInboundEvent` integration cases.
|
||||||
|
|
||||||
|
Open follow-ups deferred to future turns:
|
||||||
|
|
||||||
|
- GitLab Container Registry events use a custom envelope outside the
|
||||||
|
webhook event surface — handle if a user reports needing it.
|
||||||
|
- Docker Hub webhook (push event) uses `{"push_data": {"tag": ...}, "repository": {...}}` — add when there's a user request.
|
||||||
|
|
||||||
|
## Priority 3 — Polish
|
||||||
|
|
||||||
|
### ~~Chain-panel CSS~~ — DONE
|
||||||
|
|
||||||
|
Landed: rules for `.chain-row`, `.chain-card` (with hover/transform on
|
||||||
|
anchors), `.chain-self` (brand-tinted highlight), `.chain-name`,
|
||||||
|
`.chain-label` (70px fixed-width mono column), `.chain-children-list`
|
||||||
|
(flex-wrap), plus a sub-600px stack to keep the panel usable on narrow
|
||||||
|
screens. Appended at the end of the `<style>` block in
|
||||||
|
`web/src/routes/apps/[id]/+page.svelte`.
|
||||||
|
|
||||||
|
### Docs / codemap entries
|
||||||
|
|
||||||
|
Nothing under `docs/CODEMAPS/` for `internal/workload/plugin/`. Should cover:
|
||||||
|
|
||||||
|
- The Source × Trigger contract + registry pattern (`init()` + blank-import in
|
||||||
|
`cmd/server/main.go`).
|
||||||
|
- How a new Source kind is added (write `init()` registration, blank-import,
|
||||||
|
add to wizard via `SchemaSample`).
|
||||||
|
- The dispatcher seam: `deployer.DispatchPlugin` / `DispatchTeardown` /
|
||||||
|
`DispatchReconcile` and how the reconciler / webhook ingress / API
|
||||||
|
handlers all flow through it.
|
||||||
|
|
||||||
|
`README.md` should mention `/apps` as the new user surface and that
|
||||||
|
`/projects` / `/sites` / `/stacks` carry `Deprecation: true` headers.
|
||||||
|
|
||||||
|
### i18n: page-level strings — PARTIAL
|
||||||
|
|
||||||
|
Already i18n'd:
|
||||||
|
|
||||||
|
- `nav.apps`, `nav.eventTriggers`, `nav.logScanRules` — top nav labels.
|
||||||
|
- Log Rules panel on `/apps/[id]` reuses `logscan.panel.*` keys
|
||||||
|
(shipped with the Observability work).
|
||||||
|
- All `/event-triggers/*` and `/log-scan-rules/*` page strings — keys
|
||||||
|
live under `triggers.*` and `logscan.*` namespaces in
|
||||||
|
`web/src/lib/i18n/{en,ru}.json`.
|
||||||
|
|
||||||
|
Still hardcoded English:
|
||||||
|
|
||||||
|
- `/apps/+page.svelte` — list page (hero, lede, stats, empty state,
|
||||||
|
table headers, status pills).
|
||||||
|
- `/apps/new/+page.svelte` — wizard labels, form copy, kind-aware
|
||||||
|
form rows (compose / image / static all hardcoded English today).
|
||||||
|
- `/apps/[id]/+page.svelte` — detail page sections (chain, env,
|
||||||
|
volumes, webhook, manual deploy, danger zone) — the Log Rules
|
||||||
|
panel embedded inside it is the only i18n'd section.
|
||||||
|
|
||||||
|
Roughly 80–100 keys across the three `/apps/*` pages once extracted.
|
||||||
|
Namespace: `apps.*` (with sub-namespaces `apps.list.*`, `apps.new.*`,
|
||||||
|
`apps.detail.*`, `apps.form.*`).
|
||||||
|
|
||||||
|
## Priority 4 — Tests we still don't have
|
||||||
|
|
||||||
|
Solid pure-function coverage landed in the prior turn. Still missing:
|
||||||
|
|
||||||
|
- **API-handler integration tests** for `/api/workloads/*` (CRUD, deploy,
|
||||||
|
env, volumes, webhook, chain, promote-from). Pattern: in-memory store +
|
||||||
|
fake deployer + fake docker / proxy / dns providers, exercise via
|
||||||
|
`httptest`.
|
||||||
|
- **Deployer dispatcher**: `DispatchPlugin` / `DispatchTeardown` /
|
||||||
|
`DispatchReconcile` with a fake Source registered.
|
||||||
|
- **Compose source**: `composeProjectName` sanitizer, `writeYAMLIfChanged`
|
||||||
|
short-circuit. (Both pure; just need fixtures.)
|
||||||
|
- **Static source Backend adapter** in `cmd/server/static_backend.go`.
|
||||||
|
|
||||||
|
## Priority 5 — Post-cutover roadmap
|
||||||
|
|
||||||
|
### Triggers as first-class reusable entities
|
||||||
|
|
||||||
|
Today a trigger's config lives embedded in the workload row
|
||||||
|
(`workload.trigger_kind` plus `workload.trigger_config` JSON via the plugin
|
||||||
|
contract). One workload owns exactly one trigger; one trigger serves exactly
|
||||||
|
one workload. This couples two concepts that users increasingly want
|
||||||
|
orthogonal:
|
||||||
|
|
||||||
|
- One **inbound webhook** fanning out to several workloads (a single CI push
|
||||||
|
rebuilds dev + staging together).
|
||||||
|
- One **registry watcher** driving multiple workloads off the same image
|
||||||
|
(different tag filters per binding, shared poll state).
|
||||||
|
- One **schedule** kicking off a batch of jobs.
|
||||||
|
- One **git push** filter shared by sibling stack services.
|
||||||
|
|
||||||
|
**Direction:** promote triggers to their own table with a join.
|
||||||
|
|
||||||
|
- `triggers` — `id`, `kind` (registry / git / webhook / schedule / manual /
|
||||||
|
log_scan), `config` JSON, `secret`, `created_at`, audit fields.
|
||||||
|
- `workload_trigger_bindings` — `workload_id`, `trigger_id`, `binding_config`
|
||||||
|
JSON (per-binding overrides: tag filter, path filter, branch filter), plus
|
||||||
|
ordering / enabled flag.
|
||||||
|
|
||||||
|
The dispatcher seam stays unchanged — `deployer.DispatchPlugin` still receives
|
||||||
|
a `(Workload, TriggerEvent)` pair; the only change is that the event's source
|
||||||
|
is resolved through the binding row instead of the workload row.
|
||||||
|
|
||||||
|
**UX principle: first-class on the backend, inline by default in the UI.**
|
||||||
|
The workload create/edit form still has an "Add trigger" control that creates
|
||||||
|
a fresh trigger record in one step, so the 1:1 case (git push → this workload)
|
||||||
|
feels unchanged from today. Reuse is **opt-in** via a "Pick existing trigger"
|
||||||
|
picker on the same control. Triggers also get their own list/detail pages under
|
||||||
|
`/triggers` so the fan-out cases are discoverable and centrally manageable
|
||||||
|
(rotate secret once, audit once).
|
||||||
|
|
||||||
|
**Per-kind modal applies, same rule as Source plugins** — the create/edit
|
||||||
|
form for a trigger switches body by `kind` (git: repo / branch / path;
|
||||||
|
registry: image / tag regex; webhook: secret + payload preview; schedule:
|
||||||
|
cron). Backend cheap, UI requires a paired hand-rolled form per kind. Treat
|
||||||
|
"ship the kind-aware form" as part of done for any new trigger kind.
|
||||||
|
|
||||||
|
**Migration:** clean break (no migration) per the workload-first memory —
|
||||||
|
at cutover, each workload's embedded trigger config becomes a single
|
||||||
|
auto-created trigger record with a single binding row. No user-visible change
|
||||||
|
on day one; reuse becomes possible thereafter.
|
||||||
|
|
||||||
|
**Sequencing:** lands **after** the Priority 1 hard cutover. The embedded
|
||||||
|
trigger config works fine for the 1:1 case that dominates today; the
|
||||||
|
static-source inline port is the higher-value blocker. Treat this as the
|
||||||
|
next major arc once cutover ships.
|
||||||
|
|
||||||
|
**Touch points to expect:**
|
||||||
|
|
||||||
|
- `internal/workload/plugin/trigger/*` — kind handlers stay; only their input
|
||||||
|
shape changes (read from binding + trigger row, not workload row).
|
||||||
|
- `internal/store/` — new `triggers` + `workload_trigger_bindings` tables and
|
||||||
|
CRUD; remove `trigger_kind` / `trigger_config` from the workload row.
|
||||||
|
- `internal/api/workloads.go` — adapt the workload create/edit handlers to
|
||||||
|
accept either "inline new trigger" or "bind existing trigger" payloads.
|
||||||
|
- New `/api/triggers` surface + `/triggers` frontend pages.
|
||||||
|
- `internal/webhook/handler.go` — inbound webhook now resolves to a trigger,
|
||||||
|
fans out to all bound workloads.
|
||||||
|
- `internal/reconciler/reconciler.go` — registry watchers iterate triggers,
|
||||||
|
not workloads; each trigger may fire N bindings.
|
||||||
|
|
||||||
|
## Open architectural questions
|
||||||
|
|
||||||
|
### Stages chain vs explicit Stage entity
|
||||||
|
|
||||||
|
`parent_workload_id` is now the canonical mechanism for stage chains
|
||||||
|
(dev → staging → prod). Decision deferred: do we need a separate `Stage`
|
||||||
|
entity at all, or is the chain sufficient? Currently feels like the chain
|
||||||
|
covers the use case — `promote-from` works, the UI shows the relationship.
|
||||||
|
Probably can leave the legacy `stages` table dropped entirely once cutover
|
||||||
|
proceeds.
|
||||||
|
|
||||||
|
### `Container.extra_json` evolution
|
||||||
|
|
||||||
|
Currently only the image source uses it (per-face proxy route IDs). If
|
||||||
|
other sources gain similar needs (compose service health metadata, static
|
||||||
|
build SHAs), the schema there should stay versionless and additive — every
|
||||||
|
reader must tolerate unknown keys. Document this in the source plugin
|
||||||
|
guide alongside the codemap entry.
|
||||||
|
|
||||||
|
## File pointers for the next session
|
||||||
|
|
||||||
|
- Plugin contracts: `internal/workload/plugin/{plugin,source,trigger,types,registry}.go`
|
||||||
|
- Source implementations: `internal/workload/plugin/source/{image,compose,static}/`
|
||||||
|
- Trigger implementations: `internal/workload/plugin/trigger/{registry,git,manual}/`
|
||||||
|
- Dispatcher: `internal/deployer/dispatch.go`
|
||||||
|
- Webhook ingress (plugin path): `internal/webhook/handler.go` `handlePluginWorkloadWebhook`
|
||||||
|
- Reconciler hook: `internal/reconciler/reconciler.go` `reconcilePluginWorkloads`
|
||||||
|
- Static backend adapter (to be deleted post-port): `cmd/server/static_backend.go`
|
||||||
|
- Frontend pages: `web/src/routes/apps/+page.svelte`, `web/src/routes/apps/new/+page.svelte`, `web/src/routes/apps/[id]/+page.svelte`
|
||||||
|
- Tests: `internal/workload/plugin/trigger/*/!(_test).go`, `internal/workload/plugin/source/image/image_helpers_test.go`, `internal/webhook/inbound_event_test.go`, `internal/store/workload_env_test.go`
|
||||||
|
|
||||||
|
## Memory pointer
|
||||||
|
|
||||||
|
Memory at
|
||||||
|
`C:/Users/Alexei/.claude/projects/c--Users-Alexei-Documents-docker-watcher/memory/`
|
||||||
|
already covers the Workload-first decision and the no-migration constraint.
|
||||||
|
Refresh as the cutover lands.
|
||||||
Reference in New Issue
Block a user