feat(apps): per-workload deploy history, rollback, and resource metrics

Two additions to the app detail page, each backed by a per-workload
endpoint.

Deploy history + rollback:
- New deploy_history table — a structured, version-pinned ledger of every
  dispatch (success AND failure), distinct from the free-text event_log.
  Recorded at the single DispatchPlugin choke point so every source kind
  is covered. The raw deploy error is never persisted (it can carry
  registry-auth / compose-stdout secrets) — only a generic marker, with
  detail going to slog. Pruned to the newest N per workload; cascade-
  deleted with the workload.
- GET /api/workloads/{id}/deploys lists the ledger; POST .../rollback
  (admin) replays a prior successful deploy's pinned reference as a
  rollback-reason dispatch. Phase 1 is image-source only (RollbackCapable);
  git-built sources need checkout-by-commit, a later phase.
- DeployHistoryPanel.svelte renders the ledger with confirm-gated rollback.

Per-workload metrics:
- ListContainerStatsSamplesByWorkload joins the existing container stats
  samples through the containers index; GET /api/workloads/{id}/stats/history
  aggregates CPU/memory per timestamp across the workload's containers.
- WorkloadMetricsPanel.svelte reuses ResourceChart (CPU% + memory MiB,
  windowed, 15s poll).

en/ru i18n added with parity. Tests: store CRUD + cascade + workload-scoped
join, deployer recording (incl. secret-non-leak on failure), API rollback
guards, and per-timestamp aggregation. Plans under docs/plans/.
This commit is contained in:
2026-06-19 16:22:12 +03:00
parent c8e71a0c34
commit 0c4c338bfe
23 changed files with 1828 additions and 0 deletions
+223
View File
@@ -0,0 +1,223 @@
# Deploy History + One-Click Rollback — Implementation Plan
**Status:** planned (review incorporated) · **Feature rank:** #1 · **Date:** 2026-06-19
## Review findings incorporated (adversarial pass)
- **BLOCKER — never persist the raw deploy error** (it can carry registry-auth bytes /
compose stdout — see `compose.go` SECURITY comment + `workloads_plugin.go:198`).
`deploy_history.error` only ever gets a **fixed generic marker**
(`"deploy failed (see server logs)"`) on failure; the raw error goes to `slog` only.
`capDeployStatus(err.Error())` is rejected.
- **BLOCKER — don't double-count metrics.** `DispatchPlugin` already calls
`metrics.DeploysTotal.Inc(...)`; recording slots into the **existing** outcome block,
not a re-added metrics line.
- **FIX — no runtime-state store getter exists.** static/dockerfile `LastCommitSHA`
lives in `containers.extra_json` on a deterministic-ID row
(`GetContainerByID(w.ID+":site")` / `+":dockerfile"`, decode `ExtraJSON`). Moot for
Phase-1 rollback (image-only) but the resolver must use this, not a fictional getter.
- **FIX — cascade is distrusted here.** `DeleteWorkload` explicitly deletes containers
rather than relying on the FK. Match that: add `DELETE FROM deploy_history WHERE
workload_id = ?` inside the `DeleteWorkload` transaction, and make the cascade test a
hard gate.
- **FIX — keep recording off the hot path's tail.** `DispatchPlugin` runs synchronously
on the request goroutine; the INSERT is cheap but `PruneDeployHistory` runs in a
goroutine. Draining-rejected attempts (beginDispatch fail) record nothing — correct,
a never-run deploy must not appear as a rollback target.
- **FIX — pagination:** use `parseLimit(raw, 50, 200)` (not the unclamped
`listWorkloadEvents` style); parse `offset` separately, clamp negatives to 0.
## Problem
Tinyforge has *failure* rollback (a failed deploy unwinds its own new container —
[image.go:258](../../internal/workload/plugin/source/image/image.go)), but **no way to
revert a *successful* deploy to a prior version.** Blue-green's `enforceMaxInstances`
deletes the old container rows after cutover, so once `v3` replaces `v2` there is no
record of `v2` and nothing to roll back to. The only "history" is free-text
`event_log` rows (`"deployed"`) — not structured, not version-pinned, not replayable.
This is the single most-requested capability for any deploy tool, and the plumbing is
90% there: every deploy flows through one choke point, and the manual-deploy endpoint
already accepts a `reference` override.
## Key architectural facts (verified against current code)
- **Single dispatch choke point:** `Deployer.DispatchPlugin(ctx, w, intent)` in
[internal/deployer/dispatch.go](../../internal/deployer/dispatch.go) routes *every*
source kind and already computes a success/failure `outcome`. This is where history
is recorded.
- **`intent.Reference` is the version handle:** image source resolves
`tag := intent.Reference` (falling back to `DefaultTag`/`latest`). The manual deploy
endpoint ([workloads_plugin.go](../../internal/api/workloads_plugin.go)) already accepts
`{reference, note}` and builds a `manual` intent. **Rollback = deploy with a pinned
reference + a distinct reason.**
- **Effective vs requested reference:** for a *manual* image deploy `intent.Reference`
is often `""` (means `DefaultTag`). The *effective* deployed tag is written onto the
freshest container row (`store.Container.ImageTag`). For static/dockerfile the
effective version is `runtime_state.LastCommitSHA`, resolved inside the source.
- **Built-from-source sources don't honor a SHA reference on Deploy** — static and
dockerfile clone `cfg.Branch` HEAD and capture `latestSHA`; they cannot yet check out
an arbitrary commit. So **SHA-pinned rollback for them needs a source change (later
phase).** Image-tag rollback works today.
- **Migration pattern:** additive statements in `runMigrations()` /
`workloadTables` in [store.go](../../internal/store/store.go); workload-scoped tables
use `REFERENCES workloads(id) ON DELETE CASCADE`. Per-table CRUD lives in its own
`internal/store/<table>.go`, model in `models.go`.
- **Idempotency note:** the image source's same-tag short-circuit returns *before* it
arms its `EmitDeployEvent` defer, so a no-op deploy emits no timeline event. History
recorded at `DispatchPlugin` will still log it as a `success` attempt — acceptable
(history = ledger of attempts), but called out so the divergence is intentional.
## Scope
### Phase 1 (this plan)
1. Persistent, structured **deploy-history ledger** for **all** source kinds (success
*and* failure) — powers an audit timeline and the rollback action.
2. **One-click rollback** for the **image** source (redeploy a pinned tag).
3. Read-only history panel on `/apps/[id]`; rollback button shown only for entries that
are `success` + have a non-empty reference + a rollback-capable source kind.
### Explicitly out of scope (future phases, table already supports them)
- SHA-pinned rebuild rollback for static/dockerfile (needs source checkout-by-commit).
- Config-snapshot rollback for compose (no artifact reference).
- Promotion (dev→staging→prod) — separate feature, will reuse this ledger.
## Data model
New table `deploy_history` (added to `workloadTables` in `runMigrations`):
```sql
CREATE TABLE IF NOT EXISTS deploy_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
workload_id TEXT NOT NULL REFERENCES workloads(id) ON DELETE CASCADE,
source_kind TEXT NOT NULL DEFAULT '',
reference TEXT NOT NULL DEFAULT '', -- effective artifact: image tag | commit sha | ''
reason TEXT NOT NULL DEFAULT '', -- manual|registry-push|git-push|cron|rollback|promote
triggered_by TEXT NOT NULL DEFAULT '',
note TEXT NOT NULL DEFAULT '',
outcome TEXT NOT NULL DEFAULT '', -- success | failure
error TEXT NOT NULL DEFAULT '', -- truncated, secret-free
started_at TEXT NOT NULL DEFAULT '',
finished_at TEXT NOT NULL DEFAULT ''
);
CREATE INDEX IF NOT EXISTS idx_deploy_history_workload
ON deploy_history(workload_id, id DESC);
```
**Why a dedicated table (not `event_log`):** structured + queryable, version-pinned,
carries the replayable `reference`, and its retention is independent of the human event
feed. `event_log` stays the free-text timeline; `deploy_history` is the version ledger.
Go model in `models.go` (`DeployHistoryEntry`, mirrors `MetricAlertRule` style).
## Backend changes
### 1. Store — `internal/store/deploy_history.go` (new) + `models.go` + `store.go`
- `DeployHistoryEntry` struct.
- `InsertDeployHistory(e DeployHistoryEntry) (DeployHistoryEntry, error)`.
- `ListDeployHistory(workloadID string, limit, offset int) ([]DeployHistoryEntry, error)`
— ordered `id DESC`; default/clamped limit (e.g. 50, max 200) via existing `parseLimit`
conventions at the API layer.
- `GetDeployHistory(id int64) (DeployHistoryEntry, error)` — for rollback lookup;
`ErrNotFound` on miss.
- `PruneDeployHistory(workloadID string, keep int) error` — keep newest `keep` per
workload (mirror the stats-prune pattern). Called best-effort after insert.
- Migration: append `CREATE TABLE` + index to `workloadTables`.
- Table test `deploy_history_test.go` (insert/list/get/prune, cascade-on-workload-delete).
### 2. Deployer — record at the choke point (`internal/deployer/dispatch.go`)
Wrap the existing `src.Deploy(...)` call:
```go
started := store.Now()
err = src.Deploy(ctx, d.PluginDeps(), w, intent)
outcome := "success"; if err != nil { outcome = "failure" }
metrics.DeploysTotal.Inc(w.SourceKind, outcome)
d.recordDeployHistory(w, intent, outcome, err, started) // best-effort, never blocks
return err
```
- `recordDeployHistory` resolves the **effective reference** and inserts a row.
Best-effort: a store failure is logged, never propagated (same contract as
`maybeBackupBeforeDeploy` and `EmitDeployEvent`).
- **Effective-reference resolver** (`internal/deployer/deploy_ref.go`, unit-tested):
1. start from `intent.Reference`;
2. `image`: read newest `ListContainersByWorkload(w.ID)` row (by `CreatedAt`), prefer
its `ImageTag` when non-empty — captures the `DefaultTag`/`latest` resolution;
3. `static`/`dockerfile`: when still empty, read persisted runtime state
`LastCommitSHA` (verify exact store getter during impl);
4. `compose`/unknown: leave as-is (may be `""`).
- **Error sanitization:** reuse the `capDeployStatus` cap (256 runes) idea — store a
short, secret-free `error`. The raw error keeps going to `slog` only. (The deploy
error already carries a generic client message; the wrapped detail must not be
persisted verbatim because it can echo registry-auth / compose-stdout bytes — same
caller contract documented on `EmitDeployEvent`.)
- Recording does **not** run for `DispatchReconcile` (periodic, not a deploy) or
`DispatchTeardown`.
### 3. API — `internal/api/deploy_history.go` (new) + `router.go`
- `GET /api/workloads/{id}/deploys?limit=&offset=``listWorkloadDeploys` (read; any
authenticated user — mirrors `listWorkloadEvents`). Uses `parseLimit`.
- `POST /api/workloads/{id}/rollback``rollbackWorkload` (`auth.AdminOnly`), body
`{deploy_id}`:
1. load workload (404 if missing; 400 if `source_kind == ""`);
2. `GetDeployHistory(deploy_id)`; 404 if missing, 400 if its `workload_id` ≠ path id
(no cross-workload replay);
3. guard: `outcome == "success"`, `reference != ""`, and `source_kind` is
rollback-capable (`image` in Phase 1) → else 400 with a clear message;
4. build `manual`-shaped intent `{Reason: "rollback", Reference: row.reference,
Metadata: {"note": "rollback to " + row.reference, "rollback_of": <id>},
TriggeredBy: actor}`;
5. `deployer.DispatchPlugin(...)`; 202 on accept (same shape as deploy).
- Register both routes inside the existing `r.Route("/workloads/{id}", …)` block in
[router.go](../../internal/api/router.go), next to `/deploy` and `/events`.
- A `RollbackCapable(sourceKind) bool` helper (single source of truth, shared with the
list response so the frontend can render the button state without hardcoding kinds).
- The list response includes a per-entry `rollbackable bool` computed server-side.
## Frontend changes (`web/`)
- **`DeployHistoryPanel.svelte`** (new, in `lib/components/`): table of entries —
short reference, reason badge, `outcome` `StatusBadge` (ok/bad), `triggered_by`,
relative time. For `rollbackable` rows a **Roll back** button → `ConfirmDialog`
("Roll back <name> to <reference>?") → `POST …/rollback {deploy_id}` → `Toast` +
refresh history and container state. Loading via `Skeleton`; `EmptyState` when no
rows. Reuses existing components only.
- Mount the panel on **`/apps/[id]`** alongside the activity timeline (it is the
*structured, actionable* sibling of the free-text timeline).
- **i18n:** add keys under a `deployHistory.*` namespace to **both**
`web/src/lib/i18n/en.json` and `ru.json` (parity is mandatory and not a build error —
verify manually per CLAUDE.md).
- API client: add `listDeploys(id, params)` and `rollback(id, deployId)` to the existing
workload API module.
## Testing
- **Store:** `deploy_history_test.go` — insert/list ordering, get, prune-keeps-newest,
cascade delete with workload.
- **Deployer:** extend `deployer` tests — `DispatchPlugin` writes one `success` row and
one `failure` row (with sanitized error); reconcile/teardown write none. Resolver unit
test (`deploy_ref_test.go`) for the image read-back + empty fallbacks.
- **API:** rollback guards — cross-workload id → 400; non-success/empty-ref/
non-image → 400; happy path → 202 and a `rollback`-reason history row appears.
- **Web:** keep it light (the panel is mostly presentational); a `sourceForms`-style
pure-logic unit only if a non-trivial helper emerges.
- Gates: `go build ./...`, `go vet ./internal/...`, `go test ./internal/...`,
`cd web && npm run check && npm run test`, then `./scripts/dev-server.sh`.
## Risks / mitigations
- **Recording must never break a deploy** → best-effort insert, errors only logged
(matches existing `EmitDeployEvent` / pre-deploy-backup contracts).
- **Secret leakage via `error`** → store only a capped, generic reason; raw error to
`slog` only.
- **Unbounded growth** → `PruneDeployHistory` keeps newest N per workload.
- **Rollback to a vanished image tag** → the image source's `PullImage` fails and its
own failure-rollback leaves the live container untouched; the rollback attempt is
recorded as `failure`. No special handling needed.
- **No-op rollback (target already running, `MaxInstances>1`)** → image short-circuit
returns `nil`; recorded as `success`. Acceptable.
## Rollout
Single PR. Additive migration (no destructive DDL). No settings changes. Backward
compatible: existing workloads simply start accumulating history on their next deploy.
+84
View File
@@ -0,0 +1,84 @@
# Per-Workload Metrics Graph — Implementation Plan
**Status:** planned · **Feature rank:** #2 · **Date:** 2026-06-19
## Problem
Stats are collected per container (`container_stats_samples`, CPU/mem/net/disk) and
charted **globally** on the dashboard (`SystemResourcesCard` + `ResourceChart`), but
`/apps/[id]` shows only live snapshots — there's no per-workload "is my app leaking
memory / pegging CPU over the last few hours" view. This is a daily question and the
data already exists; we just need a per-workload query + a panel that reuses the chart.
## Verified facts
- `ContainerStatsSample.OwnerID` == the **container row id** (`containers.id`), confirmed
by `lookupInstanceName``GetContainerByID(sm.OwnerID)` in
[stats_history.go](../../internal/api/stats_history.go). `OwnerType` ∈ {instance, site}.
- Each sample's `ts` is that container's own Docker-stats `Timestamp.Unix()`
([collector.go](../../internal/stats/collector.go)) — NOT one shared tick stamp. In a
multi-container tick the per-second truncation usually collapses them to the same
integer `ts`, so per-`ts` aggregation works; a ±1s split at a second boundary is
cosmetic for a trend line. (Reviewer-corrected.) The handler 404s on an unknown
workload id but returns `[]` for a known workload with no samples yet.
- `ResourceChart.svelte` takes a fully-built `EChartsOption` from the parent; the parent
owns series/axes (see `SystemResourcesCard`). Reads stay available when Docker is down
(samples come from SQLite, not the daemon).
- Per-workload reads (`/events`, `/runtime-state`) are open to any authenticated user;
this endpoint follows suit (no `AdminOnly`).
## Backend
1. **Store**`ListContainerStatsSamplesByWorkload(workloadID string, sinceTS int64)`:
```sql
SELECT cs.container_id, cs.owner_type, cs.owner_id, cs.ts,
cs.cpu_percent, cs.memory_usage, cs.memory_limit,
cs.network_rx, cs.network_tx, cs.block_read, cs.block_write
FROM container_stats_samples cs
JOIN containers c ON c.id = cs.owner_id
WHERE c.workload_id = ? AND cs.ts >= ?
ORDER BY cs.ts ASC
```
Returns `[]ContainerStatsSample`.
2. **API** — `getWorkloadStatsHistory` (GET `/api/workloads/{id}/stats/history?window=`):
reuse `parseWindow`/`sinceTimestamp`; aggregate samples **per ts** into a compact
series so multi-container workloads (compose) sum correctly:
```go
type workloadStatsPoint struct {
TS int64 `json:"ts"`
CPUPercent float64 `json:"cpu_percent"` // sum across the workload's containers
MemoryUsage int64 `json:"memory_usage"` // sum bytes
MemoryLimit int64 `json:"memory_limit"` // max (effective ceiling)
}
```
Always returns `[]` (never 503) — empty when stats are disabled / Docker was down /
the workload is new. Register in the `/workloads/{id}` route block.
3. **Tests** — store: join scopes to the right workload (A's samples ≠ B's); API:
per-ts aggregation sums two containers at the same tick.
## Frontend
4. **api.ts** — `WorkloadStatsPoint` type + `fetchWorkloadStatsHistory(id, window, signal)`.
5. **`WorkloadMetricsPanel.svelte`** — window selector (30m / 2h / 6h), fetch + 15s poll
(mirror `SystemResourcesCard`), build an `EChartsOption` with **two series**: CPU %
on the left axis, Memory (MiB) on the right axis (absolute bytes, because
`memory_limit` is often 0/unlimited so a % would divide by zero). `EmptyState`/ hint
when there are no samples. Render via `ResourceChart`. Mount on `/apps/[id]` near the
deploy-history panel.
6. **i18n** — `apps.detail.metrics.*` in both en.json and ru.json (parity mandatory).
## Risks / mitigations
- **Docker down / stats disabled** → empty series, friendly hint (no error). SQLite read
path is independent of the daemon.
- **memory_limit = 0 (unlimited)** → plot absolute MiB, not %, to avoid div-by-zero.
- **Sparse sampling** → chart shows whatever ticks exist; window selector lets the user
widen. No interpolation.
- **Auth** → read-only, any authenticated user (consistent with other per-workload reads).
## Rollout
Single change set, additive, no migration. Reuses the existing `echarts` dependency and
`ResourceChart` component.