Files
tiny-forge/docs/plans/DEPLOY_HISTORY_ROLLBACK_PLAN.md
T
alexei.dolgolyov 0c4c338bfe feat(apps): per-workload deploy history, rollback, and resource metrics
Two additions to the app detail page, each backed by a per-workload
endpoint.

Deploy history + rollback:
- New deploy_history table — a structured, version-pinned ledger of every
  dispatch (success AND failure), distinct from the free-text event_log.
  Recorded at the single DispatchPlugin choke point so every source kind
  is covered. The raw deploy error is never persisted (it can carry
  registry-auth / compose-stdout secrets) — only a generic marker, with
  detail going to slog. Pruned to the newest N per workload; cascade-
  deleted with the workload.
- GET /api/workloads/{id}/deploys lists the ledger; POST .../rollback
  (admin) replays a prior successful deploy's pinned reference as a
  rollback-reason dispatch. Phase 1 is image-source only (RollbackCapable);
  git-built sources need checkout-by-commit, a later phase.
- DeployHistoryPanel.svelte renders the ledger with confirm-gated rollback.

Per-workload metrics:
- ListContainerStatsSamplesByWorkload joins the existing container stats
  samples through the containers index; GET /api/workloads/{id}/stats/history
  aggregates CPU/memory per timestamp across the workload's containers.
- WorkloadMetricsPanel.svelte reuses ResourceChart (CPU% + memory MiB,
  windowed, 15s poll).

en/ru i18n added with parity. Tests: store CRUD + cascade + workload-scoped
join, deployer recording (incl. secret-non-leak on failure), API rollback
guards, and per-timestamp aggregation. Plans under docs/plans/.
2026-06-19 16:22:12 +03:00

224 lines
13 KiB
Markdown

# Deploy History + One-Click Rollback — Implementation Plan
**Status:** planned (review incorporated) · **Feature rank:** #1 · **Date:** 2026-06-19
## Review findings incorporated (adversarial pass)
- **BLOCKER — never persist the raw deploy error** (it can carry registry-auth bytes /
compose stdout — see `compose.go` SECURITY comment + `workloads_plugin.go:198`).
`deploy_history.error` only ever gets a **fixed generic marker**
(`"deploy failed (see server logs)"`) on failure; the raw error goes to `slog` only.
`capDeployStatus(err.Error())` is rejected.
- **BLOCKER — don't double-count metrics.** `DispatchPlugin` already calls
`metrics.DeploysTotal.Inc(...)`; recording slots into the **existing** outcome block,
not a re-added metrics line.
- **FIX — no runtime-state store getter exists.** static/dockerfile `LastCommitSHA`
lives in `containers.extra_json` on a deterministic-ID row
(`GetContainerByID(w.ID+":site")` / `+":dockerfile"`, decode `ExtraJSON`). Moot for
Phase-1 rollback (image-only) but the resolver must use this, not a fictional getter.
- **FIX — cascade is distrusted here.** `DeleteWorkload` explicitly deletes containers
rather than relying on the FK. Match that: add `DELETE FROM deploy_history WHERE
workload_id = ?` inside the `DeleteWorkload` transaction, and make the cascade test a
hard gate.
- **FIX — keep recording off the hot path's tail.** `DispatchPlugin` runs synchronously
on the request goroutine; the INSERT is cheap but `PruneDeployHistory` runs in a
goroutine. Draining-rejected attempts (beginDispatch fail) record nothing — correct,
a never-run deploy must not appear as a rollback target.
- **FIX — pagination:** use `parseLimit(raw, 50, 200)` (not the unclamped
`listWorkloadEvents` style); parse `offset` separately, clamp negatives to 0.
## Problem
Tinyforge has *failure* rollback (a failed deploy unwinds its own new container —
[image.go:258](../../internal/workload/plugin/source/image/image.go)), but **no way to
revert a *successful* deploy to a prior version.** Blue-green's `enforceMaxInstances`
deletes the old container rows after cutover, so once `v3` replaces `v2` there is no
record of `v2` and nothing to roll back to. The only "history" is free-text
`event_log` rows (`"deployed"`) — not structured, not version-pinned, not replayable.
This is the single most-requested capability for any deploy tool, and the plumbing is
90% there: every deploy flows through one choke point, and the manual-deploy endpoint
already accepts a `reference` override.
## Key architectural facts (verified against current code)
- **Single dispatch choke point:** `Deployer.DispatchPlugin(ctx, w, intent)` in
[internal/deployer/dispatch.go](../../internal/deployer/dispatch.go) routes *every*
source kind and already computes a success/failure `outcome`. This is where history
is recorded.
- **`intent.Reference` is the version handle:** image source resolves
`tag := intent.Reference` (falling back to `DefaultTag`/`latest`). The manual deploy
endpoint ([workloads_plugin.go](../../internal/api/workloads_plugin.go)) already accepts
`{reference, note}` and builds a `manual` intent. **Rollback = deploy with a pinned
reference + a distinct reason.**
- **Effective vs requested reference:** for a *manual* image deploy `intent.Reference`
is often `""` (means `DefaultTag`). The *effective* deployed tag is written onto the
freshest container row (`store.Container.ImageTag`). For static/dockerfile the
effective version is `runtime_state.LastCommitSHA`, resolved inside the source.
- **Built-from-source sources don't honor a SHA reference on Deploy** — static and
dockerfile clone `cfg.Branch` HEAD and capture `latestSHA`; they cannot yet check out
an arbitrary commit. So **SHA-pinned rollback for them needs a source change (later
phase).** Image-tag rollback works today.
- **Migration pattern:** additive statements in `runMigrations()` /
`workloadTables` in [store.go](../../internal/store/store.go); workload-scoped tables
use `REFERENCES workloads(id) ON DELETE CASCADE`. Per-table CRUD lives in its own
`internal/store/<table>.go`, model in `models.go`.
- **Idempotency note:** the image source's same-tag short-circuit returns *before* it
arms its `EmitDeployEvent` defer, so a no-op deploy emits no timeline event. History
recorded at `DispatchPlugin` will still log it as a `success` attempt — acceptable
(history = ledger of attempts), but called out so the divergence is intentional.
## Scope
### Phase 1 (this plan)
1. Persistent, structured **deploy-history ledger** for **all** source kinds (success
*and* failure) — powers an audit timeline and the rollback action.
2. **One-click rollback** for the **image** source (redeploy a pinned tag).
3. Read-only history panel on `/apps/[id]`; rollback button shown only for entries that
are `success` + have a non-empty reference + a rollback-capable source kind.
### Explicitly out of scope (future phases, table already supports them)
- SHA-pinned rebuild rollback for static/dockerfile (needs source checkout-by-commit).
- Config-snapshot rollback for compose (no artifact reference).
- Promotion (dev→staging→prod) — separate feature, will reuse this ledger.
## Data model
New table `deploy_history` (added to `workloadTables` in `runMigrations`):
```sql
CREATE TABLE IF NOT EXISTS deploy_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
workload_id TEXT NOT NULL REFERENCES workloads(id) ON DELETE CASCADE,
source_kind TEXT NOT NULL DEFAULT '',
reference TEXT NOT NULL DEFAULT '', -- effective artifact: image tag | commit sha | ''
reason TEXT NOT NULL DEFAULT '', -- manual|registry-push|git-push|cron|rollback|promote
triggered_by TEXT NOT NULL DEFAULT '',
note TEXT NOT NULL DEFAULT '',
outcome TEXT NOT NULL DEFAULT '', -- success | failure
error TEXT NOT NULL DEFAULT '', -- truncated, secret-free
started_at TEXT NOT NULL DEFAULT '',
finished_at TEXT NOT NULL DEFAULT ''
);
CREATE INDEX IF NOT EXISTS idx_deploy_history_workload
ON deploy_history(workload_id, id DESC);
```
**Why a dedicated table (not `event_log`):** structured + queryable, version-pinned,
carries the replayable `reference`, and its retention is independent of the human event
feed. `event_log` stays the free-text timeline; `deploy_history` is the version ledger.
Go model in `models.go` (`DeployHistoryEntry`, mirrors `MetricAlertRule` style).
## Backend changes
### 1. Store — `internal/store/deploy_history.go` (new) + `models.go` + `store.go`
- `DeployHistoryEntry` struct.
- `InsertDeployHistory(e DeployHistoryEntry) (DeployHistoryEntry, error)`.
- `ListDeployHistory(workloadID string, limit, offset int) ([]DeployHistoryEntry, error)`
— ordered `id DESC`; default/clamped limit (e.g. 50, max 200) via existing `parseLimit`
conventions at the API layer.
- `GetDeployHistory(id int64) (DeployHistoryEntry, error)` — for rollback lookup;
`ErrNotFound` on miss.
- `PruneDeployHistory(workloadID string, keep int) error` — keep newest `keep` per
workload (mirror the stats-prune pattern). Called best-effort after insert.
- Migration: append `CREATE TABLE` + index to `workloadTables`.
- Table test `deploy_history_test.go` (insert/list/get/prune, cascade-on-workload-delete).
### 2. Deployer — record at the choke point (`internal/deployer/dispatch.go`)
Wrap the existing `src.Deploy(...)` call:
```go
started := store.Now()
err = src.Deploy(ctx, d.PluginDeps(), w, intent)
outcome := "success"; if err != nil { outcome = "failure" }
metrics.DeploysTotal.Inc(w.SourceKind, outcome)
d.recordDeployHistory(w, intent, outcome, err, started) // best-effort, never blocks
return err
```
- `recordDeployHistory` resolves the **effective reference** and inserts a row.
Best-effort: a store failure is logged, never propagated (same contract as
`maybeBackupBeforeDeploy` and `EmitDeployEvent`).
- **Effective-reference resolver** (`internal/deployer/deploy_ref.go`, unit-tested):
1. start from `intent.Reference`;
2. `image`: read newest `ListContainersByWorkload(w.ID)` row (by `CreatedAt`), prefer
its `ImageTag` when non-empty — captures the `DefaultTag`/`latest` resolution;
3. `static`/`dockerfile`: when still empty, read persisted runtime state
`LastCommitSHA` (verify exact store getter during impl);
4. `compose`/unknown: leave as-is (may be `""`).
- **Error sanitization:** reuse the `capDeployStatus` cap (256 runes) idea — store a
short, secret-free `error`. The raw error keeps going to `slog` only. (The deploy
error already carries a generic client message; the wrapped detail must not be
persisted verbatim because it can echo registry-auth / compose-stdout bytes — same
caller contract documented on `EmitDeployEvent`.)
- Recording does **not** run for `DispatchReconcile` (periodic, not a deploy) or
`DispatchTeardown`.
### 3. API — `internal/api/deploy_history.go` (new) + `router.go`
- `GET /api/workloads/{id}/deploys?limit=&offset=``listWorkloadDeploys` (read; any
authenticated user — mirrors `listWorkloadEvents`). Uses `parseLimit`.
- `POST /api/workloads/{id}/rollback``rollbackWorkload` (`auth.AdminOnly`), body
`{deploy_id}`:
1. load workload (404 if missing; 400 if `source_kind == ""`);
2. `GetDeployHistory(deploy_id)`; 404 if missing, 400 if its `workload_id` ≠ path id
(no cross-workload replay);
3. guard: `outcome == "success"`, `reference != ""`, and `source_kind` is
rollback-capable (`image` in Phase 1) → else 400 with a clear message;
4. build `manual`-shaped intent `{Reason: "rollback", Reference: row.reference,
Metadata: {"note": "rollback to " + row.reference, "rollback_of": <id>},
TriggeredBy: actor}`;
5. `deployer.DispatchPlugin(...)`; 202 on accept (same shape as deploy).
- Register both routes inside the existing `r.Route("/workloads/{id}", …)` block in
[router.go](../../internal/api/router.go), next to `/deploy` and `/events`.
- A `RollbackCapable(sourceKind) bool` helper (single source of truth, shared with the
list response so the frontend can render the button state without hardcoding kinds).
- The list response includes a per-entry `rollbackable bool` computed server-side.
## Frontend changes (`web/`)
- **`DeployHistoryPanel.svelte`** (new, in `lib/components/`): table of entries —
short reference, reason badge, `outcome` `StatusBadge` (ok/bad), `triggered_by`,
relative time. For `rollbackable` rows a **Roll back** button → `ConfirmDialog`
("Roll back <name> to <reference>?") → `POST …/rollback {deploy_id}` → `Toast` +
refresh history and container state. Loading via `Skeleton`; `EmptyState` when no
rows. Reuses existing components only.
- Mount the panel on **`/apps/[id]`** alongside the activity timeline (it is the
*structured, actionable* sibling of the free-text timeline).
- **i18n:** add keys under a `deployHistory.*` namespace to **both**
`web/src/lib/i18n/en.json` and `ru.json` (parity is mandatory and not a build error —
verify manually per CLAUDE.md).
- API client: add `listDeploys(id, params)` and `rollback(id, deployId)` to the existing
workload API module.
## Testing
- **Store:** `deploy_history_test.go` — insert/list ordering, get, prune-keeps-newest,
cascade delete with workload.
- **Deployer:** extend `deployer` tests — `DispatchPlugin` writes one `success` row and
one `failure` row (with sanitized error); reconcile/teardown write none. Resolver unit
test (`deploy_ref_test.go`) for the image read-back + empty fallbacks.
- **API:** rollback guards — cross-workload id → 400; non-success/empty-ref/
non-image → 400; happy path → 202 and a `rollback`-reason history row appears.
- **Web:** keep it light (the panel is mostly presentational); a `sourceForms`-style
pure-logic unit only if a non-trivial helper emerges.
- Gates: `go build ./...`, `go vet ./internal/...`, `go test ./internal/...`,
`cd web && npm run check && npm run test`, then `./scripts/dev-server.sh`.
## Risks / mitigations
- **Recording must never break a deploy** → best-effort insert, errors only logged
(matches existing `EmitDeployEvent` / pre-deploy-backup contracts).
- **Secret leakage via `error`** → store only a capped, generic reason; raw error to
`slog` only.
- **Unbounded growth** → `PruneDeployHistory` keeps newest N per workload.
- **Rollback to a vanished image tag** → the image source's `PullImage` fails and its
own failure-rollback leaves the live container untouched; the rollback attempt is
recorded as `failure`. No special handling needed.
- **No-op rollback (target already running, `MaxInstances>1`)** → image short-circuit
returns `nil`; recorded as `success`. Acceptable.
## Rollout
Single PR. Additive migration (no destructive DDL). No settings changes. Backward
compatible: existing workloads simply start accumulating history on their next deploy.