# Deploy History + One-Click Rollback — Implementation Plan **Status:** planned (review incorporated) · **Feature rank:** #1 · **Date:** 2026-06-19 ## Review findings incorporated (adversarial pass) - **BLOCKER — never persist the raw deploy error** (it can carry registry-auth bytes / compose stdout — see `compose.go` SECURITY comment + `workloads_plugin.go:198`). `deploy_history.error` only ever gets a **fixed generic marker** (`"deploy failed (see server logs)"`) on failure; the raw error goes to `slog` only. `capDeployStatus(err.Error())` is rejected. - **BLOCKER — don't double-count metrics.** `DispatchPlugin` already calls `metrics.DeploysTotal.Inc(...)`; recording slots into the **existing** outcome block, not a re-added metrics line. - **FIX — no runtime-state store getter exists.** static/dockerfile `LastCommitSHA` lives in `containers.extra_json` on a deterministic-ID row (`GetContainerByID(w.ID+":site")` / `+":dockerfile"`, decode `ExtraJSON`). Moot for Phase-1 rollback (image-only) but the resolver must use this, not a fictional getter. - **FIX — cascade is distrusted here.** `DeleteWorkload` explicitly deletes containers rather than relying on the FK. Match that: add `DELETE FROM deploy_history WHERE workload_id = ?` inside the `DeleteWorkload` transaction, and make the cascade test a hard gate. - **FIX — keep recording off the hot path's tail.** `DispatchPlugin` runs synchronously on the request goroutine; the INSERT is cheap but `PruneDeployHistory` runs in a goroutine. Draining-rejected attempts (beginDispatch fail) record nothing — correct, a never-run deploy must not appear as a rollback target. - **FIX — pagination:** use `parseLimit(raw, 50, 200)` (not the unclamped `listWorkloadEvents` style); parse `offset` separately, clamp negatives to 0. ## Problem Tinyforge has *failure* rollback (a failed deploy unwinds its own new container — [image.go:258](../../internal/workload/plugin/source/image/image.go)), but **no way to revert a *successful* deploy to a prior version.** Blue-green's `enforceMaxInstances` deletes the old container rows after cutover, so once `v3` replaces `v2` there is no record of `v2` and nothing to roll back to. The only "history" is free-text `event_log` rows (`"deployed"`) — not structured, not version-pinned, not replayable. This is the single most-requested capability for any deploy tool, and the plumbing is 90% there: every deploy flows through one choke point, and the manual-deploy endpoint already accepts a `reference` override. ## Key architectural facts (verified against current code) - **Single dispatch choke point:** `Deployer.DispatchPlugin(ctx, w, intent)` in [internal/deployer/dispatch.go](../../internal/deployer/dispatch.go) routes *every* source kind and already computes a success/failure `outcome`. This is where history is recorded. - **`intent.Reference` is the version handle:** image source resolves `tag := intent.Reference` (falling back to `DefaultTag`/`latest`). The manual deploy endpoint ([workloads_plugin.go](../../internal/api/workloads_plugin.go)) already accepts `{reference, note}` and builds a `manual` intent. **Rollback = deploy with a pinned reference + a distinct reason.** - **Effective vs requested reference:** for a *manual* image deploy `intent.Reference` is often `""` (means `DefaultTag`). The *effective* deployed tag is written onto the freshest container row (`store.Container.ImageTag`). For static/dockerfile the effective version is `runtime_state.LastCommitSHA`, resolved inside the source. - **Built-from-source sources don't honor a SHA reference on Deploy** — static and dockerfile clone `cfg.Branch` HEAD and capture `latestSHA`; they cannot yet check out an arbitrary commit. So **SHA-pinned rollback for them needs a source change (later phase).** Image-tag rollback works today. - **Migration pattern:** additive statements in `runMigrations()` / `workloadTables` in [store.go](../../internal/store/store.go); workload-scoped tables use `REFERENCES workloads(id) ON DELETE CASCADE`. Per-table CRUD lives in its own `internal/store/.go`, model in `models.go`. - **Idempotency note:** the image source's same-tag short-circuit returns *before* it arms its `EmitDeployEvent` defer, so a no-op deploy emits no timeline event. History recorded at `DispatchPlugin` will still log it as a `success` attempt — acceptable (history = ledger of attempts), but called out so the divergence is intentional. ## Scope ### Phase 1 (this plan) 1. Persistent, structured **deploy-history ledger** for **all** source kinds (success *and* failure) — powers an audit timeline and the rollback action. 2. **One-click rollback** for the **image** source (redeploy a pinned tag). 3. Read-only history panel on `/apps/[id]`; rollback button shown only for entries that are `success` + have a non-empty reference + a rollback-capable source kind. ### Explicitly out of scope (future phases, table already supports them) - SHA-pinned rebuild rollback for static/dockerfile (needs source checkout-by-commit). - Config-snapshot rollback for compose (no artifact reference). - Promotion (dev→staging→prod) — separate feature, will reuse this ledger. ## Data model New table `deploy_history` (added to `workloadTables` in `runMigrations`): ```sql CREATE TABLE IF NOT EXISTS deploy_history ( id INTEGER PRIMARY KEY AUTOINCREMENT, workload_id TEXT NOT NULL REFERENCES workloads(id) ON DELETE CASCADE, source_kind TEXT NOT NULL DEFAULT '', reference TEXT NOT NULL DEFAULT '', -- effective artifact: image tag | commit sha | '' reason TEXT NOT NULL DEFAULT '', -- manual|registry-push|git-push|cron|rollback|promote triggered_by TEXT NOT NULL DEFAULT '', note TEXT NOT NULL DEFAULT '', outcome TEXT NOT NULL DEFAULT '', -- success | failure error TEXT NOT NULL DEFAULT '', -- truncated, secret-free started_at TEXT NOT NULL DEFAULT '', finished_at TEXT NOT NULL DEFAULT '' ); CREATE INDEX IF NOT EXISTS idx_deploy_history_workload ON deploy_history(workload_id, id DESC); ``` **Why a dedicated table (not `event_log`):** structured + queryable, version-pinned, carries the replayable `reference`, and its retention is independent of the human event feed. `event_log` stays the free-text timeline; `deploy_history` is the version ledger. Go model in `models.go` (`DeployHistoryEntry`, mirrors `MetricAlertRule` style). ## Backend changes ### 1. Store — `internal/store/deploy_history.go` (new) + `models.go` + `store.go` - `DeployHistoryEntry` struct. - `InsertDeployHistory(e DeployHistoryEntry) (DeployHistoryEntry, error)`. - `ListDeployHistory(workloadID string, limit, offset int) ([]DeployHistoryEntry, error)` — ordered `id DESC`; default/clamped limit (e.g. 50, max 200) via existing `parseLimit` conventions at the API layer. - `GetDeployHistory(id int64) (DeployHistoryEntry, error)` — for rollback lookup; `ErrNotFound` on miss. - `PruneDeployHistory(workloadID string, keep int) error` — keep newest `keep` per workload (mirror the stats-prune pattern). Called best-effort after insert. - Migration: append `CREATE TABLE` + index to `workloadTables`. - Table test `deploy_history_test.go` (insert/list/get/prune, cascade-on-workload-delete). ### 2. Deployer — record at the choke point (`internal/deployer/dispatch.go`) Wrap the existing `src.Deploy(...)` call: ```go started := store.Now() err = src.Deploy(ctx, d.PluginDeps(), w, intent) outcome := "success"; if err != nil { outcome = "failure" } metrics.DeploysTotal.Inc(w.SourceKind, outcome) d.recordDeployHistory(w, intent, outcome, err, started) // best-effort, never blocks return err ``` - `recordDeployHistory` resolves the **effective reference** and inserts a row. Best-effort: a store failure is logged, never propagated (same contract as `maybeBackupBeforeDeploy` and `EmitDeployEvent`). - **Effective-reference resolver** (`internal/deployer/deploy_ref.go`, unit-tested): 1. start from `intent.Reference`; 2. `image`: read newest `ListContainersByWorkload(w.ID)` row (by `CreatedAt`), prefer its `ImageTag` when non-empty — captures the `DefaultTag`/`latest` resolution; 3. `static`/`dockerfile`: when still empty, read persisted runtime state `LastCommitSHA` (verify exact store getter during impl); 4. `compose`/unknown: leave as-is (may be `""`). - **Error sanitization:** reuse the `capDeployStatus` cap (256 runes) idea — store a short, secret-free `error`. The raw error keeps going to `slog` only. (The deploy error already carries a generic client message; the wrapped detail must not be persisted verbatim because it can echo registry-auth / compose-stdout bytes — same caller contract documented on `EmitDeployEvent`.) - Recording does **not** run for `DispatchReconcile` (periodic, not a deploy) or `DispatchTeardown`. ### 3. API — `internal/api/deploy_history.go` (new) + `router.go` - `GET /api/workloads/{id}/deploys?limit=&offset=` → `listWorkloadDeploys` (read; any authenticated user — mirrors `listWorkloadEvents`). Uses `parseLimit`. - `POST /api/workloads/{id}/rollback` → `rollbackWorkload` (`auth.AdminOnly`), body `{deploy_id}`: 1. load workload (404 if missing; 400 if `source_kind == ""`); 2. `GetDeployHistory(deploy_id)`; 404 if missing, 400 if its `workload_id` ≠ path id (no cross-workload replay); 3. guard: `outcome == "success"`, `reference != ""`, and `source_kind` is rollback-capable (`image` in Phase 1) → else 400 with a clear message; 4. build `manual`-shaped intent `{Reason: "rollback", Reference: row.reference, Metadata: {"note": "rollback to " + row.reference, "rollback_of": }, TriggeredBy: actor}`; 5. `deployer.DispatchPlugin(...)`; 202 on accept (same shape as deploy). - Register both routes inside the existing `r.Route("/workloads/{id}", …)` block in [router.go](../../internal/api/router.go), next to `/deploy` and `/events`. - A `RollbackCapable(sourceKind) bool` helper (single source of truth, shared with the list response so the frontend can render the button state without hardcoding kinds). - The list response includes a per-entry `rollbackable bool` computed server-side. ## Frontend changes (`web/`) - **`DeployHistoryPanel.svelte`** (new, in `lib/components/`): table of entries — short reference, reason badge, `outcome` `StatusBadge` (ok/bad), `triggered_by`, relative time. For `rollbackable` rows a **Roll back** button → `ConfirmDialog` ("Roll back to ?") → `POST …/rollback {deploy_id}` → `Toast` + refresh history and container state. Loading via `Skeleton`; `EmptyState` when no rows. Reuses existing components only. - Mount the panel on **`/apps/[id]`** alongside the activity timeline (it is the *structured, actionable* sibling of the free-text timeline). - **i18n:** add keys under a `deployHistory.*` namespace to **both** `web/src/lib/i18n/en.json` and `ru.json` (parity is mandatory and not a build error — verify manually per CLAUDE.md). - API client: add `listDeploys(id, params)` and `rollback(id, deployId)` to the existing workload API module. ## Testing - **Store:** `deploy_history_test.go` — insert/list ordering, get, prune-keeps-newest, cascade delete with workload. - **Deployer:** extend `deployer` tests — `DispatchPlugin` writes one `success` row and one `failure` row (with sanitized error); reconcile/teardown write none. Resolver unit test (`deploy_ref_test.go`) for the image read-back + empty fallbacks. - **API:** rollback guards — cross-workload id → 400; non-success/empty-ref/ non-image → 400; happy path → 202 and a `rollback`-reason history row appears. - **Web:** keep it light (the panel is mostly presentational); a `sourceForms`-style pure-logic unit only if a non-trivial helper emerges. - Gates: `go build ./...`, `go vet ./internal/...`, `go test ./internal/...`, `cd web && npm run check && npm run test`, then `./scripts/dev-server.sh`. ## Risks / mitigations - **Recording must never break a deploy** → best-effort insert, errors only logged (matches existing `EmitDeployEvent` / pre-deploy-backup contracts). - **Secret leakage via `error`** → store only a capped, generic reason; raw error to `slog` only. - **Unbounded growth** → `PruneDeployHistory` keeps newest N per workload. - **Rollback to a vanished image tag** → the image source's `PullImage` fails and its own failure-rollback leaves the live container untouched; the rollback attempt is recorded as `failure`. No special handling needed. - **No-op rollback (target already running, `MaxInstances>1`)** → image short-circuit returns `nil`; recorded as `success`. Acceptable. ## Rollout Single PR. Additive migration (no destructive DDL). No settings changes. Backward compatible: existing workloads simply start accumulating history on their next deploy.