Files
tiny-forge/docs/plans/DEPLOY_HISTORY_ROLLBACK_PLAN.md
T
alexei.dolgolyov 0c4c338bfe feat(apps): per-workload deploy history, rollback, and resource metrics
Two additions to the app detail page, each backed by a per-workload
endpoint.

Deploy history + rollback:
- New deploy_history table — a structured, version-pinned ledger of every
  dispatch (success AND failure), distinct from the free-text event_log.
  Recorded at the single DispatchPlugin choke point so every source kind
  is covered. The raw deploy error is never persisted (it can carry
  registry-auth / compose-stdout secrets) — only a generic marker, with
  detail going to slog. Pruned to the newest N per workload; cascade-
  deleted with the workload.
- GET /api/workloads/{id}/deploys lists the ledger; POST .../rollback
  (admin) replays a prior successful deploy's pinned reference as a
  rollback-reason dispatch. Phase 1 is image-source only (RollbackCapable);
  git-built sources need checkout-by-commit, a later phase.
- DeployHistoryPanel.svelte renders the ledger with confirm-gated rollback.

Per-workload metrics:
- ListContainerStatsSamplesByWorkload joins the existing container stats
  samples through the containers index; GET /api/workloads/{id}/stats/history
  aggregates CPU/memory per timestamp across the workload's containers.
- WorkloadMetricsPanel.svelte reuses ResourceChart (CPU% + memory MiB,
  windowed, 15s poll).

en/ru i18n added with parity. Tests: store CRUD + cascade + workload-scoped
join, deployer recording (incl. secret-non-leak on failure), API rollback
guards, and per-timestamp aggregation. Plans under docs/plans/.
2026-06-19 16:22:12 +03:00

13 KiB

Deploy History + One-Click Rollback — Implementation Plan

Status: planned (review incorporated) · Feature rank: #1 · Date: 2026-06-19

Review findings incorporated (adversarial pass)

  • BLOCKER — never persist the raw deploy error (it can carry registry-auth bytes / compose stdout — see compose.go SECURITY comment + workloads_plugin.go:198). deploy_history.error only ever gets a fixed generic marker ("deploy failed (see server logs)") on failure; the raw error goes to slog only. capDeployStatus(err.Error()) is rejected.
  • BLOCKER — don't double-count metrics. DispatchPlugin already calls metrics.DeploysTotal.Inc(...); recording slots into the existing outcome block, not a re-added metrics line.
  • FIX — no runtime-state store getter exists. static/dockerfile LastCommitSHA lives in containers.extra_json on a deterministic-ID row (GetContainerByID(w.ID+":site") / +":dockerfile", decode ExtraJSON). Moot for Phase-1 rollback (image-only) but the resolver must use this, not a fictional getter.
  • FIX — cascade is distrusted here. DeleteWorkload explicitly deletes containers rather than relying on the FK. Match that: add DELETE FROM deploy_history WHERE workload_id = ? inside the DeleteWorkload transaction, and make the cascade test a hard gate.
  • FIX — keep recording off the hot path's tail. DispatchPlugin runs synchronously on the request goroutine; the INSERT is cheap but PruneDeployHistory runs in a goroutine. Draining-rejected attempts (beginDispatch fail) record nothing — correct, a never-run deploy must not appear as a rollback target.
  • FIX — pagination: use parseLimit(raw, 50, 200) (not the unclamped listWorkloadEvents style); parse offset separately, clamp negatives to 0.

Problem

Tinyforge has failure rollback (a failed deploy unwinds its own new container — image.go:258), but no way to revert a successful deploy to a prior version. Blue-green's enforceMaxInstances deletes the old container rows after cutover, so once v3 replaces v2 there is no record of v2 and nothing to roll back to. The only "history" is free-text event_log rows ("deployed") — not structured, not version-pinned, not replayable.

This is the single most-requested capability for any deploy tool, and the plumbing is 90% there: every deploy flows through one choke point, and the manual-deploy endpoint already accepts a reference override.

Key architectural facts (verified against current code)

  • Single dispatch choke point: Deployer.DispatchPlugin(ctx, w, intent) in internal/deployer/dispatch.go routes every source kind and already computes a success/failure outcome. This is where history is recorded.
  • intent.Reference is the version handle: image source resolves tag := intent.Reference (falling back to DefaultTag/latest). The manual deploy endpoint (workloads_plugin.go) already accepts {reference, note} and builds a manual intent. Rollback = deploy with a pinned reference + a distinct reason.
  • Effective vs requested reference: for a manual image deploy intent.Reference is often "" (means DefaultTag). The effective deployed tag is written onto the freshest container row (store.Container.ImageTag). For static/dockerfile the effective version is runtime_state.LastCommitSHA, resolved inside the source.
  • Built-from-source sources don't honor a SHA reference on Deploy — static and dockerfile clone cfg.Branch HEAD and capture latestSHA; they cannot yet check out an arbitrary commit. So SHA-pinned rollback for them needs a source change (later phase). Image-tag rollback works today.
  • Migration pattern: additive statements in runMigrations() / workloadTables in store.go; workload-scoped tables use REFERENCES workloads(id) ON DELETE CASCADE. Per-table CRUD lives in its own internal/store/<table>.go, model in models.go.
  • Idempotency note: the image source's same-tag short-circuit returns before it arms its EmitDeployEvent defer, so a no-op deploy emits no timeline event. History recorded at DispatchPlugin will still log it as a success attempt — acceptable (history = ledger of attempts), but called out so the divergence is intentional.

Scope

Phase 1 (this plan)

  1. Persistent, structured deploy-history ledger for all source kinds (success and failure) — powers an audit timeline and the rollback action.
  2. One-click rollback for the image source (redeploy a pinned tag).
  3. Read-only history panel on /apps/[id]; rollback button shown only for entries that are success + have a non-empty reference + a rollback-capable source kind.

Explicitly out of scope (future phases, table already supports them)

  • SHA-pinned rebuild rollback for static/dockerfile (needs source checkout-by-commit).
  • Config-snapshot rollback for compose (no artifact reference).
  • Promotion (dev→staging→prod) — separate feature, will reuse this ledger.

Data model

New table deploy_history (added to workloadTables in runMigrations):

CREATE TABLE IF NOT EXISTS deploy_history (
    id            INTEGER PRIMARY KEY AUTOINCREMENT,
    workload_id   TEXT NOT NULL REFERENCES workloads(id) ON DELETE CASCADE,
    source_kind   TEXT NOT NULL DEFAULT '',
    reference     TEXT NOT NULL DEFAULT '',   -- effective artifact: image tag | commit sha | ''
    reason        TEXT NOT NULL DEFAULT '',   -- manual|registry-push|git-push|cron|rollback|promote
    triggered_by  TEXT NOT NULL DEFAULT '',
    note          TEXT NOT NULL DEFAULT '',
    outcome       TEXT NOT NULL DEFAULT '',   -- success | failure
    error         TEXT NOT NULL DEFAULT '',   -- truncated, secret-free
    started_at    TEXT NOT NULL DEFAULT '',
    finished_at   TEXT NOT NULL DEFAULT ''
);
CREATE INDEX IF NOT EXISTS idx_deploy_history_workload
    ON deploy_history(workload_id, id DESC);

Why a dedicated table (not event_log): structured + queryable, version-pinned, carries the replayable reference, and its retention is independent of the human event feed. event_log stays the free-text timeline; deploy_history is the version ledger.

Go model in models.go (DeployHistoryEntry, mirrors MetricAlertRule style).

Backend changes

1. Store — internal/store/deploy_history.go (new) + models.go + store.go

  • DeployHistoryEntry struct.
  • InsertDeployHistory(e DeployHistoryEntry) (DeployHistoryEntry, error).
  • ListDeployHistory(workloadID string, limit, offset int) ([]DeployHistoryEntry, error) — ordered id DESC; default/clamped limit (e.g. 50, max 200) via existing parseLimit conventions at the API layer.
  • GetDeployHistory(id int64) (DeployHistoryEntry, error) — for rollback lookup; ErrNotFound on miss.
  • PruneDeployHistory(workloadID string, keep int) error — keep newest keep per workload (mirror the stats-prune pattern). Called best-effort after insert.
  • Migration: append CREATE TABLE + index to workloadTables.
  • Table test deploy_history_test.go (insert/list/get/prune, cascade-on-workload-delete).

2. Deployer — record at the choke point (internal/deployer/dispatch.go)

Wrap the existing src.Deploy(...) call:

started := store.Now()
err = src.Deploy(ctx, d.PluginDeps(), w, intent)
outcome := "success"; if err != nil { outcome = "failure" }
metrics.DeploysTotal.Inc(w.SourceKind, outcome)
d.recordDeployHistory(w, intent, outcome, err, started) // best-effort, never blocks
return err
  • recordDeployHistory resolves the effective reference and inserts a row. Best-effort: a store failure is logged, never propagated (same contract as maybeBackupBeforeDeploy and EmitDeployEvent).
  • Effective-reference resolver (internal/deployer/deploy_ref.go, unit-tested):
    1. start from intent.Reference;
    2. image: read newest ListContainersByWorkload(w.ID) row (by CreatedAt), prefer its ImageTag when non-empty — captures the DefaultTag/latest resolution;
    3. static/dockerfile: when still empty, read persisted runtime state LastCommitSHA (verify exact store getter during impl);
    4. compose/unknown: leave as-is (may be "").
  • Error sanitization: reuse the capDeployStatus cap (256 runes) idea — store a short, secret-free error. The raw error keeps going to slog only. (The deploy error already carries a generic client message; the wrapped detail must not be persisted verbatim because it can echo registry-auth / compose-stdout bytes — same caller contract documented on EmitDeployEvent.)
  • Recording does not run for DispatchReconcile (periodic, not a deploy) or DispatchTeardown.

3. API — internal/api/deploy_history.go (new) + router.go

  • GET /api/workloads/{id}/deploys?limit=&offset=listWorkloadDeploys (read; any authenticated user — mirrors listWorkloadEvents). Uses parseLimit.
  • POST /api/workloads/{id}/rollbackrollbackWorkload (auth.AdminOnly), body {deploy_id}:
    1. load workload (404 if missing; 400 if source_kind == "");
    2. GetDeployHistory(deploy_id); 404 if missing, 400 if its workload_id ≠ path id (no cross-workload replay);
    3. guard: outcome == "success", reference != "", and source_kind is rollback-capable (image in Phase 1) → else 400 with a clear message;
    4. build manual-shaped intent {Reason: "rollback", Reference: row.reference, Metadata: {"note": "rollback to " + row.reference, "rollback_of": <id>}, TriggeredBy: actor};
    5. deployer.DispatchPlugin(...); 202 on accept (same shape as deploy).
  • Register both routes inside the existing r.Route("/workloads/{id}", …) block in router.go, next to /deploy and /events.
  • A RollbackCapable(sourceKind) bool helper (single source of truth, shared with the list response so the frontend can render the button state without hardcoding kinds).
  • The list response includes a per-entry rollbackable bool computed server-side.

Frontend changes (web/)

  • DeployHistoryPanel.svelte (new, in lib/components/): table of entries — short reference, reason badge, outcome StatusBadge (ok/bad), triggered_by, relative time. For rollbackable rows a Roll back button → ConfirmDialog ("Roll back to ?") → POST …/rollback {deploy_id}Toast + refresh history and container state. Loading via Skeleton; EmptyState when no rows. Reuses existing components only.
  • Mount the panel on /apps/[id] alongside the activity timeline (it is the structured, actionable sibling of the free-text timeline).
  • i18n: add keys under a deployHistory.* namespace to both web/src/lib/i18n/en.json and ru.json (parity is mandatory and not a build error — verify manually per CLAUDE.md).
  • API client: add listDeploys(id, params) and rollback(id, deployId) to the existing workload API module.

Testing

  • Store: deploy_history_test.go — insert/list ordering, get, prune-keeps-newest, cascade delete with workload.
  • Deployer: extend deployer tests — DispatchPlugin writes one success row and one failure row (with sanitized error); reconcile/teardown write none. Resolver unit test (deploy_ref_test.go) for the image read-back + empty fallbacks.
  • API: rollback guards — cross-workload id → 400; non-success/empty-ref/ non-image → 400; happy path → 202 and a rollback-reason history row appears.
  • Web: keep it light (the panel is mostly presentational); a sourceForms-style pure-logic unit only if a non-trivial helper emerges.
  • Gates: go build ./..., go vet ./internal/..., go test ./internal/..., cd web && npm run check && npm run test, then ./scripts/dev-server.sh.

Risks / mitigations

  • Recording must never break a deploy → best-effort insert, errors only logged (matches existing EmitDeployEvent / pre-deploy-backup contracts).
  • Secret leakage via error → store only a capped, generic reason; raw error to slog only.
  • Unbounded growthPruneDeployHistory keeps newest N per workload.
  • Rollback to a vanished image tag → the image source's PullImage fails and its own failure-rollback leaves the live container untouched; the rollback attempt is recorded as failure. No special handling needed.
  • No-op rollback (target already running, MaxInstances>1) → image short-circuit returns nil; recorded as success. Acceptable.

Rollout

Single PR. Additive migration (no destructive DDL). No settings changes. Backward compatible: existing workloads simply start accumulating history on their next deploy.