Two additions to the app detail page, each backed by a per-workload
endpoint.
Deploy history + rollback:
- New deploy_history table — a structured, version-pinned ledger of every
dispatch (success AND failure), distinct from the free-text event_log.
Recorded at the single DispatchPlugin choke point so every source kind
is covered. The raw deploy error is never persisted (it can carry
registry-auth / compose-stdout secrets) — only a generic marker, with
detail going to slog. Pruned to the newest N per workload; cascade-
deleted with the workload.
- GET /api/workloads/{id}/deploys lists the ledger; POST .../rollback
(admin) replays a prior successful deploy's pinned reference as a
rollback-reason dispatch. Phase 1 is image-source only (RollbackCapable);
git-built sources need checkout-by-commit, a later phase.
- DeployHistoryPanel.svelte renders the ledger with confirm-gated rollback.
Per-workload metrics:
- ListContainerStatsSamplesByWorkload joins the existing container stats
samples through the containers index; GET /api/workloads/{id}/stats/history
aggregates CPU/memory per timestamp across the workload's containers.
- WorkloadMetricsPanel.svelte reuses ResourceChart (CPU% + memory MiB,
windowed, 15s poll).
en/ru i18n added with parity. Tests: store CRUD + cascade + workload-scoped
join, deployer recording (incl. secret-non-leak on failure), API rollback
guards, and per-timestamp aggregation. Plans under docs/plans/.
13 KiB
Deploy History + One-Click Rollback — Implementation Plan
Status: planned (review incorporated) · Feature rank: #1 · Date: 2026-06-19
Review findings incorporated (adversarial pass)
- BLOCKER — never persist the raw deploy error (it can carry registry-auth bytes /
compose stdout — see
compose.goSECURITY comment +workloads_plugin.go:198).deploy_history.erroronly ever gets a fixed generic marker ("deploy failed (see server logs)") on failure; the raw error goes toslogonly.capDeployStatus(err.Error())is rejected. - BLOCKER — don't double-count metrics.
DispatchPluginalready callsmetrics.DeploysTotal.Inc(...); recording slots into the existing outcome block, not a re-added metrics line. - FIX — no runtime-state store getter exists. static/dockerfile
LastCommitSHAlives incontainers.extra_jsonon a deterministic-ID row (GetContainerByID(w.ID+":site")/+":dockerfile", decodeExtraJSON). Moot for Phase-1 rollback (image-only) but the resolver must use this, not a fictional getter. - FIX — cascade is distrusted here.
DeleteWorkloadexplicitly deletes containers rather than relying on the FK. Match that: addDELETE FROM deploy_history WHERE workload_id = ?inside theDeleteWorkloadtransaction, and make the cascade test a hard gate. - FIX — keep recording off the hot path's tail.
DispatchPluginruns synchronously on the request goroutine; the INSERT is cheap butPruneDeployHistoryruns in a goroutine. Draining-rejected attempts (beginDispatch fail) record nothing — correct, a never-run deploy must not appear as a rollback target. - FIX — pagination: use
parseLimit(raw, 50, 200)(not the unclampedlistWorkloadEventsstyle); parseoffsetseparately, clamp negatives to 0.
Problem
Tinyforge has failure rollback (a failed deploy unwinds its own new container —
image.go:258), but no way to
revert a successful deploy to a prior version. Blue-green's enforceMaxInstances
deletes the old container rows after cutover, so once v3 replaces v2 there is no
record of v2 and nothing to roll back to. The only "history" is free-text
event_log rows ("deployed") — not structured, not version-pinned, not replayable.
This is the single most-requested capability for any deploy tool, and the plumbing is
90% there: every deploy flows through one choke point, and the manual-deploy endpoint
already accepts a reference override.
Key architectural facts (verified against current code)
- Single dispatch choke point:
Deployer.DispatchPlugin(ctx, w, intent)in internal/deployer/dispatch.go routes every source kind and already computes a success/failureoutcome. This is where history is recorded. intent.Referenceis the version handle: image source resolvestag := intent.Reference(falling back toDefaultTag/latest). The manual deploy endpoint (workloads_plugin.go) already accepts{reference, note}and builds amanualintent. Rollback = deploy with a pinned reference + a distinct reason.- Effective vs requested reference: for a manual image deploy
intent.Referenceis often""(meansDefaultTag). The effective deployed tag is written onto the freshest container row (store.Container.ImageTag). For static/dockerfile the effective version isruntime_state.LastCommitSHA, resolved inside the source. - Built-from-source sources don't honor a SHA reference on Deploy — static and
dockerfile clone
cfg.BranchHEAD and capturelatestSHA; they cannot yet check out an arbitrary commit. So SHA-pinned rollback for them needs a source change (later phase). Image-tag rollback works today. - Migration pattern: additive statements in
runMigrations()/workloadTablesin store.go; workload-scoped tables useREFERENCES workloads(id) ON DELETE CASCADE. Per-table CRUD lives in its owninternal/store/<table>.go, model inmodels.go. - Idempotency note: the image source's same-tag short-circuit returns before it
arms its
EmitDeployEventdefer, so a no-op deploy emits no timeline event. History recorded atDispatchPluginwill still log it as asuccessattempt — acceptable (history = ledger of attempts), but called out so the divergence is intentional.
Scope
Phase 1 (this plan)
- Persistent, structured deploy-history ledger for all source kinds (success and failure) — powers an audit timeline and the rollback action.
- One-click rollback for the image source (redeploy a pinned tag).
- Read-only history panel on
/apps/[id]; rollback button shown only for entries that aresuccess+ have a non-empty reference + a rollback-capable source kind.
Explicitly out of scope (future phases, table already supports them)
- SHA-pinned rebuild rollback for static/dockerfile (needs source checkout-by-commit).
- Config-snapshot rollback for compose (no artifact reference).
- Promotion (dev→staging→prod) — separate feature, will reuse this ledger.
Data model
New table deploy_history (added to workloadTables in runMigrations):
CREATE TABLE IF NOT EXISTS deploy_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
workload_id TEXT NOT NULL REFERENCES workloads(id) ON DELETE CASCADE,
source_kind TEXT NOT NULL DEFAULT '',
reference TEXT NOT NULL DEFAULT '', -- effective artifact: image tag | commit sha | ''
reason TEXT NOT NULL DEFAULT '', -- manual|registry-push|git-push|cron|rollback|promote
triggered_by TEXT NOT NULL DEFAULT '',
note TEXT NOT NULL DEFAULT '',
outcome TEXT NOT NULL DEFAULT '', -- success | failure
error TEXT NOT NULL DEFAULT '', -- truncated, secret-free
started_at TEXT NOT NULL DEFAULT '',
finished_at TEXT NOT NULL DEFAULT ''
);
CREATE INDEX IF NOT EXISTS idx_deploy_history_workload
ON deploy_history(workload_id, id DESC);
Why a dedicated table (not event_log): structured + queryable, version-pinned,
carries the replayable reference, and its retention is independent of the human event
feed. event_log stays the free-text timeline; deploy_history is the version ledger.
Go model in models.go (DeployHistoryEntry, mirrors MetricAlertRule style).
Backend changes
1. Store — internal/store/deploy_history.go (new) + models.go + store.go
DeployHistoryEntrystruct.InsertDeployHistory(e DeployHistoryEntry) (DeployHistoryEntry, error).ListDeployHistory(workloadID string, limit, offset int) ([]DeployHistoryEntry, error)— orderedid DESC; default/clamped limit (e.g. 50, max 200) via existingparseLimitconventions at the API layer.GetDeployHistory(id int64) (DeployHistoryEntry, error)— for rollback lookup;ErrNotFoundon miss.PruneDeployHistory(workloadID string, keep int) error— keep newestkeepper workload (mirror the stats-prune pattern). Called best-effort after insert.- Migration: append
CREATE TABLE+ index toworkloadTables. - Table test
deploy_history_test.go(insert/list/get/prune, cascade-on-workload-delete).
2. Deployer — record at the choke point (internal/deployer/dispatch.go)
Wrap the existing src.Deploy(...) call:
started := store.Now()
err = src.Deploy(ctx, d.PluginDeps(), w, intent)
outcome := "success"; if err != nil { outcome = "failure" }
metrics.DeploysTotal.Inc(w.SourceKind, outcome)
d.recordDeployHistory(w, intent, outcome, err, started) // best-effort, never blocks
return err
recordDeployHistoryresolves the effective reference and inserts a row. Best-effort: a store failure is logged, never propagated (same contract asmaybeBackupBeforeDeployandEmitDeployEvent).- Effective-reference resolver (
internal/deployer/deploy_ref.go, unit-tested):- start from
intent.Reference; image: read newestListContainersByWorkload(w.ID)row (byCreatedAt), prefer itsImageTagwhen non-empty — captures theDefaultTag/latestresolution;static/dockerfile: when still empty, read persisted runtime stateLastCommitSHA(verify exact store getter during impl);compose/unknown: leave as-is (may be"").
- start from
- Error sanitization: reuse the
capDeployStatuscap (256 runes) idea — store a short, secret-freeerror. The raw error keeps going toslogonly. (The deploy error already carries a generic client message; the wrapped detail must not be persisted verbatim because it can echo registry-auth / compose-stdout bytes — same caller contract documented onEmitDeployEvent.) - Recording does not run for
DispatchReconcile(periodic, not a deploy) orDispatchTeardown.
3. API — internal/api/deploy_history.go (new) + router.go
GET /api/workloads/{id}/deploys?limit=&offset=→listWorkloadDeploys(read; any authenticated user — mirrorslistWorkloadEvents). UsesparseLimit.POST /api/workloads/{id}/rollback→rollbackWorkload(auth.AdminOnly), body{deploy_id}:- load workload (404 if missing; 400 if
source_kind == ""); GetDeployHistory(deploy_id); 404 if missing, 400 if itsworkload_id≠ path id (no cross-workload replay);- guard:
outcome == "success",reference != "", andsource_kindis rollback-capable (imagein Phase 1) → else 400 with a clear message; - build
manual-shaped intent{Reason: "rollback", Reference: row.reference, Metadata: {"note": "rollback to " + row.reference, "rollback_of": <id>}, TriggeredBy: actor}; deployer.DispatchPlugin(...); 202 on accept (same shape as deploy).
- load workload (404 if missing; 400 if
- Register both routes inside the existing
r.Route("/workloads/{id}", …)block in router.go, next to/deployand/events. - A
RollbackCapable(sourceKind) boolhelper (single source of truth, shared with the list response so the frontend can render the button state without hardcoding kinds). - The list response includes a per-entry
rollbackable boolcomputed server-side.
Frontend changes (web/)
DeployHistoryPanel.svelte(new, inlib/components/): table of entries — short reference, reason badge,outcomeStatusBadge(ok/bad),triggered_by, relative time. Forrollbackablerows a Roll back button →ConfirmDialog("Roll back to ?") →POST …/rollback {deploy_id}→Toast+ refresh history and container state. Loading viaSkeleton;EmptyStatewhen no rows. Reuses existing components only.- Mount the panel on
/apps/[id]alongside the activity timeline (it is the structured, actionable sibling of the free-text timeline). - i18n: add keys under a
deployHistory.*namespace to bothweb/src/lib/i18n/en.jsonandru.json(parity is mandatory and not a build error — verify manually per CLAUDE.md). - API client: add
listDeploys(id, params)androllback(id, deployId)to the existing workload API module.
Testing
- Store:
deploy_history_test.go— insert/list ordering, get, prune-keeps-newest, cascade delete with workload. - Deployer: extend
deployertests —DispatchPluginwrites onesuccessrow and onefailurerow (with sanitized error); reconcile/teardown write none. Resolver unit test (deploy_ref_test.go) for the image read-back + empty fallbacks. - API: rollback guards — cross-workload id → 400; non-success/empty-ref/
non-image → 400; happy path → 202 and a
rollback-reason history row appears. - Web: keep it light (the panel is mostly presentational); a
sourceForms-style pure-logic unit only if a non-trivial helper emerges. - Gates:
go build ./...,go vet ./internal/...,go test ./internal/...,cd web && npm run check && npm run test, then./scripts/dev-server.sh.
Risks / mitigations
- Recording must never break a deploy → best-effort insert, errors only logged
(matches existing
EmitDeployEvent/ pre-deploy-backup contracts). - Secret leakage via
error→ store only a capped, generic reason; raw error toslogonly. - Unbounded growth →
PruneDeployHistorykeeps newest N per workload. - Rollback to a vanished image tag → the image source's
PullImagefails and its own failure-rollback leaves the live container untouched; the rollback attempt is recorded asfailure. No special handling needed. - No-op rollback (target already running,
MaxInstances>1) → image short-circuit returnsnil; recorded assuccess. Acceptable.
Rollout
Single PR. Additive migration (no destructive DDL). No settings changes. Backward compatible: existing workloads simply start accumulating history on their next deploy.