0c4c338bfe
Two additions to the app detail page, each backed by a per-workload
endpoint.
Deploy history + rollback:
- New deploy_history table — a structured, version-pinned ledger of every
dispatch (success AND failure), distinct from the free-text event_log.
Recorded at the single DispatchPlugin choke point so every source kind
is covered. The raw deploy error is never persisted (it can carry
registry-auth / compose-stdout secrets) — only a generic marker, with
detail going to slog. Pruned to the newest N per workload; cascade-
deleted with the workload.
- GET /api/workloads/{id}/deploys lists the ledger; POST .../rollback
(admin) replays a prior successful deploy's pinned reference as a
rollback-reason dispatch. Phase 1 is image-source only (RollbackCapable);
git-built sources need checkout-by-commit, a later phase.
- DeployHistoryPanel.svelte renders the ledger with confirm-gated rollback.
Per-workload metrics:
- ListContainerStatsSamplesByWorkload joins the existing container stats
samples through the containers index; GET /api/workloads/{id}/stats/history
aggregates CPU/memory per timestamp across the workload's containers.
- WorkloadMetricsPanel.svelte reuses ResourceChart (CPU% + memory MiB,
windowed, 15s poll).
en/ru i18n added with parity. Tests: store CRUD + cascade + workload-scoped
join, deployer recording (incl. secret-non-leak on failure), API rollback
guards, and per-timestamp aggregation. Plans under docs/plans/.
224 lines
13 KiB
Markdown
224 lines
13 KiB
Markdown
# Deploy History + One-Click Rollback — Implementation Plan
|
|
|
|
**Status:** planned (review incorporated) · **Feature rank:** #1 · **Date:** 2026-06-19
|
|
|
|
## Review findings incorporated (adversarial pass)
|
|
|
|
- **BLOCKER — never persist the raw deploy error** (it can carry registry-auth bytes /
|
|
compose stdout — see `compose.go` SECURITY comment + `workloads_plugin.go:198`).
|
|
`deploy_history.error` only ever gets a **fixed generic marker**
|
|
(`"deploy failed (see server logs)"`) on failure; the raw error goes to `slog` only.
|
|
`capDeployStatus(err.Error())` is rejected.
|
|
- **BLOCKER — don't double-count metrics.** `DispatchPlugin` already calls
|
|
`metrics.DeploysTotal.Inc(...)`; recording slots into the **existing** outcome block,
|
|
not a re-added metrics line.
|
|
- **FIX — no runtime-state store getter exists.** static/dockerfile `LastCommitSHA`
|
|
lives in `containers.extra_json` on a deterministic-ID row
|
|
(`GetContainerByID(w.ID+":site")` / `+":dockerfile"`, decode `ExtraJSON`). Moot for
|
|
Phase-1 rollback (image-only) but the resolver must use this, not a fictional getter.
|
|
- **FIX — cascade is distrusted here.** `DeleteWorkload` explicitly deletes containers
|
|
rather than relying on the FK. Match that: add `DELETE FROM deploy_history WHERE
|
|
workload_id = ?` inside the `DeleteWorkload` transaction, and make the cascade test a
|
|
hard gate.
|
|
- **FIX — keep recording off the hot path's tail.** `DispatchPlugin` runs synchronously
|
|
on the request goroutine; the INSERT is cheap but `PruneDeployHistory` runs in a
|
|
goroutine. Draining-rejected attempts (beginDispatch fail) record nothing — correct,
|
|
a never-run deploy must not appear as a rollback target.
|
|
- **FIX — pagination:** use `parseLimit(raw, 50, 200)` (not the unclamped
|
|
`listWorkloadEvents` style); parse `offset` separately, clamp negatives to 0.
|
|
|
|
|
|
## Problem
|
|
|
|
Tinyforge has *failure* rollback (a failed deploy unwinds its own new container —
|
|
[image.go:258](../../internal/workload/plugin/source/image/image.go)), but **no way to
|
|
revert a *successful* deploy to a prior version.** Blue-green's `enforceMaxInstances`
|
|
deletes the old container rows after cutover, so once `v3` replaces `v2` there is no
|
|
record of `v2` and nothing to roll back to. The only "history" is free-text
|
|
`event_log` rows (`"deployed"`) — not structured, not version-pinned, not replayable.
|
|
|
|
This is the single most-requested capability for any deploy tool, and the plumbing is
|
|
90% there: every deploy flows through one choke point, and the manual-deploy endpoint
|
|
already accepts a `reference` override.
|
|
|
|
## Key architectural facts (verified against current code)
|
|
|
|
- **Single dispatch choke point:** `Deployer.DispatchPlugin(ctx, w, intent)` in
|
|
[internal/deployer/dispatch.go](../../internal/deployer/dispatch.go) routes *every*
|
|
source kind and already computes a success/failure `outcome`. This is where history
|
|
is recorded.
|
|
- **`intent.Reference` is the version handle:** image source resolves
|
|
`tag := intent.Reference` (falling back to `DefaultTag`/`latest`). The manual deploy
|
|
endpoint ([workloads_plugin.go](../../internal/api/workloads_plugin.go)) already accepts
|
|
`{reference, note}` and builds a `manual` intent. **Rollback = deploy with a pinned
|
|
reference + a distinct reason.**
|
|
- **Effective vs requested reference:** for a *manual* image deploy `intent.Reference`
|
|
is often `""` (means `DefaultTag`). The *effective* deployed tag is written onto the
|
|
freshest container row (`store.Container.ImageTag`). For static/dockerfile the
|
|
effective version is `runtime_state.LastCommitSHA`, resolved inside the source.
|
|
- **Built-from-source sources don't honor a SHA reference on Deploy** — static and
|
|
dockerfile clone `cfg.Branch` HEAD and capture `latestSHA`; they cannot yet check out
|
|
an arbitrary commit. So **SHA-pinned rollback for them needs a source change (later
|
|
phase).** Image-tag rollback works today.
|
|
- **Migration pattern:** additive statements in `runMigrations()` /
|
|
`workloadTables` in [store.go](../../internal/store/store.go); workload-scoped tables
|
|
use `REFERENCES workloads(id) ON DELETE CASCADE`. Per-table CRUD lives in its own
|
|
`internal/store/<table>.go`, model in `models.go`.
|
|
- **Idempotency note:** the image source's same-tag short-circuit returns *before* it
|
|
arms its `EmitDeployEvent` defer, so a no-op deploy emits no timeline event. History
|
|
recorded at `DispatchPlugin` will still log it as a `success` attempt — acceptable
|
|
(history = ledger of attempts), but called out so the divergence is intentional.
|
|
|
|
## Scope
|
|
|
|
### Phase 1 (this plan)
|
|
1. Persistent, structured **deploy-history ledger** for **all** source kinds (success
|
|
*and* failure) — powers an audit timeline and the rollback action.
|
|
2. **One-click rollback** for the **image** source (redeploy a pinned tag).
|
|
3. Read-only history panel on `/apps/[id]`; rollback button shown only for entries that
|
|
are `success` + have a non-empty reference + a rollback-capable source kind.
|
|
|
|
### Explicitly out of scope (future phases, table already supports them)
|
|
- SHA-pinned rebuild rollback for static/dockerfile (needs source checkout-by-commit).
|
|
- Config-snapshot rollback for compose (no artifact reference).
|
|
- Promotion (dev→staging→prod) — separate feature, will reuse this ledger.
|
|
|
|
## Data model
|
|
|
|
New table `deploy_history` (added to `workloadTables` in `runMigrations`):
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS deploy_history (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
workload_id TEXT NOT NULL REFERENCES workloads(id) ON DELETE CASCADE,
|
|
source_kind TEXT NOT NULL DEFAULT '',
|
|
reference TEXT NOT NULL DEFAULT '', -- effective artifact: image tag | commit sha | ''
|
|
reason TEXT NOT NULL DEFAULT '', -- manual|registry-push|git-push|cron|rollback|promote
|
|
triggered_by TEXT NOT NULL DEFAULT '',
|
|
note TEXT NOT NULL DEFAULT '',
|
|
outcome TEXT NOT NULL DEFAULT '', -- success | failure
|
|
error TEXT NOT NULL DEFAULT '', -- truncated, secret-free
|
|
started_at TEXT NOT NULL DEFAULT '',
|
|
finished_at TEXT NOT NULL DEFAULT ''
|
|
);
|
|
CREATE INDEX IF NOT EXISTS idx_deploy_history_workload
|
|
ON deploy_history(workload_id, id DESC);
|
|
```
|
|
|
|
**Why a dedicated table (not `event_log`):** structured + queryable, version-pinned,
|
|
carries the replayable `reference`, and its retention is independent of the human event
|
|
feed. `event_log` stays the free-text timeline; `deploy_history` is the version ledger.
|
|
|
|
Go model in `models.go` (`DeployHistoryEntry`, mirrors `MetricAlertRule` style).
|
|
|
|
## Backend changes
|
|
|
|
### 1. Store — `internal/store/deploy_history.go` (new) + `models.go` + `store.go`
|
|
- `DeployHistoryEntry` struct.
|
|
- `InsertDeployHistory(e DeployHistoryEntry) (DeployHistoryEntry, error)`.
|
|
- `ListDeployHistory(workloadID string, limit, offset int) ([]DeployHistoryEntry, error)`
|
|
— ordered `id DESC`; default/clamped limit (e.g. 50, max 200) via existing `parseLimit`
|
|
conventions at the API layer.
|
|
- `GetDeployHistory(id int64) (DeployHistoryEntry, error)` — for rollback lookup;
|
|
`ErrNotFound` on miss.
|
|
- `PruneDeployHistory(workloadID string, keep int) error` — keep newest `keep` per
|
|
workload (mirror the stats-prune pattern). Called best-effort after insert.
|
|
- Migration: append `CREATE TABLE` + index to `workloadTables`.
|
|
- Table test `deploy_history_test.go` (insert/list/get/prune, cascade-on-workload-delete).
|
|
|
|
### 2. Deployer — record at the choke point (`internal/deployer/dispatch.go`)
|
|
Wrap the existing `src.Deploy(...)` call:
|
|
```go
|
|
started := store.Now()
|
|
err = src.Deploy(ctx, d.PluginDeps(), w, intent)
|
|
outcome := "success"; if err != nil { outcome = "failure" }
|
|
metrics.DeploysTotal.Inc(w.SourceKind, outcome)
|
|
d.recordDeployHistory(w, intent, outcome, err, started) // best-effort, never blocks
|
|
return err
|
|
```
|
|
- `recordDeployHistory` resolves the **effective reference** and inserts a row.
|
|
Best-effort: a store failure is logged, never propagated (same contract as
|
|
`maybeBackupBeforeDeploy` and `EmitDeployEvent`).
|
|
- **Effective-reference resolver** (`internal/deployer/deploy_ref.go`, unit-tested):
|
|
1. start from `intent.Reference`;
|
|
2. `image`: read newest `ListContainersByWorkload(w.ID)` row (by `CreatedAt`), prefer
|
|
its `ImageTag` when non-empty — captures the `DefaultTag`/`latest` resolution;
|
|
3. `static`/`dockerfile`: when still empty, read persisted runtime state
|
|
`LastCommitSHA` (verify exact store getter during impl);
|
|
4. `compose`/unknown: leave as-is (may be `""`).
|
|
- **Error sanitization:** reuse the `capDeployStatus` cap (256 runes) idea — store a
|
|
short, secret-free `error`. The raw error keeps going to `slog` only. (The deploy
|
|
error already carries a generic client message; the wrapped detail must not be
|
|
persisted verbatim because it can echo registry-auth / compose-stdout bytes — same
|
|
caller contract documented on `EmitDeployEvent`.)
|
|
- Recording does **not** run for `DispatchReconcile` (periodic, not a deploy) or
|
|
`DispatchTeardown`.
|
|
|
|
### 3. API — `internal/api/deploy_history.go` (new) + `router.go`
|
|
- `GET /api/workloads/{id}/deploys?limit=&offset=` → `listWorkloadDeploys` (read; any
|
|
authenticated user — mirrors `listWorkloadEvents`). Uses `parseLimit`.
|
|
- `POST /api/workloads/{id}/rollback` → `rollbackWorkload` (`auth.AdminOnly`), body
|
|
`{deploy_id}`:
|
|
1. load workload (404 if missing; 400 if `source_kind == ""`);
|
|
2. `GetDeployHistory(deploy_id)`; 404 if missing, 400 if its `workload_id` ≠ path id
|
|
(no cross-workload replay);
|
|
3. guard: `outcome == "success"`, `reference != ""`, and `source_kind` is
|
|
rollback-capable (`image` in Phase 1) → else 400 with a clear message;
|
|
4. build `manual`-shaped intent `{Reason: "rollback", Reference: row.reference,
|
|
Metadata: {"note": "rollback to " + row.reference, "rollback_of": <id>},
|
|
TriggeredBy: actor}`;
|
|
5. `deployer.DispatchPlugin(...)`; 202 on accept (same shape as deploy).
|
|
- Register both routes inside the existing `r.Route("/workloads/{id}", …)` block in
|
|
[router.go](../../internal/api/router.go), next to `/deploy` and `/events`.
|
|
- A `RollbackCapable(sourceKind) bool` helper (single source of truth, shared with the
|
|
list response so the frontend can render the button state without hardcoding kinds).
|
|
- The list response includes a per-entry `rollbackable bool` computed server-side.
|
|
|
|
## Frontend changes (`web/`)
|
|
|
|
- **`DeployHistoryPanel.svelte`** (new, in `lib/components/`): table of entries —
|
|
short reference, reason badge, `outcome` `StatusBadge` (ok/bad), `triggered_by`,
|
|
relative time. For `rollbackable` rows a **Roll back** button → `ConfirmDialog`
|
|
("Roll back <name> to <reference>?") → `POST …/rollback {deploy_id}` → `Toast` +
|
|
refresh history and container state. Loading via `Skeleton`; `EmptyState` when no
|
|
rows. Reuses existing components only.
|
|
- Mount the panel on **`/apps/[id]`** alongside the activity timeline (it is the
|
|
*structured, actionable* sibling of the free-text timeline).
|
|
- **i18n:** add keys under a `deployHistory.*` namespace to **both**
|
|
`web/src/lib/i18n/en.json` and `ru.json` (parity is mandatory and not a build error —
|
|
verify manually per CLAUDE.md).
|
|
- API client: add `listDeploys(id, params)` and `rollback(id, deployId)` to the existing
|
|
workload API module.
|
|
|
|
## Testing
|
|
|
|
- **Store:** `deploy_history_test.go` — insert/list ordering, get, prune-keeps-newest,
|
|
cascade delete with workload.
|
|
- **Deployer:** extend `deployer` tests — `DispatchPlugin` writes one `success` row and
|
|
one `failure` row (with sanitized error); reconcile/teardown write none. Resolver unit
|
|
test (`deploy_ref_test.go`) for the image read-back + empty fallbacks.
|
|
- **API:** rollback guards — cross-workload id → 400; non-success/empty-ref/
|
|
non-image → 400; happy path → 202 and a `rollback`-reason history row appears.
|
|
- **Web:** keep it light (the panel is mostly presentational); a `sourceForms`-style
|
|
pure-logic unit only if a non-trivial helper emerges.
|
|
- Gates: `go build ./...`, `go vet ./internal/...`, `go test ./internal/...`,
|
|
`cd web && npm run check && npm run test`, then `./scripts/dev-server.sh`.
|
|
|
|
## Risks / mitigations
|
|
|
|
- **Recording must never break a deploy** → best-effort insert, errors only logged
|
|
(matches existing `EmitDeployEvent` / pre-deploy-backup contracts).
|
|
- **Secret leakage via `error`** → store only a capped, generic reason; raw error to
|
|
`slog` only.
|
|
- **Unbounded growth** → `PruneDeployHistory` keeps newest N per workload.
|
|
- **Rollback to a vanished image tag** → the image source's `PullImage` fails and its
|
|
own failure-rollback leaves the live container untouched; the rollback attempt is
|
|
recorded as `failure`. No special handling needed.
|
|
- **No-op rollback (target already running, `MaxInstances>1`)** → image short-circuit
|
|
returns `nil`; recorded as `success`. Acceptable.
|
|
|
|
## Rollout
|
|
|
|
Single PR. Additive migration (no destructive DDL). No settings changes. Backward
|
|
compatible: existing workloads simply start accumulating history on their next deploy.
|