feat(apps): per-workload deploy history, rollback, and resource metrics
Two additions to the app detail page, each backed by a per-workload
endpoint.
Deploy history + rollback:
- New deploy_history table — a structured, version-pinned ledger of every
dispatch (success AND failure), distinct from the free-text event_log.
Recorded at the single DispatchPlugin choke point so every source kind
is covered. The raw deploy error is never persisted (it can carry
registry-auth / compose-stdout secrets) — only a generic marker, with
detail going to slog. Pruned to the newest N per workload; cascade-
deleted with the workload.
- GET /api/workloads/{id}/deploys lists the ledger; POST .../rollback
(admin) replays a prior successful deploy's pinned reference as a
rollback-reason dispatch. Phase 1 is image-source only (RollbackCapable);
git-built sources need checkout-by-commit, a later phase.
- DeployHistoryPanel.svelte renders the ledger with confirm-gated rollback.
Per-workload metrics:
- ListContainerStatsSamplesByWorkload joins the existing container stats
samples through the containers index; GET /api/workloads/{id}/stats/history
aggregates CPU/memory per timestamp across the workload's containers.
- WorkloadMetricsPanel.svelte reuses ResourceChart (CPU% + memory MiB,
windowed, 15s poll).
en/ru i18n added with parity. Tests: store CRUD + cascade + workload-scoped
join, deployer recording (incl. secret-non-leak on failure), API rollback
guards, and per-timestamp aggregation. Plans under docs/plans/.
This commit is contained in:
@@ -0,0 +1,223 @@
|
||||
# Deploy History + One-Click Rollback — Implementation Plan
|
||||
|
||||
**Status:** planned (review incorporated) · **Feature rank:** #1 · **Date:** 2026-06-19
|
||||
|
||||
## Review findings incorporated (adversarial pass)
|
||||
|
||||
- **BLOCKER — never persist the raw deploy error** (it can carry registry-auth bytes /
|
||||
compose stdout — see `compose.go` SECURITY comment + `workloads_plugin.go:198`).
|
||||
`deploy_history.error` only ever gets a **fixed generic marker**
|
||||
(`"deploy failed (see server logs)"`) on failure; the raw error goes to `slog` only.
|
||||
`capDeployStatus(err.Error())` is rejected.
|
||||
- **BLOCKER — don't double-count metrics.** `DispatchPlugin` already calls
|
||||
`metrics.DeploysTotal.Inc(...)`; recording slots into the **existing** outcome block,
|
||||
not a re-added metrics line.
|
||||
- **FIX — no runtime-state store getter exists.** static/dockerfile `LastCommitSHA`
|
||||
lives in `containers.extra_json` on a deterministic-ID row
|
||||
(`GetContainerByID(w.ID+":site")` / `+":dockerfile"`, decode `ExtraJSON`). Moot for
|
||||
Phase-1 rollback (image-only) but the resolver must use this, not a fictional getter.
|
||||
- **FIX — cascade is distrusted here.** `DeleteWorkload` explicitly deletes containers
|
||||
rather than relying on the FK. Match that: add `DELETE FROM deploy_history WHERE
|
||||
workload_id = ?` inside the `DeleteWorkload` transaction, and make the cascade test a
|
||||
hard gate.
|
||||
- **FIX — keep recording off the hot path's tail.** `DispatchPlugin` runs synchronously
|
||||
on the request goroutine; the INSERT is cheap but `PruneDeployHistory` runs in a
|
||||
goroutine. Draining-rejected attempts (beginDispatch fail) record nothing — correct,
|
||||
a never-run deploy must not appear as a rollback target.
|
||||
- **FIX — pagination:** use `parseLimit(raw, 50, 200)` (not the unclamped
|
||||
`listWorkloadEvents` style); parse `offset` separately, clamp negatives to 0.
|
||||
|
||||
|
||||
## Problem
|
||||
|
||||
Tinyforge has *failure* rollback (a failed deploy unwinds its own new container —
|
||||
[image.go:258](../../internal/workload/plugin/source/image/image.go)), but **no way to
|
||||
revert a *successful* deploy to a prior version.** Blue-green's `enforceMaxInstances`
|
||||
deletes the old container rows after cutover, so once `v3` replaces `v2` there is no
|
||||
record of `v2` and nothing to roll back to. The only "history" is free-text
|
||||
`event_log` rows (`"deployed"`) — not structured, not version-pinned, not replayable.
|
||||
|
||||
This is the single most-requested capability for any deploy tool, and the plumbing is
|
||||
90% there: every deploy flows through one choke point, and the manual-deploy endpoint
|
||||
already accepts a `reference` override.
|
||||
|
||||
## Key architectural facts (verified against current code)
|
||||
|
||||
- **Single dispatch choke point:** `Deployer.DispatchPlugin(ctx, w, intent)` in
|
||||
[internal/deployer/dispatch.go](../../internal/deployer/dispatch.go) routes *every*
|
||||
source kind and already computes a success/failure `outcome`. This is where history
|
||||
is recorded.
|
||||
- **`intent.Reference` is the version handle:** image source resolves
|
||||
`tag := intent.Reference` (falling back to `DefaultTag`/`latest`). The manual deploy
|
||||
endpoint ([workloads_plugin.go](../../internal/api/workloads_plugin.go)) already accepts
|
||||
`{reference, note}` and builds a `manual` intent. **Rollback = deploy with a pinned
|
||||
reference + a distinct reason.**
|
||||
- **Effective vs requested reference:** for a *manual* image deploy `intent.Reference`
|
||||
is often `""` (means `DefaultTag`). The *effective* deployed tag is written onto the
|
||||
freshest container row (`store.Container.ImageTag`). For static/dockerfile the
|
||||
effective version is `runtime_state.LastCommitSHA`, resolved inside the source.
|
||||
- **Built-from-source sources don't honor a SHA reference on Deploy** — static and
|
||||
dockerfile clone `cfg.Branch` HEAD and capture `latestSHA`; they cannot yet check out
|
||||
an arbitrary commit. So **SHA-pinned rollback for them needs a source change (later
|
||||
phase).** Image-tag rollback works today.
|
||||
- **Migration pattern:** additive statements in `runMigrations()` /
|
||||
`workloadTables` in [store.go](../../internal/store/store.go); workload-scoped tables
|
||||
use `REFERENCES workloads(id) ON DELETE CASCADE`. Per-table CRUD lives in its own
|
||||
`internal/store/<table>.go`, model in `models.go`.
|
||||
- **Idempotency note:** the image source's same-tag short-circuit returns *before* it
|
||||
arms its `EmitDeployEvent` defer, so a no-op deploy emits no timeline event. History
|
||||
recorded at `DispatchPlugin` will still log it as a `success` attempt — acceptable
|
||||
(history = ledger of attempts), but called out so the divergence is intentional.
|
||||
|
||||
## Scope
|
||||
|
||||
### Phase 1 (this plan)
|
||||
1. Persistent, structured **deploy-history ledger** for **all** source kinds (success
|
||||
*and* failure) — powers an audit timeline and the rollback action.
|
||||
2. **One-click rollback** for the **image** source (redeploy a pinned tag).
|
||||
3. Read-only history panel on `/apps/[id]`; rollback button shown only for entries that
|
||||
are `success` + have a non-empty reference + a rollback-capable source kind.
|
||||
|
||||
### Explicitly out of scope (future phases, table already supports them)
|
||||
- SHA-pinned rebuild rollback for static/dockerfile (needs source checkout-by-commit).
|
||||
- Config-snapshot rollback for compose (no artifact reference).
|
||||
- Promotion (dev→staging→prod) — separate feature, will reuse this ledger.
|
||||
|
||||
## Data model
|
||||
|
||||
New table `deploy_history` (added to `workloadTables` in `runMigrations`):
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS deploy_history (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
workload_id TEXT NOT NULL REFERENCES workloads(id) ON DELETE CASCADE,
|
||||
source_kind TEXT NOT NULL DEFAULT '',
|
||||
reference TEXT NOT NULL DEFAULT '', -- effective artifact: image tag | commit sha | ''
|
||||
reason TEXT NOT NULL DEFAULT '', -- manual|registry-push|git-push|cron|rollback|promote
|
||||
triggered_by TEXT NOT NULL DEFAULT '',
|
||||
note TEXT NOT NULL DEFAULT '',
|
||||
outcome TEXT NOT NULL DEFAULT '', -- success | failure
|
||||
error TEXT NOT NULL DEFAULT '', -- truncated, secret-free
|
||||
started_at TEXT NOT NULL DEFAULT '',
|
||||
finished_at TEXT NOT NULL DEFAULT ''
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_deploy_history_workload
|
||||
ON deploy_history(workload_id, id DESC);
|
||||
```
|
||||
|
||||
**Why a dedicated table (not `event_log`):** structured + queryable, version-pinned,
|
||||
carries the replayable `reference`, and its retention is independent of the human event
|
||||
feed. `event_log` stays the free-text timeline; `deploy_history` is the version ledger.
|
||||
|
||||
Go model in `models.go` (`DeployHistoryEntry`, mirrors `MetricAlertRule` style).
|
||||
|
||||
## Backend changes
|
||||
|
||||
### 1. Store — `internal/store/deploy_history.go` (new) + `models.go` + `store.go`
|
||||
- `DeployHistoryEntry` struct.
|
||||
- `InsertDeployHistory(e DeployHistoryEntry) (DeployHistoryEntry, error)`.
|
||||
- `ListDeployHistory(workloadID string, limit, offset int) ([]DeployHistoryEntry, error)`
|
||||
— ordered `id DESC`; default/clamped limit (e.g. 50, max 200) via existing `parseLimit`
|
||||
conventions at the API layer.
|
||||
- `GetDeployHistory(id int64) (DeployHistoryEntry, error)` — for rollback lookup;
|
||||
`ErrNotFound` on miss.
|
||||
- `PruneDeployHistory(workloadID string, keep int) error` — keep newest `keep` per
|
||||
workload (mirror the stats-prune pattern). Called best-effort after insert.
|
||||
- Migration: append `CREATE TABLE` + index to `workloadTables`.
|
||||
- Table test `deploy_history_test.go` (insert/list/get/prune, cascade-on-workload-delete).
|
||||
|
||||
### 2. Deployer — record at the choke point (`internal/deployer/dispatch.go`)
|
||||
Wrap the existing `src.Deploy(...)` call:
|
||||
```go
|
||||
started := store.Now()
|
||||
err = src.Deploy(ctx, d.PluginDeps(), w, intent)
|
||||
outcome := "success"; if err != nil { outcome = "failure" }
|
||||
metrics.DeploysTotal.Inc(w.SourceKind, outcome)
|
||||
d.recordDeployHistory(w, intent, outcome, err, started) // best-effort, never blocks
|
||||
return err
|
||||
```
|
||||
- `recordDeployHistory` resolves the **effective reference** and inserts a row.
|
||||
Best-effort: a store failure is logged, never propagated (same contract as
|
||||
`maybeBackupBeforeDeploy` and `EmitDeployEvent`).
|
||||
- **Effective-reference resolver** (`internal/deployer/deploy_ref.go`, unit-tested):
|
||||
1. start from `intent.Reference`;
|
||||
2. `image`: read newest `ListContainersByWorkload(w.ID)` row (by `CreatedAt`), prefer
|
||||
its `ImageTag` when non-empty — captures the `DefaultTag`/`latest` resolution;
|
||||
3. `static`/`dockerfile`: when still empty, read persisted runtime state
|
||||
`LastCommitSHA` (verify exact store getter during impl);
|
||||
4. `compose`/unknown: leave as-is (may be `""`).
|
||||
- **Error sanitization:** reuse the `capDeployStatus` cap (256 runes) idea — store a
|
||||
short, secret-free `error`. The raw error keeps going to `slog` only. (The deploy
|
||||
error already carries a generic client message; the wrapped detail must not be
|
||||
persisted verbatim because it can echo registry-auth / compose-stdout bytes — same
|
||||
caller contract documented on `EmitDeployEvent`.)
|
||||
- Recording does **not** run for `DispatchReconcile` (periodic, not a deploy) or
|
||||
`DispatchTeardown`.
|
||||
|
||||
### 3. API — `internal/api/deploy_history.go` (new) + `router.go`
|
||||
- `GET /api/workloads/{id}/deploys?limit=&offset=` → `listWorkloadDeploys` (read; any
|
||||
authenticated user — mirrors `listWorkloadEvents`). Uses `parseLimit`.
|
||||
- `POST /api/workloads/{id}/rollback` → `rollbackWorkload` (`auth.AdminOnly`), body
|
||||
`{deploy_id}`:
|
||||
1. load workload (404 if missing; 400 if `source_kind == ""`);
|
||||
2. `GetDeployHistory(deploy_id)`; 404 if missing, 400 if its `workload_id` ≠ path id
|
||||
(no cross-workload replay);
|
||||
3. guard: `outcome == "success"`, `reference != ""`, and `source_kind` is
|
||||
rollback-capable (`image` in Phase 1) → else 400 with a clear message;
|
||||
4. build `manual`-shaped intent `{Reason: "rollback", Reference: row.reference,
|
||||
Metadata: {"note": "rollback to " + row.reference, "rollback_of": <id>},
|
||||
TriggeredBy: actor}`;
|
||||
5. `deployer.DispatchPlugin(...)`; 202 on accept (same shape as deploy).
|
||||
- Register both routes inside the existing `r.Route("/workloads/{id}", …)` block in
|
||||
[router.go](../../internal/api/router.go), next to `/deploy` and `/events`.
|
||||
- A `RollbackCapable(sourceKind) bool` helper (single source of truth, shared with the
|
||||
list response so the frontend can render the button state without hardcoding kinds).
|
||||
- The list response includes a per-entry `rollbackable bool` computed server-side.
|
||||
|
||||
## Frontend changes (`web/`)
|
||||
|
||||
- **`DeployHistoryPanel.svelte`** (new, in `lib/components/`): table of entries —
|
||||
short reference, reason badge, `outcome` `StatusBadge` (ok/bad), `triggered_by`,
|
||||
relative time. For `rollbackable` rows a **Roll back** button → `ConfirmDialog`
|
||||
("Roll back <name> to <reference>?") → `POST …/rollback {deploy_id}` → `Toast` +
|
||||
refresh history and container state. Loading via `Skeleton`; `EmptyState` when no
|
||||
rows. Reuses existing components only.
|
||||
- Mount the panel on **`/apps/[id]`** alongside the activity timeline (it is the
|
||||
*structured, actionable* sibling of the free-text timeline).
|
||||
- **i18n:** add keys under a `deployHistory.*` namespace to **both**
|
||||
`web/src/lib/i18n/en.json` and `ru.json` (parity is mandatory and not a build error —
|
||||
verify manually per CLAUDE.md).
|
||||
- API client: add `listDeploys(id, params)` and `rollback(id, deployId)` to the existing
|
||||
workload API module.
|
||||
|
||||
## Testing
|
||||
|
||||
- **Store:** `deploy_history_test.go` — insert/list ordering, get, prune-keeps-newest,
|
||||
cascade delete with workload.
|
||||
- **Deployer:** extend `deployer` tests — `DispatchPlugin` writes one `success` row and
|
||||
one `failure` row (with sanitized error); reconcile/teardown write none. Resolver unit
|
||||
test (`deploy_ref_test.go`) for the image read-back + empty fallbacks.
|
||||
- **API:** rollback guards — cross-workload id → 400; non-success/empty-ref/
|
||||
non-image → 400; happy path → 202 and a `rollback`-reason history row appears.
|
||||
- **Web:** keep it light (the panel is mostly presentational); a `sourceForms`-style
|
||||
pure-logic unit only if a non-trivial helper emerges.
|
||||
- Gates: `go build ./...`, `go vet ./internal/...`, `go test ./internal/...`,
|
||||
`cd web && npm run check && npm run test`, then `./scripts/dev-server.sh`.
|
||||
|
||||
## Risks / mitigations
|
||||
|
||||
- **Recording must never break a deploy** → best-effort insert, errors only logged
|
||||
(matches existing `EmitDeployEvent` / pre-deploy-backup contracts).
|
||||
- **Secret leakage via `error`** → store only a capped, generic reason; raw error to
|
||||
`slog` only.
|
||||
- **Unbounded growth** → `PruneDeployHistory` keeps newest N per workload.
|
||||
- **Rollback to a vanished image tag** → the image source's `PullImage` fails and its
|
||||
own failure-rollback leaves the live container untouched; the rollback attempt is
|
||||
recorded as `failure`. No special handling needed.
|
||||
- **No-op rollback (target already running, `MaxInstances>1`)** → image short-circuit
|
||||
returns `nil`; recorded as `success`. Acceptable.
|
||||
|
||||
## Rollout
|
||||
|
||||
Single PR. Additive migration (no destructive DDL). No settings changes. Backward
|
||||
compatible: existing workloads simply start accumulating history on their next deploy.
|
||||
@@ -0,0 +1,84 @@
|
||||
# Per-Workload Metrics Graph — Implementation Plan
|
||||
|
||||
**Status:** planned · **Feature rank:** #2 · **Date:** 2026-06-19
|
||||
|
||||
## Problem
|
||||
|
||||
Stats are collected per container (`container_stats_samples`, CPU/mem/net/disk) and
|
||||
charted **globally** on the dashboard (`SystemResourcesCard` + `ResourceChart`), but
|
||||
`/apps/[id]` shows only live snapshots — there's no per-workload "is my app leaking
|
||||
memory / pegging CPU over the last few hours" view. This is a daily question and the
|
||||
data already exists; we just need a per-workload query + a panel that reuses the chart.
|
||||
|
||||
## Verified facts
|
||||
|
||||
- `ContainerStatsSample.OwnerID` == the **container row id** (`containers.id`), confirmed
|
||||
by `lookupInstanceName` → `GetContainerByID(sm.OwnerID)` in
|
||||
[stats_history.go](../../internal/api/stats_history.go). `OwnerType` ∈ {instance, site}.
|
||||
- Each sample's `ts` is that container's own Docker-stats `Timestamp.Unix()`
|
||||
([collector.go](../../internal/stats/collector.go)) — NOT one shared tick stamp. In a
|
||||
multi-container tick the per-second truncation usually collapses them to the same
|
||||
integer `ts`, so per-`ts` aggregation works; a ±1s split at a second boundary is
|
||||
cosmetic for a trend line. (Reviewer-corrected.) The handler 404s on an unknown
|
||||
workload id but returns `[]` for a known workload with no samples yet.
|
||||
- `ResourceChart.svelte` takes a fully-built `EChartsOption` from the parent; the parent
|
||||
owns series/axes (see `SystemResourcesCard`). Reads stay available when Docker is down
|
||||
(samples come from SQLite, not the daemon).
|
||||
- Per-workload reads (`/events`, `/runtime-state`) are open to any authenticated user;
|
||||
this endpoint follows suit (no `AdminOnly`).
|
||||
|
||||
## Backend
|
||||
|
||||
1. **Store** — `ListContainerStatsSamplesByWorkload(workloadID string, sinceTS int64)`:
|
||||
```sql
|
||||
SELECT cs.container_id, cs.owner_type, cs.owner_id, cs.ts,
|
||||
cs.cpu_percent, cs.memory_usage, cs.memory_limit,
|
||||
cs.network_rx, cs.network_tx, cs.block_read, cs.block_write
|
||||
FROM container_stats_samples cs
|
||||
JOIN containers c ON c.id = cs.owner_id
|
||||
WHERE c.workload_id = ? AND cs.ts >= ?
|
||||
ORDER BY cs.ts ASC
|
||||
```
|
||||
Returns `[]ContainerStatsSample`.
|
||||
|
||||
2. **API** — `getWorkloadStatsHistory` (GET `/api/workloads/{id}/stats/history?window=`):
|
||||
reuse `parseWindow`/`sinceTimestamp`; aggregate samples **per ts** into a compact
|
||||
series so multi-container workloads (compose) sum correctly:
|
||||
```go
|
||||
type workloadStatsPoint struct {
|
||||
TS int64 `json:"ts"`
|
||||
CPUPercent float64 `json:"cpu_percent"` // sum across the workload's containers
|
||||
MemoryUsage int64 `json:"memory_usage"` // sum bytes
|
||||
MemoryLimit int64 `json:"memory_limit"` // max (effective ceiling)
|
||||
}
|
||||
```
|
||||
Always returns `[]` (never 503) — empty when stats are disabled / Docker was down /
|
||||
the workload is new. Register in the `/workloads/{id}` route block.
|
||||
|
||||
3. **Tests** — store: join scopes to the right workload (A's samples ≠ B's); API:
|
||||
per-ts aggregation sums two containers at the same tick.
|
||||
|
||||
## Frontend
|
||||
|
||||
4. **api.ts** — `WorkloadStatsPoint` type + `fetchWorkloadStatsHistory(id, window, signal)`.
|
||||
5. **`WorkloadMetricsPanel.svelte`** — window selector (30m / 2h / 6h), fetch + 15s poll
|
||||
(mirror `SystemResourcesCard`), build an `EChartsOption` with **two series**: CPU %
|
||||
on the left axis, Memory (MiB) on the right axis (absolute bytes, because
|
||||
`memory_limit` is often 0/unlimited so a % would divide by zero). `EmptyState`/ hint
|
||||
when there are no samples. Render via `ResourceChart`. Mount on `/apps/[id]` near the
|
||||
deploy-history panel.
|
||||
6. **i18n** — `apps.detail.metrics.*` in both en.json and ru.json (parity mandatory).
|
||||
|
||||
## Risks / mitigations
|
||||
|
||||
- **Docker down / stats disabled** → empty series, friendly hint (no error). SQLite read
|
||||
path is independent of the daemon.
|
||||
- **memory_limit = 0 (unlimited)** → plot absolute MiB, not %, to avoid div-by-zero.
|
||||
- **Sparse sampling** → chart shows whatever ticks exist; window selector lets the user
|
||||
widen. No interpolation.
|
||||
- **Auth** → read-only, any authenticated user (consistent with other per-workload reads).
|
||||
|
||||
## Rollout
|
||||
|
||||
Single change set, additive, no migration. Reuses the existing `echarts` dependency and
|
||||
`ResourceChart` component.
|
||||
Reference in New Issue
Block a user