Files
tiny-forge/docs/plans/WORKLOAD_METRICS_GRAPH_PLAN.md
alexei.dolgolyov 0c4c338bfe feat(apps): per-workload deploy history, rollback, and resource metrics
Two additions to the app detail page, each backed by a per-workload
endpoint.

Deploy history + rollback:
- New deploy_history table — a structured, version-pinned ledger of every
  dispatch (success AND failure), distinct from the free-text event_log.
  Recorded at the single DispatchPlugin choke point so every source kind
  is covered. The raw deploy error is never persisted (it can carry
  registry-auth / compose-stdout secrets) — only a generic marker, with
  detail going to slog. Pruned to the newest N per workload; cascade-
  deleted with the workload.
- GET /api/workloads/{id}/deploys lists the ledger; POST .../rollback
  (admin) replays a prior successful deploy's pinned reference as a
  rollback-reason dispatch. Phase 1 is image-source only (RollbackCapable);
  git-built sources need checkout-by-commit, a later phase.
- DeployHistoryPanel.svelte renders the ledger with confirm-gated rollback.

Per-workload metrics:
- ListContainerStatsSamplesByWorkload joins the existing container stats
  samples through the containers index; GET /api/workloads/{id}/stats/history
  aggregates CPU/memory per timestamp across the workload's containers.
- WorkloadMetricsPanel.svelte reuses ResourceChart (CPU% + memory MiB,
  windowed, 15s poll).

en/ru i18n added with parity. Tests: store CRUD + cascade + workload-scoped
join, deployer recording (incl. secret-non-leak on failure), API rollback
guards, and per-timestamp aggregation. Plans under docs/plans/.
2026-06-19 16:22:12 +03:00

85 lines
4.2 KiB
Markdown

# Per-Workload Metrics Graph — Implementation Plan
**Status:** planned · **Feature rank:** #2 · **Date:** 2026-06-19
## Problem
Stats are collected per container (`container_stats_samples`, CPU/mem/net/disk) and
charted **globally** on the dashboard (`SystemResourcesCard` + `ResourceChart`), but
`/apps/[id]` shows only live snapshots — there's no per-workload "is my app leaking
memory / pegging CPU over the last few hours" view. This is a daily question and the
data already exists; we just need a per-workload query + a panel that reuses the chart.
## Verified facts
- `ContainerStatsSample.OwnerID` == the **container row id** (`containers.id`), confirmed
by `lookupInstanceName``GetContainerByID(sm.OwnerID)` in
[stats_history.go](../../internal/api/stats_history.go). `OwnerType` ∈ {instance, site}.
- Each sample's `ts` is that container's own Docker-stats `Timestamp.Unix()`
([collector.go](../../internal/stats/collector.go)) — NOT one shared tick stamp. In a
multi-container tick the per-second truncation usually collapses them to the same
integer `ts`, so per-`ts` aggregation works; a ±1s split at a second boundary is
cosmetic for a trend line. (Reviewer-corrected.) The handler 404s on an unknown
workload id but returns `[]` for a known workload with no samples yet.
- `ResourceChart.svelte` takes a fully-built `EChartsOption` from the parent; the parent
owns series/axes (see `SystemResourcesCard`). Reads stay available when Docker is down
(samples come from SQLite, not the daemon).
- Per-workload reads (`/events`, `/runtime-state`) are open to any authenticated user;
this endpoint follows suit (no `AdminOnly`).
## Backend
1. **Store**`ListContainerStatsSamplesByWorkload(workloadID string, sinceTS int64)`:
```sql
SELECT cs.container_id, cs.owner_type, cs.owner_id, cs.ts,
cs.cpu_percent, cs.memory_usage, cs.memory_limit,
cs.network_rx, cs.network_tx, cs.block_read, cs.block_write
FROM container_stats_samples cs
JOIN containers c ON c.id = cs.owner_id
WHERE c.workload_id = ? AND cs.ts >= ?
ORDER BY cs.ts ASC
```
Returns `[]ContainerStatsSample`.
2. **API** — `getWorkloadStatsHistory` (GET `/api/workloads/{id}/stats/history?window=`):
reuse `parseWindow`/`sinceTimestamp`; aggregate samples **per ts** into a compact
series so multi-container workloads (compose) sum correctly:
```go
type workloadStatsPoint struct {
TS int64 `json:"ts"`
CPUPercent float64 `json:"cpu_percent"` // sum across the workload's containers
MemoryUsage int64 `json:"memory_usage"` // sum bytes
MemoryLimit int64 `json:"memory_limit"` // max (effective ceiling)
}
```
Always returns `[]` (never 503) — empty when stats are disabled / Docker was down /
the workload is new. Register in the `/workloads/{id}` route block.
3. **Tests** — store: join scopes to the right workload (A's samples ≠ B's); API:
per-ts aggregation sums two containers at the same tick.
## Frontend
4. **api.ts** — `WorkloadStatsPoint` type + `fetchWorkloadStatsHistory(id, window, signal)`.
5. **`WorkloadMetricsPanel.svelte`** — window selector (30m / 2h / 6h), fetch + 15s poll
(mirror `SystemResourcesCard`), build an `EChartsOption` with **two series**: CPU %
on the left axis, Memory (MiB) on the right axis (absolute bytes, because
`memory_limit` is often 0/unlimited so a % would divide by zero). `EmptyState`/ hint
when there are no samples. Render via `ResourceChart`. Mount on `/apps/[id]` near the
deploy-history panel.
6. **i18n** — `apps.detail.metrics.*` in both en.json and ru.json (parity mandatory).
## Risks / mitigations
- **Docker down / stats disabled** → empty series, friendly hint (no error). SQLite read
path is independent of the daemon.
- **memory_limit = 0 (unlimited)** → plot absolute MiB, not %, to avoid div-by-zero.
- **Sparse sampling** → chart shows whatever ticks exist; window selector lets the user
widen. No interpolation.
- **Auth** → read-only, any authenticated user (consistent with other per-workload reads).
## Rollout
Single change set, additive, no migration. Reuses the existing `echarts` dependency and
`ResourceChart` component.