Files
tiny-forge/docs/plans/WORKLOAD_METRICS_GRAPH_PLAN.md
alexei.dolgolyov 0c4c338bfe feat(apps): per-workload deploy history, rollback, and resource metrics
Two additions to the app detail page, each backed by a per-workload
endpoint.

Deploy history + rollback:
- New deploy_history table — a structured, version-pinned ledger of every
  dispatch (success AND failure), distinct from the free-text event_log.
  Recorded at the single DispatchPlugin choke point so every source kind
  is covered. The raw deploy error is never persisted (it can carry
  registry-auth / compose-stdout secrets) — only a generic marker, with
  detail going to slog. Pruned to the newest N per workload; cascade-
  deleted with the workload.
- GET /api/workloads/{id}/deploys lists the ledger; POST .../rollback
  (admin) replays a prior successful deploy's pinned reference as a
  rollback-reason dispatch. Phase 1 is image-source only (RollbackCapable);
  git-built sources need checkout-by-commit, a later phase.
- DeployHistoryPanel.svelte renders the ledger with confirm-gated rollback.

Per-workload metrics:
- ListContainerStatsSamplesByWorkload joins the existing container stats
  samples through the containers index; GET /api/workloads/{id}/stats/history
  aggregates CPU/memory per timestamp across the workload's containers.
- WorkloadMetricsPanel.svelte reuses ResourceChart (CPU% + memory MiB,
  windowed, 15s poll).

en/ru i18n added with parity. Tests: store CRUD + cascade + workload-scoped
join, deployer recording (incl. secret-non-leak on failure), API rollback
guards, and per-timestamp aggregation. Plans under docs/plans/.
2026-06-19 16:22:12 +03:00

4.2 KiB

Per-Workload Metrics Graph — Implementation Plan

Status: planned · Feature rank: #2 · Date: 2026-06-19

Problem

Stats are collected per container (container_stats_samples, CPU/mem/net/disk) and charted globally on the dashboard (SystemResourcesCard + ResourceChart), but /apps/[id] shows only live snapshots — there's no per-workload "is my app leaking memory / pegging CPU over the last few hours" view. This is a daily question and the data already exists; we just need a per-workload query + a panel that reuses the chart.

Verified facts

  • ContainerStatsSample.OwnerID == the container row id (containers.id), confirmed by lookupInstanceNameGetContainerByID(sm.OwnerID) in stats_history.go. OwnerType ∈ {instance, site}.
  • Each sample's ts is that container's own Docker-stats Timestamp.Unix() (collector.go) — NOT one shared tick stamp. In a multi-container tick the per-second truncation usually collapses them to the same integer ts, so per-ts aggregation works; a ±1s split at a second boundary is cosmetic for a trend line. (Reviewer-corrected.) The handler 404s on an unknown workload id but returns [] for a known workload with no samples yet.
  • ResourceChart.svelte takes a fully-built EChartsOption from the parent; the parent owns series/axes (see SystemResourcesCard). Reads stay available when Docker is down (samples come from SQLite, not the daemon).
  • Per-workload reads (/events, /runtime-state) are open to any authenticated user; this endpoint follows suit (no AdminOnly).

Backend

  1. StoreListContainerStatsSamplesByWorkload(workloadID string, sinceTS int64):

    SELECT cs.container_id, cs.owner_type, cs.owner_id, cs.ts,
           cs.cpu_percent, cs.memory_usage, cs.memory_limit,
           cs.network_rx, cs.network_tx, cs.block_read, cs.block_write
    FROM container_stats_samples cs
    JOIN containers c ON c.id = cs.owner_id
    WHERE c.workload_id = ? AND cs.ts >= ?
    ORDER BY cs.ts ASC
    

    Returns []ContainerStatsSample.

  2. APIgetWorkloadStatsHistory (GET /api/workloads/{id}/stats/history?window=): reuse parseWindow/sinceTimestamp; aggregate samples per ts into a compact series so multi-container workloads (compose) sum correctly:

    type workloadStatsPoint struct {
        TS          int64   `json:"ts"`
        CPUPercent  float64 `json:"cpu_percent"`   // sum across the workload's containers
        MemoryUsage int64   `json:"memory_usage"`  // sum bytes
        MemoryLimit int64   `json:"memory_limit"`  // max (effective ceiling)
    }
    

    Always returns [] (never 503) — empty when stats are disabled / Docker was down / the workload is new. Register in the /workloads/{id} route block.

  3. Tests — store: join scopes to the right workload (A's samples ≠ B's); API: per-ts aggregation sums two containers at the same tick.

Frontend

  1. api.tsWorkloadStatsPoint type + fetchWorkloadStatsHistory(id, window, signal).
  2. WorkloadMetricsPanel.svelte — window selector (30m / 2h / 6h), fetch + 15s poll (mirror SystemResourcesCard), build an EChartsOption with two series: CPU % on the left axis, Memory (MiB) on the right axis (absolute bytes, because memory_limit is often 0/unlimited so a % would divide by zero). EmptyState/ hint when there are no samples. Render via ResourceChart. Mount on /apps/[id] near the deploy-history panel.
  3. i18napps.detail.metrics.* in both en.json and ru.json (parity mandatory).

Risks / mitigations

  • Docker down / stats disabled → empty series, friendly hint (no error). SQLite read path is independent of the daemon.
  • memory_limit = 0 (unlimited) → plot absolute MiB, not %, to avoid div-by-zero.
  • Sparse sampling → chart shows whatever ticks exist; window selector lets the user widen. No interpolation.
  • Auth → read-only, any authenticated user (consistent with other per-workload reads).

Rollout

Single change set, additive, no migration. Reuses the existing echarts dependency and ResourceChart component.