Files
tiny-forge/docs/plans/workload-refactor.md
T
alexei.dolgolyov cba2149aa9 refactor(workload): finalize containers index + post-review hardening
Wraps up the workload refactor with the fixes that came out of the multi-agent
code review (see docs/plans/workload-refactor.md "What actually shipped").

Backend:
- store.ReconcileContainer: separate write path so the 30s reconciler tick no
  longer overwrites deployer-owned fields (subdomain, proxy_route_id,
  npm_proxy_id, image_tag).
- Container.stage_id column + index; ListProxyRoutes / ListContainersByStageID
  join via stage_id (survives stage rename), with legacy fallback to
  (project_id, role=stage_name).
- Reconciler: workload-existence check (rejects forged tinyforge.workload.id
  labels), skips inventing project-kind rows, child-context cancel before
  wg.Wait() on shutdown.
- Transactional CRUD across projects / stacks / static_sites: parent UPDATE
  and workload sync land in one transaction so secret rotations are durable.
- Webhook routing reads exclusively through workloads.webhook_secret; legacy
  GetProjectByWebhookSecret / GetStaticSiteByWebhookSecret fallback removed.
- store.GetStackByComposeProjectName + indexed lookup (no more full-table
  stack scan per compose container per tick).
- store.ListMissingSweepRows: filtered query for the missing-sweep.
- /api/instances/* handlers verify (workload_id, role) match URL
  (project_id, stage_name) before mutating — closes the cross-project
  hijack the security review flagged.
- extra_json no longer referenced from Go (column kept on disk for now).

Frontend:
- WorkloadContainers.svelte: generic detail-page panel reusable by stack and
  site detail pages.
- Containers page polish: client-side kind/state filters over an unfiltered
  fetch, URL-synced filters, race-safe loads via sequence number, EN+RU i18n,
  sidebar counter via navCounts.containers.

Misc:
- scripts/dev-server.sh: tolerate empty netstat grep result.
- .gitignore: ignore docker-watcher binaries, .claude/worktrees/, .facts-sync.json.
2026-05-09 15:44:41 +03:00

225 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Workload Refactor — Compressed Plan
Status: Shipped (with explicit deferrals — see "What actually shipped" at the bottom)
Owner: alexei.dolgolyov
Date: 2026-05-07
Last updated: 2026-05-09 (post multi-agent review fixes)
## Goal
Unify `Project`, `Stack`, and `StaticSite` under a single `Workload` primitive, and introduce a normalized `containers` index so every Tinyforge-managed container has one canonical row. This unblocks a global Containers view today and lets future workload kinds (cron jobs, one-shot tasks, databases-as-resource, functions) plug in without another tab/store/deployer branch.
## Why this is the compressed plan
The original 8-PR plan was designed for a live system with dual-writes and soak periods. Tinyforge has no production users yet, so all defenses against live runtime state collapse: no external label consumers, no third-party CI hitting webhook URLs, no orphaned containers to recover. Everything ships in 3 PRs against a clean slate. Solo-dev reversibility is preserved by branching, not by dual-write gymnastics.
## Target architecture
- `Workload` is the unifying primitive with `kind ∈ {project, stack, site, …}`. Each existing Project/Stack/StaticSite becomes a Workload row.
- `containers` is a normalized index: every Tinyforge-managed container has one row with `workload_id`, `workload_kind`, `role`, Docker container ID, host, state, last_seen.
- Optional `apps` table (thin nullable `app_id` on Workload) added empty; UI gated behind a feature flag, defer indefinitely until pull.
- Stable Docker labels: `tinyforge.workload.id`, `tinyforge.workload.kind`, `tinyforge.role`, `tinyforge.managed`. Legacy `tinyforge.project` / `tinyforge.stage` / `tinyforge.instance-id` are removed in the same wave.
- Global `/containers` UI route; per-workload container panel becomes a shared `<WorkloadContainers>` component reused by project, stack, and site detail pages.
## Schema
Appended to `internal/store/store.go::runMigrations()` as additive `CREATE TABLE` statements (idempotent via `CREATE TABLE IF NOT EXISTS`).
```sql
CREATE TABLE IF NOT EXISTS workloads (
id TEXT PRIMARY KEY,
kind TEXT NOT NULL, -- 'project' | 'stack' | 'site'
ref_id TEXT NOT NULL, -- FK into projects/stacks/static_sites by kind
name TEXT NOT NULL,
app_id TEXT, -- nullable FK into apps.id
notification_url TEXT NOT NULL DEFAULT '',
notification_secret TEXT NOT NULL DEFAULT '',
webhook_secret TEXT NOT NULL DEFAULT '',
webhook_signing_secret TEXT NOT NULL DEFAULT '',
webhook_require_signature INTEGER NOT NULL DEFAULT 0,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
UNIQUE(kind, ref_id)
);
CREATE INDEX IF NOT EXISTS idx_workloads_app_id ON workloads(app_id);
CREATE INDEX IF NOT EXISTS idx_workloads_kind ON workloads(kind);
CREATE TABLE IF NOT EXISTS containers (
id TEXT PRIMARY KEY,
workload_id TEXT NOT NULL,
workload_kind TEXT NOT NULL, -- denormalized for filtered queries
role TEXT NOT NULL, -- stage name (project), service name (stack), '' (site)
container_id TEXT NOT NULL DEFAULT '', -- Docker ID, '' between create+start
image_ref TEXT NOT NULL DEFAULT '',
host TEXT NOT NULL DEFAULT 'local',
state TEXT NOT NULL DEFAULT '', -- running | stopped | failed | removing | missing
port INTEGER NOT NULL DEFAULT 0,
last_seen_at TEXT NOT NULL DEFAULT '',
extra_json TEXT NOT NULL DEFAULT '{}', -- {subdomain, npm_proxy_id, proxy_route_id, ...}
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_containers_workload ON containers(workload_id);
CREATE INDEX IF NOT EXISTS idx_containers_state ON containers(state);
CREATE INDEX IF NOT EXISTS idx_containers_container_id ON containers(container_id);
CREATE TABLE IF NOT EXISTS apps (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
description TEXT NOT NULL DEFAULT '',
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
```
`extra_json` carries kind-specific fields (`subdomain`, `npm_proxy_id`, `proxy_route_id`) so the spine stays narrow. SQLite JSON1 is required for queries against `extra_json`; verify the driver in `go.mod` supports it before committing — fall back to dedicated columns if not.
## PR 1 — Spine: schema, Workload package, reconciler
Single PR, lands the data layer end-to-end. No dual-writes; project/stack/site CRUD writes directly to `workloads`.
### New files
- `internal/store/workloads.go``CreateWorkload`, `GetWorkloadByID`, `GetWorkloadByRef(kind, refID)`, `ListWorkloads`, `UpdateWorkload`, `DeleteWorkload`.
- `internal/store/containers.go``UpsertContainer`, `GetContainerByDockerID`, `ListContainersByWorkload`, `ListContainers(filter)`, `MarkContainerMissing`, new `ListProxyRoutes` (mirrors the join shape from `internal/store/instances.go::ListProxyRoutes`, reading `extra_json` via `json_extract`).
- `internal/store/apps.go` — minimal CRUD; not wired anywhere yet.
- `internal/workload/workload.go``Workload` interface (`ID`, `Kind`, `Name`, `Deploy`, `Stop`, `Start`, `Delete`, `Containers`).
- `internal/workload/adapters/project_adapter.go` — wraps `internal/deployer`.
- `internal/workload/adapters/stack_adapter.go` — wraps `internal/stack/manager.go`.
- `internal/workload/adapters/site_adapter.go` — wraps `internal/staticsite/manager.go`.
- `internal/reconciler/reconciler.go` — single writer to `containers`. Reads `docker ps --filter label=tinyforge.managed`, groups by `(workload.id, role)`, upserts rows, marks absent rows `state='missing'`. Boot-time one-shot run + 30s tick.
- `internal/reconciler/reconciler_test.go` — table-driven tests with a fake Docker client.
### Modified files
- `internal/store/store.go::runMigrations` — append the three `CREATE TABLE` statements (after line ~165 where the existing migrations end).
- `internal/store/models.go` — add `Workload`, `Container`, `App` structs.
- `internal/store/projects.go``CreateProject`, `UpdateProject`, `DeleteProject` wrap the write in `s.db.Begin()` and also write the matching `workloads` row. Webhook/notification secret setters update `workloads.webhook_secret` / `webhook_signing_secret` / `notification_secret` directly.
- `internal/store/stacks.go` — same Workload write on `CreateStack` / `UpdateStack` / `DeleteStack`.
- `internal/store/static_sites.go` — same.
- `internal/docker/client.go` — add label constants `LabelWorkloadID`, `LabelWorkloadKind`, `LabelRole`, `LabelManaged`. **Remove** the old `LabelProject`, `LabelStage`, `LabelInstanceID` writes from the deployer.
- `internal/deployer/deployer.go` (label injection ~line 388) — emit only the new labels.
- `internal/deployer/bluegreen.go` (~line 97) — same.
- `internal/stack/manager.go` — after `docker compose up`, stamp new labels on each compose-managed container via `docker container update --label-add`. Compose's own `com.docker.compose.service` becomes `role`.
- `internal/staticsite/manager.go` — stamp new labels at container start.
- `internal/store/instances.go`**delete this file**. The deployer no longer creates instance rows; reconciler owns container state.
- `internal/api/instances.go`**delete or alias** to `/api/containers` filtered by workload. Solo dev → delete is cleaner.
- `internal/api/proxies.go` — switch the `ListProxyRoutes` import to `containers.ListProxyRoutes`.
- `internal/api/docker.go::buildActiveImagesSet` (~line 251) — replace the `ListAllInstances` walk with a single `containers.image_ref` query.
- `internal/api/stale.go`, `internal/stale/scanner.go` — read from `containers` instead of `instances`.
- `internal/webhook/matcher.go` — query `workloads.webhook_secret` directly.
- `cmd/server/main.go` — start the reconciler goroutine after `store.New`. Drop any startup code that touched `instances`.
### Tests
- Extend `internal/store/store_test.go` with `TestCreateProjectAlsoCreatesWorkload`, `TestDeleteProjectCascadesWorkload`, `TestUpsertContainerIdempotent`, `TestListProxyRoutesShape`.
- New `internal/reconciler/reconciler_test.go` with a `dockerClient` interface and a fake — assert that a slice of `types.Container` produces the expected `containers` upserts.
- Run the existing test suite under `-race`.
### Deliverable
System builds, deploys a project end-to-end, deploys a stack end-to-end, deploys a static site end-to-end. `containers` table reflects reality after each deploy and after a 30s reconciler tick. The legacy `instances` table is gone.
## PR 2 — API + frontend
### New files
- `internal/api/workloads.go``GET /api/workloads`, `GET /api/workloads/{id}`, `GET /api/workloads/{id}/containers`, `PATCH /api/workloads/{id}` (sets `app_id` and notification/webhook config).
- `internal/api/containers.go``GET /api/containers?workload_id=&kind=&state=&app_id=`, `GET /api/containers/{id}`.
- `internal/api/apps.go``GET /api/apps`, `POST /api/apps`, `PATCH /api/apps/{id}`, `DELETE /api/apps/{id}` (gated by settings flag `features.apps_grouping=true`).
- `web/src/routes/containers/+page.svelte` — global filterable table. Reuses table patterns from `web/src/routes/proxies/+page.svelte` and `web/src/routes/containers/stale/+page.svelte` (the existing `stale/` route stays untouched).
- `web/src/lib/components/WorkloadContainers.svelte` — shared container panel. Takes `workloadId` prop, hits `/api/workloads/{id}/containers`. Handles 1..N container rows.
### Modified files
- `internal/api/router.go` — register the new endpoints. Remove `/api/instances` registration.
- `web/src/routes/projects/[id]/+page.svelte` — replace the inline instance list with `<WorkloadContainers workloadId={...}/>`.
- `web/src/routes/stacks/[id]/+page.svelte` — same.
- `web/src/routes/sites/[id]/+page.svelte` — same.
- Top nav component (find under `web/src/lib/components/`) — insert a "Containers" tab between "Projects" and "Stacks". Existing tabs stay.
- `web/src/lib/api.ts` (or wherever API client functions live) — add `listWorkloads`, `getWorkload`, `listContainers`, `getContainer`, `listApps`. Remove instance-shaped helpers.
- `web/src/lib/types.ts` — add `Workload`, `Container`, `App` types. Remove `Instance` once unreferenced.
### Deliverable
User-visible: a `Containers` tab in the top nav showing every running container with kind/state/workload filters, links into the owning project/stack/site detail page, and a per-workload container panel that looks identical on all three detail pages.
## PR 3 — Polish + optional Apps UI
Defer indefinitely if no pull. Lands as a single PR when wanted.
### Scope
- Apps UI: `web/src/routes/apps/+page.svelte`, `[id]/+page.svelte`. Workload detail pages get an "App" dropdown to assign `app_id`. Gated by `features.apps_grouping=true` in settings.
- Drop any leftover dead code referencing `Instance` types.
- Documentation: update `CLAUDE.md` and `README.md` to describe the Workload model.
- Optional: consolidate `internal/deployer` and `internal/stack/manager` into a single orchestrator. **Out of scope for this refactor** — adapters wrap the existing kind-specific code and that's fine. Revisit only if the duplication starts hurting.
## What's explicitly deferred
- Deployer + stack-manager consolidation.
- Apps UI (schema added in PR 1, UI in PR 3 behind flag).
- Multi-host containers (`containers.host` exists but is always `'local'`).
- Workload-kind plugin model — the adapter registry has three hardcoded entries.
- Webhook secret handling for old per-project URLs that may already be in CI configs (no users yet → don't care).
## Risks (compressed)
- **SQLite JSON1 availability.** Verify the driver in `go.mod` supports `json_extract` before committing to `extra_json`. If not, hoist `subdomain`, `npm_proxy_id`, `proxy_route_id` to dedicated columns on `containers`.
- **`ListProxyRoutes` shape regression.** The new query reads from `containers` + `workloads` instead of `instances` + `projects` + `stages`. Worth a golden-output test before flipping `internal/api/proxies.go` over.
- **Stack containers and label stamping.** `docker container update --label-add` is required to label compose-managed containers post-up. If the local Docker engine version doesn't support it, fall back to relying on `com.docker.compose.project` + `com.docker.compose.service` for reconciler joins.
- **Boot-time backfill from `docker ps`.** First run needs to populate `containers` from currently-running containers using the legacy `tinyforge.instance-id` and `com.docker.compose.project` labels (since pre-refactor containers don't have the new labels). Solo-dev workaround: `docker compose down` test workloads, run the new binary against an empty Docker host, redeploy.
## Concrete file paths
Modified:
- `internal/store/store.go` (migrations at line ~75165)
- `internal/store/projects.go`, `stacks.go`, `static_sites.go`, `models.go`, `store_test.go`
- `internal/docker/client.go`
- `internal/deployer/deployer.go` (~line 388), `internal/deployer/bluegreen.go` (~line 97)
- `internal/stack/manager.go`, `internal/staticsite/manager.go`
- `internal/api/router.go`, `proxies.go`, `docker.go` (`buildActiveImagesSet` at line 251), `stale.go`
- `internal/stale/scanner.go`, `internal/webhook/matcher.go`
- `cmd/server/main.go`
- `web/src/routes/projects/[id]/+page.svelte`, `stacks/[id]/+page.svelte`, `sites/[id]/+page.svelte`
- `web/src/lib/api.ts`, `web/src/lib/types.ts`
- Top nav component in `web/src/lib/components/`
Created:
- `internal/store/workloads.go`, `containers.go`, `apps.go`
- `internal/workload/workload.go`, `adapters/project_adapter.go`, `adapters/stack_adapter.go`, `adapters/site_adapter.go`
- `internal/reconciler/reconciler.go`, `reconciler_test.go`
- `internal/api/workloads.go`, `containers.go`, `apps.go`
- `web/src/routes/containers/+page.svelte`
- `web/src/lib/components/WorkloadContainers.svelte`
Deleted:
- `internal/store/instances.go`
- `internal/api/instances.go`
## What actually shipped (2026-05-09)
After a multi-agent code review caught several issues, the refactor landed with the following deviations from the original plan. They are documented here so a future reader doesn't have to reconstruct them from git log.
### Deferred / dropped
- **`internal/workload/` package + adapters.** The plan called for a `Workload` interface (`Deploy`, `Stop`, `Start`, `Delete`, `Containers`) with `project_adapter.go`, `stack_adapter.go`, `site_adapter.go`. **Not built.** The adapters would have been thin pass-throughs to the existing kind-specific code; the duplication is real but small and the per-kind paths still type-check cleanly. The data-layer "Workload" (DB row) is the only Workload primitive today. Revisit if the per-kind branching becomes painful.
- **`internal/api/instances.go` URL space.** Plan said "delete or alias to /api/containers." **Kept alive** but every handler that mutates a container now calls `resolveAndAuthorizeInstance` to verify the row's `(workload_id, role)` match the URL's `(project_id, stage_name)` — closes the cross-project hijack the security review flagged. URL renaming deferred until the frontend `InstanceCard` is renamed too (next refactor wave).
- **`InstanceCard.svelte` rename.** The component is now generic enough to be `ContainerCard`, but the rename would touch 3+ call sites and i18n keys. Deferred.
- **`extra_json` SQL column.** Schema still has the column (NOT NULL DEFAULT '{}'); Go code no longer references it (struct field, scan, INSERT, UPDATE all dropped). When/if a kind-specific need surfaces, hoist a dedicated column rather than re-introducing JSON1.
### Built but not in the original plan
- **`Container.stage_id` column** + index + ListProxyRoutes / ListContainersByStageID join. Survives stage renames; the original plan joined on `stages.name = containers.role` which would orphan rows on rename. The deployer populates `stage_id` for project containers; stack/site rows leave it empty.
- **`store.ReconcileContainer`** — separate write path for the reconciler. The original `UpsertContainer` ON CONFLICT clause overwrote `subdomain`, `proxy_route_id`, `npm_proxy_id`, `image_tag` from the reconciler's empty values on every 30s tick, silently wiping deployer state. `ReconcileContainer` only updates Docker-derived fields on conflict (`container_id`, `image_ref`, `state`, `port`, `last_seen_at`, `updated_at`).
- **Workload-existence check in the reconciler** — a `tinyforge.workload.id` label that doesn't resolve to a known workload is now rejected. Anyone with Docker socket access could otherwise spawn a container with a forged label and steal the canonical row for an existing workload.
- **Project-kind row invention skipped.** When the reconciler sees a container with `tinyforge.workload.kind=project` and no existing row matches the docker container ID, it skips the upsert (deployer is the authoritative writer for project rows). Inventing a deterministic-key row would race with `MaxInstances > 1` deploys.
- **Reconciler shutdown ordering** — `Stop()` cancels its child context before `wg.Wait()` so a hung `docker ps` doesn't block process shutdown.
- **Transactional CRUD + workload sync.** Every `Create*`, `Update*`, `Delete*`, and `Set*Secret` path on `projects` / `stacks` / `static_sites` now wraps the parent UPDATE and the workload row sync in a single transaction. Closes the rotation-durability gap the security review flagged.
- **Workload-only webhook lookup.** The legacy fallback (`GetProjectByWebhookSecret`, `GetStaticSiteByWebhookSecret`) is gone — webhook routing reads exclusively through `workloads.webhook_secret`, so a rotation that didn't commit doesn't get silently accepted.
- **`store.GetStackByComposeProjectName`** + indexed lookup. Reconciler used to do a full-table stack scan per compose container per tick.
- **`store.ListMissingSweepRows`** — filtered query (`container_id != '' AND state != 'missing'`) so the missing-sweep reads only candidate rows instead of the whole index.
- **`web/src/lib/components/WorkloadContainers.svelte`** — generic detail-page panel reusable by stack and site detail pages. Project detail keeps its stage-grouped `InstanceCard` layout (containers there are sharded per-stage, not flat).
- **Containers page polish** — kind/state filters now apply client-side over an unfiltered fetch (so tab counters reflect the whole population), URL-synced filters (`?kind=stack&state=running`) for shareable links, race-safe loads via a sequence number, full i18n with EN+RU strings, and a counter badge in the sidebar via `navCounts.containers`.
- **`stage_id` migration.** New rows get `stage_id` from the deployer; legacy rows fall back to the (project_id, role=stage_name) join inside `ListContainersByStageID`.