Files
tiny-forge/docs/plans/workload-refactor.md
T
alexei.dolgolyov cba2149aa9 refactor(workload): finalize containers index + post-review hardening
Wraps up the workload refactor with the fixes that came out of the multi-agent
code review (see docs/plans/workload-refactor.md "What actually shipped").

Backend:
- store.ReconcileContainer: separate write path so the 30s reconciler tick no
  longer overwrites deployer-owned fields (subdomain, proxy_route_id,
  npm_proxy_id, image_tag).
- Container.stage_id column + index; ListProxyRoutes / ListContainersByStageID
  join via stage_id (survives stage rename), with legacy fallback to
  (project_id, role=stage_name).
- Reconciler: workload-existence check (rejects forged tinyforge.workload.id
  labels), skips inventing project-kind rows, child-context cancel before
  wg.Wait() on shutdown.
- Transactional CRUD across projects / stacks / static_sites: parent UPDATE
  and workload sync land in one transaction so secret rotations are durable.
- Webhook routing reads exclusively through workloads.webhook_secret; legacy
  GetProjectByWebhookSecret / GetStaticSiteByWebhookSecret fallback removed.
- store.GetStackByComposeProjectName + indexed lookup (no more full-table
  stack scan per compose container per tick).
- store.ListMissingSweepRows: filtered query for the missing-sweep.
- /api/instances/* handlers verify (workload_id, role) match URL
  (project_id, stage_name) before mutating — closes the cross-project
  hijack the security review flagged.
- extra_json no longer referenced from Go (column kept on disk for now).

Frontend:
- WorkloadContainers.svelte: generic detail-page panel reusable by stack and
  site detail pages.
- Containers page polish: client-side kind/state filters over an unfiltered
  fetch, URL-synced filters, race-safe loads via sequence number, EN+RU i18n,
  sidebar counter via navCounts.containers.

Misc:
- scripts/dev-server.sh: tolerate empty netstat grep result.
- .gitignore: ignore docker-watcher binaries, .claude/worktrees/, .facts-sync.json.
2026-05-09 15:44:41 +03:00

19 KiB
Raw Blame History

Workload Refactor — Compressed Plan

Status: Shipped (with explicit deferrals — see "What actually shipped" at the bottom) Owner: alexei.dolgolyov Date: 2026-05-07 Last updated: 2026-05-09 (post multi-agent review fixes)

Goal

Unify Project, Stack, and StaticSite under a single Workload primitive, and introduce a normalized containers index so every Tinyforge-managed container has one canonical row. This unblocks a global Containers view today and lets future workload kinds (cron jobs, one-shot tasks, databases-as-resource, functions) plug in without another tab/store/deployer branch.

Why this is the compressed plan

The original 8-PR plan was designed for a live system with dual-writes and soak periods. Tinyforge has no production users yet, so all defenses against live runtime state collapse: no external label consumers, no third-party CI hitting webhook URLs, no orphaned containers to recover. Everything ships in 3 PRs against a clean slate. Solo-dev reversibility is preserved by branching, not by dual-write gymnastics.

Target architecture

  • Workload is the unifying primitive with kind ∈ {project, stack, site, …}. Each existing Project/Stack/StaticSite becomes a Workload row.
  • containers is a normalized index: every Tinyforge-managed container has one row with workload_id, workload_kind, role, Docker container ID, host, state, last_seen.
  • Optional apps table (thin nullable app_id on Workload) added empty; UI gated behind a feature flag, defer indefinitely until pull.
  • Stable Docker labels: tinyforge.workload.id, tinyforge.workload.kind, tinyforge.role, tinyforge.managed. Legacy tinyforge.project / tinyforge.stage / tinyforge.instance-id are removed in the same wave.
  • Global /containers UI route; per-workload container panel becomes a shared <WorkloadContainers> component reused by project, stack, and site detail pages.

Schema

Appended to internal/store/store.go::runMigrations() as additive CREATE TABLE statements (idempotent via CREATE TABLE IF NOT EXISTS).

CREATE TABLE IF NOT EXISTS workloads (
    id              TEXT PRIMARY KEY,
    kind            TEXT NOT NULL,                 -- 'project' | 'stack' | 'site'
    ref_id          TEXT NOT NULL,                 -- FK into projects/stacks/static_sites by kind
    name            TEXT NOT NULL,
    app_id          TEXT,                          -- nullable FK into apps.id
    notification_url        TEXT NOT NULL DEFAULT '',
    notification_secret     TEXT NOT NULL DEFAULT '',
    webhook_secret          TEXT NOT NULL DEFAULT '',
    webhook_signing_secret  TEXT NOT NULL DEFAULT '',
    webhook_require_signature INTEGER NOT NULL DEFAULT 0,
    created_at      TEXT NOT NULL,
    updated_at      TEXT NOT NULL,
    UNIQUE(kind, ref_id)
);
CREATE INDEX IF NOT EXISTS idx_workloads_app_id ON workloads(app_id);
CREATE INDEX IF NOT EXISTS idx_workloads_kind   ON workloads(kind);

CREATE TABLE IF NOT EXISTS containers (
    id              TEXT PRIMARY KEY,
    workload_id     TEXT NOT NULL,
    workload_kind   TEXT NOT NULL,                 -- denormalized for filtered queries
    role            TEXT NOT NULL,                 -- stage name (project), service name (stack), '' (site)
    container_id    TEXT NOT NULL DEFAULT '',      -- Docker ID, '' between create+start
    image_ref       TEXT NOT NULL DEFAULT '',
    host            TEXT NOT NULL DEFAULT 'local',
    state           TEXT NOT NULL DEFAULT '',      -- running | stopped | failed | removing | missing
    port            INTEGER NOT NULL DEFAULT 0,
    last_seen_at    TEXT NOT NULL DEFAULT '',
    extra_json      TEXT NOT NULL DEFAULT '{}',    -- {subdomain, npm_proxy_id, proxy_route_id, ...}
    created_at      TEXT NOT NULL,
    updated_at      TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_containers_workload     ON containers(workload_id);
CREATE INDEX IF NOT EXISTS idx_containers_state        ON containers(state);
CREATE INDEX IF NOT EXISTS idx_containers_container_id ON containers(container_id);

CREATE TABLE IF NOT EXISTS apps (
    id          TEXT PRIMARY KEY,
    name        TEXT NOT NULL,
    description TEXT NOT NULL DEFAULT '',
    created_at  TEXT NOT NULL,
    updated_at  TEXT NOT NULL
);

extra_json carries kind-specific fields (subdomain, npm_proxy_id, proxy_route_id) so the spine stays narrow. SQLite JSON1 is required for queries against extra_json; verify the driver in go.mod supports it before committing — fall back to dedicated columns if not.

PR 1 — Spine: schema, Workload package, reconciler

Single PR, lands the data layer end-to-end. No dual-writes; project/stack/site CRUD writes directly to workloads.

New files

  • internal/store/workloads.goCreateWorkload, GetWorkloadByID, GetWorkloadByRef(kind, refID), ListWorkloads, UpdateWorkload, DeleteWorkload.
  • internal/store/containers.goUpsertContainer, GetContainerByDockerID, ListContainersByWorkload, ListContainers(filter), MarkContainerMissing, new ListProxyRoutes (mirrors the join shape from internal/store/instances.go::ListProxyRoutes, reading extra_json via json_extract).
  • internal/store/apps.go — minimal CRUD; not wired anywhere yet.
  • internal/workload/workload.goWorkload interface (ID, Kind, Name, Deploy, Stop, Start, Delete, Containers).
  • internal/workload/adapters/project_adapter.go — wraps internal/deployer.
  • internal/workload/adapters/stack_adapter.go — wraps internal/stack/manager.go.
  • internal/workload/adapters/site_adapter.go — wraps internal/staticsite/manager.go.
  • internal/reconciler/reconciler.go — single writer to containers. Reads docker ps --filter label=tinyforge.managed, groups by (workload.id, role), upserts rows, marks absent rows state='missing'. Boot-time one-shot run + 30s tick.
  • internal/reconciler/reconciler_test.go — table-driven tests with a fake Docker client.

Modified files

  • internal/store/store.go::runMigrations — append the three CREATE TABLE statements (after line ~165 where the existing migrations end).
  • internal/store/models.go — add Workload, Container, App structs.
  • internal/store/projects.goCreateProject, UpdateProject, DeleteProject wrap the write in s.db.Begin() and also write the matching workloads row. Webhook/notification secret setters update workloads.webhook_secret / webhook_signing_secret / notification_secret directly.
  • internal/store/stacks.go — same Workload write on CreateStack / UpdateStack / DeleteStack.
  • internal/store/static_sites.go — same.
  • internal/docker/client.go — add label constants LabelWorkloadID, LabelWorkloadKind, LabelRole, LabelManaged. Remove the old LabelProject, LabelStage, LabelInstanceID writes from the deployer.
  • internal/deployer/deployer.go (label injection ~line 388) — emit only the new labels.
  • internal/deployer/bluegreen.go (~line 97) — same.
  • internal/stack/manager.go — after docker compose up, stamp new labels on each compose-managed container via docker container update --label-add. Compose's own com.docker.compose.service becomes role.
  • internal/staticsite/manager.go — stamp new labels at container start.
  • internal/store/instances.godelete this file. The deployer no longer creates instance rows; reconciler owns container state.
  • internal/api/instances.godelete or alias to /api/containers filtered by workload. Solo dev → delete is cleaner.
  • internal/api/proxies.go — switch the ListProxyRoutes import to containers.ListProxyRoutes.
  • internal/api/docker.go::buildActiveImagesSet (~line 251) — replace the ListAllInstances walk with a single containers.image_ref query.
  • internal/api/stale.go, internal/stale/scanner.go — read from containers instead of instances.
  • internal/webhook/matcher.go — query workloads.webhook_secret directly.
  • cmd/server/main.go — start the reconciler goroutine after store.New. Drop any startup code that touched instances.

Tests

  • Extend internal/store/store_test.go with TestCreateProjectAlsoCreatesWorkload, TestDeleteProjectCascadesWorkload, TestUpsertContainerIdempotent, TestListProxyRoutesShape.
  • New internal/reconciler/reconciler_test.go with a dockerClient interface and a fake — assert that a slice of types.Container produces the expected containers upserts.
  • Run the existing test suite under -race.

Deliverable

System builds, deploys a project end-to-end, deploys a stack end-to-end, deploys a static site end-to-end. containers table reflects reality after each deploy and after a 30s reconciler tick. The legacy instances table is gone.

PR 2 — API + frontend

New files

  • internal/api/workloads.goGET /api/workloads, GET /api/workloads/{id}, GET /api/workloads/{id}/containers, PATCH /api/workloads/{id} (sets app_id and notification/webhook config).
  • internal/api/containers.goGET /api/containers?workload_id=&kind=&state=&app_id=, GET /api/containers/{id}.
  • internal/api/apps.goGET /api/apps, POST /api/apps, PATCH /api/apps/{id}, DELETE /api/apps/{id} (gated by settings flag features.apps_grouping=true).
  • web/src/routes/containers/+page.svelte — global filterable table. Reuses table patterns from web/src/routes/proxies/+page.svelte and web/src/routes/containers/stale/+page.svelte (the existing stale/ route stays untouched).
  • web/src/lib/components/WorkloadContainers.svelte — shared container panel. Takes workloadId prop, hits /api/workloads/{id}/containers. Handles 1..N container rows.

Modified files

  • internal/api/router.go — register the new endpoints. Remove /api/instances registration.
  • web/src/routes/projects/[id]/+page.svelte — replace the inline instance list with <WorkloadContainers workloadId={...}/>.
  • web/src/routes/stacks/[id]/+page.svelte — same.
  • web/src/routes/sites/[id]/+page.svelte — same.
  • Top nav component (find under web/src/lib/components/) — insert a "Containers" tab between "Projects" and "Stacks". Existing tabs stay.
  • web/src/lib/api.ts (or wherever API client functions live) — add listWorkloads, getWorkload, listContainers, getContainer, listApps. Remove instance-shaped helpers.
  • web/src/lib/types.ts — add Workload, Container, App types. Remove Instance once unreferenced.

Deliverable

User-visible: a Containers tab in the top nav showing every running container with kind/state/workload filters, links into the owning project/stack/site detail page, and a per-workload container panel that looks identical on all three detail pages.

PR 3 — Polish + optional Apps UI

Defer indefinitely if no pull. Lands as a single PR when wanted.

Scope

  • Apps UI: web/src/routes/apps/+page.svelte, [id]/+page.svelte. Workload detail pages get an "App" dropdown to assign app_id. Gated by features.apps_grouping=true in settings.
  • Drop any leftover dead code referencing Instance types.
  • Documentation: update CLAUDE.md and README.md to describe the Workload model.
  • Optional: consolidate internal/deployer and internal/stack/manager into a single orchestrator. Out of scope for this refactor — adapters wrap the existing kind-specific code and that's fine. Revisit only if the duplication starts hurting.

What's explicitly deferred

  • Deployer + stack-manager consolidation.
  • Apps UI (schema added in PR 1, UI in PR 3 behind flag).
  • Multi-host containers (containers.host exists but is always 'local').
  • Workload-kind plugin model — the adapter registry has three hardcoded entries.
  • Webhook secret handling for old per-project URLs that may already be in CI configs (no users yet → don't care).

Risks (compressed)

  • SQLite JSON1 availability. Verify the driver in go.mod supports json_extract before committing to extra_json. If not, hoist subdomain, npm_proxy_id, proxy_route_id to dedicated columns on containers.
  • ListProxyRoutes shape regression. The new query reads from containers + workloads instead of instances + projects + stages. Worth a golden-output test before flipping internal/api/proxies.go over.
  • Stack containers and label stamping. docker container update --label-add is required to label compose-managed containers post-up. If the local Docker engine version doesn't support it, fall back to relying on com.docker.compose.project + com.docker.compose.service for reconciler joins.
  • Boot-time backfill from docker ps. First run needs to populate containers from currently-running containers using the legacy tinyforge.instance-id and com.docker.compose.project labels (since pre-refactor containers don't have the new labels). Solo-dev workaround: docker compose down test workloads, run the new binary against an empty Docker host, redeploy.

Concrete file paths

Modified:

  • internal/store/store.go (migrations at line ~75165)
  • internal/store/projects.go, stacks.go, static_sites.go, models.go, store_test.go
  • internal/docker/client.go
  • internal/deployer/deployer.go (~line 388), internal/deployer/bluegreen.go (~line 97)
  • internal/stack/manager.go, internal/staticsite/manager.go
  • internal/api/router.go, proxies.go, docker.go (buildActiveImagesSet at line 251), stale.go
  • internal/stale/scanner.go, internal/webhook/matcher.go
  • cmd/server/main.go
  • web/src/routes/projects/[id]/+page.svelte, stacks/[id]/+page.svelte, sites/[id]/+page.svelte
  • web/src/lib/api.ts, web/src/lib/types.ts
  • Top nav component in web/src/lib/components/

Created:

  • internal/store/workloads.go, containers.go, apps.go
  • internal/workload/workload.go, adapters/project_adapter.go, adapters/stack_adapter.go, adapters/site_adapter.go
  • internal/reconciler/reconciler.go, reconciler_test.go
  • internal/api/workloads.go, containers.go, apps.go
  • web/src/routes/containers/+page.svelte
  • web/src/lib/components/WorkloadContainers.svelte

Deleted:

  • internal/store/instances.go
  • internal/api/instances.go

What actually shipped (2026-05-09)

After a multi-agent code review caught several issues, the refactor landed with the following deviations from the original plan. They are documented here so a future reader doesn't have to reconstruct them from git log.

Deferred / dropped

  • internal/workload/ package + adapters. The plan called for a Workload interface (Deploy, Stop, Start, Delete, Containers) with project_adapter.go, stack_adapter.go, site_adapter.go. Not built. The adapters would have been thin pass-throughs to the existing kind-specific code; the duplication is real but small and the per-kind paths still type-check cleanly. The data-layer "Workload" (DB row) is the only Workload primitive today. Revisit if the per-kind branching becomes painful.
  • internal/api/instances.go URL space. Plan said "delete or alias to /api/containers." Kept alive but every handler that mutates a container now calls resolveAndAuthorizeInstance to verify the row's (workload_id, role) match the URL's (project_id, stage_name) — closes the cross-project hijack the security review flagged. URL renaming deferred until the frontend InstanceCard is renamed too (next refactor wave).
  • InstanceCard.svelte rename. The component is now generic enough to be ContainerCard, but the rename would touch 3+ call sites and i18n keys. Deferred.
  • extra_json SQL column. Schema still has the column (NOT NULL DEFAULT '{}'); Go code no longer references it (struct field, scan, INSERT, UPDATE all dropped). When/if a kind-specific need surfaces, hoist a dedicated column rather than re-introducing JSON1.

Built but not in the original plan

  • Container.stage_id column + index + ListProxyRoutes / ListContainersByStageID join. Survives stage renames; the original plan joined on stages.name = containers.role which would orphan rows on rename. The deployer populates stage_id for project containers; stack/site rows leave it empty.
  • store.ReconcileContainer — separate write path for the reconciler. The original UpsertContainer ON CONFLICT clause overwrote subdomain, proxy_route_id, npm_proxy_id, image_tag from the reconciler's empty values on every 30s tick, silently wiping deployer state. ReconcileContainer only updates Docker-derived fields on conflict (container_id, image_ref, state, port, last_seen_at, updated_at).
  • Workload-existence check in the reconciler — a tinyforge.workload.id label that doesn't resolve to a known workload is now rejected. Anyone with Docker socket access could otherwise spawn a container with a forged label and steal the canonical row for an existing workload.
  • Project-kind row invention skipped. When the reconciler sees a container with tinyforge.workload.kind=project and no existing row matches the docker container ID, it skips the upsert (deployer is the authoritative writer for project rows). Inventing a deterministic-key row would race with MaxInstances > 1 deploys.
  • Reconciler shutdown orderingStop() cancels its child context before wg.Wait() so a hung docker ps doesn't block process shutdown.
  • Transactional CRUD + workload sync. Every Create*, Update*, Delete*, and Set*Secret path on projects / stacks / static_sites now wraps the parent UPDATE and the workload row sync in a single transaction. Closes the rotation-durability gap the security review flagged.
  • Workload-only webhook lookup. The legacy fallback (GetProjectByWebhookSecret, GetStaticSiteByWebhookSecret) is gone — webhook routing reads exclusively through workloads.webhook_secret, so a rotation that didn't commit doesn't get silently accepted.
  • store.GetStackByComposeProjectName + indexed lookup. Reconciler used to do a full-table stack scan per compose container per tick.
  • store.ListMissingSweepRows — filtered query (container_id != '' AND state != 'missing') so the missing-sweep reads only candidate rows instead of the whole index.
  • web/src/lib/components/WorkloadContainers.svelte — generic detail-page panel reusable by stack and site detail pages. Project detail keeps its stage-grouped InstanceCard layout (containers there are sharded per-stage, not flat).
  • Containers page polish — kind/state filters now apply client-side over an unfiltered fetch (so tab counters reflect the whole population), URL-synced filters (?kind=stack&state=running) for shareable links, race-safe loads via a sequence number, full i18n with EN+RU strings, and a counter badge in the sidebar via navCounts.containers.
  • stage_id migration. New rows get stage_id from the deployer; legacy rows fall back to the (project_id, role=stage_name) join inside ListContainersByStageID.