Commit Graph

13 Commits

Author SHA1 Message Date
alexei.dolgolyov 279f373f80 docs(extra_json): policy doc for containers.extra_json evolution
New CODEMAPS/container-extra-json.md documents the contract every
source plugin must follow when reading or writing containers.extra_json.
Closes the open architectural question that was tracked in
WORKLOAD_REFACTOR_TODO.md.

Covers:
- Schema position (column default, four write-path normalization
  sites) and ownership model (per-source row keys, current writers).
- Reader rules: tolerate unknown keys via default json.Unmarshal,
  tolerate decode failure where first-class columns suffice.
- Writer patterns: wholesale-overwrite (image source, single-writer
  short-lived rows) vs preserve-unknown-keys (static source, RMW with
  generic-map round-trip). Preserve-unknown-keys is the recommended
  default for new sources.
- Concurrency: SetMaxOpenConns(1) + WAL gives atomic per-row writes
  and consistent reader snapshots, but does NOT serialize multi-
  goroutine RMW — a per-workload sync.Mutex is required for that
  (fenced by TestSaveState_ConcurrentWritesDoNotLoseUpdates).
- What extra_json is NOT for (workload config, cross-source state,
  queryable data, secrets) and a checklist for adding a new field.
- Pointers to every example in tree: image's containerExtra writer/
  reader, static's saveState round-trip, workload_runtime.go's
  decode-and-tolerate consumer.

WORKLOAD_REFACTOR_TODO Container.extra_json question flipped to DONE.
CODEMAPS/INDEX bumped + entry linked.

Reviewer pass (code-reviewer subagent) caught one HIGH factual error
(wrong cross-source consumer claim) and several MEDIUM/LOW drifts;
all addressed inline before commit.
2026-05-16 22:00:41 +03:00
alexei.dolgolyov ea55d31177 feat(discovery+runtime): restore static-site wizard discovery + close /sites/[id] feature parity
Build / build (push) Successful in 10m43s
Two-stage feature arc closing the gaps left by the hard legacy cutover.
The static-site creation wizard regains its auto-discovery + connection-test
flow; /apps/[id] grows the runtime/storage/lifecycle surface the legacy
/sites/[id] page used to expose.

Backend (Go)
- internal/api/discovery.go: six admin-gated endpoints wrapping
  staticsite.GitProvider — POST /api/discovery/git/{detect-provider,
  test-connection,repos,branches,tree} + GET /api/discovery/image/conflicts.
  Identifier validation (validateGitIdent / validateGitBranch) at the
  boundary so provider URL interpolation cannot be hijacked via `..`.
  Upstream errors scrubbed: detailed slog on the server, generic 502 to
  the client (mitigates token-reflection-in-error-page).
- internal/api/workload_runtime.go: four endpoints —
  GET /api/workloads/{id}/runtime-state decodes containers.extra_json for
  static workloads; GET /api/workloads/{id}/storage execs `du -sb /app/data`
  with a 30s in-process cache (storageProbeCache) so polling can't turn
  into per-request execs; POST /api/workloads/{id}/{stop,start} iterate
  ListContainersByWorkload and call docker.StopContainer / StartContainer,
  returning 200 / 409 (nothing to act on) / 502 (all failed).
- internal/staticsite/safehttp.go: NewSafeHTTPClient + ValidateBaseURL +
  blockReason. DialContext re-resolves hostnames and refuses loopback /
  link-local / multicast / unspecified addresses. RFC1918 + ULA explicitly
  allowed (self-hosted Gitea on LAN is the dominant deployment).
  Replaced four raw &http.Client{} constructions in the provider files.
- internal/staticsite/gitlab_provider.go: url.PathEscape each segment in
  the raw-file URL builder for parity with projectPath().
- Test coverage: 26 cases in discovery_test.go (image-tag stripping,
  source-config decoding, conflict scenarios, validator boundaries,
  scheme rejection), 14 in workload_runtime_test.go (404 / 409 / nil-docker
  / probe-cache), 16 in safehttp_test.go (URL validation + block-reason
  policy matrix + live dial against loopback + AWS metadata literals).

Frontend (Svelte 5 + runes)
- web/src/lib/api.ts: typed wrappers for every endpoint, AbortSignal
  threaded through post(); ApiError exported so callers can narrow on
  e.status; new DetectedGitProvider narrow union.
- web/src/routes/apps/new/+page.svelte: static-form discovery controls
  (auto-detect provider, test connection, repo / branch / folder
  EntityPickers, Deno auto-detect); image-form conflict panel with
  debounced lookup + double-click submit guard ("Forge anyway") + Inspect
  button that pre-fills port/healthcheck; English error fallbacks routed
  through apps.new.errors.* (en + ru).
- web/src/routes/apps/[id]/+page.svelte: runtime-state panel + storage
  panel + Stop / Start / Open-site toolbar; universal live-state badge
  in the hero lede for image/compose/static (RUNNING / TRANSITIONING /
  STOPPED / NOT DEPLOYED / MIXED · n/m RUNNING); ContainerStats panel
  per row (auto-collapsing native <details> when N > 2); read-only
  webhook bindings summary card; responsive toolbar overflow with native
  <details> at <640px (z-index 100 above sticky nav).
- web/src/app.css: project-wide .forge-btn-ghost:focus-visible outline.

Hardening from go-reviewer + security-reviewer + typescript-reviewer +
frontend-design UI/UX subagents (0 CRITICAL, all HIGH/BLOCKER addressed
inline, IMPORTANT applied before commit):
- AbortController + per-call sequence tokens on every long-running
  fetch (loadRuntimeState / loadStorage / loadTriggerMeta / inspectImage /
  listImageConflicts) plus onDestroy cleanup so late resolves cannot
  mutate dead component state.
- doStop / doStart snapshot and restore `error` across the finally-block
  reload so a load()-cleared message doesn't hide a real failure.
- triggersById refreshed after inline trigger creation so the webhook
  card doesn't silently exclude the just-created trigger.
- Live-state badge wraps in role=status / aria-live=polite (no redundant
  aria-label).
- Webhook row has a single click target (was two pointing at the same URL).
- Empty webhook section hides entirely.
- Dropped role=menu / role=menuitem from the overflow menu (they would
  promise arrow-key nav we don't wire; native Tab + ESC carry it).

Doc
- docs/CODEMAPS/INDEX.md + new docs/CODEMAPS/discovery-and-runtime.md
  map the endpoint surface, security posture, frontend integration
  patterns, and an "add a new probe" recipe.

Verification
- svelte-check: 0 errors, 3 pre-existing a11y warnings.
- go build + go vet + go test ./...: all green.
- i18n parity: en + ru at 1413 keys each.
- Live smoke against :8090: 404 / 409 / 502 envelopes correct, discovery
  sanity passes, ProbeError surfaces on no-container path.
2026-05-16 21:35:51 +03:00
alexei.dolgolyov 39e1e36510 feat(triggers): add schedule trigger kind + internal scheduler
Build / build (push) Successful in 10m42s
Fourth trigger kind alongside registry/git/manual. Recurring time-interval
fires driven by a new internal/scheduler tick loop (default 30s, clamped
to 5m). Goes through the same webhook.Handler.FanOutForTrigger seam as
inbound HTTP webhooks, so per-binding concurrency, outcome accounting,
and config-merge semantics are identical.

Schema: triggers.last_fired_at TEXT column (additive ALTER for existing
DBs). Scheduler persists last_fired_at BEFORE dispatch so a panicking
Match cannot wedge a tight loop; failed deploys wait one full interval
before retry — correct trade-off for a periodic refresh trigger.

Frontend: TriggerKindForm + /triggers/new + /triggers/[id] gain the
schedule kind (4-col card grid, preset chips Hourly/Daily/Weekly,
custom interval input matched to Go time.ParseDuration syntax, optional
pinned reference). /triggers/[id] surfaces "last fired" on schedule rows.
EN+RU i18n in parity.

Review fixes from go-reviewer / security-reviewer / typescript-reviewer:
- Scheduler Start/Stop wrapped in sync.Once (no goroutine leak / double-
  cancel panic on shutdown re-entry).
- shouldFire rejects sub-MinInterval as defense-in-depth against
  hand-inserted rows that bypassed Validate.
- fire() asserts trigger Kind=="schedule" before dispatching.
- Aligned isValidInterval regex across all three frontend sites; reject
  the unsupported "d" unit (Go time.ParseDuration doesn't accept it).
- formatLastFired falls back to lastFiredNever on malformed timestamps
  rather than leaking raw bytes into the UI.
- main.go scheduler closure logs per-fire deployed/errored counts.
2026-05-16 11:24:05 +03:00
alexei.dolgolyov e3c7b13d58 chore(workload): close the workload-first arc — apps i18n + codemap + tests
Build / build (push) Successful in 10m36s
Closes the workload-first refactor by landing the Priority 3 polish
items and the Priority 4 test gap. Net: ~2,400 lines added,
~350 lines modified across 13 files.

Priority 3 — polish
- apps.* i18n namespace: 276 new keys across apps.list.* (27),
  apps.new.* (91, sibling of existing apps.new.triggers.*), and
  apps.detail.* (158, sibling of existing apps.detail.bindings.*).
  EN+RU at 1314 keys each, perfectly in sync. /apps, /apps/new,
  /apps/[id] now render entirely from i18n.
- New codemap docs/CODEMAPS/workload-plugin.md (238 lines):
  Source × Trigger contract, dispatch seam, webhook fan-out path,
  recipes for adding a new Source or Trigger kind. Plus
  docs/CODEMAPS/INDEX.md gateway.

Priority 4 — tests
- internal/api/workloads_test.go (new, ~30 subtests): /api/workloads
  CRUD + deploy + delete + env + volumes + chain + promote-from +
  triggers list/inline-bind + auth gating + standalone /api/triggers
  CRUD (create / dup-409 / kind filter / delete). Uses real
  POST handlers via httptest.NewServer + a fake plugin source
  registered under "testfakesource".
- internal/deployer/dispatch_test.go (new, 11 tests):
  DispatchPlugin / DispatchTeardown / DispatchReconcile happy +
  unknown-kind + propagated-error each; PluginDeps wiring; a real
  2s-bounded RWMutex deadlock probe on PluginDeps vs SetDNSProvider.
- internal/workload/plugin/source/compose/compose_test.go (new,
  ~26 subtests): composeProjectName sanitization,
  writeYAML / writeYAMLIfChanged hash short-circuit, Validate happy
  + bad inputs, Kind / SchemaSample.

Coverage delta on the workload-plugin path:
- internal/api: 1.1% → 16.0%
- internal/deployer: 0% → 54.1%
- internal/workload/plugin/source/compose: 0% → 38.5%
- Trigger plugins already at 87-95% from the trigger-split work.

Production fix surfaced by the tests
- store.CreateWorkload now self-references RefID = ID when caller
  leaves RefID empty (the typical plugin-native path). The api
  layer's broken backfill loop (called UpdateWorkload, which
  deliberately omits ref_id) is gone. Multiple sibling plugin
  workloads can now coexist under the UNIQUE(kind, ref_id) constraint.

Review fixes addressed before commit
- CRITICAL: deadlock-detect test gained a real 2s time.After (was
  selecting on context.Background().Done() which never fires).
- HIGH: happy-path test now hard-asserts RefID = ID (was a t.Logf
  that would silently pass after a production fix).
- HIGH: standalone /api/triggers CRUD coverage added (was bypassed
  by the workload-side bind flow).
- HIGH: seedWorkload bypass deleted; tests now go through the
  real POST /api/workloads handler.
- MEDIUM: withTempDir restore is a no-op (t.Setenv auto-restores);
  dead `old := os.Getenv(...)` capture removed.
- MEDIUM: list-workloads test now asserts ID membership, not just
  count.

Doc
- WORKLOAD_REFACTOR_TODO: all three Priority 1 items, Priority 3
  polish, and Priority 4 tests marked DONE. The workload-first arc
  is closed.
2026-05-16 06:42:43 +03:00
alexei.dolgolyov 739b67856a feat(cutover): hard legacy cutover — drop projects/stacks/sites/deploys
Build / build (push) Successful in 10m39s
The clean-break delete that closes the workload-first refactor arc.
Net diff: ~30 backend files deleted, ~20 modified, ~12k LOC removed
on the Go side; entire /projects /stacks /sites /deploy frontend
trees gone; ~6.7k LOC removed on the Svelte/TypeScript side.

Backend
- API handlers gone: internal/api/{projects,stages,stage_env,stacks,
  static_sites,deploys,instances,volume_browser}.go
- Store CRUD + tests gone: internal/store/{projects,stages,stage_env,
  stacks,static_sites,static_site_secrets,deploys,poll_state,volumes,
  workload_sync}.go (+ _test.go siblings)
- Legacy deployer pipeline gone: internal/deployer/{bluegreen,promote,
  rollback,subdomain,resolver_test}.go; deployer.go trimmed to just the
  dispatch surface used by the plugin pipeline
- internal/staticsite/{manager,healthcheck}.go and
  internal/stack/manager.go gone (the rest of those packages stay as
  helpers imported by the static + compose plugins)
- internal/registry/poller.go gone (legacy registry poller)
- internal/volume.ResolvePath gone; ResolveWorkloadPath stays
- internal/webhook: handleWebhook (project) + handleSiteWebhook (site)
  gone; only POST /api/webhook/triggers/{secret} remains
- workload-side webhook URL handlers (getWorkloadWebhook +
  regenerateWorkloadWebhook + EnsureWorkloadWebhookSecret +
  SetWorkloadWebhookSecret + GetWorkloadByWebhookSecret) gone — they
  minted URLs that would 404 against the new trigger-only ingress
- cmd/server/main.go: dropped staticsite.Manager, stack.Manager,
  staticsite.HealthChecker, registry poller, SetSiteSyncTriggerer,
  SetStaticSiteManager, SetStackManager, wireStaticBackend
- store/store.go: idempotent DROP TABLE IF EXISTS for every legacy
  table (projects, stages, stage_env, volumes, deploys, deploy_logs,
  poll_states, stacks, stack_revisions, stack_deploys, static_sites,
  static_site_secrets); FK order children-then-parents
- store/models.go: dropped Project, Stage, Deploy, DeployLog, StageEnv,
  Volume, StaticSite, StaticSiteSecret, Stack, StackRevision,
  StackDeploy types; kept WorkloadKind constants as documented strings
- internal/store/helpers.go (new): BoolToInt, rowScanner,
  GenerateWebhookSecret extracted from deleted CRUD files
- internal/api/secrets.go (new): forwards to store.GenerateWebhookSecret
  so api + store paths share one secret-generation impl (no
  panic-vs-UUID-fallback divergence)
- internal/reconciler/reconciler.go: dropped legacy stack-by-compose
  + static-site label paths; only canonical tinyforge.workload.id
  dispatch remains
- providers (gitea_content/github_provider/gitlab_provider) gained
  path-traversal rejection on every tree entry
- internal/webhook ParsedImage / ParseImageRef demoted to package-
  private (no external callers)

Frontend
- /projects /stacks /sites /deploy routes deleted (entire trees)
- ProjectCard / InstanceCard / StaleContainerCard components deleted
- api.ts: dropped every project/stage/stack/site/deploy/instance
  helper + types (Project, Stage, Stack, StaticSite, Deploy,
  Instance, Volume, etc.); kept Workload, Container, App, Settings,
  Registry, EventTrigger, LogScanRule, webhook envelopes
- WorkloadWebhook type + getWorkloadWebhook/regenerateWorkloadWebhook
  api functions gone (mirror of the backend deletion above)
- web/src/routes/+layout.svelte: dropped /projects /sites /stacks
  /deploy nav entries, trimmed quick-nav keymap
- web/src/routes/+page.svelte: dashboard rewrite — reads
  listWorkloads + listContainers only; 4-card stat grid
  (workloads/running/failed/stale) + recent workloads strip
- navCounts.ts, SystemHealthCard.svelte, ContainerLogs.svelte,
  ContainerStats.svelte, StatusBadge.svelte, TagCombobox.svelte,
  proxies/+page.svelte, containers/+page.svelte all rewired to the
  workload-first surface
- AbortController plumbing on dashboard, nav-counts, stale page,
  SystemHealthCard so navigation doesn't leave dangling fetches
- i18n: dropped projects.*, projectDetail.*, envEditor.*,
  volumeEditor.*, volumeBrowser.*, quickDeploy.*, sites.*, stacks.*,
  instance.*, confirm.* namespaces; en/ru parity preserved (1042
  keys each)

Hardening from go-reviewer + security-reviewer + typescript-reviewer
subagent passes (0 CRITICAL across all three; 1 HIGH + ~12 MEDIUM
addressed inline before commit):

- Sec H1: dead-end workload webhook URL handlers (would mint URLs
  that 404 the new trigger-only ingress) deleted across backend +
  frontend
- Go M1: IsTerminalDeployStatus dropped (no production callers)
- Go M2: ParsedImage/ParseImageRef lowercased (in-package only)
- Go M6: generateWebhookSecret unified — api shim forwards to
  store.GenerateWebhookSecret
- Doc/comment freshness: stage_id (no longer FK), ProxyRoute legacy
  field names, workloadIDRow rationale, webhook_deliveries.target_type
  enum, WebhookDeliveryLog component header

Doc
- WORKLOAD_REFACTOR_TODO: cutover marked DONE; all three Priority 1
  items are now shipped. Next focus is Priority 3 polish (apps.* i18n
  + codemap entries) and Priority 4 tests.

Behavioral notes for operators upgrading from a pre-cutover build
- Existing rows in the dropped tables disappear on first boot.
- Legacy webhook URLs at /api/webhook/{secret} and
  /api/webhook/sites/{secret} return 404; CI configs must repoint to
  /api/webhook/triggers/{secret} (the trigger-split boot backfill
  lifted any embedded workload secret onto a Trigger row, so the
  secret value itself carries over).
- Frontend routes /projects /stacks /sites /deploy are gone; nav
  links replaced with /apps and /triggers.
2026-05-16 06:00:21 +03:00
alexei.dolgolyov 234c3c711e feat(static): inline static-source plugin; drop phantom-row adapter
Build / build (push) Successful in 10m43s
Lift the static-site deploy pipeline from internal/staticsite/manager.go
into internal/workload/plugin/source/static/ so plugin-native static
workloads operate directly on plugin.Workload + the containers table +
workload_env. The cmd/server/static_backend.go phantom-row adapter is
gone; the legacy static_sites table is no longer touched on plugin
deploys.

Backend
- new state.go: runtimeState (last_commit_sha, last_sync_at,
  last_error, status) persisted in containers.extra_json under the
  deterministic row id <workloadID>:site
- per-workload sync.Mutex serializes saveState read-modify-write so
  parallel deploys for the same workload can't race container_id /
  proxy_route_id writes
- extra_json round-trips through map[string]json.RawMessage so
  unknown keys survive — typed runtimeStateKeys are stripped before
  merge so clearing a typed field actually drops the key
- new env.go reads workload_env (replaces static_site_secrets for
  plugin-native sites); decrypt-failure logs and skips one entry
  rather than failing the whole deploy
- new build.go ports prepareDenoBuild + prepareStaticBuild + copyDir;
  copyDir uses filepath.WalkDir + Lstat to refuse symlinks and
  non-regular files
- new deploy.go is the ~300-line core; intent.Reason gates force vs
  skip-if-no-changes; success-path saveState failure rolls back
  container + proxy route and writes "failed" state (no orphans)
- new teardown.go combines Remove + Stop; idempotent on
  never-deployed workloads
- new reconcile.go refreshes container state from Docker; flips
  runtimeState.Status to failed when the container is missing/crashed

Hardening (from go-reviewer + security-reviewer subagent passes;
1 CRITICAL + 5 HIGH + 3 MEDIUM addressed before merge)
- path-traversal defense in all 3 providers (gitea_content,
  github_provider, gitlab_provider): reject tree entries whose
  resolved local path escapes destDir
- verifyDownloadInsideRoot walks the build dir post-download as a
  second line of defense
- sanitizeError redacts the access token, collapses to one line, and
  clamps to 240 bytes before persisting to extra_json or fanning out
  to the notification webhook
- container/image/volume names suffixed with workload-id short prefix
  (workload name is not UNIQUE in schema)
- primaryDomain reads settings.Domain to complete a bare subdomain
  face into a full FQDN (matches legacy Manager behavior)
- ctx-aware health-check sleep
- json.Marshal for event metadata (was fmt.Sprintf JSON template)
- strings.HasPrefix for failed-status detection (was brittle slice
  expression)

Wire-up
- cmd/server/main.go: removed wireStaticBackend(...) call; existing
  blank import on _ ".../source/static" drives init() registration
- cmd/server/static_backend.go deleted

Doc
- WORKLOAD_REFACTOR_TODO: static port marked DONE; next focus is
  the hard legacy cutover (drop /api/projects, /api/stacks,
  /api/sites, /api/stages + their tables, internal/stack +
  internal/staticsite packages, frontend /projects /stacks /sites)

Behavior notes for operators
- plugin-native static workloads no longer write to static_sites;
  legacy /api/sites/* still serves original rows unchanged
- legacy tinyforge.static-site / .static-site-name container labels
  dropped on plugin deploys; canonical tinyforge.workload.id / .kind
  cover ownership
- container/image/volume names gained an 8-char ID suffix
  (e.g. dw-site-mysite-a1b2c3d4); legacy-deployed sites keep the
  old shape until redeployed through the plugin path
2026-05-16 02:56:23 +03:00
alexei.dolgolyov 2aff22f565 feat(triggers): first-class triggers + bindings with fan-out webhook
Build / build (push) Successful in 10m39s
Promote triggers from embedded workload fields to standalone records
joined to workloads via workload_trigger_bindings. One trigger (webhook,
registry watcher, git push, manual) now fans out to many workloads with
per-binding config overrides (top-level JSON merge, binding wins).

Backend
- new triggers + workload_trigger_bindings tables with ON DELETE CASCADE
- boot-time backfill of embedded trigger config inside per-workload tx
- store.ErrUnique sentinel translates SQLite UNIQUE at store boundary
- /api/triggers CRUD + /api/triggers/{id}/{webhook,bindings}
- /api/bindings/{id} update/delete; /api/workloads/{id}/triggers list+bind
- bindTriggerToWorkload accepts trigger_id or inline {kind,name,config}
- inline-create uses CreateTriggerWithBindingTx (no orphan triggers)
- validateBindingConfig enforces 8 KiB cap + plugin Validate on merged
- ListTriggersWithBindingCount + ListBindings*WithNames remove N+1
- POST /api/webhook/triggers/{secret} resolves trigger then fans out
- bounded worker pool (4) per request; per-binding error isolation
- outcome accounting: deployed / skipped / no-match / errored
- legacy /api/webhook/workloads/{secret} route removed (clean break;
  backfill keeps secrets resolvable at the new /triggers/{secret} path)
- reconciler gate dropped from (Source && Trigger) to Source only
- MergeJSONConfig returns freshly allocated slices (no fan-out aliasing)
- WithEffectiveTrigger lets existing Trigger.Match contract stay unchanged

Frontend
- /triggers list, new wizard, [id] detail (bindings, webhook rotate)
- workload create wizard: NEW / PICK / SKIP trigger modes
- workload detail: bindings panel + Add-trigger modal (inline / pick)
- per-binding override editor with merged-preview + 8 KiB guard
- "OVERRIDES n FIELDS" row badge when binding_config is non-empty
- shared TriggerKindForm component (registry / git / manual + JSON)
- 3 raw <input type=checkbox> replaced with <ToggleSwitch>
- full EN + RU i18n: redeployTriggers.*, apps.detail.bindings.*,
  apps.new.triggers.*, nav.triggers; event-triggers nav disambiguated

Doc
- WORKLOAD_REFACTOR_TODO: trigger-split marked DONE; next focus is
  the static-source inline port + hard legacy cutover (Priority 1)
2026-05-16 02:24:31 +03:00
alexei.dolgolyov 30133bc1eb docs: workload refactor + observability progress
Build / build (push) Successful in 10m40s
Two design + handoff docs:

- docs/WORKLOAD_REFACTOR_TODO.md — status-at-a-glance table
  showing what's done (volume scopes, kind-aware editors,
  vendor webhook parsing, chain-panel CSS, Log Rules panel)
  and what's still pending (static source inline port + the
  hard legacy cutover gated on it; codemap entries; /apps
  page-level i18n; Priority 4 integration tests).

- docs/LOGSCAN_AND_TRIGGERS_TODO.md — companion design + status
  doc for the two Observability features. Records the
  loop-prevention invariant (event_log = system observing
  itself, webhook_deliveries = system talking to outside) so
  the next contributor doesn't accidentally break it by adding
  a new EventLog subscriber that re-publishes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:18:51 +03:00
alexei.dolgolyov cba2149aa9 refactor(workload): finalize containers index + post-review hardening
Wraps up the workload refactor with the fixes that came out of the multi-agent
code review (see docs/plans/workload-refactor.md "What actually shipped").

Backend:
- store.ReconcileContainer: separate write path so the 30s reconciler tick no
  longer overwrites deployer-owned fields (subdomain, proxy_route_id,
  npm_proxy_id, image_tag).
- Container.stage_id column + index; ListProxyRoutes / ListContainersByStageID
  join via stage_id (survives stage rename), with legacy fallback to
  (project_id, role=stage_name).
- Reconciler: workload-existence check (rejects forged tinyforge.workload.id
  labels), skips inventing project-kind rows, child-context cancel before
  wg.Wait() on shutdown.
- Transactional CRUD across projects / stacks / static_sites: parent UPDATE
  and workload sync land in one transaction so secret rotations are durable.
- Webhook routing reads exclusively through workloads.webhook_secret; legacy
  GetProjectByWebhookSecret / GetStaticSiteByWebhookSecret fallback removed.
- store.GetStackByComposeProjectName + indexed lookup (no more full-table
  stack scan per compose container per tick).
- store.ListMissingSweepRows: filtered query for the missing-sweep.
- /api/instances/* handlers verify (workload_id, role) match URL
  (project_id, stage_name) before mutating — closes the cross-project
  hijack the security review flagged.
- extra_json no longer referenced from Go (column kept on disk for now).

Frontend:
- WorkloadContainers.svelte: generic detail-page panel reusable by stack and
  site detail pages.
- Containers page polish: client-side kind/state filters over an unfiltered
  fetch, URL-synced filters, race-safe loads via sequence number, EN+RU i18n,
  sidebar counter via navCounts.containers.

Misc:
- scripts/dev-server.sh: tolerate empty netstat grep result.
- .gitignore: ignore docker-watcher binaries, .claude/worktrees/, .facts-sync.json.
2026-05-09 15:44:41 +03:00
alexei.dolgolyov f54a6ecee3 feat(workload): add Workload/Container/App store foundation
Introduces the data layer for the Workload refactor (see
docs/plans/workload-refactor.md): three new tables and store
methods, no behavior changes elsewhere yet.

- workloads: unifying primitive over Project/Stack/StaticSite,
  paired via UNIQUE(kind, ref_id). Notification + webhook config
  hosted here so it lives in one place across kinds.
- containers: normalized index of every Tinyforge-managed
  container with first-class subdomain/proxy_route_id/npm_proxy_id
  columns (heavily queried by ListProxyRoutes / stale detection).
- apps: optional grouping of workloads; schema only, no UI in v1.

Foundation only — deployer surgery, reconciler, and consumer
switchover land in the next commit.
2026-05-09 13:22:25 +03:00
alexei.dolgolyov 0405ecd9ce feat(notify): HMAC-signed outgoing webhooks with per-tier secrets and test sender
Build / build (push) Successful in 10m36s
Outgoing notifications were bare POSTs with no auth and no way to verify
they came from Tinyforge. They also went out from one global URL only,
even though stages had a notification_url field, and static-site sync
emitted no events at all.

Schema: add notification_url + notification_secret (lazy-generated) to
settings, projects, stages and static_sites. Migrations are additive.

Notifier: SendSigned computes HMAC-SHA256 over the exact body bytes and
sends X-Hub-Signature-256 (GitHub-compatible — receivers built for
GitHub/Gitea/Forgejo verify out of the box). Aux headers
X-Tinyforge-Event/Delivery/Timestamp/Tier are advisory and not signed.
Empty secret => unsigned send for back-compat.

Resolution: deploys fall through stage > project > settings, sites fall
through site > settings. The secret travels with the URL that sourced
it, so any tier can sign even when its parents are unsigned. Site sync
events now actually emit (site_sync_success / site_sync_failure).

API: 12 new endpoints — {GET secret, POST regenerate, POST disable,
POST test} for each of the 4 tiers. SendSyncForTest returns
status_code/latency_ms/signature_sent/delivery_id/response_snippet so
the UI surfaces receiver feedback inline.

UI: shared OutgoingWebhookPanel.svelte fits the existing card aesthetic.
Signing-state pill, secret reveal-on-demand, regenerate/disable behind
ConfirmDialog modals (not inline strips — too easy to misclick), send-
test result card with colour-coded status. Wired into Settings →
Integrations, project edit form, per-stage edit, and per-site detail.
EN + RU i18n.

Tests: round-trip (sender signs, receiver verifies), tampered-body and
wrong-secret rejection, unsigned-send omits header, send-test surfaces
4xx, concurrent fan-out via Drain. Resolver precedence locked for both
deploy and site paths.

Docs: docs/webhooks.md with header reference, verifier snippets in
Node/Python/Go, and a recipe for the service-to-notification-bridge
generic webhook provider.
2026-05-07 02:03:32 +03:00
alexei.dolgolyov a4362b842d fix: harden security, fix concurrency bugs, and address review findings
Build / build (push) Successful in 11m42s
Security:
- rate limit /api/webhook routes per-IP and cap concurrent site syncs
- global SSE connection cap (256) with new sse_gate
- validate ?tail= and cap JSON log responses at 4 MiB
- strip ANSI/CSI/OSC and control bytes from streamed log lines
- redact webhook secret from request log middleware
- scrub host details from /api/health for non-admin viewers
- drop container_id from /api/system/stats/top for non-admins
- generate webhook secrets via crypto/rand; require >=32 chars on insert
- verify iid path consistency in streamContainerLogs
- LimitReader on site webhook body; reject malformed non-empty bodies

Concurrency / correctness:
- stats collector: Stop() no longer hangs without Start(), semaphore
  acquired in parent loop so ctx cancellation short-circuits the queue,
  in-flight tick cancellable via shared base context, zero-ts guard
- webhook handler: replace fire-and-forget goroutine with WaitGroup-tracked
  workers + Drain() wired into graceful shutdown
- $derived(() => ...) mis-idiom fixed in ContainerStats / InstanceCard /
  ProjectCard (returned function instead of value)
- SystemResourcesCard: rename `window` and `t` locals to avoid shadowing
  globalThis.window and the i18n `t` import

Quality / performance:
- replace O(n^2) insertion sort with sort.Slice in stats top
- runMigrations only swallows duplicate-column / already-exists errors
- PruneStatsSamplesBefore wrapped in a transaction
- collapse N+1 in unusedImageStats / pruneImages to one ListAllInstances
  pass; surface DB errors instead of silently treating them as inactive
- run Docker Info + DiskUsage in parallel via errgroup
- container log SSE emits `: ping` heartbeat every 20 s
- imageMatches case-insensitive on registry host (RFC behaviour)
- log warning on invalid stage tag pattern instead of silent skip
- reject malformed non-empty site webhook payloads

Frontend / i18n:
- shared formatBytes utility replaces three local copies
- statsInterval store drives dynamic "no samples / collection disabled"
  copy across ContainerStats and SystemResourcesCard
- top consumers row now shows owner_name (project/stage or site name)
- drop seven `as any` casts on the Settings type; add cloudflare_api_token
  write-only field
- move "Service status", "Docker daemon", "Docker unreachable",
  "Proxy unreachable", "reachable", and "Docker daemon is not reachable."
  strings into en/ru i18n bundles
2026-05-07 00:56:14 +03:00
alexei.dolgolyov 670948f113 fix: address code review findings for DNS management
- CRITICAL: Change DNS zones endpoint from GET to POST to avoid
  leaking API token in URL query parameters
- HIGH: Add sync.RWMutex to protect dnsProvider field in Server,
  Deployer, and proxy Manager against concurrent read/write races
- HIGH: Capture old DNS provider reference synchronously before
  launching background cleanup goroutine
- HIGH: Use getDNS()/getDNSProviderLocked() accessors instead of
  direct field reads in all DNS operations
2026-04-02 14:54:15 +03:00