chore: post-merge cleanup — remove merged plan folders (volume-snapshot-restore, gitops)
This commit is contained in:
@@ -1,115 +0,0 @@
|
||||
# Tinyforge GitOps v1 — config-as-code for repo-backed workloads
|
||||
|
||||
**Status:** ✅ Complete (squash-merged to main 2026-06-21)
|
||||
**Branch:** `feat/gitops-config-as-code`
|
||||
**Mode:** Automated · Orchestrator (hybrid: backend built direct, Phase 3 via frontend implementer) · Incremental
|
||||
**Started:** 2026-06-21
|
||||
|
||||
## Summary
|
||||
|
||||
A `dockerfile` or `static` workload can opt in to reading its **deploy config** from a
|
||||
`.tinyforge.yml` file in its own repo. Tinyforge fetches the file, shows it, computes
|
||||
**drift** vs the live config, and lets an admin **sync** (repo → live) with one explicit
|
||||
action. The repo becomes the source of truth for the *declared* fields; the UI locks
|
||||
those fields and renders a drift view.
|
||||
|
||||
### v1 scope (deliberate)
|
||||
|
||||
- **In:** opt-in per workload (dockerfile/static only); `.tinyforge.yml` declares only
|
||||
**source_config-resident** fields (`port`, `healthcheck`, `deploy_strategy`,
|
||||
`resources.{cpu_limit,memory_limit}`, `max_instances`); manual explicit sync;
|
||||
declared-field drift view; GitOps-managed badge + read-only gate.
|
||||
- **Out (documented future seams):** `env`/`faces` declaration (separate stores —
|
||||
needs typed multi-target apply); auto-apply-on-deploy (must be a Source-plugin concern,
|
||||
not a dispatch concern); multi-workload reconcile with create/delete (Framing B);
|
||||
image/compose sources (not git-backed / overlapping config surface).
|
||||
|
||||
### `.tinyforge.yml` v1 schema
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
deploy:
|
||||
port: 8080
|
||||
healthcheck: /healthz
|
||||
deploy_strategy: blue-green # "" | recreate | blue-green (validated per source)
|
||||
resources: { cpu_limit: 0.5, memory_limit: 256 }
|
||||
max_instances: 1
|
||||
```
|
||||
|
||||
No repo location, no tokens, no secrets — those stay in the encrypted DB.
|
||||
|
||||
## Design constraints (from the adversarial review — non-negotiable)
|
||||
|
||||
- **C1** Overlay is a typed `ApplyPlan{SourceConfigPatch}` routed to `source_config` only.
|
||||
env/faces are NOT in source_config (they live in `workload_env` / `public_faces`), so
|
||||
they are cut from v1; the typed plan reserves their slots for later.
|
||||
- **C2** No `env` in v1 → no secrets-in-repo hole.
|
||||
- **C3** No auto-apply-on-deploy in v1 (SHA is resolved *inside* `src.Deploy`; image has
|
||||
no repo). Future auto-apply lands as a Source-plugin concern.
|
||||
- **C4** Sync is explicit-action only, with a hard gate:
|
||||
parse → build overlay → **omitted-field-preserving** deep-merge onto a fresh copy of the
|
||||
live source_config → run `Source.Validate` on the *merged* result → persist in one
|
||||
transaction only if valid.
|
||||
- **C5** Drift is computed **only over declared leaves**, post-normalization
|
||||
(`deploy_strategy:"" == "recreate"`; YAML int vs JSON coercion). Omitted = unmanaged.
|
||||
- Reuse `staticsite.NewGitProvider` (inherits SSRF defense); add a size-capped
|
||||
`DownloadFile`. Route all fetch errors through the existing `sanitizeError(msg, token)`.
|
||||
Distinct `no_file` status. Sync audit is NOT `deploy_history` (rollback assumes
|
||||
deployable rows). Gate enable to `dockerfile|static`. Derive read-only fields from the
|
||||
declared overlay leaves (no provenance column). 4 additive `gitops_*` columns only.
|
||||
|
||||
## Phases
|
||||
|
||||
| # | Title | Subplan | Status |
|
||||
|---|-------|---------|--------|
|
||||
| 1 | GitOps core (backend, no UI/mutation) | [phase-1-core.md](phase-1-core.md) | ✅ Done |
|
||||
| 2 | Store + API (manual sync) | [phase-2-api.md](phase-2-api.md) | ✅ Done |
|
||||
| 3 | Frontend experience (UI/UX showcase) | [phase-3-frontend.md](phase-3-frontend.md) | ✅ Done |
|
||||
| 4 | Hardening + docs + final review | [phase-4-hardening.md](phase-4-hardening.md) | ✅ Done |
|
||||
|
||||
## Phase progress log
|
||||
|
||||
- **Phase 1 — Done (2026-06-21).** Migration (4 additive `gitops_*` columns) + `Workload`
|
||||
read path. New `internal/gitops` package: `Spec`/`ParseSpec` (KnownFields rejects
|
||||
unknown keys incl. env/faces attempts), source-aware `ApplyPlan`/`BuildPlan`
|
||||
(dockerfile: port/healthcheck/deploy_strategy; static: deploy_strategy only — `resources`/
|
||||
`max_instances` dropped after confirming they aren't on dockerfile/static configs),
|
||||
`MergeAndValidate` (omitted-field-preserving + validate-then-commit), `Drift`
|
||||
(declared-only, normalized), `Fetch` (no_file/fetch_failed/invalid statuses, token-redacted).
|
||||
`DownloadFile` added to the `GitProvider` interface + 3 impls (64 KiB cap, ErrFileNotFound,
|
||||
SSRF-safe client reused, GitHub raw media type). Independent go-review: **APPROVE**, no
|
||||
CRITICAL/HIGH; M1 (GitLab doc comment) fixed; M2 (validate GitOpsPath at write) carried
|
||||
into Phase 2. 28/28 packages green.
|
||||
- **Phase 2 — Done (2026-06-21).** Store setters `SetWorkloadGitOps` / `RecordGitOpsSync`
|
||||
(targeted column updates — disjoint from `UpdateWorkload`, so neither writer clobbers the
|
||||
other). API: `GET /gitops` (single rich payload: status + raw + live drift + meta — folded
|
||||
the separate `/drift` endpoint in to avoid a double fetch), `PUT /gitops` (admin,
|
||||
enable/disable + path, rejects non-eligible source + traversal/URL-injection paths),
|
||||
`POST /gitops/sync` (admin: fetch → MergeAndValidate → UpdateWorkload → RecordGitOpsSync →
|
||||
event-log audit). Sync recorded to `event_log` (not `deploy_history` — review S6). Tests:
|
||||
store round-trip + `validGitOpsPath` + `planFields`. Independent **security review:
|
||||
clean, no CRITICAL/HIGH** (token never leaks, SSRF locked by safe dialer, authZ correct,
|
||||
no field loss); LOW-1 (path query/fragment injection) hardened in `validGitOpsPath`. Full
|
||||
backend suite green.
|
||||
- **Phase 3 — Done (2026-06-21).** Built by a frontend implementation agent.
|
||||
`GitOpsPanel.svelte` (self-fetching panel: status pill, purpose-built field-level drift
|
||||
view — repo→live per declared field on the forge/ember palette, `.tinyforge.yml` preview,
|
||||
enable `ToggleSwitch`, "Sync now" via `ConfirmDialog`, all five status states). api.ts
|
||||
fetchers + `GitOpsStatus`/`GitOpsDriftEntry`; `gitops_*` on the `Workload` TS type;
|
||||
GitOps-managed badge on the detail hero + apps list (payload already carries
|
||||
`gitops_enabled`); read-only edit-form banner (banner-only — hard-disabling inputs would
|
||||
need prop-threading through all 4 source forms; deferred). Backend `managed_fields` added
|
||||
to `GET /gitops` for the gate. i18n `apps.detail.gitops.*` en+ru (parity 1804/1804).
|
||||
Independent ts-review: one HIGH (`isAdmin` hardcoded true) + 2 MEDIUM — **all fixed**:
|
||||
real role wired via `getCurrentUser()` (panel default now `false`), stale-guard on the
|
||||
edit-open fetch, misleading `eligible` comment trimmed. check 0 errors · build · 26/26.
|
||||
- **Phase 4 — Done (2026-06-21).** Concurrent-sync guard (review S5): a per-workload
|
||||
`keyedMutex` on `Server`; `syncWorkloadGitOps` locks by id and loads the row inside the
|
||||
lock, serializing the read→merge→write so two syncs can't race. Docs: `docs/gitops.md`
|
||||
(enable flow, v1 schema, drift/sync semantics, explicit "not in v1": env/faces, auto-apply,
|
||||
multi-workload, image/compose). Backend green. Final comprehensive review + merge gate
|
||||
next.
|
||||
|
||||
## Amendment log
|
||||
|
||||
_(plan changes require approval + an entry here)_
|
||||
@@ -1,57 +0,0 @@
|
||||
# Phase 1 — GitOps core (backend, no UI, no mutations)
|
||||
|
||||
Pure logic + fetch. No HTTP endpoints, no DB writes to workloads yet (migration only).
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] **Migration**: append 4 additive columns to the workloads-table migration list in
|
||||
`internal/store/store.go` (idempotent `ALTER TABLE workloads ADD COLUMN`):
|
||||
- `gitops_enabled INTEGER NOT NULL DEFAULT 0`
|
||||
- `gitops_path TEXT NOT NULL DEFAULT '.tinyforge.yml'`
|
||||
- `gitops_last_sync_at TEXT NOT NULL DEFAULT ''`
|
||||
- `gitops_commit_sha TEXT NOT NULL DEFAULT ''`
|
||||
- Add the fields to the `Workload` struct in `internal/store/models.go` + the
|
||||
scan/insert/update column lists in `internal/store/workloads.go` (read path now;
|
||||
write path for the setters lands in Phase 2).
|
||||
- [ ] **`internal/gitops` package — `spec.go`**: `Spec` struct mirroring the v1 schema
|
||||
(`Version int`, `Deploy DeploySpec{Port *int, Healthcheck *string, DeployStrategy *string,
|
||||
Resources *ResourceSpec{CpuLimit *float64, MemoryLimit *int}, MaxInstances *int}`).
|
||||
Pointers so "omitted" is distinguishable from "zero". `ParseSpec([]byte) (Spec, error)`
|
||||
using `gopkg.in/yaml.v3` with `KnownFields(true)` to reject unknown keys; reject
|
||||
`version != 1`.
|
||||
- [ ] **`apply.go`**: typed `ApplyPlan{ SourceConfigPatch map[string]any }` (env/faces
|
||||
slots reserved in a comment). `BuildPlan(spec) ApplyPlan` maps only the **present**
|
||||
(non-nil) declared fields to their `source_config` JSON keys (`port`, `healthcheck`,
|
||||
`deploy_strategy`, `cpu_limit`, `memory_limit`, `max_instances`).
|
||||
- [ ] **`merge.go`**: `MergeAndValidate(liveConfig json.RawMessage, plan ApplyPlan,
|
||||
validate func(json.RawMessage) error) (json.RawMessage, error)` — deep-copy live →
|
||||
overlay only the patch keys (omitted-field-preserving) → marshal → run `validate` on the
|
||||
**merged** result → return merged or error. Never returns a partial/empty config.
|
||||
- [ ] **`drift.go`**: `Drift(spec Spec, liveConfig json.RawMessage) ([]DriftEntry, error)`
|
||||
where `DriftEntry{Field string /*dotted path*/, RepoValue string, LiveValue string}`.
|
||||
Compare **only declared leaves**, post-normalization:
|
||||
- `deploy_strategy` via the source's effective-default rule (`"" == "recreate"` for
|
||||
dockerfile/static) — import or replicate `effectiveStrategy` semantics.
|
||||
- numeric coercion (YAML int vs JSON number) compared by value, not raw string.
|
||||
- [ ] **Provider `DownloadFile`**: add `DownloadFile(ctx, owner, repo, ref, path string,
|
||||
maxBytes int64) ([]byte, error)` to the `GitProvider` interface in
|
||||
`internal/staticsite/provider.go` and implement for Gitea, GitHub, GitLab using each
|
||||
provider's existing raw-file endpoint + the **safe HTTP client**. Cap at 64 KiB.
|
||||
Distinguish 404 (file absent) from transport/5xx errors.
|
||||
- [ ] **`fetch.go`**: `Fetch(ctx, deps, w) (Result, error)` where
|
||||
`Result{ Raw []byte, Spec Spec, CommitSHA string, Status string /* ok|no_file|fetch_failed */ }`.
|
||||
Decrypt `access_token`, build provider via `NewGitProvider`, `GetLatestCommitSHA`, then
|
||||
`DownloadFile(gitops_path)`. Missing file → `no_file` (NOT an error). All errors routed
|
||||
through the existing `sanitizeError(msg, token)` so the token never leaks.
|
||||
- [ ] **Unit tests** (`*_test.go`): ParseSpec (valid/unknown-key/bad-version);
|
||||
MergeAndValidate (omitted-field preserved, invalid merged config rejected, no clobber);
|
||||
Drift (declared-only, deploy_strategy normalization, numeric coercion, no false positive
|
||||
on undeclared keys); a redaction test mirroring `static/helpers_test.go`.
|
||||
|
||||
## Verify
|
||||
|
||||
- `go build ./...`, `go vet ./internal/...`, `go test ./internal/...` green.
|
||||
|
||||
## Handoff notes
|
||||
|
||||
_(filled after implementation)_
|
||||
@@ -1,35 +0,0 @@
|
||||
# Phase 2 — Store + API (manual sync, explicit-action)
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] **Store setters** (`internal/store/workloads.go` or a new `gitops.go`):
|
||||
- `SetWorkloadGitOps(id string, enabled bool, path string) error` — gated to
|
||||
dockerfile/static at the API layer.
|
||||
- `RecordGitOpsSync(id, commitSHA, syncedAt string) error`.
|
||||
- All writes re-read the row first / use targeted column updates (avoid full-row
|
||||
clobber races — review S5).
|
||||
- [ ] **Sync audit** (NOT deploy_history): a small `gitops_sync_audit` table
|
||||
(`id, workload_id, outcome, commit_sha, drift_count, error, created_at`) with an insert
|
||||
helper. Errors stored as generic markers only (secret-safe). _(Or reuse the event log if
|
||||
cleaner — pick one and note it.)_
|
||||
- [ ] **API handlers** (`internal/api/gitops.go`, wired in `internal/api/router.go`):
|
||||
- `GET /api/workloads/{id}/gitops` → `{ enabled, path, status, raw, parsed, commit_sha,
|
||||
last_sync_at, drift_count }` (calls `gitops.Fetch` + `gitops.Drift`).
|
||||
- `GET /api/workloads/{id}/gitops/drift` → `[]DriftEntry`.
|
||||
- `POST /api/workloads/{id}/gitops/sync` (`auth.AdminOnly`) → `Fetch` →
|
||||
`MergeAndValidate` → `UpdateWorkload` (single txn) → `RecordGitOpsSync` + audit.
|
||||
Returns the applied summary. Secret-safe errors.
|
||||
- `PUT /api/workloads/{id}/gitops` (`auth.AdminOnly`) → enable/disable + path; **reject
|
||||
if source_kind ∉ {dockerfile, static}** with a clear 400.
|
||||
- [ ] **Validation**: path must be a repo-relative file (no `..`, no leading `/`, sane
|
||||
length); `enabled` only when the source is git-backed and has repo coords.
|
||||
|
||||
## Verify
|
||||
|
||||
- `go build ./...`, `go vet ./internal/...`, `go test ./internal/...` green.
|
||||
- Handler tests: admin-gate on sync/put, no_file path, secret-safe error on a failed
|
||||
fetch, drift_count surfaced, non-git source rejected by PUT.
|
||||
|
||||
## Handoff notes
|
||||
|
||||
_(filled after implementation)_
|
||||
@@ -1,38 +0,0 @@
|
||||
# Phase 3 — Frontend experience (frontend-design + UI/UX agent showcase)
|
||||
|
||||
Built by the **frontend implementer agent** under the frontend-design skill. Must follow
|
||||
project conventions: Svelte 5 runes, `ToggleSwitch` for booleans, `ConfirmDialog` for the
|
||||
sync action, `$t` with **en+ru parity**, the `.panel` vocabulary from `DeployHistoryPanel`.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] **`web/src/lib/api.ts`**: `GitOpsStatus` + `DriftEntry` interfaces; `fetchWorkloadGitOps(id)`,
|
||||
`fetchWorkloadDrift(id)`, `syncWorkloadGitOps(id)`, `setWorkloadGitOps(id, {enabled, path})`
|
||||
following the existing `get<T>`/`post<T>` typed-fetcher pattern (mirror `fetchWorkloadDeploys`).
|
||||
- [ ] **`GitOpsPanel.svelte`** (mounted on `apps/[id]` near the other panels): the
|
||||
centerpiece. Sections:
|
||||
- Header: title + status pill (`synced` / `N changes` / `no file` / `fetch failed`) +
|
||||
last-sync/commit meta + enable/disable `ToggleSwitch`.
|
||||
- **Drift view** — the design focus. For each declared field show repo-value vs
|
||||
live-value with a clean/changed state. Distinctive, legible, on-brand (forge tokens,
|
||||
`--forge-mono`, the `--color-warning`/`--color-success` hues already used). No diff
|
||||
library exists — design a purpose-built field-level diff (NOT a generic `<pre>` dump).
|
||||
- Rendered `.tinyforge.yml` preview (the `.code-area`/editor frame vocabulary).
|
||||
- "Sync now" button → `ConfirmDialog` ("apply repo config to live") → `syncWorkloadGitOps`
|
||||
→ toast + refresh. Admin-only affordance.
|
||||
- `no_file` / `fetch_failed` empty states (clear, not alarming).
|
||||
- [ ] **GitOps-managed badge** on apps list rows (`apps/+page.svelte`, only dockerfile/static)
|
||||
and the detail hero — reuse the `.badge` chip vocabulary.
|
||||
- [ ] **Read-only gate** on the source-config edit form: when managed, lock exactly the
|
||||
fields the synced overlay declares (derive from the drift/parsed payload) + a banner
|
||||
("managed by `.tinyforge.yml` — edit the file and sync").
|
||||
- [ ] **i18n**: `apps.detail.gitops.*` in BOTH `en.json` and `ru.json` (verify parity).
|
||||
|
||||
## Verify
|
||||
|
||||
- `npm run check` (0 errors), `npm run build`, `npm run test` green; i18n key parity equal.
|
||||
- Restart dev server (`./scripts/dev-server.sh`).
|
||||
|
||||
## Handoff notes
|
||||
|
||||
_(filled after implementation)_
|
||||
@@ -1,25 +0,0 @@
|
||||
# Phase 4 — Hardening + docs + final review
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] **Concurrent-sync guard** (review S5): per-workload sync mutex (or re-read-then-apply
|
||||
with a compare) so a sync racing the edit-form save / another sync can't silently lose
|
||||
writes.
|
||||
- [ ] **File-size + path hardening**: confirm the 64 KiB `DownloadFile` cap is enforced
|
||||
across all three providers; confirm `gitops_path` validation rejects traversal.
|
||||
- [ ] **Security-reviewer pass**: SSRF (verify the fetch goes through `NewGitProvider`/the
|
||||
safe client, never raw `http.Get`), secret handling (token never logged/persisted/leaked
|
||||
in errors — `sanitizeError`), admin-gating on sync + put.
|
||||
- [ ] **Docs**: `docs/gitops.md` (or extend `docs/plans/`): the `.tinyforge.yml` v1 schema
|
||||
reference, how to enable, the sync flow, and an explicit **"not in v1"** section
|
||||
(env/faces, auto-apply-on-deploy, multi-workload Framing B) with the future seams noted.
|
||||
- [ ] **Final comprehensive review** + (if triggered) security review, then present for the
|
||||
merge gate.
|
||||
|
||||
## Verify
|
||||
|
||||
- Full backend + frontend build/test/vet green; dev server healthy on :8090.
|
||||
|
||||
## Handoff notes
|
||||
|
||||
_(filled after implementation)_
|
||||
@@ -1,60 +0,0 @@
|
||||
# CONTEXT — Volume Snapshot Restore
|
||||
|
||||
Working memory across phases. The orchestrator owns this file.
|
||||
|
||||
## Settings (from PLAN.md header)
|
||||
|
||||
- Mode: **Automated** · Execution: **Hybrid** (backend Direct, Phase 4 frontend implementer) · Strategy: **Incremental**
|
||||
- Base: `main` · Branch: `feature/volume-snapshot-restore` · Remote: origin (Gitea)
|
||||
- Build: `go build ./...` · Test: `go test ./internal/...` + `npm run test` · Lint: `go vet ./internal/...` + `npm run check`
|
||||
|
||||
## Key codebase facts (verified during planning)
|
||||
|
||||
- **Deploy choke point:** every deploy entrypoint calls `deployer.DispatchPlugin` →
|
||||
put the per-workload lock there (C1). Entrypoints: `deployPluginWorkload`,
|
||||
`rollbackWorkload`, `promoteFromWorkload`, `dispatchGeneric`, webhook
|
||||
`fireBinding`/`handlePreviewIntent`.
|
||||
- **`activeWg`/`drainMu`** in `deployer.go` = global drain barrier, NOT a per-workload lock.
|
||||
- **Image idempotency short-circuit** (`image.go` Deploy ~L170-181) only fires for a
|
||||
*verified-running* container → after stop, redeploy makes a fresh container; blue-green
|
||||
`enforceMaxInstances` reaps the old stopped one. ⇒ stop→swap→redeploy (C4) is correct.
|
||||
- **Scope resolution** (`internal/volume/resolver.go`): stage/project → `<base>/<workload>/<source>`
|
||||
(shared per-workload dir); absolute → operator's allowed path. Stage tmp/old siblings under
|
||||
the live dir's PARENT so renames are same-fs (R2).
|
||||
- **`volsnap.Engine`** has `e.mu` taken by Create/Delete/pruneWorkload/CleanOrphans.
|
||||
`Restore` must NOT hold `e.mu` (R1).
|
||||
- **Archive layout:** gzip tar, each volume under integer subdir `0/`,`1/`…, `manifest.json`
|
||||
at root = `[]SnapshotVolume{Index,Target,Scope,Source}`. `supportedScopes` =
|
||||
absolute/stage/project (volumes.go).
|
||||
- **Precedent:** `internal/api/backups.go` `restoreBackup` — X-Confirm-Restore==id,
|
||||
`restoreInFlight` CAS→409, pre-restore safety backup, atomic rename swap.
|
||||
- **Composition root:** `cmd/server/main.go` constructs `deployer.New` + `volsnap.New` +
|
||||
`docker` + `store`; calls `CleanOrphans` at startup (wire `RecoverInterruptedRestores` there).
|
||||
- **Frontend:** `WorkloadSnapshotsPanel.svelte`; api fns `web/src/lib/api.ts` ~L581;
|
||||
i18n `apps.detail.snapshots.*` in en.json + ru.json.
|
||||
- `golang.org/x/sys v0.33.0` already in go.mod (indirect); build-tag precedent exists
|
||||
(`lockfile_windows.go`/`lockfile_unix.go`).
|
||||
|
||||
## Decisions / invariants
|
||||
|
||||
- `Engine.Restore` holds NO `e.mu`; per-workload `Lifecycle.Lock` is the serialization.
|
||||
- Extract ALL tmp dirs BEFORE any rename; swap is pure renames; journal tracks per-volume `swapped`.
|
||||
- Pre-restore snapshot captured AFTER stop, BEFORE first rename (durable escape hatch).
|
||||
- Redeploy pins the newest-running container's tag (same version back up).
|
||||
- Mixed per-volume state after a mid-restore crash is an accepted v1 limit (each volume intact; pre-restore snapshot = full revert).
|
||||
|
||||
## Deferred / out of scope
|
||||
|
||||
- Named/project_named/instance/ephemeral scopes (consistent with capture).
|
||||
- Non-image sources.
|
||||
- Fully-atomic all-volumes-or-nothing restore (v1 is per-volume atomic + journal recovery).
|
||||
|
||||
## Failed approaches / gotchas
|
||||
|
||||
- (none yet)
|
||||
|
||||
## Phase handoffs
|
||||
|
||||
- Phase 1 → 2: _(filled after Phase 1)_
|
||||
- Phase 2 → 3: _(filled after Phase 2)_
|
||||
- Phase 3 → 4: _(filled after Phase 3)_
|
||||
@@ -1,113 +0,0 @@
|
||||
# Feature: Volume Snapshot Restore (backlog #6)
|
||||
|
||||
**Branch:** `feature/volume-snapshot-restore`
|
||||
**Base branch:** `main`
|
||||
**Created:** 2026-06-22
|
||||
**Status:** 🟡 In Progress
|
||||
**Strategy:** Incremental
|
||||
**Mode:** Automated
|
||||
**Execution:** Hybrid — backend (Phases 1–3) Direct by the orchestrator; Phase 4 via the frontend implementer
|
||||
**Remote:** origin (https://git.dolgolyov-family.by/alexei.dolgolyov/tiny-forge.git)
|
||||
|
||||
## Summary
|
||||
|
||||
Restore a previously-captured volume snapshot (gzip tar of an image workload's host-bind
|
||||
data volumes) back onto the live volume directories, then bring the app back up. Capture
|
||||
already ships (`internal/volsnap`); restore is greenfield and **data-loss-sensitive** — a
|
||||
wrong design is permanent data loss, so the design was adversarially plan-reviewed twice
|
||||
(prior session + this phase breakdown).
|
||||
|
||||
**Scope (deliberate):** image-source workloads only; volume scopes `absolute` / `stage` /
|
||||
`project` only — driven off the SAME `volsnap.supportedScopes` constant capture uses. Named
|
||||
/ project_named (Docker named volumes), instance, and ephemeral scopes are out (consistent
|
||||
with capture).
|
||||
|
||||
## Mandatory design fixes (non-negotiable — a wrong design = permanent data loss)
|
||||
|
||||
- **C1** Serialize via a per-workload `keyedMutex` (the `internal/api/gitops.go` pattern,
|
||||
extracted to `internal/keyedmutex`) keyed by workload id, gating EVERY deploy entrypoint.
|
||||
All entrypoints funnel through `deployer.DispatchPlugin` (verified: deploy, rollback,
|
||||
promote, generic-hooks, webhook fireBinding/handlePreviewIntent), so the lock lives there.
|
||||
NOT `activeWg` (a global drain barrier, not a per-workload lock).
|
||||
- **C2** Extract-to-temp + atomic rename-swap (extract→`.tmp`, rename live→`.old`, rename
|
||||
`.tmp`→live), NEVER in-place. Mirrors `internal/api/backups.go` restore precedent.
|
||||
- **C3** All-or-nothing pre-flight re-resolution via `volume.ResolveWorkloadPath` — abort
|
||||
BEFORE stopping containers if ANY manifest volume doesn't resolve (config drift =
|
||||
corruption). Runs before `Lock`/`StopContainers`.
|
||||
- **C4** Image containers are recreated, not reused → **stop → swap → redeploy** (re-dispatch
|
||||
via `DispatchPlugin`/`RedeployLocked`), NOT `StartContainer(oldID)`. Verified: image
|
||||
source's idempotency short-circuit only fires for a *verified-running* container, so a
|
||||
redeploy after stop creates a fresh container on restored data; `enforceMaxInstances`
|
||||
reaps the old stopped one.
|
||||
- **C5** Disk-space pre-check, **per target filesystem** (peak = live + extracted coexist).
|
||||
- **C6** Treat the archive as UNTRUSTED on extract: zip-slip `HasPrefix` containment,
|
||||
reject symlink/hardlink/device/fifo/socket entries, manifest-index bounds, decompression-
|
||||
bomb cap. Require an `X-Confirm-Restore: <sid>` header like the DB restore (CSRF guard).
|
||||
|
||||
### Folded-in (also mandatory)
|
||||
- Single-flight per-workload CAS → 409 (different apps may restore concurrently).
|
||||
- Auto-capture a pre-restore snapshot, **durably committed before the first destructive
|
||||
rename** (the operator's clean escape hatch).
|
||||
- Logic lives in `Engine.Restore` (engine), not the API handler.
|
||||
|
||||
### Resolutions from the phase-breakdown plan review (2026-06-22)
|
||||
- **R1 (e.mu deadlock):** `Engine.Restore` does NOT hold `e.mu`; per-workload `Lifecycle.Lock`
|
||||
is the serialization. `Create`'s own `e.mu` guards only the pre-restore archive write.
|
||||
- **R2 (cross-device / containment):** stage `tmp`+`old` as siblings under the **live dir's
|
||||
own parent** (same filesystem ⇒ atomic rename). Detect `EXDEV` → abort/rollback loudly.
|
||||
- **R3 (crash window):** durable pre-restore snapshot before any rename; **extract all tmp
|
||||
dirs first, then pure renames**; restore-journal + startup `RecoverInterruptedRestores()`
|
||||
sweep (revert `live-missing→.old`, clean orphan tmp).
|
||||
- **R4:** C5 checks per-target-filesystem; `StopContainers` returns newest-running tag so
|
||||
redeploy pins the same version, and marks rows stopped; `Engine.Restore` re-validates the
|
||||
workload AFTER acquiring the lock; best-effort audit event emitted.
|
||||
|
||||
## Build & Test Commands
|
||||
|
||||
- **Build:** `go build ./...`
|
||||
- **Test:** `go test ./internal/...` (backend); from `web/`: `npm run test`
|
||||
- **Lint:** `go vet ./internal/...`; from `web/`: `npm run check`
|
||||
- **Frontend build:** from `web/`: `npm run build`
|
||||
- **Dev:** `./scripts/dev-server.sh` (port 8090; restart after every build)
|
||||
|
||||
## Phases
|
||||
|
||||
- [x] Phase 1: Restore engine primitives + path-safe extractor + unit tests [domain: backend] → [subplan](./phase-1-engine-primitives.md)
|
||||
- [x] Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback [domain: backend] → [subplan](./phase-2-lifecycle-locking.md)
|
||||
- [x] Phase 3: API endpoint + CSRF header + single-flight + wiring + tests [domain: backend] → [subplan](./phase-3-api.md)
|
||||
- [x] Phase 4: UI Restore button + ConfirmDialog + i18n en+ru [domain: frontend] → [subplan](./phase-4-frontend.md)
|
||||
|
||||
## Parallelizable Phase Groups (Orchestrator mode only)
|
||||
|
||||
None — strictly sequential. Each phase depends on the prior (P2 needs P1 primitives + the
|
||||
Lifecycle seam; P3 wires the adapter + needs `Engine.SetLifecycle`; P4 needs the endpoint).
|
||||
|
||||
## Phase Progress Log
|
||||
|
||||
| Phase | Domain | Status | Review | Build | Committed |
|
||||
|-------|--------|--------|--------|-------|-----------|
|
||||
| Phase 1: engine primitives | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
|
||||
| Phase 2: lifecycle/locking | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
|
||||
| Phase 3: API endpoint | backend | ✅ Done | ✅ Passed (go: APPROVE w/ notes; security: fixed CRITICAL) | ✅ Passed | ⬜ |
|
||||
| Phase 4: frontend | frontend | ✅ Done | ✅ Passed (ts: APPROVE) | ✅ Passed (check 0 err, build, 26 tests) | ⬜ |
|
||||
|
||||
## Outstanding Warnings
|
||||
|
||||
| Phase | Warning | Severity | Status (open / resolved / accepted) |
|
||||
|-------|---------|----------|-------------------------------------|
|
||||
| (design) | Mid-restore crash can leave a per-volume MIXED state (some restored, some original); each volume is individually intact and the pre-restore snapshot is the full escape hatch. | 🟡 | accepted (documented v1 limit) |
|
||||
| 2→3 | **B1 (was Blocker):** `RecoverInterruptedRestores()` + `SetLifecycle()` MUST be wired at startup BEFORE the API server serves — restore endpoint must not be reachable without them. | 🔴→tracked | open — HARD Phase 3 prerequisite |
|
||||
| 2 | W3 residual: the swap-failure-after-partial-swap ORCHESTRATION branch (rollbackSwaps glue) is covered by primitive unit tests + recovery test + extract-failure orchestration test, but not a full mid-swap fault-injection (needs an fs-fault seam not worth the production complexity). | 🟡 | accepted |
|
||||
|
||||
## Final Review
|
||||
|
||||
- [x] Comprehensive code review — ✅ READY TO MERGE (no blockers/warnings; 3 non-blocking notes)
|
||||
- [x] Security review (untrusted-archive extraction + CSRF + admin gating) — CRITICAL found & fixed (manifest-Source path traversal); re-derive from current config + containment
|
||||
- [x] All Outstanding Warnings resolved or consciously accepted
|
||||
- [x] Full build passes (`go build ./...`, `npm run build`)
|
||||
- [x] Full test suite passes (`go test ./internal/...`, `npm run test` 26, `npm run check` 0 err)
|
||||
- [ ] Merged to `main` (squash)
|
||||
|
||||
## Amendment Log
|
||||
|
||||
_(none yet)_
|
||||
@@ -1,96 +0,0 @@
|
||||
# Phase 1: Restore engine primitives + path-safe extractor + unit tests
|
||||
|
||||
**Status:** ✅ Complete
|
||||
**Parent plan:** [PLAN.md](./PLAN.md)
|
||||
**Domain:** backend
|
||||
|
||||
## Objective
|
||||
|
||||
Build the dangerous filesystem primitives in isolation, fully unit-tested, with NO
|
||||
docker/lifecycle wiring. Each is a pure function over directories + the store + a parsed
|
||||
manifest. No caller yet (exercised by tests so not "unused"). Zero behavior change to
|
||||
existing capture.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] **`internal/volsnap/extract.go`** — `safeExtractIndex(archivePath string, index int, dest string, bombCap int64) (int64, error)`:
|
||||
open the gzip tar, extract only entries under the `"<index>/"` prefix into `dest`, return
|
||||
bytes written. UNTRUSTED-input guards (C6):
|
||||
- zip-slip: `target := filepath.Join(dest, rel)`; require `strings.HasPrefix(filepath.Clean(target)+sep, cleanDest+sep)` (or `target == cleanDest`); reject otherwise.
|
||||
- allow ONLY `tar.TypeReg` + `tar.TypeDir`; reject symlink/hardlink/char/block/fifo/socket with an error (never follow).
|
||||
- decompression-bomb cap: running byte counter; abort when it would exceed `bombCap`.
|
||||
- create parent dirs as needed; files `0o600`, dirs `0o700` (data dirs; ownership is the container's concern).
|
||||
- skip `manifest.json` and any entry whose leading path segment ≠ `index`.
|
||||
- [ ] **`internal/volsnap/restore.go`** (primitives only — NO orchestration):
|
||||
- `archiveUncompressedSize(archivePath string, bombCap int64) (int64, error)` — header-only sizing pass summing `hdr.Size`, enforcing `bombCap` (feeds C5). Per-index sizes too (`map[int]int64`) so C5 can check per filesystem.
|
||||
- `parseManifest(snap store.VolumeSnapshot) ([]SnapshotVolume, error)`.
|
||||
- `preflightResolve(w store.Workload, settings store.Settings, manifest []SnapshotVolume) ([]resolvedVol, error)` — ALL-OR-NOTHING (C3): for every manifest volume require `supportedScopes[scope]` AND `volume.ResolveWorkloadPath` succeeds; on first failure return an error naming target+scope+reason (abort signal). `resolvedVol{Index, Target, Scope, LivePath}`. Reuses the SAME `supportedScopes` map.
|
||||
- swap helpers (C2 + R2 + R3): staging is **sibling to the live dir's parent** so renames are same-filesystem. `stagingRoot(live string) string` = `filepath.Join(filepath.Dir(live), ".tf-restore-"+token)`. `swapVolumeDir(live, tmp, old string) error` = rename(live→old) then rename(tmp→live); detect `EXDEV`/cross-device and return a clear error WITHOUT having moved anything irreversibly (check device equivalence up-front or treat the rename error as fatal/rollback). `rollbackSwaps(done []swap) error` = for each completed swap in reverse, rename(live→discard), rename(old→live).
|
||||
- `freeDiskBytes(path string) (uint64, error)` — platform helper. Build-tag split mirroring the repo's `lockfile_windows.go`/`lockfile_unix.go` precedent: `disk_unix.go` (`//go:build !windows`, `syscall.Statfs`) + `disk_windows.go` (`golang.org/x/sys/windows.GetDiskFreeSpaceEx`). Production target is Linux.
|
||||
- [ ] **Constants:** `maxRestoreUncompressedBytes` (decompression-bomb cap) + `diskFreeSafetyMargin` named consts with rationale comments.
|
||||
|
||||
## Files to Modify/Create
|
||||
|
||||
- `internal/volsnap/extract.go` — untrusted extractor (new)
|
||||
- `internal/volsnap/restore.go` — primitives: sizing, manifest parse, preflight, swap/rollback, free-disk (new)
|
||||
- `internal/volsnap/disk_unix.go`, `internal/volsnap/disk_windows.go` — free-disk platform split (new)
|
||||
- `internal/volsnap/extract_test.go`, `internal/volsnap/restore_test.go` — unit tests (new)
|
||||
- `go.mod` — `golang.org/x/sys` promoted indirect→direct (already present v0.33.0)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Zip-slip (`../`, absolute, `..\\` on win), symlink, hardlink, device, fifo entries all rejected by `safeExtractIndex`.
|
||||
- Decompression-bomb cap aborts extraction + sizing past the cap.
|
||||
- Happy-path extract round-trip restores file tree + contents byte-for-byte under `dest`.
|
||||
- `swapVolumeDir` + `rollbackSwaps`: full and PARTIAL-swap rollback leave the original live dirs byte-identical.
|
||||
- `preflightResolve` is all-or-nothing: one unresolvable/unsupported-scope volume → error, and the caller renames nothing.
|
||||
- `archiveUncompressedSize` matches the real extracted total.
|
||||
- `go test ./internal/volsnap/...`, `go build ./...`, `go vet ./internal/...` all green.
|
||||
|
||||
## Notes
|
||||
|
||||
- Open the archive once per pass; on Unix an open fd survives a concurrent `Delete` unlink (defence against a racing snapshot delete); Windows refuses delete of an open file. Acceptable.
|
||||
- `safeExtractIndex` writes into a caller-provided `dest` (the staging `tmp`), never directly onto the live path — the swap is a separate step (C2).
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- [ ] All tasks completed
|
||||
- [ ] Code follows project conventions (gofmt, wrapped errors, small funcs)
|
||||
- [ ] No unintended side effects (no change to Create/List/Delete)
|
||||
- [ ] Build passes
|
||||
- [ ] Tests pass (new + existing)
|
||||
|
||||
## Handoff to Next Phase
|
||||
|
||||
Implemented files: `extract.go` (`safeExtractIndex`, `stripIndexPrefix`, `leadingIndex`,
|
||||
`withinDir`), `restore.go` (`parseManifest`, `preflightResolve`, `archiveUncompressedSize`,
|
||||
`swap`/`swapVolumeDir`/`rollbackSwaps`/`stagingDirs`, consts `maxRestoreUncompressedBytes`
|
||||
= 50 GiB, `diskFreeHeadroomBytes` = 256 MiB), `disk_unix.go`/`disk_windows.go`
|
||||
(`freeDiskBytes`). Tests in `extract_test.go` + `restore_test.go`. `go.mod`: `x/sys` →
|
||||
direct.
|
||||
|
||||
**API contract for Phase 2 (Engine.Restore):**
|
||||
- `safeExtractIndex(archivePath, index, dest, bombCap)` — extracts ONE volume's subtree into
|
||||
a FRESH `dest` (uses `O_EXCL`); returns bytes written. Call once per resolved volume into
|
||||
its `tmp` staging dir.
|
||||
- `preflightResolve(w, settings, manifest)` → `[]resolvedVol{Index,Target,Scope,LivePath}`,
|
||||
ALL-OR-NOTHING; already rejects unsupported scopes AND negative indices. Run BEFORE
|
||||
Lock/StopContainers.
|
||||
- `stagingDirs(live, token, index)` → `(tmp, old)` siblings of `filepath.Dir(live)` (same-fs
|
||||
⇒ atomic rename). Use a per-restore `token`.
|
||||
- `swapVolumeDir(live, tmp, old)` → `(hadOld, err)`; self-reverts the first rename on failure
|
||||
(live never left missing). Collect each completed swap into `[]swap{live,old,tmp,hadOld}`
|
||||
and call `rollbackSwaps(done)` on any later failure.
|
||||
- `archiveUncompressedSize(archivePath, bombCap)` → `(perIndex map[int]int64, total, err)`
|
||||
for the C5 per-filesystem free-disk check. NOTE: it's a LOWER-BOUND (ignores dir/inode
|
||||
overhead) — treat as advisory; the staged-extract+swap is the real net.
|
||||
- `freeDiskBytes(path)` — pass the live dir's PARENT (where tmp/old land).
|
||||
|
||||
**Phase 2 must:** extract ALL tmp dirs first, THEN swap all (shrinks the destructive
|
||||
window); validate each manifest index maps to an existing archive subtree (W2 — only the
|
||||
negative check is done so far); the disk pre-check should sum per-target-filesystem.
|
||||
|
||||
**Review (go-reviewer, APPROVE WITH NOTES):** no blockers. Addressed in-phase: W2 (negative
|
||||
index reject), W3 (explicit second-rename self-revert test), W4 (stagingDirs test), N1/N2/N4
|
||||
(comments + sparse-type rejection test). W1 (disk estimate is lower-bound) folded into Phase
|
||||
2 guidance above.
|
||||
@@ -1,106 +0,0 @@
|
||||
# Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback
|
||||
|
||||
**Status:** ✅ Complete
|
||||
**Parent plan:** [PLAN.md](./PLAN.md)
|
||||
**Domain:** backend
|
||||
|
||||
## Objective
|
||||
|
||||
Wire the Phase 1 primitives into the full **stop → swap → redeploy** sequence under a
|
||||
per-workload lock, with crash-safe rollback (journal + recovery sweep) and a durable
|
||||
pre-restore auto-capture. Define the `Lifecycle` seam; modify the Deployer for per-workload
|
||||
locking + an unlocked redeploy.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] **`internal/keyedmutex/keyedmutex.go`** — extract the `gitops.go` pattern into a shared
|
||||
package: `type Mutex` with `Lock(key string) func()` and `TryLock(key string) (func(), bool)`
|
||||
(the Try variant serves the Phase 3 API single-flight → 409). Unit test both.
|
||||
- [ ] **Deployer locking (C1)** in `internal/deployer/`:
|
||||
- add `workloadLocks keyedmutex.Mutex` field.
|
||||
- refactor `DispatchPlugin` → `unlock := d.workloadLocks.Lock(w.ID); defer unlock(); return d.dispatchLocked(ctx, w, intent)`; move the current body into unexported `dispatchLocked`.
|
||||
- wrap `DispatchTeardown` in the same per-workload lock.
|
||||
- do NOT lock `DispatchReconcile` (periodic; image Reconcile is a no-op; reconciler `markMissingRows` only flips labels = benign; locking it would stall the reconcile loop behind long deploys).
|
||||
- expose `func (d *Deployer) LockWorkload(id string) func()` and `func (d *Deployer) RedeployLocked(ctx, w, intent) error` (= `dispatchLocked`, doc: "caller already holds the workload lock; calling DispatchPlugin would deadlock").
|
||||
- [ ] **`volsnap.Lifecycle` interface** (in volsnap):
|
||||
- `Lock(workloadID string) func()`
|
||||
- `StopContainers(ctx, workloadID string) (runningTag string, err error)` — stop every running container for the workload; return the **newest-running** container's `ImageTag` (so redeploy pins the same version; empty ⇒ source default). Mark stopped rows `State="stopped"`.
|
||||
- `Redeploy(ctx, w store.Workload, reference string) error` — unlocked re-dispatch, Reason `"restore"`, Reference=tag.
|
||||
- [ ] **`Engine.Restore(ctx, snapshotID, workloadID string) error`** in `internal/volsnap/restore.go`
|
||||
(engine owns it). Sequence — **does NOT hold `e.mu`** (R1):
|
||||
1. load snap; verify `snap.WorkloadID == workloadID`; load workload + settings; require `source_kind=="image"`.
|
||||
2. `parseManifest`; `preflightResolve` (C3 — abort if any fails); `archiveUncompressedSize` + per-filesystem `freeDiskBytes` pre-check (C5/R4 — abort).
|
||||
3. `unlock := lc.Lock(workloadID); defer unlock()` (C1).
|
||||
4. **re-validate** the workload still exists (R4 — teardown may have won the lock); abort if gone.
|
||||
5. `tag, _ := lc.StopContainers(ctx, workloadID)` (C4 stop).
|
||||
6. **durably** capture pre-restore snapshot: `e.Create(w, settings, "pre-restore")` (folded; AFTER stop = quiesced; BEFORE any rename = R3). `Create` takes its own `e.mu` — Restore must hold none.
|
||||
7. write **restore journal** `<snapDir>/restore-<workloadID>.json` (snapshotID, per-volume {live, old, tmp, swapped:false}).
|
||||
8. **extract ALL** volumes to their `tmp` staging dirs (`safeExtractIndex`) — R3 (shrinks the destructive window to pure renames).
|
||||
9. **swap** each volume (`swapVolumeDir`), updating the journal `swapped=true` per volume.
|
||||
10. on ANY error in 8–9 → `rollbackSwaps` + `lc.Redeploy(ctx, w, tag)` + delete journal + return wrapped error.
|
||||
11. success → `lc.Redeploy(ctx, w, tag)` (C4 redeploy); remove `.old` staging dirs (reclaim disk); delete journal; best-effort audit event (`store.InsertEvent` source `"volsnap"`).
|
||||
- `Engine.SetLifecycle(lc Lifecycle)` setter; `Restore` errors clearly if lifecycle is nil.
|
||||
- [ ] **`Engine.RecoverInterruptedRestores() (int, error)`** (R3) — startup sweep, mirrors
|
||||
`CleanOrphans`: for each `restore-*.json` journal, per volume: if `swapped` → remove `old`+`tmp`;
|
||||
else if live missing && old exists → rename old→live (revert mid-rename crash), remove tmp;
|
||||
else (live present, not swapped) → remove tmp. Delete journal. Log loudly. (Wiring at startup
|
||||
happens in Phase 3's main.go change, beside `CleanOrphans`.)
|
||||
|
||||
## Files to Modify/Create
|
||||
|
||||
- `internal/keyedmutex/keyedmutex.go` (+ `_test.go`) — shared lock (new)
|
||||
- `internal/deployer/deployer.go`, `internal/deployer/dispatch.go` — workloadLocks, dispatchLocked, LockWorkload, RedeployLocked, locked Teardown
|
||||
- `internal/volsnap/restore.go` — Lifecycle interface, Engine.Restore, RecoverInterruptedRestores, SetLifecycle, journal type
|
||||
- `internal/volsnap/restore_test.go` — fake-Lifecycle orchestration tests (extends Phase 1 file)
|
||||
- `internal/api/gitops.go` — (optional, low-risk) migrate `keyedMutex`→`keyedmutex.Mutex` for DRY
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Lock re-entrancy: `Engine.Restore` → `RedeployLocked` does NOT re-acquire the workload lock (no deadlock). All existing deployer tests still pass (lock is externally transparent).
|
||||
- **Happy-path orchestration test uses the REAL `Engine.Create` (real store + `t.TempDir()`)** for the pre-restore capture so the `e.mu` deadlock (R1) would fail `go test`, not prod. Asserts call order: preflight → lock → stop → create → extract-all → swap-all → redeploy → cleanup.
|
||||
- Rollback test: a swap fails midway → originals restored, redeploy called, journal deleted, error returned.
|
||||
- Preflight-fail test: lock/stop NEVER called (abort before lock).
|
||||
- Disk-pre-check-fail test: abort before lock.
|
||||
- `RecoverInterruptedRestores` test: simulate journals in each crash state → correct revert/keep/cleanup.
|
||||
- `go build ./...`, `go vet ./internal/...`, `go test ./internal/...` green.
|
||||
|
||||
## Notes
|
||||
|
||||
- ⚠️ The Deployer lock change touches the hot deploy path — verify no existing path re-enters `DispatchPlugin` under a held lock (webhook preview = sequential teardown-then-deploy on the child, not nested — confirmed safe).
|
||||
- The API single-flight (Phase 3) is a fast 409 reject; the deployer lock is the real mutex — they compose (document).
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- [ ] All tasks completed
|
||||
- [ ] Code follows project conventions
|
||||
- [ ] No unintended side effects (existing deploy/teardown behavior unchanged externally)
|
||||
- [ ] Build passes
|
||||
- [ ] Tests pass (new + existing)
|
||||
|
||||
## Handoff to Next Phase
|
||||
|
||||
Implemented: `internal/keyedmutex` (Lock+TryLock, tested); deployer `workloadLocks` +
|
||||
`dispatchLocked` + `LockWorkload` + `RedeployLocked`, `DispatchPlugin`/`DispatchTeardown`
|
||||
now per-workload-locked (reconciler intentionally NOT). `volsnap.Lifecycle` interface,
|
||||
`Engine.Restore`, `restoreJournal` (atomic write — W1), `RecoverInterruptedRestores`,
|
||||
`recoverVolume`, `checkDiskSpace`, `SetLifecycle`. Tests: `restore_engine_test.go`
|
||||
(happy/real-Create, redeploy-fail, preflight-abort, extract-fail-after-lock, nil-lifecycle,
|
||||
wrong-workload, recovery×3 states), `keyedmutex_test.go`. Full `go test ./internal/...` green.
|
||||
|
||||
**Review (go-reviewer, APPROVE WITH NOTES):** no functional blockers in this diff. Verified:
|
||||
no lock re-entrancy/`e.mu` self-deadlock, no prune-race (extract-all precedes `e.Create`),
|
||||
recovery state machine doesn't revert good data. Addressed in-phase: W1 (atomic journal),
|
||||
W3 (extract-failure orchestration test). Residual W3 (mid-swap fault injection) accepted.
|
||||
|
||||
**🔴 HARD PREREQUISITES for Phase 3 (B1 + N1 from review):**
|
||||
1. Wire `snapshotEngine.RecoverInterruptedRestores()` at startup in `cmd/server/main.go`,
|
||||
BEFORE the API server serves — beside the existing `CleanOrphans()` call (~main.go:333).
|
||||
Without it the journal/WAL protects nothing — a crash mid-restore is unrecovered.
|
||||
2. Wire `snapshotEngine.SetLifecycle(adapter)` strictly BEFORE serving (same place as
|
||||
`SetSnapshotEngine`) so the `e.lifecycle` field is safely published (no race).
|
||||
3. The restore endpoint MUST NOT be reachable until both are wired.
|
||||
|
||||
**Lifecycle adapter (Phase 3, main.go) maps:** `Lock`→`deployer.LockWorkload`;
|
||||
`StopContainers`→`store.ListContainersByWorkload` + `docker.StopContainer` each running +
|
||||
`UpdateContainerState(...,"stopped")` + return newest-running `ImageTag`;
|
||||
`Redeploy`→`deployer.RedeployLocked` with a `restore`-reason intent (Reference=tag).
|
||||
@@ -1,76 +0,0 @@
|
||||
# Phase 3: API endpoint + CSRF header + single-flight + wiring + tests
|
||||
|
||||
**Status:** ✅ Complete
|
||||
**Parent plan:** [PLAN.md](./PLAN.md)
|
||||
**Domain:** backend
|
||||
|
||||
## Objective
|
||||
|
||||
Expose restore over HTTP behind the destructive-action guards, and wire the real
|
||||
`Lifecycle` adapter + startup recovery sweep at the composition root.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] **`restoreWorkloadSnapshot` handler** in `internal/api/volume_snapshots.go`:
|
||||
`POST /api/workloads/{id}/snapshots/{sid}/restore`.
|
||||
- require `X-Confirm-Restore: <sid>` header == path `sid` (C6, mirror `backups.go` `restoreBackup`); mismatch → 400.
|
||||
- per-workload single-flight via `keyedmutex.TryLock(id)` (or a `sync.Map` LoadOrStore) → 409 if a restore for this workload is already running; release on completion. (Different apps restore concurrently.)
|
||||
- load snapshot via `snapshotEngine.Get(sid)`; verify `snap.WorkloadID == id`; load workload; require `source_kind=="image"` (else 400).
|
||||
- call `s.snapshotEngine.Restore(r.Context(), id, sid)`. Synchronous (mirrors `deployPluginWorkload` which blocks on dispatch). `ErrNoSnapshotData`-class / client-actionable → 400; success → 200 `{"status":"restored",...}`; other → 500 with a generic message (no internal detail leak — mirror existing handlers; raw error to `slog`).
|
||||
- [ ] **Route registration** in `internal/api/router.go` under the admin group on `/workloads/{id}`, beside the existing `POST /snapshots`. Admin-gated.
|
||||
- [ ] **Lifecycle adapter + recovery wiring** in `cmd/server/main.go` (composition root — already
|
||||
imports deployer + docker + store + volsnap):
|
||||
- a small struct implementing `volsnap.Lifecycle`: `Lock`→`deployer.LockWorkload`; `StopContainers`→`store.ListContainersByWorkload` + `docker.StopContainer` each running + `store.UpdateContainerState(...,"stopped")` + return newest-running `ImageTag`; `Redeploy`→`deployer.RedeployLocked` with `toPluginWorkload`/`WorkloadFromStore` + a `restore`-reason intent.
|
||||
- `snapshotEngine.SetLifecycle(adapter)`.
|
||||
- call `snapshotEngine.RecoverInterruptedRestores()` at startup beside the existing `CleanOrphans()` call; log result.
|
||||
- keeps the `api` package decoupled from `deployer` (router.go deliberately avoids importing deployer).
|
||||
|
||||
## Files to Modify/Create
|
||||
|
||||
- `internal/api/volume_snapshots.go` — `restoreWorkloadSnapshot` handler + single-flight field on `Server`
|
||||
- `internal/api/router.go` — route registration
|
||||
- `cmd/server/main.go` — Lifecycle adapter, `SetLifecycle`, `RecoverInterruptedRestores` startup wiring
|
||||
- `internal/api/volume_snapshots_test.go` — handler tests (extends existing file)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Missing/mismatched `X-Confirm-Restore` → 400.
|
||||
- Concurrent second restore for the same workload → 409.
|
||||
- Non-image workload → 400; snapshot belonging to another workload → 400/404.
|
||||
- Happy path → 200 (with a fake engine/lifecycle or a seeded real engine).
|
||||
- `go build ./...`, `go vet ./internal/...`, `go test ./internal/...` green; existing snapshot tests pass.
|
||||
|
||||
## Notes
|
||||
|
||||
- The handler is thin: validation + single-flight + delegate to `Engine.Restore`. All the dangerous logic stays in the engine (folded requirement).
|
||||
- Confirm the admin group + `X-Confirm-Restore` together match the DB-restore threat model (custom header defeats CSRF form/img posts; admin JWT + AdminOnly gates authz).
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- [ ] All tasks completed
|
||||
- [ ] Code follows project conventions
|
||||
- [ ] No unintended side effects
|
||||
- [ ] Build passes
|
||||
- [ ] Tests pass (new + existing)
|
||||
|
||||
## Handoff to Next Phase
|
||||
|
||||
Implemented: `restoreWorkloadSnapshot` handler (`internal/api/volume_snapshots.go`) — admin,
|
||||
`X-Confirm-Restore: <sid>` header, per-workload single-flight (`volRestoreInFlight.TryLock`
|
||||
→ 409); route `POST /api/workloads/{id}/snapshots/{sid}/restore`; `restoreLifecycle` adapter
|
||||
(`cmd/server/restore_lifecycle.go`); main.go wires `SetLifecycle` + `RecoverInterruptedRestores`
|
||||
BEFORE serving (B1/N1 resolved). Tests: header miss/mismatch→400, wrong-workload→400,
|
||||
non-image→400, not-found→404, happy-path→200, single-flight→409.
|
||||
|
||||
**SECURITY FIX (review BLOCK → resolved):** the security review caught a CRITICAL — the
|
||||
manifest's persisted `Source`/`Scope` (attacker-influenceable) was being trusted to compute
|
||||
the destructive swap target, so `Source:"../../etc"` could clobber `/etc`. Fixed by
|
||||
re-deriving the swap target from the workload's CURRENT config keyed by container Target path
|
||||
(`volumesByTarget` shared helper) + a `pathWithinBase` containment assertion for base-relative
|
||||
scopes. Regression guards: `TestPreflightResolve_IgnoresManifestSource`,
|
||||
`TestPreflightResolve_AllOrNothing`. Both reviews now clear.
|
||||
|
||||
**API for Phase 4 (frontend):** `POST /api/workloads/{id}/snapshots/{sid}/restore` with header
|
||||
`X-Confirm-Restore: <sid>` (REQUIRED — mirror how the DB restore sends it). 200 `{status:"restored"}`,
|
||||
400 (header/ownership/non-image), 404 (not found), 409 (already in progress), 500 (engine).
|
||||
Restore is synchronous (blocks until done, like deploy).
|
||||
@@ -1,70 +0,0 @@
|
||||
# Phase 4: UI Restore button + ConfirmDialog + i18n en+ru
|
||||
|
||||
**Status:** ⬜ Not Started
|
||||
**Parent plan:** [PLAN.md](./PLAN.md)
|
||||
**Domain:** frontend
|
||||
|
||||
Built by the **frontend implementer agent**. Must follow project conventions: Svelte 5
|
||||
runes, `ConfirmDialog` for the destructive action (NEVER `window.confirm`), `$t` with
|
||||
**en+ru parity**, the existing `.panel`/`.forge-btn-ghost` vocabulary in
|
||||
`WorkloadSnapshotsPanel.svelte`.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] **`web/src/lib/api.ts`**: `restoreSnapshot(workloadId: string, sid: string): Promise<void>`
|
||||
— POST `/api/workloads/${workloadId}/snapshots/${sid}/restore` with header
|
||||
`X-Confirm-Restore: ${sid}`. Check how the DB restore sends its `X-Confirm-Restore` header
|
||||
and reuse that fetch mechanism (the typed `post<T>` may need a header-capable variant or a
|
||||
raw `fetch` like `download` already uses).
|
||||
- [ ] **`WorkloadSnapshotsPanel.svelte`**:
|
||||
- add a **Restore** action per snapshot row (beside Download/Delete) → opens `ConfirmDialog`.
|
||||
- ConfirmDialog: strong destructive copy — title + message making clear it **overwrites
|
||||
live data and restarts the app**, and that a **pre-restore snapshot is auto-captured**;
|
||||
`confirmVariant="danger"`.
|
||||
- on confirm: call `restoreSnapshot`, show a "restoring…" busy state (disable row actions),
|
||||
toast success/failure, then `load()` to refresh.
|
||||
- update the file's top comment (currently "Restore is intentionally NOT here yet") to
|
||||
reflect that restore now ships.
|
||||
- [ ] **i18n**: add `apps.detail.snapshots.restore`, `.restoring`, `.restored`,
|
||||
`.restoreFailed`, `.confirmRestoreTitle`, `.confirmRestoreMessage` to BOTH
|
||||
`web/src/lib/i18n/en.json` and `web/src/lib/i18n/ru.json`. Verify parity manually (a missing
|
||||
key is NOT a build error — `$t` returns the key string).
|
||||
|
||||
## Files to Modify/Create
|
||||
|
||||
- `web/src/lib/api.ts` — `restoreSnapshot`
|
||||
- `web/src/lib/components/WorkloadSnapshotsPanel.svelte` — Restore button + ConfirmDialog + busy state
|
||||
- `web/src/lib/i18n/en.json`, `web/src/lib/i18n/ru.json` — restore keys (parity)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- `npm run check` 0 errors; `npm run build` succeeds; `npm run test` green.
|
||||
- en/ru key parity equal (every new key in both files).
|
||||
- ConfirmDialog used (no native confirm/alert); danger variant; copy warns about data overwrite + app restart.
|
||||
- Restore button disabled while a restore is in flight.
|
||||
- Restart dev server (`./scripts/dev-server.sh`).
|
||||
|
||||
## Notes
|
||||
|
||||
- Only image-source workloads expose snapshots, so no source-kind gating is needed in the panel beyond what already exists.
|
||||
- Keep the Restore button visually subordinate to Download but clearly destructive (danger styling) — it's the most dangerous action in the panel.
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- [ ] All tasks completed
|
||||
- [ ] Code follows project conventions (ToggleSwitch/ConfirmDialog rules, runes)
|
||||
- [ ] No unintended side effects
|
||||
- [ ] Build passes (`npm run check` + `npm run build`)
|
||||
- [ ] Tests pass; i18n parity verified
|
||||
|
||||
## Handoff to Next Phase
|
||||
|
||||
Implemented: `api.restoreSnapshot(workloadId, sid)` (POST + `X-Confirm-Restore` header,
|
||||
mirrors `restoreBackup`); `WorkloadSnapshotsPanel.svelte` Restore button per row →
|
||||
`ConfirmDialog` (danger, warns: overwrites live data + restarts app + auto pre-restore
|
||||
snapshot) → `doRestore` with busy state (`restoringId` disables all row actions, active row
|
||||
shows "Restoring…"); i18n `apps.detail.snapshots.restore*` in en+ru (parity verified).
|
||||
|
||||
Verify: `npm run check` 0 errors, `npm run build` OK, `npm run test` 26 pass; i18n parity
|
||||
equal; dev server restarted on :9000. typescript-reviewer: APPROVE (no blockers; one cosmetic
|
||||
wording note on the success toast addressed). Final phase — done.
|
||||
Reference in New Issue
Block a user