feat(volsnap): volume snapshot restore (backlog #6)

Restore a captured volume snapshot onto an image workload's live host-bind data volumes, then redeploy — the most destructive workload action, built to the adversarially-reviewed design (C1–C6) with all data-loss guards. - Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from the workload's CURRENT config (never the tamperable manifest), per-filesystem disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and crash-recovery sweep (RecoverInterruptedRestores) wired before serving. - internal/keyedmutex: shared per-key lock; deployer now serializes every deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked for the restore re-dispatch, no deadlock). - Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir only), decompression-bomb cap, manifest-index bounds. - POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore header (CSRF), per-workload single-flight (409). - WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru). Scope: image-source only; scopes absolute/stage/project (driven off the same supportedScopes constant capture uses). Plan-reviewed before coding; per-phase go/security/ts reviews; final review READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path traversal (re-derive target from current config + base containment). Plan: plans/volume-snapshot-restore/
2026-06-22 17:23:52 +03:00
parent 8a5f69af87
commit 1c47030854
33 changed files with 2825 additions and 34 deletions
@@ -0,0 +1,113 @@
+# Feature: Volume Snapshot Restore (backlog #6)
+
+**Branch:** `feature/volume-snapshot-restore`
+**Base branch:** `main`
+**Created:** 2026-06-22
+**Status:** 🟡 In Progress
+**Strategy:** Incremental
+**Mode:** Automated
+**Execution:** Hybrid — backend (Phases 1–3) Direct by the orchestrator; Phase 4 via the frontend implementer
+**Remote:** origin (https://git.dolgolyov-family.by/alexei.dolgolyov/tiny-forge.git)
+
+## Summary
+
+Restore a previously-captured volume snapshot (gzip tar of an image workload's host-bind
+data volumes) back onto the live volume directories, then bring the app back up. Capture
+already ships (`internal/volsnap`); restore is greenfield and **data-loss-sensitive** — a
+wrong design is permanent data loss, so the design was adversarially plan-reviewed twice
+(prior session + this phase breakdown).
+
+**Scope (deliberate):** image-source workloads only; volume scopes `absolute` / `stage` /
+`project` only — driven off the SAME `volsnap.supportedScopes` constant capture uses. Named
+/ project_named (Docker named volumes), instance, and ephemeral scopes are out (consistent
+with capture).
+
+## Mandatory design fixes (non-negotiable — a wrong design = permanent data loss)
+
+- **C1** Serialize via a per-workload `keyedMutex` (the `internal/api/gitops.go` pattern,
+  extracted to `internal/keyedmutex`) keyed by workload id, gating EVERY deploy entrypoint.
+  All entrypoints funnel through `deployer.DispatchPlugin` (verified: deploy, rollback,
+  promote, generic-hooks, webhook fireBinding/handlePreviewIntent), so the lock lives there.
+  NOT `activeWg` (a global drain barrier, not a per-workload lock).
+- **C2** Extract-to-temp + atomic rename-swap (extract→`.tmp`, rename live→`.old`, rename
+  `.tmp`→live), NEVER in-place. Mirrors `internal/api/backups.go` restore precedent.
+- **C3** All-or-nothing pre-flight re-resolution via `volume.ResolveWorkloadPath` — abort
+  BEFORE stopping containers if ANY manifest volume doesn't resolve (config drift =
+  corruption). Runs before `Lock`/`StopContainers`.
+- **C4** Image containers are recreated, not reused → **stop → swap → redeploy** (re-dispatch
+  via `DispatchPlugin`/`RedeployLocked`), NOT `StartContainer(oldID)`. Verified: image
+  source's idempotency short-circuit only fires for a *verified-running* container, so a
+  redeploy after stop creates a fresh container on restored data; `enforceMaxInstances`
+  reaps the old stopped one.
+- **C5** Disk-space pre-check, **per target filesystem** (peak = live + extracted coexist).
+- **C6** Treat the archive as UNTRUSTED on extract: zip-slip `HasPrefix` containment,
+  reject symlink/hardlink/device/fifo/socket entries, manifest-index bounds, decompression-
+  bomb cap. Require an `X-Confirm-Restore: <sid>` header like the DB restore (CSRF guard).
+
+### Folded-in (also mandatory)
+- Single-flight per-workload CAS → 409 (different apps may restore concurrently).
+- Auto-capture a pre-restore snapshot, **durably committed before the first destructive
+  rename** (the operator's clean escape hatch).
+- Logic lives in `Engine.Restore` (engine), not the API handler.
+
+### Resolutions from the phase-breakdown plan review (2026-06-22)
+- **R1 (e.mu deadlock):** `Engine.Restore` does NOT hold `e.mu`; per-workload `Lifecycle.Lock`
+  is the serialization. `Create`'s own `e.mu` guards only the pre-restore archive write.
+- **R2 (cross-device / containment):** stage `tmp`+`old` as siblings under the **live dir's
+  own parent** (same filesystem ⇒ atomic rename). Detect `EXDEV` → abort/rollback loudly.
+- **R3 (crash window):** durable pre-restore snapshot before any rename; **extract all tmp
+  dirs first, then pure renames**; restore-journal + startup `RecoverInterruptedRestores()`
+  sweep (revert `live-missing→.old`, clean orphan tmp).
+- **R4:** C5 checks per-target-filesystem; `StopContainers` returns newest-running tag so
+  redeploy pins the same version, and marks rows stopped; `Engine.Restore` re-validates the
+  workload AFTER acquiring the lock; best-effort audit event emitted.
+
+## Build & Test Commands
+
+- **Build:** `go build ./...`
+- **Test:** `go test ./internal/...` (backend); from `web/`: `npm run test`
+- **Lint:** `go vet ./internal/...`; from `web/`: `npm run check`
+- **Frontend build:** from `web/`: `npm run build`
+- **Dev:** `./scripts/dev-server.sh` (port 8090; restart after every build)
+
+## Phases
+
+- [x] Phase 1: Restore engine primitives + path-safe extractor + unit tests [domain: backend] → [subplan](./phase-1-engine-primitives.md)
+- [x] Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback [domain: backend] → [subplan](./phase-2-lifecycle-locking.md)
+- [x] Phase 3: API endpoint + CSRF header + single-flight + wiring + tests [domain: backend] → [subplan](./phase-3-api.md)
+- [x] Phase 4: UI Restore button + ConfirmDialog + i18n en+ru [domain: frontend] → [subplan](./phase-4-frontend.md)
+
+## Parallelizable Phase Groups (Orchestrator mode only)
+
+None — strictly sequential. Each phase depends on the prior (P2 needs P1 primitives + the
+Lifecycle seam; P3 wires the adapter + needs `Engine.SetLifecycle`; P4 needs the endpoint).
+
+## Phase Progress Log
+
+| Phase | Domain | Status | Review | Build | Committed |
+|-------|--------|--------|--------|-------|-----------|
+| Phase 1: engine primitives | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
+| Phase 2: lifecycle/locking | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
+| Phase 3: API endpoint | backend | ✅ Done | ✅ Passed (go: APPROVE w/ notes; security: fixed CRITICAL) | ✅ Passed | ⬜ |
+| Phase 4: frontend | frontend | ✅ Done | ✅ Passed (ts: APPROVE) | ✅ Passed (check 0 err, build, 26 tests) | ⬜ |
+
+## Outstanding Warnings
+
+| Phase | Warning | Severity | Status (open / resolved / accepted) |
+|-------|---------|----------|-------------------------------------|
+| (design) | Mid-restore crash can leave a per-volume MIXED state (some restored, some original); each volume is individually intact and the pre-restore snapshot is the full escape hatch. | 🟡 | accepted (documented v1 limit) |
+| 2→3 | **B1 (was Blocker):** `RecoverInterruptedRestores()` + `SetLifecycle()` MUST be wired at startup BEFORE the API server serves — restore endpoint must not be reachable without them. | 🔴→tracked | open — HARD Phase 3 prerequisite |
+| 2 | W3 residual: the swap-failure-after-partial-swap ORCHESTRATION branch (rollbackSwaps glue) is covered by primitive unit tests + recovery test + extract-failure orchestration test, but not a full mid-swap fault-injection (needs an fs-fault seam not worth the production complexity). | 🟡 | accepted |
+
+## Final Review
+
+- [x] Comprehensive code review — ✅ READY TO MERGE (no blockers/warnings; 3 non-blocking notes)
+- [x] Security review (untrusted-archive extraction + CSRF + admin gating) — CRITICAL found & fixed (manifest-Source path traversal); re-derive from current config + containment
+- [x] All Outstanding Warnings resolved or consciously accepted
+- [x] Full build passes (`go build ./...`, `npm run build`)
+- [x] Full test suite passes (`go test ./internal/...`, `npm run test` 26, `npm run check` 0 err)
+- [ ] Merged to `main` (squash)
+
+## Amendment Log
+
+_(none yet)_