feat(volsnap): volume snapshot restore (backlog #6)

Restore a captured volume snapshot onto an image workload's live host-bind
data volumes, then redeploy — the most destructive workload action, built to
the adversarially-reviewed design (C1–C6) with all data-loss guards.

- Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from
  the workload's CURRENT config (never the tamperable manifest), per-filesystem
  disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable
  pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and
  crash-recovery sweep (RecoverInterruptedRestores) wired before serving.
- internal/keyedmutex: shared per-key lock; deployer now serializes every
  deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked
  for the restore re-dispatch, no deadlock).
- Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir
  only), decompression-bomb cap, manifest-index bounds.
- POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore
  header (CSRF), per-workload single-flight (409).
- WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru).

Scope: image-source only; scopes absolute/stage/project (driven off the same
supportedScopes constant capture uses).

Plan-reviewed before coding; per-phase go/security/ts reviews; final review
READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path
traversal (re-derive target from current config + base containment).

Plan: plans/volume-snapshot-restore/
This commit is contained in:
2026-06-22 17:23:52 +03:00
parent 8a5f69af87
commit 1c47030854
33 changed files with 2825 additions and 34 deletions
+113
View File
@@ -0,0 +1,113 @@
# Feature: Volume Snapshot Restore (backlog #6)
**Branch:** `feature/volume-snapshot-restore`
**Base branch:** `main`
**Created:** 2026-06-22
**Status:** 🟡 In Progress
**Strategy:** Incremental
**Mode:** Automated
**Execution:** Hybrid — backend (Phases 13) Direct by the orchestrator; Phase 4 via the frontend implementer
**Remote:** origin (https://git.dolgolyov-family.by/alexei.dolgolyov/tiny-forge.git)
## Summary
Restore a previously-captured volume snapshot (gzip tar of an image workload's host-bind
data volumes) back onto the live volume directories, then bring the app back up. Capture
already ships (`internal/volsnap`); restore is greenfield and **data-loss-sensitive** — a
wrong design is permanent data loss, so the design was adversarially plan-reviewed twice
(prior session + this phase breakdown).
**Scope (deliberate):** image-source workloads only; volume scopes `absolute` / `stage` /
`project` only — driven off the SAME `volsnap.supportedScopes` constant capture uses. Named
/ project_named (Docker named volumes), instance, and ephemeral scopes are out (consistent
with capture).
## Mandatory design fixes (non-negotiable — a wrong design = permanent data loss)
- **C1** Serialize via a per-workload `keyedMutex` (the `internal/api/gitops.go` pattern,
extracted to `internal/keyedmutex`) keyed by workload id, gating EVERY deploy entrypoint.
All entrypoints funnel through `deployer.DispatchPlugin` (verified: deploy, rollback,
promote, generic-hooks, webhook fireBinding/handlePreviewIntent), so the lock lives there.
NOT `activeWg` (a global drain barrier, not a per-workload lock).
- **C2** Extract-to-temp + atomic rename-swap (extract→`.tmp`, rename live→`.old`, rename
`.tmp`→live), NEVER in-place. Mirrors `internal/api/backups.go` restore precedent.
- **C3** All-or-nothing pre-flight re-resolution via `volume.ResolveWorkloadPath` — abort
BEFORE stopping containers if ANY manifest volume doesn't resolve (config drift =
corruption). Runs before `Lock`/`StopContainers`.
- **C4** Image containers are recreated, not reused → **stop → swap → redeploy** (re-dispatch
via `DispatchPlugin`/`RedeployLocked`), NOT `StartContainer(oldID)`. Verified: image
source's idempotency short-circuit only fires for a *verified-running* container, so a
redeploy after stop creates a fresh container on restored data; `enforceMaxInstances`
reaps the old stopped one.
- **C5** Disk-space pre-check, **per target filesystem** (peak = live + extracted coexist).
- **C6** Treat the archive as UNTRUSTED on extract: zip-slip `HasPrefix` containment,
reject symlink/hardlink/device/fifo/socket entries, manifest-index bounds, decompression-
bomb cap. Require an `X-Confirm-Restore: <sid>` header like the DB restore (CSRF guard).
### Folded-in (also mandatory)
- Single-flight per-workload CAS → 409 (different apps may restore concurrently).
- Auto-capture a pre-restore snapshot, **durably committed before the first destructive
rename** (the operator's clean escape hatch).
- Logic lives in `Engine.Restore` (engine), not the API handler.
### Resolutions from the phase-breakdown plan review (2026-06-22)
- **R1 (e.mu deadlock):** `Engine.Restore` does NOT hold `e.mu`; per-workload `Lifecycle.Lock`
is the serialization. `Create`'s own `e.mu` guards only the pre-restore archive write.
- **R2 (cross-device / containment):** stage `tmp`+`old` as siblings under the **live dir's
own parent** (same filesystem ⇒ atomic rename). Detect `EXDEV` → abort/rollback loudly.
- **R3 (crash window):** durable pre-restore snapshot before any rename; **extract all tmp
dirs first, then pure renames**; restore-journal + startup `RecoverInterruptedRestores()`
sweep (revert `live-missing→.old`, clean orphan tmp).
- **R4:** C5 checks per-target-filesystem; `StopContainers` returns newest-running tag so
redeploy pins the same version, and marks rows stopped; `Engine.Restore` re-validates the
workload AFTER acquiring the lock; best-effort audit event emitted.
## Build & Test Commands
- **Build:** `go build ./...`
- **Test:** `go test ./internal/...` (backend); from `web/`: `npm run test`
- **Lint:** `go vet ./internal/...`; from `web/`: `npm run check`
- **Frontend build:** from `web/`: `npm run build`
- **Dev:** `./scripts/dev-server.sh` (port 8090; restart after every build)
## Phases
- [x] Phase 1: Restore engine primitives + path-safe extractor + unit tests [domain: backend] → [subplan](./phase-1-engine-primitives.md)
- [x] Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback [domain: backend] → [subplan](./phase-2-lifecycle-locking.md)
- [x] Phase 3: API endpoint + CSRF header + single-flight + wiring + tests [domain: backend] → [subplan](./phase-3-api.md)
- [x] Phase 4: UI Restore button + ConfirmDialog + i18n en+ru [domain: frontend] → [subplan](./phase-4-frontend.md)
## Parallelizable Phase Groups (Orchestrator mode only)
None — strictly sequential. Each phase depends on the prior (P2 needs P1 primitives + the
Lifecycle seam; P3 wires the adapter + needs `Engine.SetLifecycle`; P4 needs the endpoint).
## Phase Progress Log
| Phase | Domain | Status | Review | Build | Committed |
|-------|--------|--------|--------|-------|-----------|
| Phase 1: engine primitives | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
| Phase 2: lifecycle/locking | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
| Phase 3: API endpoint | backend | ✅ Done | ✅ Passed (go: APPROVE w/ notes; security: fixed CRITICAL) | ✅ Passed | ⬜ |
| Phase 4: frontend | frontend | ✅ Done | ✅ Passed (ts: APPROVE) | ✅ Passed (check 0 err, build, 26 tests) | ⬜ |
## Outstanding Warnings
| Phase | Warning | Severity | Status (open / resolved / accepted) |
|-------|---------|----------|-------------------------------------|
| (design) | Mid-restore crash can leave a per-volume MIXED state (some restored, some original); each volume is individually intact and the pre-restore snapshot is the full escape hatch. | 🟡 | accepted (documented v1 limit) |
| 2→3 | **B1 (was Blocker):** `RecoverInterruptedRestores()` + `SetLifecycle()` MUST be wired at startup BEFORE the API server serves — restore endpoint must not be reachable without them. | 🔴→tracked | open — HARD Phase 3 prerequisite |
| 2 | W3 residual: the swap-failure-after-partial-swap ORCHESTRATION branch (rollbackSwaps glue) is covered by primitive unit tests + recovery test + extract-failure orchestration test, but not a full mid-swap fault-injection (needs an fs-fault seam not worth the production complexity). | 🟡 | accepted |
## Final Review
- [x] Comprehensive code review — ✅ READY TO MERGE (no blockers/warnings; 3 non-blocking notes)
- [x] Security review (untrusted-archive extraction + CSRF + admin gating) — CRITICAL found & fixed (manifest-Source path traversal); re-derive from current config + containment
- [x] All Outstanding Warnings resolved or consciously accepted
- [x] Full build passes (`go build ./...`, `npm run build`)
- [x] Full test suite passes (`go test ./internal/...`, `npm run test` 26, `npm run check` 0 err)
- [ ] Merged to `main` (squash)
## Amendment Log
_(none yet)_