Files
tiny-forge/plans/volume-snapshot-restore/PLAN.md
T
alexei.dolgolyov 1c47030854 feat(volsnap): volume snapshot restore (backlog #6)
Restore a captured volume snapshot onto an image workload's live host-bind
data volumes, then redeploy — the most destructive workload action, built to
the adversarially-reviewed design (C1–C6) with all data-loss guards.

- Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from
  the workload's CURRENT config (never the tamperable manifest), per-filesystem
  disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable
  pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and
  crash-recovery sweep (RecoverInterruptedRestores) wired before serving.
- internal/keyedmutex: shared per-key lock; deployer now serializes every
  deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked
  for the restore re-dispatch, no deadlock).
- Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir
  only), decompression-bomb cap, manifest-index bounds.
- POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore
  header (CSRF), per-workload single-flight (409).
- WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru).

Scope: image-source only; scopes absolute/stage/project (driven off the same
supportedScopes constant capture uses).

Plan-reviewed before coding; per-phase go/security/ts reviews; final review
READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path
traversal (re-derive target from current config + base containment).

Plan: plans/volume-snapshot-restore/
2026-06-22 17:23:52 +03:00

7.1 KiB
Raw Blame History

Feature: Volume Snapshot Restore (backlog #6)

Branch: feature/volume-snapshot-restore Base branch: main Created: 2026-06-22 Status: 🟡 In Progress Strategy: Incremental Mode: Automated Execution: Hybrid — backend (Phases 13) Direct by the orchestrator; Phase 4 via the frontend implementer Remote: origin (https://git.dolgolyov-family.by/alexei.dolgolyov/tiny-forge.git)

Summary

Restore a previously-captured volume snapshot (gzip tar of an image workload's host-bind data volumes) back onto the live volume directories, then bring the app back up. Capture already ships (internal/volsnap); restore is greenfield and data-loss-sensitive — a wrong design is permanent data loss, so the design was adversarially plan-reviewed twice (prior session + this phase breakdown).

Scope (deliberate): image-source workloads only; volume scopes absolute / stage / project only — driven off the SAME volsnap.supportedScopes constant capture uses. Named / project_named (Docker named volumes), instance, and ephemeral scopes are out (consistent with capture).

Mandatory design fixes (non-negotiable — a wrong design = permanent data loss)

  • C1 Serialize via a per-workload keyedMutex (the internal/api/gitops.go pattern, extracted to internal/keyedmutex) keyed by workload id, gating EVERY deploy entrypoint. All entrypoints funnel through deployer.DispatchPlugin (verified: deploy, rollback, promote, generic-hooks, webhook fireBinding/handlePreviewIntent), so the lock lives there. NOT activeWg (a global drain barrier, not a per-workload lock).
  • C2 Extract-to-temp + atomic rename-swap (extract→.tmp, rename live→.old, rename .tmp→live), NEVER in-place. Mirrors internal/api/backups.go restore precedent.
  • C3 All-or-nothing pre-flight re-resolution via volume.ResolveWorkloadPath — abort BEFORE stopping containers if ANY manifest volume doesn't resolve (config drift = corruption). Runs before Lock/StopContainers.
  • C4 Image containers are recreated, not reused → stop → swap → redeploy (re-dispatch via DispatchPlugin/RedeployLocked), NOT StartContainer(oldID). Verified: image source's idempotency short-circuit only fires for a verified-running container, so a redeploy after stop creates a fresh container on restored data; enforceMaxInstances reaps the old stopped one.
  • C5 Disk-space pre-check, per target filesystem (peak = live + extracted coexist).
  • C6 Treat the archive as UNTRUSTED on extract: zip-slip HasPrefix containment, reject symlink/hardlink/device/fifo/socket entries, manifest-index bounds, decompression- bomb cap. Require an X-Confirm-Restore: <sid> header like the DB restore (CSRF guard).

Folded-in (also mandatory)

  • Single-flight per-workload CAS → 409 (different apps may restore concurrently).
  • Auto-capture a pre-restore snapshot, durably committed before the first destructive rename (the operator's clean escape hatch).
  • Logic lives in Engine.Restore (engine), not the API handler.

Resolutions from the phase-breakdown plan review (2026-06-22)

  • R1 (e.mu deadlock): Engine.Restore does NOT hold e.mu; per-workload Lifecycle.Lock is the serialization. Create's own e.mu guards only the pre-restore archive write.
  • R2 (cross-device / containment): stage tmp+old as siblings under the live dir's own parent (same filesystem ⇒ atomic rename). Detect EXDEV → abort/rollback loudly.
  • R3 (crash window): durable pre-restore snapshot before any rename; extract all tmp dirs first, then pure renames; restore-journal + startup RecoverInterruptedRestores() sweep (revert live-missing→.old, clean orphan tmp).
  • R4: C5 checks per-target-filesystem; StopContainers returns newest-running tag so redeploy pins the same version, and marks rows stopped; Engine.Restore re-validates the workload AFTER acquiring the lock; best-effort audit event emitted.

Build & Test Commands

  • Build: go build ./...
  • Test: go test ./internal/... (backend); from web/: npm run test
  • Lint: go vet ./internal/...; from web/: npm run check
  • Frontend build: from web/: npm run build
  • Dev: ./scripts/dev-server.sh (port 8090; restart after every build)

Phases

  • Phase 1: Restore engine primitives + path-safe extractor + unit tests [domain: backend] → subplan
  • Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback [domain: backend] → subplan
  • Phase 3: API endpoint + CSRF header + single-flight + wiring + tests [domain: backend] → subplan
  • Phase 4: UI Restore button + ConfirmDialog + i18n en+ru [domain: frontend] → subplan

Parallelizable Phase Groups (Orchestrator mode only)

None — strictly sequential. Each phase depends on the prior (P2 needs P1 primitives + the Lifecycle seam; P3 wires the adapter + needs Engine.SetLifecycle; P4 needs the endpoint).

Phase Progress Log

Phase Domain Status Review Build Committed
Phase 1: engine primitives backend Done Passed (APPROVE w/ notes) Passed
Phase 2: lifecycle/locking backend Done Passed (APPROVE w/ notes) Passed
Phase 3: API endpoint backend Done Passed (go: APPROVE w/ notes; security: fixed CRITICAL) Passed
Phase 4: frontend frontend Done Passed (ts: APPROVE) Passed (check 0 err, build, 26 tests)

Outstanding Warnings

Phase Warning Severity Status (open / resolved / accepted)
(design) Mid-restore crash can leave a per-volume MIXED state (some restored, some original); each volume is individually intact and the pre-restore snapshot is the full escape hatch. 🟡 accepted (documented v1 limit)
2→3 B1 (was Blocker): RecoverInterruptedRestores() + SetLifecycle() MUST be wired at startup BEFORE the API server serves — restore endpoint must not be reachable without them. 🔴→tracked open — HARD Phase 3 prerequisite
2 W3 residual: the swap-failure-after-partial-swap ORCHESTRATION branch (rollbackSwaps glue) is covered by primitive unit tests + recovery test + extract-failure orchestration test, but not a full mid-swap fault-injection (needs an fs-fault seam not worth the production complexity). 🟡 accepted

Final Review

  • Comprehensive code review — READY TO MERGE (no blockers/warnings; 3 non-blocking notes)
  • Security review (untrusted-archive extraction + CSRF + admin gating) — CRITICAL found & fixed (manifest-Source path traversal); re-derive from current config + containment
  • All Outstanding Warnings resolved or consciously accepted
  • Full build passes (go build ./..., npm run build)
  • Full test suite passes (go test ./internal/..., npm run test 26, npm run check 0 err)
  • Merged to main (squash)

Amendment Log

(none yet)