# Feature: Volume Snapshot Restore (backlog #6) **Branch:** `feature/volume-snapshot-restore` **Base branch:** `main` **Created:** 2026-06-22 **Status:** 🟑 In Progress **Strategy:** Incremental **Mode:** Automated **Execution:** Hybrid β€” backend (Phases 1–3) Direct by the orchestrator; Phase 4 via the frontend implementer **Remote:** origin (https://git.dolgolyov-family.by/alexei.dolgolyov/tiny-forge.git) ## Summary Restore a previously-captured volume snapshot (gzip tar of an image workload's host-bind data volumes) back onto the live volume directories, then bring the app back up. Capture already ships (`internal/volsnap`); restore is greenfield and **data-loss-sensitive** β€” a wrong design is permanent data loss, so the design was adversarially plan-reviewed twice (prior session + this phase breakdown). **Scope (deliberate):** image-source workloads only; volume scopes `absolute` / `stage` / `project` only β€” driven off the SAME `volsnap.supportedScopes` constant capture uses. Named / project_named (Docker named volumes), instance, and ephemeral scopes are out (consistent with capture). ## Mandatory design fixes (non-negotiable β€” a wrong design = permanent data loss) - **C1** Serialize via a per-workload `keyedMutex` (the `internal/api/gitops.go` pattern, extracted to `internal/keyedmutex`) keyed by workload id, gating EVERY deploy entrypoint. All entrypoints funnel through `deployer.DispatchPlugin` (verified: deploy, rollback, promote, generic-hooks, webhook fireBinding/handlePreviewIntent), so the lock lives there. NOT `activeWg` (a global drain barrier, not a per-workload lock). - **C2** Extract-to-temp + atomic rename-swap (extractβ†’`.tmp`, rename liveβ†’`.old`, rename `.tmp`β†’live), NEVER in-place. Mirrors `internal/api/backups.go` restore precedent. - **C3** All-or-nothing pre-flight re-resolution via `volume.ResolveWorkloadPath` β€” abort BEFORE stopping containers if ANY manifest volume doesn't resolve (config drift = corruption). Runs before `Lock`/`StopContainers`. - **C4** Image containers are recreated, not reused β†’ **stop β†’ swap β†’ redeploy** (re-dispatch via `DispatchPlugin`/`RedeployLocked`), NOT `StartContainer(oldID)`. Verified: image source's idempotency short-circuit only fires for a *verified-running* container, so a redeploy after stop creates a fresh container on restored data; `enforceMaxInstances` reaps the old stopped one. - **C5** Disk-space pre-check, **per target filesystem** (peak = live + extracted coexist). - **C6** Treat the archive as UNTRUSTED on extract: zip-slip `HasPrefix` containment, reject symlink/hardlink/device/fifo/socket entries, manifest-index bounds, decompression- bomb cap. Require an `X-Confirm-Restore: ` header like the DB restore (CSRF guard). ### Folded-in (also mandatory) - Single-flight per-workload CAS β†’ 409 (different apps may restore concurrently). - Auto-capture a pre-restore snapshot, **durably committed before the first destructive rename** (the operator's clean escape hatch). - Logic lives in `Engine.Restore` (engine), not the API handler. ### Resolutions from the phase-breakdown plan review (2026-06-22) - **R1 (e.mu deadlock):** `Engine.Restore` does NOT hold `e.mu`; per-workload `Lifecycle.Lock` is the serialization. `Create`'s own `e.mu` guards only the pre-restore archive write. - **R2 (cross-device / containment):** stage `tmp`+`old` as siblings under the **live dir's own parent** (same filesystem β‡’ atomic rename). Detect `EXDEV` β†’ abort/rollback loudly. - **R3 (crash window):** durable pre-restore snapshot before any rename; **extract all tmp dirs first, then pure renames**; restore-journal + startup `RecoverInterruptedRestores()` sweep (revert `live-missingβ†’.old`, clean orphan tmp). - **R4:** C5 checks per-target-filesystem; `StopContainers` returns newest-running tag so redeploy pins the same version, and marks rows stopped; `Engine.Restore` re-validates the workload AFTER acquiring the lock; best-effort audit event emitted. ## Build & Test Commands - **Build:** `go build ./...` - **Test:** `go test ./internal/...` (backend); from `web/`: `npm run test` - **Lint:** `go vet ./internal/...`; from `web/`: `npm run check` - **Frontend build:** from `web/`: `npm run build` - **Dev:** `./scripts/dev-server.sh` (port 8090; restart after every build) ## Phases - [x] Phase 1: Restore engine primitives + path-safe extractor + unit tests [domain: backend] β†’ [subplan](./phase-1-engine-primitives.md) - [x] Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback [domain: backend] β†’ [subplan](./phase-2-lifecycle-locking.md) - [x] Phase 3: API endpoint + CSRF header + single-flight + wiring + tests [domain: backend] β†’ [subplan](./phase-3-api.md) - [x] Phase 4: UI Restore button + ConfirmDialog + i18n en+ru [domain: frontend] β†’ [subplan](./phase-4-frontend.md) ## Parallelizable Phase Groups (Orchestrator mode only) None β€” strictly sequential. Each phase depends on the prior (P2 needs P1 primitives + the Lifecycle seam; P3 wires the adapter + needs `Engine.SetLifecycle`; P4 needs the endpoint). ## Phase Progress Log | Phase | Domain | Status | Review | Build | Committed | |-------|--------|--------|--------|-------|-----------| | Phase 1: engine primitives | backend | βœ… Done | βœ… Passed (APPROVE w/ notes) | βœ… Passed | ⬜ | | Phase 2: lifecycle/locking | backend | βœ… Done | βœ… Passed (APPROVE w/ notes) | βœ… Passed | ⬜ | | Phase 3: API endpoint | backend | βœ… Done | βœ… Passed (go: APPROVE w/ notes; security: fixed CRITICAL) | βœ… Passed | ⬜ | | Phase 4: frontend | frontend | βœ… Done | βœ… Passed (ts: APPROVE) | βœ… Passed (check 0 err, build, 26 tests) | ⬜ | ## Outstanding Warnings | Phase | Warning | Severity | Status (open / resolved / accepted) | |-------|---------|----------|-------------------------------------| | (design) | Mid-restore crash can leave a per-volume MIXED state (some restored, some original); each volume is individually intact and the pre-restore snapshot is the full escape hatch. | 🟑 | accepted (documented v1 limit) | | 2β†’3 | **B1 (was Blocker):** `RecoverInterruptedRestores()` + `SetLifecycle()` MUST be wired at startup BEFORE the API server serves β€” restore endpoint must not be reachable without them. | πŸ”΄β†’tracked | open β€” HARD Phase 3 prerequisite | | 2 | W3 residual: the swap-failure-after-partial-swap ORCHESTRATION branch (rollbackSwaps glue) is covered by primitive unit tests + recovery test + extract-failure orchestration test, but not a full mid-swap fault-injection (needs an fs-fault seam not worth the production complexity). | 🟑 | accepted | ## Final Review - [x] Comprehensive code review β€” βœ… READY TO MERGE (no blockers/warnings; 3 non-blocking notes) - [x] Security review (untrusted-archive extraction + CSRF + admin gating) β€” CRITICAL found & fixed (manifest-Source path traversal); re-derive from current config + containment - [x] All Outstanding Warnings resolved or consciously accepted - [x] Full build passes (`go build ./...`, `npm run build`) - [x] Full test suite passes (`go test ./internal/...`, `npm run test` 26, `npm run check` 0 err) - [ ] Merged to `main` (squash) ## Amendment Log _(none yet)_