tiny-forge/plans/volume-snapshot-restore/PLAN.md

# Feature: Volume Snapshot Restore (backlog #6)

**Branch:** `feature/volume-snapshot-restore`
**Base branch:** `main`
**Created:** 2026-06-22
**Status:** 🟡 In Progress
**Strategy:** Incremental
**Mode:** Automated
**Execution:** Hybrid — backend (Phases 1–3) Direct by the orchestrator; Phase 4 via the frontend implementer
**Remote:** origin (https://git.dolgolyov-family.by/alexei.dolgolyov/tiny-forge.git)

## Summary

Restore a previously-captured volume snapshot (gzip tar of an image workload's host-bind
data volumes) back onto the live volume directories, then bring the app back up. Capture
already ships (`internal/volsnap`); restore is greenfield and **data-loss-sensitive** — a
wrong design is permanent data loss, so the design was adversarially plan-reviewed twice
(prior session + this phase breakdown).

**Scope (deliberate):** image-source workloads only; volume scopes `absolute` / `stage` /
`project` only — driven off the SAME `volsnap.supportedScopes` constant capture uses. Named
/ project_named (Docker named volumes), instance, and ephemeral scopes are out (consistent
with capture).

## Mandatory design fixes (non-negotiable — a wrong design = permanent data loss)

- **C1** Serialize via a per-workload `keyedMutex` (the `internal/api/gitops.go` pattern,
  extracted to `internal/keyedmutex`) keyed by workload id, gating EVERY deploy entrypoint.
  All entrypoints funnel through `deployer.DispatchPlugin` (verified: deploy, rollback,
  promote, generic-hooks, webhook fireBinding/handlePreviewIntent), so the lock lives there.
  NOT `activeWg` (a global drain barrier, not a per-workload lock).
- **C2** Extract-to-temp + atomic rename-swap (extract→`.tmp`, rename live→`.old`, rename
  `.tmp`→live), NEVER in-place. Mirrors `internal/api/backups.go` restore precedent.
- **C3** All-or-nothing pre-flight re-resolution via `volume.ResolveWorkloadPath` — abort
  BEFORE stopping containers if ANY manifest volume doesn't resolve (config drift =
  corruption). Runs before `Lock`/`StopContainers`.
- **C4** Image containers are recreated, not reused → **stop → swap → redeploy** (re-dispatch
  via `DispatchPlugin`/`RedeployLocked`), NOT `StartContainer(oldID)`. Verified: image
  source's idempotency short-circuit only fires for a *verified-running* container, so a
  redeploy after stop creates a fresh container on restored data; `enforceMaxInstances`
  reaps the old stopped one.
- **C5** Disk-space pre-check, **per target filesystem** (peak = live + extracted coexist).
- **C6** Treat the archive as UNTRUSTED on extract: zip-slip `HasPrefix` containment,
  reject symlink/hardlink/device/fifo/socket entries, manifest-index bounds, decompression-
  bomb cap. Require an `X-Confirm-Restore: <sid>` header like the DB restore (CSRF guard).

### Folded-in (also mandatory)
- Single-flight per-workload CAS → 409 (different apps may restore concurrently).
- Auto-capture a pre-restore snapshot, **durably committed before the first destructive
  rename** (the operator's clean escape hatch).
- Logic lives in `Engine.Restore` (engine), not the API handler.

### Resolutions from the phase-breakdown plan review (2026-06-22)
- **R1 (e.mu deadlock):** `Engine.Restore` does NOT hold `e.mu`; per-workload `Lifecycle.Lock`
  is the serialization. `Create`'s own `e.mu` guards only the pre-restore archive write.
- **R2 (cross-device / containment):** stage `tmp`+`old` as siblings under the **live dir's
  own parent** (same filesystem ⇒ atomic rename). Detect `EXDEV` → abort/rollback loudly.
- **R3 (crash window):** durable pre-restore snapshot before any rename; **extract all tmp
  dirs first, then pure renames**; restore-journal + startup `RecoverInterruptedRestores()`
  sweep (revert `live-missing→.old`, clean orphan tmp).
- **R4:** C5 checks per-target-filesystem; `StopContainers` returns newest-running tag so
  redeploy pins the same version, and marks rows stopped; `Engine.Restore` re-validates the
  workload AFTER acquiring the lock; best-effort audit event emitted.

## Build & Test Commands

- **Build:** `go build ./...`
- **Test:** `go test ./internal/...` (backend); from `web/`: `npm run test`
- **Lint:** `go vet ./internal/...`; from `web/`: `npm run check`
- **Frontend build:** from `web/`: `npm run build`
- **Dev:** `./scripts/dev-server.sh` (port 8090; restart after every build)

## Phases

- [x] Phase 1: Restore engine primitives + path-safe extractor + unit tests [domain: backend] → [subplan](./phase-1-engine-primitives.md)
- [x] Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback [domain: backend] → [subplan](./phase-2-lifecycle-locking.md)
- [x] Phase 3: API endpoint + CSRF header + single-flight + wiring + tests [domain: backend] → [subplan](./phase-3-api.md)
- [x] Phase 4: UI Restore button + ConfirmDialog + i18n en+ru [domain: frontend] → [subplan](./phase-4-frontend.md)

## Parallelizable Phase Groups (Orchestrator mode only)

None — strictly sequential. Each phase depends on the prior (P2 needs P1 primitives + the
Lifecycle seam; P3 wires the adapter + needs `Engine.SetLifecycle`; P4 needs the endpoint).

## Phase Progress Log

| Phase | Domain | Status | Review | Build | Committed |
|-------|--------|--------|--------|-------|-----------|
| Phase 1: engine primitives | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
| Phase 2: lifecycle/locking | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
| Phase 3: API endpoint | backend | ✅ Done | ✅ Passed (go: APPROVE w/ notes; security: fixed CRITICAL) | ✅ Passed | ⬜ |
| Phase 4: frontend | frontend | ✅ Done | ✅ Passed (ts: APPROVE) | ✅ Passed (check 0 err, build, 26 tests) | ⬜ |

## Outstanding Warnings

| Phase | Warning | Severity | Status (open / resolved / accepted) |
|-------|---------|----------|-------------------------------------|
| (design) | Mid-restore crash can leave a per-volume MIXED state (some restored, some original); each volume is individually intact and the pre-restore snapshot is the full escape hatch. | 🟡 | accepted (documented v1 limit) |
| 2→3 | **B1 (was Blocker):** `RecoverInterruptedRestores()` + `SetLifecycle()` MUST be wired at startup BEFORE the API server serves — restore endpoint must not be reachable without them. | 🔴→tracked | open — HARD Phase 3 prerequisite |
| 2 | W3 residual: the swap-failure-after-partial-swap ORCHESTRATION branch (rollbackSwaps glue) is covered by primitive unit tests + recovery test + extract-failure orchestration test, but not a full mid-swap fault-injection (needs an fs-fault seam not worth the production complexity). | 🟡 | accepted |

## Final Review

- [x] Comprehensive code review — ✅ READY TO MERGE (no blockers/warnings; 3 non-blocking notes)
- [x] Security review (untrusted-archive extraction + CSRF + admin gating) — CRITICAL found & fixed (manifest-Source path traversal); re-derive from current config + containment
- [x] All Outstanding Warnings resolved or consciously accepted
- [x] Full build passes (`go build ./...`, `npm run build`)
- [x] Full test suite passes (`go test ./internal/...`, `npm run test` 26, `npm run check` 0 err)
- [ ] Merged to `main` (squash)

## Amendment Log

_(none yet)_