Files
tiny-forge/plans/volume-snapshot-restore/PLAN.md
T
alexei.dolgolyov 1c47030854 feat(volsnap): volume snapshot restore (backlog #6)
Restore a captured volume snapshot onto an image workload's live host-bind
data volumes, then redeploy — the most destructive workload action, built to
the adversarially-reviewed design (C1–C6) with all data-loss guards.

- Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from
  the workload's CURRENT config (never the tamperable manifest), per-filesystem
  disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable
  pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and
  crash-recovery sweep (RecoverInterruptedRestores) wired before serving.
- internal/keyedmutex: shared per-key lock; deployer now serializes every
  deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked
  for the restore re-dispatch, no deadlock).
- Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir
  only), decompression-bomb cap, manifest-index bounds.
- POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore
  header (CSRF), per-workload single-flight (409).
- WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru).

Scope: image-source only; scopes absolute/stage/project (driven off the same
supportedScopes constant capture uses).

Plan-reviewed before coding; per-phase go/security/ts reviews; final review
READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path
traversal (re-derive target from current config + base containment).

Plan: plans/volume-snapshot-restore/
2026-06-22 17:23:52 +03:00

114 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Feature: Volume Snapshot Restore (backlog #6)
**Branch:** `feature/volume-snapshot-restore`
**Base branch:** `main`
**Created:** 2026-06-22
**Status:** 🟡 In Progress
**Strategy:** Incremental
**Mode:** Automated
**Execution:** Hybrid — backend (Phases 13) Direct by the orchestrator; Phase 4 via the frontend implementer
**Remote:** origin (https://git.dolgolyov-family.by/alexei.dolgolyov/tiny-forge.git)
## Summary
Restore a previously-captured volume snapshot (gzip tar of an image workload's host-bind
data volumes) back onto the live volume directories, then bring the app back up. Capture
already ships (`internal/volsnap`); restore is greenfield and **data-loss-sensitive** — a
wrong design is permanent data loss, so the design was adversarially plan-reviewed twice
(prior session + this phase breakdown).
**Scope (deliberate):** image-source workloads only; volume scopes `absolute` / `stage` /
`project` only — driven off the SAME `volsnap.supportedScopes` constant capture uses. Named
/ project_named (Docker named volumes), instance, and ephemeral scopes are out (consistent
with capture).
## Mandatory design fixes (non-negotiable — a wrong design = permanent data loss)
- **C1** Serialize via a per-workload `keyedMutex` (the `internal/api/gitops.go` pattern,
extracted to `internal/keyedmutex`) keyed by workload id, gating EVERY deploy entrypoint.
All entrypoints funnel through `deployer.DispatchPlugin` (verified: deploy, rollback,
promote, generic-hooks, webhook fireBinding/handlePreviewIntent), so the lock lives there.
NOT `activeWg` (a global drain barrier, not a per-workload lock).
- **C2** Extract-to-temp + atomic rename-swap (extract→`.tmp`, rename live→`.old`, rename
`.tmp`→live), NEVER in-place. Mirrors `internal/api/backups.go` restore precedent.
- **C3** All-or-nothing pre-flight re-resolution via `volume.ResolveWorkloadPath` — abort
BEFORE stopping containers if ANY manifest volume doesn't resolve (config drift =
corruption). Runs before `Lock`/`StopContainers`.
- **C4** Image containers are recreated, not reused → **stop → swap → redeploy** (re-dispatch
via `DispatchPlugin`/`RedeployLocked`), NOT `StartContainer(oldID)`. Verified: image
source's idempotency short-circuit only fires for a *verified-running* container, so a
redeploy after stop creates a fresh container on restored data; `enforceMaxInstances`
reaps the old stopped one.
- **C5** Disk-space pre-check, **per target filesystem** (peak = live + extracted coexist).
- **C6** Treat the archive as UNTRUSTED on extract: zip-slip `HasPrefix` containment,
reject symlink/hardlink/device/fifo/socket entries, manifest-index bounds, decompression-
bomb cap. Require an `X-Confirm-Restore: <sid>` header like the DB restore (CSRF guard).
### Folded-in (also mandatory)
- Single-flight per-workload CAS → 409 (different apps may restore concurrently).
- Auto-capture a pre-restore snapshot, **durably committed before the first destructive
rename** (the operator's clean escape hatch).
- Logic lives in `Engine.Restore` (engine), not the API handler.
### Resolutions from the phase-breakdown plan review (2026-06-22)
- **R1 (e.mu deadlock):** `Engine.Restore` does NOT hold `e.mu`; per-workload `Lifecycle.Lock`
is the serialization. `Create`'s own `e.mu` guards only the pre-restore archive write.
- **R2 (cross-device / containment):** stage `tmp`+`old` as siblings under the **live dir's
own parent** (same filesystem ⇒ atomic rename). Detect `EXDEV` → abort/rollback loudly.
- **R3 (crash window):** durable pre-restore snapshot before any rename; **extract all tmp
dirs first, then pure renames**; restore-journal + startup `RecoverInterruptedRestores()`
sweep (revert `live-missing→.old`, clean orphan tmp).
- **R4:** C5 checks per-target-filesystem; `StopContainers` returns newest-running tag so
redeploy pins the same version, and marks rows stopped; `Engine.Restore` re-validates the
workload AFTER acquiring the lock; best-effort audit event emitted.
## Build & Test Commands
- **Build:** `go build ./...`
- **Test:** `go test ./internal/...` (backend); from `web/`: `npm run test`
- **Lint:** `go vet ./internal/...`; from `web/`: `npm run check`
- **Frontend build:** from `web/`: `npm run build`
- **Dev:** `./scripts/dev-server.sh` (port 8090; restart after every build)
## Phases
- [x] Phase 1: Restore engine primitives + path-safe extractor + unit tests [domain: backend] → [subplan](./phase-1-engine-primitives.md)
- [x] Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback [domain: backend] → [subplan](./phase-2-lifecycle-locking.md)
- [x] Phase 3: API endpoint + CSRF header + single-flight + wiring + tests [domain: backend] → [subplan](./phase-3-api.md)
- [x] Phase 4: UI Restore button + ConfirmDialog + i18n en+ru [domain: frontend] → [subplan](./phase-4-frontend.md)
## Parallelizable Phase Groups (Orchestrator mode only)
None — strictly sequential. Each phase depends on the prior (P2 needs P1 primitives + the
Lifecycle seam; P3 wires the adapter + needs `Engine.SetLifecycle`; P4 needs the endpoint).
## Phase Progress Log
| Phase | Domain | Status | Review | Build | Committed |
|-------|--------|--------|--------|-------|-----------|
| Phase 1: engine primitives | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
| Phase 2: lifecycle/locking | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
| Phase 3: API endpoint | backend | ✅ Done | ✅ Passed (go: APPROVE w/ notes; security: fixed CRITICAL) | ✅ Passed | ⬜ |
| Phase 4: frontend | frontend | ✅ Done | ✅ Passed (ts: APPROVE) | ✅ Passed (check 0 err, build, 26 tests) | ⬜ |
## Outstanding Warnings
| Phase | Warning | Severity | Status (open / resolved / accepted) |
|-------|---------|----------|-------------------------------------|
| (design) | Mid-restore crash can leave a per-volume MIXED state (some restored, some original); each volume is individually intact and the pre-restore snapshot is the full escape hatch. | 🟡 | accepted (documented v1 limit) |
| 2→3 | **B1 (was Blocker):** `RecoverInterruptedRestores()` + `SetLifecycle()` MUST be wired at startup BEFORE the API server serves — restore endpoint must not be reachable without them. | 🔴→tracked | open — HARD Phase 3 prerequisite |
| 2 | W3 residual: the swap-failure-after-partial-swap ORCHESTRATION branch (rollbackSwaps glue) is covered by primitive unit tests + recovery test + extract-failure orchestration test, but not a full mid-swap fault-injection (needs an fs-fault seam not worth the production complexity). | 🟡 | accepted |
## Final Review
- [x] Comprehensive code review — ✅ READY TO MERGE (no blockers/warnings; 3 non-blocking notes)
- [x] Security review (untrusted-archive extraction + CSRF + admin gating) — CRITICAL found & fixed (manifest-Source path traversal); re-derive from current config + containment
- [x] All Outstanding Warnings resolved or consciously accepted
- [x] Full build passes (`go build ./...`, `npm run build`)
- [x] Full test suite passes (`go test ./internal/...`, `npm run test` 26, `npm run check` 0 err)
- [ ] Merged to `main` (squash)
## Amendment Log
_(none yet)_