Restore a captured volume snapshot onto an image workload's live host-bind
data volumes, then redeploy — the most destructive workload action, built to
the adversarially-reviewed design (C1–C6) with all data-loss guards.
- Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from
the workload's CURRENT config (never the tamperable manifest), per-filesystem
disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable
pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and
crash-recovery sweep (RecoverInterruptedRestores) wired before serving.
- internal/keyedmutex: shared per-key lock; deployer now serializes every
deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked
for the restore re-dispatch, no deadlock).
- Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir
only), decompression-bomb cap, manifest-index bounds.
- POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore
header (CSRF), per-workload single-flight (409).
- WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru).
Scope: image-source only; scopes absolute/stage/project (driven off the same
supportedScopes constant capture uses).
Plan-reviewed before coding; per-phase go/security/ts reviews; final review
READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path
traversal (re-derive target from current config + base containment).
Plan: plans/volume-snapshot-restore/
7.1 KiB
Feature: Volume Snapshot Restore (backlog #6)
Branch: feature/volume-snapshot-restore
Base branch: main
Created: 2026-06-22
Status: 🟡 In Progress
Strategy: Incremental
Mode: Automated
Execution: Hybrid — backend (Phases 1–3) Direct by the orchestrator; Phase 4 via the frontend implementer
Remote: origin (https://git.dolgolyov-family.by/alexei.dolgolyov/tiny-forge.git)
Summary
Restore a previously-captured volume snapshot (gzip tar of an image workload's host-bind
data volumes) back onto the live volume directories, then bring the app back up. Capture
already ships (internal/volsnap); restore is greenfield and data-loss-sensitive — a
wrong design is permanent data loss, so the design was adversarially plan-reviewed twice
(prior session + this phase breakdown).
Scope (deliberate): image-source workloads only; volume scopes absolute / stage /
project only — driven off the SAME volsnap.supportedScopes constant capture uses. Named
/ project_named (Docker named volumes), instance, and ephemeral scopes are out (consistent
with capture).
Mandatory design fixes (non-negotiable — a wrong design = permanent data loss)
- C1 Serialize via a per-workload
keyedMutex(theinternal/api/gitops.gopattern, extracted tointernal/keyedmutex) keyed by workload id, gating EVERY deploy entrypoint. All entrypoints funnel throughdeployer.DispatchPlugin(verified: deploy, rollback, promote, generic-hooks, webhook fireBinding/handlePreviewIntent), so the lock lives there. NOTactiveWg(a global drain barrier, not a per-workload lock). - C2 Extract-to-temp + atomic rename-swap (extract→
.tmp, rename live→.old, rename.tmp→live), NEVER in-place. Mirrorsinternal/api/backups.gorestore precedent. - C3 All-or-nothing pre-flight re-resolution via
volume.ResolveWorkloadPath— abort BEFORE stopping containers if ANY manifest volume doesn't resolve (config drift = corruption). Runs beforeLock/StopContainers. - C4 Image containers are recreated, not reused → stop → swap → redeploy (re-dispatch
via
DispatchPlugin/RedeployLocked), NOTStartContainer(oldID). Verified: image source's idempotency short-circuit only fires for a verified-running container, so a redeploy after stop creates a fresh container on restored data;enforceMaxInstancesreaps the old stopped one. - C5 Disk-space pre-check, per target filesystem (peak = live + extracted coexist).
- C6 Treat the archive as UNTRUSTED on extract: zip-slip
HasPrefixcontainment, reject symlink/hardlink/device/fifo/socket entries, manifest-index bounds, decompression- bomb cap. Require anX-Confirm-Restore: <sid>header like the DB restore (CSRF guard).
Folded-in (also mandatory)
- Single-flight per-workload CAS → 409 (different apps may restore concurrently).
- Auto-capture a pre-restore snapshot, durably committed before the first destructive rename (the operator's clean escape hatch).
- Logic lives in
Engine.Restore(engine), not the API handler.
Resolutions from the phase-breakdown plan review (2026-06-22)
- R1 (e.mu deadlock):
Engine.Restoredoes NOT holde.mu; per-workloadLifecycle.Lockis the serialization.Create's owne.muguards only the pre-restore archive write. - R2 (cross-device / containment): stage
tmp+oldas siblings under the live dir's own parent (same filesystem ⇒ atomic rename). DetectEXDEV→ abort/rollback loudly. - R3 (crash window): durable pre-restore snapshot before any rename; extract all tmp
dirs first, then pure renames; restore-journal + startup
RecoverInterruptedRestores()sweep (revertlive-missing→.old, clean orphan tmp). - R4: C5 checks per-target-filesystem;
StopContainersreturns newest-running tag so redeploy pins the same version, and marks rows stopped;Engine.Restorere-validates the workload AFTER acquiring the lock; best-effort audit event emitted.
Build & Test Commands
- Build:
go build ./... - Test:
go test ./internal/...(backend); fromweb/:npm run test - Lint:
go vet ./internal/...; fromweb/:npm run check - Frontend build: from
web/:npm run build - Dev:
./scripts/dev-server.sh(port 8090; restart after every build)
Phases
- Phase 1: Restore engine primitives + path-safe extractor + unit tests [domain: backend] → subplan
- Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback [domain: backend] → subplan
- Phase 3: API endpoint + CSRF header + single-flight + wiring + tests [domain: backend] → subplan
- Phase 4: UI Restore button + ConfirmDialog + i18n en+ru [domain: frontend] → subplan
Parallelizable Phase Groups (Orchestrator mode only)
None — strictly sequential. Each phase depends on the prior (P2 needs P1 primitives + the
Lifecycle seam; P3 wires the adapter + needs Engine.SetLifecycle; P4 needs the endpoint).
Phase Progress Log
| Phase | Domain | Status | Review | Build | Committed |
|---|---|---|---|---|---|
| Phase 1: engine primitives | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
| Phase 2: lifecycle/locking | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
| Phase 3: API endpoint | backend | ✅ Done | ✅ Passed (go: APPROVE w/ notes; security: fixed CRITICAL) | ✅ Passed | ⬜ |
| Phase 4: frontend | frontend | ✅ Done | ✅ Passed (ts: APPROVE) | ✅ Passed (check 0 err, build, 26 tests) | ⬜ |
Outstanding Warnings
| Phase | Warning | Severity | Status (open / resolved / accepted) |
|---|---|---|---|
| (design) | Mid-restore crash can leave a per-volume MIXED state (some restored, some original); each volume is individually intact and the pre-restore snapshot is the full escape hatch. | 🟡 | accepted (documented v1 limit) |
| 2→3 | B1 (was Blocker): RecoverInterruptedRestores() + SetLifecycle() MUST be wired at startup BEFORE the API server serves — restore endpoint must not be reachable without them. |
🔴→tracked | open — HARD Phase 3 prerequisite |
| 2 | W3 residual: the swap-failure-after-partial-swap ORCHESTRATION branch (rollbackSwaps glue) is covered by primitive unit tests + recovery test + extract-failure orchestration test, but not a full mid-swap fault-injection (needs an fs-fault seam not worth the production complexity). | 🟡 | accepted |
Final Review
- Comprehensive code review — ✅ READY TO MERGE (no blockers/warnings; 3 non-blocking notes)
- Security review (untrusted-archive extraction + CSRF + admin gating) — CRITICAL found & fixed (manifest-Source path traversal); re-derive from current config + containment
- All Outstanding Warnings resolved or consciously accepted
- Full build passes (
go build ./...,npm run build) - Full test suite passes (
go test ./internal/...,npm run test26,npm run check0 err) - Merged to
main(squash)
Amendment Log
(none yet)