1c47030854
Restore a captured volume snapshot onto an image workload's live host-bind
data volumes, then redeploy — the most destructive workload action, built to
the adversarially-reviewed design (C1–C6) with all data-loss guards.
- Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from
the workload's CURRENT config (never the tamperable manifest), per-filesystem
disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable
pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and
crash-recovery sweep (RecoverInterruptedRestores) wired before serving.
- internal/keyedmutex: shared per-key lock; deployer now serializes every
deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked
for the restore re-dispatch, no deadlock).
- Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir
only), decompression-bomb cap, manifest-index bounds.
- POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore
header (CSRF), per-workload single-flight (409).
- WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru).
Scope: image-source only; scopes absolute/stage/project (driven off the same
supportedScopes constant capture uses).
Plan-reviewed before coding; per-phase go/security/ts reviews; final review
READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path
traversal (re-derive target from current config + base containment).
Plan: plans/volume-snapshot-restore/
114 lines
7.1 KiB
Markdown
114 lines
7.1 KiB
Markdown
# Feature: Volume Snapshot Restore (backlog #6)
|
||
|
||
**Branch:** `feature/volume-snapshot-restore`
|
||
**Base branch:** `main`
|
||
**Created:** 2026-06-22
|
||
**Status:** 🟡 In Progress
|
||
**Strategy:** Incremental
|
||
**Mode:** Automated
|
||
**Execution:** Hybrid — backend (Phases 1–3) Direct by the orchestrator; Phase 4 via the frontend implementer
|
||
**Remote:** origin (https://git.dolgolyov-family.by/alexei.dolgolyov/tiny-forge.git)
|
||
|
||
## Summary
|
||
|
||
Restore a previously-captured volume snapshot (gzip tar of an image workload's host-bind
|
||
data volumes) back onto the live volume directories, then bring the app back up. Capture
|
||
already ships (`internal/volsnap`); restore is greenfield and **data-loss-sensitive** — a
|
||
wrong design is permanent data loss, so the design was adversarially plan-reviewed twice
|
||
(prior session + this phase breakdown).
|
||
|
||
**Scope (deliberate):** image-source workloads only; volume scopes `absolute` / `stage` /
|
||
`project` only — driven off the SAME `volsnap.supportedScopes` constant capture uses. Named
|
||
/ project_named (Docker named volumes), instance, and ephemeral scopes are out (consistent
|
||
with capture).
|
||
|
||
## Mandatory design fixes (non-negotiable — a wrong design = permanent data loss)
|
||
|
||
- **C1** Serialize via a per-workload `keyedMutex` (the `internal/api/gitops.go` pattern,
|
||
extracted to `internal/keyedmutex`) keyed by workload id, gating EVERY deploy entrypoint.
|
||
All entrypoints funnel through `deployer.DispatchPlugin` (verified: deploy, rollback,
|
||
promote, generic-hooks, webhook fireBinding/handlePreviewIntent), so the lock lives there.
|
||
NOT `activeWg` (a global drain barrier, not a per-workload lock).
|
||
- **C2** Extract-to-temp + atomic rename-swap (extract→`.tmp`, rename live→`.old`, rename
|
||
`.tmp`→live), NEVER in-place. Mirrors `internal/api/backups.go` restore precedent.
|
||
- **C3** All-or-nothing pre-flight re-resolution via `volume.ResolveWorkloadPath` — abort
|
||
BEFORE stopping containers if ANY manifest volume doesn't resolve (config drift =
|
||
corruption). Runs before `Lock`/`StopContainers`.
|
||
- **C4** Image containers are recreated, not reused → **stop → swap → redeploy** (re-dispatch
|
||
via `DispatchPlugin`/`RedeployLocked`), NOT `StartContainer(oldID)`. Verified: image
|
||
source's idempotency short-circuit only fires for a *verified-running* container, so a
|
||
redeploy after stop creates a fresh container on restored data; `enforceMaxInstances`
|
||
reaps the old stopped one.
|
||
- **C5** Disk-space pre-check, **per target filesystem** (peak = live + extracted coexist).
|
||
- **C6** Treat the archive as UNTRUSTED on extract: zip-slip `HasPrefix` containment,
|
||
reject symlink/hardlink/device/fifo/socket entries, manifest-index bounds, decompression-
|
||
bomb cap. Require an `X-Confirm-Restore: <sid>` header like the DB restore (CSRF guard).
|
||
|
||
### Folded-in (also mandatory)
|
||
- Single-flight per-workload CAS → 409 (different apps may restore concurrently).
|
||
- Auto-capture a pre-restore snapshot, **durably committed before the first destructive
|
||
rename** (the operator's clean escape hatch).
|
||
- Logic lives in `Engine.Restore` (engine), not the API handler.
|
||
|
||
### Resolutions from the phase-breakdown plan review (2026-06-22)
|
||
- **R1 (e.mu deadlock):** `Engine.Restore` does NOT hold `e.mu`; per-workload `Lifecycle.Lock`
|
||
is the serialization. `Create`'s own `e.mu` guards only the pre-restore archive write.
|
||
- **R2 (cross-device / containment):** stage `tmp`+`old` as siblings under the **live dir's
|
||
own parent** (same filesystem ⇒ atomic rename). Detect `EXDEV` → abort/rollback loudly.
|
||
- **R3 (crash window):** durable pre-restore snapshot before any rename; **extract all tmp
|
||
dirs first, then pure renames**; restore-journal + startup `RecoverInterruptedRestores()`
|
||
sweep (revert `live-missing→.old`, clean orphan tmp).
|
||
- **R4:** C5 checks per-target-filesystem; `StopContainers` returns newest-running tag so
|
||
redeploy pins the same version, and marks rows stopped; `Engine.Restore` re-validates the
|
||
workload AFTER acquiring the lock; best-effort audit event emitted.
|
||
|
||
## Build & Test Commands
|
||
|
||
- **Build:** `go build ./...`
|
||
- **Test:** `go test ./internal/...` (backend); from `web/`: `npm run test`
|
||
- **Lint:** `go vet ./internal/...`; from `web/`: `npm run check`
|
||
- **Frontend build:** from `web/`: `npm run build`
|
||
- **Dev:** `./scripts/dev-server.sh` (port 8090; restart after every build)
|
||
|
||
## Phases
|
||
|
||
- [x] Phase 1: Restore engine primitives + path-safe extractor + unit tests [domain: backend] → [subplan](./phase-1-engine-primitives.md)
|
||
- [x] Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback [domain: backend] → [subplan](./phase-2-lifecycle-locking.md)
|
||
- [x] Phase 3: API endpoint + CSRF header + single-flight + wiring + tests [domain: backend] → [subplan](./phase-3-api.md)
|
||
- [x] Phase 4: UI Restore button + ConfirmDialog + i18n en+ru [domain: frontend] → [subplan](./phase-4-frontend.md)
|
||
|
||
## Parallelizable Phase Groups (Orchestrator mode only)
|
||
|
||
None — strictly sequential. Each phase depends on the prior (P2 needs P1 primitives + the
|
||
Lifecycle seam; P3 wires the adapter + needs `Engine.SetLifecycle`; P4 needs the endpoint).
|
||
|
||
## Phase Progress Log
|
||
|
||
| Phase | Domain | Status | Review | Build | Committed |
|
||
|-------|--------|--------|--------|-------|-----------|
|
||
| Phase 1: engine primitives | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
|
||
| Phase 2: lifecycle/locking | backend | ✅ Done | ✅ Passed (APPROVE w/ notes) | ✅ Passed | ⬜ |
|
||
| Phase 3: API endpoint | backend | ✅ Done | ✅ Passed (go: APPROVE w/ notes; security: fixed CRITICAL) | ✅ Passed | ⬜ |
|
||
| Phase 4: frontend | frontend | ✅ Done | ✅ Passed (ts: APPROVE) | ✅ Passed (check 0 err, build, 26 tests) | ⬜ |
|
||
|
||
## Outstanding Warnings
|
||
|
||
| Phase | Warning | Severity | Status (open / resolved / accepted) |
|
||
|-------|---------|----------|-------------------------------------|
|
||
| (design) | Mid-restore crash can leave a per-volume MIXED state (some restored, some original); each volume is individually intact and the pre-restore snapshot is the full escape hatch. | 🟡 | accepted (documented v1 limit) |
|
||
| 2→3 | **B1 (was Blocker):** `RecoverInterruptedRestores()` + `SetLifecycle()` MUST be wired at startup BEFORE the API server serves — restore endpoint must not be reachable without them. | 🔴→tracked | open — HARD Phase 3 prerequisite |
|
||
| 2 | W3 residual: the swap-failure-after-partial-swap ORCHESTRATION branch (rollbackSwaps glue) is covered by primitive unit tests + recovery test + extract-failure orchestration test, but not a full mid-swap fault-injection (needs an fs-fault seam not worth the production complexity). | 🟡 | accepted |
|
||
|
||
## Final Review
|
||
|
||
- [x] Comprehensive code review — ✅ READY TO MERGE (no blockers/warnings; 3 non-blocking notes)
|
||
- [x] Security review (untrusted-archive extraction + CSRF + admin gating) — CRITICAL found & fixed (manifest-Source path traversal); re-derive from current config + containment
|
||
- [x] All Outstanding Warnings resolved or consciously accepted
|
||
- [x] Full build passes (`go build ./...`, `npm run build`)
|
||
- [x] Full test suite passes (`go test ./internal/...`, `npm run test` 26, `npm run check` 0 err)
|
||
- [ ] Merged to `main` (squash)
|
||
|
||
## Amendment Log
|
||
|
||
_(none yet)_
|