Files
tiny-forge/plans/volume-snapshot-restore/phase-2-lifecycle-locking.md
T
alexei.dolgolyov 1c47030854 feat(volsnap): volume snapshot restore (backlog #6)
Restore a captured volume snapshot onto an image workload's live host-bind
data volumes, then redeploy — the most destructive workload action, built to
the adversarially-reviewed design (C1–C6) with all data-loss guards.

- Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from
  the workload's CURRENT config (never the tamperable manifest), per-filesystem
  disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable
  pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and
  crash-recovery sweep (RecoverInterruptedRestores) wired before serving.
- internal/keyedmutex: shared per-key lock; deployer now serializes every
  deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked
  for the restore re-dispatch, no deadlock).
- Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir
  only), decompression-bomb cap, manifest-index bounds.
- POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore
  header (CSRF), per-workload single-flight (409).
- WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru).

Scope: image-source only; scopes absolute/stage/project (driven off the same
supportedScopes constant capture uses).

Plan-reviewed before coding; per-phase go/security/ts reviews; final review
READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path
traversal (re-derive target from current config + base containment).

Plan: plans/volume-snapshot-restore/
2026-06-22 17:23:52 +03:00

107 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2: Engine.Restore orchestration + lifecycle/locking + rollback
**Status:** ✅ Complete
**Parent plan:** [PLAN.md](./PLAN.md)
**Domain:** backend
## Objective
Wire the Phase 1 primitives into the full **stop → swap → redeploy** sequence under a
per-workload lock, with crash-safe rollback (journal + recovery sweep) and a durable
pre-restore auto-capture. Define the `Lifecycle` seam; modify the Deployer for per-workload
locking + an unlocked redeploy.
## Tasks
- [ ] **`internal/keyedmutex/keyedmutex.go`** — extract the `gitops.go` pattern into a shared
package: `type Mutex` with `Lock(key string) func()` and `TryLock(key string) (func(), bool)`
(the Try variant serves the Phase 3 API single-flight → 409). Unit test both.
- [ ] **Deployer locking (C1)** in `internal/deployer/`:
- add `workloadLocks keyedmutex.Mutex` field.
- refactor `DispatchPlugin``unlock := d.workloadLocks.Lock(w.ID); defer unlock(); return d.dispatchLocked(ctx, w, intent)`; move the current body into unexported `dispatchLocked`.
- wrap `DispatchTeardown` in the same per-workload lock.
- do NOT lock `DispatchReconcile` (periodic; image Reconcile is a no-op; reconciler `markMissingRows` only flips labels = benign; locking it would stall the reconcile loop behind long deploys).
- expose `func (d *Deployer) LockWorkload(id string) func()` and `func (d *Deployer) RedeployLocked(ctx, w, intent) error` (= `dispatchLocked`, doc: "caller already holds the workload lock; calling DispatchPlugin would deadlock").
- [ ] **`volsnap.Lifecycle` interface** (in volsnap):
- `Lock(workloadID string) func()`
- `StopContainers(ctx, workloadID string) (runningTag string, err error)` — stop every running container for the workload; return the **newest-running** container's `ImageTag` (so redeploy pins the same version; empty ⇒ source default). Mark stopped rows `State="stopped"`.
- `Redeploy(ctx, w store.Workload, reference string) error` — unlocked re-dispatch, Reason `"restore"`, Reference=tag.
- [ ] **`Engine.Restore(ctx, snapshotID, workloadID string) error`** in `internal/volsnap/restore.go`
(engine owns it). Sequence — **does NOT hold `e.mu`** (R1):
1. load snap; verify `snap.WorkloadID == workloadID`; load workload + settings; require `source_kind=="image"`.
2. `parseManifest`; `preflightResolve` (C3 — abort if any fails); `archiveUncompressedSize` + per-filesystem `freeDiskBytes` pre-check (C5/R4 — abort).
3. `unlock := lc.Lock(workloadID); defer unlock()` (C1).
4. **re-validate** the workload still exists (R4 — teardown may have won the lock); abort if gone.
5. `tag, _ := lc.StopContainers(ctx, workloadID)` (C4 stop).
6. **durably** capture pre-restore snapshot: `e.Create(w, settings, "pre-restore")` (folded; AFTER stop = quiesced; BEFORE any rename = R3). `Create` takes its own `e.mu` — Restore must hold none.
7. write **restore journal** `<snapDir>/restore-<workloadID>.json` (snapshotID, per-volume {live, old, tmp, swapped:false}).
8. **extract ALL** volumes to their `tmp` staging dirs (`safeExtractIndex`) — R3 (shrinks the destructive window to pure renames).
9. **swap** each volume (`swapVolumeDir`), updating the journal `swapped=true` per volume.
10. on ANY error in 89 → `rollbackSwaps` + `lc.Redeploy(ctx, w, tag)` + delete journal + return wrapped error.
11. success → `lc.Redeploy(ctx, w, tag)` (C4 redeploy); remove `.old` staging dirs (reclaim disk); delete journal; best-effort audit event (`store.InsertEvent` source `"volsnap"`).
- `Engine.SetLifecycle(lc Lifecycle)` setter; `Restore` errors clearly if lifecycle is nil.
- [ ] **`Engine.RecoverInterruptedRestores() (int, error)`** (R3) — startup sweep, mirrors
`CleanOrphans`: for each `restore-*.json` journal, per volume: if `swapped` → remove `old`+`tmp`;
else if live missing && old exists → rename old→live (revert mid-rename crash), remove tmp;
else (live present, not swapped) → remove tmp. Delete journal. Log loudly. (Wiring at startup
happens in Phase 3's main.go change, beside `CleanOrphans`.)
## Files to Modify/Create
- `internal/keyedmutex/keyedmutex.go` (+ `_test.go`) — shared lock (new)
- `internal/deployer/deployer.go`, `internal/deployer/dispatch.go` — workloadLocks, dispatchLocked, LockWorkload, RedeployLocked, locked Teardown
- `internal/volsnap/restore.go` — Lifecycle interface, Engine.Restore, RecoverInterruptedRestores, SetLifecycle, journal type
- `internal/volsnap/restore_test.go` — fake-Lifecycle orchestration tests (extends Phase 1 file)
- `internal/api/gitops.go` — (optional, low-risk) migrate `keyedMutex``keyedmutex.Mutex` for DRY
## Acceptance Criteria
- Lock re-entrancy: `Engine.Restore``RedeployLocked` does NOT re-acquire the workload lock (no deadlock). All existing deployer tests still pass (lock is externally transparent).
- **Happy-path orchestration test uses the REAL `Engine.Create` (real store + `t.TempDir()`)** for the pre-restore capture so the `e.mu` deadlock (R1) would fail `go test`, not prod. Asserts call order: preflight → lock → stop → create → extract-all → swap-all → redeploy → cleanup.
- Rollback test: a swap fails midway → originals restored, redeploy called, journal deleted, error returned.
- Preflight-fail test: lock/stop NEVER called (abort before lock).
- Disk-pre-check-fail test: abort before lock.
- `RecoverInterruptedRestores` test: simulate journals in each crash state → correct revert/keep/cleanup.
- `go build ./...`, `go vet ./internal/...`, `go test ./internal/...` green.
## Notes
- ⚠️ The Deployer lock change touches the hot deploy path — verify no existing path re-enters `DispatchPlugin` under a held lock (webhook preview = sequential teardown-then-deploy on the child, not nested — confirmed safe).
- The API single-flight (Phase 3) is a fast 409 reject; the deployer lock is the real mutex — they compose (document).
## Review Checklist
- [ ] All tasks completed
- [ ] Code follows project conventions
- [ ] No unintended side effects (existing deploy/teardown behavior unchanged externally)
- [ ] Build passes
- [ ] Tests pass (new + existing)
## Handoff to Next Phase
Implemented: `internal/keyedmutex` (Lock+TryLock, tested); deployer `workloadLocks` +
`dispatchLocked` + `LockWorkload` + `RedeployLocked`, `DispatchPlugin`/`DispatchTeardown`
now per-workload-locked (reconciler intentionally NOT). `volsnap.Lifecycle` interface,
`Engine.Restore`, `restoreJournal` (atomic write — W1), `RecoverInterruptedRestores`,
`recoverVolume`, `checkDiskSpace`, `SetLifecycle`. Tests: `restore_engine_test.go`
(happy/real-Create, redeploy-fail, preflight-abort, extract-fail-after-lock, nil-lifecycle,
wrong-workload, recovery×3 states), `keyedmutex_test.go`. Full `go test ./internal/...` green.
**Review (go-reviewer, APPROVE WITH NOTES):** no functional blockers in this diff. Verified:
no lock re-entrancy/`e.mu` self-deadlock, no prune-race (extract-all precedes `e.Create`),
recovery state machine doesn't revert good data. Addressed in-phase: W1 (atomic journal),
W3 (extract-failure orchestration test). Residual W3 (mid-swap fault injection) accepted.
**🔴 HARD PREREQUISITES for Phase 3 (B1 + N1 from review):**
1. Wire `snapshotEngine.RecoverInterruptedRestores()` at startup in `cmd/server/main.go`,
BEFORE the API server serves — beside the existing `CleanOrphans()` call (~main.go:333).
Without it the journal/WAL protects nothing — a crash mid-restore is unrecovered.
2. Wire `snapshotEngine.SetLifecycle(adapter)` strictly BEFORE serving (same place as
`SetSnapshotEngine`) so the `e.lifecycle` field is safely published (no race).
3. The restore endpoint MUST NOT be reachable until both are wired.
**Lifecycle adapter (Phase 3, main.go) maps:** `Lock``deployer.LockWorkload`;
`StopContainers``store.ListContainersByWorkload` + `docker.StopContainer` each running +
`UpdateContainerState(...,"stopped")` + return newest-running `ImageTag`;
`Redeploy``deployer.RedeployLocked` with a `restore`-reason intent (Reference=tag).