feat(volsnap): volume snapshot restore (backlog #6)
Restore a captured volume snapshot onto an image workload's live host-bind
data volumes, then redeploy — the most destructive workload action, built to
the adversarially-reviewed design (C1–C6) with all data-loss guards.
- Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from
the workload's CURRENT config (never the tamperable manifest), per-filesystem
disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable
pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and
crash-recovery sweep (RecoverInterruptedRestores) wired before serving.
- internal/keyedmutex: shared per-key lock; deployer now serializes every
deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked
for the restore re-dispatch, no deadlock).
- Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir
only), decompression-bomb cap, manifest-index bounds.
- POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore
header (CSRF), per-workload single-flight (409).
- WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru).
Scope: image-source only; scopes absolute/stage/project (driven off the same
supportedScopes constant capture uses).
Plan-reviewed before coding; per-phase go/security/ts reviews; final review
READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path
traversal (re-derive target from current config + base containment).
Plan: plans/volume-snapshot-restore/
This commit is contained in:
@@ -0,0 +1,76 @@
|
||||
# Phase 3: API endpoint + CSRF header + single-flight + wiring + tests
|
||||
|
||||
**Status:** ✅ Complete
|
||||
**Parent plan:** [PLAN.md](./PLAN.md)
|
||||
**Domain:** backend
|
||||
|
||||
## Objective
|
||||
|
||||
Expose restore over HTTP behind the destructive-action guards, and wire the real
|
||||
`Lifecycle` adapter + startup recovery sweep at the composition root.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] **`restoreWorkloadSnapshot` handler** in `internal/api/volume_snapshots.go`:
|
||||
`POST /api/workloads/{id}/snapshots/{sid}/restore`.
|
||||
- require `X-Confirm-Restore: <sid>` header == path `sid` (C6, mirror `backups.go` `restoreBackup`); mismatch → 400.
|
||||
- per-workload single-flight via `keyedmutex.TryLock(id)` (or a `sync.Map` LoadOrStore) → 409 if a restore for this workload is already running; release on completion. (Different apps restore concurrently.)
|
||||
- load snapshot via `snapshotEngine.Get(sid)`; verify `snap.WorkloadID == id`; load workload; require `source_kind=="image"` (else 400).
|
||||
- call `s.snapshotEngine.Restore(r.Context(), id, sid)`. Synchronous (mirrors `deployPluginWorkload` which blocks on dispatch). `ErrNoSnapshotData`-class / client-actionable → 400; success → 200 `{"status":"restored",...}`; other → 500 with a generic message (no internal detail leak — mirror existing handlers; raw error to `slog`).
|
||||
- [ ] **Route registration** in `internal/api/router.go` under the admin group on `/workloads/{id}`, beside the existing `POST /snapshots`. Admin-gated.
|
||||
- [ ] **Lifecycle adapter + recovery wiring** in `cmd/server/main.go` (composition root — already
|
||||
imports deployer + docker + store + volsnap):
|
||||
- a small struct implementing `volsnap.Lifecycle`: `Lock`→`deployer.LockWorkload`; `StopContainers`→`store.ListContainersByWorkload` + `docker.StopContainer` each running + `store.UpdateContainerState(...,"stopped")` + return newest-running `ImageTag`; `Redeploy`→`deployer.RedeployLocked` with `toPluginWorkload`/`WorkloadFromStore` + a `restore`-reason intent.
|
||||
- `snapshotEngine.SetLifecycle(adapter)`.
|
||||
- call `snapshotEngine.RecoverInterruptedRestores()` at startup beside the existing `CleanOrphans()` call; log result.
|
||||
- keeps the `api` package decoupled from `deployer` (router.go deliberately avoids importing deployer).
|
||||
|
||||
## Files to Modify/Create
|
||||
|
||||
- `internal/api/volume_snapshots.go` — `restoreWorkloadSnapshot` handler + single-flight field on `Server`
|
||||
- `internal/api/router.go` — route registration
|
||||
- `cmd/server/main.go` — Lifecycle adapter, `SetLifecycle`, `RecoverInterruptedRestores` startup wiring
|
||||
- `internal/api/volume_snapshots_test.go` — handler tests (extends existing file)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Missing/mismatched `X-Confirm-Restore` → 400.
|
||||
- Concurrent second restore for the same workload → 409.
|
||||
- Non-image workload → 400; snapshot belonging to another workload → 400/404.
|
||||
- Happy path → 200 (with a fake engine/lifecycle or a seeded real engine).
|
||||
- `go build ./...`, `go vet ./internal/...`, `go test ./internal/...` green; existing snapshot tests pass.
|
||||
|
||||
## Notes
|
||||
|
||||
- The handler is thin: validation + single-flight + delegate to `Engine.Restore`. All the dangerous logic stays in the engine (folded requirement).
|
||||
- Confirm the admin group + `X-Confirm-Restore` together match the DB-restore threat model (custom header defeats CSRF form/img posts; admin JWT + AdminOnly gates authz).
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- [ ] All tasks completed
|
||||
- [ ] Code follows project conventions
|
||||
- [ ] No unintended side effects
|
||||
- [ ] Build passes
|
||||
- [ ] Tests pass (new + existing)
|
||||
|
||||
## Handoff to Next Phase
|
||||
|
||||
Implemented: `restoreWorkloadSnapshot` handler (`internal/api/volume_snapshots.go`) — admin,
|
||||
`X-Confirm-Restore: <sid>` header, per-workload single-flight (`volRestoreInFlight.TryLock`
|
||||
→ 409); route `POST /api/workloads/{id}/snapshots/{sid}/restore`; `restoreLifecycle` adapter
|
||||
(`cmd/server/restore_lifecycle.go`); main.go wires `SetLifecycle` + `RecoverInterruptedRestores`
|
||||
BEFORE serving (B1/N1 resolved). Tests: header miss/mismatch→400, wrong-workload→400,
|
||||
non-image→400, not-found→404, happy-path→200, single-flight→409.
|
||||
|
||||
**SECURITY FIX (review BLOCK → resolved):** the security review caught a CRITICAL — the
|
||||
manifest's persisted `Source`/`Scope` (attacker-influenceable) was being trusted to compute
|
||||
the destructive swap target, so `Source:"../../etc"` could clobber `/etc`. Fixed by
|
||||
re-deriving the swap target from the workload's CURRENT config keyed by container Target path
|
||||
(`volumesByTarget` shared helper) + a `pathWithinBase` containment assertion for base-relative
|
||||
scopes. Regression guards: `TestPreflightResolve_IgnoresManifestSource`,
|
||||
`TestPreflightResolve_AllOrNothing`. Both reviews now clear.
|
||||
|
||||
**API for Phase 4 (frontend):** `POST /api/workloads/{id}/snapshots/{sid}/restore` with header
|
||||
`X-Confirm-Restore: <sid>` (REQUIRED — mirror how the DB restore sends it). 200 `{status:"restored"}`,
|
||||
400 (header/ownership/non-image), 404 (not found), 409 (already in progress), 500 (engine).
|
||||
Restore is synchronous (blocks until done, like deploy).
|
||||
Reference in New Issue
Block a user