Files

T

alexei.dolgolyov 1c47030854 feat(volsnap): volume snapshot restore (backlog #6 )

Restore a captured volume snapshot onto an image workload's live host-bind
data volumes, then redeploy — the most destructive workload action, built to
the adversarially-reviewed design (C1–C6) with all data-loss guards.

- Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from
  the workload's CURRENT config (never the tamperable manifest), per-filesystem
  disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable
  pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and
  crash-recovery sweep (RecoverInterruptedRestores) wired before serving.
- internal/keyedmutex: shared per-key lock; deployer now serializes every
  deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked
  for the restore re-dispatch, no deadlock).
- Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir
  only), decompression-bomb cap, manifest-index bounds.
- POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore
  header (CSRF), per-workload single-flight (409).
- WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru).

Scope: image-source only; scopes absolute/stage/project (driven off the same
supportedScopes constant capture uses).

Plan-reviewed before coding; per-phase go/security/ts reviews; final review
READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path
traversal (re-derive target from current config + base containment).

Plan: plans/volume-snapshot-restore/

2026-06-22 17:23:52 +03:00

4.9 KiB

Raw Blame History

Phase 3: API endpoint + CSRF header + single-flight + wiring + tests

Status: ✅ Complete Parent plan: PLAN.md Domain: backend

Objective

Expose restore over HTTP behind the destructive-action guards, and wire the real Lifecycle adapter + startup recovery sweep at the composition root.

Tasks

restoreWorkloadSnapshot handler in internal/api/volume_snapshots.go: POST /api/workloads/{id}/snapshots/{sid}/restore.
- require X-Confirm-Restore: <sid> header == path sid (C6, mirror backups.go restoreBackup); mismatch → 400.
- per-workload single-flight via keyedmutex.TryLock(id) (or a sync.Map LoadOrStore) → 409 if a restore for this workload is already running; release on completion. (Different apps restore concurrently.)
- load snapshot via snapshotEngine.Get(sid); verify snap.WorkloadID == id; load workload; require source_kind=="image" (else 400).
- call s.snapshotEngine.Restore(r.Context(), id, sid). Synchronous (mirrors deployPluginWorkload which blocks on dispatch). ErrNoSnapshotData-class / client-actionable → 400; success → 200 {"status":"restored",...}; other → 500 with a generic message (no internal detail leak — mirror existing handlers; raw error to slog).
Route registration in internal/api/router.go under the admin group on /workloads/{id}, beside the existing POST /snapshots. Admin-gated.
Lifecycle adapter + recovery wiring in cmd/server/main.go (composition root — already imports deployer + docker + store + volsnap):
- a small struct implementing volsnap.Lifecycle: Lock→deployer.LockWorkload; StopContainers→store.ListContainersByWorkload + docker.StopContainer each running + store.UpdateContainerState(...,"stopped") + return newest-running ImageTag; Redeploy→deployer.RedeployLocked with toPluginWorkload/WorkloadFromStore + a restore-reason intent.
- snapshotEngine.SetLifecycle(adapter).
- call snapshotEngine.RecoverInterruptedRestores() at startup beside the existing CleanOrphans() call; log result.
- keeps the api package decoupled from deployer (router.go deliberately avoids importing deployer).

Files to Modify/Create

internal/api/volume_snapshots.go — restoreWorkloadSnapshot handler + single-flight field on Server
internal/api/router.go — route registration
cmd/server/main.go — Lifecycle adapter, SetLifecycle, RecoverInterruptedRestores startup wiring
internal/api/volume_snapshots_test.go — handler tests (extends existing file)

Acceptance Criteria

Missing/mismatched X-Confirm-Restore → 400.
Concurrent second restore for the same workload → 409.
Non-image workload → 400; snapshot belonging to another workload → 400/404.
Happy path → 200 (with a fake engine/lifecycle or a seeded real engine).
go build ./..., go vet ./internal/..., go test ./internal/... green; existing snapshot tests pass.

Notes

The handler is thin: validation + single-flight + delegate to Engine.Restore. All the dangerous logic stays in the engine (folded requirement).
Confirm the admin group + X-Confirm-Restore together match the DB-restore threat model (custom header defeats CSRF form/img posts; admin JWT + AdminOnly gates authz).

Review Checklist

All tasks completed
Code follows project conventions
No unintended side effects
Build passes
Tests pass (new + existing)

Handoff to Next Phase

Implemented: restoreWorkloadSnapshot handler (internal/api/volume_snapshots.go) — admin, X-Confirm-Restore: <sid> header, per-workload single-flight (volRestoreInFlight.TryLock → 409); route POST /api/workloads/{id}/snapshots/{sid}/restore; restoreLifecycle adapter (cmd/server/restore_lifecycle.go); main.go wires SetLifecycle + RecoverInterruptedRestores BEFORE serving (B1/N1 resolved). Tests: header miss/mismatch→400, wrong-workload→400, non-image→400, not-found→404, happy-path→200, single-flight→409.

SECURITY FIX (review BLOCK → resolved): the security review caught a CRITICAL — the manifest's persisted Source/Scope (attacker-influenceable) was being trusted to compute the destructive swap target, so Source:"../../etc" could clobber /etc. Fixed by re-deriving the swap target from the workload's CURRENT config keyed by container Target path (volumesByTarget shared helper) + a pathWithinBase containment assertion for base-relative scopes. Regression guards: TestPreflightResolve_IgnoresManifestSource, TestPreflightResolve_AllOrNothing. Both reviews now clear.

API for Phase 4 (frontend): POST /api/workloads/{id}/snapshots/{sid}/restore with header X-Confirm-Restore: <sid> (REQUIRED — mirror how the DB restore sends it). 200 {status:"restored"}, 400 (header/ownership/non-image), 404 (not found), 409 (already in progress), 500 (engine). Restore is synchronous (blocks until done, like deploy).

4.9 KiB Raw Blame History