Files
tiny-forge/plans/volume-snapshot-restore/CONTEXT.md
T
alexei.dolgolyov 1c47030854 feat(volsnap): volume snapshot restore (backlog #6)
Restore a captured volume snapshot onto an image workload's live host-bind
data volumes, then redeploy — the most destructive workload action, built to
the adversarially-reviewed design (C1–C6) with all data-loss guards.

- Engine.Restore (engine-owned): all-or-nothing pre-flight re-resolution from
  the workload's CURRENT config (never the tamperable manifest), per-filesystem
  disk pre-check, per-workload lock, container quiesce, extract-to-tmp, durable
  pre-restore snapshot, write-ahead journal, atomic rename swap, redeploy, and
  crash-recovery sweep (RecoverInterruptedRestores) wired before serving.
- internal/keyedmutex: shared per-key lock; deployer now serializes every
  deploy entrypoint per workload via DispatchPlugin (+ LockWorkload/RedeployLocked
  for the restore re-dispatch, no deadlock).
- Untrusted-archive extractor: zip-slip containment, type allow-list (reg/dir
  only), decompression-bomb cap, manifest-index bounds.
- POST /api/workloads/{id}/snapshots/{sid}/restore: admin, X-Confirm-Restore
  header (CSRF), per-workload single-flight (409).
- WebUI: Restore button + danger ConfirmDialog + busy state + i18n (en/ru).

Scope: image-source only; scopes absolute/stage/project (driven off the same
supportedScopes constant capture uses).

Plan-reviewed before coding; per-phase go/security/ts reviews; final review
READY TO MERGE. Security review caught + fixed a CRITICAL manifest-Source path
traversal (re-derive target from current config + base containment).

Plan: plans/volume-snapshot-restore/
2026-06-22 17:23:52 +03:00

3.2 KiB

CONTEXT — Volume Snapshot Restore

Working memory across phases. The orchestrator owns this file.

Settings (from PLAN.md header)

  • Mode: Automated · Execution: Hybrid (backend Direct, Phase 4 frontend implementer) · Strategy: Incremental
  • Base: main · Branch: feature/volume-snapshot-restore · Remote: origin (Gitea)
  • Build: go build ./... · Test: go test ./internal/... + npm run test · Lint: go vet ./internal/... + npm run check

Key codebase facts (verified during planning)

  • Deploy choke point: every deploy entrypoint calls deployer.DispatchPlugin → put the per-workload lock there (C1). Entrypoints: deployPluginWorkload, rollbackWorkload, promoteFromWorkload, dispatchGeneric, webhook fireBinding/handlePreviewIntent.
  • activeWg/drainMu in deployer.go = global drain barrier, NOT a per-workload lock.
  • Image idempotency short-circuit (image.go Deploy ~L170-181) only fires for a verified-running container → after stop, redeploy makes a fresh container; blue-green enforceMaxInstances reaps the old stopped one. ⇒ stop→swap→redeploy (C4) is correct.
  • Scope resolution (internal/volume/resolver.go): stage/project → <base>/<workload>/<source> (shared per-workload dir); absolute → operator's allowed path. Stage tmp/old siblings under the live dir's PARENT so renames are same-fs (R2).
  • volsnap.Engine has e.mu taken by Create/Delete/pruneWorkload/CleanOrphans. Restore must NOT hold e.mu (R1).
  • Archive layout: gzip tar, each volume under integer subdir 0/,1/…, manifest.json at root = []SnapshotVolume{Index,Target,Scope,Source}. supportedScopes = absolute/stage/project (volumes.go).
  • Precedent: internal/api/backups.go restoreBackup — X-Confirm-Restore==id, restoreInFlight CAS→409, pre-restore safety backup, atomic rename swap.
  • Composition root: cmd/server/main.go constructs deployer.New + volsnap.New + docker + store; calls CleanOrphans at startup (wire RecoverInterruptedRestores there).
  • Frontend: WorkloadSnapshotsPanel.svelte; api fns web/src/lib/api.ts ~L581; i18n apps.detail.snapshots.* in en.json + ru.json.
  • golang.org/x/sys v0.33.0 already in go.mod (indirect); build-tag precedent exists (lockfile_windows.go/lockfile_unix.go).

Decisions / invariants

  • Engine.Restore holds NO e.mu; per-workload Lifecycle.Lock is the serialization.
  • Extract ALL tmp dirs BEFORE any rename; swap is pure renames; journal tracks per-volume swapped.
  • Pre-restore snapshot captured AFTER stop, BEFORE first rename (durable escape hatch).
  • Redeploy pins the newest-running container's tag (same version back up).
  • Mixed per-volume state after a mid-restore crash is an accepted v1 limit (each volume intact; pre-restore snapshot = full revert).

Deferred / out of scope

  • Named/project_named/instance/ephemeral scopes (consistent with capture).
  • Non-image sources.
  • Fully-atomic all-volumes-or-nothing restore (v1 is per-volume atomic + journal recovery).

Failed approaches / gotchas

  • (none yet)

Phase handoffs

  • Phase 1 → 2: (filled after Phase 1)
  • Phase 2 → 3: (filled after Phase 2)
  • Phase 3 → 4: (filled after Phase 3)