feat(snapshots): capture app data-volume snapshots

Add per-workload capture of host-bind data volumes as downloadable tar.gz archives: a new internal/volsnap engine (enumerate host-bind volumes via the computeMounts merge, archive with archive/tar+gzip skipping symlinks/special files, per-workload retention + startup orphan cleanup), a volume_snapshots table + store CRUD, admin-gated API (list/snapshotable/create/download/delete), and a Snapshots panel on /apps/[id] that shows coverage and which volumes are skipped (and why). Scope: image-source apps, host-bind scopes (absolute/stage/project); Docker named volumes, tmpfs, and instance scope are surfaced as not-yet-supported. Restore is a separate later phase. Download/FilePath are containment-checked; create returns a typed no-data error (400) vs generic 500. Covered by archiver unit tests + full API e2e.
chore: add .vex.toml vex index config
2026-06-02 14:56:10 +03:00 · 2026-06-02 13:34:42 +03:00 · 2026-06-02 13:34:42 +03:00 · 2026-06-02 13:34:05 +03:00 · 2026-05-29 16:11:46 +03:00 · 2026-05-29 15:26:09 +03:00
174 changed files with 24289 additions and 3055 deletions
@@ -6,7 +6,10 @@ data/
 .env
 tinyforge
 tinyforge.exe
+/cli
+/cli.exe
 server.exe
+tinyforge-server.exe
 docker-watcher
 docker-watcher.exe
 docker-watcher.exe~
@@ -0,0 +1,57 @@
+# vex configuration — https://github.com/tenatarika/vex
+#
+# Place this file in your project root as .vex.toml
+
+# Glob patterns to exclude from indexing (gitignore syntax, on top of .gitignore)
+# exclude = [
+#     "vendor/**",
+#     "node_modules/**",
+#     "*.generated.go",
+#     "dist/**",
+# ]
+
+# Default output format: "text", "json", or "compact"
+# format = "text"
+
+# Enable semantic embeddings by default (slower indexing, enables meaning-based search)
+semantic = true
+
+# Automatically run `vex update` before search if the index is stale
+auto_update = true
+
+# Embedder used for semantic indexing. Known IDs: minilm-l6-v2 (default).
+# Changing the embedder requires a full reindex.
+# embedder = "minilm-l6-v2"
+
+# Cache directory override. Defaults to the platform cache location.
+#   macOS:   ~/Library/Caches/vex
+#   Linux:   $XDG_CACHE_HOME/vex   (fallback: ~/.cache/vex)
+#   Windows: %LOCALAPPDATA%\vex    (fallback: %USERPROFILE%\AppData\Local\vex)
+# Accepts absolute paths, "~/..." or paths relative to this file (e.g. "./.vex/cache").
+# Can also be overridden per-invocation with --cache-dir or $VEX_CACHE_DIR.
+# cache_dir = "./.vex/cache"
+
+# Store the index inside the project as `<project>/.vex_cache/`. Useful when
+# the cache should travel with the project (e.g. on a moved or renamed
+# directory). vex writes a `.gitignore` inside it so contents are not
+# committed. Overridden by `cache_dir`, `--cache-dir`, or $VEX_CACHE_DIR.
+# local_cache = false
+
+# Thread count for parallel indexing (index/update/watch).
+#   * unset  — 80% of available cores, rounded up (default, leaves headroom)
+#   * 0      — use all cores (explicit opt-in to max throughput)
+#   * N      — exactly N workers
+# Overridable per-invocation with `-j/--jobs` or $VEX_JOBS.
+# jobs = 4
+
+# Build the persistent call-graph section. Disabling falls back to live-scan
+# for `vex callers`/`vex callees` (slower per-query, but saves indexing
+# time on large monorepos). The opt-out is persisted in the manifest so
+# `vex update` does not silently re-add the section.
+# Per-invocation override: `vex index --no-call-graph`.
+# call_graph = true
+
+# Build the BM25 channel. Disabling drops the third RRF channel and keeps
+# only structural (+ semantic). Same persistence rules as `call_graph`.
+# Per-invocation override: `vex index --no-bm25`.
+# bm25 = true
@@ -12,3 +12,13 @@ Start/restart with: `./scripts/dev-server.sh`
 ## Frontend

 - **Boolean inputs use `ToggleSwitch`** (`$lib/components/ToggleSwitch.svelte`) — the slide-style switch is the unified control across the WebUI. Do not introduce raw `<input type="checkbox">` elements; place a `<ToggleSwitch>` next to a label/help block instead.
+- **Confirmations & destructive actions use `ConfirmDialog`** (`$lib/components/ConfirmDialog.svelte`) — never native `window.confirm` / `alert`. For navigation guards (e.g. the unsaved-changes prompt on `/apps/new`), `cancel()` the navigation in `beforeNavigate`, open `ConfirmDialog`, and re-issue the navigation with a bypass flag on confirm. Native `beforeunload` is acceptable only for hard tab-close/reload, where the browser forbids custom UI.
+- **Source-config shape: `$lib/workload/sourceForms.ts`** is the single source of truth (seed/serialize/validity for image/compose/static/dockerfile), consumed by both `/apps/new` and `/apps/[id]`. Don't re-inline seed/serialize logic.
+- **"App" = workload with `source_kind !== ''`.** Triggers are first-class bindings (`workload_trigger_bindings`), NOT on the workload row — never gate app lists/counts on `trigger_kind` (it's empty for plugin workloads). Legacy pre-cutover `kind:project/stack/site` rows have an empty `source_kind` and must be excluded everywhere.
+- **i18n parity is mandatory** — every key in BOTH `web/src/lib/i18n/{en,ru}.json`. A missing key is NOT a build error (`$t` returns the key string), so verify parity manually.
+
+## Build & Test
+
+- Frontend (from `web/`): `npm run check` (svelte-check — expect 0 errors), `npm run build`, `npm run test` (vitest; pure-logic units like `sourceForms.test.ts`).
+- Backend (repo root): `go build ./...`, `go vet ./internal/...`, `go test ./internal/...`.
+- `./scripts/dev-server.sh` rebuilds the SPA + restarts the Go server on :8090; it kills the prior process, so a previous background dev-server task reporting **exit 1 is expected**, not a failure.
@@ -11,6 +11,15 @@ Self-hosted deployment platform with a web dashboard. Deploy Docker containers f
 - **Multi-stage projects** (dev, staging, prod) with tag pattern matching
 - **Real-time deploy logs** via SSE streaming

+### Branch Preview Environments
+
+Get an isolated, throwaway deploy for every feature branch:
+
+- Add a **branch pattern** (e.g. `feat/*`) to a workload's **git trigger** (Triggers panel → git trigger → *Branch pattern*).
+- Pushing to any branch matching the pattern deploys an **isolated per-branch preview** — a child workload that inherits the source config, served at a **slug-prefixed subdomain** (`feat-login-app.example.com`) so previews never collide with each other or the main deploy.
+- Previews are **automatically torn down** when the branch is deleted upstream.
+- Manage live previews from the app's **Preview environments** panel (`/apps/[id]`): open each branch's URL or tear it down manually. A torn-down preview is recreated on the next push to its branch.
+
 ### Static Sites

 Deploy static sites and Deno-powered APIs directly from Git repositories:
@@ -106,6 +115,46 @@ curl -X POST https://your-domain/api/webhook/<secret> \
 3. Enter your provider's Issuer URL, Client ID, and Client Secret
 4. Set the Redirect URL to `https://your-domain/api/auth/oidc/callback`

+## CLI
+
+`tinyforge` is a terminal client for driving a server from the shell, built on the same HTTP API as the web UI.
+
+### Build
+
+```bash
+go build -o tinyforge ./cmd/cli      # ./tinyforge (tinyforge.exe on Windows)
+```
+
+### Usage
+
+```bash
+# Log in once — caches a 24h token in ~/.tinyforge/config.json (mode 0600)
+tinyforge login --base-url http://localhost:8090
+# ...or non-interactively (no password echo / shell-history leak):
+TINYFORGE_PASSWORD=… tinyforge login --base-url http://localhost:8090 --user admin
+
+tinyforge apps                              # list apps + container state
+tinyforge deploy my-app                     # deploy and wait for completion
+tinyforge deploy my-app --ref v1.2.3 --note "hotfix"
+tinyforge logs my-app -f                    # follow logs (Ctrl-C to stop)
+tinyforge status                            # server health + current user
+tinyforge status my-app                     # one app's containers
+tinyforge logout                            # revoke + clear the cached token
+```
+
+### Server & token resolution
+
+| Setting  | Flag         | Env               | Default                  |
+| -------- | ------------ | ----------------- | ------------------------ |
+| Base URL | `--base-url` | `TINYFORGE_URL`   | `http://localhost:8080`  |
+| Token    | `--token`    | `TINYFORGE_TOKEN` | cached by `login`        |
+| Config   | `--config`   | `TINYFORGE_CONFIG`| `~/.tinyforge/config.json` |
+
+### Notes
+
+- Login returns a **24h JWT** — there is no long-lived API token yet, so unattended use re-logs in when the token expires. `deploy` / `stop` / `start` require an **admin** account.
+- The token is sent as an `Authorization: Bearer` header (never placed in the URL) and the config file is written with `0600` permissions.
+
 ## Development

 ```bash
@@ -0,0 +1,149 @@
+package main
+
+import (
+	"context"
+	"flag"
+	"fmt"
+	"os"
+	"sort"
+	"strings"
+	"text/tabwriter"
+	"time"
+)
+
+func runApps(args []string) error {
+	// Accept an optional "list" subcommand: `tinyforge apps` == `tinyforge apps list`.
+	if len(args) > 0 && args[0] == "list" {
+		args = args[1:]
+	}
+	fs := flag.NewFlagSet("apps", flag.ExitOnError)
+	g := addGlobalFlags(fs)
+	fs.Usage = func() {
+		fmt.Fprint(os.Stderr, "Usage: tinyforge apps [list] [--base-url URL]\n\nList apps (workloads with a source) and their container state.\n")
+		fs.PrintDefaults()
+	}
+	if err := fs.Parse(args); err != nil {
+		return err
+	}
+
+	sess, err := newSession(g)
+	if err != nil {
+		return err
+	}
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	var workloads []Workload
+	if err := sess.client.doJSON(ctx, "GET", "/api/workloads", nil, &workloads); err != nil {
+		return err
+	}
+
+	// One extra call fetches every container so state can be shown without an
+	// N+1 per-app request.
+	var containers []Container
+	if err := sess.client.doJSON(ctx, "GET", "/api/containers", nil, &containers); err != nil {
+		return err
+	}
+	byWorkload := map[string][]Container{}
+	for _, c := range containers {
+		byWorkload[c.WorkloadID] = append(byWorkload[c.WorkloadID], c)
+	}
+
+	apps := make([]Workload, 0, len(workloads))
+	for _, w := range workloads {
+		if w.isApp() {
+			apps = append(apps, w)
+		}
+	}
+	sort.Slice(apps, func(i, j int) bool { return apps[i].Name < apps[j].Name })
+
+	if len(apps) == 0 {
+		fmt.Println("No apps yet. Create one in the web UI, then deploy with 'tinyforge deploy <app>'.")
+		return nil
+	}
+
+	tw := tabwriter.NewWriter(os.Stdout, 0, 2, 2, ' ', 0)
+	fmt.Fprintln(tw, "NAME\tSOURCE\tSTATE\tID")
+	for _, w := range apps {
+		fmt.Fprintf(tw, "%s\t%s\t%s\t%s\n", w.Name, w.SourceKind, stateSummary(byWorkload[w.ID]), idShort(w.ID))
+	}
+	return tw.Flush()
+}
+
+// stateSummary condenses a workload's containers into one status word.
+func stateSummary(cs []Container) string {
+	if len(cs) == 0 {
+		return "—"
+	}
+	running := 0
+	for _, c := range cs {
+		if c.State == "running" {
+			running++
+		}
+	}
+	switch {
+	case running == len(cs):
+		return "running"
+	case running == 0:
+		return cs[0].State // e.g. stopped / failed / missing
+	default:
+		return fmt.Sprintf("%d/%d running", running, len(cs))
+	}
+}
+
+// resolveApp maps a user-supplied reference (name, full id, or id prefix) to a
+// single app workload. Exact id wins, then exact name, then a unique id prefix.
+func resolveApp(ctx context.Context, c *Client, ref string) (Workload, error) {
+	var workloads []Workload
+	if err := c.doJSON(ctx, "GET", "/api/workloads", nil, &workloads); err != nil {
+		return Workload{}, err
+	}
+
+	var byID, byName, byPrefix []Workload
+	for _, w := range workloads {
+		if !w.isApp() {
+			continue
+		}
+		switch {
+		case w.ID == ref:
+			byID = append(byID, w)
+		case strings.EqualFold(w.Name, ref):
+			byName = append(byName, w)
+		case len(ref) >= 6 && strings.HasPrefix(w.ID, ref):
+			byPrefix = append(byPrefix, w)
+		}
+	}
+
+	if len(byID) == 1 {
+		return byID[0], nil
+	}
+	if len(byName) == 1 {
+		return byName[0], nil
+	}
+	if len(byName) > 1 {
+		return Workload{}, ambiguousErr(ref, byName)
+	}
+	if len(byPrefix) == 1 {
+		return byPrefix[0], nil
+	}
+	if len(byPrefix) > 1 {
+		return Workload{}, ambiguousErr(ref, byPrefix)
+	}
+	return Workload{}, fmt.Errorf("no app matching %q (try 'tinyforge apps list')", ref)
+}
+
+func ambiguousErr(ref string, matches []Workload) error {
+	var b strings.Builder
+	fmt.Fprintf(&b, "%q matches multiple apps; use the id:\n", ref)
+	for _, w := range matches {
+		fmt.Fprintf(&b, "  %s  %s\n", idShort(w.ID), w.Name)
+	}
+	return fmt.Errorf("%s", strings.TrimRight(b.String(), "\n"))
+}
+
+func idShort(id string) string {
+	if len(id) > 8 {
+		return id[:8]
+	}
+	return id
+}
@@ -0,0 +1,232 @@
+package main
+
+import (
+	"bufio"
+	"bytes"
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"net/http"
+	"strings"
+)
+
+// apiError carries the server's error message plus the HTTP status, so callers
+// can distinguish auth failures (401) from other errors without losing the
+// server's message (e.g. "invalid credentials" vs "invalid or expired token").
+type apiError struct {
+	status int
+	msg    string
+}
+
+func (e *apiError) Error() string { return e.msg }
+
+// isAuthError reports whether err is a 401 from the API.
+func isAuthError(err error) bool {
+	var ae *apiError
+	return errors.As(err, &ae) && ae.status == http.StatusUnauthorized
+}
+
+// Client talks to the Tinyforge HTTP API. It has no global timeout so that
+// long synchronous deploys and follow streams work; callers pass a context
+// with the appropriate deadline.
+type Client struct {
+	baseURL string
+	token   string
+	http    *http.Client
+}
+
+func newClient(baseURL, token string) *Client {
+	return &Client{
+		baseURL: strings.TrimRight(baseURL, "/"),
+		token:   token,
+		http:    &http.Client{},
+	}
+}
+
+// apiEnvelope mirrors the server's response wrapper. The server's struct is
+// unexported, so the CLI defines its own matching shape. Data is deferred so a
+// single decode path serves every endpoint.
+type apiEnvelope struct {
+	Success bool            `json:"success"`
+	Data    json.RawMessage `json:"data"`
+	Error   string          `json:"error"`
+}
+
+// SessionToken is the data payload of POST /api/auth/login.
+type SessionToken struct {
+	Token     string `json:"token"`
+	ExpiresAt string `json:"expires_at"`
+}
+
+// User is the data payload of GET /api/auth/me.
+type User struct {
+	ID       string `json:"id"`
+	Username string `json:"username"`
+	Email    string `json:"email"`
+	Role     string `json:"role"`
+}
+
+// Workload is the subset of the workload row the CLI needs. An "app" is a
+// workload with a non-empty SourceKind.
+type Workload struct {
+	ID         string `json:"id"`
+	Name       string `json:"name"`
+	Kind       string `json:"kind"`
+	AppID      string `json:"app_id"`
+	SourceKind string `json:"source_kind"`
+	CreatedAt  string `json:"created_at"`
+}
+
+func (w Workload) isApp() bool { return w.SourceKind != "" }
+
+// Container is the subset of a container row the CLI needs. State is one of
+// running|stopped|failed|missing|starting|created|restarting|paused|...
+type Container struct {
+	ID          string `json:"id"`
+	WorkloadID  string `json:"workload_id"`
+	Role        string `json:"role"`
+	ContainerID string `json:"container_id"`
+	ImageRef    string `json:"image_ref"`
+	State       string `json:"state"`
+	Port        int    `json:"port"`
+	Subdomain   string `json:"subdomain"`
+	CreatedAt   string `json:"created_at"`
+}
+
+// DeployResult is the data payload of POST /api/workloads/{id}/deploy.
+type DeployResult struct {
+	WorkloadID  string `json:"workload_id"`
+	Reference   string `json:"reference"`
+	TriggeredBy string `json:"triggered_by"`
+}
+
+// doJSON performs a JSON request and unwraps the response envelope. body may be
+// nil. out may be nil when the caller does not need the data payload. A 401
+// maps to errNotAuthenticated; any other non-success surfaces the server's
+// error message.
+func (c *Client) doJSON(ctx context.Context, method, path string, body, out any) error {
+	var reqBody io.Reader
+	if body != nil {
+		buf, err := json.Marshal(body)
+		if err != nil {
+			return fmt.Errorf("encode request: %w", err)
+		}
+		reqBody = bytes.NewReader(buf)
+	}
+
+	req, err := http.NewRequestWithContext(ctx, method, c.baseURL+path, reqBody)
+	if err != nil {
+		return fmt.Errorf("build request: %w", err)
+	}
+	if body != nil {
+		req.Header.Set("Content-Type", "application/json")
+	}
+	c.authorize(req)
+
+	resp, err := c.http.Do(req)
+	if err != nil {
+		return fmt.Errorf("%s %s: %w", method, path, err)
+	}
+	defer resp.Body.Close()
+
+	raw, err := io.ReadAll(io.LimitReader(resp.Body, 8<<20))
+	if err != nil {
+		return fmt.Errorf("read response: %w", err)
+	}
+
+	var env apiEnvelope
+	if jsonErr := json.Unmarshal(raw, &env); jsonErr != nil {
+		// Non-JSON body (e.g. a proxy error page). Surface status + a snippet,
+		// preserving auth-error typing for 401s with an unparseable body.
+		if resp.StatusCode >= 400 {
+			return &apiError{status: resp.StatusCode, msg: fmt.Sprintf(
+				"%s %s: unexpected response (status %d): %s", method, path, resp.StatusCode, snippet(raw))}
+		}
+		return fmt.Errorf("%s %s: decode response: %w", method, path, jsonErr)
+	}
+	if resp.StatusCode >= 400 || !env.Success {
+		msg := env.Error
+		if msg == "" {
+			msg = fmt.Sprintf("%s %s: request failed (status %d)", method, path, resp.StatusCode)
+		}
+		return &apiError{status: resp.StatusCode, msg: msg}
+	}
+	if out != nil && len(env.Data) > 0 {
+		if err := json.Unmarshal(env.Data, out); err != nil {
+			return fmt.Errorf("decode response data: %w", err)
+		}
+	}
+	return nil
+}
+
+// authorize attaches the bearer token. Using the Authorization header (rather
+// than a ?token= query param) keeps the JWT out of server and proxy logs.
+func (c *Client) authorize(req *http.Request) {
+	if c.token != "" {
+		req.Header.Set("Authorization", "Bearer "+c.token)
+	}
+}
+
+// streamSSE opens an SSE stream and invokes onData for each `data:` payload.
+// Comment lines (heartbeats, beginning with ':') and blanks are skipped. The
+// stream ends on EOF, context cancellation, or when onData returns an error.
+func (c *Client) streamSSE(ctx context.Context, path string, onData func(payload []byte) error) error {
+	req, err := http.NewRequestWithContext(ctx, http.MethodGet, c.baseURL+path, nil)
+	if err != nil {
+		return fmt.Errorf("build request: %w", err)
+	}
+	req.Header.Set("Accept", "text/event-stream")
+	c.authorize(req)
+
+	resp, err := c.http.Do(req)
+	if err != nil {
+		return fmt.Errorf("GET %s: %w", path, err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		raw, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
+		var env apiEnvelope
+		msg := fmt.Sprintf("GET %s: stream failed (status %d)", path, resp.StatusCode)
+		if json.Unmarshal(raw, &env) == nil && env.Error != "" {
+			msg = env.Error
+		}
+		return &apiError{status: resp.StatusCode, msg: msg}
+	}
+
+	scanner := bufio.NewScanner(resp.Body)
+	scanner.Buffer(make([]byte, 0, 64<<10), 2<<20) // tolerate long log lines
+	for scanner.Scan() {
+		line := scanner.Text()
+		if line == "" || strings.HasPrefix(line, ":") {
+			continue // blank separator or SSE comment/heartbeat
+		}
+		data, ok := strings.CutPrefix(line, "data:")
+		if !ok {
+			continue // ignore event:/id: fields — the API uses default events
+		}
+		if err := onData([]byte(strings.TrimPrefix(data, " "))); err != nil {
+			return err
+		}
+	}
+	if err := scanner.Err(); err != nil && !errors.Is(err, context.Canceled) {
+		return fmt.Errorf("read stream: %w", err)
+	}
+	return nil
+}
+
+// snippet returns a short, single-line view of an unexpected response body.
+func snippet(b []byte) string {
+	const max = 200
+	s := strings.TrimSpace(string(b))
+	s = strings.ReplaceAll(s, "\n", " ")
+	if len(s) > max {
+		s = s[:max] + "…"
+	}
+	if s == "" {
+		s = "(empty body)"
+	}
+	return s
+}
@@ -0,0 +1,148 @@
+package main
+
+import (
+	"bytes"
+	"encoding/json"
+	"flag"
+	"fmt"
+	"os"
+	"path/filepath"
+)
+
+// defaultBaseURL matches the server's default LISTEN_ADDR (:8080). The dev
+// server runs on :8090; point at it with --base-url or $TINYFORGE_URL.
+const defaultBaseURL = "http://localhost:8080"
+
+// Config is the persisted CLI state at ~/.tinyforge/config.json.
+type Config struct {
+	BaseURL   string `json:"base_url"`
+	Token     string `json:"token"`
+	ExpiresAt string `json:"expires_at"`
+}
+
+// globals holds the cross-cutting flags every command accepts.
+type globals struct {
+	baseURL    *string
+	token      *string
+	configPath *string
+}
+
+// addGlobalFlags registers the shared flags on a command's flag set.
+func addGlobalFlags(fs *flag.FlagSet) *globals {
+	return &globals{
+		baseURL:    fs.String("base-url", "", "Tinyforge server URL (default $TINYFORGE_URL or "+defaultBaseURL+")"),
+		token:      fs.String("token", "", "auth token (default $TINYFORGE_TOKEN or cached config)"),
+		configPath: fs.String("config", "", "config file path (default $TINYFORGE_CONFIG or ~/.tinyforge/config.json)"),
+	}
+}
+
+// configFilePath resolves the config file location with precedence:
+// --config flag > $TINYFORGE_CONFIG > ~/.tinyforge/config.json.
+func configFilePath(g *globals) (string, error) {
+	if g != nil && *g.configPath != "" {
+		return *g.configPath, nil
+	}
+	if env := os.Getenv("TINYFORGE_CONFIG"); env != "" {
+		return env, nil
+	}
+	home, err := os.UserHomeDir()
+	if err != nil {
+		return "", fmt.Errorf("locate home directory: %w", err)
+	}
+	return filepath.Join(home, ".tinyforge", "config.json"), nil
+}
+
+// loadConfig reads the config file. A missing file yields a zero Config and no
+// error — first run is not a failure.
+func loadConfig(path string) (Config, error) {
+	var cfg Config
+	data, err := os.ReadFile(path)
+	if err != nil {
+		if os.IsNotExist(err) {
+			return cfg, nil
+		}
+		return cfg, fmt.Errorf("read config %s: %w", path, err)
+	}
+	// An empty or whitespace-only file (e.g. freshly touched) is treated as
+	// "no config yet" rather than a parse error.
+	if len(bytes.TrimSpace(data)) == 0 {
+		return cfg, nil
+	}
+	if err := json.Unmarshal(data, &cfg); err != nil {
+		return cfg, fmt.Errorf("parse config %s: %w", path, err)
+	}
+	return cfg, nil
+}
+
+// saveConfig writes the config file with 0600 permissions, since it holds a
+// bearer token. The parent directory is created if absent.
+func saveConfig(path string, cfg Config) error {
+	if dir := filepath.Dir(path); dir != "" {
+		if err := os.MkdirAll(dir, 0o700); err != nil {
+			return fmt.Errorf("create config dir: %w", err)
+		}
+	}
+	data, err := json.MarshalIndent(cfg, "", "  ")
+	if err != nil {
+		return fmt.Errorf("encode config: %w", err)
+	}
+	if err := os.WriteFile(path, append(data, '\n'), 0o600); err != nil {
+		return fmt.Errorf("write config %s: %w", path, err)
+	}
+	// os.WriteFile only applies the mode when creating the file; Chmod ensures
+	// 0600 even when overwriting a pre-existing, looser-permissioned config.
+	if err := os.Chmod(path, 0o600); err != nil {
+		return fmt.Errorf("secure config %s: %w", path, err)
+	}
+	return nil
+}
+
+// resolveBaseURL applies precedence: --base-url > $TINYFORGE_URL > config > default.
+func resolveBaseURL(g *globals, cfg Config) string {
+	if g != nil && *g.baseURL != "" {
+		return *g.baseURL
+	}
+	if env := os.Getenv("TINYFORGE_URL"); env != "" {
+		return env
+	}
+	if cfg.BaseURL != "" {
+		return cfg.BaseURL
+	}
+	return defaultBaseURL
+}
+
+// resolveToken applies precedence: --token > $TINYFORGE_TOKEN > config.
+func resolveToken(g *globals, cfg Config) string {
+	if g != nil && *g.token != "" {
+		return *g.token
+	}
+	if env := os.Getenv("TINYFORGE_TOKEN"); env != "" {
+		return env
+	}
+	return cfg.Token
+}
+
+// session bundles the resolved client with the loaded config and its path, so
+// commands can both make requests and persist updates (e.g. login).
+type session struct {
+	client     *Client
+	cfg        Config
+	configPath string
+}
+
+// newSession loads config and builds a client with resolved base URL + token.
+func newSession(g *globals) (*session, error) {
+	path, err := configFilePath(g)
+	if err != nil {
+		return nil, err
+	}
+	cfg, err := loadConfig(path)
+	if err != nil {
+		return nil, err
+	}
+	return &session{
+		client:     newClient(resolveBaseURL(g, cfg), resolveToken(g, cfg)),
+		cfg:        cfg,
+		configPath: path,
+	}, nil
+}
@@ -0,0 +1,73 @@
+package main
+
+import (
+	"context"
+	"flag"
+	"fmt"
+	"os"
+	"time"
+)
+
+func runDeploy(args []string) error {
+	fs := flag.NewFlagSet("deploy", flag.ExitOnError)
+	g := addGlobalFlags(fs)
+	ref := fs.String("ref", "", "image tag / git ref / source-specific deploy target")
+	note := fs.String("note", "", "free-text note recorded with the deploy")
+	timeout := fs.Duration("timeout", 15*time.Minute, "max time to wait for the deploy to finish")
+	fs.Usage = func() {
+		fmt.Fprint(os.Stderr, "Usage: tinyforge deploy <app> [--ref TAG] [--note TEXT] [--timeout DUR]\n\n"+
+			"Trigger a deploy and wait for it to finish. Requires an admin token.\n")
+		fs.PrintDefaults()
+	}
+	if err := fs.Parse(args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		fs.Usage()
+		return fmt.Errorf("expected exactly one app (name or id)")
+	}
+
+	sess, err := newSession(g)
+	if err != nil {
+		return err
+	}
+
+	// Resolve the app on a short deadline; the deploy itself gets the full one.
+	resolveCtx, cancelResolve := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancelResolve()
+	app, err := resolveApp(resolveCtx, sess.client, fs.Arg(0))
+	if err != nil {
+		return err
+	}
+
+	body := map[string]string{}
+	if *ref != "" {
+		body["reference"] = *ref
+	}
+	if *note != "" {
+		body["note"] = *note
+	}
+
+	fmt.Printf("Deploying %s%s…\n", app.Name, refSuffix(*ref))
+
+	// The endpoint returns 202 but blocks until the deploy completes, so a
+	// success here means it finished; allow plenty of time for pull/build.
+	ctx, cancel := context.WithTimeout(context.Background(), *timeout)
+	defer cancel()
+
+	var result DeployResult
+	if err := sess.client.doJSON(ctx, "POST", "/api/workloads/"+app.ID+"/deploy", body, &result); err != nil {
+		return err
+	}
+
+	fmt.Printf("Deploy of %s completed (triggered by %s).\n", app.Name, result.TriggeredBy)
+	fmt.Printf("Follow with: tinyforge logs %s -f\n", app.Name)
+	return nil
+}
+
+func refSuffix(ref string) string {
+	if ref == "" {
+		return ""
+	}
+	return fmt.Sprintf(" @ %s", ref)
+}
@@ -0,0 +1,136 @@
+package main
+
+import (
+	"bufio"
+	"context"
+	"flag"
+	"fmt"
+	"os"
+	"strings"
+	"time"
+)
+
+func runLogin(args []string) error {
+	fs := flag.NewFlagSet("login", flag.ExitOnError)
+	g := addGlobalFlags(fs)
+	user := fs.String("user", "", "username (prompted if omitted)")
+	pass := fs.String("password", "", "password (insecure; prefer $TINYFORGE_PASSWORD or the prompt)")
+	fs.Usage = func() {
+		fmt.Fprint(os.Stderr, "Usage: tinyforge login [--user U] [--password P] [--base-url URL]\n\n"+
+			"Authenticate against the server and cache the token.\n")
+		fs.PrintDefaults()
+	}
+	if err := fs.Parse(args); err != nil {
+		return err
+	}
+
+	sess, err := newSession(g)
+	if err != nil {
+		return err
+	}
+
+	username := *user
+	if username == "" {
+		username, err = promptLine("Username: ")
+		if err != nil {
+			return err
+		}
+	}
+
+	password := *pass
+	if password == "" {
+		password = os.Getenv("TINYFORGE_PASSWORD")
+	}
+	if password == "" {
+		password, err = promptPassword("Password: ")
+		if err != nil {
+			return err
+		}
+	}
+	if username == "" || password == "" {
+		return fmt.Errorf("username and password are required")
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	var tok SessionToken
+	body := map[string]string{"username": username, "password": password}
+	if err := sess.client.doJSON(ctx, "POST", "/api/auth/login", body, &tok); err != nil {
+		return err
+	}
+
+	// Persist the resolved base URL alongside the token so later commands need
+	// no flags. The token file is written 0600 by saveConfig.
+	sess.cfg.BaseURL = sess.client.baseURL
+	sess.cfg.Token = tok.Token
+	sess.cfg.ExpiresAt = tok.ExpiresAt
+	if err := saveConfig(sess.configPath, sess.cfg); err != nil {
+		return err
+	}
+
+	fmt.Printf("Logged in to %s as %s.\n", sess.client.baseURL, username)
+	if exp := friendlyExpiry(tok.ExpiresAt); exp != "" {
+		fmt.Printf("Token valid until %s.\n", exp)
+	}
+	return nil
+}
+
+func runLogout(args []string) error {
+	fs := flag.NewFlagSet("logout", flag.ExitOnError)
+	g := addGlobalFlags(fs)
+	if err := fs.Parse(args); err != nil {
+		return err
+	}
+
+	sess, err := newSession(g)
+	if err != nil {
+		return err
+	}
+	if sess.client.token == "" {
+		fmt.Println("Not logged in.")
+		return nil
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
+	defer cancel()
+
+	// Best-effort server-side revocation; clear the local token regardless.
+	revokeErr := sess.client.doJSON(ctx, "POST", "/api/auth/logout", nil, nil)
+
+	sess.cfg.Token = ""
+	sess.cfg.ExpiresAt = ""
+	if err := saveConfig(sess.configPath, sess.cfg); err != nil {
+		return err
+	}
+
+	if revokeErr != nil {
+		fmt.Printf("Cleared local token (server revocation skipped: %v).\n", revokeErr)
+		return nil
+	}
+	fmt.Println("Logged out.")
+	return nil
+}
+
+// promptLine reads a single trimmed line from stdin.
+func promptLine(label string) (string, error) {
+	fmt.Fprint(os.Stderr, label)
+	r := bufio.NewReader(os.Stdin)
+	line, err := r.ReadString('\n')
+	if err != nil && line == "" {
+		return "", fmt.Errorf("read input: %w", err)
+	}
+	return strings.TrimSpace(line), nil
+}
+
+// friendlyExpiry formats an RFC3339 expiry as a local time, best-effort.
+func friendlyExpiry(s string) string {
+	if s == "" {
+		return ""
+	}
+	t, err := time.Parse(time.RFC3339, s)
+	if err != nil {
+		return s
+	}
+	return t.Local().Format("2006-01-02 15:04 MST")
+}
@@ -0,0 +1,143 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"flag"
+	"fmt"
+	"net/url"
+	"os"
+	"os/signal"
+	"strings"
+	"time"
+)
+
+func runLogs(args []string) error {
+	fs := flag.NewFlagSet("logs", flag.ExitOnError)
+	g := addGlobalFlags(fs)
+	follow := fs.Bool("f", false, "follow the log stream (Ctrl-C to stop)")
+	tail := fs.Int("tail", 200, "number of trailing lines to show (max 5000)")
+	container := fs.String("container", "", "container row id/prefix or role (when an app has several)")
+	fs.Usage = func() {
+		fmt.Fprint(os.Stderr, "Usage: tinyforge logs <app> [-f] [--tail N] [--container CID]\n\nPrint or follow a container's logs.\n")
+		fs.PrintDefaults()
+	}
+	if err := fs.Parse(args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		fs.Usage()
+		return fmt.Errorf("expected exactly one app (name or id)")
+	}
+
+	sess, err := newSession(g)
+	if err != nil {
+		return err
+	}
+
+	resolveCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	app, err := resolveApp(resolveCtx, sess.client, fs.Arg(0))
+	if err != nil {
+		return err
+	}
+
+	var containers []Container
+	if err := sess.client.doJSON(resolveCtx, "GET", "/api/workloads/"+app.ID+"/containers", nil, &containers); err != nil {
+		return err
+	}
+	target, err := chooseContainer(containers, *container)
+	if err != nil {
+		return err
+	}
+
+	q := url.Values{}
+	q.Set("tail", fmt.Sprintf("%d", *tail))
+	base := "/api/workloads/" + app.ID + "/containers/" + target.ID + "/logs"
+
+	if !*follow {
+		var lines []string
+		if err := sess.client.doJSON(resolveCtx, "GET", base+"?"+q.Encode(), nil, &lines); err != nil {
+			return err
+		}
+		for _, line := range lines {
+			fmt.Println(line)
+		}
+		return nil
+	}
+
+	// Follow: stream until EOF or Ctrl-C.
+	q.Set("follow", "true")
+	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt)
+	defer stop()
+
+	err = sess.client.streamSSE(ctx, base+"?"+q.Encode(), func(payload []byte) error {
+		var frame struct {
+			Line string `json:"line"`
+		}
+		if json.Unmarshal(payload, &frame) != nil {
+			return nil // ignore frames we can't parse
+		}
+		fmt.Println(frame.Line)
+		return nil
+	})
+	if ctx.Err() != nil { // user interrupted — clean exit
+		return nil
+	}
+	return err
+}
+
+// chooseContainer selects which container to read. With an explicit selector,
+// it matches the row id (exact or prefix) or the role. Otherwise it uses the
+// sole container, or the sole running one, and errors with a list when the
+// choice is ambiguous.
+func chooseContainer(cs []Container, selector string) (Container, error) {
+	if len(cs) == 0 {
+		return Container{}, fmt.Errorf("app has no containers yet — deploy it first")
+	}
+
+	if selector != "" {
+		var matches []Container
+		for _, c := range cs {
+			if c.ID == selector || strings.EqualFold(c.Role, selector) ||
+				(len(selector) >= 6 && strings.HasPrefix(c.ID, selector)) {
+				matches = append(matches, c)
+			}
+		}
+		switch len(matches) {
+		case 1:
+			return matches[0], nil
+		case 0:
+			return Container{}, fmt.Errorf("no container matching %q\n%s", selector, containerList(cs))
+		default:
+			return Container{}, fmt.Errorf("%q matches multiple containers\n%s", selector, containerList(cs))
+		}
+	}
+
+	if len(cs) == 1 {
+		return cs[0], nil
+	}
+	var running []Container
+	for _, c := range cs {
+		if c.State == "running" {
+			running = append(running, c)
+		}
+	}
+	if len(running) == 1 {
+		return running[0], nil
+	}
+	return Container{}, fmt.Errorf("app has %d containers; pick one with --container:\n%s", len(cs), containerList(cs))
+}
+
+func containerList(cs []Container) string {
+	var b strings.Builder
+	for _, c := range cs {
+		role := c.Role
+		if role == "" {
+			role = "(default)"
+		}
+		fmt.Fprintf(&b, "  %s  %-12s %s\n", idShort(c.ID), role, c.State)
+	}
+	return strings.TrimRight(b.String(), "\n")
+}
@@ -0,0 +1,95 @@
+// Command tinyforge is a terminal client for a Tinyforge server.
+//
+// It drives the existing HTTP API: log in to obtain a 24h JWT, then list
+// apps, trigger deploys, stream logs, and check status. The token is cached
+// in ~/.tinyforge/config.json (mode 0600) so subsequent commands reuse it.
+//
+// Usage:
+//
+//	tinyforge login [--user U] [--password P]
+//	tinyforge apps [list]
+//	tinyforge deploy <app> [--ref TAG] [--note TEXT]
+//	tinyforge logs <app> [-f] [--tail N] [--container CID]
+//	tinyforge status [<app>]
+//	tinyforge logout
+//	tinyforge version
+//
+// The target server is resolved from --base-url, then $TINYFORGE_URL, then the
+// saved config, then http://localhost:8080.
+package main
+
+import (
+	"fmt"
+	"os"
+)
+
+// version is the CLI build version. Overridable at build time via
+// -ldflags "-X main.version=...".
+var version = "dev"
+
+func main() {
+	if len(os.Args) < 2 {
+		usage(os.Stderr)
+		os.Exit(2)
+	}
+
+	cmd, args := os.Args[1], os.Args[2:]
+
+	var err error
+	switch cmd {
+	case "login":
+		err = runLogin(args)
+	case "logout":
+		err = runLogout(args)
+	case "apps":
+		err = runApps(args)
+	case "deploy":
+		err = runDeploy(args)
+	case "logs":
+		err = runLogs(args)
+	case "status":
+		err = runStatus(args)
+	case "version", "--version", "-v":
+		fmt.Printf("tinyforge %s\n", version)
+	case "help", "-h", "--help":
+		usage(os.Stdout)
+	default:
+		fmt.Fprintf(os.Stderr, "tinyforge: unknown command %q\n\n", cmd)
+		usage(os.Stderr)
+		os.Exit(2)
+	}
+
+	if err != nil {
+		// Authenticated commands that hit a 401 get a re-login hint; the login
+		// command itself surfaces the server message ("invalid credentials").
+		if cmd != "login" && isAuthError(err) {
+			err = fmt.Errorf("%w — run 'tinyforge login'", err)
+		}
+		fmt.Fprintf(os.Stderr, "tinyforge: %v\n", err)
+		os.Exit(1)
+	}
+}
+
+func usage(w *os.File) {
+	fmt.Fprint(w, `tinyforge — terminal client for a Tinyforge server
+
+Usage:
+  tinyforge <command> [flags]
+
+Commands:
+  login              Authenticate and cache a token
+  logout             Revoke the cached token and clear it
+  apps [list]        List your apps (workloads with a source)
+  deploy <app>       Trigger a deploy (waits for completion)
+  logs <app>         Print container logs (use -f to follow)
+  status [<app>]     Show server health, or one app's containers
+  version            Print the CLI version
+
+Global flags (accepted by any command):
+  --base-url URL     Server URL (default $TINYFORGE_URL or http://localhost:8080)
+  --token TOKEN      Auth token (default $TINYFORGE_TOKEN or cached config)
+  --config PATH      Config file (default $TINYFORGE_CONFIG or ~/.tinyforge/config.json)
+
+Run "tinyforge <command> -h" for command-specific flags.
+`)
+}
@@ -0,0 +1,38 @@
+//go:build !windows
+
+package main
+
+import (
+	"bufio"
+	"fmt"
+	"os"
+	"os/exec"
+	"strings"
+)
+
+// promptPassword reads a password from stdin with echo disabled via stty. If
+// stty is unavailable (no tty, missing binary), it falls back to an echoed
+// read so the command still works in pipes/CI.
+func promptPassword(label string) (string, error) {
+	fmt.Fprint(os.Stderr, label)
+
+	echoDisabled := stty("-echo") == nil
+	if echoDisabled {
+		defer func() {
+			_ = stty("echo")
+			fmt.Fprintln(os.Stderr) // the Enter keystroke was not echoed
+		}()
+	}
+
+	line, err := bufio.NewReader(os.Stdin).ReadString('\n')
+	if err != nil && line == "" {
+		return "", fmt.Errorf("read password: %w", err)
+	}
+	return strings.TrimRight(line, "\r\n"), nil
+}
+
+func stty(arg string) error {
+	cmd := exec.Command("stty", arg)
+	cmd.Stdin = os.Stdin
+	return cmd.Run()
+}
@@ -0,0 +1,45 @@
+//go:build windows
+
+package main
+
+import (
+	"bufio"
+	"fmt"
+	"os"
+	"strings"
+	"syscall"
+	"unsafe"
+)
+
+// enableEchoInput is the Windows console mode bit that echoes typed input.
+const enableEchoInput = 0x0004
+
+// promptPassword reads a password from the console with echo disabled, using
+// kernel32 directly so no third-party dependency is needed. If the console
+// mode cannot be changed (e.g. piped stdin), it falls back to an echoed read.
+func promptPassword(label string) (string, error) {
+	fmt.Fprint(os.Stderr, label)
+
+	kernel32 := syscall.NewLazyDLL("kernel32.dll")
+	getConsoleMode := kernel32.NewProc("GetConsoleMode")
+	setConsoleMode := kernel32.NewProc("SetConsoleMode")
+	handle := syscall.Handle(os.Stdin.Fd())
+
+	var mode uint32
+	echoDisabled := false
+	if r, _, _ := getConsoleMode.Call(uintptr(handle), uintptr(unsafe.Pointer(&mode))); r != 0 {
+		if ret, _, _ := setConsoleMode.Call(uintptr(handle), uintptr(mode&^enableEchoInput)); ret != 0 {
+			echoDisabled = true
+			defer setConsoleMode.Call(uintptr(handle), uintptr(mode))
+		}
+	}
+
+	line, err := bufio.NewReader(os.Stdin).ReadString('\n')
+	if echoDisabled {
+		fmt.Fprintln(os.Stderr) // the Enter keystroke was not echoed
+	}
+	if err != nil && line == "" {
+		return "", fmt.Errorf("read password: %w", err)
+	}
+	return strings.TrimRight(line, "\r\n"), nil
+}
@@ -0,0 +1,122 @@
+package main
+
+import (
+	"context"
+	"flag"
+	"fmt"
+	"os"
+	"text/tabwriter"
+	"time"
+)
+
+func runStatus(args []string) error {
+	fs := flag.NewFlagSet("status", flag.ExitOnError)
+	g := addGlobalFlags(fs)
+	fs.Usage = func() {
+		fmt.Fprint(os.Stderr, "Usage: tinyforge status [<app>]\n\nWith no app: server health and the logged-in user.\nWith an app: that app's containers.\n")
+		fs.PrintDefaults()
+	}
+	if err := fs.Parse(args); err != nil {
+		return err
+	}
+
+	sess, err := newSession(g)
+	if err != nil {
+		return err
+	}
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	if fs.NArg() == 0 {
+		return serverStatus(ctx, sess)
+	}
+	return appStatus(ctx, sess.client, fs.Arg(0))
+}
+
+func serverStatus(ctx context.Context, sess *session) error {
+	fmt.Printf("Server:  %s\n", sess.client.baseURL)
+
+	var me User
+	if err := sess.client.doJSON(ctx, "GET", "/api/auth/me", nil, &me); err != nil {
+		fmt.Printf("User:    not logged in (%v)\n", err)
+	} else {
+		fmt.Printf("User:    %s (%s)\n", me.Username, me.Role)
+	}
+	if exp := friendlyExpiry(sess.cfg.ExpiresAt); exp != "" {
+		fmt.Printf("Token:   valid until %s\n", exp)
+	}
+
+	var health map[string]any
+	if err := sess.client.doJSON(ctx, "GET", "/api/health", nil, &health); err != nil {
+		return err
+	}
+	fmt.Printf("DB:      %s\n", connState(health, "database"))
+	docker := connState(health, "docker")
+	if v := nestedString(health, "docker", "version"); v != "" {
+		docker += " (v" + v + ")"
+	}
+	fmt.Printf("Docker:  %s\n", docker)
+	if _, ok := health["proxy"]; ok {
+		fmt.Printf("Proxy:   %s\n", connState(health, "proxy"))
+	}
+	return nil
+}
+
+func appStatus(ctx context.Context, c *Client, ref string) error {
+	app, err := resolveApp(ctx, c, ref)
+	if err != nil {
+		return err
+	}
+	var containers []Container
+	if err := c.doJSON(ctx, "GET", "/api/workloads/"+app.ID+"/containers", nil, &containers); err != nil {
+		return err
+	}
+
+	fmt.Printf("%s  (%s, %s)\n", app.Name, app.SourceKind, idShort(app.ID))
+	if len(containers) == 0 {
+		fmt.Println("No containers — not deployed yet.")
+		return nil
+	}
+
+	tw := tabwriter.NewWriter(os.Stdout, 0, 2, 2, ' ', 0)
+	fmt.Fprintln(tw, "ROLE\tSTATE\tIMAGE\tPORT\tSUBDOMAIN\tCONTAINER")
+	for _, c := range containers {
+		role := c.Role
+		if role == "" {
+			role = "(default)"
+		}
+		port := ""
+		if c.Port != 0 {
+			port = fmt.Sprintf("%d", c.Port)
+		}
+		fmt.Fprintf(tw, "%s\t%s\t%s\t%s\t%s\t%s\n",
+			role, c.State, c.ImageRef, port, c.Subdomain, idShort(c.ID))
+	}
+	return tw.Flush()
+}
+
+// connState reads health[section].connected and renders connected/disconnected,
+// appending the section's error string when present.
+func connState(health map[string]any, section string) string {
+	m, ok := health[section].(map[string]any)
+	if !ok {
+		return "unknown"
+	}
+	connected, _ := m["connected"].(bool)
+	if connected {
+		return "connected"
+	}
+	if msg, ok := m["error"].(string); ok && msg != "" {
+		return "disconnected (" + msg + ")"
+	}
+	return "disconnected"
+}
+
+func nestedString(m map[string]any, section, key string) string {
+	sub, ok := m[section].(map[string]any)
+	if !ok {
+		return ""
+	}
+	s, _ := sub[key].(string)
+	return s
+}
@@ -28,6 +28,7 @@ import (
 	"github.com/alexei/tinyforge/internal/health"
 	"github.com/alexei/tinyforge/internal/logging"
 	"github.com/alexei/tinyforge/internal/logscanner"
+	"github.com/alexei/tinyforge/internal/metricalert"
 	"github.com/alexei/tinyforge/internal/notify"
 	"github.com/alexei/tinyforge/internal/npm"
 	"github.com/alexei/tinyforge/internal/proxy"
@@ -36,6 +37,7 @@ import (
 	"github.com/alexei/tinyforge/internal/stale"
 	"github.com/alexei/tinyforge/internal/stats"
 	"github.com/alexei/tinyforge/internal/store"
+	"github.com/alexei/tinyforge/internal/volsnap"
 	"github.com/alexei/tinyforge/internal/webhook"
 	"github.com/alexei/tinyforge/internal/workload/plugin"

@@ -43,6 +45,7 @@ import (
 	// itself with internal/workload/plugin. Adding a new Source or Trigger
 	// is a matter of dropping a new package and adding it to this list.
 	_ "github.com/alexei/tinyforge/internal/workload/plugin/source/compose"
+	_ "github.com/alexei/tinyforge/internal/workload/plugin/source/dockerfile"
 	_ "github.com/alexei/tinyforge/internal/workload/plugin/source/image"
 	_ "github.com/alexei/tinyforge/internal/workload/plugin/source/static"
 	_ "github.com/alexei/tinyforge/internal/workload/plugin/trigger/git"
@@ -62,6 +65,20 @@ func main() {
 		os.Exit(1)
 	}

+	// Acquire single-instance lockfile BEFORE opening the DB. SQLite +
+	// SetMaxOpenConns(1) does not protect against two Tinyforge processes
+	// sharing a data directory; without this guard a misconfigured
+	// systemd unit, container restart race, or `tinyforge` shell typo can
+	// silently double-fire schedulers, double-poll registries, and
+	// corrupt `extra_json` RMW. The lockfile is a PID file under
+	// $DATA_DIR/tinyforge.lock — collisions with dead PIDs are reclaimed.
+	releaseLock, err := store.AcquireLockfile(dataDir)
+	if err != nil {
+		slog.Error("could not acquire data-dir lock", "data_dir", dataDir, "error", err)
+		os.Exit(1)
+	}
+	defer releaseLock()
+
 	// Open database.
 	dbPath := filepath.Join(dataDir, "tinyforge.db")
 	db, err := store.New(dbPath)
@@ -78,6 +95,21 @@ func main() {
 		os.Exit(1)
 	}

+	// One-shot migration: rewrite every legacy unprefixed-hex secret
+	// in the DB into the new tf1: envelope form. Idempotent (gated by
+	// schema_versions version 2). Lets the rest of the codebase treat
+	// envelope-presence as a stable invariant for future key rotations.
+	// Failures here are logged but non-fatal: a partial migration just
+	// means some columns keep working through Decrypt's legacy
+	// fallback until the next manual save re-encrypts them.
+	if err := db.MigrateSecretsToEnvelope(store.EnvelopeMigrator{
+		HasEnvelope: crypto.HasEnvelope,
+		Decrypt:     func(v string) (string, error) { return crypto.Decrypt(encKey, v) },
+		Encrypt:     func(v string) (string, error) { return crypto.Encrypt(encKey, v) },
+	}); err != nil {
+		slog.Warn("secrets envelope migration", "error", err)
+	}
+
 	// Import seed config on first launch (idempotent).
 	seedPath := envOrDefault("SEED_FILE", "./tinyforge.yaml")
 	if err := config.ImportSeed(db, seedPath); err != nil {
@@ -197,7 +229,8 @@ func main() {
 			switch {
 			case r.Deployed:
 				deployed++
-			case r.Reason == webhook.ReasonBindingDisabled, r.Reason == webhook.ReasonNoMatch:
+			case r.Reason == webhook.ReasonBindingDisabled, r.Reason == webhook.ReasonNoMatch,
+				r.Reason == webhook.ReasonPreviewNoop:
 				// not a failure — silent
 			default:
 				errored++
@@ -291,6 +324,19 @@ func main() {
 	}
 	dep.SetPreDeployBackuper(backupEngine)

+	// Initialize volume-snapshot engine (per-workload data-volume archives).
+	snapshotEngine, err := volsnap.New(db, dataDir)
+	if err != nil {
+		slog.Error("create snapshot engine", "error", err)
+		os.Exit(1)
+	}
+	// Reclaim snapshot files orphaned by workload deletes (rows CASCADE, files don't).
+	if cleaned, err := snapshotEngine.CleanOrphans(); err != nil {
+		slog.Warn("snapshots: clean orphans on startup", "error", err)
+	} else if cleaned > 0 {
+		slog.Info("snapshots: cleaned orphan files on startup", "count", cleaned)
+	}
+
 	// Clean orphaned backup files and prune on startup.
 	if cleaned, err := backupEngine.CleanOrphans(); err != nil {
 		slog.Warn("backup: clean orphans on startup", "error", err)
@@ -359,11 +405,20 @@ func main() {
 	}
 	defer logScanMgr.Stop()

+	// Metric-alert manager: evaluates threshold rules against recent
+	// container stats samples and emits event_log entries on breach.
+	// The store satisfies RuleSource/SampleSource/EventSink; the event
+	// bus is the Publisher.
+	metricAlertMgr := metricalert.New(db, db, db, eventBus)
+	metricAlertMgr.Start()
+	defer metricAlertMgr.Stop()
+
 	// Build API server.
 	apiServer := api.NewServer(db, dockerClient, npmClient, proxyProvider, dep, notifier, webhookHandler, eventBus, encKey)
 	apiServer.SetStaleScanner(staleScanner)
 	apiServer.SetLogScanReloader(logScanMgr)
 	apiServer.SetBackupEngine(backupEngine)
+	apiServer.SetSnapshotEngine(snapshotEngine)
 	apiServer.SetDBPath(dbPath)
 	apiServer.SetBackupSettingsChangedCallback(scheduleAutobackup)
 	apiServer.SetDNSProvider(dnsProvider)
@@ -420,6 +475,7 @@ func main() {
 	eventBus.Unsubscribe(notifySub)
 	staleScanner.Stop()
 	statsCollector.Stop()
+	metricAlertMgr.Stop()

 	// Drain in-progress deploys and notifications.
 	dep.Drain()
@@ -1,6 +1,6 @@
 # Tinyforge Codemaps — Index

-**Last Updated:** 2026-05-16
+**Last Updated:** 2026-05-16 (added `container-extra-json` policy doc)

 This directory contains architectural maps of key Tinyforge subsystems. Each codemap focuses on one major area: core data types, contract surfaces, integration points, and recipes for extending the system.

@@ -8,6 +8,7 @@ This directory contains architectural maps of key Tinyforge subsystems. Each cod

 - **[Workload Plugin](./workload-plugin.md)** — Source × Trigger plugin contracts; registry lookups; webhook fan-out; how to add new kinds.
 - **[Discovery & Runtime API](./discovery-and-runtime.md)** — `/api/discovery/*` helpers (Git provider probe, repo/branch/tree pickers, image conflicts); `/api/workloads/{id}/runtime-state` + `/storage` + `/stop` + `/start`; SSRF-safe HTTP client in `internal/staticsite`.
+- **[`containers.extra_json` Evolution Policy](./container-extra-json.md)** — Ownership model, reader/writer rules, wholesale-overwrite vs preserve-unknown-keys patterns, concurrency invariants; checklist for adding a new field without breaking older deployers.

 ## Cross-References

@@ -0,0 +1,105 @@
+# `containers.extra_json` — Evolution Policy
+
+**Last Updated:** 2026-05-16
+
+`extra_json` is a TEXT column on the `containers` table that source plugins use to persist source-specific runtime state that hasn't been promoted to a first-class column. It is the single forward-compatibility seam between the canonical container row and per-source needs that arise after a schema is in production.
+
+This doc captures the rules every reader and writer must follow so new sources can extend the blob without breaking older ones.
+
+## Schema position
+
+- Column: `containers.extra_json TEXT NOT NULL DEFAULT '{}'` ([`internal/store/store.go:233`](../../internal/store/store.go#L233)).
+- All four write paths (`CreateContainer`, `UpsertContainer`, `ReconcileContainer`, `UpdateContainer`) normalize `""` → `'{}'` before the SQL exec — readers can assume a non-empty JSON object string and never need to handle SQL `NULL` or the empty-string edge.
+- Defined on the `Container` model: [`internal/store/models.go:342-347`](../../internal/store/models.go#L342-L347).
+
+## Ownership model
+
+**One container row → one owning source.** Sources never write to a row that belongs to another source. In practice:
+
+| Source kind | Row key                                | Number of rows per workload | Writes `extra_json` today?  |
+| ----------- | -------------------------------------- | --------------------------- | --------------------------- |
+| `static`    | deterministic `<workloadID>:site`      | exactly 1                   | yes (preserve-unknown-keys) |
+| `image`     | UUID per deployed container            | 1 + N (blue-green rolls)    | yes (wholesale-overwrite)   |
+| `compose`   | deterministic `<workloadID>:<service>` | N (one per compose service) | no — left at `'{}'` default |
+
+Two sources cannot contend on the same row, so the policy below is concerned with **forward compatibility across versions of the same source**, not cross-source contention. When compose (or any future source) starts writing `extra_json`, the same rules apply.
+
+## Reader rules — ALL readers
+
+1. **Tolerate unknown keys.** Decode into a typed struct using `encoding/json`; Go's default unmarshaller silently drops unknown keys, which is the desired behaviour. Never use `json.Decoder.DisallowUnknownFields()` on `extra_json`.
+2. **Tolerate decode failure as non-fatal where the row's first-class columns are useful.** A corrupted `extra_json` is debug-logged and the reader falls back to zero state — see `workload_runtime.go:118-133` for the canonical pattern. The container's `ContainerID`, `State`, `ProxyRouteID`, etc. live in their own columns and are still trustworthy.
+3. **Tolerate `''` and `'{}'`.** Both are equivalent to "no extras yet". Readers must short-circuit before json.Unmarshal to avoid `unexpected end of JSON input` on the empty case.
+
+## Writer rules — by mutation style
+
+Two distinct write patterns live in the codebase today. Pick the one that matches your source's needs.
+
+### Wholesale-overwrite (image source pattern)
+
+When the writer owns 100% of the blob's shape and discards old contents on every write:
+
+```go
+// internal/workload/plugin/source/image/image.go:341-343
+extra := containerExtra{ProxyRoutes: faceRoutes}
+if b, err := json.Marshal(extra); err == nil {
+    created.ExtraJSON = string(b)
+}
+```
+
+- Cheap and simple.
+- **Loses unknown keys written by future versions of the same source.** Only use when you are certain no other writer (including a future version of this code) needs to round-trip an unknown key.
+- The `containerExtra` struct must be **additive-only**: never rename or remove a field once shipped, and never change its JSON type. Mark new fields with `omitempty` so older readers downgrading to an older codebase don't see surprise nulls.
+
+### Preserve-unknown-keys (static source pattern)
+
+When future versions of the source (or sibling writers) may add fields and the current writer must round-trip them:
+
+```go
+// internal/workload/plugin/source/static/state.go saveState
+//   1. Decode existing blob into map[string]json.RawMessage.
+//   2. Strip every key the current typed-state struct owns
+//      (runtimeStateKeys) so a cleared field actually drops.
+//   3. Apply caller's mutate() to the typed state.
+//   4. Re-marshal typed state, splice its keys back into the
+//      generic map (overwriting any historical sibling).
+//   5. Marshal the merged map back into extra_json.
+```
+
+- Slightly more expensive (two round-trips through `json`).
+- Preserves keys the current writer doesn't know about — required for safe rolling deploys where a newer instance writes a new key, an older instance then reads, mutates, and writes back.
+- Must declare the typed key set explicitly (`runtimeStateKeys`) so step 2 can strip them. This invariant is fenced by `TestRuntimeState_JSONTagsRoundTrip` in [`state_integration_test.go`](../../internal/workload/plugin/source/static/state_integration_test.go).
+
+**Default to preserve-unknown-keys for any new source.** Wholesale-overwrite is acceptable for the image source today because the row's lifetime is short (replaced on every blue-green roll) and only one writer touches it. Sources whose container rows are long-lived (static, future compose-with-stateful-services) should preserve unknown keys.
+
+## Concurrency
+
+`UpsertContainer` is atomic at the SQL layer — SQLite serializes statements through one connection ([`internal/store/store.go:55`](../../internal/store/store.go#L55) `SetMaxOpenConns(1)`) with WAL mode enabled ([`store.go:60`](../../internal/store/store.go#L60)). That guarantees no torn write on a single row, and concurrent readers see a consistent snapshot — they read either the pre- or post-write state, never a half-applied one.
+
+What that does **not** guarantee is atomic read-modify-write across two Go goroutines. The static source serializes its RMW through a per-workload `sync.Mutex` keyed by workload ID (`internal/workload/plugin/source/static/state.go` `lockFor` + `saveState`). Any source that does its own read-modify-write on `extra_json` must do the same — verified in `TestSaveState_ConcurrentWritesDoNotLoseUpdates` (which loses 15+ markers per 20-writer run when the mutex is disabled, as confirmed in commit `ef62a41`).
+
+If a future source is purely wholesale-overwrite from a single writer, no lock is needed.
+
+## What `extra_json` is NOT for
+
+- **Workload-level config.** Workload config goes in `workloads.source_config` and is the operator's surface.
+- **Cross-source state.** If two sources need the same data, promote it to a column.
+- **Anything queryable.** SQLite can JSON-path `extra_json` but no index supports it; readers always pull the column wholesale and parse in Go.
+- **Secrets.** Anything sensitive lives in `workload_env` (per-entry encrypt flag) or another encrypted table.
+
+## Adding a new field — checklist
+
+1. Add the field to your source's typed struct with `omitempty` and a stable `json:"snake_case"` tag.
+2. If you use the **preserve-unknown-keys** pattern, add the JSON key to your `*Keys` slice (the equivalent of `runtimeStateKeys`).
+3. Confirm older readers (older deploys of the same binary) still parse the blob — `encoding/json` should drop the unknown key silently. Add a regression test if there's any doubt.
+4. Document the new field in this codemap if it's load-bearing for cross-source code (e.g., the proxy_routes map drives `ListProxyRoutes`).
+
+## Pointers
+
+- Container model + `ExtraJSON` comment: [`internal/store/models.go:342-347`](../../internal/store/models.go#L342-L347)
+- Schema declaration: [`internal/store/store.go:233`](../../internal/store/store.go#L233)
+- Store-level normalization (`'{}'` default) across all four write paths: [`internal/store/containers.go:42-43`](../../internal/store/containers.go#L42-L43) (CreateContainer), `:77-78` (UpsertContainer), `:129-130` (ReconcileContainer), `:321-322` (UpdateContainer).
+- Wholesale-overwrite writer + struct: [`image.go:341-343`](../../internal/workload/plugin/source/image/image.go#L341-L343) writes; [`image.go:481-487`](../../internal/workload/plugin/source/image/image.go#L481-L487) defines `containerExtra`; [`image.go:449-456`](../../internal/workload/plugin/source/image/image.go#L449-L456) reads it back in Teardown.
+- Preserve-unknown-keys example + concurrency lock: [`internal/workload/plugin/source/static/state.go`](../../internal/workload/plugin/source/static/state.go).
+- Canonical "decode-and-tolerate" consumer (the only cross-source reader in tree today): [`internal/api/workload_runtime.go:118-133`](../../internal/api/workload_runtime.go#L118-L133) decodes the static-only typed fields and falls back to first-class columns when the blob is empty, missing keys, or malformed.
+
+Note: no cross-source consumer reads `extra_json` in `internal/store/`. The proxy/route data exposed by `ListProxyRoutes` ([`containers.go:196`](../../internal/store/containers.go#L196)) comes from first-class columns (`proxy_route_id`, `subdomain`, `port`); the `proxy_routes` map inside `extra_json` is read only by the image source's own Teardown for cleanup.
@@ -500,13 +500,15 @@ covers the use case — `promote-from` works, the UI shows the relationship.
 Probably can leave the legacy `stages` table dropped entirely once cutover
 proceeds.

-### `Container.extra_json` evolution
+### ~~`Container.extra_json` evolution~~ — DONE (2026-05-16)

-Currently only the image source uses it (per-face proxy route IDs). If
-other sources gain similar needs (compose service health metadata, static
-build SHAs), the schema there should stay versionless and additive — every
-reader must tolerate unknown keys. Document this in the source plugin
-guide alongside the codemap entry.
+Both writer patterns now have an active example in-tree (image source
+clobbers, static source preserves) and the policy is documented in
+[`docs/CODEMAPS/container-extra-json.md`](CODEMAPS/container-extra-json.md):
+ownership model, wholesale-overwrite vs preserve-unknown-keys, reader
+tolerance for unknown keys + decode failure, the per-workload mutex
+requirement for any read-modify-write writer, and a checklist for adding
+a new field without breaking older deployers.

 ## File pointers for the next session

@@ -16,13 +16,12 @@ import (
 )

 // rateLimitedLogin wraps the login handler with per-IP rate limiting.
+// Uses clientIP() so X-Forwarded-For is honored only when the request
+// arrives from a configured trusted-proxy CIDR — preventing remote
+// attackers from spoofing the header to bypass the per-IP login limiter.
 func (s *Server) rateLimitedLogin(rl *rateLimiter) http.HandlerFunc {
 	return func(w http.ResponseWriter, r *http.Request) {
-		ip := r.RemoteAddr
-		if fwd := r.Header.Get("X-Forwarded-For"); fwd != "" {
-			ip = fwd
-		}
-		if !rl.allow(ip) {
+		if !rl.allow(clientIP(r)) {
 			respondError(w, http.StatusTooManyRequests, "too many login attempts, try again later")
 			return
 		}
@@ -1,7 +1,6 @@
 package api

 import (
-	"io"
 	"log/slog"
 	"net/http"
 	"os"
@@ -118,7 +117,22 @@ func (s *Server) deleteBackup(w http.ResponseWriter, r *http.Request) {
 }

 // restoreBackup handles POST /api/backups/{id}/restore.
-// This replaces the current database with the backup and triggers a graceful shutdown.
+//
+// Restore happens in three documented stages so a failure at any stage
+// leaves the live DB intact:
+//
+//  1. PRE-FLIGHT (sync, before the HTTP response): PrepareRestore opens
+//     the candidate read-only and runs `PRAGMA integrity_check`. If it
+//     fails the live DB is untouched and we return 400 with the reason.
+//
+//  2. SAFETY NET: a pre-restore backup of the LIVE DB is created so the
+//     operator can roll back even if the candidate is later discovered
+//     to be missing data.
+//
+//  3. SWAP (async, after the response is flushed): close the live DB,
+//     atomic-rename the candidate over the live path, wipe WAL/SHM,
+//     trigger graceful shutdown. supervisord / systemd / docker
+//     restart=on-failure brings the process back with the new DB.
 func (s *Server) restoreBackup(w http.ResponseWriter, r *http.Request) {
 	if s.backupEngine == nil {
 		respondError(w, http.StatusServiceUnavailable, "backup engine not initialized")
@@ -126,13 +140,44 @@ func (s *Server) restoreBackup(w http.ResponseWriter, r *http.Request) {
 	}

 	id := chi.URLParam(r, "id")
-	restorePath, err := s.backupEngine.RestorePath(id)
-	if err != nil {
-		respondError(w, http.StatusNotFound, "backup not found: "+err.Error())
+
+	// CSRF / accidental-fire guard: the restore endpoint is the most
+	// destructive surface in the API (replaces the whole DB). Even
+	// though it sits behind AdminOnly + Bearer JWT, a blind cross-site
+	// POST or a misclicked button in any open admin tab can fire it.
+	// Require the operator's client to echo X-Confirm-Restore: <id>
+	// — matching the path param — so a CSRF post-form / image-src
+	// trick can't trigger restore (browsers don't let cross-origin
+	// requests set custom headers without a preflight).
+	if confirm := r.Header.Get("X-Confirm-Restore"); confirm != id {
+		respondError(w, http.StatusBadRequest,
+			"missing or mismatched X-Confirm-Restore header (must equal backup id)")
 		return
 	}

-	// Create a safety backup before restore so the user can undo if needed.
+	// Single-flight guard: a rapid double-click would otherwise spawn
+	// two goroutines racing s.store.Close() and the candidate-over-
+	// live rename. CAS to true here; if someone else won, return 409.
+	if !s.restoreInFlight.CompareAndSwap(false, true) {
+		respondError(w, http.StatusConflict, "a restore is already in progress")
+		return
+	}
+	// Do NOT release the flag — the restore path triggers shutdown.
+	// A failed restore is also terminal (the DB may be closed); a
+	// fresh process boot is the recovery path.
+	// PRE-FLIGHT: refuse before touching anything if the candidate is
+	// not a valid SQLite database or fails integrity_check. This is the
+	// guard the prior code lacked — a corrupt backup would silently
+	// overwrite a healthy live DB.
+	restorePath, err := s.backupEngine.PrepareRestore(id)
+	if err != nil {
+		respondError(w, http.StatusBadRequest, err.Error())
+		return
+	}
+
+	// SAFETY NET: pre-restore snapshot of the live DB. A failure here
+	// is logged but does not abort — the integrity-checked candidate
+	// is still safer than refusing to restore.
 	if _, err := s.backupEngine.CreateBackup("pre-restore"); err != nil {
 		slog.Warn("failed to create pre-restore backup", "error", err)
 	}
@@ -153,41 +198,37 @@ func (s *Server) restoreBackup(w http.ResponseWriter, r *http.Request) {
 	go func() {
 		time.Sleep(500 * time.Millisecond)

-		// Close the current database to release locks.
+		// Once we begin closing the live DB the process can no longer serve
+		// requests against a sane store, so EVERY exit path from here must
+		// trigger shutdown. Returning early would leave the server limping
+		// on a closed/half-swapped database with no path to recovery except
+		// an external kill. shutdownFunc → graceful shutdown → main returns
+		// → deferred releaseLock()/db.Close() run, and the supervisor reopens
+		// whatever DB is on disk on the next boot.
+		triggerShutdown := func() {
+			if s.shutdownFunc != nil {
+				s.shutdownFunc()
+			}
+		}
+
+		// Close the current database to release locks. AtomicReplaceDB
+		// expects the live file to be unmapped before swap (especially
+		// important on Windows where open files cannot be renamed over).
 		if err := s.store.Close(); err != nil {
-			slog.Error("restore: failed to close database", "error", err)
+			slog.Error("restore: failed to close database, restarting", "error", err)
+			triggerShutdown()
 			return
 		}

-		// Copy the backup file over the main database using streaming (no full read into memory).
-		src, err := os.Open(restorePath)
-		if err != nil {
-			slog.Error("restore: failed to open backup file", "error", err)
+		if err := s.backupEngine.AtomicReplaceDB(restorePath, s.dbPath); err != nil {
+			slog.Error("restore: atomic replace failed, restarting", "error", err)
+			triggerShutdown()
 			return
 		}
-		defer src.Close()
-
-		dst, err := os.Create(s.dbPath)
-		if err != nil {
-			slog.Error("restore: failed to create database file", "error", err)
-			return
-		}
-		defer dst.Close()
-
-		if _, err := io.Copy(dst, src); err != nil {
-			slog.Error("restore: failed to copy backup to database", "error", err)
-			return
-		}
-
-		// Remove WAL and SHM files to ensure clean state.
-		os.Remove(s.dbPath + "-wal")
-		os.Remove(s.dbPath + "-shm")

 		slog.Info("restore: database replaced, triggering shutdown")

 		// Signal the server to shut down gracefully so it can be restarted.
-		if s.shutdownFunc != nil {
-			s.shutdownFunc()
-		}
+		triggerShutdown()
 	}()
 }
@@ -9,6 +9,7 @@ import (
 	"strings"
 	"time"

+	"github.com/alexei/tinyforge/internal/docker"
 	"github.com/alexei/tinyforge/internal/staticsite"
 )

@@ -350,6 +351,54 @@ func (s *Server) listImageConflicts(w http.ResponseWriter, r *http.Request) {
 	respondJSON(w, http.StatusOK, conflicts)
 }

+// inspectImageRequest is the body for POST /api/discovery/image/inspect.
+type inspectImageRequest struct {
+	Image string `json:"image"`
+}
+
+// inspectImageResponse mirrors the frontend InspectResult shape the
+// new-app wizard pre-fills from: the first exposed port (parsed to int,
+// 0 when none) and the image's HEALTHCHECK command string.
+type inspectImageResponse struct {
+	Port        int    `json:"port"`
+	Healthcheck string `json:"healthcheck"`
+}
+
+// inspectImageMetadata inspects a LOCAL image and returns its first
+// exposed port + healthcheck so the wizard can pre-fill those fields.
+// POST /api/discovery/image/inspect.
+//
+// This inspects local images only — it does not pull. When the image is
+// not present locally the docker call fails; we return a generic,
+// non-leaky 400 rather than the git-specific upstreamError so a raw
+// docker daemon string (which may echo the ref) never reaches the client.
+func (s *Server) inspectImageMetadata(w http.ResponseWriter, r *http.Request) {
+	var req inspectImageRequest
+	if !decodeJSON(w, r, &req) {
+		return
+	}
+	image := strings.TrimSpace(req.Image)
+	if image == "" {
+		respondError(w, http.StatusBadRequest, "image is required")
+		return
+	}
+
+	ctx, cancel := context.WithTimeout(r.Context(), discoveryTimeout)
+	defer cancel()
+
+	info, err := s.docker.InspectImage(ctx, image)
+	if err != nil {
+		slog.Warn("inspect image metadata failed", "error", err)
+		respondError(w, http.StatusBadRequest, "could not inspect image — make sure it is pulled locally and the reference is correct")
+		return
+	}
+
+	respondJSON(w, http.StatusOK, inspectImageResponse{
+		Port:        docker.ExtractPort(info.ExposedPorts),
+		Healthcheck: info.Healthcheck,
+	})
+}
+
 // stripImageTag returns the image reference with the trailing :tag
 // removed, taking care to leave a registry port (e.g. registry:5000/foo)
 // intact. Digest references (image@sha256:...) are returned unchanged.
@@ -348,3 +348,32 @@ func (s *Server) pruneImages(w http.ResponseWriter, r *http.Request) {
 		"space_reclaimed_mb": reclaimedBytes / (1024 * 1024),
 	})
 }
+
+// pruneBuildCache handles POST /api/docker/prune-build-cache. It removes
+// unused Docker build-cache records daemon-wide (all=false), so an app's next
+// rebuild still hits its warm cache. The build cache is regenerable by
+// definition — pruning only forces slower rebuilds, never data loss — and the
+// dockerfile/static deploy paths never reclaim it on teardown, so it grows
+// monotonically until pruned here.
+func (s *Server) pruneBuildCache(w http.ResponseWriter, r *http.Request) {
+	if s.docker == nil {
+		respondError(w, http.StatusServiceUnavailable, "Docker is not available")
+		return
+	}
+
+	result, err := s.docker.PruneBuildCache(r.Context(), false)
+	if err != nil {
+		slog.Error("prune: build cache", "error", err)
+		respondError(w, http.StatusInternalServerError, "internal server error")
+		return
+	}
+
+	slog.Info("prune: build cache",
+		"caches_deleted", result.CachesDeleted,
+		"space_reclaimed_mb", result.SpaceReclaimed/(1024*1024))
+
+	respondJSON(w, http.StatusOK, map[string]any{
+		"caches_deleted":     result.CachesDeleted,
+		"space_reclaimed_mb": result.SpaceReclaimed / (1024 * 1024),
+	})
+}
@@ -37,6 +37,36 @@ func (s *Server) listEventLog(w http.ResponseWriter, r *http.Request) {
 	respondJSON(w, http.StatusOK, events)
 }

+// listWorkloadEvents handles GET /api/workloads/{id}/events — the per-app
+// activity/deploy timeline. The workload id is pinned from the path, so a
+// client cannot widen the scope to other workloads or the global feed.
+// Supports the same severity/limit/offset query params as listEventLog.
+func (s *Server) listWorkloadEvents(w http.ResponseWriter, r *http.Request) {
+	id := chi.URLParam(r, "id")
+	if id == "" {
+		respondError(w, http.StatusBadRequest, "workload id is required")
+		return
+	}
+
+	q := r.URL.Query()
+	limit, _ := strconv.Atoi(q.Get("limit"))
+	offset, _ := strconv.Atoi(q.Get("offset"))
+
+	events, err := s.store.ListEvents(store.EventLogFilter{
+		WorkloadID: id,
+		Severity:   q.Get("severity"),
+		Limit:      limit,
+		Offset:     offset,
+	})
+	if err != nil {
+		slog.Error("failed to list workload events", "workload", id, "error", err)
+		respondError(w, http.StatusInternalServerError, "failed to list events")
+		return
+	}
+
+	respondJSON(w, http.StatusOK, events)
+}
+
 // getEventLogStats handles GET /api/events/log/stats.
 func (s *Server) getEventLogStats(w http.ResponseWriter, r *http.Request) {
 	stats, err := s.store.GetEventStats()
@@ -0,0 +1,64 @@
+package api
+
+import (
+	"context"
+	"log/slog"
+	"net/http"
+	"time"
+
+	"github.com/alexei/tinyforge/internal/metrics"
+)
+
+// livez always returns 200 if the process is up. Used by container
+// orchestrators / load balancers / Docker HEALTHCHECK as the "is the
+// binary alive" probe. Intentionally does NOT touch the DB or Docker —
+// a slow DB must not cause restart loops.
+func (s *Server) livez(w http.ResponseWriter, _ *http.Request) {
+	w.Header().Set("Content-Type", "text/plain; charset=utf-8")
+	_, _ = w.Write([]byte("ok\n"))
+}
+
+// readyz returns 200 only when the process can actually serve traffic:
+// SQLite is reachable, the encryption key is loaded, the deployer is
+// not draining. The response body is intentionally minimal — the
+// specific failing probe name is recorded in slog (operator-visible)
+// rather than returned to unauthenticated callers. This avoids handing
+// reconnaissance to an attacker who can hit /readyz during an outage
+// ("DB down" vs "encryption key missing" leaks operational state).
+func (s *Server) readyz(w http.ResponseWriter, r *http.Request) {
+	ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
+	defer cancel()
+
+	// DB ping: cheap and exact — exercises the connection pool, file
+	// lock, and busy-timeout. A failing ping means SQLite WAL is wedged
+	// or the data dir is gone.
+	if err := s.store.DB().PingContext(ctx); err != nil {
+		slog.Warn("readyz: db ping failed", "error", err)
+		w.Header().Set("Content-Type", "text/plain; charset=utf-8")
+		w.WriteHeader(http.StatusServiceUnavailable)
+		_, _ = w.Write([]byte("not ready\n"))
+		return
+	}
+
+	// Encryption key sanity: if it's zero we cannot decrypt any stored
+	// secret, so the deployer paths will all explode at first use.
+	if s.encKey == ([32]byte{}) {
+		slog.Warn("readyz: encryption key not loaded")
+		w.Header().Set("Content-Type", "text/plain; charset=utf-8")
+		w.WriteHeader(http.StatusServiceUnavailable)
+		_, _ = w.Write([]byte("not ready\n"))
+		return
+	}
+
+	w.Header().Set("Content-Type", "text/plain; charset=utf-8")
+	_, _ = w.Write([]byte("ready\n"))
+}
+
+// metricsExport writes the process-wide metrics registry in Prometheus
+// text format. Admin-only by router placement; surface is intentionally
+// thin (no histograms / quantiles, only counters) to keep the binary
+// dependency-free.
+func (s *Server) metricsExport(w http.ResponseWriter, _ *http.Request) {
+	w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
+	_ = metrics.DefaultRegistry.WritePrometheus(w)
+}
@@ -0,0 +1,235 @@
+// Package api: metric-alert rule HTTP handlers. The evaluator lives in
+// internal/metricalert; this file is the REST surface that lets
+// operators create, edit, and delete threshold rules. Mirrors the
+// log-scan rule handlers.
+package api
+
+import (
+	"errors"
+	"net/http"
+	"strconv"
+	"strings"
+
+	"github.com/go-chi/chi/v5"
+
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// metricAlertRuleInput is the JSON shape accepted by POST + PATCH.
+// Pointers distinguish "absent" from explicit empty/zero. WorkloadID is
+// immutable on update (per store.UpdateMetricAlertRule) so it only takes
+// effect on create.
+type metricAlertRuleInput struct {
+	WorkloadID      *string  `json:"workload_id"`
+	Name            *string  `json:"name"`
+	Metric          *string  `json:"metric"`
+	Comparator      *string  `json:"comparator"`
+	Threshold       *float64 `json:"threshold"`
+	Severity        *string  `json:"severity"`
+	CooldownSeconds *int     `json:"cooldown_seconds"`
+	Enabled         *bool    `json:"enabled"`
+}
+
+// listMetricAlertRules handles GET /api/metric-alert-rules. Optional
+// query filter `workload_id=...` returns rules applying to that workload
+// (its own rows plus globals).
+func (s *Server) listMetricAlertRules(w http.ResponseWriter, r *http.Request) {
+	if wlID := r.URL.Query().Get("workload_id"); wlID != "" {
+		out, err := s.store.ListMetricAlertRulesByWorkload(wlID)
+		if err != nil {
+			respondError(w, http.StatusInternalServerError, "list metric alert rules")
+			return
+		}
+		respondJSON(w, http.StatusOK, out)
+		return
+	}
+	out, err := s.store.ListMetricAlertRules()
+	if err != nil {
+		respondError(w, http.StatusInternalServerError, "list metric alert rules")
+		return
+	}
+	respondJSON(w, http.StatusOK, out)
+}
+
+// getMetricAlertRule handles GET /api/metric-alert-rules/{id}.
+func (s *Server) getMetricAlertRule(w http.ResponseWriter, r *http.Request) {
+	id, ok := parseMetricAlertRuleID(w, r)
+	if !ok {
+		return
+	}
+	rule, err := s.store.GetMetricAlertRule(id)
+	if err != nil {
+		mapStoreError(w, err, "metric alert rule")
+		return
+	}
+	respondJSON(w, http.StatusOK, rule)
+}
+
+// createMetricAlertRule handles POST /api/metric-alert-rules.
+func (s *Server) createMetricAlertRule(w http.ResponseWriter, r *http.Request) {
+	var in metricAlertRuleInput
+	if !decodeJSON(w, r, &in) {
+		return
+	}
+	rule := store.MetricAlertRule{
+		WorkloadID:      derefString(in.WorkloadID),
+		Name:            derefString(in.Name),
+		Metric:          derefString(in.Metric),
+		Comparator:      derefString(in.Comparator),
+		Threshold:       derefFloat64(in.Threshold),
+		Severity:        firstNonEmpty(derefString(in.Severity), store.LogScanSeverityWarn),
+		CooldownSeconds: derefIntDefault(in.CooldownSeconds, 300),
+		Enabled:         in.Enabled == nil || *in.Enabled,
+	}
+	if msg := validateMetricAlertInput(rule); msg != "" {
+		respondError(w, http.StatusBadRequest, msg)
+		return
+	}
+	out, err := s.store.CreateMetricAlertRule(rule)
+	if err != nil {
+		if isMetricAlertValidationErr(err) {
+			respondError(w, http.StatusBadRequest, err.Error())
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "create metric alert rule")
+		return
+	}
+	respondJSON(w, http.StatusCreated, out)
+}
+
+// updateMetricAlertRule handles PATCH /api/metric-alert-rules/{id}.
+// workload_id is immutable; name/metric/comparator/threshold/severity/
+// cooldown/enabled are individually overridable.
+func (s *Server) updateMetricAlertRule(w http.ResponseWriter, r *http.Request) {
+	id, ok := parseMetricAlertRuleID(w, r)
+	if !ok {
+		return
+	}
+	existing, err := s.store.GetMetricAlertRule(id)
+	if err != nil {
+		mapStoreError(w, err, "metric alert rule")
+		return
+	}
+	var in metricAlertRuleInput
+	if !decodeJSON(w, r, &in) {
+		return
+	}
+	if in.Name != nil {
+		existing.Name = *in.Name
+	}
+	if in.Metric != nil && *in.Metric != "" {
+		existing.Metric = *in.Metric
+	}
+	if in.Comparator != nil && *in.Comparator != "" {
+		existing.Comparator = *in.Comparator
+	}
+	if in.Threshold != nil {
+		existing.Threshold = *in.Threshold
+	}
+	if in.Severity != nil && *in.Severity != "" {
+		existing.Severity = *in.Severity
+	}
+	if in.CooldownSeconds != nil {
+		existing.CooldownSeconds = *in.CooldownSeconds
+	}
+	if in.Enabled != nil {
+		existing.Enabled = *in.Enabled
+	}
+	if msg := validateMetricAlertInput(existing); msg != "" {
+		respondError(w, http.StatusBadRequest, msg)
+		return
+	}
+	out, err := s.store.UpdateMetricAlertRule(existing)
+	if err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "metric alert rule")
+			return
+		}
+		if isMetricAlertValidationErr(err) {
+			respondError(w, http.StatusBadRequest, err.Error())
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "update metric alert rule")
+		return
+	}
+	respondJSON(w, http.StatusOK, out)
+}
+
+// deleteMetricAlertRule handles DELETE /api/metric-alert-rules/{id}.
+func (s *Server) deleteMetricAlertRule(w http.ResponseWriter, r *http.Request) {
+	id, ok := parseMetricAlertRuleID(w, r)
+	if !ok {
+		return
+	}
+	if err := s.store.DeleteMetricAlertRule(id); err != nil {
+		mapStoreError(w, err, "metric alert rule")
+		return
+	}
+	w.WriteHeader(http.StatusNoContent)
+}
+
+// validateMetricAlertInput does boundary validation so we return a
+// clear 400 before hitting the store. The store re-validates the same
+// invariants as a backstop.
+func validateMetricAlertInput(rule store.MetricAlertRule) string {
+	if strings.TrimSpace(rule.Name) == "" {
+		return "name is required"
+	}
+	switch rule.Metric {
+	case store.MetricCPUPercent, store.MetricMemoryPercent, store.MetricMemoryBytes:
+	default:
+		return "invalid metric: must be cpu_percent, memory_percent, or memory_bytes"
+	}
+	switch rule.Comparator {
+	case store.MetricComparatorGT, store.MetricComparatorLT:
+	default:
+		return "invalid comparator: must be gt or lt"
+	}
+	switch rule.Severity {
+	case store.LogScanSeverityInfo, store.LogScanSeverityWarn, store.LogScanSeverityError, "":
+	default:
+		return "invalid severity: must be info, warn, or error"
+	}
+	if rule.CooldownSeconds < 0 {
+		return "cooldown_seconds must be >= 0"
+	}
+	return ""
+}
+
+// isMetricAlertValidationErr maps the store's validation errors to 400
+// rather than 500 without leaking driver text.
+func isMetricAlertValidationErr(err error) bool {
+	if err == nil {
+		return false
+	}
+	msg := err.Error()
+	for _, needle := range []string{
+		"name is required",
+		"invalid metric",
+		"invalid comparator",
+		"invalid severity",
+		"cooldown_seconds must be",
+	} {
+		if strings.Contains(msg, needle) {
+			return true
+		}
+	}
+	return false
+}
+
+func parseMetricAlertRuleID(w http.ResponseWriter, r *http.Request) (int64, bool) {
+	raw := chi.URLParam(r, "id")
+	id, err := strconv.ParseInt(raw, 10, 64)
+	if err != nil || id <= 0 {
+		respondError(w, http.StatusBadRequest, "invalid rule id")
+		return 0, false
+	}
+	return id, true
+}
+
+func derefFloat64(p *float64) float64 {
+	if p == nil {
+		return 0
+	}
+	return *p
+}
@@ -1,14 +1,119 @@
 package api

 import (
+	"context"
+	"crypto/rand"
+	"encoding/hex"
 	"log/slog"
+	"net"
 	"net/http"
+	"os"
 	"runtime/debug"
 	"strings"
 	"sync"
 	"time"
+
+	"github.com/alexei/tinyforge/internal/metrics"
 )

+// requestIDKey is the context key under which the generated/forwarded
+// X-Request-ID is stored. Exported indirectly via RequestIDFromContext
+// so handlers and services downstream of the API layer can thread it
+// into their own slog calls without re-extracting from headers.
+type requestIDKeyType struct{}
+
+var requestIDKey = requestIDKeyType{}
+
+// RequestIDFromContext returns the correlation ID for the request, or
+// "" when called outside the API request path.
+func RequestIDFromContext(ctx context.Context) string {
+	if v, ok := ctx.Value(requestIDKey).(string); ok {
+		return v
+	}
+	return ""
+}
+
+// requestID middleware ensures every request has a stable correlation
+// ID. Honors a caller-supplied X-Request-ID when the request comes from
+// a trusted proxy AND the value matches a safe character set; otherwise
+// generates a fresh 128-bit ID. The ID is echoed back as X-Request-ID
+// and stitched into every subsequent slog call via the context value
+// the `logging` middleware reads.
+//
+// Format clamp: a compromised reverse proxy (or one that mis-parses an
+// untrusted header) could forward an ID containing newlines, semicolons,
+// or other separator characters. Those would corrupt structured log
+// parsers that assume one record per line / key-value. Restricting to
+// `[A-Za-z0-9._-]{1,64}` covers UUIDs, hex IDs, and trace-context IDs
+// without any sharp edges.
+func requestID(next http.Handler) http.Handler {
+	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		rid := r.Header.Get("X-Request-ID")
+		if rid == "" || !isTrustedPeer(r) || !isValidRequestID(rid) {
+			rid = newRequestID()
+		}
+		w.Header().Set("X-Request-ID", rid)
+		ctx := context.WithValue(r.Context(), requestIDKey, rid)
+		next.ServeHTTP(w, r.WithContext(ctx))
+	})
+}
+
+// isValidRequestID enforces `[A-Za-z0-9._-]{1,64}` without compiling a
+// regex on the request path. Single linear scan, no allocations.
+func isValidRequestID(s string) bool {
+	if len(s) == 0 || len(s) > 64 {
+		return false
+	}
+	for i := 0; i < len(s); i++ {
+		c := s[i]
+		switch {
+		case c >= 'A' && c <= 'Z':
+		case c >= 'a' && c <= 'z':
+		case c >= '0' && c <= '9':
+		case c == '.' || c == '_' || c == '-':
+		default:
+			return false
+		}
+	}
+	return true
+}
+
+// isTrustedPeer is a thin wrapper around the TRUSTED_PROXY_CIDRS allow-
+// list — we honor a forwarded request-id only from upstreams we already
+// trust for X-Forwarded-For. Otherwise an internet client could spam
+// log files with attacker-chosen IDs.
+func isTrustedPeer(r *http.Request) bool {
+	peer := r.RemoteAddr
+	if host, _, err := net.SplitHostPort(peer); err == nil {
+		peer = host
+	}
+	if len(trustedProxyCIDRs) == 0 {
+		return false
+	}
+	ip := net.ParseIP(peer)
+	if ip == nil {
+		return false
+	}
+	for _, n := range trustedProxyCIDRs {
+		if n.Contains(ip) {
+			return true
+		}
+	}
+	return false
+}
+
+func newRequestID() string {
+	var b [16]byte
+	if _, err := rand.Read(b[:]); err != nil {
+		// Fall back to time-based suffix if crypto/rand is unavailable
+		// — extremely unlikely outside of broken environments, but the
+		// ID is for tracing not security, so a deterministic fallback
+		// is preferable to a panic.
+		return "ts-" + time.Now().UTC().Format("20060102T150405.000000000")
+	}
+	return hex.EncodeToString(b[:])
+}
+
 // logging is an HTTP middleware that logs every request with method, path,
 // status code, and duration. Webhook URLs are redacted before being logged
 // because the secret is the only authenticator — leaking it to log
@@ -20,15 +125,58 @@ func logging(next http.Handler) http.Handler {

 		next.ServeHTTP(wrapped, r)

-		slog.Info("http request",
+		fields := []any{
 			"method", r.Method,
 			"path", redactPath(r.URL.Path),
 			"status", wrapped.status,
 			"duration", time.Since(start).String(),
-		)
+		}
+		if rq := redactQuery(r.URL.RawQuery); rq != "" {
+			fields = append(fields, "query", rq)
+		}
+		if rid := RequestIDFromContext(r.Context()); rid != "" {
+			fields = append(fields, "request_id", rid)
+		}
+		slog.Info("http request", fields...)
+
+		// Lightweight per-request counter. Bucket by status class so
+		// the cardinality stays at 5 × #methods regardless of how many
+		// distinct response codes we emit.
+		metrics.HTTPRequestsTotal.Inc(bucketMethod(r.Method), statusClass(wrapped.status))
 	})
 }

+// bucketMethod normalises HTTP method names against the standard set
+// so a malicious client cannot spam arbitrary method tokens (RFC 7230
+// allows any token) and inflate the metrics map. Anything off the
+// allow-list collapses to "other".
+func bucketMethod(m string) string {
+	switch m {
+	case "GET", "POST", "PUT", "PATCH", "DELETE", "HEAD", "OPTIONS", "CONNECT", "TRACE":
+		return m
+	}
+	return "other"
+}
+
+// statusClass buckets a status code into "1xx".."5xx" / "other". Keeps
+// metrics cardinality bounded so a chatty endpoint can't explode the
+// metrics map with one series per distinct response code.
+func statusClass(code int) string {
+	switch {
+	case code >= 100 && code < 200:
+		return "1xx"
+	case code >= 200 && code < 300:
+		return "2xx"
+	case code >= 300 && code < 400:
+		return "3xx"
+	case code >= 400 && code < 500:
+		return "4xx"
+	case code >= 500 && code < 600:
+		return "5xx"
+	}
+	return "other"
+}
+
 // redactPath strips secrets from URL paths that carry them in segments.
 // Only the canonical /api/webhook/triggers/{secret} surface remains after
 // the hard cutover.
@@ -40,6 +188,45 @@ func redactPath(path string) string {
 	return path
 }

+// redactQueryKeys is the case-insensitive set of query-parameter names whose
+// values are masked before a URL lands in the request log. `token` is used by
+// SSE/EventSource when a custom header can't be set; the rest are
+// defence-in-depth against sensitive values ever appearing in a query string.
+var redactQueryKeys = map[string]struct{}{
+	"token":         {},
+	"secret":        {},
+	"password":      {},
+	"passwd":        {},
+	"api_key":       {},
+	"apikey":        {},
+	"access_token":  {},
+	"client_secret": {},
+	"sig":           {},
+	"signature":     {},
+}
+
+// redactQuery masks the values of sensitive query parameters (see
+// redactQueryKeys) in a URL's raw query before it lands in the request log.
+// Key matching is case-insensitive. Returns the input unchanged when there is
+// nothing to redact so a malformed URL surfaces naturally.
+func redactQuery(rawQuery string) string {
+	if rawQuery == "" {
+		return ""
+	}
+	parts := strings.Split(rawQuery, "&")
+	for i, p := range parts {
+		eq := strings.IndexByte(p, '=')
+		if eq < 0 {
+			continue
+		}
+		key := strings.ToLower(p[:eq])
+		if _, ok := redactQueryKeys[key]; ok {
+			parts[i] = p[:eq+1] + "***"
+		}
+	}
+	return strings.Join(parts, "&")
+}
+
 // recovery is an HTTP middleware that catches panics and returns a 500 response.
 func recovery(next http.Handler) http.Handler {
 	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -54,16 +241,49 @@ func recovery(next http.Handler) http.Handler {
 }

 // securityHeaders sets standard security headers on all responses.
+//
+// Strict-Transport-Security is emitted only when the request arrived
+// over HTTPS (direct TLS or forwarded). Emitting HSTS over plain HTTP
+// is harmless to compliant browsers but flags as an issue in scanners
+// and confuses some reverse proxies.
+//
+// The CSP keeps `'unsafe-inline'` for now because SvelteKit injects
+// inline boot scripts and styles; removing it requires a nonce-based
+// strategy threaded through the SvelteKit handle hook. Tracked as a
+// follow-up; documented in the security report.
 func securityHeaders(next http.Handler) http.Handler {
 	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		w.Header().Set("X-Content-Type-Options", "nosniff")
 		w.Header().Set("X-Frame-Options", "DENY")
 		w.Header().Set("Referrer-Policy", "strict-origin-when-cross-origin")
-		w.Header().Set("Content-Security-Policy", "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'; img-src 'self' data:; connect-src 'self'; font-src 'self'")
+		w.Header().Set("Permissions-Policy", "camera=(), microphone=(), geolocation=(), payment=()")
+		w.Header().Set("Content-Security-Policy",
+			"default-src 'self'; "+
+				"script-src 'self' 'unsafe-inline'; "+
+				"style-src 'self' 'unsafe-inline'; "+
+				"img-src 'self' data:; "+
+				"connect-src 'self'; "+
+				"font-src 'self'; "+
+				"frame-ancestors 'none'; "+
+				"base-uri 'self'; "+
+				"form-action 'self'")
+		if isHTTPS(r) {
+			w.Header().Set("Strict-Transport-Security", "max-age=31536000; includeSubDomains")
+		}
 		next.ServeHTTP(w, r)
 	})
 }

+func isHTTPS(r *http.Request) bool {
+	if r.TLS != nil {
+		return true
+	}
+	if r.Header.Get("X-Forwarded-Proto") == "https" {
+		return true
+	}
+	return false
+}
+
 // cors is an HTTP middleware that handles CORS for same-origin requests.
 // The frontend is served from the same origin, so cross-origin requests are not expected.
 func cors(next http.Handler) http.Handler {
@@ -164,10 +384,7 @@ func jsonContentType(next http.Handler) http.Handler {
 func rateLimitMiddleware(rl *rateLimiter) func(http.Handler) http.Handler {
 	return func(next http.Handler) http.Handler {
 		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			ip := r.RemoteAddr
-			if fwd := r.Header.Get("X-Forwarded-For"); fwd != "" {
-				ip = fwd
-			}
+			ip := clientIP(r)
 			if !rl.allow(ip) {
 				respondError(w, http.StatusTooManyRequests, "rate limit exceeded")
 				return
@@ -177,6 +394,100 @@ func rateLimitMiddleware(rl *rateLimiter) func(http.Handler) http.Handler {
 	}
 }

+// trustedProxyCIDRs is the parsed allow-list of upstream proxy networks
+// whose X-Forwarded-For header we honor. Set TRUSTED_PROXY_CIDRS to a
+// comma-separated list of CIDRs (e.g. "127.0.0.1/32,10.0.0.0/8") to
+// enable. When unset (the default) X-Forwarded-For is ignored entirely
+// and rate limiting + audit logging use r.RemoteAddr — preventing a
+// remote attacker from spoofing the header to bypass per-IP limiters.
+var trustedProxyCIDRs = parseTrustedProxyCIDRs(os.Getenv("TRUSTED_PROXY_CIDRS"))
+
+func parseTrustedProxyCIDRs(raw string) []*net.IPNet {
+	raw = strings.TrimSpace(raw)
+	if raw == "" {
+		return nil
+	}
+	var nets []*net.IPNet
+	for _, p := range strings.Split(raw, ",") {
+		p = strings.TrimSpace(p)
+		if p == "" {
+			continue
+		}
+		// Allow bare IPs as /32 (IPv4) or /128 (IPv6).
+		if !strings.Contains(p, "/") {
+			if ip := net.ParseIP(p); ip != nil {
+				if ip.To4() != nil {
+					p += "/32"
+				} else {
+					p += "/128"
+				}
+			}
+		}
+		_, n, err := net.ParseCIDR(p)
+		if err != nil {
+			slog.Warn("ignoring invalid TRUSTED_PROXY_CIDRS entry", "value", p, "error", err)
+			continue
+		}
+		nets = append(nets, n)
+	}
+	return nets
+}
+
+// clientIP returns the per-request "client" address used for rate-limit
+// keying and audit attribution. X-Forwarded-For is honored ONLY when the
+// direct peer (r.RemoteAddr) belongs to a configured trusted-proxy CIDR;
+// otherwise the header is ignored to prevent header-spoofing bypasses.
+func clientIP(r *http.Request) string {
+	peer := r.RemoteAddr
+	if host, _, err := net.SplitHostPort(peer); err == nil {
+		peer = host
+	}
+	if len(trustedProxyCIDRs) == 0 {
+		return peer
+	}
+	peerIP := net.ParseIP(peer)
+	if peerIP == nil || !isTrustedProxy(peerIP) {
+		return peer
+	}
+	fwd := r.Header.Get("X-Forwarded-For")
+	if fwd == "" {
+		return peer
+	}
+	// Walk X-Forwarded-For from the RIGHTMOST entry (the address closest to
+	// us, appended by our trusted peer) leftward, skipping entries that are
+	// themselves trusted proxies, and return the first untrusted address.
+	// The LEFTMOST entry is fully client-controlled — trusting it (as a
+	// naive `fwd[:firstComma]` does) lets an attacker spoof their rate-limit
+	// and audit identity by prepending a forged value, defeating the per-IP
+	// login limiter.
+	parts := strings.Split(fwd, ",")
+	for i := len(parts) - 1; i >= 0; i-- {
+		candidate := strings.TrimSpace(parts[i])
+		ip := net.ParseIP(candidate)
+		if ip == nil {
+			continue
+		}
+		if isTrustedProxy(ip) {
+			continue
+		}
+		return candidate
+	}
+	// Every forwarded entry was a trusted proxy (or unparseable) — fall back
+	// to the direct peer.
+	return peer
+}
+
+// isTrustedProxy reports whether ip falls within a configured
+// trusted-proxy CIDR.
+func isTrustedProxy(ip net.IP) bool {
+	for _, n := range trustedProxyCIDRs {
+		if n.Contains(ip) {
+			return true
+		}
+	}
+	return false
+}
+
 // statusRecorder wraps http.ResponseWriter to capture the status code.
 type statusRecorder struct {
 	http.ResponseWriter
@@ -4,6 +4,7 @@ import (
 	"context"
 	"log/slog"
 	"sync"
+	"sync/atomic"

 	"github.com/go-chi/chi/v5"

@@ -18,6 +19,7 @@ import (
 	"github.com/alexei/tinyforge/internal/proxy"
 	"github.com/alexei/tinyforge/internal/stale"
 	"github.com/alexei/tinyforge/internal/store"
+	"github.com/alexei/tinyforge/internal/volsnap"
 	"github.com/alexei/tinyforge/internal/webhook"
 	"github.com/alexei/tinyforge/internal/workload/plugin"
 )
@@ -55,12 +57,20 @@ type Server struct {
 	onDNSProviderChanged DNSProviderChangedFunc

 	backupEngine            *backup.Engine
+	snapshotEngine          *volsnap.Engine
 	sseGate                 *sseGate
 	logScanReloader         LogScanReloader
 	dbPath                  string
 	shutdownFunc            func()                                // called after restore to trigger graceful shutdown
 	onBackupSettingsChanged func(enabled bool, intervalHours int) // called when backup settings change
 	onProxyProviderChanged  func(provider proxy.Provider)         // called when proxy provider changes
+
+	// restoreInFlight is a process-wide guard against double-firing
+	// the restore endpoint. A rapid double-click would otherwise
+	// schedule two goroutines racing s.store.Close() and the
+	// candidate-over-live rename. CAS to true at the entry point;
+	// reject the second caller with 409 Conflict.
+	restoreInFlight atomic.Bool
 }

 // NewServer creates a new API Server with all required dependencies.
@@ -111,6 +121,11 @@ func (s *Server) SetBackupEngine(engine *backup.Engine) {
 	s.backupEngine = engine
 }

+// SetSnapshotEngine sets the volume-snapshot engine on the server.
+func (s *Server) SetSnapshotEngine(engine *volsnap.Engine) {
+	s.snapshotEngine = engine
+}
+
 // SetDBPath sets the database file path (needed for restore).
 func (s *Server) SetDBPath(path string) {
 	s.dbPath = path
@@ -157,13 +172,32 @@ func (s *Server) SetDNSProviderChangedCallback(fn DNSProviderChangedFunc) {

 // initOIDCProvider creates an OIDC provider from settings. Errors are logged, not fatal.
 func (s *Server) initOIDCProvider(ctx context.Context, as store.AuthSettings) {
-	// Decrypt the OIDC client secret if it's encrypted.
+	// Decrypt the OIDC client secret. The prior code did a try-decrypt
+	// and silently treated failures as plaintext — under a rotated key
+	// that sent ciphertext upstream to the OP. Now:
+	//   - If the value carries the tf1: envelope → fail loud on
+	//     decrypt failure (rotated key / corrupted ciphertext).
+	//   - If the value is unprefixed (legacy ciphertext from v0 or true
+	//     plaintext from an old migration) → try decrypt; on failure
+	//     accept as plaintext (the only safe legacy interpretation).
 	clientSecret := as.OIDCClientSecret
 	if clientSecret != "" {
-		if decrypted, err := crypto.Decrypt(s.encKey, clientSecret); err == nil {
+		switch {
+		case crypto.HasEnvelope(clientSecret):
+			decrypted, err := crypto.Decrypt(s.encKey, clientSecret)
+			if err != nil {
+				slog.Error("OIDC client secret could not be decrypted — refusing to initialize provider",
+					"error", err,
+					"hint", "rotate ENCRYPTION_KEY back, OR re-save OIDC settings to re-encrypt with the current key")
+				return
+			}
 			clientSecret = decrypted
+		default:
+			// Legacy v0 value: try decrypt; on failure assume plaintext.
+			if decrypted, err := crypto.Decrypt(s.encKey, clientSecret); err == nil {
+				clientSecret = decrypted
+			}
 		}
-		// If decrypt fails, assume it's already plaintext (migration scenario).
 	}
 	provider, err := auth.NewOIDCProvider(ctx, auth.OIDCConfig{
 		IssuerURL:    as.OIDCIssuerURL,
@@ -183,12 +217,29 @@ func (s *Server) initOIDCProvider(ctx context.Context, as store.AuthSettings) {
 func (s *Server) Router() chi.Router {
 	r := chi.NewRouter()

-	// Global middleware.
+	// Global middleware. requestID runs first so every downstream log
+	// line (and the access log emitted by `logging`) carries the same
+	// correlation id, plus the response carries it back on the
+	// X-Request-ID header for the operator to grep across services.
+	r.Use(requestID)
 	r.Use(recovery)
 	r.Use(securityHeaders)
 	r.Use(logging)
 	r.Use(cors)

+	// Unauthenticated health probes — mounted at the root so container
+	// orchestrators / load balancers can hit them without knowing about
+	// the /api prefix. /livez intentionally does no work and stays
+	// unbounded; /readyz pings the DB and is rate-limited to keep an
+	// unauthenticated flood from serialising behind SQLite's single
+	// writer connection (busy-timeout = 5s) and log-amplifying every
+	// request via the structured access log. The 10-per-minute budget
+	// is the existing rateLimiter default — generous for k8s readiness
+	// probes (typically every 5-10s), restrictive for an attacker.
+	r.Get("/livez", s.livez)
+	readyLimiter := newRateLimiter()
+	r.With(rateLimitMiddleware(readyLimiter)).Get("/readyz", s.readyz)
+
 	loginLimiter := newRateLimiter()
 	webhookLimiter := newRateLimiter()

@@ -232,6 +283,7 @@ func (s *Server) Router() chi.Router {
 				r.Post("/discovery/git/branches", s.listGitBranches)
 				r.Post("/discovery/git/tree", s.listGitTree)
 				r.Get("/discovery/image/conflicts", s.listImageConflicts)
+				r.Post("/discovery/image/inspect", s.inspectImageMetadata)
 			})

 			// Read-only endpoints (any authenticated user).
@@ -245,16 +297,18 @@ func (s *Server) Router() chi.Router {
 			r.Get("/events/log/stats", s.getEventLogStats)
 			r.Get("/registries", s.listRegistries)
 			r.Route("/registries/{id}", func(r chi.Router) {
+				// All registry probes are admin-gated. The /tags and
+				// /images endpoints used to be open to any authenticated
+				// user, but they make outbound requests using the
+				// admin-encrypted registry token — a viewer could
+				// effectively drive arbitrary requests against a private
+				// registry under admin credentials.
+				r.Use(auth.AdminOnly)
 				r.Get("/tags/*", s.listRegistryTags)
 				r.Get("/images", s.listRegistryImages)
-
-				// Admin-only registry mutations.
-				r.Group(func(r chi.Router) {
-					r.Use(auth.AdminOnly)
-					r.Put("/", s.updateRegistry)
-					r.Delete("/", s.deleteRegistry)
-					r.Post("/test", s.testRegistry)
-				})
+				r.Put("/", s.updateRegistry)
+				r.Delete("/", s.deleteRegistry)
+				r.Post("/test", s.testRegistry)
 			})
 			r.Get("/settings", s.getSettings)
 			r.Get("/settings/npm-certificates", s.listNpmCertificates)
@@ -282,11 +336,23 @@ func (s *Server) Router() chi.Router {
 				r.With(auth.AdminOnly).Post("/start", s.startPluginWorkload)
 				r.With(auth.AdminOnly).Delete("/", s.deletePluginWorkload)

+				// Volume snapshots (admin-only). Capture/list a workload's
+				// host-bind data volumes; {sid}-scoped download/delete live
+				// in the global admin group alongside backups.
+				r.With(auth.AdminOnly).Get("/snapshots", s.listWorkloadSnapshots)
+				r.With(auth.AdminOnly).Get("/snapshotable", s.getWorkloadSnapshotable)
+				r.With(auth.AdminOnly).Post("/snapshots", s.createWorkloadSnapshot)
+
 				// Runtime view: per-source persisted state + storage usage.
 				// Read-only; safe for any authenticated user.
 				r.Get("/runtime-state", s.getWorkloadRuntimeState)
 				r.Get("/storage", s.getWorkloadStorage)

+				// Per-workload activity / deploy timeline (read-only). Scoped
+				// to this workload's event-log rows; the global feed lives at
+				// /events/log.
+				r.Get("/events", s.listWorkloadEvents)
+
 				// Per-workload env vars. Listing open to authenticated readers;
 				// mutations admin-gated. Encrypted values are write-only after store.
 				r.Get("/env", s.listWorkloadEnv)
@@ -312,6 +378,15 @@ func (s *Server) Router() chi.Router {
 				// of /triggers/{id}/bindings keyed on the workload side.
 				r.Get("/triggers", s.listBindingsForWorkload)
 				r.With(auth.AdminOnly).Post("/triggers", s.bindTriggerToWorkload)
+
+				// Per-workload notification routes — multi-destination
+				// fan-out (Slack channel + Discord webhook + ...). When
+				// zero rows are configured the dispatcher falls back to
+				// the legacy single-URL columns on the workload row.
+				r.Get("/notifications", s.listWorkloadNotifications)
+				r.With(auth.AdminOnly).Post("/notifications", s.createWorkloadNotification)
+				r.With(auth.AdminOnly).Put("/notifications/{nid}", s.updateWorkloadNotification)
+				r.With(auth.AdminOnly).Delete("/notifications/{nid}", s.deleteWorkloadNotification)
 			})

 			// Global container index, joined to workload + app names.
@@ -370,6 +445,26 @@ func (s *Server) Router() chi.Router {
 				r.Post("/log-scan-rules/{id}/test", s.testLogScanRule)
 			})

+			// Metric-alert rules.
+			r.Get("/metric-alert-rules", s.listMetricAlertRules)
+			r.Get("/metric-alert-rules/{id}", s.getMetricAlertRule)
+			r.Group(func(r chi.Router) {
+				r.Use(auth.AdminOnly)
+				r.Post("/metric-alert-rules", s.createMetricAlertRule)
+				r.Patch("/metric-alert-rules/{id}", s.updateMetricAlertRule)
+				r.Delete("/metric-alert-rules/{id}", s.deleteMetricAlertRule)
+			})
+
+			// Shared secrets (env vars shared across workloads by scope).
+			r.Get("/shared-secrets", s.listSharedSecrets)
+			r.Get("/shared-secrets/{id}", s.getSharedSecret)
+			r.Group(func(r chi.Router) {
+				r.Use(auth.AdminOnly)
+				r.Post("/shared-secrets", s.createSharedSecret)
+				r.Patch("/shared-secrets/{id}", s.updateSharedSecret)
+				r.Delete("/shared-secrets/{id}", s.deleteSharedSecret)
+			})
+
 			// System resources (read-only).
 			r.Get("/system/stats", s.getSystemStats)
 			r.Get("/system/stats/history", s.getSystemStatsHistory)
@@ -379,6 +474,12 @@ func (s *Server) Router() chi.Router {
 			r.Group(func(r chi.Router) {
 				r.Use(auth.AdminOnly)

+				// Prometheus-format metrics export. Admin-only so the
+				// counter cardinality cannot be enumerated by a low-trust
+				// viewer to map internal endpoints / sources / outcomes.
+				// Scrape with bearer auth from your Prometheus job.
+				r.Get("/metrics", s.metricsExport)
+
 				// Config export (reveals registry/global details).
 				r.Get("/config/export", s.exportConfig)

@@ -414,6 +515,7 @@ func (s *Server) Router() chi.Router {

 				// Docker management.
 				r.Post("/docker/prune-images", s.pruneImages)
+				r.Post("/docker/prune-build-cache", s.pruneBuildCache)

 				// NPM connection test.
 				r.Post("/settings/npm/test", s.testNpmConnection)
@@ -431,6 +533,11 @@ func (s *Server) Router() chi.Router {
 				r.Get("/backups/{id}/download", s.downloadBackup)
 				r.Delete("/backups/{id}", s.deleteBackup)
 				r.Post("/backups/{id}/restore", s.restoreBackup)
+
+				// Volume-snapshot download/delete (workload-scoped capture +
+				// list live under /workloads/{id}/snapshots).
+				r.Get("/snapshots/{sid}/download", s.downloadSnapshot)
+				r.Delete("/snapshots/{sid}", s.deleteSnapshot)
 			})
 		})
 	})
@@ -0,0 +1,272 @@
+package api
+
+import (
+	"errors"
+	"log/slog"
+	"net/http"
+	"strings"
+
+	"github.com/go-chi/chi/v5"
+
+	"github.com/alexei/tinyforge/internal/crypto"
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// sharedSecretRow is the JSON shape returned to clients. The secret value is
+// NEVER returned — once stored it is write-only (mirroring workload_env). The
+// has_value flag lets the UI show whether a value is set without exposing it;
+// to rotate, the operator submits a new value.
+type sharedSecretRow struct {
+	ID          string `json:"id"`
+	Name        string `json:"name"`
+	HasValue    bool   `json:"has_value"`
+	Encrypted   bool   `json:"encrypted"`
+	Scope       string `json:"scope"`
+	AppID       string `json:"app_id"`
+	Description string `json:"description"`
+	Enabled     bool   `json:"enabled"`
+	CreatedAt   string `json:"created_at"`
+	UpdatedAt   string `json:"updated_at"`
+}
+
+func toSharedSecretRow(sec store.SharedSecret) sharedSecretRow {
+	return sharedSecretRow{
+		ID:          sec.ID,
+		Name:        sec.Name,
+		HasValue:    sec.Value != "",
+		Encrypted:   sec.Encrypted,
+		Scope:       sec.Scope,
+		AppID:       sec.AppID,
+		Description: sec.Description,
+		Enabled:     sec.Enabled,
+		CreatedAt:   sec.CreatedAt,
+		UpdatedAt:   sec.UpdatedAt,
+	}
+}
+
+// listSharedSecrets handles GET /api/shared-secrets. Values are redacted.
+func (s *Server) listSharedSecrets(w http.ResponseWriter, r *http.Request) {
+	rows, err := s.store.ListSharedSecrets()
+	if err != nil {
+		respondError(w, http.StatusInternalServerError, "list shared secrets")
+		return
+	}
+	out := make([]sharedSecretRow, 0, len(rows))
+	for _, sec := range rows {
+		out = append(out, toSharedSecretRow(sec))
+	}
+	respondJSON(w, http.StatusOK, out)
+}
+
+// getSharedSecret handles GET /api/shared-secrets/{id}. Value is redacted.
+func (s *Server) getSharedSecret(w http.ResponseWriter, r *http.Request) {
+	sec, err := s.store.GetSharedSecret(chi.URLParam(r, "id"))
+	if err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "shared secret")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "get shared secret")
+		return
+	}
+	respondJSON(w, http.StatusOK, toSharedSecretRow(sec))
+}
+
+// createSharedSecretRequest is the POST body. Encrypted=true (the default for
+// a non-empty value) causes the value to be encrypted at rest with the global
+// key before it ever reaches the store.
+type createSharedSecretRequest struct {
+	Name        string `json:"name"`
+	Value       string `json:"value"`
+	Encrypted   *bool  `json:"encrypted"` // defaults true
+	Scope       string `json:"scope"`     // global | app
+	AppID       string `json:"app_id"`    // required when scope == app
+	Description string `json:"description"`
+	Enabled     *bool  `json:"enabled"` // defaults true
+}
+
+func (s *Server) createSharedSecret(w http.ResponseWriter, r *http.Request) {
+	var req createSharedSecretRequest
+	if !decodeJSONStrict(w, r, &req) {
+		return
+	}
+	req.Name = strings.TrimSpace(req.Name)
+	if !validEnvKey(req.Name) {
+		respondError(w, http.StatusBadRequest, "name must be a valid env key [A-Za-z_][A-Za-z0-9_]*")
+		return
+	}
+	if msg := validateSharedSecretScope(req.Scope, req.AppID); msg != "" {
+		respondError(w, http.StatusBadRequest, msg)
+		return
+	}
+
+	encrypted := true
+	if req.Encrypted != nil {
+		encrypted = *req.Encrypted
+	}
+	enabled := true
+	if req.Enabled != nil {
+		enabled = *req.Enabled
+	}
+
+	value, err := s.encryptSecretValue(req.Value, encrypted)
+	if err != nil {
+		respondError(w, http.StatusInternalServerError, "encrypt value")
+		return
+	}
+
+	sec, err := s.store.CreateSharedSecret(store.SharedSecret{
+		Name:        req.Name,
+		Value:       value,
+		Encrypted:   encrypted,
+		Scope:       req.Scope,
+		AppID:       strings.TrimSpace(req.AppID),
+		Description: req.Description,
+		Enabled:     enabled,
+	})
+	if err != nil {
+		if errors.Is(err, store.ErrUnique) {
+			respondError(w, http.StatusConflict, "a shared secret with this scope and name already exists")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "create shared secret")
+		return
+	}
+	respondJSON(w, http.StatusCreated, toSharedSecretRow(sec))
+}
+
+// updateSharedSecretRequest is the PATCH body. Every field is optional; nil
+// means "leave unchanged". A nil Value preserves the stored ciphertext (so a
+// metadata-only edit can't accidentally blank a secret); a non-nil Value
+// rotates it (re-encrypted under the effective Encrypted flag).
+type updateSharedSecretRequest struct {
+	Name        *string `json:"name"`
+	Value       *string `json:"value"`
+	Encrypted   *bool   `json:"encrypted"`
+	Scope       *string `json:"scope"`
+	AppID       *string `json:"app_id"`
+	Description *string `json:"description"`
+	Enabled     *bool   `json:"enabled"`
+}
+
+func (s *Server) updateSharedSecret(w http.ResponseWriter, r *http.Request) {
+	id := chi.URLParam(r, "id")
+	existing, err := s.store.GetSharedSecret(id)
+	if err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "shared secret")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "get shared secret")
+		return
+	}
+
+	var req updateSharedSecretRequest
+	if !decodeJSONStrict(w, r, &req) {
+		return
+	}
+
+	merged := existing
+	if req.Name != nil {
+		merged.Name = strings.TrimSpace(*req.Name)
+		if !validEnvKey(merged.Name) {
+			respondError(w, http.StatusBadRequest, "name must be a valid env key [A-Za-z_][A-Za-z0-9_]*")
+			return
+		}
+	}
+	if req.Encrypted != nil {
+		merged.Encrypted = *req.Encrypted
+	}
+	if req.Scope != nil {
+		merged.Scope = *req.Scope
+	}
+	if req.AppID != nil {
+		merged.AppID = strings.TrimSpace(*req.AppID)
+	}
+	if req.Description != nil {
+		merged.Description = *req.Description
+	}
+	if req.Enabled != nil {
+		merged.Enabled = *req.Enabled
+	}
+	if msg := validateSharedSecretScope(merged.Scope, merged.AppID); msg != "" {
+		respondError(w, http.StatusBadRequest, msg)
+		return
+	}
+
+	// Value handling: only (re)encrypt when the caller supplied a new value.
+	// Otherwise keep the stored ciphertext untouched — but if the Encrypted
+	// flag flipped without a new value we cannot transcode the opaque stored
+	// bytes, so reject that ambiguous request rather than corrupting the row.
+	if req.Value != nil {
+		v, encErr := s.encryptSecretValue(*req.Value, merged.Encrypted)
+		if encErr != nil {
+			respondError(w, http.StatusInternalServerError, "encrypt value")
+			return
+		}
+		merged.Value = v
+	} else if req.Encrypted != nil && *req.Encrypted != existing.Encrypted {
+		respondError(w, http.StatusBadRequest, "changing 'encrypted' requires resubmitting 'value'")
+		return
+	}
+
+	sec, err := s.store.UpdateSharedSecret(merged)
+	if err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "shared secret")
+			return
+		}
+		if errors.Is(err, store.ErrUnique) {
+			respondError(w, http.StatusConflict, "a shared secret with this scope and name already exists")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "update shared secret")
+		return
+	}
+	respondJSON(w, http.StatusOK, toSharedSecretRow(sec))
+}
+
+func (s *Server) deleteSharedSecret(w http.ResponseWriter, r *http.Request) {
+	id := chi.URLParam(r, "id")
+	if err := s.store.DeleteSharedSecret(id); err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "shared secret")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "delete shared secret")
+		return
+	}
+	respondJSON(w, http.StatusOK, map[string]string{"deleted": id})
+}
+
+// encryptSecretValue encrypts value with the global key when encrypted is set
+// and the value is non-empty; otherwise it returns the value unchanged. An
+// empty value stays empty (no value set) regardless of the flag.
+func (s *Server) encryptSecretValue(value string, encrypted bool) (string, error) {
+	if !encrypted || value == "" {
+		return value, nil
+	}
+	enc, err := crypto.Encrypt(s.encKey, value)
+	if err != nil {
+		slog.Error("encrypt shared secret value", "error", err)
+		return "", err
+	}
+	return enc, nil
+}
+
+// validateSharedSecretScope returns a non-empty 400 message when the scope /
+// app_id pairing is invalid; "" when valid. Mirrors the store-side invariant
+// so the API rejects with a clear message before hitting the store.
+func validateSharedSecretScope(scope, appID string) string {
+	switch scope {
+	case store.SharedSecretScopeGlobal:
+		return ""
+	case store.SharedSecretScopeApp:
+		if strings.TrimSpace(appID) == "" {
+			return "app_id is required when scope is 'app'"
+		}
+		return ""
+	default:
+		return "scope must be 'global' or 'app'"
+	}
+}
@@ -32,9 +32,26 @@ func (s *Server) streamEvents(w http.ResponseWriter, r *http.Request) {
 	w.WriteHeader(http.StatusOK)
 	flusher.Flush()

-	// Subscribe to instance status, deploy status, and persistent event log events.
+	// Build logs are high-volume: a single verbose `docker build` can emit
+	// thousands of lines. Streaming them to EVERY connection would flood each
+	// subscriber's bounded bus buffer and evict status/log events for ALL
+	// clients. So build logs are delivered ONLY to connections that opt in
+	// with ?workload_id=<id>, and only for that workload. Connections without
+	// the param (e.g. the global dashboard) never receive build-log frames.
+	buildLogWorkloadID := r.URL.Query().Get("workload_id")
 	sub := s.eventBus.Subscribe(func(evt events.Event) bool {
-		return evt.Type == events.EventInstanceStatus || evt.Type == events.EventDeployStatus || evt.Type == events.EventLog
+		switch evt.Type {
+		case events.EventInstanceStatus, events.EventDeployStatus, events.EventLog:
+			return true
+		case events.EventBuildLog:
+			if buildLogWorkloadID == "" {
+				return false
+			}
+			p, ok := evt.Payload.(events.BuildLogPayload)
+			return ok && p.WorkloadID == buildLogWorkloadID
+		default:
+			return false
+		}
 	})
 	defer s.eventBus.Unsubscribe(sub)

@@ -89,12 +89,16 @@ func toTriggerViewWithCount(row store.TriggerWithBindingCount) triggerView {
 // triggerRequest is the create/update body. Config is opaque per kind.
 // Auto-generates a webhook secret on create when WebhookEnabled is true;
 // the secret is exposed only via the /webhook subresource.
+//
+// WebhookRequireSignature is a *bool so we can distinguish "field omitted
+// by client" (nil → apply secure default of true when webhook is enabled)
+// from an explicit opt-out (false → respected).
 type triggerRequest struct {
 	Kind                    string          `json:"kind"`
 	Name                    string          `json:"name"`
 	Config                  json.RawMessage `json:"config"`
 	WebhookEnabled          bool            `json:"webhook_enabled"`
-	WebhookRequireSignature bool            `json:"webhook_require_signature"`
+	WebhookRequireSignature *bool           `json:"webhook_require_signature,omitempty"`
 }

 // Same per-blob caps used on the workload pluginWorkloadRequest path —
@@ -134,12 +138,26 @@ func (s *Server) getTrigger(w http.ResponseWriter, r *http.Request) {
 // buildTriggerFromRequest assembles a store.Trigger ready for insert.
 // Centralized so the standalone create endpoint and the inline-bind
 // endpoint cannot drift on secret-generation defaults.
+//
+// SECURITY: a new trigger with webhook enabled defaults to require_signature
+// = true. Operators can opt out at create time for receivers that do not
+// support HMAC, but the safer default avoids the "freshly-created trigger
+// accepts unsigned posts to its URL" footgun.
 func buildTriggerFromRequest(req triggerRequest) store.Trigger {
+	// Secure default: if webhook is enabled and the operator did NOT
+	// explicitly set require_signature, force it on. Explicit false is
+	// preserved (legacy receivers without HMAC support still work).
+	requireSig := false
+	if req.WebhookRequireSignature != nil {
+		requireSig = *req.WebhookRequireSignature
+	} else if req.WebhookEnabled {
+		requireSig = true
+	}
 	t := store.Trigger{
 		Kind:                    req.Kind,
 		Name:                    strings.TrimSpace(req.Name),
 		Config:                  string(req.Config),
-		WebhookRequireSignature: req.WebhookRequireSignature,
+		WebhookRequireSignature: requireSig,
 	}
 	if req.WebhookEnabled {
 		t.WebhookSecret = generateWebhookSecret()
@@ -199,7 +217,13 @@ func (s *Server) updateTrigger(w http.ResponseWriter, r *http.Request) {
 	if len(req.Config) > 0 {
 		existing.Config = string(req.Config)
 	}
-	existing.WebhookRequireSignature = req.WebhookRequireSignature
+	if req.WebhookRequireSignature != nil {
+		existing.WebhookRequireSignature = *req.WebhookRequireSignature
+	} else if req.WebhookEnabled && !existing.WebhookRequireSignature {
+		// Re-enabling webhook without specifying the signature flag —
+		// take the secure default.
+		existing.WebhookRequireSignature = true
+	}
 	wasEnabled := existing.WebhookSecret != ""
 	if req.WebhookEnabled && !wasEnabled {
 		// false→true transition: rotate both secrets so re-enabling
@@ -0,0 +1,177 @@
+package api
+
+import (
+	"encoding/json"
+	"errors"
+	"io"
+	"log/slog"
+	"net/http"
+	"os"
+	"path/filepath"
+
+	"github.com/go-chi/chi/v5"
+
+	"github.com/alexei/tinyforge/internal/store"
+	"github.com/alexei/tinyforge/internal/volsnap"
+)
+
+// listWorkloadSnapshots handles GET /api/workloads/{id}/snapshots.
+func (s *Server) listWorkloadSnapshots(w http.ResponseWriter, r *http.Request) {
+	if s.snapshotEngine == nil {
+		respondError(w, http.StatusServiceUnavailable, "snapshot engine not initialized")
+		return
+	}
+	id := chi.URLParam(r, "id")
+	snaps, err := s.snapshotEngine.List(id)
+	if err != nil {
+		slog.Error("snapshots: list", "workload", id, "error", err)
+		respondError(w, http.StatusInternalServerError, "internal server error")
+		return
+	}
+	respondJSON(w, http.StatusOK, snaps)
+}
+
+// snapshotableVolume is the sanitized view of a volume in the snapshotable
+// response — it omits the resolved host path so internal layout is not leaked.
+type snapshotableVolume struct {
+	Target string `json:"target"`
+	Scope  string `json:"scope"`
+	Source string `json:"source"`
+}
+
+// getWorkloadSnapshotable handles GET /api/workloads/{id}/snapshotable. It
+// tells the UI which volumes can be snapshotted and which are skipped (and
+// why), so users are never misled about coverage.
+func (s *Server) getWorkloadSnapshotable(w http.ResponseWriter, r *http.Request) {
+	if s.snapshotEngine == nil {
+		respondError(w, http.StatusServiceUnavailable, "snapshot engine not initialized")
+		return
+	}
+	id := chi.URLParam(r, "id")
+	workload, err := s.store.GetWorkloadByID(id)
+	if err != nil {
+		respondError(w, http.StatusNotFound, "workload not found")
+		return
+	}
+	settings, err := s.store.GetSettings()
+	if err != nil {
+		respondError(w, http.StatusInternalServerError, "internal server error")
+		return
+	}
+	refs, skipped, err := volsnap.SnapshotableVolumes(s.store, workload, settings)
+	if err != nil {
+		slog.Error("snapshots: enumerate", "workload", id, "error", err)
+		respondError(w, http.StatusInternalServerError, "internal server error")
+		return
+	}
+
+	volumes := make([]snapshotableVolume, 0, len(refs))
+	for _, ref := range refs {
+		volumes = append(volumes, snapshotableVolume{Target: ref.Target, Scope: ref.Scope, Source: ref.Source})
+	}
+	if skipped == nil {
+		skipped = []volsnap.SkippedVolume{}
+	}
+	respondJSON(w, http.StatusOK, map[string]any{
+		"volumes": volumes,
+		"skipped": skipped,
+	})
+}
+
+// createWorkloadSnapshot handles POST /api/workloads/{id}/snapshots.
+func (s *Server) createWorkloadSnapshot(w http.ResponseWriter, r *http.Request) {
+	if s.snapshotEngine == nil {
+		respondError(w, http.StatusServiceUnavailable, "snapshot engine not initialized")
+		return
+	}
+	id := chi.URLParam(r, "id")
+	workload, err := s.store.GetWorkloadByID(id)
+	if err != nil {
+		respondError(w, http.StatusNotFound, "workload not found")
+		return
+	}
+	settings, err := s.store.GetSettings()
+	if err != nil {
+		respondError(w, http.StatusInternalServerError, "internal server error")
+		return
+	}
+
+	var body struct {
+		Label string `json:"label"`
+	}
+	if r.ContentLength != 0 {
+		if err := json.NewDecoder(io.LimitReader(r.Body, 1<<20)).Decode(&body); err != nil && !errors.Is(err, io.EOF) {
+			respondError(w, http.StatusBadRequest, "invalid JSON body")
+			return
+		}
+	}
+
+	snap, err := s.snapshotEngine.Create(workload, settings, body.Label)
+	if err != nil {
+		// "no snapshottable volume data" is client-actionable (400, safe to
+		// echo). Any other error is server-side: log the detail, return a
+		// generic 500 so internal paths / DB text never reach the client.
+		if errors.Is(err, volsnap.ErrNoSnapshotData) {
+			respondError(w, http.StatusBadRequest, err.Error())
+			return
+		}
+		slog.Error("snapshots: create", "workload", id, "error", err)
+		respondError(w, http.StatusInternalServerError, "internal server error")
+		return
+	}
+	respondJSON(w, http.StatusCreated, snap)
+}
+
+// deleteSnapshot handles DELETE /api/snapshots/{sid}.
+func (s *Server) deleteSnapshot(w http.ResponseWriter, r *http.Request) {
+	if s.snapshotEngine == nil {
+		respondError(w, http.StatusServiceUnavailable, "snapshot engine not initialized")
+		return
+	}
+	sid := chi.URLParam(r, "sid")
+	if err := s.snapshotEngine.Delete(sid); err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondError(w, http.StatusNotFound, "snapshot not found")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "failed to delete snapshot")
+		return
+	}
+	respondJSON(w, http.StatusOK, map[string]string{"status": "deleted"})
+}
+
+// downloadSnapshot handles GET /api/snapshots/{sid}/download, streaming the
+// tar.gz archive. The resolved path is containment-checked against the
+// snapshot directory.
+func (s *Server) downloadSnapshot(w http.ResponseWriter, r *http.Request) {
+	if s.snapshotEngine == nil {
+		respondError(w, http.StatusServiceUnavailable, "snapshot engine not initialized")
+		return
+	}
+	sid := chi.URLParam(r, "sid")
+	snap, err := s.snapshotEngine.Get(sid)
+	if err != nil {
+		respondError(w, http.StatusNotFound, "snapshot not found")
+		return
+	}
+	path, err := s.snapshotEngine.FilePath(snap)
+	if err != nil {
+		respondError(w, http.StatusForbidden, "access denied")
+		return
+	}
+	f, err := os.Open(path)
+	if err != nil {
+		respondError(w, http.StatusNotFound, "snapshot file not found on disk")
+		return
+	}
+	defer f.Close()
+	stat, err := f.Stat()
+	if err != nil {
+		respondError(w, http.StatusInternalServerError, "failed to read snapshot file")
+		return
+	}
+	name := filepath.Base(snap.Filename)
+	w.Header().Set("Content-Type", "application/gzip")
+	w.Header().Set("Content-Disposition", "attachment; filename=\""+name+"\"")
+	http.ServeContent(w, r, name, stat.ModTime(), f)
+}
@@ -0,0 +1,178 @@
+package api
+
+import (
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"os"
+	"path/filepath"
+	"testing"
+
+	"github.com/alexei/tinyforge/internal/auth"
+	"github.com/alexei/tinyforge/internal/store"
+	"github.com/alexei/tinyforge/internal/volsnap"
+	"github.com/alexei/tinyforge/internal/webhook"
+)
+
+// newSnapshotEnv builds an API test env with the volume-snapshot engine wired
+// (the shared newAPITestEnv does not wire it). dataDir holds the snapshot
+// archives; baseVol is where host-bind volume directories resolve.
+func newSnapshotEnv(t *testing.T) (*apiTestEnv, string) {
+	t.Helper()
+	st, err := store.New(":memory:")
+	if err != nil {
+		t.Fatalf("create store: %v", err)
+	}
+	t.Cleanup(func() { st.Close() })
+
+	encKey := [32]byte{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
+	dispatcher := &fakeAPIDispatcher{}
+	wh := webhook.NewHandler(st)
+	wh.SetPluginDispatcher(dispatcher)
+	srv := NewServer(st, nil, nil, nil, dispatcher, nil, wh, nil, encKey)
+
+	snapEng, err := volsnap.New(st, t.TempDir())
+	if err != nil {
+		t.Fatalf("snapshot engine: %v", err)
+	}
+	srv.SetSnapshotEngine(snapEng)
+
+	httpsrv := httptest.NewServer(srv.Router())
+	t.Cleanup(httpsrv.Close)
+
+	la := auth.NewLocalAuth(encKey)
+	tok, err := la.GenerateToken(auth.Claims{UserID: "u-admin", Username: "admin", Role: "admin"})
+	if err != nil {
+		t.Fatalf("mint token: %v", err)
+	}
+
+	baseVol := t.TempDir()
+	settings, _ := st.GetSettings()
+	settings.BaseVolumePath = baseVol
+	if err := st.UpdateSettings(settings); err != nil {
+		t.Fatalf("update settings: %v", err)
+	}
+
+	return &apiTestEnv{srv: httpsrv, store: st, dispatcher: dispatcher, adminToken: tok.Token, encKey: encKey}, baseVol
+}
+
+func TestVolumeSnapshots_EndToEnd(t *testing.T) {
+	e, baseVol := newSnapshotEnv(t)
+
+	w, err := e.store.CreateWorkload(store.Workload{
+		Name:         "data-app",
+		Kind:         "project",
+		SourceKind:   "image",
+		SourceConfig: `{"image":"registry.example.com/owner/app","port":8080}`,
+	})
+	if err != nil {
+		t.Fatalf("create workload: %v", err)
+	}
+	if _, err := e.store.SetWorkloadVolume(store.WorkloadVolume{
+		WorkloadID: w.ID, Target: "/data", Source: "data", Scope: "project",
+	}); err != nil {
+		t.Fatalf("set volume: %v", err)
+	}
+
+	// Materialize the resolved host-bind dir with a file so there is data to
+	// capture. Layout mirrors ResolveWorkloadPath for project scope:
+	// <baseVol>/<name>-<id8>/<source>.
+	id8 := w.ID
+	if len(id8) > 8 {
+		id8 = id8[:8]
+	}
+	hostDir := filepath.Join(baseVol, "data-app-"+id8, "data")
+	if err := os.MkdirAll(hostDir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(hostDir, "payload.txt"), []byte("important"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	// snapshotable lists the one host-bind volume.
+	resp := e.do(t, http.MethodGet, "/api/workloads/"+w.ID+"/snapshotable", nil)
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("snapshotable status = %d", resp.StatusCode)
+	}
+	var snapable struct {
+		Volumes []map[string]string `json:"volumes"`
+		Skipped []map[string]string `json:"skipped"`
+	}
+	decodeEnvelope(t, resp, &snapable)
+	if len(snapable.Volumes) != 1 || snapable.Volumes[0]["target"] != "/data" {
+		t.Fatalf("expected 1 snapshotable volume /data, got %+v", snapable)
+	}
+
+	// Create a snapshot.
+	resp = e.do(t, http.MethodPost, "/api/workloads/"+w.ID+"/snapshots", map[string]string{"label": "before upgrade"})
+	if resp.StatusCode != http.StatusCreated {
+		t.Fatalf("create snapshot status = %d", resp.StatusCode)
+	}
+	var snap store.VolumeSnapshot
+	decodeEnvelope(t, resp, &snap)
+	if snap.ID == "" || snap.SizeBytes == 0 || snap.Label != "before upgrade" {
+		t.Fatalf("unexpected snapshot: %+v", snap)
+	}
+
+	// It appears in the list.
+	resp = e.do(t, http.MethodGet, "/api/workloads/"+w.ID+"/snapshots", nil)
+	var list []store.VolumeSnapshot
+	decodeEnvelope(t, resp, &list)
+	if len(list) != 1 || list[0].ID != snap.ID {
+		t.Fatalf("expected 1 snapshot in list, got %+v", list)
+	}
+
+	// Download streams a non-empty gzip archive (not the JSON envelope).
+	resp = e.do(t, http.MethodGet, "/api/snapshots/"+snap.ID+"/download", nil)
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("download status = %d", resp.StatusCode)
+	}
+	if ct := resp.Header.Get("Content-Type"); ct != "application/gzip" {
+		t.Errorf("download content-type = %q, want application/gzip", ct)
+	}
+	data, _ := io.ReadAll(resp.Body)
+	resp.Body.Close()
+	if len(data) == 0 {
+		t.Error("download body is empty")
+	}
+
+	// Delete removes it.
+	resp = e.do(t, http.MethodDelete, "/api/snapshots/"+snap.ID, nil)
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("delete status = %d", resp.StatusCode)
+	}
+	resp = e.do(t, http.MethodGet, "/api/workloads/"+w.ID+"/snapshots", nil)
+	var after []store.VolumeSnapshot
+	decodeEnvelope(t, resp, &after)
+	if len(after) != 0 {
+		t.Fatalf("expected 0 snapshots after delete, got %d", len(after))
+	}
+}
+
+func TestCreateSnapshot_NoVolumeData_Returns400(t *testing.T) {
+	e, _ := newSnapshotEnv(t)
+	w, err := e.store.CreateWorkload(store.Workload{
+		Name:         "no-vol-app",
+		Kind:         "project",
+		SourceKind:   "image",
+		SourceConfig: `{"image":"x","port":80}`,
+	})
+	if err != nil {
+		t.Fatalf("create workload: %v", err)
+	}
+	resp := e.do(t, http.MethodPost, "/api/workloads/"+w.ID+"/snapshots", nil)
+	if resp.StatusCode != http.StatusBadRequest {
+		t.Fatalf("expected 400 for an app with no snapshottable volumes, got %d", resp.StatusCode)
+	}
+	resp.Body.Close()
+}
+
+func TestSnapshotEndpoints_RequireWorkload(t *testing.T) {
+	e, _ := newSnapshotEnv(t)
+	// snapshotable on an unknown workload → 404.
+	resp := e.do(t, http.MethodGet, "/api/workloads/does-not-exist/snapshotable", nil)
+	if resp.StatusCode != http.StatusNotFound {
+		t.Fatalf("snapshotable unknown workload = %d, want 404", resp.StatusCode)
+	}
+	resp.Body.Close()
+}
@@ -13,18 +13,29 @@ import (
 	"github.com/alexei/tinyforge/internal/auth"
 	"github.com/alexei/tinyforge/internal/store"
 	"github.com/alexei/tinyforge/internal/workload/plugin"
+	"github.com/alexei/tinyforge/internal/workload/preview"
 )

 // chainNode is the lightweight shape returned by /chain — we deliberately
 // don't return full plugin.Workload values for ancestor/descendant rows
 // because the secret fields don't belong in a chain-traversal response.
+//
+// IsPreview / PreviewBranch surface branch-preview children to the UI so it
+// can render them in a dedicated "Preview environments" panel rather than as
+// undistinguished stage children. They are computed against the chain's
+// `self` workload via preview.IsPreviewChild — the canonical "this child is a
+// branch preview" test that reverses the MaterializeForBranch naming formula.
+// Both are zero-valued (false / "") for the parent and self nodes and for
+// operator-created stage children.
 type chainNode struct {
-	ID         string `json:"id"`
-	Name       string `json:"name"`
-	SourceKind string `json:"source_kind"`
-	TriggerKind string `json:"trigger_kind"`
-	CreatedAt  string `json:"created_at"`
-	UpdatedAt  string `json:"updated_at"`
+	ID            string `json:"id"`
+	Name          string `json:"name"`
+	SourceKind    string `json:"source_kind"`
+	TriggerKind   string `json:"trigger_kind"`
+	IsPreview     bool   `json:"is_preview"`
+	PreviewBranch string `json:"preview_branch,omitempty"`
+	CreatedAt     string `json:"created_at"`
+	UpdatedAt     string `json:"updated_at"`
 }

 func chainNodeOf(w store.Workload) chainNode {
@@ -38,6 +49,32 @@ func chainNodeOf(w store.Workload) chainNode {
 	}
 }

+// previewBranchOf extracts the branch a preview child was materialized for
+// from its source_config (the `branch` key MaterializeForBranch wrote).
+// Returns "" on a missing/malformed config — the caller only calls this for
+// rows preview.IsPreviewChild already confirmed, so a blank result just means
+// the JSON couldn't be decoded.
+func previewBranchOf(w store.Workload) string {
+	var cfg struct {
+		Branch string `json:"branch"`
+	}
+	if w.SourceConfig != "" {
+		_ = json.Unmarshal([]byte(w.SourceConfig), &cfg)
+	}
+	return cfg.Branch
+}
+
+// childChainNode builds a chainNode for a child row, marking it as a branch
+// preview (and attaching its branch) when it was materialized from `self`.
+func childChainNode(self, child store.Workload) chainNode {
+	node := chainNodeOf(child)
+	if preview.IsPreviewChild(self, child) {
+		node.IsPreview = true
+		node.PreviewBranch = previewBranchOf(child)
+	}
+	return node
+}
+
 // getWorkloadChain handles GET /api/workloads/{id}/chain.
 //
 // Returns the workload's parent (or nil), itself, and its direct children
@@ -76,7 +113,7 @@ func (s *Server) getWorkloadChain(w http.ResponseWriter, r *http.Request) {
 	}
 	children := make([]chainNode, 0, len(childRows))
 	for _, c := range childRows {
-		children = append(children, chainNodeOf(c))
+		children = append(children, childChainNode(self, c))
 	}

 	respondJSON(w, http.StatusOK, map[string]any{
@@ -0,0 +1,147 @@
+package api
+
+import (
+	"testing"
+
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// TestChildChainNode_MarksPreviewChildren verifies the /chain DTO builder
+// distinguishes branch-preview children (materialized by the preview package)
+// from operator-created stage children that merely share the parent link.
+// The discriminator is preview.IsPreviewChild, which reverses the
+// MaterializeForBranch naming formula: name == template.Name + "/" + slug.
+func TestChildChainNode_MarksPreviewChildren(t *testing.T) {
+	template := store.Workload{
+		ID:         "tmpl-1",
+		Name:       "myapp",
+		SourceKind: "dockerfile",
+	}
+
+	tests := []struct {
+		name       string
+		child      store.Workload
+		wantPrev   bool
+		wantBranch string
+	}{
+		{
+			name: "preview child is marked with its branch",
+			child: store.Workload{
+				ID:               "child-prev",
+				Name:             "myapp/feat-login",
+				SourceKind:       "dockerfile",
+				SourceConfig:     `{"branch":"feat/login","port":3000}`,
+				ParentWorkloadID: "tmpl-1",
+			},
+			wantPrev:   true,
+			wantBranch: "feat/login",
+		},
+		{
+			name: "operator-named stage child sharing the parent is not a preview",
+			child: store.Workload{
+				ID:               "child-stage",
+				Name:             "myapp-staging",
+				SourceKind:       "dockerfile",
+				SourceConfig:     `{"branch":"main"}`,
+				ParentWorkloadID: "tmpl-1",
+			},
+			wantPrev:   false,
+			wantBranch: "",
+		},
+		{
+			name: "child of a different parent is not a preview of self",
+			child: store.Workload{
+				ID:               "child-other",
+				Name:             "myapp/feat-login",
+				SourceKind:       "dockerfile",
+				SourceConfig:     `{"branch":"feat/login"}`,
+				ParentWorkloadID: "some-other-template",
+			},
+			wantPrev:   false,
+			wantBranch: "",
+		},
+		{
+			name: "child with no branch in source_config is not a preview",
+			child: store.Workload{
+				ID:               "child-nobranch",
+				Name:             "myapp/feat-login",
+				SourceKind:       "dockerfile",
+				SourceConfig:     `{}`,
+				ParentWorkloadID: "tmpl-1",
+			},
+			wantPrev:   false,
+			wantBranch: "",
+		},
+		{
+			// Same parent + a valid branch, but the name carries an extra
+			// suffix so it fails ONLY the slug-equality check (expected
+			// "myapp/feat-login", got "myapp/feat-login-staging"). The
+			// branch alone must not be enough to mark a preview.
+			name: "valid branch but name fails the slug match is not a preview",
+			child: store.Workload{
+				ID:               "child-slugmiss",
+				Name:             "myapp/feat-login-staging",
+				SourceKind:       "dockerfile",
+				SourceConfig:     `{"branch":"feat/login","port":3000}`,
+				ParentWorkloadID: "tmpl-1",
+			},
+			wantPrev:   false,
+			wantBranch: "",
+		},
+		{
+			// Uppercase + slash branch: slugifyBranch lowercases and maps
+			// "/" -> "-", so "Feature/Login" -> "feature-login" and the name
+			// "myapp/feature-login" matches. PreviewBranch must echo the RAW
+			// branch from source_config ("Feature/Login"), not the slug.
+			name: "uppercase slash branch matches and keeps raw branch",
+			child: store.Workload{
+				ID:               "child-upper",
+				Name:             "myapp/feature-login",
+				SourceKind:       "dockerfile",
+				SourceConfig:     `{"branch":"Feature/Login","port":8080}`,
+				ParentWorkloadID: "tmpl-1",
+			},
+			wantPrev:   true,
+			wantBranch: "Feature/Login",
+		},
+	}
+
+	for _, tc := range tests {
+		t.Run(tc.name, func(t *testing.T) {
+			node := childChainNode(template, tc.child)
+			if node.IsPreview != tc.wantPrev {
+				t.Errorf("IsPreview = %v, want %v", node.IsPreview, tc.wantPrev)
+			}
+			if node.PreviewBranch != tc.wantBranch {
+				t.Errorf("PreviewBranch = %q, want %q", node.PreviewBranch, tc.wantBranch)
+			}
+			// Base fields must always round-trip regardless of preview status.
+			if node.ID != tc.child.ID || node.Name != tc.child.Name {
+				t.Errorf("base fields mangled: got id=%q name=%q", node.ID, node.Name)
+			}
+		})
+	}
+}
+
+// TestPreviewBranchOf_ToleratesMalformedConfig confirms the branch extractor
+// returns "" rather than panicking on a missing or invalid source_config.
+func TestPreviewBranchOf_ToleratesMalformedConfig(t *testing.T) {
+	cases := []struct {
+		name string
+		cfg  string
+		want string
+	}{
+		{"valid branch", `{"branch":"release/v1"}`, "release/v1"},
+		{"empty config", ``, ""},
+		{"empty object", `{}`, ""},
+		{"malformed json", `{not-json`, ""},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			got := previewBranchOf(store.Workload{SourceConfig: c.cfg})
+			if got != c.want {
+				t.Errorf("previewBranchOf(%q) = %q, want %q", c.cfg, got, c.want)
+			}
+		})
+	}
+}
@@ -0,0 +1,231 @@
+package api
+
+import (
+	"errors"
+	"log/slog"
+	"net/http"
+	"strings"
+
+	"github.com/go-chi/chi/v5"
+
+	"github.com/alexei/tinyforge/internal/crypto"
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// workloadNotificationRow is the JSON shape returned to clients. The
+// `secret_set` boolean replaces the actual ciphertext: once stored a
+// secret is write-only, mirroring how workload_env hides encrypted
+// values. Rotating means submitting a new value.
+type workloadNotificationRow struct {
+	ID         string `json:"id"`
+	WorkloadID string `json:"workload_id"`
+	Name       string `json:"name"`
+	URL        string `json:"url"`
+	SecretSet  bool   `json:"secret_set"`
+	EventTypes string `json:"event_types"`
+	Enabled    bool   `json:"enabled"`
+	SortOrder  int    `json:"sort_order"`
+	CreatedAt  string `json:"created_at"`
+	UpdatedAt  string `json:"updated_at"`
+}
+
+func toWorkloadNotificationRow(n store.WorkloadNotification) workloadNotificationRow {
+	return workloadNotificationRow{
+		ID:         n.ID,
+		WorkloadID: n.WorkloadID,
+		Name:       n.Name,
+		URL:        n.URL,
+		SecretSet:  n.Secret != "",
+		EventTypes: n.EventTypes,
+		Enabled:    n.Enabled,
+		SortOrder:  n.SortOrder,
+		CreatedAt:  n.CreatedAt,
+		UpdatedAt:  n.UpdatedAt,
+	}
+}
+
+func (s *Server) listWorkloadNotifications(w http.ResponseWriter, r *http.Request) {
+	id := chi.URLParam(r, "id")
+	if _, err := s.store.GetWorkloadByID(id); err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "workload")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "get workload")
+		return
+	}
+	rows, err := s.store.ListWorkloadNotifications(id)
+	if err != nil {
+		respondError(w, http.StatusInternalServerError, "list workload notifications")
+		return
+	}
+	out := make([]workloadNotificationRow, 0, len(rows))
+	for _, n := range rows {
+		out = append(out, toWorkloadNotificationRow(n))
+	}
+	respondJSON(w, http.StatusOK, out)
+}
+
+// workloadNotificationRequest is the POST/PUT body. Secret is the raw
+// plaintext webhook signing key; the server encrypts it at rest with
+// the global encryption key before INSERT. An empty Secret on UPDATE
+// leaves the stored secret untouched so the operator can edit the URL
+// or event filter without re-entering the secret each time.
+type workloadNotificationRequest struct {
+	Name       string `json:"name"`
+	URL        string `json:"url"`
+	Secret     string `json:"secret"`
+	EventTypes string `json:"event_types"`
+	Enabled    *bool  `json:"enabled"`
+	SortOrder  int    `json:"sort_order"`
+}
+
+func (s *Server) createWorkloadNotification(w http.ResponseWriter, r *http.Request) {
+	id := chi.URLParam(r, "id")
+	if _, err := s.store.GetWorkloadByID(id); err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "workload")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "get workload")
+		return
+	}
+	var req workloadNotificationRequest
+	if !decodeJSONStrict(w, r, &req) {
+		return
+	}
+	req.URL = strings.TrimSpace(req.URL)
+	req.Name = strings.TrimSpace(req.Name)
+	if req.URL == "" {
+		respondError(w, http.StatusBadRequest, "url is required")
+		return
+	}
+	encSecret := ""
+	if req.Secret != "" {
+		v, err := crypto.Encrypt(s.encKey, req.Secret)
+		if err != nil {
+			slog.Error("workload notifications: encrypt secret", "workload", id, "error", err)
+			respondError(w, http.StatusInternalServerError, "encrypt secret")
+			return
+		}
+		encSecret = v
+	}
+	enabled := true
+	if req.Enabled != nil {
+		enabled = *req.Enabled
+	}
+	created, err := s.store.CreateWorkloadNotification(store.WorkloadNotification{
+		WorkloadID: id,
+		Name:       req.Name,
+		URL:        req.URL,
+		Secret:     encSecret,
+		EventTypes: req.EventTypes,
+		Enabled:    enabled,
+		SortOrder:  req.SortOrder,
+	})
+	if err != nil {
+		slog.Error("workload notifications: create", "workload", id, "error", err)
+		respondError(w, http.StatusInternalServerError, "create workload notification")
+		return
+	}
+	respondJSON(w, http.StatusCreated, toWorkloadNotificationRow(created))
+}
+
+func (s *Server) updateWorkloadNotification(w http.ResponseWriter, r *http.Request) {
+	id := chi.URLParam(r, "id")
+	nid := chi.URLParam(r, "nid")
+	if _, err := s.store.GetWorkloadByID(id); err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "workload")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "get workload")
+		return
+	}
+	existing, err := s.store.GetWorkloadNotification(nid)
+	if err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "workload_notification")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "get workload_notification")
+		return
+	}
+	if existing.WorkloadID != id {
+		// Route mismatch — the row exists but under a different workload.
+		// Return 404 rather than 403 so we don't leak the existence of
+		// foreign rows to an unauthorised caller.
+		respondNotFound(w, "workload_notification")
+		return
+	}
+
+	var req workloadNotificationRequest
+	if !decodeJSONStrict(w, r, &req) {
+		return
+	}
+	req.URL = strings.TrimSpace(req.URL)
+	req.Name = strings.TrimSpace(req.Name)
+	if req.URL == "" {
+		respondError(w, http.StatusBadRequest, "url is required")
+		return
+	}
+
+	existing.Name = req.Name
+	existing.URL = req.URL
+	existing.EventTypes = req.EventTypes
+	existing.SortOrder = req.SortOrder
+	if req.Enabled != nil {
+		existing.Enabled = *req.Enabled
+	}
+	// Empty Secret on UPDATE preserves the stored ciphertext — explicit
+	// rotation requires sending the new plaintext. This avoids forcing
+	// the operator to re-enter their secret on every URL edit.
+	if req.Secret != "" {
+		v, err := crypto.Encrypt(s.encKey, req.Secret)
+		if err != nil {
+			slog.Error("workload notifications: encrypt secret", "workload", id, "error", err)
+			respondError(w, http.StatusInternalServerError, "encrypt secret")
+			return
+		}
+		existing.Secret = v
+	}
+
+	if err := s.store.UpdateWorkloadNotification(existing); err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "workload_notification")
+			return
+		}
+		slog.Error("workload notifications: update", "workload", id, "error", err)
+		respondError(w, http.StatusInternalServerError, "update workload notification")
+		return
+	}
+	respondJSON(w, http.StatusOK, toWorkloadNotificationRow(existing))
+}
+
+func (s *Server) deleteWorkloadNotification(w http.ResponseWriter, r *http.Request) {
+	id := chi.URLParam(r, "id")
+	nid := chi.URLParam(r, "nid")
+	existing, err := s.store.GetWorkloadNotification(nid)
+	if err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "workload_notification")
+			return
+		}
+		respondError(w, http.StatusInternalServerError, "get workload_notification")
+		return
+	}
+	if existing.WorkloadID != id {
+		respondNotFound(w, "workload_notification")
+		return
+	}
+	if err := s.store.DeleteWorkloadNotification(nid); err != nil {
+		if errors.Is(err, store.ErrNotFound) {
+			respondNotFound(w, "workload_notification")
+			return
+		}
+		slog.Error("workload notifications: delete", "workload", id, "error", err)
+		respondError(w, http.StatusInternalServerError, "delete workload notification")
+		return
+	}
+	respondJSON(w, http.StatusOK, map[string]any{"success": true})
+}
@@ -82,16 +82,27 @@ func (s *Server) getWorkloadRuntimeState(w http.ResponseWriter, r *http.Request)

 	payload := runtimeStatePayload{SourceKind: workload.SourceKind}

-	if workload.SourceKind != "static" {
+	// Both static and dockerfile sources persist their runtime state into
+	// containers.extra_json under a deterministic row id. The shapes
+	// match (status / last_commit_sha / last_sync_at / last_error) so the
+	// handler can decode them identically. The suffix differs per source
+	// kind: static uses ":site", dockerfile uses ":dockerfile".
+	var rowSuffix string
+	switch workload.SourceKind {
+	case "static":
+		rowSuffix = ":site"
+	case "dockerfile":
+		rowSuffix = ":dockerfile"
+	default:
 		respondJSON(w, http.StatusOK, payload)
 		return
 	}

-	// The static plugin owns one container row per workload at the
-	// deterministic ID <workloadID>:site. A missing row means the
-	// workload has never been deployed — return HasState=false so the
-	// UI can prompt the operator to deploy.
-	row, err := s.store.GetContainerByID(id + ":site")
+	// The owning plugin maintains one container row per workload at the
+	// deterministic ID. A missing row means the workload has never been
+	// deployed — return HasState=false so the UI can prompt the operator
+	// to deploy.
+	row, err := s.store.GetContainerByID(id + rowSuffix)
 	if err != nil {
 		if errors.Is(err, store.ErrNotFound) {
 			respondJSON(w, http.StatusOK, payload)
@@ -130,6 +130,13 @@ func TestGetWorkloadRuntimeState_MalformedExtraJSON_ReturnsContainerFieldsOnly(t
 		SourceKind:   "static",
 		SourceConfig: `{"provider":"gitea"}`,
 	})
+	// Seed a row with a valid extra_json first, then corrupt it via raw
+	// SQL. Prior to the write-side validateExtraJSON guard this test
+	// could pass a malformed string straight to UpsertContainer; the
+	// guard now rejects that at the boundary, which is the correct
+	// behaviour. The reader resilience this test verifies remains
+	// relevant for pre-existing bad rows from upgrades or external
+	// manipulation, so we still produce one via direct SQL.
 	if err := e.store.UpsertContainer(store.Container{
 		ID:           wl.ID + ":site",
 		WorkloadID:   wl.ID,
@@ -137,10 +144,16 @@ func TestGetWorkloadRuntimeState_MalformedExtraJSON_ReturnsContainerFieldsOnly(t
 		Host:         "local",
 		ContainerID:  "abc",
 		State:        "running",
-		ExtraJSON:    `{this is not json`,
+		ExtraJSON:    `{}`,
 	}); err != nil {
 		t.Fatalf("seed: %v", err)
 	}
+	if _, err := e.store.DB().Exec(
+		`UPDATE containers SET extra_json = ? WHERE id = ?`,
+		`{this is not json`, wl.ID+":site",
+	); err != nil {
+		t.Fatalf("corrupt extra_json: %v", err)
+	}
 	resp := e.do(t, http.MethodGet, "/api/workloads/"+wl.ID+"/runtime-state", nil)
 	if resp.StatusCode != http.StatusOK {
 		t.Fatalf("status = %d, want 200 (decode is non-fatal)", resp.StatusCode)
@@ -155,6 +168,57 @@ func TestGetWorkloadRuntimeState_MalformedExtraJSON_ReturnsContainerFieldsOnly(t
 	}
 }

+func TestGetWorkloadRuntimeState_DockerfileSourceDeployed_DecodesExtraJSON(t *testing.T) {
+	e := newAPITestEnv(t)
+	wl, err := e.store.CreateWorkload(store.Workload{
+		Kind:         string(store.WorkloadKindProject),
+		Name:         "build-app",
+		SourceKind:   "dockerfile",
+		SourceConfig: `{"provider":"gitea","port":3000}`,
+	})
+	if err != nil {
+		t.Fatalf("seed workload: %v", err)
+	}
+	extra, _ := json.Marshal(map[string]any{
+		"status":          "deployed",
+		"last_commit_sha": "deadbeef",
+		"last_sync_at":    "2026-05-23T10:00:00Z",
+		"last_error":      "",
+	})
+	if err := e.store.UpsertContainer(store.Container{
+		ID:           wl.ID + ":dockerfile",
+		WorkloadID:   wl.ID,
+		WorkloadKind: string(store.WorkloadKindBuild),
+		Host:         "local",
+		ContainerID:  "ffeeddcc",
+		State:        "running",
+		ExtraJSON:    string(extra),
+	}); err != nil {
+		t.Fatalf("seed container: %v", err)
+	}
+
+	resp := e.do(t, http.MethodGet, "/api/workloads/"+wl.ID+"/runtime-state", nil)
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("status = %d, want 200", resp.StatusCode)
+	}
+	var got runtimeStatePayload
+	if errMsg := decodeEnvelope(t, resp, &got); errMsg != "" {
+		t.Fatalf("envelope error: %q", errMsg)
+	}
+	if !got.HasState {
+		t.Fatalf("HasState = false, want true")
+	}
+	if got.SourceKind != "dockerfile" {
+		t.Errorf("SourceKind = %q, want dockerfile", got.SourceKind)
+	}
+	if got.ContainerID != "ffeeddcc" || got.State != "running" {
+		t.Errorf("container fields = (%q,%q), want (ffeeddcc, running)", got.ContainerID, got.State)
+	}
+	if got.Status != "deployed" || got.LastCommitSHA != "deadbeef" {
+		t.Errorf("runtime fields = %+v, want deployed/deadbeef", got)
+	}
+}
+
 // =============================================================================
 // GET /api/workloads/{id}/storage
 // =============================================================================
@@ -14,6 +14,7 @@ import (
 	"github.com/alexei/tinyforge/internal/auth"
 	"github.com/alexei/tinyforge/internal/store"
 	"github.com/alexei/tinyforge/internal/workload/plugin"
+	"github.com/alexei/tinyforge/internal/workload/preview"
 )

 // pluginWorkloadRequest is the JSON body accepted by create + update.
@@ -227,6 +228,28 @@ func (s *Server) deletePluginWorkload(w http.ResponseWriter, r *http.Request) {
 		return
 	}

+	// Cascade-teardown any branch previews materialized from this workload
+	// so deleting a template does not orphan their containers, proxy routes,
+	// and rows. Operator-managed stage-chain children (which share the same
+	// parent link) are deliberately left alone — only previews are auto-owned
+	// by the template (see preview.IsPreviewChild).
+	if previews, err := preview.ListPreviewChildren(s.store, row); err != nil {
+		slog.Warn("delete workload: list preview children", "workload", id, "error", err)
+	} else {
+		for _, child := range previews {
+			if child.SourceKind != "" {
+				if err := s.deployer.DispatchTeardown(r.Context(), toPluginWorkload(child)); err != nil {
+					slog.Warn("delete workload: preview child teardown error",
+						"workload", id, "child", child.ID, "error", err)
+				}
+			}
+			if err := s.store.DeleteWorkload(child.ID); err != nil && !errors.Is(err, store.ErrNotFound) {
+				slog.Warn("delete workload: preview child delete error",
+					"workload", id, "child", child.ID, "error", err)
+			}
+		}
+	}
+
 	if row.SourceKind != "" {
 		if err := s.deployer.DispatchTeardown(r.Context(), toPluginWorkload(row)); err != nil {
 			slog.Warn("delete workload: teardown error",
@@ -85,9 +85,15 @@ func (la *LocalAuth) cleanBlacklist() {
 	}
 }

+// bcryptCost is the work factor used for new password hashes. Bumped from
+// the library default (10) to 12 so cost grows with hardware. Existing
+// hashes at lower costs still verify — bcrypt encodes the cost in the
+// stored hash itself.
+const bcryptCost = 12
+
 // HashPassword hashes a plaintext password using bcrypt.
 func HashPassword(password string) (string, error) {
-	hash, err := bcrypt.GenerateFromPassword([]byte(password), bcrypt.DefaultCost)
+	hash, err := bcrypt.GenerateFromPassword([]byte(password), bcryptCost)
 	if err != nil {
 		return "", fmt.Errorf("hash password: %w", err)
 	}
@@ -1,13 +1,17 @@
 package backup

 import (
+	"database/sql"
 	"fmt"
+	"io"
 	"log/slog"
 	"os"
 	"path/filepath"
 	"sync"
 	"time"

+	_ "modernc.org/sqlite" // read-only candidate inspection via PRAGMA integrity_check
+
 	"github.com/alexei/tinyforge/internal/store"
 )

@@ -129,6 +133,17 @@ func (e *Engine) RestorePath(id string) (string, error) {
 		return "", fmt.Errorf("get backup: %w", err)
 	}

+	// Filename comes from a DB row. Defence-in-depth: a backup file must live
+	// directly under backupDir, so reject any value carrying a path separator
+	// or traversal before joining. A poisoned row (future import path, manual
+	// insert) must never let restore read — and then atomically copy over the
+	// live DB — an arbitrary file. CreateBackup builds safe base names; this
+	// enforces the same invariant on read.
+	if backup.Filename == "" || backup.Filename == "." || backup.Filename == ".." ||
+		backup.Filename != filepath.Base(backup.Filename) {
+		return "", fmt.Errorf("backup: invalid filename %q", backup.Filename)
+	}
+
 	filePath := filepath.Join(e.backupDir, backup.Filename)
 	if _, err := os.Stat(filePath); err != nil {
 		return "", fmt.Errorf("backup file not found: %w", err)
@@ -137,6 +152,153 @@ func (e *Engine) RestorePath(id string) (string, error) {
 	return filePath, nil
 }

+// PrepareRestore validates a backup candidate before the caller swaps it
+// over the live DB. Runs three checks in order:
+//
+//  1. The candidate file exists and is non-empty.
+//  2. SQLite header magic matches (catches corrupted or partial downloads).
+//  3. `PRAGMA integrity_check` against a temp copy returns "ok"
+//     (catches WAL/page corruption that the header check misses).
+//
+// On success returns the candidate path. On failure returns a wrapped
+// error describing which probe rejected the file, so the operator can
+// see exactly why a "restore" was refused rather than getting a corrupt
+// DB at next boot.
+//
+// We use a *temp copy* for integrity_check because attaching the
+// candidate read-only into the live process would still hold a file
+// handle SQLite considers writable on Windows.
+func (e *Engine) PrepareRestore(id string) (string, error) {
+	path, err := e.RestorePath(id)
+	if err != nil {
+		return "", err
+	}
+
+	info, err := os.Stat(path)
+	if err != nil {
+		return "", fmt.Errorf("restore: stat candidate: %w", err)
+	}
+	if info.Size() < 100 {
+		return "", fmt.Errorf("restore: candidate %s is suspiciously small (%d bytes)", path, info.Size())
+	}
+
+	// SQLite file header: "SQLite format 3\x00" (16 bytes).
+	hdr, err := readHead(path, 16)
+	if err != nil {
+		return "", fmt.Errorf("restore: read header: %w", err)
+	}
+	if string(hdr) != "SQLite format 3\x00" {
+		return "", fmt.Errorf("restore: candidate %s is not a SQLite database (header mismatch)", path)
+	}
+
+	if err := integrityCheck(path); err != nil {
+		return "", fmt.Errorf("restore: integrity check failed: %w", err)
+	}
+
+	return path, nil
+}
+
+func readHead(path string, n int) ([]byte, error) {
+	f, err := os.Open(path)
+	if err != nil {
+		return nil, err
+	}
+	defer f.Close()
+	buf := make([]byte, n)
+	// io.ReadFull (not f.Read) guarantees the buffer is filled.
+	// A bare Read can short-return on some filesystems / on small
+	// files, which would skew the SQLite-header magic check below.
+	if _, err := io.ReadFull(f, buf); err != nil {
+		return nil, err
+	}
+	return buf, nil
+}
+
+// integrityCheck opens the candidate read-only and runs
+// `PRAGMA integrity_check`. We use immutable=1 so the driver does not
+// try to create WAL/SHM sidecars or upgrade the journal mode on the
+// candidate — both of which fail with "attempt to write a readonly
+// database" against a backup file. Anything other than the single row
+// `"ok"` is treated as corruption.
+func integrityCheck(path string) error {
+	db, err := sql.Open("sqlite", "file:"+path+"?mode=ro&immutable=1")
+	if err != nil {
+		return fmt.Errorf("open candidate: %w", err)
+	}
+	defer db.Close()
+
+	rows, err := db.Query("PRAGMA integrity_check")
+	if err != nil {
+		return fmt.Errorf("pragma integrity_check: %w", err)
+	}
+	defer rows.Close()
+
+	if !rows.Next() {
+		return fmt.Errorf("integrity_check returned no rows")
+	}
+	var result string
+	if err := rows.Scan(&result); err != nil {
+		return fmt.Errorf("scan integrity_check: %w", err)
+	}
+	if result != "ok" {
+		return fmt.Errorf("integrity_check: %s", result)
+	}
+	return nil
+}
+
+// AtomicReplaceDB writes a backup candidate into place atomically.
+// The caller is expected to:
+//  1. Call PrepareRestore(id) → candidatePath.
+//  2. Take a "pre-restore" backup of the current DB via CreateBackup.
+//  3. Close the live *sql.DB.
+//  4. Call AtomicReplaceDB(candidatePath, livePath).
+//  5. Trigger graceful shutdown; main() will re-open on next start.
+//
+// AtomicReplaceDB also wipes WAL/SHM sidecar files so the new DB starts
+// from a clean checkpoint state. Failure to remove sidecars is logged
+// but non-fatal — SQLite recreates them on open.
+func (e *Engine) AtomicReplaceDB(candidatePath, livePath string) error {
+	// Copy candidate to a tmp file next to the live DB, then rename
+	// atomically. On Windows os.Rename across volumes fails, so we
+	// keep tmp on the same dir as the destination.
+	tmp := livePath + ".restore.tmp"
+	if err := copyFile(candidatePath, tmp); err != nil {
+		return fmt.Errorf("copy candidate to %s: %w", tmp, err)
+	}
+	// Best-effort: remove WAL/SHM so SQLite re-checkpoints from the
+	// restored main file rather than a stale WAL pointing at the old
+	// DB's pages.
+	for _, sidecar := range []string{livePath + "-wal", livePath + "-shm"} {
+		if err := os.Remove(sidecar); err != nil && !os.IsNotExist(err) {
+			slog.Warn("restore: remove sidecar", "path", sidecar, "error", err)
+		}
+	}
+	if err := os.Rename(tmp, livePath); err != nil {
+		// Clean up tmp on rename failure so we don't leak a partial file.
+		_ = os.Remove(tmp)
+		return fmt.Errorf("rename %s → %s: %w", tmp, livePath, err)
+	}
+	slog.Info("restore: database file replaced atomically", "live", livePath)
+	return nil
+}
+
+func copyFile(src, dst string) error {
+	in, err := os.Open(src)
+	if err != nil {
+		return err
+	}
+	defer in.Close()
+	out, err := os.OpenFile(dst, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o600)
+	if err != nil {
+		return err
+	}
+	if _, err := io.Copy(out, in); err != nil {
+		_ = out.Close()
+		return err
+	}
+	return out.Close()
+}
+
 // Prune removes old backups exceeding the retention count.
 // Returns the number of backups pruned.
 func (e *Engine) Prune(retentionCount int) (int, error) {
@@ -0,0 +1,113 @@
+package backup
+
+import (
+	"errors"
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// newTestEngine spins up an isolated store + engine pair for tests.
+// Each test gets its own tempdir so backup files do not collide.
+func newTestEngine(t *testing.T) (*Engine, *store.Store, string) {
+	t.Helper()
+	dir := t.TempDir()
+	dbPath := filepath.Join(dir, "tinyforge.db")
+	st, err := store.New(dbPath)
+	if err != nil {
+		t.Fatalf("store.New: %v", err)
+	}
+	t.Cleanup(func() { _ = st.Close() })
+
+	eng, err := New(st, dbPath, dir)
+	if err != nil {
+		t.Fatalf("backup.New: %v", err)
+	}
+	return eng, st, dbPath
+}
+
+func TestPrepareRestore_RejectsTinyFile(t *testing.T) {
+	eng, st, _ := newTestEngine(t)
+
+	// Plant a backup row with a tiny file masquerading as a backup.
+	tinyPath := filepath.Join(eng.BackupDir(), "tinyforge-manual-junk.db")
+	if err := os.WriteFile(tinyPath, []byte("hi"), 0o600); err != nil {
+		t.Fatalf("write tiny: %v", err)
+	}
+	bk, err := st.CreateBackup(store.Backup{
+		Filename:   "tinyforge-manual-junk.db",
+		SizeBytes:  2,
+		BackupType: "manual",
+	})
+	if err != nil {
+		t.Fatalf("CreateBackup row: %v", err)
+	}
+
+	if _, err := eng.PrepareRestore(bk.ID); err == nil {
+		t.Fatal("expected PrepareRestore to reject tiny file, got nil")
+	} else if !strings.Contains(err.Error(), "suspiciously small") {
+		t.Errorf("error = %v, want 'suspiciously small'", err)
+	}
+}
+
+func TestPrepareRestore_RejectsNonSQLite(t *testing.T) {
+	eng, st, _ := newTestEngine(t)
+
+	// 200 bytes of non-SQLite garbage: passes the size check, fails
+	// the header magic check.
+	garbagePath := filepath.Join(eng.BackupDir(), "tinyforge-manual-bogus.db")
+	junk := make([]byte, 200)
+	for i := range junk {
+		junk[i] = byte('x')
+	}
+	if err := os.WriteFile(garbagePath, junk, 0o600); err != nil {
+		t.Fatalf("write junk: %v", err)
+	}
+	bk, err := st.CreateBackup(store.Backup{
+		Filename:   "tinyforge-manual-bogus.db",
+		SizeBytes:  int64(len(junk)),
+		BackupType: "manual",
+	})
+	if err != nil {
+		t.Fatalf("CreateBackup row: %v", err)
+	}
+
+	if _, err := eng.PrepareRestore(bk.ID); err == nil {
+		t.Fatal("expected PrepareRestore to reject non-SQLite blob, got nil")
+	} else if !strings.Contains(err.Error(), "header") {
+		t.Errorf("error = %v, want header mismatch", err)
+	}
+}
+
+func TestPrepareRestore_AcceptsValidVacuumInto(t *testing.T) {
+	eng, _, _ := newTestEngine(t)
+
+	// A fresh CreateBackup from the engine itself is, by construction,
+	// a valid SQLite database — VACUUM INTO produces a clean copy.
+	bk, err := eng.CreateBackup("manual")
+	if err != nil {
+		t.Fatalf("CreateBackup: %v", err)
+	}
+	path, err := eng.PrepareRestore(bk.ID)
+	if err != nil {
+		t.Fatalf("PrepareRestore on valid backup: %v", err)
+	}
+	if path == "" {
+		t.Errorf("PrepareRestore returned empty path")
+	}
+}
+
+func TestPrepareRestore_UnknownID(t *testing.T) {
+	eng, _, _ := newTestEngine(t)
+
+	_, err := eng.PrepareRestore("nonexistent-id")
+	if err == nil {
+		t.Fatal("expected error for unknown id, got nil")
+	}
+	if errors.Is(err, store.ErrNotFound) {
+		// fine — wrapped through RestorePath
+	}
+}
@@ -10,11 +10,26 @@ import (
 	"fmt"
 	"io"
 	"os"
+	"strings"
 )

 // ErrNoKey is returned when ENCRYPTION_KEY is not set.
 var ErrNoKey = errors.New("ENCRYPTION_KEY environment variable is not set")

+// ErrDecryptFailed wraps any cipher.Open / decoder failure. Callers
+// upgrading from the silent-fallback pattern (treat-as-plaintext when
+// decrypt errored) MUST instead surface this — a rotated key would
+// otherwise silently leak ciphertext to upstream services as if it
+// were plaintext.
+var ErrDecryptFailed = errors.New("crypto: decrypt failed (wrong key, corrupted ciphertext, or unversioned legacy value)")
+
+// envelopeV1Prefix tags ciphertext produced by Encrypt going forward.
+// Older databases may carry unprefixed hex blobs from the v0 era; those
+// are still readable via Decrypt for backward compatibility, but every
+// new write goes through EncryptV1 and emits the prefix so a future key
+// rotation has a clean fail-loud signal.
+const envelopeV1Prefix = "tf1:"
+
 // DeriveKey computes a 32-byte AES-256 key from the given passphrase using SHA-256.
 // This is acceptable when ENCRYPTION_KEY is a high-entropy random string (e.g., 32+ hex chars).
 // For human-chosen passphrases, consider Argon2id or PBKDF2 with a salt instead.
@@ -35,7 +50,8 @@ func KeyFromEnv() ([32]byte, error) {
 }

 // Encrypt encrypts plaintext using AES-256-GCM with a random nonce.
-// The returned ciphertext is hex-encoded: nonce || ciphertext+tag.
+// Returns a versioned envelope (tf1:<hex>) so downstream readers can
+// distinguish ciphertext from accidentally-stored plaintext.
 func Encrypt(key [32]byte, plaintext string) (string, error) {
 	block, err := aes.NewCipher(key[:])
 	if err != nil {
@@ -53,14 +69,34 @@ func Encrypt(key [32]byte, plaintext string) (string, error) {
 	}

 	sealed := gcm.Seal(nonce, nonce, []byte(plaintext), nil)
-	return hex.EncodeToString(sealed), nil
+	return envelopeV1Prefix + hex.EncodeToString(sealed), nil
 }

-// Decrypt decrypts a hex-encoded ciphertext produced by Encrypt.
-func Decrypt(key [32]byte, ciphertextHex string) (string, error) {
-	data, err := hex.DecodeString(ciphertextHex)
+// HasEnvelope reports whether the value is a v1-prefixed ciphertext.
+// Useful for router-level "decrypt only if encrypted" decision points
+// that previously relied on `err == nil` from a try-decrypt — that
+// pattern silently masked rotated-key failures.
+func HasEnvelope(value string) bool {
+	return strings.HasPrefix(value, envelopeV1Prefix)
+}
+
+// Decrypt decrypts an envelope (tf1:<hex>). For backward compatibility
+// it also accepts unprefixed hex from the v0 era — but only when the
+// resulting plaintext is valid; a wrong key for legacy data now returns
+// ErrDecryptFailed instead of silently treating ciphertext as
+// plaintext.
+//
+// Callers MUST NOT swallow the error and fall back to "use as-is".
+// That pattern is the exact footgun the envelope versioning removes.
+func Decrypt(key [32]byte, ciphertext string) (string, error) {
+	hexBlob := ciphertext
+	if strings.HasPrefix(hexBlob, envelopeV1Prefix) {
+		hexBlob = hexBlob[len(envelopeV1Prefix):]
+	}
+
+	data, err := hex.DecodeString(hexBlob)
 	if err != nil {
-		return "", fmt.Errorf("decode hex: %w", err)
+		return "", fmt.Errorf("%w: decode hex: %v", ErrDecryptFailed, err)
 	}

 	block, err := aes.NewCipher(key[:])
@@ -75,15 +111,15 @@ func Decrypt(key [32]byte, ciphertextHex string) (string, error) {

 	nonceSize := gcm.NonceSize()
 	if len(data) < nonceSize {
-		return "", errors.New("ciphertext too short")
+		return "", fmt.Errorf("%w: ciphertext too short", ErrDecryptFailed)
 	}

 	nonce := data[:nonceSize]
-	ciphertext := data[nonceSize:]
+	body := data[nonceSize:]

-	plaintext, err := gcm.Open(nil, nonce, ciphertext, nil)
+	plaintext, err := gcm.Open(nil, nonce, body, nil)
 	if err != nil {
-		return "", fmt.Errorf("decrypt: %w", err)
+		return "", fmt.Errorf("%w: %v", ErrDecryptFailed, err)
 	}

 	return string(plaintext), nil
@@ -34,7 +34,19 @@ type Deployer struct {
 	dnsMu    sync.RWMutex
 	dns      dns.Provider // nil when wildcard DNS is active

+	// proxyMu protects hot-swap of d.proxy from runtime settings updates
+	// (SetProxyProvider) racing with PluginDeps() reads on the deploy path.
+	proxyMu sync.RWMutex
+
 	// Graceful shutdown: tracks in-progress deploys.
+	//
+	// drainMu serializes the "is-draining check + activeWg.Add(1)" in
+	// beginDispatch against the "set shuttingDown + Wait()" in Drain. Without
+	// it, a dispatch could pass the draining check, Drain could then flip the
+	// flag and start Wait() with a zero counter, and the dispatch could call
+	// Add(1) concurrently with Wait — a documented sync.WaitGroup misuse
+	// (panic risk) that also lets a deploy slip past the drain barrier.
+	drainMu      sync.Mutex
 	activeWg     sync.WaitGroup
 	shuttingDown atomic.Bool
 }
@@ -73,7 +85,11 @@ func New(
 }

 // SetProxyProvider updates the proxy provider at runtime (e.g., when settings change).
+// Guarded by proxyMu so concurrent deploys that read d.proxy via PluginDeps()
+// observe a coherent value (previously a torn-pointer race under -race).
 func (d *Deployer) SetProxyProvider(provider proxy.Provider) {
+	d.proxyMu.Lock()
+	defer d.proxyMu.Unlock()
 	d.proxy = provider
 }

@@ -110,8 +126,11 @@ func (d *Deployer) SetDNSProvider(provider dns.Provider) {

 // Drain waits for all in-progress deploys to complete. Call this during graceful shutdown.
 func (d *Deployer) Drain() {
-	if !d.shuttingDown.CompareAndSwap(false, true) {
-		// Already draining.
+	d.drainMu.Lock()
+	already := d.shuttingDown.Swap(true)
+	d.drainMu.Unlock()
+	if already {
+		slog.Info("deployer: drain already in progress")
 	}
 	slog.Info("deployer: draining in-progress deploys")
 	d.activeWg.Wait()
@@ -121,11 +140,17 @@ func (d *Deployer) Drain() {
 // ShuttingDown reports whether Drain() has been called.
 func (d *Deployer) ShuttingDown() bool { return d.shuttingDown.Load() }

-// rejectIfDraining is exposed in case any plugin wants the same hard-stop
-// behaviour the legacy pipeline used.
-func (d *Deployer) rejectIfDraining() error {
+// beginDispatch atomically rejects when draining and otherwise registers the
+// in-flight unit on activeWg. The shuttingDown check and the Add(1) MUST be
+// done together under drainMu (see the field comment): Drain sets the flag
+// under the same mutex before Wait(), so once Wait() observes a zero counter
+// no further Add can race it. Callers must defer d.activeWg.Done() on success.
+func (d *Deployer) beginDispatch() error {
+	d.drainMu.Lock()
+	defer d.drainMu.Unlock()
 	if d.shuttingDown.Load() {
 		return fmt.Errorf("deployer is shutting down, rejecting new deploy")
 	}
+	d.activeWg.Add(1)
 	return nil
 }
@@ -4,6 +4,7 @@ import (
 	"context"
 	"fmt"

+	"github.com/alexei/tinyforge/internal/metrics"
 	"github.com/alexei/tinyforge/internal/workload/plugin"
 )

@@ -14,16 +15,37 @@ import (
 // triggers + image deploys still go through the legacy path, while
 // /api/hooks/generic + the unified webhook ingress go through here.
 func (d *Deployer) DispatchPlugin(ctx context.Context, w plugin.Workload, intent plugin.DeploymentIntent) error {
+	if err := d.beginDispatch(); err != nil {
+		metrics.DeploysTotal.Inc(w.SourceKind, "rejected_draining")
+		return err
+	}
+	defer d.activeWg.Done()
 	src, err := plugin.GetSource(w.SourceKind)
 	if err != nil {
+		// Unknown source: use the constant "unknown" sentinel for the
+		// label so a typo-spam attack can't grow the metrics map with
+		// one series per bogus source_kind. The actual user-supplied
+		// value still surfaces via the wrapped error / event log.
+		metrics.DeploysTotal.Inc("unknown", "unknown_source")
 		return fmt.Errorf("dispatch %s: %w", w.Name, err)
 	}
-	return src.Deploy(ctx, d.PluginDeps(), w, intent)
+	err = src.Deploy(ctx, d.PluginDeps(), w, intent)
+	outcome := "success"
+	if err != nil {
+		outcome = "failure"
+	}
+	metrics.DeploysTotal.Inc(w.SourceKind, outcome)
+	return err
 }

 // DispatchTeardown routes a teardown call to the matching Source plugin.
-// Used when a workload is deleted.
+// Used when a workload is deleted. Tracked via activeWg so Drain() honours
+// in-progress teardowns just like deploys.
 func (d *Deployer) DispatchTeardown(ctx context.Context, w plugin.Workload) error {
+	if err := d.beginDispatch(); err != nil {
+		return err
+	}
+	defer d.activeWg.Done()
 	src, err := plugin.GetSource(w.SourceKind)
 	if err != nil {
 		return fmt.Errorf("dispatch teardown %s: %w", w.Name, err)
@@ -33,8 +55,17 @@ func (d *Deployer) DispatchTeardown(ctx context.Context, w plugin.Workload) erro

 // DispatchReconcile routes a Reconcile call. Periodic reconciler iterates
 // every Workload and calls this; idle Sources should make it a cheap
-// no-op.
+// no-op. Tracked via activeWg so a long-running reconcile blocks Drain().
 func (d *Deployer) DispatchReconcile(ctx context.Context, w plugin.Workload) error {
+	if err := d.beginDispatch(); err != nil {
+		// Silent skip — reconcile is a periodic tick, not a user-initiated
+		// action, so we don't want to surface "draining" errors back to the
+		// reconciler loop. The next tick after restart will catch up. Routing
+		// through beginDispatch keeps the activeWg.Add atomic with the drain
+		// check (see Drain) instead of a bare shuttingDown.Load + Add race.
+		return nil
+	}
+	defer d.activeWg.Done()
 	src, err := plugin.GetSource(w.SourceKind)
 	if err != nil {
 		return fmt.Errorf("dispatch reconcile %s: %w", w.Name, err)
@@ -52,10 +83,13 @@ func (d *Deployer) PluginDeps() plugin.Deps {
 	d.dnsMu.RLock()
 	dnsProvider := d.dns
 	d.dnsMu.RUnlock()
+	d.proxyMu.RLock()
+	proxyProvider := d.proxy
+	d.proxyMu.RUnlock()
 	return plugin.Deps{
 		Store:    d.store,
 		Docker:   d.docker,
-		Proxy:    d.proxy,
+		Proxy:    proxyProvider,
 		DNS:      dnsProvider,
 		Health:   d.health,
 		Notifier: d.notifier,
@@ -2,20 +2,58 @@ package docker

 import (
 	"archive/tar"
+	"bufio"
 	"context"
+	"encoding/json"
 	"fmt"
 	"io"
 	"os"
 	"path/filepath"
 	"strings"

+	"github.com/moby/moby/api/types/build"
 	"github.com/moby/moby/client"
 )

-// BuildImage builds a Docker image from a directory containing a Dockerfile.
-// The directory is packaged as a tar archive and sent to the Docker daemon.
-// The tag parameter is the image name:tag to apply (e.g., "dw-site-myapp:latest").
+// BuildImage builds a Docker image from a directory containing a Dockerfile
+// at the context root. Kept as a thin wrapper around BuildImageAt for the
+// static-site plugin which always emits its generated Dockerfile at the
+// context root. New code should prefer BuildImageAt so the Dockerfile path
+// is explicit.
 func (c *Client) BuildImage(ctx context.Context, contextDir, tag string) error {
+	return c.BuildImageAt(ctx, contextDir, "Dockerfile", tag, nil)
+}
+
+// BuildImageAt builds a Docker image from a tar of contextDir, using the
+// Dockerfile at `dockerfile` *inside* the context (typically "Dockerfile"
+// but may be e.g. "docker/Dockerfile" when the user-supplied repo layout
+// keeps Dockerfiles in a subfolder).
+//
+// The dockerfile argument is the path *relative to contextDir*. Empty
+// strings are normalised to "Dockerfile" so callers can pass through a
+// user config value without sanitising twice.
+//
+// logFn, if non-nil, is invoked for every non-empty `stream` line the
+// daemon emits during the build. Callers use this to forward live build
+// progress (e.g. SSE bus). Errors from the daemon are NOT delivered via
+// logFn — they surface as the returned error so the caller's failure
+// path stays the single source of truth.
+func (c *Client) BuildImageAt(ctx context.Context, contextDir, dockerfile, tag string, logFn func(line string)) error {
+	if dockerfile == "" {
+		dockerfile = "Dockerfile"
+	}
+	// Normalise to forward slashes — the tar entry names use them and the
+	// Docker daemon expects the same.
+	dockerfile = filepath.ToSlash(dockerfile)
+	// Defence-in-depth: the dockerfile path is relative to contextDir and
+	// is increasingly user/config-supplied (subfolder Dockerfiles). Reject
+	// absolute paths and any `..` traversal at the boundary so a value like
+	// "../../etc/passwd" can never be handed to the daemon's build options,
+	// regardless of which builder backend resolves it.
+	if filepath.IsAbs(dockerfile) || strings.HasPrefix(dockerfile, "/") ||
+		dockerfile == ".." || strings.HasPrefix(dockerfile, "../") || strings.Contains(dockerfile, "/../") {
+		return fmt.Errorf("docker build: invalid dockerfile path %q (must be relative to the build context, no traversal)", dockerfile)
+	}
 	// Create tar archive of the build context.
 	pr, pw := io.Pipe()

@@ -50,16 +88,14 @@ func (c *Client) BuildImage(ctx context.Context, contextDir, tag string) error {
 				return nil
 			}

-			file, err := os.Open(path)
-			if err != nil {
-				return fmt.Errorf("open %s: %w", path, err)
+			// Per-file close, NOT defer. `defer file.Close()` inside the
+			// WalkFunc only runs when the outer goroutine returns — for a
+			// build context with thousands of files (node_modules-heavy
+			// repo) that leaks one fd per file until the walk completes
+			// and trips EMFILE on default ulimit=1024 systems.
+			if err := streamFileIntoTar(tw, path, relPath); err != nil {
+				return err
 			}
-			defer file.Close()
-
-			if _, err := io.Copy(tw, file); err != nil {
-				return fmt.Errorf("copy %s to tar: %w", relPath, err)
-			}
-
 			return nil
 		})

@@ -69,8 +105,16 @@ func (c *Client) BuildImage(ctx context.Context, contextDir, tag string) error {
 		pw.CloseWithError(err)
 	}()

+	// Pin the legacy builder explicitly. On Docker Engine 23+ BuildKit
+	// is the default for the CLI, but the daemon honours the explicit
+	// Version field on ImageBuildOptions. Legacy builder does NOT support
+	// `RUN --mount=type=bind,source=/host` so a malicious Dockerfile
+	// cannot mount host paths into the build context. Switching to
+	// BuildKit later requires (a) Dockerfile-content validation to
+	// reject bind-mount hints, or (b) an explicit per-workload opt-in.
 	resp, err := c.api.ImageBuild(ctx, pr, client.ImageBuildOptions{
-		Dockerfile:  "Dockerfile",
+		Version:     build.BuilderV1,
+		Dockerfile:  dockerfile,
 		Tags:        []string{tag},
 		Remove:      true,
 		ForceRemove: true,
@@ -80,16 +124,71 @@ func (c *Client) BuildImage(ctx context.Context, contextDir, tag string) error {
 	}
 	defer resp.Body.Close()

-	// Read the build output to completion (required for the build to finish).
-	output, err := io.ReadAll(resp.Body)
-	if err != nil {
+	// Drain the daemon's NDJSON stream to completion. The stream MUST
+	// be read for the build to finish — closing the body early aborts
+	// the build. We parse line-by-line into the {Stream, Error} shape
+	// the daemon emits so an honest `{"error":"..."}` line surfaces
+	// without false positives from informational `{"stream":"error
+	// handling: retrying..."}` chatter that the old strings.Contains
+	// path would have flagged.
+	type buildLine struct {
+		Stream string `json:"stream,omitempty"`
+		Error  string `json:"error,omitempty"`
+	}
+	scanner := bufio.NewScanner(resp.Body)
+	// Some build steps emit single lines exceeding the default 64 KiB
+	// (e.g. a fat go-mod-download dump). Bump to 1 MiB so we don't
+	// silently truncate and miss the trailing error line.
+	scanner.Buffer(make([]byte, 64*1024), 1024*1024)
+	var firstErr string
+	for scanner.Scan() {
+		line := scanner.Bytes()
+		if len(line) == 0 {
+			continue
+		}
+		var bl buildLine
+		if err := json.Unmarshal(line, &bl); err != nil {
+			// Non-JSON line — daemon shouldn't produce these, but
+			// don't fail the build over a parse hiccup.
+			continue
+		}
+		if bl.Error != "" && firstErr == "" {
+			firstErr = bl.Error
+		}
+		if logFn != nil && bl.Stream != "" {
+			logFn(bl.Stream)
+		}
+	}
+	if err := scanner.Err(); err != nil {
 		return fmt.Errorf("read build output for %s: %w", tag, err)
 	}
-
-	// Check for error in build output.
-	if strings.Contains(string(output), `"error"`) {
-		return fmt.Errorf("build image %s: build errors in output", tag)
+	if firstErr != "" {
+		return fmt.Errorf("build image %s: %s", tag, firstErr)
 	}

 	return nil
 }
+
+// streamFileIntoTar opens path, copies its contents into the tar writer
+// under the given relPath header, and closes the file *before returning*
+// — i.e. once per file, not deferred to the end of the entire walk.
+// Extracted so the per-iteration close discipline is obvious at the
+// callsite and the file handle isn't accidentally hoisted into the
+// caller's defer stack via a future refactor.
+func streamFileIntoTar(tw *tar.Writer, path, relPath string) error {
+	file, err := os.Open(path)
+	if err != nil {
+		return fmt.Errorf("open %s: %w", path, err)
+	}
+	_, copyErr := io.Copy(tw, file)
+	// Close BEFORE returning so the fd is released even on copy
+	// failure. Capture both errors so the more-specific copy error
+	// wins when both fire.
+	if cerr := file.Close(); cerr != nil && copyErr == nil {
+		copyErr = cerr
+	}
+	if copyErr != nil {
+		return fmt.Errorf("copy %s to tar: %w", relPath, copyErr)
+	}
+	return nil
+}
@@ -108,3 +108,29 @@ func (c *Client) GetSystemStats(ctx context.Context) (SystemStats, error) {

 	return stats, nil
 }
+
+// BuildCachePruneResult reports the outcome of a build-cache prune.
+type BuildCachePruneResult struct {
+	CachesDeleted  int   `json:"caches_deleted"`  // number of cache records removed
+	SpaceReclaimed int64 `json:"space_reclaimed"` // bytes reclaimed
+}
+
+// PruneBuildCache deletes unused Docker build-cache records and returns the
+// number of records removed and bytes reclaimed. Docker's build-cache API is
+// prune-by-filter only — there is no surgical per-record eviction — so this
+// is the daemon-wide "prune unused" operation.
+//
+// When all is false (the default), only build cache not currently in use is
+// removed, so an app's next rebuild still hits its warm cache. When all is
+// true, every build-cache record is removed regardless of use, forcing a cold
+// rebuild for every app.
+func (c *Client) PruneBuildCache(ctx context.Context, all bool) (BuildCachePruneResult, error) {
+	res, err := c.api.BuildCachePrune(ctx, client.BuildCachePruneOptions{All: all})
+	if err != nil {
+		return BuildCachePruneResult{}, fmt.Errorf("prune build cache: %w", err)
+	}
+	return BuildCachePruneResult{
+		CachesDeleted:  len(res.Report.CachesDeleted),
+		SpaceReclaimed: int64(res.Report.SpaceReclaimed),
+	}, nil
+}
@@ -27,6 +27,13 @@ const (

 	// EventStackStatus is emitted when a compose stack status changes.
 	EventStackStatus EventType = "stack_status"
+
+	// EventBuildLog is emitted for each line of a streaming image build.
+	// Per-line events are ephemeral (not persisted to the event_log) — they
+	// exist to drive a live tail UI during the slow "building" phase of a
+	// dockerfile-source deploy. Subscribers should filter by WorkloadID
+	// because every dockerfile deploy on the box publishes on the same bus.
+	EventBuildLog EventType = "build_log"
 )

 // Event is a single event published on the bus.
@@ -62,12 +69,13 @@ type DeployStatusPayload struct {

 // EventLogPayload is the payload for EventLog events (audit trail).
 type EventLogPayload struct {
-	ID        int64  `json:"id"`
-	Source    string `json:"source"`
-	Severity  string `json:"severity"`
-	Message   string `json:"message"`
-	Metadata  string `json:"metadata"`
-	CreatedAt string `json:"created_at"`
+	ID         int64  `json:"id"`
+	Source     string `json:"source"`
+	WorkloadID string `json:"workload_id"`
+	Severity   string `json:"severity"`
+	Message    string `json:"message"`
+	Metadata   string `json:"metadata"`
+	CreatedAt  string `json:"created_at"`
 }

 // StaticSiteStatusPayload is the payload for EventStaticSiteStatus events.
@@ -77,6 +85,14 @@ type StaticSiteStatusPayload struct {
 	Status string `json:"status"`
 }

+// BuildLogPayload is the payload for EventBuildLog events. One event
+// per non-empty line read off the daemon's NDJSON build stream.
+type BuildLogPayload struct {
+	WorkloadID string `json:"workload_id"`
+	Line       string `json:"line"`
+	Stream     string `json:"stream,omitempty"`
+}
+
 // StackStatusPayload is the payload for EventStackStatus events.
 type StackStatusPayload struct {
 	StackID string `json:"stack_id"`
@@ -0,0 +1,349 @@
+// Package metricalert implements a background goroutine that
+// periodically evaluates operator-configured metric-threshold rules
+// against recent container stats samples. On breach (subject to a
+// per-rule-per-workload cooldown) it emits an event into the existing
+// event_log + event-bus pipeline — the same fan-out used by the
+// log-scanner — instead of building any new notification plumbing.
+package metricalert
+
+import (
+	"encoding/json"
+	"fmt"
+	"log/slog"
+	"sync"
+	"time"
+
+	"github.com/alexei/tinyforge/internal/events"
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// EvalInterval is how often the evaluator tick fires.
+const EvalInterval = 30 * time.Second
+
+// lookbackSeconds bounds how far back we pull samples each tick. Stats
+// are collected at most every few seconds (see internal/stats), so a
+// 120s window comfortably captures the latest reading per container
+// even if collection briefly stalls.
+const lookbackSeconds = 120
+
+// RuleSource is the read-side seam for fetching the current rule rows.
+// Real callers pass *store.Store; tests pass a fake.
+type RuleSource interface {
+	ListMetricAlertRules() ([]store.MetricAlertRule, error)
+}
+
+// SampleSource fetches the recent container stats samples to evaluate.
+type SampleSource interface {
+	ListAllRecentContainerStatsSamples(sinceTS int64) ([]store.ContainerStatsSample, error)
+}
+
+// EventSink writes a breach into event_log.
+type EventSink interface {
+	InsertEvent(store.EventLog) (store.EventLog, error)
+}
+
+// Publisher fans the breach out on the event bus. Matches *events.Bus.
+type Publisher interface {
+	Publish(events.Event)
+}
+
+// Source identifies metric-alert events in event_log + the bus.
+const eventSource = "metric_alert"
+
+// Manager owns the evaluation loop lifecycle. It mirrors
+// stats.Collector: a once-guarded Start/Stop pair with stop/done
+// channels and a single-goroutine run loop.
+type Manager struct {
+	rules   RuleSource
+	samples SampleSource
+	sink    EventSink
+	pub     Publisher
+
+	// now is swappable in tests so cooldown windows can be exercised
+	// deterministically. Defaults to time.Now.
+	now func() time.Time
+
+	// mu guards lastFired. The run loop is single-goroutine today, but
+	// Start/Stop and a future ReloadRules may touch shared state; the
+	// mutex is cheap insurance.
+	mu        sync.Mutex
+	lastFired map[string]time.Time // "ruleID:ownerID" -> last emit time
+
+	startOnce sync.Once
+	stopOnce  sync.Once
+	started   bool
+	stop      chan struct{}
+	done      chan struct{}
+}
+
+// New wires a manager with the supplied dependencies. Call Start to
+// begin evaluating.
+func New(rules RuleSource, samples SampleSource, sink EventSink, pub Publisher) *Manager {
+	return &Manager{
+		rules:     rules,
+		samples:   samples,
+		sink:      sink,
+		pub:       pub,
+		now:       time.Now,
+		lastFired: map[string]time.Time{},
+		stop:      make(chan struct{}),
+		done:      make(chan struct{}),
+	}
+}
+
+// Start launches the background loop. Returns immediately. The loop
+// exits when Stop is called. Safe to call multiple times — only the
+// first call has an effect.
+func (m *Manager) Start() {
+	m.startOnce.Do(func() {
+		m.started = true
+		go m.run()
+	})
+}
+
+// Stop signals the loop to exit and blocks until it has finished the
+// in-flight tick. If Start was never called, Stop returns immediately.
+func (m *Manager) Stop() {
+	m.stopOnce.Do(func() {
+		close(m.stop)
+		if !m.started {
+			close(m.done)
+		}
+	})
+	<-m.done
+}
+
+// run is the main loop. It evaluates once shortly after start, then on
+// every EvalInterval tick, until Stop is called.
+func (m *Manager) run() {
+	defer close(m.done)
+
+	// Settle delay so the app + first stats samples exist before the
+	// first evaluation.
+	select {
+	case <-time.After(3 * time.Second):
+	case <-m.stop:
+		return
+	}
+
+	ticker := time.NewTicker(EvalInterval)
+	defer ticker.Stop()
+	m.evaluate(m.now())
+	for {
+		select {
+		case <-m.stop:
+			return
+		case <-ticker.C:
+			m.evaluate(m.now())
+		}
+	}
+}
+
+// evaluate runs one pass: load rules + recent samples, reduce to the
+// freshest sample per (owner, container), and emit on breach subject to
+// cooldown. Best-effort throughout — a bad rule or sample never crashes
+// the loop.
+func (m *Manager) evaluate(now time.Time) {
+	rules, err := m.rules.ListMetricAlertRules()
+	if err != nil {
+		slog.Warn("metricalert: list rules", "error", err)
+		return
+	}
+	if len(rules) == 0 {
+		return
+	}
+
+	since := now.Unix() - lookbackSeconds
+	samples, err := m.samples.ListAllRecentContainerStatsSamples(since)
+	if err != nil {
+		slog.Warn("metricalert: list samples", "error", err)
+		return
+	}
+	latest := latestPerContainer(samples)
+	if len(latest) == 0 {
+		return
+	}
+
+	for _, rule := range rules {
+		if !rule.Enabled {
+			continue
+		}
+		for _, sample := range latest {
+			// Per-workload rules only match their workload; "" matches all.
+			if rule.WorkloadID != "" && rule.WorkloadID != sample.OwnerID {
+				continue
+			}
+			value, ok := metricValue(rule.Metric, sample)
+			if !ok {
+				continue // e.g. memory_percent with a zero limit
+			}
+			if !breached(rule.Comparator, value, rule.Threshold) {
+				continue
+			}
+			if m.coolingDown(rule, sample.OwnerID, now) {
+				continue
+			}
+			m.emit(rule, sample, value)
+			m.recordFire(rule, sample.OwnerID, now)
+		}
+	}
+}
+
+// latestPerContainer keeps only the most recent sample per
+// (OwnerID, ContainerID), so each container is judged on its freshest
+// reading rather than every historical row in the window.
+func latestPerContainer(samples []store.ContainerStatsSample) []store.ContainerStatsSample {
+	newest := map[string]store.ContainerStatsSample{}
+	for _, s := range samples {
+		key := s.OwnerID + "\x00" + s.ContainerID
+		if prev, ok := newest[key]; !ok || s.TS > prev.TS {
+			newest[key] = s
+		}
+	}
+	out := make([]store.ContainerStatsSample, 0, len(newest))
+	for _, s := range newest {
+		out = append(out, s)
+	}
+	return out
+}
+
+// metricValue resolves a rule's metric against a sample. The bool is
+// false when the sample can't be judged for that metric (memory_percent
+// with a zero/unknown limit) so the caller skips it instead of dividing
+// by zero.
+func metricValue(metric string, s store.ContainerStatsSample) (float64, bool) {
+	switch metric {
+	case store.MetricCPUPercent:
+		return s.CPUPercent, true
+	case store.MetricMemoryPercent:
+		if s.MemoryLimit <= 0 {
+			return 0, false
+		}
+		return float64(s.MemoryUsage) / float64(s.MemoryLimit) * 100, true
+	case store.MetricMemoryBytes:
+		return float64(s.MemoryUsage), true
+	default:
+		return 0, false
+	}
+}
+
+// breached returns whether value crosses threshold per the comparator.
+func breached(comparator string, value, threshold float64) bool {
+	switch comparator {
+	case store.MetricComparatorGT:
+		return value > threshold
+	case store.MetricComparatorLT:
+		return value < threshold
+	default:
+		return false
+	}
+}
+
+// cooldownKey is the per-rule-per-workload cooldown key.
+func cooldownKey(ruleID int64, ownerID string) string {
+	return fmt.Sprintf("%d:%s", ruleID, ownerID)
+}
+
+func (m *Manager) coolingDown(rule store.MetricAlertRule, ownerID string, now time.Time) bool {
+	if rule.CooldownSeconds <= 0 {
+		return false
+	}
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	last, ok := m.lastFired[cooldownKey(rule.ID, ownerID)]
+	if !ok {
+		return false
+	}
+	return now.Sub(last) < time.Duration(rule.CooldownSeconds)*time.Second
+}
+
+func (m *Manager) recordFire(rule store.MetricAlertRule, ownerID string, now time.Time) {
+	m.mu.Lock()
+	m.lastFired[cooldownKey(rule.ID, ownerID)] = now
+	m.mu.Unlock()
+}
+
+// emit persists the breach as an event_log row and publishes it on the
+// bus. WorkloadID routes the alert to that app's activity timeline.
+// Metadata is JSON-marshalled (never string-concatenated). Any
+// marshal/insert failure is logged and skipped — emitting must never
+// crash the loop.
+func (m *Manager) emit(rule store.MetricAlertRule, sample store.ContainerStatsSample, value float64) {
+	message := formatMessage(rule, value)
+	meta := map[string]any{
+		"workload_id": sample.OwnerID,
+		"rule":        rule.Name,
+		"metric":      rule.Metric,
+		"value":       value,
+		"threshold":   rule.Threshold,
+		"comparator":  rule.Comparator,
+	}
+	metaJSON, err := json.Marshal(meta)
+	if err != nil {
+		slog.Error("metricalert: marshal metadata", "rule", rule.Name, "error", err)
+		return
+	}
+	severity := rule.Severity
+	if severity == "" {
+		severity = store.LogScanSeverityWarn
+	}
+	evt, err := m.sink.InsertEvent(store.EventLog{
+		Source:     eventSource,
+		Severity:   severity,
+		Message:    message,
+		WorkloadID: sample.OwnerID,
+		Metadata:   string(metaJSON),
+	})
+	if err != nil {
+		slog.Error("metricalert: persist event", "rule", rule.Name, "error", err)
+		return
+	}
+	if m.pub != nil {
+		m.pub.Publish(events.Event{
+			Type: events.EventLog,
+			Payload: events.EventLogPayload{
+				ID:         evt.ID,
+				Source:     eventSource,
+				WorkloadID: sample.OwnerID,
+				Severity:   severity,
+				Message:    message,
+				Metadata:   string(metaJSON),
+				CreatedAt:  evt.CreatedAt,
+			},
+		})
+	}
+}
+
+// formatMessage builds a concise, human, secret-free breach line. The
+// only operator-supplied text is rule.Name; the rest are numbers and
+// fixed labels.
+func formatMessage(rule store.MetricAlertRule, value float64) string {
+	label, unit := metricLabelUnit(rule.Metric)
+	word := comparatorWord(rule.Comparator)
+	return fmt.Sprintf("%s: %s is %.0f%s (threshold %s %.0f%s)",
+		rule.Name, label, value, unit, word, rule.Threshold, unit)
+}
+
+func metricLabelUnit(metric string) (label, unit string) {
+	switch metric {
+	case store.MetricCPUPercent:
+		return "CPU", "%"
+	case store.MetricMemoryPercent:
+		return "Memory", "%"
+	case store.MetricMemoryBytes:
+		return "Memory", " bytes"
+	default:
+		return metric, ""
+	}
+}
+
+func comparatorWord(comparator string) string {
+	switch comparator {
+	case store.MetricComparatorGT:
+		return ">"
+	case store.MetricComparatorLT:
+		return "<"
+	default:
+		return comparator
+	}
+}
@@ -0,0 +1,284 @@
+package metricalert
+
+import (
+	"testing"
+	"time"
+
+	"github.com/alexei/tinyforge/internal/events"
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// --- fakes -----------------------------------------------------------
+
+type fakeRules struct {
+	rules []store.MetricAlertRule
+	err   error
+}
+
+func (f *fakeRules) ListMetricAlertRules() ([]store.MetricAlertRule, error) {
+	return f.rules, f.err
+}
+
+type fakeSamples struct {
+	samples []store.ContainerStatsSample
+	err     error
+	since   int64 // captured arg of the last call
+}
+
+func (f *fakeSamples) ListAllRecentContainerStatsSamples(sinceTS int64) ([]store.ContainerStatsSample, error) {
+	f.since = sinceTS
+	return f.samples, f.err
+}
+
+type recordedEvent struct {
+	evt store.EventLog
+}
+
+type fakeSink struct {
+	events []recordedEvent
+	err    error
+	nextID int64
+}
+
+func (f *fakeSink) InsertEvent(e store.EventLog) (store.EventLog, error) {
+	if f.err != nil {
+		return store.EventLog{}, f.err
+	}
+	f.nextID++
+	e.ID = f.nextID
+	e.CreatedAt = "2026-05-29T00:00:00Z"
+	f.events = append(f.events, recordedEvent{evt: e})
+	return e, nil
+}
+
+type fakePublisher struct {
+	published []events.Event
+}
+
+func (f *fakePublisher) Publish(e events.Event) {
+	f.published = append(f.published, e)
+}
+
+func newManager(rules []store.MetricAlertRule, samples []store.ContainerStatsSample) (*Manager, *fakeSink, *fakePublisher) {
+	sink := &fakeSink{}
+	pub := &fakePublisher{}
+	m := New(&fakeRules{rules: rules}, &fakeSamples{samples: samples}, sink, pub)
+	return m, sink, pub
+}
+
+// --- tests -----------------------------------------------------------
+
+func TestEvaluate_BreachEmits(t *testing.T) {
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "cpu-hot", Metric: store.MetricCPUPercent,
+		Comparator: store.MetricComparatorGT, Threshold: 80, Severity: "error",
+		CooldownSeconds: 300, Enabled: true,
+	}}
+	samples := []store.ContainerStatsSample{{
+		ContainerID: "c1", OwnerID: "w1", OwnerType: "instance", TS: 100, CPUPercent: 95,
+	}}
+	m, sink, pub := newManager(rules, samples)
+
+	m.evaluate(time.Unix(200, 0))
+
+	if len(sink.events) != 1 {
+		t.Fatalf("expected 1 event, got %d", len(sink.events))
+	}
+	got := sink.events[0].evt
+	if got.Source != "metric_alert" {
+		t.Errorf("source = %q, want metric_alert", got.Source)
+	}
+	if got.Severity != "error" {
+		t.Errorf("severity = %q, want error", got.Severity)
+	}
+	if got.WorkloadID != "w1" {
+		t.Errorf("workload_id = %q, want w1", got.WorkloadID)
+	}
+	if got.Metadata == "" || got.Metadata == "{}" {
+		t.Errorf("metadata should be populated JSON, got %q", got.Metadata)
+	}
+	if len(pub.published) != 1 {
+		t.Fatalf("expected 1 published event, got %d", len(pub.published))
+	}
+	payload, ok := pub.published[0].Payload.(events.EventLogPayload)
+	if !ok {
+		t.Fatalf("published payload is not EventLogPayload")
+	}
+	if payload.WorkloadID != "w1" || payload.Source != "metric_alert" {
+		t.Errorf("payload workload/source mismatch: %+v", payload)
+	}
+}
+
+func TestEvaluate_NoBreachNoEmit(t *testing.T) {
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "cpu-hot", Metric: store.MetricCPUPercent,
+		Comparator: store.MetricComparatorGT, Threshold: 80, Enabled: true,
+	}}
+	samples := []store.ContainerStatsSample{{
+		ContainerID: "c1", OwnerID: "w1", TS: 100, CPUPercent: 10,
+	}}
+	m, sink, _ := newManager(rules, samples)
+
+	m.evaluate(time.Unix(200, 0))
+
+	if len(sink.events) != 0 {
+		t.Fatalf("expected no events for non-breach, got %d", len(sink.events))
+	}
+}
+
+func TestEvaluate_DisabledRuleSkipped(t *testing.T) {
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "cpu-hot", Metric: store.MetricCPUPercent,
+		Comparator: store.MetricComparatorGT, Threshold: 80, Enabled: false,
+	}}
+	samples := []store.ContainerStatsSample{{ContainerID: "c1", OwnerID: "w1", TS: 100, CPUPercent: 95}}
+	m, sink, _ := newManager(rules, samples)
+
+	m.evaluate(time.Unix(200, 0))
+
+	if len(sink.events) != 0 {
+		t.Fatalf("disabled rule should not emit, got %d", len(sink.events))
+	}
+}
+
+func TestEvaluate_PerWorkloadScoping(t *testing.T) {
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "w2-only", WorkloadID: "w2", Metric: store.MetricCPUPercent,
+		Comparator: store.MetricComparatorGT, Threshold: 80, Enabled: true,
+	}}
+	samples := []store.ContainerStatsSample{
+		{ContainerID: "c1", OwnerID: "w1", TS: 100, CPUPercent: 95}, // breach but wrong workload
+		{ContainerID: "c2", OwnerID: "w2", TS: 100, CPUPercent: 95}, // breach, correct workload
+	}
+	m, sink, _ := newManager(rules, samples)
+
+	m.evaluate(time.Unix(200, 0))
+
+	if len(sink.events) != 1 {
+		t.Fatalf("expected 1 event (only w2), got %d", len(sink.events))
+	}
+	if sink.events[0].evt.WorkloadID != "w2" {
+		t.Errorf("event should be scoped to w2, got %q", sink.events[0].evt.WorkloadID)
+	}
+}
+
+func TestEvaluate_GlobalRuleMatchesAll(t *testing.T) {
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "global", WorkloadID: "", Metric: store.MetricCPUPercent,
+		Comparator: store.MetricComparatorGT, Threshold: 80, Enabled: true,
+	}}
+	samples := []store.ContainerStatsSample{
+		{ContainerID: "c1", OwnerID: "w1", TS: 100, CPUPercent: 95},
+		{ContainerID: "c2", OwnerID: "w2", TS: 100, CPUPercent: 95},
+	}
+	m, sink, _ := newManager(rules, samples)
+
+	m.evaluate(time.Unix(200, 0))
+
+	if len(sink.events) != 2 {
+		t.Fatalf("global rule should fire for both workloads, got %d", len(sink.events))
+	}
+}
+
+func TestEvaluate_MemoryPercentDivByZeroSkip(t *testing.T) {
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "mem", Metric: store.MetricMemoryPercent,
+		Comparator: store.MetricComparatorGT, Threshold: 50, Enabled: true,
+	}}
+	samples := []store.ContainerStatsSample{{
+		ContainerID: "c1", OwnerID: "w1", TS: 100, MemoryUsage: 1000, MemoryLimit: 0,
+	}}
+	m, sink, _ := newManager(rules, samples)
+
+	m.evaluate(time.Unix(200, 0))
+
+	if len(sink.events) != 0 {
+		t.Fatalf("zero memory limit should be skipped for percent rule, got %d", len(sink.events))
+	}
+}
+
+func TestEvaluate_MemoryPercentBreaches(t *testing.T) {
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "mem", Metric: store.MetricMemoryPercent,
+		Comparator: store.MetricComparatorGT, Threshold: 90, Enabled: true,
+	}}
+	samples := []store.ContainerStatsSample{{
+		ContainerID: "c1", OwnerID: "w1", TS: 100, MemoryUsage: 950, MemoryLimit: 1000, // 95%
+	}}
+	m, sink, _ := newManager(rules, samples)
+
+	m.evaluate(time.Unix(200, 0))
+
+	if len(sink.events) != 1 {
+		t.Fatalf("95%% should breach 90%% threshold, got %d events", len(sink.events))
+	}
+}
+
+func TestEvaluate_CooldownSuppressesSecondEmit(t *testing.T) {
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "cpu-hot", Metric: store.MetricCPUPercent,
+		Comparator: store.MetricComparatorGT, Threshold: 80, CooldownSeconds: 300, Enabled: true,
+	}}
+	samples := []store.ContainerStatsSample{{ContainerID: "c1", OwnerID: "w1", TS: 100, CPUPercent: 95}}
+	m, sink, _ := newManager(rules, samples)
+
+	base := time.Unix(1000, 0)
+	m.evaluate(base)
+	// 10s later — still inside the 300s cooldown window.
+	m.evaluate(base.Add(10 * time.Second))
+
+	if len(sink.events) != 1 {
+		t.Fatalf("cooldown should suppress second emit, got %d events", len(sink.events))
+	}
+
+	// Past the window — should fire again.
+	m.evaluate(base.Add(301 * time.Second))
+	if len(sink.events) != 2 {
+		t.Fatalf("should re-fire after cooldown elapses, got %d events", len(sink.events))
+	}
+}
+
+func TestEvaluate_LatestSamplePerContainer(t *testing.T) {
+	// Two samples for the same container: an old non-breaching reading
+	// and a newer breaching one. Only the freshest should be judged.
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "cpu-hot", Metric: store.MetricCPUPercent,
+		Comparator: store.MetricComparatorGT, Threshold: 80, Enabled: true,
+	}}
+	samples := []store.ContainerStatsSample{
+		{ContainerID: "c1", OwnerID: "w1", TS: 100, CPUPercent: 10},
+		{ContainerID: "c1", OwnerID: "w1", TS: 150, CPUPercent: 95},
+	}
+	m, sink, _ := newManager(rules, samples)
+
+	m.evaluate(time.Unix(200, 0))
+
+	if len(sink.events) != 1 {
+		t.Fatalf("expected exactly 1 event from freshest sample, got %d", len(sink.events))
+	}
+}
+
+func TestEvaluate_LessThanComparator(t *testing.T) {
+	rules := []store.MetricAlertRule{{
+		ID: 1, Name: "cpu-idle", Metric: store.MetricCPUPercent,
+		Comparator: store.MetricComparatorLT, Threshold: 5, Enabled: true,
+	}}
+	samples := []store.ContainerStatsSample{{ContainerID: "c1", OwnerID: "w1", TS: 100, CPUPercent: 1}}
+	m, sink, _ := newManager(rules, samples)
+
+	m.evaluate(time.Unix(200, 0))
+
+	if len(sink.events) != 1 {
+		t.Fatalf("1%% < 5%% threshold should breach lt rule, got %d events", len(sink.events))
+	}
+}
+
+func TestEvaluate_NoRulesNoFetch(t *testing.T) {
+	// With no rules there's nothing to do; we shouldn't even query samples.
+	samplesSrc := &fakeSamples{samples: nil}
+	m := New(&fakeRules{rules: nil}, samplesSrc, &fakeSink{}, &fakePublisher{})
+	m.evaluate(time.Unix(200, 0))
+	if samplesSrc.since != 0 {
+		t.Errorf("samples should not be queried when there are no rules")
+	}
+}
@@ -0,0 +1,250 @@
+// Package metrics provides a minimal Prometheus text-format exposition
+// of Tinyforge's operational counters. We deliberately do NOT import the
+// official client_golang library: the metrics set here is small, the text
+// format is simple, and avoiding the dependency keeps `tinyforge` a fast
+// single-binary install.
+//
+// Every counter is a sync/atomic.Int64 — cheap, lock-free, and safe to
+// touch from any goroutine. Histograms / gauges aren't modeled yet; the
+// few we need (request latency p50/p99) live downstream of slog and can
+// be added when the operator actually wants them.
+package metrics
+
+import (
+	"fmt"
+	"io"
+	"log/slog"
+	"sort"
+	"strings"
+	"sync"
+	"sync/atomic"
+)
+
+// Registry holds the process-wide counter set. A single zero-value
+// Registry is ready to use — see DefaultRegistry below for the
+// recommended way to grab the global handle.
+type Registry struct {
+	mu       sync.RWMutex
+	counters map[string]*counter
+}
+
+type counter struct {
+	name   string
+	help   string
+	labels []string // label names, ordered as declared at registration
+	series map[string]*atomic.Int64
+	// seriesMu only protects insertion of new label tuples — increments
+	// on existing tuples are lock-free via the atomic.
+	seriesMu sync.Mutex
+}
+
+// DefaultRegistry is the process-wide registry. All Tinyforge metrics
+// register against it. Tests can instantiate their own Registry.
+var DefaultRegistry = newRegistry()
+
+func newRegistry() *Registry {
+	return &Registry{counters: make(map[string]*counter)}
+}
+
+// NewCounter declares a counter on the default registry. Call once at
+// package init or during NewServer; subsequent calls with the same name
+// return the existing counter so re-registration is safe.
+//
+// label names define the dimensions; calls to Inc must pass values in
+// the same order. Use the empty slice for label-less counters.
+func NewCounter(name, help string, labels ...string) *Counter {
+	return DefaultRegistry.NewCounter(name, help, labels...)
+}
+
+// NewCounter on a specific Registry — useful in tests.
+func (r *Registry) NewCounter(name, help string, labels ...string) *Counter {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	if c, ok := r.counters[name]; ok {
+		return &Counter{c: c}
+	}
+	c := &counter{
+		name:   name,
+		help:   help,
+		labels: append([]string(nil), labels...),
+		series: make(map[string]*atomic.Int64),
+	}
+	r.counters[name] = c
+	return &Counter{c: c}
+}
+
+// Counter is the public handle returned by NewCounter. Pass it around as
+// a value — the underlying state lives on the registry.
+type Counter struct {
+	c *counter
+}
+
+// Inc atomically increments the counter for the given label values.
+// Passing the wrong number of values is a programmer error; we surface
+// it as a panic during testing rather than silently aggregating into a
+// bogus series.
+func (c Counter) Inc(labelValues ...string) {
+	c.Add(1, labelValues...)
+}
+
+// Add atomically adds delta. Negative delta is rejected (counters are
+// monotonic by definition).
+func (c Counter) Add(delta int64, labelValues ...string) {
+	if delta < 0 {
+		return
+	}
+	if len(labelValues) != len(c.c.labels) {
+		// Programmer error. This used to panic to surface the bug, but Add
+		// runs on hot paths (HTTP middleware, deploy dispatch) and several
+		// callers are off the request goroutine, where a panic would take
+		// down the whole process rather than a single request. Log loudly
+		// and drop the sample so a mislabeled call site can never crash the
+		// server; the bug still shows up immediately in the logs and in
+		// tests via the error output.
+		slog.Error("metrics: label count mismatch — dropping sample",
+			"counter", c.c.name, "want", len(c.c.labels), "got", len(labelValues))
+		return
+	}
+	key := encodeKey(labelValues)
+	c.c.seriesMu.Lock()
+	v, ok := c.c.series[key]
+	if !ok {
+		v = new(atomic.Int64)
+		c.c.series[key] = v
+	}
+	c.c.seriesMu.Unlock()
+	v.Add(delta)
+}
+
+// encodeKey joins label values with a 0x1f separator. Prometheus label
+// values may contain anything except `"` and `\n`, which we escape on
+// exposition only — the key here is just a map index.
+func encodeKey(values []string) string {
+	return strings.Join(values, "\x1f")
+}
+
+// WritePrometheus dumps the registry in the text exposition format
+// Prometheus / VictoriaMetrics / OpenMetrics understands. Stable
+// ordering: counters alphabetical by name; series alphabetical by
+// encoded label tuple.
+func (r *Registry) WritePrometheus(w io.Writer) error {
+	r.mu.RLock()
+	names := make([]string, 0, len(r.counters))
+	for n := range r.counters {
+		names = append(names, n)
+	}
+	r.mu.RUnlock()
+	sort.Strings(names)
+
+	for _, name := range names {
+		r.mu.RLock()
+		c := r.counters[name]
+		r.mu.RUnlock()
+		if err := writeCounter(w, c); err != nil {
+			return err
+		}
+	}
+	return nil
+}
+
+func writeCounter(w io.Writer, c *counter) error {
+	if _, err := fmt.Fprintf(w, "# HELP %s %s\n# TYPE %s counter\n", c.name, escapeHelp(c.help), c.name); err != nil {
+		return err
+	}
+	// Snapshot the series map under a SINGLE lock acquisition. The
+	// previous shape acquired+released seriesMu twice per emitted
+	// series (once for the key list, once per Load), contending with
+	// every hot-path Inc on the HTTP request path. The *atomic.Int64
+	// pointers are stable for the lifetime of the registry (we never
+	// delete entries), so reading them after the unlock is safe.
+	type sample struct {
+		key string
+		val *atomic.Int64
+	}
+	c.seriesMu.Lock()
+	samples := make([]sample, 0, len(c.series))
+	for k, v := range c.series {
+		samples = append(samples, sample{k, v})
+	}
+	c.seriesMu.Unlock()
+
+	sort.Slice(samples, func(i, j int) bool { return samples[i].key < samples[j].key })
+
+	for _, s := range samples {
+		val := s.val.Load()
+		labels := decodeKey(s.key, c.labels)
+		if labels == "" {
+			if _, err := fmt.Fprintf(w, "%s %d\n", c.name, val); err != nil {
+				return err
+			}
+			continue
+		}
+		if _, err := fmt.Fprintf(w, "%s{%s} %d\n", c.name, labels, val); err != nil {
+			return err
+		}
+	}
+	return nil
+}
+
+func decodeKey(key string, names []string) string {
+	if key == "" || len(names) == 0 {
+		return ""
+	}
+	values := strings.Split(key, "\x1f")
+	if len(values) != len(names) {
+		// Should not happen — encodeKey/decode are symmetric.
+		return ""
+	}
+	parts := make([]string, len(names))
+	for i, n := range names {
+		parts[i] = fmt.Sprintf(`%s="%s"`, n, escapeLabelValue(values[i]))
+	}
+	return strings.Join(parts, ",")
+}
+
+func escapeHelp(s string) string {
+	r := strings.NewReplacer("\\", "\\\\", "\n", "\\n")
+	return r.Replace(s)
+}
+
+func escapeLabelValue(s string) string {
+	r := strings.NewReplacer("\\", "\\\\", "\n", "\\n", `"`, `\"`)
+	return r.Replace(s)
+}
+
+// ── Pre-declared counters ────────────────────────────────────────────
+//
+// These are the counters Tinyforge surfaces to operators. Adding more is
+// a one-line NewCounter call at the call site — no central catalogue,
+// just keep names lowercase_snake with the `tinyforge_` prefix.
+
+var (
+	HTTPRequestsTotal = NewCounter(
+		"tinyforge_http_requests_total",
+		"Total HTTP requests handled, partitioned by method and outcome class.",
+		"method", "status_class",
+	)
+	DeploysTotal = NewCounter(
+		"tinyforge_deploys_total",
+		"Total deploys dispatched, partitioned by source kind and outcome.",
+		"source_kind", "outcome",
+	)
+	WebhookDeliveriesTotal = NewCounter(
+		"tinyforge_webhook_deliveries_total",
+		"Total inbound webhook deliveries, partitioned by outcome.",
+		"outcome",
+	)
+	SchedulerTicksTotal = NewCounter(
+		"tinyforge_scheduler_ticks_total",
+		"Total scheduler ticks. The dispatched counter is the success measure.",
+	)
+	SchedulerDispatchedTotal = NewCounter(
+		"tinyforge_scheduler_dispatched_total",
+		"Triggers actually dispatched by the scheduler.",
+	)
+	OutboundNotifyTotal = NewCounter(
+		"tinyforge_outbound_notify_total",
+		"Outbound notification dispatch attempts, partitioned by outcome.",
+		"outcome",
+	)
+)
@@ -16,6 +16,8 @@ import (
 	"time"

 	"github.com/google/uuid"
+
+	"github.com/alexei/tinyforge/internal/metrics"
 )

 // Event represents a deployment / site-sync notification payload.
@@ -83,17 +85,68 @@ type TestResult struct {
 // Notifications are fire-and-forget by default — failures are logged but do
 // not propagate. SendSyncForTest is the exception, used only by the manual
 // test endpoint.
+//
+// outboundSem caps the number of in-flight outbound notifications. Without
+// it a single burst (e.g. 1000 event triggers firing on a noisy log scan)
+// would spawn 1000 simultaneous TCP connections, which both DoSes the
+// receiver and exhausts local FDs.
 type Notifier struct {
-	httpClient *http.Client
-	wg         sync.WaitGroup
+	httpClient  *http.Client
+	wg          sync.WaitGroup
+	outboundSem chan struct{}
 }

+// maxOutboundNotifications bounds the in-flight outbound webhook fan-out.
+// Sized to keep small bursts non-blocking while preventing a runaway storm
+// from starving the rest of the process. Tunable later via settings if any
+// operator legitimately needs more concurrency.
+const maxOutboundNotifications = 32
+
 // New creates a Notifier with sensible defaults.
 func New() *Notifier {
+	// Transport with bounded host pooling so a slow receiver cannot pin
+	// arbitrarily many sockets open. MaxConnsPerHost mirrors the worker
+	// pool size; idle pruning keeps long-lived processes from holding
+	// stale TCP entries indefinitely.
+	//
+	// NOTE: we deliberately do NOT apply the staticsite SSRF dialer here.
+	// Notification URLs are admin-configured, and an admin already has
+	// Docker-socket (host-root-equivalent) access, so the SSRF surface adds
+	// nothing they couldn't already reach. Blocking loopback/private targets
+	// would instead break the common self-hosted pattern of notifying a
+	// same-host sidecar/bridge (e.g. service-to-notification-bridge on
+	// 127.0.0.1). See the security review (rated LOW / out of trust boundary).
+	tr := &http.Transport{
+		MaxIdleConns:        64,
+		MaxIdleConnsPerHost: 8,
+		MaxConnsPerHost:     maxOutboundNotifications,
+		IdleConnTimeout:     90 * time.Second,
+	}
 	return &Notifier{
 		httpClient: &http.Client{
-			Timeout: 10 * time.Second,
+			Timeout:   10 * time.Second,
+			Transport: tr,
 		},
+		outboundSem: make(chan struct{}, maxOutboundNotifications),
+	}
+}
+
+// acquireSlot reserves an outbound slot, respecting ctx so a backed-up
+// queue cannot starve a request that already has its own deadline.
+func (n *Notifier) acquireSlot(ctx context.Context) bool {
+	select {
+	case n.outboundSem <- struct{}{}:
+		return true
+	case <-ctx.Done():
+		return false
+	}
+}
+
+func (n *Notifier) releaseSlot() {
+	select {
+	case <-n.outboundSem:
+	default:
+		// Drained during shutdown — never block.
 	}
 }

@@ -128,8 +181,15 @@ func (n *Notifier) SendSigned(webhookURL, secret string, tier Tier, event Event)
 	n.wg.Add(1)
 	go func() {
 		defer n.wg.Done()
-		ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+		ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
 		defer cancel()
+		if !n.acquireSlot(ctx) {
+			slog.Warn("notify: dropped — outbound queue saturated",
+				"tier", tier, "host", safeHost(webhookURL), "delivery", delivery, "event", event.Type)
+			metrics.OutboundNotifyTotal.Inc("dropped")
+			return
+		}
+		defer n.releaseSlot()

 		_, err := n.doSend(ctx, webhookURL, secret, tier, delivery, event)
 		// URL host only — never log the secret or full URL with user-info.
@@ -138,11 +198,13 @@ func (n *Notifier) SendSigned(webhookURL, secret string, tier Tier, event Event)
 			slog.Warn("notify: webhook send failed",
 				"tier", tier, "host", host, "delivery", delivery,
 				"event", event.Type, "signed", secret != "", "error", err)
+			metrics.OutboundNotifyTotal.Inc("failure")
 			return
 		}
 		slog.Info("notify: webhook dispatched",
 			"tier", tier, "host", host, "delivery", delivery,
 			"event", event.Type, "signed", secret != "")
+		metrics.OutboundNotifyTotal.Inc("success")
 	}()
 }

@@ -166,8 +228,15 @@ func (n *Notifier) SendPayload(webhookURL, secret, eventType string, payload any
 	n.wg.Add(1)
 	go func() {
 		defer n.wg.Done()
-		ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+		ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
 		defer cancel()
+		if !n.acquireSlot(ctx) {
+			slog.Warn("notify: dropped trigger payload — outbound queue saturated",
+				"tier", TierEventTrigger, "host", safeHost(webhookURL), "delivery", delivery, "event", eventType)
+			metrics.OutboundNotifyTotal.Inc("dropped")
+			return
+		}
+		defer n.releaseSlot()

 		_, err := n.doSendRaw(ctx, webhookURL, secret, TierEventTrigger, delivery, eventType, timestamp, payload)
 		host := safeHost(webhookURL)
@@ -175,11 +244,13 @@ func (n *Notifier) SendPayload(webhookURL, secret, eventType string, payload any
 			slog.Warn("notify: trigger webhook send failed",
 				"tier", TierEventTrigger, "host", host, "delivery", delivery,
 				"event", eventType, "signed", secret != "", "error", err)
+			metrics.OutboundNotifyTotal.Inc("failure")
 			return
 		}
 		slog.Info("notify: trigger webhook dispatched",
 			"tier", TierEventTrigger, "host", host, "delivery", delivery,
 			"event", eventType, "signed", secret != "")
+		metrics.OutboundNotifyTotal.Inc("success")
 	}()
 }

@@ -17,6 +17,7 @@ import (
 	"context"
 	"encoding/json"
 	"errors"
+	"fmt"
 	"log/slog"
 	"sync"
 	"time"
@@ -110,17 +111,37 @@ func (r *Reconciler) ReconcileOnce(ctx context.Context) error {
 	if err != nil {
 		return err
 	}
+
+	// Load every workload ONCE per tick and index by ID. This replaces both
+	// the former N+1 GetWorkloadByID (one DB read per container) in the
+	// upsert loop and the second ListWorkloads("") in the plugin pass: net 1
+	// query per tick, 0 GetWorkloadByID.
+	//
+	// On error we return BEFORE the upsert loop and leave state untouched
+	// this tick (the next tick retries). We must NOT proceed with an empty
+	// map and fall through to markMissingRows: with no container resolving,
+	// `seen` would be empty and markMissingRows would flip EVERY live row to
+	// 'missing'. Aborting early is the safe choice.
+	rows, err := r.store.ListWorkloads("")
+	if err != nil {
+		return fmt.Errorf("reconciler: list workloads: %w", err)
+	}
+	byID := make(map[string]store.Workload, len(rows))
+	for _, w := range rows {
+		byID[w.ID] = w
+	}
+
 	seen := make(map[string]struct{}, len(items)) // container row IDs we touched

 	for _, item := range items {
-		rowID := r.upsertFromItem(item)
+		rowID := r.upsertFromItem(item, byID)
 		if rowID != "" {
 			seen[rowID] = struct{}{}
 		}
 	}

 	r.markMissingRows(seen)
-	r.reconcilePluginWorkloads(ctx)
+	r.reconcilePluginWorkloads(ctx, rows)
 	return nil
 }

@@ -137,15 +158,13 @@ func (r *Reconciler) ReconcileOnce(ctx context.Context) error {
 //
 // No-op when the plugin dispatcher hasn't been wired (boot-time race,
 // disabled deployments, tests).
-func (r *Reconciler) reconcilePluginWorkloads(ctx context.Context) {
+//
+// rows is the workload set already loaded once by ReconcileOnce — passed
+// through rather than re-queried so a tick costs a single ListWorkloads.
+func (r *Reconciler) reconcilePluginWorkloads(ctx context.Context, rows []store.Workload) {
 	if r.plugins == nil {
 		return
 	}
-	rows, err := r.store.ListWorkloads("")
-	if err != nil {
-		slog.Warn("reconciler: list workloads for plugin pass", "error", err)
-		return
-	}
 	for _, w := range rows {
 		if w.SourceKind == "" {
 			continue
@@ -214,9 +233,9 @@ func (r *Reconciler) loop(ctx context.Context) {
 // After the hard cutover only the canonical tinyforge.workload.id label
 // path is honored — every Source plugin labels its containers with the
 // workload identity at create time.
-func (r *Reconciler) upsertFromItem(item docker.ReconcileItem) string {
+func (r *Reconciler) upsertFromItem(item docker.ReconcileItem, byID map[string]store.Workload) string {
 	if id := item.Labels[docker.LabelWorkloadID]; id != "" {
-		return r.upsertByWorkloadLabel(item, id)
+		return r.upsertByWorkloadLabel(item, id, byID)
 	}
 	return ""
 }
@@ -233,9 +252,9 @@ func (r *Reconciler) upsertFromItem(item docker.ReconcileItem) string {
 // known workload row is silently ignored. Anyone with Docker socket access
 // could otherwise spawn a container with a forged label and steal the
 // canonical slot for an existing workload.
-func (r *Reconciler) upsertByWorkloadLabel(item docker.ReconcileItem, workloadID string) string {
-	w, err := r.store.GetWorkloadByID(workloadID)
-	if err != nil {
+func (r *Reconciler) upsertByWorkloadLabel(item docker.ReconcileItem, workloadID string, byID map[string]store.Workload) string {
+	w, ok := byID[workloadID]
+	if !ok {
 		// Forged or stale label — log once at debug; tick rate keeps logs quiet.
 		slog.Debug("reconciler: unknown workload_id label", "workload_id", workloadID, "container_id", item.ID)
 		return ""
@@ -257,6 +257,138 @@ func TestReconcileSkipsProjectInsertWithoutDeployerRow(t *testing.T) {
 	}
 }

+// TestReconcileBatchingPreservesBehavior locks Fix A: loading all workloads
+// once per tick (and resolving labels from that in-memory map instead of an
+// N+1 GetWorkloadByID) must produce the same outcome as the per-container
+// lookup did. With multiple containers across multiple workloads plus a forged
+// label and a stale row, after one ReconcileOnce: known-workload containers
+// are upserted with the snapshot State, the forged-label container is skipped,
+// and the absent stale row is flipped to missing.
+func TestReconcileBatchingPreservesBehavior(t *testing.T) {
+	st := newTestStore(t)
+
+	w1 := makeWorkload(t, st, "batch-a", "stack")
+	w2 := makeWorkload(t, st, "batch-b", "stack")
+
+	// A stale row for w2 whose container is gone — must be marked missing.
+	if err := st.UpsertContainer(store.Container{
+		ID: w2.ID + ":old", WorkloadID: w2.ID, WorkloadKind: "stack",
+		Role: "old", ContainerID: "docker-vanished", State: "running",
+	}); err != nil {
+		t.Fatalf("seed stale row: %v", err)
+	}
+
+	fake := &fakeDocker{items: []docker.ReconcileItem{
+		{
+			ID: "docker-a1", Name: "batch-a-web-1", Image: "nginx:1.27", State: "running",
+			Labels: map[string]string{
+				docker.LabelManaged:      "true",
+				docker.LabelWorkloadID:   w1.ID,
+				docker.LabelWorkloadKind: "stack",
+				docker.LabelRole:         "web",
+			},
+			Ports: []uint16{8080},
+		},
+		{
+			ID: "docker-b1", Name: "batch-b-api-1", Image: "redis:7", State: "exited",
+			Labels: map[string]string{
+				docker.LabelManaged:      "true",
+				docker.LabelWorkloadID:   w2.ID,
+				docker.LabelWorkloadKind: "stack",
+				docker.LabelRole:         "api",
+			},
+		},
+		{
+			// Forged label — no such workload. Must be skipped entirely.
+			ID: "docker-evil", Name: "evil", Image: "nginx", State: "running",
+			Labels: map[string]string{
+				docker.LabelManaged:      "true",
+				docker.LabelWorkloadID:   "wl-forged",
+				docker.LabelWorkloadKind: "stack",
+				docker.LabelRole:         "web",
+			},
+		},
+	}}
+
+	r := New(st, fake, 0)
+	if err := r.ReconcileOnce(context.Background()); err != nil {
+		t.Fatalf("ReconcileOnce: %v", err)
+	}
+
+	// w1: one row, bound to docker-a1, running.
+	w1Rows, _ := st.ListContainersByWorkload(w1.ID)
+	if len(w1Rows) != 1 {
+		t.Fatalf("w1: expected 1 row, got %d", len(w1Rows))
+	}
+	if w1Rows[0].ContainerID != "docker-a1" || w1Rows[0].State != "running" || w1Rows[0].Role != "web" {
+		t.Fatalf("w1 row wrong: %+v", w1Rows[0])
+	}
+
+	// w2: the new api container is present (exited→stopped); the stale row is missing.
+	api, _ := st.GetContainerByID(w2.ID + ":api")
+	if api.ContainerID != "docker-b1" || api.State != "stopped" {
+		t.Fatalf("w2 api row wrong: %+v", api)
+	}
+	old, _ := st.GetContainerByID(w2.ID + ":old")
+	if old.State != "missing" {
+		t.Fatalf("w2 stale row should be missing, got %q", old.State)
+	}
+
+	// Forged label produced no row anywhere.
+	all, _ := st.ListContainers(store.ContainerFilter{})
+	for _, c := range all {
+		if c.ContainerID == "docker-evil" {
+			t.Fatalf("forged-label container was adopted: %+v", c)
+		}
+	}
+}
+
+// TestReconcileSyncsImageContainerState locks the Fix B coupling: the generic
+// reconciler upsert pass — NOT image.Reconcile — is what syncs an image
+// container's State from the snapshot. An image container carries the
+// workload_id / kind=image / role=image labels at create time, so a present
+// container's row gets its State written here, proving the per-container
+// inspect formerly in image.Reconcile is redundant.
+func TestReconcileSyncsImageContainerState(t *testing.T) {
+	st := newTestStore(t)
+	w := makeWorkload(t, st, "img", "image")
+
+	// Deployer pre-created the image container row (running). Docker now
+	// reports it exited — the generic pass must sync it to stopped.
+	if err := st.UpsertContainer(store.Container{
+		ID: "img-deploy-uuid", WorkloadID: w.ID, WorkloadKind: "image",
+		Role: "image", ContainerID: "docker-img", State: "running",
+	}); err != nil {
+		t.Fatalf("seed image row: %v", err)
+	}
+
+	fake := &fakeDocker{items: []docker.ReconcileItem{{
+		ID: "docker-img", Image: "ghcr.io/owner/app:v1", State: "exited",
+		Labels: map[string]string{
+			docker.LabelManaged:      "true",
+			docker.LabelWorkloadID:   w.ID,
+			docker.LabelWorkloadKind: "image",
+			docker.LabelRole:         "image",
+		},
+		Ports: []uint16{3000},
+	}}}
+
+	// No plugin reconciler wired — proves the state sync comes from the
+	// generic upsert pass, not from image.Reconcile.
+	r := New(st, fake, 0)
+	if err := r.ReconcileOnce(context.Background()); err != nil {
+		t.Fatalf("ReconcileOnce: %v", err)
+	}
+
+	got, _ := st.GetContainerByID("img-deploy-uuid")
+	if got.State != "stopped" {
+		t.Fatalf("image container state not synced by generic pass: got %q want stopped", got.State)
+	}
+	if got.Port != 3000 || got.ImageRef != "ghcr.io/owner/app:v1" {
+		t.Fatalf("image container docker fields not synced: %+v", got)
+	}
+}
+
 func TestReconcileNormalizesState(t *testing.T) {
 	st := newTestStore(t)
 	w := makeWorkload(t, st, "norm", "stack")
@@ -27,6 +27,7 @@ import (
 	"sync"
 	"time"

+	"github.com/alexei/tinyforge/internal/metrics"
 	"github.com/alexei/tinyforge/internal/store"
 	"github.com/alexei/tinyforge/internal/workload/plugin"
 	"github.com/alexei/tinyforge/internal/workload/plugin/trigger/schedule"
@@ -124,6 +125,7 @@ func (s *Scheduler) loop(ctx context.Context) {
 // TickOnce runs a single sweep. Exposed for tests and for the boot
 // kick. On error per-trigger the loop continues with the next row.
 func (s *Scheduler) TickOnce(ctx context.Context) {
+	metrics.SchedulerTicksTotal.Inc()
 	rows, err := s.store.ListTriggers("schedule")
 	if err != nil {
 		slog.Warn("scheduler: list triggers", "error", err)
@@ -226,5 +228,6 @@ func (s *Scheduler) fire(ctx context.Context, t store.Trigger, now time.Time) {
 		slog.Warn("scheduler: dispatch", "trigger", t.Name, "error", err)
 		return
 	}
+	metrics.SchedulerDispatchedTotal.Inc()
 	slog.Info("scheduler: fired", "trigger", t.Name, "kind", t.Kind, "at", ts)
 }
@@ -92,17 +92,27 @@ func (c *Compose) Ps(ctx context.Context, projectName, yamlPath string) ([]Servi
 }

 // Logs runs `docker compose -p <projectName> logs --no-color --tail=<n> <service>`.
-// If service is empty, logs for all services are returned.
+// If service is empty, logs for all services are returned. The service arg
+// is preceded by `--` so a service name that begins with `-` cannot be
+// re-parsed as a flag by the docker CLI (flag-injection guard).
 func (c *Compose) Logs(ctx context.Context, projectName, service string, tail int) (string, error) {
 	args := []string{"logs", "--no-color", fmt.Sprintf("--tail=%d", tail)}
 	if service != "" {
-		args = append(args, service)
+		args = append(args, "--", service)
 	}
 	return c.run(ctx, projectName, args...)
 }

-// run executes `docker compose -p <projectName> <args...>` and returns combined output.
+// run executes `docker compose -p <projectName> <args...>` and returns
+// combined output. projectName is verified not to begin with `-` because
+// `docker compose -p '--foo'` would otherwise be re-parsed as a flag —
+// the callers already sanitize project names through projectNameSanitizer,
+// but a belt-and-braces refusal here means any future caller cannot
+// accidentally bypass the sanitizer.
 func (c *Compose) run(ctx context.Context, projectName string, args ...string) (string, error) {
+	if projectName == "" || strings.HasPrefix(projectName, "-") {
+		return "", fmt.Errorf("docker compose: refusing project name %q", projectName)
+	}
 	full := append([]string{"compose", "-p", projectName}, args...)
 	cmd := exec.CommandContext(ctx, c.binary, full...)
 	var buf bytes.Buffer
@@ -2,6 +2,7 @@ package stack

 import (
 	"fmt"
+	"strings"

 	"gopkg.in/yaml.v3"
 )
@@ -15,11 +16,25 @@ type ComposeSpec struct {
 }

 // ServiceSpec captures the subset of compose service fields we inspect.
+//
+// All host-escape-adjacent fields are decoded here even though Tinyforge
+// itself never reads them at runtime — surfacing them to Validate() is the
+// only way to *reject* them. Add new fields here when blocking a new
+// escape vector.
 type ServiceSpec struct {
-	Image      string            `yaml:"image,omitempty"`
-	Ports      []any             `yaml:"ports,omitempty"`
-	Labels     map[string]string `yaml:"labels,omitempty"`
-	Privileged bool              `yaml:"privileged,omitempty"`
+	Image       string            `yaml:"image,omitempty"`
+	Build       any               `yaml:"build,omitempty"` // banned — see Validate
+	Ports       []any             `yaml:"ports,omitempty"`
+	Labels      map[string]string `yaml:"labels,omitempty"`
+	Privileged  bool              `yaml:"privileged,omitempty"`
+	Volumes     []any             `yaml:"volumes,omitempty"`
+	NetworkMode string            `yaml:"network_mode,omitempty"`
+	Pid         string            `yaml:"pid,omitempty"`
+	Ipc         string            `yaml:"ipc,omitempty"`
+	UsernsMode  string            `yaml:"userns_mode,omitempty"`
+	CapAdd      []string          `yaml:"cap_add,omitempty"`
+	Devices     []any             `yaml:"devices,omitempty"`
+	SecurityOpt []string          `yaml:"security_opt,omitempty"`
 }

 // Parse decodes YAML into a ComposeSpec. Returns a descriptive error on failure.
@@ -35,10 +50,20 @@ func Parse(yamlText string) (ComposeSpec, error) {
 }

 // Validate enforces Tinyforge-level constraints beyond compose schema validity.
+// All blocked fields below are documented host-escape vectors: any one of
+// them on its own gives the container root on the host. Tinyforge already
+// owns the docker socket, so the threat model is "any admin == host root,"
+// and these blocks raise the bar for any *future* viewer-to-admin
+// escalation as well as honest-mistake guardrails.
+//
 // Current rules:
 //   - No service may set `privileged: true`.
-//   - Every service must declare an image (compose supports build: too, but
-//     Tinyforge v1 disallows building from context to avoid arbitrary-code exec).
+//   - Every service must declare an image (build contexts disallowed).
+//   - No host-IPC / host-PID / host-userns / host networking.
+//   - No `cap_add`, `security_opt`, `devices`.
+//   - `volumes` may not bind-mount the docker socket, /, /etc, /var, /proc,
+//     /sys, /root, or /home — list is conservative; operators with real
+//     bind-mount needs should ship a Source plugin or a dedicated wizard.
 func Validate(spec ComposeSpec) error {
 	for name, svc := range spec.Services {
 		if svc.Privileged {
@@ -47,6 +72,121 @@ func Validate(spec ComposeSpec) error {
 		if svc.Image == "" {
 			return fmt.Errorf("service %q: image is required (build contexts not supported)", name)
 		}
+		if svc.Build != nil {
+			return fmt.Errorf("service %q: build: is not supported (use image:)", name)
+		}
+		if isBlockedNamespaceMode(svc.NetworkMode) {
+			return fmt.Errorf("service %q: network_mode %q is not allowed", name, svc.NetworkMode)
+		}
+		if isBlockedNamespaceMode(svc.Pid) {
+			return fmt.Errorf("service %q: pid: %q is not allowed", name, svc.Pid)
+		}
+		if isBlockedNamespaceMode(svc.Ipc) {
+			return fmt.Errorf("service %q: ipc: %q is not allowed", name, svc.Ipc)
+		}
+		if isHostMode(svc.UsernsMode) {
+			return fmt.Errorf("service %q: userns_mode %q is not allowed", name, svc.UsernsMode)
+		}
+		if len(svc.CapAdd) > 0 {
+			return fmt.Errorf("service %q: cap_add is not allowed", name)
+		}
+		if len(svc.SecurityOpt) > 0 {
+			return fmt.Errorf("service %q: security_opt is not allowed", name)
+		}
+		if len(svc.Devices) > 0 {
+			return fmt.Errorf("service %q: devices is not allowed", name)
+		}
+		for _, v := range svc.Volumes {
+			if host, ok := bindMountHostPath(v); ok {
+				if isBlockedBindMount(host) {
+					return fmt.Errorf("service %q: bind-mounting %q is not allowed", name, host)
+				}
+			}
+		}
 	}
 	return nil
 }
+
+// isHostMode reports a host-namespace share, i.e. network_mode / pid / ipc /
+// userns_mode set to "host". (It deliberately does NOT match "host-gateway",
+// which is an extra_hosts value, not a namespace mode — matching it here only
+// produced misleading rejections.)
+func isHostMode(v string) bool {
+	return v == "host"
+}
+
+// isBlockedNamespaceMode reports a namespace mode that must be rejected for
+// network_mode / pid / ipc: either host sharing ("host") or joining another
+// container's / compose service's namespace ("container:<id>",
+// "service:<name>"). The container/service joins are a lateral-movement and
+// sandbox-escape vector — a malicious service could attach to a victim
+// container's network or PID namespace.
+func isBlockedNamespaceMode(v string) bool {
+	return isHostMode(v) ||
+		strings.HasPrefix(v, "container:") ||
+		strings.HasPrefix(v, "service:")
+}
+
+// bindMountHostPath extracts the host-side path from a compose volume
+// declaration. Compose accepts two shapes: a short string "src:dst[:mode]"
+// and a long form map with a "source" key. Returns ok=false for named
+// volumes (no host source).
+func bindMountHostPath(v any) (string, bool) {
+	switch t := v.(type) {
+	case string:
+		// "named:/in/container" has no '/' or '.' prefix on the source.
+		if t == "" {
+			return "", false
+		}
+		parts := strings.SplitN(t, ":", 3)
+		src := parts[0]
+		if strings.HasPrefix(src, "/") || strings.HasPrefix(src, ".") || strings.HasPrefix(src, "~") {
+			return src, true
+		}
+		return "", false
+	case map[string]any:
+		if typ, _ := t["type"].(string); typ != "" && typ != "bind" {
+			return "", false
+		}
+		if src, ok := t["source"].(string); ok {
+			if strings.HasPrefix(src, "/") || strings.HasPrefix(src, ".") || strings.HasPrefix(src, "~") {
+				return src, true
+			}
+		}
+	}
+	return "", false
+}
+
+// isBlockedBindMount returns true for paths that obviously escape the
+// container's intended sandbox. Conservative deny-list — operators with
+// legitimate bind-mount needs should write a dedicated Source plugin
+// rather than tunnel them through compose.
+func isBlockedBindMount(host string) bool {
+	// Normalize trailing slash so "/var" and "/var/" both match.
+	clean := strings.TrimRight(host, "/")
+	if clean == "" || clean == "/" {
+		return true
+	}
+	// Relative ("./x", "../x", ".") and home-relative ("~/...") sources are
+	// resolved by Docker against the compose working directory (which
+	// Tinyforge controls and never intends as a host-bind source) or left
+	// unexpanded — and "../" can climb out of that directory entirely. The
+	// absolute-prefix deny-list below can't see these, so reject them
+	// outright rather than give a false sense of coverage.
+	if strings.HasPrefix(clean, ".") || strings.HasPrefix(clean, "~") {
+		return true
+	}
+	// Specific blocked files / sockets.
+	switch clean {
+	case "/var/run/docker.sock", "/run/docker.sock":
+		return true
+	}
+	// Blocked prefixes (cover sub-paths too).
+	blocked := []string{"/etc", "/var", "/proc", "/sys", "/root", "/home", "/boot", "/dev"}
+	for _, p := range blocked {
+		if clean == p || strings.HasPrefix(clean, p+"/") {
+			return true
+		}
+	}
+	return false
+}
@@ -0,0 +1,55 @@
+package staticsite
+
+import (
+	"context"
+	"log/slog"
+)
+
+// CommitStatusReporter pushes deploy outcomes back to the git provider as a
+// commit status, gated on the per-workload report_commit_status flag. It is
+// strictly best-effort: every call is wrapped so a reporting failure is logged
+// at Warn and NEVER propagates to fail or block the deploy.
+//
+// The provider + identifiers are captured once at deploy start so the hot
+// transition points (pending/success/failure) read as one-liners. A nil
+// receiver (reporting disabled) makes Report a no-op, so callers don't have to
+// guard each transition.
+//
+// It lives in the staticsite package (alongside GitProvider / CommitStatus)
+// rather than the plugin package so the source plugins can share it without
+// staticsite taking a dependency on plugin. It is parameterized on primitives
+// (not plugin.Workload) for the same reason.
+type CommitStatusReporter struct {
+	provider  GitProvider
+	owner     string
+	repo      string
+	sha       string
+	targetURL string
+	enabled   bool
+}
+
+// NewCommitStatusReporter builds a reporter from the resolved deploy inputs.
+// When enabled is false (report_commit_status off) or the SHA is empty, the
+// returned reporter's Report method is inert.
+func NewCommitStatusReporter(provider GitProvider, owner, repo, sha, targetURL string, enabled bool) *CommitStatusReporter {
+	return &CommitStatusReporter{
+		provider:  provider,
+		owner:     owner,
+		repo:      repo,
+		sha:       sha,
+		targetURL: targetURL,
+		enabled:   enabled,
+	}
+}
+
+// Report sends one commit status, swallowing (and logging) any error. Safe to
+// call on a nil/disabled reporter or with a nil provider/empty SHA.
+func (r *CommitStatusReporter) Report(ctx context.Context, workloadName, workloadID string, status CommitStatus, description string) {
+	if r == nil || !r.enabled || r.provider == nil || r.sha == "" {
+		return
+	}
+	if err := r.provider.SetCommitStatus(ctx, r.owner, r.repo, r.sha, status, r.targetURL, description); err != nil {
+		slog.Warn("commit-status report failed (ignored)",
+			"workload", workloadName, "workload_id", workloadID, "status", string(status), "error", err)
+	}
+}
@@ -0,0 +1,122 @@
+package staticsite
+
+import (
+	"context"
+	"errors"
+	"testing"
+)
+
+// fakeReporterProvider is a stub GitProvider that records SetCommitStatus
+// calls. Only the methods the reporter exercises are meaningful; the rest
+// satisfy the interface and panic if ever hit so a mis-wired test is loud.
+type fakeReporterProvider struct {
+	calls   []reporterStatusCall
+	failErr error // when set, SetCommitStatus returns it (best-effort path)
+}
+
+type reporterStatusCall struct {
+	owner, repo, sha string
+	status           CommitStatus
+	targetURL, descr string
+}
+
+func (f *fakeReporterProvider) SetCommitStatus(_ context.Context, owner, repo, sha string, status CommitStatus, targetURL, description string) error {
+	f.calls = append(f.calls, reporterStatusCall{owner, repo, sha, status, targetURL, description})
+	return f.failErr
+}
+
+func (*fakeReporterProvider) Name() string { return "fake" }
+func (*fakeReporterProvider) TestConnection(context.Context, string, string) error {
+	panic("unused")
+}
+func (*fakeReporterProvider) ListRepos(context.Context, string) ([]RepoInfo, error) {
+	panic("unused")
+}
+func (*fakeReporterProvider) ListBranches(context.Context, string, string) ([]string, error) {
+	panic("unused")
+}
+func (*fakeReporterProvider) GetLatestCommitSHA(context.Context, string, string, string) (string, error) {
+	panic("unused")
+}
+func (*fakeReporterProvider) ListTree(context.Context, string, string, string) ([]FolderEntry, error) {
+	panic("unused")
+}
+func (*fakeReporterProvider) DownloadFolder(context.Context, string, string, string, string, string) error {
+	panic("unused")
+}
+
+// Enabled: forwards to the provider with the captured identifiers + target.
+func TestCommitStatusReporter_Enabled_Calls(t *testing.T) {
+	fp := &fakeReporterProvider{}
+	r := NewCommitStatusReporter(fp, "owner", "pages", "abc123", "https://app.example.com", true)
+
+	r.Report(context.Background(), "site", "wid-1", CommitStatusPending, "Tinyforge: deploying")
+	r.Report(context.Background(), "site", "wid-1", CommitStatusSuccess, "Tinyforge: deployed")
+
+	if len(fp.calls) != 2 {
+		t.Fatalf("calls = %d, want 2", len(fp.calls))
+	}
+	first := fp.calls[0]
+	if first.owner != "owner" || first.repo != "pages" || first.sha != "abc123" {
+		t.Errorf("identifiers wrong: %+v", first)
+	}
+	if first.status != CommitStatusPending {
+		t.Errorf("first status = %q, want pending", first.status)
+	}
+	if first.targetURL != "https://app.example.com" {
+		t.Errorf("targetURL = %q", first.targetURL)
+	}
+	if fp.calls[1].status != CommitStatusSuccess {
+		t.Errorf("second status = %q, want success", fp.calls[1].status)
+	}
+}
+
+// Disabled: the reporter is inert.
+func TestCommitStatusReporter_Disabled_NoCalls(t *testing.T) {
+	fp := &fakeReporterProvider{}
+	r := NewCommitStatusReporter(fp, "owner", "pages", "abc123", "", false)
+
+	r.Report(context.Background(), "site", "wid-1", CommitStatusSuccess, "x")
+	if len(fp.calls) != 0 {
+		t.Fatalf("expected no calls when disabled, got %d", len(fp.calls))
+	}
+}
+
+// An empty SHA (e.g. a provider that couldn't resolve the branch) must not
+// produce a status call even when reporting is enabled.
+func TestCommitStatusReporter_EmptySHA_NoCalls(t *testing.T) {
+	fp := &fakeReporterProvider{}
+	r := NewCommitStatusReporter(fp, "owner", "pages", "", "", true)
+
+	r.Report(context.Background(), "site", "wid-1", CommitStatusPending, "x")
+	if len(fp.calls) != 0 {
+		t.Fatalf("expected no calls with empty SHA, got %d", len(fp.calls))
+	}
+}
+
+// A provider error must be swallowed (best-effort) — Report never panics or
+// propagates. We assert it returns normally after a failing provider call.
+func TestCommitStatusReporter_ProviderError_Swallowed(t *testing.T) {
+	fp := &fakeReporterProvider{failErr: errors.New("boom")}
+	r := NewCommitStatusReporter(fp, "owner", "pages", "abc123", "", true)
+
+	// Should not panic / propagate.
+	r.Report(context.Background(), "site", "wid-1", CommitStatusFailure, "Tinyforge: deploy failed")
+	if len(fp.calls) != 1 {
+		t.Fatalf("expected the failing call to still be recorded, got %d", len(fp.calls))
+	}
+}
+
+// A nil reporter (constructed only when needed in some call paths) is safe.
+func TestCommitStatusReporter_NilSafe(t *testing.T) {
+	var r *CommitStatusReporter
+	// Must not panic.
+	r.Report(context.Background(), "site", "wid-1", CommitStatusSuccess, "x")
+}
+
+// A nil provider on an enabled reporter is also a no-op (defensive guard).
+func TestCommitStatusReporter_NilProvider_NoPanic(t *testing.T) {
+	r := NewCommitStatusReporter(nil, "owner", "pages", "abc123", "", true)
+	// Must not panic.
+	r.Report(context.Background(), "site", "wid-1", CommitStatusSuccess, "x")
+}
@@ -0,0 +1,331 @@
+package staticsite
+
+import (
+	"context"
+	"encoding/json"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"testing"
+	"time"
+)
+
+// ── State mapping (pure) ────────────────────────────────────────────
+//
+// Each provider maps the provider-agnostic CommitStatus onto its own API
+// vocabulary. Gitea/GitHub accept the same four words; GitLab collapses
+// failure+error into "failed".
+
+func TestGiteaState_Mapping(t *testing.T) {
+	cases := map[CommitStatus]string{
+		CommitStatusPending: "pending",
+		CommitStatusSuccess: "success",
+		CommitStatusFailure: "failure",
+		CommitStatusError:   "error",
+		CommitStatus("???"): "pending", // unknown -> pending fallback
+	}
+	for in, want := range cases {
+		if got := giteaState(in); got != want {
+			t.Errorf("giteaState(%q) = %q, want %q", in, got, want)
+		}
+	}
+}
+
+func TestGitHubState_Mapping(t *testing.T) {
+	cases := map[CommitStatus]string{
+		CommitStatusPending: "pending",
+		CommitStatusSuccess: "success",
+		CommitStatusFailure: "failure",
+		CommitStatusError:   "error",
+		CommitStatus("???"): "pending",
+	}
+	for in, want := range cases {
+		if got := githubState(in); got != want {
+			t.Errorf("githubState(%q) = %q, want %q", in, got, want)
+		}
+	}
+}
+
+func TestGitLabState_Mapping(t *testing.T) {
+	cases := map[CommitStatus]string{
+		CommitStatusPending: "pending",
+		CommitStatusSuccess: "success",
+		CommitStatusFailure: "failed", // GitLab has no "failure"
+		CommitStatusError:   "failed", // error also collapses to "failed"
+		CommitStatus("???"): "pending",
+	}
+	for in, want := range cases {
+		if got := gitlabState(in); got != want {
+			t.Errorf("gitlabState(%q) = %q, want %q", in, got, want)
+		}
+	}
+}
+
+func TestTruncateDescription(t *testing.T) {
+	short := "Tinyforge: deploying"
+	if got := truncateDescription(short); got != short {
+		t.Errorf("short description mutated: %q", got)
+	}
+	long := strings.Repeat("x", 200)
+	got := truncateDescription(long)
+	if len([]rune(got)) > maxCommitStatusDescription {
+		t.Errorf("truncated length = %d runes, want <= %d", len([]rune(got)), maxCommitStatusDescription)
+	}
+	if !strings.HasSuffix(got, "…") {
+		t.Errorf("missing ellipsis on truncation: %q", got)
+	}
+}
+
+// ── Endpoint + body construction (httptest) ─────────────────────────
+//
+// The SSRF-safe client refuses loopback, so for these tests we swap the
+// provider's httpClient for a plain one pointed at httptest. This still
+// exercises the real URL/body construction inside each SetCommitStatus.
+
+type capturedRequest struct {
+	method string
+	path   string // r.URL.EscapedPath() — preserves %2F so GitLab's encoded project path is observable
+	rawQ   string
+	body   map[string]string
+	auth   string
+	token  string // PRIVATE-TOKEN (GitLab)
+}
+
+func newCaptureServer(t *testing.T, capture *capturedRequest) *httptest.Server {
+	t.Helper()
+	return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		capture.method = r.Method
+		capture.path = r.URL.EscapedPath()
+		capture.rawQ = r.URL.RawQuery
+		capture.auth = r.Header.Get("Authorization")
+		capture.token = r.Header.Get("PRIVATE-TOKEN")
+		raw, _ := io.ReadAll(r.Body)
+		if len(raw) > 0 {
+			_ = json.Unmarshal(raw, &capture.body)
+		}
+		w.WriteHeader(http.StatusCreated)
+		_, _ = w.Write([]byte(`{}`))
+	}))
+}
+
+func TestGitea_SetCommitStatus_Request(t *testing.T) {
+	var cap capturedRequest
+	srv := newCaptureServer(t, &cap)
+	defer srv.Close()
+
+	f := NewGiteaContentFetcher(srv.URL, "tok123")
+	f.httpClient = srv.Client() // bypass SSRF guard for loopback test server
+
+	err := f.SetCommitStatus(context.Background(), "owner", "repo", "abc123",
+		CommitStatusSuccess, "https://app.example.com", "deployed")
+	if err != nil {
+		t.Fatalf("SetCommitStatus: %v", err)
+	}
+
+	if cap.method != http.MethodPost {
+		t.Errorf("method = %q, want POST", cap.method)
+	}
+	if want := "/api/v1/repos/owner/repo/statuses/abc123"; cap.path != want {
+		t.Errorf("path = %q, want %q", cap.path, want)
+	}
+	if cap.body["state"] != "success" {
+		t.Errorf("state = %q, want success", cap.body["state"])
+	}
+	if cap.body["context"] != "tinyforge" {
+		t.Errorf("context = %q, want tinyforge", cap.body["context"])
+	}
+	if cap.body["target_url"] != "https://app.example.com" {
+		t.Errorf("target_url = %q", cap.body["target_url"])
+	}
+	if cap.body["description"] != "deployed" {
+		t.Errorf("description = %q, want deployed", cap.body["description"])
+	}
+	if cap.auth != "token tok123" {
+		t.Errorf("auth = %q, want 'token tok123'", cap.auth)
+	}
+}
+
+func TestGitHub_SetCommitStatus_Request(t *testing.T) {
+	var cap capturedRequest
+	srv := newCaptureServer(t, &cap)
+	defer srv.Close()
+
+	// Force GHE-style apiBase so we hit the server's path; the github.com
+	// branch hard-codes api.github.com which the SSRF client would block.
+	g := NewGitHubProvider(srv.URL, "ghp_tok")
+	g.apiBase = srv.URL + "/api/v3"
+	g.httpClient = srv.Client()
+
+	err := g.SetCommitStatus(context.Background(), "octo", "cat", "deadbeef",
+		CommitStatusFailure, "", "failed")
+	if err != nil {
+		t.Fatalf("SetCommitStatus: %v", err)
+	}
+
+	if want := "/api/v3/repos/octo/cat/statuses/deadbeef"; cap.path != want {
+		t.Errorf("path = %q, want %q", cap.path, want)
+	}
+	if cap.body["state"] != "failure" {
+		t.Errorf("state = %q, want failure", cap.body["state"])
+	}
+	if cap.body["context"] != "tinyforge" {
+		t.Errorf("context = %q, want tinyforge", cap.body["context"])
+	}
+	if cap.auth != "Bearer ghp_tok" {
+		t.Errorf("auth = %q, want 'Bearer ghp_tok'", cap.auth)
+	}
+}
+
+func TestGitLab_SetCommitStatus_Request(t *testing.T) {
+	var cap capturedRequest
+	srv := newCaptureServer(t, &cap)
+	defer srv.Close()
+
+	g := NewGitLabProvider(srv.URL, "glpat-xyz")
+	g.httpClient = srv.Client()
+
+	err := g.SetCommitStatus(context.Background(), "grp", "proj", "cafe01",
+		CommitStatusError, "https://app.example.com", "boom")
+	if err != nil {
+		t.Fatalf("SetCommitStatus: %v", err)
+	}
+
+	// GitLab uses the URL-encoded project path + sha in the path, and the
+	// status metadata as query params.
+	if want := "/api/v4/projects/grp%2Fproj/statuses/cafe01"; cap.path != want {
+		t.Errorf("path = %q, want %q", cap.path, want)
+	}
+	q, err := parseQuery(cap.rawQ)
+	if err != nil {
+		t.Fatalf("parse query %q: %v", cap.rawQ, err)
+	}
+	if q["state"] != "failed" { // error -> failed
+		t.Errorf("state = %q, want failed", q["state"])
+	}
+	if q["name"] != "tinyforge" {
+		t.Errorf("name = %q, want tinyforge", q["name"])
+	}
+	if q["target_url"] != "https://app.example.com" {
+		t.Errorf("target_url = %q", q["target_url"])
+	}
+	if q["description"] != "boom" {
+		t.Errorf("description = %q, want boom", q["description"])
+	}
+	if cap.token != "glpat-xyz" {
+		t.Errorf("PRIVATE-TOKEN = %q, want glpat-xyz", cap.token)
+	}
+}
+
+// parseQuery is a tiny wrapper so the test reads the first value of each
+// query key without dragging net/url into every assertion.
+func parseQuery(raw string) (map[string]string, error) {
+	out := map[string]string{}
+	if raw == "" {
+		return out, nil
+	}
+	for _, pair := range strings.Split(raw, "&") {
+		kv := strings.SplitN(pair, "=", 2)
+		k := urlDecode(kv[0])
+		v := ""
+		if len(kv) == 2 {
+			v = urlDecode(kv[1])
+		}
+		if _, ok := out[k]; !ok {
+			out[k] = v
+		}
+	}
+	return out, nil
+}
+
+func urlDecode(s string) string {
+	dec, err := decodeQueryComponent(s)
+	if err != nil {
+		return s
+	}
+	return dec
+}
+
+// decodeQueryComponent decodes one application/x-www-form-urlencoded
+// component (handles %XX and '+'-as-space) without importing net/url here.
+func decodeQueryComponent(s string) (string, error) {
+	var b strings.Builder
+	for i := 0; i < len(s); i++ {
+		switch s[i] {
+		case '+':
+			b.WriteByte(' ')
+		case '%':
+			if i+2 >= len(s) {
+				return s, errPercent
+			}
+			hi, lo := fromHex(s[i+1]), fromHex(s[i+2])
+			if hi < 0 || lo < 0 {
+				return s, errPercent
+			}
+			b.WriteByte(byte(hi<<4 | lo))
+			i += 2
+		default:
+			b.WriteByte(s[i])
+		}
+	}
+	return b.String(), nil
+}
+
+var errPercent = &decodeErr{}
+
+type decodeErr struct{}
+
+func (*decodeErr) Error() string { return "bad percent-encoding" }
+
+func fromHex(c byte) int {
+	switch {
+	case c >= '0' && c <= '9':
+		return int(c - '0')
+	case c >= 'a' && c <= 'f':
+		return int(c-'a') + 10
+	case c >= 'A' && c <= 'F':
+		return int(c-'A') + 10
+	}
+	return -1
+}
+
+// TestSetCommitStatus_NonOK_ReturnsError verifies a non-2xx provider
+// response surfaces as an error (the deploy hook logs + swallows it, but
+// the provider method itself must report it).
+func TestSetCommitStatus_NonOK_ReturnsError(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusUnauthorized)
+		_, _ = w.Write([]byte(`{"message":"bad token"}`))
+	}))
+	defer srv.Close()
+
+	f := NewGiteaContentFetcher(srv.URL, "tok")
+	f.httpClient = srv.Client()
+
+	err := f.SetCommitStatus(context.Background(), "o", "r", "sha", CommitStatusPending, "", "x")
+	if err == nil {
+		t.Fatal("expected error on 401, got nil")
+	}
+	if !strings.Contains(err.Error(), "401") {
+		t.Errorf("error missing status code: %v", err)
+	}
+}
+
+// TestSetCommitStatus_RespectsContext ensures the call honours context
+// cancellation (defensive — the deploy hook passes the deploy ctx).
+func TestSetCommitStatus_RespectsContext(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		time.Sleep(200 * time.Millisecond)
+		w.WriteHeader(http.StatusCreated)
+	}))
+	defer srv.Close()
+
+	f := NewGiteaContentFetcher(srv.URL, "")
+	f.httpClient = srv.Client()
+
+	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond)
+	defer cancel()
+	if err := f.SetCommitStatus(ctx, "o", "r", "sha", CommitStatusPending, "", "x"); err == nil {
+		t.Fatal("expected context-deadline error, got nil")
+	}
+}
@@ -6,6 +6,7 @@ import (
 	"fmt"
 	"io"
 	"net/http"
+	"net/url"
 	"os"
 	"path/filepath"
 	"strings"
@@ -304,6 +305,54 @@ func (f *GiteaContentFetcher) TestConnection(ctx context.Context, owner, repo st
 	return nil
 }

+// SetCommitStatus reports a deploy status on a commit via Gitea's commit-
+// status API (also serves Forgejo/Gogs). The "context" field is fixed to
+// "tinyforge" so repeated deploys update one status row.
+func (f *GiteaContentFetcher) SetCommitStatus(ctx context.Context, owner, repo, sha string, status CommitStatus, targetURL, description string) error {
+	state := giteaState(status)
+	body, err := json.Marshal(map[string]string{
+		"state":       state,
+		"target_url":  targetURL,
+		"description": truncateDescription(description),
+		"context":     commitStatusContext,
+	})
+	if err != nil {
+		return fmt.Errorf("marshal status: %w", err)
+	}
+	// Path-escape each identifier so the URL shape matches the other
+	// provider methods and a hostile owner/repo/sha can't break out of
+	// the intended path. The SSRF-safe client guards the host.
+	apiURL := fmt.Sprintf("%s/api/v1/repos/%s/%s/statuses/%s",
+		f.baseURL, url.PathEscape(owner), url.PathEscape(repo), url.PathEscape(sha))
+	if err := postJSON(ctx, f.httpClient, apiURL, body, f.setAuth); err != nil {
+		return fmt.Errorf("set commit status: %w", err)
+	}
+	return nil
+}
+
+// setAuth applies the Gitea token header (no-op when the token is empty).
+func (f *GiteaContentFetcher) setAuth(req *http.Request) {
+	if f.token != "" {
+		req.Header.Set("Authorization", "token "+f.token)
+	}
+}
+
+// giteaState maps a provider-agnostic CommitStatus onto Gitea's API
+// vocabulary. Gitea accepts the same four words Tinyforge uses, so this is
+// a 1:1 mapping with a "pending" fallback for any unknown value.
+func giteaState(status CommitStatus) string {
+	switch status {
+	case CommitStatusSuccess:
+		return "success"
+	case CommitStatusFailure:
+		return "failure"
+	case CommitStatusError:
+		return "error"
+	default:
+		return "pending"
+	}
+}
+
 // doGet performs an authenticated GET request and returns the response body.
 func (f *GiteaContentFetcher) doGet(ctx context.Context, url string) ([]byte, error) {
 	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
@@ -6,6 +6,7 @@ import (
 	"fmt"
 	"io"
 	"net/http"
+	"net/url"
 	"os"
 	"path/filepath"
 	"strings"
@@ -115,6 +116,43 @@ func (g *GitHubProvider) TestConnection(ctx context.Context, owner, repo string)
 	return err
 }

+// SetCommitStatus reports a deploy status on a commit via GitHub's commit-
+// status API (works for github.com and GitHub Enterprise — apiBase already
+// carries the /api/v3 suffix for GHE). The "context" field is fixed to
+// "tinyforge" so repeated deploys update one status row.
+func (g *GitHubProvider) SetCommitStatus(ctx context.Context, owner, repo, sha string, status CommitStatus, targetURL, description string) error {
+	body, err := json.Marshal(map[string]string{
+		"state":       githubState(status),
+		"target_url":  targetURL,
+		"description": truncateDescription(description),
+		"context":     commitStatusContext,
+	})
+	if err != nil {
+		return fmt.Errorf("marshal status: %w", err)
+	}
+	apiURL := fmt.Sprintf("%s/repos/%s/%s/statuses/%s",
+		g.apiBase, url.PathEscape(owner), url.PathEscape(repo), url.PathEscape(sha))
+	if err := postJSON(ctx, g.httpClient, apiURL, body, g.setAuth); err != nil {
+		return fmt.Errorf("set commit status: %w", err)
+	}
+	return nil
+}
+
+// githubState maps a provider-agnostic CommitStatus onto GitHub's API
+// vocabulary. GitHub accepts the same four words Tinyforge uses.
+func githubState(status CommitStatus) string {
+	switch status {
+	case CommitStatusSuccess:
+		return "success"
+	case CommitStatusFailure:
+		return "failure"
+	case CommitStatusError:
+		return "error"
+	default:
+		return "pending"
+	}
+}
+
 func (g *GitHubProvider) ListBranches(ctx context.Context, owner, repo string) ([]string, error) {
 	var allBranches []string
 	page := 1
@@ -95,6 +95,45 @@ func (g *GitLabProvider) TestConnection(ctx context.Context, owner, repo string)
 	return err
 }

+// SetCommitStatus reports a deploy status on a commit via GitLab's commit-
+// status API. GitLab's state vocabulary differs (pending/running/success/
+// failed/canceled), so failure AND error both map to "failed". The status
+// metadata (name/target_url/description) is passed as query parameters,
+// which is how GitLab's POST .../statuses/{sha} endpoint accepts them.
+func (g *GitLabProvider) SetCommitStatus(ctx context.Context, owner, repo, sha string, status CommitStatus, targetURL, description string) error {
+	q := url.Values{}
+	q.Set("state", gitlabState(status))
+	q.Set("name", commitStatusContext)
+	if targetURL != "" {
+		q.Set("target_url", targetURL)
+	}
+	if description != "" {
+		q.Set("description", truncateDescription(description))
+	}
+	apiURL := fmt.Sprintf("%s/projects/%s/statuses/%s?%s",
+		g.apiBase, projectPath(owner, repo), url.PathEscape(sha), q.Encode())
+	// No JSON body — all fields ride as query params. Reuse postJSON for
+	// the SSRF-safe POST + 2xx handling; an empty body is valid here.
+	if err := postJSON(ctx, g.httpClient, apiURL, nil, g.setAuth); err != nil {
+		return fmt.Errorf("set commit status: %w", err)
+	}
+	return nil
+}
+
+// gitlabState maps a provider-agnostic CommitStatus onto GitLab's API
+// vocabulary. GitLab has no "failure"/"error" split — both map to
+// "failed".
+func gitlabState(status CommitStatus) string {
+	switch status {
+	case CommitStatusSuccess:
+		return "success"
+	case CommitStatusFailure, CommitStatusError:
+		return "failed"
+	default:
+		return "pending"
+	}
+}
+
 func (g *GitLabProvider) ListBranches(ctx context.Context, owner, repo string) ([]string, error) {
 	var allBranches []string
 	page := 1
@@ -1,6 +1,7 @@
 package staticsite

 import (
+	"bytes"
 	"context"
 	"fmt"
 	"io"
@@ -21,6 +22,40 @@ type RepoInfo struct {
 	HTMLURL     string `json:"html_url"`
 }

+// CommitStatus is the deploy outcome reported back to the git provider as
+// a commit status. The values are provider-agnostic; each implementation
+// maps them onto its own API vocabulary (Gitea/GitHub use the same four
+// words, GitLab collapses failure/error into "failed").
+type CommitStatus string
+
+const (
+	CommitStatusPending CommitStatus = "pending"
+	CommitStatusSuccess CommitStatus = "success"
+	CommitStatusFailure CommitStatus = "failure"
+	CommitStatusError   CommitStatus = "error"
+)
+
+// commitStatusContext is the status "context"/"name" key reported to every
+// provider so repeated deploys update the same status row rather than
+// piling up new ones.
+const commitStatusContext = "tinyforge"
+
+// maxCommitStatusDescription caps the human-readable description so a
+// provider can't reject the request for an over-long field.
+const maxCommitStatusDescription = 140
+
+// truncateDescription clamps a status description to the provider-safe
+// length, appending an ellipsis when it had to cut.
+func truncateDescription(s string) string {
+	if len(s) <= maxCommitStatusDescription {
+		return s
+	}
+	// Reserve room for the ellipsis rune; cut on a byte boundary that
+	// stays under the cap. Descriptions are short ASCII strings in
+	// practice, so a simple byte cut is fine here.
+	return s[:maxCommitStatusDescription-1] + "…"
+}
+
 // GitProvider abstracts Git hosting API operations.
 // Implementations exist for Gitea/Forgejo/Gogs, GitHub, and GitLab.
 type GitProvider interface {
@@ -45,6 +80,12 @@ type GitProvider interface {

 	// DownloadFolder downloads all files from a folder path to a local directory.
 	DownloadFolder(ctx context.Context, owner, repo, branch, folderPath, destDir string) error
+
+	// SetCommitStatus reports a deploy status on a commit. Best-effort;
+	// callers ignore errors beyond logging. targetURL and description are
+	// optional (pass "" to omit); description is truncated to a provider-
+	// safe length by the implementation.
+	SetCommitStatus(ctx context.Context, owner, repo, sha string, status CommitStatus, targetURL, description string) error
 }

 // ProviderType identifies a Git hosting provider.
@@ -135,6 +176,36 @@ func httpGet(ctx context.Context, client *http.Client, url string) (int, error)
 	return resp.StatusCode, nil
 }

+// postJSON is a shared helper for POSTing a JSON body to a provider API
+// endpoint with the caller's auth applied. It accepts any 2xx as success
+// (status APIs return 201 Created on Gitea/GitHub, 200/201 on GitLab) and
+// returns a status-code-only error on non-2xx — it must NOT echo the
+// response body: the deploy hook logs this error best-effort, and a
+// hostile/misconfigured provider could reflect the request's auth token
+// back in its body. The body bytes must already be marshalled by the caller.
+func postJSON(ctx context.Context, client *http.Client, url string, body []byte, authHeader func(r *http.Request)) error {
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body))
+	if err != nil {
+		return fmt.Errorf("create request: %w", err)
+	}
+	if authHeader != nil {
+		authHeader(req)
+	}
+	req.Header.Set("Content-Type", "application/json")
+	req.Header.Set("Accept", "application/json")
+
+	resp, err := client.Do(req)
+	if err != nil {
+		return fmt.Errorf("execute request: %w", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
+		return fmt.Errorf("unexpected status %d", resp.StatusCode)
+	}
+	return nil
+}
+
 // downloadFileHTTP is a shared helper for downloading a file from a URL.
 func downloadFileHTTP(ctx context.Context, client *http.Client, url, localPath string, authHeader func(r *http.Request)) error {
 	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
@@ -50,34 +50,7 @@ func ValidateBaseURL(raw string) error {
 func NewSafeHTTPClient(timeout time.Duration) *http.Client {
 	dialer := &net.Dialer{Timeout: 10 * time.Second, KeepAlive: 30 * time.Second}
 	transport := &http.Transport{
-		DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
-			host, port, err := net.SplitHostPort(addr)
-			if err != nil {
-				return nil, err
-			}
-			// If the caller passed a literal IP, skip the DNS round-trip.
-			if literal := net.ParseIP(host); literal != nil {
-				if reason := blockReason(literal); reason != "" {
-					return nil, fmt.Errorf("%w: %s (%s)", ErrBlockedAddress, literal, reason)
-				}
-				return dialer.DialContext(ctx, network, addr)
-			}
-			ips, err := net.DefaultResolver.LookupIPAddr(ctx, host)
-			if err != nil {
-				return nil, err
-			}
-			if len(ips) == 0 {
-				return nil, fmt.Errorf("no addresses for %s", host)
-			}
-			for _, ip := range ips {
-				if reason := blockReason(ip.IP); reason != "" {
-					return nil, fmt.Errorf("%w: %s (%s)", ErrBlockedAddress, ip.IP, reason)
-				}
-			}
-			// Bind to the first resolved IP so a rebind between resolution
-			// and connect cannot redirect the request to a blocked address.
-			return dialer.DialContext(ctx, network, net.JoinHostPort(ips[0].IP.String(), port))
-		},
+		DialContext:         SafeDialContext(dialer),
 		MaxIdleConns:        16,
 		IdleConnTimeout:     30 * time.Second,
 		TLSHandshakeTimeout: 10 * time.Second,
@@ -85,6 +58,43 @@ func NewSafeHTTPClient(timeout time.Duration) *http.Client {
 	return &http.Client{Timeout: timeout, Transport: transport}
 }

+// SafeDialContext returns a DialContext that rejects loopback, link-local,
+// multicast, unspecified, and cloud-metadata addresses at connect time,
+// re-resolving and binding to the resolved IP so a DNS rebind between
+// resolution and connect cannot slip through. Exposed so other transports
+// (e.g. the outbound notification client) can apply the same SSRF policy
+// without duplicating it or losing their own connection-pool tuning.
+func SafeDialContext(dialer *net.Dialer) func(ctx context.Context, network, addr string) (net.Conn, error) {
+	return func(ctx context.Context, network, addr string) (net.Conn, error) {
+		host, port, err := net.SplitHostPort(addr)
+		if err != nil {
+			return nil, err
+		}
+		// If the caller passed a literal IP, skip the DNS round-trip.
+		if literal := net.ParseIP(host); literal != nil {
+			if reason := blockReason(literal); reason != "" {
+				return nil, fmt.Errorf("%w: %s (%s)", ErrBlockedAddress, literal, reason)
+			}
+			return dialer.DialContext(ctx, network, addr)
+		}
+		ips, err := net.DefaultResolver.LookupIPAddr(ctx, host)
+		if err != nil {
+			return nil, err
+		}
+		if len(ips) == 0 {
+			return nil, fmt.Errorf("no addresses for %s", host)
+		}
+		for _, ip := range ips {
+			if reason := blockReason(ip.IP); reason != "" {
+				return nil, fmt.Errorf("%w: %s (%s)", ErrBlockedAddress, ip.IP, reason)
+			}
+		}
+		// Bind to the first resolved IP so a rebind between resolution
+		// and connect cannot redirect the request to a blocked address.
+		return dialer.DialContext(ctx, network, net.JoinHostPort(ips[0].IP.String(), port))
+	}
+}
+
 // blockReason returns a human label for why an IP is rejected, or ""
 // if the IP is allowed. Centralized so all callers share the same
 // policy.
@@ -92,6 +102,13 @@ func blockReason(ip net.IP) string {
 	if ip == nil {
 		return "nil address"
 	}
+	// Normalize IPv4-mapped IPv6 (::ffff:x.x.x.x) so the loopback / link-local
+	// classifiers below catch them. net.IP.To4() returns the 4-byte form for
+	// IPv4-mapped addresses; net's IsLoopback already handles this, but pin
+	// the conversion to avoid future surprises if the std-lib semantics drift.
+	if v4 := ip.To4(); v4 != nil {
+		ip = v4
+	}
 	switch {
 	case ip.IsLoopback():
 		return "loopback"
@@ -104,5 +121,22 @@ func blockReason(ip net.IP) string {
 	case ip.IsMulticast():
 		return "multicast"
 	}
+	// Cloud metadata endpoints — AWS / GCP / Azure are covered by the
+	// link-local block (169.254.169.254). The rest must be enumerated.
+	if metadataIPSet[ip.String()] {
+		return "cloud metadata endpoint"
+	}
 	return ""
 }
+
+// metadataIPSet enumerates well-known cloud metadata IPs that are NOT
+// covered by net.IP.IsLinkLocalUnicast. Updating this set is the lightest
+// way to keep up with new providers without changing the policy shape.
+var metadataIPSet = map[string]bool{
+	// Alibaba Cloud ECS metadata.
+	"100.100.100.200": true,
+	// Oracle Cloud Infrastructure metadata.
+	"192.0.0.192": true,
+	// AWS IMDS over IPv6 (ULA — not link-local, must be listed).
+	"fd00:ec2::254": true,
+}
@@ -234,17 +234,17 @@ func (c *Collector) sampleAll(ctx context.Context, targets []target) []store.Con
 	found := make([]bool, len(targets))

 	var wg sync.WaitGroup
+loop:
 	for i, t := range targets {
 		// Acquire the semaphore in the parent loop so ctx cancellation
 		// short-circuits the queue rather than spawning goroutines that
-		// block on an unreachable slot.
+		// block on an unreachable slot. The labelled break exits the for
+		// loop directly; a bare `break` inside `select` would only break
+		// the select and let the loop continue.
 		select {
 		case sem <- struct{}{}:
 		case <-ctx.Done():
-			break
-		}
-		if ctx.Err() != nil {
-			break
+			break loop
 		}
 		wg.Add(1)
 		go func(i int, t target) {
@@ -2,6 +2,7 @@ package store

 import (
 	"database/sql"
+	"encoding/json"
 	"errors"
 	"fmt"
 	"strings"
@@ -9,6 +10,22 @@ import (
 	"github.com/google/uuid"
 )

+// validateExtraJSON ensures the extra_json column never receives an
+// invalid JSON document. The codemap (docs/CODEMAPS/container-extra-json.md)
+// is explicit that readers tolerate unknown keys — but only if the value
+// is valid JSON at all. A buggy plugin writing `"not json"` would silently
+// break every reader, with no schema-level check to catch it. Guarding at
+// the store boundary keeps the invariant cheap and obvious.
+func validateExtraJSON(v string) error {
+	if v == "" {
+		return nil
+	}
+	if !json.Valid([]byte(v)) {
+		return fmt.Errorf("extra_json: not valid JSON (%d bytes)", len(v))
+	}
+	return nil
+}
+
 // containerColumns is the canonical column list for `containers` queries.
 // stage_id is populated by the deployer for project containers (so ListProxyRoutes
 // survives stage renames) and left empty for stacks and sites.
@@ -42,6 +59,9 @@ func (s *Store) CreateContainer(c Container) (Container, error) {
 	if c.ExtraJSON == "" {
 		c.ExtraJSON = "{}"
 	}
+	if err := validateExtraJSON(c.ExtraJSON); err != nil {
+		return Container{}, err
+	}

 	_, err := s.db.Exec(
 		`INSERT INTO containers (`+containerColumns+`)
@@ -77,6 +97,9 @@ func (s *Store) UpsertContainer(c Container) error {
 	if c.ExtraJSON == "" {
 		c.ExtraJSON = "{}"
 	}
+	if err := validateExtraJSON(c.ExtraJSON); err != nil {
+		return err
+	}

 	// SQLite UPSERT — INSERT...ON CONFLICT(id) DO UPDATE.
 	_, err := s.db.Exec(
@@ -129,6 +152,9 @@ func (s *Store) ReconcileContainer(c Container) error {
 	if c.ExtraJSON == "" {
 		c.ExtraJSON = "{}"
 	}
+	if err := validateExtraJSON(c.ExtraJSON); err != nil {
+		return err
+	}

 	// extra_json is deliberately NOT in the ON CONFLICT SET clause: the
 	// reconciler can't observe per-face route IDs from Docker, and
@@ -321,6 +347,9 @@ func (s *Store) UpdateContainer(c Container) error {
 	if c.ExtraJSON == "" {
 		c.ExtraJSON = "{}"
 	}
+	if err := validateExtraJSON(c.ExtraJSON); err != nil {
+		return err
+	}
 	result, err := s.db.Exec(
 		`UPDATE containers SET workload_id=?, workload_kind=?, role=?, stage_id=?, container_id=?,
 			image_ref=?, image_tag=?, host=?, state=?, port=?,
@@ -7,12 +7,13 @@ import (

 // EventLogFilter holds optional filters for listing event log entries.
 type EventLogFilter struct {
-	Severity string // Filter by severity (info, warn, error).
-	Source   string // Filter by source.
-	Since    string // Only events created at or after this timestamp.
-	Until    string // Only events created at or before this timestamp.
-	Limit    int    // Maximum number of results (default 50).
-	Offset   int    // Offset for pagination.
+	Severity   string // Filter by severity (info, warn, error).
+	Source     string // Filter by source.
+	WorkloadID string // Filter by owning workload (exact match).
+	Since      string // Only events created at or after this timestamp.
+	Until      string // Only events created at or before this timestamp.
+	Limit      int    // Maximum number of results (default 50).
+	Offset     int    // Offset for pagination.
 }

 // EventLogStats holds counts of event log entries by severity.
@@ -31,9 +32,9 @@ func (s *Store) InsertEvent(evt EventLog) (EventLog, error) {
 	}

 	result, err := s.db.Exec(
-		`INSERT INTO event_log (source, severity, message, metadata, created_at)
-		 VALUES (?, ?, ?, ?, ?)`,
-		evt.Source, evt.Severity, evt.Message, evt.Metadata, evt.CreatedAt,
+		`INSERT INTO event_log (source, workload_id, severity, message, metadata, created_at)
+		 VALUES (?, ?, ?, ?, ?, ?)`,
+		evt.Source, evt.WorkloadID, evt.Severity, evt.Message, evt.Metadata, evt.CreatedAt,
 	)
 	if err != nil {
 		return EventLog{}, fmt.Errorf("insert event: %w", err)
@@ -81,6 +82,10 @@ func (s *Store) ListEvents(filter EventLogFilter) ([]EventLog, error) {
 			conditions = append(conditions, "source IN ("+strings.Join(placeholders, ",")+")")
 		}
 	}
+	if filter.WorkloadID != "" {
+		conditions = append(conditions, "workload_id = ?")
+		args = append(args, filter.WorkloadID)
+	}
 	if filter.Since != "" {
 		conditions = append(conditions, "created_at >= ?")
 		args = append(args, filter.Since)
@@ -90,7 +95,7 @@ func (s *Store) ListEvents(filter EventLogFilter) ([]EventLog, error) {
 		args = append(args, filter.Until)
 	}

-	query := "SELECT id, source, severity, message, metadata, created_at FROM event_log"
+	query := "SELECT id, source, workload_id, severity, message, metadata, created_at FROM event_log"
 	if len(conditions) > 0 {
 		query += " WHERE " + strings.Join(conditions, " AND ")
 	}
@@ -114,7 +119,7 @@ func (s *Store) ListEvents(filter EventLogFilter) ([]EventLog, error) {
 	events := []EventLog{}
 	for rows.Next() {
 		var evt EventLog
-		if err := rows.Scan(&evt.ID, &evt.Source, &evt.Severity, &evt.Message, &evt.Metadata, &evt.CreatedAt); err != nil {
+		if err := rows.Scan(&evt.ID, &evt.Source, &evt.WorkloadID, &evt.Severity, &evt.Message, &evt.Metadata, &evt.CreatedAt); err != nil {
 			return nil, fmt.Errorf("scan event: %w", err)
 		}
 		events = append(events, evt)
@@ -0,0 +1,120 @@
+package store
+
+import (
+	"testing"
+)
+
+func TestInsertEvent_RoundTripsWorkloadID(t *testing.T) {
+	s := newTestStore(t)
+
+	in := EventLog{
+		Source:     "image",
+		WorkloadID: "wl-abc",
+		Severity:   "info",
+		Message:    "my-app: deployed",
+		Metadata:   `{"workload_id":"wl-abc"}`,
+	}
+	saved, err := s.InsertEvent(in)
+	if err != nil {
+		t.Fatalf("InsertEvent: %v", err)
+	}
+	if saved.ID == 0 {
+		t.Fatal("expected non-zero ID after insert")
+	}
+	if saved.WorkloadID != "wl-abc" {
+		t.Fatalf("returned WorkloadID = %q, want %q", saved.WorkloadID, "wl-abc")
+	}
+
+	rows, err := s.ListEvents(EventLogFilter{WorkloadID: "wl-abc"})
+	if err != nil {
+		t.Fatalf("ListEvents: %v", err)
+	}
+	if len(rows) != 1 {
+		t.Fatalf("got %d rows, want 1", len(rows))
+	}
+	got := rows[0]
+	if got.WorkloadID != "wl-abc" {
+		t.Errorf("WorkloadID = %q, want %q", got.WorkloadID, "wl-abc")
+	}
+	if got.Source != "image" || got.Severity != "info" || got.Message != "my-app: deployed" {
+		t.Errorf("round-trip mismatch: %+v", got)
+	}
+}
+
+func TestInsertEvent_DefaultsWorkloadIDToEmpty(t *testing.T) {
+	s := newTestStore(t)
+
+	// Non-deploy callers leave WorkloadID at its zero value; the column
+	// must accept "" (NOT NULL DEFAULT '').
+	saved, err := s.InsertEvent(EventLog{Source: "stale", Severity: "warn", Message: "x"})
+	if err != nil {
+		t.Fatalf("InsertEvent: %v", err)
+	}
+	if saved.WorkloadID != "" {
+		t.Fatalf("WorkloadID = %q, want empty", saved.WorkloadID)
+	}
+
+	rows, err := s.ListEvents(EventLogFilter{Source: "stale"})
+	if err != nil {
+		t.Fatalf("ListEvents: %v", err)
+	}
+	if len(rows) != 1 || rows[0].WorkloadID != "" {
+		t.Fatalf("expected one unscoped row, got %+v", rows)
+	}
+}
+
+func TestListEvents_FilterByWorkloadID(t *testing.T) {
+	s := newTestStore(t)
+
+	for _, e := range []EventLog{
+		{Source: "image", WorkloadID: "wl-1", Severity: "info", Message: "a"},
+		{Source: "image", WorkloadID: "wl-1", Severity: "error", Message: "b"},
+		{Source: "compose", WorkloadID: "wl-2", Severity: "info", Message: "c"},
+		{Source: "stale", WorkloadID: "", Severity: "warn", Message: "d"},
+	} {
+		if _, err := s.InsertEvent(e); err != nil {
+			t.Fatalf("InsertEvent %q: %v", e.Message, err)
+		}
+	}
+
+	// Filtering by wl-1 returns only its two rows.
+	rows, err := s.ListEvents(EventLogFilter{WorkloadID: "wl-1"})
+	if err != nil {
+		t.Fatalf("ListEvents wl-1: %v", err)
+	}
+	if len(rows) != 2 {
+		t.Fatalf("wl-1: got %d rows, want 2", len(rows))
+	}
+	for _, r := range rows {
+		if r.WorkloadID != "wl-1" {
+			t.Errorf("wl-1 filter leaked row with workload_id %q", r.WorkloadID)
+		}
+	}
+
+	// wl-2 returns exactly one row.
+	rows, err = s.ListEvents(EventLogFilter{WorkloadID: "wl-2"})
+	if err != nil {
+		t.Fatalf("ListEvents wl-2: %v", err)
+	}
+	if len(rows) != 1 || rows[0].Message != "c" {
+		t.Fatalf("wl-2: got %+v, want single row 'c'", rows)
+	}
+
+	// Combined workload + severity filter still narrows correctly.
+	rows, err = s.ListEvents(EventLogFilter{WorkloadID: "wl-1", Severity: "error"})
+	if err != nil {
+		t.Fatalf("ListEvents wl-1+error: %v", err)
+	}
+	if len(rows) != 1 || rows[0].Message != "b" {
+		t.Fatalf("wl-1+error: got %+v, want single row 'b'", rows)
+	}
+
+	// No filter returns all four rows (back-compat: unscoped query intact).
+	rows, err = s.ListEvents(EventLogFilter{})
+	if err != nil {
+		t.Fatalf("ListEvents all: %v", err)
+	}
+	if len(rows) != 4 {
+		t.Fatalf("unfiltered: got %d rows, want 4", len(rows))
+	}
+}
@@ -0,0 +1,171 @@
+package store
+
+import (
+	"errors"
+	"fmt"
+	"os"
+	"path/filepath"
+	"strconv"
+	"strings"
+)
+
+// ErrLockHeld is returned when another Tinyforge process appears to be
+// running against the same data directory. SQLite + SetMaxOpenConns(1)
+// makes this otherwise-silent collision a recipe for double-fired
+// schedulers, double-polled registries, and `extra_json` RMW corruption.
+var ErrLockHeld = errors.New("data directory is locked by another tinyforge process")
+
+// Lockfile is a portable PID file. AcquireLockfile takes it; the returned
+// Release function removes it. The contract:
+//
+//   - Lockfile is created with O_CREATE|O_EXCL — atomic on POSIX, atomic
+//     on NTFS / ReFS via the equivalent.
+//   - On collision, the existing file's PID is read; if the PID is dead,
+//     we treat the lock as stale (process crashed without cleanup),
+//     reclaim it, and proceed. Live PID → ErrLockHeld.
+//   - flock is intentionally not used: cross-platform consistency wins
+//     over advisory-lock semantics for the single-instance use case.
+type Lockfile struct {
+	path string
+}
+
+// AcquireLockfile creates a PID-file lock under dataDir. Returns a
+// Release function the caller must defer. If another live process holds
+// the lock, returns ErrLockHeld with a hint pointing at the lockfile.
+//
+// Reclaim atomicity: when the existing lockfile names a dead PID, the
+// replacement is serialized through an auxiliary reclaim lock (see
+// reclaimStaleLock) so that, of N processes booting concurrently against
+// the same stale lockfile, EXACTLY ONE reclaims it and the rest get
+// ErrLockHeld. A bare `os.Remove`+`O_EXCL` retry — or a rename, which is
+// "last-writer-wins" — cannot guarantee this: multiple reclaimers can each
+// end up believing they own the lock, defeating the single-instance guard.
+func AcquireLockfile(dataDir string) (release func(), err error) {
+	path := filepath.Join(dataDir, "tinyforge.lock")
+
+	// First try: clean acquire.
+	if rel, ok, err := tryCreateExclusive(path); ok {
+		return rel, nil
+	} else if err != nil {
+		return nil, err
+	}
+
+	// Existing lockfile — read PID and decide whether to reclaim.
+	pid, readErr := readLockPID(path)
+	if readErr == nil && processAlive(pid) {
+		return nil, fmt.Errorf("%w (held by pid %d, lockfile=%s)", ErrLockHeld, pid, path)
+	}
+	// Stale lock (dead pid) or malformed file — reclaim under serialization.
+	reason := "malformed existing lockfile"
+	if readErr == nil {
+		reason = fmt.Sprintf("stale lockfile (dead pid %d)", pid)
+	}
+	return reclaimStaleLock(path, reason)
+}
+
+// tryCreateExclusive attempts an atomic O_CREATE|O_EXCL create at path.
+// Returns (release, true, nil) on success; (nil, false, nil) when the
+// file already exists; (nil, false, err) on any other error.
+func tryCreateExclusive(path string) (func(), bool, error) {
+	f, openErr := os.OpenFile(path, os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0o600)
+	if openErr != nil {
+		if os.IsExist(openErr) {
+			return nil, false, nil
+		}
+		return nil, false, fmt.Errorf("open lockfile: %w", openErr)
+	}
+	if _, err := fmt.Fprintf(f, "%d\n", os.Getpid()); err != nil {
+		_ = f.Close()
+		_ = os.Remove(path)
+		return nil, false, fmt.Errorf("write lockfile: %w", err)
+	}
+	if err := f.Close(); err != nil {
+		_ = os.Remove(path)
+		return nil, false, fmt.Errorf("close lockfile: %w", err)
+	}
+	return func() { _ = os.Remove(path) }, true, nil
+}
+
+// reclaimStaleLock replaces a stale/malformed lockfile with one holding our
+// PID, serialized by an auxiliary reclaim lock. Holding the reclaim lock
+// (O_EXCL) guarantees that only one process performs the remove-and-recreate
+// of the main lockfile at a time, so concurrent reclaimers cannot each end
+// up "owning" the lock the way a rename or unguarded remove+create would
+// allow. The reclaim lock is itself liveness-checked so a reclaimer that
+// crashed mid-reclaim cannot wedge startup forever.
+func reclaimStaleLock(lockPath, reason string) (func(), error) {
+	reclaimPath := lockPath + ".reclaim"
+	if err := acquireReclaimLock(reclaimPath); err != nil {
+		return nil, fmt.Errorf("%w (%v; %s)", ErrLockHeld, err, reason)
+	}
+	defer func() { _ = os.Remove(reclaimPath) }()
+
+	// Serialized now. Re-check the main lock: another process may have fully
+	// reclaimed it between our liveness probe and our taking the reclaim lock.
+	if pid, perr := readLockPID(lockPath); perr == nil && processAlive(pid) {
+		return nil, fmt.Errorf("%w (reclaimed by pid %d while we waited; %s)",
+			ErrLockHeld, pid, reason)
+	}
+
+	// Safe to replace: remove the stale file, then create a fresh exclusive
+	// one. Both run while we hold the reclaim lock, so no other reclaimer can
+	// observe the gap.
+	if err := os.Remove(lockPath); err != nil && !os.IsNotExist(err) {
+		return nil, fmt.Errorf("%w (could not remove stale lockfile %s: %v; %s)",
+			ErrLockHeld, lockPath, err, reason)
+	}
+	rel, ok, err := tryCreateExclusive(lockPath)
+	if err != nil {
+		return nil, err
+	}
+	if !ok {
+		// Should be impossible while we hold the reclaim lock; fail safe.
+		return nil, fmt.Errorf("%w (lockfile reappeared during reclaim of %s; %s)",
+			ErrLockHeld, lockPath, reason)
+	}
+	return rel, nil
+}
+
+// acquireReclaimLock takes the auxiliary reclaim lock with O_EXCL. An
+// existing reclaim lock is honoured only while its recorded PID is alive (a
+// genuine concurrent reclaim); a stale one (dead/foreign PID) is removed once
+// and re-attempted so a crashed reclaimer cannot block boot indefinitely. Of
+// concurrent callers, O_EXCL ensures at most one acquires it; the rest fail
+// and back off to ErrLockHeld.
+func acquireReclaimLock(reclaimPath string) error {
+	for attempt := 0; attempt < 2; attempt++ {
+		f, err := os.OpenFile(reclaimPath, os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0o600)
+		if err == nil {
+			if _, werr := fmt.Fprintf(f, "%d\n", os.Getpid()); werr != nil {
+				_ = f.Close()
+				_ = os.Remove(reclaimPath)
+				return fmt.Errorf("write reclaim lock %s: %v", reclaimPath, werr)
+			}
+			return f.Close()
+		}
+		if !os.IsExist(err) {
+			return fmt.Errorf("create reclaim lock %s: %v", reclaimPath, err)
+		}
+		// Reclaim lock present. A live owner means a real concurrent reclaim.
+		if pid, perr := readLockPID(reclaimPath); perr == nil && processAlive(pid) {
+			return fmt.Errorf("concurrent reclaim in progress (pid %d)", pid)
+		}
+		// Stale reclaim lock — clear it and retry the exclusive create once.
+		if rerr := os.Remove(reclaimPath); rerr != nil && !os.IsNotExist(rerr) {
+			return fmt.Errorf("remove stale reclaim lock %s: %v", reclaimPath, rerr)
+		}
+	}
+	return fmt.Errorf("could not acquire reclaim lock %s after retry", reclaimPath)
+}
+
+func readLockPID(path string) (int, error) {
+	data, err := os.ReadFile(path)
+	if err != nil {
+		return 0, err
+	}
+	pidStr := strings.TrimSpace(string(data))
+	if pidStr == "" {
+		return 0, errors.New("empty lockfile")
+	}
+	return strconv.Atoi(pidStr)
+}
@@ -0,0 +1,137 @@
+package store
+
+import (
+	"errors"
+	"fmt"
+	"os"
+	"path/filepath"
+	"sync"
+	"testing"
+)
+
+func TestAcquireLockfile_FreshDir(t *testing.T) {
+	dir := t.TempDir()
+	release, err := AcquireLockfile(dir)
+	if err != nil {
+		t.Fatalf("AcquireLockfile: %v", err)
+	}
+	defer release()
+
+	// Lockfile should exist with our PID.
+	data, err := os.ReadFile(filepath.Join(dir, "tinyforge.lock"))
+	if err != nil {
+		t.Fatalf("read lockfile: %v", err)
+	}
+	want := fmt.Sprintf("%d\n", os.Getpid())
+	if string(data) != want {
+		t.Errorf("lockfile content = %q, want %q", data, want)
+	}
+}
+
+func TestAcquireLockfile_HeldByLivePID_Refused(t *testing.T) {
+	dir := t.TempDir()
+	// Plant a lockfile holding the current PID (which is obviously alive).
+	if err := os.WriteFile(filepath.Join(dir, "tinyforge.lock"),
+		[]byte(fmt.Sprintf("%d\n", os.Getpid())), 0o600); err != nil {
+		t.Fatalf("plant lockfile: %v", err)
+	}
+	release, err := AcquireLockfile(dir)
+	if err == nil {
+		release()
+		t.Fatal("expected ErrLockHeld, got nil")
+	}
+	if !errors.Is(err, ErrLockHeld) {
+		t.Errorf("error = %v, want wrap of ErrLockHeld", err)
+	}
+}
+
+func TestAcquireLockfile_StalePID_Reclaimed(t *testing.T) {
+	dir := t.TempDir()
+	// PID 1 is init/launchd/systemd on POSIX and the System Idle Process
+	// on Windows — never our process, and very unlikely to be dead. We
+	// use a deliberately-impossible PID instead: a 31-bit value far
+	// above any plausible system maximum.
+	stalePID := 2147483640
+	if err := os.WriteFile(filepath.Join(dir, "tinyforge.lock"),
+		[]byte(fmt.Sprintf("%d\n", stalePID)), 0o600); err != nil {
+		t.Fatalf("plant stale lockfile: %v", err)
+	}
+	release, err := AcquireLockfile(dir)
+	if err != nil {
+		t.Fatalf("expected reclaim of stale lock, got: %v", err)
+	}
+	defer release()
+
+	// Verify it now holds OUR pid, not the stale one.
+	data, err := os.ReadFile(filepath.Join(dir, "tinyforge.lock"))
+	if err != nil {
+		t.Fatalf("read lockfile after reclaim: %v", err)
+	}
+	want := fmt.Sprintf("%d\n", os.Getpid())
+	if string(data) != want {
+		t.Errorf("lockfile content after reclaim = %q, want %q", data, want)
+	}
+}
+
+func TestAcquireLockfile_ConcurrentReclaim_SingleWinner(t *testing.T) {
+	dir := t.TempDir()
+	// Plant a stale lockfile (impossibly high, certainly-dead PID), then have
+	// many goroutines race to reclaim it. Exactly one must win; the rest must
+	// be refused with ErrLockHeld. A "last-writer-wins" reclaim would let
+	// several goroutines all believe they own the lock.
+	stalePID := 2147483640
+	if err := os.WriteFile(filepath.Join(dir, "tinyforge.lock"),
+		[]byte(fmt.Sprintf("%d\n", stalePID)), 0o600); err != nil {
+		t.Fatalf("plant stale lockfile: %v", err)
+	}
+
+	const n = 16
+	var (
+		wg       sync.WaitGroup
+		mu       sync.Mutex
+		winners  int
+		releases []func()
+	)
+	start := make(chan struct{})
+	for i := 0; i < n; i++ {
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			<-start
+			release, err := AcquireLockfile(dir)
+			if err != nil {
+				if !errors.Is(err, ErrLockHeld) {
+					t.Errorf("loser error = %v, want wrap of ErrLockHeld", err)
+				}
+				return
+			}
+			mu.Lock()
+			winners++
+			releases = append(releases, release)
+			mu.Unlock()
+		}()
+	}
+	close(start)
+	wg.Wait()
+
+	for _, r := range releases {
+		r()
+	}
+	if winners != 1 {
+		t.Fatalf("concurrent reclaim winners = %d, want exactly 1", winners)
+	}
+}
+
+func TestAcquireLockfile_ReleaseRemovesFile(t *testing.T) {
+	dir := t.TempDir()
+	release, err := AcquireLockfile(dir)
+	if err != nil {
+		t.Fatalf("AcquireLockfile: %v", err)
+	}
+	release()
+
+	path := filepath.Join(dir, "tinyforge.lock")
+	if _, err := os.Stat(path); !os.IsNotExist(err) {
+		t.Errorf("lockfile still present after release: %v", err)
+	}
+}
@@ -0,0 +1,33 @@
+//go:build !windows
+
+package store
+
+import (
+	"errors"
+	"os"
+	"syscall"
+)
+
+// processAlive checks whether the given PID belongs to a running process.
+// On POSIX, kill(pid, 0) sends no signal but returns ESRCH if the PID is
+// dead, EPERM if alive-but-foreign-owned (still "alive" for our purposes).
+//
+// os.FindProcess never returns a non-nil error on Linux / macOS / *BSD
+// for any PID value — it just records the integer. The probe is purely
+// the Signal(0) result. We keep the FindProcess call to obtain the
+// *os.Process handle Signal needs; we don't branch on its error.
+func processAlive(pid int) bool {
+	if pid <= 0 {
+		return false
+	}
+	proc, _ := os.FindProcess(pid)
+	if proc == nil {
+		return false
+	}
+	err := proc.Signal(syscall.Signal(0))
+	if err == nil {
+		return true
+	}
+	// EPERM = alive but not ours; ESRCH = dead.
+	return errors.Is(err, os.ErrPermission) || errors.Is(err, syscall.EPERM)
+}
@@ -0,0 +1,30 @@
+//go:build windows
+
+package store
+
+import (
+	"golang.org/x/sys/windows"
+)
+
+// processAlive returns true when the given PID is currently held by a
+// running Windows process. OpenProcess with PROCESS_QUERY_LIMITED_INFORMATION
+// is the supported way to check liveness without elevation.
+func processAlive(pid int) bool {
+	if pid <= 0 {
+		return false
+	}
+	h, err := windows.OpenProcess(windows.PROCESS_QUERY_LIMITED_INFORMATION, false, uint32(pid))
+	if err != nil {
+		return false
+	}
+	defer windows.CloseHandle(h)
+	var exitCode uint32
+	if err := windows.GetExitCodeProcess(h, &exitCode); err != nil {
+		// Conservative: if we can't ask, assume alive so we don't reclaim
+		// an active lock. Worst case the operator sees ErrLockHeld and
+		// removes the lockfile by hand.
+		return true
+	}
+	const stillActive = 259 // STILL_ACTIVE
+	return exitCode == stillActive
+}
@@ -0,0 +1,191 @@
+package store
+
+import (
+	"database/sql"
+	"errors"
+	"fmt"
+	"strings"
+)
+
+// CreateMetricAlertRule inserts a new rule row after validating its
+// metric/comparator/severity enums and rejecting negative cooldowns.
+func (s *Store) CreateMetricAlertRule(r MetricAlertRule) (MetricAlertRule, error) {
+	if err := validateMetricAlertRule(r); err != nil {
+		return MetricAlertRule{}, err
+	}
+	now := Now()
+	r.CreatedAt = now
+	r.UpdatedAt = now
+	res, err := s.db.Exec(
+		`INSERT INTO metric_alert_rules
+		   (workload_id, name, metric, comparator, threshold, severity,
+		    cooldown_seconds, enabled, created_at, updated_at)
+		 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`,
+		r.WorkloadID, r.Name, r.Metric, r.Comparator, r.Threshold, r.Severity,
+		r.CooldownSeconds, boolToInt(r.Enabled), r.CreatedAt, r.UpdatedAt,
+	)
+	if err != nil {
+		return MetricAlertRule{}, fmt.Errorf("insert metric alert rule: %w", err)
+	}
+	id, err := res.LastInsertId()
+	if err != nil {
+		return MetricAlertRule{}, fmt.Errorf("get metric alert rule id: %w", err)
+	}
+	r.ID = id
+	return r, nil
+}
+
+// ListMetricAlertRules returns every rule, ordered by id for stable UI
+// rendering.
+func (s *Store) ListMetricAlertRules() ([]MetricAlertRule, error) {
+	return s.queryMetricAlertRules(
+		`SELECT id, workload_id, name, metric, comparator, threshold, severity,
+		        cooldown_seconds, enabled, created_at, updated_at
+		 FROM metric_alert_rules ORDER BY id`,
+	)
+}
+
+// ListMetricAlertRulesByWorkload returns rules that apply to the given
+// workload: rows explicitly scoped to it plus global rows (workload_id
+// = ""). Useful for the workload detail page.
+func (s *Store) ListMetricAlertRulesByWorkload(workloadID string) ([]MetricAlertRule, error) {
+	return s.queryMetricAlertRules(
+		`SELECT id, workload_id, name, metric, comparator, threshold, severity,
+		        cooldown_seconds, enabled, created_at, updated_at
+		 FROM metric_alert_rules WHERE workload_id = ? OR workload_id = '' ORDER BY id`,
+		workloadID,
+	)
+}
+
+// GetMetricAlertRule fetches one rule by id or returns ErrNotFound.
+func (s *Store) GetMetricAlertRule(id int64) (MetricAlertRule, error) {
+	row := s.db.QueryRow(
+		`SELECT id, workload_id, name, metric, comparator, threshold, severity,
+		        cooldown_seconds, enabled, created_at, updated_at
+		 FROM metric_alert_rules WHERE id = ?`, id,
+	)
+	r, err := scanMetricAlertRuleRow(row)
+	if errors.Is(err, sql.ErrNoRows) {
+		return MetricAlertRule{}, fmt.Errorf("metric alert rule %d: %w", id, ErrNotFound)
+	}
+	if err != nil {
+		return MetricAlertRule{}, fmt.Errorf("query metric alert rule: %w", err)
+	}
+	return r, nil
+}
+
+// UpdateMetricAlertRule overwrites the editable columns of a rule row.
+// id and workload_id are immutable on update — change the scope of a
+// rule by deleting + recreating, mirroring the log-scan store.
+func (s *Store) UpdateMetricAlertRule(r MetricAlertRule) (MetricAlertRule, error) {
+	if r.ID == 0 {
+		return MetricAlertRule{}, fmt.Errorf("metric alert rule: id is required for update")
+	}
+	if err := validateMetricAlertRule(r); err != nil {
+		return MetricAlertRule{}, err
+	}
+	r.UpdatedAt = Now()
+	res, err := s.db.Exec(
+		`UPDATE metric_alert_rules
+		    SET name = ?, metric = ?, comparator = ?, threshold = ?, severity = ?,
+		        cooldown_seconds = ?, enabled = ?, updated_at = ?
+		  WHERE id = ?`,
+		r.Name, r.Metric, r.Comparator, r.Threshold, r.Severity,
+		r.CooldownSeconds, boolToInt(r.Enabled), r.UpdatedAt, r.ID,
+	)
+	if err != nil {
+		return MetricAlertRule{}, fmt.Errorf("update metric alert rule: %w", err)
+	}
+	n, _ := res.RowsAffected()
+	if n == 0 {
+		return MetricAlertRule{}, fmt.Errorf("metric alert rule %d: %w", r.ID, ErrNotFound)
+	}
+	return s.GetMetricAlertRule(r.ID)
+}
+
+// DeleteMetricAlertRule removes a rule by id, returning ErrNotFound when
+// no row matched.
+func (s *Store) DeleteMetricAlertRule(id int64) error {
+	res, err := s.db.Exec(`DELETE FROM metric_alert_rules WHERE id = ?`, id)
+	if err != nil {
+		return fmt.Errorf("delete metric alert rule: %w", err)
+	}
+	n, _ := res.RowsAffected()
+	if n == 0 {
+		return fmt.Errorf("metric alert rule %d: %w", id, ErrNotFound)
+	}
+	return nil
+}
+
+func (s *Store) queryMetricAlertRules(query string, args ...any) ([]MetricAlertRule, error) {
+	rows, err := s.db.Query(query, args...)
+	if err != nil {
+		return nil, fmt.Errorf("query metric alert rules: %w", err)
+	}
+	defer rows.Close()
+	out := []MetricAlertRule{}
+	for rows.Next() {
+		r, err := scanMetricAlertRuleRows(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, r)
+	}
+	return out, rows.Err()
+}
+
+func scanMetricAlertRuleRows(rows *sql.Rows) (MetricAlertRule, error) {
+	var r MetricAlertRule
+	var enabled int
+	if err := rows.Scan(
+		&r.ID, &r.WorkloadID, &r.Name, &r.Metric, &r.Comparator, &r.Threshold, &r.Severity,
+		&r.CooldownSeconds, &enabled, &r.CreatedAt, &r.UpdatedAt,
+	); err != nil {
+		return MetricAlertRule{}, fmt.Errorf("scan metric alert rule: %w", err)
+	}
+	r.Enabled = enabled != 0
+	return r, nil
+}
+
+func scanMetricAlertRuleRow(row *sql.Row) (MetricAlertRule, error) {
+	var r MetricAlertRule
+	var enabled int
+	if err := row.Scan(
+		&r.ID, &r.WorkloadID, &r.Name, &r.Metric, &r.Comparator, &r.Threshold, &r.Severity,
+		&r.CooldownSeconds, &enabled, &r.CreatedAt, &r.UpdatedAt,
+	); err != nil {
+		return MetricAlertRule{}, err
+	}
+	r.Enabled = enabled != 0
+	return r, nil
+}
+
+// validateMetricAlertRule enforces the per-row invariants: a non-empty
+// name, a known metric/comparator, a valid severity (blank allowed so
+// the caller can default it), and a non-negative cooldown.
+func validateMetricAlertRule(r MetricAlertRule) error {
+	if strings.TrimSpace(r.Name) == "" {
+		return fmt.Errorf("metric alert rule: name is required")
+	}
+	switch r.Metric {
+	case MetricCPUPercent, MetricMemoryPercent, MetricMemoryBytes:
+	default:
+		return fmt.Errorf("metric alert rule: invalid metric %q", r.Metric)
+	}
+	switch r.Comparator {
+	case MetricComparatorGT, MetricComparatorLT:
+	default:
+		return fmt.Errorf("metric alert rule: invalid comparator %q", r.Comparator)
+	}
+	switch r.Severity {
+	case LogScanSeverityInfo, LogScanSeverityWarn, LogScanSeverityError:
+	case "":
+		// Default applied at the caller; allow blank.
+	default:
+		return fmt.Errorf("metric alert rule: invalid severity %q", r.Severity)
+	}
+	if r.CooldownSeconds < 0 {
+		return fmt.Errorf("metric alert rule: cooldown_seconds must be >= 0")
+	}
+	return nil
+}
@@ -0,0 +1,167 @@
+package store
+
+import (
+	"strings"
+	"testing"
+)
+
+func TestCreateMetricAlertRule_Validates(t *testing.T) {
+	s := newTestStore(t)
+	cases := []struct {
+		name    string
+		in      MetricAlertRule
+		wantErr string
+	}{
+		{
+			name:    "missing name",
+			in:      MetricAlertRule{Metric: MetricCPUPercent, Comparator: MetricComparatorGT},
+			wantErr: "name is required",
+		},
+		{
+			name:    "bad metric",
+			in:      MetricAlertRule{Name: "n", Metric: "load_avg", Comparator: MetricComparatorGT},
+			wantErr: "invalid metric",
+		},
+		{
+			name:    "bad comparator",
+			in:      MetricAlertRule{Name: "n", Metric: MetricCPUPercent, Comparator: "eq"},
+			wantErr: "invalid comparator",
+		},
+		{
+			name:    "bad severity",
+			in:      MetricAlertRule{Name: "n", Metric: MetricCPUPercent, Comparator: MetricComparatorGT, Severity: "loud"},
+			wantErr: "invalid severity",
+		},
+		{
+			name:    "negative cooldown",
+			in:      MetricAlertRule{Name: "n", Metric: MetricCPUPercent, Comparator: MetricComparatorGT, CooldownSeconds: -1},
+			wantErr: "cooldown_seconds must be",
+		},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			_, err := s.CreateMetricAlertRule(c.in)
+			if err == nil {
+				t.Fatalf("expected error containing %q, got nil", c.wantErr)
+			}
+			if !strings.Contains(err.Error(), c.wantErr) {
+				t.Fatalf("error mismatch: got %q want substring %q", err.Error(), c.wantErr)
+			}
+		})
+	}
+}
+
+func TestCreateAndGetMetricAlertRule(t *testing.T) {
+	s := newTestStore(t)
+	r, err := s.CreateMetricAlertRule(MetricAlertRule{
+		Name: "cpu-hot", Metric: MetricCPUPercent, Comparator: MetricComparatorGT,
+		Threshold: 80, Severity: "warn", CooldownSeconds: 300, Enabled: true,
+	})
+	if err != nil {
+		t.Fatalf("create: %v", err)
+	}
+	if r.ID == 0 {
+		t.Fatal("id should be set")
+	}
+	got, err := s.GetMetricAlertRule(r.ID)
+	if err != nil {
+		t.Fatalf("get: %v", err)
+	}
+	if got.Metric != MetricCPUPercent || got.Comparator != MetricComparatorGT {
+		t.Errorf("metric/comparator mismatch: %q %q", got.Metric, got.Comparator)
+	}
+	if got.Threshold != 80 {
+		t.Errorf("threshold mismatch: %v", got.Threshold)
+	}
+	if !got.Enabled {
+		t.Error("enabled lost on round-trip")
+	}
+}
+
+func TestGetMetricAlertRule_NotFound(t *testing.T) {
+	s := newTestStore(t)
+	if _, err := s.GetMetricAlertRule(999); err == nil {
+		t.Fatal("expected ErrNotFound for missing rule")
+	}
+}
+
+func TestListMetricAlertRulesByWorkload(t *testing.T) {
+	s := newTestStore(t)
+	_, _ = s.CreateMetricAlertRule(MetricAlertRule{
+		Name: "global", Metric: MetricCPUPercent, Comparator: MetricComparatorGT,
+		Threshold: 90, Severity: "warn", Enabled: true,
+	})
+	_, _ = s.CreateMetricAlertRule(MetricAlertRule{
+		Name: "w1-mem", WorkloadID: "w1", Metric: MetricMemoryPercent, Comparator: MetricComparatorGT,
+		Threshold: 85, Severity: "error", Enabled: true,
+	})
+	_, _ = s.CreateMetricAlertRule(MetricAlertRule{
+		Name: "w2-mem", WorkloadID: "w2", Metric: MetricMemoryBytes, Comparator: MetricComparatorGT,
+		Threshold: 1000, Severity: "info", Enabled: true,
+	})
+
+	w1, err := s.ListMetricAlertRulesByWorkload("w1")
+	if err != nil {
+		t.Fatalf("by workload: %v", err)
+	}
+	// w1 sees its own rule + the global, but NOT w2's rule.
+	if len(w1) != 2 {
+		t.Fatalf("w1 should see 2 rules (own + global), got %d", len(w1))
+	}
+	for _, r := range w1 {
+		if r.WorkloadID == "w2" {
+			t.Errorf("w1 should not see w2's rule")
+		}
+	}
+}
+
+func TestUpdateMetricAlertRule(t *testing.T) {
+	s := newTestStore(t)
+	r, _ := s.CreateMetricAlertRule(MetricAlertRule{
+		Name: "n", Metric: MetricCPUPercent, Comparator: MetricComparatorGT,
+		Threshold: 80, Severity: "warn", Enabled: true,
+	})
+	r.Threshold = 95
+	r.Comparator = MetricComparatorLT
+	r.Enabled = false
+	got, err := s.UpdateMetricAlertRule(r)
+	if err != nil {
+		t.Fatalf("update: %v", err)
+	}
+	if got.Threshold != 95 {
+		t.Errorf("threshold not updated: %v", got.Threshold)
+	}
+	if got.Comparator != MetricComparatorLT {
+		t.Errorf("comparator not updated: %q", got.Comparator)
+	}
+	if got.Enabled {
+		t.Error("enabled=false not applied")
+	}
+}
+
+func TestUpdateMetricAlertRule_NotFound(t *testing.T) {
+	s := newTestStore(t)
+	_, err := s.UpdateMetricAlertRule(MetricAlertRule{
+		ID: 999, Name: "n", Metric: MetricCPUPercent, Comparator: MetricComparatorGT,
+	})
+	if err == nil {
+		t.Fatal("expected ErrNotFound updating missing rule")
+	}
+}
+
+func TestDeleteMetricAlertRule(t *testing.T) {
+	s := newTestStore(t)
+	r, _ := s.CreateMetricAlertRule(MetricAlertRule{
+		Name: "n", Metric: MetricCPUPercent, Comparator: MetricComparatorGT,
+		Threshold: 80, Severity: "warn", Enabled: true,
+	})
+	if err := s.DeleteMetricAlertRule(r.ID); err != nil {
+		t.Fatalf("delete: %v", err)
+	}
+	if _, err := s.GetMetricAlertRule(r.ID); err == nil {
+		t.Error("rule should be gone after delete")
+	}
+	if err := s.DeleteMetricAlertRule(r.ID); err == nil {
+		t.Error("expected ErrNotFound deleting already-deleted rule")
+	}
+}
@@ -14,60 +14,60 @@ type Registry struct {

 // Settings holds global application configuration (single-row pattern).
 type Settings struct {
-	Domain           string `json:"domain"`
-	ServerIP         string `json:"server_ip"`  // Docker host IP (for NPM remote forwarding)
-	PublicIP         string `json:"public_ip"`  // Public-facing IP for DNS A records (e.g., NPM/proxy host)
-	Network          string `json:"network"`
-	SubdomainPattern string `json:"subdomain_pattern"`
-	NotificationURL    string `json:"notification_url"`
-	NotificationSecret string `json:"-"` // outgoing-webhook signing secret; never serialized directly
-	NpmURL           string `json:"npm_url"`
-	NpmEmail         string `json:"npm_email"`
-	NpmPassword      string `json:"npm_password"`
-	PollingInterval  string `json:"polling_interval"`
-	BaseVolumePath   string `json:"base_volume_path"`
-	SSLCertificateID   int    `json:"ssl_certificate_id"`
-	StaleThresholdDays int    `json:"stale_threshold_days"`
-	AllowedVolumePaths string `json:"allowed_volume_paths"` // JSON array of allowed absolute paths
-	WildcardDNS          bool   `json:"wildcard_dns"`
-	DNSProvider          string `json:"dns_provider"`
-	CloudflareAPIToken   string `json:"cloudflare_api_token"`
-	CloudflareZoneID     string `json:"cloudflare_zone_id"`
-	NpmRemote           bool   `json:"npm_remote"`
-	NpmAccessListID     int    `json:"npm_access_list_id"`
-	ProxyProvider       string `json:"proxy_provider"`
-	TraefikEntrypoint   string `json:"traefik_entrypoint"`
-	TraefikCertResolver string `json:"traefik_cert_resolver"`
-	TraefikNetwork      string `json:"traefik_network"`
-	TraefikAPIURL       string `json:"traefik_api_url"`
-	ImagePruneThresholdMB int  `json:"image_prune_threshold_mb"`
-	BackupEnabled       bool   `json:"backup_enabled"`
-	BackupIntervalHours  int    `json:"backup_interval_hours"`
-	BackupRetentionCount int    `json:"backup_retention_count"`
+	Domain                string `json:"domain"`
+	ServerIP              string `json:"server_ip"` // Docker host IP (for NPM remote forwarding)
+	PublicIP              string `json:"public_ip"` // Public-facing IP for DNS A records (e.g., NPM/proxy host)
+	Network               string `json:"network"`
+	SubdomainPattern      string `json:"subdomain_pattern"`
+	NotificationURL       string `json:"notification_url"`
+	NotificationSecret    string `json:"-"` // outgoing-webhook signing secret; never serialized directly
+	NpmURL                string `json:"npm_url"`
+	NpmEmail              string `json:"npm_email"`
+	NpmPassword           string `json:"npm_password"`
+	PollingInterval       string `json:"polling_interval"`
+	BaseVolumePath        string `json:"base_volume_path"`
+	SSLCertificateID      int    `json:"ssl_certificate_id"`
+	StaleThresholdDays    int    `json:"stale_threshold_days"`
+	AllowedVolumePaths    string `json:"allowed_volume_paths"` // JSON array of allowed absolute paths
+	WildcardDNS           bool   `json:"wildcard_dns"`
+	DNSProvider           string `json:"dns_provider"`
+	CloudflareAPIToken    string `json:"cloudflare_api_token"`
+	CloudflareZoneID      string `json:"cloudflare_zone_id"`
+	NpmRemote             bool   `json:"npm_remote"`
+	NpmAccessListID       int    `json:"npm_access_list_id"`
+	ProxyProvider         string `json:"proxy_provider"`
+	TraefikEntrypoint     string `json:"traefik_entrypoint"`
+	TraefikCertResolver   string `json:"traefik_cert_resolver"`
+	TraefikNetwork        string `json:"traefik_network"`
+	TraefikAPIURL         string `json:"traefik_api_url"`
+	ImagePruneThresholdMB int    `json:"image_prune_threshold_mb"`
+	BackupEnabled         bool   `json:"backup_enabled"`
+	BackupIntervalHours   int    `json:"backup_interval_hours"`
+	BackupRetentionCount  int    `json:"backup_retention_count"`
 	// AutoBackupBeforeDeploy creates a "pre-deploy" Tinyforge DB backup
 	// at the start of every project deploy. Independent of BackupEnabled
 	// (which governs the periodic auto-backup cron).
-	AutoBackupBeforeDeploy bool `json:"auto_backup_before_deploy"`
-	StatsIntervalSeconds int    `json:"stats_interval_seconds"` // 0 disables collection
-	StatsRetentionHours  int    `json:"stats_retention_hours"`  // 0 disables collection
-	UpdatedAt            string `json:"updated_at"`
+	AutoBackupBeforeDeploy bool   `json:"auto_backup_before_deploy"`
+	StatsIntervalSeconds   int    `json:"stats_interval_seconds"` // 0 disables collection
+	StatsRetentionHours    int    `json:"stats_retention_hours"`  // 0 disables collection
+	UpdatedAt              string `json:"updated_at"`
 }

 // ContainerStatsSample is one persisted sample of container resource usage.
 // Cumulative counters (network, block I/O) require differencing two samples
 // to get rates; CPU is already a percent-since-previous-sample value.
 type ContainerStatsSample struct {
-	ContainerID     string `json:"container_id"`
-	OwnerType       string `json:"owner_type"` // "instance" or "site"
-	OwnerID         string `json:"owner_id"`
-	TS              int64  `json:"ts"` // Unix seconds UTC
+	ContainerID     string  `json:"container_id"`
+	OwnerType       string  `json:"owner_type"` // "instance" or "site"
+	OwnerID         string  `json:"owner_id"`
+	TS              int64   `json:"ts"` // Unix seconds UTC
 	CPUPercent      float64 `json:"cpu_percent"`
-	MemoryUsage     int64  `json:"memory_usage"`
-	MemoryLimit     int64  `json:"memory_limit"`
-	NetworkRxBytes  int64  `json:"network_rx_bytes"`
-	NetworkTxBytes  int64  `json:"network_tx_bytes"`
-	BlockReadBytes  int64  `json:"block_read_bytes"`
-	BlockWriteBytes int64  `json:"block_write_bytes"`
+	MemoryUsage     int64   `json:"memory_usage"`
+	MemoryLimit     int64   `json:"memory_limit"`
+	NetworkRxBytes  int64   `json:"network_rx_bytes"`
+	NetworkTxBytes  int64   `json:"network_tx_bytes"`
+	BlockReadBytes  int64   `json:"block_read_bytes"`
+	BlockWriteBytes int64   `json:"block_write_bytes"`
 }

 // SystemStatsSample is one persisted host-level snapshot that aggregates
@@ -91,6 +91,21 @@ type Backup struct {
 	CreatedAt  string `json:"created_at"`
 }

+// VolumeSnapshot is one captured archive of a workload's host-bind data
+// volumes. Unlike Backup (global, SQLite-specific) it is per-workload and the
+// archive is a tar.gz of the resolved volume directories. Manifest is a
+// JSON-encoded []SnapshotVolume describing what the archive covers, so a
+// future restore can re-resolve each target even if volume settings drift.
+type VolumeSnapshot struct {
+	ID         string `json:"id"`
+	WorkloadID string `json:"workload_id"`
+	Label      string `json:"label"`
+	Filename   string `json:"filename"`
+	SizeBytes  int64  `json:"size_bytes"`
+	Manifest   string `json:"manifest"` // JSON []SnapshotVolume
+	CreatedAt  string `json:"created_at"`
+}
+
 // DNSRecord tracks a DNS record managed by the application.
 type DNSRecord struct {
 	ID               string `json:"id"`
@@ -106,10 +121,12 @@ type DNSRecord struct {
 // page. The legacy field names (ProjectID, ProjectName, StageID,
 // StageName, InstanceID) are retained verbatim for the existing
 // frontend contract — after the workload-first cutover they map to:
-//   ProjectID/Name → workload id / workload name
-//   StageID/Name   → containers.stage_id / containers.role
-//   InstanceID     → container row id
-//   Source         → "instance" for image/compose, "static_site" for static
+//
+//	ProjectID/Name → workload id / workload name
+//	StageID/Name   → containers.stage_id / containers.role
+//	InstanceID     → container row id
+//	Source         → "instance" for image/compose, "static_site" for static
+//
 // Renaming would require a coordinated frontend change; deferred.
 type ProxyRoute struct {
 	Source       string `json:"source"`
@@ -157,6 +174,29 @@ type WorkloadEnv struct {
 	UpdatedAt  string `json:"updated_at"`
 }

+// SharedSecret is an env var shared across workloads by scope. Resolved
+// into a workload's container env as a low-precedence default (overridden
+// by image cfg.Env and workload_env).
+type SharedSecret struct {
+	ID          string `json:"id"`
+	Name        string `json:"name"`  // the env KEY
+	Value       string `json:"value"` // ciphertext when Encrypted; never returned decrypted by the API
+	Encrypted   bool   `json:"encrypted"`
+	Scope       string `json:"scope"`  // global | app
+	AppID       string `json:"app_id"` // set when scope == app; "" for global
+	Description string `json:"description"`
+	Enabled     bool   `json:"enabled"`
+	CreatedAt   string `json:"created_at"`
+	UpdatedAt   string `json:"updated_at"`
+}
+
+// Shared-secret scope enum: a secret is either applied to every workload
+// (global) or only to workloads whose app_id matches (app).
+const (
+	SharedSecretScopeGlobal = "global"
+	SharedSecretScopeApp    = "app"
+)
+
 // VolumeScope defines the sharing scope for a volume mount.
 // Valid scopes: instance, stage, project, project_named, named, ephemeral.
 type VolumeScope string
@@ -190,12 +230,13 @@ func IsValidVolumeScope(s string) bool {

 // EventLog represents a persistent event log entry.
 type EventLog struct {
-	ID        int64  `json:"id"`
-	Source    string `json:"source"`
-	Severity string `json:"severity"` // info, warn, error
-	Message  string `json:"message"`
-	Metadata string `json:"metadata"` // JSON-encoded structured data
-	CreatedAt string `json:"created_at"`
+	ID         int64  `json:"id"`
+	Source     string `json:"source"`
+	WorkloadID string `json:"workload_id"` // "" = unscoped (non-deploy events)
+	Severity   string `json:"severity"`    // info, warn, error
+	Message    string `json:"message"`
+	Metadata   string `json:"metadata"` // JSON-encoded structured data
+	CreatedAt  string `json:"created_at"`
 }

 // EventTrigger is a filter+action rule evaluated against EventLog
@@ -245,12 +286,12 @@ const (
 //     for this workload).
 type LogScanRule struct {
 	ID              int64  `json:"id"`
-	WorkloadID      string `json:"workload_id"`        // "" = global
-	OverridesID     int64  `json:"overrides_id"`       // 0 = not an override
+	WorkloadID      string `json:"workload_id"`  // "" = global
+	OverridesID     int64  `json:"overrides_id"` // 0 = not an override
 	Name            string `json:"name"`
-	Pattern         string `json:"pattern"`            // regex, compiled at load
-	Severity        string `json:"severity"`           // info|warn|error
-	Streams         string `json:"streams"`            // all|stdout|stderr
+	Pattern         string `json:"pattern"`  // regex, compiled at load
+	Severity        string `json:"severity"` // info|warn|error
+	Streams         string `json:"streams"`  // all|stdout|stderr
 	CooldownSeconds int    `json:"cooldown_seconds"`
 	Enabled         bool   `json:"enabled"`
 	CreatedAt       string `json:"created_at"`
@@ -274,16 +315,57 @@ const (
 	LogScanSeverityError = "error"
 )

+// MetricAlertRule fires an event when a container metric breaches a
+// threshold. Mirrors LogScanRule but evaluated against stats_samples
+// instead of log lines.
+type MetricAlertRule struct {
+	ID              int64   `json:"id"`
+	WorkloadID      string  `json:"workload_id"` // "" = applies to all workloads
+	Name            string  `json:"name"`
+	Metric          string  `json:"metric"`     // cpu_percent | memory_percent | memory_bytes
+	Comparator      string  `json:"comparator"` // gt | lt
+	Threshold       float64 `json:"threshold"`
+	Severity        string  `json:"severity"`         // info | warn | error
+	CooldownSeconds int     `json:"cooldown_seconds"` // min seconds between fires per (rule,workload)
+	Enabled         bool    `json:"enabled"`
+	CreatedAt       string  `json:"created_at"`
+	UpdatedAt       string  `json:"updated_at"`
+}
+
+// Metric-alert metric identifiers. cpu_percent + memory_percent are
+// 0–100 ratios; memory_bytes is an absolute usage figure. Validated in
+// the store on create/update.
+const (
+	MetricCPUPercent    = "cpu_percent"
+	MetricMemoryPercent = "memory_percent"
+	MetricMemoryBytes   = "memory_bytes"
+)
+
+// Metric-alert comparators. gt fires when the value exceeds the
+// threshold; lt when it falls below.
+const (
+	MetricComparatorGT = "gt"
+	MetricComparatorLT = "lt"
+)
+
 // WorkloadKind enumerates the legacy discriminator values written into
 // containers.workload_kind and workloads.kind. After the hard cutover the
 // backing project / stack / static_site tables are gone — these constants
 // are just strings used to filter the unified containers index in the UI.
+//
+// `build` is the dockerfile-source kind: a container built from a
+// Dockerfile in a Git repo. Operationally it looks like a site (one
+// container, one optional public face) but its origin is the build
+// pipeline, not a static-asset extract. Dashboard filters that need to
+// distinguish "I built this from source" from "I served files from a
+// repo" should key on this value.
 type WorkloadKind string

 const (
 	WorkloadKindProject WorkloadKind = "project"
 	WorkloadKindStack   WorkloadKind = "stack"
 	WorkloadKindSite    WorkloadKind = "site"
+	WorkloadKindBuild   WorkloadKind = "build"
 )

 // Workload is the unifying primitive that abstracts Project, Stack, and StaticSite.
@@ -300,13 +382,13 @@ type Workload struct {
 	Kind                    string `json:"kind"` // project | stack | site (legacy discriminator)
 	RefID                   string `json:"ref_id"`
 	Name                    string `json:"name"`
-	AppID                   string `json:"app_id"` // nullable; "" = unassigned (a.k.a. GroupID after rename)
-	SourceKind              string `json:"source_kind"`              // "" until plugin-mode populated
-	SourceConfig            string `json:"source_config"`            // JSON-encoded, decoded by the matching Source
+	AppID                   string `json:"app_id"`        // nullable; "" = unassigned (a.k.a. GroupID after rename)
+	SourceKind              string `json:"source_kind"`   // "" until plugin-mode populated
+	SourceConfig            string `json:"source_config"` // JSON-encoded, decoded by the matching Source
 	TriggerKind             string `json:"trigger_kind"`
-	TriggerConfig           string `json:"trigger_config"`           // JSON-encoded, decoded by the matching Trigger
-	PublicFaces             string `json:"public_faces"`             // JSON-encoded []PublicFace
-	ParentWorkloadID        string `json:"parent_workload_id"`       // "" = root; non-empty = stage chain
+	TriggerConfig           string `json:"trigger_config"`     // JSON-encoded, decoded by the matching Trigger
+	PublicFaces             string `json:"public_faces"`       // JSON-encoded []PublicFace
+	ParentWorkloadID        string `json:"parent_workload_id"` // "" = root; non-empty = stage chain
 	NotificationURL         string `json:"notification_url"`
 	NotificationSecret      string `json:"-"` // never serialized
 	WebhookSecret           string `json:"-"` // URL-identifier secret; never serialized
@@ -316,6 +398,31 @@ type Workload struct {
 	UpdatedAt               string `json:"updated_at"`
 }

+// WorkloadNotification is one configured outbound notification route for
+// a workload. Multiple rows per workload model the "one Slack channel
+// for failures, one Discord webhook for successes" routing the legacy
+// single notification_url column could not express.
+//
+// EventTypes is a comma-separated allow-list (e.g. "build_failure" or
+// "deploy_success,deploy_failure"). An empty EventTypes means the row
+// fires for every event type — the cheapest way to keep the existing
+// single-destination behaviour expressible in the new shape.
+//
+// Secret round-trips through the same crypto envelope as other stored
+// secrets; the API layer strips it from responses.
+type WorkloadNotification struct {
+	ID         string `json:"id"`
+	WorkloadID string `json:"workload_id"`
+	Name       string `json:"name"`
+	URL        string `json:"url"`
+	Secret     string `json:"-"`
+	EventTypes string `json:"event_types"`
+	Enabled    bool   `json:"enabled"`
+	SortOrder  int    `json:"sort_order"`
+	CreatedAt  string `json:"created_at"`
+	UpdatedAt  string `json:"updated_at"`
+}
+
 // Container is the normalized index of every Tinyforge-managed container.
 // Replaces the project-specific Instance table after migration. Subdomain/
 // proxy fields are hoisted as first-class columns because ListProxyRoutes,
@@ -360,11 +467,11 @@ type Container struct {
 // which workloads to fire.
 type Trigger struct {
 	ID                      string `json:"id"`
-	Kind                    string `json:"kind"`           // registry | git | manual | schedule | log_scan | ...
-	Name                    string `json:"name"`           // human-readable, unique
-	Config                  string `json:"config"`         // JSON-encoded, decoded by the matching plugin
-	WebhookSecret           string `json:"-"`              // URL-identifier secret; never serialized
-	WebhookSigningSecret    string `json:"-"`              // HMAC key; never serialized
+	Kind                    string `json:"kind"`   // registry | git | manual | schedule | log_scan | ...
+	Name                    string `json:"name"`   // human-readable, unique
+	Config                  string `json:"config"` // JSON-encoded, decoded by the matching plugin
+	WebhookSecret           string `json:"-"`      // URL-identifier secret; never serialized
+	WebhookSigningSecret    string `json:"-"`      // HMAC key; never serialized
 	WebhookRequireSignature bool   `json:"webhook_require_signature"`
 	// LastFiredAt is the RFC3339 wall-clock the scheduler last dispatched
 	// this trigger. Empty for never-fired or non-schedule triggers. The
@@ -400,4 +507,3 @@ type App struct {
 	CreatedAt   string `json:"created_at"`
 	UpdatedAt   string `json:"updated_at"`
 }
-
@@ -0,0 +1,186 @@
+package store
+
+import (
+	"database/sql"
+	"errors"
+	"fmt"
+	"strings"
+
+	"github.com/google/uuid"
+)
+
+// CreateSharedSecret inserts a new shared-secret row after validating its
+// scope/app_id pairing and non-empty name. The caller is responsible for
+// encrypting Value when Encrypted is set (mirroring workload_env) — the
+// store treats Value as opaque bytes.
+func (s *Store) CreateSharedSecret(sec SharedSecret) (SharedSecret, error) {
+	if err := validateSharedSecret(&sec); err != nil {
+		return SharedSecret{}, err
+	}
+	now := Now()
+	if sec.ID == "" {
+		sec.ID = uuid.New().String()
+	}
+	sec.CreatedAt = now
+	sec.UpdatedAt = now
+	_, err := s.db.Exec(
+		`INSERT INTO shared_secrets
+		   (id, name, value, encrypted, scope, app_id, description, enabled, created_at, updated_at)
+		 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`,
+		sec.ID, sec.Name, sec.Value, BoolToInt(sec.Encrypted), sec.Scope, sec.AppID,
+		sec.Description, BoolToInt(sec.Enabled), sec.CreatedAt, sec.UpdatedAt,
+	)
+	if err != nil {
+		return SharedSecret{}, fmt.Errorf("insert shared secret: %w", translateSQLError(err))
+	}
+	return sec, nil
+}
+
+// ListSharedSecrets returns every shared secret, ordered by scope then
+// name for stable UI rendering (globals first).
+func (s *Store) ListSharedSecrets() ([]SharedSecret, error) {
+	return s.querySharedSecrets(
+		`SELECT id, name, value, encrypted, scope, app_id, description, enabled, created_at, updated_at
+		 FROM shared_secrets
+		 ORDER BY CASE scope WHEN 'global' THEN 0 ELSE 1 END, app_id, name`,
+	)
+}
+
+// GetSharedSecret fetches one shared secret by id or returns ErrNotFound.
+func (s *Store) GetSharedSecret(id string) (SharedSecret, error) {
+	row := s.db.QueryRow(
+		`SELECT id, name, value, encrypted, scope, app_id, description, enabled, created_at, updated_at
+		 FROM shared_secrets WHERE id = ?`, id,
+	)
+	sec, err := scanSharedSecretRow(row)
+	if errors.Is(err, sql.ErrNoRows) {
+		return SharedSecret{}, fmt.Errorf("shared secret %s: %w", id, ErrNotFound)
+	}
+	if err != nil {
+		return SharedSecret{}, fmt.Errorf("query shared secret: %w", err)
+	}
+	return sec, nil
+}
+
+// UpdateSharedSecret overwrites the editable columns of a shared-secret
+// row. id is immutable; name/value/encrypted/scope/app_id/description/
+// enabled are overwritten wholesale (the API layer is responsible for
+// merging partial PATCH input onto the existing row first).
+func (s *Store) UpdateSharedSecret(sec SharedSecret) (SharedSecret, error) {
+	if sec.ID == "" {
+		return SharedSecret{}, fmt.Errorf("shared secret: id is required for update")
+	}
+	if err := validateSharedSecret(&sec); err != nil {
+		return SharedSecret{}, err
+	}
+	sec.UpdatedAt = Now()
+	res, err := s.db.Exec(
+		`UPDATE shared_secrets
+		    SET name = ?, value = ?, encrypted = ?, scope = ?, app_id = ?,
+		        description = ?, enabled = ?, updated_at = ?
+		  WHERE id = ?`,
+		sec.Name, sec.Value, BoolToInt(sec.Encrypted), sec.Scope, sec.AppID,
+		sec.Description, BoolToInt(sec.Enabled), sec.UpdatedAt, sec.ID,
+	)
+	if err != nil {
+		return SharedSecret{}, fmt.Errorf("update shared secret: %w", translateSQLError(err))
+	}
+	n, _ := res.RowsAffected()
+	if n == 0 {
+		return SharedSecret{}, fmt.Errorf("shared secret %s: %w", sec.ID, ErrNotFound)
+	}
+	return s.GetSharedSecret(sec.ID)
+}
+
+// DeleteSharedSecret removes a shared secret by id, returning ErrNotFound
+// when no row matched.
+func (s *Store) DeleteSharedSecret(id string) error {
+	res, err := s.db.Exec(`DELETE FROM shared_secrets WHERE id = ?`, id)
+	if err != nil {
+		return fmt.Errorf("delete shared secret: %w", err)
+	}
+	n, _ := res.RowsAffected()
+	if n == 0 {
+		return fmt.Errorf("shared secret %s: %w", id, ErrNotFound)
+	}
+	return nil
+}
+
+// ListApplicableSharedSecrets returns ENABLED secrets that apply to a
+// workload in the given app: all global secrets plus the app's own.
+// Ordered global-first so callers can overlay app on top of global.
+func (s *Store) ListApplicableSharedSecrets(appID string) ([]SharedSecret, error) {
+	return s.querySharedSecrets(
+		`SELECT id, name, value, encrypted, scope, app_id, description, enabled, created_at, updated_at
+		 FROM shared_secrets
+		 WHERE enabled = 1 AND (scope = 'global' OR (scope = 'app' AND app_id = ?))
+		 ORDER BY CASE scope WHEN 'global' THEN 0 ELSE 1 END, name`,
+		appID,
+	)
+}
+
+func (s *Store) querySharedSecrets(query string, args ...any) ([]SharedSecret, error) {
+	rows, err := s.db.Query(query, args...)
+	if err != nil {
+		return nil, fmt.Errorf("query shared secrets: %w", err)
+	}
+	defer rows.Close()
+	out := []SharedSecret{}
+	for rows.Next() {
+		sec, err := scanSharedSecretRows(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, sec)
+	}
+	return out, rows.Err()
+}
+
+func scanSharedSecretRows(rows *sql.Rows) (SharedSecret, error) {
+	var sec SharedSecret
+	var enc, enabled int
+	if err := rows.Scan(
+		&sec.ID, &sec.Name, &sec.Value, &enc, &sec.Scope, &sec.AppID,
+		&sec.Description, &enabled, &sec.CreatedAt, &sec.UpdatedAt,
+	); err != nil {
+		return SharedSecret{}, fmt.Errorf("scan shared secret: %w", err)
+	}
+	sec.Encrypted = enc != 0
+	sec.Enabled = enabled != 0
+	return sec, nil
+}
+
+func scanSharedSecretRow(row *sql.Row) (SharedSecret, error) {
+	var sec SharedSecret
+	var enc, enabled int
+	if err := row.Scan(
+		&sec.ID, &sec.Name, &sec.Value, &enc, &sec.Scope, &sec.AppID,
+		&sec.Description, &enabled, &sec.CreatedAt, &sec.UpdatedAt,
+	); err != nil {
+		return SharedSecret{}, err
+	}
+	sec.Encrypted = enc != 0
+	sec.Enabled = enabled != 0
+	return sec, nil
+}
+
+// validateSharedSecret enforces the per-row invariants: a non-empty name,
+// a valid scope, and a coherent scope/app_id pairing. When scope==app an
+// app_id is required; when scope==global the app_id is forced blank so the
+// unique index (scope, app_id, name) stays consistent for globals.
+func validateSharedSecret(sec *SharedSecret) error {
+	if strings.TrimSpace(sec.Name) == "" {
+		return fmt.Errorf("shared secret: name is required")
+	}
+	switch sec.Scope {
+	case SharedSecretScopeGlobal:
+		sec.AppID = ""
+	case SharedSecretScopeApp:
+		if strings.TrimSpace(sec.AppID) == "" {
+			return fmt.Errorf("shared secret: app_id is required when scope is app")
+		}
+	default:
+		return fmt.Errorf("shared secret: invalid scope %q", sec.Scope)
+	}
+	return nil
+}
@@ -0,0 +1,218 @@
+package store
+
+import (
+	"errors"
+	"strings"
+	"testing"
+)
+
+func TestCreateSharedSecret_Validates(t *testing.T) {
+	s := newTestStore(t)
+	cases := []struct {
+		name    string
+		in      SharedSecret
+		wantErr string
+	}{
+		{
+			name:    "missing name",
+			in:      SharedSecret{Scope: SharedSecretScopeGlobal},
+			wantErr: "name is required",
+		},
+		{
+			name:    "invalid scope",
+			in:      SharedSecret{Name: "FOO", Scope: "team"},
+			wantErr: "invalid scope",
+		},
+		{
+			name:    "app scope without app_id",
+			in:      SharedSecret{Name: "FOO", Scope: SharedSecretScopeApp},
+			wantErr: "app_id is required",
+		},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			_, err := s.CreateSharedSecret(c.in)
+			if err == nil {
+				t.Fatalf("expected error containing %q, got nil", c.wantErr)
+			}
+			if !strings.Contains(err.Error(), c.wantErr) {
+				t.Fatalf("error mismatch: got %q want substring %q", err.Error(), c.wantErr)
+			}
+		})
+	}
+}
+
+func TestCreateSharedSecret_GlobalForcesBlankAppID(t *testing.T) {
+	s := newTestStore(t)
+	got, err := s.CreateSharedSecret(SharedSecret{
+		Name: "GLOBAL_KEY", Value: "v", Scope: SharedSecretScopeGlobal,
+		AppID: "should-be-cleared", Enabled: true,
+	})
+	if err != nil {
+		t.Fatalf("create: %v", err)
+	}
+	if got.AppID != "" {
+		t.Errorf("global secret AppID = %q, want empty", got.AppID)
+	}
+	if got.ID == "" {
+		t.Error("id should be set")
+	}
+}
+
+func TestCreateAndGetSharedSecret(t *testing.T) {
+	s := newTestStore(t)
+	created, err := s.CreateSharedSecret(SharedSecret{
+		Name: "API_KEY", Value: "ciphertext", Encrypted: true,
+		Scope: SharedSecretScopeApp, AppID: "app1", Description: "d", Enabled: true,
+	})
+	if err != nil {
+		t.Fatalf("create: %v", err)
+	}
+	got, err := s.GetSharedSecret(created.ID)
+	if err != nil {
+		t.Fatalf("get: %v", err)
+	}
+	if got.Name != "API_KEY" || got.Value != "ciphertext" || !got.Encrypted {
+		t.Errorf("round-trip mismatch: %+v", got)
+	}
+	if got.Scope != SharedSecretScopeApp || got.AppID != "app1" {
+		t.Errorf("scope/app mismatch: %+v", got)
+	}
+	if !got.Enabled {
+		t.Error("enabled lost on round-trip")
+	}
+}
+
+func TestGetSharedSecret_NotFound(t *testing.T) {
+	s := newTestStore(t)
+	if _, err := s.GetSharedSecret("nope"); !errors.Is(err, ErrNotFound) {
+		t.Fatalf("expected ErrNotFound, got %v", err)
+	}
+}
+
+func TestUpdateSharedSecret(t *testing.T) {
+	s := newTestStore(t)
+	created, _ := s.CreateSharedSecret(SharedSecret{
+		Name: "K", Value: "v1", Scope: SharedSecretScopeGlobal, Enabled: true,
+	})
+	created.Value = "v2"
+	created.Description = "updated"
+	created.Enabled = false
+	got, err := s.UpdateSharedSecret(created)
+	if err != nil {
+		t.Fatalf("update: %v", err)
+	}
+	if got.Value != "v2" {
+		t.Errorf("value not updated: %q", got.Value)
+	}
+	if got.Description != "updated" {
+		t.Errorf("description not updated: %q", got.Description)
+	}
+	if got.Enabled {
+		t.Error("enabled=false not applied")
+	}
+}
+
+func TestUpdateSharedSecret_NotFound(t *testing.T) {
+	s := newTestStore(t)
+	_, err := s.UpdateSharedSecret(SharedSecret{
+		ID: "missing", Name: "K", Scope: SharedSecretScopeGlobal,
+	})
+	if !errors.Is(err, ErrNotFound) {
+		t.Fatalf("expected ErrNotFound updating missing secret, got %v", err)
+	}
+}
+
+func TestDeleteSharedSecret(t *testing.T) {
+	s := newTestStore(t)
+	created, _ := s.CreateSharedSecret(SharedSecret{
+		Name: "K", Value: "v", Scope: SharedSecretScopeGlobal, Enabled: true,
+	})
+	if err := s.DeleteSharedSecret(created.ID); err != nil {
+		t.Fatalf("delete: %v", err)
+	}
+	if _, err := s.GetSharedSecret(created.ID); !errors.Is(err, ErrNotFound) {
+		t.Fatalf("expected ErrNotFound after delete, got %v", err)
+	}
+	if err := s.DeleteSharedSecret(created.ID); !errors.Is(err, ErrNotFound) {
+		t.Fatalf("expected ErrNotFound deleting twice, got %v", err)
+	}
+}
+
+func TestSharedSecret_UniquePerScopeAppName(t *testing.T) {
+	s := newTestStore(t)
+	if _, err := s.CreateSharedSecret(SharedSecret{
+		Name: "DUP", Value: "a", Scope: SharedSecretScopeGlobal, Enabled: true,
+	}); err != nil {
+		t.Fatalf("first create: %v", err)
+	}
+	// Same scope+name collides on the unique index.
+	if _, err := s.CreateSharedSecret(SharedSecret{
+		Name: "DUP", Value: "b", Scope: SharedSecretScopeGlobal, Enabled: true,
+	}); err == nil {
+		t.Fatal("expected unique-index violation for duplicate global key")
+	}
+	// Same name under an app scope is a distinct row.
+	if _, err := s.CreateSharedSecret(SharedSecret{
+		Name: "DUP", Value: "c", Scope: SharedSecretScopeApp, AppID: "app1", Enabled: true,
+	}); err != nil {
+		t.Fatalf("app-scoped same name should be allowed: %v", err)
+	}
+}
+
+func TestListApplicableSharedSecrets(t *testing.T) {
+	s := newTestStore(t)
+	// Two globals (one disabled), one app1 secret, one app2 secret.
+	mustCreate(t, s, SharedSecret{Name: "G_ONE", Value: "g1", Scope: SharedSecretScopeGlobal, Enabled: true})
+	mustCreate(t, s, SharedSecret{Name: "G_OFF", Value: "off", Scope: SharedSecretScopeGlobal, Enabled: false})
+	mustCreate(t, s, SharedSecret{Name: "A_ONE", Value: "a1", Scope: SharedSecretScopeApp, AppID: "app1", Enabled: true})
+	mustCreate(t, s, SharedSecret{Name: "A_TWO", Value: "a2", Scope: SharedSecretScopeApp, AppID: "app2", Enabled: true})
+
+	got, err := s.ListApplicableSharedSecrets("app1")
+	if err != nil {
+		t.Fatalf("applicable: %v", err)
+	}
+	// app1 sees the enabled global + its own; not the disabled global, not app2's.
+	if len(got) != 2 {
+		t.Fatalf("want 2 applicable secrets, got %d: %+v", len(got), got)
+	}
+	// Global must come first so callers can overlay app on top.
+	if got[0].Name != "G_ONE" {
+		t.Errorf("expected global first, got %q", got[0].Name)
+	}
+	if got[1].Name != "A_ONE" {
+		t.Errorf("expected app1 secret second, got %q", got[1].Name)
+	}
+	for _, sec := range got {
+		if sec.AppID == "app2" {
+			t.Errorf("app1 must not see app2's secret: %+v", sec)
+		}
+		if !sec.Enabled {
+			t.Errorf("disabled secret leaked into applicable set: %+v", sec)
+		}
+	}
+}
+
+func TestListApplicableSharedSecrets_NoAppOnlyGlobals(t *testing.T) {
+	s := newTestStore(t)
+	mustCreate(t, s, SharedSecret{Name: "G", Value: "g", Scope: SharedSecretScopeGlobal, Enabled: true})
+	mustCreate(t, s, SharedSecret{Name: "A", Value: "a", Scope: SharedSecretScopeApp, AppID: "app1", Enabled: true})
+
+	// An ungrouped workload (appID == "") sees only globals.
+	got, err := s.ListApplicableSharedSecrets("")
+	if err != nil {
+		t.Fatalf("applicable: %v", err)
+	}
+	if len(got) != 1 || got[0].Name != "G" {
+		t.Fatalf("ungrouped workload should see only the global, got %+v", got)
+	}
+}
+
+func mustCreate(t *testing.T, s *Store, sec SharedSecret) SharedSecret {
+	t.Helper()
+	out, err := s.CreateSharedSecret(sec)
+	if err != nil {
+		t.Fatalf("create shared secret %q: %v", sec.Name, err)
+	}
+	return out
+}
@@ -55,11 +55,20 @@ func New(dbPath string) (*Store, error) {
 	db.SetMaxOpenConns(1)
 	db.SetConnMaxLifetime(0)

-	// Enable WAL mode and foreign keys for better concurrency and referential integrity.
+	// Enable WAL mode and foreign keys for better concurrency and
+	// referential integrity. `synchronous=NORMAL` pairs with WAL to skip
+	// the per-write fsync — the OS still flushes on checkpoint, durability
+	// is preserved across clean shutdowns, and crashes lose at most the
+	// last few committed transactions (acceptable for a tinyforge box).
+	// cache_size=-20000 = 20 MiB page cache, temp_store=MEMORY keeps
+	// indexer scratch off disk; both are pure perf knobs.
 	pragmas := []string{
 		"PRAGMA journal_mode=WAL",
+		"PRAGMA synchronous=NORMAL",
 		"PRAGMA foreign_keys=ON",
 		"PRAGMA busy_timeout=5000",
+		"PRAGMA cache_size=-20000",
+		"PRAGMA temp_store=MEMORY",
 	}
 	for _, p := range pragmas {
 		if _, err := db.Exec(p); err != nil {
@@ -169,6 +178,12 @@ func (s *Store) runMigrations() error {
 		// Empty string = never fired. Pre-trigger-split DBs land the column
 		// here so the scheduler can read/write it on first boot.
 		`ALTER TABLE triggers ADD COLUMN last_fired_at TEXT NOT NULL DEFAULT ''`,
+		// Per-app deploy/activity timeline: scope each event_log row to the
+		// workload that produced it so the dashboard can query a workload's
+		// deploy history. Empty string = unscoped (the existing non-deploy
+		// loggers don't set it). Additive ADD COLUMN — the loop below
+		// tolerates the "duplicate column" error on fully-migrated DBs.
+		`ALTER TABLE event_log ADD COLUMN workload_id TEXT NOT NULL DEFAULT ''`,
 		// Hard cutover: drop every legacy table. Idempotent — DROP TABLE
 		// IF EXISTS is a no-op once the table is gone. Operators upgrading
 		// from a pre-cutover build will lose any project / stack / static
@@ -269,6 +284,20 @@ func (s *Store) runMigrations() error {
 			updated_at  TEXT NOT NULL DEFAULT (datetime('now')),
 			UNIQUE(workload_id, target)
 		)`,
+		// volume_snapshots: per-workload archives of host-bind data
+		// volumes (tar.gz). Mirrors the backups table shape but scoped to a
+		// workload and self-describing via the manifest column so a restore
+		// can re-resolve each target. ON DELETE CASCADE so deleting an app
+		// drops its snapshot rows (the files are pruned separately).
+		`CREATE TABLE IF NOT EXISTS volume_snapshots (
+			id          TEXT PRIMARY KEY,
+			workload_id TEXT NOT NULL REFERENCES workloads(id) ON DELETE CASCADE,
+			label       TEXT NOT NULL DEFAULT '',
+			filename    TEXT NOT NULL,
+			size_bytes  INTEGER NOT NULL DEFAULT 0,
+			manifest    TEXT NOT NULL DEFAULT '[]',
+			created_at  TEXT NOT NULL DEFAULT (datetime('now'))
+		)`,
 		// triggers: first-class redeploy signal sources. Webhook secrets
 		// move from workload onto the trigger so one webhook URL can fan
 		// out to multiple workloads via workload_trigger_bindings.
@@ -284,6 +313,24 @@ func (s *Store) runMigrations() error {
 			created_at                  TEXT NOT NULL DEFAULT (datetime('now')),
 			updated_at                  TEXT NOT NULL DEFAULT (datetime('now'))
 		)`,
+		// workload_notifications: per-workload notification destinations.
+		// Each row is one route (Slack channel, Discord webhook, generic
+		// receiver, ...). event_types is a comma-separated allow-list —
+		// empty means "all events". When zero rows exist for a workload
+		// the dispatcher falls back to the legacy single notification_url
+		// column on workloads so existing setups keep working unchanged.
+		`CREATE TABLE IF NOT EXISTS workload_notifications (
+			id           TEXT PRIMARY KEY,
+			workload_id  TEXT NOT NULL REFERENCES workloads(id) ON DELETE CASCADE,
+			name         TEXT NOT NULL,
+			url          TEXT NOT NULL,
+			secret       TEXT NOT NULL DEFAULT '',
+			event_types  TEXT NOT NULL DEFAULT '',
+			enabled      INTEGER NOT NULL DEFAULT 1,
+			sort_order   INTEGER NOT NULL DEFAULT 0,
+			created_at   TEXT NOT NULL DEFAULT (datetime('now')),
+			updated_at   TEXT NOT NULL DEFAULT (datetime('now'))
+		)`,
 		// workload_trigger_bindings: many-to-many between workloads and
 		// triggers. binding_config is the per-binding override applied on
 		// top of trigger.config (top-level JSON merge, binding wins).
@@ -375,6 +422,43 @@ func (s *Store) runMigrations() error {
 		)`,
 		`CREATE INDEX IF NOT EXISTS idx_log_scan_rules_workload ON log_scan_rules(workload_id)`,
 		`CREATE INDEX IF NOT EXISTS idx_log_scan_rules_overrides ON log_scan_rules(overrides_id)`,
+		// metric_alert_rules: threshold rules the metric-alert manager
+		// evaluates against recent container stats samples. WorkloadID is
+		// nullable (via "" sentinel) so a global rule applies to every
+		// workload; a non-empty value scopes it to one workload.
+		`CREATE TABLE IF NOT EXISTS metric_alert_rules (
+			id                 INTEGER PRIMARY KEY AUTOINCREMENT,
+			workload_id        TEXT NOT NULL DEFAULT '',
+			name               TEXT NOT NULL DEFAULT '',
+			metric             TEXT NOT NULL,
+			comparator         TEXT NOT NULL,
+			threshold          REAL NOT NULL DEFAULT 0,
+			severity           TEXT NOT NULL DEFAULT 'warn',
+			cooldown_seconds   INTEGER NOT NULL DEFAULT 300,
+			enabled            INTEGER NOT NULL DEFAULT 1,
+			created_at         TEXT NOT NULL DEFAULT (datetime('now')),
+			updated_at         TEXT NOT NULL DEFAULT (datetime('now'))
+		)`,
+		`CREATE INDEX IF NOT EXISTS idx_metric_alert_rules_workload ON metric_alert_rules(workload_id)`,
+		// shared_secrets: env vars shared across workloads by scope. Scope
+		// "global" applies to every workload; "app" applies only to
+		// workloads whose app_id matches. Resolved into a workload's
+		// container env as a low-precedence default (see
+		// internal/workload/plugin/env.go).
+		`CREATE TABLE IF NOT EXISTS shared_secrets (
+			id           TEXT PRIMARY KEY,
+			name         TEXT NOT NULL,
+			value        TEXT NOT NULL DEFAULT '',
+			encrypted    INTEGER NOT NULL DEFAULT 1,
+			scope        TEXT NOT NULL,
+			app_id       TEXT NOT NULL DEFAULT '',
+			description  TEXT NOT NULL DEFAULT '',
+			enabled      INTEGER NOT NULL DEFAULT 1,
+			created_at   TEXT NOT NULL DEFAULT (datetime('now')),
+			updated_at   TEXT NOT NULL DEFAULT (datetime('now'))
+		)`,
+		`CREATE UNIQUE INDEX IF NOT EXISTS idx_shared_secrets_scope_name ON shared_secrets(scope, app_id, name)`,
+		`CREATE INDEX IF NOT EXISTS idx_shared_secrets_app ON shared_secrets(app_id)`,
 	}
 	for _, t := range observabilityTables {
 		if _, err := s.db.Exec(t); err != nil {
@@ -405,6 +489,7 @@ func (s *Store) runMigrations() error {
 		`CREATE INDEX IF NOT EXISTS idx_event_log_severity ON event_log(severity)`,
 		`CREATE INDEX IF NOT EXISTS idx_event_log_source ON event_log(source)`,
 		`CREATE INDEX IF NOT EXISTS idx_event_log_created_at ON event_log(created_at)`,
+		`CREATE INDEX IF NOT EXISTS idx_event_log_workload ON event_log(workload_id, created_at)`,
 		`CREATE INDEX IF NOT EXISTS idx_dns_records_consumer ON dns_records(consumer_type, consumer_id)`,
 		`CREATE INDEX IF NOT EXISTS idx_container_stats_owner_ts ON container_stats_samples(owner_type, owner_id, ts)`,
 		`CREATE INDEX IF NOT EXISTS idx_container_stats_container_ts ON container_stats_samples(container_id, ts)`,
@@ -422,11 +507,13 @@ func (s *Store) runMigrations() error {
 		`CREATE INDEX IF NOT EXISTS idx_containers_stage_id     ON containers(stage_id) WHERE stage_id != ''`,
 		`CREATE INDEX IF NOT EXISTS idx_workload_env_workload     ON workload_env(workload_id)`,
 		`CREATE INDEX IF NOT EXISTS idx_workload_volumes_workload ON workload_volumes(workload_id)`,
+		`CREATE INDEX IF NOT EXISTS idx_volume_snapshots_workload ON volume_snapshots(workload_id)`,
 		// Trigger-split indexes.
 		`CREATE INDEX IF NOT EXISTS idx_triggers_kind             ON triggers(kind)`,
 		`CREATE UNIQUE INDEX IF NOT EXISTS idx_triggers_webhook_secret ON triggers(webhook_secret) WHERE webhook_secret != ''`,
 		`CREATE INDEX IF NOT EXISTS idx_bindings_workload         ON workload_trigger_bindings(workload_id)`,
 		`CREATE INDEX IF NOT EXISTS idx_bindings_trigger          ON workload_trigger_bindings(trigger_id)`,
+		`CREATE INDEX IF NOT EXISTS idx_workload_notifs_workload  ON workload_notifications(workload_id)`,
 	}
 	for _, idx := range indexes {
 		if _, err := s.db.Exec(idx); err != nil {
@@ -434,13 +521,215 @@ func (s *Store) runMigrations() error {
 		}
 	}

-	if err := s.backfillTriggersFromWorkloads(); err != nil {
+	// schema_versions table gates one-shot data migrations like the
+	// trigger backfill below. Without this, the backfill scan ran on
+	// every boot even on fully-migrated DBs — wasted I/O and (more
+	// importantly) made it impossible to tell whether a "no rows
+	// processed" was a clean state or a missed-migration bug.
+	if _, err := s.db.Exec(`CREATE TABLE IF NOT EXISTS schema_versions (
+		version    INTEGER PRIMARY KEY,
+		applied_at TEXT NOT NULL DEFAULT (datetime('now'))
+	)`); err != nil {
+		return fmt.Errorf("create schema_versions: %w", err)
+	}
+
+	if err := s.runOnce(1, "trigger backfill", s.backfillTriggersFromWorkloads); err != nil {
+		// Backfill failure is non-fatal — we log and let the operator
+		// retry. The version is only recorded on success.
 		slog.Warn("trigger backfill", "error", err)
 	}

 	return nil
 }

+// runOnce executes fn at most one time per database lifetime, recording
+// success in schema_versions. Useful for data migrations whose source
+// table eventually disappears (so re-running becomes pointless or
+// dangerous).
+func (s *Store) runOnce(version int, label string, fn func() error) error {
+	var applied int
+	if err := s.db.QueryRow(`SELECT COUNT(*) FROM schema_versions WHERE version = ?`, version).Scan(&applied); err != nil {
+		return fmt.Errorf("check %s: %w", label, err)
+	}
+	if applied > 0 {
+		return nil
+	}
+	if err := fn(); err != nil {
+		return err
+	}
+	if _, err := s.db.Exec(`INSERT INTO schema_versions (version) VALUES (?)`, version); err != nil {
+		return fmt.Errorf("mark %s applied: %w", label, err)
+	}
+	slog.Info("schema migration applied", "version", version, "label", label)
+	return nil
+}
+
+// RunOnce is the public counterpart of runOnce, exposed so cmd/server can
+// gate post-store-open migrations (e.g. crypto re-encryption that needs
+// the ENCRYPTION_KEY which Store does not own) through the same
+// schema_versions ledger.
+func (s *Store) RunOnce(version int, label string, fn func() error) error {
+	return s.runOnce(version, label, fn)
+}
+
+// EnvelopeMigrator describes the contract a crypto package implements to
+// rewrite legacy unprefixed-hex ciphertext as versioned envelope values.
+// hasEnvelope reports whether a value already carries the new prefix.
+// decrypt returns plaintext for either form; encrypt always produces the
+// new envelope form. By accepting closures the store stays free of any
+// import on internal/crypto, mirroring the rest of the package layout.
+type EnvelopeMigrator struct {
+	HasEnvelope func(value string) bool
+	Decrypt     func(ciphertext string) (string, error)
+	Encrypt     func(plaintext string) (string, error)
+}
+
+// MigrateSecretsToEnvelope walks every column known to carry an encrypted
+// secret and rewrites legacy unprefixed-hex values into the new
+// envelope form using the current encryption key.
+//
+// Behaviour, per-row:
+//   - empty value → skip (no secret stored)
+//   - already-envelope value → skip (already migrated)
+//   - decrypt fails → skip (value is either plaintext from a v0 boot
+//     OR ciphertext from a rotated key; either way we cannot safely
+//     re-encrypt and leaving it alone preserves the existing read
+//     semantics)
+//   - decrypt succeeds → encrypt to envelope form + UPDATE
+//
+// The whole sweep runs in a single transaction so a power-loss
+// mid-migration leaves the DB in either the pre- or post-migration
+// state, never half. Idempotent via schema_versions version 2 — the
+// next boot is a no-op.
+//
+// Columns covered:
+//   - settings.npm_password
+//   - settings.cloudflare_api_token
+//   - auth_settings.oidc_client_secret
+//   - registries.token
+//   - workload_env.value WHERE encrypted=1
+func (s *Store) MigrateSecretsToEnvelope(m EnvelopeMigrator) error {
+	return s.runOnce(2, "secrets envelope migration", func() error {
+		tx, err := s.db.Begin()
+		if err != nil {
+			return fmt.Errorf("begin: %w", err)
+		}
+		defer func() { _ = tx.Rollback() }()
+
+		// Single-row tables (settings, auth_settings) — read-update inline.
+		singleRowColumns := []struct {
+			table, column string
+		}{
+			{"settings", "npm_password"},
+			{"settings", "cloudflare_api_token"},
+			{"auth_settings", "oidc_client_secret"},
+		}
+		for _, c := range singleRowColumns {
+			var v string
+			err := tx.QueryRow(
+				fmt.Sprintf(`SELECT %s FROM %s LIMIT 1`, c.column, c.table),
+			).Scan(&v)
+			if err != nil {
+				if errors.Is(err, sql.ErrNoRows) {
+					continue
+				}
+				// auth_settings may not exist on a brand-new DB until
+				// the OIDC code touches it; treat as nothing-to-migrate.
+				slog.Debug("envelope migration: column read skipped",
+					"table", c.table, "column", c.column, "error", err)
+				continue
+			}
+			migrated, ok := tryMigrate(m, v)
+			if !ok {
+				continue
+			}
+			if _, err := tx.Exec(
+				fmt.Sprintf(`UPDATE %s SET %s = ?`, c.table, c.column),
+				migrated,
+			); err != nil {
+				return fmt.Errorf("update %s.%s: %w", c.table, c.column, err)
+			}
+		}
+
+		// Multi-row: registries.token
+		if err := migrateRowColumn(tx, m,
+			`SELECT id, token FROM registries WHERE token != ''`,
+			`UPDATE registries SET token = ? WHERE id = ?`,
+		); err != nil {
+			return fmt.Errorf("registries.token: %w", err)
+		}
+
+		// Multi-row: workload_env.value WHERE encrypted=1
+		if err := migrateRowColumn(tx, m,
+			`SELECT id, value FROM workload_env WHERE encrypted = 1 AND value != ''`,
+			`UPDATE workload_env SET value = ? WHERE id = ?`,
+		); err != nil {
+			return fmt.Errorf("workload_env.value: %w", err)
+		}
+
+		if err := tx.Commit(); err != nil {
+			return fmt.Errorf("commit: %w", err)
+		}
+		return nil
+	})
+}
+
+// migrateRowColumn applies the envelope rewrite to every (id, value)
+// pair returned by selectQ. updateQ takes (newValue, id) as parameters.
+// Each row is its own attempt; one row failing migration (decrypt fail)
+// does not abort the others.
+func migrateRowColumn(tx *sql.Tx, m EnvelopeMigrator, selectQ, updateQ string) error {
+	rows, err := tx.Query(selectQ)
+	if err != nil {
+		return err
+	}
+	defer rows.Close()
+	type pending struct{ id, newValue string }
+	var updates []pending
+	for rows.Next() {
+		var id, value string
+		if err := rows.Scan(&id, &value); err != nil {
+			return err
+		}
+		newValue, ok := tryMigrate(m, value)
+		if !ok {
+			continue
+		}
+		updates = append(updates, pending{id, newValue})
+	}
+	if err := rows.Err(); err != nil {
+		return err
+	}
+	for _, u := range updates {
+		if _, err := tx.Exec(updateQ, u.newValue, u.id); err != nil {
+			return err
+		}
+	}
+	return nil
+}
+
+// tryMigrate returns the envelope-form ciphertext + true when the input
+// is a legacy unprefixed value that decrypts successfully with the
+// current key. Returns ("", false) for anything else: empty, already
+// envelope, plaintext, or decrypt-failed (rotated-key case).
+func tryMigrate(m EnvelopeMigrator, v string) (string, bool) {
+	if v == "" {
+		return "", false
+	}
+	if m.HasEnvelope(v) {
+		return "", false
+	}
+	plaintext, err := m.Decrypt(v)
+	if err != nil {
+		return "", false
+	}
+	enc, err := m.Encrypt(plaintext)
+	if err != nil {
+		return "", false
+	}
+	return enc, true
+}
+
 // backfillTriggersFromWorkloads converts embedded trigger config on
 // workload rows into standalone trigger + binding rows. Runs once per
 // boot and is idempotent — only workloads with non-empty trigger_kind
@@ -0,0 +1,146 @@
+package store
+
+import (
+	"database/sql"
+	"errors"
+	"fmt"
+
+	"github.com/google/uuid"
+)
+
+// CreateVolumeSnapshot inserts a snapshot metadata record. ID is generated
+// when empty; CreatedAt is stamped server-side.
+func (s *Store) CreateVolumeSnapshot(v VolumeSnapshot) (VolumeSnapshot, error) {
+	if v.WorkloadID == "" || v.Filename == "" {
+		return VolumeSnapshot{}, fmt.Errorf("volume_snapshot: workload_id and filename are required")
+	}
+	if v.ID == "" {
+		v.ID = uuid.New().String()
+	}
+	if v.Manifest == "" {
+		v.Manifest = "[]"
+	}
+	v.CreatedAt = Now()
+
+	if _, err := s.db.Exec(
+		`INSERT INTO volume_snapshots (id, workload_id, label, filename, size_bytes, manifest, created_at)
+		 VALUES (?, ?, ?, ?, ?, ?, ?)`,
+		v.ID, v.WorkloadID, v.Label, v.Filename, v.SizeBytes, v.Manifest, v.CreatedAt,
+	); err != nil {
+		return VolumeSnapshot{}, fmt.Errorf("insert volume snapshot: %w", err)
+	}
+	return v, nil
+}
+
+// GetVolumeSnapshot returns one snapshot by ID.
+func (s *Store) GetVolumeSnapshot(id string) (VolumeSnapshot, error) {
+	var v VolumeSnapshot
+	err := s.db.QueryRow(
+		`SELECT id, workload_id, label, filename, size_bytes, manifest, created_at
+		 FROM volume_snapshots WHERE id = ?`, id,
+	).Scan(&v.ID, &v.WorkloadID, &v.Label, &v.Filename, &v.SizeBytes, &v.Manifest, &v.CreatedAt)
+	if errors.Is(err, sql.ErrNoRows) {
+		return VolumeSnapshot{}, fmt.Errorf("volume snapshot %s: %w", id, ErrNotFound)
+	}
+	if err != nil {
+		return VolumeSnapshot{}, fmt.Errorf("query volume snapshot: %w", err)
+	}
+	return v, nil
+}
+
+// ListVolumeSnapshots returns a workload's snapshots, newest first.
+func (s *Store) ListVolumeSnapshots(workloadID string) ([]VolumeSnapshot, error) {
+	rows, err := s.db.Query(
+		`SELECT id, workload_id, label, filename, size_bytes, manifest, created_at
+		 FROM volume_snapshots WHERE workload_id = ? ORDER BY created_at DESC`, workloadID,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("query volume snapshots: %w", err)
+	}
+	defer rows.Close()
+
+	out := []VolumeSnapshot{}
+	for rows.Next() {
+		v, err := scanVolumeSnapshot(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, v)
+	}
+	return out, rows.Err()
+}
+
+// DeleteVolumeSnapshot removes one snapshot row by ID.
+func (s *Store) DeleteVolumeSnapshot(id string) error {
+	result, err := s.db.Exec(`DELETE FROM volume_snapshots WHERE id = ?`, id)
+	if err != nil {
+		return fmt.Errorf("delete volume snapshot: %w", err)
+	}
+	if n, _ := result.RowsAffected(); n == 0 {
+		return fmt.Errorf("volume snapshot %s: %w", id, ErrNotFound)
+	}
+	return nil
+}
+
+// CountVolumeSnapshots returns how many snapshots a workload has.
+func (s *Store) CountVolumeSnapshots(workloadID string) (int, error) {
+	var n int
+	if err := s.db.QueryRow(
+		`SELECT COUNT(*) FROM volume_snapshots WHERE workload_id = ?`, workloadID,
+	).Scan(&n); err != nil {
+		return 0, fmt.Errorf("count volume snapshots: %w", err)
+	}
+	return n, nil
+}
+
+// GetOldestVolumeSnapshots returns the N oldest snapshots for a workload, for
+// retention pruning.
+func (s *Store) GetOldestVolumeSnapshots(workloadID string, limit int) ([]VolumeSnapshot, error) {
+	rows, err := s.db.Query(
+		`SELECT id, workload_id, label, filename, size_bytes, manifest, created_at
+		 FROM volume_snapshots WHERE workload_id = ? ORDER BY created_at ASC LIMIT ?`, workloadID, limit,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("query oldest volume snapshots: %w", err)
+	}
+	defer rows.Close()
+
+	out := []VolumeSnapshot{}
+	for rows.Next() {
+		v, err := scanVolumeSnapshot(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, v)
+	}
+	return out, rows.Err()
+}
+
+// AllVolumeSnapshotFilenames returns every snapshot archive filename across all
+// workloads, for orphan-file reconciliation at startup.
+func (s *Store) AllVolumeSnapshotFilenames() ([]string, error) {
+	rows, err := s.db.Query(`SELECT filename FROM volume_snapshots`)
+	if err != nil {
+		return nil, fmt.Errorf("query snapshot filenames: %w", err)
+	}
+	defer rows.Close()
+
+	out := []string{}
+	for rows.Next() {
+		var name string
+		if err := rows.Scan(&name); err != nil {
+			return nil, fmt.Errorf("scan snapshot filename: %w", err)
+		}
+		out = append(out, name)
+	}
+	return out, rows.Err()
+}
+
+func scanVolumeSnapshot(rows *sql.Rows) (VolumeSnapshot, error) {
+	var v VolumeSnapshot
+	if err := rows.Scan(&v.ID, &v.WorkloadID, &v.Label, &v.Filename,
+		&v.SizeBytes, &v.Manifest, &v.CreatedAt); err != nil {
+		return VolumeSnapshot{}, fmt.Errorf("scan volume snapshot: %w", err)
+	}
+	return v, nil
+}
@@ -0,0 +1,159 @@
+package store
+
+import (
+	"database/sql"
+	"errors"
+	"fmt"
+	"strings"
+
+	"github.com/google/uuid"
+)
+
+const workloadNotificationColumns = `id, workload_id, name, url, secret,
+	event_types, enabled, sort_order, created_at, updated_at`
+
+func scanWorkloadNotification(scanner interface{ Scan(...any) error }) (WorkloadNotification, error) {
+	var n WorkloadNotification
+	var enabled int
+	err := scanner.Scan(
+		&n.ID, &n.WorkloadID, &n.Name, &n.URL, &n.Secret,
+		&n.EventTypes, &enabled, &n.SortOrder, &n.CreatedAt, &n.UpdatedAt,
+	)
+	n.Enabled = enabled != 0
+	return n, err
+}
+
+// CreateWorkloadNotification inserts a notification route. Returns the
+// populated row (with assigned id + timestamps) so callers don't need to
+// follow up with a Get.
+func (s *Store) CreateWorkloadNotification(n WorkloadNotification) (WorkloadNotification, error) {
+	if n.WorkloadID == "" {
+		return WorkloadNotification{}, fmt.Errorf("workload_id is required")
+	}
+	if n.URL == "" {
+		return WorkloadNotification{}, fmt.Errorf("url is required")
+	}
+	if n.ID == "" {
+		n.ID = uuid.New().String()
+	}
+	n.CreatedAt = Now()
+	n.UpdatedAt = n.CreatedAt
+
+	_, err := s.db.Exec(
+		`INSERT INTO workload_notifications (`+workloadNotificationColumns+`)
+		 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`,
+		n.ID, n.WorkloadID, n.Name, n.URL, n.Secret,
+		n.EventTypes, BoolToInt(n.Enabled), n.SortOrder, n.CreatedAt, n.UpdatedAt,
+	)
+	if err != nil {
+		return WorkloadNotification{}, fmt.Errorf("insert workload_notification: %w", err)
+	}
+	return n, nil
+}
+
+// ListWorkloadNotifications returns every notification row for a
+// workload ordered by (sort_order, created_at) so the UI stays stable
+// across reorderings.
+func (s *Store) ListWorkloadNotifications(workloadID string) ([]WorkloadNotification, error) {
+	rows, err := s.db.Query(
+		`SELECT `+workloadNotificationColumns+`
+		 FROM workload_notifications
+		 WHERE workload_id = ?
+		 ORDER BY sort_order, created_at`,
+		workloadID,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("list workload_notifications: %w", err)
+	}
+	defer rows.Close()
+
+	out := []WorkloadNotification{}
+	for rows.Next() {
+		n, err := scanWorkloadNotification(rows)
+		if err != nil {
+			return nil, fmt.Errorf("scan workload_notification: %w", err)
+		}
+		out = append(out, n)
+	}
+	return out, rows.Err()
+}
+
+// GetWorkloadNotification fetches one notification row by id. Returns
+// ErrNotFound when the row does not exist so callers can return 404
+// cleanly.
+func (s *Store) GetWorkloadNotification(id string) (WorkloadNotification, error) {
+	n, err := scanWorkloadNotification(s.db.QueryRow(
+		`SELECT `+workloadNotificationColumns+`
+		 FROM workload_notifications WHERE id = ?`, id,
+	))
+	if errors.Is(err, sql.ErrNoRows) {
+		return WorkloadNotification{}, fmt.Errorf("workload_notification %s: %w", id, ErrNotFound)
+	}
+	if err != nil {
+		return WorkloadNotification{}, fmt.Errorf("query workload_notification: %w", err)
+	}
+	return n, nil
+}
+
+// UpdateWorkloadNotification rewrites an existing row. WorkloadID is
+// immutable — re-anchoring a route to a different workload would invite
+// silent reassignments after a paste-bug in the UI; recreate instead.
+func (s *Store) UpdateWorkloadNotification(n WorkloadNotification) error {
+	if n.ID == "" {
+		return fmt.Errorf("id is required")
+	}
+	if n.URL == "" {
+		return fmt.Errorf("url is required")
+	}
+	n.UpdatedAt = Now()
+	res, err := s.db.Exec(
+		`UPDATE workload_notifications
+		 SET name = ?, url = ?, secret = ?, event_types = ?,
+		     enabled = ?, sort_order = ?, updated_at = ?
+		 WHERE id = ?`,
+		n.Name, n.URL, n.Secret, n.EventTypes,
+		BoolToInt(n.Enabled), n.SortOrder, n.UpdatedAt, n.ID,
+	)
+	if err != nil {
+		return fmt.Errorf("update workload_notification: %w", err)
+	}
+	rows, _ := res.RowsAffected()
+	if rows == 0 {
+		return fmt.Errorf("workload_notification %s: %w", n.ID, ErrNotFound)
+	}
+	return nil
+}
+
+// DeleteWorkloadNotification drops a single notification row.
+// Idempotent: missing id returns ErrNotFound so the API can map it to
+// 404 cleanly.
+func (s *Store) DeleteWorkloadNotification(id string) error {
+	res, err := s.db.Exec(`DELETE FROM workload_notifications WHERE id = ?`, id)
+	if err != nil {
+		return fmt.Errorf("delete workload_notification: %w", err)
+	}
+	rows, _ := res.RowsAffected()
+	if rows == 0 {
+		return fmt.Errorf("workload_notification %s: %w", id, ErrNotFound)
+	}
+	return nil
+}
+
+// MatchesEventType returns true when the notification row's EventTypes
+// allow-list includes eventType (or is empty, meaning "match all").
+// Helper exported so the notification dispatcher can fan-out filtering
+// inline without duplicating the comma-split parser.
+func (n WorkloadNotification) MatchesEventType(eventType string) bool {
+	if !n.Enabled {
+		return false
+	}
+	if n.EventTypes == "" {
+		return true
+	}
+	for _, et := range strings.Split(n.EventTypes, ",") {
+		if strings.TrimSpace(et) == eventType {
+			return true
+		}
+	}
+	return false
+}
@@ -0,0 +1,170 @@
+package store
+
+import (
+	"errors"
+	"testing"
+)
+
+// seedWorkloadForNotifications creates a minimal workload row so the FK
+// constraint on workload_notifications is satisfied. Returns the new
+// workload's ID for tests to reference.
+func seedWorkloadForNotifications(t *testing.T, s *Store, name string) string {
+	t.Helper()
+	w, err := s.CreateWorkload(Workload{
+		Kind:       string(WorkloadKindProject),
+		Name:       name,
+		SourceKind: "image",
+	})
+	if err != nil {
+		t.Fatalf("seed workload: %v", err)
+	}
+	return w.ID
+}
+
+func TestCreateWorkloadNotification_RoundTrip(t *testing.T) {
+	s := newTestStore(t)
+	wlID := seedWorkloadForNotifications(t, s, "app1")
+
+	created, err := s.CreateWorkloadNotification(WorkloadNotification{
+		WorkloadID: wlID,
+		Name:       "Slack alerts",
+		URL:        "https://hooks.slack.test/x",
+		Secret:     "shh",
+		EventTypes: "deploy_failure,build_failure",
+		Enabled:    true,
+	})
+	if err != nil {
+		t.Fatalf("CreateWorkloadNotification: %v", err)
+	}
+	if created.ID == "" {
+		t.Fatal("expected ID to be assigned")
+	}
+
+	got, err := s.GetWorkloadNotification(created.ID)
+	if err != nil {
+		t.Fatalf("Get: %v", err)
+	}
+	if got.URL != "https://hooks.slack.test/x" || got.Name != "Slack alerts" {
+		t.Errorf("row mismatch: %+v", got)
+	}
+	if !got.Enabled {
+		t.Error("expected Enabled=true")
+	}
+	if got.EventTypes != "deploy_failure,build_failure" {
+		t.Errorf("event_types = %q", got.EventTypes)
+	}
+}
+
+func TestCreateWorkloadNotification_RejectsMissingURL(t *testing.T) {
+	s := newTestStore(t)
+	wlID := seedWorkloadForNotifications(t, s, "app1")
+	_, err := s.CreateWorkloadNotification(WorkloadNotification{
+		WorkloadID: wlID,
+		Name:       "broken",
+		URL:        "",
+	})
+	if err == nil {
+		t.Fatal("expected URL validation error")
+	}
+}
+
+func TestListWorkloadNotifications_SortedByOrder(t *testing.T) {
+	s := newTestStore(t)
+	wlID := seedWorkloadForNotifications(t, s, "app1")
+
+	// Insert out of order; ListWorkloadNotifications should return
+	// them sorted by SortOrder ascending.
+	_, _ = s.CreateWorkloadNotification(WorkloadNotification{
+		WorkloadID: wlID, Name: "C", URL: "https://c.test", SortOrder: 30,
+	})
+	_, _ = s.CreateWorkloadNotification(WorkloadNotification{
+		WorkloadID: wlID, Name: "A", URL: "https://a.test", SortOrder: 10,
+	})
+	_, _ = s.CreateWorkloadNotification(WorkloadNotification{
+		WorkloadID: wlID, Name: "B", URL: "https://b.test", SortOrder: 20,
+	})
+
+	rows, err := s.ListWorkloadNotifications(wlID)
+	if err != nil {
+		t.Fatalf("list: %v", err)
+	}
+	if len(rows) != 3 {
+		t.Fatalf("len = %d, want 3", len(rows))
+	}
+	if rows[0].Name != "A" || rows[1].Name != "B" || rows[2].Name != "C" {
+		t.Errorf("sort order wrong: %q %q %q", rows[0].Name, rows[1].Name, rows[2].Name)
+	}
+}
+
+func TestUpdateWorkloadNotification_PersistsChanges(t *testing.T) {
+	s := newTestStore(t)
+	wlID := seedWorkloadForNotifications(t, s, "app1")
+	n, _ := s.CreateWorkloadNotification(WorkloadNotification{
+		WorkloadID: wlID, Name: "old", URL: "https://old.test", Enabled: true,
+	})
+	n.Name = "new"
+	n.URL = "https://new.test"
+	n.Enabled = false
+	n.EventTypes = "deploy_success"
+	if err := s.UpdateWorkloadNotification(n); err != nil {
+		t.Fatalf("update: %v", err)
+	}
+	got, _ := s.GetWorkloadNotification(n.ID)
+	if got.Name != "new" || got.URL != "https://new.test" || got.Enabled {
+		t.Errorf("update did not persist: %+v", got)
+	}
+}
+
+func TestDeleteWorkloadNotification_ReturnsNotFoundForMissing(t *testing.T) {
+	s := newTestStore(t)
+	err := s.DeleteWorkloadNotification("nope")
+	if !errors.Is(err, ErrNotFound) {
+		t.Errorf("expected ErrNotFound, got %v", err)
+	}
+}
+
+func TestDeleteWorkloadNotification_CascadesFromWorkload(t *testing.T) {
+	s := newTestStore(t)
+	wlID := seedWorkloadForNotifications(t, s, "app1")
+	_, _ = s.CreateWorkloadNotification(WorkloadNotification{
+		WorkloadID: wlID, Name: "x", URL: "https://x.test",
+	})
+	if err := s.DeleteWorkload(wlID); err != nil {
+		t.Fatalf("delete workload: %v", err)
+	}
+	rows, err := s.ListWorkloadNotifications(wlID)
+	if err != nil {
+		t.Fatalf("list after cascade: %v", err)
+	}
+	if len(rows) != 0 {
+		t.Errorf("expected cascade delete to remove rows, got %d", len(rows))
+	}
+}
+
+func TestMatchesEventType_AllowList(t *testing.T) {
+	cases := []struct {
+		eventTypes string
+		probe      string
+		want       bool
+	}{
+		{"", "deploy_success", true},                          // empty = all
+		{"deploy_success,deploy_failure", "deploy_success", true},
+		{"deploy_success,deploy_failure", "build_failure", false},
+		{"build_failure", "build_failure", true},
+		{" deploy_success , build_failure ", "build_failure", true}, // whitespace tolerated
+	}
+	for _, c := range cases {
+		n := WorkloadNotification{Enabled: true, EventTypes: c.eventTypes}
+		got := n.MatchesEventType(c.probe)
+		if got != c.want {
+			t.Errorf("MatchesEventType(%q, %q) = %v, want %v", c.eventTypes, c.probe, got, c.want)
+		}
+	}
+}
+
+func TestMatchesEventType_DisabledNeverMatches(t *testing.T) {
+	n := WorkloadNotification{Enabled: false, EventTypes: ""}
+	if n.MatchesEventType("any") {
+		t.Error("disabled row should never match")
+	}
+}
@@ -173,11 +173,24 @@ func (s *Store) UpdateWorkload(w Workload) error {
 	return nil
 }

-// DeleteWorkload removes a workload row. Cascading deletes for the matching
-// project/stack/site row stay with the kind-specific Delete functions; this
-// only removes the workload entry.
+// DeleteWorkload removes a workload row. Cascading deletes for FK-backed
+// child tables (workload_env, workload_volumes, workload_trigger_bindings)
+// happen via SQLite's ON DELETE CASCADE. The `containers` table doesn't
+// yet have an FK to workloads (planned migration — see ops notes), so we
+// drop its rows explicitly here in the same transaction to prevent zombie
+// container rows from outliving their owning workload.
 func (s *Store) DeleteWorkload(id string) error {
-	result, err := s.db.Exec(`DELETE FROM workloads WHERE id = ?`, id)
+	tx, err := s.db.Begin()
+	if err != nil {
+		return fmt.Errorf("begin: %w", err)
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	// Explicit container cleanup until the FK migration lands.
+	if _, err := tx.Exec(`DELETE FROM containers WHERE workload_id = ?`, id); err != nil {
+		return fmt.Errorf("delete containers: %w", err)
+	}
+	result, err := tx.Exec(`DELETE FROM workloads WHERE id = ?`, id)
 	if err != nil {
 		return fmt.Errorf("delete workload: %w", err)
 	}
@@ -188,6 +201,9 @@ func (s *Store) DeleteWorkload(id string) error {
 	if n == 0 {
 		return fmt.Errorf("workload %s: %w", id, ErrNotFound)
 	}
+	if err := tx.Commit(); err != nil {
+		return fmt.Errorf("commit: %w", err)
+	}
 	return nil
 }

@@ -0,0 +1,140 @@
+package volsnap
+
+import (
+	"archive/tar"
+	"compress/gzip"
+	"encoding/json"
+	"fmt"
+	"io"
+	"io/fs"
+	"os"
+	"path"
+	"path/filepath"
+)
+
+// writeArchive serializes the given host-bind volume directories into a
+// gzip-compressed tar at dest. Each volume's files live under an integer
+// subdirectory (its manifest Index); a manifest.json at the archive root makes
+// the archive self-describing. Returns the manifest describing what was
+// captured.
+//
+// Only regular files and directories are archived. Symlinks and special files
+// (devices, sockets, fifos) are skipped — this keeps capture safe and avoids
+// recording links whose targets would be meaningless or escape the volume on a
+// later restore. A torn snapshot is possible if the app writes during capture;
+// callers should surface that caveat.
+func writeArchive(dest string, refs []VolumeRef) ([]SnapshotVolume, error) {
+	// O_EXCL: never clobber an existing file (filenames are unique per call).
+	f, err := os.OpenFile(dest, os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0o600)
+	if err != nil {
+		return nil, fmt.Errorf("create snapshot file: %w", err)
+	}
+	defer f.Close()
+
+	gz := gzip.NewWriter(f)
+	tw := tar.NewWriter(gz)
+
+	manifest := make([]SnapshotVolume, 0, len(refs))
+	for i, ref := range refs {
+		manifest = append(manifest, SnapshotVolume{Index: i, Target: ref.Target, Scope: ref.Scope, Source: ref.Source})
+		if err := addDir(tw, ref.HostPath, fmt.Sprintf("%d", i)); err != nil {
+			_ = tw.Close()
+			_ = gz.Close()
+			_ = f.Close()
+			os.Remove(dest)
+			return nil, err
+		}
+	}
+
+	if err := writeManifestEntry(tw, manifest); err != nil {
+		_ = tw.Close()
+		_ = gz.Close()
+		os.Remove(dest)
+		return nil, err
+	}
+
+	if err := tw.Close(); err != nil {
+		_ = gz.Close()
+		os.Remove(dest)
+		return nil, fmt.Errorf("finalize tar: %w", err)
+	}
+	if err := gz.Close(); err != nil {
+		os.Remove(dest)
+		return nil, fmt.Errorf("finalize gzip: %w", err)
+	}
+	if err := f.Close(); err != nil {
+		os.Remove(dest)
+		return nil, fmt.Errorf("close snapshot file: %w", err)
+	}
+	return manifest, nil
+}
+
+// addDir walks root and writes its regular files and directories into tw under
+// the given archive prefix.
+func addDir(tw *tar.Writer, root, prefix string) error {
+	return filepath.WalkDir(root, func(p string, d fs.DirEntry, walkErr error) error {
+		if walkErr != nil {
+			return fmt.Errorf("walk %s: %w", p, walkErr)
+		}
+		// Skip symlinks and special files; archive only dirs and regular files.
+		if d.Type()&fs.ModeSymlink != 0 {
+			return nil
+		}
+		if !d.IsDir() && !d.Type().IsRegular() {
+			return nil
+		}
+
+		rel, err := filepath.Rel(root, p)
+		if err != nil {
+			return fmt.Errorf("relativize %s: %w", p, err)
+		}
+		name := prefix
+		if rel != "." {
+			name = path.Join(prefix, filepath.ToSlash(rel))
+		}
+
+		info, err := d.Info()
+		if err != nil {
+			return fmt.Errorf("stat %s: %w", p, err)
+		}
+		hdr, err := tar.FileInfoHeader(info, "")
+		if err != nil {
+			return fmt.Errorf("tar header %s: %w", p, err)
+		}
+		hdr.Name = name
+		if d.IsDir() {
+			hdr.Name += "/"
+		}
+		if err := tw.WriteHeader(hdr); err != nil {
+			return fmt.Errorf("write tar header %s: %w", name, err)
+		}
+		if d.IsDir() {
+			return nil
+		}
+
+		src, err := os.Open(p)
+		if err != nil {
+			return fmt.Errorf("open %s: %w", p, err)
+		}
+		defer src.Close()
+		if _, err := io.Copy(tw, src); err != nil {
+			return fmt.Errorf("copy %s: %w", p, err)
+		}
+		return nil
+	})
+}
+
+func writeManifestEntry(tw *tar.Writer, manifest []SnapshotVolume) error {
+	data, err := json.MarshalIndent(manifest, "", "  ")
+	if err != nil {
+		return fmt.Errorf("encode manifest: %w", err)
+	}
+	hdr := &tar.Header{Name: "manifest.json", Mode: 0o600, Size: int64(len(data)), Typeflag: tar.TypeReg}
+	if err := tw.WriteHeader(hdr); err != nil {
+		return fmt.Errorf("write manifest header: %w", err)
+	}
+	if _, err := tw.Write(data); err != nil {
+		return fmt.Errorf("write manifest: %w", err)
+	}
+	return nil
+}
@@ -0,0 +1,117 @@
+package volsnap
+
+import (
+	"archive/tar"
+	"compress/gzip"
+	"io"
+	"os"
+	"path/filepath"
+	"testing"
+)
+
+func TestWriteArchiveRoundTrip(t *testing.T) {
+	root := t.TempDir()
+	mustWrite(t, filepath.Join(root, "a.txt"), "hello")
+	if err := os.MkdirAll(filepath.Join(root, "sub"), 0o755); err != nil {
+		t.Fatal(err)
+	}
+	mustWrite(t, filepath.Join(root, "sub", "b.txt"), "world")
+
+	dest := filepath.Join(t.TempDir(), "snap.tar.gz")
+	refs := []VolumeRef{{Target: "/data", Scope: "project", Source: "data", HostPath: root}}
+
+	manifest, err := writeArchive(dest, refs)
+	if err != nil {
+		t.Fatalf("writeArchive: %v", err)
+	}
+	if len(manifest) != 1 || manifest[0].Index != 0 || manifest[0].Target != "/data" || manifest[0].Scope != "project" {
+		t.Fatalf("unexpected manifest: %+v", manifest)
+	}
+
+	entries := readArchive(t, dest)
+	for _, want := range []string{"0/a.txt", "0/sub/b.txt", "manifest.json"} {
+		if _, ok := entries[want]; !ok {
+			keys := make([]string, 0, len(entries))
+			for k := range entries {
+				keys = append(keys, k)
+			}
+			t.Fatalf("archive missing %q; got %v", want, keys)
+		}
+	}
+	if got := entries["0/a.txt"]; got != "hello" {
+		t.Errorf("0/a.txt = %q, want %q", got, "hello")
+	}
+}
+
+func TestWriteArchiveRefusesExisting(t *testing.T) {
+	dest := filepath.Join(t.TempDir(), "snap.tar.gz")
+	mustWrite(t, dest, "existing")
+	if _, err := writeArchive(dest, nil); err == nil {
+		t.Fatal("expected error writing over an existing file (O_EXCL)")
+	}
+}
+
+func TestWriteArchiveSkipsSymlinks(t *testing.T) {
+	root := t.TempDir()
+	mustWrite(t, filepath.Join(root, "real.txt"), "data")
+	if err := os.Symlink(filepath.Join(root, "real.txt"), filepath.Join(root, "link.txt")); err != nil {
+		t.Skipf("symlinks unavailable on this platform: %v", err)
+	}
+
+	dest := filepath.Join(t.TempDir(), "snap.tar.gz")
+	if _, err := writeArchive(dest, []VolumeRef{{Target: "/d", Scope: "project", HostPath: root}}); err != nil {
+		t.Fatalf("writeArchive: %v", err)
+	}
+	entries := readArchive(t, dest)
+	if _, ok := entries["0/link.txt"]; ok {
+		t.Error("symlink should have been skipped, but it is in the archive")
+	}
+	if _, ok := entries["0/real.txt"]; !ok {
+		t.Error("regular file should be archived")
+	}
+}
+
+func mustWrite(t *testing.T, path, content string) {
+	t.Helper()
+	if err := os.WriteFile(path, []byte(content), 0o644); err != nil {
+		t.Fatal(err)
+	}
+}
+
+// readArchive returns a map of regular-file entry name -> content. Directory
+// entries are recorded with an empty string so their presence can be asserted.
+func readArchive(t *testing.T, path string) map[string]string {
+	t.Helper()
+	f, err := os.Open(path)
+	if err != nil {
+		t.Fatal(err)
+	}
+	defer f.Close()
+	gz, err := gzip.NewReader(f)
+	if err != nil {
+		t.Fatal(err)
+	}
+	defer gz.Close()
+
+	out := map[string]string{}
+	tr := tar.NewReader(gz)
+	for {
+		hdr, err := tr.Next()
+		if err == io.EOF {
+			break
+		}
+		if err != nil {
+			t.Fatal(err)
+		}
+		if hdr.Typeflag == tar.TypeDir {
+			out[hdr.Name] = ""
+			continue
+		}
+		data, err := io.ReadAll(tr)
+		if err != nil {
+			t.Fatal(err)
+		}
+		out[hdr.Name] = string(data)
+	}
+	return out
+}
@@ -0,0 +1,207 @@
+package volsnap
+
+import (
+	"encoding/json"
+	"errors"
+	"fmt"
+	"log/slog"
+	"os"
+	"path/filepath"
+	"strings"
+	"sync"
+	"time"
+
+	"github.com/google/uuid"
+
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// maxSnapshotsPerWorkload caps how many snapshots are retained per app. On
+// create, older snapshots beyond this count are pruned (best-effort) so volume
+// snapshots cannot grow the data disk without bound.
+const maxSnapshotsPerWorkload = 20
+
+// ErrNoSnapshotData is returned by Create when the workload has no resolved
+// host-bind volume directory to capture. It is a client-actionable condition
+// (HTTP 400), distinct from internal failures (HTTP 500).
+var ErrNoSnapshotData = errors.New("no snapshottable volume data for this app")
+
+// Engine creates and manages volume snapshots under <dataDir>/snapshots.
+type Engine struct {
+	mu      sync.Mutex
+	store   *store.Store
+	snapDir string
+}
+
+// New creates the snapshot engine, ensuring the snapshot directory exists.
+func New(st *store.Store, dataDir string) (*Engine, error) {
+	dir := filepath.Join(dataDir, "snapshots")
+	if err := os.MkdirAll(dir, 0o755); err != nil {
+		return nil, fmt.Errorf("create snapshot directory: %w", err)
+	}
+	return &Engine{store: st, snapDir: dir}, nil
+}
+
+// SnapDir returns the directory holding snapshot archives.
+func (e *Engine) SnapDir() string { return e.snapDir }
+
+// Create captures a snapshot of the workload's host-bind data volumes.
+func (e *Engine) Create(w store.Workload, settings store.Settings, label string) (store.VolumeSnapshot, error) {
+	refs, _, err := SnapshotableVolumes(e.store, w, settings)
+	if err != nil {
+		return store.VolumeSnapshot{}, fmt.Errorf("enumerate volumes: %w", err)
+	}
+	if len(refs) == 0 {
+		return store.VolumeSnapshot{}, ErrNoSnapshotData
+	}
+
+	e.mu.Lock()
+	defer e.mu.Unlock()
+
+	filename := fmt.Sprintf("%s-%s-%s.tar.gz",
+		idShort(w.ID), time.Now().UTC().Format("20060102-150405"), uuid.New().String()[:8])
+	dest := filepath.Join(e.snapDir, filename)
+
+	manifest, err := writeArchive(dest, refs)
+	if err != nil {
+		return store.VolumeSnapshot{}, err
+	}
+
+	info, err := os.Stat(dest)
+	if err != nil {
+		os.Remove(dest)
+		return store.VolumeSnapshot{}, fmt.Errorf("stat snapshot: %w", err)
+	}
+	manifestJSON, err := json.Marshal(manifest)
+	if err != nil {
+		os.Remove(dest)
+		return store.VolumeSnapshot{}, fmt.Errorf("encode manifest: %w", err)
+	}
+
+	row, err := e.store.CreateVolumeSnapshot(store.VolumeSnapshot{
+		WorkloadID: w.ID,
+		Label:      strings.TrimSpace(label),
+		Filename:   filename,
+		SizeBytes:  info.Size(),
+		Manifest:   string(manifestJSON),
+	})
+	if err != nil {
+		os.Remove(dest) // best-effort: don't leak an orphan file
+		return store.VolumeSnapshot{}, fmt.Errorf("record snapshot: %w", err)
+	}
+
+	slog.Info("volume snapshot created", "id", row.ID, "workload", w.ID,
+		"volumes", len(manifest), "size", info.Size())
+
+	e.pruneWorkload(w.ID)
+	return row, nil
+}
+
+// List returns a workload's snapshots, newest first.
+func (e *Engine) List(workloadID string) ([]store.VolumeSnapshot, error) {
+	return e.store.ListVolumeSnapshots(workloadID)
+}
+
+// Get returns one snapshot by id.
+func (e *Engine) Get(id string) (store.VolumeSnapshot, error) {
+	return e.store.GetVolumeSnapshot(id)
+}
+
+// Delete removes a snapshot's archive file and its metadata row.
+func (e *Engine) Delete(id string) error {
+	snap, err := e.store.GetVolumeSnapshot(id)
+	if err != nil {
+		return err
+	}
+	e.mu.Lock()
+	defer e.mu.Unlock()
+	if p, perr := e.FilePath(snap); perr == nil {
+		if rmErr := os.Remove(p); rmErr != nil && !os.IsNotExist(rmErr) {
+			slog.Warn("volume snapshot: remove file", "id", id, "error", rmErr)
+		}
+	}
+	return e.store.DeleteVolumeSnapshot(id)
+}
+
+// FilePath resolves a snapshot's archive path and verifies it stays within the
+// snapshot directory (defence-in-depth against a tampered filename column).
+func (e *Engine) FilePath(snap store.VolumeSnapshot) (string, error) {
+	base := filepath.Base(snap.Filename)
+	if base == "" || base == "." || base != snap.Filename {
+		return "", fmt.Errorf("invalid snapshot filename")
+	}
+	p := filepath.Join(e.snapDir, base)
+	abs, err := filepath.Abs(p)
+	if err != nil {
+		return "", err
+	}
+	absDir, _ := filepath.Abs(e.snapDir)
+	if !strings.HasPrefix(abs, absDir+string(filepath.Separator)) {
+		return "", fmt.Errorf("snapshot path escapes snapshot directory")
+	}
+	return abs, nil
+}
+
+// CleanOrphans removes snapshot archive files that have no metadata row,
+// reconciling on-disk files against the DB. Workload deletion CASCADEs the
+// volume_snapshots rows but cannot reach the files; this (run at startup)
+// reclaims them. Mirrors backup.Engine.CleanOrphans.
+func (e *Engine) CleanOrphans() (int, error) {
+	e.mu.Lock()
+	defer e.mu.Unlock()
+
+	entries, err := os.ReadDir(e.snapDir)
+	if err != nil {
+		return 0, fmt.Errorf("read snapshot dir: %w", err)
+	}
+	filenames, err := e.store.AllVolumeSnapshotFilenames()
+	if err != nil {
+		return 0, fmt.Errorf("list snapshot filenames: %w", err)
+	}
+	known := make(map[string]bool, len(filenames))
+	for _, f := range filenames {
+		known[f] = true
+	}
+
+	removed := 0
+	for _, ent := range entries {
+		if ent.IsDir() || known[ent.Name()] {
+			continue
+		}
+		if err := os.Remove(filepath.Join(e.snapDir, ent.Name())); err != nil {
+			slog.Warn("volume snapshot: remove orphan", "file", ent.Name(), "error", err)
+			continue
+		}
+		removed++
+	}
+	return removed, nil
+}
+
+// pruneWorkload deletes snapshots beyond maxSnapshotsPerWorkload for one
+// workload (oldest first). Best-effort: caller already holds e.mu.
+func (e *Engine) pruneWorkload(workloadID string) {
+	count, err := e.store.CountVolumeSnapshots(workloadID)
+	if err != nil || count <= maxSnapshotsPerWorkload {
+		return
+	}
+	oldest, err := e.store.GetOldestVolumeSnapshots(workloadID, count-maxSnapshotsPerWorkload)
+	if err != nil {
+		slog.Warn("volume snapshot: prune query", "workload", workloadID, "error", err)
+		return
+	}
+	for _, snap := range oldest {
+		if p, perr := e.FilePath(snap); perr == nil {
+			_ = os.Remove(p)
+		}
+		if derr := e.store.DeleteVolumeSnapshot(snap.ID); derr != nil {
+			slog.Warn("volume snapshot: prune delete", "id", snap.ID, "error", derr)
+		}
+	}
+}
+
+func idShort(id string) string {
+	if len(id) > 8 {
+		return id[:8]
+	}
+	return id
+}
@@ -0,0 +1,146 @@
+// Package volsnap captures and manages per-workload snapshots of an app's
+// host-bind data volumes. It is deliberately independent of internal/backup
+// (which is SQLite-specific): a snapshot here is a tar.gz of the resolved
+// volume directories, recorded in the volume_snapshots table.
+//
+// Phase 2a-i covers CAPTURE only (create/list/delete/download). The restore
+// path — which overwrites live data and needs container quiesce + atomic swap
+// — is intentionally a separate, later phase.
+package volsnap
+
+import (
+	"encoding/json"
+	"os"
+
+	"github.com/alexei/tinyforge/internal/store"
+	"github.com/alexei/tinyforge/internal/volume"
+)
+
+// supportedScopes are the host-bind volume scopes phase 2a-i can snapshot.
+// Each resolves to a real host directory the running container binds. Excluded
+// for now: instance (needs the deployed image tag to resolve a per-tag dir),
+// named/project_named (Docker named volumes — need a docker-run-tar primitive),
+// and ephemeral (tmpfs — no data to capture).
+var supportedScopes = map[string]bool{
+	string(store.VolumeScopeAbsolute): true,
+	string(store.VolumeScopeStage):    true,
+	string(store.VolumeScopeProject):  true,
+}
+
+// SnapshotVolume is one volume covered by a snapshot. It is persisted in the
+// snapshot row's manifest (JSON) and written into the archive so a future
+// restore can re-resolve the target even if volume settings drift. Index names
+// the archive subdirectory holding that volume's files.
+type SnapshotVolume struct {
+	Index  int    `json:"index"`
+	Target string `json:"target"`
+	Scope  string `json:"scope"`
+	Source string `json:"source"`
+}
+
+// VolumeRef is a resolved, on-disk host-bind volume eligible for snapshotting.
+type VolumeRef struct {
+	Target   string
+	Scope    string
+	Source   string
+	HostPath string
+}
+
+// SkippedVolume is a declared volume that cannot be snapshotted, with the
+// reason surfaced to the UI so users are never misled into thinking data is
+// captured when it is not.
+type SkippedVolume struct {
+	Target string `json:"target"`
+	Scope  string `json:"scope"`
+	Reason string `json:"reason"`
+}
+
+// scVolumes is the minimal shape parsed out of an image workload's
+// source_config — just enough to learn its declared volumes without importing
+// the image source package.
+type scVolumes struct {
+	Volumes []struct {
+		Source string `json:"source"`
+		Target string `json:"target"`
+		Scope  string `json:"scope"`
+		Name   string `json:"name"`
+	} `json:"volumes"`
+}
+
+// SnapshotableVolumes enumerates a workload's data volumes and splits them into
+// those that can be snapshotted now (resolved host-bind dirs that exist on
+// disk) and those that are skipped (with a reason). It mirrors the image
+// source's computeMounts merge: source_config volumes overlaid by persisted
+// workload_volumes rows (persisted wins on a target conflict).
+//
+// Only image-source workloads declare host-bind data volumes today; for any
+// other source kind both slices come back empty.
+func SnapshotableVolumes(st *store.Store, w store.Workload, settings store.Settings) (refs []VolumeRef, skipped []SkippedVolume, err error) {
+	if w.SourceKind != "image" {
+		return nil, nil, nil
+	}
+
+	byTarget := map[string]store.WorkloadVolume{}
+
+	var cfg scVolumes
+	if w.SourceConfig != "" {
+		// Best-effort: a malformed config simply yields no inline volumes; the
+		// persisted rows below still apply.
+		_ = json.Unmarshal([]byte(w.SourceConfig), &cfg)
+	}
+	for _, v := range cfg.Volumes {
+		if v.Target == "" {
+			continue
+		}
+		byTarget[v.Target] = store.WorkloadVolume{Source: v.Source, Target: v.Target, Scope: v.Scope, Name: v.Name}
+	}
+	persisted, perr := st.ListWorkloadVolumes(w.ID)
+	if perr != nil {
+		return nil, nil, perr
+	}
+	for _, p := range persisted {
+		byTarget[p.Target] = store.WorkloadVolume{Source: p.Source, Target: p.Target, Scope: p.Scope, Name: p.Name}
+	}
+
+	params := volume.ResolveWorkloadParams{
+		BasePath:           settings.BaseVolumePath,
+		WorkloadID:         w.ID,
+		WorkloadName:       w.Name,
+		AllowedVolumePaths: settings.AllowedVolumePaths,
+	}
+
+	for _, v := range byTarget {
+		if v.Target == "" {
+			continue
+		}
+		if !supportedScopes[v.Scope] {
+			skipped = append(skipped, SkippedVolume{Target: v.Target, Scope: v.Scope, Reason: skipReason(v.Scope)})
+			continue
+		}
+		hostPath, rerr := volume.ResolveWorkloadPath(v, params)
+		if rerr != nil {
+			skipped = append(skipped, SkippedVolume{Target: v.Target, Scope: v.Scope, Reason: rerr.Error()})
+			continue
+		}
+		info, serr := os.Stat(hostPath)
+		if serr != nil || !info.IsDir() {
+			skipped = append(skipped, SkippedVolume{Target: v.Target, Scope: v.Scope, Reason: "no data on disk yet"})
+			continue
+		}
+		refs = append(refs, VolumeRef{Target: v.Target, Scope: v.Scope, Source: v.Source, HostPath: hostPath})
+	}
+	return refs, skipped, nil
+}
+
+func skipReason(scope string) string {
+	switch scope {
+	case string(store.VolumeScopeInstance):
+		return "instance-scoped volumes are not yet snapshottable"
+	case string(store.VolumeScopeNamed), string(store.VolumeScopeProjectNamed):
+		return "Docker named volumes are not yet snapshottable"
+	case string(store.VolumeScopeEphemeral):
+		return "ephemeral (tmpfs) volumes hold no persistent data"
+	default:
+		return "unsupported volume scope"
+	}
+}
@@ -169,6 +169,18 @@ func SaveFile(rootPath, relativePath string, r io.Reader) error {

 // safePath resolves a relative path within rootPath and validates it doesn't escape.
 // Resolves symlinks to prevent symlink-based traversal attacks.
+//
+// The check used to be `strings.HasPrefix(absResolved, absRoot)` which has
+// a classic boundary bug: a sibling root at /data/vol10 would pass the
+// prefix test for /data/vol1. The fix enforces a separator boundary so
+// the only allowed cases are absResolved == absRoot OR absResolved begins
+// with absRoot + separator.
+//
+// For paths that don't yet exist (e.g. SaveFile creating a new file),
+// EvalSymlinks returns an error and we fall back to the lexical path.
+// In that case we walk every existing ancestor with EvalSymlinks too —
+// if any ancestor is a symlink that escapes the root, we reject. This
+// closes the prior gap where pre-planted symlinks could divert writes.
 func safePath(rootPath, relativePath string) (string, error) {
 	if relativePath == "" {
 		return rootPath, nil
@@ -176,7 +188,7 @@ func safePath(rootPath, relativePath string) (string, error) {

 	// Clean and ensure no traversal.
 	cleaned := filepath.Clean(relativePath)
-	if strings.Contains(cleaned, "..") {
+	if cleaned == ".." || strings.HasPrefix(cleaned, ".."+string(filepath.Separator)) || strings.Contains(cleaned, string(filepath.Separator)+".."+string(filepath.Separator)) {
 		return "", fmt.Errorf("path traversal not allowed")
 	}

@@ -191,18 +203,66 @@ func safePath(rootPath, relativePath string) (string, error) {
 		absRoot = realRoot
 	}

-	// Resolve the target path including symlinks.
+	// Resolve the target path. If the leaf doesn't exist (write path),
+	// walk parent directories — any of which may already be a symlink.
 	absResolved, err := filepath.Abs(absPath)
 	if err != nil {
 		return "", fmt.Errorf("resolve path: %w", err)
 	}
 	if realResolved, err := filepath.EvalSymlinks(absResolved); err == nil {
 		absResolved = realResolved
+	} else {
+		// Leaf missing — resolve the deepest existing ancestor and
+		// re-join the unresolved tail. This catches a pre-planted
+		// symlink in any parent dir. An error here means an ancestor
+		// could not be resolved (e.g. a symlink we cannot follow): we MUST
+		// reject rather than fall back to the lexical path, which still
+		// carries the absRoot prefix and would let a symlink ancestor that
+		// escapes the root slip past the boundary check below.
+		resolved, tailErr := resolveExistingAncestor(absResolved)
+		if tailErr != nil {
+			return "", fmt.Errorf("path traversal not allowed")
+		}
+		if resolved != "" {
+			absResolved = resolved
+		}
 	}

-	if !strings.HasPrefix(absResolved, absRoot) {
+	if absResolved != absRoot && !strings.HasPrefix(absResolved, absRoot+string(filepath.Separator)) {
 		return "", fmt.Errorf("path traversal not allowed")
 	}

 	return absPath, nil
 }
+
+// resolveExistingAncestor walks p upward until it finds an existing
+// directory, resolves its symlinks, then rejoins the missing tail.
+// Returns ("", nil) when no ancestor exists (vanishingly rare).
+func resolveExistingAncestor(p string) (string, error) {
+	tail := ""
+	cur := p
+	for {
+		if cur == "" || cur == "/" || cur == filepath.VolumeName(cur)+string(filepath.Separator) {
+			return "", nil
+		}
+		info, err := os.Lstat(cur)
+		if err == nil {
+			real, rerr := filepath.EvalSymlinks(cur)
+			if rerr != nil {
+				return "", rerr
+			}
+			_ = info
+			if tail == "" {
+				return real, nil
+			}
+			return filepath.Join(real, tail), nil
+		}
+		// Move one level up.
+		parent := filepath.Dir(cur)
+		if parent == cur {
+			return "", nil
+		}
+		tail = filepath.Join(filepath.Base(cur), tail)
+		cur = parent
+	}
+}
@@ -131,8 +131,14 @@ const maxWebhookBodyBytes = 256 * 1024 // 256 KiB
 // PluginDispatcher is what the plugin-workload webhook handler needs from
 // the deployer: the canonical Source-dispatch entry point plus access to
 // the same Deps bundle so Trigger.Match can read store / crypto.
+//
+// DispatchTeardown is required so the preview-deploy flow can tear down
+// an ephemeral per-branch child workload when its upstream branch is
+// deleted. Same teardown path the API /workloads/{id} DELETE route uses;
+// nil error on a clean teardown lets the caller delete the workload row.
 type PluginDispatcher interface {
 	DispatchPlugin(ctx context.Context, w pluginWorkload, intent pluginIntent) error
+	DispatchTeardown(ctx context.Context, w pluginWorkload) error
 	PluginDeps() pluginDeps
 }

@@ -13,8 +13,10 @@ import (

 	"github.com/go-chi/chi/v5"

+	"github.com/alexei/tinyforge/internal/metrics"
 	"github.com/alexei/tinyforge/internal/store"
 	"github.com/alexei/tinyforge/internal/workload/plugin"
+	"github.com/alexei/tinyforge/internal/workload/preview"
 )

 // maxTriggerFanOutConcurrency caps how many bindings dispatch in
@@ -44,6 +46,17 @@ const (
 	ReasonConfigError     = "config merge error"
 	ReasonMatchError      = "match error"
 	ReasonDispatchFailed  = "dispatch failed"
+	ReasonPreviewError    = "preview materialize error"
+	ReasonPreviewTorndown = "preview torn down"
+	// ReasonPreviewNoop: a branch-delete webhook arrived but no preview was
+	// ever materialized for that branch — a legitimate clean skip, distinct
+	// from "no binding matched" so it isn't misreported as a wiring problem.
+	ReasonPreviewNoop = "preview noop"
+	// ReasonPreviewOrphaned: the preview container was torn down but its
+	// workload row could not be deleted, leaving an orphan row. Surfaced
+	// distinctly so the partial failure is visible rather than masquerading
+	// as a clean teardown.
+	ReasonPreviewOrphaned = "preview torn down (row orphaned)"
 )

 // handleTriggerWebhook processes an inbound webhook for a first-class
@@ -172,7 +185,7 @@ func (h *Handler) handleTriggerWebhook(w http.ResponseWriter, r *http.Request) {
 		switch {
 		case r.Deployed:
 			deployed++
-		case r.Reason == ReasonBindingDisabled:
+		case r.Reason == ReasonBindingDisabled, r.Reason == ReasonPreviewNoop:
 			skipped++
 		case r.Reason == ReasonNoMatch:
 			noMatch++
@@ -194,8 +207,10 @@ func (h *Handler) handleTriggerWebhook(w http.ResponseWriter, r *http.Request) {
 	case noMatch == len(results)-skipped:
 		delivery.Detail = "no binding matched"
 	default:
-		delivery.Detail = fmt.Sprintf("matched=0 skipped=%d errored=%d", skipped, errored)
+		delivery.Detail = fmt.Sprintf("matched=0 skipped=%d errored=%d nomatch=%d",
+			skipped, errored, noMatch)
 	}
+	metrics.WebhookDeliveriesTotal.Inc(delivery.Outcome)
 	respondWebhookJSON(w, http.StatusOK, map[string]any{
 		"success":  true,
 		"trigger":  trg.Name,
@@ -326,6 +341,18 @@ func (h *Handler) fireBinding(
 	if intent.TriggeredBy == "" {
 		intent.TriggeredBy = "trigger-webhook"
 	}
+
+	// Preview-deploy fork: the git trigger plugin attaches preview_branch
+	// metadata when BranchPattern matches a non-baseline branch. Route
+	// the dispatch through a per-branch child workload rather than
+	// redeploying the parent template. The fork is intentionally before
+	// the dispatch so the template's container never gets clobbered by
+	// a feature-branch push.
+	if previewBranch := intent.Metadata["preview_branch"]; previewBranch != "" {
+		fired, reason := h.handlePreviewIntent(ctx, row, intent, previewBranch)
+		return fired, reason
+	}
+
 	if err := h.plugins.DispatchPlugin(ctx, pwl, *intent); err != nil {
 		slog.Warn("webhook: dispatch failed",
 			"trigger", trg.Name, "workload", row.Name, "error", err)
@@ -336,3 +363,72 @@ func (h *Handler) fireBinding(
 	return true, intent.Reason
 }

+// handlePreviewIntent dispatches an intent that targeted a non-baseline
+// branch on a preview-template workload. Two paths:
+//
+//  1. Branch deleted: find the matching preview workload, dispatch
+//     Teardown, then delete the workload row so the dashboard reflects
+//     the upstream state.
+//  2. Branch pushed: materialize (or reuse) the preview workload, then
+//     dispatch the deploy against it. The template workload itself is
+//     never deployed against a feature branch.
+//
+// On any error the helper logs and returns a generic reason — the
+// fan-out caller treats these the same as a normal dispatch failure.
+func (h *Handler) handlePreviewIntent(
+	ctx context.Context,
+	template store.Workload,
+	intent *plugin.DeploymentIntent,
+	branch string,
+) (bool, string) {
+	deleted := intent.Metadata["preview_deleted"] == "1"
+	if deleted {
+		child, ok, err := preview.FindPreviewForBranch(h.store, template.ID, branch)
+		if err != nil {
+			slog.Warn("webhook: preview lookup failed",
+				"template", template.Name, "branch", branch, "error", err)
+			return false, ReasonPreviewError
+		}
+		if !ok {
+			// Branch was deleted upstream but we never materialized a
+			// preview for it — nothing to do. Report as a distinct noop so
+			// it isn't bucketed as "no binding matched".
+			return false, ReasonPreviewNoop
+		}
+		childPwl := toPluginWorkload(child)
+		if err := h.plugins.DispatchTeardown(ctx, childPwl); err != nil {
+			slog.Warn("webhook: preview teardown dispatch failed",
+				"template", template.Name, "preview", child.Name, "error", err)
+			return false, ReasonDispatchFailed
+		}
+		if err := h.store.DeleteWorkload(child.ID); err != nil {
+			// Container is gone but the row is orphaned. Surface this as a
+			// distinct reason so the partial failure is visible rather than
+			// reported as a clean teardown; the operator can delete the row
+			// from the dashboard if it sticks around.
+			slog.Warn("webhook: preview row delete failed (orphaned row)",
+				"template", template.Name, "preview", child.Name, "error", err)
+			return true, ReasonPreviewOrphaned
+		}
+		slog.Info("webhook: preview torn down",
+			"template", template.Name, "branch", branch, "preview", child.Name)
+		return true, ReasonPreviewTorndown
+	}
+
+	child, err := preview.MaterializeForBranch(h.store, template, branch)
+	if err != nil {
+		slog.Warn("webhook: preview materialize failed",
+			"template", template.Name, "branch", branch, "error", err)
+		return false, ReasonPreviewError
+	}
+	childPwl := toPluginWorkload(child)
+	if err := h.plugins.DispatchPlugin(ctx, childPwl, *intent); err != nil {
+		slog.Warn("webhook: preview dispatch failed",
+			"template", template.Name, "preview", child.Name, "error", err)
+		return false, ReasonDispatchFailed
+	}
+	slog.Info("webhook: triggered preview deploy",
+		"template", template.Name, "branch", branch, "preview", child.Name, "reason", intent.Reason)
+	return true, intent.Reason
+}
+
@@ -327,6 +327,10 @@ func parseGitLabPushEvent(body []byte, headers http.Header) vendorParseResult {
 			Ref:       probe.Ref,
 			CommitSHA: probe.After,
 			Pusher:    pusher,
+			// GitLab does not emit `deleted: true`; the canonical signal
+			// is an all-zero `after` SHA. Same parser helper used for the
+			// GitHub / Gitea fallback so the two branches agree.
+			Deleted: isZeroSHA(probe.After),
 		},
 	}
 	if strings.HasPrefix(probe.Ref, "refs/heads/") {
@@ -346,6 +350,7 @@ func parseGenericGitPush(body []byte) (plugin.InboundEvent, error) {
 	var probe struct {
 		Ref        string `json:"ref"`
 		After      string `json:"after"`
+		Deleted    bool   `json:"deleted"`
 		Repository struct {
 			FullName string `json:"full_name"`
 			CloneURL string `json:"clone_url"`
@@ -370,6 +375,12 @@ func parseGenericGitPush(body []byte) (plugin.InboundEvent, error) {
 	if pusher == "" {
 		pusher = probe.Pusher.Username
 	}
+	// Branch / tag deletion is signalled either by the explicit
+	// `deleted: true` flag (GitHub / Gitea) or by an all-zero `after`
+	// SHA (older shapes). Both are honoured so the preview-deploy flow
+	// can tear down ephemeral workloads even when a vendor omits the
+	// boolean flag.
+	deleted := probe.Deleted || isZeroSHA(probe.After)
 	evt := plugin.InboundEvent{
 		Kind: "git-push",
 		Git: &plugin.GitEvent{
@@ -377,6 +388,7 @@ func parseGenericGitPush(body []byte) (plugin.InboundEvent, error) {
 			Ref:       probe.Ref,
 			CommitSHA: probe.After,
 			Pusher:    pusher,
+			Deleted:   deleted,
 		},
 	}
 	if strings.HasPrefix(probe.Ref, "refs/heads/") {
@@ -388,3 +400,19 @@ func parseGenericGitPush(body []byte) (plugin.InboundEvent, error) {
 	}
 	return evt, nil
 }
+
+// isZeroSHA returns true when sha is the canonical "no commit" sentinel
+// (40 zeros) that vendors emit on the `after` field of a branch- or
+// tag-delete push event. Length-tolerant because some test fixtures
+// truncate the SHA.
+func isZeroSHA(sha string) bool {
+	if sha == "" {
+		return false
+	}
+	for _, r := range sha {
+		if r != '0' {
+			return false
+		}
+	}
+	return len(sha) >= 7
+}
@@ -0,0 +1,100 @@
+package plugin
+
+import (
+	"log/slog"
+
+	"github.com/alexei/tinyforge/internal/crypto"
+)
+
+// ResolveSharedSecrets returns the applicable shared secrets (global, then
+// app-scoped overlaying global) as a decrypted KEY->VALUE map. Decrypt
+// failures log + skip the one entry (mirroring BuildWorkloadEnv). Best-effort:
+// a store error logs and returns an empty map so a shared-secret outage never
+// fails a deploy.
+//
+// The store orders the rows global-first (then app), so iterating in order and
+// writing into the map makes a later app-scoped entry with the same Name
+// overwrite the global default — the intended global < app precedence.
+//
+// NOTE: the compose plugin intentionally does NOT call this — compose env is
+// YAML-defined and shared-secret support for compose is an explicit
+// out-of-scope follow-up.
+func ResolveSharedSecrets(deps Deps, appID, sourceName string) map[string]string {
+	merged := map[string]string{}
+	rows, err := deps.Store.ListApplicableSharedSecrets(appID)
+	if err != nil {
+		slog.Warn(sourceName+": list shared secrets", "app", appID, "error", err)
+		return merged
+	}
+	for _, sec := range rows {
+		value := sec.Value
+		if sec.Encrypted {
+			decrypted, err := crypto.Decrypt(deps.EncKey, sec.Value)
+			if err != nil {
+				slog.Warn(sourceName+": decrypt shared secret",
+					"app", appID, "name", sec.Name, "error", err)
+				continue
+			}
+			value = decrypted
+		}
+		merged[sec.Name] = value
+	}
+	return merged
+}
+
+// BuildWorkloadEnv flattens the applicable shared secrets plus workload_env
+// rows into the KEY=VALUE list Docker expects. Shared by the source plugins
+// (static, dockerfile) so they all handle decrypt failures the same way.
+//
+// Shared secrets are the low-precedence base layer; workload_env rows overlay
+// them so a workload's own config always wins on a key conflict. A workload
+// with no applicable shared secrets starts from an empty base, so the output
+// is identical to the workload_env-only behavior that predated shared secrets.
+//
+// Encrypted rows are decrypted lazily so plaintext never lives in the store
+// output. A decrypt failure logs and skips the entry rather than failing the
+// whole deploy: bricking a sync/build because one rotated key missed an env
+// entry would be worse than running with the variable unset and surfacing the
+// warning.
+//
+// appID is the workload's app_id (plugin.Workload.GroupID), used to resolve
+// app-scoped shared secrets. sourceName is the slog prefix the caller wants on
+// the warning lines (e.g. "static source" / "dockerfile source") so existing
+// log scrapers keep matching the per-source message text.
+func BuildWorkloadEnv(deps Deps, workloadID, appID, sourceName string) []string {
+	// Base layer: shared secrets (global, then app overlaying global).
+	merged := ResolveSharedSecrets(deps, appID, sourceName)
+
+	rows, err := deps.Store.ListWorkloadEnv(workloadID)
+	if err != nil {
+		slog.Warn(sourceName+": list workload env", "workload", workloadID, "error", err)
+		// Still return whatever shared secrets resolved; a workload_env
+		// outage shouldn't drop the shared defaults a deploy already has.
+		return flattenEnvMap(merged)
+	}
+	for _, e := range rows {
+		value := e.Value
+		if e.Encrypted {
+			decrypted, err := crypto.Decrypt(deps.EncKey, e.Value)
+			if err != nil {
+				slog.Warn(sourceName+": decrypt env value",
+					"workload", workloadID, "key", e.Key, "error", err)
+				continue
+			}
+			value = decrypted
+		}
+		merged[e.Key] = value // workload_env overrides shared secrets
+	}
+	return flattenEnvMap(merged)
+}
+
+// flattenEnvMap turns a KEY->VALUE map into the KEY=VALUE slice Docker
+// expects. Order is unspecified (map iteration) — Docker treats env as a
+// set, and callers that need determinism sort downstream.
+func flattenEnvMap(m map[string]string) []string {
+	out := make([]string, 0, len(m))
+	for k, v := range m {
+		out = append(out, k+"="+v)
+	}
+	return out
+}
@@ -0,0 +1,103 @@
+package plugin
+
+import (
+	"encoding/json"
+	"fmt"
+	"log/slog"
+	"strings"
+
+	"github.com/alexei/tinyforge/internal/events"
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// maxDeployStatusRunes bounds the persisted status. This is a defense-in-depth
+// BACKSTOP, not a sanitizer.
+//
+// CALLER CONTRACT: deploy events are persisted indefinitely, rendered in the
+// per-app timeline, AND egress off-box — error-severity events are forwarded
+// to the global NotificationURL (cmd/server) and to operator-configured
+// event-trigger webhooks (internal/events/dispatcher). Callers MUST therefore
+// keep secrets and raw subprocess output (e.g. `docker compose` combined
+// stderr, which can echo the deployed app's own secret-bearing logs) OUT of
+// `status`; emit a curated, secret-free reason and keep verbose detail only in
+// the returned error (server logs + admin deploy result, neither of which
+// egresses). The cap below merely bounds blast radius if something slips
+// through — 256 runes keeps a meaningful reason without letting a status
+// become an unbounded sink.
+const maxDeployStatusRunes = 256
+
+// capDeployStatus truncates s to maxDeployStatusRunes runes, appending an
+// ellipsis when it had to cut. Operating on the rune slice keeps the cut on
+// a UTF-8 boundary so multibyte output can't be sliced mid-rune.
+func capDeployStatus(s string) string {
+	runes := []rune(s)
+	if len(runes) <= maxDeployStatusRunes {
+		return s
+	}
+	return string(runes[:maxDeployStatusRunes]) + "…"
+}
+
+// EmitDeployEvent records a workload-scoped deploy event in the event log
+// and publishes it on the bus. Best-effort: logs and returns on failure,
+// never blocks or fails the deploy. `source` is the per-kind event source
+// string ("image","compose","static_site","dockerfile"); `status` is a
+// short human status ("deploying","deployed","failed: <reason>").
+//
+// The metadata always carries workload_id so the per-app activity timeline
+// can be reconstructed even by consumers that only read the JSON blob, and
+// the dedicated workload_id column powers the indexed per-workload query.
+func EmitDeployEvent(deps Deps, w Workload, source, status string) {
+	// Audit logging is best-effort and must never crash a real deploy. The
+	// production Deps always wires both, but guard so a missing bus/store
+	// (e.g. a narrow unit test) degrades to a no-op instead of a panic.
+	if deps.Store == nil || deps.Events == nil {
+		return
+	}
+
+	// Derive severity from the raw status prefix BEFORE capping, then bound
+	// the status that actually gets persisted/displayed/published.
+	severity := "info"
+	if strings.HasPrefix(status, "failed") {
+		severity = "error"
+	}
+	status = capDeployStatus(status)
+	message := fmt.Sprintf("%s: %s", w.Name, status)
+
+	metaBytes, err := json.Marshal(map[string]string{
+		"workload_id":   w.ID,
+		"workload_name": w.Name,
+		"status":        status,
+	})
+	if err != nil {
+		slog.Error("plugin: marshal deploy event metadata",
+			"source", source, "workload", w.ID, "error", err)
+		metaBytes = []byte("{}")
+	}
+	metadata := string(metaBytes)
+
+	evt, err := deps.Store.InsertEvent(store.EventLog{
+		Source:     source,
+		Severity:   severity,
+		Message:    message,
+		Metadata:   metadata,
+		WorkloadID: w.ID,
+	})
+	if err != nil {
+		slog.Error("plugin: failed to persist deploy event log",
+			"source", source, "workload", w.ID, "error", err)
+		return
+	}
+
+	deps.Events.Publish(events.Event{
+		Type: events.EventLog,
+		Payload: events.EventLogPayload{
+			ID:         evt.ID,
+			Source:     source,
+			WorkloadID: w.ID,
+			Severity:   severity,
+			Message:    message,
+			Metadata:   metadata,
+			CreatedAt:  evt.CreatedAt,
+		},
+	})
+}
@@ -0,0 +1,167 @@
+package plugin
+
+import (
+	"encoding/json"
+	"strings"
+	"testing"
+	"unicode/utf8"
+
+	"github.com/alexei/tinyforge/internal/events"
+	"github.com/alexei/tinyforge/internal/store"
+)
+
+// capturePublisher records every event published on it so a test can
+// assert on the bus payload. Satisfies plugin.EventPublisher.
+type capturePublisher struct {
+	events []events.Event
+}
+
+func (c *capturePublisher) Publish(evt events.Event) {
+	c.events = append(c.events, evt)
+}
+
+// newEmitDeps builds a plugin.Deps backed by an in-memory store and a
+// capturing publisher. Mirrors the in-memory store pattern used by the
+// store + source-plugin tests.
+func newEmitDeps(t *testing.T) (Deps, *capturePublisher) {
+	t.Helper()
+	st, err := store.New(":memory:")
+	if err != nil {
+		t.Fatalf("open store: %v", err)
+	}
+	t.Cleanup(func() { _ = st.Close() })
+	pub := &capturePublisher{}
+	return Deps{Store: st, Events: pub}, pub
+}
+
+func TestEmitDeployEvent(t *testing.T) {
+	tests := []struct {
+		name         string
+		status       string
+		wantSeverity string
+	}{
+		{name: "deployed is info", status: "deployed", wantSeverity: "info"},
+		{name: "deploying is info", status: "deploying", wantSeverity: "info"},
+		{name: "failed is error", status: "failed: pull foo failed", wantSeverity: "error"},
+		{name: "failed bare is error", status: "failed", wantSeverity: "error"},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			deps, pub := newEmitDeps(t)
+			w := Workload{ID: "wl-123", Name: "my-app"}
+
+			EmitDeployEvent(deps, w, "image", tt.status)
+
+			// Persisted row carries the workload scope + derived severity.
+			rows, err := deps.Store.ListEvents(store.EventLogFilter{WorkloadID: w.ID})
+			if err != nil {
+				t.Fatalf("ListEvents: %v", err)
+			}
+			if len(rows) != 1 {
+				t.Fatalf("got %d persisted events, want 1", len(rows))
+			}
+			got := rows[0]
+			if got.Severity != tt.wantSeverity {
+				t.Errorf("severity = %q, want %q", got.Severity, tt.wantSeverity)
+			}
+			if got.Source != "image" {
+				t.Errorf("source = %q, want %q", got.Source, "image")
+			}
+			if got.WorkloadID != w.ID {
+				t.Errorf("workload_id = %q, want %q", got.WorkloadID, w.ID)
+			}
+			wantMsg := w.Name + ": " + tt.status
+			if got.Message != wantMsg {
+				t.Errorf("message = %q, want %q", got.Message, wantMsg)
+			}
+
+			// Metadata JSON carries workload_id / workload_name / status.
+			var meta map[string]string
+			if err := json.Unmarshal([]byte(got.Metadata), &meta); err != nil {
+				t.Fatalf("unmarshal metadata %q: %v", got.Metadata, err)
+			}
+			if meta["workload_id"] != w.ID {
+				t.Errorf("metadata workload_id = %q, want %q", meta["workload_id"], w.ID)
+			}
+			if meta["workload_name"] != w.Name {
+				t.Errorf("metadata workload_name = %q, want %q", meta["workload_name"], w.Name)
+			}
+			if meta["status"] != tt.status {
+				t.Errorf("metadata status = %q, want %q", meta["status"], tt.status)
+			}
+
+			// The persisted row is also re-published on the bus as an
+			// EventLog so SSE clients see it live.
+			if len(pub.events) != 1 {
+				t.Fatalf("got %d published events, want 1", len(pub.events))
+			}
+			ev := pub.events[0]
+			if ev.Type != events.EventLog {
+				t.Errorf("event type = %q, want %q", ev.Type, events.EventLog)
+			}
+			payload, ok := ev.Payload.(events.EventLogPayload)
+			if !ok {
+				t.Fatalf("payload type = %T, want events.EventLogPayload", ev.Payload)
+			}
+			if payload.WorkloadID != w.ID {
+				t.Errorf("payload workload_id = %q, want %q", payload.WorkloadID, w.ID)
+			}
+			if payload.Severity != tt.wantSeverity {
+				t.Errorf("payload severity = %q, want %q", payload.Severity, tt.wantSeverity)
+			}
+			if payload.ID != got.ID {
+				t.Errorf("payload id = %d, want %d", payload.ID, got.ID)
+			}
+		})
+	}
+}
+
+// TestEmitDeployEvent_CapsLongStatus verifies a long failure status (e.g. one
+// embedding raw subprocess output) is bounded to maxDeployStatusRunes runes in
+// both the persisted message and metadata, cut on a UTF-8 boundary, while
+// severity is still derived from the original "failed" prefix.
+func TestEmitDeployEvent_CapsLongStatus(t *testing.T) {
+	deps, pub := newEmitDeps(t)
+	w := Workload{ID: "wl-cap", Name: "app"}
+
+	// Multibyte body so a naive byte-slice would corrupt a rune; prefix with
+	// "failed: " so the severity check exercises the pre-cap derivation.
+	longStatus := "failed: " + strings.Repeat("é", 400)
+	EmitDeployEvent(deps, w, "compose", longStatus)
+
+	rows, err := deps.Store.ListEvents(store.EventLogFilter{WorkloadID: w.ID})
+	if err != nil {
+		t.Fatalf("ListEvents: %v", err)
+	}
+	if len(rows) != 1 {
+		t.Fatalf("got %d events, want 1", len(rows))
+	}
+	got := rows[0]
+
+	if got.Severity != "error" {
+		t.Errorf("severity = %q, want error (derived from pre-cap prefix)", got.Severity)
+	}
+
+	var meta map[string]string
+	if err := json.Unmarshal([]byte(got.Metadata), &meta); err != nil {
+		t.Fatalf("unmarshal metadata: %v", err)
+	}
+	capped := meta["status"]
+	if rc := len([]rune(capped)); rc != maxDeployStatusRunes+1 { // +1 for the ellipsis rune
+		t.Errorf("capped status = %d runes, want %d", rc, maxDeployStatusRunes+1)
+	}
+	if !utf8.ValidString(capped) {
+		t.Errorf("capped status is not valid UTF-8: %q", capped)
+	}
+	if !strings.HasSuffix(capped, "…") {
+		t.Errorf("capped status missing ellipsis suffix: %q", capped)
+	}
+	wantMsg := w.Name + ": " + capped
+	if got.Message != wantMsg {
+		t.Errorf("message = %q, want %q", got.Message, wantMsg)
+	}
+	if len(pub.events) != 1 {
+		t.Fatalf("got %d published events, want 1", len(pub.events))
+	}
+}
@@ -0,0 +1,17 @@
+package plugin
+
+// IDShort returns the first 8 chars of a workload ID, used as the uniqueness
+// suffix on the Docker resources (container, image, volume) a source plugin
+// materializes. Workload names are not UNIQUE in the schema today; including
+// the ID short prevents two workloads with the same name from clobbering each
+// other's container, image, or storage volume.
+//
+// Shared by the source plugins (static, dockerfile). Each plugin still owns
+// its own container/image NAME format (the human-readable prefix differs by
+// source kind) — only the ID-short derivation is common.
+func IDShort(w Workload) string {
+	if len(w.ID) < 8 {
+		return w.ID
+	}
+	return w.ID[:8]
+}
@@ -0,0 +1,81 @@
+package plugin
+
+import (
+	"log/slog"
+
+	"github.com/alexei/tinyforge/internal/crypto"
+	"github.com/alexei/tinyforge/internal/notify"
+)
+
+// DispatchNotificationForWorkload sends `event` to every notification
+// route configured for the workload. Resolution order:
+//
+//  1. workload_notifications rows matching `event.Type` — multi-route
+//     fan-out (e.g. Slack alerts + Discord successes per workload).
+//  2. If zero matching rows AND the legacy single-URL columns on the
+//     workload row are set, send to that URL — backwards compat for
+//     installs that pre-date the new table.
+//  3. Otherwise, fall through to settings.notification_url so the global
+//     destination still fires for workloads with no per-row config.
+//
+// Secrets are decrypted via deps.EncKey before sending. A failed decrypt
+// degrades to "send unsigned" with a warning rather than dropping the
+// notification — the operator still gets the alert, they just need to
+// re-save the secret. Fire-and-forget: failures are logged inside
+// deps.Notifier and never bubble up here.
+//
+// Callers (static / dockerfile / image / compose plugins) pass an
+// already-populated Event; this helper does not synthesize the payload
+// shape, only the routing.
+func DispatchNotificationForWorkload(deps Deps, w Workload, event notify.Event) {
+	if deps.Notifier == nil {
+		return
+	}
+	rows, err := deps.Store.ListWorkloadNotifications(w.ID)
+	if err != nil {
+		slog.Warn("notify: list workload routes failed",
+			"workload", w.ID, "error", err)
+		rows = nil
+	}
+
+	matched := 0
+	for _, n := range rows {
+		if !n.MatchesEventType(event.Type) {
+			continue
+		}
+		matched++
+		secret := ""
+		if n.Secret != "" {
+			dec, derr := crypto.Decrypt(deps.EncKey, n.Secret)
+			if derr != nil {
+				slog.Warn("notify: decrypt workload secret failed — sending unsigned",
+					"workload", w.ID, "route", n.Name, "error", derr)
+			} else {
+				secret = dec
+			}
+		}
+		deps.Notifier.SendSigned(n.URL, secret, notify.TierSite, event)
+	}
+	if matched > 0 {
+		return
+	}
+
+	// Legacy fallback: single per-workload destination on workloads.notification_url.
+	if w.NotificationURL != "" {
+		deps.Notifier.SendSigned(w.NotificationURL, w.NotificationSecret, notify.TierSite, event)
+		return
+	}
+
+	// Global fallback so a one-line config in settings still notifies
+	// every workload without a per-row override.
+	settings, err := deps.Store.GetSettings()
+	if err != nil {
+		slog.Warn("notify: settings lookup for global fallback failed",
+			"workload", w.ID, "error", err)
+		return
+	}
+	if settings.NotificationURL == "" {
+		return
+	}
+	deps.Notifier.SendSigned(settings.NotificationURL, settings.NotificationSecret, notify.TierSettings, event)
+}
@@ -32,6 +32,23 @@ type Config struct {

 type source struct{}

+// composeRunner is the slice of stack.Compose this plugin actually
+// drives. Defined locally per the "interfaces where they are used"
+// idiom so the plugin can be unit-tested without a real docker compose
+// binary. `*stack.Compose` satisfies it implicitly.
+type composeRunner interface {
+	Up(ctx context.Context, projectName, yamlPath string) (string, error)
+	Down(ctx context.Context, projectName string, removeVolumes bool) (string, error)
+	Ps(ctx context.Context, projectName, yamlPath string) ([]stack.Service, error)
+}
+
+// newComposeRunner returns the runner the plugin should call. Tests
+// swap this var with a fake; production code never touches it. The
+// indirection costs one function-pointer dereference per Deploy /
+// Teardown / Reconcile call — negligible against the docker compose
+// exec it gates.
+var newComposeRunner = func() composeRunner { return stack.NewCompose("") }
+
 func init() { plugin.RegisterSource(&source{}) }

 func (*source) Kind() string { return "compose" }
@@ -67,7 +84,7 @@ func (*source) Validate(cfg json.RawMessage) error {
 // `docker compose -p <project> up -d`, then syncs one Container row per
 // service. The workload ID is the natural compose project name unless
 // the user supplied one explicitly.
-func (*source) Deploy(ctx context.Context, deps plugin.Deps, w plugin.Workload, intent plugin.DeploymentIntent) error {
+func (*source) Deploy(ctx context.Context, deps plugin.Deps, w plugin.Workload, intent plugin.DeploymentIntent) (err error) {
 	cfg, err := plugin.SourceConfigOf[Config](w)
 	if err != nil {
 		return fmt.Errorf("compose source: decode config: %w", err)
@@ -76,13 +93,36 @@ func (*source) Deploy(ctx context.Context, deps plugin.Deps, w plugin.Workload,
 		return fmt.Errorf("compose source: workload %s has empty compose_yaml", w.ID)
 	}

+	// compose.Deploy has no idempotency short-circuit (no "already up"
+	// fast path that returns nil), so every call past config validation
+	// is a real deploy. Arm the terminal audit emit here — after pure
+	// config-validation errors above (kept quiet, mirroring the image
+	// plugin) but before any real work — so all real failures and the
+	// success are captured for the per-app timeline. err is the named
+	// return.
+	defer func() {
+		if err != nil {
+			// SECURITY: the compose.Up failure wraps raw `docker compose`
+			// combined output (which can include the deployed app's own
+			// stderr — potentially secrets). Deploy events are persisted
+			// indefinitely AND egress to operator webhooks (the global
+			// NotificationURL + event-trigger actions), so the emitted
+			// status must NOT carry that output. The full detail still
+			// reaches the server log + admin deploy result via the returned
+			// err; the timeline records only a generic, secret-free reason.
+			plugin.EmitDeployEvent(deps, w, "compose", "failed")
+		} else {
+			plugin.EmitDeployEvent(deps, w, "compose", "deployed")
+		}
+	}()
+
 	projectName := composeProjectName(cfg.ComposeProjectName, w)
 	yamlPath, err := writeYAML(w.ID, cfg.ComposeYAML)
 	if err != nil {
 		return fmt.Errorf("compose source: write yaml: %w", err)
 	}

-	compose := stack.NewCompose("")
+	compose := newComposeRunner()
 	out, err := compose.Up(ctx, projectName, yamlPath)
 	if err != nil {
 		return fmt.Errorf("compose source: docker compose up: %w (output: %s)", err, truncate(out, 1024))
@@ -105,7 +145,7 @@ func (*source) Teardown(ctx context.Context, deps plugin.Deps, w plugin.Workload
 	cfg, _ := plugin.SourceConfigOf[Config](w)
 	projectName := composeProjectName(cfg.ComposeProjectName, w)

-	compose := stack.NewCompose("")
+	compose := newComposeRunner()
 	if _, err := compose.Down(ctx, projectName, true); err != nil {
 		// Log but proceed — the DB rows must not be orphaned.
 		slog.Warn("compose source: docker compose down", "workload", w.ID, "error", err)
@@ -139,7 +179,7 @@ func (*source) Reconcile(ctx context.Context, deps plugin.Deps, w plugin.Workloa
 	projectName := composeProjectName(cfg.ComposeProjectName, w)
 	yamlPath, _ := writeYAMLIfChanged(w.ID, cfg.ComposeYAML)

-	compose := stack.NewCompose("")
+	compose := newComposeRunner()
 	services, err := compose.Ps(ctx, projectName, yamlPath)
 	if err != nil {
 		// Likely no compose project running for this workload. Mark
@@ -162,7 +202,7 @@ func (*source) Reconcile(ctx context.Context, deps plugin.Deps, w plugin.Workloa

 // syncContainers shares its body with Reconcile minus the missing-row
 // fallback — Deploy expects compose ps to succeed since `up` just ran.
-func syncContainers(ctx context.Context, deps plugin.Deps, compose *stack.Compose, w plugin.Workload, projectName, yamlPath string) error {
+func syncContainers(ctx context.Context, deps plugin.Deps, compose composeRunner, w plugin.Workload, projectName, yamlPath string) error {
 	services, err := compose.Ps(ctx, projectName, yamlPath)
 	if err != nil {
 		return fmt.Errorf("compose ps: %w", err)
@@ -204,7 +244,17 @@ var projectNameSanitizer = regexp.MustCompile(`[^a-z0-9_-]`)

 func composeProjectName(explicit string, w plugin.Workload) string {
 	if explicit != "" {
-		return explicit
+		// Apply the same sanitizer to operator-supplied names so a value
+		// like "--foo" cannot reach the docker CLI and be re-parsed as a
+		// flag. Reuses the canonical lower+[^a-z0-9_-]→"-" + trim path.
+		san := strings.ToLower(explicit)
+		san = projectNameSanitizer.ReplaceAllString(san, "-")
+		san = strings.Trim(san, "-")
+		if san != "" {
+			return san
+		}
+		// Fall through to the derived name if sanitization stripped
+		// everything (operator passed e.g. "---" — degenerate input).
 	}
 	name := strings.ToLower(w.Name)
 	name = projectNameSanitizer.ReplaceAllString(name, "-")
@@ -0,0 +1,512 @@
+package compose
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"strings"
+	"sync"
+	"testing"
+
+	"github.com/alexei/tinyforge/internal/stack"
+	"github.com/alexei/tinyforge/internal/store"
+	"github.com/alexei/tinyforge/internal/workload/plugin"
+)
+
+// fakeRunner stands in for *stack.Compose. Every method records its
+// inputs and returns whatever the test set on the corresponding field.
+// Defaults are happy-path: empty services from Ps, no error from Up /
+// Down. Fields are slice-typed so a single fakeRunner can serve a
+// sequence of calls (Deploy issues Up + Ps in order).
+type fakeRunner struct {
+	mu sync.Mutex
+
+	upCalls    []runnerCall
+	upOuts     []string
+	upErrs     []error
+	downCalls  []runnerCall
+	downOuts   []string
+	downErrs   []error
+	psCalls    []runnerCall
+	psResults  [][]stack.Service
+	psErrs     []error
+	upCallIdx  int
+	psCallIdx  int
+	downCallI  int
+}
+
+type runnerCall struct {
+	ProjectName   string
+	YAMLPath      string
+	RemoveVolumes bool
+}
+
+func (f *fakeRunner) Up(_ context.Context, projectName, yamlPath string) (string, error) {
+	f.mu.Lock()
+	defer f.mu.Unlock()
+	f.upCalls = append(f.upCalls, runnerCall{ProjectName: projectName, YAMLPath: yamlPath})
+	out, err := pop(f.upOuts, f.upErrs, f.upCallIdx)
+	f.upCallIdx++
+	return out, err
+}
+
+func (f *fakeRunner) Down(_ context.Context, projectName string, removeVolumes bool) (string, error) {
+	f.mu.Lock()
+	defer f.mu.Unlock()
+	f.downCalls = append(f.downCalls, runnerCall{ProjectName: projectName, RemoveVolumes: removeVolumes})
+	out, err := pop(f.downOuts, f.downErrs, f.downCallI)
+	f.downCallI++
+	return out, err
+}
+
+func (f *fakeRunner) Ps(_ context.Context, projectName, yamlPath string) ([]stack.Service, error) {
+	f.mu.Lock()
+	defer f.mu.Unlock()
+	f.psCalls = append(f.psCalls, runnerCall{ProjectName: projectName, YAMLPath: yamlPath})
+
+	idx := f.psCallIdx
+	f.psCallIdx++
+	var svcs []stack.Service
+	if idx < len(f.psResults) {
+		svcs = f.psResults[idx]
+	}
+	var err error
+	if idx < len(f.psErrs) {
+		err = f.psErrs[idx]
+	}
+	return svcs, err
+}
+
+// pop returns the nth element of outs/errs or zero values when n is
+// past the end. Lets a test set a single expected response without
+// padding slices for every other call.
+func pop(outs []string, errs []error, n int) (string, error) {
+	var out string
+	if n < len(outs) {
+		out = outs[n]
+	}
+	var err error
+	if n < len(errs) {
+		err = errs[n]
+	}
+	return out, err
+}
+
+// withFakeRunner swaps newComposeRunner for the duration of one test
+// and restores the original on cleanup. Tests that need to inspect the
+// fake post-hoc keep the returned pointer.
+func withFakeRunner(t *testing.T, f *fakeRunner) {
+	t.Helper()
+	orig := newComposeRunner
+	newComposeRunner = func() composeRunner { return f }
+	t.Cleanup(func() { newComposeRunner = orig })
+}
+
+func testStore(t *testing.T) *store.Store {
+	t.Helper()
+	st, err := store.New(":memory:")
+	if err != nil {
+		t.Fatalf("open store: %v", err)
+	}
+	t.Cleanup(func() { _ = st.Close() })
+	return st
+}
+
+// seedWorkload creates the parent workload row that container rows FK
+// onto. Returns the workload's ID so callers can reuse it.
+func seedWorkload(t *testing.T, st *store.Store, name, yamlText string) string {
+	t.Helper()
+	cfg := Config{ComposeYAML: yamlText}
+	body, err := json.Marshal(cfg)
+	if err != nil {
+		t.Fatalf("marshal config: %v", err)
+	}
+	w, err := st.CreateWorkload(store.Workload{
+		Kind:         "plugin",
+		Name:         name,
+		SourceKind:   "compose",
+		SourceConfig: string(body),
+	})
+	if err != nil {
+		t.Fatalf("create workload: %v", err)
+	}
+	return w.ID
+}
+
+func TestDeploy_HappyPath(t *testing.T) {
+	withTempDir(t) // isolates the YAML scratch dir under t.TempDir()
+
+	deps := plugin.Deps{Store: testStore(t)}
+	yamlText := "services:\n  web:\n    image: nginx:alpine\n"
+	wid := seedWorkload(t, deps.Store, "myapp", yamlText)
+	w := plugin.Workload{
+		ID:           wid,
+		Name:         "myapp",
+		SourceKind:   "compose",
+		SourceConfig: mustMarshalConfig(t, Config{ComposeYAML: yamlText}),
+	}
+
+	fake := &fakeRunner{
+		psResults: [][]stack.Service{{
+			{Service: "web", State: "running", Status: "Up 5 seconds"},
+		}},
+	}
+	withFakeRunner(t, fake)
+
+	src := &source{}
+	if err := src.Deploy(context.Background(), deps, w, plugin.DeploymentIntent{}); err != nil {
+		t.Fatalf("Deploy: %v", err)
+	}
+
+	// Up called exactly once with the workload-derived project name.
+	if len(fake.upCalls) != 1 {
+		t.Fatalf("Up called %d times, want 1", len(fake.upCalls))
+	}
+	if !strings.HasPrefix(fake.upCalls[0].ProjectName, "tf-myapp-") {
+		t.Errorf("Up projectName = %q, want prefix tf-myapp-", fake.upCalls[0].ProjectName)
+	}
+	if !strings.HasSuffix(fake.upCalls[0].YAMLPath, "compose.yml") {
+		t.Errorf("Up yamlPath = %q, want suffix compose.yml", fake.upCalls[0].YAMLPath)
+	}
+
+	// Ps follows Up to enumerate the resulting containers.
+	if len(fake.psCalls) != 1 {
+		t.Fatalf("Ps called %d times, want 1", len(fake.psCalls))
+	}
+
+	// Service row written.
+	row, err := deps.Store.GetContainerByID(wid + ":web")
+	if err != nil {
+		t.Fatalf("get container row: %v", err)
+	}
+	if row.WorkloadID != wid {
+		t.Errorf("row.WorkloadID = %q, want %q", row.WorkloadID, wid)
+	}
+	if row.Role != "web" {
+		t.Errorf("row.Role = %q, want %q", row.Role, "web")
+	}
+	if row.State != "running" {
+		t.Errorf("row.State = %q, want %q", row.State, "running")
+	}
+}
+
+func TestDeploy_EmptyYAMLConfig_RejectsBeforeExec(t *testing.T) {
+	deps := plugin.Deps{Store: testStore(t)}
+	wid := seedWorkload(t, deps.Store, "empty", "services:\n  web:\n    image: x\n")
+	w := plugin.Workload{
+		ID:           wid,
+		Name:         "empty",
+		SourceKind:   "compose",
+		SourceConfig: mustMarshalConfig(t, Config{ComposeYAML: ""}),
+	}
+
+	fake := &fakeRunner{}
+	withFakeRunner(t, fake)
+
+	src := &source{}
+	err := src.Deploy(context.Background(), deps, w, plugin.DeploymentIntent{})
+	if err == nil {
+		t.Fatal("Deploy accepted empty compose_yaml")
+	}
+	if !strings.Contains(err.Error(), "empty compose_yaml") {
+		t.Errorf("error = %v, want substring \"empty compose_yaml\"", err)
+	}
+	if len(fake.upCalls) != 0 {
+		t.Errorf("Up should not have been called; got %d calls", len(fake.upCalls))
+	}
+}
+
+func TestDeploy_UpFailure_PropagatesAndIncludesTruncatedOutput(t *testing.T) {
+	withTempDir(t)
+
+	deps := plugin.Deps{Store: testStore(t)}
+	yamlText := "services:\n  web:\n    image: bad-image\n"
+	wid := seedWorkload(t, deps.Store, "fail", yamlText)
+	w := plugin.Workload{
+		ID:           wid,
+		Name:         "fail",
+		SourceKind:   "compose",
+		SourceConfig: mustMarshalConfig(t, Config{ComposeYAML: yamlText}),
+	}
+
+	bigOut := strings.Repeat("docker compose log noise ", 200) // > 1024 bytes
+	fake := &fakeRunner{
+		upOuts: []string{bigOut},
+		upErrs: []error{errors.New("exit status 1")},
+	}
+	withFakeRunner(t, fake)
+
+	src := &source{}
+	err := src.Deploy(context.Background(), deps, w, plugin.DeploymentIntent{})
+	if err == nil {
+		t.Fatal("Deploy accepted Up failure")
+	}
+	if !strings.Contains(err.Error(), "docker compose up") {
+		t.Errorf("error = %v, want substring \"docker compose up\"", err)
+	}
+	if !strings.Contains(err.Error(), "exit status 1") {
+		t.Errorf("error = %v, want wrapped Up err", err)
+	}
+	if !strings.Contains(err.Error(), "(truncated)") {
+		t.Errorf("error = %v, want truncated-output marker", err)
+	}
+	// Ps must not be called when Up failed.
+	if len(fake.psCalls) != 0 {
+		t.Errorf("Ps called %d times after Up failure; want 0", len(fake.psCalls))
+	}
+}
+
+func TestDeploy_UpSucceedsButPsFails_SurfacesError(t *testing.T) {
+	// `up` succeeded but enumerate failed — Deploy must surface so the UI
+	// doesn't show an empty containers index for a running stack.
+	withTempDir(t)
+
+	deps := plugin.Deps{Store: testStore(t)}
+	yamlText := "services:\n  web:\n    image: nginx\n"
+	wid := seedWorkload(t, deps.Store, "psfail", yamlText)
+	w := plugin.Workload{
+		ID:           wid,
+		Name:         "psfail",
+		SourceKind:   "compose",
+		SourceConfig: mustMarshalConfig(t, Config{ComposeYAML: yamlText}),
+	}
+
+	fake := &fakeRunner{
+		psErrs: []error{errors.New("compose ps boom")},
+	}
+	withFakeRunner(t, fake)
+
+	src := &source{}
+	err := src.Deploy(context.Background(), deps, w, plugin.DeploymentIntent{})
+	if err == nil {
+		t.Fatal("Deploy ignored Ps failure")
+	}
+	if !strings.Contains(err.Error(), "sync container rows") {
+		t.Errorf("error = %v, want substring \"sync container rows\"", err)
+	}
+}
+
+func TestTeardown_DropsContainerRows_EvenWhenDownFails(t *testing.T) {
+	// docker compose down failing must not orphan rows in the DB.
+	withTempDir(t)
+	deps := plugin.Deps{Store: testStore(t)}
+	wid := seedWorkload(t, deps.Store, "tdown", "services:\n  web:\n    image: nginx\n")
+
+	// Seed two service rows the way Deploy would.
+	for _, role := range []string{"web", "db"} {
+		if err := deps.Store.UpsertContainer(store.Container{
+			ID:           wid + ":" + role,
+			WorkloadID:   wid,
+			WorkloadKind: "compose",
+			Role:         role,
+			Host:         "local",
+			State:        "running",
+		}); err != nil {
+			t.Fatalf("seed container: %v", err)
+		}
+	}
+
+	fake := &fakeRunner{downErrs: []error{errors.New("compose project unknown")}}
+	withFakeRunner(t, fake)
+
+	src := &source{}
+	w := plugin.Workload{
+		ID:           wid,
+		Name:         "tdown",
+		SourceKind:   "compose",
+		SourceConfig: mustMarshalConfig(t, Config{ComposeYAML: "services:\n  web:\n    image: nginx\n"}),
+	}
+	if err := src.Teardown(context.Background(), deps, w); err != nil {
+		t.Fatalf("Teardown: %v", err)
+	}
+
+	// Down requested removeVolumes=true (matches the docstring claim).
+	if len(fake.downCalls) != 1 {
+		t.Fatalf("Down calls = %d, want 1", len(fake.downCalls))
+	}
+	if !fake.downCalls[0].RemoveVolumes {
+		t.Errorf("Down removeVolumes = false, want true (workload teardown is destructive)")
+	}
+
+	// Rows gone despite the Down error.
+	for _, role := range []string{"web", "db"} {
+		if _, err := deps.Store.GetContainerByID(wid + ":" + role); !errors.Is(err, store.ErrNotFound) {
+			t.Errorf("container row %q survived teardown: err=%v", role, err)
+		}
+	}
+}
+
+func TestTeardown_HappyPath(t *testing.T) {
+	withTempDir(t)
+	deps := plugin.Deps{Store: testStore(t)}
+	wid := seedWorkload(t, deps.Store, "tdown2", "services:\n  web:\n    image: nginx\n")
+
+	if err := deps.Store.UpsertContainer(store.Container{
+		ID:           wid + ":web",
+		WorkloadID:   wid,
+		WorkloadKind: "compose",
+		Role:         "web",
+		Host:         "local",
+		State:        "running",
+	}); err != nil {
+		t.Fatalf("seed: %v", err)
+	}
+
+	fake := &fakeRunner{}
+	withFakeRunner(t, fake)
+
+	src := &source{}
+	w := plugin.Workload{
+		ID:           wid,
+		Name:         "tdown2",
+		SourceKind:   "compose",
+		SourceConfig: mustMarshalConfig(t, Config{ComposeYAML: "services:\n  web:\n    image: nginx\n"}),
+	}
+	if err := src.Teardown(context.Background(), deps, w); err != nil {
+		t.Fatalf("Teardown: %v", err)
+	}
+	if len(fake.downCalls) != 1 {
+		t.Errorf("Down calls = %d, want 1", len(fake.downCalls))
+	}
+	if _, err := deps.Store.GetContainerByID(wid + ":web"); !errors.Is(err, store.ErrNotFound) {
+		t.Errorf("container row survived teardown: err=%v", err)
+	}
+}
+
+func TestReconcile_PsSuccess_UpsertsRows(t *testing.T) {
+	withTempDir(t)
+	deps := plugin.Deps{Store: testStore(t)}
+	yamlText := "services:\n  web:\n    image: nginx\n  db:\n    image: postgres\n"
+	wid := seedWorkload(t, deps.Store, "rec", yamlText)
+
+	fake := &fakeRunner{
+		psResults: [][]stack.Service{{
+			{Service: "web", State: "running"},
+			{Service: "db", State: "running"},
+		}},
+	}
+	withFakeRunner(t, fake)
+
+	src := &source{}
+	w := plugin.Workload{
+		ID:           wid,
+		Name:         "rec",
+		SourceKind:   "compose",
+		SourceConfig: mustMarshalConfig(t, Config{ComposeYAML: yamlText}),
+	}
+	if err := src.Reconcile(context.Background(), deps, w); err != nil {
+		t.Fatalf("Reconcile: %v", err)
+	}
+
+	for _, role := range []string{"web", "db"} {
+		row, err := deps.Store.GetContainerByID(wid + ":" + role)
+		if err != nil {
+			t.Errorf("row %q missing after reconcile: %v", role, err)
+			continue
+		}
+		if row.State != "running" {
+			t.Errorf("row %q state = %q, want \"running\"", role, row.State)
+		}
+	}
+}
+
+func TestReconcile_PsFailure_MarksExistingRowsMissing(t *testing.T) {
+	// When compose ps fails (project unknown to Docker), the reconciler
+	// flips existing rows to "missing" rather than deleting them — the UI
+	// surfaces the desync to the operator.
+	withTempDir(t)
+	deps := plugin.Deps{Store: testStore(t)}
+	yamlText := "services:\n  web:\n    image: nginx\n"
+	wid := seedWorkload(t, deps.Store, "missing", yamlText)
+
+	if err := deps.Store.UpsertContainer(store.Container{
+		ID:           wid + ":web",
+		WorkloadID:   wid,
+		WorkloadKind: "compose",
+		Role:         "web",
+		Host:         "local",
+		State:        "running",
+	}); err != nil {
+		t.Fatalf("seed: %v", err)
+	}
+
+	fake := &fakeRunner{psErrs: []error{errors.New("no such project")}}
+	withFakeRunner(t, fake)
+
+	src := &source{}
+	w := plugin.Workload{
+		ID:           wid,
+		Name:         "missing",
+		SourceKind:   "compose",
+		SourceConfig: mustMarshalConfig(t, Config{ComposeYAML: yamlText}),
+	}
+	if err := src.Reconcile(context.Background(), deps, w); err != nil {
+		t.Fatalf("Reconcile returned %v; should be nil even on Ps failure", err)
+	}
+
+	row, err := deps.Store.GetContainerByID(wid + ":web")
+	if err != nil {
+		t.Fatalf("row missing entirely (should be marked, not deleted): %v", err)
+	}
+	if row.State != "missing" {
+		t.Errorf("row.State = %q, want \"missing\"", row.State)
+	}
+}
+
+func TestReconcile_FallsBackToStatusWhenStateEmpty(t *testing.T) {
+	// Some compose versions populate Status (human string) but not State
+	// (enum) for non-running services. upsertServiceRow falls back to
+	// Status; verify that here.
+	withTempDir(t)
+	deps := plugin.Deps{Store: testStore(t)}
+	yamlText := "services:\n  worker:\n    image: alpine\n"
+	wid := seedWorkload(t, deps.Store, "fallback", yamlText)
+
+	fake := &fakeRunner{
+		psResults: [][]stack.Service{{
+			{Service: "worker", State: "", Status: "Exit 0"},
+		}},
+	}
+	withFakeRunner(t, fake)
+
+	src := &source{}
+	w := plugin.Workload{
+		ID:           wid,
+		Name:         "fallback",
+		SourceKind:   "compose",
+		SourceConfig: mustMarshalConfig(t, Config{ComposeYAML: yamlText}),
+	}
+	if err := src.Reconcile(context.Background(), deps, w); err != nil {
+		t.Fatalf("Reconcile: %v", err)
+	}
+
+	row, err := deps.Store.GetContainerByID(wid + ":worker")
+	if err != nil {
+		t.Fatalf("get row: %v", err)
+	}
+	if row.State != "Exit 0" {
+		t.Errorf("row.State = %q, want \"Exit 0\" (Status fallback)", row.State)
+	}
+}
+
+// mustMarshalConfig is a small helper that converts a Config to the
+// raw-JSON shape SourceConfig expects. Tests use it instead of
+// hand-rolling the string so a Config field rename can't drift the test
+// fixture from the production decoder.
+func mustMarshalConfig(t *testing.T, cfg Config) json.RawMessage {
+	t.Helper()
+	b, err := json.Marshal(cfg)
+	if err != nil {
+		t.Fatalf("marshal config: %v", err)
+	}
+	return json.RawMessage(b)
+}
+
+// Compile-time guards: *stack.Compose must continue to satisfy
+// composeRunner so the production path keeps building, and the fake
+// must continue to satisfy it too so a drift in the interface shape
+// fails the build here rather than at runtime.
+var (
+	_ composeRunner = (*stack.Compose)(nil)
+	_ composeRunner = (*fakeRunner)(nil)
+)
@@ -0,0 +1,17 @@
+package dockerfile
+
+import "testing"
+
+// The commit-status reporter itself now lives in internal/staticsite
+// (staticsite.CommitStatusReporter) and is covered by
+// internal/staticsite/commit_status_reporter_test.go. This file retains only
+// the dockerfile plugin's local statusTargetURL coverage.
+
+func TestStatusTargetURL(t *testing.T) {
+	if got := statusTargetURL(""); got != "" {
+		t.Errorf("empty domain -> %q, want \"\"", got)
+	}
+	if got := statusTargetURL("x.example.com"); got != "https://x.example.com" {
+		t.Errorf("got %q", got)
+	}
+}
@@ -0,0 +1,570 @@
+package dockerfile
+
+import (
+	"context"
+	"fmt"
+	"io"
+	"log/slog"
+	"os"
+	"strconv"
+	"strings"
+	"time"
+
+	"github.com/alexei/tinyforge/internal/crypto"
+	"github.com/alexei/tinyforge/internal/docker"
+	"github.com/alexei/tinyforge/internal/events"
+	"github.com/alexei/tinyforge/internal/notify"
+	"github.com/alexei/tinyforge/internal/proxy"
+	"github.com/alexei/tinyforge/internal/staticsite"
+	"github.com/alexei/tinyforge/internal/store"
+	"github.com/alexei/tinyforge/internal/workload/plugin"
+)
+
+// healthCheckDelay is the grace window after StartContainer before we
+// probe IsContainerRunning. Mirrors the static plugin's window — short
+// enough not to slow happy-path deploys, long enough to catch
+// crash-on-boot failures (missing env, bad CMD, port conflict).
+const healthCheckDelay = 3 * time.Second
+
+// deploy runs one end-to-end sync of a dockerfile workload:
+//
+//  1. fetch the latest commit SHA from the configured git provider
+//  2. skip if SHA + container + proxy are all still healthy
+//  3. clone the repo into a temp dir
+//  4. resolve the build context + Dockerfile location
+//  5. `docker build -t <tag> -f <dockerfile> <context>`
+//  6. recreate the container with the new image
+//  7. health-probe the container, surface logs on failure
+//  8. reconfigure the proxy route
+//  9. tear down the previous container (different ID) once we're sure
+//     the new one is healthy and proxied
+//
+// Each step writes its own status update so the dashboard's runtime-
+// state panel can show a useful intermediate state when the deploy
+// stalls on the slow step (almost always the build).
+func deploy(ctx context.Context, deps plugin.Deps, w plugin.Workload, intent plugin.DeploymentIntent) (retErr error) {
+	cfg, err := plugin.SourceConfigOf[Config](w)
+	if err != nil {
+		return fmt.Errorf("dockerfile source: decode config: %w", err)
+	}
+
+	prev, prevContainer, err := loadState(deps, w)
+	if err != nil {
+		return err
+	}
+
+	// Force a full rebuild on manual / promote / first-time deploys
+	// (no Reason at all also implies manual). Schedule / git triggers
+	// honour the unchanged-SHA short-circuit so cron polling does not
+	// rebuild minute-by-minute when nothing changed.
+	force := intent.Reason == "" || intent.Reason == "manual" || intent.Reason == "promote"
+
+	// Decrypt the access token if present. Token never escapes this
+	// frame: any error message routes through sanitizeError(_, token)
+	// which redacts the literal substring.
+	token := ""
+	if cfg.AccessToken != "" {
+		decrypted, derr := crypto.Decrypt(deps.EncKey, cfg.AccessToken)
+		if derr != nil {
+			slog.Warn("dockerfile source: failed to decrypt access token",
+				"workload", w.Name, "error", derr)
+		} else {
+			token = decrypted
+		}
+	}
+
+	provider, err := staticsite.NewGitProvider(staticsite.ProviderType(cfg.Provider), cfg.BaseURL, token)
+	if err != nil {
+		updateStatus(deps, w, "failed", prev.LastCommitSHA,
+			sanitizeError(fmt.Sprintf("create provider: %v", err), token))
+		return fmt.Errorf("create provider: %w", err)
+	}
+
+	latestSHA, err := provider.GetLatestCommitSHA(ctx, cfg.RepoOwner, cfg.RepoName, cfg.Branch)
+	if err != nil {
+		updateStatus(deps, w, "failed", prev.LastCommitSHA,
+			sanitizeError(fmt.Sprintf("fetch commit SHA: %v", err), token))
+		return fmt.Errorf("get latest commit: %w", err)
+	}
+
+	domain := primaryDomain(deps, w)
+
+	// Commit-status reporter (best-effort; gated on cfg.ReportCommitStatus).
+	// The deferred terminal report fires Success/Failure based on the
+	// deploy's outcome, but ONLY once an actual build/deploy began
+	// (deployStarted). The unchanged-SHA short-circuit below returns via
+	// healUnchanged before that flips, so no status is reported when
+	// nothing was built. retErr is the named return the defer inspects.
+	reporter := staticsite.NewCommitStatusReporter(provider, cfg.RepoOwner, cfg.RepoName, latestSHA, statusTargetURL(domain), cfg.ReportCommitStatus)
+	deployStarted := false
+	defer func() {
+		if !deployStarted {
+			return
+		}
+		if retErr != nil {
+			reporter.Report(ctx, w.Name, w.ID, staticsite.CommitStatusFailure, "Tinyforge: build failed")
+		} else {
+			reporter.Report(ctx, w.Name, w.ID, staticsite.CommitStatusSuccess, "Tinyforge: deployed")
+		}
+	}()
+
+	prevContainerID := ""
+	prevProxyRouteID := ""
+	if prevContainer != nil {
+		prevContainerID = prevContainer.ContainerID
+		prevProxyRouteID = prevContainer.ProxyRouteID
+	}
+	// Short-circuit: SHA unchanged AND container is still running AND
+	// (if there's a public face) the proxy route still exists. Manual
+	// deploys skip this entirely.
+	//
+	// We deliberately do NOT gate this on prev.Status == "deployed". A
+	// transient failure (e.g. a one-off proxy-check error) leaves the
+	// persisted status as "failed"; if we required "deployed" here, every
+	// subsequent cron/git poll with the same SHA would fall through to a
+	// full clone + docker build despite a perfectly healthy running
+	// container — a rebuild storm that burns CPU/disk until a new commit
+	// lands. Instead we trust the live container/proxy state and heal the
+	// stale status via healUnchanged.
+	if !force && latestSHA == prev.LastCommitSHA && prevContainerID != "" {
+		running, _ := deps.Docker.IsContainerRunning(ctx, prevContainerID)
+		switch {
+		case !running:
+			slog.Info("dockerfile: container not running, forcing redeploy", "workload", w.Name)
+		case domain != "":
+			proxyOK, perr := deps.Proxy.RouteExists(ctx, domain)
+			switch {
+			case perr != nil:
+				slog.Warn("dockerfile: proxy check failed, forcing redeploy",
+					"workload", w.Name, "error", perr)
+			case !proxyOK:
+				slog.Info("dockerfile: proxy route missing, forcing redeploy", "workload", w.Name)
+			default:
+				return healUnchanged(deps, w, prev, latestSHA)
+			}
+		default:
+			return healUnchanged(deps, w, prev, latestSHA)
+		}
+	}
+
+	// From here on a deploy is genuinely underway, so the deferred terminal
+	// status report should fire. Push a "pending" commit status (best-
+	// effort) and arm the deferred Success/Failure report.
+	updateStatus(deps, w, "syncing", prev.LastCommitSHA, "")
+	publishEvent(deps, w, "syncing")
+	deployStarted = true
+	reporter.Report(ctx, w.Name, w.ID, staticsite.CommitStatusPending, "Tinyforge: deploying")
+
+	// Clone the repo into a temp dir. We always download the entire
+	// repo tree (folderPath = ""); a ContextPath subset is applied
+	// at build time, not at download time, so a Dockerfile in
+	// `./docker/Dockerfile` with `ContextPath=""` still works.
+	cloneDir, err := os.MkdirTemp("", "tf-build-"+plugin.IDShort(w)+"-*")
+	if err != nil {
+		updateStatus(deps, w, "failed", prev.LastCommitSHA,
+			sanitizeError(fmt.Sprintf("create clone dir: %v", err), token))
+		return fmt.Errorf("create clone dir: %w", err)
+	}
+	defer os.RemoveAll(cloneDir)
+
+	if err := provider.DownloadFolder(ctx, cfg.RepoOwner, cfg.RepoName, cfg.Branch, "", cloneDir); err != nil {
+		updateStatus(deps, w, "failed", prev.LastCommitSHA,
+			sanitizeError(fmt.Sprintf("download repo: %v", err), token))
+		return fmt.Errorf("download repo: %w", err)
+	}
+
+	// Resolve the build context (with symlink-aware escape check) and
+	// verify the Dockerfile is actually present before sending the
+	// build off to the daemon.
+	contextDir, err := resolveContextDir(cloneDir, cfg.ContextPath)
+	if err != nil {
+		updateStatus(deps, w, "failed", latestSHA,
+			sanitizeError(fmt.Sprintf("resolve context: %v", err), token))
+		return fmt.Errorf("resolve context: %w", err)
+	}
+	if err := verifyDockerfileExists(contextDir, cfg.DockerfilePath); err != nil {
+		updateStatus(deps, w, "failed", latestSHA,
+			sanitizeError(err.Error(), token))
+		return err
+	}
+
+	imageTag := imageTagFor(w)
+	updateStatus(deps, w, "building", latestSHA, "")
+	publishEvent(deps, w, "building")
+	// Bridge per-line build output onto the event bus so /api/events
+	// subscribers (the dashboard's live tail) can show progress while
+	// the daemon chugs. The bus is non-blocking — slow subscribers drop
+	// events rather than backpressure the build — so this is safe to
+	// call from the hot scan loop.
+	logFn := func(line string) {
+		publishBuildLog(deps, w, line)
+	}
+	if err := deps.Docker.BuildImageAt(ctx, contextDir, cfg.DockerfilePath, imageTag, logFn); err != nil {
+		updateStatus(deps, w, "failed", latestSHA,
+			sanitizeError(fmt.Sprintf("docker build: %v", err), token))
+		return fmt.Errorf("docker build: %w", err)
+	}
+
+	env := plugin.BuildWorkloadEnv(deps, w.ID, w.GroupID, "dockerfile source")
+	containerPort := strconv.Itoa(cfg.Port)
+
+	settings, err := deps.Store.GetSettings()
+	if err != nil {
+		updateStatus(deps, w, "failed", latestSHA,
+			sanitizeError(fmt.Sprintf("get settings: %v", err), token))
+		return fmt.Errorf("get settings: %w", err)
+	}
+
+	networkName := settings.Network
+	networkID, err := deps.Docker.EnsureNetwork(ctx, networkName)
+	if err != nil {
+		updateStatus(deps, w, "failed", latestSHA,
+			sanitizeError(fmt.Sprintf("ensure network: %v", err), token))
+		return fmt.Errorf("ensure network: %w", err)
+	}
+
+	containerName := containerNameFor(w)
+
+	// Per-face proxy labels (Traefik consumes these; NPM ignores them).
+	labels := map[string]string{}
+	if domain != "" {
+		if l := deps.Proxy.ContainerLabels(domain, cfg.Port); l != nil {
+			for k, v := range l {
+				labels[k] = v
+			}
+		}
+	}
+
+	cc := docker.ContainerConfig{
+		Name:         containerName,
+		Image:        imageTag,
+		Env:          env,
+		ExposedPorts: []string{containerPort + "/tcp"},
+		NetworkName:  networkName,
+		NetworkID:    networkID,
+		Labels:       labels,
+		WorkloadID:   w.ID,
+		// Dockerfile workloads are tagged as "build" so the dashboard
+		// and any filtered query can distinguish them from static sites
+		// (which serve files) and image-source containers (which pull
+		// pre-built images from a registry).
+		WorkloadKind: string(store.WorkloadKindBuild),
+		Role:         "",
+	}
+
+	containerID, err := deps.Docker.CreateContainer(ctx, cc)
+	if err != nil {
+		// Name conflict — best-effort cleanup of any prior container
+		// (by ID first; by name as a fallback) and one retry.
+		if prevContainerID != "" {
+			deps.Docker.StopContainer(ctx, prevContainerID, 10)
+			deps.Docker.RemoveContainer(ctx, prevContainerID, true)
+		}
+		removeContainerByName(ctx, deps, containerName)
+
+		containerID, err = deps.Docker.CreateContainer(ctx, cc)
+		if err != nil {
+			updateStatus(deps, w, "failed", latestSHA,
+				sanitizeError(fmt.Sprintf("create container: %v", err), token))
+			return fmt.Errorf("create container: %w", err)
+		}
+	}
+
+	if err := deps.Docker.StartContainer(ctx, containerID); err != nil {
+		deps.Docker.RemoveContainer(ctx, containerID, true)
+		updateStatus(deps, w, "failed", latestSHA,
+			sanitizeError(fmt.Sprintf("start container: %v", err), token))
+		return fmt.Errorf("start container: %w", err)
+	}
+
+	// Brief health-check window — catch crash-on-boot. ctx-aware so a
+	// cancelled deploy returns promptly. On failure surface the tail
+	// of the container's logs as the error reason; that's almost
+	// always what the operator needs to debug.
+	select {
+	case <-ctx.Done():
+		deps.Docker.RemoveContainer(ctx, containerID, true)
+		updateStatus(deps, w, "failed", latestSHA, "deploy cancelled before health check")
+		return ctx.Err()
+	case <-time.After(healthCheckDelay):
+	}
+	running, runErr := deps.Docker.IsContainerRunning(ctx, containerID)
+	if runErr != nil || !running {
+		logMsg := "container exited immediately after start"
+		if logs, logErr := deps.Docker.ContainerLogs(ctx, containerID, false, "40"); logErr == nil {
+			buf, _ := io.ReadAll(logs)
+			logs.Close()
+			if len(buf) > 0 {
+				// Pass `env` so any decrypted KEY=VALUE pair that the
+				// container's startup output happens to echo (think
+				// `RUN echo $DB_PASSWORD` in a debug Dockerfile) is
+				// redacted before it lands in the operator-visible
+				// last_error field.
+				logMsg = sanitizeErrorWithSecrets(string(buf), token, env)
+			}
+		}
+		deps.Docker.RemoveContainer(ctx, containerID, true)
+		updateStatus(deps, w, "failed", latestSHA, logMsg)
+		return fmt.Errorf("container not running: %s", logMsg)
+	}
+
+	// Resolve proxy target: in-network DNS by default, NPM-remote
+	// override uses (settings.ServerIP, hostPort).
+	forwardHost := containerName
+	forwardPort := cfg.Port
+	if settings.NpmRemote && settings.ProxyProvider == "npm" {
+		if settings.ServerIP != "" {
+			hostPort, hpErr := deps.Docker.InspectContainerPort(ctx, containerID, containerPort+"/tcp")
+			if hpErr != nil {
+				slog.Warn("dockerfile: could not get host port for remote NPM",
+					"workload", w.Name, "error", hpErr)
+			} else {
+				forwardHost = settings.ServerIP
+				forwardPort = int(hostPort)
+			}
+		}
+	}
+
+	// Configure proxy if a domain is set. Replace any prior route
+	// in-place so traffic shifts atomically over to the new container.
+	proxyRouteID := prevProxyRouteID
+	if domain != "" {
+		if prevProxyRouteID != "" {
+			deps.Proxy.DeleteRoute(ctx, prevProxyRouteID)
+		}
+		routeID, rerr := deps.Proxy.ConfigureRoute(ctx, domain, forwardHost, forwardPort, proxy.RouteOptions{
+			SSLCertificateID: settings.SSLCertificateID,
+		})
+		if rerr != nil {
+			slog.Warn("dockerfile: failed to configure proxy",
+				"workload", w.Name, "domain", domain,
+				"target", fmt.Sprintf("%s:%d", forwardHost, forwardPort), "error", rerr)
+		} else {
+			proxyRouteID = routeID
+			slog.Info("dockerfile: proxy configured",
+				"workload", w.Name, "domain", domain,
+				"target", fmt.Sprintf("%s:%d", forwardHost, forwardPort), "routeID", routeID)
+		}
+	}
+
+	// Drop the previous container only after the new one is healthy
+	// + routed. Different-ID-than-previous tells us we created a
+	// fresh one (vs returning the same ID via UpsertContainer reuse).
+	if prevContainerID != "" && prevContainerID != containerID {
+		deps.Docker.StopContainer(ctx, prevContainerID, 10)
+		deps.Docker.RemoveContainer(ctx, prevContainerID, true)
+	}
+
+	// Single transactional write of new state + container metadata.
+	// On failure: tear down the just-created container + proxy route
+	// so we don't leave orphans behind for the next deploy to trip
+	// over.
+	if err := saveState(deps, w, func(rs *runtimeState, c *store.Container) {
+		rs.LastCommitSHA = latestSHA
+		rs.LastSyncAt = store.Now()
+		rs.LastError = ""
+		rs.Status = "deployed"
+
+		c.ContainerID = containerID
+		c.ProxyRouteID = proxyRouteID
+		c.Subdomain = domain
+		c.State = "running"
+		c.Port = cfg.Port
+		c.ImageRef = imageTag
+	}); err != nil {
+		slog.Error("dockerfile: failed to persist deploy state — rolling back",
+			"workload", w.Name, "error", err)
+		if proxyRouteID != "" {
+			deps.Proxy.DeleteRoute(ctx, proxyRouteID)
+		}
+		deps.Docker.StopContainer(ctx, containerID, 10)
+		deps.Docker.RemoveContainer(ctx, containerID, true)
+		updateStatus(deps, w, "failed", latestSHA,
+			sanitizeError(fmt.Sprintf("persist deploy state: %v", err), token))
+		return fmt.Errorf("persist deploy state: %w", err)
+	}
+
+	publishEvent(deps, w, "deployed")
+	dispatchBuildNotification(deps, w, domain, "deployed", "")
+
+	slog.Info("dockerfile deployed",
+		"workload", w.Name,
+		"sha", shortSHA(latestSHA),
+		"image", imageTag)
+	return nil
+}
+
+// statusTargetURL derives the https URL the commit status links back to —
+// the workload's primary public face, or "" when it has none.
+func statusTargetURL(domain string) string {
+	if domain == "" {
+		return ""
+	}
+	return "https://" + domain
+}
+
+// updateStatus writes the runtime-state status/error/commit and (on
+// terminal states) fires the side effects the static plugin's helper
+// does: failures land in the event log, and a "deployed" or "failed"
+// transition dispatches an outbound notification.
+//
+// The deploy success path calls saveState directly with the full
+// container metadata; this helper covers failure / intermediate
+// transitions where only state moves.
+func updateStatus(deps plugin.Deps, w plugin.Workload, status, commitSHA, errMsg string) {
+	if err := saveState(deps, w, func(rs *runtimeState, c *store.Container) {
+		rs.Status = status
+		rs.LastError = errMsg
+		if commitSHA != "" {
+			rs.LastCommitSHA = commitSHA
+		}
+		switch status {
+		case "deployed":
+			c.State = "running"
+		case "stopped":
+			c.State = "stopped"
+		case "failed":
+			c.State = "failed"
+		case "syncing", "building":
+			// Don't churn the container row's state during in-progress
+			// build/sync — leave whatever the previous deploy left.
+		}
+	}); err != nil {
+		slog.Error("dockerfile: failed to update status",
+			"id", w.ID, "status", status, "error", err)
+	}
+
+	if status == "failed" {
+		publishEvent(deps, w, "failed: "+errMsg)
+	}
+
+	if status == "deployed" || status == "failed" {
+		dispatchBuildNotification(deps, w, primaryDomain(deps, w), status, errMsg)
+	}
+}
+
+// dispatchBuildNotification fans the build event out to every
+// configured notification route for the workload. Multi-destination
+// fan-out (workload_notifications rows + legacy single URL + global
+// settings fallback) is centralised in plugin.DispatchNotificationForWorkload
+// so the routing rules are identical across source kinds.
+func dispatchBuildNotification(deps plugin.Deps, w plugin.Workload, domain, status, errMsg string) {
+	eventType := "build_success"
+	if status == "failed" {
+		eventType = "build_failure"
+	}
+	siteURL := ""
+	if domain != "" {
+		siteURL = "https://" + domain
+	}
+	plugin.DispatchNotificationForWorkload(deps, w, notify.Event{
+		Type:    eventType,
+		Project: w.Name,
+		URL:     siteURL,
+		Error:   errMsg,
+	})
+}
+
+// publishEvent records a workload-scoped deploy event in the audit log.
+// The InsertEvent + bus publish (and consistent message/metadata shape
+// across source kinds) is centralised in plugin.EmitDeployEvent so the
+// dashboard's audit feed and the per-workload timeline read identically
+// for image / compose / static / dockerfile deploys.
+func publishEvent(deps plugin.Deps, w plugin.Workload, status string) {
+	plugin.EmitDeployEvent(deps, w, "dockerfile", status)
+}
+
+// publishBuildLog emits one EventBuildLog per non-empty daemon "stream"
+// line. The trailing newline the daemon emits per line is trimmed so the
+// UI can render each event as its own row without smuggled blanks.
+// Strictly best-effort: the bus drops events under backpressure (slow
+// subscriber, no subscriber at all) and never blocks the build loop.
+func publishBuildLog(deps plugin.Deps, w plugin.Workload, line string) {
+	trimmed := strings.TrimRight(line, "\r\n")
+	if trimmed == "" {
+		return
+	}
+	deps.Events.Publish(events.Event{
+		Type: events.EventBuildLog,
+		Payload: events.BuildLogPayload{
+			WorkloadID: w.ID,
+			Line:       trimmed,
+			Stream:     "stdout",
+		},
+	})
+}
+
+// healUnchanged is the no-rebuild short-circuit result: the SHA matches and
+// the live container + proxy are healthy, so there is nothing to deploy. If a
+// prior transient failure left the persisted status as something other than
+// "deployed", repair it so the dashboard reflects reality and we stop treating
+// a healthy workload as failed. We heal via saveState directly (NOT
+// updateStatus) so this reconciliation does not fire a spurious build-success
+// notification on every poll.
+func healUnchanged(deps plugin.Deps, w plugin.Workload, prev runtimeState, latestSHA string) error {
+	slog.Info("dockerfile: no changes", "workload", w.Name, "sha", shortSHA(latestSHA))
+	if prev.Status == "deployed" {
+		return nil
+	}
+	if err := saveState(deps, w, func(rs *runtimeState, c *store.Container) {
+		rs.Status = "deployed"
+		rs.LastError = ""
+		c.State = "running"
+	}); err != nil {
+		slog.Warn("dockerfile: failed to heal stale status to deployed",
+			"workload", w.Name, "error", err)
+	}
+	return nil
+}
+
+// removeContainerByName enumerates Docker's view and best-effort drops
+// EVERY matching container so a name conflict in CreateContainer is
+// recoverable. Container names are unique per daemon, but the recovery
+// path exists precisely because a conflict occurred — a prior partial
+// deploy can leave more than one matching artifact, so we must not stop
+// at the first. Mirrors the static plugin's helper of the same name.
+func removeContainerByName(ctx context.Context, deps plugin.Deps, name string) {
+	containers, err := deps.Docker.ListContainers(ctx, nil)
+	if err != nil {
+		return
+	}
+	for _, c := range containers {
+		if c.Name == name {
+			deps.Docker.StopContainer(ctx, c.ID, 10)
+			deps.Docker.RemoveContainer(ctx, c.ID, true)
+		}
+	}
+}
+
+// primaryDomain mirrors the static plugin's helper of the same name —
+// derives an FQDN from the workload's first enabled public face, with
+// the same bare-subdomain + settings.Domain fall-through.
+func primaryDomain(deps plugin.Deps, w plugin.Workload) string {
+	for _, f := range w.PublicFaces {
+		if f.Subdomain == "" && f.Domain == "" {
+			continue
+		}
+		switch {
+		case f.Subdomain != "" && f.Domain != "":
+			return f.Subdomain + "." + f.Domain
+		case f.Subdomain == "" && f.Domain != "":
+			return f.Domain
+		case f.Subdomain != "" && f.Domain == "":
+			settings, err := deps.Store.GetSettings()
+			if err != nil || settings.Domain == "" {
+				return f.Subdomain
+			}
+			return f.Subdomain + "." + settings.Domain
+		}
+	}
+	return ""
+}
+
+// shortSHA truncates a commit SHA for log lines. Keeps the deploy log
+// readable without losing the "is this the same commit?" signal.
+func shortSHA(sha string) string {
+	if len(sha) > 8 {
+		return sha[:8]
+	}
+	return sha
+}
@@ -0,0 +1,136 @@
+// Package dockerfile implements the "dockerfile" source: a git-repo-backed
+// deployable that builds a Docker image from a user-supplied Dockerfile
+// and runs one container. This is the "self-hosted Vercel" Source —
+// users point at a Git repo containing a Dockerfile and Tinyforge
+// handles clone → build → run → proxy in one shot, with no external CI
+// pipeline.
+//
+// Architecturally the plugin sits between `static` (clones a Git repo,
+// builds an image, runs one container) and `image` (richer runtime
+// shape: ports, healthcheck, env, volumes). The deploy pipeline mirrors
+// static — same git-fetch, same image-tag/container-name shape, same
+// container-row state persistence — but the build step uses the
+// operator's Dockerfile instead of generating one.
+//
+// The full pipeline is implemented inline in this package
+// (deploy.go / teardown.go / reconcile.go) so a new dockerfile source
+// kind is usable immediately on init() — no separate registration step
+// in the deployer.
+package dockerfile
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"strings"
+
+	"github.com/alexei/tinyforge/internal/workload/plugin"
+)
+
+// Config is the per-workload source config blob. Mirrors the shape of
+// the static plugin's Config so the UI wizard can largely reuse the
+// existing Git-discovery + branch-picker + repo-picker components.
+//
+// Build-side fields:
+//
+//   - DockerfilePath: path to the Dockerfile *within the context*
+//     directory. Defaults to "Dockerfile". Use e.g. "docker/Dockerfile"
+//     when the operator's repo keeps Dockerfiles in a subfolder.
+//   - ContextPath: subfolder of the cloned repo to use as the build
+//     context. Defaults to "" (repo root). Use e.g. "./api" when the
+//     repo's Dockerfile lives next to a backend service in a monorepo.
+//
+// Runtime-side fields:
+//
+//   - Port: container port the workload listens on. Required.
+//   - Healthcheck: optional curl-style probe; empty disables.
+//
+// Env vars and volume mounts are handled out-of-band via the
+// workload_env and workload_volumes tables, mirroring the image source.
+type Config struct {
+	Provider       string `json:"provider"`         // "gitea" | "github" | "gitlab"; "" = autodetect
+	BaseURL        string `json:"base_url"`         // e.g. https://git.example.com
+	RepoOwner      string `json:"repo_owner"`
+	RepoName       string `json:"repo_name"`
+	Branch         string `json:"branch"`
+	ContextPath    string `json:"context_path"`     // path within repo (root by default)
+	DockerfilePath string `json:"dockerfile_path"`  // relative to context_path; "Dockerfile" by default
+	AccessToken    string `json:"access_token"`     // encrypted; optional for public repos
+
+	Port        int    `json:"port"`
+	Healthcheck string `json:"healthcheck,omitempty"`
+
+	// ReportCommitStatus, when true, pushes the deploy outcome back to the
+	// git provider as a commit status (pending/success/failure) on the
+	// built SHA. Best-effort — a reporting failure never fails a deploy.
+	ReportCommitStatus bool `json:"report_commit_status"`
+}
+
+type source struct{}
+
+// Eager registration — the deploy pipeline lives entirely inside this
+// package, so the kind is usable as soon as init() fires.
+func init() { plugin.RegisterSource(&source{}) }
+
+func (*source) Kind() string { return "dockerfile" }
+
+func (*source) SchemaSample() any {
+	return Config{
+		Provider:       "gitea",
+		BaseURL:        "https://git.example.com",
+		RepoOwner:      "owner",
+		RepoName:       "myservice",
+		Branch:         "main",
+		ContextPath:    "",
+		DockerfilePath: "Dockerfile",
+		Port:           8080,
+	}
+}
+
+// Validate rejects obviously-malformed configs before the deploy
+// pipeline materializes a temp dir, downloads a repo, and burns
+// minutes of build time on input that was never going to work.
+func (*source) Validate(cfg json.RawMessage) error {
+	var c Config
+	if len(cfg) == 0 {
+		return fmt.Errorf("dockerfile source: config is required")
+	}
+	if err := json.Unmarshal(cfg, &c); err != nil {
+		return fmt.Errorf("dockerfile source: invalid json: %w", err)
+	}
+	if strings.TrimSpace(c.RepoOwner) == "" || strings.TrimSpace(c.RepoName) == "" {
+		return fmt.Errorf("dockerfile source: repo_owner and repo_name are required")
+	}
+	if c.Port <= 0 || c.Port > 65535 {
+		return fmt.Errorf("dockerfile source: port must be between 1 and 65535 (got %d)", c.Port)
+	}
+	// Defense in depth: a leading "/" or any ".." segment in
+	// DockerfilePath / ContextPath would escape the build context. The
+	// plugin's deploy() does its own normalization too; rejecting here
+	// gives the operator a clear error at save-time instead of a
+	// confusing "no such file" mid-build.
+	for _, p := range []string{c.DockerfilePath, c.ContextPath} {
+		if p == "" {
+			continue
+		}
+		if strings.HasPrefix(p, "/") {
+			return fmt.Errorf("dockerfile source: %q must be relative", p)
+		}
+		if strings.Contains(p, "..") {
+			return fmt.Errorf("dockerfile source: %q must not contain '..'", p)
+		}
+	}
+	return nil
+}
+
+func (*source) Deploy(ctx context.Context, deps plugin.Deps, w plugin.Workload, intent plugin.DeploymentIntent) error {
+	return deploy(ctx, deps, w, intent)
+}
+
+func (*source) Teardown(ctx context.Context, deps plugin.Deps, w plugin.Workload) error {
+	return teardown(ctx, deps, w)
+}
+
+func (*source) Reconcile(ctx context.Context, deps plugin.Deps, w plugin.Workload) error {
+	return reconcile(ctx, deps, w)
+}
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
alexei.dolgolyov	6b45ed62bb	feat(snapshots): capture app data-volume snapshots Build / build (push) Successful in 10m59s Details Add per-workload capture of host-bind data volumes as downloadable tar.gz archives: a new internal/volsnap engine (enumerate host-bind volumes via the computeMounts merge, archive with archive/tar+gzip skipping symlinks/special files, per-workload retention + startup orphan cleanup), a volume_snapshots table + store CRUD, admin-gated API (list/snapshotable/create/download/delete), and a Snapshots panel on /apps/[id] that shows coverage and which volumes are skipped (and why). Scope: image-source apps, host-bind scopes (absolute/stage/project); Docker named volumes, tmpfs, and instance scope are surfaced as not-yet-supported. Restore is a separate later phase. Download/FilePath are containment-checked; create returns a typed no-data error (400) vs generic 500. Covered by archiver unit tests + full API e2e.	2026-06-02 14:56:10 +03:00
alexei.dolgolyov	2ba49b9bb6	chore: add .vex.toml vex index config Share the vex code-search config (semantic embeddings + auto-update on) so the index is reproducible across machines. No secrets.	2026-06-02 13:34:42 +03:00
alexei.dolgolyov	00503b4c0a	feat(cli): add tinyforge terminal client New zero-dependency Go CLI (cmd/cli) that drives the existing HTTP API: login/logout, apps list, deploy (synchronous, --timeout), logs (one-shot + -f SSE follow), and status. Caches a 24h JWT in ~/.tinyforge/config.json (0600, Chmod-enforced on overwrite); Bearer-header auth keeps the token out of server/proxy logs; no-echo password prompt (kernel32 on Windows, stty elsewhere). Server/token resolved via flags, TINYFORGE_URL/TINYFORGE_TOKEN env, or config. README CLI section + root-anchored .gitignore entries for the build output.	2026-06-02 13:34:42 +03:00
alexei.dolgolyov	97f338fba3	feat(maintenance): add Docker build-cache prune action Add an admin-only POST /api/docker/prune-build-cache endpoint plus a Settings > Maintenance danger-zone button to reclaim disk used by the Docker build cache (image + static-site builds), which previously grew unbounded with no UI lever. Prunes unused-only (all=false) so a warm cache is preserved for apps redeploying soon. Mirrors the existing prune-images vertical slice; full en/ru i18n parity.	2026-06-02 13:34:05 +03:00
alexei.dolgolyov	15e5b186cd	feat(secrets): scoped shared secrets rule-management UI (Phase 2) Completes scoped shared secrets end-to-end: /shared-secrets list/new/edit routes (mirroring metric-alert-rules) with an env-key name, a WRITE-ONLY value (password input; never pre-filled — the API returns only has_value; omitted on PATCH to keep the stored secret, provided to rotate; cleared after save), an encrypted toggle (flipping it requires re-entering the value, matching the server's 400 guard), a global\|app scope with an App-grouping picker (listApps), description, and enabled. 409 conflicts surface a friendly message. New "System" nav entry (IconKey) + api.ts client + full sharedsecrets.* i18n (en/ru parity). Reviewed: typescript APPROVE (0 CRITICAL/HIGH).	2026-05-29 16:11:46 +03:00
alexei.dolgolyov	fa6d5bd3ba	feat(secrets): scoped shared secrets — backend + API (Phase 1) Secrets defined once and applied to many workloads by scope (global or per-app), encrypted at rest and resolved into container env as a low-precedence default layer: global-shared < app-shared < image cfg.Env < workload_env. A workload with no applicable shared secrets is byte-identical to the prior workload_env-only behavior. - store: shared_secrets table + CRUD + ListApplicableSharedSecrets (enabled global + app, global-first), UNIQUE(scope,app_id,name). - plugin.ResolveSharedSecrets + integration into BuildWorkloadEnv (static/dockerfile) and image buildEnv; best-effort — a shared-secret store/decrypt error never fails a deploy, and values are never logged. - REST CRUD at /api/shared-secrets (reads authed, mutations AdminOnly); values encrypted at the boundary via crypto.Encrypt and never returned (only a has_value flag), mirroring workload_env. UNIQUE collisions 409. Compose is out of scope (YAML-defined env). Frontend rule UI is Phase 2. Reviewed: go + security APPROVE (0 CRITICAL/HIGH); two MEDIUMs fixed (translateSQLError -> 409, no driver-message leak). Deferred defense-in- depth: json:"-" on the model value + a description length cap.	2026-05-29 15:26:09 +03:00
alexei.dolgolyov	bd7a11d4e7	refactor(source): dedup shared helpers across static + dockerfile plugins Extract the verbatim-duplicated helpers into shared homes: - buildEnv -> plugin.BuildWorkloadEnv (base plugin pkg; a sourceName param preserves each plugin's slog prefix / log-scraper text) - idShort -> plugin.IDShort - commitStatusReporter -> staticsite.CommitStatusReporter, re-parameterized on primitives (owner/repo/sha/targetURL/enabled) so staticsite needs no dependency on the plugin package; reporter tests ported to staticsite (plus a new nil-provider case) containerNameFor/imageTagFor are intentionally left per-plugin: their prefixes differ (dw-site- vs tf-build-) and name real Docker resources, so merging them would risk mis-routing. Behavior-preserving; the static/dockerfile test suites pass unchanged. Reviewed: go APPROVE (0 CRITICAL/HIGH).	2026-05-29 14:57:30 +03:00
alexei.dolgolyov	7576f54e76	feat(alerts): metric-alert rule-management UI (Phase 2) Completes metric-threshold alerting end-to-end: /metric-alert-rules list/new/edit routes (mirroring log-scan-rules) with metric/comparator/ threshold fields, the workload scope picker, ToggleSwitch, and a ConfirmDialog delete flow; an api.ts MetricAlertRule CRUD client; an "Observe" nav entry; and a full metricalert.* i18n namespace (en/ru parity). Create-form cooldown defaults to 300s to match the server. Rules are now manageable in the WebUI; breaches already surface in the per-app activity timeline and fire any configured event-trigger webhook. Reviewed: typescript APPROVE (0 CRITICAL/HIGH).	2026-05-29 14:34:01 +03:00
alexei.dolgolyov	2e26f555c5	fix(dashboard): count Recent workloads by source_kind, not raw rows The Recent-workloads badge and empty-state guard used workloads.length (which includes legacy kind:site rows with empty source_kind) while the list renders pluginWorkloads (source_kind != ''). With one legacy row and no real workloads the badge showed "1" over an empty list and the empty state never appeared. Both now use pluginWorkloads, matching the list and the headline Total count.	2026-05-29 14:33:53 +03:00
alexei.dolgolyov	cdb9fd57d1	feat(alerts): metric-threshold alerting (backend + API) Operators can define metric-threshold alert rules (cpu_percent, memory_percent, memory_bytes; gt/lt) per-workload or global via /api/metric-alert-rules. A periodic evaluator (internal/metricalert, 30s tick) checks the freshest container stats sample per container against enabled rules and, on breach (per-rule-per-workload cooldown), emits into the existing event_log + bus pipeline (source "metric_alert", workload_id set). Alerts therefore surface on the global events page, the per-app activity timeline, and any configured event-trigger webhook -- no new notification plumbing. Mirrors the log_scan_rules store/API/route patterns and the stats.Collector lifecycle. Rule CRUD reads are authed, mutations AdminOnly. Frontend rule-config UI is a follow-up phase. Reviewed: go APPROVE (0 CRITICAL/HIGH).	2026-05-29 14:06:23 +03:00
alexei.dolgolyov	5c17885197	perf(reconciler): batch workloads per tick, drop redundant image inspect Load every workload once per tick into a map instead of a per-container GetWorkloadByID (N+1) in the upsert loop plus a second ListWorkloads in the plugin pass: one query per tick, zero GetWorkloadByID. The ListWorkloads error path returns before the missing-sweep so a failed load can't flip live container rows to 'missing'. image.Reconcile is now a no-op: the generic upsert+markMissing pass already syncs every labeled container's state from the single ListAllForReconciler (docker ps -a) snapshot earlier in the same tick, so the former per-container IsContainerRunning loop was N redundant Docker calls/tick. (Its no-op body sits in image.go, which landed with the preceding commit; the tests are here.) compose/static reconcile do non-redundant work and are intentionally untouched. Reviewed: go APPROVE.	2026-05-29 13:51:27 +03:00
alexei.dolgolyov	93b6911b34	feat(apps): per-app deploy/activity timeline Every deploy across all four source kinds now writes a workload-scoped event via a shared plugin.EmitDeployEvent helper (replacing the inline emit duplicated in static/dockerfile, standardizing static's metadata key site_id->workload_id, and adding emission to image+compose which were silent). New indexed event_log.workload_id column, EventLogFilter .WorkloadID, and GET /api/workloads/{id}/events (id pinned from path). Frontend: a forge "Activity" panel on /apps/[id] reusing EventLogEntry, live SSE prepend filtered by workload_id, load-more pagination, an All/Errors severity filter, and a shared toEventLogEntry mapper. en/ru i18n parity. Security: compose's failure status emits a generic reason instead of raw `docker compose up` output, which can echo app secrets and egresses to operator webhooks (NotificationURL + event-trigger actions); full detail stays only in the returned error. Rune-safe 256-rune status cap. Reviewed: go + typescript APPROVE; security HIGH fixed.	2026-05-29 13:51:17 +03:00
alexei.dolgolyov	3071cda512	feat(deploy): commit-status reporting to Git providers Report deploy status back to the Git provider as a commit status (pending/success/failure) for git-sourced workloads (static + dockerfile). - GitProvider.SetCommitStatus on gitea/github/gitlab over the existing SSRF-safe client; fixed "tinyforge" context so redeploys update one row. postJSON returns status-code-only errors (never echoes the upstream body, which a hostile provider could use to reflect the auth token into the best-effort log line). - Best-effort deploy hook: pending on deploy start, success/failure on outcome, gated on a per-workload report_commit_status flag. Never fails or blocks a deploy; emits nothing on the unchanged-SHA short-circuit. - UI ToggleSwitch (create + edit) + reportCommitStatus in sourceForms.ts + en/ru i18n. - Tests: per-provider state mapping + request shape; reporter gating (enabled/disabled/empty-SHA/nil/error-swallow). Reviewed via go-reviewer + security-reviewer (0 CRITICAL/HIGH; one MEDIUM body-echo log-leak fixed).	2026-05-29 11:37:56 +03:00
alexei.dolgolyov	410a131cec	feat(apps): stepped creation wizard, branch previews, and app-creation fixes This session (frontend focus): - Rebuild /apps/new as a 4-step wizard (Basics → Configure → Trigger → Review): WizardRail, SourceKindPicker card grid, AppManifest review, per-step validation, ConfirmDialog-based unsaved-changes guard. - Extract lib/workload/sourceForms.ts (single source of truth for source_config) + {Image,Compose,Static,Dockerfile}SourceForm + StaticDiscoveryWizard; fold the /apps/[id] edit form onto the same components (removes the duplication). Add vitest + sourceForms unit tests. - Branch preview environments UI: /chain is_preview/preview_branch + a Preview environments panel on /apps/[id] (per-branch URLs, ConfirmDialog teardown, armed state); RegistryImagePicker on the registry trigger and the image source. - Fixes: image-inspect 404 -> admin-gated POST /api/discovery/image/inspect; conflict-panel blur flicker; friendly localized discovery errors; CPU/Memory label hints; dashboard + /apps "Total workloads" count only source_kind workloads (drop stale trigger_kind gate); NPM cert/access-list name cache; EntityPicker empty-list guard. - Update CLAUDE.md frontend conventions + add a Build & Test section. Also captures pre-existing in-progress platform work (not from this session): workload notifications, Prometheus metrics export, store lockfile, health probes, backup hardening, and related store/webhook/scheduler changes.	2026-05-29 02:09:54 +03:00
alexei.dolgolyov	956943edbb	feat(proxies): per-row Triggers deep-link to /apps/[id]#bindings The proxies page now exposes the trigger bindings for each routed workload via a per-row action chip. Resolves the explicit "what's next" call-out in WORKLOAD_REFACTOR_TODO under Priority 3 polish. - Added id="bindings" to the existing trigger bindings <section> on /apps/[id]/+page.svelte so URL fragments resolve to the panel. - New triggersHref(route) helper in /proxies that builds /apps/{workload_id}#bindings; code-pointer comment explains the back-compat naming (ProxyRoute.project_id is actually the workload ID — see internal/store/models.go:110-113), so a future contributor doesn't trip on the mismatch and rip the helper out. - New right-aligned "Actions" column with a button-shaped link; defensive — falls back to — when project_id is absent. - Three new i18n keys under proxies.* (actions, viewTriggers, viewTriggersTitle) mirrored across EN + RU. Key parity now 1512 each. No backend change needed; ListProxyRoutes already selects w.id into ProxyRoute.project_id. Workload-aware batch endpoints (showing trigger counts inline) were deliberately out of scope for this half-turn — flagged as a future enhancement only if users want inline counts. Verification: svelte-check 0 errors + 3 pre-existing warnings in TagCombobox; go build + go test ./... all green across 20 packages.	2026-05-16 22:46:51 +03:00
alexei.dolgolyov	279f373f80	docs(extra_json): policy doc for containers.extra_json evolution New CODEMAPS/container-extra-json.md documents the contract every source plugin must follow when reading or writing containers.extra_json. Closes the open architectural question that was tracked in WORKLOAD_REFACTOR_TODO.md. Covers: - Schema position (column default, four write-path normalization sites) and ownership model (per-source row keys, current writers). - Reader rules: tolerate unknown keys via default json.Unmarshal, tolerate decode failure where first-class columns suffice. - Writer patterns: wholesale-overwrite (image source, single-writer short-lived rows) vs preserve-unknown-keys (static source, RMW with generic-map round-trip). Preserve-unknown-keys is the recommended default for new sources. - Concurrency: SetMaxOpenConns(1) + WAL gives atomic per-row writes and consistent reader snapshots, but does NOT serialize multi- goroutine RMW — a per-workload sync.Mutex is required for that (fenced by TestSaveState_ConcurrentWritesDoNotLoseUpdates). - What extra_json is NOT for (workload config, cross-source state, queryable data, secrets) and a checklist for adding a new field. - Pointers to every example in tree: image's containerExtra writer/ reader, static's saveState round-trip, workload_runtime.go's decode-and-tolerate consumer. WORKLOAD_REFACTOR_TODO Container.extra_json question flipped to DONE. CODEMAPS/INDEX bumped + entry linked. Reviewer pass (code-reviewer subagent) caught one HIGH factual error (wrong cross-source consumer claim) and several MEDIUM/LOW drifts; all addressed inline before commit.	2026-05-16 22:00:41 +03:00