# Performance & Database Review — `service-to-notification-bridge`

**Scope:** entire repo at `c:\Users\Alexei\Documents\service-to-notification-bridge`
**Backend:** FastAPI + SQLAlchemy async + SQLModel on SQLite (Postgres-compatible URL, but only SQLite branch is exercised in code).
**Frontend:** SvelteKit 5 (runes) static build served by the same FastAPI process.
**Reviewer:** Claude Opus 4.7 (1M context)

---

## Executive summary

1. **Indexing is in good shape.** FK columns and the dashboard/webhook hot paths have explicit composite indexes (`ix_event_log_user_created`, `ix_event_log_user_event_type_created`, `ix_deferred_dispatch_status_fire_at`, partial `ux_deferred_dispatch_pending`). The bulk of the "missing index" risk is already mitigated.
2. **No real migration tool.** The project runs a hand-rolled, 1880-line, idempotent migration script on every boot. It works, but it's brittle, slow on cold start, has no down-migrations, and the table-rebuild branches lose indexes silently. Move to Alembic before the next major schema change.
3. **`create_all` is still the source-of-truth for new schemas** (engine.py:63). That's an anti-pattern next to migration tooling: schema drift can silently appear between fresh installs and upgraded installs.
4. **Two real N+1 risks remain.** `_tracker_response` (notification_trackers.py:286-291) calls `_tt_response` per link, and `_refresh_telegram_chat_titles` (scheduler.py:229) issues per-chat `getChat` calls without bot-level batching guards. The big one in `load_link_data` was already fixed (good).
5. **SQLite PRAGMAs are mostly right but pool sizing is wrong.** WAL, `synchronous=NORMAL`, FK enforcement, busy_timeout, temp_store=MEMORY are all set. Missing: `cache_size`, `mmap_size`. The async engine uses SQLAlchemy's default pool with multiple writer connections — under WAL that still serializes, but it raises spurious BUSY pressure on long transactions (see #M3).
6. **Event-log retention exists and is correct** (30-day default, cron at 03:00 UTC), but `retention_days=0` disables it silently and there is no archival, no per-tenant cap, no row-count metric exposed to operators.
7. **Memory leak risk: `_dirty_bots`, `_last_update_id`, `_last_webhook_reclaim_at`, `_adaptive_state`, `_adaptive_max_skip`** in command_sync.py, telegram_poller.py, scheduler.py are unbounded module-level dicts. In a long-running process they grow without ever shrinking when entities are deleted.
8. **Frontend has no virtualization on long lists** — dashboard event stream, tracker history, target list. On a tenant with thousands of events the dashboard `{#each status.recent_events}` (with `(event.id)` key) still renders the whole page-set into DOM and re-runs derivations on every refresh.

---

## CRITICAL

### C1. `create_all` is the schema-of-record for new installs ([engine.py:60](packages/server/src/notify_bridge_server/database/engine.py))

```python
async def init_db() -> None:
    engine = get_engine()
    async with engine.begin() as conn:
        await conn.run_sync(SQLModel.metadata.create_all)
```

**What's wrong:** `init_db()` runs unconditionally on every boot before the migration script. New installs get the *current* model's CREATE TABLE statements — including FK declarations like `ondelete=SET NULL` — while upgraded installs only get what the (one-way) `migrate_*` scripts manage to inject via `ALTER TABLE`. Several migrations explicitly admit "this only takes effect on freshly created tables" (e.g. `migrate_eventlog_provider_fk` is a documented no-op). That means **the schema drift between a fresh install and a 6-month-old install is real and undocumented.**

**Impact:** stability — subtle bugs that reproduce only on upgraded installs (FK enforcement, cascade behavior, partial UNIQUE indexes); ops — restoring a backup from a fresh install onto an upgraded box, or vice-versa, can change observable behaviour.

**Fix:**
1. Adopt Alembic with autogenerate-from-models, lock the baseline migration to the current `SQLModel.metadata`, and stop calling `create_all` in production startup.
2. Keep the hand-rolled `migrate_*` chain as legacy data-migrations only (idempotent, runs once, then removed).
3. Add a CI check: spin up empty DB → run migrations → diff against `SQLModel.metadata` → fail if non-empty.

---

### C2. `migrate_schema` runs ~30+ idempotent `PRAGMA table_info` + ALTER probes on every cold start ([migrations.py:67-427](packages/server/src/notify_bridge_server/database/migrations.py))

`_has_column` issues a `PRAGMA table_info('<table>')` per check; `migrate_schema` calls it dozens of times serially inside one transaction. On a cold start this is the dominant boot latency. Worse, it forces a write txn on every boot even when nothing changes (because each migration opens `engine.begin()`).

**Impact:** startup cost — visible on Raspberry-Pi / NAS deployments; SQLite WAL checkpoint pressure on every boot when nothing changed; readiness probe grace window must accommodate this.

**Fix:**
1. Wire `schema_version` (already exists, `CURRENT_SCHEMA_VERSION=1`) as a real short-circuit — at the top of every `migrate_*`, return immediately if `schema_version >= N` for that migration.
2. Cache `PRAGMA table_info` results within a single migration run.
3. Better long-term: replace with Alembic; you already have the version table.

---

### C3. `_install_sqlite_pragmas` only fires on engine-pool `connect`, not when SQLAlchemy reuses pooled connections from a different event loop ([engine.py:18-38](packages/server/src/notify_bridge_server/database/engine.py))

The `@event.listens_for(engine.sync_engine, "connect")` hook only runs at connection creation. The default `aiosqlite` pool reuses connections — that's fine — but `connect_args["timeout"]=30` clashes with the in-PRAGMA `busy_timeout=10000` (10 s). Two different timeout settings is confusing and the lower wins.

**Impact:** stability under contention — under sustained writer contention you get `SQLITE_BUSY` *much* sooner than expected. The 30-s connect_args timeout is for connection *open*, the 10-s busy_timeout is what governs lock contention; users see "database is locked" errors after 10 s, not 30.

**Fix:** standardize on busy_timeout (raise to 30 s to match `connect_args`, or drop one and keep the other). Document the chosen value in a constant. Also add:

```python
cur.execute("PRAGMA cache_size=-65536")   # 64 MiB negative = kibibytes
cur.execute("PRAGMA mmap_size=268435456") # 256 MiB
cur.execute("PRAGMA wal_autocheckpoint=1000")
```

The 100k-asset album write pattern (`asset_ids` JSON blob) benefits significantly from a larger page cache and mmap; current defaults force a lot of SQLite-internal I/O.

---

## HIGH

### H1. Frontend dashboard event-stream lacks virtualization & double-fetches on filter changes ([+page.svelte:739](frontend/src/routes/+page.svelte))

`{#each status.recent_events as event, i (event.id)}` is keyed (good), but the page renders every event row with rich nested components (`EventDetailModal`, `MdiIcon`, etc.) for every paginate-back/forward. There's no row virtualization and the same data fetches re-run on every filter mutation (search input has a 300 ms debounce in `onSearchInput`, but `filterEventType`, `filterProviderId`, `filterSort`, `refreshSeconds` do not).

**Impact:** UX — choppy on tenants with 50+ events/page, perceptible filter-flicker; CPU — derivation cost on every status refresh.

**Fix:**
1. Wrap the events list in a tiny windowing component (svelte-virtual or a simple offset/limit windowed view — the API already supports it).
2. Debounce the entire filter-change branch, not just the search input (`$effect(() => { if (settled) { reload() }})` with a 100 ms guard).
3. The provider count map (`provider_event_counts`) is computed server-side for *all* matching events on every page request; cache it for `(user_id, filters)` in a 30-s in-memory dict server-side (see also #M2).

---

### H2. `provider_event_counts` aggregate query runs unbounded GROUP BY on every dashboard request ([status.py:84-103](packages/server/src/notify_bridge_server/api/status.py))

```python
provider_counts_query = (
    select(
        EventLog.provider_id,
        EventLog.provider_name,
        func.sum(func.coalesce(EventLog.assets_count, 1)).label("total"),
    )
    .where(EventLog.user_id == user.id)
    .group_by(EventLog.provider_id, EventLog.provider_name)
)
```

Every dashboard load (every 10–60 s by default — see `refreshIntervalItems`) runs `GROUP BY provider_id, provider_name` over *every* event the user ever owned. At 90 days × ~1 event/min/tracker this is hundreds of thousands of rows scanned per refresh per logged-in user.

**Impact:** latency — SQLite forces a full table scan + sort here because the only composite index is `(user_id, event_type, created_at DESC)`; cost — burns CPU on the bridge box for a metric that changes very slowly.

**Fix:**
1. Add `ix_event_log_user_provider (user_id, provider_id)` so the GROUP BY can be index-only.
2. Cache the result for `(user_id, filter_signature)` for 30 s in the same in-memory cache as #H1.
3. Long-term: materialize per-provider counts into an `event_counter` table maintained by triggers or an APScheduler job. The dashboard then reads at most a dozen rows.

---

### H3. `_tracker_response` issues one query per tracker-target link ([notification_trackers.py:286-291](packages/server/src/notify_bridge_server/api/notification_trackers.py))

```python
async def _tracker_response(session: AsyncSession, t: NotificationTracker) -> dict:
    result = await session.exec(
        select(NotificationTrackerTarget).where(NotificationTrackerTarget.tracker_id == t.id)
    )
    tracker_targets = [await _tt_response(session, tt) for tt in result.all()]
```

`_tt_response` (in notification_tracker_targets.py:12 — has 12 distinct `select`/`session.get` references) issues per-link follow-up SELECTs. Called from `create`, `update`, `delete` and `trigger` for a single tracker, so the practical N is small — but `_tt_response` is also called inside the bulk `list_notification_trackers` loop's downstream consumers, and any future bulk endpoint will multiply this badly.

**Impact:** latency on POST/PATCH responses; future regression risk.

**Fix:** rewrite `_tt_response` to accept pre-fetched maps (mirror the pattern in `dispatch_helpers.load_link_data`). Or, simpler: write a single eager-load helper using `selectinload(NotificationTrackerTarget.target)` once `relationship()` mappers are declared on the models.

---

### H4. `load_link_data` does not eagerly load target.config related entities — relies on `dict(target.config)` snapshotting ([dispatch_helpers.py:539-747](packages/server/src/notify_bridge_server/services/dispatch_helpers.py))

The function batch-loads receivers, telegram_chats, email_bots, matrix_bots up-front, but the broadcast-expansion branch in the active_links loop still issues `_resolve_target` per child target (line 715). That `_resolve_target` is called with all the pre-fetched maps, so it doesn't *query* per call — but it does build a fresh `target_config` dict per child. With a broadcast target containing 50 children fanning out 100 events/min this is constant garbage collection pressure.

**Impact:** GC pressure under load; not a correctness problem.

**Fix:** none required short-term. Long-term, add `selectinload` declarations on the relationship model so SQLAlchemy can co-fetch the chain. The code path is already well-batched.

---

### H5. `aiohttp.ClientSession` is constructed per-call inside `NotificationDispatcher._session_ctx` when no shared session is provided ([dispatcher.py:117-123](packages/core/src/notify_bridge_core/notifications/dispatcher.py))

```python
@contextlib.asynccontextmanager
async def _session_ctx(self) -> AsyncIterator[aiohttp.ClientSession]:
    if self._shared_session is not None and not self._shared_session.closed:
        yield self._shared_session
        return
    async with _new_session() as session:
        yield session
```

In server-side code paths (watcher, event_dispatch, deferred_dispatch) a shared session is always passed in, so this is harmless. But unit tests, the CLI, and any direct library user that instantiates `NotificationDispatcher` without a session pays the cost. Worse, the per-dispatch session creates a fresh TCP pool, fresh DNS resolver — defeating connection reuse to Telegram / Discord webhook hosts.

**Impact:** test slowness; correctness if a non-server consumer ever ships.

**Fix:** require the `session` parameter (`session: aiohttp.ClientSession` not `| None`). Or have the dispatcher lazily attach to a module-level `_default_session` cached by event loop id.

---

### H6. `WebhookPayloadLog` is pruned per-insert via a sub-select but the prune query has no UNIQUE/partial protection against duplicate inserts ([webhooks.py:404-418](packages/server/src/notify_bridge_server/api/webhooks.py))

The "keep newest `max_count` per provider, delete the rest" pattern uses `select(...).order_by(created_at DESC).limit(max_count)` as a subquery. Under SQLite this materializes the top-N then negates it — fine when max_count is 20. But this runs on every inbound webhook. For a busy Gitea/HA installation that's 60+ writes/min, each with a delete-by-sub-select. The `ix_webhook_payload_log_provider_created` index makes the read cheap, but the DELETE still rewrites pages.

**Impact:** write amplification on busy webhook tenants.

**Fix:** keep the prune but make it probabilistic — only run with `random.random() < 0.1` (10% chance per insert). The cap still holds in steady state, but the per-write cost drops 10×.

---

### H7. No retention/archival for `notification_tracker_state` and `deferred_dispatch` "fired"/"dropped" rows ([scheduler.py:332-364](packages/server/src/notify_bridge_server/services/scheduler.py))

`_cleanup_old_events` deletes `event_log`, `webhook_payload_log`, `action_execution` older than retention days. `deferred_dispatch` rows with `status IN ('fired', 'dropped')` are never deleted. `notification_tracker_state.asset_ids` for an immich tracker watching a deleted collection is also never reaped.

**Impact:** unbounded growth on long-running installs; `asset_ids` JSON blobs can be megabytes per collection.

**Fix:** extend `_cleanup_old_events` to also delete `DeferredDispatch.status != 'pending' AND fired_at < cutoff`. Add a separate housekeeping job that prunes `NotificationTrackerState` rows whose `collection_id` is no longer in `NotificationTracker.collection_ids`.

---

## MEDIUM

### M1. Sentinel value `bot_id=0` is a footgun ([models.py:69-73](packages/server/src/notify_bridge_server/database/models.py))

```python
# bot_id=0 is a sentinel meaning "Telegram has not yet returned a numeric
# ID for this bot" (i.e. token never validated). Multiple unverified bots
# may legitimately carry 0, so we only enforce uniqueness for non-sentinel
# values via a partial index added in migrate_uniqueness_constraints.
bot_id: int = Field(default=0, index=True)
```

Sentinel values on indexed columns hurt index selectivity (every unvalidated bot is the same row from the planner's perspective) and create maintenance burden. Worse, every code path that looks up by `bot_id` must remember to filter `bot_id != 0`.

**Impact:** maintainability; latent bug surface (one missed `!= 0` filter and an unverified bot is silently re-used).

**Fix:** change `bot_id: int | None` defaulting to None, drop the sentinel.

---

### M2. No request-scoped cache for `user.id` lookups inside one request ([api/*.py, throughout](packages/server/src/notify_bridge_server/api/))

The same `get_current_user` dependency runs JWT validation + a `session.get(User, id)` on every request. Many endpoints then do their *own* `user.id`-filtered SELECTs. There is no per-request memoization of the User row.

**Impact:** one extra SELECT per request, mostly noise — but it's free to fix.

**Fix:** in `get_current_user`, cache the User on `request.state.user`. Routes that take `user: User = Depends(...)` are unchanged.

---

### M3. SQLAlchemy async pool defaults serialize SQLite writers but the engine allows multiple connections ([engine.py:41-57](packages/server/src/notify_bridge_server/database/engine.py))

`create_async_engine` for SQLite defaults to a `StaticPool` of size 1 in newer SQLAlchemy versions, but older versions / different `aiosqlite` versions can default to `NullPool` (one connection per request) or a small QueuePool. The code does not pin this explicitly. Under WAL, multiple readers are fine but only one writer can hold the txn at a time — so a slow writer just makes other connections block on `busy_timeout`.

**Impact:** unpredictable behaviour across SQLAlchemy versions; sporadic `SQLITE_BUSY` under load.

**Fix:** explicitly configure the pool:

```python
from sqlalchemy.pool import StaticPool, AsyncAdaptedQueuePool

_engine = create_async_engine(
    url,
    echo=settings.debug,
    pool_pre_ping=True,
    connect_args=connect_args,
    poolclass=AsyncAdaptedQueuePool,
    pool_size=5,
    max_overflow=10,
    pool_recycle=3600,
)
```

For Postgres compatibility leave these as-is; for SQLite the right value is `StaticPool` + `connect_args={"check_same_thread": False}` to share one connection across the event loop (this is the supabase/pgbouncer pattern adapted for sqlite-async).

---

### M4. `_refresh_telegram_chat_titles` issues per-chat HTTP without per-bot bucketing ([scheduler.py:229-329](packages/server/src/notify_bridge_server/services/scheduler.py))

The job builds `tasks` as a flat list across all bots and runs them under a global `Semaphore(10)`. A bot with 50 chats and a slow Telegram response (rare but happens) can monopolize all 10 slots, starving every other bot. The semaphore should be per-bot.

**Impact:** the daily refresh can take much longer than intended on a multi-bot install with one degraded bot.

**Fix:** create one semaphore per bot:

```python
sems = {bot_id: asyncio.Semaphore(_CHAT_SYNC_CONCURRENCY) for bot_id in bot_tokens}
```

---

### M5. `event_log.collection_name.contains(search)` triggers full table scan on filter ([status.py:69-75](packages/server/src/notify_bridge_server/api/status.py))

The dashboard search input runs four `.contains(search)` clauses ORed together — these become `LIKE '%search%'` and cannot use a regular B-tree index. With 100k+ event_log rows the dashboard search becomes a multi-second operation.

**Impact:** UX — search feels broken on large installs; CPU on the bridge box.

**Fix:**
1. Limit the search to the most recent N days (e.g. retention/3) — most users only search recent events.
2. Add a SQLite FTS5 virtual table mirroring event_log's text columns, sync via triggers. Searches use `MATCH 'foo'` which is sub-millisecond on million-row tables.

---

### M6. `DeferredDispatch.event_payload` JSON blob can grow unbounded per row ([models.py:639-659](packages/server/src/notify_bridge_server/database/models.py), [deferred_dispatch.py:188-298](packages/server/src/notify_bridge_server/services/deferred_dispatch.py))

The asset-coalescing union path appends every new asset's full dict (filename, urls, tags, extra metadata) into `event_payload["added_assets"]`. A mass-import that adds 50k photos during a quiet window means one DeferredDispatch row with 50k asset entries.

**Impact:** memory blow-up at drain time (the whole JSON is parsed via `deserialize_event` into a Python list of `MediaAsset` dataclasses); could trip the drain timeout (`_DRAIN_DISPATCH_TIMEOUT_SECONDS=120`) on legitimate workloads.

**Fix:** cap the union at e.g. 500 assets per row; when crossed, emit a "more_truncated" sentinel into `payload["extra"]` so the rendered template can show "+45000 more". The `apply_tracking_display_filters` `max_assets_to_show` does cap it for delivery, but the *stored* payload is uncapped.

---

### M7. Per-tick `await get_app_timezone(session)` reads from the DB on every dispatch ([dispatch_helpers.py:146-150](packages/server/src/notify_bridge_server/services/dispatch_helpers.py))

Each tracker tick, each webhook, each defer evaluation calls `get_app_timezone` which calls `get_setting(session, "timezone")` which is a SELECT. The timezone setting rarely changes (manual setting), but the SELECT runs constantly.

**Impact:** noise on otherwise good caching.

**Fix:** cache the timezone in a module-level `(value, expires_at)` tuple with 60-s TTL, invalidated by `reschedule_cron_jobs_for_timezone_change`.

---

### M8. Unbounded in-memory dictionaries with no TTL or capacity ([scheduler.py:67-72](packages/server/src/notify_bridge_server/services/scheduler.py), [telegram_poller.py:31-35](packages/server/src/notify_bridge_server/services/telegram_poller.py), [command_sync.py:25](packages/server/src/notify_bridge_server/services/command_sync.py))

```python
_adaptive_state: dict[int, dict[str, int]] = {}
_adaptive_max_skip: dict[int, int] = {}
_last_update_id: dict[int, int] = {}
_last_webhook_reclaim_at: dict[int, float] = {}
_dirty_bots: dict[int, float] = {}
```

Each is keyed by tracker_id / bot_id. When a tracker or bot is deleted, the cleanup paths (`unschedule_tracker`, etc.) do remove some entries — but not all. `_last_update_id`, `_last_webhook_reclaim_at` are never cleared on bot deletion.

**Impact:** slow memory leak in long-running processes that create+delete trackers/bots frequently (e.g. test environments).

**Fix:** on tracker/bot deletion, explicitly clear all module dicts that key by that id. Or, simpler, switch each to `weakref.WeakValueDictionary` once the entity has a Python object representation, or to a TTLCache.

---

### M9. Bulk insert pattern in migrations uses one-statement-per-row ([migrations.py:566-588](packages/server/src/notify_bridge_server/database/migrations.py))

`migrate_tracker_targets` issues `INSERT INTO ... VALUES (...)` per row in a Python for-loop. On a tenant with 10k+ legacy rows this is slow even inside a single transaction.

**Impact:** one-shot, but rough on upgrade for big tenants.

**Fix:** use `executemany` / batch INSERTs:

```python
await conn.execute(text("INSERT INTO ... VALUES (...)"), batch_params)
```

This is mostly historical (the migration is idempotent and skipped on subsequent runs), but worth fixing if you're touching the file.

---

### M10. Missing index on `notification_tracker_state(notification_tracker_id, collection_id)` ([models.py:454-478](packages/server/src/notify_bridge_server/database/models.py))

`check_tracker` reads state per tracker; the existing `ix_notification_tracker_state.notification_tracker_id` index (declared via `index=True`) supports that. But every state read is `WHERE tracker_id = ? AND collection_id = ?` (implicitly via the resulting dict). A composite would help; SQLite can do index-only scans here.

**Impact:** small. SQLite's index intersection plus the fact that one tracker typically has <20 collections makes this a minor win.

**Fix:** add `(notification_tracker_id, collection_id)` composite index to the `_INDEXES` list.

---

## LOW

### L1. `SELECT *` semantics from `select(Model)` ORM is unavoidable but verbose ([throughout services/, api/])

SQLModel's `select(ModelClass)` is effectively `SELECT all columns`. For wide rows like `TrackingConfig` (~70 columns of boolean flags) that's a lot of bytes per dispatch evaluation. There are no API list endpoints that return `TrackingConfig` from a hot path, so this is mostly cosmetic — but for pages that only need a handful of columns (e.g. `status.py`'s `tracker_id, name` map) the explicit-column form is already used. Continue that pattern.

---

### L2. `EventLog.details` JSON dict is reconstructed on every dashboard read ([status.py:258](packages/server/src/notify_bridge_server/api/status.py))

`details: e.details or {}` serializes the JSON every time. SQLite returns this as a parsed Python dict already (JSON column), so the cost is low; just a note that this is a hot path.

---

### L3. `event_log.collection_id` and `details` have no indexes; some webhook commands filter on them ([commands/immich/events.py:43](packages/server/src/notify_bridge_server/commands/immich/events.py))

The history-by-tracker endpoint uses the composite `ix_event_log_user_event_type_created` plus a hit on `notification_tracker_id` — fine. But `events.py`'s "last assets_added for this collection" queries (`event_type='assets_added' AND collection_id=?`) cannot use any current index optimally.

**Fix:** add `(event_type, collection_id, created_at DESC)` if these queries are called by users frequently (Telegram `/assets <album>` etc.).

---

### L4. JSON column types not declared with `JSONB` semantics ([models.py: many](packages/server/src/notify_bridge_server/database/models.py))

SQLite has only `JSON` (text storage with `json_valid` checks). On Postgres you'd want `JSONB`. The codebase uses `Column(JSON)` from SQLModel which maps to native `JSONB` on Postgres — that's correct. No action needed.

---

### L5. The `setup` lifespan runs migrations *inside* the FastAPI lifespan synchronously ([main.py:62-122](packages/server/src/notify_bridge_server/main.py))

The migrations + seeds + scheduler boot all run before `_READY = True`. On a cold start with a big DB this can take 10+ s during which `/api/ready` returns 503. That's correct, but `/api/health` is also un-reachable because uvicorn hasn't started the workers yet (lifespan blocks startup). For orchestrators that probe `/api/health`, this means startup-grace must be tuned.

**Fix:** start the HTTP listener first, run migrations as a background task, expose readiness flag through `/api/ready` only.

---

### L6. `ServiceProvider.config`, `NotificationTarget.config`, `Tracker.filters` JSON columns store secrets unencrypted ([models.py:42, 349, 399](packages/server/src/notify_bridge_server/database/models.py))

API keys, refresh tokens, webhook secrets, SMTP passwords all live in `config` JSON. Visible to anyone with DB read access. This is a known design trade-off (`backup_secrets_mode` controls export behaviour) but worth flagging.

**Fix:** out of scope for this review; consider an at-rest encryption layer keyed off `secret_key` (Fernet) for `config["api_key"]`, `config["password"]`, `access_token`, etc. — but only if your threat model justifies the operational cost.

---

### L7. Frontend `caches.svelte.ts` has 30-s TTL but no cross-tab invalidation ([entity-cache.svelte.ts:14](frontend/src/lib/stores/entity-cache.svelte.ts))

Two browser tabs editing the same entity will see stale data for up to 30 s in the other tab. No `BroadcastChannel` listener.

**Fix:** add a `BroadcastChannel('notify-bridge-cache')` that calls `cache.invalidate()` on receipt. ~15 lines.

---

### L8. `providersCache.invalidate(); await load()` is two-step ([providers/+page.svelte:238, 250](frontend/src/routes/providers/+page.svelte))

`invalidate()` + immediate `fetch(true)` race against any in-flight request; the deduplication map handles it, but the explicit `await load()` is essentially `fetch(true)` directly. Simpler:

```typescript
providersCache.set(updatedList);  // or fetch(true)
```

Cosmetic.

---

### L9. `details["dispatch_status"]` is a string enum but not declared as one ([deferred_dispatch.py:619-624](packages/server/src/notify_bridge_server/services/deferred_dispatch.py))

`dispatch_status` takes values `"deferred"`, `"deferred_then_dropped"`, `"deferred_then_failed"`, `"delivered_after_quiet_hours"`, `"dropped_quiet_hours_nondeferrable"`. They're scattered as string literals. The dashboard renders them.

**Fix:** declare an `Enum` once and import from both server and frontend types.

---

### L10. No DB connection used by `/api/health` ([main.py:270-274](packages/server/src/notify_bridge_server/main.py))

`/api/health` returns instantly without checking the DB. That's correct for a liveness probe but the comment doesn't match common practice ("liveness = process up"). Pair this with #L5: orchestrators using `/api/health` for warm-up will mark the pod ready while migrations are still running.

**Fix:** keep liveness lightweight, document the readiness probe as the warm-up gate.

---

## Notes on what's already good

- Performance indexes (`_INDEXES` list) cover all the right hot paths.
- Composite `(status, fire_at)` index on `deferred_dispatch` plus partial unique `(link_id, collection_id, event_type) WHERE status='pending'` prevents the worst races.
- `load_link_data` is fully batched — the most complex hot path in the codebase looks clean.
- Shared `aiohttp.ClientSession` with DNS-rebinding-safe `PinnedResolver` is production-grade.
- Pre-migration `VACUUM INTO` snapshot is the right safety net for a hand-rolled migration chain.
- APScheduler defaults (`coalesce=True`, `misfire_grace_time=300`, `max_instances=1`) are correct production settings.
- Adaptive polling (skip-N-of-K when idle) with jitter is a thoughtful 4-tier scheduling design.
- Tracker cache (5-s TTL with explicit invalidation) and rendered-message per-locale cache are good fan-out optimizations.
- Migration idempotency is genuinely well-handled despite the rough tooling.
- Frontend `entity-cache` deduplication of in-flight requests is the right pattern.

---

## Priority recommendations (next 30 days)

1. **Adopt Alembic** (C1, C2) — eliminate `create_all` from prod, baseline the current schema, lock down new schema changes through autogenerate.
2. **Fix the dashboard aggregate query** (H1, H2, M5) — add the missing composite index, server-side cache the per-provider aggregate, virtualize the event list. This is the single biggest user-visible perf win.
3. **Cap `DeferredDispatch.event_payload` size + add retention for fired/dropped rows** (M6, H7) — closes off the worst-case memory and growth scenarios.
4. **Cleanup module-level dicts on entity deletion** (M8) — small fix, prevents a slow leak.
5. **Standardize SQLite PRAGMAs and pool config** (C3, M3) — predictable behaviour, fewer spurious BUSY errors.

---

*Reviewed against codebase at HEAD (`a20635a`).*