Files
notify-bridge/.claude/reviews/performance-db-review.md
alexei.dolgolyov 6a8f374678 feat: observability, per-receiver Telegram options, oversized-video fallback
Operability:
- Correlation IDs end-to-end: shared dispatch_id between log lines and
  EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths)
  and a new X-Request-Id middleware that normalizes inbound ids and binds
  request_id into log context.
- dispatch_summary block merged into EventLog.details: per-target
  success/failure counts plus Telegram media delivered/skipped/failed and
  truncated error lists, so partial outcomes surface in the UI.
- Diagnostic mode: admin can flip one module to DEBUG for a bounded
  window with auto-revert (in-memory only; setup_logging() resets on
  boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints
  plus DiagnosticsCassette UI on the settings page.

Telegram:
- Per-receiver options: disable_notification (silent send) and
  message_thread_id (forum-topic routing), wired through the dispatcher
  via a ContextVar so all four send sites (sendMessage / sendPhoto-Video-
  Document / sendMediaGroup / cache-hit POST) pick them up.
- send_large_videos_as_documents target setting: bypass the 50 MB
  sendVideo cap by falling back to sendDocument for oversized videos.
- sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES,
  45 MB) with per-item fallback on chunk failure so a stale file_id no
  longer silently drops a cached asset.

Tests:
- New: diagnostic_mode, dispatch_summary, request_correlation,
  telegram_media_group_partial, telegram_per_send_options.

Docs:
- .claude/reviews/: six-axis production-readiness review of v0.8.1.
- .claude/docs/functional-review-2026-05-28.md: focused review of
  Telegram/Immich/logging subsystems.
2026-05-28 15:19:31 +03:00

30 KiB
Raw Permalink Blame History

Performance & Database Review — service-to-notification-bridge

Scope: entire repo at c:\Users\Alexei\Documents\service-to-notification-bridge Backend: FastAPI + SQLAlchemy async + SQLModel on SQLite (Postgres-compatible URL, but only SQLite branch is exercised in code). Frontend: SvelteKit 5 (runes) static build served by the same FastAPI process. Reviewer: Claude Opus 4.7 (1M context)


Executive summary

  1. Indexing is in good shape. FK columns and the dashboard/webhook hot paths have explicit composite indexes (ix_event_log_user_created, ix_event_log_user_event_type_created, ix_deferred_dispatch_status_fire_at, partial ux_deferred_dispatch_pending). The bulk of the "missing index" risk is already mitigated.
  2. No real migration tool. The project runs a hand-rolled, 1880-line, idempotent migration script on every boot. It works, but it's brittle, slow on cold start, has no down-migrations, and the table-rebuild branches lose indexes silently. Move to Alembic before the next major schema change.
  3. create_all is still the source-of-truth for new schemas (engine.py:63). That's an anti-pattern next to migration tooling: schema drift can silently appear between fresh installs and upgraded installs.
  4. Two real N+1 risks remain. _tracker_response (notification_trackers.py:286-291) calls _tt_response per link, and _refresh_telegram_chat_titles (scheduler.py:229) issues per-chat getChat calls without bot-level batching guards. The big one in load_link_data was already fixed (good).
  5. SQLite PRAGMAs are mostly right but pool sizing is wrong. WAL, synchronous=NORMAL, FK enforcement, busy_timeout, temp_store=MEMORY are all set. Missing: cache_size, mmap_size. The async engine uses SQLAlchemy's default pool with multiple writer connections — under WAL that still serializes, but it raises spurious BUSY pressure on long transactions (see #M3).
  6. Event-log retention exists and is correct (30-day default, cron at 03:00 UTC), but retention_days=0 disables it silently and there is no archival, no per-tenant cap, no row-count metric exposed to operators.
  7. Memory leak risk: _dirty_bots, _last_update_id, _last_webhook_reclaim_at, _adaptive_state, _adaptive_max_skip in command_sync.py, telegram_poller.py, scheduler.py are unbounded module-level dicts. In a long-running process they grow without ever shrinking when entities are deleted.
  8. Frontend has no virtualization on long lists — dashboard event stream, tracker history, target list. On a tenant with thousands of events the dashboard {#each status.recent_events} (with (event.id) key) still renders the whole page-set into DOM and re-runs derivations on every refresh.

CRITICAL

C1. create_all is the schema-of-record for new installs (engine.py:60)

async def init_db() -> None:
    engine = get_engine()
    async with engine.begin() as conn:
        await conn.run_sync(SQLModel.metadata.create_all)

What's wrong: init_db() runs unconditionally on every boot before the migration script. New installs get the current model's CREATE TABLE statements — including FK declarations like ondelete=SET NULL — while upgraded installs only get what the (one-way) migrate_* scripts manage to inject via ALTER TABLE. Several migrations explicitly admit "this only takes effect on freshly created tables" (e.g. migrate_eventlog_provider_fk is a documented no-op). That means the schema drift between a fresh install and a 6-month-old install is real and undocumented.

Impact: stability — subtle bugs that reproduce only on upgraded installs (FK enforcement, cascade behavior, partial UNIQUE indexes); ops — restoring a backup from a fresh install onto an upgraded box, or vice-versa, can change observable behaviour.

Fix:

  1. Adopt Alembic with autogenerate-from-models, lock the baseline migration to the current SQLModel.metadata, and stop calling create_all in production startup.
  2. Keep the hand-rolled migrate_* chain as legacy data-migrations only (idempotent, runs once, then removed).
  3. Add a CI check: spin up empty DB → run migrations → diff against SQLModel.metadata → fail if non-empty.

C2. migrate_schema runs ~30+ idempotent PRAGMA table_info + ALTER probes on every cold start (migrations.py:67-427)

_has_column issues a PRAGMA table_info('<table>') per check; migrate_schema calls it dozens of times serially inside one transaction. On a cold start this is the dominant boot latency. Worse, it forces a write txn on every boot even when nothing changes (because each migration opens engine.begin()).

Impact: startup cost — visible on Raspberry-Pi / NAS deployments; SQLite WAL checkpoint pressure on every boot when nothing changed; readiness probe grace window must accommodate this.

Fix:

  1. Wire schema_version (already exists, CURRENT_SCHEMA_VERSION=1) as a real short-circuit — at the top of every migrate_*, return immediately if schema_version >= N for that migration.
  2. Cache PRAGMA table_info results within a single migration run.
  3. Better long-term: replace with Alembic; you already have the version table.

C3. _install_sqlite_pragmas only fires on engine-pool connect, not when SQLAlchemy reuses pooled connections from a different event loop (engine.py:18-38)

The @event.listens_for(engine.sync_engine, "connect") hook only runs at connection creation. The default aiosqlite pool reuses connections — that's fine — but connect_args["timeout"]=30 clashes with the in-PRAGMA busy_timeout=10000 (10 s). Two different timeout settings is confusing and the lower wins.

Impact: stability under contention — under sustained writer contention you get SQLITE_BUSY much sooner than expected. The 30-s connect_args timeout is for connection open, the 10-s busy_timeout is what governs lock contention; users see "database is locked" errors after 10 s, not 30.

Fix: standardize on busy_timeout (raise to 30 s to match connect_args, or drop one and keep the other). Document the chosen value in a constant. Also add:

cur.execute("PRAGMA cache_size=-65536")   # 64 MiB negative = kibibytes
cur.execute("PRAGMA mmap_size=268435456") # 256 MiB
cur.execute("PRAGMA wal_autocheckpoint=1000")

The 100k-asset album write pattern (asset_ids JSON blob) benefits significantly from a larger page cache and mmap; current defaults force a lot of SQLite-internal I/O.


HIGH

H1. Frontend dashboard event-stream lacks virtualization & double-fetches on filter changes (+page.svelte:739)

{#each status.recent_events as event, i (event.id)} is keyed (good), but the page renders every event row with rich nested components (EventDetailModal, MdiIcon, etc.) for every paginate-back/forward. There's no row virtualization and the same data fetches re-run on every filter mutation (search input has a 300 ms debounce in onSearchInput, but filterEventType, filterProviderId, filterSort, refreshSeconds do not).

Impact: UX — choppy on tenants with 50+ events/page, perceptible filter-flicker; CPU — derivation cost on every status refresh.

Fix:

  1. Wrap the events list in a tiny windowing component (svelte-virtual or a simple offset/limit windowed view — the API already supports it).
  2. Debounce the entire filter-change branch, not just the search input ($effect(() => { if (settled) { reload() }}) with a 100 ms guard).
  3. The provider count map (provider_event_counts) is computed server-side for all matching events on every page request; cache it for (user_id, filters) in a 30-s in-memory dict server-side (see also #M2).

H2. provider_event_counts aggregate query runs unbounded GROUP BY on every dashboard request (status.py:84-103)

provider_counts_query = (
    select(
        EventLog.provider_id,
        EventLog.provider_name,
        func.sum(func.coalesce(EventLog.assets_count, 1)).label("total"),
    )
    .where(EventLog.user_id == user.id)
    .group_by(EventLog.provider_id, EventLog.provider_name)
)

Every dashboard load (every 1060 s by default — see refreshIntervalItems) runs GROUP BY provider_id, provider_name over every event the user ever owned. At 90 days × ~1 event/min/tracker this is hundreds of thousands of rows scanned per refresh per logged-in user.

Impact: latency — SQLite forces a full table scan + sort here because the only composite index is (user_id, event_type, created_at DESC); cost — burns CPU on the bridge box for a metric that changes very slowly.

Fix:

  1. Add ix_event_log_user_provider (user_id, provider_id) so the GROUP BY can be index-only.
  2. Cache the result for (user_id, filter_signature) for 30 s in the same in-memory cache as #H1.
  3. Long-term: materialize per-provider counts into an event_counter table maintained by triggers or an APScheduler job. The dashboard then reads at most a dozen rows.

async def _tracker_response(session: AsyncSession, t: NotificationTracker) -> dict:
    result = await session.exec(
        select(NotificationTrackerTarget).where(NotificationTrackerTarget.tracker_id == t.id)
    )
    tracker_targets = [await _tt_response(session, tt) for tt in result.all()]

_tt_response (in notification_tracker_targets.py:12 — has 12 distinct select/session.get references) issues per-link follow-up SELECTs. Called from create, update, delete and trigger for a single tracker, so the practical N is small — but _tt_response is also called inside the bulk list_notification_trackers loop's downstream consumers, and any future bulk endpoint will multiply this badly.

Impact: latency on POST/PATCH responses; future regression risk.

Fix: rewrite _tt_response to accept pre-fetched maps (mirror the pattern in dispatch_helpers.load_link_data). Or, simpler: write a single eager-load helper using selectinload(NotificationTrackerTarget.target) once relationship() mappers are declared on the models.


The function batch-loads receivers, telegram_chats, email_bots, matrix_bots up-front, but the broadcast-expansion branch in the active_links loop still issues _resolve_target per child target (line 715). That _resolve_target is called with all the pre-fetched maps, so it doesn't query per call — but it does build a fresh target_config dict per child. With a broadcast target containing 50 children fanning out 100 events/min this is constant garbage collection pressure.

Impact: GC pressure under load; not a correctness problem.

Fix: none required short-term. Long-term, add selectinload declarations on the relationship model so SQLAlchemy can co-fetch the chain. The code path is already well-batched.


H5. aiohttp.ClientSession is constructed per-call inside NotificationDispatcher._session_ctx when no shared session is provided (dispatcher.py:117-123)

@contextlib.asynccontextmanager
async def _session_ctx(self) -> AsyncIterator[aiohttp.ClientSession]:
    if self._shared_session is not None and not self._shared_session.closed:
        yield self._shared_session
        return
    async with _new_session() as session:
        yield session

In server-side code paths (watcher, event_dispatch, deferred_dispatch) a shared session is always passed in, so this is harmless. But unit tests, the CLI, and any direct library user that instantiates NotificationDispatcher without a session pays the cost. Worse, the per-dispatch session creates a fresh TCP pool, fresh DNS resolver — defeating connection reuse to Telegram / Discord webhook hosts.

Impact: test slowness; correctness if a non-server consumer ever ships.

Fix: require the session parameter (session: aiohttp.ClientSession not | None). Or have the dispatcher lazily attach to a module-level _default_session cached by event loop id.


H6. WebhookPayloadLog is pruned per-insert via a sub-select but the prune query has no UNIQUE/partial protection against duplicate inserts (webhooks.py:404-418)

The "keep newest max_count per provider, delete the rest" pattern uses select(...).order_by(created_at DESC).limit(max_count) as a subquery. Under SQLite this materializes the top-N then negates it — fine when max_count is 20. But this runs on every inbound webhook. For a busy Gitea/HA installation that's 60+ writes/min, each with a delete-by-sub-select. The ix_webhook_payload_log_provider_created index makes the read cheap, but the DELETE still rewrites pages.

Impact: write amplification on busy webhook tenants.

Fix: keep the prune but make it probabilistic — only run with random.random() < 0.1 (10% chance per insert). The cap still holds in steady state, but the per-write cost drops 10×.


H7. No retention/archival for notification_tracker_state and deferred_dispatch "fired"/"dropped" rows (scheduler.py:332-364)

_cleanup_old_events deletes event_log, webhook_payload_log, action_execution older than retention days. deferred_dispatch rows with status IN ('fired', 'dropped') are never deleted. notification_tracker_state.asset_ids for an immich tracker watching a deleted collection is also never reaped.

Impact: unbounded growth on long-running installs; asset_ids JSON blobs can be megabytes per collection.

Fix: extend _cleanup_old_events to also delete DeferredDispatch.status != 'pending' AND fired_at < cutoff. Add a separate housekeeping job that prunes NotificationTrackerState rows whose collection_id is no longer in NotificationTracker.collection_ids.


MEDIUM

M1. Sentinel value bot_id=0 is a footgun (models.py:69-73)

# bot_id=0 is a sentinel meaning "Telegram has not yet returned a numeric
# ID for this bot" (i.e. token never validated). Multiple unverified bots
# may legitimately carry 0, so we only enforce uniqueness for non-sentinel
# values via a partial index added in migrate_uniqueness_constraints.
bot_id: int = Field(default=0, index=True)

Sentinel values on indexed columns hurt index selectivity (every unvalidated bot is the same row from the planner's perspective) and create maintenance burden. Worse, every code path that looks up by bot_id must remember to filter bot_id != 0.

Impact: maintainability; latent bug surface (one missed != 0 filter and an unverified bot is silently re-used).

Fix: change bot_id: int | None defaulting to None, drop the sentinel.


M2. No request-scoped cache for user.id lookups inside one request (api/*.py, throughout)

The same get_current_user dependency runs JWT validation + a session.get(User, id) on every request. Many endpoints then do their own user.id-filtered SELECTs. There is no per-request memoization of the User row.

Impact: one extra SELECT per request, mostly noise — but it's free to fix.

Fix: in get_current_user, cache the User on request.state.user. Routes that take user: User = Depends(...) are unchanged.


M3. SQLAlchemy async pool defaults serialize SQLite writers but the engine allows multiple connections (engine.py:41-57)

create_async_engine for SQLite defaults to a StaticPool of size 1 in newer SQLAlchemy versions, but older versions / different aiosqlite versions can default to NullPool (one connection per request) or a small QueuePool. The code does not pin this explicitly. Under WAL, multiple readers are fine but only one writer can hold the txn at a time — so a slow writer just makes other connections block on busy_timeout.

Impact: unpredictable behaviour across SQLAlchemy versions; sporadic SQLITE_BUSY under load.

Fix: explicitly configure the pool:

from sqlalchemy.pool import StaticPool, AsyncAdaptedQueuePool

_engine = create_async_engine(
    url,
    echo=settings.debug,
    pool_pre_ping=True,
    connect_args=connect_args,
    poolclass=AsyncAdaptedQueuePool,
    pool_size=5,
    max_overflow=10,
    pool_recycle=3600,
)

For Postgres compatibility leave these as-is; for SQLite the right value is StaticPool + connect_args={"check_same_thread": False} to share one connection across the event loop (this is the supabase/pgbouncer pattern adapted for sqlite-async).


M4. _refresh_telegram_chat_titles issues per-chat HTTP without per-bot bucketing (scheduler.py:229-329)

The job builds tasks as a flat list across all bots and runs them under a global Semaphore(10). A bot with 50 chats and a slow Telegram response (rare but happens) can monopolize all 10 slots, starving every other bot. The semaphore should be per-bot.

Impact: the daily refresh can take much longer than intended on a multi-bot install with one degraded bot.

Fix: create one semaphore per bot:

sems = {bot_id: asyncio.Semaphore(_CHAT_SYNC_CONCURRENCY) for bot_id in bot_tokens}

M5. event_log.collection_name.contains(search) triggers full table scan on filter (status.py:69-75)

The dashboard search input runs four .contains(search) clauses ORed together — these become LIKE '%search%' and cannot use a regular B-tree index. With 100k+ event_log rows the dashboard search becomes a multi-second operation.

Impact: UX — search feels broken on large installs; CPU on the bridge box.

Fix:

  1. Limit the search to the most recent N days (e.g. retention/3) — most users only search recent events.
  2. Add a SQLite FTS5 virtual table mirroring event_log's text columns, sync via triggers. Searches use MATCH 'foo' which is sub-millisecond on million-row tables.

M6. DeferredDispatch.event_payload JSON blob can grow unbounded per row (models.py:639-659, deferred_dispatch.py:188-298)

The asset-coalescing union path appends every new asset's full dict (filename, urls, tags, extra metadata) into event_payload["added_assets"]. A mass-import that adds 50k photos during a quiet window means one DeferredDispatch row with 50k asset entries.

Impact: memory blow-up at drain time (the whole JSON is parsed via deserialize_event into a Python list of MediaAsset dataclasses); could trip the drain timeout (_DRAIN_DISPATCH_TIMEOUT_SECONDS=120) on legitimate workloads.

Fix: cap the union at e.g. 500 assets per row; when crossed, emit a "more_truncated" sentinel into payload["extra"] so the rendered template can show "+45000 more". The apply_tracking_display_filters max_assets_to_show does cap it for delivery, but the stored payload is uncapped.


M7. Per-tick await get_app_timezone(session) reads from the DB on every dispatch (dispatch_helpers.py:146-150)

Each tracker tick, each webhook, each defer evaluation calls get_app_timezone which calls get_setting(session, "timezone") which is a SELECT. The timezone setting rarely changes (manual setting), but the SELECT runs constantly.

Impact: noise on otherwise good caching.

Fix: cache the timezone in a module-level (value, expires_at) tuple with 60-s TTL, invalidated by reschedule_cron_jobs_for_timezone_change.


M8. Unbounded in-memory dictionaries with no TTL or capacity (scheduler.py:67-72, telegram_poller.py:31-35, command_sync.py:25)

_adaptive_state: dict[int, dict[str, int]] = {}
_adaptive_max_skip: dict[int, int] = {}
_last_update_id: dict[int, int] = {}
_last_webhook_reclaim_at: dict[int, float] = {}
_dirty_bots: dict[int, float] = {}

Each is keyed by tracker_id / bot_id. When a tracker or bot is deleted, the cleanup paths (unschedule_tracker, etc.) do remove some entries — but not all. _last_update_id, _last_webhook_reclaim_at are never cleared on bot deletion.

Impact: slow memory leak in long-running processes that create+delete trackers/bots frequently (e.g. test environments).

Fix: on tracker/bot deletion, explicitly clear all module dicts that key by that id. Or, simpler, switch each to weakref.WeakValueDictionary once the entity has a Python object representation, or to a TTLCache.


M9. Bulk insert pattern in migrations uses one-statement-per-row (migrations.py:566-588)

migrate_tracker_targets issues INSERT INTO ... VALUES (...) per row in a Python for-loop. On a tenant with 10k+ legacy rows this is slow even inside a single transaction.

Impact: one-shot, but rough on upgrade for big tenants.

Fix: use executemany / batch INSERTs:

await conn.execute(text("INSERT INTO ... VALUES (...)"), batch_params)

This is mostly historical (the migration is idempotent and skipped on subsequent runs), but worth fixing if you're touching the file.


M10. Missing index on notification_tracker_state(notification_tracker_id, collection_id) (models.py:454-478)

check_tracker reads state per tracker; the existing ix_notification_tracker_state.notification_tracker_id index (declared via index=True) supports that. But every state read is WHERE tracker_id = ? AND collection_id = ? (implicitly via the resulting dict). A composite would help; SQLite can do index-only scans here.

Impact: small. SQLite's index intersection plus the fact that one tracker typically has <20 collections makes this a minor win.

Fix: add (notification_tracker_id, collection_id) composite index to the _INDEXES list.


LOW

L1. SELECT * semantics from select(Model) ORM is unavoidable but verbose ([throughout services/, api/])

SQLModel's select(ModelClass) is effectively SELECT all columns. For wide rows like TrackingConfig (~70 columns of boolean flags) that's a lot of bytes per dispatch evaluation. There are no API list endpoints that return TrackingConfig from a hot path, so this is mostly cosmetic — but for pages that only need a handful of columns (e.g. status.py's tracker_id, name map) the explicit-column form is already used. Continue that pattern.


L2. EventLog.details JSON dict is reconstructed on every dashboard read (status.py:258)

details: e.details or {} serializes the JSON every time. SQLite returns this as a parsed Python dict already (JSON column), so the cost is low; just a note that this is a hot path.


L3. event_log.collection_id and details have no indexes; some webhook commands filter on them (commands/immich/events.py:43)

The history-by-tracker endpoint uses the composite ix_event_log_user_event_type_created plus a hit on notification_tracker_id — fine. But events.py's "last assets_added for this collection" queries (event_type='assets_added' AND collection_id=?) cannot use any current index optimally.

Fix: add (event_type, collection_id, created_at DESC) if these queries are called by users frequently (Telegram /assets <album> etc.).


L4. JSON column types not declared with JSONB semantics (models.py: many)

SQLite has only JSON (text storage with json_valid checks). On Postgres you'd want JSONB. The codebase uses Column(JSON) from SQLModel which maps to native JSONB on Postgres — that's correct. No action needed.


L5. The setup lifespan runs migrations inside the FastAPI lifespan synchronously (main.py:62-122)

The migrations + seeds + scheduler boot all run before _READY = True. On a cold start with a big DB this can take 10+ s during which /api/ready returns 503. That's correct, but /api/health is also un-reachable because uvicorn hasn't started the workers yet (lifespan blocks startup). For orchestrators that probe /api/health, this means startup-grace must be tuned.

Fix: start the HTTP listener first, run migrations as a background task, expose readiness flag through /api/ready only.


L6. ServiceProvider.config, NotificationTarget.config, Tracker.filters JSON columns store secrets unencrypted (models.py:42, 349, 399)

API keys, refresh tokens, webhook secrets, SMTP passwords all live in config JSON. Visible to anyone with DB read access. This is a known design trade-off (backup_secrets_mode controls export behaviour) but worth flagging.

Fix: out of scope for this review; consider an at-rest encryption layer keyed off secret_key (Fernet) for config["api_key"], config["password"], access_token, etc. — but only if your threat model justifies the operational cost.


L7. Frontend caches.svelte.ts has 30-s TTL but no cross-tab invalidation (entity-cache.svelte.ts:14)

Two browser tabs editing the same entity will see stale data for up to 30 s in the other tab. No BroadcastChannel listener.

Fix: add a BroadcastChannel('notify-bridge-cache') that calls cache.invalidate() on receipt. ~15 lines.


L8. providersCache.invalidate(); await load() is two-step (providers/+page.svelte:238, 250)

invalidate() + immediate fetch(true) race against any in-flight request; the deduplication map handles it, but the explicit await load() is essentially fetch(true) directly. Simpler:

providersCache.set(updatedList);  // or fetch(true)

Cosmetic.


L9. details["dispatch_status"] is a string enum but not declared as one (deferred_dispatch.py:619-624)

dispatch_status takes values "deferred", "deferred_then_dropped", "deferred_then_failed", "delivered_after_quiet_hours", "dropped_quiet_hours_nondeferrable". They're scattered as string literals. The dashboard renders them.

Fix: declare an Enum once and import from both server and frontend types.


L10. No DB connection used by /api/health (main.py:270-274)

/api/health returns instantly without checking the DB. That's correct for a liveness probe but the comment doesn't match common practice ("liveness = process up"). Pair this with #L5: orchestrators using /api/health for warm-up will mark the pod ready while migrations are still running.

Fix: keep liveness lightweight, document the readiness probe as the warm-up gate.


Notes on what's already good

  • Performance indexes (_INDEXES list) cover all the right hot paths.
  • Composite (status, fire_at) index on deferred_dispatch plus partial unique (link_id, collection_id, event_type) WHERE status='pending' prevents the worst races.
  • load_link_data is fully batched — the most complex hot path in the codebase looks clean.
  • Shared aiohttp.ClientSession with DNS-rebinding-safe PinnedResolver is production-grade.
  • Pre-migration VACUUM INTO snapshot is the right safety net for a hand-rolled migration chain.
  • APScheduler defaults (coalesce=True, misfire_grace_time=300, max_instances=1) are correct production settings.
  • Adaptive polling (skip-N-of-K when idle) with jitter is a thoughtful 4-tier scheduling design.
  • Tracker cache (5-s TTL with explicit invalidation) and rendered-message per-locale cache are good fan-out optimizations.
  • Migration idempotency is genuinely well-handled despite the rough tooling.
  • Frontend entity-cache deduplication of in-flight requests is the right pattern.

Priority recommendations (next 30 days)

  1. Adopt Alembic (C1, C2) — eliminate create_all from prod, baseline the current schema, lock down new schema changes through autogenerate.
  2. Fix the dashboard aggregate query (H1, H2, M5) — add the missing composite index, server-side cache the per-provider aggregate, virtualize the event list. This is the single biggest user-visible perf win.
  3. Cap DeferredDispatch.event_payload size + add retention for fired/dropped rows (M6, H7) — closes off the worst-case memory and growth scenarios.
  4. Cleanup module-level dicts on entity deletion (M8) — small fix, prevents a slow leak.
  5. Standardize SQLite PRAGMAs and pool config (C3, M3) — predictable behaviour, fewer spurious BUSY errors.

Reviewed against codebase at HEAD (a20635a).