feat: observability, per-receiver Telegram options, oversized-video fallback

Operability:
- Correlation IDs end-to-end: shared dispatch_id between log lines and
  EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths)
  and a new X-Request-Id middleware that normalizes inbound ids and binds
  request_id into log context.
- dispatch_summary block merged into EventLog.details: per-target
  success/failure counts plus Telegram media delivered/skipped/failed and
  truncated error lists, so partial outcomes surface in the UI.
- Diagnostic mode: admin can flip one module to DEBUG for a bounded
  window with auto-revert (in-memory only; setup_logging() resets on
  boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints
  plus DiagnosticsCassette UI on the settings page.

Telegram:
- Per-receiver options: disable_notification (silent send) and
  message_thread_id (forum-topic routing), wired through the dispatcher
  via a ContextVar so all four send sites (sendMessage / sendPhoto-Video-
  Document / sendMediaGroup / cache-hit POST) pick them up.
- send_large_videos_as_documents target setting: bypass the 50 MB
  sendVideo cap by falling back to sendDocument for oversized videos.
- sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES,
  45 MB) with per-item fallback on chunk failure so a stale file_id no
  longer silently drops a cached asset.

Tests:
- New: diagnostic_mode, dispatch_summary, request_correlation,
  telegram_media_group_partial, telegram_per_send_options.

Docs:
- .claude/reviews/: six-axis production-readiness review of v0.8.1.
- .claude/docs/functional-review-2026-05-28.md: focused review of
  Telegram/Immich/logging subsystems.
This commit is contained in:
2026-05-28 15:19:31 +03:00
parent 85a8f1e71c
commit 6a8f374678
39 changed files with 7239 additions and 142 deletions
+89
View File
@@ -0,0 +1,89 @@
# Production-Readiness Review — service-to-notification-bridge v0.8.1
**Date:** 2026-05-22 **Scope:** entire codebase (~70k LOC, 312 files)
**Branch:** master @ a20635a **Reviewers:** 6 parallel specialised agents
## Verdict
**Ship-readiness: nearly there.** The product is in materially better shape than a typical pre-1.0 — every security baseline is in place (sandboxed Jinja2, bcrypt+JWT, SSRF guard with DNS-rebinding mitigation, secret masking, signed webhooks, non-root Docker, owner-scoped queries) and the feature set is mature (deferred dispatch, quiet hours, fan-out caps, 429 backoff, Prometheus metrics). No CRITICAL security findings exist.
The work that *should* block shipping to wider users is concentrated in **three buckets**: (1) a handful of correctness defects that surface only under load or restart (duplicate-send class), (2) two secret-handling gaps (HA token returned cleartext, bot tokens/SMTP passwords unencrypted at rest), and (3) the schema-management story (`create_all` on boot + 1880-line hand-rolled migration script with no Alembic).
## Reports
| Axis | File | Findings | Top hit |
|---|---|---|---|
| Backend (Python) | [backend-review.md](backend-review.md) | 5C / 15H / 18M / 10L | `asyncio.create_task` GC in HA status logger |
| Frontend (TS/Svelte) | [frontend-review.md](frontend-review.md) | 2C / 10H / 19M / 7L | JWT access+refresh in `localStorage` |
| Security | [security-review.md](security-review.md) | 0C / 2H / multiple M | HA `access_token` not masked on `GET /providers/{id}` |
| Performance + DB | [performance-db-review.md](performance-db-review.md) | 3C / 7H / 10M / 10L | `SQLModel.metadata.create_all` on every boot |
| Bugs + features | [bugs-features-review.md](bugs-features-review.md) | 3C / 13H / 12M / 3L + 25 features | Webhook redelivery has no idempotency |
| UI/UX | [ui-ux-review.md](ui-ux-review.md) | ~33 across 13 axes | Five overlapping glass-card abstractions |
## Ship blockers (must fix before wider rollout)
Cross-cutting top 12 — verified across all six reviews:
1. **HA `access_token` returned in plaintext** on `GET /api/providers/{id}` — not in mask list. *(Security H-1, [providers.py:399-405](packages/server/src/notify_bridge_server/api/providers.py#L399))*
2. **Secrets unencrypted at rest** — Telegram bot tokens, SMTP passwords, HA tokens, webhook secrets stored as plain text in SQLite. Disk/snapshot/backup theft = full credential set. *(Security H-2)*
3. **Frontend JWT access + refresh in `localStorage`** — any future XSS exfiltrates the session in one call. Move to httpOnly cookie. *(Frontend C-1)*
4. **`asyncio.create_task` fire-and-forget** in `ha_subscription._on_status_change` — task may be GC'd before completion. *(Backend C-1, [ha_subscription.py:249](packages/server/src/notify_bridge_server/services/ha_subscription.py#L249))*
5. **Pre-auth 1 MiB body read** on Gitea + generic webhooks — DoS amplifier. Verify `X-Hub-Signature` before reading body. *(Backend C-3, [webhooks.py:167](packages/server/src/notify_bridge_server/api/webhooks.py#L167) + 449)*
6. **No webhook idempotency** — Gitea/Planka/generic don't dedupe by `X-Gitea-Delivery` / equivalent. Replays = duplicate sends. *(Bugs C-1)*
7. **Deferred-dispatch crash window**`dispatch()` returns before `session.commit()`; restart re-fires. Wrap in idempotent "claim → send → ack" with a unique constraint. *(Bugs C-2)*
8. **Telegram `_last_update_id` in-memory only** — restart can replay or skip commands. Persist watermark. *(Bugs C-3)*
9. **`init_db` calls `SQLModel.metadata.create_all` on every boot** — causes schema drift between fresh and upgraded installs. Adopt Alembic. *(Perf C-1)*
10. **Template-preview endpoints bypass sandbox timeout** — authenticated user can wedge a worker with `{% for i in range(10**8) %}`. *(Security M-1)*
11. **Telegram webhook handler missing `session.rollback()`** in catch-all — leaves uncommitted writes. *(Backend C-2, [commands/webhook.py:162](packages/server/src/notify_bridge_server/commands/webhook.py#L162))*
12. **CLAUDE.md rule-8 violation**`if (provider.type !== 'immich')` in `RuleEditor.svelte` silently disables people/album picker for other providers. *(Frontend C-2, [RuleEditor.svelte:57](frontend/src/routes/actions/RuleEditor.svelte#L57))*
## Next-tier priorities (HIGH — fix in the same release where practical)
13. Audit `backup_schema.PROVIDER_SECRET_FIELDS` so `webhook_secret`, `password`, `client_secret`, `refresh_token` are scrubbed on export. *(Backend C-5)*
14. Add `asyncio.Lock` around `bridge_self` failure-counter dicts. *(Backend C-4)*
15. Login rate-limit is per-IP only — slow rotated-source brute force succeeds. Add per-account lockout + raise password floor. *(Security M-2)*
16. Three frontend CRUD pages copy cache items into local `$state`, breaking the shared-cache invariant and forcing a full refetch per mutation. *(Frontend H-1/H-2)*
17. Uncancelled `setTimeout` chain in backup restart flow can `window.location.reload()` after navigation. *(Frontend H-5)*
18. Refresh-token race against `logout()` produces spurious "Unauthorized" toasts. *(Frontend H-6/H-7)*
19. Dashboard per-provider GROUP-BY aggregate runs unbounded on every refresh, no caching, no covering index. *(Perf H-1/H-2)*
20. Truncation/parse-mode escaping for Telegram (HTML-aware truncate, `_extract_retry_after` fractional seconds, forum `message_thread_id` routing, 403 "bot blocked" auto-disable). *(Bugs H-various)*
21. Five overlapping glass-card abstractions + radius drift (22/18/14/12 px) + ~71 legacy `rounded-md text-sm bg-…` form inputs that bypass the global Aurora `input{}` rule. *(UI/UX H-CONSIST-01..04)*
22. Hardcoded hex colors (`#059669`, `#ef4444`) in Snackbar/ConfirmModal/actions — bypasses theming. *(UI/UX H-CONSIST-03)*
23. Snackbar has no `aria-live`; nav lacks `aria-current="page"` — invisible to screen readers. *(UI/UX H-NAV-01, A11y)*
24. DST handling in overnight quiet-hours windows. *(Bugs H)*
## What's working well — keep doing this
- **Sandboxed Jinja2 everywhere** (security agent verified every `Environment()` instantiation is `SandboxedEnvironment`).
- **`PinnedResolver` SSRF defence** — handles CGNAT, IPv4-mapped IPv6, DNS rebinding.
- **JWT with `token_version` revocation** — bcrypt offloaded to worker thread, constant-time username probe.
- **Hardened Docker** — non-root, read-only root FS, `cap_drop: ALL`.
- **Aurora/Glass design identity** — distinctive (conic-gradient orb, Newsreader italic display serif, lavender/orchid palette, "signal stream"/"on watch"/"wires"/"pulse" editorial labels). Not generic AI admin work.
- **Frontend type discipline** — `svelte-check` clean, EN/RU exactly 1466 keys each, no `eval`/`innerHTML`/`var`/`==` anywhere.
- **Most SQL hot paths already batched** — `load_link_data` is fully fan-in/fan-out; partial unique indexes on deferred-dispatch are thoughtful.
- **Most v0.8.1 production-readiness items shipped** — fan-out caps, 429 backoff, parse_mode fallback, scheduler misfire grace, Prometheus, deep healthcheck, per-receiver render cache.
## Top missing features worth adding next
Pulled from the bugs-features report — full pitches in [bugs-features-review.md](bugs-features-review.md):
- **Template playground** — "send test against last event" + dry-run with sample payload.
- **Template versioning + rollback** with audit log.
- **Bulk operations** on targets/templates (currently row-by-row).
- **User-side snooze/mute via bot command** ("/mute 2h", "/snooze tonight").
- **Auto-disable receiver on Telegram 403 ("bot blocked")** with admin notification.
- **Rate-limit per target** (separate from global fan-out cap).
- **Weekly digest + per-target stats + per-provider error rate**.
- **Generic webhook provider** and **email / Discord / ntfy.sh / Matrix** channels.
- **Message dedup window** (kills duplicate sends from redelivery and scheduler misfires).
- **First-run "Getting Started" checklist** on empty dashboard (UI/UX).
## How to consume this review
Each report has clickable `file:line` markdown links. Recommended sequence:
1. Read this `README.md`.
2. Skim each report's Executive Summary (top 5-7 bullets).
3. Triage the **Ship blockers (1-12)** above into the next release branch as individual issues.
4. Schedule the **HIGH list (13-24)** for the release after.
5. Treat the feature ideas as a refresh of `.claude/docs/feature-backlog.md`.
+342
View File
@@ -0,0 +1,342 @@
# Backend Production-Readiness Review
Scope: packages/server/src/notify_bridge_server/ and packages/core/src/notify_bridge_core/ (~44k LOC, Python 3.11, FastAPI + SQLModel async + APScheduler + aiohttp).
## Executive Summary
- **Overall quality is high.** The Jinja2 sandbox is consistently applied (every Environment instantiation is SandboxedEnvironment), JWT auth uses bcrypt offloaded to a worker thread, SSRF guard exists with DNS-rebinding mitigation, secrets are masked in logs via a dedicated filter, and most async/SQL patterns show production-aware design (per-tracker sessions, batched IN-queries, partial unique indexes).
- **Top correctness risk: a fire-and-forget asyncio.create_task in ha_subscription._on_status_change** (no reference stored, GC can drop the task) plus thread-unsafe in-memory counters in bridge_self. Both bite on chatty HA installs.
- **Module-level dict caches shared across the event loop have small read-modify-write windows** in services/scheduler.py (adaptive state), services/bridge_self.py (failure counters), commands/handler.py (TTLCache rate limits), and command_sync._dirty_bots. Currently functional under low concurrency; risky under load.
- **Very large hot-path functions** — services/watcher.py:check_tracker (381 lines), services/dispatch_helpers.py:load_link_data (208 lines), the 1880-line database/migrations.py, and the 1365-line services/scheduler.py — concentrate too much logic in one place.
- **Provider-type hardcoding** persists in api/providers.py, services/__init__.py, services/action_runner.py, and services/manual_dispatch.py (if provider.type == immich chains). The watchers _POLL_FACTORIES registry is the right model — extend it.
- **Webhook handlers read the request body BEFORE authenticating** in the Gitea and generic-webhook routes. The Planka route gets it right. Net impact: a peer that knows the URL but not the secret can drive a 1 MiB read per request.
- **autoescape is inconsistent**: True for runtime templates (renderer.py, commands/handler.py), False for preview / sample-context renders in api/template_configs.py, api/slot_helpers.py, and services/notifier.send_test_template_notification. Lower risk (admin-authored input) but mismatch invites surprise.
---
## CRITICAL
### [C-1] _on_status_change schedules an unstored task (GC + drop risk)
File: [packages/server/src/notify_bridge_server/services/ha_subscription.py:240-260](../../packages/server/src/notify_bridge_server/services/ha_subscription.py#L240)
The task created by asyncio.create_task(_record_ha_status(...)) at line 249 is not held anywhere. Python may garbage-collect a task whose only reference is the create_task return value before it completes (Python docs explicitly warn: save a reference to the result). Result: an HA disconnect/reconnect EventLog row silently disappears under memory pressure.
**Fix:** Module-level set[asyncio.Task], add the new task, remove via task.add_done_callback. ha_subscription.start_all already does this correctly (line 315-320); the pattern is already in-house.
### [C-2] Telegram-webhook handler returns 200 OK on uncommitted writes
File: [packages/server/src/notify_bridge_server/commands/webhook.py:130-169](../../packages/server/src/notify_bridge_server/commands/webhook.py#L130)
The catch-all at line 162 swallows handle_command exceptions and returns OK to Telegram. The request already called await session.commit() at line 96 (after save_chat_from_webhook), and any subsequent writes via the dispatcher use NEW sessions inside the command path. If a downstream session inside handle_command partially commits before raising, the dependency get_session does NOT roll back automatically — the context manager only closes.
**Fix:** Either explicitly session.rollback() in the except block, or wrap the per-request mutations in async with session.begin(): so the implicit transaction guarantees rollback on exception.
### [C-3] Gitea/generic webhook reads body BEFORE verifying secret is configured
File: [packages/server/src/notify_bridge_server/api/webhooks.py:167-178](../../packages/server/src/notify_bridge_server/api/webhooks.py#L167) and line 449-454
The sequence is: read 1 MiB raw_body, then check if webhook_secret is empty. A peer that learned the URL but has no secret drives a 1 MiB body read per request. Plankas handler at line 232+ validates the bearer token BEFORE the body read — that is the correct pattern.
**Fix:** Hoist the "if not webhook_secret" (Gitea) and "if auth_mode == none" short-circuit (generic) above _read_bounded_body. Gitea HMAC still needs the body — but bailing on a missing-config-side error first costs nothing.
### [C-4] bridge_self in-memory counters are not async-safe
File: [packages/server/src/notify_bridge_server/services/bridge_self.py:186-230](../../packages/server/src/notify_bridge_server/services/bridge_self.py#L186)
record_poll_failure does _poll_failure_counts[tracker_id] = _poll_failure_counts.get(tracker_id, 0) + 1. These dicts are accessed concurrently from poll loop, HA push, webhook ingest, and dispatcher target-failure recording. Individual dict ops are atomic, but get + 1 + set is not when interleaved with another coroutine that touches the same key. Symptoms: missed threshold crossings, occasional double-emission. Same pattern in _target_failure_counts and _backlog_above_threshold.
**Fix:** Wrap mutating ops in an asyncio.Lock. The reset-and-re-arm semantics already assume serial access — make it explicit.
### [C-5] PROVIDER_SECRET_FIELDS audit needed for backup exports
File: [packages/server/src/notify_bridge_server/api/providers.py:617-625](../../packages/server/src/notify_bridge_server/api/providers.py#L617) and [services/backup_service.py:84-93](../../packages/server/src/notify_bridge_server/services/backup_service.py#L84)
_apply_secrets_provider redacts only fields named in PROVIDER_SECRET_FIELDS. The webhook flow uses a field called webhook_secret (Gitea, Planka, generic) — verify this is in PROVIDER_SECRET_FIELDS (defined in backup_schema.py). A backup export with secrets_mode=INCLUDE that misses webhook_secret leaks a token that grants webhook-forgery rights.
**Action:** Audit PROVIDER_SECRET_FIELDS. Specifically check it includes: api_key, api_token, access_token, webhook_secret, password, client_secret, refresh_token. The _provider_response mask list at api/providers.py:620 is a good cross-reference — both should be the same constant.
---
## HIGH
### [H-1] _compile_template lru_cache competes across tenants
File: [packages/server/src/notify_bridge_server/commands/handler.py:99-103](../../packages/server/src/notify_bridge_server/commands/handler.py#L99)
lru_cache(maxsize=256) keyed by raw template string. Edited templates remain cached. On a multi-tenant install one tenants 256 distinct templates can evict anothers. No invalidation on template-edit.
**Fix:** Drop the cache (Jinja compile is sub-ms) OR add an invalidation call from the template-edit endpoints. The notification renderer (renderer.py:31) uses 512 slots — same problem; consistent fix.
### [H-2] check_tracker is 381 lines with deep coupling
File: [packages/server/src/notify_bridge_server/services/watcher.py:263-644](../../packages/server/src/notify_bridge_server/services/watcher.py#L263)
Loads tracker, polls, writes state, persists EventLog, evaluates gates, defers, dispatches, records bridge_self — all in one function. Refactor candidates: _poll_phase, _persist_state_and_events, _dispatch_phase. This is the watchers hot path; bugs here affect every tracker tick.
### [H-3] load_link_data returns untyped dict[str, Any]
File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:539-747](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L539)
Five call sites consume ld["target_type"], ld.get("link_id"), etc. — no static guarantee against key typos.
**Fix:** Introduce a frozen @dataclass class LinkData. Same for per-receiver entries.
### [H-4] N+1 in _resolve_command_context template-slot loop
File: [packages/server/src/notify_bridge_server/commands/handler.py:200-215](../../packages/server/src/notify_bridge_server/commands/handler.py#L200)
One SELECT per distinct command_template_config_id. Already batched for trackers/configs/providers — finish the job. Single WHERE config_id IN (...) query + Python pivot.
### [H-5] N+1 in backup_service.export_backup receiver loop
File: [packages/server/src/notify_bridge_server/services/backup_service.py:187-189](../../packages/server/src/notify_bridge_server/services/backup_service.py#L187)
50 targets = 51 SELECTs. Batch with WHERE target_id IN (...). Audit other sections of this 941-line file for the same pattern (templates -> slots, command configs -> slots).
### [H-6] _dirty_bots mutated from request and scheduler without a lock
File: [packages/server/src/notify_bridge_server/services/command_sync.py:25-95](../../packages/server/src/notify_bridge_server/services/command_sync.py#L25)
mark_bot_dirty runs in request handlers, _flush_dirty_bots on the scheduler executor. Currently safe (snapshot via ready = [...]) but fragile.
**Fix:** Snapshot under lock, or move to a thread-safe primitive.
### [H-7] HA reconnect cycle has no way for CRUD to short-circuit a stale supervisor
File: [packages/server/src/notify_bridge_server/services/ha_subscription.py:163-175](../../packages/server/src/notify_bridge_server/services/ha_subscription.py#L163)
Reload-on-reconnect means a disabled HA provider keeps trying to reconnect at the 30s/300s cadence until next reconnect attempt. CRUD endpoints should call reload_provider (defined at line 339) — verify wiring.
### [H-8] Cached expunged ORM instances are footguns
File: [packages/server/src/notify_bridge_server/services/event_dispatch.py:75-107](../../packages/server/src/notify_bridge_server/services/event_dispatch.py#L75)
_load_trackers_cached returns expunged NotificationTracker rows. Future maintainer calling session.add(tracker) on a stale cached instance triggers DetachedInstance or silent re-INSERT. Document this strongly, ideally convert to a typed projection.
### [H-9] Pending-restore at startup has no timeout
File: [packages/server/src/notify_bridge_server/main.py:142-143](../../packages/server/src/notify_bridge_server/main.py#L142)
apply_pending_restore_if_any runs in lifespan; a partially-corrupt restore could block startup indefinitely. Container liveness probes then fail after grace.
**Fix:** asyncio.wait_for with a generous timeout, or kick off as background task while app starts.
### [H-10] Jinja2 render watchdog uses daemon thread that can pin a CPU forever
File: [packages/core/src/notify_bridge_core/templates/renderer.py:48-73](../../packages/core/src/notify_bridge_core/templates/renderer.py#L48)
Comment acknowledges the trade-off. Multiple concurrent runaway renders can exhaust CPU cores while callers think they timed out. Add a process-level BoundedSemaphore capping concurrent in-flight renders.
### [H-11] _aggregate drops all but the first error
File: [packages/server/src/notify_bridge_server/services/notifier.py:326-335](../../packages/server/src/notify_bridge_server/services/notifier.py#L326)
When all sends fail, only results[0] is returned. Distinct subsequent errors are lost.
**Fix:** Aggregate all errors into a details field.
### [H-12] Generic-webhook header dict materialised twice
File: [packages/server/src/notify_bridge_server/api/webhooks.py:456](../../packages/server/src/notify_bridge_server/api/webhooks.py#L456) and line 475
dict(request.headers) materialises full headers map, then _filter_headers and _redact_sensitive_body walk the payload. With a malicious peer sending many headers (Starlette default 100), bounded but wasteful.
### [H-13] SSRF redirect-walk has no aggregate wall-clock budget
File: [packages/core/src/notify_bridge_core/notifications/telegram/client.py:232-268](../../packages/core/src/notify_bridge_core/notifications/telegram/client.py#L232)
max_redirects = 3, each with 120s _DOWNLOAD_TIMEOUT. Worst case per request: 480s. _TARGET_TIMEOUT_S = 120s in the dispatcher caps the top-level case, but per-asset preloads inside media groups dont all share that cap.
### [H-14] Backlog recovery logic flips latch for in-flight users
File: [packages/server/src/notify_bridge_server/services/bridge_self.py:544-551](../../packages/server/src/notify_bridge_server/services/bridge_self.py#L544)
Recovery loop iterates all known users and flips to False for any not in counts_by_user. If a user transiently has no user_id set on deferred rows (legacy / orphaned), theyre excluded from the GROUP BY and incorrectly marked recovered.
### [H-15] quiet_hours_status silently returns None on start == end
File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L110)
The comment notes this is almost always a user mistake. Silent return means the user wonders why their notifications still arrive at all hours. Surface via WARNING log + UI hint.
---
## MEDIUM
### [M-1] register_commands_with_telegram chat overrides loop is sequential
File: [packages/server/src/notify_bridge_server/commands/handler.py:723-776](../../packages/server/src/notify_bridge_server/commands/handler.py#L723)
50 chats with overrides = 50 sequential Telegram round-trips. Use asyncio.gather with a semaphore as in _refresh_telegram_chat_titles.
### [M-2] _run_provider exception backoff has no escalation
File: [packages/server/src/notify_bridge_server/services/ha_subscription.py:278-283](../../packages/server/src/notify_bridge_server/services/ha_subscription.py#L278)
Persistent bug in _emit reconnects every 30s forever. Add exponential backoff with cap and bridge_self alert after N failures.
### [M-3] database/migrations.py is 1880 lines
File: [packages/server/src/notify_bridge_server/database/migrations.py](../../packages/server/src/notify_bridge_server/database/migrations.py)
Past the 800-line guideline. Split per-migration into database/migrations/<name>.py, list in main.py.
### [M-4] Locale-resolution logic duplicated
File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:484-491](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L484) and [services/notifier.py:46](../../packages/server/src/notify_bridge_server/services/notifier.py#L46)
Two implementations of locale priority. One source of truth.
### [M-5] _normalize_locale duplicated across modules
File: [packages/server/src/notify_bridge_server/commands/handler.py:632](../../packages/server/src/notify_bridge_server/commands/handler.py#L632)
Five-line copy; move to commands/command_utils.py.
### [M-6] Provider-type if-chain in _test_provider_connection
File: [packages/server/src/notify_bridge_server/api/providers.py:203-250](../../packages/server/src/notify_bridge_server/api/providers.py#L203)
Same chain in services/__init__.py:_make_collection_provider. Both candidates for a single registry.
### [M-7] Secret masking exposes last 4 chars unconditionally
File: [packages/server/src/notify_bridge_server/api/providers.py:624](../../packages/server/src/notify_bridge_server/api/providers.py#L624) and [services/backup_service.py:81](../../packages/server/src/notify_bridge_server/services/backup_service.py#L81)
Fine for 32-char Immich keys. Returns half the value for short secrets. Use plain "***" for len(value) < 16.
### [M-8] Deprecated validate_outbound_url still imported
File: [packages/core/src/notify_bridge_core/providers/immich/client.py:14](../../packages/core/src/notify_bridge_core/providers/immich/client.py#L14)
The sync version uses blocking socket.getaddrinfo on the event loop. Migrate to avalidate_outbound_url.
### [M-9] Lazy cache init has confusing DCL comment
File: [packages/server/src/notify_bridge_server/services/watcher.py:81-113](../../packages/server/src/notify_bridge_server/services/watcher.py#L81)
Comment about Double-check after acquiring lock implies classic DCL — under asyncio, the unlocked first check is safe because theres no thread context switch, but rename to clarify.
### [M-10] Dispatcher concurrency cap is per-dispatch, not process-wide
File: [packages/core/src/notify_bridge_core/notifications/dispatcher.py:58](../../packages/core/src/notify_bridge_core/notifications/dispatcher.py#L58)
_DISPATCH_CONCURRENCY = 16 is INSIDE dispatch(). HA storm = N events x min(M, 16) sends with no outer cap. Add a process-level semaphore in event_dispatch.py.
### [M-11] success=True returned for partial failures
File: [packages/server/src/notify_bridge_server/services/notifier.py:329-335](../../packages/server/src/notify_bridge_server/services/notifier.py#L329)
A test that fails on 1 of 3 receivers returns success=True with a partial_failures count. Introduce a status: "ok"|"partial"|"fail" field.
### [M-12] Telegram command registration not retried on 429
File: [packages/server/src/notify_bridge_server/commands/handler.py:671-693](../../packages/server/src/notify_bridge_server/commands/handler.py#L671)
set_my_commands/delete_my_commands arent retried. Adopt the retry-after handling that _upload_media has.
### [M-13] event_log_id_by_event keyed on id(event)
File: [packages/server/src/notify_bridge_server/services/watcher.py:417-464](../../packages/server/src/notify_bridge_server/services/watcher.py#L417)
CPython object-address as key works because events are held alive in scope, but a typed key would be safer.
### [M-14] Bcrypt-length error wording could be clearer
File: [packages/server/src/notify_bridge_server/auth/routes.py:69-81](../../packages/server/src/notify_bridge_server/auth/routes.py#L69)
User typing 70 ASCII + emoji gets rejected and doesnt understand why. Clarify the byte-count language.
### [M-15] CSP allows unsafe-inline for script-src
File: [packages/server/src/notify_bridge_server/main.py:186-201](../../packages/server/src/notify_bridge_server/main.py#L186)
Acknowledged. SvelteKit --csp build flag emits hashes; switching unblocks dropping unsafe-inline.
### [M-16] Telegram-webhook body size not capped
File: [packages/server/src/notify_bridge_server/commands/webhook.py:71](../../packages/server/src/notify_bridge_server/commands/webhook.py#L71)
update = await request.json() reads with no cap. Add _read_bounded_body pattern.
### [M-17] _log_command_event swallows DB failures invisibly
File: [packages/server/src/notify_bridge_server/commands/handler.py:353-357](../../packages/server/src/notify_bridge_server/commands/handler.py#L353)
Hard DB failure here is invisible. Add a metrics counter.
### [M-18] apply_tracking_display_filters is a 60-line if-branched function
File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:350-405](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L350)
Split into _filter_favorites, _apply_order_and_limit, _strip_details_and_tags.
---
## LOW
### [L-1] from .database.models import * in main.py
File: [packages/server/src/notify_bridge_server/main.py:26](../../packages/server/src/notify_bridge_server/main.py#L26)
Comment is honest about purpose, but explicit imports or a single module import is clearer.
### [L-2] None comparisons
All comparisons verified to use is None via grep — no findings.
### [L-3] Magic numbers
Constants are well-named throughout (_TG_429_MAX_ATTEMPTS, _MAX_PENDING_PER_TRACKER, DEBOUNCE_SECONDS, etc.). Only nit: seconds=30 literal in scheduler.schedule_bot_polling could be promoted.
### [L-4] noqa E712 repeated 8+ times for SQLModel boolean comparisons
Switch to .is_(True) for SQLAlchemy idiom, or add E712 to project ruff config.
### [L-5] _check_same_origin is best-effort by design
Acceptable.
### [L-6] _normalize_host strips IPv6 zone IDs silently
File: [packages/core/src/notify_bridge_core/notifications/ssrf.py:105-106](../../packages/core/src/notify_bridge_core/notifications/ssrf.py#L105)
Debug log when stripping changes the host would help diagnose.
### [L-7] _compute_jitter cap of 30s might be tight on hourly polls
File: [packages/server/src/notify_bridge_server/services/scheduler.py:91-105](../../packages/server/src/notify_bridge_server/services/scheduler.py#L91)
Revisit if jitter-collision becomes a real-world issue.
### [L-8] SmtpConfig repr may leak password
File: [packages/server/src/notify_bridge_server/services/notifier.py:205-213](../../packages/server/src/notify_bridge_server/services/notifier.py#L205)
If SmtpConfig is a vanilla dataclass, repr() will leak the password. Verify in notify_bridge_core.notifications.email.client — add field(repr=False) or a custom __repr__.
### [L-9] noqa BLE001 count is high
49 occurrences across 26 files. Each defensible; consider narrowing where possible.
### [L-10] _normalize_for_json does not handle UUID/Decimal
File: [packages/server/src/notify_bridge_server/services/deferred_dispatch.py:124-133](../../packages/server/src/notify_bridge_server/services/deferred_dispatch.py#L124)
No current consumer emits these, but a fallback str() for unknown types would prevent future breakage.
---
## Approval Verdict
**Block** — CRITICAL findings (C-1 unstored task, C-2 missing rollback, C-3 unauthenticated body read, C-4 racy counters, C-5 secret-mask audit) must be fixed before declaring production-ready. Once those are addressed, the HIGH findings can land in a follow-up.
## Quick Wins (low effort, high value)
1. **Wrap every fire-and-forget asyncio.create_task in a module-level set** — search for asyncio.create_task( with no assignment. Definite hit: ha_subscription.py:249.
2. **Move webhook-secret check before _read_bounded_body** in Gitea + generic webhook handlers — 5-line move per endpoint, eliminates pre-auth resource exhaustion.
3. **Add an asyncio.Lock around _poll_failure_counts and _target_failure_counts** mutations — eliminates C-4.
4. **Split migrations.py** — mechanical refactor, ~1 hour, improves blame/review.
5. **Batch the receiver query in backup_service.export_backup** — single IN (...) query, ~10x faster.
6. **Replace from .database.models import \*** with explicit imports — small clarity win.
+714
View File
@@ -0,0 +1,714 @@
# Bugs + Missing Features — Production-Readiness Review
Repo: `c:\Users\Alexei\Documents\service-to-notification-bridge` (v0.8.1 baseline)
Date: 2026-05-22
Scope: full repo (backend Python/FastAPI, Svelte 5 frontend, providers + dispatchers + bot commands)
---
## Executive summary
- **The code is in much better shape than typical pre-1.0 code.** Quiet-hours,
SSRF, JWT, secret redaction, rate-limit fan-out caps, partition-by-media-kind,
parse_mode retry, scheduler misfire-grace, Prometheus metrics, deep
healthcheck, and per-receiver render cache are all already implemented and
well-tested.
- **The single biggest shipping risk is webhook idempotency.** Gitea, Planka,
and the generic webhook endpoint all dispatch on every POST regardless of
redelivery — there is no `X-Gitea-Delivery` / `X-Hub-Delivery` dedup table.
An upstream retry storm sends the same notification N times.
- **The deferred-dispatch drain has a duplicate-send window** if the process
dies between `dispatcher.dispatch()` returning and `session.commit()`
the row stays `pending` and the periodic catch-up scan re-drains it.
- **Telegram update offset (`_last_update_id`) is in-memory only** — on
restart, the bot replays already-handled updates or skips ones Telegram
has discarded. Combined with no per-update idempotency, this is a
duplicate-command surface.
- **Several Telegram features are silently unsupported**: forum threads
(`message_thread_id`), bot-blocked-by-user detection (403 → keep retrying
forever), and inline-button callback queries. None blocks shipping today
but each is a near-term ask from any real user.
- **No template versioning / dry-run / playground** — every template edit is
immediately live. There is no way to validate a new template against a
sample payload before flipping the switch, and no rollback path.
- **Frontend lacks bulk operations and import/export of templates+targets.**
An operator with 30 trackers cannot bulk-toggle, bulk-edit, or move a
template across users.
---
## Part A — Bugs and reliability issues
Severity legend: **CRITICAL** = data loss / duplicate user-visible messages /
silent stop-shipping; **HIGH** = wrong behavior under realistic conditions;
**MEDIUM** = degrades UX or operability; **LOW** = polish.
### CRITICAL
#### A1. Webhook redelivery causes duplicate notifications (no idempotency)
**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:156`
(`gitea_webhook`), `:225` (`planka_webhook`), `:427` (`generic_webhook`).
**Scenario**: Gitea retries a webhook after 30s if the bridge returns 5xx,
times out under load, or if the operator clicks "Test Delivery" twice. Every
retry produces a fresh notification because the handlers never check
`X-Gitea-Delivery` (Gitea's per-delivery UUID), nor do they record any
event_id/hash for `parse_generic_webhook` events.
**Fix**: Add a `webhook_delivery` table with `(provider_id, delivery_id)`
unique constraint and `created_at`. Insert before dispatch (`INSERT OR IGNORE`
on SQLite, `ON CONFLICT DO NOTHING` on Postgres); if the insert is a no-op,
return `{"ok": true, "skipped": "duplicate"}`. For Gitea use the
`X-Gitea-Delivery` header; for Planka use a hash of `event_type +
payload.id + payload.createdAt`; for generic webhooks use a configurable
JSONPath expression to derive an idempotency key, falling back to a SHA256 of
the raw body. TTL prune older than 7 days.
#### A2. Deferred-dispatch drain can double-send on process crash
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:721-758`.
**Scenario**: Inside `_process_row`, `dispatcher.dispatch()` actually
delivers the Telegram message (HTTP 200 returned, user phone buzzes).
The function then sets `row.status = "fired"` (line 734) but the surrounding
`session.commit()` (line 577) hasn't run yet. Process is killed (OOM,
SIGTERM during deploy, host reboot). On restart, `_run_deferred_drain_catchup`
re-fetches the still-`pending` row and dispatches it again — **the user gets
the same album twice**.
**Fix**: Either (a) record an outbound dedup key per-row before dispatch
(`row.dispatch_id = uuid4(); session.commit()` first), then ask the channel
client to send-or-no-op based on that ID; or (b) flip the row to a
`"in_flight"` state with a short timeout in a pre-dispatch transaction so a
restart sees it as poisoned and aborts. Option (a) is more correct but
needs per-channel cooperation; option (b) is the cheap fix.
#### A3. Telegram update offset is in-memory only — restart replays or loses commands
**Location**: `packages/server/src/notify_bridge_server/services/telegram_poller.py:31`
(`_last_update_id: dict[int, int] = {}`).
**Scenario**: A user types `/random Family`. Telegram delivers update_id=4711.
The bridge processes the command, sends back the media, and crashes before
APScheduler ticks again. On restart, `_last_update_id` is empty, so we call
`getUpdates(offset=None)` → Telegram returns 4711 again → we send the user
the same album a second time. Conversely, if Telegram's 24-hour retention
expired during a long outage, we silently skip pending updates.
**Fix**: Persist last_update_id in DB (`telegram_bot.last_update_id` column).
Combine with A2-style command idempotency by inserting
`(bot_id, update_id)` into a dedup table before processing.
### HIGH
#### A4. Telegram "bot blocked by user" / "chat not found" never short-circuits
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py`
(`send_message`, `_upload_media`, etc.). Errors with
`error_code == 403` (Forbidden, "Bot was blocked by the user") and 400
"chat not found" / "user is deactivated" are returned as failures but
never recorded so the receiver gets removed/disabled.
**Scenario**: A user blocks the bot. Every scheduled "Good morning memory"
fires a sendMessage that Telegram instantly 403s. Bridge logs an error,
moves on, repeats forever. The bridge_self target-failure counter eventually
fires but the underlying receiver is never disabled. With many such chats
the operator has no easy cleanup path.
**Fix**: In the dispatcher, on `error_code in (403, 400 with description
matching "chat not found"/"user is deactivated")`, automatically set
`TelegramChat.commands_enabled = False` and either flag the receiver as
`disabled` with reason `blocked_by_user` or surface it via a new
`/admin/blocked-chats` view. Also stop further retries that round.
#### A5. Telegram forum-thread (topic) routing not supported
**Location**: telegram client never accepts/sends `message_thread_id`.
**Scenario**: Operator points the bridge at a group's "Releases" forum
topic. Today every message lands in the General topic instead — there is
no way to specify the topic. This is a hard requirement for any non-trivial
group install. Currently `reply_parameters` is the only thread-adjacent
field used; `message_thread_id` is silently absent.
**Fix**: Add an optional `message_thread_id` per-receiver (or per-target)
config, pass through `send_message`, `_upload_media`, and `_post_media_group`.
Auto-extract from incoming command updates' `message.message_thread_id` so
the bot can reply into the same topic.
#### A6. `bot.token` read after commit without refresh in webhook flow
**Location**: `packages/server/src/notify_bridge_server/commands/webhook.py:92-97`.
**Scenario**: The comment acknowledges "AsyncSession expires instances on
commit" and snapshots `bot_id`/`bot_token` before commit, but `await
session.refresh(bot)` is also called after the commit. If `session.refresh`
fails (e.g. row was deleted by an admin concurrently — bot rotation), the
exception is caught as a warning and the rest of the handler still runs
using the stale local `bot_id`/`bot_token`. The window is small but real.
**Fix**: Remove the `session.refresh(bot)` since the snapshot already
covers everything the handler needs. The refresh adds risk for no gain.
#### A7. Deferred-dispatch coalescing has a JSON-mutation bug under concurrent defers
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:307`
(`_find_pending_asset_rows`).
**Scenario**: Two near-simultaneous `assets_added` events for the same
`(link_id, collection_id)` from two upstream pollers (HA chat-bus +
periodic Immich). Both call `defer_event` concurrently. The two transactions
both see "no pending row", both `session.add(new_row)`, and SQLite cheerfully
inserts two rows. The drain then fires both, sending the same combined media
twice. Note that the partial UNIQUE index from v0.8.1 protects only the
`bridge_self` provider row, not the deferred queue.
**Fix**: Add a partial UNIQUE index `UNIQUE(link_id, collection_id, event_type)
WHERE status = 'pending'` on `deferred_dispatch`, then convert `defer_event`
to `INSERT ... ON CONFLICT (link_id, collection_id, event_type) DO UPDATE`
and merge `event_payload` inside the SQL or in a re-read+retry loop.
#### A8. Quiet-hours overnight window + DST transition can produce wrong fire_at
**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:121-128`.
**Scenario**: User in `Europe/Minsk` (UTC+3, no DST anymore) sets quiet
hours 22:00-06:00. For a user in a DST-observing zone (e.g.
`America/New_York`), on the "spring forward" night where 2:00 → 3:00, an
event arriving at 02:30 local time gets `end_today = now_local.replace(hour=6,
minute=0)`. But `.replace()` ignores DST adjustments — the resulting
`datetime` may sit in the skipped hour or have ambiguous DST status. Two
hours later, the dispatcher sees the quiet window as "still active" or "30
min ago" depending on the system.
**Fix**: After `.replace(hour=t_end.hour, minute=t_end.minute, ...)`, pass
through `tz.localize` (zoneinfo's behavior: re-walk via `astimezone`) and
explicitly handle the `fold=` parameter. Add tests using
`zoneinfo.ZoneInfo("America/New_York")` and known DST transition dates.
#### A9. Quiet-hours `start == end` returns None — silently no quiet hours
**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111`.
**Scenario**: User UI submits `quiet_hours_start = "00:00"` and
`quiet_hours_end = "00:00"`, thinking "all day quiet". The function returns
`None` (no quiet window) — the user gets pinged at 3am even though the UI
says "quiet hours enabled". Same code path eats malformed times silently.
**Fix**: Bubble up `ValueError`/`malformed input` to the API validator on
write so the user gets a 422 with a specific error message rather than
silently broken behavior. Define `00:00-00:00` as "always quiet" or reject
it explicitly with a clear error.
#### A10. Telegram `_truncate` cuts mid-HTML-tag → parse_mode fallback then loses formatting
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:144-149`
(`_truncate`).
**Scenario**: A template renders to 4090 chars and an
`<a href="https://...">...</a>` straddles the 4096-byte boundary. The
truncate function takes a flat string slice, so the final character may be
inside a tag → Telegram returns 400 "can't parse entities" → the retry
strips parse_mode → the user sees `<a href="...">` literally in their chat.
**Fix**: Make `_truncate` HTML-aware: scan from the right and abandon
truncation at the start of any tag boundary, OR strip incomplete tags after
truncating. A simpler intermediate fix: pop any unclosed `<a>` /`<b>`/`<i>`
detected by a regex over the truncated string.
#### A11. JSON-payload depth/size hardened in backup, not in webhooks
**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:43-71`
(`_read_bounded_body` only caps total bytes).
**Scenario**: Generic webhook accepts a 999KB payload (under the 1MB cap)
but with 50 levels of nesting. `json.loads` succeeds, then
`parse_generic_webhook` evaluates JSONPath expressions in a loop and the CPU
spends seconds chasing pointers. Multiple concurrent malicious requests can
peg the event loop.
**Fix**: Reuse the depth/node guards from
`packages/server/src/notify_bridge_server/services/backup_service.py`
(JSON depth cap 10, node count cap 100k). Either share the helper or
re-implement around `json.loads(object_pairs_hook=...)`.
#### A12. Generic-webhook `auth_mode="none"` with `acknowledge_unauthenticated` is per-provider, not per-user
**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:294-323`.
**Scenario**: v0.8.1 added the `acknowledge_unauthenticated=true` opt-in,
but it's only stored in `provider.config` JSON. A multi-user install where
one user accepts unauthenticated and another doesn't would suffice. But
because anyone with the webhook URL can also infer the token (URLs are not
secret in real deployments — they end up in upstream config files, logs,
build artifacts), `auth_mode="none"` is dangerous beyond "explicit opt-in":
an attacker who guesses the path can DoS the rate limiter by burning the
60/min budget.
**Fix**: Refuse to even create a `webhook` provider with `auth_mode="none"`
in production unless a separate environment guard
`NOTIFY_BRIDGE_ALLOW_UNAUTHENTICATED_WEBHOOKS` is set; AND drop the rate
limit to 10/min for `auth_mode="none"` providers.
#### A13. `_extract_retry_after` returns int but Telegram `retry_after` is fractional
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:59-78`.
**Scenario**: Modern Telegram sometimes returns `retry_after` as a float
(e.g. `1.5`). The current code does `int(group(1))` and `isinstance(ra,
(int, float))`. Regex `\d+` only matches integers. So a `1.5s` retry-after
becomes "no retry-after found" → fallback 1s sleep → retry too early → second
429 → eventually the bounded retry budget runs out.
**Fix**: Loosen the regex to `\d+(?:\.\d+)?` and `float(m.group(1))`,
preserve fractional via `await asyncio.sleep(retry_after + 1)` with float.
#### A14. APScheduler date-job collision when two windows end at the exact same second
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:1127-1132`
(`_drain_job_id_for`). The job id is keyed on `YYYYMMDDHHMMSS`. Comment in
code acknowledges "two trackers... seconds different ... would collide", but
two windows ending at the exact same second still collide on a single job id
`replace_existing=True` silently drops the second.
**Scenario**: 30 users with quiet_hours_end=`07:00`. All 30 windows end at
the same wall-clock second. Only one drain job is scheduled. That single
job fires `drain_deferred_due()` which scans all rows globally so all 30
get drained — actually fine. **But** if the global drain function ever
filters by user/tracker (a likely near-term change for multi-tenant), the
collision becomes silent data loss.
**Fix**: Either keep the global drain (and document the assumption) or
add a tracker_id segment to the job_id and let APScheduler dedup naturally.
#### A15. `_handle_webhook_conflict` reclaim races against a parallel admin action
**Location**: `packages/server/src/notify_bridge_server/services/telegram_poller.py:163-218`.
**Scenario**: Admin clicks "Switch to webhook mode" in the UI, which sets
`update_mode=webhook` and calls `set_webhook(...)`. Concurrently, the next
poll tick for the same bot hits the conflict, calls `delete_webhook` → the
admin's webhook is wiped 1s after they set it. The poll tick checks
`bot.update_mode != "polling"` *before* the conflict reclaim, but the
reload is best-effort and the conflict reclaim path runs unconditionally
once entered.
**Fix**: Re-check `bot.update_mode == "polling"` inside
`_handle_webhook_conflict` before calling `delete_webhook`; or take an
advisory lock on the bot row for the duration of the mode flip.
#### A16. Discord 2000-char split breaks on Unicode codepoint boundaries
**Location**: `packages/core/src/notify_bridge_core/notifications/discord/client.py:60-80`
(`_split_message`).
**Scenario**: A template renders to 2050 chars with emoji at position
1998-1999 (each emoji is 2 surrogates / multi-byte UTF-8). The split uses
`text.rfind("\n", 0, limit)` and falls back to character index `limit`,
which is a Python str index → that part is OK in CPython 3, but if the
content contains a grapheme cluster (emoji + zero-width-joiner + skin tone),
slicing at `limit` mid-cluster renders as the broken emoji "□" in Discord.
**Fix**: Use a grapheme-cluster boundary library (e.g. `regex` module with
`\X`) or at minimum back off to the previous whitespace if `limit` is
inside a likely cluster.
### MEDIUM
#### A17. Per-target failure counter does not distinguish receivers within a target
**Location**: `packages/server/src/notify_bridge_server/services/event_dispatch.py:311-333`.
**Scenario**: A target has 10 receivers. 1 chat is blocked, 9 work. Today
`maybe_emit_target_failure` is called for the target — but the success
counter (`record_target_success`) is also called for the same target on the
other 9. Net counter behavior depends on call order. With the
default-threshold 5, this oscillates.
**Fix**: Track success/failure per receiver, not per target; or only call
`maybe_emit_target_failure` when `all` receivers failed for the target.
#### A18. `_cleanup_old_events` does not delete cancelled `DeferredDispatch` rows
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:332-364`.
**Scenario**: The daily cleanup deletes `EventLog`, `WebhookPayloadLog`,
`ActionExecution`. Cancelled / fired / dropped `DeferredDispatch` rows live
forever in the DB. Active install with chatty providers accumulates millions
of rows; eventually the `_load_pending_drain_jobs` query, `_trim_queue_if_needed`,
and the catch-up scan all degrade.
**Fix**: Add `delete(DeferredDispatch).where(status.in_(["fired", "dropped",
"cancelled"]), fired_at < cutoff)` to the cleanup.
#### A19. `random.shuffle(shuffled)` in `_sort_assets` uses non-deterministic seed
**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:317-320`.
**Scenario**: Two identical events arriving in close succession (deferred-
dispatch merge, then drain re-renders) shuffle into different orders. With
the deferred-dispatch coalescing logic, this produces a visual "they're not
the same album" surprise in the chat history.
**Fix**: Seed `random` with a stable per-event hash
(`hash(event.event_type.value + event.collection_id + event.timestamp.isoformat())`).
#### A20. `_poll_tracker` swallows exception, drops it at `_LOGGER.error` not `exception`
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:657-666`.
**Scenario**: An exception in `check_tracker` is logged as `_LOGGER.error("Error
polling tracker %d: %s", tracker_id, e)` — no traceback. Production debugging
of "why is tracker 42 silently broken since yesterday" requires the stack.
**Fix**: Change to `_LOGGER.exception("Error polling tracker %d", tracker_id)`.
#### A21. Long bot commands → `/help` reply > 4096 chars truncates without warning
**Location**: `packages/server/src/notify_bridge_server/commands/handler.py:521-532`,
combined with `send_reply``send_telegram_message``_truncate` to 4096.
**Scenario**: A user with 20 enabled commands runs `/help`. Each command +
description (RU) crosses 250 chars → 5000 chars total → truncated mid-command.
The user sees a half-list that suggests we forgot half the commands.
**Fix**: Split `/help` over multiple messages by command category (provider).
#### A22. `parse_command` truncates to 512 chars — long search queries lost
**Location**: `packages/server/src/notify_bridge_server/commands/parser.py:15`.
**Scenario**: `/search a very long query containing emoji 🎉 and more text that
the user really meant to send because they pasted a long string from somewhere…`
gets clipped to 512 chars silently. The trailing count parser then operates
on the truncated text, possibly extracting a count from mid-query.
**Fix**: Either reject `>512` with `parse_command` returning a sentinel
"too_long" tuple, or just stop truncating — the Telegram limit is already
4096 and we already truncate the response side.
#### A23. Periodic catch-up scan can dispatch a stale event payload
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:628`
(`_process_row`).
**Scenario**: An `assets_added` event is deferred at 22:00. At 06:00 the
quiet window ends, drain re-fetches `link_data`. The assets in `event_payload`
include URLs and asset metadata. But the user has since deleted those photos
from Immich. The dispatcher tries to download → 404. Notification shows
"5 photos added to Album X" but the actual media fails to attach.
**Fix**: For `assets_added`, re-validate asset existence against the
provider before dispatch (one batched `getAssets` call). Drop missing IDs
from the event, mark with "delivered_after_quiet_hours" + extra hint
`"missing_count": N` in details. For deferred windows >12h this is the
right behavior; for shorter windows the lookup is wasted work, so gate on
`(now - deferred_at).hours >= 6`.
#### A24. Watcher / scheduler restart can lose adaptive polling state
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:67-88`
(`_adaptive_state: dict`).
**Scenario**: Module-level dict resets on restart. A tracker that had ramped
up to 1-in-4 ticks goes back to every-tick polling. Over a fleet of 50
trackers in steady-state idle, this triggers a thundering herd of every-tick
polls right after deploy. Combined with no DB-level rate limiting on the
upstream Immich/Gitea API, it can rate-limit the operator out of their own
services for ~5min.
**Fix**: Either persist the adaptive state in `notification_tracker_state`
(cheap on shutdown via `atexit`) or stagger the initial ticks via
APScheduler's `next_run_time` instead of relying on the existing jitter.
#### A25. `defer_event` `return "cancelled"` logic is incorrect in some merge paths
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:444`.
**Scenario**: The `cancelled` return branch checks `upd_added is None or
upd_added.status == "cancelled"` AND same for `upd_removed`. But if both
`upd_added` and `upd_removed` are `None` (i.e. there were no pending rows
to begin with), `fully_cancelled` is `False` → returns "merged". That's
fine. But the more subtle issue: an "insert" action with one of the rows
being cancelled returns "merged" — should be "inserted". The dashboard
"merged" status confuses the operator looking at why no defer row exists.
**Fix**: Rewrite as a clearer state machine: distinguish "inserted",
"merged_into_existing", "fully_cancelled".
#### A26. `_fetch_bytes` and `_safe_get` honor only 3 redirects with no Retry-After awareness
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:217-268`.
**Scenario**: Immich behind a CDN can chain `302 → 302 → 200`. With 4 hops
it falls through to "Too many redirects". A user complains "old photos
suddenly missing in notifications".
**Fix**: Bump to 5 redirects and surface the chain in the error string for
easier debugging.
#### A27. No structured event log filter UI for "show me all drops in the last hour"
**Location**: `packages/server/src/notify_bridge_server/api/status.py`
`event_log` rows have `details.dispatch_status` field but no API filter
exposes it. The frontend can fetch only via global filter on `event_type`.
**Scenario**: An operator sees "messages are missing today". They want to
filter event_log to `dispatch_status in (dropped_quiet_hours_nondeferrable,
deferred_then_dropped, deferred_then_failed)`. Today they can't.
**Fix**: Add `dispatch_status` and `dispatched=true|false` as first-class
event_log columns (denormalized from `details`), plus API + UI filter.
#### A28. `_render_cmd_template` falls back to `"[No template: X]"` user-visible text
**Location**: `packages/server/src/notify_bridge_server/commands/handler.py:111-115`.
**Scenario**: An operator removes a template slot by mistake. The next user
who runs `/random` sees `[No template: response_random]` in chat. Not just
ugly — it leaks internal slot names.
**Fix**: Show a friendly "Sorry, something went wrong on our side" + log at
error level. Better: refuse to disable the slot if it's referenced.
### LOW
#### A29. `_truncate`'s ellipsis can land inside a multi-byte char
The marker `"…"` is one Unicode codepoint (3 bytes UTF-8) but the truncate
counts characters, not bytes. Telegram counts UTF-16 code units, so for a
4090-char message ending in emoji, the calculation is off by a small constant.
Won't break sends but messages may end up slightly longer than `TELEGRAM_MAX_TEXT_LENGTH`
allows. Re-measure in UTF-16 code units (`len(s.encode('utf-16-le')) // 2`).
#### A30. `NotificationDispatcher._render_cache` set to fresh dict on every dispatch — comment says "reuse"
The instance attribute `self._render_cache` is reset to `{}` at the start
of every `_send_to_target` (line 245). The cache only helps across receivers
within one target, not across targets. The comment at line 111-115 implies
broader reuse. Either align comment with reality or actually share across
targets within one `dispatch()` call.
#### A31. Frontend `entity-cache.svelte.ts` doesn't propagate stale-cache errors
The shared `$state`-based caches return stale data silently if the underlying
fetch fails after a successful initial load. A user sees old target list
during an outage and is confused why edits aren't sticking.
---
## Part B — Missing functionality and "cool feature" gaps
Tier legend: **must-have** = blocks prod for any non-trivial install;
**nice-to-have** = clear value, ship in next minor; **aspirational** = ship
when v1.0+ slows down.
Effort: **S** ≈ 1-2 days; **M** ≈ 1 week; **L** ≈ 2+ weeks.
### Already in the backlog (post-v0.8.1 status check)
#### B1. Target-level quiet hours (per-target DND, multi-window, days-of-week, silent mode)
**Status**: Still missing in v0.8.1. The backlog item proposed a v1 cut
(target-level windows + `silent` mode for Telegram = `disable_notification=True`).
None of the proposed code paths exist:
- `notification_target.quiet_hours_json` column — not present.
- `disable_notification=True` plumbing through `TelegramClient.send_message`
— not present.
- Days-of-week filter — not present.
**Pitch**: Quiet hours bind to the *watcher* (tracking config); users want
DND at the *destination*. "Don't ping my phone at night, regardless of
which provider".
**Who benefits**: Every user. Today they have to recreate per-link windows.
**Effort**: **M** (1 week — backend dispatcher gate + frontend Aurora-style fieldset).
**Tier**: **must-have for prod**.
#### B2. Immich Smart Actions expansion (auto-favorite by person, auto-archive, share-link rotation)
**Status**: Auto-Organize exists; no other action descriptors are shipped.
**Pitch**: Reuse the existing action descriptor pipeline. Auto-favorite-by-person
is the smallest cut.
**Effort**: **M** per action (a few days each).
**Tier**: nice-to-have.
#### B3. Block-based template builder
**Status**: Not started. `JinjaEditor` is unchanged.
**Effort**: **L** — frontend-only but big.
**Tier**: aspirational.
### Newly identified — must-have for prod
#### B4. Webhook delivery dedup table + "Test Delivery" replay
**Pitch**: Add the dedup table from A1, plus a `/api/webhooks/{provider_id}/replay/{delivery_id}`
endpoint that admin can hit to re-dispatch a stored payload without the upstream
provider needing to resend. Combined with the existing `WebhookPayloadLog`,
this is "click to retest" in the UI.
**Who benefits**: Every webhook provider. Replay is invaluable for debugging
template edits.
**Effort**: **M**.
**Tier**: **must-have for prod**.
#### B5. "Send test message" / template playground
**Pitch**: From the template editor, click "Try this template against the
last received event" → render preview, optionally send to a sandbox chat.
Bypass dispatch but exercise the full Jinja pipeline.
**Who benefits**: Every template edit today is a leap of faith — the operator
modifies the template, waits for the next real event, hopes nothing breaks.
**Effort**: **S-M**. The preview infrastructure already exists
(`services/sample_context.py`); add a "send to chat X" button.
**Tier**: **must-have for prod**.
#### B6. Template versioning + rollback
**Pitch**: Auto-snapshot each template on save (last 10 revisions). UI shows
diff between version N and N-1, "Restore" button. Same for command templates.
**Who benefits**: An operator who tweaks a template at midnight and goofs
the syntax needs an undo button.
**Effort**: **M**. New `template_revision` table; new endpoints; UI button.
**Tier**: **must-have for prod**.
#### B7. Bulk operations on trackers / targets / links
**Pitch**: Multi-select in lists → "disable selected", "delete selected",
"export selected templates as JSON bundle", "move to user X".
**Who benefits**: Operators with >10 trackers. A common pain point: deploying
the bridge for a new family member requires N clicks per tracker.
**Effort**: **M** (frontend-heavy).
**Tier**: **must-have for prod**.
#### B8. Bot blocked / chat-not-found auto-disable + dashboard
**Pitch**: Detect Telegram 403 / 400 chat-related errors. Mark the receiver
or `TelegramChat` as `disabled_by_remote`. Surface in a "Stale receivers"
admin view with a "Try resending invite" / "Delete chat" button.
**Who benefits**: Every Telegram user. Today the bridge silently sprays
errors until a human looks.
**Effort**: **S**.
**Tier**: **must-have for prod**.
#### B9. Forum-thread (topic) routing for Telegram
**Pitch**: Per-receiver `message_thread_id` field, auto-detected from incoming
command messages. UI: when adding a chat that's a forum, show a topic
selector populated via `getForumTopicIconStickers` + `getChat`'s `is_forum`.
**Who benefits**: Any group install where the user wants notifications in a
dedicated topic.
**Effort**: **M**.
**Tier**: **must-have for prod**.
#### B10. Telegram inline buttons + callback queries
**Pitch**: Templates can declare `{% buttons %}` with action descriptors.
Bridge listens for `callback_query` updates, dispatches to a registered
action (e.g. "Mark album as favorite", "Snooze this tracker for 1h", "Run
HA service light.turn_off").
**Who benefits**: Power users. Foundation for several other features
(Immich duplicate-cluster review, HA action button → service call, snooze).
**Effort**: **L**.
**Tier**: nice-to-have but unlocks the next 3 items.
#### B11. User snooze / mute via bot command
**Pitch**: `/snooze 1h` mutes the bot's outbound chat for 1h.
`/mute provider gitea` mutes a whole provider for that chat. `/wake` undoes.
Implemented as a per-receiver `snoozed_until` column.
**Effort**: **S-M**.
**Tier**: **must-have for prod** (user-side relief valve).
### Newly identified — nice-to-have
#### B12. Per-target / per-user rate limit (send-side)
**Pitch**: Cap outbound messages per minute per receiver. Existing 429
backoff handles Telegram's limit, but a runaway template / event-storm
provider can still spray the user's phone with 200 messages.
**Effort**: **S**. Token bucket per chat_id in `_send_telegram`.
**Tier**: nice-to-have.
#### B13. Message dedup window (idempotency key per outbound message)
**Pitch**: SHA256 of `(target_id, receiver_id, rendered_message,
event_collection_id)`. If the same key was sent in the last 5min, skip.
**Effort**: **S**.
**Tier**: nice-to-have (lots of overlap with A1+A2 but addresses the
end-of-pipeline dedup, after all coalescing).
#### B14. Weekly digest / per-target stats / per-provider error rate
**Pitch**: Cron-based weekly summary email/Telegram. "Top 5 noisy trackers",
"Receivers with >X% failure rate", "Top 5 days of the week with the most
activity". Operator preventive maintenance.
**Effort**: **M**.
**Tier**: nice-to-have.
#### B15. Mobile-friendly minimal mode for the SPA
**Pitch**: The Aurora redesign is a lot for mobile. A "manage from phone"
minimal layout — list of trackers, click to toggle, click to mute. Stops
operators from needing a desktop to silence a chatty tracker at 1am.
**Effort**: **M**.
**Tier**: nice-to-have.
#### B16. Audit log of admin actions
**Pitch**: New `audit_log` table. Every create/update/delete on
`NotificationTracker`, `NotificationTarget`, `TemplateConfig`, `ServiceProvider`,
`TelegramBot`, `User`, etc. writes a row with `(user_id, action,
entity_type, entity_id, before_json, after_json, ip, ua)`. Admin UI tab.
**Effort**: **M**. SQLAlchemy event listeners on the affected models.
**Tier**: nice-to-have for multi-admin installs; must-have if any
compliance requirement.
#### B17. Health → not just /ready, but per-component status page
**Pitch**: `/api/health/components` returns `{providers: [{id, last_ok_at,
last_error}], targets: [{id, last_ok_at, last_error}], scheduler:
{job_count, next_fires}}`. Frontend "Status" tab.
**Effort**: **S-M**. The data is already in `EventLog` / scheduler API.
**Tier**: nice-to-have.
#### B18. Provider unreachable backoff + escalation
**Pitch**: Today `bridge_self` emits `bridge_self_poll_failures` after N
consecutive fails. Add (a) exponential backoff on the polling interval after
M failures so we don't hammer a down host, and (b) recovery notification
when the provider comes back.
**Effort**: **S**.
**Tier**: nice-to-have.
#### B19. RSS provider
**Pitch**: Generic RSS/Atom feed poller. One more provider, reuses event_dispatch.
Long-tail value (operator wants "notify me when a blog publishes").
**Effort**: **M**.
**Tier**: nice-to-have.
#### B20. Mobile push / FCM channel
**Pitch**: A dedicated FCM "Receiver" type so the user can ship their own
companion app. Today Telegram is the only realtime channel; email is too
slow; webhook out is for plumbing.
**Effort**: **L**.
**Tier**: aspirational.
### Newly identified — aspirational
#### B21. Conversation threading per source (one notification thread per album / repo)
**Pitch**: Use Telegram `reply_parameters` to chain all notifications about
"Album X" as a single thread that grows over time. Today every notification
is a top-level message. Threading turns the chat into a navigable history.
**Effort**: **M**. Store `last_message_id` per `(target_id, collection_id)`,
pass as `reply_to_message_id`.
**Tier**: aspirational but a clear differentiator.
#### B22. A/B test variants for templates
**Pitch**: A template config can carry 2 variants. The dispatcher
hash-routes receivers to A or B; the dashboard shows "variant A's response
time / click rate / receiver mute rate".
**Effort**: **L**.
**Tier**: aspirational.
#### B23. Dark-launch a new template before enabling it
**Pitch**: "Send-to-sandbox-chat-only" toggle on a template config. The new
template renders against real events but only goes to one operator's chat
for 1 week. Then promote to production.
**Effort**: **M**. Builds on template versioning (B6).
**Tier**: aspirational.
#### B24. Scheduled template changes
**Pitch**: "On 2026-12-25 at 09:00, switch template_config X to draft Y".
Useful for holiday-themed greetings or batch migrations.
**Effort**: **M**.
**Tier**: aspirational.
#### B25. HA service-call from a Telegram inline button
**Pitch**: Building on B10. A template renders `{% button hass:light.turn_off
target=living_room %}`. User clicks → bridge calls HA `light.turn_off`.
**Effort**: **M** (after B10).
**Tier**: aspirational.
---
## Ship-blocker checklist (do not widen user audience without)
Order is rough priority (top first). Most are also called out in Part A.
1. **A1** — Webhook idempotency table (Gitea/Planka/generic). Without this,
one upstream retry storm can double-/quadruple-spray every user.
2. **A2** — Deferred-dispatch crash window. A redeploy mid-drain duplicates
every queued notification. Implement either the `dispatch_id`
pre-commit OR the `in_flight` state machine.
3. **A3** — Persist Telegram update offset. Same root cause class as A1/A2;
matters less if A1+A2 are fixed but should land together.
4. **A4 / B8** — Bot blocked / chat-not-found auto-disable. A user blocking
the bot must not generate infinite errors.
5. **A11** — Webhook JSON depth/node cap (mirror the backup guard).
6. **A9** — Quiet-hours `start == end` confirmation; either accept "always
quiet" semantics or reject in the API validator.
7. **A8** — DST handling in quiet-hours overnight window. Verify with
tests that include known transition timestamps.
8. **B5** — "Send test message" / template playground. Without this, every
template edit is a flying blind change against a live system.
9. **B6** — Template versioning + rollback. Pair with B5.
10. **A5 / B9** — Forum-thread (topic) routing. Any non-trivial Telegram
group install needs this.
11. **B11** — User snooze / mute via bot command. Relief valve when the
bridge gets too chatty.
12. **B7** — Bulk operations on trackers / targets / links. Operability
floor for any install with >10 trackers.
Everything else in Part B is upside, not a blocker.
+682
View File
@@ -0,0 +1,682 @@
# Frontend Production-Readiness Review
Scope: `frontend/src/**` (~26k lines, Svelte 5 runes + SvelteKit). `npm run check`
passes with exit code 0. The codebase is in good shape overall - i18n EN/RU keys
are 1:1 in sync (1466 each), Modal/Snackbar overlays follow the `position:fixed`
+ `z-index:9999` convention, no `eval`, no `innerHTML`, no string-interpolated
`setTimeout`, and the sanitizer (`lib/sanitize.ts`) is a sound DOMParser-based
allowlist. The issues below are real production risks layered on top of an
otherwise clean architecture.
## Executive Summary
- **Auth tokens live in `localStorage`** (`lib/api.ts`). Any XSS that bypasses
the (good) `sanitizePreview` allowlist - or sneaks past it via a future code
path - exfiltrates both access and refresh tokens. There is no httpOnly-cookie
alternative, no token rotation on refresh failure, and `redirectToLogin` only
fires once per session (a leaked refresh token can outlive that flag).
- **One real provider-hardcoding violation** (`routes/actions/RuleEditor.svelte`)
breaks the "descriptors only" rule in CLAUDE.md item 8 and silently disables
the people/album picker for any non-Immich provider - every other page is
clean.
- **Caches duplicated into local `$state`** on `notification-trackers`,
`command-trackers`, and `command-template-configs` pages - the cache is
populated but the page never re-reads it, so cross-page mutations (search
palette pre-warming) won't update the list and cache `invalidate()` becomes
useless. Convention #4 says "always use cache".
- **Three CRUD pages refetch all entities after every mutation** (full
`await load()` after upsert/delete) instead of using `cache.upsert()`/
`remove()` - defeats the optimistic-cache design and produces visible flicker
on slow connections.
- **Floating async work + N+1 patterns**: `providers/+page.svelte` fires N
parallel health checks without an AbortController (state writes continue
after navigation); `bots/TelegramBotTab.svelte` does a sequential
`for (const trk of trackers) { await api('/listeners') }` loop.
- **`backup/+page.svelte` post-restart health poll** keeps recursing for up to
120s with no unmount guard - if the user navigates away mid-restart, the
recursive `setTimeout` chain keeps calling `fetch('/api/health')` until it
reloads the page out from under whatever route they're on.
- **`api()` 30s timeout is per-request, hard-coded, with no observability** -
long-running provider operations (Immich bulk fetch, full backup export) hit
it silently and surface as `AbortError` with no telemetry.
---
## CRITICAL
### C1. JWT tokens stored in `localStorage` - XSS-exfiltratable
[lib/api.ts:78-91](frontend/src/lib/api.ts#L78-L91)
```ts
function getToken(): string | null {
return localStorage.getItem('access_token');
}
export function setTokens(access: string, refresh: string) {
localStorage.setItem('access_token', access);
localStorage.setItem('refresh_token', refresh);
}
```
Both the short-lived access token and the long-lived refresh token sit in
`localStorage`. Any successful XSS - including a future template-preview path
that escapes `sanitizePreview`, a vulnerable third-party CodeMirror extension,
or a Telegram bot username that ends up unescaped somewhere - reads both with a
single `localStorage.getItem` call.
**Fix:** Move to httpOnly + Secure + SameSite=Strict cookies set by the backend.
If a cookie-based session is infeasible for the deployment model, at minimum
move the refresh token to an httpOnly cookie and keep only the short-lived
access token in memory (a module-level `let accessToken` is XSS-readable but
not persistent across reloads, which limits the exfiltration window).
### C2. Provider type hardcoded in `RuleEditor.svelte` (convention violation)
[routes/actions/RuleEditor.svelte:55-67](frontend/src/routes/actions/RuleEditor.svelte#L55-L67)
```ts
async function loadProviderData() {
if (actionType !== 'auto_organize') return;
const provider = providersCache.items.find((p: any) => p.id === providerId);
if (!provider || provider.type !== 'immich') return;
...
```
CLAUDE.md item 8 explicitly forbids `if (type === 'immich')` in components -
this is the canonical example. As written, adding a second provider with
auto-organize support (Google Photos, future SmugMug, etc.) is a silent no-op:
the form renders with empty people/album lists and gives no error.
**Fix:** Add an `actionTypes` / `peopleFilter` capability flag to
`ProviderDescriptor`, or add a `supportsAutoOrganize: boolean` discriminator,
then check `getDescriptor(provider.type)?.supportsAutoOrganize` instead of the
literal string.
---
## HIGH
### H1. Caches imported but copied into local `$state` - invalidation no-op
[routes/notification-trackers/+page.svelte:33](frontend/src/routes/notification-trackers/+page.svelte#L33)
[routes/command-trackers/+page.svelte:27](frontend/src/routes/command-trackers/+page.svelte#L27)
[routes/command-template-configs/+page.svelte:51](frontend/src/routes/command-template-configs/+page.svelte#L51)
```ts
// notification-trackers - line 33
let allNotificationTrackers = $state<Tracker[]>([]);
// ...
[allNotificationTrackers] = await Promise.all([
api<Tracker[]>('/notification-trackers'),
...
]);
```
The cache modules expose `notificationTrackersCache`, `commandTrackersCache`,
and `commandTemplateConfigsCache` - populated by `+layout.svelte` on mount and
by the search palette - but these three pages don't read from them. They each
issue their own `api(...)` call and store the result locally. Side effects:
1. The cache shows stale data on every other page that reads it (dashboard nav
counts, search palette).
2. `commandTemplateConfigsCache.fetch(true)` is called on `command-template-configs`
`load()` but the result is then re-assigned from the function return value
into `allCmdTplConfigs` - the cache itself is updated, but the page has no
reactive link to it.
3. `cache.upsert()` / `cache.remove()` after mutations would short-circuit a
full refetch - but with the local-state copy, every save triggers a full
`await load()` (see H2).
**Fix:** Replace `let allX = $state([])` with `let allX = $derived(cache.items)`
(see how `targets/+page.svelte:147` does it correctly) and remove the parallel
`api()` call.
### H2. Full refetch after every mutation - cache.upsert/remove not used
[routes/providers/+page.svelte:238-250](frontend/src/routes/providers/+page.svelte#L238-L250)
[routes/actions/+page.svelte:139](frontend/src/routes/actions/+page.svelte#L139)
[routes/notification-trackers/+page.svelte:291](frontend/src/routes/notification-trackers/+page.svelte#L291)
[routes/targets/+page.svelte:476](frontend/src/routes/targets/+page.svelte#L476)
Every save/delete/toggle on these pages calls `cache.invalidate(); await load()`,
which re-fetches the entire list from the server. The cache exposes
`upsert(entity)` and `remove(id)` for exactly this case - the server already
returned the new entity (or 204), so the round-trip is wasted bandwidth and
produces a visible "list redraws" flash on slow links.
**Fix:** On POST/PUT response, `cache.upsert(savedEntity)`. On DELETE,
`cache.remove(id)`. Reserve `invalidate()` + `fetch()` for cases where the
mutation may have changed *other* entities (e.g. broadcast target updates
affect children).
### H3. Provider health checks fire-and-forget - leak past navigation
[routes/providers/+page.svelte:175-181](frontend/src/routes/providers/+page.svelte#L175-L181)
```ts
for (const p of allProviders) {
health = { ...health, [p.id]: null };
api(`/providers/${p.id}/test`, { method: 'POST' })
.then((r: any) => { health = { ...health, [p.id]: r.ok }; })
.catch(() => { health = { ...health, [p.id]: false }; });
}
```
No `AbortController`, no unmount guard. If the user navigates away while N
slow Immich/Gitea probes are inflight, every probe still resolves and tries to
write to the (now-detached) `health` `$state`. With Svelte 5 runes this won't
crash, but it does waste backend connections (Immich health checks call the
real API) and may trigger duplicate probes on quick back/forward navigation.
**Fix:** Pass `{ signal: controller.signal }` to `api()` (already supported -
see `lib/api.ts:150`), abort in `onDestroy`. Or use `cache.probeAll()` driven
from a single store so revisiting the page reuses the previous result.
### H4. Sequential awaits for independent fetches - N+1 in TelegramBotTab
[routes/bots/TelegramBotTab.svelte:215-223](frontend/src/routes/bots/TelegramBotTab.svelte#L215-L223)
```ts
const trackers = await api<CommandTrackerSummary[]>('/command-trackers');
const matched: CommandTrackerSummary[] = [];
for (const trk of trackers) {
try {
const listeners = await api<ListenerEntry[]>(`/command-trackers/${trk.id}/listeners`);
const hasBot = listeners.some(...);
if (hasBot) matched.push(trk);
} catch (e) { console.warn(...); }
}
```
For a deployment with 20 command trackers, opening the listener section on a
bot triggers 20 serial `GET /command-trackers/{id}/listeners` requests -
visibly slow over a high-latency link.
**Fix:** Either expose a single backend endpoint
(`GET /command-trackers/listeners?bot_id=X`) or run the loop through
`Promise.all(trackers.map(trk => api(...).catch(() => null)))` and filter
afterwards.
### H5. Post-restart health poll keeps running after unmount
[routes/settings/backup/+page.svelte:117-139](frontend/src/routes/settings/backup/+page.svelte#L117-L139)
```ts
async function applyAndRestart(): Promise<void> {
await api('/backup/apply-restart', { method: 'POST' });
restartingOverlay = true;
const startedAt = Date.now();
let attempts = 0;
const poll = async (): Promise<void> => {
attempts += 1;
try {
const res = await fetch('/api/health');
if (res.ok && Date.now() - startedAt > 2000) {
window.location.reload();
return;
}
} catch { /* still down */ }
if (attempts < 120) setTimeout(poll, 1000);
};
setTimeout(poll, 1500);
}
```
The recursive `setTimeout(poll, 1000)` chain has no cancellation. If the user
navigates to another route between `apply-restart` and the next health probe,
the chain keeps firing for up to 120s and eventually calls
`window.location.reload()` from a route the user has since moved away from.
Side effects:
1. Unauthenticated `fetch('/api/health')` calls keep going while the user is
on `/login`.
2. A user who hit "restart later" on a different tab will still get reloaded
from the original tab's poll.
**Fix:** Capture `controller = new AbortController()` and pass to `fetch`,
`onDestroy(() => controller.abort())`. Also store the timeout handle and
`clearTimeout` it on destroy.
### H6. Token refresh races with logout in a sneaky edge
[lib/api.ts:97-127](frontend/src/lib/api.ts#L97-L127)
The dedupe via `refreshPromise` is correct *for the refresh itself*, but the
outer `api()` reads `getToken()` before awaiting `refreshAccessToken()`. Three
concurrent requests that all 401 will all queue on the same refresh promise,
then *all* retry - fine. But if the refresh succeeds and an unrelated
`clearTokens()` (from `logout()`) fires between the refresh resolving and the
retry running, the retry uses an empty `Authorization: Bearer ` header. The
result is "ApiError: HTTP 401" surfaced via snackbar even though the redirect
to `/login` already happened.
**Fix:** Either re-check `isAuthenticated()` immediately before the retry, or
make `clearTokens()` cancel an inflight `refreshPromise`.
### H7. `AuthRedirectError` is thrown but not consistently caught
[lib/api.ts:165-170](frontend/src/lib/api.ts#L165-L170)
Most pages use the pattern `catch (err: unknown) { snackError(errMsg(err)); }` -
which catches `AuthRedirectError` too and shows "Unauthorized - redirecting
to login" in a snackbar that the user sees *as* the route changes. The error
class exists specifically to be distinguished, but only one or two call sites
actually check `instanceof AuthRedirectError` before showing a snackbar.
**Fix:** Make `errMsg()` (or a new helper) return `null` for `AuthRedirectError`
and have snackbar helpers ignore null messages. Or filter in the snackbar
store.
### H8. `api()` JSON-decode failure path swallowed silently
[lib/api.ts:189](frontend/src/lib/api.ts#L189)
```ts
return res.json();
```
When the backend returns a `200 OK` with a non-JSON body (proxy error page,
HTML 502 from a misconfigured reverse proxy in front), `res.json()` rejects
with a `SyntaxError: Unexpected token < in JSON at position 0`. The page
shows the raw parser message in a snackbar, which is confusing UX.
**Fix:** Wrap `res.json()` in try/catch and throw a typed `ApiError("Backend
returned non-JSON response", 502)` so the UI can show a clean message.
### H9. Email/Matrix bot tabs strip secrets via `as any`
[routes/bots/EmailBotTab.svelte:84](frontend/src/routes/bots/EmailBotTab.svelte#L84)
[routes/bots/MatrixBotTab.svelte:79](frontend/src/routes/bots/MatrixBotTab.svelte#L79)
```ts
if (!body.smtp_password) delete (body as any).smtp_password;
if (editingMatrix && !body.access_token) delete (body as any).access_token;
```
The `as any` bypass exists because the body type doesn't allow `delete` on a
required field. The intent - "don't send a blank secret which would overwrite
the stored one" - is correct, but the cast hides a real risk: if the field
name ever changes (`smtp_password` -> `smtpPassword`), the `delete` is a no-op
and the blank field is sent.
**Fix:** Build `body` as `Partial<...>` from the start and only conditionally
include the secret field.
### H10. `template-configs` hardcodes a slot name
[routes/template-configs/+page.svelte:228](frontend/src/routes/template-configs/+page.svelte#L228)
```ts
.map(s => ({ key: s.name, label: ..., rows: s.name === 'message_assets_added' ? 10 : 3, isDateFormat: false }))
```
Special-casing one Immich slot name inside a provider-agnostic component is
the same pattern CLAUDE.md item 8 forbids for components, scoped to template
configs. Other providers' "large" slots (Gitea PR descriptions, Planka card
content) would render in 3-row editors that the author probably didn't intend.
**Fix:** Add a `rows?: number` field to the backend slot definition and read
it via `notification_slots[].rows`.
---
## MEDIUM
### M1. Three placeholder strings hardcoded English in shared components
[lib/components/EntitySelect.svelte:18](frontend/src/lib/components/EntitySelect.svelte#L18)
[lib/components/IconGridSelect.svelte:16](frontend/src/lib/components/IconGridSelect.svelte#L16)
[lib/components/MultiEntitySelect.svelte:16](frontend/src/lib/components/MultiEntitySelect.svelte#L16)
```ts
placeholder = 'Select...',
```
These defaults render `Select...` in RU locale when a caller doesn't pass an
explicit placeholder. The convention (CLAUDE.md item 5) prescribes plain text
selectors but says nothing about translation - these still need to flow through
`t()`.
**Fix:** Move the default into the template: `placeholder = $props().placeholder
?? t('common.selectPlaceholder')`, with `common.selectPlaceholder` added to
both locales.
### M2. `EntitySelect.noneLabel` defaults to a decorative em-dash literal
[lib/components/EntitySelect.svelte:20](frontend/src/lib/components/EntitySelect.svelte#L20)
```
noneLabel = (em-dash literal),
```
CLAUDE.md item 5 calls out decorative dashes specifically. `LinkedTargetsSection`
already overrides this with `t('common.noneDefault')` (good), but other
consumers that do not override get the bare em-dash. It also fails the
localizable smell test.
**Fix:** Default to `t('common.none')`.
### M3. `lib/auth.svelte.ts` logout does a full page reload, losing UX continuity
[lib/auth.svelte.ts:54-61](frontend/src/lib/auth.svelte.ts#L54-L61)
```ts
export function logout() {
clearTokens();
clearAllCaches();
user = null;
if (typeof window !== 'undefined') {
window.location.href = '/login';
}
}
```
`window.location.href` triggers a hard reload - the SvelteKit router exists
specifically to avoid this. Side effects: any inflight requests get cancelled
without proper cleanup, the splash-loader flashes between the two pages, and
the search-palette / overlays do not get a chance to close gracefully.
**Fix:** `goto('/login', { invalidateAll: true, replaceState: true })`.
### M4. `+layout.svelte` auto-expand `$effect` writes during read
[routes/+layout.svelte:336-342](frontend/src/routes/+layout.svelte#L336-L342)
The effect reads `expandedGroups` (via `expandedGroups[entry.key]`) and writes
to `expandedGroups`. Svelte 5 dedupes the write back to the same set of keys,
but the pattern is fragile - adding any side effect that re-derives from
`expandedGroups` here would loop. It also persists to localStorage in
`toggleGroup` but not from this effect - so auto-expansion stays in memory only.
**Fix:** Compute the next state in a single pass and write once; either
include the localStorage save, or move the auto-expand into the initial
hydration block.
### M5. `commandTemplateConfigsCache.fetch(true)` result discarded; cache populated but unused
[routes/command-template-configs/+page.svelte:208](frontend/src/routes/command-template-configs/+page.svelte#L208)
The `Promise.all` destructures `cfgs` from `commandTemplateConfigsCache.fetch(true)`
but then writes `allCmdTplConfigs = cfgs` instead of $derived-reading the cache.
The cache is updated (good) but this page never reads it (bad - see H1).
**Fix:** Same fix as H1 - use `$derived(commandTemplateConfigsCache.items)`.
### M6. Dashboard search debounce timeout not cleared on filter change
[routes/+page.svelte:268-272](frontend/src/routes/+page.svelte#L268-L272)
If the user changes the type/provider filter (`applyFilters` runs synchronously
from the `$effect` at line 249) while a search debounce is pending, the pending
timeout still fires 300ms later and triggers an identical request. Not a leak,
just a wasted call.
**Fix:** Clear `searchTimeout` from `applyFilters()` as well.
### M7. Dashboard `Promise.all` destructure uses empty middle slot
[routes/+page.svelte:283-287](frontend/src/routes/+page.svelte#L283-L287)
```ts
const [statusRes, , chartRes] = await Promise.all([
api<DashboardStatus>(`/status?limit=${eventsLimit}`),
providersCache.fetch(),
api<{ days: ... }>('/status/chart'),
]);
```
The empty middle slot is brittle - anyone reordering for readability silently
swaps `statusRes` and `chartRes`. Trivially avoided.
**Fix:** Either await `providersCache.fetch()` separately (it caches anyway),
or `const [statusRes, _providers, chartRes] = ...` with an explicit `_providers`
local.
### M8. `actions/+page.svelte` derives `actionTypes` from a function-in-derived
[routes/actions/+page.svelte:78-81](frontend/src/routes/actions/+page.svelte#L78-L81)
```ts
let actionTypes = $derived((() => {
const caps = capabilitiesCache.items[selectedProviderType];
return caps?.action_types || [];
})());
```
The IIFE is unnecessary; `$derived` already runs the expression on every
dependency change. Reads as a refactor leftover.
**Fix:** `let actionTypes = $derived(capabilitiesCache.items[selectedProviderType]?.action_types ?? []);`
### M9. `RuleEditor.svelte` mutates rule object in `toggleRule` then sends to API
[routes/actions/RuleEditor.svelte:105-108](frontend/src/routes/actions/RuleEditor.svelte#L105-L108)
```ts
async function toggleRule(rule: ActionRule) {
rule.enabled = !rule.enabled;
await updateRule(rule);
}
```
Direct mutation of the prop violates the immutability rule (coding-style.md).
If the API call fails, the local state is already flipped - the UI shows the
new value even though the server still has the old one.
**Fix:** `await updateRule({ ...rule, enabled: !rule.enabled })`. After
successful response, `await loadRules()` (already happens) re-syncs.
### M10. `+layout.svelte` filter functions use `as any[]` four times
[routes/+layout.svelte:145-151](frontend/src/routes/+layout.svelte#L145-L151)
```ts
notification_trackers: filterById(notificationTrackersCache.items as any[]).length,
```
The cast exists because `filterById<T extends { provider_id?: number }>` is
narrower than the cache item types. The proper fix is a single base interface
`{ provider_id?: number }` on the relevant types so the cast goes away.
### M11. `setLocale` does not update `<html lang>` attr
[lib/i18n/index.svelte.ts:31-36](frontend/src/lib/i18n/index.svelte.ts#L31-L36)
Screen readers and browser translation extensions rely on `<html lang="en">`.
The app never sets it, so switching to RU leaves accessibility tooling thinking
the page is still English.
**Fix:** `document.documentElement.lang = locale` in `setLocale`.
### M12. `Modal.svelte` focus restore does not verify element still in DOM
[lib/components/Modal.svelte:43-45](frontend/src/lib/components/Modal.svelte#L43-L45)
If the previously focused element has been removed from the DOM between modal
open and close (common with optimistic UI updates that rerender the source
button), `.focus()` is a silent no-op on a detached node. Focus ends up on
`<body>` and the next Tab restarts from the top of the page.
**Fix:** `if (... && document.contains(previouslyFocused)) previouslyFocused.focus()`,
else focus a sensible fallback (the trigger that opened the page).
### M13. TimezoneSelector ticks at 1s - wakes the event loop forever
[lib/components/TimezoneSelector.svelte:33-37](frontend/src/lib/components/TimezoneSelector.svelte#L33-L37)
```ts
let tickHandle: ReturnType<typeof setInterval> | null = null;
onMount(() => {
tickHandle = setInterval(() => { now = new Date(); }, 1000);
});
```
A 1Hz tick is fine for visible UI; the issue is it keeps running even when
the selector dropdown is closed (the time display is only visible when the
dropdown is open). Battery impact is non-trivial on mobile for what is
essentially a hidden component.
**Fix:** Start/stop the interval based on `open` state, or use
`requestAnimationFrame` driven by `IntersectionObserver`.
### M14. Backup file download builds blob from JSON without size guard
[routes/settings/backup/+page.svelte:269-281](frontend/src/routes/settings/backup/+page.svelte#L269-L281)
```ts
const data = await api(`/backup/files/${filename}`);
const blob = new Blob([JSON.stringify(data, null, 2)], { type: 'application/json' });
```
For a deployment with hundreds of providers/trackers, the JSON serialization
of the entire backup happens in-memory in a single string before the Blob
constructor - wasted memory peak and a frozen tab on slow machines. Worse,
`api()` parses the JSON and then `JSON.stringify` re-serializes it.
**Fix:** Use `fetchAuth()` for the download path and pipe the response stream
straight into a Blob (`new Blob([await res.arrayBuffer()])`).
### M15. Modal focus-trap query selector includes disabled inputs
[lib/components/Modal.svelte:62-67](frontend/src/lib/components/Modal.svelte#L62-L67)
Re-querying the DOM on every Tab keystroke is OK but means disabled inputs
(common in long forms with submit-in-progress) are included in the trap and
focus can land on them. The selector should add `:not([disabled])`.
### M16. i18n resolve uses any for the recursion accumulator
[lib/i18n/index.svelte.ts:55-62](frontend/src/lib/i18n/index.svelte.ts#L55-L62)
```ts
function resolve(obj: any, path: string): string | undefined {
```
`obj: unknown` plus a runtime check would let TS narrow `current` properly and
catch the case where someone accidentally passes a `string` (returns undefined
silently today).
### M17. Tracker name auto-set string concat - English-only
[routes/notification-trackers/+page.svelte:82-84](frontend/src/routes/notification-trackers/+page.svelte#L82-L84)
[routes/command-trackers/+page.svelte:69-71](frontend/src/routes/command-trackers/+page.svelte#L69-L71)
```ts
form.name = provider ? `${provider.name} Tracker` : 'Tracker';
form.name = provider ? `${provider.name} Commands` : 'Commands';
```
Defaults the tracker name to "Provider Name Tracker" / "Provider Name Commands"
- only English. Russian users get an English suffix on the auto-generated
name. Inconsistent with the rest of the i18n discipline.
**Fix:** Use `t('notificationTracker.defaultName').replace('{name}', provider.name)`.
### M18. topbar-action store not cleared on auth state change
[routes/providers/+page.svelte:160-167](frontend/src/routes/providers/+page.svelte#L160)
Each page sets a topbar CTA in `onMount` and clears it in `onDestroy`. If
`logout()` is called from inside the page (via the search palette, etc.), the
page never destroys cleanly and the topbar action sticks into the login screen.
Defensive `topbarAction.clear()` in `logout()` would plug this.
### M19. Many `: any` and `as any` types in critical paths
[routes/users/+page.svelte:62](frontend/src/routes/users/+page.svelte#L62)
[routes/command-trackers/+page.svelte:27](frontend/src/routes/command-trackers/+page.svelte#L27)
[routes/providers/+page.svelte:179](frontend/src/routes/providers/+page.svelte#L179)
[lib/providers/types.ts:120](frontend/src/lib/providers/types.ts#L120)
64 occurrences of `: any` / `as any` across 20 files. None are in
security-sensitive paths, but they remove type safety in exactly the call
sites that shape API requests (`body: any = { ... }`). Recommended cleanup
task, not a blocker.
---
## LOW
### L1. +page.svelte event types hardcoded in three parallel maps
[routes/+page.svelte:475-512](frontend/src/routes/+page.svelte#L475-L512)
`eventLabels`, `eventIcons`, and `eventGradients` are three parallel dicts
keyed by the same set of strings. Adding a new event type requires editing
three places (plus i18n). A single `EVENT_META` object would be more
maintainable.
### L2. TestMenu.svelte uses z-index 9998 instead of 9999
[routes/notification-trackers/TestMenu.svelte:25](frontend/src/routes/notification-trackers/TestMenu.svelte#L25)
```svelte
<div style="position:fixed; top:0; left:0; right:0; bottom:0; z-index:9998;"
```
The convention says 9999 for overlays. Using 9998 was probably intentional
(so the menu sits above the backdrop), but the cleaner pattern is to give the
backdrop a slightly lower stacking context inside the same parent.
### L3. console.warn left in production-bound code
14 `console.warn`/`console.error` occurrences. Most are guarded by a
"failed to load" + UI fallback - legitimate debug noise. Recommend wiring to
a structured logger before public release; current state is acceptable for an
internal tool but spam-prone in DevTools.
### L4. Dashboard setTimeout(animateCount, 200) is uncancelled
[routes/+page.svelte:290-299](frontend/src/routes/+page.svelte#L290-L299)
The 200ms delay before triggering count animations is uncancelled. Navigating
away during the first 200ms means the count animation `requestAnimationFrame`
chain still runs against a stale `status` reference. Cosmetic only.
### L5. app.html inline theme bootstrap reads localStorage without try/catch
[src/app.html:12](frontend/src/app.html#L12)
Theme is hydrated synchronously in `<head>` to avoid FOUC - fine - but if
localStorage is blocked (Safari private mode, some enterprise policies) the
inline script throws and the rest of the head bootstrap may be skipped.
### L6. EventChart computes activeTypes and hasData from same loop twice
[lib/components/EventChart.svelte:46-49](frontend/src/lib/components/EventChart.svelte#L46-L49)
`hasData` and `activeTypes` traverse the same data twice. Single-pass
derivation would be cheaper for the rare "many days of events" case.
### L7. Single-letter t shadowing in +layout.svelte
`+layout.svelte:140` uses `for (const t of targets)` inside `navCounts`, which
shadows the imported i18n function `t`. Svelte 5 does not flag it (inner scope
wins), but it confuses search/grep and breaks IDE go-to-definition. Several
other pages use single-letter `t` as iteration var (`actions/+page.svelte`,
`command-trackers/+page.svelte`, `targets/+page.svelte`). Recommend `target` /
`tracker` for legibility.
---
## Notes & non-findings
- **Modal overlay convention** (CLAUDE.md #2): Modal.svelte, Snackbar,
IconPicker, IconGridSelect, MultiEntitySelect, EntitySelect, TimezoneSelector,
EventChart, Hint, SearchPalette, and TestMenu all use `position:fixed` with
`z-index: 9999` (or 9998 for the TestMenu backdrop - see L2). Convention
upheld.
- **@html usage** - only three call sites, all pipe through `sanitizePreview`,
which is a DOMParser-based allowlist limited to `B`, `I`, `CODE`, `PRE`, `A`,
`BR` with `https?://` href validation. Safe.
- **i18n parity**: EN and RU JSON have the exact same 1466 keys - no orphans.
- **Selector placeholders**: `LinkedTargetsSection` correctly uses
`t('common.noneDefault')`, no em-dash leaks in user-facing flows (only
defaults inside shared components - see M1/M2).
- **svelte-check passes** (exit 0) - no type errors at the strict level the
project compiles with.
- **No eval, new Function, or string-setTimeout**: dynamic code execution
surface is clean.
- **No var declarations**, no `==` (loose equality) outside generated CSS.
- **AbortController usage**: present in `lib/api.ts` for the canonical fetch
wrapper - the rest of the codebase could lean on it more (see H3, H5).
+436
View File
@@ -0,0 +1,436 @@
# Performance & Database Review — `service-to-notification-bridge`
**Scope:** entire repo at `c:\Users\Alexei\Documents\service-to-notification-bridge`
**Backend:** FastAPI + SQLAlchemy async + SQLModel on SQLite (Postgres-compatible URL, but only SQLite branch is exercised in code).
**Frontend:** SvelteKit 5 (runes) static build served by the same FastAPI process.
**Reviewer:** Claude Opus 4.7 (1M context)
---
## Executive summary
1. **Indexing is in good shape.** FK columns and the dashboard/webhook hot paths have explicit composite indexes (`ix_event_log_user_created`, `ix_event_log_user_event_type_created`, `ix_deferred_dispatch_status_fire_at`, partial `ux_deferred_dispatch_pending`). The bulk of the "missing index" risk is already mitigated.
2. **No real migration tool.** The project runs a hand-rolled, 1880-line, idempotent migration script on every boot. It works, but it's brittle, slow on cold start, has no down-migrations, and the table-rebuild branches lose indexes silently. Move to Alembic before the next major schema change.
3. **`create_all` is still the source-of-truth for new schemas** (engine.py:63). That's an anti-pattern next to migration tooling: schema drift can silently appear between fresh installs and upgraded installs.
4. **Two real N+1 risks remain.** `_tracker_response` (notification_trackers.py:286-291) calls `_tt_response` per link, and `_refresh_telegram_chat_titles` (scheduler.py:229) issues per-chat `getChat` calls without bot-level batching guards. The big one in `load_link_data` was already fixed (good).
5. **SQLite PRAGMAs are mostly right but pool sizing is wrong.** WAL, `synchronous=NORMAL`, FK enforcement, busy_timeout, temp_store=MEMORY are all set. Missing: `cache_size`, `mmap_size`. The async engine uses SQLAlchemy's default pool with multiple writer connections — under WAL that still serializes, but it raises spurious BUSY pressure on long transactions (see #M3).
6. **Event-log retention exists and is correct** (30-day default, cron at 03:00 UTC), but `retention_days=0` disables it silently and there is no archival, no per-tenant cap, no row-count metric exposed to operators.
7. **Memory leak risk: `_dirty_bots`, `_last_update_id`, `_last_webhook_reclaim_at`, `_adaptive_state`, `_adaptive_max_skip`** in command_sync.py, telegram_poller.py, scheduler.py are unbounded module-level dicts. In a long-running process they grow without ever shrinking when entities are deleted.
8. **Frontend has no virtualization on long lists** — dashboard event stream, tracker history, target list. On a tenant with thousands of events the dashboard `{#each status.recent_events}` (with `(event.id)` key) still renders the whole page-set into DOM and re-runs derivations on every refresh.
---
## CRITICAL
### C1. `create_all` is the schema-of-record for new installs ([engine.py:60](packages/server/src/notify_bridge_server/database/engine.py))
```python
async def init_db() -> None:
engine = get_engine()
async with engine.begin() as conn:
await conn.run_sync(SQLModel.metadata.create_all)
```
**What's wrong:** `init_db()` runs unconditionally on every boot before the migration script. New installs get the *current* model's CREATE TABLE statements — including FK declarations like `ondelete=SET NULL` — while upgraded installs only get what the (one-way) `migrate_*` scripts manage to inject via `ALTER TABLE`. Several migrations explicitly admit "this only takes effect on freshly created tables" (e.g. `migrate_eventlog_provider_fk` is a documented no-op). That means **the schema drift between a fresh install and a 6-month-old install is real and undocumented.**
**Impact:** stability — subtle bugs that reproduce only on upgraded installs (FK enforcement, cascade behavior, partial UNIQUE indexes); ops — restoring a backup from a fresh install onto an upgraded box, or vice-versa, can change observable behaviour.
**Fix:**
1. Adopt Alembic with autogenerate-from-models, lock the baseline migration to the current `SQLModel.metadata`, and stop calling `create_all` in production startup.
2. Keep the hand-rolled `migrate_*` chain as legacy data-migrations only (idempotent, runs once, then removed).
3. Add a CI check: spin up empty DB → run migrations → diff against `SQLModel.metadata` → fail if non-empty.
---
### C2. `migrate_schema` runs ~30+ idempotent `PRAGMA table_info` + ALTER probes on every cold start ([migrations.py:67-427](packages/server/src/notify_bridge_server/database/migrations.py))
`_has_column` issues a `PRAGMA table_info('<table>')` per check; `migrate_schema` calls it dozens of times serially inside one transaction. On a cold start this is the dominant boot latency. Worse, it forces a write txn on every boot even when nothing changes (because each migration opens `engine.begin()`).
**Impact:** startup cost — visible on Raspberry-Pi / NAS deployments; SQLite WAL checkpoint pressure on every boot when nothing changed; readiness probe grace window must accommodate this.
**Fix:**
1. Wire `schema_version` (already exists, `CURRENT_SCHEMA_VERSION=1`) as a real short-circuit — at the top of every `migrate_*`, return immediately if `schema_version >= N` for that migration.
2. Cache `PRAGMA table_info` results within a single migration run.
3. Better long-term: replace with Alembic; you already have the version table.
---
### C3. `_install_sqlite_pragmas` only fires on engine-pool `connect`, not when SQLAlchemy reuses pooled connections from a different event loop ([engine.py:18-38](packages/server/src/notify_bridge_server/database/engine.py))
The `@event.listens_for(engine.sync_engine, "connect")` hook only runs at connection creation. The default `aiosqlite` pool reuses connections — that's fine — but `connect_args["timeout"]=30` clashes with the in-PRAGMA `busy_timeout=10000` (10 s). Two different timeout settings is confusing and the lower wins.
**Impact:** stability under contention — under sustained writer contention you get `SQLITE_BUSY` *much* sooner than expected. The 30-s connect_args timeout is for connection *open*, the 10-s busy_timeout is what governs lock contention; users see "database is locked" errors after 10 s, not 30.
**Fix:** standardize on busy_timeout (raise to 30 s to match `connect_args`, or drop one and keep the other). Document the chosen value in a constant. Also add:
```python
cur.execute("PRAGMA cache_size=-65536") # 64 MiB negative = kibibytes
cur.execute("PRAGMA mmap_size=268435456") # 256 MiB
cur.execute("PRAGMA wal_autocheckpoint=1000")
```
The 100k-asset album write pattern (`asset_ids` JSON blob) benefits significantly from a larger page cache and mmap; current defaults force a lot of SQLite-internal I/O.
---
## HIGH
### H1. Frontend dashboard event-stream lacks virtualization & double-fetches on filter changes ([+page.svelte:739](frontend/src/routes/+page.svelte))
`{#each status.recent_events as event, i (event.id)}` is keyed (good), but the page renders every event row with rich nested components (`EventDetailModal`, `MdiIcon`, etc.) for every paginate-back/forward. There's no row virtualization and the same data fetches re-run on every filter mutation (search input has a 300 ms debounce in `onSearchInput`, but `filterEventType`, `filterProviderId`, `filterSort`, `refreshSeconds` do not).
**Impact:** UX — choppy on tenants with 50+ events/page, perceptible filter-flicker; CPU — derivation cost on every status refresh.
**Fix:**
1. Wrap the events list in a tiny windowing component (svelte-virtual or a simple offset/limit windowed view — the API already supports it).
2. Debounce the entire filter-change branch, not just the search input (`$effect(() => { if (settled) { reload() }})` with a 100 ms guard).
3. The provider count map (`provider_event_counts`) is computed server-side for *all* matching events on every page request; cache it for `(user_id, filters)` in a 30-s in-memory dict server-side (see also #M2).
---
### H2. `provider_event_counts` aggregate query runs unbounded GROUP BY on every dashboard request ([status.py:84-103](packages/server/src/notify_bridge_server/api/status.py))
```python
provider_counts_query = (
select(
EventLog.provider_id,
EventLog.provider_name,
func.sum(func.coalesce(EventLog.assets_count, 1)).label("total"),
)
.where(EventLog.user_id == user.id)
.group_by(EventLog.provider_id, EventLog.provider_name)
)
```
Every dashboard load (every 1060 s by default — see `refreshIntervalItems`) runs `GROUP BY provider_id, provider_name` over *every* event the user ever owned. At 90 days × ~1 event/min/tracker this is hundreds of thousands of rows scanned per refresh per logged-in user.
**Impact:** latency — SQLite forces a full table scan + sort here because the only composite index is `(user_id, event_type, created_at DESC)`; cost — burns CPU on the bridge box for a metric that changes very slowly.
**Fix:**
1. Add `ix_event_log_user_provider (user_id, provider_id)` so the GROUP BY can be index-only.
2. Cache the result for `(user_id, filter_signature)` for 30 s in the same in-memory cache as #H1.
3. Long-term: materialize per-provider counts into an `event_counter` table maintained by triggers or an APScheduler job. The dashboard then reads at most a dozen rows.
---
### H3. `_tracker_response` issues one query per tracker-target link ([notification_trackers.py:286-291](packages/server/src/notify_bridge_server/api/notification_trackers.py))
```python
async def _tracker_response(session: AsyncSession, t: NotificationTracker) -> dict:
result = await session.exec(
select(NotificationTrackerTarget).where(NotificationTrackerTarget.tracker_id == t.id)
)
tracker_targets = [await _tt_response(session, tt) for tt in result.all()]
```
`_tt_response` (in notification_tracker_targets.py:12 — has 12 distinct `select`/`session.get` references) issues per-link follow-up SELECTs. Called from `create`, `update`, `delete` and `trigger` for a single tracker, so the practical N is small — but `_tt_response` is also called inside the bulk `list_notification_trackers` loop's downstream consumers, and any future bulk endpoint will multiply this badly.
**Impact:** latency on POST/PATCH responses; future regression risk.
**Fix:** rewrite `_tt_response` to accept pre-fetched maps (mirror the pattern in `dispatch_helpers.load_link_data`). Or, simpler: write a single eager-load helper using `selectinload(NotificationTrackerTarget.target)` once `relationship()` mappers are declared on the models.
---
### H4. `load_link_data` does not eagerly load target.config related entities — relies on `dict(target.config)` snapshotting ([dispatch_helpers.py:539-747](packages/server/src/notify_bridge_server/services/dispatch_helpers.py))
The function batch-loads receivers, telegram_chats, email_bots, matrix_bots up-front, but the broadcast-expansion branch in the active_links loop still issues `_resolve_target` per child target (line 715). That `_resolve_target` is called with all the pre-fetched maps, so it doesn't *query* per call — but it does build a fresh `target_config` dict per child. With a broadcast target containing 50 children fanning out 100 events/min this is constant garbage collection pressure.
**Impact:** GC pressure under load; not a correctness problem.
**Fix:** none required short-term. Long-term, add `selectinload` declarations on the relationship model so SQLAlchemy can co-fetch the chain. The code path is already well-batched.
---
### H5. `aiohttp.ClientSession` is constructed per-call inside `NotificationDispatcher._session_ctx` when no shared session is provided ([dispatcher.py:117-123](packages/core/src/notify_bridge_core/notifications/dispatcher.py))
```python
@contextlib.asynccontextmanager
async def _session_ctx(self) -> AsyncIterator[aiohttp.ClientSession]:
if self._shared_session is not None and not self._shared_session.closed:
yield self._shared_session
return
async with _new_session() as session:
yield session
```
In server-side code paths (watcher, event_dispatch, deferred_dispatch) a shared session is always passed in, so this is harmless. But unit tests, the CLI, and any direct library user that instantiates `NotificationDispatcher` without a session pays the cost. Worse, the per-dispatch session creates a fresh TCP pool, fresh DNS resolver — defeating connection reuse to Telegram / Discord webhook hosts.
**Impact:** test slowness; correctness if a non-server consumer ever ships.
**Fix:** require the `session` parameter (`session: aiohttp.ClientSession` not `| None`). Or have the dispatcher lazily attach to a module-level `_default_session` cached by event loop id.
---
### H6. `WebhookPayloadLog` is pruned per-insert via a sub-select but the prune query has no UNIQUE/partial protection against duplicate inserts ([webhooks.py:404-418](packages/server/src/notify_bridge_server/api/webhooks.py))
The "keep newest `max_count` per provider, delete the rest" pattern uses `select(...).order_by(created_at DESC).limit(max_count)` as a subquery. Under SQLite this materializes the top-N then negates it — fine when max_count is 20. But this runs on every inbound webhook. For a busy Gitea/HA installation that's 60+ writes/min, each with a delete-by-sub-select. The `ix_webhook_payload_log_provider_created` index makes the read cheap, but the DELETE still rewrites pages.
**Impact:** write amplification on busy webhook tenants.
**Fix:** keep the prune but make it probabilistic — only run with `random.random() < 0.1` (10% chance per insert). The cap still holds in steady state, but the per-write cost drops 10×.
---
### H7. No retention/archival for `notification_tracker_state` and `deferred_dispatch` "fired"/"dropped" rows ([scheduler.py:332-364](packages/server/src/notify_bridge_server/services/scheduler.py))
`_cleanup_old_events` deletes `event_log`, `webhook_payload_log`, `action_execution` older than retention days. `deferred_dispatch` rows with `status IN ('fired', 'dropped')` are never deleted. `notification_tracker_state.asset_ids` for an immich tracker watching a deleted collection is also never reaped.
**Impact:** unbounded growth on long-running installs; `asset_ids` JSON blobs can be megabytes per collection.
**Fix:** extend `_cleanup_old_events` to also delete `DeferredDispatch.status != 'pending' AND fired_at < cutoff`. Add a separate housekeeping job that prunes `NotificationTrackerState` rows whose `collection_id` is no longer in `NotificationTracker.collection_ids`.
---
## MEDIUM
### M1. Sentinel value `bot_id=0` is a footgun ([models.py:69-73](packages/server/src/notify_bridge_server/database/models.py))
```python
# bot_id=0 is a sentinel meaning "Telegram has not yet returned a numeric
# ID for this bot" (i.e. token never validated). Multiple unverified bots
# may legitimately carry 0, so we only enforce uniqueness for non-sentinel
# values via a partial index added in migrate_uniqueness_constraints.
bot_id: int = Field(default=0, index=True)
```
Sentinel values on indexed columns hurt index selectivity (every unvalidated bot is the same row from the planner's perspective) and create maintenance burden. Worse, every code path that looks up by `bot_id` must remember to filter `bot_id != 0`.
**Impact:** maintainability; latent bug surface (one missed `!= 0` filter and an unverified bot is silently re-used).
**Fix:** change `bot_id: int | None` defaulting to None, drop the sentinel.
---
### M2. No request-scoped cache for `user.id` lookups inside one request ([api/*.py, throughout](packages/server/src/notify_bridge_server/api/))
The same `get_current_user` dependency runs JWT validation + a `session.get(User, id)` on every request. Many endpoints then do their *own* `user.id`-filtered SELECTs. There is no per-request memoization of the User row.
**Impact:** one extra SELECT per request, mostly noise — but it's free to fix.
**Fix:** in `get_current_user`, cache the User on `request.state.user`. Routes that take `user: User = Depends(...)` are unchanged.
---
### M3. SQLAlchemy async pool defaults serialize SQLite writers but the engine allows multiple connections ([engine.py:41-57](packages/server/src/notify_bridge_server/database/engine.py))
`create_async_engine` for SQLite defaults to a `StaticPool` of size 1 in newer SQLAlchemy versions, but older versions / different `aiosqlite` versions can default to `NullPool` (one connection per request) or a small QueuePool. The code does not pin this explicitly. Under WAL, multiple readers are fine but only one writer can hold the txn at a time — so a slow writer just makes other connections block on `busy_timeout`.
**Impact:** unpredictable behaviour across SQLAlchemy versions; sporadic `SQLITE_BUSY` under load.
**Fix:** explicitly configure the pool:
```python
from sqlalchemy.pool import StaticPool, AsyncAdaptedQueuePool
_engine = create_async_engine(
url,
echo=settings.debug,
pool_pre_ping=True,
connect_args=connect_args,
poolclass=AsyncAdaptedQueuePool,
pool_size=5,
max_overflow=10,
pool_recycle=3600,
)
```
For Postgres compatibility leave these as-is; for SQLite the right value is `StaticPool` + `connect_args={"check_same_thread": False}` to share one connection across the event loop (this is the supabase/pgbouncer pattern adapted for sqlite-async).
---
### M4. `_refresh_telegram_chat_titles` issues per-chat HTTP without per-bot bucketing ([scheduler.py:229-329](packages/server/src/notify_bridge_server/services/scheduler.py))
The job builds `tasks` as a flat list across all bots and runs them under a global `Semaphore(10)`. A bot with 50 chats and a slow Telegram response (rare but happens) can monopolize all 10 slots, starving every other bot. The semaphore should be per-bot.
**Impact:** the daily refresh can take much longer than intended on a multi-bot install with one degraded bot.
**Fix:** create one semaphore per bot:
```python
sems = {bot_id: asyncio.Semaphore(_CHAT_SYNC_CONCURRENCY) for bot_id in bot_tokens}
```
---
### M5. `event_log.collection_name.contains(search)` triggers full table scan on filter ([status.py:69-75](packages/server/src/notify_bridge_server/api/status.py))
The dashboard search input runs four `.contains(search)` clauses ORed together — these become `LIKE '%search%'` and cannot use a regular B-tree index. With 100k+ event_log rows the dashboard search becomes a multi-second operation.
**Impact:** UX — search feels broken on large installs; CPU on the bridge box.
**Fix:**
1. Limit the search to the most recent N days (e.g. retention/3) — most users only search recent events.
2. Add a SQLite FTS5 virtual table mirroring event_log's text columns, sync via triggers. Searches use `MATCH 'foo'` which is sub-millisecond on million-row tables.
---
### M6. `DeferredDispatch.event_payload` JSON blob can grow unbounded per row ([models.py:639-659](packages/server/src/notify_bridge_server/database/models.py), [deferred_dispatch.py:188-298](packages/server/src/notify_bridge_server/services/deferred_dispatch.py))
The asset-coalescing union path appends every new asset's full dict (filename, urls, tags, extra metadata) into `event_payload["added_assets"]`. A mass-import that adds 50k photos during a quiet window means one DeferredDispatch row with 50k asset entries.
**Impact:** memory blow-up at drain time (the whole JSON is parsed via `deserialize_event` into a Python list of `MediaAsset` dataclasses); could trip the drain timeout (`_DRAIN_DISPATCH_TIMEOUT_SECONDS=120`) on legitimate workloads.
**Fix:** cap the union at e.g. 500 assets per row; when crossed, emit a "more_truncated" sentinel into `payload["extra"]` so the rendered template can show "+45000 more". The `apply_tracking_display_filters` `max_assets_to_show` does cap it for delivery, but the *stored* payload is uncapped.
---
### M7. Per-tick `await get_app_timezone(session)` reads from the DB on every dispatch ([dispatch_helpers.py:146-150](packages/server/src/notify_bridge_server/services/dispatch_helpers.py))
Each tracker tick, each webhook, each defer evaluation calls `get_app_timezone` which calls `get_setting(session, "timezone")` which is a SELECT. The timezone setting rarely changes (manual setting), but the SELECT runs constantly.
**Impact:** noise on otherwise good caching.
**Fix:** cache the timezone in a module-level `(value, expires_at)` tuple with 60-s TTL, invalidated by `reschedule_cron_jobs_for_timezone_change`.
---
### M8. Unbounded in-memory dictionaries with no TTL or capacity ([scheduler.py:67-72](packages/server/src/notify_bridge_server/services/scheduler.py), [telegram_poller.py:31-35](packages/server/src/notify_bridge_server/services/telegram_poller.py), [command_sync.py:25](packages/server/src/notify_bridge_server/services/command_sync.py))
```python
_adaptive_state: dict[int, dict[str, int]] = {}
_adaptive_max_skip: dict[int, int] = {}
_last_update_id: dict[int, int] = {}
_last_webhook_reclaim_at: dict[int, float] = {}
_dirty_bots: dict[int, float] = {}
```
Each is keyed by tracker_id / bot_id. When a tracker or bot is deleted, the cleanup paths (`unschedule_tracker`, etc.) do remove some entries — but not all. `_last_update_id`, `_last_webhook_reclaim_at` are never cleared on bot deletion.
**Impact:** slow memory leak in long-running processes that create+delete trackers/bots frequently (e.g. test environments).
**Fix:** on tracker/bot deletion, explicitly clear all module dicts that key by that id. Or, simpler, switch each to `weakref.WeakValueDictionary` once the entity has a Python object representation, or to a TTLCache.
---
### M9. Bulk insert pattern in migrations uses one-statement-per-row ([migrations.py:566-588](packages/server/src/notify_bridge_server/database/migrations.py))
`migrate_tracker_targets` issues `INSERT INTO ... VALUES (...)` per row in a Python for-loop. On a tenant with 10k+ legacy rows this is slow even inside a single transaction.
**Impact:** one-shot, but rough on upgrade for big tenants.
**Fix:** use `executemany` / batch INSERTs:
```python
await conn.execute(text("INSERT INTO ... VALUES (...)"), batch_params)
```
This is mostly historical (the migration is idempotent and skipped on subsequent runs), but worth fixing if you're touching the file.
---
### M10. Missing index on `notification_tracker_state(notification_tracker_id, collection_id)` ([models.py:454-478](packages/server/src/notify_bridge_server/database/models.py))
`check_tracker` reads state per tracker; the existing `ix_notification_tracker_state.notification_tracker_id` index (declared via `index=True`) supports that. But every state read is `WHERE tracker_id = ? AND collection_id = ?` (implicitly via the resulting dict). A composite would help; SQLite can do index-only scans here.
**Impact:** small. SQLite's index intersection plus the fact that one tracker typically has <20 collections makes this a minor win.
**Fix:** add `(notification_tracker_id, collection_id)` composite index to the `_INDEXES` list.
---
## LOW
### L1. `SELECT *` semantics from `select(Model)` ORM is unavoidable but verbose ([throughout services/, api/])
SQLModel's `select(ModelClass)` is effectively `SELECT all columns`. For wide rows like `TrackingConfig` (~70 columns of boolean flags) that's a lot of bytes per dispatch evaluation. There are no API list endpoints that return `TrackingConfig` from a hot path, so this is mostly cosmetic — but for pages that only need a handful of columns (e.g. `status.py`'s `tracker_id, name` map) the explicit-column form is already used. Continue that pattern.
---
### L2. `EventLog.details` JSON dict is reconstructed on every dashboard read ([status.py:258](packages/server/src/notify_bridge_server/api/status.py))
`details: e.details or {}` serializes the JSON every time. SQLite returns this as a parsed Python dict already (JSON column), so the cost is low; just a note that this is a hot path.
---
### L3. `event_log.collection_id` and `details` have no indexes; some webhook commands filter on them ([commands/immich/events.py:43](packages/server/src/notify_bridge_server/commands/immich/events.py))
The history-by-tracker endpoint uses the composite `ix_event_log_user_event_type_created` plus a hit on `notification_tracker_id` — fine. But `events.py`'s "last assets_added for this collection" queries (`event_type='assets_added' AND collection_id=?`) cannot use any current index optimally.
**Fix:** add `(event_type, collection_id, created_at DESC)` if these queries are called by users frequently (Telegram `/assets <album>` etc.).
---
### L4. JSON column types not declared with `JSONB` semantics ([models.py: many](packages/server/src/notify_bridge_server/database/models.py))
SQLite has only `JSON` (text storage with `json_valid` checks). On Postgres you'd want `JSONB`. The codebase uses `Column(JSON)` from SQLModel which maps to native `JSONB` on Postgres — that's correct. No action needed.
---
### L5. The `setup` lifespan runs migrations *inside* the FastAPI lifespan synchronously ([main.py:62-122](packages/server/src/notify_bridge_server/main.py))
The migrations + seeds + scheduler boot all run before `_READY = True`. On a cold start with a big DB this can take 10+ s during which `/api/ready` returns 503. That's correct, but `/api/health` is also un-reachable because uvicorn hasn't started the workers yet (lifespan blocks startup). For orchestrators that probe `/api/health`, this means startup-grace must be tuned.
**Fix:** start the HTTP listener first, run migrations as a background task, expose readiness flag through `/api/ready` only.
---
### L6. `ServiceProvider.config`, `NotificationTarget.config`, `Tracker.filters` JSON columns store secrets unencrypted ([models.py:42, 349, 399](packages/server/src/notify_bridge_server/database/models.py))
API keys, refresh tokens, webhook secrets, SMTP passwords all live in `config` JSON. Visible to anyone with DB read access. This is a known design trade-off (`backup_secrets_mode` controls export behaviour) but worth flagging.
**Fix:** out of scope for this review; consider an at-rest encryption layer keyed off `secret_key` (Fernet) for `config["api_key"]`, `config["password"]`, `access_token`, etc. — but only if your threat model justifies the operational cost.
---
### L7. Frontend `caches.svelte.ts` has 30-s TTL but no cross-tab invalidation ([entity-cache.svelte.ts:14](frontend/src/lib/stores/entity-cache.svelte.ts))
Two browser tabs editing the same entity will see stale data for up to 30 s in the other tab. No `BroadcastChannel` listener.
**Fix:** add a `BroadcastChannel('notify-bridge-cache')` that calls `cache.invalidate()` on receipt. ~15 lines.
---
### L8. `providersCache.invalidate(); await load()` is two-step ([providers/+page.svelte:238, 250](frontend/src/routes/providers/+page.svelte))
`invalidate()` + immediate `fetch(true)` race against any in-flight request; the deduplication map handles it, but the explicit `await load()` is essentially `fetch(true)` directly. Simpler:
```typescript
providersCache.set(updatedList); // or fetch(true)
```
Cosmetic.
---
### L9. `details["dispatch_status"]` is a string enum but not declared as one ([deferred_dispatch.py:619-624](packages/server/src/notify_bridge_server/services/deferred_dispatch.py))
`dispatch_status` takes values `"deferred"`, `"deferred_then_dropped"`, `"deferred_then_failed"`, `"delivered_after_quiet_hours"`, `"dropped_quiet_hours_nondeferrable"`. They're scattered as string literals. The dashboard renders them.
**Fix:** declare an `Enum` once and import from both server and frontend types.
---
### L10. No DB connection used by `/api/health` ([main.py:270-274](packages/server/src/notify_bridge_server/main.py))
`/api/health` returns instantly without checking the DB. That's correct for a liveness probe but the comment doesn't match common practice ("liveness = process up"). Pair this with #L5: orchestrators using `/api/health` for warm-up will mark the pod ready while migrations are still running.
**Fix:** keep liveness lightweight, document the readiness probe as the warm-up gate.
---
## Notes on what's already good
- Performance indexes (`_INDEXES` list) cover all the right hot paths.
- Composite `(status, fire_at)` index on `deferred_dispatch` plus partial unique `(link_id, collection_id, event_type) WHERE status='pending'` prevents the worst races.
- `load_link_data` is fully batched — the most complex hot path in the codebase looks clean.
- Shared `aiohttp.ClientSession` with DNS-rebinding-safe `PinnedResolver` is production-grade.
- Pre-migration `VACUUM INTO` snapshot is the right safety net for a hand-rolled migration chain.
- APScheduler defaults (`coalesce=True`, `misfire_grace_time=300`, `max_instances=1`) are correct production settings.
- Adaptive polling (skip-N-of-K when idle) with jitter is a thoughtful 4-tier scheduling design.
- Tracker cache (5-s TTL with explicit invalidation) and rendered-message per-locale cache are good fan-out optimizations.
- Migration idempotency is genuinely well-handled despite the rough tooling.
- Frontend `entity-cache` deduplication of in-flight requests is the right pattern.
---
## Priority recommendations (next 30 days)
1. **Adopt Alembic** (C1, C2) — eliminate `create_all` from prod, baseline the current schema, lock down new schema changes through autogenerate.
2. **Fix the dashboard aggregate query** (H1, H2, M5) — add the missing composite index, server-side cache the per-provider aggregate, virtualize the event list. This is the single biggest user-visible perf win.
3. **Cap `DeferredDispatch.event_payload` size + add retention for fired/dropped rows** (M6, H7) — closes off the worst-case memory and growth scenarios.
4. **Cleanup module-level dicts on entity deletion** (M8) — small fix, prevents a slow leak.
5. **Standardize SQLite PRAGMAs and pool config** (C3, M3) — predictable behaviour, fewer spurious BUSY errors.
---
*Reviewed against codebase at HEAD (`a20635a`).*
+312
View File
@@ -0,0 +1,312 @@
# Security Review — notify-bridge v0.8.1
Reviewer: security-reviewer (Opus 4.7) — 2026-05-22
Branch: master @ a20635a
Scope: `packages/server`, `packages/core`, `frontend/src`, `Dockerfile`, `docker-compose.yml`, `.gitea/workflows/`, env handling.
---
## Executive Summary
- **Overall posture is strong.** The project applies many non-obvious controls correctly: Jinja2 `SandboxedEnvironment` on every render path; `bcrypt` with a 72-byte length guard and constant-time login (dummy hash on missing user); JWT with `token_version` revocation; SSRF guard with CGNAT, IPv4-mapped-IPv6 unwrapping, and a `PinnedResolver` that defeats DNS rebinding; secret-masking log filter; path-traversal-safe backup file resolver; security headers + CSP; non-root Docker user; required `SECRET_KEY` >= 32 chars with a rejection list; non-default Telegram webhook secret enforced; HMAC signature checks on Gitea/Generic webhooks; provider-config secret masking on GET; ownership checks (`get_owned_entity`) on every parameterised route I sampled.
- **HIGH — Home Assistant `access_token` is not masked.** It is stored in `provider.config`, never added to the mask list in `_provider_response`, never added to the placeholder-drop list in `update_provider`. Any logged-in user can `GET /api/providers/{id}` and read their HA token in cleartext, and a partial save will wipe it. Trivial fix.
- **HIGH — Secrets at rest are plaintext.** Telegram bot tokens (`telegram_bot.token`), provider configs containing `api_key`/`api_token`/`webhook_secret`/`access_token`/SMTP passwords, and email-bot SMTP passwords are stored unencrypted in SQLite. Disk theft, an unrelated read primitive, or any backup leak exposes all credentials. The masking on the API is good UX, but the DB itself has no encryption-at-rest. The exported JSON backup respects a `secrets_mode` flag (good) but the live DB does not.
- **MEDIUM — Template-preview endpoints bypass the timeout/size watchdog.** `template_configs.preview_config`, `template_configs.preview_raw`, `command_template_configs.preview_raw`, and `notifier.send_test_template_notification` construct fresh `SandboxedEnvironment(autoescape=False)` instances and call `.render(...)` directly. The hardened helper `render_template()` (timeout, source cap, output cap, autoescape) is bypassed. A logged-in user can wedge a worker thread with `{% for i in range(10**8) %}x{% endfor %}`. Single-tenant deployment limits the blast radius, but the renderer should be the single chokepoint.
- **MEDIUM — Login rate limit is per-IP only.** `POST /api/auth/login @ 5/min` keys on `get_remote_address`. An attacker behind a proxy / NAT, or one that rotates source IPs (cheap on residential / cloud), trivially bypasses it. There is no per-username lockout, no exponential backoff, no captcha. Combined with no MFA, this leaves the admin account vulnerable to a slow online dictionary attack from a single password (8-char minimum, no complexity requirement).
- **LOW / INFO — Several smaller findings**: webhook payload logs persist source payload (now with key-level redaction, but the redactor is name-based and will miss high-entropy secret values in non-obvious keys); no replay protection on inbound webhooks (no nonce/timestamp window); the `/api/auth/setup` 3/min limit + JWT issuance race window is hardened with a transaction count guard (good), but the dummy bcrypt hash literal used for timing-equalisation is malformed and `bcrypt.checkpw` returns `False` via `ValueError` — the swallowed exception still equalises timing, but a maintainer could regress this; CSP allows `script-src 'unsafe-inline'` (necessary for SvelteKit hydration, acceptable risk acknowledged in code).
---
## Findings
### CRITICAL
_None found._
---
### HIGH
#### H-1. Home Assistant access_token leaked in provider GET responses
- CWE: CWE-522 (Insufficiently Protected Credentials), CWE-200 (Exposure of Sensitive Information)
- Files:
- [`packages/server/src/notify_bridge_server/api/providers.py:616-624`](../../packages/server/src/notify_bridge_server/api/providers.py) — `_provider_response` masks `("api_key", "api_token", "webhook_secret", "password", "client_secret", "refresh_token")` but **not** `access_token`.
- [`packages/server/src/notify_bridge_server/api/providers.py:399-405`](../../packages/server/src/notify_bridge_server/api/providers.py) — `update_provider` also omits `access_token` from the placeholder-drop list, so the response masking is consistent here, but if you fix one you must fix the other.
- Scenario: Any user authenticated to the bridge (any role) calls `GET /api/providers/{id}` for an HA provider they own and the response includes `config.access_token` in cleartext. The HA long-lived token grants full control of the user's Home Assistant instance (lights, locks, cameras, scripts, devices). In a multi-user deployment, even within the same admin account, a stolen JWT exfiltrates the HA token; in a single-user deployment, any read primitive (XSS via a future template feature, an MITM on an HTTPS misconfiguration) gives the same result.
- Remediation: Add `access_token` to both lists.
```python
# providers.py:_provider_response
for secret_field in (
"api_key", "api_token", "webhook_secret", "password",
"client_secret", "refresh_token", "access_token", # <-- add
):
...
# providers.py:update_provider
for secret_field in (
"api_key", "api_token", "webhook_secret", "password",
"client_secret", "refresh_token", "access_token", # <-- add
):
value = incoming.get(secret_field)
if isinstance(value, str) and value.startswith("***"):
incoming.pop(secret_field, None)
```
Better still: replace the hand-maintained tuple with a single module-level constant `_PROVIDER_SECRET_FIELDS` referenced from both call sites, plus a unit test that asserts every field declared on the per-provider Pydantic configs whose name appears in a denylist (`token`, `secret`, `password`, `key`, `credential`) is in the set. That prevents the next provider type from re-introducing the same gap.
#### H-2. Secrets stored in plaintext at rest
- CWE: CWE-312 (Cleartext Storage of Sensitive Information), CWE-256 (Plaintext Storage of a Password)
- Files:
- [`packages/server/src/notify_bridge_server/database/models.py:54-84`](../../packages/server/src/notify_bridge_server/database/models.py) — `TelegramBot.token: str`
- [`packages/server/src/notify_bridge_server/database/models.py:87-100`](../../packages/server/src/notify_bridge_server/database/models.py) — `MatrixBot` (access_token in config)
- `ServiceProvider.config: dict[str, Any]` (JSON column) holds Immich `api_key`, Gitea `webhook_secret` + `api_token`, Google Photos `client_secret` + `refresh_token`, HA `access_token`, etc.
- `EmailBot.smtp_password: str` (per [`api/email_bots.py:142`](../../packages/server/src/notify_bridge_server/api/email_bots.py))
- Scenario: An attacker who can read the SQLite file (compromised host, mis-permissioned backup volume, snapshot artifact in `data_dir/backups/`, leaked debug dump) gets every credential the bridge speaks: Telegram bot tokens (full bot control), Immich/Gitea/Planka API keys (read all photos / repos), Google Photos refresh tokens (long-lived, hard to revoke at scale), HA long-lived tokens (smart-home), SMTP passwords. The pre-migrate VACUUM-INTO snapshots (`packages/server/src/notify_bridge_server/database/snapshot.py`) inherit the same plaintext exposure and live alongside the active DB.
- Remediation options, in order of effort:
1. **Short term**: document the threat in `OPERATIONS.md`, enforce file-system permissions on `/data` (the Dockerfile chowns to appuser already, but the host bind-mount must be `chmod 700`), and ensure backups are encrypted at the storage layer (S3 SSE / Borg / restic).
2. **Better**: column-level encryption with a key derived from `NOTIFY_BRIDGE_SECRET_KEY` (or a separate `NOTIFY_BRIDGE_DB_ENCRYPTION_KEY`). Use the `cryptography` library's `Fernet` for each sensitive column; envelope the secret JSON keys, not the whole row, so `WHERE` clauses and existing migrations keep working. Add a one-shot migration that re-encrypts existing rows.
3. **Best**: encrypt with a KMS-backed key (HashiCorp Vault Transit, AWS KMS) and rotate per-secret data keys. This is overkill for a homelab homeserver-style deployment but mandatory if the bridge is ever multi-tenant.
- Skeleton for option 2:
```python
# new file packages/server/src/notify_bridge_server/security/secretbox.py
from cryptography.fernet import Fernet, InvalidToken
from .config import settings
def _key() -> bytes:
# Derive a deterministic Fernet key from secret_key. Anyone with secret_key
# can decrypt — same threat model as JWT signing — but anyone with the DB
# alone cannot.
import base64, hashlib
h = hashlib.sha256(settings.secret_key.encode()).digest()
return base64.urlsafe_b64encode(h)
_fernet = Fernet(_key())
def encrypt_secret(plaintext: str) -> str:
return _fernet.encrypt(plaintext.encode()).decode()
def decrypt_secret(ciphertext: str) -> str:
return _fernet.decrypt(ciphertext.encode()).decode()
```
Apply at write time in `update_provider` / `create_provider`, decrypt at read time inside `make_immich_provider`, `make_gitea_provider`, the Telegram client constructor, etc. Add a migration that scans every `ServiceProvider.config` JSON and re-encrypts the listed keys in place.
---
### MEDIUM
#### M-1. Template preview endpoints skip the renderer watchdog
- CWE: CWE-400 (Uncontrolled Resource Consumption), CWE-1333 (Inefficient Regular Expression Complexity — analogous)
- Files:
- [`packages/server/src/notify_bridge_server/api/template_configs.py:608-613`](../../packages/server/src/notify_bridge_server/api/template_configs.py) — `preview_config` calls `SandboxedEnvironment(autoescape=False).from_string(template_body).render(...)` directly.
- [`packages/server/src/notify_bridge_server/api/slot_helpers.py:72-90`](../../packages/server/src/notify_bridge_server/api/slot_helpers.py) — `render_template_preview` (used by `/preview-raw` for both notification and command templates).
- [`packages/server/src/notify_bridge_server/services/notifier.py:494-499`](../../packages/server/src/notify_bridge_server/services/notifier.py) — `send_test_template_notification`.
- The hardened helper [`packages/core/src/notify_bridge_core/templates/renderer.py:48-108`](../../packages/core/src/notify_bridge_core/templates/renderer.py) (with timeout, length caps, output cap) is **not** used here.
- Scenario: An authenticated admin submits `{% for i in range(10**8) %}x{% endfor %}` to `POST /api/template-configs/preview-raw`. Jinja2 has no built-in timeout. The sandbox blocks attribute access but not CPU. The request blocks the FastAPI event loop's executor thread until the worker oomkills or the client times out. Repeat to DoS the API.
- Remediation: Route every render through a single, hardened helper.
```python
# Use the existing core helper consistently
from notify_bridge_core.templates.renderer import render_template
rendered = render_template(template_str, context) # already has timeout + caps
```
For the strict-undefined two-pass validation in `render_template_preview`, fold the watchdog into the helper itself rather than skipping it.
#### M-2. Login rate limit is per-IP only
- CWE: CWE-307 (Improper Restriction of Excessive Authentication Attempts)
- Files: [`packages/server/src/notify_bridge_server/auth/routes.py:140-157`](../../packages/server/src/notify_bridge_server/auth/routes.py).
- Scenario: `@limiter.limit("5/minute")` keyed on `get_remote_address` gives 5 attempts per source IP per minute = ~7,200/day per IP. An attacker rotating across 10 IPs (cheap cloud, residential proxies, even a Tor exit pool) gets 72,000/day. With the 8-character minimum password and no complexity requirement, a 7-char-and-common password is reachable in days, not centuries. There is no per-username lockout, no captcha, no MFA.
- Remediation:
1. Add a per-username sliding-window limiter on top of the per-IP one. Use a second `Limiter` whose `key_func` returns the lower-cased username from the body. Re-check after parsing the body.
2. Add an exponential lockout: after N consecutive failures for a username, require a cooldown (record in a `LoginFailure` table or in-memory TTLCache).
3. Document and recommend deploying behind a reverse proxy that adds CAPTCHA / WAF rate-limiting for login (Cloudflare Turnstile is cheap).
4. Track and log failed logins (auth-event audit trail) with src IP + username + timestamp.
```python
# Sketch — a second limiter that keys by username from the parsed body.
async def _check_username_quota(username: str) -> None:
# In-memory TTLCache: 10 attempts per username per 15 minutes
if _username_attempts[username] >= 10:
raise HTTPException(429, "Too many attempts for this account")
_username_attempts[username] += 1
```
#### M-3. Webhook payload log redactor is keyword-based, misses value-based secrets
- CWE: CWE-532 (Insertion of Sensitive Information into Log File)
- Files: [`packages/server/src/notify_bridge_server/api/webhooks.py:326-358`](../../packages/server/src/notify_bridge_server/api/webhooks.py).
- Scenario: `_redact_sensitive_body` walks the JSON and redacts values whose **keys** contain `token`, `auth`, `key`, `secret`, etc. A webhook provider that ships secrets under an innocent key (e.g. `"oauth_state": "ya29.a0..."`, `"continuation": "ABCDE..."`, `"x_state": "..."`) leaves the secret in the persisted payload log. The log row is admin-readable and exported in backups.
- Remediation: Layer a high-entropy value detector on top of the key matcher (e.g. anything matching `[A-Za-z0-9_\-+/=]{32,}` and high Shannon entropy ≥ 3.5). Lower bound: also redact known prefixes (`ya29.`, `xoxb-`, `ghp_`, `glpat_`, `sk-`, `Bearer `).
#### M-4. Webhook ingestion has no replay protection
- CWE: CWE-294 (Authentication Bypass by Capture-replay)
- Files: [`packages/server/src/notify_bridge_server/api/webhooks.py`](../../packages/server/src/notify_bridge_server/api/webhooks.py) — Gitea/Planka/Generic.
- Scenario: An attacker who once intercepts a signed Gitea push event (network downgrade, log leak from a proxy, exfil from the Gitea side) can replay it indefinitely. The HMAC stays valid; the bridge has no nonce / timestamp window / delivery-ID cache. With a webhook that fires `assets_added` it's just noise. With a webhook that triggers an action (planka card-created → `/api/actions/{id}/execute` chained logic), it could be more.
- Remediation: For Gitea, store the last N `X-Gitea-Delivery` UUIDs per provider and reject duplicates; cap with a partial unique index. For the generic webhook, add an optional `replay_window_seconds` + a timestamp-extracting JSONPath in the provider config. Constant-time string compare.
#### M-5. `bcrypt.checkpw` dummy-hash literal is malformed
- CWE: CWE-208 (Observable Timing Discrepancy) — partial.
- Files: [`packages/server/src/notify_bridge_server/auth/routes.py:147-152`](../../packages/server/src/notify_bridge_server/auth/routes.py).
- Scenario: When the username doesn't exist, the code calls `_verify_password(body.password, "$2b$12$" + "a" * 53)`. That hash is not a real bcrypt hash; `bcrypt.checkpw` raises `ValueError` which `_verify_password` swallows and returns `False`. The exception path is *faster* than a real bcrypt verify (no key schedule), so timing of "user does not exist" differs from "user exists, wrong password" — a maintainer changing the swallow behaviour later could regress this entirely.
- Remediation: Cache one valid dummy bcrypt hash at module load time so the verify path actually runs the KDF.
```python
_DUMMY_BCRYPT_HASH = bcrypt.hashpw(b"x", bcrypt.gensalt()).decode() # module load
...
password_ok = await _verify_password(
body.password,
user.hashed_password if user else _DUMMY_BCRYPT_HASH,
)
```
#### M-6. Setup endpoint relies on `User.id != 0` filter — robust but a single typo breaks it
- CWE: CWE-302 (Authentication Bypass) — defence-in-depth.
- Files: [`packages/server/src/notify_bridge_server/auth/routes.py:97-119`](../../packages/server/src/notify_bridge_server/auth/routes.py).
- Scenario: `POST /api/auth/setup` is gated by "no users with id != 0". The `__system__` sentinel is id=0. If a future migration changes the sentinel id, or the `WHERE` clause is dropped during a refactor, setup re-opens silently and an internet-reachable bridge would let an attacker claim the admin account.
- Remediation: Add a defence-in-depth flag `AppSetting.setup_completed=true` set during the first successful setup, and require it to be unset (in addition to the count check). This bakes the invariant into a single boolean that's easier to audit.
#### M-7. Anonymous Prometheus metrics endpoint leaks operational data
- CWE: CWE-200 (Exposure of Sensitive Information to an Unauthorized Actor)
- Files: [`packages/server/src/notify_bridge_server/api/metrics.py:138-159`](../../packages/server/src/notify_bridge_server/api/metrics.py).
- Notes: This is **documented and gated** by `NOTIFY_BRIDGE_METRICS_ENABLED`, and the comment explicitly says scrapers don't authenticate. Acceptable when the API port is firewalled to the scraper. Surface it here as informational so an operator who exposes the API directly to the internet (e.g. via reverse-proxy without an ACL) doesn't accidentally expose dispatch rates, provider names, queue depths.
- Remediation: keep the env flag, but additionally allow `metrics_basic_auth_user` / `metrics_basic_auth_password` as a soft credential check on the endpoint so a "default enabled, default protected" mode is possible. Document the threat in `OPERATIONS.md` next to the env var.
---
### LOW
#### L-1. CSP allows `'unsafe-inline'` for scripts
- CWE: CWE-1021 (Improper Restriction of Rendered UI Layers or Frames) — adjacent.
- File: [`packages/server/src/notify_bridge_server/main.py:186-201`](../../packages/server/src/notify_bridge_server/main.py).
- Notes: Comment explicitly justifies it — SvelteKit static adapter emits an inline bootstrap. Acceptable, but `'strict-dynamic'` with a per-page nonce (or moving the bootstrap into a hashed external module) eliminates the gap entirely. Track as INFO unless future XSS-injection paths emerge.
#### L-2. CSP `style-src 'unsafe-inline'` allows inline-style XSS payloads
- CWE: CWE-79 (Cross-site Scripting) — defence-in-depth.
- Same file as L-1. Inline styles are not directly executable, but they are a known vector for click-jacking and data-exfil via CSS selectors. Same remediation path: nonce-based CSP.
#### L-3. `frame-ancestors 'none'` but no `X-Frame-Options: DENY` collision (false — it is set)
- INFO only. Both `X-Frame-Options: DENY` and `frame-ancestors 'none'` are set; modern browsers honour CSP, legacy ones honour XFO. Good.
#### L-4. Webhook `_filter_headers` allowlist accepts unknown `X-*` headers
- CWE: CWE-532
- File: [`packages/server/src/notify_bridge_server/api/webhooks.py:361-374`](../../packages/server/src/notify_bridge_server/api/webhooks.py).
- Notes: The filter strips known sensitive headers, then accepts any `X-*`. A custom auth header like `X-Custom-Authentication: <token>` would slip past the substring check if the name doesn't contain `auth`/`token`/`key`/`secret`/etc. Low risk because the well-known providers we support don't ship such headers, but a misconfigured generic webhook will leave a credential in the log row.
- Remediation: invert the policy — explicit allowlist for known-safe `X-*` headers (e.g. `X-Forwarded-For` is also borderline since it can carry PII).
#### L-5. `external_url` setting is not validated against an allow-list
- CWE: CWE-918 (SSRF), CWE-79 (XSS in the rendered Telegram webhook URL).
- File: [`packages/server/src/notify_bridge_server/api/app_settings.py:329-339`](../../packages/server/src/notify_bridge_server/api/app_settings.py) reads, [`packages/server/src/notify_bridge_server/api/telegram_bots.py:247`](../../packages/server/src/notify_bridge_server/api/telegram_bots.py) writes it into the registered Telegram webhook URL.
- Notes: An admin can set `external_url` to anything. The value is used to build the URL passed to Telegram in `setWebhook`. Telegram itself enforces an HTTPS-only allow-list, so the actual risk is bounded. Still — validate scheme + host + that it doesn't include credentials or fragments.
#### L-6. Bot token GET endpoint is intentional but worth auditing
- File: [`packages/server/src/notify_bridge_server/api/telegram_bots.py:148-156`](../../packages/server/src/notify_bridge_server/api/telegram_bots.py).
- Notes: `GET /api/telegram-bots/{bot_id}/token` returns the full Telegram bot token to the owner. Used by the frontend to construct webhook URLs. Limiting to a single short-lived nonce per `register_bot_webhook` flow would be safer than exposing the token directly. Currently INFO; revisit if a multi-user role model lands.
#### L-7. SQLite journal mode + backup snapshot file permissions
- File: [`packages/server/src/notify_bridge_server/database/snapshot.py:60-95`](../../packages/server/src/notify_bridge_server/database/snapshot.py).
- Notes: Snapshots are written via `VACUUM INTO 'path'`. They land in `data_dir/backups/` with default umask permissions. In the Docker image the dir is owned by `appuser` and only that user runs the process, so this is fine. On a host bind-mount, an operator who forgets to lock down `/data` exposes every credential in every snapshot to anyone with shell access. Document this in `OPERATIONS.md`.
#### L-8. No CSRF token on state-changing endpoints
- CWE: CWE-352
- Notes: The API uses `Authorization: Bearer <jwt>` exclusively (no cookies). Browsers don't auto-attach `Authorization` headers cross-origin, so this is **not** classical CSRF-exploitable. Combined with strict CORS (`allow_credentials=True`, explicit origin allowlist, wildcard rejected on startup) and the `Origin`/`Referer` same-host check on the backup endpoints, the practical risk is essentially zero. INFO only.
---
### INFO / NEEDS VERIFICATION
#### N-1. Jinja2 `SandboxedEnvironment` is the standard sandbox — confirm it covers your threat model
- The sandbox blocks `__class__`, `__mro__`, etc., but it is well-known that Jinja2's sandbox is not a security boundary against a determined attacker who can author templates. The threat model here is "templates are admin-authored, so we trust them but use the sandbox as defence-in-depth"; that is reasonable. Document explicitly in `OPERATIONS.md` that anyone with template-edit permission has effective RCE on the worker thread (`{{ foo.__init__.__globals__... }}` style escapes have been published in the past; new ones surface periodically).
- Verification: run `bandit -r packages/` and `safety check` against pinned versions of `jinja2>=3.1`. Latest CVEs against Jinja2 sandbox: track `CVE-2024-34064` and any 2025+ disclosures. As of the review date there is no known unpatched sandbox-escape in `jinja2>=3.1.4`.
#### N-2. `apscheduler<4`
- Notes: The pin `apscheduler>=3.10,<4` keeps the bridge on the 3.x line, which is in maintenance. No known CVEs as of this review. Track when 4.x stabilises and migrate.
#### N-3. `python-multipart>=0.0.9`
- Notes: This package had high-severity bugs prior to 0.0.6. The minimum here is 0.0.9 — good.
#### N-4. No signed-image / SBOM on the container
- Notes: The `release.yml` workflow builds and pushes a multi-tag image but does not sign with cosign or emit an SBOM. For an internet-facing deployment, consider adding `cosign sign` against the image digest, and `syft packages` to emit an SBOM at release time. INFO only.
#### N-5. Frontend dependencies are pinned via caret (`^`) ranges
- Notes: `package.json` uses `^x.y.z`. CI builds `npm ci` from `package-lock.json`, so reproducibility is fine at build time. There is no `npm audit` step in `.gitea/workflows/build.yml`. Add `npm audit --audit-level=high` to the frontend build job.
#### N-6. `NOTIFY_BRIDGE_ALLOW_PRIVATE_URLS=1` is a footgun
- File: [`packages/core/src/notify_bridge_core/notifications/ssrf.py:39-52`](../../packages/core/src/notify_bridge_core/notifications/ssrf.py).
- Notes: When set, the SSRF guard becomes a no-op. The warning at boot is the only mitigation. Acceptable for the documented homelab use-case; document that the env flag must NEVER be set on an internet-reachable instance, and consider refusing to enable it when `cors_allowed_origins` resolves to a non-loopback host (defence-in-depth interlock).
#### N-7. Verify the auth flow at the WebSocket boundary
- File: [`packages/core/src/notify_bridge_core/providers/home_assistant/client.py:54-83`](../../packages/core/src/notify_bridge_core/providers/home_assistant/client.py).
- The `_ws_url_from_base` correctly strips userinfo before connecting and `_redact` defangs error messages — verify that `wss://` URLs go through SSRF validation (currently the HA URL is validated by `AnyHttpUrl` at config time but I did not find a call to `avalidate_outbound_url_full` on the HA WS connect path; the resolver would not pin a host the validator never saw).
- Action: confirm by reading `ha_subscription.py` for explicit validation, or add a check that calls `avalidate_outbound_url_full` against the derived `ws_url` (treating `ws`/`wss` like `http`/`https` for the block-range check) before `ws_connect`.
---
## Prioritised Fix List (Top 10)
1. **HIGH H-1** — Add `access_token` to the secret-mask list in `providers._provider_response` and the placeholder-drop list in `providers.update_provider`. Add a regression test that GETs an HA provider and asserts the response does not contain the cleartext token.
2. **HIGH H-2** — Implement column-level encryption for `TelegramBot.token`, `MatrixBot` access tokens, `EmailBot.smtp_password`, and the sensitive keys inside `ServiceProvider.config`. Use Fernet with a key derived from `SECRET_KEY`. Write a one-shot migration.
3. **MEDIUM M-1** — Replace the ad-hoc `SandboxedEnvironment(...).render()` calls in the four preview/test paths with the single hardened `render_template()` helper that already has timeout + size caps.
4. **MEDIUM M-2** — Add per-username login lockout (TTL cache or DB-backed) on top of the per-IP `5/minute`. Log failed login attempts.
5. **MEDIUM M-5** — Replace the malformed dummy bcrypt literal in `login()` with a real bcrypt hash computed once at module load so the timing-equalisation actually runs the KDF.
6. **MEDIUM M-3** — Strengthen `_redact_sensitive_body` with a value-entropy heuristic and well-known token-prefix matching.
7. **MEDIUM M-4** — Add replay protection on Gitea webhooks via the `X-Gitea-Delivery` header (small table + partial unique index).
8. **MEDIUM M-7** — Make the metrics endpoint require either a flag or a Basic Auth credential; document in `OPERATIONS.md` that the API port should not be internet-exposed when metrics are on.
9. **MEDIUM M-6** — Add a defence-in-depth `setup_completed` boolean in `app_setting` and check it in `/api/auth/setup` in addition to the count.
10. **N-5** — Add `npm audit --audit-level=high` to the frontend build job in `.gitea/workflows/build.yml` so dependency CVEs land in CI.
---
## What was confirmed safe (worth keeping)
- JWT design: HS256 with `iss`/`aud`/`exp`/`type`/`sub`/`ver`; refresh/access split; `token_version` revocation on role change, username change, and password change.
- bcrypt with 72-byte length guard; CPU-bound work run in a thread.
- SSRF guard with: scheme allowlist, IPv6-mapped-v4 unwrap, CGNAT block, IDN normalisation, async resolver, `PinnedResolver` to defeat DNS rebinding.
- SQL access goes through SQLModel/SQLAlchemy with bind parameters; the only `f"..."` SQL is in DDL (column adds, index creates, `VACUUM INTO`) using server-controlled identifiers — sampled and clean.
- Sandbox is `SandboxedEnvironment` everywhere a user-controllable template is rendered (six locations checked).
- Frontend `{@html}` is wrapped in `sanitizePreview()` everywhere (`tracking-configs`, `template-configs`, `command-template-configs`).
- Provider config secrets are masked on GET (except H-1).
- `_resolve_backup_file` rejects `..`, NUL, separators, and enforces `relative_to(base)`.
- CORS rejects wildcard with credentials at startup; secret_key default values are rejected with a clear error.
- Docker: non-root user, `read_only: true`, `tmpfs: /tmp`, `no-new-privileges`, `cap_drop: ALL`, resource limits, healthcheck on `/api/ready`.
- Logging: `SecretMaskingFilter` masks Telegram bot tokens, `Authorization`, `x-api-key`, `password`, `secret`, `access_token`, `refresh_token` from formatted messages, exception text, and stack traces.
- Telegram webhook: secret token mandatory, refused on missing config, opaque `webhook_path_id` separate from bot token.
- Inbound generic webhook: refuses `auth_mode="none"` unless an explicit acknowledgment field is set; auto-generates a strong secret if missing for `bearer_token`/`hmac_sha256`.
- Inbound payload size capped at 1 MiB with a streaming check that doesn't trust `Content-Length`.
---
## Methodology
- Manual code review of every authentication, authorization, webhook ingestion, template rendering, secret-handling, and outbound HTTP path under `packages/`.
- Cross-checked CORS / CSP / security headers and rate-limiter configuration in `main.py` + `auth/routes.py`.
- Sampled API routes for ownership enforcement (`get_owned_entity` / `_get_user_provider` / `_get_user_bot`) — all sampled routes apply it; no IDOR found.
- Grepped for `Environment(` / `jinja2.Environment` / `f"..."` SQL / `{@html}` / `subprocess` / `eval` / `os.system` / known-bad patterns.
- Reviewed CI workflows for secret leakage in env blocks and image-signing posture.
- Reviewed Dockerfile + docker-compose for least-privilege and read-only root.
- No dynamic testing performed; static review only. Run `pytest` (already gated in CI) + `bandit -r packages/` + `npm audit` in CI to backstop this review.
+408
View File
@@ -0,0 +1,408 @@
# UI / UX Design Review — Notify Bridge frontend
**Reviewed**: 2026-05-22
**Scope**: SvelteKit frontend at `frontend/`, "Aurora / Glass" aesthetic, en + ru locales.
**Reviewer method**: Read `app.css`, `+layout.svelte`, dashboard, login, setup, providers, targets, users, settings (parent), settings/IdentityCassette, notification-trackers, template-configs, actions, bots, plus shared components (Card, Button, Modal, ConfirmModal, AuthLayout, PageHeader, EmptyState, Loading, Snackbar). Cross-cutting Grep passes for inputs, border-radius, ARIA, sort, hex colors.
---
## Executive summary
- **Aurora design language is real and distinctive.** Newsreader display serif + Geist variable sans + Geist Mono, conic-gradient brand orb, animated radial-gradient aurora background (`body::before` 28s drift), gradient pill chips, glow-pulse dots, and the lavender/orchid/mint/citrus/coral/sky palette together give the product a clear visual identity. This is **not** generic admin-template AI slop — the dashboard hero, signal-stream rows, provider deck, and the `PageHeader` "subpage-hero" pattern all carry intentional character that the user will remember.
- **Consistency is the weakest axis.** Five overlapping card container abstractions (`.hero-card`, `.panel`, `.glass`, `Card.svelte`, settings `.cassette`/`.identity`) re-implement the same frosted-glass recipe with diverging radius (22 / 18 / 14 / 12 px) and padding (1.25/1.4 vs 1.3/1.4 vs 2/2.4 rem). A `--radius: 1rem` token is declared but unused. Pick one card module + one radius scale (e.g. `--radius-card: 22px`, `--radius-input: 12px`, `--radius-pill: 999px`).
- **Forms have not been migrated to Aurora.** ~71 occurrences across 17 files still use the legacy raw class string `border border-[var(--color-border)] rounded-md text-sm bg-[var(--color-background)]` instead of the global `input { ... }` rule already in `app.css` (which uses `--color-input-bg`, `--color-rule-strong`, 0.625rem radius, glow focus ring). Result: rounded-md (6px) fields next to rounded-2xl (22px) cards, solid opaque backgrounds inside frosted-glass cards. Removing the override class would auto-restyle every form to match. **HIGH** priority, mostly mechanical.
- **Hardcoded hex colors leak through.** Snackbar uses `#059669` / `#ef4444` / `#3b82f6` / `#f59e0b` instead of `--color-mint/coral/sky/citrus`. ConfirmModal uses a raw `rgba(239, 68, 68, 0.3)` glow. Actions page uses `#059669` for the enabled dot. All bypass theming — they will look wrong in light theme.
- **Snackbar is invisible to screen readers.** No `role="status"` / `aria-live="polite"` / `aria-live="assertive"` on the toast container. Critical confirmations (saved, deleted, error) are never announced. **HIGH** accessibility fix, one-line.
- **No `aria-current="page"` anywhere in the nav** — active state is conveyed only visually (border-radius bar + glow). Active state has no accessible name.
- **No sortable columns, no multi-select bulk actions, anywhere in the app.** Lists rely entirely on `IconGridSelect` sort widgets (newest / oldest, etc.) and per-row icon buttons. For a notification routing system that may accumulate dozens of trackers / targets / configs, this scales poorly.
- **Localization parity is solid string-for-string** (en.json = ru.json = 1577 lines). Russian renders the same characters but several places (hero title, brand row with provider name, stat-card label/value flex) have no length-guard for the longer Russian translations — visible truncation/wrapping likely.
- **Onboarding is a single screen.** After `/setup` lands you on `/` with `0 providers` and a hero saying "all clear" — the most important first-run moment shows nothing to do. No checklist, no empty-dashboard CTA panel, no tour.
- **Power-user feature standout**: ⌘K SearchPalette is present and wired through the topbar, global provider filter, and reduced-motion media-query support. These three deserve credit and should be more discoverable (no in-app hint they exist).
---
## Findings by area
### 1. Design quality vs generic AI aesthetic
#### F-DESIGN-01 — Aurora identity is strong and self-consistent at the macro level [LOW / commendation]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css), [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte), [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte)
- **State**: Newsreader display serif italic with linear-gradient text-clip is used in hero titles, panel titles, modal titles. Conic brand orb is unique. Aurora drift on body::before is a 28s slow loop that's never busy. The "signal" / "wires" / "on watch" / "pulse" / "stream" / "compose" semantic naming on the dashboard is editorial, not generic admin copy.
- **Verdict**: Keep all of this. Lean *further* into it on the subpages — most list pages currently default back to plain "PageHeader + Card list" without inheriting the dashboard's editorial flavor.
#### F-DESIGN-02 — Italic-serif emphasis loses impact on smaller subpage titles [LOW]
- **Files**: [`frontend/src/lib/components/PageHeader.svelte`](frontend/src/lib/components/PageHeader.svelte) (lines 132147)
- **State**: `subpage-hero__title` is 2.15rem with italic emphasis on a gradient. At that size the gradient italic word is legible but loses the editorial drama it has at the 3rem dashboard hero. Russian translations (`em` words like *«операторы»*) sometimes look cramped because letter-spacing -0.025em is shared with the much larger dashboard hero.
- **Suggestion**: Use a separate letter-spacing scale per font size step, or drop italic emphasis on titles below ~2rem and use color-only emphasis there.
---
### 2. Visual consistency
#### F-CONSIST-01 — Five overlapping card abstractions [HIGH]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css) `.glass`, [`frontend/src/lib/components/Card.svelte`](frontend/src/lib/components/Card.svelte), [`frontend/src/lib/components/PageHeader.svelte`](frontend/src/lib/components/PageHeader.svelte) `.subpage-hero`, [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) `.hero-card` / `.panel` / `.stat-card`, [`frontend/src/routes/settings/IdentityCassette.svelte`](frontend/src/routes/settings/IdentityCassette.svelte) `.identity` + `.glass`
- **State**: Six places re-declare the same recipe: `background: var(--color-glass); backdrop-filter: blur(28px) saturate(160%); border: 1px solid var(--color-border); border-radius: 22px; box-shadow: var(--shadow-card);` followed by an `::after` highlight overlay. Card.svelte even has its own 22px radius next to the global `.glass` 22px radius — they would diverge silently if either gets touched.
- **Suggestion**: Consolidate into one `<GlassPanel>` component (or `.glass-card` utility) with variants `default | hero | panel | cassette` for padding/radius differences. Delete the duplicated `::after` overlays. The pattern is good — it's just *copy-pasted* 5+ times.
#### F-CONSIST-02 — Border-radius drift, no scale [HIGH]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte), [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte), [`frontend/src/app.css`](frontend/src/app.css)
- **State**: Radii used: 22, 18, 14, 12, 11, 10, 9, 8, 7, 6, 3, 2 px + 0.3, 0.5, 0.625, 0.85, 1 rem + 9999px. `--radius: 1rem` is declared in the theme but only re-declared — no component reads it.
- **Suggestion**: Define and *use* `--radius-card: 22px; --radius-panel: 18px; --radius-pill: 999px; --radius-input: 12px; --radius-chip: 8px; --radius-tile: 6px;`. Refactor in passes — start with `Card.svelte`, `Button.svelte`, `Modal.svelte`, `ConfirmModal.svelte`.
#### F-CONSIST-03 — Hardcoded hex colors bypass theming [HIGH]
- **Files**:
- [`frontend/src/lib/components/Snackbar.svelte`](frontend/src/lib/components/Snackbar.svelte) lines 2631: `#059669 / #ef4444 / #3b82f6 / #f59e0b`
- [`frontend/src/lib/components/ConfirmModal.svelte`](frontend/src/lib/components/ConfirmModal.svelte) line 70: `box-shadow: 0 0 16px rgba(239, 68, 68, 0.3)`
- [`frontend/src/routes/actions/+page.svelte`](frontend/src/routes/actions/+page.svelte) line 379: `style="background: {action.enabled ? '#059669' : 'var(--color-muted-foreground)'}"`
- 25 files in `frontend/src/routes/**` contain `#xxx` literals
- **State**: These colors are NOT the Aurora palette — `#059669` is emerald-600, our mint is `#7ee8c4`. In light theme the user sees green-on-green that wasn't intended.
- **Suggestion**: Replace all status hexes with `--color-mint/coral/sky/citrus/orchid`. Add a stylelint rule `color-no-hex` scoped to `src/**/*.svelte` to prevent regression.
#### F-CONSIST-04 — Form input styling not migrated to Aurora [HIGH]
- **Files**: 17 routes, ~71 occurrences. Examples: [`frontend/src/routes/users/+page.svelte`](frontend/src/routes/users/+page.svelte) lines 137, 141, 190, 207; [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte) lines 303, 309, 323, 333; [`frontend/src/routes/notification-trackers/TrackerForm.svelte`](frontend/src/routes/notification-trackers/TrackerForm.svelte); [`frontend/src/routes/targets/TargetForm.svelte`](frontend/src/routes/targets/TargetForm.svelte).
- **State**: `class="w-full px-3 py-2 border border-[var(--color-border)] rounded-md text-sm bg-[var(--color-background)]"` is repeated 71+ times. This overrides the global `input { ... }` rule that *already* uses Aurora glass styling.
- **Suggestion**: Delete the class string in all these places. The global rule kicks in and forms instantly look correct. Cross-check that `Tailwind`'s preflight isn't interfering. Spot-check one page (e.g. `users/+page.svelte`), confirm visually, then mass-delete via Grep/Edit.
#### F-CONSIST-05 — ConfirmModal duplicates Button.svelte logic [MEDIUM]
- **Files**: [`frontend/src/lib/components/ConfirmModal.svelte`](frontend/src/lib/components/ConfirmModal.svelte)
- **State**: Its `.confirm-btn-cancel` and `.confirm-btn-delete` re-implement what `Button variant="secondary"` and `Button variant="danger"` already provide. The danger button even uses raw `rgba(239,68,68,...)` instead of `--color-error-fg`.
- **Suggestion**: `<Button variant="secondary" onclick={oncancel}>{cancel}</Button>` and `<Button variant="danger" onclick={onconfirm}>{confirm}</Button>`. Removes ~35 lines of CSS.
#### F-CONSIST-06 — AuthLayout uses a different glass recipe [MEDIUM]
- **Files**: [`frontend/src/lib/components/AuthLayout.svelte`](frontend/src/lib/components/AuthLayout.svelte) (line 68 `.auth-card`)
- **State**: `border-radius: 1rem`, `padding: 2rem`, `backdrop-filter: blur(8px)` (vs the 28px elsewhere), plus its own auth-bg gradient mesh + 32px-grid background that nothing else in the app uses. Has its own `.auth-input` / `.auth-submit` / `.auth-label` / `.auth-error` design language.
- **State pt 2**: Login/setup ends up looking *more* like generic SaaS than the dashboard does. The brand orb from the sidebar isn't on the login screen — instead a small lavender mdi-lan icon in a square.
- **Suggestion**: Reuse the conic brand orb. Use the same glass recipe (28px blur, 22px radius) for `.auth-card`. Either drop the dot-grid `.auth-grid` (it reads as a generic "futuristic SaaS" template) or use it as a deliberate flair on the dashboard hero too.
---
### 3. Information hierarchy
#### F-HIER-01 — Stat cards do triple duty (KPI + nav link + filter context) without ranking [MEDIUM]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 571645
- **State**: All four stat cards have the same visual weight, same accent intensity (`STAT_ACCENTS[idx]`), and rotate accents by index. When the global provider filter is active the first stat card morphs into a "literal value" card showing provider name (1rem font, very different visual). The accent rotation creates a rainbow row that doesn't carry meaning — events `total` has no semantic reason to be orchid vs. providers being lavender.
- **Suggestion**: Tie accent color to entity type (providers=primary, trackers=mint, targets=sky, throughput=citrus) so the same accent recurs throughout the app for the same concept. Keep the morph behavior but design a distinct "filtered context" stat-card variant — a smaller, narrower chip — so it doesn't compete visually.
#### F-HIER-02 — Hero title and meter compete for attention at desktop width [LOW]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 10471068, 10781086
- **State**: Both the `.hero-title` and `.hero-meter-value` are 3rem 500-weight in two different fonts. Side-by-side they create two focal points.
- **Suggestion**: Shrink `.hero-meter-value` to 2.4rem and use it as a *secondary* read; let the editorial title be the single dominant element.
#### F-HIER-03 — Pulse chart panel rarely meaningful on first launch [LOW]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 909927
- **State**: On a fresh install the chart is an empty 0-events grid taking 250-400px vertical space. No empty-state copy inside `EventChart`.
- **Suggestion**: When `chartDays` has all-zero values, replace with a small "No events recorded in the last 30 days — once a tracker fires, the pulse will appear here" inline empty state.
---
### 4. Navigation & wayfinding
#### F-NAV-01 — No `aria-current="page"` on active nav links [HIGH a11y]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 498533, 591597, 632658
- **State**: Active state is conveyed via `.active` class + a gradient left-bar div. Screen readers cannot announce it. Grep for `aria-current` across the whole frontend: zero matches.
- **Suggestion**: Add `aria-current={isActive(child.href) ? 'page' : undefined}` to every nav `<a>`.
#### F-NAV-02 — No breadcrumb on subpages [MEDIUM]
- **Files**: [`frontend/src/lib/components/PageHeader.svelte`](frontend/src/lib/components/PageHeader.svelte)
- **State**: The `crumb` prop only renders a single mono-uppercase tag (e.g. "ROUTING · AUTOMATION") — it's decorative, not navigational. There's no actual breadcrumb chain. For `/template-configs`, `/command-template-configs`, `/tracking-configs`, `/command-configs`, etc., a user landing via deep link has no parent-link to return to.
- **Suggestion**: Make the crumb a real breadcrumb (≤3 levels: `Notifications → Templates` or `Commands → Configs`). Render the prior level as a clickable `<a>`.
#### F-NAV-03 — Deep linking via `?type=<targetType>` and `?tab=<botType>` doesn't update page title [LOW]
- **State**: `/targets?type=email` and `/bots?tab=matrix` change the active sidebar item but the `<PageHeader>` title for those pages is generic ("Targets" / "Bots").
- **Suggestion**: When `activeType` is set, derive the title from it: "Email targets" / "Matrix bots". Improves browser tab titles and the in-page title.
#### F-NAV-04 — Collapsed sidebar tooltip wraps for long Russian translations [LOW]
- **State**: Tooltips for collapsed sidebar nav items use the browser-native `title=` attribute, which gives no glass-style chip. They will use the OS tooltip styling, which clashes with the Aurora aesthetic and clips long ru labels.
- **Suggestion**: Build a small custom tooltip component (or use existing portal helper) for collapsed-sidebar nav. Keep `title` as fallback for `prefers-reduced-motion` users.
---
### 5. Form UX
#### F-FORM-01 — No inline field-level validation, only post-submit error banners [MEDIUM]
- **Files**: [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte), [`frontend/src/routes/users/+page.svelte`](frontend/src/routes/users/+page.svelte), [`frontend/src/routes/targets/TargetForm.svelte`](frontend/src/routes/targets/TargetForm.svelte)
- **State**: Forms rely on HTML5 `required` / `minlength` browser validation plus a single `ErrorBanner` shown after submit failure. Native browser validation tooltips are pale and don't match Aurora.
- **Suggestion**: Add a per-field `<FieldError>` slot below labels for inline validation (URL syntax, email format, port range). The settings page already has a nice pattern (`url-field-valid` class on `IdentityCassette`) — generalize it.
#### F-FORM-02 — Save feedback inconsistent across pages [MEDIUM]
- **Files**: Settings uses a sticky `SaveBar` with dirty tracking ([`frontend/src/routes/settings/+page.svelte`](frontend/src/routes/settings/+page.svelte) lines 7784, 208214). Most other forms have inline Save buttons inside the card. Some show snackbar success ("snack.userCreated"), some don't.
- **Suggestion**: Standardize: (a) inline "Save" inside the card *plus* (b) snackbar success message *plus* (c) optional sticky SaveBar for multi-field admin forms. Document the pattern in `.claude/docs/frontend-architecture.md`.
#### F-FORM-03 — Forms auto-name from descriptor but offer no way to unlock it back to auto-name [LOW]
- **Files**: [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte) lines 136141 + 303; [`frontend/src/routes/actions/+page.svelte`](frontend/src/routes/actions/+page.svelte) lines 5056
- **State**: Once user types in the Name field, `nameManuallyEdited` becomes true and the auto-fill stops permanently — no way to ask "go back to default name".
- **Suggestion**: Add a tiny "↺ reset" link next to the name input when `nameManuallyEdited && form.name !== descriptor.defaultName`.
#### F-FORM-04 — No optimistic UI; rows disappear / appear only after server roundtrip [LOW]
- **State**: After delete/create, pages refetch via `cache.fetch(true)`. Visible 200-400ms blank state.
- **Suggestion**: Optimistic insert/remove in the cache stores, with snackbar undo for destructive ops.
#### F-FORM-05 — Login form omits `autofocus` on username [LOW]
- **Files**: [`frontend/src/routes/login/+page.svelte`](frontend/src/routes/login/+page.svelte) line 99
- **Suggestion**: Add `autofocus` to the username input. Saves one keystroke on every login.
---
### 6. Modals & overlays
#### F-MODAL-01 — Modal.svelte is well-built [LOW / commendation]
- **Files**: [`frontend/src/lib/components/Modal.svelte`](frontend/src/lib/components/Modal.svelte)
- **State**: Portal mount, focus trap, focus restoration, Escape, Tab cycling, `aria-modal="true"`, `aria-labelledby`, body scroll containment via `overscroll-behavior: contain`, transition (250ms in/out), 80vh max-height. This is the strongest single component in the codebase.
- **Verdict**: Reuse as the foundation for every overlay. Currently `BlockedByModal`, `EventDetailModal`, `SharedLinkModal`, `ConfirmModal` all do — good.
#### F-MODAL-02 — Modal backdrop has `role="button"` [LOW]
- **Files**: [`frontend/src/lib/components/Modal.svelte`](frontend/src/lib/components/Modal.svelte) line 96
- **State**: The backdrop is a `<div>` with `role="button"`, `tabindex="-1"`, and an onclick to close. That's a common pattern to silence Svelte's a11y warnings, but a screen reader announces "Close, button" twice (once for backdrop, once for the explicit X button).
- **Suggestion**: Drop `role="button"` and `aria-label` from the backdrop; the explicit Close button is enough. Or use `<button class="modal-backdrop">` instead of a div.
#### F-MODAL-03 — Modal panel uses solid `#131520` instead of glass [LOW]
- **Files**: [`frontend/src/lib/components/Modal.svelte`](frontend/src/lib/components/Modal.svelte) lines 150151
- **State**: `--modal-solid-bg: #131520;` is a deliberate choice (probably for readability) but it breaks visual consistency with the rest of the app. The Aurora drift behind it is invisible.
- **Suggestion**: Use `var(--color-glass-elev)` over the blurred backdrop. Or, if the solid choice was deliberate, document why so the next developer doesn't "fix" it.
#### F-MODAL-04 — Confirm-modal "delete" hover uses raw rgba [MEDIUM]
- **Files**: [`frontend/src/lib/components/ConfirmModal.svelte`](frontend/src/lib/components/ConfirmModal.svelte) line 70
- **State**: `box-shadow: 0 0 16px rgba(239, 68, 68, 0.3);` — not themed.
- **Suggestion**: Use `box-shadow: 0 0 16px color-mix(in srgb, var(--color-coral) 40%, transparent);`.
---
### 7. Empty / loading / error states
#### F-STATE-01 — `Loading.svelte` is a single shimmer pattern [MEDIUM]
- **Files**: [`frontend/src/lib/components/Loading.svelte`](frontend/src/lib/components/Loading.svelte)
- **State**: Three or four 4rem shimmer bars. Used as `<Loading />` on virtually every page including hero pages. Doesn't match the actual layout the user will see — looks like a row list even on settings.
- **Suggestion**: Add layout-aware variants: `<Loading shape="hero" />`, `<Loading shape="grid" cols={4} />`, `<Loading shape="list" rows={5} />`. Reduces layout shift on first paint.
#### F-STATE-02 — `EmptyState.svelte` is plain and undifferentiated [MEDIUM]
- **Files**: [`frontend/src/lib/components/EmptyState.svelte`](frontend/src/lib/components/EmptyState.svelte)
- **State**: 10-line component: dimmed icon + message. No CTA, no illustration, no flavor. The dashboard's inline `.empty-state` (lines 13001319 of `+page.svelte`) is richer (has a CTA link) but isn't reused.
- **Suggestion**: Extend `EmptyState` to accept a `cta` slot and a `tone` (with subtle gradient blob behind the icon). On `/providers` empty: "No providers yet — connect Immich, Nextcloud, or Home Assistant to start tracking events" with an "+ Add provider" CTA.
#### F-STATE-03 — Many list pages have no error-recovery action [MEDIUM]
- **Files**: Throughout — most pages have a `loadError` state that renders `<Card><ErrorBanner /></Card>` but no "Retry" button.
- **Suggestion**: `ErrorBanner` should accept an `onRetry` prop and surface a retry button. Standardize across pages.
#### F-STATE-04 — `EventChart` no empty state [LOW]
- See F-HIER-03.
---
### 8. Accessibility
#### F-A11Y-01 — Snackbar has no aria-live [HIGH]
- **Files**: [`frontend/src/lib/components/Snackbar.svelte`](frontend/src/lib/components/Snackbar.svelte) lines 3563
- **State**: Snack container is a plain `<div use:portal>`. Success / error toasts never reach screen readers. Three other files have proper aria-live; this critical one doesn't.
- **Fix**: `<div use:portal class="snackbar-container" role="region" aria-live="polite" aria-label={t('snackbar.region')}>`. Use `aria-live="assertive"` for `snack.type === 'error'`.
#### F-A11Y-02 — No `aria-current="page"` on nav links [HIGH]
- See F-NAV-01.
#### F-A11Y-03 — Custom focus outlines partially overridden [MEDIUM]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css) lines 237241 (global `button:focus-visible` outline 2px primary + offset 2px), [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) line 894 (`.nav-link { border-radius: 12px !important }`), [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 13511354 (`.signal-row--clickable:focus-visible { outline-offset: -2px }`).
- **State**: Inverted offset `-2px` makes the focus ring sit *inside* the row, which against the glass-strong hover-bg ends up nearly invisible at certain accent positions.
- **Suggestion**: Use `outline-offset: 2px` consistently with a `box-shadow: 0 0 0 2px var(--color-glass)` ringer if needed for contrast.
#### F-A11Y-04 — `prefers-reduced-motion` is honored — commendation [LOW]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css) lines 484507, [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 837840
- **State**: Aurora drift, brand-version pulse, stagger entrances, signal-row hover transitions, paginator transitions all gated. Smooth scroll override too. Solid implementation.
#### F-A11Y-05 — Color contrast risk on glass surfaces [MEDIUM]
- **State**: `--color-muted-foreground: #b6b2d4` on `--color-glass: rgba(255,255,255,0.04)` over the aurora gradient. In the brightest hot-spot of the aurora background (where the `#b8a7ff` lavender peaks), `#b6b2d4` may fail WCAG AA (4.5:1 for body text). Hasn't been measured.
- **Suggestion**: Run a contrast pass with `--color-muted-foreground` against the brightest part of the aurora background. Likely need to bump it to ~`#cfcae8` for dark theme.
#### F-A11Y-06 — Toggle switch has no label association [LOW]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css) lines 513556
- **State**: `.toggle-switch` wraps an `<input type="checkbox">` and a visual `.toggle-track` `<span>`. There's no visible label text or `aria-label` requirement in the global utility. Callers may forget to pass one.
- **Suggestion**: Lift into a `<Toggle>` component requiring a `label` prop.
---
### 9. Responsive design
#### F-RESP-01 — Sidebar collapse breakpoint is fine; mobile bottom nav covers gracefully [LOW / commendation]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 589668, 11361168
- **State**: Below 767px the desktop sidebar hides and mobile bottom-nav appears with primary 4 keys + search + more. Mobile "More" panel mirrors the full desktop tree. Solid.
#### F-RESP-02 — Hero meter wraps awkwardly between 720880px [LOW]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 11191130
- **State**: Below 880px the hero collapses to one column, but the meter row pills wrap to a third row on Russian translations of "providers/targets/armed".
- **Suggestion**: Add an intermediate breakpoint (`max-width: 1024px`) where pill labels switch from `"5 providers"` to a tooltip-only count.
#### F-RESP-03 — Stat-card grid drops to 1 column at sm: [MEDIUM]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) line 590 `grid-cols-1 sm:grid-cols-2 lg:grid-cols-4`
- **State**: Between 6401024px stat cards are 2-wide. At tablet sizes the cards become huge and dilute the dashboard density.
- **Suggestion**: Cap stat-card max-width at ~300px or switch to `auto-fit, minmax(200px, 1fr)` so they don't grow uncontrollably.
#### F-RESP-04 — List rows don't gracefully truncate webhook URLs on mobile [LOW]
- **Files**: [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte) lines 392410
- **State**: Secondary text line shows full webhook URL with `break-all` which on very narrow viewports gives a 4-line wrap.
- **Suggestion**: Use the `shortenUrl()` helper (already defined for the meta-tile path) on the narrow-screen secondary line too.
---
### 10. Onboarding
#### F-ONBOARD-01 — Setup → empty dashboard with no guidance [HIGH]
- **Files**: [`frontend/src/routes/setup/+page.svelte`](frontend/src/routes/setup/+page.svelte), [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte)
- **State**: After `/setup` the user lands on `/` with 0 providers, hero says *"all clear"* (literally "Nothing to do"). Wasted first impression.
- **Suggestion**: First-run detection (`providersCache.items.length === 0 && targetsCache.items.length === 0`) replaces the dashboard hero with a 3-4 step "Getting started" checklist: (1) Add a provider · (2) Connect a bot · (3) Create a target · (4) Wire your first tracker. Each step is a CTA card. Persist completion to localStorage so it disappears once finished.
#### F-ONBOARD-02 — No in-app discovery of ⌘K palette [MEDIUM]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 678682
- **State**: Topbar shows `⌘K` / `Ctrl K` chip but only that. No "Press ⌘K to jump to any page" hint anywhere.
- **Suggestion**: First-visit toast: "Tip: Press ⌘K from anywhere to search providers, trackers, and pages". Dismissible.
#### F-ONBOARD-03 — Login screen has no help / forgot-password / docs link [LOW]
- **Files**: [`frontend/src/routes/login/+page.svelte`](frontend/src/routes/login/+page.svelte)
- **State**: Plain username + password. For self-hosted users who lost the admin password, there's no link to the recovery docs.
- **Suggestion**: Small "Need help?" link to docs (the `/docs` route exists).
---
### 11. Microcopy
#### F-COPY-01 — Dashboard hero copy is editorial — commendation [LOW]
- "Live · throughput 24h · armed · providers" reads more like a control-room dashboard than CRUD admin. Keep doing this on the rest of the app.
#### F-COPY-02 — Many subpages use literal entity-name copy [MEDIUM]
- E.g. "Add provider" / "Add target" / "Add tracker" / "Add user". Editorial would be "Connect a provider" / "Define a target" / "Wire a tracker" / "Invite a user". Lean into verbs that match the dashboard's "wires / signals / on watch" vocabulary.
#### F-COPY-03 — Russian translations match en line-count but no length QA visible [LOW]
- File sizes match exactly (1577 lines each). That's just structural parity, not visual parity. Russian tends to be 20-30% longer for the same concept; flagged places likely have layout issues (hero title em, stat-card values, sidebar nav labels).
- **Suggestion**: Set up a Playwright snapshot test that switches locale=ru and screenshots dashboard + a representative list page to catch overflow visually.
---
### 12. Localization parity
#### F-LOCALE-01 — "Notify Bridge" wordmark stays in English [LOW / correct]
- Brand. Don't translate.
#### F-LOCALE-02 — Provider type label not localized in list rows [LOW]
- **Files**: [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte) line 391
- **State**: Type pill shows raw `provider.type` value (e.g. "immich", "nextcloud") — not localized.
- **Suggestion**: Use `getDescriptor(type).defaultName` or `t(\`providers.type${PascalName}\`)` which exists per project conventions.
#### F-LOCALE-03 — Mixed Cyrillic glitches in source [LOW]
- **Files**: [`frontend/src/routes/login/+page.svelte`](frontend/src/routes/login/+page.svelte) line 42 (`—` instead of em-dash in a comment), [`frontend/src/routes/users/+page.svelte`](frontend/src/routes/users/+page.svelte) line 166 (`В·` instead of `·`)
- **State**: Encoding-corrupt characters in source comments and one user-facing dot. Pre-existing — files were probably edited with the wrong encoding at some point.
- **Suggestion**: Grep `вЂ` / `В·` across the repo and fix. Add a pre-commit hook that fails on non-UTF8 chars in `.svelte` / `.ts` / `.json`.
---
### 13. Power-user features
#### F-POWER-01 — No sortable columns anywhere [MEDIUM]
- Confirmed by Grep: no `aria-sort` / `sortable` / `onSort` in the codebase. Lists are sorted by `IconGridSelect` widget (newest / oldest / name).
- **Suggestion**: For long lists (trackers, targets), add column-header sort affordance. Even minimal: clicking the "Name" or "Provider" header re-sorts. Use cache state so sort persists across nav.
#### F-POWER-02 — No multi-select bulk actions [MEDIUM]
- Grep for `bulkAction` / `selectAll`: only the locale files contain those strings (likely as i18n keys that are never used). No checkbox UI.
- **Suggestion**: Add a checkbox column on `targets`, `notification-trackers`, `command-trackers`, `actions` pages. Bulk-enable / bulk-delete are the obvious ones.
#### F-POWER-03 — ⌘K palette is the strongest power feature, under-promoted [MEDIUM]
- See F-ONBOARD-02.
#### F-POWER-04 — Sidebar group expand/collapse is persisted but no "expand all / collapse all" [LOW]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 263269
- **Suggestion**: Add a right-click menu on a group header, or a tiny "collapse all" icon at the bottom of the nav rail.
#### F-POWER-05 — No keyboard shortcuts beyond ⌘K [LOW]
- **Suggestion**: `n` for new, `g + p` for "go providers", `g + t` for trackers, `?` to show shortcut sheet. Document in the palette.
---
## Production polish checklist (top 15, prioritized)
1. **[HIGH]** Add `role="status" aria-live="polite"` to Snackbar container; `assertive` for error toasts. (F-A11Y-01) — one-line fix.
2. **[HIGH]** Add `aria-current="page"` to every nav link in `+layout.svelte`. (F-NAV-01, F-A11Y-02)
3. **[HIGH]** Mass-replace the legacy form-input class (`border border-[var(--color-border)] rounded-md text-sm bg-[var(--color-background)]`) with nothing — let the global `input { ... }` style win. 17 files, ~71 occurrences. (F-CONSIST-04)
4. **[HIGH]** Replace hardcoded hex colors (`#059669`, `#ef4444`, `#3b82f6`, `#f59e0b`, `rgba(239,68,68,...)`) with Aurora palette tokens in `Snackbar.svelte`, `ConfirmModal.svelte`, `actions/+page.svelte`, and any remaining sites. (F-CONSIST-03)
5. **[HIGH]** First-run onboarding: when `providersCache.items.length === 0`, replace dashboard hero with a 4-step "Getting started" checklist. (F-ONBOARD-01)
6. **[HIGH]** Consolidate the 5 glass-card abstractions into a single `<GlassPanel variant=...>` component; delete redundant `::after` overlays. (F-CONSIST-01)
7. **[HIGH]** Introduce a radius scale (`--radius-card / panel / pill / input / chip / tile`) and refactor `Card.svelte`, `Button.svelte`, `Modal.svelte`, `ConfirmModal.svelte` to use it. (F-CONSIST-02)
8. **[MEDIUM]** Rewrite `ConfirmModal.svelte` to use `<Button variant="secondary">` and `<Button variant="danger">` instead of its own buttons. (F-CONSIST-05)
9. **[MEDIUM]** Add layout-aware `<Loading shape="hero|grid|list">` variants to reduce first-paint layout shift. (F-STATE-01)
10. **[MEDIUM]** Extend `<EmptyState>` with `cta` slot and provider-/tracker-/target-specific copy + a contextual CTA. (F-STATE-02)
11. **[MEDIUM]** Visual length-QA pass for Russian — at least dashboard hero, providers list, settings hero, stat-cards. Playwright screenshot test. (F-COPY-03, F-LOCALE-02)
12. **[MEDIUM]** Implement column-header sort on `notification-trackers`, `targets`, `actions`. Persist in cache state. (F-POWER-01)
13. **[MEDIUM]** Add multi-select bulk actions (enable/disable, delete) to `targets`, `notification-trackers`, `command-trackers`. (F-POWER-02)
14. **[MEDIUM]** Audit contrast: `--color-muted-foreground` over brightest aurora peak; likely bump dark-theme value from `#b6b2d4` to ~`#cfcae8`. (F-A11Y-05)
15. **[MEDIUM]** Replace inline browser-native `title=` tooltips on the collapsed sidebar with a custom Aurora-styled tooltip (using the existing portal helper). (F-NAV-04)
### Quick wins (bonus, under an hour each)
- Add `autofocus` to the username input on `/login`. (F-FORM-05)
- Fix `вЂ"` / `В·` Cyrillic encoding glitches in `login/+page.svelte` and `users/+page.svelte`. (F-LOCALE-03)
- Drop `role="button"` from Modal backdrop. (F-MODAL-02)
- Replace `provider.type` raw label in provider list rows with localized descriptor name. (F-LOCALE-02)
- Add inline empty-state copy to `EventChart` when all `chartDays` values are 0. (F-HIER-03)
---
## What's working — keep doing it
- The conic-gradient brand orb, animated aurora background, Newsreader italic emphasis, gradient pill chips, glow-pulse dots — distinctive identity.
- `Modal.svelte` (focus trap, restore, portal, escape, scroll containment).
- `prefers-reduced-motion` honored across every animation surface.
- Global ⌘K search palette, global provider filter, persisted sidebar state, persisted nav-group expansion.
- Editorial copy on dashboard (`signal stream`, `on watch`, `pulse`, `wires`, `compose`).
- Snackbar with detail-toggle expansion for error context.
- Mobile "More" panel that mirrors the full desktop nav tree.
- 6-file template-variable sync rule honored by project conventions.
- `i18n` parity at 1577 lines for both locales.
End of review.