notify-bridge/.claude/reviews/backend-review.md

# Backend Production-Readiness Review

Scope: packages/server/src/notify_bridge_server/ and packages/core/src/notify_bridge_core/ (~44k LOC, Python 3.11, FastAPI + SQLModel async + APScheduler + aiohttp).

## Executive Summary

- **Overall quality is high.** The Jinja2 sandbox is consistently applied (every Environment instantiation is SandboxedEnvironment), JWT auth uses bcrypt offloaded to a worker thread, SSRF guard exists with DNS-rebinding mitigation, secrets are masked in logs via a dedicated filter, and most async/SQL patterns show production-aware design (per-tracker sessions, batched IN-queries, partial unique indexes).
- **Top correctness risk: a fire-and-forget asyncio.create_task in ha_subscription._on_status_change** (no reference stored, GC can drop the task) plus thread-unsafe in-memory counters in bridge_self. Both bite on chatty HA installs.
- **Module-level dict caches shared across the event loop have small read-modify-write windows** in services/scheduler.py (adaptive state), services/bridge_self.py (failure counters), commands/handler.py (TTLCache rate limits), and command_sync._dirty_bots. Currently functional under low concurrency; risky under load.
- **Very large hot-path functions** — services/watcher.py:check_tracker (381 lines), services/dispatch_helpers.py:load_link_data (208 lines), the 1880-line database/migrations.py, and the 1365-line services/scheduler.py — concentrate too much logic in one place.
- **Provider-type hardcoding** persists in api/providers.py, services/__init__.py, services/action_runner.py, and services/manual_dispatch.py (if provider.type == immich chains). The watchers _POLL_FACTORIES registry is the right model — extend it.
- **Webhook handlers read the request body BEFORE authenticating** in the Gitea and generic-webhook routes. The Planka route gets it right. Net impact: a peer that knows the URL but not the secret can drive a 1 MiB read per request.
- **autoescape is inconsistent**: True for runtime templates (renderer.py, commands/handler.py), False for preview / sample-context renders in api/template_configs.py, api/slot_helpers.py, and services/notifier.send_test_template_notification. Lower risk (admin-authored input) but mismatch invites surprise.

---

## CRITICAL

### [C-1] _on_status_change schedules an unstored task (GC + drop risk)

File: [packages/server/src/notify_bridge_server/services/ha_subscription.py:240-260](../../packages/server/src/notify_bridge_server/services/ha_subscription.py#L240)

The task created by asyncio.create_task(_record_ha_status(...)) at line 249 is not held anywhere. Python may garbage-collect a task whose only reference is the create_task return value before it completes (Python docs explicitly warn: save a reference to the result). Result: an HA disconnect/reconnect EventLog row silently disappears under memory pressure.

**Fix:** Module-level set[asyncio.Task], add the new task, remove via task.add_done_callback. ha_subscription.start_all already does this correctly (line 315-320); the pattern is already in-house.

### [C-2] Telegram-webhook handler returns 200 OK on uncommitted writes

File: [packages/server/src/notify_bridge_server/commands/webhook.py:130-169](../../packages/server/src/notify_bridge_server/commands/webhook.py#L130)

The catch-all at line 162 swallows handle_command exceptions and returns OK to Telegram. The request already called await session.commit() at line 96 (after save_chat_from_webhook), and any subsequent writes via the dispatcher use NEW sessions inside the command path. If a downstream session inside handle_command partially commits before raising, the dependency get_session does NOT roll back automatically — the context manager only closes.

**Fix:** Either explicitly session.rollback() in the except block, or wrap the per-request mutations in async with session.begin(): so the implicit transaction guarantees rollback on exception.

### [C-3] Gitea/generic webhook reads body BEFORE verifying secret is configured

File: [packages/server/src/notify_bridge_server/api/webhooks.py:167-178](../../packages/server/src/notify_bridge_server/api/webhooks.py#L167) and line 449-454

The sequence is: read 1 MiB raw_body, then check if webhook_secret is empty. A peer that learned the URL but has no secret drives a 1 MiB body read per request. Plankas handler at line 232+ validates the bearer token BEFORE the body read — that is the correct pattern.

**Fix:** Hoist the "if not webhook_secret" (Gitea) and "if auth_mode == none" short-circuit (generic) above _read_bounded_body. Gitea HMAC still needs the body — but bailing on a missing-config-side error first costs nothing.

### [C-4] bridge_self in-memory counters are not async-safe

File: [packages/server/src/notify_bridge_server/services/bridge_self.py:186-230](../../packages/server/src/notify_bridge_server/services/bridge_self.py#L186)

record_poll_failure does _poll_failure_counts[tracker_id] = _poll_failure_counts.get(tracker_id, 0) + 1. These dicts are accessed concurrently from poll loop, HA push, webhook ingest, and dispatcher target-failure recording. Individual dict ops are atomic, but get + 1 + set is not when interleaved with another coroutine that touches the same key. Symptoms: missed threshold crossings, occasional double-emission. Same pattern in _target_failure_counts and _backlog_above_threshold.

**Fix:** Wrap mutating ops in an asyncio.Lock. The reset-and-re-arm semantics already assume serial access — make it explicit.

### [C-5] PROVIDER_SECRET_FIELDS audit needed for backup exports

File: [packages/server/src/notify_bridge_server/api/providers.py:617-625](../../packages/server/src/notify_bridge_server/api/providers.py#L617) and [services/backup_service.py:84-93](../../packages/server/src/notify_bridge_server/services/backup_service.py#L84)

_apply_secrets_provider redacts only fields named in PROVIDER_SECRET_FIELDS. The webhook flow uses a field called webhook_secret (Gitea, Planka, generic) — verify this is in PROVIDER_SECRET_FIELDS (defined in backup_schema.py). A backup export with secrets_mode=INCLUDE that misses webhook_secret leaks a token that grants webhook-forgery rights.

**Action:** Audit PROVIDER_SECRET_FIELDS. Specifically check it includes: api_key, api_token, access_token, webhook_secret, password, client_secret, refresh_token. The _provider_response mask list at api/providers.py:620 is a good cross-reference — both should be the same constant.

---

## HIGH

### [H-1] _compile_template lru_cache competes across tenants

File: [packages/server/src/notify_bridge_server/commands/handler.py:99-103](../../packages/server/src/notify_bridge_server/commands/handler.py#L99)

lru_cache(maxsize=256) keyed by raw template string. Edited templates remain cached. On a multi-tenant install one tenants 256 distinct templates can evict anothers. No invalidation on template-edit.

**Fix:** Drop the cache (Jinja compile is sub-ms) OR add an invalidation call from the template-edit endpoints. The notification renderer (renderer.py:31) uses 512 slots — same problem; consistent fix.

### [H-2] check_tracker is 381 lines with deep coupling

File: [packages/server/src/notify_bridge_server/services/watcher.py:263-644](../../packages/server/src/notify_bridge_server/services/watcher.py#L263)

Loads tracker, polls, writes state, persists EventLog, evaluates gates, defers, dispatches, records bridge_self — all in one function. Refactor candidates: _poll_phase, _persist_state_and_events, _dispatch_phase. This is the watchers hot path; bugs here affect every tracker tick.

### [H-3] load_link_data returns untyped dict[str, Any]

File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:539-747](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L539)

Five call sites consume ld["target_type"], ld.get("link_id"), etc. — no static guarantee against key typos.

**Fix:** Introduce a frozen @dataclass class LinkData. Same for per-receiver entries.

### [H-4] N+1 in _resolve_command_context template-slot loop

File: [packages/server/src/notify_bridge_server/commands/handler.py:200-215](../../packages/server/src/notify_bridge_server/commands/handler.py#L200)

One SELECT per distinct command_template_config_id. Already batched for trackers/configs/providers — finish the job. Single WHERE config_id IN (...) query + Python pivot.

### [H-5] N+1 in backup_service.export_backup receiver loop

File: [packages/server/src/notify_bridge_server/services/backup_service.py:187-189](../../packages/server/src/notify_bridge_server/services/backup_service.py#L187)

50 targets = 51 SELECTs. Batch with WHERE target_id IN (...). Audit other sections of this 941-line file for the same pattern (templates -> slots, command configs -> slots).

### [H-6] _dirty_bots mutated from request and scheduler without a lock

File: [packages/server/src/notify_bridge_server/services/command_sync.py:25-95](../../packages/server/src/notify_bridge_server/services/command_sync.py#L25)

mark_bot_dirty runs in request handlers, _flush_dirty_bots on the scheduler executor. Currently safe (snapshot via ready = [...]) but fragile.

**Fix:** Snapshot under lock, or move to a thread-safe primitive.

### [H-7] HA reconnect cycle has no way for CRUD to short-circuit a stale supervisor

File: [packages/server/src/notify_bridge_server/services/ha_subscription.py:163-175](../../packages/server/src/notify_bridge_server/services/ha_subscription.py#L163)

Reload-on-reconnect means a disabled HA provider keeps trying to reconnect at the 30s/300s cadence until next reconnect attempt. CRUD endpoints should call reload_provider (defined at line 339) — verify wiring.

### [H-8] Cached expunged ORM instances are footguns

File: [packages/server/src/notify_bridge_server/services/event_dispatch.py:75-107](../../packages/server/src/notify_bridge_server/services/event_dispatch.py#L75)

_load_trackers_cached returns expunged NotificationTracker rows. Future maintainer calling session.add(tracker) on a stale cached instance triggers DetachedInstance or silent re-INSERT. Document this strongly, ideally convert to a typed projection.

### [H-9] Pending-restore at startup has no timeout

File: [packages/server/src/notify_bridge_server/main.py:142-143](../../packages/server/src/notify_bridge_server/main.py#L142)

apply_pending_restore_if_any runs in lifespan; a partially-corrupt restore could block startup indefinitely. Container liveness probes then fail after grace.

**Fix:** asyncio.wait_for with a generous timeout, or kick off as background task while app starts.

### [H-10] Jinja2 render watchdog uses daemon thread that can pin a CPU forever

File: [packages/core/src/notify_bridge_core/templates/renderer.py:48-73](../../packages/core/src/notify_bridge_core/templates/renderer.py#L48)

Comment acknowledges the trade-off. Multiple concurrent runaway renders can exhaust CPU cores while callers think they timed out. Add a process-level BoundedSemaphore capping concurrent in-flight renders.

### [H-11] _aggregate drops all but the first error

File: [packages/server/src/notify_bridge_server/services/notifier.py:326-335](../../packages/server/src/notify_bridge_server/services/notifier.py#L326)

When all sends fail, only results[0] is returned. Distinct subsequent errors are lost.

**Fix:** Aggregate all errors into a details field.

### [H-12] Generic-webhook header dict materialised twice

File: [packages/server/src/notify_bridge_server/api/webhooks.py:456](../../packages/server/src/notify_bridge_server/api/webhooks.py#L456) and line 475

dict(request.headers) materialises full headers map, then _filter_headers and _redact_sensitive_body walk the payload. With a malicious peer sending many headers (Starlette default 100), bounded but wasteful.

### [H-13] SSRF redirect-walk has no aggregate wall-clock budget

File: [packages/core/src/notify_bridge_core/notifications/telegram/client.py:232-268](../../packages/core/src/notify_bridge_core/notifications/telegram/client.py#L232)

max_redirects = 3, each with 120s _DOWNLOAD_TIMEOUT. Worst case per request: 480s. _TARGET_TIMEOUT_S = 120s in the dispatcher caps the top-level case, but per-asset preloads inside media groups dont all share that cap.

### [H-14] Backlog recovery logic flips latch for in-flight users

File: [packages/server/src/notify_bridge_server/services/bridge_self.py:544-551](../../packages/server/src/notify_bridge_server/services/bridge_self.py#L544)

Recovery loop iterates all known users and flips to False for any not in counts_by_user. If a user transiently has no user_id set on deferred rows (legacy / orphaned), theyre excluded from the GROUP BY and incorrectly marked recovered.

### [H-15] quiet_hours_status silently returns None on start == end

File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L110)

The comment notes this is almost always a user mistake. Silent return means the user wonders why their notifications still arrive at all hours. Surface via WARNING log + UI hint.

---

## MEDIUM

### [M-1] register_commands_with_telegram chat overrides loop is sequential

File: [packages/server/src/notify_bridge_server/commands/handler.py:723-776](../../packages/server/src/notify_bridge_server/commands/handler.py#L723)

50 chats with overrides = 50 sequential Telegram round-trips. Use asyncio.gather with a semaphore as in _refresh_telegram_chat_titles.

### [M-2] _run_provider exception backoff has no escalation

File: [packages/server/src/notify_bridge_server/services/ha_subscription.py:278-283](../../packages/server/src/notify_bridge_server/services/ha_subscription.py#L278)

Persistent bug in _emit reconnects every 30s forever. Add exponential backoff with cap and bridge_self alert after N failures.

### [M-3] database/migrations.py is 1880 lines

File: [packages/server/src/notify_bridge_server/database/migrations.py](../../packages/server/src/notify_bridge_server/database/migrations.py)

Past the 800-line guideline. Split per-migration into database/migrations/<name>.py, list in main.py.

### [M-4] Locale-resolution logic duplicated

File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:484-491](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L484) and [services/notifier.py:46](../../packages/server/src/notify_bridge_server/services/notifier.py#L46)

Two implementations of locale priority. One source of truth.

### [M-5] _normalize_locale duplicated across modules

File: [packages/server/src/notify_bridge_server/commands/handler.py:632](../../packages/server/src/notify_bridge_server/commands/handler.py#L632)

Five-line copy; move to commands/command_utils.py.

### [M-6] Provider-type if-chain in _test_provider_connection

File: [packages/server/src/notify_bridge_server/api/providers.py:203-250](../../packages/server/src/notify_bridge_server/api/providers.py#L203)

Same chain in services/__init__.py:_make_collection_provider. Both candidates for a single registry.

### [M-7] Secret masking exposes last 4 chars unconditionally

File: [packages/server/src/notify_bridge_server/api/providers.py:624](../../packages/server/src/notify_bridge_server/api/providers.py#L624) and [services/backup_service.py:81](../../packages/server/src/notify_bridge_server/services/backup_service.py#L81)

Fine for 32-char Immich keys. Returns half the value for short secrets. Use plain "***" for len(value) < 16.

### [M-8] Deprecated validate_outbound_url still imported

File: [packages/core/src/notify_bridge_core/providers/immich/client.py:14](../../packages/core/src/notify_bridge_core/providers/immich/client.py#L14)

The sync version uses blocking socket.getaddrinfo on the event loop. Migrate to avalidate_outbound_url.

### [M-9] Lazy cache init has confusing DCL comment

File: [packages/server/src/notify_bridge_server/services/watcher.py:81-113](../../packages/server/src/notify_bridge_server/services/watcher.py#L81)

Comment about Double-check after acquiring lock implies classic DCL — under asyncio, the unlocked first check is safe because theres no thread context switch, but rename to clarify.

### [M-10] Dispatcher concurrency cap is per-dispatch, not process-wide

File: [packages/core/src/notify_bridge_core/notifications/dispatcher.py:58](../../packages/core/src/notify_bridge_core/notifications/dispatcher.py#L58)

_DISPATCH_CONCURRENCY = 16 is INSIDE dispatch(). HA storm = N events x min(M, 16) sends with no outer cap. Add a process-level semaphore in event_dispatch.py.

### [M-11] success=True returned for partial failures

File: [packages/server/src/notify_bridge_server/services/notifier.py:329-335](../../packages/server/src/notify_bridge_server/services/notifier.py#L329)

A test that fails on 1 of 3 receivers returns success=True with a partial_failures count. Introduce a status: "ok"|"partial"|"fail" field.

### [M-12] Telegram command registration not retried on 429

File: [packages/server/src/notify_bridge_server/commands/handler.py:671-693](../../packages/server/src/notify_bridge_server/commands/handler.py#L671)

set_my_commands/delete_my_commands arent retried. Adopt the retry-after handling that _upload_media has.

### [M-13] event_log_id_by_event keyed on id(event)

File: [packages/server/src/notify_bridge_server/services/watcher.py:417-464](../../packages/server/src/notify_bridge_server/services/watcher.py#L417)

CPython object-address as key works because events are held alive in scope, but a typed key would be safer.

### [M-14] Bcrypt-length error wording could be clearer

File: [packages/server/src/notify_bridge_server/auth/routes.py:69-81](../../packages/server/src/notify_bridge_server/auth/routes.py#L69)

User typing 70 ASCII + emoji gets rejected and doesnt understand why. Clarify the byte-count language.

### [M-15] CSP allows unsafe-inline for script-src

File: [packages/server/src/notify_bridge_server/main.py:186-201](../../packages/server/src/notify_bridge_server/main.py#L186)

Acknowledged. SvelteKit --csp build flag emits hashes; switching unblocks dropping unsafe-inline.

### [M-16] Telegram-webhook body size not capped

File: [packages/server/src/notify_bridge_server/commands/webhook.py:71](../../packages/server/src/notify_bridge_server/commands/webhook.py#L71)

update = await request.json() reads with no cap. Add _read_bounded_body pattern.

### [M-17] _log_command_event swallows DB failures invisibly

File: [packages/server/src/notify_bridge_server/commands/handler.py:353-357](../../packages/server/src/notify_bridge_server/commands/handler.py#L353)

Hard DB failure here is invisible. Add a metrics counter.

### [M-18] apply_tracking_display_filters is a 60-line if-branched function

File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:350-405](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L350)

Split into _filter_favorites, _apply_order_and_limit, _strip_details_and_tags.

---

## LOW

### [L-1] from .database.models import * in main.py

File: [packages/server/src/notify_bridge_server/main.py:26](../../packages/server/src/notify_bridge_server/main.py#L26)

Comment is honest about purpose, but explicit imports or a single module import is clearer.

### [L-2] None comparisons

All comparisons verified to use is None via grep — no findings.

### [L-3] Magic numbers

Constants are well-named throughout (_TG_429_MAX_ATTEMPTS, _MAX_PENDING_PER_TRACKER, DEBOUNCE_SECONDS, etc.). Only nit: seconds=30 literal in scheduler.schedule_bot_polling could be promoted.

### [L-4] noqa E712 repeated 8+ times for SQLModel boolean comparisons

Switch to .is_(True) for SQLAlchemy idiom, or add E712 to project ruff config.

### [L-5] _check_same_origin is best-effort by design

Acceptable.

### [L-6] _normalize_host strips IPv6 zone IDs silently

File: [packages/core/src/notify_bridge_core/notifications/ssrf.py:105-106](../../packages/core/src/notify_bridge_core/notifications/ssrf.py#L105)

Debug log when stripping changes the host would help diagnose.

### [L-7] _compute_jitter cap of 30s might be tight on hourly polls

File: [packages/server/src/notify_bridge_server/services/scheduler.py:91-105](../../packages/server/src/notify_bridge_server/services/scheduler.py#L91)

Revisit if jitter-collision becomes a real-world issue.

### [L-8] SmtpConfig repr may leak password

File: [packages/server/src/notify_bridge_server/services/notifier.py:205-213](../../packages/server/src/notify_bridge_server/services/notifier.py#L205)

If SmtpConfig is a vanilla dataclass, repr() will leak the password. Verify in notify_bridge_core.notifications.email.client — add field(repr=False) or a custom __repr__.

### [L-9] noqa BLE001 count is high

49 occurrences across 26 files. Each defensible; consider narrowing where possible.

### [L-10] _normalize_for_json does not handle UUID/Decimal

File: [packages/server/src/notify_bridge_server/services/deferred_dispatch.py:124-133](../../packages/server/src/notify_bridge_server/services/deferred_dispatch.py#L124)

No current consumer emits these, but a fallback str() for unknown types would prevent future breakage.

---

## Approval Verdict

**Block** — CRITICAL findings (C-1 unstored task, C-2 missing rollback, C-3 unauthenticated body read, C-4 racy counters, C-5 secret-mask audit) must be fixed before declaring production-ready. Once those are addressed, the HIGH findings can land in a follow-up.

## Quick Wins (low effort, high value)

1. **Wrap every fire-and-forget asyncio.create_task in a module-level set** — search for asyncio.create_task( with no assignment. Definite hit: ha_subscription.py:249.
2. **Move webhook-secret check before _read_bounded_body** in Gitea + generic webhook handlers — 5-line move per endpoint, eliminates pre-auth resource exhaustion.
3. **Add an asyncio.Lock around _poll_failure_counts and _target_failure_counts** mutations — eliminates C-4.
4. **Split migrations.py** — mechanical refactor, ~1 hour, improves blame/review.
5. **Batch the receiver query in backup_service.export_backup** — single IN (...) query, ~10x faster.
6. **Replace from .database.models import \*** with explicit imports — small clarity win.