Files
notify-bridge/.claude/reviews/backend-review.md
T
alexei.dolgolyov 6a8f374678 feat: observability, per-receiver Telegram options, oversized-video fallback
Operability:
- Correlation IDs end-to-end: shared dispatch_id between log lines and
  EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths)
  and a new X-Request-Id middleware that normalizes inbound ids and binds
  request_id into log context.
- dispatch_summary block merged into EventLog.details: per-target
  success/failure counts plus Telegram media delivered/skipped/failed and
  truncated error lists, so partial outcomes surface in the UI.
- Diagnostic mode: admin can flip one module to DEBUG for a bounded
  window with auto-revert (in-memory only; setup_logging() resets on
  boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints
  plus DiagnosticsCassette UI on the settings page.

Telegram:
- Per-receiver options: disable_notification (silent send) and
  message_thread_id (forum-topic routing), wired through the dispatcher
  via a ContextVar so all four send sites (sendMessage / sendPhoto-Video-
  Document / sendMediaGroup / cache-hit POST) pick them up.
- send_large_videos_as_documents target setting: bypass the 50 MB
  sendVideo cap by falling back to sendDocument for oversized videos.
- sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES,
  45 MB) with per-item fallback on chunk failure so a stale file_id no
  longer silently drops a cached asset.

Tests:
- New: diagnostic_mode, dispatch_summary, request_correlation,
  telegram_media_group_partial, telegram_per_send_options.

Docs:
- .claude/reviews/: six-axis production-readiness review of v0.8.1.
- .claude/docs/functional-review-2026-05-28.md: focused review of
  Telegram/Immich/logging subsystems.
2026-05-28 15:19:31 +03:00

22 KiB

Backend Production-Readiness Review

Scope: packages/server/src/notify_bridge_server/ and packages/core/src/notify_bridge_core/ (~44k LOC, Python 3.11, FastAPI + SQLModel async + APScheduler + aiohttp).

Executive Summary

  • Overall quality is high. The Jinja2 sandbox is consistently applied (every Environment instantiation is SandboxedEnvironment), JWT auth uses bcrypt offloaded to a worker thread, SSRF guard exists with DNS-rebinding mitigation, secrets are masked in logs via a dedicated filter, and most async/SQL patterns show production-aware design (per-tracker sessions, batched IN-queries, partial unique indexes).
  • Top correctness risk: a fire-and-forget asyncio.create_task in ha_subscription._on_status_change (no reference stored, GC can drop the task) plus thread-unsafe in-memory counters in bridge_self. Both bite on chatty HA installs.
  • Module-level dict caches shared across the event loop have small read-modify-write windows in services/scheduler.py (adaptive state), services/bridge_self.py (failure counters), commands/handler.py (TTLCache rate limits), and command_sync._dirty_bots. Currently functional under low concurrency; risky under load.
  • Very large hot-path functions — services/watcher.py:check_tracker (381 lines), services/dispatch_helpers.py:load_link_data (208 lines), the 1880-line database/migrations.py, and the 1365-line services/scheduler.py — concentrate too much logic in one place.
  • Provider-type hardcoding persists in api/providers.py, services/__init__.py, services/action_runner.py, and services/manual_dispatch.py (if provider.type == immich chains). The watchers _POLL_FACTORIES registry is the right model — extend it.
  • Webhook handlers read the request body BEFORE authenticating in the Gitea and generic-webhook routes. The Planka route gets it right. Net impact: a peer that knows the URL but not the secret can drive a 1 MiB read per request.
  • autoescape is inconsistent: True for runtime templates (renderer.py, commands/handler.py), False for preview / sample-context renders in api/template_configs.py, api/slot_helpers.py, and services/notifier.send_test_template_notification. Lower risk (admin-authored input) but mismatch invites surprise.

CRITICAL

[C-1] _on_status_change schedules an unstored task (GC + drop risk)

File: packages/server/src/notify_bridge_server/services/ha_subscription.py:240-260

The task created by asyncio.create_task(_record_ha_status(...)) at line 249 is not held anywhere. Python may garbage-collect a task whose only reference is the create_task return value before it completes (Python docs explicitly warn: save a reference to the result). Result: an HA disconnect/reconnect EventLog row silently disappears under memory pressure.

Fix: Module-level set[asyncio.Task], add the new task, remove via task.add_done_callback. ha_subscription.start_all already does this correctly (line 315-320); the pattern is already in-house.

[C-2] Telegram-webhook handler returns 200 OK on uncommitted writes

File: packages/server/src/notify_bridge_server/commands/webhook.py:130-169

The catch-all at line 162 swallows handle_command exceptions and returns OK to Telegram. The request already called await session.commit() at line 96 (after save_chat_from_webhook), and any subsequent writes via the dispatcher use NEW sessions inside the command path. If a downstream session inside handle_command partially commits before raising, the dependency get_session does NOT roll back automatically — the context manager only closes.

Fix: Either explicitly session.rollback() in the except block, or wrap the per-request mutations in async with session.begin(): so the implicit transaction guarantees rollback on exception.

[C-3] Gitea/generic webhook reads body BEFORE verifying secret is configured

File: packages/server/src/notify_bridge_server/api/webhooks.py:167-178 and line 449-454

The sequence is: read 1 MiB raw_body, then check if webhook_secret is empty. A peer that learned the URL but has no secret drives a 1 MiB body read per request. Plankas handler at line 232+ validates the bearer token BEFORE the body read — that is the correct pattern.

Fix: Hoist the "if not webhook_secret" (Gitea) and "if auth_mode == none" short-circuit (generic) above _read_bounded_body. Gitea HMAC still needs the body — but bailing on a missing-config-side error first costs nothing.

[C-4] bridge_self in-memory counters are not async-safe

File: packages/server/src/notify_bridge_server/services/bridge_self.py:186-230

record_poll_failure does _poll_failure_counts[tracker_id] = _poll_failure_counts.get(tracker_id, 0) + 1. These dicts are accessed concurrently from poll loop, HA push, webhook ingest, and dispatcher target-failure recording. Individual dict ops are atomic, but get + 1 + set is not when interleaved with another coroutine that touches the same key. Symptoms: missed threshold crossings, occasional double-emission. Same pattern in _target_failure_counts and _backlog_above_threshold.

Fix: Wrap mutating ops in an asyncio.Lock. The reset-and-re-arm semantics already assume serial access — make it explicit.

[C-5] PROVIDER_SECRET_FIELDS audit needed for backup exports

File: packages/server/src/notify_bridge_server/api/providers.py:617-625 and services/backup_service.py:84-93

_apply_secrets_provider redacts only fields named in PROVIDER_SECRET_FIELDS. The webhook flow uses a field called webhook_secret (Gitea, Planka, generic) — verify this is in PROVIDER_SECRET_FIELDS (defined in backup_schema.py). A backup export with secrets_mode=INCLUDE that misses webhook_secret leaks a token that grants webhook-forgery rights.

Action: Audit PROVIDER_SECRET_FIELDS. Specifically check it includes: api_key, api_token, access_token, webhook_secret, password, client_secret, refresh_token. The _provider_response mask list at api/providers.py:620 is a good cross-reference — both should be the same constant.


HIGH

[H-1] _compile_template lru_cache competes across tenants

File: packages/server/src/notify_bridge_server/commands/handler.py:99-103

lru_cache(maxsize=256) keyed by raw template string. Edited templates remain cached. On a multi-tenant install one tenants 256 distinct templates can evict anothers. No invalidation on template-edit.

Fix: Drop the cache (Jinja compile is sub-ms) OR add an invalidation call from the template-edit endpoints. The notification renderer (renderer.py:31) uses 512 slots — same problem; consistent fix.

[H-2] check_tracker is 381 lines with deep coupling

File: packages/server/src/notify_bridge_server/services/watcher.py:263-644

Loads tracker, polls, writes state, persists EventLog, evaluates gates, defers, dispatches, records bridge_self — all in one function. Refactor candidates: _poll_phase, _persist_state_and_events, _dispatch_phase. This is the watchers hot path; bugs here affect every tracker tick.

File: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:539-747

Five call sites consume ld["target_type"], ld.get("link_id"), etc. — no static guarantee against key typos.

Fix: Introduce a frozen @dataclass class LinkData. Same for per-receiver entries.

[H-4] N+1 in _resolve_command_context template-slot loop

File: packages/server/src/notify_bridge_server/commands/handler.py:200-215

One SELECT per distinct command_template_config_id. Already batched for trackers/configs/providers — finish the job. Single WHERE config_id IN (...) query + Python pivot.

[H-5] N+1 in backup_service.export_backup receiver loop

File: packages/server/src/notify_bridge_server/services/backup_service.py:187-189

50 targets = 51 SELECTs. Batch with WHERE target_id IN (...). Audit other sections of this 941-line file for the same pattern (templates -> slots, command configs -> slots).

[H-6] _dirty_bots mutated from request and scheduler without a lock

File: packages/server/src/notify_bridge_server/services/command_sync.py:25-95

mark_bot_dirty runs in request handlers, _flush_dirty_bots on the scheduler executor. Currently safe (snapshot via ready = [...]) but fragile.

Fix: Snapshot under lock, or move to a thread-safe primitive.

[H-7] HA reconnect cycle has no way for CRUD to short-circuit a stale supervisor

File: packages/server/src/notify_bridge_server/services/ha_subscription.py:163-175

Reload-on-reconnect means a disabled HA provider keeps trying to reconnect at the 30s/300s cadence until next reconnect attempt. CRUD endpoints should call reload_provider (defined at line 339) — verify wiring.

[H-8] Cached expunged ORM instances are footguns

File: packages/server/src/notify_bridge_server/services/event_dispatch.py:75-107

_load_trackers_cached returns expunged NotificationTracker rows. Future maintainer calling session.add(tracker) on a stale cached instance triggers DetachedInstance or silent re-INSERT. Document this strongly, ideally convert to a typed projection.

[H-9] Pending-restore at startup has no timeout

File: packages/server/src/notify_bridge_server/main.py:142-143

apply_pending_restore_if_any runs in lifespan; a partially-corrupt restore could block startup indefinitely. Container liveness probes then fail after grace.

Fix: asyncio.wait_for with a generous timeout, or kick off as background task while app starts.

[H-10] Jinja2 render watchdog uses daemon thread that can pin a CPU forever

File: packages/core/src/notify_bridge_core/templates/renderer.py:48-73

Comment acknowledges the trade-off. Multiple concurrent runaway renders can exhaust CPU cores while callers think they timed out. Add a process-level BoundedSemaphore capping concurrent in-flight renders.

[H-11] _aggregate drops all but the first error

File: packages/server/src/notify_bridge_server/services/notifier.py:326-335

When all sends fail, only results[0] is returned. Distinct subsequent errors are lost.

Fix: Aggregate all errors into a details field.

[H-12] Generic-webhook header dict materialised twice

File: packages/server/src/notify_bridge_server/api/webhooks.py:456 and line 475

dict(request.headers) materialises full headers map, then _filter_headers and _redact_sensitive_body walk the payload. With a malicious peer sending many headers (Starlette default 100), bounded but wasteful.

[H-13] SSRF redirect-walk has no aggregate wall-clock budget

File: packages/core/src/notify_bridge_core/notifications/telegram/client.py:232-268

max_redirects = 3, each with 120s _DOWNLOAD_TIMEOUT. Worst case per request: 480s. _TARGET_TIMEOUT_S = 120s in the dispatcher caps the top-level case, but per-asset preloads inside media groups dont all share that cap.

[H-14] Backlog recovery logic flips latch for in-flight users

File: packages/server/src/notify_bridge_server/services/bridge_self.py:544-551

Recovery loop iterates all known users and flips to False for any not in counts_by_user. If a user transiently has no user_id set on deferred rows (legacy / orphaned), theyre excluded from the GROUP BY and incorrectly marked recovered.

[H-15] quiet_hours_status silently returns None on start == end

File: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111

The comment notes this is almost always a user mistake. Silent return means the user wonders why their notifications still arrive at all hours. Surface via WARNING log + UI hint.


MEDIUM

[M-1] register_commands_with_telegram chat overrides loop is sequential

File: packages/server/src/notify_bridge_server/commands/handler.py:723-776

50 chats with overrides = 50 sequential Telegram round-trips. Use asyncio.gather with a semaphore as in _refresh_telegram_chat_titles.

[M-2] _run_provider exception backoff has no escalation

File: packages/server/src/notify_bridge_server/services/ha_subscription.py:278-283

Persistent bug in _emit reconnects every 30s forever. Add exponential backoff with cap and bridge_self alert after N failures.

[M-3] database/migrations.py is 1880 lines

File: packages/server/src/notify_bridge_server/database/migrations.py

Past the 800-line guideline. Split per-migration into database/migrations/.py, list in main.py.

[M-4] Locale-resolution logic duplicated

File: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:484-491 and services/notifier.py:46

Two implementations of locale priority. One source of truth.

[M-5] _normalize_locale duplicated across modules

File: packages/server/src/notify_bridge_server/commands/handler.py:632

Five-line copy; move to commands/command_utils.py.

[M-6] Provider-type if-chain in _test_provider_connection

File: packages/server/src/notify_bridge_server/api/providers.py:203-250

Same chain in services/__init__.py:_make_collection_provider. Both candidates for a single registry.

[M-7] Secret masking exposes last 4 chars unconditionally

File: packages/server/src/notify_bridge_server/api/providers.py:624 and services/backup_service.py:81

Fine for 32-char Immich keys. Returns half the value for short secrets. Use plain "***" for len(value) < 16.

[M-8] Deprecated validate_outbound_url still imported

File: packages/core/src/notify_bridge_core/providers/immich/client.py:14

The sync version uses blocking socket.getaddrinfo on the event loop. Migrate to avalidate_outbound_url.

[M-9] Lazy cache init has confusing DCL comment

File: packages/server/src/notify_bridge_server/services/watcher.py:81-113

Comment about Double-check after acquiring lock implies classic DCL — under asyncio, the unlocked first check is safe because theres no thread context switch, but rename to clarify.

[M-10] Dispatcher concurrency cap is per-dispatch, not process-wide

File: packages/core/src/notify_bridge_core/notifications/dispatcher.py:58

_DISPATCH_CONCURRENCY = 16 is INSIDE dispatch(). HA storm = N events x min(M, 16) sends with no outer cap. Add a process-level semaphore in event_dispatch.py.

[M-11] success=True returned for partial failures

File: packages/server/src/notify_bridge_server/services/notifier.py:329-335

A test that fails on 1 of 3 receivers returns success=True with a partial_failures count. Introduce a status: "ok"|"partial"|"fail" field.

[M-12] Telegram command registration not retried on 429

File: packages/server/src/notify_bridge_server/commands/handler.py:671-693

set_my_commands/delete_my_commands arent retried. Adopt the retry-after handling that _upload_media has.

[M-13] event_log_id_by_event keyed on id(event)

File: packages/server/src/notify_bridge_server/services/watcher.py:417-464

CPython object-address as key works because events are held alive in scope, but a typed key would be safer.

[M-14] Bcrypt-length error wording could be clearer

File: packages/server/src/notify_bridge_server/auth/routes.py:69-81

User typing 70 ASCII + emoji gets rejected and doesnt understand why. Clarify the byte-count language.

[M-15] CSP allows unsafe-inline for script-src

File: packages/server/src/notify_bridge_server/main.py:186-201

Acknowledged. SvelteKit --csp build flag emits hashes; switching unblocks dropping unsafe-inline.

[M-16] Telegram-webhook body size not capped

File: packages/server/src/notify_bridge_server/commands/webhook.py:71

update = await request.json() reads with no cap. Add _read_bounded_body pattern.

[M-17] _log_command_event swallows DB failures invisibly

File: packages/server/src/notify_bridge_server/commands/handler.py:353-357

Hard DB failure here is invisible. Add a metrics counter.

[M-18] apply_tracking_display_filters is a 60-line if-branched function

File: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:350-405

Split into _filter_favorites, _apply_order_and_limit, _strip_details_and_tags.


LOW

[L-1] from .database.models import * in main.py

File: packages/server/src/notify_bridge_server/main.py:26

Comment is honest about purpose, but explicit imports or a single module import is clearer.

[L-2] None comparisons

All comparisons verified to use is None via grep — no findings.

[L-3] Magic numbers

Constants are well-named throughout (_TG_429_MAX_ATTEMPTS, _MAX_PENDING_PER_TRACKER, DEBOUNCE_SECONDS, etc.). Only nit: seconds=30 literal in scheduler.schedule_bot_polling could be promoted.

[L-4] noqa E712 repeated 8+ times for SQLModel boolean comparisons

Switch to .is_(True) for SQLAlchemy idiom, or add E712 to project ruff config.

[L-5] _check_same_origin is best-effort by design

Acceptable.

[L-6] _normalize_host strips IPv6 zone IDs silently

File: packages/core/src/notify_bridge_core/notifications/ssrf.py:105-106

Debug log when stripping changes the host would help diagnose.

[L-7] _compute_jitter cap of 30s might be tight on hourly polls

File: packages/server/src/notify_bridge_server/services/scheduler.py:91-105

Revisit if jitter-collision becomes a real-world issue.

[L-8] SmtpConfig repr may leak password

File: packages/server/src/notify_bridge_server/services/notifier.py:205-213

If SmtpConfig is a vanilla dataclass, repr() will leak the password. Verify in notify_bridge_core.notifications.email.client — add field(repr=False) or a custom repr.

[L-9] noqa BLE001 count is high

49 occurrences across 26 files. Each defensible; consider narrowing where possible.

[L-10] _normalize_for_json does not handle UUID/Decimal

File: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:124-133

No current consumer emits these, but a fallback str() for unknown types would prevent future breakage.


Approval Verdict

Block — CRITICAL findings (C-1 unstored task, C-2 missing rollback, C-3 unauthenticated body read, C-4 racy counters, C-5 secret-mask audit) must be fixed before declaring production-ready. Once those are addressed, the HIGH findings can land in a follow-up.

Quick Wins (low effort, high value)

  1. Wrap every fire-and-forget asyncio.create_task in a module-level set — search for asyncio.create_task( with no assignment. Definite hit: ha_subscription.py:249.
  2. Move webhook-secret check before _read_bounded_body in Gitea + generic webhook handlers — 5-line move per endpoint, eliminates pre-auth resource exhaustion.
  3. Add an asyncio.Lock around _poll_failure_counts and _target_failure_counts mutations — eliminates C-4.
  4. Split migrations.py — mechanical refactor, ~1 hour, improves blame/review.
  5. Batch the receiver query in backup_service.export_backup — single IN (...) query, ~10x faster.
  6. Replace from .database.models import * with explicit imports — small clarity win.