Operability: - Correlation IDs end-to-end: shared dispatch_id between log lines and EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths) and a new X-Request-Id middleware that normalizes inbound ids and binds request_id into log context. - dispatch_summary block merged into EventLog.details: per-target success/failure counts plus Telegram media delivered/skipped/failed and truncated error lists, so partial outcomes surface in the UI. - Diagnostic mode: admin can flip one module to DEBUG for a bounded window with auto-revert (in-memory only; setup_logging() resets on boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints plus DiagnosticsCassette UI on the settings page. Telegram: - Per-receiver options: disable_notification (silent send) and message_thread_id (forum-topic routing), wired through the dispatcher via a ContextVar so all four send sites (sendMessage / sendPhoto-Video- Document / sendMediaGroup / cache-hit POST) pick them up. - send_large_videos_as_documents target setting: bypass the 50 MB sendVideo cap by falling back to sendDocument for oversized videos. - sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES, 45 MB) with per-item fallback on chunk failure so a stale file_id no longer silently drops a cached asset. Tests: - New: diagnostic_mode, dispatch_summary, request_correlation, telegram_media_group_partial, telegram_per_send_options. Docs: - .claude/reviews/: six-axis production-readiness review of v0.8.1. - .claude/docs/functional-review-2026-05-28.md: focused review of Telegram/Immich/logging subsystems.
22 KiB
Backend Production-Readiness Review
Scope: packages/server/src/notify_bridge_server/ and packages/core/src/notify_bridge_core/ (~44k LOC, Python 3.11, FastAPI + SQLModel async + APScheduler + aiohttp).
Executive Summary
- Overall quality is high. The Jinja2 sandbox is consistently applied (every Environment instantiation is SandboxedEnvironment), JWT auth uses bcrypt offloaded to a worker thread, SSRF guard exists with DNS-rebinding mitigation, secrets are masked in logs via a dedicated filter, and most async/SQL patterns show production-aware design (per-tracker sessions, batched IN-queries, partial unique indexes).
- Top correctness risk: a fire-and-forget asyncio.create_task in ha_subscription._on_status_change (no reference stored, GC can drop the task) plus thread-unsafe in-memory counters in bridge_self. Both bite on chatty HA installs.
- Module-level dict caches shared across the event loop have small read-modify-write windows in services/scheduler.py (adaptive state), services/bridge_self.py (failure counters), commands/handler.py (TTLCache rate limits), and command_sync._dirty_bots. Currently functional under low concurrency; risky under load.
- Very large hot-path functions — services/watcher.py:check_tracker (381 lines), services/dispatch_helpers.py:load_link_data (208 lines), the 1880-line database/migrations.py, and the 1365-line services/scheduler.py — concentrate too much logic in one place.
- Provider-type hardcoding persists in api/providers.py, services/__init__.py, services/action_runner.py, and services/manual_dispatch.py (if provider.type == immich chains). The watchers _POLL_FACTORIES registry is the right model — extend it.
- Webhook handlers read the request body BEFORE authenticating in the Gitea and generic-webhook routes. The Planka route gets it right. Net impact: a peer that knows the URL but not the secret can drive a 1 MiB read per request.
- autoescape is inconsistent: True for runtime templates (renderer.py, commands/handler.py), False for preview / sample-context renders in api/template_configs.py, api/slot_helpers.py, and services/notifier.send_test_template_notification. Lower risk (admin-authored input) but mismatch invites surprise.
CRITICAL
[C-1] _on_status_change schedules an unstored task (GC + drop risk)
File: packages/server/src/notify_bridge_server/services/ha_subscription.py:240-260
The task created by asyncio.create_task(_record_ha_status(...)) at line 249 is not held anywhere. Python may garbage-collect a task whose only reference is the create_task return value before it completes (Python docs explicitly warn: save a reference to the result). Result: an HA disconnect/reconnect EventLog row silently disappears under memory pressure.
Fix: Module-level set[asyncio.Task], add the new task, remove via task.add_done_callback. ha_subscription.start_all already does this correctly (line 315-320); the pattern is already in-house.
[C-2] Telegram-webhook handler returns 200 OK on uncommitted writes
File: packages/server/src/notify_bridge_server/commands/webhook.py:130-169
The catch-all at line 162 swallows handle_command exceptions and returns OK to Telegram. The request already called await session.commit() at line 96 (after save_chat_from_webhook), and any subsequent writes via the dispatcher use NEW sessions inside the command path. If a downstream session inside handle_command partially commits before raising, the dependency get_session does NOT roll back automatically — the context manager only closes.
Fix: Either explicitly session.rollback() in the except block, or wrap the per-request mutations in async with session.begin(): so the implicit transaction guarantees rollback on exception.
[C-3] Gitea/generic webhook reads body BEFORE verifying secret is configured
File: packages/server/src/notify_bridge_server/api/webhooks.py:167-178 and line 449-454
The sequence is: read 1 MiB raw_body, then check if webhook_secret is empty. A peer that learned the URL but has no secret drives a 1 MiB body read per request. Plankas handler at line 232+ validates the bearer token BEFORE the body read — that is the correct pattern.
Fix: Hoist the "if not webhook_secret" (Gitea) and "if auth_mode == none" short-circuit (generic) above _read_bounded_body. Gitea HMAC still needs the body — but bailing on a missing-config-side error first costs nothing.
[C-4] bridge_self in-memory counters are not async-safe
File: packages/server/src/notify_bridge_server/services/bridge_self.py:186-230
record_poll_failure does _poll_failure_counts[tracker_id] = _poll_failure_counts.get(tracker_id, 0) + 1. These dicts are accessed concurrently from poll loop, HA push, webhook ingest, and dispatcher target-failure recording. Individual dict ops are atomic, but get + 1 + set is not when interleaved with another coroutine that touches the same key. Symptoms: missed threshold crossings, occasional double-emission. Same pattern in _target_failure_counts and _backlog_above_threshold.
Fix: Wrap mutating ops in an asyncio.Lock. The reset-and-re-arm semantics already assume serial access — make it explicit.
[C-5] PROVIDER_SECRET_FIELDS audit needed for backup exports
File: packages/server/src/notify_bridge_server/api/providers.py:617-625 and services/backup_service.py:84-93
_apply_secrets_provider redacts only fields named in PROVIDER_SECRET_FIELDS. The webhook flow uses a field called webhook_secret (Gitea, Planka, generic) — verify this is in PROVIDER_SECRET_FIELDS (defined in backup_schema.py). A backup export with secrets_mode=INCLUDE that misses webhook_secret leaks a token that grants webhook-forgery rights.
Action: Audit PROVIDER_SECRET_FIELDS. Specifically check it includes: api_key, api_token, access_token, webhook_secret, password, client_secret, refresh_token. The _provider_response mask list at api/providers.py:620 is a good cross-reference — both should be the same constant.
HIGH
[H-1] _compile_template lru_cache competes across tenants
File: packages/server/src/notify_bridge_server/commands/handler.py:99-103
lru_cache(maxsize=256) keyed by raw template string. Edited templates remain cached. On a multi-tenant install one tenants 256 distinct templates can evict anothers. No invalidation on template-edit.
Fix: Drop the cache (Jinja compile is sub-ms) OR add an invalidation call from the template-edit endpoints. The notification renderer (renderer.py:31) uses 512 slots — same problem; consistent fix.
[H-2] check_tracker is 381 lines with deep coupling
File: packages/server/src/notify_bridge_server/services/watcher.py:263-644
Loads tracker, polls, writes state, persists EventLog, evaluates gates, defers, dispatches, records bridge_self — all in one function. Refactor candidates: _poll_phase, _persist_state_and_events, _dispatch_phase. This is the watchers hot path; bugs here affect every tracker tick.
[H-3] load_link_data returns untyped dict[str, Any]
File: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:539-747
Five call sites consume ld["target_type"], ld.get("link_id"), etc. — no static guarantee against key typos.
Fix: Introduce a frozen @dataclass class LinkData. Same for per-receiver entries.
[H-4] N+1 in _resolve_command_context template-slot loop
File: packages/server/src/notify_bridge_server/commands/handler.py:200-215
One SELECT per distinct command_template_config_id. Already batched for trackers/configs/providers — finish the job. Single WHERE config_id IN (...) query + Python pivot.
[H-5] N+1 in backup_service.export_backup receiver loop
File: packages/server/src/notify_bridge_server/services/backup_service.py:187-189
50 targets = 51 SELECTs. Batch with WHERE target_id IN (...). Audit other sections of this 941-line file for the same pattern (templates -> slots, command configs -> slots).
[H-6] _dirty_bots mutated from request and scheduler without a lock
File: packages/server/src/notify_bridge_server/services/command_sync.py:25-95
mark_bot_dirty runs in request handlers, _flush_dirty_bots on the scheduler executor. Currently safe (snapshot via ready = [...]) but fragile.
Fix: Snapshot under lock, or move to a thread-safe primitive.
[H-7] HA reconnect cycle has no way for CRUD to short-circuit a stale supervisor
File: packages/server/src/notify_bridge_server/services/ha_subscription.py:163-175
Reload-on-reconnect means a disabled HA provider keeps trying to reconnect at the 30s/300s cadence until next reconnect attempt. CRUD endpoints should call reload_provider (defined at line 339) — verify wiring.
[H-8] Cached expunged ORM instances are footguns
File: packages/server/src/notify_bridge_server/services/event_dispatch.py:75-107
_load_trackers_cached returns expunged NotificationTracker rows. Future maintainer calling session.add(tracker) on a stale cached instance triggers DetachedInstance or silent re-INSERT. Document this strongly, ideally convert to a typed projection.
[H-9] Pending-restore at startup has no timeout
File: packages/server/src/notify_bridge_server/main.py:142-143
apply_pending_restore_if_any runs in lifespan; a partially-corrupt restore could block startup indefinitely. Container liveness probes then fail after grace.
Fix: asyncio.wait_for with a generous timeout, or kick off as background task while app starts.
[H-10] Jinja2 render watchdog uses daemon thread that can pin a CPU forever
File: packages/core/src/notify_bridge_core/templates/renderer.py:48-73
Comment acknowledges the trade-off. Multiple concurrent runaway renders can exhaust CPU cores while callers think they timed out. Add a process-level BoundedSemaphore capping concurrent in-flight renders.
[H-11] _aggregate drops all but the first error
File: packages/server/src/notify_bridge_server/services/notifier.py:326-335
When all sends fail, only results[0] is returned. Distinct subsequent errors are lost.
Fix: Aggregate all errors into a details field.
[H-12] Generic-webhook header dict materialised twice
File: packages/server/src/notify_bridge_server/api/webhooks.py:456 and line 475
dict(request.headers) materialises full headers map, then _filter_headers and _redact_sensitive_body walk the payload. With a malicious peer sending many headers (Starlette default 100), bounded but wasteful.
[H-13] SSRF redirect-walk has no aggregate wall-clock budget
File: packages/core/src/notify_bridge_core/notifications/telegram/client.py:232-268
max_redirects = 3, each with 120s _DOWNLOAD_TIMEOUT. Worst case per request: 480s. _TARGET_TIMEOUT_S = 120s in the dispatcher caps the top-level case, but per-asset preloads inside media groups dont all share that cap.
[H-14] Backlog recovery logic flips latch for in-flight users
File: packages/server/src/notify_bridge_server/services/bridge_self.py:544-551
Recovery loop iterates all known users and flips to False for any not in counts_by_user. If a user transiently has no user_id set on deferred rows (legacy / orphaned), theyre excluded from the GROUP BY and incorrectly marked recovered.
[H-15] quiet_hours_status silently returns None on start == end
File: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111
The comment notes this is almost always a user mistake. Silent return means the user wonders why their notifications still arrive at all hours. Surface via WARNING log + UI hint.
MEDIUM
[M-1] register_commands_with_telegram chat overrides loop is sequential
File: packages/server/src/notify_bridge_server/commands/handler.py:723-776
50 chats with overrides = 50 sequential Telegram round-trips. Use asyncio.gather with a semaphore as in _refresh_telegram_chat_titles.
[M-2] _run_provider exception backoff has no escalation
File: packages/server/src/notify_bridge_server/services/ha_subscription.py:278-283
Persistent bug in _emit reconnects every 30s forever. Add exponential backoff with cap and bridge_self alert after N failures.
[M-3] database/migrations.py is 1880 lines
File: packages/server/src/notify_bridge_server/database/migrations.py
Past the 800-line guideline. Split per-migration into database/migrations/.py, list in main.py.
[M-4] Locale-resolution logic duplicated
File: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:484-491 and services/notifier.py:46
Two implementations of locale priority. One source of truth.
[M-5] _normalize_locale duplicated across modules
File: packages/server/src/notify_bridge_server/commands/handler.py:632
Five-line copy; move to commands/command_utils.py.
[M-6] Provider-type if-chain in _test_provider_connection
File: packages/server/src/notify_bridge_server/api/providers.py:203-250
Same chain in services/__init__.py:_make_collection_provider. Both candidates for a single registry.
[M-7] Secret masking exposes last 4 chars unconditionally
File: packages/server/src/notify_bridge_server/api/providers.py:624 and services/backup_service.py:81
Fine for 32-char Immich keys. Returns half the value for short secrets. Use plain "***" for len(value) < 16.
[M-8] Deprecated validate_outbound_url still imported
File: packages/core/src/notify_bridge_core/providers/immich/client.py:14
The sync version uses blocking socket.getaddrinfo on the event loop. Migrate to avalidate_outbound_url.
[M-9] Lazy cache init has confusing DCL comment
File: packages/server/src/notify_bridge_server/services/watcher.py:81-113
Comment about Double-check after acquiring lock implies classic DCL — under asyncio, the unlocked first check is safe because theres no thread context switch, but rename to clarify.
[M-10] Dispatcher concurrency cap is per-dispatch, not process-wide
File: packages/core/src/notify_bridge_core/notifications/dispatcher.py:58
_DISPATCH_CONCURRENCY = 16 is INSIDE dispatch(). HA storm = N events x min(M, 16) sends with no outer cap. Add a process-level semaphore in event_dispatch.py.
[M-11] success=True returned for partial failures
File: packages/server/src/notify_bridge_server/services/notifier.py:329-335
A test that fails on 1 of 3 receivers returns success=True with a partial_failures count. Introduce a status: "ok"|"partial"|"fail" field.
[M-12] Telegram command registration not retried on 429
File: packages/server/src/notify_bridge_server/commands/handler.py:671-693
set_my_commands/delete_my_commands arent retried. Adopt the retry-after handling that _upload_media has.
[M-13] event_log_id_by_event keyed on id(event)
File: packages/server/src/notify_bridge_server/services/watcher.py:417-464
CPython object-address as key works because events are held alive in scope, but a typed key would be safer.
[M-14] Bcrypt-length error wording could be clearer
File: packages/server/src/notify_bridge_server/auth/routes.py:69-81
User typing 70 ASCII + emoji gets rejected and doesnt understand why. Clarify the byte-count language.
[M-15] CSP allows unsafe-inline for script-src
File: packages/server/src/notify_bridge_server/main.py:186-201
Acknowledged. SvelteKit --csp build flag emits hashes; switching unblocks dropping unsafe-inline.
[M-16] Telegram-webhook body size not capped
File: packages/server/src/notify_bridge_server/commands/webhook.py:71
update = await request.json() reads with no cap. Add _read_bounded_body pattern.
[M-17] _log_command_event swallows DB failures invisibly
File: packages/server/src/notify_bridge_server/commands/handler.py:353-357
Hard DB failure here is invisible. Add a metrics counter.
[M-18] apply_tracking_display_filters is a 60-line if-branched function
File: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:350-405
Split into _filter_favorites, _apply_order_and_limit, _strip_details_and_tags.
LOW
[L-1] from .database.models import * in main.py
File: packages/server/src/notify_bridge_server/main.py:26
Comment is honest about purpose, but explicit imports or a single module import is clearer.
[L-2] None comparisons
All comparisons verified to use is None via grep — no findings.
[L-3] Magic numbers
Constants are well-named throughout (_TG_429_MAX_ATTEMPTS, _MAX_PENDING_PER_TRACKER, DEBOUNCE_SECONDS, etc.). Only nit: seconds=30 literal in scheduler.schedule_bot_polling could be promoted.
[L-4] noqa E712 repeated 8+ times for SQLModel boolean comparisons
Switch to .is_(True) for SQLAlchemy idiom, or add E712 to project ruff config.
[L-5] _check_same_origin is best-effort by design
Acceptable.
[L-6] _normalize_host strips IPv6 zone IDs silently
File: packages/core/src/notify_bridge_core/notifications/ssrf.py:105-106
Debug log when stripping changes the host would help diagnose.
[L-7] _compute_jitter cap of 30s might be tight on hourly polls
File: packages/server/src/notify_bridge_server/services/scheduler.py:91-105
Revisit if jitter-collision becomes a real-world issue.
[L-8] SmtpConfig repr may leak password
File: packages/server/src/notify_bridge_server/services/notifier.py:205-213
If SmtpConfig is a vanilla dataclass, repr() will leak the password. Verify in notify_bridge_core.notifications.email.client — add field(repr=False) or a custom repr.
[L-9] noqa BLE001 count is high
49 occurrences across 26 files. Each defensible; consider narrowing where possible.
[L-10] _normalize_for_json does not handle UUID/Decimal
File: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:124-133
No current consumer emits these, but a fallback str() for unknown types would prevent future breakage.
Approval Verdict
Block — CRITICAL findings (C-1 unstored task, C-2 missing rollback, C-3 unauthenticated body read, C-4 racy counters, C-5 secret-mask audit) must be fixed before declaring production-ready. Once those are addressed, the HIGH findings can land in a follow-up.
Quick Wins (low effort, high value)
- Wrap every fire-and-forget asyncio.create_task in a module-level set — search for asyncio.create_task( with no assignment. Definite hit: ha_subscription.py:249.
- Move webhook-secret check before _read_bounded_body in Gitea + generic webhook handlers — 5-line move per endpoint, eliminates pre-auth resource exhaustion.
- Add an asyncio.Lock around _poll_failure_counts and _target_failure_counts mutations — eliminates C-4.
- Split migrations.py — mechanical refactor, ~1 hour, improves blame/review.
- Batch the receiver query in backup_service.export_backup — single IN (...) query, ~10x faster.
- Replace from .database.models import * with explicit imports — small clarity win.