feat: observability, per-receiver Telegram options, oversized-video fallback

Operability:
- Correlation IDs end-to-end: shared dispatch_id between log lines and
  EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths)
  and a new X-Request-Id middleware that normalizes inbound ids and binds
  request_id into log context.
- dispatch_summary block merged into EventLog.details: per-target
  success/failure counts plus Telegram media delivered/skipped/failed and
  truncated error lists, so partial outcomes surface in the UI.
- Diagnostic mode: admin can flip one module to DEBUG for a bounded
  window with auto-revert (in-memory only; setup_logging() resets on
  boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints
  plus DiagnosticsCassette UI on the settings page.

Telegram:
- Per-receiver options: disable_notification (silent send) and
  message_thread_id (forum-topic routing), wired through the dispatcher
  via a ContextVar so all four send sites (sendMessage / sendPhoto-Video-
  Document / sendMediaGroup / cache-hit POST) pick them up.
- send_large_videos_as_documents target setting: bypass the 50 MB
  sendVideo cap by falling back to sendDocument for oversized videos.
- sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES,
  45 MB) with per-item fallback on chunk failure so a stale file_id no
  longer silently drops a cached asset.

Tests:
- New: diagnostic_mode, dispatch_summary, request_correlation,
  telegram_media_group_partial, telegram_per_send_options.

Docs:
- .claude/reviews/: six-axis production-readiness review of v0.8.1.
- .claude/docs/functional-review-2026-05-28.md: focused review of
  Telegram/Immich/logging subsystems.
This commit is contained in:
2026-05-28 15:19:31 +03:00
parent 85a8f1e71c
commit 6a8f374678
39 changed files with 7239 additions and 142 deletions
@@ -0,0 +1,435 @@
# Functional Review — Telegram, Immich, Logging (2026-05-28)
Snapshot review of three subsystems, with prioritised improvement candidates.
Pairs with [feature-backlog.md](feature-backlog.md) — items here are
infrastructure that unlocks several backlog features.
All citations are from the working tree at commit `85a8f1e` (master). Two
files (`packages/core/src/notify_bridge_core/notifications/telegram/client.py`,
`media.py`) had uncommitted changes at review time — see Telegram §
"In-flight work".
---
## 1. Telegram infrastructure
### Telegram — what works well
- Single chokepoint `TelegramClient`
([packages/core/src/notify_bridge_core/notifications/telegram/client.py](../../packages/core/src/notify_bridge_core/notifications/telegram/client.py))
covers text/photo/video/document/media-group, with 429-aware retry,
parse-error retry, file_id cache, multi-bot per-token instances,
polling + webhook modes, and bot-command registration.
- CLAUDE.md rule #6 satisfied for the production paths.
- Caption length, group sizing, parse-mode fallback all enforced.
### In-flight work
Byte-budget sub-chunking for media groups
(`TELEGRAM_MAX_GROUP_TOTAL_BYTES` in
[media.py](../../packages/core/src/notify_bridge_core/notifications/telegram/media.py))
with per-item fallback inside `_send_media_group`. Logic is coherent;
before commit, verify `_build_media_items` callers still match the new
signature (caption no longer injected at fetch time).
### Gaps, ranked by user-visible value
1. **No inline keyboards / `callback_query` handlers** — zero infra for
"Favorite / Archive / Dismiss" buttons on Immich notifications.
Biggest UX unlock; prerequisite for several Immich smart actions.
2. **No edit-in-place** (`editMessageText` not wrapped). Pairs naturally
with deferred dispatch / quiet hours coalescing — 5 separate
"asset added" messages become 1 edited message.
3. **`disable_notification` (silent send) not exposed** — already a
Telegram primitive; slots into the quiet-hours `silent` mode the
backlog already mentions.
4. **`message_thread_id` (forum topics)** — single field per receiver;
unblocks supergroup-with-topics users.
5. **Direct `TelegramClient(...)` constructions** in
[api/telegram_bots.py:314,394,404,412](../../packages/server/src/notify_bridge_server/api/telegram_bots.py)
bypass `get_telegram_client()` — violates CLAUDE.md rule #6 and
skips the shared file_id cache.
6. **Per-command authorization**`commands_enabled` is all-or-nothing
per chat; no per-command allowlist or admin gate.
7. **Long-message splitting**`send_message` silently truncates at
4096 ([client.py:492](../../packages/core/src/notify_bridge_core/notifications/telegram/client.py)).
8. **No parse-mode per target** — HTML hardcoded.
---
## 2. Immich
### Immich — what works well
- Mature polling pipeline: incremental delta-fetch via `updatedAfter`,
pending-asset tracking, fingerprint fast-path skip, fallback to full
fetch on count-decrease
([providers/immich/provider.py](../../packages/core/src/notify_bridge_core/providers/immich/provider.py)).
- Rich bot commands (status / albums / events / people / search / latest
/ random / favorites / summary / memory) with full asset context
(CLAUDE.md rule #10 satisfied).
- `auto_organize` action is well-shaped: AND person + smart-query union,
exclusions, type/date/favorite filters, 500-asset batched add,
idempotent diff against album asset_ids, dry-run, `ActionExecution`
log.
- Three scheduled features wired: periodic summaries, scheduled-asset
delivery, Memory/On-This-Day (with native Immich memory API + fallback).
### Highest-leverage candidates
1. **Webhook ingestion**`webhook_based=False` at
[capabilities.py:46](../../packages/core/src/notify_bridge_core/providers/capabilities.py).
Sub-second latency vs the current 5-min poll. New
`/api/webhooks/immich/{secret}` route + parser + capability flip.
2. **Share-link expiry monitoring + auto-rotate action** — links
silently break today; data is already fetched per event
([provider.py:541-569](../../packages/core/src/notify_bridge_core/providers/immich/provider.py)).
3. **Duplicate cluster digest** — Immich >= 1.100 `/api/duplicates` is
unused; pairs with inline buttons for "merge / ignore 30d".
4. **Auto-favorite by person** (already in backlog) — smallest delta on
the existing `auto_organize` executor.
5. **Per-person notification subscription** — tracker-config filter,
reuses existing `asset.people` data.
6. **Album auto-curation from Inbox** — date-based target album name,
move (not copy); needs the Immich move endpoint (currently we only
`add_assets_to_album`).
7. **Storage / job-queue alerts**`/api/server/stats` and `/api/jobs`
unused; lightweight poll + threshold = "disk full" / "transcoding
stalled" notifications.
8. **Smart-action infra polish** — descriptors are reusable, but the
rule editor is JSON-shaped, action-run statistics aren't aggregated,
and dry-run shows counts not the asset list. Address before adding 5
more action types.
---
## 3. Logging
### What's already in place
In [logging_setup.py](../../packages/server/src/notify_bridge_server/logging_setup.py):
- `dictConfig` with `JsonFormatter` (line-delimited JSON) toggleable via
`NOTIFY_BRIDGE_LOG_FORMAT=json`.
- `SecretMaskingFilter` redacts Telegram bot tokens + Authorization /
api_key / password / refresh_token across `msg`, `exc_text`,
`stack_info`.
- ContextVar-driven record factory injects `request_id`, `command`,
`chat_id`, `bot_id`, `dispatch_id` on every record. Text format:
`[req=- cmd=- bot=- chat=- disp=-]`.
- Per-module overrides via `NOTIFY_BRIDGE_LOG_LEVELS` env or DB
`AppSetting`. Live runtime patch via `apply_log_levels()` — no
restart.
- Noisy libs pre-quieted (sqlalchemy, aiohttp, apscheduler, urllib3,
asyncio, httpx, httpcore, PIL, uvicorn.access).
Plus:
- `EventLog` table with structured rows (event_type, status,
assets_count, details JSON, FKs to tracker/provider/action/
command_tracker/bot), `event_log_retention_days=30` default, daily
APScheduler cleanup `_cleanup_old_events`
([scheduler.py:332](../../packages/server/src/notify_bridge_server/services/scheduler.py)).
- Prometheus counter `notify_bridge_event_log_total{status,event_type}`.
- Frontend viewer with filters at
[api/status.py](../../packages/server/src/notify_bridge_server/api/status.py).
- `bind_log_context` actually used in: dispatcher (dispatch_id),
telegram_poller (bot/chat/command/request_id), webhook commands.
### Gaps, ordered by debug-pain payoff
1. **No FastAPI request-ID middleware.** `request_id_var` is set only
in webhook + Telegram poller paths. Every REST call from the SPA
logs as `req=-`. Tiny middleware (read `X-Request-Id` or
`uuid4()`, bind context, echo header) closes this whole-app blind
spot.
2. **`dispatch_id` is in log lines but NOT persisted on the `EventLog`
row.** Means you can find the failed row in the UI but can't grep
stderr for the matching `disp=...`. Stash it in `details.dispatch_id`
(no migration needed) — biggest cross-surface correlation win.
3. **HTTP access log is uvicorn default**
(`access_log=not _cfg.debug` at
[main.py:419](../../packages/server/src/notify_bridge_server/main.py)).
Doesn't include `request_id`, latency, user, status as structured
fields. Replace with a small `RequestLoggerMiddleware` that emits
`method`, `path`, `status`, `latency_ms`, `request_id`.
4. **Telegram media-group failures log richly but aren't linked to the
resulting `EventLog` row.** The dispatcher result-aggregation work
in flight is the right place to dump `errors[]` into
`EventLog.details.errors`.
5. **In-browser log access is missing.** EventLog rows are visible, but
raw logger output requires container/SSH access. A bounded
in-memory ring-buffer endpoint (admin-only, last N lines, filtered
by context fields) would mean ~90% of triage stays in the UI.
6. **No "diagnostic mode" UI.** The runtime `apply_log_levels()` is
great but only reachable through the app-settings JSON editor.
A "Debug for 15 minutes: `notify_bridge_core.notifications.telegram.client`"
button with auto-revert is a few-hours job.
7. **`EventLog.details` is freeform.** Frontend already destructures
`dispatch_status`, `deferred_until`, `deferred_for_seconds`,
`original_event_log_id`
([types.ts:238-261](../../frontend/src/lib/types.ts)). Define a
typed `EventLogDetails` per `event_type` (Pydantic at the boundary)
— prevents drift between providers.
8. **No log rotation**`StreamHandler(sys.stderr)` only. Fine in
containers, brittle on bare-metal. Optional `RotatingFileHandler`
opt-in via env.
9. **No slow-query / outbound-HTTP timing logs.**
`sqlalchemy.engine=WARNING` by default; no per-query duration log.
Same for outbound calls to Immich / Telegram. A
"duration_ms >= N" threshold logger would surface "why is this
dispatch slow" without flipping global DEBUG.
10. **Action dry-run output is logger-only.** Could be streamed into
the action editor.
11. **Poll-result not persisted.** Webhook payloads are logged
([api/webhook_logs.py](../../packages/server/src/notify_bridge_server/api/webhook_logs.py)),
but Immich/Google-Photos poll cycles emit no
"last poll: 0 changes / 245ms" row. A lightweight
`provider_poll_log` (small table or ring buffer) would answer
"is the poller actually running" without reading stderr.
---
## Recommended sequencing
| # | Item | Status | Why first |
| --- | --- | --- | --- |
| 1 | Request-ID middleware + persist `dispatch_id` on `EventLog` | **SHIPPED 2026-05-28** | Unlocks the rest of the debug story; ~2 hours combined |
| 2 | Finish in-flight Telegram byte-budget chunking + write `errors[]` into `EventLog.details` | **SHIPPED 2026-05-28** | Already half-done; aligns with #1 |
| 3 | Telegram inline keyboards + `callback_query` handler | not started | Prereq for several Immich smart actions |
| 4 | Telegram `disable_notification` + `message_thread_id` per target | **SHIPPED 2026-05-28** | Small, also feeds the open Quiet Hours v1 backlog item |
| 5 | Immich webhook ingestion | not started | 5-min → sub-second; biggest user-facing latency win |
| 6 | Immich share-link expiry + auto-rotate (using #3) | not started | Real silent-breakage today |
| 7 | Diagnostic-mode UI (live log-level toggle with auto-revert) | **SHIPPED 2026-05-28** | Shifts triage to the browser |
| 8 | Immich duplicate digest + auto-favorite by person | not started | Both ride on #3 |
Items 14 are infrastructure that unlocks 58. Items 1, 2, 4 also
smooth the Quiet Hours v1 / target-level windows that's top of the
backlog — worth landing before that feature so quiet hours can dispatch
through edited messages and silent sends from day one.
---
## Decision log
- **2026-05-28** — Review completed. Starting work on item #1
(request-id middleware + persist `dispatch_id` on `EventLog`).
- **2026-05-28** — Item #1 shipped. Summary of the change:
- New helpers in
[packages/core/src/notify_bridge_core/log_context.py](../../packages/core/src/notify_bridge_core/log_context.py):
`ensure_dispatch_id()` (reuse existing or mint a new
`disp:<12 hex>`) and `enrich_details_with_correlation(details)`
(shallow-copy a details dict and merge active `dispatch_id` /
`request_id` from the ContextVar snapshot).
- New `RequestContextMiddleware` in
[packages/server/src/notify_bridge_server/main.py](../../packages/server/src/notify_bridge_server/main.py)
that reads inbound `X-Request-Id` (charset/length validated, `:`
excluded so a client can't masquerade as a server-minted id),
falls back to `req:<12 hex>`, binds the value via
`bind_log_context`, and echoes it back as the response header.
Added LAST so it's the outermost middleware.
- Outer entry points now bind a `dispatch_id` via a thin wrapper
function (`check_tracker`, `dispatch_provider_event`,
`dispatch_scheduled_for_tracker`, `_process_row`, `run_action`).
All 10 `EventLog(...)` creation sites wrap their `details=`
payload in `enrich_details_with_correlation(...)`.
- Switched `NotificationDispatcher.dispatch` to use
`ensure_dispatch_id()` instead of inline `uuid.uuid4()`.
- New tests in
[packages/server/tests/test_request_correlation.py](../../packages/server/tests/test_request_correlation.py)
(12 tests) covering header echo, charset validation, prefix-
masquerade rejection, helper merge semantics. All 239 server
tests green.
- Reviewed by `python-reviewer` subagent (no CRITICAL/HIGH; 3 MEDIUM
and 1 LOW addressed: PEP 8 imports moved to top of main.py;
`RequestResponseEndpoint` type added to `dispatch`; `:` dropped
from the request-id charset; shallow-copy caveat documented).
- Live smoke verified: generated id `req:a9b9821f5aab` on plain
request; safe inbound `my-trace-abc123` echoed unchanged;
`disp:fake12345678` correctly replaced; watcher tick log lines now
show distinct `disp=disp:<hex>` per tracker check.
- **2026-05-28** — Item #2 shipped. Summary of the change:
- Confirmed the in-flight Telegram byte-budget media-group chunking
in
[telegram/client.py](../../packages/core/src/notify_bridge_core/notifications/telegram/client.py)
is complete (15/15 media-group tests pass). Deleted the now-unused
`split_media_by_upload_size()` from
[telegram/media.py](../../packages/core/src/notify_bridge_core/notifications/telegram/media.py).
- New module
[services/dispatch_summary.py](../../packages/server/src/notify_bridge_server/services/dispatch_summary.py)
with `summarize_dispatch_results()` (aggregator),
`attach_summary_in_place()` (in-session) and
`record_dispatch_summary_async()` (post-commit). Captures
`targets_attempted/succeeded/failed`, per-target `errors`,
media-group `media{delivered,skipped,failed}` counts and
`media_errors[]` from the new
`TelegramClient._send_media_group` partial-failure path.
Bounded: 20 errors / 20 media errors / 500-char message cap with
explicit `…[truncated]` marker.
- Wired at 4 dispatch sites:
- `event_dispatch.py`: accumulates per-target results across all
tracking-config groups, attaches summary in-session before
commit.
- `deferred_dispatch.py`: inlines summary into the new EventLog
row's `details` for both `delivered_after_quiet_hours` and
`deferred_then_failed` paths.
- `scheduled_dispatch.py`: inlines summary into the cron-fire
EventLog row's `details`.
- `watcher.py`: follow-up `record_dispatch_summary_async` in a
fresh session because the EventLog row was committed before
dispatch.
- Frontend type drift fixed:
[types.ts](../../frontend/src/lib/types.ts) gets new
`DispatchSummary`, `DispatchSummaryError`,
`DispatchSummaryMediaError` interfaces plus `dispatch_id` /
`request_id` / `dispatch_summary` keys on `EventLog.details`.
- New tests in
[tests/test_dispatch_summary.py](../../packages/server/tests/test_dispatch_summary.py)
(10 tests): empty/all-success/mixed/media-counts/sub-errors/
truncation/long-message-trim/in-place attach/no-results no-op/
malformed sub-error. All 249 server tests green.
- Reviewed by `python-reviewer` subagent (no CRITICAL; 2 HIGH + 3
MEDIUM addressed: `asyncio.CancelledError` re-raise in the
best-effort catch; late `from .dispatch_summary import …` calls
hoisted to top of each file; empty-results contract changed from
"zero-count summary attached" to "no key written"; truncation
marker upgraded to `…[truncated]` for operator clarity;
`flag_modified` comment tightened).
- Live smoke: backend restarts cleanly, watcher tick log lines
continue showing `disp=disp:<hex>` correlation, no startup
errors.
- **2026-05-28** — Item #4 shipped. Summary of the change:
- `TelegramReceiver` dataclass in
[receiver.py](../../packages/core/src/notify_bridge_core/notifications/receiver.py)
gains `disable_notification: bool = False` and
`message_thread_id: int | None = None`. New
`_coerce_telegram_thread_id` helper collapses Telegram's "general
topic" sentinels (`0`, negatives, blanks, bools) to `None` so the
Bot API just omits the field — matches the frontend's `<= 0 → unset`
behaviour.
- `TelegramClient`
([client.py](../../packages/core/src/notify_bridge_core/notifications/telegram/client.py))
gets a frozen `_SendOptions` + `_send_options_var` `ContextVar`
pattern for the deep media paths (`_upload_media`,
`_post_media_group`, `_send_from_cache`) that can't easily plumb
kwargs through. `send_notification` binds the var; the 3 deep
builders read it via `_apply_send_opts_to_payload` /
`_apply_send_opts_to_form`. `send_message` is a leaf and just
inlines its kwargs into the JSON body directly (no ContextVar
needed there).
- Dispatcher
([dispatcher.py](../../packages/core/src/notify_bridge_core/notifications/dispatcher.py))
passes `receiver.disable_notification` / `receiver.message_thread_id`
into `client.send_message(...)` and `client.send_notification(...)`.
- Frontend: new inline per-Telegram-receiver options panel in
[ReceiverSection.svelte](../../frontend/src/routes/targets/ReceiverSection.svelte)
triggered by a cog icon. Silent + thread-id indicators (bell-off
icon, `#N` badge) on the row when set. `+page.svelte` handlers
PUT the merged config to `/api/targets/{id}/receivers/{rid}`.
5 new i18n keys in `en.json` / `ru.json`.
- New tests in
[test_telegram_per_send_options.py](../../packages/server/tests/test_telegram_per_send_options.py)
— 19 tests: factory + thread-id coercion table (including bool
rejection and `0`/negative collapse), payload/form helper merge
semantics, bind/reset under exceptions, concurrent-task isolation
via `asyncio.gather`, end-to-end `send_message` payload assertions.
All 270 server tests green.
- Reviewed by `python-reviewer` subagent (no CRITICAL; 2 HIGH + 1
MEDIUM + 1 LOW addressed: dead ContextVar bind in `send_message`
removed in favor of inline kwarg injection; re-entrant bind from
`send_notification → send_message` auto-resolved by the same fix;
`message_thread_id=0` collapse aligns backend with frontend;
`_coerce_telegram_thread_id` rejects `bool` input).
- Live smoke: backend restarts cleanly, no errors in startup log.
- **2026-05-28** — Holistic `code-reviewer` pass over the full session
diff (Features 1+2+4+7) caught a real HIGH that the per-feature
Python-narrow reviews missed: ``summarize_dispatch_results`` in
Feature 2 was reading the wrong dict shape. The dispatcher's
``_aggregate_results`` wraps per-receiver dicts under
``result["results"]`` and renames the Telegram media counts to
``media_delivered_count`` / ``media_skipped_count`` /
``media_failed_count``. The summarizer was reading the top-level
``delivered_count``, which is always absent in production aggregated
output — meaning the ``dispatch_summary.media`` block was silently
zero / missing for every real dispatch, and the ``media_errors``
list never populated. The unit tests passed because they
hand-constructed leaf-shaped dicts that masked the wrong-shape
read. Fixed in
[dispatch_summary.py](../../packages/server/src/notify_bridge_server/services/dispatch_summary.py)
by drilling into ``result["results"]`` per-receiver leaves and
preferring ``media_*_count`` field names with fallback to the
top-level names. Receiver index added to ``media_errors`` entries
when drilling. New integration tests in
[test_dispatch_summary.py](../../packages/server/tests/test_dispatch_summary.py)
use the real dispatcher envelope so a future shape regression fails
loudly. Also addressed MEDIUM findings: ``attach_summary_in_place``
/ ``record_dispatch_summary_async`` now skip when a caller has
pre-set ``dispatch_summary`` (mirrors the "caller wins" rule in
``enrich_details_with_correlation``); ``ReceiverSection.svelte``
props for the Telegram options panel are now optional + gated
internally so the component stays portable; TS type for
``editingReceiverOptions.message_thread_id`` is ``number | ''``
with proper coercion in ``openEditReceiver``. 294/294 server tests
green; backend restarts clean.
- **2026-05-28** — Item #5 NOT shipped. Reason: Immich has no
outbound webhook feature. The closest thing is `POST /sync/stream`
(a server-streaming sync API designed for first-party Immich
clients), and adopting it would (a) take 1-2 days of new
subscription-manager infrastructure, (b) couple us to an API with no
third-party stability contract, and (c) deliver 5-min → sub-second
latency on photo notifications which is rarely critical. If
someone later actually needs lower latency, dropping the default
``scan_interval`` is a 5-minute alternative that gets 80% of the
win for 1% of the cost. Skipped in favour of #7.
- **2026-05-28** — Item #7 shipped. Summary of the change:
- New service module
[services/diagnostic_mode.py](../../packages/server/src/notify_bridge_server/services/diagnostic_mode.py)
with `set_diagnostic` / `revert_diagnostic` / `revert_all` /
`list_active`. State is in-memory only — restart wipes overrides
(`setup_logging` re-applies the DB baseline at boot). Modules go
through an allowlist (`notify_bridge_*`, `sqlalchemy`, `aiohttp`,
`apscheduler`, `urllib3`, `httpx`, `httpcore`, `asyncio`, `PIL`,
`uvicorn`, `starlette`, `fastapi`) so a button press can't flip
root. Duration clamped to `[1, 240]` minutes. Baseline derivation
walks the dotted parents so
`sqlalchemy.engine.Engine` correctly inherits `sqlalchemy.engine`
→ WARNING rather than falling through to root.
- 3 new admin-only endpoints under `/api/settings/diagnostic-mode`
in
[api/app_settings.py](../../packages/server/src/notify_bridge_server/api/app_settings.py):
`GET` (list active), `POST` (activate, 400 on invalid input),
`DELETE /{module:path}` (manual revert, 404 if not active).
- Auto-revert uses APScheduler's date trigger with `misfire_grace_time=60`,
falling back to a strongly-referenced asyncio task (stored in a
module-level set with `add_done_callback(discard)`) when the
scheduler isn't running. `_expire_callback` re-reads `log_levels`
from the DB at fire time, so an admin who edits overrides mid-window
sees the new baseline restored — not a stale snapshot.
- `revert_all` is wired into the FastAPI lifespan shutdown in
[main.py](../../packages/server/src/notify_bridge_server/main.py)
so a clean stop / hot-reload leaves the world tidy.
- New frontend
[DiagnosticsCassette.svelte](../../frontend/src/routes/settings/DiagnosticsCassette.svelte)
sits below `LoggingCassette` in the settings page. Quick-pick
module dropdown + custom-text fallback, duration chip group (5m /
15m / 30m / 1h / 2h), Activate button. Active list with countdown
updated by a 1s ticker; resyncs from the backend every 30s based
on elapsed time (not modulo-of-now, which the prior version had
wrong). Manual revert via undo-icon button on each row.
- 15 new i18n keys in `en.json` / `ru.json`.
- 20 new tests in
[test_diagnostic_mode.py](../../packages/server/tests/test_diagnostic_mode.py)
— service-module unit tests + 4 FastAPI smoke tests via
`dependency_overrides[require_admin]` exercising the router /
path converter / HTTPException paths. All 290 server tests green.
- Reviewed by `python-reviewer` subagent (no CRITICAL; 3 HIGH +
3 MEDIUM addressed: fallback task retention in a module-level set
to prevent GC; prefix-walk for `_baseline_for` so sub-loggers
inherit parent defaults; `revert_all` wired into lifespan
shutdown; `list_active` now sweeps expired entries; DB
`log_levels` re-read at revert time instead of snapshot at
activation; frontend resync uses elapsed time. LOW items
addressed: scheduler-unavailable paths log at DEBUG instead of
silently passing; test cleanup of dead `_MIN_DURATION_MINUTES`
mutation).
- Live smoke: backend restarts cleanly, no errors in startup log.
+89
View File
@@ -0,0 +1,89 @@
# Production-Readiness Review — service-to-notification-bridge v0.8.1
**Date:** 2026-05-22 **Scope:** entire codebase (~70k LOC, 312 files)
**Branch:** master @ a20635a **Reviewers:** 6 parallel specialised agents
## Verdict
**Ship-readiness: nearly there.** The product is in materially better shape than a typical pre-1.0 — every security baseline is in place (sandboxed Jinja2, bcrypt+JWT, SSRF guard with DNS-rebinding mitigation, secret masking, signed webhooks, non-root Docker, owner-scoped queries) and the feature set is mature (deferred dispatch, quiet hours, fan-out caps, 429 backoff, Prometheus metrics). No CRITICAL security findings exist.
The work that *should* block shipping to wider users is concentrated in **three buckets**: (1) a handful of correctness defects that surface only under load or restart (duplicate-send class), (2) two secret-handling gaps (HA token returned cleartext, bot tokens/SMTP passwords unencrypted at rest), and (3) the schema-management story (`create_all` on boot + 1880-line hand-rolled migration script with no Alembic).
## Reports
| Axis | File | Findings | Top hit |
|---|---|---|---|
| Backend (Python) | [backend-review.md](backend-review.md) | 5C / 15H / 18M / 10L | `asyncio.create_task` GC in HA status logger |
| Frontend (TS/Svelte) | [frontend-review.md](frontend-review.md) | 2C / 10H / 19M / 7L | JWT access+refresh in `localStorage` |
| Security | [security-review.md](security-review.md) | 0C / 2H / multiple M | HA `access_token` not masked on `GET /providers/{id}` |
| Performance + DB | [performance-db-review.md](performance-db-review.md) | 3C / 7H / 10M / 10L | `SQLModel.metadata.create_all` on every boot |
| Bugs + features | [bugs-features-review.md](bugs-features-review.md) | 3C / 13H / 12M / 3L + 25 features | Webhook redelivery has no idempotency |
| UI/UX | [ui-ux-review.md](ui-ux-review.md) | ~33 across 13 axes | Five overlapping glass-card abstractions |
## Ship blockers (must fix before wider rollout)
Cross-cutting top 12 — verified across all six reviews:
1. **HA `access_token` returned in plaintext** on `GET /api/providers/{id}` — not in mask list. *(Security H-1, [providers.py:399-405](packages/server/src/notify_bridge_server/api/providers.py#L399))*
2. **Secrets unencrypted at rest** — Telegram bot tokens, SMTP passwords, HA tokens, webhook secrets stored as plain text in SQLite. Disk/snapshot/backup theft = full credential set. *(Security H-2)*
3. **Frontend JWT access + refresh in `localStorage`** — any future XSS exfiltrates the session in one call. Move to httpOnly cookie. *(Frontend C-1)*
4. **`asyncio.create_task` fire-and-forget** in `ha_subscription._on_status_change` — task may be GC'd before completion. *(Backend C-1, [ha_subscription.py:249](packages/server/src/notify_bridge_server/services/ha_subscription.py#L249))*
5. **Pre-auth 1 MiB body read** on Gitea + generic webhooks — DoS amplifier. Verify `X-Hub-Signature` before reading body. *(Backend C-3, [webhooks.py:167](packages/server/src/notify_bridge_server/api/webhooks.py#L167) + 449)*
6. **No webhook idempotency** — Gitea/Planka/generic don't dedupe by `X-Gitea-Delivery` / equivalent. Replays = duplicate sends. *(Bugs C-1)*
7. **Deferred-dispatch crash window**`dispatch()` returns before `session.commit()`; restart re-fires. Wrap in idempotent "claim → send → ack" with a unique constraint. *(Bugs C-2)*
8. **Telegram `_last_update_id` in-memory only** — restart can replay or skip commands. Persist watermark. *(Bugs C-3)*
9. **`init_db` calls `SQLModel.metadata.create_all` on every boot** — causes schema drift between fresh and upgraded installs. Adopt Alembic. *(Perf C-1)*
10. **Template-preview endpoints bypass sandbox timeout** — authenticated user can wedge a worker with `{% for i in range(10**8) %}`. *(Security M-1)*
11. **Telegram webhook handler missing `session.rollback()`** in catch-all — leaves uncommitted writes. *(Backend C-2, [commands/webhook.py:162](packages/server/src/notify_bridge_server/commands/webhook.py#L162))*
12. **CLAUDE.md rule-8 violation**`if (provider.type !== 'immich')` in `RuleEditor.svelte` silently disables people/album picker for other providers. *(Frontend C-2, [RuleEditor.svelte:57](frontend/src/routes/actions/RuleEditor.svelte#L57))*
## Next-tier priorities (HIGH — fix in the same release where practical)
13. Audit `backup_schema.PROVIDER_SECRET_FIELDS` so `webhook_secret`, `password`, `client_secret`, `refresh_token` are scrubbed on export. *(Backend C-5)*
14. Add `asyncio.Lock` around `bridge_self` failure-counter dicts. *(Backend C-4)*
15. Login rate-limit is per-IP only — slow rotated-source brute force succeeds. Add per-account lockout + raise password floor. *(Security M-2)*
16. Three frontend CRUD pages copy cache items into local `$state`, breaking the shared-cache invariant and forcing a full refetch per mutation. *(Frontend H-1/H-2)*
17. Uncancelled `setTimeout` chain in backup restart flow can `window.location.reload()` after navigation. *(Frontend H-5)*
18. Refresh-token race against `logout()` produces spurious "Unauthorized" toasts. *(Frontend H-6/H-7)*
19. Dashboard per-provider GROUP-BY aggregate runs unbounded on every refresh, no caching, no covering index. *(Perf H-1/H-2)*
20. Truncation/parse-mode escaping for Telegram (HTML-aware truncate, `_extract_retry_after` fractional seconds, forum `message_thread_id` routing, 403 "bot blocked" auto-disable). *(Bugs H-various)*
21. Five overlapping glass-card abstractions + radius drift (22/18/14/12 px) + ~71 legacy `rounded-md text-sm bg-…` form inputs that bypass the global Aurora `input{}` rule. *(UI/UX H-CONSIST-01..04)*
22. Hardcoded hex colors (`#059669`, `#ef4444`) in Snackbar/ConfirmModal/actions — bypasses theming. *(UI/UX H-CONSIST-03)*
23. Snackbar has no `aria-live`; nav lacks `aria-current="page"` — invisible to screen readers. *(UI/UX H-NAV-01, A11y)*
24. DST handling in overnight quiet-hours windows. *(Bugs H)*
## What's working well — keep doing this
- **Sandboxed Jinja2 everywhere** (security agent verified every `Environment()` instantiation is `SandboxedEnvironment`).
- **`PinnedResolver` SSRF defence** — handles CGNAT, IPv4-mapped IPv6, DNS rebinding.
- **JWT with `token_version` revocation** — bcrypt offloaded to worker thread, constant-time username probe.
- **Hardened Docker** — non-root, read-only root FS, `cap_drop: ALL`.
- **Aurora/Glass design identity** — distinctive (conic-gradient orb, Newsreader italic display serif, lavender/orchid palette, "signal stream"/"on watch"/"wires"/"pulse" editorial labels). Not generic AI admin work.
- **Frontend type discipline** — `svelte-check` clean, EN/RU exactly 1466 keys each, no `eval`/`innerHTML`/`var`/`==` anywhere.
- **Most SQL hot paths already batched** — `load_link_data` is fully fan-in/fan-out; partial unique indexes on deferred-dispatch are thoughtful.
- **Most v0.8.1 production-readiness items shipped** — fan-out caps, 429 backoff, parse_mode fallback, scheduler misfire grace, Prometheus, deep healthcheck, per-receiver render cache.
## Top missing features worth adding next
Pulled from the bugs-features report — full pitches in [bugs-features-review.md](bugs-features-review.md):
- **Template playground** — "send test against last event" + dry-run with sample payload.
- **Template versioning + rollback** with audit log.
- **Bulk operations** on targets/templates (currently row-by-row).
- **User-side snooze/mute via bot command** ("/mute 2h", "/snooze tonight").
- **Auto-disable receiver on Telegram 403 ("bot blocked")** with admin notification.
- **Rate-limit per target** (separate from global fan-out cap).
- **Weekly digest + per-target stats + per-provider error rate**.
- **Generic webhook provider** and **email / Discord / ntfy.sh / Matrix** channels.
- **Message dedup window** (kills duplicate sends from redelivery and scheduler misfires).
- **First-run "Getting Started" checklist** on empty dashboard (UI/UX).
## How to consume this review
Each report has clickable `file:line` markdown links. Recommended sequence:
1. Read this `README.md`.
2. Skim each report's Executive Summary (top 5-7 bullets).
3. Triage the **Ship blockers (1-12)** above into the next release branch as individual issues.
4. Schedule the **HIGH list (13-24)** for the release after.
5. Treat the feature ideas as a refresh of `.claude/docs/feature-backlog.md`.
+342
View File
@@ -0,0 +1,342 @@
# Backend Production-Readiness Review
Scope: packages/server/src/notify_bridge_server/ and packages/core/src/notify_bridge_core/ (~44k LOC, Python 3.11, FastAPI + SQLModel async + APScheduler + aiohttp).
## Executive Summary
- **Overall quality is high.** The Jinja2 sandbox is consistently applied (every Environment instantiation is SandboxedEnvironment), JWT auth uses bcrypt offloaded to a worker thread, SSRF guard exists with DNS-rebinding mitigation, secrets are masked in logs via a dedicated filter, and most async/SQL patterns show production-aware design (per-tracker sessions, batched IN-queries, partial unique indexes).
- **Top correctness risk: a fire-and-forget asyncio.create_task in ha_subscription._on_status_change** (no reference stored, GC can drop the task) plus thread-unsafe in-memory counters in bridge_self. Both bite on chatty HA installs.
- **Module-level dict caches shared across the event loop have small read-modify-write windows** in services/scheduler.py (adaptive state), services/bridge_self.py (failure counters), commands/handler.py (TTLCache rate limits), and command_sync._dirty_bots. Currently functional under low concurrency; risky under load.
- **Very large hot-path functions** — services/watcher.py:check_tracker (381 lines), services/dispatch_helpers.py:load_link_data (208 lines), the 1880-line database/migrations.py, and the 1365-line services/scheduler.py — concentrate too much logic in one place.
- **Provider-type hardcoding** persists in api/providers.py, services/__init__.py, services/action_runner.py, and services/manual_dispatch.py (if provider.type == immich chains). The watchers _POLL_FACTORIES registry is the right model — extend it.
- **Webhook handlers read the request body BEFORE authenticating** in the Gitea and generic-webhook routes. The Planka route gets it right. Net impact: a peer that knows the URL but not the secret can drive a 1 MiB read per request.
- **autoescape is inconsistent**: True for runtime templates (renderer.py, commands/handler.py), False for preview / sample-context renders in api/template_configs.py, api/slot_helpers.py, and services/notifier.send_test_template_notification. Lower risk (admin-authored input) but mismatch invites surprise.
---
## CRITICAL
### [C-1] _on_status_change schedules an unstored task (GC + drop risk)
File: [packages/server/src/notify_bridge_server/services/ha_subscription.py:240-260](../../packages/server/src/notify_bridge_server/services/ha_subscription.py#L240)
The task created by asyncio.create_task(_record_ha_status(...)) at line 249 is not held anywhere. Python may garbage-collect a task whose only reference is the create_task return value before it completes (Python docs explicitly warn: save a reference to the result). Result: an HA disconnect/reconnect EventLog row silently disappears under memory pressure.
**Fix:** Module-level set[asyncio.Task], add the new task, remove via task.add_done_callback. ha_subscription.start_all already does this correctly (line 315-320); the pattern is already in-house.
### [C-2] Telegram-webhook handler returns 200 OK on uncommitted writes
File: [packages/server/src/notify_bridge_server/commands/webhook.py:130-169](../../packages/server/src/notify_bridge_server/commands/webhook.py#L130)
The catch-all at line 162 swallows handle_command exceptions and returns OK to Telegram. The request already called await session.commit() at line 96 (after save_chat_from_webhook), and any subsequent writes via the dispatcher use NEW sessions inside the command path. If a downstream session inside handle_command partially commits before raising, the dependency get_session does NOT roll back automatically — the context manager only closes.
**Fix:** Either explicitly session.rollback() in the except block, or wrap the per-request mutations in async with session.begin(): so the implicit transaction guarantees rollback on exception.
### [C-3] Gitea/generic webhook reads body BEFORE verifying secret is configured
File: [packages/server/src/notify_bridge_server/api/webhooks.py:167-178](../../packages/server/src/notify_bridge_server/api/webhooks.py#L167) and line 449-454
The sequence is: read 1 MiB raw_body, then check if webhook_secret is empty. A peer that learned the URL but has no secret drives a 1 MiB body read per request. Plankas handler at line 232+ validates the bearer token BEFORE the body read — that is the correct pattern.
**Fix:** Hoist the "if not webhook_secret" (Gitea) and "if auth_mode == none" short-circuit (generic) above _read_bounded_body. Gitea HMAC still needs the body — but bailing on a missing-config-side error first costs nothing.
### [C-4] bridge_self in-memory counters are not async-safe
File: [packages/server/src/notify_bridge_server/services/bridge_self.py:186-230](../../packages/server/src/notify_bridge_server/services/bridge_self.py#L186)
record_poll_failure does _poll_failure_counts[tracker_id] = _poll_failure_counts.get(tracker_id, 0) + 1. These dicts are accessed concurrently from poll loop, HA push, webhook ingest, and dispatcher target-failure recording. Individual dict ops are atomic, but get + 1 + set is not when interleaved with another coroutine that touches the same key. Symptoms: missed threshold crossings, occasional double-emission. Same pattern in _target_failure_counts and _backlog_above_threshold.
**Fix:** Wrap mutating ops in an asyncio.Lock. The reset-and-re-arm semantics already assume serial access — make it explicit.
### [C-5] PROVIDER_SECRET_FIELDS audit needed for backup exports
File: [packages/server/src/notify_bridge_server/api/providers.py:617-625](../../packages/server/src/notify_bridge_server/api/providers.py#L617) and [services/backup_service.py:84-93](../../packages/server/src/notify_bridge_server/services/backup_service.py#L84)
_apply_secrets_provider redacts only fields named in PROVIDER_SECRET_FIELDS. The webhook flow uses a field called webhook_secret (Gitea, Planka, generic) — verify this is in PROVIDER_SECRET_FIELDS (defined in backup_schema.py). A backup export with secrets_mode=INCLUDE that misses webhook_secret leaks a token that grants webhook-forgery rights.
**Action:** Audit PROVIDER_SECRET_FIELDS. Specifically check it includes: api_key, api_token, access_token, webhook_secret, password, client_secret, refresh_token. The _provider_response mask list at api/providers.py:620 is a good cross-reference — both should be the same constant.
---
## HIGH
### [H-1] _compile_template lru_cache competes across tenants
File: [packages/server/src/notify_bridge_server/commands/handler.py:99-103](../../packages/server/src/notify_bridge_server/commands/handler.py#L99)
lru_cache(maxsize=256) keyed by raw template string. Edited templates remain cached. On a multi-tenant install one tenants 256 distinct templates can evict anothers. No invalidation on template-edit.
**Fix:** Drop the cache (Jinja compile is sub-ms) OR add an invalidation call from the template-edit endpoints. The notification renderer (renderer.py:31) uses 512 slots — same problem; consistent fix.
### [H-2] check_tracker is 381 lines with deep coupling
File: [packages/server/src/notify_bridge_server/services/watcher.py:263-644](../../packages/server/src/notify_bridge_server/services/watcher.py#L263)
Loads tracker, polls, writes state, persists EventLog, evaluates gates, defers, dispatches, records bridge_self — all in one function. Refactor candidates: _poll_phase, _persist_state_and_events, _dispatch_phase. This is the watchers hot path; bugs here affect every tracker tick.
### [H-3] load_link_data returns untyped dict[str, Any]
File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:539-747](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L539)
Five call sites consume ld["target_type"], ld.get("link_id"), etc. — no static guarantee against key typos.
**Fix:** Introduce a frozen @dataclass class LinkData. Same for per-receiver entries.
### [H-4] N+1 in _resolve_command_context template-slot loop
File: [packages/server/src/notify_bridge_server/commands/handler.py:200-215](../../packages/server/src/notify_bridge_server/commands/handler.py#L200)
One SELECT per distinct command_template_config_id. Already batched for trackers/configs/providers — finish the job. Single WHERE config_id IN (...) query + Python pivot.
### [H-5] N+1 in backup_service.export_backup receiver loop
File: [packages/server/src/notify_bridge_server/services/backup_service.py:187-189](../../packages/server/src/notify_bridge_server/services/backup_service.py#L187)
50 targets = 51 SELECTs. Batch with WHERE target_id IN (...). Audit other sections of this 941-line file for the same pattern (templates -> slots, command configs -> slots).
### [H-6] _dirty_bots mutated from request and scheduler without a lock
File: [packages/server/src/notify_bridge_server/services/command_sync.py:25-95](../../packages/server/src/notify_bridge_server/services/command_sync.py#L25)
mark_bot_dirty runs in request handlers, _flush_dirty_bots on the scheduler executor. Currently safe (snapshot via ready = [...]) but fragile.
**Fix:** Snapshot under lock, or move to a thread-safe primitive.
### [H-7] HA reconnect cycle has no way for CRUD to short-circuit a stale supervisor
File: [packages/server/src/notify_bridge_server/services/ha_subscription.py:163-175](../../packages/server/src/notify_bridge_server/services/ha_subscription.py#L163)
Reload-on-reconnect means a disabled HA provider keeps trying to reconnect at the 30s/300s cadence until next reconnect attempt. CRUD endpoints should call reload_provider (defined at line 339) — verify wiring.
### [H-8] Cached expunged ORM instances are footguns
File: [packages/server/src/notify_bridge_server/services/event_dispatch.py:75-107](../../packages/server/src/notify_bridge_server/services/event_dispatch.py#L75)
_load_trackers_cached returns expunged NotificationTracker rows. Future maintainer calling session.add(tracker) on a stale cached instance triggers DetachedInstance or silent re-INSERT. Document this strongly, ideally convert to a typed projection.
### [H-9] Pending-restore at startup has no timeout
File: [packages/server/src/notify_bridge_server/main.py:142-143](../../packages/server/src/notify_bridge_server/main.py#L142)
apply_pending_restore_if_any runs in lifespan; a partially-corrupt restore could block startup indefinitely. Container liveness probes then fail after grace.
**Fix:** asyncio.wait_for with a generous timeout, or kick off as background task while app starts.
### [H-10] Jinja2 render watchdog uses daemon thread that can pin a CPU forever
File: [packages/core/src/notify_bridge_core/templates/renderer.py:48-73](../../packages/core/src/notify_bridge_core/templates/renderer.py#L48)
Comment acknowledges the trade-off. Multiple concurrent runaway renders can exhaust CPU cores while callers think they timed out. Add a process-level BoundedSemaphore capping concurrent in-flight renders.
### [H-11] _aggregate drops all but the first error
File: [packages/server/src/notify_bridge_server/services/notifier.py:326-335](../../packages/server/src/notify_bridge_server/services/notifier.py#L326)
When all sends fail, only results[0] is returned. Distinct subsequent errors are lost.
**Fix:** Aggregate all errors into a details field.
### [H-12] Generic-webhook header dict materialised twice
File: [packages/server/src/notify_bridge_server/api/webhooks.py:456](../../packages/server/src/notify_bridge_server/api/webhooks.py#L456) and line 475
dict(request.headers) materialises full headers map, then _filter_headers and _redact_sensitive_body walk the payload. With a malicious peer sending many headers (Starlette default 100), bounded but wasteful.
### [H-13] SSRF redirect-walk has no aggregate wall-clock budget
File: [packages/core/src/notify_bridge_core/notifications/telegram/client.py:232-268](../../packages/core/src/notify_bridge_core/notifications/telegram/client.py#L232)
max_redirects = 3, each with 120s _DOWNLOAD_TIMEOUT. Worst case per request: 480s. _TARGET_TIMEOUT_S = 120s in the dispatcher caps the top-level case, but per-asset preloads inside media groups dont all share that cap.
### [H-14] Backlog recovery logic flips latch for in-flight users
File: [packages/server/src/notify_bridge_server/services/bridge_self.py:544-551](../../packages/server/src/notify_bridge_server/services/bridge_self.py#L544)
Recovery loop iterates all known users and flips to False for any not in counts_by_user. If a user transiently has no user_id set on deferred rows (legacy / orphaned), theyre excluded from the GROUP BY and incorrectly marked recovered.
### [H-15] quiet_hours_status silently returns None on start == end
File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L110)
The comment notes this is almost always a user mistake. Silent return means the user wonders why their notifications still arrive at all hours. Surface via WARNING log + UI hint.
---
## MEDIUM
### [M-1] register_commands_with_telegram chat overrides loop is sequential
File: [packages/server/src/notify_bridge_server/commands/handler.py:723-776](../../packages/server/src/notify_bridge_server/commands/handler.py#L723)
50 chats with overrides = 50 sequential Telegram round-trips. Use asyncio.gather with a semaphore as in _refresh_telegram_chat_titles.
### [M-2] _run_provider exception backoff has no escalation
File: [packages/server/src/notify_bridge_server/services/ha_subscription.py:278-283](../../packages/server/src/notify_bridge_server/services/ha_subscription.py#L278)
Persistent bug in _emit reconnects every 30s forever. Add exponential backoff with cap and bridge_self alert after N failures.
### [M-3] database/migrations.py is 1880 lines
File: [packages/server/src/notify_bridge_server/database/migrations.py](../../packages/server/src/notify_bridge_server/database/migrations.py)
Past the 800-line guideline. Split per-migration into database/migrations/<name>.py, list in main.py.
### [M-4] Locale-resolution logic duplicated
File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:484-491](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L484) and [services/notifier.py:46](../../packages/server/src/notify_bridge_server/services/notifier.py#L46)
Two implementations of locale priority. One source of truth.
### [M-5] _normalize_locale duplicated across modules
File: [packages/server/src/notify_bridge_server/commands/handler.py:632](../../packages/server/src/notify_bridge_server/commands/handler.py#L632)
Five-line copy; move to commands/command_utils.py.
### [M-6] Provider-type if-chain in _test_provider_connection
File: [packages/server/src/notify_bridge_server/api/providers.py:203-250](../../packages/server/src/notify_bridge_server/api/providers.py#L203)
Same chain in services/__init__.py:_make_collection_provider. Both candidates for a single registry.
### [M-7] Secret masking exposes last 4 chars unconditionally
File: [packages/server/src/notify_bridge_server/api/providers.py:624](../../packages/server/src/notify_bridge_server/api/providers.py#L624) and [services/backup_service.py:81](../../packages/server/src/notify_bridge_server/services/backup_service.py#L81)
Fine for 32-char Immich keys. Returns half the value for short secrets. Use plain "***" for len(value) < 16.
### [M-8] Deprecated validate_outbound_url still imported
File: [packages/core/src/notify_bridge_core/providers/immich/client.py:14](../../packages/core/src/notify_bridge_core/providers/immich/client.py#L14)
The sync version uses blocking socket.getaddrinfo on the event loop. Migrate to avalidate_outbound_url.
### [M-9] Lazy cache init has confusing DCL comment
File: [packages/server/src/notify_bridge_server/services/watcher.py:81-113](../../packages/server/src/notify_bridge_server/services/watcher.py#L81)
Comment about Double-check after acquiring lock implies classic DCL — under asyncio, the unlocked first check is safe because theres no thread context switch, but rename to clarify.
### [M-10] Dispatcher concurrency cap is per-dispatch, not process-wide
File: [packages/core/src/notify_bridge_core/notifications/dispatcher.py:58](../../packages/core/src/notify_bridge_core/notifications/dispatcher.py#L58)
_DISPATCH_CONCURRENCY = 16 is INSIDE dispatch(). HA storm = N events x min(M, 16) sends with no outer cap. Add a process-level semaphore in event_dispatch.py.
### [M-11] success=True returned for partial failures
File: [packages/server/src/notify_bridge_server/services/notifier.py:329-335](../../packages/server/src/notify_bridge_server/services/notifier.py#L329)
A test that fails on 1 of 3 receivers returns success=True with a partial_failures count. Introduce a status: "ok"|"partial"|"fail" field.
### [M-12] Telegram command registration not retried on 429
File: [packages/server/src/notify_bridge_server/commands/handler.py:671-693](../../packages/server/src/notify_bridge_server/commands/handler.py#L671)
set_my_commands/delete_my_commands arent retried. Adopt the retry-after handling that _upload_media has.
### [M-13] event_log_id_by_event keyed on id(event)
File: [packages/server/src/notify_bridge_server/services/watcher.py:417-464](../../packages/server/src/notify_bridge_server/services/watcher.py#L417)
CPython object-address as key works because events are held alive in scope, but a typed key would be safer.
### [M-14] Bcrypt-length error wording could be clearer
File: [packages/server/src/notify_bridge_server/auth/routes.py:69-81](../../packages/server/src/notify_bridge_server/auth/routes.py#L69)
User typing 70 ASCII + emoji gets rejected and doesnt understand why. Clarify the byte-count language.
### [M-15] CSP allows unsafe-inline for script-src
File: [packages/server/src/notify_bridge_server/main.py:186-201](../../packages/server/src/notify_bridge_server/main.py#L186)
Acknowledged. SvelteKit --csp build flag emits hashes; switching unblocks dropping unsafe-inline.
### [M-16] Telegram-webhook body size not capped
File: [packages/server/src/notify_bridge_server/commands/webhook.py:71](../../packages/server/src/notify_bridge_server/commands/webhook.py#L71)
update = await request.json() reads with no cap. Add _read_bounded_body pattern.
### [M-17] _log_command_event swallows DB failures invisibly
File: [packages/server/src/notify_bridge_server/commands/handler.py:353-357](../../packages/server/src/notify_bridge_server/commands/handler.py#L353)
Hard DB failure here is invisible. Add a metrics counter.
### [M-18] apply_tracking_display_filters is a 60-line if-branched function
File: [packages/server/src/notify_bridge_server/services/dispatch_helpers.py:350-405](../../packages/server/src/notify_bridge_server/services/dispatch_helpers.py#L350)
Split into _filter_favorites, _apply_order_and_limit, _strip_details_and_tags.
---
## LOW
### [L-1] from .database.models import * in main.py
File: [packages/server/src/notify_bridge_server/main.py:26](../../packages/server/src/notify_bridge_server/main.py#L26)
Comment is honest about purpose, but explicit imports or a single module import is clearer.
### [L-2] None comparisons
All comparisons verified to use is None via grep — no findings.
### [L-3] Magic numbers
Constants are well-named throughout (_TG_429_MAX_ATTEMPTS, _MAX_PENDING_PER_TRACKER, DEBOUNCE_SECONDS, etc.). Only nit: seconds=30 literal in scheduler.schedule_bot_polling could be promoted.
### [L-4] noqa E712 repeated 8+ times for SQLModel boolean comparisons
Switch to .is_(True) for SQLAlchemy idiom, or add E712 to project ruff config.
### [L-5] _check_same_origin is best-effort by design
Acceptable.
### [L-6] _normalize_host strips IPv6 zone IDs silently
File: [packages/core/src/notify_bridge_core/notifications/ssrf.py:105-106](../../packages/core/src/notify_bridge_core/notifications/ssrf.py#L105)
Debug log when stripping changes the host would help diagnose.
### [L-7] _compute_jitter cap of 30s might be tight on hourly polls
File: [packages/server/src/notify_bridge_server/services/scheduler.py:91-105](../../packages/server/src/notify_bridge_server/services/scheduler.py#L91)
Revisit if jitter-collision becomes a real-world issue.
### [L-8] SmtpConfig repr may leak password
File: [packages/server/src/notify_bridge_server/services/notifier.py:205-213](../../packages/server/src/notify_bridge_server/services/notifier.py#L205)
If SmtpConfig is a vanilla dataclass, repr() will leak the password. Verify in notify_bridge_core.notifications.email.client — add field(repr=False) or a custom __repr__.
### [L-9] noqa BLE001 count is high
49 occurrences across 26 files. Each defensible; consider narrowing where possible.
### [L-10] _normalize_for_json does not handle UUID/Decimal
File: [packages/server/src/notify_bridge_server/services/deferred_dispatch.py:124-133](../../packages/server/src/notify_bridge_server/services/deferred_dispatch.py#L124)
No current consumer emits these, but a fallback str() for unknown types would prevent future breakage.
---
## Approval Verdict
**Block** — CRITICAL findings (C-1 unstored task, C-2 missing rollback, C-3 unauthenticated body read, C-4 racy counters, C-5 secret-mask audit) must be fixed before declaring production-ready. Once those are addressed, the HIGH findings can land in a follow-up.
## Quick Wins (low effort, high value)
1. **Wrap every fire-and-forget asyncio.create_task in a module-level set** — search for asyncio.create_task( with no assignment. Definite hit: ha_subscription.py:249.
2. **Move webhook-secret check before _read_bounded_body** in Gitea + generic webhook handlers — 5-line move per endpoint, eliminates pre-auth resource exhaustion.
3. **Add an asyncio.Lock around _poll_failure_counts and _target_failure_counts** mutations — eliminates C-4.
4. **Split migrations.py** — mechanical refactor, ~1 hour, improves blame/review.
5. **Batch the receiver query in backup_service.export_backup** — single IN (...) query, ~10x faster.
6. **Replace from .database.models import \*** with explicit imports — small clarity win.
+714
View File
@@ -0,0 +1,714 @@
# Bugs + Missing Features — Production-Readiness Review
Repo: `c:\Users\Alexei\Documents\service-to-notification-bridge` (v0.8.1 baseline)
Date: 2026-05-22
Scope: full repo (backend Python/FastAPI, Svelte 5 frontend, providers + dispatchers + bot commands)
---
## Executive summary
- **The code is in much better shape than typical pre-1.0 code.** Quiet-hours,
SSRF, JWT, secret redaction, rate-limit fan-out caps, partition-by-media-kind,
parse_mode retry, scheduler misfire-grace, Prometheus metrics, deep
healthcheck, and per-receiver render cache are all already implemented and
well-tested.
- **The single biggest shipping risk is webhook idempotency.** Gitea, Planka,
and the generic webhook endpoint all dispatch on every POST regardless of
redelivery — there is no `X-Gitea-Delivery` / `X-Hub-Delivery` dedup table.
An upstream retry storm sends the same notification N times.
- **The deferred-dispatch drain has a duplicate-send window** if the process
dies between `dispatcher.dispatch()` returning and `session.commit()`
the row stays `pending` and the periodic catch-up scan re-drains it.
- **Telegram update offset (`_last_update_id`) is in-memory only** — on
restart, the bot replays already-handled updates or skips ones Telegram
has discarded. Combined with no per-update idempotency, this is a
duplicate-command surface.
- **Several Telegram features are silently unsupported**: forum threads
(`message_thread_id`), bot-blocked-by-user detection (403 → keep retrying
forever), and inline-button callback queries. None blocks shipping today
but each is a near-term ask from any real user.
- **No template versioning / dry-run / playground** — every template edit is
immediately live. There is no way to validate a new template against a
sample payload before flipping the switch, and no rollback path.
- **Frontend lacks bulk operations and import/export of templates+targets.**
An operator with 30 trackers cannot bulk-toggle, bulk-edit, or move a
template across users.
---
## Part A — Bugs and reliability issues
Severity legend: **CRITICAL** = data loss / duplicate user-visible messages /
silent stop-shipping; **HIGH** = wrong behavior under realistic conditions;
**MEDIUM** = degrades UX or operability; **LOW** = polish.
### CRITICAL
#### A1. Webhook redelivery causes duplicate notifications (no idempotency)
**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:156`
(`gitea_webhook`), `:225` (`planka_webhook`), `:427` (`generic_webhook`).
**Scenario**: Gitea retries a webhook after 30s if the bridge returns 5xx,
times out under load, or if the operator clicks "Test Delivery" twice. Every
retry produces a fresh notification because the handlers never check
`X-Gitea-Delivery` (Gitea's per-delivery UUID), nor do they record any
event_id/hash for `parse_generic_webhook` events.
**Fix**: Add a `webhook_delivery` table with `(provider_id, delivery_id)`
unique constraint and `created_at`. Insert before dispatch (`INSERT OR IGNORE`
on SQLite, `ON CONFLICT DO NOTHING` on Postgres); if the insert is a no-op,
return `{"ok": true, "skipped": "duplicate"}`. For Gitea use the
`X-Gitea-Delivery` header; for Planka use a hash of `event_type +
payload.id + payload.createdAt`; for generic webhooks use a configurable
JSONPath expression to derive an idempotency key, falling back to a SHA256 of
the raw body. TTL prune older than 7 days.
#### A2. Deferred-dispatch drain can double-send on process crash
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:721-758`.
**Scenario**: Inside `_process_row`, `dispatcher.dispatch()` actually
delivers the Telegram message (HTTP 200 returned, user phone buzzes).
The function then sets `row.status = "fired"` (line 734) but the surrounding
`session.commit()` (line 577) hasn't run yet. Process is killed (OOM,
SIGTERM during deploy, host reboot). On restart, `_run_deferred_drain_catchup`
re-fetches the still-`pending` row and dispatches it again — **the user gets
the same album twice**.
**Fix**: Either (a) record an outbound dedup key per-row before dispatch
(`row.dispatch_id = uuid4(); session.commit()` first), then ask the channel
client to send-or-no-op based on that ID; or (b) flip the row to a
`"in_flight"` state with a short timeout in a pre-dispatch transaction so a
restart sees it as poisoned and aborts. Option (a) is more correct but
needs per-channel cooperation; option (b) is the cheap fix.
#### A3. Telegram update offset is in-memory only — restart replays or loses commands
**Location**: `packages/server/src/notify_bridge_server/services/telegram_poller.py:31`
(`_last_update_id: dict[int, int] = {}`).
**Scenario**: A user types `/random Family`. Telegram delivers update_id=4711.
The bridge processes the command, sends back the media, and crashes before
APScheduler ticks again. On restart, `_last_update_id` is empty, so we call
`getUpdates(offset=None)` → Telegram returns 4711 again → we send the user
the same album a second time. Conversely, if Telegram's 24-hour retention
expired during a long outage, we silently skip pending updates.
**Fix**: Persist last_update_id in DB (`telegram_bot.last_update_id` column).
Combine with A2-style command idempotency by inserting
`(bot_id, update_id)` into a dedup table before processing.
### HIGH
#### A4. Telegram "bot blocked by user" / "chat not found" never short-circuits
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py`
(`send_message`, `_upload_media`, etc.). Errors with
`error_code == 403` (Forbidden, "Bot was blocked by the user") and 400
"chat not found" / "user is deactivated" are returned as failures but
never recorded so the receiver gets removed/disabled.
**Scenario**: A user blocks the bot. Every scheduled "Good morning memory"
fires a sendMessage that Telegram instantly 403s. Bridge logs an error,
moves on, repeats forever. The bridge_self target-failure counter eventually
fires but the underlying receiver is never disabled. With many such chats
the operator has no easy cleanup path.
**Fix**: In the dispatcher, on `error_code in (403, 400 with description
matching "chat not found"/"user is deactivated")`, automatically set
`TelegramChat.commands_enabled = False` and either flag the receiver as
`disabled` with reason `blocked_by_user` or surface it via a new
`/admin/blocked-chats` view. Also stop further retries that round.
#### A5. Telegram forum-thread (topic) routing not supported
**Location**: telegram client never accepts/sends `message_thread_id`.
**Scenario**: Operator points the bridge at a group's "Releases" forum
topic. Today every message lands in the General topic instead — there is
no way to specify the topic. This is a hard requirement for any non-trivial
group install. Currently `reply_parameters` is the only thread-adjacent
field used; `message_thread_id` is silently absent.
**Fix**: Add an optional `message_thread_id` per-receiver (or per-target)
config, pass through `send_message`, `_upload_media`, and `_post_media_group`.
Auto-extract from incoming command updates' `message.message_thread_id` so
the bot can reply into the same topic.
#### A6. `bot.token` read after commit without refresh in webhook flow
**Location**: `packages/server/src/notify_bridge_server/commands/webhook.py:92-97`.
**Scenario**: The comment acknowledges "AsyncSession expires instances on
commit" and snapshots `bot_id`/`bot_token` before commit, but `await
session.refresh(bot)` is also called after the commit. If `session.refresh`
fails (e.g. row was deleted by an admin concurrently — bot rotation), the
exception is caught as a warning and the rest of the handler still runs
using the stale local `bot_id`/`bot_token`. The window is small but real.
**Fix**: Remove the `session.refresh(bot)` since the snapshot already
covers everything the handler needs. The refresh adds risk for no gain.
#### A7. Deferred-dispatch coalescing has a JSON-mutation bug under concurrent defers
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:307`
(`_find_pending_asset_rows`).
**Scenario**: Two near-simultaneous `assets_added` events for the same
`(link_id, collection_id)` from two upstream pollers (HA chat-bus +
periodic Immich). Both call `defer_event` concurrently. The two transactions
both see "no pending row", both `session.add(new_row)`, and SQLite cheerfully
inserts two rows. The drain then fires both, sending the same combined media
twice. Note that the partial UNIQUE index from v0.8.1 protects only the
`bridge_self` provider row, not the deferred queue.
**Fix**: Add a partial UNIQUE index `UNIQUE(link_id, collection_id, event_type)
WHERE status = 'pending'` on `deferred_dispatch`, then convert `defer_event`
to `INSERT ... ON CONFLICT (link_id, collection_id, event_type) DO UPDATE`
and merge `event_payload` inside the SQL or in a re-read+retry loop.
#### A8. Quiet-hours overnight window + DST transition can produce wrong fire_at
**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:121-128`.
**Scenario**: User in `Europe/Minsk` (UTC+3, no DST anymore) sets quiet
hours 22:00-06:00. For a user in a DST-observing zone (e.g.
`America/New_York`), on the "spring forward" night where 2:00 → 3:00, an
event arriving at 02:30 local time gets `end_today = now_local.replace(hour=6,
minute=0)`. But `.replace()` ignores DST adjustments — the resulting
`datetime` may sit in the skipped hour or have ambiguous DST status. Two
hours later, the dispatcher sees the quiet window as "still active" or "30
min ago" depending on the system.
**Fix**: After `.replace(hour=t_end.hour, minute=t_end.minute, ...)`, pass
through `tz.localize` (zoneinfo's behavior: re-walk via `astimezone`) and
explicitly handle the `fold=` parameter. Add tests using
`zoneinfo.ZoneInfo("America/New_York")` and known DST transition dates.
#### A9. Quiet-hours `start == end` returns None — silently no quiet hours
**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111`.
**Scenario**: User UI submits `quiet_hours_start = "00:00"` and
`quiet_hours_end = "00:00"`, thinking "all day quiet". The function returns
`None` (no quiet window) — the user gets pinged at 3am even though the UI
says "quiet hours enabled". Same code path eats malformed times silently.
**Fix**: Bubble up `ValueError`/`malformed input` to the API validator on
write so the user gets a 422 with a specific error message rather than
silently broken behavior. Define `00:00-00:00` as "always quiet" or reject
it explicitly with a clear error.
#### A10. Telegram `_truncate` cuts mid-HTML-tag → parse_mode fallback then loses formatting
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:144-149`
(`_truncate`).
**Scenario**: A template renders to 4090 chars and an
`<a href="https://...">...</a>` straddles the 4096-byte boundary. The
truncate function takes a flat string slice, so the final character may be
inside a tag → Telegram returns 400 "can't parse entities" → the retry
strips parse_mode → the user sees `<a href="...">` literally in their chat.
**Fix**: Make `_truncate` HTML-aware: scan from the right and abandon
truncation at the start of any tag boundary, OR strip incomplete tags after
truncating. A simpler intermediate fix: pop any unclosed `<a>` /`<b>`/`<i>`
detected by a regex over the truncated string.
#### A11. JSON-payload depth/size hardened in backup, not in webhooks
**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:43-71`
(`_read_bounded_body` only caps total bytes).
**Scenario**: Generic webhook accepts a 999KB payload (under the 1MB cap)
but with 50 levels of nesting. `json.loads` succeeds, then
`parse_generic_webhook` evaluates JSONPath expressions in a loop and the CPU
spends seconds chasing pointers. Multiple concurrent malicious requests can
peg the event loop.
**Fix**: Reuse the depth/node guards from
`packages/server/src/notify_bridge_server/services/backup_service.py`
(JSON depth cap 10, node count cap 100k). Either share the helper or
re-implement around `json.loads(object_pairs_hook=...)`.
#### A12. Generic-webhook `auth_mode="none"` with `acknowledge_unauthenticated` is per-provider, not per-user
**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:294-323`.
**Scenario**: v0.8.1 added the `acknowledge_unauthenticated=true` opt-in,
but it's only stored in `provider.config` JSON. A multi-user install where
one user accepts unauthenticated and another doesn't would suffice. But
because anyone with the webhook URL can also infer the token (URLs are not
secret in real deployments — they end up in upstream config files, logs,
build artifacts), `auth_mode="none"` is dangerous beyond "explicit opt-in":
an attacker who guesses the path can DoS the rate limiter by burning the
60/min budget.
**Fix**: Refuse to even create a `webhook` provider with `auth_mode="none"`
in production unless a separate environment guard
`NOTIFY_BRIDGE_ALLOW_UNAUTHENTICATED_WEBHOOKS` is set; AND drop the rate
limit to 10/min for `auth_mode="none"` providers.
#### A13. `_extract_retry_after` returns int but Telegram `retry_after` is fractional
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:59-78`.
**Scenario**: Modern Telegram sometimes returns `retry_after` as a float
(e.g. `1.5`). The current code does `int(group(1))` and `isinstance(ra,
(int, float))`. Regex `\d+` only matches integers. So a `1.5s` retry-after
becomes "no retry-after found" → fallback 1s sleep → retry too early → second
429 → eventually the bounded retry budget runs out.
**Fix**: Loosen the regex to `\d+(?:\.\d+)?` and `float(m.group(1))`,
preserve fractional via `await asyncio.sleep(retry_after + 1)` with float.
#### A14. APScheduler date-job collision when two windows end at the exact same second
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:1127-1132`
(`_drain_job_id_for`). The job id is keyed on `YYYYMMDDHHMMSS`. Comment in
code acknowledges "two trackers... seconds different ... would collide", but
two windows ending at the exact same second still collide on a single job id
`replace_existing=True` silently drops the second.
**Scenario**: 30 users with quiet_hours_end=`07:00`. All 30 windows end at
the same wall-clock second. Only one drain job is scheduled. That single
job fires `drain_deferred_due()` which scans all rows globally so all 30
get drained — actually fine. **But** if the global drain function ever
filters by user/tracker (a likely near-term change for multi-tenant), the
collision becomes silent data loss.
**Fix**: Either keep the global drain (and document the assumption) or
add a tracker_id segment to the job_id and let APScheduler dedup naturally.
#### A15. `_handle_webhook_conflict` reclaim races against a parallel admin action
**Location**: `packages/server/src/notify_bridge_server/services/telegram_poller.py:163-218`.
**Scenario**: Admin clicks "Switch to webhook mode" in the UI, which sets
`update_mode=webhook` and calls `set_webhook(...)`. Concurrently, the next
poll tick for the same bot hits the conflict, calls `delete_webhook` → the
admin's webhook is wiped 1s after they set it. The poll tick checks
`bot.update_mode != "polling"` *before* the conflict reclaim, but the
reload is best-effort and the conflict reclaim path runs unconditionally
once entered.
**Fix**: Re-check `bot.update_mode == "polling"` inside
`_handle_webhook_conflict` before calling `delete_webhook`; or take an
advisory lock on the bot row for the duration of the mode flip.
#### A16. Discord 2000-char split breaks on Unicode codepoint boundaries
**Location**: `packages/core/src/notify_bridge_core/notifications/discord/client.py:60-80`
(`_split_message`).
**Scenario**: A template renders to 2050 chars with emoji at position
1998-1999 (each emoji is 2 surrogates / multi-byte UTF-8). The split uses
`text.rfind("\n", 0, limit)` and falls back to character index `limit`,
which is a Python str index → that part is OK in CPython 3, but if the
content contains a grapheme cluster (emoji + zero-width-joiner + skin tone),
slicing at `limit` mid-cluster renders as the broken emoji "□" in Discord.
**Fix**: Use a grapheme-cluster boundary library (e.g. `regex` module with
`\X`) or at minimum back off to the previous whitespace if `limit` is
inside a likely cluster.
### MEDIUM
#### A17. Per-target failure counter does not distinguish receivers within a target
**Location**: `packages/server/src/notify_bridge_server/services/event_dispatch.py:311-333`.
**Scenario**: A target has 10 receivers. 1 chat is blocked, 9 work. Today
`maybe_emit_target_failure` is called for the target — but the success
counter (`record_target_success`) is also called for the same target on the
other 9. Net counter behavior depends on call order. With the
default-threshold 5, this oscillates.
**Fix**: Track success/failure per receiver, not per target; or only call
`maybe_emit_target_failure` when `all` receivers failed for the target.
#### A18. `_cleanup_old_events` does not delete cancelled `DeferredDispatch` rows
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:332-364`.
**Scenario**: The daily cleanup deletes `EventLog`, `WebhookPayloadLog`,
`ActionExecution`. Cancelled / fired / dropped `DeferredDispatch` rows live
forever in the DB. Active install with chatty providers accumulates millions
of rows; eventually the `_load_pending_drain_jobs` query, `_trim_queue_if_needed`,
and the catch-up scan all degrade.
**Fix**: Add `delete(DeferredDispatch).where(status.in_(["fired", "dropped",
"cancelled"]), fired_at < cutoff)` to the cleanup.
#### A19. `random.shuffle(shuffled)` in `_sort_assets` uses non-deterministic seed
**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:317-320`.
**Scenario**: Two identical events arriving in close succession (deferred-
dispatch merge, then drain re-renders) shuffle into different orders. With
the deferred-dispatch coalescing logic, this produces a visual "they're not
the same album" surprise in the chat history.
**Fix**: Seed `random` with a stable per-event hash
(`hash(event.event_type.value + event.collection_id + event.timestamp.isoformat())`).
#### A20. `_poll_tracker` swallows exception, drops it at `_LOGGER.error` not `exception`
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:657-666`.
**Scenario**: An exception in `check_tracker` is logged as `_LOGGER.error("Error
polling tracker %d: %s", tracker_id, e)` — no traceback. Production debugging
of "why is tracker 42 silently broken since yesterday" requires the stack.
**Fix**: Change to `_LOGGER.exception("Error polling tracker %d", tracker_id)`.
#### A21. Long bot commands → `/help` reply > 4096 chars truncates without warning
**Location**: `packages/server/src/notify_bridge_server/commands/handler.py:521-532`,
combined with `send_reply``send_telegram_message``_truncate` to 4096.
**Scenario**: A user with 20 enabled commands runs `/help`. Each command +
description (RU) crosses 250 chars → 5000 chars total → truncated mid-command.
The user sees a half-list that suggests we forgot half the commands.
**Fix**: Split `/help` over multiple messages by command category (provider).
#### A22. `parse_command` truncates to 512 chars — long search queries lost
**Location**: `packages/server/src/notify_bridge_server/commands/parser.py:15`.
**Scenario**: `/search a very long query containing emoji 🎉 and more text that
the user really meant to send because they pasted a long string from somewhere…`
gets clipped to 512 chars silently. The trailing count parser then operates
on the truncated text, possibly extracting a count from mid-query.
**Fix**: Either reject `>512` with `parse_command` returning a sentinel
"too_long" tuple, or just stop truncating — the Telegram limit is already
4096 and we already truncate the response side.
#### A23. Periodic catch-up scan can dispatch a stale event payload
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:628`
(`_process_row`).
**Scenario**: An `assets_added` event is deferred at 22:00. At 06:00 the
quiet window ends, drain re-fetches `link_data`. The assets in `event_payload`
include URLs and asset metadata. But the user has since deleted those photos
from Immich. The dispatcher tries to download → 404. Notification shows
"5 photos added to Album X" but the actual media fails to attach.
**Fix**: For `assets_added`, re-validate asset existence against the
provider before dispatch (one batched `getAssets` call). Drop missing IDs
from the event, mark with "delivered_after_quiet_hours" + extra hint
`"missing_count": N` in details. For deferred windows >12h this is the
right behavior; for shorter windows the lookup is wasted work, so gate on
`(now - deferred_at).hours >= 6`.
#### A24. Watcher / scheduler restart can lose adaptive polling state
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:67-88`
(`_adaptive_state: dict`).
**Scenario**: Module-level dict resets on restart. A tracker that had ramped
up to 1-in-4 ticks goes back to every-tick polling. Over a fleet of 50
trackers in steady-state idle, this triggers a thundering herd of every-tick
polls right after deploy. Combined with no DB-level rate limiting on the
upstream Immich/Gitea API, it can rate-limit the operator out of their own
services for ~5min.
**Fix**: Either persist the adaptive state in `notification_tracker_state`
(cheap on shutdown via `atexit`) or stagger the initial ticks via
APScheduler's `next_run_time` instead of relying on the existing jitter.
#### A25. `defer_event` `return "cancelled"` logic is incorrect in some merge paths
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:444`.
**Scenario**: The `cancelled` return branch checks `upd_added is None or
upd_added.status == "cancelled"` AND same for `upd_removed`. But if both
`upd_added` and `upd_removed` are `None` (i.e. there were no pending rows
to begin with), `fully_cancelled` is `False` → returns "merged". That's
fine. But the more subtle issue: an "insert" action with one of the rows
being cancelled returns "merged" — should be "inserted". The dashboard
"merged" status confuses the operator looking at why no defer row exists.
**Fix**: Rewrite as a clearer state machine: distinguish "inserted",
"merged_into_existing", "fully_cancelled".
#### A26. `_fetch_bytes` and `_safe_get` honor only 3 redirects with no Retry-After awareness
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:217-268`.
**Scenario**: Immich behind a CDN can chain `302 → 302 → 200`. With 4 hops
it falls through to "Too many redirects". A user complains "old photos
suddenly missing in notifications".
**Fix**: Bump to 5 redirects and surface the chain in the error string for
easier debugging.
#### A27. No structured event log filter UI for "show me all drops in the last hour"
**Location**: `packages/server/src/notify_bridge_server/api/status.py`
`event_log` rows have `details.dispatch_status` field but no API filter
exposes it. The frontend can fetch only via global filter on `event_type`.
**Scenario**: An operator sees "messages are missing today". They want to
filter event_log to `dispatch_status in (dropped_quiet_hours_nondeferrable,
deferred_then_dropped, deferred_then_failed)`. Today they can't.
**Fix**: Add `dispatch_status` and `dispatched=true|false` as first-class
event_log columns (denormalized from `details`), plus API + UI filter.
#### A28. `_render_cmd_template` falls back to `"[No template: X]"` user-visible text
**Location**: `packages/server/src/notify_bridge_server/commands/handler.py:111-115`.
**Scenario**: An operator removes a template slot by mistake. The next user
who runs `/random` sees `[No template: response_random]` in chat. Not just
ugly — it leaks internal slot names.
**Fix**: Show a friendly "Sorry, something went wrong on our side" + log at
error level. Better: refuse to disable the slot if it's referenced.
### LOW
#### A29. `_truncate`'s ellipsis can land inside a multi-byte char
The marker `"…"` is one Unicode codepoint (3 bytes UTF-8) but the truncate
counts characters, not bytes. Telegram counts UTF-16 code units, so for a
4090-char message ending in emoji, the calculation is off by a small constant.
Won't break sends but messages may end up slightly longer than `TELEGRAM_MAX_TEXT_LENGTH`
allows. Re-measure in UTF-16 code units (`len(s.encode('utf-16-le')) // 2`).
#### A30. `NotificationDispatcher._render_cache` set to fresh dict on every dispatch — comment says "reuse"
The instance attribute `self._render_cache` is reset to `{}` at the start
of every `_send_to_target` (line 245). The cache only helps across receivers
within one target, not across targets. The comment at line 111-115 implies
broader reuse. Either align comment with reality or actually share across
targets within one `dispatch()` call.
#### A31. Frontend `entity-cache.svelte.ts` doesn't propagate stale-cache errors
The shared `$state`-based caches return stale data silently if the underlying
fetch fails after a successful initial load. A user sees old target list
during an outage and is confused why edits aren't sticking.
---
## Part B — Missing functionality and "cool feature" gaps
Tier legend: **must-have** = blocks prod for any non-trivial install;
**nice-to-have** = clear value, ship in next minor; **aspirational** = ship
when v1.0+ slows down.
Effort: **S** ≈ 1-2 days; **M** ≈ 1 week; **L** ≈ 2+ weeks.
### Already in the backlog (post-v0.8.1 status check)
#### B1. Target-level quiet hours (per-target DND, multi-window, days-of-week, silent mode)
**Status**: Still missing in v0.8.1. The backlog item proposed a v1 cut
(target-level windows + `silent` mode for Telegram = `disable_notification=True`).
None of the proposed code paths exist:
- `notification_target.quiet_hours_json` column — not present.
- `disable_notification=True` plumbing through `TelegramClient.send_message`
— not present.
- Days-of-week filter — not present.
**Pitch**: Quiet hours bind to the *watcher* (tracking config); users want
DND at the *destination*. "Don't ping my phone at night, regardless of
which provider".
**Who benefits**: Every user. Today they have to recreate per-link windows.
**Effort**: **M** (1 week — backend dispatcher gate + frontend Aurora-style fieldset).
**Tier**: **must-have for prod**.
#### B2. Immich Smart Actions expansion (auto-favorite by person, auto-archive, share-link rotation)
**Status**: Auto-Organize exists; no other action descriptors are shipped.
**Pitch**: Reuse the existing action descriptor pipeline. Auto-favorite-by-person
is the smallest cut.
**Effort**: **M** per action (a few days each).
**Tier**: nice-to-have.
#### B3. Block-based template builder
**Status**: Not started. `JinjaEditor` is unchanged.
**Effort**: **L** — frontend-only but big.
**Tier**: aspirational.
### Newly identified — must-have for prod
#### B4. Webhook delivery dedup table + "Test Delivery" replay
**Pitch**: Add the dedup table from A1, plus a `/api/webhooks/{provider_id}/replay/{delivery_id}`
endpoint that admin can hit to re-dispatch a stored payload without the upstream
provider needing to resend. Combined with the existing `WebhookPayloadLog`,
this is "click to retest" in the UI.
**Who benefits**: Every webhook provider. Replay is invaluable for debugging
template edits.
**Effort**: **M**.
**Tier**: **must-have for prod**.
#### B5. "Send test message" / template playground
**Pitch**: From the template editor, click "Try this template against the
last received event" → render preview, optionally send to a sandbox chat.
Bypass dispatch but exercise the full Jinja pipeline.
**Who benefits**: Every template edit today is a leap of faith — the operator
modifies the template, waits for the next real event, hopes nothing breaks.
**Effort**: **S-M**. The preview infrastructure already exists
(`services/sample_context.py`); add a "send to chat X" button.
**Tier**: **must-have for prod**.
#### B6. Template versioning + rollback
**Pitch**: Auto-snapshot each template on save (last 10 revisions). UI shows
diff between version N and N-1, "Restore" button. Same for command templates.
**Who benefits**: An operator who tweaks a template at midnight and goofs
the syntax needs an undo button.
**Effort**: **M**. New `template_revision` table; new endpoints; UI button.
**Tier**: **must-have for prod**.
#### B7. Bulk operations on trackers / targets / links
**Pitch**: Multi-select in lists → "disable selected", "delete selected",
"export selected templates as JSON bundle", "move to user X".
**Who benefits**: Operators with >10 trackers. A common pain point: deploying
the bridge for a new family member requires N clicks per tracker.
**Effort**: **M** (frontend-heavy).
**Tier**: **must-have for prod**.
#### B8. Bot blocked / chat-not-found auto-disable + dashboard
**Pitch**: Detect Telegram 403 / 400 chat-related errors. Mark the receiver
or `TelegramChat` as `disabled_by_remote`. Surface in a "Stale receivers"
admin view with a "Try resending invite" / "Delete chat" button.
**Who benefits**: Every Telegram user. Today the bridge silently sprays
errors until a human looks.
**Effort**: **S**.
**Tier**: **must-have for prod**.
#### B9. Forum-thread (topic) routing for Telegram
**Pitch**: Per-receiver `message_thread_id` field, auto-detected from incoming
command messages. UI: when adding a chat that's a forum, show a topic
selector populated via `getForumTopicIconStickers` + `getChat`'s `is_forum`.
**Who benefits**: Any group install where the user wants notifications in a
dedicated topic.
**Effort**: **M**.
**Tier**: **must-have for prod**.
#### B10. Telegram inline buttons + callback queries
**Pitch**: Templates can declare `{% buttons %}` with action descriptors.
Bridge listens for `callback_query` updates, dispatches to a registered
action (e.g. "Mark album as favorite", "Snooze this tracker for 1h", "Run
HA service light.turn_off").
**Who benefits**: Power users. Foundation for several other features
(Immich duplicate-cluster review, HA action button → service call, snooze).
**Effort**: **L**.
**Tier**: nice-to-have but unlocks the next 3 items.
#### B11. User snooze / mute via bot command
**Pitch**: `/snooze 1h` mutes the bot's outbound chat for 1h.
`/mute provider gitea` mutes a whole provider for that chat. `/wake` undoes.
Implemented as a per-receiver `snoozed_until` column.
**Effort**: **S-M**.
**Tier**: **must-have for prod** (user-side relief valve).
### Newly identified — nice-to-have
#### B12. Per-target / per-user rate limit (send-side)
**Pitch**: Cap outbound messages per minute per receiver. Existing 429
backoff handles Telegram's limit, but a runaway template / event-storm
provider can still spray the user's phone with 200 messages.
**Effort**: **S**. Token bucket per chat_id in `_send_telegram`.
**Tier**: nice-to-have.
#### B13. Message dedup window (idempotency key per outbound message)
**Pitch**: SHA256 of `(target_id, receiver_id, rendered_message,
event_collection_id)`. If the same key was sent in the last 5min, skip.
**Effort**: **S**.
**Tier**: nice-to-have (lots of overlap with A1+A2 but addresses the
end-of-pipeline dedup, after all coalescing).
#### B14. Weekly digest / per-target stats / per-provider error rate
**Pitch**: Cron-based weekly summary email/Telegram. "Top 5 noisy trackers",
"Receivers with >X% failure rate", "Top 5 days of the week with the most
activity". Operator preventive maintenance.
**Effort**: **M**.
**Tier**: nice-to-have.
#### B15. Mobile-friendly minimal mode for the SPA
**Pitch**: The Aurora redesign is a lot for mobile. A "manage from phone"
minimal layout — list of trackers, click to toggle, click to mute. Stops
operators from needing a desktop to silence a chatty tracker at 1am.
**Effort**: **M**.
**Tier**: nice-to-have.
#### B16. Audit log of admin actions
**Pitch**: New `audit_log` table. Every create/update/delete on
`NotificationTracker`, `NotificationTarget`, `TemplateConfig`, `ServiceProvider`,
`TelegramBot`, `User`, etc. writes a row with `(user_id, action,
entity_type, entity_id, before_json, after_json, ip, ua)`. Admin UI tab.
**Effort**: **M**. SQLAlchemy event listeners on the affected models.
**Tier**: nice-to-have for multi-admin installs; must-have if any
compliance requirement.
#### B17. Health → not just /ready, but per-component status page
**Pitch**: `/api/health/components` returns `{providers: [{id, last_ok_at,
last_error}], targets: [{id, last_ok_at, last_error}], scheduler:
{job_count, next_fires}}`. Frontend "Status" tab.
**Effort**: **S-M**. The data is already in `EventLog` / scheduler API.
**Tier**: nice-to-have.
#### B18. Provider unreachable backoff + escalation
**Pitch**: Today `bridge_self` emits `bridge_self_poll_failures` after N
consecutive fails. Add (a) exponential backoff on the polling interval after
M failures so we don't hammer a down host, and (b) recovery notification
when the provider comes back.
**Effort**: **S**.
**Tier**: nice-to-have.
#### B19. RSS provider
**Pitch**: Generic RSS/Atom feed poller. One more provider, reuses event_dispatch.
Long-tail value (operator wants "notify me when a blog publishes").
**Effort**: **M**.
**Tier**: nice-to-have.
#### B20. Mobile push / FCM channel
**Pitch**: A dedicated FCM "Receiver" type so the user can ship their own
companion app. Today Telegram is the only realtime channel; email is too
slow; webhook out is for plumbing.
**Effort**: **L**.
**Tier**: aspirational.
### Newly identified — aspirational
#### B21. Conversation threading per source (one notification thread per album / repo)
**Pitch**: Use Telegram `reply_parameters` to chain all notifications about
"Album X" as a single thread that grows over time. Today every notification
is a top-level message. Threading turns the chat into a navigable history.
**Effort**: **M**. Store `last_message_id` per `(target_id, collection_id)`,
pass as `reply_to_message_id`.
**Tier**: aspirational but a clear differentiator.
#### B22. A/B test variants for templates
**Pitch**: A template config can carry 2 variants. The dispatcher
hash-routes receivers to A or B; the dashboard shows "variant A's response
time / click rate / receiver mute rate".
**Effort**: **L**.
**Tier**: aspirational.
#### B23. Dark-launch a new template before enabling it
**Pitch**: "Send-to-sandbox-chat-only" toggle on a template config. The new
template renders against real events but only goes to one operator's chat
for 1 week. Then promote to production.
**Effort**: **M**. Builds on template versioning (B6).
**Tier**: aspirational.
#### B24. Scheduled template changes
**Pitch**: "On 2026-12-25 at 09:00, switch template_config X to draft Y".
Useful for holiday-themed greetings or batch migrations.
**Effort**: **M**.
**Tier**: aspirational.
#### B25. HA service-call from a Telegram inline button
**Pitch**: Building on B10. A template renders `{% button hass:light.turn_off
target=living_room %}`. User clicks → bridge calls HA `light.turn_off`.
**Effort**: **M** (after B10).
**Tier**: aspirational.
---
## Ship-blocker checklist (do not widen user audience without)
Order is rough priority (top first). Most are also called out in Part A.
1. **A1** — Webhook idempotency table (Gitea/Planka/generic). Without this,
one upstream retry storm can double-/quadruple-spray every user.
2. **A2** — Deferred-dispatch crash window. A redeploy mid-drain duplicates
every queued notification. Implement either the `dispatch_id`
pre-commit OR the `in_flight` state machine.
3. **A3** — Persist Telegram update offset. Same root cause class as A1/A2;
matters less if A1+A2 are fixed but should land together.
4. **A4 / B8** — Bot blocked / chat-not-found auto-disable. A user blocking
the bot must not generate infinite errors.
5. **A11** — Webhook JSON depth/node cap (mirror the backup guard).
6. **A9** — Quiet-hours `start == end` confirmation; either accept "always
quiet" semantics or reject in the API validator.
7. **A8** — DST handling in quiet-hours overnight window. Verify with
tests that include known transition timestamps.
8. **B5** — "Send test message" / template playground. Without this, every
template edit is a flying blind change against a live system.
9. **B6** — Template versioning + rollback. Pair with B5.
10. **A5 / B9** — Forum-thread (topic) routing. Any non-trivial Telegram
group install needs this.
11. **B11** — User snooze / mute via bot command. Relief valve when the
bridge gets too chatty.
12. **B7** — Bulk operations on trackers / targets / links. Operability
floor for any install with >10 trackers.
Everything else in Part B is upside, not a blocker.
+682
View File
@@ -0,0 +1,682 @@
# Frontend Production-Readiness Review
Scope: `frontend/src/**` (~26k lines, Svelte 5 runes + SvelteKit). `npm run check`
passes with exit code 0. The codebase is in good shape overall - i18n EN/RU keys
are 1:1 in sync (1466 each), Modal/Snackbar overlays follow the `position:fixed`
+ `z-index:9999` convention, no `eval`, no `innerHTML`, no string-interpolated
`setTimeout`, and the sanitizer (`lib/sanitize.ts`) is a sound DOMParser-based
allowlist. The issues below are real production risks layered on top of an
otherwise clean architecture.
## Executive Summary
- **Auth tokens live in `localStorage`** (`lib/api.ts`). Any XSS that bypasses
the (good) `sanitizePreview` allowlist - or sneaks past it via a future code
path - exfiltrates both access and refresh tokens. There is no httpOnly-cookie
alternative, no token rotation on refresh failure, and `redirectToLogin` only
fires once per session (a leaked refresh token can outlive that flag).
- **One real provider-hardcoding violation** (`routes/actions/RuleEditor.svelte`)
breaks the "descriptors only" rule in CLAUDE.md item 8 and silently disables
the people/album picker for any non-Immich provider - every other page is
clean.
- **Caches duplicated into local `$state`** on `notification-trackers`,
`command-trackers`, and `command-template-configs` pages - the cache is
populated but the page never re-reads it, so cross-page mutations (search
palette pre-warming) won't update the list and cache `invalidate()` becomes
useless. Convention #4 says "always use cache".
- **Three CRUD pages refetch all entities after every mutation** (full
`await load()` after upsert/delete) instead of using `cache.upsert()`/
`remove()` - defeats the optimistic-cache design and produces visible flicker
on slow connections.
- **Floating async work + N+1 patterns**: `providers/+page.svelte` fires N
parallel health checks without an AbortController (state writes continue
after navigation); `bots/TelegramBotTab.svelte` does a sequential
`for (const trk of trackers) { await api('/listeners') }` loop.
- **`backup/+page.svelte` post-restart health poll** keeps recursing for up to
120s with no unmount guard - if the user navigates away mid-restart, the
recursive `setTimeout` chain keeps calling `fetch('/api/health')` until it
reloads the page out from under whatever route they're on.
- **`api()` 30s timeout is per-request, hard-coded, with no observability** -
long-running provider operations (Immich bulk fetch, full backup export) hit
it silently and surface as `AbortError` with no telemetry.
---
## CRITICAL
### C1. JWT tokens stored in `localStorage` - XSS-exfiltratable
[lib/api.ts:78-91](frontend/src/lib/api.ts#L78-L91)
```ts
function getToken(): string | null {
return localStorage.getItem('access_token');
}
export function setTokens(access: string, refresh: string) {
localStorage.setItem('access_token', access);
localStorage.setItem('refresh_token', refresh);
}
```
Both the short-lived access token and the long-lived refresh token sit in
`localStorage`. Any successful XSS - including a future template-preview path
that escapes `sanitizePreview`, a vulnerable third-party CodeMirror extension,
or a Telegram bot username that ends up unescaped somewhere - reads both with a
single `localStorage.getItem` call.
**Fix:** Move to httpOnly + Secure + SameSite=Strict cookies set by the backend.
If a cookie-based session is infeasible for the deployment model, at minimum
move the refresh token to an httpOnly cookie and keep only the short-lived
access token in memory (a module-level `let accessToken` is XSS-readable but
not persistent across reloads, which limits the exfiltration window).
### C2. Provider type hardcoded in `RuleEditor.svelte` (convention violation)
[routes/actions/RuleEditor.svelte:55-67](frontend/src/routes/actions/RuleEditor.svelte#L55-L67)
```ts
async function loadProviderData() {
if (actionType !== 'auto_organize') return;
const provider = providersCache.items.find((p: any) => p.id === providerId);
if (!provider || provider.type !== 'immich') return;
...
```
CLAUDE.md item 8 explicitly forbids `if (type === 'immich')` in components -
this is the canonical example. As written, adding a second provider with
auto-organize support (Google Photos, future SmugMug, etc.) is a silent no-op:
the form renders with empty people/album lists and gives no error.
**Fix:** Add an `actionTypes` / `peopleFilter` capability flag to
`ProviderDescriptor`, or add a `supportsAutoOrganize: boolean` discriminator,
then check `getDescriptor(provider.type)?.supportsAutoOrganize` instead of the
literal string.
---
## HIGH
### H1. Caches imported but copied into local `$state` - invalidation no-op
[routes/notification-trackers/+page.svelte:33](frontend/src/routes/notification-trackers/+page.svelte#L33)
[routes/command-trackers/+page.svelte:27](frontend/src/routes/command-trackers/+page.svelte#L27)
[routes/command-template-configs/+page.svelte:51](frontend/src/routes/command-template-configs/+page.svelte#L51)
```ts
// notification-trackers - line 33
let allNotificationTrackers = $state<Tracker[]>([]);
// ...
[allNotificationTrackers] = await Promise.all([
api<Tracker[]>('/notification-trackers'),
...
]);
```
The cache modules expose `notificationTrackersCache`, `commandTrackersCache`,
and `commandTemplateConfigsCache` - populated by `+layout.svelte` on mount and
by the search palette - but these three pages don't read from them. They each
issue their own `api(...)` call and store the result locally. Side effects:
1. The cache shows stale data on every other page that reads it (dashboard nav
counts, search palette).
2. `commandTemplateConfigsCache.fetch(true)` is called on `command-template-configs`
`load()` but the result is then re-assigned from the function return value
into `allCmdTplConfigs` - the cache itself is updated, but the page has no
reactive link to it.
3. `cache.upsert()` / `cache.remove()` after mutations would short-circuit a
full refetch - but with the local-state copy, every save triggers a full
`await load()` (see H2).
**Fix:** Replace `let allX = $state([])` with `let allX = $derived(cache.items)`
(see how `targets/+page.svelte:147` does it correctly) and remove the parallel
`api()` call.
### H2. Full refetch after every mutation - cache.upsert/remove not used
[routes/providers/+page.svelte:238-250](frontend/src/routes/providers/+page.svelte#L238-L250)
[routes/actions/+page.svelte:139](frontend/src/routes/actions/+page.svelte#L139)
[routes/notification-trackers/+page.svelte:291](frontend/src/routes/notification-trackers/+page.svelte#L291)
[routes/targets/+page.svelte:476](frontend/src/routes/targets/+page.svelte#L476)
Every save/delete/toggle on these pages calls `cache.invalidate(); await load()`,
which re-fetches the entire list from the server. The cache exposes
`upsert(entity)` and `remove(id)` for exactly this case - the server already
returned the new entity (or 204), so the round-trip is wasted bandwidth and
produces a visible "list redraws" flash on slow links.
**Fix:** On POST/PUT response, `cache.upsert(savedEntity)`. On DELETE,
`cache.remove(id)`. Reserve `invalidate()` + `fetch()` for cases where the
mutation may have changed *other* entities (e.g. broadcast target updates
affect children).
### H3. Provider health checks fire-and-forget - leak past navigation
[routes/providers/+page.svelte:175-181](frontend/src/routes/providers/+page.svelte#L175-L181)
```ts
for (const p of allProviders) {
health = { ...health, [p.id]: null };
api(`/providers/${p.id}/test`, { method: 'POST' })
.then((r: any) => { health = { ...health, [p.id]: r.ok }; })
.catch(() => { health = { ...health, [p.id]: false }; });
}
```
No `AbortController`, no unmount guard. If the user navigates away while N
slow Immich/Gitea probes are inflight, every probe still resolves and tries to
write to the (now-detached) `health` `$state`. With Svelte 5 runes this won't
crash, but it does waste backend connections (Immich health checks call the
real API) and may trigger duplicate probes on quick back/forward navigation.
**Fix:** Pass `{ signal: controller.signal }` to `api()` (already supported -
see `lib/api.ts:150`), abort in `onDestroy`. Or use `cache.probeAll()` driven
from a single store so revisiting the page reuses the previous result.
### H4. Sequential awaits for independent fetches - N+1 in TelegramBotTab
[routes/bots/TelegramBotTab.svelte:215-223](frontend/src/routes/bots/TelegramBotTab.svelte#L215-L223)
```ts
const trackers = await api<CommandTrackerSummary[]>('/command-trackers');
const matched: CommandTrackerSummary[] = [];
for (const trk of trackers) {
try {
const listeners = await api<ListenerEntry[]>(`/command-trackers/${trk.id}/listeners`);
const hasBot = listeners.some(...);
if (hasBot) matched.push(trk);
} catch (e) { console.warn(...); }
}
```
For a deployment with 20 command trackers, opening the listener section on a
bot triggers 20 serial `GET /command-trackers/{id}/listeners` requests -
visibly slow over a high-latency link.
**Fix:** Either expose a single backend endpoint
(`GET /command-trackers/listeners?bot_id=X`) or run the loop through
`Promise.all(trackers.map(trk => api(...).catch(() => null)))` and filter
afterwards.
### H5. Post-restart health poll keeps running after unmount
[routes/settings/backup/+page.svelte:117-139](frontend/src/routes/settings/backup/+page.svelte#L117-L139)
```ts
async function applyAndRestart(): Promise<void> {
await api('/backup/apply-restart', { method: 'POST' });
restartingOverlay = true;
const startedAt = Date.now();
let attempts = 0;
const poll = async (): Promise<void> => {
attempts += 1;
try {
const res = await fetch('/api/health');
if (res.ok && Date.now() - startedAt > 2000) {
window.location.reload();
return;
}
} catch { /* still down */ }
if (attempts < 120) setTimeout(poll, 1000);
};
setTimeout(poll, 1500);
}
```
The recursive `setTimeout(poll, 1000)` chain has no cancellation. If the user
navigates to another route between `apply-restart` and the next health probe,
the chain keeps firing for up to 120s and eventually calls
`window.location.reload()` from a route the user has since moved away from.
Side effects:
1. Unauthenticated `fetch('/api/health')` calls keep going while the user is
on `/login`.
2. A user who hit "restart later" on a different tab will still get reloaded
from the original tab's poll.
**Fix:** Capture `controller = new AbortController()` and pass to `fetch`,
`onDestroy(() => controller.abort())`. Also store the timeout handle and
`clearTimeout` it on destroy.
### H6. Token refresh races with logout in a sneaky edge
[lib/api.ts:97-127](frontend/src/lib/api.ts#L97-L127)
The dedupe via `refreshPromise` is correct *for the refresh itself*, but the
outer `api()` reads `getToken()` before awaiting `refreshAccessToken()`. Three
concurrent requests that all 401 will all queue on the same refresh promise,
then *all* retry - fine. But if the refresh succeeds and an unrelated
`clearTokens()` (from `logout()`) fires between the refresh resolving and the
retry running, the retry uses an empty `Authorization: Bearer ` header. The
result is "ApiError: HTTP 401" surfaced via snackbar even though the redirect
to `/login` already happened.
**Fix:** Either re-check `isAuthenticated()` immediately before the retry, or
make `clearTokens()` cancel an inflight `refreshPromise`.
### H7. `AuthRedirectError` is thrown but not consistently caught
[lib/api.ts:165-170](frontend/src/lib/api.ts#L165-L170)
Most pages use the pattern `catch (err: unknown) { snackError(errMsg(err)); }` -
which catches `AuthRedirectError` too and shows "Unauthorized - redirecting
to login" in a snackbar that the user sees *as* the route changes. The error
class exists specifically to be distinguished, but only one or two call sites
actually check `instanceof AuthRedirectError` before showing a snackbar.
**Fix:** Make `errMsg()` (or a new helper) return `null` for `AuthRedirectError`
and have snackbar helpers ignore null messages. Or filter in the snackbar
store.
### H8. `api()` JSON-decode failure path swallowed silently
[lib/api.ts:189](frontend/src/lib/api.ts#L189)
```ts
return res.json();
```
When the backend returns a `200 OK` with a non-JSON body (proxy error page,
HTML 502 from a misconfigured reverse proxy in front), `res.json()` rejects
with a `SyntaxError: Unexpected token < in JSON at position 0`. The page
shows the raw parser message in a snackbar, which is confusing UX.
**Fix:** Wrap `res.json()` in try/catch and throw a typed `ApiError("Backend
returned non-JSON response", 502)` so the UI can show a clean message.
### H9. Email/Matrix bot tabs strip secrets via `as any`
[routes/bots/EmailBotTab.svelte:84](frontend/src/routes/bots/EmailBotTab.svelte#L84)
[routes/bots/MatrixBotTab.svelte:79](frontend/src/routes/bots/MatrixBotTab.svelte#L79)
```ts
if (!body.smtp_password) delete (body as any).smtp_password;
if (editingMatrix && !body.access_token) delete (body as any).access_token;
```
The `as any` bypass exists because the body type doesn't allow `delete` on a
required field. The intent - "don't send a blank secret which would overwrite
the stored one" - is correct, but the cast hides a real risk: if the field
name ever changes (`smtp_password` -> `smtpPassword`), the `delete` is a no-op
and the blank field is sent.
**Fix:** Build `body` as `Partial<...>` from the start and only conditionally
include the secret field.
### H10. `template-configs` hardcodes a slot name
[routes/template-configs/+page.svelte:228](frontend/src/routes/template-configs/+page.svelte#L228)
```ts
.map(s => ({ key: s.name, label: ..., rows: s.name === 'message_assets_added' ? 10 : 3, isDateFormat: false }))
```
Special-casing one Immich slot name inside a provider-agnostic component is
the same pattern CLAUDE.md item 8 forbids for components, scoped to template
configs. Other providers' "large" slots (Gitea PR descriptions, Planka card
content) would render in 3-row editors that the author probably didn't intend.
**Fix:** Add a `rows?: number` field to the backend slot definition and read
it via `notification_slots[].rows`.
---
## MEDIUM
### M1. Three placeholder strings hardcoded English in shared components
[lib/components/EntitySelect.svelte:18](frontend/src/lib/components/EntitySelect.svelte#L18)
[lib/components/IconGridSelect.svelte:16](frontend/src/lib/components/IconGridSelect.svelte#L16)
[lib/components/MultiEntitySelect.svelte:16](frontend/src/lib/components/MultiEntitySelect.svelte#L16)
```ts
placeholder = 'Select...',
```
These defaults render `Select...` in RU locale when a caller doesn't pass an
explicit placeholder. The convention (CLAUDE.md item 5) prescribes plain text
selectors but says nothing about translation - these still need to flow through
`t()`.
**Fix:** Move the default into the template: `placeholder = $props().placeholder
?? t('common.selectPlaceholder')`, with `common.selectPlaceholder` added to
both locales.
### M2. `EntitySelect.noneLabel` defaults to a decorative em-dash literal
[lib/components/EntitySelect.svelte:20](frontend/src/lib/components/EntitySelect.svelte#L20)
```
noneLabel = (em-dash literal),
```
CLAUDE.md item 5 calls out decorative dashes specifically. `LinkedTargetsSection`
already overrides this with `t('common.noneDefault')` (good), but other
consumers that do not override get the bare em-dash. It also fails the
localizable smell test.
**Fix:** Default to `t('common.none')`.
### M3. `lib/auth.svelte.ts` logout does a full page reload, losing UX continuity
[lib/auth.svelte.ts:54-61](frontend/src/lib/auth.svelte.ts#L54-L61)
```ts
export function logout() {
clearTokens();
clearAllCaches();
user = null;
if (typeof window !== 'undefined') {
window.location.href = '/login';
}
}
```
`window.location.href` triggers a hard reload - the SvelteKit router exists
specifically to avoid this. Side effects: any inflight requests get cancelled
without proper cleanup, the splash-loader flashes between the two pages, and
the search-palette / overlays do not get a chance to close gracefully.
**Fix:** `goto('/login', { invalidateAll: true, replaceState: true })`.
### M4. `+layout.svelte` auto-expand `$effect` writes during read
[routes/+layout.svelte:336-342](frontend/src/routes/+layout.svelte#L336-L342)
The effect reads `expandedGroups` (via `expandedGroups[entry.key]`) and writes
to `expandedGroups`. Svelte 5 dedupes the write back to the same set of keys,
but the pattern is fragile - adding any side effect that re-derives from
`expandedGroups` here would loop. It also persists to localStorage in
`toggleGroup` but not from this effect - so auto-expansion stays in memory only.
**Fix:** Compute the next state in a single pass and write once; either
include the localStorage save, or move the auto-expand into the initial
hydration block.
### M5. `commandTemplateConfigsCache.fetch(true)` result discarded; cache populated but unused
[routes/command-template-configs/+page.svelte:208](frontend/src/routes/command-template-configs/+page.svelte#L208)
The `Promise.all` destructures `cfgs` from `commandTemplateConfigsCache.fetch(true)`
but then writes `allCmdTplConfigs = cfgs` instead of $derived-reading the cache.
The cache is updated (good) but this page never reads it (bad - see H1).
**Fix:** Same fix as H1 - use `$derived(commandTemplateConfigsCache.items)`.
### M6. Dashboard search debounce timeout not cleared on filter change
[routes/+page.svelte:268-272](frontend/src/routes/+page.svelte#L268-L272)
If the user changes the type/provider filter (`applyFilters` runs synchronously
from the `$effect` at line 249) while a search debounce is pending, the pending
timeout still fires 300ms later and triggers an identical request. Not a leak,
just a wasted call.
**Fix:** Clear `searchTimeout` from `applyFilters()` as well.
### M7. Dashboard `Promise.all` destructure uses empty middle slot
[routes/+page.svelte:283-287](frontend/src/routes/+page.svelte#L283-L287)
```ts
const [statusRes, , chartRes] = await Promise.all([
api<DashboardStatus>(`/status?limit=${eventsLimit}`),
providersCache.fetch(),
api<{ days: ... }>('/status/chart'),
]);
```
The empty middle slot is brittle - anyone reordering for readability silently
swaps `statusRes` and `chartRes`. Trivially avoided.
**Fix:** Either await `providersCache.fetch()` separately (it caches anyway),
or `const [statusRes, _providers, chartRes] = ...` with an explicit `_providers`
local.
### M8. `actions/+page.svelte` derives `actionTypes` from a function-in-derived
[routes/actions/+page.svelte:78-81](frontend/src/routes/actions/+page.svelte#L78-L81)
```ts
let actionTypes = $derived((() => {
const caps = capabilitiesCache.items[selectedProviderType];
return caps?.action_types || [];
})());
```
The IIFE is unnecessary; `$derived` already runs the expression on every
dependency change. Reads as a refactor leftover.
**Fix:** `let actionTypes = $derived(capabilitiesCache.items[selectedProviderType]?.action_types ?? []);`
### M9. `RuleEditor.svelte` mutates rule object in `toggleRule` then sends to API
[routes/actions/RuleEditor.svelte:105-108](frontend/src/routes/actions/RuleEditor.svelte#L105-L108)
```ts
async function toggleRule(rule: ActionRule) {
rule.enabled = !rule.enabled;
await updateRule(rule);
}
```
Direct mutation of the prop violates the immutability rule (coding-style.md).
If the API call fails, the local state is already flipped - the UI shows the
new value even though the server still has the old one.
**Fix:** `await updateRule({ ...rule, enabled: !rule.enabled })`. After
successful response, `await loadRules()` (already happens) re-syncs.
### M10. `+layout.svelte` filter functions use `as any[]` four times
[routes/+layout.svelte:145-151](frontend/src/routes/+layout.svelte#L145-L151)
```ts
notification_trackers: filterById(notificationTrackersCache.items as any[]).length,
```
The cast exists because `filterById<T extends { provider_id?: number }>` is
narrower than the cache item types. The proper fix is a single base interface
`{ provider_id?: number }` on the relevant types so the cast goes away.
### M11. `setLocale` does not update `<html lang>` attr
[lib/i18n/index.svelte.ts:31-36](frontend/src/lib/i18n/index.svelte.ts#L31-L36)
Screen readers and browser translation extensions rely on `<html lang="en">`.
The app never sets it, so switching to RU leaves accessibility tooling thinking
the page is still English.
**Fix:** `document.documentElement.lang = locale` in `setLocale`.
### M12. `Modal.svelte` focus restore does not verify element still in DOM
[lib/components/Modal.svelte:43-45](frontend/src/lib/components/Modal.svelte#L43-L45)
If the previously focused element has been removed from the DOM between modal
open and close (common with optimistic UI updates that rerender the source
button), `.focus()` is a silent no-op on a detached node. Focus ends up on
`<body>` and the next Tab restarts from the top of the page.
**Fix:** `if (... && document.contains(previouslyFocused)) previouslyFocused.focus()`,
else focus a sensible fallback (the trigger that opened the page).
### M13. TimezoneSelector ticks at 1s - wakes the event loop forever
[lib/components/TimezoneSelector.svelte:33-37](frontend/src/lib/components/TimezoneSelector.svelte#L33-L37)
```ts
let tickHandle: ReturnType<typeof setInterval> | null = null;
onMount(() => {
tickHandle = setInterval(() => { now = new Date(); }, 1000);
});
```
A 1Hz tick is fine for visible UI; the issue is it keeps running even when
the selector dropdown is closed (the time display is only visible when the
dropdown is open). Battery impact is non-trivial on mobile for what is
essentially a hidden component.
**Fix:** Start/stop the interval based on `open` state, or use
`requestAnimationFrame` driven by `IntersectionObserver`.
### M14. Backup file download builds blob from JSON without size guard
[routes/settings/backup/+page.svelte:269-281](frontend/src/routes/settings/backup/+page.svelte#L269-L281)
```ts
const data = await api(`/backup/files/${filename}`);
const blob = new Blob([JSON.stringify(data, null, 2)], { type: 'application/json' });
```
For a deployment with hundreds of providers/trackers, the JSON serialization
of the entire backup happens in-memory in a single string before the Blob
constructor - wasted memory peak and a frozen tab on slow machines. Worse,
`api()` parses the JSON and then `JSON.stringify` re-serializes it.
**Fix:** Use `fetchAuth()` for the download path and pipe the response stream
straight into a Blob (`new Blob([await res.arrayBuffer()])`).
### M15. Modal focus-trap query selector includes disabled inputs
[lib/components/Modal.svelte:62-67](frontend/src/lib/components/Modal.svelte#L62-L67)
Re-querying the DOM on every Tab keystroke is OK but means disabled inputs
(common in long forms with submit-in-progress) are included in the trap and
focus can land on them. The selector should add `:not([disabled])`.
### M16. i18n resolve uses any for the recursion accumulator
[lib/i18n/index.svelte.ts:55-62](frontend/src/lib/i18n/index.svelte.ts#L55-L62)
```ts
function resolve(obj: any, path: string): string | undefined {
```
`obj: unknown` plus a runtime check would let TS narrow `current` properly and
catch the case where someone accidentally passes a `string` (returns undefined
silently today).
### M17. Tracker name auto-set string concat - English-only
[routes/notification-trackers/+page.svelte:82-84](frontend/src/routes/notification-trackers/+page.svelte#L82-L84)
[routes/command-trackers/+page.svelte:69-71](frontend/src/routes/command-trackers/+page.svelte#L69-L71)
```ts
form.name = provider ? `${provider.name} Tracker` : 'Tracker';
form.name = provider ? `${provider.name} Commands` : 'Commands';
```
Defaults the tracker name to "Provider Name Tracker" / "Provider Name Commands"
- only English. Russian users get an English suffix on the auto-generated
name. Inconsistent with the rest of the i18n discipline.
**Fix:** Use `t('notificationTracker.defaultName').replace('{name}', provider.name)`.
### M18. topbar-action store not cleared on auth state change
[routes/providers/+page.svelte:160-167](frontend/src/routes/providers/+page.svelte#L160)
Each page sets a topbar CTA in `onMount` and clears it in `onDestroy`. If
`logout()` is called from inside the page (via the search palette, etc.), the
page never destroys cleanly and the topbar action sticks into the login screen.
Defensive `topbarAction.clear()` in `logout()` would plug this.
### M19. Many `: any` and `as any` types in critical paths
[routes/users/+page.svelte:62](frontend/src/routes/users/+page.svelte#L62)
[routes/command-trackers/+page.svelte:27](frontend/src/routes/command-trackers/+page.svelte#L27)
[routes/providers/+page.svelte:179](frontend/src/routes/providers/+page.svelte#L179)
[lib/providers/types.ts:120](frontend/src/lib/providers/types.ts#L120)
64 occurrences of `: any` / `as any` across 20 files. None are in
security-sensitive paths, but they remove type safety in exactly the call
sites that shape API requests (`body: any = { ... }`). Recommended cleanup
task, not a blocker.
---
## LOW
### L1. +page.svelte event types hardcoded in three parallel maps
[routes/+page.svelte:475-512](frontend/src/routes/+page.svelte#L475-L512)
`eventLabels`, `eventIcons`, and `eventGradients` are three parallel dicts
keyed by the same set of strings. Adding a new event type requires editing
three places (plus i18n). A single `EVENT_META` object would be more
maintainable.
### L2. TestMenu.svelte uses z-index 9998 instead of 9999
[routes/notification-trackers/TestMenu.svelte:25](frontend/src/routes/notification-trackers/TestMenu.svelte#L25)
```svelte
<div style="position:fixed; top:0; left:0; right:0; bottom:0; z-index:9998;"
```
The convention says 9999 for overlays. Using 9998 was probably intentional
(so the menu sits above the backdrop), but the cleaner pattern is to give the
backdrop a slightly lower stacking context inside the same parent.
### L3. console.warn left in production-bound code
14 `console.warn`/`console.error` occurrences. Most are guarded by a
"failed to load" + UI fallback - legitimate debug noise. Recommend wiring to
a structured logger before public release; current state is acceptable for an
internal tool but spam-prone in DevTools.
### L4. Dashboard setTimeout(animateCount, 200) is uncancelled
[routes/+page.svelte:290-299](frontend/src/routes/+page.svelte#L290-L299)
The 200ms delay before triggering count animations is uncancelled. Navigating
away during the first 200ms means the count animation `requestAnimationFrame`
chain still runs against a stale `status` reference. Cosmetic only.
### L5. app.html inline theme bootstrap reads localStorage without try/catch
[src/app.html:12](frontend/src/app.html#L12)
Theme is hydrated synchronously in `<head>` to avoid FOUC - fine - but if
localStorage is blocked (Safari private mode, some enterprise policies) the
inline script throws and the rest of the head bootstrap may be skipped.
### L6. EventChart computes activeTypes and hasData from same loop twice
[lib/components/EventChart.svelte:46-49](frontend/src/lib/components/EventChart.svelte#L46-L49)
`hasData` and `activeTypes` traverse the same data twice. Single-pass
derivation would be cheaper for the rare "many days of events" case.
### L7. Single-letter t shadowing in +layout.svelte
`+layout.svelte:140` uses `for (const t of targets)` inside `navCounts`, which
shadows the imported i18n function `t`. Svelte 5 does not flag it (inner scope
wins), but it confuses search/grep and breaks IDE go-to-definition. Several
other pages use single-letter `t` as iteration var (`actions/+page.svelte`,
`command-trackers/+page.svelte`, `targets/+page.svelte`). Recommend `target` /
`tracker` for legibility.
---
## Notes & non-findings
- **Modal overlay convention** (CLAUDE.md #2): Modal.svelte, Snackbar,
IconPicker, IconGridSelect, MultiEntitySelect, EntitySelect, TimezoneSelector,
EventChart, Hint, SearchPalette, and TestMenu all use `position:fixed` with
`z-index: 9999` (or 9998 for the TestMenu backdrop - see L2). Convention
upheld.
- **@html usage** - only three call sites, all pipe through `sanitizePreview`,
which is a DOMParser-based allowlist limited to `B`, `I`, `CODE`, `PRE`, `A`,
`BR` with `https?://` href validation. Safe.
- **i18n parity**: EN and RU JSON have the exact same 1466 keys - no orphans.
- **Selector placeholders**: `LinkedTargetsSection` correctly uses
`t('common.noneDefault')`, no em-dash leaks in user-facing flows (only
defaults inside shared components - see M1/M2).
- **svelte-check passes** (exit 0) - no type errors at the strict level the
project compiles with.
- **No eval, new Function, or string-setTimeout**: dynamic code execution
surface is clean.
- **No var declarations**, no `==` (loose equality) outside generated CSS.
- **AbortController usage**: present in `lib/api.ts` for the canonical fetch
wrapper - the rest of the codebase could lean on it more (see H3, H5).
+436
View File
@@ -0,0 +1,436 @@
# Performance & Database Review — `service-to-notification-bridge`
**Scope:** entire repo at `c:\Users\Alexei\Documents\service-to-notification-bridge`
**Backend:** FastAPI + SQLAlchemy async + SQLModel on SQLite (Postgres-compatible URL, but only SQLite branch is exercised in code).
**Frontend:** SvelteKit 5 (runes) static build served by the same FastAPI process.
**Reviewer:** Claude Opus 4.7 (1M context)
---
## Executive summary
1. **Indexing is in good shape.** FK columns and the dashboard/webhook hot paths have explicit composite indexes (`ix_event_log_user_created`, `ix_event_log_user_event_type_created`, `ix_deferred_dispatch_status_fire_at`, partial `ux_deferred_dispatch_pending`). The bulk of the "missing index" risk is already mitigated.
2. **No real migration tool.** The project runs a hand-rolled, 1880-line, idempotent migration script on every boot. It works, but it's brittle, slow on cold start, has no down-migrations, and the table-rebuild branches lose indexes silently. Move to Alembic before the next major schema change.
3. **`create_all` is still the source-of-truth for new schemas** (engine.py:63). That's an anti-pattern next to migration tooling: schema drift can silently appear between fresh installs and upgraded installs.
4. **Two real N+1 risks remain.** `_tracker_response` (notification_trackers.py:286-291) calls `_tt_response` per link, and `_refresh_telegram_chat_titles` (scheduler.py:229) issues per-chat `getChat` calls without bot-level batching guards. The big one in `load_link_data` was already fixed (good).
5. **SQLite PRAGMAs are mostly right but pool sizing is wrong.** WAL, `synchronous=NORMAL`, FK enforcement, busy_timeout, temp_store=MEMORY are all set. Missing: `cache_size`, `mmap_size`. The async engine uses SQLAlchemy's default pool with multiple writer connections — under WAL that still serializes, but it raises spurious BUSY pressure on long transactions (see #M3).
6. **Event-log retention exists and is correct** (30-day default, cron at 03:00 UTC), but `retention_days=0` disables it silently and there is no archival, no per-tenant cap, no row-count metric exposed to operators.
7. **Memory leak risk: `_dirty_bots`, `_last_update_id`, `_last_webhook_reclaim_at`, `_adaptive_state`, `_adaptive_max_skip`** in command_sync.py, telegram_poller.py, scheduler.py are unbounded module-level dicts. In a long-running process they grow without ever shrinking when entities are deleted.
8. **Frontend has no virtualization on long lists** — dashboard event stream, tracker history, target list. On a tenant with thousands of events the dashboard `{#each status.recent_events}` (with `(event.id)` key) still renders the whole page-set into DOM and re-runs derivations on every refresh.
---
## CRITICAL
### C1. `create_all` is the schema-of-record for new installs ([engine.py:60](packages/server/src/notify_bridge_server/database/engine.py))
```python
async def init_db() -> None:
engine = get_engine()
async with engine.begin() as conn:
await conn.run_sync(SQLModel.metadata.create_all)
```
**What's wrong:** `init_db()` runs unconditionally on every boot before the migration script. New installs get the *current* model's CREATE TABLE statements — including FK declarations like `ondelete=SET NULL` — while upgraded installs only get what the (one-way) `migrate_*` scripts manage to inject via `ALTER TABLE`. Several migrations explicitly admit "this only takes effect on freshly created tables" (e.g. `migrate_eventlog_provider_fk` is a documented no-op). That means **the schema drift between a fresh install and a 6-month-old install is real and undocumented.**
**Impact:** stability — subtle bugs that reproduce only on upgraded installs (FK enforcement, cascade behavior, partial UNIQUE indexes); ops — restoring a backup from a fresh install onto an upgraded box, or vice-versa, can change observable behaviour.
**Fix:**
1. Adopt Alembic with autogenerate-from-models, lock the baseline migration to the current `SQLModel.metadata`, and stop calling `create_all` in production startup.
2. Keep the hand-rolled `migrate_*` chain as legacy data-migrations only (idempotent, runs once, then removed).
3. Add a CI check: spin up empty DB → run migrations → diff against `SQLModel.metadata` → fail if non-empty.
---
### C2. `migrate_schema` runs ~30+ idempotent `PRAGMA table_info` + ALTER probes on every cold start ([migrations.py:67-427](packages/server/src/notify_bridge_server/database/migrations.py))
`_has_column` issues a `PRAGMA table_info('<table>')` per check; `migrate_schema` calls it dozens of times serially inside one transaction. On a cold start this is the dominant boot latency. Worse, it forces a write txn on every boot even when nothing changes (because each migration opens `engine.begin()`).
**Impact:** startup cost — visible on Raspberry-Pi / NAS deployments; SQLite WAL checkpoint pressure on every boot when nothing changed; readiness probe grace window must accommodate this.
**Fix:**
1. Wire `schema_version` (already exists, `CURRENT_SCHEMA_VERSION=1`) as a real short-circuit — at the top of every `migrate_*`, return immediately if `schema_version >= N` for that migration.
2. Cache `PRAGMA table_info` results within a single migration run.
3. Better long-term: replace with Alembic; you already have the version table.
---
### C3. `_install_sqlite_pragmas` only fires on engine-pool `connect`, not when SQLAlchemy reuses pooled connections from a different event loop ([engine.py:18-38](packages/server/src/notify_bridge_server/database/engine.py))
The `@event.listens_for(engine.sync_engine, "connect")` hook only runs at connection creation. The default `aiosqlite` pool reuses connections — that's fine — but `connect_args["timeout"]=30` clashes with the in-PRAGMA `busy_timeout=10000` (10 s). Two different timeout settings is confusing and the lower wins.
**Impact:** stability under contention — under sustained writer contention you get `SQLITE_BUSY` *much* sooner than expected. The 30-s connect_args timeout is for connection *open*, the 10-s busy_timeout is what governs lock contention; users see "database is locked" errors after 10 s, not 30.
**Fix:** standardize on busy_timeout (raise to 30 s to match `connect_args`, or drop one and keep the other). Document the chosen value in a constant. Also add:
```python
cur.execute("PRAGMA cache_size=-65536") # 64 MiB negative = kibibytes
cur.execute("PRAGMA mmap_size=268435456") # 256 MiB
cur.execute("PRAGMA wal_autocheckpoint=1000")
```
The 100k-asset album write pattern (`asset_ids` JSON blob) benefits significantly from a larger page cache and mmap; current defaults force a lot of SQLite-internal I/O.
---
## HIGH
### H1. Frontend dashboard event-stream lacks virtualization & double-fetches on filter changes ([+page.svelte:739](frontend/src/routes/+page.svelte))
`{#each status.recent_events as event, i (event.id)}` is keyed (good), but the page renders every event row with rich nested components (`EventDetailModal`, `MdiIcon`, etc.) for every paginate-back/forward. There's no row virtualization and the same data fetches re-run on every filter mutation (search input has a 300 ms debounce in `onSearchInput`, but `filterEventType`, `filterProviderId`, `filterSort`, `refreshSeconds` do not).
**Impact:** UX — choppy on tenants with 50+ events/page, perceptible filter-flicker; CPU — derivation cost on every status refresh.
**Fix:**
1. Wrap the events list in a tiny windowing component (svelte-virtual or a simple offset/limit windowed view — the API already supports it).
2. Debounce the entire filter-change branch, not just the search input (`$effect(() => { if (settled) { reload() }})` with a 100 ms guard).
3. The provider count map (`provider_event_counts`) is computed server-side for *all* matching events on every page request; cache it for `(user_id, filters)` in a 30-s in-memory dict server-side (see also #M2).
---
### H2. `provider_event_counts` aggregate query runs unbounded GROUP BY on every dashboard request ([status.py:84-103](packages/server/src/notify_bridge_server/api/status.py))
```python
provider_counts_query = (
select(
EventLog.provider_id,
EventLog.provider_name,
func.sum(func.coalesce(EventLog.assets_count, 1)).label("total"),
)
.where(EventLog.user_id == user.id)
.group_by(EventLog.provider_id, EventLog.provider_name)
)
```
Every dashboard load (every 1060 s by default — see `refreshIntervalItems`) runs `GROUP BY provider_id, provider_name` over *every* event the user ever owned. At 90 days × ~1 event/min/tracker this is hundreds of thousands of rows scanned per refresh per logged-in user.
**Impact:** latency — SQLite forces a full table scan + sort here because the only composite index is `(user_id, event_type, created_at DESC)`; cost — burns CPU on the bridge box for a metric that changes very slowly.
**Fix:**
1. Add `ix_event_log_user_provider (user_id, provider_id)` so the GROUP BY can be index-only.
2. Cache the result for `(user_id, filter_signature)` for 30 s in the same in-memory cache as #H1.
3. Long-term: materialize per-provider counts into an `event_counter` table maintained by triggers or an APScheduler job. The dashboard then reads at most a dozen rows.
---
### H3. `_tracker_response` issues one query per tracker-target link ([notification_trackers.py:286-291](packages/server/src/notify_bridge_server/api/notification_trackers.py))
```python
async def _tracker_response(session: AsyncSession, t: NotificationTracker) -> dict:
result = await session.exec(
select(NotificationTrackerTarget).where(NotificationTrackerTarget.tracker_id == t.id)
)
tracker_targets = [await _tt_response(session, tt) for tt in result.all()]
```
`_tt_response` (in notification_tracker_targets.py:12 — has 12 distinct `select`/`session.get` references) issues per-link follow-up SELECTs. Called from `create`, `update`, `delete` and `trigger` for a single tracker, so the practical N is small — but `_tt_response` is also called inside the bulk `list_notification_trackers` loop's downstream consumers, and any future bulk endpoint will multiply this badly.
**Impact:** latency on POST/PATCH responses; future regression risk.
**Fix:** rewrite `_tt_response` to accept pre-fetched maps (mirror the pattern in `dispatch_helpers.load_link_data`). Or, simpler: write a single eager-load helper using `selectinload(NotificationTrackerTarget.target)` once `relationship()` mappers are declared on the models.
---
### H4. `load_link_data` does not eagerly load target.config related entities — relies on `dict(target.config)` snapshotting ([dispatch_helpers.py:539-747](packages/server/src/notify_bridge_server/services/dispatch_helpers.py))
The function batch-loads receivers, telegram_chats, email_bots, matrix_bots up-front, but the broadcast-expansion branch in the active_links loop still issues `_resolve_target` per child target (line 715). That `_resolve_target` is called with all the pre-fetched maps, so it doesn't *query* per call — but it does build a fresh `target_config` dict per child. With a broadcast target containing 50 children fanning out 100 events/min this is constant garbage collection pressure.
**Impact:** GC pressure under load; not a correctness problem.
**Fix:** none required short-term. Long-term, add `selectinload` declarations on the relationship model so SQLAlchemy can co-fetch the chain. The code path is already well-batched.
---
### H5. `aiohttp.ClientSession` is constructed per-call inside `NotificationDispatcher._session_ctx` when no shared session is provided ([dispatcher.py:117-123](packages/core/src/notify_bridge_core/notifications/dispatcher.py))
```python
@contextlib.asynccontextmanager
async def _session_ctx(self) -> AsyncIterator[aiohttp.ClientSession]:
if self._shared_session is not None and not self._shared_session.closed:
yield self._shared_session
return
async with _new_session() as session:
yield session
```
In server-side code paths (watcher, event_dispatch, deferred_dispatch) a shared session is always passed in, so this is harmless. But unit tests, the CLI, and any direct library user that instantiates `NotificationDispatcher` without a session pays the cost. Worse, the per-dispatch session creates a fresh TCP pool, fresh DNS resolver — defeating connection reuse to Telegram / Discord webhook hosts.
**Impact:** test slowness; correctness if a non-server consumer ever ships.
**Fix:** require the `session` parameter (`session: aiohttp.ClientSession` not `| None`). Or have the dispatcher lazily attach to a module-level `_default_session` cached by event loop id.
---
### H6. `WebhookPayloadLog` is pruned per-insert via a sub-select but the prune query has no UNIQUE/partial protection against duplicate inserts ([webhooks.py:404-418](packages/server/src/notify_bridge_server/api/webhooks.py))
The "keep newest `max_count` per provider, delete the rest" pattern uses `select(...).order_by(created_at DESC).limit(max_count)` as a subquery. Under SQLite this materializes the top-N then negates it — fine when max_count is 20. But this runs on every inbound webhook. For a busy Gitea/HA installation that's 60+ writes/min, each with a delete-by-sub-select. The `ix_webhook_payload_log_provider_created` index makes the read cheap, but the DELETE still rewrites pages.
**Impact:** write amplification on busy webhook tenants.
**Fix:** keep the prune but make it probabilistic — only run with `random.random() < 0.1` (10% chance per insert). The cap still holds in steady state, but the per-write cost drops 10×.
---
### H7. No retention/archival for `notification_tracker_state` and `deferred_dispatch` "fired"/"dropped" rows ([scheduler.py:332-364](packages/server/src/notify_bridge_server/services/scheduler.py))
`_cleanup_old_events` deletes `event_log`, `webhook_payload_log`, `action_execution` older than retention days. `deferred_dispatch` rows with `status IN ('fired', 'dropped')` are never deleted. `notification_tracker_state.asset_ids` for an immich tracker watching a deleted collection is also never reaped.
**Impact:** unbounded growth on long-running installs; `asset_ids` JSON blobs can be megabytes per collection.
**Fix:** extend `_cleanup_old_events` to also delete `DeferredDispatch.status != 'pending' AND fired_at < cutoff`. Add a separate housekeeping job that prunes `NotificationTrackerState` rows whose `collection_id` is no longer in `NotificationTracker.collection_ids`.
---
## MEDIUM
### M1. Sentinel value `bot_id=0` is a footgun ([models.py:69-73](packages/server/src/notify_bridge_server/database/models.py))
```python
# bot_id=0 is a sentinel meaning "Telegram has not yet returned a numeric
# ID for this bot" (i.e. token never validated). Multiple unverified bots
# may legitimately carry 0, so we only enforce uniqueness for non-sentinel
# values via a partial index added in migrate_uniqueness_constraints.
bot_id: int = Field(default=0, index=True)
```
Sentinel values on indexed columns hurt index selectivity (every unvalidated bot is the same row from the planner's perspective) and create maintenance burden. Worse, every code path that looks up by `bot_id` must remember to filter `bot_id != 0`.
**Impact:** maintainability; latent bug surface (one missed `!= 0` filter and an unverified bot is silently re-used).
**Fix:** change `bot_id: int | None` defaulting to None, drop the sentinel.
---
### M2. No request-scoped cache for `user.id` lookups inside one request ([api/*.py, throughout](packages/server/src/notify_bridge_server/api/))
The same `get_current_user` dependency runs JWT validation + a `session.get(User, id)` on every request. Many endpoints then do their *own* `user.id`-filtered SELECTs. There is no per-request memoization of the User row.
**Impact:** one extra SELECT per request, mostly noise — but it's free to fix.
**Fix:** in `get_current_user`, cache the User on `request.state.user`. Routes that take `user: User = Depends(...)` are unchanged.
---
### M3. SQLAlchemy async pool defaults serialize SQLite writers but the engine allows multiple connections ([engine.py:41-57](packages/server/src/notify_bridge_server/database/engine.py))
`create_async_engine` for SQLite defaults to a `StaticPool` of size 1 in newer SQLAlchemy versions, but older versions / different `aiosqlite` versions can default to `NullPool` (one connection per request) or a small QueuePool. The code does not pin this explicitly. Under WAL, multiple readers are fine but only one writer can hold the txn at a time — so a slow writer just makes other connections block on `busy_timeout`.
**Impact:** unpredictable behaviour across SQLAlchemy versions; sporadic `SQLITE_BUSY` under load.
**Fix:** explicitly configure the pool:
```python
from sqlalchemy.pool import StaticPool, AsyncAdaptedQueuePool
_engine = create_async_engine(
url,
echo=settings.debug,
pool_pre_ping=True,
connect_args=connect_args,
poolclass=AsyncAdaptedQueuePool,
pool_size=5,
max_overflow=10,
pool_recycle=3600,
)
```
For Postgres compatibility leave these as-is; for SQLite the right value is `StaticPool` + `connect_args={"check_same_thread": False}` to share one connection across the event loop (this is the supabase/pgbouncer pattern adapted for sqlite-async).
---
### M4. `_refresh_telegram_chat_titles` issues per-chat HTTP without per-bot bucketing ([scheduler.py:229-329](packages/server/src/notify_bridge_server/services/scheduler.py))
The job builds `tasks` as a flat list across all bots and runs them under a global `Semaphore(10)`. A bot with 50 chats and a slow Telegram response (rare but happens) can monopolize all 10 slots, starving every other bot. The semaphore should be per-bot.
**Impact:** the daily refresh can take much longer than intended on a multi-bot install with one degraded bot.
**Fix:** create one semaphore per bot:
```python
sems = {bot_id: asyncio.Semaphore(_CHAT_SYNC_CONCURRENCY) for bot_id in bot_tokens}
```
---
### M5. `event_log.collection_name.contains(search)` triggers full table scan on filter ([status.py:69-75](packages/server/src/notify_bridge_server/api/status.py))
The dashboard search input runs four `.contains(search)` clauses ORed together — these become `LIKE '%search%'` and cannot use a regular B-tree index. With 100k+ event_log rows the dashboard search becomes a multi-second operation.
**Impact:** UX — search feels broken on large installs; CPU on the bridge box.
**Fix:**
1. Limit the search to the most recent N days (e.g. retention/3) — most users only search recent events.
2. Add a SQLite FTS5 virtual table mirroring event_log's text columns, sync via triggers. Searches use `MATCH 'foo'` which is sub-millisecond on million-row tables.
---
### M6. `DeferredDispatch.event_payload` JSON blob can grow unbounded per row ([models.py:639-659](packages/server/src/notify_bridge_server/database/models.py), [deferred_dispatch.py:188-298](packages/server/src/notify_bridge_server/services/deferred_dispatch.py))
The asset-coalescing union path appends every new asset's full dict (filename, urls, tags, extra metadata) into `event_payload["added_assets"]`. A mass-import that adds 50k photos during a quiet window means one DeferredDispatch row with 50k asset entries.
**Impact:** memory blow-up at drain time (the whole JSON is parsed via `deserialize_event` into a Python list of `MediaAsset` dataclasses); could trip the drain timeout (`_DRAIN_DISPATCH_TIMEOUT_SECONDS=120`) on legitimate workloads.
**Fix:** cap the union at e.g. 500 assets per row; when crossed, emit a "more_truncated" sentinel into `payload["extra"]` so the rendered template can show "+45000 more". The `apply_tracking_display_filters` `max_assets_to_show` does cap it for delivery, but the *stored* payload is uncapped.
---
### M7. Per-tick `await get_app_timezone(session)` reads from the DB on every dispatch ([dispatch_helpers.py:146-150](packages/server/src/notify_bridge_server/services/dispatch_helpers.py))
Each tracker tick, each webhook, each defer evaluation calls `get_app_timezone` which calls `get_setting(session, "timezone")` which is a SELECT. The timezone setting rarely changes (manual setting), but the SELECT runs constantly.
**Impact:** noise on otherwise good caching.
**Fix:** cache the timezone in a module-level `(value, expires_at)` tuple with 60-s TTL, invalidated by `reschedule_cron_jobs_for_timezone_change`.
---
### M8. Unbounded in-memory dictionaries with no TTL or capacity ([scheduler.py:67-72](packages/server/src/notify_bridge_server/services/scheduler.py), [telegram_poller.py:31-35](packages/server/src/notify_bridge_server/services/telegram_poller.py), [command_sync.py:25](packages/server/src/notify_bridge_server/services/command_sync.py))
```python
_adaptive_state: dict[int, dict[str, int]] = {}
_adaptive_max_skip: dict[int, int] = {}
_last_update_id: dict[int, int] = {}
_last_webhook_reclaim_at: dict[int, float] = {}
_dirty_bots: dict[int, float] = {}
```
Each is keyed by tracker_id / bot_id. When a tracker or bot is deleted, the cleanup paths (`unschedule_tracker`, etc.) do remove some entries — but not all. `_last_update_id`, `_last_webhook_reclaim_at` are never cleared on bot deletion.
**Impact:** slow memory leak in long-running processes that create+delete trackers/bots frequently (e.g. test environments).
**Fix:** on tracker/bot deletion, explicitly clear all module dicts that key by that id. Or, simpler, switch each to `weakref.WeakValueDictionary` once the entity has a Python object representation, or to a TTLCache.
---
### M9. Bulk insert pattern in migrations uses one-statement-per-row ([migrations.py:566-588](packages/server/src/notify_bridge_server/database/migrations.py))
`migrate_tracker_targets` issues `INSERT INTO ... VALUES (...)` per row in a Python for-loop. On a tenant with 10k+ legacy rows this is slow even inside a single transaction.
**Impact:** one-shot, but rough on upgrade for big tenants.
**Fix:** use `executemany` / batch INSERTs:
```python
await conn.execute(text("INSERT INTO ... VALUES (...)"), batch_params)
```
This is mostly historical (the migration is idempotent and skipped on subsequent runs), but worth fixing if you're touching the file.
---
### M10. Missing index on `notification_tracker_state(notification_tracker_id, collection_id)` ([models.py:454-478](packages/server/src/notify_bridge_server/database/models.py))
`check_tracker` reads state per tracker; the existing `ix_notification_tracker_state.notification_tracker_id` index (declared via `index=True`) supports that. But every state read is `WHERE tracker_id = ? AND collection_id = ?` (implicitly via the resulting dict). A composite would help; SQLite can do index-only scans here.
**Impact:** small. SQLite's index intersection plus the fact that one tracker typically has <20 collections makes this a minor win.
**Fix:** add `(notification_tracker_id, collection_id)` composite index to the `_INDEXES` list.
---
## LOW
### L1. `SELECT *` semantics from `select(Model)` ORM is unavoidable but verbose ([throughout services/, api/])
SQLModel's `select(ModelClass)` is effectively `SELECT all columns`. For wide rows like `TrackingConfig` (~70 columns of boolean flags) that's a lot of bytes per dispatch evaluation. There are no API list endpoints that return `TrackingConfig` from a hot path, so this is mostly cosmetic — but for pages that only need a handful of columns (e.g. `status.py`'s `tracker_id, name` map) the explicit-column form is already used. Continue that pattern.
---
### L2. `EventLog.details` JSON dict is reconstructed on every dashboard read ([status.py:258](packages/server/src/notify_bridge_server/api/status.py))
`details: e.details or {}` serializes the JSON every time. SQLite returns this as a parsed Python dict already (JSON column), so the cost is low; just a note that this is a hot path.
---
### L3. `event_log.collection_id` and `details` have no indexes; some webhook commands filter on them ([commands/immich/events.py:43](packages/server/src/notify_bridge_server/commands/immich/events.py))
The history-by-tracker endpoint uses the composite `ix_event_log_user_event_type_created` plus a hit on `notification_tracker_id` — fine. But `events.py`'s "last assets_added for this collection" queries (`event_type='assets_added' AND collection_id=?`) cannot use any current index optimally.
**Fix:** add `(event_type, collection_id, created_at DESC)` if these queries are called by users frequently (Telegram `/assets <album>` etc.).
---
### L4. JSON column types not declared with `JSONB` semantics ([models.py: many](packages/server/src/notify_bridge_server/database/models.py))
SQLite has only `JSON` (text storage with `json_valid` checks). On Postgres you'd want `JSONB`. The codebase uses `Column(JSON)` from SQLModel which maps to native `JSONB` on Postgres — that's correct. No action needed.
---
### L5. The `setup` lifespan runs migrations *inside* the FastAPI lifespan synchronously ([main.py:62-122](packages/server/src/notify_bridge_server/main.py))
The migrations + seeds + scheduler boot all run before `_READY = True`. On a cold start with a big DB this can take 10+ s during which `/api/ready` returns 503. That's correct, but `/api/health` is also un-reachable because uvicorn hasn't started the workers yet (lifespan blocks startup). For orchestrators that probe `/api/health`, this means startup-grace must be tuned.
**Fix:** start the HTTP listener first, run migrations as a background task, expose readiness flag through `/api/ready` only.
---
### L6. `ServiceProvider.config`, `NotificationTarget.config`, `Tracker.filters` JSON columns store secrets unencrypted ([models.py:42, 349, 399](packages/server/src/notify_bridge_server/database/models.py))
API keys, refresh tokens, webhook secrets, SMTP passwords all live in `config` JSON. Visible to anyone with DB read access. This is a known design trade-off (`backup_secrets_mode` controls export behaviour) but worth flagging.
**Fix:** out of scope for this review; consider an at-rest encryption layer keyed off `secret_key` (Fernet) for `config["api_key"]`, `config["password"]`, `access_token`, etc. — but only if your threat model justifies the operational cost.
---
### L7. Frontend `caches.svelte.ts` has 30-s TTL but no cross-tab invalidation ([entity-cache.svelte.ts:14](frontend/src/lib/stores/entity-cache.svelte.ts))
Two browser tabs editing the same entity will see stale data for up to 30 s in the other tab. No `BroadcastChannel` listener.
**Fix:** add a `BroadcastChannel('notify-bridge-cache')` that calls `cache.invalidate()` on receipt. ~15 lines.
---
### L8. `providersCache.invalidate(); await load()` is two-step ([providers/+page.svelte:238, 250](frontend/src/routes/providers/+page.svelte))
`invalidate()` + immediate `fetch(true)` race against any in-flight request; the deduplication map handles it, but the explicit `await load()` is essentially `fetch(true)` directly. Simpler:
```typescript
providersCache.set(updatedList); // or fetch(true)
```
Cosmetic.
---
### L9. `details["dispatch_status"]` is a string enum but not declared as one ([deferred_dispatch.py:619-624](packages/server/src/notify_bridge_server/services/deferred_dispatch.py))
`dispatch_status` takes values `"deferred"`, `"deferred_then_dropped"`, `"deferred_then_failed"`, `"delivered_after_quiet_hours"`, `"dropped_quiet_hours_nondeferrable"`. They're scattered as string literals. The dashboard renders them.
**Fix:** declare an `Enum` once and import from both server and frontend types.
---
### L10. No DB connection used by `/api/health` ([main.py:270-274](packages/server/src/notify_bridge_server/main.py))
`/api/health` returns instantly without checking the DB. That's correct for a liveness probe but the comment doesn't match common practice ("liveness = process up"). Pair this with #L5: orchestrators using `/api/health` for warm-up will mark the pod ready while migrations are still running.
**Fix:** keep liveness lightweight, document the readiness probe as the warm-up gate.
---
## Notes on what's already good
- Performance indexes (`_INDEXES` list) cover all the right hot paths.
- Composite `(status, fire_at)` index on `deferred_dispatch` plus partial unique `(link_id, collection_id, event_type) WHERE status='pending'` prevents the worst races.
- `load_link_data` is fully batched — the most complex hot path in the codebase looks clean.
- Shared `aiohttp.ClientSession` with DNS-rebinding-safe `PinnedResolver` is production-grade.
- Pre-migration `VACUUM INTO` snapshot is the right safety net for a hand-rolled migration chain.
- APScheduler defaults (`coalesce=True`, `misfire_grace_time=300`, `max_instances=1`) are correct production settings.
- Adaptive polling (skip-N-of-K when idle) with jitter is a thoughtful 4-tier scheduling design.
- Tracker cache (5-s TTL with explicit invalidation) and rendered-message per-locale cache are good fan-out optimizations.
- Migration idempotency is genuinely well-handled despite the rough tooling.
- Frontend `entity-cache` deduplication of in-flight requests is the right pattern.
---
## Priority recommendations (next 30 days)
1. **Adopt Alembic** (C1, C2) — eliminate `create_all` from prod, baseline the current schema, lock down new schema changes through autogenerate.
2. **Fix the dashboard aggregate query** (H1, H2, M5) — add the missing composite index, server-side cache the per-provider aggregate, virtualize the event list. This is the single biggest user-visible perf win.
3. **Cap `DeferredDispatch.event_payload` size + add retention for fired/dropped rows** (M6, H7) — closes off the worst-case memory and growth scenarios.
4. **Cleanup module-level dicts on entity deletion** (M8) — small fix, prevents a slow leak.
5. **Standardize SQLite PRAGMAs and pool config** (C3, M3) — predictable behaviour, fewer spurious BUSY errors.
---
*Reviewed against codebase at HEAD (`a20635a`).*
+312
View File
@@ -0,0 +1,312 @@
# Security Review — notify-bridge v0.8.1
Reviewer: security-reviewer (Opus 4.7) — 2026-05-22
Branch: master @ a20635a
Scope: `packages/server`, `packages/core`, `frontend/src`, `Dockerfile`, `docker-compose.yml`, `.gitea/workflows/`, env handling.
---
## Executive Summary
- **Overall posture is strong.** The project applies many non-obvious controls correctly: Jinja2 `SandboxedEnvironment` on every render path; `bcrypt` with a 72-byte length guard and constant-time login (dummy hash on missing user); JWT with `token_version` revocation; SSRF guard with CGNAT, IPv4-mapped-IPv6 unwrapping, and a `PinnedResolver` that defeats DNS rebinding; secret-masking log filter; path-traversal-safe backup file resolver; security headers + CSP; non-root Docker user; required `SECRET_KEY` >= 32 chars with a rejection list; non-default Telegram webhook secret enforced; HMAC signature checks on Gitea/Generic webhooks; provider-config secret masking on GET; ownership checks (`get_owned_entity`) on every parameterised route I sampled.
- **HIGH — Home Assistant `access_token` is not masked.** It is stored in `provider.config`, never added to the mask list in `_provider_response`, never added to the placeholder-drop list in `update_provider`. Any logged-in user can `GET /api/providers/{id}` and read their HA token in cleartext, and a partial save will wipe it. Trivial fix.
- **HIGH — Secrets at rest are plaintext.** Telegram bot tokens (`telegram_bot.token`), provider configs containing `api_key`/`api_token`/`webhook_secret`/`access_token`/SMTP passwords, and email-bot SMTP passwords are stored unencrypted in SQLite. Disk theft, an unrelated read primitive, or any backup leak exposes all credentials. The masking on the API is good UX, but the DB itself has no encryption-at-rest. The exported JSON backup respects a `secrets_mode` flag (good) but the live DB does not.
- **MEDIUM — Template-preview endpoints bypass the timeout/size watchdog.** `template_configs.preview_config`, `template_configs.preview_raw`, `command_template_configs.preview_raw`, and `notifier.send_test_template_notification` construct fresh `SandboxedEnvironment(autoescape=False)` instances and call `.render(...)` directly. The hardened helper `render_template()` (timeout, source cap, output cap, autoescape) is bypassed. A logged-in user can wedge a worker thread with `{% for i in range(10**8) %}x{% endfor %}`. Single-tenant deployment limits the blast radius, but the renderer should be the single chokepoint.
- **MEDIUM — Login rate limit is per-IP only.** `POST /api/auth/login @ 5/min` keys on `get_remote_address`. An attacker behind a proxy / NAT, or one that rotates source IPs (cheap on residential / cloud), trivially bypasses it. There is no per-username lockout, no exponential backoff, no captcha. Combined with no MFA, this leaves the admin account vulnerable to a slow online dictionary attack from a single password (8-char minimum, no complexity requirement).
- **LOW / INFO — Several smaller findings**: webhook payload logs persist source payload (now with key-level redaction, but the redactor is name-based and will miss high-entropy secret values in non-obvious keys); no replay protection on inbound webhooks (no nonce/timestamp window); the `/api/auth/setup` 3/min limit + JWT issuance race window is hardened with a transaction count guard (good), but the dummy bcrypt hash literal used for timing-equalisation is malformed and `bcrypt.checkpw` returns `False` via `ValueError` — the swallowed exception still equalises timing, but a maintainer could regress this; CSP allows `script-src 'unsafe-inline'` (necessary for SvelteKit hydration, acceptable risk acknowledged in code).
---
## Findings
### CRITICAL
_None found._
---
### HIGH
#### H-1. Home Assistant access_token leaked in provider GET responses
- CWE: CWE-522 (Insufficiently Protected Credentials), CWE-200 (Exposure of Sensitive Information)
- Files:
- [`packages/server/src/notify_bridge_server/api/providers.py:616-624`](../../packages/server/src/notify_bridge_server/api/providers.py) — `_provider_response` masks `("api_key", "api_token", "webhook_secret", "password", "client_secret", "refresh_token")` but **not** `access_token`.
- [`packages/server/src/notify_bridge_server/api/providers.py:399-405`](../../packages/server/src/notify_bridge_server/api/providers.py) — `update_provider` also omits `access_token` from the placeholder-drop list, so the response masking is consistent here, but if you fix one you must fix the other.
- Scenario: Any user authenticated to the bridge (any role) calls `GET /api/providers/{id}` for an HA provider they own and the response includes `config.access_token` in cleartext. The HA long-lived token grants full control of the user's Home Assistant instance (lights, locks, cameras, scripts, devices). In a multi-user deployment, even within the same admin account, a stolen JWT exfiltrates the HA token; in a single-user deployment, any read primitive (XSS via a future template feature, an MITM on an HTTPS misconfiguration) gives the same result.
- Remediation: Add `access_token` to both lists.
```python
# providers.py:_provider_response
for secret_field in (
"api_key", "api_token", "webhook_secret", "password",
"client_secret", "refresh_token", "access_token", # <-- add
):
...
# providers.py:update_provider
for secret_field in (
"api_key", "api_token", "webhook_secret", "password",
"client_secret", "refresh_token", "access_token", # <-- add
):
value = incoming.get(secret_field)
if isinstance(value, str) and value.startswith("***"):
incoming.pop(secret_field, None)
```
Better still: replace the hand-maintained tuple with a single module-level constant `_PROVIDER_SECRET_FIELDS` referenced from both call sites, plus a unit test that asserts every field declared on the per-provider Pydantic configs whose name appears in a denylist (`token`, `secret`, `password`, `key`, `credential`) is in the set. That prevents the next provider type from re-introducing the same gap.
#### H-2. Secrets stored in plaintext at rest
- CWE: CWE-312 (Cleartext Storage of Sensitive Information), CWE-256 (Plaintext Storage of a Password)
- Files:
- [`packages/server/src/notify_bridge_server/database/models.py:54-84`](../../packages/server/src/notify_bridge_server/database/models.py) — `TelegramBot.token: str`
- [`packages/server/src/notify_bridge_server/database/models.py:87-100`](../../packages/server/src/notify_bridge_server/database/models.py) — `MatrixBot` (access_token in config)
- `ServiceProvider.config: dict[str, Any]` (JSON column) holds Immich `api_key`, Gitea `webhook_secret` + `api_token`, Google Photos `client_secret` + `refresh_token`, HA `access_token`, etc.
- `EmailBot.smtp_password: str` (per [`api/email_bots.py:142`](../../packages/server/src/notify_bridge_server/api/email_bots.py))
- Scenario: An attacker who can read the SQLite file (compromised host, mis-permissioned backup volume, snapshot artifact in `data_dir/backups/`, leaked debug dump) gets every credential the bridge speaks: Telegram bot tokens (full bot control), Immich/Gitea/Planka API keys (read all photos / repos), Google Photos refresh tokens (long-lived, hard to revoke at scale), HA long-lived tokens (smart-home), SMTP passwords. The pre-migrate VACUUM-INTO snapshots (`packages/server/src/notify_bridge_server/database/snapshot.py`) inherit the same plaintext exposure and live alongside the active DB.
- Remediation options, in order of effort:
1. **Short term**: document the threat in `OPERATIONS.md`, enforce file-system permissions on `/data` (the Dockerfile chowns to appuser already, but the host bind-mount must be `chmod 700`), and ensure backups are encrypted at the storage layer (S3 SSE / Borg / restic).
2. **Better**: column-level encryption with a key derived from `NOTIFY_BRIDGE_SECRET_KEY` (or a separate `NOTIFY_BRIDGE_DB_ENCRYPTION_KEY`). Use the `cryptography` library's `Fernet` for each sensitive column; envelope the secret JSON keys, not the whole row, so `WHERE` clauses and existing migrations keep working. Add a one-shot migration that re-encrypts existing rows.
3. **Best**: encrypt with a KMS-backed key (HashiCorp Vault Transit, AWS KMS) and rotate per-secret data keys. This is overkill for a homelab homeserver-style deployment but mandatory if the bridge is ever multi-tenant.
- Skeleton for option 2:
```python
# new file packages/server/src/notify_bridge_server/security/secretbox.py
from cryptography.fernet import Fernet, InvalidToken
from .config import settings
def _key() -> bytes:
# Derive a deterministic Fernet key from secret_key. Anyone with secret_key
# can decrypt — same threat model as JWT signing — but anyone with the DB
# alone cannot.
import base64, hashlib
h = hashlib.sha256(settings.secret_key.encode()).digest()
return base64.urlsafe_b64encode(h)
_fernet = Fernet(_key())
def encrypt_secret(plaintext: str) -> str:
return _fernet.encrypt(plaintext.encode()).decode()
def decrypt_secret(ciphertext: str) -> str:
return _fernet.decrypt(ciphertext.encode()).decode()
```
Apply at write time in `update_provider` / `create_provider`, decrypt at read time inside `make_immich_provider`, `make_gitea_provider`, the Telegram client constructor, etc. Add a migration that scans every `ServiceProvider.config` JSON and re-encrypts the listed keys in place.
---
### MEDIUM
#### M-1. Template preview endpoints skip the renderer watchdog
- CWE: CWE-400 (Uncontrolled Resource Consumption), CWE-1333 (Inefficient Regular Expression Complexity — analogous)
- Files:
- [`packages/server/src/notify_bridge_server/api/template_configs.py:608-613`](../../packages/server/src/notify_bridge_server/api/template_configs.py) — `preview_config` calls `SandboxedEnvironment(autoescape=False).from_string(template_body).render(...)` directly.
- [`packages/server/src/notify_bridge_server/api/slot_helpers.py:72-90`](../../packages/server/src/notify_bridge_server/api/slot_helpers.py) — `render_template_preview` (used by `/preview-raw` for both notification and command templates).
- [`packages/server/src/notify_bridge_server/services/notifier.py:494-499`](../../packages/server/src/notify_bridge_server/services/notifier.py) — `send_test_template_notification`.
- The hardened helper [`packages/core/src/notify_bridge_core/templates/renderer.py:48-108`](../../packages/core/src/notify_bridge_core/templates/renderer.py) (with timeout, length caps, output cap) is **not** used here.
- Scenario: An authenticated admin submits `{% for i in range(10**8) %}x{% endfor %}` to `POST /api/template-configs/preview-raw`. Jinja2 has no built-in timeout. The sandbox blocks attribute access but not CPU. The request blocks the FastAPI event loop's executor thread until the worker oomkills or the client times out. Repeat to DoS the API.
- Remediation: Route every render through a single, hardened helper.
```python
# Use the existing core helper consistently
from notify_bridge_core.templates.renderer import render_template
rendered = render_template(template_str, context) # already has timeout + caps
```
For the strict-undefined two-pass validation in `render_template_preview`, fold the watchdog into the helper itself rather than skipping it.
#### M-2. Login rate limit is per-IP only
- CWE: CWE-307 (Improper Restriction of Excessive Authentication Attempts)
- Files: [`packages/server/src/notify_bridge_server/auth/routes.py:140-157`](../../packages/server/src/notify_bridge_server/auth/routes.py).
- Scenario: `@limiter.limit("5/minute")` keyed on `get_remote_address` gives 5 attempts per source IP per minute = ~7,200/day per IP. An attacker rotating across 10 IPs (cheap cloud, residential proxies, even a Tor exit pool) gets 72,000/day. With the 8-character minimum password and no complexity requirement, a 7-char-and-common password is reachable in days, not centuries. There is no per-username lockout, no captcha, no MFA.
- Remediation:
1. Add a per-username sliding-window limiter on top of the per-IP one. Use a second `Limiter` whose `key_func` returns the lower-cased username from the body. Re-check after parsing the body.
2. Add an exponential lockout: after N consecutive failures for a username, require a cooldown (record in a `LoginFailure` table or in-memory TTLCache).
3. Document and recommend deploying behind a reverse proxy that adds CAPTCHA / WAF rate-limiting for login (Cloudflare Turnstile is cheap).
4. Track and log failed logins (auth-event audit trail) with src IP + username + timestamp.
```python
# Sketch — a second limiter that keys by username from the parsed body.
async def _check_username_quota(username: str) -> None:
# In-memory TTLCache: 10 attempts per username per 15 minutes
if _username_attempts[username] >= 10:
raise HTTPException(429, "Too many attempts for this account")
_username_attempts[username] += 1
```
#### M-3. Webhook payload log redactor is keyword-based, misses value-based secrets
- CWE: CWE-532 (Insertion of Sensitive Information into Log File)
- Files: [`packages/server/src/notify_bridge_server/api/webhooks.py:326-358`](../../packages/server/src/notify_bridge_server/api/webhooks.py).
- Scenario: `_redact_sensitive_body` walks the JSON and redacts values whose **keys** contain `token`, `auth`, `key`, `secret`, etc. A webhook provider that ships secrets under an innocent key (e.g. `"oauth_state": "ya29.a0..."`, `"continuation": "ABCDE..."`, `"x_state": "..."`) leaves the secret in the persisted payload log. The log row is admin-readable and exported in backups.
- Remediation: Layer a high-entropy value detector on top of the key matcher (e.g. anything matching `[A-Za-z0-9_\-+/=]{32,}` and high Shannon entropy ≥ 3.5). Lower bound: also redact known prefixes (`ya29.`, `xoxb-`, `ghp_`, `glpat_`, `sk-`, `Bearer `).
#### M-4. Webhook ingestion has no replay protection
- CWE: CWE-294 (Authentication Bypass by Capture-replay)
- Files: [`packages/server/src/notify_bridge_server/api/webhooks.py`](../../packages/server/src/notify_bridge_server/api/webhooks.py) — Gitea/Planka/Generic.
- Scenario: An attacker who once intercepts a signed Gitea push event (network downgrade, log leak from a proxy, exfil from the Gitea side) can replay it indefinitely. The HMAC stays valid; the bridge has no nonce / timestamp window / delivery-ID cache. With a webhook that fires `assets_added` it's just noise. With a webhook that triggers an action (planka card-created → `/api/actions/{id}/execute` chained logic), it could be more.
- Remediation: For Gitea, store the last N `X-Gitea-Delivery` UUIDs per provider and reject duplicates; cap with a partial unique index. For the generic webhook, add an optional `replay_window_seconds` + a timestamp-extracting JSONPath in the provider config. Constant-time string compare.
#### M-5. `bcrypt.checkpw` dummy-hash literal is malformed
- CWE: CWE-208 (Observable Timing Discrepancy) — partial.
- Files: [`packages/server/src/notify_bridge_server/auth/routes.py:147-152`](../../packages/server/src/notify_bridge_server/auth/routes.py).
- Scenario: When the username doesn't exist, the code calls `_verify_password(body.password, "$2b$12$" + "a" * 53)`. That hash is not a real bcrypt hash; `bcrypt.checkpw` raises `ValueError` which `_verify_password` swallows and returns `False`. The exception path is *faster* than a real bcrypt verify (no key schedule), so timing of "user does not exist" differs from "user exists, wrong password" — a maintainer changing the swallow behaviour later could regress this entirely.
- Remediation: Cache one valid dummy bcrypt hash at module load time so the verify path actually runs the KDF.
```python
_DUMMY_BCRYPT_HASH = bcrypt.hashpw(b"x", bcrypt.gensalt()).decode() # module load
...
password_ok = await _verify_password(
body.password,
user.hashed_password if user else _DUMMY_BCRYPT_HASH,
)
```
#### M-6. Setup endpoint relies on `User.id != 0` filter — robust but a single typo breaks it
- CWE: CWE-302 (Authentication Bypass) — defence-in-depth.
- Files: [`packages/server/src/notify_bridge_server/auth/routes.py:97-119`](../../packages/server/src/notify_bridge_server/auth/routes.py).
- Scenario: `POST /api/auth/setup` is gated by "no users with id != 0". The `__system__` sentinel is id=0. If a future migration changes the sentinel id, or the `WHERE` clause is dropped during a refactor, setup re-opens silently and an internet-reachable bridge would let an attacker claim the admin account.
- Remediation: Add a defence-in-depth flag `AppSetting.setup_completed=true` set during the first successful setup, and require it to be unset (in addition to the count check). This bakes the invariant into a single boolean that's easier to audit.
#### M-7. Anonymous Prometheus metrics endpoint leaks operational data
- CWE: CWE-200 (Exposure of Sensitive Information to an Unauthorized Actor)
- Files: [`packages/server/src/notify_bridge_server/api/metrics.py:138-159`](../../packages/server/src/notify_bridge_server/api/metrics.py).
- Notes: This is **documented and gated** by `NOTIFY_BRIDGE_METRICS_ENABLED`, and the comment explicitly says scrapers don't authenticate. Acceptable when the API port is firewalled to the scraper. Surface it here as informational so an operator who exposes the API directly to the internet (e.g. via reverse-proxy without an ACL) doesn't accidentally expose dispatch rates, provider names, queue depths.
- Remediation: keep the env flag, but additionally allow `metrics_basic_auth_user` / `metrics_basic_auth_password` as a soft credential check on the endpoint so a "default enabled, default protected" mode is possible. Document the threat in `OPERATIONS.md` next to the env var.
---
### LOW
#### L-1. CSP allows `'unsafe-inline'` for scripts
- CWE: CWE-1021 (Improper Restriction of Rendered UI Layers or Frames) — adjacent.
- File: [`packages/server/src/notify_bridge_server/main.py:186-201`](../../packages/server/src/notify_bridge_server/main.py).
- Notes: Comment explicitly justifies it — SvelteKit static adapter emits an inline bootstrap. Acceptable, but `'strict-dynamic'` with a per-page nonce (or moving the bootstrap into a hashed external module) eliminates the gap entirely. Track as INFO unless future XSS-injection paths emerge.
#### L-2. CSP `style-src 'unsafe-inline'` allows inline-style XSS payloads
- CWE: CWE-79 (Cross-site Scripting) — defence-in-depth.
- Same file as L-1. Inline styles are not directly executable, but they are a known vector for click-jacking and data-exfil via CSS selectors. Same remediation path: nonce-based CSP.
#### L-3. `frame-ancestors 'none'` but no `X-Frame-Options: DENY` collision (false — it is set)
- INFO only. Both `X-Frame-Options: DENY` and `frame-ancestors 'none'` are set; modern browsers honour CSP, legacy ones honour XFO. Good.
#### L-4. Webhook `_filter_headers` allowlist accepts unknown `X-*` headers
- CWE: CWE-532
- File: [`packages/server/src/notify_bridge_server/api/webhooks.py:361-374`](../../packages/server/src/notify_bridge_server/api/webhooks.py).
- Notes: The filter strips known sensitive headers, then accepts any `X-*`. A custom auth header like `X-Custom-Authentication: <token>` would slip past the substring check if the name doesn't contain `auth`/`token`/`key`/`secret`/etc. Low risk because the well-known providers we support don't ship such headers, but a misconfigured generic webhook will leave a credential in the log row.
- Remediation: invert the policy — explicit allowlist for known-safe `X-*` headers (e.g. `X-Forwarded-For` is also borderline since it can carry PII).
#### L-5. `external_url` setting is not validated against an allow-list
- CWE: CWE-918 (SSRF), CWE-79 (XSS in the rendered Telegram webhook URL).
- File: [`packages/server/src/notify_bridge_server/api/app_settings.py:329-339`](../../packages/server/src/notify_bridge_server/api/app_settings.py) reads, [`packages/server/src/notify_bridge_server/api/telegram_bots.py:247`](../../packages/server/src/notify_bridge_server/api/telegram_bots.py) writes it into the registered Telegram webhook URL.
- Notes: An admin can set `external_url` to anything. The value is used to build the URL passed to Telegram in `setWebhook`. Telegram itself enforces an HTTPS-only allow-list, so the actual risk is bounded. Still — validate scheme + host + that it doesn't include credentials or fragments.
#### L-6. Bot token GET endpoint is intentional but worth auditing
- File: [`packages/server/src/notify_bridge_server/api/telegram_bots.py:148-156`](../../packages/server/src/notify_bridge_server/api/telegram_bots.py).
- Notes: `GET /api/telegram-bots/{bot_id}/token` returns the full Telegram bot token to the owner. Used by the frontend to construct webhook URLs. Limiting to a single short-lived nonce per `register_bot_webhook` flow would be safer than exposing the token directly. Currently INFO; revisit if a multi-user role model lands.
#### L-7. SQLite journal mode + backup snapshot file permissions
- File: [`packages/server/src/notify_bridge_server/database/snapshot.py:60-95`](../../packages/server/src/notify_bridge_server/database/snapshot.py).
- Notes: Snapshots are written via `VACUUM INTO 'path'`. They land in `data_dir/backups/` with default umask permissions. In the Docker image the dir is owned by `appuser` and only that user runs the process, so this is fine. On a host bind-mount, an operator who forgets to lock down `/data` exposes every credential in every snapshot to anyone with shell access. Document this in `OPERATIONS.md`.
#### L-8. No CSRF token on state-changing endpoints
- CWE: CWE-352
- Notes: The API uses `Authorization: Bearer <jwt>` exclusively (no cookies). Browsers don't auto-attach `Authorization` headers cross-origin, so this is **not** classical CSRF-exploitable. Combined with strict CORS (`allow_credentials=True`, explicit origin allowlist, wildcard rejected on startup) and the `Origin`/`Referer` same-host check on the backup endpoints, the practical risk is essentially zero. INFO only.
---
### INFO / NEEDS VERIFICATION
#### N-1. Jinja2 `SandboxedEnvironment` is the standard sandbox — confirm it covers your threat model
- The sandbox blocks `__class__`, `__mro__`, etc., but it is well-known that Jinja2's sandbox is not a security boundary against a determined attacker who can author templates. The threat model here is "templates are admin-authored, so we trust them but use the sandbox as defence-in-depth"; that is reasonable. Document explicitly in `OPERATIONS.md` that anyone with template-edit permission has effective RCE on the worker thread (`{{ foo.__init__.__globals__... }}` style escapes have been published in the past; new ones surface periodically).
- Verification: run `bandit -r packages/` and `safety check` against pinned versions of `jinja2>=3.1`. Latest CVEs against Jinja2 sandbox: track `CVE-2024-34064` and any 2025+ disclosures. As of the review date there is no known unpatched sandbox-escape in `jinja2>=3.1.4`.
#### N-2. `apscheduler<4`
- Notes: The pin `apscheduler>=3.10,<4` keeps the bridge on the 3.x line, which is in maintenance. No known CVEs as of this review. Track when 4.x stabilises and migrate.
#### N-3. `python-multipart>=0.0.9`
- Notes: This package had high-severity bugs prior to 0.0.6. The minimum here is 0.0.9 — good.
#### N-4. No signed-image / SBOM on the container
- Notes: The `release.yml` workflow builds and pushes a multi-tag image but does not sign with cosign or emit an SBOM. For an internet-facing deployment, consider adding `cosign sign` against the image digest, and `syft packages` to emit an SBOM at release time. INFO only.
#### N-5. Frontend dependencies are pinned via caret (`^`) ranges
- Notes: `package.json` uses `^x.y.z`. CI builds `npm ci` from `package-lock.json`, so reproducibility is fine at build time. There is no `npm audit` step in `.gitea/workflows/build.yml`. Add `npm audit --audit-level=high` to the frontend build job.
#### N-6. `NOTIFY_BRIDGE_ALLOW_PRIVATE_URLS=1` is a footgun
- File: [`packages/core/src/notify_bridge_core/notifications/ssrf.py:39-52`](../../packages/core/src/notify_bridge_core/notifications/ssrf.py).
- Notes: When set, the SSRF guard becomes a no-op. The warning at boot is the only mitigation. Acceptable for the documented homelab use-case; document that the env flag must NEVER be set on an internet-reachable instance, and consider refusing to enable it when `cors_allowed_origins` resolves to a non-loopback host (defence-in-depth interlock).
#### N-7. Verify the auth flow at the WebSocket boundary
- File: [`packages/core/src/notify_bridge_core/providers/home_assistant/client.py:54-83`](../../packages/core/src/notify_bridge_core/providers/home_assistant/client.py).
- The `_ws_url_from_base` correctly strips userinfo before connecting and `_redact` defangs error messages — verify that `wss://` URLs go through SSRF validation (currently the HA URL is validated by `AnyHttpUrl` at config time but I did not find a call to `avalidate_outbound_url_full` on the HA WS connect path; the resolver would not pin a host the validator never saw).
- Action: confirm by reading `ha_subscription.py` for explicit validation, or add a check that calls `avalidate_outbound_url_full` against the derived `ws_url` (treating `ws`/`wss` like `http`/`https` for the block-range check) before `ws_connect`.
---
## Prioritised Fix List (Top 10)
1. **HIGH H-1** — Add `access_token` to the secret-mask list in `providers._provider_response` and the placeholder-drop list in `providers.update_provider`. Add a regression test that GETs an HA provider and asserts the response does not contain the cleartext token.
2. **HIGH H-2** — Implement column-level encryption for `TelegramBot.token`, `MatrixBot` access tokens, `EmailBot.smtp_password`, and the sensitive keys inside `ServiceProvider.config`. Use Fernet with a key derived from `SECRET_KEY`. Write a one-shot migration.
3. **MEDIUM M-1** — Replace the ad-hoc `SandboxedEnvironment(...).render()` calls in the four preview/test paths with the single hardened `render_template()` helper that already has timeout + size caps.
4. **MEDIUM M-2** — Add per-username login lockout (TTL cache or DB-backed) on top of the per-IP `5/minute`. Log failed login attempts.
5. **MEDIUM M-5** — Replace the malformed dummy bcrypt literal in `login()` with a real bcrypt hash computed once at module load so the timing-equalisation actually runs the KDF.
6. **MEDIUM M-3** — Strengthen `_redact_sensitive_body` with a value-entropy heuristic and well-known token-prefix matching.
7. **MEDIUM M-4** — Add replay protection on Gitea webhooks via the `X-Gitea-Delivery` header (small table + partial unique index).
8. **MEDIUM M-7** — Make the metrics endpoint require either a flag or a Basic Auth credential; document in `OPERATIONS.md` that the API port should not be internet-exposed when metrics are on.
9. **MEDIUM M-6** — Add a defence-in-depth `setup_completed` boolean in `app_setting` and check it in `/api/auth/setup` in addition to the count.
10. **N-5** — Add `npm audit --audit-level=high` to the frontend build job in `.gitea/workflows/build.yml` so dependency CVEs land in CI.
---
## What was confirmed safe (worth keeping)
- JWT design: HS256 with `iss`/`aud`/`exp`/`type`/`sub`/`ver`; refresh/access split; `token_version` revocation on role change, username change, and password change.
- bcrypt with 72-byte length guard; CPU-bound work run in a thread.
- SSRF guard with: scheme allowlist, IPv6-mapped-v4 unwrap, CGNAT block, IDN normalisation, async resolver, `PinnedResolver` to defeat DNS rebinding.
- SQL access goes through SQLModel/SQLAlchemy with bind parameters; the only `f"..."` SQL is in DDL (column adds, index creates, `VACUUM INTO`) using server-controlled identifiers — sampled and clean.
- Sandbox is `SandboxedEnvironment` everywhere a user-controllable template is rendered (six locations checked).
- Frontend `{@html}` is wrapped in `sanitizePreview()` everywhere (`tracking-configs`, `template-configs`, `command-template-configs`).
- Provider config secrets are masked on GET (except H-1).
- `_resolve_backup_file` rejects `..`, NUL, separators, and enforces `relative_to(base)`.
- CORS rejects wildcard with credentials at startup; secret_key default values are rejected with a clear error.
- Docker: non-root user, `read_only: true`, `tmpfs: /tmp`, `no-new-privileges`, `cap_drop: ALL`, resource limits, healthcheck on `/api/ready`.
- Logging: `SecretMaskingFilter` masks Telegram bot tokens, `Authorization`, `x-api-key`, `password`, `secret`, `access_token`, `refresh_token` from formatted messages, exception text, and stack traces.
- Telegram webhook: secret token mandatory, refused on missing config, opaque `webhook_path_id` separate from bot token.
- Inbound generic webhook: refuses `auth_mode="none"` unless an explicit acknowledgment field is set; auto-generates a strong secret if missing for `bearer_token`/`hmac_sha256`.
- Inbound payload size capped at 1 MiB with a streaming check that doesn't trust `Content-Length`.
---
## Methodology
- Manual code review of every authentication, authorization, webhook ingestion, template rendering, secret-handling, and outbound HTTP path under `packages/`.
- Cross-checked CORS / CSP / security headers and rate-limiter configuration in `main.py` + `auth/routes.py`.
- Sampled API routes for ownership enforcement (`get_owned_entity` / `_get_user_provider` / `_get_user_bot`) — all sampled routes apply it; no IDOR found.
- Grepped for `Environment(` / `jinja2.Environment` / `f"..."` SQL / `{@html}` / `subprocess` / `eval` / `os.system` / known-bad patterns.
- Reviewed CI workflows for secret leakage in env blocks and image-signing posture.
- Reviewed Dockerfile + docker-compose for least-privilege and read-only root.
- No dynamic testing performed; static review only. Run `pytest` (already gated in CI) + `bandit -r packages/` + `npm audit` in CI to backstop this review.
+408
View File
@@ -0,0 +1,408 @@
# UI / UX Design Review — Notify Bridge frontend
**Reviewed**: 2026-05-22
**Scope**: SvelteKit frontend at `frontend/`, "Aurora / Glass" aesthetic, en + ru locales.
**Reviewer method**: Read `app.css`, `+layout.svelte`, dashboard, login, setup, providers, targets, users, settings (parent), settings/IdentityCassette, notification-trackers, template-configs, actions, bots, plus shared components (Card, Button, Modal, ConfirmModal, AuthLayout, PageHeader, EmptyState, Loading, Snackbar). Cross-cutting Grep passes for inputs, border-radius, ARIA, sort, hex colors.
---
## Executive summary
- **Aurora design language is real and distinctive.** Newsreader display serif + Geist variable sans + Geist Mono, conic-gradient brand orb, animated radial-gradient aurora background (`body::before` 28s drift), gradient pill chips, glow-pulse dots, and the lavender/orchid/mint/citrus/coral/sky palette together give the product a clear visual identity. This is **not** generic admin-template AI slop — the dashboard hero, signal-stream rows, provider deck, and the `PageHeader` "subpage-hero" pattern all carry intentional character that the user will remember.
- **Consistency is the weakest axis.** Five overlapping card container abstractions (`.hero-card`, `.panel`, `.glass`, `Card.svelte`, settings `.cassette`/`.identity`) re-implement the same frosted-glass recipe with diverging radius (22 / 18 / 14 / 12 px) and padding (1.25/1.4 vs 1.3/1.4 vs 2/2.4 rem). A `--radius: 1rem` token is declared but unused. Pick one card module + one radius scale (e.g. `--radius-card: 22px`, `--radius-input: 12px`, `--radius-pill: 999px`).
- **Forms have not been migrated to Aurora.** ~71 occurrences across 17 files still use the legacy raw class string `border border-[var(--color-border)] rounded-md text-sm bg-[var(--color-background)]` instead of the global `input { ... }` rule already in `app.css` (which uses `--color-input-bg`, `--color-rule-strong`, 0.625rem radius, glow focus ring). Result: rounded-md (6px) fields next to rounded-2xl (22px) cards, solid opaque backgrounds inside frosted-glass cards. Removing the override class would auto-restyle every form to match. **HIGH** priority, mostly mechanical.
- **Hardcoded hex colors leak through.** Snackbar uses `#059669` / `#ef4444` / `#3b82f6` / `#f59e0b` instead of `--color-mint/coral/sky/citrus`. ConfirmModal uses a raw `rgba(239, 68, 68, 0.3)` glow. Actions page uses `#059669` for the enabled dot. All bypass theming — they will look wrong in light theme.
- **Snackbar is invisible to screen readers.** No `role="status"` / `aria-live="polite"` / `aria-live="assertive"` on the toast container. Critical confirmations (saved, deleted, error) are never announced. **HIGH** accessibility fix, one-line.
- **No `aria-current="page"` anywhere in the nav** — active state is conveyed only visually (border-radius bar + glow). Active state has no accessible name.
- **No sortable columns, no multi-select bulk actions, anywhere in the app.** Lists rely entirely on `IconGridSelect` sort widgets (newest / oldest, etc.) and per-row icon buttons. For a notification routing system that may accumulate dozens of trackers / targets / configs, this scales poorly.
- **Localization parity is solid string-for-string** (en.json = ru.json = 1577 lines). Russian renders the same characters but several places (hero title, brand row with provider name, stat-card label/value flex) have no length-guard for the longer Russian translations — visible truncation/wrapping likely.
- **Onboarding is a single screen.** After `/setup` lands you on `/` with `0 providers` and a hero saying "all clear" — the most important first-run moment shows nothing to do. No checklist, no empty-dashboard CTA panel, no tour.
- **Power-user feature standout**: ⌘K SearchPalette is present and wired through the topbar, global provider filter, and reduced-motion media-query support. These three deserve credit and should be more discoverable (no in-app hint they exist).
---
## Findings by area
### 1. Design quality vs generic AI aesthetic
#### F-DESIGN-01 — Aurora identity is strong and self-consistent at the macro level [LOW / commendation]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css), [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte), [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte)
- **State**: Newsreader display serif italic with linear-gradient text-clip is used in hero titles, panel titles, modal titles. Conic brand orb is unique. Aurora drift on body::before is a 28s slow loop that's never busy. The "signal" / "wires" / "on watch" / "pulse" / "stream" / "compose" semantic naming on the dashboard is editorial, not generic admin copy.
- **Verdict**: Keep all of this. Lean *further* into it on the subpages — most list pages currently default back to plain "PageHeader + Card list" without inheriting the dashboard's editorial flavor.
#### F-DESIGN-02 — Italic-serif emphasis loses impact on smaller subpage titles [LOW]
- **Files**: [`frontend/src/lib/components/PageHeader.svelte`](frontend/src/lib/components/PageHeader.svelte) (lines 132147)
- **State**: `subpage-hero__title` is 2.15rem with italic emphasis on a gradient. At that size the gradient italic word is legible but loses the editorial drama it has at the 3rem dashboard hero. Russian translations (`em` words like *«операторы»*) sometimes look cramped because letter-spacing -0.025em is shared with the much larger dashboard hero.
- **Suggestion**: Use a separate letter-spacing scale per font size step, or drop italic emphasis on titles below ~2rem and use color-only emphasis there.
---
### 2. Visual consistency
#### F-CONSIST-01 — Five overlapping card abstractions [HIGH]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css) `.glass`, [`frontend/src/lib/components/Card.svelte`](frontend/src/lib/components/Card.svelte), [`frontend/src/lib/components/PageHeader.svelte`](frontend/src/lib/components/PageHeader.svelte) `.subpage-hero`, [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) `.hero-card` / `.panel` / `.stat-card`, [`frontend/src/routes/settings/IdentityCassette.svelte`](frontend/src/routes/settings/IdentityCassette.svelte) `.identity` + `.glass`
- **State**: Six places re-declare the same recipe: `background: var(--color-glass); backdrop-filter: blur(28px) saturate(160%); border: 1px solid var(--color-border); border-radius: 22px; box-shadow: var(--shadow-card);` followed by an `::after` highlight overlay. Card.svelte even has its own 22px radius next to the global `.glass` 22px radius — they would diverge silently if either gets touched.
- **Suggestion**: Consolidate into one `<GlassPanel>` component (or `.glass-card` utility) with variants `default | hero | panel | cassette` for padding/radius differences. Delete the duplicated `::after` overlays. The pattern is good — it's just *copy-pasted* 5+ times.
#### F-CONSIST-02 — Border-radius drift, no scale [HIGH]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte), [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte), [`frontend/src/app.css`](frontend/src/app.css)
- **State**: Radii used: 22, 18, 14, 12, 11, 10, 9, 8, 7, 6, 3, 2 px + 0.3, 0.5, 0.625, 0.85, 1 rem + 9999px. `--radius: 1rem` is declared in the theme but only re-declared — no component reads it.
- **Suggestion**: Define and *use* `--radius-card: 22px; --radius-panel: 18px; --radius-pill: 999px; --radius-input: 12px; --radius-chip: 8px; --radius-tile: 6px;`. Refactor in passes — start with `Card.svelte`, `Button.svelte`, `Modal.svelte`, `ConfirmModal.svelte`.
#### F-CONSIST-03 — Hardcoded hex colors bypass theming [HIGH]
- **Files**:
- [`frontend/src/lib/components/Snackbar.svelte`](frontend/src/lib/components/Snackbar.svelte) lines 2631: `#059669 / #ef4444 / #3b82f6 / #f59e0b`
- [`frontend/src/lib/components/ConfirmModal.svelte`](frontend/src/lib/components/ConfirmModal.svelte) line 70: `box-shadow: 0 0 16px rgba(239, 68, 68, 0.3)`
- [`frontend/src/routes/actions/+page.svelte`](frontend/src/routes/actions/+page.svelte) line 379: `style="background: {action.enabled ? '#059669' : 'var(--color-muted-foreground)'}"`
- 25 files in `frontend/src/routes/**` contain `#xxx` literals
- **State**: These colors are NOT the Aurora palette — `#059669` is emerald-600, our mint is `#7ee8c4`. In light theme the user sees green-on-green that wasn't intended.
- **Suggestion**: Replace all status hexes with `--color-mint/coral/sky/citrus/orchid`. Add a stylelint rule `color-no-hex` scoped to `src/**/*.svelte` to prevent regression.
#### F-CONSIST-04 — Form input styling not migrated to Aurora [HIGH]
- **Files**: 17 routes, ~71 occurrences. Examples: [`frontend/src/routes/users/+page.svelte`](frontend/src/routes/users/+page.svelte) lines 137, 141, 190, 207; [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte) lines 303, 309, 323, 333; [`frontend/src/routes/notification-trackers/TrackerForm.svelte`](frontend/src/routes/notification-trackers/TrackerForm.svelte); [`frontend/src/routes/targets/TargetForm.svelte`](frontend/src/routes/targets/TargetForm.svelte).
- **State**: `class="w-full px-3 py-2 border border-[var(--color-border)] rounded-md text-sm bg-[var(--color-background)]"` is repeated 71+ times. This overrides the global `input { ... }` rule that *already* uses Aurora glass styling.
- **Suggestion**: Delete the class string in all these places. The global rule kicks in and forms instantly look correct. Cross-check that `Tailwind`'s preflight isn't interfering. Spot-check one page (e.g. `users/+page.svelte`), confirm visually, then mass-delete via Grep/Edit.
#### F-CONSIST-05 — ConfirmModal duplicates Button.svelte logic [MEDIUM]
- **Files**: [`frontend/src/lib/components/ConfirmModal.svelte`](frontend/src/lib/components/ConfirmModal.svelte)
- **State**: Its `.confirm-btn-cancel` and `.confirm-btn-delete` re-implement what `Button variant="secondary"` and `Button variant="danger"` already provide. The danger button even uses raw `rgba(239,68,68,...)` instead of `--color-error-fg`.
- **Suggestion**: `<Button variant="secondary" onclick={oncancel}>{cancel}</Button>` and `<Button variant="danger" onclick={onconfirm}>{confirm}</Button>`. Removes ~35 lines of CSS.
#### F-CONSIST-06 — AuthLayout uses a different glass recipe [MEDIUM]
- **Files**: [`frontend/src/lib/components/AuthLayout.svelte`](frontend/src/lib/components/AuthLayout.svelte) (line 68 `.auth-card`)
- **State**: `border-radius: 1rem`, `padding: 2rem`, `backdrop-filter: blur(8px)` (vs the 28px elsewhere), plus its own auth-bg gradient mesh + 32px-grid background that nothing else in the app uses. Has its own `.auth-input` / `.auth-submit` / `.auth-label` / `.auth-error` design language.
- **State pt 2**: Login/setup ends up looking *more* like generic SaaS than the dashboard does. The brand orb from the sidebar isn't on the login screen — instead a small lavender mdi-lan icon in a square.
- **Suggestion**: Reuse the conic brand orb. Use the same glass recipe (28px blur, 22px radius) for `.auth-card`. Either drop the dot-grid `.auth-grid` (it reads as a generic "futuristic SaaS" template) or use it as a deliberate flair on the dashboard hero too.
---
### 3. Information hierarchy
#### F-HIER-01 — Stat cards do triple duty (KPI + nav link + filter context) without ranking [MEDIUM]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 571645
- **State**: All four stat cards have the same visual weight, same accent intensity (`STAT_ACCENTS[idx]`), and rotate accents by index. When the global provider filter is active the first stat card morphs into a "literal value" card showing provider name (1rem font, very different visual). The accent rotation creates a rainbow row that doesn't carry meaning — events `total` has no semantic reason to be orchid vs. providers being lavender.
- **Suggestion**: Tie accent color to entity type (providers=primary, trackers=mint, targets=sky, throughput=citrus) so the same accent recurs throughout the app for the same concept. Keep the morph behavior but design a distinct "filtered context" stat-card variant — a smaller, narrower chip — so it doesn't compete visually.
#### F-HIER-02 — Hero title and meter compete for attention at desktop width [LOW]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 10471068, 10781086
- **State**: Both the `.hero-title` and `.hero-meter-value` are 3rem 500-weight in two different fonts. Side-by-side they create two focal points.
- **Suggestion**: Shrink `.hero-meter-value` to 2.4rem and use it as a *secondary* read; let the editorial title be the single dominant element.
#### F-HIER-03 — Pulse chart panel rarely meaningful on first launch [LOW]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 909927
- **State**: On a fresh install the chart is an empty 0-events grid taking 250-400px vertical space. No empty-state copy inside `EventChart`.
- **Suggestion**: When `chartDays` has all-zero values, replace with a small "No events recorded in the last 30 days — once a tracker fires, the pulse will appear here" inline empty state.
---
### 4. Navigation & wayfinding
#### F-NAV-01 — No `aria-current="page"` on active nav links [HIGH a11y]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 498533, 591597, 632658
- **State**: Active state is conveyed via `.active` class + a gradient left-bar div. Screen readers cannot announce it. Grep for `aria-current` across the whole frontend: zero matches.
- **Suggestion**: Add `aria-current={isActive(child.href) ? 'page' : undefined}` to every nav `<a>`.
#### F-NAV-02 — No breadcrumb on subpages [MEDIUM]
- **Files**: [`frontend/src/lib/components/PageHeader.svelte`](frontend/src/lib/components/PageHeader.svelte)
- **State**: The `crumb` prop only renders a single mono-uppercase tag (e.g. "ROUTING · AUTOMATION") — it's decorative, not navigational. There's no actual breadcrumb chain. For `/template-configs`, `/command-template-configs`, `/tracking-configs`, `/command-configs`, etc., a user landing via deep link has no parent-link to return to.
- **Suggestion**: Make the crumb a real breadcrumb (≤3 levels: `Notifications → Templates` or `Commands → Configs`). Render the prior level as a clickable `<a>`.
#### F-NAV-03 — Deep linking via `?type=<targetType>` and `?tab=<botType>` doesn't update page title [LOW]
- **State**: `/targets?type=email` and `/bots?tab=matrix` change the active sidebar item but the `<PageHeader>` title for those pages is generic ("Targets" / "Bots").
- **Suggestion**: When `activeType` is set, derive the title from it: "Email targets" / "Matrix bots". Improves browser tab titles and the in-page title.
#### F-NAV-04 — Collapsed sidebar tooltip wraps for long Russian translations [LOW]
- **State**: Tooltips for collapsed sidebar nav items use the browser-native `title=` attribute, which gives no glass-style chip. They will use the OS tooltip styling, which clashes with the Aurora aesthetic and clips long ru labels.
- **Suggestion**: Build a small custom tooltip component (or use existing portal helper) for collapsed-sidebar nav. Keep `title` as fallback for `prefers-reduced-motion` users.
---
### 5. Form UX
#### F-FORM-01 — No inline field-level validation, only post-submit error banners [MEDIUM]
- **Files**: [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte), [`frontend/src/routes/users/+page.svelte`](frontend/src/routes/users/+page.svelte), [`frontend/src/routes/targets/TargetForm.svelte`](frontend/src/routes/targets/TargetForm.svelte)
- **State**: Forms rely on HTML5 `required` / `minlength` browser validation plus a single `ErrorBanner` shown after submit failure. Native browser validation tooltips are pale and don't match Aurora.
- **Suggestion**: Add a per-field `<FieldError>` slot below labels for inline validation (URL syntax, email format, port range). The settings page already has a nice pattern (`url-field-valid` class on `IdentityCassette`) — generalize it.
#### F-FORM-02 — Save feedback inconsistent across pages [MEDIUM]
- **Files**: Settings uses a sticky `SaveBar` with dirty tracking ([`frontend/src/routes/settings/+page.svelte`](frontend/src/routes/settings/+page.svelte) lines 7784, 208214). Most other forms have inline Save buttons inside the card. Some show snackbar success ("snack.userCreated"), some don't.
- **Suggestion**: Standardize: (a) inline "Save" inside the card *plus* (b) snackbar success message *plus* (c) optional sticky SaveBar for multi-field admin forms. Document the pattern in `.claude/docs/frontend-architecture.md`.
#### F-FORM-03 — Forms auto-name from descriptor but offer no way to unlock it back to auto-name [LOW]
- **Files**: [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte) lines 136141 + 303; [`frontend/src/routes/actions/+page.svelte`](frontend/src/routes/actions/+page.svelte) lines 5056
- **State**: Once user types in the Name field, `nameManuallyEdited` becomes true and the auto-fill stops permanently — no way to ask "go back to default name".
- **Suggestion**: Add a tiny "↺ reset" link next to the name input when `nameManuallyEdited && form.name !== descriptor.defaultName`.
#### F-FORM-04 — No optimistic UI; rows disappear / appear only after server roundtrip [LOW]
- **State**: After delete/create, pages refetch via `cache.fetch(true)`. Visible 200-400ms blank state.
- **Suggestion**: Optimistic insert/remove in the cache stores, with snackbar undo for destructive ops.
#### F-FORM-05 — Login form omits `autofocus` on username [LOW]
- **Files**: [`frontend/src/routes/login/+page.svelte`](frontend/src/routes/login/+page.svelte) line 99
- **Suggestion**: Add `autofocus` to the username input. Saves one keystroke on every login.
---
### 6. Modals & overlays
#### F-MODAL-01 — Modal.svelte is well-built [LOW / commendation]
- **Files**: [`frontend/src/lib/components/Modal.svelte`](frontend/src/lib/components/Modal.svelte)
- **State**: Portal mount, focus trap, focus restoration, Escape, Tab cycling, `aria-modal="true"`, `aria-labelledby`, body scroll containment via `overscroll-behavior: contain`, transition (250ms in/out), 80vh max-height. This is the strongest single component in the codebase.
- **Verdict**: Reuse as the foundation for every overlay. Currently `BlockedByModal`, `EventDetailModal`, `SharedLinkModal`, `ConfirmModal` all do — good.
#### F-MODAL-02 — Modal backdrop has `role="button"` [LOW]
- **Files**: [`frontend/src/lib/components/Modal.svelte`](frontend/src/lib/components/Modal.svelte) line 96
- **State**: The backdrop is a `<div>` with `role="button"`, `tabindex="-1"`, and an onclick to close. That's a common pattern to silence Svelte's a11y warnings, but a screen reader announces "Close, button" twice (once for backdrop, once for the explicit X button).
- **Suggestion**: Drop `role="button"` and `aria-label` from the backdrop; the explicit Close button is enough. Or use `<button class="modal-backdrop">` instead of a div.
#### F-MODAL-03 — Modal panel uses solid `#131520` instead of glass [LOW]
- **Files**: [`frontend/src/lib/components/Modal.svelte`](frontend/src/lib/components/Modal.svelte) lines 150151
- **State**: `--modal-solid-bg: #131520;` is a deliberate choice (probably for readability) but it breaks visual consistency with the rest of the app. The Aurora drift behind it is invisible.
- **Suggestion**: Use `var(--color-glass-elev)` over the blurred backdrop. Or, if the solid choice was deliberate, document why so the next developer doesn't "fix" it.
#### F-MODAL-04 — Confirm-modal "delete" hover uses raw rgba [MEDIUM]
- **Files**: [`frontend/src/lib/components/ConfirmModal.svelte`](frontend/src/lib/components/ConfirmModal.svelte) line 70
- **State**: `box-shadow: 0 0 16px rgba(239, 68, 68, 0.3);` — not themed.
- **Suggestion**: Use `box-shadow: 0 0 16px color-mix(in srgb, var(--color-coral) 40%, transparent);`.
---
### 7. Empty / loading / error states
#### F-STATE-01 — `Loading.svelte` is a single shimmer pattern [MEDIUM]
- **Files**: [`frontend/src/lib/components/Loading.svelte`](frontend/src/lib/components/Loading.svelte)
- **State**: Three or four 4rem shimmer bars. Used as `<Loading />` on virtually every page including hero pages. Doesn't match the actual layout the user will see — looks like a row list even on settings.
- **Suggestion**: Add layout-aware variants: `<Loading shape="hero" />`, `<Loading shape="grid" cols={4} />`, `<Loading shape="list" rows={5} />`. Reduces layout shift on first paint.
#### F-STATE-02 — `EmptyState.svelte` is plain and undifferentiated [MEDIUM]
- **Files**: [`frontend/src/lib/components/EmptyState.svelte`](frontend/src/lib/components/EmptyState.svelte)
- **State**: 10-line component: dimmed icon + message. No CTA, no illustration, no flavor. The dashboard's inline `.empty-state` (lines 13001319 of `+page.svelte`) is richer (has a CTA link) but isn't reused.
- **Suggestion**: Extend `EmptyState` to accept a `cta` slot and a `tone` (with subtle gradient blob behind the icon). On `/providers` empty: "No providers yet — connect Immich, Nextcloud, or Home Assistant to start tracking events" with an "+ Add provider" CTA.
#### F-STATE-03 — Many list pages have no error-recovery action [MEDIUM]
- **Files**: Throughout — most pages have a `loadError` state that renders `<Card><ErrorBanner /></Card>` but no "Retry" button.
- **Suggestion**: `ErrorBanner` should accept an `onRetry` prop and surface a retry button. Standardize across pages.
#### F-STATE-04 — `EventChart` no empty state [LOW]
- See F-HIER-03.
---
### 8. Accessibility
#### F-A11Y-01 — Snackbar has no aria-live [HIGH]
- **Files**: [`frontend/src/lib/components/Snackbar.svelte`](frontend/src/lib/components/Snackbar.svelte) lines 3563
- **State**: Snack container is a plain `<div use:portal>`. Success / error toasts never reach screen readers. Three other files have proper aria-live; this critical one doesn't.
- **Fix**: `<div use:portal class="snackbar-container" role="region" aria-live="polite" aria-label={t('snackbar.region')}>`. Use `aria-live="assertive"` for `snack.type === 'error'`.
#### F-A11Y-02 — No `aria-current="page"` on nav links [HIGH]
- See F-NAV-01.
#### F-A11Y-03 — Custom focus outlines partially overridden [MEDIUM]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css) lines 237241 (global `button:focus-visible` outline 2px primary + offset 2px), [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) line 894 (`.nav-link { border-radius: 12px !important }`), [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 13511354 (`.signal-row--clickable:focus-visible { outline-offset: -2px }`).
- **State**: Inverted offset `-2px` makes the focus ring sit *inside* the row, which against the glass-strong hover-bg ends up nearly invisible at certain accent positions.
- **Suggestion**: Use `outline-offset: 2px` consistently with a `box-shadow: 0 0 0 2px var(--color-glass)` ringer if needed for contrast.
#### F-A11Y-04 — `prefers-reduced-motion` is honored — commendation [LOW]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css) lines 484507, [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 837840
- **State**: Aurora drift, brand-version pulse, stagger entrances, signal-row hover transitions, paginator transitions all gated. Smooth scroll override too. Solid implementation.
#### F-A11Y-05 — Color contrast risk on glass surfaces [MEDIUM]
- **State**: `--color-muted-foreground: #b6b2d4` on `--color-glass: rgba(255,255,255,0.04)` over the aurora gradient. In the brightest hot-spot of the aurora background (where the `#b8a7ff` lavender peaks), `#b6b2d4` may fail WCAG AA (4.5:1 for body text). Hasn't been measured.
- **Suggestion**: Run a contrast pass with `--color-muted-foreground` against the brightest part of the aurora background. Likely need to bump it to ~`#cfcae8` for dark theme.
#### F-A11Y-06 — Toggle switch has no label association [LOW]
- **Files**: [`frontend/src/app.css`](frontend/src/app.css) lines 513556
- **State**: `.toggle-switch` wraps an `<input type="checkbox">` and a visual `.toggle-track` `<span>`. There's no visible label text or `aria-label` requirement in the global utility. Callers may forget to pass one.
- **Suggestion**: Lift into a `<Toggle>` component requiring a `label` prop.
---
### 9. Responsive design
#### F-RESP-01 — Sidebar collapse breakpoint is fine; mobile bottom nav covers gracefully [LOW / commendation]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 589668, 11361168
- **State**: Below 767px the desktop sidebar hides and mobile bottom-nav appears with primary 4 keys + search + more. Mobile "More" panel mirrors the full desktop tree. Solid.
#### F-RESP-02 — Hero meter wraps awkwardly between 720880px [LOW]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) lines 11191130
- **State**: Below 880px the hero collapses to one column, but the meter row pills wrap to a third row on Russian translations of "providers/targets/armed".
- **Suggestion**: Add an intermediate breakpoint (`max-width: 1024px`) where pill labels switch from `"5 providers"` to a tooltip-only count.
#### F-RESP-03 — Stat-card grid drops to 1 column at sm: [MEDIUM]
- **Files**: [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte) line 590 `grid-cols-1 sm:grid-cols-2 lg:grid-cols-4`
- **State**: Between 6401024px stat cards are 2-wide. At tablet sizes the cards become huge and dilute the dashboard density.
- **Suggestion**: Cap stat-card max-width at ~300px or switch to `auto-fit, minmax(200px, 1fr)` so they don't grow uncontrollably.
#### F-RESP-04 — List rows don't gracefully truncate webhook URLs on mobile [LOW]
- **Files**: [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte) lines 392410
- **State**: Secondary text line shows full webhook URL with `break-all` which on very narrow viewports gives a 4-line wrap.
- **Suggestion**: Use the `shortenUrl()` helper (already defined for the meta-tile path) on the narrow-screen secondary line too.
---
### 10. Onboarding
#### F-ONBOARD-01 — Setup → empty dashboard with no guidance [HIGH]
- **Files**: [`frontend/src/routes/setup/+page.svelte`](frontend/src/routes/setup/+page.svelte), [`frontend/src/routes/+page.svelte`](frontend/src/routes/+page.svelte)
- **State**: After `/setup` the user lands on `/` with 0 providers, hero says *"all clear"* (literally "Nothing to do"). Wasted first impression.
- **Suggestion**: First-run detection (`providersCache.items.length === 0 && targetsCache.items.length === 0`) replaces the dashboard hero with a 3-4 step "Getting started" checklist: (1) Add a provider · (2) Connect a bot · (3) Create a target · (4) Wire your first tracker. Each step is a CTA card. Persist completion to localStorage so it disappears once finished.
#### F-ONBOARD-02 — No in-app discovery of ⌘K palette [MEDIUM]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 678682
- **State**: Topbar shows `⌘K` / `Ctrl K` chip but only that. No "Press ⌘K to jump to any page" hint anywhere.
- **Suggestion**: First-visit toast: "Tip: Press ⌘K from anywhere to search providers, trackers, and pages". Dismissible.
#### F-ONBOARD-03 — Login screen has no help / forgot-password / docs link [LOW]
- **Files**: [`frontend/src/routes/login/+page.svelte`](frontend/src/routes/login/+page.svelte)
- **State**: Plain username + password. For self-hosted users who lost the admin password, there's no link to the recovery docs.
- **Suggestion**: Small "Need help?" link to docs (the `/docs` route exists).
---
### 11. Microcopy
#### F-COPY-01 — Dashboard hero copy is editorial — commendation [LOW]
- "Live · throughput 24h · armed · providers" reads more like a control-room dashboard than CRUD admin. Keep doing this on the rest of the app.
#### F-COPY-02 — Many subpages use literal entity-name copy [MEDIUM]
- E.g. "Add provider" / "Add target" / "Add tracker" / "Add user". Editorial would be "Connect a provider" / "Define a target" / "Wire a tracker" / "Invite a user". Lean into verbs that match the dashboard's "wires / signals / on watch" vocabulary.
#### F-COPY-03 — Russian translations match en line-count but no length QA visible [LOW]
- File sizes match exactly (1577 lines each). That's just structural parity, not visual parity. Russian tends to be 20-30% longer for the same concept; flagged places likely have layout issues (hero title em, stat-card values, sidebar nav labels).
- **Suggestion**: Set up a Playwright snapshot test that switches locale=ru and screenshots dashboard + a representative list page to catch overflow visually.
---
### 12. Localization parity
#### F-LOCALE-01 — "Notify Bridge" wordmark stays in English [LOW / correct]
- Brand. Don't translate.
#### F-LOCALE-02 — Provider type label not localized in list rows [LOW]
- **Files**: [`frontend/src/routes/providers/+page.svelte`](frontend/src/routes/providers/+page.svelte) line 391
- **State**: Type pill shows raw `provider.type` value (e.g. "immich", "nextcloud") — not localized.
- **Suggestion**: Use `getDescriptor(type).defaultName` or `t(\`providers.type${PascalName}\`)` which exists per project conventions.
#### F-LOCALE-03 — Mixed Cyrillic glitches in source [LOW]
- **Files**: [`frontend/src/routes/login/+page.svelte`](frontend/src/routes/login/+page.svelte) line 42 (`—` instead of em-dash in a comment), [`frontend/src/routes/users/+page.svelte`](frontend/src/routes/users/+page.svelte) line 166 (`В·` instead of `·`)
- **State**: Encoding-corrupt characters in source comments and one user-facing dot. Pre-existing — files were probably edited with the wrong encoding at some point.
- **Suggestion**: Grep `вЂ` / `В·` across the repo and fix. Add a pre-commit hook that fails on non-UTF8 chars in `.svelte` / `.ts` / `.json`.
---
### 13. Power-user features
#### F-POWER-01 — No sortable columns anywhere [MEDIUM]
- Confirmed by Grep: no `aria-sort` / `sortable` / `onSort` in the codebase. Lists are sorted by `IconGridSelect` widget (newest / oldest / name).
- **Suggestion**: For long lists (trackers, targets), add column-header sort affordance. Even minimal: clicking the "Name" or "Provider" header re-sorts. Use cache state so sort persists across nav.
#### F-POWER-02 — No multi-select bulk actions [MEDIUM]
- Grep for `bulkAction` / `selectAll`: only the locale files contain those strings (likely as i18n keys that are never used). No checkbox UI.
- **Suggestion**: Add a checkbox column on `targets`, `notification-trackers`, `command-trackers`, `actions` pages. Bulk-enable / bulk-delete are the obvious ones.
#### F-POWER-03 — ⌘K palette is the strongest power feature, under-promoted [MEDIUM]
- See F-ONBOARD-02.
#### F-POWER-04 — Sidebar group expand/collapse is persisted but no "expand all / collapse all" [LOW]
- **Files**: [`frontend/src/routes/+layout.svelte`](frontend/src/routes/+layout.svelte) lines 263269
- **Suggestion**: Add a right-click menu on a group header, or a tiny "collapse all" icon at the bottom of the nav rail.
#### F-POWER-05 — No keyboard shortcuts beyond ⌘K [LOW]
- **Suggestion**: `n` for new, `g + p` for "go providers", `g + t` for trackers, `?` to show shortcut sheet. Document in the palette.
---
## Production polish checklist (top 15, prioritized)
1. **[HIGH]** Add `role="status" aria-live="polite"` to Snackbar container; `assertive` for error toasts. (F-A11Y-01) — one-line fix.
2. **[HIGH]** Add `aria-current="page"` to every nav link in `+layout.svelte`. (F-NAV-01, F-A11Y-02)
3. **[HIGH]** Mass-replace the legacy form-input class (`border border-[var(--color-border)] rounded-md text-sm bg-[var(--color-background)]`) with nothing — let the global `input { ... }` style win. 17 files, ~71 occurrences. (F-CONSIST-04)
4. **[HIGH]** Replace hardcoded hex colors (`#059669`, `#ef4444`, `#3b82f6`, `#f59e0b`, `rgba(239,68,68,...)`) with Aurora palette tokens in `Snackbar.svelte`, `ConfirmModal.svelte`, `actions/+page.svelte`, and any remaining sites. (F-CONSIST-03)
5. **[HIGH]** First-run onboarding: when `providersCache.items.length === 0`, replace dashboard hero with a 4-step "Getting started" checklist. (F-ONBOARD-01)
6. **[HIGH]** Consolidate the 5 glass-card abstractions into a single `<GlassPanel variant=...>` component; delete redundant `::after` overlays. (F-CONSIST-01)
7. **[HIGH]** Introduce a radius scale (`--radius-card / panel / pill / input / chip / tile`) and refactor `Card.svelte`, `Button.svelte`, `Modal.svelte`, `ConfirmModal.svelte` to use it. (F-CONSIST-02)
8. **[MEDIUM]** Rewrite `ConfirmModal.svelte` to use `<Button variant="secondary">` and `<Button variant="danger">` instead of its own buttons. (F-CONSIST-05)
9. **[MEDIUM]** Add layout-aware `<Loading shape="hero|grid|list">` variants to reduce first-paint layout shift. (F-STATE-01)
10. **[MEDIUM]** Extend `<EmptyState>` with `cta` slot and provider-/tracker-/target-specific copy + a contextual CTA. (F-STATE-02)
11. **[MEDIUM]** Visual length-QA pass for Russian — at least dashboard hero, providers list, settings hero, stat-cards. Playwright screenshot test. (F-COPY-03, F-LOCALE-02)
12. **[MEDIUM]** Implement column-header sort on `notification-trackers`, `targets`, `actions`. Persist in cache state. (F-POWER-01)
13. **[MEDIUM]** Add multi-select bulk actions (enable/disable, delete) to `targets`, `notification-trackers`, `command-trackers`. (F-POWER-02)
14. **[MEDIUM]** Audit contrast: `--color-muted-foreground` over brightest aurora peak; likely bump dark-theme value from `#b6b2d4` to ~`#cfcae8`. (F-A11Y-05)
15. **[MEDIUM]** Replace inline browser-native `title=` tooltips on the collapsed sidebar with a custom Aurora-styled tooltip (using the existing portal helper). (F-NAV-04)
### Quick wins (bonus, under an hour each)
- Add `autofocus` to the username input on `/login`. (F-FORM-05)
- Fix `вЂ"` / `В·` Cyrillic encoding glitches in `login/+page.svelte` and `users/+page.svelte`. (F-LOCALE-03)
- Drop `role="button"` from Modal backdrop. (F-MODAL-02)
- Replace `provider.type` raw label in provider list rows with localized descriptor name. (F-LOCALE-02)
- Add inline empty-state copy to `EventChart` when all `chartDays` values are 0. (F-HIER-03)
---
## What's working — keep doing it
- The conic-gradient brand orb, animated aurora background, Newsreader italic emphasis, gradient pill chips, glow-pulse dots — distinctive identity.
- `Modal.svelte` (focus trap, restore, portal, escape, scroll containment).
- `prefers-reduced-motion` honored across every animation surface.
- Global ⌘K search palette, global provider filter, persisted sidebar state, persisted nav-group expansion.
- Editorial copy on dashboard (`signal stream`, `on watch`, `pulse`, `wires`, `compose`).
- Snackbar with detail-toggle expansion for error context.
- Mobile "More" panel that mirrors the full desktop nav tree.
- 6-file template-variable sync rule honored by project conventions.
- `i18n` parity at 1577 lines for both locales.
End of review.
+2 -2
View File
@@ -1,6 +1,6 @@
{
"last_commit": "cfdafa9c2b49ea64496e9355d92337dbbb70db93",
"last_sync": "2026-05-16T00:00:00Z",
"last_commit": "04fe8124fcc3f783038b9aaac393b6c62c68e22a",
"last_sync": "2026-05-16T20:04:00Z",
"tracked_files": {
"gitea-python-ci-cd.md": "sha256:9f1f57e1b0d909143e20cb3f21ac9c4d75b45f2992ec002645540f94c4920851",
"gitea-release-workflow.md": "sha256:5eb64789fca062b2138ca7661b942c9fc9c304f63326844ff6f6724e7e05b08c"
+22
View File
@@ -480,6 +480,7 @@
"videoWarning": "Video size warning",
"disableUrlPreview": "Disable link previews",
"sendLargeAsDocuments": "Send large photos as documents",
"sendLargeVideosAsDocuments": "Send oversized videos as documents (bypass 50 MB limit)",
"chatAction": "Chat action",
"chatActionNone": "None (no action)",
"chatActionTyping": "Typing",
@@ -509,6 +510,11 @@
"confirmDeleteReceiver": "Delete this receiver?",
"receiverEnabled": "Receiver enabled",
"receiverDisabled": "Receiver disabled",
"telegramOptions": "Telegram options",
"telegramOptionsSaved": "Telegram options saved",
"telegramDisableNotification": "Send silently (no sound / vibration)",
"telegramThreadId": "Forum topic ID",
"telegramThreadIdPlaceholder": "Leave empty for general topic",
"groupNoBot": "No bot linked",
"groupDirect": "Direct delivery",
"groupBotMissing": "Unknown bot",
@@ -897,6 +903,22 @@
"identityHeadline": "How this instance presents itself to bots, webhooks, and recipients",
"telegramHeadline": "Webhook authentication and media cache tuning",
"loggingHeadline": "Verbosity, output format, and per-module overrides",
"diagnostics": "Diagnostics",
"diagnosticsHeadline": "Temporary DEBUG for one module, auto-reverted",
"diagnosticsHint": "Use to investigate a specific dispatch failure without flooding stderr. The chosen module flips to DEBUG immediately and reverts to its baseline (your per-module overrides or the noisy-library defaults) when the window ends. Restarts also reset.",
"diagModuleQuick": "Module (quick pick)",
"diagModuleCustom": "Or a custom module name",
"diagModuleCustomPlaceholder": "e.g. notify_bridge_server.services.deferred_dispatch",
"diagModuleRequired": "Pick a module first",
"diagDuration": "Duration",
"diagActivate": "Activate DEBUG",
"diagActivated": "Diagnostic mode activated",
"diagActivateFailed": "Failed to activate diagnostic mode",
"diagActive": "Active overrides",
"diagRevertsIn": "Reverts in",
"diagRevertNow": "Revert now",
"diagReverted": "Diagnostic mode reverted",
"diagRevertFailed": "Failed to revert diagnostic mode",
"heroNoUrl": "External URL not set",
"heroNoLocales": "no locales",
"copy": "Copy",
+22
View File
@@ -480,6 +480,7 @@
"videoWarning": "Предупреждение о размере видео",
"disableUrlPreview": "Отключить превью ссылок",
"sendLargeAsDocuments": "Отправлять большие фото как документы",
"sendLargeVideosAsDocuments": "Отправлять видео сверх лимита как документы (обход 50 МБ)",
"chatAction": "Действие в чате",
"chatActionNone": "Нет (без действия)",
"chatActionTyping": "Печатает",
@@ -509,6 +510,11 @@
"confirmDeleteReceiver": "Удалить этого получателя?",
"receiverEnabled": "Получатель включён",
"receiverDisabled": "Получатель отключён",
"telegramOptions": "Параметры Telegram",
"telegramOptionsSaved": "Параметры Telegram сохранены",
"telegramDisableNotification": "Отправлять без звука и вибрации",
"telegramThreadId": "ID темы форума",
"telegramThreadIdPlaceholder": "Оставьте пустым для общей темы",
"groupNoBot": "Без привязки к боту",
"groupDirect": "Прямая доставка",
"groupBotMissing": "Неизвестный бот",
@@ -897,6 +903,22 @@
"identityHeadline": "Как этот сервер представляется ботам, вебхукам и получателям",
"telegramHeadline": "Аутентификация вебхуков и настройка медиакэша",
"loggingHeadline": "Подробность, формат вывода и переопределения по модулям",
"diagnostics": "Диагностика",
"diagnosticsHeadline": "Временный DEBUG для одного модуля с авто-возвратом",
"diagnosticsHint": "Включите, чтобы разобраться в конкретной ошибке отправки без заливания stderr. Выбранный модуль немедленно переходит в DEBUG и возвращается к базовому уровню (вашим переопределениям или умолчаниям для шумных библиотек) по истечении окна. При перезапуске сервера всё сбрасывается.",
"diagModuleQuick": "Модуль (быстрый выбор)",
"diagModuleCustom": "Или произвольное имя модуля",
"diagModuleCustomPlaceholder": "напр. notify_bridge_server.services.deferred_dispatch",
"diagModuleRequired": "Сначала выберите модуль",
"diagDuration": "Длительность",
"diagActivate": "Включить DEBUG",
"diagActivated": "Режим диагностики включён",
"diagActivateFailed": "Не удалось включить режим диагностики",
"diagActive": "Активные переопределения",
"diagRevertsIn": "Вернётся через",
"diagRevertNow": "Вернуть сейчас",
"diagReverted": "Режим диагностики отменён",
"diagRevertFailed": "Не удалось отменить режим диагностики",
"heroNoUrl": "Внешний URL не задан",
"heroNoLocales": "нет локалей",
"copy": "Копировать",
+32
View File
@@ -235,6 +235,35 @@ export type DispatchStatus =
| 'deferred_then_failed'
| 'suppressed_quiet_hours_nondeferrable';
export interface DispatchSummaryError {
index: number;
error: string;
}
export interface DispatchSummaryMediaError {
target_index: number;
kind?: string;
chunk?: number;
item_index?: number;
error?: string;
code?: number;
}
export interface DispatchSummary {
targets_attempted: number;
targets_succeeded: number;
targets_failed: number;
errors?: DispatchSummaryError[];
errors_truncated?: number;
media?: {
delivered: number;
skipped: number;
failed: number;
};
media_errors?: DispatchSummaryMediaError[];
media_errors_truncated?: number;
}
export interface EventLog {
id: number;
event_type: string;
@@ -256,6 +285,9 @@ export interface EventLog {
deferred_until?: string;
original_event_log_id?: number | null;
deferred_for_seconds?: number;
dispatch_id?: string;
request_id?: string;
dispatch_summary?: DispatchSummary;
};
created_at: string;
}
@@ -14,6 +14,7 @@
import ReleaseCassette from './ReleaseCassette.svelte';
import CacheLedger from './CacheLedger.svelte';
import LoggingCassette from './LoggingCassette.svelte';
import DiagnosticsCassette from './DiagnosticsCassette.svelte';
import SaveBar from './SaveBar.svelte';
interface CacheBucketStats {
@@ -203,6 +204,8 @@
bind:logFormat={settings.log_format}
bind:logLevels={settings.log_levels}
/>
<DiagnosticsCassette />
</div>
<SaveBar
@@ -0,0 +1,424 @@
<script lang="ts">
import { onMount, onDestroy } from 'svelte';
import { slide } from 'svelte/transition';
import { api } from '$lib/api';
import { t } from '$lib/i18n';
import MdiIcon from '$lib/components/MdiIcon.svelte';
import IconButton from '$lib/components/IconButton.svelte';
import IconGridSelect from '$lib/components/IconGridSelect.svelte';
import { snackSuccess, snackError } from '$lib/stores/snackbar.svelte';
interface ActiveOverride {
module: string;
baseline_level: string;
current_level: string;
activated_at: string;
expires_at: string;
remaining_seconds: number;
}
// Modules ship with shortcuts; users can also type a freeform name
// matching the backend allowlist (notify_bridge_*, sqlalchemy.*, etc.).
// Icons let the IconGridSelect render each entry as a visual chip
// instead of a bare text list — same pattern as the surrounding
// log-level / log-format selectors.
const QUICK_MODULES: { value: string; icon: string; label: string; desc?: string }[] = [
{ value: 'notify_bridge_core.notifications.telegram.client', icon: 'mdiSend', label: 'Telegram client' },
{ value: 'notify_bridge_core.notifications.dispatcher', icon: 'mdiCallSplit', label: 'Dispatcher' },
{ value: 'notify_bridge_core.providers.immich', icon: 'mdiImageMultiple', label: 'Immich provider' },
{ value: 'notify_bridge_server.services.watcher', icon: 'mdiEyeOutline', label: 'Watcher' },
{ value: 'notify_bridge_server.services.deferred_dispatch', icon: 'mdiClockOutline', label: 'Deferred dispatch' },
{ value: 'notify_bridge_server.services.scheduled_dispatch', icon: 'mdiCalendarClock', label: 'Scheduled dispatch' },
{ value: 'sqlalchemy.engine', icon: 'mdiDatabase', label: 'SQLAlchemy engine (SQL)' },
{ value: 'aiohttp.client', icon: 'mdiWeb', label: 'aiohttp client' },
];
const DURATION_PRESETS: { minutes: number; label: string }[] = [
{ minutes: 5, label: '5m' },
{ minutes: 15, label: '15m' },
{ minutes: 30, label: '30m' },
{ minutes: 60, label: '1h' },
{ minutes: 120, label: '2h' },
];
let active = $state<ActiveOverride[]>([]);
let pickedModule = $state(QUICK_MODULES[0].value);
let customModule = $state('');
let pickedMinutes = $state(30);
let submitting = $state(false);
let tickHandle: ReturnType<typeof setInterval> | null = null;
// Resync from the backend every N seconds so a server-side auto-revert
// is reflected even if we missed a tick. Tracked as elapsed-time so the
// 1s ticker can drift without breaking the cadence.
const RESYNC_EVERY_SECONDS = 30;
let lastResyncAt = Date.now();
async function refresh(): Promise<void> {
try {
const data = await api<{ active: ActiveOverride[] }>(
'/settings/diagnostic-mode',
{ method: 'GET' },
);
active = data.active || [];
} catch (err: unknown) {
// Surface non-401 errors only; settings page already shows a banner
// when the API is unreachable.
}
}
function tick(): void {
// Cheap local countdown so the UI doesn't poll the server every second
// to render a clock. The full refresh happens every 30s OR on action.
if (active.length === 0) return;
const now = Date.now();
active = active
.map(a => ({
...a,
remaining_seconds: Math.max(
0,
Math.floor((new Date(a.expires_at).getTime() - now) / 1000),
),
}))
.filter(a => a.remaining_seconds > 0);
}
function startTicker(): void {
if (tickHandle != null) return;
tickHandle = setInterval(() => {
tick();
const now = Date.now();
if (now - lastResyncAt >= RESYNC_EVERY_SECONDS * 1000) {
lastResyncAt = now;
void refresh();
}
}, 1000);
}
function stopTicker(): void {
if (tickHandle != null) {
clearInterval(tickHandle);
tickHandle = null;
}
}
onMount(() => {
lastResyncAt = Date.now();
void refresh();
startTicker();
});
onDestroy(() => {
stopTicker();
});
function effectiveModule(): string {
return (customModule.trim() || pickedModule).trim();
}
async function activate(): Promise<void> {
const mod = effectiveModule();
if (!mod) {
snackError(t('settings.diagModuleRequired'));
return;
}
submitting = true;
try {
const entry = await api<ActiveOverride>('/settings/diagnostic-mode', {
method: 'POST',
body: JSON.stringify({ module: mod, duration_minutes: pickedMinutes }),
});
// Replace any existing row for this module with the new schedule.
active = [
...active.filter(a => a.module !== entry.module),
entry,
];
customModule = '';
snackSuccess(t('settings.diagActivated'));
} catch (err: unknown) {
const msg = err instanceof Error ? err.message : String(err);
snackError(msg || t('settings.diagActivateFailed'));
} finally {
submitting = false;
}
}
async function revert(module: string): Promise<void> {
try {
await api(`/settings/diagnostic-mode/${encodeURIComponent(module)}`, {
method: 'DELETE',
});
active = active.filter(a => a.module !== module);
snackSuccess(t('settings.diagReverted'));
} catch (err: unknown) {
const msg = err instanceof Error ? err.message : String(err);
snackError(msg || t('settings.diagRevertFailed'));
}
}
function formatRemaining(seconds: number): string {
if (seconds <= 0) return '0s';
const mins = Math.floor(seconds / 60);
const secs = seconds % 60;
if (mins >= 60) {
const hours = Math.floor(mins / 60);
const remMins = mins % 60;
return `${hours}h ${remMins}m`;
}
if (mins > 0) return `${mins}m ${secs}s`;
return `${secs}s`;
}
</script>
<section class="diag glass">
<header class="diag-head">
<div class="diag-eyebrow">
<MdiIcon name="mdiBugOutline" size={12} />
<span>{t('settings.diagnostics')}</span>
</div>
<h3 class="diag-title">{t('settings.diagnosticsHeadline')}</h3>
<p class="diag-sub">{t('settings.diagnosticsHint')}</p>
</header>
<!-- Compose new override -->
<div class="diag-compose">
<div class="diag-label">
<span>{t('settings.diagModuleQuick')}</span>
<IconGridSelect items={QUICK_MODULES} bind:value={pickedModule} columns={2} compact />
</div>
<label class="diag-label">
<span>{t('settings.diagModuleCustom')}</span>
<input
bind:value={customModule}
type="text"
autocomplete="off"
spellcheck="false"
placeholder={t('settings.diagModuleCustomPlaceholder')}
class="diag-input"
/>
</label>
<div class="diag-label">
<span>{t('settings.diagDuration')}</span>
<div class="diag-duration-chips">
{#each DURATION_PRESETS as preset (preset.minutes)}
<button
type="button"
class="diag-chip"
class:diag-chip-active={pickedMinutes === preset.minutes}
onclick={() => (pickedMinutes = preset.minutes)}
>
{preset.label}
</button>
{/each}
</div>
</div>
<button
type="button"
onclick={activate}
disabled={submitting}
class="diag-activate"
>
<MdiIcon name="mdiPlay" size={14} />
<span>{submitting ? t('common.loading') : t('settings.diagActivate')}</span>
</button>
</div>
<!-- Active overrides list -->
{#if active.length > 0}
<div class="diag-active" in:slide={{ duration: 180 }}>
<div class="diag-active-head">
<MdiIcon name="mdiTimerSandComplete" size={12} />
<span>{t('settings.diagActive')}</span>
</div>
{#each active as ov (ov.module)}
<div class="diag-row">
<div class="diag-row-info">
<code class="diag-row-module">{ov.module}</code>
<span class="diag-row-meta">
{t('settings.diagRevertsIn')} <strong>{formatRemaining(ov.remaining_seconds)}</strong>
<span class="diag-row-baseline">{ov.baseline_level}</span>
</span>
</div>
<IconButton
icon="mdiUndoVariant"
title={t('settings.diagRevertNow')}
onclick={() => revert(ov.module)}
size={16}
/>
</div>
{/each}
</div>
{/if}
</section>
<style>
.diag {
padding: 1.5rem 1.6rem 1.4rem;
display: flex;
flex-direction: column;
gap: 1.15rem;
}
.diag-head {
position: relative;
z-index: 1;
}
.diag-eyebrow {
display: inline-flex;
align-items: center;
gap: 0.35rem;
font-family: var(--font-mono);
font-size: 0.62rem;
text-transform: uppercase;
letter-spacing: 0.18em;
color: var(--color-muted-foreground);
margin-bottom: 0.45rem;
}
.diag-title {
margin: 0;
font-family: var(--font-display);
font-style: italic;
font-weight: 400;
font-size: 1.15rem;
line-height: 1.35;
letter-spacing: -0.015em;
color: var(--color-foreground);
max-width: 38ch;
}
.diag-sub {
margin: 0.45rem 0 0 0;
font-size: 0.78rem;
color: var(--color-muted-foreground);
max-width: 56ch;
}
.diag-compose {
position: relative;
z-index: 1;
display: flex;
flex-direction: column;
gap: 0.7rem;
padding-top: 0.4rem;
border-top: 1px solid var(--color-border);
}
.diag-label {
display: flex;
flex-direction: column;
gap: 0.32rem;
}
.diag-label > span {
font-size: 0.74rem;
font-weight: 500;
color: var(--color-foreground);
}
.diag-input {
width: 100%;
font-family: var(--font-mono);
font-size: 0.78rem;
padding: 0.45rem 0.7rem;
border: 1px solid var(--color-border);
border-radius: 8px;
background: var(--color-glass);
color: var(--color-foreground);
}
.diag-duration-chips {
display: flex;
flex-wrap: wrap;
gap: 0.35rem;
}
.diag-chip {
padding: 0.32rem 0.75rem;
border-radius: 999px;
border: 1px solid var(--color-border);
background: transparent;
color: var(--color-muted-foreground);
font-family: var(--font-mono);
font-size: 0.72rem;
cursor: pointer;
transition: background 0.15s, color 0.15s, border-color 0.15s;
}
.diag-chip:hover {
background: var(--color-glass-strong);
color: var(--color-foreground);
}
.diag-chip-active {
background: color-mix(in srgb, var(--color-primary) 12%, transparent);
color: var(--color-primary);
border-color: color-mix(in srgb, var(--color-primary) 45%, var(--color-border));
}
.diag-activate {
display: inline-flex;
align-items: center;
justify-content: center;
gap: 0.4rem;
align-self: flex-start;
padding: 0.55rem 1.1rem;
border-radius: 10px;
border: 1px solid color-mix(in srgb, var(--color-primary) 45%, var(--color-border));
background: color-mix(in srgb, var(--color-primary) 12%, transparent);
color: var(--color-primary);
font-family: var(--font-display);
font-style: italic;
font-size: 0.85rem;
cursor: pointer;
transition: background 0.15s, color 0.15s, border-color 0.15s;
}
.diag-activate:hover {
background: color-mix(in srgb, var(--color-primary) 18%, transparent);
border-color: color-mix(in srgb, var(--color-primary) 65%, var(--color-border));
}
.diag-activate:disabled {
opacity: 0.5;
cursor: not-allowed;
}
.diag-active {
display: flex;
flex-direction: column;
gap: 0.4rem;
padding-top: 0.55rem;
border-top: 1px solid var(--color-border);
}
.diag-active-head {
display: inline-flex;
align-items: center;
gap: 0.3rem;
font-family: var(--font-mono);
font-size: 0.58rem;
text-transform: uppercase;
letter-spacing: 0.18em;
color: var(--color-muted-foreground);
}
.diag-row {
display: flex;
align-items: center;
justify-content: space-between;
gap: 0.6rem;
padding: 0.5rem 0.65rem;
border-radius: 10px;
border: 1px solid var(--color-border);
background: var(--color-glass-strong);
}
.diag-row-info {
display: flex;
flex-direction: column;
gap: 0.2rem;
min-width: 0;
}
.diag-row-module {
font-family: var(--font-mono);
font-size: 0.78rem;
color: var(--color-foreground);
word-break: break-all;
}
.diag-row-meta {
font-size: 0.72rem;
color: var(--color-muted-foreground);
}
.diag-row-baseline {
font-family: var(--font-mono);
font-size: 0.7rem;
margin-left: 0.4rem;
opacity: 0.7;
}
</style>
+65 -2
View File
@@ -166,7 +166,7 @@
const defaultForm = () => ({
name: '', icon: '', bot_id: 0, bot_token: '',
max_media_to_send: 50, max_media_per_group: 10, media_delay: 500, max_asset_size: 50,
disable_url_preview: true, send_large_photos_as_documents: false, ai_captions: false, chat_action: 'typing',
disable_url_preview: true, send_large_photos_as_documents: false, send_large_videos_as_documents: false, ai_captions: false, chat_action: 'typing',
// Discord/Slack shared settings
username: '',
// ntfy shared settings
@@ -407,7 +407,7 @@
bot_id: c.bot_id || 0, bot_token: '',
max_media_to_send: c.max_media_to_send ?? 50, max_media_per_group: c.max_media_per_group ?? 10,
media_delay: c.media_delay ?? 500, max_asset_size: c.max_asset_size ?? 50,
disable_url_preview: c.disable_url_preview ?? false, send_large_photos_as_documents: c.send_large_photos_as_documents ?? false,
disable_url_preview: c.disable_url_preview ?? false, send_large_photos_as_documents: c.send_large_photos_as_documents ?? false, send_large_videos_as_documents: c.send_large_videos_as_documents ?? false,
ai_captions: c.ai_captions ?? false, chat_action: tgt.chat_action ?? c.chat_action ?? 'typing',
// discord/slack
username: c.username || '',
@@ -448,6 +448,7 @@
max_media_to_send: form.max_media_to_send, max_media_per_group: form.max_media_per_group,
media_delay: form.media_delay, max_asset_size: form.max_asset_size,
disable_url_preview: form.disable_url_preview, send_large_photos_as_documents: form.send_large_photos_as_documents,
send_large_videos_as_documents: form.send_large_videos_as_documents,
ai_captions: form.ai_captions,
};
} else if (formType === 'webhook') {
@@ -603,6 +604,63 @@
} catch (err: unknown) { snackError(errMsg(err)); }
}
// Per-Telegram-receiver options panel: silent send + forum thread id.
// Edits the receiver's config dict in place via PUT.
let editingReceiverId = $state<number | null>(null);
// ``<input type="number">`` binds either a ``number`` or empty string
// when the field is blank — model both so TS strict mode and the save
// path's ``Number(raw)`` coercion agree.
let editingReceiverOptions = $state<{ disable_notification: boolean; message_thread_id: number | '' }>({
disable_notification: false,
message_thread_id: '',
});
function openEditReceiver(_targetId: number, receiver: TargetReceiver) {
editingReceiverId = receiver.id;
// Empty string maps to "no thread" — the form's <input type=number>
// produces '' for an empty field, which we normalize to null on save.
const raw = receiver.config?.message_thread_id;
const parsed = raw == null || raw === '' ? '' : Number(raw);
editingReceiverOptions = {
disable_notification: Boolean(receiver.config?.disable_notification),
message_thread_id: typeof parsed === 'number' && Number.isFinite(parsed) ? parsed : '',
};
}
function cancelEditReceiver() {
editingReceiverId = null;
}
async function saveEditReceiver(targetId: number, receiverId: number) {
const target = allTargets.find(t => t.id === targetId);
const receiver = target?.receivers?.find(r => r.id === receiverId);
if (!receiver) return;
// Merge new options into the existing config so we don't lose the chat_id
// or any other receiver-specific keys (language_code on Telegram).
const newConfig: Record<string, any> = { ...receiver.config };
newConfig.disable_notification = editingReceiverOptions.disable_notification;
const raw = editingReceiverOptions.message_thread_id;
if (raw === '' || raw == null) {
delete newConfig.message_thread_id;
} else {
const parsed = Number(raw);
if (Number.isFinite(parsed) && parsed > 0) {
newConfig.message_thread_id = Math.trunc(parsed);
} else {
delete newConfig.message_thread_id;
}
}
try {
await api(`/targets/${targetId}/receivers/${receiverId}`, {
method: 'PUT',
body: JSON.stringify({ config: newConfig }),
});
editingReceiverId = null;
await load();
snackSuccess(t('targets.telegramOptionsSaved'));
} catch (err: unknown) { snackError(errMsg(err)); }
}
async function toggleBroadcastChild(targetId: number, childId: number) {
const tgt = allTargets.find(t => t.id === targetId);
if (!tgt) return;
@@ -753,6 +811,8 @@
{receiverBotChats}
{receiverTesting}
{receiverLabel}
{editingReceiverId}
bind:editingReceiverOptions
onopenReceiverForm={openReceiverForm}
onsaveReceiver={saveReceiver}
oncancelReceiver={() => addingReceiverForTarget = null}
@@ -762,6 +822,9 @@
onloadBotChats={loadReceiverBotChats}
onchangeReceiverForm={(f) => receiverForm = f}
ontoggleBroadcastChild={toggleBroadcastChild}
onopenEditReceiver={openEditReceiver}
oncancelEditReceiver={cancelEditReceiver}
onsaveEditReceiver={saveEditReceiver}
/>
</div>
{/if}
@@ -16,6 +16,12 @@
receiverBotChats: Record<number, TelegramChat[]>;
receiverTesting: Record<number, boolean>;
receiverLabel: (target: NotificationTarget, recv: TargetReceiver) => string;
// Telegram-only editing state. Optional so a future caller that
// reuses this component for a non-Telegram target page doesn't have
// to pass dead props; the cog button only renders when both the
// target type matches AND the handlers are wired.
editingReceiverId?: number | null;
editingReceiverOptions?: Record<string, any>;
onopenReceiverForm: (targetId: number, targetType: string) => void;
onsaveReceiver: (targetId: number) => void;
oncancelReceiver: () => void;
@@ -25,6 +31,9 @@
onloadBotChats: (botId: number) => void;
onchangeReceiverForm: (form: Record<string, any>) => void;
ontoggleBroadcastChild?: (targetId: number, childId: number) => void;
onopenEditReceiver?: (targetId: number, receiver: TargetReceiver) => void;
oncancelEditReceiver?: () => void;
onsaveEditReceiver?: (targetId: number, receiverId: number) => void;
}
let {
@@ -37,6 +46,8 @@
receiverBotChats,
receiverTesting,
receiverLabel,
editingReceiverId,
editingReceiverOptions = $bindable(),
onopenReceiverForm,
onsaveReceiver,
oncancelReceiver,
@@ -46,6 +57,9 @@
onloadBotChats,
onchangeReceiverForm,
ontoggleBroadcastChild,
onopenEditReceiver,
oncancelEditReceiver,
onsaveEditReceiver,
}: Props = $props();
</script>
@@ -92,11 +106,25 @@
{#if (recv as any).language_code || recv.config?.language_code}
<span class="text-xs px-1 py-0.5 rounded bg-[var(--color-muted)] text-[var(--color-muted-foreground)]">{((recv as any).language_code || recv.config.language_code).toUpperCase()}</span>
{/if}
{#if target.type === 'telegram' && recv.config?.disable_notification}
<MdiIcon name="mdiBellOff" size={12} />
{/if}
{#if target.type === 'telegram' && recv.config?.message_thread_id != null && recv.config?.message_thread_id !== ''}
<span class="text-xs px-1 py-0.5 rounded bg-[var(--color-muted)] text-[var(--color-muted-foreground)]" title={t('targets.telegramThreadId')}>#{recv.config.message_thread_id}</span>
{/if}
</div>
<div class="flex items-center gap-1">
<IconButton icon="mdiSend" title={t('targets.test')}
onclick={() => ontestReceiver(target.id, recv.id)}
disabled={receiverTesting[recv.id]} size={16} />
{#if target.type === 'telegram' && onopenEditReceiver != null}
<IconButton
icon="mdiCog"
title={t('targets.telegramOptions')}
onclick={() => onopenEditReceiver!(target.id, recv)}
size={16}
/>
{/if}
<IconButton
icon={recv.enabled ? 'mdiToggleSwitch' : 'mdiToggleSwitchOff'}
title={recv.enabled ? t('targets.receiverDisabled') : t('targets.receiverEnabled')}
@@ -112,6 +140,31 @@
/>
</div>
</div>
{#if target.type === 'telegram' && editingReceiverId === recv.id && editingReceiverOptions != null && onsaveEditReceiver != null && oncancelEditReceiver != null}
<div in:slide={{ duration: 150 }} class="mb-2 ml-6 mr-2 p-2 rounded-md border border-[var(--color-border)] bg-[var(--color-background)]">
<label class="flex items-center gap-2 text-sm mb-2 cursor-pointer">
<input type="checkbox" bind:checked={editingReceiverOptions.disable_notification} />
<span>{t('targets.telegramDisableNotification')}</span>
</label>
<label class="flex flex-col gap-1 text-sm mb-2">
<span>{t('targets.telegramThreadId')}</span>
<input type="number" min="1" inputmode="numeric"
bind:value={editingReceiverOptions.message_thread_id}
placeholder={t('targets.telegramThreadIdPlaceholder')}
class="w-full px-2 py-1 border border-[var(--color-border)] rounded-md text-sm bg-[var(--color-background)]" />
</label>
<div class="flex gap-2">
<button type="button" onclick={() => onsaveEditReceiver!(target.id, recv.id)}
class="px-3 py-1 bg-[var(--color-primary)] text-[var(--color-primary-foreground)] rounded-md text-xs font-medium hover:opacity-90">
{t('common.save')}
</button>
<button type="button" onclick={oncancelEditReceiver}
class="px-3 py-1 border border-[var(--color-border)] rounded-md text-xs hover:bg-[var(--color-muted)]">
{t('targets.cancel')}
</button>
</div>
</div>
{/if}
{/each}
<!-- Telegram: chat picker palette opens directly from the "Add receiver" button — no inline section. -->
@@ -23,6 +23,7 @@
max_asset_size: number;
disable_url_preview: boolean;
send_large_photos_as_documents: boolean;
send_large_videos_as_documents: boolean;
ai_captions: boolean;
chat_action: string;
username: string;
@@ -131,6 +132,7 @@
</div>
<label class="flex items-center gap-2 text-sm col-span-2"><input type="checkbox" bind:checked={form.disable_url_preview} /> {t('targets.disableUrlPreview')}</label>
<label class="flex items-center gap-2 text-sm col-span-2"><input type="checkbox" bind:checked={form.send_large_photos_as_documents} /> {t('targets.sendLargeAsDocuments')}</label>
<label class="flex items-center gap-2 text-sm col-span-2"><input type="checkbox" bind:checked={form.send_large_videos_as_documents} /> {t('targets.sendLargeVideosAsDocuments')}</label>
</div>
{/if}
</div>
@@ -14,6 +14,7 @@ Kept in ``notify_bridge_core`` so core modules (``TelegramClient``,
from __future__ import annotations
import uuid
from contextlib import contextmanager
from contextvars import ContextVar, Token
from typing import Any, Iterator
@@ -56,6 +57,22 @@ def bind_log_context(**kwargs: Any) -> Iterator[None]:
var.reset(tok)
def ensure_dispatch_id() -> str:
"""Return the bound ``dispatch_id`` if one is active, else a new one.
Format matches :class:`NotificationDispatcher.dispatch` (``disp:<12 hex>``)
so logs and ``EventLog.details.dispatch_id`` use a single shape. Callers
typically wrap a top-level handler with::
with bind_log_context(dispatch_id=ensure_dispatch_id()):
...
so nested calls inherit the same id and any ``EventLog`` row written
inside the block can be correlated with the dispatcher's log lines.
"""
return dispatch_id_var.get() or f"disp:{uuid.uuid4().hex[:12]}"
def current_log_context() -> dict[str, Any]:
"""Return a snapshot of the currently-bound context values (non-None)."""
snap: dict[str, Any] = {}
@@ -64,3 +81,43 @@ def current_log_context() -> dict[str, Any]:
if val is not None:
snap[key] = val
return snap
# Keys copied onto ``EventLog.details`` so an operator can grep stderr for
# the matching ``disp=``/``req=`` log lines after spotting a row in the UI.
# Kept narrow on purpose — ``chat_id``/``bot_id``/``command`` are already
# represented by dedicated EventLog columns.
_CORRELATION_KEYS = ("dispatch_id", "request_id")
def enrich_details_with_correlation(
details: dict[str, Any] | None,
) -> dict[str, Any]:
"""Return a (shallow) copy of ``details`` with active correlation IDs merged in.
Use this when constructing an ``EventLog.details`` dict so the persisted
row carries the same ``dispatch_id`` / ``request_id`` that the stderr log
lines emitted during the same dispatch carry. The mapping makes it
possible to jump from a row in the dashboard to the corresponding log
lines without server-side correlation.
Existing keys in ``details`` are NOT overwritten — callers can pin a
specific value (e.g. a synthetic dispatch_id for a backfilled row) by
setting it themselves before calling.
The copy is shallow. Nested mutable values (lists, dicts) are shared with
the input — fine for the all-scalar dicts every current call site passes,
but callers that intend to mutate after this returns should ``deepcopy``
themselves.
"""
result: dict[str, Any] = dict(details or {})
for key in _CORRELATION_KEYS:
if key in result:
continue
var = _VAR_MAP.get(key)
if var is None:
continue
val = var.get()
if val is not None:
result[key] = val
return result
@@ -5,13 +5,12 @@ from __future__ import annotations
import asyncio
import contextlib
import logging
import uuid
from dataclasses import dataclass, field
from typing import Any, AsyncIterator, Awaitable, Callable, Final
import aiohttp
from notify_bridge_core.log_context import bind_log_context, dispatch_id_var
from notify_bridge_core.log_context import bind_log_context, ensure_dispatch_id
from notify_bridge_core.models.events import ServiceEvent
from notify_bridge_core.templates.context import build_template_context
from notify_bridge_core.templates.renderer import render_template
@@ -132,7 +131,7 @@ class NotificationDispatcher:
Returns one result per target. Per-target failures are isolated;
a single bad target cannot poison the batch.
"""
new_id = dispatch_id_var.get() or f"disp:{uuid.uuid4().hex[:12]}"
new_id = ensure_dispatch_id()
with bind_log_context(dispatch_id=new_id):
_LOGGER.info(
@@ -341,6 +340,7 @@ class NotificationDispatcher:
max_size_mb = target.config.get("max_asset_size")
max_size_bytes = max_size_mb * 1024 * 1024 if max_size_mb else None
send_large_as_docs = target.config.get("send_large_photos_as_documents", False)
send_large_videos_as_docs = target.config.get("send_large_videos_as_documents", False)
if not bot_token:
return {"success": False, "error": "Missing bot_token"}
@@ -392,6 +392,8 @@ class NotificationDispatcher:
chat_id=receiver.chat_id,
text=message,
disable_web_page_preview=bool(disable_preview),
disable_notification=receiver.disable_notification,
message_thread_id=receiver.message_thread_id,
)
if not text_result.get("success"):
_LOGGER.warning(
@@ -409,22 +411,45 @@ class NotificationDispatcher:
chunk_delay=chunk_delay,
max_asset_data_size=max_size_bytes,
send_large_photos_as_documents=send_large_as_docs,
send_large_videos_as_documents=send_large_videos_as_docs,
chat_action=chat_action or None,
disable_notification=receiver.disable_notification,
message_thread_id=receiver.message_thread_id,
)
if not media_result.get("success"):
delivered = media_result.get("delivered_count", 0)
skipped = media_result.get("skipped_count", 0)
failed = media_result.get("failed_count", 0)
media_success = media_result.get("success", False)
has_partial_loss = skipped > 0 or failed > 0
if not media_success:
_LOGGER.warning(
"Text sent OK but media failed for chat %s: %s",
receiver.chat_id, media_result.get("error"),
"Text sent OK but media failed for chat %s "
"(delivered=%d skipped=%d failed=%d): %s",
receiver.chat_id, delivered, skipped, failed,
media_result.get("error"),
)
elif has_partial_loss:
_LOGGER.warning(
"Partial media delivery for chat %s "
"(delivered=%d skipped=%d failed=%d)",
receiver.chat_id, delivered, skipped, failed,
)
if not media_success or has_partial_loss:
# Preserve both outcomes — text succeeded, media
# didn't. Operators losing media-failure detail
# in the result dict made root-cause analysis
# partially or fully didn't. Operators losing
# media-failure detail made root-cause analysis
# impossible.
return {
"success": True,
"message_id": text_result.get("message_id"),
"media_error": media_result.get("error"),
"media_failed_at_chunk": media_result.get("failed_at_chunk"),
"media_delivered_count": delivered,
"media_skipped_count": skipped,
"media_failed_count": failed,
"media_errors": media_result.get("errors"),
}
return text_result
@@ -20,9 +20,21 @@ class Receiver:
@dataclass
class TelegramReceiver(Receiver):
"""Telegram chat receiver."""
"""Telegram chat receiver.
``disable_notification`` toggles Telegram's ``disable_notification=true``
flag — the message is delivered without an audible / vibration alert.
Useful for low-priority chats that the user reads but doesn't want to
be paged by.
``message_thread_id`` routes the send into a specific forum topic on a
supergroup with topics enabled. ``None`` means "general topic" (default
Telegram behaviour).
"""
chat_id: str = ""
disable_notification: bool = False
message_thread_id: int | None = None
@dataclass
@@ -80,9 +92,30 @@ def _coerce_int(value: Any, default: int) -> int:
return default
def _coerce_telegram_thread_id(value: Any) -> int | None:
"""Coerce a config value to a positive Telegram forum-topic id.
The Bot API treats omission, ``0``, and negative values all as
"general topic", so we collapse them to ``None`` for consistency
with the frontend (which rejects ``<= 0``). Booleans are explicitly
rejected so ``int(True) == 1`` doesn't silently route a misconfigured
chat into topic #1.
"""
if value is None or value == "" or isinstance(value, bool):
return None
try:
n = int(value)
except (TypeError, ValueError):
return None
return n if n > 0 else None
_RECEIVER_FACTORIES: dict[str, _ReceiverFactory] = {
"telegram": lambda locale, config: TelegramReceiver(
locale=locale, config=config, chat_id=str(config.get("chat_id", "")),
locale=locale, config=config,
chat_id=str(config.get("chat_id", "")),
disable_notification=bool(config.get("disable_notification", False)),
message_thread_id=_coerce_telegram_thread_id(config.get("message_thread_id")),
),
"webhook": lambda locale, config: WebhookReceiver(
locale=locale, config=config,
@@ -3,12 +3,14 @@
from __future__ import annotations
import asyncio
import contextlib
import json
import logging
import mimetypes
import re
from contextvars import ContextVar
from dataclasses import dataclass, field
from typing import Any, Callable, Final
from typing import Any, Callable, Final, Iterator
import aiohttp
from aiohttp import FormData
@@ -19,6 +21,7 @@ from .cache import TelegramFileCache
from .media import (
TELEGRAM_API_BASE_URL,
TELEGRAM_MAX_CAPTION_LENGTH,
TELEGRAM_MAX_GROUP_TOTAL_BYTES,
TELEGRAM_MAX_PHOTO_SIZE,
TELEGRAM_MAX_TEXT_LENGTH,
TELEGRAM_MAX_VIDEO_SIZE,
@@ -27,7 +30,6 @@ from .media import (
extract_asset_id_from_url,
is_asset_cache_key,
is_asset_id,
split_media_by_upload_size,
)
_LOGGER = logging.getLogger(__name__)
@@ -56,6 +58,68 @@ _UPLOAD_TIMEOUT: Final = aiohttp.ClientTimeout(total=120, connect=10)
_DOWNLOAD_TIMEOUT: Final = aiohttp.ClientTimeout(total=120, connect=10)
# ---------------------------------------------------------------------------
# Per-send options (disable_notification, message_thread_id, …)
# ---------------------------------------------------------------------------
#
# These are properties of a single send, not of the bot or the client, and
# they fan out into the JSON / multipart payload at four different sites
# (sendMessage, sendPhoto/Video/Document, sendMediaGroup, cache-hit POST).
# Rather than threading the kwargs through every internal helper, we bind
# them on a ContextVar inside the public ``send_message`` / ``send_notification``
# entry points; the payload builders read the var when constructing the
# request. ContextVar propagation isolates concurrent ``asyncio.gather``
# fan-outs in the dispatcher (one task per receiver) — each task sees the
# value its own caller bound.
@dataclass(frozen=True)
class _SendOptions:
"""Per-send Telegram flags applied to every API call within one send.
``disable_notification`` maps to Bot API ``disable_notification=true``
— the chat receives the message silently. ``message_thread_id`` routes
the message into a specific forum-topic on supergroups with topics
enabled; ``None`` means "general topic" (Bot API omits the field).
"""
disable_notification: bool = False
message_thread_id: int | None = None
_send_options_var: ContextVar[_SendOptions] = ContextVar(
"_tg_send_options", default=_SendOptions(),
)
@contextlib.contextmanager
def _bind_send_options(opts: _SendOptions) -> Iterator[None]:
"""Bind per-send options for the duration of the ``with`` block."""
token = _send_options_var.set(opts)
try:
yield
finally:
_send_options_var.reset(token)
def _apply_send_opts_to_payload(payload: dict[str, Any]) -> None:
"""Merge the active per-send options into a JSON request body."""
opts = _send_options_var.get()
if opts.disable_notification:
payload["disable_notification"] = True
if opts.message_thread_id is not None:
payload["message_thread_id"] = opts.message_thread_id
def _apply_send_opts_to_form(form: FormData) -> None:
"""Merge the active per-send options into a multipart form payload."""
opts = _send_options_var.get()
if opts.disable_notification:
form.add_field("disable_notification", "true")
if opts.message_thread_id is not None:
form.add_field("message_thread_id", str(opts.message_thread_id))
def _extract_retry_after(result: dict[str, Any]) -> int | None:
"""Return the retry_after seconds from a Telegram error response.
@@ -135,10 +199,27 @@ class _MediaItem:
keyed by position. Bundling these together prevents the
``media_json`` and ``cache_info`` lists from drifting out of
alignment under future edits.
``source_url`` and ``download_headers`` let the per-item fallback
re-download a cache-hit item if its ``file_id`` POST returns
transient errors — without them, a stale ``file_id`` would silently
lose a cached asset that the original single-item path would have
recovered.
"""
media_json: dict[str, Any]
cache_info: tuple[str, str, str | None, int] | None
attachment: tuple[str, bytes, str, str] | None # (name, data, filename, content_type)
source_url: str | None = None
download_headers: dict[str, str] | None = None
@property
def upload_bytes(self) -> int:
"""Bytes this item contributes to a multipart sendMediaGroup payload.
Cached items (referenced by ``file_id``) contribute 0 since
Telegram serves them server-side without us re-uploading.
"""
return len(self.attachment[1]) if self.attachment else 0
def _truncate(text: str, limit: int, *, marker: str = "") -> str:
@@ -302,6 +383,7 @@ class TelegramClient:
payload["caption"] = _truncate(caption, TELEGRAM_MAX_CAPTION_LENGTH)
if reply_to_message_id is not None:
payload["reply_parameters"] = {"message_id": reply_to_message_id}
_apply_send_opts_to_payload(payload)
try:
async with self._session.post(
self._api_url(kind.api_method), json=payload, timeout=_API_TIMEOUT,
@@ -351,6 +433,7 @@ class TelegramClient:
f.add_field("caption", capped_caption)
if reply_to_message_id is not None:
f.add_field("reply_parameters", json.dumps({"message_id": reply_to_message_id}))
_apply_send_opts_to_form(f)
return f
for attempt in range(1, _TG_429_MAX_ATTEMPTS + 1):
@@ -415,18 +498,54 @@ class TelegramClient:
chunk_delay: int = 0,
max_asset_data_size: int | None = None,
send_large_photos_as_documents: bool = False,
send_large_videos_as_documents: bool = False,
chat_action: str | None = "typing",
*,
disable_notification: bool = False,
message_thread_id: int | None = None,
) -> NotificationResult:
if not assets:
return await self.send_message(
chat_id, caption or "", reply_to_message_id,
disable_web_page_preview, parse_mode,
disable_notification=disable_notification,
message_thread_id=message_thread_id,
)
keepalive: _KeepaliveHandle | None = None
if chat_action:
keepalive = self.start_chat_action_keepalive(chat_id, chat_action)
# Bind for the whole media-send fan-out — every internal helper
# (_send_photo / _send_video / _send_document / _send_media_group /
# _post_media_group / _send_from_cache / _upload_media) reads the
# current value when it constructs its request payload.
opts = _SendOptions(
disable_notification=disable_notification,
message_thread_id=message_thread_id,
)
with _bind_send_options(opts):
return await self._send_notification_body(
chat_id, assets, caption, reply_to_message_id, parse_mode,
max_group_size, chunk_delay, max_asset_data_size,
send_large_photos_as_documents, send_large_videos_as_documents,
keepalive,
)
async def _send_notification_body(
self,
chat_id: str,
assets: list[dict[str, Any]],
caption: str | None,
reply_to_message_id: int | None,
parse_mode: str,
max_group_size: int,
chunk_delay: int,
max_asset_data_size: int | None,
send_large_photos_as_documents: bool,
send_large_videos_as_documents: bool,
keepalive: _KeepaliveHandle | None,
) -> NotificationResult:
try:
if len(assets) == 1 and assets[0].get("type") == "photo":
return await self._send_photo(
@@ -443,6 +562,7 @@ class TelegramClient:
assets[0].get("content_type"), assets[0].get("cache_key"),
download_headers=assets[0].get("headers"),
preloaded_data=assets[0].get("data"),
send_large_videos_as_documents=send_large_videos_as_documents,
)
if len(assets) == 1 and assets[0].get("type", "document") == "document":
url = assets[0].get("url")
@@ -465,7 +585,7 @@ class TelegramClient:
return await self._send_media_group(
chat_id, assets, caption, reply_to_message_id, max_group_size,
chunk_delay, parse_mode, max_asset_data_size,
send_large_photos_as_documents,
send_large_photos_as_documents, send_large_videos_as_documents,
)
finally:
await self.stop_keepalive(keepalive)
@@ -477,6 +597,9 @@ class TelegramClient:
reply_to_message_id: int | None = None,
disable_web_page_preview: bool | None = None,
parse_mode: str = "HTML",
*,
disable_notification: bool = False,
message_thread_id: int | None = None,
) -> NotificationResult:
if not text:
_LOGGER.warning("send_message called with empty text — using placeholder")
@@ -490,7 +613,19 @@ class TelegramClient:
payload["reply_parameters"] = {"message_id": reply_to_message_id}
if disable_web_page_preview:
payload["link_preview_options"] = {"is_disabled": True}
# sendMessage is a leaf call — its kwargs go straight into the
# JSON body. The ContextVar pattern is reserved for the deeper
# media paths (``_upload_media`` / ``_post_media_group`` /
# ``_send_from_cache``) that can't easily plumb kwargs through.
if disable_notification:
payload["disable_notification"] = True
if message_thread_id is not None:
payload["message_thread_id"] = message_thread_id
return await self._post_send_message(payload)
async def _post_send_message(
self, payload: dict[str, Any],
) -> NotificationResult:
url = self._api_url("sendMessage")
try:
async with self._session.post(url, json=payload, timeout=_API_TIMEOUT) as response:
@@ -651,6 +786,7 @@ class TelegramClient:
max_asset_data_size: int | None = None, content_type: str | None = None,
cache_key: str | None = None, download_headers: dict[str, str] | None = None,
preloaded_data: bytes | None = None,
send_large_videos_as_documents: bool = False,
) -> NotificationResult:
if not url:
return {"success": False, "error": "Missing 'url' for video"}
@@ -672,6 +808,18 @@ class TelegramClient:
if max_asset_data_size is not None and len(data) > max_asset_data_size:
return {"success": False, "error": "Video exceeds size limit", "skipped": True}
if len(data) > TELEGRAM_MAX_VIDEO_SIZE:
# Telegram's sendVideo hard-caps at 50 MB. Documents accept
# up to 2 GB, so when the operator opts in we deliver the
# bytes as a document instead of silently dropping the asset.
# Loses inline playback but preserves delivery.
if send_large_videos_as_documents:
filename = url.split("/")[-1].split("?")[0] or "video.mp4"
if "." not in filename:
filename = "video.mp4"
return await self._send_document(
chat_id, data, filename, caption, reply_to_message_id,
parse_mode, url, content_type, cache_key,
)
return {
"success": False,
"error": f"Video exceeds Telegram's {TELEGRAM_MAX_VIDEO_SIZE // (1024*1024)} MB limit",
@@ -723,6 +871,7 @@ class TelegramClient:
caption: str | None = None, reply_to_message_id: int | None = None,
max_group_size: int = 10, chunk_delay: int = 0, parse_mode: str = "HTML",
max_asset_data_size: int | None = None, send_large_photos_as_documents: bool = False,
send_large_videos_as_documents: bool = False,
) -> NotificationResult:
# Telegram rejects mixed photo/video + document in a single
# sendMediaGroup. Split before chunking so a malformed input
@@ -730,75 +879,293 @@ class TelegramClient:
partitions = self._partition_media_by_kind(assets)
all_message_ids: list[int] = []
first_chunk_overall = True
errors: list[dict[str, Any]] = []
delivered = 0
skipped = 0
failed = 0
first_send = True
# Oversized videos that the operator wants delivered as
# documents. Sent after all media-group chunks finish so
# they ride out on their own (Telegram refuses to mix
# documents with photo/video in one group).
deferred_documents: list[_MediaItem] = []
# Caption + reply_to are "spent" on the first send attempt,
# mirroring the prior contract. If that first attempt fails
# entirely, they're lost — same as before. Tracking these as
# standalone flags (rather than deriving from ``chunk_idx==0``)
# keeps the semantics right across multiple partitions.
caption_pending = bool(caption)
reply_pending = reply_to_message_id is not None
async def maybe_delay() -> None:
nonlocal first_send
if not first_send and chunk_delay > 0:
await asyncio.sleep(chunk_delay / 1000)
first_send = False
for partition in partitions:
chunks = [
partition[i:i + max_group_size]
for i in range(0, len(partition), max_group_size)
]
for chunk_idx, chunk in enumerate(chunks):
if not first_chunk_overall and chunk_delay > 0:
await asyncio.sleep(chunk_delay / 1000)
# Single-item chunk → use the simpler send_photo/video path.
if len(chunk) == 1:
item = chunk[0]
chunk_caption = caption if first_chunk_overall else None
chunk_reply = reply_to_message_id if first_chunk_overall else None
if item.get("type") == "photo":
result = await self._send_photo(
chat_id, item.get("url"), chunk_caption, chunk_reply, parse_mode,
max_asset_data_size, send_large_photos_as_documents,
item.get("content_type"), item.get("cache_key"),
download_headers=item.get("headers"),
preloaded_data=item.get("data"),
)
elif item.get("type") == "video":
result = await self._send_video(
chat_id, item.get("url"), chunk_caption, chunk_reply, parse_mode,
max_asset_data_size,
item.get("content_type"), item.get("cache_key"),
download_headers=item.get("headers"),
preloaded_data=item.get("data"),
)
else:
first_chunk_overall = False
continue
first_chunk_overall = False
if not result.get("success"):
result["failed_at_chunk"] = chunk_idx + 1
return result
if result.get("message_id") is not None:
all_message_ids.append(result["message_id"])
continue
items = await self._build_media_items(
chunk, max_asset_data_size, caption if first_chunk_overall else None,
parse_mode,
# Fetch + filter the parent chunk. Skipped items
# (oversized, bad photo, failed download) never enter
# ``items`` — count them so the operator-facing result
# reflects what actually went out vs got dropped.
# Oversized videos opted into doc-fallback get
# deferred — they're delivered (eventually) so they
# don't count as skipped.
items, chunk_deferred = await self._build_media_items(
chunk, max_asset_data_size, send_large_videos_as_documents,
)
deferred_documents.extend(chunk_deferred)
skipped += len(chunk) - len(items) - len(chunk_deferred)
if not items:
_LOGGER.warning(
"sendMediaGroup skipped — chunk %d/%d had %d input items but 0 usable (all filtered/failed)",
"sendMediaGroup: chunk %d/%d had %d input items but 0 usable",
chunk_idx + 1, len(chunks), len(chunk),
)
first_chunk_overall = False
continue
chunk_msg_ids, chunk_err = await self._post_media_group(
chat_id, items, reply_to_message_id if first_chunk_overall else None,
chunk_idx, len(chunks),
# Split the chunk into sub-chunks that each fit under
# Telegram's per-request byte cap. Per-item filtering
# alone can't prevent 413s when several legal-sized
# items together bust the envelope.
sub_chunks = self._split_items_by_byte_budget(
items, TELEGRAM_MAX_GROUP_TOTAL_BYTES,
)
first_chunk_overall = False
if chunk_err is not None:
return chunk_err
all_message_ids.extend(chunk_msg_ids)
if len(sub_chunks) > 1:
_LOGGER.info(
"sendMediaGroup: byte-budget split chunk %d/%d into %d sub-chunks",
chunk_idx + 1, len(chunks), len(sub_chunks),
)
if not all_message_ids:
_LOGGER.warning(
"sendMediaGroup completed with 0 message_ids — nothing was delivered",
for sub_items in sub_chunks:
await maybe_delay()
sub_caption = caption if caption_pending else None
sub_reply = reply_to_message_id if reply_pending else None
caption_pending = False
reply_pending = False
if sub_caption:
self._attach_caption_to_first(
sub_items, sub_caption, parse_mode,
)
msg_ids, err = await self._post_media_group(
chat_id, sub_items, sub_reply, chunk_idx, len(chunks),
)
if err is None:
all_message_ids.extend(msg_ids)
delivered += len(sub_items)
continue
# Telegram rejected the sub-chunk after our
# pre-flight passed (content / transient / rate).
# Try each item as its own message so partial
# delivery survives the chunk-level failure.
# Record the chunk-level cause first so the
# operator-visible ``errors`` list reads in
# cause-then-consequence order.
_LOGGER.warning(
"sendMediaGroup chunk %d/%d failed (%s) — falling back to per-item",
chunk_idx + 1, len(chunks), err.get("error"),
)
errors.append({
"kind": "chunk",
"chunk": chunk_idx + 1,
"error": err.get("error", "unknown"),
"code": err.get("error_code"),
})
for item_idx, item in enumerate(sub_items):
item_caption = sub_caption if item_idx == 0 else None
item_reply = sub_reply if item_idx == 0 else None
# No ``maybe_delay()`` here: per-item retries
# are a recovery path where added latency
# only widens the outage window — the
# individual sendPhoto/sendVideo calls have
# their own 429 backoff in ``_upload_media``.
item_result = await self._send_item_individually(
chat_id, item, item_caption, item_reply, parse_mode,
)
if item_result.get("success"):
delivered += 1
mid = item_result.get("message_id")
if mid is not None:
all_message_ids.append(mid)
else:
failed += 1
errors.append({
"kind": "item",
"chunk": chunk_idx + 1,
"item_index": item_idx,
"error": item_result.get("error", "unknown"),
})
# Deferred oversized-videos-as-documents: send each on its own
# via sendDocument. They couldn't ride in the media group
# because Telegram refuses to mix document with photo/video,
# and per-item failures don't poison siblings.
for deferred in deferred_documents:
await maybe_delay()
d_caption = caption if caption_pending else None
d_reply = reply_to_message_id if reply_pending else None
caption_pending = False
reply_pending = False
d_result = await self._send_item_individually(
chat_id, deferred, d_caption, d_reply, parse_mode,
)
return {"success": False, "error": "no_items_delivered"}
return {"success": True, "message_ids": all_message_ids}
if d_result.get("success"):
delivered += 1
mid = d_result.get("message_id")
if mid is not None:
all_message_ids.append(mid)
else:
failed += 1
errors.append({
"kind": "deferred_document",
"error": d_result.get("error", "unknown"),
})
if delivered == 0:
if skipped > 0 and not errors:
msg = f"all {skipped} item(s) filtered before send"
elif errors:
msg = errors[0].get("error", "no_items_delivered")
else:
msg = "no_items_delivered"
_LOGGER.warning(
"sendMediaGroup delivered 0 items (skipped=%d failed=%d)",
skipped, failed,
)
return {
"success": False,
"error": msg,
"message_ids": [],
"delivered_count": 0,
"skipped_count": skipped,
"failed_count": failed,
"errors": errors or None,
"failed_at_chunk": errors[0].get("chunk") if errors else None,
}
return {
"success": True,
"message_ids": all_message_ids,
"delivered_count": delivered,
"skipped_count": skipped,
"failed_count": failed,
"errors": errors or None,
}
@staticmethod
def _split_items_by_byte_budget(
items: list[_MediaItem], max_bytes: int,
) -> list[list[_MediaItem]]:
"""Greedy-pack ``items`` into sub-chunks under ``max_bytes`` each.
Cached items (``upload_bytes == 0``) are free and never force a
split. A single item that on its own exceeds the budget is
placed alone — letting Telegram return a precise error rather
than dropping it silently. Order is preserved so caption
attachment stays deterministic.
"""
if not items:
return []
groups: list[list[_MediaItem]] = []
current: list[_MediaItem] = []
current_size = 0
for item in items:
cost = item.upload_bytes
if current and current_size + cost > max_bytes:
groups.append(current)
current = []
current_size = 0
current.append(item)
current_size += cost
if current:
groups.append(current)
return groups
@staticmethod
def _attach_caption_to_first(
items: list[_MediaItem], caption: str, parse_mode: str,
) -> None:
"""Inject caption + parse_mode into the first item's media_json.
Telegram displays the caption of the first media-group item; the
rest are ignored. Idempotent — re-attaching simply overwrites.
"""
if not items:
return
items[0].media_json["caption"] = _truncate(caption, TELEGRAM_MAX_CAPTION_LENGTH)
items[0].media_json["parse_mode"] = parse_mode
async def _send_item_individually(
self, chat_id: str, item: _MediaItem,
caption: str | None, reply_to_message_id: int | None,
parse_mode: str,
) -> NotificationResult:
"""Send one ``_MediaItem`` as a standalone sendPhoto/sendVideo/sendDocument.
Used as the per-item fallback when sendMediaGroup itself
rejects a sub-chunk after pre-flight passed. Reuses already-
fetched bytes for fresh items; for cache-hit items that fail
the file_id POST, re-downloads from ``source_url`` so a stale
``file_id`` doesn't silently lose an asset — the original
single-item path does the same recovery.
"""
media_type = item.media_json.get("type") or "photo"
if media_type == "photo":
kind = _PHOTO_KIND
elif media_type == "video":
kind = _VIDEO_KIND
else:
kind = _DOCUMENT_KIND
cache: TelegramFileCache | None = None
cache_key: str | None = None
thumbhash: str | None = None
if item.cache_info is not None:
ck, _ck_type, ck_thumb, _ck_size = item.cache_info
cache = self._get_cache_for_key(ck)
cache_key = ck
thumbhash = ck_thumb
# Cached items have no attachment bytes — POST the file_id
# reference first; if that fails transiently, re-download via
# source_url and upload fresh. This matches what _send_photo /
# _send_video do for their cache path.
if item.attachment is None:
file_id = item.media_json.get("media", "")
if file_id and not file_id.startswith("attach://"):
cached_result = await self._send_from_cache(
kind, chat_id, file_id, caption, reply_to_message_id, parse_mode,
)
if cached_result is not None:
return cached_result
if not item.source_url:
return {"success": False, "error": "Cached fallback send failed (no source URL)"}
data, err = await self._safe_get(
self._resolve_url(item.source_url), item.download_headers,
)
if data is None:
return {"success": False, "error": f"Re-download failed: {err}"}
return await self._upload_media(
kind, chat_id, data,
kind.default_filename, kind.default_content_type,
caption, reply_to_message_id, parse_mode,
cache, cache_key, thumbhash,
)
_, data, filename, content_type = item.attachment
return await self._upload_media(
kind, chat_id, data, filename, content_type,
caption, reply_to_message_id, parse_mode,
cache, cache_key, thumbhash,
)
@staticmethod
def _partition_media_by_kind(
@@ -830,23 +1197,40 @@ class TelegramClient:
self,
chunk: list[dict[str, Any]],
max_asset_data_size: int | None,
first_caption: str | None,
parse_mode: str,
) -> list[_MediaItem]:
send_large_videos_as_documents: bool = False,
) -> tuple[list[_MediaItem], list[_MediaItem]]:
"""Fetch + filter a chunk and return aligned media-group items.
Returns ``(items, deferred_documents)`` — ``items`` go into
sendMediaGroup, ``deferred_documents`` are oversized videos
retagged as documents (when the caller opted in) that will be
sent individually via ``_send_item_individually`` *after* the
group sends. Telegram rejects mixing documents with photo/video
in one group, so they have to ride out separately.
Concurrency is bounded by ``_MEDIA_FETCH_CONCURRENCY`` so peak
memory stays predictable. Per-fetch exceptions are isolated via
``return_exceptions=True`` so a single failed download cannot
cancel its peers.
Caption injection is intentionally NOT performed here — callers
attach the caption after byte-budget sub-splitting so it lands
on the first item of the first delivered sub-chunk.
"""
sem = asyncio.Semaphore(_MEDIA_FETCH_CONCURRENCY)
async def fetch(idx: int, item: dict[str, Any]) -> tuple[int, dict | None, bytes | None]:
async def fetch(
idx: int, item: dict[str, Any],
) -> tuple[int, dict | None, bytes | None, bool]:
"""Returns ``(idx, cached_entry, data, defer_as_document)``.
``defer_as_document=True`` signals "video bytes valid but
too big for sendVideo — caller should send as document".
"""
url = item.get("url")
if not url:
_LOGGER.warning("Media skipped: missing url (idx=%d type=%s)", idx, item.get("type"))
return idx, None, None
return idx, None, None, False
media_type = item.get("type", "photo")
custom_cache_key = item.get("cache_key")
@@ -860,7 +1244,7 @@ class TelegramClient:
)
cached = item_cache.get(ck, thumbhash=item_thumbhash) if item_cache else None
if cached and cached.get("file_id"):
return idx, cached, None
return idx, cached, None, False
preloaded = item.get("data")
data: bytes | None
@@ -874,34 +1258,40 @@ class TelegramClient:
"Media skipped: download failed (idx=%d type=%s): %s",
idx, media_type, err,
)
return idx, None, None
return idx, None, None, False
if max_asset_data_size and len(data) > max_asset_data_size:
_LOGGER.warning(
"Media skipped: size %d exceeds max_asset_data_size %d (idx=%d type=%s)",
len(data), max_asset_data_size, idx, media_type,
)
return idx, None, None
return idx, None, None, False
if media_type == "video" and len(data) > TELEGRAM_MAX_VIDEO_SIZE:
if send_large_videos_as_documents:
_LOGGER.info(
"Video %d bytes over Telegram limit (idx=%d) — deferring as document",
len(data), idx,
)
return idx, None, data, True
_LOGGER.warning(
"Media skipped: video %d bytes exceeds Telegram limit %d (idx=%d)",
len(data), TELEGRAM_MAX_VIDEO_SIZE, idx,
)
return idx, None, None
return idx, None, None, False
if media_type == "photo":
exceeds, reason, _, _ = check_photo_limits(data)
if exceeds:
_LOGGER.warning(
"Media skipped: photo %s (idx=%d)", reason, idx,
)
return idx, None, None
return idx, None, data
return idx, None, None, False
return idx, None, data, False
raw = await asyncio.gather(
*(fetch(i, item) for i, item in enumerate(chunk)),
return_exceptions=True,
)
results: list[tuple[int, dict | None, bytes | None]] = []
results: list[tuple[int, dict | None, bytes | None, bool]] = []
for entry in raw:
if isinstance(entry, Exception):
_LOGGER.warning("Media fetch raised: %s", redact_exc(entry))
@@ -909,8 +1299,9 @@ class TelegramClient:
results.append(entry)
items: list[_MediaItem] = []
deferred_documents: list[_MediaItem] = []
upload_idx = 0
for idx, cached_entry, data in results:
for idx, cached_entry, data, defer_as_document in results:
item = chunk[idx]
url = item.get("url")
if not url:
@@ -918,6 +1309,35 @@ class TelegramClient:
media_type = item.get("type") or "photo"
custom_cache_key = item.get("cache_key")
# Deferred videos-as-documents are NEVER cache hits (the
# cache lookup branch returns early before the size check),
# so we always have fresh bytes here. Retag the
# media_json so ``_send_item_individually`` routes via
# ``_DOCUMENT_KIND`` to /sendDocument.
if defer_as_document and data is not None:
ct = item.get("content_type") or "video/mp4"
# Best-effort filename preserves the original
# extension so Telegram clients give it a sensible
# icon and the recipient can re-open it.
fname = url.split("/")[-1].split("?")[0] or "video.mp4"
if "." not in fname:
fname = "video.mp4"
ck = custom_cache_key or extract_asset_id_from_url(url) or url
ck_is_asset = is_asset_cache_key(ck)
bare_ck = asset_id_from_cache_key(ck) if ck_is_asset else ck
th = (
self._thumbhash_resolver(bare_ck)
if ck_is_asset and self._thumbhash_resolver else None
)
deferred_documents.append(_MediaItem(
media_json={"type": "document", "media": "attach://deferred"},
cache_info=(ck, "document", th, len(data)),
attachment=("deferred", data, fname, ct),
source_url=url,
download_headers=item.get("headers"),
))
continue
if cached_entry and cached_entry.get("file_id"):
mij: dict[str, Any] = {"type": media_type, "media": cached_entry["file_id"]}
cache_info: tuple[str, str, str | None, int] | None = None
@@ -940,14 +1360,14 @@ class TelegramClient:
else:
continue
if first_caption and not items:
# Only the first usable item in the first chunk receives
# the caption, per Telegram's media-group semantics.
mij["caption"] = _truncate(first_caption, TELEGRAM_MAX_CAPTION_LENGTH)
mij["parse_mode"] = parse_mode
items.append(_MediaItem(media_json=mij, cache_info=cache_info, attachment=attachment))
return items
items.append(_MediaItem(
media_json=mij,
cache_info=cache_info,
attachment=attachment,
source_url=url,
download_headers=item.get("headers"),
))
return items, deferred_documents
async def _post_media_group(
self,
@@ -973,6 +1393,7 @@ class TelegramClient:
for name, payload, filename, ct in attachments:
f.add_field(name, payload, filename=filename, content_type=ct)
f.add_field("media", json.dumps(media_json))
_apply_send_opts_to_form(f)
return f
for attempt in range(1, _TG_429_MAX_ATTEMPTS + 1):
@@ -13,6 +13,11 @@ _LOGGER = logging.getLogger(__name__)
TELEGRAM_API_BASE_URL: Final = "https://api.telegram.org/bot"
TELEGRAM_MAX_PHOTO_SIZE: Final = 10 * 1024 * 1024 # 10 MB
TELEGRAM_MAX_VIDEO_SIZE: Final = 50 * 1024 * 1024 # 50 MB
# Telegram's sendMediaGroup envelope tops out near 50 MB total (multipart
# bytes including form overhead). 45 MB keeps a safety margin so we don't
# eat 413s when the per-item budget admits items that, summed, would
# bust Telegram's request cap.
TELEGRAM_MAX_GROUP_TOTAL_BYTES: Final = 45 * 1024 * 1024 # 45 MB
TELEGRAM_MAX_DIMENSION_SUM: Final = 10000
# Telegram message-text limit (sendMessage) and caption limit
# (sendPhoto/sendVideo/sendDocument/first item of sendMediaGroup).
@@ -126,36 +131,6 @@ def build_telegram_asset_entry(
return entry
def split_media_by_upload_size(
media_items: list[tuple], max_upload_size: int
) -> list[list[tuple]]:
"""Split media items into sub-groups respecting upload size limit."""
if not media_items:
return []
groups: list[list[tuple]] = []
current_group: list[tuple] = []
current_size = 0
for item in media_items:
media_ref = item[1]
is_cached = item[4]
item_size = 0 if is_cached else (len(media_ref) if isinstance(media_ref, bytes) else 0)
if current_group and current_size + item_size > max_upload_size:
groups.append(current_group)
current_group = []
current_size = 0
current_group.append(item)
current_size += item_size
if current_group:
groups.append(current_group)
return groups
def check_photo_limits(
data: bytes,
) -> tuple[bool, str | None, int | None, int | None]:
@@ -315,6 +315,63 @@ async def clear_telegram_cache(
return result
class DiagnosticActivateBody(BaseModel):
module: str
duration_minutes: int = 30
@router.get("/diagnostic-mode")
async def list_diagnostic_overrides(
user: User = Depends(require_admin),
):
"""List currently-active temporary DEBUG overrides + their countdown.
Drives the dashboard panel that lets admins toggle a module to DEBUG
for a bounded window with auto-revert.
"""
from ..services.diagnostic_mode import list_active
return {"active": list_active()}
@router.post("/diagnostic-mode")
async def activate_diagnostic_override(
body: DiagnosticActivateBody,
user: User = Depends(require_admin),
):
"""Flip ``module`` to DEBUG and schedule an auto-revert.
Re-activating an already-active module replaces the prior schedule.
Returns the new entry shape so the UI can render countdown without
a follow-up GET. The service module reads the current ``log_levels``
setting at activation and at revert so an admin who edits overrides
mid-window doesn't see a stale baseline restored.
"""
from ..services.diagnostic_mode import set_diagnostic
try:
entry = await set_diagnostic(body.module, body.duration_minutes)
except ValueError as err:
raise HTTPException(status_code=400, detail=str(err)) from err
return entry
@router.delete("/diagnostic-mode/{module:path}")
async def revert_diagnostic_override(
module: str,
user: User = Depends(require_admin),
):
"""Manually revert a single module before its window ends.
Returns 404 when no override was active so the caller can fall through
to a friendly "nothing to revert" UX without parsing booleans.
"""
from ..services.diagnostic_mode import revert_diagnostic
if not await revert_diagnostic(module):
raise HTTPException(
status_code=404, detail=f"No active override for {module!r}",
)
return {"reverted": module}
@router.get("/locales")
async def get_supported_locales(
user: User = Depends(get_current_user),
@@ -13,6 +13,7 @@ from jinja2.sandbox import SandboxedEnvironment
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from notify_bridge_core.log_context import enrich_details_with_correlation
from notify_bridge_core.notifications.telegram.client import TelegramClient
from ..database.engine import get_engine
from ..database.models import (
@@ -347,7 +348,7 @@ async def _log_command_event(
collection_id=str(chat_id),
collection_name=_format_command_subject(cmd, args),
assets_count=media_total,
details=details,
details=enrich_details_with_correlation(details),
))
await session.commit()
except Exception: # noqa: BLE001 — diagnostic only, never block reply
@@ -1,6 +1,7 @@
"""Notify Bridge Server — FastAPI application entry point."""
import logging
import uuid
from contextlib import asynccontextmanager
from fastapi import FastAPI
@@ -8,6 +9,11 @@ from fastapi.middleware.cors import CORSMiddleware
from slowapi import _rate_limit_exceeded_handler
from slowapi.errors import RateLimitExceeded
from slowapi.middleware import SlowAPIMiddleware
from starlette.middleware.base import BaseHTTPMiddleware, RequestResponseEndpoint
from starlette.requests import Request as StarletteRequest
from starlette.responses import Response as StarletteResponse
from notify_bridge_core.log_context import bind_log_context
from .config import settings as _log_cfg
from .logging_setup import setup_logging
@@ -163,6 +169,16 @@ async def lifespan(app: FastAPI):
_READY = False
from .services.ha_subscription import stop_all as stop_ha_subscriptions
await stop_ha_subscriptions()
# Restore the DB-configured baseline level for any temporary DEBUG
# overrides before the engine is disposed — so even a forced restart
# leaves the world tidy and doesn't leak DEBUG state into the next
# process (which would also be wiped by setup_logging() at boot, but
# being explicit about shutdown is cheaper than relying on a re-init).
from .services.diagnostic_mode import revert_all as revert_diagnostics
try:
await revert_diagnostics()
except Exception: # pragma: no cover — never block shutdown on this.
_LOGGER.exception("Failed to revert diagnostic overrides during shutdown")
scheduler = get_scheduler()
if scheduler.running:
scheduler.shutdown(wait=True)
@@ -178,9 +194,55 @@ _APP_VERSION = _resolve_version()
app = FastAPI(title="Notify Bridge", version=_APP_VERSION, lifespan=lifespan)
# --- Security headers ---
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request as StarletteRequest
from starlette.responses import Response as StarletteResponse
# Bounded character set for accepted inbound X-Request-Id values. Anything
# outside this is replaced with a server-generated id so a malicious header
# can't smuggle CR/LF into log lines or break grep-by-field parsing.
# ``:`` is intentionally excluded so an inbound value can't masquerade as a
# server-minted ``disp:<hex>`` / ``req:<hex>`` id and confuse operator greps.
_REQUEST_ID_MAX_LEN = 64
_REQUEST_ID_ALLOWED = set(
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
)
def _normalize_request_id(raw: str | None) -> str:
if not raw:
return f"req:{uuid.uuid4().hex[:12]}"
raw = raw.strip()
if not raw or len(raw) > _REQUEST_ID_MAX_LEN:
return f"req:{uuid.uuid4().hex[:12]}"
if not all(c in _REQUEST_ID_ALLOWED for c in raw):
return f"req:{uuid.uuid4().hex[:12]}"
return raw
class RequestContextMiddleware(BaseHTTPMiddleware):
"""Bind a per-request ``request_id`` ContextVar and echo it back.
Reads ``X-Request-Id`` from the inbound request (so an upstream proxy
with its own correlation system can propagate its id), falling back to
a short random ``req:<12 hex>`` value. Always sets the same id on the
response ``X-Request-Id`` header so the SPA can surface it for
operator-friendly bug reports.
Bound via :func:`bind_log_context` so the id appears on every log line
emitted during request handling (``[req=...]``) and is picked up by
:func:`notify_bridge_core.log_context.enrich_details_with_correlation`
when an ``EventLog`` row is written during the same request.
"""
async def dispatch(
self,
request: StarletteRequest,
call_next: RequestResponseEndpoint,
) -> StarletteResponse:
req_id = _normalize_request_id(request.headers.get("x-request-id"))
with bind_log_context(request_id=req_id):
response: StarletteResponse = await call_next(request)
response.headers["X-Request-Id"] = req_id
return response
_CSP = (
@@ -238,6 +300,12 @@ app.add_middleware(
allow_headers=["*"],
)
# Request-ID middleware is added LAST so it becomes the outermost wrapper —
# every other middleware (CORS, rate limit, security headers) then logs with
# the request_id already bound, and CORS preflight responses also carry the
# X-Request-Id echo header.
app.add_middleware(RequestContextMiddleware)
# Register routes — static paths before parameterized
app.include_router(auth_router)
app.include_router(template_vars_router)
@@ -9,6 +9,11 @@ from typing import Any
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from notify_bridge_core.log_context import (
bind_log_context,
ensure_dispatch_id,
enrich_details_with_correlation,
)
from notify_bridge_core.providers.action_executor import ActionResult
from ..database.engine import get_engine
@@ -27,6 +32,15 @@ async def run_action(
action_id: int, *, trigger: str = "scheduled"
) -> ActionResult:
"""Load an action from DB, execute it, and save the execution log."""
# One dispatch_id per action run so the EventLog row (and any inner log
# lines emitted by the action executor) share a correlation id.
with bind_log_context(dispatch_id=ensure_dispatch_id()):
return await _run_action_impl(action_id, trigger=trigger)
async def _run_action_impl(
action_id: int, *, trigger: str = "scheduled"
) -> ActionResult:
engine = get_engine()
# ------------------------------------------------------------------
@@ -142,7 +156,7 @@ async def run_action(
# without a separate action_name renderer.
collection_name=action.name,
assets_count=action_result.total_items_affected,
details={
details=enrich_details_with_correlation({
"action_type": action.action_type,
"trigger": trigger,
"rules_processed": action_result.rules_processed,
@@ -150,7 +164,7 @@ async def run_action(
"rules_failed": action_result.rules_failed,
"error": action_result.error or "",
"execution_id": execution_id,
},
}),
))
await session.commit()
@@ -33,6 +33,11 @@ from sqlalchemy.orm.attributes import flag_modified
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from notify_bridge_core.log_context import (
bind_log_context,
ensure_dispatch_id,
enrich_details_with_correlation,
)
from notify_bridge_core.models.events import EventType, ServiceEvent
from notify_bridge_core.models.media import MediaAsset, MediaType
from notify_bridge_core.notifications.dispatcher import (
@@ -56,6 +61,7 @@ from .dispatch_helpers import (
load_link_data,
resolve_provider_credential,
)
from .dispatch_summary import summarize_dispatch_results
_LOGGER = logging.getLogger(__name__)
@@ -616,12 +622,12 @@ async def _mark_dropped(
collection_name=payload.get("collection_name", ""),
assets_count=int(payload.get("added_count", 0))
or int(payload.get("removed_count", 0)),
details={
details=enrich_details_with_correlation({
"dispatch_status": "deferred_then_dropped",
"reason": reason,
"original_event_log_id": row.event_log_id,
"provider_type": payload.get("provider_type", ""),
},
}),
))
@@ -644,6 +650,28 @@ async def _process_row(
entry produces its own target_config so a broadcast deferred row fans
out to all current children at drain time.
"""
# Bind a fresh dispatch_id per drained row so the EventLog rows written
# by the success/drop paths AND the inner dispatcher's log lines share
# one id. Each deferred row is a logically separate dispatch attempt.
with bind_log_context(dispatch_id=ensure_dispatch_id()):
await _process_row_impl(
session, row, tracker, provider_id, provider_name,
provider_config, app_tz, link_by_id, dispatcher, stats,
)
async def _process_row_impl(
session: AsyncSession,
row: DeferredDispatch,
tracker: NotificationTracker,
provider_id: int,
provider_name: str,
provider_config: dict[str, Any],
app_tz: str,
link_by_id: dict[int, list[dict[str, Any]]],
dispatcher: NotificationDispatcher,
stats: dict[str, int],
) -> None:
expanded = link_by_id.get(row.link_id)
if not expanded:
# Link removed/disabled between defer and drain.
@@ -735,6 +763,8 @@ async def _process_row(
row.fired_at = datetime.now(timezone.utc)
session.add(row)
summary = summarize_dispatch_results(results)
if success:
stats["fired"] += 1
session.add(EventLog(
@@ -747,14 +777,15 @@ async def _process_row(
collection_id=row.collection_id,
collection_name=event.collection_name,
assets_count=event.added_count or event.removed_count or 0,
details={
details=enrich_details_with_correlation({
"dispatch_status": "delivered_after_quiet_hours",
"original_event_log_id": row.event_log_id,
"deferred_for_seconds": int(
(row.fired_at - row.created_at).total_seconds()
),
"provider_type": event.provider_type.value,
},
"dispatch_summary": summary,
}),
))
else:
stats["dropped"] += 1
@@ -769,12 +800,13 @@ async def _process_row(
collection_id=row.collection_id,
collection_name=event.collection_name,
assets_count=event.added_count or event.removed_count or 0,
details={
details=enrich_details_with_correlation({
"dispatch_status": "deferred_then_failed",
"reason": str(first_err)[:200],
"original_event_log_id": row.event_log_id,
"provider_type": event.provider_type.value,
},
"dispatch_summary": summary,
}),
))
@@ -0,0 +1,381 @@
"""Temporary per-module DEBUG overrides with auto-revert.
The runtime ``apply_log_levels()`` API in ``logging_setup`` already lets
admins flip a module to DEBUG, but the existing path requires editing the
``log_levels`` DB setting and remembering to revert it. Operators end up
either forgetting (leaving DEBUG-flooded logs in production) or never
turning it on (debugging through stderr only).
This module gives the dashboard a cheap toggle: "give me DEBUG for
``notify_bridge_core.notifications.telegram.client`` for 30 minutes"
apply immediately, schedule a one-shot job at ``now + 30 min`` that
reverts to whatever level that module would normally have under the
current DB-configured ``log_levels``.
State is in-memory only. A server restart wipes every active override,
which is the right semantic: ``setup_logging`` re-applies the
DB-configured baseline at boot, so a forgotten override can never
silently carry across a deploy. The lifespan shutdown also calls
:func:`revert_all` to cleanly restore baselines before the process
exits — useful for hot-reload dev loops where the server restarts in
place.
"""
from __future__ import annotations
import asyncio
import logging
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from typing import Any
from sqlmodel.ext.asyncio.session import AsyncSession
from ..database.engine import get_engine
from ..logging_setup import (
_NOISY_LIBRARY_DEFAULTS,
parse_level_overrides,
)
_LOGGER = logging.getLogger(__name__)
# Limits picked to match what "an operator clicked this button" looks like.
# One minute is enough to reproduce a single failing dispatch; four hours is
# long enough for a slow-rolling incident without risking a forgotten
# override outliving a workday.
_MIN_DURATION_MINUTES = 1
_MAX_DURATION_MINUTES = 240
# Allowlist of module namespaces an operator can flip. Lets us catch typos
# and blocks ``""`` (root) — flipping the root logger to DEBUG floods
# stderr with stuff the operator probably didn't want (boto3, jinja2,
# every dependency). Anything matching is accepted, anything else is
# rejected with a 400.
_ALLOWED_PREFIXES = (
"notify_bridge_core",
"notify_bridge_server",
"sqlalchemy",
"aiohttp",
"apscheduler",
"urllib3",
"httpx",
"httpcore",
"asyncio",
"PIL",
"uvicorn",
"starlette",
"fastapi",
)
@dataclass(frozen=True)
class _Override:
"""One active DEBUG override.
``baseline_level`` is what the module had at activation time — used
for the dashboard's "→ WARNING" display. The actual revert path
re-reads the current DB-configured ``log_levels`` so a setting change
made *while* the override is active is honored at expiry.
"""
module: str
baseline_level: str
activated_at: datetime
expires_at: datetime
# Module name → active override. Mutated only from the asyncio thread.
_active: dict[str, _Override] = {}
# Strong references for background tasks created via the asyncio-timer
# fallback path. CPython's event loop holds only weak refs, so a task
# without an external retainer can be GC'd before it fires. Tasks are
# discarded automatically when they complete.
_bg_tasks: set[asyncio.Task[None]] = set()
def _is_allowed(module: str) -> bool:
if not module:
return False
return any(module == p or module.startswith(p + ".") for p in _ALLOWED_PREFIXES)
def _normalize_level_name(lvl: int) -> str:
"""Return a canonical string for a logging level code."""
name = logging.getLevelName(lvl)
if isinstance(name, str) and name and not name.startswith("Level "):
return name
return "INFO"
def _walk_dotted(name: str) -> list[str]:
"""Yield ``name`` then progressively shorter dotted prefixes.
``"sqlalchemy.engine.Engine"`` →
``["sqlalchemy.engine.Engine", "sqlalchemy.engine", "sqlalchemy"]``.
Mirrors Python's logger-hierarchy traversal so a sub-logger inherits
its parent's override / noisy default rather than falling through to
the root level.
"""
out = [name]
while "." in name:
name = name.rsplit(".", 1)[0]
out.append(name)
return out
def _baseline_for(module: str, db_log_levels: str | None) -> str:
"""The level ``module`` would have if no diagnostic override were active.
Precedence per dotted-parent walk:
1. Explicit DB ``log_levels`` entry (most specific wins).
2. Curated noisy-library default in ``_NOISY_LIBRARY_DEFAULTS``.
3. Root logger effective level.
"""
overrides = parse_level_overrides(db_log_levels or "")
for candidate in _walk_dotted(module):
if candidate in overrides:
return overrides[candidate]
if candidate in _NOISY_LIBRARY_DEFAULTS:
return _NOISY_LIBRARY_DEFAULTS[candidate]
root_level = logging.getLogger().getEffectiveLevel()
return _normalize_level_name(root_level)
async def _read_db_log_levels() -> str:
"""Snapshot the current ``log_levels`` setting in a short-lived session.
Called at activation AND at revert time so the revert reflects any
setting change made while the override was active. Best-effort: a
DB hiccup degrades to empty (no DB overrides), which makes the
revert use noisy-library defaults — safer than crashing the timer.
"""
try:
from ..api.app_settings import get_setting
async with AsyncSession(get_engine()) as session:
return await get_setting(session, "log_levels") or ""
except Exception: # noqa: BLE001
_LOGGER.debug(
"diagnostic_mode: failed to read log_levels from DB; "
"revert will use noisy-library defaults",
exc_info=True,
)
return ""
def list_active() -> list[dict[str, Any]]:
"""Snapshot the currently active overrides for the dashboard.
Also sweeps any entry whose ``expires_at`` is in the past — protects
against a scheduler misfire that left a ghost row in ``_active``.
"""
now = datetime.now(timezone.utc)
out: list[dict[str, Any]] = []
expired: list[str] = []
for module, ov in _active.items():
if ov.expires_at <= now:
expired.append(module)
continue
out.append({
"module": ov.module,
"baseline_level": ov.baseline_level,
"current_level": "DEBUG",
"activated_at": ov.activated_at.isoformat(),
"expires_at": ov.expires_at.isoformat(),
"remaining_seconds": int((ov.expires_at - now).total_seconds()),
})
for module in expired:
_active.pop(module, None)
return out
def is_active(module: str) -> bool:
ov = _active.get(module)
if ov is None:
return False
return ov.expires_at > datetime.now(timezone.utc)
async def set_diagnostic(
module: str,
duration_minutes: int,
) -> dict[str, Any]:
"""Activate a DEBUG override for ``module`` lasting ``duration_minutes``.
Re-activating an already-active module replaces the prior schedule
(a clicked-twice button extends the window rather than stacking).
Returns the dashboard-ready dict; raises ``ValueError`` on bad input
so the API layer can surface a 400 with a precise message.
"""
if not _is_allowed(module):
raise ValueError(
f"Module {module!r} is not in the diagnostic allowlist",
)
if not (_MIN_DURATION_MINUTES <= duration_minutes <= _MAX_DURATION_MINUTES):
raise ValueError(
f"duration_minutes must be between {_MIN_DURATION_MINUTES} and "
f"{_MAX_DURATION_MINUTES}",
)
db_log_levels = await _read_db_log_levels()
baseline = _baseline_for(module, db_log_levels)
now = datetime.now(timezone.utc)
expires_at = now + timedelta(minutes=duration_minutes)
# Apply DEBUG immediately. ``logging.getLogger(name).setLevel`` is the
# same primitive ``apply_log_levels`` uses, so the two mechanisms stay
# consistent.
logging.getLogger(module).setLevel("DEBUG")
# Replace any prior schedule for this module before recording the new one.
_remove_scheduled(module)
_active[module] = _Override(
module=module,
baseline_level=baseline,
activated_at=now,
expires_at=expires_at,
)
_schedule_revert(module, expires_at)
_LOGGER.info(
"Diagnostic mode: %s set to DEBUG (was %s) for %d min, expires at %s",
module, baseline, duration_minutes, expires_at.isoformat(),
)
return {
"module": module,
"baseline_level": baseline,
"current_level": "DEBUG",
"activated_at": now.isoformat(),
"expires_at": expires_at.isoformat(),
"remaining_seconds": int((expires_at - now).total_seconds()),
}
async def revert_diagnostic(module: str) -> bool:
"""Immediately end the override for ``module``. Returns ``False`` if
no override was active (so callers can return a 404)."""
ov = _active.pop(module, None)
if ov is None:
return False
_remove_scheduled(module)
db_log_levels = await _read_db_log_levels()
target = _baseline_for(module, db_log_levels)
logging.getLogger(module).setLevel(target)
_LOGGER.info(
"Diagnostic mode: %s reverted from DEBUG back to %s (manual)",
module, target,
)
return True
async def revert_all() -> int:
"""Revert every active override. Wired into the lifespan shutdown so a
server stop / hot-reload leaves the world in a clean state. Also
callable from a debug endpoint if we ever add one."""
count = 0
for module in list(_active.keys()):
if await revert_diagnostic(module):
count += 1
return count
# ---------------------------------------------------------------------------
# APScheduler glue — wired here so the API layer doesn't import scheduler.
# ---------------------------------------------------------------------------
_JOB_PREFIX = "diag_revert::"
def _job_id_for(module: str) -> str:
return _JOB_PREFIX + module
def _remove_scheduled(module: str) -> None:
"""Drop a previously-scheduled revert job for ``module``, if any.
Best-effort: scheduler isn't always available in tests; a missing job
is the normal path on first-time activation. Logged at DEBUG so an
operator chasing a scheduler problem still sees the trail.
"""
try:
from .scheduler import get_scheduler
scheduler = get_scheduler()
except Exception: # noqa: BLE001
_LOGGER.debug(
"diagnostic_mode: scheduler not yet available for remove(%s)",
module, exc_info=True,
)
return
job_id = _job_id_for(module)
try:
scheduler.remove_job(job_id)
except Exception: # noqa: BLE001 — JobLookupError or not-running.
_LOGGER.debug(
"diagnostic_mode: no prior schedule to remove for %s",
module, exc_info=True,
)
def _schedule_revert(module: str, when: datetime) -> None:
"""Schedule the auto-revert one-shot.
Falls back to a strongly-referenced ``asyncio`` task if the
APScheduler instance isn't running (tests, very early startup) so the
revert still happens.
"""
try:
from .scheduler import get_scheduler
scheduler = get_scheduler()
if scheduler.running:
scheduler.add_job(
_expire_callback,
trigger="date",
run_date=when,
args=[module],
id=_job_id_for(module),
replace_existing=True,
misfire_grace_time=60,
)
return
except Exception: # noqa: BLE001 — fall through to the task path.
_LOGGER.debug(
"diagnostic_mode: scheduler unavailable; using asyncio fallback",
exc_info=True,
)
# Fallback: in-process timer. Retain the task in a module-level set so
# CPython doesn't GC it before the timer fires.
delay = max(0.0, (when - datetime.now(timezone.utc)).total_seconds())
async def _wait_and_expire() -> None:
try:
await asyncio.sleep(delay)
except asyncio.CancelledError:
return
await _expire_callback(module)
try:
loop = asyncio.get_running_loop()
except RuntimeError:
return
task = loop.create_task(_wait_and_expire())
_bg_tasks.add(task)
task.add_done_callback(_bg_tasks.discard)
async def _expire_callback(module: str) -> None:
"""Fired by the scheduler at ``expires_at``. Re-applies the baseline.
Re-reads ``log_levels`` from the DB so a setting change made while
the window was active is honored at revert time (instead of using a
stale snapshot taken at activation).
"""
ov = _active.pop(module, None)
db_log_levels = await _read_db_log_levels()
target = _baseline_for(module, db_log_levels)
logging.getLogger(module).setLevel(target)
_LOGGER.info(
"Diagnostic mode: %s auto-reverted from DEBUG to %s (was active=%s)",
module, target, ov is not None,
)
@@ -0,0 +1,255 @@
"""Aggregate per-target dispatch results into an ``EventLog.details`` summary.
Every dispatch site (``event_dispatch``, ``watcher``, ``deferred_dispatch``,
``scheduled_dispatch``) calls :func:`NotificationDispatcher.dispatch` and
gets back a ``list[dict]`` — one entry per target. Each entry has at minimum
``success: bool`` and (on failure) ``error: str``. Telegram media-group
sends additionally include ``delivered_count``, ``skipped_count``,
``failed_count``, ``errors`` and ``failed_at_chunk`` so a partial delivery
is observable from the result.
Historically the dashboard only saw the per-row ``status`` derived at
EventLog insert time — partial failures (one target out of three failed,
two assets out of ten dropped) showed up as a generic success/failure and
the operator had to read stderr to find the cause. This module collapses
the per-target dicts into a small ``dispatch_summary`` block that's merged
into ``EventLog.details`` after the dispatch completes, so the same
information surfaces in the UI without re-reading logs.
"""
from __future__ import annotations
import asyncio
import logging
from typing import Any
from sqlalchemy.orm.attributes import flag_modified
from sqlmodel.ext.asyncio.session import AsyncSession
from ..database.models import EventLog
_LOGGER = logging.getLogger(__name__)
# Bound the error list we stash on the row. A pathological dispatch (50
# targets, 50 media items each, all failing) would otherwise bloat the
# row past anything useful — and the dashboard renders a fixed-height
# strip anyway. Excess entries are summarized as ``errors_truncated``.
_MAX_ERRORS = 20
_MAX_MEDIA_ERRORS = 20
# Cap error message length to avoid pathological payloads in the row.
_MAX_ERROR_MSG_LEN = 500
# Distinct sentinel so an operator scanning the dashboard can tell our
# clipping apart from a literal ``…`` that often appears in upstream API
# error text (Telegram does this in some Bad Request messages).
_TRUNCATION_MARKER = "…[truncated]"
def _trim(value: Any) -> Any:
"""Truncate string values to keep the persisted summary bounded."""
if isinstance(value, str) and len(value) > _MAX_ERROR_MSG_LEN:
return value[:_MAX_ERROR_MSG_LEN] + _TRUNCATION_MARKER
return value
def summarize_dispatch_results(
results: list[dict[str, Any]],
) -> dict[str, Any]:
"""Aggregate per-target dispatch results into a compact summary dict.
The shape is intentionally narrow so it round-trips cleanly through
SQLite JSON storage and stays cheap to render in the dashboard.
Returns a dict with keys:
* ``targets_attempted`` / ``targets_succeeded`` / ``targets_failed``
— counts across the results list.
* ``errors`` — per-target failure entries
(``[{index, error}, ...]``), capped at ``_MAX_ERRORS``.
* ``media`` — present only when at least one result reports media
counts. ``{delivered, skipped, failed}``.
* ``media_errors`` — per-item / per-chunk failure entries from the
Telegram media-group fallback, capped at ``_MAX_MEDIA_ERRORS``.
* ``errors_truncated`` / ``media_errors_truncated`` — count of dropped
entries when the corresponding cap was hit. Present only when > 0.
Input shape: each entry is what ``NotificationDispatcher._aggregate_results``
returns for one target — ``{success, receivers, successes, failures,
results: [per-receiver, ...], errors?, error?}``. Media counts live
on each per-receiver dict under ``media_delivered_count`` /
``media_skipped_count`` / ``media_failed_count`` / ``media_errors``,
so the walk drills one level deeper than the obvious top-level reads.
For backward compat with simpler call sites that pass a single leaf
dict (the Telegram media-group result directly), the leaf shape is
accepted as a fallback when ``results`` is absent.
"""
if not results:
# Empty results = nothing to summarize. Returning ``{}`` lets the
# callers' ``if summary`` / ``if results`` guards keep the row
# clean rather than stamping a misleading zero-counts block.
return {}
succeeded = 0
failed = 0
errors: list[dict[str, Any]] = []
media_delivered = 0
media_skipped = 0
media_failed = 0
media_errors: list[dict[str, Any]] = []
has_media_counts = False
errors_dropped = 0
media_errors_dropped = 0
for index, result in enumerate(results):
if result.get("success"):
succeeded += 1
else:
failed += 1
if len(errors) < _MAX_ERRORS:
errors.append({
"index": index,
"error": _trim(result.get("error", "unknown")),
})
else:
errors_dropped += 1
# Per-receiver detail is bundled under ``results`` by the
# dispatcher's ``_aggregate_results``. Walk it when present; fall
# back to reading the leaf shape directly so older callers and
# direct-test fixtures keep working.
per_receiver = result.get("results")
leaves: list[dict[str, Any]]
if isinstance(per_receiver, list):
leaves = [r for r in per_receiver if isinstance(r, dict)]
else:
leaves = [result]
for receiver_index, leaf in enumerate(leaves):
# The dispatcher's Telegram path renames the media counters
# to ``media_*`` to disambiguate them from the surrounding
# text-message result. Accept both names so a future provider
# that surfaces top-level counts (single-shot text+media)
# also gets picked up.
d = leaf.get("media_delivered_count")
if d is None:
d = leaf.get("delivered_count")
s = leaf.get("media_skipped_count")
if s is None:
s = leaf.get("skipped_count")
f = leaf.get("media_failed_count")
if f is None:
f = leaf.get("failed_count")
if d is not None or s is not None or f is not None:
has_media_counts = True
media_delivered += int(d or 0)
media_skipped += int(s or 0)
media_failed += int(f or 0)
sub_errors = leaf.get("media_errors") or leaf.get("errors") or []
for sub in sub_errors:
if not isinstance(sub, dict):
# ``_aggregate_results`` populates a string list at
# the target level; only dict entries carry structured
# per-chunk / per-item detail worth keeping here.
continue
if len(media_errors) >= _MAX_MEDIA_ERRORS:
media_errors_dropped += 1
continue
entry: dict[str, Any] = {"target_index": index}
# Only stamp the receiver index when we actually drilled
# into a multi-receiver target — single-leaf fallbacks
# leave the key off so the existing one-target tests
# stay shape-compatible.
if len(leaves) > 1 or isinstance(per_receiver, list):
entry["receiver_index"] = receiver_index
entry.update({k: _trim(v) for k, v in sub.items()})
media_errors.append(entry)
summary: dict[str, Any] = {
"targets_attempted": len(results),
"targets_succeeded": succeeded,
"targets_failed": failed,
}
if errors:
summary["errors"] = errors
if errors_dropped:
summary["errors_truncated"] = errors_dropped
if has_media_counts:
summary["media"] = {
"delivered": media_delivered,
"skipped": media_skipped,
"failed": media_failed,
}
if media_errors:
summary["media_errors"] = media_errors
if media_errors_dropped:
summary["media_errors_truncated"] = media_errors_dropped
return summary
def attach_summary_in_place(
row: EventLog, results: list[dict[str, Any]],
) -> None:
"""Merge a dispatch summary into ``row.details`` before its session commits.
Use when the EventLog row is still attached to a session that has not
yet committed — the caller's session.commit() carries the update.
"""
summary = summarize_dispatch_results(results)
if not summary:
return
details = dict(row.details or {})
# Don't overwrite a summary that a caller / previous pass already
# set explicitly — that's the same "caller wins" rule the correlation
# enricher follows in ``log_context.py``.
if "dispatch_summary" in details:
return
details["dispatch_summary"] = summary
row.details = details
# Identity-changing reassignment above is enough for SQLAlchemy to mark
# the column dirty. ``flag_modified`` is belt-and-suspenders against a
# future refactor that switches this to in-place mutation.
flag_modified(row, "details")
async def record_dispatch_summary_async(
session: AsyncSession,
event_log_id: int | None,
results: list[dict[str, Any]],
) -> None:
"""Best-effort update of an already-committed ``EventLog`` row.
Used by call sites where the row was committed in an earlier
transaction (the polling watcher commits its EventLog rows before
invoking the dispatcher, so we need a follow-up update).
Best-effort: a DB hiccup here must never abort the wider dispatch
flow — the row keeps its prior status / details and the operator
can still trace via stderr (via the ``dispatch_id`` correlation
written at insert time).
"""
if event_log_id is None or not results:
return
summary = summarize_dispatch_results(results)
if not summary:
return
try:
row = await session.get(EventLog, event_log_id)
if row is None:
return
details = dict(row.details or {})
if "dispatch_summary" in details:
return
details["dispatch_summary"] = summary
row.details = details
flag_modified(row, "details")
session.add(row)
await session.commit()
except asyncio.CancelledError:
# Cancellation must propagate so APScheduler can drain shutdown.
# Swallowing it here would pin the task and leave the row in an
# indeterminate state.
raise
except Exception: # noqa: BLE001
_LOGGER.exception(
"Failed to record dispatch_summary on event_log %s", event_log_id,
)
@@ -20,6 +20,11 @@ from typing import Any, Awaitable, Callable
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from notify_bridge_core.log_context import (
bind_log_context,
ensure_dispatch_id,
enrich_details_with_correlation,
)
from notify_bridge_core.models.events import ServiceEvent
from notify_bridge_core.notifications.dispatcher import (
NotificationDispatcher,
@@ -36,6 +41,7 @@ from .dispatch_helpers import (
load_link_data,
resolve_provider_credential,
)
from .dispatch_summary import attach_summary_in_place
_LOGGER = logging.getLogger(__name__)
@@ -141,6 +147,31 @@ async def dispatch_provider_event(
int
Number of successfully dispatched notifications across all trackers.
"""
# Bind a dispatch_id for the whole event so every EventLog row written
# below — and every log line emitted by the inner dispatcher — share the
# same correlation id. The dispatcher's own ``ensure_dispatch_id()`` call
# reuses this id rather than generating its own.
with bind_log_context(dispatch_id=ensure_dispatch_id()):
return await _dispatch_provider_event_impl(
engine, provider_id, provider_name, provider_config,
event, detail_keys, filter_fn,
)
async def _dispatch_provider_event_impl(
engine: Any,
provider_id: int,
provider_name: str,
provider_config: dict[str, Any],
event: ServiceEvent,
detail_keys: tuple[str, ...],
filter_fn: FilterFn,
) -> int:
"""Implementation body for :func:`dispatch_provider_event`.
Split out so the public function can wrap the body in
:func:`bind_log_context` without re-indenting the entire flow.
"""
dispatched = 0
# Drain-scheduling is best-effort: a scheduling failure must not roll
# back the persisted defer rows (startup catch-up re-establishes them).
@@ -188,10 +219,10 @@ async def dispatch_provider_event(
collection_id=event.collection_id,
collection_name=event.collection_name,
assets_count=0,
details={
details=enrich_details_with_correlation({
"provider_type": event.provider_type.value,
**extra_details,
},
}),
)
session.add(event_log_row)
await session.flush()
@@ -294,6 +325,11 @@ async def dispatch_provider_event(
event.provider_type.value != "bridge_self"
)
# Accumulate per-target results across every tracking-config
# group so the EventLog row carries a single ``dispatch_summary``
# covering the full fan-out (not just the last group).
all_results: list[dict[str, Any]] = []
for tc, target_entries in groups.values():
if not target_entries:
continue
@@ -308,6 +344,7 @@ async def dispatch_provider_event(
"Dispatcher raised for tracker %d: %s", tracker.id, err,
)
continue
all_results.extend(results)
for entry, r in zip(target_entries, results):
_, target_id, target_name = entry
if r.get("success"):
@@ -332,6 +369,12 @@ async def dispatch_provider_event(
"bridge_self target-failure emission failed",
)
# Merge the aggregated per-target results onto the EventLog row
# while the session still owns it. The commit below carries the
# ``dispatch_summary`` block alongside the row's original fields.
if all_results:
attach_summary_in_place(event_log_row, all_results)
await session.commit()
# Schedule drain jobs OUTSIDE the DB session so an APScheduler hiccup
@@ -28,6 +28,7 @@ from typing import Any
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from notify_bridge_core.log_context import enrich_details_with_correlation
from notify_bridge_core.models.events import ServiceEvent
from notify_bridge_core.providers.home_assistant import (
HomeAssistantAuthError,
@@ -139,11 +140,11 @@ async def _record_ha_status(
collection_id="",
collection_name="",
assets_count=0,
details={
details=enrich_details_with_correlation({
"provider_type": "home_assistant",
"ha_status": state,
"ha_status_detail": detail or "",
},
}),
))
await session.commit()
except Exception: # noqa: BLE001
@@ -29,6 +29,11 @@ from zoneinfo import ZoneInfo, ZoneInfoNotFoundError
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from notify_bridge_core.log_context import (
bind_log_context,
ensure_dispatch_id,
enrich_details_with_correlation,
)
from notify_bridge_core.models.events import EventType
from notify_bridge_core.notifications.dispatcher import (
NotificationDispatcher,
@@ -51,6 +56,7 @@ from .dispatch_helpers import (
load_link_data,
resolve_provider_credential,
)
from .dispatch_summary import summarize_dispatch_results
from .manual_dispatch import build_immich_dispatch_events
_LOGGER = logging.getLogger(__name__)
@@ -135,12 +141,12 @@ async def _log_skip(
collection_id="",
collection_name="",
assets_count=0,
details={
details=enrich_details_with_correlation({
"kind": kind,
"trigger": "cron",
"status": "skipped",
"skip_reason": reason,
},
}),
))
await session.commit()
@@ -164,6 +170,15 @@ async def dispatch_scheduled_for_tracker(
the slot is disabled on the tracker's default tracking config, or no link
has a ``TemplateConfig`` with the corresponding slot row.
"""
# Bind a dispatch_id for the whole cron fire so the EventLog "skipped" /
# "sent" rows AND the inner dispatcher log lines share one correlation id.
with bind_log_context(dispatch_id=ensure_dispatch_id()):
await _dispatch_scheduled_for_tracker_impl(tracker_id, kind)
async def _dispatch_scheduled_for_tracker_impl(
tracker_id: int, kind: ScheduledKind
) -> None:
engine = get_engine()
async with AsyncSession(engine) as session:
tracker = await session.get(NotificationTracker, tracker_id)
@@ -390,6 +405,9 @@ async def dispatch_scheduled_for_tracker(
any_sent = True
successes = sum(1 for r in results if isinstance(r, dict) and r.get("success"))
summary = summarize_dispatch_results(
[r for r in results if isinstance(r, dict)],
)
async with AsyncSession(engine) as session:
session.add(EventLog(
user_id=tracker_user_id,
@@ -401,7 +419,7 @@ async def dispatch_scheduled_for_tracker(
collection_id=event.collection_id,
collection_name=event.collection_name,
assets_count=event.added_count or 0,
details={
details=enrich_details_with_correlation({
"kind": kind,
"slot": slot_name,
"trigger": "cron",
@@ -410,7 +428,8 @@ async def dispatch_scheduled_for_tracker(
"status": "sent",
"targets_dispatched": total_targets,
"targets_succeeded": successes,
},
"dispatch_summary": summary,
}),
))
await session.commit()
@@ -95,6 +95,7 @@ async def send_telegram_media(
chunk_delay: int = 0,
max_asset_data_size: int | None = None,
send_large_photos_as_documents: bool = False,
send_large_videos_as_documents: bool = False,
chat_action: str | None = "typing",
thumbhash_resolver: Callable[[str], str | None] | None = None,
) -> NotificationResult:
@@ -116,6 +117,7 @@ async def send_telegram_media(
chunk_delay=chunk_delay,
max_asset_data_size=max_asset_data_size,
send_large_photos_as_documents=send_large_photos_as_documents,
send_large_videos_as_documents=send_large_videos_as_documents,
chat_action=chat_action,
)
@@ -9,6 +9,11 @@ from typing import Any, Awaitable, Callable
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from notify_bridge_core.log_context import (
bind_log_context,
ensure_dispatch_id,
enrich_details_with_correlation,
)
from notify_bridge_core.models.events import ServiceEvent
from notify_bridge_core.notifications.dispatcher import NotificationDispatcher, TargetConfig
from notify_bridge_core.notifications.telegram.cache import TelegramFileCache
@@ -30,6 +35,7 @@ from .dispatch_helpers import (
load_link_data,
resolve_provider_credential,
)
from .dispatch_summary import record_dispatch_summary_async
_LOGGER = logging.getLogger(__name__)
@@ -262,6 +268,13 @@ _POLL_FACTORIES: dict[str, PollerFactory] = {
async def check_tracker(tracker_id: int) -> dict[str, Any]:
"""Poll a tracker's provider for changes and dispatch notifications."""
# Bind a per-tick dispatch_id so the EventLog row written for each detected
# change carries the same correlation id as the dispatcher's log lines.
with bind_log_context(dispatch_id=ensure_dispatch_id()):
return await _check_tracker_impl(tracker_id)
async def _check_tracker_impl(tracker_id: int) -> dict[str, Any]:
engine = get_engine()
# Load all DB data eagerly before entering aiohttp context
@@ -457,7 +470,7 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
collection_id=event.collection_id,
collection_name=event.collection_name,
assets_count=assets_count,
details=details,
details=enrich_details_with_correlation(details),
)
session.add(log)
await session.flush()
@@ -605,6 +618,10 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
event.provider_type.value != "bridge_self"
)
# Per-event accumulator so the summary write covers every
# tracking-config group, not just the last one.
event_results: list[dict[str, Any]] = []
for tc, target_entries in groups.values():
if not target_entries:
continue
@@ -616,6 +633,7 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
continue
target_configs = [entry[0] for entry in target_entries]
results = await dispatcher.dispatch(shaped_event, target_configs)
event_results.extend(results)
for entry, r in zip(target_entries, results):
_, target_id, target_name = entry
if r.get("success"):
@@ -637,6 +655,15 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
"bridge_self target-failure emission failed",
)
# The EventLog row was committed in the earlier session block
# so we run a tiny follow-up UPDATE in a fresh session. Best-
# effort: a failure here logs but does not abort the watcher.
if event_log_id is not None and event_results:
async with AsyncSession(engine) as summary_session:
await record_dispatch_summary_async(
summary_session, event_log_id, event_results,
)
return {
"status": "ok",
"events_detected": len(events),
@@ -0,0 +1,372 @@
"""Temporary per-module DEBUG overrides with auto-revert.
Covers the in-memory service module + a smoke pass over the API layer
using ``dependency_overrides`` to bypass auth. The APScheduler glue is
exercised via the fallback asyncio-timer path since tests run without a
running scheduler.
"""
from __future__ import annotations
import asyncio
import logging
from datetime import datetime, timedelta, timezone
from typing import Any
import pytest
from fastapi.testclient import TestClient
# ---------------------------------------------------------------------------
# Test scaffolding
# ---------------------------------------------------------------------------
def _reset_state() -> None:
"""Clear the module-level ``_active`` dict between tests so prior
activations don't bleed across cases."""
from notify_bridge_server.services import diagnostic_mode as svc
svc._active.clear()
@pytest.fixture(autouse=True)
def _stub_db_read(monkeypatch):
"""Default every test to a fixed empty ``log_levels`` snapshot.
A test that wants to exercise DB-override precedence overrides this
fixture by re-patching the function explicitly.
"""
async def fake() -> str:
return ""
from notify_bridge_server.services import diagnostic_mode as svc
monkeypatch.setattr(svc, "_read_db_log_levels", fake)
def _patch_db_read(monkeypatch, value: str) -> None:
"""Override the auto-applied fixture for a single test that needs a
non-empty ``log_levels`` value."""
async def fake() -> str:
return value
from notify_bridge_server.services import diagnostic_mode as svc
monkeypatch.setattr(svc, "_read_db_log_levels", fake)
# ---------------------------------------------------------------------------
# Unit tests — service module
# ---------------------------------------------------------------------------
@pytest.mark.asyncio
async def test_set_diagnostic_applies_debug_immediately(tmp_data_dir) -> None: # noqa: ARG001
from notify_bridge_server.services.diagnostic_mode import set_diagnostic
_reset_state()
module = "notify_bridge_core.notifications.telegram.client"
entry = await set_diagnostic(module, duration_minutes=30)
assert entry["module"] == module
assert entry["current_level"] == "DEBUG"
assert entry["remaining_seconds"] > 60 * 29
assert logging.getLogger(module).level == logging.DEBUG
@pytest.mark.asyncio
async def test_set_diagnostic_rejects_unlisted_module(tmp_data_dir) -> None: # noqa: ARG001
"""Only the documented namespaces should be flippable from the UI."""
from notify_bridge_server.services.diagnostic_mode import set_diagnostic
_reset_state()
with pytest.raises(ValueError, match="allowlist"):
await set_diagnostic("some_random_third_party", 30)
@pytest.mark.asyncio
async def test_set_diagnostic_rejects_root_logger(tmp_data_dir) -> None: # noqa: ARG001
"""The empty string would target root — explicitly disallowed."""
from notify_bridge_server.services.diagnostic_mode import set_diagnostic
_reset_state()
with pytest.raises(ValueError, match="allowlist"):
await set_diagnostic("", 30)
@pytest.mark.asyncio
async def test_set_diagnostic_rejects_unreasonable_durations(tmp_data_dir) -> None: # noqa: ARG001
from notify_bridge_server.services.diagnostic_mode import set_diagnostic
_reset_state()
with pytest.raises(ValueError, match="duration_minutes"):
await set_diagnostic("notify_bridge_core", 0)
with pytest.raises(ValueError, match="duration_minutes"):
await set_diagnostic("notify_bridge_core", 9999)
@pytest.mark.asyncio
async def test_baseline_from_db_override(tmp_data_dir, monkeypatch) -> None: # noqa: ARG001
"""``log_levels`` setting wins over the noisy-library default."""
from notify_bridge_server.services.diagnostic_mode import set_diagnostic
_reset_state()
_patch_db_read(monkeypatch, "sqlalchemy.engine=ERROR")
entry = await set_diagnostic("sqlalchemy.engine", duration_minutes=15)
assert entry["baseline_level"] == "ERROR"
@pytest.mark.asyncio
async def test_baseline_from_noisy_default(tmp_data_dir) -> None: # noqa: ARG001
"""No DB override falls through to the curated noisy-lib quiet list."""
from notify_bridge_server.services.diagnostic_mode import set_diagnostic
_reset_state()
entry = await set_diagnostic("sqlalchemy.engine", duration_minutes=15)
assert entry["baseline_level"] == "WARNING"
@pytest.mark.asyncio
async def test_baseline_prefix_walks_for_submodule(tmp_data_dir, monkeypatch) -> None: # noqa: ARG001
"""A sub-logger like ``sqlalchemy.engine.Engine`` inherits its parent's
noisy-default level (WARNING), not the root INFO."""
from notify_bridge_server.services.diagnostic_mode import set_diagnostic
_reset_state()
entry = await set_diagnostic(
"sqlalchemy.engine.Engine", duration_minutes=15,
)
assert entry["baseline_level"] == "WARNING"
@pytest.mark.asyncio
async def test_baseline_prefix_walks_for_db_override(tmp_data_dir, monkeypatch) -> None: # noqa: ARG001
"""An explicit ``log_levels`` entry covers all sub-loggers below it."""
from notify_bridge_server.services.diagnostic_mode import set_diagnostic
_reset_state()
_patch_db_read(
monkeypatch, "notify_bridge_core.notifications=ERROR",
)
entry = await set_diagnostic(
"notify_bridge_core.notifications.telegram.client",
duration_minutes=15,
)
assert entry["baseline_level"] == "ERROR"
@pytest.mark.asyncio
async def test_set_diagnostic_twice_replaces_schedule(tmp_data_dir) -> None: # noqa: ARG001
"""Clicking the button twice extends, doesn't stack."""
from notify_bridge_server.services.diagnostic_mode import (
list_active, set_diagnostic,
)
_reset_state()
module = "notify_bridge_core"
await set_diagnostic(module, 5)
first_active = list_active()
assert len(first_active) == 1
first_expires = first_active[0]["expires_at"]
# Sleep just long enough to make the timestamps distinct, then re-set.
await asyncio.sleep(0.05)
await set_diagnostic(module, 60)
second_active = list_active()
assert len(second_active) == 1
assert second_active[0]["expires_at"] != first_expires
assert second_active[0]["remaining_seconds"] > 30 * 60
@pytest.mark.asyncio
async def test_manual_revert_restores_baseline(tmp_data_dir) -> None: # noqa: ARG001
from notify_bridge_server.services.diagnostic_mode import (
revert_diagnostic, set_diagnostic,
)
_reset_state()
module = "sqlalchemy.engine"
await set_diagnostic(module, 30)
assert logging.getLogger(module).level == logging.DEBUG
reverted = await revert_diagnostic(module)
assert reverted is True
# noisy-library default is WARNING (30)
assert logging.getLogger(module).level == logging.WARNING
@pytest.mark.asyncio
async def test_revert_reads_db_at_revert_time(tmp_data_dir, monkeypatch) -> None: # noqa: ARG001
"""Editing ``log_levels`` while the override is active is honored when
the revert fires — not the snapshot taken at activation time."""
from notify_bridge_server.services.diagnostic_mode import (
revert_diagnostic, set_diagnostic,
)
_reset_state()
module = "sqlalchemy.engine"
_patch_db_read(monkeypatch, "")
await set_diagnostic(module, 30)
# Operator edits the setting mid-window — bump to ERROR.
_patch_db_read(monkeypatch, "sqlalchemy.engine=ERROR")
assert await revert_diagnostic(module) is True
assert logging.getLogger(module).level == logging.ERROR
@pytest.mark.asyncio
async def test_manual_revert_no_active_returns_false(tmp_data_dir) -> None: # noqa: ARG001
from notify_bridge_server.services.diagnostic_mode import revert_diagnostic
_reset_state()
assert await revert_diagnostic("notify_bridge_core") is False
@pytest.mark.asyncio
async def test_auto_revert_after_window_elapses(tmp_data_dir) -> None: # noqa: ARG001
"""The asyncio-timer fallback fires near ``expires_at`` and restores
the baseline. Uses a sub-second window so the test stays fast.
Bypasses ``set_diagnostic`` (which clamps to minutes) by populating the
``_active`` dict and calling ``_schedule_revert`` directly.
"""
from notify_bridge_server.services import diagnostic_mode as svc
_reset_state()
module = "sqlalchemy.engine"
baseline = svc._baseline_for(module, db_log_levels="")
now = datetime.now(timezone.utc)
expires = now + timedelta(seconds=0.3)
logging.getLogger(module).setLevel("DEBUG")
svc._active[module] = svc._Override(
module=module,
baseline_level=baseline,
activated_at=now,
expires_at=expires,
)
svc._schedule_revert(module, expires)
await asyncio.sleep(0.5)
assert module not in svc._active
assert logging.getLogger(module).level == logging.WARNING
@pytest.mark.asyncio
async def test_fallback_task_retained_until_fire(tmp_data_dir) -> None: # noqa: ARG001
"""The asyncio fallback path must keep a strong reference to its task
so CPython doesn't GC it before the timer fires."""
from notify_bridge_server.services import diagnostic_mode as svc
_reset_state()
when = datetime.now(timezone.utc) + timedelta(seconds=10)
svc._schedule_revert("notify_bridge_core", when)
# The retainer set should hold exactly the task we just queued.
assert len(svc._bg_tasks) == 1
# Cancel it to clean up; the done-callback will drop it.
for task in list(svc._bg_tasks):
task.cancel()
await asyncio.sleep(0)
def test_list_active_omits_and_sweeps_expired(tmp_data_dir) -> None: # noqa: ARG001
"""Expired entries are filtered AND removed so a delayed scheduler
fire doesn't leave ghost rows in ``_active`` forever."""
from notify_bridge_server.services import diagnostic_mode as svc
_reset_state()
past = datetime.now(timezone.utc) - timedelta(minutes=1)
svc._active["sqlalchemy.engine"] = svc._Override(
module="sqlalchemy.engine",
baseline_level="WARNING",
activated_at=past - timedelta(minutes=30),
expires_at=past,
)
assert svc.list_active() == []
assert "sqlalchemy.engine" not in svc._active
@pytest.mark.asyncio
async def test_revert_all_clears_every_override(tmp_data_dir) -> None: # noqa: ARG001
from notify_bridge_server.services.diagnostic_mode import (
list_active, revert_all, set_diagnostic,
)
_reset_state()
await set_diagnostic("notify_bridge_core", 30)
await set_diagnostic("sqlalchemy.engine", 30)
assert len(list_active()) == 2
count = await revert_all()
assert count == 2
assert list_active() == []
# ---------------------------------------------------------------------------
# API smoke — bypasses auth via dependency_overrides
# ---------------------------------------------------------------------------
@pytest.fixture
def _admin_client(tmp_data_dir): # noqa: ARG001
"""Yield a TestClient with ``require_admin`` short-circuited.
Keeps the auth-flow's SQLAlchemy/greenlet issues out of the picture
while still exercising the FastAPI router, path converters, and the
``HTTPException`` paths.
"""
_reset_state()
from notify_bridge_server.auth.dependencies import require_admin
from notify_bridge_server.database.models import User
from notify_bridge_server.main import app
fake = User(
id=1, username="admin",
password_hash="x", role="admin", token_version=0,
)
app.dependency_overrides[require_admin] = lambda: fake
with TestClient(app) as client:
yield client
app.dependency_overrides.pop(require_admin, None)
_reset_state()
def test_api_post_rejects_unlisted_module_with_400(_admin_client: TestClient) -> None:
resp = _admin_client.post(
"/api/settings/diagnostic-mode",
json={"module": "evil.namespace", "duration_minutes": 15},
)
assert resp.status_code == 400
assert "allowlist" in resp.json().get("detail", "")
def test_api_post_rejects_huge_duration_with_400(_admin_client: TestClient) -> None:
resp = _admin_client.post(
"/api/settings/diagnostic-mode",
json={"module": "notify_bridge_core", "duration_minutes": 99999},
)
assert resp.status_code == 400
def test_api_delete_unknown_returns_404(_admin_client: TestClient) -> None:
resp = _admin_client.delete(
"/api/settings/diagnostic-mode/notify_bridge_core",
)
assert resp.status_code == 404
def test_api_delete_handles_dotted_module_path(_admin_client: TestClient) -> None:
"""``{module:path}`` lets dotted names survive URL routing intact."""
target = "notify_bridge_core.notifications.telegram.client"
_admin_client.post(
"/api/settings/diagnostic-mode",
json={"module": target, "duration_minutes": 15},
)
resp = _admin_client.delete(f"/api/settings/diagnostic-mode/{target}")
assert resp.status_code == 200, resp.text
assert resp.json()["reverted"] == target
@@ -0,0 +1,357 @@
"""Aggregation of per-target dispatch results into ``EventLog.details``.
Covers ``summarize_dispatch_results`` and ``attach_summary_in_place``.
The async ``record_dispatch_summary_async`` is exercised through the
in-process update path; the watcher-style flow is covered indirectly via
the full server tests.
"""
from __future__ import annotations
from typing import Any
import pytest
def test_summarize_empty_returns_empty(tmp_data_dir) -> None: # noqa: ARG001
"""Empty results = nothing to summarize. Callers can short-circuit
on the falsy return so a row with zero dispatches doesn't get a
misleading zero-counts block."""
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
assert summarize_dispatch_results([]) == {}
def test_summarize_all_success_no_errors_block(tmp_data_dir) -> None: # noqa: ARG001
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
results = [
{"success": True, "message_id": 1},
{"success": True, "message_id": 2},
]
summary = summarize_dispatch_results(results)
assert summary["targets_attempted"] == 2
assert summary["targets_succeeded"] == 2
assert summary["targets_failed"] == 0
assert "errors" not in summary
assert "media" not in summary
def test_summarize_mixed_records_only_failures(tmp_data_dir) -> None: # noqa: ARG001
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
results = [
{"success": True},
{"success": False, "error": "Bad Request: chat not found"},
{"success": False, "error": "timeout"},
]
summary = summarize_dispatch_results(results)
assert summary["targets_succeeded"] == 1
assert summary["targets_failed"] == 2
assert summary["errors"] == [
{"index": 1, "error": "Bad Request: chat not found"},
{"index": 2, "error": "timeout"},
]
def test_summarize_media_counts_aggregate(tmp_data_dir) -> None: # noqa: ARG001
"""Media counts from a Telegram media-group success are merged."""
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
results = [
{
"success": True,
"delivered_count": 5,
"skipped_count": 1,
"failed_count": 0,
},
{
"success": True,
"delivered_count": 3,
"skipped_count": 0,
"failed_count": 0,
},
]
summary = summarize_dispatch_results(results)
assert summary["media"] == {"delivered": 8, "skipped": 1, "failed": 0}
def test_summarize_sub_errors_carry_target_index(tmp_data_dir) -> None: # noqa: ARG001
"""Per-chunk/per-item failures from a partial media-group send are flattened."""
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
results = [
{"success": True, "delivered_count": 1, "skipped_count": 0, "failed_count": 0},
{
"success": True, # group landed but with partial failure
"delivered_count": 2,
"skipped_count": 0,
"failed_count": 1,
"errors": [
{"kind": "chunk", "chunk": 1, "error": "Bad Request: ..."},
{"kind": "item", "chunk": 1, "item_index": 2, "error": "media not found"},
],
},
]
summary = summarize_dispatch_results(results)
assert summary["media_errors"] == [
{"target_index": 1, "kind": "chunk", "chunk": 1, "error": "Bad Request: ..."},
{
"target_index": 1,
"kind": "item",
"chunk": 1,
"item_index": 2,
"error": "media not found",
},
]
def test_summarize_caps_errors_and_reports_truncation(tmp_data_dir) -> None: # noqa: ARG001
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
results: list[dict[str, Any]] = [
{"success": False, "error": f"err {i}"} for i in range(25)
]
summary = summarize_dispatch_results(results)
assert len(summary["errors"]) == 20
assert summary["errors_truncated"] == 5
def test_summarize_trims_long_error_messages(tmp_data_dir) -> None: # noqa: ARG001
"""A pathological multi-KB error string is bounded so the row stays small."""
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
long_err = "x" * 2000
results = [{"success": False, "error": long_err}]
summary = summarize_dispatch_results(results)
persisted = summary["errors"][0]["error"]
assert persisted.endswith("…[truncated]")
# 500 char body + the explicit "…[truncated]" marker.
assert len(persisted) == 500 + len("…[truncated]")
@pytest.mark.asyncio
async def test_attach_summary_in_place_mutates_details_dict(tmp_data_dir) -> None: # noqa: ARG001
"""In-session call merges the summary without losing original keys."""
from notify_bridge_server.database.models import EventLog
from notify_bridge_server.services.dispatch_summary import (
attach_summary_in_place,
)
row = EventLog(
event_type="assets_added",
collection_id="abc",
collection_name="Album",
details={"provider_type": "immich", "added_count": 3},
)
attach_summary_in_place(row, [{"success": True}, {"success": False, "error": "x"}])
assert row.details["provider_type"] == "immich"
assert row.details["added_count"] == 3
assert row.details["dispatch_summary"] == {
"targets_attempted": 2,
"targets_succeeded": 1,
"targets_failed": 1,
"errors": [{"index": 1, "error": "x"}],
}
@pytest.mark.asyncio
async def test_attach_summary_in_place_with_no_results_is_noop(tmp_data_dir) -> None: # noqa: ARG001
"""Empty results → no ``dispatch_summary`` key written. Original
details survive untouched."""
from notify_bridge_server.database.models import EventLog
from notify_bridge_server.services.dispatch_summary import (
attach_summary_in_place,
)
row = EventLog(
event_type="assets_added",
collection_id="abc",
collection_name="Album",
details={"k": "v"},
)
attach_summary_in_place(row, [])
assert row.details == {"k": "v"}
assert "dispatch_summary" not in row.details
def test_summarize_handles_malformed_sub_errors(tmp_data_dir) -> None: # noqa: ARG001
"""A non-dict sub-error entry is silently skipped, not crashed on."""
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
results = [
{
"success": True,
"delivered_count": 1,
"errors": ["not a dict", {"kind": "item", "error": "real"}],
},
]
summary = summarize_dispatch_results(results)
assert summary["media_errors"] == [
{"target_index": 0, "kind": "item", "error": "real"}
]
# ---------------------------------------------------------------------------
# Integration: real dispatcher output shape from ``_aggregate_results``
# ---------------------------------------------------------------------------
#
# The dispatcher wraps each Telegram fan-out in a per-target envelope:
#
# {
# "success": True,
# "receivers": 2,
# "successes": 2,
# "failures": 0,
# "results": [<per-receiver dict>, ...], # ← media counts live HERE
# }
#
# These tests use that exact shape so a future refactor of the dispatcher
# doesn't silently zero out the dashboard's ``dispatch_summary.media``
# block. Earlier versions of this file passed leaf dicts directly, which
# masked the wrong-shape read in production.
def test_summarize_drills_into_aggregated_per_receiver_dicts(tmp_data_dir) -> None: # noqa: ARG001
"""Media counts on per-receiver leaves are summed across receivers."""
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
# Two targets, each with two Telegram receivers.
results = [
{
"success": True,
"receivers": 2,
"successes": 2,
"failures": 0,
"results": [
{
"success": True,
"message_id": 100,
"media_delivered_count": 5,
"media_skipped_count": 1,
"media_failed_count": 0,
},
{
"success": True,
"message_id": 101,
"media_delivered_count": 3,
"media_skipped_count": 0,
"media_failed_count": 0,
},
],
},
]
summary = summarize_dispatch_results(results)
assert summary["media"] == {"delivered": 8, "skipped": 1, "failed": 0}
def test_summarize_collects_aggregated_media_errors_with_receiver_index(
tmp_data_dir, # noqa: ARG001
) -> None:
"""Per-chunk / per-item media errors carry both target AND receiver index."""
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
results = [
{
"success": True,
"receivers": 1,
"successes": 1,
"failures": 0,
"results": [
{
"success": True,
"message_id": 200,
"media_delivered_count": 2,
"media_failed_count": 1,
"media_errors": [
{"kind": "chunk", "chunk": 1, "error": "Bad Request"},
{"kind": "item", "chunk": 1, "item_index": 2,
"error": "media not found"},
],
},
],
},
]
summary = summarize_dispatch_results(results)
assert summary["media_errors"] == [
{"target_index": 0, "receiver_index": 0, "kind": "chunk",
"chunk": 1, "error": "Bad Request"},
{"target_index": 0, "receiver_index": 0, "kind": "item",
"chunk": 1, "item_index": 2, "error": "media not found"},
]
def test_summarize_aggregated_target_errors_list_is_safely_ignored(
tmp_data_dir, # noqa: ARG001
) -> None:
"""``_aggregate_results`` stamps a flat ``errors: [str, ...]`` at the
target level on failure. The summarizer must not try to treat the
strings as structured sub-errors."""
from notify_bridge_server.services.dispatch_summary import (
summarize_dispatch_results,
)
results = [
{
"success": False,
"receivers": 2,
"successes": 0,
"failures": 2,
"error": "All receivers failed",
"errors": ["chat_not_found", "blocked_by_user"],
"results": [
{"success": False, "error": "chat_not_found"},
{"success": False, "error": "blocked_by_user"},
],
},
]
summary = summarize_dispatch_results(results)
assert summary["targets_failed"] == 1
assert summary["errors"] == [
{"index": 0, "error": "All receivers failed"},
]
# The string list at the target level is ignored — the per-receiver
# errors are already represented by the target-level error message.
assert "media_errors" not in summary
assert "media" not in summary
@pytest.mark.asyncio
async def test_attach_summary_in_place_skips_when_already_set(
tmp_data_dir, # noqa: ARG001
) -> None:
"""Caller-set ``dispatch_summary`` wins — the same "caller pins"
rule that ``enrich_details_with_correlation`` follows."""
from notify_bridge_server.database.models import EventLog
from notify_bridge_server.services.dispatch_summary import (
attach_summary_in_place,
)
row = EventLog(
event_type="assets_added",
collection_id="abc",
collection_name="Album",
details={"dispatch_summary": {"pinned": True}},
)
attach_summary_in_place(row, [{"success": True}])
assert row.details["dispatch_summary"] == {"pinned": True}
@@ -0,0 +1,158 @@
"""Request-ID middleware + EventLog dispatch_id correlation.
Covers two halves of the same correlation story:
* ``RequestContextMiddleware`` generates / accepts an inbound request id,
binds it onto the log-context ContextVar for the duration of the request,
and echoes it back as the ``X-Request-Id`` response header.
* ``enrich_details_with_correlation`` merges the active ``dispatch_id`` and
``request_id`` into an ``EventLog.details`` dict so the persisted row can
be cross-referenced with the stderr log lines emitted during the same
dispatch.
"""
from __future__ import annotations
import re
import pytest
from fastapi.testclient import TestClient
_REQ_ID_PATTERN = re.compile(r"^req:[0-9a-f]{12}$")
def test_response_carries_generated_request_id(tmp_data_dir) -> None: # noqa: ARG001
"""No inbound header → server generates ``req:<12 hex>`` and echoes it."""
from notify_bridge_server.main import app
with TestClient(app) as client:
resp = client.get("/api/health")
assert resp.status_code == 200
req_id = resp.headers.get("X-Request-Id")
assert req_id is not None
assert _REQ_ID_PATTERN.match(req_id), (
f"generated id {req_id!r} should match req:<12 hex>"
)
def test_response_echoes_safe_inbound_request_id(tmp_data_dir) -> None: # noqa: ARG001
"""A well-formed inbound ``X-Request-Id`` is preserved unchanged."""
from notify_bridge_server.main import app
inbound = "abc-123_XYZ_trace"
with TestClient(app) as client:
resp = client.get("/api/health", headers={"X-Request-Id": inbound})
assert resp.status_code == 200
assert resp.headers.get("X-Request-Id") == inbound
def test_colon_prefixed_inbound_id_is_replaced(tmp_data_dir) -> None: # noqa: ARG001
"""``:`` is reserved for server-minted ids — a colon in the inbound value
must trigger replacement so a client can't masquerade as ``disp:...``."""
from notify_bridge_server.main import app
with TestClient(app) as client:
resp = client.get(
"/api/health", headers={"X-Request-Id": "disp:fake12345678"},
)
assert resp.status_code == 200
echoed = resp.headers.get("X-Request-Id", "")
assert echoed != "disp:fake12345678"
assert _REQ_ID_PATTERN.match(echoed)
@pytest.mark.parametrize(
"bad_value",
[
# CRLF injection attempt — would split log lines / inject headers.
"abc\r\ninjected: yes",
# Way too long.
"x" * 256,
# Disallowed characters.
"<script>alert(1)</script>",
# Empty after stripping.
" ",
],
)
def test_unsafe_inbound_request_id_is_replaced(
tmp_data_dir, bad_value: str, # noqa: ARG001
) -> None:
"""An attacker-controlled id must not flow into logs verbatim."""
from notify_bridge_server.main import app
with TestClient(app) as client:
resp = client.get("/api/health", headers={"X-Request-Id": bad_value})
assert resp.status_code == 200
echoed = resp.headers.get("X-Request-Id", "")
assert echoed != bad_value, "unsafe id was passed through unchanged"
assert _REQ_ID_PATTERN.match(echoed), (
f"replacement id {echoed!r} should match req:<12 hex>"
)
def test_enrich_details_merges_active_correlation_ids() -> None:
"""Within a ``bind_log_context`` block, the helper copies the active ids."""
from notify_bridge_core.log_context import (
bind_log_context,
enrich_details_with_correlation,
)
with bind_log_context(
dispatch_id="disp:deadbeef0001",
request_id="req:cafecafe0002",
):
result = enrich_details_with_correlation({"existing": "value"})
assert result == {
"existing": "value",
"dispatch_id": "disp:deadbeef0001",
"request_id": "req:cafecafe0002",
}
def test_enrich_details_does_not_overwrite_explicit_keys() -> None:
"""If the caller pre-set a correlation key, the helper leaves it alone."""
from notify_bridge_core.log_context import (
bind_log_context,
enrich_details_with_correlation,
)
with bind_log_context(dispatch_id="disp:newvalue00001"):
result = enrich_details_with_correlation({"dispatch_id": "disp:pinned"})
assert result["dispatch_id"] == "disp:pinned"
def test_enrich_details_no_context_returns_copy() -> None:
"""Outside any binding, the helper returns the dict unchanged but copied."""
from notify_bridge_core.log_context import enrich_details_with_correlation
original = {"key": "value"}
result = enrich_details_with_correlation(original)
assert result == original
# Mutating the result must not leak into the caller's dict.
result["extra"] = "added"
assert "extra" not in original
def test_enrich_details_handles_none() -> None:
"""``None`` is accepted (callers may build details lazily)."""
from notify_bridge_core.log_context import enrich_details_with_correlation
assert enrich_details_with_correlation(None) == {}
def test_ensure_dispatch_id_generates_or_reuses() -> None:
"""Fresh call produces a new id; inside a bind it returns the bound one."""
from notify_bridge_core.log_context import (
bind_log_context,
ensure_dispatch_id,
)
fresh = ensure_dispatch_id()
assert fresh.startswith("disp:")
assert len(fresh) == len("disp:") + 12
with bind_log_context(dispatch_id="disp:bound00000001"):
assert ensure_dispatch_id() == "disp:bound00000001"
@@ -0,0 +1,511 @@
"""Tests for partial-delivery resilience in TelegramClient._send_media_group.
Covers the three independent failure modes that previously aborted the
whole send:
1. **Per-item oversize** one item over ``max_asset_data_size`` is
silently dropped; siblings still deliver. ``skipped_count`` reflects
the drop.
2. **Combined chunk over Telegram's byte envelope** — pre-flight splits
into byte-budgeted sub-chunks, avoiding the 413 entirely.
3. **Telegram-side chunk rejection after pre-flight** fall back to
sending each item individually so partial delivery still happens.
"""
from __future__ import annotations
from typing import Any
from unittest.mock import patch
import aiohttp
import pytest
from aioresponses import aioresponses
from notify_bridge_core.notifications.telegram.client import (
TelegramClient,
_MediaItem,
)
from notify_bridge_core.notifications.telegram.media import (
TELEGRAM_MAX_GROUP_TOTAL_BYTES,
)
BOT_TOKEN = "TEST_TOKEN"
TG = f"https://api.telegram.org/bot{BOT_TOKEN}"
CHAT_ID = "-1001234567890"
# ---------------------------------------------------------------------------
# Pure unit tests for the new helpers
# ---------------------------------------------------------------------------
def _item(upload_bytes: int, media_type: str = "photo") -> _MediaItem:
"""Build a synthetic _MediaItem with the given upload byte cost."""
if upload_bytes == 0:
return _MediaItem(
media_json={"type": media_type, "media": "file_id_cached"},
cache_info=None,
attachment=None,
)
return _MediaItem(
media_json={"type": media_type, "media": "attach://x"},
cache_info=("ck", media_type, None, upload_bytes),
attachment=("x", b"\x00" * upload_bytes, "f.jpg", "image/jpeg"),
)
def test_split_empty_returns_empty() -> None:
assert TelegramClient._split_items_by_byte_budget([], 1000) == []
def test_split_fits_in_single_group() -> None:
items = [_item(10), _item(20), _item(30)]
groups = TelegramClient._split_items_by_byte_budget(items, 100)
assert len(groups) == 1
assert sum(it.upload_bytes for it in groups[0]) == 60
def test_split_packs_greedily_across_budget() -> None:
# Three items @ 40 each, budget 100 → groups of [40,40] and [40].
items = [_item(40), _item(40), _item(40)]
groups = TelegramClient._split_items_by_byte_budget(items, 100)
assert [len(g) for g in groups] == [2, 1]
assert sum(it.upload_bytes for it in groups[0]) == 80
assert sum(it.upload_bytes for it in groups[1]) == 40
def test_split_oversized_single_item_kept_alone() -> None:
# An item that exceeds the budget on its own goes alone — Telegram
# gets to return a precise per-item error instead of silently
# dropping it client-side.
items = [_item(200)]
groups = TelegramClient._split_items_by_byte_budget(items, 100)
assert len(groups) == 1
assert groups[0][0].upload_bytes == 200
def test_split_cached_items_are_free() -> None:
# Cached items contribute 0 bytes — they never force a split.
items = [_item(0), _item(0), _item(0)]
groups = TelegramClient._split_items_by_byte_budget(items, 10)
assert len(groups) == 1
assert len(groups[0]) == 3
def test_split_mixes_cached_and_fresh_correctly() -> None:
# Cached items piggyback freely into whatever group they land in.
items = [_item(40), _item(0), _item(40), _item(0), _item(40)]
groups = TelegramClient._split_items_by_byte_budget(items, 100)
# [40, 0, 40] = 80 bytes (fits), next 0 fits, next 40 starts new.
assert [len(g) for g in groups] == [4, 1]
def test_attach_caption_to_first_idempotent() -> None:
items = [_item(10), _item(10)]
TelegramClient._attach_caption_to_first(items, "Hello", "HTML")
assert items[0].media_json["caption"] == "Hello"
assert items[0].media_json["parse_mode"] == "HTML"
assert "caption" not in items[1].media_json
# Re-attaching overwrites in-place, doesn't duplicate.
TelegramClient._attach_caption_to_first(items, "Bye", "MarkdownV2")
assert items[0].media_json["caption"] == "Bye"
assert items[0].media_json["parse_mode"] == "MarkdownV2"
def test_attach_caption_truncates_to_telegram_limit() -> None:
from notify_bridge_core.notifications.telegram.media import (
TELEGRAM_MAX_CAPTION_LENGTH,
)
items = [_item(10)]
long_caption = "A" * (TELEGRAM_MAX_CAPTION_LENGTH + 500)
TelegramClient._attach_caption_to_first(items, long_caption, "HTML")
assert len(items[0].media_json["caption"]) <= TELEGRAM_MAX_CAPTION_LENGTH
def test_attach_caption_no_items_is_noop() -> None:
TelegramClient._attach_caption_to_first([], "x", "HTML") # must not raise
# ---------------------------------------------------------------------------
# Integration tests for the full _send_media_group flow
# ---------------------------------------------------------------------------
def _png_bytes(size: int) -> bytes:
"""Minimal valid PNG header + pad bytes to reach the requested size.
Required so ``check_photo_limits`` can identify the bytes as an
image rather than rejecting them. The PIL inspection only reads the
header so padding with zeros is harmless.
"""
# 8-byte PNG signature + IHDR chunk for a 1x1 image (zero-padded
# to size). Pillow accepts this enough to read dimensions; the
# remaining bytes after IHDR are treated as trailing garbage.
sig = b"\x89PNG\r\n\x1a\n"
ihdr = bytes.fromhex(
# length=13, type=IHDR, w=1, h=1, depth=8, color=2 (RGB),
# compression=0, filter=0, interlace=0, crc=ignored
"0000000d49484452000000010000000108020000009077"
"53de"
)
base = sig + ihdr
if len(base) >= size:
return base[:size]
return base + b"\x00" * (size - len(base))
async def _build_client(session: aiohttp.ClientSession) -> TelegramClient:
return TelegramClient(session, BOT_TOKEN)
@pytest.mark.asyncio
async def test_oversized_item_skipped_others_delivered() -> None:
"""One item over max_asset_data_size is dropped; siblings still go."""
mock_url_big = "http://assets.test/big.jpg"
mock_url_a = "http://assets.test/a.jpg"
mock_url_b = "http://assets.test/b.jpg"
max_size = 1_000_000 # 1 MB cap
# We pre-load bytes via the asset dict so we don't have to mock the
# asset HTTP server. Telegram side is mocked so sendMediaGroup
# returns a clean 200 with two message IDs.
assets = [
{"type": "photo", "url": mock_url_big, "data": _png_bytes(2_000_000)},
{"type": "photo", "url": mock_url_a, "data": _png_bytes(50_000)},
{"type": "photo", "url": mock_url_b, "data": _png_bytes(50_000)},
]
with aioresponses() as mocked:
mocked.post(
f"{TG}/sendMediaGroup",
payload={
"ok": True,
"result": [
{"message_id": 100, "photo": [{"file_id": "fa"}]},
{"message_id": 101, "photo": [{"file_id": "fb"}]},
],
},
)
async with aiohttp.ClientSession() as sess:
client = await _build_client(sess)
result = await client._send_media_group(
CHAT_ID, assets, max_asset_data_size=max_size,
)
assert result["success"] is True
assert result["delivered_count"] == 2
assert result["skipped_count"] == 1
assert result["failed_count"] == 0
assert result["message_ids"] == [100, 101]
@pytest.mark.asyncio
async def test_byte_budget_splits_into_sub_chunks() -> None:
"""Three items that combined exceed the byte budget pre-split into 2 calls."""
# Sized so 2 fit (sum < budget) but 3 don't (sum > budget) →
# [2 items, 1 item] split.
per_item = TELEGRAM_MAX_GROUP_TOTAL_BYTES // 3 + 1
# Use generated PNGs so check_photo_limits doesn't reject them as
# malformed; the size doesn't matter for the photo dimension check
# since the PNG header advertises 1x1.
assets = [
{"type": "photo", "url": f"http://t/{i}.jpg", "data": _png_bytes(per_item)}
for i in range(3)
]
calls: list[int] = []
def _ok_response_for_n(n: int) -> dict[str, Any]:
return {
"ok": True,
"result": [
{"message_id": 200 + i, "photo": [{"file_id": f"x{i}"}]}
for i in range(n)
],
}
with aioresponses() as mocked:
# We don't know item count per call up front, so respond with
# 10-item payloads (Telegram ignores trailing IDs we don't use).
mocked.post(
f"{TG}/sendMediaGroup",
payload=_ok_response_for_n(10),
repeat=True,
)
async with aiohttp.ClientSession() as sess:
client = await _build_client(sess)
# Disable photo limits — large PNG bodies trip dimension
# checks since we pad past the IHDR.
with patch(
"notify_bridge_core.notifications.telegram.client.check_photo_limits",
return_value=(False, None, None, None),
):
result = await client._send_media_group(CHAT_ID, assets)
# Count outbound sendMediaGroup calls via the mock registry.
req_log = mocked.requests
send_calls = [
k for k in req_log if k[1].path.endswith("/sendMediaGroup")
]
assert len(send_calls) >= 1
# At least one call → multiple requests recorded.
for k in send_calls:
calls.append(len(req_log[k]))
assert result["success"] is True
# Pre-split avoided 413 entirely.
assert result["failed_count"] == 0
# The 3 items went out across 2 sub-chunks (2+1).
assert sum(calls) == 2
@pytest.mark.asyncio
async def test_chunk_413_falls_back_to_per_item() -> None:
"""If Telegram 413s a chunk anyway, retry each item individually."""
assets = [
{"type": "photo", "url": f"http://t/{i}.jpg", "data": _png_bytes(50_000)}
for i in range(2)
]
with aioresponses() as mocked:
# The group send fails hard (Telegram-side rejection).
mocked.post(
f"{TG}/sendMediaGroup",
status=413,
payload={"ok": False, "error_code": 413, "description": "Request Entity Too Large"},
)
# Per-item fallback: two sendPhoto calls succeed.
mocked.post(
f"{TG}/sendPhoto",
payload={"ok": True, "result": {"message_id": 300, "photo": [{"file_id": "z0"}]}},
)
mocked.post(
f"{TG}/sendPhoto",
payload={"ok": True, "result": {"message_id": 301, "photo": [{"file_id": "z1"}]}},
)
async with aiohttp.ClientSession() as sess:
client = await _build_client(sess)
with patch(
"notify_bridge_core.notifications.telegram.client.check_photo_limits",
return_value=(False, None, None, None),
):
result = await client._send_media_group(CHAT_ID, assets)
assert result["success"] is True
assert result["delivered_count"] == 2
assert result["failed_count"] == 0
# We still record the original chunk-level error for diagnostics,
# tagged with kind="chunk" so operators can distinguish cause from
# per-item consequences.
assert result["errors"] is not None
chunk_errors = [e for e in result["errors"] if e.get("kind") == "chunk"]
assert len(chunk_errors) == 1
assert "Request Entity Too Large" in str(chunk_errors[0]["error"])
@pytest.mark.asyncio
async def test_chunk_failure_with_per_item_partial_failure() -> None:
"""Per-item fallback can itself partially fail; we report both."""
assets = [
{"type": "photo", "url": f"http://t/{i}.jpg", "data": _png_bytes(50_000)}
for i in range(2)
]
with aioresponses() as mocked:
mocked.post(
f"{TG}/sendMediaGroup",
status=400,
payload={"ok": False, "error_code": 400, "description": "Bad Request"},
)
# First per-item OK, second fails.
mocked.post(
f"{TG}/sendPhoto",
payload={"ok": True, "result": {"message_id": 400, "photo": [{"file_id": "p0"}]}},
)
mocked.post(
f"{TG}/sendPhoto",
status=400,
payload={"ok": False, "error_code": 400, "description": "PHOTO_INVALID_DIMENSIONS"},
)
async with aiohttp.ClientSession() as sess:
client = await _build_client(sess)
with patch(
"notify_bridge_core.notifications.telegram.client.check_photo_limits",
return_value=(False, None, None, None),
):
result = await client._send_media_group(CHAT_ID, assets)
# At least one item delivered → overall success.
assert result["success"] is True
assert result["delivered_count"] == 1
assert result["failed_count"] == 1
assert result["message_ids"] == [400]
# The failed item carries its index so operators can correlate
# with the original asset list.
item_errors = [e for e in result["errors"] if e.get("kind") == "item"]
assert len(item_errors) == 1
assert item_errors[0]["item_index"] == 1
@pytest.mark.asyncio
async def test_document_chunk_failure_falls_back_to_sendDocument() -> None:
"""Document items must hit /sendDocument in fallback, not /sendVideo.
Regression guard: an earlier draft routed any non-photo through
_VIDEO_KIND, silently misrouting documents to the video endpoint
where Telegram would reject them with a confusing error.
"""
assets = [
{"type": "document", "url": f"http://t/f{i}.bin", "data": b"\x00" * 50_000}
for i in range(2)
]
with aioresponses() as mocked:
mocked.post(
f"{TG}/sendMediaGroup",
status=400,
payload={"ok": False, "error_code": 400, "description": "Bad Request"},
)
mocked.post(
f"{TG}/sendDocument",
payload={"ok": True, "result": {"message_id": 500, "document": {"file_id": "d0"}}},
)
mocked.post(
f"{TG}/sendDocument",
payload={"ok": True, "result": {"message_id": 501, "document": {"file_id": "d1"}}},
)
async with aiohttp.ClientSession() as sess:
client = await _build_client(sess)
result = await client._send_media_group(CHAT_ID, assets)
# No /sendVideo or /sendPhoto calls should have been made.
for key in mocked.requests:
assert "/sendVideo" not in key[1].path
assert "/sendPhoto" not in key[1].path
assert result["success"] is True
assert result["delivered_count"] == 2
assert result["message_ids"] == [500, 501]
@pytest.mark.asyncio
async def test_oversized_video_deferred_as_document_when_opted_in() -> None:
"""Oversized videos are sent as documents post-chunk when the flag is set.
Telegram caps sendVideo at 50 MB but accepts up to 2 GB via
sendDocument. With ``send_large_videos_as_documents=True``, an
oversized video should be deferred out of the media group, then
delivered as its own document send instead of being silently
dropped. Other items in the same group must ride through the
normal sendMediaGroup path unaffected.
"""
# 60 MB exceeds the 50 MB sendVideo cap but is under document's 2 GB cap.
oversized_video = b"\x00" * (60 * 1024 * 1024)
assets = [
{"type": "video", "url": "http://t/big.mp4", "data": oversized_video,
"content_type": "video/mp4"},
{"type": "photo", "url": "http://t/a.jpg", "data": _png_bytes(50_000)},
{"type": "photo", "url": "http://t/b.jpg", "data": _png_bytes(50_000)},
]
with aioresponses() as mocked:
# The 2 photos ride out in sendMediaGroup together.
mocked.post(
f"{TG}/sendMediaGroup",
payload={
"ok": True,
"result": [
{"message_id": 700, "photo": [{"file_id": "p0"}]},
{"message_id": 701, "photo": [{"file_id": "p1"}]},
],
},
)
# The deferred video lands as a document after the chunk.
mocked.post(
f"{TG}/sendDocument",
payload={"ok": True, "result": {"message_id": 702, "document": {"file_id": "d0"}}},
)
async with aiohttp.ClientSession() as sess:
client = await _build_client(sess)
with patch(
"notify_bridge_core.notifications.telegram.client.check_photo_limits",
return_value=(False, None, None, None),
):
result = await client._send_media_group(
CHAT_ID, assets,
send_large_videos_as_documents=True,
)
# sendVideo must NOT have been called — the oversized video
# bypasses sendVideo entirely and goes straight to sendDocument.
for key in mocked.requests:
assert "/sendVideo" not in key[1].path
assert result["success"] is True
assert result["delivered_count"] == 3
assert result["skipped_count"] == 0
assert result["failed_count"] == 0
assert sorted(result["message_ids"]) == [700, 701, 702]
@pytest.mark.asyncio
async def test_oversized_video_skipped_when_flag_off() -> None:
"""Without the opt-in flag, oversized videos are dropped (legacy behavior)."""
oversized_video = b"\x00" * (60 * 1024 * 1024)
assets = [
{"type": "video", "url": "http://t/big.mp4", "data": oversized_video,
"content_type": "video/mp4"},
{"type": "photo", "url": "http://t/a.jpg", "data": _png_bytes(50_000)},
]
with aioresponses() as mocked:
mocked.post(
f"{TG}/sendMediaGroup",
payload={
"ok": True,
"result": [{"message_id": 800, "photo": [{"file_id": "p0"}]}],
},
)
async with aiohttp.ClientSession() as sess:
client = await _build_client(sess)
with patch(
"notify_bridge_core.notifications.telegram.client.check_photo_limits",
return_value=(False, None, None, None),
):
result = await client._send_media_group(CHAT_ID, assets)
# No sendDocument call either — video is simply dropped.
for key in mocked.requests:
assert "/sendDocument" not in key[1].path
assert result["success"] is True
assert result["delivered_count"] == 1
assert result["skipped_count"] == 1
@pytest.mark.asyncio
async def test_all_items_oversized_returns_failure() -> None:
"""When every asset is filtered before send, success is False."""
assets = [
{"type": "photo", "url": "http://t/big.jpg", "data": _png_bytes(5_000_000)}
for _ in range(2)
]
async with aiohttp.ClientSession() as sess:
client = await _build_client(sess)
# No HTTP mock needed — nothing should reach Telegram.
result = await client._send_media_group(
CHAT_ID, assets, max_asset_data_size=1_000_000,
)
assert result["success"] is False
assert result["delivered_count"] == 0
assert result["skipped_count"] == 2
assert result["failed_count"] == 0
assert "filtered" in result["error"]
@@ -0,0 +1,249 @@
"""Per-send Telegram options (`disable_notification`, `message_thread_id`).
Verifies the ContextVar-based plumbing inside ``TelegramClient`` so the
two new flags actually land in the request payloads at all four send
paths (sendMessage, single-asset send, media-group, cache-hit POST) and
that concurrent ``asyncio.gather`` fan-outs in the dispatcher don't leak
options between tasks.
"""
from __future__ import annotations
import asyncio
import json
from typing import Any
import pytest
from aiohttp import FormData
def test_telegram_receiver_factory_reads_new_fields() -> None:
"""The receiver factory turns config-dict keys into typed fields."""
from notify_bridge_core.notifications.receiver import (
TelegramReceiver, build_receiver,
)
recv = build_receiver(
"telegram",
{
"chat_id": "12345",
"disable_notification": True,
"message_thread_id": "7", # string form, common from JSON UI
},
)
assert isinstance(recv, TelegramReceiver)
assert recv.chat_id == "12345"
assert recv.disable_notification is True
assert recv.message_thread_id == 7
def test_telegram_receiver_factory_defaults_when_missing() -> None:
"""Missing keys default to off / general topic."""
from notify_bridge_core.notifications.receiver import (
TelegramReceiver, build_receiver,
)
recv = build_receiver("telegram", {"chat_id": "12345"})
assert isinstance(recv, TelegramReceiver)
assert recv.disable_notification is False
assert recv.message_thread_id is None
@pytest.mark.parametrize(
"raw_thread, expected",
[
(None, None),
("", None),
("not-a-number", None),
("42", 42),
(42, 42),
# ``0`` is Telegram's "general topic" sentinel — collapse to None
# so the Bot API just omits the field, matching the frontend's
# ``<= 0 → unset`` behaviour.
("0", None),
(0, None),
(-5, None),
# bool would otherwise pass through as int(True)==1 / int(False)==0
# and silently route into topic #1; reject explicitly.
(True, None),
(False, None),
],
)
def test_telegram_receiver_thread_id_coercion(raw_thread: Any, expected: Any) -> None:
from notify_bridge_core.notifications.receiver import build_receiver
recv = build_receiver(
"telegram",
{"chat_id": "1", "message_thread_id": raw_thread},
)
assert recv.message_thread_id == expected # type: ignore[attr-defined]
def test_apply_send_opts_to_payload_merges_when_bound() -> None:
"""Inside ``_bind_send_options``, payload helper writes the two keys."""
from notify_bridge_core.notifications.telegram.client import (
_SendOptions,
_apply_send_opts_to_payload,
_bind_send_options,
)
payload: dict[str, Any] = {"chat_id": "1"}
with _bind_send_options(_SendOptions(disable_notification=True, message_thread_id=7)):
_apply_send_opts_to_payload(payload)
assert payload["disable_notification"] is True
assert payload["message_thread_id"] == 7
def test_apply_send_opts_to_payload_omits_when_default() -> None:
"""No bind = no flags written (Bot API treats omission as default)."""
from notify_bridge_core.notifications.telegram.client import (
_apply_send_opts_to_payload,
)
payload: dict[str, Any] = {"chat_id": "1"}
_apply_send_opts_to_payload(payload)
assert "disable_notification" not in payload
assert "message_thread_id" not in payload
def test_apply_send_opts_to_form_merges_when_bound() -> None:
"""Multipart payload helper writes the two fields when bound."""
from notify_bridge_core.notifications.telegram.client import (
_SendOptions,
_apply_send_opts_to_form,
_bind_send_options,
)
form = FormData()
with _bind_send_options(_SendOptions(disable_notification=True, message_thread_id=42)):
_apply_send_opts_to_form(form)
# aiohttp.FormData stores fields as ``(MultiDict{name, ...}, headers, value)``.
name_to_value = {}
for type_opts, _headers, value in form._fields: # type: ignore[attr-defined]
name_to_value[type_opts.get("name")] = value
assert name_to_value.get("disable_notification") == "true"
assert name_to_value.get("message_thread_id") == "42"
def test_bind_send_options_resets_on_exit() -> None:
"""Token-reset semantics: the var is restored even after a raise."""
from notify_bridge_core.notifications.telegram.client import (
_SendOptions,
_bind_send_options,
_send_options_var,
)
default = _send_options_var.get()
try:
with _bind_send_options(_SendOptions(disable_notification=True)):
raise RuntimeError("boom")
except RuntimeError:
pass
assert _send_options_var.get() == default
@pytest.mark.asyncio
async def test_concurrent_binds_do_not_leak_between_tasks() -> None:
"""Two ``asyncio.gather`` tasks see only their own bound options.
This is the load-bearing invariant for the dispatcher's per-receiver
fan-out: one chat with ``disable_notification=True`` must not silence
a peer chat in the same dispatch.
"""
from notify_bridge_core.notifications.telegram.client import (
_SendOptions,
_apply_send_opts_to_payload,
_bind_send_options,
)
results: list[dict[str, Any]] = []
async def run_with(opts: _SendOptions, label: str) -> None:
payload: dict[str, Any] = {"label": label}
with _bind_send_options(opts):
# Yield to the loop to interleave with the sibling task.
await asyncio.sleep(0)
_apply_send_opts_to_payload(payload)
results.append(payload)
await asyncio.gather(
run_with(_SendOptions(disable_notification=True, message_thread_id=1), "silent"),
run_with(_SendOptions(disable_notification=False, message_thread_id=2), "loud"),
)
by_label = {r["label"]: r for r in results}
assert by_label["silent"].get("disable_notification") is True
assert by_label["silent"].get("message_thread_id") == 1
assert "disable_notification" not in by_label["loud"] # False → omitted
assert by_label["loud"].get("message_thread_id") == 2
@pytest.mark.asyncio
async def test_send_message_passes_options_into_payload(monkeypatch) -> None:
"""``send_message(disable_notification=True, message_thread_id=N)``
surfaces both keys in the JSON request body."""
from notify_bridge_core.notifications.telegram.client import TelegramClient
captured: dict[str, Any] = {}
class _FakeResp:
status = 200
async def json(self) -> dict[str, Any]:
return {"ok": True, "result": {"message_id": 99}}
async def __aenter__(self) -> "_FakeResp":
return self
async def __aexit__(self, *args: Any) -> None:
return None
class _FakeSession:
def post(self, url: str, *, json: dict[str, Any] | None = None, **_kw: Any) -> _FakeResp:
captured["url"] = url
captured["json"] = json
return _FakeResp()
client = TelegramClient(_FakeSession(), "TEST:token") # type: ignore[arg-type]
result = await client.send_message(
chat_id="123",
text="hello",
disable_notification=True,
message_thread_id=5,
)
assert result["success"] is True
payload = captured["json"]
assert payload["disable_notification"] is True
assert payload["message_thread_id"] == 5
@pytest.mark.asyncio
async def test_send_message_without_options_omits_keys(monkeypatch) -> None:
"""Default kwargs leave the payload Bot-API-clean."""
from notify_bridge_core.notifications.telegram.client import TelegramClient
captured: dict[str, Any] = {}
class _FakeResp:
status = 200
async def json(self) -> dict[str, Any]:
return {"ok": True, "result": {"message_id": 1}}
async def __aenter__(self) -> "_FakeResp":
return self
async def __aexit__(self, *args: Any) -> None:
return None
class _FakeSession:
def post(self, url: str, *, json: dict[str, Any] | None = None, **_kw: Any) -> _FakeResp:
captured["json"] = json
return _FakeResp()
client = TelegramClient(_FakeSession(), "TEST:token") # type: ignore[arg-type]
await client.send_message(chat_id="123", text="hello")
payload = captured["json"]
assert "disable_notification" not in payload
assert "message_thread_id" not in payload