Files

T

alexei.dolgolyov 6a8f374678 feat: observability, per-receiver Telegram options, oversized-video fallback

Operability:
- Correlation IDs end-to-end: shared dispatch_id between log lines and
  EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths)
  and a new X-Request-Id middleware that normalizes inbound ids and binds
  request_id into log context.
- dispatch_summary block merged into EventLog.details: per-target
  success/failure counts plus Telegram media delivered/skipped/failed and
  truncated error lists, so partial outcomes surface in the UI.
- Diagnostic mode: admin can flip one module to DEBUG for a bounded
  window with auto-revert (in-memory only; setup_logging() resets on
  boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints
  plus DiagnosticsCassette UI on the settings page.

Telegram:
- Per-receiver options: disable_notification (silent send) and
  message_thread_id (forum-topic routing), wired through the dispatcher
  via a ContextVar so all four send sites (sendMessage / sendPhoto-Video-
  Document / sendMediaGroup / cache-hit POST) pick them up.
- send_large_videos_as_documents target setting: bypass the 50 MB
  sendVideo cap by falling back to sendDocument for oversized videos.
- sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES,
  45 MB) with per-item fallback on chunk failure so a stale file_id no
  longer silently drops a cached asset.

Tests:
- New: diagnostic_mode, dispatch_summary, request_correlation,
  telegram_media_group_partial, telegram_per_send_options.

Docs:
- .claude/reviews/: six-axis production-readiness review of v0.8.1.
- .claude/docs/functional-review-2026-05-28.md: focused review of
  Telegram/Immich/logging subsystems.

2026-05-28 15:19:31 +03:00

36 KiB

Raw Blame History

Bugs + Missing Features — Production-Readiness Review

Repo: c:\Users\Alexei\Documents\service-to-notification-bridge (v0.8.1 baseline) Date: 2026-05-22 Scope: full repo (backend Python/FastAPI, Svelte 5 frontend, providers + dispatchers + bot commands)

Executive summary

The code is in much better shape than typical pre-1.0 code. Quiet-hours, SSRF, JWT, secret redaction, rate-limit fan-out caps, partition-by-media-kind, parse_mode retry, scheduler misfire-grace, Prometheus metrics, deep healthcheck, and per-receiver render cache are all already implemented and well-tested.
The single biggest shipping risk is webhook idempotency. Gitea, Planka, and the generic webhook endpoint all dispatch on every POST regardless of redelivery — there is no X-Gitea-Delivery / X-Hub-Delivery dedup table. An upstream retry storm sends the same notification N times.
The deferred-dispatch drain has a duplicate-send window if the process dies between dispatcher.dispatch() returning and session.commit() — the row stays pending and the periodic catch-up scan re-drains it.
Telegram update offset (_last_update_id) is in-memory only — on restart, the bot replays already-handled updates or skips ones Telegram has discarded. Combined with no per-update idempotency, this is a duplicate-command surface.
Several Telegram features are silently unsupported: forum threads (message_thread_id), bot-blocked-by-user detection (403 → keep retrying forever), and inline-button callback queries. None blocks shipping today but each is a near-term ask from any real user.
No template versioning / dry-run / playground — every template edit is immediately live. There is no way to validate a new template against a sample payload before flipping the switch, and no rollback path.
Frontend lacks bulk operations and import/export of templates+targets. An operator with 30 trackers cannot bulk-toggle, bulk-edit, or move a template across users.

Part A — Bugs and reliability issues

Severity legend: CRITICAL = data loss / duplicate user-visible messages / silent stop-shipping; HIGH = wrong behavior under realistic conditions; MEDIUM = degrades UX or operability; LOW = polish.

CRITICAL

A1. Webhook redelivery causes duplicate notifications (no idempotency)

Location: packages/server/src/notify_bridge_server/api/webhooks.py:156 (gitea_webhook), :225 (planka_webhook), :427 (generic_webhook). Scenario: Gitea retries a webhook after 30s if the bridge returns 5xx, times out under load, or if the operator clicks "Test Delivery" twice. Every retry produces a fresh notification because the handlers never check X-Gitea-Delivery (Gitea's per-delivery UUID), nor do they record any event_id/hash for parse_generic_webhook events. Fix: Add a webhook_delivery table with (provider_id, delivery_id) unique constraint and created_at. Insert before dispatch (INSERT OR IGNORE on SQLite, ON CONFLICT DO NOTHING on Postgres); if the insert is a no-op, return {"ok": true, "skipped": "duplicate"}. For Gitea use the X-Gitea-Delivery header; for Planka use a hash of event_type + payload.id + payload.createdAt; for generic webhooks use a configurable JSONPath expression to derive an idempotency key, falling back to a SHA256 of the raw body. TTL prune older than 7 days.

A2. Deferred-dispatch drain can double-send on process crash

Location: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:721-758. Scenario: Inside _process_row, dispatcher.dispatch() actually delivers the Telegram message (HTTP 200 returned, user phone buzzes). The function then sets row.status = "fired" (line 734) but the surrounding session.commit() (line 577) hasn't run yet. Process is killed (OOM, SIGTERM during deploy, host reboot). On restart, _run_deferred_drain_catchup re-fetches the still-pending row and dispatches it again — the user gets the same album twice. Fix: Either (a) record an outbound dedup key per-row before dispatch (row.dispatch_id = uuid4(); session.commit() first), then ask the channel client to send-or-no-op based on that ID; or (b) flip the row to a "in_flight" state with a short timeout in a pre-dispatch transaction so a restart sees it as poisoned and aborts. Option (a) is more correct but needs per-channel cooperation; option (b) is the cheap fix.

A3. Telegram update offset is in-memory only — restart replays or loses commands

Location: packages/server/src/notify_bridge_server/services/telegram_poller.py:31 (_last_update_id: dict[int, int] = {}). Scenario: A user types /random Family. Telegram delivers update_id=4711. The bridge processes the command, sends back the media, and crashes before APScheduler ticks again. On restart, _last_update_id is empty, so we call getUpdates(offset=None) → Telegram returns 4711 again → we send the user the same album a second time. Conversely, if Telegram's 24-hour retention expired during a long outage, we silently skip pending updates. Fix: Persist last_update_id in DB (telegram_bot.last_update_id column). Combine with A2-style command idempotency by inserting (bot_id, update_id) into a dedup table before processing.

HIGH

A4. Telegram "bot blocked by user" / "chat not found" never short-circuits

Location: packages/core/src/notify_bridge_core/notifications/telegram/client.py (send_message, _upload_media, etc.). Errors with error_code == 403 (Forbidden, "Bot was blocked by the user") and 400 "chat not found" / "user is deactivated" are returned as failures but never recorded so the receiver gets removed/disabled. Scenario: A user blocks the bot. Every scheduled "Good morning memory" fires a sendMessage that Telegram instantly 403s. Bridge logs an error, moves on, repeats forever. The bridge_self target-failure counter eventually fires but the underlying receiver is never disabled. With many such chats the operator has no easy cleanup path. Fix: In the dispatcher, on error_code in (403, 400 with description matching "chat not found"/"user is deactivated"), automatically set TelegramChat.commands_enabled = False and either flag the receiver as disabled with reason blocked_by_user or surface it via a new /admin/blocked-chats view. Also stop further retries that round.

A5. Telegram forum-thread (topic) routing not supported

Location: telegram client never accepts/sends message_thread_id. Scenario: Operator points the bridge at a group's "Releases" forum topic. Today every message lands in the General topic instead — there is no way to specify the topic. This is a hard requirement for any non-trivial group install. Currently reply_parameters is the only thread-adjacent field used; message_thread_id is silently absent. Fix: Add an optional message_thread_id per-receiver (or per-target) config, pass through send_message, _upload_media, and _post_media_group. Auto-extract from incoming command updates' message.message_thread_id so the bot can reply into the same topic.

A6. `bot.token` read after commit without refresh in webhook flow

Location: packages/server/src/notify_bridge_server/commands/webhook.py:92-97. Scenario: The comment acknowledges "AsyncSession expires instances on commit" and snapshots bot_id/bot_token before commit, but await session.refresh(bot) is also called after the commit. If session.refresh fails (e.g. row was deleted by an admin concurrently — bot rotation), the exception is caught as a warning and the rest of the handler still runs using the stale local bot_id/bot_token. The window is small but real. Fix: Remove the session.refresh(bot) since the snapshot already covers everything the handler needs. The refresh adds risk for no gain.

A7. Deferred-dispatch coalescing has a JSON-mutation bug under concurrent defers

Location: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:307 (_find_pending_asset_rows). Scenario: Two near-simultaneous assets_added events for the same (link_id, collection_id) from two upstream pollers (HA chat-bus + periodic Immich). Both call defer_event concurrently. The two transactions both see "no pending row", both session.add(new_row), and SQLite cheerfully inserts two rows. The drain then fires both, sending the same combined media twice. Note that the partial UNIQUE index from v0.8.1 protects only the bridge_self provider row, not the deferred queue. Fix: Add a partial UNIQUE index UNIQUE(link_id, collection_id, event_type) WHERE status = 'pending' on deferred_dispatch, then convert defer_event to INSERT ... ON CONFLICT (link_id, collection_id, event_type) DO UPDATE and merge event_payload inside the SQL or in a re-read+retry loop.

A8. Quiet-hours overnight window + DST transition can produce wrong fire_at

Location: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:121-128. Scenario: User in Europe/Minsk (UTC+3, no DST anymore) sets quiet hours 22:00-06:00. For a user in a DST-observing zone (e.g. America/New_York), on the "spring forward" night where 2:00 → 3:00, an event arriving at 02:30 local time gets end_today = now_local.replace(hour=6, minute=0). But .replace() ignores DST adjustments — the resulting datetime may sit in the skipped hour or have ambiguous DST status. Two hours later, the dispatcher sees the quiet window as "still active" or "30 min ago" depending on the system. Fix: After .replace(hour=t_end.hour, minute=t_end.minute, ...), pass through tz.localize (zoneinfo's behavior: re-walk via astimezone) and explicitly handle the fold= parameter. Add tests using zoneinfo.ZoneInfo("America/New_York") and known DST transition dates.

A9. Quiet-hours `start == end` returns None — silently no quiet hours

Location: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111. Scenario: User UI submits quiet_hours_start = "00:00" and quiet_hours_end = "00:00", thinking "all day quiet". The function returns None (no quiet window) — the user gets pinged at 3am even though the UI says "quiet hours enabled". Same code path eats malformed times silently. Fix: Bubble up ValueError/malformed input to the API validator on write so the user gets a 422 with a specific error message rather than silently broken behavior. Define 00:00-00:00 as "always quiet" or reject it explicitly with a clear error.

A10. Telegram `_truncate` cuts mid-HTML-tag → parse_mode fallback then loses formatting

Location: packages/core/src/notify_bridge_core/notifications/telegram/client.py:144-149 (_truncate). Scenario: A template renders to 4090 chars and an <a href="https://...">...</a> straddles the 4096-byte boundary. The truncate function takes a flat string slice, so the final character may be inside a tag → Telegram returns 400 "can't parse entities" → the retry strips parse_mode → the user sees <a href="..."> literally in their chat. Fix: Make _truncate HTML-aware: scan from the right and abandon truncation at the start of any tag boundary, OR strip incomplete tags after truncating. A simpler intermediate fix: pop any unclosed <a> /<b>/<i> detected by a regex over the truncated string.

A11. JSON-payload depth/size hardened in backup, not in webhooks

Location: packages/server/src/notify_bridge_server/api/webhooks.py:43-71 (_read_bounded_body only caps total bytes). Scenario: Generic webhook accepts a 999KB payload (under the 1MB cap) but with 50 levels of nesting. json.loads succeeds, then parse_generic_webhook evaluates JSONPath expressions in a loop and the CPU spends seconds chasing pointers. Multiple concurrent malicious requests can peg the event loop. Fix: Reuse the depth/node guards from packages/server/src/notify_bridge_server/services/backup_service.py (JSON depth cap 10, node count cap 100k). Either share the helper or re-implement around json.loads(object_pairs_hook=...).

A12. Generic-webhook `auth_mode="none"` with `acknowledge_unauthenticated` is per-provider, not per-user

Location: packages/server/src/notify_bridge_server/api/webhooks.py:294-323. Scenario: v0.8.1 added the acknowledge_unauthenticated=true opt-in, but it's only stored in provider.config JSON. A multi-user install where one user accepts unauthenticated and another doesn't would suffice. But because anyone with the webhook URL can also infer the token (URLs are not secret in real deployments — they end up in upstream config files, logs, build artifacts), auth_mode="none" is dangerous beyond "explicit opt-in": an attacker who guesses the path can DoS the rate limiter by burning the 60/min budget. Fix: Refuse to even create a webhook provider with auth_mode="none" in production unless a separate environment guard NOTIFY_BRIDGE_ALLOW_UNAUTHENTICATED_WEBHOOKS is set; AND drop the rate limit to 10/min for auth_mode="none" providers.

A13. `_extract_retry_after` returns int but Telegram `retry_after` is fractional

Location: packages/core/src/notify_bridge_core/notifications/telegram/client.py:59-78. Scenario: Modern Telegram sometimes returns retry_after as a float (e.g. 1.5). The current code does int(group(1)) and isinstance(ra, (int, float)). Regex \d+ only matches integers. So a 1.5s retry-after becomes "no retry-after found" → fallback 1s sleep → retry too early → second 429 → eventually the bounded retry budget runs out. Fix: Loosen the regex to \d+(?:\.\d+)? and float(m.group(1)), preserve fractional via await asyncio.sleep(retry_after + 1) with float.

A14. APScheduler date-job collision when two windows end at the exact same second

Location: packages/server/src/notify_bridge_server/services/scheduler.py:1127-1132 (_drain_job_id_for). The job id is keyed on YYYYMMDDHHMMSS. Comment in code acknowledges "two trackers... seconds different ... would collide", but two windows ending at the exact same second still collide on a single job id — replace_existing=True silently drops the second. Scenario: 30 users with quiet_hours_end=07:00. All 30 windows end at the same wall-clock second. Only one drain job is scheduled. That single job fires drain_deferred_due() which scans all rows globally so all 30 get drained — actually fine. But if the global drain function ever filters by user/tracker (a likely near-term change for multi-tenant), the collision becomes silent data loss. Fix: Either keep the global drain (and document the assumption) or add a tracker_id segment to the job_id and let APScheduler dedup naturally.

A15. `_handle_webhook_conflict` reclaim races against a parallel admin action

Location: packages/server/src/notify_bridge_server/services/telegram_poller.py:163-218. Scenario: Admin clicks "Switch to webhook mode" in the UI, which sets update_mode=webhook and calls set_webhook(...). Concurrently, the next poll tick for the same bot hits the conflict, calls delete_webhook → the admin's webhook is wiped 1s after they set it. The poll tick checks bot.update_mode != "polling" before the conflict reclaim, but the reload is best-effort and the conflict reclaim path runs unconditionally once entered. Fix: Re-check bot.update_mode == "polling" inside _handle_webhook_conflict before calling delete_webhook; or take an advisory lock on the bot row for the duration of the mode flip.

A16. Discord 2000-char split breaks on Unicode codepoint boundaries

Location: packages/core/src/notify_bridge_core/notifications/discord/client.py:60-80 (_split_message). Scenario: A template renders to 2050 chars with emoji at position 1998-1999 (each emoji is 2 surrogates / multi-byte UTF-8). The split uses text.rfind("\n", 0, limit) and falls back to character index limit, which is a Python str index → that part is OK in CPython 3, but if the content contains a grapheme cluster (emoji + zero-width-joiner + skin tone), slicing at limit mid-cluster renders as the broken emoji "□" in Discord. Fix: Use a grapheme-cluster boundary library (e.g. regex module with \X) or at minimum back off to the previous whitespace if limit is inside a likely cluster.

MEDIUM

A17. Per-target failure counter does not distinguish receivers within a target

Location: packages/server/src/notify_bridge_server/services/event_dispatch.py:311-333. Scenario: A target has 10 receivers. 1 chat is blocked, 9 work. Today maybe_emit_target_failure is called for the target — but the success counter (record_target_success) is also called for the same target on the other 9. Net counter behavior depends on call order. With the default-threshold 5, this oscillates. Fix: Track success/failure per receiver, not per target; or only call maybe_emit_target_failure when all receivers failed for the target.

A18. `_cleanup_old_events` does not delete cancelled `DeferredDispatch` rows

Location: packages/server/src/notify_bridge_server/services/scheduler.py:332-364. Scenario: The daily cleanup deletes EventLog, WebhookPayloadLog, ActionExecution. Cancelled / fired / dropped DeferredDispatch rows live forever in the DB. Active install with chatty providers accumulates millions of rows; eventually the _load_pending_drain_jobs query, _trim_queue_if_needed, and the catch-up scan all degrade. Fix: Add delete(DeferredDispatch).where(status.in_(["fired", "dropped", "cancelled"]), fired_at < cutoff) to the cleanup.

A19. `random.shuffle(shuffled)` in `_sort_assets` uses non-deterministic seed

Location: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:317-320. Scenario: Two identical events arriving in close succession (deferred- dispatch merge, then drain re-renders) shuffle into different orders. With the deferred-dispatch coalescing logic, this produces a visual "they're not the same album" surprise in the chat history. Fix: Seed random with a stable per-event hash (hash(event.event_type.value + event.collection_id + event.timestamp.isoformat())).

A20. `_poll_tracker` swallows exception, drops it at `_LOGGER.error` not `exception`

Location: packages/server/src/notify_bridge_server/services/scheduler.py:657-666. Scenario: An exception in check_tracker is logged as _LOGGER.error("Error polling tracker %d: %s", tracker_id, e) — no traceback. Production debugging of "why is tracker 42 silently broken since yesterday" requires the stack. Fix: Change to _LOGGER.exception("Error polling tracker %d", tracker_id).

A21. Long bot commands → `/help` reply > 4096 chars truncates without warning

Location: packages/server/src/notify_bridge_server/commands/handler.py:521-532, combined with send_reply → send_telegram_message → _truncate to 4096. Scenario: A user with 20 enabled commands runs /help. Each command + description (RU) crosses 250 chars → 5000 chars total → truncated mid-command. The user sees a half-list that suggests we forgot half the commands. Fix: Split /help over multiple messages by command category (provider).

A22. `parse_command` truncates to 512 chars — long search queries lost

Location: packages/server/src/notify_bridge_server/commands/parser.py:15. Scenario: /search a very long query containing emoji 🎉 and more text that the user really meant to send because they pasted a long string from somewhere… gets clipped to 512 chars silently. The trailing count parser then operates on the truncated text, possibly extracting a count from mid-query. Fix: Either reject >512 with parse_command returning a sentinel "too_long" tuple, or just stop truncating — the Telegram limit is already 4096 and we already truncate the response side.

A23. Periodic catch-up scan can dispatch a stale event payload

Location: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:628 (_process_row). Scenario: An assets_added event is deferred at 22:00. At 06:00 the quiet window ends, drain re-fetches link_data. The assets in event_payload include URLs and asset metadata. But the user has since deleted those photos from Immich. The dispatcher tries to download → 404. Notification shows "5 photos added to Album X" but the actual media fails to attach. Fix: For assets_added, re-validate asset existence against the provider before dispatch (one batched getAssets call). Drop missing IDs from the event, mark with "delivered_after_quiet_hours" + extra hint "missing_count": N in details. For deferred windows >12h this is the right behavior; for shorter windows the lookup is wasted work, so gate on (now - deferred_at).hours >= 6.

A24. Watcher / scheduler restart can lose adaptive polling state

Location: packages/server/src/notify_bridge_server/services/scheduler.py:67-88 (_adaptive_state: dict). Scenario: Module-level dict resets on restart. A tracker that had ramped up to 1-in-4 ticks goes back to every-tick polling. Over a fleet of 50 trackers in steady-state idle, this triggers a thundering herd of every-tick polls right after deploy. Combined with no DB-level rate limiting on the upstream Immich/Gitea API, it can rate-limit the operator out of their own services for ~5min. Fix: Either persist the adaptive state in notification_tracker_state (cheap on shutdown via atexit) or stagger the initial ticks via APScheduler's next_run_time instead of relying on the existing jitter.

A25. `defer_event` `return "cancelled"` logic is incorrect in some merge paths

Location: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:444. Scenario: The cancelled return branch checks upd_added is None or upd_added.status == "cancelled" AND same for upd_removed. But if both upd_added and upd_removed are None (i.e. there were no pending rows to begin with), fully_cancelled is False → returns "merged". That's fine. But the more subtle issue: an "insert" action with one of the rows being cancelled returns "merged" — should be "inserted". The dashboard "merged" status confuses the operator looking at why no defer row exists. Fix: Rewrite as a clearer state machine: distinguish "inserted", "merged_into_existing", "fully_cancelled".

A26. `_fetch_bytes` and `_safe_get` honor only 3 redirects with no Retry-After awareness

Location: packages/core/src/notify_bridge_core/notifications/telegram/client.py:217-268. Scenario: Immich behind a CDN can chain 302 → 302 → 200. With 4 hops it falls through to "Too many redirects". A user complains "old photos suddenly missing in notifications". Fix: Bump to 5 redirects and surface the chain in the error string for easier debugging.

A27. No structured event log filter UI for "show me all drops in the last hour"

Location: packages/server/src/notify_bridge_server/api/status.py — event_log rows have details.dispatch_status field but no API filter exposes it. The frontend can fetch only via global filter on event_type. Scenario: An operator sees "messages are missing today". They want to filter event_log to dispatch_status in (dropped_quiet_hours_nondeferrable, deferred_then_dropped, deferred_then_failed). Today they can't. Fix: Add dispatch_status and dispatched=true|false as first-class event_log columns (denormalized from details), plus API + UI filter.

A28. `_render_cmd_template` falls back to `"[No template: X]"` user-visible text

Location: packages/server/src/notify_bridge_server/commands/handler.py:111-115. Scenario: An operator removes a template slot by mistake. The next user who runs /random sees [No template: response_random] in chat. Not just ugly — it leaks internal slot names. Fix: Show a friendly "Sorry, something went wrong on our side" + log at error level. Better: refuse to disable the slot if it's referenced.

LOW

A29. `_truncate`'s ellipsis can land inside a multi-byte char

The marker "…" is one Unicode codepoint (3 bytes UTF-8) but the truncate counts characters, not bytes. Telegram counts UTF-16 code units, so for a 4090-char message ending in emoji, the calculation is off by a small constant. Won't break sends but messages may end up slightly longer than TELEGRAM_MAX_TEXT_LENGTH allows. Re-measure in UTF-16 code units (len(s.encode('utf-16-le')) // 2).

A30. `NotificationDispatcher._render_cache` set to fresh dict on every dispatch — comment says "reuse"

The instance attribute self._render_cache is reset to {} at the start of every _send_to_target (line 245). The cache only helps across receivers within one target, not across targets. The comment at line 111-115 implies broader reuse. Either align comment with reality or actually share across targets within one dispatch() call.

A31. Frontend `entity-cache.svelte.ts` doesn't propagate stale-cache errors

The shared $state-based caches return stale data silently if the underlying fetch fails after a successful initial load. A user sees old target list during an outage and is confused why edits aren't sticking.

Part B — Missing functionality and "cool feature" gaps

Tier legend: must-have = blocks prod for any non-trivial install; nice-to-have = clear value, ship in next minor; aspirational = ship when v1.0+ slows down. Effort: S ≈ 1-2 days; M ≈ 1 week; L ≈ 2+ weeks.

Already in the backlog (post-v0.8.1 status check)

B1. Target-level quiet hours (per-target DND, multi-window, days-of-week, silent mode)

Status: Still missing in v0.8.1. The backlog item proposed a v1 cut (target-level windows + silent mode for Telegram = disable_notification=True). None of the proposed code paths exist:

notification_target.quiet_hours_json column — not present.
disable_notification=True plumbing through TelegramClient.send_message — not present.
Days-of-week filter — not present.

Pitch: Quiet hours bind to the watcher (tracking config); users want DND at the destination. "Don't ping my phone at night, regardless of which provider". Who benefits: Every user. Today they have to recreate per-link windows. Effort: M (1 week — backend dispatcher gate + frontend Aurora-style fieldset). Tier: must-have for prod.

Status: Auto-Organize exists; no other action descriptors are shipped. Pitch: Reuse the existing action descriptor pipeline. Auto-favorite-by-person is the smallest cut. Effort: M per action (a few days each). Tier: nice-to-have.

B3. Block-based template builder

Status: Not started. JinjaEditor is unchanged. Effort: L — frontend-only but big. Tier: aspirational.

Newly identified — must-have for prod

B4. Webhook delivery dedup table + "Test Delivery" replay

Pitch: Add the dedup table from A1, plus a /api/webhooks/{provider_id}/replay/{delivery_id} endpoint that admin can hit to re-dispatch a stored payload without the upstream provider needing to resend. Combined with the existing WebhookPayloadLog, this is "click to retest" in the UI. Who benefits: Every webhook provider. Replay is invaluable for debugging template edits. Effort: M. Tier: must-have for prod.

B5. "Send test message" / template playground

Pitch: From the template editor, click "Try this template against the last received event" → render preview, optionally send to a sandbox chat. Bypass dispatch but exercise the full Jinja pipeline. Who benefits: Every template edit today is a leap of faith — the operator modifies the template, waits for the next real event, hopes nothing breaks. Effort: S-M. The preview infrastructure already exists (services/sample_context.py); add a "send to chat X" button. Tier: must-have for prod.

B6. Template versioning + rollback

Pitch: Auto-snapshot each template on save (last 10 revisions). UI shows diff between version N and N-1, "Restore" button. Same for command templates. Who benefits: An operator who tweaks a template at midnight and goofs the syntax needs an undo button. Effort: M. New template_revision table; new endpoints; UI button. Tier: must-have for prod.

B7. Bulk operations on trackers / targets / links

Pitch: Multi-select in lists → "disable selected", "delete selected", "export selected templates as JSON bundle", "move to user X". Who benefits: Operators with >10 trackers. A common pain point: deploying the bridge for a new family member requires N clicks per tracker. Effort: M (frontend-heavy). Tier: must-have for prod.

B8. Bot blocked / chat-not-found auto-disable + dashboard

Pitch: Detect Telegram 403 / 400 chat-related errors. Mark the receiver or TelegramChat as disabled_by_remote. Surface in a "Stale receivers" admin view with a "Try resending invite" / "Delete chat" button. Who benefits: Every Telegram user. Today the bridge silently sprays errors until a human looks. Effort: S. Tier: must-have for prod.

B9. Forum-thread (topic) routing for Telegram

Pitch: Per-receiver message_thread_id field, auto-detected from incoming command messages. UI: when adding a chat that's a forum, show a topic selector populated via getForumTopicIconStickers + getChat's is_forum. Who benefits: Any group install where the user wants notifications in a dedicated topic. Effort: M. Tier: must-have for prod.

B10. Telegram inline buttons + callback queries

Pitch: Templates can declare {% buttons %} with action descriptors. Bridge listens for callback_query updates, dispatches to a registered action (e.g. "Mark album as favorite", "Snooze this tracker for 1h", "Run HA service light.turn_off"). Who benefits: Power users. Foundation for several other features (Immich duplicate-cluster review, HA action button → service call, snooze). Effort: L. Tier: nice-to-have but unlocks the next 3 items.

B11. User snooze / mute via bot command

Pitch: /snooze 1h mutes the bot's outbound chat for 1h. /mute provider gitea mutes a whole provider for that chat. /wake undoes. Implemented as a per-receiver snoozed_until column. Effort: S-M. Tier: must-have for prod (user-side relief valve).

Newly identified — nice-to-have

B12. Per-target / per-user rate limit (send-side)

Pitch: Cap outbound messages per minute per receiver. Existing 429 backoff handles Telegram's limit, but a runaway template / event-storm provider can still spray the user's phone with 200 messages. Effort: S. Token bucket per chat_id in _send_telegram. Tier: nice-to-have.

B13. Message dedup window (idempotency key per outbound message)

Pitch: SHA256 of (target_id, receiver_id, rendered_message, event_collection_id). If the same key was sent in the last 5min, skip. Effort: S. Tier: nice-to-have (lots of overlap with A1+A2 but addresses the end-of-pipeline dedup, after all coalescing).

B14. Weekly digest / per-target stats / per-provider error rate

Pitch: Cron-based weekly summary email/Telegram. "Top 5 noisy trackers", "Receivers with >X% failure rate", "Top 5 days of the week with the most activity". Operator preventive maintenance. Effort: M. Tier: nice-to-have.

B15. Mobile-friendly minimal mode for the SPA

Pitch: The Aurora redesign is a lot for mobile. A "manage from phone" minimal layout — list of trackers, click to toggle, click to mute. Stops operators from needing a desktop to silence a chatty tracker at 1am. Effort: M. Tier: nice-to-have.

B16. Audit log of admin actions

Pitch: New audit_log table. Every create/update/delete on NotificationTracker, NotificationTarget, TemplateConfig, ServiceProvider, TelegramBot, User, etc. writes a row with (user_id, action, entity_type, entity_id, before_json, after_json, ip, ua). Admin UI tab. Effort: M. SQLAlchemy event listeners on the affected models. Tier: nice-to-have for multi-admin installs; must-have if any compliance requirement.

B17. Health → not just /ready, but per-component status page

Pitch: /api/health/components returns {providers: [{id, last_ok_at, last_error}], targets: [{id, last_ok_at, last_error}], scheduler: {job_count, next_fires}}. Frontend "Status" tab. Effort: S-M. The data is already in EventLog / scheduler API. Tier: nice-to-have.

B18. Provider unreachable backoff + escalation

Pitch: Today bridge_self emits bridge_self_poll_failures after N consecutive fails. Add (a) exponential backoff on the polling interval after M failures so we don't hammer a down host, and (b) recovery notification when the provider comes back. Effort: S. Tier: nice-to-have.

B19. RSS provider

Pitch: Generic RSS/Atom feed poller. One more provider, reuses event_dispatch. Long-tail value (operator wants "notify me when a blog publishes"). Effort: M. Tier: nice-to-have.

B20. Mobile push / FCM channel

Pitch: A dedicated FCM "Receiver" type so the user can ship their own companion app. Today Telegram is the only realtime channel; email is too slow; webhook out is for plumbing. Effort: L. Tier: aspirational.

Newly identified — aspirational

B21. Conversation threading per source (one notification thread per album / repo)

Pitch: Use Telegram reply_parameters to chain all notifications about "Album X" as a single thread that grows over time. Today every notification is a top-level message. Threading turns the chat into a navigable history. Effort: M. Store last_message_id per (target_id, collection_id), pass as reply_to_message_id. Tier: aspirational but a clear differentiator.

B22. A/B test variants for templates

Pitch: A template config can carry 2 variants. The dispatcher hash-routes receivers to A or B; the dashboard shows "variant A's response time / click rate / receiver mute rate". Effort: L. Tier: aspirational.

B23. Dark-launch a new template before enabling it

Pitch: "Send-to-sandbox-chat-only" toggle on a template config. The new template renders against real events but only goes to one operator's chat for 1 week. Then promote to production. Effort: M. Builds on template versioning (B6). Tier: aspirational.

B24. Scheduled template changes

Pitch: "On 2026-12-25 at 09:00, switch template_config X to draft Y". Useful for holiday-themed greetings or batch migrations. Effort: M. Tier: aspirational.

B25. HA service-call from a Telegram inline button

Pitch: Building on B10. A template renders {% button hass:light.turn_off target=living_room %}. User clicks → bridge calls HA light.turn_off. Effort: M (after B10). Tier: aspirational.

Ship-blocker checklist (do not widen user audience without)

Order is rough priority (top first). Most are also called out in Part A.

A1 — Webhook idempotency table (Gitea/Planka/generic). Without this, one upstream retry storm can double-/quadruple-spray every user.
A2 — Deferred-dispatch crash window. A redeploy mid-drain duplicates every queued notification. Implement either the dispatch_id pre-commit OR the in_flight state machine.
A3 — Persist Telegram update offset. Same root cause class as A1/A2; matters less if A1+A2 are fixed but should land together.
A4 / B8 — Bot blocked / chat-not-found auto-disable. A user blocking the bot must not generate infinite errors.
A11 — Webhook JSON depth/node cap (mirror the backup guard).
A9 — Quiet-hours start == end confirmation; either accept "always quiet" semantics or reject in the API validator.
A8 — DST handling in quiet-hours overnight window. Verify with tests that include known transition timestamps.
B5 — "Send test message" / template playground. Without this, every template edit is a flying blind change against a live system.
B6 — Template versioning + rollback. Pair with B5.
A5 / B9 — Forum-thread (topic) routing. Any non-trivial Telegram group install needs this.
B11 — User snooze / mute via bot command. Relief valve when the bridge gets too chatty.
B7 — Bulk operations on trackers / targets / links. Operability floor for any install with >10 trackers.

Everything else in Part B is upside, not a blocker.

36 KiB Raw Blame History