Operability: - Correlation IDs end-to-end: shared dispatch_id between log lines and EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths) and a new X-Request-Id middleware that normalizes inbound ids and binds request_id into log context. - dispatch_summary block merged into EventLog.details: per-target success/failure counts plus Telegram media delivered/skipped/failed and truncated error lists, so partial outcomes surface in the UI. - Diagnostic mode: admin can flip one module to DEBUG for a bounded window with auto-revert (in-memory only; setup_logging() resets on boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints plus DiagnosticsCassette UI on the settings page. Telegram: - Per-receiver options: disable_notification (silent send) and message_thread_id (forum-topic routing), wired through the dispatcher via a ContextVar so all four send sites (sendMessage / sendPhoto-Video- Document / sendMediaGroup / cache-hit POST) pick them up. - send_large_videos_as_documents target setting: bypass the 50 MB sendVideo cap by falling back to sendDocument for oversized videos. - sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES, 45 MB) with per-item fallback on chunk failure so a stale file_id no longer silently drops a cached asset. Tests: - New: diagnostic_mode, dispatch_summary, request_correlation, telegram_media_group_partial, telegram_per_send_options. Docs: - .claude/reviews/: six-axis production-readiness review of v0.8.1. - .claude/docs/functional-review-2026-05-28.md: focused review of Telegram/Immich/logging subsystems.
36 KiB
Bugs + Missing Features — Production-Readiness Review
Repo: c:\Users\Alexei\Documents\service-to-notification-bridge (v0.8.1 baseline)
Date: 2026-05-22
Scope: full repo (backend Python/FastAPI, Svelte 5 frontend, providers + dispatchers + bot commands)
Executive summary
- The code is in much better shape than typical pre-1.0 code. Quiet-hours, SSRF, JWT, secret redaction, rate-limit fan-out caps, partition-by-media-kind, parse_mode retry, scheduler misfire-grace, Prometheus metrics, deep healthcheck, and per-receiver render cache are all already implemented and well-tested.
- The single biggest shipping risk is webhook idempotency. Gitea, Planka,
and the generic webhook endpoint all dispatch on every POST regardless of
redelivery — there is no
X-Gitea-Delivery/X-Hub-Deliverydedup table. An upstream retry storm sends the same notification N times. - The deferred-dispatch drain has a duplicate-send window if the process
dies between
dispatcher.dispatch()returning andsession.commit()— the row stayspendingand the periodic catch-up scan re-drains it. - Telegram update offset (
_last_update_id) is in-memory only — on restart, the bot replays already-handled updates or skips ones Telegram has discarded. Combined with no per-update idempotency, this is a duplicate-command surface. - Several Telegram features are silently unsupported: forum threads
(
message_thread_id), bot-blocked-by-user detection (403 → keep retrying forever), and inline-button callback queries. None blocks shipping today but each is a near-term ask from any real user. - No template versioning / dry-run / playground — every template edit is immediately live. There is no way to validate a new template against a sample payload before flipping the switch, and no rollback path.
- Frontend lacks bulk operations and import/export of templates+targets. An operator with 30 trackers cannot bulk-toggle, bulk-edit, or move a template across users.
Part A — Bugs and reliability issues
Severity legend: CRITICAL = data loss / duplicate user-visible messages / silent stop-shipping; HIGH = wrong behavior under realistic conditions; MEDIUM = degrades UX or operability; LOW = polish.
CRITICAL
A1. Webhook redelivery causes duplicate notifications (no idempotency)
Location: packages/server/src/notify_bridge_server/api/webhooks.py:156
(gitea_webhook), :225 (planka_webhook), :427 (generic_webhook).
Scenario: Gitea retries a webhook after 30s if the bridge returns 5xx,
times out under load, or if the operator clicks "Test Delivery" twice. Every
retry produces a fresh notification because the handlers never check
X-Gitea-Delivery (Gitea's per-delivery UUID), nor do they record any
event_id/hash for parse_generic_webhook events.
Fix: Add a webhook_delivery table with (provider_id, delivery_id)
unique constraint and created_at. Insert before dispatch (INSERT OR IGNORE
on SQLite, ON CONFLICT DO NOTHING on Postgres); if the insert is a no-op,
return {"ok": true, "skipped": "duplicate"}. For Gitea use the
X-Gitea-Delivery header; for Planka use a hash of event_type + payload.id + payload.createdAt; for generic webhooks use a configurable
JSONPath expression to derive an idempotency key, falling back to a SHA256 of
the raw body. TTL prune older than 7 days.
A2. Deferred-dispatch drain can double-send on process crash
Location: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:721-758.
Scenario: Inside _process_row, dispatcher.dispatch() actually
delivers the Telegram message (HTTP 200 returned, user phone buzzes).
The function then sets row.status = "fired" (line 734) but the surrounding
session.commit() (line 577) hasn't run yet. Process is killed (OOM,
SIGTERM during deploy, host reboot). On restart, _run_deferred_drain_catchup
re-fetches the still-pending row and dispatches it again — the user gets
the same album twice.
Fix: Either (a) record an outbound dedup key per-row before dispatch
(row.dispatch_id = uuid4(); session.commit() first), then ask the channel
client to send-or-no-op based on that ID; or (b) flip the row to a
"in_flight" state with a short timeout in a pre-dispatch transaction so a
restart sees it as poisoned and aborts. Option (a) is more correct but
needs per-channel cooperation; option (b) is the cheap fix.
A3. Telegram update offset is in-memory only — restart replays or loses commands
Location: packages/server/src/notify_bridge_server/services/telegram_poller.py:31
(_last_update_id: dict[int, int] = {}).
Scenario: A user types /random Family. Telegram delivers update_id=4711.
The bridge processes the command, sends back the media, and crashes before
APScheduler ticks again. On restart, _last_update_id is empty, so we call
getUpdates(offset=None) → Telegram returns 4711 again → we send the user
the same album a second time. Conversely, if Telegram's 24-hour retention
expired during a long outage, we silently skip pending updates.
Fix: Persist last_update_id in DB (telegram_bot.last_update_id column).
Combine with A2-style command idempotency by inserting
(bot_id, update_id) into a dedup table before processing.
HIGH
A4. Telegram "bot blocked by user" / "chat not found" never short-circuits
Location: packages/core/src/notify_bridge_core/notifications/telegram/client.py
(send_message, _upload_media, etc.). Errors with
error_code == 403 (Forbidden, "Bot was blocked by the user") and 400
"chat not found" / "user is deactivated" are returned as failures but
never recorded so the receiver gets removed/disabled.
Scenario: A user blocks the bot. Every scheduled "Good morning memory"
fires a sendMessage that Telegram instantly 403s. Bridge logs an error,
moves on, repeats forever. The bridge_self target-failure counter eventually
fires but the underlying receiver is never disabled. With many such chats
the operator has no easy cleanup path.
Fix: In the dispatcher, on error_code in (403, 400 with description matching "chat not found"/"user is deactivated"), automatically set
TelegramChat.commands_enabled = False and either flag the receiver as
disabled with reason blocked_by_user or surface it via a new
/admin/blocked-chats view. Also stop further retries that round.
A5. Telegram forum-thread (topic) routing not supported
Location: telegram client never accepts/sends message_thread_id.
Scenario: Operator points the bridge at a group's "Releases" forum
topic. Today every message lands in the General topic instead — there is
no way to specify the topic. This is a hard requirement for any non-trivial
group install. Currently reply_parameters is the only thread-adjacent
field used; message_thread_id is silently absent.
Fix: Add an optional message_thread_id per-receiver (or per-target)
config, pass through send_message, _upload_media, and _post_media_group.
Auto-extract from incoming command updates' message.message_thread_id so
the bot can reply into the same topic.
A6. bot.token read after commit without refresh in webhook flow
Location: packages/server/src/notify_bridge_server/commands/webhook.py:92-97.
Scenario: The comment acknowledges "AsyncSession expires instances on
commit" and snapshots bot_id/bot_token before commit, but await session.refresh(bot) is also called after the commit. If session.refresh
fails (e.g. row was deleted by an admin concurrently — bot rotation), the
exception is caught as a warning and the rest of the handler still runs
using the stale local bot_id/bot_token. The window is small but real.
Fix: Remove the session.refresh(bot) since the snapshot already
covers everything the handler needs. The refresh adds risk for no gain.
A7. Deferred-dispatch coalescing has a JSON-mutation bug under concurrent defers
Location: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:307
(_find_pending_asset_rows).
Scenario: Two near-simultaneous assets_added events for the same
(link_id, collection_id) from two upstream pollers (HA chat-bus +
periodic Immich). Both call defer_event concurrently. The two transactions
both see "no pending row", both session.add(new_row), and SQLite cheerfully
inserts two rows. The drain then fires both, sending the same combined media
twice. Note that the partial UNIQUE index from v0.8.1 protects only the
bridge_self provider row, not the deferred queue.
Fix: Add a partial UNIQUE index UNIQUE(link_id, collection_id, event_type) WHERE status = 'pending' on deferred_dispatch, then convert defer_event
to INSERT ... ON CONFLICT (link_id, collection_id, event_type) DO UPDATE
and merge event_payload inside the SQL or in a re-read+retry loop.
A8. Quiet-hours overnight window + DST transition can produce wrong fire_at
Location: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:121-128.
Scenario: User in Europe/Minsk (UTC+3, no DST anymore) sets quiet
hours 22:00-06:00. For a user in a DST-observing zone (e.g.
America/New_York), on the "spring forward" night where 2:00 → 3:00, an
event arriving at 02:30 local time gets end_today = now_local.replace(hour=6, minute=0). But .replace() ignores DST adjustments — the resulting
datetime may sit in the skipped hour or have ambiguous DST status. Two
hours later, the dispatcher sees the quiet window as "still active" or "30
min ago" depending on the system.
Fix: After .replace(hour=t_end.hour, minute=t_end.minute, ...), pass
through tz.localize (zoneinfo's behavior: re-walk via astimezone) and
explicitly handle the fold= parameter. Add tests using
zoneinfo.ZoneInfo("America/New_York") and known DST transition dates.
A9. Quiet-hours start == end returns None — silently no quiet hours
Location: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111.
Scenario: User UI submits quiet_hours_start = "00:00" and
quiet_hours_end = "00:00", thinking "all day quiet". The function returns
None (no quiet window) — the user gets pinged at 3am even though the UI
says "quiet hours enabled". Same code path eats malformed times silently.
Fix: Bubble up ValueError/malformed input to the API validator on
write so the user gets a 422 with a specific error message rather than
silently broken behavior. Define 00:00-00:00 as "always quiet" or reject
it explicitly with a clear error.
A10. Telegram _truncate cuts mid-HTML-tag → parse_mode fallback then loses formatting
Location: packages/core/src/notify_bridge_core/notifications/telegram/client.py:144-149
(_truncate).
Scenario: A template renders to 4090 chars and an
<a href="https://...">...</a> straddles the 4096-byte boundary. The
truncate function takes a flat string slice, so the final character may be
inside a tag → Telegram returns 400 "can't parse entities" → the retry
strips parse_mode → the user sees <a href="..."> literally in their chat.
Fix: Make _truncate HTML-aware: scan from the right and abandon
truncation at the start of any tag boundary, OR strip incomplete tags after
truncating. A simpler intermediate fix: pop any unclosed <a> /<b>/<i>
detected by a regex over the truncated string.
A11. JSON-payload depth/size hardened in backup, not in webhooks
Location: packages/server/src/notify_bridge_server/api/webhooks.py:43-71
(_read_bounded_body only caps total bytes).
Scenario: Generic webhook accepts a 999KB payload (under the 1MB cap)
but with 50 levels of nesting. json.loads succeeds, then
parse_generic_webhook evaluates JSONPath expressions in a loop and the CPU
spends seconds chasing pointers. Multiple concurrent malicious requests can
peg the event loop.
Fix: Reuse the depth/node guards from
packages/server/src/notify_bridge_server/services/backup_service.py
(JSON depth cap 10, node count cap 100k). Either share the helper or
re-implement around json.loads(object_pairs_hook=...).
A12. Generic-webhook auth_mode="none" with acknowledge_unauthenticated is per-provider, not per-user
Location: packages/server/src/notify_bridge_server/api/webhooks.py:294-323.
Scenario: v0.8.1 added the acknowledge_unauthenticated=true opt-in,
but it's only stored in provider.config JSON. A multi-user install where
one user accepts unauthenticated and another doesn't would suffice. But
because anyone with the webhook URL can also infer the token (URLs are not
secret in real deployments — they end up in upstream config files, logs,
build artifacts), auth_mode="none" is dangerous beyond "explicit opt-in":
an attacker who guesses the path can DoS the rate limiter by burning the
60/min budget.
Fix: Refuse to even create a webhook provider with auth_mode="none"
in production unless a separate environment guard
NOTIFY_BRIDGE_ALLOW_UNAUTHENTICATED_WEBHOOKS is set; AND drop the rate
limit to 10/min for auth_mode="none" providers.
A13. _extract_retry_after returns int but Telegram retry_after is fractional
Location: packages/core/src/notify_bridge_core/notifications/telegram/client.py:59-78.
Scenario: Modern Telegram sometimes returns retry_after as a float
(e.g. 1.5). The current code does int(group(1)) and isinstance(ra, (int, float)). Regex \d+ only matches integers. So a 1.5s retry-after
becomes "no retry-after found" → fallback 1s sleep → retry too early → second
429 → eventually the bounded retry budget runs out.
Fix: Loosen the regex to \d+(?:\.\d+)? and float(m.group(1)),
preserve fractional via await asyncio.sleep(retry_after + 1) with float.
A14. APScheduler date-job collision when two windows end at the exact same second
Location: packages/server/src/notify_bridge_server/services/scheduler.py:1127-1132
(_drain_job_id_for). The job id is keyed on YYYYMMDDHHMMSS. Comment in
code acknowledges "two trackers... seconds different ... would collide", but
two windows ending at the exact same second still collide on a single job id
— replace_existing=True silently drops the second.
Scenario: 30 users with quiet_hours_end=07:00. All 30 windows end at
the same wall-clock second. Only one drain job is scheduled. That single
job fires drain_deferred_due() which scans all rows globally so all 30
get drained — actually fine. But if the global drain function ever
filters by user/tracker (a likely near-term change for multi-tenant), the
collision becomes silent data loss.
Fix: Either keep the global drain (and document the assumption) or
add a tracker_id segment to the job_id and let APScheduler dedup naturally.
A15. _handle_webhook_conflict reclaim races against a parallel admin action
Location: packages/server/src/notify_bridge_server/services/telegram_poller.py:163-218.
Scenario: Admin clicks "Switch to webhook mode" in the UI, which sets
update_mode=webhook and calls set_webhook(...). Concurrently, the next
poll tick for the same bot hits the conflict, calls delete_webhook → the
admin's webhook is wiped 1s after they set it. The poll tick checks
bot.update_mode != "polling" before the conflict reclaim, but the
reload is best-effort and the conflict reclaim path runs unconditionally
once entered.
Fix: Re-check bot.update_mode == "polling" inside
_handle_webhook_conflict before calling delete_webhook; or take an
advisory lock on the bot row for the duration of the mode flip.
A16. Discord 2000-char split breaks on Unicode codepoint boundaries
Location: packages/core/src/notify_bridge_core/notifications/discord/client.py:60-80
(_split_message).
Scenario: A template renders to 2050 chars with emoji at position
1998-1999 (each emoji is 2 surrogates / multi-byte UTF-8). The split uses
text.rfind("\n", 0, limit) and falls back to character index limit,
which is a Python str index → that part is OK in CPython 3, but if the
content contains a grapheme cluster (emoji + zero-width-joiner + skin tone),
slicing at limit mid-cluster renders as the broken emoji "□" in Discord.
Fix: Use a grapheme-cluster boundary library (e.g. regex module with
\X) or at minimum back off to the previous whitespace if limit is
inside a likely cluster.
MEDIUM
A17. Per-target failure counter does not distinguish receivers within a target
Location: packages/server/src/notify_bridge_server/services/event_dispatch.py:311-333.
Scenario: A target has 10 receivers. 1 chat is blocked, 9 work. Today
maybe_emit_target_failure is called for the target — but the success
counter (record_target_success) is also called for the same target on the
other 9. Net counter behavior depends on call order. With the
default-threshold 5, this oscillates.
Fix: Track success/failure per receiver, not per target; or only call
maybe_emit_target_failure when all receivers failed for the target.
A18. _cleanup_old_events does not delete cancelled DeferredDispatch rows
Location: packages/server/src/notify_bridge_server/services/scheduler.py:332-364.
Scenario: The daily cleanup deletes EventLog, WebhookPayloadLog,
ActionExecution. Cancelled / fired / dropped DeferredDispatch rows live
forever in the DB. Active install with chatty providers accumulates millions
of rows; eventually the _load_pending_drain_jobs query, _trim_queue_if_needed,
and the catch-up scan all degrade.
Fix: Add delete(DeferredDispatch).where(status.in_(["fired", "dropped", "cancelled"]), fired_at < cutoff) to the cleanup.
A19. random.shuffle(shuffled) in _sort_assets uses non-deterministic seed
Location: packages/server/src/notify_bridge_server/services/dispatch_helpers.py:317-320.
Scenario: Two identical events arriving in close succession (deferred-
dispatch merge, then drain re-renders) shuffle into different orders. With
the deferred-dispatch coalescing logic, this produces a visual "they're not
the same album" surprise in the chat history.
Fix: Seed random with a stable per-event hash
(hash(event.event_type.value + event.collection_id + event.timestamp.isoformat())).
A20. _poll_tracker swallows exception, drops it at _LOGGER.error not exception
Location: packages/server/src/notify_bridge_server/services/scheduler.py:657-666.
Scenario: An exception in check_tracker is logged as _LOGGER.error("Error polling tracker %d: %s", tracker_id, e) — no traceback. Production debugging
of "why is tracker 42 silently broken since yesterday" requires the stack.
Fix: Change to _LOGGER.exception("Error polling tracker %d", tracker_id).
A21. Long bot commands → /help reply > 4096 chars truncates without warning
Location: packages/server/src/notify_bridge_server/commands/handler.py:521-532,
combined with send_reply → send_telegram_message → _truncate to 4096.
Scenario: A user with 20 enabled commands runs /help. Each command +
description (RU) crosses 250 chars → 5000 chars total → truncated mid-command.
The user sees a half-list that suggests we forgot half the commands.
Fix: Split /help over multiple messages by command category (provider).
A22. parse_command truncates to 512 chars — long search queries lost
Location: packages/server/src/notify_bridge_server/commands/parser.py:15.
Scenario: /search a very long query containing emoji 🎉 and more text that the user really meant to send because they pasted a long string from somewhere…
gets clipped to 512 chars silently. The trailing count parser then operates
on the truncated text, possibly extracting a count from mid-query.
Fix: Either reject >512 with parse_command returning a sentinel
"too_long" tuple, or just stop truncating — the Telegram limit is already
4096 and we already truncate the response side.
A23. Periodic catch-up scan can dispatch a stale event payload
Location: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:628
(_process_row).
Scenario: An assets_added event is deferred at 22:00. At 06:00 the
quiet window ends, drain re-fetches link_data. The assets in event_payload
include URLs and asset metadata. But the user has since deleted those photos
from Immich. The dispatcher tries to download → 404. Notification shows
"5 photos added to Album X" but the actual media fails to attach.
Fix: For assets_added, re-validate asset existence against the
provider before dispatch (one batched getAssets call). Drop missing IDs
from the event, mark with "delivered_after_quiet_hours" + extra hint
"missing_count": N in details. For deferred windows >12h this is the
right behavior; for shorter windows the lookup is wasted work, so gate on
(now - deferred_at).hours >= 6.
A24. Watcher / scheduler restart can lose adaptive polling state
Location: packages/server/src/notify_bridge_server/services/scheduler.py:67-88
(_adaptive_state: dict).
Scenario: Module-level dict resets on restart. A tracker that had ramped
up to 1-in-4 ticks goes back to every-tick polling. Over a fleet of 50
trackers in steady-state idle, this triggers a thundering herd of every-tick
polls right after deploy. Combined with no DB-level rate limiting on the
upstream Immich/Gitea API, it can rate-limit the operator out of their own
services for ~5min.
Fix: Either persist the adaptive state in notification_tracker_state
(cheap on shutdown via atexit) or stagger the initial ticks via
APScheduler's next_run_time instead of relying on the existing jitter.
A25. defer_event return "cancelled" logic is incorrect in some merge paths
Location: packages/server/src/notify_bridge_server/services/deferred_dispatch.py:444.
Scenario: The cancelled return branch checks upd_added is None or upd_added.status == "cancelled" AND same for upd_removed. But if both
upd_added and upd_removed are None (i.e. there were no pending rows
to begin with), fully_cancelled is False → returns "merged". That's
fine. But the more subtle issue: an "insert" action with one of the rows
being cancelled returns "merged" — should be "inserted". The dashboard
"merged" status confuses the operator looking at why no defer row exists.
Fix: Rewrite as a clearer state machine: distinguish "inserted",
"merged_into_existing", "fully_cancelled".
A26. _fetch_bytes and _safe_get honor only 3 redirects with no Retry-After awareness
Location: packages/core/src/notify_bridge_core/notifications/telegram/client.py:217-268.
Scenario: Immich behind a CDN can chain 302 → 302 → 200. With 4 hops
it falls through to "Too many redirects". A user complains "old photos
suddenly missing in notifications".
Fix: Bump to 5 redirects and surface the chain in the error string for
easier debugging.
A27. No structured event log filter UI for "show me all drops in the last hour"
Location: packages/server/src/notify_bridge_server/api/status.py —
event_log rows have details.dispatch_status field but no API filter
exposes it. The frontend can fetch only via global filter on event_type.
Scenario: An operator sees "messages are missing today". They want to
filter event_log to dispatch_status in (dropped_quiet_hours_nondeferrable, deferred_then_dropped, deferred_then_failed). Today they can't.
Fix: Add dispatch_status and dispatched=true|false as first-class
event_log columns (denormalized from details), plus API + UI filter.
A28. _render_cmd_template falls back to "[No template: X]" user-visible text
Location: packages/server/src/notify_bridge_server/commands/handler.py:111-115.
Scenario: An operator removes a template slot by mistake. The next user
who runs /random sees [No template: response_random] in chat. Not just
ugly — it leaks internal slot names.
Fix: Show a friendly "Sorry, something went wrong on our side" + log at
error level. Better: refuse to disable the slot if it's referenced.
LOW
A29. _truncate's ellipsis can land inside a multi-byte char
The marker "…" is one Unicode codepoint (3 bytes UTF-8) but the truncate
counts characters, not bytes. Telegram counts UTF-16 code units, so for a
4090-char message ending in emoji, the calculation is off by a small constant.
Won't break sends but messages may end up slightly longer than TELEGRAM_MAX_TEXT_LENGTH
allows. Re-measure in UTF-16 code units (len(s.encode('utf-16-le')) // 2).
A30. NotificationDispatcher._render_cache set to fresh dict on every dispatch — comment says "reuse"
The instance attribute self._render_cache is reset to {} at the start
of every _send_to_target (line 245). The cache only helps across receivers
within one target, not across targets. The comment at line 111-115 implies
broader reuse. Either align comment with reality or actually share across
targets within one dispatch() call.
A31. Frontend entity-cache.svelte.ts doesn't propagate stale-cache errors
The shared $state-based caches return stale data silently if the underlying
fetch fails after a successful initial load. A user sees old target list
during an outage and is confused why edits aren't sticking.
Part B — Missing functionality and "cool feature" gaps
Tier legend: must-have = blocks prod for any non-trivial install; nice-to-have = clear value, ship in next minor; aspirational = ship when v1.0+ slows down. Effort: S ≈ 1-2 days; M ≈ 1 week; L ≈ 2+ weeks.
Already in the backlog (post-v0.8.1 status check)
B1. Target-level quiet hours (per-target DND, multi-window, days-of-week, silent mode)
Status: Still missing in v0.8.1. The backlog item proposed a v1 cut
(target-level windows + silent mode for Telegram = disable_notification=True).
None of the proposed code paths exist:
notification_target.quiet_hours_jsoncolumn — not present.disable_notification=Trueplumbing throughTelegramClient.send_message— not present.- Days-of-week filter — not present.
Pitch: Quiet hours bind to the watcher (tracking config); users want DND at the destination. "Don't ping my phone at night, regardless of which provider". Who benefits: Every user. Today they have to recreate per-link windows. Effort: M (1 week — backend dispatcher gate + frontend Aurora-style fieldset). Tier: must-have for prod.
B2. Immich Smart Actions expansion (auto-favorite by person, auto-archive, share-link rotation)
Status: Auto-Organize exists; no other action descriptors are shipped. Pitch: Reuse the existing action descriptor pipeline. Auto-favorite-by-person is the smallest cut. Effort: M per action (a few days each). Tier: nice-to-have.
B3. Block-based template builder
Status: Not started. JinjaEditor is unchanged.
Effort: L — frontend-only but big.
Tier: aspirational.
Newly identified — must-have for prod
B4. Webhook delivery dedup table + "Test Delivery" replay
Pitch: Add the dedup table from A1, plus a /api/webhooks/{provider_id}/replay/{delivery_id}
endpoint that admin can hit to re-dispatch a stored payload without the upstream
provider needing to resend. Combined with the existing WebhookPayloadLog,
this is "click to retest" in the UI.
Who benefits: Every webhook provider. Replay is invaluable for debugging
template edits.
Effort: M.
Tier: must-have for prod.
B5. "Send test message" / template playground
Pitch: From the template editor, click "Try this template against the
last received event" → render preview, optionally send to a sandbox chat.
Bypass dispatch but exercise the full Jinja pipeline.
Who benefits: Every template edit today is a leap of faith — the operator
modifies the template, waits for the next real event, hopes nothing breaks.
Effort: S-M. The preview infrastructure already exists
(services/sample_context.py); add a "send to chat X" button.
Tier: must-have for prod.
B6. Template versioning + rollback
Pitch: Auto-snapshot each template on save (last 10 revisions). UI shows
diff between version N and N-1, "Restore" button. Same for command templates.
Who benefits: An operator who tweaks a template at midnight and goofs
the syntax needs an undo button.
Effort: M. New template_revision table; new endpoints; UI button.
Tier: must-have for prod.
B7. Bulk operations on trackers / targets / links
Pitch: Multi-select in lists → "disable selected", "delete selected", "export selected templates as JSON bundle", "move to user X". Who benefits: Operators with >10 trackers. A common pain point: deploying the bridge for a new family member requires N clicks per tracker. Effort: M (frontend-heavy). Tier: must-have for prod.
B8. Bot blocked / chat-not-found auto-disable + dashboard
Pitch: Detect Telegram 403 / 400 chat-related errors. Mark the receiver
or TelegramChat as disabled_by_remote. Surface in a "Stale receivers"
admin view with a "Try resending invite" / "Delete chat" button.
Who benefits: Every Telegram user. Today the bridge silently sprays
errors until a human looks.
Effort: S.
Tier: must-have for prod.
B9. Forum-thread (topic) routing for Telegram
Pitch: Per-receiver message_thread_id field, auto-detected from incoming
command messages. UI: when adding a chat that's a forum, show a topic
selector populated via getForumTopicIconStickers + getChat's is_forum.
Who benefits: Any group install where the user wants notifications in a
dedicated topic.
Effort: M.
Tier: must-have for prod.
B10. Telegram inline buttons + callback queries
Pitch: Templates can declare {% buttons %} with action descriptors.
Bridge listens for callback_query updates, dispatches to a registered
action (e.g. "Mark album as favorite", "Snooze this tracker for 1h", "Run
HA service light.turn_off").
Who benefits: Power users. Foundation for several other features
(Immich duplicate-cluster review, HA action button → service call, snooze).
Effort: L.
Tier: nice-to-have but unlocks the next 3 items.
B11. User snooze / mute via bot command
Pitch: /snooze 1h mutes the bot's outbound chat for 1h.
/mute provider gitea mutes a whole provider for that chat. /wake undoes.
Implemented as a per-receiver snoozed_until column.
Effort: S-M.
Tier: must-have for prod (user-side relief valve).
Newly identified — nice-to-have
B12. Per-target / per-user rate limit (send-side)
Pitch: Cap outbound messages per minute per receiver. Existing 429
backoff handles Telegram's limit, but a runaway template / event-storm
provider can still spray the user's phone with 200 messages.
Effort: S. Token bucket per chat_id in _send_telegram.
Tier: nice-to-have.
B13. Message dedup window (idempotency key per outbound message)
Pitch: SHA256 of (target_id, receiver_id, rendered_message, event_collection_id). If the same key was sent in the last 5min, skip.
Effort: S.
Tier: nice-to-have (lots of overlap with A1+A2 but addresses the
end-of-pipeline dedup, after all coalescing).
B14. Weekly digest / per-target stats / per-provider error rate
Pitch: Cron-based weekly summary email/Telegram. "Top 5 noisy trackers", "Receivers with >X% failure rate", "Top 5 days of the week with the most activity". Operator preventive maintenance. Effort: M. Tier: nice-to-have.
B15. Mobile-friendly minimal mode for the SPA
Pitch: The Aurora redesign is a lot for mobile. A "manage from phone" minimal layout — list of trackers, click to toggle, click to mute. Stops operators from needing a desktop to silence a chatty tracker at 1am. Effort: M. Tier: nice-to-have.
B16. Audit log of admin actions
Pitch: New audit_log table. Every create/update/delete on
NotificationTracker, NotificationTarget, TemplateConfig, ServiceProvider,
TelegramBot, User, etc. writes a row with (user_id, action, entity_type, entity_id, before_json, after_json, ip, ua). Admin UI tab.
Effort: M. SQLAlchemy event listeners on the affected models.
Tier: nice-to-have for multi-admin installs; must-have if any
compliance requirement.
B17. Health → not just /ready, but per-component status page
Pitch: /api/health/components returns {providers: [{id, last_ok_at, last_error}], targets: [{id, last_ok_at, last_error}], scheduler: {job_count, next_fires}}. Frontend "Status" tab.
Effort: S-M. The data is already in EventLog / scheduler API.
Tier: nice-to-have.
B18. Provider unreachable backoff + escalation
Pitch: Today bridge_self emits bridge_self_poll_failures after N
consecutive fails. Add (a) exponential backoff on the polling interval after
M failures so we don't hammer a down host, and (b) recovery notification
when the provider comes back.
Effort: S.
Tier: nice-to-have.
B19. RSS provider
Pitch: Generic RSS/Atom feed poller. One more provider, reuses event_dispatch. Long-tail value (operator wants "notify me when a blog publishes"). Effort: M. Tier: nice-to-have.
B20. Mobile push / FCM channel
Pitch: A dedicated FCM "Receiver" type so the user can ship their own companion app. Today Telegram is the only realtime channel; email is too slow; webhook out is for plumbing. Effort: L. Tier: aspirational.
Newly identified — aspirational
B21. Conversation threading per source (one notification thread per album / repo)
Pitch: Use Telegram reply_parameters to chain all notifications about
"Album X" as a single thread that grows over time. Today every notification
is a top-level message. Threading turns the chat into a navigable history.
Effort: M. Store last_message_id per (target_id, collection_id),
pass as reply_to_message_id.
Tier: aspirational but a clear differentiator.
B22. A/B test variants for templates
Pitch: A template config can carry 2 variants. The dispatcher hash-routes receivers to A or B; the dashboard shows "variant A's response time / click rate / receiver mute rate". Effort: L. Tier: aspirational.
B23. Dark-launch a new template before enabling it
Pitch: "Send-to-sandbox-chat-only" toggle on a template config. The new template renders against real events but only goes to one operator's chat for 1 week. Then promote to production. Effort: M. Builds on template versioning (B6). Tier: aspirational.
B24. Scheduled template changes
Pitch: "On 2026-12-25 at 09:00, switch template_config X to draft Y". Useful for holiday-themed greetings or batch migrations. Effort: M. Tier: aspirational.
B25. HA service-call from a Telegram inline button
Pitch: Building on B10. A template renders {% button hass:light.turn_off target=living_room %}. User clicks → bridge calls HA light.turn_off.
Effort: M (after B10).
Tier: aspirational.
Ship-blocker checklist (do not widen user audience without)
Order is rough priority (top first). Most are also called out in Part A.
- A1 — Webhook idempotency table (Gitea/Planka/generic). Without this, one upstream retry storm can double-/quadruple-spray every user.
- A2 — Deferred-dispatch crash window. A redeploy mid-drain duplicates
every queued notification. Implement either the
dispatch_idpre-commit OR thein_flightstate machine. - A3 — Persist Telegram update offset. Same root cause class as A1/A2; matters less if A1+A2 are fixed but should land together.
- A4 / B8 — Bot blocked / chat-not-found auto-disable. A user blocking the bot must not generate infinite errors.
- A11 — Webhook JSON depth/node cap (mirror the backup guard).
- A9 — Quiet-hours
start == endconfirmation; either accept "always quiet" semantics or reject in the API validator. - A8 — DST handling in quiet-hours overnight window. Verify with tests that include known transition timestamps.
- B5 — "Send test message" / template playground. Without this, every template edit is a flying blind change against a live system.
- B6 — Template versioning + rollback. Pair with B5.
- A5 / B9 — Forum-thread (topic) routing. Any non-trivial Telegram group install needs this.
- B11 — User snooze / mute via bot command. Relief valve when the bridge gets too chatty.
- B7 — Bulk operations on trackers / targets / links. Operability floor for any install with >10 trackers.
Everything else in Part B is upside, not a blocker.