notify-bridge/.claude/reviews/bugs-features-review.md

# Bugs + Missing Features — Production-Readiness Review

Repo: `c:\Users\Alexei\Documents\service-to-notification-bridge` (v0.8.1 baseline)
Date: 2026-05-22
Scope: full repo (backend Python/FastAPI, Svelte 5 frontend, providers + dispatchers + bot commands)

---

## Executive summary

- **The code is in much better shape than typical pre-1.0 code.** Quiet-hours,
  SSRF, JWT, secret redaction, rate-limit fan-out caps, partition-by-media-kind,
  parse_mode retry, scheduler misfire-grace, Prometheus metrics, deep
  healthcheck, and per-receiver render cache are all already implemented and
  well-tested.
- **The single biggest shipping risk is webhook idempotency.** Gitea, Planka,
  and the generic webhook endpoint all dispatch on every POST regardless of
  redelivery — there is no `X-Gitea-Delivery` / `X-Hub-Delivery` dedup table.
  An upstream retry storm sends the same notification N times.
- **The deferred-dispatch drain has a duplicate-send window** if the process
  dies between `dispatcher.dispatch()` returning and `session.commit()` —
  the row stays `pending` and the periodic catch-up scan re-drains it.
- **Telegram update offset (`_last_update_id`) is in-memory only** — on
  restart, the bot replays already-handled updates or skips ones Telegram
  has discarded. Combined with no per-update idempotency, this is a
  duplicate-command surface.
- **Several Telegram features are silently unsupported**: forum threads
  (`message_thread_id`), bot-blocked-by-user detection (403 → keep retrying
  forever), and inline-button callback queries. None blocks shipping today
  but each is a near-term ask from any real user.
- **No template versioning / dry-run / playground** — every template edit is
  immediately live. There is no way to validate a new template against a
  sample payload before flipping the switch, and no rollback path.
- **Frontend lacks bulk operations and import/export of templates+targets.**
  An operator with 30 trackers cannot bulk-toggle, bulk-edit, or move a
  template across users.

---

## Part A — Bugs and reliability issues

Severity legend: **CRITICAL** = data loss / duplicate user-visible messages /
silent stop-shipping; **HIGH** = wrong behavior under realistic conditions;
**MEDIUM** = degrades UX or operability; **LOW** = polish.

### CRITICAL

#### A1. Webhook redelivery causes duplicate notifications (no idempotency)

**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:156`
(`gitea_webhook`), `:225` (`planka_webhook`), `:427` (`generic_webhook`).
**Scenario**: Gitea retries a webhook after 30s if the bridge returns 5xx,
times out under load, or if the operator clicks "Test Delivery" twice. Every
retry produces a fresh notification because the handlers never check
`X-Gitea-Delivery` (Gitea's per-delivery UUID), nor do they record any
event_id/hash for `parse_generic_webhook` events.
**Fix**: Add a `webhook_delivery` table with `(provider_id, delivery_id)`
unique constraint and `created_at`. Insert before dispatch (`INSERT OR IGNORE`
on SQLite, `ON CONFLICT DO NOTHING` on Postgres); if the insert is a no-op,
return `{"ok": true, "skipped": "duplicate"}`. For Gitea use the
`X-Gitea-Delivery` header; for Planka use a hash of `event_type +
payload.id + payload.createdAt`; for generic webhooks use a configurable
JSONPath expression to derive an idempotency key, falling back to a SHA256 of
the raw body. TTL prune older than 7 days.

#### A2. Deferred-dispatch drain can double-send on process crash

**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:721-758`.
**Scenario**: Inside `_process_row`, `dispatcher.dispatch()` actually
delivers the Telegram message (HTTP 200 returned, user phone buzzes).
The function then sets `row.status = "fired"` (line 734) but the surrounding
`session.commit()` (line 577) hasn't run yet. Process is killed (OOM,
SIGTERM during deploy, host reboot). On restart, `_run_deferred_drain_catchup`
re-fetches the still-`pending` row and dispatches it again — **the user gets
the same album twice**.
**Fix**: Either (a) record an outbound dedup key per-row before dispatch
(`row.dispatch_id = uuid4(); session.commit()` first), then ask the channel
client to send-or-no-op based on that ID; or (b) flip the row to a
`"in_flight"` state with a short timeout in a pre-dispatch transaction so a
restart sees it as poisoned and aborts. Option (a) is more correct but
needs per-channel cooperation; option (b) is the cheap fix.

#### A3. Telegram update offset is in-memory only — restart replays or loses commands

**Location**: `packages/server/src/notify_bridge_server/services/telegram_poller.py:31`
(`_last_update_id: dict[int, int] = {}`).
**Scenario**: A user types `/random Family`. Telegram delivers update_id=4711.
The bridge processes the command, sends back the media, and crashes before
APScheduler ticks again. On restart, `_last_update_id` is empty, so we call
`getUpdates(offset=None)` → Telegram returns 4711 again → we send the user
the same album a second time. Conversely, if Telegram's 24-hour retention
expired during a long outage, we silently skip pending updates.
**Fix**: Persist last_update_id in DB (`telegram_bot.last_update_id` column).
Combine with A2-style command idempotency by inserting
`(bot_id, update_id)` into a dedup table before processing.

### HIGH

#### A4. Telegram "bot blocked by user" / "chat not found" never short-circuits

**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py`
(`send_message`, `_upload_media`, etc.). Errors with
`error_code == 403` (Forbidden, "Bot was blocked by the user") and 400
"chat not found" / "user is deactivated" are returned as failures but
never recorded so the receiver gets removed/disabled.
**Scenario**: A user blocks the bot. Every scheduled "Good morning memory"
fires a sendMessage that Telegram instantly 403s. Bridge logs an error,
moves on, repeats forever. The bridge_self target-failure counter eventually
fires but the underlying receiver is never disabled. With many such chats
the operator has no easy cleanup path.
**Fix**: In the dispatcher, on `error_code in (403, 400 with description
matching "chat not found"/"user is deactivated")`, automatically set
`TelegramChat.commands_enabled = False` and either flag the receiver as
`disabled` with reason `blocked_by_user` or surface it via a new
`/admin/blocked-chats` view. Also stop further retries that round.

#### A5. Telegram forum-thread (topic) routing not supported

**Location**: telegram client never accepts/sends `message_thread_id`.
**Scenario**: Operator points the bridge at a group's "Releases" forum
topic. Today every message lands in the General topic instead — there is
no way to specify the topic. This is a hard requirement for any non-trivial
group install. Currently `reply_parameters` is the only thread-adjacent
field used; `message_thread_id` is silently absent.
**Fix**: Add an optional `message_thread_id` per-receiver (or per-target)
config, pass through `send_message`, `_upload_media`, and `_post_media_group`.
Auto-extract from incoming command updates' `message.message_thread_id` so
the bot can reply into the same topic.

#### A6. `bot.token` read after commit without refresh in webhook flow

**Location**: `packages/server/src/notify_bridge_server/commands/webhook.py:92-97`.
**Scenario**: The comment acknowledges "AsyncSession expires instances on
commit" and snapshots `bot_id`/`bot_token` before commit, but `await
session.refresh(bot)` is also called after the commit. If `session.refresh`
fails (e.g. row was deleted by an admin concurrently — bot rotation), the
exception is caught as a warning and the rest of the handler still runs
using the stale local `bot_id`/`bot_token`. The window is small but real.
**Fix**: Remove the `session.refresh(bot)` since the snapshot already
covers everything the handler needs. The refresh adds risk for no gain.

#### A7. Deferred-dispatch coalescing has a JSON-mutation bug under concurrent defers

**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:307`
(`_find_pending_asset_rows`).
**Scenario**: Two near-simultaneous `assets_added` events for the same
`(link_id, collection_id)` from two upstream pollers (HA chat-bus +
periodic Immich). Both call `defer_event` concurrently. The two transactions
both see "no pending row", both `session.add(new_row)`, and SQLite cheerfully
inserts two rows. The drain then fires both, sending the same combined media
twice. Note that the partial UNIQUE index from v0.8.1 protects only the
`bridge_self` provider row, not the deferred queue.
**Fix**: Add a partial UNIQUE index `UNIQUE(link_id, collection_id, event_type)
WHERE status = 'pending'` on `deferred_dispatch`, then convert `defer_event`
to `INSERT ... ON CONFLICT (link_id, collection_id, event_type) DO UPDATE`
and merge `event_payload` inside the SQL or in a re-read+retry loop.

#### A8. Quiet-hours overnight window + DST transition can produce wrong fire_at

**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:121-128`.
**Scenario**: User in `Europe/Minsk` (UTC+3, no DST anymore) sets quiet
hours 22:00-06:00. For a user in a DST-observing zone (e.g.
`America/New_York`), on the "spring forward" night where 2:00 → 3:00, an
event arriving at 02:30 local time gets `end_today = now_local.replace(hour=6,
minute=0)`. But `.replace()` ignores DST adjustments — the resulting
`datetime` may sit in the skipped hour or have ambiguous DST status. Two
hours later, the dispatcher sees the quiet window as "still active" or "30
min ago" depending on the system.
**Fix**: After `.replace(hour=t_end.hour, minute=t_end.minute, ...)`, pass
through `tz.localize` (zoneinfo's behavior: re-walk via `astimezone`) and
explicitly handle the `fold=` parameter. Add tests using
`zoneinfo.ZoneInfo("America/New_York")` and known DST transition dates.

#### A9. Quiet-hours `start == end` returns None — silently no quiet hours

**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111`.
**Scenario**: User UI submits `quiet_hours_start = "00:00"` and
`quiet_hours_end = "00:00"`, thinking "all day quiet". The function returns
`None` (no quiet window) — the user gets pinged at 3am even though the UI
says "quiet hours enabled". Same code path eats malformed times silently.
**Fix**: Bubble up `ValueError`/`malformed input` to the API validator on
write so the user gets a 422 with a specific error message rather than
silently broken behavior. Define `00:00-00:00` as "always quiet" or reject
it explicitly with a clear error.

#### A10. Telegram `_truncate` cuts mid-HTML-tag → parse_mode fallback then loses formatting

**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:144-149`
(`_truncate`).
**Scenario**: A template renders to 4090 chars and an
`<a href="https://...">...</a>` straddles the 4096-byte boundary. The
truncate function takes a flat string slice, so the final character may be
inside a tag → Telegram returns 400 "can't parse entities" → the retry
strips parse_mode → the user sees `<a href="...">` literally in their chat.
**Fix**: Make `_truncate` HTML-aware: scan from the right and abandon
truncation at the start of any tag boundary, OR strip incomplete tags after
truncating. A simpler intermediate fix: pop any unclosed `<a>` /`<b>`/`<i>`
detected by a regex over the truncated string.

#### A11. JSON-payload depth/size hardened in backup, not in webhooks

**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:43-71`
(`_read_bounded_body` only caps total bytes).
**Scenario**: Generic webhook accepts a 999KB payload (under the 1MB cap)
but with 50 levels of nesting. `json.loads` succeeds, then
`parse_generic_webhook` evaluates JSONPath expressions in a loop and the CPU
spends seconds chasing pointers. Multiple concurrent malicious requests can
peg the event loop.
**Fix**: Reuse the depth/node guards from
`packages/server/src/notify_bridge_server/services/backup_service.py`
(JSON depth cap 10, node count cap 100k). Either share the helper or
re-implement around `json.loads(object_pairs_hook=...)`.

#### A12. Generic-webhook `auth_mode="none"` with `acknowledge_unauthenticated` is per-provider, not per-user

**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:294-323`.
**Scenario**: v0.8.1 added the `acknowledge_unauthenticated=true` opt-in,
but it's only stored in `provider.config` JSON. A multi-user install where
one user accepts unauthenticated and another doesn't would suffice. But
because anyone with the webhook URL can also infer the token (URLs are not
secret in real deployments — they end up in upstream config files, logs,
build artifacts), `auth_mode="none"` is dangerous beyond "explicit opt-in":
an attacker who guesses the path can DoS the rate limiter by burning the
60/min budget.
**Fix**: Refuse to even create a `webhook` provider with `auth_mode="none"`
in production unless a separate environment guard
`NOTIFY_BRIDGE_ALLOW_UNAUTHENTICATED_WEBHOOKS` is set; AND drop the rate
limit to 10/min for `auth_mode="none"` providers.

#### A13. `_extract_retry_after` returns int but Telegram `retry_after` is fractional

**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:59-78`.
**Scenario**: Modern Telegram sometimes returns `retry_after` as a float
(e.g. `1.5`). The current code does `int(group(1))` and `isinstance(ra,
(int, float))`. Regex `\d+` only matches integers. So a `1.5s` retry-after
becomes "no retry-after found" → fallback 1s sleep → retry too early → second
429 → eventually the bounded retry budget runs out.
**Fix**: Loosen the regex to `\d+(?:\.\d+)?` and `float(m.group(1))`,
preserve fractional via `await asyncio.sleep(retry_after + 1)` with float.

#### A14. APScheduler date-job collision when two windows end at the exact same second

**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:1127-1132`
(`_drain_job_id_for`). The job id is keyed on `YYYYMMDDHHMMSS`. Comment in
code acknowledges "two trackers... seconds different ... would collide", but
two windows ending at the exact same second still collide on a single job id
— `replace_existing=True` silently drops the second.
**Scenario**: 30 users with quiet_hours_end=`07:00`. All 30 windows end at
the same wall-clock second. Only one drain job is scheduled. That single
job fires `drain_deferred_due()` which scans all rows globally so all 30
get drained — actually fine. **But** if the global drain function ever
filters by user/tracker (a likely near-term change for multi-tenant), the
collision becomes silent data loss.
**Fix**: Either keep the global drain (and document the assumption) or
add a tracker_id segment to the job_id and let APScheduler dedup naturally.

#### A15. `_handle_webhook_conflict` reclaim races against a parallel admin action

**Location**: `packages/server/src/notify_bridge_server/services/telegram_poller.py:163-218`.
**Scenario**: Admin clicks "Switch to webhook mode" in the UI, which sets
`update_mode=webhook` and calls `set_webhook(...)`. Concurrently, the next
poll tick for the same bot hits the conflict, calls `delete_webhook` → the
admin's webhook is wiped 1s after they set it. The poll tick checks
`bot.update_mode != "polling"` *before* the conflict reclaim, but the
reload is best-effort and the conflict reclaim path runs unconditionally
once entered.
**Fix**: Re-check `bot.update_mode == "polling"` inside
`_handle_webhook_conflict` before calling `delete_webhook`; or take an
advisory lock on the bot row for the duration of the mode flip.

#### A16. Discord 2000-char split breaks on Unicode codepoint boundaries

**Location**: `packages/core/src/notify_bridge_core/notifications/discord/client.py:60-80`
(`_split_message`).
**Scenario**: A template renders to 2050 chars with emoji at position
1998-1999 (each emoji is 2 surrogates / multi-byte UTF-8). The split uses
`text.rfind("\n", 0, limit)` and falls back to character index `limit`,
which is a Python str index → that part is OK in CPython 3, but if the
content contains a grapheme cluster (emoji + zero-width-joiner + skin tone),
slicing at `limit` mid-cluster renders as the broken emoji "□" in Discord.
**Fix**: Use a grapheme-cluster boundary library (e.g. `regex` module with
`\X`) or at minimum back off to the previous whitespace if `limit` is
inside a likely cluster.

### MEDIUM

#### A17. Per-target failure counter does not distinguish receivers within a target

**Location**: `packages/server/src/notify_bridge_server/services/event_dispatch.py:311-333`.
**Scenario**: A target has 10 receivers. 1 chat is blocked, 9 work. Today
`maybe_emit_target_failure` is called for the target — but the success
counter (`record_target_success`) is also called for the same target on the
other 9. Net counter behavior depends on call order. With the
default-threshold 5, this oscillates.
**Fix**: Track success/failure per receiver, not per target; or only call
`maybe_emit_target_failure` when `all` receivers failed for the target.

#### A18. `_cleanup_old_events` does not delete cancelled `DeferredDispatch` rows

**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:332-364`.
**Scenario**: The daily cleanup deletes `EventLog`, `WebhookPayloadLog`,
`ActionExecution`. Cancelled / fired / dropped `DeferredDispatch` rows live
forever in the DB. Active install with chatty providers accumulates millions
of rows; eventually the `_load_pending_drain_jobs` query, `_trim_queue_if_needed`,
and the catch-up scan all degrade.
**Fix**: Add `delete(DeferredDispatch).where(status.in_(["fired", "dropped",
"cancelled"]), fired_at < cutoff)` to the cleanup.

#### A19. `random.shuffle(shuffled)` in `_sort_assets` uses non-deterministic seed

**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:317-320`.
**Scenario**: Two identical events arriving in close succession (deferred-
dispatch merge, then drain re-renders) shuffle into different orders. With
the deferred-dispatch coalescing logic, this produces a visual "they're not
the same album" surprise in the chat history.
**Fix**: Seed `random` with a stable per-event hash
(`hash(event.event_type.value + event.collection_id + event.timestamp.isoformat())`).

#### A20. `_poll_tracker` swallows exception, drops it at `_LOGGER.error` not `exception`

**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:657-666`.
**Scenario**: An exception in `check_tracker` is logged as `_LOGGER.error("Error
polling tracker %d: %s", tracker_id, e)` — no traceback. Production debugging
of "why is tracker 42 silently broken since yesterday" requires the stack.
**Fix**: Change to `_LOGGER.exception("Error polling tracker %d", tracker_id)`.

#### A21. Long bot commands → `/help` reply > 4096 chars truncates without warning

**Location**: `packages/server/src/notify_bridge_server/commands/handler.py:521-532`,
combined with `send_reply` → `send_telegram_message` → `_truncate` to 4096.
**Scenario**: A user with 20 enabled commands runs `/help`. Each command +
description (RU) crosses 250 chars → 5000 chars total → truncated mid-command.
The user sees a half-list that suggests we forgot half the commands.
**Fix**: Split `/help` over multiple messages by command category (provider).

#### A22. `parse_command` truncates to 512 chars — long search queries lost

**Location**: `packages/server/src/notify_bridge_server/commands/parser.py:15`.
**Scenario**: `/search a very long query containing emoji 🎉 and more text that
the user really meant to send because they pasted a long string from somewhere…`
gets clipped to 512 chars silently. The trailing count parser then operates
on the truncated text, possibly extracting a count from mid-query.
**Fix**: Either reject `>512` with `parse_command` returning a sentinel
"too_long" tuple, or just stop truncating — the Telegram limit is already
4096 and we already truncate the response side.

#### A23. Periodic catch-up scan can dispatch a stale event payload

**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:628`
(`_process_row`).
**Scenario**: An `assets_added` event is deferred at 22:00. At 06:00 the
quiet window ends, drain re-fetches `link_data`. The assets in `event_payload`
include URLs and asset metadata. But the user has since deleted those photos
from Immich. The dispatcher tries to download → 404. Notification shows
"5 photos added to Album X" but the actual media fails to attach.
**Fix**: For `assets_added`, re-validate asset existence against the
provider before dispatch (one batched `getAssets` call). Drop missing IDs
from the event, mark with "delivered_after_quiet_hours" + extra hint
`"missing_count": N` in details. For deferred windows >12h this is the
right behavior; for shorter windows the lookup is wasted work, so gate on
`(now - deferred_at).hours >= 6`.

#### A24. Watcher / scheduler restart can lose adaptive polling state

**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:67-88`
(`_adaptive_state: dict`).
**Scenario**: Module-level dict resets on restart. A tracker that had ramped
up to 1-in-4 ticks goes back to every-tick polling. Over a fleet of 50
trackers in steady-state idle, this triggers a thundering herd of every-tick
polls right after deploy. Combined with no DB-level rate limiting on the
upstream Immich/Gitea API, it can rate-limit the operator out of their own
services for ~5min.
**Fix**: Either persist the adaptive state in `notification_tracker_state`
(cheap on shutdown via `atexit`) or stagger the initial ticks via
APScheduler's `next_run_time` instead of relying on the existing jitter.

#### A25. `defer_event` `return "cancelled"` logic is incorrect in some merge paths

**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:444`.
**Scenario**: The `cancelled` return branch checks `upd_added is None or
upd_added.status == "cancelled"` AND same for `upd_removed`. But if both
`upd_added` and `upd_removed` are `None` (i.e. there were no pending rows
to begin with), `fully_cancelled` is `False` → returns "merged". That's
fine. But the more subtle issue: an "insert" action with one of the rows
being cancelled returns "merged" — should be "inserted". The dashboard
"merged" status confuses the operator looking at why no defer row exists.
**Fix**: Rewrite as a clearer state machine: distinguish "inserted",
"merged_into_existing", "fully_cancelled".

#### A26. `_fetch_bytes` and `_safe_get` honor only 3 redirects with no Retry-After awareness

**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:217-268`.
**Scenario**: Immich behind a CDN can chain `302 → 302 → 200`. With 4 hops
it falls through to "Too many redirects". A user complains "old photos
suddenly missing in notifications".
**Fix**: Bump to 5 redirects and surface the chain in the error string for
easier debugging.

#### A27. No structured event log filter UI for "show me all drops in the last hour"

**Location**: `packages/server/src/notify_bridge_server/api/status.py` —
`event_log` rows have `details.dispatch_status` field but no API filter
exposes it. The frontend can fetch only via global filter on `event_type`.
**Scenario**: An operator sees "messages are missing today". They want to
filter event_log to `dispatch_status in (dropped_quiet_hours_nondeferrable,
deferred_then_dropped, deferred_then_failed)`. Today they can't.
**Fix**: Add `dispatch_status` and `dispatched=true|false` as first-class
event_log columns (denormalized from `details`), plus API + UI filter.

#### A28. `_render_cmd_template` falls back to `"[No template: X]"` user-visible text

**Location**: `packages/server/src/notify_bridge_server/commands/handler.py:111-115`.
**Scenario**: An operator removes a template slot by mistake. The next user
who runs `/random` sees `[No template: response_random]` in chat. Not just
ugly — it leaks internal slot names.
**Fix**: Show a friendly "Sorry, something went wrong on our side" + log at
error level. Better: refuse to disable the slot if it's referenced.

### LOW

#### A29. `_truncate`'s ellipsis can land inside a multi-byte char

The marker `"…"` is one Unicode codepoint (3 bytes UTF-8) but the truncate
counts characters, not bytes. Telegram counts UTF-16 code units, so for a
4090-char message ending in emoji, the calculation is off by a small constant.
Won't break sends but messages may end up slightly longer than `TELEGRAM_MAX_TEXT_LENGTH`
allows. Re-measure in UTF-16 code units (`len(s.encode('utf-16-le')) // 2`).

#### A30. `NotificationDispatcher._render_cache` set to fresh dict on every dispatch — comment says "reuse"

The instance attribute `self._render_cache` is reset to `{}` at the start
of every `_send_to_target` (line 245). The cache only helps across receivers
within one target, not across targets. The comment at line 111-115 implies
broader reuse. Either align comment with reality or actually share across
targets within one `dispatch()` call.

#### A31. Frontend `entity-cache.svelte.ts` doesn't propagate stale-cache errors

The shared `$state`-based caches return stale data silently if the underlying
fetch fails after a successful initial load. A user sees old target list
during an outage and is confused why edits aren't sticking.

---

## Part B — Missing functionality and "cool feature" gaps

Tier legend: **must-have** = blocks prod for any non-trivial install;
**nice-to-have** = clear value, ship in next minor; **aspirational** = ship
when v1.0+ slows down.
Effort: **S** ≈ 1-2 days; **M** ≈ 1 week; **L** ≈ 2+ weeks.

### Already in the backlog (post-v0.8.1 status check)

#### B1. Target-level quiet hours (per-target DND, multi-window, days-of-week, silent mode)

**Status**: Still missing in v0.8.1. The backlog item proposed a v1 cut
(target-level windows + `silent` mode for Telegram = `disable_notification=True`).
None of the proposed code paths exist:
- `notification_target.quiet_hours_json` column — not present.
- `disable_notification=True` plumbing through `TelegramClient.send_message`
  — not present.
- Days-of-week filter — not present.

**Pitch**: Quiet hours bind to the *watcher* (tracking config); users want
DND at the *destination*. "Don't ping my phone at night, regardless of
which provider".
**Who benefits**: Every user. Today they have to recreate per-link windows.
**Effort**: **M** (1 week — backend dispatcher gate + frontend Aurora-style fieldset).
**Tier**: **must-have for prod**.

#### B2. Immich Smart Actions expansion (auto-favorite by person, auto-archive, share-link rotation)

**Status**: Auto-Organize exists; no other action descriptors are shipped.
**Pitch**: Reuse the existing action descriptor pipeline. Auto-favorite-by-person
is the smallest cut.
**Effort**: **M** per action (a few days each).
**Tier**: nice-to-have.

#### B3. Block-based template builder

**Status**: Not started. `JinjaEditor` is unchanged.
**Effort**: **L** — frontend-only but big.
**Tier**: aspirational.

### Newly identified — must-have for prod

#### B4. Webhook delivery dedup table + "Test Delivery" replay

**Pitch**: Add the dedup table from A1, plus a `/api/webhooks/{provider_id}/replay/{delivery_id}`
endpoint that admin can hit to re-dispatch a stored payload without the upstream
provider needing to resend. Combined with the existing `WebhookPayloadLog`,
this is "click to retest" in the UI.
**Who benefits**: Every webhook provider. Replay is invaluable for debugging
template edits.
**Effort**: **M**.
**Tier**: **must-have for prod**.

#### B5. "Send test message" / template playground

**Pitch**: From the template editor, click "Try this template against the
last received event" → render preview, optionally send to a sandbox chat.
Bypass dispatch but exercise the full Jinja pipeline.
**Who benefits**: Every template edit today is a leap of faith — the operator
modifies the template, waits for the next real event, hopes nothing breaks.
**Effort**: **S-M**. The preview infrastructure already exists
(`services/sample_context.py`); add a "send to chat X" button.
**Tier**: **must-have for prod**.

#### B6. Template versioning + rollback

**Pitch**: Auto-snapshot each template on save (last 10 revisions). UI shows
diff between version N and N-1, "Restore" button. Same for command templates.
**Who benefits**: An operator who tweaks a template at midnight and goofs
the syntax needs an undo button.
**Effort**: **M**. New `template_revision` table; new endpoints; UI button.
**Tier**: **must-have for prod**.

#### B7. Bulk operations on trackers / targets / links

**Pitch**: Multi-select in lists → "disable selected", "delete selected",
"export selected templates as JSON bundle", "move to user X".
**Who benefits**: Operators with >10 trackers. A common pain point: deploying
the bridge for a new family member requires N clicks per tracker.
**Effort**: **M** (frontend-heavy).
**Tier**: **must-have for prod**.

#### B8. Bot blocked / chat-not-found auto-disable + dashboard

**Pitch**: Detect Telegram 403 / 400 chat-related errors. Mark the receiver
or `TelegramChat` as `disabled_by_remote`. Surface in a "Stale receivers"
admin view with a "Try resending invite" / "Delete chat" button.
**Who benefits**: Every Telegram user. Today the bridge silently sprays
errors until a human looks.
**Effort**: **S**.
**Tier**: **must-have for prod**.

#### B9. Forum-thread (topic) routing for Telegram

**Pitch**: Per-receiver `message_thread_id` field, auto-detected from incoming
command messages. UI: when adding a chat that's a forum, show a topic
selector populated via `getForumTopicIconStickers` + `getChat`'s `is_forum`.
**Who benefits**: Any group install where the user wants notifications in a
dedicated topic.
**Effort**: **M**.
**Tier**: **must-have for prod**.

#### B10. Telegram inline buttons + callback queries

**Pitch**: Templates can declare `{% buttons %}` with action descriptors.
Bridge listens for `callback_query` updates, dispatches to a registered
action (e.g. "Mark album as favorite", "Snooze this tracker for 1h", "Run
HA service light.turn_off").
**Who benefits**: Power users. Foundation for several other features
(Immich duplicate-cluster review, HA action button → service call, snooze).
**Effort**: **L**.
**Tier**: nice-to-have but unlocks the next 3 items.

#### B11. User snooze / mute via bot command

**Pitch**: `/snooze 1h` mutes the bot's outbound chat for 1h.
`/mute provider gitea` mutes a whole provider for that chat. `/wake` undoes.
Implemented as a per-receiver `snoozed_until` column.
**Effort**: **S-M**.
**Tier**: **must-have for prod** (user-side relief valve).

### Newly identified — nice-to-have

#### B12. Per-target / per-user rate limit (send-side)

**Pitch**: Cap outbound messages per minute per receiver. Existing 429
backoff handles Telegram's limit, but a runaway template / event-storm
provider can still spray the user's phone with 200 messages.
**Effort**: **S**. Token bucket per chat_id in `_send_telegram`.
**Tier**: nice-to-have.

#### B13. Message dedup window (idempotency key per outbound message)

**Pitch**: SHA256 of `(target_id, receiver_id, rendered_message,
event_collection_id)`. If the same key was sent in the last 5min, skip.
**Effort**: **S**.
**Tier**: nice-to-have (lots of overlap with A1+A2 but addresses the
end-of-pipeline dedup, after all coalescing).

#### B14. Weekly digest / per-target stats / per-provider error rate

**Pitch**: Cron-based weekly summary email/Telegram. "Top 5 noisy trackers",
"Receivers with >X% failure rate", "Top 5 days of the week with the most
activity". Operator preventive maintenance.
**Effort**: **M**.
**Tier**: nice-to-have.

#### B15. Mobile-friendly minimal mode for the SPA

**Pitch**: The Aurora redesign is a lot for mobile. A "manage from phone"
minimal layout — list of trackers, click to toggle, click to mute. Stops
operators from needing a desktop to silence a chatty tracker at 1am.
**Effort**: **M**.
**Tier**: nice-to-have.

#### B16. Audit log of admin actions

**Pitch**: New `audit_log` table. Every create/update/delete on
`NotificationTracker`, `NotificationTarget`, `TemplateConfig`, `ServiceProvider`,
`TelegramBot`, `User`, etc. writes a row with `(user_id, action,
entity_type, entity_id, before_json, after_json, ip, ua)`. Admin UI tab.
**Effort**: **M**. SQLAlchemy event listeners on the affected models.
**Tier**: nice-to-have for multi-admin installs; must-have if any
compliance requirement.

#### B17. Health → not just /ready, but per-component status page

**Pitch**: `/api/health/components` returns `{providers: [{id, last_ok_at,
last_error}], targets: [{id, last_ok_at, last_error}], scheduler:
{job_count, next_fires}}`. Frontend "Status" tab.
**Effort**: **S-M**. The data is already in `EventLog` / scheduler API.
**Tier**: nice-to-have.

#### B18. Provider unreachable backoff + escalation

**Pitch**: Today `bridge_self` emits `bridge_self_poll_failures` after N
consecutive fails. Add (a) exponential backoff on the polling interval after
M failures so we don't hammer a down host, and (b) recovery notification
when the provider comes back.
**Effort**: **S**.
**Tier**: nice-to-have.

#### B19. RSS provider

**Pitch**: Generic RSS/Atom feed poller. One more provider, reuses event_dispatch.
Long-tail value (operator wants "notify me when a blog publishes").
**Effort**: **M**.
**Tier**: nice-to-have.

#### B20. Mobile push / FCM channel

**Pitch**: A dedicated FCM "Receiver" type so the user can ship their own
companion app. Today Telegram is the only realtime channel; email is too
slow; webhook out is for plumbing.
**Effort**: **L**.
**Tier**: aspirational.

### Newly identified — aspirational

#### B21. Conversation threading per source (one notification thread per album / repo)

**Pitch**: Use Telegram `reply_parameters` to chain all notifications about
"Album X" as a single thread that grows over time. Today every notification
is a top-level message. Threading turns the chat into a navigable history.
**Effort**: **M**. Store `last_message_id` per `(target_id, collection_id)`,
pass as `reply_to_message_id`.
**Tier**: aspirational but a clear differentiator.

#### B22. A/B test variants for templates

**Pitch**: A template config can carry 2 variants. The dispatcher
hash-routes receivers to A or B; the dashboard shows "variant A's response
time / click rate / receiver mute rate".
**Effort**: **L**.
**Tier**: aspirational.

#### B23. Dark-launch a new template before enabling it

**Pitch**: "Send-to-sandbox-chat-only" toggle on a template config. The new
template renders against real events but only goes to one operator's chat
for 1 week. Then promote to production.
**Effort**: **M**. Builds on template versioning (B6).
**Tier**: aspirational.

#### B24. Scheduled template changes

**Pitch**: "On 2026-12-25 at 09:00, switch template_config X to draft Y".
Useful for holiday-themed greetings or batch migrations.
**Effort**: **M**.
**Tier**: aspirational.

#### B25. HA service-call from a Telegram inline button

**Pitch**: Building on B10. A template renders `{% button hass:light.turn_off
target=living_room %}`. User clicks → bridge calls HA `light.turn_off`.
**Effort**: **M** (after B10).
**Tier**: aspirational.

---

## Ship-blocker checklist (do not widen user audience without)

Order is rough priority (top first). Most are also called out in Part A.

1. **A1** — Webhook idempotency table (Gitea/Planka/generic). Without this,
   one upstream retry storm can double-/quadruple-spray every user.
2. **A2** — Deferred-dispatch crash window. A redeploy mid-drain duplicates
   every queued notification. Implement either the `dispatch_id`
   pre-commit OR the `in_flight` state machine.
3. **A3** — Persist Telegram update offset. Same root cause class as A1/A2;
   matters less if A1+A2 are fixed but should land together.
4. **A4 / B8** — Bot blocked / chat-not-found auto-disable. A user blocking
   the bot must not generate infinite errors.
5. **A11** — Webhook JSON depth/node cap (mirror the backup guard).
6. **A9** — Quiet-hours `start == end` confirmation; either accept "always
   quiet" semantics or reject in the API validator.
7. **A8** — DST handling in quiet-hours overnight window. Verify with
   tests that include known transition timestamps.
8. **B5** — "Send test message" / template playground. Without this, every
   template edit is a flying blind change against a live system.
9. **B6** — Template versioning + rollback. Pair with B5.
10. **A5 / B9** — Forum-thread (topic) routing. Any non-trivial Telegram
    group install needs this.
11. **B11** — User snooze / mute via bot command. Relief valve when the
    bridge gets too chatty.
12. **B7** — Bulk operations on trackers / targets / links. Operability
    floor for any install with >10 trackers.

Everything else in Part B is upside, not a blocker.