6a8f374678
Operability: - Correlation IDs end-to-end: shared dispatch_id between log lines and EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths) and a new X-Request-Id middleware that normalizes inbound ids and binds request_id into log context. - dispatch_summary block merged into EventLog.details: per-target success/failure counts plus Telegram media delivered/skipped/failed and truncated error lists, so partial outcomes surface in the UI. - Diagnostic mode: admin can flip one module to DEBUG for a bounded window with auto-revert (in-memory only; setup_logging() resets on boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints plus DiagnosticsCassette UI on the settings page. Telegram: - Per-receiver options: disable_notification (silent send) and message_thread_id (forum-topic routing), wired through the dispatcher via a ContextVar so all four send sites (sendMessage / sendPhoto-Video- Document / sendMediaGroup / cache-hit POST) pick them up. - send_large_videos_as_documents target setting: bypass the 50 MB sendVideo cap by falling back to sendDocument for oversized videos. - sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES, 45 MB) with per-item fallback on chunk failure so a stale file_id no longer silently drops a cached asset. Tests: - New: diagnostic_mode, dispatch_summary, request_correlation, telegram_media_group_partial, telegram_per_send_options. Docs: - .claude/reviews/: six-axis production-readiness review of v0.8.1. - .claude/docs/functional-review-2026-05-28.md: focused review of Telegram/Immich/logging subsystems.
715 lines
36 KiB
Markdown
715 lines
36 KiB
Markdown
# Bugs + Missing Features — Production-Readiness Review
|
|
|
|
Repo: `c:\Users\Alexei\Documents\service-to-notification-bridge` (v0.8.1 baseline)
|
|
Date: 2026-05-22
|
|
Scope: full repo (backend Python/FastAPI, Svelte 5 frontend, providers + dispatchers + bot commands)
|
|
|
|
---
|
|
|
|
## Executive summary
|
|
|
|
- **The code is in much better shape than typical pre-1.0 code.** Quiet-hours,
|
|
SSRF, JWT, secret redaction, rate-limit fan-out caps, partition-by-media-kind,
|
|
parse_mode retry, scheduler misfire-grace, Prometheus metrics, deep
|
|
healthcheck, and per-receiver render cache are all already implemented and
|
|
well-tested.
|
|
- **The single biggest shipping risk is webhook idempotency.** Gitea, Planka,
|
|
and the generic webhook endpoint all dispatch on every POST regardless of
|
|
redelivery — there is no `X-Gitea-Delivery` / `X-Hub-Delivery` dedup table.
|
|
An upstream retry storm sends the same notification N times.
|
|
- **The deferred-dispatch drain has a duplicate-send window** if the process
|
|
dies between `dispatcher.dispatch()` returning and `session.commit()` —
|
|
the row stays `pending` and the periodic catch-up scan re-drains it.
|
|
- **Telegram update offset (`_last_update_id`) is in-memory only** — on
|
|
restart, the bot replays already-handled updates or skips ones Telegram
|
|
has discarded. Combined with no per-update idempotency, this is a
|
|
duplicate-command surface.
|
|
- **Several Telegram features are silently unsupported**: forum threads
|
|
(`message_thread_id`), bot-blocked-by-user detection (403 → keep retrying
|
|
forever), and inline-button callback queries. None blocks shipping today
|
|
but each is a near-term ask from any real user.
|
|
- **No template versioning / dry-run / playground** — every template edit is
|
|
immediately live. There is no way to validate a new template against a
|
|
sample payload before flipping the switch, and no rollback path.
|
|
- **Frontend lacks bulk operations and import/export of templates+targets.**
|
|
An operator with 30 trackers cannot bulk-toggle, bulk-edit, or move a
|
|
template across users.
|
|
|
|
---
|
|
|
|
## Part A — Bugs and reliability issues
|
|
|
|
Severity legend: **CRITICAL** = data loss / duplicate user-visible messages /
|
|
silent stop-shipping; **HIGH** = wrong behavior under realistic conditions;
|
|
**MEDIUM** = degrades UX or operability; **LOW** = polish.
|
|
|
|
### CRITICAL
|
|
|
|
#### A1. Webhook redelivery causes duplicate notifications (no idempotency)
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:156`
|
|
(`gitea_webhook`), `:225` (`planka_webhook`), `:427` (`generic_webhook`).
|
|
**Scenario**: Gitea retries a webhook after 30s if the bridge returns 5xx,
|
|
times out under load, or if the operator clicks "Test Delivery" twice. Every
|
|
retry produces a fresh notification because the handlers never check
|
|
`X-Gitea-Delivery` (Gitea's per-delivery UUID), nor do they record any
|
|
event_id/hash for `parse_generic_webhook` events.
|
|
**Fix**: Add a `webhook_delivery` table with `(provider_id, delivery_id)`
|
|
unique constraint and `created_at`. Insert before dispatch (`INSERT OR IGNORE`
|
|
on SQLite, `ON CONFLICT DO NOTHING` on Postgres); if the insert is a no-op,
|
|
return `{"ok": true, "skipped": "duplicate"}`. For Gitea use the
|
|
`X-Gitea-Delivery` header; for Planka use a hash of `event_type +
|
|
payload.id + payload.createdAt`; for generic webhooks use a configurable
|
|
JSONPath expression to derive an idempotency key, falling back to a SHA256 of
|
|
the raw body. TTL prune older than 7 days.
|
|
|
|
#### A2. Deferred-dispatch drain can double-send on process crash
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:721-758`.
|
|
**Scenario**: Inside `_process_row`, `dispatcher.dispatch()` actually
|
|
delivers the Telegram message (HTTP 200 returned, user phone buzzes).
|
|
The function then sets `row.status = "fired"` (line 734) but the surrounding
|
|
`session.commit()` (line 577) hasn't run yet. Process is killed (OOM,
|
|
SIGTERM during deploy, host reboot). On restart, `_run_deferred_drain_catchup`
|
|
re-fetches the still-`pending` row and dispatches it again — **the user gets
|
|
the same album twice**.
|
|
**Fix**: Either (a) record an outbound dedup key per-row before dispatch
|
|
(`row.dispatch_id = uuid4(); session.commit()` first), then ask the channel
|
|
client to send-or-no-op based on that ID; or (b) flip the row to a
|
|
`"in_flight"` state with a short timeout in a pre-dispatch transaction so a
|
|
restart sees it as poisoned and aborts. Option (a) is more correct but
|
|
needs per-channel cooperation; option (b) is the cheap fix.
|
|
|
|
#### A3. Telegram update offset is in-memory only — restart replays or loses commands
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/telegram_poller.py:31`
|
|
(`_last_update_id: dict[int, int] = {}`).
|
|
**Scenario**: A user types `/random Family`. Telegram delivers update_id=4711.
|
|
The bridge processes the command, sends back the media, and crashes before
|
|
APScheduler ticks again. On restart, `_last_update_id` is empty, so we call
|
|
`getUpdates(offset=None)` → Telegram returns 4711 again → we send the user
|
|
the same album a second time. Conversely, if Telegram's 24-hour retention
|
|
expired during a long outage, we silently skip pending updates.
|
|
**Fix**: Persist last_update_id in DB (`telegram_bot.last_update_id` column).
|
|
Combine with A2-style command idempotency by inserting
|
|
`(bot_id, update_id)` into a dedup table before processing.
|
|
|
|
### HIGH
|
|
|
|
#### A4. Telegram "bot blocked by user" / "chat not found" never short-circuits
|
|
|
|
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py`
|
|
(`send_message`, `_upload_media`, etc.). Errors with
|
|
`error_code == 403` (Forbidden, "Bot was blocked by the user") and 400
|
|
"chat not found" / "user is deactivated" are returned as failures but
|
|
never recorded so the receiver gets removed/disabled.
|
|
**Scenario**: A user blocks the bot. Every scheduled "Good morning memory"
|
|
fires a sendMessage that Telegram instantly 403s. Bridge logs an error,
|
|
moves on, repeats forever. The bridge_self target-failure counter eventually
|
|
fires but the underlying receiver is never disabled. With many such chats
|
|
the operator has no easy cleanup path.
|
|
**Fix**: In the dispatcher, on `error_code in (403, 400 with description
|
|
matching "chat not found"/"user is deactivated")`, automatically set
|
|
`TelegramChat.commands_enabled = False` and either flag the receiver as
|
|
`disabled` with reason `blocked_by_user` or surface it via a new
|
|
`/admin/blocked-chats` view. Also stop further retries that round.
|
|
|
|
#### A5. Telegram forum-thread (topic) routing not supported
|
|
|
|
**Location**: telegram client never accepts/sends `message_thread_id`.
|
|
**Scenario**: Operator points the bridge at a group's "Releases" forum
|
|
topic. Today every message lands in the General topic instead — there is
|
|
no way to specify the topic. This is a hard requirement for any non-trivial
|
|
group install. Currently `reply_parameters` is the only thread-adjacent
|
|
field used; `message_thread_id` is silently absent.
|
|
**Fix**: Add an optional `message_thread_id` per-receiver (or per-target)
|
|
config, pass through `send_message`, `_upload_media`, and `_post_media_group`.
|
|
Auto-extract from incoming command updates' `message.message_thread_id` so
|
|
the bot can reply into the same topic.
|
|
|
|
#### A6. `bot.token` read after commit without refresh in webhook flow
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/commands/webhook.py:92-97`.
|
|
**Scenario**: The comment acknowledges "AsyncSession expires instances on
|
|
commit" and snapshots `bot_id`/`bot_token` before commit, but `await
|
|
session.refresh(bot)` is also called after the commit. If `session.refresh`
|
|
fails (e.g. row was deleted by an admin concurrently — bot rotation), the
|
|
exception is caught as a warning and the rest of the handler still runs
|
|
using the stale local `bot_id`/`bot_token`. The window is small but real.
|
|
**Fix**: Remove the `session.refresh(bot)` since the snapshot already
|
|
covers everything the handler needs. The refresh adds risk for no gain.
|
|
|
|
#### A7. Deferred-dispatch coalescing has a JSON-mutation bug under concurrent defers
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:307`
|
|
(`_find_pending_asset_rows`).
|
|
**Scenario**: Two near-simultaneous `assets_added` events for the same
|
|
`(link_id, collection_id)` from two upstream pollers (HA chat-bus +
|
|
periodic Immich). Both call `defer_event` concurrently. The two transactions
|
|
both see "no pending row", both `session.add(new_row)`, and SQLite cheerfully
|
|
inserts two rows. The drain then fires both, sending the same combined media
|
|
twice. Note that the partial UNIQUE index from v0.8.1 protects only the
|
|
`bridge_self` provider row, not the deferred queue.
|
|
**Fix**: Add a partial UNIQUE index `UNIQUE(link_id, collection_id, event_type)
|
|
WHERE status = 'pending'` on `deferred_dispatch`, then convert `defer_event`
|
|
to `INSERT ... ON CONFLICT (link_id, collection_id, event_type) DO UPDATE`
|
|
and merge `event_payload` inside the SQL or in a re-read+retry loop.
|
|
|
|
#### A8. Quiet-hours overnight window + DST transition can produce wrong fire_at
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:121-128`.
|
|
**Scenario**: User in `Europe/Minsk` (UTC+3, no DST anymore) sets quiet
|
|
hours 22:00-06:00. For a user in a DST-observing zone (e.g.
|
|
`America/New_York`), on the "spring forward" night where 2:00 → 3:00, an
|
|
event arriving at 02:30 local time gets `end_today = now_local.replace(hour=6,
|
|
minute=0)`. But `.replace()` ignores DST adjustments — the resulting
|
|
`datetime` may sit in the skipped hour or have ambiguous DST status. Two
|
|
hours later, the dispatcher sees the quiet window as "still active" or "30
|
|
min ago" depending on the system.
|
|
**Fix**: After `.replace(hour=t_end.hour, minute=t_end.minute, ...)`, pass
|
|
through `tz.localize` (zoneinfo's behavior: re-walk via `astimezone`) and
|
|
explicitly handle the `fold=` parameter. Add tests using
|
|
`zoneinfo.ZoneInfo("America/New_York")` and known DST transition dates.
|
|
|
|
#### A9. Quiet-hours `start == end` returns None — silently no quiet hours
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:110-111`.
|
|
**Scenario**: User UI submits `quiet_hours_start = "00:00"` and
|
|
`quiet_hours_end = "00:00"`, thinking "all day quiet". The function returns
|
|
`None` (no quiet window) — the user gets pinged at 3am even though the UI
|
|
says "quiet hours enabled". Same code path eats malformed times silently.
|
|
**Fix**: Bubble up `ValueError`/`malformed input` to the API validator on
|
|
write so the user gets a 422 with a specific error message rather than
|
|
silently broken behavior. Define `00:00-00:00` as "always quiet" or reject
|
|
it explicitly with a clear error.
|
|
|
|
#### A10. Telegram `_truncate` cuts mid-HTML-tag → parse_mode fallback then loses formatting
|
|
|
|
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:144-149`
|
|
(`_truncate`).
|
|
**Scenario**: A template renders to 4090 chars and an
|
|
`<a href="https://...">...</a>` straddles the 4096-byte boundary. The
|
|
truncate function takes a flat string slice, so the final character may be
|
|
inside a tag → Telegram returns 400 "can't parse entities" → the retry
|
|
strips parse_mode → the user sees `<a href="...">` literally in their chat.
|
|
**Fix**: Make `_truncate` HTML-aware: scan from the right and abandon
|
|
truncation at the start of any tag boundary, OR strip incomplete tags after
|
|
truncating. A simpler intermediate fix: pop any unclosed `<a>` /`<b>`/`<i>`
|
|
detected by a regex over the truncated string.
|
|
|
|
#### A11. JSON-payload depth/size hardened in backup, not in webhooks
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:43-71`
|
|
(`_read_bounded_body` only caps total bytes).
|
|
**Scenario**: Generic webhook accepts a 999KB payload (under the 1MB cap)
|
|
but with 50 levels of nesting. `json.loads` succeeds, then
|
|
`parse_generic_webhook` evaluates JSONPath expressions in a loop and the CPU
|
|
spends seconds chasing pointers. Multiple concurrent malicious requests can
|
|
peg the event loop.
|
|
**Fix**: Reuse the depth/node guards from
|
|
`packages/server/src/notify_bridge_server/services/backup_service.py`
|
|
(JSON depth cap 10, node count cap 100k). Either share the helper or
|
|
re-implement around `json.loads(object_pairs_hook=...)`.
|
|
|
|
#### A12. Generic-webhook `auth_mode="none"` with `acknowledge_unauthenticated` is per-provider, not per-user
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/api/webhooks.py:294-323`.
|
|
**Scenario**: v0.8.1 added the `acknowledge_unauthenticated=true` opt-in,
|
|
but it's only stored in `provider.config` JSON. A multi-user install where
|
|
one user accepts unauthenticated and another doesn't would suffice. But
|
|
because anyone with the webhook URL can also infer the token (URLs are not
|
|
secret in real deployments — they end up in upstream config files, logs,
|
|
build artifacts), `auth_mode="none"` is dangerous beyond "explicit opt-in":
|
|
an attacker who guesses the path can DoS the rate limiter by burning the
|
|
60/min budget.
|
|
**Fix**: Refuse to even create a `webhook` provider with `auth_mode="none"`
|
|
in production unless a separate environment guard
|
|
`NOTIFY_BRIDGE_ALLOW_UNAUTHENTICATED_WEBHOOKS` is set; AND drop the rate
|
|
limit to 10/min for `auth_mode="none"` providers.
|
|
|
|
#### A13. `_extract_retry_after` returns int but Telegram `retry_after` is fractional
|
|
|
|
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:59-78`.
|
|
**Scenario**: Modern Telegram sometimes returns `retry_after` as a float
|
|
(e.g. `1.5`). The current code does `int(group(1))` and `isinstance(ra,
|
|
(int, float))`. Regex `\d+` only matches integers. So a `1.5s` retry-after
|
|
becomes "no retry-after found" → fallback 1s sleep → retry too early → second
|
|
429 → eventually the bounded retry budget runs out.
|
|
**Fix**: Loosen the regex to `\d+(?:\.\d+)?` and `float(m.group(1))`,
|
|
preserve fractional via `await asyncio.sleep(retry_after + 1)` with float.
|
|
|
|
#### A14. APScheduler date-job collision when two windows end at the exact same second
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:1127-1132`
|
|
(`_drain_job_id_for`). The job id is keyed on `YYYYMMDDHHMMSS`. Comment in
|
|
code acknowledges "two trackers... seconds different ... would collide", but
|
|
two windows ending at the exact same second still collide on a single job id
|
|
— `replace_existing=True` silently drops the second.
|
|
**Scenario**: 30 users with quiet_hours_end=`07:00`. All 30 windows end at
|
|
the same wall-clock second. Only one drain job is scheduled. That single
|
|
job fires `drain_deferred_due()` which scans all rows globally so all 30
|
|
get drained — actually fine. **But** if the global drain function ever
|
|
filters by user/tracker (a likely near-term change for multi-tenant), the
|
|
collision becomes silent data loss.
|
|
**Fix**: Either keep the global drain (and document the assumption) or
|
|
add a tracker_id segment to the job_id and let APScheduler dedup naturally.
|
|
|
|
#### A15. `_handle_webhook_conflict` reclaim races against a parallel admin action
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/telegram_poller.py:163-218`.
|
|
**Scenario**: Admin clicks "Switch to webhook mode" in the UI, which sets
|
|
`update_mode=webhook` and calls `set_webhook(...)`. Concurrently, the next
|
|
poll tick for the same bot hits the conflict, calls `delete_webhook` → the
|
|
admin's webhook is wiped 1s after they set it. The poll tick checks
|
|
`bot.update_mode != "polling"` *before* the conflict reclaim, but the
|
|
reload is best-effort and the conflict reclaim path runs unconditionally
|
|
once entered.
|
|
**Fix**: Re-check `bot.update_mode == "polling"` inside
|
|
`_handle_webhook_conflict` before calling `delete_webhook`; or take an
|
|
advisory lock on the bot row for the duration of the mode flip.
|
|
|
|
#### A16. Discord 2000-char split breaks on Unicode codepoint boundaries
|
|
|
|
**Location**: `packages/core/src/notify_bridge_core/notifications/discord/client.py:60-80`
|
|
(`_split_message`).
|
|
**Scenario**: A template renders to 2050 chars with emoji at position
|
|
1998-1999 (each emoji is 2 surrogates / multi-byte UTF-8). The split uses
|
|
`text.rfind("\n", 0, limit)` and falls back to character index `limit`,
|
|
which is a Python str index → that part is OK in CPython 3, but if the
|
|
content contains a grapheme cluster (emoji + zero-width-joiner + skin tone),
|
|
slicing at `limit` mid-cluster renders as the broken emoji "□" in Discord.
|
|
**Fix**: Use a grapheme-cluster boundary library (e.g. `regex` module with
|
|
`\X`) or at minimum back off to the previous whitespace if `limit` is
|
|
inside a likely cluster.
|
|
|
|
### MEDIUM
|
|
|
|
#### A17. Per-target failure counter does not distinguish receivers within a target
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/event_dispatch.py:311-333`.
|
|
**Scenario**: A target has 10 receivers. 1 chat is blocked, 9 work. Today
|
|
`maybe_emit_target_failure` is called for the target — but the success
|
|
counter (`record_target_success`) is also called for the same target on the
|
|
other 9. Net counter behavior depends on call order. With the
|
|
default-threshold 5, this oscillates.
|
|
**Fix**: Track success/failure per receiver, not per target; or only call
|
|
`maybe_emit_target_failure` when `all` receivers failed for the target.
|
|
|
|
#### A18. `_cleanup_old_events` does not delete cancelled `DeferredDispatch` rows
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:332-364`.
|
|
**Scenario**: The daily cleanup deletes `EventLog`, `WebhookPayloadLog`,
|
|
`ActionExecution`. Cancelled / fired / dropped `DeferredDispatch` rows live
|
|
forever in the DB. Active install with chatty providers accumulates millions
|
|
of rows; eventually the `_load_pending_drain_jobs` query, `_trim_queue_if_needed`,
|
|
and the catch-up scan all degrade.
|
|
**Fix**: Add `delete(DeferredDispatch).where(status.in_(["fired", "dropped",
|
|
"cancelled"]), fired_at < cutoff)` to the cleanup.
|
|
|
|
#### A19. `random.shuffle(shuffled)` in `_sort_assets` uses non-deterministic seed
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/dispatch_helpers.py:317-320`.
|
|
**Scenario**: Two identical events arriving in close succession (deferred-
|
|
dispatch merge, then drain re-renders) shuffle into different orders. With
|
|
the deferred-dispatch coalescing logic, this produces a visual "they're not
|
|
the same album" surprise in the chat history.
|
|
**Fix**: Seed `random` with a stable per-event hash
|
|
(`hash(event.event_type.value + event.collection_id + event.timestamp.isoformat())`).
|
|
|
|
#### A20. `_poll_tracker` swallows exception, drops it at `_LOGGER.error` not `exception`
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:657-666`.
|
|
**Scenario**: An exception in `check_tracker` is logged as `_LOGGER.error("Error
|
|
polling tracker %d: %s", tracker_id, e)` — no traceback. Production debugging
|
|
of "why is tracker 42 silently broken since yesterday" requires the stack.
|
|
**Fix**: Change to `_LOGGER.exception("Error polling tracker %d", tracker_id)`.
|
|
|
|
#### A21. Long bot commands → `/help` reply > 4096 chars truncates without warning
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/commands/handler.py:521-532`,
|
|
combined with `send_reply` → `send_telegram_message` → `_truncate` to 4096.
|
|
**Scenario**: A user with 20 enabled commands runs `/help`. Each command +
|
|
description (RU) crosses 250 chars → 5000 chars total → truncated mid-command.
|
|
The user sees a half-list that suggests we forgot half the commands.
|
|
**Fix**: Split `/help` over multiple messages by command category (provider).
|
|
|
|
#### A22. `parse_command` truncates to 512 chars — long search queries lost
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/commands/parser.py:15`.
|
|
**Scenario**: `/search a very long query containing emoji 🎉 and more text that
|
|
the user really meant to send because they pasted a long string from somewhere…`
|
|
gets clipped to 512 chars silently. The trailing count parser then operates
|
|
on the truncated text, possibly extracting a count from mid-query.
|
|
**Fix**: Either reject `>512` with `parse_command` returning a sentinel
|
|
"too_long" tuple, or just stop truncating — the Telegram limit is already
|
|
4096 and we already truncate the response side.
|
|
|
|
#### A23. Periodic catch-up scan can dispatch a stale event payload
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:628`
|
|
(`_process_row`).
|
|
**Scenario**: An `assets_added` event is deferred at 22:00. At 06:00 the
|
|
quiet window ends, drain re-fetches `link_data`. The assets in `event_payload`
|
|
include URLs and asset metadata. But the user has since deleted those photos
|
|
from Immich. The dispatcher tries to download → 404. Notification shows
|
|
"5 photos added to Album X" but the actual media fails to attach.
|
|
**Fix**: For `assets_added`, re-validate asset existence against the
|
|
provider before dispatch (one batched `getAssets` call). Drop missing IDs
|
|
from the event, mark with "delivered_after_quiet_hours" + extra hint
|
|
`"missing_count": N` in details. For deferred windows >12h this is the
|
|
right behavior; for shorter windows the lookup is wasted work, so gate on
|
|
`(now - deferred_at).hours >= 6`.
|
|
|
|
#### A24. Watcher / scheduler restart can lose adaptive polling state
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/scheduler.py:67-88`
|
|
(`_adaptive_state: dict`).
|
|
**Scenario**: Module-level dict resets on restart. A tracker that had ramped
|
|
up to 1-in-4 ticks goes back to every-tick polling. Over a fleet of 50
|
|
trackers in steady-state idle, this triggers a thundering herd of every-tick
|
|
polls right after deploy. Combined with no DB-level rate limiting on the
|
|
upstream Immich/Gitea API, it can rate-limit the operator out of their own
|
|
services for ~5min.
|
|
**Fix**: Either persist the adaptive state in `notification_tracker_state`
|
|
(cheap on shutdown via `atexit`) or stagger the initial ticks via
|
|
APScheduler's `next_run_time` instead of relying on the existing jitter.
|
|
|
|
#### A25. `defer_event` `return "cancelled"` logic is incorrect in some merge paths
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/services/deferred_dispatch.py:444`.
|
|
**Scenario**: The `cancelled` return branch checks `upd_added is None or
|
|
upd_added.status == "cancelled"` AND same for `upd_removed`. But if both
|
|
`upd_added` and `upd_removed` are `None` (i.e. there were no pending rows
|
|
to begin with), `fully_cancelled` is `False` → returns "merged". That's
|
|
fine. But the more subtle issue: an "insert" action with one of the rows
|
|
being cancelled returns "merged" — should be "inserted". The dashboard
|
|
"merged" status confuses the operator looking at why no defer row exists.
|
|
**Fix**: Rewrite as a clearer state machine: distinguish "inserted",
|
|
"merged_into_existing", "fully_cancelled".
|
|
|
|
#### A26. `_fetch_bytes` and `_safe_get` honor only 3 redirects with no Retry-After awareness
|
|
|
|
**Location**: `packages/core/src/notify_bridge_core/notifications/telegram/client.py:217-268`.
|
|
**Scenario**: Immich behind a CDN can chain `302 → 302 → 200`. With 4 hops
|
|
it falls through to "Too many redirects". A user complains "old photos
|
|
suddenly missing in notifications".
|
|
**Fix**: Bump to 5 redirects and surface the chain in the error string for
|
|
easier debugging.
|
|
|
|
#### A27. No structured event log filter UI for "show me all drops in the last hour"
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/api/status.py` —
|
|
`event_log` rows have `details.dispatch_status` field but no API filter
|
|
exposes it. The frontend can fetch only via global filter on `event_type`.
|
|
**Scenario**: An operator sees "messages are missing today". They want to
|
|
filter event_log to `dispatch_status in (dropped_quiet_hours_nondeferrable,
|
|
deferred_then_dropped, deferred_then_failed)`. Today they can't.
|
|
**Fix**: Add `dispatch_status` and `dispatched=true|false` as first-class
|
|
event_log columns (denormalized from `details`), plus API + UI filter.
|
|
|
|
#### A28. `_render_cmd_template` falls back to `"[No template: X]"` user-visible text
|
|
|
|
**Location**: `packages/server/src/notify_bridge_server/commands/handler.py:111-115`.
|
|
**Scenario**: An operator removes a template slot by mistake. The next user
|
|
who runs `/random` sees `[No template: response_random]` in chat. Not just
|
|
ugly — it leaks internal slot names.
|
|
**Fix**: Show a friendly "Sorry, something went wrong on our side" + log at
|
|
error level. Better: refuse to disable the slot if it's referenced.
|
|
|
|
### LOW
|
|
|
|
#### A29. `_truncate`'s ellipsis can land inside a multi-byte char
|
|
|
|
The marker `"…"` is one Unicode codepoint (3 bytes UTF-8) but the truncate
|
|
counts characters, not bytes. Telegram counts UTF-16 code units, so for a
|
|
4090-char message ending in emoji, the calculation is off by a small constant.
|
|
Won't break sends but messages may end up slightly longer than `TELEGRAM_MAX_TEXT_LENGTH`
|
|
allows. Re-measure in UTF-16 code units (`len(s.encode('utf-16-le')) // 2`).
|
|
|
|
#### A30. `NotificationDispatcher._render_cache` set to fresh dict on every dispatch — comment says "reuse"
|
|
|
|
The instance attribute `self._render_cache` is reset to `{}` at the start
|
|
of every `_send_to_target` (line 245). The cache only helps across receivers
|
|
within one target, not across targets. The comment at line 111-115 implies
|
|
broader reuse. Either align comment with reality or actually share across
|
|
targets within one `dispatch()` call.
|
|
|
|
#### A31. Frontend `entity-cache.svelte.ts` doesn't propagate stale-cache errors
|
|
|
|
The shared `$state`-based caches return stale data silently if the underlying
|
|
fetch fails after a successful initial load. A user sees old target list
|
|
during an outage and is confused why edits aren't sticking.
|
|
|
|
---
|
|
|
|
## Part B — Missing functionality and "cool feature" gaps
|
|
|
|
Tier legend: **must-have** = blocks prod for any non-trivial install;
|
|
**nice-to-have** = clear value, ship in next minor; **aspirational** = ship
|
|
when v1.0+ slows down.
|
|
Effort: **S** ≈ 1-2 days; **M** ≈ 1 week; **L** ≈ 2+ weeks.
|
|
|
|
### Already in the backlog (post-v0.8.1 status check)
|
|
|
|
#### B1. Target-level quiet hours (per-target DND, multi-window, days-of-week, silent mode)
|
|
|
|
**Status**: Still missing in v0.8.1. The backlog item proposed a v1 cut
|
|
(target-level windows + `silent` mode for Telegram = `disable_notification=True`).
|
|
None of the proposed code paths exist:
|
|
- `notification_target.quiet_hours_json` column — not present.
|
|
- `disable_notification=True` plumbing through `TelegramClient.send_message`
|
|
— not present.
|
|
- Days-of-week filter — not present.
|
|
|
|
**Pitch**: Quiet hours bind to the *watcher* (tracking config); users want
|
|
DND at the *destination*. "Don't ping my phone at night, regardless of
|
|
which provider".
|
|
**Who benefits**: Every user. Today they have to recreate per-link windows.
|
|
**Effort**: **M** (1 week — backend dispatcher gate + frontend Aurora-style fieldset).
|
|
**Tier**: **must-have for prod**.
|
|
|
|
#### B2. Immich Smart Actions expansion (auto-favorite by person, auto-archive, share-link rotation)
|
|
|
|
**Status**: Auto-Organize exists; no other action descriptors are shipped.
|
|
**Pitch**: Reuse the existing action descriptor pipeline. Auto-favorite-by-person
|
|
is the smallest cut.
|
|
**Effort**: **M** per action (a few days each).
|
|
**Tier**: nice-to-have.
|
|
|
|
#### B3. Block-based template builder
|
|
|
|
**Status**: Not started. `JinjaEditor` is unchanged.
|
|
**Effort**: **L** — frontend-only but big.
|
|
**Tier**: aspirational.
|
|
|
|
### Newly identified — must-have for prod
|
|
|
|
#### B4. Webhook delivery dedup table + "Test Delivery" replay
|
|
|
|
**Pitch**: Add the dedup table from A1, plus a `/api/webhooks/{provider_id}/replay/{delivery_id}`
|
|
endpoint that admin can hit to re-dispatch a stored payload without the upstream
|
|
provider needing to resend. Combined with the existing `WebhookPayloadLog`,
|
|
this is "click to retest" in the UI.
|
|
**Who benefits**: Every webhook provider. Replay is invaluable for debugging
|
|
template edits.
|
|
**Effort**: **M**.
|
|
**Tier**: **must-have for prod**.
|
|
|
|
#### B5. "Send test message" / template playground
|
|
|
|
**Pitch**: From the template editor, click "Try this template against the
|
|
last received event" → render preview, optionally send to a sandbox chat.
|
|
Bypass dispatch but exercise the full Jinja pipeline.
|
|
**Who benefits**: Every template edit today is a leap of faith — the operator
|
|
modifies the template, waits for the next real event, hopes nothing breaks.
|
|
**Effort**: **S-M**. The preview infrastructure already exists
|
|
(`services/sample_context.py`); add a "send to chat X" button.
|
|
**Tier**: **must-have for prod**.
|
|
|
|
#### B6. Template versioning + rollback
|
|
|
|
**Pitch**: Auto-snapshot each template on save (last 10 revisions). UI shows
|
|
diff between version N and N-1, "Restore" button. Same for command templates.
|
|
**Who benefits**: An operator who tweaks a template at midnight and goofs
|
|
the syntax needs an undo button.
|
|
**Effort**: **M**. New `template_revision` table; new endpoints; UI button.
|
|
**Tier**: **must-have for prod**.
|
|
|
|
#### B7. Bulk operations on trackers / targets / links
|
|
|
|
**Pitch**: Multi-select in lists → "disable selected", "delete selected",
|
|
"export selected templates as JSON bundle", "move to user X".
|
|
**Who benefits**: Operators with >10 trackers. A common pain point: deploying
|
|
the bridge for a new family member requires N clicks per tracker.
|
|
**Effort**: **M** (frontend-heavy).
|
|
**Tier**: **must-have for prod**.
|
|
|
|
#### B8. Bot blocked / chat-not-found auto-disable + dashboard
|
|
|
|
**Pitch**: Detect Telegram 403 / 400 chat-related errors. Mark the receiver
|
|
or `TelegramChat` as `disabled_by_remote`. Surface in a "Stale receivers"
|
|
admin view with a "Try resending invite" / "Delete chat" button.
|
|
**Who benefits**: Every Telegram user. Today the bridge silently sprays
|
|
errors until a human looks.
|
|
**Effort**: **S**.
|
|
**Tier**: **must-have for prod**.
|
|
|
|
#### B9. Forum-thread (topic) routing for Telegram
|
|
|
|
**Pitch**: Per-receiver `message_thread_id` field, auto-detected from incoming
|
|
command messages. UI: when adding a chat that's a forum, show a topic
|
|
selector populated via `getForumTopicIconStickers` + `getChat`'s `is_forum`.
|
|
**Who benefits**: Any group install where the user wants notifications in a
|
|
dedicated topic.
|
|
**Effort**: **M**.
|
|
**Tier**: **must-have for prod**.
|
|
|
|
#### B10. Telegram inline buttons + callback queries
|
|
|
|
**Pitch**: Templates can declare `{% buttons %}` with action descriptors.
|
|
Bridge listens for `callback_query` updates, dispatches to a registered
|
|
action (e.g. "Mark album as favorite", "Snooze this tracker for 1h", "Run
|
|
HA service light.turn_off").
|
|
**Who benefits**: Power users. Foundation for several other features
|
|
(Immich duplicate-cluster review, HA action button → service call, snooze).
|
|
**Effort**: **L**.
|
|
**Tier**: nice-to-have but unlocks the next 3 items.
|
|
|
|
#### B11. User snooze / mute via bot command
|
|
|
|
**Pitch**: `/snooze 1h` mutes the bot's outbound chat for 1h.
|
|
`/mute provider gitea` mutes a whole provider for that chat. `/wake` undoes.
|
|
Implemented as a per-receiver `snoozed_until` column.
|
|
**Effort**: **S-M**.
|
|
**Tier**: **must-have for prod** (user-side relief valve).
|
|
|
|
### Newly identified — nice-to-have
|
|
|
|
#### B12. Per-target / per-user rate limit (send-side)
|
|
|
|
**Pitch**: Cap outbound messages per minute per receiver. Existing 429
|
|
backoff handles Telegram's limit, but a runaway template / event-storm
|
|
provider can still spray the user's phone with 200 messages.
|
|
**Effort**: **S**. Token bucket per chat_id in `_send_telegram`.
|
|
**Tier**: nice-to-have.
|
|
|
|
#### B13. Message dedup window (idempotency key per outbound message)
|
|
|
|
**Pitch**: SHA256 of `(target_id, receiver_id, rendered_message,
|
|
event_collection_id)`. If the same key was sent in the last 5min, skip.
|
|
**Effort**: **S**.
|
|
**Tier**: nice-to-have (lots of overlap with A1+A2 but addresses the
|
|
end-of-pipeline dedup, after all coalescing).
|
|
|
|
#### B14. Weekly digest / per-target stats / per-provider error rate
|
|
|
|
**Pitch**: Cron-based weekly summary email/Telegram. "Top 5 noisy trackers",
|
|
"Receivers with >X% failure rate", "Top 5 days of the week with the most
|
|
activity". Operator preventive maintenance.
|
|
**Effort**: **M**.
|
|
**Tier**: nice-to-have.
|
|
|
|
#### B15. Mobile-friendly minimal mode for the SPA
|
|
|
|
**Pitch**: The Aurora redesign is a lot for mobile. A "manage from phone"
|
|
minimal layout — list of trackers, click to toggle, click to mute. Stops
|
|
operators from needing a desktop to silence a chatty tracker at 1am.
|
|
**Effort**: **M**.
|
|
**Tier**: nice-to-have.
|
|
|
|
#### B16. Audit log of admin actions
|
|
|
|
**Pitch**: New `audit_log` table. Every create/update/delete on
|
|
`NotificationTracker`, `NotificationTarget`, `TemplateConfig`, `ServiceProvider`,
|
|
`TelegramBot`, `User`, etc. writes a row with `(user_id, action,
|
|
entity_type, entity_id, before_json, after_json, ip, ua)`. Admin UI tab.
|
|
**Effort**: **M**. SQLAlchemy event listeners on the affected models.
|
|
**Tier**: nice-to-have for multi-admin installs; must-have if any
|
|
compliance requirement.
|
|
|
|
#### B17. Health → not just /ready, but per-component status page
|
|
|
|
**Pitch**: `/api/health/components` returns `{providers: [{id, last_ok_at,
|
|
last_error}], targets: [{id, last_ok_at, last_error}], scheduler:
|
|
{job_count, next_fires}}`. Frontend "Status" tab.
|
|
**Effort**: **S-M**. The data is already in `EventLog` / scheduler API.
|
|
**Tier**: nice-to-have.
|
|
|
|
#### B18. Provider unreachable backoff + escalation
|
|
|
|
**Pitch**: Today `bridge_self` emits `bridge_self_poll_failures` after N
|
|
consecutive fails. Add (a) exponential backoff on the polling interval after
|
|
M failures so we don't hammer a down host, and (b) recovery notification
|
|
when the provider comes back.
|
|
**Effort**: **S**.
|
|
**Tier**: nice-to-have.
|
|
|
|
#### B19. RSS provider
|
|
|
|
**Pitch**: Generic RSS/Atom feed poller. One more provider, reuses event_dispatch.
|
|
Long-tail value (operator wants "notify me when a blog publishes").
|
|
**Effort**: **M**.
|
|
**Tier**: nice-to-have.
|
|
|
|
#### B20. Mobile push / FCM channel
|
|
|
|
**Pitch**: A dedicated FCM "Receiver" type so the user can ship their own
|
|
companion app. Today Telegram is the only realtime channel; email is too
|
|
slow; webhook out is for plumbing.
|
|
**Effort**: **L**.
|
|
**Tier**: aspirational.
|
|
|
|
### Newly identified — aspirational
|
|
|
|
#### B21. Conversation threading per source (one notification thread per album / repo)
|
|
|
|
**Pitch**: Use Telegram `reply_parameters` to chain all notifications about
|
|
"Album X" as a single thread that grows over time. Today every notification
|
|
is a top-level message. Threading turns the chat into a navigable history.
|
|
**Effort**: **M**. Store `last_message_id` per `(target_id, collection_id)`,
|
|
pass as `reply_to_message_id`.
|
|
**Tier**: aspirational but a clear differentiator.
|
|
|
|
#### B22. A/B test variants for templates
|
|
|
|
**Pitch**: A template config can carry 2 variants. The dispatcher
|
|
hash-routes receivers to A or B; the dashboard shows "variant A's response
|
|
time / click rate / receiver mute rate".
|
|
**Effort**: **L**.
|
|
**Tier**: aspirational.
|
|
|
|
#### B23. Dark-launch a new template before enabling it
|
|
|
|
**Pitch**: "Send-to-sandbox-chat-only" toggle on a template config. The new
|
|
template renders against real events but only goes to one operator's chat
|
|
for 1 week. Then promote to production.
|
|
**Effort**: **M**. Builds on template versioning (B6).
|
|
**Tier**: aspirational.
|
|
|
|
#### B24. Scheduled template changes
|
|
|
|
**Pitch**: "On 2026-12-25 at 09:00, switch template_config X to draft Y".
|
|
Useful for holiday-themed greetings or batch migrations.
|
|
**Effort**: **M**.
|
|
**Tier**: aspirational.
|
|
|
|
#### B25. HA service-call from a Telegram inline button
|
|
|
|
**Pitch**: Building on B10. A template renders `{% button hass:light.turn_off
|
|
target=living_room %}`. User clicks → bridge calls HA `light.turn_off`.
|
|
**Effort**: **M** (after B10).
|
|
**Tier**: aspirational.
|
|
|
|
---
|
|
|
|
## Ship-blocker checklist (do not widen user audience without)
|
|
|
|
Order is rough priority (top first). Most are also called out in Part A.
|
|
|
|
1. **A1** — Webhook idempotency table (Gitea/Planka/generic). Without this,
|
|
one upstream retry storm can double-/quadruple-spray every user.
|
|
2. **A2** — Deferred-dispatch crash window. A redeploy mid-drain duplicates
|
|
every queued notification. Implement either the `dispatch_id`
|
|
pre-commit OR the `in_flight` state machine.
|
|
3. **A3** — Persist Telegram update offset. Same root cause class as A1/A2;
|
|
matters less if A1+A2 are fixed but should land together.
|
|
4. **A4 / B8** — Bot blocked / chat-not-found auto-disable. A user blocking
|
|
the bot must not generate infinite errors.
|
|
5. **A11** — Webhook JSON depth/node cap (mirror the backup guard).
|
|
6. **A9** — Quiet-hours `start == end` confirmation; either accept "always
|
|
quiet" semantics or reject in the API validator.
|
|
7. **A8** — DST handling in quiet-hours overnight window. Verify with
|
|
tests that include known transition timestamps.
|
|
8. **B5** — "Send test message" / template playground. Without this, every
|
|
template edit is a flying blind change against a live system.
|
|
9. **B6** — Template versioning + rollback. Pair with B5.
|
|
10. **A5 / B9** — Forum-thread (topic) routing. Any non-trivial Telegram
|
|
group install needs this.
|
|
11. **B11** — User snooze / mute via bot command. Relief valve when the
|
|
bridge gets too chatty.
|
|
12. **B7** — Bulk operations on trackers / targets / links. Operability
|
|
floor for any install with >10 trackers.
|
|
|
|
Everything else in Part B is upside, not a blocker.
|
|
|