feat: observability, per-receiver Telegram options, oversized-video fallback
Operability: - Correlation IDs end-to-end: shared dispatch_id between log lines and EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths) and a new X-Request-Id middleware that normalizes inbound ids and binds request_id into log context. - dispatch_summary block merged into EventLog.details: per-target success/failure counts plus Telegram media delivered/skipped/failed and truncated error lists, so partial outcomes surface in the UI. - Diagnostic mode: admin can flip one module to DEBUG for a bounded window with auto-revert (in-memory only; setup_logging() resets on boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints plus DiagnosticsCassette UI on the settings page. Telegram: - Per-receiver options: disable_notification (silent send) and message_thread_id (forum-topic routing), wired through the dispatcher via a ContextVar so all four send sites (sendMessage / sendPhoto-Video- Document / sendMediaGroup / cache-hit POST) pick them up. - send_large_videos_as_documents target setting: bypass the 50 MB sendVideo cap by falling back to sendDocument for oversized videos. - sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES, 45 MB) with per-item fallback on chunk failure so a stale file_id no longer silently drops a cached asset. Tests: - New: diagnostic_mode, dispatch_summary, request_correlation, telegram_media_group_partial, telegram_per_send_options. Docs: - .claude/reviews/: six-axis production-readiness review of v0.8.1. - .claude/docs/functional-review-2026-05-28.md: focused review of Telegram/Immich/logging subsystems.
This commit is contained in:
@@ -0,0 +1,89 @@
|
||||
# Production-Readiness Review — service-to-notification-bridge v0.8.1
|
||||
|
||||
**Date:** 2026-05-22 **Scope:** entire codebase (~70k LOC, 312 files)
|
||||
**Branch:** master @ a20635a **Reviewers:** 6 parallel specialised agents
|
||||
|
||||
## Verdict
|
||||
|
||||
**Ship-readiness: nearly there.** The product is in materially better shape than a typical pre-1.0 — every security baseline is in place (sandboxed Jinja2, bcrypt+JWT, SSRF guard with DNS-rebinding mitigation, secret masking, signed webhooks, non-root Docker, owner-scoped queries) and the feature set is mature (deferred dispatch, quiet hours, fan-out caps, 429 backoff, Prometheus metrics). No CRITICAL security findings exist.
|
||||
|
||||
The work that *should* block shipping to wider users is concentrated in **three buckets**: (1) a handful of correctness defects that surface only under load or restart (duplicate-send class), (2) two secret-handling gaps (HA token returned cleartext, bot tokens/SMTP passwords unencrypted at rest), and (3) the schema-management story (`create_all` on boot + 1880-line hand-rolled migration script with no Alembic).
|
||||
|
||||
## Reports
|
||||
|
||||
| Axis | File | Findings | Top hit |
|
||||
|---|---|---|---|
|
||||
| Backend (Python) | [backend-review.md](backend-review.md) | 5C / 15H / 18M / 10L | `asyncio.create_task` GC in HA status logger |
|
||||
| Frontend (TS/Svelte) | [frontend-review.md](frontend-review.md) | 2C / 10H / 19M / 7L | JWT access+refresh in `localStorage` |
|
||||
| Security | [security-review.md](security-review.md) | 0C / 2H / multiple M | HA `access_token` not masked on `GET /providers/{id}` |
|
||||
| Performance + DB | [performance-db-review.md](performance-db-review.md) | 3C / 7H / 10M / 10L | `SQLModel.metadata.create_all` on every boot |
|
||||
| Bugs + features | [bugs-features-review.md](bugs-features-review.md) | 3C / 13H / 12M / 3L + 25 features | Webhook redelivery has no idempotency |
|
||||
| UI/UX | [ui-ux-review.md](ui-ux-review.md) | ~33 across 13 axes | Five overlapping glass-card abstractions |
|
||||
|
||||
## Ship blockers (must fix before wider rollout)
|
||||
|
||||
Cross-cutting top 12 — verified across all six reviews:
|
||||
|
||||
1. **HA `access_token` returned in plaintext** on `GET /api/providers/{id}` — not in mask list. *(Security H-1, [providers.py:399-405](packages/server/src/notify_bridge_server/api/providers.py#L399))*
|
||||
2. **Secrets unencrypted at rest** — Telegram bot tokens, SMTP passwords, HA tokens, webhook secrets stored as plain text in SQLite. Disk/snapshot/backup theft = full credential set. *(Security H-2)*
|
||||
3. **Frontend JWT access + refresh in `localStorage`** — any future XSS exfiltrates the session in one call. Move to httpOnly cookie. *(Frontend C-1)*
|
||||
4. **`asyncio.create_task` fire-and-forget** in `ha_subscription._on_status_change` — task may be GC'd before completion. *(Backend C-1, [ha_subscription.py:249](packages/server/src/notify_bridge_server/services/ha_subscription.py#L249))*
|
||||
5. **Pre-auth 1 MiB body read** on Gitea + generic webhooks — DoS amplifier. Verify `X-Hub-Signature` before reading body. *(Backend C-3, [webhooks.py:167](packages/server/src/notify_bridge_server/api/webhooks.py#L167) + 449)*
|
||||
6. **No webhook idempotency** — Gitea/Planka/generic don't dedupe by `X-Gitea-Delivery` / equivalent. Replays = duplicate sends. *(Bugs C-1)*
|
||||
7. **Deferred-dispatch crash window** — `dispatch()` returns before `session.commit()`; restart re-fires. Wrap in idempotent "claim → send → ack" with a unique constraint. *(Bugs C-2)*
|
||||
8. **Telegram `_last_update_id` in-memory only** — restart can replay or skip commands. Persist watermark. *(Bugs C-3)*
|
||||
9. **`init_db` calls `SQLModel.metadata.create_all` on every boot** — causes schema drift between fresh and upgraded installs. Adopt Alembic. *(Perf C-1)*
|
||||
10. **Template-preview endpoints bypass sandbox timeout** — authenticated user can wedge a worker with `{% for i in range(10**8) %}`. *(Security M-1)*
|
||||
11. **Telegram webhook handler missing `session.rollback()`** in catch-all — leaves uncommitted writes. *(Backend C-2, [commands/webhook.py:162](packages/server/src/notify_bridge_server/commands/webhook.py#L162))*
|
||||
12. **CLAUDE.md rule-8 violation** — `if (provider.type !== 'immich')` in `RuleEditor.svelte` silently disables people/album picker for other providers. *(Frontend C-2, [RuleEditor.svelte:57](frontend/src/routes/actions/RuleEditor.svelte#L57))*
|
||||
|
||||
## Next-tier priorities (HIGH — fix in the same release where practical)
|
||||
|
||||
13. Audit `backup_schema.PROVIDER_SECRET_FIELDS` so `webhook_secret`, `password`, `client_secret`, `refresh_token` are scrubbed on export. *(Backend C-5)*
|
||||
14. Add `asyncio.Lock` around `bridge_self` failure-counter dicts. *(Backend C-4)*
|
||||
15. Login rate-limit is per-IP only — slow rotated-source brute force succeeds. Add per-account lockout + raise password floor. *(Security M-2)*
|
||||
16. Three frontend CRUD pages copy cache items into local `$state`, breaking the shared-cache invariant and forcing a full refetch per mutation. *(Frontend H-1/H-2)*
|
||||
17. Uncancelled `setTimeout` chain in backup restart flow can `window.location.reload()` after navigation. *(Frontend H-5)*
|
||||
18. Refresh-token race against `logout()` produces spurious "Unauthorized" toasts. *(Frontend H-6/H-7)*
|
||||
19. Dashboard per-provider GROUP-BY aggregate runs unbounded on every refresh, no caching, no covering index. *(Perf H-1/H-2)*
|
||||
20. Truncation/parse-mode escaping for Telegram (HTML-aware truncate, `_extract_retry_after` fractional seconds, forum `message_thread_id` routing, 403 "bot blocked" auto-disable). *(Bugs H-various)*
|
||||
21. Five overlapping glass-card abstractions + radius drift (22/18/14/12 px) + ~71 legacy `rounded-md text-sm bg-…` form inputs that bypass the global Aurora `input{}` rule. *(UI/UX H-CONSIST-01..04)*
|
||||
22. Hardcoded hex colors (`#059669`, `#ef4444`) in Snackbar/ConfirmModal/actions — bypasses theming. *(UI/UX H-CONSIST-03)*
|
||||
23. Snackbar has no `aria-live`; nav lacks `aria-current="page"` — invisible to screen readers. *(UI/UX H-NAV-01, A11y)*
|
||||
24. DST handling in overnight quiet-hours windows. *(Bugs H)*
|
||||
|
||||
## What's working well — keep doing this
|
||||
|
||||
- **Sandboxed Jinja2 everywhere** (security agent verified every `Environment()` instantiation is `SandboxedEnvironment`).
|
||||
- **`PinnedResolver` SSRF defence** — handles CGNAT, IPv4-mapped IPv6, DNS rebinding.
|
||||
- **JWT with `token_version` revocation** — bcrypt offloaded to worker thread, constant-time username probe.
|
||||
- **Hardened Docker** — non-root, read-only root FS, `cap_drop: ALL`.
|
||||
- **Aurora/Glass design identity** — distinctive (conic-gradient orb, Newsreader italic display serif, lavender/orchid palette, "signal stream"/"on watch"/"wires"/"pulse" editorial labels). Not generic AI admin work.
|
||||
- **Frontend type discipline** — `svelte-check` clean, EN/RU exactly 1466 keys each, no `eval`/`innerHTML`/`var`/`==` anywhere.
|
||||
- **Most SQL hot paths already batched** — `load_link_data` is fully fan-in/fan-out; partial unique indexes on deferred-dispatch are thoughtful.
|
||||
- **Most v0.8.1 production-readiness items shipped** — fan-out caps, 429 backoff, parse_mode fallback, scheduler misfire grace, Prometheus, deep healthcheck, per-receiver render cache.
|
||||
|
||||
## Top missing features worth adding next
|
||||
|
||||
Pulled from the bugs-features report — full pitches in [bugs-features-review.md](bugs-features-review.md):
|
||||
|
||||
- **Template playground** — "send test against last event" + dry-run with sample payload.
|
||||
- **Template versioning + rollback** with audit log.
|
||||
- **Bulk operations** on targets/templates (currently row-by-row).
|
||||
- **User-side snooze/mute via bot command** ("/mute 2h", "/snooze tonight").
|
||||
- **Auto-disable receiver on Telegram 403 ("bot blocked")** with admin notification.
|
||||
- **Rate-limit per target** (separate from global fan-out cap).
|
||||
- **Weekly digest + per-target stats + per-provider error rate**.
|
||||
- **Generic webhook provider** and **email / Discord / ntfy.sh / Matrix** channels.
|
||||
- **Message dedup window** (kills duplicate sends from redelivery and scheduler misfires).
|
||||
- **First-run "Getting Started" checklist** on empty dashboard (UI/UX).
|
||||
|
||||
## How to consume this review
|
||||
|
||||
Each report has clickable `file:line` markdown links. Recommended sequence:
|
||||
|
||||
1. Read this `README.md`.
|
||||
2. Skim each report's Executive Summary (top 5-7 bullets).
|
||||
3. Triage the **Ship blockers (1-12)** above into the next release branch as individual issues.
|
||||
4. Schedule the **HIGH list (13-24)** for the release after.
|
||||
5. Treat the feature ideas as a refresh of `.claude/docs/feature-backlog.md`.
|
||||
Reference in New Issue
Block a user