Files

T

alexei.dolgolyov 6a8f374678 feat: observability, per-receiver Telegram options, oversized-video fallback

Operability:
- Correlation IDs end-to-end: shared dispatch_id between log lines and
  EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths)
  and a new X-Request-Id middleware that normalizes inbound ids and binds
  request_id into log context.
- dispatch_summary block merged into EventLog.details: per-target
  success/failure counts plus Telegram media delivered/skipped/failed and
  truncated error lists, so partial outcomes surface in the UI.
- Diagnostic mode: admin can flip one module to DEBUG for a bounded
  window with auto-revert (in-memory only; setup_logging() resets on
  boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints
  plus DiagnosticsCassette UI on the settings page.

Telegram:
- Per-receiver options: disable_notification (silent send) and
  message_thread_id (forum-topic routing), wired through the dispatcher
  via a ContextVar so all four send sites (sendMessage / sendPhoto-Video-
  Document / sendMediaGroup / cache-hit POST) pick them up.
- send_large_videos_as_documents target setting: bypass the 50 MB
  sendVideo cap by falling back to sendDocument for oversized videos.
- sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES,
  45 MB) with per-item fallback on chunk failure so a stale file_id no
  longer silently drops a cached asset.

Tests:
- New: diagnostic_mode, dispatch_summary, request_correlation,
  telegram_media_group_partial, telegram_per_send_options.

Docs:
- .claude/reviews/: six-axis production-readiness review of v0.8.1.
- .claude/docs/functional-review-2026-05-28.md: focused review of
  Telegram/Immich/logging subsystems.

2026-05-28 15:19:31 +03:00

24 KiB

Raw Blame History

Functional Review — Telegram, Immich, Logging (2026-05-28)

Snapshot review of three subsystems, with prioritised improvement candidates. Pairs with feature-backlog.md — items here are infrastructure that unlocks several backlog features.

All citations are from the working tree at commit 85a8f1e (master). Two files (packages/core/src/notify_bridge_core/notifications/telegram/client.py, media.py) had uncommitted changes at review time — see Telegram § "In-flight work".

1. Telegram infrastructure

Telegram — what works well

Single chokepoint TelegramClient (packages/core/src/notify_bridge_core/notifications/telegram/client.py) covers text/photo/video/document/media-group, with 429-aware retry, parse-error retry, file_id cache, multi-bot per-token instances, polling + webhook modes, and bot-command registration.
CLAUDE.md rule #6 satisfied for the production paths.
Caption length, group sizing, parse-mode fallback all enforced.

In-flight work

Byte-budget sub-chunking for media groups (TELEGRAM_MAX_GROUP_TOTAL_BYTES in media.py) with per-item fallback inside _send_media_group. Logic is coherent; before commit, verify _build_media_items callers still match the new signature (caption no longer injected at fetch time).

Gaps, ranked by user-visible value

No inline keyboards / callback_query handlers — zero infra for "Favorite / Archive / Dismiss" buttons on Immich notifications. Biggest UX unlock; prerequisite for several Immich smart actions.
No edit-in-place (editMessageText not wrapped). Pairs naturally with deferred dispatch / quiet hours coalescing — 5 separate "asset added" messages become 1 edited message.
disable_notification (silent send) not exposed — already a Telegram primitive; slots into the quiet-hours silent mode the backlog already mentions.
message_thread_id (forum topics) — single field per receiver; unblocks supergroup-with-topics users.
Direct TelegramClient(...) constructions in api/telegram_bots.py:314,394,404,412 bypass get_telegram_client() — violates CLAUDE.md rule #6 and skips the shared file_id cache.
Per-command authorization — commands_enabled is all-or-nothing per chat; no per-command allowlist or admin gate.
Long-message splitting — send_message silently truncates at 4096 (client.py:492).
No parse-mode per target — HTML hardcoded.

2. Immich

Immich — what works well

Mature polling pipeline: incremental delta-fetch via updatedAfter, pending-asset tracking, fingerprint fast-path skip, fallback to full fetch on count-decrease (providers/immich/provider.py).
Rich bot commands (status / albums / events / people / search / latest / random / favorites / summary / memory) with full asset context (CLAUDE.md rule #10 satisfied).
auto_organize action is well-shaped: AND person + smart-query union, exclusions, type/date/favorite filters, 500-asset batched add, idempotent diff against album asset_ids, dry-run, ActionExecution log.
Three scheduled features wired: periodic summaries, scheduled-asset delivery, Memory/On-This-Day (with native Immich memory API + fallback).

Highest-leverage candidates

Webhook ingestion — webhook_based=False at capabilities.py:46. Sub-second latency vs the current 5-min poll. New /api/webhooks/immich/{secret} route + parser + capability flip.
Share-link expiry monitoring + auto-rotate action — links silently break today; data is already fetched per event (provider.py:541-569).
Duplicate cluster digest — Immich >= 1.100 /api/duplicates is unused; pairs with inline buttons for "merge / ignore 30d".
Auto-favorite by person (already in backlog) — smallest delta on the existing auto_organize executor.
Per-person notification subscription — tracker-config filter, reuses existing asset.people data.
Album auto-curation from Inbox — date-based target album name, move (not copy); needs the Immich move endpoint (currently we only add_assets_to_album).
Storage / job-queue alerts — /api/server/stats and /api/jobs unused; lightweight poll + threshold = "disk full" / "transcoding stalled" notifications.
Smart-action infra polish — descriptors are reusable, but the rule editor is JSON-shaped, action-run statistics aren't aggregated, and dry-run shows counts not the asset list. Address before adding 5 more action types.

3. Logging

What's already in place

In logging_setup.py:

dictConfig with JsonFormatter (line-delimited JSON) toggleable via NOTIFY_BRIDGE_LOG_FORMAT=json.
SecretMaskingFilter redacts Telegram bot tokens + Authorization / api_key / password / refresh_token across msg, exc_text, stack_info.
ContextVar-driven record factory injects request_id, command, chat_id, bot_id, dispatch_id on every record. Text format: [req=- cmd=- bot=- chat=- disp=-].
Per-module overrides via NOTIFY_BRIDGE_LOG_LEVELS env or DB AppSetting. Live runtime patch via apply_log_levels() — no restart.
Noisy libs pre-quieted (sqlalchemy, aiohttp, apscheduler, urllib3, asyncio, httpx, httpcore, PIL, uvicorn.access).

Plus:

EventLog table with structured rows (event_type, status, assets_count, details JSON, FKs to tracker/provider/action/ command_tracker/bot), event_log_retention_days=30 default, daily APScheduler cleanup _cleanup_old_events (scheduler.py:332).
Prometheus counter notify_bridge_event_log_total{status,event_type}.
Frontend viewer with filters at api/status.py.
bind_log_context actually used in: dispatcher (dispatch_id), telegram_poller (bot/chat/command/request_id), webhook commands.

Gaps, ordered by debug-pain payoff

No FastAPI request-ID middleware. request_id_var is set only in webhook + Telegram poller paths. Every REST call from the SPA logs as req=-. Tiny middleware (read X-Request-Id or uuid4(), bind context, echo header) closes this whole-app blind spot.
dispatch_id is in log lines but NOT persisted on the EventLog row. Means you can find the failed row in the UI but can't grep stderr for the matching disp=.... Stash it in details.dispatch_id (no migration needed) — biggest cross-surface correlation win.
HTTP access log is uvicorn default (access_log=not _cfg.debug at main.py:419). Doesn't include request_id, latency, user, status as structured fields. Replace with a small RequestLoggerMiddleware that emits method, path, status, latency_ms, request_id.
Telegram media-group failures log richly but aren't linked to the resulting EventLog row. The dispatcher result-aggregation work in flight is the right place to dump errors[] into EventLog.details.errors.
In-browser log access is missing. EventLog rows are visible, but raw logger output requires container/SSH access. A bounded in-memory ring-buffer endpoint (admin-only, last N lines, filtered by context fields) would mean ~90% of triage stays in the UI.
No "diagnostic mode" UI. The runtime apply_log_levels() is great but only reachable through the app-settings JSON editor. A "Debug for 15 minutes: notify_bridge_core.notifications.telegram.client" button with auto-revert is a few-hours job.
EventLog.details is freeform. Frontend already destructures dispatch_status, deferred_until, deferred_for_seconds, original_event_log_id (types.ts:238-261). Define a typed EventLogDetails per event_type (Pydantic at the boundary) — prevents drift between providers.
No log rotation — StreamHandler(sys.stderr) only. Fine in containers, brittle on bare-metal. Optional RotatingFileHandler opt-in via env.
No slow-query / outbound-HTTP timing logs. sqlalchemy.engine=WARNING by default; no per-query duration log. Same for outbound calls to Immich / Telegram. A "duration_ms >= N" threshold logger would surface "why is this dispatch slow" without flipping global DEBUG.
Action dry-run output is logger-only. Could be streamed into the action editor.
Poll-result not persisted. Webhook payloads are logged (api/webhook_logs.py), but Immich/Google-Photos poll cycles emit no "last poll: 0 changes / 245ms" row. A lightweight provider_poll_log (small table or ring buffer) would answer "is the poller actually running" without reading stderr.

Recommended sequencing

#	Item	Status	Why first
1	Request-ID middleware + persist `dispatch_id` on `EventLog`	SHIPPED 2026-05-28	Unlocks the rest of the debug story; ~2 hours combined
2	Finish in-flight Telegram byte-budget chunking + write `errors[]` into `EventLog.details`	SHIPPED 2026-05-28	Already half-done; aligns with #1
3	Telegram inline keyboards + `callback_query` handler	not started	Prereq for several Immich smart actions
4	Telegram `disable_notification` + `message_thread_id` per target	SHIPPED 2026-05-28	Small, also feeds the open Quiet Hours v1 backlog item
5	Immich webhook ingestion	not started	5-min → sub-second; biggest user-facing latency win
6	Immich share-link expiry + auto-rotate (using #3)	not started	Real silent-breakage today
7	Diagnostic-mode UI (live log-level toggle with auto-revert)	SHIPPED 2026-05-28	Shifts triage to the browser
8	Immich duplicate digest + auto-favorite by person	not started	Both ride on #3

Items 1–4 are infrastructure that unlocks 5–8. Items 1, 2, 4 also smooth the Quiet Hours v1 / target-level windows that's top of the backlog — worth landing before that feature so quiet hours can dispatch through edited messages and silent sends from day one.

Decision log

2026-05-28 — Review completed. Starting work on item #1 (request-id middleware + persist dispatch_id on EventLog).
2026-05-28 — Item #1 shipped. Summary of the change:
- New helpers in packages/core/src/notify_bridge_core/log_context.py: ensure_dispatch_id() (reuse existing or mint a new disp:<12 hex>) and enrich_details_with_correlation(details) (shallow-copy a details dict and merge active dispatch_id / request_id from the ContextVar snapshot).
- New RequestContextMiddleware in packages/server/src/notify_bridge_server/main.py that reads inbound X-Request-Id (charset/length validated, : excluded so a client can't masquerade as a server-minted id), falls back to req:<12 hex>, binds the value via bind_log_context, and echoes it back as the response header. Added LAST so it's the outermost middleware.
- Outer entry points now bind a dispatch_id via a thin wrapper function (check_tracker, dispatch_provider_event, dispatch_scheduled_for_tracker, _process_row, run_action). All 10 EventLog(...) creation sites wrap their details= payload in enrich_details_with_correlation(...).
- Switched NotificationDispatcher.dispatch to use ensure_dispatch_id() instead of inline uuid.uuid4().
- New tests in packages/server/tests/test_request_correlation.py (12 tests) covering header echo, charset validation, prefix- masquerade rejection, helper merge semantics. All 239 server tests green.
- Reviewed by python-reviewer subagent (no CRITICAL/HIGH; 3 MEDIUM and 1 LOW addressed: PEP 8 imports moved to top of main.py; RequestResponseEndpoint type added to dispatch; : dropped from the request-id charset; shallow-copy caveat documented).
- Live smoke verified: generated id req:a9b9821f5aab on plain request; safe inbound my-trace-abc123 echoed unchanged; disp:fake12345678 correctly replaced; watcher tick log lines now show distinct disp=disp:<hex> per tracker check.
2026-05-28 — Item #2 shipped. Summary of the change:
- Confirmed the in-flight Telegram byte-budget media-group chunking in telegram/client.py is complete (15/15 media-group tests pass). Deleted the now-unused split_media_by_upload_size() from telegram/media.py.
- New module services/dispatch_summary.py with summarize_dispatch_results() (aggregator), attach_summary_in_place() (in-session) and record_dispatch_summary_async() (post-commit). Captures targets_attempted/succeeded/failed, per-target errors, media-group media{delivered,skipped,failed} counts and media_errors[] from the new TelegramClient._send_media_group partial-failure path. Bounded: 20 errors / 20 media errors / 500-char message cap with explicit …[truncated] marker.
- Wired at 4 dispatch sites:
  - event_dispatch.py: accumulates per-target results across all tracking-config groups, attaches summary in-session before commit.
  - deferred_dispatch.py: inlines summary into the new EventLog row's details for both delivered_after_quiet_hours and deferred_then_failed paths.
  - scheduled_dispatch.py: inlines summary into the cron-fire EventLog row's details.
  - watcher.py: follow-up record_dispatch_summary_async in a fresh session because the EventLog row was committed before dispatch.
- Frontend type drift fixed: types.ts gets new DispatchSummary, DispatchSummaryError, DispatchSummaryMediaError interfaces plus dispatch_id / request_id / dispatch_summary keys on EventLog.details.
- New tests in tests/test_dispatch_summary.py (10 tests): empty/all-success/mixed/media-counts/sub-errors/ truncation/long-message-trim/in-place attach/no-results no-op/ malformed sub-error. All 249 server tests green.
- Reviewed by python-reviewer subagent (no CRITICAL; 2 HIGH + 3 MEDIUM addressed: asyncio.CancelledError re-raise in the best-effort catch; late from .dispatch_summary import … calls hoisted to top of each file; empty-results contract changed from "zero-count summary attached" to "no key written"; truncation marker upgraded to …[truncated] for operator clarity; flag_modified comment tightened).
- Live smoke: backend restarts cleanly, watcher tick log lines continue showing disp=disp:<hex> correlation, no startup errors.
2026-05-28 — Item #4 shipped. Summary of the change:
- TelegramReceiver dataclass in receiver.py gains disable_notification: bool = False and message_thread_id: int | None = None. New _coerce_telegram_thread_id helper collapses Telegram's "general topic" sentinels (0, negatives, blanks, bools) to None so the Bot API just omits the field — matches the frontend's <= 0 → unset behaviour.
- TelegramClient (client.py) gets a frozen _SendOptions + _send_options_var ContextVar pattern for the deep media paths (_upload_media, _post_media_group, _send_from_cache) that can't easily plumb kwargs through. send_notification binds the var; the 3 deep builders read it via _apply_send_opts_to_payload / _apply_send_opts_to_form. send_message is a leaf and just inlines its kwargs into the JSON body directly (no ContextVar needed there).
- Dispatcher (dispatcher.py) passes receiver.disable_notification / receiver.message_thread_id into client.send_message(...) and client.send_notification(...).
- Frontend: new inline per-Telegram-receiver options panel in ReceiverSection.svelte triggered by a cog icon. Silent + thread-id indicators (bell-off icon, #N badge) on the row when set. +page.svelte handlers PUT the merged config to /api/targets/{id}/receivers/{rid}. 5 new i18n keys in en.json / ru.json.
- New tests in test_telegram_per_send_options.py — 19 tests: factory + thread-id coercion table (including bool rejection and 0/negative collapse), payload/form helper merge semantics, bind/reset under exceptions, concurrent-task isolation via asyncio.gather, end-to-end send_message payload assertions. All 270 server tests green.
- Reviewed by python-reviewer subagent (no CRITICAL; 2 HIGH + 1 MEDIUM + 1 LOW addressed: dead ContextVar bind in send_message removed in favor of inline kwarg injection; re-entrant bind from send_notification → send_message auto-resolved by the same fix; message_thread_id=0 collapse aligns backend with frontend; _coerce_telegram_thread_id rejects bool input).
- Live smoke: backend restarts cleanly, no errors in startup log.
2026-05-28 — Holistic code-reviewer pass over the full session diff (Features 1+2+4+7) caught a real HIGH that the per-feature Python-narrow reviews missed: summarize_dispatch_results in Feature 2 was reading the wrong dict shape. The dispatcher's _aggregate_results wraps per-receiver dicts under result["results"] and renames the Telegram media counts to media_delivered_count / media_skipped_count / media_failed_count. The summarizer was reading the top-level delivered_count, which is always absent in production aggregated output — meaning the dispatch_summary.media block was silently zero / missing for every real dispatch, and the media_errors list never populated. The unit tests passed because they hand-constructed leaf-shaped dicts that masked the wrong-shape read. Fixed in dispatch_summary.py by drilling into result["results"] per-receiver leaves and preferring media_*_count field names with fallback to the top-level names. Receiver index added to media_errors entries when drilling. New integration tests in test_dispatch_summary.py use the real dispatcher envelope so a future shape regression fails loudly. Also addressed MEDIUM findings: attach_summary_in_place / record_dispatch_summary_async now skip when a caller has pre-set dispatch_summary (mirrors the "caller wins" rule in enrich_details_with_correlation); ReceiverSection.svelte props for the Telegram options panel are now optional + gated internally so the component stays portable; TS type for editingReceiverOptions.message_thread_id is number | '' with proper coercion in openEditReceiver. 294/294 server tests green; backend restarts clean.
2026-05-28 — Item #5 NOT shipped. Reason: Immich has no outbound webhook feature. The closest thing is POST /sync/stream (a server-streaming sync API designed for first-party Immich clients), and adopting it would (a) take 1-2 days of new subscription-manager infrastructure, (b) couple us to an API with no third-party stability contract, and (c) deliver 5-min → sub-second latency on photo notifications which is rarely critical. If someone later actually needs lower latency, dropping the default scan_interval is a 5-minute alternative that gets 80% of the win for 1% of the cost. Skipped in favour of #7.
2026-05-28 — Item #7 shipped. Summary of the change:
- New service module services/diagnostic_mode.py with set_diagnostic / revert_diagnostic / revert_all / list_active. State is in-memory only — restart wipes overrides (setup_logging re-applies the DB baseline at boot). Modules go through an allowlist (notify_bridge_*, sqlalchemy, aiohttp, apscheduler, urllib3, httpx, httpcore, asyncio, PIL, uvicorn, starlette, fastapi) so a button press can't flip root. Duration clamped to [1, 240] minutes. Baseline derivation walks the dotted parents so sqlalchemy.engine.Engine correctly inherits sqlalchemy.engine → WARNING rather than falling through to root.
- 3 new admin-only endpoints under /api/settings/diagnostic-mode in api/app_settings.py: GET (list active), POST (activate, 400 on invalid input), DELETE /{module:path} (manual revert, 404 if not active).
- Auto-revert uses APScheduler's date trigger with misfire_grace_time=60, falling back to a strongly-referenced asyncio task (stored in a module-level set with add_done_callback(discard)) when the scheduler isn't running. _expire_callback re-reads log_levels from the DB at fire time, so an admin who edits overrides mid-window sees the new baseline restored — not a stale snapshot.
- revert_all is wired into the FastAPI lifespan shutdown in main.py so a clean stop / hot-reload leaves the world tidy.
- New frontend DiagnosticsCassette.svelte sits below LoggingCassette in the settings page. Quick-pick module dropdown + custom-text fallback, duration chip group (5m / 15m / 30m / 1h / 2h), Activate button. Active list with countdown updated by a 1s ticker; resyncs from the backend every 30s based on elapsed time (not modulo-of-now, which the prior version had wrong). Manual revert via undo-icon button on each row.
- 15 new i18n keys in en.json / ru.json.
- 20 new tests in test_diagnostic_mode.py — service-module unit tests + 4 FastAPI smoke tests via dependency_overrides[require_admin] exercising the router / path converter / HTTPException paths. All 290 server tests green.
- Reviewed by python-reviewer subagent (no CRITICAL; 3 HIGH + 3 MEDIUM addressed: fallback task retention in a module-level set to prevent GC; prefix-walk for _baseline_for so sub-loggers inherit parent defaults; revert_all wired into lifespan shutdown; list_active now sweeps expired entries; DB log_levels re-read at revert time instead of snapshot at activation; frontend resync uses elapsed time. LOW items addressed: scheduler-unavailable paths log at DEBUG instead of silently passing; test cleanup of dead _MIN_DURATION_MINUTES mutation).
- Live smoke: backend restarts cleanly, no errors in startup log.

24 KiB Raw Blame History Unescape Escape