Operability: - Correlation IDs end-to-end: shared dispatch_id between log lines and EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths) and a new X-Request-Id middleware that normalizes inbound ids and binds request_id into log context. - dispatch_summary block merged into EventLog.details: per-target success/failure counts plus Telegram media delivered/skipped/failed and truncated error lists, so partial outcomes surface in the UI. - Diagnostic mode: admin can flip one module to DEBUG for a bounded window with auto-revert (in-memory only; setup_logging() resets on boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints plus DiagnosticsCassette UI on the settings page. Telegram: - Per-receiver options: disable_notification (silent send) and message_thread_id (forum-topic routing), wired through the dispatcher via a ContextVar so all four send sites (sendMessage / sendPhoto-Video- Document / sendMediaGroup / cache-hit POST) pick them up. - send_large_videos_as_documents target setting: bypass the 50 MB sendVideo cap by falling back to sendDocument for oversized videos. - sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES, 45 MB) with per-item fallback on chunk failure so a stale file_id no longer silently drops a cached asset. Tests: - New: diagnostic_mode, dispatch_summary, request_correlation, telegram_media_group_partial, telegram_per_send_options. Docs: - .claude/reviews/: six-axis production-readiness review of v0.8.1. - .claude/docs/functional-review-2026-05-28.md: focused review of Telegram/Immich/logging subsystems.
24 KiB
Functional Review — Telegram, Immich, Logging (2026-05-28)
Snapshot review of three subsystems, with prioritised improvement candidates. Pairs with feature-backlog.md — items here are infrastructure that unlocks several backlog features.
All citations are from the working tree at commit 85a8f1e (master). Two
files (packages/core/src/notify_bridge_core/notifications/telegram/client.py,
media.py) had uncommitted changes at review time — see Telegram §
"In-flight work".
1. Telegram infrastructure
Telegram — what works well
- Single chokepoint
TelegramClient(packages/core/src/notify_bridge_core/notifications/telegram/client.py) covers text/photo/video/document/media-group, with 429-aware retry, parse-error retry, file_id cache, multi-bot per-token instances, polling + webhook modes, and bot-command registration. - CLAUDE.md rule #6 satisfied for the production paths.
- Caption length, group sizing, parse-mode fallback all enforced.
In-flight work
Byte-budget sub-chunking for media groups
(TELEGRAM_MAX_GROUP_TOTAL_BYTES in
media.py)
with per-item fallback inside _send_media_group. Logic is coherent;
before commit, verify _build_media_items callers still match the new
signature (caption no longer injected at fetch time).
Gaps, ranked by user-visible value
- No inline keyboards /
callback_queryhandlers — zero infra for "Favorite / Archive / Dismiss" buttons on Immich notifications. Biggest UX unlock; prerequisite for several Immich smart actions. - No edit-in-place (
editMessageTextnot wrapped). Pairs naturally with deferred dispatch / quiet hours coalescing — 5 separate "asset added" messages become 1 edited message. disable_notification(silent send) not exposed — already a Telegram primitive; slots into the quiet-hourssilentmode the backlog already mentions.message_thread_id(forum topics) — single field per receiver; unblocks supergroup-with-topics users.- Direct
TelegramClient(...)constructions in api/telegram_bots.py:314,394,404,412 bypassget_telegram_client()— violates CLAUDE.md rule #6 and skips the shared file_id cache. - Per-command authorization —
commands_enabledis all-or-nothing per chat; no per-command allowlist or admin gate. - Long-message splitting —
send_messagesilently truncates at 4096 (client.py:492). - No parse-mode per target — HTML hardcoded.
2. Immich
Immich — what works well
- Mature polling pipeline: incremental delta-fetch via
updatedAfter, pending-asset tracking, fingerprint fast-path skip, fallback to full fetch on count-decrease (providers/immich/provider.py). - Rich bot commands (status / albums / events / people / search / latest / random / favorites / summary / memory) with full asset context (CLAUDE.md rule #10 satisfied).
auto_organizeaction is well-shaped: AND person + smart-query union, exclusions, type/date/favorite filters, 500-asset batched add, idempotent diff against album asset_ids, dry-run,ActionExecutionlog.- Three scheduled features wired: periodic summaries, scheduled-asset delivery, Memory/On-This-Day (with native Immich memory API + fallback).
Highest-leverage candidates
- Webhook ingestion —
webhook_based=Falseat capabilities.py:46. Sub-second latency vs the current 5-min poll. New/api/webhooks/immich/{secret}route + parser + capability flip. - Share-link expiry monitoring + auto-rotate action — links silently break today; data is already fetched per event (provider.py:541-569).
- Duplicate cluster digest — Immich >= 1.100
/api/duplicatesis unused; pairs with inline buttons for "merge / ignore 30d". - Auto-favorite by person (already in backlog) — smallest delta on
the existing
auto_organizeexecutor. - Per-person notification subscription — tracker-config filter,
reuses existing
asset.peopledata. - Album auto-curation from Inbox — date-based target album name,
move (not copy); needs the Immich move endpoint (currently we only
add_assets_to_album). - Storage / job-queue alerts —
/api/server/statsand/api/jobsunused; lightweight poll + threshold = "disk full" / "transcoding stalled" notifications. - Smart-action infra polish — descriptors are reusable, but the rule editor is JSON-shaped, action-run statistics aren't aggregated, and dry-run shows counts not the asset list. Address before adding 5 more action types.
3. Logging
What's already in place
In logging_setup.py:
dictConfigwithJsonFormatter(line-delimited JSON) toggleable viaNOTIFY_BRIDGE_LOG_FORMAT=json.SecretMaskingFilterredacts Telegram bot tokens + Authorization / api_key / password / refresh_token acrossmsg,exc_text,stack_info.- ContextVar-driven record factory injects
request_id,command,chat_id,bot_id,dispatch_idon every record. Text format:[req=- cmd=- bot=- chat=- disp=-]. - Per-module overrides via
NOTIFY_BRIDGE_LOG_LEVELSenv or DBAppSetting. Live runtime patch viaapply_log_levels()— no restart. - Noisy libs pre-quieted (sqlalchemy, aiohttp, apscheduler, urllib3, asyncio, httpx, httpcore, PIL, uvicorn.access).
Plus:
EventLogtable with structured rows (event_type, status, assets_count, details JSON, FKs to tracker/provider/action/ command_tracker/bot),event_log_retention_days=30default, daily APScheduler cleanup_cleanup_old_events(scheduler.py:332).- Prometheus counter
notify_bridge_event_log_total{status,event_type}. - Frontend viewer with filters at api/status.py.
bind_log_contextactually used in: dispatcher (dispatch_id), telegram_poller (bot/chat/command/request_id), webhook commands.
Gaps, ordered by debug-pain payoff
- No FastAPI request-ID middleware.
request_id_varis set only in webhook + Telegram poller paths. Every REST call from the SPA logs asreq=-. Tiny middleware (readX-Request-Idoruuid4(), bind context, echo header) closes this whole-app blind spot. dispatch_idis in log lines but NOT persisted on theEventLogrow. Means you can find the failed row in the UI but can't grep stderr for the matchingdisp=.... Stash it indetails.dispatch_id(no migration needed) — biggest cross-surface correlation win.- HTTP access log is uvicorn default
(
access_log=not _cfg.debugat main.py:419). Doesn't includerequest_id, latency, user, status as structured fields. Replace with a smallRequestLoggerMiddlewarethat emitsmethod,path,status,latency_ms,request_id. - Telegram media-group failures log richly but aren't linked to the
resulting
EventLogrow. The dispatcher result-aggregation work in flight is the right place to dumperrors[]intoEventLog.details.errors. - In-browser log access is missing. EventLog rows are visible, but raw logger output requires container/SSH access. A bounded in-memory ring-buffer endpoint (admin-only, last N lines, filtered by context fields) would mean ~90% of triage stays in the UI.
- No "diagnostic mode" UI. The runtime
apply_log_levels()is great but only reachable through the app-settings JSON editor. A "Debug for 15 minutes:notify_bridge_core.notifications.telegram.client" button with auto-revert is a few-hours job. EventLog.detailsis freeform. Frontend already destructuresdispatch_status,deferred_until,deferred_for_seconds,original_event_log_id(types.ts:238-261). Define a typedEventLogDetailsperevent_type(Pydantic at the boundary) — prevents drift between providers.- No log rotation —
StreamHandler(sys.stderr)only. Fine in containers, brittle on bare-metal. OptionalRotatingFileHandleropt-in via env. - No slow-query / outbound-HTTP timing logs.
sqlalchemy.engine=WARNINGby default; no per-query duration log. Same for outbound calls to Immich / Telegram. A "duration_ms >= N" threshold logger would surface "why is this dispatch slow" without flipping global DEBUG. - Action dry-run output is logger-only. Could be streamed into the action editor.
- Poll-result not persisted. Webhook payloads are logged
(api/webhook_logs.py),
but Immich/Google-Photos poll cycles emit no
"last poll: 0 changes / 245ms" row. A lightweight
provider_poll_log(small table or ring buffer) would answer "is the poller actually running" without reading stderr.
Recommended sequencing
| # | Item | Status | Why first |
|---|---|---|---|
| 1 | Request-ID middleware + persist dispatch_id on EventLog |
SHIPPED 2026-05-28 | Unlocks the rest of the debug story; ~2 hours combined |
| 2 | Finish in-flight Telegram byte-budget chunking + write errors[] into EventLog.details |
SHIPPED 2026-05-28 | Already half-done; aligns with #1 |
| 3 | Telegram inline keyboards + callback_query handler |
not started | Prereq for several Immich smart actions |
| 4 | Telegram disable_notification + message_thread_id per target |
SHIPPED 2026-05-28 | Small, also feeds the open Quiet Hours v1 backlog item |
| 5 | Immich webhook ingestion | not started | 5-min → sub-second; biggest user-facing latency win |
| 6 | Immich share-link expiry + auto-rotate (using #3) | not started | Real silent-breakage today |
| 7 | Diagnostic-mode UI (live log-level toggle with auto-revert) | SHIPPED 2026-05-28 | Shifts triage to the browser |
| 8 | Immich duplicate digest + auto-favorite by person | not started | Both ride on #3 |
Items 1–4 are infrastructure that unlocks 5–8. Items 1, 2, 4 also smooth the Quiet Hours v1 / target-level windows that's top of the backlog — worth landing before that feature so quiet hours can dispatch through edited messages and silent sends from day one.
Decision log
- 2026-05-28 — Review completed. Starting work on item #1
(request-id middleware + persist
dispatch_idonEventLog). - 2026-05-28 — Item #1 shipped. Summary of the change:
- New helpers in
packages/core/src/notify_bridge_core/log_context.py:
ensure_dispatch_id()(reuse existing or mint a newdisp:<12 hex>) andenrich_details_with_correlation(details)(shallow-copy a details dict and merge activedispatch_id/request_idfrom the ContextVar snapshot). - New
RequestContextMiddlewarein packages/server/src/notify_bridge_server/main.py that reads inboundX-Request-Id(charset/length validated,:excluded so a client can't masquerade as a server-minted id), falls back toreq:<12 hex>, binds the value viabind_log_context, and echoes it back as the response header. Added LAST so it's the outermost middleware. - Outer entry points now bind a
dispatch_idvia a thin wrapper function (check_tracker,dispatch_provider_event,dispatch_scheduled_for_tracker,_process_row,run_action). All 10EventLog(...)creation sites wrap theirdetails=payload inenrich_details_with_correlation(...). - Switched
NotificationDispatcher.dispatchto useensure_dispatch_id()instead of inlineuuid.uuid4(). - New tests in packages/server/tests/test_request_correlation.py (12 tests) covering header echo, charset validation, prefix- masquerade rejection, helper merge semantics. All 239 server tests green.
- Reviewed by
python-reviewersubagent (no CRITICAL/HIGH; 3 MEDIUM and 1 LOW addressed: PEP 8 imports moved to top of main.py;RequestResponseEndpointtype added todispatch;:dropped from the request-id charset; shallow-copy caveat documented). - Live smoke verified: generated id
req:a9b9821f5aabon plain request; safe inboundmy-trace-abc123echoed unchanged;disp:fake12345678correctly replaced; watcher tick log lines now show distinctdisp=disp:<hex>per tracker check.
- New helpers in
packages/core/src/notify_bridge_core/log_context.py:
- 2026-05-28 — Item #2 shipped. Summary of the change:
- Confirmed the in-flight Telegram byte-budget media-group chunking
in
telegram/client.py
is complete (15/15 media-group tests pass). Deleted the now-unused
split_media_by_upload_size()from telegram/media.py. - New module
services/dispatch_summary.py
with
summarize_dispatch_results()(aggregator),attach_summary_in_place()(in-session) andrecord_dispatch_summary_async()(post-commit). Capturestargets_attempted/succeeded/failed, per-targeterrors, media-groupmedia{delivered,skipped,failed}counts andmedia_errors[]from the newTelegramClient._send_media_grouppartial-failure path. Bounded: 20 errors / 20 media errors / 500-char message cap with explicit…[truncated]marker. - Wired at 4 dispatch sites:
event_dispatch.py: accumulates per-target results across all tracking-config groups, attaches summary in-session before commit.deferred_dispatch.py: inlines summary into the new EventLog row'sdetailsfor bothdelivered_after_quiet_hoursanddeferred_then_failedpaths.scheduled_dispatch.py: inlines summary into the cron-fire EventLog row'sdetails.watcher.py: follow-uprecord_dispatch_summary_asyncin a fresh session because the EventLog row was committed before dispatch.
- Frontend type drift fixed:
types.ts gets new
DispatchSummary,DispatchSummaryError,DispatchSummaryMediaErrorinterfaces plusdispatch_id/request_id/dispatch_summarykeys onEventLog.details. - New tests in tests/test_dispatch_summary.py (10 tests): empty/all-success/mixed/media-counts/sub-errors/ truncation/long-message-trim/in-place attach/no-results no-op/ malformed sub-error. All 249 server tests green.
- Reviewed by
python-reviewersubagent (no CRITICAL; 2 HIGH + 3 MEDIUM addressed:asyncio.CancelledErrorre-raise in the best-effort catch; latefrom .dispatch_summary import …calls hoisted to top of each file; empty-results contract changed from "zero-count summary attached" to "no key written"; truncation marker upgraded to…[truncated]for operator clarity;flag_modifiedcomment tightened). - Live smoke: backend restarts cleanly, watcher tick log lines
continue showing
disp=disp:<hex>correlation, no startup errors.
- Confirmed the in-flight Telegram byte-budget media-group chunking
in
telegram/client.py
is complete (15/15 media-group tests pass). Deleted the now-unused
- 2026-05-28 — Item #4 shipped. Summary of the change:
TelegramReceiverdataclass in receiver.py gainsdisable_notification: bool = Falseandmessage_thread_id: int | None = None. New_coerce_telegram_thread_idhelper collapses Telegram's "general topic" sentinels (0, negatives, blanks, bools) toNoneso the Bot API just omits the field — matches the frontend's<= 0 → unsetbehaviour.TelegramClient(client.py) gets a frozen_SendOptions+_send_options_varContextVarpattern for the deep media paths (_upload_media,_post_media_group,_send_from_cache) that can't easily plumb kwargs through.send_notificationbinds the var; the 3 deep builders read it via_apply_send_opts_to_payload/_apply_send_opts_to_form.send_messageis a leaf and just inlines its kwargs into the JSON body directly (no ContextVar needed there).- Dispatcher
(dispatcher.py)
passes
receiver.disable_notification/receiver.message_thread_idintoclient.send_message(...)andclient.send_notification(...). - Frontend: new inline per-Telegram-receiver options panel in
ReceiverSection.svelte
triggered by a cog icon. Silent + thread-id indicators (bell-off
icon,
#Nbadge) on the row when set.+page.sveltehandlers PUT the merged config to/api/targets/{id}/receivers/{rid}. 5 new i18n keys inen.json/ru.json. - New tests in
test_telegram_per_send_options.py
— 19 tests: factory + thread-id coercion table (including bool
rejection and
0/negative collapse), payload/form helper merge semantics, bind/reset under exceptions, concurrent-task isolation viaasyncio.gather, end-to-endsend_messagepayload assertions. All 270 server tests green. - Reviewed by
python-reviewersubagent (no CRITICAL; 2 HIGH + 1 MEDIUM + 1 LOW addressed: dead ContextVar bind insend_messageremoved in favor of inline kwarg injection; re-entrant bind fromsend_notification → send_messageauto-resolved by the same fix;message_thread_id=0collapse aligns backend with frontend;_coerce_telegram_thread_idrejectsboolinput). - Live smoke: backend restarts cleanly, no errors in startup log.
- 2026-05-28 — Holistic
code-reviewerpass over the full session diff (Features 1+2+4+7) caught a real HIGH that the per-feature Python-narrow reviews missed:summarize_dispatch_resultsin Feature 2 was reading the wrong dict shape. The dispatcher's_aggregate_resultswraps per-receiver dicts underresult["results"]and renames the Telegram media counts tomedia_delivered_count/media_skipped_count/media_failed_count. The summarizer was reading the top-leveldelivered_count, which is always absent in production aggregated output — meaning thedispatch_summary.mediablock was silently zero / missing for every real dispatch, and themedia_errorslist never populated. The unit tests passed because they hand-constructed leaf-shaped dicts that masked the wrong-shape read. Fixed in dispatch_summary.py by drilling intoresult["results"]per-receiver leaves and preferringmedia_*_countfield names with fallback to the top-level names. Receiver index added tomedia_errorsentries when drilling. New integration tests in test_dispatch_summary.py use the real dispatcher envelope so a future shape regression fails loudly. Also addressed MEDIUM findings:attach_summary_in_place/record_dispatch_summary_asyncnow skip when a caller has pre-setdispatch_summary(mirrors the "caller wins" rule inenrich_details_with_correlation);ReceiverSection.svelteprops for the Telegram options panel are now optional + gated internally so the component stays portable; TS type foreditingReceiverOptions.message_thread_idisnumber | ''with proper coercion inopenEditReceiver. 294/294 server tests green; backend restarts clean. - 2026-05-28 — Item #5 NOT shipped. Reason: Immich has no
outbound webhook feature. The closest thing is
POST /sync/stream(a server-streaming sync API designed for first-party Immich clients), and adopting it would (a) take 1-2 days of new subscription-manager infrastructure, (b) couple us to an API with no third-party stability contract, and (c) deliver 5-min → sub-second latency on photo notifications which is rarely critical. If someone later actually needs lower latency, dropping the defaultscan_intervalis a 5-minute alternative that gets 80% of the win for 1% of the cost. Skipped in favour of #7. - 2026-05-28 — Item #7 shipped. Summary of the change:
- New service module
services/diagnostic_mode.py
with
set_diagnostic/revert_diagnostic/revert_all/list_active. State is in-memory only — restart wipes overrides (setup_loggingre-applies the DB baseline at boot). Modules go through an allowlist (notify_bridge_*,sqlalchemy,aiohttp,apscheduler,urllib3,httpx,httpcore,asyncio,PIL,uvicorn,starlette,fastapi) so a button press can't flip root. Duration clamped to[1, 240]minutes. Baseline derivation walks the dotted parents sosqlalchemy.engine.Enginecorrectly inheritssqlalchemy.engine→ WARNING rather than falling through to root. - 3 new admin-only endpoints under
/api/settings/diagnostic-modein api/app_settings.py:GET(list active),POST(activate, 400 on invalid input),DELETE /{module:path}(manual revert, 404 if not active). - Auto-revert uses APScheduler's date trigger with
misfire_grace_time=60, falling back to a strongly-referenced asyncio task (stored in a module-level set withadd_done_callback(discard)) when the scheduler isn't running._expire_callbackre-readslog_levelsfrom the DB at fire time, so an admin who edits overrides mid-window sees the new baseline restored — not a stale snapshot. revert_allis wired into the FastAPI lifespan shutdown in main.py so a clean stop / hot-reload leaves the world tidy.- New frontend
DiagnosticsCassette.svelte
sits below
LoggingCassettein the settings page. Quick-pick module dropdown + custom-text fallback, duration chip group (5m / 15m / 30m / 1h / 2h), Activate button. Active list with countdown updated by a 1s ticker; resyncs from the backend every 30s based on elapsed time (not modulo-of-now, which the prior version had wrong). Manual revert via undo-icon button on each row. - 15 new i18n keys in
en.json/ru.json. - 20 new tests in
test_diagnostic_mode.py
— service-module unit tests + 4 FastAPI smoke tests via
dependency_overrides[require_admin]exercising the router / path converter / HTTPException paths. All 290 server tests green. - Reviewed by
python-reviewersubagent (no CRITICAL; 3 HIGH + 3 MEDIUM addressed: fallback task retention in a module-level set to prevent GC; prefix-walk for_baseline_forso sub-loggers inherit parent defaults;revert_allwired into lifespan shutdown;list_activenow sweeps expired entries; DBlog_levelsre-read at revert time instead of snapshot at activation; frontend resync uses elapsed time. LOW items addressed: scheduler-unavailable paths log at DEBUG instead of silently passing; test cleanup of dead_MIN_DURATION_MINUTESmutation). - Live smoke: backend restarts cleanly, no errors in startup log.
- New service module
services/diagnostic_mode.py
with