feat(logging): production-grade logging with context vars, secret masking, and runtime level control

Boot-time logging was a three-line basicConfig stub with no timestamps, no correlation, and silent drops at every layer of the Telegram send path — a /random command that delivered text but no media left zero evidence in the log. This replaces the setup and closes every silent drop encountered end-to-end. New infrastructure: - notify_bridge_core.log_context: request_id/command/chat_id/bot_id/dispatch_id ContextVars with a bind_log_context() context manager so deep call sites (TelegramClient, NotificationDispatcher) inherit the correlation tag without threading args through. - notify_bridge_server.logging_setup: dictConfig-based setup with a LogRecordFactory that tags every record, a SecretMaskingFilter that redacts /botN:TOKEN plus Authorization/x-api-key/password/secret in messages AND tracebacks, a JSON formatter for aggregators, text formatter with grep-friendly [req=... cmd=... bot=... chat=... disp=...] prefix, and default dampening for sqlalchemy/aiohttp/apscheduler/urllib3/PIL. Runtime control: - NOTIFY_BRIDGE_LOG_LEVEL / _FORMAT / _LEVELS env vars (boot). - DB-backed log_level / log_format / log_levels AppSettings, applied on boot after migrations and live via apply_log_levels() when edited in the settings UI (format still requires restart, logs a WARN). - Frontend settings page gains a Logging card (level dropdown, format dropdown, per-module overrides); en/ru i18n keys added. Call-site fixes (/random media-group blind spot and adjacent): - TelegramClient._fetch_asset: every silent drop now WARN-logs with reason (missing url, HTTP non-200, size/dimension limits, ClientError). - TelegramClient._send_media_group: WARN on "chunk had N items but 0 usable", ERROR on sendMediaGroup non-ok/transport with full context; returns success=False + "no_items_delivered" instead of success=True with an empty message_ids list so callers can distinguish. - TelegramClient.send_message / _upload_media / _send_from_cache: ERROR on non-ok + transport failures with status/code/desc; DEBUG for cache-hit fallbacks. - NotificationDispatcher.dispatch: generates a dispatch_id, binds it, logs start/finish with failure count, uses exc_info for target failures. - commands/handler: missing/failed templates -> ERROR + exc_info; send_reply and send_media_group errors upgraded WARNING -> ERROR with chat/error_code context; rate-limit and truncation cases logged with full context. - commands/webhook and services/telegram_poller: bind_log_context(request_id =tg:<update_id>, command, chat_id, bot_id), INFO on receive/dispatch/ completion with duration, exc_info on raise, INFO when commands disabled. - commands/immich: INFO when album scope is empty; WARN per asset dropped from media payload and a summary WARN when "N assets in, 0 out".
2026-04-23 14:41:26 +03:00
parent 1f880daa0c
commit f50d465c0e
15 changed files with 831 additions and 63 deletions
@@ -108,13 +108,18 @@ def _render_cmd_template(
    """Render a locale-aware command template. Falls back to 'en'."""
    template_str = _resolve_template(templates, slot_name, locale)
    if not template_str:
-        _LOGGER.warning("No command template found for slot '%s' locale '%s'", slot_name, locale)
+        # Missing template = user sees "[No template: X]" — this is an ERROR,
+        # not a warning. Broken replies must stand out in production logs.
+        _LOGGER.error("No command template found for slot '%s' locale '%s'", slot_name, locale)
        return f"[No template: {slot_name}]"
    try:
        tmpl = _compile_template(template_str)
        return tmpl.render(**context)
-    except Exception as e:
-        _LOGGER.warning("Failed to render command template '%s': %s", slot_name, e)
+    except Exception:
+        _LOGGER.error(
+            "Failed to render command template '%s' locale=%s — user will see a broken reply",
+            slot_name, locale, exc_info=True,
+        )
        return f"[Template error: {slot_name}]"


@@ -296,6 +301,10 @@ async def handle_command(
    # Rate limit check (once per command, shared across all trackers)
    wait = _check_rate_limit(bot.id, chat_id, cmd, rate_limits)
    if wait is not None:
+        _LOGGER.info(
+            "Rate-limited /%s for bot=%d chat=%s — %ds cooldown remaining",
+            cmd, bot.id, chat_id, wait,
+        )
        text_resp = _render_cmd_template(merged_templates, "rate_limited", locale, {"wait": wait})
        return [CommandResponse(text=text_resp)]

@@ -322,8 +331,8 @@ async def handle_command(
    for tracker, config, provider, listener in ctx_tuples:
        if len(responses) >= _MAX_RESPONSES_PER_COMMAND:
            _LOGGER.warning(
-                "Truncated command responses at %d for bot %d cmd /%s",
-                _MAX_RESPONSES_PER_COMMAND, bot.id, cmd,
+                "Truncated command responses at %d for bot=%d chat=%s cmd=/%s (listener context size=%d)",
+                _MAX_RESPONSES_PER_COMMAND, bot.id, chat_id, cmd, len(ctx_tuples),
            )
            break

@@ -418,7 +427,12 @@ async def send_reply(
        disable_web_page_preview=True,
    )
    if not result.get("success"):
-        _LOGGER.warning("Telegram reply failed: %s", result.get("error"))
+        # User-visible failure: the bot's reply never reached the chat.
+        _LOGGER.error(
+            "Telegram reply failed (chat=%s reply_to=%s len=%d): code=%s error=%r",
+            chat_id, reply_to_message_id, len(text or ""),
+            result.get("error_code"), result.get("error"),
+        )


 async def send_media_group(
@@ -442,6 +456,14 @@ async def send_media_group(
    assets hit the cache and skip the re-upload.
    """
    if not media_items:
+        # This is what happened in the /random blind spot: the text reply
+        # was sent, but the media follow-up was silently skipped because
+        # the caller passed an empty media list. Surface it so we can see
+        # it in the log and correlate with the text message.
+        _LOGGER.warning(
+            "send_media_group called with 0 items (chat=%s reply_to=%s) — no media will be delivered",
+            chat_id, reply_to_message_id,
+        )
        return

    from ..services.telegram_send import send_telegram_media
@@ -452,7 +474,13 @@ async def send_media_group(
        chat_action=None,
    )
    if not result.get("success"):
-        _LOGGER.warning("Telegram media group failed: %s", result.get("error"))
+        # User-visible failure: media promised by the text reply never arrived.
+        _LOGGER.error(
+            "Telegram media group failed (chat=%s items=%d reply_to=%s): code=%s error=%r failed_at_chunk=%s",
+            chat_id, len(media_items), reply_to_message_id,
+            result.get("error_code"), result.get("error"),
+            result.get("failed_at_chunk"),
+        )


 async def register_commands_with_telegram(bot: TelegramBot) -> bool: