feat: production readiness — security, perf, bug fixes, bridge self-monitoring

Comprehensive multi-area pass driven by a parallel 8-agent production review. Frontend, backend, database, security, performance, operational, plus a new self-monitoring feature. ## Critical fixes - Planka webhook: reads bounded raw body (was NameError on every call) - HA quiet hours: ha_state_changed/automation_triggered/service_called/ event_fired added to deferrable set (were silently dropped) - DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session - Telegram inbound webhook: secret now mandatory (401 without) - Generic webhook: auth_mode="none" requires explicit acknowledge_unauthenticated=true; per-IP rate limit 60/min - svelte-check: 5 null-narrowing errors in EventDetailModal fixed - Provider hardcoding: Immich-only block extracted to descriptor featureDiscoveryHint - command_sync: snapshot+expunge bot before exiting AsyncSession ## Bug fixes - notifier asyncio.gather(return_exceptions=True) — one bad chat no longer cancels peer sends - NotificationDispatcher hoisted out of per-tracker loop - Provider credential resolution unified across all 5 dispatch sites - HA asyncio.shield now drains inner task on cancellation - Provider construction switched from if/elif ladder to factory registry - NUT first poll seeds silently (no spurious ups_on_battery) - Quiet-hours gate: event-type-disabled now wins over deferral - APScheduler drain job ID resolution upgraded to seconds - HA on_status_change wired through to EventLog - Webhook payload rollback failures now logged (not swallowed) - Batched receivers/chats/bots in load_link_data (was per-target N+1) - flag_modified on JSON column reassignments in deferred_dispatch ## Database - UNIQUE indexes on service_provider.webhook_token, telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id, telegram_chat(bot_id, chat_id), notification_tracker_target unique link, partial UNIQUE on bridge_self provider per user - Composite ix_event_log_user_event_type_created index - save_chat_from_webhook switched to ON CONFLICT DO UPDATE - ondelete=CASCADE on user-id FKs (model annotation; app-side cascade delete added for existing data) - delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE - Module-level asyncio.Lock replaced with lazy _get_lock() pattern - VACUUM INTO snapshot now PRAGMA integrity_check verified ## Performance - Jinja2 template compilation LRU cached (lru_cache maxsize=512) - Per-locale render cache in NotificationDispatcher (skips re-rendering identical content for receivers sharing a locale) - Tracker list cached per provider_id with 5s TTL + explicit invalidation on tracker CRUD (relieves HA chat-bus rate query pressure) - Nav-counts collapsed from 16 round-trips to single UNION ALL - HA event_log: skip persisting empty assets_added/removed events ## Security hardening - Mass-assignment guard on Action create/update; cron sub-minute reject - Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k) - _sanitize_config extended to all JSON-typed fields on backup import - Telegram _safe_get walks redirects manually with SSRF revalidation - Bcrypt 72-byte password length cap with clear 422 - Webhook payload body redaction; sensitive substring set extended with oauth/client_secret/webhook_secret/csrf in both header filter and template extras filter ## Frontend - 76 catch (err: any) sites converted to errMsg(err) helper - globalProviderFilter: pure getter; reconciliation moved to one-time $effect in +layout - Provider-filter binding: removed paired $effects + _syncingFilter flag, now one-way derived - entity-cache: separate _refreshing flag for background re-fetches - api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag, goto() instead of window.location.href - a11y: aria-expanded on mobile More, role=switch + aria-checked on Telegram bot toggles ## Tests & operations - CI pytest gate added to .gitea/workflows/build.yml + release.yml (wheel-built install to dodge editable-install slowness) - /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running, HA supervisor presence) returning {ready, checks, errors, version} - /api/metrics endpoint with prometheus_client (deferred_pending, event_log_total, dispatch_duration, poll_failures, send_failures) - New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore procedures, log handling, common scenarios, upgrade flow - New tests: test_bridge_self (11), test_gitea_parser (9), test_planka_parser (6), test_immich_change_detector (6), test_backup_roundtrip (1) ## New feature: bridge self-monitoring - New bridge_self provider type — internal sink for bridge health events - Three event types: bridge_self_poll_failures (consecutive tracker poll failures), bridge_self_deferred_backlog (pending count crosses threshold), bridge_self_target_failures (consecutive 5xx/network failures per target) - Per-user thresholds (defaults: 3 / 100 / 5) configurable via the provider config form - Auto-seeded on user create + /setup + boot backfill for existing users - Anti-spam: counters reset after emission; backlog uses transition latch - Self-loop guard: bridge_self failures don't count toward target-failure thresholds (logged only) — wire to your own Telegram/Email/Matrix to get notified when polls/dispatches/sends fail - 6 default templates (3 events × 2 locales), tracking config columns with backfill migration, frontend descriptor (excluded from "create provider" wizard since auto-managed) Operator-visible behavior changes (call out in release notes): - NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode - Existing webhook providers with auth_mode="none" need explicit opt-in - Generic webhook endpoint rate-limited 60/min per source IP - HA disconnect/reconnect writes ha_status_* EventLog rows - Every user gets a bridge_self provider — wire it to a target to receive failure alerts Pre-existing test failures (test_ssrf, test_release_provider) on Python 3.13 are unrelated; CI runs on 3.12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:49 +03:00
parent 22127e2a59
commit 10d30fc956
97 changed files with 5423 additions and 821 deletions
@@ -71,6 +71,12 @@ class EventType(str, Enum):
    HA_SERVICE_CALLED = "ha_service_called"
    HA_EVENT_FIRED = "ha_event_fired"

+    # Bridge self-monitoring events — emitted by the bridge itself when
+    # internal failures cross configured thresholds.
+    BRIDGE_SELF_POLL_FAILURES = "bridge_self_poll_failures"
+    BRIDGE_SELF_DEFERRED_BACKLOG = "bridge_self_deferred_backlog"
+    BRIDGE_SELF_TARGET_FAILURES = "bridge_self_target_failures"
+

@dataclass
 class ServiceEvent:
@@ -107,6 +107,12 @@ class NotificationDispatcher:
        # Optional shared session owned by the caller; when supplied we reuse
        # its connection pool instead of opening a fresh per-dispatch session.
        self._shared_session = session
+        # Per-dispatch render cache, keyed by locale. Populated by
+        # ``_send_to_target`` and consumed inside ``_message_for_receiver``
+        # so a 100-receiver fan-out renders each unique locale once.
+        # Initialized to empty so handlers called outside the normal
+        # dispatch path (tests) still see a valid dict.
+        self._render_cache: dict[str, str] = {}

    @contextlib.asynccontextmanager
    async def _session_ctx(self) -> AsyncIterator[aiohttp.ClientSession]:
@@ -198,20 +204,49 @@ class NotificationDispatcher:
    def _message_for_receiver(
        self, receiver: Receiver, default_message: str,
        event: ServiceEvent, target: TargetConfig,
+        cache: dict[str, str] | None = None,
    ) -> str:
-        if receiver.locale and receiver.locale != target.locale:
-            return self._render_message(event, target, receiver.locale)
-        return default_message
+        """Render message respecting receiver locale, with optional cache.
+
+        The ``cache`` dict (typically created in ``_send_to_target`` and
+        threaded through the per-channel ``_send_*`` handlers) memoizes
+        per-locale renders so a 100-receiver fan-out with two locales
+        renders twice instead of one hundred times.
+        """
+        loc = receiver.locale or target.locale
+        if loc == target.locale:
+            return default_message
+        if cache is not None:
+            cached = cache.get(loc)
+            if cached is not None:
+                return cached
+            rendered = self._render_message(event, target, loc)
+            cache[loc] = rendered
+            return rendered
+        return self._render_message(event, target, loc)

    async def _send_to_target(
        self, event: ServiceEvent, target: TargetConfig
    ) -> dict[str, Any]:
-        """Dispatch to a single target via the registered handler."""
+        """Dispatch to a single target via the registered handler.
+
+        Builds a per-locale render cache once and threads it through the
+        send handler. The cache is keyed by receiver locale; the default
+        locale's render lives in ``default_message`` and is short-circuited
+        before any cache lookup.
+        """
        default_message = self._render_message(event, target, target.locale)
        send_method = _PROVIDER_HANDLERS.get(target.type)
        if send_method is None:
            return {"success": False, "error": f"Unknown target type: {target.type}"}
-        return await send_method(self, target, default_message, event)
+        # Stash the cache on the dispatcher instance for the duration of
+        # this dispatch — handlers pick it up via _message_for_receiver.
+        # Avoids changing every _send_* signature.
+        self._render_cache: dict[str, str] = {}
+        try:
+            return await send_method(self, target, default_message, event)
+        finally:
+            self._render_cache = {}

    # ------------------------------------------------------------------
    # Asset preload (Telegram-specific)
@@ -352,7 +387,7 @@ class NotificationDispatcher:
            async def send_one(receiver: Receiver) -> dict[str, Any]:
                if not isinstance(receiver, TelegramReceiver) or not receiver.chat_id:
                    return {"success": False, "error": "Invalid telegram receiver"}
-                message = self._message_for_receiver(receiver, default_message, event, target)
+                message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
                text_result = await client.send_message(
                    chat_id=receiver.chat_id,
                    text=message,
@@ -407,7 +442,7 @@ class NotificationDispatcher:
            async def send_one(receiver: Receiver) -> dict[str, Any]:
                if not isinstance(receiver, WebhookReceiver) or not receiver.url:
                    return {"success": False, "error": "Invalid webhook receiver"}
-                message = self._message_for_receiver(receiver, default_message, event, target)
+                message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
                payload = {
                    "message": message,
                    "event_type": event.event_type.value,
@@ -450,7 +485,7 @@ class NotificationDispatcher:
        async def send_one(receiver: Receiver) -> dict[str, Any]:
            if not isinstance(receiver, EmailReceiver) or not receiver.email:
                return {"success": False, "error": "Invalid email receiver"}
-            message = self._message_for_receiver(receiver, default_message, event, target)
+            message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
            # body_html=None lets EmailClient build a safely-escaped HTML
            # alternative from body_text instead of trusting user content.
            return await email_client.send(
@@ -479,7 +514,7 @@ class NotificationDispatcher:
            async def send_one(receiver: Receiver) -> dict[str, Any]:
                if not isinstance(receiver, DiscordReceiver) or not receiver.webhook_url:
                    return {"success": False, "error": "Invalid discord receiver"}
-                message = self._message_for_receiver(receiver, default_message, event, target)
+                message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
                return await client.send(receiver.webhook_url, message, username=username)

            results = await self._fan_out(target.receivers, send_one)
@@ -501,7 +536,7 @@ class NotificationDispatcher:
            async def send_one(receiver: Receiver) -> dict[str, Any]:
                if not isinstance(receiver, SlackReceiver) or not receiver.webhook_url:
                    return {"success": False, "error": "Invalid slack receiver"}
-                message = self._message_for_receiver(receiver, default_message, event, target)
+                message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
                return await client.send(receiver.webhook_url, message, username=username)

            results = await self._fan_out(target.receivers, send_one)
@@ -530,7 +565,7 @@ class NotificationDispatcher:
            async def send_one(receiver: Receiver) -> dict[str, Any]:
                if not isinstance(receiver, NtfyReceiver) or not receiver.topic:
                    return {"success": False, "error": "Invalid ntfy receiver"}
-                message = self._message_for_receiver(receiver, default_message, event, target)
+                message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
                return await client.send(
                    server_url, receiver.topic, message,
                    title=title, priority=receiver.priority, auth_token=auth_token,
@@ -563,7 +598,7 @@ class NotificationDispatcher:
            async def send_one(receiver: Receiver) -> dict[str, Any]:
                if not isinstance(receiver, MatrixReceiver) or not receiver.room_id:
                    return {"success": False, "error": "Invalid matrix receiver"}
-                message = self._message_for_receiver(receiver, default_message, event, target)
+                message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
                # body_html is the same plain text — Matrix accepts the
                # raw message as both ``body`` and ``formatted_body``.
                # If templates emit HTML in the future, generate a
@@ -222,21 +222,48 @@ class TelegramClient:
        """SSRF-guarded GET that returns ``(data, error)``.

        Validates the URL via ``avalidate_outbound_url`` before any HTTP
-        traffic. Errors are returned (not raised) and stripped of any
-        embedded secrets before they propagate to the operator-visible
-        result dict.
+        traffic. Redirects are walked manually so each ``Location`` is
+        re-validated — without this an attacker-controlled origin could
+        302 to a private-IP target after the initial guard passed.
+        Errors are returned (not raised) and stripped of any embedded
+        secrets before they propagate to the operator-visible result
+        dict.
        """
+        max_redirects = 3
+        current_url = url
        try:
-            await avalidate_outbound_url(url)
+            await avalidate_outbound_url(current_url)
        except UnsafeURLError as err:
            return None, f"Unsafe URL: {redact_exc(err)}"
        try:
-            async with self._session.get(
-                url, headers=headers or {}, timeout=_DOWNLOAD_TIMEOUT,
-            ) as resp:
-                if resp.status != 200:
-                    return None, f"HTTP {resp.status}"
-                return await resp.read(), None
+            for _ in range(max_redirects + 1):
+                async with self._session.get(
+                    current_url,
+                    headers=headers or {},
+                    timeout=_DOWNLOAD_TIMEOUT,
+                    allow_redirects=False,
+                ) as resp:
+                    if resp.status in (301, 302, 303, 307, 308):
+                        loc = resp.headers.get("Location")
+                        if not loc:
+                            return None, f"HTTP {resp.status} without Location header"
+                        # ``resp.url`` is a yarl.URL; ``.join`` resolves
+                        # relative redirects (``/foo/bar``) against it.
+                        from yarl import URL as _URL
+                        try:
+                            next_url = str(resp.url.join(_URL(loc)))
+                        except (ValueError, TypeError):
+                            return None, "Malformed redirect Location"
+                        try:
+                            await avalidate_outbound_url(next_url)
+                        except UnsafeURLError as err:
+                            return None, f"Unsafe redirect: {redact_exc(err)}"
+                        current_url = next_url
+                        continue
+                    if resp.status != 200:
+                        return None, f"HTTP {resp.status}"
+                    return await resp.read(), None
+            return None, f"Too many redirects (>{max_redirects})"
        except (aiohttp.ClientError, asyncio.TimeoutError, OSError) as err:
            return None, redact_exc(err)

@@ -22,6 +22,7 @@ class ServiceProviderType(str, Enum):
    GOOGLE_PHOTOS = "google_photos"
    WEBHOOK = "webhook"
    HOME_ASSISTANT = "home_assistant"
+    BRIDGE_SELF = "bridge_self"


 # Callback signature for push-style providers: a coroutine that accepts a
@@ -0,0 +1,39 @@
+"""Bridge self-monitoring service provider.
+
+Unlike external providers (Immich, Gitea, NUT, ...), the ``bridge_self``
+provider does not connect to any remote service. Its sole purpose is to
+give operators a configurable surface (thresholds + notification slots
+ trackers + targets) for events that the bridge itself emits when its
+internal subsystems fail.
+
+Three failure conditions are surfaced as :class:`ServiceEvent` instances
+through the same dispatch pipeline that all other providers use:
+
+* ``bridge_self_poll_failures``    — N consecutive poll failures for
+  any tracker exceed the configured threshold.
+* ``bridge_self_deferred_backlog`` — pending ``deferred_dispatch`` row
+  count crosses the configured threshold.
+* ``bridge_self_target_failures``  — N consecutive 5xx / network failures
+  for a single notification target.
+
+Events are constructed by ``services/bridge_self.py`` on the server side
+(it owns DB access for looking up the bridge_self provider per user)
+and then fed into ``dispatch_provider_event`` like any other event.
+"""
+
+from notify_bridge_core.providers.base import ServiceProviderType
+from notify_bridge_core.templates.variables import registry
+
+from .event_parser import build_event
+from .provider import BRIDGE_SELF_VARIABLES, BridgeSelfServiceProvider
+
+# Register variables so the validator and template-vars API see them.
+registry.register_provider_variables(
+    ServiceProviderType.BRIDGE_SELF, BRIDGE_SELF_VARIABLES,
+)
+
+__all__ = [
+    "BRIDGE_SELF_VARIABLES",
+    "BridgeSelfServiceProvider",
+    "build_event",
+]
@@ -0,0 +1,89 @@
+"""Bridge self-monitoring event parser.
+
+The bridge generates these events from internal subsystems (watcher,
+scheduler, dispatcher) — the parser turns a flat payload dict into the
+generic :class:`ServiceEvent` shape that the rest of the dispatch
+pipeline expects.
+
+Payload shape::
+
+    {
+        "failure_type": "poll_failures" | "deferred_backlog" | "target_failures",
+        "subject_id":   int,        # tracker_id, target_id, or 0
+        "subject_name": str,
+        "count":        int,        # consecutive failures or pending count
+        "threshold":    int,
+        "last_error":   str,        # may be empty
+        "details":      dict[str, Any],  # extra context
+    }
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from typing import Any
+
+from notify_bridge_core.models.events import EventType, ServiceEvent
+from notify_bridge_core.providers.base import ServiceProviderType
+
+
+# Defensive cap on the persisted error message; very long tracebacks would
+# bloat the EventLog details JSON column otherwise.
+_MAX_ERROR_LEN = 1000
+
+
+_FAILURE_TYPE_TO_EVENT: dict[str, EventType] = {
+    "poll_failures": EventType.BRIDGE_SELF_POLL_FAILURES,
+    "deferred_backlog": EventType.BRIDGE_SELF_DEFERRED_BACKLOG,
+    "target_failures": EventType.BRIDGE_SELF_TARGET_FAILURES,
+}
+
+
+def build_event(
+    payload: dict[str, Any],
+    *,
+    provider_name: str = "Bridge Self-Monitoring",
+    timestamp: datetime | None = None,
+) -> ServiceEvent | None:
+    """Convert a self-monitoring payload dict into a ServiceEvent.
+
+    Returns None for malformed payloads (unknown failure_type or missing
+    keys) — the caller drops without raising so a misbehaving emitter
+    can never tip over the dispatch pipeline.
+    """
+    if not isinstance(payload, dict):
+        return None
+    failure_type = payload.get("failure_type")
+    event_type = _FAILURE_TYPE_TO_EVENT.get(str(failure_type) if failure_type else "")
+    if event_type is None:
+        return None
+
+    subject_id = int(payload.get("subject_id") or 0)
+    subject_name = str(payload.get("subject_name") or "")
+    count = int(payload.get("count") or 0)
+    threshold = int(payload.get("threshold") or 0)
+    last_error = str(payload.get("last_error") or "")[:_MAX_ERROR_LEN]
+    details = payload.get("details") if isinstance(payload.get("details"), dict) else {}
+
+    when = timestamp or datetime.now(timezone.utc)
+
+    return ServiceEvent(
+        event_type=event_type,
+        provider_type=ServiceProviderType.BRIDGE_SELF,
+        provider_name=provider_name,
+        # ``collection_id`` / ``collection_name`` are required fields on
+        # ServiceEvent; we use the subject so quiet-hours / dedupe logic
+        # treats different subjects as distinct streams.
+        collection_id=str(subject_id),
+        collection_name=subject_name or str(failure_type),
+        timestamp=when,
+        extra={
+            "failure_type": str(failure_type),
+            "subject_id": subject_id,
+            "subject_name": subject_name,
+            "count": count,
+            "threshold": threshold,
+            "last_error": last_error,
+            "details": dict(details),
+        },
+    )
@@ -0,0 +1,148 @@
+"""Bridge self-monitoring service provider — emits internal-failure events.
+
+This is a passive provider: it does not connect to anything, never polls,
+and never subscribes. It exists so the rest of the bridge's CRUD / config /
+template / target plumbing has a single ``ServiceProvider`` to attach
+self-monitoring trackers and notification slots to.
+
+Events are constructed by the server-side helper
+``services/bridge_self.emit_bridge_self_event`` and pushed into
+``dispatch_provider_event`` directly — the provider itself is not asked
+to produce events.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+from notify_bridge_core.models.events import ServiceEvent
+from notify_bridge_core.providers.base import (
+    ServiceProvider,
+    ServiceProviderType,
+)
+from notify_bridge_core.templates.variables import TemplateVariableDefinition
+
+
+# Configuration keys recognised on the bridge_self provider's ``config`` JSON.
+DEFAULT_POLL_FAILURE_THRESHOLD = 3
+DEFAULT_DEFERRED_BACKLOG_THRESHOLD = 100
+DEFAULT_TARGET_FAILURE_THRESHOLD = 5
+
+
+# Template variables exposed to bridge_self templates.
+BRIDGE_SELF_VARIABLES: list[TemplateVariableDefinition] = [
+    TemplateVariableDefinition(
+        name="failure_type",
+        type="string",
+        description="Which self-monitoring condition fired",
+        example="poll_failures",
+        provider_type=ServiceProviderType.BRIDGE_SELF,
+    ),
+    TemplateVariableDefinition(
+        name="subject_id",
+        type="int",
+        description="ID of the affected entity (tracker_id, target_id, or 0)",
+        example="42",
+        provider_type=ServiceProviderType.BRIDGE_SELF,
+    ),
+    TemplateVariableDefinition(
+        name="subject_name",
+        type="string",
+        description="Human-readable name of the affected entity",
+        example="My Immich Tracker",
+        provider_type=ServiceProviderType.BRIDGE_SELF,
+    ),
+    TemplateVariableDefinition(
+        name="count",
+        type="int",
+        description="Consecutive failure count or current backlog size",
+        example="3",
+        provider_type=ServiceProviderType.BRIDGE_SELF,
+    ),
+    TemplateVariableDefinition(
+        name="threshold",
+        type="int",
+        description="Configured threshold that was crossed",
+        example="3",
+        provider_type=ServiceProviderType.BRIDGE_SELF,
+    ),
+    TemplateVariableDefinition(
+        name="last_error",
+        type="string",
+        description="Last underlying error message (truncated)",
+        example="Connection refused",
+        provider_type=ServiceProviderType.BRIDGE_SELF,
+    ),
+    TemplateVariableDefinition(
+        name="details",
+        type="dict",
+        description="Extra structured context for the event",
+        example='{"provider_id": 7}',
+        provider_type=ServiceProviderType.BRIDGE_SELF,
+    ),
+]
+
+
+class BridgeSelfServiceProvider(ServiceProvider):
+    """Passive provider — exposes nothing remote, holds only thresholds.
+
+    Polling is a no-op and ``connect`` always succeeds; the bridge itself
+    is what generates events for this provider.
+    """
+
+    provider_type = ServiceProviderType.BRIDGE_SELF
+    supports_subscription = False
+
+    def __init__(self, name: str = "Bridge Self-Monitoring") -> None:
+        self._name = name
+
+    async def connect(self) -> bool:
+        return True
+
+    async def disconnect(self) -> None:
+        return None
+
+    async def poll(
+        self,
+        collection_ids: list[str],
+        tracker_state: dict[str, Any],
+    ) -> tuple[list[ServiceEvent], dict[str, Any]]:
+        # No external service to poll. Returning empty keeps the contract
+        # so accidental scheduling no-ops cleanly.
+        return [], tracker_state
+
+    def get_available_variables(self) -> list[TemplateVariableDefinition]:
+        return list(BRIDGE_SELF_VARIABLES)
+
+    def get_provider_config_schema(self) -> dict[str, Any]:
+        return {
+            "type": "object",
+            "properties": {
+                "poll_failure_threshold": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "default": DEFAULT_POLL_FAILURE_THRESHOLD,
+                    "description": "Consecutive tracker poll failures before alerting",
+                },
+                "deferred_backlog_threshold": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "default": DEFAULT_DEFERRED_BACKLOG_THRESHOLD,
+                    "description": "Pending deferred_dispatch rows before alerting",
+                },
+                "target_failure_threshold": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "default": DEFAULT_TARGET_FAILURE_THRESHOLD,
+                    "description": "Consecutive target send failures before alerting",
+                },
+            },
+            "required": [],
+        }
+
+    async def list_collections(self) -> list[dict[str, Any]]:
+        # No collection concept — operators don't pick anything for this provider.
+        return []
+
+    async def test_connection(self) -> dict[str, Any]:
+        return {"ok": True, "message": "Bridge self-monitoring is always available"}
@@ -514,6 +514,39 @@ HOME_ASSISTANT_CAPABILITIES = ProviderCapabilities(
 )


+# ---------------------------------------------------------------------------
+# Bridge self-monitoring capabilities
+# ---------------------------------------------------------------------------
+
+BRIDGE_SELF_CAPABILITIES = ProviderCapabilities(
+    provider_type="bridge_self",
+    display_name="Bridge Self-Monitoring",
+    webhook_based=False,
+    supported_filters=[],
+    notification_slots=[
+        {
+            "name": "message_bridge_self_poll_failures",
+            "description": "Tracker poll failures crossed threshold",
+        },
+        {
+            "name": "message_bridge_self_deferred_backlog",
+            "description": "Deferred dispatch backlog crossed threshold",
+        },
+        {
+            "name": "message_bridge_self_target_failures",
+            "description": "Target send failures crossed threshold",
+        },
+    ],
+    events=[
+        {"name": "bridge_self_poll_failures", "description": "Tracker poll failures"},
+        {"name": "bridge_self_deferred_backlog", "description": "Deferred backlog high"},
+        {"name": "bridge_self_target_failures", "description": "Target send failures"},
+    ],
+    command_slots=[],
+    commands=[],
+)
+
+
 # ---------------------------------------------------------------------------
 # Registry
 # ---------------------------------------------------------------------------
@@ -527,6 +560,7 @@ _REGISTRY: dict[str, ProviderCapabilities] = {
    "google_photos": GOOGLE_PHOTOS_CAPABILITIES,
    "webhook": WEBHOOK_CAPABILITIES,
    "home_assistant": HOME_ASSISTANT_CAPABILITIES,
+    "bridge_self": BRIDGE_SELF_CAPABILITIES,
 }


@@ -10,7 +10,7 @@ arrive. The lifecycle is owned by the server-side subscription manager
 from __future__ import annotations

 import logging
-from typing import Any
+from typing import Any, Callable

 import aiohttp

@@ -25,6 +25,12 @@ from notify_bridge_core.templates.variables import TemplateVariableDefinition
 from .client import HomeAssistantWSClient
 from .event_parser import parse_event

+
+# Status callback signature: ``(state, detail)`` where ``state`` is one of
+# ``"connected"`` / ``"disconnected"`` and ``detail`` is an optional already-
+# redacted reason string (or None on connect).
+StatusChangeCallback = Callable[[str, str | None], None]
+
 _LOGGER = logging.getLogger(__name__)


@@ -229,7 +235,11 @@ class HomeAssistantServiceProvider(ServiceProvider):
        # — the subscription manager owns this provider's lifecycle instead.
        return [], tracker_state

-    async def subscribe(self, emit: EventEmitCallback) -> None:
+    async def subscribe(
+        self,
+        emit: EventEmitCallback,
+        on_status_change: StatusChangeCallback | None = None,
+    ) -> None:
        async def _on_event(ha_event: dict[str, Any]) -> None:
            event = parse_event(
                ha_event,
@@ -252,6 +262,7 @@ class HomeAssistantServiceProvider(ServiceProvider):
            on_event=_on_event,
            event_types=self._event_types,
            refresh_areas=_refresh_areas,
+            on_status_change=on_status_change,
        )

    def get_available_variables(self) -> list[TemplateVariableDefinition]:
@@ -29,10 +29,21 @@ _LOGGER = logging.getLogger(__name__)
 # calls per poll cycle.  TTL is conservative (1h) and a hashed key keeps the
 # raw api_key out of dict keys in case of a memory dump.
 _USERS_CACHE_TTL_SECONDS = 3600
-_users_cache_lock = asyncio.Lock()
+# Lazy init: ``asyncio.Lock()`` at module import binds to whichever event
+# loop is current at import time (often none, or the wrong one when tests
+# spin up dedicated loops). Defer creation to first use.
+_users_cache_lock: asyncio.Lock | None = None
 _users_cache: dict[str, tuple[float, dict[str, str]]] = {}


+def _get_users_cache_lock() -> asyncio.Lock:
+    """Return the module users-cache lock, creating it on first call."""
+    global _users_cache_lock
+    if _users_cache_lock is None:
+        _users_cache_lock = asyncio.Lock()
+    return _users_cache_lock
+
+
 def _users_cache_key(url: str, api_key: str) -> str:
    digest = hashlib.sha256(f"{url}|{api_key}".encode("utf-8")).hexdigest()
    return digest[:32]
@@ -51,7 +62,7 @@ async def _get_cached_users(
    if entry is not None and (now - entry[0]) < _USERS_CACHE_TTL_SECONDS:
        return entry[1]

-    async with _users_cache_lock:
+    async with _get_users_cache_lock():
        # Re-check after acquiring the lock — another coroutine may have
        # refreshed the entry while we waited.
        entry = _users_cache.get(key)
@@ -200,10 +200,28 @@ class NutServiceProvider(ServiceProvider):
        try:
            for ups_name in collection_ids:
                prev = tracker_state.get(ups_name, {})
+                # First-ever observation has no baseline — emitting transition
+                # events for whatever flags the device happens to carry would
+                # spam the user with "OB"/"LB"/"REPLBATT" alerts on every fresh
+                # tracker even when nothing changed. Seed state silently and
+                # skip event emission until the next poll provides a baseline.
+                is_first_observation = ups_name not in tracker_state
                try:
                    variables = await client.list_var(ups_name)
                    data = NutUpsData.from_variables(ups_name, variables)

+                    if is_first_observation:
+                        new_state[ups_name] = {
+                            "name": data.description or ups_name,
+                            "status": data.status,
+                            "battery_charge": data.battery_charge,
+                            "comms_ok": True,
+                            "asset_ids": [],
+                            "pending_asset_ids": [],
+                            "shared": False,
+                        }
+                        continue
+
                    # Check for comms restored
                    if not prev.get("comms_ok", True):
                        events.append(self._make_event(
@@ -35,6 +35,10 @@ _SENSITIVE_EXTRA_TOKENS: tuple[str, ...] = (
    "bearer",
    "private_key",
    "access_key",
+    "oauth",
+    "client_secret",
+    "webhook_secret",
+    "csrf",
 )


@@ -0,0 +1,6 @@
+⚠️ <b>Deferred dispatch backlog high</b>
+Pending notifications: <b>{{ count }}</b>
+Threshold: <b>{{ threshold }}</b>
+{%- if last_error %}
+<i>Note:</i> <code>{{ last_error }}</code>
+{%- endif %}
@@ -0,0 +1,6 @@
+🚨 <b>Tracker poll failures</b>
+<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
+<b>{{ count }}</b> consecutive failures (threshold {{ threshold }})
+{%- if last_error %}
+<i>Last error:</i> <code>{{ last_error }}</code>
+{%- endif %}
@@ -0,0 +1,6 @@
+📡 <b>Target send failures</b>
+<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
+<b>{{ count }}</b> consecutive failures (threshold {{ threshold }})
+{%- if last_error %}
+<i>Last error:</i> <code>{{ last_error }}</code>
+{%- endif %}
@@ -79,6 +79,11 @@ PROVIDER_SLOT_FILE_MAP: dict[str, dict[str, str]] = {
        "message_ha_service_called": "ha_service_called.jinja2",
        "message_ha_event_fired": "ha_event_fired.jinja2",
    },
+    "bridge_self": {
+        "message_bridge_self_poll_failures": "bridge_self_poll_failures.jinja2",
+        "message_bridge_self_deferred_backlog": "bridge_self_deferred_backlog.jinja2",
+        "message_bridge_self_target_failures": "bridge_self_target_failures.jinja2",
+    },
 }

 # Backward-compatible alias
@@ -0,0 +1,6 @@
+⚠️ <b>Очередь отложенной отправки растёт</b>
+Ожидают отправки: <b>{{ count }}</b>
+Порог: <b>{{ threshold }}</b>
+{%- if last_error %}
+<i>Примечание:</i> <code>{{ last_error }}</code>
+{%- endif %}
@@ -0,0 +1,6 @@
+🚨 <b>Сбои опроса трекера</b>
+<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
+Подряд сбоев: <b>{{ count }}</b> (порог {{ threshold }})
+{%- if last_error %}
+<i>Последняя ошибка:</i> <code>{{ last_error }}</code>
+{%- endif %}
@@ -0,0 +1,6 @@
+📡 <b>Сбои отправки в адресат</b>
+<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
+Подряд сбоев: <b>{{ count }}</b> (порог {{ threshold }})
+{%- if last_error %}
+<i>Последняя ошибка:</i> <code>{{ last_error }}</code>
+{%- endif %}
@@ -13,6 +13,7 @@ from __future__ import annotations

 import logging
 import threading
+from functools import lru_cache
 from typing import Any

 import jinja2
@@ -27,6 +28,19 @@ RENDER_TIMEOUT_SECONDS = 2.0
 _env = SandboxedEnvironment(autoescape=True)


+@lru_cache(maxsize=512)
+def _compile_cached(template_str: str) -> jinja2.Template:
+    """Compile + cache Jinja2 templates by source text.
+
+    Hot paths (NotificationDispatcher fan-out, periodic dispatch) re-render
+    the same template string for every event; ``_env.from_string`` parses
+    the source from scratch each time (~ms each). The 512-entry cache is
+    large enough to hold every template across a busy install while
+    keeping memory bounded.
+    """
+    return _env.from_string(template_str)
+
+
 class TemplateRenderTimeout(jinja2.TemplateError):
    """Raised when a template exceeds the configured render budget."""

@@ -74,7 +88,7 @@ def render_template(template_str: str, context: dict[str, Any]) -> str:
        )
        return "[Template too large]"
    try:
-        compiled = _env.from_string(template_str)
+        compiled = _compile_cached(template_str)
        output = _render_with_timeout(compiled, context)
    except TemplateRenderTimeout as e:
        _LOGGER.error("Template render timeout: %s", e)
@@ -27,6 +27,9 @@ def validate_template(
        "has_oversized_videos", "max_video_size", "max_video_size_mb",
        "added_assets", "assets", "albums",
        "raw_payload", "event_type_raw", "source_ip",
+        # bridge_self self-monitoring variables.
+        "failure_type", "subject_id", "subject_name", "count",
+        "threshold", "last_error", "details",
    }
    allowed = available | runtime_vars