feat: production readiness — security, perf, bug fixes, bridge self-monitoring

Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 02:16:49 +03:00
parent 22127e2a59
commit 10d30fc956
97 changed files with 5423 additions and 821 deletions
@@ -71,6 +71,12 @@ class EventType(str, Enum):
HA_SERVICE_CALLED = "ha_service_called"
HA_EVENT_FIRED = "ha_event_fired"
# Bridge self-monitoring events — emitted by the bridge itself when
# internal failures cross configured thresholds.
BRIDGE_SELF_POLL_FAILURES = "bridge_self_poll_failures"
BRIDGE_SELF_DEFERRED_BACKLOG = "bridge_self_deferred_backlog"
BRIDGE_SELF_TARGET_FAILURES = "bridge_self_target_failures"
@dataclass
class ServiceEvent:
@@ -107,6 +107,12 @@ class NotificationDispatcher:
# Optional shared session owned by the caller; when supplied we reuse
# its connection pool instead of opening a fresh per-dispatch session.
self._shared_session = session
# Per-dispatch render cache, keyed by locale. Populated by
# ``_send_to_target`` and consumed inside ``_message_for_receiver``
# so a 100-receiver fan-out renders each unique locale once.
# Initialized to empty so handlers called outside the normal
# dispatch path (tests) still see a valid dict.
self._render_cache: dict[str, str] = {}
@contextlib.asynccontextmanager
async def _session_ctx(self) -> AsyncIterator[aiohttp.ClientSession]:
@@ -198,20 +204,49 @@ class NotificationDispatcher:
def _message_for_receiver(
self, receiver: Receiver, default_message: str,
event: ServiceEvent, target: TargetConfig,
cache: dict[str, str] | None = None,
) -> str:
if receiver.locale and receiver.locale != target.locale:
return self._render_message(event, target, receiver.locale)
return default_message
"""Render message respecting receiver locale, with optional cache.
The ``cache`` dict (typically created in ``_send_to_target`` and
threaded through the per-channel ``_send_*`` handlers) memoizes
per-locale renders so a 100-receiver fan-out with two locales
renders twice instead of one hundred times.
"""
loc = receiver.locale or target.locale
if loc == target.locale:
return default_message
if cache is not None:
cached = cache.get(loc)
if cached is not None:
return cached
rendered = self._render_message(event, target, loc)
cache[loc] = rendered
return rendered
return self._render_message(event, target, loc)
async def _send_to_target(
self, event: ServiceEvent, target: TargetConfig
) -> dict[str, Any]:
"""Dispatch to a single target via the registered handler."""
"""Dispatch to a single target via the registered handler.
Builds a per-locale render cache once and threads it through the
send handler. The cache is keyed by receiver locale; the default
locale's render lives in ``default_message`` and is short-circuited
before any cache lookup.
"""
default_message = self._render_message(event, target, target.locale)
send_method = _PROVIDER_HANDLERS.get(target.type)
if send_method is None:
return {"success": False, "error": f"Unknown target type: {target.type}"}
return await send_method(self, target, default_message, event)
# Stash the cache on the dispatcher instance for the duration of
# this dispatch — handlers pick it up via _message_for_receiver.
# Avoids changing every _send_* signature.
self._render_cache: dict[str, str] = {}
try:
return await send_method(self, target, default_message, event)
finally:
self._render_cache = {}
# ------------------------------------------------------------------
# Asset preload (Telegram-specific)
@@ -352,7 +387,7 @@ class NotificationDispatcher:
async def send_one(receiver: Receiver) -> dict[str, Any]:
if not isinstance(receiver, TelegramReceiver) or not receiver.chat_id:
return {"success": False, "error": "Invalid telegram receiver"}
message = self._message_for_receiver(receiver, default_message, event, target)
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
text_result = await client.send_message(
chat_id=receiver.chat_id,
text=message,
@@ -407,7 +442,7 @@ class NotificationDispatcher:
async def send_one(receiver: Receiver) -> dict[str, Any]:
if not isinstance(receiver, WebhookReceiver) or not receiver.url:
return {"success": False, "error": "Invalid webhook receiver"}
message = self._message_for_receiver(receiver, default_message, event, target)
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
payload = {
"message": message,
"event_type": event.event_type.value,
@@ -450,7 +485,7 @@ class NotificationDispatcher:
async def send_one(receiver: Receiver) -> dict[str, Any]:
if not isinstance(receiver, EmailReceiver) or not receiver.email:
return {"success": False, "error": "Invalid email receiver"}
message = self._message_for_receiver(receiver, default_message, event, target)
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
# body_html=None lets EmailClient build a safely-escaped HTML
# alternative from body_text instead of trusting user content.
return await email_client.send(
@@ -479,7 +514,7 @@ class NotificationDispatcher:
async def send_one(receiver: Receiver) -> dict[str, Any]:
if not isinstance(receiver, DiscordReceiver) or not receiver.webhook_url:
return {"success": False, "error": "Invalid discord receiver"}
message = self._message_for_receiver(receiver, default_message, event, target)
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
return await client.send(receiver.webhook_url, message, username=username)
results = await self._fan_out(target.receivers, send_one)
@@ -501,7 +536,7 @@ class NotificationDispatcher:
async def send_one(receiver: Receiver) -> dict[str, Any]:
if not isinstance(receiver, SlackReceiver) or not receiver.webhook_url:
return {"success": False, "error": "Invalid slack receiver"}
message = self._message_for_receiver(receiver, default_message, event, target)
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
return await client.send(receiver.webhook_url, message, username=username)
results = await self._fan_out(target.receivers, send_one)
@@ -530,7 +565,7 @@ class NotificationDispatcher:
async def send_one(receiver: Receiver) -> dict[str, Any]:
if not isinstance(receiver, NtfyReceiver) or not receiver.topic:
return {"success": False, "error": "Invalid ntfy receiver"}
message = self._message_for_receiver(receiver, default_message, event, target)
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
return await client.send(
server_url, receiver.topic, message,
title=title, priority=receiver.priority, auth_token=auth_token,
@@ -563,7 +598,7 @@ class NotificationDispatcher:
async def send_one(receiver: Receiver) -> dict[str, Any]:
if not isinstance(receiver, MatrixReceiver) or not receiver.room_id:
return {"success": False, "error": "Invalid matrix receiver"}
message = self._message_for_receiver(receiver, default_message, event, target)
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
# body_html is the same plain text — Matrix accepts the
# raw message as both ``body`` and ``formatted_body``.
# If templates emit HTML in the future, generate a
@@ -222,21 +222,48 @@ class TelegramClient:
"""SSRF-guarded GET that returns ``(data, error)``.
Validates the URL via ``avalidate_outbound_url`` before any HTTP
traffic. Errors are returned (not raised) and stripped of any
embedded secrets before they propagate to the operator-visible
result dict.
traffic. Redirects are walked manually so each ``Location`` is
re-validated — without this an attacker-controlled origin could
302 to a private-IP target after the initial guard passed.
Errors are returned (not raised) and stripped of any embedded
secrets before they propagate to the operator-visible result
dict.
"""
max_redirects = 3
current_url = url
try:
await avalidate_outbound_url(url)
await avalidate_outbound_url(current_url)
except UnsafeURLError as err:
return None, f"Unsafe URL: {redact_exc(err)}"
try:
async with self._session.get(
url, headers=headers or {}, timeout=_DOWNLOAD_TIMEOUT,
) as resp:
if resp.status != 200:
return None, f"HTTP {resp.status}"
return await resp.read(), None
for _ in range(max_redirects + 1):
async with self._session.get(
current_url,
headers=headers or {},
timeout=_DOWNLOAD_TIMEOUT,
allow_redirects=False,
) as resp:
if resp.status in (301, 302, 303, 307, 308):
loc = resp.headers.get("Location")
if not loc:
return None, f"HTTP {resp.status} without Location header"
# ``resp.url`` is a yarl.URL; ``.join`` resolves
# relative redirects (``/foo/bar``) against it.
from yarl import URL as _URL
try:
next_url = str(resp.url.join(_URL(loc)))
except (ValueError, TypeError):
return None, "Malformed redirect Location"
try:
await avalidate_outbound_url(next_url)
except UnsafeURLError as err:
return None, f"Unsafe redirect: {redact_exc(err)}"
current_url = next_url
continue
if resp.status != 200:
return None, f"HTTP {resp.status}"
return await resp.read(), None
return None, f"Too many redirects (>{max_redirects})"
except (aiohttp.ClientError, asyncio.TimeoutError, OSError) as err:
return None, redact_exc(err)
@@ -22,6 +22,7 @@ class ServiceProviderType(str, Enum):
GOOGLE_PHOTOS = "google_photos"
WEBHOOK = "webhook"
HOME_ASSISTANT = "home_assistant"
BRIDGE_SELF = "bridge_self"
# Callback signature for push-style providers: a coroutine that accepts a
@@ -0,0 +1,39 @@
"""Bridge self-monitoring service provider.
Unlike external providers (Immich, Gitea, NUT, ...), the ``bridge_self``
provider does not connect to any remote service. Its sole purpose is to
give operators a configurable surface (thresholds + notification slots
+ trackers + targets) for events that the bridge itself emits when its
internal subsystems fail.
Three failure conditions are surfaced as :class:`ServiceEvent` instances
through the same dispatch pipeline that all other providers use:
* ``bridge_self_poll_failures`` — N consecutive poll failures for
any tracker exceed the configured threshold.
* ``bridge_self_deferred_backlog`` — pending ``deferred_dispatch`` row
count crosses the configured threshold.
* ``bridge_self_target_failures`` — N consecutive 5xx / network failures
for a single notification target.
Events are constructed by ``services/bridge_self.py`` on the server side
(it owns DB access for looking up the bridge_self provider per user)
and then fed into ``dispatch_provider_event`` like any other event.
"""
from notify_bridge_core.providers.base import ServiceProviderType
from notify_bridge_core.templates.variables import registry
from .event_parser import build_event
from .provider import BRIDGE_SELF_VARIABLES, BridgeSelfServiceProvider
# Register variables so the validator and template-vars API see them.
registry.register_provider_variables(
ServiceProviderType.BRIDGE_SELF, BRIDGE_SELF_VARIABLES,
)
__all__ = [
"BRIDGE_SELF_VARIABLES",
"BridgeSelfServiceProvider",
"build_event",
]
@@ -0,0 +1,89 @@
"""Bridge self-monitoring event parser.
The bridge generates these events from internal subsystems (watcher,
scheduler, dispatcher) — the parser turns a flat payload dict into the
generic :class:`ServiceEvent` shape that the rest of the dispatch
pipeline expects.
Payload shape::
{
"failure_type": "poll_failures" | "deferred_backlog" | "target_failures",
"subject_id": int, # tracker_id, target_id, or 0
"subject_name": str,
"count": int, # consecutive failures or pending count
"threshold": int,
"last_error": str, # may be empty
"details": dict[str, Any], # extra context
}
"""
from __future__ import annotations
from datetime import datetime, timezone
from typing import Any
from notify_bridge_core.models.events import EventType, ServiceEvent
from notify_bridge_core.providers.base import ServiceProviderType
# Defensive cap on the persisted error message; very long tracebacks would
# bloat the EventLog details JSON column otherwise.
_MAX_ERROR_LEN = 1000
_FAILURE_TYPE_TO_EVENT: dict[str, EventType] = {
"poll_failures": EventType.BRIDGE_SELF_POLL_FAILURES,
"deferred_backlog": EventType.BRIDGE_SELF_DEFERRED_BACKLOG,
"target_failures": EventType.BRIDGE_SELF_TARGET_FAILURES,
}
def build_event(
payload: dict[str, Any],
*,
provider_name: str = "Bridge Self-Monitoring",
timestamp: datetime | None = None,
) -> ServiceEvent | None:
"""Convert a self-monitoring payload dict into a ServiceEvent.
Returns None for malformed payloads (unknown failure_type or missing
keys) — the caller drops without raising so a misbehaving emitter
can never tip over the dispatch pipeline.
"""
if not isinstance(payload, dict):
return None
failure_type = payload.get("failure_type")
event_type = _FAILURE_TYPE_TO_EVENT.get(str(failure_type) if failure_type else "")
if event_type is None:
return None
subject_id = int(payload.get("subject_id") or 0)
subject_name = str(payload.get("subject_name") or "")
count = int(payload.get("count") or 0)
threshold = int(payload.get("threshold") or 0)
last_error = str(payload.get("last_error") or "")[:_MAX_ERROR_LEN]
details = payload.get("details") if isinstance(payload.get("details"), dict) else {}
when = timestamp or datetime.now(timezone.utc)
return ServiceEvent(
event_type=event_type,
provider_type=ServiceProviderType.BRIDGE_SELF,
provider_name=provider_name,
# ``collection_id`` / ``collection_name`` are required fields on
# ServiceEvent; we use the subject so quiet-hours / dedupe logic
# treats different subjects as distinct streams.
collection_id=str(subject_id),
collection_name=subject_name or str(failure_type),
timestamp=when,
extra={
"failure_type": str(failure_type),
"subject_id": subject_id,
"subject_name": subject_name,
"count": count,
"threshold": threshold,
"last_error": last_error,
"details": dict(details),
},
)
@@ -0,0 +1,148 @@
"""Bridge self-monitoring service provider — emits internal-failure events.
This is a passive provider: it does not connect to anything, never polls,
and never subscribes. It exists so the rest of the bridge's CRUD / config /
template / target plumbing has a single ``ServiceProvider`` to attach
self-monitoring trackers and notification slots to.
Events are constructed by the server-side helper
``services/bridge_self.emit_bridge_self_event`` and pushed into
``dispatch_provider_event`` directly — the provider itself is not asked
to produce events.
"""
from __future__ import annotations
from typing import Any
from notify_bridge_core.models.events import ServiceEvent
from notify_bridge_core.providers.base import (
ServiceProvider,
ServiceProviderType,
)
from notify_bridge_core.templates.variables import TemplateVariableDefinition
# Configuration keys recognised on the bridge_self provider's ``config`` JSON.
DEFAULT_POLL_FAILURE_THRESHOLD = 3
DEFAULT_DEFERRED_BACKLOG_THRESHOLD = 100
DEFAULT_TARGET_FAILURE_THRESHOLD = 5
# Template variables exposed to bridge_self templates.
BRIDGE_SELF_VARIABLES: list[TemplateVariableDefinition] = [
TemplateVariableDefinition(
name="failure_type",
type="string",
description="Which self-monitoring condition fired",
example="poll_failures",
provider_type=ServiceProviderType.BRIDGE_SELF,
),
TemplateVariableDefinition(
name="subject_id",
type="int",
description="ID of the affected entity (tracker_id, target_id, or 0)",
example="42",
provider_type=ServiceProviderType.BRIDGE_SELF,
),
TemplateVariableDefinition(
name="subject_name",
type="string",
description="Human-readable name of the affected entity",
example="My Immich Tracker",
provider_type=ServiceProviderType.BRIDGE_SELF,
),
TemplateVariableDefinition(
name="count",
type="int",
description="Consecutive failure count or current backlog size",
example="3",
provider_type=ServiceProviderType.BRIDGE_SELF,
),
TemplateVariableDefinition(
name="threshold",
type="int",
description="Configured threshold that was crossed",
example="3",
provider_type=ServiceProviderType.BRIDGE_SELF,
),
TemplateVariableDefinition(
name="last_error",
type="string",
description="Last underlying error message (truncated)",
example="Connection refused",
provider_type=ServiceProviderType.BRIDGE_SELF,
),
TemplateVariableDefinition(
name="details",
type="dict",
description="Extra structured context for the event",
example='{"provider_id": 7}',
provider_type=ServiceProviderType.BRIDGE_SELF,
),
]
class BridgeSelfServiceProvider(ServiceProvider):
"""Passive provider — exposes nothing remote, holds only thresholds.
Polling is a no-op and ``connect`` always succeeds; the bridge itself
is what generates events for this provider.
"""
provider_type = ServiceProviderType.BRIDGE_SELF
supports_subscription = False
def __init__(self, name: str = "Bridge Self-Monitoring") -> None:
self._name = name
async def connect(self) -> bool:
return True
async def disconnect(self) -> None:
return None
async def poll(
self,
collection_ids: list[str],
tracker_state: dict[str, Any],
) -> tuple[list[ServiceEvent], dict[str, Any]]:
# No external service to poll. Returning empty keeps the contract
# so accidental scheduling no-ops cleanly.
return [], tracker_state
def get_available_variables(self) -> list[TemplateVariableDefinition]:
return list(BRIDGE_SELF_VARIABLES)
def get_provider_config_schema(self) -> dict[str, Any]:
return {
"type": "object",
"properties": {
"poll_failure_threshold": {
"type": "integer",
"minimum": 1,
"default": DEFAULT_POLL_FAILURE_THRESHOLD,
"description": "Consecutive tracker poll failures before alerting",
},
"deferred_backlog_threshold": {
"type": "integer",
"minimum": 1,
"default": DEFAULT_DEFERRED_BACKLOG_THRESHOLD,
"description": "Pending deferred_dispatch rows before alerting",
},
"target_failure_threshold": {
"type": "integer",
"minimum": 1,
"default": DEFAULT_TARGET_FAILURE_THRESHOLD,
"description": "Consecutive target send failures before alerting",
},
},
"required": [],
}
async def list_collections(self) -> list[dict[str, Any]]:
# No collection concept — operators don't pick anything for this provider.
return []
async def test_connection(self) -> dict[str, Any]:
return {"ok": True, "message": "Bridge self-monitoring is always available"}
@@ -514,6 +514,39 @@ HOME_ASSISTANT_CAPABILITIES = ProviderCapabilities(
)
# ---------------------------------------------------------------------------
# Bridge self-monitoring capabilities
# ---------------------------------------------------------------------------
BRIDGE_SELF_CAPABILITIES = ProviderCapabilities(
provider_type="bridge_self",
display_name="Bridge Self-Monitoring",
webhook_based=False,
supported_filters=[],
notification_slots=[
{
"name": "message_bridge_self_poll_failures",
"description": "Tracker poll failures crossed threshold",
},
{
"name": "message_bridge_self_deferred_backlog",
"description": "Deferred dispatch backlog crossed threshold",
},
{
"name": "message_bridge_self_target_failures",
"description": "Target send failures crossed threshold",
},
],
events=[
{"name": "bridge_self_poll_failures", "description": "Tracker poll failures"},
{"name": "bridge_self_deferred_backlog", "description": "Deferred backlog high"},
{"name": "bridge_self_target_failures", "description": "Target send failures"},
],
command_slots=[],
commands=[],
)
# ---------------------------------------------------------------------------
# Registry
# ---------------------------------------------------------------------------
@@ -527,6 +560,7 @@ _REGISTRY: dict[str, ProviderCapabilities] = {
"google_photos": GOOGLE_PHOTOS_CAPABILITIES,
"webhook": WEBHOOK_CAPABILITIES,
"home_assistant": HOME_ASSISTANT_CAPABILITIES,
"bridge_self": BRIDGE_SELF_CAPABILITIES,
}
@@ -10,7 +10,7 @@ arrive. The lifecycle is owned by the server-side subscription manager
from __future__ import annotations
import logging
from typing import Any
from typing import Any, Callable
import aiohttp
@@ -25,6 +25,12 @@ from notify_bridge_core.templates.variables import TemplateVariableDefinition
from .client import HomeAssistantWSClient
from .event_parser import parse_event
# Status callback signature: ``(state, detail)`` where ``state`` is one of
# ``"connected"`` / ``"disconnected"`` and ``detail`` is an optional already-
# redacted reason string (or None on connect).
StatusChangeCallback = Callable[[str, str | None], None]
_LOGGER = logging.getLogger(__name__)
@@ -229,7 +235,11 @@ class HomeAssistantServiceProvider(ServiceProvider):
# — the subscription manager owns this provider's lifecycle instead.
return [], tracker_state
async def subscribe(self, emit: EventEmitCallback) -> None:
async def subscribe(
self,
emit: EventEmitCallback,
on_status_change: StatusChangeCallback | None = None,
) -> None:
async def _on_event(ha_event: dict[str, Any]) -> None:
event = parse_event(
ha_event,
@@ -252,6 +262,7 @@ class HomeAssistantServiceProvider(ServiceProvider):
on_event=_on_event,
event_types=self._event_types,
refresh_areas=_refresh_areas,
on_status_change=on_status_change,
)
def get_available_variables(self) -> list[TemplateVariableDefinition]:
@@ -29,10 +29,21 @@ _LOGGER = logging.getLogger(__name__)
# calls per poll cycle. TTL is conservative (1h) and a hashed key keeps the
# raw api_key out of dict keys in case of a memory dump.
_USERS_CACHE_TTL_SECONDS = 3600
_users_cache_lock = asyncio.Lock()
# Lazy init: ``asyncio.Lock()`` at module import binds to whichever event
# loop is current at import time (often none, or the wrong one when tests
# spin up dedicated loops). Defer creation to first use.
_users_cache_lock: asyncio.Lock | None = None
_users_cache: dict[str, tuple[float, dict[str, str]]] = {}
def _get_users_cache_lock() -> asyncio.Lock:
"""Return the module users-cache lock, creating it on first call."""
global _users_cache_lock
if _users_cache_lock is None:
_users_cache_lock = asyncio.Lock()
return _users_cache_lock
def _users_cache_key(url: str, api_key: str) -> str:
digest = hashlib.sha256(f"{url}|{api_key}".encode("utf-8")).hexdigest()
return digest[:32]
@@ -51,7 +62,7 @@ async def _get_cached_users(
if entry is not None and (now - entry[0]) < _USERS_CACHE_TTL_SECONDS:
return entry[1]
async with _users_cache_lock:
async with _get_users_cache_lock():
# Re-check after acquiring the lock — another coroutine may have
# refreshed the entry while we waited.
entry = _users_cache.get(key)
@@ -200,10 +200,28 @@ class NutServiceProvider(ServiceProvider):
try:
for ups_name in collection_ids:
prev = tracker_state.get(ups_name, {})
# First-ever observation has no baseline — emitting transition
# events for whatever flags the device happens to carry would
# spam the user with "OB"/"LB"/"REPLBATT" alerts on every fresh
# tracker even when nothing changed. Seed state silently and
# skip event emission until the next poll provides a baseline.
is_first_observation = ups_name not in tracker_state
try:
variables = await client.list_var(ups_name)
data = NutUpsData.from_variables(ups_name, variables)
if is_first_observation:
new_state[ups_name] = {
"name": data.description or ups_name,
"status": data.status,
"battery_charge": data.battery_charge,
"comms_ok": True,
"asset_ids": [],
"pending_asset_ids": [],
"shared": False,
}
continue
# Check for comms restored
if not prev.get("comms_ok", True):
events.append(self._make_event(
@@ -35,6 +35,10 @@ _SENSITIVE_EXTRA_TOKENS: tuple[str, ...] = (
"bearer",
"private_key",
"access_key",
"oauth",
"client_secret",
"webhook_secret",
"csrf",
)
@@ -0,0 +1,6 @@
⚠️ <b>Deferred dispatch backlog high</b>
Pending notifications: <b>{{ count }}</b>
Threshold: <b>{{ threshold }}</b>
{%- if last_error %}
<i>Note:</i> <code>{{ last_error }}</code>
{%- endif %}
@@ -0,0 +1,6 @@
🚨 <b>Tracker poll failures</b>
<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
<b>{{ count }}</b> consecutive failures (threshold {{ threshold }})
{%- if last_error %}
<i>Last error:</i> <code>{{ last_error }}</code>
{%- endif %}
@@ -0,0 +1,6 @@
📡 <b>Target send failures</b>
<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
<b>{{ count }}</b> consecutive failures (threshold {{ threshold }})
{%- if last_error %}
<i>Last error:</i> <code>{{ last_error }}</code>
{%- endif %}
@@ -79,6 +79,11 @@ PROVIDER_SLOT_FILE_MAP: dict[str, dict[str, str]] = {
"message_ha_service_called": "ha_service_called.jinja2",
"message_ha_event_fired": "ha_event_fired.jinja2",
},
"bridge_self": {
"message_bridge_self_poll_failures": "bridge_self_poll_failures.jinja2",
"message_bridge_self_deferred_backlog": "bridge_self_deferred_backlog.jinja2",
"message_bridge_self_target_failures": "bridge_self_target_failures.jinja2",
},
}
# Backward-compatible alias
@@ -0,0 +1,6 @@
⚠️ <b>Очередь отложенной отправки растёт</b>
Ожидают отправки: <b>{{ count }}</b>
Порог: <b>{{ threshold }}</b>
{%- if last_error %}
<i>Примечание:</i> <code>{{ last_error }}</code>
{%- endif %}
@@ -0,0 +1,6 @@
🚨 <b>Сбои опроса трекера</b>
<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
Подряд сбоев: <b>{{ count }}</b> (порог {{ threshold }})
{%- if last_error %}
<i>Последняя ошибка:</i> <code>{{ last_error }}</code>
{%- endif %}
@@ -0,0 +1,6 @@
📡 <b>Сбои отправки в адресат</b>
<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
Подряд сбоев: <b>{{ count }}</b> (порог {{ threshold }})
{%- if last_error %}
<i>Последняя ошибка:</i> <code>{{ last_error }}</code>
{%- endif %}
@@ -13,6 +13,7 @@ from __future__ import annotations
import logging
import threading
from functools import lru_cache
from typing import Any
import jinja2
@@ -27,6 +28,19 @@ RENDER_TIMEOUT_SECONDS = 2.0
_env = SandboxedEnvironment(autoescape=True)
@lru_cache(maxsize=512)
def _compile_cached(template_str: str) -> jinja2.Template:
"""Compile + cache Jinja2 templates by source text.
Hot paths (NotificationDispatcher fan-out, periodic dispatch) re-render
the same template string for every event; ``_env.from_string`` parses
the source from scratch each time (~ms each). The 512-entry cache is
large enough to hold every template across a busy install while
keeping memory bounded.
"""
return _env.from_string(template_str)
class TemplateRenderTimeout(jinja2.TemplateError):
"""Raised when a template exceeds the configured render budget."""
@@ -74,7 +88,7 @@ def render_template(template_str: str, context: dict[str, Any]) -> str:
)
return "[Template too large]"
try:
compiled = _env.from_string(template_str)
compiled = _compile_cached(template_str)
output = _render_with_timeout(compiled, context)
except TemplateRenderTimeout as e:
_LOGGER.error("Template render timeout: %s", e)
@@ -27,6 +27,9 @@ def validate_template(
"has_oversized_videos", "max_video_size", "max_video_size_mb",
"added_assets", "assets", "albums",
"raw_payload", "event_type_raw", "source_ip",
# bridge_self self-monitoring variables.
"failure_type", "subject_id", "subject_name", "count",
"threshold", "last_error", "details",
}
allowed = available | runtime_vars