feat: production readiness — security, perf, bug fixes, bridge self-monitoring
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.
## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession
## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch
## Database
- UNIQUE indexes on service_provider.webhook_token,
telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified
## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events
## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
oauth/client_secret/webhook_secret/csrf in both header filter and
template extras filter
## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
$effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
Telegram bot toggles
## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
(wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
test_planka_parser (6), test_immich_change_detector (6),
test_backup_roundtrip (1)
## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
failures), bridge_self_deferred_backlog (pending count crosses
threshold), bridge_self_target_failures (consecutive 5xx/network
failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
thresholds (logged only) — wire to your own Telegram/Email/Matrix to
get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
with backfill migration, frontend descriptor (excluded from "create
provider" wizard since auto-managed)
Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
receive failure alerts
Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -71,6 +71,12 @@ class EventType(str, Enum):
|
||||
HA_SERVICE_CALLED = "ha_service_called"
|
||||
HA_EVENT_FIRED = "ha_event_fired"
|
||||
|
||||
# Bridge self-monitoring events — emitted by the bridge itself when
|
||||
# internal failures cross configured thresholds.
|
||||
BRIDGE_SELF_POLL_FAILURES = "bridge_self_poll_failures"
|
||||
BRIDGE_SELF_DEFERRED_BACKLOG = "bridge_self_deferred_backlog"
|
||||
BRIDGE_SELF_TARGET_FAILURES = "bridge_self_target_failures"
|
||||
|
||||
|
||||
@dataclass
|
||||
class ServiceEvent:
|
||||
|
||||
@@ -107,6 +107,12 @@ class NotificationDispatcher:
|
||||
# Optional shared session owned by the caller; when supplied we reuse
|
||||
# its connection pool instead of opening a fresh per-dispatch session.
|
||||
self._shared_session = session
|
||||
# Per-dispatch render cache, keyed by locale. Populated by
|
||||
# ``_send_to_target`` and consumed inside ``_message_for_receiver``
|
||||
# so a 100-receiver fan-out renders each unique locale once.
|
||||
# Initialized to empty so handlers called outside the normal
|
||||
# dispatch path (tests) still see a valid dict.
|
||||
self._render_cache: dict[str, str] = {}
|
||||
|
||||
@contextlib.asynccontextmanager
|
||||
async def _session_ctx(self) -> AsyncIterator[aiohttp.ClientSession]:
|
||||
@@ -198,20 +204,49 @@ class NotificationDispatcher:
|
||||
def _message_for_receiver(
|
||||
self, receiver: Receiver, default_message: str,
|
||||
event: ServiceEvent, target: TargetConfig,
|
||||
cache: dict[str, str] | None = None,
|
||||
) -> str:
|
||||
if receiver.locale and receiver.locale != target.locale:
|
||||
return self._render_message(event, target, receiver.locale)
|
||||
return default_message
|
||||
"""Render message respecting receiver locale, with optional cache.
|
||||
|
||||
The ``cache`` dict (typically created in ``_send_to_target`` and
|
||||
threaded through the per-channel ``_send_*`` handlers) memoizes
|
||||
per-locale renders so a 100-receiver fan-out with two locales
|
||||
renders twice instead of one hundred times.
|
||||
"""
|
||||
loc = receiver.locale or target.locale
|
||||
if loc == target.locale:
|
||||
return default_message
|
||||
if cache is not None:
|
||||
cached = cache.get(loc)
|
||||
if cached is not None:
|
||||
return cached
|
||||
rendered = self._render_message(event, target, loc)
|
||||
cache[loc] = rendered
|
||||
return rendered
|
||||
return self._render_message(event, target, loc)
|
||||
|
||||
async def _send_to_target(
|
||||
self, event: ServiceEvent, target: TargetConfig
|
||||
) -> dict[str, Any]:
|
||||
"""Dispatch to a single target via the registered handler."""
|
||||
"""Dispatch to a single target via the registered handler.
|
||||
|
||||
Builds a per-locale render cache once and threads it through the
|
||||
send handler. The cache is keyed by receiver locale; the default
|
||||
locale's render lives in ``default_message`` and is short-circuited
|
||||
before any cache lookup.
|
||||
"""
|
||||
default_message = self._render_message(event, target, target.locale)
|
||||
send_method = _PROVIDER_HANDLERS.get(target.type)
|
||||
if send_method is None:
|
||||
return {"success": False, "error": f"Unknown target type: {target.type}"}
|
||||
return await send_method(self, target, default_message, event)
|
||||
# Stash the cache on the dispatcher instance for the duration of
|
||||
# this dispatch — handlers pick it up via _message_for_receiver.
|
||||
# Avoids changing every _send_* signature.
|
||||
self._render_cache: dict[str, str] = {}
|
||||
try:
|
||||
return await send_method(self, target, default_message, event)
|
||||
finally:
|
||||
self._render_cache = {}
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Asset preload (Telegram-specific)
|
||||
@@ -352,7 +387,7 @@ class NotificationDispatcher:
|
||||
async def send_one(receiver: Receiver) -> dict[str, Any]:
|
||||
if not isinstance(receiver, TelegramReceiver) or not receiver.chat_id:
|
||||
return {"success": False, "error": "Invalid telegram receiver"}
|
||||
message = self._message_for_receiver(receiver, default_message, event, target)
|
||||
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
|
||||
text_result = await client.send_message(
|
||||
chat_id=receiver.chat_id,
|
||||
text=message,
|
||||
@@ -407,7 +442,7 @@ class NotificationDispatcher:
|
||||
async def send_one(receiver: Receiver) -> dict[str, Any]:
|
||||
if not isinstance(receiver, WebhookReceiver) or not receiver.url:
|
||||
return {"success": False, "error": "Invalid webhook receiver"}
|
||||
message = self._message_for_receiver(receiver, default_message, event, target)
|
||||
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
|
||||
payload = {
|
||||
"message": message,
|
||||
"event_type": event.event_type.value,
|
||||
@@ -450,7 +485,7 @@ class NotificationDispatcher:
|
||||
async def send_one(receiver: Receiver) -> dict[str, Any]:
|
||||
if not isinstance(receiver, EmailReceiver) or not receiver.email:
|
||||
return {"success": False, "error": "Invalid email receiver"}
|
||||
message = self._message_for_receiver(receiver, default_message, event, target)
|
||||
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
|
||||
# body_html=None lets EmailClient build a safely-escaped HTML
|
||||
# alternative from body_text instead of trusting user content.
|
||||
return await email_client.send(
|
||||
@@ -479,7 +514,7 @@ class NotificationDispatcher:
|
||||
async def send_one(receiver: Receiver) -> dict[str, Any]:
|
||||
if not isinstance(receiver, DiscordReceiver) or not receiver.webhook_url:
|
||||
return {"success": False, "error": "Invalid discord receiver"}
|
||||
message = self._message_for_receiver(receiver, default_message, event, target)
|
||||
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
|
||||
return await client.send(receiver.webhook_url, message, username=username)
|
||||
|
||||
results = await self._fan_out(target.receivers, send_one)
|
||||
@@ -501,7 +536,7 @@ class NotificationDispatcher:
|
||||
async def send_one(receiver: Receiver) -> dict[str, Any]:
|
||||
if not isinstance(receiver, SlackReceiver) or not receiver.webhook_url:
|
||||
return {"success": False, "error": "Invalid slack receiver"}
|
||||
message = self._message_for_receiver(receiver, default_message, event, target)
|
||||
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
|
||||
return await client.send(receiver.webhook_url, message, username=username)
|
||||
|
||||
results = await self._fan_out(target.receivers, send_one)
|
||||
@@ -530,7 +565,7 @@ class NotificationDispatcher:
|
||||
async def send_one(receiver: Receiver) -> dict[str, Any]:
|
||||
if not isinstance(receiver, NtfyReceiver) or not receiver.topic:
|
||||
return {"success": False, "error": "Invalid ntfy receiver"}
|
||||
message = self._message_for_receiver(receiver, default_message, event, target)
|
||||
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
|
||||
return await client.send(
|
||||
server_url, receiver.topic, message,
|
||||
title=title, priority=receiver.priority, auth_token=auth_token,
|
||||
@@ -563,7 +598,7 @@ class NotificationDispatcher:
|
||||
async def send_one(receiver: Receiver) -> dict[str, Any]:
|
||||
if not isinstance(receiver, MatrixReceiver) or not receiver.room_id:
|
||||
return {"success": False, "error": "Invalid matrix receiver"}
|
||||
message = self._message_for_receiver(receiver, default_message, event, target)
|
||||
message = self._message_for_receiver(receiver, default_message, event, target, cache=self._render_cache)
|
||||
# body_html is the same plain text — Matrix accepts the
|
||||
# raw message as both ``body`` and ``formatted_body``.
|
||||
# If templates emit HTML in the future, generate a
|
||||
|
||||
@@ -222,21 +222,48 @@ class TelegramClient:
|
||||
"""SSRF-guarded GET that returns ``(data, error)``.
|
||||
|
||||
Validates the URL via ``avalidate_outbound_url`` before any HTTP
|
||||
traffic. Errors are returned (not raised) and stripped of any
|
||||
embedded secrets before they propagate to the operator-visible
|
||||
result dict.
|
||||
traffic. Redirects are walked manually so each ``Location`` is
|
||||
re-validated — without this an attacker-controlled origin could
|
||||
302 to a private-IP target after the initial guard passed.
|
||||
Errors are returned (not raised) and stripped of any embedded
|
||||
secrets before they propagate to the operator-visible result
|
||||
dict.
|
||||
"""
|
||||
max_redirects = 3
|
||||
current_url = url
|
||||
try:
|
||||
await avalidate_outbound_url(url)
|
||||
await avalidate_outbound_url(current_url)
|
||||
except UnsafeURLError as err:
|
||||
return None, f"Unsafe URL: {redact_exc(err)}"
|
||||
try:
|
||||
async with self._session.get(
|
||||
url, headers=headers or {}, timeout=_DOWNLOAD_TIMEOUT,
|
||||
) as resp:
|
||||
if resp.status != 200:
|
||||
return None, f"HTTP {resp.status}"
|
||||
return await resp.read(), None
|
||||
for _ in range(max_redirects + 1):
|
||||
async with self._session.get(
|
||||
current_url,
|
||||
headers=headers or {},
|
||||
timeout=_DOWNLOAD_TIMEOUT,
|
||||
allow_redirects=False,
|
||||
) as resp:
|
||||
if resp.status in (301, 302, 303, 307, 308):
|
||||
loc = resp.headers.get("Location")
|
||||
if not loc:
|
||||
return None, f"HTTP {resp.status} without Location header"
|
||||
# ``resp.url`` is a yarl.URL; ``.join`` resolves
|
||||
# relative redirects (``/foo/bar``) against it.
|
||||
from yarl import URL as _URL
|
||||
try:
|
||||
next_url = str(resp.url.join(_URL(loc)))
|
||||
except (ValueError, TypeError):
|
||||
return None, "Malformed redirect Location"
|
||||
try:
|
||||
await avalidate_outbound_url(next_url)
|
||||
except UnsafeURLError as err:
|
||||
return None, f"Unsafe redirect: {redact_exc(err)}"
|
||||
current_url = next_url
|
||||
continue
|
||||
if resp.status != 200:
|
||||
return None, f"HTTP {resp.status}"
|
||||
return await resp.read(), None
|
||||
return None, f"Too many redirects (>{max_redirects})"
|
||||
except (aiohttp.ClientError, asyncio.TimeoutError, OSError) as err:
|
||||
return None, redact_exc(err)
|
||||
|
||||
|
||||
@@ -22,6 +22,7 @@ class ServiceProviderType(str, Enum):
|
||||
GOOGLE_PHOTOS = "google_photos"
|
||||
WEBHOOK = "webhook"
|
||||
HOME_ASSISTANT = "home_assistant"
|
||||
BRIDGE_SELF = "bridge_self"
|
||||
|
||||
|
||||
# Callback signature for push-style providers: a coroutine that accepts a
|
||||
|
||||
@@ -0,0 +1,39 @@
|
||||
"""Bridge self-monitoring service provider.
|
||||
|
||||
Unlike external providers (Immich, Gitea, NUT, ...), the ``bridge_self``
|
||||
provider does not connect to any remote service. Its sole purpose is to
|
||||
give operators a configurable surface (thresholds + notification slots
|
||||
+ trackers + targets) for events that the bridge itself emits when its
|
||||
internal subsystems fail.
|
||||
|
||||
Three failure conditions are surfaced as :class:`ServiceEvent` instances
|
||||
through the same dispatch pipeline that all other providers use:
|
||||
|
||||
* ``bridge_self_poll_failures`` — N consecutive poll failures for
|
||||
any tracker exceed the configured threshold.
|
||||
* ``bridge_self_deferred_backlog`` — pending ``deferred_dispatch`` row
|
||||
count crosses the configured threshold.
|
||||
* ``bridge_self_target_failures`` — N consecutive 5xx / network failures
|
||||
for a single notification target.
|
||||
|
||||
Events are constructed by ``services/bridge_self.py`` on the server side
|
||||
(it owns DB access for looking up the bridge_self provider per user)
|
||||
and then fed into ``dispatch_provider_event`` like any other event.
|
||||
"""
|
||||
|
||||
from notify_bridge_core.providers.base import ServiceProviderType
|
||||
from notify_bridge_core.templates.variables import registry
|
||||
|
||||
from .event_parser import build_event
|
||||
from .provider import BRIDGE_SELF_VARIABLES, BridgeSelfServiceProvider
|
||||
|
||||
# Register variables so the validator and template-vars API see them.
|
||||
registry.register_provider_variables(
|
||||
ServiceProviderType.BRIDGE_SELF, BRIDGE_SELF_VARIABLES,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"BRIDGE_SELF_VARIABLES",
|
||||
"BridgeSelfServiceProvider",
|
||||
"build_event",
|
||||
]
|
||||
@@ -0,0 +1,89 @@
|
||||
"""Bridge self-monitoring event parser.
|
||||
|
||||
The bridge generates these events from internal subsystems (watcher,
|
||||
scheduler, dispatcher) — the parser turns a flat payload dict into the
|
||||
generic :class:`ServiceEvent` shape that the rest of the dispatch
|
||||
pipeline expects.
|
||||
|
||||
Payload shape::
|
||||
|
||||
{
|
||||
"failure_type": "poll_failures" | "deferred_backlog" | "target_failures",
|
||||
"subject_id": int, # tracker_id, target_id, or 0
|
||||
"subject_name": str,
|
||||
"count": int, # consecutive failures or pending count
|
||||
"threshold": int,
|
||||
"last_error": str, # may be empty
|
||||
"details": dict[str, Any], # extra context
|
||||
}
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from notify_bridge_core.models.events import EventType, ServiceEvent
|
||||
from notify_bridge_core.providers.base import ServiceProviderType
|
||||
|
||||
|
||||
# Defensive cap on the persisted error message; very long tracebacks would
|
||||
# bloat the EventLog details JSON column otherwise.
|
||||
_MAX_ERROR_LEN = 1000
|
||||
|
||||
|
||||
_FAILURE_TYPE_TO_EVENT: dict[str, EventType] = {
|
||||
"poll_failures": EventType.BRIDGE_SELF_POLL_FAILURES,
|
||||
"deferred_backlog": EventType.BRIDGE_SELF_DEFERRED_BACKLOG,
|
||||
"target_failures": EventType.BRIDGE_SELF_TARGET_FAILURES,
|
||||
}
|
||||
|
||||
|
||||
def build_event(
|
||||
payload: dict[str, Any],
|
||||
*,
|
||||
provider_name: str = "Bridge Self-Monitoring",
|
||||
timestamp: datetime | None = None,
|
||||
) -> ServiceEvent | None:
|
||||
"""Convert a self-monitoring payload dict into a ServiceEvent.
|
||||
|
||||
Returns None for malformed payloads (unknown failure_type or missing
|
||||
keys) — the caller drops without raising so a misbehaving emitter
|
||||
can never tip over the dispatch pipeline.
|
||||
"""
|
||||
if not isinstance(payload, dict):
|
||||
return None
|
||||
failure_type = payload.get("failure_type")
|
||||
event_type = _FAILURE_TYPE_TO_EVENT.get(str(failure_type) if failure_type else "")
|
||||
if event_type is None:
|
||||
return None
|
||||
|
||||
subject_id = int(payload.get("subject_id") or 0)
|
||||
subject_name = str(payload.get("subject_name") or "")
|
||||
count = int(payload.get("count") or 0)
|
||||
threshold = int(payload.get("threshold") or 0)
|
||||
last_error = str(payload.get("last_error") or "")[:_MAX_ERROR_LEN]
|
||||
details = payload.get("details") if isinstance(payload.get("details"), dict) else {}
|
||||
|
||||
when = timestamp or datetime.now(timezone.utc)
|
||||
|
||||
return ServiceEvent(
|
||||
event_type=event_type,
|
||||
provider_type=ServiceProviderType.BRIDGE_SELF,
|
||||
provider_name=provider_name,
|
||||
# ``collection_id`` / ``collection_name`` are required fields on
|
||||
# ServiceEvent; we use the subject so quiet-hours / dedupe logic
|
||||
# treats different subjects as distinct streams.
|
||||
collection_id=str(subject_id),
|
||||
collection_name=subject_name or str(failure_type),
|
||||
timestamp=when,
|
||||
extra={
|
||||
"failure_type": str(failure_type),
|
||||
"subject_id": subject_id,
|
||||
"subject_name": subject_name,
|
||||
"count": count,
|
||||
"threshold": threshold,
|
||||
"last_error": last_error,
|
||||
"details": dict(details),
|
||||
},
|
||||
)
|
||||
@@ -0,0 +1,148 @@
|
||||
"""Bridge self-monitoring service provider — emits internal-failure events.
|
||||
|
||||
This is a passive provider: it does not connect to anything, never polls,
|
||||
and never subscribes. It exists so the rest of the bridge's CRUD / config /
|
||||
template / target plumbing has a single ``ServiceProvider`` to attach
|
||||
self-monitoring trackers and notification slots to.
|
||||
|
||||
Events are constructed by the server-side helper
|
||||
``services/bridge_self.emit_bridge_self_event`` and pushed into
|
||||
``dispatch_provider_event`` directly — the provider itself is not asked
|
||||
to produce events.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
from notify_bridge_core.models.events import ServiceEvent
|
||||
from notify_bridge_core.providers.base import (
|
||||
ServiceProvider,
|
||||
ServiceProviderType,
|
||||
)
|
||||
from notify_bridge_core.templates.variables import TemplateVariableDefinition
|
||||
|
||||
|
||||
# Configuration keys recognised on the bridge_self provider's ``config`` JSON.
|
||||
DEFAULT_POLL_FAILURE_THRESHOLD = 3
|
||||
DEFAULT_DEFERRED_BACKLOG_THRESHOLD = 100
|
||||
DEFAULT_TARGET_FAILURE_THRESHOLD = 5
|
||||
|
||||
|
||||
# Template variables exposed to bridge_self templates.
|
||||
BRIDGE_SELF_VARIABLES: list[TemplateVariableDefinition] = [
|
||||
TemplateVariableDefinition(
|
||||
name="failure_type",
|
||||
type="string",
|
||||
description="Which self-monitoring condition fired",
|
||||
example="poll_failures",
|
||||
provider_type=ServiceProviderType.BRIDGE_SELF,
|
||||
),
|
||||
TemplateVariableDefinition(
|
||||
name="subject_id",
|
||||
type="int",
|
||||
description="ID of the affected entity (tracker_id, target_id, or 0)",
|
||||
example="42",
|
||||
provider_type=ServiceProviderType.BRIDGE_SELF,
|
||||
),
|
||||
TemplateVariableDefinition(
|
||||
name="subject_name",
|
||||
type="string",
|
||||
description="Human-readable name of the affected entity",
|
||||
example="My Immich Tracker",
|
||||
provider_type=ServiceProviderType.BRIDGE_SELF,
|
||||
),
|
||||
TemplateVariableDefinition(
|
||||
name="count",
|
||||
type="int",
|
||||
description="Consecutive failure count or current backlog size",
|
||||
example="3",
|
||||
provider_type=ServiceProviderType.BRIDGE_SELF,
|
||||
),
|
||||
TemplateVariableDefinition(
|
||||
name="threshold",
|
||||
type="int",
|
||||
description="Configured threshold that was crossed",
|
||||
example="3",
|
||||
provider_type=ServiceProviderType.BRIDGE_SELF,
|
||||
),
|
||||
TemplateVariableDefinition(
|
||||
name="last_error",
|
||||
type="string",
|
||||
description="Last underlying error message (truncated)",
|
||||
example="Connection refused",
|
||||
provider_type=ServiceProviderType.BRIDGE_SELF,
|
||||
),
|
||||
TemplateVariableDefinition(
|
||||
name="details",
|
||||
type="dict",
|
||||
description="Extra structured context for the event",
|
||||
example='{"provider_id": 7}',
|
||||
provider_type=ServiceProviderType.BRIDGE_SELF,
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
class BridgeSelfServiceProvider(ServiceProvider):
|
||||
"""Passive provider — exposes nothing remote, holds only thresholds.
|
||||
|
||||
Polling is a no-op and ``connect`` always succeeds; the bridge itself
|
||||
is what generates events for this provider.
|
||||
"""
|
||||
|
||||
provider_type = ServiceProviderType.BRIDGE_SELF
|
||||
supports_subscription = False
|
||||
|
||||
def __init__(self, name: str = "Bridge Self-Monitoring") -> None:
|
||||
self._name = name
|
||||
|
||||
async def connect(self) -> bool:
|
||||
return True
|
||||
|
||||
async def disconnect(self) -> None:
|
||||
return None
|
||||
|
||||
async def poll(
|
||||
self,
|
||||
collection_ids: list[str],
|
||||
tracker_state: dict[str, Any],
|
||||
) -> tuple[list[ServiceEvent], dict[str, Any]]:
|
||||
# No external service to poll. Returning empty keeps the contract
|
||||
# so accidental scheduling no-ops cleanly.
|
||||
return [], tracker_state
|
||||
|
||||
def get_available_variables(self) -> list[TemplateVariableDefinition]:
|
||||
return list(BRIDGE_SELF_VARIABLES)
|
||||
|
||||
def get_provider_config_schema(self) -> dict[str, Any]:
|
||||
return {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"poll_failure_threshold": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"default": DEFAULT_POLL_FAILURE_THRESHOLD,
|
||||
"description": "Consecutive tracker poll failures before alerting",
|
||||
},
|
||||
"deferred_backlog_threshold": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"default": DEFAULT_DEFERRED_BACKLOG_THRESHOLD,
|
||||
"description": "Pending deferred_dispatch rows before alerting",
|
||||
},
|
||||
"target_failure_threshold": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"default": DEFAULT_TARGET_FAILURE_THRESHOLD,
|
||||
"description": "Consecutive target send failures before alerting",
|
||||
},
|
||||
},
|
||||
"required": [],
|
||||
}
|
||||
|
||||
async def list_collections(self) -> list[dict[str, Any]]:
|
||||
# No collection concept — operators don't pick anything for this provider.
|
||||
return []
|
||||
|
||||
async def test_connection(self) -> dict[str, Any]:
|
||||
return {"ok": True, "message": "Bridge self-monitoring is always available"}
|
||||
@@ -514,6 +514,39 @@ HOME_ASSISTANT_CAPABILITIES = ProviderCapabilities(
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Bridge self-monitoring capabilities
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
BRIDGE_SELF_CAPABILITIES = ProviderCapabilities(
|
||||
provider_type="bridge_self",
|
||||
display_name="Bridge Self-Monitoring",
|
||||
webhook_based=False,
|
||||
supported_filters=[],
|
||||
notification_slots=[
|
||||
{
|
||||
"name": "message_bridge_self_poll_failures",
|
||||
"description": "Tracker poll failures crossed threshold",
|
||||
},
|
||||
{
|
||||
"name": "message_bridge_self_deferred_backlog",
|
||||
"description": "Deferred dispatch backlog crossed threshold",
|
||||
},
|
||||
{
|
||||
"name": "message_bridge_self_target_failures",
|
||||
"description": "Target send failures crossed threshold",
|
||||
},
|
||||
],
|
||||
events=[
|
||||
{"name": "bridge_self_poll_failures", "description": "Tracker poll failures"},
|
||||
{"name": "bridge_self_deferred_backlog", "description": "Deferred backlog high"},
|
||||
{"name": "bridge_self_target_failures", "description": "Target send failures"},
|
||||
],
|
||||
command_slots=[],
|
||||
commands=[],
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Registry
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -527,6 +560,7 @@ _REGISTRY: dict[str, ProviderCapabilities] = {
|
||||
"google_photos": GOOGLE_PHOTOS_CAPABILITIES,
|
||||
"webhook": WEBHOOK_CAPABILITIES,
|
||||
"home_assistant": HOME_ASSISTANT_CAPABILITIES,
|
||||
"bridge_self": BRIDGE_SELF_CAPABILITIES,
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@ arrive. The lifecycle is owned by the server-side subscription manager
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from typing import Any
|
||||
from typing import Any, Callable
|
||||
|
||||
import aiohttp
|
||||
|
||||
@@ -25,6 +25,12 @@ from notify_bridge_core.templates.variables import TemplateVariableDefinition
|
||||
from .client import HomeAssistantWSClient
|
||||
from .event_parser import parse_event
|
||||
|
||||
|
||||
# Status callback signature: ``(state, detail)`` where ``state`` is one of
|
||||
# ``"connected"`` / ``"disconnected"`` and ``detail`` is an optional already-
|
||||
# redacted reason string (or None on connect).
|
||||
StatusChangeCallback = Callable[[str, str | None], None]
|
||||
|
||||
_LOGGER = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@@ -229,7 +235,11 @@ class HomeAssistantServiceProvider(ServiceProvider):
|
||||
# — the subscription manager owns this provider's lifecycle instead.
|
||||
return [], tracker_state
|
||||
|
||||
async def subscribe(self, emit: EventEmitCallback) -> None:
|
||||
async def subscribe(
|
||||
self,
|
||||
emit: EventEmitCallback,
|
||||
on_status_change: StatusChangeCallback | None = None,
|
||||
) -> None:
|
||||
async def _on_event(ha_event: dict[str, Any]) -> None:
|
||||
event = parse_event(
|
||||
ha_event,
|
||||
@@ -252,6 +262,7 @@ class HomeAssistantServiceProvider(ServiceProvider):
|
||||
on_event=_on_event,
|
||||
event_types=self._event_types,
|
||||
refresh_areas=_refresh_areas,
|
||||
on_status_change=on_status_change,
|
||||
)
|
||||
|
||||
def get_available_variables(self) -> list[TemplateVariableDefinition]:
|
||||
|
||||
@@ -29,10 +29,21 @@ _LOGGER = logging.getLogger(__name__)
|
||||
# calls per poll cycle. TTL is conservative (1h) and a hashed key keeps the
|
||||
# raw api_key out of dict keys in case of a memory dump.
|
||||
_USERS_CACHE_TTL_SECONDS = 3600
|
||||
_users_cache_lock = asyncio.Lock()
|
||||
# Lazy init: ``asyncio.Lock()`` at module import binds to whichever event
|
||||
# loop is current at import time (often none, or the wrong one when tests
|
||||
# spin up dedicated loops). Defer creation to first use.
|
||||
_users_cache_lock: asyncio.Lock | None = None
|
||||
_users_cache: dict[str, tuple[float, dict[str, str]]] = {}
|
||||
|
||||
|
||||
def _get_users_cache_lock() -> asyncio.Lock:
|
||||
"""Return the module users-cache lock, creating it on first call."""
|
||||
global _users_cache_lock
|
||||
if _users_cache_lock is None:
|
||||
_users_cache_lock = asyncio.Lock()
|
||||
return _users_cache_lock
|
||||
|
||||
|
||||
def _users_cache_key(url: str, api_key: str) -> str:
|
||||
digest = hashlib.sha256(f"{url}|{api_key}".encode("utf-8")).hexdigest()
|
||||
return digest[:32]
|
||||
@@ -51,7 +62,7 @@ async def _get_cached_users(
|
||||
if entry is not None and (now - entry[0]) < _USERS_CACHE_TTL_SECONDS:
|
||||
return entry[1]
|
||||
|
||||
async with _users_cache_lock:
|
||||
async with _get_users_cache_lock():
|
||||
# Re-check after acquiring the lock — another coroutine may have
|
||||
# refreshed the entry while we waited.
|
||||
entry = _users_cache.get(key)
|
||||
|
||||
@@ -200,10 +200,28 @@ class NutServiceProvider(ServiceProvider):
|
||||
try:
|
||||
for ups_name in collection_ids:
|
||||
prev = tracker_state.get(ups_name, {})
|
||||
# First-ever observation has no baseline — emitting transition
|
||||
# events for whatever flags the device happens to carry would
|
||||
# spam the user with "OB"/"LB"/"REPLBATT" alerts on every fresh
|
||||
# tracker even when nothing changed. Seed state silently and
|
||||
# skip event emission until the next poll provides a baseline.
|
||||
is_first_observation = ups_name not in tracker_state
|
||||
try:
|
||||
variables = await client.list_var(ups_name)
|
||||
data = NutUpsData.from_variables(ups_name, variables)
|
||||
|
||||
if is_first_observation:
|
||||
new_state[ups_name] = {
|
||||
"name": data.description or ups_name,
|
||||
"status": data.status,
|
||||
"battery_charge": data.battery_charge,
|
||||
"comms_ok": True,
|
||||
"asset_ids": [],
|
||||
"pending_asset_ids": [],
|
||||
"shared": False,
|
||||
}
|
||||
continue
|
||||
|
||||
# Check for comms restored
|
||||
if not prev.get("comms_ok", True):
|
||||
events.append(self._make_event(
|
||||
|
||||
@@ -35,6 +35,10 @@ _SENSITIVE_EXTRA_TOKENS: tuple[str, ...] = (
|
||||
"bearer",
|
||||
"private_key",
|
||||
"access_key",
|
||||
"oauth",
|
||||
"client_secret",
|
||||
"webhook_secret",
|
||||
"csrf",
|
||||
)
|
||||
|
||||
|
||||
|
||||
+6
@@ -0,0 +1,6 @@
|
||||
⚠️ <b>Deferred dispatch backlog high</b>
|
||||
Pending notifications: <b>{{ count }}</b>
|
||||
Threshold: <b>{{ threshold }}</b>
|
||||
{%- if last_error %}
|
||||
<i>Note:</i> <code>{{ last_error }}</code>
|
||||
{%- endif %}
|
||||
+6
@@ -0,0 +1,6 @@
|
||||
🚨 <b>Tracker poll failures</b>
|
||||
<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
|
||||
<b>{{ count }}</b> consecutive failures (threshold {{ threshold }})
|
||||
{%- if last_error %}
|
||||
<i>Last error:</i> <code>{{ last_error }}</code>
|
||||
{%- endif %}
|
||||
+6
@@ -0,0 +1,6 @@
|
||||
📡 <b>Target send failures</b>
|
||||
<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
|
||||
<b>{{ count }}</b> consecutive failures (threshold {{ threshold }})
|
||||
{%- if last_error %}
|
||||
<i>Last error:</i> <code>{{ last_error }}</code>
|
||||
{%- endif %}
|
||||
@@ -79,6 +79,11 @@ PROVIDER_SLOT_FILE_MAP: dict[str, dict[str, str]] = {
|
||||
"message_ha_service_called": "ha_service_called.jinja2",
|
||||
"message_ha_event_fired": "ha_event_fired.jinja2",
|
||||
},
|
||||
"bridge_self": {
|
||||
"message_bridge_self_poll_failures": "bridge_self_poll_failures.jinja2",
|
||||
"message_bridge_self_deferred_backlog": "bridge_self_deferred_backlog.jinja2",
|
||||
"message_bridge_self_target_failures": "bridge_self_target_failures.jinja2",
|
||||
},
|
||||
}
|
||||
|
||||
# Backward-compatible alias
|
||||
|
||||
+6
@@ -0,0 +1,6 @@
|
||||
⚠️ <b>Очередь отложенной отправки растёт</b>
|
||||
Ожидают отправки: <b>{{ count }}</b>
|
||||
Порог: <b>{{ threshold }}</b>
|
||||
{%- if last_error %}
|
||||
<i>Примечание:</i> <code>{{ last_error }}</code>
|
||||
{%- endif %}
|
||||
+6
@@ -0,0 +1,6 @@
|
||||
🚨 <b>Сбои опроса трекера</b>
|
||||
<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
|
||||
Подряд сбоев: <b>{{ count }}</b> (порог {{ threshold }})
|
||||
{%- if last_error %}
|
||||
<i>Последняя ошибка:</i> <code>{{ last_error }}</code>
|
||||
{%- endif %}
|
||||
+6
@@ -0,0 +1,6 @@
|
||||
📡 <b>Сбои отправки в адресат</b>
|
||||
<b>{{ subject_name }}</b> (id <code>{{ subject_id }}</code>)
|
||||
Подряд сбоев: <b>{{ count }}</b> (порог {{ threshold }})
|
||||
{%- if last_error %}
|
||||
<i>Последняя ошибка:</i> <code>{{ last_error }}</code>
|
||||
{%- endif %}
|
||||
@@ -13,6 +13,7 @@ from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import threading
|
||||
from functools import lru_cache
|
||||
from typing import Any
|
||||
|
||||
import jinja2
|
||||
@@ -27,6 +28,19 @@ RENDER_TIMEOUT_SECONDS = 2.0
|
||||
_env = SandboxedEnvironment(autoescape=True)
|
||||
|
||||
|
||||
@lru_cache(maxsize=512)
|
||||
def _compile_cached(template_str: str) -> jinja2.Template:
|
||||
"""Compile + cache Jinja2 templates by source text.
|
||||
|
||||
Hot paths (NotificationDispatcher fan-out, periodic dispatch) re-render
|
||||
the same template string for every event; ``_env.from_string`` parses
|
||||
the source from scratch each time (~ms each). The 512-entry cache is
|
||||
large enough to hold every template across a busy install while
|
||||
keeping memory bounded.
|
||||
"""
|
||||
return _env.from_string(template_str)
|
||||
|
||||
|
||||
class TemplateRenderTimeout(jinja2.TemplateError):
|
||||
"""Raised when a template exceeds the configured render budget."""
|
||||
|
||||
@@ -74,7 +88,7 @@ def render_template(template_str: str, context: dict[str, Any]) -> str:
|
||||
)
|
||||
return "[Template too large]"
|
||||
try:
|
||||
compiled = _env.from_string(template_str)
|
||||
compiled = _compile_cached(template_str)
|
||||
output = _render_with_timeout(compiled, context)
|
||||
except TemplateRenderTimeout as e:
|
||||
_LOGGER.error("Template render timeout: %s", e)
|
||||
|
||||
@@ -27,6 +27,9 @@ def validate_template(
|
||||
"has_oversized_videos", "max_video_size", "max_video_size_mb",
|
||||
"added_assets", "assets", "albums",
|
||||
"raw_payload", "event_type_raw", "source_ip",
|
||||
# bridge_self self-monitoring variables.
|
||||
"failure_type", "subject_id", "subject_name", "count",
|
||||
"threshold", "last_error", "details",
|
||||
}
|
||||
allowed = available | runtime_vars
|
||||
|
||||
|
||||
Reference in New Issue
Block a user