feat: production readiness — security, perf, bug fixes, bridge self-monitoring

Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 02:16:49 +03:00
parent 22127e2a59
commit 10d30fc956
97 changed files with 5423 additions and 821 deletions
@@ -4,7 +4,7 @@ from __future__ import annotations
import asyncio
import logging
from typing import Any
from typing import Any, Awaitable, Callable
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
@@ -12,6 +12,7 @@ from sqlmodel.ext.asyncio.session import AsyncSession
from notify_bridge_core.models.events import ServiceEvent
from notify_bridge_core.notifications.dispatcher import NotificationDispatcher, TargetConfig
from notify_bridge_core.notifications.telegram.cache import TelegramFileCache
from notify_bridge_core.providers.capabilities import get_capabilities
from notify_bridge_core.storage import JsonFileBackend
from ..database.engine import get_engine
@@ -27,6 +28,7 @@ from .dispatch_helpers import (
evaluate_event_gate,
get_app_timezone,
load_link_data,
resolve_provider_credential,
)
_LOGGER = logging.getLogger(__name__)
@@ -34,7 +36,18 @@ _LOGGER = logging.getLogger(__name__)
# Module-level Telegram file caches — shared across dispatches for reuse
_url_cache: TelegramFileCache | None = None
_asset_cache: TelegramFileCache | None = None
_cache_lock = asyncio.Lock()
# Lazy init: creating ``asyncio.Lock()`` at module import time binds the
# lock to whichever event loop is current at import (often none / the wrong
# one when tests fire up dedicated loops). Defer until first use.
_cache_lock: asyncio.Lock | None = None
def _get_cache_lock() -> asyncio.Lock:
"""Return the module cache lock, creating it on first call."""
global _cache_lock
if _cache_lock is None:
_cache_lock = asyncio.Lock()
return _cache_lock
async def _load_cache_settings() -> tuple[int, int]:
@@ -68,7 +81,7 @@ async def _get_telegram_caches() -> tuple[TelegramFileCache | None, TelegramFile
global _url_cache, _asset_cache
if _url_cache is not None:
return _url_cache, _asset_cache
async with _cache_lock:
async with _get_cache_lock():
# Double-check after acquiring lock
if _url_cache is not None:
return _url_cache, _asset_cache
@@ -108,7 +121,7 @@ async def reset_telegram_caches_in_memory() -> None:
deletes cached file_ids.
"""
global _url_cache, _asset_cache
async with _cache_lock:
async with _get_cache_lock():
_url_cache = None
_asset_cache = None
_LOGGER.info("Reset Telegram cache refs in memory (files preserved)")
@@ -135,7 +148,7 @@ async def clear_telegram_caches() -> dict[str, Any]:
Returns a summary with the paths that were removed.
"""
global _url_cache, _asset_cache
async with _cache_lock:
async with _get_cache_lock():
removed: list[str] = []
for cache, label in ((_url_cache, "url"), (_asset_cache, "asset")):
if cache is not None:
@@ -163,6 +176,90 @@ async def clear_telegram_caches() -> dict[str, Any]:
return {"cleared": True, "removed": removed}
# ---------------------------------------------------------------------------
# Provider polling registry
# ---------------------------------------------------------------------------
#
# Each registered factory returns (events, new_state). Replaces the long
# ``if provider_type == ...`` chain in ``check_tracker``. New pollable
# providers register here; webhook-only providers are short-circuited above
# via ``capabilities.webhook_based``.
class _PollerConnectError(Exception):
"""Raised by a poller factory when initial provider connection fails."""
def __init__(self, reason: str) -> None:
super().__init__(reason)
self.reason = reason
PollResult = tuple[list[ServiceEvent], dict[str, Any]]
PollerFactory = Callable[..., Awaitable[PollResult]]
async def _poll_immich(*, provider_config, provider_name, collection_ids, state_dict, **_kw) -> PollResult:
from notify_bridge_core.providers.immich import ImmichServiceProvider
from .http_session import get_http_session
http_session = await get_http_session()
immich = ImmichServiceProvider(
http_session,
provider_config.get("url", ""),
provider_config.get("api_key", ""),
provider_config.get("external_domain"),
provider_name,
)
if not await immich.connect():
raise _PollerConnectError("failed to connect to provider")
return await immich.poll(collection_ids, state_dict)
async def _poll_scheduler(*, provider_name, tracker_name, tracker_filters, collection_ids, state_dict, app_tz, **_kw) -> PollResult:
from notify_bridge_core.providers.scheduler import SchedulerServiceProvider
sched = SchedulerServiceProvider(
name=provider_name,
tracker_name=tracker_name,
custom_variables=tracker_filters.get("custom_variables", {}),
timezone_name=app_tz,
)
return await sched.poll(collection_ids, state_dict)
async def _poll_nut(*, provider_config, provider_name, collection_ids, state_dict, **_kw) -> PollResult:
from notify_bridge_core.providers.nut import NutServiceProvider
nut = NutServiceProvider(
host=provider_config.get("host", "localhost"),
port=provider_config.get("port", 3493),
username=provider_config.get("username"),
password=provider_config.get("password"),
name=provider_name,
)
return await nut.poll(collection_ids, state_dict)
async def _poll_google_photos(*, provider_config, provider_name, collection_ids, state_dict, **_kw) -> PollResult:
from notify_bridge_core.providers.google_photos import GooglePhotosServiceProvider
from .http_session import get_http_session
http_session = await get_http_session()
gp = GooglePhotosServiceProvider(
http_session,
provider_config.get("client_id", ""),
provider_config.get("client_secret", ""),
provider_config.get("refresh_token", ""),
provider_name,
)
if not await gp.connect():
raise _PollerConnectError("failed to connect to Google Photos")
return await gp.poll(collection_ids, state_dict)
_POLL_FACTORIES: dict[str, PollerFactory] = {
"immich": _poll_immich,
"scheduler": _poll_scheduler,
"nut": _poll_nut,
"google_photos": _poll_google_photos,
}
async def check_tracker(tracker_id: int) -> dict[str, Any]:
"""Poll a tracker's provider for changes and dispatch notifications."""
engine = get_engine()
@@ -223,70 +320,61 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
events: list[ServiceEvent] = []
new_state: dict[str, Any] = {}
if provider_type == "immich":
from notify_bridge_core.providers.immich import ImmichServiceProvider
from .http_session import get_http_session
http_session = await get_http_session()
immich = ImmichServiceProvider(
http_session,
provider_config.get("url", ""),
provider_config.get("api_key", ""),
provider_config.get("external_domain"),
provider_name,
)
connected = await immich.connect()
if not connected:
return {"status": "error", "reason": "failed to connect to provider"}
# Webhook-only providers: capabilities.webhook_based short-circuits the
# poll path. Inbound events arrive via the /api/webhooks/* endpoints.
caps = get_capabilities(provider_type)
if caps is not None and caps.webhook_based:
return {"status": "ok", "events_detected": 0, "collections_checked": 0}
events, new_state = await immich.poll(collection_ids, state_dict)
elif provider_type == "gitea":
# Gitea is webhook-based — events arrive via /api/webhooks/gitea endpoint.
# The scheduler still calls check_tracker but there's nothing to poll.
return {"status": "ok", "events_detected": 0, "collections_checked": 0}
elif provider_type == "planka":
# Planka is webhook-based — events arrive via /api/webhooks/planka endpoint.
return {"status": "ok", "events_detected": 0, "collections_checked": 0}
elif provider_type == "scheduler":
from notify_bridge_core.providers.scheduler import SchedulerServiceProvider
custom_vars = tracker_filters.get("custom_variables", {})
sched = SchedulerServiceProvider(
name=provider_name,
tracker_name=tracker_name,
custom_variables=custom_vars,
timezone_name=app_tz,
)
events, new_state = await sched.poll(collection_ids, state_dict)
elif provider_type == "nut":
from notify_bridge_core.providers.nut import NutServiceProvider
nut = NutServiceProvider(
host=provider_config.get("host", "localhost"),
port=provider_config.get("port", 3493),
username=provider_config.get("username"),
password=provider_config.get("password"),
name=provider_name,
)
events, new_state = await nut.poll(collection_ids, state_dict)
elif provider_type == "google_photos":
from notify_bridge_core.providers.google_photos import GooglePhotosServiceProvider
from .http_session import get_http_session
http_session = await get_http_session()
gp = GooglePhotosServiceProvider(
http_session,
provider_config.get("client_id", ""),
provider_config.get("client_secret", ""),
provider_config.get("refresh_token", ""),
provider_name,
)
connected = await gp.connect()
if not connected:
return {"status": "error", "reason": "failed to connect to Google Photos"}
events, new_state = await gp.poll(collection_ids, state_dict)
elif provider_type == "webhook":
# Webhook providers receive events via inbound HTTP; no polling needed.
return {"status": "ok", "events_detected": 0, "collections_checked": 0}
else:
poller = _POLL_FACTORIES.get(provider_type)
if poller is None:
return {"status": "error", "reason": f"unsupported provider type: {provider_type}"}
try:
events, new_state = await poller(
provider_config=provider_config,
provider_name=provider_name,
tracker_name=tracker_name,
tracker_filters=tracker_filters,
collection_ids=collection_ids,
state_dict=state_dict,
app_tz=app_tz,
)
except _PollerConnectError as exc:
# Track consecutive poll failures so the bridge_self provider can
# alert when a tracker stops responding. The emission is async
# but cheap; we await it inline so its DB writes happen before
# check_tracker returns to the scheduler.
from .bridge_self import maybe_emit_poll_failure
try:
await maybe_emit_poll_failure(
tracker_id=tracker_id,
tracker_name=tracker_name,
error=exc.reason,
)
except Exception: # noqa: BLE001
_LOGGER.exception("bridge_self poll-failure emission failed")
return {"status": "error", "reason": exc.reason}
except Exception as exc: # noqa: BLE001
# Catch broader poll exceptions (e.g. a provider-side bug, transient
# network error inside the poller after connect) so the same
# streak-tracking logic applies. Re-raised after the bookkeeping so
# the existing error path keeps logging at the caller.
from .bridge_self import maybe_emit_poll_failure
try:
await maybe_emit_poll_failure(
tracker_id=tracker_id,
tracker_name=tracker_name,
error=str(exc),
)
except Exception: # noqa: BLE001
_LOGGER.exception("bridge_self poll-failure emission failed")
raise
# Successful poll — clear the consecutive-failure counter for this tracker.
from .bridge_self import record_poll_success
record_poll_success(tracker_id)
# Save updated state and log events
async with AsyncSession(engine) as session:
for cid, cstate in new_state.items():
@@ -328,6 +416,16 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
# row if quiet hours suppresses it.
event_log_id_by_event: dict[int, int] = {}
for event in events:
# Skip persistence for events the dispatch loop will filter
# anyway (assets_added with 0 added, assets_removed with 0
# removed). Without this we wrote a "noise" row for every
# tracker tick that detected nothing. The dispatch-time filter
# below still runs as a safety net.
etype = event.event_type.value
if etype == "assets_added" and event.added_count == 0:
continue
if etype == "assets_removed" and event.removed_count == 0:
continue
assets_count = event.added_count or event.removed_count or 0
details: dict[str, Any] = {
"added_count": event.added_count,
@@ -445,7 +543,7 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
template_slots=ld["template_slots"],
date_format=tmpl.date_format if tmpl else "%d.%m.%Y, %H:%M UTC",
date_only_format=tmpl.date_only_format if tmpl and tmpl.date_only_format else "%d.%m.%Y",
provider_api_key=provider_config.get("api_key"),
provider_api_key=resolve_provider_credential(provider_config),
provider_internal_url=provider_config.get("url", ""),
provider_external_url=provider_config.get("external_domain", ""),
receivers=ld["receivers"],
@@ -453,7 +551,9 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
key = id(tc) if tc is not None else 0
if key not in groups:
groups[key] = (tc, [])
groups[key][1].append(target_cfg)
# Threaded with target_id/target_name so per-target failure
# counters can attribute the dispatch result correctly.
groups[key][1].append((target_cfg, ld.get("target_id"), ld.get("target_name", "")))
# Persist defers + stamp the event_log row + schedule drains in a
# single transaction. This keeps the "deferred" pill on the
@@ -496,8 +596,17 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
"Failed to schedule deferred drain for %s", fire_at,
)
for tc, target_configs in groups.values():
if not target_configs:
from .bridge_self import (
maybe_emit_target_failure,
record_target_success,
)
track_target_failures = (
event.provider_type.value != "bridge_self"
)
for tc, target_entries in groups.values():
if not target_entries:
continue
shaped_event = apply_tracking_display_filters(event, tc)
if shaped_event is None:
@@ -505,12 +614,28 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
" Event suppressed by display filters (favorites_only)",
)
continue
target_configs = [entry[0] for entry in target_entries]
results = await dispatcher.dispatch(shaped_event, target_configs)
for r in results:
for entry, r in zip(target_entries, results):
_, target_id, target_name = entry
if r.get("success"):
_LOGGER.info(" Notification sent successfully")
if track_target_failures and target_id is not None:
record_target_success(int(target_id))
else:
_LOGGER.error(" Notification failed: %s", r.get("error", "unknown"))
if track_target_failures and target_id is not None:
try:
await maybe_emit_target_failure(
target_id=int(target_id),
target_name=target_name or "",
target_type=entry[0].type,
error=str(r.get("error") or ""),
)
except Exception: # noqa: BLE001
_LOGGER.exception(
"bridge_self target-failure emission failed",
)
return {
"status": "ok",