feat: production readiness — security, perf, bug fixes, bridge self-monitoring
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.
## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession
## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch
## Database
- UNIQUE indexes on service_provider.webhook_token,
telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified
## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events
## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
oauth/client_secret/webhook_secret/csrf in both header filter and
template extras filter
## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
$effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
Telegram bot toggles
## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
(wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
test_planka_parser (6), test_immich_change_detector (6),
test_backup_roundtrip (1)
## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
failures), bridge_self_deferred_backlog (pending count crosses
threshold), bridge_self_target_failures (consecutive 5xx/network
failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
thresholds (logged only) — wire to your own Telegram/Email/Matrix to
get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
with backfill migration, frontend descriptor (excluded from "create
provider" wizard since auto-managed)
Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
receive failure alerts
Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4,7 +4,7 @@ from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
from typing import Any
|
||||
from typing import Any, Awaitable, Callable
|
||||
|
||||
from sqlmodel import select
|
||||
from sqlmodel.ext.asyncio.session import AsyncSession
|
||||
@@ -12,6 +12,7 @@ from sqlmodel.ext.asyncio.session import AsyncSession
|
||||
from notify_bridge_core.models.events import ServiceEvent
|
||||
from notify_bridge_core.notifications.dispatcher import NotificationDispatcher, TargetConfig
|
||||
from notify_bridge_core.notifications.telegram.cache import TelegramFileCache
|
||||
from notify_bridge_core.providers.capabilities import get_capabilities
|
||||
from notify_bridge_core.storage import JsonFileBackend
|
||||
|
||||
from ..database.engine import get_engine
|
||||
@@ -27,6 +28,7 @@ from .dispatch_helpers import (
|
||||
evaluate_event_gate,
|
||||
get_app_timezone,
|
||||
load_link_data,
|
||||
resolve_provider_credential,
|
||||
)
|
||||
|
||||
_LOGGER = logging.getLogger(__name__)
|
||||
@@ -34,7 +36,18 @@ _LOGGER = logging.getLogger(__name__)
|
||||
# Module-level Telegram file caches — shared across dispatches for reuse
|
||||
_url_cache: TelegramFileCache | None = None
|
||||
_asset_cache: TelegramFileCache | None = None
|
||||
_cache_lock = asyncio.Lock()
|
||||
# Lazy init: creating ``asyncio.Lock()`` at module import time binds the
|
||||
# lock to whichever event loop is current at import (often none / the wrong
|
||||
# one when tests fire up dedicated loops). Defer until first use.
|
||||
_cache_lock: asyncio.Lock | None = None
|
||||
|
||||
|
||||
def _get_cache_lock() -> asyncio.Lock:
|
||||
"""Return the module cache lock, creating it on first call."""
|
||||
global _cache_lock
|
||||
if _cache_lock is None:
|
||||
_cache_lock = asyncio.Lock()
|
||||
return _cache_lock
|
||||
|
||||
|
||||
async def _load_cache_settings() -> tuple[int, int]:
|
||||
@@ -68,7 +81,7 @@ async def _get_telegram_caches() -> tuple[TelegramFileCache | None, TelegramFile
|
||||
global _url_cache, _asset_cache
|
||||
if _url_cache is not None:
|
||||
return _url_cache, _asset_cache
|
||||
async with _cache_lock:
|
||||
async with _get_cache_lock():
|
||||
# Double-check after acquiring lock
|
||||
if _url_cache is not None:
|
||||
return _url_cache, _asset_cache
|
||||
@@ -108,7 +121,7 @@ async def reset_telegram_caches_in_memory() -> None:
|
||||
deletes cached file_ids.
|
||||
"""
|
||||
global _url_cache, _asset_cache
|
||||
async with _cache_lock:
|
||||
async with _get_cache_lock():
|
||||
_url_cache = None
|
||||
_asset_cache = None
|
||||
_LOGGER.info("Reset Telegram cache refs in memory (files preserved)")
|
||||
@@ -135,7 +148,7 @@ async def clear_telegram_caches() -> dict[str, Any]:
|
||||
Returns a summary with the paths that were removed.
|
||||
"""
|
||||
global _url_cache, _asset_cache
|
||||
async with _cache_lock:
|
||||
async with _get_cache_lock():
|
||||
removed: list[str] = []
|
||||
for cache, label in ((_url_cache, "url"), (_asset_cache, "asset")):
|
||||
if cache is not None:
|
||||
@@ -163,6 +176,90 @@ async def clear_telegram_caches() -> dict[str, Any]:
|
||||
return {"cleared": True, "removed": removed}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Provider polling registry
|
||||
# ---------------------------------------------------------------------------
|
||||
#
|
||||
# Each registered factory returns (events, new_state). Replaces the long
|
||||
# ``if provider_type == ...`` chain in ``check_tracker``. New pollable
|
||||
# providers register here; webhook-only providers are short-circuited above
|
||||
# via ``capabilities.webhook_based``.
|
||||
|
||||
class _PollerConnectError(Exception):
|
||||
"""Raised by a poller factory when initial provider connection fails."""
|
||||
|
||||
def __init__(self, reason: str) -> None:
|
||||
super().__init__(reason)
|
||||
self.reason = reason
|
||||
|
||||
|
||||
PollResult = tuple[list[ServiceEvent], dict[str, Any]]
|
||||
PollerFactory = Callable[..., Awaitable[PollResult]]
|
||||
|
||||
|
||||
async def _poll_immich(*, provider_config, provider_name, collection_ids, state_dict, **_kw) -> PollResult:
|
||||
from notify_bridge_core.providers.immich import ImmichServiceProvider
|
||||
from .http_session import get_http_session
|
||||
http_session = await get_http_session()
|
||||
immich = ImmichServiceProvider(
|
||||
http_session,
|
||||
provider_config.get("url", ""),
|
||||
provider_config.get("api_key", ""),
|
||||
provider_config.get("external_domain"),
|
||||
provider_name,
|
||||
)
|
||||
if not await immich.connect():
|
||||
raise _PollerConnectError("failed to connect to provider")
|
||||
return await immich.poll(collection_ids, state_dict)
|
||||
|
||||
|
||||
async def _poll_scheduler(*, provider_name, tracker_name, tracker_filters, collection_ids, state_dict, app_tz, **_kw) -> PollResult:
|
||||
from notify_bridge_core.providers.scheduler import SchedulerServiceProvider
|
||||
sched = SchedulerServiceProvider(
|
||||
name=provider_name,
|
||||
tracker_name=tracker_name,
|
||||
custom_variables=tracker_filters.get("custom_variables", {}),
|
||||
timezone_name=app_tz,
|
||||
)
|
||||
return await sched.poll(collection_ids, state_dict)
|
||||
|
||||
|
||||
async def _poll_nut(*, provider_config, provider_name, collection_ids, state_dict, **_kw) -> PollResult:
|
||||
from notify_bridge_core.providers.nut import NutServiceProvider
|
||||
nut = NutServiceProvider(
|
||||
host=provider_config.get("host", "localhost"),
|
||||
port=provider_config.get("port", 3493),
|
||||
username=provider_config.get("username"),
|
||||
password=provider_config.get("password"),
|
||||
name=provider_name,
|
||||
)
|
||||
return await nut.poll(collection_ids, state_dict)
|
||||
|
||||
|
||||
async def _poll_google_photos(*, provider_config, provider_name, collection_ids, state_dict, **_kw) -> PollResult:
|
||||
from notify_bridge_core.providers.google_photos import GooglePhotosServiceProvider
|
||||
from .http_session import get_http_session
|
||||
http_session = await get_http_session()
|
||||
gp = GooglePhotosServiceProvider(
|
||||
http_session,
|
||||
provider_config.get("client_id", ""),
|
||||
provider_config.get("client_secret", ""),
|
||||
provider_config.get("refresh_token", ""),
|
||||
provider_name,
|
||||
)
|
||||
if not await gp.connect():
|
||||
raise _PollerConnectError("failed to connect to Google Photos")
|
||||
return await gp.poll(collection_ids, state_dict)
|
||||
|
||||
|
||||
_POLL_FACTORIES: dict[str, PollerFactory] = {
|
||||
"immich": _poll_immich,
|
||||
"scheduler": _poll_scheduler,
|
||||
"nut": _poll_nut,
|
||||
"google_photos": _poll_google_photos,
|
||||
}
|
||||
|
||||
|
||||
async def check_tracker(tracker_id: int) -> dict[str, Any]:
|
||||
"""Poll a tracker's provider for changes and dispatch notifications."""
|
||||
engine = get_engine()
|
||||
@@ -223,70 +320,61 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
|
||||
events: list[ServiceEvent] = []
|
||||
new_state: dict[str, Any] = {}
|
||||
|
||||
if provider_type == "immich":
|
||||
from notify_bridge_core.providers.immich import ImmichServiceProvider
|
||||
from .http_session import get_http_session
|
||||
http_session = await get_http_session()
|
||||
immich = ImmichServiceProvider(
|
||||
http_session,
|
||||
provider_config.get("url", ""),
|
||||
provider_config.get("api_key", ""),
|
||||
provider_config.get("external_domain"),
|
||||
provider_name,
|
||||
)
|
||||
connected = await immich.connect()
|
||||
if not connected:
|
||||
return {"status": "error", "reason": "failed to connect to provider"}
|
||||
# Webhook-only providers: capabilities.webhook_based short-circuits the
|
||||
# poll path. Inbound events arrive via the /api/webhooks/* endpoints.
|
||||
caps = get_capabilities(provider_type)
|
||||
if caps is not None and caps.webhook_based:
|
||||
return {"status": "ok", "events_detected": 0, "collections_checked": 0}
|
||||
|
||||
events, new_state = await immich.poll(collection_ids, state_dict)
|
||||
elif provider_type == "gitea":
|
||||
# Gitea is webhook-based — events arrive via /api/webhooks/gitea endpoint.
|
||||
# The scheduler still calls check_tracker but there's nothing to poll.
|
||||
return {"status": "ok", "events_detected": 0, "collections_checked": 0}
|
||||
elif provider_type == "planka":
|
||||
# Planka is webhook-based — events arrive via /api/webhooks/planka endpoint.
|
||||
return {"status": "ok", "events_detected": 0, "collections_checked": 0}
|
||||
elif provider_type == "scheduler":
|
||||
from notify_bridge_core.providers.scheduler import SchedulerServiceProvider
|
||||
custom_vars = tracker_filters.get("custom_variables", {})
|
||||
sched = SchedulerServiceProvider(
|
||||
name=provider_name,
|
||||
tracker_name=tracker_name,
|
||||
custom_variables=custom_vars,
|
||||
timezone_name=app_tz,
|
||||
)
|
||||
events, new_state = await sched.poll(collection_ids, state_dict)
|
||||
elif provider_type == "nut":
|
||||
from notify_bridge_core.providers.nut import NutServiceProvider
|
||||
nut = NutServiceProvider(
|
||||
host=provider_config.get("host", "localhost"),
|
||||
port=provider_config.get("port", 3493),
|
||||
username=provider_config.get("username"),
|
||||
password=provider_config.get("password"),
|
||||
name=provider_name,
|
||||
)
|
||||
events, new_state = await nut.poll(collection_ids, state_dict)
|
||||
elif provider_type == "google_photos":
|
||||
from notify_bridge_core.providers.google_photos import GooglePhotosServiceProvider
|
||||
from .http_session import get_http_session
|
||||
http_session = await get_http_session()
|
||||
gp = GooglePhotosServiceProvider(
|
||||
http_session,
|
||||
provider_config.get("client_id", ""),
|
||||
provider_config.get("client_secret", ""),
|
||||
provider_config.get("refresh_token", ""),
|
||||
provider_name,
|
||||
)
|
||||
connected = await gp.connect()
|
||||
if not connected:
|
||||
return {"status": "error", "reason": "failed to connect to Google Photos"}
|
||||
events, new_state = await gp.poll(collection_ids, state_dict)
|
||||
elif provider_type == "webhook":
|
||||
# Webhook providers receive events via inbound HTTP; no polling needed.
|
||||
return {"status": "ok", "events_detected": 0, "collections_checked": 0}
|
||||
else:
|
||||
poller = _POLL_FACTORIES.get(provider_type)
|
||||
if poller is None:
|
||||
return {"status": "error", "reason": f"unsupported provider type: {provider_type}"}
|
||||
|
||||
try:
|
||||
events, new_state = await poller(
|
||||
provider_config=provider_config,
|
||||
provider_name=provider_name,
|
||||
tracker_name=tracker_name,
|
||||
tracker_filters=tracker_filters,
|
||||
collection_ids=collection_ids,
|
||||
state_dict=state_dict,
|
||||
app_tz=app_tz,
|
||||
)
|
||||
except _PollerConnectError as exc:
|
||||
# Track consecutive poll failures so the bridge_self provider can
|
||||
# alert when a tracker stops responding. The emission is async
|
||||
# but cheap; we await it inline so its DB writes happen before
|
||||
# check_tracker returns to the scheduler.
|
||||
from .bridge_self import maybe_emit_poll_failure
|
||||
try:
|
||||
await maybe_emit_poll_failure(
|
||||
tracker_id=tracker_id,
|
||||
tracker_name=tracker_name,
|
||||
error=exc.reason,
|
||||
)
|
||||
except Exception: # noqa: BLE001
|
||||
_LOGGER.exception("bridge_self poll-failure emission failed")
|
||||
return {"status": "error", "reason": exc.reason}
|
||||
except Exception as exc: # noqa: BLE001
|
||||
# Catch broader poll exceptions (e.g. a provider-side bug, transient
|
||||
# network error inside the poller after connect) so the same
|
||||
# streak-tracking logic applies. Re-raised after the bookkeeping so
|
||||
# the existing error path keeps logging at the caller.
|
||||
from .bridge_self import maybe_emit_poll_failure
|
||||
try:
|
||||
await maybe_emit_poll_failure(
|
||||
tracker_id=tracker_id,
|
||||
tracker_name=tracker_name,
|
||||
error=str(exc),
|
||||
)
|
||||
except Exception: # noqa: BLE001
|
||||
_LOGGER.exception("bridge_self poll-failure emission failed")
|
||||
raise
|
||||
|
||||
# Successful poll — clear the consecutive-failure counter for this tracker.
|
||||
from .bridge_self import record_poll_success
|
||||
record_poll_success(tracker_id)
|
||||
|
||||
# Save updated state and log events
|
||||
async with AsyncSession(engine) as session:
|
||||
for cid, cstate in new_state.items():
|
||||
@@ -328,6 +416,16 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
|
||||
# row if quiet hours suppresses it.
|
||||
event_log_id_by_event: dict[int, int] = {}
|
||||
for event in events:
|
||||
# Skip persistence for events the dispatch loop will filter
|
||||
# anyway (assets_added with 0 added, assets_removed with 0
|
||||
# removed). Without this we wrote a "noise" row for every
|
||||
# tracker tick that detected nothing. The dispatch-time filter
|
||||
# below still runs as a safety net.
|
||||
etype = event.event_type.value
|
||||
if etype == "assets_added" and event.added_count == 0:
|
||||
continue
|
||||
if etype == "assets_removed" and event.removed_count == 0:
|
||||
continue
|
||||
assets_count = event.added_count or event.removed_count or 0
|
||||
details: dict[str, Any] = {
|
||||
"added_count": event.added_count,
|
||||
@@ -445,7 +543,7 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
|
||||
template_slots=ld["template_slots"],
|
||||
date_format=tmpl.date_format if tmpl else "%d.%m.%Y, %H:%M UTC",
|
||||
date_only_format=tmpl.date_only_format if tmpl and tmpl.date_only_format else "%d.%m.%Y",
|
||||
provider_api_key=provider_config.get("api_key"),
|
||||
provider_api_key=resolve_provider_credential(provider_config),
|
||||
provider_internal_url=provider_config.get("url", ""),
|
||||
provider_external_url=provider_config.get("external_domain", ""),
|
||||
receivers=ld["receivers"],
|
||||
@@ -453,7 +551,9 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
|
||||
key = id(tc) if tc is not None else 0
|
||||
if key not in groups:
|
||||
groups[key] = (tc, [])
|
||||
groups[key][1].append(target_cfg)
|
||||
# Threaded with target_id/target_name so per-target failure
|
||||
# counters can attribute the dispatch result correctly.
|
||||
groups[key][1].append((target_cfg, ld.get("target_id"), ld.get("target_name", "")))
|
||||
|
||||
# Persist defers + stamp the event_log row + schedule drains in a
|
||||
# single transaction. This keeps the "deferred" pill on the
|
||||
@@ -496,8 +596,17 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
|
||||
"Failed to schedule deferred drain for %s", fire_at,
|
||||
)
|
||||
|
||||
for tc, target_configs in groups.values():
|
||||
if not target_configs:
|
||||
from .bridge_self import (
|
||||
maybe_emit_target_failure,
|
||||
record_target_success,
|
||||
)
|
||||
|
||||
track_target_failures = (
|
||||
event.provider_type.value != "bridge_self"
|
||||
)
|
||||
|
||||
for tc, target_entries in groups.values():
|
||||
if not target_entries:
|
||||
continue
|
||||
shaped_event = apply_tracking_display_filters(event, tc)
|
||||
if shaped_event is None:
|
||||
@@ -505,12 +614,28 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
|
||||
" Event suppressed by display filters (favorites_only)",
|
||||
)
|
||||
continue
|
||||
target_configs = [entry[0] for entry in target_entries]
|
||||
results = await dispatcher.dispatch(shaped_event, target_configs)
|
||||
for r in results:
|
||||
for entry, r in zip(target_entries, results):
|
||||
_, target_id, target_name = entry
|
||||
if r.get("success"):
|
||||
_LOGGER.info(" Notification sent successfully")
|
||||
if track_target_failures and target_id is not None:
|
||||
record_target_success(int(target_id))
|
||||
else:
|
||||
_LOGGER.error(" Notification failed: %s", r.get("error", "unknown"))
|
||||
if track_target_failures and target_id is not None:
|
||||
try:
|
||||
await maybe_emit_target_failure(
|
||||
target_id=int(target_id),
|
||||
target_name=target_name or "",
|
||||
target_type=entry[0].type,
|
||||
error=str(r.get("error") or ""),
|
||||
)
|
||||
except Exception: # noqa: BLE001
|
||||
_LOGGER.exception(
|
||||
"bridge_self target-failure emission failed",
|
||||
)
|
||||
|
||||
return {
|
||||
"status": "ok",
|
||||
|
||||
Reference in New Issue
Block a user