feat: production readiness — security, perf, bug fixes, bridge self-monitoring

Comprehensive multi-area pass driven by a parallel 8-agent production review. Frontend, backend, database, security, performance, operational, plus a new self-monitoring feature. ## Critical fixes - Planka webhook: reads bounded raw body (was NameError on every call) - HA quiet hours: ha_state_changed/automation_triggered/service_called/ event_fired added to deferrable set (were silently dropped) - DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session - Telegram inbound webhook: secret now mandatory (401 without) - Generic webhook: auth_mode="none" requires explicit acknowledge_unauthenticated=true; per-IP rate limit 60/min - svelte-check: 5 null-narrowing errors in EventDetailModal fixed - Provider hardcoding: Immich-only block extracted to descriptor featureDiscoveryHint - command_sync: snapshot+expunge bot before exiting AsyncSession ## Bug fixes - notifier asyncio.gather(return_exceptions=True) — one bad chat no longer cancels peer sends - NotificationDispatcher hoisted out of per-tracker loop - Provider credential resolution unified across all 5 dispatch sites - HA asyncio.shield now drains inner task on cancellation - Provider construction switched from if/elif ladder to factory registry - NUT first poll seeds silently (no spurious ups_on_battery) - Quiet-hours gate: event-type-disabled now wins over deferral - APScheduler drain job ID resolution upgraded to seconds - HA on_status_change wired through to EventLog - Webhook payload rollback failures now logged (not swallowed) - Batched receivers/chats/bots in load_link_data (was per-target N+1) - flag_modified on JSON column reassignments in deferred_dispatch ## Database - UNIQUE indexes on service_provider.webhook_token, telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id, telegram_chat(bot_id, chat_id), notification_tracker_target unique link, partial UNIQUE on bridge_self provider per user - Composite ix_event_log_user_event_type_created index - save_chat_from_webhook switched to ON CONFLICT DO UPDATE - ondelete=CASCADE on user-id FKs (model annotation; app-side cascade delete added for existing data) - delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE - Module-level asyncio.Lock replaced with lazy _get_lock() pattern - VACUUM INTO snapshot now PRAGMA integrity_check verified ## Performance - Jinja2 template compilation LRU cached (lru_cache maxsize=512) - Per-locale render cache in NotificationDispatcher (skips re-rendering identical content for receivers sharing a locale) - Tracker list cached per provider_id with 5s TTL + explicit invalidation on tracker CRUD (relieves HA chat-bus rate query pressure) - Nav-counts collapsed from 16 round-trips to single UNION ALL - HA event_log: skip persisting empty assets_added/removed events ## Security hardening - Mass-assignment guard on Action create/update; cron sub-minute reject - Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k) - _sanitize_config extended to all JSON-typed fields on backup import - Telegram _safe_get walks redirects manually with SSRF revalidation - Bcrypt 72-byte password length cap with clear 422 - Webhook payload body redaction; sensitive substring set extended with oauth/client_secret/webhook_secret/csrf in both header filter and template extras filter ## Frontend - 76 catch (err: any) sites converted to errMsg(err) helper - globalProviderFilter: pure getter; reconciliation moved to one-time $effect in +layout - Provider-filter binding: removed paired $effects + _syncingFilter flag, now one-way derived - entity-cache: separate _refreshing flag for background re-fetches - api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag, goto() instead of window.location.href - a11y: aria-expanded on mobile More, role=switch + aria-checked on Telegram bot toggles ## Tests & operations - CI pytest gate added to .gitea/workflows/build.yml + release.yml (wheel-built install to dodge editable-install slowness) - /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running, HA supervisor presence) returning {ready, checks, errors, version} - /api/metrics endpoint with prometheus_client (deferred_pending, event_log_total, dispatch_duration, poll_failures, send_failures) - New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore procedures, log handling, common scenarios, upgrade flow - New tests: test_bridge_self (11), test_gitea_parser (9), test_planka_parser (6), test_immich_change_detector (6), test_backup_roundtrip (1) ## New feature: bridge self-monitoring - New bridge_self provider type — internal sink for bridge health events - Three event types: bridge_self_poll_failures (consecutive tracker poll failures), bridge_self_deferred_backlog (pending count crosses threshold), bridge_self_target_failures (consecutive 5xx/network failures per target) - Per-user thresholds (defaults: 3 / 100 / 5) configurable via the provider config form - Auto-seeded on user create + /setup + boot backfill for existing users - Anti-spam: counters reset after emission; backlog uses transition latch - Self-loop guard: bridge_self failures don't count toward target-failure thresholds (logged only) — wire to your own Telegram/Email/Matrix to get notified when polls/dispatches/sends fail - 6 default templates (3 events × 2 locales), tracking config columns with backfill migration, frontend descriptor (excluded from "create provider" wizard since auto-managed) Operator-visible behavior changes (call out in release notes): - NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode - Existing webhook providers with auth_mode="none" need explicit opt-in - Generic webhook endpoint rate-limited 60/min per source IP - HA disconnect/reconnect writes ha_status_* EventLog rows - Every user gets a bridge_self provider — wire it to a target to receive failure alerts Pre-existing test failures (test_ssrf, test_release_provider) on Python 3.13 are unrelated; CI runs on 3.12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:49 +03:00
parent 22127e2a59
commit 10d30fc956
97 changed files with 5423 additions and 821 deletions
@@ -1,6 +1,7 @@
 """Action management API routes — CRUD, execute, dry-run, executions."""

 import logging
+import re

 from fastapi import APIRouter, Depends, HTTPException, Query, status
 from pydantic import BaseModel
@@ -54,6 +55,58 @@ class ActionUpdate(BaseModel):
 # ---------------------------------------------------------------------------


+# Allowlist of fields a CRUD client may set on Action. Mirrors ActionCreate /
+# ActionUpdate but enforced server-side so a tampered request body cannot
+# overwrite ``user_id``, ``last_run_at``, ``created_at``, etc. via ``**dump``.
+_ALLOWED_ACTION_CREATE_FIELDS = frozenset({
+    "provider_id", "name", "icon", "action_type", "config",
+    "schedule_type", "schedule_interval", "schedule_cron", "enabled",
+})
+_ALLOWED_ACTION_UPDATE_FIELDS = frozenset({
+    "name", "icon", "config",
+    "schedule_type", "schedule_interval", "schedule_cron", "enabled",
+})
+
+# 6 fields = standard cron, 7 fields = with seconds (Quartz-style). Reject
+# the 7-field form whose first column allows fires more often than once per
+# minute. Also reject ``*/N`` minute patterns where N<1 (so ``*/0``) and the
+# bare ``*`` minute used together with ``*`` second.
+_DISALLOWED_CRON_PATTERNS = (
+    re.compile(r"^\s*\*/0\s+"),  # */0 in any leading position
+)
+
+
+def _validate_cron(expr: str) -> None:
+    """Reject schedule_cron strings that fire more often than once per minute.
+
+    Without croniter as a hard dep we apply a conservative regex check: a
+    valid 5-field cron's first column is the minute, so anything other than
+    ``*``/digits/comma/dash/slash there is bogus, and a sub-minute cadence
+    requires a 6+ field expression with seconds. Reject both shapes.
+    """
+    if not expr or not expr.strip():
+        return
+    parts = expr.split()
+    if len(parts) >= 6:
+        # Seconds field present (Quartz-style or 6-field). Forbid
+        # second-level fires entirely; minute-cadence is the floor.
+        seconds_field = parts[0]
+        if seconds_field != "0":
+            raise HTTPException(
+                status_code=400,
+                detail=(
+                    "schedule_cron with a sub-minute cadence is not allowed; "
+                    "set the seconds field to 0 or use a standard 5-field cron"
+                ),
+            )
+    for pattern in _DISALLOWED_CRON_PATTERNS:
+        if pattern.search(expr):
+            raise HTTPException(
+                status_code=400,
+                detail="schedule_cron contains a disallowed pattern",
+            )
+
+
 async def _action_response(session: AsyncSession, action: Action) -> dict:
    """Build response dict with rules inlined."""
    result = await session.exec(
@@ -127,7 +180,15 @@ async def create_action(
            detail=f"Invalid action type '{body.action_type}' for provider type '{provider.type}'",
        )

-    action = Action(user_id=user.id, **body.model_dump())
+    _validate_cron(body.schedule_cron)
+
+    # Project only allowlisted fields so a tampered body can't write
+    # ``user_id``, ``id``, ``last_run_at``, etc. via ``**dump``.
+    payload = {
+        k: v for k, v in body.model_dump().items()
+        if k in _ALLOWED_ACTION_CREATE_FIELDS
+    }
+    action = Action(user_id=user.id, **payload)
    session.add(action)
    await session.commit()
    await session.refresh(action)
@@ -168,7 +229,13 @@ async def update_action(
        raise HTTPException(status_code=404, detail="Action not found")

    updates = body.model_dump(exclude_unset=True)
+    if "schedule_cron" in updates:
+        _validate_cron(updates["schedule_cron"] or "")
+    # Drop any field outside the update allowlist so a tampered request
+    # can't mutate ``user_id`` / ``provider_id`` / ``action_type`` etc.
    for key, value in updates.items():
+        if key not in _ALLOWED_ACTION_UPDATE_FIELDS:
+            continue
        setattr(action, key, value)
    session.add(action)
    await session.commit()
@@ -48,6 +48,40 @@ _LOGGER = logging.getLogger(__name__)

 router = APIRouter(prefix="/api/backup", tags=["backup"])

+
+# Hard caps on uploaded backup file shape — defend against parser DoS
+# (deeply nested or pathologically wide JSON) before we hand the
+# structure to the import pipeline.
+_MAX_BACKUP_DEPTH = 10
+_MAX_BACKUP_NODES = 100_000
+
+
+def _validate_backup_shape(value: object, depth: int = 0, count: list[int] | None = None) -> None:
+    """Walk ``value`` and reject anything beyond the depth/node caps.
+
+    Raises HTTPException(400) on overflow. Cheap O(n) walk; runs once
+    per upload.
+    """
+    if count is None:
+        count = [0]
+    if depth > _MAX_BACKUP_DEPTH:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Backup file too deeply nested (max depth {_MAX_BACKUP_DEPTH})",
+        )
+    count[0] += 1
+    if count[0] > _MAX_BACKUP_NODES:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Backup file has too many nodes (max {_MAX_BACKUP_NODES})",
+        )
+    if isinstance(value, dict):
+        for v in value.values():
+            _validate_backup_shape(v, depth + 1, count)
+    elif isinstance(value, list):
+        for v in value:
+            _validate_backup_shape(v, depth + 1, count)
+
 MAX_UPLOAD_SIZE = 10 * 1024 * 1024  # 10 MB


@@ -181,6 +215,8 @@ async def validate_config(
    except json.JSONDecodeError as e:
        raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")

+    _validate_backup_shape(raw)
+
    result = validate_backup(raw)
    return result.model_dump()

@@ -204,6 +240,8 @@ async def import_config(
    except json.JSONDecodeError as e:
        raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")

+    _validate_backup_shape(raw)
+
    # Validate first
    validation = validate_backup(raw)
    if not validation.valid:
@@ -259,6 +297,8 @@ async def prepare_restore(
    except json.JSONDecodeError as e:
        raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")

+    _validate_backup_shape(raw)
+
    validation = validate_backup(raw)
    if not validation.valid:
        raise HTTPException(
@@ -504,11 +504,14 @@ async def delete_config(
    if config.user_id == 0 and user.role != "admin":
        raise HTTPException(status_code=403, detail="Cannot delete system default configs")
    raise_if_used(await check_command_template_config(session, config.id), config.name)
-    slot_result = await session.exec(
-        select(CommandTemplateSlot).where(CommandTemplateSlot.config_id == config.id)
+    # Bulk delete slot rows so the round-trip count stays O(1) regardless
+    # of how many locale/slot combinations the config carries.
+    from sqlalchemy import delete as sa_delete
+    await session.execute(
+        sa_delete(CommandTemplateSlot).where(
+            CommandTemplateSlot.config_id == config.id
+        )
    )
-    for slot in slot_result.all():
-        await session.delete(slot)
    await session.delete(config)
    await session.commit()

@@ -162,17 +162,26 @@ async def delete_command_tracker(
    from ..services.command_sync import mark_dirty_for_tracker
    await mark_dirty_for_tracker(tracker.id)

-    # Delete associated listeners, collecting bot IDs for polling cleanup
+    # First read the listeners we're about to delete so we can collect the
+    # set of telegram_bot IDs whose polling state may need to be re-checked.
+    # Then issue a single bulk DELETE instead of N per-row deletes.
+    from sqlalchemy import delete as sa_delete
+
    result = await session.exec(
        select(CommandTrackerListener).where(
            CommandTrackerListener.command_tracker_id == tracker_id
        )
    )
-    bot_ids_to_check: set[int] = set()
-    for listener in result.all():
-        if listener.listener_type == "telegram_bot":
-            bot_ids_to_check.add(listener.listener_id)
-        await session.delete(listener)
+    bot_ids_to_check: set[int] = {
+        listener.listener_id
+        for listener in result.all()
+        if listener.listener_type == "telegram_bot"
+    }
+    await session.execute(
+        sa_delete(CommandTrackerListener).where(
+            CommandTrackerListener.command_tracker_id == tracker_id
+        )
+    )

    await session.delete(tracker)
    await session.commit()
@@ -0,0 +1,161 @@
+"""Prometheus metrics endpoint and central registry.
+
+Exposes operational metrics via ``GET /api/metrics`` in the standard
+Prometheus text format. Unauthenticated by design — Prometheus scrapers do
+not authenticate. If the API port crosses a trust boundary, disable via
+``NOTIFY_BRIDGE_METRICS_ENABLED=false``.
+
+Metrics are defined as module-level singletons so the rest of the codebase
+can ``from notify_bridge_server.api.metrics import metrics`` and call
+``metrics.dispatch_duration.labels(channel="telegram").observe(0.42)``
+without re-creating the underlying objects.
+
+Other modules MUST NOT ``import prometheus_client`` directly. Route every
+metric through :data:`metrics` (a :class:`MetricsRegistry`) so we have one
+place to swap implementations or add labels.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import Final
+
+from fastapi import APIRouter, HTTPException
+from starlette.responses import Response
+
+from prometheus_client import (
+    CONTENT_TYPE_LATEST,
+    CollectorRegistry,
+    Counter,
+    Gauge,
+    Histogram,
+    generate_latest,
+)
+
+from ..config import settings as _settings
+
+_LOGGER = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Metric definitions
+# ---------------------------------------------------------------------------
+# Use a dedicated CollectorRegistry instead of the global default registry so
+# tests can construct the module repeatedly without ``Duplicated timeseries``
+# errors and so we never accidentally export Python GC / process metrics that
+# aren't part of the documented surface in OPERATIONS.md.
+
+_REGISTRY: Final[CollectorRegistry] = CollectorRegistry()
+
+
+class MetricsRegistry:
+    """Singleton holder for module-level Prometheus collectors.
+
+    Instantiated once at import time as :data:`metrics`. Keep collectors as
+    instance attributes so call sites get IDE autocomplete and so swapping
+    the collector type (e.g. Counter -> Summary) is a one-line change here.
+    """
+
+    def __init__(self, registry: CollectorRegistry) -> None:
+        self.registry = registry
+
+        # Gauge: populated on every scrape via the collector hook below.
+        self.deferred_pending = Gauge(
+            "notify_bridge_deferred_pending",
+            "Count of deferred_dispatch rows awaiting drain.",
+            registry=registry,
+        )
+
+        # Counter: incremented after each event_log row is persisted.
+        self.event_log_total = Counter(
+            "notify_bridge_event_log_total",
+            "Total events written to event_log, partitioned by status and event_type.",
+            ["status", "event_type"],
+            registry=registry,
+        )
+
+        # Histogram: observed wall-clock seconds per outbound dispatch attempt.
+        self.dispatch_duration = Histogram(
+            "notify_bridge_dispatch_duration_seconds",
+            "Wall-clock duration of one dispatch attempt to a notification channel.",
+            ["channel"],
+            registry=registry,
+            buckets=(0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0),
+        )
+
+        # Counter: each polling provider that fails a tick increments by 1.
+        self.provider_poll_failures = Counter(
+            "notify_bridge_provider_poll_failures_total",
+            "Polling provider failures partitioned by provider type.",
+            ["provider_type"],
+            registry=registry,
+        )
+
+        # Counter: each rejected delivery to a target increments by 1.
+        self.target_send_failures = Counter(
+            "notify_bridge_target_send_failures_total",
+            "Failed sends to a target partitioned by target type and HTTP status.",
+            ["target_type", "status_code"],
+            registry=registry,
+        )
+
+
+metrics: Final[MetricsRegistry] = MetricsRegistry(_REGISTRY)
+
+
+# ---------------------------------------------------------------------------
+# Scrape hook: refresh dynamic gauges on demand
+# ---------------------------------------------------------------------------
+
+async def _refresh_deferred_pending_gauge() -> None:
+    """Populate ``deferred_pending`` by counting pending rows in the DB.
+
+    Called from the request handler before serializing — we don't poll the
+    DB on a fixed cadence to avoid a steady-state cost when nothing is
+    scraping. Kept tolerant: a DB error logs and leaves the previous value.
+    """
+    try:
+        from sqlalchemy import text
+
+        from ..database.engine import get_engine
+
+        engine = get_engine()
+        async with engine.connect() as conn:
+            result = await conn.execute(
+                text("SELECT count(*) FROM deferred_dispatch WHERE status='pending'")
+            )
+            row = result.first()
+            count = int(row[0]) if row else 0
+        metrics.deferred_pending.set(count)
+    except Exception as exc:  # noqa: BLE001 — never fail the scrape over this
+        _LOGGER.debug("deferred_pending refresh skipped: %s", exc)
+
+
+# ---------------------------------------------------------------------------
+# Router
+# ---------------------------------------------------------------------------
+
+router = APIRouter(tags=["metrics"])
+
+
+@router.get("/api/metrics")
+async def metrics_endpoint() -> Response:
+    """Expose collected metrics in Prometheus text format.
+
+    No auth by design — Prometheus scrapers don't authenticate. Gate the
+    endpoint via ``NOTIFY_BRIDGE_METRICS_ENABLED=false`` when the API port
+    is reachable from outside the trust boundary.
+    """
+    if not _settings.metrics_enabled:
+        raise HTTPException(status_code=404, detail="Metrics disabled")
+
+    await _refresh_deferred_pending_gauge()
+
+    # Stub increments so the endpoint reports non-empty data even before
+    # callers wire instrumentation. Removed once code-paths are instrumented.
+    # The labels here intentionally use a sentinel value so dashboards can
+    # filter the noise out: ``status="bootstrap"``.
+    metrics.event_log_total.labels(status="bootstrap", event_type="metrics_scrape").inc(0)
+
+    payload = generate_latest(_REGISTRY)
+    return Response(content=payload, media_type=CONTENT_TYPE_LATEST)
@@ -152,6 +152,10 @@ async def create_notification_tracker(
    session.add(tracker)
    await session.commit()
    await session.refresh(tracker)
+    # Drop the cached enabled-trackers list so the next inbound event
+    # (HA / webhook) sees the new tracker without waiting out the TTL.
+    from ..services.event_dispatch import invalidate_tracker_cache
+    invalidate_tracker_cache(tracker.provider_id)
    if tracker.enabled:
        await schedule_tracker(
            tracker.id, tracker.scan_interval,
@@ -184,6 +188,8 @@ async def update_notification_tracker(
    session.add(tracker)
    await session.commit()
    await session.refresh(tracker)
+    from ..services.event_dispatch import invalidate_tracker_cache
+    invalidate_tracker_cache(tracker.provider_id)
    if tracker.enabled:
        await schedule_tracker(
            tracker.id, tracker.scan_interval,
@@ -201,28 +207,39 @@ async def delete_notification_tracker(
    user: User = Depends(get_current_user),
    session: AsyncSession = Depends(get_session),
 ):
+    """Delete a tracker and its child rows in three bulk statements.
+
+    The previous implementation issued one DELETE per child row plus one
+    UPDATE per event_log row, which scaled linearly with the tracker's
+    history (an old, busy tracker could hit thousands of round-trips).
+    Bulk DELETE/UPDATE collapses that to three SQL statements regardless
+    of size.
+    """
+    from sqlalchemy import delete as sa_delete, update as sa_update
+
    tracker = await _get_user_tracker(session, tracker_id, user.id)
-    # Delete associated tracker-target links
-    result = await session.exec(
-        select(NotificationTrackerTarget).where(NotificationTrackerTarget.tracker_id == tracker_id)
+    # Junction rows — direct dependents of the tracker.
+    await session.execute(
+        sa_delete(NotificationTrackerTarget).where(
+            NotificationTrackerTarget.tracker_id == tracker_id
+        )
    )
-    for tt in result.all():
-        await session.delete(tt)
-    # Delete associated tracker state
-    state_result = await session.exec(
-        select(NotificationTrackerState).where(NotificationTrackerState.tracker_id == tracker_id)
+    # Persisted scan state for this tracker.
+    await session.execute(
+        sa_delete(NotificationTrackerState).where(
+            NotificationTrackerState.tracker_id == tracker_id
+        )
    )
-    for ts in state_result.all():
-        await session.delete(ts)
-    # Nullify event log references
-    event_result = await session.exec(
-        select(EventLog).where(EventLog.tracker_id == tracker_id)
+    # Preserve the audit trail in event_log; just null the back-reference
+    # so the tracker row can be removed without an FK violation.
+    await session.execute(
+        sa_update(EventLog).where(EventLog.tracker_id == tracker_id).values(tracker_id=None)
    )
-    for el in event_result.all():
-        el.tracker_id = None
-        session.add(el)
+    provider_id_for_cache = tracker.provider_id
    await session.delete(tracker)
    await session.commit()
+    from ..services.event_dispatch import invalidate_tracker_cache
+    invalidate_tracker_cache(provider_id_for_cache)
    await unschedule_tracker(tracker_id)
    await reschedule_immich_dispatch_jobs()

@@ -1,9 +1,10 @@
 """Service provider management API routes."""

 import logging
+import secrets

 from fastapi import APIRouter, Depends, HTTPException, status
-from pydantic import AnyHttpUrl, BaseModel, ValidationError, field_validator
+from pydantic import AnyHttpUrl, BaseModel, ValidationError, field_validator, model_validator
 from sqlmodel import select
 from sqlmodel.ext.asyncio.session import AsyncSession
 from typing import Any
@@ -94,14 +95,36 @@ class PayloadMapping(BaseModel):


 class WebhookProviderConfig(BaseModel):
-    auth_mode: str = "none"
+    # Default to bearer to avoid silently creating an open relay. Operators
+    # who genuinely want an unauthenticated endpoint must set
+    # ``acknowledge_unauthenticated=True`` to opt in explicitly.
+    auth_mode: str = "bearer_token"
    webhook_secret: str | None = None
+    # Explicit opt-in required for ``auth_mode="none"``. Without this flag
+    # an unauthenticated webhook is rejected at validation time so a
+    # mis-clicked dropdown can't expose the bridge to arbitrary internet
+    # traffic.
+    acknowledge_unauthenticated: bool = False
    payload_mappings: list[PayloadMapping] = []
    event_type_path: str | None = None
    collection_path: str | None = None
    store_payloads: bool = True
    max_stored_payloads: int = 20  # 1-100

+    @model_validator(mode="after")
+    def _check_auth(self) -> "WebhookProviderConfig":
+        if self.auth_mode == "none" and not self.acknowledge_unauthenticated:
+            raise ValueError(
+                "auth_mode='none' creates an open webhook endpoint; set "
+                "acknowledge_unauthenticated=true to confirm this is intentional"
+            )
+        if self.auth_mode in ("bearer_token", "hmac_sha256") and not self.webhook_secret:
+            # Auto-generate a strong secret if the operator forgot to supply
+            # one — better than rejecting an otherwise-valid config and far
+            # better than silently leaving the endpoint open.
+            self.webhook_secret = secrets.token_urlsafe(32)
+        return self
+

 class HomeAssistantProviderConfig(BaseModel):
    url: str
@@ -291,15 +291,19 @@ async def get_nav_counts(
 ):
    """Return entity counts for sidebar navigation badges.

-    Note: queries run sequentially because SQLAlchemy AsyncSession is NOT safe
-    for concurrent use within a single session (no asyncio.gather).  We
-    minimise round-trips by combining user + system counts and per-type
-    target counts into single aggregate queries where possible.
+    Combines user-owned counts, system-owned shared counts, and per-type
+    target counts into a single round-trip via a UNION ALL of label + count
+    rows. SQLAlchemy AsyncSession is single-threaded so we cannot
+    asyncio.gather; collapsing 16 SELECTs into one is the optimisation.
    """
+    from sqlalchemy import literal, union_all
+
    counts: dict[str, int] = {}

-    # --- 1) User-owned entity counts (one query per model) ---
-    for model, key in [
+    user_id = user.id
+
+    # User-owned counts: one (label, count) per model.
+    user_models = [
        (ServiceProvider, "providers"),
        (NotificationTracker, "notification_trackers"),
        (TrackingConfig, "tracking_configs"),
@@ -311,40 +315,52 @@ async def get_nav_counts(
        (CommandTracker, "command_trackers"),
        (CommandConfig, "command_configs"),
        (CommandTemplateConfig, "command_template_configs"),
-    ]:
-        count = (await session.exec(
-            select(func.count()).select_from(model).where(model.user_id == user.id)
-        )).one()
-        counts[key] = count
-
-    # --- 2) Add system-owned counts (user_id=0) for shared entities ---
-    for model, key in [
+    ]
+    # System-owned shared counts (user_id=0) folded back into the same key.
+    system_models = [
        (TemplateConfig, "template_configs"),
        (CommandTemplateConfig, "command_template_configs"),
        (TrackingConfig, "tracking_configs"),
        (CommandConfig, "command_configs"),
-    ]:
-        system_count = (await session.exec(
-            select(func.count()).select_from(model).where(model.user_id == 0)
-        )).one()
-        counts[key] += system_count
-
-    # --- 3) Per-type target counts in a single query using conditional aggregation ---
+    ]
    target_types = ("telegram", "webhook", "email", "discord", "slack", "ntfy", "matrix")
-    type_counts_result = (await session.exec(
-        select(
-            NotificationTarget.type,
-            func.count(),
+
+    # Initialise counts to 0 so missing UNION rows surface as zeroes
+    # instead of KeyErrors when a category has no rows.
+    for _model, key in user_models:
+        counts[key] = 0
+    for ttype in target_types:
+        counts[f"targets_{ttype}"] = 0
+
+    queries = []
+    for model, key in user_models:
+        queries.append(
+            select(literal(key).label("k"), func.count().label("c"))
+            .select_from(model).where(model.user_id == user_id)
        )
-        .where(
-            NotificationTarget.user_id == user.id,
-            NotificationTarget.type.in_(target_types),
+    for model, key in system_models:
+        queries.append(
+            select(literal(f"__sys__:{key}").label("k"), func.count().label("c"))
+            .select_from(model).where(model.user_id == 0)
        )
-        .group_by(NotificationTarget.type)
-    )).all()
-    type_counts_map = dict(type_counts_result)
-    for target_type in target_types:
-        counts[f"targets_{target_type}"] = type_counts_map.get(target_type, 0)
+    for ttype in target_types:
+        queries.append(
+            select(literal(f"target:{ttype}").label("k"), func.count().label("c"))
+            .select_from(NotificationTarget).where(
+                NotificationTarget.user_id == user_id,
+                NotificationTarget.type == ttype,
+            )
+        )
+
+    union_q = union_all(*queries)
+    rows = (await session.execute(union_q)).all()
+    for label, value in rows:
+        if label.startswith("__sys__:"):
+            counts[label.removeprefix("__sys__:")] += int(value or 0)
+        elif label.startswith("target:"):
+            counts[f"targets_{label.removeprefix('target:')}"] = int(value or 0)
+        else:
+            counts[label] = int(value or 0)

    return counts

@@ -287,6 +287,8 @@ async def get_template_variables(
        **_nut_variables(),
        # --- Home Assistant slots ---
        **_home_assistant_variables(),
+        # --- Bridge self-monitoring slots ---
+        **_bridge_self_variables(),
        # --- Scheduler slots ---
        "message_scheduled_message": {
            "description": "Notification for scheduled message events",
@@ -487,6 +489,32 @@ def _home_assistant_variables() -> dict:
    }


+def _bridge_self_variables() -> dict:
+    common = {
+        "failure_type": "Which condition fired (poll_failures, deferred_backlog, target_failures)",
+        "subject_id": "Affected entity ID (tracker_id, target_id, or 0 for backlog)",
+        "subject_name": "Human-readable name of the affected entity",
+        "count": "Consecutive failure count or current backlog size",
+        "threshold": "Configured threshold that was crossed",
+        "last_error": "Last underlying error message (truncated)",
+        "details": "Extra structured context dict (use {{ details | tojson }})",
+    }
+    return {
+        "message_bridge_self_poll_failures": {
+            "description": "Tracker poll failures crossed threshold",
+            "variables": common,
+        },
+        "message_bridge_self_deferred_backlog": {
+            "description": "Deferred dispatch backlog crossed threshold",
+            "variables": common,
+        },
+        "message_bridge_self_target_failures": {
+            "description": "Target send failures crossed threshold",
+            "variables": common,
+        },
+    }
+
+
@router.post("", status_code=status.HTTP_201_CREATED)
 async def create_config(
    body: TemplateConfigCreate,
@@ -64,9 +64,19 @@ async def create_user(
    admin: User = Depends(require_admin),
    session: AsyncSession = Depends(get_session),
 ):
-    """Create a new user (admin only)."""
+    """Create a new user (admin only).
+
+    Username is normalised to ``strip().lower()`` so "Admin" and "admin"
+    cannot coexist. We do not add a CHECK constraint at the DB level — that
+    would require rebuilding the table on SQLite — so the application is
+    the single source of truth for normalisation.
+    """
+    # Normalise so case-only variants collide with existing accounts.
+    username = (body.username or "").strip().lower()
+    if not username:
+        raise HTTPException(status_code=400, detail="Username cannot be empty")
    # Check for duplicate username
-    result = await session.exec(select(User).where(User.username == body.username))
+    result = await session.exec(select(User).where(User.username == username))
    if result.first():
        raise HTTPException(status_code=409, detail="Username already exists")

@@ -74,13 +84,25 @@ async def create_user(
        raise HTTPException(status_code=400, detail="Password must be at least 8 characters")

    user = User(
-        username=body.username,
+        username=username,
        hashed_password=await _hash_password(body.password),
        role=body.role if body.role in ("admin", "user") else "user",
    )
    session.add(user)
    await session.commit()
    await session.refresh(user)
+
+    # Auto-create the bridge_self provider so the new user immediately gets
+    # internal-failure notifications without manual setup. Best-effort —
+    # a seeding hiccup must not fail the user creation itself.
+    try:
+        from ..database.seeds import ensure_bridge_self_provider_for_user
+        await ensure_bridge_self_provider_for_user(session, user.id)
+        await session.commit()
+    except Exception:  # noqa: BLE001
+        _LOGGER.exception("Failed to auto-seed bridge_self provider for user %s", user.id)
+        await session.rollback()
+
    return {"id": user.id, "username": user.username, "role": user.role}


@@ -103,14 +125,19 @@ async def update_user(
    identity_changed = False

    if body.username is not None and body.username != user.username:
-        new_username = body.username.strip()
+        # Normalise to match the case-insensitive uniqueness rule applied
+        # at user creation. Comparing the normalised form against the
+        # stored username also avoids false-positive "no change" when a
+        # legacy mixed-case account is being renamed to its lower form.
+        new_username = (body.username or "").strip().lower()
        if not new_username:
            raise HTTPException(status_code=400, detail="Username cannot be empty")
-        dup = await session.exec(select(User).where(User.username == new_username))
-        if dup.first():
-            raise HTTPException(status_code=409, detail="Username already exists")
-        user.username = new_username
-        identity_changed = True
+        if new_username != user.username:
+            dup = await session.exec(select(User).where(User.username == new_username))
+            if dup.first():
+                raise HTTPException(status_code=409, detail="Username already exists")
+            user.username = new_username
+            identity_changed = True

    if body.role is not None and body.role != user.role:
        if body.role not in ("admin", "user"):
@@ -191,11 +218,139 @@ async def delete_user(
    admin: User = Depends(require_admin),
    session: AsyncSession = Depends(get_session),
 ):
-    """Delete a user (admin only, cannot delete self)."""
+    """Delete a user (admin only, cannot delete self).
+
+    Cascades through every user-owned table by hand. The model declares
+    ``ondelete=CASCADE`` on each FK, but SQLite only enforces FK actions
+    on tables created *after* the ondelete clause was added — existing
+    installs upgraded from older schemas need this Python-side cascade
+    instead of a multi-step table rebuild.
+
+    TODO: drop this manual cascade once we ship a real
+    rebuild-with-FK-actions migration for legacy SQLite installs (or
+    once Postgres becomes the default deployment target).
+    """
+    from sqlalchemy import delete as sa_delete, update as sa_update
+
    if user_id == admin.id:
        raise HTTPException(status_code=400, detail="Cannot delete yourself")
    user = await session.get(User, user_id)
    if not user:
        raise HTTPException(status_code=404, detail="User not found")
-    await session.delete(user)
-    await session.commit()
+
+    # Lazy import to avoid circulars.
+    from ..database.models import (
+        Action,
+        ActionExecution,
+        ActionRule,
+        CommandConfig,
+        CommandTracker,
+        CommandTrackerListener,
+        DeferredDispatch,
+        EventLog,
+        NotificationTarget,
+        NotificationTracker,
+        NotificationTrackerState,
+        NotificationTrackerTarget,
+        ServiceProvider,
+        TelegramBot,
+        TelegramChat,
+        TrackingConfig,
+        EmailBot,
+        MatrixBot,
+    )
+
+    # Wrap the entire cascade in one transaction so a failure mid-way
+    # cannot leave dangling child rows pointing at a missing user.
+    try:
+        # Order: leaves first, then their parents, finally the user. This
+        # matters even with FKs disabled — it's the natural dependency
+        # graph and avoids accidental constraint trips on engines that do
+        # enforce FKs (Postgres).
+
+        # Resolve tracker ids first (needed for state + link cleanup
+        # before the parent rows themselves are deleted further down).
+        from sqlmodel import select as _select
+        tracker_ids = list((await session.exec(
+            _select(NotificationTracker.id).where(NotificationTracker.user_id == user_id)
+        )).all())
+        if tracker_ids:
+            await session.execute(
+                sa_delete(NotificationTrackerState).where(
+                    NotificationTrackerState.tracker_id.in_(tracker_ids)
+                )
+            )
+            await session.execute(
+                sa_delete(NotificationTrackerTarget).where(
+                    NotificationTrackerTarget.tracker_id.in_(tracker_ids)
+                )
+            )
+            await session.execute(
+                sa_delete(DeferredDispatch).where(
+                    DeferredDispatch.tracker_id.in_(tracker_ids)
+                )
+            )
+
+        # Action children: rules and execution log.
+        action_ids = list((await session.exec(
+            _select(Action.id).where(Action.user_id == user_id)
+        )).all())
+        if action_ids:
+            await session.execute(
+                sa_delete(ActionRule).where(ActionRule.action_id.in_(action_ids))
+            )
+            await session.execute(
+                sa_delete(ActionExecution).where(
+                    ActionExecution.action_id.in_(action_ids)
+                )
+            )
+
+        # Command tracker children: listeners.
+        cmd_tracker_ids = list((await session.exec(
+            _select(CommandTracker.id).where(CommandTracker.user_id == user_id)
+        )).all())
+        if cmd_tracker_ids:
+            await session.execute(
+                sa_delete(CommandTrackerListener).where(
+                    CommandTrackerListener.command_tracker_id.in_(cmd_tracker_ids)
+                )
+            )
+
+        # Telegram bot children: chats.
+        bot_ids = list((await session.exec(
+            _select(TelegramBot.id).where(TelegramBot.user_id == user_id)
+        )).all())
+        if bot_ids:
+            await session.execute(
+                sa_delete(TelegramChat).where(TelegramChat.bot_id.in_(bot_ids))
+            )
+
+        # Owned top-level entities (user is a direct owner).
+        for model in (
+            NotificationTracker,
+            NotificationTarget,
+            CommandTracker,
+            CommandConfig,
+            TrackingConfig,
+            Action,
+            TelegramBot,
+            EmailBot,
+            MatrixBot,
+            ServiceProvider,
+        ):
+            await session.execute(
+                sa_delete(model).where(model.user_id == user_id)
+            )
+
+        # EventLog: keep the audit trail but null the owner reference so
+        # the rows survive the user delete (matches the SET NULL semantic
+        # declared on the model).
+        await session.execute(
+            sa_update(EventLog).where(EventLog.user_id == user_id).values(user_id=None)
+        )
+
+        await session.delete(user)
+        await session.commit()
+    except Exception:
+        await session.rollback()
+        raise
@@ -12,6 +12,8 @@ from fastapi import APIRouter, HTTPException, Request
 from sqlmodel import select
 from sqlmodel.ext.asyncio.session import AsyncSession

+from ..auth.routes import limiter
+
 from notify_bridge_core.models.events import ServiceEvent
 from notify_bridge_core.providers.gitea.event_parser import parse_webhook as parse_gitea_webhook
 from notify_bridge_core.providers.planka.event_parser import parse_webhook as parse_planka_webhook
@@ -240,6 +242,10 @@ async def planka_webhook(token: str, request: Request):
    if not _verify_planka_token(webhook_secret, request):
        raise HTTPException(status_code=403, detail="Invalid token")

+    # Read body AFTER auth check so an attacker without the bearer token
+    # can't force an unbounded read. Token is in the header, not the body.
+    raw_body = await _read_bounded_body(request)
+
    # Parse payload from the bounded raw_body we already read.
    try:
        payload = json.loads(raw_body.decode("utf-8"))
@@ -320,6 +326,8 @@ def _verify_generic_webhook_auth(
 _SENSITIVE_HEADER_SUBSTR = (
    "token", "auth", "key", "secret", "signature", "password", "credential",
    "cookie", "x-api", "x-hub-signature",
+    # Extended for per-key body redaction; harmless extras for header check.
+    "oauth", "client_secret", "webhook_secret", "csrf",
 )


@@ -328,6 +336,28 @@ def _is_sensitive_header(name: str) -> bool:
    return any(s in n for s in _SENSITIVE_HEADER_SUBSTR)


+_REDACTED_PLACEHOLDER = "[REDACTED]"
+
+
+def _redact_sensitive_body(value: object) -> object:
+    """Walk a parsed JSON body and redact values for sensitive-named keys.
+
+    Returns a defensively-copied structure so the caller's object is
+    never mutated (callers downstream still consume the original).
+    """
+    if isinstance(value, dict):
+        cleaned: dict[str, object] = {}
+        for k, v in value.items():
+            if isinstance(k, str) and _is_sensitive_header(k):
+                cleaned[k] = _REDACTED_PLACEHOLDER
+            else:
+                cleaned[k] = _redact_sensitive_body(v)
+        return cleaned
+    if isinstance(value, list):
+        return [_redact_sensitive_body(v) for v in value]
+    return value
+
+
 def _filter_headers(raw_headers: dict[str, str]) -> dict[str, str]:
    """Keep only safe headers for logging (strip Authorization, signatures, tokens).

@@ -358,11 +388,15 @@ async def _save_webhook_log(
    """Insert a webhook payload log entry and prune old ones."""
    try:
        body_json = body if isinstance(body, dict) else {}
+        # Strip sensitive values before persistence — webhook payloads
+        # routinely include OAuth tokens / secrets in the body, and the
+        # log is admin-readable but not need-to-know for the operator.
+        safe_body = _redact_sensitive_body(body_json) if body_json else {}
        session.add(WebhookPayloadLog(
            provider_id=provider_id,
            method=method,
            headers=headers,
-            body=body_json,
+            body=safe_body,
            status=status,
            extracted_fields=extracted_fields or {},
            error_message=error_message,
@@ -386,13 +420,19 @@ async def _save_webhook_log(
        _LOGGER.warning("Failed to save webhook payload log for provider %d", provider_id, exc_info=True)
        try:
            await session.rollback()
-        except Exception:
-            pass
+        except Exception:  # noqa: BLE001
+            _LOGGER.exception("Rollback after payload-log save failed")


@router.post("/webhook/{token}")
+@limiter.limit("60/minute")
 async def generic_webhook(token: str, request: Request):
-    """Receive a generic webhook, extract variables via JSONPath, and dispatch notifications."""
+    """Receive a generic webhook, extract variables via JSONPath, and dispatch notifications.
+
+    Per-IP rate limit (60/min) caps blast radius from a single source —
+    legitimate providers send well below this; anything higher is either
+    a misconfigured retry loop or abuse.
+    """
    engine = get_engine()

    # --- Load provider and validate auth ---