feat: production readiness — security, perf, bug fixes, bridge self-monitoring

Comprehensive multi-area pass driven by a parallel 8-agent production review. Frontend, backend, database, security, performance, operational, plus a new self-monitoring feature. ## Critical fixes - Planka webhook: reads bounded raw body (was NameError on every call) - HA quiet hours: ha_state_changed/automation_triggered/service_called/ event_fired added to deferrable set (were silently dropped) - DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session - Telegram inbound webhook: secret now mandatory (401 without) - Generic webhook: auth_mode="none" requires explicit acknowledge_unauthenticated=true; per-IP rate limit 60/min - svelte-check: 5 null-narrowing errors in EventDetailModal fixed - Provider hardcoding: Immich-only block extracted to descriptor featureDiscoveryHint - command_sync: snapshot+expunge bot before exiting AsyncSession ## Bug fixes - notifier asyncio.gather(return_exceptions=True) — one bad chat no longer cancels peer sends - NotificationDispatcher hoisted out of per-tracker loop - Provider credential resolution unified across all 5 dispatch sites - HA asyncio.shield now drains inner task on cancellation - Provider construction switched from if/elif ladder to factory registry - NUT first poll seeds silently (no spurious ups_on_battery) - Quiet-hours gate: event-type-disabled now wins over deferral - APScheduler drain job ID resolution upgraded to seconds - HA on_status_change wired through to EventLog - Webhook payload rollback failures now logged (not swallowed) - Batched receivers/chats/bots in load_link_data (was per-target N+1) - flag_modified on JSON column reassignments in deferred_dispatch ## Database - UNIQUE indexes on service_provider.webhook_token, telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id, telegram_chat(bot_id, chat_id), notification_tracker_target unique link, partial UNIQUE on bridge_self provider per user - Composite ix_event_log_user_event_type_created index - save_chat_from_webhook switched to ON CONFLICT DO UPDATE - ondelete=CASCADE on user-id FKs (model annotation; app-side cascade delete added for existing data) - delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE - Module-level asyncio.Lock replaced with lazy _get_lock() pattern - VACUUM INTO snapshot now PRAGMA integrity_check verified ## Performance - Jinja2 template compilation LRU cached (lru_cache maxsize=512) - Per-locale render cache in NotificationDispatcher (skips re-rendering identical content for receivers sharing a locale) - Tracker list cached per provider_id with 5s TTL + explicit invalidation on tracker CRUD (relieves HA chat-bus rate query pressure) - Nav-counts collapsed from 16 round-trips to single UNION ALL - HA event_log: skip persisting empty assets_added/removed events ## Security hardening - Mass-assignment guard on Action create/update; cron sub-minute reject - Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k) - _sanitize_config extended to all JSON-typed fields on backup import - Telegram _safe_get walks redirects manually with SSRF revalidation - Bcrypt 72-byte password length cap with clear 422 - Webhook payload body redaction; sensitive substring set extended with oauth/client_secret/webhook_secret/csrf in both header filter and template extras filter ## Frontend - 76 catch (err: any) sites converted to errMsg(err) helper - globalProviderFilter: pure getter; reconciliation moved to one-time $effect in +layout - Provider-filter binding: removed paired $effects + _syncingFilter flag, now one-way derived - entity-cache: separate _refreshing flag for background re-fetches - api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag, goto() instead of window.location.href - a11y: aria-expanded on mobile More, role=switch + aria-checked on Telegram bot toggles ## Tests & operations - CI pytest gate added to .gitea/workflows/build.yml + release.yml (wheel-built install to dodge editable-install slowness) - /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running, HA supervisor presence) returning {ready, checks, errors, version} - /api/metrics endpoint with prometheus_client (deferred_pending, event_log_total, dispatch_duration, poll_failures, send_failures) - New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore procedures, log handling, common scenarios, upgrade flow - New tests: test_bridge_self (11), test_gitea_parser (9), test_planka_parser (6), test_immich_change_detector (6), test_backup_roundtrip (1) ## New feature: bridge self-monitoring - New bridge_self provider type — internal sink for bridge health events - Three event types: bridge_self_poll_failures (consecutive tracker poll failures), bridge_self_deferred_backlog (pending count crosses threshold), bridge_self_target_failures (consecutive 5xx/network failures per target) - Per-user thresholds (defaults: 3 / 100 / 5) configurable via the provider config form - Auto-seeded on user create + /setup + boot backfill for existing users - Anti-spam: counters reset after emission; backlog uses transition latch - Self-loop guard: bridge_self failures don't count toward target-failure thresholds (logged only) — wire to your own Telegram/Email/Matrix to get notified when polls/dispatches/sends fail - 6 default templates (3 events × 2 locales), tracking config columns with backfill migration, frontend descriptor (excluded from "create provider" wizard since auto-managed) Operator-visible behavior changes (call out in release notes): - NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode - Existing webhook providers with auth_mode="none" need explicit opt-in - Generic webhook endpoint rate-limited 60/min per source IP - HA disconnect/reconnect writes ha_status_* EventLog rows - Every user gets a bridge_self provider — wire it to a target to receive failure alerts Pre-existing test failures (test_ssrf, test_release_provider) on Python 3.13 are unrelated; CI runs on 3.12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:49 +03:00
parent 22127e2a59
commit 10d30fc956
97 changed files with 5423 additions and 821 deletions
@@ -12,6 +12,8 @@ from fastapi import APIRouter, HTTPException, Request
 from sqlmodel import select
 from sqlmodel.ext.asyncio.session import AsyncSession

+from ..auth.routes import limiter
+
 from notify_bridge_core.models.events import ServiceEvent
 from notify_bridge_core.providers.gitea.event_parser import parse_webhook as parse_gitea_webhook
 from notify_bridge_core.providers.planka.event_parser import parse_webhook as parse_planka_webhook
@@ -240,6 +242,10 @@ async def planka_webhook(token: str, request: Request):
    if not _verify_planka_token(webhook_secret, request):
        raise HTTPException(status_code=403, detail="Invalid token")

+    # Read body AFTER auth check so an attacker without the bearer token
+    # can't force an unbounded read. Token is in the header, not the body.
+    raw_body = await _read_bounded_body(request)
+
    # Parse payload from the bounded raw_body we already read.
    try:
        payload = json.loads(raw_body.decode("utf-8"))
@@ -320,6 +326,8 @@ def _verify_generic_webhook_auth(
 _SENSITIVE_HEADER_SUBSTR = (
    "token", "auth", "key", "secret", "signature", "password", "credential",
    "cookie", "x-api", "x-hub-signature",
+    # Extended for per-key body redaction; harmless extras for header check.
+    "oauth", "client_secret", "webhook_secret", "csrf",
 )


@@ -328,6 +336,28 @@ def _is_sensitive_header(name: str) -> bool:
    return any(s in n for s in _SENSITIVE_HEADER_SUBSTR)


+_REDACTED_PLACEHOLDER = "[REDACTED]"
+
+
+def _redact_sensitive_body(value: object) -> object:
+    """Walk a parsed JSON body and redact values for sensitive-named keys.
+
+    Returns a defensively-copied structure so the caller's object is
+    never mutated (callers downstream still consume the original).
+    """
+    if isinstance(value, dict):
+        cleaned: dict[str, object] = {}
+        for k, v in value.items():
+            if isinstance(k, str) and _is_sensitive_header(k):
+                cleaned[k] = _REDACTED_PLACEHOLDER
+            else:
+                cleaned[k] = _redact_sensitive_body(v)
+        return cleaned
+    if isinstance(value, list):
+        return [_redact_sensitive_body(v) for v in value]
+    return value
+
+
 def _filter_headers(raw_headers: dict[str, str]) -> dict[str, str]:
    """Keep only safe headers for logging (strip Authorization, signatures, tokens).

@@ -358,11 +388,15 @@ async def _save_webhook_log(
    """Insert a webhook payload log entry and prune old ones."""
    try:
        body_json = body if isinstance(body, dict) else {}
+        # Strip sensitive values before persistence — webhook payloads
+        # routinely include OAuth tokens / secrets in the body, and the
+        # log is admin-readable but not need-to-know for the operator.
+        safe_body = _redact_sensitive_body(body_json) if body_json else {}
        session.add(WebhookPayloadLog(
            provider_id=provider_id,
            method=method,
            headers=headers,
-            body=body_json,
+            body=safe_body,
            status=status,
            extracted_fields=extracted_fields or {},
            error_message=error_message,
@@ -386,13 +420,19 @@ async def _save_webhook_log(
        _LOGGER.warning("Failed to save webhook payload log for provider %d", provider_id, exc_info=True)
        try:
            await session.rollback()
-        except Exception:
-            pass
+        except Exception:  # noqa: BLE001
+            _LOGGER.exception("Rollback after payload-log save failed")


@router.post("/webhook/{token}")
+@limiter.limit("60/minute")
 async def generic_webhook(token: str, request: Request):
-    """Receive a generic webhook, extract variables via JSONPath, and dispatch notifications."""
+    """Receive a generic webhook, extract variables via JSONPath, and dispatch notifications.
+
+    Per-IP rate limit (60/min) caps blast radius from a single source —
+    legitimate providers send well below this; anything higher is either
+    a misconfigured retry loop or abuse.
+    """
    engine = get_engine()

    # --- Load provider and validate auth ---