feat: production readiness — security, perf, bug fixes, bridge self-monitoring

Comprehensive multi-area pass driven by a parallel 8-agent production review. Frontend, backend, database, security, performance, operational, plus a new self-monitoring feature. ## Critical fixes - Planka webhook: reads bounded raw body (was NameError on every call) - HA quiet hours: ha_state_changed/automation_triggered/service_called/ event_fired added to deferrable set (were silently dropped) - DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session - Telegram inbound webhook: secret now mandatory (401 without) - Generic webhook: auth_mode="none" requires explicit acknowledge_unauthenticated=true; per-IP rate limit 60/min - svelte-check: 5 null-narrowing errors in EventDetailModal fixed - Provider hardcoding: Immich-only block extracted to descriptor featureDiscoveryHint - command_sync: snapshot+expunge bot before exiting AsyncSession ## Bug fixes - notifier asyncio.gather(return_exceptions=True) — one bad chat no longer cancels peer sends - NotificationDispatcher hoisted out of per-tracker loop - Provider credential resolution unified across all 5 dispatch sites - HA asyncio.shield now drains inner task on cancellation - Provider construction switched from if/elif ladder to factory registry - NUT first poll seeds silently (no spurious ups_on_battery) - Quiet-hours gate: event-type-disabled now wins over deferral - APScheduler drain job ID resolution upgraded to seconds - HA on_status_change wired through to EventLog - Webhook payload rollback failures now logged (not swallowed) - Batched receivers/chats/bots in load_link_data (was per-target N+1) - flag_modified on JSON column reassignments in deferred_dispatch ## Database - UNIQUE indexes on service_provider.webhook_token, telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id, telegram_chat(bot_id, chat_id), notification_tracker_target unique link, partial UNIQUE on bridge_self provider per user - Composite ix_event_log_user_event_type_created index - save_chat_from_webhook switched to ON CONFLICT DO UPDATE - ondelete=CASCADE on user-id FKs (model annotation; app-side cascade delete added for existing data) - delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE - Module-level asyncio.Lock replaced with lazy _get_lock() pattern - VACUUM INTO snapshot now PRAGMA integrity_check verified ## Performance - Jinja2 template compilation LRU cached (lru_cache maxsize=512) - Per-locale render cache in NotificationDispatcher (skips re-rendering identical content for receivers sharing a locale) - Tracker list cached per provider_id with 5s TTL + explicit invalidation on tracker CRUD (relieves HA chat-bus rate query pressure) - Nav-counts collapsed from 16 round-trips to single UNION ALL - HA event_log: skip persisting empty assets_added/removed events ## Security hardening - Mass-assignment guard on Action create/update; cron sub-minute reject - Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k) - _sanitize_config extended to all JSON-typed fields on backup import - Telegram _safe_get walks redirects manually with SSRF revalidation - Bcrypt 72-byte password length cap with clear 422 - Webhook payload body redaction; sensitive substring set extended with oauth/client_secret/webhook_secret/csrf in both header filter and template extras filter ## Frontend - 76 catch (err: any) sites converted to errMsg(err) helper - globalProviderFilter: pure getter; reconciliation moved to one-time $effect in +layout - Provider-filter binding: removed paired $effects + _syncingFilter flag, now one-way derived - entity-cache: separate _refreshing flag for background re-fetches - api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag, goto() instead of window.location.href - a11y: aria-expanded on mobile More, role=switch + aria-checked on Telegram bot toggles ## Tests & operations - CI pytest gate added to .gitea/workflows/build.yml + release.yml (wheel-built install to dodge editable-install slowness) - /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running, HA supervisor presence) returning {ready, checks, errors, version} - /api/metrics endpoint with prometheus_client (deferred_pending, event_log_total, dispatch_duration, poll_failures, send_failures) - New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore procedures, log handling, common scenarios, upgrade flow - New tests: test_bridge_self (11), test_gitea_parser (9), test_planka_parser (6), test_immich_change_detector (6), test_backup_roundtrip (1) ## New feature: bridge self-monitoring - New bridge_self provider type — internal sink for bridge health events - Three event types: bridge_self_poll_failures (consecutive tracker poll failures), bridge_self_deferred_backlog (pending count crosses threshold), bridge_self_target_failures (consecutive 5xx/network failures per target) - Per-user thresholds (defaults: 3 / 100 / 5) configurable via the provider config form - Auto-seeded on user create + /setup + boot backfill for existing users - Anti-spam: counters reset after emission; backlog uses transition latch - Self-loop guard: bridge_self failures don't count toward target-failure thresholds (logged only) — wire to your own Telegram/Email/Matrix to get notified when polls/dispatches/sends fail - 6 default templates (3 events × 2 locales), tracking config columns with backfill migration, frontend descriptor (excluded from "create provider" wizard since auto-managed) Operator-visible behavior changes (call out in release notes): - NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode - Existing webhook providers with auth_mode="none" need explicit opt-in - Generic webhook endpoint rate-limited 60/min per source IP - HA disconnect/reconnect writes ha_status_* EventLog rows - Every user gets a bridge_self provider — wire it to a target to receive failure alerts Pre-existing test failures (test_ssrf, test_release_provider) on Python 3.13 are unrelated; CI runs on 3.12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:49 +03:00
parent 22127e2a59
commit 10d30fc956
97 changed files with 5423 additions and 821 deletions
@@ -50,6 +50,7 @@ from .commands.webhook import router as webhook_router, set_webhook_secret
 from .api.webhooks import router as webhooks_router
 from .api.webhook_logs import router as webhook_logs_router
 from .api.backup import router as backup_router
+from .api.metrics import router as metrics_router


 # Readiness flag — flipped to True once the scheduler has started and the
@@ -78,6 +79,8 @@ async def lifespan(app: FastAPI):
        migrate_chat_action_to_column,
        migrate_deferred_dispatch_event_log_fk,
        migrate_deferred_dispatch_unique_pending,
+        migrate_uniqueness_constraints,
+        migrate_eventlog_provider_fk,
        migrate_schema_version,
    )
    from .database.snapshot import snapshot_and_prune
@@ -107,6 +110,13 @@ async def lifespan(app: FastAPI):
    # the partial unique index.
    await migrate_deferred_dispatch_event_log_fk(engine)
    await migrate_deferred_dispatch_unique_pending(engine)
+    # Backfill missing UNIQUE indexes on webhook hot paths (deduping any
+    # existing duplicates). Runs after performance_indexes so non-unique
+    # support indexes are already in place.
+    await migrate_uniqueness_constraints(engine)
+    # Document EventLog.provider_id FK strategy on existing tables (no-op
+    # on SQLite besides the log line; new tables get the FK from create_all).
+    await migrate_eventlog_provider_fk(engine)
    await migrate_schema_version(engine)
    from .database.seeds import seed_all
    await seed_all()
@@ -254,6 +264,7 @@ app.include_router(webhook_router)
 app.include_router(webhooks_router)
 app.include_router(webhook_logs_router)
 app.include_router(backup_router)
+app.include_router(metrics_router)


@app.get("/api/health")
@@ -265,15 +276,107 @@ async def health():

@app.get("/api/ready")
 async def ready():
-    """Readiness: migrations and scheduler have started, app can serve traffic.
+    """Readiness: deep dependency check.

-    Returns 503 until the lifespan startup sequence has completed. Use this
-    for orchestrator readiness probes (Docker, Kubernetes).
+    Verifies each critical dependency is actually reachable, not just that
+    the app finished its lifespan startup. Returns 503 if any *required*
+    check fails (db, scheduler). Home Assistant supervisor presence is
+    informational — a degraded HA does not flip readiness off.
+
+    Response shape:
+        {
+          "ready": bool,
+          "checks": {"db": "ok|fail", "scheduler": "ok|fail", "ha": "ok|degraded|na"},
+          "errors": [str, ...]
+        }
    """
+    from starlette.responses import JSONResponse
+    import asyncio as _asyncio
+    from sqlalchemy import text as _text
+
+    checks: dict[str, str] = {}
+    errors: list[str] = []
+
    if not _READY:
-        from starlette.responses import JSONResponse
-        return JSONResponse({"status": "starting"}, status_code=503)
-    return {"status": "ready", "version": _APP_VERSION}
+        # Lifespan still running — short-circuit so we don't poke a half-built engine.
+        return JSONResponse(
+            {
+                "ready": False,
+                "checks": {"db": "fail", "scheduler": "fail", "ha": "na"},
+                "errors": ["startup not complete"],
+                "version": _APP_VERSION,
+            },
+            status_code=503,
+        )
+
+    # --- DB: SELECT 1 with a 2s timeout ---
+    try:
+        from .database.engine import get_engine
+        engine = get_engine()
+
+        async def _ping_db() -> None:
+            async with engine.connect() as conn:
+                await conn.execute(_text("SELECT 1"))
+
+        await _asyncio.wait_for(_ping_db(), timeout=2.0)
+        checks["db"] = "ok"
+    except Exception as exc:  # noqa: BLE001
+        checks["db"] = "fail"
+        errors.append(f"db: {exc!s}")
+
+    # --- Scheduler: APScheduler must be running ---
+    try:
+        from .services.scheduler import get_scheduler
+        scheduler = get_scheduler()
+        if scheduler.running:
+            checks["scheduler"] = "ok"
+        else:
+            checks["scheduler"] = "fail"
+            errors.append("scheduler: not running")
+    except Exception as exc:  # noqa: BLE001
+        checks["scheduler"] = "fail"
+        errors.append(f"scheduler: {exc!s}")
+
+    # --- HA supervisor: informational only ---
+    # If no HA providers are configured, report "na" (not applicable). If any
+    # HA providers exist, ensure at least one supervisor task is alive — a
+    # task being not-yet-connected is fine, we just want it to exist.
+    try:
+        from sqlmodel import select as _select
+        from sqlmodel.ext.asyncio.session import AsyncSession as _AS
+        from .database.models import ServiceProvider
+        from .services.ha_subscription import _running_tasks as _ha_tasks
+
+        from .database.engine import get_engine as _get_engine_ha
+        async with _AS(_get_engine_ha()) as _session:
+            _result = await _session.exec(
+                _select(ServiceProvider).where(
+                    ServiceProvider.type == "home_assistant",
+                )
+            )
+            ha_providers = _result.all()
+        if not ha_providers:
+            checks["ha"] = "na"
+        else:
+            alive = [
+                t for t in _ha_tasks.values() if t is not None and not t.done()
+            ]
+            checks["ha"] = "ok" if alive else "degraded"
+    except Exception as exc:  # noqa: BLE001
+        # Never let the HA probe fail readiness — it's informational.
+        checks["ha"] = "degraded"
+        errors.append(f"ha: {exc!s}")
+
+    required_ok = checks["db"] == "ok" and checks["scheduler"] == "ok"
+    body = {
+        "ready": required_ok,
+        "checks": checks,
+        "errors": errors,
+        "version": _APP_VERSION,
+    }
+    if not required_ok:
+        return JSONResponse(body, status_code=503)
+    return body


 # --- Serve frontend static files (production) ---