feat: production readiness — security, perf, bug fixes, bridge self-monitoring

Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 02:16:49 +03:00
parent 22127e2a59
commit 10d30fc956
97 changed files with 5423 additions and 821 deletions
@@ -50,6 +50,7 @@ from .commands.webhook import router as webhook_router, set_webhook_secret
from .api.webhooks import router as webhooks_router
from .api.webhook_logs import router as webhook_logs_router
from .api.backup import router as backup_router
from .api.metrics import router as metrics_router
# Readiness flag — flipped to True once the scheduler has started and the
@@ -78,6 +79,8 @@ async def lifespan(app: FastAPI):
migrate_chat_action_to_column,
migrate_deferred_dispatch_event_log_fk,
migrate_deferred_dispatch_unique_pending,
migrate_uniqueness_constraints,
migrate_eventlog_provider_fk,
migrate_schema_version,
)
from .database.snapshot import snapshot_and_prune
@@ -107,6 +110,13 @@ async def lifespan(app: FastAPI):
# the partial unique index.
await migrate_deferred_dispatch_event_log_fk(engine)
await migrate_deferred_dispatch_unique_pending(engine)
# Backfill missing UNIQUE indexes on webhook hot paths (deduping any
# existing duplicates). Runs after performance_indexes so non-unique
# support indexes are already in place.
await migrate_uniqueness_constraints(engine)
# Document EventLog.provider_id FK strategy on existing tables (no-op
# on SQLite besides the log line; new tables get the FK from create_all).
await migrate_eventlog_provider_fk(engine)
await migrate_schema_version(engine)
from .database.seeds import seed_all
await seed_all()
@@ -254,6 +264,7 @@ app.include_router(webhook_router)
app.include_router(webhooks_router)
app.include_router(webhook_logs_router)
app.include_router(backup_router)
app.include_router(metrics_router)
@app.get("/api/health")
@@ -265,15 +276,107 @@ async def health():
@app.get("/api/ready")
async def ready():
"""Readiness: migrations and scheduler have started, app can serve traffic.
"""Readiness: deep dependency check.
Returns 503 until the lifespan startup sequence has completed. Use this
for orchestrator readiness probes (Docker, Kubernetes).
Verifies each critical dependency is actually reachable, not just that
the app finished its lifespan startup. Returns 503 if any *required*
check fails (db, scheduler). Home Assistant supervisor presence is
informational — a degraded HA does not flip readiness off.
Response shape:
{
"ready": bool,
"checks": {"db": "ok|fail", "scheduler": "ok|fail", "ha": "ok|degraded|na"},
"errors": [str, ...]
}
"""
from starlette.responses import JSONResponse
import asyncio as _asyncio
from sqlalchemy import text as _text
checks: dict[str, str] = {}
errors: list[str] = []
if not _READY:
from starlette.responses import JSONResponse
return JSONResponse({"status": "starting"}, status_code=503)
return {"status": "ready", "version": _APP_VERSION}
# Lifespan still running — short-circuit so we don't poke a half-built engine.
return JSONResponse(
{
"ready": False,
"checks": {"db": "fail", "scheduler": "fail", "ha": "na"},
"errors": ["startup not complete"],
"version": _APP_VERSION,
},
status_code=503,
)
# --- DB: SELECT 1 with a 2s timeout ---
try:
from .database.engine import get_engine
engine = get_engine()
async def _ping_db() -> None:
async with engine.connect() as conn:
await conn.execute(_text("SELECT 1"))
await _asyncio.wait_for(_ping_db(), timeout=2.0)
checks["db"] = "ok"
except Exception as exc: # noqa: BLE001
checks["db"] = "fail"
errors.append(f"db: {exc!s}")
# --- Scheduler: APScheduler must be running ---
try:
from .services.scheduler import get_scheduler
scheduler = get_scheduler()
if scheduler.running:
checks["scheduler"] = "ok"
else:
checks["scheduler"] = "fail"
errors.append("scheduler: not running")
except Exception as exc: # noqa: BLE001
checks["scheduler"] = "fail"
errors.append(f"scheduler: {exc!s}")
# --- HA supervisor: informational only ---
# If no HA providers are configured, report "na" (not applicable). If any
# HA providers exist, ensure at least one supervisor task is alive — a
# task being not-yet-connected is fine, we just want it to exist.
try:
from sqlmodel import select as _select
from sqlmodel.ext.asyncio.session import AsyncSession as _AS
from .database.models import ServiceProvider
from .services.ha_subscription import _running_tasks as _ha_tasks
from .database.engine import get_engine as _get_engine_ha
async with _AS(_get_engine_ha()) as _session:
_result = await _session.exec(
_select(ServiceProvider).where(
ServiceProvider.type == "home_assistant",
)
)
ha_providers = _result.all()
if not ha_providers:
checks["ha"] = "na"
else:
alive = [
t for t in _ha_tasks.values() if t is not None and not t.done()
]
checks["ha"] = "ok" if alive else "degraded"
except Exception as exc: # noqa: BLE001
# Never let the HA probe fail readiness — it's informational.
checks["ha"] = "degraded"
errors.append(f"ha: {exc!s}")
required_ok = checks["db"] == "ok" and checks["scheduler"] == "ok"
body = {
"ready": required_ok,
"checks": checks,
"errors": errors,
"version": _APP_VERSION,
}
if not required_ok:
return JSONResponse(body, status_code=503)
return body
# --- Serve frontend static files (production) ---