feat: production readiness — security, perf, bug fixes, bridge self-monitoring
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.
## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession
## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch
## Database
- UNIQUE indexes on service_provider.webhook_token,
telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified
## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events
## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
oauth/client_secret/webhook_secret/csrf in both header filter and
template extras filter
## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
$effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
Telegram bot toggles
## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
(wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
test_planka_parser (6), test_immich_change_detector (6),
test_backup_roundtrip (1)
## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
failures), bridge_self_deferred_backlog (pending count crosses
threshold), bridge_self_target_failures (consecutive 5xx/network
failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
thresholds (logged only) — wire to your own Telegram/Email/Matrix to
get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
with backfill migration, frontend descriptor (excluded from "create
provider" wizard since auto-managed)
Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
receive failure alerts
Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -50,6 +50,7 @@ from .commands.webhook import router as webhook_router, set_webhook_secret
|
||||
from .api.webhooks import router as webhooks_router
|
||||
from .api.webhook_logs import router as webhook_logs_router
|
||||
from .api.backup import router as backup_router
|
||||
from .api.metrics import router as metrics_router
|
||||
|
||||
|
||||
# Readiness flag — flipped to True once the scheduler has started and the
|
||||
@@ -78,6 +79,8 @@ async def lifespan(app: FastAPI):
|
||||
migrate_chat_action_to_column,
|
||||
migrate_deferred_dispatch_event_log_fk,
|
||||
migrate_deferred_dispatch_unique_pending,
|
||||
migrate_uniqueness_constraints,
|
||||
migrate_eventlog_provider_fk,
|
||||
migrate_schema_version,
|
||||
)
|
||||
from .database.snapshot import snapshot_and_prune
|
||||
@@ -107,6 +110,13 @@ async def lifespan(app: FastAPI):
|
||||
# the partial unique index.
|
||||
await migrate_deferred_dispatch_event_log_fk(engine)
|
||||
await migrate_deferred_dispatch_unique_pending(engine)
|
||||
# Backfill missing UNIQUE indexes on webhook hot paths (deduping any
|
||||
# existing duplicates). Runs after performance_indexes so non-unique
|
||||
# support indexes are already in place.
|
||||
await migrate_uniqueness_constraints(engine)
|
||||
# Document EventLog.provider_id FK strategy on existing tables (no-op
|
||||
# on SQLite besides the log line; new tables get the FK from create_all).
|
||||
await migrate_eventlog_provider_fk(engine)
|
||||
await migrate_schema_version(engine)
|
||||
from .database.seeds import seed_all
|
||||
await seed_all()
|
||||
@@ -254,6 +264,7 @@ app.include_router(webhook_router)
|
||||
app.include_router(webhooks_router)
|
||||
app.include_router(webhook_logs_router)
|
||||
app.include_router(backup_router)
|
||||
app.include_router(metrics_router)
|
||||
|
||||
|
||||
@app.get("/api/health")
|
||||
@@ -265,15 +276,107 @@ async def health():
|
||||
|
||||
@app.get("/api/ready")
|
||||
async def ready():
|
||||
"""Readiness: migrations and scheduler have started, app can serve traffic.
|
||||
"""Readiness: deep dependency check.
|
||||
|
||||
Returns 503 until the lifespan startup sequence has completed. Use this
|
||||
for orchestrator readiness probes (Docker, Kubernetes).
|
||||
Verifies each critical dependency is actually reachable, not just that
|
||||
the app finished its lifespan startup. Returns 503 if any *required*
|
||||
check fails (db, scheduler). Home Assistant supervisor presence is
|
||||
informational — a degraded HA does not flip readiness off.
|
||||
|
||||
Response shape:
|
||||
{
|
||||
"ready": bool,
|
||||
"checks": {"db": "ok|fail", "scheduler": "ok|fail", "ha": "ok|degraded|na"},
|
||||
"errors": [str, ...]
|
||||
}
|
||||
"""
|
||||
from starlette.responses import JSONResponse
|
||||
import asyncio as _asyncio
|
||||
from sqlalchemy import text as _text
|
||||
|
||||
checks: dict[str, str] = {}
|
||||
errors: list[str] = []
|
||||
|
||||
if not _READY:
|
||||
from starlette.responses import JSONResponse
|
||||
return JSONResponse({"status": "starting"}, status_code=503)
|
||||
return {"status": "ready", "version": _APP_VERSION}
|
||||
# Lifespan still running — short-circuit so we don't poke a half-built engine.
|
||||
return JSONResponse(
|
||||
{
|
||||
"ready": False,
|
||||
"checks": {"db": "fail", "scheduler": "fail", "ha": "na"},
|
||||
"errors": ["startup not complete"],
|
||||
"version": _APP_VERSION,
|
||||
},
|
||||
status_code=503,
|
||||
)
|
||||
|
||||
# --- DB: SELECT 1 with a 2s timeout ---
|
||||
try:
|
||||
from .database.engine import get_engine
|
||||
engine = get_engine()
|
||||
|
||||
async def _ping_db() -> None:
|
||||
async with engine.connect() as conn:
|
||||
await conn.execute(_text("SELECT 1"))
|
||||
|
||||
await _asyncio.wait_for(_ping_db(), timeout=2.0)
|
||||
checks["db"] = "ok"
|
||||
except Exception as exc: # noqa: BLE001
|
||||
checks["db"] = "fail"
|
||||
errors.append(f"db: {exc!s}")
|
||||
|
||||
# --- Scheduler: APScheduler must be running ---
|
||||
try:
|
||||
from .services.scheduler import get_scheduler
|
||||
scheduler = get_scheduler()
|
||||
if scheduler.running:
|
||||
checks["scheduler"] = "ok"
|
||||
else:
|
||||
checks["scheduler"] = "fail"
|
||||
errors.append("scheduler: not running")
|
||||
except Exception as exc: # noqa: BLE001
|
||||
checks["scheduler"] = "fail"
|
||||
errors.append(f"scheduler: {exc!s}")
|
||||
|
||||
# --- HA supervisor: informational only ---
|
||||
# If no HA providers are configured, report "na" (not applicable). If any
|
||||
# HA providers exist, ensure at least one supervisor task is alive — a
|
||||
# task being not-yet-connected is fine, we just want it to exist.
|
||||
try:
|
||||
from sqlmodel import select as _select
|
||||
from sqlmodel.ext.asyncio.session import AsyncSession as _AS
|
||||
from .database.models import ServiceProvider
|
||||
from .services.ha_subscription import _running_tasks as _ha_tasks
|
||||
|
||||
from .database.engine import get_engine as _get_engine_ha
|
||||
async with _AS(_get_engine_ha()) as _session:
|
||||
_result = await _session.exec(
|
||||
_select(ServiceProvider).where(
|
||||
ServiceProvider.type == "home_assistant",
|
||||
)
|
||||
)
|
||||
ha_providers = _result.all()
|
||||
if not ha_providers:
|
||||
checks["ha"] = "na"
|
||||
else:
|
||||
alive = [
|
||||
t for t in _ha_tasks.values() if t is not None and not t.done()
|
||||
]
|
||||
checks["ha"] = "ok" if alive else "degraded"
|
||||
except Exception as exc: # noqa: BLE001
|
||||
# Never let the HA probe fail readiness — it's informational.
|
||||
checks["ha"] = "degraded"
|
||||
errors.append(f"ha: {exc!s}")
|
||||
|
||||
required_ok = checks["db"] == "ok" and checks["scheduler"] == "ok"
|
||||
body = {
|
||||
"ready": required_ok,
|
||||
"checks": checks,
|
||||
"errors": errors,
|
||||
"version": _APP_VERSION,
|
||||
}
|
||||
if not required_ok:
|
||||
return JSONResponse(body, status_code=503)
|
||||
return body
|
||||
|
||||
|
||||
# --- Serve frontend static files (production) ---
|
||||
|
||||
Reference in New Issue
Block a user