feat: production readiness — security, perf, bug fixes, bridge self-monitoring

Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 02:16:49 +03:00
parent 22127e2a59
commit 10d30fc956
97 changed files with 5423 additions and 821 deletions
@@ -309,6 +309,22 @@ async def migrate_schema(engine: AsyncEngine) -> None:
)
logger.info("Added %s column to tracking_config table", col_name)
# Add Bridge self-monitoring tracking flags to tracking_config if missing.
# All three default ON — the bridge_self provider exists specifically
# to surface these conditions, so silencing one would defeat the point.
if await _has_table(conn, "tracking_config"):
bridge_self_flags = [
("track_bridge_self_poll_failures", "INTEGER DEFAULT 1"),
("track_bridge_self_deferred_backlog", "INTEGER DEFAULT 1"),
("track_bridge_self_target_failures", "INTEGER DEFAULT 1"),
]
for col_name, col_type in bridge_self_flags:
if not await _has_column(conn, "tracking_config", col_name):
await conn.execute(
text(f"ALTER TABLE tracking_config ADD COLUMN {col_name} {col_type}")
)
logger.info("Added %s column to tracking_config table", col_name)
# Add quiet hours to tracking_config if missing.
# Start/end are nullable HH:MM strings; quiet_hours_enabled gates them.
if await _has_table(conn, "tracking_config"):
@@ -1361,6 +1377,12 @@ _INDEXES: list[tuple[str, str, str]] = [
("ix_action_provider_id", "action", "provider_id"),
# Dashboard: SELECT event_log WHERE user_id = ? ORDER BY created_at DESC
("ix_event_log_user_created", "event_log", "user_id, created_at DESC"),
# Dashboard "events of type X for me, recent first" filter.
(
"ix_event_log_user_event_type_created",
"event_log",
"user_id, event_type, created_at DESC",
),
("ix_event_log_provider_id", "event_log", "provider_id"),
("ix_event_log_notification_tracker_id", "event_log", "notification_tracker_id"),
("ix_event_log_action_id", "event_log", "action_id"),
@@ -1543,6 +1565,269 @@ async def migrate_chat_action_to_column(engine: AsyncEngine) -> None:
logger.info("Migrated chat_action from config JSON to column where present")
# ---------------------------------------------------------------------------
# Uniqueness + dedupe migrations for webhook hot paths.
#
# These backfill missing UNIQUE indexes on webhook tokens, webhook path IDs,
# bot_id (with sentinel guard), (bot_id, chat_id), and tracker-target links.
# Every CREATE UNIQUE INDEX is preceded by a dedupe pass that keeps the
# canonical row (lowest id, or oldest created_at where specified) and removes
# the rest, logging a WARNING with the dropped count so operators can audit.
# ---------------------------------------------------------------------------
async def _dedupe_by_columns(
conn,
table: str,
cols: list[str],
*,
keep: str = "min_id",
label: str = "",
) -> int:
"""Delete duplicate rows leaving one survivor per ``cols`` group.
``keep`` chooses the survivor:
- ``"min_id"`` keeps the row with the lowest ``id`` (default — used
when there is no semantic "first" row to preserve).
- ``"min_created_at"`` keeps the row with the oldest ``created_at``,
falling back to the lowest id on ties — preferred for tracker-target
links so the original link wins.
Returns the number of rows deleted. All identifiers flow through
``_assert_ident`` to neutralise SQL injection from any caller mistake.
"""
_assert_ident(table, "table")
for c in cols:
_assert_ident(c, "column")
group_by = ", ".join(cols)
where_cols = " AND ".join(f"{c} = g.{c}" for c in cols)
if keep == "min_created_at":
# Tie-break on id so the survivor is deterministic even if two rows
# share the same created_at (insert-batches commonly do).
survivor_sql = (
f"SELECT id FROM {table} "
f"WHERE {where_cols} "
f"ORDER BY created_at ASC, id ASC LIMIT 1"
)
elif keep == "min_id":
survivor_sql = f"SELECT MIN(id) FROM {table} WHERE {where_cols}"
else:
raise ValueError(f"Unknown keep strategy: {keep!r}")
delete_sql = (
f"DELETE FROM {table} WHERE id IN ("
f" SELECT t.id FROM {table} t "
f" JOIN ("
f" SELECT {group_by} FROM {table} "
f" GROUP BY {group_by} HAVING COUNT(*) > 1"
f" ) g ON {' AND '.join(f't.{c} = g.{c}' for c in cols)} "
f" WHERE t.id NOT IN ({survivor_sql})"
f")"
)
result = await conn.execute(text(delete_sql))
deleted = int(getattr(result, "rowcount", 0) or 0)
if deleted:
logger.warning(
"Removed %d duplicate row(s) from %s on (%s)%s",
deleted, table, ", ".join(cols),
f"{label}" if label else "",
)
return deleted
async def migrate_uniqueness_constraints(engine: AsyncEngine) -> None:
"""Backfill missing UNIQUE indexes on webhook hot paths.
SQLite cannot ALTER an existing column to add a UNIQUE constraint, but
a UNIQUE INDEX is functionally equivalent and can be created with
``IF NOT EXISTS`` on every boot. Each index is preceded by a dedupe
pass so the index creation does not fail on existing duplicates.
Indexes added:
- service_provider.webhook_token (full unique)
- telegram_bot.webhook_path_id (full unique)
- telegram_bot.bot_id (partial unique WHERE bot_id != 0; 0 is a
sentinel meaning "not yet validated")
- telegram_chat (bot_id, chat_id) (full unique composite)
- notification_tracker_target (notification_tracker_id, target_id)
(full unique composite)
"""
# Skip on non-SQLite engines — they enforce UNIQUE via the model
# metadata (create_all) and don't have sqlite_master introspection.
if not str(engine.url).startswith("sqlite"):
return
async with engine.begin() as conn:
# service_provider.webhook_token
if await _has_table(conn, "service_provider") and await _has_column(
conn, "service_provider", "webhook_token",
):
await _dedupe_by_columns(
conn, "service_provider", ["webhook_token"],
keep="min_id", label="webhook_token uniqueness",
)
await conn.execute(text(
"CREATE UNIQUE INDEX IF NOT EXISTS "
"uq_service_provider_webhook_token "
"ON service_provider(webhook_token)"
))
# telegram_bot.webhook_path_id (full unique)
# telegram_bot.bot_id (partial unique excluding sentinel 0)
if await _has_table(conn, "telegram_bot"):
if await _has_column(conn, "telegram_bot", "webhook_path_id"):
await _dedupe_by_columns(
conn, "telegram_bot", ["webhook_path_id"],
keep="min_id", label="webhook_path_id uniqueness",
)
await conn.execute(text(
"CREATE UNIQUE INDEX IF NOT EXISTS "
"uq_telegram_bot_webhook_path_id "
"ON telegram_bot(webhook_path_id)"
))
if await _has_column(conn, "telegram_bot", "bot_id"):
# Dedupe only non-sentinel rows. Two unverified bots both
# carrying bot_id=0 is legitimate — only collisions among
# validated bot_ids signal a real corruption to clean up.
deleted = await conn.execute(text(
"DELETE FROM telegram_bot WHERE id IN ("
" SELECT t.id FROM telegram_bot t "
" JOIN ("
" SELECT bot_id FROM telegram_bot "
" WHERE bot_id != 0 GROUP BY bot_id HAVING COUNT(*) > 1"
" ) g ON t.bot_id = g.bot_id "
" WHERE t.id NOT IN ("
" SELECT MIN(id) FROM telegram_bot WHERE bot_id = g.bot_id"
" )"
")"
))
rc = int(getattr(deleted, "rowcount", 0) or 0)
if rc:
logger.warning(
"Removed %d duplicate telegram_bot row(s) on bot_id "
"(non-sentinel collisions)", rc,
)
# Plain INDEX for the lookup-by-bot_id path.
await conn.execute(text(
"CREATE INDEX IF NOT EXISTS ix_telegram_bot_bot_id "
"ON telegram_bot(bot_id)"
))
# Partial UNIQUE excluding the sentinel.
await conn.execute(text(
"CREATE UNIQUE INDEX IF NOT EXISTS "
"uq_telegram_bot_bot_id_nonzero "
"ON telegram_bot(bot_id) WHERE bot_id != 0"
))
# telegram_chat (bot_id, chat_id) — keep the survivor with the oldest
# discovered_at so the original discovery row wins. _dedupe_by_columns
# only handles created_at; do this one inline.
if await _has_table(conn, "telegram_chat"):
res = await conn.execute(text(
"DELETE FROM telegram_chat WHERE id IN ("
" SELECT t.id FROM telegram_chat t "
" JOIN ("
" SELECT bot_id, chat_id FROM telegram_chat "
" GROUP BY bot_id, chat_id HAVING COUNT(*) > 1"
" ) g ON t.bot_id = g.bot_id AND t.chat_id = g.chat_id "
" WHERE t.id NOT IN ("
" SELECT id FROM telegram_chat "
" WHERE bot_id = g.bot_id AND chat_id = g.chat_id "
" ORDER BY discovered_at ASC, id ASC LIMIT 1"
" )"
")"
))
rc = int(getattr(res, "rowcount", 0) or 0)
if rc:
logger.warning(
"Removed %d duplicate telegram_chat row(s) on (bot_id, chat_id)",
rc,
)
await conn.execute(text(
"CREATE UNIQUE INDEX IF NOT EXISTS uq_telegram_chat_bot_chat "
"ON telegram_chat(bot_id, chat_id)"
))
await conn.execute(text(
"CREATE INDEX IF NOT EXISTS ix_telegram_chat_bot_chat "
"ON telegram_chat(bot_id, chat_id)"
))
# notification_tracker_target (notification_tracker_id, target_id)
# — keep the oldest created_at link so the original wins.
if await _has_table(conn, "notification_tracker_target") and await _has_column(
conn, "notification_tracker_target", "notification_tracker_id",
):
await _dedupe_by_columns(
conn,
"notification_tracker_target",
["notification_tracker_id", "target_id"],
keep="min_created_at",
label="tracker-target link uniqueness",
)
await conn.execute(text(
"CREATE UNIQUE INDEX IF NOT EXISTS uq_ntt_tracker_target "
"ON notification_tracker_target(notification_tracker_id, target_id)"
))
# service_provider partial unique on (user_id) WHERE type='bridge_self'.
# Bridge-self is special: exactly one row per user, auto-seeded at boot,
# at user-create, and on /setup. Without this guard, a concurrent boot
# backfill + POST /api/users could double-insert. Dedupe keeps the
# oldest row so any user-customised thresholds on it survive.
if await _has_table(conn, "service_provider"):
res = await conn.execute(text(
"DELETE FROM service_provider WHERE id IN ("
" SELECT t.id FROM service_provider t "
" JOIN ("
" SELECT user_id FROM service_provider "
" WHERE type='bridge_self' GROUP BY user_id HAVING COUNT(*) > 1"
" ) g ON t.user_id = g.user_id "
" WHERE t.type='bridge_self' AND t.id NOT IN ("
" SELECT MIN(id) FROM service_provider "
" WHERE type='bridge_self' AND user_id = g.user_id"
" )"
")"
))
rc = int(getattr(res, "rowcount", 0) or 0)
if rc:
logger.warning(
"Removed %d duplicate bridge_self service_provider row(s) "
"on user_id", rc,
)
await conn.execute(text(
"CREATE UNIQUE INDEX IF NOT EXISTS "
"uq_service_provider_bridge_self_per_user "
"ON service_provider(user_id) WHERE type='bridge_self'"
))
async def migrate_eventlog_provider_fk(engine: AsyncEngine) -> None:
"""Document the EventLog.provider_id FK situation.
SQLite cannot ALTER a column to add a foreign-key constraint without
rebuilding the table. The model annotation now declares
``ondelete=SET NULL`` which only takes effect on freshly created
tables (i.e. brand-new installs). For existing installs we rely on
application-side cleanup in ``api/providers.delete_provider`` to NULL
out ``event_log.provider_id`` rows before deleting the provider row.
This migration is intentionally a no-op aside from the log line — it
exists so the migration order is explicit and operators see in the
logs that the FK strategy was reviewed on this boot.
"""
if not str(engine.url).startswith("sqlite"):
return
async with engine.begin() as conn:
if not await _has_table(conn, "event_log"):
return
# No DDL change. Application code in api/providers.delete_provider
# is the source of truth for the SET NULL semantic on existing tables.
logger.debug(
"event_log.provider_id FK enforcement deferred to application "
"code on existing SQLite tables (model declares ondelete=SET NULL "
"which applies to fresh schemas only)."
)
# ---------------------------------------------------------------------------
# Schema version tracking — lightweight alternative to Alembic while the
# hand-rolled idempotent migrations remain the source of truth. Gives