feat: production readiness — security, perf, bug fixes, bridge self-monitoring

Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 02:16:49 +03:00
parent 22127e2a59
commit 10d30fc956
97 changed files with 5423 additions and 821 deletions
@@ -1,6 +1,7 @@
"""Action management API routes — CRUD, execute, dry-run, executions."""
import logging
import re
from fastapi import APIRouter, Depends, HTTPException, Query, status
from pydantic import BaseModel
@@ -54,6 +55,58 @@ class ActionUpdate(BaseModel):
# ---------------------------------------------------------------------------
# Allowlist of fields a CRUD client may set on Action. Mirrors ActionCreate /
# ActionUpdate but enforced server-side so a tampered request body cannot
# overwrite ``user_id``, ``last_run_at``, ``created_at``, etc. via ``**dump``.
_ALLOWED_ACTION_CREATE_FIELDS = frozenset({
"provider_id", "name", "icon", "action_type", "config",
"schedule_type", "schedule_interval", "schedule_cron", "enabled",
})
_ALLOWED_ACTION_UPDATE_FIELDS = frozenset({
"name", "icon", "config",
"schedule_type", "schedule_interval", "schedule_cron", "enabled",
})
# 6 fields = standard cron, 7 fields = with seconds (Quartz-style). Reject
# the 7-field form whose first column allows fires more often than once per
# minute. Also reject ``*/N`` minute patterns where N<1 (so ``*/0``) and the
# bare ``*`` minute used together with ``*`` second.
_DISALLOWED_CRON_PATTERNS = (
re.compile(r"^\s*\*/0\s+"), # */0 in any leading position
)
def _validate_cron(expr: str) -> None:
"""Reject schedule_cron strings that fire more often than once per minute.
Without croniter as a hard dep we apply a conservative regex check: a
valid 5-field cron's first column is the minute, so anything other than
``*``/digits/comma/dash/slash there is bogus, and a sub-minute cadence
requires a 6+ field expression with seconds. Reject both shapes.
"""
if not expr or not expr.strip():
return
parts = expr.split()
if len(parts) >= 6:
# Seconds field present (Quartz-style or 6-field). Forbid
# second-level fires entirely; minute-cadence is the floor.
seconds_field = parts[0]
if seconds_field != "0":
raise HTTPException(
status_code=400,
detail=(
"schedule_cron with a sub-minute cadence is not allowed; "
"set the seconds field to 0 or use a standard 5-field cron"
),
)
for pattern in _DISALLOWED_CRON_PATTERNS:
if pattern.search(expr):
raise HTTPException(
status_code=400,
detail="schedule_cron contains a disallowed pattern",
)
async def _action_response(session: AsyncSession, action: Action) -> dict:
"""Build response dict with rules inlined."""
result = await session.exec(
@@ -127,7 +180,15 @@ async def create_action(
detail=f"Invalid action type '{body.action_type}' for provider type '{provider.type}'",
)
action = Action(user_id=user.id, **body.model_dump())
_validate_cron(body.schedule_cron)
# Project only allowlisted fields so a tampered body can't write
# ``user_id``, ``id``, ``last_run_at``, etc. via ``**dump``.
payload = {
k: v for k, v in body.model_dump().items()
if k in _ALLOWED_ACTION_CREATE_FIELDS
}
action = Action(user_id=user.id, **payload)
session.add(action)
await session.commit()
await session.refresh(action)
@@ -168,7 +229,13 @@ async def update_action(
raise HTTPException(status_code=404, detail="Action not found")
updates = body.model_dump(exclude_unset=True)
if "schedule_cron" in updates:
_validate_cron(updates["schedule_cron"] or "")
# Drop any field outside the update allowlist so a tampered request
# can't mutate ``user_id`` / ``provider_id`` / ``action_type`` etc.
for key, value in updates.items():
if key not in _ALLOWED_ACTION_UPDATE_FIELDS:
continue
setattr(action, key, value)
session.add(action)
await session.commit()
@@ -48,6 +48,40 @@ _LOGGER = logging.getLogger(__name__)
router = APIRouter(prefix="/api/backup", tags=["backup"])
# Hard caps on uploaded backup file shape — defend against parser DoS
# (deeply nested or pathologically wide JSON) before we hand the
# structure to the import pipeline.
_MAX_BACKUP_DEPTH = 10
_MAX_BACKUP_NODES = 100_000
def _validate_backup_shape(value: object, depth: int = 0, count: list[int] | None = None) -> None:
"""Walk ``value`` and reject anything beyond the depth/node caps.
Raises HTTPException(400) on overflow. Cheap O(n) walk; runs once
per upload.
"""
if count is None:
count = [0]
if depth > _MAX_BACKUP_DEPTH:
raise HTTPException(
status_code=400,
detail=f"Backup file too deeply nested (max depth {_MAX_BACKUP_DEPTH})",
)
count[0] += 1
if count[0] > _MAX_BACKUP_NODES:
raise HTTPException(
status_code=400,
detail=f"Backup file has too many nodes (max {_MAX_BACKUP_NODES})",
)
if isinstance(value, dict):
for v in value.values():
_validate_backup_shape(v, depth + 1, count)
elif isinstance(value, list):
for v in value:
_validate_backup_shape(v, depth + 1, count)
MAX_UPLOAD_SIZE = 10 * 1024 * 1024 # 10 MB
@@ -181,6 +215,8 @@ async def validate_config(
except json.JSONDecodeError as e:
raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")
_validate_backup_shape(raw)
result = validate_backup(raw)
return result.model_dump()
@@ -204,6 +240,8 @@ async def import_config(
except json.JSONDecodeError as e:
raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")
_validate_backup_shape(raw)
# Validate first
validation = validate_backup(raw)
if not validation.valid:
@@ -259,6 +297,8 @@ async def prepare_restore(
except json.JSONDecodeError as e:
raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")
_validate_backup_shape(raw)
validation = validate_backup(raw)
if not validation.valid:
raise HTTPException(
@@ -504,11 +504,14 @@ async def delete_config(
if config.user_id == 0 and user.role != "admin":
raise HTTPException(status_code=403, detail="Cannot delete system default configs")
raise_if_used(await check_command_template_config(session, config.id), config.name)
slot_result = await session.exec(
select(CommandTemplateSlot).where(CommandTemplateSlot.config_id == config.id)
# Bulk delete slot rows so the round-trip count stays O(1) regardless
# of how many locale/slot combinations the config carries.
from sqlalchemy import delete as sa_delete
await session.execute(
sa_delete(CommandTemplateSlot).where(
CommandTemplateSlot.config_id == config.id
)
)
for slot in slot_result.all():
await session.delete(slot)
await session.delete(config)
await session.commit()
@@ -162,17 +162,26 @@ async def delete_command_tracker(
from ..services.command_sync import mark_dirty_for_tracker
await mark_dirty_for_tracker(tracker.id)
# Delete associated listeners, collecting bot IDs for polling cleanup
# First read the listeners we're about to delete so we can collect the
# set of telegram_bot IDs whose polling state may need to be re-checked.
# Then issue a single bulk DELETE instead of N per-row deletes.
from sqlalchemy import delete as sa_delete
result = await session.exec(
select(CommandTrackerListener).where(
CommandTrackerListener.command_tracker_id == tracker_id
)
)
bot_ids_to_check: set[int] = set()
for listener in result.all():
if listener.listener_type == "telegram_bot":
bot_ids_to_check.add(listener.listener_id)
await session.delete(listener)
bot_ids_to_check: set[int] = {
listener.listener_id
for listener in result.all()
if listener.listener_type == "telegram_bot"
}
await session.execute(
sa_delete(CommandTrackerListener).where(
CommandTrackerListener.command_tracker_id == tracker_id
)
)
await session.delete(tracker)
await session.commit()
@@ -0,0 +1,161 @@
"""Prometheus metrics endpoint and central registry.
Exposes operational metrics via ``GET /api/metrics`` in the standard
Prometheus text format. Unauthenticated by design — Prometheus scrapers do
not authenticate. If the API port crosses a trust boundary, disable via
``NOTIFY_BRIDGE_METRICS_ENABLED=false``.
Metrics are defined as module-level singletons so the rest of the codebase
can ``from notify_bridge_server.api.metrics import metrics`` and call
``metrics.dispatch_duration.labels(channel="telegram").observe(0.42)``
without re-creating the underlying objects.
Other modules MUST NOT ``import prometheus_client`` directly. Route every
metric through :data:`metrics` (a :class:`MetricsRegistry`) so we have one
place to swap implementations or add labels.
"""
from __future__ import annotations
import logging
from typing import Final
from fastapi import APIRouter, HTTPException
from starlette.responses import Response
from prometheus_client import (
CONTENT_TYPE_LATEST,
CollectorRegistry,
Counter,
Gauge,
Histogram,
generate_latest,
)
from ..config import settings as _settings
_LOGGER = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Metric definitions
# ---------------------------------------------------------------------------
# Use a dedicated CollectorRegistry instead of the global default registry so
# tests can construct the module repeatedly without ``Duplicated timeseries``
# errors and so we never accidentally export Python GC / process metrics that
# aren't part of the documented surface in OPERATIONS.md.
_REGISTRY: Final[CollectorRegistry] = CollectorRegistry()
class MetricsRegistry:
"""Singleton holder for module-level Prometheus collectors.
Instantiated once at import time as :data:`metrics`. Keep collectors as
instance attributes so call sites get IDE autocomplete and so swapping
the collector type (e.g. Counter -> Summary) is a one-line change here.
"""
def __init__(self, registry: CollectorRegistry) -> None:
self.registry = registry
# Gauge: populated on every scrape via the collector hook below.
self.deferred_pending = Gauge(
"notify_bridge_deferred_pending",
"Count of deferred_dispatch rows awaiting drain.",
registry=registry,
)
# Counter: incremented after each event_log row is persisted.
self.event_log_total = Counter(
"notify_bridge_event_log_total",
"Total events written to event_log, partitioned by status and event_type.",
["status", "event_type"],
registry=registry,
)
# Histogram: observed wall-clock seconds per outbound dispatch attempt.
self.dispatch_duration = Histogram(
"notify_bridge_dispatch_duration_seconds",
"Wall-clock duration of one dispatch attempt to a notification channel.",
["channel"],
registry=registry,
buckets=(0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0),
)
# Counter: each polling provider that fails a tick increments by 1.
self.provider_poll_failures = Counter(
"notify_bridge_provider_poll_failures_total",
"Polling provider failures partitioned by provider type.",
["provider_type"],
registry=registry,
)
# Counter: each rejected delivery to a target increments by 1.
self.target_send_failures = Counter(
"notify_bridge_target_send_failures_total",
"Failed sends to a target partitioned by target type and HTTP status.",
["target_type", "status_code"],
registry=registry,
)
metrics: Final[MetricsRegistry] = MetricsRegistry(_REGISTRY)
# ---------------------------------------------------------------------------
# Scrape hook: refresh dynamic gauges on demand
# ---------------------------------------------------------------------------
async def _refresh_deferred_pending_gauge() -> None:
"""Populate ``deferred_pending`` by counting pending rows in the DB.
Called from the request handler before serializing — we don't poll the
DB on a fixed cadence to avoid a steady-state cost when nothing is
scraping. Kept tolerant: a DB error logs and leaves the previous value.
"""
try:
from sqlalchemy import text
from ..database.engine import get_engine
engine = get_engine()
async with engine.connect() as conn:
result = await conn.execute(
text("SELECT count(*) FROM deferred_dispatch WHERE status='pending'")
)
row = result.first()
count = int(row[0]) if row else 0
metrics.deferred_pending.set(count)
except Exception as exc: # noqa: BLE001 — never fail the scrape over this
_LOGGER.debug("deferred_pending refresh skipped: %s", exc)
# ---------------------------------------------------------------------------
# Router
# ---------------------------------------------------------------------------
router = APIRouter(tags=["metrics"])
@router.get("/api/metrics")
async def metrics_endpoint() -> Response:
"""Expose collected metrics in Prometheus text format.
No auth by design — Prometheus scrapers don't authenticate. Gate the
endpoint via ``NOTIFY_BRIDGE_METRICS_ENABLED=false`` when the API port
is reachable from outside the trust boundary.
"""
if not _settings.metrics_enabled:
raise HTTPException(status_code=404, detail="Metrics disabled")
await _refresh_deferred_pending_gauge()
# Stub increments so the endpoint reports non-empty data even before
# callers wire instrumentation. Removed once code-paths are instrumented.
# The labels here intentionally use a sentinel value so dashboards can
# filter the noise out: ``status="bootstrap"``.
metrics.event_log_total.labels(status="bootstrap", event_type="metrics_scrape").inc(0)
payload = generate_latest(_REGISTRY)
return Response(content=payload, media_type=CONTENT_TYPE_LATEST)
@@ -152,6 +152,10 @@ async def create_notification_tracker(
session.add(tracker)
await session.commit()
await session.refresh(tracker)
# Drop the cached enabled-trackers list so the next inbound event
# (HA / webhook) sees the new tracker without waiting out the TTL.
from ..services.event_dispatch import invalidate_tracker_cache
invalidate_tracker_cache(tracker.provider_id)
if tracker.enabled:
await schedule_tracker(
tracker.id, tracker.scan_interval,
@@ -184,6 +188,8 @@ async def update_notification_tracker(
session.add(tracker)
await session.commit()
await session.refresh(tracker)
from ..services.event_dispatch import invalidate_tracker_cache
invalidate_tracker_cache(tracker.provider_id)
if tracker.enabled:
await schedule_tracker(
tracker.id, tracker.scan_interval,
@@ -201,28 +207,39 @@ async def delete_notification_tracker(
user: User = Depends(get_current_user),
session: AsyncSession = Depends(get_session),
):
"""Delete a tracker and its child rows in three bulk statements.
The previous implementation issued one DELETE per child row plus one
UPDATE per event_log row, which scaled linearly with the tracker's
history (an old, busy tracker could hit thousands of round-trips).
Bulk DELETE/UPDATE collapses that to three SQL statements regardless
of size.
"""
from sqlalchemy import delete as sa_delete, update as sa_update
tracker = await _get_user_tracker(session, tracker_id, user.id)
# Delete associated tracker-target links
result = await session.exec(
select(NotificationTrackerTarget).where(NotificationTrackerTarget.tracker_id == tracker_id)
# Junction rows — direct dependents of the tracker.
await session.execute(
sa_delete(NotificationTrackerTarget).where(
NotificationTrackerTarget.tracker_id == tracker_id
)
)
for tt in result.all():
await session.delete(tt)
# Delete associated tracker state
state_result = await session.exec(
select(NotificationTrackerState).where(NotificationTrackerState.tracker_id == tracker_id)
# Persisted scan state for this tracker.
await session.execute(
sa_delete(NotificationTrackerState).where(
NotificationTrackerState.tracker_id == tracker_id
)
)
for ts in state_result.all():
await session.delete(ts)
# Nullify event log references
event_result = await session.exec(
select(EventLog).where(EventLog.tracker_id == tracker_id)
# Preserve the audit trail in event_log; just null the back-reference
# so the tracker row can be removed without an FK violation.
await session.execute(
sa_update(EventLog).where(EventLog.tracker_id == tracker_id).values(tracker_id=None)
)
for el in event_result.all():
el.tracker_id = None
session.add(el)
provider_id_for_cache = tracker.provider_id
await session.delete(tracker)
await session.commit()
from ..services.event_dispatch import invalidate_tracker_cache
invalidate_tracker_cache(provider_id_for_cache)
await unschedule_tracker(tracker_id)
await reschedule_immich_dispatch_jobs()
@@ -1,9 +1,10 @@
"""Service provider management API routes."""
import logging
import secrets
from fastapi import APIRouter, Depends, HTTPException, status
from pydantic import AnyHttpUrl, BaseModel, ValidationError, field_validator
from pydantic import AnyHttpUrl, BaseModel, ValidationError, field_validator, model_validator
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from typing import Any
@@ -94,14 +95,36 @@ class PayloadMapping(BaseModel):
class WebhookProviderConfig(BaseModel):
auth_mode: str = "none"
# Default to bearer to avoid silently creating an open relay. Operators
# who genuinely want an unauthenticated endpoint must set
# ``acknowledge_unauthenticated=True`` to opt in explicitly.
auth_mode: str = "bearer_token"
webhook_secret: str | None = None
# Explicit opt-in required for ``auth_mode="none"``. Without this flag
# an unauthenticated webhook is rejected at validation time so a
# mis-clicked dropdown can't expose the bridge to arbitrary internet
# traffic.
acknowledge_unauthenticated: bool = False
payload_mappings: list[PayloadMapping] = []
event_type_path: str | None = None
collection_path: str | None = None
store_payloads: bool = True
max_stored_payloads: int = 20 # 1-100
@model_validator(mode="after")
def _check_auth(self) -> "WebhookProviderConfig":
if self.auth_mode == "none" and not self.acknowledge_unauthenticated:
raise ValueError(
"auth_mode='none' creates an open webhook endpoint; set "
"acknowledge_unauthenticated=true to confirm this is intentional"
)
if self.auth_mode in ("bearer_token", "hmac_sha256") and not self.webhook_secret:
# Auto-generate a strong secret if the operator forgot to supply
# one — better than rejecting an otherwise-valid config and far
# better than silently leaving the endpoint open.
self.webhook_secret = secrets.token_urlsafe(32)
return self
class HomeAssistantProviderConfig(BaseModel):
url: str
@@ -291,15 +291,19 @@ async def get_nav_counts(
):
"""Return entity counts for sidebar navigation badges.
Note: queries run sequentially because SQLAlchemy AsyncSession is NOT safe
for concurrent use within a single session (no asyncio.gather). We
minimise round-trips by combining user + system counts and per-type
target counts into single aggregate queries where possible.
Combines user-owned counts, system-owned shared counts, and per-type
target counts into a single round-trip via a UNION ALL of label + count
rows. SQLAlchemy AsyncSession is single-threaded so we cannot
asyncio.gather; collapsing 16 SELECTs into one is the optimisation.
"""
from sqlalchemy import literal, union_all
counts: dict[str, int] = {}
# --- 1) User-owned entity counts (one query per model) ---
for model, key in [
user_id = user.id
# User-owned counts: one (label, count) per model.
user_models = [
(ServiceProvider, "providers"),
(NotificationTracker, "notification_trackers"),
(TrackingConfig, "tracking_configs"),
@@ -311,40 +315,52 @@ async def get_nav_counts(
(CommandTracker, "command_trackers"),
(CommandConfig, "command_configs"),
(CommandTemplateConfig, "command_template_configs"),
]:
count = (await session.exec(
select(func.count()).select_from(model).where(model.user_id == user.id)
)).one()
counts[key] = count
# --- 2) Add system-owned counts (user_id=0) for shared entities ---
for model, key in [
]
# System-owned shared counts (user_id=0) folded back into the same key.
system_models = [
(TemplateConfig, "template_configs"),
(CommandTemplateConfig, "command_template_configs"),
(TrackingConfig, "tracking_configs"),
(CommandConfig, "command_configs"),
]:
system_count = (await session.exec(
select(func.count()).select_from(model).where(model.user_id == 0)
)).one()
counts[key] += system_count
# --- 3) Per-type target counts in a single query using conditional aggregation ---
]
target_types = ("telegram", "webhook", "email", "discord", "slack", "ntfy", "matrix")
type_counts_result = (await session.exec(
select(
NotificationTarget.type,
func.count(),
# Initialise counts to 0 so missing UNION rows surface as zeroes
# instead of KeyErrors when a category has no rows.
for _model, key in user_models:
counts[key] = 0
for ttype in target_types:
counts[f"targets_{ttype}"] = 0
queries = []
for model, key in user_models:
queries.append(
select(literal(key).label("k"), func.count().label("c"))
.select_from(model).where(model.user_id == user_id)
)
.where(
NotificationTarget.user_id == user.id,
NotificationTarget.type.in_(target_types),
for model, key in system_models:
queries.append(
select(literal(f"__sys__:{key}").label("k"), func.count().label("c"))
.select_from(model).where(model.user_id == 0)
)
.group_by(NotificationTarget.type)
)).all()
type_counts_map = dict(type_counts_result)
for target_type in target_types:
counts[f"targets_{target_type}"] = type_counts_map.get(target_type, 0)
for ttype in target_types:
queries.append(
select(literal(f"target:{ttype}").label("k"), func.count().label("c"))
.select_from(NotificationTarget).where(
NotificationTarget.user_id == user_id,
NotificationTarget.type == ttype,
)
)
union_q = union_all(*queries)
rows = (await session.execute(union_q)).all()
for label, value in rows:
if label.startswith("__sys__:"):
counts[label.removeprefix("__sys__:")] += int(value or 0)
elif label.startswith("target:"):
counts[f"targets_{label.removeprefix('target:')}"] = int(value or 0)
else:
counts[label] = int(value or 0)
return counts
@@ -287,6 +287,8 @@ async def get_template_variables(
**_nut_variables(),
# --- Home Assistant slots ---
**_home_assistant_variables(),
# --- Bridge self-monitoring slots ---
**_bridge_self_variables(),
# --- Scheduler slots ---
"message_scheduled_message": {
"description": "Notification for scheduled message events",
@@ -487,6 +489,32 @@ def _home_assistant_variables() -> dict:
}
def _bridge_self_variables() -> dict:
common = {
"failure_type": "Which condition fired (poll_failures, deferred_backlog, target_failures)",
"subject_id": "Affected entity ID (tracker_id, target_id, or 0 for backlog)",
"subject_name": "Human-readable name of the affected entity",
"count": "Consecutive failure count or current backlog size",
"threshold": "Configured threshold that was crossed",
"last_error": "Last underlying error message (truncated)",
"details": "Extra structured context dict (use {{ details | tojson }})",
}
return {
"message_bridge_self_poll_failures": {
"description": "Tracker poll failures crossed threshold",
"variables": common,
},
"message_bridge_self_deferred_backlog": {
"description": "Deferred dispatch backlog crossed threshold",
"variables": common,
},
"message_bridge_self_target_failures": {
"description": "Target send failures crossed threshold",
"variables": common,
},
}
@router.post("", status_code=status.HTTP_201_CREATED)
async def create_config(
body: TemplateConfigCreate,
@@ -64,9 +64,19 @@ async def create_user(
admin: User = Depends(require_admin),
session: AsyncSession = Depends(get_session),
):
"""Create a new user (admin only)."""
"""Create a new user (admin only).
Username is normalised to ``strip().lower()`` so "Admin" and "admin"
cannot coexist. We do not add a CHECK constraint at the DB level — that
would require rebuilding the table on SQLite — so the application is
the single source of truth for normalisation.
"""
# Normalise so case-only variants collide with existing accounts.
username = (body.username or "").strip().lower()
if not username:
raise HTTPException(status_code=400, detail="Username cannot be empty")
# Check for duplicate username
result = await session.exec(select(User).where(User.username == body.username))
result = await session.exec(select(User).where(User.username == username))
if result.first():
raise HTTPException(status_code=409, detail="Username already exists")
@@ -74,13 +84,25 @@ async def create_user(
raise HTTPException(status_code=400, detail="Password must be at least 8 characters")
user = User(
username=body.username,
username=username,
hashed_password=await _hash_password(body.password),
role=body.role if body.role in ("admin", "user") else "user",
)
session.add(user)
await session.commit()
await session.refresh(user)
# Auto-create the bridge_self provider so the new user immediately gets
# internal-failure notifications without manual setup. Best-effort —
# a seeding hiccup must not fail the user creation itself.
try:
from ..database.seeds import ensure_bridge_self_provider_for_user
await ensure_bridge_self_provider_for_user(session, user.id)
await session.commit()
except Exception: # noqa: BLE001
_LOGGER.exception("Failed to auto-seed bridge_self provider for user %s", user.id)
await session.rollback()
return {"id": user.id, "username": user.username, "role": user.role}
@@ -103,14 +125,19 @@ async def update_user(
identity_changed = False
if body.username is not None and body.username != user.username:
new_username = body.username.strip()
# Normalise to match the case-insensitive uniqueness rule applied
# at user creation. Comparing the normalised form against the
# stored username also avoids false-positive "no change" when a
# legacy mixed-case account is being renamed to its lower form.
new_username = (body.username or "").strip().lower()
if not new_username:
raise HTTPException(status_code=400, detail="Username cannot be empty")
dup = await session.exec(select(User).where(User.username == new_username))
if dup.first():
raise HTTPException(status_code=409, detail="Username already exists")
user.username = new_username
identity_changed = True
if new_username != user.username:
dup = await session.exec(select(User).where(User.username == new_username))
if dup.first():
raise HTTPException(status_code=409, detail="Username already exists")
user.username = new_username
identity_changed = True
if body.role is not None and body.role != user.role:
if body.role not in ("admin", "user"):
@@ -191,11 +218,139 @@ async def delete_user(
admin: User = Depends(require_admin),
session: AsyncSession = Depends(get_session),
):
"""Delete a user (admin only, cannot delete self)."""
"""Delete a user (admin only, cannot delete self).
Cascades through every user-owned table by hand. The model declares
``ondelete=CASCADE`` on each FK, but SQLite only enforces FK actions
on tables created *after* the ondelete clause was added — existing
installs upgraded from older schemas need this Python-side cascade
instead of a multi-step table rebuild.
TODO: drop this manual cascade once we ship a real
rebuild-with-FK-actions migration for legacy SQLite installs (or
once Postgres becomes the default deployment target).
"""
from sqlalchemy import delete as sa_delete, update as sa_update
if user_id == admin.id:
raise HTTPException(status_code=400, detail="Cannot delete yourself")
user = await session.get(User, user_id)
if not user:
raise HTTPException(status_code=404, detail="User not found")
await session.delete(user)
await session.commit()
# Lazy import to avoid circulars.
from ..database.models import (
Action,
ActionExecution,
ActionRule,
CommandConfig,
CommandTracker,
CommandTrackerListener,
DeferredDispatch,
EventLog,
NotificationTarget,
NotificationTracker,
NotificationTrackerState,
NotificationTrackerTarget,
ServiceProvider,
TelegramBot,
TelegramChat,
TrackingConfig,
EmailBot,
MatrixBot,
)
# Wrap the entire cascade in one transaction so a failure mid-way
# cannot leave dangling child rows pointing at a missing user.
try:
# Order: leaves first, then their parents, finally the user. This
# matters even with FKs disabled — it's the natural dependency
# graph and avoids accidental constraint trips on engines that do
# enforce FKs (Postgres).
# Resolve tracker ids first (needed for state + link cleanup
# before the parent rows themselves are deleted further down).
from sqlmodel import select as _select
tracker_ids = list((await session.exec(
_select(NotificationTracker.id).where(NotificationTracker.user_id == user_id)
)).all())
if tracker_ids:
await session.execute(
sa_delete(NotificationTrackerState).where(
NotificationTrackerState.tracker_id.in_(tracker_ids)
)
)
await session.execute(
sa_delete(NotificationTrackerTarget).where(
NotificationTrackerTarget.tracker_id.in_(tracker_ids)
)
)
await session.execute(
sa_delete(DeferredDispatch).where(
DeferredDispatch.tracker_id.in_(tracker_ids)
)
)
# Action children: rules and execution log.
action_ids = list((await session.exec(
_select(Action.id).where(Action.user_id == user_id)
)).all())
if action_ids:
await session.execute(
sa_delete(ActionRule).where(ActionRule.action_id.in_(action_ids))
)
await session.execute(
sa_delete(ActionExecution).where(
ActionExecution.action_id.in_(action_ids)
)
)
# Command tracker children: listeners.
cmd_tracker_ids = list((await session.exec(
_select(CommandTracker.id).where(CommandTracker.user_id == user_id)
)).all())
if cmd_tracker_ids:
await session.execute(
sa_delete(CommandTrackerListener).where(
CommandTrackerListener.command_tracker_id.in_(cmd_tracker_ids)
)
)
# Telegram bot children: chats.
bot_ids = list((await session.exec(
_select(TelegramBot.id).where(TelegramBot.user_id == user_id)
)).all())
if bot_ids:
await session.execute(
sa_delete(TelegramChat).where(TelegramChat.bot_id.in_(bot_ids))
)
# Owned top-level entities (user is a direct owner).
for model in (
NotificationTracker,
NotificationTarget,
CommandTracker,
CommandConfig,
TrackingConfig,
Action,
TelegramBot,
EmailBot,
MatrixBot,
ServiceProvider,
):
await session.execute(
sa_delete(model).where(model.user_id == user_id)
)
# EventLog: keep the audit trail but null the owner reference so
# the rows survive the user delete (matches the SET NULL semantic
# declared on the model).
await session.execute(
sa_update(EventLog).where(EventLog.user_id == user_id).values(user_id=None)
)
await session.delete(user)
await session.commit()
except Exception:
await session.rollback()
raise
@@ -12,6 +12,8 @@ from fastapi import APIRouter, HTTPException, Request
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from ..auth.routes import limiter
from notify_bridge_core.models.events import ServiceEvent
from notify_bridge_core.providers.gitea.event_parser import parse_webhook as parse_gitea_webhook
from notify_bridge_core.providers.planka.event_parser import parse_webhook as parse_planka_webhook
@@ -240,6 +242,10 @@ async def planka_webhook(token: str, request: Request):
if not _verify_planka_token(webhook_secret, request):
raise HTTPException(status_code=403, detail="Invalid token")
# Read body AFTER auth check so an attacker without the bearer token
# can't force an unbounded read. Token is in the header, not the body.
raw_body = await _read_bounded_body(request)
# Parse payload from the bounded raw_body we already read.
try:
payload = json.loads(raw_body.decode("utf-8"))
@@ -320,6 +326,8 @@ def _verify_generic_webhook_auth(
_SENSITIVE_HEADER_SUBSTR = (
"token", "auth", "key", "secret", "signature", "password", "credential",
"cookie", "x-api", "x-hub-signature",
# Extended for per-key body redaction; harmless extras for header check.
"oauth", "client_secret", "webhook_secret", "csrf",
)
@@ -328,6 +336,28 @@ def _is_sensitive_header(name: str) -> bool:
return any(s in n for s in _SENSITIVE_HEADER_SUBSTR)
_REDACTED_PLACEHOLDER = "[REDACTED]"
def _redact_sensitive_body(value: object) -> object:
"""Walk a parsed JSON body and redact values for sensitive-named keys.
Returns a defensively-copied structure so the caller's object is
never mutated (callers downstream still consume the original).
"""
if isinstance(value, dict):
cleaned: dict[str, object] = {}
for k, v in value.items():
if isinstance(k, str) and _is_sensitive_header(k):
cleaned[k] = _REDACTED_PLACEHOLDER
else:
cleaned[k] = _redact_sensitive_body(v)
return cleaned
if isinstance(value, list):
return [_redact_sensitive_body(v) for v in value]
return value
def _filter_headers(raw_headers: dict[str, str]) -> dict[str, str]:
"""Keep only safe headers for logging (strip Authorization, signatures, tokens).
@@ -358,11 +388,15 @@ async def _save_webhook_log(
"""Insert a webhook payload log entry and prune old ones."""
try:
body_json = body if isinstance(body, dict) else {}
# Strip sensitive values before persistence — webhook payloads
# routinely include OAuth tokens / secrets in the body, and the
# log is admin-readable but not need-to-know for the operator.
safe_body = _redact_sensitive_body(body_json) if body_json else {}
session.add(WebhookPayloadLog(
provider_id=provider_id,
method=method,
headers=headers,
body=body_json,
body=safe_body,
status=status,
extracted_fields=extracted_fields or {},
error_message=error_message,
@@ -386,13 +420,19 @@ async def _save_webhook_log(
_LOGGER.warning("Failed to save webhook payload log for provider %d", provider_id, exc_info=True)
try:
await session.rollback()
except Exception:
pass
except Exception: # noqa: BLE001
_LOGGER.exception("Rollback after payload-log save failed")
@router.post("/webhook/{token}")
@limiter.limit("60/minute")
async def generic_webhook(token: str, request: Request):
"""Receive a generic webhook, extract variables via JSONPath, and dispatch notifications."""
"""Receive a generic webhook, extract variables via JSONPath, and dispatch notifications.
Per-IP rate limit (60/min) caps blast radius from a single source —
legitimate providers send well below this; anything higher is either
a misconfigured retry loop or abuse.
"""
engine = get_engine()
# --- Load provider and validate auth ---