Files
notify-bridge/packages/server/src/notify_bridge_server/api/notification_trackers.py
T
alexei.dolgolyov 10d30fc956 feat: production readiness — security, perf, bug fixes, bridge self-monitoring
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:49 +03:00

316 lines
11 KiB
Python

"""Notification tracker management API routes."""
import logging
from fastapi import APIRouter, Depends, HTTPException, Query, status
from pydantic import BaseModel
from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession
from ..auth.dependencies import get_current_user
from ..database.engine import get_session
from ..database.models import (
EventLog,
NotificationTarget,
NotificationTracker,
NotificationTrackerState,
NotificationTrackerTarget,
ServiceProvider,
User,
)
from ..services.scheduler import (
reschedule_immich_dispatch_jobs,
schedule_tracker,
unschedule_tracker,
)
from .helpers import get_owned_entity
from .notification_tracker_targets import _tt_response
_LOGGER = logging.getLogger(__name__)
router = APIRouter(prefix="/api/notification-trackers", tags=["notification-trackers"])
class NotificationTrackerCreate(BaseModel):
provider_id: int
name: str
icon: str = ""
collection_ids: list[str] = []
scan_interval: int = 60
adaptive_max_skip: int | None = None
default_tracking_config_id: int | None = None
default_template_config_id: int | None = None
enabled: bool = True
class NotificationTrackerUpdate(BaseModel):
name: str | None = None
icon: str | None = None
collection_ids: list[str] | None = None
scan_interval: int | None = None
# int | None is ambiguous for partial updates — we can't distinguish
# "clear the field" from "don't touch". Callers send this via
# model_dump(exclude_unset=True), so an omitted key leaves the value
# alone and an explicit null clears it back to the adaptive-off default.
adaptive_max_skip: int | None = None
default_tracking_config_id: int | None = None
default_template_config_id: int | None = None
enabled: bool | None = None
@router.get("")
async def list_notification_trackers(
user: User = Depends(get_current_user),
session: AsyncSession = Depends(get_session),
):
# Batched loader: pull trackers, then all their tracker-target links in
# a single query, then the referenced targets in a single query. Avoids
# the old 1 + N + N*M pattern that ran ~60 round-trips for 10 trackers.
result = await session.exec(
select(NotificationTracker).where(NotificationTracker.user_id == user.id)
)
trackers = list(result.all())
if not trackers:
return []
tracker_ids = [t.id for t in trackers]
tt_result = await session.exec(
select(NotificationTrackerTarget).where(
NotificationTrackerTarget.tracker_id.in_(tracker_ids)
)
)
tt_rows = list(tt_result.all())
target_ids = {tt.target_id for tt in tt_rows}
targets_by_id: dict[int, NotificationTarget] = {}
if target_ids:
tgt_result = await session.exec(
select(NotificationTarget).where(NotificationTarget.id.in_(target_ids))
)
targets_by_id = {t.id: t for t in tgt_result.all()}
tts_by_tracker: dict[int, list[NotificationTrackerTarget]] = {}
for tt in tt_rows:
tts_by_tracker.setdefault(tt.tracker_id, []).append(tt)
return [
_build_tracker_response(t, tts_by_tracker.get(t.id, []), targets_by_id)
for t in trackers
]
def _build_tracker_response(
t: NotificationTracker,
tts: list[NotificationTrackerTarget],
targets_by_id: dict[int, NotificationTarget],
) -> dict:
"""In-memory assembler for a tracker + its pre-loaded links/targets."""
tracker_targets = []
for tt in tts:
target = targets_by_id.get(tt.target_id)
tracker_targets.append({
"id": tt.id,
"tracker_id": tt.tracker_id,
"target_id": tt.target_id,
"target_name": target.name if target else None,
"target_type": target.type if target else None,
"target_icon": target.icon if target else None,
"tracking_config_id": tt.tracking_config_id,
"template_config_id": tt.template_config_id,
"enabled": tt.enabled,
"quiet_hours_start": tt.quiet_hours_start,
"quiet_hours_end": tt.quiet_hours_end,
"created_at": tt.created_at.isoformat(),
})
return {
"id": t.id,
"name": t.name,
"icon": t.icon,
"provider_id": t.provider_id,
"collection_ids": t.collection_ids,
"scan_interval": t.scan_interval,
"adaptive_max_skip": t.adaptive_max_skip,
"default_tracking_config_id": t.default_tracking_config_id,
"default_template_config_id": t.default_template_config_id,
"enabled": t.enabled,
"tracker_targets": tracker_targets,
"created_at": t.created_at.isoformat(),
}
@router.post("", status_code=status.HTTP_201_CREATED)
async def create_notification_tracker(
body: NotificationTrackerCreate,
user: User = Depends(get_current_user),
session: AsyncSession = Depends(get_session),
):
provider = await session.get(ServiceProvider, body.provider_id)
if not provider or provider.user_id != user.id:
raise HTTPException(status_code=404, detail="Provider not found")
tracker = NotificationTracker(user_id=user.id, **body.model_dump())
session.add(tracker)
await session.commit()
await session.refresh(tracker)
# Drop the cached enabled-trackers list so the next inbound event
# (HA / webhook) sees the new tracker without waiting out the TTL.
from ..services.event_dispatch import invalidate_tracker_cache
invalidate_tracker_cache(tracker.provider_id)
if tracker.enabled:
await schedule_tracker(
tracker.id, tracker.scan_interval,
adaptive_max_skip=tracker.adaptive_max_skip,
)
await reschedule_immich_dispatch_jobs()
return await _tracker_response(session, tracker)
@router.get("/{tracker_id}")
async def get_notification_tracker(
tracker_id: int,
user: User = Depends(get_current_user),
session: AsyncSession = Depends(get_session),
):
tracker = await _get_user_tracker(session, tracker_id, user.id)
return await _tracker_response(session, tracker)
@router.put("/{tracker_id}")
async def update_notification_tracker(
tracker_id: int,
body: NotificationTrackerUpdate,
user: User = Depends(get_current_user),
session: AsyncSession = Depends(get_session),
):
tracker = await _get_user_tracker(session, tracker_id, user.id)
for field, value in body.model_dump(exclude_unset=True).items():
setattr(tracker, field, value)
session.add(tracker)
await session.commit()
await session.refresh(tracker)
from ..services.event_dispatch import invalidate_tracker_cache
invalidate_tracker_cache(tracker.provider_id)
if tracker.enabled:
await schedule_tracker(
tracker.id, tracker.scan_interval,
adaptive_max_skip=tracker.adaptive_max_skip,
)
else:
await unschedule_tracker(tracker.id)
await reschedule_immich_dispatch_jobs()
return await _tracker_response(session, tracker)
@router.delete("/{tracker_id}", status_code=status.HTTP_204_NO_CONTENT)
async def delete_notification_tracker(
tracker_id: int,
user: User = Depends(get_current_user),
session: AsyncSession = Depends(get_session),
):
"""Delete a tracker and its child rows in three bulk statements.
The previous implementation issued one DELETE per child row plus one
UPDATE per event_log row, which scaled linearly with the tracker's
history (an old, busy tracker could hit thousands of round-trips).
Bulk DELETE/UPDATE collapses that to three SQL statements regardless
of size.
"""
from sqlalchemy import delete as sa_delete, update as sa_update
tracker = await _get_user_tracker(session, tracker_id, user.id)
# Junction rows — direct dependents of the tracker.
await session.execute(
sa_delete(NotificationTrackerTarget).where(
NotificationTrackerTarget.tracker_id == tracker_id
)
)
# Persisted scan state for this tracker.
await session.execute(
sa_delete(NotificationTrackerState).where(
NotificationTrackerState.tracker_id == tracker_id
)
)
# Preserve the audit trail in event_log; just null the back-reference
# so the tracker row can be removed without an FK violation.
await session.execute(
sa_update(EventLog).where(EventLog.tracker_id == tracker_id).values(tracker_id=None)
)
provider_id_for_cache = tracker.provider_id
await session.delete(tracker)
await session.commit()
from ..services.event_dispatch import invalidate_tracker_cache
invalidate_tracker_cache(provider_id_for_cache)
await unschedule_tracker(tracker_id)
await reschedule_immich_dispatch_jobs()
@router.post("/{tracker_id}/trigger")
async def trigger_notification_tracker(
tracker_id: int,
user: User = Depends(get_current_user),
session: AsyncSession = Depends(get_session),
):
tracker = await _get_user_tracker(session, tracker_id, user.id)
from ..services.watcher import check_tracker
result = await check_tracker(tracker.id)
return {"triggered": True, "result": result}
@router.get("/{tracker_id}/history")
async def notification_tracker_history(
tracker_id: int,
limit: int = Query(default=20, ge=1, le=500),
user: User = Depends(get_current_user),
session: AsyncSession = Depends(get_session),
):
await _get_user_tracker(session, tracker_id, user.id)
result = await session.exec(
select(EventLog)
.where(EventLog.tracker_id == tracker_id)
.order_by(EventLog.created_at.desc())
.limit(limit)
)
return [
{
"id": e.id,
"event_type": e.event_type,
"collection_id": e.collection_id,
"collection_name": e.collection_name,
"details": e.details,
"created_at": e.created_at.isoformat() + ("Z" if not e.created_at.tzinfo else ""),
}
for e in result.all()
]
async def _tracker_response(session: AsyncSession, t: NotificationTracker) -> dict:
"""Build tracker response with nested tracker_targets."""
result = await session.exec(
select(NotificationTrackerTarget).where(NotificationTrackerTarget.tracker_id == t.id)
)
tracker_targets = [await _tt_response(session, tt) for tt in result.all()]
return {
"id": t.id,
"name": t.name,
"icon": t.icon,
"provider_id": t.provider_id,
"collection_ids": t.collection_ids,
"scan_interval": t.scan_interval,
"adaptive_max_skip": t.adaptive_max_skip,
"default_tracking_config_id": t.default_tracking_config_id,
"default_template_config_id": t.default_template_config_id,
"enabled": t.enabled,
"tracker_targets": tracker_targets,
"created_at": t.created_at.isoformat(),
}
async def _get_user_tracker(
session: AsyncSession, tracker_id: int, user_id: int
) -> NotificationTracker:
return await get_owned_entity(
session, NotificationTracker, tracker_id, user_id,
not_found_msg="Tracker not found",
)