Files
alexei.dolgolyov 10d30fc956 feat: production readiness — security, perf, bug fixes, bridge self-monitoring
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:49 +03:00

15 KiB

Operations Guide

This document covers running, monitoring, and recovering Notify Bridge in production. The intended audience is the operator on call when the notifications stop firing or when a release upgrade goes sideways.

For developer-focused docs (architecture, conventions, project layout) see CLAUDE.md and the .claude/docs/ directory.

Deployment overview

Notify Bridge ships as a single Docker image. All state lives in a single data directory mounted at /data.

Required environment variables

Variable Default Notes
NOTIFY_BRIDGE_SECRET_KEY (none) Required. 32+ random bytes. The server refuses to boot with the default placeholder or any of the known dev literals.
NOTIFY_BRIDGE_CORS_ALLOWED_ORIGINS http://localhost:5175 Comma-separated list. * is rejected because credentials are enabled.
NOTIFY_BRIDGE_FORWARDED_ALLOW_IPS 127.0.0.1 Trusted proxy IPs whose X-Forwarded-For / X-Forwarded-Proto headers are honored. Set to your reverse-proxy IP.

Useful environment variables

Variable Default Notes
NOTIFY_BRIDGE_DATA_DIR /data Where the SQLite DB, snapshots, and backups live.
NOTIFY_BRIDGE_DATABASE_URL (derived from data_dir) Override only if you want a non-default DB path.
NOTIFY_BRIDGE_DEBUG false Verbose logging + SQL echo. Do not enable in production.
NOTIFY_BRIDGE_LOG_FORMAT text Set to json for one JSON object per line — pipe to a log aggregator.
NOTIFY_BRIDGE_LOG_LEVEL INFO Root logger level.
NOTIFY_BRIDGE_LOG_LEVELS (empty) Per-module overrides, e.g. sqlalchemy.engine=WARNING,notify_bridge_core.notifications.telegram.client=DEBUG.
NOTIFY_BRIDGE_EVENT_LOG_RETENTION_DAYS 30 Days of event_log history kept by the daily cleanup job. 0 disables retention.
NOTIFY_BRIDGE_PRE_MIGRATE_SNAPSHOT_KEEP 5 Number of pre-migration DB snapshots retained. 0 disables snapshotting.
NOTIFY_BRIDGE_METRICS_ENABLED true Expose /api/metrics for Prometheus. Set to false if the API port crosses a trust boundary.
NOTIFY_BRIDGE_GRACEFUL_SHUTDOWN_SECONDS 60 SIGTERM grace period before in-flight requests are killed.
NOTIFY_BRIDGE_SUPERVISED (auto) Force the supervised flag for apply-restart. Use true when running under systemd/PM2 outside Docker.

Data directory layout

/data/
  notify_bridge.db          # main SQLite DB (WAL mode)
  notify_bridge.db-wal      # SQLite write-ahead log
  notify_bridge.db-shm      # SQLite shared memory file
  backups/
    pre-migrate-*.db        # automatic pre-upgrade snapshots
    backup-*.json           # scheduled / manual config backups
  snapshots/                # legacy alias retained for older deployments
  pending_restore.json      # staged restore (consumed at next boot)
  applied_restores/         # archive of applied restore payloads

Always mount /data on a persistent volume. The WAL files MUST live on the same filesystem as the main DB — never split them across mounts.

Docker example

See docker-compose.yml at the repo root for the canonical reference. The container runs read-only with tmpfs for /tmp, drops all capabilities, and limits memory/CPU. The healthcheck targets /api/ready (deep) — see the next section.

Healthchecks

Two endpoints, used for different probe types.

GET /api/health — liveness, shallow

Returns 200 OK once the ASGI app has started. Does not touch the DB or the scheduler. Use this for liveness probes that should only restart the process if it stops responding entirely.

{"status": "ok", "version": "0.8.0"}

GET /api/ready — readiness, deep

Verifies that each critical dependency is reachable:

  • dbSELECT 1 against the SQLAlchemy engine, 2-second timeout.
  • scheduler — APScheduler running flag.
  • ha — Home Assistant subscription supervisor task. Reported as na when no HA providers are configured, ok when at least one supervisor is alive, degraded otherwise. Informational only — HA degradation does not flip readiness off.

Returns 503 when any required check (db, scheduler) fails.

{
  "ready": true,
  "checks": {"db": "ok", "scheduler": "ok", "ha": "na"},
  "errors": [],
  "version": "0.8.0"
}

Kubernetes probe example

livenessProbe:
  httpGet:
    path: /api/health
    port: 8420
  initialDelaySeconds: 10
  periodSeconds: 30
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /api/ready
    port: 8420
  initialDelaySeconds: 15
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 2

The Docker compose file uses /api/ready as its healthcheck so the container is only reported healthy after migrations finish.

Metrics

Notify Bridge exposes Prometheus metrics at GET /api/metrics in the standard text exposition format. No authentication — Prometheus scrapers do not authenticate. Disable via NOTIFY_BRIDGE_METRICS_ENABLED=false when the API port is reachable beyond the trust boundary.

Prometheus scrape example

scrape_configs:
  - job_name: notify-bridge
    metrics_path: /api/metrics
    static_configs:
      - targets: ['notify-bridge.internal:8420']
    scrape_interval: 30s

Available metrics

Metric Type Labels Meaning
notify_bridge_deferred_pending Gauge (none) Pending rows in deferred_dispatch. Refreshed on each scrape. A persistent non-zero value usually means a tracker target is in extended quiet hours.
notify_bridge_event_log_total Counter status, event_type Events written to event_log. status is the dispatch outcome (dispatched, dropped, deferred, etc.).
notify_bridge_dispatch_duration_seconds Histogram channel Wall-clock duration of one outbound dispatch (Telegram, Discord, email, …). Useful for latency alerts.
notify_bridge_provider_poll_failures_total Counter provider_type Polling provider tick failures (Immich poll error, Gitea API down, …). Compare against expected scan interval to compute failure rate.
notify_bridge_target_send_failures_total Counter target_type, status_code Failed sends to a notification channel. status_code is the HTTP status (or 0 when no HTTP response was received).

The metrics module never imports prometheus_client outside api/metrics.py. Other modules record events through the metrics singleton — see that module's docstring before adding new collectors.

Backups

Notify Bridge produces three different kinds of backup files. Know which one you are looking at before restoring.

Kind Location Format Trigger
Config backup data/backups/backup-*.json JSON (BackupFile schema) Manual via /api/backup/files POST or scheduled job
Pre-migration snapshot data/backups/pre-migrate-*.db SQLite DB file Automatic on every boot before migrations
Pending restore data/pending_restore.json JSON Staged via /api/backup/prepare-restore, consumed at next restart

Config backups capture user configuration (providers, trackers, targets, templates, …). They do not include event_log, deferred_dispatch, or any other operational table. Pre-migration snapshots are full DB copies and contain everything.

Manual backup

The admin UI has a one-click button under Settings → Backup. Equivalent HTTP call:

curl -fsS -X POST \
  -H "Authorization: Bearer $ADMIN_JWT" \
  "https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude"

The download endpoint produces a downloadable JSON envelope with no secrets unless secrets_mode=include is passed:

curl -fsS -X GET \
  -H "Authorization: Bearer $ADMIN_JWT" \
  -OJ "https://notify-bridge.example.com/api/backup/export?secrets_mode=exclude"

Scheduled backup

Configure under Settings → Backup or via PUT /api/backup/scheduled with:

{
  "backup_scheduled_enabled": "true",
  "backup_scheduled_interval_hours": "24",
  "backup_secrets_mode": "exclude",
  "backup_retention_count": "5"
}

Saved files land in data/backups/; retention prunes the oldest files beyond backup_retention_count. Backups can be downloaded individually:

curl -fsS -X GET \
  -H "Authorization: Bearer $ADMIN_JWT" \
  "https://notify-bridge.example.com/api/backup/files/backup-2026-05-16T12-00-00.json" \
  -o backup-latest.json

Cron snippet for off-host backup

# /etc/cron.d/notify-bridge-backup
0 3 * * * www-data \
  curl -fsS -X POST \
    -H "Authorization: Bearer $(cat /etc/notify-bridge/admin.token)" \
    "https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude" \
    -o /var/backups/notify-bridge/backup-$(date +\%F).json

Restore procedure

Restoring REPLACES configuration. Always export the current state first.

# 1. Stage the backup file (validates and writes to data/pending_restore.json)
curl -fsS -X POST \
  -H "Authorization: Bearer $ADMIN_JWT" \
  -F "file=@backup-2026-05-16T12-00-00.json" \
  "https://notify-bridge.example.com/api/backup/prepare-restore?conflict_mode=overwrite"

# 2. Trigger graceful restart so startup applies the staged restore.
#    Same-origin Origin/Referer is enforced — call from the admin UI when
#    possible, or from the same host. Requires the supervisor to respawn
#    the process (Docker restart policy, systemd, PM2, etc.).
curl -fsS -X POST \
  -H "Origin: https://notify-bridge.example.com" \
  -H "Referer: https://notify-bridge.example.com/settings/backup" \
  -H "Authorization: Bearer $ADMIN_JWT" \
  "https://notify-bridge.example.com/api/backup/apply-restart"

If the process is not supervised, /api/backup/apply-restart returns 409. Restart the backend manually after staging — startup applies the pending restore on the next boot.

To cancel a staged restore before applying:

curl -fsS -X DELETE \
  -H "Authorization: Bearer $ADMIN_JWT" \
  "https://notify-bridge.example.com/api/backup/pending-restore"

Recovery from a corrupted DB

If migrations crash on boot or the DB file is unreadable, roll back to a pre-migration snapshot:

# Stop the backend, then
cd /var/lib/docker/volumes/notify-bridge-data/_data
ls -1t backups/pre-migrate-*.db | head -5      # pick the snapshot

cp notify_bridge.db notify_bridge.db.broken    # keep the broken DB for forensics
cp backups/pre-migrate-2026-05-16T11-58-30.db notify_bridge.db
rm -f notify_bridge.db-wal notify_bridge.db-shm   # WAL belongs to the broken file

Restart the container. The startup snapshot will run again and capture the rolled-back state, so you have a clean recovery point if the next boot needs another rollback.

Logs

  • Output goes to stderr only. The Docker log driver captures it.
  • Set NOTIFY_BRIDGE_LOG_FORMAT=json for line-delimited JSON suitable for Loki, ELK, or CloudWatch.
  • Secret values (bot tokens, API keys, passwords) are masked at the log formatter level — see notify_bridge_server.logging_setup.
  • No file rotation is built in. Use the Docker JSON log driver's max-size/max-file options or send logs to your aggregator.
# docker-compose.yml snippet
logging:
  driver: json-file
  options:
    max-size: "10m"
    max-file: "5"

Common operational scenarios

"Notifications stopped firing"

  1. Hit /api/ready. If scheduler is fail, restart the backend; the scheduler died in a way it cannot recover from.

  2. Check notify_bridge_deferred_pending. A non-zero value during quiet hours is normal; a value that grows monotonically across days is a bug — inspect the deferred_dispatch table.

  3. Inspect the most recent event_log rows in the admin Events page or:

    SELECT created_at, event_type, dispatch_status, details
    FROM event_log
    ORDER BY created_at DESC LIMIT 50;
    

    Look for a dispatch_status other than dispatched.

  4. If a single tracker is silent, verify the provider's last poll status in the admin UI (Providers page) — notify_bridge_provider_poll_failures_total tells you which provider type is failing.

  5. If you've configured a bridge_self tracker but never received a self-monitoring alert when something failed, see the next section — bridge_self failures are deliberately log-only to prevent recursion.

Bridge self-monitoring is log-only on its own failures

The built-in bridge_self provider emits notifications when polls, dispatches, or target sends fail. To prevent infinite-recursion (a bridge_self notification failing → triggering another bridge_self notification → ...), failures of bridge_self events themselves are not counted toward target-failure thresholds and are logged only.

If your bridge_self notifications stop arriving, it means the notification target you wired them to is itself failing. Grep stderr for:

bridge_self target-failure emission failed
emit_bridge_self_event failed

The fix is always at the target layer (Telegram bot blocked, Matrix homeserver down, SMTP credentials rotated). The bridge cannot tell you about its own outbound failure — that's what the operator's external monitoring (Prometheus alert on notify_bridge_target_send_failures_total) is for.

"Webhook returns 500"

Inspect the webhook_payload_log table for the matching request:

SELECT received_at, status_code, error_message, payload_excerpt
FROM webhook_payload_log
ORDER BY received_at DESC LIMIT 20;

Common causes: payload schema change in the source service, a tracker referencing a deleted provider, a Jinja template that errors out (look for template render failed in logs).

"Telegram bot rate-limited (429)"

The Telegram client implements exponential backoff with jitter on Retry-After. No operator action is required for transient throttling. If the rate-limit persists, check:

  • The bot is being driven by multiple Notify Bridge instances pointing at the same chat (split-brain — only one instance should own a bot).
  • A template is producing very large messages (Telegram limits message size to 4096 chars). Look for MessageTooLong in the logs.

"DB lock contention"

SQLite WAL mode and busy_timeout=10000 make this rare. If you see SQLITE_BUSY in logs:

  • Check for long-running transactions (most often a stuck migration).
  • Confirm the WAL files are on the same filesystem as the main DB — splitting them across mounts is a known cause.
  • Run sqlite3 notify_bridge.db "PRAGMA wal_checkpoint(TRUNCATE);" to flush the WAL. Safe to run while the backend is up.

Upgrades

  1. Pre-migration snapshot is taken automatically before any migration runs. The latest five snapshots are retained by default.
  2. Migrations are idempotent — re-running an upgrade is safe.
  3. If a migration fails, the snapshot from step 1 is the recovery point. See "Recovery from a corrupted DB" above.
  4. Always test major version upgrades in staging first. The upgrade flow is the same in staging: pull the new image, restart the container.

The release tag stream lives at the project Gitea / GitHub releases page. Release notes are written to RELEASE_NOTES.md for the upcoming version and copied into the Gitea release body by the release.yml workflow.