# Operations Guide This document covers running, monitoring, and recovering Notify Bridge in production. The intended audience is the operator on call when the notifications stop firing or when a release upgrade goes sideways. For developer-focused docs (architecture, conventions, project layout) see `CLAUDE.md` and the `.claude/docs/` directory. ## Deployment overview Notify Bridge ships as a single Docker image. All state lives in a single data directory mounted at `/data`. ### Required environment variables | Variable | Default | Notes | | --- | --- | --- | | `NOTIFY_BRIDGE_SECRET_KEY` | _(none)_ | **Required.** 32+ random bytes. The server refuses to boot with the default placeholder or any of the known dev literals. | | `NOTIFY_BRIDGE_CORS_ALLOWED_ORIGINS` | `http://localhost:5175` | Comma-separated list. `*` is rejected because credentials are enabled. | | `NOTIFY_BRIDGE_FORWARDED_ALLOW_IPS` | `127.0.0.1` | Trusted proxy IPs whose `X-Forwarded-For` / `X-Forwarded-Proto` headers are honored. Set to your reverse-proxy IP. | ### Useful environment variables | Variable | Default | Notes | | --- | --- | --- | | `NOTIFY_BRIDGE_DATA_DIR` | `/data` | Where the SQLite DB, snapshots, and backups live. | | `NOTIFY_BRIDGE_DATABASE_URL` | _(derived from data_dir)_ | Override only if you want a non-default DB path. | | `NOTIFY_BRIDGE_DEBUG` | `false` | Verbose logging + SQL echo. Do not enable in production. | | `NOTIFY_BRIDGE_LOG_FORMAT` | `text` | Set to `json` for one JSON object per line — pipe to a log aggregator. | | `NOTIFY_BRIDGE_LOG_LEVEL` | `INFO` | Root logger level. | | `NOTIFY_BRIDGE_LOG_LEVELS` | _(empty)_ | Per-module overrides, e.g. `sqlalchemy.engine=WARNING,notify_bridge_core.notifications.telegram.client=DEBUG`. | | `NOTIFY_BRIDGE_EVENT_LOG_RETENTION_DAYS` | `30` | Days of `event_log` history kept by the daily cleanup job. `0` disables retention. | | `NOTIFY_BRIDGE_PRE_MIGRATE_SNAPSHOT_KEEP` | `5` | Number of pre-migration DB snapshots retained. `0` disables snapshotting. | | `NOTIFY_BRIDGE_METRICS_ENABLED` | `true` | Expose `/api/metrics` for Prometheus. Set to `false` if the API port crosses a trust boundary. | | `NOTIFY_BRIDGE_GRACEFUL_SHUTDOWN_SECONDS` | `60` | SIGTERM grace period before in-flight requests are killed. | | `NOTIFY_BRIDGE_SUPERVISED` | _(auto)_ | Force the supervised flag for `apply-restart`. Use `true` when running under systemd/PM2 outside Docker. | ### Data directory layout ``` /data/ notify_bridge.db # main SQLite DB (WAL mode) notify_bridge.db-wal # SQLite write-ahead log notify_bridge.db-shm # SQLite shared memory file backups/ pre-migrate-*.db # automatic pre-upgrade snapshots backup-*.json # scheduled / manual config backups snapshots/ # legacy alias retained for older deployments pending_restore.json # staged restore (consumed at next boot) applied_restores/ # archive of applied restore payloads ``` Always mount `/data` on a persistent volume. The WAL files MUST live on the same filesystem as the main DB — never split them across mounts. ### Docker example See `docker-compose.yml` at the repo root for the canonical reference. The container runs read-only with `tmpfs` for `/tmp`, drops all capabilities, and limits memory/CPU. The healthcheck targets `/api/ready` (deep) — see the next section. ## Healthchecks Two endpoints, used for different probe types. ### `GET /api/health` — liveness, shallow Returns `200 OK` once the ASGI app has started. Does not touch the DB or the scheduler. Use this for liveness probes that should only restart the process if it stops responding entirely. ```json {"status": "ok", "version": "0.8.0"} ``` ### `GET /api/ready` — readiness, deep Verifies that each critical dependency is reachable: * **db** — `SELECT 1` against the SQLAlchemy engine, 2-second timeout. * **scheduler** — APScheduler `running` flag. * **ha** — Home Assistant subscription supervisor task. Reported as `na` when no HA providers are configured, `ok` when at least one supervisor is alive, `degraded` otherwise. **Informational only** — HA degradation does not flip readiness off. Returns `503` when any required check (db, scheduler) fails. ```json { "ready": true, "checks": {"db": "ok", "scheduler": "ok", "ha": "na"}, "errors": [], "version": "0.8.0" } ``` ### Kubernetes probe example ```yaml livenessProbe: httpGet: path: /api/health port: 8420 initialDelaySeconds: 10 periodSeconds: 30 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /api/ready port: 8420 initialDelaySeconds: 15 periodSeconds: 15 timeoutSeconds: 5 failureThreshold: 2 ``` The Docker compose file uses `/api/ready` as its healthcheck so the container is only reported healthy after migrations finish. ## Metrics Notify Bridge exposes Prometheus metrics at `GET /api/metrics` in the standard text exposition format. **No authentication** — Prometheus scrapers do not authenticate. Disable via `NOTIFY_BRIDGE_METRICS_ENABLED=false` when the API port is reachable beyond the trust boundary. ### Prometheus scrape example ```yaml scrape_configs: - job_name: notify-bridge metrics_path: /api/metrics static_configs: - targets: ['notify-bridge.internal:8420'] scrape_interval: 30s ``` ### Available metrics | Metric | Type | Labels | Meaning | | --- | --- | --- | --- | | `notify_bridge_deferred_pending` | Gauge | _(none)_ | Pending rows in `deferred_dispatch`. Refreshed on each scrape. A persistent non-zero value usually means a tracker target is in extended quiet hours. | | `notify_bridge_event_log_total` | Counter | `status`, `event_type` | Events written to `event_log`. `status` is the dispatch outcome (`dispatched`, `dropped`, `deferred`, etc.). | | `notify_bridge_dispatch_duration_seconds` | Histogram | `channel` | Wall-clock duration of one outbound dispatch (Telegram, Discord, email, …). Useful for latency alerts. | | `notify_bridge_provider_poll_failures_total` | Counter | `provider_type` | Polling provider tick failures (Immich poll error, Gitea API down, …). Compare against expected scan interval to compute failure rate. | | `notify_bridge_target_send_failures_total` | Counter | `target_type`, `status_code` | Failed sends to a notification channel. `status_code` is the HTTP status (or `0` when no HTTP response was received). | The metrics module never imports `prometheus_client` outside `api/metrics.py`. Other modules record events through the `metrics` singleton — see that module's docstring before adding new collectors. ## Backups Notify Bridge produces three different kinds of backup files. Know which one you are looking at before restoring. | Kind | Location | Format | Trigger | | --- | --- | --- | --- | | Config backup | `data/backups/backup-*.json` | JSON (BackupFile schema) | Manual via `/api/backup/files` POST or scheduled job | | Pre-migration snapshot | `data/backups/pre-migrate-*.db` | SQLite DB file | Automatic on every boot before migrations | | Pending restore | `data/pending_restore.json` | JSON | Staged via `/api/backup/prepare-restore`, consumed at next restart | Config backups capture user configuration (providers, trackers, targets, templates, …). They do **not** include `event_log`, `deferred_dispatch`, or any other operational table. Pre-migration snapshots are full DB copies and contain everything. ### Manual backup The admin UI has a one-click button under Settings → Backup. Equivalent HTTP call: ```bash curl -fsS -X POST \ -H "Authorization: Bearer $ADMIN_JWT" \ "https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude" ``` The download endpoint produces a downloadable JSON envelope with no secrets unless `secrets_mode=include` is passed: ```bash curl -fsS -X GET \ -H "Authorization: Bearer $ADMIN_JWT" \ -OJ "https://notify-bridge.example.com/api/backup/export?secrets_mode=exclude" ``` ### Scheduled backup Configure under Settings → Backup or via `PUT /api/backup/scheduled` with: ```json { "backup_scheduled_enabled": "true", "backup_scheduled_interval_hours": "24", "backup_secrets_mode": "exclude", "backup_retention_count": "5" } ``` Saved files land in `data/backups/`; retention prunes the oldest files beyond `backup_retention_count`. Backups can be downloaded individually: ```bash curl -fsS -X GET \ -H "Authorization: Bearer $ADMIN_JWT" \ "https://notify-bridge.example.com/api/backup/files/backup-2026-05-16T12-00-00.json" \ -o backup-latest.json ``` ### Cron snippet for off-host backup ```bash # /etc/cron.d/notify-bridge-backup 0 3 * * * www-data \ curl -fsS -X POST \ -H "Authorization: Bearer $(cat /etc/notify-bridge/admin.token)" \ "https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude" \ -o /var/backups/notify-bridge/backup-$(date +\%F).json ``` ### Restore procedure Restoring REPLACES configuration. Always export the current state first. ```bash # 1. Stage the backup file (validates and writes to data/pending_restore.json) curl -fsS -X POST \ -H "Authorization: Bearer $ADMIN_JWT" \ -F "file=@backup-2026-05-16T12-00-00.json" \ "https://notify-bridge.example.com/api/backup/prepare-restore?conflict_mode=overwrite" # 2. Trigger graceful restart so startup applies the staged restore. # Same-origin Origin/Referer is enforced — call from the admin UI when # possible, or from the same host. Requires the supervisor to respawn # the process (Docker restart policy, systemd, PM2, etc.). curl -fsS -X POST \ -H "Origin: https://notify-bridge.example.com" \ -H "Referer: https://notify-bridge.example.com/settings/backup" \ -H "Authorization: Bearer $ADMIN_JWT" \ "https://notify-bridge.example.com/api/backup/apply-restart" ``` If the process is **not** supervised, `/api/backup/apply-restart` returns `409`. Restart the backend manually after staging — startup applies the pending restore on the next boot. To cancel a staged restore before applying: ```bash curl -fsS -X DELETE \ -H "Authorization: Bearer $ADMIN_JWT" \ "https://notify-bridge.example.com/api/backup/pending-restore" ``` ### Recovery from a corrupted DB If migrations crash on boot or the DB file is unreadable, roll back to a pre-migration snapshot: ```bash # Stop the backend, then cd /var/lib/docker/volumes/notify-bridge-data/_data ls -1t backups/pre-migrate-*.db | head -5 # pick the snapshot cp notify_bridge.db notify_bridge.db.broken # keep the broken DB for forensics cp backups/pre-migrate-2026-05-16T11-58-30.db notify_bridge.db rm -f notify_bridge.db-wal notify_bridge.db-shm # WAL belongs to the broken file ``` Restart the container. The startup snapshot will run again and capture the rolled-back state, so you have a clean recovery point if the next boot needs another rollback. ## Logs * Output goes to **stderr only**. The Docker log driver captures it. * Set `NOTIFY_BRIDGE_LOG_FORMAT=json` for line-delimited JSON suitable for Loki, ELK, or CloudWatch. * Secret values (bot tokens, API keys, passwords) are masked at the log formatter level — see `notify_bridge_server.logging_setup`. * No file rotation is built in. Use the Docker JSON log driver's `max-size`/`max-file` options or send logs to your aggregator. ```yaml # docker-compose.yml snippet logging: driver: json-file options: max-size: "10m" max-file: "5" ``` ## Common operational scenarios ### "Notifications stopped firing" 1. Hit `/api/ready`. If `scheduler` is `fail`, restart the backend; the scheduler died in a way it cannot recover from. 2. Check `notify_bridge_deferred_pending`. A non-zero value during quiet hours is normal; a value that grows monotonically across days is a bug — inspect the `deferred_dispatch` table. 3. Inspect the most recent `event_log` rows in the admin Events page or: ```sql SELECT created_at, event_type, dispatch_status, details FROM event_log ORDER BY created_at DESC LIMIT 50; ``` Look for a `dispatch_status` other than `dispatched`. 4. If a single tracker is silent, verify the provider's last poll status in the admin UI (Providers page) — `notify_bridge_provider_poll_failures_total` tells you which provider type is failing. 5. If you've configured a `bridge_self` tracker but never received a self-monitoring alert when something failed, see the next section — `bridge_self` failures are deliberately log-only to prevent recursion. ### Bridge self-monitoring is log-only on its own failures The built-in `bridge_self` provider emits notifications when polls, dispatches, or target sends fail. To prevent infinite-recursion (a `bridge_self` notification failing → triggering another `bridge_self` notification → ...), failures of `bridge_self` events themselves are **not** counted toward target-failure thresholds and are logged only. If your `bridge_self` notifications stop arriving, it means the notification target you wired them to is itself failing. Grep stderr for: ```text bridge_self target-failure emission failed emit_bridge_self_event failed ``` The fix is always at the target layer (Telegram bot blocked, Matrix homeserver down, SMTP credentials rotated). The bridge cannot tell you about its own outbound failure — that's what the operator's external monitoring (Prometheus alert on `notify_bridge_target_send_failures_total`) is for. ### "Webhook returns 500" Inspect the `webhook_payload_log` table for the matching request: ```sql SELECT received_at, status_code, error_message, payload_excerpt FROM webhook_payload_log ORDER BY received_at DESC LIMIT 20; ``` Common causes: payload schema change in the source service, a tracker referencing a deleted provider, a Jinja template that errors out (look for `template render failed` in logs). ### "Telegram bot rate-limited (429)" The Telegram client implements exponential backoff with jitter on `Retry-After`. No operator action is required for transient throttling. If the rate-limit persists, check: * The bot is being driven by multiple Notify Bridge instances pointing at the same chat (split-brain — only one instance should own a bot). * A template is producing very large messages (Telegram limits message size to 4096 chars). Look for `MessageTooLong` in the logs. ### "DB lock contention" SQLite WAL mode and `busy_timeout=10000` make this rare. If you see `SQLITE_BUSY` in logs: * Check for long-running transactions (most often a stuck migration). * Confirm the WAL files are on the same filesystem as the main DB — splitting them across mounts is a known cause. * Run `sqlite3 notify_bridge.db "PRAGMA wal_checkpoint(TRUNCATE);"` to flush the WAL. Safe to run while the backend is up. ## Upgrades 1. Pre-migration snapshot is taken automatically before any migration runs. The latest five snapshots are retained by default. 2. Migrations are idempotent — re-running an upgrade is safe. 3. If a migration fails, the snapshot from step 1 is the recovery point. See "Recovery from a corrupted DB" above. 4. Always test major version upgrades in staging first. The upgrade flow is the same in staging: pull the new image, restart the container. The release tag stream lives at the project Gitea / GitHub releases page. Release notes are written to `RELEASE_NOTES.md` for the upcoming version and copied into the Gitea release body by the `release.yml` workflow.