notify-bridge/OPERATIONS.md

# Operations Guide

This document covers running, monitoring, and recovering Notify Bridge in
production. The intended audience is the operator on call when the
notifications stop firing or when a release upgrade goes sideways.

For developer-focused docs (architecture, conventions, project layout) see
`CLAUDE.md` and the `.claude/docs/` directory.

## Deployment overview

Notify Bridge ships as a single Docker image. All state lives in a single
data directory mounted at `/data`.

### Required environment variables

| Variable | Default | Notes |
| --- | --- | --- |
| `NOTIFY_BRIDGE_SECRET_KEY` | _(none)_ | **Required.** 32+ random bytes. The server refuses to boot with the default placeholder or any of the known dev literals. |
| `NOTIFY_BRIDGE_CORS_ALLOWED_ORIGINS` | `http://localhost:5175` | Comma-separated list. `*` is rejected because credentials are enabled. |
| `NOTIFY_BRIDGE_FORWARDED_ALLOW_IPS` | `127.0.0.1` | Trusted proxy IPs whose `X-Forwarded-For` / `X-Forwarded-Proto` headers are honored. Set to your reverse-proxy IP. |

### Useful environment variables

| Variable | Default | Notes |
| --- | --- | --- |
| `NOTIFY_BRIDGE_DATA_DIR` | `/data` | Where the SQLite DB, snapshots, and backups live. |
| `NOTIFY_BRIDGE_DATABASE_URL` | _(derived from data_dir)_ | Override only if you want a non-default DB path. |
| `NOTIFY_BRIDGE_DEBUG` | `false` | Verbose logging + SQL echo. Do not enable in production. |
| `NOTIFY_BRIDGE_LOG_FORMAT` | `text` | Set to `json` for one JSON object per line — pipe to a log aggregator. |
| `NOTIFY_BRIDGE_LOG_LEVEL` | `INFO` | Root logger level. |
| `NOTIFY_BRIDGE_LOG_LEVELS` | _(empty)_ | Per-module overrides, e.g. `sqlalchemy.engine=WARNING,notify_bridge_core.notifications.telegram.client=DEBUG`. |
| `NOTIFY_BRIDGE_EVENT_LOG_RETENTION_DAYS` | `30` | Days of `event_log` history kept by the daily cleanup job. `0` disables retention. |
| `NOTIFY_BRIDGE_PRE_MIGRATE_SNAPSHOT_KEEP` | `5` | Number of pre-migration DB snapshots retained. `0` disables snapshotting. |
| `NOTIFY_BRIDGE_METRICS_ENABLED` | `true` | Expose `/api/metrics` for Prometheus. Set to `false` if the API port crosses a trust boundary. |
| `NOTIFY_BRIDGE_GRACEFUL_SHUTDOWN_SECONDS` | `60` | SIGTERM grace period before in-flight requests are killed. |
| `NOTIFY_BRIDGE_SUPERVISED` | _(auto)_ | Force the supervised flag for `apply-restart`. Use `true` when running under systemd/PM2 outside Docker. |

### Data directory layout

```
/data/
  notify_bridge.db          # main SQLite DB (WAL mode)
  notify_bridge.db-wal      # SQLite write-ahead log
  notify_bridge.db-shm      # SQLite shared memory file
  backups/
    pre-migrate-*.db        # automatic pre-upgrade snapshots
    backup-*.json           # scheduled / manual config backups
  snapshots/                # legacy alias retained for older deployments
  pending_restore.json      # staged restore (consumed at next boot)
  applied_restores/         # archive of applied restore payloads
```

Always mount `/data` on a persistent volume. The WAL files MUST live on the
same filesystem as the main DB — never split them across mounts.

### Docker example

See `docker-compose.yml` at the repo root for the canonical reference. The
container runs read-only with `tmpfs` for `/tmp`, drops all capabilities,
and limits memory/CPU. The healthcheck targets `/api/ready` (deep) — see
the next section.

## Healthchecks

Two endpoints, used for different probe types.

### `GET /api/health` — liveness, shallow

Returns `200 OK` once the ASGI app has started. Does not touch the DB or
the scheduler. Use this for liveness probes that should only restart the
process if it stops responding entirely.

```json
{"status": "ok", "version": "0.8.0"}
```

### `GET /api/ready` — readiness, deep

Verifies that each critical dependency is reachable:

* **db** — `SELECT 1` against the SQLAlchemy engine, 2-second timeout.
* **scheduler** — APScheduler `running` flag.
* **ha** — Home Assistant subscription supervisor task. Reported as
  `na` when no HA providers are configured, `ok` when at least one
  supervisor is alive, `degraded` otherwise. **Informational only** —
  HA degradation does not flip readiness off.

Returns `503` when any required check (db, scheduler) fails.

```json
{
  "ready": true,
  "checks": {"db": "ok", "scheduler": "ok", "ha": "na"},
  "errors": [],
  "version": "0.8.0"
}
```

### Kubernetes probe example

```yaml
livenessProbe:
  httpGet:
    path: /api/health
    port: 8420
  initialDelaySeconds: 10
  periodSeconds: 30
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /api/ready
    port: 8420
  initialDelaySeconds: 15
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 2
```

The Docker compose file uses `/api/ready` as its healthcheck so the
container is only reported healthy after migrations finish.

## Metrics

Notify Bridge exposes Prometheus metrics at `GET /api/metrics` in the
standard text exposition format. **No authentication** — Prometheus
scrapers do not authenticate. Disable via `NOTIFY_BRIDGE_METRICS_ENABLED=false`
when the API port is reachable beyond the trust boundary.

### Prometheus scrape example

```yaml
scrape_configs:
  - job_name: notify-bridge
    metrics_path: /api/metrics
    static_configs:
      - targets: ['notify-bridge.internal:8420']
    scrape_interval: 30s
```

### Available metrics

| Metric | Type | Labels | Meaning |
| --- | --- | --- | --- |
| `notify_bridge_deferred_pending` | Gauge | _(none)_ | Pending rows in `deferred_dispatch`. Refreshed on each scrape. A persistent non-zero value usually means a tracker target is in extended quiet hours. |
| `notify_bridge_event_log_total` | Counter | `status`, `event_type` | Events written to `event_log`. `status` is the dispatch outcome (`dispatched`, `dropped`, `deferred`, etc.). |
| `notify_bridge_dispatch_duration_seconds` | Histogram | `channel` | Wall-clock duration of one outbound dispatch (Telegram, Discord, email, …). Useful for latency alerts. |
| `notify_bridge_provider_poll_failures_total` | Counter | `provider_type` | Polling provider tick failures (Immich poll error, Gitea API down, …). Compare against expected scan interval to compute failure rate. |
| `notify_bridge_target_send_failures_total` | Counter | `target_type`, `status_code` | Failed sends to a notification channel. `status_code` is the HTTP status (or `0` when no HTTP response was received). |

The metrics module never imports `prometheus_client` outside `api/metrics.py`.
Other modules record events through the `metrics` singleton — see that
module's docstring before adding new collectors.

## Backups

Notify Bridge produces three different kinds of backup files. Know which
one you are looking at before restoring.

| Kind | Location | Format | Trigger |
| --- | --- | --- | --- |
| Config backup | `data/backups/backup-*.json` | JSON (BackupFile schema) | Manual via `/api/backup/files` POST or scheduled job |
| Pre-migration snapshot | `data/backups/pre-migrate-*.db` | SQLite DB file | Automatic on every boot before migrations |
| Pending restore | `data/pending_restore.json` | JSON | Staged via `/api/backup/prepare-restore`, consumed at next restart |

Config backups capture user configuration (providers, trackers, targets,
templates, …). They do **not** include `event_log`, `deferred_dispatch`,
or any other operational table. Pre-migration snapshots are full DB
copies and contain everything.

### Manual backup

The admin UI has a one-click button under Settings → Backup. Equivalent
HTTP call:

```bash
curl -fsS -X POST \
  -H "Authorization: Bearer $ADMIN_JWT" \
  "https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude"
```

The download endpoint produces a downloadable JSON envelope with no
secrets unless `secrets_mode=include` is passed:

```bash
curl -fsS -X GET \
  -H "Authorization: Bearer $ADMIN_JWT" \
  -OJ "https://notify-bridge.example.com/api/backup/export?secrets_mode=exclude"
```

### Scheduled backup

Configure under Settings → Backup or via `PUT /api/backup/scheduled` with:

```json
{
  "backup_scheduled_enabled": "true",
  "backup_scheduled_interval_hours": "24",
  "backup_secrets_mode": "exclude",
  "backup_retention_count": "5"
}
```

Saved files land in `data/backups/`; retention prunes the oldest files
beyond `backup_retention_count`. Backups can be downloaded individually:

```bash
curl -fsS -X GET \
  -H "Authorization: Bearer $ADMIN_JWT" \
  "https://notify-bridge.example.com/api/backup/files/backup-2026-05-16T12-00-00.json" \
  -o backup-latest.json
```

### Cron snippet for off-host backup

```bash
# /etc/cron.d/notify-bridge-backup
0 3 * * * www-data \
  curl -fsS -X POST \
    -H "Authorization: Bearer $(cat /etc/notify-bridge/admin.token)" \
    "https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude" \
    -o /var/backups/notify-bridge/backup-$(date +\%F).json
```

### Restore procedure

Restoring REPLACES configuration. Always export the current state first.

```bash
# 1. Stage the backup file (validates and writes to data/pending_restore.json)
curl -fsS -X POST \
  -H "Authorization: Bearer $ADMIN_JWT" \
  -F "file=@backup-2026-05-16T12-00-00.json" \
  "https://notify-bridge.example.com/api/backup/prepare-restore?conflict_mode=overwrite"

# 2. Trigger graceful restart so startup applies the staged restore.
#    Same-origin Origin/Referer is enforced — call from the admin UI when
#    possible, or from the same host. Requires the supervisor to respawn
#    the process (Docker restart policy, systemd, PM2, etc.).
curl -fsS -X POST \
  -H "Origin: https://notify-bridge.example.com" \
  -H "Referer: https://notify-bridge.example.com/settings/backup" \
  -H "Authorization: Bearer $ADMIN_JWT" \
  "https://notify-bridge.example.com/api/backup/apply-restart"
```

If the process is **not** supervised, `/api/backup/apply-restart` returns
`409`. Restart the backend manually after staging — startup applies the
pending restore on the next boot.

To cancel a staged restore before applying:

```bash
curl -fsS -X DELETE \
  -H "Authorization: Bearer $ADMIN_JWT" \
  "https://notify-bridge.example.com/api/backup/pending-restore"
```

### Recovery from a corrupted DB

If migrations crash on boot or the DB file is unreadable, roll back to a
pre-migration snapshot:

```bash
# Stop the backend, then
cd /var/lib/docker/volumes/notify-bridge-data/_data
ls -1t backups/pre-migrate-*.db | head -5      # pick the snapshot

cp notify_bridge.db notify_bridge.db.broken    # keep the broken DB for forensics
cp backups/pre-migrate-2026-05-16T11-58-30.db notify_bridge.db
rm -f notify_bridge.db-wal notify_bridge.db-shm   # WAL belongs to the broken file
```

Restart the container. The startup snapshot will run again and capture
the rolled-back state, so you have a clean recovery point if the next
boot needs another rollback.

## Logs

* Output goes to **stderr only**. The Docker log driver captures it.
* Set `NOTIFY_BRIDGE_LOG_FORMAT=json` for line-delimited JSON suitable
  for Loki, ELK, or CloudWatch.
* Secret values (bot tokens, API keys, passwords) are masked at the log
  formatter level — see `notify_bridge_server.logging_setup`.
* No file rotation is built in. Use the Docker JSON log driver's
  `max-size`/`max-file` options or send logs to your aggregator.

```yaml
# docker-compose.yml snippet
logging:
  driver: json-file
  options:
    max-size: "10m"
    max-file: "5"
```

## Common operational scenarios

### "Notifications stopped firing"

1. Hit `/api/ready`. If `scheduler` is `fail`, restart the backend; the
   scheduler died in a way it cannot recover from.
2. Check `notify_bridge_deferred_pending`. A non-zero value during quiet
   hours is normal; a value that grows monotonically across days is a
   bug — inspect the `deferred_dispatch` table.
3. Inspect the most recent `event_log` rows in the admin Events page or:

   ```sql
   SELECT created_at, event_type, dispatch_status, details
   FROM event_log
   ORDER BY created_at DESC LIMIT 50;
   ```

   Look for a `dispatch_status` other than `dispatched`.
4. If a single tracker is silent, verify the provider's last poll status
   in the admin UI (Providers page) — `notify_bridge_provider_poll_failures_total`
   tells you which provider type is failing.
5. If you've configured a `bridge_self` tracker but never received a
   self-monitoring alert when something failed, see the next section —
   `bridge_self` failures are deliberately log-only to prevent recursion.

### Bridge self-monitoring is log-only on its own failures

The built-in `bridge_self` provider emits notifications when polls,
dispatches, or target sends fail. To prevent infinite-recursion (a
`bridge_self` notification failing → triggering another `bridge_self`
notification → ...), failures of `bridge_self` events themselves are
**not** counted toward target-failure thresholds and are logged only.

If your `bridge_self` notifications stop arriving, it means the
notification target you wired them to is itself failing. Grep stderr for:

```text
bridge_self target-failure emission failed
emit_bridge_self_event failed
```

The fix is always at the target layer (Telegram bot blocked, Matrix
homeserver down, SMTP credentials rotated). The bridge cannot tell you
about its own outbound failure — that's what the operator's external
monitoring (Prometheus alert on `notify_bridge_target_send_failures_total`)
is for.

### "Webhook returns 500"

Inspect the `webhook_payload_log` table for the matching request:

```sql
SELECT received_at, status_code, error_message, payload_excerpt
FROM webhook_payload_log
ORDER BY received_at DESC LIMIT 20;
```

Common causes: payload schema change in the source service, a tracker
referencing a deleted provider, a Jinja template that errors out (look
for `template render failed` in logs).

### "Telegram bot rate-limited (429)"

The Telegram client implements exponential backoff with jitter on
`Retry-After`. No operator action is required for transient throttling.
If the rate-limit persists, check:

* The bot is being driven by multiple Notify Bridge instances pointing
  at the same chat (split-brain — only one instance should own a bot).
* A template is producing very large messages (Telegram limits message
  size to 4096 chars). Look for `MessageTooLong` in the logs.

### "DB lock contention"

SQLite WAL mode and `busy_timeout=10000` make this rare. If you see
`SQLITE_BUSY` in logs:

* Check for long-running transactions (most often a stuck migration).
* Confirm the WAL files are on the same filesystem as the main DB —
  splitting them across mounts is a known cause.
* Run `sqlite3 notify_bridge.db "PRAGMA wal_checkpoint(TRUNCATE);"` to
  flush the WAL. Safe to run while the backend is up.

## Upgrades

1. Pre-migration snapshot is taken automatically before any migration
   runs. The latest five snapshots are retained by default.
2. Migrations are idempotent — re-running an upgrade is safe.
3. If a migration fails, the snapshot from step 1 is the recovery point.
   See "Recovery from a corrupted DB" above.
4. Always test major version upgrades in staging first. The upgrade flow
   is the same in staging: pull the new image, restart the container.

The release tag stream lives at the project Gitea / GitHub releases page.
Release notes are written to `RELEASE_NOTES.md` for the upcoming version
and copied into the Gitea release body by the `release.yml` workflow.