Files
alexei.dolgolyov 10d30fc956 feat: production readiness — security, perf, bug fixes, bridge self-monitoring
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:49 +03:00

395 lines
15 KiB
Markdown

# Operations Guide
This document covers running, monitoring, and recovering Notify Bridge in
production. The intended audience is the operator on call when the
notifications stop firing or when a release upgrade goes sideways.
For developer-focused docs (architecture, conventions, project layout) see
`CLAUDE.md` and the `.claude/docs/` directory.
## Deployment overview
Notify Bridge ships as a single Docker image. All state lives in a single
data directory mounted at `/data`.
### Required environment variables
| Variable | Default | Notes |
| --- | --- | --- |
| `NOTIFY_BRIDGE_SECRET_KEY` | _(none)_ | **Required.** 32+ random bytes. The server refuses to boot with the default placeholder or any of the known dev literals. |
| `NOTIFY_BRIDGE_CORS_ALLOWED_ORIGINS` | `http://localhost:5175` | Comma-separated list. `*` is rejected because credentials are enabled. |
| `NOTIFY_BRIDGE_FORWARDED_ALLOW_IPS` | `127.0.0.1` | Trusted proxy IPs whose `X-Forwarded-For` / `X-Forwarded-Proto` headers are honored. Set to your reverse-proxy IP. |
### Useful environment variables
| Variable | Default | Notes |
| --- | --- | --- |
| `NOTIFY_BRIDGE_DATA_DIR` | `/data` | Where the SQLite DB, snapshots, and backups live. |
| `NOTIFY_BRIDGE_DATABASE_URL` | _(derived from data_dir)_ | Override only if you want a non-default DB path. |
| `NOTIFY_BRIDGE_DEBUG` | `false` | Verbose logging + SQL echo. Do not enable in production. |
| `NOTIFY_BRIDGE_LOG_FORMAT` | `text` | Set to `json` for one JSON object per line — pipe to a log aggregator. |
| `NOTIFY_BRIDGE_LOG_LEVEL` | `INFO` | Root logger level. |
| `NOTIFY_BRIDGE_LOG_LEVELS` | _(empty)_ | Per-module overrides, e.g. `sqlalchemy.engine=WARNING,notify_bridge_core.notifications.telegram.client=DEBUG`. |
| `NOTIFY_BRIDGE_EVENT_LOG_RETENTION_DAYS` | `30` | Days of `event_log` history kept by the daily cleanup job. `0` disables retention. |
| `NOTIFY_BRIDGE_PRE_MIGRATE_SNAPSHOT_KEEP` | `5` | Number of pre-migration DB snapshots retained. `0` disables snapshotting. |
| `NOTIFY_BRIDGE_METRICS_ENABLED` | `true` | Expose `/api/metrics` for Prometheus. Set to `false` if the API port crosses a trust boundary. |
| `NOTIFY_BRIDGE_GRACEFUL_SHUTDOWN_SECONDS` | `60` | SIGTERM grace period before in-flight requests are killed. |
| `NOTIFY_BRIDGE_SUPERVISED` | _(auto)_ | Force the supervised flag for `apply-restart`. Use `true` when running under systemd/PM2 outside Docker. |
### Data directory layout
```
/data/
notify_bridge.db # main SQLite DB (WAL mode)
notify_bridge.db-wal # SQLite write-ahead log
notify_bridge.db-shm # SQLite shared memory file
backups/
pre-migrate-*.db # automatic pre-upgrade snapshots
backup-*.json # scheduled / manual config backups
snapshots/ # legacy alias retained for older deployments
pending_restore.json # staged restore (consumed at next boot)
applied_restores/ # archive of applied restore payloads
```
Always mount `/data` on a persistent volume. The WAL files MUST live on the
same filesystem as the main DB — never split them across mounts.
### Docker example
See `docker-compose.yml` at the repo root for the canonical reference. The
container runs read-only with `tmpfs` for `/tmp`, drops all capabilities,
and limits memory/CPU. The healthcheck targets `/api/ready` (deep) — see
the next section.
## Healthchecks
Two endpoints, used for different probe types.
### `GET /api/health` — liveness, shallow
Returns `200 OK` once the ASGI app has started. Does not touch the DB or
the scheduler. Use this for liveness probes that should only restart the
process if it stops responding entirely.
```json
{"status": "ok", "version": "0.8.0"}
```
### `GET /api/ready` — readiness, deep
Verifies that each critical dependency is reachable:
* **db** — `SELECT 1` against the SQLAlchemy engine, 2-second timeout.
* **scheduler** — APScheduler `running` flag.
* **ha** — Home Assistant subscription supervisor task. Reported as
`na` when no HA providers are configured, `ok` when at least one
supervisor is alive, `degraded` otherwise. **Informational only**
HA degradation does not flip readiness off.
Returns `503` when any required check (db, scheduler) fails.
```json
{
"ready": true,
"checks": {"db": "ok", "scheduler": "ok", "ha": "na"},
"errors": [],
"version": "0.8.0"
}
```
### Kubernetes probe example
```yaml
livenessProbe:
httpGet:
path: /api/health
port: 8420
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/ready
port: 8420
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 2
```
The Docker compose file uses `/api/ready` as its healthcheck so the
container is only reported healthy after migrations finish.
## Metrics
Notify Bridge exposes Prometheus metrics at `GET /api/metrics` in the
standard text exposition format. **No authentication** — Prometheus
scrapers do not authenticate. Disable via `NOTIFY_BRIDGE_METRICS_ENABLED=false`
when the API port is reachable beyond the trust boundary.
### Prometheus scrape example
```yaml
scrape_configs:
- job_name: notify-bridge
metrics_path: /api/metrics
static_configs:
- targets: ['notify-bridge.internal:8420']
scrape_interval: 30s
```
### Available metrics
| Metric | Type | Labels | Meaning |
| --- | --- | --- | --- |
| `notify_bridge_deferred_pending` | Gauge | _(none)_ | Pending rows in `deferred_dispatch`. Refreshed on each scrape. A persistent non-zero value usually means a tracker target is in extended quiet hours. |
| `notify_bridge_event_log_total` | Counter | `status`, `event_type` | Events written to `event_log`. `status` is the dispatch outcome (`dispatched`, `dropped`, `deferred`, etc.). |
| `notify_bridge_dispatch_duration_seconds` | Histogram | `channel` | Wall-clock duration of one outbound dispatch (Telegram, Discord, email, …). Useful for latency alerts. |
| `notify_bridge_provider_poll_failures_total` | Counter | `provider_type` | Polling provider tick failures (Immich poll error, Gitea API down, …). Compare against expected scan interval to compute failure rate. |
| `notify_bridge_target_send_failures_total` | Counter | `target_type`, `status_code` | Failed sends to a notification channel. `status_code` is the HTTP status (or `0` when no HTTP response was received). |
The metrics module never imports `prometheus_client` outside `api/metrics.py`.
Other modules record events through the `metrics` singleton — see that
module's docstring before adding new collectors.
## Backups
Notify Bridge produces three different kinds of backup files. Know which
one you are looking at before restoring.
| Kind | Location | Format | Trigger |
| --- | --- | --- | --- |
| Config backup | `data/backups/backup-*.json` | JSON (BackupFile schema) | Manual via `/api/backup/files` POST or scheduled job |
| Pre-migration snapshot | `data/backups/pre-migrate-*.db` | SQLite DB file | Automatic on every boot before migrations |
| Pending restore | `data/pending_restore.json` | JSON | Staged via `/api/backup/prepare-restore`, consumed at next restart |
Config backups capture user configuration (providers, trackers, targets,
templates, …). They do **not** include `event_log`, `deferred_dispatch`,
or any other operational table. Pre-migration snapshots are full DB
copies and contain everything.
### Manual backup
The admin UI has a one-click button under Settings → Backup. Equivalent
HTTP call:
```bash
curl -fsS -X POST \
-H "Authorization: Bearer $ADMIN_JWT" \
"https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude"
```
The download endpoint produces a downloadable JSON envelope with no
secrets unless `secrets_mode=include` is passed:
```bash
curl -fsS -X GET \
-H "Authorization: Bearer $ADMIN_JWT" \
-OJ "https://notify-bridge.example.com/api/backup/export?secrets_mode=exclude"
```
### Scheduled backup
Configure under Settings → Backup or via `PUT /api/backup/scheduled` with:
```json
{
"backup_scheduled_enabled": "true",
"backup_scheduled_interval_hours": "24",
"backup_secrets_mode": "exclude",
"backup_retention_count": "5"
}
```
Saved files land in `data/backups/`; retention prunes the oldest files
beyond `backup_retention_count`. Backups can be downloaded individually:
```bash
curl -fsS -X GET \
-H "Authorization: Bearer $ADMIN_JWT" \
"https://notify-bridge.example.com/api/backup/files/backup-2026-05-16T12-00-00.json" \
-o backup-latest.json
```
### Cron snippet for off-host backup
```bash
# /etc/cron.d/notify-bridge-backup
0 3 * * * www-data \
curl -fsS -X POST \
-H "Authorization: Bearer $(cat /etc/notify-bridge/admin.token)" \
"https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude" \
-o /var/backups/notify-bridge/backup-$(date +\%F).json
```
### Restore procedure
Restoring REPLACES configuration. Always export the current state first.
```bash
# 1. Stage the backup file (validates and writes to data/pending_restore.json)
curl -fsS -X POST \
-H "Authorization: Bearer $ADMIN_JWT" \
-F "file=@backup-2026-05-16T12-00-00.json" \
"https://notify-bridge.example.com/api/backup/prepare-restore?conflict_mode=overwrite"
# 2. Trigger graceful restart so startup applies the staged restore.
# Same-origin Origin/Referer is enforced — call from the admin UI when
# possible, or from the same host. Requires the supervisor to respawn
# the process (Docker restart policy, systemd, PM2, etc.).
curl -fsS -X POST \
-H "Origin: https://notify-bridge.example.com" \
-H "Referer: https://notify-bridge.example.com/settings/backup" \
-H "Authorization: Bearer $ADMIN_JWT" \
"https://notify-bridge.example.com/api/backup/apply-restart"
```
If the process is **not** supervised, `/api/backup/apply-restart` returns
`409`. Restart the backend manually after staging — startup applies the
pending restore on the next boot.
To cancel a staged restore before applying:
```bash
curl -fsS -X DELETE \
-H "Authorization: Bearer $ADMIN_JWT" \
"https://notify-bridge.example.com/api/backup/pending-restore"
```
### Recovery from a corrupted DB
If migrations crash on boot or the DB file is unreadable, roll back to a
pre-migration snapshot:
```bash
# Stop the backend, then
cd /var/lib/docker/volumes/notify-bridge-data/_data
ls -1t backups/pre-migrate-*.db | head -5 # pick the snapshot
cp notify_bridge.db notify_bridge.db.broken # keep the broken DB for forensics
cp backups/pre-migrate-2026-05-16T11-58-30.db notify_bridge.db
rm -f notify_bridge.db-wal notify_bridge.db-shm # WAL belongs to the broken file
```
Restart the container. The startup snapshot will run again and capture
the rolled-back state, so you have a clean recovery point if the next
boot needs another rollback.
## Logs
* Output goes to **stderr only**. The Docker log driver captures it.
* Set `NOTIFY_BRIDGE_LOG_FORMAT=json` for line-delimited JSON suitable
for Loki, ELK, or CloudWatch.
* Secret values (bot tokens, API keys, passwords) are masked at the log
formatter level — see `notify_bridge_server.logging_setup`.
* No file rotation is built in. Use the Docker JSON log driver's
`max-size`/`max-file` options or send logs to your aggregator.
```yaml
# docker-compose.yml snippet
logging:
driver: json-file
options:
max-size: "10m"
max-file: "5"
```
## Common operational scenarios
### "Notifications stopped firing"
1. Hit `/api/ready`. If `scheduler` is `fail`, restart the backend; the
scheduler died in a way it cannot recover from.
2. Check `notify_bridge_deferred_pending`. A non-zero value during quiet
hours is normal; a value that grows monotonically across days is a
bug — inspect the `deferred_dispatch` table.
3. Inspect the most recent `event_log` rows in the admin Events page or:
```sql
SELECT created_at, event_type, dispatch_status, details
FROM event_log
ORDER BY created_at DESC LIMIT 50;
```
Look for a `dispatch_status` other than `dispatched`.
4. If a single tracker is silent, verify the provider's last poll status
in the admin UI (Providers page) — `notify_bridge_provider_poll_failures_total`
tells you which provider type is failing.
5. If you've configured a `bridge_self` tracker but never received a
self-monitoring alert when something failed, see the next section —
`bridge_self` failures are deliberately log-only to prevent recursion.
### Bridge self-monitoring is log-only on its own failures
The built-in `bridge_self` provider emits notifications when polls,
dispatches, or target sends fail. To prevent infinite-recursion (a
`bridge_self` notification failing → triggering another `bridge_self`
notification → ...), failures of `bridge_self` events themselves are
**not** counted toward target-failure thresholds and are logged only.
If your `bridge_self` notifications stop arriving, it means the
notification target you wired them to is itself failing. Grep stderr for:
```text
bridge_self target-failure emission failed
emit_bridge_self_event failed
```
The fix is always at the target layer (Telegram bot blocked, Matrix
homeserver down, SMTP credentials rotated). The bridge cannot tell you
about its own outbound failure — that's what the operator's external
monitoring (Prometheus alert on `notify_bridge_target_send_failures_total`)
is for.
### "Webhook returns 500"
Inspect the `webhook_payload_log` table for the matching request:
```sql
SELECT received_at, status_code, error_message, payload_excerpt
FROM webhook_payload_log
ORDER BY received_at DESC LIMIT 20;
```
Common causes: payload schema change in the source service, a tracker
referencing a deleted provider, a Jinja template that errors out (look
for `template render failed` in logs).
### "Telegram bot rate-limited (429)"
The Telegram client implements exponential backoff with jitter on
`Retry-After`. No operator action is required for transient throttling.
If the rate-limit persists, check:
* The bot is being driven by multiple Notify Bridge instances pointing
at the same chat (split-brain — only one instance should own a bot).
* A template is producing very large messages (Telegram limits message
size to 4096 chars). Look for `MessageTooLong` in the logs.
### "DB lock contention"
SQLite WAL mode and `busy_timeout=10000` make this rare. If you see
`SQLITE_BUSY` in logs:
* Check for long-running transactions (most often a stuck migration).
* Confirm the WAL files are on the same filesystem as the main DB —
splitting them across mounts is a known cause.
* Run `sqlite3 notify_bridge.db "PRAGMA wal_checkpoint(TRUNCATE);"` to
flush the WAL. Safe to run while the backend is up.
## Upgrades
1. Pre-migration snapshot is taken automatically before any migration
runs. The latest five snapshots are retained by default.
2. Migrations are idempotent — re-running an upgrade is safe.
3. If a migration fails, the snapshot from step 1 is the recovery point.
See "Recovery from a corrupted DB" above.
4. Always test major version upgrades in staging first. The upgrade flow
is the same in staging: pull the new image, restart the container.
The release tag stream lives at the project Gitea / GitHub releases page.
Release notes are written to `RELEASE_NOTES.md` for the upcoming version
and copied into the Gitea release body by the `release.yml` workflow.