Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.
## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession
## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch
## Database
- UNIQUE indexes on service_provider.webhook_token,
telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified
## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events
## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
oauth/client_secret/webhook_secret/csrf in both header filter and
template extras filter
## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
$effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
Telegram bot toggles
## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
(wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
test_planka_parser (6), test_immich_change_detector (6),
test_backup_roundtrip (1)
## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
failures), bridge_self_deferred_backlog (pending count crosses
threshold), bridge_self_target_failures (consecutive 5xx/network
failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
thresholds (logged only) — wire to your own Telegram/Email/Matrix to
get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
with backfill migration, frontend descriptor (excluded from "create
provider" wizard since auto-managed)
Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
receive failure alerts
Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 KiB
Operations Guide
This document covers running, monitoring, and recovering Notify Bridge in production. The intended audience is the operator on call when the notifications stop firing or when a release upgrade goes sideways.
For developer-focused docs (architecture, conventions, project layout) see
CLAUDE.md and the .claude/docs/ directory.
Deployment overview
Notify Bridge ships as a single Docker image. All state lives in a single
data directory mounted at /data.
Required environment variables
| Variable | Default | Notes |
|---|---|---|
NOTIFY_BRIDGE_SECRET_KEY |
(none) | Required. 32+ random bytes. The server refuses to boot with the default placeholder or any of the known dev literals. |
NOTIFY_BRIDGE_CORS_ALLOWED_ORIGINS |
http://localhost:5175 |
Comma-separated list. * is rejected because credentials are enabled. |
NOTIFY_BRIDGE_FORWARDED_ALLOW_IPS |
127.0.0.1 |
Trusted proxy IPs whose X-Forwarded-For / X-Forwarded-Proto headers are honored. Set to your reverse-proxy IP. |
Useful environment variables
| Variable | Default | Notes |
|---|---|---|
NOTIFY_BRIDGE_DATA_DIR |
/data |
Where the SQLite DB, snapshots, and backups live. |
NOTIFY_BRIDGE_DATABASE_URL |
(derived from data_dir) | Override only if you want a non-default DB path. |
NOTIFY_BRIDGE_DEBUG |
false |
Verbose logging + SQL echo. Do not enable in production. |
NOTIFY_BRIDGE_LOG_FORMAT |
text |
Set to json for one JSON object per line — pipe to a log aggregator. |
NOTIFY_BRIDGE_LOG_LEVEL |
INFO |
Root logger level. |
NOTIFY_BRIDGE_LOG_LEVELS |
(empty) | Per-module overrides, e.g. sqlalchemy.engine=WARNING,notify_bridge_core.notifications.telegram.client=DEBUG. |
NOTIFY_BRIDGE_EVENT_LOG_RETENTION_DAYS |
30 |
Days of event_log history kept by the daily cleanup job. 0 disables retention. |
NOTIFY_BRIDGE_PRE_MIGRATE_SNAPSHOT_KEEP |
5 |
Number of pre-migration DB snapshots retained. 0 disables snapshotting. |
NOTIFY_BRIDGE_METRICS_ENABLED |
true |
Expose /api/metrics for Prometheus. Set to false if the API port crosses a trust boundary. |
NOTIFY_BRIDGE_GRACEFUL_SHUTDOWN_SECONDS |
60 |
SIGTERM grace period before in-flight requests are killed. |
NOTIFY_BRIDGE_SUPERVISED |
(auto) | Force the supervised flag for apply-restart. Use true when running under systemd/PM2 outside Docker. |
Data directory layout
/data/
notify_bridge.db # main SQLite DB (WAL mode)
notify_bridge.db-wal # SQLite write-ahead log
notify_bridge.db-shm # SQLite shared memory file
backups/
pre-migrate-*.db # automatic pre-upgrade snapshots
backup-*.json # scheduled / manual config backups
snapshots/ # legacy alias retained for older deployments
pending_restore.json # staged restore (consumed at next boot)
applied_restores/ # archive of applied restore payloads
Always mount /data on a persistent volume. The WAL files MUST live on the
same filesystem as the main DB — never split them across mounts.
Docker example
See docker-compose.yml at the repo root for the canonical reference. The
container runs read-only with tmpfs for /tmp, drops all capabilities,
and limits memory/CPU. The healthcheck targets /api/ready (deep) — see
the next section.
Healthchecks
Two endpoints, used for different probe types.
GET /api/health — liveness, shallow
Returns 200 OK once the ASGI app has started. Does not touch the DB or
the scheduler. Use this for liveness probes that should only restart the
process if it stops responding entirely.
{"status": "ok", "version": "0.8.0"}
GET /api/ready — readiness, deep
Verifies that each critical dependency is reachable:
- db —
SELECT 1against the SQLAlchemy engine, 2-second timeout. - scheduler — APScheduler
runningflag. - ha — Home Assistant subscription supervisor task. Reported as
nawhen no HA providers are configured,okwhen at least one supervisor is alive,degradedotherwise. Informational only — HA degradation does not flip readiness off.
Returns 503 when any required check (db, scheduler) fails.
{
"ready": true,
"checks": {"db": "ok", "scheduler": "ok", "ha": "na"},
"errors": [],
"version": "0.8.0"
}
Kubernetes probe example
livenessProbe:
httpGet:
path: /api/health
port: 8420
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/ready
port: 8420
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 2
The Docker compose file uses /api/ready as its healthcheck so the
container is only reported healthy after migrations finish.
Metrics
Notify Bridge exposes Prometheus metrics at GET /api/metrics in the
standard text exposition format. No authentication — Prometheus
scrapers do not authenticate. Disable via NOTIFY_BRIDGE_METRICS_ENABLED=false
when the API port is reachable beyond the trust boundary.
Prometheus scrape example
scrape_configs:
- job_name: notify-bridge
metrics_path: /api/metrics
static_configs:
- targets: ['notify-bridge.internal:8420']
scrape_interval: 30s
Available metrics
| Metric | Type | Labels | Meaning |
|---|---|---|---|
notify_bridge_deferred_pending |
Gauge | (none) | Pending rows in deferred_dispatch. Refreshed on each scrape. A persistent non-zero value usually means a tracker target is in extended quiet hours. |
notify_bridge_event_log_total |
Counter | status, event_type |
Events written to event_log. status is the dispatch outcome (dispatched, dropped, deferred, etc.). |
notify_bridge_dispatch_duration_seconds |
Histogram | channel |
Wall-clock duration of one outbound dispatch (Telegram, Discord, email, …). Useful for latency alerts. |
notify_bridge_provider_poll_failures_total |
Counter | provider_type |
Polling provider tick failures (Immich poll error, Gitea API down, …). Compare against expected scan interval to compute failure rate. |
notify_bridge_target_send_failures_total |
Counter | target_type, status_code |
Failed sends to a notification channel. status_code is the HTTP status (or 0 when no HTTP response was received). |
The metrics module never imports prometheus_client outside api/metrics.py.
Other modules record events through the metrics singleton — see that
module's docstring before adding new collectors.
Backups
Notify Bridge produces three different kinds of backup files. Know which one you are looking at before restoring.
| Kind | Location | Format | Trigger |
|---|---|---|---|
| Config backup | data/backups/backup-*.json |
JSON (BackupFile schema) | Manual via /api/backup/files POST or scheduled job |
| Pre-migration snapshot | data/backups/pre-migrate-*.db |
SQLite DB file | Automatic on every boot before migrations |
| Pending restore | data/pending_restore.json |
JSON | Staged via /api/backup/prepare-restore, consumed at next restart |
Config backups capture user configuration (providers, trackers, targets,
templates, …). They do not include event_log, deferred_dispatch,
or any other operational table. Pre-migration snapshots are full DB
copies and contain everything.
Manual backup
The admin UI has a one-click button under Settings → Backup. Equivalent HTTP call:
curl -fsS -X POST \
-H "Authorization: Bearer $ADMIN_JWT" \
"https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude"
The download endpoint produces a downloadable JSON envelope with no
secrets unless secrets_mode=include is passed:
curl -fsS -X GET \
-H "Authorization: Bearer $ADMIN_JWT" \
-OJ "https://notify-bridge.example.com/api/backup/export?secrets_mode=exclude"
Scheduled backup
Configure under Settings → Backup or via PUT /api/backup/scheduled with:
{
"backup_scheduled_enabled": "true",
"backup_scheduled_interval_hours": "24",
"backup_secrets_mode": "exclude",
"backup_retention_count": "5"
}
Saved files land in data/backups/; retention prunes the oldest files
beyond backup_retention_count. Backups can be downloaded individually:
curl -fsS -X GET \
-H "Authorization: Bearer $ADMIN_JWT" \
"https://notify-bridge.example.com/api/backup/files/backup-2026-05-16T12-00-00.json" \
-o backup-latest.json
Cron snippet for off-host backup
# /etc/cron.d/notify-bridge-backup
0 3 * * * www-data \
curl -fsS -X POST \
-H "Authorization: Bearer $(cat /etc/notify-bridge/admin.token)" \
"https://notify-bridge.example.com/api/backup/files?secrets_mode=exclude" \
-o /var/backups/notify-bridge/backup-$(date +\%F).json
Restore procedure
Restoring REPLACES configuration. Always export the current state first.
# 1. Stage the backup file (validates and writes to data/pending_restore.json)
curl -fsS -X POST \
-H "Authorization: Bearer $ADMIN_JWT" \
-F "file=@backup-2026-05-16T12-00-00.json" \
"https://notify-bridge.example.com/api/backup/prepare-restore?conflict_mode=overwrite"
# 2. Trigger graceful restart so startup applies the staged restore.
# Same-origin Origin/Referer is enforced — call from the admin UI when
# possible, or from the same host. Requires the supervisor to respawn
# the process (Docker restart policy, systemd, PM2, etc.).
curl -fsS -X POST \
-H "Origin: https://notify-bridge.example.com" \
-H "Referer: https://notify-bridge.example.com/settings/backup" \
-H "Authorization: Bearer $ADMIN_JWT" \
"https://notify-bridge.example.com/api/backup/apply-restart"
If the process is not supervised, /api/backup/apply-restart returns
409. Restart the backend manually after staging — startup applies the
pending restore on the next boot.
To cancel a staged restore before applying:
curl -fsS -X DELETE \
-H "Authorization: Bearer $ADMIN_JWT" \
"https://notify-bridge.example.com/api/backup/pending-restore"
Recovery from a corrupted DB
If migrations crash on boot or the DB file is unreadable, roll back to a pre-migration snapshot:
# Stop the backend, then
cd /var/lib/docker/volumes/notify-bridge-data/_data
ls -1t backups/pre-migrate-*.db | head -5 # pick the snapshot
cp notify_bridge.db notify_bridge.db.broken # keep the broken DB for forensics
cp backups/pre-migrate-2026-05-16T11-58-30.db notify_bridge.db
rm -f notify_bridge.db-wal notify_bridge.db-shm # WAL belongs to the broken file
Restart the container. The startup snapshot will run again and capture the rolled-back state, so you have a clean recovery point if the next boot needs another rollback.
Logs
- Output goes to stderr only. The Docker log driver captures it.
- Set
NOTIFY_BRIDGE_LOG_FORMAT=jsonfor line-delimited JSON suitable for Loki, ELK, or CloudWatch. - Secret values (bot tokens, API keys, passwords) are masked at the log
formatter level — see
notify_bridge_server.logging_setup. - No file rotation is built in. Use the Docker JSON log driver's
max-size/max-fileoptions or send logs to your aggregator.
# docker-compose.yml snippet
logging:
driver: json-file
options:
max-size: "10m"
max-file: "5"
Common operational scenarios
"Notifications stopped firing"
-
Hit
/api/ready. Ifschedulerisfail, restart the backend; the scheduler died in a way it cannot recover from. -
Check
notify_bridge_deferred_pending. A non-zero value during quiet hours is normal; a value that grows monotonically across days is a bug — inspect thedeferred_dispatchtable. -
Inspect the most recent
event_logrows in the admin Events page or:SELECT created_at, event_type, dispatch_status, details FROM event_log ORDER BY created_at DESC LIMIT 50;Look for a
dispatch_statusother thandispatched. -
If a single tracker is silent, verify the provider's last poll status in the admin UI (Providers page) —
notify_bridge_provider_poll_failures_totaltells you which provider type is failing. -
If you've configured a
bridge_selftracker but never received a self-monitoring alert when something failed, see the next section —bridge_selffailures are deliberately log-only to prevent recursion.
Bridge self-monitoring is log-only on its own failures
The built-in bridge_self provider emits notifications when polls,
dispatches, or target sends fail. To prevent infinite-recursion (a
bridge_self notification failing → triggering another bridge_self
notification → ...), failures of bridge_self events themselves are
not counted toward target-failure thresholds and are logged only.
If your bridge_self notifications stop arriving, it means the
notification target you wired them to is itself failing. Grep stderr for:
bridge_self target-failure emission failed
emit_bridge_self_event failed
The fix is always at the target layer (Telegram bot blocked, Matrix
homeserver down, SMTP credentials rotated). The bridge cannot tell you
about its own outbound failure — that's what the operator's external
monitoring (Prometheus alert on notify_bridge_target_send_failures_total)
is for.
"Webhook returns 500"
Inspect the webhook_payload_log table for the matching request:
SELECT received_at, status_code, error_message, payload_excerpt
FROM webhook_payload_log
ORDER BY received_at DESC LIMIT 20;
Common causes: payload schema change in the source service, a tracker
referencing a deleted provider, a Jinja template that errors out (look
for template render failed in logs).
"Telegram bot rate-limited (429)"
The Telegram client implements exponential backoff with jitter on
Retry-After. No operator action is required for transient throttling.
If the rate-limit persists, check:
- The bot is being driven by multiple Notify Bridge instances pointing at the same chat (split-brain — only one instance should own a bot).
- A template is producing very large messages (Telegram limits message
size to 4096 chars). Look for
MessageTooLongin the logs.
"DB lock contention"
SQLite WAL mode and busy_timeout=10000 make this rare. If you see
SQLITE_BUSY in logs:
- Check for long-running transactions (most often a stuck migration).
- Confirm the WAL files are on the same filesystem as the main DB — splitting them across mounts is a known cause.
- Run
sqlite3 notify_bridge.db "PRAGMA wal_checkpoint(TRUNCATE);"to flush the WAL. Safe to run while the backend is up.
Upgrades
- Pre-migration snapshot is taken automatically before any migration runs. The latest five snapshots are retained by default.
- Migrations are idempotent — re-running an upgrade is safe.
- If a migration fails, the snapshot from step 1 is the recovery point. See "Recovery from a corrupted DB" above.
- Always test major version upgrades in staging first. The upgrade flow is the same in staging: pull the new image, restart the container.
The release tag stream lives at the project Gitea / GitHub releases page.
Release notes are written to RELEASE_NOTES.md for the upcoming version
and copied into the Gitea release body by the release.yml workflow.