The persistent Gitea runner caches the setup-python toolcache between
runs. A previous run that produced wheels with broken metadata (no
Version field in METADATA) left a notify-bridge-server install with
no RECORD file in site-packages. The next run hits:
Found existing installation: notify-bridge-server None
error: uninstall-no-record-file
pip refuses to uninstall (no RECORD) and refuses to overlay (it tries
to uninstall first). Switching from a system-pip install into the
toolcache to an isolated /tmp/venv per run sidesteps the leak — each
CI run starts with empty site-packages.
Same change to build.yml and release.yml so the pre-merge gate and
the release-gate both run the same setup.
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.
## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession
## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch
## Database
- UNIQUE indexes on service_provider.webhook_token,
telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified
## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events
## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
oauth/client_secret/webhook_secret/csrf in both header filter and
template extras filter
## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
$effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
Telegram bot toggles
## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
(wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
test_planka_parser (6), test_immich_change_detector (6),
test_backup_roundtrip (1)
## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
failures), bridge_self_deferred_backlog (pending count crosses
threshold), bridge_self_target_failures (consecutive 5xx/network
failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
thresholds (logged only) — wire to your own Telegram/Email/Matrix to
get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
with backfill migration, frontend descriptor (excluded from "create
provider" wizard since auto-managed)
Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
receive failure alerts
Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The editable install of core+server+dev pulls the full scientific Python
stack (SQLAlchemy, aiohttp, pytest, httpx, slowapi, uvicorn[standard],
apscheduler and their transitives) on every CI run. Even with pip cache
the restore + install takes minutes per job — not worth it for a suite
that still runs locally via ``pytest packages/server/tests``.
Kept the frontend svelte-check + build and the non-push Docker image
build. Release workflow no longer has a test gate either (same reason).
Bring the test stage back once we have a prebuilt CI image with deps.
Two wins:
* actions/setup-python's built-in pip cache, keyed on the two
pyproject.toml files, turns the 20+ transitive dep downloads into
a single tarball restore on cache hit.
* One ``pip install -e ./core -e ./server[dev]`` call instead of
two — lets pip's resolver run once over the combined graph and
skips the second invocation's overhead.
Also dropped ``pip install --upgrade pip``: the runner image already
ships a recent pip, and the upgrade ran once per CI job for no gain.
The hosted Gitea runner image pre-installs older versions of both packages
in its system Python site-packages and retains stale ~otify_bridge_core /
~otify_bridge_server dist-info directories from prior interrupted runs.
``pip install -e`` against the system interpreter tries to uninstall those,
the rollback fires mid-transaction, and the runner's
``/opt/hostedtoolcache/.../bin/notify-bridge`` console script disappears
before the new install can be placed:
ERROR: Could not install packages due to an OSError:
[Errno 2] No such file or directory:
'/opt/hostedtoolcache/Python/3.12.12/x64/bin/notify-bridge'
Installing into a fresh venv sidesteps the pre-cached state entirely (and
is the recommendation pip itself prints on every run).
Security
- SSRF: async DNS resolver; allow_redirects=False on all outbound clients;
matrix homeserver_url validated on create/update/test; update_provider
and email_bot merge incoming config and reject ***-masked secrets.
- Auth: bcrypt offloaded to asyncio.to_thread; JWT now carries iss/aud +
leeway and rejects missing claims; setup TOCTOU closed inside a
transaction; rate limits extended (default 600/min, 10/min on password
change, 30/min on needs-setup); constant-time login to prevent username
enumeration.
- Config: rejects known dev secret keys; validates CORS origin schemes,
port range, token lifetimes.
- Webhook handlers stream-read body with a 1 MiB cap; Discord 429 retries
bounded (3 attempts, Retry-After capped at 60 s).
- CSP + HSTS added to SecurityHeadersMiddleware.
Async / runtime
- SQLite engine: WAL, synchronous=NORMAL, foreign_keys=ON, busy_timeout,
pool_pre_ping, dispose on shutdown.
- Lifespan shutdown now stops scheduler before closing HTTP session and
disposing the engine.
- Shared aiohttp session locked against concurrent first-caller races;
core NotificationDispatcher accepts and reuses it.
- Storage and scheduled backup writes wrapped in asyncio.to_thread.
- NUT client writes bounded by asyncio.wait_for.
- Telegram poller switched from 3 s short-poll to 30 s interval + 25 s
long-poll (~10x fewer API calls).
Database
- New performance-indexes migration covers every FK/owner column and
hot-path composite (notification_tracker(provider_id, enabled);
event_log(user_id, created_at DESC); webhook_payload_log(provider_id,
created_at DESC); action_execution(action_id, started_at DESC)).
- New schema_version table for future upgrade gating.
- __system__ placeholder user (id=0) seeded so user_id=0 system defaults
satisfy the newly enforced FK; filtered out of /auth/needs-setup,
/api/users, and setup.
- list_notification_trackers rewritten to batched loads (was 1+N+N*M).
- Retention job extended to event_log, webhook_payload_log, and
action_execution; retention days exposed as a setting.
Scheduler
- AsyncIOScheduler job_defaults: coalesce, misfire_grace_time=300,
max_instances=1.
Ops
- uvicorn runs with proxy_headers, forwarded_allow_ips,
timeout_graceful_shutdown; access log suppressed in non-debug.
- FastAPI version string now reads from importlib.metadata.
- New /api/ready endpoint separate from /api/health.
- docker-compose drops the ALLOW_PRIVATE_URLS=1 default, adds mem/cpu/pid
limits, read_only + tmpfs, cap_drop:ALL, no-new-privileges; healthcheck
targets /api/ready.
- CI now runs on push/PR with backend pytest, frontend svelte-check +
build, and a non-push image build; release workflow gated on tests,
publishes immutable sha-<commit> image tag, adds Trivy scan.
Tests
- New packages/server/tests/ with 29 passing tests: config validation,
JWT round-trip + aud/alg=none rejection, SSRF scheme and private-range
enforcement (sync + async), Discord bounded retry, and a lifespan-level
/api/health + /api/ready smoke check.
- Renamed the misnamed services/test_dispatch.py to manual_dispatch.py so
pytest never auto-collects production code.
Frontend
- /login now redirects already-authenticated users to /, shows a distinct
'backend unreachable' banner (en/ru) when /auth/needs-setup fails.
Previous attempt used `python3 -c "..." KEY=VALUE` which passes
KEY=VALUE as positional args, not environment variables — the python
block then crashed with KeyError: 'BODY' because nothing actually set
it in the environment.
Consolidate into a single heredoc-fed python3 block that reads
RELEASE_NOTES from the already-exported env var and reads TAG/VERSION/
IS_PRE after an explicit `export`. Uses <<'PY' so shell metachars in
the Python source (backticks, $, quotes) are not interpreted.
Also drops the redundant intermediate BODY variable — body is built
directly inside the single python invocation.
Previous implementation silently assumed any missing 'id' in POST
response meant "release already exists", then called an unguarded
python3 on the fallback response — which crashes (exit 1) if the
fallback also fails (e.g. release really doesn't exist).
New logic:
- Build JSON payload in Python (avoids shell escaping + CLI length limits)
- Capture HTTP status explicitly
- 201 → success
- 409 or "already exists" message → reuse existing (with HTTP check on fetch)
- Anything else → fail loudly with the response body printed
This also unblocks diagnosis of the current v0.1.0 failure by surfacing
the actual error the Gitea API is returning.
- Set fetch-depth: 0 so previous tag lookups work across full history.
- Use `-n 20` instead of HEAD~20..HEAD, which fails when the repo has
fewer than 20 commits (e.g. on the first release).
The two-step pattern (sparse-checkout RELEASE_NOTES.md, then full
checkout) left sparse-checkout config active on the workspace, so the
second checkout still only restored RELEASE_NOTES.md. Docker build
then failed with "open Dockerfile: no such file or directory".
Since both RELEASE_NOTES.md and the full source are needed in the same
job, one full checkout is simpler and correct.
- Use one DEPLOY_TOKEN for both registry login and Gitea release API,
matching the claude-code-facts convention.
- Rename "Trigger Portainer redeploy" to "Trigger redeploy webhook" —
the step calls a generic DOCKER_REDEPLOY_WEBHOOK_URL, not a
Portainer-specific endpoint.
- Add .facts-sync.json to pin this project to the facts repo commit.
- Fix github.* → gitea.* context consistency
- Add pre-release detection (skip :latest for alpha/beta/rc)
- Add release fallback (reuse existing if creation fails)
- Add prerelease field to release API call
- Use sparse-checkout for RELEASE_NOTES.md
- Skip Portainer redeploy for pre-releases
- Add version tag without v prefix
- Add manual build.yml for Docker image verification