17 Commits

Author SHA1 Message Date
alexei.dolgolyov 11593eaa7c fix(ci): pin aiohttp<3.14 in backend test deps
aioresponses 0.7.8 (latest) does not pass the stream_writer keyword
argument that aiohttp 3.14 made required on ClientResponse, so every
aioresponses-mocked HTTP test fails to construct a response. core
declares aiohttp>=3.9 with no upper bound, so CI floated to 3.14.0 and
the backend job would break on the next run. Pin <3.14 in the test
install in both build.yml and release.yml until aioresponses ships an
aiohttp-3.14-compatible release.
2026-06-05 21:01:44 +03:00
alexei.dolgolyov d7c48b06ee ci: isolate test backend install in venv
Release / test-backend (push) Successful in 2m1s
Release / release (push) Successful in 2m3s
The persistent Gitea runner caches the setup-python toolcache between
runs. A previous run that produced wheels with broken metadata (no
Version field in METADATA) left a notify-bridge-server install with
no RECORD file in site-packages. The next run hits:

  Found existing installation: notify-bridge-server None
  error: uninstall-no-record-file

pip refuses to uninstall (no RECORD) and refuses to overlay (it tries
to uninstall first). Switching from a system-pip install into the
toolcache to an isolated /tmp/venv per run sidesteps the leak — each
CI run starts with empty site-packages.

Same change to build.yml and release.yml so the pre-merge gate and
the release-gate both run the same setup.
2026-05-16 18:33:41 +03:00
alexei.dolgolyov 10d30fc956 feat: production readiness — security, perf, bug fixes, bridge self-monitoring
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:49 +03:00
alexei.dolgolyov 19036a90bb ci: drop trivy scan from release (never failed, output discarded)
Release / release (push) Successful in 31s
2026-04-23 20:55:30 +03:00
alexei.dolgolyov bbcdf1c5d1 ci: skip build.yml on release commits 2026-04-23 20:46:09 +03:00
alexei.dolgolyov f9040370bc ci: drop backend pytest stage (too slow on hosted runner)
Build and Test / build-image (push) Has been cancelled
Build and Test / test-frontend (push) Has been cancelled
Release / release (push) Has been cancelled
The editable install of core+server+dev pulls the full scientific Python
stack (SQLAlchemy, aiohttp, pytest, httpx, slowapi, uvicorn[standard],
apscheduler and their transitives) on every CI run. Even with pip cache
the restore + install takes minutes per job — not worth it for a suite
that still runs locally via ``pytest packages/server/tests``.

Kept the frontend svelte-check + build and the non-push Docker image
build. Release workflow no longer has a test gate either (same reason).
Bring the test stage back once we have a prebuilt CI image with deps.
2026-04-23 20:40:53 +03:00
alexei.dolgolyov 3b683ce82c ci: cache pip downloads and collapse install into one pip call
Build and Test / build-image (push) Has been cancelled
Build and Test / test-backend (push) Has been cancelled
Build and Test / test-frontend (push) Has been cancelled
Two wins:
  * actions/setup-python's built-in pip cache, keyed on the two
    pyproject.toml files, turns the 20+ transitive dep downloads into
    a single tarball restore on cache hit.
  * One ``pip install -e ./core -e ./server[dev]`` call instead of
    two — lets pip's resolver run once over the combined graph and
    skips the second invocation's overhead.

Also dropped ``pip install --upgrade pip``: the runner image already
ships a recent pip, and the upgrade ran once per CI job for no gain.
2026-04-23 20:40:17 +03:00
alexei.dolgolyov 2bec25353b ci: install editable packages inside a venv
Build and Test / test-frontend (push) Successful in 9m46s
Release / release (push) Has been cancelled
Release / test (push) Has been cancelled
Build and Test / build-image (push) Has been cancelled
Build and Test / test-backend (push) Has been cancelled
The hosted Gitea runner image pre-installs older versions of both packages
in its system Python site-packages and retains stale ~otify_bridge_core /
~otify_bridge_server dist-info directories from prior interrupted runs.
``pip install -e`` against the system interpreter tries to uninstall those,
the rollback fires mid-transaction, and the runner's
``/opt/hostedtoolcache/.../bin/notify-bridge`` console script disappears
before the new install can be placed:

  ERROR: Could not install packages due to an OSError:
  [Errno 2] No such file or directory:
  '/opt/hostedtoolcache/Python/3.12.12/x64/bin/notify-bridge'

Installing into a fresh venv sidesteps the pre-cached state entirely (and
is the recommendation pip itself prints on every run).
2026-04-23 20:23:42 +03:00
alexei.dolgolyov 920920bc67 feat: production-readiness hardening across security, async, DB, ops
Build and Test / test-frontend (push) Successful in 9m37s
Build and Test / test-backend (push) Successful in 10m53s
Build and Test / build-image (push) Failing after 14m52s
Security
- SSRF: async DNS resolver; allow_redirects=False on all outbound clients;
  matrix homeserver_url validated on create/update/test; update_provider
  and email_bot merge incoming config and reject ***-masked secrets.
- Auth: bcrypt offloaded to asyncio.to_thread; JWT now carries iss/aud +
  leeway and rejects missing claims; setup TOCTOU closed inside a
  transaction; rate limits extended (default 600/min, 10/min on password
  change, 30/min on needs-setup); constant-time login to prevent username
  enumeration.
- Config: rejects known dev secret keys; validates CORS origin schemes,
  port range, token lifetimes.
- Webhook handlers stream-read body with a 1 MiB cap; Discord 429 retries
  bounded (3 attempts, Retry-After capped at 60 s).
- CSP + HSTS added to SecurityHeadersMiddleware.

Async / runtime
- SQLite engine: WAL, synchronous=NORMAL, foreign_keys=ON, busy_timeout,
  pool_pre_ping, dispose on shutdown.
- Lifespan shutdown now stops scheduler before closing HTTP session and
  disposing the engine.
- Shared aiohttp session locked against concurrent first-caller races;
  core NotificationDispatcher accepts and reuses it.
- Storage and scheduled backup writes wrapped in asyncio.to_thread.
- NUT client writes bounded by asyncio.wait_for.
- Telegram poller switched from 3 s short-poll to 30 s interval + 25 s
  long-poll (~10x fewer API calls).

Database
- New performance-indexes migration covers every FK/owner column and
  hot-path composite (notification_tracker(provider_id, enabled);
  event_log(user_id, created_at DESC); webhook_payload_log(provider_id,
  created_at DESC); action_execution(action_id, started_at DESC)).
- New schema_version table for future upgrade gating.
- __system__ placeholder user (id=0) seeded so user_id=0 system defaults
  satisfy the newly enforced FK; filtered out of /auth/needs-setup,
  /api/users, and setup.
- list_notification_trackers rewritten to batched loads (was 1+N+N*M).
- Retention job extended to event_log, webhook_payload_log, and
  action_execution; retention days exposed as a setting.

Scheduler
- AsyncIOScheduler job_defaults: coalesce, misfire_grace_time=300,
  max_instances=1.

Ops
- uvicorn runs with proxy_headers, forwarded_allow_ips,
  timeout_graceful_shutdown; access log suppressed in non-debug.
- FastAPI version string now reads from importlib.metadata.
- New /api/ready endpoint separate from /api/health.
- docker-compose drops the ALLOW_PRIVATE_URLS=1 default, adds mem/cpu/pid
  limits, read_only + tmpfs, cap_drop:ALL, no-new-privileges; healthcheck
  targets /api/ready.
- CI now runs on push/PR with backend pytest, frontend svelte-check +
  build, and a non-push image build; release workflow gated on tests,
  publishes immutable sha-<commit> image tag, adds Trivy scan.

Tests
- New packages/server/tests/ with 29 passing tests: config validation,
  JWT round-trip + aud/alg=none rejection, SSRF scheme and private-range
  enforcement (sync + async), Discord bounded retry, and a lifespan-level
  /api/health + /api/ready smoke check.
- Renamed the misnamed services/test_dispatch.py to manual_dispatch.py so
  pytest never auto-collects production code.

Frontend
- /login now redirects already-authenticated users to /, shows a distinct
  'backend unreachable' banner (en/ru) when /auth/needs-setup fails.
2026-04-23 19:44:56 +03:00
alexei.dolgolyov f27fa42b87 fix(ci): build release payload via heredoc, drop broken env-var passing
Release / release (push) Successful in 24s
Previous attempt used `python3 -c "..." KEY=VALUE` which passes
KEY=VALUE as positional args, not environment variables — the python
block then crashed with KeyError: 'BODY' because nothing actually set
it in the environment.

Consolidate into a single heredoc-fed python3 block that reads
RELEASE_NOTES from the already-exported env var and reads TAG/VERSION/
IS_PRE after an explicit `export`. Uses <<'PY' so shell metachars in
the Python source (backticks, $, quotes) are not interpreted.

Also drops the redundant intermediate BODY variable — body is built
directly inside the single python invocation.
2026-04-21 20:16:27 +03:00
alexei.dolgolyov e12820f150 ci: robust Gitea release creation with HTTP status + diagnostics
Release / release (push) Failing after 21s
Previous implementation silently assumed any missing 'id' in POST
response meant "release already exists", then called an unguarded
python3 on the fallback response — which crashes (exit 1) if the
fallback also fails (e.g. release really doesn't exist).

New logic:
- Build JSON payload in Python (avoids shell escaping + CLI length limits)
- Capture HTTP status explicitly
- 201 → success
- 409 or "already exists" message → reuse existing (with HTTP check on fetch)
- Anything else → fail loudly with the response body printed

This also unblocks diagnosis of the current v0.1.0 failure by surfacing
the actual error the Gitea API is returning.
2026-04-21 20:09:55 +03:00
alexei.dolgolyov 866a8df310 ci: fix changelog step on shallow checkout and small repos
Release / release (push) Failing after 54s
- Set fetch-depth: 0 so previous tag lookups work across full history.
- Use `-n 20` instead of HEAD~20..HEAD, which fails when the repo has
  fewer than 20 commits (e.g. on the first release).
2026-04-21 19:59:40 +03:00
alexei.dolgolyov 56b345188e ci: consolidate release.yml into single checkout step
Release / release (push) Failing after 1m53s
The two-step pattern (sparse-checkout RELEASE_NOTES.md, then full
checkout) left sparse-checkout config active on the workspace, so the
second checkout still only restored RELEASE_NOTES.md. Docker build
then failed with "open Dockerfile: no such file or directory".

Since both RELEASE_NOTES.md and the full source are needed in the same
job, one full checkout is simpler and correct.
2026-04-21 19:50:49 +03:00
alexei.dolgolyov eecc9e295c ci: consolidate release tokens to single DEPLOY_TOKEN, rename redeploy step
- Use one DEPLOY_TOKEN for both registry login and Gitea release API,
  matching the claude-code-facts convention.
- Rename "Trigger Portainer redeploy" to "Trigger redeploy webhook" —
  the step calls a generic DOCKER_REDEPLOY_WEBHOOK_URL, not a
  Portainer-specific endpoint.
- Add .facts-sync.json to pin this project to the facts repo commit.
2026-04-21 19:35:50 +03:00
alexei.dolgolyov c41182ffd0 ci: sync release workflow with CI/CD docs, add manual build
- Fix github.* → gitea.* context consistency
- Add pre-release detection (skip :latest for alpha/beta/rc)
- Add release fallback (reuse existing if creation fails)
- Add prerelease field to release API call
- Use sparse-checkout for RELEASE_NOTES.md
- Skip Portainer redeploy for pre-releases
- Add version tag without v prefix
- Add manual build.yml for Docker image verification
2026-03-28 13:27:28 +03:00
alexei.dolgolyov b803d004e1 refactor: comprehensive codebase review — security, performance, quality, UX
Security:
- Fix NUT protocol command injection (validate names against safe regex)
- Enable Jinja2 autoescape=True to prevent HTML injection via external data
- Add WebhookProviderConfig validation model

Performance:
- Shared aiohttp.ClientSession singleton (replaces 40+ per-request sessions)
- Fix 4 N+1 queries with batch IN loads (poller, scheduler, memory, broadcast)
- asyncio.gather for Gitea commands and notification dispatcher
- Add DB indexes on NotificationTrackerState.tracker_id, CommandTrackerListener
- LRU cache for compiled Jinja2 templates
- Daily EventLog cleanup job (90-day retention)
- 30s HTTP timeout on all external calls
- GROUP BY for target type counts (replaces 7 sequential queries)

Code quality:
- Extract get_owned_entity() helper (replaces 11 duplicate functions)
- Extract slot_helpers.py (load_slots, save_slots, render_template_preview)
- Extract command_utils.py (tracker lookup, last event, collection IDs)
- Extract http_session.py (shared session lifecycle)
- Provider connection validation dedup (3x → 1 helper)
- Command dispatch tables replacing if/elif chains
- Album+links fetch helper (fetch_albums_with_links)
- Provider dispatch polymorphism (list_provider_collections)
- Immutable _enrich_assets (no longer mutates in-place)
- Fix _format_assets return type + handler unpacking

Frontend:
- Fix 18+ hardcoded English strings → t() with new i18n keys (en + ru)
- Mobile "More" nav panel with provider filter and search
- Shared Button.svelte component (4 variants, 2 sizes)
- Shared ErrorBanner.svelte component (8 pages updated)
- SvelteKit goto() replacing window.location.href
- Dashboard grid fixed for 4 cards, paginator opacity consistency

Functionality:
- max_instances=1 on scheduler jobs (prevents duplicate events)
- Webhook provider in watcher (prevents error spam)
- Fix stale SQLModel reference in poller
- Gitea get_repo() direct API call
2026-03-28 13:22:26 +03:00
alexei.dolgolyov 1ac6a17f6f feat: Docker deployment + Gitea CI/CD workflow
- Multi-stage Dockerfile: Node frontend build → Python wheel build → slim runtime
- Backend serves SvelteKit static output via FastAPI StaticFiles mount
- docker-compose.yml with named volume for /data persistence
- Gitea Actions workflow: build/push Docker image + create release on v* tags
- Add NOTIFY_BRIDGE_STATIC_DIR config for frontend path
- Fix run() to use configurable host/port
2026-03-23 02:14:14 +03:00