13 Commits

Author SHA1 Message Date
alexei.dolgolyov 8065e6effa feat(immich): multi-time-point scheduling for scheduled/periodic/memory
Replace the single comma-separated text box with an add/remove list of
native HH:MM pickers for the Immich scheduled-assets, periodic-summary,
and memory slots. The backend already stored comma-separated *_times and
scheduled one cron job per time; this makes entering several times per
day discoverable and hardens the read/write path.

Backend:
- services/time_list.py: normalize_time_list (validate / dedup / sort /
  cap at 24, raising TimeListError) + lenient parse_hhmm_list; the
  scheduler now uses the shared parser (drops its private copy).
- tracking_configs API normalizes *_times on every write (422 on bad
  input) and rejects enabling a slot whose times list is empty.
- scheduler warns when an enabled slot has zero or dropped fire times,
  restoring the observability lost with the old per-call warning.

Frontend:
- TimeListEditor.svelte: add/remove native time rows, dedupe + sort on
  emit, per-day cap, collapses on-screen duplicates, aria-labelled rows;
  syncs from the value prop only on external changes (untrack guard) so
  keyboard entry isn't clobbered mid-edit.
- Descriptor-driven save guard: an enabled feature section must have at
  least one time.
- i18n (en/ru) keys; refreshed help text; removed dead invalidTimeList.

Tests: time_list normalization/parsing (incl. non-ASCII/odd shapes) and
the enabled-implies-times validation.
2026-05-29 14:57:41 +03:00
alexei.dolgolyov 9aada75381 fix(tests): clear diagnostic_mode _bg_tasks between cases
test_fallback_task_retained_until_fire asserts len(_bg_tasks) == 1, but
the set carries pending tasks from earlier tests' fallback schedules, so
the assertion saw the accumulated count instead. Drop the references (no
.cancel() — the tasks belong to closed loops, cross-loop cancel raises
RuntimeError on the next test's setup).
2026-05-28 15:26:11 +03:00
alexei.dolgolyov 6a8f374678 feat: observability, per-receiver Telegram options, oversized-video fallback
Operability:
- Correlation IDs end-to-end: shared dispatch_id between log lines and
  EventLog rows (event/watcher/scheduled/deferred/action/HA/command paths)
  and a new X-Request-Id middleware that normalizes inbound ids and binds
  request_id into log context.
- dispatch_summary block merged into EventLog.details: per-target
  success/failure counts plus Telegram media delivered/skipped/failed and
  truncated error lists, so partial outcomes surface in the UI.
- Diagnostic mode: admin can flip one module to DEBUG for a bounded
  window with auto-revert (in-memory only; setup_logging() resets on
  boot, lifespan reverts on shutdown). New /diagnostic-mode endpoints
  plus DiagnosticsCassette UI on the settings page.

Telegram:
- Per-receiver options: disable_notification (silent send) and
  message_thread_id (forum-topic routing), wired through the dispatcher
  via a ContextVar so all four send sites (sendMessage / sendPhoto-Video-
  Document / sendMediaGroup / cache-hit POST) pick them up.
- send_large_videos_as_documents target setting: bypass the 50 MB
  sendVideo cap by falling back to sendDocument for oversized videos.
- sendMediaGroup byte-budget enforcement (TELEGRAM_MAX_GROUP_TOTAL_BYTES,
  45 MB) with per-item fallback on chunk failure so a stale file_id no
  longer silently drops a cached asset.

Tests:
- New: diagnostic_mode, dispatch_summary, request_correlation,
  telegram_media_group_partial, telegram_per_send_options.

Docs:
- .claude/reviews/: six-axis production-readiness review of v0.8.1.
- .claude/docs/functional-review-2026-05-28.md: focused review of
  Telegram/Immich/logging subsystems.
2026-05-28 15:19:31 +03:00
alexei.dolgolyov 66f152ef2c fix(tests): green pytest gate for v0.8.1
Release / test-backend (push) Failing after 8s
Release / release (push) Has been skipped
Four root causes blocked the CI test gate; all fixed minimally:

1. test_release_provider._allow_private_urls used setenv +
   importlib.reload(ssrf_mod). The reload permanently rebound
   _ALLOW_PRIVATE=True in the module; monkeypatch.setenv undid the
   env var on teardown but the module attribute stayed True for the
   rest of the session, masking every test_ssrf*/test_ssrf_hardening
   case (16 failures). Switched to monkeypatch.setattr on the module
   attribute directly — restored cleanly on teardown.

2. _FakeResponse in test_release_provider lacked the content_length
   attribute and a top-level read() method that the new size-cap
   guards in gitea.py consult before parsing (5 failures).

3. test_gate_quiet_hours_wins_over_event_type_flag was asserting the
   pre-refactor gate order. evaluate_event_gate now intentionally
   reports EVENT_TYPE_DISABLED before QUIET_HOURS so deferrable
   events with the event-type flag off get dropped immediately
   instead of being deferred and then silently discarded at drain
   time. Renamed the test and inverted the expectation.

4. resolve_version() returned 0.0.0+unknown in CI because
   pip-wheel-built hatchling distributions ended up with METADATA
   missing the Version field — importlib.metadata returned None.
   Added __version__ = "0.8.1" to notify_bridge_server/__init__.py
   as a third (always-available) candidate; resolve_version() now
   picks the max of (installed, package, source).
2026-05-16 18:25:51 +03:00
alexei.dolgolyov 8651767112 feat: bridge_self bot commands — status, thresholds, reset, health
Adds bot commands for the bridge_self provider so operators can inspect
and manage bridge health from chat: /status, /thresholds, /reset, /health.
Includes Jinja2 templates for both locales, seed data, capability slots,
and a handler that exposes pending deferred backlog plus per-counter
reset. Also adds .claude/skills/ for project-scoped graph-aware skills.
2026-05-16 03:43:48 +03:00
alexei.dolgolyov 10d30fc956 feat: production readiness — security, perf, bug fixes, bridge self-monitoring
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.

## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
  event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
  acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
  featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession

## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
  cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch

## Database
- UNIQUE indexes on service_provider.webhook_token,
  telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
  telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
  partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
  delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified

## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
  identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
  on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events

## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
  oauth/client_secret/webhook_secret/csrf in both header filter and
  template extras filter

## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
  $effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
  now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
  goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
  Telegram bot toggles

## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
  (wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
  HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
  event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
  procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
  test_planka_parser (6), test_immich_change_detector (6),
  test_backup_roundtrip (1)

## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
  failures), bridge_self_deferred_backlog (pending count crosses
  threshold), bridge_self_target_failures (consecutive 5xx/network
  failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
  provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
  thresholds (logged only) — wire to your own Telegram/Email/Matrix to
  get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
  with backfill migration, frontend descriptor (excluded from "create
  provider" wizard since auto-managed)

Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
  receive failure alerts

Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:49 +03:00
alexei.dolgolyov 22127e2a59 feat: Home Assistant provider — WebSocket subscription + bot commands
Adds Home Assistant as a service provider with two coordinated surfaces:

Notifications (subscription):
- Long-lived WebSocket client (aiohttp ws_connect) with auth handshake,
  exponential-backoff reconnect, bounded event queue, and area-registry
  enrichment cached per (re)connect
- ServiceProvider ABC gains an optional `subscribe()` method for push-style
  providers; HomeAssistantServiceProvider uses it via a per-provider
  supervisor task started in the FastAPI lifespan
- 4 event types (state_changed, automation_triggered, call_service,
  event_fired), 4 default Jinja templates (en + ru), HA-specific
  tracker filters (entity_glob, domain_allowlist, exact entity ids)
- Extracted shared dispatch pipeline (api/webhooks.py → services/
  event_dispatch.py) so subscription and webhook ingest share the same
  event_log + deferred-dispatch + quiet-hours code path

Bot commands:
- /status, /entities [glob], /state <entity_id>, /areas
- Multi-command WS session so /status and /areas cost one handshake
- Sensitive-attribute blocklist (camera access_token, entity_picture, etc.)
  and 30-attribute cap to keep /state output safe and within Telegram's
  message size
- Error-message redaction strips URL userinfo before surfacing to chat

Frontend:
- HA descriptor with toggle ConfigField type (new) and tag-input filter
  mode for free-text glob/domain lists (new TagInput component)
- 15 command slots + 4 notification slots wired into the existing
  template-config UI
2026-05-13 14:31:56 +03:00
alexei.dolgolyov ba199f24bd feat: deferred dispatch, release-check provider, settings polish
- Defer quiet-hours dispatches into new deferred_dispatch table; drain
  job + periodic catch-up scan re-fire at window end with coalescing on
  (link, event_type, collection_id).
- Add ON DELETE SET NULL migration on event_log_id and partial unique
  index on (link_id, collection_id, event_type) WHERE status='pending'.
- Add release-check provider abstraction (Gitea/GitHub) with SSRF-safe
  URL validation, settings UI cassette, and scheduled polling.
- Replace importlib-only version lookup with version.py helper that
  prefers the higher of installed metadata vs source pyproject so stale
  editable dev installs stop misreporting.
- Aurora frontend polish: MetaStrip component, ReleaseCassette,
  EventDetailModal expansion, and i18n additions.
2026-05-12 02:58:07 +03:00
alexei.dolgolyov bede928a3f feat(server): add /status command handler for webhook providers
The generic-webhook provider has no upstream API, so /status reports
DB-derived stats: active/total trackers, provider name, and last event
timestamp (formatted via the shared get_last_event_str helper).

Includes pytest coverage for handler registration, populated stats with
a recent event, the empty-state dash sentinel, and unknown-command
fall-through. Template variable docs in command_template_configs.py
extended with the new trackers_active/trackers_total keys.
2026-05-10 23:51:25 +03:00
alexei.dolgolyov 35a3008896 feat: log bot command invocations to the event stream
Bot commands were the only user-initiated path that didn't surface in
the dashboard. They now produce ``command_handled`` /
``command_rate_limited`` / ``command_failed`` rows in ``EventLog``
alongside tracker and action events.

Backend
- ``EventLog`` gains nullable ``command_tracker_id`` / ``telegram_bot_id``
  FKs plus deletion-snapshot name columns (idempotent migration).
- New ``_log_command_event`` helper emits one row per invocation at the
  three branches in ``handle_command``. Logging failures are swallowed
  so they cannot block the user-visible reply.
- Telegram ``from`` is captured in poller and webhook, whitelisted to
  identity fields by ``_normalize_issuer`` (drops ``language_code`` and
  any future PII), persisted under ``details.issuer``.
- ``/api/status`` resolves live ``CommandTracker`` / ``TelegramBot``
  names (mirroring the action pattern) and exposes ``tracker_id``,
  ``command_tracker_id``, ``telegram_bot_id`` so the frontend can
  deep-link.

Frontend
- Event rows are now clickable and open a detail modal with full
  provenance (bot → chat → issuer → provider), raw ``details`` JSON,
  and per-entity action buttons.
- Buttons use the existing ``requestHighlight`` + ``goto`` crosslink
  pattern, so clicking lands on the entity's list page with that
  specific card scrolled into view and pulsing.
- Auto-refresh dropdown (Off / 10s / 30s / 1m / 5m) persisted in
  ``localStorage``; ticker pauses while the tab is hidden.
- Event-type filter, dashboard verb labels, and gradients extended for
  the three new ``command_*`` types.
- Filled in pre-existing missing i18n keys (``common.hide`` /
  ``common.show`` / ``commandConfig.noCommandsForProvider``).

Tests
- New ``test_command_event_logging.py`` covers subject formatting,
  issuer normalization, the three event branches, and graceful failure
  when the DB is unreachable. ``pytest packages/server/tests/`` → 96/96.
2026-05-07 22:22:41 +03:00
alexei.dolgolyov 0eb899afb9 feat: harden notification stack and switch logging selectors to icon grid
Notifications:
- Add shared http_base, redact, and SSRF hardening modules
- Refactor dispatcher, queue, receiver and per-provider clients
  (telegram, discord, email, matrix, ntfy, slack, webhook) to use
  the shared base, with bounded queue and redacted error logs
- Tests for ssrf, redact, http_base, queue bounds, dispatcher
  aggregation, telegram media partition, email and matrix clients

Frontend:
- Settings: log level / log format selectors now use IconGridSelect
  with per-option icons and i18n descriptions
- Minor providers page and entity-cache store updates

Tooling:
- Document code-review-graph MCP usage in CLAUDE.md
- Ignore .code-review-graph/, register .mcp.json
2026-05-07 13:53:26 +03:00
alexei.dolgolyov 7cbb02b1ef feat(db): pre-migration SQLite snapshots via VACUUM INTO
Build and Test / test-backend (push) Successful in 2m38s
Build and Test / test-frontend (push) Successful in 9m44s
Build and Test / build-image (push) Failing after 17m9s
Take a consistent, atomic copy of the DB at lifespan startup BEFORE
migrations run, so a botched future upgrade is recoverable by restoring
a single file instead of a data-loss incident.

Uses SQLite's VACUUM INTO — safe under WAL, cannot tear against
concurrent writes. Best-effort: failures are logged, never raised —
the main DB remains the source of truth.

Configurable via NOTIFY_BRIDGE_PRE_MIGRATE_SNAPSHOT_KEEP (default 5;
0 disables). Snapshots land in ``data_dir/backups/pre-migrate-<ts>.db``
and the N oldest are pruned each boot.
2026-04-23 19:53:15 +03:00
alexei.dolgolyov 920920bc67 feat: production-readiness hardening across security, async, DB, ops
Build and Test / test-frontend (push) Successful in 9m37s
Build and Test / test-backend (push) Successful in 10m53s
Build and Test / build-image (push) Failing after 14m52s
Security
- SSRF: async DNS resolver; allow_redirects=False on all outbound clients;
  matrix homeserver_url validated on create/update/test; update_provider
  and email_bot merge incoming config and reject ***-masked secrets.
- Auth: bcrypt offloaded to asyncio.to_thread; JWT now carries iss/aud +
  leeway and rejects missing claims; setup TOCTOU closed inside a
  transaction; rate limits extended (default 600/min, 10/min on password
  change, 30/min on needs-setup); constant-time login to prevent username
  enumeration.
- Config: rejects known dev secret keys; validates CORS origin schemes,
  port range, token lifetimes.
- Webhook handlers stream-read body with a 1 MiB cap; Discord 429 retries
  bounded (3 attempts, Retry-After capped at 60 s).
- CSP + HSTS added to SecurityHeadersMiddleware.

Async / runtime
- SQLite engine: WAL, synchronous=NORMAL, foreign_keys=ON, busy_timeout,
  pool_pre_ping, dispose on shutdown.
- Lifespan shutdown now stops scheduler before closing HTTP session and
  disposing the engine.
- Shared aiohttp session locked against concurrent first-caller races;
  core NotificationDispatcher accepts and reuses it.
- Storage and scheduled backup writes wrapped in asyncio.to_thread.
- NUT client writes bounded by asyncio.wait_for.
- Telegram poller switched from 3 s short-poll to 30 s interval + 25 s
  long-poll (~10x fewer API calls).

Database
- New performance-indexes migration covers every FK/owner column and
  hot-path composite (notification_tracker(provider_id, enabled);
  event_log(user_id, created_at DESC); webhook_payload_log(provider_id,
  created_at DESC); action_execution(action_id, started_at DESC)).
- New schema_version table for future upgrade gating.
- __system__ placeholder user (id=0) seeded so user_id=0 system defaults
  satisfy the newly enforced FK; filtered out of /auth/needs-setup,
  /api/users, and setup.
- list_notification_trackers rewritten to batched loads (was 1+N+N*M).
- Retention job extended to event_log, webhook_payload_log, and
  action_execution; retention days exposed as a setting.

Scheduler
- AsyncIOScheduler job_defaults: coalesce, misfire_grace_time=300,
  max_instances=1.

Ops
- uvicorn runs with proxy_headers, forwarded_allow_ips,
  timeout_graceful_shutdown; access log suppressed in non-debug.
- FastAPI version string now reads from importlib.metadata.
- New /api/ready endpoint separate from /api/health.
- docker-compose drops the ALLOW_PRIVATE_URLS=1 default, adds mem/cpu/pid
  limits, read_only + tmpfs, cap_drop:ALL, no-new-privileges; healthcheck
  targets /api/ready.
- CI now runs on push/PR with backend pytest, frontend svelte-check +
  build, and a non-push image build; release workflow gated on tests,
  publishes immutable sha-<commit> image tag, adds Trivy scan.

Tests
- New packages/server/tests/ with 29 passing tests: config validation,
  JWT round-trip + aud/alg=none rejection, SSRF scheme and private-range
  enforcement (sync + async), Discord bounded retry, and a lifespan-level
  /api/health + /api/ready smoke check.
- Renamed the misnamed services/test_dispatch.py to manual_dispatch.py so
  pytest never auto-collects production code.

Frontend
- /login now redirects already-authenticated users to /, shows a distinct
  'backend unreachable' banner (en/ru) when /auth/needs-setup fails.
2026-04-23 19:44:56 +03:00