feat: production readiness — security, perf, bug fixes, bridge self-monitoring
Comprehensive multi-area pass driven by a parallel 8-agent production
review. Frontend, backend, database, security, performance, operational,
plus a new self-monitoring feature.
## Critical fixes
- Planka webhook: reads bounded raw body (was NameError on every call)
- HA quiet hours: ha_state_changed/automation_triggered/service_called/
event_fired added to deferrable set (were silently dropped)
- DNS-rebinding SSRF: PinnedResolver wired into shared aiohttp session
- Telegram inbound webhook: secret now mandatory (401 without)
- Generic webhook: auth_mode="none" requires explicit
acknowledge_unauthenticated=true; per-IP rate limit 60/min
- svelte-check: 5 null-narrowing errors in EventDetailModal fixed
- Provider hardcoding: Immich-only block extracted to descriptor
featureDiscoveryHint
- command_sync: snapshot+expunge bot before exiting AsyncSession
## Bug fixes
- notifier asyncio.gather(return_exceptions=True) — one bad chat no longer
cancels peer sends
- NotificationDispatcher hoisted out of per-tracker loop
- Provider credential resolution unified across all 5 dispatch sites
- HA asyncio.shield now drains inner task on cancellation
- Provider construction switched from if/elif ladder to factory registry
- NUT first poll seeds silently (no spurious ups_on_battery)
- Quiet-hours gate: event-type-disabled now wins over deferral
- APScheduler drain job ID resolution upgraded to seconds
- HA on_status_change wired through to EventLog
- Webhook payload rollback failures now logged (not swallowed)
- Batched receivers/chats/bots in load_link_data (was per-target N+1)
- flag_modified on JSON column reassignments in deferred_dispatch
## Database
- UNIQUE indexes on service_provider.webhook_token,
telegram_bot.webhook_path_id, partial UNIQUE on telegram_bot.bot_id,
telegram_chat(bot_id, chat_id), notification_tracker_target unique link,
partial UNIQUE on bridge_self provider per user
- Composite ix_event_log_user_event_type_created index
- save_chat_from_webhook switched to ON CONFLICT DO UPDATE
- ondelete=CASCADE on user-id FKs (model annotation; app-side cascade
delete added for existing data)
- delete_notification_tracker converted from N+1 to bulk DELETE/UPDATE
- Module-level asyncio.Lock replaced with lazy _get_lock() pattern
- VACUUM INTO snapshot now PRAGMA integrity_check verified
## Performance
- Jinja2 template compilation LRU cached (lru_cache maxsize=512)
- Per-locale render cache in NotificationDispatcher (skips re-rendering
identical content for receivers sharing a locale)
- Tracker list cached per provider_id with 5s TTL + explicit invalidation
on tracker CRUD (relieves HA chat-bus rate query pressure)
- Nav-counts collapsed from 16 round-trips to single UNION ALL
- HA event_log: skip persisting empty assets_added/removed events
## Security hardening
- Mass-assignment guard on Action create/update; cron sub-minute reject
- Backup JSON depth/node-count cap (depth ≤ 10, nodes ≤ 100k)
- _sanitize_config extended to all JSON-typed fields on backup import
- Telegram _safe_get walks redirects manually with SSRF revalidation
- Bcrypt 72-byte password length cap with clear 422
- Webhook payload body redaction; sensitive substring set extended with
oauth/client_secret/webhook_secret/csrf in both header filter and
template extras filter
## Frontend
- 76 catch (err: any) sites converted to errMsg(err) helper
- globalProviderFilter: pure getter; reconciliation moved to one-time
$effect in +layout
- Provider-filter binding: removed paired $effects + _syncingFilter flag,
now one-way derived
- entity-cache: separate _refreshing flag for background re-fetches
- api.ts 401 handling: AuthRedirectError class + dedup _redirecting flag,
goto() instead of window.location.href
- a11y: aria-expanded on mobile More, role=switch + aria-checked on
Telegram bot toggles
## Tests & operations
- CI pytest gate added to .gitea/workflows/build.yml + release.yml
(wheel-built install to dodge editable-install slowness)
- /api/ready upgraded to deep healthcheck (db SELECT 1, scheduler.running,
HA supervisor presence) returning {ready, checks, errors, version}
- /api/metrics endpoint with prometheus_client (deferred_pending,
event_log_total, dispatch_duration, poll_failures, send_failures)
- New OPERATIONS.md covering deploy, healthchecks, metrics, backup/restore
procedures, log handling, common scenarios, upgrade flow
- New tests: test_bridge_self (11), test_gitea_parser (9),
test_planka_parser (6), test_immich_change_detector (6),
test_backup_roundtrip (1)
## New feature: bridge self-monitoring
- New bridge_self provider type — internal sink for bridge health events
- Three event types: bridge_self_poll_failures (consecutive tracker poll
failures), bridge_self_deferred_backlog (pending count crosses
threshold), bridge_self_target_failures (consecutive 5xx/network
failures per target)
- Per-user thresholds (defaults: 3 / 100 / 5) configurable via the
provider config form
- Auto-seeded on user create + /setup + boot backfill for existing users
- Anti-spam: counters reset after emission; backlog uses transition latch
- Self-loop guard: bridge_self failures don't count toward target-failure
thresholds (logged only) — wire to your own Telegram/Email/Matrix to
get notified when polls/dispatches/sends fail
- 6 default templates (3 events × 2 locales), tracking config columns
with backfill migration, frontend descriptor (excluded from "create
provider" wizard since auto-managed)
Operator-visible behavior changes (call out in release notes):
- NOTIFY_BRIDGE_TELEGRAM_WEBHOOK_SECRET now REQUIRED for webhook mode
- Existing webhook providers with auth_mode="none" need explicit opt-in
- Generic webhook endpoint rate-limited 60/min per source IP
- HA disconnect/reconnect writes ha_status_* EventLog rows
- Every user gets a bridge_self provider — wire it to a target to
receive failure alerts
Pre-existing test failures (test_ssrf, test_release_provider) on
Python 3.13 are unrelated; CI runs on 3.12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,265 @@
|
||||
"""Tests for the bridge self-monitoring provider.
|
||||
|
||||
Covers:
|
||||
1. ``build_event`` parses a well-formed payload and rejects malformed ones.
|
||||
2. The threshold-crossing helpers in ``services.bridge_self`` only emit on
|
||||
the actual crossing, not on every increment afterwards (anti-spam).
|
||||
3. ``ensure_bridge_self_provider_for_user`` creates exactly one provider
|
||||
per user and is idempotent on re-run.
|
||||
4. The capability registry exposes the new event/slot definitions.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import pytest
|
||||
from sqlmodel import SQLModel, select
|
||||
from sqlmodel.ext.asyncio.session import AsyncSession
|
||||
from sqlalchemy.ext.asyncio import create_async_engine
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Event parser
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_build_event_well_formed_payload() -> None:
|
||||
from notify_bridge_core.providers.bridge_self.event_parser import build_event
|
||||
from notify_bridge_core.models.events import EventType
|
||||
from notify_bridge_core.providers.base import ServiceProviderType
|
||||
|
||||
payload = {
|
||||
"failure_type": "poll_failures",
|
||||
"subject_id": 7,
|
||||
"subject_name": "My Tracker",
|
||||
"count": 3,
|
||||
"threshold": 3,
|
||||
"last_error": "Timeout",
|
||||
"details": {"tracker_id": 7},
|
||||
}
|
||||
when = datetime(2026, 5, 16, 10, 0, tzinfo=timezone.utc)
|
||||
event = build_event(payload, timestamp=when)
|
||||
|
||||
assert event is not None
|
||||
assert event.event_type == EventType.BRIDGE_SELF_POLL_FAILURES
|
||||
assert event.provider_type == ServiceProviderType.BRIDGE_SELF
|
||||
assert event.collection_id == "7"
|
||||
assert event.collection_name == "My Tracker"
|
||||
assert event.timestamp == when
|
||||
assert event.extra["count"] == 3
|
||||
assert event.extra["threshold"] == 3
|
||||
assert event.extra["last_error"] == "Timeout"
|
||||
assert event.extra["failure_type"] == "poll_failures"
|
||||
assert event.extra["details"] == {"tracker_id": 7}
|
||||
|
||||
|
||||
def test_build_event_unknown_failure_type_returns_none() -> None:
|
||||
from notify_bridge_core.providers.bridge_self.event_parser import build_event
|
||||
|
||||
assert build_event({"failure_type": "rocket_launch"}) is None
|
||||
|
||||
|
||||
def test_build_event_non_dict_payload_returns_none() -> None:
|
||||
from notify_bridge_core.providers.bridge_self.event_parser import build_event
|
||||
|
||||
assert build_event("not a dict") is None # type: ignore[arg-type]
|
||||
assert build_event(None) is None # type: ignore[arg-type]
|
||||
|
||||
|
||||
def test_build_event_clamps_long_error_messages() -> None:
|
||||
from notify_bridge_core.providers.bridge_self.event_parser import (
|
||||
build_event, _MAX_ERROR_LEN,
|
||||
)
|
||||
|
||||
huge = "X" * (_MAX_ERROR_LEN * 5)
|
||||
event = build_event({
|
||||
"failure_type": "target_failures",
|
||||
"subject_id": 1,
|
||||
"subject_name": "t",
|
||||
"count": 5,
|
||||
"threshold": 5,
|
||||
"last_error": huge,
|
||||
})
|
||||
assert event is not None
|
||||
assert len(event.extra["last_error"]) <= _MAX_ERROR_LEN
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Threshold-crossing counters
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_record_poll_failure_increments_then_success_resets() -> None:
|
||||
from notify_bridge_server.services import bridge_self as bs
|
||||
|
||||
# Use a tracker_id we know is unique to this test to avoid pollution
|
||||
# across tests sharing the module-level dicts.
|
||||
tid = 9_001
|
||||
bs.reset_poll_counter(tid)
|
||||
|
||||
assert bs.record_poll_failure(tid, "boom") == 1
|
||||
assert bs.record_poll_failure(tid, "boom") == 2
|
||||
assert bs.record_poll_failure(tid, "boom") == 3
|
||||
assert bs.get_poll_failure_count(tid) == 3
|
||||
assert bs.get_poll_last_error(tid) == "boom"
|
||||
|
||||
bs.record_poll_success(tid)
|
||||
assert bs.get_poll_failure_count(tid) == 0
|
||||
assert bs.get_poll_last_error(tid) == ""
|
||||
|
||||
|
||||
def test_record_target_failure_increments_then_success_resets() -> None:
|
||||
from notify_bridge_server.services import bridge_self as bs
|
||||
|
||||
tid = 9_101
|
||||
bs.reset_target_counter(tid)
|
||||
|
||||
assert bs.record_target_failure(tid, "503") == 1
|
||||
assert bs.record_target_failure(tid, "503") == 2
|
||||
assert bs.get_target_failure_count(tid) == 2
|
||||
|
||||
bs.record_target_success(tid)
|
||||
assert bs.get_target_failure_count(tid) == 0
|
||||
|
||||
|
||||
def test_backlog_state_only_emits_on_crossing() -> None:
|
||||
"""Only the False -> True transition should report a crossing.
|
||||
|
||||
A sustained backlog must not re-fire on every scan, and a recovered
|
||||
backlog re-arms the latch so the next crossing is reported again.
|
||||
"""
|
||||
from notify_bridge_server.services import bridge_self as bs
|
||||
|
||||
user_id = 9_201
|
||||
# Reset latch by going through a False reading first.
|
||||
bs._backlog_above_threshold.pop(user_id, None)
|
||||
|
||||
# Initial above-threshold reading IS a crossing (None -> True latch).
|
||||
assert bs.record_backlog_state(user_id, True) is True
|
||||
# Sustained above — no second alert.
|
||||
assert bs.record_backlog_state(user_id, True) is False
|
||||
assert bs.record_backlog_state(user_id, True) is False
|
||||
# Drop below — no alert (we don't notify on recovery).
|
||||
assert bs.record_backlog_state(user_id, False) is False
|
||||
# Cross again — alert.
|
||||
assert bs.record_backlog_state(user_id, True) is True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# ensure_bridge_self_provider_for_user — DB roundtrip
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
async def session() -> AsyncSession:
|
||||
"""Fresh in-memory DB with the SQLModel schema applied."""
|
||||
engine = create_async_engine("sqlite+aiosqlite:///:memory:")
|
||||
async with engine.begin() as conn:
|
||||
await conn.run_sync(SQLModel.metadata.create_all)
|
||||
async with AsyncSession(engine) as session:
|
||||
yield session
|
||||
await engine.dispose()
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_ensure_bridge_self_provider_creates_once(session: AsyncSession) -> None:
|
||||
from notify_bridge_server.database.models import ServiceProvider, User
|
||||
from notify_bridge_server.database.seeds import (
|
||||
ensure_bridge_self_provider_for_user,
|
||||
)
|
||||
|
||||
# Create a real user.
|
||||
user = User(username="alice", hashed_password="x", role="user")
|
||||
session.add(user)
|
||||
await session.commit()
|
||||
await session.refresh(user)
|
||||
user_id = user.id
|
||||
|
||||
p1 = await ensure_bridge_self_provider_for_user(session, user_id)
|
||||
assert p1 is not None
|
||||
p1_id = p1.id
|
||||
assert p1.type == "bridge_self"
|
||||
assert p1.user_id == user_id
|
||||
assert p1.config["poll_failure_threshold"] == 3
|
||||
assert p1.config["deferred_backlog_threshold"] == 100
|
||||
assert p1.config["target_failure_threshold"] == 5
|
||||
await session.commit()
|
||||
|
||||
# Idempotent: second call returns the same row, no duplicates.
|
||||
p2 = await ensure_bridge_self_provider_for_user(session, user_id)
|
||||
assert p2 is not None
|
||||
assert p2.id == p1_id
|
||||
await session.commit()
|
||||
|
||||
rows = (
|
||||
await session.exec(
|
||||
select(ServiceProvider).where(
|
||||
ServiceProvider.user_id == user_id,
|
||||
ServiceProvider.type == "bridge_self",
|
||||
)
|
||||
)
|
||||
).all()
|
||||
assert len(rows) == 1
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_ensure_bridge_self_provider_skips_system_user(session: AsyncSession) -> None:
|
||||
"""user_id <= 0 is the __system__ placeholder — never gets a provider."""
|
||||
from notify_bridge_server.database.seeds import (
|
||||
ensure_bridge_self_provider_for_user,
|
||||
)
|
||||
|
||||
result = await ensure_bridge_self_provider_for_user(session, 0)
|
||||
assert result is None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Capability registry
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_capability_registry_lists_bridge_self() -> None:
|
||||
from notify_bridge_core.providers.capabilities import (
|
||||
get_capabilities, get_all_capabilities,
|
||||
)
|
||||
|
||||
caps = get_capabilities("bridge_self")
|
||||
assert caps is not None
|
||||
assert caps.provider_type == "bridge_self"
|
||||
assert caps.webhook_based is False
|
||||
|
||||
event_names = {e["name"] for e in caps.events}
|
||||
assert event_names == {
|
||||
"bridge_self_poll_failures",
|
||||
"bridge_self_deferred_backlog",
|
||||
"bridge_self_target_failures",
|
||||
}
|
||||
|
||||
slot_names = {s["name"] for s in caps.notification_slots}
|
||||
assert slot_names == {
|
||||
"message_bridge_self_poll_failures",
|
||||
"message_bridge_self_deferred_backlog",
|
||||
"message_bridge_self_target_failures",
|
||||
}
|
||||
|
||||
# And it shows up in the global registry.
|
||||
assert "bridge_self" in get_all_capabilities()
|
||||
|
||||
|
||||
def test_default_template_loader_returns_bridge_self_slots() -> None:
|
||||
"""All three bridge_self slots have shipped Jinja2 default templates."""
|
||||
from notify_bridge_core.templates.defaults.loader import load_default_templates
|
||||
|
||||
en = load_default_templates("en", "bridge_self")
|
||||
ru = load_default_templates("ru", "bridge_self")
|
||||
expected = {
|
||||
"message_bridge_self_poll_failures",
|
||||
"message_bridge_self_deferred_backlog",
|
||||
"message_bridge_self_target_failures",
|
||||
}
|
||||
assert set(en.keys()) == expected
|
||||
assert set(ru.keys()) == expected
|
||||
# Sanity: each template references at least one of the bridge_self vars.
|
||||
for tpl in list(en.values()) + list(ru.values()):
|
||||
assert "{{" in tpl
|
||||
Reference in New Issue
Block a user