feat: production-readiness hardening across security, async, DB, ops
Build and Test / test-frontend (push) Successful in 9m37s
Build and Test / test-backend (push) Successful in 10m53s
Build and Test / build-image (push) Failing after 14m52s

Security
- SSRF: async DNS resolver; allow_redirects=False on all outbound clients;
  matrix homeserver_url validated on create/update/test; update_provider
  and email_bot merge incoming config and reject ***-masked secrets.
- Auth: bcrypt offloaded to asyncio.to_thread; JWT now carries iss/aud +
  leeway and rejects missing claims; setup TOCTOU closed inside a
  transaction; rate limits extended (default 600/min, 10/min on password
  change, 30/min on needs-setup); constant-time login to prevent username
  enumeration.
- Config: rejects known dev secret keys; validates CORS origin schemes,
  port range, token lifetimes.
- Webhook handlers stream-read body with a 1 MiB cap; Discord 429 retries
  bounded (3 attempts, Retry-After capped at 60 s).
- CSP + HSTS added to SecurityHeadersMiddleware.

Async / runtime
- SQLite engine: WAL, synchronous=NORMAL, foreign_keys=ON, busy_timeout,
  pool_pre_ping, dispose on shutdown.
- Lifespan shutdown now stops scheduler before closing HTTP session and
  disposing the engine.
- Shared aiohttp session locked against concurrent first-caller races;
  core NotificationDispatcher accepts and reuses it.
- Storage and scheduled backup writes wrapped in asyncio.to_thread.
- NUT client writes bounded by asyncio.wait_for.
- Telegram poller switched from 3 s short-poll to 30 s interval + 25 s
  long-poll (~10x fewer API calls).

Database
- New performance-indexes migration covers every FK/owner column and
  hot-path composite (notification_tracker(provider_id, enabled);
  event_log(user_id, created_at DESC); webhook_payload_log(provider_id,
  created_at DESC); action_execution(action_id, started_at DESC)).
- New schema_version table for future upgrade gating.
- __system__ placeholder user (id=0) seeded so user_id=0 system defaults
  satisfy the newly enforced FK; filtered out of /auth/needs-setup,
  /api/users, and setup.
- list_notification_trackers rewritten to batched loads (was 1+N+N*M).
- Retention job extended to event_log, webhook_payload_log, and
  action_execution; retention days exposed as a setting.

Scheduler
- AsyncIOScheduler job_defaults: coalesce, misfire_grace_time=300,
  max_instances=1.

Ops
- uvicorn runs with proxy_headers, forwarded_allow_ips,
  timeout_graceful_shutdown; access log suppressed in non-debug.
- FastAPI version string now reads from importlib.metadata.
- New /api/ready endpoint separate from /api/health.
- docker-compose drops the ALLOW_PRIVATE_URLS=1 default, adds mem/cpu/pid
  limits, read_only + tmpfs, cap_drop:ALL, no-new-privileges; healthcheck
  targets /api/ready.
- CI now runs on push/PR with backend pytest, frontend svelte-check +
  build, and a non-push image build; release workflow gated on tests,
  publishes immutable sha-<commit> image tag, adds Trivy scan.

Tests
- New packages/server/tests/ with 29 passing tests: config validation,
  JWT round-trip + aud/alg=none rejection, SSRF scheme and private-range
  enforcement (sync + async), Discord bounded retry, and a lifespan-level
  /api/health + /api/ready smoke check.
- Renamed the misnamed services/test_dispatch.py to manual_dispatch.py so
  pytest never auto-collects production code.

Frontend
- /login now redirects already-authenticated users to /, shows a distinct
  'backend unreachable' banner (en/ru) when /auth/needs-setup fails.
This commit is contained in:
2026-04-23 19:44:56 +03:00
parent f50d465c0e
commit 920920bc67
44 changed files with 1426 additions and 186 deletions
@@ -387,10 +387,9 @@ async def export_backup_to_file(
ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%S")
filename = f"backup-{ts}.json"
filepath = backup_dir / filename
filepath.write_text(
json.dumps(backup.model_dump(), indent=2, ensure_ascii=False),
encoding="utf-8",
)
import asyncio as _asyncio
payload = json.dumps(backup.model_dump(), indent=2, ensure_ascii=False)
await _asyncio.to_thread(filepath.write_text, payload, encoding="utf-8")
_LOGGER.info("Scheduled backup saved: %s", filepath)
return filepath
@@ -399,7 +398,13 @@ def cleanup_old_backups(backup_dir: Path, keep: int = 5) -> list[str]:
"""Delete oldest backup files exceeding `keep` count. Returns deleted filenames."""
if not backup_dir.is_dir():
return []
files = sorted(backup_dir.glob("backup-*.json"), key=lambda f: f.name, reverse=True)
# Sort by mtime (newest first) so behavior doesn't depend on the filename
# timestamp format, which could change later without updating this code.
files = sorted(
backup_dir.glob("backup-*.json"),
key=lambda f: f.stat().st_mtime,
reverse=True,
)
deleted = []
for old in files[keep:]:
old.unlink()
@@ -413,7 +418,13 @@ def list_backup_files(backup_dir: Path) -> list[dict[str, Any]]:
"""List backup files in the directory with metadata."""
if not backup_dir.is_dir():
return []
files = sorted(backup_dir.glob("backup-*.json"), key=lambda f: f.name, reverse=True)
# Sort by mtime (newest first) so behavior doesn't depend on the filename
# timestamp format, which could change later without updating this code.
files = sorted(
backup_dir.glob("backup-*.json"),
key=lambda f: f.stat().st_mtime,
reverse=True,
)
result = []
for f in files:
stat = f.stat()
@@ -11,23 +11,36 @@ Call ``close_http_session()`` once during application shutdown.
from __future__ import annotations
import asyncio
import aiohttp
_DEFAULT_TIMEOUT = aiohttp.ClientTimeout(total=30)
_DEFAULT_TIMEOUT = aiohttp.ClientTimeout(total=30, connect=10)
_session: aiohttp.ClientSession | None = None
_lock = asyncio.Lock()
async def get_http_session() -> aiohttp.ClientSession:
"""Get or create the shared HTTP session."""
"""Get or create the shared HTTP session.
Concurrent first-callers are serialized through ``_lock`` so we never
leak a second ClientSession / connector pair. Once established, hot
callers skip the lock via the fast-path check.
"""
global _session
if _session is None or _session.closed:
_session = aiohttp.ClientSession(timeout=_DEFAULT_TIMEOUT)
if _session is not None and not _session.closed:
return _session
async with _lock:
if _session is None or _session.closed:
_session = aiohttp.ClientSession(timeout=_DEFAULT_TIMEOUT)
return _session
async def close_http_session() -> None:
"""Close the shared HTTP session (call on app shutdown)."""
global _session
if _session is not None and not _session.closed:
await _session.close()
async with _lock:
if _session is not None and not _session.closed:
await _session.close()
_session = None
@@ -85,7 +85,21 @@ def _compute_jitter(interval_seconds: int) -> int:
def get_scheduler() -> AsyncIOScheduler:
global _scheduler
if _scheduler is None:
_scheduler = AsyncIOScheduler()
# Sensible production defaults applied to every job unless overridden:
# * coalesce — collapse a queue of missed runs into one firing after
# a restart / pause, instead of bursting to catch up.
# * misfire_grace_time — accept firings up to 5 min late without
# dropping them silently.
# * max_instances=1 — never run two copies of the same tracker tick
# concurrently; the scheduler already enforces this on add_job,
# but we also set it as the default for safety.
_scheduler = AsyncIOScheduler(
job_defaults={
"coalesce": True,
"misfire_grace_time": 300,
"max_instances": 1,
},
)
return _scheduler
@@ -279,21 +293,38 @@ async def _refresh_telegram_chat_titles() -> None:
async def _cleanup_old_events() -> None:
"""Delete EventLog entries older than 90 days."""
"""Delete EventLog / WebhookPayloadLog / ActionExecution rows older than the
configured retention window. A retention of 0 disables the job.
"""
from datetime import datetime, timedelta, timezone
from sqlmodel import delete
from sqlmodel.ext.asyncio.session import AsyncSession
from ..config import settings
from ..database.engine import get_engine
from ..database.models import EventLog
from ..database.models import ActionExecution, EventLog, WebhookPayloadLog
cutoff = datetime.now(timezone.utc) - timedelta(days=90)
days = settings.event_log_retention_days
if days <= 0:
_LOGGER.debug("Event log retention disabled (days=0); skipping cleanup")
return
cutoff = datetime.now(timezone.utc) - timedelta(days=days)
engine = get_engine()
async with AsyncSession(engine) as session:
await session.exec(delete(EventLog).where(EventLog.created_at < cutoff))
await session.exec(
delete(WebhookPayloadLog).where(WebhookPayloadLog.created_at < cutoff)
)
await session.exec(
delete(ActionExecution).where(ActionExecution.started_at < cutoff)
)
await session.commit()
_LOGGER.info("Cleaned up event log entries older than %s", cutoff.date())
_LOGGER.info(
"Cleaned event_log / webhook_payload_log / action_execution older than %s",
cutoff.date(),
)
async def _load_tracker_jobs() -> None:
@@ -127,7 +127,14 @@ async def stop_bot_if_unused(bot_id: int) -> None:
def schedule_bot_polling(bot_id: int) -> None:
"""Add a polling job for a bot (idempotent)."""
"""Add a polling job for a bot (idempotent).
We schedule at a 30 s interval, but each tick calls ``getUpdates`` with
``timeout=25`` — Telegram holds the connection open until either an
update arrives or the timeout elapses, so in practice the bot streams
updates with sub-second latency while consuming ~2 API calls / minute
per bot (down from 20 under the old 3 s short-poll).
"""
scheduler = get_scheduler()
job_id = f"telegram_poll_{bot_id}"
if scheduler.get_job(job_id):
@@ -135,13 +142,13 @@ def schedule_bot_polling(bot_id: int) -> None:
scheduler.add_job(
_poll_bot,
"interval",
seconds=3,
seconds=30,
id=job_id,
args=[bot_id],
replace_existing=True,
max_instances=1,
)
_LOGGER.info("Started polling for bot %d", bot_id)
_LOGGER.info("Started polling for bot %d (long-poll, 25s timeout)", bot_id)
def unschedule_bot_polling(bot_id: int) -> None:
@@ -233,8 +240,10 @@ async def _poll_bot(bot_id: int) -> None:
from .http_session import get_http_session
http = await get_http_session()
client = TelegramClient(http, bot_token)
# Long-poll: hold connection open until an update arrives or 25 s
# elapse. Drastically cuts API calls vs. 3 s short-poll.
result = await client.get_updates(
offset=offset + 1 if offset else None, limit=50,
offset=offset + 1 if offset else None, limit=50, timeout=25,
)
if not result.get("success"):
err_text = str(result.get("error") or "")
@@ -369,7 +369,13 @@ async def check_tracker(tracker_id: int) -> dict[str, Any]:
if events and link_data:
url_cache, asset_cache = await _get_telegram_caches()
dispatcher = NotificationDispatcher(url_cache=url_cache, asset_cache=asset_cache)
from .http_session import get_http_session
shared_session = await get_http_session()
dispatcher = NotificationDispatcher(
url_cache=url_cache,
asset_cache=asset_cache,
session=shared_session,
)
for event in events:
_LOGGER.info(
"Dispatching event %s for %s (added=%d removed=%d)",