feat(alerts): metric-threshold alerting (backend + API)

Operators can define metric-threshold alert rules (cpu_percent,
memory_percent, memory_bytes; gt/lt) per-workload or global via
/api/metric-alert-rules. A periodic evaluator (internal/metricalert,
30s tick) checks the freshest container stats sample per container
against enabled rules and, on breach (per-rule-per-workload cooldown),
emits into the existing event_log + bus pipeline (source "metric_alert",
workload_id set). Alerts therefore surface on the global events page,
the per-app activity timeline, and any configured event-trigger webhook
-- no new notification plumbing.

Mirrors the log_scan_rules store/API/route patterns and the
stats.Collector lifecycle. Rule CRUD reads are authed, mutations
AdminOnly. Frontend rule-config UI is a follow-up phase.

Reviewed: go APPROVE (0 CRITICAL/HIGH).
This commit is contained in:
2026-05-29 14:06:23 +03:00
parent 5c17885197
commit cdb9fd57d1
11 changed files with 1299 additions and 0 deletions
+10
View File
@@ -28,6 +28,7 @@ import (
"github.com/alexei/tinyforge/internal/health"
"github.com/alexei/tinyforge/internal/logging"
"github.com/alexei/tinyforge/internal/logscanner"
"github.com/alexei/tinyforge/internal/metricalert"
"github.com/alexei/tinyforge/internal/notify"
"github.com/alexei/tinyforge/internal/npm"
"github.com/alexei/tinyforge/internal/proxy"
@@ -390,6 +391,14 @@ func main() {
}
defer logScanMgr.Stop()
// Metric-alert manager: evaluates threshold rules against recent
// container stats samples and emits event_log entries on breach.
// The store satisfies RuleSource/SampleSource/EventSink; the event
// bus is the Publisher.
metricAlertMgr := metricalert.New(db, db, db, eventBus)
metricAlertMgr.Start()
defer metricAlertMgr.Stop()
// Build API server.
apiServer := api.NewServer(db, dockerClient, npmClient, proxyProvider, dep, notifier, webhookHandler, eventBus, encKey)
apiServer.SetStaleScanner(staleScanner)
@@ -451,6 +460,7 @@ func main() {
eventBus.Unsubscribe(notifySub)
staleScanner.Stop()
statsCollector.Stop()
metricAlertMgr.Stop()
// Drain in-progress deploys and notifications.
dep.Drain()