feat(alerts): metric-threshold alerting (backend + API)

Operators can define metric-threshold alert rules (cpu_percent,
memory_percent, memory_bytes; gt/lt) per-workload or global via
/api/metric-alert-rules. A periodic evaluator (internal/metricalert,
30s tick) checks the freshest container stats sample per container
against enabled rules and, on breach (per-rule-per-workload cooldown),
emits into the existing event_log + bus pipeline (source "metric_alert",
workload_id set). Alerts therefore surface on the global events page,
the per-app activity timeline, and any configured event-trigger webhook
-- no new notification plumbing.

Mirrors the log_scan_rules store/API/route patterns and the
stats.Collector lifecycle. Rule CRUD reads are authed, mutations
AdminOnly. Frontend rule-config UI is a follow-up phase.

Reviewed: go APPROVE (0 CRITICAL/HIGH).
This commit is contained in:
2026-05-29 14:06:23 +03:00
parent 5c17885197
commit cdb9fd57d1
11 changed files with 1299 additions and 0 deletions
+33
View File
@@ -277,6 +277,39 @@ const (
LogScanSeverityError = "error"
)
// MetricAlertRule fires an event when a container metric breaches a
// threshold. Mirrors LogScanRule but evaluated against stats_samples
// instead of log lines.
type MetricAlertRule struct {
ID int64 `json:"id"`
WorkloadID string `json:"workload_id"` // "" = applies to all workloads
Name string `json:"name"`
Metric string `json:"metric"` // cpu_percent | memory_percent | memory_bytes
Comparator string `json:"comparator"` // gt | lt
Threshold float64 `json:"threshold"`
Severity string `json:"severity"` // info | warn | error
CooldownSeconds int `json:"cooldown_seconds"` // min seconds between fires per (rule,workload)
Enabled bool `json:"enabled"`
CreatedAt string `json:"created_at"`
UpdatedAt string `json:"updated_at"`
}
// Metric-alert metric identifiers. cpu_percent + memory_percent are
// 0100 ratios; memory_bytes is an absolute usage figure. Validated in
// the store on create/update.
const (
MetricCPUPercent = "cpu_percent"
MetricMemoryPercent = "memory_percent"
MetricMemoryBytes = "memory_bytes"
)
// Metric-alert comparators. gt fires when the value exceeds the
// threshold; lt when it falls below.
const (
MetricComparatorGT = "gt"
MetricComparatorLT = "lt"
)
// WorkloadKind enumerates the legacy discriminator values written into
// containers.workload_kind and workloads.kind. After the hard cutover the
// backing project / stack / static_site tables are gone — these constants