Files
web-app-launcher/src/lib/server/services/metricsService.ts
T
alexei.dolgolyov f1cfb61d13
Lint & Test / lint-and-check (push) Failing after 5m5s
Lint & Test / test (push) Has been skipped
Lint & Test / build (push) Has been skipped
Lint & Test / docker-build (push) Has been skipped
Lint & Test / audit (push) Has been skipped
feat: production hardening + password reset, metrics, signed webhooks
Security hardening (CRITICAL/HIGH from production-readiness audit):
- Require strong JWT_SECRET + separate INTEGRATION_ENCRYPTION_KEY at boot;
  refuse placeholder defaults. Integration key now derived via HKDF.
- SSRF guard (src/lib/server/utils/safeFetch.ts): DNS-resolves and rejects
  RFC1918/loopback/link-local/IPv4-mapped IPv6/decimal-IP/cloud-metadata.
  Manual redirect handling re-validates each 3xx Location hop. Applied to
  healthcheck, RSS, calendar, metric, system-stats, camera, notifications,
  discovery, apps/preview, and all integration clients.
- API tokens, session refresh tokens, invite tokens, password-reset tokens
  switched from bcrypt to sha256 with @unique indexed lookup (O(1) instead
  of O(N) bcrypt-compares; eliminates a trivial DoS).
- Refresh-token reuse detection via Session.previousTokenHash.
- Permission checks on App PATCH/DELETE and Widget/Section endpoints.
- /api/integrations/alerts now requires auth.
- SVG uploads sanitized through DOMPurify (svg profile, scheme allow-list).
- Custom CSS sanitizer + selector scoping (decodes CSS unicode escapes
  before pattern match, drops forbidden at-rules incl. @import without
  whitespace, strips dangerous url() args). Scoped to .custom-css-scope.
- Backup restore validates SQLite magic header, takes a safety snapshot,
  uses atomic rename, re-applies pragmas.
- SQLite WAL + busy_timeout + foreign_keys + synchronous=NORMAL at startup.
- Healthcheck scheduler was dead code; wired in hooks.server.ts with
  HMR-safe singleton, concurrency cap, overlap prevention, retention jobs
  for AppClick/Notification/AuditLog. Composite indexes added on hot paths.
- Security headers (CSP, HSTS-on-https, X-Frame-Options, Permissions-Policy)
  emitted on every response.
- Account-enumeration mitigation on login (dummy bcrypt on no-user/oauth
  branches) + rate limiting on login/register/onboarding/refresh/invite/
  password-reset.
- OAuth callback sanitizes IdP error_description before echoing.

New features:
- Custom +error.svelte pages (root + boards + admin) via shared
  ErrorState component. Inverted hierarchy (status as label, title as hero).
- /forgot-password + /reset-password + admin-mediated /admin/password-resets
  page. SHA256 tokens, 24h TTL, all sessions revoked on apply.
- /invite page for manual invite-token redemption.
- /api/metrics Prometheus exposition with optional METRICS_TOKEN bearer
  auth. Counters for login/healthcheck/notification/integration; gauges
  for users/boards/apps + per-status app counts.
- Webhook HMAC-SHA256 signing for HTTP notification channels (optional
  shared secret + configurable signature header, default X-Signature-256).
- PATCH /api/users/me/password for self-service password change.
- Persistent uploads at /app/data/uploads with served-from-volume handler
  at /uploads/[...path]. SVGs served with CSP: sandbox.
- /api/health does a DB ping; returns 503 on disconnect.
- Public /status filtered to guest-accessible-board apps when unauthenticated.
- Audit log coverage: LOGIN_SUCCESS/FAILED, LOGOUT, OAUTH_LOGIN,
  OAUTH_USER_PROVISIONED, SESSION_REVOKED, API_TOKEN_*, INVITE_*,
  APP_UPDATED, PASSWORD_CHANGED, PASSWORD_RESET_*.

Performance:
- Board page: removed double findAll() over-fetch; include links + appTags
  in board query; widgets lazy-loaded via dynamic imports (marked,
  DOMPurify, hls.js, integration renderers).
- uptimeService.getAllAppsUptime: single batched query instead of N+1.
- 30s in-memory user-locals cache; invalidated on user mutation.
- pruneOldStatuses: single window-function DELETE instead of N+1.

Code quality:
- Typed error classes (NotFoundError, PermissionError, RateLimitError,
  IntegrationError) with toHttpError mapper.
- Locals.user shape exposes avatarUrl and narrows role via guard.
- App input types derived from Zod schemas via z.infer.
- 274 tests passing (up from 212); 62 new tests covering SSRF guard,
  CSS sanitizer, SVG sanitizer, rate limiter.

CI / Docker / config:
- Test workflow adds build, docker-build, audit jobs. Release workflow
  uses buildx multi-arch (amd64+arm64) with provenance + SBOM.
- Dockerfile uses tini, multi-stage prune, persistent uploads dir, single
  prisma migrate deploy (no destructive db push fallback).
- docker-compose: JWT_SECRET + INTEGRATION_ENCRYPTION_KEY required at
  startup, log rotation, resource limits.
- README documents breaking-change upgrade path.

Bug fixes from UI/UX review:
- ~55 missing i18n keys added to en/ru (auth flows, error pages, admin
  nav, register invite banner, settings.card_style).
- Hardcoded English on login replaced with $t('auth.remember_me').
- Admin nav uses i18n keys; mobile horizontal-scroll layout.
- Page <title> tags standardized.
- Password-resets: separated error/info/success surfaces, ConfirmDialog
  replaces window.confirm.
- Auth pages have matching lucide icon badges.
- Webhook secret has eye toggle and monospace input.
- text-green-500 → text-emerald-500 to match codebase convention.

Pre-existing CI lint failures cleaned up (31 errors → 0): each-key
attributes added, unused-svelte-ignore comments removed, two any casts
typed, dead skeleton components removed, /boards/[id]/edit redirect to
inline edit mode.

Tests: 274 / 274 passing
Type check: 0 errors / 0 warnings
Build: green
2026-05-26 19:51:21 +03:00

170 lines
5.7 KiB
TypeScript

import { prisma } from '../prisma.js';
import { AppStatusValue } from '$lib/utils/constants.js';
/**
* Tiny Prometheus-text metrics gatherer. Avoids the prom-client dependency
* (~150KB + extra runtime memory) by emitting the exposition format directly.
* If we later want histograms or counters with labels at high cardinality,
* swap this out for prom-client.
*/
interface CounterSnapshot {
readonly name: string;
readonly help: string;
readonly value: number;
readonly labels?: Record<string, string>;
}
function escapeLabel(value: string): string {
return value.replace(/\\/g, '\\\\').replace(/"/g, '\\"').replace(/\n/g, '\\n');
}
function renderLabels(labels?: Record<string, string>): string {
if (!labels) return '';
const parts = Object.entries(labels).map(([k, v]) => `${k}="${escapeLabel(v)}"`);
return parts.length ? `{${parts.join(',')}}` : '';
}
/**
* In-memory counter / gauge state. Process-local — Prometheus is expected to
* scrape a single launcher instance (the app is SQLite-bound to one process
* anyway). Reset on restart, like most lightweight setups.
*/
class MetricRegistry {
private counters = new Map<string, number>();
private gauges = new Map<string, number>();
incCounter(name: string, by = 1): void {
this.counters.set(name, (this.counters.get(name) ?? 0) + by);
}
setGauge(name: string, value: number): void {
this.gauges.set(name, value);
}
getCounter(name: string): number {
return this.counters.get(name) ?? 0;
}
snapshot(): { counters: Map<string, number>; gauges: Map<string, number> } {
return { counters: new Map(this.counters), gauges: new Map(this.gauges) };
}
}
export const metricRegistry = new MetricRegistry();
// Counter names — keep them ASCII identifiers (Prometheus naming rules).
export const Counters = {
HEALTHCHECK_TOTAL: 'wal_healthcheck_total',
HEALTHCHECK_FAILED: 'wal_healthcheck_failed_total',
LOGIN_SUCCESS: 'wal_login_success_total',
LOGIN_FAILED: 'wal_login_failed_total',
NOTIFICATION_SENT: 'wal_notification_sent_total',
NOTIFICATION_FAILED: 'wal_notification_failed_total',
INTEGRATION_FETCH_TOTAL: 'wal_integration_fetch_total',
INTEGRATION_FETCH_FAILED: 'wal_integration_fetch_failed_total'
} as const;
/**
* Build the full exposition. Combines:
* - process-local counters (login attempts, healthcheck ticks, etc.)
* - DB-backed gauges (current online/offline app count, user count, etc.)
*/
export async function renderMetrics(): Promise<string> {
const lines: string[] = [];
// --- Static help/type lines + counter snapshots ---
const COUNTER_HELP: Record<string, string> = {
[Counters.HEALTHCHECK_TOTAL]: 'Total healthcheck ticks executed since process start',
[Counters.HEALTHCHECK_FAILED]: 'Healthcheck ticks where any app returned offline',
[Counters.LOGIN_SUCCESS]: 'Successful local logins since process start',
[Counters.LOGIN_FAILED]: 'Failed local logins since process start',
[Counters.NOTIFICATION_SENT]: 'Notification dispatch attempts',
[Counters.NOTIFICATION_FAILED]: 'Notification dispatch failures',
[Counters.INTEGRATION_FETCH_TOTAL]: 'Integration fetch attempts',
[Counters.INTEGRATION_FETCH_FAILED]: 'Integration fetch failures'
};
const { counters } = metricRegistry.snapshot();
for (const name of Object.values(Counters)) {
const value = counters.get(name) ?? 0;
lines.push(`# HELP ${name} ${COUNTER_HELP[name]}`);
lines.push(`# TYPE ${name} counter`);
lines.push(`${name} ${value}`);
}
// --- DB-backed gauges ---
const gauges: CounterSnapshot[] = [];
try {
const [totalApps, healthchecked, totalUsers, totalBoards] = await Promise.all([
prisma.app.count(),
prisma.app.count({ where: { healthcheckEnabled: true } }),
prisma.user.count(),
prisma.board.count()
]);
gauges.push(
{ name: 'wal_apps_total', help: 'Total apps registered', value: totalApps },
{
name: 'wal_apps_healthchecked_total',
help: 'Apps with healthcheck enabled',
value: healthchecked
},
{ name: 'wal_users_total', help: 'Total user accounts', value: totalUsers },
{ name: 'wal_boards_total', help: 'Total boards', value: totalBoards }
);
// Latest status per app — broken down by status value.
// Subquery: for each app, take the most recent AppStatus row.
const latest = await prisma.$queryRaw<{ status: string; count: number }[]>`
SELECT status, COUNT(*) AS count
FROM (
SELECT appId, status, ROW_NUMBER() OVER (PARTITION BY appId ORDER BY checkedAt DESC) AS rn
FROM AppStatus
)
WHERE rn = 1
GROUP BY status
`;
for (const status of Object.values(AppStatusValue)) {
const row = latest.find((r) => r.status === status);
gauges.push({
name: 'wal_app_status',
help: 'Current count of apps by latest status',
value: Number(row?.count ?? 0),
labels: { status }
});
}
} catch (err) {
// DB issue — emit an "up" gauge of 0 so scrapers can alert on it.
// eslint-disable-next-line no-console
console.warn('[metrics] failed to gather DB gauges:', err);
lines.push(`# HELP wal_db_up 1 if the metrics endpoint could read from the DB`);
lines.push(`# TYPE wal_db_up gauge`);
lines.push(`wal_db_up 0`);
lines.push('');
return lines.join('\n');
}
// Group same-name gauges so we emit HELP/TYPE once.
const grouped = new Map<string, CounterSnapshot[]>();
for (const g of gauges) {
const arr = grouped.get(g.name);
if (arr) arr.push(g);
else grouped.set(g.name, [g]);
}
for (const [name, samples] of grouped) {
lines.push(`# HELP ${name} ${samples[0].help}`);
lines.push(`# TYPE ${name} gauge`);
for (const s of samples) {
lines.push(`${name}${renderLabels(s.labels)} ${s.value}`);
}
}
lines.push(`# HELP wal_db_up 1 if the metrics endpoint could read from the DB`);
lines.push(`# TYPE wal_db_up gauge`);
lines.push(`wal_db_up 1`);
lines.push('');
return lines.join('\n');
}