Files
tiny-forge/plans/observability-proxy-mgmt/phase-2-stale-detection.md
T
alexei.dolgolyov c38b7d4c78 feat(observability): phase 1 - schema, models & event log backend
Add database foundation for observability features:
- event_log table with severity/source filtering and pagination
- standalone_proxies table for user-created reverse proxies
- stale_threshold_days setting (default 7 days)
- Auto-persist warn/error events from event bus to database
- SSE broadcast of persistent events for real-time UI updates
- Frontend types and API functions for downstream UI phases
2026-03-30 10:59:13 +03:00

3.1 KiB

Phase 2: Stale Container Detection

Status: Not Started Parent plan: PLAN.md Domain: backend

Objective

Implement a periodic scanner that detects containers managed by docker-watcher which have been non-running for more than N configurable days, and exposes them via API.

Tasks

  • Task 1: Create internal/stale/scanner.go — Scanner struct with dependencies (store, docker client, event bus)
  • Task 2: Implement scan logic: query all instances from store, check Docker container state via Docker SDK, compare against stale_threshold_days from settings
  • Task 3: Add last_alive_at column to instances table (migration) — updated when instance is seen running
  • Task 4: Update deployer/instance lifecycle to set last_alive_at when container starts/is seen running
  • Task 5: Implement stale detection: instance is stale if status != 'running' AND (now - last_alive_at) > threshold days
  • Task 6: Emit event_log warnings when containers become newly stale (avoid re-emitting for already-known stale containers)
  • Task 7: Register scanner as cron job (reuse existing robfig/cron infrastructure from registry poller)
  • Task 8: Add API endpoints: GET /api/containers/stale (list stale with project/stage info), POST /api/containers/stale/{id}/cleanup (remove single), POST /api/containers/stale/cleanup (bulk remove)
  • Task 9: Cleanup handler: stop container via Docker SDK, remove instance from store, emit event
  • Task 10: Wire scanner into main.go startup (after store, docker client, event bus init)

Files to Modify/Create

  • internal/stale/scanner.go — NEW: Stale container scanner
  • internal/store/store.go — Migration for last_alive_at column
  • internal/store/models.go — Update Instance struct with LastAliveAt field
  • internal/store/instances.go — Update queries to include last_alive_at; add UpdateLastAliveAt method
  • internal/api/router.go — Mount stale container routes
  • internal/api/stale.go — NEW: Stale container HTTP handlers
  • cmd/server/main.go — Wire scanner with cron

Acceptance Criteria

  • Scanner runs on configurable interval (e.g., every hour)
  • Stale containers correctly identified based on threshold
  • GET /api/containers/stale returns list with project name, stage name, image tag, last alive timestamp, days stale
  • Cleanup endpoints properly stop Docker containers and remove from store
  • Events emitted when containers become stale
  • Existing deploy flow unaffected — last_alive_at updated on successful deploy
  • Build passes, existing tests pass

Notes

  • Scanner should handle gracefully: containers that no longer exist in Docker (already removed externally)
  • Bulk cleanup should be admin-only
  • Consider: scan interval could be derived from stale_threshold_days (e.g., scan every threshold/7 days, min 1h)
  • Don't remove containers that are in 'removing' status (already being cleaned up)

Review Checklist

  • All tasks completed
  • Code follows project conventions
  • No unintended side effects
  • Build passes
  • Tests pass (new + existing)

Handoff to Next Phase