c38b7d4c78
Add database foundation for observability features: - event_log table with severity/source filtering and pagination - standalone_proxies table for user-created reverse proxies - stale_threshold_days setting (default 7 days) - Auto-persist warn/error events from event bus to database - SSE broadcast of persistent events for real-time UI updates - Frontend types and API functions for downstream UI phases
56 lines
3.1 KiB
Markdown
56 lines
3.1 KiB
Markdown
# Phase 2: Stale Container Detection
|
|
|
|
**Status:** ⬜ Not Started
|
|
**Parent plan:** [PLAN.md](./PLAN.md)
|
|
**Domain:** backend
|
|
|
|
## Objective
|
|
Implement a periodic scanner that detects containers managed by docker-watcher which have been non-running for more than N configurable days, and exposes them via API.
|
|
|
|
## Tasks
|
|
|
|
- [ ] Task 1: Create `internal/stale/scanner.go` — Scanner struct with dependencies (store, docker client, event bus)
|
|
- [ ] Task 2: Implement scan logic: query all instances from store, check Docker container state via Docker SDK, compare against stale_threshold_days from settings
|
|
- [ ] Task 3: Add `last_alive_at` column to instances table (migration) — updated when instance is seen running
|
|
- [ ] Task 4: Update deployer/instance lifecycle to set last_alive_at when container starts/is seen running
|
|
- [ ] Task 5: Implement stale detection: instance is stale if status != 'running' AND (now - last_alive_at) > threshold days
|
|
- [ ] Task 6: Emit event_log warnings when containers become newly stale (avoid re-emitting for already-known stale containers)
|
|
- [ ] Task 7: Register scanner as cron job (reuse existing robfig/cron infrastructure from registry poller)
|
|
- [ ] Task 8: Add API endpoints: `GET /api/containers/stale` (list stale with project/stage info), `POST /api/containers/stale/{id}/cleanup` (remove single), `POST /api/containers/stale/cleanup` (bulk remove)
|
|
- [ ] Task 9: Cleanup handler: stop container via Docker SDK, remove instance from store, emit event
|
|
- [ ] Task 10: Wire scanner into main.go startup (after store, docker client, event bus init)
|
|
|
|
## Files to Modify/Create
|
|
- `internal/stale/scanner.go` — NEW: Stale container scanner
|
|
- `internal/store/store.go` — Migration for last_alive_at column
|
|
- `internal/store/models.go` — Update Instance struct with LastAliveAt field
|
|
- `internal/store/instances.go` — Update queries to include last_alive_at; add UpdateLastAliveAt method
|
|
- `internal/api/router.go` — Mount stale container routes
|
|
- `internal/api/stale.go` — NEW: Stale container HTTP handlers
|
|
- `cmd/server/main.go` — Wire scanner with cron
|
|
|
|
## Acceptance Criteria
|
|
- Scanner runs on configurable interval (e.g., every hour)
|
|
- Stale containers correctly identified based on threshold
|
|
- GET /api/containers/stale returns list with project name, stage name, image tag, last alive timestamp, days stale
|
|
- Cleanup endpoints properly stop Docker containers and remove from store
|
|
- Events emitted when containers become stale
|
|
- Existing deploy flow unaffected — last_alive_at updated on successful deploy
|
|
- Build passes, existing tests pass
|
|
|
|
## Notes
|
|
- Scanner should handle gracefully: containers that no longer exist in Docker (already removed externally)
|
|
- Bulk cleanup should be admin-only
|
|
- Consider: scan interval could be derived from stale_threshold_days (e.g., scan every threshold/7 days, min 1h)
|
|
- Don't remove containers that are in 'removing' status (already being cleaned up)
|
|
|
|
## Review Checklist
|
|
- [ ] All tasks completed
|
|
- [ ] Code follows project conventions
|
|
- [ ] No unintended side effects
|
|
- [ ] Build passes
|
|
- [ ] Tests pass (new + existing)
|
|
|
|
## Handoff to Next Phase
|
|
<!-- Filled in by the implementation agent after completing this phase. -->
|