Files
ledgrab/REVIEW_TODO.md
T
alexei.dolgolyov 06273ba2bc chore(tooling): vex semantic-search config + REVIEW_TODO backlog
Add .vex.toml so `vex` is the project's primary code-search backend with
auto-update + semantic embeddings enabled. Ignore the .fastembed_cache/
directory that vex creates on first --semantic run. REVIEW_TODO.md
captures items flagged by the multi-agent production review that were
deliberately deferred (multi-day refactors, profile-first perf, and
design-sensitive security work).
2026-05-23 00:46:44 +03:00

163 lines
9.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Production Review — Remaining Items
Output of the multi-agent production review (security / Python / TypeScript /
performance / architecture / code-quality). Each entry below is something
the original audit flagged and the autonomous hardening pass deliberately
did **not** address — either because it needs design input, profiling
validation, or a multi-day refactor that should land in its own session.
The hardening pass landed everything else: see git log between `master` and
the head of the review branch for the applied changes (URL-scheme +
malicious-input rejection, IconSelect XSS escape, MiniSelect for forbidden
plain `<select>`s, WebSocket Origin allow-list, /docs auth-gate, security
headers middleware, streaming upload size caps, fire-and-forget task
tracking + drain resilience in MQTT runtime, discovery_watcher task
tracking, asyncio.gather return_exceptions, secret_box encryption for MQTT
/ Hue / Govee credentials with auto-migration, SSRF-validated update
redirects, single source of truth for IP classification in
`utils/net_classify.py`, allowlist + parity test for inbound WS events,
typed `Window` globals, and more).
---
## Architecture refactors (multi-day — own session)
- [ ] **Split `core/processing/value_stream.py`** (1856 LOC, 14 stream classes)
into a `value_streams/` package. Each value-stream type gets its own
file ≤300 LOC; `manager.py` holds `ValueStreamManager`.
- [ ] **Split `storage/color_strip_source.py`** (1841 LOC, 18 source kinds)
into a `color_strip_sources/` package mirroring `value_streams/`.
- [ ] **Frontend file splits**`graph-editor.ts` (2707), `streams.ts`
(2335), `value-sources.ts` (1889), `types.ts` (1062). Highest-churn
modules; mixed UI / state / network responsibilities.
- [ ] **Layering reversal**: introduce a neutral `domain/` package and move
shared DTOs (`FilterInstance`, `CalibrationConfig`, etc.) into it so
`storage/` no longer imports `core/`. Eliminates 7+ layering
violations and the lazy-import hacks used to break the resulting
circulars.
- [ ] **`main.py` boot refactor** — extract import-time side effects into
`bootstrap.py` + `create_app()` factory. `lifespan()` becomes the
single place that wires stores and managers.
- [ ] **DI consolidation** — replace `api/dependencies.py` getter sprawl
(30+ `get_*()` functions reading a process-global `_deps` dict) with
a single typed `get_container()` dependency. Makes test-overrides
trivial; ban direct getter calls in handler bodies.
- [ ] **Exception hierarchy** — define `ledgrab/errors.py` (`LedGrabError`,
`NotFoundError`, `ValidationError`, `RemoteUnavailableError`,
`SSRFBlockedError`). Move HTTP translation into a FastAPI exception
handler. Stop raising `HTTPException` from `utils/safe_source.py`.
- [ ] **Lazy-import audit** — 289 in-function `from ledgrab.*` imports.
Specifically `core/processing/daylight_settings.py` imports
`api.dependencies` (core → api inversion). Pass the database in via
the constructor instead of service-locator lookup.
## Performance (profile before applying)
- [ ] **`composite_stream.py` blend modes** — pre-allocate scratch buffers
in `_blend_override / overlay / hard_light / soft_light / difference
/ exclusion`. Each currently allocates per frame (`mul`, `scr`,
`blended`, `np.where(...)`). At 100 LEDs × 30 fps × N layers this
adds up.
- [ ] **`mapped_stream` / `composite_stream` zone resize** — replace the
per-channel `np.interp` calls with a cached `floor/ceil/frac` LUT
(same trick as `wled_target_processor._fit_to_device`) or a single
`cv2.resize` call on the (N,3) array. `np.interp` allocates a new
`float64` array per channel per frame even on cache-hit.
- [ ] **`processed_stream._processing_loop`** — add ping-pong output
buffers and pass them as `out=` to filter `process_strip()` calls.
Today every filter that returns a fresh allocation costs us a copy
per frame. Also: the loop uses `time.sleep` instead of an
event-driven wait on the input stream — input updates faster than
30 fps see up to `frame_time` of latency.
- [ ] **`mqtt_client.py` `send_pixels`** — add a binary publish path (or
at minimum cache the outer dict skeleton). Today every frame
`pixels.tolist()` + `json.dumps` for ~300 LEDs × 30 fps × N devices.
- [ ] **Frontend `static/js/features/color-strips/test.ts`** — cache
`ImageData` per canvas (`canvas._imageData`); only re-create on
dimension change; use a `Uint32Array` view to copy pixels in one
loop instead of the per-pixel JS loop. Border-overlay rebuild on
every frame should also be debounced to dimension changes only.
- [ ] **`ws_stream.py` composite branch** — pre-allocate a `bytearray`
sized to the largest frame and write into slices instead of
`b"".join(tobytes()) per layer` every iteration. Same anti-pattern
in `wled_target_processor._broadcast_led_preview`.
- [ ] **Preview broadcast slow-client guard**`asyncio.gather` over
preview clients waits for the slowest. Move to `asyncio.wait` with a
timeout and drop slow clients, or fire-and-forget with a
`ws.application_state` filter.
## Security (deferred — non-trivial or design-sensitive)
- [ ] **Content-Security-Policy header** — would need careful tuning
because the UI uses inline event handlers / Jinja templates.
Mis-set CSP would break the app silently. Defer until templates can
move to event-delegated handlers, then add a strict policy.
- [ ] **`api/auth.py` exception specificity** — 9 `except Exception:`
sites. Most are intentional best-effort `websocket.send_json`
swallows (the WS is already closed or about to be), but the auth
decision path itself could be tightened to specific types
(`jwt.InvalidTokenError`, `OSError`) + `logger.exception` for
observability.
- [ ] **Hue bridge cert pinning**`httpx.AsyncClient(verify=False)` for
Hue bridge (self-signed cert by design). Should record the
certificate fingerprint at pairing time and pin it on subsequent
requests; otherwise an on-path attacker can MITM the bridge.
## Mechanical / code-quality (low risk, high line-count)
- [ ] **i18n parity****328** keys missing in `ru.json`, **325** missing
in `zh.json`. Examples: `section.hide`, `filters.hsl_shift`,
`filters.contrast`, `filters.temporal_blur`,
`filters.audio_filter_template.desc`. Russian and Chinese users
currently see raw keys for these. This is translation work, not
code work.
- [ ] **`Optional[T]``T | None`** (PEP 604) — large mechanical refactor
across the codebase. Can be auto-fixed via `ruff check --fix
--select UP007`. Worth doing once the file splits land.
- [ ] **Hot-path `logger.error(f"...")` → `logger.error("... %s", e)`**
lazy-eval — mostly cosmetic; ~200 sites. The f-string still builds
the message even when DEBUG is off.
- [ ] **Remaining `(window as any)` sites** — typed `global-types.d.ts`
is in place and new code uses `window.foo` directly, but ~80
existing sites still have the cast. Per-site mechanical cleanup.
Add `eslint`-equivalent guard (TS rule) to prevent new ones.
- [ ] **Magic numbers → named constants** in processing hot paths —
`_FILTER_RECHECK_EVERY_N_FRAMES = 30` in
`core/processing/processed_stream.py:159`; `5 ms` / `5 s` /
`30 iterations` literals in `wled_target_processor.py:890,893,915`.
- [ ] **Standardise `from __future__ import annotations`** across the
codebase. Some modules use the future-annotation form, others stick
with `Optional[...]`. Enforce one via ruff `FA` rules.
## Test gaps
- [ ] **Route-level integration test** for the WLED scheme inference —
POST `/api/v1/devices` with `{"url": "192.168.1.42",
"device_type": "wled"}` and assert the stored device has
`url == "http://192.168.1.42"`. The helper is exhaustively
unit-tested but no integration test exercises the create/update
flow end-to-end.
- [ ] **IPv6 public address regression** — extend `test_url_scheme.py`
with explicit assertions for `2001:db8::1` and similar public IPv6
literals (the bare-label fallback used to misclassify these). The
helper does the right thing today via the IPv6 probe added during
the hardening pass, but no test pins it.
## Pre-existing issues surfaced during the audit (not in our diff)
These were flagged by the auditors but predate the review session — kept
here as a future-work backlog:
- [ ] **`icon-select.ts:_buildGrid` `item.icon` is interpolated raw** —
documented as "trusted SVG by design". If callers ever feed
user-supplied icon strings, that's an XSS sink. Audit every caller
that builds `IconSelectItem.icon` from non-constant data and
reject HTML there.
- [ ] **`devices.py:461` `manager.update_device_info(device_url=update_data.url)`**
receives `None` when a PATCH omits `url` (rename / icon-only edit).
The processor never re-syncs in that case. Should pass
`existing.url` (after normalization) or skip the call.
- [ ] **`asyncio.gather` over uncapped client lists** in preview broadcasts
— slow clients block the loop. Already noted under Performance
above; pre-existing.