fix(devices): SP110E vendor handshake + Windows/bleak robustness
Build Android APK / build-android (push) Failing after 1m38s
Lint & Test / test (push) Successful in 4m32s

SP110E peripherals silently tear down the GATT link ~1s after connect
unless a two-write vendor handshake (01 00 → FFE2, 01 B7 E3 D5 → FFE1)
arrives immediately. Without it the first real write hangs 30s then
reconnect-loops forever. Adds optional BLEProtocol.init_writes executed
on connect, plumbs a per-write char_uuid through both transports, and
fixes the SP110E color/power frames from an incorrect 5 bytes to the
documented 4 bytes.

Windows/WinRT robustness:
- asyncio.wait_for hangs on bleak because WinRT IAsyncOperations refuse
  to cancel. _bounded_await() uses asyncio.wait() instead so timeouts
  actually return control even when the inner task is uncancellable.
- BleakClient connect by raw MAC string times out when WinRT guesses
  address type wrong; switched to pre-scanning with BleakScanner and
  passing the resolved BLEDevice, which carries the address type.
- Target-start fetch timeout bumped to 30s with retry disabled so the
  UI doesn't abort during the BLE pre-scan + connect + handshake path.

UI:
- Settings modal exposes Protocol Family (IconSelect grid, shared with
  add-device via parameterized ensureBleFamilyIconSelect) so users can
  fix a wrong family pick without recreating the device. Govee AES key
  row toggles on/off with family selection.

Also turns LAN auth back on in default_config.yaml, logs start_processing
requests on entry for easier diagnosis, and captures the full debug trail
in docs/BLE_LED_CONTROLLERS.md for future BLE work.

Refs the mbullington SP110E protocol gist for the handshake bytes.
This commit is contained in:
2026-04-21 17:45:21 +03:00
parent 2b5dac2c42
commit 45f93fd30e
14 changed files with 448 additions and 55 deletions
+109
View File
@@ -0,0 +1,109 @@
# BLE LED Controllers — Investigation & Implementation Notes
Reference for anyone touching the BLE device provider (`server/src/ledgrab/core/devices/ble_*`). Captures the protocol quirks, Windows/bleak traps, and hardware lockdown we hit while bringing up SP110E / Triones / Zengge / Govee support.
## Architecture
```
BLEDeviceProvider → BLEClient → BLETransport (desktop: bleak, Android: Kotlin BleBridge via Chaquopy)
└─ BLEProtocol (family-specific wire bytes: sp110e.py, triones.py, zengge.py, govee.py)
```
- One `BLEProtocol` dataclass per controller family. Each supplies GATT UUIDs, write type (with/without response), `encode_color` / `encode_power` functions, name prefixes for discovery, and an optional `init_writes` handshake sequence.
- `BLEClient` is whole-strip only. `send_pixels()` averages incoming pixel arrays and emits one solid color per frame — none of these protocols support per-pixel streaming.
- Discovery auto-detects the family via advertised name prefix first, falls back to service UUID matching. The detected family is returned on `DiscoveredDevice.ble_family` and preselected in the UI.
- The settings modal lets users change the family after creation — wrong family → writes go to a characteristic the device ignores → strip stays dark.
## Protocol Quirks
### SP110E / SP108E (critical handshake)
The controller **silently tears the GATT link down within ~1 second of connect** unless a two-write handshake arrives immediately:
```
Write 01 00 → characteristic FFE2
Write 01 B7 E3 D5 → characteristic FFE1
```
Without this, the first real write later hangs for 30 s because bleak thinks the link is up but the peripheral has already dropped it. We carry these writes in `PROTOCOL.init_writes` and execute them from `BLEClient.connect()` right after GATT open.
Color frame is **4 bytes** (`RR GG BB 1E`), not 5 — the earlier implementation had a stray `0x00` padding byte that the device tolerated but isn't documented.
Source: [mbullington's reverse-engineering gist](https://gist.github.com/mbullington/37957501a07ad065b67d4e8d39bfe012).
### Triones / Zengge / Govee
No init handshake required. Color frames and command bytes documented inline in each protocol module. Notable: Zengge and SP110E share service UUID `FFE0/FFE1`, so name-based identification is the only reliable way to tell them apart. In `_register_builtins()`, SP110E is registered first so it wins the `identify_family_by_service_uuids` tie by default — change this if the user base flips.
## bleak + Windows WinRT Traps
These bit us hard. All are now worked around, but future BLE work should keep them in mind.
### 1. `asyncio.wait_for` hangs forever on WinRT
`BleakClient.connect()` / `write_gatt_char()` wrap WinRT `IAsyncOperation`s. When asyncio tries to cancel them (as `wait_for` does on timeout), the WinRT task **never finishes cancelling**, so `wait_for` itself blocks forever while awaiting the cancellation. Symptom: log stops with no timeout error, process is alive but wedged.
**Fix**: `_bounded_await()` in [ble_transport.py](../server/src/ledgrab/core/devices/ble_transport.py) uses `asyncio.wait()` instead, which returns on timeout without awaiting pending tasks. Orphans the hanging WinRT task but frees the caller.
### 2. Connect by raw MAC string fails on Windows
Passing `BleakClient("AA:BB:CC:DD:EE:FF")` makes WinRT guess the address type (public vs random static vs random resolvable). Guesses wrong → connect silently times out. Symptom: `TimeoutError: BLE connect to ... exceeded 10.0s` with no other signal.
**Fix**: Always pre-scan with `BleakScanner.find_device_by_address()` and pass the returned `BLEDevice` object to `BleakClient`. Costs ~400 ms but makes connect reliable.
### 3. Client-side fetch timeout too short for BLE target start
The target-start endpoint does a ~5 s pre-scan + up to 10 s GATT connect + init handshake. Default `fetchWithAuth` has a 10 s timeout and 3× retry, so the UI was aborting and retrying concurrent `/start` requests into the server.
**Fix**: `startTargetProcessing` overrides `timeout: 30000, retry: false`.
### 4. `Start-Process -WindowStyle Hidden` from bash/WSL strips handles
When `restart.ps1` is invoked from Git-Bash / WSL, `Start-Process` inherited handles cause the child uvicorn to exit immediately. Stream redirection fixes it.
**Fix**: `restart.ps1` always uses `-RedirectStandardOutput`/`-RedirectStandardError` to a temp log. Failed startups dump the stderr tail to the caller so root cause is visible.
## Vendor Lockdown (the dead end)
Some controllers — notably the one we tested, advertising as `AlexTable` at `16:61:05:70:68:44`**only accept connections from the vendor phone app**. Diagnostic sequence:
| Test | Result | Meaning |
| --- | --- | --- |
| LedGrab `BleakClient.connect()` | 10 s timeout | Windows can't connect |
| Windows "Bluetooth LE Explorer" | Hangs on connect | Same Windows stack as bleak — not our bug |
| Phone **OS** Bluetooth Settings | Can't connect | Phone OS uses generic BLE stack — also fails |
| Phone **LED Hue** app | Connects fine | Vendor app is the *only* working client |
At this point, further Windows/bleak tweaks have no effect. The peripheral firmware rejects generic GATT connects and only stays connected when the LED Hue app emits its vendor-specific handshake. To unlock such a controller from LedGrab you'd need to:
1. Enable **Developer Options → Bluetooth HCI snoop log** on Android.
2. Reproduce the LED Hue flow (connect → color change → disconnect).
3. `adb bugreport bugreport.zip`; extract `btsnoop_hci.log`.
4. Open in Wireshark; identify the vendor handshake bytes written during connect.
5. Add them to the protocol's `init_writes`.
Alternatively, replace the BLE controller hardware with **WLED on ESP32** — $3, fully supported, vastly more capable.
## Frontend
- BLE family picker uses the project's shared `IconSelect` grid (project rule — see [CLAUDE.md](../CLAUDE.md): "NEVER use plain HTML `<select>`").
- Registry in `device-discovery.ts` is keyed by element ID so both the add-device and settings modals get their own IconSelect instance. Helpers: `ensureBleFamilyIconSelect(selectId, onChange?)` / `destroyBleFamilyIconSelect(selectId)`.
- Govee AES key row is conditionally visible: only shows when the selected family is `govee`.
## HAOS Integration Pair
The sister repo `ledgrab-haos-integration` had its own WebSocket auth bug that surfaced during this session — the integration still used the deprecated `?token=<key>` query param instead of the new first-message handshake. Fixed in [v0.2.1](https://git.dolgolyov-family.by/alexei.dolgolyov/ledgrab-haos-integration). Unrelated to BLE but shared debugging time.
## Tests
`server/tests/test_ble_protocols.py` and `server/tests/test_ble_client.py` use a `FakeTransport` that logs every write with its `char_uuid`, so protocol wire formats and the init handshake are unit-testable without hardware or bleak installed. New protocol additions should extend these.
## Files
- [ble_client.py](../server/src/ledgrab/core/devices/ble_client.py) — provider-facing class; runs init handshake on connect; reconnect backoff.
- [ble_transport.py](../server/src/ledgrab/core/devices/ble_transport.py) — bleak desktop transport; `_bounded_await` helper; per-write char override.
- [android_ble_transport.py](../server/src/ledgrab/core/devices/android_ble_transport.py) — Chaquopy/Kotlin transport; currently ignores `char_uuid` override (bridge binds a single write characteristic).
- [ble_provider.py](../server/src/ledgrab/core/devices/ble_provider.py) — discovery, family detection, `set_color` / `set_power` short-lived sessions.
- [ble_protocols/](../server/src/ledgrab/core/devices/ble_protocols/) — one file per family (pure byte-encoding functions, no BLE deps).
- [BleBridge.kt](../android/app/src/main/java/com/ledgrab/android/BleBridge.kt) — Android-side BLE GATT wrapper exposed to Python via Chaquopy.