Files
ledgrab/docs/BLE_LED_CONTROLLERS.md
T
alexei.dolgolyov 45f93fd30e
Build Android APK / build-android (push) Failing after 1m38s
Lint & Test / test (push) Successful in 4m32s
fix(devices): SP110E vendor handshake + Windows/bleak robustness
SP110E peripherals silently tear down the GATT link ~1s after connect
unless a two-write vendor handshake (01 00 → FFE2, 01 B7 E3 D5 → FFE1)
arrives immediately. Without it the first real write hangs 30s then
reconnect-loops forever. Adds optional BLEProtocol.init_writes executed
on connect, plumbs a per-write char_uuid through both transports, and
fixes the SP110E color/power frames from an incorrect 5 bytes to the
documented 4 bytes.

Windows/WinRT robustness:
- asyncio.wait_for hangs on bleak because WinRT IAsyncOperations refuse
  to cancel. _bounded_await() uses asyncio.wait() instead so timeouts
  actually return control even when the inner task is uncancellable.
- BleakClient connect by raw MAC string times out when WinRT guesses
  address type wrong; switched to pre-scanning with BleakScanner and
  passing the resolved BLEDevice, which carries the address type.
- Target-start fetch timeout bumped to 30s with retry disabled so the
  UI doesn't abort during the BLE pre-scan + connect + handshake path.

UI:
- Settings modal exposes Protocol Family (IconSelect grid, shared with
  add-device via parameterized ensureBleFamilyIconSelect) so users can
  fix a wrong family pick without recreating the device. Govee AES key
  row toggles on/off with family selection.

Also turns LAN auth back on in default_config.yaml, logs start_processing
requests on entry for easier diagnosis, and captures the full debug trail
in docs/BLE_LED_CONTROLLERS.md for future BLE work.

Refs the mbullington SP110E protocol gist for the handshake bytes.
2026-04-21 17:45:21 +03:00

110 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BLE LED Controllers — Investigation & Implementation Notes
Reference for anyone touching the BLE device provider (`server/src/ledgrab/core/devices/ble_*`). Captures the protocol quirks, Windows/bleak traps, and hardware lockdown we hit while bringing up SP110E / Triones / Zengge / Govee support.
## Architecture
```
BLEDeviceProvider → BLEClient → BLETransport (desktop: bleak, Android: Kotlin BleBridge via Chaquopy)
└─ BLEProtocol (family-specific wire bytes: sp110e.py, triones.py, zengge.py, govee.py)
```
- One `BLEProtocol` dataclass per controller family. Each supplies GATT UUIDs, write type (with/without response), `encode_color` / `encode_power` functions, name prefixes for discovery, and an optional `init_writes` handshake sequence.
- `BLEClient` is whole-strip only. `send_pixels()` averages incoming pixel arrays and emits one solid color per frame — none of these protocols support per-pixel streaming.
- Discovery auto-detects the family via advertised name prefix first, falls back to service UUID matching. The detected family is returned on `DiscoveredDevice.ble_family` and preselected in the UI.
- The settings modal lets users change the family after creation — wrong family → writes go to a characteristic the device ignores → strip stays dark.
## Protocol Quirks
### SP110E / SP108E (critical handshake)
The controller **silently tears the GATT link down within ~1 second of connect** unless a two-write handshake arrives immediately:
```
Write 01 00 → characteristic FFE2
Write 01 B7 E3 D5 → characteristic FFE1
```
Without this, the first real write later hangs for 30 s because bleak thinks the link is up but the peripheral has already dropped it. We carry these writes in `PROTOCOL.init_writes` and execute them from `BLEClient.connect()` right after GATT open.
Color frame is **4 bytes** (`RR GG BB 1E`), not 5 — the earlier implementation had a stray `0x00` padding byte that the device tolerated but isn't documented.
Source: [mbullington's reverse-engineering gist](https://gist.github.com/mbullington/37957501a07ad065b67d4e8d39bfe012).
### Triones / Zengge / Govee
No init handshake required. Color frames and command bytes documented inline in each protocol module. Notable: Zengge and SP110E share service UUID `FFE0/FFE1`, so name-based identification is the only reliable way to tell them apart. In `_register_builtins()`, SP110E is registered first so it wins the `identify_family_by_service_uuids` tie by default — change this if the user base flips.
## bleak + Windows WinRT Traps
These bit us hard. All are now worked around, but future BLE work should keep them in mind.
### 1. `asyncio.wait_for` hangs forever on WinRT
`BleakClient.connect()` / `write_gatt_char()` wrap WinRT `IAsyncOperation`s. When asyncio tries to cancel them (as `wait_for` does on timeout), the WinRT task **never finishes cancelling**, so `wait_for` itself blocks forever while awaiting the cancellation. Symptom: log stops with no timeout error, process is alive but wedged.
**Fix**: `_bounded_await()` in [ble_transport.py](../server/src/ledgrab/core/devices/ble_transport.py) uses `asyncio.wait()` instead, which returns on timeout without awaiting pending tasks. Orphans the hanging WinRT task but frees the caller.
### 2. Connect by raw MAC string fails on Windows
Passing `BleakClient("AA:BB:CC:DD:EE:FF")` makes WinRT guess the address type (public vs random static vs random resolvable). Guesses wrong → connect silently times out. Symptom: `TimeoutError: BLE connect to ... exceeded 10.0s` with no other signal.
**Fix**: Always pre-scan with `BleakScanner.find_device_by_address()` and pass the returned `BLEDevice` object to `BleakClient`. Costs ~400 ms but makes connect reliable.
### 3. Client-side fetch timeout too short for BLE target start
The target-start endpoint does a ~5 s pre-scan + up to 10 s GATT connect + init handshake. Default `fetchWithAuth` has a 10 s timeout and 3× retry, so the UI was aborting and retrying concurrent `/start` requests into the server.
**Fix**: `startTargetProcessing` overrides `timeout: 30000, retry: false`.
### 4. `Start-Process -WindowStyle Hidden` from bash/WSL strips handles
When `restart.ps1` is invoked from Git-Bash / WSL, `Start-Process` inherited handles cause the child uvicorn to exit immediately. Stream redirection fixes it.
**Fix**: `restart.ps1` always uses `-RedirectStandardOutput`/`-RedirectStandardError` to a temp log. Failed startups dump the stderr tail to the caller so root cause is visible.
## Vendor Lockdown (the dead end)
Some controllers — notably the one we tested, advertising as `AlexTable` at `16:61:05:70:68:44`**only accept connections from the vendor phone app**. Diagnostic sequence:
| Test | Result | Meaning |
| --- | --- | --- |
| LedGrab `BleakClient.connect()` | 10 s timeout | Windows can't connect |
| Windows "Bluetooth LE Explorer" | Hangs on connect | Same Windows stack as bleak — not our bug |
| Phone **OS** Bluetooth Settings | Can't connect | Phone OS uses generic BLE stack — also fails |
| Phone **LED Hue** app | Connects fine | Vendor app is the *only* working client |
At this point, further Windows/bleak tweaks have no effect. The peripheral firmware rejects generic GATT connects and only stays connected when the LED Hue app emits its vendor-specific handshake. To unlock such a controller from LedGrab you'd need to:
1. Enable **Developer Options → Bluetooth HCI snoop log** on Android.
2. Reproduce the LED Hue flow (connect → color change → disconnect).
3. `adb bugreport bugreport.zip`; extract `btsnoop_hci.log`.
4. Open in Wireshark; identify the vendor handshake bytes written during connect.
5. Add them to the protocol's `init_writes`.
Alternatively, replace the BLE controller hardware with **WLED on ESP32** — $3, fully supported, vastly more capable.
## Frontend
- BLE family picker uses the project's shared `IconSelect` grid (project rule — see [CLAUDE.md](../CLAUDE.md): "NEVER use plain HTML `<select>`").
- Registry in `device-discovery.ts` is keyed by element ID so both the add-device and settings modals get their own IconSelect instance. Helpers: `ensureBleFamilyIconSelect(selectId, onChange?)` / `destroyBleFamilyIconSelect(selectId)`.
- Govee AES key row is conditionally visible: only shows when the selected family is `govee`.
## HAOS Integration Pair
The sister repo `ledgrab-haos-integration` had its own WebSocket auth bug that surfaced during this session — the integration still used the deprecated `?token=<key>` query param instead of the new first-message handshake. Fixed in [v0.2.1](https://git.dolgolyov-family.by/alexei.dolgolyov/ledgrab-haos-integration). Unrelated to BLE but shared debugging time.
## Tests
`server/tests/test_ble_protocols.py` and `server/tests/test_ble_client.py` use a `FakeTransport` that logs every write with its `char_uuid`, so protocol wire formats and the init handshake are unit-testable without hardware or bleak installed. New protocol additions should extend these.
## Files
- [ble_client.py](../server/src/ledgrab/core/devices/ble_client.py) — provider-facing class; runs init handshake on connect; reconnect backoff.
- [ble_transport.py](../server/src/ledgrab/core/devices/ble_transport.py) — bleak desktop transport; `_bounded_await` helper; per-write char override.
- [android_ble_transport.py](../server/src/ledgrab/core/devices/android_ble_transport.py) — Chaquopy/Kotlin transport; currently ignores `char_uuid` override (bridge binds a single write characteristic).
- [ble_provider.py](../server/src/ledgrab/core/devices/ble_provider.py) — discovery, family detection, `set_color` / `set_power` short-lived sessions.
- [ble_protocols/](../server/src/ledgrab/core/devices/ble_protocols/) — one file per family (pure byte-encoding functions, no BLE deps).
- [BleBridge.kt](../android/app/src/main/java/com/ledgrab/android/BleBridge.kt) — Android-side BLE GATT wrapper exposed to Python via Chaquopy.