Files
ledgrab/docs/BLE_LED_CONTROLLERS.md
T
alexei.dolgolyov 45f93fd30e
Build Android APK / build-android (push) Failing after 1m38s
Lint & Test / test (push) Successful in 4m32s
fix(devices): SP110E vendor handshake + Windows/bleak robustness
SP110E peripherals silently tear down the GATT link ~1s after connect
unless a two-write vendor handshake (01 00 → FFE2, 01 B7 E3 D5 → FFE1)
arrives immediately. Without it the first real write hangs 30s then
reconnect-loops forever. Adds optional BLEProtocol.init_writes executed
on connect, plumbs a per-write char_uuid through both transports, and
fixes the SP110E color/power frames from an incorrect 5 bytes to the
documented 4 bytes.

Windows/WinRT robustness:
- asyncio.wait_for hangs on bleak because WinRT IAsyncOperations refuse
  to cancel. _bounded_await() uses asyncio.wait() instead so timeouts
  actually return control even when the inner task is uncancellable.
- BleakClient connect by raw MAC string times out when WinRT guesses
  address type wrong; switched to pre-scanning with BleakScanner and
  passing the resolved BLEDevice, which carries the address type.
- Target-start fetch timeout bumped to 30s with retry disabled so the
  UI doesn't abort during the BLE pre-scan + connect + handshake path.

UI:
- Settings modal exposes Protocol Family (IconSelect grid, shared with
  add-device via parameterized ensureBleFamilyIconSelect) so users can
  fix a wrong family pick without recreating the device. Govee AES key
  row toggles on/off with family selection.

Also turns LAN auth back on in default_config.yaml, logs start_processing
requests on entry for easier diagnosis, and captures the full debug trail
in docs/BLE_LED_CONTROLLERS.md for future BLE work.

Refs the mbullington SP110E protocol gist for the handshake bytes.
2026-04-21 17:45:21 +03:00

7.9 KiB
Raw Blame History

BLE LED Controllers — Investigation & Implementation Notes

Reference for anyone touching the BLE device provider (server/src/ledgrab/core/devices/ble_*). Captures the protocol quirks, Windows/bleak traps, and hardware lockdown we hit while bringing up SP110E / Triones / Zengge / Govee support.

Architecture

BLEDeviceProvider  →  BLEClient  →  BLETransport (desktop: bleak, Android: Kotlin BleBridge via Chaquopy)
                           │
                           └─ BLEProtocol (family-specific wire bytes: sp110e.py, triones.py, zengge.py, govee.py)
  • One BLEProtocol dataclass per controller family. Each supplies GATT UUIDs, write type (with/without response), encode_color / encode_power functions, name prefixes for discovery, and an optional init_writes handshake sequence.
  • BLEClient is whole-strip only. send_pixels() averages incoming pixel arrays and emits one solid color per frame — none of these protocols support per-pixel streaming.
  • Discovery auto-detects the family via advertised name prefix first, falls back to service UUID matching. The detected family is returned on DiscoveredDevice.ble_family and preselected in the UI.
  • The settings modal lets users change the family after creation — wrong family → writes go to a characteristic the device ignores → strip stays dark.

Protocol Quirks

SP110E / SP108E (critical handshake)

The controller silently tears the GATT link down within ~1 second of connect unless a two-write handshake arrives immediately:

Write  01 00                      → characteristic FFE2
Write  01 B7 E3 D5                → characteristic FFE1

Without this, the first real write later hangs for 30 s because bleak thinks the link is up but the peripheral has already dropped it. We carry these writes in PROTOCOL.init_writes and execute them from BLEClient.connect() right after GATT open.

Color frame is 4 bytes (RR GG BB 1E), not 5 — the earlier implementation had a stray 0x00 padding byte that the device tolerated but isn't documented.

Source: mbullington's reverse-engineering gist.

Triones / Zengge / Govee

No init handshake required. Color frames and command bytes documented inline in each protocol module. Notable: Zengge and SP110E share service UUID FFE0/FFE1, so name-based identification is the only reliable way to tell them apart. In _register_builtins(), SP110E is registered first so it wins the identify_family_by_service_uuids tie by default — change this if the user base flips.

bleak + Windows WinRT Traps

These bit us hard. All are now worked around, but future BLE work should keep them in mind.

1. asyncio.wait_for hangs forever on WinRT

BleakClient.connect() / write_gatt_char() wrap WinRT IAsyncOperations. When asyncio tries to cancel them (as wait_for does on timeout), the WinRT task never finishes cancelling, so wait_for itself blocks forever while awaiting the cancellation. Symptom: log stops with no timeout error, process is alive but wedged.

Fix: _bounded_await() in ble_transport.py uses asyncio.wait() instead, which returns on timeout without awaiting pending tasks. Orphans the hanging WinRT task but frees the caller.

2. Connect by raw MAC string fails on Windows

Passing BleakClient("AA:BB:CC:DD:EE:FF") makes WinRT guess the address type (public vs random static vs random resolvable). Guesses wrong → connect silently times out. Symptom: TimeoutError: BLE connect to ... exceeded 10.0s with no other signal.

Fix: Always pre-scan with BleakScanner.find_device_by_address() and pass the returned BLEDevice object to BleakClient. Costs ~400 ms but makes connect reliable.

3. Client-side fetch timeout too short for BLE target start

The target-start endpoint does a ~5 s pre-scan + up to 10 s GATT connect + init handshake. Default fetchWithAuth has a 10 s timeout and 3× retry, so the UI was aborting and retrying concurrent /start requests into the server.

Fix: startTargetProcessing overrides timeout: 30000, retry: false.

4. Start-Process -WindowStyle Hidden from bash/WSL strips handles

When restart.ps1 is invoked from Git-Bash / WSL, Start-Process inherited handles cause the child uvicorn to exit immediately. Stream redirection fixes it.

Fix: restart.ps1 always uses -RedirectStandardOutput/-RedirectStandardError to a temp log. Failed startups dump the stderr tail to the caller so root cause is visible.

Vendor Lockdown (the dead end)

Some controllers — notably the one we tested, advertising as AlexTable at 16:61:05:70:68:44only accept connections from the vendor phone app. Diagnostic sequence:

Test Result Meaning
LedGrab BleakClient.connect() 10 s timeout Windows can't connect
Windows "Bluetooth LE Explorer" Hangs on connect Same Windows stack as bleak — not our bug
Phone OS Bluetooth Settings Can't connect Phone OS uses generic BLE stack — also fails
Phone LED Hue app Connects fine Vendor app is the only working client

At this point, further Windows/bleak tweaks have no effect. The peripheral firmware rejects generic GATT connects and only stays connected when the LED Hue app emits its vendor-specific handshake. To unlock such a controller from LedGrab you'd need to:

  1. Enable Developer Options → Bluetooth HCI snoop log on Android.
  2. Reproduce the LED Hue flow (connect → color change → disconnect).
  3. adb bugreport bugreport.zip; extract btsnoop_hci.log.
  4. Open in Wireshark; identify the vendor handshake bytes written during connect.
  5. Add them to the protocol's init_writes.

Alternatively, replace the BLE controller hardware with WLED on ESP32 — $3, fully supported, vastly more capable.

Frontend

  • BLE family picker uses the project's shared IconSelect grid (project rule — see CLAUDE.md: "NEVER use plain HTML <select>").
  • Registry in device-discovery.ts is keyed by element ID so both the add-device and settings modals get their own IconSelect instance. Helpers: ensureBleFamilyIconSelect(selectId, onChange?) / destroyBleFamilyIconSelect(selectId).
  • Govee AES key row is conditionally visible: only shows when the selected family is govee.

HAOS Integration Pair

The sister repo ledgrab-haos-integration had its own WebSocket auth bug that surfaced during this session — the integration still used the deprecated ?token=<key> query param instead of the new first-message handshake. Fixed in v0.2.1. Unrelated to BLE but shared debugging time.

Tests

server/tests/test_ble_protocols.py and server/tests/test_ble_client.py use a FakeTransport that logs every write with its char_uuid, so protocol wire formats and the init handshake are unit-testable without hardware or bleak installed. New protocol additions should extend these.

Files

  • ble_client.py — provider-facing class; runs init handshake on connect; reconnect backoff.
  • ble_transport.py — bleak desktop transport; _bounded_await helper; per-write char override.
  • android_ble_transport.py — Chaquopy/Kotlin transport; currently ignores char_uuid override (bridge binds a single write characteristic).
  • ble_provider.py — discovery, family detection, set_color / set_power short-lived sessions.
  • ble_protocols/ — one file per family (pure byte-encoding functions, no BLE deps).
  • BleBridge.kt — Android-side BLE GATT wrapper exposed to Python via Chaquopy.