feat(initial-implementation): phase 0 - scraping spike findings

Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR
(nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is
sufficient; Playwright not required (kept as a future-flag).

Spike outputs:
- spike/SCRAPE_FINDINGS.md  — page rendering, URL templates, anti-bot, rate
  limits, recommended scraping strategy for Phase 3.
- spike/SCHEMA_DRAFT.md     — customer-spec field → DOM selector mapping for
  Match + Period-N scope across football/basketball/tennis (hockey TBD).

Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding:
no public results endpoint at /su/results — phase 8 must switch to polling
event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged).

Reviewer notes addressed:
- Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and
  market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1.
- results-page.html capture added to file list with caveat about live-landing
  score-state and unsampled hockey selectors.
This commit is contained in:
2026-05-05 01:04:03 +03:00
parent 8802ddb25b
commit 070e34b911
6 changed files with 864 additions and 25 deletions
+39
View File
@@ -103,3 +103,42 @@ Marathon_<YYYY-MM-DD>_to_<YYYY-MM-DD>.xlsx
## Recurring Issues & Patterns
(Populated as we work — leave empty until something repeats.)
## Feature: Initial Implementation > Phase 0: Scraping Spike — Learnings
(Permanent learnings about marathonbet.by data shape, anti-bot, page structure.
For full detail see `spike/SCRAPE_FINDINGS.md` and `spike/SCHEMA_DRAFT.md`.)
- **Site is fully SSR (`Server: nginx`).** Anonymous GET with browser User-Agent
returns full HTML for `/su/`, `/su/live`, `/su/popular/<Sport>`,
`/su/betting/<event-path>`. No Cloudflare, no JS challenge.
- **Use HttpClient + AngleSharp + Polly v8** — no Playwright needed for read-only.
Keep `Scraping:UsePlaywright = false` flag for future-proofing.
- **Sport ID = `data-sport-treeId` = breadcrumb canonical ID.** Confirmed:
Basketball=6, Football=11, Tennis=22723, Hockey=43658. URL by ID:
`/su/betting/<Sport>+-+<id>` (preferred over `/su/popular/<Sport>` because the
ID is stable).
- **`EventCode` = `data-event-eventId`** (numeric, ~26-million range, stable).
`TreeId` = `data-event-treeId` (URL-routing ID, less stable). Use `EventCode`
as the entity primary key in SQLite.
- **Selection key format:** `{eventId}@{MarketName}{LineIndex?}.{Outcome}`.
Outcomes: `1`/`draw`/`3` for 3-way, `HB_H`/`HB_A` for handicap, `Under_<X>`/
`Over_<X>` for totals. Total threshold is encoded in the outcome string;
handicap value lives in `<span class="middle-simple">` text.
- **Tennis has no Draw outcome.** Domain `Bet_Match_Draw` must be nullable; Excel
exporter writes empty cell when null.
- **Date parsing:** listing shows `HH:MM` (today) or `DD <ru-month> HH:MM` (future).
Anchor with `initData.serverTime` (Moscow TZ, format `YYYY,MM,DD,HH,MM,SS`)
parsed from the embedded `<script>` blob on every scraped page.
- **Live updates:** site polls `/su/liveupdate/popular/?treeIds=...` every 3 s but
response is just `{"modified":[{"type":"refreshPage"}],...}` — re-scrape the
full event detail HTML for actual odds. Our analyzer cadence: pre-match 30 s,
live 510 s.
- **No public results / archive page** (`/su/results` → 404). Final scores must
be harvested by polling the event detail page until
`eventJsonInfo.matchIsComplete=true`, then storing `resultDescription`. Phase 8
cannot back-fill from a public archive.
- **Period scope vocabulary varies by sport:** football=`1st_Half`, basketball=
`1st_Half`/`1st_Quarter`, tennis=`1st_Set`, hockey=`1st_Period`. Domain stores
`PeriodNumber:int` and a sport-aware `PeriodScopeMapper` resolves the correct
market token at parse time.