feat(initial-implementation): phase 0 - scraping spike findings
Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR (nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is sufficient; Playwright not required (kept as a future-flag). Spike outputs: - spike/SCRAPE_FINDINGS.md — page rendering, URL templates, anti-bot, rate limits, recommended scraping strategy for Phase 3. - spike/SCHEMA_DRAFT.md — customer-spec field → DOM selector mapping for Match + Period-N scope across football/basketball/tennis (hockey TBD). Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding: no public results endpoint at /su/results — phase 8 must switch to polling event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged). Reviewer notes addressed: - Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1. - results-page.html capture added to file list with caveat about live-landing score-state and unsampled hockey selectors.
This commit is contained in:
@@ -103,3 +103,42 @@ Marathon_<YYYY-MM-DD>_to_<YYYY-MM-DD>.xlsx
|
||||
## Recurring Issues & Patterns
|
||||
|
||||
(Populated as we work — leave empty until something repeats.)
|
||||
|
||||
## Feature: Initial Implementation > Phase 0: Scraping Spike — Learnings
|
||||
|
||||
(Permanent learnings about marathonbet.by data shape, anti-bot, page structure.
|
||||
For full detail see `spike/SCRAPE_FINDINGS.md` and `spike/SCHEMA_DRAFT.md`.)
|
||||
|
||||
- **Site is fully SSR (`Server: nginx`).** Anonymous GET with browser User-Agent
|
||||
returns full HTML for `/su/`, `/su/live`, `/su/popular/<Sport>`,
|
||||
`/su/betting/<event-path>`. No Cloudflare, no JS challenge.
|
||||
- **Use HttpClient + AngleSharp + Polly v8** — no Playwright needed for read-only.
|
||||
Keep `Scraping:UsePlaywright = false` flag for future-proofing.
|
||||
- **Sport ID = `data-sport-treeId` = breadcrumb canonical ID.** Confirmed:
|
||||
Basketball=6, Football=11, Tennis=22723, Hockey=43658. URL by ID:
|
||||
`/su/betting/<Sport>+-+<id>` (preferred over `/su/popular/<Sport>` because the
|
||||
ID is stable).
|
||||
- **`EventCode` = `data-event-eventId`** (numeric, ~26-million range, stable).
|
||||
`TreeId` = `data-event-treeId` (URL-routing ID, less stable). Use `EventCode`
|
||||
as the entity primary key in SQLite.
|
||||
- **Selection key format:** `{eventId}@{MarketName}{LineIndex?}.{Outcome}`.
|
||||
Outcomes: `1`/`draw`/`3` for 3-way, `HB_H`/`HB_A` for handicap, `Under_<X>`/
|
||||
`Over_<X>` for totals. Total threshold is encoded in the outcome string;
|
||||
handicap value lives in `<span class="middle-simple">` text.
|
||||
- **Tennis has no Draw outcome.** Domain `Bet_Match_Draw` must be nullable; Excel
|
||||
exporter writes empty cell when null.
|
||||
- **Date parsing:** listing shows `HH:MM` (today) or `DD <ru-month> HH:MM` (future).
|
||||
Anchor with `initData.serverTime` (Moscow TZ, format `YYYY,MM,DD,HH,MM,SS`)
|
||||
parsed from the embedded `<script>` blob on every scraped page.
|
||||
- **Live updates:** site polls `/su/liveupdate/popular/?treeIds=...` every 3 s but
|
||||
response is just `{"modified":[{"type":"refreshPage"}],...}` — re-scrape the
|
||||
full event detail HTML for actual odds. Our analyzer cadence: pre-match 30 s,
|
||||
live 5–10 s.
|
||||
- **No public results / archive page** (`/su/results` → 404). Final scores must
|
||||
be harvested by polling the event detail page until
|
||||
`eventJsonInfo.matchIsComplete=true`, then storing `resultDescription`. Phase 8
|
||||
cannot back-fill from a public archive.
|
||||
- **Period scope vocabulary varies by sport:** football=`1st_Half`, basketball=
|
||||
`1st_Half`/`1st_Quarter`, tennis=`1st_Set`, hockey=`1st_Period`. Domain stores
|
||||
`PeriodNumber:int` and a sport-aware `PeriodScopeMapper` resolves the correct
|
||||
market token at parse time.
|
||||
|
||||
Reference in New Issue
Block a user