From 070e34b91181743f04ba527e8999e1c56cab836e Mon Sep 17 00:00:00 2001 From: "alexei.dolgolyov" Date: Tue, 5 May 2026 01:04:03 +0300 Subject: [PATCH] feat(initial-implementation): phase 0 - scraping spike findings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR (nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is sufficient; Playwright not required (kept as a future-flag). Spike outputs: - spike/SCRAPE_FINDINGS.md — page rendering, URL templates, anti-bot, rate limits, recommended scraping strategy for Phase 3. - spike/SCHEMA_DRAFT.md — customer-spec field → DOM selector mapping for Match + Period-N scope across football/basketball/tennis (hockey TBD). Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding: no public results endpoint at /su/results — phase 8 must switch to polling event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged). Reviewer notes addressed: - Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1. - results-page.html capture added to file list with caveat about live-landing score-state and unsampled hockey selectors. --- CLAUDE.md | 39 ++ plans/initial-implementation/CONTEXT.md | 55 ++- plans/initial-implementation/PLAN.md | 4 +- .../phase-0-scraping-spike.md | 126 ++++++- spike/SCHEMA_DRAFT.md | 318 ++++++++++++++++ spike/SCRAPE_FINDINGS.md | 347 ++++++++++++++++++ 6 files changed, 864 insertions(+), 25 deletions(-) create mode 100644 spike/SCHEMA_DRAFT.md create mode 100644 spike/SCRAPE_FINDINGS.md diff --git a/CLAUDE.md b/CLAUDE.md index 99e5d4e..16808d1 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -103,3 +103,42 @@ Marathon__to_.xlsx ## Recurring Issues & Patterns (Populated as we work — leave empty until something repeats.) + +## Feature: Initial Implementation > Phase 0: Scraping Spike — Learnings + +(Permanent learnings about marathonbet.by data shape, anti-bot, page structure. +For full detail see `spike/SCRAPE_FINDINGS.md` and `spike/SCHEMA_DRAFT.md`.) + +- **Site is fully SSR (`Server: nginx`).** Anonymous GET with browser User-Agent + returns full HTML for `/su/`, `/su/live`, `/su/popular/`, + `/su/betting/`. No Cloudflare, no JS challenge. +- **Use HttpClient + AngleSharp + Polly v8** — no Playwright needed for read-only. + Keep `Scraping:UsePlaywright = false` flag for future-proofing. +- **Sport ID = `data-sport-treeId` = breadcrumb canonical ID.** Confirmed: + Basketball=6, Football=11, Tennis=22723, Hockey=43658. URL by ID: + `/su/betting/+-+` (preferred over `/su/popular/` because the + ID is stable). +- **`EventCode` = `data-event-eventId`** (numeric, ~26-million range, stable). + `TreeId` = `data-event-treeId` (URL-routing ID, less stable). Use `EventCode` + as the entity primary key in SQLite. +- **Selection key format:** `{eventId}@{MarketName}{LineIndex?}.{Outcome}`. + Outcomes: `1`/`draw`/`3` for 3-way, `HB_H`/`HB_A` for handicap, `Under_`/ + `Over_` for totals. Total threshold is encoded in the outcome string; + handicap value lives in `` text. +- **Tennis has no Draw outcome.** Domain `Bet_Match_Draw` must be nullable; Excel + exporter writes empty cell when null. +- **Date parsing:** listing shows `HH:MM` (today) or `DD HH:MM` (future). + Anchor with `initData.serverTime` (Moscow TZ, format `YYYY,MM,DD,HH,MM,SS`) + parsed from the embedded `