feat(initial-implementation): phase 0 - scraping spike findings
Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR (nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is sufficient; Playwright not required (kept as a future-flag). Spike outputs: - spike/SCRAPE_FINDINGS.md — page rendering, URL templates, anti-bot, rate limits, recommended scraping strategy for Phase 3. - spike/SCHEMA_DRAFT.md — customer-spec field → DOM selector mapping for Match + Period-N scope across football/basketball/tennis (hockey TBD). Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding: no public results endpoint at /su/results — phase 8 must switch to polling event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged). Reviewer notes addressed: - Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1. - results-page.html capture added to file list with caveat about live-landing score-state and unsampled hockey selectors.
This commit is contained in:
@@ -55,7 +55,21 @@ with scraping research, no implementation.
|
||||
|
||||
## Failed Approaches
|
||||
|
||||
(none yet — phases not started)
|
||||
- **Public results / archive endpoint** — does NOT exist. Tested
|
||||
`https://www.marathonbet.by/su/results`, `/su/results/`, `/su/results.htm` —
|
||||
all return HTTP 404. No `/archive`, `/history` links anywhere in the public
|
||||
HTML either. **Phase 8 deviation:** the Results loader cannot back-fill from
|
||||
an archive — it must poll each event detail page until
|
||||
`eventJsonInfo.matchIsComplete=true` and snapshot `resultDescription` at that
|
||||
moment. Phase 8 implementer must revise the subplan accordingly.
|
||||
- **JSONP `/su/liveupdate/popular/` endpoint** — exposes only refresh signals
|
||||
(`{"modified":[{"type":"refreshPage"}],"updated":<ts>}`), not actual odds. Cannot
|
||||
be used as a JSON odds source. Use it only as a "something changed" hint to
|
||||
trigger a full event-detail re-scrape.
|
||||
- **Anonymous WebSocket (STOMP)** at `/su/websocket/endpoint` is documented in
|
||||
`initData.stomp` but appears to require an authenticated session
|
||||
(`PUNTER-SESSION-HASH` cookie); we did not test it but the customer's anonymous
|
||||
scraping constraint makes it unsuitable anyway.
|
||||
|
||||
## Review Findings Log
|
||||
|
||||
@@ -65,7 +79,7 @@ with scraping research, no implementation.
|
||||
|
||||
| Phase | Agent | Model | Test Writer | Parallel | Notes |
|
||||
|---|---|---|---|---|---|
|
||||
| Phase 0 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (research only) | — | Throwaway probe; outputs SCRAPE_FINDINGS.md only |
|
||||
| Phase 0 | phase-implementer | Opus | ⏭️ Skipped (research only) | — | ✅ Done 2026-05-05. Outputs: spike/SCRAPE_FINDINGS.md + spike/SCHEMA_DRAFT.md + 7 local fixtures. Anonymous scraping confirmed feasible; HttpClient+AngleSharp recommended; no Playwright needed; no public results page found (Phase 8 deviation noted). |
|
||||
| Phase 1 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | — | — |
|
||||
| Phase 2 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 3 + 5 | — |
|
||||
| Phase 3 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 2 + 5 | — |
|
||||
@@ -87,4 +101,39 @@ with scraping research, no implementation.
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
(populated as we work)
|
||||
### Phase 0 (Scraping spike, 2026-05-05)
|
||||
|
||||
- **Anonymous scraping is feasible** from a non-Belarus IP. No Cloudflare, no JS
|
||||
challenge, no UA filtering observed. `Server: nginx`. Standard cookies only.
|
||||
- **Site is fully SSR.** All needed data (event grid, full odds, breadcrumbs,
|
||||
period markets) is in the raw HTML. No SPA hydration required.
|
||||
- **Recommended scraper stack: HttpClient + AngleSharp + Polly v8.** Playwright is
|
||||
not required for read-only scraping — keep it as an optional fallback flag
|
||||
(`Scraping:UsePlaywright`) for future-proofing only.
|
||||
- **Polling cadence:** site itself polls live updates every 3 s; for our analyzer,
|
||||
pre-match 30 s and live 5–10 s is sufficient.
|
||||
- **Rate-limit:** 5 sequential requests at 1 req/s pacing all returned 200 in <1 s,
|
||||
no throttling. Recommend default `RequestsPerSecond=1`, `MaxConcurrent=4`.
|
||||
- **Sport ID semantics:** customer's "Sport_Code = 6" (Basketball) maps to
|
||||
`data-sport-treeId="6"` in the breadcrumb-canonical sport listing
|
||||
(`/su/betting/Basketball+-+6`). Some sports also have a separate "category tree
|
||||
ID" used inside the live grouping (e.g., 45356 for Basketball-live) — ignore
|
||||
those, use only the canonical breadcrumb ID.
|
||||
- **Selection key format:** `<eventId>@<MarketName>{LineIndex?}.<Outcome>`. The
|
||||
market name is sport-specific (`Match_Result`, `1st_Half_Result`, `Total_Goals`,
|
||||
`Total_Points`, `Total_Games`, `To_Win_Match_With_Handicap`, etc.). Total
|
||||
thresholds are encoded in the outcome (`Under_3.5`, `Over_213.5`). Handicap
|
||||
values are NOT in the key — they're in `<span class="middle-simple">` text.
|
||||
- **Tennis has no Draw outcome** — domain `Bet_Match_Draw` must be nullable.
|
||||
- **Date display ambiguity:** listing shows `HH:MM` (today) or `DD <ru-month> HH:MM`
|
||||
(future). Anchor the parser on `initData.serverTime` (Moscow TZ, format
|
||||
`YYYY,MM,DD,HH,MM,SS`).
|
||||
- **No public results page** (`/su/results` → 404). Final scores are exposed only
|
||||
on the event detail page itself via `eventJsonInfo` JSON
|
||||
(`matchIsComplete`, `resultDescription`). Phase 8 must poll until completion;
|
||||
cannot back-fill from an archive endpoint.
|
||||
- **Probe environment:** Windows 10 + curl, geo-routed as Poland (`countryCode: PL`).
|
||||
Customer in Belarus may see slightly different KYC overlays — parser must be
|
||||
defensive (treat missing markets as null, never throw).
|
||||
- **Captures saved locally** at `spike/captures/*.html` (gitignored): 7 fixtures
|
||||
for offline parser development in Phase 3.
|
||||
|
||||
Reference in New Issue
Block a user