Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR (nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is sufficient; Playwright not required (kept as a future-flag). Spike outputs: - spike/SCRAPE_FINDINGS.md — page rendering, URL templates, anti-bot, rate limits, recommended scraping strategy for Phase 3. - spike/SCHEMA_DRAFT.md — customer-spec field → DOM selector mapping for Match + Period-N scope across football/basketball/tennis (hockey TBD). Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding: no public results endpoint at /su/results — phase 8 must switch to polling event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged). Reviewer notes addressed: - Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1. - results-page.html capture added to file list with caveat about live-landing score-state and unsampled hockey selectors.
8.8 KiB
Phase 0: Scraping Spike (Research, Throwaway)
Status: ✅ Done Parent plan: PLAN.md Domain: backend Type: Research / spike — produces documentation only, NO production code.
Objective
Determine whether marathonbet.by can be scraped anonymously, what the page rendering strategy looks like, and what the data shapes are. The output is a documented foundation that Phases 1–9 build on. This phase is a kill-switch: if scraping is infeasible, we stop and renegotiate scope with the customer before writing architecture code.
Tasks
- Probe
https://www.marathonbet.by/su(pre-match) anonymously. Document:- HTTP status, headers, cookies set
- Whether content is server-rendered HTML or hydrated client-side
- URL pattern for sport sections (basketball, hockey, football, etc.)
- Sport group codes (e.g., basketball = 6 per spec)
- Probe
https://www.marathonbet.by/su/live(live events). Document:- Same as above
- Whether odds update via XHR/fetch/WebSocket — capture network calls
- Identify event-detail URL pattern and inspect a sample event's full odds page.
- For 3 events across 3 sports (basketball, football, tennis — hockey deferred to Phase 3 verify), capture:
- Event metadata (sport, country, league, category, scheduled time, event ID)
- Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value), Total Less/More (with threshold)
- Period-N bets where the sport has periods
- Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate limiting, header requirements, fingerprinting hints.
- Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT hammer — be respectful.
- Document API endpoints if marathonbet.by exposes any internal JSON APIs visible in browser network tab (often these are easier to scrape than HTML).
- Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
- Save 2–3 representative HTML/JSON samples under
spike/captures/(gitignored; for local reference only). Saved 7 fixtures. - Write
spike/SCRAPE_FINDINGS.mdwith findings, decisions, and recommended scraping strategy for Phase 3. - Write
spike/SCHEMA_DRAFT.mdwith concrete proposed domain field mappings — marathonbet.by terms → spec field names (Bet_Match_Win_1, etc.).
Files to Modify/Create
spike/SCRAPE_FINDINGS.md— research output (committed to repo)spike/SCHEMA_DRAFT.md— proposed domain mapping (committed to repo)spike/captures/*.html/.json— local samples (gitignored, NOT committed)
Acceptance Criteria
SCRAPE_FINDINGS.mdexists and answers:- Is anonymous scraping feasible? (yes/no/conditional)
- What scraping technology is required? (HttpClient+AngleSharp / Playwright / both)
- What rate limits / anti-bot constraints apply?
- What URL patterns and endpoints will Phase 3 target?
SCHEMA_DRAFT.mdmaps real marathonbet.by data to the customer-spec field names.- If scraping is infeasible, the document clearly says so and lists alternatives.
- No production C# code is written in this phase.
Notes
- Use WebFetch tool for initial probing; supplement with curl/Bash if Playwright-style behavior needs investigation.
- Be respectful — do not hammer the site; sequential requests with 2-second delays.
- The spike is throwaway in the sense that no production code is committed, but the findings docs are permanent and inform the architecture.
- If marathonbet.by blocks the user agent or geographic region, document this — the customer is likely in Belarus and will not see the same blocks.
Review Checklist
SCRAPE_FINDINGS.mdanswers all required questions aboveSCHEMA_DRAFT.mdcovers all bet types in the customer spec (Win/Draw/Win_Fora/Total at Match + Period-N scope)- No production code committed
- Recommended Phase 3 strategy is concrete and actionable
- Risk register updated if anti-bot or rate-limit issues found
Handoff to Next Phase
Anonymous scraping is feasible and recommended technology is HttpClient + AngleSharp. No Cloudflare, no JS challenge. Site is fully SSR — all data we need is in the raw HTML.
What Phase 1 (Domain) needs to know
-
SportCodeis thedata-sport-treeIdattribute / first integer after the sport name in/su/betting/<Sport>+-+<id>. Customer's "basketball=6" matches exactly. Confirmed IDs: Basketball=6, Football=11, Tennis=22723, Hockey=43658. Note: there are duplicate "category" tree IDs (e.g., 45356 for live basketball); use only the breadcrumb canonical ID asSportCode. -
EventCodeisdata-event-eventId(numeric, ~26-million range). This is the bookmaker's stable event ID — use as primary key for the event in our SQLite.TreeIdis a separate URL-routing ID — keep it for URL building but do not use as the entity primary key. -
No "Draw" outcome for tennis (and for some basketball variants). The Domain model should make the Draw rate nullable. Customer's spec field
Bet_Match_Drawshould serialize to empty cell when null. -
Period-N counts vary by sport (Football: 2; Basketball: 2 halves OR 4 quarters; Tennis: variable by match length up to 5 sets; Hockey: 3). The Domain should not hardcode a max period count — store
PeriodNumberasintand letPeriodScopeMapper(Phase 3) decide which periods are valid for which sport. -
Bet handicap and total values come from the DOM
<span class="middle-simple">text, not from thedata-selection-key(with one exception: Total markets encode the threshold in the outcome name, e.g.,Under_213.5). DomainBet.Valueisdecimal?— populated for handicap and total, null for Win/Draw. -
ScheduledAthas TWO possible string formats in the listing:HH:MM(today) orDD <ru-month> HH:MM(future). Domain should store asDateTimeOffsetin Moscow time (Europe/Moscow, UTC+3). The "today" anchor comes from theinitData.serverTimeblob (YYYY,MM,DD,HH,MM,SSformat). Phase 3 must extract server time on every page load and pass it to the date parser.
What Phase 3 (Scraping) needs to know
Read spike/SCRAPE_FINDINGS.md end-to-end before designing the scraper.
Highlights:
- Selector inventory: in
SCHEMA_DRAFT.md§1–§3 and inSCRAPE_FINDINGS.md§5. - URL templates in
SCRAPE_FINDINGS.md§3. - Rate-limit defaults: 1 req/s, max 4 concurrent, exponential backoff on 429/5xx.
Use
Microsoft.Extensions.Http.Resilience(Polly v8). - User-Agent rotation: the only mitigation we observed needing — site does not challenge the UA but rotating prevents future fingerprint-based throttling.
- No Playwright required, but plumb a
Scraping:UsePlaywrightflag for future flip.
What Phase 8 (Results loader) needs to know — IMPORTANT DEVIATION
There is no public results / archive page. https://www.marathonbet.by/su/results
returns 404. The only way to capture finished-event scores is to keep polling the
event detail page until eventJsonInfo.matchIsComplete === true, then snapshot
resultDescription (e.g., "2:1 (1:1)").
This means Phase 8 must:
- Maintain a "watch list" of events whose
ScheduledAt + EstimatedDurationis in the past but whose status in our DB is not yetCompleted. - Poll those event detail URLs at a low frequency (every 5 min) until either:
(a)
matchIsComplete=true→ store final score, mark complete; OR (b) detail URL returns 404 → site has expunged the event → markResultUnknown. - Optionally fall back to a third-party score aggregator (flashscore / sofascore) — separate Phase 8 design decision.
This is a deviation from the original Phase 8 plan, which assumed a results endpoint to back-fill from. Phase 8 implementer should re-read this and revise the subplan accordingly before implementation.
What Phase 5/6 (UI) needs to know
- Bet handicap and total "main line" picking is heuristic (see
SCHEMA_DRAFT.md§2.2 and §2.3) and should be exposed as a configurable policy. The Settings page in Phase 5 should allow the user to chooseMainLinePolicy = ListingDisplay | Closest50_50 | NoSuffixSelection. - Russian-only labels in the source HTML. Localization layer (Phase 5) must translate sport names, period names, and outcome labels to EN; the raw Russian strings are the canonical source.
Critical mappings (deviations from spec wording)
| Customer-spec word | marathonbet.by reality |
|---|---|
Win_Fora |
Handicap market in DOM (To_Win_Match_With_Handicap). Same concept, different word. |
Total_Less / Total_More |
DOM uses Under / Over. |
Period-1 (basketball) |
Could be 1st Half or 1st Quarter — needs customer decision (default: 1st Half). |
Sport_Code = 6 |
data-sport-treeId="6" confirmed for Basketball. |