# Phase 0: Scraping Spike (Research, Throwaway) **Status:** ✅ Done **Parent plan:** [PLAN.md](./PLAN.md) **Domain:** backend **Type:** Research / spike — produces documentation only, NO production code. ## Objective Determine whether marathonbet.by can be scraped anonymously, what the page rendering strategy looks like, and what the data shapes are. The output is a documented foundation that Phases 1–9 build on. **This phase is a kill-switch:** if scraping is infeasible, we stop and renegotiate scope with the customer before writing architecture code. ## Tasks - [x] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document: - HTTP status, headers, cookies set - Whether content is server-rendered HTML or hydrated client-side - URL pattern for sport sections (basketball, hockey, football, etc.) - Sport group codes (e.g., basketball = 6 per spec) - [x] Probe `https://www.marathonbet.by/su/live` (live events). Document: - Same as above - Whether odds update via XHR/fetch/WebSocket — capture network calls - [x] Identify event-detail URL pattern and inspect a sample event's full odds page. - [x] For 3 events across 3 sports (basketball, football, tennis — hockey deferred to Phase 3 verify), capture: - Event metadata (sport, country, league, category, scheduled time, event ID) - Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value), Total Less/More (with threshold) - Period-N bets where the sport has periods - [x] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate limiting, header requirements, fingerprinting hints. - [x] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT hammer — be respectful. - [x] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible in browser network tab (often these are easier to scrape than HTML). - [x] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)? - [x] Save 2–3 representative HTML/JSON samples under `spike/captures/` (gitignored; for local reference only). Saved 7 fixtures. - [x] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended scraping strategy for Phase 3. - [x] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings — marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.). ## Files to Modify/Create - `spike/SCRAPE_FINDINGS.md` — research output (committed to repo) - `spike/SCHEMA_DRAFT.md` — proposed domain mapping (committed to repo) - `spike/captures/*.html` / `.json` — local samples (gitignored, NOT committed) ## Acceptance Criteria - `SCRAPE_FINDINGS.md` exists and answers: - Is anonymous scraping feasible? (yes/no/conditional) - What scraping technology is required? (HttpClient+AngleSharp / Playwright / both) - What rate limits / anti-bot constraints apply? - What URL patterns and endpoints will Phase 3 target? - `SCHEMA_DRAFT.md` maps real marathonbet.by data to the customer-spec field names. - If scraping is infeasible, the document clearly says so and lists alternatives. - **No production C# code is written in this phase.** ## Notes - Use WebFetch tool for initial probing; supplement with curl/Bash if Playwright-style behavior needs investigation. - Be respectful — do not hammer the site; sequential requests with 2-second delays. - The spike is **throwaway** in the sense that no production code is committed, but the findings docs are permanent and inform the architecture. - If marathonbet.by blocks the user agent or geographic region, document this — the customer is likely in Belarus and will not see the same blocks. ## Review Checklist - [x] `SCRAPE_FINDINGS.md` answers all required questions above - [x] `SCHEMA_DRAFT.md` covers all bet types in the customer spec (Win/Draw/Win_Fora/Total at Match + Period-N scope) - [x] No production code committed - [x] Recommended Phase 3 strategy is concrete and actionable - [x] Risk register updated if anti-bot or rate-limit issues found ## Handoff to Next Phase **Anonymous scraping is feasible and recommended technology is HttpClient + AngleSharp.** No Cloudflare, no JS challenge. Site is fully SSR — all data we need is in the raw HTML. ### What Phase 1 (Domain) needs to know 1. **`SportCode`** is the `data-sport-treeId` attribute / first integer after the sport name in `/su/betting/+-+`. Customer's "basketball=6" matches exactly. Confirmed IDs: Basketball=6, Football=11, Tennis=22723, Hockey=43658. Note: there are duplicate "category" tree IDs (e.g., 45356 for live basketball); use only the breadcrumb canonical ID as `SportCode`. 2. **`EventCode`** is `data-event-eventId` (numeric, ~26-million range). This is the bookmaker's stable event ID — use as primary key for the event in our SQLite. `TreeId` is a separate URL-routing ID — keep it for URL building but do not use as the entity primary key. 3. **No "Draw" outcome for tennis (and for some basketball variants).** The Domain model should make the Draw rate nullable. Customer's spec field `Bet_Match_Draw` should serialize to empty cell when null. 4. **Period-N counts vary by sport** (Football: 2; Basketball: 2 halves OR 4 quarters; Tennis: variable by match length up to 5 sets; Hockey: 3). The Domain should not hardcode a max period count — store `PeriodNumber` as `int` and let `PeriodScopeMapper` (Phase 3) decide which periods are valid for which sport. 5. **Bet handicap and total values come from the DOM ``** text, not from the `data-selection-key` (with one exception: Total markets encode the threshold in the outcome name, e.g., `Under_213.5`). Domain `Bet.Value` is `decimal?` — populated for handicap and total, null for Win/Draw. 6. **`ScheduledAt`** has TWO possible string formats in the listing: `HH:MM` (today) or `DD HH:MM` (future). Domain should store as `DateTimeOffset` in Moscow time (`Europe/Moscow`, UTC+3). The "today" anchor comes from the `initData.serverTime` blob (`YYYY,MM,DD,HH,MM,SS` format). Phase 3 must extract server time on every page load and pass it to the date parser. ### What Phase 3 (Scraping) needs to know Read `spike/SCRAPE_FINDINGS.md` end-to-end before designing the scraper. Highlights: - **Selector inventory:** in `SCHEMA_DRAFT.md` §1–§3 and in `SCRAPE_FINDINGS.md` §5. - **URL templates** in `SCRAPE_FINDINGS.md` §3. - **Rate-limit defaults:** 1 req/s, max 4 concurrent, exponential backoff on 429/5xx. Use `Microsoft.Extensions.Http.Resilience` (Polly v8). - **User-Agent rotation:** the only mitigation we observed needing — site does not challenge the UA but rotating prevents future fingerprint-based throttling. - **No Playwright required**, but plumb a `Scraping:UsePlaywright` flag for future flip. ### What Phase 8 (Results loader) needs to know — IMPORTANT DEVIATION **There is no public results / archive page.** `https://www.marathonbet.by/su/results` returns 404. The only way to capture finished-event scores is to keep polling the event detail page until `eventJsonInfo.matchIsComplete === true`, then snapshot `resultDescription` (e.g., `"2:1 (1:1)"`). This means Phase 8 must: 1. Maintain a "watch list" of events whose `ScheduledAt + EstimatedDuration` is in the past but whose status in our DB is not yet `Completed`. 2. Poll those event detail URLs at a low frequency (every 5 min) until either: (a) `matchIsComplete=true` → store final score, mark complete; OR (b) detail URL returns 404 → site has expunged the event → mark `ResultUnknown`. 3. Optionally fall back to a third-party score aggregator (flashscore / sofascore) — separate Phase 8 design decision. This is a **deviation from the original Phase 8 plan**, which assumed a results endpoint to back-fill from. Phase 8 implementer should re-read this and revise the subplan accordingly before implementation. ### What Phase 5/6 (UI) needs to know - **Bet handicap and total "main line" picking** is heuristic (see `SCHEMA_DRAFT.md` §2.2 and §2.3) and should be exposed as a configurable policy. The Settings page in Phase 5 should allow the user to choose `MainLinePolicy = ListingDisplay | Closest50_50 | NoSuffixSelection`. - **Russian-only labels** in the source HTML. Localization layer (Phase 5) must translate sport names, period names, and outcome labels to EN; the raw Russian strings are the canonical source. ### Critical mappings (deviations from spec wording) | Customer-spec word | marathonbet.by reality | | --- | --- | | `Win_Fora` | `Handicap` market in DOM (`To_Win_Match_With_Handicap`). Same concept, different word. | | `Total_Less` / `Total_More` | DOM uses `Under` / `Over`. | | `Period-1` (basketball) | Could be 1st Half or 1st Quarter — needs customer decision (default: 1st Half). | | `Sport_Code = 6` | `data-sport-treeId="6"` confirmed for Basketball. |