# Phase 0: Scraping Spike (Research, Throwaway) **Status:** ⬜ Not Started **Parent plan:** [PLAN.md](./PLAN.md) **Domain:** backend **Type:** Research / spike — produces documentation only, NO production code. ## Objective Determine whether marathonbet.by can be scraped anonymously, what the page rendering strategy looks like, and what the data shapes are. The output is a documented foundation that Phases 1–9 build on. **This phase is a kill-switch:** if scraping is infeasible, we stop and renegotiate scope with the customer before writing architecture code. ## Tasks - [ ] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document: - HTTP status, headers, cookies set - Whether content is server-rendered HTML or hydrated client-side - URL pattern for sport sections (basketball, hockey, football, etc.) - Sport group codes (e.g., basketball = 6 per spec) - [ ] Probe `https://www.marathonbet.by/su/live` (live events). Document: - Same as above - Whether odds update via XHR/fetch/WebSocket — capture network calls - [ ] Identify event-detail URL pattern and inspect a sample event's full odds page. - [ ] For 3 events across 3 sports (e.g., basketball, hockey, tennis), capture: - Event metadata (sport, country, league, category, scheduled time, event ID) - Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value), Total Less/More (with threshold) - Period-N bets where the sport has periods - [ ] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate limiting, header requirements, fingerprinting hints. - [ ] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT hammer — be respectful. - [ ] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible in browser network tab (often these are easier to scrape than HTML). - [ ] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)? - [ ] Save 2–3 representative HTML/JSON samples under `spike/captures/` (gitignored; for local reference only). - [ ] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended scraping strategy for Phase 3. - [ ] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings — marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.). ## Files to Modify/Create - `spike/SCRAPE_FINDINGS.md` — research output (committed to repo) - `spike/SCHEMA_DRAFT.md` — proposed domain mapping (committed to repo) - `spike/captures/*.html` / `.json` — local samples (gitignored, NOT committed) ## Acceptance Criteria - `SCRAPE_FINDINGS.md` exists and answers: - Is anonymous scraping feasible? (yes/no/conditional) - What scraping technology is required? (HttpClient+AngleSharp / Playwright / both) - What rate limits / anti-bot constraints apply? - What URL patterns and endpoints will Phase 3 target? - `SCHEMA_DRAFT.md` maps real marathonbet.by data to the customer-spec field names. - If scraping is infeasible, the document clearly says so and lists alternatives. - **No production C# code is written in this phase.** ## Notes - Use WebFetch tool for initial probing; supplement with curl/Bash if Playwright-style behavior needs investigation. - Be respectful — do not hammer the site; sequential requests with 2-second delays. - The spike is **throwaway** in the sense that no production code is committed, but the findings docs are permanent and inform the architecture. - If marathonbet.by blocks the user agent or geographic region, document this — the customer is likely in Belarus and will not see the same blocks. ## Review Checklist - [ ] `SCRAPE_FINDINGS.md` answers all required questions above - [ ] `SCHEMA_DRAFT.md` covers all bet types in the customer spec (Win/Draw/Win_Fora/Total at Match + Period-N scope) - [ ] No production code committed - [ ] Recommended Phase 3 strategy is concrete and actionable - [ ] Risk register updated if anti-bot or rate-limit issues found ## Handoff to Next Phase