Files
maraphon-app/plans/initial-implementation/phase-0-scraping-spike.md
T

4.1 KiB
Raw Blame History

Phase 0: Scraping Spike (Research, Throwaway)

Status: Not Started Parent plan: PLAN.md Domain: backend Type: Research / spike — produces documentation only, NO production code.

Objective

Determine whether marathonbet.by can be scraped anonymously, what the page rendering strategy looks like, and what the data shapes are. The output is a documented foundation that Phases 19 build on. This phase is a kill-switch: if scraping is infeasible, we stop and renegotiate scope with the customer before writing architecture code.

Tasks

  • Probe https://www.marathonbet.by/su (pre-match) anonymously. Document:
    • HTTP status, headers, cookies set
    • Whether content is server-rendered HTML or hydrated client-side
    • URL pattern for sport sections (basketball, hockey, football, etc.)
    • Sport group codes (e.g., basketball = 6 per spec)
  • Probe https://www.marathonbet.by/su/live (live events). Document:
    • Same as above
    • Whether odds update via XHR/fetch/WebSocket — capture network calls
  • Identify event-detail URL pattern and inspect a sample event's full odds page.
  • For 3 events across 3 sports (e.g., basketball, hockey, tennis), capture:
    • Event metadata (sport, country, league, category, scheduled time, event ID)
    • Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value), Total Less/More (with threshold)
    • Period-N bets where the sport has periods
  • Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate limiting, header requirements, fingerprinting hints.
  • Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT hammer — be respectful.
  • Document API endpoints if marathonbet.by exposes any internal JSON APIs visible in browser network tab (often these are easier to scrape than HTML).
  • Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
  • Save 23 representative HTML/JSON samples under spike/captures/ (gitignored; for local reference only).
  • Write spike/SCRAPE_FINDINGS.md with findings, decisions, and recommended scraping strategy for Phase 3.
  • Write spike/SCHEMA_DRAFT.md with concrete proposed domain field mappings — marathonbet.by terms → spec field names (Bet_Match_Win_1, etc.).

Files to Modify/Create

  • spike/SCRAPE_FINDINGS.md — research output (committed to repo)
  • spike/SCHEMA_DRAFT.md — proposed domain mapping (committed to repo)
  • spike/captures/*.html / .json — local samples (gitignored, NOT committed)

Acceptance Criteria

  • SCRAPE_FINDINGS.md exists and answers:
    • Is anonymous scraping feasible? (yes/no/conditional)
    • What scraping technology is required? (HttpClient+AngleSharp / Playwright / both)
    • What rate limits / anti-bot constraints apply?
    • What URL patterns and endpoints will Phase 3 target?
  • SCHEMA_DRAFT.md maps real marathonbet.by data to the customer-spec field names.
  • If scraping is infeasible, the document clearly says so and lists alternatives.
  • No production C# code is written in this phase.

Notes

  • Use WebFetch tool for initial probing; supplement with curl/Bash if Playwright-style behavior needs investigation.
  • Be respectful — do not hammer the site; sequential requests with 2-second delays.
  • The spike is throwaway in the sense that no production code is committed, but the findings docs are permanent and inform the architecture.
  • If marathonbet.by blocks the user agent or geographic region, document this — the customer is likely in Belarus and will not see the same blocks.

Review Checklist

  • SCRAPE_FINDINGS.md answers all required questions above
  • SCHEMA_DRAFT.md covers all bet types in the customer spec (Win/Draw/Win_Fora/Total at Match + Period-N scope)
  • No production code committed
  • Recommended Phase 3 strategy is concrete and actionable
  • Risk register updated if anti-bot or rate-limit issues found

Handoff to Next Phase