Files
maraphon-app/plans/initial-implementation/phase-0-scraping-spike.md
T

85 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 0: Scraping Spike (Research, Throwaway)
**Status:** ⬜ Not Started
**Parent plan:** [PLAN.md](./PLAN.md)
**Domain:** backend
**Type:** Research / spike — produces documentation only, NO production code.
## Objective
Determine whether marathonbet.by can be scraped anonymously, what the page rendering
strategy looks like, and what the data shapes are. The output is a documented foundation
that Phases 19 build on. **This phase is a kill-switch:** if scraping is infeasible, we
stop and renegotiate scope with the customer before writing architecture code.
## Tasks
- [ ] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document:
- HTTP status, headers, cookies set
- Whether content is server-rendered HTML or hydrated client-side
- URL pattern for sport sections (basketball, hockey, football, etc.)
- Sport group codes (e.g., basketball = 6 per spec)
- [ ] Probe `https://www.marathonbet.by/su/live` (live events). Document:
- Same as above
- Whether odds update via XHR/fetch/WebSocket — capture network calls
- [ ] Identify event-detail URL pattern and inspect a sample event's full odds page.
- [ ] For 3 events across 3 sports (e.g., basketball, hockey, tennis), capture:
- Event metadata (sport, country, league, category, scheduled time, event ID)
- Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value),
Total Less/More (with threshold)
- Period-N bets where the sport has periods
- [ ] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate
limiting, header requirements, fingerprinting hints.
- [ ] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT
hammer — be respectful.
- [ ] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible
in browser network tab (often these are easier to scrape than HTML).
- [ ] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
- [ ] Save 23 representative HTML/JSON samples under `spike/captures/` (gitignored;
for local reference only).
- [ ] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended
scraping strategy for Phase 3.
- [ ] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings —
marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.).
## Files to Modify/Create
- `spike/SCRAPE_FINDINGS.md` — research output (committed to repo)
- `spike/SCHEMA_DRAFT.md` — proposed domain mapping (committed to repo)
- `spike/captures/*.html` / `.json` — local samples (gitignored, NOT committed)
## Acceptance Criteria
- `SCRAPE_FINDINGS.md` exists and answers:
- Is anonymous scraping feasible? (yes/no/conditional)
- What scraping technology is required? (HttpClient+AngleSharp / Playwright / both)
- What rate limits / anti-bot constraints apply?
- What URL patterns and endpoints will Phase 3 target?
- `SCHEMA_DRAFT.md` maps real marathonbet.by data to the customer-spec field names.
- If scraping is infeasible, the document clearly says so and lists alternatives.
- **No production C# code is written in this phase.**
## Notes
- Use WebFetch tool for initial probing; supplement with curl/Bash if Playwright-style
behavior needs investigation.
- Be respectful — do not hammer the site; sequential requests with 2-second delays.
- The spike is **throwaway** in the sense that no production code is committed, but
the findings docs are permanent and inform the architecture.
- If marathonbet.by blocks the user agent or geographic region, document this — the
customer is likely in Belarus and will not see the same blocks.
## Review Checklist
- [ ] `SCRAPE_FINDINGS.md` answers all required questions above
- [ ] `SCHEMA_DRAFT.md` covers all bet types in the customer spec
(Win/Draw/Win_Fora/Total at Match + Period-N scope)
- [ ] No production code committed
- [ ] Recommended Phase 3 strategy is concrete and actionable
- [ ] Risk register updated if anti-bot or rate-limit issues found
## Handoff to Next Phase
<!-- Filled by Phase 0 implementer. Critical: list anything Phase 1+ implementers must know,
especially deviations from the customer spec field names due to real bookmaker data. -->