docs(initial-implementation): add feature plan and 10 phase subplans
This commit is contained in:
@@ -0,0 +1,84 @@
|
||||
# Phase 0: Scraping Spike (Research, Throwaway)
|
||||
|
||||
**Status:** ⬜ Not Started
|
||||
**Parent plan:** [PLAN.md](./PLAN.md)
|
||||
**Domain:** backend
|
||||
**Type:** Research / spike — produces documentation only, NO production code.
|
||||
|
||||
## Objective
|
||||
|
||||
Determine whether marathonbet.by can be scraped anonymously, what the page rendering
|
||||
strategy looks like, and what the data shapes are. The output is a documented foundation
|
||||
that Phases 1–9 build on. **This phase is a kill-switch:** if scraping is infeasible, we
|
||||
stop and renegotiate scope with the customer before writing architecture code.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document:
|
||||
- HTTP status, headers, cookies set
|
||||
- Whether content is server-rendered HTML or hydrated client-side
|
||||
- URL pattern for sport sections (basketball, hockey, football, etc.)
|
||||
- Sport group codes (e.g., basketball = 6 per spec)
|
||||
- [ ] Probe `https://www.marathonbet.by/su/live` (live events). Document:
|
||||
- Same as above
|
||||
- Whether odds update via XHR/fetch/WebSocket — capture network calls
|
||||
- [ ] Identify event-detail URL pattern and inspect a sample event's full odds page.
|
||||
- [ ] For 3 events across 3 sports (e.g., basketball, hockey, tennis), capture:
|
||||
- Event metadata (sport, country, league, category, scheduled time, event ID)
|
||||
- Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value),
|
||||
Total Less/More (with threshold)
|
||||
- Period-N bets where the sport has periods
|
||||
- [ ] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate
|
||||
limiting, header requirements, fingerprinting hints.
|
||||
- [ ] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT
|
||||
hammer — be respectful.
|
||||
- [ ] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible
|
||||
in browser network tab (often these are easier to scrape than HTML).
|
||||
- [ ] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
|
||||
- [ ] Save 2–3 representative HTML/JSON samples under `spike/captures/` (gitignored;
|
||||
for local reference only).
|
||||
- [ ] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended
|
||||
scraping strategy for Phase 3.
|
||||
- [ ] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings —
|
||||
marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.).
|
||||
|
||||
## Files to Modify/Create
|
||||
|
||||
- `spike/SCRAPE_FINDINGS.md` — research output (committed to repo)
|
||||
- `spike/SCHEMA_DRAFT.md` — proposed domain mapping (committed to repo)
|
||||
- `spike/captures/*.html` / `.json` — local samples (gitignored, NOT committed)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- `SCRAPE_FINDINGS.md` exists and answers:
|
||||
- Is anonymous scraping feasible? (yes/no/conditional)
|
||||
- What scraping technology is required? (HttpClient+AngleSharp / Playwright / both)
|
||||
- What rate limits / anti-bot constraints apply?
|
||||
- What URL patterns and endpoints will Phase 3 target?
|
||||
- `SCHEMA_DRAFT.md` maps real marathonbet.by data to the customer-spec field names.
|
||||
- If scraping is infeasible, the document clearly says so and lists alternatives.
|
||||
- **No production C# code is written in this phase.**
|
||||
|
||||
## Notes
|
||||
|
||||
- Use WebFetch tool for initial probing; supplement with curl/Bash if Playwright-style
|
||||
behavior needs investigation.
|
||||
- Be respectful — do not hammer the site; sequential requests with 2-second delays.
|
||||
- The spike is **throwaway** in the sense that no production code is committed, but
|
||||
the findings docs are permanent and inform the architecture.
|
||||
- If marathonbet.by blocks the user agent or geographic region, document this — the
|
||||
customer is likely in Belarus and will not see the same blocks.
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- [ ] `SCRAPE_FINDINGS.md` answers all required questions above
|
||||
- [ ] `SCHEMA_DRAFT.md` covers all bet types in the customer spec
|
||||
(Win/Draw/Win_Fora/Total at Match + Period-N scope)
|
||||
- [ ] No production code committed
|
||||
- [ ] Recommended Phase 3 strategy is concrete and actionable
|
||||
- [ ] Risk register updated if anti-bot or rate-limit issues found
|
||||
|
||||
## Handoff to Next Phase
|
||||
|
||||
<!-- Filled by Phase 0 implementer. Critical: list anything Phase 1+ implementers must know,
|
||||
especially deviations from the customer spec field names due to real bookmaker data. -->
|
||||
Reference in New Issue
Block a user