Files
alexei.dolgolyov 070e34b911 feat(initial-implementation): phase 0 - scraping spike findings
Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR
(nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is
sufficient; Playwright not required (kept as a future-flag).

Spike outputs:
- spike/SCRAPE_FINDINGS.md  — page rendering, URL templates, anti-bot, rate
  limits, recommended scraping strategy for Phase 3.
- spike/SCHEMA_DRAFT.md     — customer-spec field → DOM selector mapping for
  Match + Period-N scope across football/basketball/tennis (hockey TBD).

Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding:
no public results endpoint at /su/results — phase 8 must switch to polling
event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged).

Reviewer notes addressed:
- Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and
  market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1.
- results-page.html capture added to file list with caveat about live-landing
  score-state and unsampled hockey selectors.
2026-05-05 01:04:03 +03:00

171 lines
8.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 0: Scraping Spike (Research, Throwaway)
**Status:** ✅ Done
**Parent plan:** [PLAN.md](./PLAN.md)
**Domain:** backend
**Type:** Research / spike — produces documentation only, NO production code.
## Objective
Determine whether marathonbet.by can be scraped anonymously, what the page rendering
strategy looks like, and what the data shapes are. The output is a documented foundation
that Phases 19 build on. **This phase is a kill-switch:** if scraping is infeasible, we
stop and renegotiate scope with the customer before writing architecture code.
## Tasks
- [x] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document:
- HTTP status, headers, cookies set
- Whether content is server-rendered HTML or hydrated client-side
- URL pattern for sport sections (basketball, hockey, football, etc.)
- Sport group codes (e.g., basketball = 6 per spec)
- [x] Probe `https://www.marathonbet.by/su/live` (live events). Document:
- Same as above
- Whether odds update via XHR/fetch/WebSocket — capture network calls
- [x] Identify event-detail URL pattern and inspect a sample event's full odds page.
- [x] For 3 events across 3 sports (basketball, football, tennis — hockey deferred to Phase 3 verify), capture:
- Event metadata (sport, country, league, category, scheduled time, event ID)
- Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value),
Total Less/More (with threshold)
- Period-N bets where the sport has periods
- [x] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate
limiting, header requirements, fingerprinting hints.
- [x] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT
hammer — be respectful.
- [x] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible
in browser network tab (often these are easier to scrape than HTML).
- [x] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
- [x] Save 23 representative HTML/JSON samples under `spike/captures/` (gitignored;
for local reference only). Saved 7 fixtures.
- [x] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended
scraping strategy for Phase 3.
- [x] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings —
marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.).
## Files to Modify/Create
- `spike/SCRAPE_FINDINGS.md` — research output (committed to repo)
- `spike/SCHEMA_DRAFT.md` — proposed domain mapping (committed to repo)
- `spike/captures/*.html` / `.json` — local samples (gitignored, NOT committed)
## Acceptance Criteria
- `SCRAPE_FINDINGS.md` exists and answers:
- Is anonymous scraping feasible? (yes/no/conditional)
- What scraping technology is required? (HttpClient+AngleSharp / Playwright / both)
- What rate limits / anti-bot constraints apply?
- What URL patterns and endpoints will Phase 3 target?
- `SCHEMA_DRAFT.md` maps real marathonbet.by data to the customer-spec field names.
- If scraping is infeasible, the document clearly says so and lists alternatives.
- **No production C# code is written in this phase.**
## Notes
- Use WebFetch tool for initial probing; supplement with curl/Bash if Playwright-style
behavior needs investigation.
- Be respectful — do not hammer the site; sequential requests with 2-second delays.
- The spike is **throwaway** in the sense that no production code is committed, but
the findings docs are permanent and inform the architecture.
- If marathonbet.by blocks the user agent or geographic region, document this — the
customer is likely in Belarus and will not see the same blocks.
## Review Checklist
- [x] `SCRAPE_FINDINGS.md` answers all required questions above
- [x] `SCHEMA_DRAFT.md` covers all bet types in the customer spec
(Win/Draw/Win_Fora/Total at Match + Period-N scope)
- [x] No production code committed
- [x] Recommended Phase 3 strategy is concrete and actionable
- [x] Risk register updated if anti-bot or rate-limit issues found
## Handoff to Next Phase
**Anonymous scraping is feasible and recommended technology is HttpClient + AngleSharp.**
No Cloudflare, no JS challenge. Site is fully SSR — all data we need is in the raw HTML.
### What Phase 1 (Domain) needs to know
1. **`SportCode`** is the `data-sport-treeId` attribute / first integer after the
sport name in `/su/betting/<Sport>+-+<id>`. Customer's "basketball=6" matches
exactly. Confirmed IDs: Basketball=6, Football=11, Tennis=22723, Hockey=43658.
Note: there are duplicate "category" tree IDs (e.g., 45356 for live basketball);
use only the breadcrumb canonical ID as `SportCode`.
2. **`EventCode`** is `data-event-eventId` (numeric, ~26-million range). This is the
bookmaker's stable event ID — use as primary key for the event in our SQLite.
`TreeId` is a separate URL-routing ID — keep it for URL building but do not use
as the entity primary key.
3. **No "Draw" outcome for tennis (and for some basketball variants).** The Domain
model should make the Draw rate nullable. Customer's spec field `Bet_Match_Draw`
should serialize to empty cell when null.
4. **Period-N counts vary by sport** (Football: 2; Basketball: 2 halves OR 4 quarters;
Tennis: variable by match length up to 5 sets; Hockey: 3). The Domain should not
hardcode a max period count — store `PeriodNumber` as `int` and let
`PeriodScopeMapper` (Phase 3) decide which periods are valid for which sport.
5. **Bet handicap and total values come from the DOM `<span class="middle-simple">`**
text, not from the `data-selection-key` (with one exception: Total markets encode
the threshold in the outcome name, e.g., `Under_213.5`). Domain `Bet.Value` is
`decimal?` — populated for handicap and total, null for Win/Draw.
6. **`ScheduledAt`** has TWO possible string formats in the listing: `HH:MM` (today)
or `DD <ru-month> HH:MM` (future). Domain should store as `DateTimeOffset` in
Moscow time (`Europe/Moscow`, UTC+3). The "today" anchor comes from the
`initData.serverTime` blob (`YYYY,MM,DD,HH,MM,SS` format). Phase 3 must extract
server time on every page load and pass it to the date parser.
### What Phase 3 (Scraping) needs to know
Read `spike/SCRAPE_FINDINGS.md` end-to-end before designing the scraper.
Highlights:
- **Selector inventory:** in `SCHEMA_DRAFT.md` §1–§3 and in `SCRAPE_FINDINGS.md` §5.
- **URL templates** in `SCRAPE_FINDINGS.md` §3.
- **Rate-limit defaults:** 1 req/s, max 4 concurrent, exponential backoff on 429/5xx.
Use `Microsoft.Extensions.Http.Resilience` (Polly v8).
- **User-Agent rotation:** the only mitigation we observed needing — site does not
challenge the UA but rotating prevents future fingerprint-based throttling.
- **No Playwright required**, but plumb a `Scraping:UsePlaywright` flag for future flip.
### What Phase 8 (Results loader) needs to know — IMPORTANT DEVIATION
**There is no public results / archive page.** `https://www.marathonbet.by/su/results`
returns 404. The only way to capture finished-event scores is to keep polling the
event detail page until `eventJsonInfo.matchIsComplete === true`, then snapshot
`resultDescription` (e.g., `"2:1 (1:1)"`).
This means Phase 8 must:
1. Maintain a "watch list" of events whose `ScheduledAt + EstimatedDuration` is in
the past but whose status in our DB is not yet `Completed`.
2. Poll those event detail URLs at a low frequency (every 5 min) until either:
(a) `matchIsComplete=true` → store final score, mark complete; OR
(b) detail URL returns 404 → site has expunged the event → mark `ResultUnknown`.
3. Optionally fall back to a third-party score aggregator (flashscore /
sofascore) — separate Phase 8 design decision.
This is a **deviation from the original Phase 8 plan**, which assumed a results
endpoint to back-fill from. Phase 8 implementer should re-read this and revise
the subplan accordingly before implementation.
### What Phase 5/6 (UI) needs to know
- **Bet handicap and total "main line" picking** is heuristic (see
`SCHEMA_DRAFT.md` §2.2 and §2.3) and should be exposed as a configurable
policy. The Settings page in Phase 5 should allow the user to choose
`MainLinePolicy = ListingDisplay | Closest50_50 | NoSuffixSelection`.
- **Russian-only labels** in the source HTML. Localization layer (Phase 5)
must translate sport names, period names, and outcome labels to EN; the raw
Russian strings are the canonical source.
### Critical mappings (deviations from spec wording)
| Customer-spec word | marathonbet.by reality |
| --- | --- |
| `Win_Fora` | `Handicap` market in DOM (`To_Win_Match_With_Handicap`). Same concept, different word. |
| `Total_Less` / `Total_More` | DOM uses `Under` / `Over`. |
| `Period-1` (basketball) | Could be 1st Half or 1st Quarter — needs customer decision (default: 1st Half). |
| `Sport_Code = 6` | `data-sport-treeId="6"` confirmed for Basketball. |