070e34b911
Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR (nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is sufficient; Playwright not required (kept as a future-flag). Spike outputs: - spike/SCRAPE_FINDINGS.md — page rendering, URL templates, anti-bot, rate limits, recommended scraping strategy for Phase 3. - spike/SCHEMA_DRAFT.md — customer-spec field → DOM selector mapping for Match + Period-N scope across football/basketball/tennis (hockey TBD). Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding: no public results endpoint at /su/results — phase 8 must switch to polling event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged). Reviewer notes addressed: - Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1. - results-page.html capture added to file list with caveat about live-landing score-state and unsampled hockey selectors.
171 lines
8.8 KiB
Markdown
171 lines
8.8 KiB
Markdown
# Phase 0: Scraping Spike (Research, Throwaway)
|
||
|
||
**Status:** ✅ Done
|
||
**Parent plan:** [PLAN.md](./PLAN.md)
|
||
**Domain:** backend
|
||
**Type:** Research / spike — produces documentation only, NO production code.
|
||
|
||
## Objective
|
||
|
||
Determine whether marathonbet.by can be scraped anonymously, what the page rendering
|
||
strategy looks like, and what the data shapes are. The output is a documented foundation
|
||
that Phases 1–9 build on. **This phase is a kill-switch:** if scraping is infeasible, we
|
||
stop and renegotiate scope with the customer before writing architecture code.
|
||
|
||
## Tasks
|
||
|
||
- [x] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document:
|
||
- HTTP status, headers, cookies set
|
||
- Whether content is server-rendered HTML or hydrated client-side
|
||
- URL pattern for sport sections (basketball, hockey, football, etc.)
|
||
- Sport group codes (e.g., basketball = 6 per spec)
|
||
- [x] Probe `https://www.marathonbet.by/su/live` (live events). Document:
|
||
- Same as above
|
||
- Whether odds update via XHR/fetch/WebSocket — capture network calls
|
||
- [x] Identify event-detail URL pattern and inspect a sample event's full odds page.
|
||
- [x] For 3 events across 3 sports (basketball, football, tennis — hockey deferred to Phase 3 verify), capture:
|
||
- Event metadata (sport, country, league, category, scheduled time, event ID)
|
||
- Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value),
|
||
Total Less/More (with threshold)
|
||
- Period-N bets where the sport has periods
|
||
- [x] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate
|
||
limiting, header requirements, fingerprinting hints.
|
||
- [x] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT
|
||
hammer — be respectful.
|
||
- [x] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible
|
||
in browser network tab (often these are easier to scrape than HTML).
|
||
- [x] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
|
||
- [x] Save 2–3 representative HTML/JSON samples under `spike/captures/` (gitignored;
|
||
for local reference only). Saved 7 fixtures.
|
||
- [x] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended
|
||
scraping strategy for Phase 3.
|
||
- [x] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings —
|
||
marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.).
|
||
|
||
## Files to Modify/Create
|
||
|
||
- `spike/SCRAPE_FINDINGS.md` — research output (committed to repo)
|
||
- `spike/SCHEMA_DRAFT.md` — proposed domain mapping (committed to repo)
|
||
- `spike/captures/*.html` / `.json` — local samples (gitignored, NOT committed)
|
||
|
||
## Acceptance Criteria
|
||
|
||
- `SCRAPE_FINDINGS.md` exists and answers:
|
||
- Is anonymous scraping feasible? (yes/no/conditional)
|
||
- What scraping technology is required? (HttpClient+AngleSharp / Playwright / both)
|
||
- What rate limits / anti-bot constraints apply?
|
||
- What URL patterns and endpoints will Phase 3 target?
|
||
- `SCHEMA_DRAFT.md` maps real marathonbet.by data to the customer-spec field names.
|
||
- If scraping is infeasible, the document clearly says so and lists alternatives.
|
||
- **No production C# code is written in this phase.**
|
||
|
||
## Notes
|
||
|
||
- Use WebFetch tool for initial probing; supplement with curl/Bash if Playwright-style
|
||
behavior needs investigation.
|
||
- Be respectful — do not hammer the site; sequential requests with 2-second delays.
|
||
- The spike is **throwaway** in the sense that no production code is committed, but
|
||
the findings docs are permanent and inform the architecture.
|
||
- If marathonbet.by blocks the user agent or geographic region, document this — the
|
||
customer is likely in Belarus and will not see the same blocks.
|
||
|
||
## Review Checklist
|
||
|
||
- [x] `SCRAPE_FINDINGS.md` answers all required questions above
|
||
- [x] `SCHEMA_DRAFT.md` covers all bet types in the customer spec
|
||
(Win/Draw/Win_Fora/Total at Match + Period-N scope)
|
||
- [x] No production code committed
|
||
- [x] Recommended Phase 3 strategy is concrete and actionable
|
||
- [x] Risk register updated if anti-bot or rate-limit issues found
|
||
|
||
## Handoff to Next Phase
|
||
|
||
**Anonymous scraping is feasible and recommended technology is HttpClient + AngleSharp.**
|
||
No Cloudflare, no JS challenge. Site is fully SSR — all data we need is in the raw HTML.
|
||
|
||
### What Phase 1 (Domain) needs to know
|
||
|
||
1. **`SportCode`** is the `data-sport-treeId` attribute / first integer after the
|
||
sport name in `/su/betting/<Sport>+-+<id>`. Customer's "basketball=6" matches
|
||
exactly. Confirmed IDs: Basketball=6, Football=11, Tennis=22723, Hockey=43658.
|
||
Note: there are duplicate "category" tree IDs (e.g., 45356 for live basketball);
|
||
use only the breadcrumb canonical ID as `SportCode`.
|
||
|
||
2. **`EventCode`** is `data-event-eventId` (numeric, ~26-million range). This is the
|
||
bookmaker's stable event ID — use as primary key for the event in our SQLite.
|
||
`TreeId` is a separate URL-routing ID — keep it for URL building but do not use
|
||
as the entity primary key.
|
||
|
||
3. **No "Draw" outcome for tennis (and for some basketball variants).** The Domain
|
||
model should make the Draw rate nullable. Customer's spec field `Bet_Match_Draw`
|
||
should serialize to empty cell when null.
|
||
|
||
4. **Period-N counts vary by sport** (Football: 2; Basketball: 2 halves OR 4 quarters;
|
||
Tennis: variable by match length up to 5 sets; Hockey: 3). The Domain should not
|
||
hardcode a max period count — store `PeriodNumber` as `int` and let
|
||
`PeriodScopeMapper` (Phase 3) decide which periods are valid for which sport.
|
||
|
||
5. **Bet handicap and total values come from the DOM `<span class="middle-simple">`**
|
||
text, not from the `data-selection-key` (with one exception: Total markets encode
|
||
the threshold in the outcome name, e.g., `Under_213.5`). Domain `Bet.Value` is
|
||
`decimal?` — populated for handicap and total, null for Win/Draw.
|
||
|
||
6. **`ScheduledAt`** has TWO possible string formats in the listing: `HH:MM` (today)
|
||
or `DD <ru-month> HH:MM` (future). Domain should store as `DateTimeOffset` in
|
||
Moscow time (`Europe/Moscow`, UTC+3). The "today" anchor comes from the
|
||
`initData.serverTime` blob (`YYYY,MM,DD,HH,MM,SS` format). Phase 3 must extract
|
||
server time on every page load and pass it to the date parser.
|
||
|
||
### What Phase 3 (Scraping) needs to know
|
||
|
||
Read `spike/SCRAPE_FINDINGS.md` end-to-end before designing the scraper.
|
||
Highlights:
|
||
|
||
- **Selector inventory:** in `SCHEMA_DRAFT.md` §1–§3 and in `SCRAPE_FINDINGS.md` §5.
|
||
- **URL templates** in `SCRAPE_FINDINGS.md` §3.
|
||
- **Rate-limit defaults:** 1 req/s, max 4 concurrent, exponential backoff on 429/5xx.
|
||
Use `Microsoft.Extensions.Http.Resilience` (Polly v8).
|
||
- **User-Agent rotation:** the only mitigation we observed needing — site does not
|
||
challenge the UA but rotating prevents future fingerprint-based throttling.
|
||
- **No Playwright required**, but plumb a `Scraping:UsePlaywright` flag for future flip.
|
||
|
||
### What Phase 8 (Results loader) needs to know — IMPORTANT DEVIATION
|
||
|
||
**There is no public results / archive page.** `https://www.marathonbet.by/su/results`
|
||
returns 404. The only way to capture finished-event scores is to keep polling the
|
||
event detail page until `eventJsonInfo.matchIsComplete === true`, then snapshot
|
||
`resultDescription` (e.g., `"2:1 (1:1)"`).
|
||
|
||
This means Phase 8 must:
|
||
|
||
1. Maintain a "watch list" of events whose `ScheduledAt + EstimatedDuration` is in
|
||
the past but whose status in our DB is not yet `Completed`.
|
||
2. Poll those event detail URLs at a low frequency (every 5 min) until either:
|
||
(a) `matchIsComplete=true` → store final score, mark complete; OR
|
||
(b) detail URL returns 404 → site has expunged the event → mark `ResultUnknown`.
|
||
3. Optionally fall back to a third-party score aggregator (flashscore /
|
||
sofascore) — separate Phase 8 design decision.
|
||
|
||
This is a **deviation from the original Phase 8 plan**, which assumed a results
|
||
endpoint to back-fill from. Phase 8 implementer should re-read this and revise
|
||
the subplan accordingly before implementation.
|
||
|
||
### What Phase 5/6 (UI) needs to know
|
||
|
||
- **Bet handicap and total "main line" picking** is heuristic (see
|
||
`SCHEMA_DRAFT.md` §2.2 and §2.3) and should be exposed as a configurable
|
||
policy. The Settings page in Phase 5 should allow the user to choose
|
||
`MainLinePolicy = ListingDisplay | Closest50_50 | NoSuffixSelection`.
|
||
- **Russian-only labels** in the source HTML. Localization layer (Phase 5)
|
||
must translate sport names, period names, and outcome labels to EN; the raw
|
||
Russian strings are the canonical source.
|
||
|
||
### Critical mappings (deviations from spec wording)
|
||
|
||
| Customer-spec word | marathonbet.by reality |
|
||
| --- | --- |
|
||
| `Win_Fora` | `Handicap` market in DOM (`To_Win_Match_With_Handicap`). Same concept, different word. |
|
||
| `Total_Less` / `Total_More` | DOM uses `Under` / `Over`. |
|
||
| `Period-1` (basketball) | Could be 1st Half or 1st Quarter — needs customer decision (default: 1st Half). |
|
||
| `Sport_Code = 6` | `data-sport-treeId="6"` confirmed for Basketball. |
|