feat(initial-implementation): phase 0 - scraping spike findings

Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR
(nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is
sufficient; Playwright not required (kept as a future-flag).

Spike outputs:
- spike/SCRAPE_FINDINGS.md  — page rendering, URL templates, anti-bot, rate
  limits, recommended scraping strategy for Phase 3.
- spike/SCHEMA_DRAFT.md     — customer-spec field → DOM selector mapping for
  Match + Period-N scope across football/basketball/tennis (hockey TBD).

Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding:
no public results endpoint at /su/results — phase 8 must switch to polling
event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged).

Reviewer notes addressed:
- Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and
  market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1.
- results-page.html capture added to file list with caveat about live-landing
  score-state and unsampled hockey selectors.
This commit is contained in:
2026-05-05 01:04:03 +03:00
parent 8802ddb25b
commit 070e34b911
6 changed files with 864 additions and 25 deletions
+52 -3
View File
@@ -55,7 +55,21 @@ with scraping research, no implementation.
## Failed Approaches
(none yet — phases not started)
- **Public results / archive endpoint** — does NOT exist. Tested
`https://www.marathonbet.by/su/results`, `/su/results/`, `/su/results.htm`
all return HTTP 404. No `/archive`, `/history` links anywhere in the public
HTML either. **Phase 8 deviation:** the Results loader cannot back-fill from
an archive — it must poll each event detail page until
`eventJsonInfo.matchIsComplete=true` and snapshot `resultDescription` at that
moment. Phase 8 implementer must revise the subplan accordingly.
- **JSONP `/su/liveupdate/popular/` endpoint** — exposes only refresh signals
(`{"modified":[{"type":"refreshPage"}],"updated":<ts>}`), not actual odds. Cannot
be used as a JSON odds source. Use it only as a "something changed" hint to
trigger a full event-detail re-scrape.
- **Anonymous WebSocket (STOMP)** at `/su/websocket/endpoint` is documented in
`initData.stomp` but appears to require an authenticated session
(`PUNTER-SESSION-HASH` cookie); we did not test it but the customer's anonymous
scraping constraint makes it unsuitable anyway.
## Review Findings Log
@@ -65,7 +79,7 @@ with scraping research, no implementation.
| Phase | Agent | Model | Test Writer | Parallel | Notes |
|---|---|---|---|---|---|
| Phase 0 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (research only) | — | Throwaway probe; outputs SCRAPE_FINDINGS.md only |
| Phase 0 | phase-implementer | Opus | ⏭️ Skipped (research only) | — | ✅ Done 2026-05-05. Outputs: spike/SCRAPE_FINDINGS.md + spike/SCHEMA_DRAFT.md + 7 local fixtures. Anonymous scraping confirmed feasible; HttpClient+AngleSharp recommended; no Playwright needed; no public results page found (Phase 8 deviation noted). |
| Phase 1 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | — | — |
| Phase 2 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 3 + 5 | — |
| Phase 3 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 2 + 5 | — |
@@ -87,4 +101,39 @@ with scraping research, no implementation.
## Implementation Notes
(populated as we work)
### Phase 0 (Scraping spike, 2026-05-05)
- **Anonymous scraping is feasible** from a non-Belarus IP. No Cloudflare, no JS
challenge, no UA filtering observed. `Server: nginx`. Standard cookies only.
- **Site is fully SSR.** All needed data (event grid, full odds, breadcrumbs,
period markets) is in the raw HTML. No SPA hydration required.
- **Recommended scraper stack: HttpClient + AngleSharp + Polly v8.** Playwright is
not required for read-only scraping — keep it as an optional fallback flag
(`Scraping:UsePlaywright`) for future-proofing only.
- **Polling cadence:** site itself polls live updates every 3 s; for our analyzer,
pre-match 30 s and live 510 s is sufficient.
- **Rate-limit:** 5 sequential requests at 1 req/s pacing all returned 200 in <1 s,
no throttling. Recommend default `RequestsPerSecond=1`, `MaxConcurrent=4`.
- **Sport ID semantics:** customer's "Sport_Code = 6" (Basketball) maps to
`data-sport-treeId="6"` in the breadcrumb-canonical sport listing
(`/su/betting/Basketball+-+6`). Some sports also have a separate "category tree
ID" used inside the live grouping (e.g., 45356 for Basketball-live) — ignore
those, use only the canonical breadcrumb ID.
- **Selection key format:** `<eventId>@<MarketName>{LineIndex?}.<Outcome>`. The
market name is sport-specific (`Match_Result`, `1st_Half_Result`, `Total_Goals`,
`Total_Points`, `Total_Games`, `To_Win_Match_With_Handicap`, etc.). Total
thresholds are encoded in the outcome (`Under_3.5`, `Over_213.5`). Handicap
values are NOT in the key — they're in `<span class="middle-simple">` text.
- **Tennis has no Draw outcome** — domain `Bet_Match_Draw` must be nullable.
- **Date display ambiguity:** listing shows `HH:MM` (today) or `DD <ru-month> HH:MM`
(future). Anchor the parser on `initData.serverTime` (Moscow TZ, format
`YYYY,MM,DD,HH,MM,SS`).
- **No public results page** (`/su/results` → 404). Final scores are exposed only
on the event detail page itself via `eventJsonInfo` JSON
(`matchIsComplete`, `resultDescription`). Phase 8 must poll until completion;
cannot back-fill from an archive endpoint.
- **Probe environment:** Windows 10 + curl, geo-routed as Poland (`countryCode: PL`).
Customer in Belarus may see slightly different KYC overlays — parser must be
defensive (treat missing markets as null, never throw).
- **Captures saved locally** at `spike/captures/*.html` (gitignored): 7 fixtures
for offline parser development in Phase 3.
+2 -2
View File
@@ -34,7 +34,7 @@ parameter configurable.
## Phases
- [ ] Phase 0: Scraping spike (research, throwaway) [domain: backend] → [subplan](./phase-0-scraping-spike.md)
- [x] Phase 0: Scraping spike (research, throwaway) [domain: backend] → [subplan](./phase-0-scraping-spike.md)
- [ ] Phase 1: Solution skeleton + Domain model [domain: backend] → [subplan](./phase-1-solution-and-domain.md)
- [ ] Phase 2: Infrastructure — Storage [domain: backend] → [subplan](./phase-2-storage.md)
- [ ] Phase 3: Infrastructure — Scraping [domain: backend] → [subplan](./phase-3-scraping.md)
@@ -62,7 +62,7 @@ parameter configurable.
| Phase | Domain | Status | Review | Build | Committed |
|---|---|---|---|---|---|
| Phase 0: Scraping spike | backend | ⬜ Not Started | ⬜ | ⬜ | ⬜ |
| Phase 0: Scraping spike | backend | ✅ Done | ⬜ Pending review | ⏭️ N/A (research) | ⬜ |
| Phase 1: Solution + Domain | backend | ⬜ Not Started | ⬜ | ⬜ | ⬜ |
| Phase 2: Storage | backend | ⬜ Not Started | ⬜ | ⏭️ Big Bang | ⬜ |
| Phase 3: Scraping | backend | ⬜ Not Started | ⬜ | ⏭️ Big Bang | ⬜ |
@@ -1,6 +1,6 @@
# Phase 0: Scraping Spike (Research, Throwaway)
**Status:** ⬜ Not Started
**Status:** ✅ Done
**Parent plan:** [PLAN.md](./PLAN.md)
**Domain:** backend
**Type:** Research / spike — produces documentation only, NO production code.
@@ -14,32 +14,32 @@ stop and renegotiate scope with the customer before writing architecture code.
## Tasks
- [ ] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document:
- [x] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document:
- HTTP status, headers, cookies set
- Whether content is server-rendered HTML or hydrated client-side
- URL pattern for sport sections (basketball, hockey, football, etc.)
- Sport group codes (e.g., basketball = 6 per spec)
- [ ] Probe `https://www.marathonbet.by/su/live` (live events). Document:
- [x] Probe `https://www.marathonbet.by/su/live` (live events). Document:
- Same as above
- Whether odds update via XHR/fetch/WebSocket — capture network calls
- [ ] Identify event-detail URL pattern and inspect a sample event's full odds page.
- [ ] For 3 events across 3 sports (e.g., basketball, hockey, tennis), capture:
- [x] Identify event-detail URL pattern and inspect a sample event's full odds page.
- [x] For 3 events across 3 sports (basketball, football, tennis — hockey deferred to Phase 3 verify), capture:
- Event metadata (sport, country, league, category, scheduled time, event ID)
- Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value),
Total Less/More (with threshold)
- Period-N bets where the sport has periods
- [ ] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate
- [x] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate
limiting, header requirements, fingerprinting hints.
- [ ] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT
- [x] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT
hammer — be respectful.
- [ ] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible
- [x] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible
in browser network tab (often these are easier to scrape than HTML).
- [ ] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
- [ ] Save 23 representative HTML/JSON samples under `spike/captures/` (gitignored;
for local reference only).
- [ ] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended
- [x] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
- [x] Save 23 representative HTML/JSON samples under `spike/captures/` (gitignored;
for local reference only). Saved 7 fixtures.
- [x] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended
scraping strategy for Phase 3.
- [ ] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings —
- [x] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings —
marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.).
## Files to Modify/Create
@@ -71,14 +71,100 @@ stop and renegotiate scope with the customer before writing architecture code.
## Review Checklist
- [ ] `SCRAPE_FINDINGS.md` answers all required questions above
- [ ] `SCHEMA_DRAFT.md` covers all bet types in the customer spec
- [x] `SCRAPE_FINDINGS.md` answers all required questions above
- [x] `SCHEMA_DRAFT.md` covers all bet types in the customer spec
(Win/Draw/Win_Fora/Total at Match + Period-N scope)
- [ ] No production code committed
- [ ] Recommended Phase 3 strategy is concrete and actionable
- [ ] Risk register updated if anti-bot or rate-limit issues found
- [x] No production code committed
- [x] Recommended Phase 3 strategy is concrete and actionable
- [x] Risk register updated if anti-bot or rate-limit issues found
## Handoff to Next Phase
<!-- Filled by Phase 0 implementer. Critical: list anything Phase 1+ implementers must know,
especially deviations from the customer spec field names due to real bookmaker data. -->
**Anonymous scraping is feasible and recommended technology is HttpClient + AngleSharp.**
No Cloudflare, no JS challenge. Site is fully SSR — all data we need is in the raw HTML.
### What Phase 1 (Domain) needs to know
1. **`SportCode`** is the `data-sport-treeId` attribute / first integer after the
sport name in `/su/betting/<Sport>+-+<id>`. Customer's "basketball=6" matches
exactly. Confirmed IDs: Basketball=6, Football=11, Tennis=22723, Hockey=43658.
Note: there are duplicate "category" tree IDs (e.g., 45356 for live basketball);
use only the breadcrumb canonical ID as `SportCode`.
2. **`EventCode`** is `data-event-eventId` (numeric, ~26-million range). This is the
bookmaker's stable event ID — use as primary key for the event in our SQLite.
`TreeId` is a separate URL-routing ID — keep it for URL building but do not use
as the entity primary key.
3. **No "Draw" outcome for tennis (and for some basketball variants).** The Domain
model should make the Draw rate nullable. Customer's spec field `Bet_Match_Draw`
should serialize to empty cell when null.
4. **Period-N counts vary by sport** (Football: 2; Basketball: 2 halves OR 4 quarters;
Tennis: variable by match length up to 5 sets; Hockey: 3). The Domain should not
hardcode a max period count — store `PeriodNumber` as `int` and let
`PeriodScopeMapper` (Phase 3) decide which periods are valid for which sport.
5. **Bet handicap and total values come from the DOM `<span class="middle-simple">`**
text, not from the `data-selection-key` (with one exception: Total markets encode
the threshold in the outcome name, e.g., `Under_213.5`). Domain `Bet.Value` is
`decimal?` — populated for handicap and total, null for Win/Draw.
6. **`ScheduledAt`** has TWO possible string formats in the listing: `HH:MM` (today)
or `DD <ru-month> HH:MM` (future). Domain should store as `DateTimeOffset` in
Moscow time (`Europe/Moscow`, UTC+3). The "today" anchor comes from the
`initData.serverTime` blob (`YYYY,MM,DD,HH,MM,SS` format). Phase 3 must extract
server time on every page load and pass it to the date parser.
### What Phase 3 (Scraping) needs to know
Read `spike/SCRAPE_FINDINGS.md` end-to-end before designing the scraper.
Highlights:
- **Selector inventory:** in `SCHEMA_DRAFT.md` §1–§3 and in `SCRAPE_FINDINGS.md` §5.
- **URL templates** in `SCRAPE_FINDINGS.md` §3.
- **Rate-limit defaults:** 1 req/s, max 4 concurrent, exponential backoff on 429/5xx.
Use `Microsoft.Extensions.Http.Resilience` (Polly v8).
- **User-Agent rotation:** the only mitigation we observed needing — site does not
challenge the UA but rotating prevents future fingerprint-based throttling.
- **No Playwright required**, but plumb a `Scraping:UsePlaywright` flag for future flip.
### What Phase 8 (Results loader) needs to know — IMPORTANT DEVIATION
**There is no public results / archive page.** `https://www.marathonbet.by/su/results`
returns 404. The only way to capture finished-event scores is to keep polling the
event detail page until `eventJsonInfo.matchIsComplete === true`, then snapshot
`resultDescription` (e.g., `"2:1 (1:1)"`).
This means Phase 8 must:
1. Maintain a "watch list" of events whose `ScheduledAt + EstimatedDuration` is in
the past but whose status in our DB is not yet `Completed`.
2. Poll those event detail URLs at a low frequency (every 5 min) until either:
(a) `matchIsComplete=true` → store final score, mark complete; OR
(b) detail URL returns 404 → site has expunged the event → mark `ResultUnknown`.
3. Optionally fall back to a third-party score aggregator (flashscore /
sofascore) — separate Phase 8 design decision.
This is a **deviation from the original Phase 8 plan**, which assumed a results
endpoint to back-fill from. Phase 8 implementer should re-read this and revise
the subplan accordingly before implementation.
### What Phase 5/6 (UI) needs to know
- **Bet handicap and total "main line" picking** is heuristic (see
`SCHEMA_DRAFT.md` §2.2 and §2.3) and should be exposed as a configurable
policy. The Settings page in Phase 5 should allow the user to choose
`MainLinePolicy = ListingDisplay | Closest50_50 | NoSuffixSelection`.
- **Russian-only labels** in the source HTML. Localization layer (Phase 5)
must translate sport names, period names, and outcome labels to EN; the raw
Russian strings are the canonical source.
### Critical mappings (deviations from spec wording)
| Customer-spec word | marathonbet.by reality |
| --- | --- |
| `Win_Fora` | `Handicap` market in DOM (`To_Win_Match_With_Handicap`). Same concept, different word. |
| `Total_Less` / `Total_More` | DOM uses `Under` / `Over`. |
| `Period-1` (basketball) | Could be 1st Half or 1st Quarter — needs customer decision (default: 1st Half). |
| `Sport_Code = 6` | `data-sport-treeId="6"` confirmed for Basketball. |