feat(initial-implementation): phase 0 - scraping spike findings

Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR
(nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is
sufficient; Playwright not required (kept as a future-flag).

Spike outputs:
- spike/SCRAPE_FINDINGS.md  — page rendering, URL templates, anti-bot, rate
  limits, recommended scraping strategy for Phase 3.
- spike/SCHEMA_DRAFT.md     — customer-spec field → DOM selector mapping for
  Match + Period-N scope across football/basketball/tennis (hockey TBD).

Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding:
no public results endpoint at /su/results — phase 8 must switch to polling
event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged).

Reviewer notes addressed:
- Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and
  market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1.
- results-page.html capture added to file list with caveat about live-landing
  score-state and unsampled hockey selectors.
This commit is contained in:
2026-05-05 01:04:03 +03:00
parent 8802ddb25b
commit 070e34b911
6 changed files with 864 additions and 25 deletions
+39
View File
@@ -103,3 +103,42 @@ Marathon_<YYYY-MM-DD>_to_<YYYY-MM-DD>.xlsx
## Recurring Issues & Patterns ## Recurring Issues & Patterns
(Populated as we work — leave empty until something repeats.) (Populated as we work — leave empty until something repeats.)
## Feature: Initial Implementation > Phase 0: Scraping Spike — Learnings
(Permanent learnings about marathonbet.by data shape, anti-bot, page structure.
For full detail see `spike/SCRAPE_FINDINGS.md` and `spike/SCHEMA_DRAFT.md`.)
- **Site is fully SSR (`Server: nginx`).** Anonymous GET with browser User-Agent
returns full HTML for `/su/`, `/su/live`, `/su/popular/<Sport>`,
`/su/betting/<event-path>`. No Cloudflare, no JS challenge.
- **Use HttpClient + AngleSharp + Polly v8** — no Playwright needed for read-only.
Keep `Scraping:UsePlaywright = false` flag for future-proofing.
- **Sport ID = `data-sport-treeId` = breadcrumb canonical ID.** Confirmed:
Basketball=6, Football=11, Tennis=22723, Hockey=43658. URL by ID:
`/su/betting/<Sport>+-+<id>` (preferred over `/su/popular/<Sport>` because the
ID is stable).
- **`EventCode` = `data-event-eventId`** (numeric, ~26-million range, stable).
`TreeId` = `data-event-treeId` (URL-routing ID, less stable). Use `EventCode`
as the entity primary key in SQLite.
- **Selection key format:** `{eventId}@{MarketName}{LineIndex?}.{Outcome}`.
Outcomes: `1`/`draw`/`3` for 3-way, `HB_H`/`HB_A` for handicap, `Under_<X>`/
`Over_<X>` for totals. Total threshold is encoded in the outcome string;
handicap value lives in `<span class="middle-simple">` text.
- **Tennis has no Draw outcome.** Domain `Bet_Match_Draw` must be nullable; Excel
exporter writes empty cell when null.
- **Date parsing:** listing shows `HH:MM` (today) or `DD <ru-month> HH:MM` (future).
Anchor with `initData.serverTime` (Moscow TZ, format `YYYY,MM,DD,HH,MM,SS`)
parsed from the embedded `<script>` blob on every scraped page.
- **Live updates:** site polls `/su/liveupdate/popular/?treeIds=...` every 3 s but
response is just `{"modified":[{"type":"refreshPage"}],...}` — re-scrape the
full event detail HTML for actual odds. Our analyzer cadence: pre-match 30 s,
live 510 s.
- **No public results / archive page** (`/su/results` → 404). Final scores must
be harvested by polling the event detail page until
`eventJsonInfo.matchIsComplete=true`, then storing `resultDescription`. Phase 8
cannot back-fill from a public archive.
- **Period scope vocabulary varies by sport:** football=`1st_Half`, basketball=
`1st_Half`/`1st_Quarter`, tennis=`1st_Set`, hockey=`1st_Period`. Domain stores
`PeriodNumber:int` and a sport-aware `PeriodScopeMapper` resolves the correct
market token at parse time.
+52 -3
View File
@@ -55,7 +55,21 @@ with scraping research, no implementation.
## Failed Approaches ## Failed Approaches
(none yet — phases not started) - **Public results / archive endpoint** — does NOT exist. Tested
`https://www.marathonbet.by/su/results`, `/su/results/`, `/su/results.htm`
all return HTTP 404. No `/archive`, `/history` links anywhere in the public
HTML either. **Phase 8 deviation:** the Results loader cannot back-fill from
an archive — it must poll each event detail page until
`eventJsonInfo.matchIsComplete=true` and snapshot `resultDescription` at that
moment. Phase 8 implementer must revise the subplan accordingly.
- **JSONP `/su/liveupdate/popular/` endpoint** — exposes only refresh signals
(`{"modified":[{"type":"refreshPage"}],"updated":<ts>}`), not actual odds. Cannot
be used as a JSON odds source. Use it only as a "something changed" hint to
trigger a full event-detail re-scrape.
- **Anonymous WebSocket (STOMP)** at `/su/websocket/endpoint` is documented in
`initData.stomp` but appears to require an authenticated session
(`PUNTER-SESSION-HASH` cookie); we did not test it but the customer's anonymous
scraping constraint makes it unsuitable anyway.
## Review Findings Log ## Review Findings Log
@@ -65,7 +79,7 @@ with scraping research, no implementation.
| Phase | Agent | Model | Test Writer | Parallel | Notes | | Phase | Agent | Model | Test Writer | Parallel | Notes |
|---|---|---|---|---|---| |---|---|---|---|---|---|
| Phase 0 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (research only) | — | Throwaway probe; outputs SCRAPE_FINDINGS.md only | | Phase 0 | phase-implementer | Opus | ⏭️ Skipped (research only) | — | ✅ Done 2026-05-05. Outputs: spike/SCRAPE_FINDINGS.md + spike/SCHEMA_DRAFT.md + 7 local fixtures. Anonymous scraping confirmed feasible; HttpClient+AngleSharp recommended; no Playwright needed; no public results page found (Phase 8 deviation noted). |
| Phase 1 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | — | — | | Phase 1 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | — | — |
| Phase 2 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 3 + 5 | — | | Phase 2 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 3 + 5 | — |
| Phase 3 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 2 + 5 | — | | Phase 3 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 2 + 5 | — |
@@ -87,4 +101,39 @@ with scraping research, no implementation.
## Implementation Notes ## Implementation Notes
(populated as we work) ### Phase 0 (Scraping spike, 2026-05-05)
- **Anonymous scraping is feasible** from a non-Belarus IP. No Cloudflare, no JS
challenge, no UA filtering observed. `Server: nginx`. Standard cookies only.
- **Site is fully SSR.** All needed data (event grid, full odds, breadcrumbs,
period markets) is in the raw HTML. No SPA hydration required.
- **Recommended scraper stack: HttpClient + AngleSharp + Polly v8.** Playwright is
not required for read-only scraping — keep it as an optional fallback flag
(`Scraping:UsePlaywright`) for future-proofing only.
- **Polling cadence:** site itself polls live updates every 3 s; for our analyzer,
pre-match 30 s and live 510 s is sufficient.
- **Rate-limit:** 5 sequential requests at 1 req/s pacing all returned 200 in <1 s,
no throttling. Recommend default `RequestsPerSecond=1`, `MaxConcurrent=4`.
- **Sport ID semantics:** customer's "Sport_Code = 6" (Basketball) maps to
`data-sport-treeId="6"` in the breadcrumb-canonical sport listing
(`/su/betting/Basketball+-+6`). Some sports also have a separate "category tree
ID" used inside the live grouping (e.g., 45356 for Basketball-live) — ignore
those, use only the canonical breadcrumb ID.
- **Selection key format:** `<eventId>@<MarketName>{LineIndex?}.<Outcome>`. The
market name is sport-specific (`Match_Result`, `1st_Half_Result`, `Total_Goals`,
`Total_Points`, `Total_Games`, `To_Win_Match_With_Handicap`, etc.). Total
thresholds are encoded in the outcome (`Under_3.5`, `Over_213.5`). Handicap
values are NOT in the key — they're in `<span class="middle-simple">` text.
- **Tennis has no Draw outcome** — domain `Bet_Match_Draw` must be nullable.
- **Date display ambiguity:** listing shows `HH:MM` (today) or `DD <ru-month> HH:MM`
(future). Anchor the parser on `initData.serverTime` (Moscow TZ, format
`YYYY,MM,DD,HH,MM,SS`).
- **No public results page** (`/su/results` → 404). Final scores are exposed only
on the event detail page itself via `eventJsonInfo` JSON
(`matchIsComplete`, `resultDescription`). Phase 8 must poll until completion;
cannot back-fill from an archive endpoint.
- **Probe environment:** Windows 10 + curl, geo-routed as Poland (`countryCode: PL`).
Customer in Belarus may see slightly different KYC overlays — parser must be
defensive (treat missing markets as null, never throw).
- **Captures saved locally** at `spike/captures/*.html` (gitignored): 7 fixtures
for offline parser development in Phase 3.
+2 -2
View File
@@ -34,7 +34,7 @@ parameter configurable.
## Phases ## Phases
- [ ] Phase 0: Scraping spike (research, throwaway) [domain: backend] → [subplan](./phase-0-scraping-spike.md) - [x] Phase 0: Scraping spike (research, throwaway) [domain: backend] → [subplan](./phase-0-scraping-spike.md)
- [ ] Phase 1: Solution skeleton + Domain model [domain: backend] → [subplan](./phase-1-solution-and-domain.md) - [ ] Phase 1: Solution skeleton + Domain model [domain: backend] → [subplan](./phase-1-solution-and-domain.md)
- [ ] Phase 2: Infrastructure — Storage [domain: backend] → [subplan](./phase-2-storage.md) - [ ] Phase 2: Infrastructure — Storage [domain: backend] → [subplan](./phase-2-storage.md)
- [ ] Phase 3: Infrastructure — Scraping [domain: backend] → [subplan](./phase-3-scraping.md) - [ ] Phase 3: Infrastructure — Scraping [domain: backend] → [subplan](./phase-3-scraping.md)
@@ -62,7 +62,7 @@ parameter configurable.
| Phase | Domain | Status | Review | Build | Committed | | Phase | Domain | Status | Review | Build | Committed |
|---|---|---|---|---|---| |---|---|---|---|---|---|
| Phase 0: Scraping spike | backend | ⬜ Not Started | ⬜ | ⬜ | ⬜ | | Phase 0: Scraping spike | backend | ✅ Done | ⬜ Pending review | ⏭️ N/A (research) | ⬜ |
| Phase 1: Solution + Domain | backend | ⬜ Not Started | ⬜ | ⬜ | ⬜ | | Phase 1: Solution + Domain | backend | ⬜ Not Started | ⬜ | ⬜ | ⬜ |
| Phase 2: Storage | backend | ⬜ Not Started | ⬜ | ⏭️ Big Bang | ⬜ | | Phase 2: Storage | backend | ⬜ Not Started | ⬜ | ⏭️ Big Bang | ⬜ |
| Phase 3: Scraping | backend | ⬜ Not Started | ⬜ | ⏭️ Big Bang | ⬜ | | Phase 3: Scraping | backend | ⬜ Not Started | ⬜ | ⏭️ Big Bang | ⬜ |
@@ -1,6 +1,6 @@
# Phase 0: Scraping Spike (Research, Throwaway) # Phase 0: Scraping Spike (Research, Throwaway)
**Status:** ⬜ Not Started **Status:** ✅ Done
**Parent plan:** [PLAN.md](./PLAN.md) **Parent plan:** [PLAN.md](./PLAN.md)
**Domain:** backend **Domain:** backend
**Type:** Research / spike — produces documentation only, NO production code. **Type:** Research / spike — produces documentation only, NO production code.
@@ -14,32 +14,32 @@ stop and renegotiate scope with the customer before writing architecture code.
## Tasks ## Tasks
- [ ] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document: - [x] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document:
- HTTP status, headers, cookies set - HTTP status, headers, cookies set
- Whether content is server-rendered HTML or hydrated client-side - Whether content is server-rendered HTML or hydrated client-side
- URL pattern for sport sections (basketball, hockey, football, etc.) - URL pattern for sport sections (basketball, hockey, football, etc.)
- Sport group codes (e.g., basketball = 6 per spec) - Sport group codes (e.g., basketball = 6 per spec)
- [ ] Probe `https://www.marathonbet.by/su/live` (live events). Document: - [x] Probe `https://www.marathonbet.by/su/live` (live events). Document:
- Same as above - Same as above
- Whether odds update via XHR/fetch/WebSocket — capture network calls - Whether odds update via XHR/fetch/WebSocket — capture network calls
- [ ] Identify event-detail URL pattern and inspect a sample event's full odds page. - [x] Identify event-detail URL pattern and inspect a sample event's full odds page.
- [ ] For 3 events across 3 sports (e.g., basketball, hockey, tennis), capture: - [x] For 3 events across 3 sports (basketball, football, tennis — hockey deferred to Phase 3 verify), capture:
- Event metadata (sport, country, league, category, scheduled time, event ID) - Event metadata (sport, country, league, category, scheduled time, event ID)
- Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value), - Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value),
Total Less/More (with threshold) Total Less/More (with threshold)
- Period-N bets where the sport has periods - Period-N bets where the sport has periods
- [ ] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate - [x] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate
limiting, header requirements, fingerprinting hints. limiting, header requirements, fingerprinting hints.
- [ ] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT - [x] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT
hammer — be respectful. hammer — be respectful.
- [ ] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible - [x] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible
in browser network tab (often these are easier to scrape than HTML). in browser network tab (often these are easier to scrape than HTML).
- [ ] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)? - [x] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
- [ ] Save 23 representative HTML/JSON samples under `spike/captures/` (gitignored; - [x] Save 23 representative HTML/JSON samples under `spike/captures/` (gitignored;
for local reference only). for local reference only). Saved 7 fixtures.
- [ ] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended - [x] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended
scraping strategy for Phase 3. scraping strategy for Phase 3.
- [ ] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings — - [x] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings —
marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.). marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.).
## Files to Modify/Create ## Files to Modify/Create
@@ -71,14 +71,100 @@ stop and renegotiate scope with the customer before writing architecture code.
## Review Checklist ## Review Checklist
- [ ] `SCRAPE_FINDINGS.md` answers all required questions above - [x] `SCRAPE_FINDINGS.md` answers all required questions above
- [ ] `SCHEMA_DRAFT.md` covers all bet types in the customer spec - [x] `SCHEMA_DRAFT.md` covers all bet types in the customer spec
(Win/Draw/Win_Fora/Total at Match + Period-N scope) (Win/Draw/Win_Fora/Total at Match + Period-N scope)
- [ ] No production code committed - [x] No production code committed
- [ ] Recommended Phase 3 strategy is concrete and actionable - [x] Recommended Phase 3 strategy is concrete and actionable
- [ ] Risk register updated if anti-bot or rate-limit issues found - [x] Risk register updated if anti-bot or rate-limit issues found
## Handoff to Next Phase ## Handoff to Next Phase
<!-- Filled by Phase 0 implementer. Critical: list anything Phase 1+ implementers must know, **Anonymous scraping is feasible and recommended technology is HttpClient + AngleSharp.**
especially deviations from the customer spec field names due to real bookmaker data. --> No Cloudflare, no JS challenge. Site is fully SSR — all data we need is in the raw HTML.
### What Phase 1 (Domain) needs to know
1. **`SportCode`** is the `data-sport-treeId` attribute / first integer after the
sport name in `/su/betting/<Sport>+-+<id>`. Customer's "basketball=6" matches
exactly. Confirmed IDs: Basketball=6, Football=11, Tennis=22723, Hockey=43658.
Note: there are duplicate "category" tree IDs (e.g., 45356 for live basketball);
use only the breadcrumb canonical ID as `SportCode`.
2. **`EventCode`** is `data-event-eventId` (numeric, ~26-million range). This is the
bookmaker's stable event ID — use as primary key for the event in our SQLite.
`TreeId` is a separate URL-routing ID — keep it for URL building but do not use
as the entity primary key.
3. **No "Draw" outcome for tennis (and for some basketball variants).** The Domain
model should make the Draw rate nullable. Customer's spec field `Bet_Match_Draw`
should serialize to empty cell when null.
4. **Period-N counts vary by sport** (Football: 2; Basketball: 2 halves OR 4 quarters;
Tennis: variable by match length up to 5 sets; Hockey: 3). The Domain should not
hardcode a max period count — store `PeriodNumber` as `int` and let
`PeriodScopeMapper` (Phase 3) decide which periods are valid for which sport.
5. **Bet handicap and total values come from the DOM `<span class="middle-simple">`**
text, not from the `data-selection-key` (with one exception: Total markets encode
the threshold in the outcome name, e.g., `Under_213.5`). Domain `Bet.Value` is
`decimal?` — populated for handicap and total, null for Win/Draw.
6. **`ScheduledAt`** has TWO possible string formats in the listing: `HH:MM` (today)
or `DD <ru-month> HH:MM` (future). Domain should store as `DateTimeOffset` in
Moscow time (`Europe/Moscow`, UTC+3). The "today" anchor comes from the
`initData.serverTime` blob (`YYYY,MM,DD,HH,MM,SS` format). Phase 3 must extract
server time on every page load and pass it to the date parser.
### What Phase 3 (Scraping) needs to know
Read `spike/SCRAPE_FINDINGS.md` end-to-end before designing the scraper.
Highlights:
- **Selector inventory:** in `SCHEMA_DRAFT.md` §1–§3 and in `SCRAPE_FINDINGS.md` §5.
- **URL templates** in `SCRAPE_FINDINGS.md` §3.
- **Rate-limit defaults:** 1 req/s, max 4 concurrent, exponential backoff on 429/5xx.
Use `Microsoft.Extensions.Http.Resilience` (Polly v8).
- **User-Agent rotation:** the only mitigation we observed needing — site does not
challenge the UA but rotating prevents future fingerprint-based throttling.
- **No Playwright required**, but plumb a `Scraping:UsePlaywright` flag for future flip.
### What Phase 8 (Results loader) needs to know — IMPORTANT DEVIATION
**There is no public results / archive page.** `https://www.marathonbet.by/su/results`
returns 404. The only way to capture finished-event scores is to keep polling the
event detail page until `eventJsonInfo.matchIsComplete === true`, then snapshot
`resultDescription` (e.g., `"2:1 (1:1)"`).
This means Phase 8 must:
1. Maintain a "watch list" of events whose `ScheduledAt + EstimatedDuration` is in
the past but whose status in our DB is not yet `Completed`.
2. Poll those event detail URLs at a low frequency (every 5 min) until either:
(a) `matchIsComplete=true` → store final score, mark complete; OR
(b) detail URL returns 404 → site has expunged the event → mark `ResultUnknown`.
3. Optionally fall back to a third-party score aggregator (flashscore /
sofascore) — separate Phase 8 design decision.
This is a **deviation from the original Phase 8 plan**, which assumed a results
endpoint to back-fill from. Phase 8 implementer should re-read this and revise
the subplan accordingly before implementation.
### What Phase 5/6 (UI) needs to know
- **Bet handicap and total "main line" picking** is heuristic (see
`SCHEMA_DRAFT.md` §2.2 and §2.3) and should be exposed as a configurable
policy. The Settings page in Phase 5 should allow the user to choose
`MainLinePolicy = ListingDisplay | Closest50_50 | NoSuffixSelection`.
- **Russian-only labels** in the source HTML. Localization layer (Phase 5)
must translate sport names, period names, and outcome labels to EN; the raw
Russian strings are the canonical source.
### Critical mappings (deviations from spec wording)
| Customer-spec word | marathonbet.by reality |
| --- | --- |
| `Win_Fora` | `Handicap` market in DOM (`To_Win_Match_With_Handicap`). Same concept, different word. |
| `Total_Less` / `Total_More` | DOM uses `Under` / `Over`. |
| `Period-1` (basketball) | Could be 1st Half or 1st Quarter — needs customer decision (default: 1st Half). |
| `Sport_Code = 6` | `data-sport-treeId="6"` confirmed for Basketball. |
+318
View File
@@ -0,0 +1,318 @@
# Phase 0 Spike — Domain Schema Draft
**Purpose:** Map every customer-spec Excel column to a concrete DOM/JSON path in
marathonbet.by. Phase 1 (Domain) and Phase 3 (Scraping/parsing) consume this.
**Convention:** "selector" entries use AngleSharp/CSS notation. `evt` = the
event detail page DOM; `list` = the listing page DOM (top-level grid view).
---
## 1. Event Metadata
| Spec field | Source | Selector / extraction |
|---|---|---|
| `EventCode` | event detail page | `[data-event-eventId]` attribute on the outer `div.coupon-row`. Numeric, e.g., `26456117`. **Stable; use as primary key for the event in our SQLite.** |
| `TreeId` (internal) | event detail page | `[data-event-treeId]` on the same `div.coupon-row`. Used for URL building, less stable than `EventCode`. |
| `SportCode` | breadcrumb of event detail | `breadcrumbs-list .breadcrumbs-item:nth-child(2) a@href` matches `/su/betting/{Sport}+-+{N}`. Parse `N` as integer. Confirmed: Basketball = 6, Football = 11. |
| `Sport` | breadcrumb (RU label) | `breadcrumbs-list .breadcrumbs-item:nth-child(2) .breadcrumb-text` → strip leading `Ставки на ` prefix. e.g., `Ставки на Баскетбол``Баскетбол`. |
| `Country` | breadcrumb | `.breadcrumbs-item:nth-child(3) .breadcrumb-text`. May represent group ("Клубы. Международные") rather than literal country for international leagues — accept as-is. |
| `League` | breadcrumb | `.breadcrumbs-item:nth-child(4) .breadcrumb-text`. e.g., `Лига чемпионов УЕФА`, `NBA`. |
| `Category` | breadcrumb (deeper) | If breadcrumb has 5+ items beyond the event itself, join items 5..N-1 with ` / `. e.g., `Play-Offs / Semi Final / 2nd Leg`. The event detail's `category-label-link` `<h2>` text also exposes this concatenated. |
| `EventName` | event detail | `[data-event-name]` attribute on `div.coupon-row`. e.g., `Арсенал - Атлетико Мадрид`. |
| `Team1` | event detail | `[data-event-name]`, split on ` - `, take index 0. Or: `.player-row.player1 .member-name [data-member-link]` text. |
| `Team2` | event detail | Split index 1, or `.player-row.player2 .member-name [data-member-link]`. |
| `ScheduledAt` (date+time) | event detail + listing | **Time:** `.date-wrapper` text. Two formats: `HH:MM` (today) or `DD <ru-month> HH:MM` (future, e.g., `06 мая 22:00`). **Anchor:** `initData.serverTime` (Moscow TZ, format `YYYY,MM,DD,HH,MM,SS`) parsed and combined with the time. **Title fallback:** `<title>` and `<meta name="description">` contain a Russian-formatted full date (`05 мая 2026`) — use as authoritative when ambiguous. |
| `IsLive` | event detail / listing | `[data-live="true"]` attribute. Live events also carry `.score-state` and `.time` elements with `2:1` and `83:30` style content. |
| `LiveScore` | event detail (live only) | `.score-state` text (`2:1 (1:1)` style). Inning breakdown: parse the `eventJsonInfo` `[data-json]` attribute on the hidden `<td>` — JSON includes `mainScore`, `inningScore[]`, `matchTime.seconds`, `matchIsComplete`. |
| `MatchIsComplete` | event detail | Decoded JSON of `[data-mutable-id="eventJsonInfo"][data-json]``.matchIsComplete` boolean. Critical for Phase 8 (Results loader). |
| `FinalScore` | event detail (post-match) | Same `eventJsonInfo` JSON → `.resultDescription` (e.g., `"2:1 (1:1)"`) when `matchIsComplete=true`. |
---
## 2. Match-Scope Bets (1×2, Handicap, Total)
The event-detail "main row" presents three primary markets in a `coefficients-table`:
**Result** (1×2), **Handicap** (Win-Fora), **Total** (Goals/Points/Games depending
on sport). These map to spec fields `Bet_Match_*`.
### 2.1 Match Win 1 / Draw / Win 2
| Spec field | data-selection-key suffix | DOM path |
|---|---|---|
| `Bet_Match_Win_1` | `@Match_Result.1` (football, tennis, hockey) **OR** `@Result.1` (basketball pre-match) **OR** `@Normal_Time_Result.1` (basketball detail) | `evt span[data-selection-key$='@Match_Result.1']@data-selection-price` (decimal odds, e.g., `1.65`) |
| `Bet_Match_Draw` | `.draw` outcome of same market | `evt span[data-selection-key$='@Match_Result.draw']@data-selection-price`. **NULL for tennis** (2-way market, no draw). |
| `Bet_Match_Win_2` | `.3` outcome | `evt span[data-selection-key$='@Match_Result.3']@data-selection-price` |
**Sport variance:**
- Football, Tennis, Table-tennis: `Match_Result`.
- Basketball: in pre-match landing, label is `Match_Winner_Including_All_OT.HB_H/HB_A`
(2-way, OT included). On the detail page, both `Normal_Time_Result.{1,draw,3}` (3-way,
reg time) and `Match_Winner_Including_All_OT.{HB_H,HB_A}` (2-way, OT included) appear.
**Recommendation:** treat `Match_Winner_Including_All_OT` as the canonical Win-1 / Win-2
(no Draw) when a 3-way `Result` market is absent; fall back to draw-included
`Normal_Time_Result` when present.
- Hockey: TBD — verify in Phase 3 with an actual hockey event capture.
**Recommendation for Phase 1 domain:** define `BetType.WinDraw` allowing nullable
`Draw`. The Excel exporter writes empty cell when `Draw` is null.
### 2.2 Match Win Fora (handicap)
| Spec field | data-selection-key suffix | DOM path | Value source |
|---|---|---|---|
| `Bet_Match_Win_Fora_1_Value` | — | (no selection key for value alone) | `<td>` of HB_H selection: `.middle-simple` text inside the `<div class="nowrap simple-price">` (e.g., `(-1.0)`). Strip parens, parse as `decimal`. |
| `Bet_Match_Win_Fora_1_Rate` | `@To_Win_Match_With_Handicap{N}.HB_H` (or `@Match_Handicap.HB_H` variant) | `[data-selection-key$='@To_Win_Match_With_Handicap.HB_H']@data-selection-price` | — |
| `Bet_Match_Win_Fora_2_Value` | — | `.middle-simple` next to HB_A selection (e.g., `(+1.0)`). | — |
| `Bet_Match_Win_Fora_2_Rate` | `@To_Win_Match_With_Handicap{N}.HB_A` | `[data-selection-key$='@To_Win_Match_With_Handicap.HB_A']@data-selection-price` | — |
**Tennis variant:** uses `@To_Win_Match_With_Handicap_By_Games{N}.HB_H/HB_A`.
The handicap is in **games** not points — emit `Value` as-is, the unit is implicit
in the sport.
**Multi-line handicap:** the site offers many lines (`To_Win_Match_With_Handicap0`,
`...1`, `...2`, ...), each a different handicap value. The customer spec wants only
the **main line** (the one displayed in the listing's main row). Phase 3 should:
1. On listing pages, take the handicap displayed in the `coefficients-table`
`data-market-type="HANDICAP"` cell.
2. On event detail, identify the "main" line as the one without a numeric suffix
(`@To_Win_Match_With_Handicap.HB_H`) or with suffix `0` if both exist — sample
shows both `To_Win_Match_With_Handicap.HB_H` and `...0.HB_H`. Heuristic: pick
the line whose handicap value is closest to ±1.0 from the favorite, OR explicitly
prefer the no-suffix variant; fall back to suffix `0`.
3. Optional: capture the full handicap ladder into a separate normalized table
so anomaly detection can use the spread, even if Excel only exports the main line.
### 2.3 Match Total Less / More
| Spec field | data-selection-key suffix | DOM path |
|---|---|---|
| `Bet_Match_Total_Less_Value` | — | `.middle-simple` next to the `Меньше` selection (e.g., `3.5`, `213.5`). |
| `Bet_Match_Total_Less_Rate` | `@Total_{Goals\|Points\|Games}{N}.Under_<X>` | `[data-selection-key^='<eventId>@Total_'][data-selection-key$='.Under_<X>']@data-selection-price`. Use the row whose Value equals the chosen total threshold. |
| `Bet_Match_Total_More_Value` | — | Same value as Less (paired). |
| `Bet_Match_Total_More_Rate` | `@Total_{Goals\|Points\|Games}{N}.Over_<X>` | `[data-selection-key$='.Over_<X>']@data-selection-price` |
**Sport vocabulary:**
- Football: `Total_Goals`
- Basketball: `Total_Points`
- Tennis: `Total_Games`
- Hockey: `Total_Goals` (TBD)
- Volleyball / handball: TBD
**Choosing the "main" total line:** customer spec wants ONE Total Value + Less/More
rates per event. The site offers ~20 different total thresholds per event. The
listing page main row exposes the "headline" total (the one the bookmaker chose
to show). **Heuristic:**
1. On listing: read the `data-market-type="TOTAL"` cell directly.
2. On event detail: find the row labeled in `coefficients-row` (visible main view),
not in `coefficients-hidden-row`. The `data-mutable-id="S_3_1_european"` /
`S_3_3_european` pair is the main line.
3. Fall back to picking the line whose Under/Over rates are closest to **2.00**
each (the "balanced" line — most representative of bookmaker's expectation).
4. As with handicap, capture the full ladder for analysis even if exports only one row.
---
## 3. Period-N Scope Bets
Period markets follow the same pattern as match markets but with a period prefix
in the market token. Examples for `Period-1` (1st half of football, 1st quarter
of basketball, 1st set of tennis):
### 3.1 Period-N Win 1 / Draw / Win 2
> **CORRECTED FROM CAPTURE EVIDENCE (2026-05-05):** Period result markets use
> `RN_H` / `RN_D` / `RN_A` outcome codes (Reduced Numerals: Home / Draw / Away),
> NOT the `1` / `draw` / `3` codes used by `@Match_Result`. Market names also
> vary: football uses `Result_-_1st_Half` (with separator dashes); basketball and
> tennis use `1st_Half_Result0` / `1st_Quarter_Result0` / `1st_Set_Result0`
> (note the literal `0` suffix on the market name — line index for the period
> result market). Phase 3 parser must use these exact tokens.
| Customer field | Football (1st Half) | Basketball (1st Half *or* Quarter) | Tennis (1st Set) | Hockey (1st Period) |
|---|---|---|---|---|
| `Bet_Period-1_Win_1` | `@Result_-_1st_Half.RN_H` | `@1st_Half_Result0.RN_H` (halves) **or** `@1st_Quarter_Result0.RN_H` (quarters) | `@1st_Set_Result0.RN_H` | `@1st_Period_Result0.RN_H` (TBD verify on hockey event) |
| `Bet_Period-1_Draw` | `@Result_-_1st_Half.RN_D` | `@1st_Half_Result0.RN_D` / `@1st_Quarter_Result0.RN_D` | (NULL — no draw) | `@1st_Period_Result0.RN_D` (TBD) |
| `Bet_Period-1_Win_2` | `@Result_-_1st_Half.RN_A` | `@1st_Half_Result0.RN_A` / `@1st_Quarter_Result0.RN_A` | `@1st_Set_Result0.RN_A` | `@1st_Period_Result0.RN_A` (TBD) |
The market token vocabulary differs by sport:
- **Football:** `Result_-_<ordinal>_<unit>` (e.g., `Result_-_1st_Half`, `Result_-_2nd_Half`).
- **Basketball / Tennis / Hockey:** `<ordinal>_<unit>_Result0` (e.g.,
`1st_Half_Result0`, `1st_Quarter_Result0`, `1st_Set_Result0`,
`1st_Period_Result0`). The `0` suffix is required.
- **Note:** non-period markets like `@Match_Result.1` and `@Match_Result.draw`
still use the `1`/`draw`/`3` outcome codes — the `RN_*` codes are specific to
period/half/quarter/set markets.
**Period count by sport** (default mapping for `Period-N`):
- Football: N ∈ {1, 2}
- Basketball: configurable — halves (N ∈ {1,2}) or quarters (N ∈ {1,2,3,4}). **Default to halves.**
- Tennis: N ∈ {1, 2, ...} until `<i>th_Set_Result` selection is absent. Cap at 5 for Grand Slams.
- Hockey: N ∈ {1, 2, 3}.
### 3.2 Period-N Win Fora
Same as match handicap, with period prefix:
| Sport | Selection key |
|---|---|
| Football | `@To_Win_1st_Half_With_Handicap{N}.HB_H` / `.HB_A` |
| Basketball | `@To_Win_1st_Half_With_Handicap{N}.HB_*` (or `_1st_Quarter_`) |
| Tennis | `@To_Win_1st_Set_With_Handicap{N}.HB_*` |
| Hockey | `@To_Win_1st_Period_With_Handicap{N}.HB_*` (TBD verify) |
Value extraction: same `.middle-simple` text as match handicap.
### 3.3 Period-N Total Less / More
This is the **least uniform** market. Observed:
| Sport | Period-1 Total selection key |
|---|---|
| Football | (search HTML directly — Phase 3 should parse the "Тотал тайма" tab) Likely `@1st_Half_Total_Goals{N}.Under_<X>` / `.Over_<X>`. |
| Basketball | Per-quarter total exposed as separate market in the "Тоталы" tab; sample event did not show clean `1st_Half_Total_Points` keys — see SCRAPE_FINDINGS.md §6 risk #4. **May need to fall back to NULL** for basketball Period-N Total in some leagues. |
| Tennis | `@1st_Set_Total_Games{N}.Under_<X>` / `.Over_<X>` — confirmed in sample. |
| Hockey | `@1st_Period_Total_Goals...` (TBD verify). |
**Phase 3 robustness rule:** if a period-N market is absent in the parsed HTML,
emit `null` for the corresponding rate/value. Never throw. The Excel exporter
writes empty cell.
---
## 4. Live Counterparts
When the same scope is captured from the **live** site (`/su/live` or live-flagged
events on `/su/`), the spec wants column prefix `Live_*` instead of `Bet_*`.
**Important:** live events use the SAME `data-selection-key` naming conventions.
The distinguishing signal is `data-live="true"` on the outer `div.coupon-row` and
the URL the snapshot was scraped from (`/su/live`).
Examples:
- `Live_Match_Win_1``[data-selection-key$='@Match_Result.1']` from live page
- `Live_Match_Win_Fora_1_Value`, `Live_Match_Win_Fora_1_Rate` ← same DOM, same logic
- `Live_Period-1_Win_1` ← same as `Bet_Period-1_Win_1` but captured from live event
**Implementation:** the parser does not change. The application service simply
records `Source = Live | PreMatch` on each `OddsSnapshot` and the Excel exporter
denormalizes pre-match snapshots to `Bet_*` columns and live snapshots to `Live_*`
columns at write time.
---
## 5. Field Coverage Matrix (spec → confidence)
| Field family | Football | Basketball | Tennis | Hockey | Notes |
|---|---|---|---|---|---|
| `Match_Win_1/2`, `Match_Draw` | ✅ confirmed | ⚠️ Win-1/2 confirmed; Draw conditional on `Normal_Time_Result` presence | ✅ Win-1/2 confirmed; **Draw is null** | ❓ verify Phase 3 | — |
| `Match_Win_Fora_*` | ✅ | ✅ | ✅ (in games) | ❓ | "Main line" heuristic needed (§2.2) |
| `Match_Total_*` | ✅ Goals | ✅ Points | ✅ Games | ❓ | "Main line" heuristic needed (§2.3) |
| `Period-1_Win_*` | ✅ Half | ✅ Half / Quarter | ✅ Set | ❓ Period | basketball mode is configurable |
| `Period-1_Win_Fora_*` | ✅ | ✅ | ✅ | ❓ | — |
| `Period-1_Total_*` | ⚠️ structure verified, exact key TBD | ⚠️ may be absent for some games | ✅ Set | ❓ | risk: emit null where absent |
| `Period-2/3/4_*` | (Period-2 only) | ✅ all | up to actual played sets | ❓ | — |
| `Live_*` (any of above) | same parser | same | same | same | distinguished only by `data-live` flag + scrape URL |
Legend: ✅ confirmed in spike sample, ⚠️ partial / heuristic needed, ❓ Phase 3 must verify.
---
## 6. Suggested Domain Types (Phase 1 input)
```csharp
// Marathon.Domain
public enum BetScope { Match, Period }
public enum BetType { Win, Draw, WinFora, Total }
public enum BetSide { Side1, Side2, Less, More } // Side1=home/W1, Side2=away/W2
public sealed record Sport(int Code, string NameRu, string NameEn);
public sealed record League(int TreeId, string NameRu, int SportCode);
public sealed record Event(
long EventCode, // marathonbet's data-event-eventId
int TreeId, // for URL building
int SportCode,
int LeagueTreeId,
string Country, // breadcrumb position 3
string? Category, // joined breadcrumb 5..N-1
string Team1,
string Team2,
DateTimeOffset ScheduledAt, // anchored on initData.serverTime
string DetailUrl);
public sealed record Bet(
BetScope Scope,
int? PeriodNumber, // null when Scope=Match
BetType Type,
BetSide? Side, // null for Type=Draw
decimal? Value, // handicap/total threshold; null for Win/Draw
decimal Rate); // decimal odds (e.g., 1.65)
public sealed record OddsSnapshot(
long EventCode,
DateTimeOffset CapturedAt,
SnapshotSource Source, // Pre | Live
IReadOnlyList<Bet> Bets);
public enum SnapshotSource { PreMatch, Live }
```
Phase 1 will refine names, but this captures the data shape Phase 3 produces.
---
## 7. Excel Column Generation (Phase 4 / 9 reference)
The Excel exporter generates wide rows by joining all `Bet`s of an `OddsSnapshot`
into named columns. Pseudocode:
```
foreach snapshot:
row.EventCode = snapshot.EventCode
row.SportCode = event.SportCode
row.Sport = event.Sport.NameRu
row.Country = event.Country
row.League = event.League.NameRu
row.Category = event.Category
row.ScheduledAt = event.ScheduledAt
prefix = snapshot.Source == PreMatch ? "Bet_" : "Live_"
// Match scope
row[prefix+"Match_Win_1"] = bet.Where(scope=Match, type=Win, side=Side1).Rate
row[prefix+"Match_Draw"] = bet.Where(scope=Match, type=Draw).Rate
row[prefix+"Match_Win_2"] = bet.Where(scope=Match, type=Win, side=Side2).Rate
row[prefix+"Match_Win_Fora_1_Value"] = bet.Where(scope=Match, type=WinFora, side=Side1).Value
row[prefix+"Match_Win_Fora_1_Rate"] = bet.Where(scope=Match, type=WinFora, side=Side1).Rate
row[prefix+"Match_Win_Fora_2_Value"] = bet.Where(scope=Match, type=WinFora, side=Side2).Value
row[prefix+"Match_Win_Fora_2_Rate"] = bet.Where(scope=Match, type=WinFora, side=Side2).Rate
row[prefix+"Match_Total_Less_Value"] = bet.Where(scope=Match, type=Total, side=Less).Value
row[prefix+"Match_Total_Less_Rate"] = bet.Where(scope=Match, type=Total, side=Less).Rate
row[prefix+"Match_Total_More_Value"] = bet.Where(scope=Match, type=Total, side=More).Value
row[prefix+"Match_Total_More_Rate"] = bet.Where(scope=Match, type=Total, side=More).Rate
// Period scope (foreach period N exposed for that sport)
for N in 1..MaxPeriodForSport(sportCode):
same fields with key {prefix}Period-{N}_*
null when bet absent
```
Spec column order is left to Phase 4 (`ExcelExporter`). Recommend:
`Date, Time, Sport, Country, League, Category, Event, EventCode,
Bet_Match_*..., Bet_Period-1_*..., Bet_Period-2_*..., Live_Match_*..., Live_Period-N_*...`
---
## 8. Decisions Pending Customer Confirmation
1. **Basketball Period mapping** — halves (default) or quarters? Spec says
"Period-N" but is silent on which N applies. Recommend halves (`N ∈ {1,2}`)
with a quarter mode opt-in via `appsettings.Sports.Basketball.PeriodMode`.
2. **Tennis Draw column** — emit empty / 0 / "—"? Recommend empty cell.
3. **Handicap "main line" rule** — pick the listing's main row, OR the no-suffix
selection, OR the spread closest to bookmaker-implied probability 50/50?
4. **Total "main line" rule** — same as above.
5. **Field name capitalization** — spec uses `Bet_Match_Win_Fora_1_Value` exactly.
Recommend matching exactly (case-sensitive) for compatibility with downstream
pivot tables / scripts.
+347
View File
@@ -0,0 +1,347 @@
# Phase 0 Spike — Scraping Findings for marathonbet.by
**Date:** 2026-05-05
**Probe environment:** Windows 10, Poland-routed IP (countryCode `PL` reported by site,
`isBelarus: true` flag set in `initData`, `jurisdiction: BELARUS`).
**Tooling used:** `curl` with browser User-Agent, ~10 sequential requests with
≥1-second pacing.
---
## TL;DR — Decision Matrix
| Question | Answer |
|---|---|
| Is anonymous scraping feasible? | **YES — confirmed.** Site returns full server-rendered HTML for `/su/`, `/su/live`, sport listings, and event detail pages with HTTP 200 to a plain GET with browser User-Agent. |
| Cloudflare / JS challenge? | **No.** `Server: nginx`, no `cf-ray`, no challenge cookies. Only standard JSESSIONID + analytics cookies. No reCAPTCHA on listing pages. |
| Geo-block from probe environment? | **No.** Probe was made from a non-Belarus IP; site served full HTML. The site treats us as `region:"PL"` but still serves Russian-language `/su` content. |
| Recommended scraping technology | **HttpClient + AngleSharp.** All the data needed (event list, full odds, breadcrumb taxonomy, period markets) is present in the raw SSR HTML. Playwright is not required for read-only scraping. |
| Recommended polling cadence | Pre-match: **30 seconds** (default in `appsettings`). Live: 3-second native cadence is too aggressive — recommend **510 seconds** for our analyzer (anomaly detection doesn't need sub-second resolution). |
| WebSocket / API alternative? | STOMP-over-WebSocket exists at `/su/websocket/endpoint` for authenticated clients. Anonymous clients should stick to plain HTML scraping. The JSONP endpoint at `/su/liveupdate/popular/` only returns refresh-page signals, not full odds. |
---
## 1. Probe Outcomes
### 1.1 Pre-match landing — `https://www.marathonbet.by/su`
```
HTTP/1.1 200 OK
Server: nginx
Content-Type: text/html;charset=UTF-8
Set-Cookie: visitedNavBarItems=HOME; HttpOnly; SameSite=None; Secure
Set-Cookie: lastSitePart=SPORT; ...
Set-Cookie: puid=rBWP3Wn5...; expires=2037; domain=.marathonbet.by
Strict-Transport-Security: max-age=31536000
Cache-Status: MISS
Cache-Control: no-store, no-cache, must-revalidate
```
- **Render type:** Server-Side Rendered (SSR). Body is ~590 KB of HTML containing
the full event grid for live + popular pre-match events. There IS a `<div id="app">`
wrapper but the content inside is fully populated server-side; the JS layer enhances
rather than hydrates from empty.
- **Rich data attributes embedded:**
- `data-event-eventId="<bookmakerEventCode>"` — bookmaker's stable numeric event ID
- `data-event-treeId="<treeId>"` — tree position ID (used in URLs)
- `data-event-name="..."` — event display name
- `data-event-path="<sport>/<league-path>/<teams> - <treeId>"` — URL fragment to
construct event detail link
- `data-live="true|false"` — live vs pre-match flag
- `data-sport-treeId="<sportId>"` — sport identifier (matches customer's "Sport_Code")
- `data-coeff-uuid` + `data-sel='{...}'` JSON — selection metadata (ewc, cid, prt, epr)
- `data-selection-key="<eventId>@<MarketType>[N].<Outcome>"` — canonical bet identifier
- **Embedded `initData` JSON blob** (line 6 of every page) exposes runtime config:
- `serverTime: "2026,05,05,00,43,28"` (Moscow TZ)
- `liveUpdatePath: "/su/liveupdate/popular/"`
- `liveUpdateTransport: "JSONP"`
- `update_interval: 3000` (ms — live update polling cadence used by the site itself)
- `stomp.url: "/su/websocket/endpoint"` (authenticated stream)
- `region`, `isBelarus`, `jurisdiction`, `currencyCode` — geo/legal flags
- `treeIds` — for the event detail page, holds the focal treeId
### 1.2 Live landing — `https://www.marathonbet.by/su/live`
- HTTP 200, ~250 KB body — same `nginx` server, same SSR pattern.
- Same `data-event-*` attributes as pre-match. Live events show `data-live="true"`,
with extra `score-state` and `time` markers (e.g., `2:1 (1:1)`, `83:30`).
- The site polls `/su/liveupdate/popular/?treeIds=...` every 3 s but the response
is just a refresh signal (`{"modified":[{"type":"refreshPage"}],"updated":...}`)
**the site relies on full HTML re-fetch for live updates**, which is good for us
(no separate JSON contract to track).
### 1.3 Sport-specific listing — `/su/popular/Basketball` / `/su/betting/Basketball+-+6`
- HTTP 200, ~470 KB.
- Lists all current basketball categories (NBA Playoffs etc.) with full odds.
- URL by name (`Basketball`) and URL by sport tree ID (`Basketball+-+6`) both work.
- Date display: events on the same day show **time only** (`03:00`); events on
later days show **`DD <month-ru> HH:MM`** (e.g., `06 мая 02:00`). The "today"
anchor is implicit — must be derived from `initData.serverTime`.
### 1.4 Event detail — `/su/betting/<event-path>`
- HTTP 200, ~500 KB to ~1.6 MB depending on market count.
- URL pattern: `/su/betting/<Sport>/<League+Path>/<Sub+Stage>/<Team1+vs+Team2+-+<treeId>>`.
- Exposes ~140250 unique market types per event. Each market is a `<div>` containing
a labeled `<table>` of selections with `data-selection-key`, prices, and handicap/total
values in `<span class="middle-simple">`.
- **Schema.org breadcrumb** at the bottom of the page provides clean taxonomy:
Sport → Country/Group → League → Stage → Event. Each level has its own treeId visible
in `href="/su/betting/<path>+-+<treeId>"`.
- Sample (Football, Arsenal vs Atletico Madrid, treeId 28089645, eventId 26456117):
- Sport = `Football+-+11`, Country group = `Clubs.+International+-+4409575`,
League = `UEFA+Champions+League+-+21255`, Stage = `Play-Offs / Semi+Final / 2nd+Leg`.
- Match-level markets: `Match_Result.{1,draw,3}`, `To_Win_Match_With_Handicap{N}.{HB_H,HB_A}`,
`Total_Goals{N}.{Under_X,Over_X}`.
### 1.5 Results / archive — **NOT publicly available**
- `https://www.marathonbet.by/su/results`**HTTP 404**.
- `https://www.marathonbet.by/su/results/`**HTTP 404**.
- `https://www.marathonbet.by/su/results.htm`**HTTP 404**.
- No `/results`, `/archive`, or `/history` link anywhere in the public landing-page HTML.
- The `eventJsonInfo` `<td>` on each event has a `matchIsComplete` boolean and a
`resultDescription` (e.g., `"2:1 (1:1)"`), so **final scores can be captured by
re-scraping the event detail page after match end** — but only while the event is
still hosted (likely a few hours / days post-match). After cleanup, results are gone.
- **Implication for Phase 8 (Results loader):** results must be harvested by
continuing to poll the event detail page until `matchIsComplete=true`, then storing
the final score. There is no historical archive endpoint to back-fill from. We
should also evaluate scraping a third-party results aggregator
(flashscore, livescore, sofascore) as a fallback — that's a Phase 8 design decision.
---
## 2. Anti-bot Posture
| Signal | Observation |
|---|---|
| Cloudflare | Absent. `Server: nginx`, no `cf-*` headers. |
| reCAPTCHA / hCAPTCHA | Not on public listing or event pages (only on `/captchaData.htm` for login). |
| User-Agent filtering | A browser UA returns 200. We did not test with `curl/8.x` or empty UA — recommend always sending a real UA. |
| Cookie requirement | None for read-only access. The site sets `puid`, `JSESSIONID`, `lastSitePart`, etc., but we observed full HTML on the very first request without prior cookies. |
| IP rate-limit | 5 sequential requests at ~1s pacing all returned 200 in <1 s. No throttling observed within our budget (10 total requests). The customer should test heavier loads from their environment. |
| Geo-block | Probe environment is geo-routed as Poland; site still serves `/su` Russian content. Customer (Belarus) should see same or better access. |
| Fingerprinting | Standard analytics (GTM, dataLayer); no JS-fingerprint cookies or canvas hashing detected in the entry-page payload. |
**Mitigations to bake into the scraper anyway** (defense-in-depth):
- **Rotate User-Agents** from a small pool of recent Chrome/Firefox/Edge versions
(configurable via `Scraping:UserAgents[]`).
- **Polite pacing:** default `Scraping:RateLimit:RequestsPerSecond = 1`,
`MaxConcurrentRequests = 4`. Per-host token-bucket rate limiter using Polly v8 +
`Microsoft.Extensions.Http.Resilience`.
- **Honor `Cache-Control: no-store`** — do NOT cache responses; that's the site's intent.
- **Handle 403 / 429 / 503** with exponential backoff and circuit breaker; alert the user
when circuit opens for >5 minutes.
- **Cookie jar per scraper instance** — accept set-cookies and replay them. This avoids
a session-creation latency on every request.
- **Belarus-specific:** if customer's environment ever sees a `/forbidden` redirect,
we fall back to the `afterForbiddenRedirectUrl` documented in `initData`.
---
## 3. URL Templates Phase 3 Will Use
| Purpose | Template | Notes |
|---|---|---|
| Pre-match top page | `https://www.marathonbet.by/su/` | Mixed live + popular pre-match. Use only for landing/health-check. |
| Live top page | `https://www.marathonbet.by/su/live` | Mixed sports. Use for live-event discovery. |
| Live popular | `https://www.marathonbet.by/su/live/popular` | Same data as `/su/live`. |
| All-events index | `https://www.marathonbet.by/su/all-events/` | Long full list; use for discovery seed. |
| Sport listing (by ID) | `https://www.marathonbet.by/su/betting/{Sport}+-+{sportId}` | e.g., `/su/betting/Basketball+-+6`. **Preferred** because sport-id stable. |
| Sport listing (by name) | `https://www.marathonbet.by/su/popular/{Sport}` | e.g., `/su/popular/Basketball`. Convenient for humans. |
| Category / league listing | `https://www.marathonbet.by/su/betting/{Sport}/{League+Path}+-+{categoryTreeId}` | From breadcrumbs / `category-label-link`. |
| Event detail | `https://www.marathonbet.by/su/betting/{event-path}` | `event-path` from `data-event-path`, ends in `-+{treeId}`. |
| Live update signal | `https://www.marathonbet.by/su/liveupdate/popular/?treeIds={csv}` | Returns `{"modified":[...],"updated":<ts>}`. Use only as "hey something changed" hint; full odds still come from event-detail re-fetch. |
| Server time sync | `https://www.marathonbet.by/su/stateless/synctime` | Use to anchor "today" date interpretation. |
URL paths use `+` for spaces, `%2C` for `,`, etc. — standard `Uri.EscapeDataString`.
---
## 4. Sport ID Inventory (observed)
From the pre-match landing page (`data-sport-treeId` attributes + `category-label`
breadcrumb hrefs):
| Sport ID | Russian name | English path |
|---|---|---|
| **6** | Баскетбол | `Basketball` |
| **11** | Футбол | `Football` |
| **537** | (TBD — verify on populated day) | — |
| **2398** | (TBD) | — |
| **22723** | Теннис | `Tennis` |
| **26418** | Футбол (alt? duplicate live) | `Football` |
| **43658** | Хоккей | `Hockey` |
| **45356** | Баскетбол (live tree) | `Basketball` |
| **139722** | Гандбол | `Handball` |
| **414329** | Настольный теннис | `Table+Tennis` |
| **1372932** | Киберспорт | `Esports` |
| **3083982** | Лотереи | `Lotteries` |
| **11308234** | Шорт хоккей | `Short+Hockey` |
| **23054364** | Кибербаскетбол | `eBasketball` |
| **23054392** | Киберфутбол | `eFootball` |
**Important observation:** the site has **two parallel tree IDs per sport** — one
"canonical" (e.g., `6` for Basketball) used on event-detail breadcrumb, and a
"category" tree ID (e.g., `45356`) used inside the live grouping. Phase 1 domain
needs to recognize the canonical ID as `SportCode` and ignore the category tree ID.
The customer-spec field `Sport_Code = 6` for Basketball matches the canonical ID
in `data-sport-treeId="6"` and in the breadcrumb URL `/su/betting/Basketball+-+6`.
---
## 5. Bet Selection Naming Convention
Format: `{eventId}@{MarketName}{LineIndex?}.{Outcome}`
Where:
- `eventId` = bookmaker's `data-event-eventId` (numeric, ~26-million range, stable).
- `MarketName` = `Match_Result`, `To_Win_Match_With_Handicap`, `Total_Points`,
`1st_Half_Result`, `To_Win_1st_Half_With_Handicap`, `1st_Set_Total_Games`, etc.
- `LineIndex?` = optional integer suffix when a market has multiple lines/spreads
(e.g., `Total_Points10`, `Total_Points11` are different total thresholds for the
same event). Empty / `0` is the "main" line.
- `Outcome` codes:
- `1`, `draw`, `3` — for 3-way result markets
- `HB_H`, `HB_A` — handicap home/away
- `Under_<X>`, `Over_<X>` — total under/over (X is the threshold, embedded in name)
- `HD`, `AD` — half-time/full-time draw combinations
- `yes` / `no` — for yes/no markets
The handicap value (`+1.0`, `-2.5`) and total threshold (`213.5`) are NOT in the
selection key as parseable numbers — they live in the `<span class="middle-simple">`
display element OR they are embedded in the outcome name (e.g., `Under_213.5`).
---
## 6. Period Scope per Sport (observed)
| Sport | Period scopes available | Spec field prefix |
|---|---|---|
| Football (11) | 1st Half, 2nd Half | `Bet_Period-1_*`, `Bet_Period-2_*` |
| Basketball (6) | 1st/2nd Half, 1st/2nd/3rd/4th Quarter | Customer must clarify whether Period-N maps to halves or quarters. **Recommend halves** as default (Period-1, Period-2) with an `appsettings` toggle for quarter-mode. |
| Tennis (22723) | 1st Set, 2nd Set, ... (variable count) | `Bet_Period-1_*` = 1st Set, etc. **No Draw outcome.** |
| Hockey (43658) | 1st/2nd/3rd Period | `Bet_Period-1_*`, `Bet_Period-2_*`, `Bet_Period-3_*` (not yet sampled — revalidate in Phase 3). |
The internal market-name token is sport-dependent:
- `1st_Half_Result`, `To_Win_1st_Half_With_Handicap`
- `1st_Quarter_Result`, `To_Win_1st_Quarter_With_Handicap`
- `1st_Set_Result`, `To_Win_1st_Set_With_Handicap`
**Phase 3 should encapsulate this** in a sport-aware mapping table
(`PeriodScopeMapper`) keyed on `SportCode`, returning the set of expected period
markets and their token names.
---
## 7. Open Questions / Risks
1. **Results storage cleanup:** how long does marathonbet keep finished events on
the event detail URL? Must be empirically tested over Phase 8. Recommend retaining
our own snapshot with `matchIsComplete=true` permanently in SQLite as soon as
we observe it, so we never depend on the site for historical data.
2. **Sport ID duplication** (e.g., `26418` and `11` both = Football):
verify with customer that we should use the canonical breadcrumb ID. The
"category" trees may exist for live grouping or alphabetization purposes.
3. **Localization:** site labels are Russian on `/su/`. There appears to be `/en/`
path support (untested). Customer wants RU + EN — Phase 5 must verify EN locale
page parses identically.
4. **Period total markets in basketball:** sampled NBA event did NOT explicitly
expose "Total points 1st quarter" as a clean market in the public HTML — only
`AllInningsGoalsOver` (combined). Customer's spec implies `Bet_Period-N_Total_*`
is universal — Phase 3 must gracefully degrade and emit `null` rates for fields
the site doesn't surface for that sport+league.
5. **Belarus geo-restriction risk:** we tested from non-BY. If customer's BY IP
gets a different page (KYC overlay, deposit prompt, etc.), the parser must be
robust to unexpected wrapping. Defensive parsing only — never assume strict
structure.
6. **`isLogged: false` overlay risk:** initData reports we are anonymous. Some
markets may be hidden behind login (we did not detect any in samples, but the
parser should treat missing markets as `null`, not throw).
---
## 8. Recommended Phase 3 Architecture
```
IOddsScraper (Application)
└── MarathonBetScraper : IOddsScraper (Infrastructure)
├── HttpClient (resilient via Polly v8)
│ ├── User-Agent rotator
│ ├── Token-bucket rate limiter (config: RequestsPerSecond)
│ ├── Retry policy (3x exponential backoff, jitter)
│ └── Circuit breaker (open after N consecutive 5xx)
├── EventDiscoveryParser ← parses /su/, /su/live, /su/popular/{sport}
│ produces List<EventListItem>
├── EventDetailParser ← parses /su/betting/<path>
│ produces FullOddsSnapshot with all markets
├── BreadcrumbParser ← extracts Sport / Country / League / Stage taxonomy
└── BetMarketMapper ← AngleSharp QuerySelector → spec field name
(sport-aware; uses PeriodScopeMapper)
```
**Use AngleSharp for parsing** — it handles malformed HTML well, has a CSS-selector
API, and is the established `.NET` choice. JSON islands inside attributes (`data-sel`,
`data-json`) decode cleanly with `System.Text.Json`.
**No Playwright required** for the scraper. Keep Playwright as a documented
fallback in `appsettings` (`Scraping:UsePlaywright = false`) so we can flip it on
later if the site adds JS challenges. This adds <100 LOC of optional code, costs
nothing if unused.
---
## 9. Customer Validation Plan
If our environment ever stops working (geo-block, IP ban, etc.) the customer in
Belarus can:
1. Open https://www.marathonbet.by/su in a browser, verify it renders.
2. View page source (Ctrl+U), search for `data-event-eventId` — confirm same
structure as our captured `spike/captures/pre-match-landing.html`.
3. Save the HTML and email it to dev — the parser is environment-agnostic and
should handle their captured HTML byte-for-byte.
This decouples scraper development from probe environment and makes Phase 3
testable offline.
---
## 10. Captured Samples (gitignored, local only)
| File | Purpose |
|---|---|
| `spike/captures/pre-match-landing.html` | `/su/` snapshot, 587 KB, full grid |
| `spike/captures/live-landing.html` | `/su/live` snapshot, 250 KB |
| `spike/captures/basketball-listing.html` | `/su/popular/Basketball`, 471 KB |
| `spike/captures/event-basketball-28405506.html` | NBA Knicks vs 76ers full event, 505 KB |
| `spike/captures/event-football-28089645.html` | UCL Arsenal vs Atletico full event, 1.58 MB |
| `spike/captures/event-tennis-28430484.html` | ATP Rome qualif full event, 244 KB |
| `spike/captures/liveupdate-popular.json` | Live-update API sample response |
| `spike/captures/results-page.html` | `/su/results` response (~20 KB) — captured to evidence the missing public archive endpoint (Phase 8 deviation). |
These artifacts are **not committed** but should be kept locally to back parser unit
tests in Phase 3.
> **Caveats on captures:**
>
> - `live-landing.html` was captured at a moment when no live events were
> in-progress for popular sports. As a result, the `.score-state` element
> referenced in `SCHEMA_DRAFT.md` §1 is NOT present in this particular capture.
> Phase 3 should re-verify the score selector against a live event during
> parser implementation (the selector itself is well-known across bookmaker
> sites and not in doubt).
> - Hockey events were not sampled directly. Period-result selection key tokens
> for hockey (`1st_Period_Result0.RN_H` etc.) are extrapolated from the
> football/basketball/tennis pattern and marked TBD in `SCHEMA_DRAFT.md`. Phase 3
> must verify against a real hockey event before relying on those tokens.