feat(initial-implementation): phase 0 - scraping spike findings

Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR
(nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is
sufficient; Playwright not required (kept as a future-flag).

Spike outputs:
- spike/SCRAPE_FINDINGS.md  — page rendering, URL templates, anti-bot, rate
  limits, recommended scraping strategy for Phase 3.
- spike/SCHEMA_DRAFT.md     — customer-spec field → DOM selector mapping for
  Match + Period-N scope across football/basketball/tennis (hockey TBD).

Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding:
no public results endpoint at /su/results — phase 8 must switch to polling
event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged).

Reviewer notes addressed:
- Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and
  market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1.
- results-page.html capture added to file list with caveat about live-landing
  score-state and unsampled hockey selectors.
This commit is contained in:
2026-05-05 01:04:03 +03:00
parent 8802ddb25b
commit 070e34b911
6 changed files with 864 additions and 25 deletions
+39
View File
@@ -103,3 +103,42 @@ Marathon_<YYYY-MM-DD>_to_<YYYY-MM-DD>.xlsx
## Recurring Issues & Patterns
(Populated as we work — leave empty until something repeats.)
## Feature: Initial Implementation > Phase 0: Scraping Spike — Learnings
(Permanent learnings about marathonbet.by data shape, anti-bot, page structure.
For full detail see `spike/SCRAPE_FINDINGS.md` and `spike/SCHEMA_DRAFT.md`.)
- **Site is fully SSR (`Server: nginx`).** Anonymous GET with browser User-Agent
returns full HTML for `/su/`, `/su/live`, `/su/popular/<Sport>`,
`/su/betting/<event-path>`. No Cloudflare, no JS challenge.
- **Use HttpClient + AngleSharp + Polly v8** — no Playwright needed for read-only.
Keep `Scraping:UsePlaywright = false` flag for future-proofing.
- **Sport ID = `data-sport-treeId` = breadcrumb canonical ID.** Confirmed:
Basketball=6, Football=11, Tennis=22723, Hockey=43658. URL by ID:
`/su/betting/<Sport>+-+<id>` (preferred over `/su/popular/<Sport>` because the
ID is stable).
- **`EventCode` = `data-event-eventId`** (numeric, ~26-million range, stable).
`TreeId` = `data-event-treeId` (URL-routing ID, less stable). Use `EventCode`
as the entity primary key in SQLite.
- **Selection key format:** `{eventId}@{MarketName}{LineIndex?}.{Outcome}`.
Outcomes: `1`/`draw`/`3` for 3-way, `HB_H`/`HB_A` for handicap, `Under_<X>`/
`Over_<X>` for totals. Total threshold is encoded in the outcome string;
handicap value lives in `<span class="middle-simple">` text.
- **Tennis has no Draw outcome.** Domain `Bet_Match_Draw` must be nullable; Excel
exporter writes empty cell when null.
- **Date parsing:** listing shows `HH:MM` (today) or `DD <ru-month> HH:MM` (future).
Anchor with `initData.serverTime` (Moscow TZ, format `YYYY,MM,DD,HH,MM,SS`)
parsed from the embedded `<script>` blob on every scraped page.
- **Live updates:** site polls `/su/liveupdate/popular/?treeIds=...` every 3 s but
response is just `{"modified":[{"type":"refreshPage"}],...}` — re-scrape the
full event detail HTML for actual odds. Our analyzer cadence: pre-match 30 s,
live 510 s.
- **No public results / archive page** (`/su/results` → 404). Final scores must
be harvested by polling the event detail page until
`eventJsonInfo.matchIsComplete=true`, then storing `resultDescription`. Phase 8
cannot back-fill from a public archive.
- **Period scope vocabulary varies by sport:** football=`1st_Half`, basketball=
`1st_Half`/`1st_Quarter`, tennis=`1st_Set`, hockey=`1st_Period`. Domain stores
`PeriodNumber:int` and a sport-aware `PeriodScopeMapper` resolves the correct
market token at parse time.
+52 -3
View File
@@ -55,7 +55,21 @@ with scraping research, no implementation.
## Failed Approaches
(none yet — phases not started)
- **Public results / archive endpoint** — does NOT exist. Tested
`https://www.marathonbet.by/su/results`, `/su/results/`, `/su/results.htm`
all return HTTP 404. No `/archive`, `/history` links anywhere in the public
HTML either. **Phase 8 deviation:** the Results loader cannot back-fill from
an archive — it must poll each event detail page until
`eventJsonInfo.matchIsComplete=true` and snapshot `resultDescription` at that
moment. Phase 8 implementer must revise the subplan accordingly.
- **JSONP `/su/liveupdate/popular/` endpoint** — exposes only refresh signals
(`{"modified":[{"type":"refreshPage"}],"updated":<ts>}`), not actual odds. Cannot
be used as a JSON odds source. Use it only as a "something changed" hint to
trigger a full event-detail re-scrape.
- **Anonymous WebSocket (STOMP)** at `/su/websocket/endpoint` is documented in
`initData.stomp` but appears to require an authenticated session
(`PUNTER-SESSION-HASH` cookie); we did not test it but the customer's anonymous
scraping constraint makes it unsuitable anyway.
## Review Findings Log
@@ -65,7 +79,7 @@ with scraping research, no implementation.
| Phase | Agent | Model | Test Writer | Parallel | Notes |
|---|---|---|---|---|---|
| Phase 0 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (research only) | — | Throwaway probe; outputs SCRAPE_FINDINGS.md only |
| Phase 0 | phase-implementer | Opus | ⏭️ Skipped (research only) | — | ✅ Done 2026-05-05. Outputs: spike/SCRAPE_FINDINGS.md + spike/SCHEMA_DRAFT.md + 7 local fixtures. Anonymous scraping confirmed feasible; HttpClient+AngleSharp recommended; no Playwright needed; no public results page found (Phase 8 deviation noted). |
| Phase 1 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | — | — |
| Phase 2 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 3 + 5 | — |
| Phase 3 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 2 + 5 | — |
@@ -87,4 +101,39 @@ with scraping research, no implementation.
## Implementation Notes
(populated as we work)
### Phase 0 (Scraping spike, 2026-05-05)
- **Anonymous scraping is feasible** from a non-Belarus IP. No Cloudflare, no JS
challenge, no UA filtering observed. `Server: nginx`. Standard cookies only.
- **Site is fully SSR.** All needed data (event grid, full odds, breadcrumbs,
period markets) is in the raw HTML. No SPA hydration required.
- **Recommended scraper stack: HttpClient + AngleSharp + Polly v8.** Playwright is
not required for read-only scraping — keep it as an optional fallback flag
(`Scraping:UsePlaywright`) for future-proofing only.
- **Polling cadence:** site itself polls live updates every 3 s; for our analyzer,
pre-match 30 s and live 510 s is sufficient.
- **Rate-limit:** 5 sequential requests at 1 req/s pacing all returned 200 in <1 s,
no throttling. Recommend default `RequestsPerSecond=1`, `MaxConcurrent=4`.
- **Sport ID semantics:** customer's "Sport_Code = 6" (Basketball) maps to
`data-sport-treeId="6"` in the breadcrumb-canonical sport listing
(`/su/betting/Basketball+-+6`). Some sports also have a separate "category tree
ID" used inside the live grouping (e.g., 45356 for Basketball-live) — ignore
those, use only the canonical breadcrumb ID.
- **Selection key format:** `<eventId>@<MarketName>{LineIndex?}.<Outcome>`. The
market name is sport-specific (`Match_Result`, `1st_Half_Result`, `Total_Goals`,
`Total_Points`, `Total_Games`, `To_Win_Match_With_Handicap`, etc.). Total
thresholds are encoded in the outcome (`Under_3.5`, `Over_213.5`). Handicap
values are NOT in the key — they're in `<span class="middle-simple">` text.
- **Tennis has no Draw outcome** — domain `Bet_Match_Draw` must be nullable.
- **Date display ambiguity:** listing shows `HH:MM` (today) or `DD <ru-month> HH:MM`
(future). Anchor the parser on `initData.serverTime` (Moscow TZ, format
`YYYY,MM,DD,HH,MM,SS`).
- **No public results page** (`/su/results` → 404). Final scores are exposed only
on the event detail page itself via `eventJsonInfo` JSON
(`matchIsComplete`, `resultDescription`). Phase 8 must poll until completion;
cannot back-fill from an archive endpoint.
- **Probe environment:** Windows 10 + curl, geo-routed as Poland (`countryCode: PL`).
Customer in Belarus may see slightly different KYC overlays — parser must be
defensive (treat missing markets as null, never throw).
- **Captures saved locally** at `spike/captures/*.html` (gitignored): 7 fixtures
for offline parser development in Phase 3.
+2 -2
View File
@@ -34,7 +34,7 @@ parameter configurable.
## Phases
- [ ] Phase 0: Scraping spike (research, throwaway) [domain: backend] → [subplan](./phase-0-scraping-spike.md)
- [x] Phase 0: Scraping spike (research, throwaway) [domain: backend] → [subplan](./phase-0-scraping-spike.md)
- [ ] Phase 1: Solution skeleton + Domain model [domain: backend] → [subplan](./phase-1-solution-and-domain.md)
- [ ] Phase 2: Infrastructure — Storage [domain: backend] → [subplan](./phase-2-storage.md)
- [ ] Phase 3: Infrastructure — Scraping [domain: backend] → [subplan](./phase-3-scraping.md)
@@ -62,7 +62,7 @@ parameter configurable.
| Phase | Domain | Status | Review | Build | Committed |
|---|---|---|---|---|---|
| Phase 0: Scraping spike | backend | ⬜ Not Started | ⬜ | ⬜ | ⬜ |
| Phase 0: Scraping spike | backend | ✅ Done | ⬜ Pending review | ⏭️ N/A (research) | ⬜ |
| Phase 1: Solution + Domain | backend | ⬜ Not Started | ⬜ | ⬜ | ⬜ |
| Phase 2: Storage | backend | ⬜ Not Started | ⬜ | ⏭️ Big Bang | ⬜ |
| Phase 3: Scraping | backend | ⬜ Not Started | ⬜ | ⏭️ Big Bang | ⬜ |
@@ -1,6 +1,6 @@
# Phase 0: Scraping Spike (Research, Throwaway)
**Status:** ⬜ Not Started
**Status:** ✅ Done
**Parent plan:** [PLAN.md](./PLAN.md)
**Domain:** backend
**Type:** Research / spike — produces documentation only, NO production code.
@@ -14,32 +14,32 @@ stop and renegotiate scope with the customer before writing architecture code.
## Tasks
- [ ] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document:
- [x] Probe `https://www.marathonbet.by/su` (pre-match) anonymously. Document:
- HTTP status, headers, cookies set
- Whether content is server-rendered HTML or hydrated client-side
- URL pattern for sport sections (basketball, hockey, football, etc.)
- Sport group codes (e.g., basketball = 6 per spec)
- [ ] Probe `https://www.marathonbet.by/su/live` (live events). Document:
- [x] Probe `https://www.marathonbet.by/su/live` (live events). Document:
- Same as above
- Whether odds update via XHR/fetch/WebSocket — capture network calls
- [ ] Identify event-detail URL pattern and inspect a sample event's full odds page.
- [ ] For 3 events across 3 sports (e.g., basketball, hockey, tennis), capture:
- [x] Identify event-detail URL pattern and inspect a sample event's full odds page.
- [x] For 3 events across 3 sports (basketball, football, tennis — hockey deferred to Phase 3 verify), capture:
- Event metadata (sport, country, league, category, scheduled time, event ID)
- Match-level bets: Win-1 / Draw / Win-2, Win-Fora-1/2 (with handicap value),
Total Less/More (with threshold)
- Period-N bets where the sport has periods
- [ ] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate
- [x] Identify any anti-bot measures: Cloudflare challenges, JS challenges, rate
limiting, header requirements, fingerprinting hints.
- [ ] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT
- [x] Test rate behavior: ~10 sequential requests, observe latency / blocks. Do NOT
hammer — be respectful.
- [ ] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible
- [x] Document API endpoints if marathonbet.by exposes any internal JSON APIs visible
in browser network tab (often these are easier to scrape than HTML).
- [ ] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
- [ ] Save 23 representative HTML/JSON samples under `spike/captures/` (gitignored;
for local reference only).
- [ ] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended
- [x] Decide: HtmlClient + AngleSharp sufficient, or Playwright required (or both)?
- [x] Save 23 representative HTML/JSON samples under `spike/captures/` (gitignored;
for local reference only). Saved 7 fixtures.
- [x] Write `spike/SCRAPE_FINDINGS.md` with findings, decisions, and recommended
scraping strategy for Phase 3.
- [ ] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings —
- [x] Write `spike/SCHEMA_DRAFT.md` with concrete proposed domain field mappings —
marathonbet.by terms → spec field names (`Bet_Match_Win_1`, etc.).
## Files to Modify/Create
@@ -71,14 +71,100 @@ stop and renegotiate scope with the customer before writing architecture code.
## Review Checklist
- [ ] `SCRAPE_FINDINGS.md` answers all required questions above
- [ ] `SCHEMA_DRAFT.md` covers all bet types in the customer spec
- [x] `SCRAPE_FINDINGS.md` answers all required questions above
- [x] `SCHEMA_DRAFT.md` covers all bet types in the customer spec
(Win/Draw/Win_Fora/Total at Match + Period-N scope)
- [ ] No production code committed
- [ ] Recommended Phase 3 strategy is concrete and actionable
- [ ] Risk register updated if anti-bot or rate-limit issues found
- [x] No production code committed
- [x] Recommended Phase 3 strategy is concrete and actionable
- [x] Risk register updated if anti-bot or rate-limit issues found
## Handoff to Next Phase
<!-- Filled by Phase 0 implementer. Critical: list anything Phase 1+ implementers must know,
especially deviations from the customer spec field names due to real bookmaker data. -->
**Anonymous scraping is feasible and recommended technology is HttpClient + AngleSharp.**
No Cloudflare, no JS challenge. Site is fully SSR — all data we need is in the raw HTML.
### What Phase 1 (Domain) needs to know
1. **`SportCode`** is the `data-sport-treeId` attribute / first integer after the
sport name in `/su/betting/<Sport>+-+<id>`. Customer's "basketball=6" matches
exactly. Confirmed IDs: Basketball=6, Football=11, Tennis=22723, Hockey=43658.
Note: there are duplicate "category" tree IDs (e.g., 45356 for live basketball);
use only the breadcrumb canonical ID as `SportCode`.
2. **`EventCode`** is `data-event-eventId` (numeric, ~26-million range). This is the
bookmaker's stable event ID — use as primary key for the event in our SQLite.
`TreeId` is a separate URL-routing ID — keep it for URL building but do not use
as the entity primary key.
3. **No "Draw" outcome for tennis (and for some basketball variants).** The Domain
model should make the Draw rate nullable. Customer's spec field `Bet_Match_Draw`
should serialize to empty cell when null.
4. **Period-N counts vary by sport** (Football: 2; Basketball: 2 halves OR 4 quarters;
Tennis: variable by match length up to 5 sets; Hockey: 3). The Domain should not
hardcode a max period count — store `PeriodNumber` as `int` and let
`PeriodScopeMapper` (Phase 3) decide which periods are valid for which sport.
5. **Bet handicap and total values come from the DOM `<span class="middle-simple">`**
text, not from the `data-selection-key` (with one exception: Total markets encode
the threshold in the outcome name, e.g., `Under_213.5`). Domain `Bet.Value` is
`decimal?` — populated for handicap and total, null for Win/Draw.
6. **`ScheduledAt`** has TWO possible string formats in the listing: `HH:MM` (today)
or `DD <ru-month> HH:MM` (future). Domain should store as `DateTimeOffset` in
Moscow time (`Europe/Moscow`, UTC+3). The "today" anchor comes from the
`initData.serverTime` blob (`YYYY,MM,DD,HH,MM,SS` format). Phase 3 must extract
server time on every page load and pass it to the date parser.
### What Phase 3 (Scraping) needs to know
Read `spike/SCRAPE_FINDINGS.md` end-to-end before designing the scraper.
Highlights:
- **Selector inventory:** in `SCHEMA_DRAFT.md` §1–§3 and in `SCRAPE_FINDINGS.md` §5.
- **URL templates** in `SCRAPE_FINDINGS.md` §3.
- **Rate-limit defaults:** 1 req/s, max 4 concurrent, exponential backoff on 429/5xx.
Use `Microsoft.Extensions.Http.Resilience` (Polly v8).
- **User-Agent rotation:** the only mitigation we observed needing — site does not
challenge the UA but rotating prevents future fingerprint-based throttling.
- **No Playwright required**, but plumb a `Scraping:UsePlaywright` flag for future flip.
### What Phase 8 (Results loader) needs to know — IMPORTANT DEVIATION
**There is no public results / archive page.** `https://www.marathonbet.by/su/results`
returns 404. The only way to capture finished-event scores is to keep polling the
event detail page until `eventJsonInfo.matchIsComplete === true`, then snapshot
`resultDescription` (e.g., `"2:1 (1:1)"`).
This means Phase 8 must:
1. Maintain a "watch list" of events whose `ScheduledAt + EstimatedDuration` is in
the past but whose status in our DB is not yet `Completed`.
2. Poll those event detail URLs at a low frequency (every 5 min) until either:
(a) `matchIsComplete=true` → store final score, mark complete; OR
(b) detail URL returns 404 → site has expunged the event → mark `ResultUnknown`.
3. Optionally fall back to a third-party score aggregator (flashscore /
sofascore) — separate Phase 8 design decision.
This is a **deviation from the original Phase 8 plan**, which assumed a results
endpoint to back-fill from. Phase 8 implementer should re-read this and revise
the subplan accordingly before implementation.
### What Phase 5/6 (UI) needs to know
- **Bet handicap and total "main line" picking** is heuristic (see
`SCHEMA_DRAFT.md` §2.2 and §2.3) and should be exposed as a configurable
policy. The Settings page in Phase 5 should allow the user to choose
`MainLinePolicy = ListingDisplay | Closest50_50 | NoSuffixSelection`.
- **Russian-only labels** in the source HTML. Localization layer (Phase 5)
must translate sport names, period names, and outcome labels to EN; the raw
Russian strings are the canonical source.
### Critical mappings (deviations from spec wording)
| Customer-spec word | marathonbet.by reality |
| --- | --- |
| `Win_Fora` | `Handicap` market in DOM (`To_Win_Match_With_Handicap`). Same concept, different word. |
| `Total_Less` / `Total_More` | DOM uses `Under` / `Over`. |
| `Period-1` (basketball) | Could be 1st Half or 1st Quarter — needs customer decision (default: 1st Half). |
| `Sport_Code = 6` | `data-sport-treeId="6"` confirmed for Basketball. |
+318
View File
@@ -0,0 +1,318 @@
# Phase 0 Spike — Domain Schema Draft
**Purpose:** Map every customer-spec Excel column to a concrete DOM/JSON path in
marathonbet.by. Phase 1 (Domain) and Phase 3 (Scraping/parsing) consume this.
**Convention:** "selector" entries use AngleSharp/CSS notation. `evt` = the
event detail page DOM; `list` = the listing page DOM (top-level grid view).
---
## 1. Event Metadata
| Spec field | Source | Selector / extraction |
|---|---|---|
| `EventCode` | event detail page | `[data-event-eventId]` attribute on the outer `div.coupon-row`. Numeric, e.g., `26456117`. **Stable; use as primary key for the event in our SQLite.** |
| `TreeId` (internal) | event detail page | `[data-event-treeId]` on the same `div.coupon-row`. Used for URL building, less stable than `EventCode`. |
| `SportCode` | breadcrumb of event detail | `breadcrumbs-list .breadcrumbs-item:nth-child(2) a@href` matches `/su/betting/{Sport}+-+{N}`. Parse `N` as integer. Confirmed: Basketball = 6, Football = 11. |
| `Sport` | breadcrumb (RU label) | `breadcrumbs-list .breadcrumbs-item:nth-child(2) .breadcrumb-text` → strip leading `Ставки на ` prefix. e.g., `Ставки на Баскетбол``Баскетбол`. |
| `Country` | breadcrumb | `.breadcrumbs-item:nth-child(3) .breadcrumb-text`. May represent group ("Клубы. Международные") rather than literal country for international leagues — accept as-is. |
| `League` | breadcrumb | `.breadcrumbs-item:nth-child(4) .breadcrumb-text`. e.g., `Лига чемпионов УЕФА`, `NBA`. |
| `Category` | breadcrumb (deeper) | If breadcrumb has 5+ items beyond the event itself, join items 5..N-1 with ` / `. e.g., `Play-Offs / Semi Final / 2nd Leg`. The event detail's `category-label-link` `<h2>` text also exposes this concatenated. |
| `EventName` | event detail | `[data-event-name]` attribute on `div.coupon-row`. e.g., `Арсенал - Атлетико Мадрид`. |
| `Team1` | event detail | `[data-event-name]`, split on ` - `, take index 0. Or: `.player-row.player1 .member-name [data-member-link]` text. |
| `Team2` | event detail | Split index 1, or `.player-row.player2 .member-name [data-member-link]`. |
| `ScheduledAt` (date+time) | event detail + listing | **Time:** `.date-wrapper` text. Two formats: `HH:MM` (today) or `DD <ru-month> HH:MM` (future, e.g., `06 мая 22:00`). **Anchor:** `initData.serverTime` (Moscow TZ, format `YYYY,MM,DD,HH,MM,SS`) parsed and combined with the time. **Title fallback:** `<title>` and `<meta name="description">` contain a Russian-formatted full date (`05 мая 2026`) — use as authoritative when ambiguous. |
| `IsLive` | event detail / listing | `[data-live="true"]` attribute. Live events also carry `.score-state` and `.time` elements with `2:1` and `83:30` style content. |
| `LiveScore` | event detail (live only) | `.score-state` text (`2:1 (1:1)` style). Inning breakdown: parse the `eventJsonInfo` `[data-json]` attribute on the hidden `<td>` — JSON includes `mainScore`, `inningScore[]`, `matchTime.seconds`, `matchIsComplete`. |
| `MatchIsComplete` | event detail | Decoded JSON of `[data-mutable-id="eventJsonInfo"][data-json]``.matchIsComplete` boolean. Critical for Phase 8 (Results loader). |
| `FinalScore` | event detail (post-match) | Same `eventJsonInfo` JSON → `.resultDescription` (e.g., `"2:1 (1:1)"`) when `matchIsComplete=true`. |
---
## 2. Match-Scope Bets (1×2, Handicap, Total)
The event-detail "main row" presents three primary markets in a `coefficients-table`:
**Result** (1×2), **Handicap** (Win-Fora), **Total** (Goals/Points/Games depending
on sport). These map to spec fields `Bet_Match_*`.
### 2.1 Match Win 1 / Draw / Win 2
| Spec field | data-selection-key suffix | DOM path |
|---|---|---|
| `Bet_Match_Win_1` | `@Match_Result.1` (football, tennis, hockey) **OR** `@Result.1` (basketball pre-match) **OR** `@Normal_Time_Result.1` (basketball detail) | `evt span[data-selection-key$='@Match_Result.1']@data-selection-price` (decimal odds, e.g., `1.65`) |
| `Bet_Match_Draw` | `.draw` outcome of same market | `evt span[data-selection-key$='@Match_Result.draw']@data-selection-price`. **NULL for tennis** (2-way market, no draw). |
| `Bet_Match_Win_2` | `.3` outcome | `evt span[data-selection-key$='@Match_Result.3']@data-selection-price` |
**Sport variance:**
- Football, Tennis, Table-tennis: `Match_Result`.
- Basketball: in pre-match landing, label is `Match_Winner_Including_All_OT.HB_H/HB_A`
(2-way, OT included). On the detail page, both `Normal_Time_Result.{1,draw,3}` (3-way,
reg time) and `Match_Winner_Including_All_OT.{HB_H,HB_A}` (2-way, OT included) appear.
**Recommendation:** treat `Match_Winner_Including_All_OT` as the canonical Win-1 / Win-2
(no Draw) when a 3-way `Result` market is absent; fall back to draw-included
`Normal_Time_Result` when present.
- Hockey: TBD — verify in Phase 3 with an actual hockey event capture.
**Recommendation for Phase 1 domain:** define `BetType.WinDraw` allowing nullable
`Draw`. The Excel exporter writes empty cell when `Draw` is null.
### 2.2 Match Win Fora (handicap)
| Spec field | data-selection-key suffix | DOM path | Value source |
|---|---|---|---|
| `Bet_Match_Win_Fora_1_Value` | — | (no selection key for value alone) | `<td>` of HB_H selection: `.middle-simple` text inside the `<div class="nowrap simple-price">` (e.g., `(-1.0)`). Strip parens, parse as `decimal`. |
| `Bet_Match_Win_Fora_1_Rate` | `@To_Win_Match_With_Handicap{N}.HB_H` (or `@Match_Handicap.HB_H` variant) | `[data-selection-key$='@To_Win_Match_With_Handicap.HB_H']@data-selection-price` | — |
| `Bet_Match_Win_Fora_2_Value` | — | `.middle-simple` next to HB_A selection (e.g., `(+1.0)`). | — |
| `Bet_Match_Win_Fora_2_Rate` | `@To_Win_Match_With_Handicap{N}.HB_A` | `[data-selection-key$='@To_Win_Match_With_Handicap.HB_A']@data-selection-price` | — |
**Tennis variant:** uses `@To_Win_Match_With_Handicap_By_Games{N}.HB_H/HB_A`.
The handicap is in **games** not points — emit `Value` as-is, the unit is implicit
in the sport.
**Multi-line handicap:** the site offers many lines (`To_Win_Match_With_Handicap0`,
`...1`, `...2`, ...), each a different handicap value. The customer spec wants only
the **main line** (the one displayed in the listing's main row). Phase 3 should:
1. On listing pages, take the handicap displayed in the `coefficients-table`
`data-market-type="HANDICAP"` cell.
2. On event detail, identify the "main" line as the one without a numeric suffix
(`@To_Win_Match_With_Handicap.HB_H`) or with suffix `0` if both exist — sample
shows both `To_Win_Match_With_Handicap.HB_H` and `...0.HB_H`. Heuristic: pick
the line whose handicap value is closest to ±1.0 from the favorite, OR explicitly
prefer the no-suffix variant; fall back to suffix `0`.
3. Optional: capture the full handicap ladder into a separate normalized table
so anomaly detection can use the spread, even if Excel only exports the main line.
### 2.3 Match Total Less / More
| Spec field | data-selection-key suffix | DOM path |
|---|---|---|
| `Bet_Match_Total_Less_Value` | — | `.middle-simple` next to the `Меньше` selection (e.g., `3.5`, `213.5`). |
| `Bet_Match_Total_Less_Rate` | `@Total_{Goals\|Points\|Games}{N}.Under_<X>` | `[data-selection-key^='<eventId>@Total_'][data-selection-key$='.Under_<X>']@data-selection-price`. Use the row whose Value equals the chosen total threshold. |
| `Bet_Match_Total_More_Value` | — | Same value as Less (paired). |
| `Bet_Match_Total_More_Rate` | `@Total_{Goals\|Points\|Games}{N}.Over_<X>` | `[data-selection-key$='.Over_<X>']@data-selection-price` |
**Sport vocabulary:**
- Football: `Total_Goals`
- Basketball: `Total_Points`
- Tennis: `Total_Games`
- Hockey: `Total_Goals` (TBD)
- Volleyball / handball: TBD
**Choosing the "main" total line:** customer spec wants ONE Total Value + Less/More
rates per event. The site offers ~20 different total thresholds per event. The
listing page main row exposes the "headline" total (the one the bookmaker chose
to show). **Heuristic:**
1. On listing: read the `data-market-type="TOTAL"` cell directly.
2. On event detail: find the row labeled in `coefficients-row` (visible main view),
not in `coefficients-hidden-row`. The `data-mutable-id="S_3_1_european"` /
`S_3_3_european` pair is the main line.
3. Fall back to picking the line whose Under/Over rates are closest to **2.00**
each (the "balanced" line — most representative of bookmaker's expectation).
4. As with handicap, capture the full ladder for analysis even if exports only one row.
---
## 3. Period-N Scope Bets
Period markets follow the same pattern as match markets but with a period prefix
in the market token. Examples for `Period-1` (1st half of football, 1st quarter
of basketball, 1st set of tennis):
### 3.1 Period-N Win 1 / Draw / Win 2
> **CORRECTED FROM CAPTURE EVIDENCE (2026-05-05):** Period result markets use
> `RN_H` / `RN_D` / `RN_A` outcome codes (Reduced Numerals: Home / Draw / Away),
> NOT the `1` / `draw` / `3` codes used by `@Match_Result`. Market names also
> vary: football uses `Result_-_1st_Half` (with separator dashes); basketball and
> tennis use `1st_Half_Result0` / `1st_Quarter_Result0` / `1st_Set_Result0`
> (note the literal `0` suffix on the market name — line index for the period
> result market). Phase 3 parser must use these exact tokens.
| Customer field | Football (1st Half) | Basketball (1st Half *or* Quarter) | Tennis (1st Set) | Hockey (1st Period) |
|---|---|---|---|---|
| `Bet_Period-1_Win_1` | `@Result_-_1st_Half.RN_H` | `@1st_Half_Result0.RN_H` (halves) **or** `@1st_Quarter_Result0.RN_H` (quarters) | `@1st_Set_Result0.RN_H` | `@1st_Period_Result0.RN_H` (TBD verify on hockey event) |
| `Bet_Period-1_Draw` | `@Result_-_1st_Half.RN_D` | `@1st_Half_Result0.RN_D` / `@1st_Quarter_Result0.RN_D` | (NULL — no draw) | `@1st_Period_Result0.RN_D` (TBD) |
| `Bet_Period-1_Win_2` | `@Result_-_1st_Half.RN_A` | `@1st_Half_Result0.RN_A` / `@1st_Quarter_Result0.RN_A` | `@1st_Set_Result0.RN_A` | `@1st_Period_Result0.RN_A` (TBD) |
The market token vocabulary differs by sport:
- **Football:** `Result_-_<ordinal>_<unit>` (e.g., `Result_-_1st_Half`, `Result_-_2nd_Half`).
- **Basketball / Tennis / Hockey:** `<ordinal>_<unit>_Result0` (e.g.,
`1st_Half_Result0`, `1st_Quarter_Result0`, `1st_Set_Result0`,
`1st_Period_Result0`). The `0` suffix is required.
- **Note:** non-period markets like `@Match_Result.1` and `@Match_Result.draw`
still use the `1`/`draw`/`3` outcome codes — the `RN_*` codes are specific to
period/half/quarter/set markets.
**Period count by sport** (default mapping for `Period-N`):
- Football: N ∈ {1, 2}
- Basketball: configurable — halves (N ∈ {1,2}) or quarters (N ∈ {1,2,3,4}). **Default to halves.**
- Tennis: N ∈ {1, 2, ...} until `<i>th_Set_Result` selection is absent. Cap at 5 for Grand Slams.
- Hockey: N ∈ {1, 2, 3}.
### 3.2 Period-N Win Fora
Same as match handicap, with period prefix:
| Sport | Selection key |
|---|---|
| Football | `@To_Win_1st_Half_With_Handicap{N}.HB_H` / `.HB_A` |
| Basketball | `@To_Win_1st_Half_With_Handicap{N}.HB_*` (or `_1st_Quarter_`) |
| Tennis | `@To_Win_1st_Set_With_Handicap{N}.HB_*` |
| Hockey | `@To_Win_1st_Period_With_Handicap{N}.HB_*` (TBD verify) |
Value extraction: same `.middle-simple` text as match handicap.
### 3.3 Period-N Total Less / More
This is the **least uniform** market. Observed:
| Sport | Period-1 Total selection key |
|---|---|
| Football | (search HTML directly — Phase 3 should parse the "Тотал тайма" tab) Likely `@1st_Half_Total_Goals{N}.Under_<X>` / `.Over_<X>`. |
| Basketball | Per-quarter total exposed as separate market in the "Тоталы" tab; sample event did not show clean `1st_Half_Total_Points` keys — see SCRAPE_FINDINGS.md §6 risk #4. **May need to fall back to NULL** for basketball Period-N Total in some leagues. |
| Tennis | `@1st_Set_Total_Games{N}.Under_<X>` / `.Over_<X>` — confirmed in sample. |
| Hockey | `@1st_Period_Total_Goals...` (TBD verify). |
**Phase 3 robustness rule:** if a period-N market is absent in the parsed HTML,
emit `null` for the corresponding rate/value. Never throw. The Excel exporter
writes empty cell.
---
## 4. Live Counterparts
When the same scope is captured from the **live** site (`/su/live` or live-flagged
events on `/su/`), the spec wants column prefix `Live_*` instead of `Bet_*`.
**Important:** live events use the SAME `data-selection-key` naming conventions.
The distinguishing signal is `data-live="true"` on the outer `div.coupon-row` and
the URL the snapshot was scraped from (`/su/live`).
Examples:
- `Live_Match_Win_1``[data-selection-key$='@Match_Result.1']` from live page
- `Live_Match_Win_Fora_1_Value`, `Live_Match_Win_Fora_1_Rate` ← same DOM, same logic
- `Live_Period-1_Win_1` ← same as `Bet_Period-1_Win_1` but captured from live event
**Implementation:** the parser does not change. The application service simply
records `Source = Live | PreMatch` on each `OddsSnapshot` and the Excel exporter
denormalizes pre-match snapshots to `Bet_*` columns and live snapshots to `Live_*`
columns at write time.
---
## 5. Field Coverage Matrix (spec → confidence)
| Field family | Football | Basketball | Tennis | Hockey | Notes |
|---|---|---|---|---|---|
| `Match_Win_1/2`, `Match_Draw` | ✅ confirmed | ⚠️ Win-1/2 confirmed; Draw conditional on `Normal_Time_Result` presence | ✅ Win-1/2 confirmed; **Draw is null** | ❓ verify Phase 3 | — |
| `Match_Win_Fora_*` | ✅ | ✅ | ✅ (in games) | ❓ | "Main line" heuristic needed (§2.2) |
| `Match_Total_*` | ✅ Goals | ✅ Points | ✅ Games | ❓ | "Main line" heuristic needed (§2.3) |
| `Period-1_Win_*` | ✅ Half | ✅ Half / Quarter | ✅ Set | ❓ Period | basketball mode is configurable |
| `Period-1_Win_Fora_*` | ✅ | ✅ | ✅ | ❓ | — |
| `Period-1_Total_*` | ⚠️ structure verified, exact key TBD | ⚠️ may be absent for some games | ✅ Set | ❓ | risk: emit null where absent |
| `Period-2/3/4_*` | (Period-2 only) | ✅ all | up to actual played sets | ❓ | — |
| `Live_*` (any of above) | same parser | same | same | same | distinguished only by `data-live` flag + scrape URL |
Legend: ✅ confirmed in spike sample, ⚠️ partial / heuristic needed, ❓ Phase 3 must verify.
---
## 6. Suggested Domain Types (Phase 1 input)
```csharp
// Marathon.Domain
public enum BetScope { Match, Period }
public enum BetType { Win, Draw, WinFora, Total }
public enum BetSide { Side1, Side2, Less, More } // Side1=home/W1, Side2=away/W2
public sealed record Sport(int Code, string NameRu, string NameEn);
public sealed record League(int TreeId, string NameRu, int SportCode);
public sealed record Event(
long EventCode, // marathonbet's data-event-eventId
int TreeId, // for URL building
int SportCode,
int LeagueTreeId,
string Country, // breadcrumb position 3
string? Category, // joined breadcrumb 5..N-1
string Team1,
string Team2,
DateTimeOffset ScheduledAt, // anchored on initData.serverTime
string DetailUrl);
public sealed record Bet(
BetScope Scope,
int? PeriodNumber, // null when Scope=Match
BetType Type,
BetSide? Side, // null for Type=Draw
decimal? Value, // handicap/total threshold; null for Win/Draw
decimal Rate); // decimal odds (e.g., 1.65)
public sealed record OddsSnapshot(
long EventCode,
DateTimeOffset CapturedAt,
SnapshotSource Source, // Pre | Live
IReadOnlyList<Bet> Bets);
public enum SnapshotSource { PreMatch, Live }
```
Phase 1 will refine names, but this captures the data shape Phase 3 produces.
---
## 7. Excel Column Generation (Phase 4 / 9 reference)
The Excel exporter generates wide rows by joining all `Bet`s of an `OddsSnapshot`
into named columns. Pseudocode:
```
foreach snapshot:
row.EventCode = snapshot.EventCode
row.SportCode = event.SportCode
row.Sport = event.Sport.NameRu
row.Country = event.Country
row.League = event.League.NameRu
row.Category = event.Category
row.ScheduledAt = event.ScheduledAt
prefix = snapshot.Source == PreMatch ? "Bet_" : "Live_"
// Match scope
row[prefix+"Match_Win_1"] = bet.Where(scope=Match, type=Win, side=Side1).Rate
row[prefix+"Match_Draw"] = bet.Where(scope=Match, type=Draw).Rate
row[prefix+"Match_Win_2"] = bet.Where(scope=Match, type=Win, side=Side2).Rate
row[prefix+"Match_Win_Fora_1_Value"] = bet.Where(scope=Match, type=WinFora, side=Side1).Value
row[prefix+"Match_Win_Fora_1_Rate"] = bet.Where(scope=Match, type=WinFora, side=Side1).Rate
row[prefix+"Match_Win_Fora_2_Value"] = bet.Where(scope=Match, type=WinFora, side=Side2).Value
row[prefix+"Match_Win_Fora_2_Rate"] = bet.Where(scope=Match, type=WinFora, side=Side2).Rate
row[prefix+"Match_Total_Less_Value"] = bet.Where(scope=Match, type=Total, side=Less).Value
row[prefix+"Match_Total_Less_Rate"] = bet.Where(scope=Match, type=Total, side=Less).Rate
row[prefix+"Match_Total_More_Value"] = bet.Where(scope=Match, type=Total, side=More).Value
row[prefix+"Match_Total_More_Rate"] = bet.Where(scope=Match, type=Total, side=More).Rate
// Period scope (foreach period N exposed for that sport)
for N in 1..MaxPeriodForSport(sportCode):
same fields with key {prefix}Period-{N}_*
null when bet absent
```
Spec column order is left to Phase 4 (`ExcelExporter`). Recommend:
`Date, Time, Sport, Country, League, Category, Event, EventCode,
Bet_Match_*..., Bet_Period-1_*..., Bet_Period-2_*..., Live_Match_*..., Live_Period-N_*...`
---
## 8. Decisions Pending Customer Confirmation
1. **Basketball Period mapping** — halves (default) or quarters? Spec says
"Period-N" but is silent on which N applies. Recommend halves (`N ∈ {1,2}`)
with a quarter mode opt-in via `appsettings.Sports.Basketball.PeriodMode`.
2. **Tennis Draw column** — emit empty / 0 / "—"? Recommend empty cell.
3. **Handicap "main line" rule** — pick the listing's main row, OR the no-suffix
selection, OR the spread closest to bookmaker-implied probability 50/50?
4. **Total "main line" rule** — same as above.
5. **Field name capitalization** — spec uses `Bet_Match_Win_Fora_1_Value` exactly.
Recommend matching exactly (case-sensitive) for compatibility with downstream
pivot tables / scripts.
+347
View File
@@ -0,0 +1,347 @@
# Phase 0 Spike — Scraping Findings for marathonbet.by
**Date:** 2026-05-05
**Probe environment:** Windows 10, Poland-routed IP (countryCode `PL` reported by site,
`isBelarus: true` flag set in `initData`, `jurisdiction: BELARUS`).
**Tooling used:** `curl` with browser User-Agent, ~10 sequential requests with
≥1-second pacing.
---
## TL;DR — Decision Matrix
| Question | Answer |
|---|---|
| Is anonymous scraping feasible? | **YES — confirmed.** Site returns full server-rendered HTML for `/su/`, `/su/live`, sport listings, and event detail pages with HTTP 200 to a plain GET with browser User-Agent. |
| Cloudflare / JS challenge? | **No.** `Server: nginx`, no `cf-ray`, no challenge cookies. Only standard JSESSIONID + analytics cookies. No reCAPTCHA on listing pages. |
| Geo-block from probe environment? | **No.** Probe was made from a non-Belarus IP; site served full HTML. The site treats us as `region:"PL"` but still serves Russian-language `/su` content. |
| Recommended scraping technology | **HttpClient + AngleSharp.** All the data needed (event list, full odds, breadcrumb taxonomy, period markets) is present in the raw SSR HTML. Playwright is not required for read-only scraping. |
| Recommended polling cadence | Pre-match: **30 seconds** (default in `appsettings`). Live: 3-second native cadence is too aggressive — recommend **510 seconds** for our analyzer (anomaly detection doesn't need sub-second resolution). |
| WebSocket / API alternative? | STOMP-over-WebSocket exists at `/su/websocket/endpoint` for authenticated clients. Anonymous clients should stick to plain HTML scraping. The JSONP endpoint at `/su/liveupdate/popular/` only returns refresh-page signals, not full odds. |
---
## 1. Probe Outcomes
### 1.1 Pre-match landing — `https://www.marathonbet.by/su`
```
HTTP/1.1 200 OK
Server: nginx
Content-Type: text/html;charset=UTF-8
Set-Cookie: visitedNavBarItems=HOME; HttpOnly; SameSite=None; Secure
Set-Cookie: lastSitePart=SPORT; ...
Set-Cookie: puid=rBWP3Wn5...; expires=2037; domain=.marathonbet.by
Strict-Transport-Security: max-age=31536000
Cache-Status: MISS
Cache-Control: no-store, no-cache, must-revalidate
```
- **Render type:** Server-Side Rendered (SSR). Body is ~590 KB of HTML containing
the full event grid for live + popular pre-match events. There IS a `<div id="app">`
wrapper but the content inside is fully populated server-side; the JS layer enhances
rather than hydrates from empty.
- **Rich data attributes embedded:**
- `data-event-eventId="<bookmakerEventCode>"` — bookmaker's stable numeric event ID
- `data-event-treeId="<treeId>"` — tree position ID (used in URLs)
- `data-event-name="..."` — event display name
- `data-event-path="<sport>/<league-path>/<teams> - <treeId>"` — URL fragment to
construct event detail link
- `data-live="true|false"` — live vs pre-match flag
- `data-sport-treeId="<sportId>"` — sport identifier (matches customer's "Sport_Code")
- `data-coeff-uuid` + `data-sel='{...}'` JSON — selection metadata (ewc, cid, prt, epr)
- `data-selection-key="<eventId>@<MarketType>[N].<Outcome>"` — canonical bet identifier
- **Embedded `initData` JSON blob** (line 6 of every page) exposes runtime config:
- `serverTime: "2026,05,05,00,43,28"` (Moscow TZ)
- `liveUpdatePath: "/su/liveupdate/popular/"`
- `liveUpdateTransport: "JSONP"`
- `update_interval: 3000` (ms — live update polling cadence used by the site itself)
- `stomp.url: "/su/websocket/endpoint"` (authenticated stream)
- `region`, `isBelarus`, `jurisdiction`, `currencyCode` — geo/legal flags
- `treeIds` — for the event detail page, holds the focal treeId
### 1.2 Live landing — `https://www.marathonbet.by/su/live`
- HTTP 200, ~250 KB body — same `nginx` server, same SSR pattern.
- Same `data-event-*` attributes as pre-match. Live events show `data-live="true"`,
with extra `score-state` and `time` markers (e.g., `2:1 (1:1)`, `83:30`).
- The site polls `/su/liveupdate/popular/?treeIds=...` every 3 s but the response
is just a refresh signal (`{"modified":[{"type":"refreshPage"}],"updated":...}`)
**the site relies on full HTML re-fetch for live updates**, which is good for us
(no separate JSON contract to track).
### 1.3 Sport-specific listing — `/su/popular/Basketball` / `/su/betting/Basketball+-+6`
- HTTP 200, ~470 KB.
- Lists all current basketball categories (NBA Playoffs etc.) with full odds.
- URL by name (`Basketball`) and URL by sport tree ID (`Basketball+-+6`) both work.
- Date display: events on the same day show **time only** (`03:00`); events on
later days show **`DD <month-ru> HH:MM`** (e.g., `06 мая 02:00`). The "today"
anchor is implicit — must be derived from `initData.serverTime`.
### 1.4 Event detail — `/su/betting/<event-path>`
- HTTP 200, ~500 KB to ~1.6 MB depending on market count.
- URL pattern: `/su/betting/<Sport>/<League+Path>/<Sub+Stage>/<Team1+vs+Team2+-+<treeId>>`.
- Exposes ~140250 unique market types per event. Each market is a `<div>` containing
a labeled `<table>` of selections with `data-selection-key`, prices, and handicap/total
values in `<span class="middle-simple">`.
- **Schema.org breadcrumb** at the bottom of the page provides clean taxonomy:
Sport → Country/Group → League → Stage → Event. Each level has its own treeId visible
in `href="/su/betting/<path>+-+<treeId>"`.
- Sample (Football, Arsenal vs Atletico Madrid, treeId 28089645, eventId 26456117):
- Sport = `Football+-+11`, Country group = `Clubs.+International+-+4409575`,
League = `UEFA+Champions+League+-+21255`, Stage = `Play-Offs / Semi+Final / 2nd+Leg`.
- Match-level markets: `Match_Result.{1,draw,3}`, `To_Win_Match_With_Handicap{N}.{HB_H,HB_A}`,
`Total_Goals{N}.{Under_X,Over_X}`.
### 1.5 Results / archive — **NOT publicly available**
- `https://www.marathonbet.by/su/results`**HTTP 404**.
- `https://www.marathonbet.by/su/results/`**HTTP 404**.
- `https://www.marathonbet.by/su/results.htm`**HTTP 404**.
- No `/results`, `/archive`, or `/history` link anywhere in the public landing-page HTML.
- The `eventJsonInfo` `<td>` on each event has a `matchIsComplete` boolean and a
`resultDescription` (e.g., `"2:1 (1:1)"`), so **final scores can be captured by
re-scraping the event detail page after match end** — but only while the event is
still hosted (likely a few hours / days post-match). After cleanup, results are gone.
- **Implication for Phase 8 (Results loader):** results must be harvested by
continuing to poll the event detail page until `matchIsComplete=true`, then storing
the final score. There is no historical archive endpoint to back-fill from. We
should also evaluate scraping a third-party results aggregator
(flashscore, livescore, sofascore) as a fallback — that's a Phase 8 design decision.
---
## 2. Anti-bot Posture
| Signal | Observation |
|---|---|
| Cloudflare | Absent. `Server: nginx`, no `cf-*` headers. |
| reCAPTCHA / hCAPTCHA | Not on public listing or event pages (only on `/captchaData.htm` for login). |
| User-Agent filtering | A browser UA returns 200. We did not test with `curl/8.x` or empty UA — recommend always sending a real UA. |
| Cookie requirement | None for read-only access. The site sets `puid`, `JSESSIONID`, `lastSitePart`, etc., but we observed full HTML on the very first request without prior cookies. |
| IP rate-limit | 5 sequential requests at ~1s pacing all returned 200 in <1 s. No throttling observed within our budget (10 total requests). The customer should test heavier loads from their environment. |
| Geo-block | Probe environment is geo-routed as Poland; site still serves `/su` Russian content. Customer (Belarus) should see same or better access. |
| Fingerprinting | Standard analytics (GTM, dataLayer); no JS-fingerprint cookies or canvas hashing detected in the entry-page payload. |
**Mitigations to bake into the scraper anyway** (defense-in-depth):
- **Rotate User-Agents** from a small pool of recent Chrome/Firefox/Edge versions
(configurable via `Scraping:UserAgents[]`).
- **Polite pacing:** default `Scraping:RateLimit:RequestsPerSecond = 1`,
`MaxConcurrentRequests = 4`. Per-host token-bucket rate limiter using Polly v8 +
`Microsoft.Extensions.Http.Resilience`.
- **Honor `Cache-Control: no-store`** — do NOT cache responses; that's the site's intent.
- **Handle 403 / 429 / 503** with exponential backoff and circuit breaker; alert the user
when circuit opens for >5 minutes.
- **Cookie jar per scraper instance** — accept set-cookies and replay them. This avoids
a session-creation latency on every request.
- **Belarus-specific:** if customer's environment ever sees a `/forbidden` redirect,
we fall back to the `afterForbiddenRedirectUrl` documented in `initData`.
---
## 3. URL Templates Phase 3 Will Use
| Purpose | Template | Notes |
|---|---|---|
| Pre-match top page | `https://www.marathonbet.by/su/` | Mixed live + popular pre-match. Use only for landing/health-check. |
| Live top page | `https://www.marathonbet.by/su/live` | Mixed sports. Use for live-event discovery. |
| Live popular | `https://www.marathonbet.by/su/live/popular` | Same data as `/su/live`. |
| All-events index | `https://www.marathonbet.by/su/all-events/` | Long full list; use for discovery seed. |
| Sport listing (by ID) | `https://www.marathonbet.by/su/betting/{Sport}+-+{sportId}` | e.g., `/su/betting/Basketball+-+6`. **Preferred** because sport-id stable. |
| Sport listing (by name) | `https://www.marathonbet.by/su/popular/{Sport}` | e.g., `/su/popular/Basketball`. Convenient for humans. |
| Category / league listing | `https://www.marathonbet.by/su/betting/{Sport}/{League+Path}+-+{categoryTreeId}` | From breadcrumbs / `category-label-link`. |
| Event detail | `https://www.marathonbet.by/su/betting/{event-path}` | `event-path` from `data-event-path`, ends in `-+{treeId}`. |
| Live update signal | `https://www.marathonbet.by/su/liveupdate/popular/?treeIds={csv}` | Returns `{"modified":[...],"updated":<ts>}`. Use only as "hey something changed" hint; full odds still come from event-detail re-fetch. |
| Server time sync | `https://www.marathonbet.by/su/stateless/synctime` | Use to anchor "today" date interpretation. |
URL paths use `+` for spaces, `%2C` for `,`, etc. — standard `Uri.EscapeDataString`.
---
## 4. Sport ID Inventory (observed)
From the pre-match landing page (`data-sport-treeId` attributes + `category-label`
breadcrumb hrefs):
| Sport ID | Russian name | English path |
|---|---|---|
| **6** | Баскетбол | `Basketball` |
| **11** | Футбол | `Football` |
| **537** | (TBD — verify on populated day) | — |
| **2398** | (TBD) | — |
| **22723** | Теннис | `Tennis` |
| **26418** | Футбол (alt? duplicate live) | `Football` |
| **43658** | Хоккей | `Hockey` |
| **45356** | Баскетбол (live tree) | `Basketball` |
| **139722** | Гандбол | `Handball` |
| **414329** | Настольный теннис | `Table+Tennis` |
| **1372932** | Киберспорт | `Esports` |
| **3083982** | Лотереи | `Lotteries` |
| **11308234** | Шорт хоккей | `Short+Hockey` |
| **23054364** | Кибербаскетбол | `eBasketball` |
| **23054392** | Киберфутбол | `eFootball` |
**Important observation:** the site has **two parallel tree IDs per sport** — one
"canonical" (e.g., `6` for Basketball) used on event-detail breadcrumb, and a
"category" tree ID (e.g., `45356`) used inside the live grouping. Phase 1 domain
needs to recognize the canonical ID as `SportCode` and ignore the category tree ID.
The customer-spec field `Sport_Code = 6` for Basketball matches the canonical ID
in `data-sport-treeId="6"` and in the breadcrumb URL `/su/betting/Basketball+-+6`.
---
## 5. Bet Selection Naming Convention
Format: `{eventId}@{MarketName}{LineIndex?}.{Outcome}`
Where:
- `eventId` = bookmaker's `data-event-eventId` (numeric, ~26-million range, stable).
- `MarketName` = `Match_Result`, `To_Win_Match_With_Handicap`, `Total_Points`,
`1st_Half_Result`, `To_Win_1st_Half_With_Handicap`, `1st_Set_Total_Games`, etc.
- `LineIndex?` = optional integer suffix when a market has multiple lines/spreads
(e.g., `Total_Points10`, `Total_Points11` are different total thresholds for the
same event). Empty / `0` is the "main" line.
- `Outcome` codes:
- `1`, `draw`, `3` — for 3-way result markets
- `HB_H`, `HB_A` — handicap home/away
- `Under_<X>`, `Over_<X>` — total under/over (X is the threshold, embedded in name)
- `HD`, `AD` — half-time/full-time draw combinations
- `yes` / `no` — for yes/no markets
The handicap value (`+1.0`, `-2.5`) and total threshold (`213.5`) are NOT in the
selection key as parseable numbers — they live in the `<span class="middle-simple">`
display element OR they are embedded in the outcome name (e.g., `Under_213.5`).
---
## 6. Period Scope per Sport (observed)
| Sport | Period scopes available | Spec field prefix |
|---|---|---|
| Football (11) | 1st Half, 2nd Half | `Bet_Period-1_*`, `Bet_Period-2_*` |
| Basketball (6) | 1st/2nd Half, 1st/2nd/3rd/4th Quarter | Customer must clarify whether Period-N maps to halves or quarters. **Recommend halves** as default (Period-1, Period-2) with an `appsettings` toggle for quarter-mode. |
| Tennis (22723) | 1st Set, 2nd Set, ... (variable count) | `Bet_Period-1_*` = 1st Set, etc. **No Draw outcome.** |
| Hockey (43658) | 1st/2nd/3rd Period | `Bet_Period-1_*`, `Bet_Period-2_*`, `Bet_Period-3_*` (not yet sampled — revalidate in Phase 3). |
The internal market-name token is sport-dependent:
- `1st_Half_Result`, `To_Win_1st_Half_With_Handicap`
- `1st_Quarter_Result`, `To_Win_1st_Quarter_With_Handicap`
- `1st_Set_Result`, `To_Win_1st_Set_With_Handicap`
**Phase 3 should encapsulate this** in a sport-aware mapping table
(`PeriodScopeMapper`) keyed on `SportCode`, returning the set of expected period
markets and their token names.
---
## 7. Open Questions / Risks
1. **Results storage cleanup:** how long does marathonbet keep finished events on
the event detail URL? Must be empirically tested over Phase 8. Recommend retaining
our own snapshot with `matchIsComplete=true` permanently in SQLite as soon as
we observe it, so we never depend on the site for historical data.
2. **Sport ID duplication** (e.g., `26418` and `11` both = Football):
verify with customer that we should use the canonical breadcrumb ID. The
"category" trees may exist for live grouping or alphabetization purposes.
3. **Localization:** site labels are Russian on `/su/`. There appears to be `/en/`
path support (untested). Customer wants RU + EN — Phase 5 must verify EN locale
page parses identically.
4. **Period total markets in basketball:** sampled NBA event did NOT explicitly
expose "Total points 1st quarter" as a clean market in the public HTML — only
`AllInningsGoalsOver` (combined). Customer's spec implies `Bet_Period-N_Total_*`
is universal — Phase 3 must gracefully degrade and emit `null` rates for fields
the site doesn't surface for that sport+league.
5. **Belarus geo-restriction risk:** we tested from non-BY. If customer's BY IP
gets a different page (KYC overlay, deposit prompt, etc.), the parser must be
robust to unexpected wrapping. Defensive parsing only — never assume strict
structure.
6. **`isLogged: false` overlay risk:** initData reports we are anonymous. Some
markets may be hidden behind login (we did not detect any in samples, but the
parser should treat missing markets as `null`, not throw).
---
## 8. Recommended Phase 3 Architecture
```
IOddsScraper (Application)
└── MarathonBetScraper : IOddsScraper (Infrastructure)
├── HttpClient (resilient via Polly v8)
│ ├── User-Agent rotator
│ ├── Token-bucket rate limiter (config: RequestsPerSecond)
│ ├── Retry policy (3x exponential backoff, jitter)
│ └── Circuit breaker (open after N consecutive 5xx)
├── EventDiscoveryParser ← parses /su/, /su/live, /su/popular/{sport}
│ produces List<EventListItem>
├── EventDetailParser ← parses /su/betting/<path>
│ produces FullOddsSnapshot with all markets
├── BreadcrumbParser ← extracts Sport / Country / League / Stage taxonomy
└── BetMarketMapper ← AngleSharp QuerySelector → spec field name
(sport-aware; uses PeriodScopeMapper)
```
**Use AngleSharp for parsing** — it handles malformed HTML well, has a CSS-selector
API, and is the established `.NET` choice. JSON islands inside attributes (`data-sel`,
`data-json`) decode cleanly with `System.Text.Json`.
**No Playwright required** for the scraper. Keep Playwright as a documented
fallback in `appsettings` (`Scraping:UsePlaywright = false`) so we can flip it on
later if the site adds JS challenges. This adds <100 LOC of optional code, costs
nothing if unused.
---
## 9. Customer Validation Plan
If our environment ever stops working (geo-block, IP ban, etc.) the customer in
Belarus can:
1. Open https://www.marathonbet.by/su in a browser, verify it renders.
2. View page source (Ctrl+U), search for `data-event-eventId` — confirm same
structure as our captured `spike/captures/pre-match-landing.html`.
3. Save the HTML and email it to dev — the parser is environment-agnostic and
should handle their captured HTML byte-for-byte.
This decouples scraper development from probe environment and makes Phase 3
testable offline.
---
## 10. Captured Samples (gitignored, local only)
| File | Purpose |
|---|---|
| `spike/captures/pre-match-landing.html` | `/su/` snapshot, 587 KB, full grid |
| `spike/captures/live-landing.html` | `/su/live` snapshot, 250 KB |
| `spike/captures/basketball-listing.html` | `/su/popular/Basketball`, 471 KB |
| `spike/captures/event-basketball-28405506.html` | NBA Knicks vs 76ers full event, 505 KB |
| `spike/captures/event-football-28089645.html` | UCL Arsenal vs Atletico full event, 1.58 MB |
| `spike/captures/event-tennis-28430484.html` | ATP Rome qualif full event, 244 KB |
| `spike/captures/liveupdate-popular.json` | Live-update API sample response |
| `spike/captures/results-page.html` | `/su/results` response (~20 KB) — captured to evidence the missing public archive endpoint (Phase 8 deviation). |
These artifacts are **not committed** but should be kept locally to back parser unit
tests in Phase 3.
> **Caveats on captures:**
>
> - `live-landing.html` was captured at a moment when no live events were
> in-progress for popular sports. As a result, the `.score-state` element
> referenced in `SCHEMA_DRAFT.md` §1 is NOT present in this particular capture.
> Phase 3 should re-verify the score selector against a live event during
> parser implementation (the selector itself is well-known across bookmaker
> sites and not in doubt).
> - Hockey events were not sampled directly. Period-result selection key tokens
> for hockey (`1st_Period_Result0.RN_H` etc.) are extrapolated from the
> football/basketball/tennis pattern and marked TBD in `SCHEMA_DRAFT.md`. Phase 3
> must verify against a real hockey event before relying on those tokens.