Files
alexei.dolgolyov 070e34b911 feat(initial-implementation): phase 0 - scraping spike findings
Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR
(nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is
sufficient; Playwright not required (kept as a future-flag).

Spike outputs:
- spike/SCRAPE_FINDINGS.md  — page rendering, URL templates, anti-bot, rate
  limits, recommended scraping strategy for Phase 3.
- spike/SCHEMA_DRAFT.md     — customer-spec field → DOM selector mapping for
  Match + Period-N scope across football/basketball/tennis (hockey TBD).

Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding:
no public results endpoint at /su/results — phase 8 must switch to polling
event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged).

Reviewer notes addressed:
- Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and
  market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1.
- results-page.html capture added to file list with caveat about live-landing
  score-state and unsampled hockey selectors.
2026-05-05 01:04:03 +03:00

348 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 0 Spike — Scraping Findings for marathonbet.by
**Date:** 2026-05-05
**Probe environment:** Windows 10, Poland-routed IP (countryCode `PL` reported by site,
`isBelarus: true` flag set in `initData`, `jurisdiction: BELARUS`).
**Tooling used:** `curl` with browser User-Agent, ~10 sequential requests with
≥1-second pacing.
---
## TL;DR — Decision Matrix
| Question | Answer |
|---|---|
| Is anonymous scraping feasible? | **YES — confirmed.** Site returns full server-rendered HTML for `/su/`, `/su/live`, sport listings, and event detail pages with HTTP 200 to a plain GET with browser User-Agent. |
| Cloudflare / JS challenge? | **No.** `Server: nginx`, no `cf-ray`, no challenge cookies. Only standard JSESSIONID + analytics cookies. No reCAPTCHA on listing pages. |
| Geo-block from probe environment? | **No.** Probe was made from a non-Belarus IP; site served full HTML. The site treats us as `region:"PL"` but still serves Russian-language `/su` content. |
| Recommended scraping technology | **HttpClient + AngleSharp.** All the data needed (event list, full odds, breadcrumb taxonomy, period markets) is present in the raw SSR HTML. Playwright is not required for read-only scraping. |
| Recommended polling cadence | Pre-match: **30 seconds** (default in `appsettings`). Live: 3-second native cadence is too aggressive — recommend **510 seconds** for our analyzer (anomaly detection doesn't need sub-second resolution). |
| WebSocket / API alternative? | STOMP-over-WebSocket exists at `/su/websocket/endpoint` for authenticated clients. Anonymous clients should stick to plain HTML scraping. The JSONP endpoint at `/su/liveupdate/popular/` only returns refresh-page signals, not full odds. |
---
## 1. Probe Outcomes
### 1.1 Pre-match landing — `https://www.marathonbet.by/su`
```
HTTP/1.1 200 OK
Server: nginx
Content-Type: text/html;charset=UTF-8
Set-Cookie: visitedNavBarItems=HOME; HttpOnly; SameSite=None; Secure
Set-Cookie: lastSitePart=SPORT; ...
Set-Cookie: puid=rBWP3Wn5...; expires=2037; domain=.marathonbet.by
Strict-Transport-Security: max-age=31536000
Cache-Status: MISS
Cache-Control: no-store, no-cache, must-revalidate
```
- **Render type:** Server-Side Rendered (SSR). Body is ~590 KB of HTML containing
the full event grid for live + popular pre-match events. There IS a `<div id="app">`
wrapper but the content inside is fully populated server-side; the JS layer enhances
rather than hydrates from empty.
- **Rich data attributes embedded:**
- `data-event-eventId="<bookmakerEventCode>"` — bookmaker's stable numeric event ID
- `data-event-treeId="<treeId>"` — tree position ID (used in URLs)
- `data-event-name="..."` — event display name
- `data-event-path="<sport>/<league-path>/<teams> - <treeId>"` — URL fragment to
construct event detail link
- `data-live="true|false"` — live vs pre-match flag
- `data-sport-treeId="<sportId>"` — sport identifier (matches customer's "Sport_Code")
- `data-coeff-uuid` + `data-sel='{...}'` JSON — selection metadata (ewc, cid, prt, epr)
- `data-selection-key="<eventId>@<MarketType>[N].<Outcome>"` — canonical bet identifier
- **Embedded `initData` JSON blob** (line 6 of every page) exposes runtime config:
- `serverTime: "2026,05,05,00,43,28"` (Moscow TZ)
- `liveUpdatePath: "/su/liveupdate/popular/"`
- `liveUpdateTransport: "JSONP"`
- `update_interval: 3000` (ms — live update polling cadence used by the site itself)
- `stomp.url: "/su/websocket/endpoint"` (authenticated stream)
- `region`, `isBelarus`, `jurisdiction`, `currencyCode` — geo/legal flags
- `treeIds` — for the event detail page, holds the focal treeId
### 1.2 Live landing — `https://www.marathonbet.by/su/live`
- HTTP 200, ~250 KB body — same `nginx` server, same SSR pattern.
- Same `data-event-*` attributes as pre-match. Live events show `data-live="true"`,
with extra `score-state` and `time` markers (e.g., `2:1 (1:1)`, `83:30`).
- The site polls `/su/liveupdate/popular/?treeIds=...` every 3 s but the response
is just a refresh signal (`{"modified":[{"type":"refreshPage"}],"updated":...}`)
**the site relies on full HTML re-fetch for live updates**, which is good for us
(no separate JSON contract to track).
### 1.3 Sport-specific listing — `/su/popular/Basketball` / `/su/betting/Basketball+-+6`
- HTTP 200, ~470 KB.
- Lists all current basketball categories (NBA Playoffs etc.) with full odds.
- URL by name (`Basketball`) and URL by sport tree ID (`Basketball+-+6`) both work.
- Date display: events on the same day show **time only** (`03:00`); events on
later days show **`DD <month-ru> HH:MM`** (e.g., `06 мая 02:00`). The "today"
anchor is implicit — must be derived from `initData.serverTime`.
### 1.4 Event detail — `/su/betting/<event-path>`
- HTTP 200, ~500 KB to ~1.6 MB depending on market count.
- URL pattern: `/su/betting/<Sport>/<League+Path>/<Sub+Stage>/<Team1+vs+Team2+-+<treeId>>`.
- Exposes ~140250 unique market types per event. Each market is a `<div>` containing
a labeled `<table>` of selections with `data-selection-key`, prices, and handicap/total
values in `<span class="middle-simple">`.
- **Schema.org breadcrumb** at the bottom of the page provides clean taxonomy:
Sport → Country/Group → League → Stage → Event. Each level has its own treeId visible
in `href="/su/betting/<path>+-+<treeId>"`.
- Sample (Football, Arsenal vs Atletico Madrid, treeId 28089645, eventId 26456117):
- Sport = `Football+-+11`, Country group = `Clubs.+International+-+4409575`,
League = `UEFA+Champions+League+-+21255`, Stage = `Play-Offs / Semi+Final / 2nd+Leg`.
- Match-level markets: `Match_Result.{1,draw,3}`, `To_Win_Match_With_Handicap{N}.{HB_H,HB_A}`,
`Total_Goals{N}.{Under_X,Over_X}`.
### 1.5 Results / archive — **NOT publicly available**
- `https://www.marathonbet.by/su/results`**HTTP 404**.
- `https://www.marathonbet.by/su/results/`**HTTP 404**.
- `https://www.marathonbet.by/su/results.htm`**HTTP 404**.
- No `/results`, `/archive`, or `/history` link anywhere in the public landing-page HTML.
- The `eventJsonInfo` `<td>` on each event has a `matchIsComplete` boolean and a
`resultDescription` (e.g., `"2:1 (1:1)"`), so **final scores can be captured by
re-scraping the event detail page after match end** — but only while the event is
still hosted (likely a few hours / days post-match). After cleanup, results are gone.
- **Implication for Phase 8 (Results loader):** results must be harvested by
continuing to poll the event detail page until `matchIsComplete=true`, then storing
the final score. There is no historical archive endpoint to back-fill from. We
should also evaluate scraping a third-party results aggregator
(flashscore, livescore, sofascore) as a fallback — that's a Phase 8 design decision.
---
## 2. Anti-bot Posture
| Signal | Observation |
|---|---|
| Cloudflare | Absent. `Server: nginx`, no `cf-*` headers. |
| reCAPTCHA / hCAPTCHA | Not on public listing or event pages (only on `/captchaData.htm` for login). |
| User-Agent filtering | A browser UA returns 200. We did not test with `curl/8.x` or empty UA — recommend always sending a real UA. |
| Cookie requirement | None for read-only access. The site sets `puid`, `JSESSIONID`, `lastSitePart`, etc., but we observed full HTML on the very first request without prior cookies. |
| IP rate-limit | 5 sequential requests at ~1s pacing all returned 200 in <1 s. No throttling observed within our budget (10 total requests). The customer should test heavier loads from their environment. |
| Geo-block | Probe environment is geo-routed as Poland; site still serves `/su` Russian content. Customer (Belarus) should see same or better access. |
| Fingerprinting | Standard analytics (GTM, dataLayer); no JS-fingerprint cookies or canvas hashing detected in the entry-page payload. |
**Mitigations to bake into the scraper anyway** (defense-in-depth):
- **Rotate User-Agents** from a small pool of recent Chrome/Firefox/Edge versions
(configurable via `Scraping:UserAgents[]`).
- **Polite pacing:** default `Scraping:RateLimit:RequestsPerSecond = 1`,
`MaxConcurrentRequests = 4`. Per-host token-bucket rate limiter using Polly v8 +
`Microsoft.Extensions.Http.Resilience`.
- **Honor `Cache-Control: no-store`** — do NOT cache responses; that's the site's intent.
- **Handle 403 / 429 / 503** with exponential backoff and circuit breaker; alert the user
when circuit opens for >5 minutes.
- **Cookie jar per scraper instance** — accept set-cookies and replay them. This avoids
a session-creation latency on every request.
- **Belarus-specific:** if customer's environment ever sees a `/forbidden` redirect,
we fall back to the `afterForbiddenRedirectUrl` documented in `initData`.
---
## 3. URL Templates Phase 3 Will Use
| Purpose | Template | Notes |
|---|---|---|
| Pre-match top page | `https://www.marathonbet.by/su/` | Mixed live + popular pre-match. Use only for landing/health-check. |
| Live top page | `https://www.marathonbet.by/su/live` | Mixed sports. Use for live-event discovery. |
| Live popular | `https://www.marathonbet.by/su/live/popular` | Same data as `/su/live`. |
| All-events index | `https://www.marathonbet.by/su/all-events/` | Long full list; use for discovery seed. |
| Sport listing (by ID) | `https://www.marathonbet.by/su/betting/{Sport}+-+{sportId}` | e.g., `/su/betting/Basketball+-+6`. **Preferred** because sport-id stable. |
| Sport listing (by name) | `https://www.marathonbet.by/su/popular/{Sport}` | e.g., `/su/popular/Basketball`. Convenient for humans. |
| Category / league listing | `https://www.marathonbet.by/su/betting/{Sport}/{League+Path}+-+{categoryTreeId}` | From breadcrumbs / `category-label-link`. |
| Event detail | `https://www.marathonbet.by/su/betting/{event-path}` | `event-path` from `data-event-path`, ends in `-+{treeId}`. |
| Live update signal | `https://www.marathonbet.by/su/liveupdate/popular/?treeIds={csv}` | Returns `{"modified":[...],"updated":<ts>}`. Use only as "hey something changed" hint; full odds still come from event-detail re-fetch. |
| Server time sync | `https://www.marathonbet.by/su/stateless/synctime` | Use to anchor "today" date interpretation. |
URL paths use `+` for spaces, `%2C` for `,`, etc. — standard `Uri.EscapeDataString`.
---
## 4. Sport ID Inventory (observed)
From the pre-match landing page (`data-sport-treeId` attributes + `category-label`
breadcrumb hrefs):
| Sport ID | Russian name | English path |
|---|---|---|
| **6** | Баскетбол | `Basketball` |
| **11** | Футбол | `Football` |
| **537** | (TBD — verify on populated day) | — |
| **2398** | (TBD) | — |
| **22723** | Теннис | `Tennis` |
| **26418** | Футбол (alt? duplicate live) | `Football` |
| **43658** | Хоккей | `Hockey` |
| **45356** | Баскетбол (live tree) | `Basketball` |
| **139722** | Гандбол | `Handball` |
| **414329** | Настольный теннис | `Table+Tennis` |
| **1372932** | Киберспорт | `Esports` |
| **3083982** | Лотереи | `Lotteries` |
| **11308234** | Шорт хоккей | `Short+Hockey` |
| **23054364** | Кибербаскетбол | `eBasketball` |
| **23054392** | Киберфутбол | `eFootball` |
**Important observation:** the site has **two parallel tree IDs per sport** — one
"canonical" (e.g., `6` for Basketball) used on event-detail breadcrumb, and a
"category" tree ID (e.g., `45356`) used inside the live grouping. Phase 1 domain
needs to recognize the canonical ID as `SportCode` and ignore the category tree ID.
The customer-spec field `Sport_Code = 6` for Basketball matches the canonical ID
in `data-sport-treeId="6"` and in the breadcrumb URL `/su/betting/Basketball+-+6`.
---
## 5. Bet Selection Naming Convention
Format: `{eventId}@{MarketName}{LineIndex?}.{Outcome}`
Where:
- `eventId` = bookmaker's `data-event-eventId` (numeric, ~26-million range, stable).
- `MarketName` = `Match_Result`, `To_Win_Match_With_Handicap`, `Total_Points`,
`1st_Half_Result`, `To_Win_1st_Half_With_Handicap`, `1st_Set_Total_Games`, etc.
- `LineIndex?` = optional integer suffix when a market has multiple lines/spreads
(e.g., `Total_Points10`, `Total_Points11` are different total thresholds for the
same event). Empty / `0` is the "main" line.
- `Outcome` codes:
- `1`, `draw`, `3` — for 3-way result markets
- `HB_H`, `HB_A` — handicap home/away
- `Under_<X>`, `Over_<X>` — total under/over (X is the threshold, embedded in name)
- `HD`, `AD` — half-time/full-time draw combinations
- `yes` / `no` — for yes/no markets
The handicap value (`+1.0`, `-2.5`) and total threshold (`213.5`) are NOT in the
selection key as parseable numbers — they live in the `<span class="middle-simple">`
display element OR they are embedded in the outcome name (e.g., `Under_213.5`).
---
## 6. Period Scope per Sport (observed)
| Sport | Period scopes available | Spec field prefix |
|---|---|---|
| Football (11) | 1st Half, 2nd Half | `Bet_Period-1_*`, `Bet_Period-2_*` |
| Basketball (6) | 1st/2nd Half, 1st/2nd/3rd/4th Quarter | Customer must clarify whether Period-N maps to halves or quarters. **Recommend halves** as default (Period-1, Period-2) with an `appsettings` toggle for quarter-mode. |
| Tennis (22723) | 1st Set, 2nd Set, ... (variable count) | `Bet_Period-1_*` = 1st Set, etc. **No Draw outcome.** |
| Hockey (43658) | 1st/2nd/3rd Period | `Bet_Period-1_*`, `Bet_Period-2_*`, `Bet_Period-3_*` (not yet sampled — revalidate in Phase 3). |
The internal market-name token is sport-dependent:
- `1st_Half_Result`, `To_Win_1st_Half_With_Handicap`
- `1st_Quarter_Result`, `To_Win_1st_Quarter_With_Handicap`
- `1st_Set_Result`, `To_Win_1st_Set_With_Handicap`
**Phase 3 should encapsulate this** in a sport-aware mapping table
(`PeriodScopeMapper`) keyed on `SportCode`, returning the set of expected period
markets and their token names.
---
## 7. Open Questions / Risks
1. **Results storage cleanup:** how long does marathonbet keep finished events on
the event detail URL? Must be empirically tested over Phase 8. Recommend retaining
our own snapshot with `matchIsComplete=true` permanently in SQLite as soon as
we observe it, so we never depend on the site for historical data.
2. **Sport ID duplication** (e.g., `26418` and `11` both = Football):
verify with customer that we should use the canonical breadcrumb ID. The
"category" trees may exist for live grouping or alphabetization purposes.
3. **Localization:** site labels are Russian on `/su/`. There appears to be `/en/`
path support (untested). Customer wants RU + EN — Phase 5 must verify EN locale
page parses identically.
4. **Period total markets in basketball:** sampled NBA event did NOT explicitly
expose "Total points 1st quarter" as a clean market in the public HTML — only
`AllInningsGoalsOver` (combined). Customer's spec implies `Bet_Period-N_Total_*`
is universal — Phase 3 must gracefully degrade and emit `null` rates for fields
the site doesn't surface for that sport+league.
5. **Belarus geo-restriction risk:** we tested from non-BY. If customer's BY IP
gets a different page (KYC overlay, deposit prompt, etc.), the parser must be
robust to unexpected wrapping. Defensive parsing only — never assume strict
structure.
6. **`isLogged: false` overlay risk:** initData reports we are anonymous. Some
markets may be hidden behind login (we did not detect any in samples, but the
parser should treat missing markets as `null`, not throw).
---
## 8. Recommended Phase 3 Architecture
```
IOddsScraper (Application)
└── MarathonBetScraper : IOddsScraper (Infrastructure)
├── HttpClient (resilient via Polly v8)
│ ├── User-Agent rotator
│ ├── Token-bucket rate limiter (config: RequestsPerSecond)
│ ├── Retry policy (3x exponential backoff, jitter)
│ └── Circuit breaker (open after N consecutive 5xx)
├── EventDiscoveryParser ← parses /su/, /su/live, /su/popular/{sport}
│ produces List<EventListItem>
├── EventDetailParser ← parses /su/betting/<path>
│ produces FullOddsSnapshot with all markets
├── BreadcrumbParser ← extracts Sport / Country / League / Stage taxonomy
└── BetMarketMapper ← AngleSharp QuerySelector → spec field name
(sport-aware; uses PeriodScopeMapper)
```
**Use AngleSharp for parsing** — it handles malformed HTML well, has a CSS-selector
API, and is the established `.NET` choice. JSON islands inside attributes (`data-sel`,
`data-json`) decode cleanly with `System.Text.Json`.
**No Playwright required** for the scraper. Keep Playwright as a documented
fallback in `appsettings` (`Scraping:UsePlaywright = false`) so we can flip it on
later if the site adds JS challenges. This adds <100 LOC of optional code, costs
nothing if unused.
---
## 9. Customer Validation Plan
If our environment ever stops working (geo-block, IP ban, etc.) the customer in
Belarus can:
1. Open https://www.marathonbet.by/su in a browser, verify it renders.
2. View page source (Ctrl+U), search for `data-event-eventId` — confirm same
structure as our captured `spike/captures/pre-match-landing.html`.
3. Save the HTML and email it to dev — the parser is environment-agnostic and
should handle their captured HTML byte-for-byte.
This decouples scraper development from probe environment and makes Phase 3
testable offline.
---
## 10. Captured Samples (gitignored, local only)
| File | Purpose |
|---|---|
| `spike/captures/pre-match-landing.html` | `/su/` snapshot, 587 KB, full grid |
| `spike/captures/live-landing.html` | `/su/live` snapshot, 250 KB |
| `spike/captures/basketball-listing.html` | `/su/popular/Basketball`, 471 KB |
| `spike/captures/event-basketball-28405506.html` | NBA Knicks vs 76ers full event, 505 KB |
| `spike/captures/event-football-28089645.html` | UCL Arsenal vs Atletico full event, 1.58 MB |
| `spike/captures/event-tennis-28430484.html` | ATP Rome qualif full event, 244 KB |
| `spike/captures/liveupdate-popular.json` | Live-update API sample response |
| `spike/captures/results-page.html` | `/su/results` response (~20 KB) — captured to evidence the missing public archive endpoint (Phase 8 deviation). |
These artifacts are **not committed** but should be kept locally to back parser unit
tests in Phase 3.
> **Caveats on captures:**
>
> - `live-landing.html` was captured at a moment when no live events were
> in-progress for popular sports. As a result, the `.score-state` element
> referenced in `SCHEMA_DRAFT.md` §1 is NOT present in this particular capture.
> Phase 3 should re-verify the score selector against a live event during
> parser implementation (the selector itself is well-known across bookmaker
> sites and not in doubt).
> - Hockey events were not sampled directly. Period-result selection key tokens
> for hockey (`1st_Period_Result0.RN_H` etc.) are extrapolated from the
> football/basketball/tennis pattern and marked TBD in `SCHEMA_DRAFT.md`. Phase 3
> must verify against a real hockey event before relying on those tokens.