070e34b911
Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR (nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is sufficient; Playwright not required (kept as a future-flag). Spike outputs: - spike/SCRAPE_FINDINGS.md — page rendering, URL templates, anti-bot, rate limits, recommended scraping strategy for Phase 3. - spike/SCHEMA_DRAFT.md — customer-spec field → DOM selector mapping for Match + Period-N scope across football/basketball/tennis (hockey TBD). Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding: no public results endpoint at /su/results — phase 8 must switch to polling event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged). Reviewer notes addressed: - Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1. - results-page.html capture added to file list with caveat about live-landing score-state and unsampled hockey selectors.
348 lines
19 KiB
Markdown
348 lines
19 KiB
Markdown
# Phase 0 Spike — Scraping Findings for marathonbet.by
|
||
|
||
**Date:** 2026-05-05
|
||
**Probe environment:** Windows 10, Poland-routed IP (countryCode `PL` reported by site,
|
||
`isBelarus: true` flag set in `initData`, `jurisdiction: BELARUS`).
|
||
**Tooling used:** `curl` with browser User-Agent, ~10 sequential requests with
|
||
≥1-second pacing.
|
||
|
||
---
|
||
|
||
## TL;DR — Decision Matrix
|
||
|
||
| Question | Answer |
|
||
|---|---|
|
||
| Is anonymous scraping feasible? | **YES — confirmed.** Site returns full server-rendered HTML for `/su/`, `/su/live`, sport listings, and event detail pages with HTTP 200 to a plain GET with browser User-Agent. |
|
||
| Cloudflare / JS challenge? | **No.** `Server: nginx`, no `cf-ray`, no challenge cookies. Only standard JSESSIONID + analytics cookies. No reCAPTCHA on listing pages. |
|
||
| Geo-block from probe environment? | **No.** Probe was made from a non-Belarus IP; site served full HTML. The site treats us as `region:"PL"` but still serves Russian-language `/su` content. |
|
||
| Recommended scraping technology | **HttpClient + AngleSharp.** All the data needed (event list, full odds, breadcrumb taxonomy, period markets) is present in the raw SSR HTML. Playwright is not required for read-only scraping. |
|
||
| Recommended polling cadence | Pre-match: **30 seconds** (default in `appsettings`). Live: 3-second native cadence is too aggressive — recommend **5–10 seconds** for our analyzer (anomaly detection doesn't need sub-second resolution). |
|
||
| WebSocket / API alternative? | STOMP-over-WebSocket exists at `/su/websocket/endpoint` for authenticated clients. Anonymous clients should stick to plain HTML scraping. The JSONP endpoint at `/su/liveupdate/popular/` only returns refresh-page signals, not full odds. |
|
||
|
||
---
|
||
|
||
## 1. Probe Outcomes
|
||
|
||
### 1.1 Pre-match landing — `https://www.marathonbet.by/su`
|
||
|
||
```
|
||
HTTP/1.1 200 OK
|
||
Server: nginx
|
||
Content-Type: text/html;charset=UTF-8
|
||
Set-Cookie: visitedNavBarItems=HOME; HttpOnly; SameSite=None; Secure
|
||
Set-Cookie: lastSitePart=SPORT; ...
|
||
Set-Cookie: puid=rBWP3Wn5...; expires=2037; domain=.marathonbet.by
|
||
Strict-Transport-Security: max-age=31536000
|
||
Cache-Status: MISS
|
||
Cache-Control: no-store, no-cache, must-revalidate
|
||
```
|
||
|
||
- **Render type:** Server-Side Rendered (SSR). Body is ~590 KB of HTML containing
|
||
the full event grid for live + popular pre-match events. There IS a `<div id="app">`
|
||
wrapper but the content inside is fully populated server-side; the JS layer enhances
|
||
rather than hydrates from empty.
|
||
- **Rich data attributes embedded:**
|
||
- `data-event-eventId="<bookmakerEventCode>"` — bookmaker's stable numeric event ID
|
||
- `data-event-treeId="<treeId>"` — tree position ID (used in URLs)
|
||
- `data-event-name="..."` — event display name
|
||
- `data-event-path="<sport>/<league-path>/<teams> - <treeId>"` — URL fragment to
|
||
construct event detail link
|
||
- `data-live="true|false"` — live vs pre-match flag
|
||
- `data-sport-treeId="<sportId>"` — sport identifier (matches customer's "Sport_Code")
|
||
- `data-coeff-uuid` + `data-sel='{...}'` JSON — selection metadata (ewc, cid, prt, epr)
|
||
- `data-selection-key="<eventId>@<MarketType>[N].<Outcome>"` — canonical bet identifier
|
||
- **Embedded `initData` JSON blob** (line 6 of every page) exposes runtime config:
|
||
- `serverTime: "2026,05,05,00,43,28"` (Moscow TZ)
|
||
- `liveUpdatePath: "/su/liveupdate/popular/"`
|
||
- `liveUpdateTransport: "JSONP"`
|
||
- `update_interval: 3000` (ms — live update polling cadence used by the site itself)
|
||
- `stomp.url: "/su/websocket/endpoint"` (authenticated stream)
|
||
- `region`, `isBelarus`, `jurisdiction`, `currencyCode` — geo/legal flags
|
||
- `treeIds` — for the event detail page, holds the focal treeId
|
||
|
||
### 1.2 Live landing — `https://www.marathonbet.by/su/live`
|
||
|
||
- HTTP 200, ~250 KB body — same `nginx` server, same SSR pattern.
|
||
- Same `data-event-*` attributes as pre-match. Live events show `data-live="true"`,
|
||
with extra `score-state` and `time` markers (e.g., `2:1 (1:1)`, `83:30`).
|
||
- The site polls `/su/liveupdate/popular/?treeIds=...` every 3 s but the response
|
||
is just a refresh signal (`{"modified":[{"type":"refreshPage"}],"updated":...}`)
|
||
— **the site relies on full HTML re-fetch for live updates**, which is good for us
|
||
(no separate JSON contract to track).
|
||
|
||
### 1.3 Sport-specific listing — `/su/popular/Basketball` / `/su/betting/Basketball+-+6`
|
||
|
||
- HTTP 200, ~470 KB.
|
||
- Lists all current basketball categories (NBA Playoffs etc.) with full odds.
|
||
- URL by name (`Basketball`) and URL by sport tree ID (`Basketball+-+6`) both work.
|
||
- Date display: events on the same day show **time only** (`03:00`); events on
|
||
later days show **`DD <month-ru> HH:MM`** (e.g., `06 мая 02:00`). The "today"
|
||
anchor is implicit — must be derived from `initData.serverTime`.
|
||
|
||
### 1.4 Event detail — `/su/betting/<event-path>`
|
||
|
||
- HTTP 200, ~500 KB to ~1.6 MB depending on market count.
|
||
- URL pattern: `/su/betting/<Sport>/<League+Path>/<Sub+Stage>/<Team1+vs+Team2+-+<treeId>>`.
|
||
- Exposes ~140–250 unique market types per event. Each market is a `<div>` containing
|
||
a labeled `<table>` of selections with `data-selection-key`, prices, and handicap/total
|
||
values in `<span class="middle-simple">`.
|
||
- **Schema.org breadcrumb** at the bottom of the page provides clean taxonomy:
|
||
Sport → Country/Group → League → Stage → Event. Each level has its own treeId visible
|
||
in `href="/su/betting/<path>+-+<treeId>"`.
|
||
- Sample (Football, Arsenal vs Atletico Madrid, treeId 28089645, eventId 26456117):
|
||
- Sport = `Football+-+11`, Country group = `Clubs.+International+-+4409575`,
|
||
League = `UEFA+Champions+League+-+21255`, Stage = `Play-Offs / Semi+Final / 2nd+Leg`.
|
||
- Match-level markets: `Match_Result.{1,draw,3}`, `To_Win_Match_With_Handicap{N}.{HB_H,HB_A}`,
|
||
`Total_Goals{N}.{Under_X,Over_X}`.
|
||
|
||
### 1.5 Results / archive — **NOT publicly available**
|
||
|
||
- `https://www.marathonbet.by/su/results` → **HTTP 404**.
|
||
- `https://www.marathonbet.by/su/results/` → **HTTP 404**.
|
||
- `https://www.marathonbet.by/su/results.htm` → **HTTP 404**.
|
||
- No `/results`, `/archive`, or `/history` link anywhere in the public landing-page HTML.
|
||
- The `eventJsonInfo` `<td>` on each event has a `matchIsComplete` boolean and a
|
||
`resultDescription` (e.g., `"2:1 (1:1)"`), so **final scores can be captured by
|
||
re-scraping the event detail page after match end** — but only while the event is
|
||
still hosted (likely a few hours / days post-match). After cleanup, results are gone.
|
||
- **Implication for Phase 8 (Results loader):** results must be harvested by
|
||
continuing to poll the event detail page until `matchIsComplete=true`, then storing
|
||
the final score. There is no historical archive endpoint to back-fill from. We
|
||
should also evaluate scraping a third-party results aggregator
|
||
(flashscore, livescore, sofascore) as a fallback — that's a Phase 8 design decision.
|
||
|
||
---
|
||
|
||
## 2. Anti-bot Posture
|
||
|
||
| Signal | Observation |
|
||
|---|---|
|
||
| Cloudflare | Absent. `Server: nginx`, no `cf-*` headers. |
|
||
| reCAPTCHA / hCAPTCHA | Not on public listing or event pages (only on `/captchaData.htm` for login). |
|
||
| User-Agent filtering | A browser UA returns 200. We did not test with `curl/8.x` or empty UA — recommend always sending a real UA. |
|
||
| Cookie requirement | None for read-only access. The site sets `puid`, `JSESSIONID`, `lastSitePart`, etc., but we observed full HTML on the very first request without prior cookies. |
|
||
| IP rate-limit | 5 sequential requests at ~1s pacing all returned 200 in <1 s. No throttling observed within our budget (10 total requests). The customer should test heavier loads from their environment. |
|
||
| Geo-block | Probe environment is geo-routed as Poland; site still serves `/su` Russian content. Customer (Belarus) should see same or better access. |
|
||
| Fingerprinting | Standard analytics (GTM, dataLayer); no JS-fingerprint cookies or canvas hashing detected in the entry-page payload. |
|
||
|
||
**Mitigations to bake into the scraper anyway** (defense-in-depth):
|
||
|
||
- **Rotate User-Agents** from a small pool of recent Chrome/Firefox/Edge versions
|
||
(configurable via `Scraping:UserAgents[]`).
|
||
- **Polite pacing:** default `Scraping:RateLimit:RequestsPerSecond = 1`,
|
||
`MaxConcurrentRequests = 4`. Per-host token-bucket rate limiter using Polly v8 +
|
||
`Microsoft.Extensions.Http.Resilience`.
|
||
- **Honor `Cache-Control: no-store`** — do NOT cache responses; that's the site's intent.
|
||
- **Handle 403 / 429 / 503** with exponential backoff and circuit breaker; alert the user
|
||
when circuit opens for >5 minutes.
|
||
- **Cookie jar per scraper instance** — accept set-cookies and replay them. This avoids
|
||
a session-creation latency on every request.
|
||
- **Belarus-specific:** if customer's environment ever sees a `/forbidden` redirect,
|
||
we fall back to the `afterForbiddenRedirectUrl` documented in `initData`.
|
||
|
||
---
|
||
|
||
## 3. URL Templates Phase 3 Will Use
|
||
|
||
| Purpose | Template | Notes |
|
||
|---|---|---|
|
||
| Pre-match top page | `https://www.marathonbet.by/su/` | Mixed live + popular pre-match. Use only for landing/health-check. |
|
||
| Live top page | `https://www.marathonbet.by/su/live` | Mixed sports. Use for live-event discovery. |
|
||
| Live popular | `https://www.marathonbet.by/su/live/popular` | Same data as `/su/live`. |
|
||
| All-events index | `https://www.marathonbet.by/su/all-events/` | Long full list; use for discovery seed. |
|
||
| Sport listing (by ID) | `https://www.marathonbet.by/su/betting/{Sport}+-+{sportId}` | e.g., `/su/betting/Basketball+-+6`. **Preferred** because sport-id stable. |
|
||
| Sport listing (by name) | `https://www.marathonbet.by/su/popular/{Sport}` | e.g., `/su/popular/Basketball`. Convenient for humans. |
|
||
| Category / league listing | `https://www.marathonbet.by/su/betting/{Sport}/{League+Path}+-+{categoryTreeId}` | From breadcrumbs / `category-label-link`. |
|
||
| Event detail | `https://www.marathonbet.by/su/betting/{event-path}` | `event-path` from `data-event-path`, ends in `-+{treeId}`. |
|
||
| Live update signal | `https://www.marathonbet.by/su/liveupdate/popular/?treeIds={csv}` | Returns `{"modified":[...],"updated":<ts>}`. Use only as "hey something changed" hint; full odds still come from event-detail re-fetch. |
|
||
| Server time sync | `https://www.marathonbet.by/su/stateless/synctime` | Use to anchor "today" date interpretation. |
|
||
|
||
URL paths use `+` for spaces, `%2C` for `,`, etc. — standard `Uri.EscapeDataString`.
|
||
|
||
---
|
||
|
||
## 4. Sport ID Inventory (observed)
|
||
|
||
From the pre-match landing page (`data-sport-treeId` attributes + `category-label`
|
||
breadcrumb hrefs):
|
||
|
||
| Sport ID | Russian name | English path |
|
||
|---|---|---|
|
||
| **6** | Баскетбол | `Basketball` |
|
||
| **11** | Футбол | `Football` |
|
||
| **537** | (TBD — verify on populated day) | — |
|
||
| **2398** | (TBD) | — |
|
||
| **22723** | Теннис | `Tennis` |
|
||
| **26418** | Футбол (alt? duplicate live) | `Football` |
|
||
| **43658** | Хоккей | `Hockey` |
|
||
| **45356** | Баскетбол (live tree) | `Basketball` |
|
||
| **139722** | Гандбол | `Handball` |
|
||
| **414329** | Настольный теннис | `Table+Tennis` |
|
||
| **1372932** | Киберспорт | `Esports` |
|
||
| **3083982** | Лотереи | `Lotteries` |
|
||
| **11308234** | Шорт хоккей | `Short+Hockey` |
|
||
| **23054364** | Кибербаскетбол | `eBasketball` |
|
||
| **23054392** | Киберфутбол | `eFootball` |
|
||
|
||
**Important observation:** the site has **two parallel tree IDs per sport** — one
|
||
"canonical" (e.g., `6` for Basketball) used on event-detail breadcrumb, and a
|
||
"category" tree ID (e.g., `45356`) used inside the live grouping. Phase 1 domain
|
||
needs to recognize the canonical ID as `SportCode` and ignore the category tree ID.
|
||
|
||
The customer-spec field `Sport_Code = 6` for Basketball matches the canonical ID
|
||
in `data-sport-treeId="6"` and in the breadcrumb URL `/su/betting/Basketball+-+6`.
|
||
|
||
---
|
||
|
||
## 5. Bet Selection Naming Convention
|
||
|
||
Format: `{eventId}@{MarketName}{LineIndex?}.{Outcome}`
|
||
|
||
Where:
|
||
|
||
- `eventId` = bookmaker's `data-event-eventId` (numeric, ~26-million range, stable).
|
||
- `MarketName` = `Match_Result`, `To_Win_Match_With_Handicap`, `Total_Points`,
|
||
`1st_Half_Result`, `To_Win_1st_Half_With_Handicap`, `1st_Set_Total_Games`, etc.
|
||
- `LineIndex?` = optional integer suffix when a market has multiple lines/spreads
|
||
(e.g., `Total_Points10`, `Total_Points11` are different total thresholds for the
|
||
same event). Empty / `0` is the "main" line.
|
||
- `Outcome` codes:
|
||
- `1`, `draw`, `3` — for 3-way result markets
|
||
- `HB_H`, `HB_A` — handicap home/away
|
||
- `Under_<X>`, `Over_<X>` — total under/over (X is the threshold, embedded in name)
|
||
- `HD`, `AD` — half-time/full-time draw combinations
|
||
- `yes` / `no` — for yes/no markets
|
||
|
||
The handicap value (`+1.0`, `-2.5`) and total threshold (`213.5`) are NOT in the
|
||
selection key as parseable numbers — they live in the `<span class="middle-simple">`
|
||
display element OR they are embedded in the outcome name (e.g., `Under_213.5`).
|
||
|
||
---
|
||
|
||
## 6. Period Scope per Sport (observed)
|
||
|
||
| Sport | Period scopes available | Spec field prefix |
|
||
|---|---|---|
|
||
| Football (11) | 1st Half, 2nd Half | `Bet_Period-1_*`, `Bet_Period-2_*` |
|
||
| Basketball (6) | 1st/2nd Half, 1st/2nd/3rd/4th Quarter | Customer must clarify whether Period-N maps to halves or quarters. **Recommend halves** as default (Period-1, Period-2) with an `appsettings` toggle for quarter-mode. |
|
||
| Tennis (22723) | 1st Set, 2nd Set, ... (variable count) | `Bet_Period-1_*` = 1st Set, etc. **No Draw outcome.** |
|
||
| Hockey (43658) | 1st/2nd/3rd Period | `Bet_Period-1_*`, `Bet_Period-2_*`, `Bet_Period-3_*` (not yet sampled — revalidate in Phase 3). |
|
||
|
||
The internal market-name token is sport-dependent:
|
||
- `1st_Half_Result`, `To_Win_1st_Half_With_Handicap`
|
||
- `1st_Quarter_Result`, `To_Win_1st_Quarter_With_Handicap`
|
||
- `1st_Set_Result`, `To_Win_1st_Set_With_Handicap`
|
||
|
||
**Phase 3 should encapsulate this** in a sport-aware mapping table
|
||
(`PeriodScopeMapper`) keyed on `SportCode`, returning the set of expected period
|
||
markets and their token names.
|
||
|
||
---
|
||
|
||
## 7. Open Questions / Risks
|
||
|
||
1. **Results storage cleanup:** how long does marathonbet keep finished events on
|
||
the event detail URL? Must be empirically tested over Phase 8. Recommend retaining
|
||
our own snapshot with `matchIsComplete=true` permanently in SQLite as soon as
|
||
we observe it, so we never depend on the site for historical data.
|
||
2. **Sport ID duplication** (e.g., `26418` and `11` both = Football):
|
||
verify with customer that we should use the canonical breadcrumb ID. The
|
||
"category" trees may exist for live grouping or alphabetization purposes.
|
||
3. **Localization:** site labels are Russian on `/su/`. There appears to be `/en/`
|
||
path support (untested). Customer wants RU + EN — Phase 5 must verify EN locale
|
||
page parses identically.
|
||
4. **Period total markets in basketball:** sampled NBA event did NOT explicitly
|
||
expose "Total points 1st quarter" as a clean market in the public HTML — only
|
||
`AllInningsGoalsOver` (combined). Customer's spec implies `Bet_Period-N_Total_*`
|
||
is universal — Phase 3 must gracefully degrade and emit `null` rates for fields
|
||
the site doesn't surface for that sport+league.
|
||
5. **Belarus geo-restriction risk:** we tested from non-BY. If customer's BY IP
|
||
gets a different page (KYC overlay, deposit prompt, etc.), the parser must be
|
||
robust to unexpected wrapping. Defensive parsing only — never assume strict
|
||
structure.
|
||
6. **`isLogged: false` overlay risk:** initData reports we are anonymous. Some
|
||
markets may be hidden behind login (we did not detect any in samples, but the
|
||
parser should treat missing markets as `null`, not throw).
|
||
|
||
---
|
||
|
||
## 8. Recommended Phase 3 Architecture
|
||
|
||
```
|
||
IOddsScraper (Application)
|
||
│
|
||
└── MarathonBetScraper : IOddsScraper (Infrastructure)
|
||
├── HttpClient (resilient via Polly v8)
|
||
│ ├── User-Agent rotator
|
||
│ ├── Token-bucket rate limiter (config: RequestsPerSecond)
|
||
│ ├── Retry policy (3x exponential backoff, jitter)
|
||
│ └── Circuit breaker (open after N consecutive 5xx)
|
||
│
|
||
├── EventDiscoveryParser ← parses /su/, /su/live, /su/popular/{sport}
|
||
│ produces List<EventListItem>
|
||
│
|
||
├── EventDetailParser ← parses /su/betting/<path>
|
||
│ produces FullOddsSnapshot with all markets
|
||
│
|
||
├── BreadcrumbParser ← extracts Sport / Country / League / Stage taxonomy
|
||
│
|
||
└── BetMarketMapper ← AngleSharp QuerySelector → spec field name
|
||
(sport-aware; uses PeriodScopeMapper)
|
||
```
|
||
|
||
**Use AngleSharp for parsing** — it handles malformed HTML well, has a CSS-selector
|
||
API, and is the established `.NET` choice. JSON islands inside attributes (`data-sel`,
|
||
`data-json`) decode cleanly with `System.Text.Json`.
|
||
|
||
**No Playwright required** for the scraper. Keep Playwright as a documented
|
||
fallback in `appsettings` (`Scraping:UsePlaywright = false`) so we can flip it on
|
||
later if the site adds JS challenges. This adds <100 LOC of optional code, costs
|
||
nothing if unused.
|
||
|
||
---
|
||
|
||
## 9. Customer Validation Plan
|
||
|
||
If our environment ever stops working (geo-block, IP ban, etc.) the customer in
|
||
Belarus can:
|
||
|
||
1. Open https://www.marathonbet.by/su in a browser, verify it renders.
|
||
2. View page source (Ctrl+U), search for `data-event-eventId` — confirm same
|
||
structure as our captured `spike/captures/pre-match-landing.html`.
|
||
3. Save the HTML and email it to dev — the parser is environment-agnostic and
|
||
should handle their captured HTML byte-for-byte.
|
||
|
||
This decouples scraper development from probe environment and makes Phase 3
|
||
testable offline.
|
||
|
||
---
|
||
|
||
## 10. Captured Samples (gitignored, local only)
|
||
|
||
| File | Purpose |
|
||
|---|---|
|
||
| `spike/captures/pre-match-landing.html` | `/su/` snapshot, 587 KB, full grid |
|
||
| `spike/captures/live-landing.html` | `/su/live` snapshot, 250 KB |
|
||
| `spike/captures/basketball-listing.html` | `/su/popular/Basketball`, 471 KB |
|
||
| `spike/captures/event-basketball-28405506.html` | NBA Knicks vs 76ers full event, 505 KB |
|
||
| `spike/captures/event-football-28089645.html` | UCL Arsenal vs Atletico full event, 1.58 MB |
|
||
| `spike/captures/event-tennis-28430484.html` | ATP Rome qualif full event, 244 KB |
|
||
| `spike/captures/liveupdate-popular.json` | Live-update API sample response |
|
||
| `spike/captures/results-page.html` | `/su/results` response (~20 KB) — captured to evidence the missing public archive endpoint (Phase 8 deviation). |
|
||
|
||
These artifacts are **not committed** but should be kept locally to back parser unit
|
||
tests in Phase 3.
|
||
|
||
> **Caveats on captures:**
|
||
>
|
||
> - `live-landing.html` was captured at a moment when no live events were
|
||
> in-progress for popular sports. As a result, the `.score-state` element
|
||
> referenced in `SCHEMA_DRAFT.md` §1 is NOT present in this particular capture.
|
||
> Phase 3 should re-verify the score selector against a live event during
|
||
> parser implementation (the selector itself is well-known across bookmaker
|
||
> sites and not in doubt).
|
||
> - Hockey events were not sampled directly. Period-result selection key tokens
|
||
> for hockey (`1st_Period_Result0.RN_H` etc.) are extrapolated from the
|
||
> football/basketball/tennis pattern and marked TBD in `SCHEMA_DRAFT.md`. Phase 3
|
||
> must verify against a real hockey event before relying on those tokens.
|