# Phase 0 Spike — Scraping Findings for marathonbet.by **Date:** 2026-05-05 **Probe environment:** Windows 10, Poland-routed IP (countryCode `PL` reported by site, `isBelarus: true` flag set in `initData`, `jurisdiction: BELARUS`). **Tooling used:** `curl` with browser User-Agent, ~10 sequential requests with ≥1-second pacing. --- ## TL;DR — Decision Matrix | Question | Answer | |---|---| | Is anonymous scraping feasible? | **YES — confirmed.** Site returns full server-rendered HTML for `/su/`, `/su/live`, sport listings, and event detail pages with HTTP 200 to a plain GET with browser User-Agent. | | Cloudflare / JS challenge? | **No.** `Server: nginx`, no `cf-ray`, no challenge cookies. Only standard JSESSIONID + analytics cookies. No reCAPTCHA on listing pages. | | Geo-block from probe environment? | **No.** Probe was made from a non-Belarus IP; site served full HTML. The site treats us as `region:"PL"` but still serves Russian-language `/su` content. | | Recommended scraping technology | **HttpClient + AngleSharp.** All the data needed (event list, full odds, breadcrumb taxonomy, period markets) is present in the raw SSR HTML. Playwright is not required for read-only scraping. | | Recommended polling cadence | Pre-match: **30 seconds** (default in `appsettings`). Live: 3-second native cadence is too aggressive — recommend **5–10 seconds** for our analyzer (anomaly detection doesn't need sub-second resolution). | | WebSocket / API alternative? | STOMP-over-WebSocket exists at `/su/websocket/endpoint` for authenticated clients. Anonymous clients should stick to plain HTML scraping. The JSONP endpoint at `/su/liveupdate/popular/` only returns refresh-page signals, not full odds. | --- ## 1. Probe Outcomes ### 1.1 Pre-match landing — `https://www.marathonbet.by/su` ``` HTTP/1.1 200 OK Server: nginx Content-Type: text/html;charset=UTF-8 Set-Cookie: visitedNavBarItems=HOME; HttpOnly; SameSite=None; Secure Set-Cookie: lastSitePart=SPORT; ... Set-Cookie: puid=rBWP3Wn5...; expires=2037; domain=.marathonbet.by Strict-Transport-Security: max-age=31536000 Cache-Status: MISS Cache-Control: no-store, no-cache, must-revalidate ``` - **Render type:** Server-Side Rendered (SSR). Body is ~590 KB of HTML containing the full event grid for live + popular pre-match events. There IS a `
` wrapper but the content inside is fully populated server-side; the JS layer enhances rather than hydrates from empty. - **Rich data attributes embedded:** - `data-event-eventId=""` — bookmaker's stable numeric event ID - `data-event-treeId=""` — tree position ID (used in URLs) - `data-event-name="..."` — event display name - `data-event-path="// - "` — URL fragment to construct event detail link - `data-live="true|false"` — live vs pre-match flag - `data-sport-treeId=""` — sport identifier (matches customer's "Sport_Code") - `data-coeff-uuid` + `data-sel='{...}'` JSON — selection metadata (ewc, cid, prt, epr) - `data-selection-key="@[N]."` — canonical bet identifier - **Embedded `initData` JSON blob** (line 6 of every page) exposes runtime config: - `serverTime: "2026,05,05,00,43,28"` (Moscow TZ) - `liveUpdatePath: "/su/liveupdate/popular/"` - `liveUpdateTransport: "JSONP"` - `update_interval: 3000` (ms — live update polling cadence used by the site itself) - `stomp.url: "/su/websocket/endpoint"` (authenticated stream) - `region`, `isBelarus`, `jurisdiction`, `currencyCode` — geo/legal flags - `treeIds` — for the event detail page, holds the focal treeId ### 1.2 Live landing — `https://www.marathonbet.by/su/live` - HTTP 200, ~250 KB body — same `nginx` server, same SSR pattern. - Same `data-event-*` attributes as pre-match. Live events show `data-live="true"`, with extra `score-state` and `time` markers (e.g., `2:1 (1:1)`, `83:30`). - The site polls `/su/liveupdate/popular/?treeIds=...` every 3 s but the response is just a refresh signal (`{"modified":[{"type":"refreshPage"}],"updated":...}`) — **the site relies on full HTML re-fetch for live updates**, which is good for us (no separate JSON contract to track). ### 1.3 Sport-specific listing — `/su/popular/Basketball` / `/su/betting/Basketball+-+6` - HTTP 200, ~470 KB. - Lists all current basketball categories (NBA Playoffs etc.) with full odds. - URL by name (`Basketball`) and URL by sport tree ID (`Basketball+-+6`) both work. - Date display: events on the same day show **time only** (`03:00`); events on later days show **`DD HH:MM`** (e.g., `06 мая 02:00`). The "today" anchor is implicit — must be derived from `initData.serverTime`. ### 1.4 Event detail — `/su/betting/` - HTTP 200, ~500 KB to ~1.6 MB depending on market count. - URL pattern: `/su/betting////>`. - Exposes ~140–250 unique market types per event. Each market is a `
` containing a labeled `` of selections with `data-selection-key`, prices, and handicap/total values in ``. - **Schema.org breadcrumb** at the bottom of the page provides clean taxonomy: Sport → Country/Group → League → Stage → Event. Each level has its own treeId visible in `href="/su/betting/+-+"`. - Sample (Football, Arsenal vs Atletico Madrid, treeId 28089645, eventId 26456117): - Sport = `Football+-+11`, Country group = `Clubs.+International+-+4409575`, League = `UEFA+Champions+League+-+21255`, Stage = `Play-Offs / Semi+Final / 2nd+Leg`. - Match-level markets: `Match_Result.{1,draw,3}`, `To_Win_Match_With_Handicap{N}.{HB_H,HB_A}`, `Total_Goals{N}.{Under_X,Over_X}`. ### 1.5 Results / archive — **NOT publicly available** - `https://www.marathonbet.by/su/results` → **HTTP 404**. - `https://www.marathonbet.by/su/results/` → **HTTP 404**. - `https://www.marathonbet.by/su/results.htm` → **HTTP 404**. - No `/results`, `/archive`, or `/history` link anywhere in the public landing-page HTML. - The `eventJsonInfo` `
` on each event has a `matchIsComplete` boolean and a `resultDescription` (e.g., `"2:1 (1:1)"`), so **final scores can be captured by re-scraping the event detail page after match end** — but only while the event is still hosted (likely a few hours / days post-match). After cleanup, results are gone. - **Implication for Phase 8 (Results loader):** results must be harvested by continuing to poll the event detail page until `matchIsComplete=true`, then storing the final score. There is no historical archive endpoint to back-fill from. We should also evaluate scraping a third-party results aggregator (flashscore, livescore, sofascore) as a fallback — that's a Phase 8 design decision. --- ## 2. Anti-bot Posture | Signal | Observation | |---|---| | Cloudflare | Absent. `Server: nginx`, no `cf-*` headers. | | reCAPTCHA / hCAPTCHA | Not on public listing or event pages (only on `/captchaData.htm` for login). | | User-Agent filtering | A browser UA returns 200. We did not test with `curl/8.x` or empty UA — recommend always sending a real UA. | | Cookie requirement | None for read-only access. The site sets `puid`, `JSESSIONID`, `lastSitePart`, etc., but we observed full HTML on the very first request without prior cookies. | | IP rate-limit | 5 sequential requests at ~1s pacing all returned 200 in <1 s. No throttling observed within our budget (10 total requests). The customer should test heavier loads from their environment. | | Geo-block | Probe environment is geo-routed as Poland; site still serves `/su` Russian content. Customer (Belarus) should see same or better access. | | Fingerprinting | Standard analytics (GTM, dataLayer); no JS-fingerprint cookies or canvas hashing detected in the entry-page payload. | **Mitigations to bake into the scraper anyway** (defense-in-depth): - **Rotate User-Agents** from a small pool of recent Chrome/Firefox/Edge versions (configurable via `Scraping:UserAgents[]`). - **Polite pacing:** default `Scraping:RateLimit:RequestsPerSecond = 1`, `MaxConcurrentRequests = 4`. Per-host token-bucket rate limiter using Polly v8 + `Microsoft.Extensions.Http.Resilience`. - **Honor `Cache-Control: no-store`** — do NOT cache responses; that's the site's intent. - **Handle 403 / 429 / 503** with exponential backoff and circuit breaker; alert the user when circuit opens for >5 minutes. - **Cookie jar per scraper instance** — accept set-cookies and replay them. This avoids a session-creation latency on every request. - **Belarus-specific:** if customer's environment ever sees a `/forbidden` redirect, we fall back to the `afterForbiddenRedirectUrl` documented in `initData`. --- ## 3. URL Templates Phase 3 Will Use | Purpose | Template | Notes | |---|---|---| | Pre-match top page | `https://www.marathonbet.by/su/` | Mixed live + popular pre-match. Use only for landing/health-check. | | Live top page | `https://www.marathonbet.by/su/live` | Mixed sports. Use for live-event discovery. | | Live popular | `https://www.marathonbet.by/su/live/popular` | Same data as `/su/live`. | | All-events index | `https://www.marathonbet.by/su/all-events/` | Long full list; use for discovery seed. | | Sport listing (by ID) | `https://www.marathonbet.by/su/betting/{Sport}+-+{sportId}` | e.g., `/su/betting/Basketball+-+6`. **Preferred** because sport-id stable. | | Sport listing (by name) | `https://www.marathonbet.by/su/popular/{Sport}` | e.g., `/su/popular/Basketball`. Convenient for humans. | | Category / league listing | `https://www.marathonbet.by/su/betting/{Sport}/{League+Path}+-+{categoryTreeId}` | From breadcrumbs / `category-label-link`. | | Event detail | `https://www.marathonbet.by/su/betting/{event-path}` | `event-path` from `data-event-path`, ends in `-+{treeId}`. | | Live update signal | `https://www.marathonbet.by/su/liveupdate/popular/?treeIds={csv}` | Returns `{"modified":[...],"updated":}`. Use only as "hey something changed" hint; full odds still come from event-detail re-fetch. | | Server time sync | `https://www.marathonbet.by/su/stateless/synctime` | Use to anchor "today" date interpretation. | URL paths use `+` for spaces, `%2C` for `,`, etc. — standard `Uri.EscapeDataString`. --- ## 4. Sport ID Inventory (observed) From the pre-match landing page (`data-sport-treeId` attributes + `category-label` breadcrumb hrefs): | Sport ID | Russian name | English path | |---|---|---| | **6** | Баскетбол | `Basketball` | | **11** | Футбол | `Football` | | **537** | (TBD — verify on populated day) | — | | **2398** | (TBD) | — | | **22723** | Теннис | `Tennis` | | **26418** | Футбол (alt? duplicate live) | `Football` | | **43658** | Хоккей | `Hockey` | | **45356** | Баскетбол (live tree) | `Basketball` | | **139722** | Гандбол | `Handball` | | **414329** | Настольный теннис | `Table+Tennis` | | **1372932** | Киберспорт | `Esports` | | **3083982** | Лотереи | `Lotteries` | | **11308234** | Шорт хоккей | `Short+Hockey` | | **23054364** | Кибербаскетбол | `eBasketball` | | **23054392** | Киберфутбол | `eFootball` | **Important observation:** the site has **two parallel tree IDs per sport** — one "canonical" (e.g., `6` for Basketball) used on event-detail breadcrumb, and a "category" tree ID (e.g., `45356`) used inside the live grouping. Phase 1 domain needs to recognize the canonical ID as `SportCode` and ignore the category tree ID. The customer-spec field `Sport_Code = 6` for Basketball matches the canonical ID in `data-sport-treeId="6"` and in the breadcrumb URL `/su/betting/Basketball+-+6`. --- ## 5. Bet Selection Naming Convention Format: `{eventId}@{MarketName}{LineIndex?}.{Outcome}` Where: - `eventId` = bookmaker's `data-event-eventId` (numeric, ~26-million range, stable). - `MarketName` = `Match_Result`, `To_Win_Match_With_Handicap`, `Total_Points`, `1st_Half_Result`, `To_Win_1st_Half_With_Handicap`, `1st_Set_Total_Games`, etc. - `LineIndex?` = optional integer suffix when a market has multiple lines/spreads (e.g., `Total_Points10`, `Total_Points11` are different total thresholds for the same event). Empty / `0` is the "main" line. - `Outcome` codes: - `1`, `draw`, `3` — for 3-way result markets - `HB_H`, `HB_A` — handicap home/away - `Under_`, `Over_` — total under/over (X is the threshold, embedded in name) - `HD`, `AD` — half-time/full-time draw combinations - `yes` / `no` — for yes/no markets The handicap value (`+1.0`, `-2.5`) and total threshold (`213.5`) are NOT in the selection key as parseable numbers — they live in the `` display element OR they are embedded in the outcome name (e.g., `Under_213.5`). --- ## 6. Period Scope per Sport (observed) | Sport | Period scopes available | Spec field prefix | |---|---|---| | Football (11) | 1st Half, 2nd Half | `Bet_Period-1_*`, `Bet_Period-2_*` | | Basketball (6) | 1st/2nd Half, 1st/2nd/3rd/4th Quarter | Customer must clarify whether Period-N maps to halves or quarters. **Recommend halves** as default (Period-1, Period-2) with an `appsettings` toggle for quarter-mode. | | Tennis (22723) | 1st Set, 2nd Set, ... (variable count) | `Bet_Period-1_*` = 1st Set, etc. **No Draw outcome.** | | Hockey (43658) | 1st/2nd/3rd Period | `Bet_Period-1_*`, `Bet_Period-2_*`, `Bet_Period-3_*` (not yet sampled — revalidate in Phase 3). | The internal market-name token is sport-dependent: - `1st_Half_Result`, `To_Win_1st_Half_With_Handicap` - `1st_Quarter_Result`, `To_Win_1st_Quarter_With_Handicap` - `1st_Set_Result`, `To_Win_1st_Set_With_Handicap` **Phase 3 should encapsulate this** in a sport-aware mapping table (`PeriodScopeMapper`) keyed on `SportCode`, returning the set of expected period markets and their token names. --- ## 7. Open Questions / Risks 1. **Results storage cleanup:** how long does marathonbet keep finished events on the event detail URL? Must be empirically tested over Phase 8. Recommend retaining our own snapshot with `matchIsComplete=true` permanently in SQLite as soon as we observe it, so we never depend on the site for historical data. 2. **Sport ID duplication** (e.g., `26418` and `11` both = Football): verify with customer that we should use the canonical breadcrumb ID. The "category" trees may exist for live grouping or alphabetization purposes. 3. **Localization:** site labels are Russian on `/su/`. There appears to be `/en/` path support (untested). Customer wants RU + EN — Phase 5 must verify EN locale page parses identically. 4. **Period total markets in basketball:** sampled NBA event did NOT explicitly expose "Total points 1st quarter" as a clean market in the public HTML — only `AllInningsGoalsOver` (combined). Customer's spec implies `Bet_Period-N_Total_*` is universal — Phase 3 must gracefully degrade and emit `null` rates for fields the site doesn't surface for that sport+league. 5. **Belarus geo-restriction risk:** we tested from non-BY. If customer's BY IP gets a different page (KYC overlay, deposit prompt, etc.), the parser must be robust to unexpected wrapping. Defensive parsing only — never assume strict structure. 6. **`isLogged: false` overlay risk:** initData reports we are anonymous. Some markets may be hidden behind login (we did not detect any in samples, but the parser should treat missing markets as `null`, not throw). --- ## 8. Recommended Phase 3 Architecture ``` IOddsScraper (Application) │ └── MarathonBetScraper : IOddsScraper (Infrastructure) ├── HttpClient (resilient via Polly v8) │ ├── User-Agent rotator │ ├── Token-bucket rate limiter (config: RequestsPerSecond) │ ├── Retry policy (3x exponential backoff, jitter) │ └── Circuit breaker (open after N consecutive 5xx) │ ├── EventDiscoveryParser ← parses /su/, /su/live, /su/popular/{sport} │ produces List │ ├── EventDetailParser ← parses /su/betting/ │ produces FullOddsSnapshot with all markets │ ├── BreadcrumbParser ← extracts Sport / Country / League / Stage taxonomy │ └── BetMarketMapper ← AngleSharp QuerySelector → spec field name (sport-aware; uses PeriodScopeMapper) ``` **Use AngleSharp for parsing** — it handles malformed HTML well, has a CSS-selector API, and is the established `.NET` choice. JSON islands inside attributes (`data-sel`, `data-json`) decode cleanly with `System.Text.Json`. **No Playwright required** for the scraper. Keep Playwright as a documented fallback in `appsettings` (`Scraping:UsePlaywright = false`) so we can flip it on later if the site adds JS challenges. This adds <100 LOC of optional code, costs nothing if unused. --- ## 9. Customer Validation Plan If our environment ever stops working (geo-block, IP ban, etc.) the customer in Belarus can: 1. Open https://www.marathonbet.by/su in a browser, verify it renders. 2. View page source (Ctrl+U), search for `data-event-eventId` — confirm same structure as our captured `spike/captures/pre-match-landing.html`. 3. Save the HTML and email it to dev — the parser is environment-agnostic and should handle their captured HTML byte-for-byte. This decouples scraper development from probe environment and makes Phase 3 testable offline. --- ## 10. Captured Samples (gitignored, local only) | File | Purpose | |---|---| | `spike/captures/pre-match-landing.html` | `/su/` snapshot, 587 KB, full grid | | `spike/captures/live-landing.html` | `/su/live` snapshot, 250 KB | | `spike/captures/basketball-listing.html` | `/su/popular/Basketball`, 471 KB | | `spike/captures/event-basketball-28405506.html` | NBA Knicks vs 76ers full event, 505 KB | | `spike/captures/event-football-28089645.html` | UCL Arsenal vs Atletico full event, 1.58 MB | | `spike/captures/event-tennis-28430484.html` | ATP Rome qualif full event, 244 KB | | `spike/captures/liveupdate-popular.json` | Live-update API sample response | | `spike/captures/results-page.html` | `/su/results` response (~20 KB) — captured to evidence the missing public archive endpoint (Phase 8 deviation). | These artifacts are **not committed** but should be kept locally to back parser unit tests in Phase 3. > **Caveats on captures:** > > - `live-landing.html` was captured at a moment when no live events were > in-progress for popular sports. As a result, the `.score-state` element > referenced in `SCHEMA_DRAFT.md` §1 is NOT present in this particular capture. > Phase 3 should re-verify the score selector against a live event during > parser implementation (the selector itself is well-known across bookmaker > sites and not in doubt). > - Hockey events were not sampled directly. Period-result selection key tokens > for hockey (`1st_Period_Result0.RN_H` etc.) are extrapolated from the > football/basketball/tennis pattern and marked TBD in `SCHEMA_DRAFT.md`. Phase 3 > must verify against a real hockey event before relying on those tokens.