"`.
- Sample (Football, Arsenal vs Atletico Madrid, treeId 28089645, eventId 26456117):
- Sport = `Football+-+11`, Country group = `Clubs.+International+-+4409575`,
League = `UEFA+Champions+League+-+21255`, Stage = `Play-Offs / Semi+Final / 2nd+Leg`.
- Match-level markets: `Match_Result.{1,draw,3}`, `To_Win_Match_With_Handicap{N}.{HB_H,HB_A}`,
`Total_Goals{N}.{Under_X,Over_X}`.
### 1.5 Results / archive — **NOT publicly available**
- `https://www.marathonbet.by/su/results` → **HTTP 404**.
- `https://www.marathonbet.by/su/results/` → **HTTP 404**.
- `https://www.marathonbet.by/su/results.htm` → **HTTP 404**.
- No `/results`, `/archive`, or `/history` link anywhere in the public landing-page HTML.
- The `eventJsonInfo` `| ` on each event has a `matchIsComplete` boolean and a
`resultDescription` (e.g., `"2:1 (1:1)"`), so **final scores can be captured by
re-scraping the event detail page after match end** — but only while the event is
still hosted (likely a few hours / days post-match). After cleanup, results are gone.
- **Implication for Phase 8 (Results loader):** results must be harvested by
continuing to poll the event detail page until `matchIsComplete=true`, then storing
the final score. There is no historical archive endpoint to back-fill from. We
should also evaluate scraping a third-party results aggregator
(flashscore, livescore, sofascore) as a fallback — that's a Phase 8 design decision.
---
## 2. Anti-bot Posture
| Signal | Observation |
|---|---|
| Cloudflare | Absent. `Server: nginx`, no `cf-*` headers. |
| reCAPTCHA / hCAPTCHA | Not on public listing or event pages (only on `/captchaData.htm` for login). |
| User-Agent filtering | A browser UA returns 200. We did not test with `curl/8.x` or empty UA — recommend always sending a real UA. |
| Cookie requirement | None for read-only access. The site sets `puid`, `JSESSIONID`, `lastSitePart`, etc., but we observed full HTML on the very first request without prior cookies. |
| IP rate-limit | 5 sequential requests at ~1s pacing all returned 200 in <1 s. No throttling observed within our budget (10 total requests). The customer should test heavier loads from their environment. |
| Geo-block | Probe environment is geo-routed as Poland; site still serves `/su` Russian content. Customer (Belarus) should see same or better access. |
| Fingerprinting | Standard analytics (GTM, dataLayer); no JS-fingerprint cookies or canvas hashing detected in the entry-page payload. |
**Mitigations to bake into the scraper anyway** (defense-in-depth):
- **Rotate User-Agents** from a small pool of recent Chrome/Firefox/Edge versions
(configurable via `Scraping:UserAgents[]`).
- **Polite pacing:** default `Scraping:RateLimit:RequestsPerSecond = 1`,
`MaxConcurrentRequests = 4`. Per-host token-bucket rate limiter using Polly v8 +
`Microsoft.Extensions.Http.Resilience`.
- **Honor `Cache-Control: no-store`** — do NOT cache responses; that's the site's intent.
- **Handle 403 / 429 / 503** with exponential backoff and circuit breaker; alert the user
when circuit opens for >5 minutes.
- **Cookie jar per scraper instance** — accept set-cookies and replay them. This avoids
a session-creation latency on every request.
- **Belarus-specific:** if customer's environment ever sees a `/forbidden` redirect,
we fall back to the `afterForbiddenRedirectUrl` documented in `initData`.
---
## 3. URL Templates Phase 3 Will Use
| Purpose | Template | Notes |
|---|---|---|
| Pre-match top page | `https://www.marathonbet.by/su/` | Mixed live + popular pre-match. Use only for landing/health-check. |
| Live top page | `https://www.marathonbet.by/su/live` | Mixed sports. Use for live-event discovery. |
| Live popular | `https://www.marathonbet.by/su/live/popular` | Same data as `/su/live`. |
| All-events index | `https://www.marathonbet.by/su/all-events/` | Long full list; use for discovery seed. |
| Sport listing (by ID) | `https://www.marathonbet.by/su/betting/{Sport}+-+{sportId}` | e.g., `/su/betting/Basketball+-+6`. **Preferred** because sport-id stable. |
| Sport listing (by name) | `https://www.marathonbet.by/su/popular/{Sport}` | e.g., `/su/popular/Basketball`. Convenient for humans. |
| Category / league listing | `https://www.marathonbet.by/su/betting/{Sport}/{League+Path}+-+{categoryTreeId}` | From breadcrumbs / `category-label-link`. |
| Event detail | `https://www.marathonbet.by/su/betting/{event-path}` | `event-path` from `data-event-path`, ends in `-+{treeId}`. |
| Live update signal | `https://www.marathonbet.by/su/liveupdate/popular/?treeIds={csv}` | Returns `{"modified":[...],"updated":}`. Use only as "hey something changed" hint; full odds still come from event-detail re-fetch. |
| Server time sync | `https://www.marathonbet.by/su/stateless/synctime` | Use to anchor "today" date interpretation. |
URL paths use `+` for spaces, `%2C` for `,`, etc. — standard `Uri.EscapeDataString`.
---
## 4. Sport ID Inventory (observed)
From the pre-match landing page (`data-sport-treeId` attributes + `category-label`
breadcrumb hrefs):
| Sport ID | Russian name | English path |
|---|---|---|
| **6** | Баскетбол | `Basketball` |
| **11** | Футбол | `Football` |
| **537** | (TBD — verify on populated day) | — |
| **2398** | (TBD) | — |
| **22723** | Теннис | `Tennis` |
| **26418** | Футбол (alt? duplicate live) | `Football` |
| **43658** | Хоккей | `Hockey` |
| **45356** | Баскетбол (live tree) | `Basketball` |
| **139722** | Гандбол | `Handball` |
| **414329** | Настольный теннис | `Table+Tennis` |
| **1372932** | Киберспорт | `Esports` |
| **3083982** | Лотереи | `Lotteries` |
| **11308234** | Шорт хоккей | `Short+Hockey` |
| **23054364** | Кибербаскетбол | `eBasketball` |
| **23054392** | Киберфутбол | `eFootball` |
**Important observation:** the site has **two parallel tree IDs per sport** — one
"canonical" (e.g., `6` for Basketball) used on event-detail breadcrumb, and a
"category" tree ID (e.g., `45356`) used inside the live grouping. Phase 1 domain
needs to recognize the canonical ID as `SportCode` and ignore the category tree ID.
The customer-spec field `Sport_Code = 6` for Basketball matches the canonical ID
in `data-sport-treeId="6"` and in the breadcrumb URL `/su/betting/Basketball+-+6`.
---
## 5. Bet Selection Naming Convention
Format: `{eventId}@{MarketName}{LineIndex?}.{Outcome}`
Where:
- `eventId` = bookmaker's `data-event-eventId` (numeric, ~26-million range, stable).
- `MarketName` = `Match_Result`, `To_Win_Match_With_Handicap`, `Total_Points`,
`1st_Half_Result`, `To_Win_1st_Half_With_Handicap`, `1st_Set_Total_Games`, etc.
- `LineIndex?` = optional integer suffix when a market has multiple lines/spreads
(e.g., `Total_Points10`, `Total_Points11` are different total thresholds for the
same event). Empty / `0` is the "main" line.
- `Outcome` codes:
- `1`, `draw`, `3` — for 3-way result markets
- `HB_H`, `HB_A` — handicap home/away
- `Under_`, `Over_` — total under/over (X is the threshold, embedded in name)
- `HD`, `AD` — half-time/full-time draw combinations
- `yes` / `no` — for yes/no markets
The handicap value (`+1.0`, `-2.5`) and total threshold (`213.5`) are NOT in the
selection key as parseable numbers — they live in the ``
display element OR they are embedded in the outcome name (e.g., `Under_213.5`).
---
## 6. Period Scope per Sport (observed)
| Sport | Period scopes available | Spec field prefix |
|---|---|---|
| Football (11) | 1st Half, 2nd Half | `Bet_Period-1_*`, `Bet_Period-2_*` |
| Basketball (6) | 1st/2nd Half, 1st/2nd/3rd/4th Quarter | Customer must clarify whether Period-N maps to halves or quarters. **Recommend halves** as default (Period-1, Period-2) with an `appsettings` toggle for quarter-mode. |
| Tennis (22723) | 1st Set, 2nd Set, ... (variable count) | `Bet_Period-1_*` = 1st Set, etc. **No Draw outcome.** |
| Hockey (43658) | 1st/2nd/3rd Period | `Bet_Period-1_*`, `Bet_Period-2_*`, `Bet_Period-3_*` (not yet sampled — revalidate in Phase 3). |
The internal market-name token is sport-dependent:
- `1st_Half_Result`, `To_Win_1st_Half_With_Handicap`
- `1st_Quarter_Result`, `To_Win_1st_Quarter_With_Handicap`
- `1st_Set_Result`, `To_Win_1st_Set_With_Handicap`
**Phase 3 should encapsulate this** in a sport-aware mapping table
(`PeriodScopeMapper`) keyed on `SportCode`, returning the set of expected period
markets and their token names.
---
## 7. Open Questions / Risks
1. **Results storage cleanup:** how long does marathonbet keep finished events on
the event detail URL? Must be empirically tested over Phase 8. Recommend retaining
our own snapshot with `matchIsComplete=true` permanently in SQLite as soon as
we observe it, so we never depend on the site for historical data.
2. **Sport ID duplication** (e.g., `26418` and `11` both = Football):
verify with customer that we should use the canonical breadcrumb ID. The
"category" trees may exist for live grouping or alphabetization purposes.
3. **Localization:** site labels are Russian on `/su/`. There appears to be `/en/`
path support (untested). Customer wants RU + EN — Phase 5 must verify EN locale
page parses identically.
4. **Period total markets in basketball:** sampled NBA event did NOT explicitly
expose "Total points 1st quarter" as a clean market in the public HTML — only
`AllInningsGoalsOver` (combined). Customer's spec implies `Bet_Period-N_Total_*`
is universal — Phase 3 must gracefully degrade and emit `null` rates for fields
the site doesn't surface for that sport+league.
5. **Belarus geo-restriction risk:** we tested from non-BY. If customer's BY IP
gets a different page (KYC overlay, deposit prompt, etc.), the parser must be
robust to unexpected wrapping. Defensive parsing only — never assume strict
structure.
6. **`isLogged: false` overlay risk:** initData reports we are anonymous. Some
markets may be hidden behind login (we did not detect any in samples, but the
parser should treat missing markets as `null`, not throw).
---
## 8. Recommended Phase 3 Architecture
```
IOddsScraper (Application)
│
└── MarathonBetScraper : IOddsScraper (Infrastructure)
├── HttpClient (resilient via Polly v8)
│ ├── User-Agent rotator
│ ├── Token-bucket rate limiter (config: RequestsPerSecond)
│ ├── Retry policy (3x exponential backoff, jitter)
│ └── Circuit breaker (open after N consecutive 5xx)
│
├── EventDiscoveryParser ← parses /su/, /su/live, /su/popular/{sport}
│ produces List
│
├── EventDetailParser ← parses /su/betting/
│ produces FullOddsSnapshot with all markets
│
├── BreadcrumbParser ← extracts Sport / Country / League / Stage taxonomy
│
└── BetMarketMapper ← AngleSharp QuerySelector → spec field name
(sport-aware; uses PeriodScopeMapper)
```
**Use AngleSharp for parsing** — it handles malformed HTML well, has a CSS-selector
API, and is the established `.NET` choice. JSON islands inside attributes (`data-sel`,
`data-json`) decode cleanly with `System.Text.Json`.
**No Playwright required** for the scraper. Keep Playwright as a documented
fallback in `appsettings` (`Scraping:UsePlaywright = false`) so we can flip it on
later if the site adds JS challenges. This adds <100 LOC of optional code, costs
nothing if unused.
---
## 9. Customer Validation Plan
If our environment ever stops working (geo-block, IP ban, etc.) the customer in
Belarus can:
1. Open https://www.marathonbet.by/su in a browser, verify it renders.
2. View page source (Ctrl+U), search for `data-event-eventId` — confirm same
structure as our captured `spike/captures/pre-match-landing.html`.
3. Save the HTML and email it to dev — the parser is environment-agnostic and
should handle their captured HTML byte-for-byte.
This decouples scraper development from probe environment and makes Phase 3
testable offline.
---
## 10. Captured Samples (gitignored, local only)
| File | Purpose |
|---|---|
| `spike/captures/pre-match-landing.html` | `/su/` snapshot, 587 KB, full grid |
| `spike/captures/live-landing.html` | `/su/live` snapshot, 250 KB |
| `spike/captures/basketball-listing.html` | `/su/popular/Basketball`, 471 KB |
| `spike/captures/event-basketball-28405506.html` | NBA Knicks vs 76ers full event, 505 KB |
| `spike/captures/event-football-28089645.html` | UCL Arsenal vs Atletico full event, 1.58 MB |
| `spike/captures/event-tennis-28430484.html` | ATP Rome qualif full event, 244 KB |
| `spike/captures/liveupdate-popular.json` | Live-update API sample response |
| `spike/captures/results-page.html` | `/su/results` response (~20 KB) — captured to evidence the missing public archive endpoint (Phase 8 deviation). |
These artifacts are **not committed** but should be kept locally to back parser unit
tests in Phase 3.
> **Caveats on captures:**
>
> - `live-landing.html` was captured at a moment when no live events were
> in-progress for popular sports. As a result, the `.score-state` element
> referenced in `SCHEMA_DRAFT.md` §1 is NOT present in this particular capture.
> Phase 3 should re-verify the score selector against a live event during
> parser implementation (the selector itself is well-known across bookmaker
> sites and not in doubt).
> - Hockey events were not sampled directly. Period-result selection key tokens
> for hockey (`1st_Period_Result0.RN_H` etc.) are extrapolated from the
> football/basketball/tennis pattern and marked TBD in `SCHEMA_DRAFT.md`. Phase 3
> must verify against a real hockey event before relying on those tokens.
|