Files
maraphon-app/spike/SCHEMA_DRAFT.md
T
alexei.dolgolyov 070e34b911 feat(initial-implementation): phase 0 - scraping spike findings
Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR
(nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is
sufficient; Playwright not required (kept as a future-flag).

Spike outputs:
- spike/SCRAPE_FINDINGS.md  — page rendering, URL templates, anti-bot, rate
  limits, recommended scraping strategy for Phase 3.
- spike/SCHEMA_DRAFT.md     — customer-spec field → DOM selector mapping for
  Match + Period-N scope across football/basketball/tennis (hockey TBD).

Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding:
no public results endpoint at /su/results — phase 8 must switch to polling
event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged).

Reviewer notes addressed:
- Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and
  market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1.
- results-page.html capture added to file list with caveat about live-landing
  score-state and unsampled hockey selectors.
2026-05-05 01:04:03 +03:00

319 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 0 Spike — Domain Schema Draft
**Purpose:** Map every customer-spec Excel column to a concrete DOM/JSON path in
marathonbet.by. Phase 1 (Domain) and Phase 3 (Scraping/parsing) consume this.
**Convention:** "selector" entries use AngleSharp/CSS notation. `evt` = the
event detail page DOM; `list` = the listing page DOM (top-level grid view).
---
## 1. Event Metadata
| Spec field | Source | Selector / extraction |
|---|---|---|
| `EventCode` | event detail page | `[data-event-eventId]` attribute on the outer `div.coupon-row`. Numeric, e.g., `26456117`. **Stable; use as primary key for the event in our SQLite.** |
| `TreeId` (internal) | event detail page | `[data-event-treeId]` on the same `div.coupon-row`. Used for URL building, less stable than `EventCode`. |
| `SportCode` | breadcrumb of event detail | `breadcrumbs-list .breadcrumbs-item:nth-child(2) a@href` matches `/su/betting/{Sport}+-+{N}`. Parse `N` as integer. Confirmed: Basketball = 6, Football = 11. |
| `Sport` | breadcrumb (RU label) | `breadcrumbs-list .breadcrumbs-item:nth-child(2) .breadcrumb-text` → strip leading `Ставки на ` prefix. e.g., `Ставки на Баскетбол``Баскетбол`. |
| `Country` | breadcrumb | `.breadcrumbs-item:nth-child(3) .breadcrumb-text`. May represent group ("Клубы. Международные") rather than literal country for international leagues — accept as-is. |
| `League` | breadcrumb | `.breadcrumbs-item:nth-child(4) .breadcrumb-text`. e.g., `Лига чемпионов УЕФА`, `NBA`. |
| `Category` | breadcrumb (deeper) | If breadcrumb has 5+ items beyond the event itself, join items 5..N-1 with ` / `. e.g., `Play-Offs / Semi Final / 2nd Leg`. The event detail's `category-label-link` `<h2>` text also exposes this concatenated. |
| `EventName` | event detail | `[data-event-name]` attribute on `div.coupon-row`. e.g., `Арсенал - Атлетико Мадрид`. |
| `Team1` | event detail | `[data-event-name]`, split on ` - `, take index 0. Or: `.player-row.player1 .member-name [data-member-link]` text. |
| `Team2` | event detail | Split index 1, or `.player-row.player2 .member-name [data-member-link]`. |
| `ScheduledAt` (date+time) | event detail + listing | **Time:** `.date-wrapper` text. Two formats: `HH:MM` (today) or `DD <ru-month> HH:MM` (future, e.g., `06 мая 22:00`). **Anchor:** `initData.serverTime` (Moscow TZ, format `YYYY,MM,DD,HH,MM,SS`) parsed and combined with the time. **Title fallback:** `<title>` and `<meta name="description">` contain a Russian-formatted full date (`05 мая 2026`) — use as authoritative when ambiguous. |
| `IsLive` | event detail / listing | `[data-live="true"]` attribute. Live events also carry `.score-state` and `.time` elements with `2:1` and `83:30` style content. |
| `LiveScore` | event detail (live only) | `.score-state` text (`2:1 (1:1)` style). Inning breakdown: parse the `eventJsonInfo` `[data-json]` attribute on the hidden `<td>` — JSON includes `mainScore`, `inningScore[]`, `matchTime.seconds`, `matchIsComplete`. |
| `MatchIsComplete` | event detail | Decoded JSON of `[data-mutable-id="eventJsonInfo"][data-json]``.matchIsComplete` boolean. Critical for Phase 8 (Results loader). |
| `FinalScore` | event detail (post-match) | Same `eventJsonInfo` JSON → `.resultDescription` (e.g., `"2:1 (1:1)"`) when `matchIsComplete=true`. |
---
## 2. Match-Scope Bets (1×2, Handicap, Total)
The event-detail "main row" presents three primary markets in a `coefficients-table`:
**Result** (1×2), **Handicap** (Win-Fora), **Total** (Goals/Points/Games depending
on sport). These map to spec fields `Bet_Match_*`.
### 2.1 Match Win 1 / Draw / Win 2
| Spec field | data-selection-key suffix | DOM path |
|---|---|---|
| `Bet_Match_Win_1` | `@Match_Result.1` (football, tennis, hockey) **OR** `@Result.1` (basketball pre-match) **OR** `@Normal_Time_Result.1` (basketball detail) | `evt span[data-selection-key$='@Match_Result.1']@data-selection-price` (decimal odds, e.g., `1.65`) |
| `Bet_Match_Draw` | `.draw` outcome of same market | `evt span[data-selection-key$='@Match_Result.draw']@data-selection-price`. **NULL for tennis** (2-way market, no draw). |
| `Bet_Match_Win_2` | `.3` outcome | `evt span[data-selection-key$='@Match_Result.3']@data-selection-price` |
**Sport variance:**
- Football, Tennis, Table-tennis: `Match_Result`.
- Basketball: in pre-match landing, label is `Match_Winner_Including_All_OT.HB_H/HB_A`
(2-way, OT included). On the detail page, both `Normal_Time_Result.{1,draw,3}` (3-way,
reg time) and `Match_Winner_Including_All_OT.{HB_H,HB_A}` (2-way, OT included) appear.
**Recommendation:** treat `Match_Winner_Including_All_OT` as the canonical Win-1 / Win-2
(no Draw) when a 3-way `Result` market is absent; fall back to draw-included
`Normal_Time_Result` when present.
- Hockey: TBD — verify in Phase 3 with an actual hockey event capture.
**Recommendation for Phase 1 domain:** define `BetType.WinDraw` allowing nullable
`Draw`. The Excel exporter writes empty cell when `Draw` is null.
### 2.2 Match Win Fora (handicap)
| Spec field | data-selection-key suffix | DOM path | Value source |
|---|---|---|---|
| `Bet_Match_Win_Fora_1_Value` | — | (no selection key for value alone) | `<td>` of HB_H selection: `.middle-simple` text inside the `<div class="nowrap simple-price">` (e.g., `(-1.0)`). Strip parens, parse as `decimal`. |
| `Bet_Match_Win_Fora_1_Rate` | `@To_Win_Match_With_Handicap{N}.HB_H` (or `@Match_Handicap.HB_H` variant) | `[data-selection-key$='@To_Win_Match_With_Handicap.HB_H']@data-selection-price` | — |
| `Bet_Match_Win_Fora_2_Value` | — | `.middle-simple` next to HB_A selection (e.g., `(+1.0)`). | — |
| `Bet_Match_Win_Fora_2_Rate` | `@To_Win_Match_With_Handicap{N}.HB_A` | `[data-selection-key$='@To_Win_Match_With_Handicap.HB_A']@data-selection-price` | — |
**Tennis variant:** uses `@To_Win_Match_With_Handicap_By_Games{N}.HB_H/HB_A`.
The handicap is in **games** not points — emit `Value` as-is, the unit is implicit
in the sport.
**Multi-line handicap:** the site offers many lines (`To_Win_Match_With_Handicap0`,
`...1`, `...2`, ...), each a different handicap value. The customer spec wants only
the **main line** (the one displayed in the listing's main row). Phase 3 should:
1. On listing pages, take the handicap displayed in the `coefficients-table`
`data-market-type="HANDICAP"` cell.
2. On event detail, identify the "main" line as the one without a numeric suffix
(`@To_Win_Match_With_Handicap.HB_H`) or with suffix `0` if both exist — sample
shows both `To_Win_Match_With_Handicap.HB_H` and `...0.HB_H`. Heuristic: pick
the line whose handicap value is closest to ±1.0 from the favorite, OR explicitly
prefer the no-suffix variant; fall back to suffix `0`.
3. Optional: capture the full handicap ladder into a separate normalized table
so anomaly detection can use the spread, even if Excel only exports the main line.
### 2.3 Match Total Less / More
| Spec field | data-selection-key suffix | DOM path |
|---|---|---|
| `Bet_Match_Total_Less_Value` | — | `.middle-simple` next to the `Меньше` selection (e.g., `3.5`, `213.5`). |
| `Bet_Match_Total_Less_Rate` | `@Total_{Goals\|Points\|Games}{N}.Under_<X>` | `[data-selection-key^='<eventId>@Total_'][data-selection-key$='.Under_<X>']@data-selection-price`. Use the row whose Value equals the chosen total threshold. |
| `Bet_Match_Total_More_Value` | — | Same value as Less (paired). |
| `Bet_Match_Total_More_Rate` | `@Total_{Goals\|Points\|Games}{N}.Over_<X>` | `[data-selection-key$='.Over_<X>']@data-selection-price` |
**Sport vocabulary:**
- Football: `Total_Goals`
- Basketball: `Total_Points`
- Tennis: `Total_Games`
- Hockey: `Total_Goals` (TBD)
- Volleyball / handball: TBD
**Choosing the "main" total line:** customer spec wants ONE Total Value + Less/More
rates per event. The site offers ~20 different total thresholds per event. The
listing page main row exposes the "headline" total (the one the bookmaker chose
to show). **Heuristic:**
1. On listing: read the `data-market-type="TOTAL"` cell directly.
2. On event detail: find the row labeled in `coefficients-row` (visible main view),
not in `coefficients-hidden-row`. The `data-mutable-id="S_3_1_european"` /
`S_3_3_european` pair is the main line.
3. Fall back to picking the line whose Under/Over rates are closest to **2.00**
each (the "balanced" line — most representative of bookmaker's expectation).
4. As with handicap, capture the full ladder for analysis even if exports only one row.
---
## 3. Period-N Scope Bets
Period markets follow the same pattern as match markets but with a period prefix
in the market token. Examples for `Period-1` (1st half of football, 1st quarter
of basketball, 1st set of tennis):
### 3.1 Period-N Win 1 / Draw / Win 2
> **CORRECTED FROM CAPTURE EVIDENCE (2026-05-05):** Period result markets use
> `RN_H` / `RN_D` / `RN_A` outcome codes (Reduced Numerals: Home / Draw / Away),
> NOT the `1` / `draw` / `3` codes used by `@Match_Result`. Market names also
> vary: football uses `Result_-_1st_Half` (with separator dashes); basketball and
> tennis use `1st_Half_Result0` / `1st_Quarter_Result0` / `1st_Set_Result0`
> (note the literal `0` suffix on the market name — line index for the period
> result market). Phase 3 parser must use these exact tokens.
| Customer field | Football (1st Half) | Basketball (1st Half *or* Quarter) | Tennis (1st Set) | Hockey (1st Period) |
|---|---|---|---|---|
| `Bet_Period-1_Win_1` | `@Result_-_1st_Half.RN_H` | `@1st_Half_Result0.RN_H` (halves) **or** `@1st_Quarter_Result0.RN_H` (quarters) | `@1st_Set_Result0.RN_H` | `@1st_Period_Result0.RN_H` (TBD verify on hockey event) |
| `Bet_Period-1_Draw` | `@Result_-_1st_Half.RN_D` | `@1st_Half_Result0.RN_D` / `@1st_Quarter_Result0.RN_D` | (NULL — no draw) | `@1st_Period_Result0.RN_D` (TBD) |
| `Bet_Period-1_Win_2` | `@Result_-_1st_Half.RN_A` | `@1st_Half_Result0.RN_A` / `@1st_Quarter_Result0.RN_A` | `@1st_Set_Result0.RN_A` | `@1st_Period_Result0.RN_A` (TBD) |
The market token vocabulary differs by sport:
- **Football:** `Result_-_<ordinal>_<unit>` (e.g., `Result_-_1st_Half`, `Result_-_2nd_Half`).
- **Basketball / Tennis / Hockey:** `<ordinal>_<unit>_Result0` (e.g.,
`1st_Half_Result0`, `1st_Quarter_Result0`, `1st_Set_Result0`,
`1st_Period_Result0`). The `0` suffix is required.
- **Note:** non-period markets like `@Match_Result.1` and `@Match_Result.draw`
still use the `1`/`draw`/`3` outcome codes — the `RN_*` codes are specific to
period/half/quarter/set markets.
**Period count by sport** (default mapping for `Period-N`):
- Football: N ∈ {1, 2}
- Basketball: configurable — halves (N ∈ {1,2}) or quarters (N ∈ {1,2,3,4}). **Default to halves.**
- Tennis: N ∈ {1, 2, ...} until `<i>th_Set_Result` selection is absent. Cap at 5 for Grand Slams.
- Hockey: N ∈ {1, 2, 3}.
### 3.2 Period-N Win Fora
Same as match handicap, with period prefix:
| Sport | Selection key |
|---|---|
| Football | `@To_Win_1st_Half_With_Handicap{N}.HB_H` / `.HB_A` |
| Basketball | `@To_Win_1st_Half_With_Handicap{N}.HB_*` (or `_1st_Quarter_`) |
| Tennis | `@To_Win_1st_Set_With_Handicap{N}.HB_*` |
| Hockey | `@To_Win_1st_Period_With_Handicap{N}.HB_*` (TBD verify) |
Value extraction: same `.middle-simple` text as match handicap.
### 3.3 Period-N Total Less / More
This is the **least uniform** market. Observed:
| Sport | Period-1 Total selection key |
|---|---|
| Football | (search HTML directly — Phase 3 should parse the "Тотал тайма" tab) Likely `@1st_Half_Total_Goals{N}.Under_<X>` / `.Over_<X>`. |
| Basketball | Per-quarter total exposed as separate market in the "Тоталы" tab; sample event did not show clean `1st_Half_Total_Points` keys — see SCRAPE_FINDINGS.md §6 risk #4. **May need to fall back to NULL** for basketball Period-N Total in some leagues. |
| Tennis | `@1st_Set_Total_Games{N}.Under_<X>` / `.Over_<X>` — confirmed in sample. |
| Hockey | `@1st_Period_Total_Goals...` (TBD verify). |
**Phase 3 robustness rule:** if a period-N market is absent in the parsed HTML,
emit `null` for the corresponding rate/value. Never throw. The Excel exporter
writes empty cell.
---
## 4. Live Counterparts
When the same scope is captured from the **live** site (`/su/live` or live-flagged
events on `/su/`), the spec wants column prefix `Live_*` instead of `Bet_*`.
**Important:** live events use the SAME `data-selection-key` naming conventions.
The distinguishing signal is `data-live="true"` on the outer `div.coupon-row` and
the URL the snapshot was scraped from (`/su/live`).
Examples:
- `Live_Match_Win_1``[data-selection-key$='@Match_Result.1']` from live page
- `Live_Match_Win_Fora_1_Value`, `Live_Match_Win_Fora_1_Rate` ← same DOM, same logic
- `Live_Period-1_Win_1` ← same as `Bet_Period-1_Win_1` but captured from live event
**Implementation:** the parser does not change. The application service simply
records `Source = Live | PreMatch` on each `OddsSnapshot` and the Excel exporter
denormalizes pre-match snapshots to `Bet_*` columns and live snapshots to `Live_*`
columns at write time.
---
## 5. Field Coverage Matrix (spec → confidence)
| Field family | Football | Basketball | Tennis | Hockey | Notes |
|---|---|---|---|---|---|
| `Match_Win_1/2`, `Match_Draw` | ✅ confirmed | ⚠️ Win-1/2 confirmed; Draw conditional on `Normal_Time_Result` presence | ✅ Win-1/2 confirmed; **Draw is null** | ❓ verify Phase 3 | — |
| `Match_Win_Fora_*` | ✅ | ✅ | ✅ (in games) | ❓ | "Main line" heuristic needed (§2.2) |
| `Match_Total_*` | ✅ Goals | ✅ Points | ✅ Games | ❓ | "Main line" heuristic needed (§2.3) |
| `Period-1_Win_*` | ✅ Half | ✅ Half / Quarter | ✅ Set | ❓ Period | basketball mode is configurable |
| `Period-1_Win_Fora_*` | ✅ | ✅ | ✅ | ❓ | — |
| `Period-1_Total_*` | ⚠️ structure verified, exact key TBD | ⚠️ may be absent for some games | ✅ Set | ❓ | risk: emit null where absent |
| `Period-2/3/4_*` | (Period-2 only) | ✅ all | up to actual played sets | ❓ | — |
| `Live_*` (any of above) | same parser | same | same | same | distinguished only by `data-live` flag + scrape URL |
Legend: ✅ confirmed in spike sample, ⚠️ partial / heuristic needed, ❓ Phase 3 must verify.
---
## 6. Suggested Domain Types (Phase 1 input)
```csharp
// Marathon.Domain
public enum BetScope { Match, Period }
public enum BetType { Win, Draw, WinFora, Total }
public enum BetSide { Side1, Side2, Less, More } // Side1=home/W1, Side2=away/W2
public sealed record Sport(int Code, string NameRu, string NameEn);
public sealed record League(int TreeId, string NameRu, int SportCode);
public sealed record Event(
long EventCode, // marathonbet's data-event-eventId
int TreeId, // for URL building
int SportCode,
int LeagueTreeId,
string Country, // breadcrumb position 3
string? Category, // joined breadcrumb 5..N-1
string Team1,
string Team2,
DateTimeOffset ScheduledAt, // anchored on initData.serverTime
string DetailUrl);
public sealed record Bet(
BetScope Scope,
int? PeriodNumber, // null when Scope=Match
BetType Type,
BetSide? Side, // null for Type=Draw
decimal? Value, // handicap/total threshold; null for Win/Draw
decimal Rate); // decimal odds (e.g., 1.65)
public sealed record OddsSnapshot(
long EventCode,
DateTimeOffset CapturedAt,
SnapshotSource Source, // Pre | Live
IReadOnlyList<Bet> Bets);
public enum SnapshotSource { PreMatch, Live }
```
Phase 1 will refine names, but this captures the data shape Phase 3 produces.
---
## 7. Excel Column Generation (Phase 4 / 9 reference)
The Excel exporter generates wide rows by joining all `Bet`s of an `OddsSnapshot`
into named columns. Pseudocode:
```
foreach snapshot:
row.EventCode = snapshot.EventCode
row.SportCode = event.SportCode
row.Sport = event.Sport.NameRu
row.Country = event.Country
row.League = event.League.NameRu
row.Category = event.Category
row.ScheduledAt = event.ScheduledAt
prefix = snapshot.Source == PreMatch ? "Bet_" : "Live_"
// Match scope
row[prefix+"Match_Win_1"] = bet.Where(scope=Match, type=Win, side=Side1).Rate
row[prefix+"Match_Draw"] = bet.Where(scope=Match, type=Draw).Rate
row[prefix+"Match_Win_2"] = bet.Where(scope=Match, type=Win, side=Side2).Rate
row[prefix+"Match_Win_Fora_1_Value"] = bet.Where(scope=Match, type=WinFora, side=Side1).Value
row[prefix+"Match_Win_Fora_1_Rate"] = bet.Where(scope=Match, type=WinFora, side=Side1).Rate
row[prefix+"Match_Win_Fora_2_Value"] = bet.Where(scope=Match, type=WinFora, side=Side2).Value
row[prefix+"Match_Win_Fora_2_Rate"] = bet.Where(scope=Match, type=WinFora, side=Side2).Rate
row[prefix+"Match_Total_Less_Value"] = bet.Where(scope=Match, type=Total, side=Less).Value
row[prefix+"Match_Total_Less_Rate"] = bet.Where(scope=Match, type=Total, side=Less).Rate
row[prefix+"Match_Total_More_Value"] = bet.Where(scope=Match, type=Total, side=More).Value
row[prefix+"Match_Total_More_Rate"] = bet.Where(scope=Match, type=Total, side=More).Rate
// Period scope (foreach period N exposed for that sport)
for N in 1..MaxPeriodForSport(sportCode):
same fields with key {prefix}Period-{N}_*
null when bet absent
```
Spec column order is left to Phase 4 (`ExcelExporter`). Recommend:
`Date, Time, Sport, Country, League, Category, Event, EventCode,
Bet_Match_*..., Bet_Period-1_*..., Bet_Period-2_*..., Live_Match_*..., Live_Period-N_*...`
---
## 8. Decisions Pending Customer Confirmation
1. **Basketball Period mapping** — halves (default) or quarters? Spec says
"Period-N" but is silent on which N applies. Recommend halves (`N ∈ {1,2}`)
with a quarter mode opt-in via `appsettings.Sports.Basketball.PeriodMode`.
2. **Tennis Draw column** — emit empty / 0 / "—"? Recommend empty cell.
3. **Handicap "main line" rule** — pick the listing's main row, OR the no-suffix
selection, OR the spread closest to bookmaker-implied probability 50/50?
4. **Total "main line" rule** — same as above.
5. **Field name capitalization** — spec uses `Bet_Match_Win_Fora_1_Value` exactly.
Recommend matching exactly (case-sensitive) for compatibility with downstream
pivot tables / scripts.