Files
maraphon-app/spike/SCHEMA_DRAFT.md
T
alexei.dolgolyov 070e34b911 feat(initial-implementation): phase 0 - scraping spike findings
Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR
(nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is
sufficient; Playwright not required (kept as a future-flag).

Spike outputs:
- spike/SCRAPE_FINDINGS.md  — page rendering, URL templates, anti-bot, rate
  limits, recommended scraping strategy for Phase 3.
- spike/SCHEMA_DRAFT.md     — customer-spec field → DOM selector mapping for
  Match + Period-N scope across football/basketball/tennis (hockey TBD).

Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding:
no public results endpoint at /su/results — phase 8 must switch to polling
event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged).

Reviewer notes addressed:
- Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and
  market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1.
- results-page.html capture added to file list with caveat about live-landing
  score-state and unsampled hockey selectors.
2026-05-05 01:04:03 +03:00

18 KiB
Raw Blame History

Phase 0 Spike — Domain Schema Draft

Purpose: Map every customer-spec Excel column to a concrete DOM/JSON path in marathonbet.by. Phase 1 (Domain) and Phase 3 (Scraping/parsing) consume this.

Convention: "selector" entries use AngleSharp/CSS notation. evt = the event detail page DOM; list = the listing page DOM (top-level grid view).


1. Event Metadata

Spec field Source Selector / extraction
EventCode event detail page [data-event-eventId] attribute on the outer div.coupon-row. Numeric, e.g., 26456117. Stable; use as primary key for the event in our SQLite.
TreeId (internal) event detail page [data-event-treeId] on the same div.coupon-row. Used for URL building, less stable than EventCode.
SportCode breadcrumb of event detail breadcrumbs-list .breadcrumbs-item:nth-child(2) a@href matches /su/betting/{Sport}+-+{N}. Parse N as integer. Confirmed: Basketball = 6, Football = 11.
Sport breadcrumb (RU label) breadcrumbs-list .breadcrumbs-item:nth-child(2) .breadcrumb-text → strip leading Ставки на prefix. e.g., Ставки на БаскетболБаскетбол.
Country breadcrumb .breadcrumbs-item:nth-child(3) .breadcrumb-text. May represent group ("Клубы. Международные") rather than literal country for international leagues — accept as-is.
League breadcrumb .breadcrumbs-item:nth-child(4) .breadcrumb-text. e.g., Лига чемпионов УЕФА, NBA.
Category breadcrumb (deeper) If breadcrumb has 5+ items beyond the event itself, join items 5..N-1 with /. e.g., Play-Offs / Semi Final / 2nd Leg. The event detail's category-label-link <h2> text also exposes this concatenated.
EventName event detail [data-event-name] attribute on div.coupon-row. e.g., Арсенал - Атлетико Мадрид.
Team1 event detail [data-event-name], split on -, take index 0. Or: .player-row.player1 .member-name [data-member-link] text.
Team2 event detail Split index 1, or .player-row.player2 .member-name [data-member-link].
ScheduledAt (date+time) event detail + listing Time: .date-wrapper text. Two formats: HH:MM (today) or DD <ru-month> HH:MM (future, e.g., 06 мая 22:00). Anchor: initData.serverTime (Moscow TZ, format YYYY,MM,DD,HH,MM,SS) parsed and combined with the time. Title fallback: <title> and <meta name="description"> contain a Russian-formatted full date (05 мая 2026) — use as authoritative when ambiguous.
IsLive event detail / listing [data-live="true"] attribute. Live events also carry .score-state and .time elements with 2:1 and 83:30 style content.
LiveScore event detail (live only) .score-state text (2:1 (1:1) style). Inning breakdown: parse the eventJsonInfo [data-json] attribute on the hidden <td> — JSON includes mainScore, inningScore[], matchTime.seconds, matchIsComplete.
MatchIsComplete event detail Decoded JSON of [data-mutable-id="eventJsonInfo"][data-json].matchIsComplete boolean. Critical for Phase 8 (Results loader).
FinalScore event detail (post-match) Same eventJsonInfo JSON → .resultDescription (e.g., "2:1 (1:1)") when matchIsComplete=true.

2. Match-Scope Bets (1×2, Handicap, Total)

The event-detail "main row" presents three primary markets in a coefficients-table: Result (1×2), Handicap (Win-Fora), Total (Goals/Points/Games depending on sport). These map to spec fields Bet_Match_*.

2.1 Match Win 1 / Draw / Win 2

Spec field data-selection-key suffix DOM path
Bet_Match_Win_1 @Match_Result.1 (football, tennis, hockey) OR @Result.1 (basketball pre-match) OR @Normal_Time_Result.1 (basketball detail) evt span[data-selection-key$='@Match_Result.1']@data-selection-price (decimal odds, e.g., 1.65)
Bet_Match_Draw .draw outcome of same market evt span[data-selection-key$='@Match_Result.draw']@data-selection-price. NULL for tennis (2-way market, no draw).
Bet_Match_Win_2 .3 outcome evt span[data-selection-key$='@Match_Result.3']@data-selection-price

Sport variance:

  • Football, Tennis, Table-tennis: Match_Result.
  • Basketball: in pre-match landing, label is Match_Winner_Including_All_OT.HB_H/HB_A (2-way, OT included). On the detail page, both Normal_Time_Result.{1,draw,3} (3-way, reg time) and Match_Winner_Including_All_OT.{HB_H,HB_A} (2-way, OT included) appear. Recommendation: treat Match_Winner_Including_All_OT as the canonical Win-1 / Win-2 (no Draw) when a 3-way Result market is absent; fall back to draw-included Normal_Time_Result when present.
  • Hockey: TBD — verify in Phase 3 with an actual hockey event capture.

Recommendation for Phase 1 domain: define BetType.WinDraw allowing nullable Draw. The Excel exporter writes empty cell when Draw is null.

2.2 Match Win Fora (handicap)

Spec field data-selection-key suffix DOM path Value source
Bet_Match_Win_Fora_1_Value (no selection key for value alone) <td> of HB_H selection: .middle-simple text inside the <div class="nowrap simple-price"> (e.g., (-1.0)). Strip parens, parse as decimal.
Bet_Match_Win_Fora_1_Rate @To_Win_Match_With_Handicap{N}.HB_H (or @Match_Handicap.HB_H variant) [data-selection-key$='@To_Win_Match_With_Handicap.HB_H']@data-selection-price
Bet_Match_Win_Fora_2_Value .middle-simple next to HB_A selection (e.g., (+1.0)).
Bet_Match_Win_Fora_2_Rate @To_Win_Match_With_Handicap{N}.HB_A [data-selection-key$='@To_Win_Match_With_Handicap.HB_A']@data-selection-price

Tennis variant: uses @To_Win_Match_With_Handicap_By_Games{N}.HB_H/HB_A. The handicap is in games not points — emit Value as-is, the unit is implicit in the sport.

Multi-line handicap: the site offers many lines (To_Win_Match_With_Handicap0, ...1, ...2, ...), each a different handicap value. The customer spec wants only the main line (the one displayed in the listing's main row). Phase 3 should:

  1. On listing pages, take the handicap displayed in the coefficients-table data-market-type="HANDICAP" cell.
  2. On event detail, identify the "main" line as the one without a numeric suffix (@To_Win_Match_With_Handicap.HB_H) or with suffix 0 if both exist — sample shows both To_Win_Match_With_Handicap.HB_H and ...0.HB_H. Heuristic: pick the line whose handicap value is closest to ±1.0 from the favorite, OR explicitly prefer the no-suffix variant; fall back to suffix 0.
  3. Optional: capture the full handicap ladder into a separate normalized table so anomaly detection can use the spread, even if Excel only exports the main line.

2.3 Match Total Less / More

Spec field data-selection-key suffix DOM path
Bet_Match_Total_Less_Value .middle-simple next to the Меньше selection (e.g., 3.5, 213.5).
Bet_Match_Total_Less_Rate @Total_{Goals|Points|Games}{N}.Under_<X> [data-selection-key^='<eventId>@Total_'][data-selection-key$='.Under_<X>']@data-selection-price. Use the row whose Value equals the chosen total threshold.
Bet_Match_Total_More_Value Same value as Less (paired).
Bet_Match_Total_More_Rate @Total_{Goals|Points|Games}{N}.Over_<X> [data-selection-key$='.Over_<X>']@data-selection-price

Sport vocabulary:

  • Football: Total_Goals
  • Basketball: Total_Points
  • Tennis: Total_Games
  • Hockey: Total_Goals (TBD)
  • Volleyball / handball: TBD

Choosing the "main" total line: customer spec wants ONE Total Value + Less/More rates per event. The site offers ~20 different total thresholds per event. The listing page main row exposes the "headline" total (the one the bookmaker chose to show). Heuristic:

  1. On listing: read the data-market-type="TOTAL" cell directly.
  2. On event detail: find the row labeled in coefficients-row (visible main view), not in coefficients-hidden-row. The data-mutable-id="S_3_1_european" / S_3_3_european pair is the main line.
  3. Fall back to picking the line whose Under/Over rates are closest to 2.00 each (the "balanced" line — most representative of bookmaker's expectation).
  4. As with handicap, capture the full ladder for analysis even if exports only one row.

3. Period-N Scope Bets

Period markets follow the same pattern as match markets but with a period prefix in the market token. Examples for Period-1 (1st half of football, 1st quarter of basketball, 1st set of tennis):

3.1 Period-N Win 1 / Draw / Win 2

CORRECTED FROM CAPTURE EVIDENCE (2026-05-05): Period result markets use RN_H / RN_D / RN_A outcome codes (Reduced Numerals: Home / Draw / Away), NOT the 1 / draw / 3 codes used by @Match_Result. Market names also vary: football uses Result_-_1st_Half (with separator dashes); basketball and tennis use 1st_Half_Result0 / 1st_Quarter_Result0 / 1st_Set_Result0 (note the literal 0 suffix on the market name — line index for the period result market). Phase 3 parser must use these exact tokens.

Customer field Football (1st Half) Basketball (1st Half or Quarter) Tennis (1st Set) Hockey (1st Period)
Bet_Period-1_Win_1 @Result_-_1st_Half.RN_H @1st_Half_Result0.RN_H (halves) or @1st_Quarter_Result0.RN_H (quarters) @1st_Set_Result0.RN_H @1st_Period_Result0.RN_H (TBD verify on hockey event)
Bet_Period-1_Draw @Result_-_1st_Half.RN_D @1st_Half_Result0.RN_D / @1st_Quarter_Result0.RN_D (NULL — no draw) @1st_Period_Result0.RN_D (TBD)
Bet_Period-1_Win_2 @Result_-_1st_Half.RN_A @1st_Half_Result0.RN_A / @1st_Quarter_Result0.RN_A @1st_Set_Result0.RN_A @1st_Period_Result0.RN_A (TBD)

The market token vocabulary differs by sport:

  • Football: Result_-_<ordinal>_<unit> (e.g., Result_-_1st_Half, Result_-_2nd_Half).
  • Basketball / Tennis / Hockey: <ordinal>_<unit>_Result0 (e.g., 1st_Half_Result0, 1st_Quarter_Result0, 1st_Set_Result0, 1st_Period_Result0). The 0 suffix is required.
  • Note: non-period markets like @Match_Result.1 and @Match_Result.draw still use the 1/draw/3 outcome codes — the RN_* codes are specific to period/half/quarter/set markets.

Period count by sport (default mapping for Period-N):

  • Football: N ∈ {1, 2}
  • Basketball: configurable — halves (N ∈ {1,2}) or quarters (N ∈ {1,2,3,4}). Default to halves.
  • Tennis: N ∈ {1, 2, ...} until <i>th_Set_Result selection is absent. Cap at 5 for Grand Slams.
  • Hockey: N ∈ {1, 2, 3}.

3.2 Period-N Win Fora

Same as match handicap, with period prefix:

Sport Selection key
Football @To_Win_1st_Half_With_Handicap{N}.HB_H / .HB_A
Basketball @To_Win_1st_Half_With_Handicap{N}.HB_* (or _1st_Quarter_)
Tennis @To_Win_1st_Set_With_Handicap{N}.HB_*
Hockey @To_Win_1st_Period_With_Handicap{N}.HB_* (TBD verify)

Value extraction: same .middle-simple text as match handicap.

3.3 Period-N Total Less / More

This is the least uniform market. Observed:

Sport Period-1 Total selection key
Football (search HTML directly — Phase 3 should parse the "Тотал тайма" tab) Likely @1st_Half_Total_Goals{N}.Under_<X> / .Over_<X>.
Basketball Per-quarter total exposed as separate market in the "Тоталы" tab; sample event did not show clean 1st_Half_Total_Points keys — see SCRAPE_FINDINGS.md §6 risk #4. May need to fall back to NULL for basketball Period-N Total in some leagues.
Tennis @1st_Set_Total_Games{N}.Under_<X> / .Over_<X> — confirmed in sample.
Hockey @1st_Period_Total_Goals... (TBD verify).

Phase 3 robustness rule: if a period-N market is absent in the parsed HTML, emit null for the corresponding rate/value. Never throw. The Excel exporter writes empty cell.


4. Live Counterparts

When the same scope is captured from the live site (/su/live or live-flagged events on /su/), the spec wants column prefix Live_* instead of Bet_*.

Important: live events use the SAME data-selection-key naming conventions. The distinguishing signal is data-live="true" on the outer div.coupon-row and the URL the snapshot was scraped from (/su/live).

Examples:

  • Live_Match_Win_1[data-selection-key$='@Match_Result.1'] from live page
  • Live_Match_Win_Fora_1_Value, Live_Match_Win_Fora_1_Rate ← same DOM, same logic
  • Live_Period-1_Win_1 ← same as Bet_Period-1_Win_1 but captured from live event

Implementation: the parser does not change. The application service simply records Source = Live | PreMatch on each OddsSnapshot and the Excel exporter denormalizes pre-match snapshots to Bet_* columns and live snapshots to Live_* columns at write time.


5. Field Coverage Matrix (spec → confidence)

Field family Football Basketball Tennis Hockey Notes
Match_Win_1/2, Match_Draw confirmed ⚠️ Win-1/2 confirmed; Draw conditional on Normal_Time_Result presence Win-1/2 confirmed; Draw is null verify Phase 3
Match_Win_Fora_* (in games) "Main line" heuristic needed (§2.2)
Match_Total_* Goals Points Games "Main line" heuristic needed (§2.3)
Period-1_Win_* Half Half / Quarter Set Period basketball mode is configurable
Period-1_Win_Fora_*
Period-1_Total_* ⚠️ structure verified, exact key TBD ⚠️ may be absent for some games Set risk: emit null where absent
Period-2/3/4_* (Period-2 only) all up to actual played sets
Live_* (any of above) same parser same same same distinguished only by data-live flag + scrape URL

Legend: confirmed in spike sample, ⚠️ partial / heuristic needed, Phase 3 must verify.


6. Suggested Domain Types (Phase 1 input)

// Marathon.Domain
public enum BetScope { Match, Period }
public enum BetType  { Win, Draw, WinFora, Total }
public enum BetSide  { Side1, Side2, Less, More } // Side1=home/W1, Side2=away/W2

public sealed record Sport(int Code, string NameRu, string NameEn);
public sealed record League(int TreeId, string NameRu, int SportCode);
public sealed record Event(
    long EventCode,             // marathonbet's data-event-eventId
    int  TreeId,                // for URL building
    int  SportCode,
    int  LeagueTreeId,
    string Country,             // breadcrumb position 3
    string? Category,            // joined breadcrumb 5..N-1
    string Team1,
    string Team2,
    DateTimeOffset ScheduledAt, // anchored on initData.serverTime
    string DetailUrl);

public sealed record Bet(
    BetScope Scope,
    int?     PeriodNumber,      // null when Scope=Match
    BetType  Type,
    BetSide? Side,              // null for Type=Draw
    decimal? Value,             // handicap/total threshold; null for Win/Draw
    decimal  Rate);             // decimal odds (e.g., 1.65)

public sealed record OddsSnapshot(
    long           EventCode,
    DateTimeOffset CapturedAt,
    SnapshotSource Source,       // Pre | Live
    IReadOnlyList<Bet> Bets);

public enum SnapshotSource { PreMatch, Live }

Phase 1 will refine names, but this captures the data shape Phase 3 produces.


7. Excel Column Generation (Phase 4 / 9 reference)

The Excel exporter generates wide rows by joining all Bets of an OddsSnapshot into named columns. Pseudocode:

foreach snapshot:
  row.EventCode         = snapshot.EventCode
  row.SportCode         = event.SportCode
  row.Sport             = event.Sport.NameRu
  row.Country           = event.Country
  row.League            = event.League.NameRu
  row.Category          = event.Category
  row.ScheduledAt       = event.ScheduledAt
  prefix = snapshot.Source == PreMatch ? "Bet_" : "Live_"

  // Match scope
  row[prefix+"Match_Win_1"]                = bet.Where(scope=Match, type=Win, side=Side1).Rate
  row[prefix+"Match_Draw"]                 = bet.Where(scope=Match, type=Draw).Rate
  row[prefix+"Match_Win_2"]                = bet.Where(scope=Match, type=Win, side=Side2).Rate
  row[prefix+"Match_Win_Fora_1_Value"]     = bet.Where(scope=Match, type=WinFora, side=Side1).Value
  row[prefix+"Match_Win_Fora_1_Rate"]      = bet.Where(scope=Match, type=WinFora, side=Side1).Rate
  row[prefix+"Match_Win_Fora_2_Value"]     = bet.Where(scope=Match, type=WinFora, side=Side2).Value
  row[prefix+"Match_Win_Fora_2_Rate"]      = bet.Where(scope=Match, type=WinFora, side=Side2).Rate
  row[prefix+"Match_Total_Less_Value"]     = bet.Where(scope=Match, type=Total, side=Less).Value
  row[prefix+"Match_Total_Less_Rate"]      = bet.Where(scope=Match, type=Total, side=Less).Rate
  row[prefix+"Match_Total_More_Value"]     = bet.Where(scope=Match, type=Total, side=More).Value
  row[prefix+"Match_Total_More_Rate"]      = bet.Where(scope=Match, type=Total, side=More).Rate

  // Period scope (foreach period N exposed for that sport)
  for N in 1..MaxPeriodForSport(sportCode):
    same fields with key {prefix}Period-{N}_*
    null when bet absent

Spec column order is left to Phase 4 (ExcelExporter). Recommend: Date, Time, Sport, Country, League, Category, Event, EventCode, Bet_Match_*..., Bet_Period-1_*..., Bet_Period-2_*..., Live_Match_*..., Live_Period-N_*...


8. Decisions Pending Customer Confirmation

  1. Basketball Period mapping — halves (default) or quarters? Spec says "Period-N" but is silent on which N applies. Recommend halves (N ∈ {1,2}) with a quarter mode opt-in via appsettings.Sports.Basketball.PeriodMode.
  2. Tennis Draw column — emit empty / 0 / "—"? Recommend empty cell.
  3. Handicap "main line" rule — pick the listing's main row, OR the no-suffix selection, OR the spread closest to bookmaker-implied probability 50/50?
  4. Total "main line" rule — same as above.
  5. Field name capitalization — spec uses Bet_Match_Win_Fora_1_Value exactly. Recommend matching exactly (case-sensitive) for compatibility with downstream pivot tables / scripts.