Files
maraphon-app/plans/initial-implementation/CONTEXT.md
T
alexei.dolgolyov 070e34b911 feat(initial-implementation): phase 0 - scraping spike findings
Anonymous scraping confirmed feasible for marathonbet.by — site is fully SSR
(nginx), no Cloudflare or JS challenge. HttpClient + AngleSharp + Polly v8 is
sufficient; Playwright not required (kept as a future-flag).

Spike outputs:
- spike/SCRAPE_FINDINGS.md  — page rendering, URL templates, anti-bot, rate
  limits, recommended scraping strategy for Phase 3.
- spike/SCHEMA_DRAFT.md     — customer-spec field → DOM selector mapping for
  Match + Period-N scope across football/basketball/tennis (hockey TBD).

Phase 1+ handoff captured in subplan + CLAUDE.md. Critical Phase 8 finding:
no public results endpoint at /su/results — phase 8 must switch to polling
event-detail until eventJsonInfo.matchIsComplete=true (deviation flagged).

Reviewer notes addressed:
- Period market outcome codes corrected to RN_H/RN_D/RN_A (not 1/draw/3) and
  market name vocabulary clarified per-sport in SCHEMA_DRAFT §3.1.
- results-page.html capture added to file list with caveat about live-landing
  score-state and unsampled hockey selectors.
2026-05-05 01:04:03 +03:00

140 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Feature Context: Initial Implementation
## Configuration
- **Development mode:** Automated
- **Execution mode:** Orchestrator
- **Strategy:** Big Bang
- **Build:** `dotnet build Marathon.sln`
- **Test:** `dotnet test Marathon.sln`
- **Lint:** `dotnet format Marathon.sln --verify-no-changes`
- **Run:** `dotnet run --project src/Marathon.Hosts.WpfBlazor`
- **Implementer models:** Sonnet 4.6 (backend), Opus (frontend)
- **Reviewer model:** Sonnet 4.6
## Customer Constraints
- Source: marathonbet.by — anonymous scraping (no login). ToS risk acknowledged by customer.
- Output: Excel files matching customer's wide-column spec (`Bet_Match_Win_1`,
`Bet_Period-1_Win_Fora_2_Value`, etc.) with date-range filenames.
- Storage: customer accepted SQLite-with-Excel-export instead of Excel-as-database
(decided 2026-05-05).
- UI tech: Blazor Hybrid (changed from initial WPF assumption — better for web migration).
- Locale: RU + EN.
- Scope: analyze-only initially; design `IBetPlacer` extension point for future betting.
- Configurability: every variable parameter (polling, concurrency, retry, UA, retention,
thresholds, locale) goes in `appsettings.json` + Settings UI page.
## Current State
Repo just initialized. Single `main` commit with `.gitignore` + `README.md` + `CLAUDE.md`.
Working on `feature/initial-implementation` branch. No source code yet — Phase 0 starts
with scraping research, no implementation.
## Temporary Workarounds
(none yet)
## Cross-Phase Dependencies
- **Phase 1 (Domain)** is the foundation; all later phases reference domain types.
- **Phase 2 (Storage)** & **Phase 3 (Scraping)** depend only on Phase 1 — can run in parallel.
- **Phase 4 (Application + Workers)** depends on Phase 2 + Phase 3.
- **Phase 5 (UI Shell)** depends on Phase 1 only — can run in parallel with 2/3.
- **Phase 6 (Event Browsing UI)** depends on Phase 4 + Phase 5.
- **Phase 7 (Anomaly)** depends on Phase 4 (snapshot storage) + Phase 6 (UI patterns).
- **Phase 8 (Results)** depends on Phase 6.
- **Phase 9 (Packaging)** is final — runs full build + test suite.
## Deferred Work
- Bet placing (explicit out-of-scope; design extension point only).
- Authenticated scraping (anonymous now; `IOddsScraper` impl is swappable).
- Multi-bookmaker support (only marathonbet.by; abstraction allows future expansion).
- PostgreSQL backend (SQLite for now; `IRepository<T>` abstraction allows swap).
## Failed Approaches
- **Public results / archive endpoint** — does NOT exist. Tested
`https://www.marathonbet.by/su/results`, `/su/results/`, `/su/results.htm`
all return HTTP 404. No `/archive`, `/history` links anywhere in the public
HTML either. **Phase 8 deviation:** the Results loader cannot back-fill from
an archive — it must poll each event detail page until
`eventJsonInfo.matchIsComplete=true` and snapshot `resultDescription` at that
moment. Phase 8 implementer must revise the subplan accordingly.
- **JSONP `/su/liveupdate/popular/` endpoint** — exposes only refresh signals
(`{"modified":[{"type":"refreshPage"}],"updated":<ts>}`), not actual odds. Cannot
be used as a JSON odds source. Use it only as a "something changed" hint to
trigger a full event-detail re-scrape.
- **Anonymous WebSocket (STOMP)** at `/su/websocket/endpoint` is documented in
`initData.stomp` but appears to require an authenticated session
(`PUNTER-SESSION-HASH` cookie); we did not test it but the customer's anonymous
scraping constraint makes it unsuitable anyway.
## Review Findings Log
(populated by reviewers)
## Phase Execution Log
| Phase | Agent | Model | Test Writer | Parallel | Notes |
|---|---|---|---|---|---|
| Phase 0 | phase-implementer | Opus | ⏭️ Skipped (research only) | — | ✅ Done 2026-05-05. Outputs: spike/SCRAPE_FINDINGS.md + spike/SCHEMA_DRAFT.md + 7 local fixtures. Anonymous scraping confirmed feasible; HttpClient+AngleSharp recommended; no Playwright needed; no public results page found (Phase 8 deviation noted). |
| Phase 1 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | — | — |
| Phase 2 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 3 + 5 | — |
| Phase 3 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | ✅ With 2 + 5 | — |
| Phase 4 | phase-implementer | Sonnet 4.6 | ⏭️ Skipped (Big Bang) | — | — |
| Phase 5 | phase-implementer-frontend | Opus | ⏭️ Skipped (Big Bang) | ✅ With 2 + 3 | Uses frontend-design skill |
| Phase 6 | phase-implementer-frontend | Opus | ⏭️ Skipped (Big Bang) | — | Uses frontend-design skill |
| Phase 7 | phase-implementer (split if needed) | Sonnet/Opus | ⏭️ Skipped (Big Bang) | — | UI portion uses Opus |
| Phase 8 | phase-implementer (split if needed) | Sonnet/Opus | ⏭️ Skipped (Big Bang) | — | UI portion uses Opus |
| Phase 9 | phase-implementer | Sonnet 4.6 | ✅ Final phase tests | — | Full build + test enforced |
## Environment & Runtime Notes
- Windows 10, PowerShell 5.1 default shell, Bash also available.
- `git` configured globally; remote `origin` = `https://git.dolgolyov-family.by/alexei.dolgolyov/maraphon-app.git`.
- Note: home directory (`C:\Users\Alexei`) is itself a git repo (likely accidental).
The maraphon-app local `.git` overrides it for this directory tree.
- .NET SDK assumed installed; if Phase 1 fails on `dotnet --version`, install or
document in CONTEXT.md.
## Implementation Notes
### Phase 0 (Scraping spike, 2026-05-05)
- **Anonymous scraping is feasible** from a non-Belarus IP. No Cloudflare, no JS
challenge, no UA filtering observed. `Server: nginx`. Standard cookies only.
- **Site is fully SSR.** All needed data (event grid, full odds, breadcrumbs,
period markets) is in the raw HTML. No SPA hydration required.
- **Recommended scraper stack: HttpClient + AngleSharp + Polly v8.** Playwright is
not required for read-only scraping — keep it as an optional fallback flag
(`Scraping:UsePlaywright`) for future-proofing only.
- **Polling cadence:** site itself polls live updates every 3 s; for our analyzer,
pre-match 30 s and live 510 s is sufficient.
- **Rate-limit:** 5 sequential requests at 1 req/s pacing all returned 200 in <1 s,
no throttling. Recommend default `RequestsPerSecond=1`, `MaxConcurrent=4`.
- **Sport ID semantics:** customer's "Sport_Code = 6" (Basketball) maps to
`data-sport-treeId="6"` in the breadcrumb-canonical sport listing
(`/su/betting/Basketball+-+6`). Some sports also have a separate "category tree
ID" used inside the live grouping (e.g., 45356 for Basketball-live) — ignore
those, use only the canonical breadcrumb ID.
- **Selection key format:** `<eventId>@<MarketName>{LineIndex?}.<Outcome>`. The
market name is sport-specific (`Match_Result`, `1st_Half_Result`, `Total_Goals`,
`Total_Points`, `Total_Games`, `To_Win_Match_With_Handicap`, etc.). Total
thresholds are encoded in the outcome (`Under_3.5`, `Over_213.5`). Handicap
values are NOT in the key — they're in `<span class="middle-simple">` text.
- **Tennis has no Draw outcome** — domain `Bet_Match_Draw` must be nullable.
- **Date display ambiguity:** listing shows `HH:MM` (today) or `DD <ru-month> HH:MM`
(future). Anchor the parser on `initData.serverTime` (Moscow TZ, format
`YYYY,MM,DD,HH,MM,SS`).
- **No public results page** (`/su/results` → 404). Final scores are exposed only
on the event detail page itself via `eventJsonInfo` JSON
(`matchIsComplete`, `resultDescription`). Phase 8 must poll until completion;
cannot back-fill from an archive endpoint.
- **Probe environment:** Windows 10 + curl, geo-routed as Poland (`countryCode: PL`).
Customer in Belarus may see slightly different KYC overlays — parser must be
defensive (treat missing markets as null, never throw).
- **Captures saved locally** at `spike/captures/*.html` (gitignored): 7 fixtures
for offline parser development in Phase 3.