Files
maraphon-app/plans/initial-implementation/phase-3-scraping.md
T
alexei.dolgolyov e4d8476782 WIP(initial-implementation): parallel batch P2/P3/P5 — code complete, unreviewed
Snapshot of the parallel batch (Phases 2 + 3 + 5) at session pause. Solution does
NOT build cleanly yet — known cross-phase compile issues remain to be resolved
before review. See plans/initial-implementation/PLAN.md "Resume Notes" section
for the exact tomorrow-morning action list.

Phase 2 (Storage):
- Repository interfaces in Marathon.Application/Abstractions
- DateRange, ExportKind, StorageOptions in Marathon.Application/Storage
- EF Core 8 + SQLite (WAL) persistence: 7 entities + configurations + 4 repos
- Hand-written InitialCreate migration (dotnet ef blocked by parallel work)
- ClosedXML ExcelExporter with exact customer-spec wide columns
- PersistenceModule.AddMarathonPersistence DI extension
- Round-trip + export tests (cannot run yet — see cross-phase issues)

Phase 3 (Scraping):
- IOddsScraper, IBetPlacer in Marathon.Application/Abstractions
- ScrapingOptions in Marathon.Infrastructure/Configuration
- MarathonbetScraper with 4 parsers (Upcoming, Live, EventOdds, Results)
- Helpers: ServerTimeProvider, PeriodScopeMapper, OutcomeCodeMapper, MoscowDateParser
- UserAgentRotatorHandler + Polly v8 resilience pipeline
- ScrapingModule.AddMarathonScraping DI extension
- GlobalUsings.cs aliases for EventId / Configuration disambiguation
- Parser tests with trimmed HTML fixtures
- ScrapeResultsAsync interim no-op (Phase 8 will replace via watch-list polling)

Phase 5 (UI shell — killed mid-final-verify, assumed ~95%):
- Marathon.UI populated: MainLayout, App.razor, Pages (Home, Settings),
  Components, Theme (MarathonTheme.cs + Tokens.cs + app.css), Resources
  (SharedResource.{cs,ru.resx,en.resx}), Services (ISettingsWriter), wwwroot
- WPF host: App.xaml(.cs), MainWindow.xaml(.cs), Marathon.Hosts.WpfBlazor.csproj
  with Microsoft.AspNetCore.Components.WebView.Wpf + MudBlazor + Serilog
- appsettings.json + appsettings.Development.json with all sections wired
- bUnit tests: MainLayoutTests, LocaleSwitcherTests, ThemeToggleTests,
  JsonSettingsWriterTests + Support helpers

Cross-phase issues to resolve at next session:
1. Phase 2 repository classes are 'internal' — Phase 3's tests can't reference
   them. Fix: add InternalsVisibleTo to Marathon.Infrastructure.csproj.
2. Phase 5: LocalizationOptions namespace ambiguity (AspNetCore vs Extensions).
3. Phase 5: WpfBlazor Serilog API mismatch.

Reviewer has NOT run on this batch. Move to Phase 4 only after build is green
and a combined parallel-batch reviewer passes.
2026-05-05 01:56:53 +03:00

12 KiB

Phase 3: Infrastructure — Scraping

Status: Done Parent plan: PLAN.md Domain: backend

Objective

Implement the scraping pipeline: HttpClient + AngleSharp for HTML pages with a Playwright fallback for JS-rendered content, all wrapped in resilient policies (retry, circuit breaker, rate limiter). All parsing logic is informed by Phase 0's SCRAPE_FINDINGS.md and SCHEMA_DRAFT.md.

Tasks

  • Read spike/SCRAPE_FINDINGS.md and spike/SCHEMA_DRAFT.md from Phase 0 to determine which strategy applies (HTML / Playwright / hybrid).
  • Add packages:
    • AngleSharp
    • Microsoft.Extensions.Http
    • Microsoft.Extensions.Http.Resilience (Polly v8 wrapper)
    • Microsoft.Playwright (only if Phase 0 decided Playwright is needed)
  • Define abstractions in Marathon.Application/Abstractions/:
    • IOddsScraper:
      • Task<IReadOnlyList<Event>> ScrapeUpcomingAsync(SportCode? filter, CancellationToken ct)
      • Task<OddsSnapshot> ScrapeEventOddsAsync(EventId id, OddsSource source, CancellationToken ct)
      • Task<IReadOnlyList<EventResult>> ScrapeResultsAsync(DateRange range, CancellationToken ct)
    • IBetPlacer — empty marker interface for future betting feature (extension point)
  • Implement Marathon.Infrastructure/Scraping/MarathonbetScraper.cs:
    • Composes parsers + HttpClient + (optional) Playwright per Phase 0 strategy
    • Constructor takes IHttpClientFactory, IOptions<ScrapingOptions>, ILogger
    • Methods correspond to IOddsScraper interface
  • Implement parsers in Marathon.Infrastructure/Scraping/Parsers/:
    • UpcomingEventsParser — parses listing page → IReadOnlyList<Event>
    • LiveEventsParser — parses live listing → IReadOnlyList<Event>
    • EventOddsParser — parses event detail page → OddsSnapshot (handles all bet types in spec: Win/Draw/WinFora/Total at Match + Period-N scope)
    • ResultsParser — parses completed events → IReadOnlyList<EventResult>
    • Each parser is unit-testable: takes string html (or IDocument), returns domain types
  • ScrapingOptions POCO bound to appsettings.json Scraping:* section:
    public sealed class ScrapingOptions {
      public int PollingIntervalSeconds { get; init; } = 30;
      public int MaxConcurrentRequests { get; init; } = 4;
      public string[] UserAgents { get; init; } = Array.Empty<string>();
      public RetryPolicyOptions RetryPolicy { get; init; } = new();
      public RateLimitOptions RateLimit { get; init; } = new();
      public bool EnablePlaywrightFallback { get; init; } = false;
      public string BaseUrl { get; init; } = "https://www.marathonbet.by";
    }
    
  • Configure named HttpClient "marathonbet" in DI with:
    • BaseAddress = Scraping:BaseUrl
    • User-Agent rotation via DelegatingHandler (UserAgentRotatorHandler)
    • Polly resilience (AddResilienceHandler from Microsoft.Extensions.Http.Resilience):
      • Retry: exponential backoff, max attempts from config
      • Circuit breaker: 5 failures → 30s open
      • Rate limiter: token bucket (configurable RPS)
      • Timeout: per-request from config
  • (Optional, if Phase 0 needs it) Implement PlaywrightScraper for SPA-rendered pages — used as fallback if HTML scraping detects empty/dynamic content.
  • Add DI registration in Marathon.Infrastructure/DependencyInjection.cs:
    • services.AddOptions<ScrapingOptions>().Bind(config.GetSection("Scraping"))
    • services.AddHttpClient("marathonbet").AddResilienceHandler(...)
    • services.AddSingleton<IOddsScraper, MarathonbetScraper>()
    • services.AddSingleton<UserAgentRotatorHandler>()
  • Add appsettings.json template under src/Marathon.Hosts.WpfBlazor/appsettings.json (will move when host phase runs):
    {
      "Scraping": {
        "PollingIntervalSeconds": 30,
        "MaxConcurrentRequests": 4,
        "UserAgents": [
          "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ..."
        ],
        "RetryPolicy": { "MaxAttempts": 3, "BaseDelayMs": 500 },
        "RateLimit": { "RequestsPerSecond": 1 },
        "EnablePlaywrightFallback": false,
        "BaseUrl": "https://www.marathonbet.by"
      }
    }
    
  • Tests in Marathon.Infrastructure.Tests/Scraping/:
    • Use recorded HTML fixtures (committed under tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html — small samples only) — copy from spike/captures/ if appropriate
    • Test each parser produces expected domain output for the fixtures
    • Test MarathonbetScraper handles network errors gracefully (Polly mock)
    • DO NOT make real network calls in tests

Files to Modify/Create

  • src/Marathon.Application/Abstractions/IOddsScraper.cs
  • src/Marathon.Application/Abstractions/IBetPlacer.cs (marker interface)
  • src/Marathon.Infrastructure/Scraping/MarathonbetScraper.cs
  • src/Marathon.Infrastructure/Scraping/Parsers/*.cs — 4 parsers
  • src/Marathon.Infrastructure/Scraping/UserAgentRotatorHandler.cs
  • src/Marathon.Infrastructure/Scraping/Playwright/PlaywrightScraper.cs (conditional)
  • src/Marathon.Infrastructure/Configuration/ScrapingOptions.cs
  • tests/Marathon.Infrastructure.Tests/Scraping/Parsers/*Tests.cs
  • tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html

Acceptance Criteria

  • Compiles (Big Bang).
  • All parser logic is unit-testable without network.
  • IOddsScraper is the only public surface used by Application layer.
  • appsettings.json template covers every variable parameter.
  • IBetPlacer exists as a future-proof extension point.

Notes

  • This phase is parallelizable with Phase 2 — disjoint files.
  • DO NOT hammer marathonbet.by — tests use local fixtures.
  • If Phase 0 found that scraping requires headless browser only, skip the AngleSharp parsers and implement Playwright-only.
  • Big Bang: compile-only smoke check after this phase; tests deferred to Phase 9.

Review Checklist

  • Compiles
  • Parser interface is clean (string html → domain types)
  • All Scraping:* config keys are wired through ScrapingOptions
  • No real network calls in tests

Review Checklist (filled)

  • Compiles (dotnet build src/Marathon.Infrastructure — 0 errors)
  • Parser interface is clean (string html → domain types)
  • All Scraping:* config keys are wired through ScrapingOptions
  • No real network calls in tests (all tests use local HTML fixtures)

Handoff to Next Phase

For Phase 4 (Application + Workers)

Calling ScrapingModule.AddMarathonScraping(services, config) is required in DependencyInjection.cs to wire all scraping services. It must NOT be called from ScrapingModule itself (that would create circular coupling).

IOddsScraper.ScrapeResultsAsync is a no-op (returns empty list + logs a warning). Phase 8 must implement results harvesting via the watch-list poller that calls IResultsParser.ParseAsync on individual event-detail pages.

IOddsScraper.ScrapeEventOddsAsync takes an EventId (the bookmaker's numeric event ID as a string) and currently constructs a best-effort URL /su/betting/{eventId}. Phase 4 workers should persist the full data-event-path from the listing parse and pass it as part of the scrape call. A TODO comment marks this location in MarathonbetScraper.cs.

Basketball period mode defaults to halves (Period-1, Period-2). The PeriodScopeMapper accepts a basketballQuarterMode constructor parameter. Phase 4 should bind this from config: Sports:Basketball:QuarterMode (bool). A TODO comment is present in ScrapingModule.cs.

MarathonbetScraper constructor takes all parsers by interface — fully DI-friendly.

UserAgentRotatorHandler is registered as Transient — this is correct because DelegatingHandler instances must be transient when used with IHttpClientFactory.

Named HttpClient "marathonbet" is registered. Resilience pipeline:

  1. Timeout (per-attempt)
  2. Retry (exp backoff + jitter, configurable MaxAttempts + BaseDelayMs)
  3. Circuit Breaker (5 failures / 30s window → 30s break)
  4. Rate Limiter (token bucket, configurable RequestsPerSecond)

appsettings.scraping.sample.json in src/Marathon.Infrastructure/Scraping/ is a documentation-only sample. Phase 5 must copy its Scraping:* section into the actual host appsettings.json.

EventId disambiguation (IMPORTANT)

Marathon.Domain.ValueObjects.EventId conflicts with Microsoft.Extensions.Logging.EventId. The Infrastructure project resolves this via:

  • GlobalUsings.cs: global using LogEventId = Microsoft.Extensions.Logging.EventId;
  • Local file aliases: using DomainEventId = Marathon.Domain.ValueObjects.EventId; in parser files that use both namespaces.
  • MarathonbetScraper.ScrapeEventOddsAsync uses the fully qualified name Marathon.Domain.ValueObjects.EventId for the parameter type.

Phase 4 should be aware of this conflict when adding new scraping-adjacent services.

Test status

Phase 3 scraping tests (tests/Marathon.Infrastructure.Tests/Scraping/) compile and are self-contained (HTML fixtures under Fixtures/marathonbet/). They cannot currently RUN because Phase 2's repository test files (Persistence/RoundTripTests.cs, Export/ExcelExporterTests.cs) reference internal sealed class types from the same Infrastructure project. Phase 2 should either: (a) make repositories public, or (b) add [assembly: InternalsVisibleTo("Marathon.Infrastructure.Tests")] to the Infrastructure project.

Option (b) is preferred: add to Marathon.Infrastructure.csproj or a GlobalUsings.cs:

<ItemGroup>
  <InternalsVisibleTo Include="Marathon.Infrastructure.Tests" />
</ItemGroup>

Files created (Phase 3 scope)

src/Marathon.Application/Abstractions/IOddsScraper.cs
src/Marathon.Application/Abstractions/IBetPlacer.cs
src/Marathon.Infrastructure/Configuration/ScrapingOptions.cs
src/Marathon.Infrastructure/GlobalUsings.cs                         (EventId disambiguation)
src/Marathon.Infrastructure/Scraping/MarathonbetScraper.cs
src/Marathon.Infrastructure/Scraping/ScrapingModule.cs
src/Marathon.Infrastructure/Scraping/UserAgentRotatorHandler.cs
src/Marathon.Infrastructure/Scraping/appsettings.scraping.sample.json
src/Marathon.Infrastructure/Scraping/Parsers/IServerTimeProvider.cs
src/Marathon.Infrastructure/Scraping/Parsers/ServerTimeProvider.cs
src/Marathon.Infrastructure/Scraping/Parsers/MoscowDateParser.cs
src/Marathon.Infrastructure/Scraping/Parsers/OutcomeCodeMapper.cs
src/Marathon.Infrastructure/Scraping/Parsers/PeriodScopeMapper.cs
src/Marathon.Infrastructure/Scraping/Parsers/EventListingParserBase.cs
src/Marathon.Infrastructure/Scraping/Parsers/IUpcomingEventsParser.cs
src/Marathon.Infrastructure/Scraping/Parsers/UpcomingEventsParser.cs
src/Marathon.Infrastructure/Scraping/Parsers/ILiveEventsParser.cs
src/Marathon.Infrastructure/Scraping/Parsers/LiveEventsParser.cs
src/Marathon.Infrastructure/Scraping/Parsers/IEventOddsParser.cs
src/Marathon.Infrastructure/Scraping/Parsers/EventOddsParser.cs
src/Marathon.Infrastructure/Scraping/Parsers/IResultsParser.cs
src/Marathon.Infrastructure/Scraping/Parsers/ResultsParser.cs
tests/Marathon.Infrastructure.Tests/Scraping/OutcomeCodeMapperTests.cs
tests/Marathon.Infrastructure.Tests/Scraping/MoscowDateParserTests.cs
tests/Marathon.Infrastructure.Tests/Scraping/ServerTimeProviderTests.cs
tests/Marathon.Infrastructure.Tests/Scraping/UpcomingEventsParserTests.cs
tests/Marathon.Infrastructure.Tests/Scraping/EventOddsParserTests.cs
tests/Marathon.Infrastructure.Tests/Scraping/ResultsParserTests.cs
tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/listing-sample.html
tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/event-football-sample.html
tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/event-basketball-sample.html
tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/event-completed-sample.html