Files
maraphon-app/plans/initial-implementation/phase-3-scraping.md
T

6.1 KiB

Phase 3: Infrastructure — Scraping

Status: Not Started Parent plan: PLAN.md Domain: backend

Objective

Implement the scraping pipeline: HttpClient + AngleSharp for HTML pages with a Playwright fallback for JS-rendered content, all wrapped in resilient policies (retry, circuit breaker, rate limiter). All parsing logic is informed by Phase 0's SCRAPE_FINDINGS.md and SCHEMA_DRAFT.md.

Tasks

  • Read spike/SCRAPE_FINDINGS.md and spike/SCHEMA_DRAFT.md from Phase 0 to determine which strategy applies (HTML / Playwright / hybrid).
  • Add packages:
    • AngleSharp
    • Microsoft.Extensions.Http
    • Microsoft.Extensions.Http.Resilience (Polly v8 wrapper)
    • Microsoft.Playwright (only if Phase 0 decided Playwright is needed)
  • Define abstractions in Marathon.Application/Abstractions/:
    • IOddsScraper:
      • Task<IReadOnlyList<Event>> ScrapeUpcomingAsync(SportCode? filter, CancellationToken ct)
      • Task<OddsSnapshot> ScrapeEventOddsAsync(EventId id, OddsSource source, CancellationToken ct)
      • Task<IReadOnlyList<EventResult>> ScrapeResultsAsync(DateRange range, CancellationToken ct)
    • IBetPlacer — empty marker interface for future betting feature (extension point)
  • Implement Marathon.Infrastructure/Scraping/MarathonbetScraper.cs:
    • Composes parsers + HttpClient + (optional) Playwright per Phase 0 strategy
    • Constructor takes IHttpClientFactory, IOptions<ScrapingOptions>, ILogger
    • Methods correspond to IOddsScraper interface
  • Implement parsers in Marathon.Infrastructure/Scraping/Parsers/:
    • UpcomingEventsParser — parses listing page → IReadOnlyList<Event>
    • LiveEventsParser — parses live listing → IReadOnlyList<Event>
    • EventOddsParser — parses event detail page → OddsSnapshot (handles all bet types in spec: Win/Draw/WinFora/Total at Match + Period-N scope)
    • ResultsParser — parses completed events → IReadOnlyList<EventResult>
    • Each parser is unit-testable: takes string html (or IDocument), returns domain types
  • ScrapingOptions POCO bound to appsettings.json Scraping:* section:
    public sealed class ScrapingOptions {
      public int PollingIntervalSeconds { get; init; } = 30;
      public int MaxConcurrentRequests { get; init; } = 4;
      public string[] UserAgents { get; init; } = Array.Empty<string>();
      public RetryPolicyOptions RetryPolicy { get; init; } = new();
      public RateLimitOptions RateLimit { get; init; } = new();
      public bool EnablePlaywrightFallback { get; init; } = false;
      public string BaseUrl { get; init; } = "https://www.marathonbet.by";
    }
    
  • Configure named HttpClient "marathonbet" in DI with:
    • BaseAddress = Scraping:BaseUrl
    • User-Agent rotation via DelegatingHandler (UserAgentRotatorHandler)
    • Polly resilience (AddResilienceHandler from Microsoft.Extensions.Http.Resilience):
      • Retry: exponential backoff, max attempts from config
      • Circuit breaker: 5 failures → 30s open
      • Rate limiter: token bucket (configurable RPS)
      • Timeout: per-request from config
  • (Optional, if Phase 0 needs it) Implement PlaywrightScraper for SPA-rendered pages — used as fallback if HTML scraping detects empty/dynamic content.
  • Add DI registration in Marathon.Infrastructure/DependencyInjection.cs:
    • services.AddOptions<ScrapingOptions>().Bind(config.GetSection("Scraping"))
    • services.AddHttpClient("marathonbet").AddResilienceHandler(...)
    • services.AddSingleton<IOddsScraper, MarathonbetScraper>()
    • services.AddSingleton<UserAgentRotatorHandler>()
  • Add appsettings.json template under src/Marathon.Hosts.WpfBlazor/appsettings.json (will move when host phase runs):
    {
      "Scraping": {
        "PollingIntervalSeconds": 30,
        "MaxConcurrentRequests": 4,
        "UserAgents": [
          "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ..."
        ],
        "RetryPolicy": { "MaxAttempts": 3, "BaseDelayMs": 500 },
        "RateLimit": { "RequestsPerSecond": 1 },
        "EnablePlaywrightFallback": false,
        "BaseUrl": "https://www.marathonbet.by"
      }
    }
    
  • Tests in Marathon.Infrastructure.Tests/Scraping/:
    • Use recorded HTML fixtures (committed under tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html — small samples only) — copy from spike/captures/ if appropriate
    • Test each parser produces expected domain output for the fixtures
    • Test MarathonbetScraper handles network errors gracefully (Polly mock)
    • DO NOT make real network calls in tests

Files to Modify/Create

  • src/Marathon.Application/Abstractions/IOddsScraper.cs
  • src/Marathon.Application/Abstractions/IBetPlacer.cs (marker interface)
  • src/Marathon.Infrastructure/Scraping/MarathonbetScraper.cs
  • src/Marathon.Infrastructure/Scraping/Parsers/*.cs — 4 parsers
  • src/Marathon.Infrastructure/Scraping/UserAgentRotatorHandler.cs
  • src/Marathon.Infrastructure/Scraping/Playwright/PlaywrightScraper.cs (conditional)
  • src/Marathon.Infrastructure/Configuration/ScrapingOptions.cs
  • tests/Marathon.Infrastructure.Tests/Scraping/Parsers/*Tests.cs
  • tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html

Acceptance Criteria

  • Compiles (Big Bang).
  • All parser logic is unit-testable without network.
  • IOddsScraper is the only public surface used by Application layer.
  • appsettings.json template covers every variable parameter.
  • IBetPlacer exists as a future-proof extension point.

Notes

  • This phase is parallelizable with Phase 2 — disjoint files.
  • DO NOT hammer marathonbet.by — tests use local fixtures.
  • If Phase 0 found that scraping requires headless browser only, skip the AngleSharp parsers and implement Playwright-only.
  • Big Bang: compile-only smoke check after this phase; tests deferred to Phase 9.

Review Checklist

  • Compiles
  • Parser interface is clean (string html → domain types)
  • All Scraping:* config keys are wired through ScrapingOptions
  • No real network calls in tests

Handoff to Next Phase