6.1 KiB
6.1 KiB
Phase 3: Infrastructure — Scraping
Status: ⬜ Not Started Parent plan: PLAN.md Domain: backend
Objective
Implement the scraping pipeline: HttpClient + AngleSharp for HTML pages with a Playwright
fallback for JS-rendered content, all wrapped in resilient policies (retry, circuit
breaker, rate limiter). All parsing logic is informed by Phase 0's SCRAPE_FINDINGS.md
and SCHEMA_DRAFT.md.
Tasks
- Read
spike/SCRAPE_FINDINGS.mdandspike/SCHEMA_DRAFT.mdfrom Phase 0 to determine which strategy applies (HTML / Playwright / hybrid). - Add packages:
AngleSharpMicrosoft.Extensions.HttpMicrosoft.Extensions.Http.Resilience(Polly v8 wrapper)Microsoft.Playwright(only if Phase 0 decided Playwright is needed)
- Define abstractions in
Marathon.Application/Abstractions/:IOddsScraper:Task<IReadOnlyList<Event>> ScrapeUpcomingAsync(SportCode? filter, CancellationToken ct)Task<OddsSnapshot> ScrapeEventOddsAsync(EventId id, OddsSource source, CancellationToken ct)Task<IReadOnlyList<EventResult>> ScrapeResultsAsync(DateRange range, CancellationToken ct)
IBetPlacer— empty marker interface for future betting feature (extension point)
- Implement
Marathon.Infrastructure/Scraping/MarathonbetScraper.cs:- Composes parsers + HttpClient + (optional) Playwright per Phase 0 strategy
- Constructor takes
IHttpClientFactory,IOptions<ScrapingOptions>,ILogger - Methods correspond to
IOddsScraperinterface
- Implement parsers in
Marathon.Infrastructure/Scraping/Parsers/:UpcomingEventsParser— parses listing page →IReadOnlyList<Event>LiveEventsParser— parses live listing →IReadOnlyList<Event>EventOddsParser— parses event detail page →OddsSnapshot(handles all bet types in spec: Win/Draw/WinFora/Total at Match + Period-N scope)ResultsParser— parses completed events →IReadOnlyList<EventResult>- Each parser is unit-testable: takes
string html(orIDocument), returns domain types
ScrapingOptionsPOCO bound toappsettings.jsonScraping:*section:public sealed class ScrapingOptions { public int PollingIntervalSeconds { get; init; } = 30; public int MaxConcurrentRequests { get; init; } = 4; public string[] UserAgents { get; init; } = Array.Empty<string>(); public RetryPolicyOptions RetryPolicy { get; init; } = new(); public RateLimitOptions RateLimit { get; init; } = new(); public bool EnablePlaywrightFallback { get; init; } = false; public string BaseUrl { get; init; } = "https://www.marathonbet.by"; }- Configure named
HttpClient"marathonbet" in DI with:BaseAddress=Scraping:BaseUrlUser-Agentrotation viaDelegatingHandler(UserAgentRotatorHandler)- Polly resilience (
AddResilienceHandlerfromMicrosoft.Extensions.Http.Resilience):- Retry: exponential backoff, max attempts from config
- Circuit breaker: 5 failures → 30s open
- Rate limiter: token bucket (configurable RPS)
- Timeout: per-request from config
- (Optional, if Phase 0 needs it) Implement
PlaywrightScraperfor SPA-rendered pages — used as fallback if HTML scraping detects empty/dynamic content. - Add DI registration in
Marathon.Infrastructure/DependencyInjection.cs:services.AddOptions<ScrapingOptions>().Bind(config.GetSection("Scraping"))services.AddHttpClient("marathonbet").AddResilienceHandler(...)services.AddSingleton<IOddsScraper, MarathonbetScraper>()services.AddSingleton<UserAgentRotatorHandler>()
- Add
appsettings.jsontemplate undersrc/Marathon.Hosts.WpfBlazor/appsettings.json(will move when host phase runs):{ "Scraping": { "PollingIntervalSeconds": 30, "MaxConcurrentRequests": 4, "UserAgents": [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ..." ], "RetryPolicy": { "MaxAttempts": 3, "BaseDelayMs": 500 }, "RateLimit": { "RequestsPerSecond": 1 }, "EnablePlaywrightFallback": false, "BaseUrl": "https://www.marathonbet.by" } } - Tests in
Marathon.Infrastructure.Tests/Scraping/:- Use recorded HTML fixtures (committed under
tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html— small samples only) — copy fromspike/captures/if appropriate - Test each parser produces expected domain output for the fixtures
- Test
MarathonbetScraperhandles network errors gracefully (Polly mock) - DO NOT make real network calls in tests
- Use recorded HTML fixtures (committed under
Files to Modify/Create
src/Marathon.Application/Abstractions/IOddsScraper.cssrc/Marathon.Application/Abstractions/IBetPlacer.cs(marker interface)src/Marathon.Infrastructure/Scraping/MarathonbetScraper.cssrc/Marathon.Infrastructure/Scraping/Parsers/*.cs— 4 parserssrc/Marathon.Infrastructure/Scraping/UserAgentRotatorHandler.cssrc/Marathon.Infrastructure/Scraping/Playwright/PlaywrightScraper.cs(conditional)src/Marathon.Infrastructure/Configuration/ScrapingOptions.cstests/Marathon.Infrastructure.Tests/Scraping/Parsers/*Tests.cstests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html
Acceptance Criteria
- Compiles (Big Bang).
- All parser logic is unit-testable without network.
IOddsScraperis the only public surface used by Application layer.appsettings.jsontemplate covers every variable parameter.IBetPlacerexists as a future-proof extension point.
Notes
- This phase is parallelizable with Phase 2 — disjoint files.
- DO NOT hammer marathonbet.by — tests use local fixtures.
- If Phase 0 found that scraping requires headless browser only, skip the AngleSharp parsers and implement Playwright-only.
- Big Bang: compile-only smoke check after this phase; tests deferred to Phase 9.
Review Checklist
- Compiles
- Parser interface is clean (
string html → domain types) - All
Scraping:*config keys are wired throughScrapingOptions - No real network calls in tests