# Phase 3: Infrastructure — Scraping **Status:** ✅ Done **Parent plan:** [PLAN.md](./PLAN.md) **Domain:** backend ## Objective Implement the scraping pipeline: HttpClient + AngleSharp for HTML pages with a Playwright fallback for JS-rendered content, all wrapped in resilient policies (retry, circuit breaker, rate limiter). All parsing logic is informed by Phase 0's `SCRAPE_FINDINGS.md` and `SCHEMA_DRAFT.md`. ## Tasks - [ ] Read `spike/SCRAPE_FINDINGS.md` and `spike/SCHEMA_DRAFT.md` from Phase 0 to determine which strategy applies (HTML / Playwright / hybrid). - [ ] Add packages: - `AngleSharp` - `Microsoft.Extensions.Http` - `Microsoft.Extensions.Http.Resilience` (Polly v8 wrapper) - `Microsoft.Playwright` (only if Phase 0 decided Playwright is needed) - [ ] Define abstractions in `Marathon.Application/Abstractions/`: - `IOddsScraper`: - `Task> ScrapeUpcomingAsync(SportCode? filter, CancellationToken ct)` - `Task ScrapeEventOddsAsync(EventId id, OddsSource source, CancellationToken ct)` - `Task> ScrapeResultsAsync(DateRange range, CancellationToken ct)` - `IBetPlacer` — empty marker interface for future betting feature (extension point) - [ ] Implement `Marathon.Infrastructure/Scraping/MarathonbetScraper.cs`: - Composes parsers + HttpClient + (optional) Playwright per Phase 0 strategy - Constructor takes `IHttpClientFactory`, `IOptions`, `ILogger` - Methods correspond to `IOddsScraper` interface - [ ] Implement parsers in `Marathon.Infrastructure/Scraping/Parsers/`: - `UpcomingEventsParser` — parses listing page → `IReadOnlyList` - `LiveEventsParser` — parses live listing → `IReadOnlyList` - `EventOddsParser` — parses event detail page → `OddsSnapshot` (handles all bet types in spec: Win/Draw/WinFora/Total at Match + Period-N scope) - `ResultsParser` — parses completed events → `IReadOnlyList` - Each parser is unit-testable: takes `string html` (or `IDocument`), returns domain types - [ ] `ScrapingOptions` POCO bound to `appsettings.json` `Scraping:*` section: ```csharp public sealed class ScrapingOptions { public int PollingIntervalSeconds { get; init; } = 30; public int MaxConcurrentRequests { get; init; } = 4; public string[] UserAgents { get; init; } = Array.Empty(); public RetryPolicyOptions RetryPolicy { get; init; } = new(); public RateLimitOptions RateLimit { get; init; } = new(); public bool EnablePlaywrightFallback { get; init; } = false; public string BaseUrl { get; init; } = "https://www.marathonbet.by"; } ``` - [ ] Configure named `HttpClient` "marathonbet" in DI with: - `BaseAddress` = `Scraping:BaseUrl` - `User-Agent` rotation via `DelegatingHandler` (`UserAgentRotatorHandler`) - Polly resilience (`AddResilienceHandler` from `Microsoft.Extensions.Http.Resilience`): - Retry: exponential backoff, max attempts from config - Circuit breaker: 5 failures → 30s open - Rate limiter: token bucket (configurable RPS) - Timeout: per-request from config - [ ] (Optional, if Phase 0 needs it) Implement `PlaywrightScraper` for SPA-rendered pages — used as fallback if HTML scraping detects empty/dynamic content. - [ ] Add DI registration in `Marathon.Infrastructure/DependencyInjection.cs`: - `services.AddOptions().Bind(config.GetSection("Scraping"))` - `services.AddHttpClient("marathonbet").AddResilienceHandler(...)` - `services.AddSingleton()` - `services.AddSingleton()` - [ ] Add `appsettings.json` template under `src/Marathon.Hosts.WpfBlazor/appsettings.json` (will move when host phase runs): ```json { "Scraping": { "PollingIntervalSeconds": 30, "MaxConcurrentRequests": 4, "UserAgents": [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ..." ], "RetryPolicy": { "MaxAttempts": 3, "BaseDelayMs": 500 }, "RateLimit": { "RequestsPerSecond": 1 }, "EnablePlaywrightFallback": false, "BaseUrl": "https://www.marathonbet.by" } } ``` - [ ] Tests in `Marathon.Infrastructure.Tests/Scraping/`: - Use recorded HTML fixtures (committed under `tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html` — small samples only) — copy from `spike/captures/` if appropriate - Test each parser produces expected domain output for the fixtures - Test `MarathonbetScraper` handles network errors gracefully (Polly mock) - DO NOT make real network calls in tests ## Files to Modify/Create - `src/Marathon.Application/Abstractions/IOddsScraper.cs` - `src/Marathon.Application/Abstractions/IBetPlacer.cs` (marker interface) - `src/Marathon.Infrastructure/Scraping/MarathonbetScraper.cs` - `src/Marathon.Infrastructure/Scraping/Parsers/*.cs` — 4 parsers - `src/Marathon.Infrastructure/Scraping/UserAgentRotatorHandler.cs` - `src/Marathon.Infrastructure/Scraping/Playwright/PlaywrightScraper.cs` (conditional) - `src/Marathon.Infrastructure/Configuration/ScrapingOptions.cs` - `tests/Marathon.Infrastructure.Tests/Scraping/Parsers/*Tests.cs` - `tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html` ## Acceptance Criteria - Compiles (Big Bang). - All parser logic is unit-testable without network. - `IOddsScraper` is the only public surface used by Application layer. - `appsettings.json` template covers every variable parameter. - `IBetPlacer` exists as a future-proof extension point. ## Notes - This phase is parallelizable with Phase 2 — disjoint files. - DO NOT hammer marathonbet.by — tests use local fixtures. - If Phase 0 found that scraping requires headless browser only, skip the AngleSharp parsers and implement Playwright-only. - Big Bang: compile-only smoke check after this phase; tests deferred to Phase 9. ## Review Checklist - [ ] Compiles - [ ] Parser interface is clean (`string html → domain types`) - [ ] All `Scraping:*` config keys are wired through `ScrapingOptions` - [ ] No real network calls in tests ## Review Checklist (filled) - [x] Compiles (`dotnet build src/Marathon.Infrastructure` — 0 errors) - [x] Parser interface is clean (`string html → domain types`) - [x] All `Scraping:*` config keys are wired through `ScrapingOptions` - [x] No real network calls in tests (all tests use local HTML fixtures) ## Handoff to Next Phase ### For Phase 4 (Application + Workers) **Calling `ScrapingModule.AddMarathonScraping(services, config)`** is required in `DependencyInjection.cs` to wire all scraping services. It must NOT be called from `ScrapingModule` itself (that would create circular coupling). **`IOddsScraper.ScrapeResultsAsync` is a no-op** (returns empty list + logs a warning). Phase 8 must implement results harvesting via the watch-list poller that calls `IResultsParser.ParseAsync` on individual event-detail pages. **`IOddsScraper.ScrapeEventOddsAsync`** takes an `EventId` (the bookmaker's numeric event ID as a string) and currently constructs a best-effort URL `/su/betting/{eventId}`. Phase 4 workers should persist the full `data-event-path` from the listing parse and pass it as part of the scrape call. A TODO comment marks this location in `MarathonbetScraper.cs`. **Basketball period mode** defaults to halves (Period-1, Period-2). The `PeriodScopeMapper` accepts a `basketballQuarterMode` constructor parameter. Phase 4 should bind this from config: `Sports:Basketball:QuarterMode` (bool). A TODO comment is present in `ScrapingModule.cs`. **`MarathonbetScraper` constructor** takes all parsers by interface — fully DI-friendly. **`UserAgentRotatorHandler` is registered as `Transient`** — this is correct because `DelegatingHandler` instances must be transient when used with IHttpClientFactory. **Named HttpClient `"marathonbet"`** is registered. Resilience pipeline: 1. Timeout (per-attempt) 2. Retry (exp backoff + jitter, configurable MaxAttempts + BaseDelayMs) 3. Circuit Breaker (5 failures / 30s window → 30s break) 4. Rate Limiter (token bucket, configurable RequestsPerSecond) **`appsettings.scraping.sample.json`** in `src/Marathon.Infrastructure/Scraping/` is a documentation-only sample. Phase 5 must copy its `Scraping:*` section into the actual host `appsettings.json`. ### EventId disambiguation (IMPORTANT) `Marathon.Domain.ValueObjects.EventId` conflicts with `Microsoft.Extensions.Logging.EventId`. The Infrastructure project resolves this via: - `GlobalUsings.cs`: `global using LogEventId = Microsoft.Extensions.Logging.EventId;` - Local file aliases: `using DomainEventId = Marathon.Domain.ValueObjects.EventId;` in parser files that use both namespaces. - `MarathonbetScraper.ScrapeEventOddsAsync` uses the fully qualified name `Marathon.Domain.ValueObjects.EventId` for the parameter type. Phase 4 should be aware of this conflict when adding new scraping-adjacent services. ### Test status Phase 3 scraping tests (`tests/Marathon.Infrastructure.Tests/Scraping/`) compile and are self-contained (HTML fixtures under `Fixtures/marathonbet/`). They cannot currently RUN because Phase 2's repository test files (`Persistence/RoundTripTests.cs`, `Export/ExcelExporterTests.cs`) reference `internal sealed class` types from the same Infrastructure project. Phase 2 should either: (a) make repositories `public`, or (b) add `[assembly: InternalsVisibleTo("Marathon.Infrastructure.Tests")]` to the Infrastructure project. Option (b) is preferred: add to `Marathon.Infrastructure.csproj` or a `GlobalUsings.cs`: ```xml ``` ### Files created (Phase 3 scope) ``` src/Marathon.Application/Abstractions/IOddsScraper.cs src/Marathon.Application/Abstractions/IBetPlacer.cs src/Marathon.Infrastructure/Configuration/ScrapingOptions.cs src/Marathon.Infrastructure/GlobalUsings.cs (EventId disambiguation) src/Marathon.Infrastructure/Scraping/MarathonbetScraper.cs src/Marathon.Infrastructure/Scraping/ScrapingModule.cs src/Marathon.Infrastructure/Scraping/UserAgentRotatorHandler.cs src/Marathon.Infrastructure/Scraping/appsettings.scraping.sample.json src/Marathon.Infrastructure/Scraping/Parsers/IServerTimeProvider.cs src/Marathon.Infrastructure/Scraping/Parsers/ServerTimeProvider.cs src/Marathon.Infrastructure/Scraping/Parsers/MoscowDateParser.cs src/Marathon.Infrastructure/Scraping/Parsers/OutcomeCodeMapper.cs src/Marathon.Infrastructure/Scraping/Parsers/PeriodScopeMapper.cs src/Marathon.Infrastructure/Scraping/Parsers/EventListingParserBase.cs src/Marathon.Infrastructure/Scraping/Parsers/IUpcomingEventsParser.cs src/Marathon.Infrastructure/Scraping/Parsers/UpcomingEventsParser.cs src/Marathon.Infrastructure/Scraping/Parsers/ILiveEventsParser.cs src/Marathon.Infrastructure/Scraping/Parsers/LiveEventsParser.cs src/Marathon.Infrastructure/Scraping/Parsers/IEventOddsParser.cs src/Marathon.Infrastructure/Scraping/Parsers/EventOddsParser.cs src/Marathon.Infrastructure/Scraping/Parsers/IResultsParser.cs src/Marathon.Infrastructure/Scraping/Parsers/ResultsParser.cs tests/Marathon.Infrastructure.Tests/Scraping/OutcomeCodeMapperTests.cs tests/Marathon.Infrastructure.Tests/Scraping/MoscowDateParserTests.cs tests/Marathon.Infrastructure.Tests/Scraping/ServerTimeProviderTests.cs tests/Marathon.Infrastructure.Tests/Scraping/UpcomingEventsParserTests.cs tests/Marathon.Infrastructure.Tests/Scraping/EventOddsParserTests.cs tests/Marathon.Infrastructure.Tests/Scraping/ResultsParserTests.cs tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/listing-sample.html tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/event-football-sample.html tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/event-basketball-sample.html tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/event-completed-sample.html ```