e4d8476782
Snapshot of the parallel batch (Phases 2 + 3 + 5) at session pause. Solution does
NOT build cleanly yet — known cross-phase compile issues remain to be resolved
before review. See plans/initial-implementation/PLAN.md "Resume Notes" section
for the exact tomorrow-morning action list.
Phase 2 (Storage):
- Repository interfaces in Marathon.Application/Abstractions
- DateRange, ExportKind, StorageOptions in Marathon.Application/Storage
- EF Core 8 + SQLite (WAL) persistence: 7 entities + configurations + 4 repos
- Hand-written InitialCreate migration (dotnet ef blocked by parallel work)
- ClosedXML ExcelExporter with exact customer-spec wide columns
- PersistenceModule.AddMarathonPersistence DI extension
- Round-trip + export tests (cannot run yet — see cross-phase issues)
Phase 3 (Scraping):
- IOddsScraper, IBetPlacer in Marathon.Application/Abstractions
- ScrapingOptions in Marathon.Infrastructure/Configuration
- MarathonbetScraper with 4 parsers (Upcoming, Live, EventOdds, Results)
- Helpers: ServerTimeProvider, PeriodScopeMapper, OutcomeCodeMapper, MoscowDateParser
- UserAgentRotatorHandler + Polly v8 resilience pipeline
- ScrapingModule.AddMarathonScraping DI extension
- GlobalUsings.cs aliases for EventId / Configuration disambiguation
- Parser tests with trimmed HTML fixtures
- ScrapeResultsAsync interim no-op (Phase 8 will replace via watch-list polling)
Phase 5 (UI shell — killed mid-final-verify, assumed ~95%):
- Marathon.UI populated: MainLayout, App.razor, Pages (Home, Settings),
Components, Theme (MarathonTheme.cs + Tokens.cs + app.css), Resources
(SharedResource.{cs,ru.resx,en.resx}), Services (ISettingsWriter), wwwroot
- WPF host: App.xaml(.cs), MainWindow.xaml(.cs), Marathon.Hosts.WpfBlazor.csproj
with Microsoft.AspNetCore.Components.WebView.Wpf + MudBlazor + Serilog
- appsettings.json + appsettings.Development.json with all sections wired
- bUnit tests: MainLayoutTests, LocaleSwitcherTests, ThemeToggleTests,
JsonSettingsWriterTests + Support helpers
Cross-phase issues to resolve at next session:
1. Phase 2 repository classes are 'internal' — Phase 3's tests can't reference
them. Fix: add InternalsVisibleTo to Marathon.Infrastructure.csproj.
2. Phase 5: LocalizationOptions namespace ambiguity (AspNetCore vs Extensions).
3. Phase 5: WpfBlazor Serilog API mismatch.
Reviewer has NOT run on this batch. Move to Phase 4 only after build is green
and a combined parallel-batch reviewer passes.
239 lines
12 KiB
Markdown
239 lines
12 KiB
Markdown
# Phase 3: Infrastructure — Scraping
|
|
|
|
**Status:** ✅ Done
|
|
**Parent plan:** [PLAN.md](./PLAN.md)
|
|
**Domain:** backend
|
|
|
|
## Objective
|
|
|
|
Implement the scraping pipeline: HttpClient + AngleSharp for HTML pages with a Playwright
|
|
fallback for JS-rendered content, all wrapped in resilient policies (retry, circuit
|
|
breaker, rate limiter). All parsing logic is informed by Phase 0's `SCRAPE_FINDINGS.md`
|
|
and `SCHEMA_DRAFT.md`.
|
|
|
|
## Tasks
|
|
|
|
- [ ] Read `spike/SCRAPE_FINDINGS.md` and `spike/SCHEMA_DRAFT.md` from Phase 0 to
|
|
determine which strategy applies (HTML / Playwright / hybrid).
|
|
- [ ] Add packages:
|
|
- `AngleSharp`
|
|
- `Microsoft.Extensions.Http`
|
|
- `Microsoft.Extensions.Http.Resilience` (Polly v8 wrapper)
|
|
- `Microsoft.Playwright` (only if Phase 0 decided Playwright is needed)
|
|
- [ ] Define abstractions in `Marathon.Application/Abstractions/`:
|
|
- `IOddsScraper`:
|
|
- `Task<IReadOnlyList<Event>> ScrapeUpcomingAsync(SportCode? filter, CancellationToken ct)`
|
|
- `Task<OddsSnapshot> ScrapeEventOddsAsync(EventId id, OddsSource source, CancellationToken ct)`
|
|
- `Task<IReadOnlyList<EventResult>> ScrapeResultsAsync(DateRange range, CancellationToken ct)`
|
|
- `IBetPlacer` — empty marker interface for future betting feature (extension point)
|
|
- [ ] Implement `Marathon.Infrastructure/Scraping/MarathonbetScraper.cs`:
|
|
- Composes parsers + HttpClient + (optional) Playwright per Phase 0 strategy
|
|
- Constructor takes `IHttpClientFactory`, `IOptions<ScrapingOptions>`, `ILogger`
|
|
- Methods correspond to `IOddsScraper` interface
|
|
- [ ] Implement parsers in `Marathon.Infrastructure/Scraping/Parsers/`:
|
|
- `UpcomingEventsParser` — parses listing page → `IReadOnlyList<Event>`
|
|
- `LiveEventsParser` — parses live listing → `IReadOnlyList<Event>`
|
|
- `EventOddsParser` — parses event detail page → `OddsSnapshot` (handles all bet types
|
|
in spec: Win/Draw/WinFora/Total at Match + Period-N scope)
|
|
- `ResultsParser` — parses completed events → `IReadOnlyList<EventResult>`
|
|
- Each parser is unit-testable: takes `string html` (or `IDocument`), returns domain types
|
|
- [ ] `ScrapingOptions` POCO bound to `appsettings.json` `Scraping:*` section:
|
|
```csharp
|
|
public sealed class ScrapingOptions {
|
|
public int PollingIntervalSeconds { get; init; } = 30;
|
|
public int MaxConcurrentRequests { get; init; } = 4;
|
|
public string[] UserAgents { get; init; } = Array.Empty<string>();
|
|
public RetryPolicyOptions RetryPolicy { get; init; } = new();
|
|
public RateLimitOptions RateLimit { get; init; } = new();
|
|
public bool EnablePlaywrightFallback { get; init; } = false;
|
|
public string BaseUrl { get; init; } = "https://www.marathonbet.by";
|
|
}
|
|
```
|
|
- [ ] Configure named `HttpClient` "marathonbet" in DI with:
|
|
- `BaseAddress` = `Scraping:BaseUrl`
|
|
- `User-Agent` rotation via `DelegatingHandler` (`UserAgentRotatorHandler`)
|
|
- Polly resilience (`AddResilienceHandler` from `Microsoft.Extensions.Http.Resilience`):
|
|
- Retry: exponential backoff, max attempts from config
|
|
- Circuit breaker: 5 failures → 30s open
|
|
- Rate limiter: token bucket (configurable RPS)
|
|
- Timeout: per-request from config
|
|
- [ ] (Optional, if Phase 0 needs it) Implement `PlaywrightScraper` for SPA-rendered
|
|
pages — used as fallback if HTML scraping detects empty/dynamic content.
|
|
- [ ] Add DI registration in `Marathon.Infrastructure/DependencyInjection.cs`:
|
|
- `services.AddOptions<ScrapingOptions>().Bind(config.GetSection("Scraping"))`
|
|
- `services.AddHttpClient("marathonbet").AddResilienceHandler(...)`
|
|
- `services.AddSingleton<IOddsScraper, MarathonbetScraper>()`
|
|
- `services.AddSingleton<UserAgentRotatorHandler>()`
|
|
- [ ] Add `appsettings.json` template under `src/Marathon.Hosts.WpfBlazor/appsettings.json`
|
|
(will move when host phase runs):
|
|
```json
|
|
{
|
|
"Scraping": {
|
|
"PollingIntervalSeconds": 30,
|
|
"MaxConcurrentRequests": 4,
|
|
"UserAgents": [
|
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ..."
|
|
],
|
|
"RetryPolicy": { "MaxAttempts": 3, "BaseDelayMs": 500 },
|
|
"RateLimit": { "RequestsPerSecond": 1 },
|
|
"EnablePlaywrightFallback": false,
|
|
"BaseUrl": "https://www.marathonbet.by"
|
|
}
|
|
}
|
|
```
|
|
- [ ] Tests in `Marathon.Infrastructure.Tests/Scraping/`:
|
|
- Use recorded HTML fixtures (committed under
|
|
`tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html` — small samples
|
|
only) — copy from `spike/captures/` if appropriate
|
|
- Test each parser produces expected domain output for the fixtures
|
|
- Test `MarathonbetScraper` handles network errors gracefully (Polly mock)
|
|
- DO NOT make real network calls in tests
|
|
|
|
## Files to Modify/Create
|
|
|
|
- `src/Marathon.Application/Abstractions/IOddsScraper.cs`
|
|
- `src/Marathon.Application/Abstractions/IBetPlacer.cs` (marker interface)
|
|
- `src/Marathon.Infrastructure/Scraping/MarathonbetScraper.cs`
|
|
- `src/Marathon.Infrastructure/Scraping/Parsers/*.cs` — 4 parsers
|
|
- `src/Marathon.Infrastructure/Scraping/UserAgentRotatorHandler.cs`
|
|
- `src/Marathon.Infrastructure/Scraping/Playwright/PlaywrightScraper.cs` (conditional)
|
|
- `src/Marathon.Infrastructure/Configuration/ScrapingOptions.cs`
|
|
- `tests/Marathon.Infrastructure.Tests/Scraping/Parsers/*Tests.cs`
|
|
- `tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html`
|
|
|
|
## Acceptance Criteria
|
|
|
|
- Compiles (Big Bang).
|
|
- All parser logic is unit-testable without network.
|
|
- `IOddsScraper` is the only public surface used by Application layer.
|
|
- `appsettings.json` template covers every variable parameter.
|
|
- `IBetPlacer` exists as a future-proof extension point.
|
|
|
|
## Notes
|
|
|
|
- This phase is parallelizable with Phase 2 — disjoint files.
|
|
- DO NOT hammer marathonbet.by — tests use local fixtures.
|
|
- If Phase 0 found that scraping requires headless browser only, skip the AngleSharp
|
|
parsers and implement Playwright-only.
|
|
- Big Bang: compile-only smoke check after this phase; tests deferred to Phase 9.
|
|
|
|
## Review Checklist
|
|
|
|
- [ ] Compiles
|
|
- [ ] Parser interface is clean (`string html → domain types`)
|
|
- [ ] All `Scraping:*` config keys are wired through `ScrapingOptions`
|
|
- [ ] No real network calls in tests
|
|
|
|
## Review Checklist (filled)
|
|
|
|
- [x] Compiles (`dotnet build src/Marathon.Infrastructure` — 0 errors)
|
|
- [x] Parser interface is clean (`string html → domain types`)
|
|
- [x] All `Scraping:*` config keys are wired through `ScrapingOptions`
|
|
- [x] No real network calls in tests (all tests use local HTML fixtures)
|
|
|
|
## Handoff to Next Phase
|
|
|
|
### For Phase 4 (Application + Workers)
|
|
|
|
**Calling `ScrapingModule.AddMarathonScraping(services, config)`** is required in
|
|
`DependencyInjection.cs` to wire all scraping services. It must NOT be called from
|
|
`ScrapingModule` itself (that would create circular coupling).
|
|
|
|
**`IOddsScraper.ScrapeResultsAsync` is a no-op** (returns empty list + logs a warning).
|
|
Phase 8 must implement results harvesting via the watch-list poller that calls
|
|
`IResultsParser.ParseAsync` on individual event-detail pages.
|
|
|
|
**`IOddsScraper.ScrapeEventOddsAsync`** takes an `EventId` (the bookmaker's numeric
|
|
event ID as a string) and currently constructs a best-effort URL
|
|
`/su/betting/{eventId}`. Phase 4 workers should persist the full
|
|
`data-event-path` from the listing parse and pass it as part of the scrape call.
|
|
A TODO comment marks this location in `MarathonbetScraper.cs`.
|
|
|
|
**Basketball period mode** defaults to halves (Period-1, Period-2). The
|
|
`PeriodScopeMapper` accepts a `basketballQuarterMode` constructor parameter.
|
|
Phase 4 should bind this from config: `Sports:Basketball:QuarterMode` (bool).
|
|
A TODO comment is present in `ScrapingModule.cs`.
|
|
|
|
**`MarathonbetScraper` constructor** takes all parsers by interface — fully DI-friendly.
|
|
|
|
**`UserAgentRotatorHandler` is registered as `Transient`** — this is correct because
|
|
`DelegatingHandler` instances must be transient when used with IHttpClientFactory.
|
|
|
|
**Named HttpClient `"marathonbet"`** is registered. Resilience pipeline:
|
|
1. Timeout (per-attempt)
|
|
2. Retry (exp backoff + jitter, configurable MaxAttempts + BaseDelayMs)
|
|
3. Circuit Breaker (5 failures / 30s window → 30s break)
|
|
4. Rate Limiter (token bucket, configurable RequestsPerSecond)
|
|
|
|
**`appsettings.scraping.sample.json`** in `src/Marathon.Infrastructure/Scraping/` is
|
|
a documentation-only sample. Phase 5 must copy its `Scraping:*` section into the
|
|
actual host `appsettings.json`.
|
|
|
|
### EventId disambiguation (IMPORTANT)
|
|
|
|
`Marathon.Domain.ValueObjects.EventId` conflicts with `Microsoft.Extensions.Logging.EventId`.
|
|
The Infrastructure project resolves this via:
|
|
- `GlobalUsings.cs`: `global using LogEventId = Microsoft.Extensions.Logging.EventId;`
|
|
- Local file aliases: `using DomainEventId = Marathon.Domain.ValueObjects.EventId;` in
|
|
parser files that use both namespaces.
|
|
- `MarathonbetScraper.ScrapeEventOddsAsync` uses the fully qualified name
|
|
`Marathon.Domain.ValueObjects.EventId` for the parameter type.
|
|
|
|
Phase 4 should be aware of this conflict when adding new scraping-adjacent services.
|
|
|
|
### Test status
|
|
|
|
Phase 3 scraping tests (`tests/Marathon.Infrastructure.Tests/Scraping/`) compile
|
|
and are self-contained (HTML fixtures under `Fixtures/marathonbet/`). They cannot
|
|
currently RUN because Phase 2's repository test files
|
|
(`Persistence/RoundTripTests.cs`, `Export/ExcelExporterTests.cs`) reference
|
|
`internal sealed class` types from the same Infrastructure project. Phase 2
|
|
should either:
|
|
(a) make repositories `public`, or
|
|
(b) add `[assembly: InternalsVisibleTo("Marathon.Infrastructure.Tests")]`
|
|
to the Infrastructure project.
|
|
|
|
Option (b) is preferred: add to `Marathon.Infrastructure.csproj` or a `GlobalUsings.cs`:
|
|
```xml
|
|
<ItemGroup>
|
|
<InternalsVisibleTo Include="Marathon.Infrastructure.Tests" />
|
|
</ItemGroup>
|
|
```
|
|
|
|
### Files created (Phase 3 scope)
|
|
|
|
```
|
|
src/Marathon.Application/Abstractions/IOddsScraper.cs
|
|
src/Marathon.Application/Abstractions/IBetPlacer.cs
|
|
src/Marathon.Infrastructure/Configuration/ScrapingOptions.cs
|
|
src/Marathon.Infrastructure/GlobalUsings.cs (EventId disambiguation)
|
|
src/Marathon.Infrastructure/Scraping/MarathonbetScraper.cs
|
|
src/Marathon.Infrastructure/Scraping/ScrapingModule.cs
|
|
src/Marathon.Infrastructure/Scraping/UserAgentRotatorHandler.cs
|
|
src/Marathon.Infrastructure/Scraping/appsettings.scraping.sample.json
|
|
src/Marathon.Infrastructure/Scraping/Parsers/IServerTimeProvider.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/ServerTimeProvider.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/MoscowDateParser.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/OutcomeCodeMapper.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/PeriodScopeMapper.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/EventListingParserBase.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/IUpcomingEventsParser.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/UpcomingEventsParser.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/ILiveEventsParser.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/LiveEventsParser.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/IEventOddsParser.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/EventOddsParser.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/IResultsParser.cs
|
|
src/Marathon.Infrastructure/Scraping/Parsers/ResultsParser.cs
|
|
tests/Marathon.Infrastructure.Tests/Scraping/OutcomeCodeMapperTests.cs
|
|
tests/Marathon.Infrastructure.Tests/Scraping/MoscowDateParserTests.cs
|
|
tests/Marathon.Infrastructure.Tests/Scraping/ServerTimeProviderTests.cs
|
|
tests/Marathon.Infrastructure.Tests/Scraping/UpcomingEventsParserTests.cs
|
|
tests/Marathon.Infrastructure.Tests/Scraping/EventOddsParserTests.cs
|
|
tests/Marathon.Infrastructure.Tests/Scraping/ResultsParserTests.cs
|
|
tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/listing-sample.html
|
|
tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/event-football-sample.html
|
|
tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/event-basketball-sample.html
|
|
tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/event-completed-sample.html
|
|
```
|