docs(initial-implementation): add feature plan and 10 phase subplans
This commit is contained in:
@@ -0,0 +1,129 @@
|
||||
# Phase 3: Infrastructure — Scraping
|
||||
|
||||
**Status:** ⬜ Not Started
|
||||
**Parent plan:** [PLAN.md](./PLAN.md)
|
||||
**Domain:** backend
|
||||
|
||||
## Objective
|
||||
|
||||
Implement the scraping pipeline: HttpClient + AngleSharp for HTML pages with a Playwright
|
||||
fallback for JS-rendered content, all wrapped in resilient policies (retry, circuit
|
||||
breaker, rate limiter). All parsing logic is informed by Phase 0's `SCRAPE_FINDINGS.md`
|
||||
and `SCHEMA_DRAFT.md`.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [ ] Read `spike/SCRAPE_FINDINGS.md` and `spike/SCHEMA_DRAFT.md` from Phase 0 to
|
||||
determine which strategy applies (HTML / Playwright / hybrid).
|
||||
- [ ] Add packages:
|
||||
- `AngleSharp`
|
||||
- `Microsoft.Extensions.Http`
|
||||
- `Microsoft.Extensions.Http.Resilience` (Polly v8 wrapper)
|
||||
- `Microsoft.Playwright` (only if Phase 0 decided Playwright is needed)
|
||||
- [ ] Define abstractions in `Marathon.Application/Abstractions/`:
|
||||
- `IOddsScraper`:
|
||||
- `Task<IReadOnlyList<Event>> ScrapeUpcomingAsync(SportCode? filter, CancellationToken ct)`
|
||||
- `Task<OddsSnapshot> ScrapeEventOddsAsync(EventId id, OddsSource source, CancellationToken ct)`
|
||||
- `Task<IReadOnlyList<EventResult>> ScrapeResultsAsync(DateRange range, CancellationToken ct)`
|
||||
- `IBetPlacer` — empty marker interface for future betting feature (extension point)
|
||||
- [ ] Implement `Marathon.Infrastructure/Scraping/MarathonbetScraper.cs`:
|
||||
- Composes parsers + HttpClient + (optional) Playwright per Phase 0 strategy
|
||||
- Constructor takes `IHttpClientFactory`, `IOptions<ScrapingOptions>`, `ILogger`
|
||||
- Methods correspond to `IOddsScraper` interface
|
||||
- [ ] Implement parsers in `Marathon.Infrastructure/Scraping/Parsers/`:
|
||||
- `UpcomingEventsParser` — parses listing page → `IReadOnlyList<Event>`
|
||||
- `LiveEventsParser` — parses live listing → `IReadOnlyList<Event>`
|
||||
- `EventOddsParser` — parses event detail page → `OddsSnapshot` (handles all bet types
|
||||
in spec: Win/Draw/WinFora/Total at Match + Period-N scope)
|
||||
- `ResultsParser` — parses completed events → `IReadOnlyList<EventResult>`
|
||||
- Each parser is unit-testable: takes `string html` (or `IDocument`), returns domain types
|
||||
- [ ] `ScrapingOptions` POCO bound to `appsettings.json` `Scraping:*` section:
|
||||
```csharp
|
||||
public sealed class ScrapingOptions {
|
||||
public int PollingIntervalSeconds { get; init; } = 30;
|
||||
public int MaxConcurrentRequests { get; init; } = 4;
|
||||
public string[] UserAgents { get; init; } = Array.Empty<string>();
|
||||
public RetryPolicyOptions RetryPolicy { get; init; } = new();
|
||||
public RateLimitOptions RateLimit { get; init; } = new();
|
||||
public bool EnablePlaywrightFallback { get; init; } = false;
|
||||
public string BaseUrl { get; init; } = "https://www.marathonbet.by";
|
||||
}
|
||||
```
|
||||
- [ ] Configure named `HttpClient` "marathonbet" in DI with:
|
||||
- `BaseAddress` = `Scraping:BaseUrl`
|
||||
- `User-Agent` rotation via `DelegatingHandler` (`UserAgentRotatorHandler`)
|
||||
- Polly resilience (`AddResilienceHandler` from `Microsoft.Extensions.Http.Resilience`):
|
||||
- Retry: exponential backoff, max attempts from config
|
||||
- Circuit breaker: 5 failures → 30s open
|
||||
- Rate limiter: token bucket (configurable RPS)
|
||||
- Timeout: per-request from config
|
||||
- [ ] (Optional, if Phase 0 needs it) Implement `PlaywrightScraper` for SPA-rendered
|
||||
pages — used as fallback if HTML scraping detects empty/dynamic content.
|
||||
- [ ] Add DI registration in `Marathon.Infrastructure/DependencyInjection.cs`:
|
||||
- `services.AddOptions<ScrapingOptions>().Bind(config.GetSection("Scraping"))`
|
||||
- `services.AddHttpClient("marathonbet").AddResilienceHandler(...)`
|
||||
- `services.AddSingleton<IOddsScraper, MarathonbetScraper>()`
|
||||
- `services.AddSingleton<UserAgentRotatorHandler>()`
|
||||
- [ ] Add `appsettings.json` template under `src/Marathon.Hosts.WpfBlazor/appsettings.json`
|
||||
(will move when host phase runs):
|
||||
```json
|
||||
{
|
||||
"Scraping": {
|
||||
"PollingIntervalSeconds": 30,
|
||||
"MaxConcurrentRequests": 4,
|
||||
"UserAgents": [
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ..."
|
||||
],
|
||||
"RetryPolicy": { "MaxAttempts": 3, "BaseDelayMs": 500 },
|
||||
"RateLimit": { "RequestsPerSecond": 1 },
|
||||
"EnablePlaywrightFallback": false,
|
||||
"BaseUrl": "https://www.marathonbet.by"
|
||||
}
|
||||
}
|
||||
```
|
||||
- [ ] Tests in `Marathon.Infrastructure.Tests/Scraping/`:
|
||||
- Use recorded HTML fixtures (committed under
|
||||
`tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html` — small samples
|
||||
only) — copy from `spike/captures/` if appropriate
|
||||
- Test each parser produces expected domain output for the fixtures
|
||||
- Test `MarathonbetScraper` handles network errors gracefully (Polly mock)
|
||||
- DO NOT make real network calls in tests
|
||||
|
||||
## Files to Modify/Create
|
||||
|
||||
- `src/Marathon.Application/Abstractions/IOddsScraper.cs`
|
||||
- `src/Marathon.Application/Abstractions/IBetPlacer.cs` (marker interface)
|
||||
- `src/Marathon.Infrastructure/Scraping/MarathonbetScraper.cs`
|
||||
- `src/Marathon.Infrastructure/Scraping/Parsers/*.cs` — 4 parsers
|
||||
- `src/Marathon.Infrastructure/Scraping/UserAgentRotatorHandler.cs`
|
||||
- `src/Marathon.Infrastructure/Scraping/Playwright/PlaywrightScraper.cs` (conditional)
|
||||
- `src/Marathon.Infrastructure/Configuration/ScrapingOptions.cs`
|
||||
- `tests/Marathon.Infrastructure.Tests/Scraping/Parsers/*Tests.cs`
|
||||
- `tests/Marathon.Infrastructure.Tests/Fixtures/marathonbet/*.html`
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Compiles (Big Bang).
|
||||
- All parser logic is unit-testable without network.
|
||||
- `IOddsScraper` is the only public surface used by Application layer.
|
||||
- `appsettings.json` template covers every variable parameter.
|
||||
- `IBetPlacer` exists as a future-proof extension point.
|
||||
|
||||
## Notes
|
||||
|
||||
- This phase is parallelizable with Phase 2 — disjoint files.
|
||||
- DO NOT hammer marathonbet.by — tests use local fixtures.
|
||||
- If Phase 0 found that scraping requires headless browser only, skip the AngleSharp
|
||||
parsers and implement Playwright-only.
|
||||
- Big Bang: compile-only smoke check after this phase; tests deferred to Phase 9.
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- [ ] Compiles
|
||||
- [ ] Parser interface is clean (`string html → domain types`)
|
||||
- [ ] All `Scraping:*` config keys are wired through `ScrapingOptions`
|
||||
- [ ] No real network calls in tests
|
||||
|
||||
## Handoff to Next Phase
|
||||
|
||||
<!-- Filled by Phase 3 implementer. -->
|
||||
Reference in New Issue
Block a user