Files
Learn_System/.claude/memory/image-extraction.md
T
Maxim Dolgolyov 8a7091ddec chore(memory): снимок файлов памяти Claude в репозиторий для переноса
Копия пользовательской автопамяти (29 фактов + индекс MEMORY.md) в
.claude/memory/, чтобы переносить между машинами через git.
README.md — как восстановить в пользовательскую папку на другой машине.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 08:32:16 +03:00

81 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Image Extraction from PDF — ЦТ/ЦЭ Questions
## Tools available
- **pdftoppm** (poppler, via scoop): renders PDF pages to PNG
- **sharp** (npm, installed in `backend/`): crops images in Node.js
- Script: `backend/src/db/crop_images.js`
## Workflow for extracting figures from exam PDF
### Step 1 — Render pages at 200 DPI
```bash
pdftoppm -png -r 200 -f <first_page> -l <last_page> "path/to/file.pdf" "/tmp/prefix"
# Output: /tmp/prefix-06.png, /tmp/prefix-07.png ...
# Copy to: frontend/img/questions/pageN.png
```
### Step 2 — Calibrate coordinates using 72 DPI reference
```bash
pdftoppm -png -r 72 -f <first_page> -l <last_page> "path/to/file.pdf" "/tmp/pt"
# Output: /tmp/pt-06.png etc. (614×844 px for A4)
# Copy to: frontend/img/questions/ptN.png
```
At 72 DPI: A4 = 614×844 px. Scale to 200 DPI: **×2.777**
Measure coordinates visually on 72 DPI images, multiply by 2.777 to get 200 DPI coords.
### Step 3 — Test crops (Node.js)
```javascript
// Run from backend/ folder
const sharp = require('sharp');
sharp('../frontend/img/questions/pt6.png')
.extract({ left: 220, top: 80, width: 200, height: 100 })
.toFile('../frontend/img/questions/test.png');
```
### Step 4 — Run crop_images.js
```bash
cd backend && node src/db/crop_images.js
```
---
## ЦТ 2021 — Variant 1 crop coordinates (200 DPI, 1705×2344)
| File | Question | Source page | left | top | width | height |
|------|----------|-------------|------|-----|-------|--------|
| ct2021v1_a1.png | A1 triangle | page6.png | 611 | 222 | 556 | 292 |
| ct2021v1_a7.png | A7 graph f(x) | page6.png | 278 | 1222| 750 | 403 |
| ct2021v1_a15.png | A15 parabola | page7.png | 556 | 917 | 695 | 278 |
| ct2021v1_a17.png | A17 grid A,B | page7.png | 861 | 1439| 639 | 194 |
| ct2021v1_a18.png | A18 pyramid | page7.png | 389 | 1656| 945 | 472 |
| ct2021v1_b1.png | B1 bar chart | page8.png | 28 | 83 | 1167| 556 |
| ct2021v1_b3.png | B3 3D planes | page8.png | 1015| 1717| 319 | 417 |
| ct2021v1_b4.png | B4 enclosure | page9.png | 945 | 14 | 542 | 208 |
PDF source: `ЦТ-ЦЭ/ЦТ 2021.pdf`
- Page 6 = Variant 1, Part A (A1A11)
- Page 7 = Variant 1, Part A (A12A18)
- Page 8 = Variant 1, Part B (B1B3)
- Page 9 = Variant 1, Part B (B4B14)
Images stored: `frontend/img/questions/ct2021v1_*.png`
Served at: `/img/questions/ct2021v1_*.png` (via express.static on frontendDir)
---
## DB: updating image field after seeding
```javascript
const upd = db.prepare('UPDATE questions SET image = ? WHERE text LIKE ? AND year = ?');
upd.run('/img/questions/ct2021v1_a1.png', '[ЦТ 2021 · A1]%', 2021);
```
---
## Notes
- PDF contains scanned raster images — no extractable vector graphics
- Each PDF page = one large bitmap scan
- `pdfimages` extracts only full-page bitmaps (not individual diagram crops)
- sharp must be required from `backend/` directory (installed there)
- Temp page renders NOT committed to git — regenerate with pdftoppm when needed