Files
Learn_System/.claude/memory/image-extraction.md
T
Maxim Dolgolyov 8a7091ddec chore(memory): снимок файлов памяти Claude в репозиторий для переноса
Копия пользовательской автопамяти (29 фактов + индекс MEMORY.md) в
.claude/memory/, чтобы переносить между машинами через git.
README.md — как восстановить в пользовательскую папку на другой машине.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 08:32:16 +03:00

2.9 KiB
Raw Blame History

Image Extraction from PDF — ЦТ/ЦЭ Questions

Tools available

  • pdftoppm (poppler, via scoop): renders PDF pages to PNG
  • sharp (npm, installed in backend/): crops images in Node.js
  • Script: backend/src/db/crop_images.js

Workflow for extracting figures from exam PDF

Step 1 — Render pages at 200 DPI

pdftoppm -png -r 200 -f <first_page> -l <last_page> "path/to/file.pdf" "/tmp/prefix"
# Output: /tmp/prefix-06.png, /tmp/prefix-07.png ...
# Copy to: frontend/img/questions/pageN.png

Step 2 — Calibrate coordinates using 72 DPI reference

pdftoppm -png -r 72 -f <first_page> -l <last_page> "path/to/file.pdf" "/tmp/pt"
# Output: /tmp/pt-06.png etc. (614×844 px for A4)
# Copy to: frontend/img/questions/ptN.png

At 72 DPI: A4 = 614×844 px. Scale to 200 DPI: ×2.777 Measure coordinates visually on 72 DPI images, multiply by 2.777 to get 200 DPI coords.

Step 3 — Test crops (Node.js)

// Run from backend/ folder
const sharp = require('sharp');
sharp('../frontend/img/questions/pt6.png')
  .extract({ left: 220, top: 80, width: 200, height: 100 })
  .toFile('../frontend/img/questions/test.png');

Step 4 — Run crop_images.js

cd backend && node src/db/crop_images.js

ЦТ 2021 — Variant 1 crop coordinates (200 DPI, 1705×2344)

File Question Source page left top width height
ct2021v1_a1.png A1 triangle page6.png 611 222 556 292
ct2021v1_a7.png A7 graph f(x) page6.png 278 1222 750 403
ct2021v1_a15.png A15 parabola page7.png 556 917 695 278
ct2021v1_a17.png A17 grid A,B page7.png 861 1439 639 194
ct2021v1_a18.png A18 pyramid page7.png 389 1656 945 472
ct2021v1_b1.png B1 bar chart page8.png 28 83 1167 556
ct2021v1_b3.png B3 3D planes page8.png 1015 1717 319 417
ct2021v1_b4.png B4 enclosure page9.png 945 14 542 208

PDF source: ЦТ-ЦЭ/ЦТ 2021.pdf

  • Page 6 = Variant 1, Part A (A1A11)
  • Page 7 = Variant 1, Part A (A12A18)
  • Page 8 = Variant 1, Part B (B1B3)
  • Page 9 = Variant 1, Part B (B4B14)

Images stored: frontend/img/questions/ct2021v1_*.png Served at: /img/questions/ct2021v1_*.png (via express.static on frontendDir)


DB: updating image field after seeding

const upd = db.prepare('UPDATE questions SET image = ? WHERE text LIKE ? AND year = ?');
upd.run('/img/questions/ct2021v1_a1.png', '[ЦТ 2021 · A1]%', 2021);

Notes

  • PDF contains scanned raster images — no extractable vector graphics
  • Each PDF page = one large bitmap scan
  • pdfimages extracts only full-page bitmaps (not individual diagram crops)
  • sharp must be required from backend/ directory (installed there)
  • Temp page renders NOT committed to git — regenerate with pdftoppm when needed