Files
pdf-reader/README.md
T
k1nq 546c26157d Add initial project structure with OCR functionality and dependencies
- Create .gitignore to exclude Python-generated files and virtual environments
- Add .python-version for Python version management
- Implement main OCR script (main.py) to process PDF files and extract text
- Add PDF processing functions in pdf_ocr.py
- Update README.md with project description, requirements, and usage instructions
- Include pyproject.toml for project metadata and dependencies
- Add uv.lock for dependency resolution
2026-03-16 20:18:38 +05:00

923 B

PDF Range OCR Script

This project provides a command line script that recognizes text from a selected PDF page range.

Requirements

  1. Linux with Tesseract OCR installed:

    sudo apt-get update sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng

  2. Python dependencies:

    uv sync

Usage

Run OCR for an inclusive 1-based page range and write to a text file:

uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"

Optional flags:

  • --lang (default: rus+eng)
  • --dpi (default: 300)
  • --rotate (default: 0, degrees before OCR)

Example:

uv run python main.py
--input "Красавчикова. Личные права. 1994.pdf"
--start 1
--end 3
--output "ocr_output.txt"
--lang "rus+eng"
--dpi 300
--rotate 90

The output file is UTF-8 text with page separators:

=== Page 1 ===

=== Page 2 ===