546c26157d462d913f238477d171d9b299758174
- Create .gitignore to exclude Python-generated files and virtual environments - Add .python-version for Python version management - Implement main OCR script (main.py) to process PDF files and extract text - Add PDF processing functions in pdf_ocr.py - Update README.md with project description, requirements, and usage instructions - Include pyproject.toml for project metadata and dependencies - Add uv.lock for dependency resolution
PDF Range OCR Script
This project provides a command line script that recognizes text from a selected PDF page range.
Requirements
-
Linux with Tesseract OCR installed:
sudo apt-get update sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng
-
Python dependencies:
uv sync
Usage
Run OCR for an inclusive 1-based page range and write to a text file:
uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"
Optional flags:
- --lang (default: rus+eng)
- --dpi (default: 300)
- --rotate (default: 0, degrees before OCR)
Example:
uv run python main.py
--input "Красавчикова. Личные права. 1994.pdf"
--start 1
--end 3
--output "ocr_output.txt"
--lang "rus+eng"
--dpi 300
--rotate 90
The output file is UTF-8 text with page separators:
=== Page 1 ===
=== Page 2 ===
Description
Languages
Python
100%