k1nq/pdf-reader

T

k1nq 5af97638c9 Update README and main.py to clarify optional page range arguments

2026-03-16 20:28:28 +05:00

.gitignore

Add initial project structure with OCR functionality and dependencies

2026-03-16 20:18:38 +05:00

.python-version

Add initial project structure with OCR functionality and dependencies

2026-03-16 20:18:38 +05:00

main.py

Update README and main.py to clarify optional page range arguments

2026-03-16 20:28:28 +05:00

pdf_ocr.py

Add initial project structure with OCR functionality and dependencies

2026-03-16 20:18:38 +05:00

pyproject.toml

Add initial project structure with OCR functionality and dependencies

2026-03-16 20:18:38 +05:00

README.md

Update README and main.py to clarify optional page range arguments

2026-03-16 20:28:28 +05:00

uv.lock

Add initial project structure with OCR functionality and dependencies

2026-03-16 20:18:38 +05:00

README.md

PDF Range OCR Script

This project provides a command line script that recognizes text from a selected PDF page range.

Requirements

Linux with Tesseract OCR installed:

sudo apt-get update sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng
Python dependencies:

uv sync

Usage

Run OCR for an inclusive 1-based page range and write to a text file:

uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"

If --start and --end are both omitted, OCR runs from the first page to the last page.

Optional flags:

--lang (default: rus+eng)
--dpi (default: 300)
--rotate (default: 0, degrees before OCR)

Example:

uv run python main.py
--input "Красавчикова. Личные права. 1994.pdf"
--start 1
--end 3
--output "ocr_output.txt"
--lang "rus+eng"
--dpi 300
--rotate 90

The output file is UTF-8 text with page separators:

=== Page 1 ===

=== Page 2 ===

Powered by TurnKey Linux.