PDF Range OCR Script

This project provides a command line script that recognizes text from a selected PDF page range.

Requirements

  1. Linux with Tesseract OCR installed:

    sudo apt-get update sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng

  2. Python dependencies:

    uv sync

Usage

Run OCR for an inclusive 1-based page range and write to a text file:

uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"

If --start and --end are both omitted, OCR runs from the first page to the last page.

Optional flags:

  • --lang (default: rus+eng)
  • --dpi (default: 300)
  • --rotate (default: 0, degrees before OCR)

Example:

uv run python main.py
--input "Красавчикова. Личные права. 1994.pdf"
--start 1
--end 3
--output "ocr_output.txt"
--lang "rus+eng"
--dpi 300
--rotate 90

The output file is UTF-8 text with page separators:

=== Page 1 ===

=== Page 2 ===

S
Description
No description provided
Readme 36 KiB
Languages
Python 100%