pdf-reader/README.md

# PDF Range OCR Script

This project provides a command line script that recognizes text from a selected PDF page range.

## Requirements

1. Linux with Tesseract OCR installed:

	sudo apt-get update
	sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng

2. Python dependencies:

	uv sync

## Usage

Run OCR for an inclusive 1-based page range and write to a text file:

uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"

If `--start` and `--end` are both omitted, OCR runs from the first page to the last page.

Optional flags:

- --lang (default: rus+eng)
- --dpi (default: 300)
- --rotate (default: 0, degrees before OCR)

Example:

uv run python main.py \
  --input "Красавчикова. Личные права. 1994.pdf" \
  --start 1 \
  --end 3 \
  --output "ocr_output.txt" \
  --lang "rus+eng" \
  --dpi 300 \
  --rotate 90

The output file is UTF-8 text with page separators:

=== Page 1 ===
<recognized text>

=== Page 2 ===
<recognized text>