Files
pdf-reader/README.md
T

48 lines
1014 B
Markdown

# PDF Range OCR Script
This project provides a command line script that recognizes text from a selected PDF page range.
## Requirements
1. Linux with Tesseract OCR installed:
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng
2. Python dependencies:
uv sync
## Usage
Run OCR for an inclusive 1-based page range and write to a text file:
uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"
If `--start` and `--end` are both omitted, OCR runs from the first page to the last page.
Optional flags:
- --lang (default: rus+eng)
- --dpi (default: 300)
- --rotate (default: 0, degrees before OCR)
Example:
uv run python main.py \
--input "Красавчикова. Личные права. 1994.pdf" \
--start 1 \
--end 3 \
--output "ocr_output.txt" \
--lang "rus+eng" \
--dpi 300 \
--rotate 90
The output file is UTF-8 text with page separators:
=== Page 1 ===
<recognized text>
=== Page 2 ===
<recognized text>