48 lines
1014 B
Markdown
48 lines
1014 B
Markdown
# PDF Range OCR Script
|
|
|
|
This project provides a command line script that recognizes text from a selected PDF page range.
|
|
|
|
## Requirements
|
|
|
|
1. Linux with Tesseract OCR installed:
|
|
|
|
sudo apt-get update
|
|
sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng
|
|
|
|
2. Python dependencies:
|
|
|
|
uv sync
|
|
|
|
## Usage
|
|
|
|
Run OCR for an inclusive 1-based page range and write to a text file:
|
|
|
|
uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"
|
|
|
|
If `--start` and `--end` are both omitted, OCR runs from the first page to the last page.
|
|
|
|
Optional flags:
|
|
|
|
- --lang (default: rus+eng)
|
|
- --dpi (default: 300)
|
|
- --rotate (default: 0, degrees before OCR)
|
|
|
|
Example:
|
|
|
|
uv run python main.py \
|
|
--input "Красавчикова. Личные права. 1994.pdf" \
|
|
--start 1 \
|
|
--end 3 \
|
|
--output "ocr_output.txt" \
|
|
--lang "rus+eng" \
|
|
--dpi 300 \
|
|
--rotate 90
|
|
|
|
The output file is UTF-8 text with page separators:
|
|
|
|
=== Page 1 ===
|
|
<recognized text>
|
|
|
|
=== Page 2 ===
|
|
<recognized text>
|