Add initial project structure with OCR functionality and dependencies
- Create .gitignore to exclude Python-generated files and virtual environments - Add .python-version for Python version management - Implement main OCR script (main.py) to process PDF files and extract text - Add PDF processing functions in pdf_ocr.py - Update README.md with project description, requirements, and usage instructions - Include pyproject.toml for project metadata and dependencies - Add uv.lock for dependency resolution
This commit is contained in:
@@ -0,0 +1,45 @@
|
||||
# PDF Range OCR Script
|
||||
|
||||
This project provides a command line script that recognizes text from a selected PDF page range.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. Linux with Tesseract OCR installed:
|
||||
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng
|
||||
|
||||
2. Python dependencies:
|
||||
|
||||
uv sync
|
||||
|
||||
## Usage
|
||||
|
||||
Run OCR for an inclusive 1-based page range and write to a text file:
|
||||
|
||||
uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"
|
||||
|
||||
Optional flags:
|
||||
|
||||
- --lang (default: rus+eng)
|
||||
- --dpi (default: 300)
|
||||
- --rotate (default: 0, degrees before OCR)
|
||||
|
||||
Example:
|
||||
|
||||
uv run python main.py \
|
||||
--input "Красавчикова. Личные права. 1994.pdf" \
|
||||
--start 1 \
|
||||
--end 3 \
|
||||
--output "ocr_output.txt" \
|
||||
--lang "rus+eng" \
|
||||
--dpi 300 \
|
||||
--rotate 90
|
||||
|
||||
The output file is UTF-8 text with page separators:
|
||||
|
||||
=== Page 1 ===
|
||||
<recognized text>
|
||||
|
||||
=== Page 2 ===
|
||||
<recognized text>
|
||||
Reference in New Issue
Block a user