Add initial project structure with OCR functionality and dependencies

- Create .gitignore to exclude Python-generated files and virtual environments
- Add .python-version for Python version management
- Implement main OCR script (main.py) to process PDF files and extract text
- Add PDF processing functions in pdf_ocr.py
- Update README.md with project description, requirements, and usage instructions
- Include pyproject.toml for project metadata and dependencies
- Add uv.lock for dependency resolution
This commit is contained in:
k1nq
2026-03-16 20:18:38 +05:00
commit 546c26157d
7 changed files with 342 additions and 0 deletions
+45
View File
@@ -0,0 +1,45 @@
# PDF Range OCR Script
This project provides a command line script that recognizes text from a selected PDF page range.
## Requirements
1. Linux with Tesseract OCR installed:
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng
2. Python dependencies:
uv sync
## Usage
Run OCR for an inclusive 1-based page range and write to a text file:
uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"
Optional flags:
- --lang (default: rus+eng)
- --dpi (default: 300)
- --rotate (default: 0, degrees before OCR)
Example:
uv run python main.py \
--input "Красавчикова. Личные права. 1994.pdf" \
--start 1 \
--end 3 \
--output "ocr_output.txt" \
--lang "rus+eng" \
--dpi 300 \
--rotate 90
The output file is UTF-8 text with page separators:
=== Page 1 ===
<recognized text>
=== Page 2 ===
<recognized text>