Add initial project structure with OCR functionality and dependencies

- Create .gitignore to exclude Python-generated files and virtual environments - Add .python-version for Python version management - Implement main OCR script (main.py) to process PDF files and extract text - Add PDF processing functions in pdf_ocr.py - Update README.md with project description, requirements, and usage instructions - Include pyproject.toml for project metadata and dependencies - Add uv.lock for dependency resolution
2026-03-16 20:18:38 +05:00
commit 546c26157d
7 changed files with 342 additions and 0 deletions
@@ -0,0 +1,45 @@
+# PDF Range OCR Script
+
+This project provides a command line script that recognizes text from a selected PDF page range.
+
+## Requirements
+
+1. Linux with Tesseract OCR installed:
+
+	sudo apt-get update
+	sudo apt-get install -y tesseract-ocr tesseract-ocr-rus tesseract-ocr-eng
+
+2. Python dependencies:
+
+	uv sync
+
+## Usage
+
+Run OCR for an inclusive 1-based page range and write to a text file:
+
+uv run python main.py --input "input.pdf" --start 5 --end 12 --output "result.txt"
+
+Optional flags:
+
+- --lang (default: rus+eng)
+- --dpi (default: 300)
+- --rotate (default: 0, degrees before OCR)
+
+Example:
+
+uv run python main.py \
+  --input "Красавчикова. Личные права. 1994.pdf" \
+  --start 1 \
+  --end 3 \
+  --output "ocr_output.txt" \
+  --lang "rus+eng" \
+  --dpi 300 \
+  --rotate 90
+
+The output file is UTF-8 text with page separators:
+
+=== Page 1 ===
+<recognized text>
+
+=== Page 2 ===
+<recognized text>