Initial commit: PDF OCR Hotfolder v0.1.0
Komplettes Rewrite des alten Bash-Tools `pdf-tool` in Python. - ocrmypdf als Library, watchdog für Hotfolder, ThreadPool für Parallelität - Upload-Targets: folder, Nextcloud (WebDAV), SFTP - E-Mail-Notify, optional veraPDF - Interaktiver Installer mit Service-User-Support (lokal + AD via SSSD) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
+15
@@ -0,0 +1,15 @@
|
|||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
|
*.pyo
|
||||||
|
venv/
|
||||||
|
env/
|
||||||
|
.venv/
|
||||||
|
*.egg-info/
|
||||||
|
build/
|
||||||
|
dist/
|
||||||
|
config.toml
|
||||||
|
.repo_path
|
||||||
|
*.log
|
||||||
|
.DS_Store
|
||||||
|
.idea/
|
||||||
|
.vscode/
|
||||||
@@ -0,0 +1,117 @@
|
|||||||
|
# AI Agent Briefing — PDF OCR Hotfolder
|
||||||
|
|
||||||
|
**Zuletzt aktualisiert:** 2026-04-08
|
||||||
|
**Version:** 0.1.0
|
||||||
|
**Status:** Initiale Implementation, nicht produktiv getestet
|
||||||
|
|
||||||
|
## 🎯 Projektziel
|
||||||
|
|
||||||
|
Eingehende gescannte PDFs werden automatisch durch OCR (ocrmypdf + Tesseract) in durchsuchbare PDFs (optional PDF/A) umgewandelt und nach Wahl in einen Ordner / Nextcloud / per SFTP weitergegeben. Ersetzt das alte Bash-Tool `pdf-tool` (im Workspace).
|
||||||
|
|
||||||
|
## 📁 Projekt-Struktur
|
||||||
|
|
||||||
|
```
|
||||||
|
pdf-ocr-hotfolder/
|
||||||
|
├── pdf_ocr_hotfolder/
|
||||||
|
│ ├── __init__.py # Versionsstring
|
||||||
|
│ ├── __main__.py # CLI-Entrypoint (argparse, --once, --config)
|
||||||
|
│ ├── config.py # TOML-Loader, Dataclasses
|
||||||
|
│ ├── service.py # Hauptservice (watchdog + ThreadPool)
|
||||||
|
│ ├── processor.py # ocrmypdf + veraPDF
|
||||||
|
│ └── uploaders.py # folder, nextcloud (WebDAV), sftp, email
|
||||||
|
├── systemd/
|
||||||
|
│ └── pdf-ocr-hotfolder.service # Template (Platzhalter __SERVICE_USER__/__SERVICE_GROUP__)
|
||||||
|
├── config.example.toml
|
||||||
|
├── install.sh # Interaktiver Installer
|
||||||
|
├── update.sh # Update aus Repo
|
||||||
|
├── requirements.txt
|
||||||
|
├── VERSION
|
||||||
|
├── CHANGELOG.md
|
||||||
|
└── README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔧 Stack
|
||||||
|
|
||||||
|
| Komponente | Technologie |
|
||||||
|
|------------|-------------|
|
||||||
|
| Sprache | Python 3.11+ (für `tomllib` aus stdlib) |
|
||||||
|
| OCR | `ocrmypdf` (als Library, nicht via Subprozess) |
|
||||||
|
| Engine | Tesseract |
|
||||||
|
| Watcher | `watchdog` |
|
||||||
|
| HTTP | `requests` (Nextcloud WebDAV) |
|
||||||
|
| SFTP | `paramiko` |
|
||||||
|
| Email | `smtplib` (stdlib) |
|
||||||
|
| Service | systemd |
|
||||||
|
|
||||||
|
## 🖥️ Installations-Layout
|
||||||
|
|
||||||
|
| Pfad | Inhalt |
|
||||||
|
|------|--------|
|
||||||
|
| `/opt/pdf-ocr-hotfolder/` | Code + venv (`venv/bin/python`) |
|
||||||
|
| `/etc/pdf-ocr-hotfolder/config.toml` | Konfiguration (mode 640, root:<service-group>) |
|
||||||
|
| `/var/lib/pdf-ocr-hotfolder/{incoming,working,outgoing,error}/` | Datenverzeichnisse |
|
||||||
|
| `/var/log/pdf-ocr-hotfolder/` | Logs (zusätzlich zu journald) |
|
||||||
|
| `/etc/systemd/system/pdf-ocr-hotfolder.service` | systemd-Unit |
|
||||||
|
| `/var/backups/pdf-ocr-hotfolder/` | Update-Backups |
|
||||||
|
|
||||||
|
## 👤 Service-User
|
||||||
|
|
||||||
|
Der Installer fragt interaktiv:
|
||||||
|
1. Username (default `pdfocr`)
|
||||||
|
2. Falls User existiert (lokal oder AD via SSSD/Winbind): wird übernommen, primäre Gruppe automatisch erkannt
|
||||||
|
3. Falls nicht: Frage nach lokaler Anlage als System-User
|
||||||
|
|
||||||
|
**Wichtig:** Bei AD-Usern mit lokaler UID werden Datei-Berechtigungen über die UID gesetzt — funktioniert transparent.
|
||||||
|
|
||||||
|
## 🔄 Verarbeitungs-Flow
|
||||||
|
|
||||||
|
1. `watchdog` triggert auf Datei-Event in `incoming/`
|
||||||
|
2. `_wait_until_stable()` wartet, bis Datei nicht mehr wächst (Scanner schreibt mehrmals)
|
||||||
|
3. Move nach `working/`
|
||||||
|
4. `ocrmypdf.ocr()` als **Library-Call** (kein Subprozess-Start pro PDF — schneller)
|
||||||
|
5. Optional: veraPDF-Validierung (CLI-Subprozess)
|
||||||
|
6. Move nach `outgoing/` als `OCR_<originalname>.pdf`
|
||||||
|
7. Aktive Upload-Targets ausführen (folder/nextcloud/sftp)
|
||||||
|
8. Optional E-Mail-Notify
|
||||||
|
|
||||||
|
Fehler → Move nach `error/`, Service läuft weiter (kein `exit 1` wie im alten Bash-Tool).
|
||||||
|
|
||||||
|
## 🧠 Performance-Entscheidungen
|
||||||
|
|
||||||
|
- **ocrmypdf als Library** statt `subprocess`: spart Python-Interpreter-Start pro PDF
|
||||||
|
- **ThreadPool** mit `max_workers` (default 2) — selbst wenn selten >1 PDF gleichzeitig kommt, blockiert ein langsamer Scan keinen schnellen
|
||||||
|
- **`--jobs` an ocrmypdf**: Tesseract parallelisiert Seiten innerhalb eines PDFs
|
||||||
|
- **`skip_text=True`**: bereits OCR-haltige Seiten werden nicht neu verarbeitet
|
||||||
|
- **Stabilitäts-Check** statt magic-file `new` (alte Bash-Krücke)
|
||||||
|
- veraPDF nur wenn `enabled=true` (JVM-Start ist teuer)
|
||||||
|
|
||||||
|
## 🛠️ Entwicklung
|
||||||
|
|
||||||
|
Lokaler Test ohne Installation:
|
||||||
|
```bash
|
||||||
|
cd ~/dev/gitea.sonith.de/pdf-ocr-hotfolder
|
||||||
|
python3 -m venv venv && source venv/bin/activate
|
||||||
|
pip install -r requirements.txt
|
||||||
|
cp config.example.toml /tmp/config.toml
|
||||||
|
# Pfade in /tmp/config.toml auf Test-Verzeichnisse anpassen
|
||||||
|
python -m pdf_ocr_hotfolder --config /tmp/config.toml
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📋 Roadmap / TODO
|
||||||
|
|
||||||
|
- [ ] Tests (`pytest`) für `processor` und `uploaders`
|
||||||
|
- [ ] Prometheus-Metriken (verarbeitete PDFs, Fehlerquote, Laufzeit)
|
||||||
|
- [ ] CLI-Subkommandos: `pdf-ocr-hotfolder reprocess <error-file>`
|
||||||
|
- [ ] Optional: S3/MinIO Upload-Target
|
||||||
|
- [ ] Docker-Image für Setups ohne systemd
|
||||||
|
|
||||||
|
## 🔑 Repo
|
||||||
|
|
||||||
|
- **Repo:** https://gitea.sonith.de/sonith_ug/pdf-ocr-hotfolder
|
||||||
|
- **Owner:** sonith_ug
|
||||||
|
- **Versionierung:** Semver (PATCH bei jedem Build, MINOR bei Features, MAJOR manuell)
|
||||||
|
- **Tags:** `v{VERSION}`, automatischer Push nach Commit
|
||||||
|
|
||||||
|
## 📞 Kontakt
|
||||||
|
|
||||||
|
**Maintainer:** Dominik Höfling (Sonith GmbH)
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
# Changelog
|
||||||
|
|
||||||
|
## [0.1.0] - 2026-04-08
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Initiale Version (Komplettes Rewrite des alten Bash-Tools `pdf-tool`)
|
||||||
|
- Python-Implementation auf Basis von `ocrmypdf` (Library, kein Subprozess)
|
||||||
|
- Hotfolder-Watcher mit `watchdog` (created/moved/closed Events)
|
||||||
|
- File-Stability-Check (wartet bis Scanner fertig geschrieben hat)
|
||||||
|
- ThreadPool für parallele PDF-Verarbeitung (`max_workers`)
|
||||||
|
- Upload-Targets: lokaler Ordner, Nextcloud (WebDAV via `requests`), SFTP (`paramiko`)
|
||||||
|
- E-Mail-Notify (`smtplib`, immer / nur Fehler / nie)
|
||||||
|
- Optional veraPDF-Validierung
|
||||||
|
- TOML-Konfiguration (`tomllib` aus stdlib, Python ≥3.11)
|
||||||
|
- systemd-Unit mit Hardening-Optionen
|
||||||
|
- `install.sh` mit interaktivem Service-User-Prompt
|
||||||
|
(lokal anlegen oder bestehenden lokalen/AD-User übernehmen)
|
||||||
|
- `update.sh` mit Backup, Code-Sync und Service-Reload
|
||||||
|
- README.md, AI_AGENT_BRIEFING.md
|
||||||
@@ -0,0 +1,176 @@
|
|||||||
|
# PDF OCR Hotfolder
|
||||||
|
|
||||||
|
Verwandelt eingehende gescannte PDFs automatisch in **durchsuchbare PDFs** (PDF/A optional) per OCR. Hauptanwendung: Kunden-Scanner schiebt PDF in einen Ordner — Sekunden später liegt die OCR-Version im Ausgang oder wird in Nextcloud / per SFTP weitergeleitet.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- 🔍 **OCR via ocrmypdf + Tesseract** (Library-Call, kein Subprozess-Overhead)
|
||||||
|
- 📂 **Hotfolder via watchdog** — reagiert auf `created`, `moved`, `closed` Events
|
||||||
|
- 🧠 **Stabilitäts-Erkennung**: wartet bis Scanner fertig geschrieben hat
|
||||||
|
- 🔁 **Parallelverarbeitung** mehrerer PDFs (ThreadPool, konfigurierbar)
|
||||||
|
- ✅ **PDF/A-Output** (1, 2 oder 3) optional
|
||||||
|
- 🛡️ **veraPDF-Validierung** optional
|
||||||
|
- ☁️ **Upload-Ziele**: lokaler Ordner, Nextcloud (WebDAV via Python), SFTP
|
||||||
|
- 📧 **E-Mail-Notify** (immer / nur Fehler / nie)
|
||||||
|
- 🔐 **Service-User-Support** für lokale **und AD-User mit lokaler UID** (SSSD/Winbind)
|
||||||
|
- ⚙️ Saubere systemd-Integration mit auto-Restart
|
||||||
|
|
||||||
|
## Schnellstart
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://gitea.sonith.de/sonith_ug/pdf-ocr-hotfolder.git
|
||||||
|
cd pdf-ocr-hotfolder
|
||||||
|
sudo ./install.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Der Installer fragt nach dem Service-User. Standardmäßig wird ein lokaler System-User `pdfocr` angelegt. Wenn der User bereits existiert (z.B. AD via SSSD), wird er einfach übernommen.
|
||||||
|
|
||||||
|
Danach Konfiguration anpassen:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo nano /etc/pdf-ocr-hotfolder/config.toml
|
||||||
|
sudo systemctl restart pdf-ocr-hotfolder
|
||||||
|
```
|
||||||
|
|
||||||
|
Test:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp irgendein-scan.pdf /var/lib/pdf-ocr-hotfolder/incoming/
|
||||||
|
journalctl -u pdf-ocr-hotfolder -f
|
||||||
|
```
|
||||||
|
|
||||||
|
Nach wenigen Sekunden liegt das OCR-PDF unter `/var/lib/pdf-ocr-hotfolder/outgoing/OCR_irgendein-scan.pdf`.
|
||||||
|
|
||||||
|
## Verzeichnisse
|
||||||
|
|
||||||
|
| Pfad | Zweck |
|
||||||
|
|------|-------|
|
||||||
|
| `/etc/pdf-ocr-hotfolder/config.toml` | Konfiguration |
|
||||||
|
| `/var/lib/pdf-ocr-hotfolder/incoming` | Eingang (Scanner schreibt hier rein) |
|
||||||
|
| `/var/lib/pdf-ocr-hotfolder/working` | Arbeitsverzeichnis während OCR |
|
||||||
|
| `/var/lib/pdf-ocr-hotfolder/outgoing` | Ausgang (fertige PDFs) |
|
||||||
|
| `/var/lib/pdf-ocr-hotfolder/error` | PDFs, die nicht verarbeitet werden konnten |
|
||||||
|
| `/opt/pdf-ocr-hotfolder/` | Code + venv |
|
||||||
|
| `/var/log/pdf-ocr-hotfolder/` | Logs (zusätzlich zu journald) |
|
||||||
|
|
||||||
|
## Konfiguration
|
||||||
|
|
||||||
|
Vollständiges Beispiel: [`config.example.toml`](config.example.toml). Wichtigste Sektionen:
|
||||||
|
|
||||||
|
### `[ocr]`
|
||||||
|
```toml
|
||||||
|
languages = "deu+eng" # Tesseract-Sprachen
|
||||||
|
jobs = 4 # Threads pro PDF
|
||||||
|
skip_text = true # bereits OCR-haltige Seiten überspringen
|
||||||
|
pdfa_level = "2" # "1", "2", "3" oder "" für reines PDF
|
||||||
|
deskew = true
|
||||||
|
max_workers = 2 # parallele PDFs
|
||||||
|
timeout = 1800
|
||||||
|
```
|
||||||
|
|
||||||
|
### `[upload.nextcloud]`
|
||||||
|
```toml
|
||||||
|
enabled = true
|
||||||
|
url = "https://cloud.example.com"
|
||||||
|
username = "scanuser"
|
||||||
|
password = "app-password"
|
||||||
|
remote_path = "Scans/Inbox"
|
||||||
|
```
|
||||||
|
|
||||||
|
### `[upload.sftp]`
|
||||||
|
```toml
|
||||||
|
enabled = true
|
||||||
|
host = "sftp.example.com"
|
||||||
|
username = "scanuser"
|
||||||
|
key_file = "/etc/pdf-ocr-hotfolder/sftp_key"
|
||||||
|
remote_path = "/uploads"
|
||||||
|
```
|
||||||
|
|
||||||
|
### `[notify.email]`
|
||||||
|
```toml
|
||||||
|
enabled = true
|
||||||
|
smtp_host = "smtp.example.com"
|
||||||
|
smtp_port = 587
|
||||||
|
smtp_user = "alerts@example.com"
|
||||||
|
smtp_password = "secret"
|
||||||
|
from_addr = "PDF OCR <alerts@example.com>"
|
||||||
|
to_addrs = ["admin@example.com"]
|
||||||
|
on = "errors" # always | errors | never
|
||||||
|
```
|
||||||
|
|
||||||
|
## Service-Verwaltung
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl status pdf-ocr-hotfolder
|
||||||
|
sudo systemctl restart pdf-ocr-hotfolder
|
||||||
|
journalctl -u pdf-ocr-hotfolder -f
|
||||||
|
```
|
||||||
|
|
||||||
|
## Update
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /pfad/zum/repo
|
||||||
|
git pull
|
||||||
|
sudo ./update.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Das Repo muss bestehen bleiben — `update.sh` kopiert daraus.
|
||||||
|
|
||||||
|
## Manueller Lauf (One-Shot)
|
||||||
|
|
||||||
|
Bestehende PDFs im Eingang einmalig verarbeiten und beenden:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo -u pdfocr /opt/pdf-ocr-hotfolder/venv/bin/python -m pdf_ocr_hotfolder \
|
||||||
|
--config /etc/pdf-ocr-hotfolder/config.toml --once
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Tesseract findet die Sprache nicht
|
||||||
|
```bash
|
||||||
|
sudo apt install tesseract-ocr-deu tesseract-ocr-eng
|
||||||
|
```
|
||||||
|
|
||||||
|
### "PriorOcrFoundError"
|
||||||
|
ocrmypdf erkennt bereits vorhandenen OCR-Text. `skip_text = true` in der Config setzen.
|
||||||
|
|
||||||
|
### Berechtigungsprobleme bei AD-User
|
||||||
|
Service-User braucht **rw** auf alle vier Verzeichnisse unter `/var/lib/pdf-ocr-hotfolder/`. Bei AD-User mit lokaler UID:
|
||||||
|
```bash
|
||||||
|
sudo chown -R DOMAIN\\scanuser:DOMAIN\\scangroup /var/lib/pdf-ocr-hotfolder
|
||||||
|
```
|
||||||
|
|
||||||
|
### veraPDF-Validierung schlägt immer fehl
|
||||||
|
veraPDF binary prüfen (`[verapdf].binary`). Wenn nicht zwingend gebraucht: `enabled = false`.
|
||||||
|
|
||||||
|
## Architektur
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────┐ watchdog ┌──────────────┐ ocrmypdf ┌──────────┐
|
||||||
|
│ Scanner │ ──────────────▶ │ incoming/ │ ─────────────▶ │ working/ │
|
||||||
|
└──────────┘ PDF-Datei └──────────────┘ (Library) └────┬─────┘
|
||||||
|
│
|
||||||
|
optional veraPDF
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────┐
|
||||||
|
│ outgoing/ │
|
||||||
|
└──────┬───────┘
|
||||||
|
│
|
||||||
|
┌──────────────────────┼──────────────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌────────────┐ ┌────────────┐ ┌────────────┐
|
||||||
|
│ Nextcloud │ │ SFTP │ │ E-Mail │
|
||||||
|
│ (WebDAV) │ │ (paramiko) │ │ Notify │
|
||||||
|
└────────────┘ └────────────┘ └────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Lizenz
|
||||||
|
|
||||||
|
MIT — © Sonith UG
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Version:** 0.1.0
|
||||||
|
**Repo:** https://gitea.sonith.de/sonith_ug/pdf-ocr-hotfolder
|
||||||
@@ -0,0 +1,81 @@
|
|||||||
|
# PDF OCR Hotfolder — Konfiguration
|
||||||
|
# Speichern als /etc/pdf-ocr-hotfolder/config.toml
|
||||||
|
|
||||||
|
[paths]
|
||||||
|
# Eingangsverzeichnis: hier landen gescannte PDFs
|
||||||
|
incoming = "/var/lib/pdf-ocr-hotfolder/incoming"
|
||||||
|
# Ausgangsverzeichnis: fertige durchsuchbare PDFs
|
||||||
|
outgoing = "/var/lib/pdf-ocr-hotfolder/outgoing"
|
||||||
|
# Arbeitsverzeichnis (während Verarbeitung)
|
||||||
|
working = "/var/lib/pdf-ocr-hotfolder/working"
|
||||||
|
# Fehlerverzeichnis: PDFs, die nicht verarbeitet werden konnten
|
||||||
|
error = "/var/lib/pdf-ocr-hotfolder/error"
|
||||||
|
|
||||||
|
[ocr]
|
||||||
|
# Tesseract-Sprachen (z.B. "deu", "deu+eng")
|
||||||
|
languages = "deu+eng"
|
||||||
|
# Anzahl Threads pro PDF (ocrmypdf --jobs)
|
||||||
|
jobs = 4
|
||||||
|
# Bereits OCR-haltige Seiten überspringen statt neu zu OCRen
|
||||||
|
skip_text = true
|
||||||
|
# Auflösung für gerasterte Seiten
|
||||||
|
oversample = 300
|
||||||
|
# PDF/A-Konformitätsstufe ("1", "2", "3" oder leer für keinen PDF/A-Output)
|
||||||
|
pdfa_level = "2"
|
||||||
|
# Schiefe Scans automatisch begradigen
|
||||||
|
deskew = true
|
||||||
|
# Hintergrund säubern
|
||||||
|
clean = false
|
||||||
|
# Maximale parallele PDFs (Hauptsystem hat selten mehr als 1-2 gleichzeitig)
|
||||||
|
max_workers = 2
|
||||||
|
# Timeout pro PDF in Sekunden
|
||||||
|
timeout = 1800
|
||||||
|
|
||||||
|
[verapdf]
|
||||||
|
# PDF/A-Validierung (optional)
|
||||||
|
enabled = false
|
||||||
|
binary = "/opt/verapdf/verapdf"
|
||||||
|
flavour = "1b"
|
||||||
|
|
||||||
|
# Upload-Ziele — beliebig viele aktivierbar.
|
||||||
|
# Wenn alle deaktiviert sind, bleibt das fertige PDF einfach im outgoing-Ordner.
|
||||||
|
|
||||||
|
[upload.folder]
|
||||||
|
enabled = true
|
||||||
|
# Wenn leer, wird [paths].outgoing verwendet
|
||||||
|
target = ""
|
||||||
|
|
||||||
|
[upload.nextcloud]
|
||||||
|
enabled = false
|
||||||
|
url = "https://cloud.example.com"
|
||||||
|
username = "scanuser"
|
||||||
|
password = "app-password"
|
||||||
|
# Zielpfad relativ zum User-Root, z.B. "Scans/Inbox"
|
||||||
|
remote_path = "Scans/Inbox"
|
||||||
|
verify_ssl = true
|
||||||
|
|
||||||
|
[upload.sftp]
|
||||||
|
enabled = false
|
||||||
|
host = "sftp.example.com"
|
||||||
|
port = 22
|
||||||
|
username = "scanuser"
|
||||||
|
# Entweder Key-Datei oder Passwort
|
||||||
|
key_file = "/etc/pdf-ocr-hotfolder/sftp_key"
|
||||||
|
password = ""
|
||||||
|
remote_path = "/uploads"
|
||||||
|
|
||||||
|
[notify.email]
|
||||||
|
enabled = false
|
||||||
|
smtp_host = "smtp.example.com"
|
||||||
|
smtp_port = 587
|
||||||
|
smtp_user = "alerts@example.com"
|
||||||
|
smtp_password = "secret"
|
||||||
|
use_starttls = true
|
||||||
|
from_addr = "PDF OCR Hotfolder <alerts@example.com>"
|
||||||
|
to_addrs = ["admin@example.com"]
|
||||||
|
# Wann benachrichtigen: "always" | "errors" | "never"
|
||||||
|
on = "errors"
|
||||||
|
|
||||||
|
[logging]
|
||||||
|
# DEBUG | INFO | WARNING | ERROR
|
||||||
|
level = "INFO"
|
||||||
Executable
+157
@@ -0,0 +1,157 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# PDF OCR Hotfolder — Installer für Debian 12/13
|
||||||
|
#
|
||||||
|
# Fragt interaktiv nach dem Service-User. Unterstützt:
|
||||||
|
# - Lokal anlegen (neuer System-User)
|
||||||
|
# - Bereits existierender lokaler User
|
||||||
|
# - AD-User mit lokaler UID (z.B. via SSSD/Winbind)
|
||||||
|
#
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
RED='\033[0;31m'; GREEN='\033[0;32m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; NC='\033[0m'
|
||||||
|
log_info() { echo -e "${GREEN}[INFO]${NC} $*"; }
|
||||||
|
log_warn() { echo -e "${YELLOW}[WARN]${NC} $*"; }
|
||||||
|
log_error() { echo -e "${RED}[ERROR]${NC} $*"; }
|
||||||
|
log_step() { echo -e "${BLUE}==>${NC} $*"; }
|
||||||
|
|
||||||
|
if [ "${EUID}" -ne 0 ]; then
|
||||||
|
log_error "Bitte als root ausführen: sudo ./install.sh"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
INSTALL_DIR="/opt/pdf-ocr-hotfolder"
|
||||||
|
CONFIG_DIR="/etc/pdf-ocr-hotfolder"
|
||||||
|
DATA_DIR="/var/lib/pdf-ocr-hotfolder"
|
||||||
|
LOG_DIR="/var/log/pdf-ocr-hotfolder"
|
||||||
|
SERVICE_NAME="pdf-ocr-hotfolder"
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
REPO_DIR="$SCRIPT_DIR"
|
||||||
|
|
||||||
|
if [ ! -f "$REPO_DIR/pdf_ocr_hotfolder/__init__.py" ]; then
|
||||||
|
log_error "Repo-Layout nicht erkannt. install.sh aus dem Repo ausführen."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "=========================================="
|
||||||
|
echo " PDF OCR Hotfolder — Installation"
|
||||||
|
echo "=========================================="
|
||||||
|
echo
|
||||||
|
|
||||||
|
# ============ 1. System-Dependencies ============
|
||||||
|
log_step "Installiere System-Pakete"
|
||||||
|
|
||||||
|
apt-get update -qq
|
||||||
|
apt-get install -y --no-install-recommends \
|
||||||
|
python3 python3-venv python3-pip \
|
||||||
|
tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng \
|
||||||
|
ghostscript qpdf unpaper pngquant \
|
||||||
|
icc-profiles-free \
|
||||||
|
ca-certificates curl
|
||||||
|
|
||||||
|
log_info "System-Pakete installiert ✓"
|
||||||
|
|
||||||
|
# ============ 2. Service-User ============
|
||||||
|
log_step "Service-User konfigurieren"
|
||||||
|
|
||||||
|
read -r -p "Service-User-Name [pdfocr]: " SERVICE_USER
|
||||||
|
SERVICE_USER="${SERVICE_USER:-pdfocr}"
|
||||||
|
|
||||||
|
if id "$SERVICE_USER" &>/dev/null; then
|
||||||
|
log_info "User '$SERVICE_USER' existiert bereits (lokal oder via AD)."
|
||||||
|
SERVICE_GROUP="$(id -gn "$SERVICE_USER")"
|
||||||
|
log_info "Verwende bestehende primäre Gruppe: $SERVICE_GROUP"
|
||||||
|
else
|
||||||
|
log_warn "User '$SERVICE_USER' existiert nicht."
|
||||||
|
read -r -p "Lokal als System-User anlegen? [J/n]: " CREATE_USER
|
||||||
|
CREATE_USER="${CREATE_USER:-J}"
|
||||||
|
if [[ "$CREATE_USER" =~ ^[JjYy]$ ]]; then
|
||||||
|
adduser --system --group --home "$DATA_DIR" --shell /usr/sbin/nologin "$SERVICE_USER"
|
||||||
|
SERVICE_GROUP="$SERVICE_USER"
|
||||||
|
log_info "Lokaler System-User '$SERVICE_USER' angelegt ✓"
|
||||||
|
else
|
||||||
|
log_error "User '$SERVICE_USER' muss vor der Installation existieren (z.B. via AD/SSSD)."
|
||||||
|
log_error "Lege ihn an oder wähle einen existierenden Namen."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# ============ 3. Verzeichnisse ============
|
||||||
|
log_step "Verzeichnisse erstellen"
|
||||||
|
|
||||||
|
mkdir -p "$INSTALL_DIR" "$CONFIG_DIR" "$LOG_DIR"
|
||||||
|
mkdir -p "$DATA_DIR"/{incoming,outgoing,working,error}
|
||||||
|
|
||||||
|
cp -r "$REPO_DIR/pdf_ocr_hotfolder" "$INSTALL_DIR/"
|
||||||
|
cp "$REPO_DIR/requirements.txt" "$INSTALL_DIR/"
|
||||||
|
cp "$REPO_DIR/VERSION" "$INSTALL_DIR/"
|
||||||
|
echo "$REPO_DIR" > "$INSTALL_DIR/.repo_path"
|
||||||
|
|
||||||
|
if [ ! -f "$CONFIG_DIR/config.toml" ]; then
|
||||||
|
cp "$REPO_DIR/config.example.toml" "$CONFIG_DIR/config.toml"
|
||||||
|
log_info "Beispiel-Konfig nach $CONFIG_DIR/config.toml kopiert"
|
||||||
|
else
|
||||||
|
log_info "Bestehende Konfig $CONFIG_DIR/config.toml bleibt unverändert"
|
||||||
|
fi
|
||||||
|
|
||||||
|
log_info "Verzeichnisse erstellt ✓"
|
||||||
|
|
||||||
|
# ============ 4. Python venv ============
|
||||||
|
log_step "Python venv anlegen"
|
||||||
|
|
||||||
|
if [ ! -d "$INSTALL_DIR/venv" ]; then
|
||||||
|
python3 -m venv "$INSTALL_DIR/venv"
|
||||||
|
fi
|
||||||
|
"$INSTALL_DIR/venv/bin/pip" install --upgrade pip -q
|
||||||
|
"$INSTALL_DIR/venv/bin/pip" install -r "$INSTALL_DIR/requirements.txt" -q
|
||||||
|
|
||||||
|
log_info "venv bereit ✓"
|
||||||
|
|
||||||
|
# ============ 5. Berechtigungen ============
|
||||||
|
log_step "Berechtigungen setzen"
|
||||||
|
|
||||||
|
chown -R "$SERVICE_USER:$SERVICE_GROUP" "$INSTALL_DIR" "$DATA_DIR" "$LOG_DIR"
|
||||||
|
chown root:"$SERVICE_GROUP" "$CONFIG_DIR"
|
||||||
|
chmod 750 "$CONFIG_DIR"
|
||||||
|
if [ -f "$CONFIG_DIR/config.toml" ]; then
|
||||||
|
chown root:"$SERVICE_GROUP" "$CONFIG_DIR/config.toml"
|
||||||
|
chmod 640 "$CONFIG_DIR/config.toml"
|
||||||
|
fi
|
||||||
|
|
||||||
|
log_info "Berechtigungen gesetzt ✓"
|
||||||
|
|
||||||
|
# ============ 6. systemd-Unit ============
|
||||||
|
log_step "systemd-Unit installieren"
|
||||||
|
|
||||||
|
sed -e "s|__SERVICE_USER__|$SERVICE_USER|g" \
|
||||||
|
-e "s|__SERVICE_GROUP__|$SERVICE_GROUP|g" \
|
||||||
|
"$REPO_DIR/systemd/pdf-ocr-hotfolder.service" \
|
||||||
|
> "/etc/systemd/system/${SERVICE_NAME}.service"
|
||||||
|
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable "${SERVICE_NAME}.service"
|
||||||
|
|
||||||
|
log_info "systemd-Unit installiert & enabled ✓"
|
||||||
|
|
||||||
|
# ============ 7. Start ============
|
||||||
|
log_step "Service starten"
|
||||||
|
systemctl restart "${SERVICE_NAME}.service"
|
||||||
|
sleep 2
|
||||||
|
systemctl --no-pager --lines=10 status "${SERVICE_NAME}.service" || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "=========================================="
|
||||||
|
echo " Installation abgeschlossen"
|
||||||
|
echo "=========================================="
|
||||||
|
echo
|
||||||
|
echo " Konfiguration: $CONFIG_DIR/config.toml"
|
||||||
|
echo " Eingang: $DATA_DIR/incoming"
|
||||||
|
echo " Ausgang: $DATA_DIR/outgoing"
|
||||||
|
echo " Service-User: $SERVICE_USER ($SERVICE_GROUP)"
|
||||||
|
echo
|
||||||
|
echo " Logs: journalctl -u $SERVICE_NAME -f"
|
||||||
|
echo " Update: sudo ./update.sh"
|
||||||
|
echo
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
"""PDF OCR Hotfolder — Scanner-PDFs automatisch durchsuchbar machen."""
|
||||||
|
|
||||||
|
__version__ = "0.1.0"
|
||||||
@@ -0,0 +1,57 @@
|
|||||||
|
"""CLI-Entrypoint."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from . import __version__
|
||||||
|
from .config import load_config
|
||||||
|
from .service import HotfolderService
|
||||||
|
|
||||||
|
|
||||||
|
def _setup_logging(level: str) -> None:
|
||||||
|
logging.basicConfig(
|
||||||
|
level=getattr(logging, level.upper(), logging.INFO),
|
||||||
|
format="%(asctime)s %(levelname)-7s %(name)s: %(message)s",
|
||||||
|
datefmt="%Y-%m-%d %H:%M:%S",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
prog="pdf-ocr-hotfolder",
|
||||||
|
description="Wandelt eingehende PDFs per OCR in durchsuchbare PDFs um.",
|
||||||
|
)
|
||||||
|
parser.add_argument("--config", "-c", default="/etc/pdf-ocr-hotfolder/config.toml",
|
||||||
|
help="Pfad zur Konfigurationsdatei (TOML)")
|
||||||
|
parser.add_argument("--version", action="version", version=f"%(prog)s {__version__}")
|
||||||
|
parser.add_argument("--once", action="store_true",
|
||||||
|
help="Nur bestehende Dateien verarbeiten und beenden")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
cfg_path = Path(args.config)
|
||||||
|
if not cfg_path.exists():
|
||||||
|
print(f"Config nicht gefunden: {cfg_path}", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
cfg = load_config(cfg_path)
|
||||||
|
_setup_logging(cfg.log_level)
|
||||||
|
|
||||||
|
service = HotfolderService(cfg)
|
||||||
|
if args.once:
|
||||||
|
service._ensure_dirs() # noqa: SLF001
|
||||||
|
service._scan_existing() # noqa: SLF001
|
||||||
|
service._executor.shutdown(wait=True) # noqa: SLF001
|
||||||
|
return 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
service.run()
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
pass
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -0,0 +1,129 @@
|
|||||||
|
"""Konfigurations-Loader (TOML)."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import tomllib
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Paths:
|
||||||
|
incoming: Path
|
||||||
|
outgoing: Path
|
||||||
|
working: Path
|
||||||
|
error: Path
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class OcrConfig:
|
||||||
|
languages: str = "deu+eng"
|
||||||
|
jobs: int = 4
|
||||||
|
skip_text: bool = True
|
||||||
|
oversample: int = 300
|
||||||
|
pdfa_level: str = "2"
|
||||||
|
deskew: bool = True
|
||||||
|
clean: bool = False
|
||||||
|
max_workers: int = 2
|
||||||
|
timeout: int = 1800
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class VeraPdfConfig:
|
||||||
|
enabled: bool = False
|
||||||
|
binary: str = "/opt/verapdf/verapdf"
|
||||||
|
flavour: str = "1b"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class FolderUpload:
|
||||||
|
enabled: bool = True
|
||||||
|
target: str = ""
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class NextcloudUpload:
|
||||||
|
enabled: bool = False
|
||||||
|
url: str = ""
|
||||||
|
username: str = ""
|
||||||
|
password: str = ""
|
||||||
|
remote_path: str = ""
|
||||||
|
verify_ssl: bool = True
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SftpUpload:
|
||||||
|
enabled: bool = False
|
||||||
|
host: str = ""
|
||||||
|
port: int = 22
|
||||||
|
username: str = ""
|
||||||
|
key_file: str = ""
|
||||||
|
password: str = ""
|
||||||
|
remote_path: str = ""
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class EmailNotify:
|
||||||
|
enabled: bool = False
|
||||||
|
smtp_host: str = ""
|
||||||
|
smtp_port: int = 587
|
||||||
|
smtp_user: str = ""
|
||||||
|
smtp_password: str = ""
|
||||||
|
use_starttls: bool = True
|
||||||
|
from_addr: str = ""
|
||||||
|
to_addrs: list[str] = field(default_factory=list)
|
||||||
|
on: str = "errors" # always | errors | never
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Config:
|
||||||
|
paths: Paths
|
||||||
|
ocr: OcrConfig
|
||||||
|
verapdf: VeraPdfConfig
|
||||||
|
folder: FolderUpload
|
||||||
|
nextcloud: NextcloudUpload
|
||||||
|
sftp: SftpUpload
|
||||||
|
email: EmailNotify
|
||||||
|
log_level: str = "INFO"
|
||||||
|
|
||||||
|
|
||||||
|
def _section(data: dict[str, Any], *keys: str) -> dict[str, Any]:
|
||||||
|
cur: Any = data
|
||||||
|
for k in keys:
|
||||||
|
cur = cur.get(k, {}) if isinstance(cur, dict) else {}
|
||||||
|
return cur if isinstance(cur, dict) else {}
|
||||||
|
|
||||||
|
|
||||||
|
def load_config(path: str | Path) -> Config:
|
||||||
|
path = Path(path)
|
||||||
|
with path.open("rb") as f:
|
||||||
|
data = tomllib.load(f)
|
||||||
|
|
||||||
|
p = _section(data, "paths")
|
||||||
|
paths = Paths(
|
||||||
|
incoming=Path(p["incoming"]),
|
||||||
|
outgoing=Path(p["outgoing"]),
|
||||||
|
working=Path(p["working"]),
|
||||||
|
error=Path(p["error"]),
|
||||||
|
)
|
||||||
|
|
||||||
|
ocr = OcrConfig(**{k: v for k, v in _section(data, "ocr").items()
|
||||||
|
if k in OcrConfig.__annotations__})
|
||||||
|
verapdf = VeraPdfConfig(**{k: v for k, v in _section(data, "verapdf").items()
|
||||||
|
if k in VeraPdfConfig.__annotations__})
|
||||||
|
folder = FolderUpload(**{k: v for k, v in _section(data, "upload", "folder").items()
|
||||||
|
if k in FolderUpload.__annotations__})
|
||||||
|
nextcloud = NextcloudUpload(**{k: v for k, v in _section(data, "upload", "nextcloud").items()
|
||||||
|
if k in NextcloudUpload.__annotations__})
|
||||||
|
sftp = SftpUpload(**{k: v for k, v in _section(data, "upload", "sftp").items()
|
||||||
|
if k in SftpUpload.__annotations__})
|
||||||
|
email = EmailNotify(**{k: v for k, v in _section(data, "notify", "email").items()
|
||||||
|
if k in EmailNotify.__annotations__})
|
||||||
|
|
||||||
|
log_level = _section(data, "logging").get("level", "INFO")
|
||||||
|
|
||||||
|
return Config(
|
||||||
|
paths=paths, ocr=ocr, verapdf=verapdf,
|
||||||
|
folder=folder, nextcloud=nextcloud, sftp=sftp, email=email,
|
||||||
|
log_level=log_level,
|
||||||
|
)
|
||||||
@@ -0,0 +1,112 @@
|
|||||||
|
"""OCR-Verarbeitung einer einzelnen PDF mit ocrmypdf + optional veraPDF."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import ocrmypdf
|
||||||
|
|
||||||
|
from .config import OcrConfig, VeraPdfConfig
|
||||||
|
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ProcessResult:
|
||||||
|
source: Path
|
||||||
|
output: Path
|
||||||
|
success: bool
|
||||||
|
error: str = ""
|
||||||
|
verapdf_passed: bool | None = None
|
||||||
|
|
||||||
|
|
||||||
|
def run_ocr(src: Path, dst: Path, cfg: OcrConfig) -> None:
|
||||||
|
"""Führt ocrmypdf als Library-Call aus (kein Subprozess-Overhead)."""
|
||||||
|
kwargs: dict = {
|
||||||
|
"language": cfg.languages,
|
||||||
|
"jobs": cfg.jobs,
|
||||||
|
"deskew": cfg.deskew,
|
||||||
|
"clean": cfg.clean,
|
||||||
|
"oversample": cfg.oversample,
|
||||||
|
"progress_bar": False,
|
||||||
|
"skip_text": cfg.skip_text,
|
||||||
|
}
|
||||||
|
if cfg.pdfa_level:
|
||||||
|
kwargs["output_type"] = f"pdfa-{cfg.pdfa_level}"
|
||||||
|
else:
|
||||||
|
kwargs["output_type"] = "pdf"
|
||||||
|
|
||||||
|
log.info("OCR start: %s", src.name)
|
||||||
|
ocrmypdf.ocr(str(src), str(dst), **kwargs)
|
||||||
|
log.info("OCR done: %s", dst.name)
|
||||||
|
|
||||||
|
|
||||||
|
def run_verapdf(pdf: Path, cfg: VeraPdfConfig) -> bool:
|
||||||
|
"""Validiert PDF/A mit veraPDF (CLI). Gibt True zurück, wenn konform."""
|
||||||
|
if not cfg.enabled:
|
||||||
|
return True
|
||||||
|
if not Path(cfg.binary).exists():
|
||||||
|
log.warning("veraPDF binary nicht gefunden: %s", cfg.binary)
|
||||||
|
return False
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
[cfg.binary, "--flavour", cfg.flavour, "--format", "text", str(pdf)],
|
||||||
|
capture_output=True, text=True, timeout=300,
|
||||||
|
)
|
||||||
|
ok = result.returncode == 0 and "PASS" in result.stdout
|
||||||
|
log.info("veraPDF %s: %s", "PASS" if ok else "FAIL", pdf.name)
|
||||||
|
return ok
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
log.error("veraPDF Timeout: %s", pdf.name)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def process_pdf(
|
||||||
|
src: Path,
|
||||||
|
working_dir: Path,
|
||||||
|
outgoing_dir: Path,
|
||||||
|
error_dir: Path,
|
||||||
|
ocr_cfg: OcrConfig,
|
||||||
|
vera_cfg: VeraPdfConfig,
|
||||||
|
) -> ProcessResult:
|
||||||
|
"""Verarbeitet eine einzelne PDF: move→OCR→validate→outgoing/error."""
|
||||||
|
work_src = working_dir / src.name
|
||||||
|
work_out = working_dir / f"OCR_{src.name}"
|
||||||
|
final_out = outgoing_dir / f"OCR_{src.name}"
|
||||||
|
|
||||||
|
try:
|
||||||
|
shutil.move(str(src), str(work_src))
|
||||||
|
except OSError as e:
|
||||||
|
return ProcessResult(src, final_out, False, f"move to working failed: {e}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
run_ocr(work_src, work_out, ocr_cfg)
|
||||||
|
except Exception as e: # noqa: BLE001 - ocrmypdf wirft viele Typen
|
||||||
|
log.exception("OCR fehlgeschlagen für %s", src.name)
|
||||||
|
_move_to_error(work_src, error_dir)
|
||||||
|
return ProcessResult(src, final_out, False, f"ocr failed: {e}")
|
||||||
|
|
||||||
|
vera_ok: bool | None = None
|
||||||
|
if vera_cfg.enabled:
|
||||||
|
vera_ok = run_verapdf(work_out, vera_cfg)
|
||||||
|
if not vera_ok:
|
||||||
|
_move_to_error(work_out, error_dir)
|
||||||
|
work_src.unlink(missing_ok=True)
|
||||||
|
return ProcessResult(src, final_out, False,
|
||||||
|
"verapdf validation failed", verapdf_passed=False)
|
||||||
|
|
||||||
|
outgoing_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
shutil.move(str(work_out), str(final_out))
|
||||||
|
work_src.unlink(missing_ok=True)
|
||||||
|
return ProcessResult(src, final_out, True, verapdf_passed=vera_ok)
|
||||||
|
|
||||||
|
|
||||||
|
def _move_to_error(p: Path, error_dir: Path) -> None:
|
||||||
|
error_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
try:
|
||||||
|
shutil.move(str(p), str(error_dir / p.name))
|
||||||
|
except OSError:
|
||||||
|
log.exception("Konnte %s nicht in error-Verzeichnis verschieben", p)
|
||||||
@@ -0,0 +1,173 @@
|
|||||||
|
"""Hauptservice: Hotfolder via watchdog, ThreadPool für PDF-Verarbeitung."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import signal
|
||||||
|
import threading
|
||||||
|
import time
|
||||||
|
from concurrent.futures import Future, ThreadPoolExecutor
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from watchdog.events import FileSystemEvent, FileSystemEventHandler
|
||||||
|
from watchdog.observers import Observer
|
||||||
|
|
||||||
|
from .config import Config
|
||||||
|
from .processor import ProcessResult, process_pdf
|
||||||
|
from .uploaders import notify_email, upload_folder, upload_nextcloud, upload_sftp
|
||||||
|
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def _is_pdf(path: Path) -> bool:
|
||||||
|
return path.suffix.lower() == ".pdf" and path.is_file()
|
||||||
|
|
||||||
|
|
||||||
|
def _wait_until_stable(path: Path, checks: int = 3, interval: float = 1.0) -> bool:
|
||||||
|
"""Wartet bis Datei nicht mehr wächst (Scanner schreibt mehrmals)."""
|
||||||
|
last = -1
|
||||||
|
stable_count = 0
|
||||||
|
for _ in range(60): # max ~60s
|
||||||
|
try:
|
||||||
|
size = path.stat().st_size
|
||||||
|
except FileNotFoundError:
|
||||||
|
return False
|
||||||
|
if size == last and size > 0:
|
||||||
|
stable_count += 1
|
||||||
|
if stable_count >= checks:
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
stable_count = 0
|
||||||
|
last = size
|
||||||
|
time.sleep(interval)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
class _Handler(FileSystemEventHandler):
|
||||||
|
def __init__(self, service: "HotfolderService") -> None:
|
||||||
|
self.service = service
|
||||||
|
|
||||||
|
def on_created(self, event: FileSystemEvent) -> None:
|
||||||
|
if not event.is_directory:
|
||||||
|
self.service.enqueue(Path(event.src_path))
|
||||||
|
|
||||||
|
def on_moved(self, event: FileSystemEvent) -> None:
|
||||||
|
if not event.is_directory:
|
||||||
|
self.service.enqueue(Path(event.dest_path))
|
||||||
|
|
||||||
|
def on_closed(self, event: FileSystemEvent) -> None:
|
||||||
|
if not event.is_directory:
|
||||||
|
self.service.enqueue(Path(event.src_path))
|
||||||
|
|
||||||
|
|
||||||
|
class HotfolderService:
|
||||||
|
def __init__(self, cfg: Config) -> None:
|
||||||
|
self.cfg = cfg
|
||||||
|
self._executor = ThreadPoolExecutor(
|
||||||
|
max_workers=cfg.ocr.max_workers,
|
||||||
|
thread_name_prefix="ocr",
|
||||||
|
)
|
||||||
|
self._observer: Observer | None = None
|
||||||
|
self._stop = threading.Event()
|
||||||
|
self._inflight: set[str] = set()
|
||||||
|
self._lock = threading.Lock()
|
||||||
|
|
||||||
|
# ---- Setup ----
|
||||||
|
|
||||||
|
def _ensure_dirs(self) -> None:
|
||||||
|
for p in (self.cfg.paths.incoming, self.cfg.paths.outgoing,
|
||||||
|
self.cfg.paths.working, self.cfg.paths.error):
|
||||||
|
p.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# ---- Lifecycle ----
|
||||||
|
|
||||||
|
def run(self) -> None:
|
||||||
|
self._ensure_dirs()
|
||||||
|
self._scan_existing()
|
||||||
|
|
||||||
|
self._observer = Observer()
|
||||||
|
self._observer.schedule(_Handler(self), str(self.cfg.paths.incoming), recursive=False)
|
||||||
|
self._observer.start()
|
||||||
|
log.info("Hotfolder läuft. Watching: %s", self.cfg.paths.incoming)
|
||||||
|
|
||||||
|
signal.signal(signal.SIGTERM, lambda *_: self._stop.set())
|
||||||
|
signal.signal(signal.SIGINT, lambda *_: self._stop.set())
|
||||||
|
|
||||||
|
try:
|
||||||
|
while not self._stop.is_set():
|
||||||
|
self._stop.wait(1.0)
|
||||||
|
finally:
|
||||||
|
self.shutdown()
|
||||||
|
|
||||||
|
def shutdown(self) -> None:
|
||||||
|
log.info("Shutdown läuft...")
|
||||||
|
if self._observer:
|
||||||
|
self._observer.stop()
|
||||||
|
self._observer.join(timeout=5)
|
||||||
|
self._executor.shutdown(wait=True, cancel_futures=False)
|
||||||
|
log.info("Shutdown ok.")
|
||||||
|
|
||||||
|
# ---- Queue ----
|
||||||
|
|
||||||
|
def _scan_existing(self) -> None:
|
||||||
|
"""Beim Start: bereits liegende PDFs aufgreifen."""
|
||||||
|
for p in self.cfg.paths.incoming.iterdir():
|
||||||
|
if _is_pdf(p):
|
||||||
|
self.enqueue(p)
|
||||||
|
|
||||||
|
def enqueue(self, path: Path) -> None:
|
||||||
|
if not _is_pdf(path):
|
||||||
|
return
|
||||||
|
key = str(path.resolve())
|
||||||
|
with self._lock:
|
||||||
|
if key in self._inflight:
|
||||||
|
return
|
||||||
|
self._inflight.add(key)
|
||||||
|
fut = self._executor.submit(self._process, path)
|
||||||
|
fut.add_done_callback(lambda f, k=key: self._done(k, f))
|
||||||
|
|
||||||
|
def _done(self, key: str, fut: Future) -> None:
|
||||||
|
with self._lock:
|
||||||
|
self._inflight.discard(key)
|
||||||
|
exc = fut.exception()
|
||||||
|
if exc:
|
||||||
|
log.exception("Worker-Exception", exc_info=exc)
|
||||||
|
|
||||||
|
# ---- Processing ----
|
||||||
|
|
||||||
|
def _process(self, path: Path) -> None:
|
||||||
|
if not _wait_until_stable(path):
|
||||||
|
log.warning("Datei nicht stabilisiert, überspringe: %s", path)
|
||||||
|
return
|
||||||
|
if not path.exists():
|
||||||
|
return
|
||||||
|
|
||||||
|
result: ProcessResult = process_pdf(
|
||||||
|
src=path,
|
||||||
|
working_dir=self.cfg.paths.working,
|
||||||
|
outgoing_dir=self.cfg.paths.outgoing,
|
||||||
|
error_dir=self.cfg.paths.error,
|
||||||
|
ocr_cfg=self.cfg.ocr,
|
||||||
|
vera_cfg=self.cfg.verapdf,
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.success:
|
||||||
|
self._dispatch_uploads(result.output)
|
||||||
|
self._notify(result)
|
||||||
|
|
||||||
|
def _dispatch_uploads(self, pdf: Path) -> None:
|
||||||
|
upload_folder(pdf, self.cfg.folder, self.cfg.paths.outgoing)
|
||||||
|
if self.cfg.nextcloud.enabled:
|
||||||
|
upload_nextcloud(pdf, self.cfg.nextcloud)
|
||||||
|
if self.cfg.sftp.enabled:
|
||||||
|
upload_sftp(pdf, self.cfg.sftp)
|
||||||
|
|
||||||
|
def _notify(self, result: ProcessResult) -> None:
|
||||||
|
if result.success:
|
||||||
|
subject = f"[pdf-ocr] OK: {result.source.name}"
|
||||||
|
body = f"Datei verarbeitet: {result.output}\n"
|
||||||
|
if result.verapdf_passed is not None:
|
||||||
|
body += f"veraPDF: {'PASS' if result.verapdf_passed else 'FAIL'}\n"
|
||||||
|
else:
|
||||||
|
subject = f"[pdf-ocr] FEHLER: {result.source.name}"
|
||||||
|
body = f"Fehler beim Verarbeiten von {result.source}\n\n{result.error}\n"
|
||||||
|
notify_email(self.cfg.email, subject, body, result.success)
|
||||||
@@ -0,0 +1,104 @@
|
|||||||
|
"""Upload-Ziele: lokaler Ordner, Nextcloud (WebDAV), SFTP. Plus E-Mail-Notify."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import smtplib
|
||||||
|
import ssl
|
||||||
|
from email.message import EmailMessage
|
||||||
|
from pathlib import Path
|
||||||
|
from urllib.parse import quote
|
||||||
|
|
||||||
|
import paramiko
|
||||||
|
import requests
|
||||||
|
|
||||||
|
from .config import EmailNotify, FolderUpload, NextcloudUpload, SftpUpload
|
||||||
|
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def upload_folder(pdf: Path, cfg: FolderUpload, default_target: Path) -> bool:
|
||||||
|
if not cfg.enabled:
|
||||||
|
return True
|
||||||
|
target = Path(cfg.target) if cfg.target else default_target
|
||||||
|
target.mkdir(parents=True, exist_ok=True)
|
||||||
|
dest = target / pdf.name
|
||||||
|
try:
|
||||||
|
if pdf.resolve() == dest.resolve():
|
||||||
|
return True
|
||||||
|
dest.write_bytes(pdf.read_bytes())
|
||||||
|
log.info("Folder upload OK: %s", dest)
|
||||||
|
return True
|
||||||
|
except OSError as e:
|
||||||
|
log.error("Folder upload failed: %s", e)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def upload_nextcloud(pdf: Path, cfg: NextcloudUpload) -> bool:
|
||||||
|
if not cfg.enabled:
|
||||||
|
return True
|
||||||
|
base = cfg.url.rstrip("/")
|
||||||
|
remote = "/".join(quote(part) for part in cfg.remote_path.strip("/").split("/") if part)
|
||||||
|
url = f"{base}/remote.php/dav/files/{quote(cfg.username)}/{remote}/{quote(pdf.name)}"
|
||||||
|
try:
|
||||||
|
with pdf.open("rb") as f:
|
||||||
|
r = requests.put(url, data=f, auth=(cfg.username, cfg.password),
|
||||||
|
verify=cfg.verify_ssl, timeout=300)
|
||||||
|
if r.status_code in (200, 201, 204):
|
||||||
|
log.info("Nextcloud upload OK: %s", pdf.name)
|
||||||
|
return True
|
||||||
|
log.error("Nextcloud upload HTTP %s: %s", r.status_code, r.text[:200])
|
||||||
|
return False
|
||||||
|
except requests.RequestException as e:
|
||||||
|
log.error("Nextcloud upload failed: %s", e)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def upload_sftp(pdf: Path, cfg: SftpUpload) -> bool:
|
||||||
|
if not cfg.enabled:
|
||||||
|
return True
|
||||||
|
try:
|
||||||
|
client = paramiko.SSHClient()
|
||||||
|
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
|
||||||
|
connect_kwargs: dict = {
|
||||||
|
"hostname": cfg.host, "port": cfg.port, "username": cfg.username,
|
||||||
|
"timeout": 30,
|
||||||
|
}
|
||||||
|
if cfg.key_file:
|
||||||
|
connect_kwargs["key_filename"] = cfg.key_file
|
||||||
|
if cfg.password:
|
||||||
|
connect_kwargs["password"] = cfg.password
|
||||||
|
client.connect(**connect_kwargs)
|
||||||
|
sftp = client.open_sftp()
|
||||||
|
try:
|
||||||
|
remote = f"{cfg.remote_path.rstrip('/')}/{pdf.name}"
|
||||||
|
sftp.put(str(pdf), remote)
|
||||||
|
log.info("SFTP upload OK: %s", remote)
|
||||||
|
return True
|
||||||
|
finally:
|
||||||
|
sftp.close()
|
||||||
|
client.close()
|
||||||
|
except (paramiko.SSHException, OSError) as e:
|
||||||
|
log.error("SFTP upload failed: %s", e)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def notify_email(cfg: EmailNotify, subject: str, body: str, success: bool) -> None:
|
||||||
|
if not cfg.enabled or cfg.on == "never":
|
||||||
|
return
|
||||||
|
if cfg.on == "errors" and success:
|
||||||
|
return
|
||||||
|
msg = EmailMessage()
|
||||||
|
msg["Subject"] = subject
|
||||||
|
msg["From"] = cfg.from_addr
|
||||||
|
msg["To"] = ", ".join(cfg.to_addrs)
|
||||||
|
msg.set_content(body)
|
||||||
|
try:
|
||||||
|
with smtplib.SMTP(cfg.smtp_host, cfg.smtp_port, timeout=30) as s:
|
||||||
|
if cfg.use_starttls:
|
||||||
|
s.starttls(context=ssl.create_default_context())
|
||||||
|
if cfg.smtp_user:
|
||||||
|
s.login(cfg.smtp_user, cfg.smtp_password)
|
||||||
|
s.send_message(msg)
|
||||||
|
log.info("E-Mail-Notify gesendet: %s", subject)
|
||||||
|
except (smtplib.SMTPException, OSError) as e:
|
||||||
|
log.error("E-Mail-Notify fehlgeschlagen: %s", e)
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
ocrmypdf>=16.0
|
||||||
|
watchdog>=4.0
|
||||||
|
requests>=2.31
|
||||||
|
paramiko>=3.4
|
||||||
@@ -0,0 +1,25 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=PDF OCR Hotfolder
|
||||||
|
After=network-online.target
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=__SERVICE_USER__
|
||||||
|
Group=__SERVICE_GROUP__
|
||||||
|
ExecStart=/opt/pdf-ocr-hotfolder/venv/bin/python -m pdf_ocr_hotfolder --config /etc/pdf-ocr-hotfolder/config.toml
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=5
|
||||||
|
KillMode=mixed
|
||||||
|
TimeoutStopSec=30
|
||||||
|
|
||||||
|
# Hardening (lockerer wegen AD-User & Datei-ACLs)
|
||||||
|
NoNewPrivileges=true
|
||||||
|
PrivateTmp=true
|
||||||
|
ProtectSystem=full
|
||||||
|
ProtectKernelTunables=true
|
||||||
|
ProtectKernelModules=true
|
||||||
|
ProtectControlGroups=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
@@ -0,0 +1,88 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# PDF OCR Hotfolder — Update-Script
|
||||||
|
#
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
RED='\033[0;31m'; GREEN='\033[0;32m'; YELLOW='\033[1;33m'; NC='\033[0m'
|
||||||
|
log_info() { echo -e "${GREEN}[INFO]${NC} $*"; }
|
||||||
|
log_warn() { echo -e "${YELLOW}[WARN]${NC} $*"; }
|
||||||
|
log_error() { echo -e "${RED}[ERROR]${NC} $*"; }
|
||||||
|
|
||||||
|
if [ "${EUID}" -ne 0 ]; then
|
||||||
|
log_error "Bitte als root ausführen: sudo ./update.sh"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
INSTALL_DIR="/opt/pdf-ocr-hotfolder"
|
||||||
|
SERVICE_NAME="pdf-ocr-hotfolder"
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
if [ -f "$SCRIPT_DIR/pdf_ocr_hotfolder/__init__.py" ]; then
|
||||||
|
REPO_DIR="$SCRIPT_DIR"
|
||||||
|
elif [ -f "$INSTALL_DIR/.repo_path" ]; then
|
||||||
|
REPO_DIR="$(cat "$INSTALL_DIR/.repo_path")"
|
||||||
|
[ -d "$REPO_DIR" ] || { log_error "Gespeicherter Repo-Pfad existiert nicht: $REPO_DIR"; exit 1; }
|
||||||
|
else
|
||||||
|
log_error "Repo nicht gefunden. update.sh aus dem Repo ausführen."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
[ -d "$INSTALL_DIR" ] || { log_error "Installation nicht gefunden. Erst install.sh ausführen."; exit 1; }
|
||||||
|
|
||||||
|
OLD_VERSION="$(cat "$INSTALL_DIR/VERSION" 2>/dev/null || echo unknown)"
|
||||||
|
NEW_VERSION="$(cat "$REPO_DIR/VERSION" 2>/dev/null || echo unknown)"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "=========================================="
|
||||||
|
echo " PDF OCR Hotfolder — Update"
|
||||||
|
echo "=========================================="
|
||||||
|
log_info "Repo: $REPO_DIR"
|
||||||
|
log_info "Install: $INSTALL_DIR"
|
||||||
|
log_info "Version: $OLD_VERSION → $NEW_VERSION"
|
||||||
|
echo
|
||||||
|
|
||||||
|
# Service-User aus systemd-Unit lesen
|
||||||
|
SERVICE_USER="$(awk -F= '/^User=/{print $2}' /etc/systemd/system/${SERVICE_NAME}.service 2>/dev/null || echo pdfocr)"
|
||||||
|
SERVICE_GROUP="$(awk -F= '/^Group=/{print $2}' /etc/systemd/system/${SERVICE_NAME}.service 2>/dev/null || echo pdfocr)"
|
||||||
|
|
||||||
|
log_info "Stoppe Service..."
|
||||||
|
systemctl stop "${SERVICE_NAME}.service" 2>/dev/null || true
|
||||||
|
|
||||||
|
log_info "Backup erstellen..."
|
||||||
|
BACKUP_DIR="/var/backups/pdf-ocr-hotfolder"
|
||||||
|
mkdir -p "$BACKUP_DIR"
|
||||||
|
tar -czf "$BACKUP_DIR/backup-$(date +%Y%m%d-%H%M%S).tar.gz" \
|
||||||
|
-C "$INSTALL_DIR" --exclude=venv --exclude=__pycache__ . 2>/dev/null || true
|
||||||
|
|
||||||
|
log_info "Code aktualisieren..."
|
||||||
|
rm -rf "$INSTALL_DIR/pdf_ocr_hotfolder"
|
||||||
|
cp -r "$REPO_DIR/pdf_ocr_hotfolder" "$INSTALL_DIR/"
|
||||||
|
cp "$REPO_DIR/requirements.txt" "$INSTALL_DIR/"
|
||||||
|
cp "$REPO_DIR/VERSION" "$INSTALL_DIR/"
|
||||||
|
echo "$REPO_DIR" > "$INSTALL_DIR/.repo_path"
|
||||||
|
|
||||||
|
log_info "Dependencies aktualisieren..."
|
||||||
|
"$INSTALL_DIR/venv/bin/pip" install --upgrade pip -q
|
||||||
|
"$INSTALL_DIR/venv/bin/pip" install --upgrade -r "$INSTALL_DIR/requirements.txt" -q
|
||||||
|
|
||||||
|
log_info "systemd-Unit aktualisieren..."
|
||||||
|
sed -e "s|__SERVICE_USER__|$SERVICE_USER|g" \
|
||||||
|
-e "s|__SERVICE_GROUP__|$SERVICE_GROUP|g" \
|
||||||
|
"$REPO_DIR/systemd/pdf-ocr-hotfolder.service" \
|
||||||
|
> "/etc/systemd/system/${SERVICE_NAME}.service"
|
||||||
|
systemctl daemon-reload
|
||||||
|
|
||||||
|
log_info "Berechtigungen setzen..."
|
||||||
|
chown -R "$SERVICE_USER:$SERVICE_GROUP" "$INSTALL_DIR"
|
||||||
|
|
||||||
|
log_info "Service starten..."
|
||||||
|
systemctl start "${SERVICE_NAME}.service"
|
||||||
|
sleep 2
|
||||||
|
|
||||||
|
if systemctl is-active --quiet "${SERVICE_NAME}.service"; then
|
||||||
|
log_info "✅ Service läuft (Version $NEW_VERSION)"
|
||||||
|
else
|
||||||
|
log_error "Service läuft nicht. journalctl -u $SERVICE_NAME -n 30"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
Reference in New Issue
Block a user