Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| cbdc9d6664 | |||
| a23a3968ef | |||
| 9cdc9ae443 | |||
| 6f7cadfc63 |
@@ -4,6 +4,7 @@ __pycache__/
|
|||||||
venv/
|
venv/
|
||||||
env/
|
env/
|
||||||
.venv/
|
.venv/
|
||||||
|
.pytest_cache/
|
||||||
*.egg-info/
|
*.egg-info/
|
||||||
build/
|
build/
|
||||||
dist/
|
dist/
|
||||||
|
|||||||
@@ -1,5 +1,55 @@
|
|||||||
# Changelog
|
# Changelog
|
||||||
|
|
||||||
|
## [0.3.1] - 2026-04-10
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- **Issue #4**: LXC/Container-Kompatibilität — systemd-Hardening (`PrivateTmp`, `ProtectSystem`, etc.)
|
||||||
|
verursacht Error 226/NAMESPACE in LXC-Containern. Installer erkennt Container-Umgebung automatisch
|
||||||
|
und bietet ein Drop-in an. Zusätzlich liegt `systemd/lxc-compat.conf` als Vorlage im Repo.
|
||||||
|
- **Issue #5**: `WorkingDirectory=/opt/pdf-ocr-hotfolder` in der systemd Template-Unit ergänzt —
|
||||||
|
ohne diesen Eintrag konnte das Python-Modul nicht gefunden werden.
|
||||||
|
- **Issue #6**: Auf Debian 12 bietet der Installer bei betroffenen Ghostscript-Versionen (10.0.0–10.02.0)
|
||||||
|
jetzt automatisch an, bookworm-backports zu aktivieren und GS zu upgraden (statt nur zu warnen).
|
||||||
|
|
||||||
|
## [0.3.0] - 2026-04-09
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Neue Config-Sektion `[output]` mit:
|
||||||
|
- `name_mode` — Platzierung des Tags im Dateinamen: `"prefix"`, `"suffix"` (vor Extension), `"none"`
|
||||||
|
- `name_tag` — verbatim einzufügender String, z.B. `"OCR_"` oder `"_OCR"`
|
||||||
|
- `original_on_success` — `"delete"` (alter Default) oder `"archive"`
|
||||||
|
- `archive_dir` — Zielverzeichnis für `"archive"`, mit Kollisions-Schutz (Timestamp-Suffix)
|
||||||
|
- Runtime-Validierung der Output-Config in `check_output_config()`
|
||||||
|
- 20 neue Tests für `build_output_name()`, `check_output_config()` und `process_pdf()`
|
||||||
|
mit allen Kombinationen aus Modus + Original-Behandlung
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- `process_pdf()` nimmt jetzt `output_cfg: OutputConfig` als Pflicht-Argument
|
||||||
|
|
||||||
|
## [0.2.2] - 2026-04-09
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- **Issue #3**: Ghostscript 10.0.0–10.02.0 (Debian 12 default) zerschießen OCR mit PDF/A + `skip_text=true`.
|
||||||
|
- `config.example.toml`: `pdfa_level = ""` als sicherer Default
|
||||||
|
- Runtime-Preflight: Prüft `gs --version` wenn `pdfa_level` gesetzt ist, bricht mit klarer Fehlermeldung ab
|
||||||
|
- `install.sh`: warnt bei betroffenen GS-Versionen mit Upgrade-Hinweis auf bookworm-backports
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- `is_ghostscript_broken()` / `detect_ghostscript_version()` in `pdf_ocr_hotfolder.service`
|
||||||
|
- 19 weitere pytest-Tests für GS-Versions-Detection (parametrisiert) und Preflight-Kombinationen
|
||||||
|
|
||||||
|
## [0.2.1] - 2026-04-09
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- **Issue #1**: Preflight-Check beim Start prüft jetzt `tesseract` und `gs` (Ghostscript). Fehlt eine Abhängigkeit, beendet sich der Service sofort mit Exit-Code 2 und klarer Fehlermeldung statt erst bei der ersten Datei.
|
||||||
|
- **Issue #2**: `--once`-Modus liefert jetzt Exit-Code `1`, sobald **mindestens ein** PDF fehlgeschlagen ist. Exit-Code `0` nur bei vollständigem Erfolg (inkl. "keine Dateien vorhanden"). Exit-Code `2` bei Preflight-Fehler.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Public API: `HotfolderService.run_once()`, `.success_count`, `.error_count`, `.ensure_dirs()`
|
||||||
|
- `check_preflight()` / `PreflightError` in `pdf_ocr_hotfolder.service`
|
||||||
|
- pytest-Test-Suite (`tests/`) mit 11 Tests — deckt alle Szenarien aus Issue #1 und #2 ab
|
||||||
|
- `ocrmypdf`-Import in `processor.py` ist jetzt lazy (Tests ohne ocrmypdf-Installation möglich)
|
||||||
|
|
||||||
## [0.2.0] - 2026-04-08
|
## [0.2.0] - 2026-04-08
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|||||||
@@ -89,6 +89,22 @@ max_workers = 2 # parallele PDFs
|
|||||||
timeout = 1800
|
timeout = 1800
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### `[output]`
|
||||||
|
```toml
|
||||||
|
# Dateiname im outgoing/:
|
||||||
|
# "prefix" → OCR_scan.pdf
|
||||||
|
# "suffix" → scan_OCR.pdf (vor der Extension)
|
||||||
|
# "none" → scan.pdf (unverändert)
|
||||||
|
name_mode = "prefix"
|
||||||
|
name_tag = "OCR_"
|
||||||
|
|
||||||
|
# Nach erfolgreichem OCR mit dem Original:
|
||||||
|
# "delete" → löschen
|
||||||
|
# "archive" → in archive_dir verschieben
|
||||||
|
original_on_success = "delete"
|
||||||
|
archive_dir = "" # absoluter Pfad, Pflicht bei "archive"
|
||||||
|
```
|
||||||
|
|
||||||
### `[upload.nextcloud]`
|
### `[upload.nextcloud]`
|
||||||
```toml
|
```toml
|
||||||
enabled = true
|
enabled = true
|
||||||
@@ -174,6 +190,24 @@ Service-User braucht **rw** auf alle vier Verzeichnisse unter `/var/lib/pdf-ocr-
|
|||||||
sudo chown -R DOMAIN\\scanuser:DOMAIN\\scangroup /var/lib/pdf-ocr-hotfolder
|
sudo chown -R DOMAIN\\scanuser:DOMAIN\\scangroup /var/lib/pdf-ocr-hotfolder
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### LXC/Container: Error 226/NAMESPACE
|
||||||
|
In LXC-Containern schlagen systemd-Hardening-Optionen fehl. Der Installer erkennt Container automatisch und bietet ein Drop-in an. Manuell:
|
||||||
|
```bash
|
||||||
|
sudo mkdir -p /etc/systemd/system/pdf-ocr-hotfolder@.service.d/
|
||||||
|
sudo cp /opt/pdf-ocr-hotfolder/systemd/lxc-compat.conf \
|
||||||
|
/etc/systemd/system/pdf-ocr-hotfolder@.service.d/
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
sudo systemctl restart 'pdf-ocr-hotfolder@*'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Ghostscript PDF/A-Bug auf Debian 12
|
||||||
|
GS 10.00.0–10.02.0 (Debian 12 Default) zerstört OCR bei `pdfa_level` + `skip_text=true`. Der Installer bietet automatisch bookworm-backports an. Manuell:
|
||||||
|
```bash
|
||||||
|
echo 'deb http://deb.debian.org/debian bookworm-backports main' | \
|
||||||
|
sudo tee /etc/apt/sources.list.d/bookworm-backports.list
|
||||||
|
sudo apt update && sudo apt install -t bookworm-backports ghostscript
|
||||||
|
```
|
||||||
|
|
||||||
### veraPDF-Validierung schlägt immer fehl
|
### veraPDF-Validierung schlägt immer fehl
|
||||||
veraPDF binary prüfen (`[verapdf].binary`). Wenn nicht zwingend gebraucht: `enabled = false`.
|
veraPDF binary prüfen (`[verapdf].binary`). Wenn nicht zwingend gebraucht: `enabled = false`.
|
||||||
|
|
||||||
@@ -205,5 +239,5 @@ MIT — © Sonith UG
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
**Version:** 0.2.0
|
**Version:** 0.3.1
|
||||||
**Repo:** https://gitea.sonith.de/sonith_ug/pdf-ocr-hotfolder
|
**Repo:** https://gitea.sonith.de/sonith_ug/pdf-ocr-hotfolder
|
||||||
|
|||||||
+20
-1
@@ -21,7 +21,10 @@ skip_text = true
|
|||||||
# Auflösung für gerasterte Seiten
|
# Auflösung für gerasterte Seiten
|
||||||
oversample = 300
|
oversample = 300
|
||||||
# PDF/A-Konformitätsstufe ("1", "2", "3" oder leer für keinen PDF/A-Output)
|
# PDF/A-Konformitätsstufe ("1", "2", "3" oder leer für keinen PDF/A-Output)
|
||||||
pdfa_level = "2"
|
# ACHTUNG: Ghostscript 10.0.0 bis 10.02.0 (Debian 12 default!) haben einen Bug,
|
||||||
|
# der mit pdfa_level + skip_text=true ocrmypdf komplett blockiert.
|
||||||
|
# Sicherer Default ist "" — nur auf "1"/"2"/"3" setzen, wenn gs >= 10.02.1 installiert ist.
|
||||||
|
pdfa_level = ""
|
||||||
# Schiefe Scans automatisch begradigen
|
# Schiefe Scans automatisch begradigen
|
||||||
deskew = true
|
deskew = true
|
||||||
# Hintergrund säubern
|
# Hintergrund säubern
|
||||||
@@ -31,6 +34,22 @@ max_workers = 2
|
|||||||
# Timeout pro PDF in Sekunden
|
# Timeout pro PDF in Sekunden
|
||||||
timeout = 1800
|
timeout = 1800
|
||||||
|
|
||||||
|
[output]
|
||||||
|
# Wie soll die Ziel-Datei im outgoing/-Ordner benannt werden?
|
||||||
|
# "prefix" : name_tag wird vor den Dateinamen gestellt (OCR_scan.pdf)
|
||||||
|
# "suffix" : name_tag wird vor die Extension gestellt (scan_OCR.pdf)
|
||||||
|
# "none" : Dateiname bleibt wie das Original
|
||||||
|
name_mode = "prefix"
|
||||||
|
# Verbatim einzufügender String. Leerer String = kein Tag (wie mode="none").
|
||||||
|
# Beispiele: "OCR_", "[OCR]_", "_OCR", "_searchable"
|
||||||
|
name_tag = "OCR_"
|
||||||
|
# Was passiert mit dem Original, wenn OCR erfolgreich war?
|
||||||
|
# "delete" : Original wird gelöscht (alter Standard)
|
||||||
|
# "archive" : Original wird in archive_dir verschoben
|
||||||
|
original_on_success = "delete"
|
||||||
|
# Absoluter Pfad; nur relevant wenn original_on_success = "archive"
|
||||||
|
archive_dir = ""
|
||||||
|
|
||||||
[verapdf]
|
[verapdf]
|
||||||
# PDF/A-Validierung (optional)
|
# PDF/A-Validierung (optional)
|
||||||
enabled = false
|
enabled = false
|
||||||
|
|||||||
+54
@@ -52,6 +52,60 @@ install_base() {
|
|||||||
icc-profiles-free ca-certificates curl
|
icc-profiles-free ca-certificates curl
|
||||||
log_info "System-Pakete ok ✓"
|
log_info "System-Pakete ok ✓"
|
||||||
|
|
||||||
|
# Ghostscript-Versions-Check (Issue #3 + Issue #6)
|
||||||
|
if command -v gs >/dev/null 2>&1; then
|
||||||
|
GS_VER="$(gs --version 2>/dev/null || echo 0.0)"
|
||||||
|
log_info "Ghostscript: $GS_VER"
|
||||||
|
case "$GS_VER" in
|
||||||
|
10.0.0|10.00.0|10.01.*|10.02.0)
|
||||||
|
echo
|
||||||
|
log_warn "═══════════════════════════════════════════════════════════════"
|
||||||
|
log_warn "Ghostscript $GS_VER ist vom PDF/A-Bug betroffen (10.0.0–10.02.0)."
|
||||||
|
log_warn "Mit pdfa_level + skip_text=true kann ocrmypdf KEINE PDFs verarbeiten."
|
||||||
|
log_warn "═══════════════════════════════════════════════════════════════"
|
||||||
|
echo
|
||||||
|
# Prüfe ob Debian bookworm (12) — Backports anbieten
|
||||||
|
if grep -q 'bookworm' /etc/os-release 2>/dev/null; then
|
||||||
|
read -r -p "Ghostscript via bookworm-backports upgraden? [J/n]: " UPGRADE_GS
|
||||||
|
UPGRADE_GS="${UPGRADE_GS:-J}"
|
||||||
|
if [[ "$UPGRADE_GS" =~ ^[JjYy]$ ]]; then
|
||||||
|
log_info "Aktiviere bookworm-backports..."
|
||||||
|
if ! grep -q 'bookworm-backports' /etc/apt/sources.list /etc/apt/sources.list.d/*.list 2>/dev/null; then
|
||||||
|
echo 'deb http://deb.debian.org/debian bookworm-backports main' \
|
||||||
|
> /etc/apt/sources.list.d/bookworm-backports.list
|
||||||
|
apt-get update -qq
|
||||||
|
fi
|
||||||
|
apt-get install -y -t bookworm-backports ghostscript
|
||||||
|
GS_VER_NEW="$(gs --version 2>/dev/null || echo '?')"
|
||||||
|
log_info "Ghostscript aktualisiert: $GS_VER → $GS_VER_NEW ✓"
|
||||||
|
else
|
||||||
|
log_warn "Workaround: In der Config [ocr].pdfa_level = \"\" setzen (Default ab v0.2.2)"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
log_warn "Kein Debian bookworm erkannt — manuelles Upgrade nötig."
|
||||||
|
log_warn "Workaround: In der Config [ocr].pdfa_level = \"\" setzen (Default ab v0.2.2)"
|
||||||
|
fi
|
||||||
|
echo
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
fi
|
||||||
|
|
||||||
|
# LXC/Container-Erkennung (Issue #4)
|
||||||
|
if systemd-detect-virt --container -q 2>/dev/null; then
|
||||||
|
VIRT_TYPE="$(systemd-detect-virt --container 2>/dev/null || echo 'container')"
|
||||||
|
log_warn "Container-Umgebung erkannt ($VIRT_TYPE)."
|
||||||
|
log_warn "systemd-Hardening kann in Containern fehlschlagen (Error 226/NAMESPACE)."
|
||||||
|
read -r -p "LXC-Kompatibilitäts-Drop-in installieren? [J/n]: " LXC_FIX
|
||||||
|
LXC_FIX="${LXC_FIX:-J}"
|
||||||
|
if [[ "$LXC_FIX" =~ ^[JjYy]$ ]]; then
|
||||||
|
local LXC_DROPIN_DIR="/etc/systemd/system/pdf-ocr-hotfolder@.service.d"
|
||||||
|
mkdir -p "$LXC_DROPIN_DIR"
|
||||||
|
cp "$REPO_DIR/systemd/lxc-compat.conf" "$LXC_DROPIN_DIR/lxc-compat.conf"
|
||||||
|
systemctl daemon-reload
|
||||||
|
log_info "LXC-Kompatibilitäts-Drop-in installiert ✓"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
log_step "Default-User '$DEFAULT_USER' prüfen"
|
log_step "Default-User '$DEFAULT_USER' prüfen"
|
||||||
if id "$DEFAULT_USER" &>/dev/null; then
|
if id "$DEFAULT_USER" &>/dev/null; then
|
||||||
log_info "'$DEFAULT_USER' existiert bereits"
|
log_info "'$DEFAULT_USER' existiert bereits"
|
||||||
|
|||||||
@@ -1,3 +1,3 @@
|
|||||||
"""PDF OCR Hotfolder — Scanner-PDFs automatisch durchsuchbar machen."""
|
"""PDF OCR Hotfolder — Scanner-PDFs automatisch durchsuchbar machen."""
|
||||||
|
|
||||||
__version__ = "0.1.0"
|
__version__ = "0.3.1"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ from pathlib import Path
|
|||||||
|
|
||||||
from . import __version__
|
from . import __version__
|
||||||
from .config import load_config
|
from .config import load_config
|
||||||
from .service import HotfolderService
|
from .service import HotfolderService, PreflightError
|
||||||
|
|
||||||
|
|
||||||
def _setup_logging(level: str) -> None:
|
def _setup_logging(level: str) -> None:
|
||||||
@@ -40,14 +40,20 @@ def main() -> int:
|
|||||||
_setup_logging(cfg.log_level)
|
_setup_logging(cfg.log_level)
|
||||||
|
|
||||||
service = HotfolderService(cfg)
|
service = HotfolderService(cfg)
|
||||||
|
|
||||||
if args.once:
|
if args.once:
|
||||||
service._ensure_dirs() # noqa: SLF001
|
try:
|
||||||
service._scan_existing() # noqa: SLF001
|
errors = service.run_once()
|
||||||
service._executor.shutdown(wait=True) # noqa: SLF001
|
except PreflightError as e:
|
||||||
return 0
|
print(f"FEHLER: {e}", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
return 1 if errors > 0 else 0
|
||||||
|
|
||||||
try:
|
try:
|
||||||
service.run()
|
service.run()
|
||||||
|
except PreflightError as e:
|
||||||
|
print(f"FEHLER: {e}", file=sys.stderr)
|
||||||
|
return 2
|
||||||
except KeyboardInterrupt:
|
except KeyboardInterrupt:
|
||||||
pass
|
pass
|
||||||
return 0
|
return 0
|
||||||
|
|||||||
@@ -28,6 +28,18 @@ class OcrConfig:
|
|||||||
timeout: int = 1800
|
timeout: int = 1800
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class OutputConfig:
|
||||||
|
# "prefix" | "suffix" | "none"
|
||||||
|
name_mode: str = "prefix"
|
||||||
|
# Tag-String, verbatim eingefügt (Leerstring = kein Tag)
|
||||||
|
name_tag: str = "OCR_"
|
||||||
|
# "delete" | "archive"
|
||||||
|
original_on_success: str = "delete"
|
||||||
|
# Absoluter Pfad; Pflicht wenn original_on_success == "archive"
|
||||||
|
archive_dir: str = ""
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class VeraPdfConfig:
|
class VeraPdfConfig:
|
||||||
enabled: bool = False
|
enabled: bool = False
|
||||||
@@ -79,6 +91,7 @@ class EmailNotify:
|
|||||||
class Config:
|
class Config:
|
||||||
paths: Paths
|
paths: Paths
|
||||||
ocr: OcrConfig
|
ocr: OcrConfig
|
||||||
|
output: OutputConfig
|
||||||
verapdf: VeraPdfConfig
|
verapdf: VeraPdfConfig
|
||||||
folder: FolderUpload
|
folder: FolderUpload
|
||||||
nextcloud: NextcloudUpload
|
nextcloud: NextcloudUpload
|
||||||
@@ -109,6 +122,8 @@ def load_config(path: str | Path) -> Config:
|
|||||||
|
|
||||||
ocr = OcrConfig(**{k: v for k, v in _section(data, "ocr").items()
|
ocr = OcrConfig(**{k: v for k, v in _section(data, "ocr").items()
|
||||||
if k in OcrConfig.__annotations__})
|
if k in OcrConfig.__annotations__})
|
||||||
|
output = OutputConfig(**{k: v for k, v in _section(data, "output").items()
|
||||||
|
if k in OutputConfig.__annotations__})
|
||||||
verapdf = VeraPdfConfig(**{k: v for k, v in _section(data, "verapdf").items()
|
verapdf = VeraPdfConfig(**{k: v for k, v in _section(data, "verapdf").items()
|
||||||
if k in VeraPdfConfig.__annotations__})
|
if k in VeraPdfConfig.__annotations__})
|
||||||
folder = FolderUpload(**{k: v for k, v in _section(data, "upload", "folder").items()
|
folder = FolderUpload(**{k: v for k, v in _section(data, "upload", "folder").items()
|
||||||
@@ -123,7 +138,7 @@ def load_config(path: str | Path) -> Config:
|
|||||||
log_level = _section(data, "logging").get("level", "INFO")
|
log_level = _section(data, "logging").get("level", "INFO")
|
||||||
|
|
||||||
return Config(
|
return Config(
|
||||||
paths=paths, ocr=ocr, verapdf=verapdf,
|
paths=paths, ocr=ocr, output=output, verapdf=verapdf,
|
||||||
folder=folder, nextcloud=nextcloud, sftp=sftp, email=email,
|
folder=folder, nextcloud=nextcloud, sftp=sftp, email=email,
|
||||||
log_level=log_level,
|
log_level=log_level,
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -7,13 +7,37 @@ import subprocess
|
|||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
import ocrmypdf
|
from .config import OcrConfig, OutputConfig, VeraPdfConfig
|
||||||
|
|
||||||
from .config import OcrConfig, VeraPdfConfig
|
|
||||||
|
|
||||||
log = logging.getLogger(__name__)
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def build_output_name(src_name: str, mode: str, tag: str) -> str:
|
||||||
|
"""Erzeugt den Ziel-Dateinamen für ein OCR-PDF.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
src_name: Original-Dateiname (z.B. "scan.pdf")
|
||||||
|
mode: "prefix" | "suffix" | "none"
|
||||||
|
tag: Einzufügender String (verbatim, leer = kein Tag)
|
||||||
|
|
||||||
|
Beispiele:
|
||||||
|
prefix "OCR_": "scan.pdf" -> "OCR_scan.pdf"
|
||||||
|
suffix "_OCR": "scan.pdf" -> "scan_OCR.pdf"
|
||||||
|
suffix "_OCR": "scan.tar.gz.pdf" -> "scan.tar.gz_OCR.pdf"
|
||||||
|
none: "scan.pdf" -> "scan.pdf"
|
||||||
|
"""
|
||||||
|
if mode == "none" or not tag:
|
||||||
|
return src_name
|
||||||
|
if mode == "prefix":
|
||||||
|
return f"{tag}{src_name}"
|
||||||
|
if mode == "suffix":
|
||||||
|
# Nur die letzte Extension abspalten, sonst "foo.bar.pdf" kaputt gemacht
|
||||||
|
p = Path(src_name)
|
||||||
|
stem, ext = p.stem, p.suffix
|
||||||
|
return f"{stem}{tag}{ext}"
|
||||||
|
raise ValueError(f"Unbekannter name_mode: {mode!r}")
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class ProcessResult:
|
class ProcessResult:
|
||||||
source: Path
|
source: Path
|
||||||
@@ -25,6 +49,8 @@ class ProcessResult:
|
|||||||
|
|
||||||
def run_ocr(src: Path, dst: Path, cfg: OcrConfig) -> None:
|
def run_ocr(src: Path, dst: Path, cfg: OcrConfig) -> None:
|
||||||
"""Führt ocrmypdf als Library-Call aus (kein Subprozess-Overhead)."""
|
"""Führt ocrmypdf als Library-Call aus (kein Subprozess-Overhead)."""
|
||||||
|
import ocrmypdf # lazy, damit Tests ohne ocrmypdf laufen
|
||||||
|
|
||||||
kwargs: dict = {
|
kwargs: dict = {
|
||||||
"language": cfg.languages,
|
"language": cfg.languages,
|
||||||
"jobs": cfg.jobs,
|
"jobs": cfg.jobs,
|
||||||
@@ -71,11 +97,13 @@ def process_pdf(
|
|||||||
error_dir: Path,
|
error_dir: Path,
|
||||||
ocr_cfg: OcrConfig,
|
ocr_cfg: OcrConfig,
|
||||||
vera_cfg: VeraPdfConfig,
|
vera_cfg: VeraPdfConfig,
|
||||||
|
output_cfg: OutputConfig,
|
||||||
) -> ProcessResult:
|
) -> ProcessResult:
|
||||||
"""Verarbeitet eine einzelne PDF: move→OCR→validate→outgoing/error."""
|
"""Verarbeitet eine einzelne PDF: move→OCR→validate→outgoing/error."""
|
||||||
|
out_name = build_output_name(src.name, output_cfg.name_mode, output_cfg.name_tag)
|
||||||
work_src = working_dir / src.name
|
work_src = working_dir / src.name
|
||||||
work_out = working_dir / f"OCR_{src.name}"
|
work_out = working_dir / f"__ocr_{out_name}" # Temp-Name, damit er != src.name ist
|
||||||
final_out = outgoing_dir / f"OCR_{src.name}"
|
final_out = outgoing_dir / out_name
|
||||||
|
|
||||||
try:
|
try:
|
||||||
shutil.move(str(src), str(work_src))
|
shutil.move(str(src), str(work_src))
|
||||||
@@ -100,10 +128,38 @@ def process_pdf(
|
|||||||
|
|
||||||
outgoing_dir.mkdir(parents=True, exist_ok=True)
|
outgoing_dir.mkdir(parents=True, exist_ok=True)
|
||||||
shutil.move(str(work_out), str(final_out))
|
shutil.move(str(work_out), str(final_out))
|
||||||
work_src.unlink(missing_ok=True)
|
_dispose_original(work_src, src.name, output_cfg)
|
||||||
return ProcessResult(src, final_out, True, verapdf_passed=vera_ok)
|
return ProcessResult(src, final_out, True, verapdf_passed=vera_ok)
|
||||||
|
|
||||||
|
|
||||||
|
def _dispose_original(work_src: Path, original_name: str, cfg: OutputConfig) -> None:
|
||||||
|
"""Entsorgt das Original nach erfolgreichem OCR — löschen oder archivieren."""
|
||||||
|
if not work_src.exists():
|
||||||
|
return
|
||||||
|
mode = cfg.original_on_success
|
||||||
|
if mode == "delete":
|
||||||
|
work_src.unlink(missing_ok=True)
|
||||||
|
return
|
||||||
|
if mode == "archive":
|
||||||
|
if not cfg.archive_dir:
|
||||||
|
log.error("original_on_success=archive aber archive_dir ist leer — lösche stattdessen")
|
||||||
|
work_src.unlink(missing_ok=True)
|
||||||
|
return
|
||||||
|
archive = Path(cfg.archive_dir)
|
||||||
|
archive.mkdir(parents=True, exist_ok=True)
|
||||||
|
dest = archive / original_name
|
||||||
|
# Bei Namens-Kollision mit Timestamp umbenennen
|
||||||
|
if dest.exists():
|
||||||
|
from datetime import datetime
|
||||||
|
ts = datetime.now().strftime("%Y%m%d-%H%M%S")
|
||||||
|
dest = archive / f"{dest.stem}_{ts}{dest.suffix}"
|
||||||
|
shutil.move(str(work_src), str(dest))
|
||||||
|
log.info("Original archiviert: %s", dest)
|
||||||
|
return
|
||||||
|
log.warning("Unbekannter original_on_success=%r — lösche stattdessen", mode)
|
||||||
|
work_src.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
|
||||||
def _move_to_error(p: Path, error_dir: Path) -> None:
|
def _move_to_error(p: Path, error_dir: Path) -> None:
|
||||||
error_dir.mkdir(parents=True, exist_ok=True)
|
error_dir.mkdir(parents=True, exist_ok=True)
|
||||||
try:
|
try:
|
||||||
|
|||||||
@@ -2,7 +2,10 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
|
import re
|
||||||
|
import shutil
|
||||||
import signal
|
import signal
|
||||||
|
import subprocess
|
||||||
import threading
|
import threading
|
||||||
import time
|
import time
|
||||||
from concurrent.futures import Future, ThreadPoolExecutor
|
from concurrent.futures import Future, ThreadPoolExecutor
|
||||||
@@ -18,6 +21,98 @@ from .uploaders import notify_email, upload_folder, upload_nextcloud, upload_sft
|
|||||||
log = logging.getLogger(__name__)
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class PreflightError(RuntimeError):
|
||||||
|
"""Erforderliche externe Binaries fehlen."""
|
||||||
|
|
||||||
|
|
||||||
|
# Pflicht-Binaries für ocrmypdf
|
||||||
|
_REQUIRED_BINARIES = ("tesseract", "gs")
|
||||||
|
|
||||||
|
# Ghostscript-Versionen mit bekanntem PDF/A+skip_text Bug (Issue #3):
|
||||||
|
# 10.0.0 .. 10.02.0 (inklusive). Ab 10.02.1 wieder nutzbar.
|
||||||
|
_GS_BROKEN_MIN = (10, 0, 0)
|
||||||
|
_GS_BROKEN_MAX = (10, 2, 0)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_version(text: str) -> tuple[int, ...] | None:
|
||||||
|
"""Extrahiert die erste X.Y[.Z] Version aus einem String."""
|
||||||
|
m = re.search(r"(\d+)\.(\d+)(?:\.(\d+))?", text)
|
||||||
|
if not m:
|
||||||
|
return None
|
||||||
|
return tuple(int(x) if x is not None else 0 for x in m.groups())
|
||||||
|
|
||||||
|
|
||||||
|
def is_ghostscript_broken(version: str | None) -> bool:
|
||||||
|
"""Prüft, ob eine Ghostscript-Version vom PDF/A+skip_text Bug betroffen ist.
|
||||||
|
|
||||||
|
Betrifft 10.0.0 bis einschließlich 10.02.0. Ab 10.02.1 wieder sicher.
|
||||||
|
"""
|
||||||
|
if not version:
|
||||||
|
return False
|
||||||
|
parsed = _parse_version(version)
|
||||||
|
if parsed is None:
|
||||||
|
return False
|
||||||
|
# Auf 3-Tupel normalisieren
|
||||||
|
while len(parsed) < 3:
|
||||||
|
parsed = parsed + (0,)
|
||||||
|
parsed = parsed[:3]
|
||||||
|
return _GS_BROKEN_MIN <= parsed <= _GS_BROKEN_MAX
|
||||||
|
|
||||||
|
|
||||||
|
def detect_ghostscript_version() -> str | None:
|
||||||
|
"""Ruft `gs --version` auf und gibt den Versionsstring zurück (oder None)."""
|
||||||
|
gs = shutil.which("gs")
|
||||||
|
if gs is None:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
result = subprocess.run([gs, "--version"], capture_output=True,
|
||||||
|
text=True, timeout=5)
|
||||||
|
except (OSError, subprocess.TimeoutExpired):
|
||||||
|
return None
|
||||||
|
return result.stdout.strip() or None
|
||||||
|
|
||||||
|
|
||||||
|
def check_output_config(mode: str, archive_dir: str) -> None:
|
||||||
|
"""Validiert die [output]-Section. Wirft PreflightError bei Problemen."""
|
||||||
|
valid_modes = {"delete", "archive"}
|
||||||
|
if mode not in valid_modes:
|
||||||
|
raise PreflightError(
|
||||||
|
f"[output].original_on_success={mode!r} ungültig. "
|
||||||
|
f"Erlaubt: {sorted(valid_modes)}"
|
||||||
|
)
|
||||||
|
if mode == "archive" and not archive_dir:
|
||||||
|
raise PreflightError(
|
||||||
|
"[output].original_on_success='archive' erfordert [output].archive_dir"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def check_preflight(pdfa_level: str = "") -> None:
|
||||||
|
"""Prüft externe Abhängigkeiten.
|
||||||
|
|
||||||
|
- Tesseract und Ghostscript müssen im PATH sein
|
||||||
|
- Bei gesetztem pdfa_level wird die Ghostscript-Version gegen den
|
||||||
|
bekannten 10.0.0–10.02.0 Bug geprüft
|
||||||
|
|
||||||
|
Wirft PreflightError bei fehlenden Binaries oder unsicherem Ghostscript.
|
||||||
|
"""
|
||||||
|
missing = [b for b in _REQUIRED_BINARIES if shutil.which(b) is None]
|
||||||
|
if missing:
|
||||||
|
raise PreflightError(
|
||||||
|
"Fehlende Abhängigkeiten: " + ", ".join(missing)
|
||||||
|
+ ". Bitte installieren: sudo apt install tesseract-ocr ghostscript"
|
||||||
|
)
|
||||||
|
|
||||||
|
if pdfa_level:
|
||||||
|
gs_version = detect_ghostscript_version()
|
||||||
|
if is_ghostscript_broken(gs_version):
|
||||||
|
raise PreflightError(
|
||||||
|
f"Ghostscript {gs_version} ist mit pdfa_level='{pdfa_level}' nicht "
|
||||||
|
"kompatibel (bekannter Bug in 10.0.0–10.02.0). "
|
||||||
|
"Entweder ghostscript auf >=10.02.1 upgraden (z.B. via bookworm-backports) "
|
||||||
|
"oder in der Config [ocr].pdfa_level = \"\" setzen."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def _is_pdf(path: Path) -> bool:
|
def _is_pdf(path: Path) -> bool:
|
||||||
return path.suffix.lower() == ".pdf" and path.is_file()
|
return path.suffix.lower() == ".pdf" and path.is_file()
|
||||||
|
|
||||||
@@ -70,10 +165,20 @@ class HotfolderService:
|
|||||||
self._stop = threading.Event()
|
self._stop = threading.Event()
|
||||||
self._inflight: set[str] = set()
|
self._inflight: set[str] = set()
|
||||||
self._lock = threading.Lock()
|
self._lock = threading.Lock()
|
||||||
|
self._success_count = 0
|
||||||
|
self._error_count = 0
|
||||||
|
|
||||||
|
@property
|
||||||
|
def success_count(self) -> int:
|
||||||
|
return self._success_count
|
||||||
|
|
||||||
|
@property
|
||||||
|
def error_count(self) -> int:
|
||||||
|
return self._error_count
|
||||||
|
|
||||||
# ---- Setup ----
|
# ---- Setup ----
|
||||||
|
|
||||||
def _ensure_dirs(self) -> None:
|
def ensure_dirs(self) -> None:
|
||||||
for p in (self.cfg.paths.incoming, self.cfg.paths.outgoing,
|
for p in (self.cfg.paths.incoming, self.cfg.paths.outgoing,
|
||||||
self.cfg.paths.working, self.cfg.paths.error):
|
self.cfg.paths.working, self.cfg.paths.error):
|
||||||
p.mkdir(parents=True, exist_ok=True)
|
p.mkdir(parents=True, exist_ok=True)
|
||||||
@@ -81,7 +186,10 @@ class HotfolderService:
|
|||||||
# ---- Lifecycle ----
|
# ---- Lifecycle ----
|
||||||
|
|
||||||
def run(self) -> None:
|
def run(self) -> None:
|
||||||
self._ensure_dirs()
|
check_preflight(self.cfg.ocr.pdfa_level)
|
||||||
|
check_output_config(self.cfg.output.original_on_success,
|
||||||
|
self.cfg.output.archive_dir)
|
||||||
|
self.ensure_dirs()
|
||||||
self._scan_existing()
|
self._scan_existing()
|
||||||
|
|
||||||
self._observer = Observer()
|
self._observer = Observer()
|
||||||
@@ -98,6 +206,22 @@ class HotfolderService:
|
|||||||
finally:
|
finally:
|
||||||
self.shutdown()
|
self.shutdown()
|
||||||
|
|
||||||
|
def run_once(self) -> int:
|
||||||
|
"""Verarbeitet alle bereits im incoming-Ordner liegenden PDFs und beendet sich.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Anzahl fehlgeschlagener PDFs (0 = alles ok).
|
||||||
|
"""
|
||||||
|
check_preflight(self.cfg.ocr.pdfa_level)
|
||||||
|
check_output_config(self.cfg.output.original_on_success,
|
||||||
|
self.cfg.output.archive_dir)
|
||||||
|
self.ensure_dirs()
|
||||||
|
self._scan_existing()
|
||||||
|
self._executor.shutdown(wait=True)
|
||||||
|
log.info("One-shot fertig: %d ok, %d Fehler",
|
||||||
|
self._success_count, self._error_count)
|
||||||
|
return self._error_count
|
||||||
|
|
||||||
def shutdown(self) -> None:
|
def shutdown(self) -> None:
|
||||||
log.info("Shutdown läuft...")
|
log.info("Shutdown läuft...")
|
||||||
if self._observer:
|
if self._observer:
|
||||||
@@ -148,8 +272,15 @@ class HotfolderService:
|
|||||||
error_dir=self.cfg.paths.error,
|
error_dir=self.cfg.paths.error,
|
||||||
ocr_cfg=self.cfg.ocr,
|
ocr_cfg=self.cfg.ocr,
|
||||||
vera_cfg=self.cfg.verapdf,
|
vera_cfg=self.cfg.verapdf,
|
||||||
|
output_cfg=self.cfg.output,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
with self._lock:
|
||||||
|
if result.success:
|
||||||
|
self._success_count += 1
|
||||||
|
else:
|
||||||
|
self._error_count += 1
|
||||||
|
|
||||||
if result.success:
|
if result.success:
|
||||||
self._dispatch_uploads(result.output)
|
self._dispatch_uploads(result.output)
|
||||||
self._notify(result)
|
self._notify(result)
|
||||||
|
|||||||
@@ -0,0 +1,10 @@
|
|||||||
|
# Drop-in für LXC/Container-Betrieb
|
||||||
|
# Kopieren nach: /etc/systemd/system/pdf-ocr-hotfolder@.service.d/lxc-compat.conf
|
||||||
|
# Danach: systemctl daemon-reload && systemctl restart 'pdf-ocr-hotfolder@*'
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
PrivateTmp=false
|
||||||
|
ProtectSystem=false
|
||||||
|
ProtectKernelTunables=false
|
||||||
|
ProtectKernelModules=false
|
||||||
|
ProtectControlGroups=false
|
||||||
@@ -7,6 +7,7 @@ Wants=network-online.target
|
|||||||
Type=simple
|
Type=simple
|
||||||
User=pdfocr
|
User=pdfocr
|
||||||
Group=pdfocr
|
Group=pdfocr
|
||||||
|
WorkingDirectory=/opt/pdf-ocr-hotfolder
|
||||||
ExecStart=/opt/pdf-ocr-hotfolder/venv/bin/python -m pdf_ocr_hotfolder --config /etc/pdf-ocr-hotfolder/%i.toml
|
ExecStart=/opt/pdf-ocr-hotfolder/venv/bin/python -m pdf_ocr_hotfolder --config /etc/pdf-ocr-hotfolder/%i.toml
|
||||||
Restart=on-failure
|
Restart=on-failure
|
||||||
RestartSec=5
|
RestartSec=5
|
||||||
|
|||||||
@@ -0,0 +1,54 @@
|
|||||||
|
"""Gemeinsame pytest-Fixtures."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from pdf_ocr_hotfolder.config import (
|
||||||
|
Config,
|
||||||
|
EmailNotify,
|
||||||
|
FolderUpload,
|
||||||
|
NextcloudUpload,
|
||||||
|
OcrConfig,
|
||||||
|
OutputConfig,
|
||||||
|
Paths,
|
||||||
|
SftpUpload,
|
||||||
|
VeraPdfConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def tmp_config(tmp_path: Path) -> Config:
|
||||||
|
"""Minimal-Config mit tmp_path-Verzeichnissen, alle Uploads deaktiviert."""
|
||||||
|
paths = Paths(
|
||||||
|
incoming=tmp_path / "incoming",
|
||||||
|
outgoing=tmp_path / "outgoing",
|
||||||
|
working=tmp_path / "working",
|
||||||
|
error=tmp_path / "error",
|
||||||
|
)
|
||||||
|
for p in (paths.incoming, paths.outgoing, paths.working, paths.error):
|
||||||
|
p.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
return Config(
|
||||||
|
paths=paths,
|
||||||
|
ocr=OcrConfig(max_workers=1),
|
||||||
|
output=OutputConfig(),
|
||||||
|
verapdf=VeraPdfConfig(enabled=False),
|
||||||
|
folder=FolderUpload(enabled=False),
|
||||||
|
nextcloud=NextcloudUpload(enabled=False),
|
||||||
|
sftp=SftpUpload(enabled=False),
|
||||||
|
email=EmailNotify(enabled=False),
|
||||||
|
log_level="DEBUG",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def dummy_pdf(tmp_config: Config) -> Path:
|
||||||
|
"""Legt eine Datei mit .pdf-Extension im incoming-Ordner ab.
|
||||||
|
|
||||||
|
Achtung: kein echtes PDF. Für Tests wird `process_pdf` gemockt.
|
||||||
|
"""
|
||||||
|
pdf = tmp_config.paths.incoming / "test.pdf"
|
||||||
|
pdf.write_bytes(b"%PDF-1.4 fake\n")
|
||||||
|
return pdf
|
||||||
@@ -0,0 +1,72 @@
|
|||||||
|
"""Tests für Issue #3: Ghostscript 10.0.0–10.02.0 PDF/A-Bug-Erkennung."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from pdf_ocr_hotfolder.service import (
|
||||||
|
PreflightError,
|
||||||
|
check_preflight,
|
||||||
|
is_ghostscript_broken,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("version,expected", [
|
||||||
|
# Betroffene Versionen
|
||||||
|
("10.0.0", True),
|
||||||
|
("10.00.0", True),
|
||||||
|
("10.01.0", True),
|
||||||
|
("10.01.1", True),
|
||||||
|
("10.01.2", True),
|
||||||
|
("10.02.0", True),
|
||||||
|
# Sichere Versionen
|
||||||
|
("10.02.1", False),
|
||||||
|
("10.03.0", False),
|
||||||
|
("10.04.0", False),
|
||||||
|
("11.0.0", False),
|
||||||
|
("9.56.1", False), # Debian 11 / Ubuntu 22.04
|
||||||
|
("9.55.0", False),
|
||||||
|
# Edge cases
|
||||||
|
("", False),
|
||||||
|
(None, False),
|
||||||
|
("garbage", False),
|
||||||
|
])
|
||||||
|
def test_is_ghostscript_broken(version, expected) -> None:
|
||||||
|
assert is_ghostscript_broken(version) is expected
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_preflight_without_pdfa_passes_with_broken_gs() -> None:
|
||||||
|
"""Ohne pdfa_level darf der betroffene GS verwendet werden."""
|
||||||
|
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value="/usr/bin/fake"), \
|
||||||
|
patch("pdf_ocr_hotfolder.service.detect_ghostscript_version",
|
||||||
|
return_value="10.0.0"):
|
||||||
|
check_preflight(pdfa_level="") # darf nicht werfen
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_preflight_with_pdfa_fails_on_broken_gs() -> None:
|
||||||
|
"""Mit pdfa_level + kaputtem GS → PreflightError mit hilfreicher Meldung."""
|
||||||
|
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value="/usr/bin/fake"), \
|
||||||
|
patch("pdf_ocr_hotfolder.service.detect_ghostscript_version",
|
||||||
|
return_value="10.0.0"):
|
||||||
|
with pytest.raises(PreflightError, match="Ghostscript 10.0.0"):
|
||||||
|
check_preflight(pdfa_level="2")
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_preflight_with_pdfa_passes_on_fixed_gs() -> None:
|
||||||
|
"""Mit pdfa_level + gefixtem GS → ok."""
|
||||||
|
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value="/usr/bin/fake"), \
|
||||||
|
patch("pdf_ocr_hotfolder.service.detect_ghostscript_version",
|
||||||
|
return_value="10.02.1"):
|
||||||
|
check_preflight(pdfa_level="2") # darf nicht werfen
|
||||||
|
|
||||||
|
|
||||||
|
def test_default_config_pdfa_level_is_empty() -> None:
|
||||||
|
"""Default-Config der Beispiel-Datei soll pdfa_level='' enthalten (Issue #3)."""
|
||||||
|
from pathlib import Path
|
||||||
|
import tomllib
|
||||||
|
cfg_path = Path(__file__).parent.parent / "config.example.toml"
|
||||||
|
with cfg_path.open("rb") as f:
|
||||||
|
data = tomllib.load(f)
|
||||||
|
assert data["ocr"]["pdfa_level"] == "", \
|
||||||
|
"config.example.toml muss pdfa_level='' als sicheren Default haben"
|
||||||
@@ -0,0 +1,96 @@
|
|||||||
|
"""Tests für Issue #2: --once Modus muss Exit-Code != 0 bei Fehlern liefern."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
from pdf_ocr_hotfolder.processor import ProcessResult
|
||||||
|
from pdf_ocr_hotfolder.service import HotfolderService
|
||||||
|
|
||||||
|
|
||||||
|
def _fake_success(src: Path, working_dir, outgoing_dir, error_dir, **kwargs):
|
||||||
|
out = outgoing_dir / f"OCR_{src.name}"
|
||||||
|
out.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
out.write_bytes(b"%PDF-1.4 ocr\n")
|
||||||
|
src.unlink(missing_ok=True)
|
||||||
|
return ProcessResult(src, out, True)
|
||||||
|
|
||||||
|
|
||||||
|
def _fake_failure(src: Path, working_dir, outgoing_dir, error_dir, **kwargs):
|
||||||
|
error_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
dest = error_dir / src.name
|
||||||
|
src.rename(dest)
|
||||||
|
return ProcessResult(src, outgoing_dir / f"OCR_{src.name}", False,
|
||||||
|
error="fake ocr failure")
|
||||||
|
|
||||||
|
|
||||||
|
def _run(tmp_config, fake_process):
|
||||||
|
"""Helper: führt run_once() mit gemocktem process_pdf und preflight aus."""
|
||||||
|
with patch("pdf_ocr_hotfolder.service.check_preflight", return_value=None), \
|
||||||
|
patch("pdf_ocr_hotfolder.service.process_pdf", side_effect=fake_process), \
|
||||||
|
patch("pdf_ocr_hotfolder.service._wait_until_stable", return_value=True):
|
||||||
|
service = HotfolderService(tmp_config)
|
||||||
|
try:
|
||||||
|
return service.run_once()
|
||||||
|
finally:
|
||||||
|
service._executor.shutdown(wait=False)
|
||||||
|
|
||||||
|
|
||||||
|
def test_once_exit_0_when_no_files(tmp_config) -> None:
|
||||||
|
"""Szenario: Keine PDFs vorhanden → Exit 0."""
|
||||||
|
errors = _run(tmp_config, _fake_success)
|
||||||
|
assert errors == 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_once_exit_0_when_all_success(tmp_config) -> None:
|
||||||
|
"""Szenario: Alle PDFs erfolgreich → Exit 0."""
|
||||||
|
(tmp_config.paths.incoming / "a.pdf").write_bytes(b"%PDF-1.4\n")
|
||||||
|
(tmp_config.paths.incoming / "b.pdf").write_bytes(b"%PDF-1.4\n")
|
||||||
|
|
||||||
|
errors = _run(tmp_config, _fake_success)
|
||||||
|
assert errors == 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_once_exit_nonzero_when_all_fail(tmp_config) -> None:
|
||||||
|
"""Szenario: Alle PDFs fehlgeschlagen → Exit != 0 (Issue #2)."""
|
||||||
|
(tmp_config.paths.incoming / "a.pdf").write_bytes(b"%PDF-1.4\n")
|
||||||
|
(tmp_config.paths.incoming / "b.pdf").write_bytes(b"%PDF-1.4\n")
|
||||||
|
|
||||||
|
errors = _run(tmp_config, _fake_failure)
|
||||||
|
assert errors == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_once_exit_nonzero_when_some_fail(tmp_config) -> None:
|
||||||
|
"""Szenario: Teilweise fehlgeschlagen → Exit != 0."""
|
||||||
|
(tmp_config.paths.incoming / "ok.pdf").write_bytes(b"%PDF-1.4\n")
|
||||||
|
(tmp_config.paths.incoming / "bad.pdf").write_bytes(b"%PDF-1.4\n")
|
||||||
|
|
||||||
|
def mixed(src, *args, **kwargs):
|
||||||
|
if "bad" in src.name:
|
||||||
|
return _fake_failure(src, *args, **kwargs)
|
||||||
|
return _fake_success(src, *args, **kwargs)
|
||||||
|
|
||||||
|
errors = _run(tmp_config, mixed)
|
||||||
|
assert errors == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_counters_track_success_and_failure(tmp_config) -> None:
|
||||||
|
"""success_count und error_count sollen korrekt mitzählen."""
|
||||||
|
(tmp_config.paths.incoming / "ok.pdf").write_bytes(b"%PDF-1.4\n")
|
||||||
|
(tmp_config.paths.incoming / "bad.pdf").write_bytes(b"%PDF-1.4\n")
|
||||||
|
|
||||||
|
def mixed(src, *args, **kwargs):
|
||||||
|
if "bad" in src.name:
|
||||||
|
return _fake_failure(src, *args, **kwargs)
|
||||||
|
return _fake_success(src, *args, **kwargs)
|
||||||
|
|
||||||
|
with patch("pdf_ocr_hotfolder.service.check_preflight", return_value=None), \
|
||||||
|
patch("pdf_ocr_hotfolder.service.process_pdf", side_effect=mixed), \
|
||||||
|
patch("pdf_ocr_hotfolder.service._wait_until_stable", return_value=True):
|
||||||
|
service = HotfolderService(tmp_config)
|
||||||
|
try:
|
||||||
|
service.run_once()
|
||||||
|
assert service.success_count == 1
|
||||||
|
assert service.error_count == 1
|
||||||
|
finally:
|
||||||
|
service._executor.shutdown(wait=False)
|
||||||
@@ -0,0 +1,190 @@
|
|||||||
|
"""Tests für Feature: konfigurierbare Dateinamen und Original-Behandlung."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from pdf_ocr_hotfolder.config import OcrConfig, OutputConfig, VeraPdfConfig
|
||||||
|
from pdf_ocr_hotfolder.processor import build_output_name, process_pdf
|
||||||
|
from pdf_ocr_hotfolder.service import PreflightError, check_output_config
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------- build_output_name ----------------
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("src,mode,tag,expected", [
|
||||||
|
# prefix
|
||||||
|
("scan.pdf", "prefix", "OCR_", "OCR_scan.pdf"),
|
||||||
|
("scan.pdf", "prefix", "[OCR] ", "[OCR] scan.pdf"),
|
||||||
|
# suffix (Tag vor Extension)
|
||||||
|
("scan.pdf", "suffix", "_OCR", "scan_OCR.pdf"),
|
||||||
|
("scan.pdf", "suffix", "-ocr", "scan-ocr.pdf"),
|
||||||
|
# none
|
||||||
|
("scan.pdf", "none", "OCR_", "scan.pdf"),
|
||||||
|
# leerer Tag = none
|
||||||
|
("scan.pdf", "prefix", "", "scan.pdf"),
|
||||||
|
("scan.pdf", "suffix", "", "scan.pdf"),
|
||||||
|
# Mehrfach-Punkte im Namen: nur letzte Extension zählt
|
||||||
|
("rechnung.2026.pdf", "suffix", "_OCR", "rechnung.2026_OCR.pdf"),
|
||||||
|
("rechnung.2026.pdf", "prefix", "OCR_", "OCR_rechnung.2026.pdf"),
|
||||||
|
# Name ohne Extension
|
||||||
|
("NO_EXT", "suffix", "_OCR", "NO_EXT_OCR"),
|
||||||
|
])
|
||||||
|
def test_build_output_name(src, mode, tag, expected) -> None:
|
||||||
|
assert build_output_name(src, mode, tag) == expected
|
||||||
|
|
||||||
|
|
||||||
|
def test_build_output_name_invalid_mode() -> None:
|
||||||
|
with pytest.raises(ValueError, match="name_mode"):
|
||||||
|
build_output_name("x.pdf", "bogus", "OCR_")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------- check_output_config ----------------
|
||||||
|
|
||||||
|
def test_check_output_config_delete_ok() -> None:
|
||||||
|
check_output_config("delete", "") # ok
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_output_config_archive_requires_dir() -> None:
|
||||||
|
with pytest.raises(PreflightError, match="archive_dir"):
|
||||||
|
check_output_config("archive", "")
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_output_config_archive_with_dir_ok() -> None:
|
||||||
|
check_output_config("archive", "/var/archive") # ok
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_output_config_invalid_mode() -> None:
|
||||||
|
with pytest.raises(PreflightError, match="ungültig"):
|
||||||
|
check_output_config("trash", "")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------- process_pdf mit Original-Behandlung ----------------
|
||||||
|
|
||||||
|
def _fake_ocr(src: Path, dst: Path, cfg: OcrConfig) -> None:
|
||||||
|
"""Simuliert ocrmypdf: kopiert Inhalt, erzeugt Zieldatei."""
|
||||||
|
dst.write_bytes(b"%PDF-1.4 OCRed\n" + src.read_bytes())
|
||||||
|
|
||||||
|
|
||||||
|
def _prepare(tmp_path: Path) -> dict:
|
||||||
|
dirs = {
|
||||||
|
"working": tmp_path / "working",
|
||||||
|
"outgoing": tmp_path / "outgoing",
|
||||||
|
"error": tmp_path / "error",
|
||||||
|
"archive": tmp_path / "archive",
|
||||||
|
"incoming": tmp_path / "incoming",
|
||||||
|
}
|
||||||
|
for d in dirs.values():
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
src = dirs["incoming"] / "scan.pdf"
|
||||||
|
src.write_bytes(b"%PDF-1.4 original\n")
|
||||||
|
return {"src": src, **dirs}
|
||||||
|
|
||||||
|
|
||||||
|
def test_process_pdf_prefix_delete(tmp_path: Path) -> None:
|
||||||
|
env = _prepare(tmp_path)
|
||||||
|
out_cfg = OutputConfig(name_mode="prefix", name_tag="OCR_",
|
||||||
|
original_on_success="delete")
|
||||||
|
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
|
||||||
|
result = process_pdf(
|
||||||
|
src=env["src"],
|
||||||
|
working_dir=env["working"],
|
||||||
|
outgoing_dir=env["outgoing"],
|
||||||
|
error_dir=env["error"],
|
||||||
|
ocr_cfg=OcrConfig(),
|
||||||
|
vera_cfg=VeraPdfConfig(enabled=False),
|
||||||
|
output_cfg=out_cfg,
|
||||||
|
)
|
||||||
|
assert result.success
|
||||||
|
assert (env["outgoing"] / "OCR_scan.pdf").exists()
|
||||||
|
# Original ist weg, weder in incoming noch in working
|
||||||
|
assert not env["src"].exists()
|
||||||
|
assert not (env["working"] / "scan.pdf").exists()
|
||||||
|
|
||||||
|
|
||||||
|
def test_process_pdf_suffix_delete(tmp_path: Path) -> None:
|
||||||
|
env = _prepare(tmp_path)
|
||||||
|
out_cfg = OutputConfig(name_mode="suffix", name_tag="_OCR",
|
||||||
|
original_on_success="delete")
|
||||||
|
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
|
||||||
|
result = process_pdf(
|
||||||
|
src=env["src"],
|
||||||
|
working_dir=env["working"],
|
||||||
|
outgoing_dir=env["outgoing"],
|
||||||
|
error_dir=env["error"],
|
||||||
|
ocr_cfg=OcrConfig(),
|
||||||
|
vera_cfg=VeraPdfConfig(enabled=False),
|
||||||
|
output_cfg=out_cfg,
|
||||||
|
)
|
||||||
|
assert result.success
|
||||||
|
assert (env["outgoing"] / "scan_OCR.pdf").exists()
|
||||||
|
|
||||||
|
|
||||||
|
def test_process_pdf_none_mode(tmp_path: Path) -> None:
|
||||||
|
env = _prepare(tmp_path)
|
||||||
|
out_cfg = OutputConfig(name_mode="none", name_tag="OCR_",
|
||||||
|
original_on_success="delete")
|
||||||
|
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
|
||||||
|
result = process_pdf(
|
||||||
|
src=env["src"],
|
||||||
|
working_dir=env["working"],
|
||||||
|
outgoing_dir=env["outgoing"],
|
||||||
|
error_dir=env["error"],
|
||||||
|
ocr_cfg=OcrConfig(),
|
||||||
|
vera_cfg=VeraPdfConfig(enabled=False),
|
||||||
|
output_cfg=out_cfg,
|
||||||
|
)
|
||||||
|
assert result.success
|
||||||
|
# Ausgang hat GLEICHEN Namen wie Original
|
||||||
|
assert (env["outgoing"] / "scan.pdf").exists()
|
||||||
|
|
||||||
|
|
||||||
|
def test_process_pdf_archive_original(tmp_path: Path) -> None:
|
||||||
|
env = _prepare(tmp_path)
|
||||||
|
out_cfg = OutputConfig(name_mode="prefix", name_tag="OCR_",
|
||||||
|
original_on_success="archive",
|
||||||
|
archive_dir=str(env["archive"]))
|
||||||
|
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
|
||||||
|
result = process_pdf(
|
||||||
|
src=env["src"],
|
||||||
|
working_dir=env["working"],
|
||||||
|
outgoing_dir=env["outgoing"],
|
||||||
|
error_dir=env["error"],
|
||||||
|
ocr_cfg=OcrConfig(),
|
||||||
|
vera_cfg=VeraPdfConfig(enabled=False),
|
||||||
|
output_cfg=out_cfg,
|
||||||
|
)
|
||||||
|
assert result.success
|
||||||
|
assert (env["outgoing"] / "OCR_scan.pdf").exists()
|
||||||
|
# Original liegt jetzt im Archiv
|
||||||
|
archived = env["archive"] / "scan.pdf"
|
||||||
|
assert archived.exists()
|
||||||
|
assert archived.read_bytes() == b"%PDF-1.4 original\n"
|
||||||
|
|
||||||
|
|
||||||
|
def test_process_pdf_archive_name_collision(tmp_path: Path) -> None:
|
||||||
|
"""Bei Namens-Kollision im Archiv wird Timestamp angehängt."""
|
||||||
|
env = _prepare(tmp_path)
|
||||||
|
# Vorhandene Kollisions-Datei
|
||||||
|
(env["archive"] / "scan.pdf").write_bytes(b"old")
|
||||||
|
|
||||||
|
out_cfg = OutputConfig(name_mode="prefix", name_tag="OCR_",
|
||||||
|
original_on_success="archive",
|
||||||
|
archive_dir=str(env["archive"]))
|
||||||
|
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
|
||||||
|
process_pdf(
|
||||||
|
src=env["src"],
|
||||||
|
working_dir=env["working"],
|
||||||
|
outgoing_dir=env["outgoing"],
|
||||||
|
error_dir=env["error"],
|
||||||
|
ocr_cfg=OcrConfig(),
|
||||||
|
vera_cfg=VeraPdfConfig(enabled=False),
|
||||||
|
output_cfg=out_cfg,
|
||||||
|
)
|
||||||
|
# Alte Datei unverändert
|
||||||
|
assert (env["archive"] / "scan.pdf").read_bytes() == b"old"
|
||||||
|
# Neue Datei mit Timestamp-Suffix
|
||||||
|
archived = list(env["archive"].glob("scan_*.pdf"))
|
||||||
|
assert len(archived) == 1
|
||||||
|
assert archived[0].read_bytes() == b"%PDF-1.4 original\n"
|
||||||
@@ -0,0 +1,75 @@
|
|||||||
|
"""Tests für Issue #1: Preflight-Check bei fehlendem Tesseract."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from pdf_ocr_hotfolder.service import (
|
||||||
|
HotfolderService,
|
||||||
|
PreflightError,
|
||||||
|
check_preflight,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_preflight_passes_when_all_binaries_present() -> None:
|
||||||
|
"""Wenn tesseract + gs im PATH sind, darf kein Fehler fliegen."""
|
||||||
|
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value="/usr/bin/fake"):
|
||||||
|
check_preflight() # darf nicht werfen
|
||||||
|
|
||||||
|
|
||||||
|
def test_preflight_fails_when_tesseract_missing() -> None:
|
||||||
|
"""Fehlendes tesseract → PreflightError mit passender Meldung."""
|
||||||
|
def fake_which(name: str) -> str | None:
|
||||||
|
return None if name == "tesseract" else "/usr/bin/fake"
|
||||||
|
|
||||||
|
with patch("pdf_ocr_hotfolder.service.shutil.which", side_effect=fake_which):
|
||||||
|
with pytest.raises(PreflightError, match="tesseract"):
|
||||||
|
check_preflight()
|
||||||
|
|
||||||
|
|
||||||
|
def test_preflight_fails_when_ghostscript_missing() -> None:
|
||||||
|
def fake_which(name: str) -> str | None:
|
||||||
|
return None if name == "gs" else "/usr/bin/fake"
|
||||||
|
|
||||||
|
with patch("pdf_ocr_hotfolder.service.shutil.which", side_effect=fake_which):
|
||||||
|
with pytest.raises(PreflightError, match="gs"):
|
||||||
|
check_preflight()
|
||||||
|
|
||||||
|
|
||||||
|
def test_preflight_lists_all_missing_binaries() -> None:
|
||||||
|
"""Bei mehreren fehlenden Binaries werden alle genannt."""
|
||||||
|
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value=None):
|
||||||
|
with pytest.raises(PreflightError) as exc_info:
|
||||||
|
check_preflight()
|
||||||
|
msg = str(exc_info.value)
|
||||||
|
assert "tesseract" in msg
|
||||||
|
assert "gs" in msg
|
||||||
|
|
||||||
|
|
||||||
|
def test_run_once_raises_preflight_error(tmp_config) -> None:
|
||||||
|
"""HotfolderService.run_once() wirft PreflightError, wenn tesseract fehlt."""
|
||||||
|
service = HotfolderService(tmp_config)
|
||||||
|
try:
|
||||||
|
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value=None):
|
||||||
|
with pytest.raises(PreflightError):
|
||||||
|
service.run_once()
|
||||||
|
finally:
|
||||||
|
service._executor.shutdown(wait=False)
|
||||||
|
|
||||||
|
|
||||||
|
def test_main_returns_2_on_preflight_error(tmp_config, tmp_path, monkeypatch) -> None:
|
||||||
|
"""CLI liefert Exit-Code 2 bei Preflight-Fehler (Issue #1 Szenario)."""
|
||||||
|
cfg_file = tmp_path / "cfg.toml"
|
||||||
|
cfg_file.write_text(f"""
|
||||||
|
[paths]
|
||||||
|
incoming = "{tmp_config.paths.incoming}"
|
||||||
|
outgoing = "{tmp_config.paths.outgoing}"
|
||||||
|
working = "{tmp_config.paths.working}"
|
||||||
|
error = "{tmp_config.paths.error}"
|
||||||
|
""")
|
||||||
|
monkeypatch.setattr(sys, "argv", ["pdf-ocr-hotfolder", "--config", str(cfg_file), "--once"])
|
||||||
|
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value=None):
|
||||||
|
from pdf_ocr_hotfolder.__main__ import main
|
||||||
|
assert main() == 2
|
||||||
Reference in New Issue
Block a user