3 Commits

Author SHA1 Message Date
techadmin a23a3968ef feat: konfigurierbarer Dateiname + Archiv-Modus für Original (v0.3.0)
Neue [output]-Section:
- name_mode: prefix | suffix | none (suffix wird vor Extension eingefügt)
- name_tag: verbatim einfügbarer String
- original_on_success: delete | archive
- archive_dir mit Kollisions-Schutz (Timestamp-Suffix)

20 neue Tests (50 insgesamt, alle grün).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 22:32:41 +02:00
techadmin 9cdc9ae443 fix: Ghostscript 10.0.0-10.02.0 PDF/A-Bug abfangen (v0.2.2)
- config.example.toml: pdfa_level="" als sicherer Default
- check_preflight(pdfa_level) erkennt betroffene GS-Versionen und bricht ab
- install.sh warnt bei betroffenen GS-Versionen
- 19 neue Tests (parametrisiert über Versions-Matrix)

Closes #3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 07:29:18 +02:00
techadmin 6f7cadfc63 fix: Preflight-Check und Exit-Code in --once Modus (v0.2.1)
- #1: check_preflight() prüft beim Start tesseract + gs, wirft
  PreflightError. CLI endet mit Exit 2 statt grün zu bleiben.
- #2: run_once() gibt Anzahl fehlgeschlagener PDFs zurück, CLI
  endet mit Exit 1 wenn mindestens eine Datei scheiterte.
- pytest-Suite mit 11 Tests für beide Szenarien
- ocrmypdf-Import lazy in processor.py (Tests ohne ocrmypdf möglich)

Closes #1, #2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 07:24:00 +02:00
16 changed files with 806 additions and 16 deletions
+1
View File
@@ -4,6 +4,7 @@ __pycache__/
venv/ venv/
env/ env/
.venv/ .venv/
.pytest_cache/
*.egg-info/ *.egg-info/
build/ build/
dist/ dist/
+39
View File
@@ -1,5 +1,44 @@
# Changelog # Changelog
## [0.3.0] - 2026-04-09
### Added
- Neue Config-Sektion `[output]` mit:
- `name_mode` — Platzierung des Tags im Dateinamen: `"prefix"`, `"suffix"` (vor Extension), `"none"`
- `name_tag` — verbatim einzufügender String, z.B. `"OCR_"` oder `"_OCR"`
- `original_on_success``"delete"` (alter Default) oder `"archive"`
- `archive_dir` — Zielverzeichnis für `"archive"`, mit Kollisions-Schutz (Timestamp-Suffix)
- Runtime-Validierung der Output-Config in `check_output_config()`
- 20 neue Tests für `build_output_name()`, `check_output_config()` und `process_pdf()`
mit allen Kombinationen aus Modus + Original-Behandlung
### Changed
- `process_pdf()` nimmt jetzt `output_cfg: OutputConfig` als Pflicht-Argument
## [0.2.2] - 2026-04-09
### Fixed
- **Issue #3**: Ghostscript 10.0.010.02.0 (Debian 12 default) zerschießen OCR mit PDF/A + `skip_text=true`.
- `config.example.toml`: `pdfa_level = ""` als sicherer Default
- Runtime-Preflight: Prüft `gs --version` wenn `pdfa_level` gesetzt ist, bricht mit klarer Fehlermeldung ab
- `install.sh`: warnt bei betroffenen GS-Versionen mit Upgrade-Hinweis auf bookworm-backports
### Added
- `is_ghostscript_broken()` / `detect_ghostscript_version()` in `pdf_ocr_hotfolder.service`
- 19 weitere pytest-Tests für GS-Versions-Detection (parametrisiert) und Preflight-Kombinationen
## [0.2.1] - 2026-04-09
### Fixed
- **Issue #1**: Preflight-Check beim Start prüft jetzt `tesseract` und `gs` (Ghostscript). Fehlt eine Abhängigkeit, beendet sich der Service sofort mit Exit-Code 2 und klarer Fehlermeldung statt erst bei der ersten Datei.
- **Issue #2**: `--once`-Modus liefert jetzt Exit-Code `1`, sobald **mindestens ein** PDF fehlgeschlagen ist. Exit-Code `0` nur bei vollständigem Erfolg (inkl. "keine Dateien vorhanden"). Exit-Code `2` bei Preflight-Fehler.
### Added
- Public API: `HotfolderService.run_once()`, `.success_count`, `.error_count`, `.ensure_dirs()`
- `check_preflight()` / `PreflightError` in `pdf_ocr_hotfolder.service`
- pytest-Test-Suite (`tests/`) mit 11 Tests — deckt alle Szenarien aus Issue #1 und #2 ab
- `ocrmypdf`-Import in `processor.py` ist jetzt lazy (Tests ohne ocrmypdf-Installation möglich)
## [0.2.0] - 2026-04-08 ## [0.2.0] - 2026-04-08
### Added ### Added
+16
View File
@@ -89,6 +89,22 @@ max_workers = 2 # parallele PDFs
timeout = 1800 timeout = 1800
``` ```
### `[output]`
```toml
# Dateiname im outgoing/:
# "prefix" → OCR_scan.pdf
# "suffix" → scan_OCR.pdf (vor der Extension)
# "none" → scan.pdf (unverändert)
name_mode = "prefix"
name_tag = "OCR_"
# Nach erfolgreichem OCR mit dem Original:
# "delete" → löschen
# "archive" → in archive_dir verschieben
original_on_success = "delete"
archive_dir = "" # absoluter Pfad, Pflicht bei "archive"
```
### `[upload.nextcloud]` ### `[upload.nextcloud]`
```toml ```toml
enabled = true enabled = true
+1 -1
View File
@@ -1 +1 @@
0.2.0 0.3.0
+20 -1
View File
@@ -21,7 +21,10 @@ skip_text = true
# Auflösung für gerasterte Seiten # Auflösung für gerasterte Seiten
oversample = 300 oversample = 300
# PDF/A-Konformitätsstufe ("1", "2", "3" oder leer für keinen PDF/A-Output) # PDF/A-Konformitätsstufe ("1", "2", "3" oder leer für keinen PDF/A-Output)
pdfa_level = "2" # ACHTUNG: Ghostscript 10.0.0 bis 10.02.0 (Debian 12 default!) haben einen Bug,
# der mit pdfa_level + skip_text=true ocrmypdf komplett blockiert.
# Sicherer Default ist "" — nur auf "1"/"2"/"3" setzen, wenn gs >= 10.02.1 installiert ist.
pdfa_level = ""
# Schiefe Scans automatisch begradigen # Schiefe Scans automatisch begradigen
deskew = true deskew = true
# Hintergrund säubern # Hintergrund säubern
@@ -31,6 +34,22 @@ max_workers = 2
# Timeout pro PDF in Sekunden # Timeout pro PDF in Sekunden
timeout = 1800 timeout = 1800
[output]
# Wie soll die Ziel-Datei im outgoing/-Ordner benannt werden?
# "prefix" : name_tag wird vor den Dateinamen gestellt (OCR_scan.pdf)
# "suffix" : name_tag wird vor die Extension gestellt (scan_OCR.pdf)
# "none" : Dateiname bleibt wie das Original
name_mode = "prefix"
# Verbatim einzufügender String. Leerer String = kein Tag (wie mode="none").
# Beispiele: "OCR_", "[OCR]_", "_OCR", "_searchable"
name_tag = "OCR_"
# Was passiert mit dem Original, wenn OCR erfolgreich war?
# "delete" : Original wird gelöscht (alter Standard)
# "archive" : Original wird in archive_dir verschoben
original_on_success = "delete"
# Absoluter Pfad; nur relevant wenn original_on_success = "archive"
archive_dir = ""
[verapdf] [verapdf]
# PDF/A-Validierung (optional) # PDF/A-Validierung (optional)
enabled = false enabled = false
+20
View File
@@ -52,6 +52,26 @@ install_base() {
icc-profiles-free ca-certificates curl icc-profiles-free ca-certificates curl
log_info "System-Pakete ok ✓" log_info "System-Pakete ok ✓"
# Ghostscript-Versions-Check (Issue #3)
if command -v gs >/dev/null 2>&1; then
GS_VER="$(gs --version 2>/dev/null || echo 0.0)"
log_info "Ghostscript: $GS_VER"
case "$GS_VER" in
10.0.0|10.00.0|10.01.*|10.02.0)
echo
log_warn "═══════════════════════════════════════════════════════════════"
log_warn "Ghostscript $GS_VER ist vom PDF/A-Bug betroffen (10.0.010.02.0)."
log_warn "Mit pdfa_level + skip_text=true kann ocrmypdf KEINE PDFs verarbeiten."
log_warn ""
log_warn "Workarounds:"
log_warn " 1. ghostscript aus bookworm-backports installieren (>=10.02.1)"
log_warn " 2. In der Config [ocr].pdfa_level = \"\" setzen (Default ab v0.2.2)"
log_warn "═══════════════════════════════════════════════════════════════"
echo
;;
esac
fi
log_step "Default-User '$DEFAULT_USER' prüfen" log_step "Default-User '$DEFAULT_USER' prüfen"
if id "$DEFAULT_USER" &>/dev/null; then if id "$DEFAULT_USER" &>/dev/null; then
log_info "'$DEFAULT_USER' existiert bereits" log_info "'$DEFAULT_USER' existiert bereits"
+11 -5
View File
@@ -8,7 +8,7 @@ from pathlib import Path
from . import __version__ from . import __version__
from .config import load_config from .config import load_config
from .service import HotfolderService from .service import HotfolderService, PreflightError
def _setup_logging(level: str) -> None: def _setup_logging(level: str) -> None:
@@ -40,14 +40,20 @@ def main() -> int:
_setup_logging(cfg.log_level) _setup_logging(cfg.log_level)
service = HotfolderService(cfg) service = HotfolderService(cfg)
if args.once: if args.once:
service._ensure_dirs() # noqa: SLF001 try:
service._scan_existing() # noqa: SLF001 errors = service.run_once()
service._executor.shutdown(wait=True) # noqa: SLF001 except PreflightError as e:
return 0 print(f"FEHLER: {e}", file=sys.stderr)
return 2
return 1 if errors > 0 else 0
try: try:
service.run() service.run()
except PreflightError as e:
print(f"FEHLER: {e}", file=sys.stderr)
return 2
except KeyboardInterrupt: except KeyboardInterrupt:
pass pass
return 0 return 0
+16 -1
View File
@@ -28,6 +28,18 @@ class OcrConfig:
timeout: int = 1800 timeout: int = 1800
@dataclass
class OutputConfig:
# "prefix" | "suffix" | "none"
name_mode: str = "prefix"
# Tag-String, verbatim eingefügt (Leerstring = kein Tag)
name_tag: str = "OCR_"
# "delete" | "archive"
original_on_success: str = "delete"
# Absoluter Pfad; Pflicht wenn original_on_success == "archive"
archive_dir: str = ""
@dataclass @dataclass
class VeraPdfConfig: class VeraPdfConfig:
enabled: bool = False enabled: bool = False
@@ -79,6 +91,7 @@ class EmailNotify:
class Config: class Config:
paths: Paths paths: Paths
ocr: OcrConfig ocr: OcrConfig
output: OutputConfig
verapdf: VeraPdfConfig verapdf: VeraPdfConfig
folder: FolderUpload folder: FolderUpload
nextcloud: NextcloudUpload nextcloud: NextcloudUpload
@@ -109,6 +122,8 @@ def load_config(path: str | Path) -> Config:
ocr = OcrConfig(**{k: v for k, v in _section(data, "ocr").items() ocr = OcrConfig(**{k: v for k, v in _section(data, "ocr").items()
if k in OcrConfig.__annotations__}) if k in OcrConfig.__annotations__})
output = OutputConfig(**{k: v for k, v in _section(data, "output").items()
if k in OutputConfig.__annotations__})
verapdf = VeraPdfConfig(**{k: v for k, v in _section(data, "verapdf").items() verapdf = VeraPdfConfig(**{k: v for k, v in _section(data, "verapdf").items()
if k in VeraPdfConfig.__annotations__}) if k in VeraPdfConfig.__annotations__})
folder = FolderUpload(**{k: v for k, v in _section(data, "upload", "folder").items() folder = FolderUpload(**{k: v for k, v in _section(data, "upload", "folder").items()
@@ -123,7 +138,7 @@ def load_config(path: str | Path) -> Config:
log_level = _section(data, "logging").get("level", "INFO") log_level = _section(data, "logging").get("level", "INFO")
return Config( return Config(
paths=paths, ocr=ocr, verapdf=verapdf, paths=paths, ocr=ocr, output=output, verapdf=verapdf,
folder=folder, nextcloud=nextcloud, sftp=sftp, email=email, folder=folder, nextcloud=nextcloud, sftp=sftp, email=email,
log_level=log_level, log_level=log_level,
) )
+62 -6
View File
@@ -7,13 +7,37 @@ import subprocess
from dataclasses import dataclass from dataclasses import dataclass
from pathlib import Path from pathlib import Path
import ocrmypdf from .config import OcrConfig, OutputConfig, VeraPdfConfig
from .config import OcrConfig, VeraPdfConfig
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
def build_output_name(src_name: str, mode: str, tag: str) -> str:
"""Erzeugt den Ziel-Dateinamen für ein OCR-PDF.
Args:
src_name: Original-Dateiname (z.B. "scan.pdf")
mode: "prefix" | "suffix" | "none"
tag: Einzufügender String (verbatim, leer = kein Tag)
Beispiele:
prefix "OCR_": "scan.pdf" -> "OCR_scan.pdf"
suffix "_OCR": "scan.pdf" -> "scan_OCR.pdf"
suffix "_OCR": "scan.tar.gz.pdf" -> "scan.tar.gz_OCR.pdf"
none: "scan.pdf" -> "scan.pdf"
"""
if mode == "none" or not tag:
return src_name
if mode == "prefix":
return f"{tag}{src_name}"
if mode == "suffix":
# Nur die letzte Extension abspalten, sonst "foo.bar.pdf" kaputt gemacht
p = Path(src_name)
stem, ext = p.stem, p.suffix
return f"{stem}{tag}{ext}"
raise ValueError(f"Unbekannter name_mode: {mode!r}")
@dataclass @dataclass
class ProcessResult: class ProcessResult:
source: Path source: Path
@@ -25,6 +49,8 @@ class ProcessResult:
def run_ocr(src: Path, dst: Path, cfg: OcrConfig) -> None: def run_ocr(src: Path, dst: Path, cfg: OcrConfig) -> None:
"""Führt ocrmypdf als Library-Call aus (kein Subprozess-Overhead).""" """Führt ocrmypdf als Library-Call aus (kein Subprozess-Overhead)."""
import ocrmypdf # lazy, damit Tests ohne ocrmypdf laufen
kwargs: dict = { kwargs: dict = {
"language": cfg.languages, "language": cfg.languages,
"jobs": cfg.jobs, "jobs": cfg.jobs,
@@ -71,11 +97,13 @@ def process_pdf(
error_dir: Path, error_dir: Path,
ocr_cfg: OcrConfig, ocr_cfg: OcrConfig,
vera_cfg: VeraPdfConfig, vera_cfg: VeraPdfConfig,
output_cfg: OutputConfig,
) -> ProcessResult: ) -> ProcessResult:
"""Verarbeitet eine einzelne PDF: move→OCR→validate→outgoing/error.""" """Verarbeitet eine einzelne PDF: move→OCR→validate→outgoing/error."""
out_name = build_output_name(src.name, output_cfg.name_mode, output_cfg.name_tag)
work_src = working_dir / src.name work_src = working_dir / src.name
work_out = working_dir / f"OCR_{src.name}" work_out = working_dir / f"__ocr_{out_name}" # Temp-Name, damit er != src.name ist
final_out = outgoing_dir / f"OCR_{src.name}" final_out = outgoing_dir / out_name
try: try:
shutil.move(str(src), str(work_src)) shutil.move(str(src), str(work_src))
@@ -100,10 +128,38 @@ def process_pdf(
outgoing_dir.mkdir(parents=True, exist_ok=True) outgoing_dir.mkdir(parents=True, exist_ok=True)
shutil.move(str(work_out), str(final_out)) shutil.move(str(work_out), str(final_out))
work_src.unlink(missing_ok=True) _dispose_original(work_src, src.name, output_cfg)
return ProcessResult(src, final_out, True, verapdf_passed=vera_ok) return ProcessResult(src, final_out, True, verapdf_passed=vera_ok)
def _dispose_original(work_src: Path, original_name: str, cfg: OutputConfig) -> None:
"""Entsorgt das Original nach erfolgreichem OCR — löschen oder archivieren."""
if not work_src.exists():
return
mode = cfg.original_on_success
if mode == "delete":
work_src.unlink(missing_ok=True)
return
if mode == "archive":
if not cfg.archive_dir:
log.error("original_on_success=archive aber archive_dir ist leer — lösche stattdessen")
work_src.unlink(missing_ok=True)
return
archive = Path(cfg.archive_dir)
archive.mkdir(parents=True, exist_ok=True)
dest = archive / original_name
# Bei Namens-Kollision mit Timestamp umbenennen
if dest.exists():
from datetime import datetime
ts = datetime.now().strftime("%Y%m%d-%H%M%S")
dest = archive / f"{dest.stem}_{ts}{dest.suffix}"
shutil.move(str(work_src), str(dest))
log.info("Original archiviert: %s", dest)
return
log.warning("Unbekannter original_on_success=%r — lösche stattdessen", mode)
work_src.unlink(missing_ok=True)
def _move_to_error(p: Path, error_dir: Path) -> None: def _move_to_error(p: Path, error_dir: Path) -> None:
error_dir.mkdir(parents=True, exist_ok=True) error_dir.mkdir(parents=True, exist_ok=True)
try: try:
+133 -2
View File
@@ -2,7 +2,10 @@
from __future__ import annotations from __future__ import annotations
import logging import logging
import re
import shutil
import signal import signal
import subprocess
import threading import threading
import time import time
from concurrent.futures import Future, ThreadPoolExecutor from concurrent.futures import Future, ThreadPoolExecutor
@@ -18,6 +21,98 @@ from .uploaders import notify_email, upload_folder, upload_nextcloud, upload_sft
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
class PreflightError(RuntimeError):
"""Erforderliche externe Binaries fehlen."""
# Pflicht-Binaries für ocrmypdf
_REQUIRED_BINARIES = ("tesseract", "gs")
# Ghostscript-Versionen mit bekanntem PDF/A+skip_text Bug (Issue #3):
# 10.0.0 .. 10.02.0 (inklusive). Ab 10.02.1 wieder nutzbar.
_GS_BROKEN_MIN = (10, 0, 0)
_GS_BROKEN_MAX = (10, 2, 0)
def _parse_version(text: str) -> tuple[int, ...] | None:
"""Extrahiert die erste X.Y[.Z] Version aus einem String."""
m = re.search(r"(\d+)\.(\d+)(?:\.(\d+))?", text)
if not m:
return None
return tuple(int(x) if x is not None else 0 for x in m.groups())
def is_ghostscript_broken(version: str | None) -> bool:
"""Prüft, ob eine Ghostscript-Version vom PDF/A+skip_text Bug betroffen ist.
Betrifft 10.0.0 bis einschließlich 10.02.0. Ab 10.02.1 wieder sicher.
"""
if not version:
return False
parsed = _parse_version(version)
if parsed is None:
return False
# Auf 3-Tupel normalisieren
while len(parsed) < 3:
parsed = parsed + (0,)
parsed = parsed[:3]
return _GS_BROKEN_MIN <= parsed <= _GS_BROKEN_MAX
def detect_ghostscript_version() -> str | None:
"""Ruft `gs --version` auf und gibt den Versionsstring zurück (oder None)."""
gs = shutil.which("gs")
if gs is None:
return None
try:
result = subprocess.run([gs, "--version"], capture_output=True,
text=True, timeout=5)
except (OSError, subprocess.TimeoutExpired):
return None
return result.stdout.strip() or None
def check_output_config(mode: str, archive_dir: str) -> None:
"""Validiert die [output]-Section. Wirft PreflightError bei Problemen."""
valid_modes = {"delete", "archive"}
if mode not in valid_modes:
raise PreflightError(
f"[output].original_on_success={mode!r} ungültig. "
f"Erlaubt: {sorted(valid_modes)}"
)
if mode == "archive" and not archive_dir:
raise PreflightError(
"[output].original_on_success='archive' erfordert [output].archive_dir"
)
def check_preflight(pdfa_level: str = "") -> None:
"""Prüft externe Abhängigkeiten.
- Tesseract und Ghostscript müssen im PATH sein
- Bei gesetztem pdfa_level wird die Ghostscript-Version gegen den
bekannten 10.0.010.02.0 Bug geprüft
Wirft PreflightError bei fehlenden Binaries oder unsicherem Ghostscript.
"""
missing = [b for b in _REQUIRED_BINARIES if shutil.which(b) is None]
if missing:
raise PreflightError(
"Fehlende Abhängigkeiten: " + ", ".join(missing)
+ ". Bitte installieren: sudo apt install tesseract-ocr ghostscript"
)
if pdfa_level:
gs_version = detect_ghostscript_version()
if is_ghostscript_broken(gs_version):
raise PreflightError(
f"Ghostscript {gs_version} ist mit pdfa_level='{pdfa_level}' nicht "
"kompatibel (bekannter Bug in 10.0.010.02.0). "
"Entweder ghostscript auf >=10.02.1 upgraden (z.B. via bookworm-backports) "
"oder in der Config [ocr].pdfa_level = \"\" setzen."
)
def _is_pdf(path: Path) -> bool: def _is_pdf(path: Path) -> bool:
return path.suffix.lower() == ".pdf" and path.is_file() return path.suffix.lower() == ".pdf" and path.is_file()
@@ -70,10 +165,20 @@ class HotfolderService:
self._stop = threading.Event() self._stop = threading.Event()
self._inflight: set[str] = set() self._inflight: set[str] = set()
self._lock = threading.Lock() self._lock = threading.Lock()
self._success_count = 0
self._error_count = 0
@property
def success_count(self) -> int:
return self._success_count
@property
def error_count(self) -> int:
return self._error_count
# ---- Setup ---- # ---- Setup ----
def _ensure_dirs(self) -> None: def ensure_dirs(self) -> None:
for p in (self.cfg.paths.incoming, self.cfg.paths.outgoing, for p in (self.cfg.paths.incoming, self.cfg.paths.outgoing,
self.cfg.paths.working, self.cfg.paths.error): self.cfg.paths.working, self.cfg.paths.error):
p.mkdir(parents=True, exist_ok=True) p.mkdir(parents=True, exist_ok=True)
@@ -81,7 +186,10 @@ class HotfolderService:
# ---- Lifecycle ---- # ---- Lifecycle ----
def run(self) -> None: def run(self) -> None:
self._ensure_dirs() check_preflight(self.cfg.ocr.pdfa_level)
check_output_config(self.cfg.output.original_on_success,
self.cfg.output.archive_dir)
self.ensure_dirs()
self._scan_existing() self._scan_existing()
self._observer = Observer() self._observer = Observer()
@@ -98,6 +206,22 @@ class HotfolderService:
finally: finally:
self.shutdown() self.shutdown()
def run_once(self) -> int:
"""Verarbeitet alle bereits im incoming-Ordner liegenden PDFs und beendet sich.
Returns:
Anzahl fehlgeschlagener PDFs (0 = alles ok).
"""
check_preflight(self.cfg.ocr.pdfa_level)
check_output_config(self.cfg.output.original_on_success,
self.cfg.output.archive_dir)
self.ensure_dirs()
self._scan_existing()
self._executor.shutdown(wait=True)
log.info("One-shot fertig: %d ok, %d Fehler",
self._success_count, self._error_count)
return self._error_count
def shutdown(self) -> None: def shutdown(self) -> None:
log.info("Shutdown läuft...") log.info("Shutdown läuft...")
if self._observer: if self._observer:
@@ -148,8 +272,15 @@ class HotfolderService:
error_dir=self.cfg.paths.error, error_dir=self.cfg.paths.error,
ocr_cfg=self.cfg.ocr, ocr_cfg=self.cfg.ocr,
vera_cfg=self.cfg.verapdf, vera_cfg=self.cfg.verapdf,
output_cfg=self.cfg.output,
) )
with self._lock:
if result.success:
self._success_count += 1
else:
self._error_count += 1
if result.success: if result.success:
self._dispatch_uploads(result.output) self._dispatch_uploads(result.output)
self._notify(result) self._notify(result)
View File
+54
View File
@@ -0,0 +1,54 @@
"""Gemeinsame pytest-Fixtures."""
from __future__ import annotations
from pathlib import Path
import pytest
from pdf_ocr_hotfolder.config import (
Config,
EmailNotify,
FolderUpload,
NextcloudUpload,
OcrConfig,
OutputConfig,
Paths,
SftpUpload,
VeraPdfConfig,
)
@pytest.fixture
def tmp_config(tmp_path: Path) -> Config:
"""Minimal-Config mit tmp_path-Verzeichnissen, alle Uploads deaktiviert."""
paths = Paths(
incoming=tmp_path / "incoming",
outgoing=tmp_path / "outgoing",
working=tmp_path / "working",
error=tmp_path / "error",
)
for p in (paths.incoming, paths.outgoing, paths.working, paths.error):
p.mkdir(parents=True, exist_ok=True)
return Config(
paths=paths,
ocr=OcrConfig(max_workers=1),
output=OutputConfig(),
verapdf=VeraPdfConfig(enabled=False),
folder=FolderUpload(enabled=False),
nextcloud=NextcloudUpload(enabled=False),
sftp=SftpUpload(enabled=False),
email=EmailNotify(enabled=False),
log_level="DEBUG",
)
@pytest.fixture
def dummy_pdf(tmp_config: Config) -> Path:
"""Legt eine Datei mit .pdf-Extension im incoming-Ordner ab.
Achtung: kein echtes PDF. Für Tests wird `process_pdf` gemockt.
"""
pdf = tmp_config.paths.incoming / "test.pdf"
pdf.write_bytes(b"%PDF-1.4 fake\n")
return pdf
+72
View File
@@ -0,0 +1,72 @@
"""Tests für Issue #3: Ghostscript 10.0.010.02.0 PDF/A-Bug-Erkennung."""
from __future__ import annotations
from unittest.mock import patch
import pytest
from pdf_ocr_hotfolder.service import (
PreflightError,
check_preflight,
is_ghostscript_broken,
)
@pytest.mark.parametrize("version,expected", [
# Betroffene Versionen
("10.0.0", True),
("10.00.0", True),
("10.01.0", True),
("10.01.1", True),
("10.01.2", True),
("10.02.0", True),
# Sichere Versionen
("10.02.1", False),
("10.03.0", False),
("10.04.0", False),
("11.0.0", False),
("9.56.1", False), # Debian 11 / Ubuntu 22.04
("9.55.0", False),
# Edge cases
("", False),
(None, False),
("garbage", False),
])
def test_is_ghostscript_broken(version, expected) -> None:
assert is_ghostscript_broken(version) is expected
def test_check_preflight_without_pdfa_passes_with_broken_gs() -> None:
"""Ohne pdfa_level darf der betroffene GS verwendet werden."""
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value="/usr/bin/fake"), \
patch("pdf_ocr_hotfolder.service.detect_ghostscript_version",
return_value="10.0.0"):
check_preflight(pdfa_level="") # darf nicht werfen
def test_check_preflight_with_pdfa_fails_on_broken_gs() -> None:
"""Mit pdfa_level + kaputtem GS → PreflightError mit hilfreicher Meldung."""
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value="/usr/bin/fake"), \
patch("pdf_ocr_hotfolder.service.detect_ghostscript_version",
return_value="10.0.0"):
with pytest.raises(PreflightError, match="Ghostscript 10.0.0"):
check_preflight(pdfa_level="2")
def test_check_preflight_with_pdfa_passes_on_fixed_gs() -> None:
"""Mit pdfa_level + gefixtem GS → ok."""
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value="/usr/bin/fake"), \
patch("pdf_ocr_hotfolder.service.detect_ghostscript_version",
return_value="10.02.1"):
check_preflight(pdfa_level="2") # darf nicht werfen
def test_default_config_pdfa_level_is_empty() -> None:
"""Default-Config der Beispiel-Datei soll pdfa_level='' enthalten (Issue #3)."""
from pathlib import Path
import tomllib
cfg_path = Path(__file__).parent.parent / "config.example.toml"
with cfg_path.open("rb") as f:
data = tomllib.load(f)
assert data["ocr"]["pdfa_level"] == "", \
"config.example.toml muss pdfa_level='' als sicheren Default haben"
+96
View File
@@ -0,0 +1,96 @@
"""Tests für Issue #2: --once Modus muss Exit-Code != 0 bei Fehlern liefern."""
from __future__ import annotations
from pathlib import Path
from unittest.mock import patch
from pdf_ocr_hotfolder.processor import ProcessResult
from pdf_ocr_hotfolder.service import HotfolderService
def _fake_success(src: Path, working_dir, outgoing_dir, error_dir, **kwargs):
out = outgoing_dir / f"OCR_{src.name}"
out.parent.mkdir(parents=True, exist_ok=True)
out.write_bytes(b"%PDF-1.4 ocr\n")
src.unlink(missing_ok=True)
return ProcessResult(src, out, True)
def _fake_failure(src: Path, working_dir, outgoing_dir, error_dir, **kwargs):
error_dir.mkdir(parents=True, exist_ok=True)
dest = error_dir / src.name
src.rename(dest)
return ProcessResult(src, outgoing_dir / f"OCR_{src.name}", False,
error="fake ocr failure")
def _run(tmp_config, fake_process):
"""Helper: führt run_once() mit gemocktem process_pdf und preflight aus."""
with patch("pdf_ocr_hotfolder.service.check_preflight", return_value=None), \
patch("pdf_ocr_hotfolder.service.process_pdf", side_effect=fake_process), \
patch("pdf_ocr_hotfolder.service._wait_until_stable", return_value=True):
service = HotfolderService(tmp_config)
try:
return service.run_once()
finally:
service._executor.shutdown(wait=False)
def test_once_exit_0_when_no_files(tmp_config) -> None:
"""Szenario: Keine PDFs vorhanden → Exit 0."""
errors = _run(tmp_config, _fake_success)
assert errors == 0
def test_once_exit_0_when_all_success(tmp_config) -> None:
"""Szenario: Alle PDFs erfolgreich → Exit 0."""
(tmp_config.paths.incoming / "a.pdf").write_bytes(b"%PDF-1.4\n")
(tmp_config.paths.incoming / "b.pdf").write_bytes(b"%PDF-1.4\n")
errors = _run(tmp_config, _fake_success)
assert errors == 0
def test_once_exit_nonzero_when_all_fail(tmp_config) -> None:
"""Szenario: Alle PDFs fehlgeschlagen → Exit != 0 (Issue #2)."""
(tmp_config.paths.incoming / "a.pdf").write_bytes(b"%PDF-1.4\n")
(tmp_config.paths.incoming / "b.pdf").write_bytes(b"%PDF-1.4\n")
errors = _run(tmp_config, _fake_failure)
assert errors == 2
def test_once_exit_nonzero_when_some_fail(tmp_config) -> None:
"""Szenario: Teilweise fehlgeschlagen → Exit != 0."""
(tmp_config.paths.incoming / "ok.pdf").write_bytes(b"%PDF-1.4\n")
(tmp_config.paths.incoming / "bad.pdf").write_bytes(b"%PDF-1.4\n")
def mixed(src, *args, **kwargs):
if "bad" in src.name:
return _fake_failure(src, *args, **kwargs)
return _fake_success(src, *args, **kwargs)
errors = _run(tmp_config, mixed)
assert errors == 1
def test_counters_track_success_and_failure(tmp_config) -> None:
"""success_count und error_count sollen korrekt mitzählen."""
(tmp_config.paths.incoming / "ok.pdf").write_bytes(b"%PDF-1.4\n")
(tmp_config.paths.incoming / "bad.pdf").write_bytes(b"%PDF-1.4\n")
def mixed(src, *args, **kwargs):
if "bad" in src.name:
return _fake_failure(src, *args, **kwargs)
return _fake_success(src, *args, **kwargs)
with patch("pdf_ocr_hotfolder.service.check_preflight", return_value=None), \
patch("pdf_ocr_hotfolder.service.process_pdf", side_effect=mixed), \
patch("pdf_ocr_hotfolder.service._wait_until_stable", return_value=True):
service = HotfolderService(tmp_config)
try:
service.run_once()
assert service.success_count == 1
assert service.error_count == 1
finally:
service._executor.shutdown(wait=False)
+190
View File
@@ -0,0 +1,190 @@
"""Tests für Feature: konfigurierbare Dateinamen und Original-Behandlung."""
from __future__ import annotations
from pathlib import Path
from unittest.mock import patch
import pytest
from pdf_ocr_hotfolder.config import OcrConfig, OutputConfig, VeraPdfConfig
from pdf_ocr_hotfolder.processor import build_output_name, process_pdf
from pdf_ocr_hotfolder.service import PreflightError, check_output_config
# ---------------- build_output_name ----------------
@pytest.mark.parametrize("src,mode,tag,expected", [
# prefix
("scan.pdf", "prefix", "OCR_", "OCR_scan.pdf"),
("scan.pdf", "prefix", "[OCR] ", "[OCR] scan.pdf"),
# suffix (Tag vor Extension)
("scan.pdf", "suffix", "_OCR", "scan_OCR.pdf"),
("scan.pdf", "suffix", "-ocr", "scan-ocr.pdf"),
# none
("scan.pdf", "none", "OCR_", "scan.pdf"),
# leerer Tag = none
("scan.pdf", "prefix", "", "scan.pdf"),
("scan.pdf", "suffix", "", "scan.pdf"),
# Mehrfach-Punkte im Namen: nur letzte Extension zählt
("rechnung.2026.pdf", "suffix", "_OCR", "rechnung.2026_OCR.pdf"),
("rechnung.2026.pdf", "prefix", "OCR_", "OCR_rechnung.2026.pdf"),
# Name ohne Extension
("NO_EXT", "suffix", "_OCR", "NO_EXT_OCR"),
])
def test_build_output_name(src, mode, tag, expected) -> None:
assert build_output_name(src, mode, tag) == expected
def test_build_output_name_invalid_mode() -> None:
with pytest.raises(ValueError, match="name_mode"):
build_output_name("x.pdf", "bogus", "OCR_")
# ---------------- check_output_config ----------------
def test_check_output_config_delete_ok() -> None:
check_output_config("delete", "") # ok
def test_check_output_config_archive_requires_dir() -> None:
with pytest.raises(PreflightError, match="archive_dir"):
check_output_config("archive", "")
def test_check_output_config_archive_with_dir_ok() -> None:
check_output_config("archive", "/var/archive") # ok
def test_check_output_config_invalid_mode() -> None:
with pytest.raises(PreflightError, match="ungültig"):
check_output_config("trash", "")
# ---------------- process_pdf mit Original-Behandlung ----------------
def _fake_ocr(src: Path, dst: Path, cfg: OcrConfig) -> None:
"""Simuliert ocrmypdf: kopiert Inhalt, erzeugt Zieldatei."""
dst.write_bytes(b"%PDF-1.4 OCRed\n" + src.read_bytes())
def _prepare(tmp_path: Path) -> dict:
dirs = {
"working": tmp_path / "working",
"outgoing": tmp_path / "outgoing",
"error": tmp_path / "error",
"archive": tmp_path / "archive",
"incoming": tmp_path / "incoming",
}
for d in dirs.values():
d.mkdir(parents=True, exist_ok=True)
src = dirs["incoming"] / "scan.pdf"
src.write_bytes(b"%PDF-1.4 original\n")
return {"src": src, **dirs}
def test_process_pdf_prefix_delete(tmp_path: Path) -> None:
env = _prepare(tmp_path)
out_cfg = OutputConfig(name_mode="prefix", name_tag="OCR_",
original_on_success="delete")
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
result = process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
assert result.success
assert (env["outgoing"] / "OCR_scan.pdf").exists()
# Original ist weg, weder in incoming noch in working
assert not env["src"].exists()
assert not (env["working"] / "scan.pdf").exists()
def test_process_pdf_suffix_delete(tmp_path: Path) -> None:
env = _prepare(tmp_path)
out_cfg = OutputConfig(name_mode="suffix", name_tag="_OCR",
original_on_success="delete")
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
result = process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
assert result.success
assert (env["outgoing"] / "scan_OCR.pdf").exists()
def test_process_pdf_none_mode(tmp_path: Path) -> None:
env = _prepare(tmp_path)
out_cfg = OutputConfig(name_mode="none", name_tag="OCR_",
original_on_success="delete")
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
result = process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
assert result.success
# Ausgang hat GLEICHEN Namen wie Original
assert (env["outgoing"] / "scan.pdf").exists()
def test_process_pdf_archive_original(tmp_path: Path) -> None:
env = _prepare(tmp_path)
out_cfg = OutputConfig(name_mode="prefix", name_tag="OCR_",
original_on_success="archive",
archive_dir=str(env["archive"]))
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
result = process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
assert result.success
assert (env["outgoing"] / "OCR_scan.pdf").exists()
# Original liegt jetzt im Archiv
archived = env["archive"] / "scan.pdf"
assert archived.exists()
assert archived.read_bytes() == b"%PDF-1.4 original\n"
def test_process_pdf_archive_name_collision(tmp_path: Path) -> None:
"""Bei Namens-Kollision im Archiv wird Timestamp angehängt."""
env = _prepare(tmp_path)
# Vorhandene Kollisions-Datei
(env["archive"] / "scan.pdf").write_bytes(b"old")
out_cfg = OutputConfig(name_mode="prefix", name_tag="OCR_",
original_on_success="archive",
archive_dir=str(env["archive"]))
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
# Alte Datei unverändert
assert (env["archive"] / "scan.pdf").read_bytes() == b"old"
# Neue Datei mit Timestamp-Suffix
archived = list(env["archive"].glob("scan_*.pdf"))
assert len(archived) == 1
assert archived[0].read_bytes() == b"%PDF-1.4 original\n"
+75
View File
@@ -0,0 +1,75 @@
"""Tests für Issue #1: Preflight-Check bei fehlendem Tesseract."""
from __future__ import annotations
import sys
from unittest.mock import patch
import pytest
from pdf_ocr_hotfolder.service import (
HotfolderService,
PreflightError,
check_preflight,
)
def test_preflight_passes_when_all_binaries_present() -> None:
"""Wenn tesseract + gs im PATH sind, darf kein Fehler fliegen."""
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value="/usr/bin/fake"):
check_preflight() # darf nicht werfen
def test_preflight_fails_when_tesseract_missing() -> None:
"""Fehlendes tesseract → PreflightError mit passender Meldung."""
def fake_which(name: str) -> str | None:
return None if name == "tesseract" else "/usr/bin/fake"
with patch("pdf_ocr_hotfolder.service.shutil.which", side_effect=fake_which):
with pytest.raises(PreflightError, match="tesseract"):
check_preflight()
def test_preflight_fails_when_ghostscript_missing() -> None:
def fake_which(name: str) -> str | None:
return None if name == "gs" else "/usr/bin/fake"
with patch("pdf_ocr_hotfolder.service.shutil.which", side_effect=fake_which):
with pytest.raises(PreflightError, match="gs"):
check_preflight()
def test_preflight_lists_all_missing_binaries() -> None:
"""Bei mehreren fehlenden Binaries werden alle genannt."""
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value=None):
with pytest.raises(PreflightError) as exc_info:
check_preflight()
msg = str(exc_info.value)
assert "tesseract" in msg
assert "gs" in msg
def test_run_once_raises_preflight_error(tmp_config) -> None:
"""HotfolderService.run_once() wirft PreflightError, wenn tesseract fehlt."""
service = HotfolderService(tmp_config)
try:
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value=None):
with pytest.raises(PreflightError):
service.run_once()
finally:
service._executor.shutdown(wait=False)
def test_main_returns_2_on_preflight_error(tmp_config, tmp_path, monkeypatch) -> None:
"""CLI liefert Exit-Code 2 bei Preflight-Fehler (Issue #1 Szenario)."""
cfg_file = tmp_path / "cfg.toml"
cfg_file.write_text(f"""
[paths]
incoming = "{tmp_config.paths.incoming}"
outgoing = "{tmp_config.paths.outgoing}"
working = "{tmp_config.paths.working}"
error = "{tmp_config.paths.error}"
""")
monkeypatch.setattr(sys, "argv", ["pdf-ocr-hotfolder", "--config", str(cfg_file), "--once"])
with patch("pdf_ocr_hotfolder.service.shutil.which", return_value=None):
from pdf_ocr_hotfolder.__main__ import main
assert main() == 2