2 Commits

Author SHA1 Message Date
techadmin cbdc9d6664 Fix Issues #4, #5, #6: LXC-Kompatibilität, WorkingDirectory, GS-Backports
- #4: LXC/Container Drop-in (lxc-compat.conf) deaktiviert systemd-Hardening;
  Installer erkennt Container automatisch und bietet Drop-in an
- #5: WorkingDirectory=/opt/pdf-ocr-hotfolder in Template-Unit ergänzt
- #6: Installer bietet auf Debian 12 bei betroffenen GS-Versionen
  automatisch bookworm-backports Upgrade an (statt nur Warnung)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-11 01:41:54 +02:00
techadmin a23a3968ef feat: konfigurierbarer Dateiname + Archiv-Modus für Original (v0.3.0)
Neue [output]-Section:
- name_mode: prefix | suffix | none (suffix wird vor Extension eingefügt)
- name_tag: verbatim einfügbarer String
- original_on_success: delete | archive
- archive_dir mit Kollisions-Schutz (Timestamp-Suffix)

20 neue Tests (50 insgesamt, alle grün).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 22:32:41 +02:00
14 changed files with 418 additions and 15 deletions
+26
View File
@@ -1,5 +1,31 @@
# Changelog # Changelog
## [0.3.1] - 2026-04-10
### Fixed
- **Issue #4**: LXC/Container-Kompatibilität — systemd-Hardening (`PrivateTmp`, `ProtectSystem`, etc.)
verursacht Error 226/NAMESPACE in LXC-Containern. Installer erkennt Container-Umgebung automatisch
und bietet ein Drop-in an. Zusätzlich liegt `systemd/lxc-compat.conf` als Vorlage im Repo.
- **Issue #5**: `WorkingDirectory=/opt/pdf-ocr-hotfolder` in der systemd Template-Unit ergänzt —
ohne diesen Eintrag konnte das Python-Modul nicht gefunden werden.
- **Issue #6**: Auf Debian 12 bietet der Installer bei betroffenen Ghostscript-Versionen (10.0.010.02.0)
jetzt automatisch an, bookworm-backports zu aktivieren und GS zu upgraden (statt nur zu warnen).
## [0.3.0] - 2026-04-09
### Added
- Neue Config-Sektion `[output]` mit:
- `name_mode` — Platzierung des Tags im Dateinamen: `"prefix"`, `"suffix"` (vor Extension), `"none"`
- `name_tag` — verbatim einzufügender String, z.B. `"OCR_"` oder `"_OCR"`
- `original_on_success``"delete"` (alter Default) oder `"archive"`
- `archive_dir` — Zielverzeichnis für `"archive"`, mit Kollisions-Schutz (Timestamp-Suffix)
- Runtime-Validierung der Output-Config in `check_output_config()`
- 20 neue Tests für `build_output_name()`, `check_output_config()` und `process_pdf()`
mit allen Kombinationen aus Modus + Original-Behandlung
### Changed
- `process_pdf()` nimmt jetzt `output_cfg: OutputConfig` als Pflicht-Argument
## [0.2.2] - 2026-04-09 ## [0.2.2] - 2026-04-09
### Fixed ### Fixed
+35 -1
View File
@@ -89,6 +89,22 @@ max_workers = 2 # parallele PDFs
timeout = 1800 timeout = 1800
``` ```
### `[output]`
```toml
# Dateiname im outgoing/:
# "prefix" → OCR_scan.pdf
# "suffix" → scan_OCR.pdf (vor der Extension)
# "none" → scan.pdf (unverändert)
name_mode = "prefix"
name_tag = "OCR_"
# Nach erfolgreichem OCR mit dem Original:
# "delete" → löschen
# "archive" → in archive_dir verschieben
original_on_success = "delete"
archive_dir = "" # absoluter Pfad, Pflicht bei "archive"
```
### `[upload.nextcloud]` ### `[upload.nextcloud]`
```toml ```toml
enabled = true enabled = true
@@ -174,6 +190,24 @@ Service-User braucht **rw** auf alle vier Verzeichnisse unter `/var/lib/pdf-ocr-
sudo chown -R DOMAIN\\scanuser:DOMAIN\\scangroup /var/lib/pdf-ocr-hotfolder sudo chown -R DOMAIN\\scanuser:DOMAIN\\scangroup /var/lib/pdf-ocr-hotfolder
``` ```
### LXC/Container: Error 226/NAMESPACE
In LXC-Containern schlagen systemd-Hardening-Optionen fehl. Der Installer erkennt Container automatisch und bietet ein Drop-in an. Manuell:
```bash
sudo mkdir -p /etc/systemd/system/pdf-ocr-hotfolder@.service.d/
sudo cp /opt/pdf-ocr-hotfolder/systemd/lxc-compat.conf \
/etc/systemd/system/pdf-ocr-hotfolder@.service.d/
sudo systemctl daemon-reload
sudo systemctl restart 'pdf-ocr-hotfolder@*'
```
### Ghostscript PDF/A-Bug auf Debian 12
GS 10.00.010.02.0 (Debian 12 Default) zerstört OCR bei `pdfa_level` + `skip_text=true`. Der Installer bietet automatisch bookworm-backports an. Manuell:
```bash
echo 'deb http://deb.debian.org/debian bookworm-backports main' | \
sudo tee /etc/apt/sources.list.d/bookworm-backports.list
sudo apt update && sudo apt install -t bookworm-backports ghostscript
```
### veraPDF-Validierung schlägt immer fehl ### veraPDF-Validierung schlägt immer fehl
veraPDF binary prüfen (`[verapdf].binary`). Wenn nicht zwingend gebraucht: `enabled = false`. veraPDF binary prüfen (`[verapdf].binary`). Wenn nicht zwingend gebraucht: `enabled = false`.
@@ -205,5 +239,5 @@ MIT — © Sonith UG
--- ---
**Version:** 0.2.0 **Version:** 0.3.1
**Repo:** https://gitea.sonith.de/sonith_ug/pdf-ocr-hotfolder **Repo:** https://gitea.sonith.de/sonith_ug/pdf-ocr-hotfolder
+1 -1
View File
@@ -1 +1 @@
0.2.2 0.3.1
+16
View File
@@ -34,6 +34,22 @@ max_workers = 2
# Timeout pro PDF in Sekunden # Timeout pro PDF in Sekunden
timeout = 1800 timeout = 1800
[output]
# Wie soll die Ziel-Datei im outgoing/-Ordner benannt werden?
# "prefix" : name_tag wird vor den Dateinamen gestellt (OCR_scan.pdf)
# "suffix" : name_tag wird vor die Extension gestellt (scan_OCR.pdf)
# "none" : Dateiname bleibt wie das Original
name_mode = "prefix"
# Verbatim einzufügender String. Leerer String = kein Tag (wie mode="none").
# Beispiele: "OCR_", "[OCR]_", "_OCR", "_searchable"
name_tag = "OCR_"
# Was passiert mit dem Original, wenn OCR erfolgreich war?
# "delete" : Original wird gelöscht (alter Standard)
# "archive" : Original wird in archive_dir verschoben
original_on_success = "delete"
# Absoluter Pfad; nur relevant wenn original_on_success = "archive"
archive_dir = ""
[verapdf] [verapdf]
# PDF/A-Validierung (optional) # PDF/A-Validierung (optional)
enabled = false enabled = false
+39 -5
View File
@@ -52,7 +52,7 @@ install_base() {
icc-profiles-free ca-certificates curl icc-profiles-free ca-certificates curl
log_info "System-Pakete ok ✓" log_info "System-Pakete ok ✓"
# Ghostscript-Versions-Check (Issue #3) # Ghostscript-Versions-Check (Issue #3 + Issue #6)
if command -v gs >/dev/null 2>&1; then if command -v gs >/dev/null 2>&1; then
GS_VER="$(gs --version 2>/dev/null || echo 0.0)" GS_VER="$(gs --version 2>/dev/null || echo 0.0)"
log_info "Ghostscript: $GS_VER" log_info "Ghostscript: $GS_VER"
@@ -62,16 +62,50 @@ install_base() {
log_warn "═══════════════════════════════════════════════════════════════" log_warn "═══════════════════════════════════════════════════════════════"
log_warn "Ghostscript $GS_VER ist vom PDF/A-Bug betroffen (10.0.010.02.0)." log_warn "Ghostscript $GS_VER ist vom PDF/A-Bug betroffen (10.0.010.02.0)."
log_warn "Mit pdfa_level + skip_text=true kann ocrmypdf KEINE PDFs verarbeiten." log_warn "Mit pdfa_level + skip_text=true kann ocrmypdf KEINE PDFs verarbeiten."
log_warn ""
log_warn "Workarounds:"
log_warn " 1. ghostscript aus bookworm-backports installieren (>=10.02.1)"
log_warn " 2. In der Config [ocr].pdfa_level = \"\" setzen (Default ab v0.2.2)"
log_warn "═══════════════════════════════════════════════════════════════" log_warn "═══════════════════════════════════════════════════════════════"
echo echo
# Prüfe ob Debian bookworm (12) — Backports anbieten
if grep -q 'bookworm' /etc/os-release 2>/dev/null; then
read -r -p "Ghostscript via bookworm-backports upgraden? [J/n]: " UPGRADE_GS
UPGRADE_GS="${UPGRADE_GS:-J}"
if [[ "$UPGRADE_GS" =~ ^[JjYy]$ ]]; then
log_info "Aktiviere bookworm-backports..."
if ! grep -q 'bookworm-backports' /etc/apt/sources.list /etc/apt/sources.list.d/*.list 2>/dev/null; then
echo 'deb http://deb.debian.org/debian bookworm-backports main' \
> /etc/apt/sources.list.d/bookworm-backports.list
apt-get update -qq
fi
apt-get install -y -t bookworm-backports ghostscript
GS_VER_NEW="$(gs --version 2>/dev/null || echo '?')"
log_info "Ghostscript aktualisiert: $GS_VER$GS_VER_NEW"
else
log_warn "Workaround: In der Config [ocr].pdfa_level = \"\" setzen (Default ab v0.2.2)"
fi
else
log_warn "Kein Debian bookworm erkannt — manuelles Upgrade nötig."
log_warn "Workaround: In der Config [ocr].pdfa_level = \"\" setzen (Default ab v0.2.2)"
fi
echo
;; ;;
esac esac
fi fi
# LXC/Container-Erkennung (Issue #4)
if systemd-detect-virt --container -q 2>/dev/null; then
VIRT_TYPE="$(systemd-detect-virt --container 2>/dev/null || echo 'container')"
log_warn "Container-Umgebung erkannt ($VIRT_TYPE)."
log_warn "systemd-Hardening kann in Containern fehlschlagen (Error 226/NAMESPACE)."
read -r -p "LXC-Kompatibilitäts-Drop-in installieren? [J/n]: " LXC_FIX
LXC_FIX="${LXC_FIX:-J}"
if [[ "$LXC_FIX" =~ ^[JjYy]$ ]]; then
local LXC_DROPIN_DIR="/etc/systemd/system/pdf-ocr-hotfolder@.service.d"
mkdir -p "$LXC_DROPIN_DIR"
cp "$REPO_DIR/systemd/lxc-compat.conf" "$LXC_DROPIN_DIR/lxc-compat.conf"
systemctl daemon-reload
log_info "LXC-Kompatibilitäts-Drop-in installiert ✓"
fi
fi
log_step "Default-User '$DEFAULT_USER' prüfen" log_step "Default-User '$DEFAULT_USER' prüfen"
if id "$DEFAULT_USER" &>/dev/null; then if id "$DEFAULT_USER" &>/dev/null; then
log_info "'$DEFAULT_USER' existiert bereits" log_info "'$DEFAULT_USER' existiert bereits"
+1 -1
View File
@@ -1,3 +1,3 @@
"""PDF OCR Hotfolder — Scanner-PDFs automatisch durchsuchbar machen.""" """PDF OCR Hotfolder — Scanner-PDFs automatisch durchsuchbar machen."""
__version__ = "0.1.0" __version__ = "0.3.1"
+16 -1
View File
@@ -28,6 +28,18 @@ class OcrConfig:
timeout: int = 1800 timeout: int = 1800
@dataclass
class OutputConfig:
# "prefix" | "suffix" | "none"
name_mode: str = "prefix"
# Tag-String, verbatim eingefügt (Leerstring = kein Tag)
name_tag: str = "OCR_"
# "delete" | "archive"
original_on_success: str = "delete"
# Absoluter Pfad; Pflicht wenn original_on_success == "archive"
archive_dir: str = ""
@dataclass @dataclass
class VeraPdfConfig: class VeraPdfConfig:
enabled: bool = False enabled: bool = False
@@ -79,6 +91,7 @@ class EmailNotify:
class Config: class Config:
paths: Paths paths: Paths
ocr: OcrConfig ocr: OcrConfig
output: OutputConfig
verapdf: VeraPdfConfig verapdf: VeraPdfConfig
folder: FolderUpload folder: FolderUpload
nextcloud: NextcloudUpload nextcloud: NextcloudUpload
@@ -109,6 +122,8 @@ def load_config(path: str | Path) -> Config:
ocr = OcrConfig(**{k: v for k, v in _section(data, "ocr").items() ocr = OcrConfig(**{k: v for k, v in _section(data, "ocr").items()
if k in OcrConfig.__annotations__}) if k in OcrConfig.__annotations__})
output = OutputConfig(**{k: v for k, v in _section(data, "output").items()
if k in OutputConfig.__annotations__})
verapdf = VeraPdfConfig(**{k: v for k, v in _section(data, "verapdf").items() verapdf = VeraPdfConfig(**{k: v for k, v in _section(data, "verapdf").items()
if k in VeraPdfConfig.__annotations__}) if k in VeraPdfConfig.__annotations__})
folder = FolderUpload(**{k: v for k, v in _section(data, "upload", "folder").items() folder = FolderUpload(**{k: v for k, v in _section(data, "upload", "folder").items()
@@ -123,7 +138,7 @@ def load_config(path: str | Path) -> Config:
log_level = _section(data, "logging").get("level", "INFO") log_level = _section(data, "logging").get("level", "INFO")
return Config( return Config(
paths=paths, ocr=ocr, verapdf=verapdf, paths=paths, ocr=ocr, output=output, verapdf=verapdf,
folder=folder, nextcloud=nextcloud, sftp=sftp, email=email, folder=folder, nextcloud=nextcloud, sftp=sftp, email=email,
log_level=log_level, log_level=log_level,
) )
+60 -4
View File
@@ -7,11 +7,37 @@ import subprocess
from dataclasses import dataclass from dataclasses import dataclass
from pathlib import Path from pathlib import Path
from .config import OcrConfig, VeraPdfConfig from .config import OcrConfig, OutputConfig, VeraPdfConfig
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
def build_output_name(src_name: str, mode: str, tag: str) -> str:
"""Erzeugt den Ziel-Dateinamen für ein OCR-PDF.
Args:
src_name: Original-Dateiname (z.B. "scan.pdf")
mode: "prefix" | "suffix" | "none"
tag: Einzufügender String (verbatim, leer = kein Tag)
Beispiele:
prefix "OCR_": "scan.pdf" -> "OCR_scan.pdf"
suffix "_OCR": "scan.pdf" -> "scan_OCR.pdf"
suffix "_OCR": "scan.tar.gz.pdf" -> "scan.tar.gz_OCR.pdf"
none: "scan.pdf" -> "scan.pdf"
"""
if mode == "none" or not tag:
return src_name
if mode == "prefix":
return f"{tag}{src_name}"
if mode == "suffix":
# Nur die letzte Extension abspalten, sonst "foo.bar.pdf" kaputt gemacht
p = Path(src_name)
stem, ext = p.stem, p.suffix
return f"{stem}{tag}{ext}"
raise ValueError(f"Unbekannter name_mode: {mode!r}")
@dataclass @dataclass
class ProcessResult: class ProcessResult:
source: Path source: Path
@@ -71,11 +97,13 @@ def process_pdf(
error_dir: Path, error_dir: Path,
ocr_cfg: OcrConfig, ocr_cfg: OcrConfig,
vera_cfg: VeraPdfConfig, vera_cfg: VeraPdfConfig,
output_cfg: OutputConfig,
) -> ProcessResult: ) -> ProcessResult:
"""Verarbeitet eine einzelne PDF: move→OCR→validate→outgoing/error.""" """Verarbeitet eine einzelne PDF: move→OCR→validate→outgoing/error."""
out_name = build_output_name(src.name, output_cfg.name_mode, output_cfg.name_tag)
work_src = working_dir / src.name work_src = working_dir / src.name
work_out = working_dir / f"OCR_{src.name}" work_out = working_dir / f"__ocr_{out_name}" # Temp-Name, damit er != src.name ist
final_out = outgoing_dir / f"OCR_{src.name}" final_out = outgoing_dir / out_name
try: try:
shutil.move(str(src), str(work_src)) shutil.move(str(src), str(work_src))
@@ -100,10 +128,38 @@ def process_pdf(
outgoing_dir.mkdir(parents=True, exist_ok=True) outgoing_dir.mkdir(parents=True, exist_ok=True)
shutil.move(str(work_out), str(final_out)) shutil.move(str(work_out), str(final_out))
work_src.unlink(missing_ok=True) _dispose_original(work_src, src.name, output_cfg)
return ProcessResult(src, final_out, True, verapdf_passed=vera_ok) return ProcessResult(src, final_out, True, verapdf_passed=vera_ok)
def _dispose_original(work_src: Path, original_name: str, cfg: OutputConfig) -> None:
"""Entsorgt das Original nach erfolgreichem OCR — löschen oder archivieren."""
if not work_src.exists():
return
mode = cfg.original_on_success
if mode == "delete":
work_src.unlink(missing_ok=True)
return
if mode == "archive":
if not cfg.archive_dir:
log.error("original_on_success=archive aber archive_dir ist leer — lösche stattdessen")
work_src.unlink(missing_ok=True)
return
archive = Path(cfg.archive_dir)
archive.mkdir(parents=True, exist_ok=True)
dest = archive / original_name
# Bei Namens-Kollision mit Timestamp umbenennen
if dest.exists():
from datetime import datetime
ts = datetime.now().strftime("%Y%m%d-%H%M%S")
dest = archive / f"{dest.stem}_{ts}{dest.suffix}"
shutil.move(str(work_src), str(dest))
log.info("Original archiviert: %s", dest)
return
log.warning("Unbekannter original_on_success=%r — lösche stattdessen", mode)
work_src.unlink(missing_ok=True)
def _move_to_error(p: Path, error_dir: Path) -> None: def _move_to_error(p: Path, error_dir: Path) -> None:
error_dir.mkdir(parents=True, exist_ok=True) error_dir.mkdir(parents=True, exist_ok=True)
try: try:
+19
View File
@@ -72,6 +72,20 @@ def detect_ghostscript_version() -> str | None:
return result.stdout.strip() or None return result.stdout.strip() or None
def check_output_config(mode: str, archive_dir: str) -> None:
"""Validiert die [output]-Section. Wirft PreflightError bei Problemen."""
valid_modes = {"delete", "archive"}
if mode not in valid_modes:
raise PreflightError(
f"[output].original_on_success={mode!r} ungültig. "
f"Erlaubt: {sorted(valid_modes)}"
)
if mode == "archive" and not archive_dir:
raise PreflightError(
"[output].original_on_success='archive' erfordert [output].archive_dir"
)
def check_preflight(pdfa_level: str = "") -> None: def check_preflight(pdfa_level: str = "") -> None:
"""Prüft externe Abhängigkeiten. """Prüft externe Abhängigkeiten.
@@ -173,6 +187,8 @@ class HotfolderService:
def run(self) -> None: def run(self) -> None:
check_preflight(self.cfg.ocr.pdfa_level) check_preflight(self.cfg.ocr.pdfa_level)
check_output_config(self.cfg.output.original_on_success,
self.cfg.output.archive_dir)
self.ensure_dirs() self.ensure_dirs()
self._scan_existing() self._scan_existing()
@@ -197,6 +213,8 @@ class HotfolderService:
Anzahl fehlgeschlagener PDFs (0 = alles ok). Anzahl fehlgeschlagener PDFs (0 = alles ok).
""" """
check_preflight(self.cfg.ocr.pdfa_level) check_preflight(self.cfg.ocr.pdfa_level)
check_output_config(self.cfg.output.original_on_success,
self.cfg.output.archive_dir)
self.ensure_dirs() self.ensure_dirs()
self._scan_existing() self._scan_existing()
self._executor.shutdown(wait=True) self._executor.shutdown(wait=True)
@@ -254,6 +272,7 @@ class HotfolderService:
error_dir=self.cfg.paths.error, error_dir=self.cfg.paths.error,
ocr_cfg=self.cfg.ocr, ocr_cfg=self.cfg.ocr,
vera_cfg=self.cfg.verapdf, vera_cfg=self.cfg.verapdf,
output_cfg=self.cfg.output,
) )
with self._lock: with self._lock:
+10
View File
@@ -0,0 +1,10 @@
# Drop-in für LXC/Container-Betrieb
# Kopieren nach: /etc/systemd/system/pdf-ocr-hotfolder@.service.d/lxc-compat.conf
# Danach: systemctl daemon-reload && systemctl restart 'pdf-ocr-hotfolder@*'
[Service]
PrivateTmp=false
ProtectSystem=false
ProtectKernelTunables=false
ProtectKernelModules=false
ProtectControlGroups=false
+1
View File
@@ -7,6 +7,7 @@ Wants=network-online.target
Type=simple Type=simple
User=pdfocr User=pdfocr
Group=pdfocr Group=pdfocr
WorkingDirectory=/opt/pdf-ocr-hotfolder
ExecStart=/opt/pdf-ocr-hotfolder/venv/bin/python -m pdf_ocr_hotfolder --config /etc/pdf-ocr-hotfolder/%i.toml ExecStart=/opt/pdf-ocr-hotfolder/venv/bin/python -m pdf_ocr_hotfolder --config /etc/pdf-ocr-hotfolder/%i.toml
Restart=on-failure Restart=on-failure
RestartSec=5 RestartSec=5
+2
View File
@@ -11,6 +11,7 @@ from pdf_ocr_hotfolder.config import (
FolderUpload, FolderUpload,
NextcloudUpload, NextcloudUpload,
OcrConfig, OcrConfig,
OutputConfig,
Paths, Paths,
SftpUpload, SftpUpload,
VeraPdfConfig, VeraPdfConfig,
@@ -32,6 +33,7 @@ def tmp_config(tmp_path: Path) -> Config:
return Config( return Config(
paths=paths, paths=paths,
ocr=OcrConfig(max_workers=1), ocr=OcrConfig(max_workers=1),
output=OutputConfig(),
verapdf=VeraPdfConfig(enabled=False), verapdf=VeraPdfConfig(enabled=False),
folder=FolderUpload(enabled=False), folder=FolderUpload(enabled=False),
nextcloud=NextcloudUpload(enabled=False), nextcloud=NextcloudUpload(enabled=False),
+2 -2
View File
@@ -8,7 +8,7 @@ from pdf_ocr_hotfolder.processor import ProcessResult
from pdf_ocr_hotfolder.service import HotfolderService from pdf_ocr_hotfolder.service import HotfolderService
def _fake_success(src: Path, working_dir, outgoing_dir, error_dir, ocr_cfg, vera_cfg): def _fake_success(src: Path, working_dir, outgoing_dir, error_dir, **kwargs):
out = outgoing_dir / f"OCR_{src.name}" out = outgoing_dir / f"OCR_{src.name}"
out.parent.mkdir(parents=True, exist_ok=True) out.parent.mkdir(parents=True, exist_ok=True)
out.write_bytes(b"%PDF-1.4 ocr\n") out.write_bytes(b"%PDF-1.4 ocr\n")
@@ -16,7 +16,7 @@ def _fake_success(src: Path, working_dir, outgoing_dir, error_dir, ocr_cfg, vera
return ProcessResult(src, out, True) return ProcessResult(src, out, True)
def _fake_failure(src: Path, working_dir, outgoing_dir, error_dir, ocr_cfg, vera_cfg): def _fake_failure(src: Path, working_dir, outgoing_dir, error_dir, **kwargs):
error_dir.mkdir(parents=True, exist_ok=True) error_dir.mkdir(parents=True, exist_ok=True)
dest = error_dir / src.name dest = error_dir / src.name
src.rename(dest) src.rename(dest)
+190
View File
@@ -0,0 +1,190 @@
"""Tests für Feature: konfigurierbare Dateinamen und Original-Behandlung."""
from __future__ import annotations
from pathlib import Path
from unittest.mock import patch
import pytest
from pdf_ocr_hotfolder.config import OcrConfig, OutputConfig, VeraPdfConfig
from pdf_ocr_hotfolder.processor import build_output_name, process_pdf
from pdf_ocr_hotfolder.service import PreflightError, check_output_config
# ---------------- build_output_name ----------------
@pytest.mark.parametrize("src,mode,tag,expected", [
# prefix
("scan.pdf", "prefix", "OCR_", "OCR_scan.pdf"),
("scan.pdf", "prefix", "[OCR] ", "[OCR] scan.pdf"),
# suffix (Tag vor Extension)
("scan.pdf", "suffix", "_OCR", "scan_OCR.pdf"),
("scan.pdf", "suffix", "-ocr", "scan-ocr.pdf"),
# none
("scan.pdf", "none", "OCR_", "scan.pdf"),
# leerer Tag = none
("scan.pdf", "prefix", "", "scan.pdf"),
("scan.pdf", "suffix", "", "scan.pdf"),
# Mehrfach-Punkte im Namen: nur letzte Extension zählt
("rechnung.2026.pdf", "suffix", "_OCR", "rechnung.2026_OCR.pdf"),
("rechnung.2026.pdf", "prefix", "OCR_", "OCR_rechnung.2026.pdf"),
# Name ohne Extension
("NO_EXT", "suffix", "_OCR", "NO_EXT_OCR"),
])
def test_build_output_name(src, mode, tag, expected) -> None:
assert build_output_name(src, mode, tag) == expected
def test_build_output_name_invalid_mode() -> None:
with pytest.raises(ValueError, match="name_mode"):
build_output_name("x.pdf", "bogus", "OCR_")
# ---------------- check_output_config ----------------
def test_check_output_config_delete_ok() -> None:
check_output_config("delete", "") # ok
def test_check_output_config_archive_requires_dir() -> None:
with pytest.raises(PreflightError, match="archive_dir"):
check_output_config("archive", "")
def test_check_output_config_archive_with_dir_ok() -> None:
check_output_config("archive", "/var/archive") # ok
def test_check_output_config_invalid_mode() -> None:
with pytest.raises(PreflightError, match="ungültig"):
check_output_config("trash", "")
# ---------------- process_pdf mit Original-Behandlung ----------------
def _fake_ocr(src: Path, dst: Path, cfg: OcrConfig) -> None:
"""Simuliert ocrmypdf: kopiert Inhalt, erzeugt Zieldatei."""
dst.write_bytes(b"%PDF-1.4 OCRed\n" + src.read_bytes())
def _prepare(tmp_path: Path) -> dict:
dirs = {
"working": tmp_path / "working",
"outgoing": tmp_path / "outgoing",
"error": tmp_path / "error",
"archive": tmp_path / "archive",
"incoming": tmp_path / "incoming",
}
for d in dirs.values():
d.mkdir(parents=True, exist_ok=True)
src = dirs["incoming"] / "scan.pdf"
src.write_bytes(b"%PDF-1.4 original\n")
return {"src": src, **dirs}
def test_process_pdf_prefix_delete(tmp_path: Path) -> None:
env = _prepare(tmp_path)
out_cfg = OutputConfig(name_mode="prefix", name_tag="OCR_",
original_on_success="delete")
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
result = process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
assert result.success
assert (env["outgoing"] / "OCR_scan.pdf").exists()
# Original ist weg, weder in incoming noch in working
assert not env["src"].exists()
assert not (env["working"] / "scan.pdf").exists()
def test_process_pdf_suffix_delete(tmp_path: Path) -> None:
env = _prepare(tmp_path)
out_cfg = OutputConfig(name_mode="suffix", name_tag="_OCR",
original_on_success="delete")
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
result = process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
assert result.success
assert (env["outgoing"] / "scan_OCR.pdf").exists()
def test_process_pdf_none_mode(tmp_path: Path) -> None:
env = _prepare(tmp_path)
out_cfg = OutputConfig(name_mode="none", name_tag="OCR_",
original_on_success="delete")
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
result = process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
assert result.success
# Ausgang hat GLEICHEN Namen wie Original
assert (env["outgoing"] / "scan.pdf").exists()
def test_process_pdf_archive_original(tmp_path: Path) -> None:
env = _prepare(tmp_path)
out_cfg = OutputConfig(name_mode="prefix", name_tag="OCR_",
original_on_success="archive",
archive_dir=str(env["archive"]))
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
result = process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
assert result.success
assert (env["outgoing"] / "OCR_scan.pdf").exists()
# Original liegt jetzt im Archiv
archived = env["archive"] / "scan.pdf"
assert archived.exists()
assert archived.read_bytes() == b"%PDF-1.4 original\n"
def test_process_pdf_archive_name_collision(tmp_path: Path) -> None:
"""Bei Namens-Kollision im Archiv wird Timestamp angehängt."""
env = _prepare(tmp_path)
# Vorhandene Kollisions-Datei
(env["archive"] / "scan.pdf").write_bytes(b"old")
out_cfg = OutputConfig(name_mode="prefix", name_tag="OCR_",
original_on_success="archive",
archive_dir=str(env["archive"]))
with patch("pdf_ocr_hotfolder.processor.run_ocr", side_effect=_fake_ocr):
process_pdf(
src=env["src"],
working_dir=env["working"],
outgoing_dir=env["outgoing"],
error_dir=env["error"],
ocr_cfg=OcrConfig(),
vera_cfg=VeraPdfConfig(enabled=False),
output_cfg=out_cfg,
)
# Alte Datei unverändert
assert (env["archive"] / "scan.pdf").read_bytes() == b"old"
# Neue Datei mit Timestamp-Suffix
archived = list(env["archive"].glob("scan_*.pdf"))
assert len(archived) == 1
assert archived[0].read_bytes() == b"%PDF-1.4 original\n"