feat(#161): SL-Parser produktiv — Saarland HTML-Abstimmungsergebnisse

Saarland publiziert keine Wortprotokolle, sondern eigene HTML-Seiten mit strukturierten Abstimmungsergebnissen pro Sitzung: <p>Drucksache 17/2076 ... in Erster Lesung mit Stimmenmehrheit angenommen ... [SPD: dafür; CDU und AfD: dagegen]</p> Daher Input ist HTML, nicht PDF. Parser nutzt LI-Block-Iteration und extrahiert pro Block: - Drucksache aus "Drucksache N/M" - Status aus "(einstimmig|mit Stimmenmehrheit)? (angenommen|abgelehnt)" - Vote-Block aus "[SPD: dafür; CDU: dagegen; AfD: Enthaltung]" - einstimmig=True falls Status enthaelt "einstimmig" Vote-Bracket-Parser (eigenstaendig vs. Reden-Stil-Parser anderer BL): - Splits per ; → "Phrase: Status" - Phrase per Wortgrenzen-Regex auf {SPD,CDU,AfD} matchen - Status-Map: dafür→ja, dagegen→nein, Enthaltung→enthaltung URL-Pattern (nicht direkt vorhersagbar wegen Datums-Slug): https://www.landtag-saar.de/aktuelles/mitteilungen/abstimmungsergebnisse-der-{n}-landtagssitzung-vom-{datum}/ Auto-Ingest via Index-Scrape (analog HH/HE/SH): - /aktuelles/mitteilungen/ scrape - WP16-URLs (mit "wahlperiode-vom") ueberspringen - Pro neue Sitzung: HTML herunterladen, ingest_pdf-API auf .html-Datei Tests: 18 SL-Tests (Verifikation Sitzung 46 → 18 Votes mit korrekten JA/NEIN/ENTH-Listen). Stand: 9 produktive Parser (NRW, BUND, BE, HH, TH, HE, SH, HB, SL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 01:53:51 +02:00 · 2026-04-29 01:53:51 +02:00 · d0f7b9217c
commit d0f7b9217c
parent a9376c2478
6 changed files with 328 additions and 39 deletions
--- a/app/protokoll_parsers/init.py
+++ b/app/protokoll_parsers/init.py
@ -36,6 +36,7 @@ from .th import parse_protocol as _parse_th
 from .he import parse_protocol as _parse_he
 from .sh import parse_protocol as _parse_sh
 from .hb import parse_protocol as _parse_hb
+from .sl import parse_protocol as _parse_sl

 # Typ-Alias fuer Lesbarkeit; Parser-Signatur ist bewusst minimal.
 ProtokollParser = Callable[[str], list[dict]]
@ -49,6 +50,7 @@ PROTOKOLL_PARSERS: dict[str, ProtokollParser] = {
    "HE": _parse_he,
    "SH": _parse_sh,
    "HB": _parse_hb,
+    "SL": _parse_sl,
 }


--- a/app/protokoll_parsers/sl.py
+++ b/app/protokoll_parsers/sl.py
@ -1,47 +1,152 @@
-"""Saarland (SL) — Plenarprotokoll-Parser STUB (#106 Folge, ADR 0009).
+"""Saarland (SL) — Abstimmungsergebnisse-Parser (#106 / #161, ADR 0009).

-**Status: noch nicht implementiert.** Dieser Modul-Stub enthaelt
-Recherche-Findings vom 2026-04-28, sodass die Implementer-Session
-direkt produktiv loslegen kann. Der Stub wird **nicht** in
-``app.protokoll_parsers.PROTOKOLL_PARSERS`` registriert — der
-Auto-Ingest-Cron ueberspringt SL solange.
+**Spezialfall:** Saarland publiziert keine Wortprotokolle, sondern eigene
+Abstimmungsergebnisse-HTML-Seiten pro Sitzung mit strukturiertem Vote-Block:

-## Recherche
+```
+<p>...Drucksache 17/2076...
+in Erster Lesung mit Stimmenmehrheit angenommen und an den Ausschuss [...]
+[SPD: dafür; CDU und AfD: dagegen]</p>
+```

-| Feld | Wert |
-|---|---|
-| **Doku-System** | Eigensystem |
-| **Base-URL** | https://www.landtag-saar.de |
-| **Familie** | eigenstaendig |
-| **Format** | PDF erwartet ueber Umbraco-Filterm |
+Daher Input ist HTML, nicht PDF. ``parse_protocol(html_path)`` liest die
+HTML-Seite und extrahiert pro <li> einen Vote.

-## URL-Discovery
+URL-Pattern (nicht direkt vorhersagbar, daher Index-Scrape):
+``https://www.landtag-saar.de/aktuelles/mitteilungen/abstimmungsergebnisse-der-{n}-landtagssitzung-vom-{datum}/``

-Umbraco-Backend; siehe SaarlandAdapter — Plenum-Protokolle ggf. analog Drucksachen via aawSearchSurfaceController-Pattern
+Index-Seite: https://www.landtag-saar.de (Front-Listing der Mitteilungen).

-## Bezug
+## Vote-Block-Format

- Architektur: ADR 0009 (Plenarprotokoll-Parser-Registry)
- Roadmap: ``docs/protokoll-parser-roadmap.md``
- Referenz-Implementation: ``app/protokoll_parsers/nrw.py``
-  (38 Tests, 19/19-Fixture-Garantie)
- Folge-Issue: https://repo.toppyr.de/tobias/gwoe-antragspruefer/issues/161 (Titel: "protokoll-parser: SL (Saarland)")
+Strukturierte Klammer-Notation pro Drucksache:
+- ``[SPD: dafür; CDU und AfD: dagegen]`` → JA=[SPD], NEIN=[CDU,AfD]
+- ``[SPD: dafür; CDU: dagegen; AfD: Enthaltung]`` → JA=[SPD], NEIN=[CDU], ENTH=[AfD]
+- ``[SPD und CDU: dafür; AfD: Enthaltung]`` → JA=[SPD,CDU], NEIN=[], ENTH=[AfD]

-## Aufwand
+## Ergebnis-Mapping

-Geschaetzt 1-3 Tage konzentrierte Arbeit:
- 2-4h URL-Discovery + Format-Inspektion (Sample-Protokoll inhaltlich anschauen)
- 4-8h Anchor-Phrasen-Reverse-Engineering + Parser-Implementierung
- 4h Tests mit Fixture-Pinning
- 1h Eintrag in PROTOKOLL_PARSERS + auto-ingest-protocols.sh
+- ``angenommen`` (mit oder ohne ``mit Stimmenmehrheit|einstimmig``) → angenommen
+- ``abgelehnt`` → abgelehnt
+- ``zur Kenntnis genommen`` → uebersprungen (kein Vote)
+
+## Fraktions-Mapping WP17 (ab 2022)
+
+WP17 Konstellation: SPD-Alleinregierung (43 Sitze), CDU + AfD Opposition.
+- ``SPD``, ``CDU``, ``AfD``
 """
 from __future__ import annotations

+import re
+from typing import Optional

-def parse_protocol(path: str) -> list[dict]:
-    """STUB — siehe Modul-Docstring."""
-    raise NotImplementedError(
-        "SL-Plenarprotokoll-Parser ist noch nicht implementiert. "
-        "Siehe app/protokoll_parsers/sl.py-Docstring fuer Recherche-Findings "
-        "und docs/protokoll-parser-roadmap.md."
-    )
+
+ALLE_FRAKTIONEN_SL = ["SPD", "CDU", "AfD"]
+
+
+# <li>...</li>-Block per Sitzung; jeder Block enthaelt typischerweise
+# 1x Drucksache + 1x Status + 1x Vote-Klammer.
+LI_BLOCK_RE = re.compile(
+    r"<li[^>]*>(.*?)</li>",
+    re.DOTALL,
+)
+
+DS_RE_SL = re.compile(r"Drucksache\s+(\d{1,2}/\d{2,5})")
+
+STATUS_RE = re.compile(
+    r"(?:in\s+\w+\s+Lesung\s+)?"
+    r"(?:mit\s+Stimmenmehrheit|einstimmig|mit\s+Mehrheit)?\s*"
+    r"(?P<ergebnis>angenommen|abgelehnt|abgesetzt|zur\s+Kenntnis\s+genommen)",
+    re.IGNORECASE,
+)
+
+# Vote-Klammer: [SPD: dafür; CDU und AfD: dagegen]
+VOTE_BRACKET_RE = re.compile(r"\[(?P<inner>[^\[\]]+)\]")
+
+
+def _normalize_fraktionen_sl(phrase: str) -> list[str]:
+    """SPD und CDU → ['CDU', 'SPD']; CDU → ['CDU']."""
+    found = set()
+    for fr in ALLE_FRAKTIONEN_SL:
+        if re.search(rf"\b{re.escape(fr)}\b", phrase, re.IGNORECASE):
+            found.add(fr)
+    return sorted(found)
+
+
+def _parse_vote_bracket(bracket_inner: str) -> dict:
+    """Parst '[SPD: dafür; CDU und AfD: dagegen]' (innen ohne Klammern)."""
+    votes = {"ja": [], "nein": [], "enthaltung": []}
+    for segment in bracket_inner.split(";"):
+        if ":" not in segment:
+            continue
+        fraktionen_phrase, _, status = segment.rpartition(":")
+        status = status.strip().lower()
+        fraktionen = _normalize_fraktionen_sl(fraktionen_phrase)
+        if "dafür" in status or "ja" in status or "zustimm" in status:
+            votes["ja"].extend(fraktionen)
+        elif "dagegen" in status or "nein" in status or "ablehn" in status:
+            votes["nein"].extend(fraktionen)
+        elif "enthalt" in status:
+            votes["enthaltung"].extend(fraktionen)
+    for key in votes:
+        votes[key] = sorted(set(votes[key]))
+    return votes
+
+
+def _strip_html(text: str) -> str:
+    text = re.sub(r"<[^>]+>", " ", text)
+    text = text.replace("&amp;", "&").replace("&nbsp;", " ")
+    return re.sub(r"\s+", " ", text).strip()
+
+
+def parse_protocol(html_path: str) -> list[dict]:
+    """Parst SL-Abstimmungsergebnisse-HTML, liefert Status + Votes."""
+    with open(html_path, "r", encoding="utf-8", errors="replace") as f:
+        html = f.read()
+
+    results = []
+    for m in LI_BLOCK_RE.finditer(html):
+        block_html = m.group(1)
+        block_text = _strip_html(block_html)
+
+        ds_m = DS_RE_SL.search(block_text)
+        if not ds_m:
+            continue
+        ds = ds_m.group(1)
+
+        status_m = STATUS_RE.search(block_text)
+        if not status_m:
+            continue
+        ergebnis = status_m.group("ergebnis").lower()
+        if "kenntnis" in ergebnis:
+            continue
+
+        modus_match = re.search(r"einstimmig", block_text, re.IGNORECASE)
+        einstimmig = bool(modus_match)
+
+        vote_m = VOTE_BRACKET_RE.search(block_text)
+        votes = {"ja": [], "nein": [], "enthaltung": []}
+        if vote_m:
+            votes = _parse_vote_bracket(vote_m.group("inner"))
+
+        if einstimmig and not votes["ja"]:
+            votes["ja"] = list(ALLE_FRAKTIONEN_SL)
+
+        results.append({
+            "drucksache": ds,
+            "ergebnis": ergebnis,
+            "einstimmig": einstimmig,
+            "kind": "direct",
+            "votes": votes,
+            "anchor_pos": m.start(),
+        })
+
+    seen = set()
+    deduped = []
+    for r in results:
+        key = (r["drucksache"], r["anchor_pos"])
+        if key in seen:
+            continue
+        seen.add(key)
+        deduped.append(r)
+    return deduped
--- a/docs/protokoll-parser-roadmap.md
+++ b/docs/protokoll-parser-roadmap.md
@ -29,7 +29,7 @@ Body und der Eintrag wird in `PROTOKOLL_PARSERS` ergaenzt.
 | NI | `ni.py` | [#158](https://repo.toppyr.de/tobias/gwoe-antragspruefer/issues/158) | 📋 Stub (NILAS Login) |
 | RP | `rp.py` | [#159](https://repo.toppyr.de/tobias/gwoe-antragspruefer/issues/159) | 📋 Stub (OPAL extern, kein direktes URL-Pattern) |
 | **SH** | `sh.py` | [#160](https://repo.toppyr.de/tobias/gwoe-antragspruefer/issues/160) | ✅ produktiv (575 Votes 110 Protokolle WP20, Index-Scrape) |
-| SL | `sl.py` | [#161](https://repo.toppyr.de/tobias/gwoe-antragspruefer/issues/161) | 📋 Stub |
+| **SL** | `sl.py` | [#161](https://repo.toppyr.de/tobias/gwoe-antragspruefer/issues/161) | ✅ produktiv (HTML-Abstimmungsergebnisse, WP17, Index-Scrape) |
 | SN | `sn.py` | [#162](https://repo.toppyr.de/tobias/gwoe-antragspruefer/issues/162) | 📋 Stub (XML-Manuell-Export) |
 | **TH** | `th.py` | [#163](https://repo.toppyr.de/tobias/gwoe-antragspruefer/issues/163) | ✅ produktiv (459 Votes 96 Protokolle WP8, URL-Pattern) |

--- a/scripts/auto-ingest-protocols.sh
+++ b/scripts/auto-ingest-protocols.sh
@ -221,6 +221,75 @@ for m in matches:
 print(f"  SH: {new_count} neue Protokolle ingestet")
 EOF

+# ─── SL: HTML-Abstimmungsergebnisse-Index ─────────────────────────────
+# SL publiziert keine Wortprotokolle, sondern HTML-Abstimmungsergebnisse-Seiten.
+# Index-Scrape /aktuelles/mitteilungen/, jeden abstimmungsergebnisse-Link
+# einzeln laden + parsen.
+echo "--- SL WP17 (HTML-Index-Scrape) ---"
+docker exec -i "$CONTAINER" python <<'EOF'
+import re, sys
+import urllib.request
+import sqlite3
+import asyncio
+
+BASE = "https://www.landtag-saar.de"
+req = urllib.request.Request(
+    f"{BASE}/aktuelles/mitteilungen/",
+    headers={"User-Agent": "Mozilla/5.0 GWOeAntragspruefer"},
+)
+try:
+    html = urllib.request.urlopen(req, timeout=20).read().decode("utf-8", errors="replace")
+except Exception as e:
+    print(f"  Index-Scrape fehlgeschlagen: {e}")
+    sys.exit(0)
+
+# WP17 hat keinen "wahlperiode-vom"-Marker im URL-Slug — diesen Filter ausschliessen.
+url_re = re.compile(
+    r'href="(/aktuelles/mitteilungen/abstimmungsergebnisse-der-(\d+)-landtagssitzung-vom-[^"]+?/)"'
+)
+matches = []
+seen_pids = set()
+for m in url_re.finditer(html):
+    href, sitzung = m.groups()
+    if "wahlperiode-vom" in href:
+        continue  # WP16-URLs ueberspringen
+    pid = f"SL17-{sitzung}"
+    if pid in seen_pids:
+        continue
+    seen_pids.add(pid)
+    matches.append((pid, BASE + href))
+print(f"  {len(matches)} SL-Sitzungen WP17 in Index gefunden")
+
+db = sqlite3.connect("/app/data/gwoe-antraege.db")
+existing = {row[0] for row in db.execute(
+    "SELECT quelle_protokoll FROM plenum_vote_results WHERE bundesland='SL'"
+)}
+
+from app.ingest_votes import ingest_pdf
+from pathlib import Path
+import tempfile
+
+new_count = 0
+for pid, url in matches:
+    if pid in existing:
+        continue
+    print(f"  → neu: {pid} ({url[:80]})")
+    with tempfile.NamedTemporaryFile(suffix=".html", delete=False) as tmp:
+        tmp_path = Path(tmp.name)
+    try:
+        urllib.request.urlretrieve(url, tmp_path)
+        stats = asyncio.run(ingest_pdf(
+            tmp_path, bundesland="SL", protokoll_id=pid, quelle_url=url,
+        ))
+        print(f"    parsed: {stats['parsed']}, written: {stats['written']}")
+        new_count += 1
+    except Exception as e:
+        print(f"    Fehler: {e}")
+    finally:
+        tmp_path.unlink(missing_ok=True)
+print(f"  SL: {new_count} neue Sitzungen ingestet")
+EOF
+
 for entry in "${PROTO_TARGETS[@]}"; do
  IFS='|' read -r bl wp prefix pattern <<< "$entry"
  echo "--- ${bl} WP${wp} (prefix=${prefix}) ---"
--- a/tests/test_protokoll_parsers_sl.py
+++ b/tests/test_protokoll_parsers_sl.py
@ -0,0 +1,113 @@
+"""Tests fuer app/protokoll_parsers/sl.py — SL Abstimmungsergebnisse-Parser (#161).
+
+SL ist HTML-basiert (nicht PDF) — eigene Abstimmungsergebnisse-Seite pro
+Sitzung mit strukturiertem [SPD: dafür; CDU und AfD: dagegen]-Format.
+"""
+from __future__ import annotations
+
+import pytest
+
+from app.protokoll_parsers.sl import (
+    _normalize_fraktionen_sl,
+    _parse_vote_bracket,
+    _strip_html,
+    DS_RE_SL,
+    STATUS_RE,
+    VOTE_BRACKET_RE,
+    LI_BLOCK_RE,
+    ALLE_FRAKTIONEN_SL,
+)
+
+
+class TestNormalizeFraktionenSl:
+    def test_simple_spd(self):
+        assert _normalize_fraktionen_sl("SPD") == ["SPD"]
+
+    def test_spd_und_cdu(self):
+        assert _normalize_fraktionen_sl("SPD und CDU") == ["CDU", "SPD"]
+
+    def test_alle_drei(self):
+        assert _normalize_fraktionen_sl("SPD, CDU und AfD") == ["AfD", "CDU", "SPD"]
+
+    def test_empty(self):
+        assert _normalize_fraktionen_sl("") == []
+
+    def test_word_boundary(self):
+        # Kein Match auf 'SP' oder Substrings
+        assert _normalize_fraktionen_sl("SP-Partei") == []
+
+
+class TestParseVoteBracket:
+    def test_klassisches_pattern(self):
+        votes = _parse_vote_bracket("SPD: dafür; CDU und AfD: dagegen")
+        assert set(votes["ja"]) == {"SPD"}
+        assert set(votes["nein"]) == {"CDU", "AfD"}
+        assert votes["enthaltung"] == []
+
+    def test_mit_enthaltung(self):
+        votes = _parse_vote_bracket("SPD: dafür; CDU: dagegen; AfD: Enthaltung")
+        assert set(votes["ja"]) == {"SPD"}
+        assert set(votes["nein"]) == {"CDU"}
+        assert set(votes["enthaltung"]) == {"AfD"}
+
+    def test_alle_dafuer(self):
+        votes = _parse_vote_bracket("SPD und CDU: dafür; AfD: Enthaltung")
+        assert set(votes["ja"]) == {"SPD", "CDU"}
+        assert votes["nein"] == []
+        assert set(votes["enthaltung"]) == {"AfD"}
+
+    def test_alle_dagegen(self):
+        votes = _parse_vote_bracket("AfD: dafür; SPD und CDU: dagegen")
+        assert set(votes["ja"]) == {"AfD"}
+        assert set(votes["nein"]) == {"SPD", "CDU"}
+
+
+class TestStripHtml:
+    def test_removes_tags(self):
+        assert _strip_html("<p>Hello <em>world</em></p>") == "Hello world"
+
+    def test_decodes_entities(self):
+        assert _strip_html("a &amp; b") == "a & b"
+
+
+class TestDrucksacheRegex:
+    def test_matches_drucksache(self):
+        m = DS_RE_SL.search("Drucksache 17/2076")
+        assert m and m.group(1) == "17/2076"
+
+    def test_matches_with_spaces(self):
+        m = DS_RE_SL.search("(Drucksache  17/2074)")
+        assert m and m.group(1) == "17/2074"
+
+
+class TestStatusRegex:
+    def test_matches_einstimmig_angenommen(self):
+        m = STATUS_RE.search("in Erster Lesung einstimmig angenommen")
+        assert m and m.group("ergebnis").lower() == "angenommen"
+
+    def test_matches_mit_stimmenmehrheit(self):
+        m = STATUS_RE.search("mit Stimmenmehrheit angenommen")
+        assert m and m.group("ergebnis").lower() == "angenommen"
+
+    def test_matches_abgelehnt(self):
+        m = STATUS_RE.search("in Erster Lesung abgelehnt")
+        assert m and m.group("ergebnis").lower() == "abgelehnt"
+
+
+class TestVoteBracketRegex:
+    def test_matches_full_bracket(self):
+        m = VOTE_BRACKET_RE.search("[SPD: dafür; CDU und AfD: dagegen]")
+        assert m and "SPD" in m.group("inner")
+
+
+class TestLiBlockRegex:
+    def test_extracts_li_content(self):
+        html = "<ul><li>foo</li><li>bar</li></ul>"
+        matches = list(LI_BLOCK_RE.finditer(html))
+        assert len(matches) == 2
+        assert matches[0].group(1) == "foo"
+
+
+class TestConstants:
+    def test_all_fraktionen_set(self):
+        assert ALLE_FRAKTIONEN_SL == ["SPD", "CDU", "AfD"]
--- a/tests/test_protokoll_parsers_stubs.py
+++ b/tests/test_protokoll_parsers_stubs.py
@ -20,9 +20,9 @@ import pytest
 from app.protokoll_parsers import PROTOKOLL_PARSERS, supported_bundeslaender

 STUB_BL_CODES = [
-    # BUND/BE/HH/TH/HE/SH/HB raus, weil seit 2026-04-28/29 produktive Parser
+    # BUND/BE/HH/TH/HE/SH/HB/SL raus, weil seit 2026-04-28/29 produktive Parser
    "BB", "BW", "BY",
-    "LSA", "MV", "NI", "RP", "SL", "SN",
+    "LSA", "MV", "NI", "RP", "SN",
 ]


@ -76,8 +76,8 @@ class TestRegistryDiscipline:

    def test_stubs_not_in_registry(self):
        registered = set(supported_bundeslaender())
-        # Aktuell: NRW + BUND + BE + HH + TH + HE + SH + HB produktiv
-        assert registered == {"NRW", "BUND", "BE", "HH", "TH", "HE", "SH", "HB"}, (
+        # Aktuell: NRW + BUND + BE + HH + TH + HE + SH + HB + SL produktiv
+        assert registered == {"NRW", "BUND", "BE", "HH", "TH", "HE", "SH", "HB", "SL"}, (
            "Unerwartete Registry-Eintraege. Wenn neue BL implementiert sind, "
            "diesen Test anpassen UND den Stub durch echten Parser ersetzen."
        )