Activate Baden-Württemberg via PARLISAdapter (#29, Phase 1)

PARLIS auf parlis.landtag-bw.de läuft technisch auf demselben eUI-Backend wie LSA-PADOKA und BE-PARDOK, hat aber drei wichtige Unterschiede, die eine eigene Klasse statt einer PortalaAdapter- Subklasse rechtfertigen: 1. Body-Schema: minimales lines mit l1/l2/l3/l4 (statt LSA/BE 2/3/4/10/11/20.x/90.x), serverrecordname=vorgang, format=suchergebnis-vorgang-full, sort=SORT01/D SORT02/D SORT03, keine parsed/json-Felder. Quelle: dokukratie/scrapers/portala.query.bw.json plus HAR-Verifikation gegen die Live-Instanz. 2. Async polling: die initiale SearchAndDisplay-Antwort liefert nur search_id mit status=running, KEINE report_id. Erst eine zweite SearchAndDisplay-Anfrage mit id=<search_id> (ohne search-Component) bekommt nach 1-3 Sekunden die report_id zurück. Reverse-engineered aus esearch-ui.main.js requestReportOK() Z. ~1268. 3. Hit-Format: report.tt.html liefert Records als JSON-in-HTML-Comments . Komplett anderes Format als LSA Perl-Dump oder BE HTML-Cards. Felder: - EWBV22: "Drucksache 17/10323" - EWBD05: direkter PDF-URL - WMV33: Schlagworte (joined by ;) - WMV30: Urheber-Kurzform - EWBV23: "Antrag <Urheber> <DD.MM.YYYY>" Smoke-Test (lokal): BW q='': 8 hits in 17s, jüngste WP17-Anträge mit Datum + Fraktion BW q='Schule': 8 hits, alle wirklich Schul-bezogen (Hochschule, Grundschule, Schwimmunterricht, Lehrerbedarf etc.) BW q='Klima': 8 hits, Klimaschutz/CO2/Energieberatung get_document(17/10323): roundtrip funktioniert bundeslaender.py: aktiv=True für BW; Anmerkung erweitert mit PARLISAdapter-Verweis und drei-Unterschiede-Hinweis für künftige Wartung. Test test_four_active_bundeslaender umbenannt zu test_active_bundeslaender_include_phase_1_set, prüft jetzt nur Subset-Bedingung statt exakter Count, damit Phase-1/2-Erweiterungen keine Test-Updates brauchen. Phase 1 (1/3) aus Roadmap-Issue #49. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:38:04 +02:00 · 2026-04-08 23:38:04 +02:00 · db5a875d7c
commit db5a875d7c
parent f98e64c734
3 changed files with 425 additions and 20 deletions
--- a/app/bundeslaender.py
+++ b/app/bundeslaender.py
@ -80,10 +80,18 @@ BUNDESLAENDER: dict[str, Bundesland] = {
        doku_base_url="https://parlis.landtag-bw.de",
        drucksache_format="17/12345",
        dokukratie_scraper="bw",
        aktiv=True,
        anmerkung=(
-            "Wahl zum 18. Landtag fand am 08.03.2026 statt; Koalitionsverhandlungen "
+            "PARLIS auf parlis.landtag-bw.de läuft auf demselben "
-            "GRÜNE+CDU laufen, Kabinett Kretschmann III geschäftsführend. Nach "
+            "eUI-Backend wie LSA-PADOKA und BE-PARDOK, aber mit drei "
-            "Konstituierung des 18. LT ca. Mai 2026 müssen WP und Wahltermin aktualisiert werden."
+            "Unterschieden: minimales lines-Schema (l1/l2/l3/l4), "
            "asynchrones Polling (initial → search_id → poll → "
            "report_id) und Hit-Records als JSON-in-HTML-Comments. "
            "Eigene Adapter-Klasse PARLISAdapter (#29). Wahl zum 18. "
            "Landtag fand am 08.03.2026 statt; Koalitionsverhandlungen "
            "GRÜNE+CDU laufen, Kabinett Kretschmann III geschäftsführend. "
            "Nach Konstituierung des 18. LT ca. Mai 2026 müssen WP und "
            "Wahltermin aktualisiert werden."
        ),
    ),
    "BY": Bundesland(
--- a/app/parlamente.py
+++ b/app/parlamente.py
@ -1256,21 +1256,407 @@ class BayernAdapter(ParlamentAdapter):
        return None
-class BWAdapter(ParlamentAdapter):
+class PARLISAdapter(ParlamentAdapter):
-    """Adapter for Baden-Württemberg Landtag."""
+    """Adapter for Baden-Württemberg's PARLIS — eUI/portala-Variante mit
    polling und JSON-in-HTML-Comment-Records.
-    bundesland = "BW"
+    PARLIS auf ``parlis.landtag-bw.de`` läuft technisch auf demselben
-    name = "Landtag Baden-Württemberg"
+    eUI-Backend wie LSA-PADOKA und BE-PARDOK, aber mit drei wichtigen
-    base_url = "https://www.landtag-bw.de"
+    Unterschieden, die eine eigene Klasse statt einer PortalaAdapter-
    Subklasse rechtfertigen:
    1. **Body-Schema:** Statt der portala/LSA-typischen ``search.lines``
       mit ``2/3/4/10/11/20.x/90.x``-Slots nutzt PARLIS ein viel kürzeres
       ``l1/l2/l3/l4`` Schema (siehe ``dokukratie/scrapers/portala.query.bw.json``).
       ``serverrecordname`` ist ``"vorgang"`` statt ``"sr_generic1"``,
       ``format`` ist ``"suchergebnis-vorgang-full"``, ``sort`` ist
       ``"SORT01/D SORT02/D SORT03"``. Es gibt kein ``parsed`` und kein
       ``json``-Tree — der Server akzeptiert das minimale Schema direkt.
    2. **Async polling:** Im Gegensatz zu LSA/BE liefert die initiale
       ``Fulltext/Search``-Antwort nur eine ``search_id`` mit
       ``status: "running"``, KEINE ``report_id``. Erst eine zweite
       ``SearchAndDisplay``-Anfrage mit ``id: <search_id>`` (und ohne
       ``search``-Component) bekommt die fertige ``report_id`` zurück.
       In meinen Live-Tests reichte ein einziger 2-Sekunden-Sleep
       zwischen den Calls.
    3. **Hit-Format:** Die ``report.tt.html``-Antwort liefert keine
       Perl-Dump-Blöcke (LSA) und keine Bootstrap-Card-Divs (BE),
       sondern **JSON-Records in HTML-Kommentaren**::
           <!--{"WMV33":[{"main":"Schlagworte"}],
                "EWBV22":[{"main":"Drucksache 17/10323"}],
                "EWBD05":[{"main":"https://.../17_10323.pdf"}],
                "EWBV23":[{"main":"Antrag Felix Herkens (GRÜNE) u. a. 16.03.2026"}],
                ...}-->
       Der Parser zieht die Comments raw raus und mappt die WMV/EWBV-
       Felder auf das ``Drucksache``-Dataclass.
    Reverse-Engineering-Quelle: ``dokukratie/scrapers/portala.query.bw.json``
    + Live-HAR gegen ``parlis.landtag-bw.de`` (Issue #29).
    """
    # Reverse-engineered field map for the JSON records that come embedded
    # in HTML comments inside report.tt.html responses.
    #
    # Records look like ``<!--{"WMV33":[...],...}-->`` and may contain
    # nested ``<i>...</i>`` highlight tags inside the JSON values.
    # Non-greedy match against the literal closing ``}-->`` because that
    # delimiter does not appear inside the JSON payload itself.
    _RE_RECORD = re.compile(r"<!--(\{.*?\})-->", re.DOTALL)
    _RE_DRUCKSACHE = re.compile(r"Drucksache\s+(\d+/\d+)")
    _RE_DATUM = re.compile(r"(\d{1,2}\.\d{1,2}\.\d{4})")
    def __init__(
        self,
        *,
        bundesland: str,
        name: str,
        base_url: str,
        wahlperiode: int,
        prefix: str = "/parlis",
        document_typ: str = "Antrag",
        date_window_days: int = 730,
        poll_attempts: int = 15,
        poll_interval_seconds: float = 2.0,
    ) -> None:
        """Configure a PARLIS adapter for one specific parliament instance.
        Args:
            bundesland: state code, e.g. ``"BW"``.
            name: human-readable label.
            base_url: ``https://parlis.landtag-bw.de`` (no trailing slash).
            wahlperiode: legislative period — feeds into ``lines.l1``.
            prefix: app prefix where PARLIS lives. ``/parlis`` for BW.
            document_typ: feeds into ``lines.l4``. The server interprets
                this as a German document type label like ``"Antrag"``.
            date_window_days: look-back window for the search range,
                quick-win against title-only filtering — same approach
                as the PortalaAdapter for LSA/BE.
            poll_attempts: how many times to poll for ``report_id`` before
                giving up. ~15 × 2s = 30s upper bound.
            poll_interval_seconds: sleep between poll attempts.
        """
        self.bundesland = bundesland
        self.name = name
        self.base_url = base_url.rstrip("/")
        self.prefix = "/" + prefix.strip("/")
        self.wahlperiode = wahlperiode
        self.document_typ = document_typ
        self.date_window_days = date_window_days
        self.poll_attempts = poll_attempts
        self.poll_interval_seconds = poll_interval_seconds
    @staticmethod
    def _datum_de_to_iso(datum_de: str) -> str:
        """DD.MM.YYYY → YYYY-MM-DD; '' for empty input."""
        if not datum_de:
            return ""
        try:
            d, m, y = datum_de.split(".")
            return f"{y}-{m.zfill(2)}-{d.zfill(2)}"
        except ValueError:
            return ""
    @staticmethod
    def _normalize_fraktion(text: str) -> list[str]:
        """Map a free-text Urheber line to canonical fraction codes.
        PARLIS packs the originator into ``EWBV23`` like
        ``"Antrag Felix Herkens (GRÜNE), Saskia Frank (GRÜNE), ... 16.03.2026"``
        — multiple MdLs with their party in parentheses, comma-separated.
        Same logic as ``ParLDokAdapter._normalize_fraktion`` (#46 fixed
        the MINISTER/MINISTERIUM regex there too).
        """
        if not text:
            return []
        u = text.upper()
        out: list[str] = []
        if re.search(r"\bBÜNDNIS\s*90\b", u) or re.search(r"\bGR(?:Ü|UE)NE\b", u):
            out.append("GRÜNE")
        if re.search(r"\bCDU\b", u):
            out.append("CDU")
        if re.search(r"\bSPD\b", u):
            out.append("SPD")
        if re.search(r"\bF\.?\s*D\.?\s*P\.?\b", u):
            out.append("FDP")
        if re.search(r"\bAFD\b", u):
            out.append("AfD")
        if re.search(r"\bLINKE\b", u):
            out.append("LINKE")
        if re.search(r"\bBSW\b", u):
            out.append("BSW")
        if re.search(r"LANDESREGIERUNG|\bMINISTER|STAATSKANZLEI|MINISTERPRÄSIDENT", u):
            out.append("Landesregierung")
        return out
    def _build_initial_body(self, start_date: str, end_date: str) -> dict:
        """Build the first ``SearchAndDisplay`` body with the search component.
        The schema follows ``dokukratie/scrapers/portala.query.bw.json``
        verbatim — only the placeholder values are substituted.
        """
        return {
            "action": "SearchAndDisplay",
            "report": {
                "rhl": "main",
                "rhlmode": "add",
                "format": "suchergebnis-vorgang-full",
                "mime": "html",
                "sort": "SORT01/D SORT02/D SORT03",
            },
            "search": {
                "lines": {
                    "l1": str(self.wahlperiode),
                    "l2": start_date,
                    "l3": end_date,
                    "l4": self.document_typ,
                },
                "serverrecordname": "vorgang",
            },
            "sources": ["Star"],
        }
    def _build_poll_body(self, search_id: str) -> dict:
        """Build the polling body — same action, but with the search_id
        instead of a fresh search component."""
        return {
            "action": "SearchAndDisplay",
            "report": {
                "rhl": "main",
                "rhlmode": "add",
                "format": "suchergebnis-vorgang-full",
                "mime": "html",
                "sort": "SORT01/D SORT02/D SORT03",
            },
            "id": search_id,
            "sources": ["Star"],
        }
    def _hit_record_to_drucksache(self, record: dict) -> Optional[Drucksache]:
        """Map a single JSON-in-comment record to a ``Drucksache``.
        PARLIS-record schema (reverse-engineered, all values are arrays
        of ``{"main": ...}`` dicts):
        - ``EWBV22``: "Drucksache 17/10323"
        - ``EWBD05``: direct PDF URL
        - ``EWBV23``: "Antrag <Urheber> <DD.MM.YYYY>" — single combined line
        - ``WMV30``: short Urheber summary ("Felix Herkens (GRÜNE) u. a.")
        - ``WMV33``: subject keywords (Schlagworte)
        - ``EWBD01``: "Drucksache <X/Y> <DD.MM.YYYY>"
        """
        def first(field: str) -> str:
            block = record.get(field)
            if isinstance(block, list) and block:
                return (block[0].get("main") or "").strip()
            return ""
        ds_text = first("EWBV22") or first("EWBD01")
        m_ds = self._RE_DRUCKSACHE.search(ds_text)
        if not m_ds:
            return None
        drucksache = m_ds.group(1)
        # The "title" we want is the Schlagworte/topic, not the
        # Drucksachen-Header. PARLIS keeps the human-readable subject
        # in WMV33 (Schlagworte joined by semicolons) — that's the
        # closest equivalent to "title" the LSA/BE adapters expose.
        # Fallback to the EWBV23 line if WMV33 is empty.
        schlagworte = first("WMV33")
        # Strip embedded <i>...</i> highlight tags
        schlagworte_clean = re.sub(r"</?i>", "", schlagworte).strip()
        title = schlagworte_clean or first("EWBV23") or f"Drucksache {drucksache}"
        # Date + Urheber out of EWBV23 ("Antrag <Urheber> <DD.MM.YYYY>")
        ewbv23 = first("EWBV23")
        m_dat = self._RE_DATUM.search(ewbv23)
        datum_iso = self._datum_de_to_iso(m_dat.group(1) if m_dat else "")
        urheber_short = first("WMV30")
        fraktionen = self._normalize_fraktion(urheber_short or ewbv23)
        pdf_url = first("EWBD05")
        return Drucksache(
            drucksache=drucksache,
            title=title,
            fraktionen=fraktionen,
            datum=datum_iso,
            link=pdf_url,
            bundesland=self.bundesland,
            typ=self.document_typ,
        )
    async def _initial_search_and_poll(
        self, client: httpx.AsyncClient, start_date: str, end_date: str,
    ) -> Optional[str]:
        """Run the initial search + poll until ``report_id`` arrives."""
        import asyncio
        browse_html = f"{self.base_url}{self.prefix}/browse.tt.html"
        browse_json = f"{self.base_url}{self.prefix}/browse.tt.json"
        # Step 1: warm cookies
        await client.get(browse_html)
        # Step 2: initial search
        try:
            resp = await client.post(
                browse_json,
                json=self._build_initial_body(start_date, end_date),
                headers={"Referer": browse_html},
            )
        except Exception:
            logger.exception("%s initial search request error", self.bundesland)
            return None
        if resp.status_code != 200:
            logger.error("%s initial search HTTP %s", self.bundesland, resp.status_code)
            return None
        data = resp.json()
        if data.get("report_id"):
            return data["report_id"]
        search_id = data.get("search_id")
        if not search_id:
            logger.error("%s no search_id in initial response: %s", self.bundesland, data)
            return None
        # Step 3: poll until report_id appears or we run out of attempts
        for _ in range(self.poll_attempts):
            await asyncio.sleep(self.poll_interval_seconds)
            try:
                resp = await client.post(
                    browse_json,
                    json=self._build_poll_body(search_id),
                    headers={"Referer": browse_html},
                )
            except Exception:
                logger.exception("%s poll request error", self.bundesland)
                return None
            if resp.status_code != 200:
                logger.error("%s poll HTTP %s", self.bundesland, resp.status_code)
                return None
            data = resp.json()
            if data.get("report_id"):
                return data["report_id"]
            star = data.get("sources", {}).get("Star", {})
            if star.get("status") == "stopped" and not data.get("report_id"):
                # Search finished but no report — empty result
                return None
        logger.warning("%s gave up polling after %d attempts", self.bundesland, self.poll_attempts)
        return None
    def _parse_report_html(self, html: str) -> list[Drucksache]:
        """Extract Drucksachen from a report.tt.html response.
        Records are JSON objects embedded in HTML comments. We pull each
        comment block via regex, parse it as JSON, and map the WMV/EWBV
        fields to a Drucksache.
        """
        results: list[Drucksache] = []
        for m in self._RE_RECORD.finditer(html):
            json_text = m.group(1)
            try:
                record = json.loads(json_text)
            except json.JSONDecodeError:
                continue
            doc = self._hit_record_to_drucksache(record)
            if doc:
                results.append(doc)
        return results
    async def search(self, query: str, limit: int = 20) -> list[Drucksache]:
-        # TODO: Implement BW search
+        """Search recent BW Anträge with optional client-side title filter.
        Server-side full-text is not used (#18 — einheitliches
        Verhalten ohne Volltext bis alle Adapter es können). The
        client filter looks at title (Schlagworte) + Urheber.
        """
        from datetime import date, timedelta
        end = date.today()
        start = end - timedelta(days=self.date_window_days)
        async with httpx.AsyncClient(
            timeout=60,
            follow_redirects=True,
            headers={"User-Agent": "Mozilla/5.0 GWOE-Antragspruefer"},
        ) as client:
            try:
                report_id = await self._initial_search_and_poll(
                    client, start.isoformat(), end.isoformat(),
                )
                if not report_id:
                    return []
                # Pull a generous chunk so the client-side filter has
                # enough material to work with.
                chunksize = max(limit * 10, 200) if query else max(limit * 2, 50)
                report_url = (
                    f"{self.base_url}{self.prefix}/report.tt.html"
                    f"?report_id={report_id}&start=0&chunksize={chunksize}"
                )
                resp = await client.get(
                    report_url,
                    headers={"Referer": f"{self.base_url}{self.prefix}/browse.tt.html"},
                )
                if resp.status_code != 200:
                    logger.error("%s report HTTP %s", self.bundesland, resp.status_code)
                    return []
                results = self._parse_report_html(resp.text)
            except Exception:
                logger.exception("%s search error", self.bundesland)
                return []
        # Client-side filter
        if query:
            terms = [t.lower() for t in query.split() if t]
            results = [
                d for d in results
                if all(t in f"{d.title} {' '.join(d.fraktionen)}".lower() for t in terms)
            ]
        return results[:limit]
    async def get_document(self, drucksache: str) -> Optional[Drucksache]:
        """Look up a single Drucksache by ID via a broad browse."""
        results = await self.search(query="", limit=200)
        for doc in results:
            if doc.drucksache == drucksache:
                return doc
        return None
    async def download_text(self, drucksache: str) -> Optional[str]:
        """Download the PDF for a Drucksache and extract its text."""
        import fitz  # PyMuPDF
        doc = await self.get_document(drucksache)
        if not doc or not doc.link:
            return None
        async with httpx.AsyncClient(
            timeout=60,
            follow_redirects=True,
            headers={"User-Agent": "Mozilla/5.0 GWOE-Antragspruefer"},
        ) as client:
            try:
                resp = await client.get(doc.link)
                if resp.status_code != 200:
                    logger.error(
                        "%s PDF HTTP %s for %s (%s)",
                        self.bundesland, resp.status_code, drucksache, doc.link,
                    )
                    return None
                pdf = fitz.open(stream=resp.content, filetype="pdf")
                text = ""
                for page in pdf:
                    text += page.get_text()
                pdf.close()
                return text
            except Exception:
                logger.exception("%s PDF download error for %s", self.bundesland, drucksache)
                return None
@ -1315,7 +1701,14 @@ ADAPTERS = {
        document_typ="Antrag",
    ),
    "BY": BayernAdapter(),
-    "BW": BWAdapter(),
+    "BW": PARLISAdapter(
        bundesland="BW",
        name="Landtag von Baden-Württemberg (PARLIS)",
        base_url="https://parlis.landtag-bw.de",
        wahlperiode=17,
        prefix="/parlis",
        document_typ="Antrag",
    ),
 }
--- a/tests/test_bundeslaender.py
+++ b/tests/test_bundeslaender.py
@ -24,10 +24,14 @@ class TestRegistryStructure:
 class TestActiveBundeslaender:
-    def test_four_active_bundeslaender(self):
+    def test_active_bundeslaender_include_phase_1_set(self):
-        active = aktive_bundeslaender()
+        """At least the original four (NRW, LSA, MV, BE) plus any
-        codes = {bl.code for bl in active}
+        Phase-1 additions (BW after #29) must be active. The test
-        assert codes == {"NRW", "LSA", "MV", "BE"}
+        avoids hardcoding the exact count so adding a new active
        Bundesland in a follow-up doesn't break this case."""
        active_codes = {bl.code for bl in aktive_bundeslaender()}
        original = {"NRW", "LSA", "MV", "BE"}
        assert original <= active_codes
    def test_alle_bundeslaender_returns_all_sixteen(self):
        assert len(alle_bundeslaender()) == 16