#19 SaarlandAdapter — Umbraco JSON-API mit Iframe-Unwrap

Reverse-Engineering aus HAR-Capture (User-Browser, /suche?searchValue=Schule): - Endpoint: POST /umbraco/aawSearchSurfaceController/SearchSurface/GetSearchResults/ - Content-Type: application/x-www-form-urlencoded; charset=UTF-8 mit rohem JSON im Body (Kendo-Konvention von $.ajax ohne expliziten contentType) - Body MUSS Sections={} und Sort={} als leere Dicts haben — sobald Sections.Print/etc. gesetzt sind, antwortet der Server mit HTTP 500 (eigene Stunden in der Sackgasse, bis HAR den minimalen Body zeigte) - Body-Schema: {Filter:{Periods:[17]}, Pageination:{Skip,Take}, Sections:{}, Sort:{}, OnlyTitle:false, Value:<query>, CurrentSearchTab:0} Response-Mapping (FilteredResult[*]): - DocumentNumber → drucksache (e.g. "17/11") - Title → title - DocumentType → typ; client-side gefiltert auf "Antrag" (Print-Section enthält Anfragen + Anträge + Gesetzentwürfe gemischt, ~30-50% sind Anträge) - Publisher (kollektive Anträge: "CDU"/"SPD") + DocumentAuthor (individuelle MdL: "Name, Vorname (CDU);…") via parteien.extract_fraktionen - PublicDate (ISO mit T-Suffix) → datum (auf 10 Zeichen abgeschnitten) - FilePath: ``/file.ashx?FileId=…&FileName=…`` ist ein HTML-Iframe-Wrapper (455 Bytes), nicht das PDF! Echter Binär-Endpoint ist ``/Downloadfile.ashx`` (Großbuchstabe!) mit denselben Query-Parametern. Der Wrapper hat mich beim ersten Smoke-Test mit "no objects found" angeschmissen, der Iframe-Hint im HTML hat den Trick verraten. Drucksachen-Lookup nutzt ``Value=<drucksache>``: der Server matcht die Nummer im Volltext und liefert sie zuverlässig als ersten Hit. Kein dedizierter GetById-Endpoint vorhanden. Smoke-Test gegen prod (im Container): - search("Schule", limit=5) → 2 Anträge in WP17 (140 Print-Hits gesamt, Antrag-Filter auf 2/140 — der Rest sind Anfragen/Gesetzentwürfe): 17/11 [CDU] "Schule als Lern- und Bildungsort weiter stärken …" 17/419 [AfD] "Eine gute Bildungspolitik als wesentlicher Bestandteil …" - get_document("17/11") → match - download_text("17/11") → 3520 chars echter Antrags-Volltext (Header, Fraktion, Resolutionstext) Tests: 185/185 grün (keine Regression). UI-Aktivierung erfolgt separat in #31 (blockiert auf diesem Commit). Refs: #19, #49 (Roadmap Phase 3)
2026-04-10 00:46:02 +02:00 · 2026-04-10 00:46:02 +02:00 · 6dfcd69979
commit 6dfcd69979
parent 6ced7ae018
1 changed files with 222 additions and 0 deletions
--- a/app/parlamente.py
+++ b/app/parlamente.py
@ -2701,6 +2701,227 @@ class BundestagAdapter(ParlamentAdapter):
                return None


+class SaarlandAdapter(ParlamentAdapter):
+    """Adapter für den Landtag des Saarlandes via Umbraco JSON-API (#19).
+
+    Backend ist eine Umbraco/.NET-SurfaceController-Schicht hinter
+    ``www.landtag-saar.de``. Die Suchseite ``/suche?searchValue=…`` lädt
+    ihre Ergebnisse via XHR-POST gegen
+    ``/umbraco/aawSearchSurfaceController/SearchSurface/GetSearchResults/``.
+
+    Schema reverse-engineered aus einem HAR-Capture (User-Browser, gegen
+    ``Schule``-Suche). Wichtig:
+
+    - Content-Type ist ``application/x-www-form-urlencoded; charset=UTF-8``,
+      aber der Body ist trotzdem **rohes JSON** (Kendo-Konvention von
+      ``$.ajax`` ohne explizites ``contentType``). Ein
+      ``application/json``-Header funktioniert auch, aber nur mit der
+      minimalen Body-Form unten — sobald ``Sections.{Print,Operations,…}``
+      gesetzt sind, antwortet der Server mit HTTP 500. Mit ``Sections:{}``
+      ist alles OK und der Server liefert die Hits sektionsübergreifend.
+    - Body-Schema:
+
+      ```json
+      {
+        "Filter": {"Periods": [17]},
+        "Pageination": {"Skip": 0, "Take": 10},
+        "Sections": {},
+        "Sort": {},
+        "OnlyTitle": false,
+        "Value": "Schule",
+        "CurrentSearchTab": 0
+      }
+      ```
+
+    - Response: ``FilteredResult[]`` mit pro Item ``DocumentNumber``
+      (``"17/11"``), ``Legislative`` (Wahlperiode int), ``DocumentType``
+      (``"Antrag"``/``"Anfrage"``/``"Gesetzentwurf"``/…), ``Title``,
+      ``PublicDate``, ``DocumentAuthor`` (Liste mit ``Name (Partei);…``),
+      ``Publisher`` (Fraktion bei kollektiven Anträgen), ``FilePath``
+      (relativ, ``/file.ashx?FileId=…&FileName=…``).
+
+    Der Filter auf ``DocumentType=="Antrag"`` läuft client-side, weil die
+    Server-Sections-Struktur die Filter-Granularität nicht hat (Print
+    enthält Anfragen + Anträge + Gesetzentwürfe gemischt).
+
+    Drucksachen-Lookup: ``Value="17/11"`` matched die Drucksachen-Nummer
+    direkt an erster Position — ein dedizierter ``GetById``-Endpoint
+    existiert nicht.
+    """
+
+    bundesland = "SL"
+    name = "Landtag des Saarlandes"
+    base_url = "https://www.landtag-saar.de"
+
+    def __init__(self, *, wahlperiode: int = 17):
+        self.wahlperiode = wahlperiode
+
+    def _make_client(self) -> httpx.AsyncClient:
+        return httpx.AsyncClient(
+            timeout=30,
+            follow_redirects=True,
+            headers={
+                "User-Agent": "Mozilla/5.0 GWOE-Antragspruefer",
+                "Accept": "application/json, text/javascript, */*; q=0.01",
+                "X-Requested-With": "XMLHttpRequest",
+                "Origin": self.base_url,
+                "Referer": f"{self.base_url}/suche?searchValue=&ActiveTab=0",
+            },
+        )
+
+    def _build_body(self, query: str, *, skip: int = 0, take: int = 50) -> str:
+        """Bauen den minimalen Body, der vom Server akzeptiert wird.
+
+        Beachte: ``Sections={}`` und ``Sort={}`` sind PFLICHT als leere
+        Objekte (nicht weglassen, nicht ausfüllen — ausgefüllte Sections
+        triggern HTTP 500).
+        """
+        return json.dumps({
+            "Filter": {"Periods": [self.wahlperiode]},
+            "Pageination": {"Skip": skip, "Take": take},
+            "Sections": {},
+            "Sort": {},
+            "OnlyTitle": False,
+            "Value": query or "",
+            "CurrentSearchTab": 0,
+        })
+
+    @staticmethod
+    def _doc_to_drucksache(item: dict) -> Optional[Drucksache]:
+        from .parteien import extract_fraktionen
+
+        nummer = item.get("DocumentNumber")
+        if not nummer:
+            return None
+
+        # Fraktionen aus Publisher (kollektive Anträge: "CDU", "SPD") oder
+        # DocumentAuthor (individuelle MdL: "Schmitt-Lang, Jutta (CDU)").
+        # Beides via extract_fraktionen normalisiert.
+        publisher = item.get("Publisher") or ""
+        author = item.get("DocumentAuthor") or ""
+        fraktionen = extract_fraktionen(
+            f"{publisher} {author}".strip(), bundesland="SL",
+        )
+
+        # PublicDate ist im Format ``2022-05-12T00:00:00`` — ISO-Date abschneiden.
+        public_date = (item.get("PublicDate") or "")[:10]
+
+        # ``FilePath`` ist ``/file.ashx?FileId=…&FileName=…`` — der gibt
+        # aber HTML mit einem Iframe-Wrapper zurück, nicht das PDF selbst.
+        # Der echte Binär-Endpoint ist ``/Downloadfile.ashx`` (Großbuchstabe!)
+        # mit denselben Query-Parametern. Server liefert dort
+        # ``Content-Type: application/pdf``.
+        file_path = item.get("FilePath") or ""
+        if file_path.startswith("/file.ashx"):
+            file_path = file_path.replace("/file.ashx", "/Downloadfile.ashx", 1)
+        link = (
+            f"https://www.landtag-saar.de{file_path}"
+            if file_path.startswith("/") else file_path
+        )
+
+        return Drucksache(
+            drucksache=nummer,
+            title=item.get("Title", ""),
+            fraktionen=fraktionen,
+            datum=public_date,
+            link=link,
+            bundesland="SL",
+            typ=item.get("DocumentType", ""),
+        )
+
+    async def _post_search(
+        self, client: httpx.AsyncClient, query: str, *, skip: int = 0, take: int = 50,
+    ) -> list[dict]:
+        url = (
+            f"{self.base_url}/umbraco/aawSearchSurfaceController/"
+            "SearchSurface/GetSearchResults/"
+        )
+        body = self._build_body(query, skip=skip, take=take)
+        try:
+            resp = await client.post(
+                url,
+                content=body,
+                headers={
+                    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
+                },
+            )
+            if resp.status_code != 200:
+                logger.error("SL HTTP %s: %s", resp.status_code, resp.text[:200])
+                return []
+            data = resp.json()
+            return data.get("FilteredResult", []) or []
+        except Exception:
+            logger.exception("SL search request error")
+            return []
+
+    async def search(self, query: str, limit: int = 20) -> list[Drucksache]:
+        """Volltextsuche über die aktuelle Wahlperiode, gefiltert auf Anträge.
+
+        Holt 5*limit Hits in einer Page, filtert client-side auf
+        ``DocumentType=="Antrag"`` (Print-Section enthält auch Anfragen
+        und Gesetzentwürfe), und kürzt auf ``limit``. Sortierung kommt
+        relevance-based vom Server — für die UI ist Relevanz zu einer
+        Query meist wertvoller als Date-DESC.
+        """
+        async with self._make_client() as client:
+            # Take großzügig, weil der Antrag-Filter ~30-50% der Hits drosselt
+            take = max(limit * 5, 30)
+            items = await self._post_search(client, query, skip=0, take=take)
+
+        results: list[Drucksache] = []
+        seen: set[str] = set()
+        for item in items:
+            if (item.get("DocumentType") or "").lower() != "antrag":
+                continue
+            doc = self._doc_to_drucksache(item)
+            if doc is None or doc.drucksache in seen:
+                continue
+            seen.add(doc.drucksache)
+            results.append(doc)
+            if len(results) >= limit:
+                break
+        return results
+
+    async def get_document(self, drucksache: str) -> Optional[Drucksache]:
+        """Direktes Lookup via ``Value=<drucksache>`` — die Server-Suche
+        matcht die Drucksachen-Nummer im Dokument selbst und liefert sie
+        zuverlässig als ersten Treffer."""
+        async with self._make_client() as client:
+            items = await self._post_search(client, drucksache, take=20)
+
+        for item in items:
+            if item.get("DocumentNumber") == drucksache:
+                return self._doc_to_drucksache(item)
+        return None
+
+    async def download_text(self, drucksache: str) -> Optional[str]:
+        """Hole das Antrags-PDF via ``/file.ashx`` und extrahiere Volltext."""
+        import fitz
+
+        doc = await self.get_document(drucksache)
+        if doc is None or not doc.link:
+            return None
+
+        async with httpx.AsyncClient(
+            timeout=60, follow_redirects=True,
+            headers={"User-Agent": "Mozilla/5.0 GWOE-Antragspruefer"},
+        ) as client:
+            try:
+                resp = await client.get(doc.link)
+                if resp.status_code != 200:
+                    logger.error("SL PDF HTTP %s for %s", resp.status_code, drucksache)
+                    return None
+                pdf = fitz.open(stream=resp.content, filetype="pdf")
+                text = ""
+                for page in pdf:
+                    text += page.get_text()
+                pdf.close()
+                return text
+            except Exception:
+                logger.exception("SL download error for %s", drucksache)
+                return None
+
+
 # Registry of adapters
 ADAPTERS = {
    "BUND": BundestagAdapter(),
@ -2797,6 +3018,7 @@ ADAPTERS = {
        document_type="Antrag",
    ),
    "BY": BayernAdapter(),
+    "SL": SaarlandAdapter(),
    "BW": PARLISAdapter(
        bundesland="BW",
        name="Landtag von Baden-Württemberg (PARLIS)",