Add PortalaAdapter for PADOKA / Sachsen-Anhalt (#2)

Adds a clean-room PortalaAdapter that talks to the eUI/portala framework
behind PADOKA (Landtag Sachsen-Anhalt). Same engine powers Berlin's
PARDOK; the same adapter will serve issue #3 once activated for BE.

Reverse-engineering notes
- The "PADOKA = StarWeb" assumption from issue #1 / dokukratie's st.yml
  is outdated. The Sachsen-Anhalt portal was migrated to the same
  eUI/portala SPA framework Berlin uses. The legacy starweb URL returns
  503; the new entry point is /portal/browse.tt.html.
- Search workflow is two-stage:
  1. POST /portal/browse.tt.json with a JSON action body containing an
     Elasticsearch-style query tree under search.json. Returns a
     report_id plus hit count.
  2. POST /portal/report.tt.html with {report_id, start, chunksize}
     returns the HTML hit list. Each record carries a Perl Data::Dumper
     block in a <pre> tag with the canonical metadata.
- The query schema (sources, search.lines, search.json tree, report
  block) is taken from dokukratie/scrapers/portala.query.json (GPL-3.0)
  — only structure/selectors are reused, no Python code is ported.
- DB id is "lsa.lissh"; the server validates this and rejects unknown
  interfaces with an explicit errormsg.
- PDFs live under /files/drs/wp{N}/drs/d{nr}{xxx}.pdf and are served
  directly without any session cookie.

What the adapter does
- search() builds a date-window query (last ~24 months) for "Antrag"
  document type and returns the most recent hits. The user's free-text
  query is applied as a client-side title/Urheber filter (no fulltext
  search server-side yet — see "Limitations" below).
- Hits are parsed from the Perl record dumps in the report HTML:
  - WEV06.main → title (Perl \x{xx} hex escapes decoded)
  - WEV32.5   → relative PDF path
  - WEV32.main → "Antrag <Urheber> <DD.MM.YYYY> Drucksache <b>X/YYYY</b>"
- Fraktion strings are normalised to canonical codes (CDU, SPD, GRÜNE,
  FDP, AfD, LINKE, Landesregierung).
- get_document() looks up a single Drucksache by re-running the search.
- download_text() fetches the PDF and extracts text via PyMuPDF.
- bundeslaender.py: LSA's doku_system corrected from "StarWeb" to
  "PARDOK", anmerkung updated with the migration story.

Limitations (deliberate, MVP)
- No server-side full-text search. The portala framework's sf index
  names for LSA full-text content are not yet known; tree mutations
  with sf=alAB return 0 hits. Client-side filter is "good enough" for
  the next ~24 months of Anträge (≈few hundred per WP).
- LSA is still aktiv=False in bundeslaender.py — the adapter is dormant
  in production until issue #2's wahlprogramm ingest and frontend
  activation land.

Verified live against padoka.landtag.sachsen-anhalt.de:
- search(query="", limit=5) returned 5 current Anträge from März 2026
  (LINKE + GRÜNE) with correct dates, fractions, titles and PDF URLs.
- download_text("8/6790") returned 5051 chars of real Antragstext
  ("ICE-Halt für Salzwedel dauerhaft erhalten").

Refs #2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Dotty Dotter 2026-04-07 21:50:23 +02:00
parent ee0218b5af
commit c7242f8413
2 changed files with 331 additions and 6 deletions

View File

@ -298,14 +298,16 @@ BUNDESLAENDER: dict[str, Bundesland] = {
naechste_wahl="2026-09-06",
regierungsfraktionen=["CDU", "SPD", "FDP"],
landtagsfraktionen=["CDU", "AfD", "LINKE", "SPD", "GRÜNE", "FDP"],
doku_system="StarWeb",
doku_system="PARDOK",
doku_base_url="https://padoka.landtag.sachsen-anhalt.de",
drucksache_format="8/1234",
dokukratie_scraper="st",
anmerkung=(
"ISO-Code wäre ST; LSA ist im politischen Sprachgebrauch dominant. "
"Sven Schulze (CDU) seit 28.01.2026 MP nach Rücktritt Haseloff. "
"PADOKA = Parlamentsdokumentationssystem auf StarWeb-6.0.01-Basis."
"PADOKA wurde von StarWeb auf das portala/eUI-Framework migriert "
"(gleiche Engine wie Berlin/PARDOK). dokukratie's st.yml ist veraltet. "
"Suche läuft via POST /portal/browse.tt.json + report.tt.html."
),
),
"SH": Bundesland(

View File

@ -305,21 +305,343 @@ class NRWAdapter(ParlamentAdapter):
return None
class PortalaAdapter(ParlamentAdapter):
"""Adapter for portala/eUI-based parliament documentation systems.
Used by parliaments running the proprietary "esearch" / portala framework
(originally developed for STAR/StarFinder backends, now wrapped in a
Single-Page App with Template Toolkit on the server side):
- **LSA** (Sachsen-Anhalt) PADOKA at ``padoka.landtag.sachsen-anhalt.de``
- **BE** (Berlin) PARDOK at ``pardok.parlament-berlin.de`` (future)
The search workflow is two-stage:
1. ``POST /portal/browse.tt.json`` with a complex JSON ``action`` body
that contains an Elasticsearch-style query tree under
``search.json``. The server returns a ``report_id`` plus hit count.
2. ``POST /portal/report.tt.html`` with ``{report_id, start, chunksize}``
to fetch the HTML hit list. Each hit carries a Perl Data::Dumper
block in a ``<pre>`` tag with the canonical metadata.
The query body schema was reverse-engineered from
https://github.com/okfde/dokukratie/blob/main/dokukratie/scrapers/portala.query.json
(GPL-3.0 only structure/selectors are reused, not Python code).
Full-text search is **not** implemented in the MVP: the adapter
returns the most recent ``Anträge`` of the current Wahlperiode in the
given date window, and the search query is applied as a client-side
title/Urheber filter. The portala server-side full-text path requires
LSA-specific ``sf`` index names that are not yet known.
"""
bundesland = "LSA"
name = "Landtag von Sachsen-Anhalt (PADOKA)"
base_url = "https://padoka.landtag.sachsen-anhalt.de"
db_id = "lsa.lissh"
wahlperiode = 8
# Reverse-engineered "WEV*" Perl record fields used in the hit-list dumps:
# WEV06.main = title
# WEV32.5 = relative PDF path
# WEV32.main = "Antrag <Urheber> <DD.MM.YYYY> Drucksache <b>X/YYYY</b> ..."
_RE_TITLE = re.compile(r"'WEV06'\s*=>\s*\[\s*\{\s*'main'\s*=>\s*[\"']([^\"']+)[\"']")
_RE_PDF = re.compile(r"'5'\s*=>\s*'([^']*\.pdf)'")
_RE_DRUCKSACHE = re.compile(r"Drucksache\s*<b>(\d+/\d+)</b>")
_RE_URHEBER_DATUM = re.compile(
r"'WEV32'\s*=>\s*\[\s*\{[^}]*'main'\s*=>\s*[\"']Antrag\s+(.+?)\s+(\d{1,2}\.\d{1,2}\.\d{4})\s+Drucksache",
)
_RE_PRE_BLOCK = re.compile(r'<pre>\$VAR1 = (.*?)</pre>', re.DOTALL)
@staticmethod
def _decode_perl_hex(s: str) -> str:
"""Decode \\x{abcd} escape sequences from Perl Data::Dumper output."""
return re.sub(r'\\x\{([0-9a-f]+)\}', lambda m: chr(int(m.group(1), 16)), s)
@staticmethod
def _normalize_fraktion(urheber: str) -> list[str]:
"""Map Urheber-String to canonical fraction codes."""
u = urheber.upper()
out = []
if "BÜNDNIS 90" in u or "GRÜNE" in u or "GRUENE" in u:
out.append("GRÜNE")
if u.startswith("CDU") or " CDU " in u or u.endswith(" CDU"):
out.append("CDU")
if "SPD" in u:
out.append("SPD")
if "FDP" in u:
out.append("FDP")
if "AFD" in u:
out.append("AfD")
if "LINKE" in u or "DIE LINKE" in u:
out.append("LINKE")
if "LANDESREGIERUNG" in u or "MINISTER" in u or "STAATSKANZLEI" in u:
out.append("Landesregierung")
return out
def _build_search_body(
self,
wahlperiode: int,
start_date: str,
end_date: str,
document_type: str = "Antrag",
) -> dict:
"""Build the action JSON body for browse.tt.json.
The schema is taken 1:1 from dokukratie's portala.query.json template
and only differs in the data source (lsa.lissh) and the variable
substitutions.
"""
return {
"action": "SearchAndDisplay",
"sources": [self.db_id],
"report": {
"rhl": "main",
"rhlmode": "add",
"format": "generic1-full",
"mime": "html",
"sort": "WEVSO1/D WEVSO2 WEVSO3",
},
"search": {
"lines": {
"2": str(wahlperiode),
"3": document_type,
"4": "D",
"10": start_date,
"11": end_date,
"20.1": "alWEBBI",
"20.2": "alWEBBI",
"20.3": "alWEBBI",
"90.1": "AND",
"90.2": "AND",
"90.3": "AND",
},
"serverrecordname": "sr_generic1",
"parsed": (
f"((/WP {wahlperiode}) AND "
f"(/ETYPF,ETYP2F,DTYPF,DTYP2F,1VTYPF (\"{document_type}\")) "
f"AND (/DART,DARTS (\"D\")) AND "
f"(DAT,DDAT,SDAT= {start_date} THRU {end_date})) AND TYP=DOKDBE"
),
"sref": (
f"((/WP {wahlperiode}) AND "
f"(/ETYPF,ETYP2F,DTYPF,DTYP2F,1VTYPF (\"{document_type}\")) "
f"AND (/DART,DARTS (\"D\")) AND "
f"(DAT,DDAT,SDAT= {start_date} THRU {end_date})) AND TYP=DOKDBE"
),
"json": [{
"tn": "and",
"num": 1,
"terms": [
{"tn": "term", "t": str(wahlperiode), "idx": 6, "l": 3,
"sf": "WP", "op": "eq", "num": 5},
{"tn": "or", "num": 3, "terms": [
{"tn": "or", "num": 4, "terms": [
{"tn": "term", "t": f'"{document_type}"', "idx": 50,
"l": 4, "sf": "ETYPF", "op": "eq", "num": 10},
{"tn": "term", "t": f'"{document_type}"', "idx": 50,
"l": 4, "sf": "ETYP2F", "op": "eq", "num": 11},
{"tn": "term", "t": f'"{document_type}"', "idx": 50,
"l": 4, "sf": "DTYPF", "op": "eq", "num": 12},
{"tn": "term", "t": f'"{document_type}"', "idx": 50,
"l": 4, "sf": "DTYP2F", "op": "eq", "num": 13},
{"tn": "term", "t": f'"{document_type}"', "idx": 50,
"l": 4, "sf": "1VTYPF", "op": "eq", "num": 14},
]},
{"tn": "or", "num": 15, "terms": [
{"tn": "term", "t": '"D"', "idx": 93, "l": 4,
"sf": "DART", "op": "eq", "num": 16},
{"tn": "term", "t": '"D"', "idx": 93, "l": 4,
"sf": "DARTS", "op": "eq", "num": 17},
]},
]},
{"tn": "or", "num": 18, "terms": [
{"tn": "or", "num": 19, "terms": [
{"tn": "trange", "sf": "DAT", "op": "eq", "num": 20,
"idx": 119, "l": 3, "p1": start_date, "t1": start_date,
"p2": end_date, "t2": end_date,
"t": f"{start_date} THRU {end_date}"},
{"tn": "trange", "sf": "DDAT", "op": "eq", "num": 21,
"idx": 119, "l": 3, "p1": start_date, "t1": start_date,
"p2": end_date, "t2": end_date,
"t": f"{start_date} THRU {end_date}"},
]},
{"tn": "trange", "sf": "SDAT", "op": "eq", "num": 22,
"idx": 119, "l": 3, "p1": start_date, "t1": start_date,
"p2": end_date, "t2": end_date,
"t": f"{start_date} THRU {end_date}"},
]},
{"tn": "term", "t": "DOKDBE", "idx": 156, "l": 1,
"sf": "TYP", "op": "eq", "num": 23},
],
}],
},
"dataSet": "1",
}
def _parse_hit_list_html(self, html: str, query_filter: str = "") -> list[Drucksache]:
"""Extract Drucksachen from a report.tt.html response."""
results: list[Drucksache] = []
for pre in self._RE_PRE_BLOCK.findall(html):
m_ds = self._RE_DRUCKSACHE.search(pre)
if not m_ds:
continue
drucksache = m_ds.group(1)
m_t = self._RE_TITLE.search(pre)
title = self._decode_perl_hex(m_t.group(1)) if m_t else f"Drucksache {drucksache}"
m_pdf = self._RE_PDF.search(pre)
pdf_rel = m_pdf.group(1) if m_pdf else ""
pdf_url = f"{self.base_url}/files/{pdf_rel}" if pdf_rel else ""
m_w32 = self._RE_URHEBER_DATUM.search(pre)
urheber = self._decode_perl_hex(m_w32.group(1).strip()) if m_w32 else ""
datum_de = m_w32.group(2) if m_w32 else ""
# DD.MM.YYYY -> ISO YYYY-MM-DD
datum_iso = ""
if datum_de:
d, m, y = datum_de.split(".")
datum_iso = f"{y}-{m.zfill(2)}-{d.zfill(2)}"
fraktionen = self._normalize_fraktion(urheber) if urheber else []
doc = Drucksache(
drucksache=drucksache,
title=title,
fraktionen=fraktionen,
datum=datum_iso,
link=pdf_url,
bundesland=self.bundesland,
typ="Antrag",
)
# Client-side title filter (no fulltext search server-side)
if query_filter:
hay = f"{title} {urheber}".lower()
if not all(t in hay for t in query_filter.lower().split()):
continue
results.append(doc)
return results
async def search(self, query: str, limit: int = 20) -> list[Drucksache]:
"""Search recent Anträge of the current Wahlperiode.
``query`` is applied as a client-side title/Urheber filter; the
server-side query covers the last ~24 months by default.
"""
from datetime import date, timedelta
end = date.today()
start = end - timedelta(days=730)
body = self._build_search_body(
wahlperiode=self.wahlperiode,
start_date=start.isoformat(),
end_date=end.isoformat(),
document_type="Antrag",
)
async with httpx.AsyncClient(
timeout=30,
follow_redirects=True,
headers={"User-Agent": "Mozilla/5.0 GWOE-Antragspruefer"},
) as client:
try:
# Step 1: warm up cookies via the browse page
await client.get(f"{self.base_url}/portal/browse.tt.html")
# Step 2: submit the search action
resp = await client.post(
f"{self.base_url}/portal/browse.tt.json",
json=body,
headers={"Referer": f"{self.base_url}/portal/browse.tt.html"},
)
if resp.status_code != 200:
print(f"PADOKA search HTTP {resp.status_code}")
return []
data = resp.json()
report_id = data.get("report_id")
if not report_id:
print(f"PADOKA: no report_id in response: {data}")
return []
# Step 3: fetch the HTML hit list
# Take a generous chunk so client-side filter still has enough
chunksize = 100 if query else limit
report_resp = await client.post(
f"{self.base_url}/portal/report.tt.html",
json={"report_id": report_id, "start": 0, "chunksize": chunksize},
headers={"Referer": f"{self.base_url}/portal/browse.tt.html"},
)
if report_resp.status_code != 200:
print(f"PADOKA report HTTP {report_resp.status_code}")
return []
results = self._parse_hit_list_html(report_resp.text, query_filter=query)
return results[:limit]
except Exception as e:
print(f"PADOKA search error: {e}")
return []
async def get_document(self, drucksache: str) -> Optional[Drucksache]:
"""Look up a single document by ID via the search endpoint with a
document_number filter."""
# Pragmatic MVP: do a broad search and filter for the requested ID.
# A targeted single-document fetch would require a different
# action.search.json structure that we have not reverse-engineered yet.
results = await self.search(query="", limit=200)
for doc in results:
if doc.drucksache == drucksache:
return doc
return None
async def download_text(self, drucksache: str) -> Optional[str]:
"""Download the PDF for a Drucksache and extract its text."""
import fitz # PyMuPDF
doc = await self.get_document(drucksache)
if not doc or not doc.link:
return None
async with httpx.AsyncClient(
timeout=60,
follow_redirects=True,
headers={"User-Agent": "Mozilla/5.0 GWOE-Antragspruefer"},
) as client:
try:
resp = await client.get(doc.link)
if resp.status_code != 200:
return None
pdf = fitz.open(stream=resp.content, filetype="pdf")
text = ""
for page in pdf:
text += page.get_text()
pdf.close()
return text
except Exception as e:
print(f"PADOKA download error for {drucksache}: {e}")
return None
class BayernAdapter(ParlamentAdapter):
"""Adapter for Bayerischer Landtag."""
bundesland = "BY"
name = "Bayerischer Landtag"
base_url = "https://www.bayern.landtag.de"
async def search(self, query: str, limit: int = 20) -> list[Drucksache]:
# TODO: Implement Bayern search
return []
async def get_document(self, drucksache: str) -> Optional[Drucksache]:
# TODO: Implement
return None
async def download_text(self, drucksache: str) -> Optional[str]:
return None
@ -345,6 +667,7 @@ class BWAdapter(ParlamentAdapter):
# Registry of adapters
ADAPTERS = {
"NRW": NRWAdapter(),
"LSA": PortalaAdapter(),
"BY": BayernAdapter(),
"BW": BWAdapter(),
}