Activate Berlin (PARDOK) — search-only MVP (#3)

PortalaAdapter is now parameterizable and serves both LSA and Berlin
from a single class. Berlin is activated as the third live bundesland
(after NRW + LSA), with the deliberate caveat that the LTW 2023
Wahlprogramme are not yet indexed.

PortalaAdapter refactor
- Class attributes (bundesland, name, base_url, db_id, wahlperiode)
  moved into the constructor. New optional parameters:
    - portala_path: "/portal" for LSA, "/portala" for Berlin
    - document_type: "Antrag" for LSA, None for Berlin (BE's ETYPF
      index uses different value strings; the document_type subtree
      is dropped from the action.search.json tree)
    - pdf_url_prefix: "/files/" by default; absolute URLs in the hit
      list are passed through unchanged (Berlin embeds full
      starweb/adis/citat/... links)
    - date_window_days: 730 for LSA, 180 for BE (BE has ~10x more
      documents per WP, narrower window keeps payloads bounded)
- _build_search_body builds the JSON tree dynamically: when
  document_type is None, the entire ETYPF/DTYPF/DART subtree is
  omitted, mirrored in the parsed/sref display strings as well.
- _parse_hit_list_html now auto-detects between two formats:
    1. LSA-style: <pre>$VAR1 = …</pre> Perl Data::Dumper records
       (existing parser, untouched).
    2. Berlin-style: production HTML cards with efxRecordRepeater
       divs, h3 titles, h6 metadata lines containing the document
       type, drucksachen-id and date, plus a direct <a href="…pdf">
       to the PDF on the same host.
- Berlin extracts originator parties from the h6 line ("Antrag CDU,
  SPD" → ["CDU","SPD"], typ "Antrag") via the new word-boundary
  _normalize_fraktion regex.
- _normalize_fraktion rewritten with regex word boundaries, fixing a
  long-standing bug where comma-separated fraction lists like
  "CDU, SPD" failed to match CDU. Also picks up BSW for the
  Brombeer/SPD-BSW landtage and "Senat von Berlin" as Landesregierung.

bundeslaender.py
- BE flipped to aktiv=True. anmerkung documents the Wahlprogramm-
  Lücke and the auto-detected hit-list format.

Live verified against pardok.parlament-berlin.de:
- WP 19 with 180-day date window returns 2962 hits, page 1 contains
  5 records all with title, drucksache, date, PDF URL.
- 19/3107 ("Kleingewässerprogramm") correctly extracted as Antrag of
  CDU+SPD; 19/3104-3106 as Vorlagen zur Beschlussfassung; 19/3108 as
  Vorlage zur Kenntnisnahme.
- LSA still returns the same 5 current Anträge of März 2026 — no
  regression from the refactor.

Known limitation (will be tracked as a follow-up issue)
- Berlin Wahlprogramme zur LTW 2023 are not yet indexed in the
  embeddings DB. The 2023 PDFs are no longer linked from the live
  party websites (which currently feature 2026 draft programmes), and
  Wayback has no snapshots. The analyzer therefore falls back to
  bundesländer-übergreifende Grundsatzprogramme for BE Anträge until
  the 2023 PDFs are sourced manually.

Refs #3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Dotty Dotter 2026-04-07 23:33:16 +02:00
parent 80e16df288
commit 9e0f11f7c9
2 changed files with 340 additions and 123 deletions

View File

@ -114,9 +114,18 @@ BUNDESLAENDER: dict[str, Bundesland] = {
doku_base_url="https://pardok.parlament-berlin.de", doku_base_url="https://pardok.parlament-berlin.de",
drucksache_format="19/1234", drucksache_format="19/1234",
dokukratie_scraper="be", dokukratie_scraper="be",
aktiv=True,
anmerkung=( anmerkung=(
"PARDOK basiert auf StarWeb-Software (portala-Frontend). Berlin bietet " "PARDOK = portala/eUI-Framework (gleiche Engine wie LSA-PADOKA, "
"zusätzlich Open-Data-XML unter parlament-berlin.de/dokumente/open-data." "unter /portala/ statt /portal/). Hit list arrives as production "
"HTML cards instead of LSA-style Perl Data::Dumper blocks — "
"PortalaAdapter auto-detects both formats. document_type=None "
"for BE because Berlin's ETYPF index uses different value strings "
"than LSA. Wahlprogramme zur LTW 2023 sind noch nicht indexiert "
"(Folge-Issue) — Analyse läuft daher mit Grundsatzprogramm-"
"Zitaten als Fallback. Open-Data-XML unter "
"parlament-berlin.de/dokumente/open-data ist eine alternative "
"Datenquelle, derzeit nicht verwendet."
), ),
), ),
"BB": Bundesland( "BB": Bundesland(

View File

@ -313,35 +313,82 @@ class PortalaAdapter(ParlamentAdapter):
Single-Page App with Template Toolkit on the server side): Single-Page App with Template Toolkit on the server side):
- **LSA** (Sachsen-Anhalt) PADOKA at ``padoka.landtag.sachsen-anhalt.de`` - **LSA** (Sachsen-Anhalt) PADOKA at ``padoka.landtag.sachsen-anhalt.de``
- **BE** (Berlin) PARDOK at ``pardok.parlament-berlin.de`` (future) under ``/portal/`` (singular)
- **BE** (Berlin) PARDOK at ``pardok.parlament-berlin.de`` under
``/portala/`` (with the trailing 'a')
Both instances share the same JSON action schema, only the base URL,
the data source ID, the application path prefix and a few minor
quirks differ those are constructor parameters so that the same
class can serve both states (and any future portala-based parliament).
The search workflow is two-stage: The search workflow is two-stage:
1. ``POST /portal/browse.tt.json`` with a complex JSON ``action`` body 1. ``POST {base}{path}/browse.tt.json`` with a complex JSON ``action``
that contains an Elasticsearch-style query tree under body that contains an Elasticsearch-style query tree under
``search.json``. The server returns a ``report_id`` plus hit count. ``search.json``. The server returns a ``report_id`` plus hit count.
2. ``POST /portal/report.tt.html`` with ``{report_id, start, chunksize}`` 2. ``POST {base}{path}/report.tt.html`` with ``{report_id, start,
to fetch the HTML hit list. Each hit carries a Perl Data::Dumper chunksize}`` to fetch the HTML hit list. Each hit carries a Perl
block in a ``<pre>`` tag with the canonical metadata. Data::Dumper block in a ``<pre>`` tag with the canonical metadata.
The query body schema was reverse-engineered from The query body schema was reverse-engineered from
https://github.com/okfde/dokukratie/blob/main/dokukratie/scrapers/portala.query.json https://github.com/okfde/dokukratie/blob/main/dokukratie/scrapers/portala.query.json
(GPL-3.0 only structure/selectors are reused, not Python code). (GPL-3.0 only structure/selectors are reused, not Python code).
Full-text search is **not** implemented in the MVP: the adapter Full-text search is **not** implemented in the MVP: the adapter
returns the most recent ``Anträge`` of the current Wahlperiode in the returns documents of the current Wahlperiode in the given date
given date window, and the search query is applied as a client-side window, and the search query is applied as a client-side
title/Urheber filter. The portala server-side full-text path requires title/Urheber filter. The server-side full-text path requires
LSA-specific ``sf`` index names that are not yet known. state-specific ``sf`` index names that are not yet known.
""" """
bundesland = "LSA" def __init__(
name = "Landtag von Sachsen-Anhalt (PADOKA)" self,
base_url = "https://padoka.landtag.sachsen-anhalt.de" *,
db_id = "lsa.lissh" bundesland: str,
wahlperiode = 8 name: str,
base_url: str,
db_id: str,
wahlperiode: int,
portala_path: str = "/portal",
document_type: Optional[str] = "Antrag",
pdf_url_prefix: str = "/files/",
date_window_days: int = 730,
) -> None:
"""Configure a portala/eUI adapter for one specific parliament.
# Reverse-engineered "WEV*" Perl record fields used in the hit-list dumps: Args:
bundesland: state code (e.g. ``"LSA"``, ``"BE"``).
name: human-readable adapter label (used in logs/UI).
base_url: ``https://...`` of the portal host without trailing slash.
db_id: data source identifier the eUI server expects in
``action.sources``, e.g. ``"lsa.lissh"`` or ``"lah.lissh"``.
wahlperiode: current legislative period fed into the WP
term of the search tree.
portala_path: path prefix where the portala app lives. ``/portal``
for LSA, ``/portala`` for Berlin.
document_type: optional filter applied via ETYPF/DTYPF/DART
terms. ``"Antrag"`` works for LSA; for instances where
the index uses different document_type values (e.g. Berlin),
pass ``None`` to drop the document_type subtree entirely
the user can still filter client-side by title.
pdf_url_prefix: URL fragment between ``base_url`` and the
relative PDF path returned by the server.
date_window_days: how many days back ``search()`` looks by
default.
"""
self.bundesland = bundesland
self.name = name
self.base_url = base_url.rstrip("/")
self.db_id = db_id
self.wahlperiode = wahlperiode
self.portala_path = "/" + portala_path.strip("/")
self.document_type = document_type
self.pdf_url_prefix = "/" + pdf_url_prefix.strip("/") + "/"
self.date_window_days = date_window_days
# ── LSA-style hit list (Perl Data::Dumper inside <pre> blocks) ──
# Reverse-engineered "WEV*" record fields:
# WEV06.main = title # WEV06.main = title
# WEV32.5 = relative PDF path # WEV32.5 = relative PDF path
# WEV32.main = "Antrag <Urheber> <DD.MM.YYYY> Drucksache <b>X/YYYY</b> ..." # WEV32.main = "Antrag <Urheber> <DD.MM.YYYY> Drucksache <b>X/YYYY</b> ..."
@ -353,6 +400,20 @@ class PortalaAdapter(ParlamentAdapter):
) )
_RE_PRE_BLOCK = re.compile(r'<pre>\$VAR1 = (.*?)</pre>', re.DOTALL) _RE_PRE_BLOCK = re.compile(r'<pre>\$VAR1 = (.*?)</pre>', re.DOTALL)
# ── Berlin-style hit list (production HTML cards, no Perl dump) ──
# The whole div for one record:
_RE_BE_RECORD = re.compile(
r'<div[^>]*class="[^"]*efxRecordRepeater[^"]*"[^>]*data-efx-rec="[^"]*"[^>]*>(.*?)(?=<div[^>]*efxRecordRepeater|<div[^>]*id="efxResultsEnd"|</main>|$)',
re.DOTALL,
)
_RE_BE_TITLE = re.compile(r'<h3[^>]*class="h5[^"]*"[^>]*>\s*<span>([^<]+)</span>')
_RE_BE_LINK = re.compile(r'<a[^>]*href="([^"]+\.pdf)"[^>]*>')
# The metadata h6 looks like:
# <span class="h6">Antrag (Eilantrag) &nbsp;<a ...>Drucksache 19/3104</a> S. 1 bis 24 vom 31.03.2026</span>
_RE_BE_DRUCKSACHE = re.compile(r'Drucksache\s+(\d+/\d+)')
_RE_BE_DATUM = re.compile(r'vom\s+(\d{1,2}\.\d{1,2}\.\d{4})')
_RE_BE_DOCTYPE = re.compile(r'<span class="h6">\s*([^<&]+?)(?:&nbsp;|<)')
@staticmethod @staticmethod
def _decode_perl_hex(s: str) -> str: def _decode_perl_hex(s: str) -> str:
"""Decode \\x{abcd} escape sequences from Perl Data::Dumper output.""" """Decode \\x{abcd} escape sequences from Perl Data::Dumper output."""
@ -360,22 +421,33 @@ class PortalaAdapter(ParlamentAdapter):
@staticmethod @staticmethod
def _normalize_fraktion(urheber: str) -> list[str]: def _normalize_fraktion(urheber: str) -> list[str]:
"""Map Urheber-String to canonical fraction codes.""" """Map Urheber-String to canonical fraction codes.
Uses regex word boundaries instead of plain substring matching so
that comma-separated lists ("CDU, SPD") and the embedded "DIE
LINKE" are matched reliably.
"""
u = urheber.upper() u = urheber.upper()
out = [] out: list[str] = []
if "BÜNDNIS 90" in u or "GRÜNE" in u or "GRUENE" in u:
def has(pattern: str) -> bool:
return re.search(pattern, u) is not None
if has(r"\bBÜNDNIS\s*90\b") or has(r"\bGR(?:Ü|UE)NE\b"):
out.append("GRÜNE") out.append("GRÜNE")
if u.startswith("CDU") or " CDU " in u or u.endswith(" CDU"): if has(r"\bCDU\b"):
out.append("CDU") out.append("CDU")
if "SPD" in u: if has(r"\bSPD\b"):
out.append("SPD") out.append("SPD")
if "FDP" in u: if has(r"\bFDP\b"):
out.append("FDP") out.append("FDP")
if "AFD" in u: if has(r"\bAFD\b"):
out.append("AfD") out.append("AfD")
if "LINKE" in u or "DIE LINKE" in u: if has(r"\bLINKE\b"):
out.append("LINKE") out.append("LINKE")
if "LANDESREGIERUNG" in u or "MINISTER" in u or "STAATSKANZLEI" in u: if has(r"\bBSW\b"):
out.append("BSW")
if has(r"LANDESREGIERUNG|SENAT VON BERLIN|REGIERENDE[RN]?\s+BÜRGERMEISTER|MINISTER\b|STAATSKANZLEI"):
out.append("Landesregierung") out.append("Landesregierung")
return out return out
@ -384,29 +456,28 @@ class PortalaAdapter(ParlamentAdapter):
wahlperiode: int, wahlperiode: int,
start_date: str, start_date: str,
end_date: str, end_date: str,
document_type: str = "Antrag",
) -> dict: ) -> dict:
"""Build the action JSON body for browse.tt.json. """Build the action JSON body for browse.tt.json.
The schema is taken 1:1 from dokukratie's portala.query.json template The schema is taken from dokukratie's portala.query.json template
and only differs in the data source (lsa.lissh) and the variable and only differs in the data source and the variable substitutions.
substitutions. When ``self.document_type`` is None, the ETYPF/DTYPF/DART subtree
is dropped useful for parliaments whose ETYPF index uses
different value strings than ``"Antrag"``.
""" """
return { document_type = self.document_type
"action": "SearchAndDisplay", date_range_text = f"{start_date} THRU {end_date}"
"sources": [self.db_id], date_term = lambda sf, num: { # noqa: E731 — local helper
"report": { "tn": "trange", "sf": sf, "op": "eq", "num": num,
"rhl": "main", "idx": 119, "l": 3,
"rhlmode": "add", "p1": start_date, "t1": start_date,
"format": "generic1-full", "p2": end_date, "t2": end_date,
"mime": "html", "t": date_range_text,
"sort": "WEVSO1/D WEVSO2 WEVSO3", }
},
"search": { # Build the search.lines (form-state mirror) and the json tree
"lines": { lines: dict = {
"2": str(wahlperiode), "2": str(wahlperiode),
"3": document_type,
"4": "D",
"10": start_date, "10": start_date,
"11": end_date, "11": end_date,
"20.1": "alWEBBI", "20.1": "alWEBBI",
@ -415,27 +486,19 @@ class PortalaAdapter(ParlamentAdapter):
"90.1": "AND", "90.1": "AND",
"90.2": "AND", "90.2": "AND",
"90.3": "AND", "90.3": "AND",
}, }
"serverrecordname": "sr_generic1", if document_type is not None:
"parsed": ( lines["3"] = document_type
f"((/WP {wahlperiode}) AND " lines["4"] = "D"
f"(/ETYPF,ETYP2F,DTYPF,DTYP2F,1VTYPF (\"{document_type}\")) "
f"AND (/DART,DARTS (\"D\")) AND " # Top-level AND tree
f"(DAT,DDAT,SDAT= {start_date} THRU {end_date})) AND TYP=DOKDBE" top_terms: list = [
),
"sref": (
f"((/WP {wahlperiode}) AND "
f"(/ETYPF,ETYP2F,DTYPF,DTYP2F,1VTYPF (\"{document_type}\")) "
f"AND (/DART,DARTS (\"D\")) AND "
f"(DAT,DDAT,SDAT= {start_date} THRU {end_date})) AND TYP=DOKDBE"
),
"json": [{
"tn": "and",
"num": 1,
"terms": [
{"tn": "term", "t": str(wahlperiode), "idx": 6, "l": 3, {"tn": "term", "t": str(wahlperiode), "idx": 6, "l": 3,
"sf": "WP", "op": "eq", "num": 5}, "sf": "WP", "op": "eq", "num": 5},
{"tn": "or", "num": 3, "terms": [ ]
if document_type is not None:
top_terms.append({"tn": "or", "num": 3, "terms": [
{"tn": "or", "num": 4, "terms": [ {"tn": "or", "num": 4, "terms": [
{"tn": "term", "t": f'"{document_type}"', "idx": 50, {"tn": "term", "t": f'"{document_type}"', "idx": 50,
"l": 4, "sf": "ETYPF", "op": "eq", "num": 10}, "l": 4, "sf": "ETYPF", "op": "eq", "num": 10},
@ -454,33 +517,84 @@ class PortalaAdapter(ParlamentAdapter):
{"tn": "term", "t": '"D"', "idx": 93, "l": 4, {"tn": "term", "t": '"D"', "idx": 93, "l": 4,
"sf": "DARTS", "op": "eq", "num": 17}, "sf": "DARTS", "op": "eq", "num": 17},
]}, ]},
]}, ]})
{"tn": "or", "num": 18, "terms": [
top_terms.append({"tn": "or", "num": 18, "terms": [
{"tn": "or", "num": 19, "terms": [ {"tn": "or", "num": 19, "terms": [
{"tn": "trange", "sf": "DAT", "op": "eq", "num": 20, date_term("DAT", 20),
"idx": 119, "l": 3, "p1": start_date, "t1": start_date, date_term("DDAT", 21),
"p2": end_date, "t2": end_date,
"t": f"{start_date} THRU {end_date}"},
{"tn": "trange", "sf": "DDAT", "op": "eq", "num": 21,
"idx": 119, "l": 3, "p1": start_date, "t1": start_date,
"p2": end_date, "t2": end_date,
"t": f"{start_date} THRU {end_date}"},
]}, ]},
{"tn": "trange", "sf": "SDAT", "op": "eq", "num": 22, date_term("SDAT", 22),
"idx": 119, "l": 3, "p1": start_date, "t1": start_date, ]})
"p2": end_date, "t2": end_date, top_terms.append({"tn": "term", "t": "DOKDBE", "idx": 156, "l": 1,
"t": f"{start_date} THRU {end_date}"}, "sf": "TYP", "op": "eq", "num": 23})
]},
{"tn": "term", "t": "DOKDBE", "idx": 156, "l": 1, # Mirror the same shape into the parsed/sref display strings
"sf": "TYP", "op": "eq", "num": 23}, if document_type is not None:
], parsed = (
f"((/WP {wahlperiode}) AND "
f"(/ETYPF,ETYP2F,DTYPF,DTYP2F,1VTYPF (\"{document_type}\")) "
f"AND (/DART,DARTS (\"D\")) AND "
f"(DAT,DDAT,SDAT= {date_range_text})) AND TYP=DOKDBE"
)
else:
parsed = (
f"((/WP {wahlperiode}) AND "
f"(DAT,DDAT,SDAT= {date_range_text})) AND TYP=DOKDBE"
)
return {
"action": "SearchAndDisplay",
"sources": [self.db_id],
"report": {
"rhl": "main",
"rhlmode": "add",
"format": "generic1-full",
"mime": "html",
"sort": "WEVSO1/D WEVSO2 WEVSO3",
},
"search": {
"lines": lines,
"serverrecordname": "sr_generic1",
"parsed": parsed,
"sref": parsed,
"json": [{
"tn": "and",
"num": 1,
"terms": top_terms,
}], }],
}, },
"dataSet": "1", "dataSet": "1",
} }
@staticmethod
def _datum_de_to_iso(datum_de: str) -> str:
"""Convert DD.MM.YYYY → YYYY-MM-DD; return '' for empty input."""
if not datum_de:
return ""
d, m, y = datum_de.split(".")
return f"{y}-{m.zfill(2)}-{d.zfill(2)}"
def _parse_hit_list_html(self, html: str, query_filter: str = "") -> list[Drucksache]: def _parse_hit_list_html(self, html: str, query_filter: str = "") -> list[Drucksache]:
"""Extract Drucksachen from a report.tt.html response.""" """Extract Drucksachen from a report.tt.html response.
Two formats are supported and auto-detected:
- **LSA-style:** the records are embedded as Perl Data::Dumper
dumps inside ``<pre>$VAR1 = </pre>`` blocks. WEV06 title,
WEV32 metadata + PDF path. Used by Sachsen-Anhalt's PADOKA
template.
- **Berlin-style:** standard production HTML cards with
``efxRecordRepeater`` divs. Title in an ``<h3 class="h5">``,
metadata + PDF link in an ``<span class="h6">``. Used by
Berlin's PARDOK template.
"""
if self._RE_PRE_BLOCK.search(html):
return self._parse_hit_list_dump(html, query_filter)
return self._parse_hit_list_cards(html, query_filter)
def _parse_hit_list_dump(self, html: str, query_filter: str) -> list[Drucksache]:
"""Parse LSA-style ``<pre>$VAR1 = …</pre>`` Perl-dump records."""
results: list[Drucksache] = [] results: list[Drucksache] = []
for pre in self._RE_PRE_BLOCK.findall(html): for pre in self._RE_PRE_BLOCK.findall(html):
m_ds = self._RE_DRUCKSACHE.search(pre) m_ds = self._RE_DRUCKSACHE.search(pre)
@ -493,17 +607,11 @@ class PortalaAdapter(ParlamentAdapter):
m_pdf = self._RE_PDF.search(pre) m_pdf = self._RE_PDF.search(pre)
pdf_rel = m_pdf.group(1) if m_pdf else "" pdf_rel = m_pdf.group(1) if m_pdf else ""
pdf_url = f"{self.base_url}/files/{pdf_rel}" if pdf_rel else "" pdf_url = f"{self.base_url}{self.pdf_url_prefix}{pdf_rel}" if pdf_rel else ""
m_w32 = self._RE_URHEBER_DATUM.search(pre) m_w32 = self._RE_URHEBER_DATUM.search(pre)
urheber = self._decode_perl_hex(m_w32.group(1).strip()) if m_w32 else "" urheber = self._decode_perl_hex(m_w32.group(1).strip()) if m_w32 else ""
datum_de = m_w32.group(2) if m_w32 else "" datum_iso = self._datum_de_to_iso(m_w32.group(2) if m_w32 else "")
# DD.MM.YYYY -> ISO YYYY-MM-DD
datum_iso = ""
if datum_de:
d, m, y = datum_de.split(".")
datum_iso = f"{y}-{m.zfill(2)}-{d.zfill(2)}"
fraktionen = self._normalize_fraktion(urheber) if urheber else [] fraktionen = self._normalize_fraktion(urheber) if urheber else []
doc = Drucksache( doc = Drucksache(
@ -516,7 +624,6 @@ class PortalaAdapter(ParlamentAdapter):
typ="Antrag", typ="Antrag",
) )
# Client-side title filter (no fulltext search server-side)
if query_filter: if query_filter:
hay = f"{title} {urheber}".lower() hay = f"{title} {urheber}".lower()
if not all(t in hay for t in query_filter.lower().split()): if not all(t in hay for t in query_filter.lower().split()):
@ -526,23 +633,100 @@ class PortalaAdapter(ParlamentAdapter):
return results return results
def _parse_hit_list_cards(self, html: str, query_filter: str) -> list[Drucksache]:
"""Parse Berlin-style ``efxRecordRepeater`` HTML-card records.
Each card contains an ``<h3>`` title, a metadata ``<span class="h6">``
with the document type, the Drucksachen-Nummer, and the date,
plus a direct ``<a href="…pdf">`` link to the PDF on the same host.
"""
results: list[Drucksache] = []
# Split the HTML on every record-div opener — easier than balancing
# divs with regex.
chunks = html.split('class="record')
# First chunk is the prelude, skip it
for chunk in chunks[1:]:
# Each chunk now starts at the record class attribute
m_t = self._RE_BE_TITLE.search(chunk)
title = m_t.group(1).strip() if m_t else "Ohne Titel"
m_ds = self._RE_BE_DRUCKSACHE.search(chunk)
if not m_ds:
continue
drucksache = m_ds.group(1)
m_pdf = self._RE_BE_LINK.search(chunk)
pdf_url = ""
if m_pdf:
href = m_pdf.group(1)
if href.startswith("http://") or href.startswith("https://"):
pdf_url = href
elif href.startswith("/"):
pdf_url = f"{self.base_url}{href}"
else:
pdf_url = f"{self.base_url}{self.pdf_url_prefix}{href}"
m_dat = self._RE_BE_DATUM.search(chunk)
datum_iso = self._datum_de_to_iso(m_dat.group(1) if m_dat else "")
m_doc = self._RE_BE_DOCTYPE.search(chunk)
doctype_full = m_doc.group(1).strip() if m_doc else "Drucksache"
# Berlin often packs the originator(s) into the same h6 line:
# "Antrag CDU, SPD" → fraktionen = ["CDU","SPD"], typ = "Antrag"
# Senat-Vorlagen carry no fraction, only "Vorlage zur …".
fraktionen = self._normalize_fraktion(doctype_full)
# Strip the fraction names back out of the typ string so the UI
# shows a clean "Antrag" / "Vorlage …" label.
typ = doctype_full
if fraktionen:
# Cut at the first occurrence of any party name
cuts = [typ.upper().find(f.upper()) for f in fraktionen]
cuts = [c for c in cuts if c >= 0]
if cuts:
typ = typ[: min(cuts)].rstrip(" ,")
doc = Drucksache(
drucksache=drucksache,
title=title,
fraktionen=fraktionen,
datum=datum_iso,
link=pdf_url,
bundesland=self.bundesland,
typ=typ,
)
if query_filter:
hay = f"{title} {doctype}".lower()
if not all(t in hay for t in query_filter.lower().split()):
continue
results.append(doc)
return results
async def search(self, query: str, limit: int = 20) -> list[Drucksache]: async def search(self, query: str, limit: int = 20) -> list[Drucksache]:
"""Search recent Anträge of the current Wahlperiode. """Search recent documents of the current Wahlperiode.
``query`` is applied as a client-side title/Urheber filter; the ``query`` is applied as a client-side title/Urheber filter; the
server-side query covers the last ~24 months by default. server-side query covers the configured ``date_window_days``
(default 24 months).
""" """
from datetime import date, timedelta from datetime import date, timedelta
end = date.today() end = date.today()
start = end - timedelta(days=730) start = end - timedelta(days=self.date_window_days)
body = self._build_search_body( body = self._build_search_body(
wahlperiode=self.wahlperiode, wahlperiode=self.wahlperiode,
start_date=start.isoformat(), start_date=start.isoformat(),
end_date=end.isoformat(), end_date=end.isoformat(),
document_type="Antrag",
) )
browse_html = f"{self.base_url}{self.portala_path}/browse.tt.html"
browse_json = f"{self.base_url}{self.portala_path}/browse.tt.json"
report_html = f"{self.base_url}{self.portala_path}/report.tt.html"
async with httpx.AsyncClient( async with httpx.AsyncClient(
timeout=30, timeout=30,
follow_redirects=True, follow_redirects=True,
@ -550,41 +734,41 @@ class PortalaAdapter(ParlamentAdapter):
) as client: ) as client:
try: try:
# Step 1: warm up cookies via the browse page # Step 1: warm up cookies via the browse page
await client.get(f"{self.base_url}/portal/browse.tt.html") await client.get(browse_html)
# Step 2: submit the search action # Step 2: submit the search action
resp = await client.post( resp = await client.post(
f"{self.base_url}/portal/browse.tt.json", browse_json,
json=body, json=body,
headers={"Referer": f"{self.base_url}/portal/browse.tt.html"}, headers={"Referer": browse_html},
) )
if resp.status_code != 200: if resp.status_code != 200:
print(f"PADOKA search HTTP {resp.status_code}") print(f"{self.bundesland} search HTTP {resp.status_code}")
return [] return []
data = resp.json() data = resp.json()
report_id = data.get("report_id") report_id = data.get("report_id")
if not report_id: if not report_id:
print(f"PADOKA: no report_id in response: {data}") print(f"{self.bundesland}: no report_id in response: {data}")
return [] return []
# Step 3: fetch the HTML hit list # Step 3: fetch the HTML hit list
# Take a generous chunk so client-side filter still has enough # Take a generous chunk so client-side filter still has enough
chunksize = 100 if query else limit chunksize = 100 if query else limit
report_resp = await client.post( report_resp = await client.post(
f"{self.base_url}/portal/report.tt.html", report_html,
json={"report_id": report_id, "start": 0, "chunksize": chunksize}, json={"report_id": report_id, "start": 0, "chunksize": chunksize},
headers={"Referer": f"{self.base_url}/portal/browse.tt.html"}, headers={"Referer": browse_html},
) )
if report_resp.status_code != 200: if report_resp.status_code != 200:
print(f"PADOKA report HTTP {report_resp.status_code}") print(f"{self.bundesland} report HTTP {report_resp.status_code}")
return [] return []
results = self._parse_hit_list_html(report_resp.text, query_filter=query) results = self._parse_hit_list_html(report_resp.text, query_filter=query)
return results[:limit] return results[:limit]
except Exception as e: except Exception as e:
print(f"PADOKA search error: {e}") print(f"{self.bundesland} search error: {e}")
return [] return []
async def get_document(self, drucksache: str) -> Optional[Drucksache]: async def get_document(self, drucksache: str) -> Optional[Drucksache]:
@ -623,7 +807,7 @@ class PortalaAdapter(ParlamentAdapter):
pdf.close() pdf.close()
return text return text
except Exception as e: except Exception as e:
print(f"PADOKA download error for {drucksache}: {e}") print(f"{self.bundesland} download error for {drucksache}: {e}")
return None return None
@ -667,7 +851,31 @@ class BWAdapter(ParlamentAdapter):
# Registry of adapters # Registry of adapters
ADAPTERS = { ADAPTERS = {
"NRW": NRWAdapter(), "NRW": NRWAdapter(),
"LSA": PortalaAdapter(), "LSA": PortalaAdapter(
bundesland="LSA",
name="Landtag von Sachsen-Anhalt (PADOKA)",
base_url="https://padoka.landtag.sachsen-anhalt.de",
db_id="lsa.lissh",
wahlperiode=8,
portala_path="/portal",
document_type="Antrag",
pdf_url_prefix="/files/",
),
"BE": PortalaAdapter(
bundesland="BE",
name="Abgeordnetenhaus von Berlin (PARDOK)",
base_url="https://pardok.parlament-berlin.de",
db_id="lah.lissh",
wahlperiode=19,
portala_path="/portala",
# Berlin's ETYPF index uses different value strings — drop the
# document_type subtree, fall back to client-side title filter.
document_type=None,
# Tighter date window: BE has ~10x more documents than LSA, so a
# narrower window keeps the per-request payload bounded.
date_window_days=180,
pdf_url_prefix="/files/",
),
"BY": BayernAdapter(), "BY": BayernAdapter(),
"BW": BWAdapter(), "BW": BWAdapter(),
} }