Replaces the client-side title/Urheber substring filter with a
real server-side full-text search through ParlDok's facet_fulltext
tag (type=0). The tag schema is reverse-engineered from
pd.addInput in the live bundle.js:
{"type": 0,
"id": <getFulltextId(term)>, # non-alphanum → "-"
"fulltext": <raw term>,
"label": <raw term>,
"field": "Alle"} # search all indexed fields
The Resultpage queryid inherits the fulltext filter, so
pagination works without re-sending the tag.
Smoke test (local):
Schule → 10 hits (was 3)
Klima → 10 hits across multiple parties + dates
Wohnen → 10 hits including older 2025 Anträge
The 10-page (1000-doc) safety bound still applies on top of the
fulltext-filtered result set, but since the server now narrows
to ~2k Schule-related docs WP-wide instead of the 8k+ raw WP
total, the bound is no longer the limiting factor for typical
queries.
Closes#12. BE/LSA equivalent (#13) is independent — eUI
sf-index names still need DevTools tracing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a new ParLDokAdapter for ParlDok 8.x parliament documentation
systems by J3S GmbH. MV becomes the fourth supported state alongside
NRW, LSA and BE.
Notable details:
- ParlDok 8.x is a single-page app whose backend is a JSON API rooted
at {base}/parldok/Fulltext/{Search,Resultpage}. The legacy ParLDok
5.x HTML POST form (parldok/formalkriterien) used by dokukratie's
mv.yml has been deprecated by the LandtagMV upgrade to 8.3.5 and
is no longer reachable via the old form fields — hence a new
adapter rather than reusing the dokukratie scraper.
- Two-stage pagination: Fulltext/Search returns the first 100 hits
+ a queryid; further pages come from Fulltext/Resultpage with
{queryid, limit:{Start,Length}}. The Search endpoint silently
ignores any non-zero Start, so single-stage offset pagination is
not an option.
- Server-side filter via facet_lp (type=10) on the configured WP;
type=Antrag is filtered client-side because the facet_type value
IDs are instance-specific and would require an extra
Fulltext/Filter discovery call. ParlDok also returns the same
Drucksache multiple times when it appears in several
Vorgänge/Beratungen, so search() dedupes by lp/number.
- Wahlprogramme zur LTW 26.09.2021 are not yet indexed (follow-up
in #4) — analyses run with the federal Grundsatzprogramm fallback,
same as Berlin until #10 lands.
Drive-by cleanup of PortalaAdapter print() statements: switched to
the module-level logger so adapter parser bugs no longer disappear
into stdout.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_parse_hit_list_cards referenced an undefined `doctype` instead of
`doctype_full` on the query-filter path. The surrounding try/except in
search() swallowed the exception, so Berlin queries silently returned
0 hits whenever a search term was given.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PortalaAdapter is now parameterizable and serves both LSA and Berlin
from a single class. Berlin is activated as the third live bundesland
(after NRW + LSA), with the deliberate caveat that the LTW 2023
Wahlprogramme are not yet indexed.
PortalaAdapter refactor
- Class attributes (bundesland, name, base_url, db_id, wahlperiode)
moved into the constructor. New optional parameters:
- portala_path: "/portal" for LSA, "/portala" for Berlin
- document_type: "Antrag" for LSA, None for Berlin (BE's ETYPF
index uses different value strings; the document_type subtree
is dropped from the action.search.json tree)
- pdf_url_prefix: "/files/" by default; absolute URLs in the hit
list are passed through unchanged (Berlin embeds full
starweb/adis/citat/... links)
- date_window_days: 730 for LSA, 180 for BE (BE has ~10x more
documents per WP, narrower window keeps payloads bounded)
- _build_search_body builds the JSON tree dynamically: when
document_type is None, the entire ETYPF/DTYPF/DART subtree is
omitted, mirrored in the parsed/sref display strings as well.
- _parse_hit_list_html now auto-detects between two formats:
1. LSA-style: <pre>$VAR1 = …</pre> Perl Data::Dumper records
(existing parser, untouched).
2. Berlin-style: production HTML cards with efxRecordRepeater
divs, h3 titles, h6 metadata lines containing the document
type, drucksachen-id and date, plus a direct <a href="…pdf">
to the PDF on the same host.
- Berlin extracts originator parties from the h6 line ("Antrag CDU,
SPD" → ["CDU","SPD"], typ "Antrag") via the new word-boundary
_normalize_fraktion regex.
- _normalize_fraktion rewritten with regex word boundaries, fixing a
long-standing bug where comma-separated fraction lists like
"CDU, SPD" failed to match CDU. Also picks up BSW for the
Brombeer/SPD-BSW landtage and "Senat von Berlin" as Landesregierung.
bundeslaender.py
- BE flipped to aktiv=True. anmerkung documents the Wahlprogramm-
Lücke and the auto-detected hit-list format.
Live verified against pardok.parlament-berlin.de:
- WP 19 with 180-day date window returns 2962 hits, page 1 contains
5 records all with title, drucksache, date, PDF URL.
- 19/3107 ("Kleingewässerprogramm") correctly extracted as Antrag of
CDU+SPD; 19/3104-3106 as Vorlagen zur Beschlussfassung; 19/3108 as
Vorlage zur Kenntnisnahme.
- LSA still returns the same 5 current Anträge of März 2026 — no
regression from the refactor.
Known limitation (will be tracked as a follow-up issue)
- Berlin Wahlprogramme zur LTW 2023 are not yet indexed in the
embeddings DB. The 2023 PDFs are no longer linked from the live
party websites (which currently feature 2026 draft programmes), and
Wayback has no snapshots. The analyzer therefore falls back to
bundesländer-übergreifende Grundsatzprogramme for BE Anträge until
the 2023 PDFs are sourced manually.
Refs #3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a clean-room PortalaAdapter that talks to the eUI/portala framework
behind PADOKA (Landtag Sachsen-Anhalt). Same engine powers Berlin's
PARDOK; the same adapter will serve issue #3 once activated for BE.
Reverse-engineering notes
- The "PADOKA = StarWeb" assumption from issue #1 / dokukratie's st.yml
is outdated. The Sachsen-Anhalt portal was migrated to the same
eUI/portala SPA framework Berlin uses. The legacy starweb URL returns
503; the new entry point is /portal/browse.tt.html.
- Search workflow is two-stage:
1. POST /portal/browse.tt.json with a JSON action body containing an
Elasticsearch-style query tree under search.json. Returns a
report_id plus hit count.
2. POST /portal/report.tt.html with {report_id, start, chunksize}
returns the HTML hit list. Each record carries a Perl Data::Dumper
block in a <pre> tag with the canonical metadata.
- The query schema (sources, search.lines, search.json tree, report
block) is taken from dokukratie/scrapers/portala.query.json (GPL-3.0)
— only structure/selectors are reused, no Python code is ported.
- DB id is "lsa.lissh"; the server validates this and rejects unknown
interfaces with an explicit errormsg.
- PDFs live under /files/drs/wp{N}/drs/d{nr}{xxx}.pdf and are served
directly without any session cookie.
What the adapter does
- search() builds a date-window query (last ~24 months) for "Antrag"
document type and returns the most recent hits. The user's free-text
query is applied as a client-side title/Urheber filter (no fulltext
search server-side yet — see "Limitations" below).
- Hits are parsed from the Perl record dumps in the report HTML:
- WEV06.main → title (Perl \x{xx} hex escapes decoded)
- WEV32.5 → relative PDF path
- WEV32.main → "Antrag <Urheber> <DD.MM.YYYY> Drucksache <b>X/YYYY</b>"
- Fraktion strings are normalised to canonical codes (CDU, SPD, GRÜNE,
FDP, AfD, LINKE, Landesregierung).
- get_document() looks up a single Drucksache by re-running the search.
- download_text() fetches the PDF and extracts text via PyMuPDF.
- bundeslaender.py: LSA's doku_system corrected from "StarWeb" to
"PARDOK", anmerkung updated with the migration story.
Limitations (deliberate, MVP)
- No server-side full-text search. The portala framework's sf index
names for LSA full-text content are not yet known; tree mutations
with sf=alAB return 0 hits. Client-side filter is "good enough" for
the next ~24 months of Anträge (≈few hundred per WP).
- LSA is still aktiv=False in bundeslaender.py — the adapter is dormant
in production until issue #2's wahlprogramm ingest and frontend
activation land.
Verified live against padoka.landtag.sachsen-anhalt.de:
- search(query="", limit=5) returned 5 current Anträge from März 2026
(LINKE + GRÜNE) with correct dates, fractions, titles and PDF URLs.
- download_text("8/6790") returned 5051 chars of real Antragstext
("ICE-Halt für Salzwedel dauerhaft erhalten").
Refs #2.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>