Commit Graph

6 Commits

Author SHA1 Message Date
Dotty Dotter
5a30ce8bab bundeslaender.py: doku_system-Klassifikation für TH, HB, SN korrigiert (#48)
Beim Auswerten der dokukratie/*.yml-Configs (Annotationen in
Issues #19-#30) sind drei Falsch-Einträge aufgefallen, die zu
unnötigem Reverse-Engineering bei den Adapter-Implementierungen
geführt hätten:

- TH "StarWeb" → "ParlDok"
  dokukratie/th.yml zeigt parldok.thueringen.de/ParlDok/formalkriterien/
  mit ParlDok-typischen Form-Feldern (DokumententypId, LegislaturpNum).
  Live-System ggf. ParlDok 8.x SPA wie MV — dann ParLDokAdapter direkt
  wiederverwendbar (Issue #25 Annotation).

- HB "PARiS" → "StarWeb"
  PARiS ist nur eine StarWeb-Skin auf paris.bremische-buergerschaft.de
  /starweb/paris/servlet.starweb?path=paris/LISSH.web. Wiederverwendbar
  mit dem generischen StarWebAdapter aus Issue #27 (Template).

- SN "ParlDok" → "Eigensystem"
  EDAS auf edas.landtag.sachsen.de basiert auf ASP.NET-Webforms mit
  __VIEWSTATE/__CALLBACKID-Postbacks (siehe dokukratie/sn.yml). Nicht
  ParlDok-kompatibel mit MV. Eigener Adapter notwendig.

Anmerkungs-Texte erweitert mit Adapter-Wiederverwendungs-Hinweis und
Verifikations-Schritt für Live-System-Versionen.

Phase 0 aus Roadmap-Issue #49.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:19:41 +02:00
Dotty Dotter
2b9c0b2908 Activate Mecklenburg-Vorpommern (ParlDok) — search-only MVP (#4)
Adds a new ParLDokAdapter for ParlDok 8.x parliament documentation
systems by J3S GmbH. MV becomes the fourth supported state alongside
NRW, LSA and BE.

Notable details:

- ParlDok 8.x is a single-page app whose backend is a JSON API rooted
  at {base}/parldok/Fulltext/{Search,Resultpage}. The legacy ParLDok
  5.x HTML POST form (parldok/formalkriterien) used by dokukratie's
  mv.yml has been deprecated by the LandtagMV upgrade to 8.3.5 and
  is no longer reachable via the old form fields — hence a new
  adapter rather than reusing the dokukratie scraper.

- Two-stage pagination: Fulltext/Search returns the first 100 hits
  + a queryid; further pages come from Fulltext/Resultpage with
  {queryid, limit:{Start,Length}}. The Search endpoint silently
  ignores any non-zero Start, so single-stage offset pagination is
  not an option.

- Server-side filter via facet_lp (type=10) on the configured WP;
  type=Antrag is filtered client-side because the facet_type value
  IDs are instance-specific and would require an extra
  Fulltext/Filter discovery call. ParlDok also returns the same
  Drucksache multiple times when it appears in several
  Vorgänge/Beratungen, so search() dedupes by lp/number.

- Wahlprogramme zur LTW 26.09.2021 are not yet indexed (follow-up
  in #4) — analyses run with the federal Grundsatzprogramm fallback,
  same as Berlin until #10 lands.

Drive-by cleanup of PortalaAdapter print() statements: switched to
the module-level logger so adapter parser bugs no longer disappear
into stdout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 08:19:48 +02:00
Dotty Dotter
9e0f11f7c9 Activate Berlin (PARDOK) — search-only MVP (#3)
PortalaAdapter is now parameterizable and serves both LSA and Berlin
from a single class. Berlin is activated as the third live bundesland
(after NRW + LSA), with the deliberate caveat that the LTW 2023
Wahlprogramme are not yet indexed.

PortalaAdapter refactor
- Class attributes (bundesland, name, base_url, db_id, wahlperiode)
  moved into the constructor. New optional parameters:
    - portala_path: "/portal" for LSA, "/portala" for Berlin
    - document_type: "Antrag" for LSA, None for Berlin (BE's ETYPF
      index uses different value strings; the document_type subtree
      is dropped from the action.search.json tree)
    - pdf_url_prefix: "/files/" by default; absolute URLs in the hit
      list are passed through unchanged (Berlin embeds full
      starweb/adis/citat/... links)
    - date_window_days: 730 for LSA, 180 for BE (BE has ~10x more
      documents per WP, narrower window keeps payloads bounded)
- _build_search_body builds the JSON tree dynamically: when
  document_type is None, the entire ETYPF/DTYPF/DART subtree is
  omitted, mirrored in the parsed/sref display strings as well.
- _parse_hit_list_html now auto-detects between two formats:
    1. LSA-style: <pre>$VAR1 = …</pre> Perl Data::Dumper records
       (existing parser, untouched).
    2. Berlin-style: production HTML cards with efxRecordRepeater
       divs, h3 titles, h6 metadata lines containing the document
       type, drucksachen-id and date, plus a direct <a href="…pdf">
       to the PDF on the same host.
- Berlin extracts originator parties from the h6 line ("Antrag CDU,
  SPD" → ["CDU","SPD"], typ "Antrag") via the new word-boundary
  _normalize_fraktion regex.
- _normalize_fraktion rewritten with regex word boundaries, fixing a
  long-standing bug where comma-separated fraction lists like
  "CDU, SPD" failed to match CDU. Also picks up BSW for the
  Brombeer/SPD-BSW landtage and "Senat von Berlin" as Landesregierung.

bundeslaender.py
- BE flipped to aktiv=True. anmerkung documents the Wahlprogramm-
  Lücke and the auto-detected hit-list format.

Live verified against pardok.parlament-berlin.de:
- WP 19 with 180-day date window returns 2962 hits, page 1 contains
  5 records all with title, drucksache, date, PDF URL.
- 19/3107 ("Kleingewässerprogramm") correctly extracted as Antrag of
  CDU+SPD; 19/3104-3106 as Vorlagen zur Beschlussfassung; 19/3108 as
  Vorlage zur Kenntnisnahme.
- LSA still returns the same 5 current Anträge of März 2026 — no
  regression from the refactor.

Known limitation (will be tracked as a follow-up issue)
- Berlin Wahlprogramme zur LTW 2023 are not yet indexed in the
  embeddings DB. The 2023 PDFs are no longer linked from the live
  party websites (which currently feature 2026 draft programmes), and
  Wayback has no snapshots. The analyzer therefore falls back to
  bundesländer-übergreifende Grundsatzprogramme for BE Anträge until
  the 2023 PDFs are sourced manually.

Refs #3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 23:33:16 +02:00
Dotty Dotter
87874a7a14 Activate LSA: Wahlprogramme + ingest + frontend (#2)
Brings Sachsen-Anhalt online as the second supported Bundesland after
NRW. Closes the gap that issue #2 left open: with the PortalaAdapter
already in place from c7242f8, this commit adds the reference data and
flips the activation switch.

Wahlprogramme (LTW Sachsen-Anhalt 06.06.2021)
- Six PDFs added under app/static/referenzen/{cdu,spd,gruene,fdp,afd,
  linke}-lsa-2021.pdf, plus paged plain-text extractions under
  app/kontext/*.txt for the keyword fallback search.
- Sources verified by hand:
  - CDU "Unsere Heimat. Unsere Verantwortung." (cdulsa.de, 82 pages)
  - SPD "Zusammenhalt und neue Chancen" (FES library, 77 pages)
  - GRÜNE "Verlässlich für Sachsen-Anhalt" (gruene-lsa.de, 164 pages)
  - FDP "Wahlprogramm zur Landtagswahl 2021" (Naumann-Stiftung, 76 pages)
  - AfD "Alles für unsere Heimat!" (klimawahlen.de mirror, 64 pages)
  - LINKE "Wahlprogramm zur Landtagswahl 2021" (dielinke-sachsen-anhalt.de,
    88 pages)
- The CDU PDF was the trickiest: KAS blocks bot downloads via
  Cloudflare; the cdulsa.de copy was located by an autonomous web
  search and verified to be byte-identical with the official document.

Embeddings indexed (in production container, OpenAI-compatible
DashScope embeddings via the existing index_programm pipeline):
- CDU 134, SPD 145, GRÜNE 183, FDP 100, AfD 64, LINKE 143 chunks
- Total LSA: 769 new chunks alongside the existing 775 NRW chunks
  and 335 federal Grundsatzprogramm chunks.

wahlprogramme.py
- WAHLPROGRAMME["LSA"] populated with all six parties (canonical fraction
  codes, original titles, page counts).

embeddings.py
- PROGRAMME extended with the six new "<partei>-lsa-2021" entries that
  the indexer pipeline expects.

bundeslaender.py
- LSA flipped to aktiv=True. The frontend dropdown will now offer
  Sachsen-Anhalt as a selectable bundesland and analyzer.get_bundesland_
  context() will produce a real LSA prompt block (CDU/SPD/FDP as
  governing fractions, all six landtagsfraktionen).

End-to-end smoke test (live in production container before commit)
- Adapter: PortalaAdapter.search() returned current Anträge of März 2026
  (LINKE + GRÜNE) with correct titles and PDF URLs.
- Semantic search for an LSA "ÖPNV in der Altmark" sample antrag
  matched LINKE S.53, SPD S.68, FDP S.52 — all three with similarity
  > 0.6 and topical hits (Regionalisierungsmittel, ÖPNV-Förderprogramm,
  Wasserstoffnetz).

Resolves issue #2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 22:12:32 +02:00
Dotty Dotter
c7242f8413 Add PortalaAdapter for PADOKA / Sachsen-Anhalt (#2)
Adds a clean-room PortalaAdapter that talks to the eUI/portala framework
behind PADOKA (Landtag Sachsen-Anhalt). Same engine powers Berlin's
PARDOK; the same adapter will serve issue #3 once activated for BE.

Reverse-engineering notes
- The "PADOKA = StarWeb" assumption from issue #1 / dokukratie's st.yml
  is outdated. The Sachsen-Anhalt portal was migrated to the same
  eUI/portala SPA framework Berlin uses. The legacy starweb URL returns
  503; the new entry point is /portal/browse.tt.html.
- Search workflow is two-stage:
  1. POST /portal/browse.tt.json with a JSON action body containing an
     Elasticsearch-style query tree under search.json. Returns a
     report_id plus hit count.
  2. POST /portal/report.tt.html with {report_id, start, chunksize}
     returns the HTML hit list. Each record carries a Perl Data::Dumper
     block in a <pre> tag with the canonical metadata.
- The query schema (sources, search.lines, search.json tree, report
  block) is taken from dokukratie/scrapers/portala.query.json (GPL-3.0)
  — only structure/selectors are reused, no Python code is ported.
- DB id is "lsa.lissh"; the server validates this and rejects unknown
  interfaces with an explicit errormsg.
- PDFs live under /files/drs/wp{N}/drs/d{nr}{xxx}.pdf and are served
  directly without any session cookie.

What the adapter does
- search() builds a date-window query (last ~24 months) for "Antrag"
  document type and returns the most recent hits. The user's free-text
  query is applied as a client-side title/Urheber filter (no fulltext
  search server-side yet — see "Limitations" below).
- Hits are parsed from the Perl record dumps in the report HTML:
  - WEV06.main → title (Perl \x{xx} hex escapes decoded)
  - WEV32.5   → relative PDF path
  - WEV32.main → "Antrag <Urheber> <DD.MM.YYYY> Drucksache <b>X/YYYY</b>"
- Fraktion strings are normalised to canonical codes (CDU, SPD, GRÜNE,
  FDP, AfD, LINKE, Landesregierung).
- get_document() looks up a single Drucksache by re-running the search.
- download_text() fetches the PDF and extracts text via PyMuPDF.
- bundeslaender.py: LSA's doku_system corrected from "StarWeb" to
  "PARDOK", anmerkung updated with the migration story.

Limitations (deliberate, MVP)
- No server-side full-text search. The portala framework's sf index
  names for LSA full-text content are not yet known; tree mutations
  with sf=alAB return 0 hits. Client-side filter is "good enough" for
  the next ~24 months of Anträge (≈few hundred per WP).
- LSA is still aktiv=False in bundeslaender.py — the adapter is dormant
  in production until issue #2's wahlprogramm ingest and frontend
  activation land.

Verified live against padoka.landtag.sachsen-anhalt.de:
- search(query="", limit=5) returned 5 current Anträge from März 2026
  (LINKE + GRÜNE) with correct dates, fractions, titles and PDF URLs.
- download_text("8/6790") returned 5051 chars of real Antragstext
  ("ICE-Halt für Salzwedel dauerhaft erhalten").

Refs #2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 21:50:23 +02:00
Dotty Dotter
ac18743ff2 Add central bundeslaender.py module with all 16 states (#7)
Introduces app/bundeslaender.py as the single source of truth for all
bundesland-specific data (parliament name, current legislative period,
upcoming elections, governing coalition, doku system, base URLs,
drucksache format, dokukratie scraper code, active flag, optional
remarks). Data reflects April 2026 state.

main.py::index() and /api/bundeslaender now derive their lists from
this module instead of hardcoding. Frontend dropdown now shows all 16
bundesländer (15 disabled with "(bald)" suffix); previously the
landing template showed only 4. NRW remains the only "aktiv" entry.

API behaviour change worth noting: the /api/bundeslaender endpoint
previously emitted code "ST" for Sachsen-Anhalt; it now emits "LSA"
to match the politically dominant abbreviation. No functional impact
because non-NRW bundesländer were inactive in both versions.

Foundation for #5 and #2; deliberately a no-op for NRW so it can ship
and rollback independently.

Resolves issue #7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 14:17:54 +02:00