Commit Graph

19 Commits

Author SHA1 Message Date
Dotty Dotter
f98e64c734 Add pytest suite + fix two regex bugs uncovered by it (#46)
Erste Tests für die Codebase. 77 Tests, 0.08s Laufzeit, decken die
drei Bug-Klassen aus der April-2026-Adapter-Session ab plus haben
schon zwei weitere Bugs in Production-Code aufgedeckt.

## Setup

- requirements-dev.txt mit pytest + pytest-asyncio
- pytest.ini mit asyncio_mode=auto
- tests/conftest.py stubbt fitz/bs4/openai/pydantic_settings, damit
  die Suite ohne den vollen prod-requirements-Satz läuft (pure unit
  tests, kein PDF-Parsing, kein HTTP)

## Tests

- tests/test_parlamente.py (33 Tests)
  * PortalaAdapter._parse_hit_list_cards: doctype/doctype_full
    NameError-Regression aus 1cb030a, plus Title/Drucksache/Fraktion-
    /Datum/PDF-Extraktion gegen ein BE-Card-Fixture
  * PortalaAdapter._parse_hit_list_dump: gegen ein LSA-Perl-Dump-
    Fixture inkl. Hex-Escape-Decoding (\x{fc} → ü)
  * PortalaAdapter._parse_hit_list_html: Auto-Detection zwischen
    Card- und Dump-Format
  * PortalaAdapter._normalize_fraktion: kanonische Fraktion-Codes
    inkl. F.D.P.-mit-Punkten, BÜNDNIS 90, DIE LINKE, BSW
  * ParLDokAdapter._hit_to_drucksache: JSON-Hit → Drucksache
    Mapping inkl. /navpanes-Stripping, MdL-mit-Partei-in-Klammern,
    Landesregierung-Detection
  * ParLDokAdapter._fulltext_id: bundle.js-mirroring (deferred,
    aber dokumentiert)
  * ADAPTERS-Registry-Sanity

- tests/test_embeddings.py (11 Tests)
  * _chunk_source_label: Programm-Name + Seite (Halluzinations-
    Bug-Regression aus 1b5fd96)
  * format_quotes_for_prompt: jeder Chunk muss Programm-Name
    enthalten, strict-citation-Hinweis muss im Output sein,
    keine NRW-Halluzinationen für MV/BE-Chunk-Sets

- tests/test_wahlprogramme.py (14 Tests)
  * Registry-Struktur (jahr int, seiten int, .pdf-Endung)
  * File-Existenz: jede registrierte PDF muss in
    static/referenzen/ liegen — würde Tippfehler in den 22
    indexierten Programmen sofort fangen
  * embeddings.PROGRAMME-Konsistenz-Cross-Check

- tests/test_bundeslaender.py (15 Tests)
  * Sanity über 16-State-Registry
  * #48-Klassifikations-Regression: TH=ParlDok, HB=StarWeb,
    SN=Eigensystem
  * Wahltermine plausibel (zwischen 2026 und 2035)

- tests/test_analyzer.py (4 Tests)
  * Markdown-Codeblock-Stripping aus dem JSON-Retry-Loop

## Bug-Funde während der Test-Schreibphase

Zwei Production-Bugs in den _normalize_fraktion-Helfern wurden
durch die neuen Tests sofort aufgedeckt und im selben Commit gefixt:

1. PortalaAdapter._normalize_fraktion matched "F.D.P." (mit Punkten,
   wie historische SH/HB-Drucksachen) nicht — Regex \bFDP\b ist zu
   strikt. Fix: \bF\.?\s*D\.?\s*P\.?\b analog zu ParLDokAdapter.

2. ParLDokAdapter._normalize_fraktion (auch PortalaAdapter) matched
   "Ministerium der Finanzen" nicht als Landesregierung, weil
   \bMINISTER\b die Wortgrenze auch nach MINISTER verlangt — bei
   MINISTERIUM steht aber IUM danach, keine Wortgrenze. Fix:
   \bMINISTER ohne abschließendes \b.

Beide Bugs hätten Fraktion-Felder bei Drucksachen der Bremischen
Bürgerschaft (FDP-Listen) und bei Landesregierungs-Drucksachen
in MV/LSA fälschlich leer gelassen — exakt der "fraktionen=[]"-
Befund aus dem MV-Smoke-Test in #4.

Phase 0 aus Roadmap-Issue #49.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:26:06 +02:00
Dotty Dotter
5a30ce8bab bundeslaender.py: doku_system-Klassifikation für TH, HB, SN korrigiert (#48)
Beim Auswerten der dokukratie/*.yml-Configs (Annotationen in
Issues #19-#30) sind drei Falsch-Einträge aufgefallen, die zu
unnötigem Reverse-Engineering bei den Adapter-Implementierungen
geführt hätten:

- TH "StarWeb" → "ParlDok"
  dokukratie/th.yml zeigt parldok.thueringen.de/ParlDok/formalkriterien/
  mit ParlDok-typischen Form-Feldern (DokumententypId, LegislaturpNum).
  Live-System ggf. ParlDok 8.x SPA wie MV — dann ParLDokAdapter direkt
  wiederverwendbar (Issue #25 Annotation).

- HB "PARiS" → "StarWeb"
  PARiS ist nur eine StarWeb-Skin auf paris.bremische-buergerschaft.de
  /starweb/paris/servlet.starweb?path=paris/LISSH.web. Wiederverwendbar
  mit dem generischen StarWebAdapter aus Issue #27 (Template).

- SN "ParlDok" → "Eigensystem"
  EDAS auf edas.landtag.sachsen.de basiert auf ASP.NET-Webforms mit
  __VIEWSTATE/__CALLBACKID-Postbacks (siehe dokukratie/sn.yml). Nicht
  ParlDok-kompatibel mit MV. Eigener Adapter notwendig.

Anmerkungs-Texte erweitert mit Adapter-Wiederverwendungs-Hinweis und
Verifikations-Schritt für Live-System-Versionen.

Phase 0 aus Roadmap-Issue #49.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:19:41 +02:00
Dotty Dotter
b5ae8894d4 ParLDokAdapter: Volltext (#12) deaktivieren — einheitlich Title-Filter (#18)
Server-side facet_fulltext-Suche aus #12 war asymmetrisch zu BE/LSA
(beide nur Title-Filter über 730d-Window aus #13). User-Entscheidung
2026-04-08: einheitliches Verhalten ist wichtiger als das beste
Verhalten in 2 von 4 Adaptern.

Konkrete Änderungen:

- _build_search_body() schickt query nicht mehr server-side. Der
  query-Parameter bleibt in der Signatur als unused-mit-del, weil
  die Wieder-Aktivierung später ein Drop-in sein soll wenn die
  PortalaAdapter-Variante reverse-engineered wurde.

- _initial_search() und _paginated_hits() ohne query-Parameter.

- search() macht clientseitigen Title+Urheber-Filter wie der
  PortalaAdapter — same Codepfad, einheitliches Verhalten.

- get_document() nutzt die unveränderte Pagination.

- FACET_FULLTEXT-Konstante und _fulltext_id-Helper bleiben im Code
  als Dokumentation für die spätere Re-Aktivierung. Im Docstring
  ist die Tag-Form festgehalten.

Folgen:

- MV "Schule" ist von 20 (mit Volltext) auf 3 zurück (Title-Filter
  über die letzten 1000 Drucksachen). Gleiches Niveau wie BE/LSA
  pre-#13.

- Browse-Mode (no query) ist unverändert: ~10 hits in ~25s, MAX_PAGES=10.

- Wenn das später nicht reicht: #16 (UI-Split DB vs. Landtag) und
  ein optionaler "echter Volltext"-Toggle (#17 closed-as-deferred)
  bleiben als Folge-Optionen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 19:01:00 +02:00
Dotty Dotter
9eda6f9f36 PortalaAdapter: quick-win bigger window + chunksize for BE/LSA (#13)
Real server-side fulltext search through the eUI sf-Index requires
reverse-engineering the LSA/BE-specific search field (the obvious
candidates VOLL, VOLL.main, WEV62 and bare-term-without-sf all
return zero hits when probed). Without browser DevTools to capture
a real fulltext request that's a multi-hour project — split out
to remain in #13 as a follow-up.

This commit ships the pragmatic interim fix from #11:

- BE date_window_days: 180 → 730
  Berlin had a tight default window because PARDOK has ~10x more
  documents than PADOKA. With the bigger window the client-side
  title/Urheber filter reaches back across most of WP19 instead
  of just the last six months.

- chunksize logic in PortalaAdapter.search() inverted from
  "small when query, big when no query" to the opposite. The
  query-filtered path now pulls up to max(limit*10, 500) records
  per page so the title-filter has enough material; the unfiltered
  browse path stays at max(limit*2, 100).

- httpx timeout 30s → 60s. LSA's report.tt.html occasionally
  takes 30+s on cold start; warm requests are <10s.

Smoke test (local):
  BE  Schule: 15 hits (was 0)
  LSA Schule: 14 hits (was N/A; same path)

Live verification follows after deploy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 13:58:34 +02:00
Dotty Dotter
6184bf8a88 ParLDokAdapter: server-side fulltext search via facet_fulltext (#12)
Replaces the client-side title/Urheber substring filter with a
real server-side full-text search through ParlDok's facet_fulltext
tag (type=0). The tag schema is reverse-engineered from
pd.addInput in the live bundle.js:

  {"type": 0,
   "id": <getFulltextId(term)>,    # non-alphanum → "-"
   "fulltext": <raw term>,
   "label": <raw term>,
   "field": "Alle"}                # search all indexed fields

The Resultpage queryid inherits the fulltext filter, so
pagination works without re-sending the tag.

Smoke test (local):
  Schule → 10 hits (was 3)
  Klima  → 10 hits across multiple parties + dates
  Wohnen → 10 hits including older 2025 Anträge

The 10-page (1000-doc) safety bound still applies on top of the
fulltext-filtered result set, but since the server now narrows
to ~2k Schule-related docs WP-wide instead of the 8k+ raw WP
total, the bound is no longer the limiting factor for typical
queries.

Closes #12. BE/LSA equivalent (#13) is independent — eUI
sf-index names still need DevTools tracing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 12:57:34 +02:00
Dotty Dotter
bc7f4a67cb Analyzer prompt: strict citation rule against LLM hallucination
Even after format_quotes_for_prompt was fixed to expose the
fully-qualified programme name + page in each chunk, Qwen
continued to hallucinate familiar source labels (typically "CDU
Wahlprogramm NRW 2022") for parties whose actual MV/BE chunks
were thematically off-topic for the Antrag at hand. The model
preferred its training prior over the prompt context.

Smoke test: MV Drucksache 8/6390 (CDU "Krisenmechanismus
Kraftstoffpreise"). The CDU MV chunks the embedder retrieved
were about Senioren and Aussenwirtschaft — not about energy
prices — so qwen pulled what it knew about CDU NRW transport
policy and wrote that as the source, even though the prompt
listed only "CDU Mecklenburg-Vorpommern Wahlprogramm 2021"
chunks. The new explicit ZITATEREGEL block in the user prompt
forbids cross-Bundesland citations and instructs the model to
leave zitate empty rather than fabricate when no listed chunk
fits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:31:21 +02:00
Dotty Dotter
1b5fd96e16 Embeddings prompt: include programme name in chunk citations
format_quotes_for_prompt previously rendered each retrieved chunk
as just "S. X: text", giving the LLM no way to know which
Bundesland or Wahlprogramm the passage came from. Result: even
when the embedding search correctly returned MV-only chunks, the
LLM hallucinated familiar source labels from its training set
(typically "FDP NRW Wahlprogramm 2022, S. 75") because that was
its strongest prior for budget/transparency policy citations.

Fix: prepend the fully-qualified PROGRAMME[programm_id]["name"]
to each quote and explicitly instruct the model to use these
labels verbatim. Discovered while smoke-testing MV after
indexing the new MV+BE programmes — embedding retrieval was
clean (sim ~0.6 chunks all from fdp-mv-2021), only the prompt
serialisation was lossy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:24:31 +02:00
Dotty Dotter
8992cffc64 Add MV+BE Wahlprogramme zur jeweils laufenden Legislatur (#4, #10)
11 PDFs in app/static/referenzen/ + Einträge in WAHLPROGRAMME
und embeddings.PROGRAMME für die beiden bisher nur per
föderalem Grundsatzprogramm-Fallback abgedeckten Landtage:

- **MV** (WP 8, seit 26.10.2021): CDU, SPD, GRÜNE, FDP, AfD, LINKE
  Wahlprogramme zur LTW 26.09.2021. Issue #4.

- **BE** (WP 19, konstituiert nach Wiederholungswahl 12.02.2023):
  CDU, SPD, GRÜNE, LINKE, AfD Programme zur AGH-Wahl 26.09.2021.
  Die Wiederholungswahl 2023 nutzte dieselben Programme wie die
  Originalwahl, daher die "be-2023.pdf"-Benennung mit Programm-
  jahr 2021. Issue #10.

Quellen: abgeordnetenwatch.de Mirror für 9 PDFs, library.fes.de
für SPD MV, cdu-mv.de direkt für CDU MV, fdp-mv.de direkt für
FDP MV. Alle PDFs verifiziert via pdftotext gegen das im Programm
genannte Wahldatum, um zu vermeiden, dass aktuellere
Wahlkampf-Entwürfe (z.B. das CDU "Berlin-Plan 2026") als
Legislatur-Programm fehlinterpretiert werden.

Indexierung in die embeddings-DB ist NICHT Teil dieses Commits —
sie muss separat im prod-Container ausgeführt werden:

  docker exec gwoe-antragspruefer python -c "
  from app.embeddings import index_programm
  from pathlib import Path
  d = Path('/app/static/referenzen')
  for pid in ['cdu-mv-2021','spd-mv-2021','gruene-mv-2021',
              'fdp-mv-2021','afd-mv-2021','linke-mv-2021',
              'cdu-be-2023','spd-be-2023','gruene-be-2023',
              'linke-be-2023','afd-be-2023']:
      index_programm(pid, d)
  "

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 08:24:33 +02:00
Dotty Dotter
2b9c0b2908 Activate Mecklenburg-Vorpommern (ParlDok) — search-only MVP (#4)
Adds a new ParLDokAdapter for ParlDok 8.x parliament documentation
systems by J3S GmbH. MV becomes the fourth supported state alongside
NRW, LSA and BE.

Notable details:

- ParlDok 8.x is a single-page app whose backend is a JSON API rooted
  at {base}/parldok/Fulltext/{Search,Resultpage}. The legacy ParLDok
  5.x HTML POST form (parldok/formalkriterien) used by dokukratie's
  mv.yml has been deprecated by the LandtagMV upgrade to 8.3.5 and
  is no longer reachable via the old form fields — hence a new
  adapter rather than reusing the dokukratie scraper.

- Two-stage pagination: Fulltext/Search returns the first 100 hits
  + a queryid; further pages come from Fulltext/Resultpage with
  {queryid, limit:{Start,Length}}. The Search endpoint silently
  ignores any non-zero Start, so single-stage offset pagination is
  not an option.

- Server-side filter via facet_lp (type=10) on the configured WP;
  type=Antrag is filtered client-side because the facet_type value
  IDs are instance-specific and would require an extra
  Fulltext/Filter discovery call. ParlDok also returns the same
  Drucksache multiple times when it appears in several
  Vorgänge/Beratungen, so search() dedupes by lp/number.

- Wahlprogramme zur LTW 26.09.2021 are not yet indexed (follow-up
  in #4) — analyses run with the federal Grundsatzprogramm fallback,
  same as Berlin until #10 lands.

Drive-by cleanup of PortalaAdapter print() statements: switched to
the module-level logger so adapter parser bugs no longer disappear
into stdout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 08:19:48 +02:00
Dotty Dotter
1cb030aab7 Fix NameError in PortalaAdapter card parser
_parse_hit_list_cards referenced an undefined `doctype` instead of
`doctype_full` on the query-filter path. The surrounding try/except in
search() swallowed the exception, so Berlin queries silently returned
0 hits whenever a search term was given.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 07:50:44 +02:00
Dotty Dotter
9e0f11f7c9 Activate Berlin (PARDOK) — search-only MVP (#3)
PortalaAdapter is now parameterizable and serves both LSA and Berlin
from a single class. Berlin is activated as the third live bundesland
(after NRW + LSA), with the deliberate caveat that the LTW 2023
Wahlprogramme are not yet indexed.

PortalaAdapter refactor
- Class attributes (bundesland, name, base_url, db_id, wahlperiode)
  moved into the constructor. New optional parameters:
    - portala_path: "/portal" for LSA, "/portala" for Berlin
    - document_type: "Antrag" for LSA, None for Berlin (BE's ETYPF
      index uses different value strings; the document_type subtree
      is dropped from the action.search.json tree)
    - pdf_url_prefix: "/files/" by default; absolute URLs in the hit
      list are passed through unchanged (Berlin embeds full
      starweb/adis/citat/... links)
    - date_window_days: 730 for LSA, 180 for BE (BE has ~10x more
      documents per WP, narrower window keeps payloads bounded)
- _build_search_body builds the JSON tree dynamically: when
  document_type is None, the entire ETYPF/DTYPF/DART subtree is
  omitted, mirrored in the parsed/sref display strings as well.
- _parse_hit_list_html now auto-detects between two formats:
    1. LSA-style: <pre>$VAR1 = …</pre> Perl Data::Dumper records
       (existing parser, untouched).
    2. Berlin-style: production HTML cards with efxRecordRepeater
       divs, h3 titles, h6 metadata lines containing the document
       type, drucksachen-id and date, plus a direct <a href="…pdf">
       to the PDF on the same host.
- Berlin extracts originator parties from the h6 line ("Antrag CDU,
  SPD" → ["CDU","SPD"], typ "Antrag") via the new word-boundary
  _normalize_fraktion regex.
- _normalize_fraktion rewritten with regex word boundaries, fixing a
  long-standing bug where comma-separated fraction lists like
  "CDU, SPD" failed to match CDU. Also picks up BSW for the
  Brombeer/SPD-BSW landtage and "Senat von Berlin" as Landesregierung.

bundeslaender.py
- BE flipped to aktiv=True. anmerkung documents the Wahlprogramm-
  Lücke and the auto-detected hit-list format.

Live verified against pardok.parlament-berlin.de:
- WP 19 with 180-day date window returns 2962 hits, page 1 contains
  5 records all with title, drucksache, date, PDF URL.
- 19/3107 ("Kleingewässerprogramm") correctly extracted as Antrag of
  CDU+SPD; 19/3104-3106 as Vorlagen zur Beschlussfassung; 19/3108 as
  Vorlage zur Kenntnisnahme.
- LSA still returns the same 5 current Anträge of März 2026 — no
  regression from the refactor.

Known limitation (will be tracked as a follow-up issue)
- Berlin Wahlprogramme zur LTW 2023 are not yet indexed in the
  embeddings DB. The 2023 PDFs are no longer linked from the live
  party websites (which currently feature 2026 draft programmes), and
  Wayback has no snapshots. The analyzer therefore falls back to
  bundesländer-übergreifende Grundsatzprogramme for BE Anträge until
  the 2023 PDFs are sourced manually.

Refs #3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 23:33:16 +02:00
Dotty Dotter
80e16df288 Append original Antrag-PDF to GWÖ-Report (#9)
Extends generate_pdf_report() with a best-effort second stage that
appends the original Antrag PDF to the freshly rendered GWÖ-Report so
the analysis and its source document live in the same file.

Pipeline
1. WeasyPrint renders the report PDF as before.
2. _append_original_antrag() then:
   - Skips silently if assessment.link is empty or non-HTTP (manual
     uploads / pasted text leave nothing to fetch).
   - Downloads the original PDF via httpx (30s timeout, follow redirects,
     custom user agent).
   - Validates the response is actually a PDF (Content-Length not relied
     on; the magic bytes %PDF- are checked).
   - Adds a single A4 separator page that says "Original-Antrag",
     repeats the Drucksachen-ID and title, and either confirms the
     append or shows the failure reason (HTTP code, network error,
     parse error) plus the source URL.
   - Appends the downloaded PDF via PyMuPDF doc.insert_pdf().
   - Saves to a sibling .tmp file and atomically replaces the original
     (PyMuPDF refuses non-incremental save into the same file).

Edge cases handled
- No link / pasted-text upload → no append, no divider, original report
  unchanged.
- Download error / 404 / non-PDF response → divider page with explicit
  error message and source URL, report still ships.
- PDF parse error → divider page without appended content, error logged.
- Hard failure during save → fall back to the original WeasyPrint PDF.

Verified live in production container against drucksache 8/6645
(Untrending Frauenhass, BÜNDNIS 90/DIE GRÜNEN LSA):
- Report 4 pages + 1 divider + 3 pages original = 8 pages total
- Divider correctly placed at index 4
- Page 5 starts with "(Ausgegeben am 24.02.2026) … Drucksache 8/6645 …
  Antrag — Fraktion BÜNDNIS 90/DIE GRÜNEN — Untrending Frauenhass …"
- Negative test with a synthetic 404 link: 5 pages total, divider at
  index 4 with "Original-PDF konnte nicht angehängt werden. Grund: HTTP
  404".

Resolves #9.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 23:15:05 +02:00
Dotty Dotter
f1867d463c Bundesland filter & transparency: stringent split + visible source (#8)
Brings the Bundesland-Dropdown from a cosmetic header widget to a real
filter that propagates through every layer (Listing, internal search,
statistics, party/tag filters, upload mode), and at the same time makes
the source parliament visible in every place where assessments from
multiple bundesländer can be mixed.

Backend
- database.get_all_assessments(bundesland=None) — new optional filter,
  "ALL" treated as None.
- database.search_assessments — bug fix: previous `if bundesland:`
  branch incorrectly added a `WHERE bundesland='ALL'` clause; now
  guarded with `bundesland and bundesland != "ALL"`.
- main.list_assessments — accepts ?bundesland= query param, includes the
  bundesland field in the response so the frontend can render badges.
- main.get_single_assessment — also includes bundesland in the response
  so the detail header can show the source parlament.
- main.search_landtag — early HTTP 400 when bundesland is missing or
  "ALL"; the live Landtag adapter cannot serve a synthetic Bundesweit
  request.
- main.index() and main.list_bundeslaender — synthetic "🌍 Bundesweit"
  entry prepended to the bundesländer list (kept out of bundeslaender.py
  on purpose — ALL is not a real state). Both endpoints additionally
  expose a parlament_names map so the frontend can render the source
  parliament without an extra round-trip.

Report (PDF + HTML)
- generate_html_report / generate_pdf_report — new optional bundesland
  parameter. When set, the report header carries the parliament name
  ("Landtag von Sachsen-Anhalt", "Landtag Nordrhein-Westfalen", …)
  beside the title. Three call sites updated: run_analysis,
  run_drucksache_analysis, download_assessment_pdf.

Frontend (templates/index.html)
- Header dropdown gets the synthetic ALL entry as first option;
  initial currentBundesland is now 'ALL' (was 'NRW').
- localStorage persistence: changeBundesland writes, DOMContentLoaded
  reads and validates against the visible options.
- changeBundesland resets the score / party / tag filter state, syncs
  the upload-mode bundesland select, disables the Landtag-Suche button
  + tooltip when ALL, and toggles a data-mode attribute on
  .list-content (used by CSS to show/hide the per-item bundesland
  badge).
- loadAssessments now sends ?bundesland=… so the API does the actual
  filtering. updateStats renders an additional per-bundesland average
  block (Ø NRW: x · Ø LSA: y) when in ALL mode and the loaded list
  spans more than one bundesland.
- renderList prepends a small "bl-badge" beside the Drucksachen-Nummer.
  Hidden in single-bundesland mode via CSS selector to avoid clutter.
- showDetail header now shows the parliament name as its own line
  (.detail-parlament).
- searchLandtag has an early-out alert if currentBundesland === 'ALL',
  saving a network round-trip.
- Upload-Mode bundesland select now starts with a "— Bundesland wählen
  —" placeholder (no auto-default), and startAnalysis validates that a
  concrete bundesland was chosen.

CSS
- .bl-badge plus the .list-content[data-mode="single"] hide rule.
- .detail-parlament for the detail header line.
- .header-parlament for the PDF report header line.

Resolves #8.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 23:00:39 +02:00
Dotty Dotter
87874a7a14 Activate LSA: Wahlprogramme + ingest + frontend (#2)
Brings Sachsen-Anhalt online as the second supported Bundesland after
NRW. Closes the gap that issue #2 left open: with the PortalaAdapter
already in place from c7242f8, this commit adds the reference data and
flips the activation switch.

Wahlprogramme (LTW Sachsen-Anhalt 06.06.2021)
- Six PDFs added under app/static/referenzen/{cdu,spd,gruene,fdp,afd,
  linke}-lsa-2021.pdf, plus paged plain-text extractions under
  app/kontext/*.txt for the keyword fallback search.
- Sources verified by hand:
  - CDU "Unsere Heimat. Unsere Verantwortung." (cdulsa.de, 82 pages)
  - SPD "Zusammenhalt und neue Chancen" (FES library, 77 pages)
  - GRÜNE "Verlässlich für Sachsen-Anhalt" (gruene-lsa.de, 164 pages)
  - FDP "Wahlprogramm zur Landtagswahl 2021" (Naumann-Stiftung, 76 pages)
  - AfD "Alles für unsere Heimat!" (klimawahlen.de mirror, 64 pages)
  - LINKE "Wahlprogramm zur Landtagswahl 2021" (dielinke-sachsen-anhalt.de,
    88 pages)
- The CDU PDF was the trickiest: KAS blocks bot downloads via
  Cloudflare; the cdulsa.de copy was located by an autonomous web
  search and verified to be byte-identical with the official document.

Embeddings indexed (in production container, OpenAI-compatible
DashScope embeddings via the existing index_programm pipeline):
- CDU 134, SPD 145, GRÜNE 183, FDP 100, AfD 64, LINKE 143 chunks
- Total LSA: 769 new chunks alongside the existing 775 NRW chunks
  and 335 federal Grundsatzprogramm chunks.

wahlprogramme.py
- WAHLPROGRAMME["LSA"] populated with all six parties (canonical fraction
  codes, original titles, page counts).

embeddings.py
- PROGRAMME extended with the six new "<partei>-lsa-2021" entries that
  the indexer pipeline expects.

bundeslaender.py
- LSA flipped to aktiv=True. The frontend dropdown will now offer
  Sachsen-Anhalt as a selectable bundesland and analyzer.get_bundesland_
  context() will produce a real LSA prompt block (CDU/SPD/FDP as
  governing fractions, all six landtagsfraktionen).

End-to-end smoke test (live in production container before commit)
- Adapter: PortalaAdapter.search() returned current Anträge of März 2026
  (LINKE + GRÜNE) with correct titles and PDF URLs.
- Semantic search for an LSA "ÖPNV in der Altmark" sample antrag
  matched LINKE S.53, SPD S.68, FDP S.52 — all three with similarity
  > 0.6 and topical hits (Regionalisierungsmittel, ÖPNV-Förderprogramm,
  Wasserstoffnetz).

Resolves issue #2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 22:12:32 +02:00
Dotty Dotter
c7242f8413 Add PortalaAdapter for PADOKA / Sachsen-Anhalt (#2)
Adds a clean-room PortalaAdapter that talks to the eUI/portala framework
behind PADOKA (Landtag Sachsen-Anhalt). Same engine powers Berlin's
PARDOK; the same adapter will serve issue #3 once activated for BE.

Reverse-engineering notes
- The "PADOKA = StarWeb" assumption from issue #1 / dokukratie's st.yml
  is outdated. The Sachsen-Anhalt portal was migrated to the same
  eUI/portala SPA framework Berlin uses. The legacy starweb URL returns
  503; the new entry point is /portal/browse.tt.html.
- Search workflow is two-stage:
  1. POST /portal/browse.tt.json with a JSON action body containing an
     Elasticsearch-style query tree under search.json. Returns a
     report_id plus hit count.
  2. POST /portal/report.tt.html with {report_id, start, chunksize}
     returns the HTML hit list. Each record carries a Perl Data::Dumper
     block in a <pre> tag with the canonical metadata.
- The query schema (sources, search.lines, search.json tree, report
  block) is taken from dokukratie/scrapers/portala.query.json (GPL-3.0)
  — only structure/selectors are reused, no Python code is ported.
- DB id is "lsa.lissh"; the server validates this and rejects unknown
  interfaces with an explicit errormsg.
- PDFs live under /files/drs/wp{N}/drs/d{nr}{xxx}.pdf and are served
  directly without any session cookie.

What the adapter does
- search() builds a date-window query (last ~24 months) for "Antrag"
  document type and returns the most recent hits. The user's free-text
  query is applied as a client-side title/Urheber filter (no fulltext
  search server-side yet — see "Limitations" below).
- Hits are parsed from the Perl record dumps in the report HTML:
  - WEV06.main → title (Perl \x{xx} hex escapes decoded)
  - WEV32.5   → relative PDF path
  - WEV32.main → "Antrag <Urheber> <DD.MM.YYYY> Drucksache <b>X/YYYY</b>"
- Fraktion strings are normalised to canonical codes (CDU, SPD, GRÜNE,
  FDP, AfD, LINKE, Landesregierung).
- get_document() looks up a single Drucksache by re-running the search.
- download_text() fetches the PDF and extracts text via PyMuPDF.
- bundeslaender.py: LSA's doku_system corrected from "StarWeb" to
  "PARDOK", anmerkung updated with the migration story.

Limitations (deliberate, MVP)
- No server-side full-text search. The portala framework's sf index
  names for LSA full-text content are not yet known; tree mutations
  with sf=alAB return 0 hits. Client-side filter is "good enough" for
  the next ~24 months of Anträge (≈few hundred per WP).
- LSA is still aktiv=False in bundeslaender.py — the adapter is dormant
  in production until issue #2's wahlprogramm ingest and frontend
  activation land.

Verified live against padoka.landtag.sachsen-anhalt.de:
- search(query="", limit=5) returned 5 current Anträge from März 2026
  (LINKE + GRÜNE) with correct dates, fractions, titles and PDF URLs.
- download_text("8/6790") returned 5051 chars of real Antragstext
  ("ICE-Halt für Salzwedel dauerhaft erhalten").

Refs #2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 21:50:23 +02:00
Dotty Dotter
ee0218b5af Refactor wahlprogramme/embeddings/analyzer for multi-state (#5)
Atomic refactor of the three modules that previously hardcoded NRW
behaviour. After this commit, every analysis path consults the central
BUNDESLAENDER registry for governing fractions, parliament name, and
state metadata.

wahlprogramme.py
- WAHLPROGRAMME is now nested {bundesland: {partei: meta}}; NRW data
  hoisted unchanged under the "NRW" key.
- New WAHLPROGRAMM_KONTEXT_FILES dict maps a state to its overview
  markdown file (currently only NRW).
- find_relevant_quotes(text, fraktionen, bundesland) — bundesland is
  now a required positional. Governing fractions for the requested
  state are merged with the submitting fractions before lookup.
- Helpers get_wahlprogramm() and parteien_mit_wahlprogramm() expose
  the new shape to other modules.
- ValueError on unknown bundesland (no silent fallback).

embeddings.py
- Schema migration in init_embeddings_db: adds a `bundesland` column
  to the chunks table when missing, plus an index, and backfills
  existing rows from the PROGRAMME registry. Grundsatzprogramme
  (federal level) keep bundesland NULL by design.
- find_relevant_chunks accepts a bundesland filter that matches state
  rows OR NULL — so federal Grundsatzprogramme remain visible to every
  analysis.
- get_relevant_quotes_for_antrag(text, fraktionen, bundesland, …) —
  bundesland required, governing fractions read from BUNDESLAENDER
  instead of hardcoded ["CDU","GRÜNE"]. Order-preserving dedup
  replaces the previous set-based merge.
- index_programm now writes the bundesland column on insert.
- Dropped the hardcoded "Wahlprogramm NRW 2022" label in
  format_quotes_for_prompt — bundesland context is implicit in the
  surrounding prompt block.

analyzer.py
- get_bundesland_context reads parlament_name, regierungsfraktionen,
  landtagsfraktionen and the optional WAHLPROGRAMM_KONTEXT_FILES entry
  from the central registry. Throws ValueError on unknown OR inactive
  bundesland — kills the silent NRW fallback that previously masked
  configuration gaps.
- The Antragsteller-detection heuristic now iterates
  BUNDESLAENDER[bundesland].landtagsfraktionen instead of
  WAHLPROGRAMME.keys(), so we recognise parties for which we don't
  yet have a Wahlprogramm PDF.
- Both quote lookups (semantic + keyword fallback) now receive the
  bundesland.

Resolves issue #5. Foundation for #2 (LSA), #3 (Berlin), #4 (MV).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 18:48:11 +02:00
Dotty Dotter
ac18743ff2 Add central bundeslaender.py module with all 16 states (#7)
Introduces app/bundeslaender.py as the single source of truth for all
bundesland-specific data (parliament name, current legislative period,
upcoming elections, governing coalition, doku system, base URLs,
drucksache format, dokukratie scraper code, active flag, optional
remarks). Data reflects April 2026 state.

main.py::index() and /api/bundeslaender now derive their lists from
this module instead of hardcoding. Frontend dropdown now shows all 16
bundesländer (15 disabled with "(bald)" suffix); previously the
landing template showed only 4. NRW remains the only "aktiv" entry.

API behaviour change worth noting: the /api/bundeslaender endpoint
previously emitted code "ST" for Sachsen-Anhalt; it now emits "LSA"
to match the politically dominant abbreviation. No functional impact
because non-NRW bundesländer were inactive in both versions.

Foundation for #5 and #2; deliberately a no-op for NRW so it can ship
and rollback independently.

Resolves issue #7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 14:17:54 +02:00
Dotty Dotter
bcd532be89 Fix responsive layout for mobile viewports (#6)
Body becomes a flex column so the header takes its natural height and the
main container fills the rest via flex:1 — replaces the brittle
calc(100vh - 70px) that assumed a 70px header and broke as soon as the
header wrapped on mobile. Adds 100dvh fallback for iOS Safari address
bar quirks.

Mobile breakpoint (≤900px) reworked: list scrolls internally via
list-content max-height:50vh, detail-panel uses overflow:visible so the
whole document scrolls naturally instead of nesting scrollers. Tapping
an item auto-scrolls to the detail panel and a new "← Zur Liste" button
(mobile-only) jumps back. Adds a tighter ≤600px breakpoint that hides
the subtitle, collapses the matrix grid to one column and shrinks the
matrix table for phone screens.

Resolves issue #6.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 13:48:55 +02:00
Dotty Dotter
63de3ca20d Initial commit: GWÖ-Antragsprüfer v1.0
Features:
- GWÖ-Matrix 2.0 Analyse für NRW-Landtagsanträge
- Verbesserungsvorschläge im Redline-Format (Original/Vorschlag/Begründung)
- Wahlprogramm- und Parteiprogrammtreue-Bewertung
- Landtag-Suche via OPAL-API
- Tag-Wolke mit Multi-Select Filter
- Partei-Filter mit Durchschnittswerten
- PDF-Report-Generierung
- Security Headers (CSP, X-Frame-Options, etc.)
- Persistente SQLite-DB via Docker Volumes

Tech Stack:
- FastAPI + Jinja2
- Qwen LLM via DashScope API
- SQLite + aiosqlite
- WeasyPrint für PDF
- Docker Compose mit Traefik
2026-03-28 22:30:24 +01:00