gwoe-antragspruefer

Author	SHA1	Message	Date
Dotty Dotter	9c70b463ac	Phase A: Audit-Restbefunde #57.3/4/7 (Roadmap #59 ) Drei verbleibende Audit-Befunde aus #57 in einem Patch: - #57.3 MEDIUM Drucksache-Regex-Validation: neue app/validators.py mit validate_drucksache() als gemeinsamer Validation-Funnel. Pattern ^\d{1,3}/\d{1,7}([-(].{1,20})?$ deckt alle 10 aktiven Bundesländer (8/6390, 18/12345, 8/6390(neu), 23/3700-A) ab und blockt Path-Traversal (../, /etc/passwd) plus Standard-Injection (;, <, &). Drei Endpoints durchgeschleust: /api/assessment, /api/assessment/pdf, /api/analyze-drucksache. - #57.4 MEDIUM print() → logging.getLogger(__name__): main.py und analyzer.py auf strukturiertes Logging umgestellt. LLM-Inhalte werden NICHT mehr als Volltext geloggt — neue Helper _content_fingerprint() liefert nur "len=N sha1=XXXX", reicht zur Forensik ohne Antrag-Inhalte ins Container-Log zu leaken. basicConfig() mit ISO-Format setzt strukturiertes Logging früh, damit logger.exception() auch beim Boot greift. - #57.7 LOW-MED Search-Query-Limit: validate_search_query() mit MAX_SEARCH_QUERY_LEN=200 schützt /api/search und /api/search-landtag vor 10-MB-Query-DoS. database._parse_search_query() loggt jetzt shlex.ValueError-Fallback statt ihn zu verschlucken (deckt Memory- Regel "stille excepts in Adaptern" ab). Tests: neue tests/test_main_validators.py mit 22 Cases — Drucksache- Whitelist-Roundtrip + Path-Traversal-Reject, Search-Query Längen- Edge-Cases. 107 Unit-Tests grün (85 alt + 22 neu). Validators in eigenem Modul (app/validators.py), damit Tests sie ohne slowapi-Dependency direkt importieren können. Refs: #57, #59 (Phase A) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 11:15:16 +02:00
Dotty Dotter	bc7f4a67cb	Analyzer prompt: strict citation rule against LLM hallucination Even after format_quotes_for_prompt was fixed to expose the fully-qualified programme name + page in each chunk, Qwen continued to hallucinate familiar source labels (typically "CDU Wahlprogramm NRW 2022") for parties whose actual MV/BE chunks were thematically off-topic for the Antrag at hand. The model preferred its training prior over the prompt context. Smoke test: MV Drucksache 8/6390 (CDU "Krisenmechanismus Kraftstoffpreise"). The CDU MV chunks the embedder retrieved were about Senioren and Aussenwirtschaft — not about energy prices — so qwen pulled what it knew about CDU NRW transport policy and wrote that as the source, even though the prompt listed only "CDU Mecklenburg-Vorpommern Wahlprogramm 2021" chunks. The new explicit ZITATEREGEL block in the user prompt forbids cross-Bundesland citations and instructs the model to leave zitate empty rather than fabricate when no listed chunk fits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 11:31:21 +02:00
Dotty Dotter	ee0218b5af	Refactor wahlprogramme/embeddings/analyzer for multi-state (#5 ) Atomic refactor of the three modules that previously hardcoded NRW behaviour. After this commit, every analysis path consults the central BUNDESLAENDER registry for governing fractions, parliament name, and state metadata. wahlprogramme.py - WAHLPROGRAMME is now nested {bundesland: {partei: meta}}; NRW data hoisted unchanged under the "NRW" key. - New WAHLPROGRAMM_KONTEXT_FILES dict maps a state to its overview markdown file (currently only NRW). - find_relevant_quotes(text, fraktionen, bundesland) — bundesland is now a required positional. Governing fractions for the requested state are merged with the submitting fractions before lookup. - Helpers get_wahlprogramm() and parteien_mit_wahlprogramm() expose the new shape to other modules. - ValueError on unknown bundesland (no silent fallback). embeddings.py - Schema migration in init_embeddings_db: adds a `bundesland` column to the chunks table when missing, plus an index, and backfills existing rows from the PROGRAMME registry. Grundsatzprogramme (federal level) keep bundesland NULL by design. - find_relevant_chunks accepts a bundesland filter that matches state rows OR NULL — so federal Grundsatzprogramme remain visible to every analysis. - get_relevant_quotes_for_antrag(text, fraktionen, bundesland, …) — bundesland required, governing fractions read from BUNDESLAENDER instead of hardcoded ["CDU","GRÜNE"]. Order-preserving dedup replaces the previous set-based merge. - index_programm now writes the bundesland column on insert. - Dropped the hardcoded "Wahlprogramm NRW 2022" label in format_quotes_for_prompt — bundesland context is implicit in the surrounding prompt block. analyzer.py - get_bundesland_context reads parlament_name, regierungsfraktionen, landtagsfraktionen and the optional WAHLPROGRAMM_KONTEXT_FILES entry from the central registry. Throws ValueError on unknown OR inactive bundesland — kills the silent NRW fallback that previously masked configuration gaps. - The Antragsteller-detection heuristic now iterates BUNDESLAENDER[bundesland].landtagsfraktionen instead of WAHLPROGRAMME.keys(), so we recognise parties for which we don't yet have a Wahlprogramm PDF. - Both quote lookups (semantic + keyword fallback) now receive the bundesland. Resolves issue #5. Foundation for #2 (LSA), #3 (Berlin), #4 (MV). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-07 18:48:11 +02:00
Dotty Dotter	63de3ca20d	Initial commit: GWÖ-Antragsprüfer v1.0 Features: - GWÖ-Matrix 2.0 Analyse für NRW-Landtagsanträge - Verbesserungsvorschläge im Redline-Format (Original/Vorschlag/Begründung) - Wahlprogramm- und Parteiprogrammtreue-Bewertung - Landtag-Suche via OPAL-API - Tag-Wolke mit Multi-Select Filter - Partei-Filter mit Durchschnittswerten - PDF-Report-Generierung - Security Headers (CSP, X-Frame-Options, etc.) - Persistente SQLite-DB via Docker Volumes Tech Stack: - FastAPI + Jinja2 - Qwen LLM via DashScope API - SQLite + aiosqlite - WeasyPrint für PDF - Docker Compose mit Traefik	2026-03-28 22:30:24 +01:00

4 Commits