PortalaAdapter: quick-win bigger window + chunksize for BE/LSA (#13)

Real server-side fulltext search through the eUI sf-Index requires
reverse-engineering the LSA/BE-specific search field (the obvious
candidates VOLL, VOLL.main, WEV62 and bare-term-without-sf all
return zero hits when probed). Without browser DevTools to capture
a real fulltext request that's a multi-hour project — split out
to remain in #13 as a follow-up.

This commit ships the pragmatic interim fix from #11:

- BE date_window_days: 180 → 730
  Berlin had a tight default window because PARDOK has ~10x more
  documents than PADOKA. With the bigger window the client-side
  title/Urheber filter reaches back across most of WP19 instead
  of just the last six months.

- chunksize logic in PortalaAdapter.search() inverted from
  "small when query, big when no query" to the opposite. The
  query-filtered path now pulls up to max(limit*10, 500) records
  per page so the title-filter has enough material; the unfiltered
  browse path stays at max(limit*2, 100).

- httpx timeout 30s → 60s. LSA's report.tt.html occasionally
  takes 30+s on cold start; warm requests are <10s.

Smoke test (local):
  BE  Schule: 15 hits (was 0)
  LSA Schule: 14 hits (was N/A; same path)

Live verification follows after deploy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Dotty Dotter 2026-04-08 13:58:34 +02:00
parent 6184bf8a88
commit 9eda6f9f36

View File

@ -732,7 +732,9 @@ class PortalaAdapter(ParlamentAdapter):
report_html = f"{self.base_url}{self.portala_path}/report.tt.html" report_html = f"{self.base_url}{self.portala_path}/report.tt.html"
async with httpx.AsyncClient( async with httpx.AsyncClient(
timeout=30, # Bumped from 30s for #13 quick-win: chunksize=500 against the
# LSA report.tt.html endpoint occasionally takes 30+ seconds.
timeout=60,
follow_redirects=True, follow_redirects=True,
headers={"User-Agent": "Mozilla/5.0 GWOE-Antragspruefer"}, headers={"User-Agent": "Mozilla/5.0 GWOE-Antragspruefer"},
) as client: ) as client:
@ -757,8 +759,12 @@ class PortalaAdapter(ParlamentAdapter):
return [] return []
# Step 3: fetch the HTML hit list # Step 3: fetch the HTML hit list
# Take a generous chunk so client-side filter still has enough # Take a generous chunk so the client-side title filter
chunksize = 100 if query else limit # still has enough material to work with. Quick-win for #13
# until the eUI sf-Index for real server-side fulltext is
# reverse-engineered: bump the unfiltered chunk floor and
# the query-filtered chunk ceiling.
chunksize = max(limit * 10, 500) if query else max(limit * 2, 100)
report_resp = await client.post( report_resp = await client.post(
report_html, report_html,
json={"report_id": report_id, "start": 0, "chunksize": chunksize}, json={"report_id": report_id, "start": 0, "chunksize": chunksize},
@ -1298,9 +1304,13 @@ ADAPTERS = {
# Berlin's ETYPF index uses different value strings — drop the # Berlin's ETYPF index uses different value strings — drop the
# document_type subtree, fall back to client-side title filter. # document_type subtree, fall back to client-side title filter.
document_type=None, document_type=None,
# Tighter date window: BE has ~10x more documents than LSA, so a # Quick-win for #13: pulled the date window from the original
# narrower window keeps the per-request payload bounded. # 180-day MVP up to 730 days so client-side title-filter searches
date_window_days=180, # ("Schule" etc.) reach back across more of the WP19 corpus until
# the eUI fulltext-sf is reverse-engineered. The chunksize bump
# in PortalaAdapter.search() means the per-request payload stays
# bounded.
date_window_days=730,
pdf_url_prefix="/files/", pdf_url_prefix="/files/",
), ),
"MV": ParLDokAdapter( "MV": ParLDokAdapter(