booru-viewer/booru_viewer
pax e00d88e1ec api: CategoryFetcher module with HTML scrape + batch tag API + cache
New module core/api/category_fetcher.py — the unified tag-category
fetcher for boorus that don't return categories inline.

Public surface:
  try_compose_from_cache(post) — instant, no HTTP. Builds
    post.tag_categories from cached (site_id, name) -> label
    entries. Returns True if every tag in the post is cached.
  fetch_via_tag_api(posts) — batch fast path. Collects uncached
    tags across posts, chunks into 500-name batches, GETs the
    tag DAPI. Only available when the client declares _tag_api_url
    AND has credentials (Gelbooru proper). Includes JSON/XML
    sniffing parser ported from the reverted code.
  fetch_post(post) — universal fallback. HTTP GETs the post-view
    HTML page, regex-extracts class="tag-type-X">name</a>
    markup. Works on every Gelbooru fork and every Moebooru
    deployment. Does NOT require auth.
  ensure_categories(post) — idempotent dispatch: cache compose ->
    batch API (if available) -> HTML scrape. Coalesces concurrent
    calls for the same post.id via an in-flight task dict.
  prefetch_batch(posts) — fire-and-forget background prefetch.
    ONE fetch path per invocation (no mixing batch + HTML).

Probe-and-cache for the batch tag API:
  _batch_api_works = None -> not yet probed OR transient error
                              (retry next call)
  _batch_api_works = True -> batch works (Gelbooru proper)
  _batch_api_works = False -> clean 200 + zero matching names
                               (Rule34's broken names= filter)
  Transition to True/False is permanent per instance. Transient
  errors (HTTP error, timeout, parse exception) leave None so the
  next search retries the probe.

HTML regex handles both standard tag-type-artist and combined-
class forms like tag-link tag-type-artist (Konachan). Tag names
normalized to underscore-separated lowercase.

Canonical category order: Artist > Character > Copyright >
Species > General > Meta > Lore (matches danbooru/e621 inline).

Dead code at this commit — no integration yet.
2026-04-09 19:12:43 -05:00
..