security: fix #14 — cap category_fetcher HTML body before regex walk

CategoryFetcher.fetch_post pulls a post-view HTML page and runs _TAG_ELEMENT_RE.finditer over the full body. The regex itself is linear (no catastrophic backtracking shape), but a hostile server returning hundreds of MB of HTML still pegs CPU walking the buffer. Caps the body the regex sees at 2MB — well above any legit Gelbooru/Moebooru post page (~30-150KB). Truncation rather than streaming because httpx already buffers the body before _request returns; the cost we're cutting is the regex walk, not the memory hit. A full streaming refactor of fetch_post is a follow-up that the audit explicitly flagged as out of scope ("not catastrophic — defense in depth"). Audit-Ref: SECURITY_AUDIT.md finding #14 Severity: Informational
2026-04-11 16:26:00 -05:00 · 2026-04-11 16:26:00 -05:00 · 5a511338c8
commit 5a511338c8
parent b65f8da837
1 changed files with 13 additions and 1 deletions
--- a/booru_viewer/core/api/category_fetcher.py
+++ b/booru_viewer/core/api/category_fetcher.py
@ -76,6 +76,13 @@ _LABEL_MAP: dict[str, str] = {
    "style":     "Style",
 }

+# Sentinel cap on the HTML body the regex walks over. A real
+# Gelbooru/Moebooru post page is ~30-150KB; capping at 2MB gives
+# any legit page comfortable headroom while preventing a hostile
+# server from feeding the regex hundreds of MB and pegging CPU.
+# Audit finding #14.
+_FETCH_POST_HTML_CAP = 2 * 1024 * 1024
+
 # Gelbooru tag DAPI integer code -> Capitalized label (for fetch_via_tag_api)
 _GELBOORU_TYPE_MAP: dict[int, str] = {
    0: "General",
@ -290,7 +297,12 @@ class CategoryFetcher:
                log.warning("Category HTML fetch for #%d failed: %s: %s",
                            post.id, type(e).__name__, e)
                return False
-        cats, labels = _parse_post_html(resp.text)
+        # Cap the HTML the regex walks over (audit #14). Truncation
+        # vs. full read: the body is already buffered by httpx, so
+        # this doesn't prevent a memory hit — but it does cap the
+        # CPU spent in _TAG_ELEMENT_RE.finditer for a hostile server
+        # returning hundreds of MB of HTML.
+        cats, labels = _parse_post_html(resp.text[:_FETCH_POST_HTML_CAP])
        if not cats:
            return False
        post.tag_categories = _canonical_order(cats)