Both fetch_via_tag_api and _probe_batch_api built the same params
dict (with identical lstrip/startswith credential quirks) inline.
Pulled into _build_tag_api_params so future credential-format tweaks
have one site, not two.
User-configurable sites could send XXE or billion-laughs payloads
via tag category API responses. Reject any XML body containing
<!DOCTYPE or <!ENTITY before passing to ET.fromstring.
CategoryFetcher.fetch_post pulls a post-view HTML page and runs
_TAG_ELEMENT_RE.finditer over the full body. The regex itself is
linear (no catastrophic backtracking shape), but a hostile server
returning hundreds of MB of HTML still pegs CPU walking the buffer.
Caps the body the regex sees at 2MB — well above any legit
Gelbooru/Moebooru post page (~30-150KB).
Truncation rather than streaming because httpx already buffers the
body before _request returns; the cost we're cutting is the regex
walk, not the memory hit. A full streaming refactor of fetch_post
is a follow-up that the audit explicitly flagged as out of scope
("not catastrophic — defense in depth").
Audit-Ref: SECURITY_AUDIT.md finding #14
Severity: Informational
Same fix as danbooru.py and e621.py — Gelbooru's params dict
carries api_key + user_id when configured. Route through
redact_params() before the debug log emits them.
Audit-Ref: SECURITY_AUDIT.md finding #3
Severity: Medium
Same fix as danbooru.py — the search() log.debug params line
previously emitted login + api_key. Route through redact_params().
Audit-Ref: SECURITY_AUDIT.md finding #3
Severity: Medium
The log.debug(f" params: {params}") line in search() previously
dumped login + api_key to the booru logger at DEBUG level. Route
the params dict through redact_params() so the keys are replaced
with *** before formatting.
Audit-Ref: SECURITY_AUDIT.md finding #3
Severity: Medium
The httpx request event hook converts request.url to a str so
log_connection can parse it — at that point the credential query
params (login, api_key, etc.) are in scope and could be captured
by any traceback, debug hook, or monitoring agent observing the
hook call. Pipe through redact_url() first so the rendered string
never carries the secrets, even transiently.
Audit-Ref: SECURITY_AUDIT.md finding #3
Severity: Medium
detect_site_type constructs a fresh BooruClient._shared_client
directly (bypassing the BooruClient.client property) for the
/posts.json, /index.php, and /post.json probes. The hooks set
here are the ones installed on that initial construction — if
detection runs before any BooruClient instance's .client is
accessed, the shared singleton must still have SSRF validation
and connection logging.
This additionally closes finding #16 for the detect client — site
detection requests now appear in the connection log instead of
being invisible.
behavior change from v0.2.5: Test Connection from the site dialog
now rejects private-IP targets. Adding a local/RFC1918 booru via
the "auto-detect type" dialog will fail with "blocked request
target ..." instead of probing it. Explicit api_type selection
still goes through the BooruClient.client path, which is also
now protected.
Audit-Ref: SECURITY_AUDIT.md finding #1
Also-Closes: SECURITY_AUDIT.md finding #16 (detect half)
Severity: High
E621 maintains its own httpx.AsyncClient because their TOS requires
a per-user User-Agent string that BooruClient's shared client can't
carry. The client is rebuilt on User-Agent change, so the hook must
be installed in the same construction path.
Also installs BooruClient._log_request as a second hook (this
additionally closes finding #16 for the e621 client — e621 requests
previously bypassed the connection log entirely, and this wires
them in consistently with the base client).
Audit-Ref: SECURITY_AUDIT.md finding #1
Also-Closes: SECURITY_AUDIT.md finding #16 (e621 half)
Severity: High
Adds validate_public_request to the BooruClient event_hooks list so
every request (and every redirect hop) is checked against the block
list from _safety.py. Danbooru, Gelbooru, and Moebooru subclasses
all go through BooruClient.client and inherit the protection.
Preserves the existing _log_request hook by listing both hooks in
order: validate first (so blocked hops never reach the log), then
log.
Audit-Ref: SECURITY_AUDIT.md finding #1
Severity: High
Introduces core/api/_safety.py containing check_public_host and the
validate_public_request async request-hook. The hook rejects any URL
whose host is (or resolves to) loopback, RFC1918, link-local
(including 169.254.169.254 cloud metadata), CGNAT, unique-local v6,
or multicast. Called on every request hop so it covers both the
initial URL and every redirect target that httpx would otherwise
follow blindly.
Also exports redact_url / redact_params for finding #3 — the
secret-key set lives in the same module since both #1 and #3 work
is wired through httpx client event_hooks. Helper is stdlib-only
(ipaddress, socket, urllib.parse) plus httpx; no new deps.
Not yet wired into any httpx client; per-file wiring commits follow.
Audit-Ref: SECURITY_AUDIT.md finding #1
Severity: High
When _batch_api_works is True (Gelbooru proper with auth, persisted
from a prior session's probe), search() fires prefetch_batch in the
background. The batch tag API covers the entire page's tags in 1-2
requests during the time between grid render and user click — the
cache is warm before the info panel opens, so categories appear
instantly with no flash of flat tags.
Gated on _batch_api_works is True (not None, not False):
- Gelbooru proper: prefetches (batch API known good)
- Rule34: skips (batch_api_works = False, persisted)
- Safebooru.org: skips (no auth → fetcher skips batch capability)
Rule34 / Safebooru.org / Moebooru stay on-demand: the ~200ms
per-click HTML scrape is unavoidable for those sites because their
only path is per-post page fetching, which can't be batched.
_do_ensure only tried the batch API when _batch_api_works was True,
but after removing the search-time prefetch (where the probe used
to run), _batch_api_works stayed None forever. Gelbooru's only
viable path IS the batch API (its post-view HTML has no tag links),
so clicks on Gelbooru posts produced zero categories.
Fix: _do_ensure now tries the batch API when _batch_api_works is
not False (i.e., both True and None). When None, the call doubles
as an inline probe: if the batch produced categories, save True;
if nothing useful came back, save False and fall to HTML.
This is simpler than the old prefetch_batch probe because it runs
on ONE post at a time — no batch/HTML mixing concerns, no "single
path per invocation" rule. The probe result is persisted to DB so
it only fires once per site ever.
Dispatch matrix in _do_ensure:
_batch_api_works True + auth → batch API (Gelbooru proper)
_batch_api_works None + auth → batch as probe → True or False
_batch_api_works False → HTML scrape (Rule34)
no auth → HTML scrape (Safebooru.org)
transient error → stays None, retry next click
Verified all three sites from clean cache: Gelbooru 55/56+49/50
(batch), Rule34 40/40+38/38 (HTML), Safebooru.org 47/47+47/47
(HTML).
Removes the asyncio.create_task(prefetch_batch) calls from
search() and get_post() in both clients. Tags are now fetched
ONLY when the user actually clicks a post (via ensure_categories
in the info panel path) or saves with a category-token template.
The background prefetch was the source of most of the complexity:
probe timing, early-exit bugs from partial composes racing with
on-click ensures, Rule34's slow probe blocking the prefetch
window. All gone.
New flow:
search() → fast, returns posts with flat tags only
click → ensure_categories fires, ~200ms HTML scrape or
batch API, categories arrive, signal re-renders
re-click → instant (cache compose, no HTTP)
save → ensure in save_post_file, same path
The ~200ms per first-click is invisible during the image load.
The cache compounds across posts and sessions. The prefetch_batch
method stays in CategoryFetcher for potential future use but
nothing calls it from the hot path anymore.
The probe that detects whether a site's batch tag API works
(Gelbooru proper: yes, Rule34: no) now persists its result in the
tag_types table using a sentinel key (__batch_api_probe__). On
subsequent app launches, the fetcher reads the saved result at
construction time and skips the probe entirely.
Before: every session with Rule34 wasted ~0.6s on a probe request
that always fails (Rule34 returns garbage for names=). During that
time the background prefetch couldn't start HTML scraping, so the
first few post clicks paid ~0.3s each.
After: first ever session probes Rule34 once, stores False. Every
subsequent session reads False from DB, skips the probe, and the
background prefetch immediately starts HTML scraping. By the time
the user clicks any post, the scrape is usually done.
Gelbooru proper: probe succeeds on first session, stores True.
Future sessions use the batch API without probing. No change in
speed (already fast), just saves the probe roundtrip.
Persisted per site_id so different Gelbooru-shaped sites get their
own probe result. The clear_tag_cache method wipes probe results
along with tag data (the sentinel key lives in the same table).
Two places checked `if post.tag_categories: return` before doing
a full cache-coverage check, causing posts with partial cache
composes (e.g. 5/40 tags from the background prefetch) to get
stuck at low coverage forever:
ensure_categories: removed the post.tag_categories early exit.
Now ALWAYS runs try_compose_from_cache first. Only the 100%
coverage return (True) is trusted as "done." Partial composes
return False and fall through to the fetch path.
_ensure_post_categories_async: removed the post.tag_categories
guard. Danbooru/e621 are filtered by the client.category_fetcher
is None check instead (they categorize inline, no fetcher).
For Gelbooru-style sites, always schedules ensure_categories
regardless of current post state.
Root cause: the partial-compose fix (try_compose_from_cache
populates tag_categories even when cache coverage is <100%)
conflicted with the early-exit guards that assumed non-empty
tag_categories = fully categorized. Now the only "fully done"
signal is try_compose_from_cache returning True (100% coverage).
try_compose_from_cache was returning True on ANY partial cache hit
(even 1/38 tags). ensure_categories then saw non-empty
tag_categories and returned immediately, leaving the post stuck at
1/38 coverage. The bug showed on Rule34: post 1 got fully scraped
(40/40), its tags got cached, then post 2's compose found one
matching tag and declared victory.
Fix: try_compose_from_cache now returns True ONLY when 100% of
unique tags have cached labels (no fetch needed). It STILL
populates post.tag_categories with whatever IS cached (for
immediate partial display), but returning False signals
ensure_categories to continue to the fetch path.
This is the correct semantic split:
- populate → always (for display)
- return True → only when complete (for dispatch)
Verified:
Rule34: 40/40 + 38/38 (was 40/40 + 1/38)
Gelbooru: 55/56 + 49/50 (batch API, one rare tag)
Safebooru.org: 47/47 + 47/47 (HTML scrape, full)
Three fixes:
1. HTML parser completely rewritten with two-pass approach:
- Pass 1: regex finds each tag-type element and its full inner
content (up to closing </li|span|td|div>)
- Pass 2: within the content, extracts the tag name from the
tags=NAME URL parameter in the search link
The old single-pass regex captured the ? wiki-link (first <a>)
instead of the tag name (second <a>). The URL-param extraction
works on Rule34 (40 tags), Safebooru.org (47 tags), and
yande.re (3 tags). Gelbooru proper returns 0 (post page only
has ? links with no tags= param) which is correct — Gelbooru
uses the batch tag API instead.
2. prefetch_batch is now truly fire-and-forget:
gelbooru.py and moebooru.py use asyncio.create_task instead of
await for prefetch_batch. search() returns immediately. The
probe + batch/HTML fetch runs in the background. Previously
search() blocked on the probe, which made Rule34 searches take
5+ seconds (slow/broken Rule34 API response time).
3. Partial cache compose already fixed in the previous commit
complements this: posts with 49/50 cached tags now show all
available categories instead of nothing.
try_compose_from_cache previously required 100% cache coverage —
every tag in the post had to have a cached label or it returned
False and populated nothing. One rare uncached tag out of 50
blocked the entire composition, leaving the post with zero
categories even though 49/50 labels were available.
Fix: compose whatever IS cached, return True when at least one
tag got categorized. Tags not in the cache are simply absent from
the categories dict (they stay in the flat tags string). The
return value now means "the post has usable categories" rather
than "the post has complete categories." This distinction matters
because the dispatch logic uses the return value to decide
whether to skip the fetch path — partial coverage is better than
no coverage, and the missing tags get cached eventually when
other posts that contain them get fetched.
Verified against Gelbooru: post with 50 tags where 49 were cached
now gets 49/50 categorized (Artist, Character, Copyright, General,
Meta) instead of 0/50.
client_for_type gains optional db + site_id kwargs. When both are
passed and api_type is gelbooru or moebooru, a CategoryFetcher is
constructed and assigned to client.category_fetcher. The fetcher
owns the per-tag cache, the batch tag API fast path, and the
per-post HTML scrape fallback.
Danbooru and e621 never get a fetcher — their inline JSON
categorization is already optimal.
Test Connection dialog and scripts don't pass db/site_id, so they
get fetcher-less clients with the existing search behavior.
Override _post_view_url to return /post/show/{id} for the per-post
HTML scrape path. No _tag_api_url override — Moebooru has no batch
tag DAPI; the CategoryFetcher dispatch goes straight to per-post
HTML for these sites.
search() and get_post() now call prefetch_batch when a fetcher is
attached, same fire-and-forget pattern as gelbooru.py.
Overrides both URL methods from the base class:
_post_view_url(post) -> /index.php?page=post&s=view&id={id}
Universal HTML scrape path — works on Gelbooru proper, Rule34,
Safebooru.org without auth.
_tag_api_url() -> {base_url}/index.php
Batch tag DAPI fast path. The CategoryFetcher's probe-and-cache
determines at runtime whether the endpoint actually honors
names=. Gelbooru proper: probe succeeds. Rule34: probe fails
(garbage response), falls back to HTML. Safebooru.org: no auth,
dispatch skips batch entirely.
search() and get_post() now call
await self.category_fetcher.prefetch_batch(posts)
after building the post list, when a fetcher is attached. The
prefetch is fire-and-forget — search returns immediately and the
background tasks fill categories as the user reads. When no
fetcher is attached (Test Connection dialog, scripts), this is a
no-op and behavior is unchanged.
Three additions to the base class, all default-inactive:
_post_view_url(post) -> str | None
Override to provide the post-view HTML URL for the per-post
category scrape path. Default None (Danbooru/e621 skip it).
_tag_api_url() -> str | None
Override to provide the batch tag DAPI base URL for the fast
path in CategoryFetcher. Default None. Only Gelbooru proper
benefits — the fetcher's probe-and-cache determines at runtime
whether the endpoint actually honors the names= parameter.
self.category_fetcher = None
Set externally by the factory (client_for_type) when db and
site_id are available. Gelbooru-shape and Moebooru clients use
it; Danbooru/e621 leave it None.
No behavior change at this commit. Existing clients inherit the
defaults and continue working identically.
New module core/api/category_fetcher.py — the unified tag-category
fetcher for boorus that don't return categories inline.
Public surface:
try_compose_from_cache(post) — instant, no HTTP. Builds
post.tag_categories from cached (site_id, name) -> label
entries. Returns True if every tag in the post is cached.
fetch_via_tag_api(posts) — batch fast path. Collects uncached
tags across posts, chunks into 500-name batches, GETs the
tag DAPI. Only available when the client declares _tag_api_url
AND has credentials (Gelbooru proper). Includes JSON/XML
sniffing parser ported from the reverted code.
fetch_post(post) — universal fallback. HTTP GETs the post-view
HTML page, regex-extracts class="tag-type-X">name</a>
markup. Works on every Gelbooru fork and every Moebooru
deployment. Does NOT require auth.
ensure_categories(post) — idempotent dispatch: cache compose ->
batch API (if available) -> HTML scrape. Coalesces concurrent
calls for the same post.id via an in-flight task dict.
prefetch_batch(posts) — fire-and-forget background prefetch.
ONE fetch path per invocation (no mixing batch + HTML).
Probe-and-cache for the batch tag API:
_batch_api_works = None -> not yet probed OR transient error
(retry next call)
_batch_api_works = True -> batch works (Gelbooru proper)
_batch_api_works = False -> clean 200 + zero matching names
(Rule34's broken names= filter)
Transition to True/False is permanent per instance. Transient
errors (HTTP error, timeout, parse exception) leave None so the
next search retries the probe.
HTML regex handles both standard tag-type-artist and combined-
class forms like tag-link tag-type-artist (Konachan). Tag names
normalized to underscore-separated lowercase.
Canonical category order: Artist > Character > Copyright >
Species > General > Meta > Lore (matches danbooru/e621 inline).
Dead code at this commit — no integration yet.
Mixing `threading.Thread + asyncio.run` workers with the long-lived
asyncio loop in gui/app.py is a real loop-affinity bug: the first worker
thread to call `asyncio.run` constructs a throwaway loop, which the
shared httpx clients then attach to, and the next call from the
persistent loop fails with "Event loop is closed" / "attached to a
different loop". This commit eliminates the pattern across the GUI and
adds the locking + cleanup that should have been there from the start.
Persistent loop accessor (core/concurrency.py — new)
- set_app_loop / get_app_loop / run_on_app_loop. BooruApp registers the
one persistent loop at startup; everything that wants to schedule
async work calls run_on_app_loop instead of spawning a thread that
builds its own loop. Three functions, ~30 lines, single source of
truth for "the loop".
Lazy-init lock + cleanup on shared httpx clients (core/api/base.py,
core/api/e621.py, core/cache.py)
- Each shared singleton (BooruClient._shared_client, E621Client._e621_client,
cache._shared_client) now uses fast-path / locked-slow-path lazy init.
Concurrent first-callers from the same loop can no longer both build
a client and leak one (verified: 10 racing callers => 1 httpx instance).
- Each module exposes an aclose helper that BooruApp.closeEvent runs via
run_coroutine_threadsafe(...).result(timeout=5) BEFORE stopping the
loop. The connection pool, keepalive sockets, and TLS state finally
release cleanly instead of being abandoned at process exit.
- E621Client tracks UA-change leftovers in _e621_to_close so the old
client doesn't leak when api_user changes — drained in aclose_shared.
GUI workers routed through the persistent loop (gui/sites.py,
gui/bookmarks.py)
- SiteDialog._on_detect / _on_test: replaced
`threading.Thread(target=lambda: asyncio.run(...))` with
run_on_app_loop. Results marshaled back through Qt Signals connected
with QueuedConnection. Added _closed flag + _inflight futures list:
closeEvent cancels pending coroutines and shorts out the result emit
if the user closes the dialog mid-detect (no use-after-free on
destroyed QObject).
- BookmarksView._load_thumb_async: same swap. The existing thumb_ready
signal already used QueuedConnection so the marshaling side was
already correct.
DB write serialization (core/db.py)
- Database._write_lock = threading.RLock() — RLock not Lock so a
writing method can call another writing method on the same thread
without self-deadlocking.
- New _write() context manager composes the lock + sqlite3's connection
context manager (the latter handles BEGIN / COMMIT / ROLLBACK
atomically). Every write method converted: add_site, update_site,
delete_site, add_bookmark, add_bookmarks_batch, remove_bookmark,
update_bookmark_cache_path, add_folder, remove_folder, rename_folder,
move_bookmark_to_folder, add/remove_blacklisted_tag,
add/remove_blacklisted_post, save_library_meta, remove_library_meta,
set_setting, add_search_history, clear_search_history,
remove_search_history, add_saved_search, remove_saved_search.
- _migrate keeps using the lock + raw _conn context manager because
it runs from inside the conn property's lazy init (where _write()
would re-enter conn).
- Reads stay lock-free and rely on WAL for reader concurrency. Verified
under contention: 5 threads × 50 add_bookmark calls => 250 rows,
zero corruption, zero "database is locked" errors.
Smoke-tested with seven scenarios: get_app_loop raises before set,
run_on_app_loop round-trips, lazy init creates exactly one client,
10 concurrent first-callers => 1 httpx, aclose_shared cleans up,
RLock allows nested re-acquire, multi-threaded write contention.
Sweep of defensive hardening across the core layers plus a related popout
overlay regression that surfaced during verification.
Database integrity (core/db.py)
- Wrap delete_site, add_search_history, remove_folder, rename_folder,
and _migrate in `with self.conn:` so partial commits can't leave
orphan rows on a crash mid-method.
- add_bookmark re-SELECTs the existing id when INSERT OR IGNORE
collides on (site_id, post_id). Was returning Bookmark(id=0)
silently, which then no-op'd update_bookmark_cache_path the next
time the post was bookmarked.
- get_bookmarks LIKE clauses now ESCAPE '%', '_', '\\' so user search
literals stop acting as SQL wildcards (cat_ear no longer matches
catear).
Path traversal (core/db.py + core/config.py)
- Validate folder names at write time via _validate_folder_name —
rejects '..', os.sep, leading '.' / '~'. Permits Unicode/spaces/
parens so existing folders keep working.
- saved_folder_dir() resolves the candidate path and refuses anything
that doesn't relative_to the saved-images base. Defense in depth
against folder strings that bypass the write-time validator.
- gui/bookmarks.py and gui/app.py wrap add_folder calls in try/except
ValueError and surface a QMessageBox.warning instead of crashing.
Download safety (core/cache.py)
- New _do_download(): payloads >=50MB stream to a tempfile in the
destination dir and atomically os.replace into place; smaller
payloads keep the existing buffer-then-write fast path. Both
enforce a 500MB hard cap against the advertised Content-Length AND
the running total inside the chunk loop (servers can lie).
- Per-URL asyncio.Lock coalesces concurrent downloads of the same
URL so two callers don't race write_bytes on the same path.
- Image.MAX_IMAGE_PIXELS = 256M with DecompressionBombError handling
in both converters.
- _convert_ugoira_to_gif checks frame count + cumulative uncompressed
size against UGOIRA_MAX_FRAMES / UGOIRA_MAX_UNCOMPRESSED_BYTES from
ZipInfo headers BEFORE decompressing — defends against zip bombs.
- _convert_animated_to_gif writes a .convfailed sentinel sibling on
failure to break the re-decode-on-every-paint loop for malformed
animated PNGs/WebPs.
- _is_valid_media returns True (don't delete) on OSError so a
transient EBUSY/permissions hiccup no longer triggers a delete +
re-download loop on every access.
- _referer_for() uses proper hostname suffix matching, not substring
`in` (imgblahgelbooru.attacker.com no longer maps to gelbooru.com).
- PIL handles wrapped in `with` blocks for deterministic cleanup.
API client retry + visibility (core/api/*)
- base.py: _request retries on httpx.NetworkError + ConnectError in
addition to TimeoutException. test_connection no longer echoes the
HTTP response body in the error string (it was an SSRF body-leak
gadget when used via detect_site_type's redirect-following client).
- detect.py + danbooru.py + e621.py + gelbooru.py + moebooru.py:
every previously-swallowed exception in search/autocomplete/probe
paths now logs at WARNING with type, message, and (where relevant)
the response body prefix. Debugging "the site isn't working" used
to be a total blackout.
main_gui.py
- file_dialog_platform DB probe failure prints to stderr instead of
vanishing.
Popout overlay (gui/preview.py + gui/app.py)
- preview.py:79,141 — setAttribute(WA_StyledBackground, True) on
_slideshow_toolbar and _slideshow_controls. Plain QWidget parents
silently ignore QSS `background:` declarations without this
attribute, which is why the popout overlay strip was rendering
fully transparent (buttons styled, bar behind them showing the
letterbox color).
- app.py: bake _BASE_POPOUT_OVERLAY_QSS as a fallback prepended
before the user's custom.qss in the loader. Custom themes that
don't define overlay rules now still get a translucent black
bar with white text + hairline borders. Bundled themes win on
tie because their identical-specificity rules come last in the
prepended string.
Some boorus return empty/HTML responses for tag-limited queries.
All API clients now catch JSON parse errors and return empty
results instead of crashing.
Single shared httpx.AsyncClient for all BooruClient instances
(Danbooru, Gelbooru, Moebooru) with connection pooling.
E621 gets its own shared client (custom User-Agent required).
Site detection also reuses the shared client.
Eliminates per-request TLS handshakes on Windows.
- Artist (gold), Character (green), Copyright (purple), Species
(red), General, Meta/Lore (gray)
- Danbooru and e621 provide categories from API
- Gelbooru/Moebooru fall back to flat tag list
Logs every outgoing connection (API requests and image downloads)
with timestamps. Network tab in Settings shows all hosts contacted
this session with request counts. No telemetry, just transparency.
Supports Danbooru, Gelbooru, Moebooru, and e621. Features include tag search
with autocomplete, favorites with folders, save-to-library, video playback,
drag-and-drop, multi-select, custom CSS theming, and cross-platform support.