92 Commits

Author SHA1 Message Date
pax
0f26475f52 detect: remove leftover if-True indent marker
Dead syntax left over from a prior refactor. No behavior change.
2026-04-15 17:34:27 -05:00
pax
cf8bc0ad89 library_save: require category_fetcher to prevent silent category drop
behavior change: save_post_file's category_fetcher argument is now
keyword-only with no default, so every call site has to pass something
explicit (fetcher instance or None). Previously the =None default let
bookmark→library save and bookmark Save As slip through without a
fetcher at all, silently rendering %artist%/%character% tokens as
empty strings and producing filenames like '_12345.jpg' instead of
'greatartist_12345.jpg'.

BookmarksView now takes a category_fetcher_factory callable in its
constructor (wired to BooruApp._get_category_fetcher), called at save
time so it picks up the fetcher for whatever site is currently active.

tests/core/test_library_save.py pins the signature shape and the
three relevant paths: fetcher populates empty categories, None
accepted when categories are pre-populated (Danbooru/e621 inline),
fetcher skipped when template has no category tokens.
2026-04-15 17:32:25 -05:00
pax
bbf0d3107b category_fetcher: stop flipping _batch_api_works=False on transient errors in single-post path
behavior change: a single mid-call network drop could previously
poison _batch_api_works=False for the whole site, forcing every
future ensure_categories onto the slower HTML scrape path. _do_ensure
now routes the unprobed case through _probe_batch_api, which only
flips the flag on a clean HTTP 200 with zero matching names; timeout
and non-200 responses leave the flag None so the next call retries
the probe.

The bug surfaced because fetch_via_tag_api swallows per-chunk
failures with 'except Exception: continue', so the previous code
path couldn't distinguish 'API returned zero matches' from 'the
network dropped halfway through.' _probe_batch_api already made
that distinction for prefetch_batch; _do_ensure now reuses it.

Tests in tests/core/api/test_category_fetcher.py pin the three
routes (transient raise, clean-200-zero-matches, non-200).
2026-04-15 17:29:01 -05:00
pax
ec9e44efbe category_fetcher: extract shared tag-API params builder
Both fetch_via_tag_api and _probe_batch_api built the same params
dict (with identical lstrip/startswith credential quirks) inline.
Pulled into _build_tag_api_params so future credential-format tweaks
have one site, not two.
2026-04-15 17:27:10 -05:00
pax
ad6f876f40 category_fetcher: reject XML responses with DOCTYPE/ENTITY declarations
User-configurable sites could send XXE or billion-laughs payloads
via tag category API responses. Reject any XML body containing
<!DOCTYPE or <!ENTITY before passing to ET.fromstring.
2026-04-12 14:55:30 -05:00
pax
b964a77688 cache: single-pass directory walk in eviction functions
evict_oldest and evict_oldest_thumbnails now collect paths, stats,
and sizes in one iterdir() pass instead of separate passes for
sorting, sizing, and deleting. evict_oldest also accepts a
current_bytes arg to skip a redundant cache_size_bytes() call.
2026-04-11 23:01:44 -05:00
pax
b28cc0d104 db: escape LIKE wildcards in search_library_meta
Same fix as audit #5 applied to get_bookmarks (lines 490-499) but
missed here. Without ESCAPE, searching 'cat_ear' also matches
'catxear' because _ is a SQL LIKE wildcard that matches any single
character.
2026-04-11 19:28:59 -05:00
pax
2186f50065 remove dead code: core/images.py
make_thumbnail and image_dimensions were both unreferenced. The
library's actual thumbnailing happens inline in gui/library.py
(PIL for stills, ffmpeg subprocess for videos), and the live
image_dimensions used by main_window.py is the static method on
gui/media_controller.py — not the standalone function this file
exposed. Audit finding #15 follow-up.
2026-04-11 17:29:04 -05:00
pax
07665942db core/__init__.py: drop stale core.images reference from docstring
The audit #8 explanation no longer needs to name core.images as the
example case — the invariant holds for any submodule, and core.images
is about to be removed entirely as dead code.
2026-04-11 17:28:57 -05:00
pax
5a511338c8 security: fix #14 — cap category_fetcher HTML body before regex walk
CategoryFetcher.fetch_post pulls a post-view HTML page and runs
_TAG_ELEMENT_RE.finditer over the full body. The regex itself is
linear (no catastrophic backtracking shape), but a hostile server
returning hundreds of MB of HTML still pegs CPU walking the buffer.
Caps the body the regex sees at 2MB — well above any legit
Gelbooru/Moebooru post page (~30-150KB).

Truncation rather than streaming because httpx already buffers the
body before _request returns; the cost we're cutting is the regex
walk, not the memory hit. A full streaming refactor of fetch_post
is a follow-up that the audit explicitly flagged as out of scope
("not catastrophic — defense in depth").

Audit-Ref: SECURITY_AUDIT.md finding #14
Severity: Informational
2026-04-11 16:26:00 -05:00
pax
b65f8da837 security: fix #10 — validate media magic in first 16 bytes of stream
The previous flow streamed the full body to disk and called
_is_valid_media after completion. A hostile server that omits
Content-Type (so the early text/html guard doesn't fire) could
burn up to MAX_DOWNLOAD_BYTES (500MB) of bandwidth and cache-dir
write/delete churn before the post-download check rejected.

Refactors _do_download to accumulate chunks into a small header
buffer until at least 16 bytes have arrived, then runs
_looks_like_media against the buffer before committing to writing
the full payload. The 16-byte minimum handles servers that send
tiny chunks (chunked encoding with 1-byte chunks, slow trickle,
TCP MSS fragmentation) without false-failing on the first chunk.

Extracts _looks_like_media(bytes) as a sibling to _is_valid_media
(path) sharing the same magic-byte recognition. _looks_like_media
fails closed on empty input — when called from the streaming
validator, an empty header means the server returned nothing
useful. _is_valid_media keeps its OSError-fallback open behavior
for the on-disk path so transient EBUSY doesn't trigger a delete
+ re-download loop.

Audit-Ref: SECURITY_AUDIT.md finding #10
Severity: Low
2026-04-11 16:24:59 -05:00
pax
8f9e4f7e65 security: fix #8 — drop duplicate MAX_IMAGE_PIXELS set from cache.py
The cap is now installed by core/__init__.py (previous commit), so
the line in cache.py is redundant. Removing it leaves a single
authoritative location for the security-critical PIL setting.

Audit-Ref: SECURITY_AUDIT.md finding #8
Severity: Low
2026-04-11 16:21:37 -05:00
pax
2bb6352141 security: fix #8 — install MAX_IMAGE_PIXELS cap in core/__init__.py
PIL's decompression-bomb cap previously lived as a side effect of
importing core/cache.py. Any future code path that touched core/images
(or any other core submodule) without first importing cache would
silently revert to PIL's default 89M-pixel *warning* (not an error),
re-opening the bomb surface.

Moves the cap into core/__init__.py so any import of any
booru_viewer.core.* submodule installs it first. The duplicate set
in cache.py is left in place by this commit and removed in the next
one — both writes are idempotent so this commit is bisect-safe.

Audit-Ref: SECURITY_AUDIT.md finding #8
Severity: Low
2026-04-11 16:21:32 -05:00
pax
6ff1f726d4 security: fix #7 — reject Windows reserved device names in template
render_filename_template's sanitization stripped reserved chars,
control codes, whitespace, and `..` prefixes — but did not catch
Windows reserved device names (CON, PRN, AUX, NUL, COM1-9, LPT1-9).
On Windows, opening `con.jpg` for writing redirects to the CON
device, so a tag value of `con` from a hostile booru would silently
break Save to Library.

Adds a frozenset of reserved stems and prefixes the rendered name
with `_` if its lowercased stem matches. The check runs
unconditionally (not Windows-gated) so a library saved on Linux
can be copied to a Windows machine without breaking on these
filenames.

Audit-Ref: SECURITY_AUDIT.md finding #7
Severity: Low
2026-04-11 16:20:27 -05:00
pax
5d348fa8be security: fix #5 — LRU cap on _url_locks to prevent memory leak
Replaces the unbounded defaultdict(asyncio.Lock) with an OrderedDict
guarded by _get_url_lock() and _evict_url_locks(). The cap is 4096
entries; LRU semantics keep the hot working set alive and oldest-
unlocked-first eviction trims back toward the cap on each new
insertion.

Eviction skips locks that are currently held — popping a lock that
a coroutine is mid-`async with` on would break its __aexit__. The
inner loop's evicted-flag handles the edge case where every
remaining entry is either the freshly inserted hash or held; in
that state the cap is briefly exceeded and the next insertion
retries, instead of looping forever.

Audit-Ref: SECURITY_AUDIT.md finding #5
Severity: Medium
2026-04-11 16:16:52 -05:00
pax
a6a73fed61 security: fix #4 — chmod SQLite DB + WAL/SHM sidecars to 0o600
The sites table stores api_key + api_user in plaintext. Previous
behavior left the DB file at the inherited umask (0o644 on most
Linux systems) so any other local user could sqlite3 it open and
exfiltrate every booru API key.

Adds Database._restrict_perms(), called from the lazy conn init
right after _migrate(). Tightens the main file plus the -wal and
-shm sidecars to 0o600. The sidecars only exist after the first
write, so the FileNotFoundError path is expected and silenced.
Filesystem chmod failures are also swallowed for FUSE-mount
compatibility.

behavior change from v0.2.5: ~/.local/share/booru-viewer/booru.db
is now 0o600 even if a previous version created it 0o644.

Audit-Ref: SECURITY_AUDIT.md finding #4
Severity: Medium
2026-04-11 16:15:41 -05:00
pax
6801a0b45e security: fix #4 — chmod data_dir to 0o700 on POSIX
The data directory holds the SQLite database whose `sites` table
stores api_key and api_user in plaintext. Previous behavior used
the inherited umask (typically 0o755), which leaves the dir
world-traversable on shared workstations and on networked home
dirs whose home is 0o755. Tighten to 0o700 unconditionally on
every data_dir() call so the fix is applied even when an older
version (or external tooling) left the directory loose.

Failures from filesystems that don't support chmod (some FUSE
mounts) are swallowed — better to keep working than refuse to
start. Windows: no-op, NTFS ACLs handle this separately.

behavior change from v0.2.5: ~/.local/share/booru-viewer is now
0o700 even if it was previously 0o755.

Audit-Ref: SECURITY_AUDIT.md finding #4
Severity: Medium
2026-04-11 16:14:30 -05:00
pax
19a22be59c security: fix #3 — redact params in GelbooruClient debug log
Same fix as danbooru.py and e621.py — Gelbooru's params dict
carries api_key + user_id when configured. Route through
redact_params() before the debug log emits them.

Audit-Ref: SECURITY_AUDIT.md finding #3
Severity: Medium
2026-04-11 16:13:25 -05:00
pax
49fa2c5b7a security: fix #3 — redact params in E621Client debug log
Same fix as danbooru.py — the search() log.debug params line
previously emitted login + api_key. Route through redact_params().

Audit-Ref: SECURITY_AUDIT.md finding #3
Severity: Medium
2026-04-11 16:13:06 -05:00
pax
9a3bb697ec security: fix #3 — redact params in DanbooruClient debug log
The log.debug(f"  params: {params}") line in search() previously
dumped login + api_key to the booru logger at DEBUG level. Route
the params dict through redact_params() so the keys are replaced
with *** before formatting.

Audit-Ref: SECURITY_AUDIT.md finding #3
Severity: Medium
2026-04-11 16:12:47 -05:00
pax
d6909bf4d7 security: fix #3 — redact URL in BooruClient._log_request
The httpx request event hook converts request.url to a str so
log_connection can parse it — at that point the credential query
params (login, api_key, etc.) are in scope and could be captured
by any traceback, debug hook, or monitoring agent observing the
hook call. Pipe through redact_url() first so the rendered string
never carries the secrets, even transiently.

Audit-Ref: SECURITY_AUDIT.md finding #3
Severity: Medium
2026-04-11 16:12:28 -05:00
pax
c735db0c68 security: fix #1 — wire SSRF hook into detect_site_type client
detect_site_type constructs a fresh BooruClient._shared_client
directly (bypassing the BooruClient.client property) for the
/posts.json, /index.php, and /post.json probes. The hooks set
here are the ones installed on that initial construction — if
detection runs before any BooruClient instance's .client is
accessed, the shared singleton must still have SSRF validation
and connection logging.

This additionally closes finding #16 for the detect client — site
detection requests now appear in the connection log instead of
being invisible.

behavior change from v0.2.5: Test Connection from the site dialog
now rejects private-IP targets. Adding a local/RFC1918 booru via
the "auto-detect type" dialog will fail with "blocked request
target ..." instead of probing it. Explicit api_type selection
still goes through the BooruClient.client path, which is also
now protected.

Audit-Ref: SECURITY_AUDIT.md finding #1
Also-Closes: SECURITY_AUDIT.md finding #16 (detect half)
Severity: High
2026-04-11 16:11:37 -05:00
pax
ef95509551 security: fix #1 — wire SSRF hook into E621Client custom client
E621 maintains its own httpx.AsyncClient because their TOS requires
a per-user User-Agent string that BooruClient's shared client can't
carry. The client is rebuilt on User-Agent change, so the hook must
be installed in the same construction path.

Also installs BooruClient._log_request as a second hook (this
additionally closes finding #16 for the e621 client — e621 requests
previously bypassed the connection log entirely, and this wires
them in consistently with the base client).

Audit-Ref: SECURITY_AUDIT.md finding #1
Also-Closes: SECURITY_AUDIT.md finding #16 (e621 half)
Severity: High
2026-04-11 16:11:12 -05:00
pax
ec79be9c83 security: fix #1 — wire SSRF hook into cache download client
Adds validate_public_request to the cache module's shared httpx
client event_hooks. Covers image/video/thumbnail downloads, which
are the most likely exfil path — file_url comes straight from the
booru JSON response and previously followed any 3xx that landed,
so a hostile booru could point downloads at a private IP. Every
redirect hop is now rejected if the target is non-public.

The import is lazy inside _get_shared_client because
core.api.base imports log_connection from this module; a top-level
`from .api._safety import ...` would circular-import through
api/__init__.py during cache.py load. By the time
_get_shared_client is called the api package is fully loaded.

Audit-Ref: SECURITY_AUDIT.md finding #1
Severity: High
2026-04-11 16:10:50 -05:00
pax
6eebb77ae5 security: fix #1 — wire SSRF hook into BooruClient shared client
Adds validate_public_request to the BooruClient event_hooks list so
every request (and every redirect hop) is checked against the block
list from _safety.py. Danbooru, Gelbooru, and Moebooru subclasses
all go through BooruClient.client and inherit the protection.

Preserves the existing _log_request hook by listing both hooks in
order: validate first (so blocked hops never reach the log), then
log.

Audit-Ref: SECURITY_AUDIT.md finding #1
Severity: High
2026-04-11 16:10:12 -05:00
pax
013fe43f95 security: fix #1 — add public-host validator helper
Introduces core/api/_safety.py containing check_public_host and the
validate_public_request async request-hook. The hook rejects any URL
whose host is (or resolves to) loopback, RFC1918, link-local
(including 169.254.169.254 cloud metadata), CGNAT, unique-local v6,
or multicast. Called on every request hop so it covers both the
initial URL and every redirect target that httpx would otherwise
follow blindly.

Also exports redact_url / redact_params for finding #3 — the
secret-key set lives in the same module since both #1 and #3 work
is wired through httpx client event_hooks. Helper is stdlib-only
(ipaddress, socket, urllib.parse) plus httpx; no new deps.

Not yet wired into any httpx client; per-file wiring commits follow.

Audit-Ref: SECURITY_AUDIT.md finding #1
Severity: High
2026-04-11 16:09:53 -05:00
pax
5261fa176d add search history setting
New setting "Record recent searches" (on by default). When disabled,
searches are not recorded and the Recent section is hidden from the
history dropdown. Saved searches are unaffected.

behavior change: opt-in setting, on by default (preserves existing behavior)
2026-04-10 16:28:43 -05:00
pax
94588e324c add unbookmark-on-save setting
New setting "Remove bookmark when saved to library" (off by default).
When enabled, _maybe_unbookmark runs directly in each save callback
after save_post_file succeeds -- handles DB removal, grid dot, preview
state, popout sync, and bookmarks tab refresh. Wired into all 4 save
paths: save_to_library, bulk_save, save_as, batch_download_to.

behavior change: opt-in setting, off by default
2026-04-10 16:23:54 -05:00
pax
9cc294a16a Revert "add unbookmark-on-save setting"
This reverts commit 08f99a61011532202b22d05750416aa1e754f9c9.
2026-04-10 16:20:26 -05:00
pax
08f99a6101 add unbookmark-on-save setting
New setting "Remove bookmark when saved to library" (off by default).
When enabled, saving a post to the library automatically removes its
bookmark. Handles both single saves (on_bookmark_done) and bulk saves
(on_batch_done). UI toggle in Settings > General.

behavior change: opt-in setting, off by default
2026-04-10 16:19:00 -05:00
pax
d66dc14454 db: fix orphan rows — cascade delete_site, wire up reconcile on startup
delete_site() leaked rows in tag_types, search_history, and
saved_searches; reconcile_library_meta() was implemented but never
called. Add tests for both fixes plus tag cache pruning.
2026-04-10 14:10:57 -05:00
pax
264c421dff cache: skip .part files in evict_oldest
Prevents cache eviction from deleting a .part temp file that mpv's
stream-record is actively writing to. Prerequisite for the stream-record
plumbing in video_player.py.
2026-04-09 20:52:36 -05:00
pax
57a19f87ba gelbooru: re-add background prefetch for batch API fast path only
When _batch_api_works is True (Gelbooru proper with auth, persisted
from a prior session's probe), search() fires prefetch_batch in the
background. The batch tag API covers the entire page's tags in 1-2
requests during the time between grid render and user click — the
cache is warm before the info panel opens, so categories appear
instantly with no flash of flat tags.

Gated on _batch_api_works is True (not None, not False):
  - Gelbooru proper: prefetches (batch API known good)
  - Rule34: skips (batch_api_works = False, persisted)
  - Safebooru.org: skips (no auth → fetcher skips batch capability)

Rule34 / Safebooru.org / Moebooru stay on-demand: the ~200ms
per-click HTML scrape is unavoidable for those sites because their
only path is per-post page fetching, which can't be batched.
2026-04-09 20:01:34 -05:00
pax
f168bece00 category_fetcher: fix _do_ensure to try batch API when not yet probed
_do_ensure only tried the batch API when _batch_api_works was True,
but after removing the search-time prefetch (where the probe used
to run), _batch_api_works stayed None forever. Gelbooru's only
viable path IS the batch API (its post-view HTML has no tag links),
so clicks on Gelbooru posts produced zero categories.

Fix: _do_ensure now tries the batch API when _batch_api_works is
not False (i.e., both True and None). When None, the call doubles
as an inline probe: if the batch produced categories, save True;
if nothing useful came back, save False and fall to HTML.

This is simpler than the old prefetch_batch probe because it runs
on ONE post at a time — no batch/HTML mixing concerns, no "single
path per invocation" rule. The probe result is persisted to DB so
it only fires once per site ever.

Dispatch matrix in _do_ensure:
  _batch_api_works True  + auth → batch API (Gelbooru proper)
  _batch_api_works None  + auth → batch as probe → True or False
  _batch_api_works False        → HTML scrape (Rule34)
  no auth                       → HTML scrape (Safebooru.org)
  transient error               → stays None, retry next click

Verified all three sites from clean cache: Gelbooru 55/56+49/50
(batch), Rule34 40/40+38/38 (HTML), Safebooru.org 47/47+47/47
(HTML).
2026-04-09 19:53:20 -05:00
pax
35424ff89d gelbooru+moebooru: drop background prefetch from search, fetch on demand
Removes the asyncio.create_task(prefetch_batch) calls from
search() and get_post() in both clients. Tags are now fetched
ONLY when the user actually clicks a post (via ensure_categories
in the info panel path) or saves with a category-token template.

The background prefetch was the source of most of the complexity:
probe timing, early-exit bugs from partial composes racing with
on-click ensures, Rule34's slow probe blocking the prefetch
window. All gone.

New flow:
  search() → fast, returns posts with flat tags only
  click    → ensure_categories fires, ~200ms HTML scrape or
             batch API, categories arrive, signal re-renders
  re-click → instant (cache compose, no HTTP)
  save     → ensure in save_post_file, same path

The ~200ms per first-click is invisible during the image load.
The cache compounds across posts and sessions. The prefetch_batch
method stays in CategoryFetcher for potential future use but
nothing calls it from the hot path anymore.
2026-04-09 19:48:04 -05:00
pax
7d11aeab06 category_fetcher: persist batch API probe result across sessions
The probe that detects whether a site's batch tag API works
(Gelbooru proper: yes, Rule34: no) now persists its result in the
tag_types table using a sentinel key (__batch_api_probe__). On
subsequent app launches, the fetcher reads the saved result at
construction time and skips the probe entirely.

Before: every session with Rule34 wasted ~0.6s on a probe request
that always fails (Rule34 returns garbage for names=). During that
time the background prefetch couldn't start HTML scraping, so the
first few post clicks paid ~0.3s each.

After: first ever session probes Rule34 once, stores False. Every
subsequent session reads False from DB, skips the probe, and the
background prefetch immediately starts HTML scraping. By the time
the user clicks any post, the scrape is usually done.

Gelbooru proper: probe succeeds on first session, stores True.
Future sessions use the batch API without probing. No change in
speed (already fast), just saves the probe roundtrip.

Persisted per site_id so different Gelbooru-shaped sites get their
own probe result. The clear_tag_cache method wipes probe results
along with tag data (the sentinel key lives in the same table).
2026-04-09 19:46:20 -05:00
pax
1547cbe55a fix: remove early-exit on non-empty tag_categories in ensure path
Two places checked `if post.tag_categories: return` before doing
a full cache-coverage check, causing posts with partial cache
composes (e.g. 5/40 tags from the background prefetch) to get
stuck at low coverage forever:

  ensure_categories: removed the post.tag_categories early exit.
    Now ALWAYS runs try_compose_from_cache first. Only the 100%
    coverage return (True) is trusted as "done." Partial composes
    return False and fall through to the fetch path.

  _ensure_post_categories_async: removed the post.tag_categories
    guard. Danbooru/e621 are filtered by the client.category_fetcher
    is None check instead (they categorize inline, no fetcher).
    For Gelbooru-style sites, always schedules ensure_categories
    regardless of current post state.

Root cause: the partial-compose fix (try_compose_from_cache
populates tag_categories even when cache coverage is <100%)
conflicted with the early-exit guards that assumed non-empty
tag_categories = fully categorized. Now the only "fully done"
signal is try_compose_from_cache returning True (100% coverage).
2026-04-09 19:40:09 -05:00
pax
762d73dc4f category_fetcher: fix partial-compose vs ensure_categories interaction
try_compose_from_cache was returning True on ANY partial cache hit
(even 1/38 tags). ensure_categories then saw non-empty
tag_categories and returned immediately, leaving the post stuck at
1/38 coverage. The bug showed on Rule34: post 1 got fully scraped
(40/40), its tags got cached, then post 2's compose found one
matching tag and declared victory.

Fix: try_compose_from_cache now returns True ONLY when 100% of
unique tags have cached labels (no fetch needed). It STILL
populates post.tag_categories with whatever IS cached (for
immediate partial display), but returning False signals
ensure_categories to continue to the fetch path.

This is the correct semantic split:
  - populate → always (for display)
  - return True → only when complete (for dispatch)

Verified:
  Rule34:       40/40 + 38/38 (was 40/40 + 1/38)
  Gelbooru:     55/56 + 49/50 (batch API, one rare tag)
  Safebooru.org: 47/47 + 47/47 (HTML scrape, full)
2026-04-09 19:36:58 -05:00
pax
f0fe52c886 fix: HTML parser two-pass rewrite + fire-and-forget prefetch
Three fixes:

1. HTML parser completely rewritten with two-pass approach:
   - Pass 1: regex finds each tag-type element and its full inner
     content (up to closing </li|span|td|div>)
   - Pass 2: within the content, extracts the tag name from the
     tags=NAME URL parameter in the search link
   The old single-pass regex captured the ? wiki-link (first <a>)
   instead of the tag name (second <a>). The URL-param extraction
   works on Rule34 (40 tags), Safebooru.org (47 tags), and
   yande.re (3 tags). Gelbooru proper returns 0 (post page only
   has ? links with no tags= param) which is correct — Gelbooru
   uses the batch tag API instead.

2. prefetch_batch is now truly fire-and-forget:
   gelbooru.py and moebooru.py use asyncio.create_task instead of
   await for prefetch_batch. search() returns immediately. The
   probe + batch/HTML fetch runs in the background. Previously
   search() blocked on the probe, which made Rule34 searches take
   5+ seconds (slow/broken Rule34 API response time).

3. Partial cache compose already fixed in the previous commit
   complements this: posts with 49/50 cached tags now show all
   available categories instead of nothing.
2026-04-09 19:31:43 -05:00
pax
165733c6e0 category_fetcher: compose from partial cache coverage
try_compose_from_cache previously required 100% cache coverage —
every tag in the post had to have a cached label or it returned
False and populated nothing. One rare uncached tag out of 50
blocked the entire composition, leaving the post with zero
categories even though 49/50 labels were available.

Fix: compose whatever IS cached, return True when at least one
tag got categorized. Tags not in the cache are simply absent from
the categories dict (they stay in the flat tags string). The
return value now means "the post has usable categories" rather
than "the post has complete categories." This distinction matters
because the dispatch logic uses the return value to decide
whether to skip the fetch path — partial coverage is better than
no coverage, and the missing tags get cached eventually when
other posts that contain them get fetched.

Verified against Gelbooru: post with 50 tags where 49 were cached
now gets 49/50 categorized (Artist, Character, Copyright, General,
Meta) instead of 0/50.
2026-04-09 19:23:57 -05:00
pax
8f8db62a5a library_save: ensure categories before template render
save_post_file is now async and gains an optional
category_fetcher parameter. When the template uses any category
token (%artist%, %character%, %copyright%, %general%, %meta%,
%species%) AND the post's tag_categories is empty AND a fetcher
is available, it awaits ensure_categories(post) before calling
render_filename_template. This guarantees the filename is
correct even when saving a post the user hasn't clicked
(bypassing the info panel's on-display trigger).

When the template uses only non-category tokens (%id%, %md5%,
%score%, %rating%, %ext%) or is empty, the ensure check is
skipped entirely — no HTTP overhead for the common case.

Every existing caller already runs from _run_async closures,
so the sync→async signature change is mechanical. The callers
are updated in the next two commits to pass category_fetcher.
2026-04-09 19:18:13 -05:00
pax
f5954d1387 api: factory constructs CategoryFetcher for Gelbooru + Moebooru sites
client_for_type gains optional db + site_id kwargs. When both are
passed and api_type is gelbooru or moebooru, a CategoryFetcher is
constructed and assigned to client.category_fetcher. The fetcher
owns the per-tag cache, the batch tag API fast path, and the
per-post HTML scrape fallback.

Danbooru and e621 never get a fetcher — their inline JSON
categorization is already optimal.

Test Connection dialog and scripts don't pass db/site_id, so they
get fetcher-less clients with the existing search behavior.
2026-04-09 19:15:57 -05:00
pax
834deecf57 moebooru: implement _post_view_url + prefetch wiring
Override _post_view_url to return /post/show/{id} for the per-post
HTML scrape path. No _tag_api_url override — Moebooru has no batch
tag DAPI; the CategoryFetcher dispatch goes straight to per-post
HTML for these sites.

search() and get_post() now call prefetch_batch when a fetcher is
attached, same fire-and-forget pattern as gelbooru.py.
2026-04-09 19:15:34 -05:00
pax
7f897df4b2 gelbooru: implement _post_view_url + _tag_api_url + prefetch wiring
Overrides both URL methods from the base class:
  _post_view_url(post) -> /index.php?page=post&s=view&id={id}
    Universal HTML scrape path — works on Gelbooru proper, Rule34,
    Safebooru.org without auth.
  _tag_api_url() -> {base_url}/index.php
    Batch tag DAPI fast path. The CategoryFetcher's probe-and-cache
    determines at runtime whether the endpoint actually honors
    names=. Gelbooru proper: probe succeeds. Rule34: probe fails
    (garbage response), falls back to HTML. Safebooru.org: no auth,
    dispatch skips batch entirely.

search() and get_post() now call
    await self.category_fetcher.prefetch_batch(posts)
after building the post list, when a fetcher is attached. The
prefetch is fire-and-forget — search returns immediately and the
background tasks fill categories as the user reads. When no
fetcher is attached (Test Connection dialog, scripts), this is a
no-op and behavior is unchanged.
2026-04-09 19:15:02 -05:00
pax
5ba0441be7 e621: populate categories in get_post (latent bug fix) 2026-04-09 19:14:19 -05:00
pax
9001808951 danbooru: populate categories in get_post (latent bug fix) 2026-04-09 19:13:52 -05:00
pax
8f298e51fc api: BooruClient virtual _post_view_url + _tag_api_url + category_fetcher attr
Three additions to the base class, all default-inactive:

  _post_view_url(post) -> str | None
    Override to provide the post-view HTML URL for the per-post
    category scrape path. Default None (Danbooru/e621 skip it).

  _tag_api_url() -> str | None
    Override to provide the batch tag DAPI base URL for the fast
    path in CategoryFetcher. Default None. Only Gelbooru proper
    benefits — the fetcher's probe-and-cache determines at runtime
    whether the endpoint actually honors the names= parameter.

  self.category_fetcher = None
    Set externally by the factory (client_for_type) when db and
    site_id are available. Gelbooru-shape and Moebooru clients use
    it; Danbooru/e621 leave it None.

No behavior change at this commit. Existing clients inherit the
defaults and continue working identically.
2026-04-09 19:13:21 -05:00
pax
e00d88e1ec api: CategoryFetcher module with HTML scrape + batch tag API + cache
New module core/api/category_fetcher.py — the unified tag-category
fetcher for boorus that don't return categories inline.

Public surface:
  try_compose_from_cache(post) — instant, no HTTP. Builds
    post.tag_categories from cached (site_id, name) -> label
    entries. Returns True if every tag in the post is cached.
  fetch_via_tag_api(posts) — batch fast path. Collects uncached
    tags across posts, chunks into 500-name batches, GETs the
    tag DAPI. Only available when the client declares _tag_api_url
    AND has credentials (Gelbooru proper). Includes JSON/XML
    sniffing parser ported from the reverted code.
  fetch_post(post) — universal fallback. HTTP GETs the post-view
    HTML page, regex-extracts class="tag-type-X">name</a>
    markup. Works on every Gelbooru fork and every Moebooru
    deployment. Does NOT require auth.
  ensure_categories(post) — idempotent dispatch: cache compose ->
    batch API (if available) -> HTML scrape. Coalesces concurrent
    calls for the same post.id via an in-flight task dict.
  prefetch_batch(posts) — fire-and-forget background prefetch.
    ONE fetch path per invocation (no mixing batch + HTML).

Probe-and-cache for the batch tag API:
  _batch_api_works = None -> not yet probed OR transient error
                              (retry next call)
  _batch_api_works = True -> batch works (Gelbooru proper)
  _batch_api_works = False -> clean 200 + zero matching names
                               (Rule34's broken names= filter)
  Transition to True/False is permanent per instance. Transient
  errors (HTTP error, timeout, parse exception) leave None so the
  next search retries the probe.

HTML regex handles both standard tag-type-artist and combined-
class forms like tag-link tag-type-artist (Konachan). Tag names
normalized to underscore-separated lowercase.

Canonical category order: Artist > Character > Copyright >
Species > General > Meta > Lore (matches danbooru/e621 inline).

Dead code at this commit — no integration yet.
2026-04-09 19:12:43 -05:00
pax
5395569213 db: re-add tag_types cache table with string labels + auto-prune
Per-site tag-type cache for boorus that don't return categories
inline. Uses string labels ("Artist", "Character", "Copyright",
"General", "Meta") instead of the integer codes the reverted
version used — the labels come directly from HTML class names,
no mapping step needed.

Schema: tag_types(site_id, name, label TEXT, fetched_at)
        PRIMARY KEY (site_id, name)

Methods:
  get_tag_labels(site_id, names) — chunked 500-name SELECT
  set_tag_labels(site_id, mapping) — bulk INSERT OR REPLACE,
    auto-prunes oldest entries when the table exceeds 50k rows
  clear_tag_cache(site_id=None) — manual wipe, for future
    Settings UI "Clear tag cache" button

The 50k row cap prevents unbounded growth over months of
browsing multiple boorus. Normal usage (a few thousand unique
tags per site) never reaches it. When exceeded, the oldest
entries by fetched_at are pruned first — these are the tags the
user hasn't encountered recently and would be re-fetched cheaply
if needed.

Migration: CREATE TABLE IF NOT EXISTS in _migrate(), non-breaking
for existing databases.
2026-04-09 19:10:37 -05:00
pax
150970b56f cache: delete_from_library cleans up library_meta + matches templated names
Two related fixes that the old delete flow was missing:

1. delete_from_library now accepts an optional `db` parameter which
   it forwards to find_library_files. Without `db`, only digit-stem
   files match (the old behavior — preserved as a fallback). With
   `db`, templated filenames stored in library_meta also match,
   so post-refactor saves like 12345_hatsune_miku.jpg get unlinked
   too. Without this fix, "Unsave from Library" on a templated
   save was a silent no-op.

2. Always cleans up the library_meta row when called with `db`, not
   just when files were unlinked. Two cases this matters for:
     a. Files were on disk and unlinked → meta is now stale.
     b. Files were already gone but the meta lingered (orphan from
        a previous broken delete) → user asked to "unsave," meta
        should reflect that.
   This is the missing half of the cleanup that left some libraries
   with pathologically more meta rows than actual files.
2026-04-09 18:25:21 -05:00