Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Garbage collector always overwrites cached file even if content unchanged #684

Open
porg opened this issue May 8, 2023 · 0 comments

Comments

@porg
Copy link

porg commented May 8, 2023

Negative Consequences

  • This totally brakes the cache policy “cache with validation”.
  • Why? → It causes revisiting users to unnecessarily re-download (HTTP 200) instead of using their local cache (HTTP 304 Not Modified). Because overwriting changes the Last-Modified. And in consequence also the ETag changes, b/c W3TC calculates the ETag from Last-Modified and Size (see: Bug in detail).

Proposed Fix

  1. Garbage collector visits URL-X. Loads output into memory and hashes it.
  2. Garbage collector compares hash of URL-X to hash of cached URL-X.
  3. If the hashes are identical, content is considered unchanged, and skips to the next URL.
  4. If the hashes differ, it overwrites URL-X and notes down its hash.
    • Where/how exactly is up to your developer expertise ofc! Ideas from me as a layman:
    • Either in the file metadata (part of filename or in xattr). This means almost no extra load as the file is accessed anyways.
    • Or separate from the cache-files (possibly utilizing RAM based caching pool too). I dunno what’s more efficient. Am tech-savvy but no dev.

Note: W3TC "Cache Preload" does exactly that already. Maybe you do not need to develop anything nw, but just need to ensure that the Garbage Collector uses the same functions.

Bug in detail

/wp-content/cache/page_enhanced/.htaccess has the directive:

FileETag MTime Size

Lets think what this means for totally unchanged content fetched by the Garbage Collector:

  • Size between cached and current remains unchanged.
  • But because the Garbage Calculator stubbornly always overwrite, the MTime of the file changes.
  • So the Last-Modified HTTP header changes.
  • And in consequence its ETag HTTP header changes too, because the FileETag directive in W3TC’s setup means that the ETag gets calculated from a combination of MTime and Size.
  • ❗️ So although NOTHING changed, the cached content gets overwritten.
  • ❗️ All revisiting users, which have the page in their local cache, unnecessarily reload the very same page again.

Notes on the Hash

  • As W3TC puts <!-- Debug/stats --> in the last lines of HTML files, these should never be included for calculating the hash, as this “everchanging noise” would ofc always result in a different hash. But my proposal never hashes full files anyways, but only the pure HTML as output by the live CMS.
  • In step 1 the pure output from the CMS is in memory and gets hashed.
  • In step 4 the hash of that pure HTML content (debug lines not included!) is noted down.
  • In step 2 that hash noted for the cached content as it came from the CMS unaltered is used in the comparison.
  • So this should be conflict free.

My Website Environment

Small personal portfolio website

  • ~ 100 pages, predicting max +5 per year
  • ~ 10 blogposts, predicting +30 per year
  • Main domain caching only HTML + all plugin/theme assets (CSS/JS/fonts/…)
  • Media Library on subdomain, could get CDN one day

Shared Hosting which includes

  • OPcode — on/off ; timeout: 30 secs, 5 min, 1 hour, 4 hours
  • Caching pool 16 MB: Memcached OR Redis (both as UNIX socket)
  • Varnish — on/off ; timeout: 1, 3, 5, 15, 30, 60 mins
  • Hardware: 100% SSD-Hosting, High-End Proliant Servers by HP

W3TC fundamentals

  • Page Cache: Disk: Enhanced
  • Opcode Cache: Zend Opcache — Speedup in Backend is on average 2x, sometimes peeking at 3x, great efficiency improvement!
  • Object Cache: Memcached
  • Database Cache: OFF — b/c I guess the 16 MB are already utilized enough by the Object Cache
  • Varnish: OFF during debugging to not complicate things. But when I tested it it again was an extra great speedup (reducing latency from ca 80-100ms to 20ms).

My Website Usability and Performance Goals

  1. Load is not a concern for now. Not a motivation for caching.

  2. Want my users to browse fast. Main motivation for caching.

  3. Don’t want the first visitor of a new or purged cached page to await page cache generation.

  • Hence: Page Cache → Cache Preload:

    • a) ☑︎ Automatically prime the page cache

    • b) ☑︎ Preload the post cache upon publish events

  1. Want my users to always have a chance to get the most recent version of a page/post.
  • a) Settings from 3 are enough to provide this for first time visitors.

  • b) But by themselves NOT enough for re-visiting users!

    • Hence: Browser Cache → HTML/XML → Cache-Policy:

    • b1) If “cache with max-age” and the content change occurs earlier than Max-Age kicking in, then the revisiting users will have the page as “fresh” in their Local Cache and hence will not re-validate and thus will show the outdated local version instead of the renewed content in the CMS.

    • b2) If “cache with validation” then revisiting users have:

      • CON: Needs a tiny bit longer than those trusting only Max-Age. Requesting, server sees Last-Modified and ETag are still sufficing with minimal latency, responds with HTTP 304 Not modified, and the browser shows the locally cached page. Only if it really changed it re-sends it.

      • PRO: Reliably always get the most recent content with only a minimal overhead: One roundtrip of HTTP metadata exchange, no content needs to be exchanged if unchanged.

      • BUG: That benefit is destroyed by the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant