Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: page level caching #3158

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

elevatebart
Copy link
Contributor

@elevatebart elevatebart commented Feb 19, 2025

πŸ”— Linked issue

closes #3151

❓ Type of change

  • πŸ“– Documentation (updates to the documentation or readme)
  • 🐞 Bug fix (a non-breaking change that fixes an issue)
  • πŸ‘Œ Enhancement (improving an existing functionality like performance)
  • ✨ New feature (a non-breaking change that adds functionality)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)

πŸ“š Description

This PR moves the caching level to each page/record instead of each collection.

This allows for faster iteration.

  • Build time
    • Only take into account the definition of the table in the table hash
    • Add a __hash__ field on each content table to store the hash of the current record
      • it should have the "unique" constraint
    • Calculate at build time the hash of the values then append the hash at the end of the values before setting the query
    • Store the list of hash in the database dump as the first line of the dump.
      • Store in a SQL comment to respect sql syntax
      • Store it as a JSON table to facilitate parsing and avoid bad chars
  • At runtime
    • Start by getting the list of existing hash from the database
    • From the database dump get the list of hash that needs to be set after init
    • Get the list of hash that need to be removed and delete them (split into multiple queries if necessary)
    • Run the needed inserts from the dump
      • Since the queries and the hash are in the same order, use the array to know what inserts to run

πŸ“ Checklist

  • I have linked an issue or discussion.
  • I have updated the documentation accordingly.

@elevatebart elevatebart marked this pull request as draft February 19, 2025 12:38
@elevatebart
Copy link
Contributor Author

This PR still needs some extra tests for updating an existing db.

Copy link

pkg-pr-new bot commented Feb 19, 2025

npm i https://pkg.pr.new/@nuxt/content@3158

commit: 53198e2

@elevatebart elevatebart changed the title feat: page level caching (WIP) feat: page level caching Feb 19, 2025
@elevatebart elevatebart marked this pull request as ready for review February 19, 2025 19:44
@elevatebart
Copy link
Contributor Author

Live testing has been made on this PR.

I would love some help for E2E testing of updates.
I can't find a way to setup a project, build it is serve it... specially since I am trying to build/serve multiple times with file changes.

If you have the time to show me the ropes, I am all ears.
If not, I'd rather have you write the E2E tests here.

Test scenarii:

  • initialization works on an empty database
  • when collection structure change (add a new field), table is dropped
  • when adding md doc, the new doc is added in the DB
  • when removing a doc, removed in the DB
  • when updating a doc, doc ...
  • All 3 together (add some, remove some, update some)

To be complete, docs should be in multiple directories so re-order has an effect.

Copy link
Member

@farnabaz farnabaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I like the idea.
I have concerns about the solution. Removing drop queries is a breaking change for the full dump and breaks NuxtHub deployment. (NuxtHub uses this full dump to refill the database)

What if we keep the drop query and prefix it with structure checksum /* checksum */ DROP ..., then we ignore these lines if the checksum is not changed.

We can add similar prefixes for insert and update queries. This will simply hash detection here

if (unchangedStructure && (sql.startsWith('INSERT ') || sql.startsWith('UPDATE ')) && !indexesToInsert.includes(index)) {

The dump can be like

/* structure:{HASH} */ DROP TABLE ......
/* structure:{HASH} */ CREATE TABLE ...
/* hash:{ROW_HASH} */ INSERT ...
/* hash:{ROW_HASH} */ UPDATE ...

With these comments, the dump is executable outside of our logic and we have optimize our import

WDYT?

// If the structure has not changed,
// skip any insert/update line whose hash is already in the database.
// If not, since we dropped the table, no record is skipped, insert them all again.
if (unchangedStructure && (sql.startsWith('INSERT ') || sql.startsWith('UPDATE ')) && !indexesToInsert.includes(index)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values of indexesToInsert does not match the index of queries in dump therefore indexesToInsert.includes(index) is not a valid condition to check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashes indexed here come directly from the dump

  1. So they are ordered the same way as the insert query, I make sure of it at build time.
  2. The table of hash is padded at the beginning with the exact number of lines in the init. If there are 3 lines of insert, we insert 3 empty strings at the beginning.
  3. That is why I would have loved a few e2e tests.
  4. I like your idea of having a hash at the begining of the line so the code is simpler to understand

@elevatebart
Copy link
Contributor Author

elevatebart commented Feb 25, 2025

I have tweaked the system a little:

  • restored the drop table in the dump
  • added the hash in comment in front of updates and inserts
  • ignored all structure queries if the structure has not changed
  • made sure that info version inserts and updates are always committed

Example of resulting dump

-- ["VwBBpsSOHZ","EO7IcoosJP"]
CREATE TABLE IF NOT EXISTS _content_info (id TEXT PRIMARY KEY, "ready" BOOLEAN, "structureVersion" VARCHAR, "version" VARCHAR, "__hash__" TEXT UNIQUE);
/* starting_init */ INSERT INTO _content_info VALUES ('checksum_content', false, 'fr2JfCq2gl', 'v3.2.3--dE9pgeuPBt', '3hvYqGvzZ8');
DROP TABLE IF EXISTS _content_content;
CREATE TABLE IF NOT EXISTS _content_content (id TEXT PRIMARY KEY, "title" VARCHAR, "body" TEXT, "description" VARCHAR, "extension" VARCHAR, "meta" TEXT, "navigation" TEXT DEFAULT true, "path" VARCHAR, "seo" TEXT DEFAULT '{}', "stem" VARCHAR, "__hash__" TEXT UNIQUE);
/* VwBBpsSOHZ */ INSERT INTO _content_content VALUES ('content/about.md', 'About', '{"type":"minimal","value":[["h1",{"id":"about"},"About"]],"toc":{"title":"","searchDepth":2,"depth":2,"links":[]}}', '', 'md', '{"booleanField":false,"numberField":123,"arrayField":["item3","item4"]}', 'true', '/about', '{"title":"About","description":""}', 'about', 'VwBBpsSOHZ');
/* EO7IcoosJP */ INSERT INTO _content_content VALUES ('content/index.md', 'Home page', '{"type":"minimal","value":[["h1",{"id":"home-page"},"Home page"]],"toc":{"title":"","searchDepth":2,"depth":2,"links":[]}}', '', 'md', '{"booleanField":true,"numberField":1,"arrayField":["item1","item2"]}', 'true', '/', '{"title":"Home page","description":""}', 'index', 'EO7IcoosJP');
/* successful_init */ UPDATE _content_info SET ready = true WHERE id = 'checksum_content';

NOTA: I kept the first line of hashes as an index that I do not need to rebuild as I run through the dump.
The deletes have to be done first to avoid primary key conflicts.
To do delete the records that need cleaning, I need the list of hash.
To get the list of hash I need to roll through the dump which could be costly.

I will optimize for readability and remove that first line.

@elevatebart
Copy link
Contributor Author

elevatebart commented Feb 25, 2025

In that last commit I changed the format of the dump once more:

CREATE TABLE IF NOT EXISTS _content_info (id TEXT PRIMARY KEY, "ready" BOOLEAN, "structureVersion" VARCHAR, "version" VARCHAR, "__hash__" TEXT UNIQUE); /* structure */
INSERT INTO _content_info VALUES ('checksum_content', false, 'RYgZDcy6sk', 'v3.2.3--DZOppTJAd2', 'synekhxovA');
DROP TABLE IF EXISTS _content_content; /* structure */
CREATE TABLE IF NOT EXISTS _content_content (id TEXT PRIMARY KEY, "title" VARCHAR, "arrayField" TEXT, "body" TEXT, "booleanField" BOOLEAN, "description" VARCHAR, "extension" VARCHAR, "meta" TEXT, "navigation" TEXT DEFAULT true, "numberField" INT, "path" VARCHAR, "seo" TEXT DEFAULT '{}', "stem" VARCHAR, "__hash__" TEXT UNIQUE); /* structure */
INSERT INTO _content_content VALUES ('content/index.md', 'Home page', '["item1","item2"]', '{"type":"minimal","value":[["h1",{"id":"home-page"},"Home page"]],"toc":{"title":"","searchDepth":2,"depth":2,"links":[]}}', true, '', 'md', '{}', 'true', 1, '/', '{"title":"Home page","description":""}', 'index', 'JApZF0Dncz'); /* checksum: JApZF0Dncz */
UPDATE _content_info SET ready = true WHERE id = 'checksum_content'; 

This feels more readable to both human beings and sql scripts.

All comments are at the end to avoid changing the tests.
Structure related statements are flagged with a /* structure */ suffix.
Version related must never be ignored so they have no comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Caching on page level
2 participants