From 7be80d450864dce0c173e975177e94e61df3adf9 Mon Sep 17 00:00:00 2001 From: Pepijn Van Eeckhoudt Date: Fri, 17 Feb 2023 01:18:45 +0100 Subject: [PATCH 01/28] Complete SQLite database parsing - Make all database page types parseable - Add cell content overflow handling - Add UTF-16 text encoding support - Make free page list and overflow page lists accessible --- database/sqlite3.ksy | 560 +++++++++++++++++++++++++++++++------------ 1 file changed, 409 insertions(+), 151 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 8c7da44dc..b39549850 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -27,218 +27,476 @@ doc: | versions, size of page, etc). After the header, normal contents of the first page follow. - Each page would be of some type, and generally, they would be - reached via the links starting from the first page. First page type - (`root_page`) is always "btree_page". + Each page would be of some type (btree, ptrmap, lock_byte, or free), + and generally, they would be reached via the links starting from the + first page. The first page is always a btree page for the implicitly + defined `sqlite_schema` table. doc-ref: https://www.sqlite.org/fileformat.html seq: - - id: magic - contents: ["SQLite format 3", 0] - - id: len_page_mod - type: u2 - doc: | - The database page size in bytes. Must be a power of two between - 512 and 32768 inclusive, or the value 1 representing a page size - of 65536. - - id: write_version - type: u1 - enum: versions - - id: read_version - type: u1 - enum: versions - - id: reserved_space - type: u1 - doc: Bytes of unused "reserved" space at the end of each page. Usually 0. - - id: max_payload_frac - type: u1 - doc: Maximum embedded payload fraction. Must be 64. - - id: min_payload_frac - type: u1 - doc: Minimum embedded payload fraction. Must be 32. - - id: leaf_payload_frac - type: u1 - doc: Leaf payload fraction. Must be 32. - - id: file_change_counter - type: u4 - - id: num_pages - type: u4 - doc: Size of the database file in pages. The "in-header database size". - - id: first_freelist_trunk_page - type: u4 - doc: Page number of the first freelist trunk page. - - id: num_freelist_pages - type: u4 - doc: Total number of freelist pages. - - id: schema_cookie - type: u4 - - id: schema_format - type: u4 - doc: The schema format number. Supported schema formats are 1, 2, 3, and 4. - - id: def_page_cache_size - type: u4 - doc: Default page cache size. - - id: largest_root_page - type: u4 - doc: The page number of the largest root b-tree page when in auto-vacuum or incremental-vacuum modes, or zero otherwise. - - id: text_encoding - type: u4 - enum: encodings - doc: The database text encoding. A value of 1 means UTF-8. A value of 2 means UTF-16le. A value of 3 means UTF-16be. - - id: user_version - type: u4 - doc: The "user version" as read and set by the user_version pragma. - - id: is_incremental_vacuum - type: u4 - doc: True (non-zero) for incremental-vacuum mode. False (zero) otherwise. - - id: application_id - type: u4 - doc: The "Application ID" set by PRAGMA application_id. - - id: reserved - size: 20 - - id: version_valid_for - type: u4 - - id: sqlite_version_number - type: u4 - - id: root_page - type: btree_page + - id: header + type: database_header instances: - len_page: - value: 'len_page_mod == 1 ? 0x10000 : len_page_mod' + pages: + type: + switch-on: '(_index == header.lock_byte_page_index ? 0 : _index >= header.first_ptrmap_page_index and _index <= header.last_ptrmap_page_index ? 1 : 2)' + cases: + 0: lock_byte_page(_index + 1) + 1: ptrmap_page(_index + 1) + # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages + # This is unfortunate, but unavoidable since there's no way to recognize these types at + # this point in the parser. + 2: btree_page(_index + 1) + pos: 0 + size: header.page_size + repeat: expr + repeat-expr: header.num_pages types: + database_header: + seq: + - id: magic + contents: ["SQLite format 3", 0] + - id: page_size_raw + type: u2 + doc: | + The database page size in bytes. Must be a power of two between + 512 and 32768 inclusive, or the value 1 representing a page size + of 65536. The interpreted value is available as `page_size`. + - id: write_version + type: u1 + enum: format_version + doc: File format write version + - id: read_version + type: u1 + enum: format_version + doc: File format read version + - id: page_reserved_space_size + type: u1 + doc: Bytes of unused "reserved" space at the end of each page. Usually 0. + - id: max_payload_fraction + type: u1 + doc: Maximum embedded payload fraction. Must be 64. + - id: min_payload_fraction + type: u1 + doc: Minimum embedded payload fraction. Must be 32. + - id: leaf_payload_fraction + type: u1 + doc: Leaf payload fraction. Must be 32. + - id: file_change_counter + type: u4 + - id: num_pages + type: u4 + doc: Size of the database file in pages. The "in-header database size". + - id: first_freelist_trunk_page + type: freelist_trunk_page_pointer + doc: Page number of the first freelist trunk page. + - id: num_freelist_pages + type: u4 + doc: Total number of freelist pages. + - id: schema_cookie + type: u4 + - id: schema_format + type: u4 + doc: The schema format number. Supported schema formats are 1, 2, 3, and 4. + - id: def_page_cache_size + type: u4 + doc: Default page cache size. + - id: largest_root_page + type: u4 + doc: The page number of the largest root b-tree page when in auto-vacuum or incremental-vacuum modes, or zero otherwise. + - id: text_encoding + type: u4 + doc: The database text encoding. A value of 1 means UTF-8. A value of 2 means UTF-16le. A value of 3 means UTF-16be. + - id: user_version + type: u4 + doc: The "user version" as read and set by the user_version pragma. + - id: is_incremental_vacuum + type: u4 + doc: True (non-zero) for incremental-vacuum mode. False (zero) otherwise. + - id: application_id + type: u4 + doc: The "Application ID" set by PRAGMA application_id. + - id: reserved_header_bytes + size: 20 + - id: version_valid_for + type: u4 + - id: sqlite_version_number + type: u4 + instances: + page_size: + value: 'page_size_raw == 1 ? 0x10000 : page_size_raw' + doc: The database page size in bytes + usable_size: + value: 'page_size - page_reserved_space_size' + doc: The "usable size" of a database page + overflow_min_payload_size: + value: ((usable_size-12)*32/255)-23 + doc: The minimum amount of inline b-tree cell payload + table_max_overflow_payload_size: + value: usable_size - 35 + doc: The maximum amount of inline table b-tree cell payload + index_max_overflow_payload_size: + value: ((usable_size-12)*64/255)-23 + doc: The maximum amount of inline index b-tree cell payload + lock_byte_page_index: + value: '1073741824 / page_size' + ptrmap_max_num_entries: + value: usable_size/5 + doc: The number of ptrmap entries per ptrmap page + first_ptrmap_page_index: + value: 'largest_root_page > 0 ? 1 : 0' + doc: The index (0-based) of the first ptrmap page + num_ptrmap_pages: + value: 'first_ptrmap_page_index > 0 ? (num_pages / ptrmap_max_num_entries) + 1 : 0' + doc: The number of ptrmap pages in the database + last_ptrmap_page_index: + value: 'first_ptrmap_page_index + num_ptrmap_pages - (first_ptrmap_page_index + num_ptrmap_pages >= lock_byte_page_index ? 0 : 1)' + doc: The index (0-based) of the last ptrmap page (inclusive) + lock_byte_page: + params: + - id: page_number + type: u4 + seq: [] + doc: | + The lock-byte page is the single page of the database file that contains the bytes at offsets between + 1073741824 and 1073742335, inclusive. A database file that is less than or equal to 1073741824 bytes + in size contains no lock-byte page. A database file larger than 1073741824 contains exactly one + lock-byte page. + The lock-byte page is set aside for use by the operating-system specific VFS implementation in implementing + the database file locking primitives. SQLite does not use the lock-byte page. + ptrmap_page: + params: + - id: page_number + type: u4 + seq: + - id: entries + type: ptrmap_entry + repeat: expr + repeat-expr: num_entries + instances: + first_page: + value: '3 + (_root.header.ptrmap_max_num_entries * (page_number - 2))' + last_page: + value: 'first_page + _root.header.ptrmap_max_num_entries - 1' + num_entries: + value: '(last_page > _root.header.num_pages ? _root.header.num_pages : last_page) - first_page + 1' + ptrmap_entry: + seq: + - id: type + type: u1 + enum: ptrmap_page_type + - id: page_number + type: u4 + btree_page_pointer: + seq: + - id: page_number + type: u4 + instances: + page: + io: _root._io + pos: (page_number - 1) * _root.header.page_size + size: _root.header.page_size + type: btree_page(page_number) + if: page_number != 0 btree_page: + params: + - id: page_number + type: u4 seq: + - id: database_header + type: database_header + if: page_number == 1 - id: page_type type: u1 + enum: btree_page_type - id: first_freeblock type: u2 + doc: The start of the first freeblock on the page, or is zero if there are no freeblocks. - id: num_cells type: u2 - - id: ofs_cells + doc: The number of cells on the page + - id: ofs_cell_content_area_raw type: u2 + doc: | + The start of the cell content area. A zero value for this integer is interpreted as 65536. + The interpreted value is available as `cell_content_area`. - id: num_frag_free_bytes type: u1 + doc: The number of fragmented free bytes within the cell content area. - id: right_ptr - type: u4 - if: page_type == 2 or page_type == 5 + type: btree_page_pointer + if: page_type == btree_page_type::index_interior or page_type == btree_page_type::table_interior + doc: | + The right-most pointer. This value appears in the header of interior + b-tree pages only and is omitted from all other pages. - id: cells - type: ref_cell + type: cell_pointer repeat: expr repeat-expr: num_cells - ref_cell: + instances: + ofs_cell_content_area: + value: 'ofs_cell_content_area_raw == 0 ? 65536 : ofs_cell_content_area_raw' + cell_content_area: + pos: ofs_cell_content_area + size: _root.header.usable_size - ofs_cell_content_area + reserved_space: + pos: _root.header.page_size - _root.header.page_reserved_space_size + size-eos: true + if: _root.header.page_reserved_space_size != 0 + cell_pointer: seq: - - id: ofs_body + - id: ofs_content type: u2 instances: - body: - pos: ofs_body + content: + pos: ofs_content type: switch-on: _parent.page_type cases: - 0x0d: cell_table_leaf - 0x05: cell_table_interior - 0x0a: cell_index_leaf - 0x02: cell_index_interior - cell_table_leaf: + btree_page_type::table_leaf: table_leaf_cell + btree_page_type::table_interior: table_interior_cell + btree_page_type::index_leaf: index_leaf_cell + btree_page_type::index_interior: index_interior_cell + table_leaf_cell: doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages' seq: - - id: len_payload + - id: payload_size type: vlq_base128_be - id: row_id type: vlq_base128_be - id: payload - size: len_payload.value - type: cell_payload - # TODO: overflow - cell_table_interior: + type: + switch-on: '(payload_size.value > _root.header.table_max_overflow_payload_size ? 1 : 0)' + cases: + 0: record + 1: overflow_record(payload_size.value, _root.header.table_max_overflow_payload_size) + table_interior_cell: doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages' seq: - id: left_child_page - type: u4 + type: btree_page_pointer - id: row_id type: vlq_base128_be - cell_index_leaf: + index_leaf_cell: doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages' seq: - - id: len_payload + - id: payload_size type: vlq_base128_be - id: payload - size: len_payload.value - type: cell_payload - # TODO: overflow - cell_index_interior: + type: + switch-on: '(payload_size.value > _root.header.index_max_overflow_payload_size ? 1 : 0)' + cases: + 0: record + 1: overflow_record(payload_size.value, _root.header.index_max_overflow_payload_size) + index_interior_cell: doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages' seq: - id: left_child_page - type: u4 - - id: len_payload + type: btree_page_pointer + - id: payload_size type: vlq_base128_be - id: payload - size: len_payload.value - type: cell_payload - cell_payload: + type: + switch-on: '(payload_size.value > _root.header.index_max_overflow_payload_size ? 1 : 0)' + cases: + 0: record + 1: overflow_record(payload_size.value, _root.header.index_max_overflow_payload_size) + record: doc-ref: 'https://sqlite.org/fileformat2.html#record_format' seq: - - id: len_header_and_len + - id: header_size type: vlq_base128_be - - id: column_serials - size: len_header_and_len.value - 1 - type: serials - - id: column_contents + - id: header + type: record_header + size: header_size.value - 1 + - id: values + type: value(header.value_types[_index]) repeat: expr - repeat-expr: column_serials.entries.size - type: column_content(column_serials.entries[_index]) - serials: + repeat-expr: header.value_types.size + record_header: seq: - - id: entries - type: vlq_base128_be + - id: value_types + type: serial_type repeat: eos - serial: + serial_type: + -webide-representation: "{type:dec}" seq: - - id: code + - id: raw_value type: vlq_base128_be instances: - is_blob: - value: 'code.value >= 12 and (code.value % 2 == 0)' - is_string: - value: 'code.value >= 13 and (code.value % 2 == 1)' - len_content: - value: (code.value - 12) / 2 - if: code.value >= 12 - column_content: + type: + value: 'raw_value.value >= 12 ? ((raw_value.value % 2 == 0) ? 12 : 13 + _root.header.text_encoding - 1) : raw_value.value' + enum: serial + variable_size: + value: (raw_value.value - 12) / 2 + if: raw_value.value >= 12 + value: params: - - id: ser - type: struct + - id: serial_type + type: serial_type seq: - - id: as_int + - id: value type: - switch-on: serial_type.code.value + switch-on: serial_type.type cases: - 1: u1 - 2: u2 - 3: b24 - 4: u4 - 5: b48 - 6: u8 - if: serial_type.code.value >= 1 and serial_type.code.value <= 6 - - id: as_float - type: f8 - if: serial_type.code.value == 7 - - id: as_blob - size: serial_type.len_content - if: serial_type.is_blob - - id: as_str + serial::nil: null_value + serial::two_comp_8: s1 + serial::two_comp_16: s2 + serial::two_comp_24: b24 + serial::two_comp_32: s4 + serial::two_comp_48: b48 + serial::two_comp_64: s8 + serial::ieee754_64: f8 + serial::integer_0: int_0 + serial::integer_1: int_1 + serial::blob: blob(serial_type.variable_size) + serial::string_utf8: string_utf8(serial_type.variable_size) + serial::string_utf16_le: string_utf16_le(serial_type.variable_size) + serial::string_utf16_be: string_utf16_be(serial_type.variable_size) + null_value: + -webide-representation: "NULL" + seq: [] + int_0: + -webide-representation: "0" + seq: [] + int_1: + -webide-representation: "1" + seq: [] + string_utf8: + params: + - id: len_value + type: u4 + seq: + - id: value + size: len_value type: str - size: serial_type.len_content encoding: UTF-8 -# if: _root.text_encoding == encodings::utf_8 and serial_type.is_string + string_utf16_be: + params: + - id: len_value + type: u4 + seq: + - id: value + size: len_value + type: str + encoding: UTF-16BE + string_utf16_le: + params: + - id: len_value + type: u4 + seq: + - id: value + size: len_value + type: str + encoding: UTF-16LE + blob: + params: + - id: len_value + type: u4 + seq: + - id: value + size: len_value + overflow_record: + params: + - id: payload_size + type: u8 + - id: overflow_payload_size_max + type: u8 + seq: + - id: inline_payload + size: '(inline_payload_size <= overflow_payload_size_max ? inline_payload_size : _root.header.overflow_min_payload_size)' + - id: overflow_page_number + type: overflow_page_pointer + instances: + inline_payload_size: + value: _root.header.overflow_min_payload_size+((payload_size-_root.header.overflow_min_payload_size)%(_root.header.usable_size-4)) + overflow_page_pointer: + seq: + - id: page_number + type: u4 + instances: + page: + io: _root._io + pos: (page_number - 1) * _root.header.page_size + size: _root.header.page_size + type: overflow_page + if: page_number != 0 + overflow_page: + seq: + - id: next_page_number + type: overflow_page_pointer + - id: content + size: _root.header.page_size - 4 + freelist_trunk_page_pointer: + seq: + - id: page_number + type: u4 instances: - serial_type: - value: ser.as + page: + io: _root._io + pos: (page_number - 1) * _root.header.page_size + size: _root.header.page_size + type: freelist_trunk_page + if: page_number != 0 + freelist_trunk_page: + seq: + - id: next_page + type: freelist_trunk_page_pointer + - id: num_free_pages + type: u4 + - id: free_pages + type: u4 + repeat: expr + repeat-expr: num_free_pages enums: - versions: + format_version: 1: legacy 2: wal - encodings: - 1: utf_8 - 2: utf_16le - 3: utf_16be + btree_page_type: + 0x02: index_interior + 0x05: table_interior + 0x0a: index_leaf + 0x0d: table_leaf + ptrmap_page_type: + 1: root_page + 2: free_page + 3: overflow1 + 4: overflow2 + 5: btree + serial: + # Value is a NULL. + 0: nil + # Value is an 8-bit twos-complement integer. + 1: two_comp_8 + # Value is a big-endian 16-bit twos-complement integer. + 2: two_comp_16 + # Value is a big-endian 24-bit twos-complement integer. + 3: two_comp_24 + # Value is a big-endian 32-bit twos-complement integer. + 4: two_comp_32 + # Value is a big-endian 48-bit twos-complement integer. + 5: two_comp_48 + # Value is a big-endian 64-bit twos-complement integer. + 6: two_comp_64 + # Value is a big-endian IEEE 754-2008 64-bit floating point number. + 7: ieee754_64 + # Value is the integer 0. (Only available for schema format 4 and higher.) + 8: integer_0 + # Value is the integer 1. (Only available for schema format 4 and higher.) + 9: integer_1 + # Reserved for internal use. These serial type codes will never appear in a + # well-formed database file, but they might be used in transient and temporary + # database files that SQLite sometimes generates for its own use. The meanings + # of these codes can shift from one release of SQLite to the next. + 10: internal_1 + 11: internal_2 + # The serial types for blob and string are 'N >= 12 and even' and 'N >=13 and odd' respectively + # The enum here differs slightly to have a single value for blob and a value per text encoding + # for string. + # + # Value is a BLOB that is (N-12)/2 bytes in length. + 12: blob + # Value is a string in the text encoding and (N-13)/2 bytes in length. The nul terminator is + # not stored. + 13: string_utf8 + 14: string_utf16_le + 15: string_utf16_be From 59110cec94aa6c0e187dcf330cca64539f343c65 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 15:44:17 +0200 Subject: [PATCH 02/28] fixup: lazy db.pages --- database/sqlite3.ksy | 33 ++++++++++++++++++++++----------- 1 file changed, 22 insertions(+), 11 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index b39549850..f06f73d3e 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -37,20 +37,31 @@ seq: type: database_header instances: pages: - type: - switch-on: '(_index == header.lock_byte_page_index ? 0 : _index >= header.first_ptrmap_page_index and _index <= header.last_ptrmap_page_index ? 1 : 2)' - cases: - 0: lock_byte_page(_index + 1) - 1: ptrmap_page(_index + 1) - # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages - # This is unfortunate, but unavoidable since there's no way to recognize these types at - # this point in the parser. - 2: btree_page(_index + 1) - pos: 0 - size: header.page_size + type: page(_index + 1, header.page_size * _index) repeat: expr repeat-expr: header.num_pages types: + page: + params: + - id: page_number + type: s4 + - id: ofs_body + type: s4 + instances: + page_index: + value: 'page_number - 1' + body: + pos: ofs_body + size: _root.header.page_size + type: + switch-on: '(page_index == _root.header.lock_byte_page_index ? 0 : page_index >= _root.header.first_ptrmap_page_index and page_index <= _root.header.last_ptrmap_page_index ? 1 : 2)' + cases: + 0: lock_byte_page(page_number) + 1: ptrmap_page(page_number) + # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages + # This is unfortunate, but unavoidable since there's no way to recognize these types at + # this point in the parser. + 2: btree_page(page_number) database_header: seq: - id: magic From 4149848923cea6b3505ee140041905e2f279cfae Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 15:51:52 +0200 Subject: [PATCH 03/28] num_ptrmap_entries_max --- database/sqlite3.ksy | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index f06f73d3e..8d06547b9 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -150,14 +150,14 @@ types: doc: The maximum amount of inline index b-tree cell payload lock_byte_page_index: value: '1073741824 / page_size' - ptrmap_max_num_entries: + num_ptrmap_entries_max: value: usable_size/5 doc: The number of ptrmap entries per ptrmap page first_ptrmap_page_index: value: 'largest_root_page > 0 ? 1 : 0' doc: The index (0-based) of the first ptrmap page num_ptrmap_pages: - value: 'first_ptrmap_page_index > 0 ? (num_pages / ptrmap_max_num_entries) + 1 : 0' + value: 'first_ptrmap_page_index > 0 ? (num_pages / num_ptrmap_entries_max) + 1 : 0' doc: The number of ptrmap pages in the database last_ptrmap_page_index: value: 'first_ptrmap_page_index + num_ptrmap_pages - (first_ptrmap_page_index + num_ptrmap_pages >= lock_byte_page_index ? 0 : 1)' @@ -185,9 +185,9 @@ types: repeat-expr: num_entries instances: first_page: - value: '3 + (_root.header.ptrmap_max_num_entries * (page_number - 2))' + value: '3 + (_root.header.num_ptrmap_entries_max * (page_number - 2))' last_page: - value: 'first_page + _root.header.ptrmap_max_num_entries - 1' + value: 'first_page + _root.header.num_ptrmap_entries_max - 1' num_entries: value: '(last_page > _root.header.num_pages ? _root.header.num_pages : last_page) - first_page + 1' ptrmap_entry: From 6dcc498a3b9535b0db8f4c5937fd99f40de87e13 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 15:52:46 +0200 Subject: [PATCH 04/28] idx_first_ptrmap_page --- database/sqlite3.ksy | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 8d06547b9..2ba32d64c 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -54,7 +54,7 @@ types: pos: ofs_body size: _root.header.page_size type: - switch-on: '(page_index == _root.header.lock_byte_page_index ? 0 : page_index >= _root.header.first_ptrmap_page_index and page_index <= _root.header.last_ptrmap_page_index ? 1 : 2)' + switch-on: '(page_index == _root.header.lock_byte_page_index ? 0 : page_index >= _root.header.idx_first_ptrmap_page and page_index <= _root.header.last_ptrmap_page_index ? 1 : 2)' cases: 0: lock_byte_page(page_number) 1: ptrmap_page(page_number) @@ -153,14 +153,14 @@ types: num_ptrmap_entries_max: value: usable_size/5 doc: The number of ptrmap entries per ptrmap page - first_ptrmap_page_index: + idx_first_ptrmap_page: value: 'largest_root_page > 0 ? 1 : 0' doc: The index (0-based) of the first ptrmap page num_ptrmap_pages: - value: 'first_ptrmap_page_index > 0 ? (num_pages / num_ptrmap_entries_max) + 1 : 0' + value: 'idx_first_ptrmap_page > 0 ? (num_pages / num_ptrmap_entries_max) + 1 : 0' doc: The number of ptrmap pages in the database last_ptrmap_page_index: - value: 'first_ptrmap_page_index + num_ptrmap_pages - (first_ptrmap_page_index + num_ptrmap_pages >= lock_byte_page_index ? 0 : 1)' + value: 'idx_first_ptrmap_page + num_ptrmap_pages - (idx_first_ptrmap_page + num_ptrmap_pages >= lock_byte_page_index ? 0 : 1)' doc: The index (0-based) of the last ptrmap page (inclusive) lock_byte_page: params: From 11e27ef419404345cc6ffa0b965fac02cf52558f Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 15:53:34 +0200 Subject: [PATCH 05/28] idx_lock_byte_page --- database/sqlite3.ksy | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 2ba32d64c..a1f7693fc 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -54,7 +54,7 @@ types: pos: ofs_body size: _root.header.page_size type: - switch-on: '(page_index == _root.header.lock_byte_page_index ? 0 : page_index >= _root.header.idx_first_ptrmap_page and page_index <= _root.header.last_ptrmap_page_index ? 1 : 2)' + switch-on: '(page_index == _root.header.idx_lock_byte_page ? 0 : page_index >= _root.header.idx_first_ptrmap_page and page_index <= _root.header.last_ptrmap_page_index ? 1 : 2)' cases: 0: lock_byte_page(page_number) 1: ptrmap_page(page_number) @@ -148,7 +148,7 @@ types: index_max_overflow_payload_size: value: ((usable_size-12)*64/255)-23 doc: The maximum amount of inline index b-tree cell payload - lock_byte_page_index: + idx_lock_byte_page: value: '1073741824 / page_size' num_ptrmap_entries_max: value: usable_size/5 @@ -160,7 +160,7 @@ types: value: 'idx_first_ptrmap_page > 0 ? (num_pages / num_ptrmap_entries_max) + 1 : 0' doc: The number of ptrmap pages in the database last_ptrmap_page_index: - value: 'idx_first_ptrmap_page + num_ptrmap_pages - (idx_first_ptrmap_page + num_ptrmap_pages >= lock_byte_page_index ? 0 : 1)' + value: 'idx_first_ptrmap_page + num_ptrmap_pages - (idx_first_ptrmap_page + num_ptrmap_pages >= idx_lock_byte_page ? 0 : 1)' doc: The index (0-based) of the last ptrmap page (inclusive) lock_byte_page: params: From 8a61bd98596ba651926ab2d774129efd1100540d Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 15:55:10 +0200 Subject: [PATCH 06/28] maximum number of ptrmap entries --- database/sqlite3.ksy | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index a1f7693fc..373383b5f 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -152,7 +152,7 @@ types: value: '1073741824 / page_size' num_ptrmap_entries_max: value: usable_size/5 - doc: The number of ptrmap entries per ptrmap page + doc: The maximum number of ptrmap entries per ptrmap page idx_first_ptrmap_page: value: 'largest_root_page > 0 ? 1 : 0' doc: The index (0-based) of the first ptrmap page From 09202259884c076108d7acf2a7c42ac7f7d40cdc Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 15:59:38 +0200 Subject: [PATCH 07/28] original docstrings for overflow sizes --- database/sqlite3.ksy | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 373383b5f..fd601acb7 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -141,13 +141,13 @@ types: doc: The "usable size" of a database page overflow_min_payload_size: value: ((usable_size-12)*32/255)-23 - doc: The minimum amount of inline b-tree cell payload + doc: The minimum amount of payload that must be stored on the btree page before spilling is allowed table_max_overflow_payload_size: value: usable_size - 35 - doc: The maximum amount of inline table b-tree cell payload + doc: The maximum amount of payload that can be stored directly on the b-tree page without spilling onto an overflow page. Value for table page index_max_overflow_payload_size: value: ((usable_size-12)*64/255)-23 - doc: The maximum amount of inline index b-tree cell payload + doc: The maximum amount of payload that can be stored directly on the b-tree page without spilling onto an overflow page. Value for index page idx_lock_byte_page: value: '1073741824 / page_size' num_ptrmap_entries_max: From d1721102d9a6825d1e3bac3c15126af25b4edb69 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 16:11:32 +0200 Subject: [PATCH 08/28] idx_last_ptrmap_page --- database/sqlite3.ksy | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index fd601acb7..cd55d3065 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -54,7 +54,7 @@ types: pos: ofs_body size: _root.header.page_size type: - switch-on: '(page_index == _root.header.idx_lock_byte_page ? 0 : page_index >= _root.header.idx_first_ptrmap_page and page_index <= _root.header.last_ptrmap_page_index ? 1 : 2)' + switch-on: '(page_index == _root.header.idx_lock_byte_page ? 0 : page_index >= _root.header.idx_first_ptrmap_page and page_index <= _root.header.idx_last_ptrmap_page ? 1 : 2)' cases: 0: lock_byte_page(page_number) 1: ptrmap_page(page_number) @@ -159,7 +159,7 @@ types: num_ptrmap_pages: value: 'idx_first_ptrmap_page > 0 ? (num_pages / num_ptrmap_entries_max) + 1 : 0' doc: The number of ptrmap pages in the database - last_ptrmap_page_index: + idx_last_ptrmap_page: value: 'idx_first_ptrmap_page + num_ptrmap_pages - (idx_first_ptrmap_page + num_ptrmap_pages >= idx_lock_byte_page ? 0 : 1)' doc: The index (0-based) of the last ptrmap page (inclusive) lock_byte_page: From 7b697c67a2ac889c4152ce6021065993be5d509b Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 16:55:25 +0200 Subject: [PATCH 09/28] Revert "fixup: lazy db.pages" This reverts commit 59110cec94aa6c0e187dcf330cca64539f343c65. kaitai-struct-compiler still generates an eager array --- database/sqlite3.ksy | 33 +++++++++++---------------------- 1 file changed, 11 insertions(+), 22 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index cd55d3065..c24f46a51 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -37,31 +37,20 @@ seq: type: database_header instances: pages: - type: page(_index + 1, header.page_size * _index) + type: + switch-on: '(_index == header.lock_byte_page_index ? 0 : _index >= header.first_ptrmap_page_index and _index <= header.last_ptrmap_page_index ? 1 : 2)' + cases: + 0: lock_byte_page(_index + 1) + 1: ptrmap_page(_index + 1) + # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages + # This is unfortunate, but unavoidable since there's no way to recognize these types at + # this point in the parser. + 2: btree_page(_index + 1) + pos: 0 + size: header.page_size repeat: expr repeat-expr: header.num_pages types: - page: - params: - - id: page_number - type: s4 - - id: ofs_body - type: s4 - instances: - page_index: - value: 'page_number - 1' - body: - pos: ofs_body - size: _root.header.page_size - type: - switch-on: '(page_index == _root.header.idx_lock_byte_page ? 0 : page_index >= _root.header.idx_first_ptrmap_page and page_index <= _root.header.idx_last_ptrmap_page ? 1 : 2)' - cases: - 0: lock_byte_page(page_number) - 1: ptrmap_page(page_number) - # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages - # This is unfortunate, but unavoidable since there's no way to recognize these types at - # this point in the parser. - 2: btree_page(page_number) database_header: seq: - id: magic From 9ee96608c176de09133b65fcc039387a08ffb86a Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 17:27:51 +0200 Subject: [PATCH 10/28] fix variable_size for string --- database/sqlite3.ksy | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index c24f46a51..26c2258ac 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -327,7 +327,7 @@ types: value: 'raw_value.value >= 12 ? ((raw_value.value % 2 == 0) ? 12 : 13 + _root.header.text_encoding - 1) : raw_value.value' enum: serial variable_size: - value: (raw_value.value - 12) / 2 + value: '(raw_value.value % 2 == 0) ? (raw_value.value - 12) / 2 : (raw_value.value - 13) / 2' if: raw_value.value >= 12 value: params: From 7751c10dfb52a0b7046c1aad45d8637295487430 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 17:31:11 +0200 Subject: [PATCH 11/28] add comments: Workaround for string encoding --- database/sqlite3.ksy | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 26c2258ac..6a30fa515 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -324,6 +324,13 @@ types: type: vlq_base128_be instances: type: + # Workaround for string encoding: + # 13 + _root.header.text_encoding - 1 + # See type serial: + # 12: blob + # 13: string_utf8 + # 14: string_utf16_le + # 15: string_utf16_be value: 'raw_value.value >= 12 ? ((raw_value.value % 2 == 0) ? 12 : 13 + _root.header.text_encoding - 1) : raw_value.value' enum: serial variable_size: @@ -349,6 +356,7 @@ types: serial::integer_0: int_0 serial::integer_1: int_1 serial::blob: blob(serial_type.variable_size) + # Workaround for string encoding: serial::string_utf8: string_utf8(serial_type.variable_size) serial::string_utf16_le: string_utf16_le(serial_type.variable_size) serial::string_utf16_be: string_utf16_be(serial_type.variable_size) @@ -497,6 +505,9 @@ enums: 12: blob # Value is a string in the text encoding and (N-13)/2 bytes in length. The nul terminator is # not stored. + # Workaround for string encoding: + # Originally, sqlite3 has only one string type, + # and the string encoding is stored in _root.header.text_encoding. 13: string_utf8 14: string_utf16_le 15: string_utf16_be From 8a55c0bd75c830527a58be1678864f2db604df93 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 17:32:15 +0200 Subject: [PATCH 12/28] rename variable_size to len_blob_string --- database/sqlite3.ksy | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 6a30fa515..181b3bb0a 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -333,7 +333,7 @@ types: # 15: string_utf16_be value: 'raw_value.value >= 12 ? ((raw_value.value % 2 == 0) ? 12 : 13 + _root.header.text_encoding - 1) : raw_value.value' enum: serial - variable_size: + len_blob_string: value: '(raw_value.value % 2 == 0) ? (raw_value.value - 12) / 2 : (raw_value.value - 13) / 2' if: raw_value.value >= 12 value: @@ -355,11 +355,11 @@ types: serial::ieee754_64: f8 serial::integer_0: int_0 serial::integer_1: int_1 - serial::blob: blob(serial_type.variable_size) + serial::blob: blob(serial_type.len_blob_string) # Workaround for string encoding: - serial::string_utf8: string_utf8(serial_type.variable_size) - serial::string_utf16_le: string_utf16_le(serial_type.variable_size) - serial::string_utf16_be: string_utf16_be(serial_type.variable_size) + serial::string_utf8: string_utf8(serial_type.len_blob_string) + serial::string_utf16_le: string_utf16_le(serial_type.len_blob_string) + serial::string_utf16_be: string_utf16_be(serial_type.len_blob_string) null_value: -webide-representation: "NULL" seq: [] From 7a6b5731d6d93086d4f5f082c1bcd2ac8fe9cf4d Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 17:37:37 +0200 Subject: [PATCH 13/28] add enum serial_type_size --- database/sqlite3.ksy | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 181b3bb0a..3b42a3a74 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -511,3 +511,20 @@ enums: 13: string_utf8 14: string_utf16_le 15: string_utf16_be + serial_type_size: + 0: 0 + 1: 1 + 2: 2 + 3: 3 + 4: 4 + 5: 6 + 6: 8 + 7: 8 + 8: 0 + 9: 0 + # -1 means variable size + 10: -1 # internal + 11: -1 # internal + # blob and string: size is stored in serial_type.len_blob_string + 12: -1 # blob + 13: -1 # string From aab41dc62aef9ab98284b2e523cdbc6f1e899cbc Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 18:04:15 +0200 Subject: [PATCH 14/28] disable enum enum serial_type_size --- database/sqlite3.ksy | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 3b42a3a74..b81ad7c84 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -511,20 +511,21 @@ enums: 13: string_utf8 14: string_utf16_le 15: string_utf16_be - serial_type_size: - 0: 0 - 1: 1 - 2: 2 - 3: 3 - 4: 4 - 5: 6 - 6: 8 - 7: 8 - 8: 0 - 9: 0 - # -1 means variable size - 10: -1 # internal - 11: -1 # internal - # blob and string: size is stored in serial_type.len_blob_string - 12: -1 # blob - 13: -1 # string + # FIXME error: expected string or map, got 0 + #serial_type_size: + # 0: 0 + # 1: 1 + # 2: 2 + # 3: 3 + # 4: 4 + # 5: 6 + # 6: 8 + # 7: 8 + # 8: 0 + # 9: 0 + # # -1 means variable size + # 10: -1 # internal + # 11: -1 # internal + # # blob and string: size is stored in serial_type.len_blob_string + # 12: -1 # blob + # 13: -1 # string From 78d5f669087cb3c9df1c8da2e0bc1dff2cff164a Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 18:06:12 +0200 Subject: [PATCH 15/28] idx_lock_byte_page --- database/sqlite3.ksy | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index b81ad7c84..bc592533a 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -38,7 +38,7 @@ seq: instances: pages: type: - switch-on: '(_index == header.lock_byte_page_index ? 0 : _index >= header.first_ptrmap_page_index and _index <= header.last_ptrmap_page_index ? 1 : 2)' + switch-on: '(_index == header.idx_lock_byte_page ? 0 : _index >= header.first_ptrmap_page_index and _index <= header.last_ptrmap_page_index ? 1 : 2)' cases: 0: lock_byte_page(_index + 1) 1: ptrmap_page(_index + 1) From f47b57b9f687eb1a92606e9090a62cae396273c9 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 18:06:22 +0200 Subject: [PATCH 16/28] idx_first_ptrmap_page --- database/sqlite3.ksy | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index bc592533a..ed828e541 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -38,7 +38,7 @@ seq: instances: pages: type: - switch-on: '(_index == header.idx_lock_byte_page ? 0 : _index >= header.first_ptrmap_page_index and _index <= header.last_ptrmap_page_index ? 1 : 2)' + switch-on: '(_index == header.idx_lock_byte_page ? 0 : _index >= header.idx_first_ptrmap_page and _index <= header.last_ptrmap_page_index ? 1 : 2)' cases: 0: lock_byte_page(_index + 1) 1: ptrmap_page(_index + 1) From aa12bc53b01c38e8c00ddeff2af876e893972ec7 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 18:06:34 +0200 Subject: [PATCH 17/28] idx_last_ptrmap_page --- database/sqlite3.ksy | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index ed828e541..f67987776 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -38,7 +38,7 @@ seq: instances: pages: type: - switch-on: '(_index == header.idx_lock_byte_page ? 0 : _index >= header.idx_first_ptrmap_page and _index <= header.last_ptrmap_page_index ? 1 : 2)' + switch-on: '(_index == header.idx_lock_byte_page ? 0 : _index >= header.idx_first_ptrmap_page and _index <= header.idx_last_ptrmap_page ? 1 : 2)' cases: 0: lock_byte_page(_index + 1) 1: ptrmap_page(_index + 1) From cb15b1e6104e2d50ac5708c5032b240cab6a6794 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 19:35:27 +0200 Subject: [PATCH 18/28] if false --- database/sqlite3.ksy | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index f67987776..bce8cc642 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -50,6 +50,12 @@ instances: size: header.page_size repeat: expr repeat-expr: header.num_pages + doc: | + "if false" is a workaround for lazy parsing of db.pages. + the main parser will parse only the first page as db.header + and the user is responsible to parse any further pages. + TODO how exactly? add example code + if: false types: database_header: seq: From 4e4bb1fd0045571ca710e05127fec3842cd574c3 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 19:36:19 +0200 Subject: [PATCH 19/28] Revert "Revert "fixup: lazy db.pages"" This reverts commit 7b697c67a2ac889c4152ce6021065993be5d509b. --- database/sqlite3.ksy | 33 ++++++++++++++++++++++----------- 1 file changed, 22 insertions(+), 11 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index bce8cc642..b0916eada 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -37,17 +37,7 @@ seq: type: database_header instances: pages: - type: - switch-on: '(_index == header.idx_lock_byte_page ? 0 : _index >= header.idx_first_ptrmap_page and _index <= header.idx_last_ptrmap_page ? 1 : 2)' - cases: - 0: lock_byte_page(_index + 1) - 1: ptrmap_page(_index + 1) - # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages - # This is unfortunate, but unavoidable since there's no way to recognize these types at - # this point in the parser. - 2: btree_page(_index + 1) - pos: 0 - size: header.page_size + type: page(_index + 1, header.page_size * _index) repeat: expr repeat-expr: header.num_pages doc: | @@ -57,6 +47,27 @@ instances: TODO how exactly? add example code if: false types: + page: + params: + - id: page_number + type: s4 + - id: ofs_body + type: s4 + instances: + page_index: + value: 'page_number - 1' + body: + pos: ofs_body + size: _root.header.page_size + type: + switch-on: '(page_index == _root.header.idx_lock_byte_page ? 0 : page_index >= _root.header.idx_first_ptrmap_page and page_index <= _root.header.idx_last_ptrmap_page ? 1 : 2)' + cases: + 0: lock_byte_page(page_number) + 1: ptrmap_page(page_number) + # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages + # This is unfortunate, but unavoidable since there's no way to recognize these types at + # this point in the parser. + 2: btree_page(page_number) database_header: seq: - id: magic From f580fab3ca9bf7a5bfd81e39ffdd669141f4e689 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 19:55:17 +0200 Subject: [PATCH 20/28] default_page_cache_size --- database/sqlite3.ksy | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index b0916eada..410aa940d 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -114,7 +114,7 @@ types: - id: schema_format type: u4 doc: The schema format number. Supported schema formats are 1, 2, 3, and 4. - - id: def_page_cache_size + - id: default_page_cache_size type: u4 doc: Default page cache size. - id: largest_root_page From cd187d3b28eff57fc0b3e8800fe4d30a5bf1f52f Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 22:30:57 +0200 Subject: [PATCH 21/28] if false --- database/sqlite3.ksy | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 410aa940d..38a03fe68 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -43,7 +43,7 @@ instances: doc: | "if false" is a workaround for lazy parsing of db.pages. the main parser will parse only the first page as db.header - and the user is responsible to parse any further pages. + and the user is responsible for parsing further pages. TODO how exactly? add example code if: false types: From 879ec3674a2ef8977a81da943dfcc841ad7a37ce Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Sun, 2 Apr 2023 22:31:13 +0200 Subject: [PATCH 22/28] Revert "Revert "Revert "fixup: lazy db.pages""" This reverts commit 4e4bb1fd0045571ca710e05127fec3842cd574c3. --- database/sqlite3.ksy | 33 +++++++++++---------------------- 1 file changed, 11 insertions(+), 22 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 38a03fe68..b4c9a230e 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -37,7 +37,17 @@ seq: type: database_header instances: pages: - type: page(_index + 1, header.page_size * _index) + type: + switch-on: '(_index == header.idx_lock_byte_page ? 0 : _index >= header.idx_first_ptrmap_page and _index <= header.idx_last_ptrmap_page ? 1 : 2)' + cases: + 0: lock_byte_page(_index + 1) + 1: ptrmap_page(_index + 1) + # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages + # This is unfortunate, but unavoidable since there's no way to recognize these types at + # this point in the parser. + 2: btree_page(_index + 1) + pos: 0 + size: header.page_size repeat: expr repeat-expr: header.num_pages doc: | @@ -47,27 +57,6 @@ instances: TODO how exactly? add example code if: false types: - page: - params: - - id: page_number - type: s4 - - id: ofs_body - type: s4 - instances: - page_index: - value: 'page_number - 1' - body: - pos: ofs_body - size: _root.header.page_size - type: - switch-on: '(page_index == _root.header.idx_lock_byte_page ? 0 : page_index >= _root.header.idx_first_ptrmap_page and page_index <= _root.header.idx_last_ptrmap_page ? 1 : 2)' - cases: - 0: lock_byte_page(page_number) - 1: ptrmap_page(page_number) - # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages - # This is unfortunate, but unavoidable since there's no way to recognize these types at - # this point in the parser. - 2: btree_page(page_number) database_header: seq: - id: magic From 7b130de0ccd6d5eb4b945f483be726dd98f7b108 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Mon, 3 Apr 2023 09:55:27 +0200 Subject: [PATCH 23/28] fix doc url --- database/sqlite3.ksy | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index b4c9a230e..9c491d053 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -31,7 +31,7 @@ doc: | and generally, they would be reached via the links starting from the first page. The first page is always a btree page for the implicitly defined `sqlite_schema` table. -doc-ref: https://www.sqlite.org/fileformat.html +doc-ref: https://www.sqlite.org/fileformat2.html seq: - id: header type: database_header @@ -263,7 +263,7 @@ types: btree_page_type::index_leaf: index_leaf_cell btree_page_type::index_interior: index_interior_cell table_leaf_cell: - doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages' + doc-ref: 'https://www.sqlite.org/fileformat2.html#b_tree_pages' seq: - id: payload_size type: vlq_base128_be @@ -276,14 +276,14 @@ types: 0: record 1: overflow_record(payload_size.value, _root.header.table_max_overflow_payload_size) table_interior_cell: - doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages' + doc-ref: 'https://www.sqlite.org/fileformat2.html#b_tree_pages' seq: - id: left_child_page type: btree_page_pointer - id: row_id type: vlq_base128_be index_leaf_cell: - doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages' + doc-ref: 'https://www.sqlite.org/fileformat2.html#b_tree_pages' seq: - id: payload_size type: vlq_base128_be @@ -294,7 +294,7 @@ types: 0: record 1: overflow_record(payload_size.value, _root.header.index_max_overflow_payload_size) index_interior_cell: - doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages' + doc-ref: 'https://www.sqlite.org/fileformat2.html#b_tree_pages' seq: - id: left_child_page type: btree_page_pointer From 14bb01373c3146e87943918ad563a7fb06ff6759 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Mon, 3 Apr 2023 10:16:45 +0200 Subject: [PATCH 24/28] add docs --- database/sqlite3.ksy | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 9c491d053..395e78c36 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -267,14 +267,22 @@ types: seq: - id: payload_size type: vlq_base128_be + doc: | + total number of bytes of payload, + including any overflow - id: row_id type: vlq_base128_be + doc: | + integer key, a.k.a. "rowid" - id: payload type: switch-on: '(payload_size.value > _root.header.table_max_overflow_payload_size ? 1 : 0)' cases: 0: record 1: overflow_record(payload_size.value, _root.header.table_max_overflow_payload_size) + doc: | + The initial portion of the payload + that does not spill to overflow pages. table_interior_cell: doc-ref: 'https://www.sqlite.org/fileformat2.html#b_tree_pages' seq: @@ -420,6 +428,9 @@ types: size: '(inline_payload_size <= overflow_payload_size_max ? inline_payload_size : _root.header.overflow_min_payload_size)' - id: overflow_page_number type: overflow_page_pointer + doc: | + page number for the first page + of the overflow page list instances: inline_payload_size: value: _root.header.overflow_min_payload_size+((payload_size-_root.header.overflow_min_payload_size)%(_root.header.usable_size-4)) From c49c45702b0d26ab5bbb4ab1086e8af2ec1a6cba Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Mon, 3 Apr 2023 10:17:07 +0200 Subject: [PATCH 25/28] add docs + example python code for large db --- database/sqlite3.ksy | 71 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 66 insertions(+), 5 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 395e78c36..29a6fd60e 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -31,6 +31,20 @@ doc: | and generally, they would be reached via the links starting from the first page. The first page is always a btree page for the implicitly defined `sqlite_schema` table. + + This works well when parsing small database files. To parse large + database files, see the documentation for /instances/pages. + + Further documentation: + + - https://www.sqlite.org/arch.html + - https://medium.com/the-polyglot-programmer/what-would-sqlite-look-like-if-written-in-rust-part-3-edd2eefda473 + - https://cstack.github.io/db_tutorial/parts/part7.html + + Original sources: + + - https://github.com/sqlite/sqlite/blob/master/src/btree.h + - https://github.com/sqlite/sqlite/blob/master/src/btree.c doc-ref: https://www.sqlite.org/fileformat2.html seq: - id: header @@ -51,11 +65,58 @@ instances: repeat: expr repeat-expr: header.num_pages doc: | - "if false" is a workaround for lazy parsing of db.pages. - the main parser will parse only the first page as db.header - and the user is responsible for parsing further pages. - TODO how exactly? add example code - if: false + This works well when parsing small database files. + + problem: + the first access to db.pages + for example `db.pages[0]` + will loop and parse **all** pages. + + To parse large database files, + the user should set + the internal cache attribute `db._m_pages` + so that any access to `db.pages` + will use the cached value in `db._m_pages`. + + # import sqlite3.py generated from sqlite3.ksy + import parser.sqlite3 as parser_sqlite3 + # create a lazy list class + # accessing db.pages[i] will call pages_list.__getitem__(i) + class PagesList: + def __init__(self, db): + self.db = db + def __len__(self): + return self.db.header.num_pages + def __getitem__(self, i): # i is 0-based + db = self.db + header = db.header + if i < 0: # -1 means last page, etc + i = header.num_pages + i + assert ( + 0 <= i and i < header.num_pages + ), f"page index is out of range: {i} is not in (0, {header.num_pages - 1})" + # todo: maybe cache page + # equality test: page_a.page_number == page_b.page_number + _pos = db._io.pos() + db._io.seek(i * header.page_size) + if i == header.idx_lock_byte_page: + page = parser_sqlite3.Sqlite3.LockBytePage((i + 1), db._io, db, db._root) + elif ( + i >= header.idx_first_ptrmap_page and + i <= header.idx_last_ptrmap_page + ): + page = parser_sqlite3.Sqlite3.PtrmapPage((i + 1), db._io, db, db._root) + else: + page = parser_sqlite3.Sqlite3.BtreePage((i + 1), db._io, db, db._root) + db._io.seek(_pos) + return page + # create a database parser + database = "test.db" + db = parser_sqlite3.Sqlite3.from_file(database) + # patch the internal cache attribute of db.pages + db._m_pages = PagesList(db) + # now, this will parse **only** the first page + page = db.pages[0] types: database_header: seq: From a055f991b52c7596cc272ee60820d2128a3f1605 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Mon, 3 Apr 2023 10:18:08 +0200 Subject: [PATCH 26/28] rename page types to *_page --- database/sqlite3.ksy | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 29a6fd60e..1eb00b386 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -291,7 +291,7 @@ types: doc: The number of fragmented free bytes within the cell content area. - id: right_ptr type: btree_page_pointer - if: page_type == btree_page_type::index_interior or page_type == btree_page_type::table_interior + if: page_type == btree_page_type::index_interior_page or page_type == btree_page_type::table_interior_page doc: | The right-most pointer. This value appears in the header of interior b-tree pages only and is omitted from all other pages. @@ -319,10 +319,10 @@ types: type: switch-on: _parent.page_type cases: - btree_page_type::table_leaf: table_leaf_cell - btree_page_type::table_interior: table_interior_cell - btree_page_type::index_leaf: index_leaf_cell - btree_page_type::index_interior: index_interior_cell + btree_page_type::table_leaf_page: table_leaf_cell + btree_page_type::table_interior_page: table_interior_cell + btree_page_type::index_leaf_page: index_leaf_cell + btree_page_type::index_interior_page: index_interior_cell table_leaf_cell: doc-ref: 'https://www.sqlite.org/fileformat2.html#b_tree_pages' seq: @@ -538,10 +538,10 @@ enums: 1: legacy 2: wal btree_page_type: - 0x02: index_interior - 0x05: table_interior - 0x0a: index_leaf - 0x0d: table_leaf + 0x02: index_interior_page + 0x05: table_interior_page + 0x0a: index_leaf_page + 0x0d: table_leaf_page ptrmap_page_type: 1: root_page 2: free_page From 2aeb2039ea5f3b1864a7c887af4e476ab58f3583 Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Mon, 10 Apr 2023 12:45:14 +0200 Subject: [PATCH 27/28] fix: ofs_content is relative to page --- database/sqlite3.ksy | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 1eb00b386..74c5b2f8a 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -315,7 +315,8 @@ types: type: u2 instances: content: - pos: ofs_content + # ofs_content is relative to page + pos: ((_parent.page_number - 1) * _root.header.page_size) + ofs_content type: switch-on: _parent.page_type cases: From 22c4801b52979bf42fd5d81b00b9af6396a8b92d Mon Sep 17 00:00:00 2001 From: Milan Hauth Date: Wed, 12 Apr 2023 19:37:34 +0200 Subject: [PATCH 28/28] fix pointer_map_page --- database/sqlite3.ksy | 92 +++++++++++++++++++++++++++++++++++++++----- 1 file changed, 82 insertions(+), 10 deletions(-) diff --git a/database/sqlite3.ksy b/database/sqlite3.ksy index 74c5b2f8a..c4840417a 100644 --- a/database/sqlite3.ksy +++ b/database/sqlite3.ksy @@ -55,7 +55,7 @@ instances: switch-on: '(_index == header.idx_lock_byte_page ? 0 : _index >= header.idx_first_ptrmap_page and _index <= header.idx_last_ptrmap_page ? 1 : 2)' cases: 0: lock_byte_page(_index + 1) - 1: ptrmap_page(_index + 1) + 1: pointer_map_page(_index + 1) # TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages # This is unfortunate, but unavoidable since there's no way to recognize these types at # this point in the parser. @@ -230,23 +230,95 @@ types: lock-byte page. The lock-byte page is set aside for use by the operating-system specific VFS implementation in implementing the database file locking primitives. SQLite does not use the lock-byte page. - ptrmap_page: + pointer_map_page: params: - - id: page_number + - id: pointer_map_page_number type: u4 seq: - id: entries - type: ptrmap_entry + type: pointer_map_entry repeat: expr repeat-expr: num_entries instances: - first_page: - value: '3 + (_root.header.num_ptrmap_entries_max * (page_number - 2))' - last_page: - value: 'first_page + _root.header.num_ptrmap_entries_max - 1' + first_linked_page_number: + value: pointer_map_page_number + 1 + last_linked_page_number_max: + value: pointer_map_page_number + _root.header.pointer_map_page_entries_max + last_linked_page_number: + value: | + last_linked_page_number_max <= _root.header.num_pages + ? last_linked_page_number_max + : _root.header.num_pages num_entries: - value: '(last_page > _root.header.num_pages ? _root.header.num_pages : last_page) - first_page + 1' - ptrmap_entry: + value: last_linked_page_number - first_linked_page_number + 1 + doc: | + A ptrmap page contains back-links from child to parent. + See also: /types/pointer_map_entry. + + Pointer map pages (or "ptrmap pages") + are extra pages inserted into the database + to make the operation of auto_vacuum and + incremental_vacuum modes more efficient. + + Ptrmap pages must exist in any database file + which has a non-zero largest root b-tree page value + in db.header.largest_root_page. + + If db.header.largest_root_page is zero, + then the database must not contain ptrmap pages. + + The first ptrmap page (on page 2) + will contain back pointer information + for pages 3 through J+2, inclusive. + + The second pointer map page will be on page J+3 + and that ptrmap page will provide back pointer information + for pages J+4 through 2*J+3 inclusive. + + And so forth for the entire database file. + + ```py + page_size = 512 + page_reserved_space_size = 0 + U = usable_size = page_size - page_reserved_space_size # 512 + J = pointer_map_page_entries_max = usable_size // 5 # 102 + + # pointer map 1 + X = 1 + N = pointer_map_page_number_raw = ((X - 1) * J) + 1 + X # 2 + A = first_linked_page_number = N + 1 # 3 + Z = last_linked_page_number = N + J # 104 = J + 2 + + # pointer map 2 + X = 2 + N = pointer_map_page_number = ((X - 1) * J) + 1 + X # 105 = J + 3 + A = first_linked_page_number = N + 1 # 106 = J + 4 + Z = last_linked_page_number = N + J # 207 = (2 * J) + 3 + + # pointer map 3 + X = 3 + N = pointer_map_page_number = ((X - 1) * J) + 1 + X # 208 + A = first_linked_page_number = N + 1 # 209 + Z = last_linked_page_number = N + J # 310 + + # pointer map 4 + X = 4 + N = pointer_map_page_number = ((X - 1) * J) + 1 + X # 311 + A = first_linked_page_number = N + 1 # 312 + Z = last_linked_page_number = N + J # 413 + ``` + + actual pointer_map_page_number: + + ```py + NR = pointer_map_page_number_raw = ((X - 1) * J) + 1 + X # 2 + N = pointer_map_page_number = ( + pointer_map_page_number_raw + if (pointer_map_page_number_raw != lock_byte_page_number) + else (pointer_map_page_number_raw + 1) + ) + ``` + doc-ref: https://www.sqlite.org/fileformat2.html#pointer_map_or_ptrmap_pages seq: - id: type type: u1