-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy parsing of sequence attributes #133
Comments
For python there is a nice module https://github.com/ionelmc/python-lazy-object-proxy which may be useful. But KSC must be altered in order to use it. But there are obvious drawbacks for lazy parsing: for now we can parse and close the file, lazy parsing means that we cannot close the file until all the needed parts of a struct have been accessed. So there should be some parameter controlling it. I guess that we may need a hook function in runtime where we pass a lambda accessing the property and returning the KS object. Non-lazy one will execute the lambda immediately. Lazy one will call lazy_object_proxy.Proxy. |
what if my file has 100GB, but i have only 1GB ram? currently @property
def pages(self):
if hasattr(self, "_m_pages"):
return self._m_pages
self._m_pages = []
for i in range(self.header.num_pages):
self._m_pages.append(
Sqlite3.Page(
(i + 1), (self.header.page_size * i), self._io, self, self._root
)
) when it should generate class Sqlite3(KaitaiStruct):
class PagesList:
def __init__(self, root):
self.root = root
def __len__(self):
return self.root.header.num_pages
def __getitem__(self, i): # i is 0-based
if i < 0: # -1 means last page, etc
i = self.root.header.num_pages + i
assert (
0 <= i and i < self.root.header.num_pages
), f"page index is out of range: {i} is not in (0, {self.root.header.num_pages - 1})"
# TODO LRU cache with sparse array?
# note: LRU cache does not give pointer equality
# but equality check is trivial: page_a.page_number == page_b.page_number
_pos = self.root._io.pos()
self.root._io.seek(i * self.root.header.page_size)
page = Sqlite3.Page(
(i + 1), (self.root.header.page_size * i), self.root._io, self.root, self.root._root
)
self.root._io.seek(_pos)
return page
def __init__(self, _io, _parent=None, _root=None):
self._io = _io
self._parent = _parent
self._root = _root if _root else self
# add this line:
self.pages = Sqlite3.PagesList(self)
self._read() for now im patching the generated see also: Can Kaitai Struct be used to describe TLV data without creating new types for each field? (via kaitai-io/kaitai_struct_formats#661 (comment)) keywords: random access, array, list, nested parsing, deferred parsing |
#65 was created with exactly that use case in mind. Not to store all the data in memory, but map the file and store offsets and let the OS to read them only when needed and re-parse tuem again (and in the case of fixed structs - without actually parsing, at least for C++ and Rust, when no serialization is needed). Not very suitable for systems without a MMU, such as Arduino boards. For them it may be possible to generate instances, getting the ranges. But again it is not very suitable to read the whole array into memory, so we need generating an index first, and then reuse it. And we need the runtime to forget the stuff it doesn't need. All these are not so straighforward decisions and I doubt they can be done without consulting to general intelligence. The knowledge pf the right decisions can be incorporated into ksy files in the form of hints #225. Probably there should he a mode for the compiler adding hints stubs (in the form of all suitable hints for the case) into ksy file in the places they are needed. Then a programmer can use a diff tool to review the hints and choose the right ones. |
Sometimes we have some fairly large / complex subtypes in a
seq
with a known size, which are not really used on every file read, so it's beneficial to read the lazily, i.e. on demand. A simple case:See #65 (comment) for proposed generation results.
Another, more complex option, is to have lazy
seq
arrays with fixed size elements and (repeat-expr
orrepeat-eos
) — thus number of elements being known a priori. Probably it's worth discussing after we've handled this simple case.The text was updated successfully, but these errors were encountered: