Skip to content

Notes on Runtime Library

Robin Sommer edited this page May 3, 2020 · 11 revisions

Some pieces of the HILTI/Spicy runtime libraries come with some specific constraints and/or design goals. We collect some notes on that here.

Types

Containers (bytes/list/map/set/stream/vector)

  • Our iterator interfaces aren't standard compliant at the moment, we should make them so.

  • Different from iterators in the C++ standard library, we need iterators to be safe against undefined behavior. We want the following semantics:

    • An existing iterator must generally remain valid as long as the underlying container sticks around.

    • An existing iterator becomes invalid in two cases:

      1. The underlying container gets destroyed.
      2. Due to container changes, the iterator's current position semantically does not "make sense" anymore. For example, a vector iterator pointing to an index i will be invalid once the vector shrinks to less than i+1 elements. (Note the "semantically" here: For example, resizing of a vector's internal array should not invalidate any iterators.)
    • Once invalid, any attempt to deference an iterator must throw an exception.

    • Even when invalid, iterator operations other than dereferencing should remain supported. For example, if one keep incrementing a vector iterator beyond the vector's size, it may eventually become valid again if the vector grows sufficiently. (This property is particularly important for streams; we'll discuss that further down below.)

    • As the previous bullet suggests, it is possible for an invalid iterator to become valid once more if the container changes accordingly.

    • Iterators must not prevent their underlying containers from being destroyed (e.g., they can't store a strong reference to the container, as that would force it to stay around until the iterator dies.)

  • These safe properties came with some overhead.

    • Generally we accept that overhead.

    • Internally, however, we should avoid the overhead where we safely can. For example, if another part of libhilti-rt executes an iteration over a bytes instance where iterators can't become invalid while the operation is in progress, we should just skip the safety checks (this is the reason for the current "Safe" vs "Non-Safe" iterators; but not sure that's the best way to approach it).

    • Longer-term, the code generator could make use of such "unsafe" paths as well in cases where it can guarantee that the safety properties are still maintained by the particular code being generated.

Integers

The main requirements for integers are:

  1. No unsafe/undefined conversions (e.g., between signed/unsigned; or from more bits to less bits); unless explicitly override through cast.
  2. No overflows/underflows must go undetected (i.e., throw exceptions).

String

Strings are meant to capture character sequences meant for display to the user. They use UTF8 internally for storage, and character sets conversions must be well defined when creating instances, or extracting data.

Strings are immutable.

Bytes

Bytes capture raw data without associating meaning (or any encodings) with them.

Bytes are "mostly immutable". Mostly because currently we do support an "append" operation as the one method changing the content of an existing bytes instance. The motivation here was that "append" is common enough that it might be more efficient to support it in-place. However, we could get rid of this exception if it helps otherwise.

Stream

stream is one of the most complex types internally. It is optimized for the use case of moving a sliding window over a large data stream. That use case comes with a few requirements:

  • Needs to efficiently support appending large blocks of data.
  • Needs to efficiently support trimming off data at the front of the stream, and release that memory.
  • Needs to provide safe views & iterators into the stream (more on that below).
  • Does not need to support mutation of data already appended.

The current implementation uses a linked list of data chunks to support the "append" and "trim" operations. That's probably the right approach to keep, but it makes the iterators complex to implement safely as they need traverse across these chunks.

Streams can be frozen, which signals that no more data will be added. That's how Spicy parsers detect the end of the input stream. Currently, that's just a flag on the stream that can be toggled anytime. It would be ok to disallow going back to unfrozen once a stream has been frozen once.

Iterators

The stream's iterators need to satisfy the standard safety requirements of catching accesses to invalidated data, and not preventing the underlying stream object from releasing memory that's no longer needed; that's all the same as for any containers (see above). In additional, stream iterators must work with the "append" and "trim" operations in the following way:

  1. If an iterator refers to the current end of the stream (i.e., it has been created through std::end(stream)), it must keep referring to the stream's end even when data gets added. The iterator needs to move along with the stream to its new end (i.e., after the append, it must still equal std::end(stream)).

  2. If an iterator refers to a specific position inside the stream, it must not change that position when new data gets added.

  3. If an iterator refers to a specific position inside the stream and the stream gets trimmed beyond that position, the iterator must become invalid (i.e., throw on dereference).

  4. If an iterator is advanced to an offset beyond the current end of the stream, it must become invalid. However, once sufficient data gets added for the iterator's new position to become available, the iterator must become valid and now support deref.

  5. Comparison of iterators must compare the position they are pointing to, even if any of them is invalid and/or before or beyond the currently available data. Likewise, the difference of two iterators must return the number of bytes between them in the stream's character space, even if any of them is invalid.

View

A stream view is really just a pair of iterators marking beginning and end positions of window into the stream. Note that the iterator properties from above transfer over to a view. For example, if the view's end position is std::end(stream), the view must grow with the stream.

A stream view is one of the main data types used by Spicy parsers, and provides a number of methods to inspect the data it refers to; more actually than the stream itself.

We should consider having "unsafe" ways to iterate over a view's data for efficiency, as that's a very common operation for other types of the runtime library.

Global State & Thread Safety

  • Generally, the runtime library should avoid global state wherever possible to remain thread-safe. While we currently are not spawning multiple threads, we will do so eventually.

  • To that end, most global state that the library needs is encapsulated in the hilti::rt::Context struct. Each thread gets its own context instance, and receives access to it through hilti::rt::context::get() (which internally uses a TLS variable). That means that all accesses to anything in Context are automatically thread-safe.

  • There's a small set of truely global state in {hilti/spicy}/rt/global-state.h.

    • Some of this is read-only configuration information, which is thread-safe.
    • Some is state that needs to be managed acrosss threads; we need to review this, as currently some of that is not used in a thread-safe manner.
    • Generally, we should avoid adding any truely global state wherever we can, unless there's a strong reason for doing so.

Fibers

HILTI uses co-routines to suspend execution at arbitrary times and return control back to host applications. That's how Spicy implements incremental parsing: whenever it runs out of input, it returns back to the host application (e.g., Zeek) until more becomes availablee. It the resumes just where it left off.

This is implemented through a Fiber abstraction that internally uses the libtask co-routine implementation, plus recycling of old fibers through a cache so that we don't need to reallocated stacks constantly.

This code is unfortunately quite hard to follow because the control flow just jumps around so awkwardly. It generally seems to be working, but would benefit from some review:

  1. Are there other co-routine implementation out there are either either (1) more efficient, or (2) more portable (but not less efficient)? (Just waiting for C++20 co-routines is an option, too).

  2. Fibers are on Spicy's hot path, so performance here is critical: Is there anything we can speed up here? For example, we are currently wrapping callbacks into std::function, which I understand comes with some overhead.

  3. We have some unit tests, but should extend. A little benchmark tool to measure context switch overhead would be valuable, too.

Use of tinyformat (fmt.h)

I'm thinking we should switch to {fmt} as the more standard implementation. However, there's a catch: I believe we will eventually need our own implementation of format string rendering because otherwise we're getting semantic mismatches between HILTI's types and how the C++ side happens to interpret them. (This has already come up in a couple tickets, and I imagine we'll keep running into that.)

Result

We should generally use Result and ``result::Error` more consistently across the API, instead of where currently use booleans or null values.

Utilities (util.h)

There's probably some cleanup & optimization we can apply to a number of these utility functions, potentially also split them out into some subgroups.