Add a memory-mappable binary serialization format (#7222)

The files `slang-fossil.{h,cpp}` define a new serialization format that is designed to support data being memory-mapped in and then traversed as-is. The `docs/design/serialization.md` document was updated with details on this new format. The `slang-serialize-fossil.{h,cpp}` files define implementations of the recently introduced `ISerializerImpl` interface for reading/writing this new binary format. The overall structure of these implementations is heavily based on the existing RIFF implementation from `slang-serialize-riff.{h,cpp}`. Switching the AST serialization over to use this format required almost no changes to `slang-serialize-ast.cpp`. The new format is more space-efficient than the RIFF-based format in memory (by factor of over 2x), but is actually *worse* than the RIFF-based format in terms of how it affects the size of `slang.dll`, because the new format is seemingly less amenable to LZ4 compression. A few pieces of utility code were added or moved as part of this work: * The `core/slang-internally-linked-list.*` implementation is just a type that was used as part of `core/slang-riff.*`, but that wasn't really RIFF-specific. * The `core/slang-blob-builder.*` files implement a low-level utility for building a binary format in memory out of "chunks". The overall structure of this type is based on the RIFF-specific builder implementation, but has been generalized so that it should apply to other kinds of binary serialization. * The `core/slang-relative-ptr.h` file implements a simple relative pointer type, which is currently only used by the `slang-fossil.h` format. If there are concerns about adopting the new format immediately for the AST, this change could be modified to introduce all the new code, but leave the AST serialization using the previous RIFF-based format.
author: Theresa Foley <10618364+tangent-vector@users.noreply.github.com> 2025-05-30 10:00:38 -0700
committer: GitHub <noreply@github.com> 2025-05-30 17:00:38 +0000
commit: ec7ab914f79978b8980c7797e20d3399604b1f86 (patch)
tree: 2e6b01dc99fc0998e4f17a9aeaf22ef3d48817e0 /docs/design
parent: 14409bf1015af47691f09d2be6afb18cfb999aea (diff)
1 files changed, 154 insertions, 1 deletions
diff --git a/docs/design/serialization.md b/docs/design/serialization.md
index c3e32dfca..10dae90fd 100644
--- a/docs/design/serialization.md
+++ b/docs/design/serialization.md
@@ -5,6 +5,147 @@ Slang's infrastructure for serialization is currently in flux, so there exist a
 
 This document is curently minimal, and primarily serves to provide a replacement for an older draft that no longer reflects the state of the codebase.
 
+The Fossil Format
+=================
+
+The "fossil" format is a memory-mappable binary format for general-purpose serialization.
+
+Goals
+-----
+
+The main goals of the fossil format are:
+
+* Data can be read from memory as-is.
+
+  * Basic types are stored at offsets that are naturally aligned (e.g., a 4-byte integer is 4-byte aligned)
+
+  * Pointers are encoded as relative offsets, and can be traversed without any "relocation" step after data is loaded.
+
+* Supports general-purpose data, including complicated object graphs.
+
+* Data can include embedded layout information, allowing code to traverse it without statically knowing the structure.
+
+  * Embedded layout information should support versioning; new code should be able to load old data by notcing what has/hasn't been encoded.
+
+* Layout information is *optional*, and data can be traversed with minimal overhead by code that knows/assumes the layout
+
+Top-Level Structure
+-------------------
+
+A serialized blob in fossil format starts with a header (see `Slang::Fossil::Header`), which in turn points to the *root value*.
+All other data in the blob should be reachable from the root value, and an application can choose to make the root value whatever type they want (an array, structure, etc.).
+
+Encoding
+--------
+
+### Endian
+
+All data is read/written in the endianness of the host machine.
+There is currently no automatic support for encoding endian-ness as part of the format; a byte-order mark should be added if we ever need to support big-endian platforms.
+
+### Fixed-Size Types
+
+#### Basic Types
+
+Basic types like fixed-width integers and floating-point numbers are encoded as-is.
+That is, an N-byte value is stored directly as N bytes of data with N-byte alignment.
+
+A Boolean value is encoded as an 8-bit unsigned integer holding either zero or one.
+
+#### Pointers
+
+A pointer is encoded as a 4-byte signed integer, representing a relative offset.
+
+If the relative offset value is zero, then the pointer is null.
+Otehrwise, the relative offset value should be added to the offset of the pointer itself, to get the offset of the target.
+
+#### Optionals
+
+An optional value of some type `T` (e.g., the equivalent of a `std::optional<T>`) is encoded as a pointer to a `T`.
+If the pointer is null, the optional has no value; otherwise the value is stored at the offset being pointed to.
+
+Note that when encoding a pointer to an optional (`std::optional<T> *`) or an optional pointer (`std::optional<T*>`), there will be two indirections.
+
+#### Records
+
+Things that are conceptually like a `struct` or tuple are encoded as *records*, which are simply a sequence of *fields*.
+
+The alignment of a record is the maximum alignment of its fields.
+
+Fields in a record are laid out sequentially, where each field gets the next suitably-aligned offset after the preceding field.
+No effort is made to fill in "gaps" left by preceding fields.
+
+Note: currently the size of a record is *not* rounded up to be a multiple of its alignment, so it is possible for one field to be laid out in the "tail padding" of the field before it.
+This behavior should probably be changed, so that the fossilized layout better matches what C/C++ compilers tend to do.
+
+### Variable-Size Types
+
+Types where different instances may consume a different number of bytes may be encoded either *inline* or *indirectly*.
+
+If a variable-size type `V` is being referred to by a pointer or optional (e.g., `V*` or `std::optional<V>`), then it will be encoded inline as the target address of that pointer/optional.
+
+In all other contexts, including when a `V` is used as a field or a record, it will be encoded indirectly (conceptually, as if the field was actually a `V*`).
+When a variable-size type is encoded indirectly, a null pointer should be interpreted as an empty instance of the type `V`.
+
+#### Arrays
+
+An array of `T` is encoded as a sequence of `T` values, separated by the *stride* of `T` (the size of `T` rounded up to the alignment of `T`).
+The offset of the array is the offset of its first element.
+
+The number of elements in the array is encoded as a 4-byte unsigned integer stored immediately *before* the offset of the array itself.
+
+#### Strings
+
+A string is encoded in the same way that an array of 8-bit bytes would be (including the count stored before the first element).
+The only additional detail is that the serialized data *must* include an additional nul byte after the last element of the string.
+
+The data of a string is assumed to be in UTF-8 encoding, but there is nothing about the format that validates or enforces this.
+
+#### Dictionaries
+
+A dictionary with keys of type `K` and values of type `V` is encoded in the same way as an array of `P`, where `P` is a two-element tuple of a `K` and a `V`.
+
+There is currently no provision made for efficient lookup of elements of a fossilized dictionary.
+
+#### Variants
+
+A *variant* is a fossilized value that can describe its own layout.
+
+The content of variant holding a value of type `T` is encoded exactly as a record with one field of type `T` would be, starting at the offset of the variant itself.
+
+The four bytes immediately preceding a variant store a relative pointer to the fossilized layout for the type `T` of the content.
+
+### Layouts
+
+Every layout starts with a 4-byte unsigned integer that holds a tag representing the kind of layout (see `Slang::FossilizedValKind`).
+The value of the tag determines what, if any, information appears after the tag.
+
+In any place where a relative pointer to a layout is expected, a null pointer may be used to indicate that the relevant layout information is either unknown, or was elided from the fossilized data.
+
+#### Pointer-Like Types
+
+For pointers (`T*`) and optionals (`Optional<T>`), the tag is followed by a relative pointer to a layout for `T`.
+
+#### Container Types
+
+For arrays and dictionaries, the tag is followed by:
+
+* A relative pointer to a layout for the element type
+
+* A 4-byte unsigned integer holding the stride between elements
+
+#### Record Types
+
+For records, the tag is followed by:
+
+* A 4-byte unsigned integer holding the number of fields, `N`
+
+* `N` 8-byte values representing the fields, each comprising:
+
+    * A relative pointer to the type of the field
+
+    * A 4-byte unsigned integer holding the offset of that field within the record
+
 The RIFF Support Code
 =====================
 
@@ -56,7 +197,19 @@ Of course there's a lot more to it in once you get into the details and the diff
 For now, looking at `source/slang/slang-serialize.h` is probably the best way to learn more about the approach.
 
 One key goal of this serialization system is that it allows the serialized format to be swapped in and out without affecting the per-type `serialize` functions.
-The current implementation only includes a RIFF-based output format matching what had previously been in use for the AST, but the infrastructure should also be able to support a JSON implementation, or binary formats.
+Currently there are only a small number of implementations.
+
+RIFF Serialization
+------------------
+
+The files `slang-serialize-riff.{h,cpp}` provide an implementation of the general-purpose serialization framework that reads/writes RIFF files with a particular kind of structure, based on what had previously been hard-coded for use in serializing the AST to RIFF.
+
+In practice this representation is kind of like an encoding of JSON as RIFF chunks, with leaf/data chunks for what would be leaf values in JSON, and container chunks for arrays and dictionaries (plus other aggregates that would translate into arrays or dictionaries in JSON).
+
+Fossil Serialization
+--------------------
+
+The files `slang-serialize-fossil.{h,cpp}` provide an implementation of the generla-purpose serialization framwork that reads/writes the "fossil" format, which is described earlier in this document.
 
 AST Serialization
 =================
author	Theresa Foley <10618364+tangent-vector@users.noreply.github.com>	2025-05-30 10:00:38 -0700
committer	GitHub <noreply@github.com>	2025-05-30 17:00:38 +0000
commit	ec7ab914f79978b8980c7797e20d3399604b1f86 (patch)
tree	2e6b01dc99fc0998e4f17a9aeaf22ef3d48817e0 /docs/design
parent	14409bf1015af47691f09d2be6afb18cfb999aea (diff)