Improve performance of AST deserialization (#7935)

2025-08-07T02:30:57+00:00

* Improve performance of AST deserialization The primary goal of these changes is to reduce the total time spent in the global session's `loadBuiltinModule()`, which gets called as part of global session creation to load the core module, and thus impacts every invocation of `slangc` and every user of the Slang compiler API. The majority of the time is spent simply deserializing the core module's AST and IR and, of those two, the AST takes significantly longer to load than the IR (in the ballpark of 5x the time). This change is focused on the serialization infrastructure but, given the performance situation described above, the focus is first and foremost on *deserialization* performance for the Slang *AST*, when using the *fossil* format. That focus shows through in the changes that have been implemented. Change serialization framework to use `template` instead of `virtual` ===================================================================== The recently-introduced serialization framework in `slang-serialize.h` was centered around a dynamically-dispatched `ISerializerImpl` interface. As a result, every single invocation of a `serialize(...)` call ultimately went through `virtual` function dispatch. While the overhead of the `virtual` calls themselves does not have a major impact on the total deserialization performance, those calls end up serving as a barrier to further optimization. This change changes operations that used to take a `Serializer const&` (which wraps an `ISerializerImpl*`), to instead declare a template parameter `` and take an `S const&`. The main consequence of the change is that `serialize()` functions for user-defined types will need to be template functions, and thus either be defined in headers (alongside the type that they serialize) or else in the specific source file that handles serialization (as is currently being done for the AST-related types in `slang-serialize-ast.cpp`). Note that if we later decide that we want the ability to perform serialization through a dynamically-dispatched interface (e.g., to easily toggle between different serialization back-ends), it will be easier to layer a dynamically-dispatched implementation on top of the statically-dispatched `template` version than the other way around. Generous use of `SLANG_FORCE_INLINE` ==================================== In order to unlock further optimizations, a bunch of operations were marked with `SLANG_FORCE_INLINE`. It is important to note that forcing inlining like this is a big hammer, and needs to be approached with at least a little caution. The simplest cases are: * trivial wrapper function that just delegate to another function * functions that only have a single call site (but exist to keep abstractions clean) Externalize Scope for `begin`/`end` Operations ============================================== The old `ISerializerImpl` interface had a bunch of paired begin/end operations that define the hierarchical structure of data being read. Most serializer implementations (whether for reading or writing) use these operations to help maintain some kind of internal stack for tracking state in the hierarchy. The overhead of maintaining such a stack with something like a `List` amortizes out over many operations, but even that overhead is unnecessary when the begin/end pairs are *already* mirroring the call stack of the code invoking serialization. This change modifies the `ScopedSerializerFoo` types so that they each provide a piece of stack-allocated storage to the serializer back-end's `beginFoo()` and `endFoo()` operations. Currently only the `Fossil::SerialReader` is making use of that facility, but the other implementations of readers and writers in the codebase could be adapted if we ever wanted to. Streamline `Fossil::SerialReader` ================================= The most significant performance gains came from changes to the `Fossil::SerialReader` type, aimed at minimizing the cycles spent in the core `_readValPtr()` routine. That function used to have a large-ish `switch` statement that implemented superficially very different reading logic depending on the outer container/object being read from. The new logic pushes more work back on the `begin` and `end` operations (which get invoked far less frequently than simple scalar/pointer values get read), so that they always set up the state of the reader with direct pointers to the data and layout for the next fossilized value to be read. The remaining work in `_readValPtr()` has been factored into a differnt subroutine - `_advanceCursor()` - that takes responsibility for advancing the data pointer, and updating the various other fields. The `_advanceCursor()` routine is still messier than is ideal, because it has to deal with the various different kinds of logic required for navigating to the next value. Various other conditionals inside the `SerialReader` implementation were streamlined, mostly by collapsing the `State::Type` enumeration down to only represent the cases that are truly semantically distinct. Evaluated: Streamline Layout Rules for Fossil ============================================= One potential approach that I implemented but then reverted (after finding it had little to no performance impact) was changing the fossil format to always write things with 4-byte alignment/granularity. That would mean values smaller than 4 bytes would get inflated to a full 4 bytes, and scalar values larger than 4 bytes get written with only 4-byte alignment (requiring unaligned loads to read them). I found that the only way to take advantage of the simplified layout rules to improve read performance would be to more-or-less eliminate the use of the layout information embedded in the fossil data, which would make it very difficult to validate that the data is correctly structured. Possible Future Work: Further Type Specialization ================================================= As it stands, the biggest overhead remaining on the critical path of `_readValPtr()` is the way the `_advanceCursor()` logic needs to take different approaches depending on the type of the surrounding context (advancing through elements of a container is very different than advancing through fields of a `struct`, for example). The interesting thing to note is that at the use site within a `serialize()` function, it is usually manifestly obvious which case something is in. If the code uses `SLANG_SCOPED_SERIALIZER_ARRAY` it is in a container, while if it uses `SLANG_SCOPED_SERIALIZER_STRUCT` it is in a struct. This means that the contextual information is staticaly available, but just isn't exposed in a way that lets the core reading logic take advantage of it. A logical extension of the work here would be to expand on the `Scope` idea added in this change such that most of the serialization operations (`handleInt32`, `handleString`, etc.) are actually dispatched through the scope, and then have each of the `SLANG_SCOPED_SERIALIZER_...` macros instantiate a *different* scope type (still dependent on the serializer). * fixup * format code * typo --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>

Add a memory-mappable binary serialization format (#7222)

2025-05-30T17:00:38+00:00

The files `slang-fossil.{h,cpp}` define a new serialization format that is designed to support data being memory-mapped in and then traversed as-is. The `docs/design/serialization.md` document was updated with details on this new format. The `slang-serialize-fossil.{h,cpp}` files define implementations of the recently introduced `ISerializerImpl` interface for reading/writing this new binary format. The overall structure of these implementations is heavily based on the existing RIFF implementation from `slang-serialize-riff.{h,cpp}`. Switching the AST serialization over to use this format required almost no changes to `slang-serialize-ast.cpp`. The new format is more space-efficient than the RIFF-based format in memory (by factor of over 2x), but is actually *worse* than the RIFF-based format in terms of how it affects the size of `slang.dll`, because the new format is seemingly less amenable to LZ4 compression. A few pieces of utility code were added or moved as part of this work: * The `core/slang-internally-linked-list.*` implementation is just a type that was used as part of `core/slang-riff.*`, but that wasn't really RIFF-specific. * The `core/slang-blob-builder.*` files implement a low-level utility for building a binary format in memory out of "chunks". The overall structure of this type is based on the RIFF-specific builder implementation, but has been generalized so that it should apply to other kinds of binary serialization. * The `core/slang-relative-ptr.h` file implements a simple relative pointer type, which is currently only used by the `slang-fossil.h` format. If there are concerns about adopting the new format immediately for the AST, this change could be modified to introduce all the new code, but leave the AST serialization using the previous RIFF-based format.

slang.git/source/core/slang-relative-ptr.h, branch master

Improve performance of AST deserialization (#7935)

Add a memory-mappable binary serialization format (#7222)