Serialization design doc first pass (#1587)

* #include an absolute path didn't work - because paths were taken to always be relative. * WIP on serialization design doc. * More docs on serialization design. * Improve serialization documentation. Remove unused function from IRSerialReader. * Small fixes around naming. Remove long comment from slang-serialize.h - as covered in serialization.md * Remove long comment in slang-serialize.h as covered in serialization.md * More information about doing replacements on read for AST and problems surrounding. * Typo fix. * Spelling fixes.
author: jsmall-nvidia <jsmall@nvidia.com> 2020-10-23 16:39:18 -0400
committer: GitHub <noreply@github.com> 2020-10-23 16:39:18 -0400
commit: e702b704e15a3f0dcbcac6ae82b7cb3c10a4ced2 (patch)
tree: 61d16e88b301ae021b301338851d3b6ccd274efb /source/slang
parent: 051b20c218124e9ffc72ae31b95529b35aa9a43c (diff)
4 files changed, 7 insertions, 239 deletions
diff --git a/source/slang/slang-serialize-ir.h b/source/slang/slang-serialize-ir.h
index c3c3bcf19..038335a60 100644
--- a/source/slang/slang-serialize-ir.h
+++ b/source/slang/slang-serialize-ir.h
@@ -68,9 +68,6 @@ struct IRSerialReader
 {
     typedef IRSerialData Ser;
     
-        /// Read potentially multiple modules from a stream
-    static Result readStreamModules(Stream* stream, Session* session, SourceManager* manager, List<RefPtr<IRModule>>& outModules, List<FrontEndCompileRequest::ExtraEntryPointInfo>& outEntryPoints);
-
         /// Read a stream to fill in dataOut IRSerialData
     static Result readContainer(RiffContainer::ListChunk* module, SerialCompressionType containerCompressionType, IRSerialData* outData);
 
diff --git a/source/slang/slang-serialize-source-loc.h b/source/slang/slang-serialize-source-loc.h
index c8f06d6eb..5ebd264cc 100644
--- a/source/slang/slang-serialize-source-loc.h
+++ b/source/slang/slang-serialize-source-loc.h
@@ -141,7 +141,7 @@ public:
 class SerialSourceLocReader : public RefObject
 {
 public:
-    static const SerialExtraType kExtraType = SerialExtraType::DebugReader;
+    static const SerialExtraType kExtraType = SerialExtraType::SourceLocReader;
 
     Index findViewIndex(SerialSourceLocData::SourceLoc loc);
 
@@ -170,7 +170,7 @@ protected:
 class SerialSourceLocWriter : public RefObject
 {
 public:
-    static const SerialExtraType kExtraType = SerialExtraType::DebugWriter;
+    static const SerialExtraType kExtraType = SerialExtraType::SourceLocWriter;
 
     class Source : public RefObject
     {
diff --git a/source/slang/slang-serialize-types.h b/source/slang/slang-serialize-types.h
index 8df2f362f..9bb84e290 100644
--- a/source/slang/slang-serialize-types.h
+++ b/source/slang/slang-serialize-types.h
@@ -14,8 +14,8 @@ namespace Slang {
 // An enumeration of types that can be set
 enum class SerialExtraType
 {
-    DebugReader,
-    DebugWriter,
+    SourceLocReader,
+    SourceLocWriter,
     CountOf,
 };
 
diff --git a/source/slang/slang-serialize.h b/source/slang/slang-serialize.h
index 4c6a57b34..0e7fdd68a 100644
--- a/source/slang/slang-serialize.h
+++ b/source/slang/slang-serialize.h
@@ -19,238 +19,9 @@ namespace Slang
 class Linkage;
 
 /*
-General Serialization Overview
-==============================
+A discussion of the serialization system design can be found in
 
-The AST node types are generally types derived from the NodeBase. The C++ extractor is used to associate an ASTNodeType with
-every NodeBase type, such that casting is fast and simple and we have a simple integer to uniquely identify those types. The
-extractor also performs another task of associating with the type name all of the fields held in just that type. The definition
-of the fields is stored in an 'x macro' which is in the slang-ast-generated-macro.h file, for example
-
-```
-#define SLANG_FIELDS_ASTNode_DeclRefExpr(_x_, _param_)\
-    _x_(scope, (RefPtr<Scope>), _param_)\
-    _x_(declRef, (DeclRef<Decl>), _param_)\
-    _x_(name, (Name*), _param_)
-``
-
-For the type DeclRefExpr, this holds all of the fields held in just DeclRefExpr in this case `scope`, `declRef` and `name`.
-DeclRefExpr derives from Expr and this might hold other fields and so forth.
-
-The implementation makes a distinction between the 'native' types, the regular C++ in memory types and 'serial' types.
-Each serializable C++ type has an associated 'serial' type - with the distinction that it can be written out and (with perhaps some other data)
-read back in to recreate the C++ type. The serial type can be a C++ type, but is such it can be written and read from disk and still
-represent the same data. 
-
-We need a mechanism to be able to do do a conversion between native and serial types. To make the association we use the template
-
-```
-template <typename T>
-struct SerialTypeInfo;
-```
-
-and specialize it for each native type. The specialization holds
-
-SerialType - The type that will be used to represent the native type
-NativeType - The native type
-SerialAlignment - A value that holds what kind of alignment the SerialType needs to be serializable (it may be different from SLANG_ALIGN_OF(SerialType)!)
-toSerial - A function that with the help of ASTSerialWriter convert the NativeType into the SerialType
-toNative - A function that with the help of ASTSerialReader convert the SerialType into the NativeType
-
-It is useful to have a structure that holds the type information, so it can be stored. That is achieved with
-
-```
-template <typename T>
-struct SerialGetType;
-```
-
-This template can be specialized for a specific native types - but all it holds is just a function getType, which returns a SerialType*,
-which just holds the information held in the SerialTypeInfo template, but additionally including the size of the SerialType.
-
-So we need to define a specialized SerialTypeInfo for each type that can be a field in a NodeBase/RefObject derived type. We don't need to define
-anything explicitly for the NodeBase derived types, as we will just generate the layout from the fields. How do we know the fields? We just
-used the macros generated from the C++ extractor.
-
-So first a few things to observe...
-
-1) Some types don't need any conversion to be serializable - int8_t, or float the bits can just be written out and read in (1)
-2) Some types need a conversion but it's very simple - for example an enum without explicit size, being written as an explicit size
-3) Some types can be written out but would not be directly readable or usable with different targets/processors, so need converting
-4) Some types require complex conversions that require programmer code - like Dictionary/List
-
-For types that need no conversion (1), we can just use the template SerialIdentityTypeInfo
-
-```
-template <>
-struct SerialTypeInfo<SomeType> : public SerialIdentityTypeInfo<SomeType> {};
-```
-
-This specialization means that SomeType can be written out and read in across targets/compilers without problems.
-
-For (2) we have another template that will do the conversion for us
-
-```
-template <typename NATIVE_T, typename SERIAL_T>
-struct SerialConvertTypeInfo;
-```
-
-That we can use as above, and specify the native and serial types.
-
-For (3) there are a few scenarios. For any field in a serial type we must store in the serialized type such that the representation
-will work across all processors/compilers. So one problematic type is `bool`. It's not specified how it's laid out in memory - and
-some compiles have stored it as a word. Most recently it's been stored as a byte. To make sure bool is ok for serialization therefore
-we store as a uint8_t.
-
-Another example would be double. It's 64 bits, but on some arches/compilers it's SLANG_ALIGN_OF is 4 and on others it's 8. On some
-arches a non aligned read will lead to a fault. To work around this problem therefore we have to ensure double has the alignment that
-will work across all targets - and that alignment is 8. In that specific case that issue is handled via SerialBasicTypeInfo, which
-makes the SerialAlignment the sizeof the type.
-
-For (4) there are a few things to say. First a type can always implement a custom version of how to do a conversion by specializing
-`SerialTypeInfo`. But there remains another nagging issue - types which allocate/use other memory that changes at runtime. Clearly
-we cannot define 'any size of memory' in a fixed SerialType defined in a specialization of SerialTypeInfo. The mechanism to work around
-this is to allow arbitrary arrays to be stored, that can be accessed via an SerialIndex. This will be discussed more once we discuss
-a little more about the file system, and SerialIndex. 
-
-Serialization Format
-====================
-
-The serialization format used is 'stream-like' with each 'object' stored in order. Each object is given an index starting from 1.
-0 is used to be in effect nullptr. The stream looks like
-
-```
-SerialInfo::Entry (for index 1)
-Payload for type in entry
-
-SerialInfo::Entry (for index 2)
-Payload for type in entry
-
-... 
-... 
-
-That when writing we have an array that maps each index to a pointer to the associated header. We also have a map that maps native pointers
-to their indices. The Payload *is* the SerialType for thing saved. The payload directly follows the Entry data.
-
-Each object in this list can only be a few types of things
-
-* NodeBase derived type
-* RefObject derived type
-* String
-* Array
-
-The actual Entry followed by the payloads are allocated and stored when writing in a MemoryArena. When we want to write into a stream, we
-can just iterate over each entry in order and write it out.
-
-You may have spotted a problem here - that some Entry types can be stored without alignment (for example a string - which stores the length
-VarInt encoded followed by the characters). Others require an alignment - for example an NodeBase derived type that contains a int64_t will
-*require* 8 byte alignment. That as a feature of the serialization format we want to be able to just map the data into memory, and be able
-to access all the SerialType as is on the CPU. For that to work we *require* that the payload for each entry has the right alignment for
-the associated SerialType.
-
-To achieve this we store in the Entry it's alignment requirement *AND* the next entries alignment. With this when we read, as we as stepping
-through the entries we can find where the next Entry starts. Because the payload comes directly after the Entry - the Entrys size must be
-a modulo of the largest alignment the payload can have.
-
-For the code that does the conversion between native and serial types it uses either the SerialWriter or SerialReader. This provides
-the mechanism to turn a pointer into a serializable ASTSerialIndex and vice versa. There are some special functions for turning string like
-types to and forth.
-
-The final mechanism is that of 'Arrays'. An array allows reading or writing a chunk of data associated with a ASTSerialIndex. The chunk of
-data *must* hold data that is serializable. If the array holds pointers - then the serialized array must hold SerialIndices that
-represent those pointers. When reading back in they are converted back.
-
-Arrays are the escape hatch that allows for more complex types to serialize. Dictionaries for example are saved as a serial type that is
-two SerialIndices one to a keys array and one to a values array.
-
-Note that writing has two phases, serializing out into an SerialWriter, and then secondly writing out to a stream. 
-
-Object/Reference Types
-======================
-
-When talking about Object/Reference types this means types that can be referenced natively as pointers. Currently that means NodeBase and
-some RefObject derived types. 
-
-The SerialTypeInfo mechanism is generally for *fields* of object types. That for derived types we use the C++ extractors
-field list to work out the native fields offsets and types. With this we can then calculate the layout for NodeBase types such that they
-follow the requirements for serialization - such as alignment and so forth.
-
-This information is held in the SerialClasses, which for a given TypeKind/SubType gives a SerialClassInfo, that specifies fields for
-just that type. 
-
-Reading
-=======
-
-Due to the care in writing reading is relatively simple. We can just take the contents of the file and put in memory, as long as in memory
-it has an alignment of at least MAX_ALIGNMENT. Then we can build up an entries table by stepping through the data and writing the pointer.
-
-The toNative functions take an SerialReader - this allows the implementation to ask for pointers and arrays from other parts of the serialized
-data. It also allows for types to be lazily reconstructed if necessary.
-
-Lazy reconstruction may be useful in the future to partially reconstruct a sub part of the serialized data. In the current implementation, lazy
-evaluation is used on Strings. The m_objects array holds all of the recreated native 'objects'. Since the objects can be derived from different
-base classes the associated Entry will describe what it really is.
-
-For the String type, we initially store the object pointer as null. If a string is requested from that index, we see if the object pointer is null,
-if it is we have to construct the StringRepresentation that will be used.
-
-An extra wrinkle is that we allow accessing of a serialized String as a Name or a string or a UnownedSubString. Fortunately a Name just holds a string,
-and a Name remains in scope as long as it's NamePool does which is passed in.
-
-Other Reading issues
-====================
-
-## SourceLoc
-
-SourceLoc present a problem. If we follow the simple mechanism described above, then we require two things
-
-1) That the SourceLoc information is blossomed before anything that defines a SourceLoc
-2) That the structure for accessing SourceLoc information is conveniently available.
-
-This was sidestepped previously because the SourceLoc information was held in a different structure, and a separate Riff section. It was deserialized
-before anything else took place.
-
-That *is* a strategy we could use here. That we could make the SourceLoc information generally serialized. On loading locate it in a Riff section
-deserialize it (perhaps with general serialization), then deserialize the rest using this structure.
-
-## IRModule
-
-In this case we may want to have IRModule serialized in someway unlike the generalized serialization (for example supporting compression). In other
-frameworks this aspect might be handing by 'read/writeReplacing'. Doing so would significantly complicate the simple reading mechanism - because instead
-of just constructing and referencing we would have to care about construction order. That this could perhaps be achieved by having any reference access
-be handled lazily. Note that SourceLoc would still require being handled specially because it requires construction before any SourceLoc is referenced,
-and SourceLocs *aren't* pointers.
-
-## Modified reading
-
-We could modify reading as follows.
-
-1) Don't construct anything at the start
-2) Find 'root's they must be created and deserialized first
-  . Any read/writeReplace is a root
-  . Any marked (like SourceLocData) is a root. (When deconstructed it also needs to add information to the Reader)
-  . The root of the objects (note we could just deserialize first to last if not already constructed)
-3) During deserialization pointer references and constructed on demand
-4) Extra code is needed to make sure there aren't cycles. Any object is either Pre/Created/Deserialized.
-
-For now we might want to just do this with Riff sections for simplicity
-
-Other Issues
-============
-
-A final issue is around the special extra types needed for serializing or deserializing. SourceLoc information (on reading and writing),
-but it could be other types in the future.
-
-We probably don't want to have them as specific types on the SerialReader/SerialWriter, as doing so requires exposing the types to this interface.
-What we really want is a mechanism for the Reader/Writer where it's possible to get a pointer based on some type. We want this to be fairly fast
-because every SourceLoc reference will have to do this lookup.
-
-We could use an enum, and just have an array of pointers on the reader and writer. How that pointer is interpreted is dependent on the Reader/Writer.
-This would be very fast, extendable without making types specific. On debug builds we could do a dynamic cast to make sure it is the expected type. 
-
-Rich Information
-================
-
-Nothing is done here about versioning, patching, backward or forward compatibility.
+docs/design/serialization.md
 */
 
 // Predeclare
@@ -636,7 +407,7 @@ struct SerialField
     static SerialField make(const char* name, T* in);
 
     const char* name;                   ///< The name of the field
-    const SerialFieldType* type;             ///< The type of the field
+    const SerialFieldType* type;        ///< The type of the field
     uint32_t nativeOffset;              ///< Offset to field from base of type
     uint32_t serialOffset;              ///< Offset in serial type
 };
author	jsmall-nvidia <jsmall@nvidia.com>	2020-10-23 16:39:18 -0400
committer	GitHub <noreply@github.com>	2020-10-23 16:39:18 -0400
commit	e702b704e15a3f0dcbcac6ae82b7cb3c10a4ced2 (patch)
tree	61d16e88b301ae021b301338851d3b6ccd274efb /source/slang
parent	051b20c218124e9ffc72ae31b95529b35aa9a43c (diff)