slang.git - Making it easier to work with shaders

Age	Commit message (Collapse)	Author
2025-08-07	Initial copy elision pass (#8042)	ArielG-NV
	Fixes #7574 Changes: * Add an initial (fairly simple) optimization pass which is able to eliminate redundant copies. * Our current existing optimizer passes remove redundant load/store very robustly, this pass will focus on other cases of copy elimination * Primary approach is to make all functions which are `in T` and `T` is trivial to copy into a `__constref T`. We then (depending on scenario) manually insert a variable+load if a pass-by-reference is not possible; otherwise we pass by `constref`. * Added optimizations to eliminate redundant code which causes `constref` to fail to compile --------- Co-authored-by: Harsh Aggarwal <haaggarwal@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: slangbot <ellieh+slangbot@nvidia.com> Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-08-07	export fewer things from module targets (#8089)	Ellie Hermaszewska
	Closes https://github.com/shader-slang/slang/issues/7722 without adding SONAME ``` e@light-hope in ~/work/slang on HEAD (68b0125) [nix-shell] [direnv] $ nm -DC --defined-only build/Debug/lib/libslang-llvm.so 0000000001a4b637 T createLLVMDownstreamCompiler_V4 0000000001a427ed T createLLVMFileCheck_V1 0000000001a4b91e W std::type_info::operator==(std::type_info const&) const 0000000001a4bb07 W std::_Sp_make_shared_tag::_S_ti() 0000000001059165 u std::ranges::_Cpo::iter_move 0000000005f92ac0 V typeinfo for std::_Sp_make_shared_tag 0000000005f92938 V vtable for std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2> e@light-hope in ~/work/slang on HEAD (68b0125) [nix-shell] [direnv] $ nm -DC --defined-only build/Debug/lib/libslang-glslang.so 00000000004701a2 T glslang_compile 000000000047012b T glslang_compile_1_1 000000000046ffd9 T glslang_compile_1_2 000000000046db75 T glslang_disassembleSPIRV 000000000046da42 T glslang_disassembleSPIRVWithResult 000000000047028b T glslang_linkSPIRV 000000000046d8fa T glslang_validateSPIRV e@light-hope in ~/work/slang on HEAD (68b0125) [nix-shell] [direnv] $ nm -DC --defined-only build/Debug/lib/libslang-glsl-module.so 0000000000135bf9 T slang_getEmbeddedModule ``` I think that the exports from libslang-llvm are unavoidable I believe, however these are weak exports so should exhibit the same problem. @NBickford-NV does this look good for you?
2025-08-07	Fix atomic fp16 vector SPIRV emit (#8104)	jarcherNV
	Update the SPIRV emit of atomic fp16 vector extension from its previous incorrect name to SPV_NV_shader_atomic_fp16_vector.
2025-08-07	Improve performance of AST deserialization (#7935)	Theresa Foley
	* Improve performance of AST deserialization The primary goal of these changes is to reduce the total time spent in the global session's `loadBuiltinModule()`, which gets called as part of global session creation to load the core module, and thus impacts every invocation of `slangc` and every user of the Slang compiler API. The majority of the time is spent simply deserializing the core module's AST and IR and, of those two, the AST takes significantly longer to load than the IR (in the ballpark of 5x the time). This change is focused on the serialization infrastructure but, given the performance situation described above, the focus is first and foremost on deserialization performance for the Slang AST, when using the fossil format. That focus shows through in the changes that have been implemented. Change serialization framework to use `template` instead of `virtual` ===================================================================== The recently-introduced serialization framework in `slang-serialize.h` was centered around a dynamically-dispatched `ISerializerImpl` interface. As a result, every single invocation of a `serialize(...)` call ultimately went through `virtual` function dispatch. While the overhead of the `virtual` calls themselves does not have a major impact on the total deserialization performance, those calls end up serving as a barrier to further optimization. This change changes operations that used to take a `Serializer const&` (which wraps an `ISerializerImpl`), to instead declare a template parameter `<typename S>` and take an `S const&`. The main consequence of the change is that `serialize()` functions for user-defined types will need to be template functions, and thus either be defined in headers (alongside the type that they serialize) or else in the specific source file that handles serialization (as is currently being done for the AST-related types in `slang-serialize-ast.cpp`). Note that if we later decide that we want the ability to perform serialization through a dynamically-dispatched interface (e.g., to easily toggle between different serialization back-ends), it will be easier to layer a dynamically-dispatched implementation on top of the statically-dispatched `template` version than the other way around. Generous use of `SLANG_FORCE_INLINE` ==================================== In order to unlock further optimizations, a bunch of operations were marked with `SLANG_FORCE_INLINE`. It is important to note that forcing inlining like this is a big hammer, and needs to be approached with at least a little caution. The simplest cases are: trivial wrapper function that just delegate to another function * functions that only have a single call site (but exist to keep abstractions clean) Externalize Scope for `begin`/`end` Operations ============================================== The old `ISerializerImpl` interface had a bunch of paired begin/end operations that define the hierarchical structure of data being read. Most serializer implementations (whether for reading or writing) use these operations to help maintain some kind of internal stack for tracking state in the hierarchy. The overhead of maintaining such a stack with something like a `List<T>` amortizes out over many operations, but even that overhead is unnecessary when the begin/end pairs are already mirroring the call stack of the code invoking serialization. This change modifies the `ScopedSerializerFoo` types so that they each provide a piece of stack-allocated storage to the serializer back-end's `beginFoo()` and `endFoo()` operations. Currently only the `Fossil::SerialReader` is making use of that facility, but the other implementations of readers and writers in the codebase could be adapted if we ever wanted to. Streamline `Fossil::SerialReader` ================================= The most significant performance gains came from changes to the `Fossil::SerialReader` type, aimed at minimizing the cycles spent in the core `_readValPtr()` routine. That function used to have a large-ish `switch` statement that implemented superficially very different reading logic depending on the outer container/object being read from. The new logic pushes more work back on the `begin` and `end` operations (which get invoked far less frequently than simple scalar/pointer values get read), so that they always set up the state of the reader with direct pointers to the data and layout for the next fossilized value to be read. The remaining work in `_readValPtr()` has been factored into a differnt subroutine - `_advanceCursor()` - that takes responsibility for advancing the data pointer, and updating the various other fields. The `_advanceCursor()` routine is still messier than is ideal, because it has to deal with the various different kinds of logic required for navigating to the next value. Various other conditionals inside the `SerialReader` implementation were streamlined, mostly by collapsing the `State::Type` enumeration down to only represent the cases that are truly semantically distinct. Evaluated: Streamline Layout Rules for Fossil ============================================= One potential approach that I implemented but then reverted (after finding it had little to no performance impact) was changing the fossil format to always write things with 4-byte alignment/granularity. That would mean values smaller than 4 bytes would get inflated to a full 4 bytes, and scalar values larger than 4 bytes get written with only 4-byte alignment (requiring unaligned loads to read them). I found that the only way to take advantage of the simplified layout rules to improve read performance would be to more-or-less eliminate the use of the layout information embedded in the fossil data, which would make it very difficult to validate that the data is correctly structured. Possible Future Work: Further Type Specialization ================================================= As it stands, the biggest overhead remaining on the critical path of `_readValPtr()` is the way the `_advanceCursor()` logic needs to take different approaches depending on the type of the surrounding context (advancing through elements of a container is very different than advancing through fields of a `struct`, for example). The interesting thing to note is that at the use site within a `serialize()` function, it is usually manifestly obvious which case something is in. If the code uses `SLANG_SCOPED_SERIALIZER_ARRAY` it is in a container, while if it uses `SLANG_SCOPED_SERIALIZER_STRUCT` it is in a struct. This means that the contextual information is staticaly available, but just isn't exposed in a way that lets the core reading logic take advantage of it. A logical extension of the work here would be to expand on the `Scope` idea added in this change such that most of the serialization operations (`handleInt32`, `handleString`, etc.) are actually dispatched through the scope, and then have each of the `SLANG_SCOPED_SERIALIZER_...` macros instantiate a different scope type (still dependent on the serializer). * fixup * format code * typo --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-08-06	Add Discord server link to Slang internal error messages (#8026)	Copilot
	* Initial plan * Add Discord server link to Slang internal error messages Co-authored-by: aidanfnv <198290069+aidanfnv@users.noreply.github.com> * Format code according to Slang coding standards Co-authored-by: aidanfnv <198290069+aidanfnv@users.noreply.github.com> * Update Discord URL and message format for error diagnostics - Changed Discord URL from discord.gg/slang to khr.io/slangdiscord - Updated message format to "For assistance, file an issue on GitHub (...) or join the Slang Discord (...)" - Applied changes to all internal error diagnostics: unimplemented, unexpected, internalCompilerError, compilationAborted, compilationAbortedDueToException Co-authored-by: aidanfnv <198290069+aidanfnv@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: aidanfnv <198290069+aidanfnv@users.noreply.github.com> Co-authored-by: aidanfnv <aidanf@nvidia.com>
2025-08-06	Fix unused space discovery for bindless heap. (#8075)	Yong He

2025-08-06	Enable on-demand deserialization of AST decls (#8095)	Theresa Foley
	Overview -------- This change basically just flips a `#define` switch to enable the changes that were already checked in with PR #7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR #8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature. Impact ------ Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build. The key performance stats from a build before this change are: ``` [] loadBuiltinModule 1 254.29ms [] checkAllTranslationUnits 1 6.14ms ``` After this change, we see: ``` [] loadBuiltinModule 1 91.75ms [] checkAllTranslationUnits 1 11.40ms ``` This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain. For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used). Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further. The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further. Relationship to PR #7935 ------------------------ PR #7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are: ``` [] loadBuiltinModule 1 176.62ms [] checkAllTranslationUnits 1 6.04ms ``` It remains to be seen how performance fares when this change and the optimizations in PR #7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
2025-08-06	Fix GetDimensions to use mipLevel for SPIRV (#8065)	Jay Kwak
	* Fix GetDimensions to use mipLevel for SPIRV * format code (#84) --------- Co-authored-by: slangbot <ellieh+slangbot@nvidia.com>
2025-08-06	A new approach to AST node deduplication (#8072)	Theresa Foley
	* A new approach to AST node deduplication Background ---------- The Slang compiler currently relies on the idea that AST nodes derived from `Val` are always deduplicated based on their opcode and operands. Deduplication requires caching, and we thus have to determine the right scope at which to allocate and cache different `Val`s. We need to ensure that `Val`s that refer to declarations/nodes in a specific compilation session/linkage are not cached in a scope that will outlive that session/linkage (or else the cache will contain garbage pointers). Conversely, we also need to ensure that any `Val`s that are referred to by something like a builtin module's AST will remain alive at least as long as that module/AST, and that every compilation session/linkage that refers to that module will find that `Val` cached if they try to create an equivalent one. The existing approach to deduplication has some subtleties: * A builtin module like the core module needs to be fully loaded (using the `ASTBuilder` for the global session) before any other compilation session might construct `Val`s referring to nodes in the AST of that module. * The various declarations/types/values cached on the `SharedASTBuilder` need to be created before any user-defined compilation occurs, to ensure that all of the relevant `Val`s are created on the global session's AST builder (see `Session::finalizeSharedASTBuilder`). * Related to all of the above: the deduplication cache on a non-top-level `ASTBuilder` is initialized by copying the cache from the top-level (global session) `ASTBuilder` on creation. Any subsequent allocations on the global-session `ASTBuilder` will not be visible to the linkage-level `ASTBuilder`. This led to weird things like the options parsing logic for `-load-core-module` reaching in and performing a manual copy of the cache from the global session to a linkage to try to restore the invariants. * Notably, the deduplication logic doesn't actually care if two different linkages create distinct `Val`s that represent the same thing, so long as nothing in the AST of a builtin module will refer to that thing. E.g., if the builtin modules never mention `Ptr<String>`, then two different linkages that both refer to that type will likely construct distinct `Val`s to represent it. Thus the deduplication is not as complete as some might assume. The subtle ordering considerations when creating `Val`s related to builtin modules has proved to be a sticking point for implementing on-demand deserialization of the builtin modules (in order to improve startup times). Overview -------- This change implements a new approach that hopefully makes it easier to ensure correctness, even in the case where AST nodes for builtin modules and `Val`s that refer to those nodes sometimes get created after other compilation has been performed. The key points of the new approach are: * `ASTBuilder`s are now explicitly (rather than just implicitly) linked into a hierarchy. Currently the compiler codebase will only create a two-level hierarchy: the `ASTBuilder` for a global session is the parent to all of the `ASTBuilder`s created for `Linkage`s. The expectation is that the code can and will generalize to more levels of nesting. * When a request is made to create a `Val` (or find it in a cache), there is a single `ASTBuilder` that is determined to be responsible for owning that `Val` for its lifetime. The logic involved is comparable to the way that the `IRBuilder` decides where to insert a "hoistable" (deduplicated) instruction, with the simplification that we are dealing with a tree instead of a CFG. Details ------- * `Session::finalizeASTBuilder()` is gone, because it is no longer needed * The logic in the `ASTBuilder` constructor and in `slang-options.cpp` that was manually copying the `m_cachedNodes` from the global-session `ASTBuilder` over to a linkage's `ASTBuilder` is removed, since it is no longer needed. * Made every AST node (`NodeBase`) carry a pointer to the `ASTBuilder` that created it. This is wasteful, but makes it easier to be sure the implementation will work. * Introduced a class `RootASTBuilder`, derived from `ASTBuilder` to represent the root of a given hierarchy of AST builders. * Every non-root `ASTBuilder` is now constructed with a pointer to its parent builder. * Changed it so that instead of allocating a `SharedASTBuilder` and then passing it in to create one or more `ASTBuilder`s, the shared AST builder state is more of an implementation detail of the `ASTBuilder` type, and is automatically allocated behind the scenes as part of creating an `ASTBuilder`. * The inline (defined in header) `ASTBuilder::_getOrCreateImpl()` now just does a first-pass check for an existing cached `Val` and, if it doesn't find one, delegates to `ASTBuilder::_getOrCreateImplSlowPath()`, which encapsulates the logic for the cache-miss case (even if that logic is just two lines). * The `ASTBuilder::_findAppropriateASTBuilderForVal()` method inspects a `ValNodeDesc` to determine the "deepest" of the AST builders among its operands (falling back to the root AST builder if there are no relevant operands). * The `ASTBuilder::_getOrCreateValDirectly()` is intended for use when the correct AST builder to use for caching/allocation has already been identified. * Moved the caching of generic arguments for "default substitutions" out of `ASTBuilder` and onto `GenericDecl` itself. Note that the naming of the old field (`m_cachedGenericDefaultArgs`) was unclear about the fact that this is related to "default substitutions" (where each generic parameter is fed an argument that refers to the parameter itself), and has nothing to do with any default argument values that might be set on the generic parameters. * Changed the global session (`Session`) so that instead of storing pointers to both the root/builtin `ASTBuilder` and the corresponding `SharedASTBuilder`, it just retains a pointer to the root `ASTBuilder`. This is consistent with the move toward making the `SharedASTBuilder` just an implementation detail. Possible Future Work -------------------- * In theory this change should unblock on-demand AST deserialization, so it would be good to get back to those changes once this new approach lands. * Many AST node subclasses end up storing their own `ASTBuilder` pointers, which are likely now redundant with the one in `NodeBase`. These should be eliminated where possible. * A lot of code for AST-related manipulations has been changed over time to require an `ASTBuilder` to be passed in, but at this point such a builder should always be derive-able so long as the operation has at least one non-null AST node available to it. * Most of the accessors that are currently on `SharedASTBuilder` can/should be migrated to just be on `ASTBuilder`, where they can use the data from the `SharedASTBuilder` as part of their implementation. Ideally most code should only nee to interface with `ASTBuilder`s directly, and will never need to talk to the `SharedASTBuilder`. * This change cleaned up some of the ownership hierarchy, and made it so that the global session only retains a pointer to the root AST builder (and not the shared state as well). A logical follow-on for that would be to make the `Linkage` more properly own (and thus allocate) its own AST builder (and other related objects), and have the global session only store a pointer to the root linkage (and not point to the various sub-objects directly). * The `Linkage` and `SourceManager` types have their own form of parent/child hierarchy (restricted to two levels), and could be generalized in a way that is similar to what this change does for `ASTBuilder`. Support for a hierarchy of `Linkage`s could be a powerful tool for end users, if we expose it in the right way. * format code --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-08-06	Fix noperspective modifier for SV_Barycentrics in SPIRV and GLSL (#8067)	davli-nv
	* Fix noperspective modifier for SV_Barycentrics in SPIRV and GLSL - Added test case with both regular and noperspective SV_Barycentrics inputs 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: davli-nv <davli-nv@users.noreply.github.com> * fixup format * address review https://github.com/shader-slang/slang/pull/8067#pullrequestreview-3090037501 * address review https://github.com/shader-slang/slang/pull/8067#discussion_r2255818595 * add test case from review --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: davli-nv <davli-nv@users.noreply.github.com>
2025-08-06	Add reflection api for overload candidate filtering. (#8066)	Yong He
	* Add reflection api for overload candidate filtering. * Fix API. * Fix. * Update build. * Update test. * Update formatting.
2025-08-05	Implement SPV_EXT_fragment_invocation_density (SPV_NV_shading_rate) (#8037)	davli-nv
	* Implement SPV_EXT_fragment_invocation_density -Adds semantics SV_FragSize and SV_FragInvocationCount and implements them for SPIRV and GLSL using the appropriate target builtins from extensions. -Adds test case checking for expected target builtins from these semantics. -For future work, could implement SV_FragSize using pixel shader input SV_ShadingRate for HLSL, and SV_FragInvocationCount needs research. Fixes #7974 Generated with Claude Code * address review feedback https://github.com/shader-slang/slang/pull/8037#pullrequestreview-3084645845 * fixup format * review feedback https://github.com/shader-slang/slang/pull/8037#pullrequestreview-3086442819
2025-08-05	Add missing `ASTBuilder` methods for int types (#8056)	Sam Estep

2025-08-05	Fix #pragma warning not working with multifile modules (#7942)	Copilot
	* Initial plan * Fix pragma warning not working with multifile modules - Check if DiagnosticSink already has a WarningStateTracker before creating new one - This preserves pragma warning state across __include'd files - Add regression tests for multifile pragma warnings Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Add additional test cases for nested pragma warnings - Test nested __include scenarios with pragma warning directives - Verify pragma warnings work correctly with multiple levels of includes Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> Co-authored-by: Yong He <yonghe@outlook.com>
2025-08-04	Omit "Repro" category from default help text output (#8032)	aidanfnv
	* Omit "Repro" category from default help text output * format code (#24) Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com> --------- Co-authored-by: slangbot <ellieh+slangbot@nvidia.com> Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-08-04	Add support for pointer literals in metal (#8040)	jarcherNV
	Add support for kIROp_PtrLit types in metal and add a test for null pointer values, which is the only valid value.
2025-08-03	fix overload in extension issue (#7999)	kaizhangNV
	close #7931. For a generic callable, we have two passes overload resolution, in first pass, we will resolve the generic by only checking the generic parameters, while in the second pass, we will resolve the function signature to resolve the overload. But in our candidate comparison logic, we pick a preferred generic even two generics are equally good. However, we should not make this decision in the first pass, because we don't know about the function arguments in this pass yet. So we just return OverloadEpxr2 in this case, and let the function overload resolution to break the tie.
2025-08-01	Drain sink when single-argument constructor call fail (#7883)	ArielG-NV
	* fix bug * fix test * push test changs for clarity * fix bug * fix test * push test changs for clarity * test what fails * remove redundant code
2025-08-01	Fix 7441: CUDA boolean vector layout to use 1-byte elements (#7862)	Harsh Aggarwal (NVIDIA)
	* Fix 7441: CUDA boolean vector layout to use 1-byte elements Boolean vectors (bool1, bool2, bool3, bool4) were incorrectly implemented as integer-based types using 4 bytes per element instead of actual 1-byte boolean elements on CUDA targets. Changes: - Update CUDA prelude to define boolean vectors as structs with bool fields instead of typedef aliases to integer vectors - Implement CUDALayoutRulesImpl::GetVectorLayout to use 1-byte alignment for boolean vectors, matching actual CUDA memory layout behavior - Update make_bool functions to populate struct fields correctly This ensures boolean vectors have the same memory layout as bool[4] arrays: - bool1: 1 byte (was 4 bytes) - bool2: 2 bytes (was 8 bytes) - bool3: 3 bytes (was 12 bytes) - bool4: 4 bytes (was 16 bytes) Fixes memory layout mismatch between Slang reflection API and actual CUDA compilation, achieving 75% memory savings for boolean vector usage. * Fix CI issues - Add and update associated functions and operators * Make boolX same as uchar * Use align construct on struct for boolX * Improve Test case for robust alignment checks * Formatting * Disable selected slangpy tests * add metal check which is slightly different than cuda * Test-1 * Test-2 * Test-3 * Test-4 * ReflectionChange * cleanup and update * _slang_select with plain bool is needed for reverse-loop-checkpoint-test
2025-08-01	Omit "Invalid" capability from slangc -h output (#8020)	aidanfnv
	* Omit "Invalid" capability from slangc -h output * format code (#23) Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com> --------- Co-authored-by: slangbot <ellieh+slangbot@nvidia.com> Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-08-01	Omit listing values in slangc -h cmdline output, show how to list them ↵	aidanfnv
	seperately (#8012) * Omit listing values in slangc -h cmdline output, show how to list them seperately * format code (#22) Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com> --------- Co-authored-by: slangbot <ellieh+slangbot@nvidia.com> Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-08-01	Omit "Internal" category from default help text output (#8013)	aidanfnv

2025-07-31	Fix segmentation fault in ray tracing parameter consolidation. (#7997)	Copilot
	* Initial plan * Fix segfault in ray tracing parameter consolidation Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Fix. * Fix. * Keep entrypoint param layout consistent during `MoveEntryPointUniformParametersToGlobalScope`. * Fix. * fix. * Fix. * Fix pending layout handling. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> Co-authored-by: Yong He <yonghe@outlook.com>
2025-07-31	Add matrix select intrinsic (#7566)	venkataram-nv
	* Add matrix select intrinsic * Fix hlsl test * Restrict matrix select to HLSL * Better test for HLSL side * Select route for GLSL/SPIRV * Exclude matrices from select legalization * Exclude CUDA from select test * Inline and move * format code --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-07-31	Handle debug-layer messages in a separate channel (#7988)	Jay Kwak
	* Handle debug-layer messages in a separate channel The Problem (Issue #7343) The issue was that Vulkan Validation Layer error messages were being mixed into regular test output, causing potential false positives or negatives. When using -enable-debug-layers true, validation messages would appear in the same output stream as test results, potentially matching //CHECK: patterns incorrectly. Example Problem: - Test expects: //CHECK: 1 - Validation layer prints: VALIDATION ERROR: 1 invalid buffer binding - Test incorrectly matches the "1" in the error message instead of the actual output Slang Test Communication Architecture Execution Modes Slang has 3 different execution modes controlled by SpawnType: enum class SpawnType { UseSharedLibrary, // In-process execution UseTestServer, // Out-of-process via persistent server UseFullyIsolatedTestServer, // Out-of-process via isolated server UseExe // Direct executable spawn } 1. In-Process Mode (UseSharedLibrary) ┌─────────────────────────────────────────┐ │ slang-test │ │ ┌─────────────┐ ┌─────────────────┐ │ │ │render-test │ │gfx-unit-test │ │ │ │unit-test │ │slangc library │ │ │ └─────────────┘ └─────────────────┘ │ │ │ │ │ │ └───► StdWriters ◄──┘ │ │ (shared) │ └─────────────────────────────────────────┘ - Communication: Direct function calls, shared memory - Debug callbacks: Single callback instance in StdWriters - Used when: Default mode for most tests 2. Out-of-Process Mode (UseTestServer) ┌──────────────────┐ JSON-RPC ┌──────────────────┐ │ slang-test │◄──over pipes────┤ test-server.exe │ │ │ │ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ │StdWriters │ │ │ │render-test │ │ │ │+debug │ │ │ │gfx-unit-test│ │ │ │callback │ │ │ │+debug │ │ │ └─────────────┘ │ │ │callback │ │ └──────────────────┘ │ └─────────────┘ │ └──────────────────┘ - Communication: JSON-RPC over stdin/stdout pipes - Debug callbacks: Separate instances in each process - Used when: CI/CD, multi-threaded testing, crash isolation 3. Direct Executable Mode (UseExe) ┌──────────────────┐ pipes ┌──────────────────┐ │ slang-test │◄───────────────┤ slangc.exe │ │ │ │ other tools │ └──────────────────┘ └──────────────────┘ - Communication: Standard process pipes (stdout/stderr) - Debug callbacks: None (external executables) - Used when: Testing external tools Communication Mechanisms Deep Dive JSON-RPC Protocol Over Pipes The test-server.exe communicates with slang-test using JSON-RPC over stdin/stdout pipes: // Parent process (slang-test) creates child with pipes Process* testServerProcess = /* spawn test-server.exe /; // JSONRPCConnection wraps the pipe communication JSONRPCConnection connection; connection.initWithStdStreams(); // Uses stdin/stdout pipes // Send RPC call TestServerProtocol::ExecutionResult result; connection.sendCall("executeTool", &args, &result); Key Point: The pipes carry structured JSON messages, not raw stdout/stderr. This is what enables clean separation of different data channels. Protocol Structure Your changes extend the ExecutionResult protocol: struct ExecutionResult { String stdOut; // Regular program output String stdError; // Error messages String debugLayer; // NEW: Debug/validation messages int32_t result; int32_t returnCode; }; Your Debug Layer Solution The Challenge Memory pointers cannot cross process boundaries. A debugCallback pointer in the parent process is meaningless in the child process. The Solution: String-Based Serialization You solved this by using string capture and serialization: 1. Debug Callback Interface (slang-std-writers.h) class IDebugCallback { virtual void handleMessage( DebugMessageType type, DebugMessageSource source, const char message) = 0; }; 2. String-Capturing Implementation (slang-support.h) class CoreDebugCallback : public Slang::IDebugCallback { StringBuilder m_buf; // Captures messages as strings void handleMessage(DebugMessageType type, DebugMessageSource source, const char* message) { if (type == DebugMessageType::Error) { m_buf << message << '\n'; // Serialize to string } } String getString() { return m_buf.toString(); } // Extract accumulated messages }; 3. Bridge Between RHI and Core (slang-support.h) class CoreToRHIDebugBridge : public rhi::IDebugCallback { Slang::IDebugCallback* m_coreCallback; void handleMessage(rhi::DebugMessageType type, rhi::DebugMessageSource source, const char* message) { // Convert RHI types to core types and forward m_coreCallback->handleMessage(convertType(type), convertSource(source), message); } }; Data Flow: Debug Messages End-to-End In-Process Mode Flow GPU Driver → RHI Debug Callback → Core Debug Callback → String Buffer → Test Output Out-of-Process Mode Flow Child Process: GPU Driver → RHI Debug Callback → Core Debug Callback → String Buffer ↓ Parent Process: JSON-RPC Serialization Test Output ← String Processing ← ExecutionResult.debugLayer ←┘ Step-by-Step Example 1. Test Execution Starts // In test-server process CoreDebugCallback debugCallback; CoreToRHIDebugBridge bridge; bridge.setCoreCallback(&debugCallback); // Set up graphics device with debug layers deviceDesc.debugCallback = &bridge; 2. Graphics API Call Triggers Validation Error // Inside Vulkan driver (external code) // Validation layer detects error and calls our callback bridge.handleMessage(RHI_ERROR, RHI_LAYER, "Invalid buffer binding"); 3. Message Capture // In CoreDebugCallback::handleMessage m_buf << "Invalid buffer binding\n"; // Stored in string buffer 4. Test Completion & Serialization // Back in test-server TestServerProtocol::ExecutionResult result; result.debugLayer = debugCallback.getString(); // "Invalid buffer binding\n" result.stdOut = "1"; // Regular test output // Send via JSON-RPC connection.sendResult(&result); 5. Parent Process Receives & Separates Output // In slang-test process String output = buildTestOutput(result); // Results in clean separation: // standard output = { // 1 // } // debug layer = { // Invalid buffer binding // } Why This Solution Works 1. Process Isolation: Each process has its own callback objects, no shared pointers 2. String Serialization: Debug messages converted to strings that can cross process boundaries 3. Protocol Extension: Uses existing JSON-RPC infrastructure, just adds new field 4. Clean Separation: Debug messages never mix with stdout/stderr 5. Backward Compatibility: Existing tests unaffected, debug layer field optional Key Benefits - Eliminates False Positives: Debug messages can't interfere with //CHECK: patterns - Better Debugging: Debug messages clearly separated and labeled - Robust Architecture: Works across all execution modes - Minimal Changes: Leverages existing communication infrastructure This elegant solution transforms a fundamental cross-process communication challenge into a simple string serialization problem, using the existing test server architecture to cleanly separate validation layer messages from test results. * Cover the missing slang-test execution path * format code (#82) Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com> --------- Co-authored-by: slangbot <ellieh+slangbot@nvidia.com> Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-07-31	Fix bug in ci test (#8005)	Jay Kwak
	This commit fixes two problems. 1. uninitialized file handle for lock-file test 2. uninitialized static variable for lock-file test The first bug is more of speculartive rather than actual bug. The second bug was causing heap corruption when it was retried, because the counter was not reset to zero on "retry" and it wrote data to an invalida range in an array.
2025-07-31	msvc style bitfield packing (#7963)	Ellie Hermaszewska
	Closes https://github.com/shader-slang/slang/issues/3646 New tests rather than just adding another TEST line to existing tests so that we get the msvc- prefix in the output of slang-test
2025-07-30	disallow `static const` variables without default-value (#7993)	ArielG-NV
	* Fix static const variables without initializers causing internal errors Add validation in SemanticsDeclHeaderVisitor::checkVarDeclCommon to detect static const variables without initializers and emit proper error diagnostics instead of allowing internal errors to escape during SPIR-V generation. - Add new diagnostic (ID 31225) for static const variables without initializers - Skip validation for extern static const variables - Skip validation for interface member variables - Add comprehensive test case covering various scenarios Fixes #7989 Co-authored-by: ArielG-NV <ArielG-NV@users.noreply.github.com> * clean up test and implementation * format code (#7994) Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ArielG-NV <ArielG-NV@users.noreply.github.com> Co-authored-by: slangbot <ellieh+slangbot@nvidia.com> Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-07-30	[Language Server]: Don't eagerly check file upon open doc. (#7995)	Yong He

2025-07-30	Lowering unsupported matrix types for GLSL/WGSL/Metal targets (#7936)	venkataram-nv
	* Add emit cases for WGSL and GLSL * Fix compilation warnings Modify short cutting test to reflect change in emit logic Lower matrix for metal as well Add emit matrix logic for metal Fix compiler warning Brace initializer for lowered matrices Fix compiler warnings * Tests for metal * Fix mult, any, and determinant * Fix matrix-matrix multiplication * Fix mat mul to be element-wise * Fix compiler warning * Move makeMatrix to legalization * Move unary and binary arithmetic operator lowering to legalization * Remove emit logic and move final comparison operators to legalization * Handle vector/matrix negation for WGSL * Restore older SPIR-V emit logic * Address PR comments * Revert to zero minus for negation * format code --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-07-29	Fix ICE when immutable value is passed to a bwd_diff function. (#7973)	Yong He

2025-07-29	Fix Metal invalid as_type cast for 64-bit RWByteAddressBuffer.Store values ↵	Gangzheng Tong
	(#7843) * Fix 64-bit val lowering for metal * Add ByteAddressBuffer load/store 64-bit tests * Handle Store/Load ptr types * Use bitcast for non-pointer typers * format code (#7966) Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com> --------- Co-authored-by: slangbot <ellieh+slangbot@nvidia.com> Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-07-29	Fix slang-no-embedded-core-module-source embedding core module (#7885)	Julius Ikkala

2025-07-29	Invoke `clang++` on C++ code instead of `clang` (#7958)	Sam Estep

2025-07-29	[Language Server]: Show signature help on generic parameters. (#7913)	Yong He
	* Show signature help on generic parameters. * Fix. * Update tests. * slang-test: make vvl error go through stderr. * update slang-rhi * Update slang-rhi
2025-07-29	Detect uses of uninitialized resource fields (#7962)	Ellie Hermaszewska
	Closes https://github.com/shader-slang/slang/issues/3386
2025-07-29	Improve diagnostics over ambiguous references. (#7930)	Yong He
	* Improve diagnostics over ambiguous references. * Fix. * Remove files. * Fix some optix hitobject intrinsics. * Fix some hitobject intrinsics for optix. * Fix. * update rhi * revert slang-rhi * Update slang-rhi
2025-07-29	Fix CUDA backend missing U32_firstbitlow implementation (#7921)	Copilot
	* Initial plan * Add U32_firstbitlow implementation for CUDA and CPP backends Co-authored-by: bmillsNV <163073245+bmillsNV@users.noreply.github.com> * Add I32_firstbitlow and comprehensive testing for signed/unsigned firstbitlow Co-authored-by: bmillsNV <163073245+bmillsNV@users.noreply.github.com> * Convert firstbitlow test to use inline filecheck syntax Co-authored-by: ArielG-NV <159081215+ArielG-NV@users.noreply.github.com> * Add U32_firstbithigh and I32_firstbithigh implementations for CUDA and CPP backends Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Update prelude/slang-cpp-scalar-intrinsics.h * Update prelude/slang-cpp-scalar-intrinsics.h * Update prelude/slang-cpp-scalar-intrinsics.h * Refactor Metal bit intrinsics to handle zero case correctly Co-authored-by: ArielG-NV <159081215+ArielG-NV@users.noreply.github.com> * Update slang-cuda-prelude.h remove fake links * Update hlsl.meta.slang * if -1, return -1 due to implicit hlsl rule * -1 or 0 is ~0u as per hlsl implictly * 0 or -1 as per hlsl * fix the math to map to hlsl * fix compile error * forgot `31 - clz` * format code (#7943) Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com> * Update source/slang/hlsl.meta.slang * Update source/slang/hlsl.meta.slang * Update source/slang/hlsl.meta.slang * Update source/slang/hlsl.meta.slang --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bmillsNV <163073245+bmillsNV@users.noreply.github.com> Co-authored-by: ArielG-NV <159081215+ArielG-NV@users.noreply.github.com> Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> Co-authored-by: ArielG-NV <aglasroth@nvidia.com> Co-authored-by: slangbot <ellieh+slangbot@nvidia.com> Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-07-28	Fix issue in multi-level break elimination by handling multi-level continue ↵	Sai Praveen Bangaru
	statements (#7953)
2025-07-28	Update slang-rhi to fix CI error on debug. (#7946)	Yong He
	* Update slang-rhi * Fix profile lookup on `nullptr`. * update rhi
2025-07-25	Fix mesh shader reflection JSON output (#7868)	pdeayton-nv
	* Fix mesh shader reflection JSON output Fixes issue #7736 where mesh shaders were showing stage "UNKNOWN" and missing type information in reflection JSON output. Changes: - Add missing SLANG_STAGE_MESH and SLANG_STAGE_AMPLIFICATION cases to shader stage switch statements in emitReflectionVarBindingInfoJSON and emitReflectionEntryPointJSON - Add missing MeshOutput to TypeReflection::Kind enum in slang.h - Add missing type kind cases (OutputStream, MeshOutput, Specialized, None) to emitReflectionTypeInfoJSON Testing: - Simple mesh shader now correctly shows "stage": "mesh" instead of "UNKNOWN" - Complex mesh shader with parameters shows proper stage and input type information - Output parameters show "kind": "None" instead of crashing with assertion 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: pdeayton-nv <pdeayton-nv@users.noreply.github.com> * Remove test files from mesh shader reflection fix As requested, removed test files since there is no testing infrastructure for reflection JSON output. Focus is now only on the core mesh and amplification shader reflection fixes. Co-authored-by: pdeayton-nv <pdeayton-nv@users.noreply.github.com> * Address review comments: move MeshOutput enum to end and restore assert - Move MeshOutput enum member to end of TypeReflection::Kind enum for backward compatibility - Replace fprintf with SLANG_ASSERT for unhandled type kinds Co-authored-by: pdeayton-nv <pdeayton-nv@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: pdeayton-nv <pdeayton-nv@users.noreply.github.com> Co-authored-by: Yong He <yonghe@outlook.com>
2025-07-25	Fix nv_viewport_mask semantic to emit SpvBuiltInViewportMaskNV (#7904)	davli-nv
	Fixes #7903 SPV_NV_viewport_array2 says ViewportMaskNV corresponds to gl_ViewportMask
2025-07-25	Fix for Generic Function Redefinition Error (#7891)	Gangzheng Tong
	* emit literal values in getTypeNameHint for bool, str etc. * add test for specializing generics with bool literals * fix build error * add specializing with Enum type test
2025-07-25	Add combined texture-sampler flag to reflection API to differentiate ↵	Copilot
	Texture2D from Sampler2D (#7901) * Initial plan * Add SLANG_TEXTURE_COMBINED_FLAG to differentiate combined texture-samplers Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Fix regression in hlsl-to-vulkan-combined test by updating expected output Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com>
2025-07-25	Fix compiler crash when enum is used as vertex output data (#7915)	Copilot
	* Initial plan * Initial investigation and plan for enum vertex output fix Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Fix compiler crash when enum is used as vertex output data Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Fix compiler crash when enum is used as vertex output data Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Address reviewer feedback: use SLANG_ASSERT and improve CHECK directives Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com>
2025-07-25	Fix SPIRV OpMemberName member indices (#7912)	davli-nv
	OpMemberName instructions were all using member index 0 instead of incrementing indices (0, 1, 2, ...) as required by the SPIR-V spec. The bug was that the member index increment (id++) was only happening for physical struct types, but OpMemberName emission occurs for all struct types when they have name decorations. This fix ensures member indices are properly incremented for both physical and non-physical struct types. Fixes #7909 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: davli-nv <davli-nv@users.noreply.github.com>
2025-07-25	Fix metallib-asm target parsing by adding missing comma (#7902)	Jay Kwak
	* Fix metallib-asm target parsing by adding missing comma The metallib-asm target was not being recognized from command line because of a syntax error in the target string definition in slang-type-text-util.cpp. A missing comma after the first "metallib-asm" string caused C++ string literal concatenation to create a malformed string. Resolves #7774 Co-authored-by: Jay Kwak <jkwak-work@users.noreply.github.com> * format code (#7910) Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com> * regenerate command line reference (#7911) Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com> --------- Co-authored-by: Claude Code <claude@anthropic.com> Co-authored-by: Jay Kwak <jkwak-work@users.noreply.github.com> Co-authored-by: slangbot <ellieh+slangbot@nvidia.com> Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-07-24	Avoid early specialization for witness tables in SimplifyIR (#7636)	Jay Kwak
	* Avoid early specialization for witness tables in SimplifyIR Prevents SimplifyIR from prematurely specializing witness tables before the main specialization pass. Witness tables are hoistable immutable objects that must maintain consistent signatures to avoid incorrect deduplication. SimplifyIR was incorrectly transforming expressions like "witness_table_t(%IFoo)(specialize(%7, %GenericValue4))" into "witness_table_t(%IFoo)(%Foo)" even when %GenericValue4 was unused. Fixes #7233 * Add a missing test file
2025-07-24	Organize code better by splitting some big files (#7890)	Theresa Foley
	* Organize code better by splitting some big files The basic change here is that the majority of the declarations in `slang-compiler.h` have been split out into a set of smaller and more focused files. As a result, the implement of those declarations have been moved from `slang-compiler.cpp` and `slang.cpp` over to those new files when the proper home for code is obvious. I have tried as much as possible to not make any edits to the code along the way, and just copy-paste declarations from one place to another as-is. The exceptions I am aware of are: * In some cases a function that used to be file-scope `static` was used by code that landed in two or more different `.cpp` files. In these cases, I changed the function to be non-`static` (removing the `_` prefix from its name, if it had one, per our naming conventions), and put a declaration for the function into the most appropriate header I could identify. * I added a few comments in places where I saw ugly or unfortunate things in the code I was moving, and wanted to tag them with `TODO`s so we can hopefully get to them in the fullness of time. * I added top-level comments to each of the new `.h` files that was introduced to try to explain the logic for what goes into that file. * In cases where one of the new header files mostly existed to declare a single type, I sometimes added more detail to the doc comment on that type, to better explain the type and its role in the compiler (this is text that otherwise might have gone into the comment at the top leve lof the file, but I figured that the doc comment would have higher discoverability). I expect that the most contentious choice here is that the `Session` class lands in `slang-global-session.h` while `slang-session.h` holds the `Linkage` class. The names used in this change are consistent with how the relevant concepts in the public Slang API are named, and are consistent with how we intend to rename the classes themselves in time. * format code * fixup --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
2025-07-24	Fix confusing error messages for interface return type mismatches (#7854)	Copilot
	* Initial plan * Add improved diagnostic for interface return type mismatches Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Complete fix for interface return type mismatch error reporting Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Move diagnostic to synthesis phase for better interface return type mismatch errors Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Remove extraneous test file and update .gitignore Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Add diagnostic test for interface return type mismatch and apply formatting Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Address feedback: restore whitespace and use filecheck for diagnostic test Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Fix logic error in return type mismatch detection Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Remove unnecessary flag by using out parameter for diagnostic tracking Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Refactor witness synthesis failure reporting to use structured approach Replace ad-hoc `outSpecificDiagnosticEmitted` parameter with `WitnessSynthesisFailureReason` enum and `MethodWitnessSynthesisFailureDetails` struct as requested in code review. This provides: - Clear taxonomy of failure reasons (General, MethodResultTypeMismatch, MethodParameterMismatch) - Centralized diagnostic emission in findWitnessForInterfaceRequirement - Better extensibility for future failure types - Improved maintainability by removing state tracking flags The return type mismatch diagnostic continues to work correctly, showing error 38106 with precise location information. Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Remove unused MethodParameterMismatch enum and duplicate code Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Remove redundant requiredMethod field from MethodWitnessSynthesisFailureDetails Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Address feedback: add outFailureDetails guard and remove unnecessary hasReturnTypeError variable Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Fix regression: restore original diagnostic message for mutating method mismatch Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> * Fix. * Fix. * Remove `innerSink`. * Print candidates considered for interface match upon error. * Fix tests. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: csyonghe <2652293+csyonghe@users.noreply.github.com> Co-authored-by: Yong He <yonghe@outlook.com>