| Age | Commit message (Collapse) | Author |
|
* Explicit swapchain interface in `gfx`.
* Correctly return nullptr when `IRenderer` creation failed.
* Fix crashes on CUDA tests.
* Cleanups.
|
|
* Make gfx library visible to external user.
* Fixup
|
|
This change kind of rolls together two different simplifications:
1. The `createShaderObject()` shouldn't really need to take an `IShaderObjectLayout` because it could just take the `slang::TypeLayoutReflection` instead and create the shader-object layout behind the scenes.
2. For that matter, it needn't take a `slang::TypeLayoutReflection` either, becaues it could just take a `slang::TypeReflection` and query the layout of that type behind the scenes.
The combination of these two changes means:
* `IShaderObjectLayout` is gone from the public API, as is `createShaderObjectLayout()`
* `createShaderObject()` directly takes a `slang::TypeReflection` and allocates a shader object of that type
The result is simpler and more streamlined application code.
Note that under the hood the implementation still has shader-object layouts, using the `ShaderObjectLayoutBase` class. A few locations had to change to use `RefPtr`s instead of `ComPtr`s now that the class is no longer a public COM-lite API type.
The hope is that this change makes it easier to allocate/cache layouts for things like specialized types "under the hood," as is needed to implement parameter setting for static specialization.
|
|
* Add `SampleGrad` overload for lod clamp.
* Fix gfx to run the test on vulkan.
* Whitespace change to trigger CI build
* remove presentFrame call in render-test
Co-authored-by: Yong He <yhe@nvidia.com>
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
* #include an absolute path didn't work - because paths were taken to always be relative.
* Fix bugs with m_features on Dx12 and gl.
Fix issue about GFX_NVAPI availability.
* Fix handling of SLANG_E_NOT_AVAILABLE on renderer startup.
* Clarify comment.
* Improve comment.
|
|
|
|
|
|
|
|
|
|
* Make `gfx` compile to a DLL.
* Fix cuda
* Fix cuda build
* Bug gl screen capture bug.
|
|
This change converts a large number of our existing tests to use the `ShaderObject` support that was added to the `gfx` layer.
In many cases, tests were just updated to pass `-shaderobj` and the result Just Worked.
In other cases, a `name` attribute had to be added to one or more `TEST_INPUT` lines.
For tests that did not work with shader objects "out of the box," I spent a little bit of time trying to get them work, but fell back to letting those tests run in the older mode.
Future changes to the infrastructure will be needed to get those additional tests working in the new path.
Along with the changes to test files, the following implementation changes were made to get additional tests working:
* Because the shader object mode uses explicit register bindings (from reflection), the hacky logic that was offseting `u` registers for D3D12 based on the number of render targets gets disabled (by another hack).
* The "flat" reflection information coming from Slang was not correctly reporting "binding ranges" for things that consumed only uniform data (which would be everything on CUDA/CPU), so it was refactored to properly include binding ranges for anything where the type of the field/variable implied a binding range should be created (even if the `LayoutResourceKind` was `::Uniform`).
* A few fixes were made to the CUDA implementation of `Renderer`, in order to get additional tests up and running. Most of these changes had to do with texture bindings, which hadn't really been tested previously.
In addition, a few changes were made that were attempts at getting more tests working, but didn't actually help. These could be dropped if requested:
* As a quality-of-life feature (not being used) the `object` style of `TEST_INPUT` line is upgraded to support inferring the type to use from the type of the input being set.
* Any `object` shader input lines get ignored in non-shader-object mode.
|
|
* COM-ify all slang-gfx interfaces.
|
|
* Make `gfx::Renderer` a COM interface.
This is a first step towards making the `gfx` library expose a COM compatible DLL interface. Remaining classes will come as separate PRs.
* Fixup project files
* Fix calling conventions
* Make gfx::create*Renderer() functions increase ref count by 1
* Make renderer createFunc return via out parameter
|
|
* Implements CUDA renderer in gfx.
* Revert unnecessary change.
* Revert unnecessary changes.
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
* Move ShaderObject to be under renderer interface.
* Make `create*PipelineState` take `const PipelineStateDesc&`.
* Move ShaderCursor implementation to a cpp file
|
|
* #include an absolute path didn't work - because paths were taken to always be relative.
* Move reflection to reflection-api.
* Slight reorg to pull out potentially Slang internal functions from the reflection API impls.
* Remove visual studio projects
* Fix for slang-binaries copy.
* Add the visual studio projects in build/visual-studio
* Remove miniz project.
* Differentiate the linePath from the filePath.
* Improve comment in premake5.lua + to kick of CI.
* Kick CI.
|
|
* Add shader object parameter binding to renderer_test.
* remove multiple-definitions.hlsl
* Fix cuda implementation.
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
* #include an absolute path didn't work - because paths were taken to always be relative.
* Mangling/module name extraction for GenericDecl
* Add comment on SerialFilter to explain re-enabling Stmt.
* Support setting up SyntaxDecl when reconstructed after deserialization.
* Improvements to setup SyntaxDecl.
* Fix typo so can read compressed SourceLocs.
* Fix issue with SourceManger.
* Simple test for serializing out stdlib and reading back in.
* Fix calling convention.
* Add override to StdLib impls.
* Fix typo.
* Apply testing to an actual compute test when using load-stdlib
Make -load/compile-stdlib processable by Slang
Move out testing into util into TestToolUtil so can be shared.
* Slightly more concise setup of session.
* Fix some errors introduced with session handling.
* Made setup for compile same across slangc and slangc-tool.
|
|
* Use integer RTTI/witness handles in existential tuples.
* Fix clang error.
* Fix IR serialization to use 16bits for opcode.
* Undo accidental comment change.
* Use variable length encoding for opcode.
* Fix compile error.
* Fixing issues
* Fix code review issues.
|
|
* #include an absolute path didn't work - because paths were taken to always be relative.
* Fix handling of access modifiers inside type definition.
* Fix access problem for AST node.
Make dumping produce a single function with switch, to potentially make available without Dump specific access.
* WIP on serialization design doc.
* Remove project references to previously generated files.
* More docs on serialization design.
* Improve serialization documentation.
Remove unused function from IRSerialReader.
* Small fixes around naming. Remove long comment from slang-serialize.h - as covered in serialization.md
* Remove long comment in slang-serialize.h as covered in serialization.md
* More information about doing replacements on read for AST and problems surrounding.
* Typo fix.
* Spelling fixes.
* Value serialize.
* Value types with inheritence.
* Use value reflection serial conversion for more AST types
* Use automatic serialization on more of AST.
* Get the types via decltype, simplifies what the extractor has to do.
* Update the serialization.md for the value serialization.
* Small doc improvements.
* Update project.
* Remove ImportExternalDecl type
Added addImportSymbol and ImportSymbol type
Fixed bug in container which meant it wouldn't read back AST module
* Because of change of how imports and handled, store objects as SerialPointers.
* First pass symbol lookup from mangled names.
* Cache current module looked up from mangled name.
* Fix SourceLoc bug.
Improve comments.
* Added diagnostic on mangled symbol not being found
* Fix typo.
* WIP serializing stdlib.
* WIP serializing stdlib in.
* Fix problem serializing arrays that hold data that is already serialized.
* Remove clash of names in MagicTypeModifier.
* Make conversion from char to String explicit.
Fix reference count issue with SerialReader.
* Add code to save/load stdlib.
* Use return code to avoid warning - SerialContainerUtil::write(module, options, &stream))
* Make all String numeric ctors explicit.
Added isChar to UnownedStringSlice.
Added operator== for UnownedStringSlice to String to avoid need to convert to String and allocate.
* Add error check to readAllText.
* tabs -> spaces on String.h
* tab -> spaces String.cpp
* Remove msg for StringBuilder, just build inplace for exceptions.
* Check SerialClasses - for name clashes.
Renamed Modifier::name as Modifier::keywordName
* Handling of extensions when deserializing AST - updating the moduleDecl->mapTypeToCandidateExtensions
Co-authored-by: Tim Foley <tim.foley.is@gmail.com>
|
|
|
|
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
* Support shader parameters that are an array of existential type.
* Rename to getFirstNonExistentialValueCategory
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* Allow unspecialized existential shader parameters (dynamic dispatch).
* Fixes.
* Fixes
* disable cuda test
|
|
* Support dynamic existential shader parameters in render-test
* Fix linux build error.
* Fixes.
* Fix code review issues.
* Fix gcc error.
* More fixes.
* More fixes.
|
|
* First pass at incorporating nvapi into test harness.
* D3d12 Atomic Float Add via NVAPI working
* Dx12 atomic float appears to work.
* Atomic float add on Dx12.
* Added atomic64 feature addition to vk.
Fix correct output for atomic-float-byte-address.slang
* Disable atomic float failing tests.
* Upgraded VK headers.
* Detect atomic float availability on VK.
* Try to get test working for in64 atomic.
* Made HLSL prelude controlled via the render-test requirements.
* Added -enable-nvapi to premake.
* Fix D3D12Renderer when NVAPI is not available.
* Small improvements to VKRenderer.
* Improve atomic documentation in target-compatibility.md.
* Fixed NVAPI working on D3D12.
* Test for specific NVAPI features.
* Remove requiredFeatures from Renderer::Desc as was ignored. Tried to document more around nvapiExtnSlot.
* Readded requiredFeatures to Renderer::Desc
* Improve comments in the tests.
|
|
* First pass at incorporating nvapi into test harness.
* D3d12 Atomic Float Add via NVAPI working
* Dx12 atomic float appears to work.
* Atomic float add on Dx12.
* Added atomic64 feature addition to vk.
Fix correct output for atomic-float-byte-address.slang
* Disable atomic float failing tests.
* Upgraded VK headers.
* Detect atomic float availability on VK.
* Try to get test working for in64 atomic.
* Made HLSL prelude controlled via the render-test requirements.
* Added -enable-nvapi to premake.
* Fix D3D12Renderer when NVAPI is not available.
* Small improvements to VKRenderer.
* Improve atomic documentation in target-compatibility.md.
|
|
Entry point `uniform` parameters were a feature of the original Cg and HLSL, but have not been used much in production shader code. One of our goals on Slang is to reduce the (ab)use of the global scope, so bringing entry point `uniform` parameters up to a greater level of usability is an important goal.
Some policy choices about how global vs. entry-point `uniform` parameters behave have already been made, that shape decisions looking forward:
* For DXBC/DXIL, it makes the most sense to follow the lead of fxc/dxc, by treating entry point `uniform` parameters as a kind of syntax sugar for global shader parameters. Any parameters of "ordinary" types are bundles up into an implicit constant buffer, and all the resources (including the implicit constant buffer) are assigned `register`s just as for globals. It is up to the application to decide how to bind those parameters via a root signature (using root descriptors, root constants, descriptor tables, local vs. global root signature, etc.)
* For CPU, it makes sense to pass global vs. entry-point parameters as two different pointers, although the details of what we do for CPU are the least constrained across all current targets.
* For CUDA compute, it makes the most sense to map global shader parameters to `__constant__` global data, and entry-point `uniform` parameters to kernel parameters. This choice ensures that the signature of a kernel when translated from Slang->CUDA follows the Principle of Least Surprise, at the cost of making entry-point vs. global parameters be passed via different mechanisms.
* For OptiX ray tracing, it makes sense to expand on the precedent from CUDA compute: pass global parameters via global `__constant__` data (as is already expected by OptiX for whole-launch parameters), and pass entry-point `uniform` parameters via the "shader record." This establishes a precedent that for ray-tracing shaders, global-scope parameters map to the "global root signature" concept from DXR, while entry-point `uniform` parameters map to a "local root signature" or "shader record."
* For Vulkan ray tracing, the precedent from OptiX then argues that entry-point `uniform` parameters should map to the Vulkan "shader record" concept (and thus cannot support things like resource types).
* The remaining interesting case is what to do for non-ray-tracing shaders on Vulkan.
The dev team agrees that the most reasonable choice to make for non-ray-tracing Vulkan shaders is to map entry-point `uniform` parameters to "push constants." In particular, this makes it easy to express the case of a compute kernel with direct parameters of ordinary/value types in the way that will be implemented most efficiently.
The big picture is then that a kernel like:
```hlsl
void computeMain(uniform float someValue) { ... }
```
will map to output GLSL like:
```glsl
layout(push_constant)
uniform
{
float someValue;
} U;
void main() { ... }
```
If the user really wanted a constant-buffer binding to be created instead, they can easily change their input to make the buffer explicit:
```hlsl
struct Params { float someValue; }
void computeMain(uniform ConstantBuffer<Params> params) { ... }
```
(Forcing the user to be explicit about the desire for a buffer here creates a nice symmetry between Vulkan and CUDA; in the first case the user sets up the data in host memory and passes it to the GPU by copy, while in the second case the user must allocate and set up a device-memory buffer for the data. This symmetry extends to D3D if the application chooses to map entry-point `uniform` parameters to root constants.)
This change implements logic in the "parameter binding" part of the Slang compiler to make sure that entry-point `uniform` parameters are wrapped up in a push-constant buffer rather than an ordinary constant buffer for non-ray-tracing shaders on Vulkan (and in a shader record "buffer" for the ray-tracing case).
The majority of the actual work was in adding support for root/push constants to the test framework and the graphics API abstraction it uses. To be clear about that support:
* Root constant ranges are (perhaps confusingly) treated as a new kind of "slot" that can appear on a descriptor set. This choice ensures that the implicit numbering of registers/spaces used by the back-ends can account for these ranges correctly.
* The `TEST_INPUT` lines are extended to allow a `root_constants` case that behaves more or less like `cbuffer`
* The CPU and CUDA paths can treat a `root_constants` input identically to a `cbuffer`. They already allocate the actual buffers based on reflection, and just use `cbuffer` as a directive that causes bytes to be copied in.
* On D3D12 and Vulkan, a descriptor set allocates a `List<char>` to hold the bytes of root constant data assigned into it, and these bytes are flushed to the command list when the table is actually bound (usually right before rendering).
* On D3D11, a descriptor set treats a root constant range more or less like a constant buffer range (with a single buffer), except that it also automatically allocates a buffer to hold the data. Assigning "root constant" data automatically copies it into that buffer.
The small number of tests that used entry-point `uniform` parameters of ordinary types were updated to use the new `root_constant` input type, and the bugs that surfaced were fixed.
A new test to confirm that entry-point `uniform` parameters map to the shader record for VK ray tracing was added.
An important but technically unrelated change is the removal of the `DescriptorSetImpl::Binding` type and related function from the Vulkan implementation of `Renderer`. That type was created to ensure that objects that are bound into a descriptor set don't get released while the descriptor set is still alive, but the implementation relied on a complicated linear search to check for existing bindings, which could create a performance issue for descriptor sets that include large arrays of descriptors. The new implementation makes use of the approach already present in the various `Renderer` implementations (including the Vulkan one) for assigning ranges in a descriptor set a flat/linear index for where their pertinent data is to be bound. As a result, the Vulkan `DescriptorSetImpl` now uses a single flat array of `RefPtr`s to track bound objects, and has no need for linear search when binding.
Co-authored-by: Yong He <yonghe@outlook.com>
|
|
The Big Picture
===============
Given input Slang code like:
```hlsl
Texture2D gA;
[shader("compute")]
void kernelFunc(uniform Texture2D b, uint3 tid : SV_DispatchThreadID)
{ ... }
```
the existing CUDA code generation strategy would always generate a kernel with a signature like:
```c++
struct GlobalParams { Texture2D gA; }
struct EntryPointParams { Texture2D b; }
extern "C" __global__
void kernelFunc(EntryPointParams* entryPointParams, GlobalParams* globalParams)
{ ... }
```
This choice was consistent with the conventions of the CPU kernel target, and shares the advantage that it is easy for the user to data-drive the logic for filling in parameters and then invoking a kernel.
However, the approach outlined above has two serious problems when used for CUDA kernels:
* First, it defies the programmer's expectation about what an "equivalent" CUDA kernel signature would be, which makes it awkward for a developer to invoke this kernel from CUDA C++ host code (especially in the context of an app that might also run hand-written CUDA kernels).
* Second, the performance of this approach suffers because every access to a global or entry point parameter turns into a load from global memory. In contrast, a typical hand-written CUDA kernel passes its parameters via an implementation-specific path that (for current CUDA platforms) seems to be equivalent to `__constant__` memory in performance.
This change alters the convention so that the Slang compiler takes the code from the top of this message and translates it into something like:
```c++
struct GlobalParams { Texture2D gA; }
__constant__ GlobalParams SLANG_globalParams;
extern "C" __global__
void kernelFunc( Texture2D b )
{ ... }
```
This translation alleviates both problems with the current translation:
* The signature of the generated CUDA kernel function is as close to that of the original as is possible (we had to eliminate the `SV_*`-semantic varying inputs), and should directly match what the programmer would expect in common cases.
* Entry-point parameters are passed via CUDA kernel parameters, and should thus match in performance. Global parameters are passed via a variable in `__constant__` memory, and thus should also perform as well as possible/expected.
Detailed Changes
================
* Disable the `collectEntryPointUniformParams` pass for CUDA, so that entry-point `uniform` parameters are *not* bundles into a single `struct` and/or `ConstantBuffer`.
* When targeting CUDA, disable the logic for generating an entry-point parameter for passing in the global shader parameter(s)
* Allow `CLikeSourceEmitter` subclasses to override the name generated for entry-point symbols, and use this to add the required prefix for each OptiX kernel type when translating a ray-tracing kernel.
* Add logic to emit "parameter groups" in a specialized way for CUDA (this is the same approach that allows us to generate `cbufffer { ... }` declarations for fxc). A global-scope parameter group will turn into a global `__constant__` variable called `SLANG_globalParams` (that name becomes part of the ABI for Slang-compiled shaders).
* Update the logic in `render-test` for loading and invoking CUDA kernels to handle the new policy.
The last bullet there merits expansion, since it is indicative of the work a client using Slang would have to go through to use our generated kernels with the new policy:
* When loading a CUDA module with one or more kernels, we also use `cuModuleGetGlobal` to query the address of the `SLANG_globalParams` symbol in that CUDA module. That pointer needs to be used when setting global parameter values to be used by kernels in that CUDA odule.
* Because our existing `BindPoint` logic for CUDA always sets up parameter data in GPU memory, we end up having to copy the entry-point parameter data from GPU memory to host memory. This step would ideally be skipped in a codebase that understands the correct policy, but it is a bit unfortunate that it is no longer trivially correct for an application to store all parameter data in GPU memory.
* Before invoking the kernel, we need to use a `cudaMemcpyAsync` to copy from the prepared GPU memory for global parameters over to the `SLANG_globalParams` symbol associated with the kernel to be invoked. Because this operations is issued on the same CUDA stream as the kernel call, it is guaranteed to not overlap with GPU kernel execution.
* When invoking the kernel, we take advantage of the seldom-used `CU_LAUNCH_PARAM_BUFFER_POINTER` facility to specify a contiguous memory region with all the entry-point parameters in it instead of passing each entry-point parameter separately. Given Slang reflection it is also possible to query the offset of each entry-point parameter in the buffer, so we could invoke the kernel in the traditional fashion as well. The choice here is up to the application.
Caveats
=======
* This is a breaking change, and any subsequent release will need to reflect that fact. Any customers who rely on Slang's current CUDA codegen strategy are likely to be surprised by this change, and I don't see an easy way to give them a more gentle transition.
* This change does *not* remove the logic that introduces a `KernelContext` type for code that requires it. That means that things like `static` global variables can continue to work on CUDA for now, but we know that those are not going to be something we can support in the long-term with separate compilation.
* While the policy implemented in this change is a reasonable default, it is still not going to perfectly match expecations for some developers. In particular, some developers who are familiar with both D3D and CUDA will likely wonder why a global `cbuffer` in Slang translates to a global-memory pointer in the output CUDA instead of one global `__constant__` variable per `cbuffer`. A more detailed alternate translation would generate a distinct global `__constant__` variable for each top-level constant buffer or parameter block. We may need to refine the translation even more based on feedback from users who care about how we handle global-scope parameters.
* Recent changes in Slang have broken the logic that handles the OptiX "shader record" as an alternative mechanism for passing entry-point parameters. In order to get any level of OptiX support up and running we will have to change the IR passes that run on CUDA kernels to actually run the "collection" of `uniform` parameters for ray tracing stages, and then to replace references to the resulting parameter with a call to the function to access the shader record.
* The use of `SLANG_globalParams` here works well enough in the case of whole-program compilation; every `CUmodule` ends up with (zero or) one parameter with this name, and an application can just hard-code it. As a mechanism it wouldn't work in the presence of separately-compiled modules that might introduce their own global parameters (including cases like constant lookup tables that really want to be at the global scope). An alternative approach would have Slang generate output PTX for each module, where a module has an optional global symbol for its own global-scope parameters (with a mangled name that is based on the module name), and then a linked CUDA binary has all of those distinct symbols. Such an approach would be compatible with module-at-a-time reflection and parameter binding, but would lead to another breaking change down the line for code that switches to `SLANG_globalParams`.
|
|
* Introduced heterogeneous example. Example includes C++ source and
header files, and does not currently make use of the associated slang
file when building. The intent of this commit is to introduce the
example as a baseline for later updates as the heterogeneous model is
expanded.
* Changing namespace
* Renamed and rewrote README
* Updated example to account for compiler updates
* Updated path
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
* Associate a downstream compiler for prelude lookup even if output is source.
* Remove LanguageStyle and just use SourceLanguage instread.
* Added set/getPrelude.
Made prelude work on source language.
* Fix typo in method name replacement.
get/SetPrelude get/setLanguagePrelude
* Fix issue because of method name change.
* Remove getPreludeDownstreamCompilerForTarget
|
|
* * Remove UniformState and UniformEntryPointParams types
* Put all output C++ source in an anonymous namespace
* If SLANG_PRELUDE_NAMESPACE is set, make what it defines available in generated file.
* Fix signature issue in performance-profile.slang
* Context -> KernelContext to avoid ambiguity.
* Fix issues around dynamic dispatch and anonymous namespace.
* Fix typo.
|
|
* Synthesize "active mask" for CUDA
The Big Picture
===============
The most important change here is to `hlsl.meta.slang`, where the declaration of `WaveGetActiveMask()` is changed so that instead of mapping to `__activemask()` on CUDA (which is semantically incorrect) it maps to a dedicated IR instruction.
The other `WaveActive*()` intrinsics that make use of the implicit "active mask" concept had already been changed in #1336 so that they explicitly translate to call the equivalent `WaveMask*()` intrinsic with the result of `WaveGetActiveMask()`. As a result, all of the `WaveActive*()` functions are now no different from a user-defined function that uses `WaveGetActiveMask()`.
The bulk of the work in this change goes into an IR pass to replace the new instruction for getting the active mask gets replaced with appropriately computed values before we generate output CUDA code. That work is in `slang-ir-synthesize-active-mask.{h,cpp}`.
Utilities
=========
There are a few pieces of code that were helpful in writing the main pass but that can be explained separately:
* IR instructions were added corresponding to the Slang `WaveMaskBallot()` and `WaveMaskMatch()` functions, which map to the CUDA `__ballot_sync()` and `__match_any_sync()` operations, respectively. These are only implemented for the CUDA target because they are only being generated as part of our CUDA-only pass.
* The `IRDominatorTree` type was updated to make it a bit more robust in the presence of unreachable blocks in the CFG. It is possible that the same ends could be achieved more efficiently by folding the corner cases into the main logic, but I went ahead and made things very explicit for now.
* I added an `IREdge` utility type to better encapsulate the way that certain code operating on the predecessors/successors of an `IRBlock` were using an `IRUse*` to represent a control-flow edge. The `IREdge` type makes the logic of those operations more explicit. A future change should proably change it so that `IRBlock::getPredecessors()` and `getSuccessors()` are instead `getIncomingEdges()` and `getOutgoingEdges()` and work as iterators over `IREdge` values, given the way that the predecessor and successor lists today can contain duplicates.
* Using the above `IREdge` type, the logic for detecting and break critical edges was broken down into something that is a bit more clear (I hope), and that also factors out the breaking of an edge (by inserting a block along it) into a reusable subroutine.
The Main Pass
=============
The implementation of the new pass is in `slang-ir-synthesize-active-mask.cpp`, and that file attempts to include enough comments to make the logic clear. A brief summary for the benefit of the commit history:
* The first order of business is to identify functions that need to have the active mask value piped into them, and to add an additional parameter to them so that the active mask is passed down explicitly. Call sites are adjusted to pass down the active mask which can then result in new functions being identified as needing the active mask.
* The next challenge is for a function that uses the active mask, to compute the active mask value to use in each basic block. The entry block can easily use the active mask value that was passed in, while other blocks need more work.
* When doing a conditional branch, we can compute the new mask for the block we branch to as a function of the existing mask and the branch condition. E.g., the value `WaveMaskBallot(existingMask, condition)` can be used as the mask for the "then" block of an `if` statement.
* When control flow paths need to "reconverge" at a point after a structured control-flow statement, we need to insert logic to synchronize and re-build the mask that will execute after the statement, while also excluding any lanes/threads that exited the statement in other ways (e.g., an early `return` from the function).
The explanation here is fairly hand-wavy, but the actual pass uses much more crisp definitions, so the code itself should be inspected if you care about the details.
Tests
=====
The tests for the new feature are all under `tests/hlsl-intrinsic/active-mask/`. Most of them stress a single control-flow construct (`if`, `switch`, or loop) and write out the value of `WaveGetActiveMask()` at various points in the code.
In practice, our definition of the active mask doesn't always agree with what D3D/Vulkan implementations seem to produce in practice, and as a result a certain amount of effort has gone into adding tweaks to the tests that force them to produce the expected output on existing graphics APIs. These tweaks usually amount to introducing conditional branches that aren't actually conditional in practice (the branch condition is always `true` or always `false` at runtime), in order to trick some simplistic analysis approaches that downstream compilers seem to employ.
One test case currently fails on our CUDA target (`switch-trivial-fallthrough.slang`) and has been disabled. This is an expected failure, because making it produce the expected value requires a bit of detailed/careful coding that would add a lot of additional complexity to this change. It seemed better to leave that as future work.
Future Work
===========
* As discussed under "Tests" above, the handling of simple `switch` statements in the current pass is incomplete.
* There's an entire can of worms to be dealt with around the handling of fall-through for `switch`.
* The current work also doesn't handle `discard` statements, which is unimportant right now (CUDA doesn't have fragment shaders), but might matter if we decide to synthesize masks for other targets. Similar work would probably be needed if we ever have `throw` or other non-local control flow that crosses function boundaries.
* An important optimization opportunity is being left on the floor in this change. When block that comes "after" a structured control-flow region (which is encoded explicitly in Slang IR and SPIR-V) post-dominates the entry block of the region, then we know that the active mask when exiting the region must be the same as the mask when entering the region, and there is no need to insert explicit code to cause "re-convergence." This should be addressed in a follow-on change once we add code to Slang for computing a post-dominator tree from a function CFG.
* Related to the above, the decision-making around whether a basic block "needs" the active mask is perhaps too conservative, since it decides that any block that precedes one needing the active mask also needs it. This isn't true in cases where the active mask for a merge block can be inferred by post-dominance (as described above), so that the blocks that branch to it don't need to compute an active mask at all.
* If/when we extend the CPU target to support these operations (along with SIMD code generation, I assume), we will also need to synthesize an active mask on that platform, but the approach taken here (which pretty much relies on support for CUDA "cooperative groups") wouldn't seem to apply in the SIMD case.
* Similarly, the approach taken to computing the active mask here requires a new enough CUDA SM architecture version to support explicit cooperative groups. If we want to run on older CUDA-supporting architectures, we will need a new and potentially very different strategy.
* Because the new pass here changes the signature of functions that require the active mask (and not those that don't), it creates possible problems for generating code that uses dynamic dispatch (via function pointers). In principle, we need to know at a call site whether or not the callee uses the active mask. There are multiple possible solutions to this problem, and they'd need to be worked through before we can make the implicit active mask and dynamic dispatch be mutually compatible.
* Related to changing function signatures: no effort is made in this pass to clean up the IR type of the functions it modifies, so there could technically be mismatches between the IR type of a function and its actual signature. If/when this causes problems for downstream passes we probably need to do some cleanup.
* fixup: backslash-escaped lines
I did some "ASCII art" sorts of diagrams to explain cases in the CFG, and some of those diagrams used backslash (`\`) characters as the last character on the line, causing them to count as escaped newlines for C/C++.
The gcc compiler apparently balked at those lines, since they made some of the single-line comments into multi-line comments.
I solved the problem by adding a terminating column of `|` characters at the end of each line that was part of an ASCII art diagram.
* fixup: typos
Co-authored-by: jsmall-nvidia <jsmall@nvidia.com>
|
|
* Fields from upper to lower case in slang-ast-decl.h
* Lower camel field names in slang-ast-stmt.h
* Fix fields in slang-ast-expr.h
* slang-ast-type.h make fields lowerCamel.
* slang-ast-base.h members functions lowerCamel.
* Method names in slang-ast-type.h to lowerCamel.
* GetCanonicalType -> getCanonicalType
* Substitute -> substitute
* Equals -> equals
ToString -> toString
* ParentDecl -> parentDecl
Members -> members
* * Make hash code types explicit
* Use HashCode as return type of GetHashCode
* Added conversion from double to int64_t
* Split Stable from other hash functions
* toHash32/64 to convert a HashCode to the other styles.
GetHashCode32/64 -> getHashCode32/64
GetStableHashCode32/64 -> getStableHashCode32/64
* Other Get/Stable/HashCode32/64 fixes
* GetHashCode -> getHashCode
* Equals -> equals
* CreateCanonicalType -> createCanonicalType
* Catches of polymorphic types should be through references otherwise slicing can occur.
* Fixes for newer verison of gcc.
Fix hashing problem on gcc for Dictionary.
* Another fix for GetHashPos
* Fix signed issue around GetHashPos
|
|
* Fix issues in wave-mask/wave.slang tests.
WaveGetActiveMask -> WaveGetConvergedMask.
Update target-compatibility.md
* First pass at wave-intrinsics.md documentation.
Write up around WaveMaskSharedSync.
* Added more of the Wave intrinsics as WaveMask intrinsics.
Improvements to documentation around wave-intrinsics.
* Add the Wave intrinsics for SM6.5 for WaveMask
Expand WaveMask intrinsics
Improve WaveMask documentation
* Added WaveMaskIsFirstLane.
* Added WaveGetConvergedMask for glsl and hlsl.
Added wave-get-converged-mask.slang test.
* WaveGetActiveMask/Multi and WageGetConvergedMask/Multi
* Improve Wave intrinsics docs.
Adde WaveGetActveMulti WaveGetConvergedMulti, WaveGetActiveMask (for vk/hlsl).
* Enable GLSL WaveMultiPrefixBitAnd.
* Re-add definitions of f16tof32 and f32to16 from #1326
* Remove multiple definition of f32tof16
Disable optix call to Ray trace test, if OPTIX not available.
* Improve wave intrinsics documetnation - remove the __generic as part of definitions, small improvements.
* Change comment to try and trigger build.
|
|
There are two main pieces here.
First, we specialize the code generaiton for CUDA kernels to account for the way that shader parameters are passed differently for ordinary compute kernels vs. ray-tracing kernels. Both global and entry-point shader parameters in Slang are translated to kernel function parameters for CUDA compute kernels, while for OptiX ray tracing kernels we need to use a global `__constant__` variable for the global parameters, and the SBT data (accessed via an OptiX API function) for entry-point shader parameters.
This choice bakes in a few pieces of policy when it comes to how Slang ray-tracing shaders translate to OptiX:
* It fixes the name used for the global `__constant__` variable for global shader parameters to be `SLANG_globalParams`. Since that name has to be specified when creating a pipeline with the OptiX API, the choice of name effectively becomes an ABI contract for Slang's code generation.
* It fixes the choice that global parameters in Slang map to per-launch parameters in OptiX, and entry-point parameters in Slang map to SBT-backed parameters in OptiX. This is a reasonable policy, and it is also one that we are likely to need to codify for Vulkan as well, but it is always a bit unfortunate to bake policy choices like this into the compiler (especially when shaders compiled for D3D can often decouple the form of their HLSL/Slang code from how things are bound in the API).
The second piece is a lot of refactoring of the logic in `render-test/cuda/cuda-compute-util.cpp`, so that the logic for setting up (and reading back) the buffers of parameter data can be shared between the compute and ray-tracing paths. The result may not be a true global optimum for how the code is organized, but it at least serves the goal of not duplicating the parameter-binding logic between compute and ray-tracing.
|
|
The CUDA build of the render-test tool had been broken in a fixup change to #1307 (which was ostensibly adding features for the CUDA path). The fix is a simple one-liner.
|
|
* Initial work to support OptiX output for ray tracing shaders
This change represents in-progress work toward allowing Slang/HLSL ray-tracing shaders to be cross-compiled for execution on top of OptiX. The work as it exists here is incomplete, but the changes are incremental and should not disturb existing supported use cases.
One major unresolved issue in this work is that the OptiX SDK does not appear to set an environment variable
Changes include:
* Modified the premake script to support new options for adding OptiX to the build. Right now the default path to the OptiX SDK is hard-coded because the installer doesn't seem to set an environment variable. We will want to update that to have a reasonable default path for both Windows and Unix-y platforms in a later chance.
* I ran the premake generator on the project since I added new options, which resulted in a bunch of diffs to the Visual Studio project files that are unrelated to this change. Many of the diffs come from previous edits that added files using only the Visual Studio IDE rather than by re-running premake, so it is arguably better to have the checked-in project files more accurately reflect the generated files used for CI builds.
* The "downstream compiler" abstraction was extended to have an explicit notion of the kind of pipeline that shaders are being compiled for (e.g., compute vs. rasterization vs. ray tracing). This option is used to tell the NVRTC case when it needs to include the OptiX SDK headers in the search path for shader compilation (and also when it should add a `#define` to make the prelude pull in OptiX). This code again uses a hard-coded default path for the OptiX SDK; we will need to modify that to have a better discovery approach and also to support an API or command-line override.
* One note for the future is that instead of passing down a "pipeline type" we could instead pass down the list/set of stages for the kernels being compiled, and the OptiX support could be enabled whenever there is *any* ray tracing entry point present in a module. That approach would allow mixing RT and compute kernels during downstream compilation. We will need to revisit these choices when we start supporting code generation for multiple entry points at a time.
* The CUDA emit logic is currently mostly unchanged. The biggest difference is that when emitting a ray-tracing entry point we prefix the name of the generated `__global__` function with a marker for its stage type, as required by the OptiX runtime (e.g., a `__raygen__` prefix is required on all ray-generation entry points).
* The `Renderer` abstraction had a bare minimum of changes made to be able to understand that ray-tracing pipelines exist, and also that some APIs will require the name of each entry point along with its binary data in order to create a program.
* The `ShaderCompileRequest` type was updated so that only a single "source" is supported (rather than distinct source for each entry point), and also the entry points have been turned into a single list where each entry identifies its stage instead of a fixed list of fields for the supported entry-point types.
* The CUDA compute path had a lot of code added to support execution for the new ray-tracing pipeline type. The logic is mostly derived from the `optixHello` example in the OptiX SDK, and at present only supports running a single ray-generation shader with no parameters. The code here is not intended to be ready for use, but represents a signficiant amount of learning-by-doing.
* The `slang-support.cpp` file in `render-test` was updated so that instead of having separate compilation logic for compute vs. rasterization shaders (which would mean adding a third path for ray tracing), there is now a single flow to the code that works for all pipeline types and any kind of entry points.
* Implicit in the new code is dropping support for the way GLSL was being compiled for pass-through render tests, which means pass-through GLSL render tests will no longer work. It seems like we didn't have any of those to begin with, though, so it is no great loss.
* Also implicit are some new invariants about how shaders without known/default entry points need to be handled. For example, the ray tracing case intentionally does not fill in entry points on the `ShaderCompileRequest` and instead fully relies on the Slang compiler's support for discovering and enumerating entry points via reflection. As a consequence of those edits the `-no-default-entry-point` flag on `render-test` is probably not working, but it seems like we don't have any test cases that use that flag anyway.
Given the seemingly breaking changes in those last two bullets, I was surprised to find that all our current tests seem to pass with this change. If there are things that I'm missing, I hope they will come up in review.
* fixup: issues from review and CI
* Some issues noted during the review process (e.g., a missing `break`)
* Fix logic for render tests with `-no-default-entry-point`. I had somehow missed that we had tests reliant on that flag. This required a bit of refactoring to pass down the relevant flag (luckily the function in question was already being passed most of what was in `Options`, so that just passing that in directly actually simplifies the call sites a bit.
* There was a missing line of code to actually add the default compute entry points to the compile request. I think this was a problem that slipped in as part of some pre-PR refactoring/cleanup changes that I failed to re-test.
|
|
* render feature for CUDA compute model.
* Use SemanticVersion type.
* Enable CUDA wave tests that require CUDA SM 7.0.
Provide mechanism for DownstreamCompiler to specify version numbers.
* Enabled wave-equality.slang
* Make CUDA SM version major version not just a single digit.
* Fix assert.
* DownstreamCompiler::Version -> CapabilityVersion
|
|
* Better diagnostics on failure on CUDA.
* Catch exceptions in render-test
* * Added ability to disable reporting on CUDA failures
* Stopped using exception for reporting (just write to StdWriter::out()
* Removed CUDAResult type
* Don't set arch type on nvrtc to see if fixes CI issues.
* Try compute_30 on CUDA.
* Added ability to IGNORE_ a test
DIsabled rw-texture-simple and texture-get-dimensions
* Disable tests that require CUDA SM7.0
Use DISABLE_ prefix to disable tests.
* Disable signalUnexpectedError doing printf.
|
|
* Added CPU support for GetDimensions on C++/CPU target.
Added texture-get-dimension.slang test
* Fix some typos.
* Update CUDA docs.
* Fix output of GetDimensions on glsl when has an array.
Disabled VK - because VK renderer doesn't support createTextureView
* Fix typo.
* Fix typo.
* Fix bad-operator-call diagnostics output.
|
|
|
|
* Added FloatTextureData as a mechanism to enable CPU based Texture writes.
* Add [] RWTexture access for CPU.
* Fixed rw-texture-simple.slang.expected.txt
* WIP: CUDA stdlib has support for [] surface access.
* Made IRWTexture class able to take different locations.
Doing a Texture2d access on CUDA works.
* Fix bug in outputing UniformState - was missing out padding.
Support RWTexture with array. Support RWTexture3D.
* Use * for locations for read only textures, so only need a ITexture interface.
* Fix problem around application of set/get for CUDA on subscript Texture types.
|
|
* CUDA support for array of resources.
* * Add support for Texture2DArray on CPU
* Expand texture-simple.slang to test Texture2DArray
* Reorganise CUDAComputeUtil to split out createTextureResource.
* Add TextureCubeArray support for CPU/CUDA targets.
* Pulled out CUDAResource
Renamed derived classes to reflect that change.
* Creation of SurfObject type.
* Functions to return read/write access for simplifying future additions.
* WIP for RWTexture access on CPU/CUDA.
* CUsurfObject cannot have mips.
* Ability to set number of mips on test data.
Preliminary support for CUsurfObj and RWTexture1D on CUDA.
CUDA docs improvements.
* Fix typo.
|
|
* CUDA support for array of resources.
* * Add support for Texture2DArray on CPU
* Expand texture-simple.slang to test Texture2DArray
* Reorganise CUDAComputeUtil to split out createTextureResource.
* Add TextureCubeArray support for CPU/CUDA targets.
|
|
* Added support for Targets to TypeTextUtil.
* Made Function names 'get' and 'find' instead of 'as' in TypeTextUtil.
|
|
* Add cubemap support.
* Add CUDA fence instrinsics.
* Added Gather for CUDA.
* Use the CUDA driver API as much as possible.
* * Support 1D texture on CPU
* WIP on 1D texture on CUDA
* Added simplified texture test
* Fix test.
* Improve texture-simple tests.
* * Add CPU support for 3d textures
* Add support for mip maps to CUDA
* Disable warnings in nvrtc
* Update CUDA docs
* WIP on 3d texture support.
* Add support for 3d textures for CPU and CUDA.
* CPU and CUDA support for cube maps.
* Add CPU support for Texture1DArray.
* Support CUDA Layered/Array type in meta library.
|
|
* Add cubemap support.
* Add CUDA fence instrinsics.
* Added Gather for CUDA.
* Use the CUDA driver API as much as possible.
* * Support 1D texture on CPU
* WIP on 1D texture on CUDA
* Added simplified texture test
* Fix test.
* Improve texture-simple tests.
* * Add CPU support for 3d textures
* Add support for mip maps to CUDA
* Disable warnings in nvrtc
* Update CUDA docs
* WIP on 3d texture support.
* Add support for 3d textures for CPU and CUDA.
|
|
* Add cubemap support.
* Add CUDA fence instrinsics.
* Added Gather for CUDA.
* Use the CUDA driver API as much as possible.
* * Support 1D texture on CPU
* WIP on 1D texture on CUDA
* Added simplified texture test
* Fix test.
* Improve texture-simple tests.
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
* Launch CUDA test taking into account dispatch size.
* Enable isCPUOnly hack to work on CUDA.
* Rename 'isCPUOnly' hack to 'onlyCPULikeBinding'.
* Add $T special type.
Support SampleLevel on CUDA.
* Fix typo.
|