| Commit message (Collapse) | Author | Age |
| ... | |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Big Picture
===============
Given input Slang code like:
```hlsl
Texture2D gA;
[shader("compute")]
void kernelFunc(uniform Texture2D b, uint3 tid : SV_DispatchThreadID)
{ ... }
```
the existing CUDA code generation strategy would always generate a kernel with a signature like:
```c++
struct GlobalParams { Texture2D gA; }
struct EntryPointParams { Texture2D b; }
extern "C" __global__
void kernelFunc(EntryPointParams* entryPointParams, GlobalParams* globalParams)
{ ... }
```
This choice was consistent with the conventions of the CPU kernel target, and shares the advantage that it is easy for the user to data-drive the logic for filling in parameters and then invoking a kernel.
However, the approach outlined above has two serious problems when used for CUDA kernels:
* First, it defies the programmer's expectation about what an "equivalent" CUDA kernel signature would be, which makes it awkward for a developer to invoke this kernel from CUDA C++ host code (especially in the context of an app that might also run hand-written CUDA kernels).
* Second, the performance of this approach suffers because every access to a global or entry point parameter turns into a load from global memory. In contrast, a typical hand-written CUDA kernel passes its parameters via an implementation-specific path that (for current CUDA platforms) seems to be equivalent to `__constant__` memory in performance.
This change alters the convention so that the Slang compiler takes the code from the top of this message and translates it into something like:
```c++
struct GlobalParams { Texture2D gA; }
__constant__ GlobalParams SLANG_globalParams;
extern "C" __global__
void kernelFunc( Texture2D b )
{ ... }
```
This translation alleviates both problems with the current translation:
* The signature of the generated CUDA kernel function is as close to that of the original as is possible (we had to eliminate the `SV_*`-semantic varying inputs), and should directly match what the programmer would expect in common cases.
* Entry-point parameters are passed via CUDA kernel parameters, and should thus match in performance. Global parameters are passed via a variable in `__constant__` memory, and thus should also perform as well as possible/expected.
Detailed Changes
================
* Disable the `collectEntryPointUniformParams` pass for CUDA, so that entry-point `uniform` parameters are *not* bundles into a single `struct` and/or `ConstantBuffer`.
* When targeting CUDA, disable the logic for generating an entry-point parameter for passing in the global shader parameter(s)
* Allow `CLikeSourceEmitter` subclasses to override the name generated for entry-point symbols, and use this to add the required prefix for each OptiX kernel type when translating a ray-tracing kernel.
* Add logic to emit "parameter groups" in a specialized way for CUDA (this is the same approach that allows us to generate `cbufffer { ... }` declarations for fxc). A global-scope parameter group will turn into a global `__constant__` variable called `SLANG_globalParams` (that name becomes part of the ABI for Slang-compiled shaders).
* Update the logic in `render-test` for loading and invoking CUDA kernels to handle the new policy.
The last bullet there merits expansion, since it is indicative of the work a client using Slang would have to go through to use our generated kernels with the new policy:
* When loading a CUDA module with one or more kernels, we also use `cuModuleGetGlobal` to query the address of the `SLANG_globalParams` symbol in that CUDA module. That pointer needs to be used when setting global parameter values to be used by kernels in that CUDA odule.
* Because our existing `BindPoint` logic for CUDA always sets up parameter data in GPU memory, we end up having to copy the entry-point parameter data from GPU memory to host memory. This step would ideally be skipped in a codebase that understands the correct policy, but it is a bit unfortunate that it is no longer trivially correct for an application to store all parameter data in GPU memory.
* Before invoking the kernel, we need to use a `cudaMemcpyAsync` to copy from the prepared GPU memory for global parameters over to the `SLANG_globalParams` symbol associated with the kernel to be invoked. Because this operations is issued on the same CUDA stream as the kernel call, it is guaranteed to not overlap with GPU kernel execution.
* When invoking the kernel, we take advantage of the seldom-used `CU_LAUNCH_PARAM_BUFFER_POINTER` facility to specify a contiguous memory region with all the entry-point parameters in it instead of passing each entry-point parameter separately. Given Slang reflection it is also possible to query the offset of each entry-point parameter in the buffer, so we could invoke the kernel in the traditional fashion as well. The choice here is up to the application.
Caveats
=======
* This is a breaking change, and any subsequent release will need to reflect that fact. Any customers who rely on Slang's current CUDA codegen strategy are likely to be surprised by this change, and I don't see an easy way to give them a more gentle transition.
* This change does *not* remove the logic that introduces a `KernelContext` type for code that requires it. That means that things like `static` global variables can continue to work on CUDA for now, but we know that those are not going to be something we can support in the long-term with separate compilation.
* While the policy implemented in this change is a reasonable default, it is still not going to perfectly match expecations for some developers. In particular, some developers who are familiar with both D3D and CUDA will likely wonder why a global `cbuffer` in Slang translates to a global-memory pointer in the output CUDA instead of one global `__constant__` variable per `cbuffer`. A more detailed alternate translation would generate a distinct global `__constant__` variable for each top-level constant buffer or parameter block. We may need to refine the translation even more based on feedback from users who care about how we handle global-scope parameters.
* Recent changes in Slang have broken the logic that handles the OptiX "shader record" as an alternative mechanism for passing entry-point parameters. In order to get any level of OptiX support up and running we will have to change the IR passes that run on CUDA kernels to actually run the "collection" of `uniform` parameters for ray tracing stages, and then to replace references to the resulting parameter with a call to the function to access the shader record.
* The use of `SLANG_globalParams` here works well enough in the case of whole-program compilation; every `CUmodule` ends up with (zero or) one parameter with this name, and an application can just hard-code it. As a mechanism it wouldn't work in the presence of separately-compiled modules that might introduce their own global parameters (including cases like constant lookup tables that really want to be at the global scope). An alternative approach would have Slang generate output PTX for each module, where a module has an optional global symbol for its own global-scope parameters (with a mangled name that is based on the module name), and then a linked CUDA binary has all of those distinct symbols. Such an approach would be compatible with module-at-a-time reflection and parameter binding, but would lead to another breaking change down the line for code that switches to `SLANG_globalParams`.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Remove KernelContext wrapper from CPU/CUDA emit
Currently, the CPU and CUDA C++ targets rely on a `KernelContext` type that is generated during emit, as a way to provide implicit access to things that were global in the input Slang code, but that can't actually be emitted as globals in the target language (because the semantics of global declarations differ).
For example, input like:
```hlsl
ConstantBuffer<Stuff> gStuff; // shader parameter
groupshared int gData[1024]; // thread-group shared variable
static int gCounter = 0; // "thread-local" global-scope variable
void subroutine() { ... }
[shader("compute")] void computeMain() { ... }
```
would translate to output C++ for CPU a bit like:
```c++
struct KernelContext
{
ConstantBuffer<Stuff> gStuff;
int gData[1024];
int gCounter = 0;
void subroutine() { ... }
void computeMain() { ... }
};
```
Note that both `computeMain()` and `subroutine()` are non-`static` members functions on `KernelContext`, so they have an implicit `this` parameter of type `KernelContext`, which allows the bodies of those functions to implicitly reference `gStuff`, etc. by name in their bodies.
Because `KernelContext::computeMain()` is a member function, we end up emitting an additional global-scope function to expose the entry point to the outside world, and that function is responsible for declaring a local `KernelContext` and invoking the generated entry point on it.
This approach has several important drawbacks:
* It complicates the emit logic for CPU and CUDA, with many special cases around when/how things get emitted
* It complicates the implementation of dynamic dispatch, because what seems like a function pointer in Slang IR needs to be a pointer-to-member-function in C++.
* It makes it difficult to have a non-kernel-oriented mode of compilation for CPU where a Slang function with a given signature gets output as a C++ CPU function with the "same" signature (not wrapped up as a member function of `KernelContext`.
This change makes a step toward addressing these issues by making the introducing of the `KernelContext` type be something that is done in an explicit IR pass instead of being handled as part of the last-mile emit logic.
The most important change is the removal of code related to `KernelContext` from the `slang-emit-{cpp,cuda}.{h,cpp}` files, with the equivalent logic instead being handled in a new pass in `slang-ir-explicit-global-context.{h,cpp}`. It should be noted that further cleanups to the emit logic should now be possible; in particular, both the CPU and CUDA emit paths are manually sequencing the `EmitAction`s instead of relying on the default logic, but at this point they should be able to just use the default. The additional cleanups are left for future work.
The explicit IR pass does more or less what one would expect: it identifies global-scope entities (global variables and parameters) that need to be wrapped and turns them into fields of a `KernelContext` type. It then modifies all entry points to initialize a `KernelContext` as part of their startup. Finally, any code that used to refer to the global entities is changed to refer to a field of the context, with the context passed via new function parameters (the new parameter is only added to functions that need it for now).
Transforming global variables into fields of a `KernelContext` type in the IR pass ends up dropping their initial-value expressions (since those were attached as basic blocks on the `IRGlobalVar`). To avoid breaking code that relies on global-scope (but thread-local) variables, this change also adds an explicit pass that takes the initialization logic on all global variables and moves it to explicit logic that runs at the start of every entry point in a linked module (`slang-ir-explicit-global-init.{h,cpp}`). This pass would also be useful when we get back to direct SPIR-V emit, since SPIR-V also requires initialization logic for globals to be emitted into entry points.
One complication that arises when the IR is introducing the types for entry-point parameters, global-scope parameters, and the `KernelContext` type is that it becomes harder for the emit logic to utter the names of those types (they might not even have names, since `IRNameHint`s might get stripped). This created a problem since the wrapper operations that were being generated for CPU were taking `void*` parameters and casting them to the appropriate type. To work around this issue, we have added an explicit IR pass (`slang-ir-entry-point-raw-ptr-params.{h,cpp}`) that transforms the signature of entry points so that any pointer parameters instead become raw pointer (`void*`) parameters, with the casting being handled inside the entry point itself.
One consequence of all the above changes is that for the CUDA target we no longer need a wrapper function to invoke the generated entry point any more, because the IR function for the entry point ends up having the correct/expected signature already. This is also the case for CPU when it comes to the `*_Thread` wrapper function, but this change doesn't try to eliminate the wrapper because of a belief that the `*_Thread`-level interface is going away anyway.
Because the IR is now responsible for ensuring the signature of the IR entry point for CUDA and CPU is what is expected, I needed to modify the `slang-ir-entry-point-uniforms` pass to always create an explicit parameter for the entry point uniforms when compiling for CUDA/CPU, even if there were no `uniform` parameters on the entry point as written. This also ended up requiring some tweaks to the parameter layout logic to ensure that CPU/CUDA targets always treat `ConstantBuffer<T>` as a `T*` even in the case where `T` is an empty `struct` type (which happens when we construct a `struct` type to represent the uniform parameters of an entry point with no uniform parameters...).
There are several future changes that can/should build on this work:
* We should change the generated signatures for CUDA kernels, so that they don't rely on `KernelContext` for global-scope parameters. At that point we can avoid generating a `KernelContext` at all for CUDA, except when a program uses global-scope thread-local variables.
* We should figure out how to make the "ABI" for dynamic-dispatch calls ensure that the kernel context is either always passed, or always *not* passed. Making a hard-and-fast rule as part of the calling convention for dynamic calls would ensure that they access through the context continues to work with dynamic calls (this change might break it in some cases).
* We should figure out how to handle the layout for the `KernelContext` in cases where a program is composed of multiple separately-compiled modules. Right now the layout of the `KernelContext` requires global knowledge (as does the pass that introduces explicit initialization for global-scope thread-locals).
* We should try to further clean up the CPU/CUDA C++ emit logic to fall back on the default emit behavior more, now that the various special-case approaches that were taken are no longer needed
* fixup: restore build files to default configuration
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
The main change here is that the CPU and CUDA C++ emit paths now rely on an earlier IR pass to legalize the varying parameter list of a kernel and translate references to varying parameters with semantics like `SV_DispatchThreadID`. Doing so removes a lot of special-case logic from the emit passes.
This work moves us even closer to being able to eliminate `KernelContext` from the CPU/CUDA emit logic, because it removes the issue of state related to varying inputs being stored in `KernelContext`.
The new pass that handles the legalization is in `slang-ir-legalize-varying-params.cpp`, and it borrows heavily from the existing `slang-ir-glsl-legalize.cpp` pass. The new pass factors out the target-independent and target-dependent logic, so that both CPU and CUDA can share much of the same code despite having very different rules for how the system-value parameters are being provided.
An eventual goal is to have the new pass also handle the GLSL case, but doing so requires copying even more logic out of the GLSL-specific pass, and doing so seemed like a step to far for what was meant to be a stepping-stone change as part of other work. As a result of the incomplete nature of the pass, certain cases don't work for compute shader inputs for CPU/CUDA (e.g., wrapping your varying inputs in a `struct` type parameter), but those were cases that also didn't work in the existing `emit`-based logic.
One major consequence of this change is that the logic for emitting the various different functions that represent an entry point for our CPU back-end has been streamlined and simplified. The original logic had a fair bit of cleverness built in to try and avoid unnecessary math ops when computing the various IDs/indices, while the new logic is much more simplistic (the main dispatch function loops over threadgroups with a triply-nested `for` and then delegates to the group-level function loops over threads with its own nested `for`s).
Longer term, it will be important to simplify the CPU functions we emit further, by eliminating things like the `_Thread` function that should never really be exposed to users (the minimum granularity of invoking a CPU compute kernel should be a single threadgroup). We may eventually decide to synthesize all of the extra code that is being generated in the `emit` pass as IR instead.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Adding support for global uniform shader parameters
This change adds support for Slang programmers to declare shader parameters of "ordinary" types at global scope:
```hlsl
uniform float gScaleFactor;
void main() { ... *= gScaleFactor; ... }
```
The generated HLSL/GLSL/DXIL/SPIR-V output will be something along the lines of:
```hlsl
struct GlobalParams
{
float gScaleFactor;
}
cbuffer globalParams
{
GlobalParams globalParams;
}
void main() { ... *= globalParams.gScaleFactor; ... }
```
The binding information used for the implicit `globalParams` constant buffer will be determined by the existing implicit parameter binding logic (which already had support for this kind of transformation).
The reason this change is being pursued right now is because it is one step toward removing the implicit `KernelContext` type that is used to wrap the generated code for our CPU and CUDA C++ targets. Handling global-scope parameters of ordinary type requires an IR pass that synthesizes the `GlobalParams` structure type above, and that step ends up removing the need for the similar `UniformState` structure that was being used in the CPU/CUDA emit logic.
A more detailed guide to the changes included follows:
* The diagnostic for a global-scope variable that is implicitly a shader parameter was kept, but changed to a warning. Users can opt out of the warning by decorating their parameter as a `uniform` (since that keyword is already being used to mark entry-point parameters that should be treated as uniform shader parameters).
* To simplify the task of finding the global shader parameters, the `CLikeSourceEmitter` type has been given an `m_irModule` member. The previous emit logic for `UniformState` was having to do a roundabout solution involving the `EmitAction`s to deal with not having direct access to the module.
* Removed a few dead declarations in the emit logic (related to a much earlier point where emit was based on the AST instead of the IR).
* Made the computation of type names in C++ emit take into account `ConstantBuffer<T>` and `ParameterBlock<T>`. As far as I can tell, these were being handled with some special-case hacks in the emit logic instead of being supported more fundamentally. It might actually be good to pass these through as `ConstantBuffer<T>` and `ParameterBlock<T>` in the C++ output, and allow the prelude to customize their translation (defaulting to defining them as `T*`).
* Removed the special-case C++ emit logic for references to global shader parameters. There are now at most two global shader parameters to deal with, and the default emit logic (referring to them by name) does the Right Thing.
* Changed the handling of entry points for C++ (both CPU and CUDA) so that it handles the bundled-up shader paameters for the global and entry-point scopes the same way. The main complication here is OptiX, where parameter data is passed very differently than it is for CUDA compute kernels.
* Reverted changes to `ir-entry-point-uniforms` that had made its logic depend on the compilation target. The parameter binding logic was already responsible for deciding if a given target needed to wrap up its entry-point parameters in a constant buffer, and the IR pass was respecting that layout information. The current workaround had been removing the `ConstantBuffer<T>` indirection from this IR pass for CPU/CUDA, but then reintroducing the same indirection later on in the emit step.
* Added an explicit IR pass with the task of collecting global-scope parameters of uniform/ordinary type and packaging them up into a `struct`, and then optionally packaging that `struct` up in a constant buffer. This pass bases its decisions on the IR layout information that was already computed, so it should match whatever policy choices were made at the layout level.
* Changed the "key" operand on IR `struct` layout information to not assume an `IRStructKey`. The problem here is that the global scope gets a `StructTypeLayout` to represent its members, and this is convenient (rather than having to always special-case logic that handles the global scope), but the "fields" of that struct are global variables which do not have `IRStructKey`s associated with them. The simplest solution is to use the variables themselves as the keys, which required removing the assumption in the IR encoding.
* Updated the IR layout process to compute a layout for the global scope of an entire program, and to attach that to the `IRModule` via a decoration. Updated the IR linking process to carry through that decoration to the linked output. This is necessary so that the IR pass that transforms global parameters can access the global-scope layout information.
An important concern with this approach is that the contents and layout of the monolithic `GlobalParams` structure depends on the exact set of modules that were linked (and the order in which they were specified, in some cases). This isn't really a new thing with this change, but it becomes more important as we start to think of how to generalize things to better support separate compilation and linking.
There are changes that can (and should) be made to the way that IR layouts are computed for programs (e.g., so that we compute layout per-module and then combine them rather than as a whole-program step). In this case, the problem of forming the combined/linked global layout can be moved down the IR level and not be reliant on AST-level information.
Just changing the way layout and linking interact would not change the fundamental problem that global shader parameters as they currently exist in Slang/HLSL/GLSL are not readily compatible with true separate compilation. We either need to find a solution strategy that we can apply to allow existing shaders to work with separate compilation *or* we need to incrementally work toward removing support for global-scope shader parameters in favor of explicit entry-point parameters in all cases.
* fixup: missing files
* fixup: comment the new code
|
| | |
|
| |
|
|
|
|
| |
* * Fix output in slang repro command line
* Profile uses lowerCamel method names (had mix of upper and lower)
* Rename slang-serialize-state/SerializeStateUtil to slang-repro and ReproUtil.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* * Remove UniformState and UniformEntryPointParams types
* Put all output C++ source in an anonymous namespace
* If SLANG_PRELUDE_NAMESPACE is set, make what it defines available in generated file.
* Fix signature issue in performance-profile.slang
* Context -> KernelContext to avoid ambiguity.
* Fix issues around dynamic dispatch and anonymous namespace.
* Fix typo.
|
| |
|
|
|
|
|
|
|
|
|
| |
* Fixes for active mask synthesis + tests
There are two fixes here:
* The code generation that follows active mask synthesis was requiring CUDA SM architecture version 7.0 for one of the introduced instructions, but not all of them. This change centralizes the handling of upgrading the required CUDA SM architecture version, and makes sure that the instructions introduced by active mask synthesis request version 7.0.
* The tests for active mask synthesis were not flagged as requiring the `cuda_sm_7_0` feature when invoking `render-test-tool`, which meant they run but produce unexpected results when invoked on a GPU without the required semantics for functions like `__ballot_sync()`. This change adds the missing `-render-feature cuda_sm_7_0` to those tests.
* fixup: mark more tests that rely on implicit active mask
|
| |
|
|
|
|
|
|
|
| |
* Small improvements to documentation and code around DiagnosticSink
* Made methods/functions in slang-syntax.h be lowerCamel
Removed some commented out source (was placed elsewhere in code)
* Making AST related methods and function lowerCamel.
Made IsLeftValue -> isLeftValue.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Synthesize "active mask" for CUDA
The Big Picture
===============
The most important change here is to `hlsl.meta.slang`, where the declaration of `WaveGetActiveMask()` is changed so that instead of mapping to `__activemask()` on CUDA (which is semantically incorrect) it maps to a dedicated IR instruction.
The other `WaveActive*()` intrinsics that make use of the implicit "active mask" concept had already been changed in #1336 so that they explicitly translate to call the equivalent `WaveMask*()` intrinsic with the result of `WaveGetActiveMask()`. As a result, all of the `WaveActive*()` functions are now no different from a user-defined function that uses `WaveGetActiveMask()`.
The bulk of the work in this change goes into an IR pass to replace the new instruction for getting the active mask gets replaced with appropriately computed values before we generate output CUDA code. That work is in `slang-ir-synthesize-active-mask.{h,cpp}`.
Utilities
=========
There are a few pieces of code that were helpful in writing the main pass but that can be explained separately:
* IR instructions were added corresponding to the Slang `WaveMaskBallot()` and `WaveMaskMatch()` functions, which map to the CUDA `__ballot_sync()` and `__match_any_sync()` operations, respectively. These are only implemented for the CUDA target because they are only being generated as part of our CUDA-only pass.
* The `IRDominatorTree` type was updated to make it a bit more robust in the presence of unreachable blocks in the CFG. It is possible that the same ends could be achieved more efficiently by folding the corner cases into the main logic, but I went ahead and made things very explicit for now.
* I added an `IREdge` utility type to better encapsulate the way that certain code operating on the predecessors/successors of an `IRBlock` were using an `IRUse*` to represent a control-flow edge. The `IREdge` type makes the logic of those operations more explicit. A future change should proably change it so that `IRBlock::getPredecessors()` and `getSuccessors()` are instead `getIncomingEdges()` and `getOutgoingEdges()` and work as iterators over `IREdge` values, given the way that the predecessor and successor lists today can contain duplicates.
* Using the above `IREdge` type, the logic for detecting and break critical edges was broken down into something that is a bit more clear (I hope), and that also factors out the breaking of an edge (by inserting a block along it) into a reusable subroutine.
The Main Pass
=============
The implementation of the new pass is in `slang-ir-synthesize-active-mask.cpp`, and that file attempts to include enough comments to make the logic clear. A brief summary for the benefit of the commit history:
* The first order of business is to identify functions that need to have the active mask value piped into them, and to add an additional parameter to them so that the active mask is passed down explicitly. Call sites are adjusted to pass down the active mask which can then result in new functions being identified as needing the active mask.
* The next challenge is for a function that uses the active mask, to compute the active mask value to use in each basic block. The entry block can easily use the active mask value that was passed in, while other blocks need more work.
* When doing a conditional branch, we can compute the new mask for the block we branch to as a function of the existing mask and the branch condition. E.g., the value `WaveMaskBallot(existingMask, condition)` can be used as the mask for the "then" block of an `if` statement.
* When control flow paths need to "reconverge" at a point after a structured control-flow statement, we need to insert logic to synchronize and re-build the mask that will execute after the statement, while also excluding any lanes/threads that exited the statement in other ways (e.g., an early `return` from the function).
The explanation here is fairly hand-wavy, but the actual pass uses much more crisp definitions, so the code itself should be inspected if you care about the details.
Tests
=====
The tests for the new feature are all under `tests/hlsl-intrinsic/active-mask/`. Most of them stress a single control-flow construct (`if`, `switch`, or loop) and write out the value of `WaveGetActiveMask()` at various points in the code.
In practice, our definition of the active mask doesn't always agree with what D3D/Vulkan implementations seem to produce in practice, and as a result a certain amount of effort has gone into adding tweaks to the tests that force them to produce the expected output on existing graphics APIs. These tweaks usually amount to introducing conditional branches that aren't actually conditional in practice (the branch condition is always `true` or always `false` at runtime), in order to trick some simplistic analysis approaches that downstream compilers seem to employ.
One test case currently fails on our CUDA target (`switch-trivial-fallthrough.slang`) and has been disabled. This is an expected failure, because making it produce the expected value requires a bit of detailed/careful coding that would add a lot of additional complexity to this change. It seemed better to leave that as future work.
Future Work
===========
* As discussed under "Tests" above, the handling of simple `switch` statements in the current pass is incomplete.
* There's an entire can of worms to be dealt with around the handling of fall-through for `switch`.
* The current work also doesn't handle `discard` statements, which is unimportant right now (CUDA doesn't have fragment shaders), but might matter if we decide to synthesize masks for other targets. Similar work would probably be needed if we ever have `throw` or other non-local control flow that crosses function boundaries.
* An important optimization opportunity is being left on the floor in this change. When block that comes "after" a structured control-flow region (which is encoded explicitly in Slang IR and SPIR-V) post-dominates the entry block of the region, then we know that the active mask when exiting the region must be the same as the mask when entering the region, and there is no need to insert explicit code to cause "re-convergence." This should be addressed in a follow-on change once we add code to Slang for computing a post-dominator tree from a function CFG.
* Related to the above, the decision-making around whether a basic block "needs" the active mask is perhaps too conservative, since it decides that any block that precedes one needing the active mask also needs it. This isn't true in cases where the active mask for a merge block can be inferred by post-dominance (as described above), so that the blocks that branch to it don't need to compute an active mask at all.
* If/when we extend the CPU target to support these operations (along with SIMD code generation, I assume), we will also need to synthesize an active mask on that platform, but the approach taken here (which pretty much relies on support for CUDA "cooperative groups") wouldn't seem to apply in the SIMD case.
* Similarly, the approach taken to computing the active mask here requires a new enough CUDA SM architecture version to support explicit cooperative groups. If we want to run on older CUDA-supporting architectures, we will need a new and potentially very different strategy.
* Because the new pass here changes the signature of functions that require the active mask (and not those that don't), it creates possible problems for generating code that uses dynamic dispatch (via function pointers). In principle, we need to know at a call site whether or not the callee uses the active mask. There are multiple possible solutions to this problem, and they'd need to be worked through before we can make the implicit active mask and dynamic dispatch be mutually compatible.
* Related to changing function signatures: no effort is made in this pass to clean up the IR type of the functions it modifies, so there could technically be mismatches between the IR type of a function and its actual signature. If/when this causes problems for downstream passes we probably need to do some cleanup.
* fixup: backslash-escaped lines
I did some "ASCII art" sorts of diagrams to explain cases in the CFG, and some of those diagrams used backslash (`\`) characters as the last character on the line, causing them to count as escaped newlines for C/C++.
The gcc compiler apparently balked at those lines, since they made some of the single-line comments into multi-line comments.
I solved the problem by adding a terminating column of `|` characters at the end of each line that was part of an ASCII art diagram.
* fixup: typos
Co-authored-by: jsmall-nvidia <jsmall@nvidia.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
There are two main pieces here.
First, we specialize the code generaiton for CUDA kernels to account for the way that shader parameters are passed differently for ordinary compute kernels vs. ray-tracing kernels. Both global and entry-point shader parameters in Slang are translated to kernel function parameters for CUDA compute kernels, while for OptiX ray tracing kernels we need to use a global `__constant__` variable for the global parameters, and the SBT data (accessed via an OptiX API function) for entry-point shader parameters.
This choice bakes in a few pieces of policy when it comes to how Slang ray-tracing shaders translate to OptiX:
* It fixes the name used for the global `__constant__` variable for global shader parameters to be `SLANG_globalParams`. Since that name has to be specified when creating a pipeline with the OptiX API, the choice of name effectively becomes an ABI contract for Slang's code generation.
* It fixes the choice that global parameters in Slang map to per-launch parameters in OptiX, and entry-point parameters in Slang map to SBT-backed parameters in OptiX. This is a reasonable policy, and it is also one that we are likely to need to codify for Vulkan as well, but it is always a bit unfortunate to bake policy choices like this into the compiler (especially when shaders compiled for D3D can often decouple the form of their HLSL/Slang code from how things are bound in the API).
The second piece is a lot of refactoring of the logic in `render-test/cuda/cuda-compute-util.cpp`, so that the logic for setting up (and reading back) the buffers of parameter data can be shared between the compute and ray-tracing paths. The result may not be a true global optimum for how the code is organized, but it at least serves the goal of not duplicating the parameter-binding logic between compute and ray-tracing.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Fix CUDA output of a static const array if values are all literals.
* Fix bug in Convert definition.
* Output makeArray such that is deconstructed on CUDA to fill in based on what the target type is. Tries to expand such that there are no function calls so that static const global scope definitions work.
* Fix unbounded-array-of-array-syntax.slang to work correctly on CUDA.
* Remove tabs.
* Check works with static const vector/matrix.
* Fix typo in type comparison.
* Shorten _areEquivalent test.
* Rename _emitInitializerList. Some small comment fixes.
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Initial work to support OptiX output for ray tracing shaders
This change represents in-progress work toward allowing Slang/HLSL ray-tracing shaders to be cross-compiled for execution on top of OptiX. The work as it exists here is incomplete, but the changes are incremental and should not disturb existing supported use cases.
One major unresolved issue in this work is that the OptiX SDK does not appear to set an environment variable
Changes include:
* Modified the premake script to support new options for adding OptiX to the build. Right now the default path to the OptiX SDK is hard-coded because the installer doesn't seem to set an environment variable. We will want to update that to have a reasonable default path for both Windows and Unix-y platforms in a later chance.
* I ran the premake generator on the project since I added new options, which resulted in a bunch of diffs to the Visual Studio project files that are unrelated to this change. Many of the diffs come from previous edits that added files using only the Visual Studio IDE rather than by re-running premake, so it is arguably better to have the checked-in project files more accurately reflect the generated files used for CI builds.
* The "downstream compiler" abstraction was extended to have an explicit notion of the kind of pipeline that shaders are being compiled for (e.g., compute vs. rasterization vs. ray tracing). This option is used to tell the NVRTC case when it needs to include the OptiX SDK headers in the search path for shader compilation (and also when it should add a `#define` to make the prelude pull in OptiX). This code again uses a hard-coded default path for the OptiX SDK; we will need to modify that to have a better discovery approach and also to support an API or command-line override.
* One note for the future is that instead of passing down a "pipeline type" we could instead pass down the list/set of stages for the kernels being compiled, and the OptiX support could be enabled whenever there is *any* ray tracing entry point present in a module. That approach would allow mixing RT and compute kernels during downstream compilation. We will need to revisit these choices when we start supporting code generation for multiple entry points at a time.
* The CUDA emit logic is currently mostly unchanged. The biggest difference is that when emitting a ray-tracing entry point we prefix the name of the generated `__global__` function with a marker for its stage type, as required by the OptiX runtime (e.g., a `__raygen__` prefix is required on all ray-generation entry points).
* The `Renderer` abstraction had a bare minimum of changes made to be able to understand that ray-tracing pipelines exist, and also that some APIs will require the name of each entry point along with its binary data in order to create a program.
* The `ShaderCompileRequest` type was updated so that only a single "source" is supported (rather than distinct source for each entry point), and also the entry points have been turned into a single list where each entry identifies its stage instead of a fixed list of fields for the supported entry-point types.
* The CUDA compute path had a lot of code added to support execution for the new ray-tracing pipeline type. The logic is mostly derived from the `optixHello` example in the OptiX SDK, and at present only supports running a single ray-generation shader with no parameters. The code here is not intended to be ready for use, but represents a signficiant amount of learning-by-doing.
* The `slang-support.cpp` file in `render-test` was updated so that instead of having separate compilation logic for compute vs. rasterization shaders (which would mean adding a third path for ray tracing), there is now a single flow to the code that works for all pipeline types and any kind of entry points.
* Implicit in the new code is dropping support for the way GLSL was being compiled for pass-through render tests, which means pass-through GLSL render tests will no longer work. It seems like we didn't have any of those to begin with, though, so it is no great loss.
* Also implicit are some new invariants about how shaders without known/default entry points need to be handled. For example, the ray tracing case intentionally does not fill in entry points on the `ShaderCompileRequest` and instead fully relies on the Slang compiler's support for discovering and enumerating entry points via reflection. As a consequence of those edits the `-no-default-entry-point` flag on `render-test` is probably not working, but it seems like we don't have any test cases that use that flag anyway.
Given the seemingly breaking changes in those last two bullets, I was surprised to find that all our current tests seem to pass with this change. If there are things that I'm missing, I hope they will come up in review.
* fixup: issues from review and CI
* Some issues noted during the review process (e.g., a missing `break`)
* Fix logic for render tests with `-no-default-entry-point`. I had somehow missed that we had tests reliant on that flag. This required a bit of refactoring to pass down the relevant flag (luckily the function in question was already being passed most of what was in `Options`, so that just passing that in directly actually simplifies the call sites a bit.
* There was a missing line of code to actually add the default compute entry points to the compile request. I think this was a problem that slipped in as part of some pre-PR refactoring/cleanup changes that I failed to re-test.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* render feature for CUDA compute model.
* Use SemanticVersion type.
* Enable CUDA wave tests that require CUDA SM 7.0.
Provide mechanism for DownstreamCompiler to specify version numbers.
* Enabled wave-equality.slang
* Make CUDA SM version major version not just a single digit.
* Fix assert.
* DownstreamCompiler::Version -> CapabilityVersion
|
| |
|
|
|
|
|
|
|
|
|
| |
* Add unroll support for CUDA, and preliminary for C++.
Document [unroll] support.
* Fix loop-unroll to run on CPU, and test on CPU and elsewhere.
Fix bug in emitting loop unroll condition.
* Improved comment.
* Added support for vk/glsl loop unrolling.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change continues the work already started in moving the definitions of many built-in functions to the standard library.
The main focus in this change was reducing the number of operations that had to be special-cased on the CPU and CUDA targets by making sure that the scalar cases of built-in functions map to the proper names in the prelude (e.g., `F32_sin()`) via the ordinary `__target_intrinsic` mechanism. In some cases this cleanup meant that special-case logic that was constructing definitions for those functions using C++ code could be scrapped.
Additional changes made along the way:
* A few scalar functions that were missing in the CPU/CUDA preludes got added: `round`, hyperbolic trigonometric functions, `frexp`, `modf`, and `fma`
* The floating-point `min()` and `max()` definitions in the preludes were changed to use intrinsic operations on the target (which are likely to follow IEEE semantics, while our definitions did not)
* For the CUDA target, many of the functions had their names translated during code emit from, e.g., `sin` to `sinf`. This change makes the CUDA target more closely match the C++/CPU target in using names like `F32_sin` consistently.
* For the CUDA target, a few additional functions have intrinsics that don't exist (portably) on CPU: `sincos()` and `rsqrt()`.
* For the Slang stdlib definitions to work, a new `$P` replacement was defined for `__targert_intrinsic` that expands to a type based on the first operand of the function (e.g., `F32` for `float`).
* I removed the dedicated opcodes for matrix-matrix, matrix-vector, and vector-matrix multiplication, and instead turned them into ordinary functions with definitions and `__target_intrinsic` modifiers to map them appropriately for HLSL and GLSL. This is realistically how we would have implemented these if we'd had `__target_intrinsic` from the start.
Notes about possible follow-on work:
* The `ldexp` function is still left in the Slang stdlib because it has to account for a floating-point exponent and the `math.h` version only handles integers for the exponent. It is possible that we can/should define another overload for `ldexp` (and `frexp`) that uses an integer for exponent, and then have that one be a built-in on CPU/CUDA, with the HLSL `frexp` being defined in the stdlib to delegate to the correct `frexp` for those targets.
* The `firstbithigh` and related functions are missing for our CPU and CUDA targets, and will need to be added. It is worth nothing that `firstbithigh` apparently has some very odd functionality around signed integer arguments (which are supported, despite MSDN being unclear on that point). General cleanup will be required for those functions.
* Maxing the various matrix and vector products no longer be intrinsic ops might affect how we emit code for them as sub-expressions (both whether we fold them into use sites and how we parenthize them). This doesn't seem to affect any of our existing tests, but we could consider marking these functions with `[__readNone]` to ensure they can be folded, and then also adding whatever modifier(s) we might invent to control precdence and parentheses insertion during emit.
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add cubemap support.
* Add CUDA fence instrinsics.
* Added Gather for CUDA.
* Use the CUDA driver API as much as possible.
* * Support 1D texture on CPU
* WIP on 1D texture on CUDA
* Added simplified texture test
* Fix test.
* Improve texture-simple tests.
* * Add CPU support for 3d textures
* Add support for mip maps to CUDA
* Disable warnings in nvrtc
* Update CUDA docs
* WIP on 3d texture support.
* Add support for 3d textures for CPU and CUDA.
* CPU and CUDA support for cube maps.
* Add CPU support for Texture1DArray.
* Support CUDA Layered/Array type in meta library.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* * Improved fastRemoveAt
* Fixed off by one bug
* Fixed const safeness with List<>
* Made List begin and end const safe.
* Revert to previous RefPtr usage.
* Fix bug with casting.
* Tabs -> spaces.
Small fixes/improvements to List.
* Improve comment on List.
* Group shared/atomic test works on CUDA.
* * Enabled CUDA tests for atomics tests
* Enabled DX12 test for atomics-buffer.slang
Not clear just yet how to implement that for CUDA - it will work with StructuredBuffer.
* hasContent -> isNonEmpty
* Remove unneeded comment.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Fix CPP construct when matrix type.
* Test intrinsics on float matrices.
* Fix typo in _areNearlyEqual test. Increased default sensitivity.
Added matrix-float test.
* Matrix double test.
Fixed some issues with CUDA.
* Added reduced intrinsic version of matrix-double test.
* Improve matrix double coverage.
Test reflect/length etc on vector float.
* * Added literal-float test.
* Added vector double test
* Improved coverage of vector/matrix tests
* Disable Dx11 double-vector test because fails on CI.
* Disable literal-float, because on CI fails.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* WIP with vector float test.
* vector-float test working.
* Fixed remaing tests broken with init changes.
* Improve 64bit-type-support.md
* Disable tests broken on CI system for Dx.
* WIP: Make type available for comparison.
* Moved type conversion into TypeTextUtil.
* Add text/type conversions from DownstreamCompiler to TypeTextUtil.
* Allow compaison taking into account type.
* Removed quantize in vector-float.slang test.
|
| |
|
|
|
|
| |
* Fix tests/compute/global-init.slang by handling some other cases where functions are emitted.
* Fix comment.
|
| |
|
|
|
|
| |
CPU. (#1182)
Allow bounds check to zero index.
Update docs.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
* WIP: Trying to figure out how texturing will work with CUDA.
* WIP: Fixes for CUDA layout. Initial CUDA texture test.
* WIP: Outputs something compilable by CUDA for TextureND.Sample
* 2d texture working with CUDA.
* Fix how binding for SamplerState occurs in CUDA.
* Small tidy up of comments.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Added hlsl-intrinsic test folder.
Enabled ceil as works across targets.
* log10 support.
* Fix float % on CPU/CUDA to match HLSL which is fmod (not fremainder).
* Added log10 tests back to scalar-float.slang
* Don't add the ( for $Sx - it's clearer what's going on without it.
* Works on CUDA/CPU. Problem with asint/asuint do not seem to be found.
* Only asuint exists for double.
* Support countbits on CUDA and C++.
* Fix typo in C++ population count.
* First pass at int vector intrinsic tests.
* Swizzle for int.
* Bit cast tests on CUDA.
* Fix warning on gcc.
* Fix bit-cast-double execution on CUDA.
* scalar-int test working on gcc release.
* GetAt working on CUDA/C++
* Split out runtime index into it's own test.
* Removed SetAt, as can use assignment with GetAt.
* Allowing getAt to be used on matrices.
* Don't need [] on matrix type any longer because use getAt.
* Enable clamp on matrix-int.
* Fix matrix-int.slang test - because clamp behavior varied if min and max were say inverted.
Added runtime indexing version of matrix-int.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add test result for compile-to-cuda
* Add RAII for some CUDA types to simplify usage.
* First pass handling of some instrinsics on CUDA (for example transcendentals)
* CUDA working with built in intrinsics.
* Add missing CUDA prelude intrinsics.
* CUDA matches CPU output on simple-cross-compile.slang
* First pass at hlsl-scalar-float-intrinsic.slang test.
* Fix smoothstep impl on CUDA and CPU.
* Fixed step intrinsic on CUDA/CPU.
* Added operator[] to Matrix for C++, to allow row access.
Needs a fix for CUDA.
* Fixed warning on clang build.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* First pass at BindLocation.
* Added BindSet::init - for initializing with two input constant buffers. Needs better name, and perhaps should be another class.
* Fix handling of constant buffer stripping.
Improved initialization.
* Trying to generalize BindLocation a little more.
Split out CPULikeBindRoot.
* More work to make BindLocation et al work with non uniform bindings.
* Added parsing to a location.
* WIP: Trying to get CPU working with BindLocation.
* Describe problem of knowing the type of the reference point in the binding table.
* More ideas on getBindings fix.
* Remove BindSet as member of BindLocation.
* Added BindLocation::Invalid
* Made BindLocation able to be key in hash
* Use BindLocation for bindings on BindingSet.
* Added cuda and nvrtc categories to test infrastructure.
Disabled CUDA synthetic tests by default.
Fixed such that all tests now produce something in BindLocation style.
* Use m_userIndex instead of m_userData on Resource.
Move the binding setup out of cpu-compute-util (as no longer CPU specific)
* Removed CPUBinding - used BindLocation/BindSet instead.
Fixed some bugs around indexOf around uniform indirection.
* Renamed BindSet::Resource -> BindSet::Value.
* Document BindLocation.
* Fixes for Clang/GCC
Improve invariant requirement handling when constructing from BindPoints.
* WIP: First attempt to run CUDA kernel.
* Fix some issues around doing CUDA kernel launch.
* Fix issues around use of cudaMemCpy .
* Better cuda runtime error checking mechanism.
* Fixed bug in passing parameters to cuda kernel launch.
Simplified initialisation of context.
* WIP: Fix CUDA runtime issues.
* Add explicit CUDA synchronize so failures don't appear on implicit ones.
* Fix problem emitting non shared variable on CUDA.
* Fix some typos in CUDA layout.
Use just a pointer for now for CUDA StucturedBuffer.
* Arg order for CUDA launch was wrong.
* First compute kernel runs on CUDA.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* CUDA generated first test compiles.
* WIP on enabling CUDA in render-test.
* Detect CUDA_PATH environmental variable to build build cuda support into render-test.
Added WIP cuda-compute-util.cpp/h
Added CUDA as a renderer type.
* Fix libraries needed for cuda in premake.
* Added -enable-cuda premake option. Defaults to false.
* Creates CUDA device, loads PTX and finds entry point.
* Fix some erroneous cruft from slang-cuda-prelude.h
* Made CUDA use C++ like ABI for generated code.
Fix small bug in C++ output semantics.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* CUDA generated first test compiles.
* WIP on enabling CUDA in render-test.
* Detect CUDA_PATH environmental variable to build build cuda support into render-test.
Added WIP cuda-compute-util.cpp/h
Added CUDA as a renderer type.
* Fix libraries needed for cuda in premake.
* Added -enable-cuda premake option. Defaults to false.
* Creates CUDA device, loads PTX and finds entry point.
* Fix some erroneous cruft from slang-cuda-prelude.h
|
| | |
|
| |
|
|
|
|
|
|
|
|
| |
* WIP use IRTypeSet in CPPSourceEmitter - doesn't work because of a cloning issue, causing a crash on exit.
* Fix destruction of module issue for IRTypeSet usage in CPPEmitter.
* Fix out definition emitting ordering that was removed.
* Disable cuda output test.
|
|
|
* CPPCompiler -> DownstreamCompiler
* Added DownstreamCompileResult to start abstraction such that we don't need files.
* * Split out slang-blob.cpp
* Made CompileResult hold a DownstreamCompileResult - for access to binary or ISlangSharedLibrary
* Keep temporary files in scope.
* Add a hash to the hex dump stream.
* Move all file tracking into DownstreamCompiler.
* WIP support for nvrtc.
* WIP: Adding support for nvrtc compiler.
Adding enum types, wiring up the nvrtc into slang.
* Fix remaining CPPCompiler references.
* Fix order issue on target string matching.
* Use ISlangSharedLibrary for nvrtc.
* Use DownstreamCompiler for nvrtc.
* WIP first pass at compilation win nvrtc.
* Added testing if file is on file system into CommandLineDownstreamCompiler.
Added sourceContentsPath.
* Make test cuda-compile.cu work by just compiling not comparing output.
* Genearlize DownstreamCompiler usage.
* Fix warning on clang.
* Remove CompilerType from DownstreamCompiler.
* Use DownstreamCompiler interface for all compilers.
NOTE for FXC, DXC and GLSLANG this doesn't mean using 'compile' - it's still extracting functions from shared library.
* Replace DownstreamCompiler::SourceType -> SlangSourceLanguage
* Replace _canCompile with something data driven.
* Fix compiling on gcc/clang for DownstreamCompiler.
* Moved some text conversions into DownstreamCompiler.
* Fix problem on non-vc builds with not having return on locateCompilers for VS.
* Change so no warning for code not reachable on locateCompilers for vs.
* WIP: CUDA code generation - currently just using CPU layout and HLSL.
* emitXXXForEntryPoint -> emitEntryPointSource
emitSourceForEntryPoint -> emitEntryPointSourceFromIR
Fix up generating cuda to get PTX.
* WIP emitting cuda for IR.
* Small improvements to CUDA ouput.
* Disable the CUDA emit test, as output not currently compilable.
|