| Age | Commit message (Collapse) | Author |
|
* format
* Minor test fixes
* enable checking cpp format in ci
|
|
* Respect matrix layout in uniform and in/out parameters for HLSL target.
* Update test.
* Fix test.
* fix test.
* Fix metal layout calculation.
* Fix compile error.
* Fix compiler error.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* Overhaul IR lowering of pointer types.
* Propagate address space in IRBuilder.
* Fixup.
* Fix.
* Fix.
* Change how Ptr type is printed to text.
* Fix.
|
|
`SubpassInput<T>` (#4462)
* Add case to `emitVectorReshape` for `vector<>` type, `scalar` value
1. Add new case
2. Add test
* fix warning
* fix warning
* Implement HLSL resource bindings and default type `float4` to `SubpassInput<T>`
fixes: #4440
1. Removed GLSLInputAttachmentIndexLayout modifier and the somewhat 'hacky' binding model 'Input Attachment' previously relied upon. This was changed to work with the slang-type-layout rules system. This change allows Slang automatic bindings, HLSL bindings, GLSL bindings, and translation of GLSL to and from HLSL bindings to work.
2. Added default argument `float4` to SubpassInput<T>.
3. Merged glsl.meta and hlsl.meta SubpassInput logic.
* fix InputAttachment attribute checks
fix InputAttachment attribute checks for HLSL and GLSL syntax
* remove unused var
* validate attribute correctly
Attributes do not have type information. We must check the type expression to validate attribute usage.
* remove hacky validation
type based validation before types are fully resolved is quite hacky and unstable to changes and wrapped types
* fix warning
* remove redundant `!= nullptr`
* remove extra `!= nullptr`
* fix some warnings/errors
* subpass capability to limit to dxc & remove default values in some functions
* revert logic to previous logic
revert logic to return if we have a binding regardless of if a VarDecl is given the binding
|
|
* Remove use of `G0` and `__target_intrinsic` in stdlib.
* Fix.
* Fix calling intrinsic in global scope.
|
|
* Add `requirePrelude()` intrinsic function.
* Fix.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* Bump vulkan headers
Also just use vulkan-headers as a submodule
* Add drawMeshTasks to gfx graphics pipelines
* Add DispatchMesh overload with no payload, with GLSL intrinsic
* Require spirv 1.4 for mesh shaders
* Add vulkan mesh shader feature discovery
* Add mesh shader stage bits to vk-util
* Add mesh and task shader support to render-test
* Add mesh and task tests
* Preserve "payload" specifier in task shaders
* Add mesh shader pipeline support to gfx
* Add TODO
* Add numThreads attribute for amplification stage
* Add payload to task shader test
* Drop dependency on d3dx12
* Allow passing payloads from task to mesh shaders
* regenerate vs projects
* check DispatchMesh name correctly
* Add mesh shader tests to failing tests
* Detect wave-ops feature on vulkan
* Add fuse-product to expected failures
This fails because the global varaible `count` is not initialized
* Add required extension to WaveMaskMatch SPIR-V impl
* Remove meshShader member from pipeline desc
* Identify mesh shader support on d3d12
|
|
* Add `target_switch` and `__intrinsic_asm` statement.
* Cleanup.
* WaveGetActiveMask, WaveGetActiveMask, WaveCountBits.
* WaveIsFirstLane.
* More wave intrinsics.
* wave intrinsics.
* merge fix.
* Fix.
* Fix.
* Update test.
* update test.
* Fix.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* Cast integer literals.
* Fix expected output.
* For CUDA, search global instructions to see what types are used.
Improve lookup for fp16 header in CUDA.
* Fix issue with f16tof32
* Small improvement around finding used base types.
|
|
* Various dxc/fxc compatibility fixes.
* Cleanup.
* Fix test cases.
* Fix comments.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* Add support for emitting cuda kernel and host functions.
* Update test.
* Fix cuda preamble emit.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* Overhaul global inst deduplication and cpp/cuda backend.
* Update IR documentation.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* #include an absolute path didn't work - because paths were taken to always be relative.
* Refactor how prelude output works in emit.
* Small improvement to emit output.
* Move around comment on target specific language directives based on review.
Co-authored-by: Theresa Foley <10618364+tangent-vector@users.noreply.github.com>
|
|
An earlier refactoring pass over the compiler codebase split the
type that had been called `CompileRequest` into three distinct
pieces:
* `FrontEndCompileRequest` which was supposed to own state and
options related to running the compiler front end and producing
IR + reflection (e.g., what translation units and source
files/strings are included).
* `BackEndCompileRequest` which was supposed to own state and options
related to running the compiler back end to translate the IR
for a `ComponentType` (program) into output code. (Note that the
`BackEndCompileRequest` was conceived of as orthogonal to the
`TargetRequest`s, which store per-target and target-specific
options.)
* `EndToEndCompileRequest` which was an umbrella object that owns
separate front-end and back-end requests, plus any state that is
only relevant when doing a true end-to-end compile (such as the
kinds of compiles initiated with `slangc`). As originally conceived,
the only state that this type was supposed to own was stuff related
to "pass-through" compilation, as well as state related to writing
of generated code to output files.
That refactoring work was very useful at the time, because it allowed
us to "scrub" the back end compilation steps to remove all
dependencies on front-end and AST state (this was important for our
goals of enabling linking and codegen from serialized Slang IR).
At this point, however, it is clear that the hierarchy that was built
up serves very little purpose:
* The `BackEndCompileRequest` type is only used in two places:
* As part of an `EndToEndCompileRequest`, where the settings on
the `BackEndCompileRequest` can be configured, but only through
the `EndToEndCompileRequest`
* As part of on-demand code generation through the `IComponentType`
APIs. In this case, the settings stored on the
`BackEndCompileRequest` are not accessible to the application
at all, and will always use their default values, so that
instantiating a "request" object doesn't really make any sense.
* The `FrontEndCompileRequest` type has a similar situation:
* Front-end compilation as part of an `EndToEndCompileRequest`
supports user configuration of `FrontEndCompileRequest` settings,
but only through the `EndToEndCompileRequest`
* Front-end compilation triggered by an `import` or a `loadModule()`
call does not support user configuration of settings at all. It
will always derive all relevant settings from thsoe on the
session ("linkage").
In addition, subsequent changes have been made to the compiler that
show a bit of a "code smell" and/or forward-looking worries for this
decomposition:
* In some cases we've had to add the same setting to multiple types
in the breakdown (front-end, back-end, end-to-end, linkage, target,
etc.) which makes it harder for us to validate that all the possible
mixtures of state work correctly.
* Related to the above, in some cases we have manual logic that copies
state from one of the objects in the breakdown to another, in order
to ensure that the user's intention is actually followed.
* As a forward-looking concern, it seems that developers have sometimes
added new configuration options and state to places that don't really
make sense according to the rationale of the original decomposition
(e.g., we probably don't want to have a lot of state that is only
available via end-to-end requests, given that the API structure is
meant to push users *away* from end-to-end compiles).
As a result of all of the above, I've been planning a large refactor
with the following big-picture goals:
* Eliminate `BackEndCompileRequest`
* Move all relevant state/options from the back-end request to
the end-to-end request, since that is the only place they could
be set anyway.
* Introduce a transient "context" type to be used for the duration
of code generation that serves the main functions that back-end
requests really served in the codebase
* Make `EndToEndCompileRequest` be a subclass of
`FrontEndCompileRequest`
* Consider addding a transient "context" type for front-end
compiles that can be used in `import`-like cases rather than
needing a full front-end request object. If this works, then
eliminate `FrontEndCompileRequest` and be back to world with
just a single `CompileRequest` type
* Move *all* compiler configuration options to a distinct type (named
something like `CompilerConfig` or `CompilerOptions` or whatever)
which stores setting as key-value pairs, and has a notion of
"inheritance" such that one configuration can extend or build on top
of another. Make all the relevant types use this catch-all structure
instead of redundantly storing flags in many places.
This change deals with the first of those bullets: removeal of
`BackEndCompileRequest`. The addition of the `CodeGenContext` type is
perhaps an unncessary additional step, but making that change helps
clean up a bunch of the code related to per-target code generation,
so I think it is the right choice.
Co-authored-by: Yong He <yonghe@outlook.com>
|
|
`ImageSubscript` for GLSL (#2146)
|
|
* #include an absolute path didn't work - because paths were taken to always be relative.
* Fix issue with with SLANG_ENABLE_GLSLANG_SUPPORT
* Update expected output from glslang-error.glsl
* Fix bug in glsl dissassembly.
* Make ExtensionTracker available even if source is not emitted.
* Only explicitly set extension tracker based on capability bits, if we are in pass through.
* Small simplification of invoke sourceEmit.
|
|
* #include an absolute path didn't work - because paths were taken to always be relative.
* Split out StringEscapeUtil.
* Added StringEscapeUtil.
* Fix typo in unix quoting type.
* Small comment improvements.
* Try to fix linux linking issue.
* Fix typo.
* Attempt to fix linux link issue.
* Update VS proj even though nothing really changed.
* Fix another typo issue.
* Fix for windows issue.
Fixed bug.
* Make separate Utils for escaping.
* Fix typo.
* Split out into StringEscapeHandler.
* Windows shell does handle removing quotes (so remove code to remove them).
* Handle unescaping if not initiating using the shell.
* Slight improvement around shell like decoding.
* Simplify command extraction.
* Add shared-library category type.
* Fix bug in command extraction.
* Typo in transcendental category.
* Enable unit-test on in smoke test category.
* Make parsing failing output as a failing test.
* Fixes for transcendental tests. Disable tests that do not work.
* Changed category parsing.
* Removed the TestResult parameter from _gatherTestsForFile.
Made testsList only output.
* Remove testing if all tests were disabled.
* Make args of CommandLine always unescaped.
* Add category.
* Don't need escaping on unix/linux.
* Remove some no longer used functions.
* Add requireSMVersion to CUDAExtensionTracker.
* half-calc.slang now works for CUDA.
* bit-cast-16-bit works on CUDA.
* WIP handling of CUDA vector<half> types.
* Half swizzle CUDA.
* Half vector test.
* Fix swizzle half bug.
* Fix compilation issue with narrowing to Index.
* Add unary ops.
* Add some vector scalar maths ops.
* Add half vector conversions for CUDA.
* Fix erroneous comment.
* Support for half comparisons.
* First pass test for half compare.
* Fix bug in CUDA specialized emit control.
Updated tests to have pre and post inc/dec.
* Removed unneeded parts of the cuda prelude.
* Half structured buffer works on CUDA.
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
* #include an absolute path didn't work - because paths were taken to always be relative.
* Split out StringEscapeUtil.
* Added StringEscapeUtil.
* Fix typo in unix quoting type.
* Small comment improvements.
* Try to fix linux linking issue.
* Fix typo.
* Attempt to fix linux link issue.
* Update VS proj even though nothing really changed.
* Fix another typo issue.
* Fix for windows issue.
Fixed bug.
* Make separate Utils for escaping.
* Fix typo.
* Split out into StringEscapeHandler.
* Windows shell does handle removing quotes (so remove code to remove them).
* Handle unescaping if not initiating using the shell.
* Slight improvement around shell like decoding.
* Simplify command extraction.
* Add shared-library category type.
* Fix bug in command extraction.
* Typo in transcendental category.
* Enable unit-test on in smoke test category.
* Make parsing failing output as a failing test.
* Fixes for transcendental tests. Disable tests that do not work.
* Changed category parsing.
* Removed the TestResult parameter from _gatherTestsForFile.
Made testsList only output.
* Remove testing if all tests were disabled.
* Make args of CommandLine always unescaped.
* Add category.
* Don't need escaping on unix/linux.
* Remove some no longer used functions.
* Add requireSMVersion to CUDAExtensionTracker.
* half-calc.slang now works for CUDA.
* bit-cast-16-bit works on CUDA.
* WIP handling of CUDA vector<half> types.
* Half swizzle CUDA.
* Half vector test.
* Fix swizzle half bug.
* Fix compilation issue with narrowing to Index.
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
* #include an absolute path didn't work - because paths were taken to always be relative.
* WIP CUDA half support.
* Working support for half on CUDA - requires cuda_fp16.h and associated files can be found.
* Fix for win32 for unused funcs.
* Fix for Clang.
* Hack to disable unused local function warning.
|
|
* PR to fix issue #1638. This change introduces a diagnostic sink to the
emitModule function, and updates all associated calls to that function.
Additionally, this commit updates the heterogeneous hello world example
to not need the entry and stage flags for simplicity.
* Updated emit-cpp per suggested changes
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
In some cases, functionality is available as either a GLSL extension for Vulkan/SPIR-V, or through the NVAPI system for D3D. This situation creates complications because while GLSL extensions are generally all supported by the open-source glslang compiler (which we can bundle and ship), NVAPI operations are exposed through a specific header (`nvHLSLExtns.h`) that ships as part of the NVAPI SDK.
When a user wants to explicitly use NVAPI-provided operations in their shader code, there are no major complications for Slang; the user sets up their include paths, `#include`s the relevant header, calls functions in it, and lets Slang deal with the details of compilation.
The challenge for Slang arises when we want to provide a cross-platform interface in our standard library (e.g., the `RWByteAddressBuffer.InterlockedAddF32` method that was recently added) that uses either a GLSL extension (when compiling for Vulkan/SPIR-V) or an NVAPI (when compiling to DXBC or DXIL). In that case, the code *generated* by Slang now has a dependency on NVAPI, and we need to somehow emit a `#include` directive that pulls it in when invoking fxc or dxc. Because we do not (and seemingly cannot) bundle the NVAPI header with the compiler, we have to rely on ther user to have it available and to somehow communicate to Slang where it is.
Exposing portable routines that sometimes use NVAPI currently creates two main challenges:
1. The user is forced to interact with the "prelude" mechanism in the compiler, which allows the programmer to define code in a given target language that gets prepended to the Slang-generated code. While the prelude mechanism is powerful, it is also hard for users to integrate into their workflow, and our experience so far is that users want something that Just Works.
2. If the user writes code that uses some of our abstract operations that layer on NVAPI *and* they also want to use NVAPI explicitly, they end up with two copies of the NVAPI header (one included by the Slang front-end, and another included by the downstream fxc/dxc compiler). This puts the user in the situation of (a) having to ensure that they set the defines like `NV_SHADER_EXTN_SLOT` consistently both when invoking Slang and when adding their prelude, and (b) even if they do make the definitions consistent, they run into the problem that fxc/dxc complain about overlapping register bindings on the two copies of the `g_NvidiaExt` global shader paraemter that the NVAPI header declares.
This change attempts to resolve both issues by adding a lot of "do what I mean" logic to the compiler to try to ease things in the common case. In particular:
1. The user no longer needs to use the "prelude" mechanism when using NVAPI. The compiler now embeds a default prelude for HLSL output, which will `#include` the NVAPI header if and only if the generated code needs NVAPI access because of portable standard library routines that were used.
2. The user can mix-and-match explicit NVAPI use and stdlib functions that compile to use NVAPI. The register/space to be used by NVAPI when included via prelude is now set based on whatever the user set via the preprocessor so that it should automatically be consistent between both cases. Furthermore, the code we emit for the declaration of `g_NvidiaExt` when compiling explicit NVAPI use is set up to be conditional, so that it is skipped in the case where the prelude will pull in its own declaration of that parameter.
The way all this is achieved involves a lot of moving pieces:
* We now have an HLSL prelude, which mostly just serves to `#include "nvHLSLExtns.h"` in the case where NVAPI support is needed downstream.
* Standard library operations that require NVAPI for their implementation on HLSL include a new `[__requiresNVAPI]` attribute.
* The preprocessor has been extended so that after tokenizing an input file it looks up the NVAPI-relevant macros in the resulting environment, and if they are set it attached a modifier (`NVAPISlotModifier1) to the AST `ModuleDecl` that is based on their values. Logic is added to detect if multiple input files specify values for the macros in ways that conflict.
* The semantic checking step is extended so that it detects the "magic" NVAPI declarations (the `g_NvidiaExt` paramter and the `NvShaderExtnStruct` type that it uses) and attaches a modifier to them so that they can be identified as such in later steps.
* Parameter binding is extended to collect a list of the AST modifiers that reflect NVAPI binding, and to reserve the relevant register(s) so that ordinary user-defined parameters cannot conflict with them.
* IR lowering translates the three new AST modifiers related to NVAPI over to IR equivalents.
* IR linking is extended to make sure that it clones any `IRNVAPISlotDecoration`s attached to the input modules. The pass intentionally does not care where the modifiers came from; it just collects them all and leaves it to downstream code to sort out what they mean.
* Emit logic is extended to have a notion of "prelude directives" which are preprocessor directives that should come *before* the prelude in the generated code, because they can impact the way that the prelude compiles. This is done so that we don't have to introduce ad hoc logic for each downstream compiler to set any relevant `-D` flags (e.g., both fxc and dxc would need to duplicate such logic for NVAPI support).
* The HLSL source emitter is extended to track whether it emits any operations that require NVAPI support.
* The HLSL source emitter is extended to emit prelude directives based on whether NVAPI is needed and, if it is, to also set the register and space that NVAPI should use based on what was stored in the decoration(s) on the IR module.
* The HLSL source emitter is extended so that it detects global instructions that represent "magic" NVAPI constructs , and emit them as conditional definitions so that they are skipped when NVAPI is included via the prelude.
* The handling of requires capabilities during emit logic was cleaned up a bit so that more logic is shared across targets, and also so that the same logic is used both when emitting a function declaration/definition and when emitting a call to an instrinsic function (which won't get declared/defined).
|
|
* Enable all dynamic dispatch tests on CUDA.
* Fix expected cross-compile test results.
|
|
* Front-load cuda module loading to fill in RTTI pointers.
* Enable dynamic dispatch codegen for CUDA.
|
|
The Big Picture
===============
Given input Slang code like:
```hlsl
Texture2D gA;
[shader("compute")]
void kernelFunc(uniform Texture2D b, uint3 tid : SV_DispatchThreadID)
{ ... }
```
the existing CUDA code generation strategy would always generate a kernel with a signature like:
```c++
struct GlobalParams { Texture2D gA; }
struct EntryPointParams { Texture2D b; }
extern "C" __global__
void kernelFunc(EntryPointParams* entryPointParams, GlobalParams* globalParams)
{ ... }
```
This choice was consistent with the conventions of the CPU kernel target, and shares the advantage that it is easy for the user to data-drive the logic for filling in parameters and then invoking a kernel.
However, the approach outlined above has two serious problems when used for CUDA kernels:
* First, it defies the programmer's expectation about what an "equivalent" CUDA kernel signature would be, which makes it awkward for a developer to invoke this kernel from CUDA C++ host code (especially in the context of an app that might also run hand-written CUDA kernels).
* Second, the performance of this approach suffers because every access to a global or entry point parameter turns into a load from global memory. In contrast, a typical hand-written CUDA kernel passes its parameters via an implementation-specific path that (for current CUDA platforms) seems to be equivalent to `__constant__` memory in performance.
This change alters the convention so that the Slang compiler takes the code from the top of this message and translates it into something like:
```c++
struct GlobalParams { Texture2D gA; }
__constant__ GlobalParams SLANG_globalParams;
extern "C" __global__
void kernelFunc( Texture2D b )
{ ... }
```
This translation alleviates both problems with the current translation:
* The signature of the generated CUDA kernel function is as close to that of the original as is possible (we had to eliminate the `SV_*`-semantic varying inputs), and should directly match what the programmer would expect in common cases.
* Entry-point parameters are passed via CUDA kernel parameters, and should thus match in performance. Global parameters are passed via a variable in `__constant__` memory, and thus should also perform as well as possible/expected.
Detailed Changes
================
* Disable the `collectEntryPointUniformParams` pass for CUDA, so that entry-point `uniform` parameters are *not* bundles into a single `struct` and/or `ConstantBuffer`.
* When targeting CUDA, disable the logic for generating an entry-point parameter for passing in the global shader parameter(s)
* Allow `CLikeSourceEmitter` subclasses to override the name generated for entry-point symbols, and use this to add the required prefix for each OptiX kernel type when translating a ray-tracing kernel.
* Add logic to emit "parameter groups" in a specialized way for CUDA (this is the same approach that allows us to generate `cbufffer { ... }` declarations for fxc). A global-scope parameter group will turn into a global `__constant__` variable called `SLANG_globalParams` (that name becomes part of the ABI for Slang-compiled shaders).
* Update the logic in `render-test` for loading and invoking CUDA kernels to handle the new policy.
The last bullet there merits expansion, since it is indicative of the work a client using Slang would have to go through to use our generated kernels with the new policy:
* When loading a CUDA module with one or more kernels, we also use `cuModuleGetGlobal` to query the address of the `SLANG_globalParams` symbol in that CUDA module. That pointer needs to be used when setting global parameter values to be used by kernels in that CUDA odule.
* Because our existing `BindPoint` logic for CUDA always sets up parameter data in GPU memory, we end up having to copy the entry-point parameter data from GPU memory to host memory. This step would ideally be skipped in a codebase that understands the correct policy, but it is a bit unfortunate that it is no longer trivially correct for an application to store all parameter data in GPU memory.
* Before invoking the kernel, we need to use a `cudaMemcpyAsync` to copy from the prepared GPU memory for global parameters over to the `SLANG_globalParams` symbol associated with the kernel to be invoked. Because this operations is issued on the same CUDA stream as the kernel call, it is guaranteed to not overlap with GPU kernel execution.
* When invoking the kernel, we take advantage of the seldom-used `CU_LAUNCH_PARAM_BUFFER_POINTER` facility to specify a contiguous memory region with all the entry-point parameters in it instead of passing each entry-point parameter separately. Given Slang reflection it is also possible to query the offset of each entry-point parameter in the buffer, so we could invoke the kernel in the traditional fashion as well. The choice here is up to the application.
Caveats
=======
* This is a breaking change, and any subsequent release will need to reflect that fact. Any customers who rely on Slang's current CUDA codegen strategy are likely to be surprised by this change, and I don't see an easy way to give them a more gentle transition.
* This change does *not* remove the logic that introduces a `KernelContext` type for code that requires it. That means that things like `static` global variables can continue to work on CUDA for now, but we know that those are not going to be something we can support in the long-term with separate compilation.
* While the policy implemented in this change is a reasonable default, it is still not going to perfectly match expecations for some developers. In particular, some developers who are familiar with both D3D and CUDA will likely wonder why a global `cbuffer` in Slang translates to a global-memory pointer in the output CUDA instead of one global `__constant__` variable per `cbuffer`. A more detailed alternate translation would generate a distinct global `__constant__` variable for each top-level constant buffer or parameter block. We may need to refine the translation even more based on feedback from users who care about how we handle global-scope parameters.
* Recent changes in Slang have broken the logic that handles the OptiX "shader record" as an alternative mechanism for passing entry-point parameters. In order to get any level of OptiX support up and running we will have to change the IR passes that run on CUDA kernels to actually run the "collection" of `uniform` parameters for ray tracing stages, and then to replace references to the resulting parameter with a call to the function to access the shader record.
* The use of `SLANG_globalParams` here works well enough in the case of whole-program compilation; every `CUmodule` ends up with (zero or) one parameter with this name, and an application can just hard-code it. As a mechanism it wouldn't work in the presence of separately-compiled modules that might introduce their own global parameters (including cases like constant lookup tables that really want to be at the global scope). An alternative approach would have Slang generate output PTX for each module, where a module has an optional global symbol for its own global-scope parameters (with a mangled name that is based on the module name), and then a linked CUDA binary has all of those distinct symbols. Such an approach would be compatible with module-at-a-time reflection and parameter binding, but would lead to another breaking change down the line for code that switches to `SLANG_globalParams`.
|
|
* Remove KernelContext wrapper from CPU/CUDA emit
Currently, the CPU and CUDA C++ targets rely on a `KernelContext` type that is generated during emit, as a way to provide implicit access to things that were global in the input Slang code, but that can't actually be emitted as globals in the target language (because the semantics of global declarations differ).
For example, input like:
```hlsl
ConstantBuffer<Stuff> gStuff; // shader parameter
groupshared int gData[1024]; // thread-group shared variable
static int gCounter = 0; // "thread-local" global-scope variable
void subroutine() { ... }
[shader("compute")] void computeMain() { ... }
```
would translate to output C++ for CPU a bit like:
```c++
struct KernelContext
{
ConstantBuffer<Stuff> gStuff;
int gData[1024];
int gCounter = 0;
void subroutine() { ... }
void computeMain() { ... }
};
```
Note that both `computeMain()` and `subroutine()` are non-`static` members functions on `KernelContext`, so they have an implicit `this` parameter of type `KernelContext`, which allows the bodies of those functions to implicitly reference `gStuff`, etc. by name in their bodies.
Because `KernelContext::computeMain()` is a member function, we end up emitting an additional global-scope function to expose the entry point to the outside world, and that function is responsible for declaring a local `KernelContext` and invoking the generated entry point on it.
This approach has several important drawbacks:
* It complicates the emit logic for CPU and CUDA, with many special cases around when/how things get emitted
* It complicates the implementation of dynamic dispatch, because what seems like a function pointer in Slang IR needs to be a pointer-to-member-function in C++.
* It makes it difficult to have a non-kernel-oriented mode of compilation for CPU where a Slang function with a given signature gets output as a C++ CPU function with the "same" signature (not wrapped up as a member function of `KernelContext`.
This change makes a step toward addressing these issues by making the introducing of the `KernelContext` type be something that is done in an explicit IR pass instead of being handled as part of the last-mile emit logic.
The most important change is the removal of code related to `KernelContext` from the `slang-emit-{cpp,cuda}.{h,cpp}` files, with the equivalent logic instead being handled in a new pass in `slang-ir-explicit-global-context.{h,cpp}`. It should be noted that further cleanups to the emit logic should now be possible; in particular, both the CPU and CUDA emit paths are manually sequencing the `EmitAction`s instead of relying on the default logic, but at this point they should be able to just use the default. The additional cleanups are left for future work.
The explicit IR pass does more or less what one would expect: it identifies global-scope entities (global variables and parameters) that need to be wrapped and turns them into fields of a `KernelContext` type. It then modifies all entry points to initialize a `KernelContext` as part of their startup. Finally, any code that used to refer to the global entities is changed to refer to a field of the context, with the context passed via new function parameters (the new parameter is only added to functions that need it for now).
Transforming global variables into fields of a `KernelContext` type in the IR pass ends up dropping their initial-value expressions (since those were attached as basic blocks on the `IRGlobalVar`). To avoid breaking code that relies on global-scope (but thread-local) variables, this change also adds an explicit pass that takes the initialization logic on all global variables and moves it to explicit logic that runs at the start of every entry point in a linked module (`slang-ir-explicit-global-init.{h,cpp}`). This pass would also be useful when we get back to direct SPIR-V emit, since SPIR-V also requires initialization logic for globals to be emitted into entry points.
One complication that arises when the IR is introducing the types for entry-point parameters, global-scope parameters, and the `KernelContext` type is that it becomes harder for the emit logic to utter the names of those types (they might not even have names, since `IRNameHint`s might get stripped). This created a problem since the wrapper operations that were being generated for CPU were taking `void*` parameters and casting them to the appropriate type. To work around this issue, we have added an explicit IR pass (`slang-ir-entry-point-raw-ptr-params.{h,cpp}`) that transforms the signature of entry points so that any pointer parameters instead become raw pointer (`void*`) parameters, with the casting being handled inside the entry point itself.
One consequence of all the above changes is that for the CUDA target we no longer need a wrapper function to invoke the generated entry point any more, because the IR function for the entry point ends up having the correct/expected signature already. This is also the case for CPU when it comes to the `*_Thread` wrapper function, but this change doesn't try to eliminate the wrapper because of a belief that the `*_Thread`-level interface is going away anyway.
Because the IR is now responsible for ensuring the signature of the IR entry point for CUDA and CPU is what is expected, I needed to modify the `slang-ir-entry-point-uniforms` pass to always create an explicit parameter for the entry point uniforms when compiling for CUDA/CPU, even if there were no `uniform` parameters on the entry point as written. This also ended up requiring some tweaks to the parameter layout logic to ensure that CPU/CUDA targets always treat `ConstantBuffer<T>` as a `T*` even in the case where `T` is an empty `struct` type (which happens when we construct a `struct` type to represent the uniform parameters of an entry point with no uniform parameters...).
There are several future changes that can/should build on this work:
* We should change the generated signatures for CUDA kernels, so that they don't rely on `KernelContext` for global-scope parameters. At that point we can avoid generating a `KernelContext` at all for CUDA, except when a program uses global-scope thread-local variables.
* We should figure out how to make the "ABI" for dynamic-dispatch calls ensure that the kernel context is either always passed, or always *not* passed. Making a hard-and-fast rule as part of the calling convention for dynamic calls would ensure that they access through the context continues to work with dynamic calls (this change might break it in some cases).
* We should figure out how to handle the layout for the `KernelContext` in cases where a program is composed of multiple separately-compiled modules. Right now the layout of the `KernelContext` requires global knowledge (as does the pass that introduces explicit initialization for global-scope thread-locals).
* We should try to further clean up the CPU/CUDA C++ emit logic to fall back on the default emit behavior more, now that the various special-case approaches that were taken are no longer needed
* fixup: restore build files to default configuration
|
|
The main change here is that the CPU and CUDA C++ emit paths now rely on an earlier IR pass to legalize the varying parameter list of a kernel and translate references to varying parameters with semantics like `SV_DispatchThreadID`. Doing so removes a lot of special-case logic from the emit passes.
This work moves us even closer to being able to eliminate `KernelContext` from the CPU/CUDA emit logic, because it removes the issue of state related to varying inputs being stored in `KernelContext`.
The new pass that handles the legalization is in `slang-ir-legalize-varying-params.cpp`, and it borrows heavily from the existing `slang-ir-glsl-legalize.cpp` pass. The new pass factors out the target-independent and target-dependent logic, so that both CPU and CUDA can share much of the same code despite having very different rules for how the system-value parameters are being provided.
An eventual goal is to have the new pass also handle the GLSL case, but doing so requires copying even more logic out of the GLSL-specific pass, and doing so seemed like a step to far for what was meant to be a stepping-stone change as part of other work. As a result of the incomplete nature of the pass, certain cases don't work for compute shader inputs for CPU/CUDA (e.g., wrapping your varying inputs in a `struct` type parameter), but those were cases that also didn't work in the existing `emit`-based logic.
One major consequence of this change is that the logic for emitting the various different functions that represent an entry point for our CPU back-end has been streamlined and simplified. The original logic had a fair bit of cleverness built in to try and avoid unnecessary math ops when computing the various IDs/indices, while the new logic is much more simplistic (the main dispatch function loops over threadgroups with a triply-nested `for` and then delegates to the group-level function loops over threads with its own nested `for`s).
Longer term, it will be important to simplify the CPU functions we emit further, by eliminating things like the `_Thread` function that should never really be exposed to users (the minimum granularity of invoking a CPU compute kernel should be a single threadgroup). We may eventually decide to synthesize all of the extra code that is being generated in the `emit` pass as IR instead.
|
|
* Fixes for active mask synthesis + tests
There are two fixes here:
* The code generation that follows active mask synthesis was requiring CUDA SM architecture version 7.0 for one of the introduced instructions, but not all of them. This change centralizes the handling of upgrading the required CUDA SM architecture version, and makes sure that the instructions introduced by active mask synthesis request version 7.0.
* The tests for active mask synthesis were not flagged as requiring the `cuda_sm_7_0` feature when invoking `render-test-tool`, which meant they run but produce unexpected results when invoked on a GPU without the required semantics for functions like `__ballot_sync()`. This change adds the missing `-render-feature cuda_sm_7_0` to those tests.
* fixup: mark more tests that rely on implicit active mask
|
|
* Fix CUDA output of a static const array if values are all literals.
* Fix bug in Convert definition.
* Output makeArray such that is deconstructed on CUDA to fill in based on what the target type is. Tries to expand such that there are no function calls so that static const global scope definitions work.
* Fix unbounded-array-of-array-syntax.slang to work correctly on CUDA.
* Remove tabs.
* Check works with static const vector/matrix.
* Fix typo in type comparison.
* Shorten _areEquivalent test.
* Rename _emitInitializerList. Some small comment fixes.
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
* render feature for CUDA compute model.
* Use SemanticVersion type.
* Enable CUDA wave tests that require CUDA SM 7.0.
Provide mechanism for DownstreamCompiler to specify version numbers.
* Enabled wave-equality.slang
* Make CUDA SM version major version not just a single digit.
* Fix assert.
* DownstreamCompiler::Version -> CapabilityVersion
|
|
* Add unroll support for CUDA, and preliminary for C++.
Document [unroll] support.
* Fix loop-unroll to run on CPU, and test on CPU and elsewhere.
Fix bug in emitting loop unroll condition.
* Improved comment.
* Added support for vk/glsl loop unrolling.
|
|
* Fix tests/compute/global-init.slang by handling some other cases where functions are emitted.
* Fix comment.
|
|
* Add test result for compile-to-cuda
* Add RAII for some CUDA types to simplify usage.
* First pass handling of some instrinsics on CUDA (for example transcendentals)
* CUDA working with built in intrinsics.
* Add missing CUDA prelude intrinsics.
* CUDA matches CPU output on simple-cross-compile.slang
* First pass at hlsl-scalar-float-intrinsic.slang test.
* Fix smoothstep impl on CUDA and CPU.
* Fixed step intrinsic on CUDA/CPU.
* Added operator[] to Matrix for C++, to allow row access.
Needs a fix for CUDA.
* Fixed warning on clang build.
|
|
* CUDA generated first test compiles.
* WIP on enabling CUDA in render-test.
* Detect CUDA_PATH environmental variable to build build cuda support into render-test.
Added WIP cuda-compute-util.cpp/h
Added CUDA as a renderer type.
* Fix libraries needed for cuda in premake.
* Added -enable-cuda premake option. Defaults to false.
* Creates CUDA device, loads PTX and finds entry point.
* Fix some erroneous cruft from slang-cuda-prelude.h
* Made CUDA use C++ like ABI for generated code.
Fix small bug in C++ output semantics.
|
|
|
|
* WIP use IRTypeSet in CPPSourceEmitter - doesn't work because of a cloning issue, causing a crash on exit.
* Fix destruction of module issue for IRTypeSet usage in CPPEmitter.
* Fix out definition emitting ordering that was removed.
* Disable cuda output test.
|
|
* CPPCompiler -> DownstreamCompiler
* Added DownstreamCompileResult to start abstraction such that we don't need files.
* * Split out slang-blob.cpp
* Made CompileResult hold a DownstreamCompileResult - for access to binary or ISlangSharedLibrary
* Keep temporary files in scope.
* Add a hash to the hex dump stream.
* Move all file tracking into DownstreamCompiler.
* WIP support for nvrtc.
* WIP: Adding support for nvrtc compiler.
Adding enum types, wiring up the nvrtc into slang.
* Fix remaining CPPCompiler references.
* Fix order issue on target string matching.
* Use ISlangSharedLibrary for nvrtc.
* Use DownstreamCompiler for nvrtc.
* WIP first pass at compilation win nvrtc.
* Added testing if file is on file system into CommandLineDownstreamCompiler.
Added sourceContentsPath.
* Make test cuda-compile.cu work by just compiling not comparing output.
* Genearlize DownstreamCompiler usage.
* Fix warning on clang.
* Remove CompilerType from DownstreamCompiler.
* Use DownstreamCompiler interface for all compilers.
NOTE for FXC, DXC and GLSLANG this doesn't mean using 'compile' - it's still extracting functions from shared library.
* Replace DownstreamCompiler::SourceType -> SlangSourceLanguage
* Replace _canCompile with something data driven.
* Fix compiling on gcc/clang for DownstreamCompiler.
* Moved some text conversions into DownstreamCompiler.
* Fix problem on non-vc builds with not having return on locateCompilers for VS.
* Change so no warning for code not reachable on locateCompilers for vs.
* WIP: CUDA code generation - currently just using CPU layout and HLSL.
* emitXXXForEntryPoint -> emitEntryPointSource
emitSourceForEntryPoint -> emitEntryPointSourceFromIR
Fix up generating cuda to get PTX.
* WIP emitting cuda for IR.
* Small improvements to CUDA ouput.
* Disable the CUDA emit test, as output not currently compilable.
|