| Age | Commit message (Collapse) | Author |
|
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* Overhaul global inst deduplication and cpp/cuda backend.
* Update IR documentation.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* Clean up `IRReturnVoid`.
* Update gitignore.
Co-authored-by: Yong He <yhe@nvidia.com>
|
|
* Add an accessor for IRInst opcode
This main changing is renaming `IRInst::op` over to `IRInst::m_op` and then adds an accessor `IRInst::getOp()` to read it. The rest of the changes are just changing use sites to `getOp` (or to `m_op` in the limited cases where we write to it).
This work is in anticipation of a future change that might need to store an extra bit in the same field as the opcode. It seemed better to do this massive refactoring as a separate PR.
* fixup
|
|
* Add first steps toward a "capability" system
We already have cases in the stdlib where we mark declarations as being specific to certain targets, e.g.:
```
// My ordinary function to add two numbers.
// Works everywhere.
//
void myFunc(int a, int b) { return a + b; }
// On the "coolgpu" target, we can use a secret intrinsic
// that adds numbers even faster!
//
__specialized_for_target(coolgpu)
void myFunc(int a, int b) { return __secretIntrinsic(a, b); }
```
The existing logic for dealing with these modifiers (`__specialized_for_target` and `__target_intrinsic`) was almost entirely string-based. We would turn the chosen compilation target into a string, and then use that to try and search for the "best" definition of a function at a few steps:
* During IR linking, we always pick one definition of an `[import]`ed function, and that definition will be the one with the "best" target-specialization modifier (if any)
* During final code generation, we always look up the "best" target-intrinsic modifier, and use it as the template for the code we output.
This change preserves the basic flow there, but replaces the ad hoc string-based logic with something a bit more principled, in terms of a new `CapabilitySet` type.
A `CapabilitySet` represents a set of zero or more atomic features (here represented as `CapabilityAtom`s). What a `CapabilitySet` means depends on how and where it is used:
* A compilation target implies a `CapabilitySet` where the contents of the set are the features the target *supports*.
* A `CapabilitySet` attached to a declaration (or a modifier on that declaration) describes a set of feature that declaration *requires*.
The current implementation of `CapabilitySet` is wasteful and inefficient, but that is something we can iterate on over time.
In practice, most of the current code only ever uses capability sets that are either empty (because they represent a function with no specific requirements) or singleton (because they represent asingle atomic capability like "is a GLSL target," "is an HLSL target," etc.).
The main goal here was to put in the skeleton of a new system, including some of the features it might need down the line, and then to leave changes that eventually use the greater flexibility for later. Eventually, the capability system should encompass:
* Differences between shader model versions, GLSL versions, SPIR-V versions, etc. (currently tracked with other modifiers)
* Optional extensions, and functions that are made available only with certain extensions (currently tracked with other modifiers)
* Front-end checking that the call graph of a program doesn't violate any capability-requirements (e.g., having a GLSL+HLSL portable function call a GLSL-only subroutine)
* Hypothetically we can also try to fold stage-specific (vertex-only, fragment-only, etc.) functions into this system, but doing so would require more linker cleverness if we allow overloading on stages (since we might have to clone a caller if it calls through to a callee with multiple stage-specific versions)
One important complication that the system has to deal with just because of the "do what I mean" nature of the current compiler is that somethings a current Slang user might compile for target X and specify version N, but then use a function that actually requires version N+1 of that target. Currently the Slang compiler silently "upgrades" the version(s) used by user code in these cases, because it is often what users want in cross-compilation scenarios.
Dealing with the "silent upgrade" situation requires us to be a little careful and sometimes pick a "best" capability set that doesn't appear to be supported on our target. Refining that system and potentially getting rid of the "do what I mean" behavior over time could be a goal for future changes.
* fixup: handle case where value is incompatible during linking
|
|
Most people agree that it is a Good Thing when compilers are deterministic: the exact same input bits produce the exact same output bits every time the compiler is run. Bonus points are awarded if the results are independent of the platform the compiler was compiled for and run on.
One of the easiest kinds of nondeterminism to have sneak into a compiler is for it to produce the "same" code inside functions, but sometimes emits functions or other global symbols in a different order from run to run. Right now, the Slang compiler has some of this kind of nondeterminism.
The main way (but not necessarily the only way) that a compiler ends up producing output with a different ordering across runs is by iterating over the contents of a hash-based container (in our codebase, a `Dictionary` or `HashSet`), where the keys make use of pointers. Most operating systems intentionally try to randomize the address space of processes across runs (as a security feature), so that exact pointer values are not stable across runs, and thus hash value are not stable across runs, and thus the ordering of entries is not stable across runs.
This change identifies a few cases of iterating over dictionaries or sets that could have produced output non-determinism:
* The `HLSLIntrinsicSet` was using a `Dictionary` to store intrinsics that had been referenced, and would later produce a linear list of those intrinsics based on their order in the dictionary.
* The `WitnessTable`s produced by the front-end stored a `Dictionary` or requirements, and lowering from AST->IR was iterating over that dictionary to ensure that everythign got emitted.
* The `SharedSemanticsContext` was tracking a `HashSet` of modules that were imported into scope (so that their `extension`s should be visible), and an iteration over that list was used when producing candidate extensions during lookup. This case is unlikely to cause any nondeterminism in final output, but could lead to nondeterministic ordering in diagnostic messages for ambiguous reference/overload cases.
* The IR linker maintains a `Dictionary` of symbols based on their mangled names, and iterates over it in code that clones all witness tables into the linked IR whether or not they are referenced.
For most of these cases the fix is simple:
* Keep both a `Dictionary`/`HashSet` and a `List` of the appropriate type
* Whenever adding to the hash-based container also add to the list
* Whenever iterating, iterate over the list
In the final case of the IR linker, the relevant code was marked with a `TODO` comment noting that it shouldn't actually be needed, so I simply dropped it and the change doesn't seem to break any of our tests. I've been fairly confident that code wasn't needed for a while.
This change isn't exactly elegant, and a better long term solution might be to introduce two new types, `OrderedDictionary` and `OrderedSet`, which are similar to `Dictionary` and `HashSet` except that they guarantee a deterministic order of enumeration of their contents, based on insertion order.
(Note that a `SortedDictionary` and/or `SortedSet` that use something like a binary tree to produce a "determinsitc" sorted order wouldn't actually help here, because sorting entries by pointer values wouldn't solve the underlying problem that the pointer values aren't stable across runs)
I've chosen to avoid adding new types to `core` in the interest of making the change as small as possible. If we all agree that new types are warranted, it should be easy to clean up these use cases.
Testing this change is difficult, because we can't produce a reliable test to rule out nondeterminism. I have done best-effort testing by hand by crafting shaders that show output nondeterminism, and then compiling them both with and without these changes.
|
|
This change continues the work already started in moving the definitions of many built-in functions to the standard library.
The main focus in this change was reducing the number of operations that had to be special-cased on the CPU and CUDA targets by making sure that the scalar cases of built-in functions map to the proper names in the prelude (e.g., `F32_sin()`) via the ordinary `__target_intrinsic` mechanism. In some cases this cleanup meant that special-case logic that was constructing definitions for those functions using C++ code could be scrapped.
Additional changes made along the way:
* A few scalar functions that were missing in the CPU/CUDA preludes got added: `round`, hyperbolic trigonometric functions, `frexp`, `modf`, and `fma`
* The floating-point `min()` and `max()` definitions in the preludes were changed to use intrinsic operations on the target (which are likely to follow IEEE semantics, while our definitions did not)
* For the CUDA target, many of the functions had their names translated during code emit from, e.g., `sin` to `sinf`. This change makes the CUDA target more closely match the C++/CPU target in using names like `F32_sin` consistently.
* For the CUDA target, a few additional functions have intrinsics that don't exist (portably) on CPU: `sincos()` and `rsqrt()`.
* For the Slang stdlib definitions to work, a new `$P` replacement was defined for `__targert_intrinsic` that expands to a type based on the first operand of the function (e.g., `F32` for `float`).
* I removed the dedicated opcodes for matrix-matrix, matrix-vector, and vector-matrix multiplication, and instead turned them into ordinary functions with definitions and `__target_intrinsic` modifiers to map them appropriately for HLSL and GLSL. This is realistically how we would have implemented these if we'd had `__target_intrinsic` from the start.
Notes about possible follow-on work:
* The `ldexp` function is still left in the Slang stdlib because it has to account for a floating-point exponent and the `math.h` version only handles integers for the exponent. It is possible that we can/should define another overload for `ldexp` (and `frexp`) that uses an integer for exponent, and then have that one be a built-in on CPU/CUDA, with the HLSL `frexp` being defined in the stdlib to delegate to the correct `frexp` for those targets.
* The `firstbithigh` and related functions are missing for our CPU and CUDA targets, and will need to be added. It is worth nothing that `firstbithigh` apparently has some very odd functionality around signed integer arguments (which are supported, despite MSDN being unclear on that point). General cleanup will be required for those functions.
* Maxing the various matrix and vector products no longer be intrinsic ops might affect how we emit code for them as sub-expressions (both whether we fold them into use sites and how we parenthize them). This doesn't seem to affect any of our existing tests, but we could consider marking these functions with `[__readNone]` to ensure they can be folded, and then also adding whatever modifier(s) we might invent to control precdence and parentheses insertion during emit.
|
|
Some of the functions declared in the Slang standard library are built in on some targets (almost always the case for HLSL) but aren't available on other targets (often the case for GLSL, CUDA, and CPU). To date, the CUDA and CPU targets have worked around this issue by synthesizing definitions of the missing functions on the fly as part of output code generation, at the cost of some amount of code complexity in the emit pass.
This change adds definitions inside the stdlib itself for a large number of built-in HLSL functions that act element-wise over both vectors and matrices (e.g., `sin()`, `sqrt()`, etc.), and changes the CPU/CUDA codegen path to *not* synthesize C++ code for those functions (instead relying on code generated from the Slang definitions).
The element-wise vector/matrix function bodies are being defined using macros in the stdlib, so that we can more easily swap out the definitions en masse if we find an implementation strategy we like better. This could involve defining special-case syntax just for vector/matrix "map" operations that can lower directly to the IR and theoretically generate cleaner code after specialization is complete.
As a byproduct of this change, the matrix versions of these functions should in principle now be available to GLSL (GLSL only defines vector versions of functions like `sin()`, and leaves out matrix ones). No testing has been done to confirm this fix.
In some cases builtins were being declared with multiple declarations to split out the HLSL and GLSL cases, and this change tries to unify these as much as possible into single declarations to keep the stdlib as small as possible.
Two functions -- `sincos()` and `saturate()` -- were simple enough that their full definitions could be given in the stdlib so that even the scalar cases wouldn't need to be synthesized, so the corresponding enumerants were removed in `slang-hlsl-intrinsic-set.h`. In the case of `saturate()` the pre-existing definition used for GLSL codegen could have been used for CPU/CUDA all along.
In some cases functions that can and should be defined in the future have had commented-out bodies added as an outline for what should be inserted in the future. Most of these functions cannot be implemented directly in the stdlib today because basic operations like `operator+` are currently not defined for `T : __BuiltinArithmeticType`, etc. Adding such declarations should be straightforward, but brings risks of creating unexpected breakage, so it seemed best to leave for a future change.
This change does not try to address making vector or matrix versions of builtin functions that map to single `IROp`s, because the existing mechanisms for target-based specialization, etc., do not apply for such cases. In the future we will either have to make those operations into ordinary functions (eliminating many `IROp`s) so that stdlib definitions can apply, or add an explicit IR pass to deal with legalizing vector/matrix ops for targets that don't support them natively. The right path for this is not yet clear, so this change doesn't wade into it.
This change does not touch the `Wave*` functions added in Shader Model 6, despite many of these having vector/matrix versions that could benefit from the same default mapping. It is expected that these functions will have GLSL/Vulkan translation added soon, and it probably makes sense to know what cases are directly supported on Vulkan before adding the hand-written definitions.
Because of the limitations on what could be ported into the stdlib, it is not yet possible to remove any of the infrastructure for synthesizing builtin function definitions in the CPU and CUDA back-ends.
|
|
|
|
* WIP with vector float test.
* vector-float test working.
* Fixed remaing tests broken with init changes.
* Improve 64bit-type-support.md
* Disable tests broken on CI system for Dx.
* WIP: Make type available for comparison.
* Moved type conversion into TypeTextUtil.
* Add text/type conversions from DownstreamCompiler to TypeTextUtil.
* Allow compaison taking into account type.
* Removed quantize in vector-float.slang test.
|
|
* Added hlsl-intrinsic test folder.
Enabled ceil as works across targets.
* log10 support.
* Fix float % on CPU/CUDA to match HLSL which is fmod (not fremainder).
* Added log10 tests back to scalar-float.slang
* Don't add the ( for $Sx - it's clearer what's going on without it.
* Works on CUDA/CPU. Problem with asint/asuint do not seem to be found.
* Only asuint exists for double.
* Support countbits on CUDA and C++.
* Fix typo in C++ population count.
* First pass at int vector intrinsic tests.
* Swizzle for int.
* Bit cast tests on CUDA.
* Fix warning on gcc.
* Fix bit-cast-double execution on CUDA.
* scalar-int test working on gcc release.
* GetAt working on CUDA/C++
* Split out runtime index into it's own test.
* Removed SetAt, as can use assignment with GetAt.
* Allowing getAt to be used on matrices.
* Don't need [] on matrix type any longer because use getAt.
* Enable clamp on matrix-int.
* Fix matrix-int.slang test - because clamp behavior varied if min and max were say inverted.
Added runtime indexing version of matrix-int.
|
|
* CPPCompiler -> DownstreamCompiler
* Added DownstreamCompileResult to start abstraction such that we don't need files.
* * Split out slang-blob.cpp
* Made CompileResult hold a DownstreamCompileResult - for access to binary or ISlangSharedLibrary
* Keep temporary files in scope.
* Add a hash to the hex dump stream.
* Move all file tracking into DownstreamCompiler.
* WIP support for nvrtc.
* WIP: Adding support for nvrtc compiler.
Adding enum types, wiring up the nvrtc into slang.
* Fix remaining CPPCompiler references.
* Fix order issue on target string matching.
* Use ISlangSharedLibrary for nvrtc.
* Use DownstreamCompiler for nvrtc.
* WIP first pass at compilation win nvrtc.
* Added testing if file is on file system into CommandLineDownstreamCompiler.
Added sourceContentsPath.
* Make test cuda-compile.cu work by just compiling not comparing output.
* Genearlize DownstreamCompiler usage.
* Fix warning on clang.
* Remove CompilerType from DownstreamCompiler.
* Use DownstreamCompiler interface for all compilers.
NOTE for FXC, DXC and GLSLANG this doesn't mean using 'compile' - it's still extracting functions from shared library.
* Replace DownstreamCompiler::SourceType -> SlangSourceLanguage
* Replace _canCompile with something data driven.
* Fix compiling on gcc/clang for DownstreamCompiler.
* Moved some text conversions into DownstreamCompiler.
* Fix problem on non-vc builds with not having return on locateCompilers for VS.
* Change so no warning for code not reachable on locateCompilers for vs.
* WIP: CUDA code generation - currently just using CPU layout and HLSL.
* emitXXXForEntryPoint -> emitEntryPointSource
emitSourceForEntryPoint -> emitEntryPointSourceFromIR
Fix up generating cuda to get PTX.
* WIP emitting cuda for IR.
* Small improvements to CUDA ouput.
* Disable the CUDA emit test, as output not currently compilable.
* Split out IRTypeSet to simplify CPPSourceEmitter and other Emitters that rely on determining unique use of type and/or need to generate types in order to output code.
* First pass at HLSLIntrinsicSet.
* Small improvements to HLSLIntrinsicSet.
* Use HLSLIntrinsicSet in CPPSourceEmitter.
* Small improvements to checking of HLSLIntrinsic construction.
* Deallocate intrinsic copy if a match was found.
|