| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add fkYAML submodule
* Generate slang-ir-inst-defs.h from slang-ir-inst-defs.yaml
* generate ir-inst-defs.h
* neaten things
* neaten inst def parser
* add rapidyaml submodule
* remove fkyaml
* remove fkyaml submodule
* remove use of ir-inst-defs.h
* format and warnings
* fix wasm build
* tidy
* remove rapidyaml
* Extend fiddle to allow custom splices in more places
* Use lua to describe ir insts
* fix
* neaten
* neaten
* neaten
* spelling
* neaten
* comment comment out assert
* merge
|
| |
|
|
|
|
|
|
|
| |
* Move switch statement bodies to their own lines
* format
---------
Co-authored-by: Yong He <yonghe@outlook.com>
|
| |
|
|
|
|
|
| |
* format
* Minor test fixes
* enable checking cpp format in ci
|
| |
|
|
|
| |
(#5415)
This commit changes the word "stdlib" or "standard library" to "core module" in the source code.
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
| |
* Various fixes
* Remove unused parameter
* Update slang-ir-loop-unroll.cpp
---------
Co-authored-by: Yong He <yonghe@outlook.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* #include an absolute path didn't work - because paths were taken to always be relative.
* WIP lowerCamel Dictionary.
* WIP more lowerCamel fixes for Dictionary.
* Add/Remove/Clear
* GetValue/Contains
* Fix tabs in dictionary.
Count -> getCount
* Fix fields with caps.
* Key -> key
Value -> value
Use m_ for members where appropriate.
Use lowerCamel in linked list.
* Some small fixes/improvements to Dictionary.
* Kick CI.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Fix linking issue in slangpy + no mask param for kernels.
* add cuda header changes
* fix
* More correct change of active mask insertion.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
| |
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
|
|
|
|
| |
* Clean up `IRReturnVoid`.
* Update gitignore.
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Cleanup refactoring work around the IR builder
We have some long-term goals for the IR that require a more centralized and disciplined set of rules for how IR instructions get created/emitted. I had been working on trying to set things up so that all IR instruction creation goes through a single bottleneck point, but the non-trivial work in that branch was getting drowned out by the sheer volume of cleanup and refactoring changes. This change tries to pull together several of the more important cleanups.
The big pieces are:
* `IRBuilder` and `SharedIRBuilder` now protect their data members and rely on users to initialize them more directly via constructor of an `init()` method. This change affects a *bunch* of sites where `IRBuilder`s were created. I changed use sites to use the constructors whenever possible, and to use `init()` in cases where we had longer-lived builders that needed to be initialized multiple times.
* The insertion location for the `IRBuilder` now uses an encapsulated type called `IRInsertLoc`. This new type can replace what used to be just two `IRInst*` fields in the builder, and also covers some new functionality (if we ever want to take advantage of it). Very little client code cares about this change, but it is still a nice cleanup in terms of making things more explicit.
* The creation of an `IRModule` has been moded *out* of `IRBuilder`, because in practice we `IRBuilder` always wants to be associated with a pre-existing `IRModule` at creation time (via its `SharedIRBuilder`). There is now an `IRModule::create()` operation instead. This required changing the sequencing at many `IRModule` creation sites, since most had been contriving to make an `IRBuilder` first. There were also several cleanups because code had been carelessly using non-reference-counted pointers for `IRModule`s in ways that broke now that `IRModule::create()` always returns a `RefPtr`.
* The core operations to actually allocate memory for IR instructions were moved into `IRModule` (since they interact with the memory pool that the module owns). These *were* called `createEmptyInst()` but have been renamed into `_allocateInst()`. In principle these seem like they should only be needed to be called by the `IRBuilder`, but in practice they are also needed by the IR deserialization logic.
* A few core operations for emitting IR instructions that were associted with `IRBuilder` were moved to actually be methods on `IRBuilder`. First is `_findOrEmitConstant` which is the primary bottleneck for creating simple scalar constant values. Another is `_createInst` (formerly part of the templated `createInstImpl` along with `createInstWithSizeImpl`) which is the main bottleneck for allocation and initialization of any instruction other than a constant (well, the `IRModuleInst` is the other exception...). Finally, there is also `_maybeSetSourceLoc()`, which is obvious to scope inside the `IRBuilder` once it is protecting the source-location info.
Notes:
* The `minSizeInBytes` parameter to `_createInst()` might not actually be needed at all. At this point any `IRInst` subtypes that need data allocated for things other than their operands already get created manually via `_allocateInst` or `_findOrEmitConstant`, so I *think* we could remove that part. I will handle that in a subsequent cleanup if it turns out to be the case.
* There is one IR pass (`slang-ir-string-hash.cpp`) that is using manual `_allocateInst()` instead of going through an `IRBuilder`. It could be easily cleaned up to not do so (and I will probably make that change down the line), but for now I wanted to avoid doing anything that wasn't close to pure refactoring if I could.
* At this point in our design an `IRBuilder` is a very lightweight thing - it basically just owns the insertion location plus a source location to write into instructions. A lot of our code currently treats `IRBuilder`s like they are expensive and/or need to be re-used (which leads to them being used in more mutable/stateful ways). It is quite likely that as we clean up other aspects of the implementation of IR creation/emission we can make `IRBuilder` use feel more lightweight in ways that can streamline and simplify code.
* The next step for this work is to identify the different paths that eventually lead to `_createInst()` being called, and unify them at a single bottleneck operation that can own the decisions around when to create an instruction vs. when to re-use an existing one (rather than those decisions being baked into the various `IRBuilder` subroutines that create instructions of the various subtypes).
* fixup: gcc/clang C++ spec details
|
| |
|
|
|
|
|
|
|
| |
* Add an accessor for IRInst opcode
This main changing is renaming `IRInst::op` over to `IRInst::m_op` and then adds an accessor `IRInst::getOp()` to read it. The rest of the changes are just changing use sites to `getOp` (or to `m_op` in the limited cases where we write to it).
This work is in anticipation of a future change that might need to store an extra bit in the same field as the opcode. It seemed better to do this massive refactoring as a separate PR.
* fixup
|
|
|
* Synthesize "active mask" for CUDA
The Big Picture
===============
The most important change here is to `hlsl.meta.slang`, where the declaration of `WaveGetActiveMask()` is changed so that instead of mapping to `__activemask()` on CUDA (which is semantically incorrect) it maps to a dedicated IR instruction.
The other `WaveActive*()` intrinsics that make use of the implicit "active mask" concept had already been changed in #1336 so that they explicitly translate to call the equivalent `WaveMask*()` intrinsic with the result of `WaveGetActiveMask()`. As a result, all of the `WaveActive*()` functions are now no different from a user-defined function that uses `WaveGetActiveMask()`.
The bulk of the work in this change goes into an IR pass to replace the new instruction for getting the active mask gets replaced with appropriately computed values before we generate output CUDA code. That work is in `slang-ir-synthesize-active-mask.{h,cpp}`.
Utilities
=========
There are a few pieces of code that were helpful in writing the main pass but that can be explained separately:
* IR instructions were added corresponding to the Slang `WaveMaskBallot()` and `WaveMaskMatch()` functions, which map to the CUDA `__ballot_sync()` and `__match_any_sync()` operations, respectively. These are only implemented for the CUDA target because they are only being generated as part of our CUDA-only pass.
* The `IRDominatorTree` type was updated to make it a bit more robust in the presence of unreachable blocks in the CFG. It is possible that the same ends could be achieved more efficiently by folding the corner cases into the main logic, but I went ahead and made things very explicit for now.
* I added an `IREdge` utility type to better encapsulate the way that certain code operating on the predecessors/successors of an `IRBlock` were using an `IRUse*` to represent a control-flow edge. The `IREdge` type makes the logic of those operations more explicit. A future change should proably change it so that `IRBlock::getPredecessors()` and `getSuccessors()` are instead `getIncomingEdges()` and `getOutgoingEdges()` and work as iterators over `IREdge` values, given the way that the predecessor and successor lists today can contain duplicates.
* Using the above `IREdge` type, the logic for detecting and break critical edges was broken down into something that is a bit more clear (I hope), and that also factors out the breaking of an edge (by inserting a block along it) into a reusable subroutine.
The Main Pass
=============
The implementation of the new pass is in `slang-ir-synthesize-active-mask.cpp`, and that file attempts to include enough comments to make the logic clear. A brief summary for the benefit of the commit history:
* The first order of business is to identify functions that need to have the active mask value piped into them, and to add an additional parameter to them so that the active mask is passed down explicitly. Call sites are adjusted to pass down the active mask which can then result in new functions being identified as needing the active mask.
* The next challenge is for a function that uses the active mask, to compute the active mask value to use in each basic block. The entry block can easily use the active mask value that was passed in, while other blocks need more work.
* When doing a conditional branch, we can compute the new mask for the block we branch to as a function of the existing mask and the branch condition. E.g., the value `WaveMaskBallot(existingMask, condition)` can be used as the mask for the "then" block of an `if` statement.
* When control flow paths need to "reconverge" at a point after a structured control-flow statement, we need to insert logic to synchronize and re-build the mask that will execute after the statement, while also excluding any lanes/threads that exited the statement in other ways (e.g., an early `return` from the function).
The explanation here is fairly hand-wavy, but the actual pass uses much more crisp definitions, so the code itself should be inspected if you care about the details.
Tests
=====
The tests for the new feature are all under `tests/hlsl-intrinsic/active-mask/`. Most of them stress a single control-flow construct (`if`, `switch`, or loop) and write out the value of `WaveGetActiveMask()` at various points in the code.
In practice, our definition of the active mask doesn't always agree with what D3D/Vulkan implementations seem to produce in practice, and as a result a certain amount of effort has gone into adding tweaks to the tests that force them to produce the expected output on existing graphics APIs. These tweaks usually amount to introducing conditional branches that aren't actually conditional in practice (the branch condition is always `true` or always `false` at runtime), in order to trick some simplistic analysis approaches that downstream compilers seem to employ.
One test case currently fails on our CUDA target (`switch-trivial-fallthrough.slang`) and has been disabled. This is an expected failure, because making it produce the expected value requires a bit of detailed/careful coding that would add a lot of additional complexity to this change. It seemed better to leave that as future work.
Future Work
===========
* As discussed under "Tests" above, the handling of simple `switch` statements in the current pass is incomplete.
* There's an entire can of worms to be dealt with around the handling of fall-through for `switch`.
* The current work also doesn't handle `discard` statements, which is unimportant right now (CUDA doesn't have fragment shaders), but might matter if we decide to synthesize masks for other targets. Similar work would probably be needed if we ever have `throw` or other non-local control flow that crosses function boundaries.
* An important optimization opportunity is being left on the floor in this change. When block that comes "after" a structured control-flow region (which is encoded explicitly in Slang IR and SPIR-V) post-dominates the entry block of the region, then we know that the active mask when exiting the region must be the same as the mask when entering the region, and there is no need to insert explicit code to cause "re-convergence." This should be addressed in a follow-on change once we add code to Slang for computing a post-dominator tree from a function CFG.
* Related to the above, the decision-making around whether a basic block "needs" the active mask is perhaps too conservative, since it decides that any block that precedes one needing the active mask also needs it. This isn't true in cases where the active mask for a merge block can be inferred by post-dominance (as described above), so that the blocks that branch to it don't need to compute an active mask at all.
* If/when we extend the CPU target to support these operations (along with SIMD code generation, I assume), we will also need to synthesize an active mask on that platform, but the approach taken here (which pretty much relies on support for CUDA "cooperative groups") wouldn't seem to apply in the SIMD case.
* Similarly, the approach taken to computing the active mask here requires a new enough CUDA SM architecture version to support explicit cooperative groups. If we want to run on older CUDA-supporting architectures, we will need a new and potentially very different strategy.
* Because the new pass here changes the signature of functions that require the active mask (and not those that don't), it creates possible problems for generating code that uses dynamic dispatch (via function pointers). In principle, we need to know at a call site whether or not the callee uses the active mask. There are multiple possible solutions to this problem, and they'd need to be worked through before we can make the implicit active mask and dynamic dispatch be mutually compatible.
* Related to changing function signatures: no effort is made in this pass to clean up the IR type of the functions it modifies, so there could technically be mismatches between the IR type of a function and its actual signature. If/when this causes problems for downstream passes we probably need to do some cleanup.
* fixup: backslash-escaped lines
I did some "ASCII art" sorts of diagrams to explain cases in the CFG, and some of those diagrams used backslash (`\`) characters as the last character on the line, causing them to count as escaped newlines for C/C++.
The gcc compiler apparently balked at those lines, since they made some of the single-line comments into multi-line comments.
I solved the problem by adding a terminating column of `|` characters at the end of each line that was part of an ASCII art diagram.
* fixup: typos
Co-authored-by: jsmall-nvidia <jsmall@nvidia.com>
|