| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
|
|
| |
legalization pass. (#8567)
This is crash that be triggered by providing custom
`getDescriptorFromHandle` and use it to return access a
ByteAddressBuffer from a bindless handle.
Closes #8355.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(#7843)
* Fix 64-bit val lowering for metal
* Add ByteAddressBuffer load/store 64-bit tests
* Handle Store/Load ptr types
* Use bitcast for non-pointer typers
* format code (#7966)
Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
---------
Co-authored-by: slangbot <ellieh+slangbot@nvidia.com>
Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* update hlsl meta
* update test
* use slang syntax in meta file
* improve meta file
* fix pack clamp u8
* remove builtin packed types, use typealias instead
* fix wgsl pack clamp
* fix formatting
---------
Co-authored-by: Yong He <yonghe@outlook.com>
|
| |
|
|
|
| |
* Add packed bytes builtin type
* fix test
|
| |
|
|
|
|
|
|
|
| |
* Move switch statement bodies to their own lines
* format
---------
Co-authored-by: Yong He <yonghe@outlook.com>
|
| |
|
|
|
|
|
| |
* format
* Minor test fixes
* enable checking cpp format in ci
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Metal: `Interlocked` (atomic) member function support for buffers
fixes: #4654
fixes: #4481
1. Add `Interlocked` (atomic) member function support for buffers to Metal
2. Fix `__getEquivalentStructuredBuffer` so it works with CPP/Metal targets
* add `CompareStore` support
* legalize RWByteAddressBuffer to fully replace StructuredBuffer
* destroy replaced byte-addr buffer
* cleanup as per review and add comment to explain why certain code exists
* fix flow of byte-address-buffer replacement
* toggle on option to translate byteAddrBuffer to StructuredBuffer
* cleanup unused buffers
* add treatGetEquivalentStructuredBufferAsGetThis flag to treat getEquivStructuredBuffer as a byteAddressBuffer
* comment to explain `treatGetEquivalentStructuredBufferAsGetThis`
---------
Co-authored-by: Yong He <yonghe@outlook.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Fix and enable tests for metal.
* Fix.
* Fix.
* Fix tests.
* Fix warnings.
* Fix.
---------
Co-authored-by: Yong He <yonghe@Yongs-Mac-mini.local>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* RasterizerOrder resource for spirv and metal.
Also fixes the byte address buffer logic for metal.
* Fix.
* Delete commented lines.
---------
Co-authored-by: Jay Kwak <82421531+jkwak-work@users.noreply.github.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes #4062
This change enables wide load/stores for byte-address-buffer backed
resources, when the data is accessed at an offset that is aligned.
**Goals**
- Improve performance by issuing wider instructions instead of sequence
of scalar instructions, for load and stores of byte-address buffers.
- Reduce code-size and readability of the generated shaders.
- Help naive users as well as ninja programmers, generate optimal code.
**Non Goals**
- Help with Structured buffers, or other resources.
- Target compilation time improvements.
**Key changes**
Adds 2 new overloads for Load and Store operations on ByteAddress Buffers.
1. Load / Store with an extra alignment parameter
```
resource.Load<T>(offset, alignment);
resource.Store<T>(offset, value, alignment);
```
2. LoadAligned / StoreAligned with no extra parameter,
with the same signature as orignial Load / Store.
```
resource.LoadAligned<T>(offset);
resource.StoreAligned<T>(offset, value);
```
- This overload will implicitly identify the alignment value,
from the base type T of the elementary unit of the resource.
**Supported resources**
1. Vectors
This can be upto 4 elements, i.e. float -- float4.
2. Arrays
This does not have a limit on number of elements, but on a
conservative estimate, we can limit to few hundreds.
3. Structures
This is used to group a resource of a single type.
```
struct {
float4 x;
}
```
**Code updates**
- Modified byte-address-ir legalize to handle struct, array and vector
kinds of load or store access
- Added custom hlsl stdlib functions to implement all the overloads for Load,
Store etc.
- Added C-like emitter, SPIR-V emitter for handling ByteAddressBuffers.
- Added a new core stdlib function intrinsic to wrap around alignOf<T>().
- Added a new peephole optimization entry to identify the equivalent
IntLiteral value from the alignOf<T>() inst.
- Added tests to check explicit, and implicit aligned Load and Store
operations.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Fixes #3533
- Add logic to perform aligned memory operations for loading from and storing to composite resources, like vectors within the ByteAddress legalize pass.
- Checks
Added a new test for byte address with/without alignment.
---------
Co-authored-by: Yong He <yonghe@outlook.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(#3881)
* Refactor memory qualifier decorators to be a bit-flag set.
replace GloballyCoherent, ReadOnly, WriteOnly, Volatile, and Restrict memory modifiers and decorations with a bit flag set to more efficiently manage memory qualifiers.
added `restrict` modifier to test to ensure the code works when dropping a `restrict` memory qualifier
* Refine tests & add SSBO memory qualifer support
add CHECK's to tests to ensure memory qualifiers emit as intended
added tests and changed code to ensure memory qualifiers work on SSBO objects (SPIR-V & GLSL)
* add memory qualifiers & fixes.
Add to StructuredBuffer & ByteAddressBuffer `ReadOnly`/NonWritable qualifier.
* Memory qualifiers must be decorated on a variable inst. Due to this the qualifier is added after `lowerStructuredBufferType`
Fixed an error where ReadOnly->NonReadable & WriteOnly->NonWritable
* Adjusted tests accordingly
Added back the removed `globallycoherent` memory qualifier emit'ing code in hlsl-emit (was incorrectly removed).
undo hlsl.meta changes
cleanup
|
| |
|
|
|
|
|
|
|
| |
* Preserve ByteAddressBuffer user type name.
* Make user type lowercase.
* Make typenames conform to spec.
* Use `SpvOpDecorateString`.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Refactor compiler option representation.
* Fix binary compatibility.
* Add a test for specifying compiler options at link time.
* Fix binary compatibility.
* Fix binary compatibility.
* Fix backward compatibility on matrix layout.
* Fix.
* Fix.
* Fix.
* Fix gfx.
* Fix gfx.
* Fix dynamic dispatch.
* Polish.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Add per-buffer data layout control.
Fixes #3534.
* Fixes.
* Robustness.
* Update test.
* Fix.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* [SPIRV] Support `globallycoherent` modifier.
* Fix.
* Disable executable cooperative vector tests.
* Update expected failure.
* [SPIRV] Emit varying output index decoration.
* Add test.
* Update tests.
* Fix test.
* Emit `SpvExecutionModeEarlyFragmentTests`.
* Lower `StructuredBuffer<bool>`.
* Support globallycoherent on ByteAddressBuffer.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
| |
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
|
|
|
|
|
|
|
|
| |
* Compile append and consume structured buffers to glsl.
* Fix.
* Update CI config.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Create storage types of different layouts for SPIRV emit.
* Fix.
* Fix.
* Fix.
* Update expected failure list.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
| |
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Support per field matrix layout
* Fix warnings.
* Fix.
* Fix tests.
* Fix spiv gen.
* Fix.
* More test fixes.
* Fix.
* Run only GPU tests on self-hosted servers.
* Remove -use-glsl-matrix-layout-modifier.
* Fix.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Support `-fvk-use-gl-layout` for ByteAddressBuffer load/store.
* Fix.
* Fix.
* Add test for unaligned load.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* #include an absolute path didn't work - because paths were taken to always be relative.
* WIP lowerCamel Dictionary.
* WIP more lowerCamel fixes for Dictionary.
* Add/Remove/Clear
* GetValue/Contains
* Fix tabs in dictionary.
Count -> getCount
* Fix fields with caps.
* Key -> key
Value -> value
Use m_ for members where appropriate.
Use lowerCamel in linked list.
* Some small fixes/improvements to Dictionary.
* Kick CI.
|
| |
|
| |
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
|
|
|
|
|
|
| |
* Overhaul global inst deduplication and cpp/cuda backend.
* Update IR documentation.
---------
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
| |
Co-authored-by: Yong He <yhe@nvidia.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Cleanup refactoring work around the IR builder
We have some long-term goals for the IR that require a more centralized and disciplined set of rules for how IR instructions get created/emitted. I had been working on trying to set things up so that all IR instruction creation goes through a single bottleneck point, but the non-trivial work in that branch was getting drowned out by the sheer volume of cleanup and refactoring changes. This change tries to pull together several of the more important cleanups.
The big pieces are:
* `IRBuilder` and `SharedIRBuilder` now protect their data members and rely on users to initialize them more directly via constructor of an `init()` method. This change affects a *bunch* of sites where `IRBuilder`s were created. I changed use sites to use the constructors whenever possible, and to use `init()` in cases where we had longer-lived builders that needed to be initialized multiple times.
* The insertion location for the `IRBuilder` now uses an encapsulated type called `IRInsertLoc`. This new type can replace what used to be just two `IRInst*` fields in the builder, and also covers some new functionality (if we ever want to take advantage of it). Very little client code cares about this change, but it is still a nice cleanup in terms of making things more explicit.
* The creation of an `IRModule` has been moded *out* of `IRBuilder`, because in practice we `IRBuilder` always wants to be associated with a pre-existing `IRModule` at creation time (via its `SharedIRBuilder`). There is now an `IRModule::create()` operation instead. This required changing the sequencing at many `IRModule` creation sites, since most had been contriving to make an `IRBuilder` first. There were also several cleanups because code had been carelessly using non-reference-counted pointers for `IRModule`s in ways that broke now that `IRModule::create()` always returns a `RefPtr`.
* The core operations to actually allocate memory for IR instructions were moved into `IRModule` (since they interact with the memory pool that the module owns). These *were* called `createEmptyInst()` but have been renamed into `_allocateInst()`. In principle these seem like they should only be needed to be called by the `IRBuilder`, but in practice they are also needed by the IR deserialization logic.
* A few core operations for emitting IR instructions that were associted with `IRBuilder` were moved to actually be methods on `IRBuilder`. First is `_findOrEmitConstant` which is the primary bottleneck for creating simple scalar constant values. Another is `_createInst` (formerly part of the templated `createInstImpl` along with `createInstWithSizeImpl`) which is the main bottleneck for allocation and initialization of any instruction other than a constant (well, the `IRModuleInst` is the other exception...). Finally, there is also `_maybeSetSourceLoc()`, which is obvious to scope inside the `IRBuilder` once it is protecting the source-location info.
Notes:
* The `minSizeInBytes` parameter to `_createInst()` might not actually be needed at all. At this point any `IRInst` subtypes that need data allocated for things other than their operands already get created manually via `_allocateInst` or `_findOrEmitConstant`, so I *think* we could remove that part. I will handle that in a subsequent cleanup if it turns out to be the case.
* There is one IR pass (`slang-ir-string-hash.cpp`) that is using manual `_allocateInst()` instead of going through an `IRBuilder`. It could be easily cleaned up to not do so (and I will probably make that change down the line), but for now I wanted to avoid doing anything that wasn't close to pure refactoring if I could.
* At this point in our design an `IRBuilder` is a very lightweight thing - it basically just owns the insertion location plus a source location to write into instructions. A lot of our code currently treats `IRBuilder`s like they are expensive and/or need to be re-used (which leads to them being used in more mutable/stateful ways). It is quite likely that as we clean up other aspects of the implementation of IR creation/emission we can make `IRBuilder` use feel more lightweight in ways that can streamline and simplify code.
* The next step for this work is to identify the different paths that eventually lead to `_createInst()` being called, and unify them at a single bottleneck operation that can own the decisions around when to create an instruction vs. when to re-use an existing one (rather than those decisions being baked into the various `IRBuilder` subroutines that create instructions of the various subtypes).
* fixup: gcc/clang C++ spec details
|
| |
|
|
|
|
|
|
|
| |
* Add an accessor for IRInst opcode
This main changing is renaming `IRInst::op` over to `IRInst::m_op` and then adds an accessor `IRInst::getOp()` to read it. The rest of the changes are just changing use sites to `getOp` (or to `m_op` in the limited cases where we write to it).
This work is in anticipation of a future change that might need to store an extra bit in the same field as the opcode. It seemed better to do this massive refactoring as a separate PR.
* fixup
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Overview
========
Prior to this change, we had two different code generation strategies for interface/existential types in Slang, that didn't always play nicely together:
* The "legacy" static specialization approach could handle plugging in an arbitrary concrete type for an existential type parameter (including types with resources, etc.), but wouldn't work well with things like a `StructuredBuffer<>` of an interface type, and requires somewhat counter-intuitive layout rules to make work.
* The new dynamic dispatch approach produces simpler, more easily understood layouts by assuming that values of interface type can fit into a fixed number of bytes. The tradeoff there is that it cannot handle types that include resources (only POD types).
The goal of this change is to make it so that the two strategies can co-exist. In particular, in cases where a shader is amenable to both static specialization and dynamic dispatch, the type layouts should agree.
In order to make the type layouts agree, we:
* Declare that *all* values of existential type reserve storage according to the dynamic-dispatch rules (so 16 bytes for the RTTI and witness-table information, plus whatever bytes are needed to story "any value" of a conforming type).
* Then we modify the "legacy" layout rules so that if a value of concrete type can fit in the reserved "any value" space for a given interface, then it is laid out there exactly like the dynamic dispatch rules would do. Otherwise, we fall back to the previous legacy rules (since we don't need to agree with the dynamic-dispatch layout on types that can't be used with dynamic dispatch).
Details
=======
* Renamed `ExistentialBox` to `BoundInterfaceType` to better clarify how it relates to `BindExistentialsType`
* Unconditionally apply the `lowerGenerics` pass during emit, since it is now responsible for aspects of the lowering of existential types when specialization is used.
* Made IR type layout take the target into account, so that the layout of resource types can vary by target (e.g., being POD on some targets, and invalid on others)
* Cleaned up some issues around using global shader parameters as the "key" for their layout information in the global-scope layout (only comes up when there are global-scope `uniform` parameters)
* Made there be a default any-value size (16) instead of making it be an error to leave out. This was the simplest option; we could try to go back to having an error, but we'd need to only issue it if we are sure a type/interface is being used with dynamic dispatch, since static dispatch doesn't have to obey the restrictions.
* Changed lowering of existential types to tuples so that bound interfaces where the concrete type won't fit use a "pseudo-pointer" instead of an "any-value" to hold the payload
* Changed IR type legalization to handle the "pseudo-pointer" case and apply layout information from an interface type over to the payload part when static specialization was used.
* Changed some details of how witness tables were being lowered, so that we didn't have to create "proxy" witness tables for the constraints on associated types (just use the actual requirement entries we generate)
* Changed witness tables so that they know the subtype doing the conforming
* Added logic so that we don't generate pack/unpack logic and witness table wrapper functions for types that are incompatible with any-value/dynamic dispatch for a given interface.
* Changed the core AST-level type layout logic to use the dynamic-dispatch layout in case things fit, and the legacy static specialization case when things don't (while also reserving space for the dynamic-dispatch fields)
* Changed a bunch of test cases for static specialization to properly use the new layout (which introduces new buffers in some cases, and moves data around in others).
Future Work
===========
The experience of trying to reconcile our older way of handling interface-type specialization with our newer model (that supports dynamic dispatch) makes it clear that we really need to make similar changes to our handling of generic type parameters on entry points and at the global scope.
A future change should make it so that a global type parameter is lowered with a type layout similar to a value parameter of interface type, including the RTTI and witness-table pieces, and just leaving out the "any value" piece. A similar translation strategy should apply to entry-point generic parameters (mirroring how we lower generic functions for dynamic dispatch already), and value specialization parameters.
Co-authored-by: Yong He <yonghe@outlook.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Fix premake5.lua so it uses the new path needed for OpenCLDebugInfo100.h
* Keep including the includes directory.
* Added the spirv-tools-generated files.
* We don't need to include the spirv/unified1 path because the files needed are actually in the spirv-tools-generated folder.
* Put the build_info.h glslang generated files in external/glslang-generated. Alter premake5.lua to pick up that header.
* First pass at documenting how to build glslang and spirv-tools.
* Improved glsl/spir-v tools README.md
* Added revision.h
* Change how gResources is calculated.
Update about revision.h
* Update docs a little.
* Split out spirv-tools into a separate project for building glslang. This was not necessary on linux, but *is* necessary on windows, because there is a file disassemble.cpp in spirv-tools and in glslang, and this leads to VS choosing only one. With the separate library, the problem is resolved.
* Fix direct-spirv-emit output.
* Update to latest version of spirv headers and spirv-tools.
* Upgrade submodule version of glslang in external.
* Add fPIC to build options of slang-spirv-tools
* WIP adding support for InterlockedAddFp32
* Upgrade slang-binaries to have new glslang.
* Fix issues with Windows slang-glslang binaries, via update of slang-binaries used.
* WIP - atomicAdd. This solution can't work as we can't do (float*) in glsl.
* WIP on atomic float ops.
* Added checking for multiple decls that takes into account __target_intrinsic and __specialized_for_target.
First pass impl of atomic add on float for glsl.
* Split __atomicAdd so extensions are applied appropriately.
* Made Dxc/Fxc support includes.
Use HLSL prelude to pass the path to nvapi
Added -nv-api-path
* Refactor around IncludeHandler and impl of IncludeSystem
* slang-include-handler -> slang-include-system
Have IncludeHandler/Impl defined in slang-preprocessor
* Small comment improvements.
* Document atomic float add addition in target-compatibility.md.
* CUDA float atomic support on RWByteAddressBuffer.
* Add atomic-float-byte-address-buffer-cross.slang
* Removed inappropriate-once.slang - the test is no longer valid when a file is loaded and has a unique identity by default. A test could be made, but would require an API call to create the file (so no unique id).
Improved handling of loadFile - uses uniqueId if has one.
* Work around for testing target overlaps - to avoid exceptions on adding targets.
Simplify PathInfo setup.
Modify single-target-intrinsic.slang - it no longer failed because there were no longer multiple definitions for the same target.
Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
|
|
|
* Add support for generic load/store on byte-addressed buffers
Introduction
============
The HLSL `*ByteAddressBuffer` types originaly only supported loading/storing `uint` values or vectors of the same, using `Load`/`Load2`/`Load3`/`Load4` or `Store`/`Store2`/`Store3`/`Store4`. More recent versions of dxc have added support for generic `Load<T>` and `Store<T>`, which adds a two main pieces of functionality for users.
The first and more fundamental feature is that `T` can be a type that isn't 32 bits in size (or a vector with elements of such a type), thus exposing a capability that is difficult or impossible to emulate on top of 32-bit load/store (depending on what guarantees `*StructuredBuffer` makes about the atomicity of loads/stores).
The secondary benefit of having a generic `Load<T>` and `Store<T>` is that it becomes possible to load/store types like `float` without manual bit-casting, and also becomes possible to load/store `struct` types so long as all the fields are loadable/storable.
This change adds generic `Load<T>` and `Store<T>` to the Slang standard library definition of byte-address buffers, and tries to bring those same benefits to as many targets as possible. In particular, the secondary benefits become available on all targets, including DXBC: byte-address buffers can be used to directly load/store types other than `uint`, including user-defined `struct` types, so long as all of the fields of those types can be loaded/stored.
The ability to load/store non-32-bit types depends on target capabilities, and so is only available where direct support for those types is available. For 16-bit types like `half` this includes both Vulkan and D3D12 DXIL with appropriate extensions or shader models.
The implementation is somewhat involved, so I will try to explain the pieces here.
Standard Library
================
The changes to the Slang standard library in `hlsl.meta.slang` are pretty simple. We add new `Load<T>` and `Store<T>` generic methods to `*ByteAddressBuffer`, and route them through to a new IR opcode.
Right now the generic `Load<T>` and `Store<T>` do *not* place any constraints on the type `T`, although in practice they should only work when `T` is a fixed-size type that only contains "first class"
uniform/ordinary data (so no resources, unless the target makes resource types first class). Our front-end checking cannot currently represent first-class-ness and validate it (nor can it represent fixed-size-ness), so these gaps will have to do for now.
Rather than directly translate `Load<T>` or `Store<T>` calls into a single instruction, we instead bottleneck them through internal-use-only subroutines. The design choice here is intended to ensure that for some large user-defined type like `MassiveMaterialStruct` we only emit code for loading all of its fields *once* in the output HLSL/GLSL rather than once per load site. While downstream compilers are likely to inline all of this logic anyway, we are doing what we can to avoid generating bloated code.
Emit and C++/CUDA
=================
Over in `slang-emit-c-like.cpp` we translate the new ops into output code in a straightforward way. A call like `obj.Load<Foo>(offset)` will eventually output as a call like `obj.Load<Foo>(offset)` in the generated code, by default.
For the CPU C++ and CUDA C++ codegen paths, this is enough to make a workable implementation, and we add suitable templated `Load<T>` and `Store<T>` declarations to the prelude for those targets.
Legalization
============
For targets like DXBC and GLSL there is no way to emit a load operation for an aggregate type like a `struct`, so we introduce a legalization pass on the IR that will translate our byte-address-buffer load/store ops into multiple ops that are legal for the target.
Scalarization
-------------
The big picture here is easy enough to understand: when we see a load of a `struct` type from a byte-address buffer, we translate that into loads for each of the fields, and then assemble a new `struct` value from the results. We do similar things for arrays, matrices, and optionally for vectors (depending on the target).
Bit Casting
-----------
After scalarization alone, we might have a load of a `float` or a `float3` that isn't legal for D3D11/DXBC, but that *would* be legal if we just loaded a `uint` or `uint3` and then bit-casted it. The legalization pass thus includes an option to allow for loads/stores to be translated to operate on a same-size unsigned integer type and then to bit-cast.
To make this work actually usable, I had to add some more details to the implementation of the bit-cast op during HLSL emit and, more importantly, I had to customize the way that the byte-address buffer load/store ops get emitted to HLSL so that it prefers to use the existing operations like `Load`/`Load2`/`Load3`/`Load4` instead of the generic one, whenever operating on `uint`s or vectors of `uint`.
Translation to Structured Buffers
---------------------------------
Even after scalarizing all byte-address-buffer loads/stores, we still have a problem for GLSL targets, because a single global `buffer` declaration used to back a byte-address buffer can only have a single element type (currently always `uint`), so the granularity of loads/stores it can express is fixed at declaration time. If we want to load a `half` from a byte-address buffer, we need a dedicated `buffer` declaration in the output GLSL with an element type of `half`.
The solution we employ here is to translate all byte-address buffer loads into "equivalent" structured-buffer ops when targetting GLSL. We add logic to find the underlying global shader parameter that was used for a load/store and introduce a new structured-buffer parameter with the desired element type (e.g., `half`) and then rewrite the load/store op to use that buffer instead. We copy layout information from the original buffer to the new one, so that in the output GLSL all the various `buffer`s will use a single `binding` and thus alias "for free."
We don't want to create a new global buffer for every load/store, so we try to cache these "equivalent" structured buffers as best as we can. For the caching I ended up needing a pair to use as a key, so I tweaked the `KeyValuePair<K,V>` type in `core` so that it could actually work for that purpose.
Because we are working at the level of IR instructions instead of stdlib functions at this work I had to add new IR opcodes to represent structured-buffer load/store that only (currently) apply to GLSL.
Layout
======
In order to translate a load/store of a `struct` type into per-field load/store we need a way to access layout information for the types of the fields. Previously layout information has been an AST-level concern that then gets passed down to the IR only when needed and only on global parameters, so layout information isn't always available in cases like this, at the actual load/store point.
As an expedient move for now I've introduced a dedicated module that does IR-level layout and caches its results on the IR types themselves. This approach *only* supports the "natural" layout of a type, and thus is usable for structured buffers and byte-address buffers (or general pointer load/store on targets that support it), but which is *not* usable for things like constant buffer layout.
We've known for a while that the Right Way to do layout going forward is to have an IR-based layout system, and this could either be seen as a first step toward it, or else as a gross short-term hack. YMMV.
Details
=======
The GLSL "extension tracker" stuff around type support needed to be tweaked to recognize that types like `int16_t` aren't actually available by default. I switched it from using a "black list" of unavailable types at initialization time over to using a "white list" of types that are known to always be available without any extensions.
Tests
=====
There are two tests checked in here: one for the basic case of a `struct` type that has fields that should all be natively loadable, and one that stresses 16-bit types. Each test uses both load and store operations.
Future Directions
=================
Right now we translate vector load/store to GLSL as load/store of individual scalars, which means the assumed alignment is just that of the scalars (consistent with HLSL byte-address buffer rules). We could conceivably introduce some controls to allow outputting the vector load/store ops more directly to GLSL (e.g., declaring a `buffer` of `float4`s), which might enable more efficient load/store based on the alignment rules for `buffer`s.
The IR layout work has a number of rough edges, but the most worrying is probably the assumption that all matrices are laid out in row-major order. Slang really needs an overhaul of its handling of matrices and matrix layout, so I don't know if we can do much better in the near term.
At some point the IR-based layout system needs to be reconciled with our current AST-base layout, and we need to figure out how "natural" layout and the currently computed layouts co-exist (in particular, we need to make sure that the IR-based layout and the existing layout logic for structured buffers will agree). This probably needs to come along once we have moved the core layout logic to operate on IR types instead of AST types (a change we keep talking about).
As part of this work I had to touch the implementation of bit-casting for HLSL, and it seems like that logic has some serious gaps. We really ought to consider a separate legalization pass that can turn IR bitcast instructions into the separate ops that a target actually supports so that we can implement `uint64_t`<->`double` and other conersions that are technically achievable, but which are hard to express in HLSL today.
* fixup: missing files
|