Feature/wave mask review (#1325)

* Fix issues in wave-mask/wave.slang tests. WaveGetActiveMask -> WaveGetConvergedMask. Update target-compatibility.md * First pass at wave-intrinsics.md documentation. Write up around WaveMaskSharedSync. * Added more of the Wave intrinsics as WaveMask intrinsics. Improvements to documentation around wave-intrinsics.
author: jsmall-nvidia <jsmall@nvidia.com> 2020-04-20 13:03:18 -0400
committer: GitHub <noreply@github.com> 2020-04-20 13:03:18 -0400
commit: c4441d804aaa97bad7ff01bef505491d30bbc046 (patch)
tree: ac251ab76ccb8fd3a07a7dd61f22dd4fc7c2bd41 /docs
parent: acb1c39b4e29358cf496c07dc325e52f39be71f4 (diff)
2 files changed, 90 insertions, 2 deletions
diff --git a/docs/target-compatibility.md b/docs/target-compatibility.md
index ee5341733..ff63a65a2 100644
--- a/docs/target-compatibility.md
+++ b/docs/target-compatibility.md
@@ -17,9 +17,10 @@ Items with ^ means there is some discussion about support later in the document
 | u/int64_t Intrinsics        |     No       |   No         |   Yes      |     Yes       |    Yes
 | int matrix                  |     Yes      |   Yes        |   No +     |     Yes       |    Yes
 | tex.GetDimension            |     Yes      |   Yes        |   Yes      |     No        |    Yes
-| SM6.0 Wave Intrinsics       |     No       |   Yes        |  Partial   |     Yes       |    No
+| SM6.0 Wave Intrinsics       |     No       |   Yes        |  Partial   |     No +      |    No
 | SM6.0 Quad Intrinsics       |     No       |   Yes        |   No +     |     No        |    No
-| SM6.5 Wave Intrinsics       |     No       |   Yes ^      |   No +     |     Yes       |    No
+| SM6.5 Wave Intrinsics       |     No       |   Yes ^      |   No +     |     No +      |    No
+| WaveMask Intrinsics         |     Yes ^    |   Yes ^      |   Yes +    |     Yes       |    No
 | WaveShuffle                 |     No       |   Limited ^  |   Yes      |     Yes       |    No
 | Tesselation                 |     Yes ^    |   Yes ^      |   No +     |     No        |    No
 | Graphics Pipeline           |     Yes      |   Yes        |   Yes      |     No        |    No
@@ -37,6 +38,7 @@ Items with ^ means there is some discussion about support later in the document
 | Mesh Shader                 |     No       |   No +       |   No +     |     No        |    No
 | `[unroll]`                  |     Yes      |   Yes        |   Yes ^    |     Yes       |    Limited + 
 
+
 ## Half Type
 
 There appears to be a problem writing to a StructuredBuffer containing half on D3D12. D3D12 also appears to have problems doing calculations with half.
@@ -53,10 +55,22 @@ Means can use matrix types containing integer types.
 
 tex.GetDimensions is the GetDimensions method on 'texture' objects. This is not supported on CUDA as CUDA has no equivalent functionality to get these values. GetDimensions work on Buffer resource types on CUDA.
 
+## SM6.0 Wave Intrinsics
+
+CUDA does not currently support the HLSL Wave intrinsics. It does support 'WaveMask' intrinsics that follow the CUDA sync mechanism, where the programmer has to explicilty specify the lanes involved when calling the intrisnic.
+
+Currently there is the intention to look into making Slang generate suitable masks automatically such that that regular Wave intrinsics work. 
+
 ## SM6.5 Wave Intrinsics
 
 SM6.5 Wave Intrinsics are supported, but requires a downstream DXC compiler that supports SM6.5. As it stands the DXC shipping with windows does not. 
 
+## WaveMask Intrinsics
+
+In order to map better to the CUDA sync/mask model Slang supports 'WaveMask' intrinsics. They operate in broadly the same way as the Wave intrinsics, but require the programmer to specify the lanes that are involved. To write code that uses wave intrinsics acrosss targets including CUDA, currently the WaveMask intrinsics must be used. For this to work, the masks passed to the WaveMask functions should exactly match the 'Active lanes' concept that HLSL uses, otherwise the result is undefined. 
+
+The WaveMask intrinsics are not part of HLSL and are only available on Slang.
+
 ## WaveShuffle
 
 `WaveShuffle` and `WaveBroadcastLaneAt` are Slang specific intrinsic additions to expand the options available around `WaveReadLaneAt`. 
diff --git a/docs/wave-intrinsics.md b/docs/wave-intrinsics.md
new file mode 100644
index 000000000..6a63d628c
--- /dev/null
+++ b/docs/wave-intrinsics.md
@@ -0,0 +1,74 @@
+Wave Intrinsics
+===============
+
+Slang has support for Wave intrinsics introduced to HLSL in SM6.0 and SM6.5. All intrinsics are available on D3D12, and a subset on Vulkan. On CUDA 'WaveMask' intrinsics are introduced which map more directly to the CUDA model of requiring a `mask` of participating lanes. On D3D12 and Vulkan the WaveMask instrinsics can be used, but the mask is effectively ignored. For this to work across targets including CUDA, the mask must be calculated such that it exactly matches that of HLSL defined 'active' lanes, else the behavior is undefined. 
+
+Another wrinkle in compatibility is that on GLSL targets such as Vulkan, the is not built in language support for Matrix versions of Wave intrinsics. Currently this means that Matrix is not a supported type for Wave intrinsics on Vulkan, but may be in the future.
+
+Additional Wave Intrinsics
+==========================
+
+T can be scalar, vector or matrix, except on Vulkan which doesn't support Matrix.
+
+```
+T WaveBroadcastLaneAt(T value, constexpr int lane);
+```
+
+All lanes receive the value specified in lane. Lane must be an active lane, otherwise the result is undefined. 
+This is a more restricive version of `WaveReadLaneAt` - which can take a non constexpr lane, *but* must be the same value for all lanes in the warp. Or 'dynamically uniform' as described in the HLSL documentation. 
+
+```
+T WaveShuffle(T value, int lane);
+```
+
+Shuffle is a less restrictive version of `WaveReadLaneAt` in that it has no restriction on the lane value - it does *not* require the value to be same on all lanes. 
+
+There isn't explicit support for WaveShuffle in HLSL, and for now it will emit `WaveReadLaneAt`. As it turns out for a sizable set of hardware WaveReadLaneAt does work correctly when the lane is not 'dynamically uniform'. This is not necessarily the case for hardware general though, so if targetting HLSL it is important to make sure that this does work correctly on your target hardware.
+
+Our intention is that Slang will support the appropriate HLSL mechanism that makes this work on all hardware when it's available.  
+
+```
+void AllMemoryBarrierWithWaveSync();
+```
+
+Synchronizes all lanes to the same AllMemoryBarrierWithWaveSync in program flow. Orders all memory accesses such that accesses after the barrier can be seen by writes before.  
+
+```
+void GroupMemoryBarrierWithWaveSync();
+```
+
+Synchronizes all lanes to the same GroupMemoryBarrierWithWaveSync in program flow. Orders group shared memory accesses such that accesses after the barrier can be seen by writes before.  
+
+Wave Mask Intrinsics
+====================
+
+CUDA has a different programming model for inter warp/wave communication based around masks of active lanes. This is because the CUDA programming model allows for divergence that is more granualar than just on program flow, and that there isn't implied reconvergence at the end of a conditional. 
+
+In the future Slang may have the capability to work out the masks required such that the regular HLSL Wave intrinsics work. As it stands there does not appear to be any way to implement the regular Wave intrinsics directly. To work around this problem we introduce 'WaveMask' intrinsics, which are essentially the same as the regular HLSL Wave intrinsics with the first parameter as the WaveMask which identifies the participating lanes. 
+
+The WaveMask intrinsics will work across targets, but *only* if on CUDA targets the mask captures exactly the same lanes as the 'Active' lanes concept in HLSL. If the masks deviate then the behavior is undefined. On non CUDA based targets currently the mask is ignored. This behavior may change on GLSL which has an extension to support a more CUDA like behavior.  
+
+Most of the `WaveMask` functions are identical to the regular Wave intrinsics, but they take a WaveMask as the first parameter, and the intrinsic name starts with `WaveMask`. 
+
+```
+WaveMask GetConvergedMask();
+```
+
+Gets the mask of lanes which are converged within the Wave. Note that this is *not* the same as Active threads, and may be some subset of that. It is equivalent to the `__activemask()` in CUDA.
+
+On non CUDA targets the the function will return all lanes as active - even though this is not the case. This is 'ok' in so far as on non CUDA targets the mask is ignored. It is *not* okay if the code uses the value other than as a superset of the 'really converged' lanes. For example testing the bit's and changing behavior would likely not work correctly on non CUDA targets. 
+
+```
+void AllMemoryBarrierWithWaveMaskSync(WaveMask mask);
+```
+
+Same as AllMemoryBarrierWithWaveSync but takes a mask of active lanes to sync with. 
+
+```
+void GroupMemoryBarrierWithWaveMaskSync(WaveMask mask);
+```
+
+Same as GroupMemoryBarrierWithWaveSync but takes a mask of active lanes to sync with. 
+ 
+ 
+ 
+\ No newline at end of file
author	jsmall-nvidia <jsmall@nvidia.com>	2020-04-20 13:03:18 -0400
committer	GitHub <noreply@github.com>	2020-04-20 13:03:18 -0400
commit	c4441d804aaa97bad7ff01bef505491d30bbc046 (patch)
tree	ac251ab76ccb8fd3a07a7dd61f22dd4fc7c2bd41 /docs
parent	acb1c39b4e29358cf496c07dc325e52f39be71f4 (diff)