4 files changed, 351 insertions, 12 deletions
diff --git a/docs/target-compatibility.md b/docs/target-compatibility.md
index ee5341733..ff63a65a2 100644
--- a/docs/target-compatibility.md
+++ b/docs/target-compatibility.md
@@ -17,9 +17,10 @@ Items with ^ means there is some discussion about support later in the document
 | u/int64_t Intrinsics        |     No       |   No         |   Yes      |     Yes       |    Yes
 | int matrix                  |     Yes      |   Yes        |   No +     |     Yes       |    Yes
 | tex.GetDimension            |     Yes      |   Yes        |   Yes      |     No        |    Yes
-| SM6.0 Wave Intrinsics       |     No       |   Yes        |  Partial   |     Yes       |    No
+| SM6.0 Wave Intrinsics       |     No       |   Yes        |  Partial   |     No +      |    No
 | SM6.0 Quad Intrinsics       |     No       |   Yes        |   No +     |     No        |    No
-| SM6.5 Wave Intrinsics       |     No       |   Yes ^      |   No +     |     Yes       |    No
+| SM6.5 Wave Intrinsics       |     No       |   Yes ^      |   No +     |     No +      |    No
+| WaveMask Intrinsics         |     Yes ^    |   Yes ^      |   Yes +    |     Yes       |    No
 | WaveShuffle                 |     No       |   Limited ^  |   Yes      |     Yes       |    No
 | Tesselation                 |     Yes ^    |   Yes ^      |   No +     |     No        |    No
 | Graphics Pipeline           |     Yes      |   Yes        |   Yes      |     No        |    No
@@ -37,6 +38,7 @@ Items with ^ means there is some discussion about support later in the document
 | Mesh Shader                 |     No       |   No +       |   No +     |     No        |    No
 | `[unroll]`                  |     Yes      |   Yes        |   Yes ^    |     Yes       |    Limited + 
 
+
 ## Half Type
 
 There appears to be a problem writing to a StructuredBuffer containing half on D3D12. D3D12 also appears to have problems doing calculations with half.
@@ -53,10 +55,22 @@ Means can use matrix types containing integer types.
 
 tex.GetDimensions is the GetDimensions method on 'texture' objects. This is not supported on CUDA as CUDA has no equivalent functionality to get these values. GetDimensions work on Buffer resource types on CUDA.
 
+## SM6.0 Wave Intrinsics
+
+CUDA does not currently support the HLSL Wave intrinsics. It does support 'WaveMask' intrinsics that follow the CUDA sync mechanism, where the programmer has to explicilty specify the lanes involved when calling the intrisnic.
+
+Currently there is the intention to look into making Slang generate suitable masks automatically such that that regular Wave intrinsics work. 
+
 ## SM6.5 Wave Intrinsics
 
 SM6.5 Wave Intrinsics are supported, but requires a downstream DXC compiler that supports SM6.5. As it stands the DXC shipping with windows does not. 
 
+## WaveMask Intrinsics
+
+In order to map better to the CUDA sync/mask model Slang supports 'WaveMask' intrinsics. They operate in broadly the same way as the Wave intrinsics, but require the programmer to specify the lanes that are involved. To write code that uses wave intrinsics acrosss targets including CUDA, currently the WaveMask intrinsics must be used. For this to work, the masks passed to the WaveMask functions should exactly match the 'Active lanes' concept that HLSL uses, otherwise the result is undefined. 
+
+The WaveMask intrinsics are not part of HLSL and are only available on Slang.
+
 ## WaveShuffle
 
 `WaveShuffle` and `WaveBroadcastLaneAt` are Slang specific intrinsic additions to expand the options available around `WaveReadLaneAt`. 
diff --git a/docs/wave-intrinsics.md b/docs/wave-intrinsics.md
new file mode 100644
index 000000000..6a63d628c
--- /dev/null
+++ b/docs/wave-intrinsics.md
@@ -0,0 +1,74 @@
+Wave Intrinsics
+===============
+
+Slang has support for Wave intrinsics introduced to HLSL in SM6.0 and SM6.5. All intrinsics are available on D3D12, and a subset on Vulkan. On CUDA 'WaveMask' intrinsics are introduced which map more directly to the CUDA model of requiring a `mask` of participating lanes. On D3D12 and Vulkan the WaveMask instrinsics can be used, but the mask is effectively ignored. For this to work across targets including CUDA, the mask must be calculated such that it exactly matches that of HLSL defined 'active' lanes, else the behavior is undefined. 
+
+Another wrinkle in compatibility is that on GLSL targets such as Vulkan, the is not built in language support for Matrix versions of Wave intrinsics. Currently this means that Matrix is not a supported type for Wave intrinsics on Vulkan, but may be in the future.
+
+Additional Wave Intrinsics
+==========================
+
+T can be scalar, vector or matrix, except on Vulkan which doesn't support Matrix.
+
+```
+T WaveBroadcastLaneAt(T value, constexpr int lane);
+```
+
+All lanes receive the value specified in lane. Lane must be an active lane, otherwise the result is undefined. 
+This is a more restricive version of `WaveReadLaneAt` - which can take a non constexpr lane, *but* must be the same value for all lanes in the warp. Or 'dynamically uniform' as described in the HLSL documentation. 
+
+```
+T WaveShuffle(T value, int lane);
+```
+
+Shuffle is a less restrictive version of `WaveReadLaneAt` in that it has no restriction on the lane value - it does *not* require the value to be same on all lanes. 
+
+There isn't explicit support for WaveShuffle in HLSL, and for now it will emit `WaveReadLaneAt`. As it turns out for a sizable set of hardware WaveReadLaneAt does work correctly when the lane is not 'dynamically uniform'. This is not necessarily the case for hardware general though, so if targetting HLSL it is important to make sure that this does work correctly on your target hardware.
+
+Our intention is that Slang will support the appropriate HLSL mechanism that makes this work on all hardware when it's available.  
+
+```
+void AllMemoryBarrierWithWaveSync();
+```
+
+Synchronizes all lanes to the same AllMemoryBarrierWithWaveSync in program flow. Orders all memory accesses such that accesses after the barrier can be seen by writes before.  
+
+```
+void GroupMemoryBarrierWithWaveSync();
+```
+
+Synchronizes all lanes to the same GroupMemoryBarrierWithWaveSync in program flow. Orders group shared memory accesses such that accesses after the barrier can be seen by writes before.  
+
+Wave Mask Intrinsics
+====================
+
+CUDA has a different programming model for inter warp/wave communication based around masks of active lanes. This is because the CUDA programming model allows for divergence that is more granualar than just on program flow, and that there isn't implied reconvergence at the end of a conditional. 
+
+In the future Slang may have the capability to work out the masks required such that the regular HLSL Wave intrinsics work. As it stands there does not appear to be any way to implement the regular Wave intrinsics directly. To work around this problem we introduce 'WaveMask' intrinsics, which are essentially the same as the regular HLSL Wave intrinsics with the first parameter as the WaveMask which identifies the participating lanes. 
+
+The WaveMask intrinsics will work across targets, but *only* if on CUDA targets the mask captures exactly the same lanes as the 'Active' lanes concept in HLSL. If the masks deviate then the behavior is undefined. On non CUDA based targets currently the mask is ignored. This behavior may change on GLSL which has an extension to support a more CUDA like behavior.  
+
+Most of the `WaveMask` functions are identical to the regular Wave intrinsics, but they take a WaveMask as the first parameter, and the intrinsic name starts with `WaveMask`. 
+
+```
+WaveMask GetConvergedMask();
+```
+
+Gets the mask of lanes which are converged within the Wave. Note that this is *not* the same as Active threads, and may be some subset of that. It is equivalent to the `__activemask()` in CUDA.
+
+On non CUDA targets the the function will return all lanes as active - even though this is not the case. This is 'ok' in so far as on non CUDA targets the mask is ignored. It is *not* okay if the code uses the value other than as a superset of the 'really converged' lanes. For example testing the bit's and changing behavior would likely not work correctly on non CUDA targets. 
+
+```
+void AllMemoryBarrierWithWaveMaskSync(WaveMask mask);
+```
+
+Same as AllMemoryBarrierWithWaveSync but takes a mask of active lanes to sync with. 
+
+```
+void GroupMemoryBarrierWithWaveMaskSync(WaveMask mask);
+```
+
+Same as GroupMemoryBarrierWithWaveSync but takes a mask of active lanes to sync with. 
+ 
+ 
+ 
+\ No newline at end of file
diff --git a/source/slang/hlsl.meta.slang b/source/slang/hlsl.meta.slang
index f096a125e..73cda7761 100644
--- a/source/slang/hlsl.meta.slang
+++ b/source/slang/hlsl.meta.slang
@@ -2484,7 +2484,7 @@ matrix<T, N, M> trunc(matrix<T, N, M> x)
 typedef uint WaveMask;
 
 __target_intrinsic(cuda, "__activemask()")
-WaveMask WaveGetActiveMask() { return 0xffffffff; }
+WaveMask WaveGetConvergedMask() { return 0xffffffff; }
 
 __glsl_extension(GL_KHR_shader_subgroup_vote)
 __spirv_version(1.3)
@@ -2534,19 +2534,42 @@ __glsl_extension(GL_KHR_shader_subgroup_basic)
 __spirv_version(1.3)
 __target_intrinsic(glsl, "subgroupBarrier()")
 __target_intrinsic(hlsl, "AllMemoryBarrier()")
-void WaveMaskSync(WaveMask mask);
+void AllMemoryBarrierWithWaveMaskSync(WaveMask mask);
 
 // On GLSL, it appears we can't use subgroupMemoryBarrierShared, because it only implies a memory ordering, it does not
 // imply convergence. For subgroupBarrier we have from the docs..
 // "The function subgroupBarrier() enforces that all active invocations within a subgroup must execute this function before any
 // are allowed to continue their execution"
-
+// TODO(JS):
+// It's not entirely clear what to do here on HLSL. 
+// Reading the dxc wiki (https://github.com/Microsoft/DirectXShaderCompiler/wiki/Wave-Intrinsics), we have statements like:
+//    ... these intrinsics enable the elimination of barrier constructs when the scope of synchronization is within the width of the SIMD processor.
+//    Wave: A set of lanes executed simultaneously in the processor. No explicit barriers are required to guarantee that they execute in parallel.
+// Which seems to imply at least some memory barriers like Shared might not be needed.
+//
+// The barrier is left here though, because not only is the barrier make writes before the barrier across the wave appear to others afterwards, it's
+// also there to inform the compiler on what order reads and writes can take place. This might seem to be silly because of the 'Active' lanes
+// aspect of HLSL seems to make everything in lock step - but that's not quite so, it only has to apparently be that way as far as the programmers
+// model appears - divergence could perhaps potentially still happen. 
 __target_intrinsic(cuda, "__syncwarp($0)")
 __glsl_extension(GL_KHR_shader_subgroup_basic)
 __spirv_version(1.3)
 __target_intrinsic(glsl, "subgroupBarrier()")
 __target_intrinsic(hlsl, "GroupMemoryBarrier()")
-void WaveMaskSharedSync(WaveMask mask);
+void GroupMemoryBarrierWithWaveMaskSync(WaveMask mask);
+
+
+__glsl_extension(GL_KHR_shader_subgroup_basic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupBarrier()")
+__target_intrinsic(hlsl, "AllMemoryBarrier()")
+void AllMemoryBarrierWithWaveSync();
+
+__glsl_extension(GL_KHR_shader_subgroup_basic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupBarrier()")
+__target_intrinsic(hlsl, "GroupMemoryBarrier()")
+void GroupMemoryBarrierWithWaveSync();
 
 // NOTE! WaveMaskBroadcastLaneAt is *NOT* standard HLSL
 // It is provided as access to subgroupBroadcast which can only take a
@@ -2624,6 +2647,222 @@ __target_intrinsic(cuda, "__popc(__ballot_sync($0, $1)  & _getLaneLtMask())")
 __target_intrinsic(hlsl, "WavePrefixCountBits($1)")
 uint WaveMaskPrefixCountBits(WaveMask mask, bool value);
 
+// Across lane ops
+
+__generic<T : __BuiltinIntegerType>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupAnd($1)")
+__target_intrinsic(cuda, "_waveAnd($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveBitAnd($1)")
+T WaveMaskBitAnd(WaveMask mask, T expr);
+__generic<T : __BuiltinIntegerType, let N : int>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupAnd($1)")
+__target_intrinsic(cuda, "_waveAndMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveBitAnd($1)")
+vector<T,N> WaveMaskBitAnd(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinIntegerType, let N : int, let M : int>
+__target_intrinsic(cuda, "_waveAndMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveBitAnd($1)")
+matrix<T,N,M> WaveMaskBitAnd(WaveMask mask, matrix<T,N,M> expr);
+
+__generic<T : __BuiltinIntegerType>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupOr($1)")
+__target_intrinsic(cuda, "_waveOr($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveBitOr($1)")
+T WaveMaskBitOr(WaveMask mask, T expr);
+__generic<T : __BuiltinIntegerType, let N : int>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupOr($1)")
+__target_intrinsic(cuda, "_waveOrMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveBitOr($1)")
+vector<T,N> WaveMaskBitOr(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinIntegerType, let N : int, let M : int>
+__target_intrinsic(cuda, "_waveOrMultiple(_$0, $1)")
+__target_intrinsic(hlsl, "WaveActiveBitOr($1)")
+matrix<T,N,M> WaveMaskBitOr(WaveMask mask, matrix<T,N,M> expr);
+
+__generic<T : __BuiltinIntegerType>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupXor($1)")
+__target_intrinsic(cuda, "_waveXor($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveBitXor($1)")
+T WaveMaskBitXor(WaveMask mask, T expr);
+__generic<T : __BuiltinIntegerType, let N : int> 
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupXor($1)")
+__target_intrinsic(cuda, "_waveXorMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveBitXor($1)")
+vector<T,N> WaveMaskBitXor(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinIntegerType, let N : int, let M : int>
+__target_intrinsic(cuda, "_waveXorMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveBitXor($1)")
+matrix<T,N,M> WaveMaskBitXor(WaveMask mask, matrix<T,N,M> expr);
+
+__generic<T : __BuiltinArithmeticType>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupMax($1)")
+__target_intrinsic(cuda, "_waveMax($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveMax($1)")
+T WaveMaskMax(WaveMask mask, T expr);
+__generic<T : __BuiltinArithmeticType, let N : int>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupMax($1)")
+__target_intrinsic(cuda, "_waveMaxMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveMax($1)")
+vector<T,N> WaveMaskMax(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinArithmeticType, let N : int, let M : int>
+__target_intrinsic(cuda, "_waveMaxMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveMax($1)")
+matrix<T,N,M> WaveMaskMax(WaveMask mask, matrix<T,N,M> expr);
+
+__generic<T : __BuiltinArithmeticType>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupMin($1)")
+__target_intrinsic(cuda, "_waveMin($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveMin($1)")
+T WaveMaskMin(WaveMask mask, T expr);
+__generic<T : __BuiltinArithmeticType, let N : int>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupMin($1)")
+__target_intrinsic(cuda, "_waveMinMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveMin($1)")
+vector<T,N> WaveMaskMin(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinArithmeticType, let N : int, let M : int>
+__target_intrinsic(cuda, "_waveMinMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveMin($1)")
+matrix<T,N,M> WaveMaskMin(WaveMask mask, matrix<T,N,M> expr);
+
+__generic<T : __BuiltinArithmeticType>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupMul($1)")
+__target_intrinsic(cuda, "_waveProduct($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveProduct($1)")
+T WaveMaskProduct(WaveMask mask, T expr);
+__generic<T : __BuiltinArithmeticType, let N : int>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupMul($1)")
+__target_intrinsic(cuda, "_waveProductMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveProduct($1)")
+vector<T,N> WaveMaskProduct(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinArithmeticType, let N : int, let M : int>
+__target_intrinsic(cuda, "_waveProductMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveProduct($1)")
+matrix<T,N,M> WaveMaskProduct(WaveMask mask, matrix<T,N,M> expr);
+
+__generic<T : __BuiltinArithmeticType>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupAdd($1)")
+__target_intrinsic(cuda, "_waveSum($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveSum($1)")
+T WaveMaskSum(WaveMask mask, T expr);
+__generic<T : __BuiltinArithmeticType, let N : int>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupAdd($1)")
+__target_intrinsic(cuda, "_waveSumMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveSum($1)")
+vector<T,N> WaveMaskSum(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinArithmeticType, let N : int, let M : int>
+__target_intrinsic(cuda, "_waveSumMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveSum($1)")
+matrix<T,N,M> WaveMaskSum(WaveMask mask, matrix<T,N,M> expr);
+
+__generic<T : __BuiltinType>
+__glsl_extension(GL_KHR_shader_subgroup_vote)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupAllEqual($1)")
+__cuda_sm_version(7.0)
+__target_intrinsic(cuda, "_waveAllEqual($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveAllEqual($1)")
+bool WaveMaskAllEqual(WaveMask mask, T value);
+__generic<T : __BuiltinType, let N : int> 
+__glsl_extension(GL_KHR_shader_subgroup_vote)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupAllEqual($1)")
+__cuda_sm_version(7.0)
+__target_intrinsic(cuda, "_waveAllEqualMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveAllEqual($1)")
+bool WaveMaskAllEqual(WaveMask mask, vector<T,N> value);
+__generic<T : __BuiltinType, let N : int, let M : int>
+__cuda_sm_version(7.0)
+__target_intrinsic(cuda, "_waveAllEqualMultiple($0, $1)")
+__target_intrinsic(hlsl, "WaveActiveAllEqual($1)")
+bool WaveMaskAllEqual(WaveMask mask, matrix<T,N,M> value);
+
+// Prefix
+
+__generic<T : __BuiltinArithmeticType>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupExclusiveMul($1)")
+__target_intrinsic(cuda, "_wavePrefixProduct($0, $1)")
+__target_intrinsic(hlsl, "WavePrefixProduct($1)")
+T WaveMaskPrefixProduct(WaveMask mask, T expr);
+__generic<T : __BuiltinArithmeticType, let N : int>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupExclusiveMul($1)")
+__target_intrinsic(cuda, "_wavePrefixProductMultiple($0, $1)")
+__target_intrinsic(hlsl, "WavePrefixProduct($1)")
+vector<T,N> WaveMaskPrefixProduct(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinArithmeticType, let N : int, let M : int>
+__target_intrinsic(cuda, "_wavePrefixProductMultiple($0, $1)")
+__target_intrinsic(hlsl, "WavePrefixProduct($1)")
+matrix<T,N,M> WaveMaskPrefixProduct(WaveMask mask, matrix<T,N,M> expr);
+
+__generic<T : __BuiltinArithmeticType>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupExclusiveAdd($1)")
+__target_intrinsic(cuda, "_wavePrefixSum($0, $1)")
+__target_intrinsic(hlsl, "WavePrefixSum($1)")
+T WaveMaskPrefixSum(WaveMask mask, T expr);
+__generic<T : __BuiltinArithmeticType, let N : int>
+__glsl_extension(GL_KHR_shader_subgroup_arithmetic)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupExclusiveAdd($1)")
+__target_intrinsic(cuda, "_wavePrefixSumMultiple($0, $1)")
+__target_intrinsic(hlsl, "WavePrefixSum($1)")
+vector<T,N> WaveMaskPrefixSum(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinArithmeticType, let N : int, let M : int>
+__target_intrinsic(cuda, "_wavePrefixSumMultiple($0, $1)")
+__target_intrinsic(hlsl, "WavePrefixSum($1)")
+matrix<T,N,M> WaveMaskPrefixSum(WaveMask mask, matrix<T,N,M> expr);
+
+__generic<T : __BuiltinType>
+__glsl_extension(GL_KHR_shader_subgroup_ballot)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupBroadcastFirst($1)")
+__target_intrinsic(cuda, "_waveReadFirst($0, $1)")
+T WaveMaskReadLaneFirst(WaveMask mask, T expr);
+__generic<T : __BuiltinType, let N : int>
+__glsl_extension(GL_KHR_shader_subgroup_ballot)
+__spirv_version(1.3)
+__target_intrinsic(glsl, "subgroupBroadcastFirst($1)")
+__target_intrinsic(cuda, "_waveReadFirstMultiple($0, $1)")
+vector<T,N> WaveMaskReadLaneFirst(WaveMask mask, vector<T,N> expr);
+__generic<T : __BuiltinType, let N : int, let M : int>
+__target_intrinsic(cuda, "_waveReadFirstMultiple($0, $1)")
+matrix<T,N,M> WaveMaskReadLaneFirst(WaveMask mask, matrix<T,N,M> expr);
+
+
+
+
 // Shader model 6.0 stuff
 
 // Information for GLSL wave/subgroup support
@@ -2645,6 +2884,7 @@ __generic<T : __BuiltinType> T QuadReadAcrossDiagonal(T localValue);
 __generic<T : __BuiltinType, let N : int> vector<T,N> QuadReadAcrossDiagonal(vector<T,N> localValue);
 __generic<T : __BuiltinType, let N : int, let M : int> matrix<T,N,M> QuadReadAcrossDiagonal(matrix<T,N,M> localValue);
 
+
 __generic<T : __BuiltinIntegerType>
 __glsl_extension(GL_KHR_shader_subgroup_arithmetic)
 __spirv_version(1.3)
@@ -2820,6 +3060,8 @@ __target_intrinsic(glsl, "subgroupElect()")
 __target_intrinsic(cuda, "_waveIsFirstLane()")
 bool WaveIsFirstLane();
 
+// Prefix
+
 __generic<T : __BuiltinArithmeticType>
 __glsl_extension(GL_KHR_shader_subgroup_arithmetic)
 __spirv_version(1.3)
diff --git a/tests/hlsl-intrinsic/wave-mask/wave.slang b/tests/hlsl-intrinsic/wave-mask/wave.slang
index 6b641906d..346940cb2 100644
--- a/tests/hlsl-intrinsic/wave-mask/wave.slang
+++ b/tests/hlsl-intrinsic/wave-mask/wave.slang
@@ -14,7 +14,7 @@ groupshared int sharedMem[32];
 
 int exclusivePrefixSum(WaveMask mask, int index, int waveLaneId, int originalValue, int elementCount)
 {
-    WaveMask localMask = WaveMaskBallot(mask, waveLaneId < elementCount);
+    WaveMask localMask = WaveMaskBallot(mask, index < elementCount);
     
     sharedMem[index] = 0;
     
@@ -23,7 +23,7 @@ int exclusivePrefixSum(WaveMask mask, int index, int waveLaneId, int originalVal
         int temp = 0;
         int val = originalValue;        
         
-        for(int  i = 1; i < elementCount; i += i)
+        for(int i = 1; i < elementCount; i += i)
         {
             int temp = WaveMaskShuffle(localMask, val, waveLaneId - i);            
             if(waveLaneId >= i)
@@ -37,25 +37,34 @@ int exclusivePrefixSum(WaveMask mask, int index, int waveLaneId, int originalVal
         
         // Write to shared memory
         sharedMem[index] = val;
-        
-        // Syncronizes on the mask, and ensures memory fence for shared data write
-        WaveMaskSharedSync(localMask); 
         return val;
     }    
     
     return 0;    
 }
 
+// It matters how kernels with WaveMask intrinsics are launched(!). 
+// TODO(JS):
+// If I launch with an numthreads amount that is not the size of the Wave on the device, then some 
+// lanes will not be executing at startup, and the kernel will have to know that is the case. 
+// This works currently though because the mask is only used
+// on CUDA, and it's Wave size is 32.
 [numthreads(32, 1, 1)]
 void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
+    // Assumes all threads in the Wave are active at start.
+    WaveMask waveMask = ~WaveMask(0);
+    
     int index = int(dispatchThreadID.x);
     const int waveLaneId = WaveGetLaneIndex();
     
     const int value = inputBuffer[index];
     const int elementCount = 9;
     
-    exclusivePrefixSum(WaveGetActiveMask(), index, waveLaneId, value, elementCount);
+    exclusivePrefixSum(waveMask, index, waveLaneId, value, elementCount);
+    
+    // We don't read from any other lane, so we don't actually need any sync
+    //WaveMaskSharedSync(waveMask); 
     
     // It returns the result, but we are going to read from shared memory, to check that aspect worked
     int prefixValue = sharedMem[index];