Initial partial support for WaveXXX intrinsics on CUDA (#1228)

* Start work on wave intrinsics for CUDA. * Add prelimary CUDA support for some Wave intrinsics. Document the issue around WaveGetLaneIndex
author: jsmall-nvidia <jsmall@nvidia.com> 2020-02-19 14:16:38 -0500
committer: GitHub <noreply@github.com> 2020-02-19 11:16:38 -0800
commit: 46a1b5f58a528bf1cd2fa2907234a9090cd7ba10 (patch)
tree: f6581768499187bf1572079d65b99c15a95e7743 /docs/cuda-target.md
parent: 1d9152bd2d0b1234680ce6a9f7ef940d7f179e9a (diff)
1 files changed, 12 insertions, 0 deletions
diff --git a/docs/cuda-target.md b/docs/cuda-target.md
index 743e9830c..a9b35d735 100644
--- a/docs/cuda-target.md
+++ b/docs/cuda-target.md
@@ -18,6 +18,7 @@ These limitations apply to Slang transpiling to CUDA.
 * Samplers are not separate objects in CUDA - they are combined into a single 'TextureObject'. So samplers are effectively ignored on CUDA targets. 
 * Whilst there is tex1Dfetch there are no equivalents for higher dimensions - so such accesses are not currently supported
 * When using a TextureArray (layered texture in CUDA) - the index will be treated as an int, as this is all CUDA allows
+* Care must be used in using `WaveGetLaneIndex` wave intrinsic - it will only give the right results for appopriate launches
 
 The following are a work in progress or not implmented but are planned to be so in the future
 
@@ -143,6 +144,17 @@ void setDownstreamCompilerPrelude(SlangPassThrough passThrough, const char* prel
 
 The code that sets up the prelude for the test infrastucture and command line usage can be found in ```TestToolUtil::setSessionDefaultPrelude```. Essentially this determines what the absolute path is to `slang-cpp-prelude.h` is and then just makes the prelude `#include "the absolute path"`.
 
+Limitations
+===========
+
+## WaveGetLaneIndex
+
+This defaults to `threadIdx.x & SLANG_CUDA_WARP_MASK`. Depending on how the kernel is launched this could be incorrect. 
+
+There other ways to get lane index, for example using inline assembly. This mechanism though is apparently slower than the simple method used here. 
+
+There is potential to calculate the lane id using the [numthreads] markup in Slang/HLSL, but that also requires some assumptions of how that maps to a lane index. 
+
 Language aspects
 ================
author	jsmall-nvidia <jsmall@nvidia.com>	2020-02-19 14:16:38 -0500
committer	GitHub <noreply@github.com>	2020-02-19 11:16:38 -0800
commit	46a1b5f58a528bf1cd2fa2907234a9090cd7ba10 (patch)
tree	f6581768499187bf1572079d65b99c15a95e7743 /docs/cuda-target.md
parent	1d9152bd2d0b1234680ce6a9f7ef940d7f179e9a (diff)