Initial CUDA target doc. (#1174)

author: jsmall-nvidia <jsmall@nvidia.com> 2020-01-23 11:44:59 -0500
committer: GitHub <noreply@github.com> 2020-01-23 11:44:59 -0500
commit: b9c0662af02bcacb93f0dddb970a2ba13288ed79 (patch)
tree: 6766702341b870062302fd0d83076191e1422954 /docs/cuda-target.md
parent: ce7b8319d0532a96ef66ba06d1d184a6c61b65cc (diff)
1 files changed, 123 insertions, 0 deletions
diff --git a/docs/cuda-target.md b/docs/cuda-target.md
new file mode 100644
index 000000000..db8c98f14
--- /dev/null
+++ b/docs/cuda-target.md
@@ -0,0 +1,123 @@
+Slang CUDA Target Support
+=========================
+
+Slang has preliminary support for producing CUDA source, and PTX binaries using nvrtc. 
+
+# Features
+
+* Can compile Slang source into CUDA source code
+* Supports compute style shaders 
+* Supports a 'bindless' CPU like model
+* Can compile CUDA source to PTX through 'pass through' mechansism
+
+# Limitations
+
+These limitations apply to Slang transpiling to CUDA. 
+
+* Only supports the 'texture object' style binding 
+* Samplers are not separate objects in CUDA - they are combined into a single 'TextureObject'. So samplers are effectively ignored on CUDA targets. 
+* Whilst there is tex1Dfetch there are no equivalents for higher dimensions - so such accesses are not currently supported
+
+The following are a work in progress or not implmented but are planned to be so in the future
+
+* Barriers/Atomics/Complex resource types
+* Preliminary version does maps StructuredBuffers to a pointer - and without boudn checking
+
+# How it works
+
+For producing PTX binaries Slang uses nvrtc. Nvrtc dll/shared library has to be available to Slang (in the appropriate PATH for example) for it to be able to produce PTX. The nvrtc compiler can be accessed directly through
+
+```
+SLANG_PASS_THROUGH_NVRTC,  
+```
+
+Much like other targets that use downstream compilers Slang can be used to compile CUDA source directly to PTX via the pass through mechansism. That the Slang command line options will broadly be mapped down to the appropriate options for the nvrtc compilation. In the API the `SlangCompileTarget` for CUDA is `SLANG_CUDA_SOURCE` and for PTX is `SLANG_PTX`. These can also be specified on the Slang command line as `-target cuda` and `-target ptx`. 
+
+Binding 
+=======
+
+Say we have some Slang source like the following:
+
+```
+struct Thing { int a; int b; }
+
+Texture2D<float> tex;
+SamplerState sampler;
+RWStructuredBuffer<int> outputBuffer;        
+ConstantBuffer<Thing> thing3;        
+        
+[numthreads(4, 1, 1)]
+void computeMain(
+    uint3 dispatchThreadID : SV_DispatchThreadID, 
+    uniform Thing thing, 
+    uniform Thing thing2)
+{
+   // ...
+}
+```
+
+This will be turned into a CUDA entry point with 
+
+```
+struct UniformEntryPointParams
+{
+    Thing thing;
+    Thing thing2;
+};
+
+struct UniformState
+{
+    CUtexObject tex;                // This is the combination of a texture and a sampler(!)
+    //SamplerState sampler;         // CUDA doesn't have separate sampler objects - so this is just ignored.
+    int32_t* outputBuffer;          // Currently Structured buffers are converted to pointers - this will likely change in the future (for bounds checking and other reasons)
+    Thing* thing3;                  // Constant buffers map to pointers
+};   
+
+// [numthreads(4, 1, 1)]
+extern "C" __global__  void computeMain(UniformEntryPointParams* params, UniformState* uniformState)
+```
+
+With CUDA - the caller specifies how threading is broken up, so `[numthreads]` is available through reflection, and in a comment in output source code but does not produce varying code. 
+
+The UniformState and UniformEntryPointParams struct typically vary by shader. UniformState holds 'normal' bindings, whereas UniformEntryPointParams hold the uniform entry point parameters. Where specific bindings or parameters are located can be determined by reflection. The structures for the example above would be something like the following... 
+
+## Unsized arrays
+
+WIP: Not implemented yet.
+
+## Prelude
+
+For CUDA the code to support the code generated by Slang is partly defined within the 'prelude'. The prelude is inserted text placed before the generated CUDA source code. For the Slang command line tools as well as the test infrastructure, the prelude functionality is achieved through a `#include` in the prelude text of the `prelude/slang-cuda-prelude.h` specified with an absolute path. Doing so means other files the `slang-cuda-prelude.h` might need can be specified relatively, and include paths for the backend compiler do not need to be modified. 
+
+The prelude needs to define 
+
+* 'Built in' types (vector, matrix, 'object'-like Texture, SamplerState etc) 
+* Scalar intrinsic function implementations
+* Compiler based definations/tweaks 
+
+For a client application - as long as the requirements of the generated code are met, the prelude can be implemented by whatever mechanism is appropriate for the client. For example the implementation could be replaced with another implementation, or the prelude could contain all of the required text for compilation. Setting the prelude text can be achieved with the method on the global session...
+
+```
+/** Set the 'prelude' for generated code for a 'downstream compiler'.
+@param passThrough The downstream compiler for generated code that will have the prelude applied to it. 
+@param preludeText The text added pre-pended verbatim before the generated source
+
+That for pass-through usage, prelude is not pre-pended, preludes are for code generation only. 
+*/
+virtual SLANG_NO_THROW void SLANG_MCALL setDownstreamCompilerPrelude(
+SlangPassThrough passThrough,
+const char* preludeText) = 0;
+```
+
+The code that sets up the prelude for the test infrastucture and command line usage can be found in ```TestToolUtil::setSessionDefaultPrelude```. Essentially this determines what the absolute path is to `slang-cpp-prelude.h` is and then just makes the prelude `#include "the absolute path"`.
+
+Language aspects
+================
+
+# Arrays passed by Value
+
+Slang follows the HLSL convention that arrays are passed by value. This is in contrast with CUDA where arrays follow C++ conventions and are passed by reference. To make generated CUDA follow this convention an array is turned into a 'FixedArray' struct type. 
+
+To get something more similar to CUDA/C++ operation the array can be marked in out or inout to make it passed by reference. 
+
+
author	jsmall-nvidia <jsmall@nvidia.com>	2020-01-23 11:44:59 -0500
committer	GitHub <noreply@github.com>	2020-01-23 11:44:59 -0500
commit	b9c0662af02bcacb93f0dddb970a2ba13288ed79 (patch)
tree	6766702341b870062302fd0d83076191e1422954 /docs/cuda-target.md
parent	ce7b8319d0532a96ef66ba06d1d184a6c61b65cc (diff)