Improvements to testing and ABI for CPU (#1057)

* WIP: Improving CPU performance/ABI * Optionally output code on CPU for groupThreadID and groupID. * Added ability to set compute dispatch size on command line for render-test. Dispatch compute tests taking into account dispatch size. Added test for semantics are working. * Test using GroupRange. * Fix problem with adding \n for externa diagnostic - to do it if there isn't a \n at the end. Change the ouput order (put result before) so last value is diagnostic string. * Made GroupRange the default exposed CPU ABI entry point style. Removed CPU_EXECUTE test style -as tested via the now cross platform render-test * Split out execution from setup for execution to improve perf. * For better code coverage/testing test all styles of CPU compute entry point. * Improve documentation for ABI changes for CPU code. Add 'expecting' to error message from review. * Fix small typos.
author: jsmall-nvidia <jsmall@nvidia.com> 2019-09-18 11:40:59 -0400
committer: GitHub <noreply@github.com> 2019-09-18 11:40:59 -0400
commit: 31c7abcc27a33d63ac8d335387a0ce7b3ad74954 (patch)
tree: 3b4254df7bdbf8b497aa8a3e5f08f8927c1afbc6 /docs
parent: 3af404da7f7f125464b78159940cb3fc06e69cc5 (diff)
1 files changed, 24 insertions, 8 deletions
diff --git a/docs/cpu-target.md b/docs/cpu-target.md
index ac1499218..d6bc34880 100644
--- a/docs/cpu-target.md
+++ b/docs/cpu-target.md
@@ -1,4 +1,4 @@
-Slang CPU target Support
+Slang CPU Target Support
 ========================
 
 Slang has preliminary support for producing CPU source and binaries. 
@@ -86,7 +86,7 @@ For pass through compilation of C/C++ this mechanism allows any functions marked
 ABI
 ===
 
-Say we have some Slang source like the following. 
+Say we have some Slang source like the following:
 
 ```
 struct Thing { int a; int b; }
@@ -106,28 +106,43 @@ void computeMain(
 }
 ```
 
-When compiled into a shared library/dll - how is it invoked? The entry point is exported with a signiture 
+When compiled into a shared library/dll - how is it invoked? An entry point in the slang source code produces several exported functions. The 'default' exported function has the same name as the entry point in the original source. It has the signature  
 
 ```
 void computeMain(ComputeVaryingInput* varyingInput, UniformEntryPointParams* uniformParams, UniformState* uniformState);
 ```
 
-
-If compiled with `SLANG_HOST_CALLABLE` the `ISlangSharedLibrary` will export a function named `computeMain` the same name as the entry point in the original source.  
-
 ComputeVaryingInput is defined in the prelude as 
 
 ```
 struct ComputeVaryingInput
 {
+    uint3 startGroupID;
+    uint3 endGroupID;
+};
+```
+
+`ComputeVaryingInput` allows specifying a range of groupIDs to execute - all the ids in a grid from startGroup to endGroup, but not including the endGroupIDs. Most compute APIs allow specifying an x,y,z extent on 'dispatch'. This would be equivalent as having startGroupID = { 0, 0, 0} and endGroupID = { x, y, z }. The exported function allows setting a range of groupIDs such that client code could dispatch different parts of the work to different cores. This group range mechanism was chosen as the 'default' mechanism as it is most likely to achieve the best performance.
+
+There are two other functions that consist of the entry point name postfixed with `_Thread` and `_Group`. For the entry point 'computeMain' these functions would be accessable from the shared library interface as `computeMain_Group` and `computeMain_Thread`. `_Group` has the same signature as the listed for computeMain, but it doesn't execute a range, only the single group specified by startGroupID (endGroupID is ignored). That is all of the threads within the group (as specified by `[numthreads]`) will be executed in a single call. 
+
+It may be desirable to have even finer control of how execution takes place down to the level of individual 'thread's and this can be achieved with the `_Thread` style. The signiture looks as follows
+
+```
+struct ComputeThreadVaryingInput
+{
     uint3 groupID;
     uint3 groupThreadID;
 };
+
+void computeMain_Thread(ComputeThreadVaryingInput* varyingInput, UniformEntryPointParams* uniformParams, UniformState* uniformState);
 ```
 
-Typically when invoking the kernel it is a question of updating the groupID/groupThreadID, to specify which 'thread' of the computation to execute. For the example above we have `[numthreads(4, 1, 1)]`. This means groupThreadID.x can vary from 0-3 and .y and .z must be 0. That groupID.x indicates which 'group of 4' to execute. So groupID.x = 1, with groupThreadID.x=0,1,2,3 runs the 4th, 5th, 6th and 7th 'thread'. Being able to invoke each thread in this way is flexible - in that any specific thread can specified and executed. It is not necessarily very efficient because there is the call overhead and a small amount of extra work that is performed inside the kernel. 
+When invoking the kernel at the `thread` level it is a question of updating the groupID/groupThreadID, to specify which thread of the computation to execute. For the example above we have `[numthreads(4, 1, 1)]`. This means groupThreadID.x can vary from 0-3 and .y and .z must be 0. That groupID.x indicates which 'group of 4' to execute. So groupID.x = 1, with groupThreadID.x=0,1,2,3 runs the 4th, 5th, 6th and 7th 'thread'. Being able to invoke each thread in this way is flexible - in that any specific thread can specified and executed. It is not necessarily very efficient because there is the call overhead and a small amount of extra work that is performed inside the kernel. 
+
+Note that the `_Thread` style signature is likely to change to support 'groupshared' variables in the near future.
 
-For improved performance there is a mechanism to execute a 'thread group' all in a single invocation. A function with the same signature will be exposed with the entry point name postfixed with `_Group` - in the example above the function would be called 'computeMain_Group'. When calling this function only the groupID need be specified, the groupThreadID is ignored. All of the threads within the group (as specified by `[numthreads]`) will be executed in a single call. 
+In terms of performance the 'default' function is probably the most efficient for most common usages. The `_Group` style allows for slightly less loop overhead, but with many invocations this will likely be drowned out by the extra call/setup overhead. The `_Thread` style in most situations will be the slowest, with even more call overhead, and less options for the C/C++ compiler to use faster paths. 
 
 The UniformState and UniformEntryPointParams struct typically vary by shader. UniformState holds 'normal' bindings, whereas UniformEntryPointParams hold the uniform entry point parameters. Where specific bindings or parameters are located can be determined by reflection. The structures for the example above would be something like the following... 
 
@@ -264,6 +279,7 @@ TODO
 
 # Main
 
+* groupshared is not yet supported
 * Complete support (in terms of interfaces) for 'complex' resource types - such as Texture
 * Output of header files 
 * Output multiple entry points
author	jsmall-nvidia <jsmall@nvidia.com>	2019-09-18 11:40:59 -0400
committer	GitHub <noreply@github.com>	2019-09-18 11:40:59 -0400
commit	31c7abcc27a33d63ac8d335387a0ce7b3ad74954 (patch)
tree	3b4254df7bdbf8b497aa8a3e5f08f8927c1afbc6 /docs
parent	3af404da7f7f125464b78159940cb3fc06e69cc5 (diff)