CPU Performance/Testing improvements (#1055)

* First pass of render-test refactor. * Make window construction a function that can choose an implementation. * Remove OpenGL as currently has windows dependency. * Disable Vulkan as Renderer impl has dependency on windows. * Pass Window in as parameter of 'update'. * Add win-window.cpp as was missing. * Fix warning on windows about signs during comparison. * * Added mechanism to add random arrays as buffer inputs and select type * Improved RenderGenerator to generate more types, and to be more careful around int32 ranges. * Added support for security checks (for Visual Studio C++) * Disable Execption handling being on by default when compiling kernels * Added a 'Group' version of the entry point that will evaluate all threads in a group in a single call. In test code use this method if available. * Added -compile-arg to be able to pass arguments to the compile within render-test * Add documention for the _Group execution feature. * Fix some typos in cpu-target.md
author: jsmall-nvidia <jsmall@nvidia.com> 2019-09-16 09:38:21 -0400
committer: GitHub <noreply@github.com> 2019-09-16 09:38:21 -0400
commit: 40d8f3aeedf018c7c6766e98ec64733abd90671e (patch)
tree: 0c9cae7bc88d4344dd53596a88c3ce9918f2df13 /docs/cpu-target.md
parent: c2e5d2468ad6a38cdb8a067da0678302f6cc6066 (diff)
1 files changed, 17 insertions, 0 deletions
diff --git a/docs/cpu-target.md b/docs/cpu-target.md
index d4ef6ddd8..ac1499218 100644
--- a/docs/cpu-target.md
+++ b/docs/cpu-target.md
@@ -112,6 +112,23 @@ When compiled into a shared library/dll - how is it invoked? The entry point is
 void computeMain(ComputeVaryingInput* varyingInput, UniformEntryPointParams* uniformParams, UniformState* uniformState);
 ```
 
+
+If compiled with `SLANG_HOST_CALLABLE` the `ISlangSharedLibrary` will export a function named `computeMain` the same name as the entry point in the original source.  
+
+ComputeVaryingInput is defined in the prelude as 
+
+```
+struct ComputeVaryingInput
+{
+    uint3 groupID;
+    uint3 groupThreadID;
+};
+```
+
+Typically when invoking the kernel it is a question of updating the groupID/groupThreadID, to specify which 'thread' of the computation to execute. For the example above we have `[numthreads(4, 1, 1)]`. This means groupThreadID.x can vary from 0-3 and .y and .z must be 0. That groupID.x indicates which 'group of 4' to execute. So groupID.x = 1, with groupThreadID.x=0,1,2,3 runs the 4th, 5th, 6th and 7th 'thread'. Being able to invoke each thread in this way is flexible - in that any specific thread can specified and executed. It is not necessarily very efficient because there is the call overhead and a small amount of extra work that is performed inside the kernel. 
+
+For improved performance there is a mechanism to execute a 'thread group' all in a single invocation. A function with the same signature will be exposed with the entry point name postfixed with `_Group` - in the example above the function would be called 'computeMain_Group'. When calling this function only the groupID need be specified, the groupThreadID is ignored. All of the threads within the group (as specified by `[numthreads]`) will be executed in a single call. 
+
 The UniformState and UniformEntryPointParams struct typically vary by shader. UniformState holds 'normal' bindings, whereas UniformEntryPointParams hold the uniform entry point parameters. Where specific bindings or parameters are located can be determined by reflection. The structures for the example above would be something like the following... 
 
 ```
author	jsmall-nvidia <jsmall@nvidia.com>	2019-09-16 09:38:21 -0400
committer	GitHub <noreply@github.com>	2019-09-16 09:38:21 -0400
commit	40d8f3aeedf018c7c6766e98ec64733abd90671e (patch)
tree	0c9cae7bc88d4344dd53596a88c3ce9918f2df13 /docs/cpu-target.md
parent	c2e5d2468ad6a38cdb8a067da0678302f6cc6066 (diff)