Improvements to testing and ABI for CPU (#1057)

* WIP: Improving CPU performance/ABI * Optionally output code on CPU for groupThreadID and groupID. * Added ability to set compute dispatch size on command line for render-test. Dispatch compute tests taking into account dispatch size. Added test for semantics are working. * Test using GroupRange. * Fix problem with adding \n for externa diagnostic - to do it if there isn't a \n at the end. Change the ouput order (put result before) so last value is diagnostic string. * Made GroupRange the default exposed CPU ABI entry point style. Removed CPU_EXECUTE test style -as tested via the now cross platform render-test * Split out execution from setup for execution to improve perf. * For better code coverage/testing test all styles of CPU compute entry point. * Improve documentation for ABI changes for CPU code. Add 'expecting' to error message from review. * Fix small typos.
author: jsmall-nvidia <jsmall@nvidia.com> 2019-09-18 11:40:59 -0400
committer: GitHub <noreply@github.com> 2019-09-18 11:40:59 -0400
commit: 31c7abcc27a33d63ac8d335387a0ce7b3ad74954 (patch)
tree: 3b4254df7bdbf8b497aa8a3e5f08f8927c1afbc6
parent: 3af404da7f7f125464b78159940cb3fc06e69cc5 (diff)
12 files changed, 251 insertions, 347 deletions
diff --git a/docs/cpu-target.md b/docs/cpu-target.md
index ac1499218..d6bc34880 100644
--- a/docs/cpu-target.md
+++ b/docs/cpu-target.md
@@ -1,4 +1,4 @@
-Slang CPU target Support
+Slang CPU Target Support
 ========================
 
 Slang has preliminary support for producing CPU source and binaries. 
@@ -86,7 +86,7 @@ For pass through compilation of C/C++ this mechanism allows any functions marked
 ABI
 ===
 
-Say we have some Slang source like the following. 
+Say we have some Slang source like the following:
 
 ```
 struct Thing { int a; int b; }
@@ -106,28 +106,43 @@ void computeMain(
 }
 ```
 
-When compiled into a shared library/dll - how is it invoked? The entry point is exported with a signiture 
+When compiled into a shared library/dll - how is it invoked? An entry point in the slang source code produces several exported functions. The 'default' exported function has the same name as the entry point in the original source. It has the signature  
 
 ```
 void computeMain(ComputeVaryingInput* varyingInput, UniformEntryPointParams* uniformParams, UniformState* uniformState);
 ```
 
-
-If compiled with `SLANG_HOST_CALLABLE` the `ISlangSharedLibrary` will export a function named `computeMain` the same name as the entry point in the original source.  
-
 ComputeVaryingInput is defined in the prelude as 
 
 ```
 struct ComputeVaryingInput
 {
+    uint3 startGroupID;
+    uint3 endGroupID;
+};
+```
+
+`ComputeVaryingInput` allows specifying a range of groupIDs to execute - all the ids in a grid from startGroup to endGroup, but not including the endGroupIDs. Most compute APIs allow specifying an x,y,z extent on 'dispatch'. This would be equivalent as having startGroupID = { 0, 0, 0} and endGroupID = { x, y, z }. The exported function allows setting a range of groupIDs such that client code could dispatch different parts of the work to different cores. This group range mechanism was chosen as the 'default' mechanism as it is most likely to achieve the best performance.
+
+There are two other functions that consist of the entry point name postfixed with `_Thread` and `_Group`. For the entry point 'computeMain' these functions would be accessable from the shared library interface as `computeMain_Group` and `computeMain_Thread`. `_Group` has the same signature as the listed for computeMain, but it doesn't execute a range, only the single group specified by startGroupID (endGroupID is ignored). That is all of the threads within the group (as specified by `[numthreads]`) will be executed in a single call. 
+
+It may be desirable to have even finer control of how execution takes place down to the level of individual 'thread's and this can be achieved with the `_Thread` style. The signiture looks as follows
+
+```
+struct ComputeThreadVaryingInput
+{
     uint3 groupID;
     uint3 groupThreadID;
 };
+
+void computeMain_Thread(ComputeThreadVaryingInput* varyingInput, UniformEntryPointParams* uniformParams, UniformState* uniformState);
 ```
 
-Typically when invoking the kernel it is a question of updating the groupID/groupThreadID, to specify which 'thread' of the computation to execute. For the example above we have `[numthreads(4, 1, 1)]`. This means groupThreadID.x can vary from 0-3 and .y and .z must be 0. That groupID.x indicates which 'group of 4' to execute. So groupID.x = 1, with groupThreadID.x=0,1,2,3 runs the 4th, 5th, 6th and 7th 'thread'. Being able to invoke each thread in this way is flexible - in that any specific thread can specified and executed. It is not necessarily very efficient because there is the call overhead and a small amount of extra work that is performed inside the kernel. 
+When invoking the kernel at the `thread` level it is a question of updating the groupID/groupThreadID, to specify which thread of the computation to execute. For the example above we have `[numthreads(4, 1, 1)]`. This means groupThreadID.x can vary from 0-3 and .y and .z must be 0. That groupID.x indicates which 'group of 4' to execute. So groupID.x = 1, with groupThreadID.x=0,1,2,3 runs the 4th, 5th, 6th and 7th 'thread'. Being able to invoke each thread in this way is flexible - in that any specific thread can specified and executed. It is not necessarily very efficient because there is the call overhead and a small amount of extra work that is performed inside the kernel. 
+
+Note that the `_Thread` style signature is likely to change to support 'groupshared' variables in the near future.
 
-For improved performance there is a mechanism to execute a 'thread group' all in a single invocation. A function with the same signature will be exposed with the entry point name postfixed with `_Group` - in the example above the function would be called 'computeMain_Group'. When calling this function only the groupID need be specified, the groupThreadID is ignored. All of the threads within the group (as specified by `[numthreads]`) will be executed in a single call. 
+In terms of performance the 'default' function is probably the most efficient for most common usages. The `_Group` style allows for slightly less loop overhead, but with many invocations this will likely be drowned out by the extra call/setup overhead. The `_Thread` style in most situations will be the slowest, with even more call overhead, and less options for the C/C++ compiler to use faster paths. 
 
 The UniformState and UniformEntryPointParams struct typically vary by shader. UniformState holds 'normal' bindings, whereas UniformEntryPointParams hold the uniform entry point parameters. Where specific bindings or parameters are located can be determined by reflection. The structures for the example above would be something like the following... 
 
@@ -264,6 +279,7 @@ TODO
 
 # Main
 
+* groupshared is not yet supported
 * Complete support (in terms of interfaces) for 'complex' resource types - such as Texture
 * Output of header files 
 * Output multiple entry points
diff --git a/prelude/slang-cpp-types.h b/prelude/slang-cpp-types.h
index c79465032..ab87c6208 100644
--- a/prelude/slang-cpp-types.h
+++ b/prelude/slang-cpp-types.h
@@ -226,21 +226,28 @@ struct Texture2D
 };
 
 /* Varying input for Compute */
-struct ComputeVaryingInput
+
+/* Used when running a single thread */
+struct ComputeThreadVaryingInput
 {
     uint3 groupID;
     uint3 groupThreadID;
 };
 
-struct GroupComputeVaryingInput
+struct ComputeVaryingInput
 {
     uint3 startGroupID;     ///< start groupID
     uint3 endGroupID;       ///< Non inclusive end groupID
 };
 
 /* Type that defines the uniform entry point params. The actual content of this type is dependent on the entry point parameters, and can be
-found via reflection or defined such that it matches the shader appropriately. */
+found via reflection or defined such that it matches the shader appropriately.
+*/
 struct UniformEntryPointParams;
+struct UniformState;
+
+typedef void(*ComputeThreadFunc)(ComputeThreadVaryingInput* varyingInput, UniformEntryPointParams* uniformEntryPointParams, UniformState* uniformState);
+typedef void(*ComputeFunc)(ComputeVaryingInput* varyingInput, UniformEntryPointParams* uniformEntryPointParams, UniformState* uniformState);
 
 #ifdef SLANG_PRELUDE_NAMESPACE
 }
diff --git a/source/slang/slang-emit-cpp.cpp b/source/slang/slang-emit-cpp.cpp
index 1a6a46fc5..8f3e2f2e5 100644
--- a/source/slang/slang-emit-cpp.cpp
+++ b/source/slang/slang-emit-cpp.cpp
@@ -2830,8 +2830,13 @@ void CPPSourceEmitter::emitModuleImpl(IRModule* module)
 
                 String funcName = getFuncName(func);
 
-                {    
-                    _emitEntryPointDefinitionStart(func, entryPointGlobalParams, funcName, UnownedStringSlice::fromLiteral("ComputeVaryingInput"));
+                {
+                    StringBuilder builder;
+                    builder << funcName << "_Thread";
+
+                    String threadFuncName = builder;
+
+                    _emitEntryPointDefinitionStart(func, entryPointGlobalParams, threadFuncName, UnownedStringSlice::fromLiteral("ComputeThreadVaryingInput"));
 
                     if (m_semanticUsedFlags & SemanticUsedFlag::GroupThreadID)
                     {
@@ -2854,7 +2859,7 @@ void CPPSourceEmitter::emitModuleImpl(IRModule* module)
                     _emitEntryPointDefinitionEnd(func);
                 }
 
-                // Emit the group version which runs for all elements in a thread group
+                // Emit the group version which runs for all elements in *single* thread group
                 {
                     StringBuilder builder;
                     builder << getFuncName(func);
@@ -2865,7 +2870,7 @@ void CPPSourceEmitter::emitModuleImpl(IRModule* module)
                     _emitEntryPointDefinitionStart(func, entryPointGlobalParams, groupFuncName, UnownedStringSlice::fromLiteral("ComputeVaryingInput"));
 
                     m_writer->emit("const uint3 start = ");
-                    _emitInitAxisValues(sizeAlongAxis, UnownedStringSlice::fromLiteral("varyingInput->groupID"), UnownedStringSlice());
+                    _emitInitAxisValues(sizeAlongAxis, UnownedStringSlice::fromLiteral("varyingInput->startGroupID"), UnownedStringSlice());
 
                     if (m_semanticUsedFlags & SemanticUsedFlag::GroupThreadID)
                     {
@@ -2874,7 +2879,7 @@ void CPPSourceEmitter::emitModuleImpl(IRModule* module)
 
                     if (m_semanticUsedFlags & SemanticUsedFlag::GroupID)
                     {
-                        m_writer->emit("context.groupID = varyingInput->groupID;\n");
+                        m_writer->emit("context.groupID = varyingInput->startGroupID;\n");
                     }
                     m_writer->emit("context.dispatchThreadID = start;\n");
 
@@ -2882,34 +2887,15 @@ void CPPSourceEmitter::emitModuleImpl(IRModule* module)
                     _emitEntryPointDefinitionEnd(func);
                 }
 
-                // Emit the group version which runs for all elements in a thread group
+                // Emit the main version - which takes a dispatch size
                 {
-                    StringBuilder builder;
-                    builder << getFuncName(func);
-                    builder << "_GroupRange";
-
-                    String groupRangeFuncName = builder;
-
-                    _emitEntryPointDefinitionStart(func, entryPointGlobalParams, groupRangeFuncName, UnownedStringSlice::fromLiteral("GroupComputeVaryingInput"));
+                    _emitEntryPointDefinitionStart(func, entryPointGlobalParams, funcName, UnownedStringSlice::fromLiteral("ComputeVaryingInput"));
 
                     m_writer->emit("const uint3 start = ");
                     _emitInitAxisValues(sizeAlongAxis, UnownedStringSlice::fromLiteral("varyingInput->startGroupID"), UnownedStringSlice());
                     m_writer->emit("const uint3 end = ");
                     _emitInitAxisValues(sizeAlongAxis, UnownedStringSlice::fromLiteral("varyingInput->endGroupID"), UnownedStringSlice());
 
-#if 0
-                    // Not needed as will be emitted as part of the loop
-                    m_writer->emit("context.dispatchThreadID = start;\n");
-                    if (m_semanticUsedFlags & SemanticUsedFlag::GroupThreadID)
-                    {
-                        m_writer->emit("context.groupDispatchThreadID = start;");
-                    }
-                    if (m_semanticUsedFlags & SemanticUsedFlag::GroupID)
-                    {
-                        m_writer->emit("context.groupID = varyingInput->startGroupID;\n");
-                    }
-#endif
-
                     _emitEntryPointGroupRange(sizeAlongAxis, funcName);
                     _emitEntryPointDefinitionEnd(func);
                 }
diff --git a/tests/cross-compile/cpp-execute-simple.slang b/tests/cross-compile/cpp-execute-simple.slang
deleted file mode 100644
index 72c77b653..000000000
--- a/tests/cross-compile/cpp-execute-simple.slang
+++ /dev/null
@@ -1,14 +0,0 @@
-//TEST:CPU_EXECUTE: -profile cs_5_0 -entry computeMain -target sharedlib
-
-//TEST_INPUT:ubuffer(data=[0 0 0 0], stride=4):dxbinding(0),glbinding(0),out
-RWStructuredBuffer<int> outputBuffer;
-
-
-[numthreads(4, 1, 1)]
-void computeMain(
-    uint3 dispatchThreadID : SV_DispatchThreadID)
-{
-    uint tid = dispatchThreadID.x;
-
-    outputBuffer[tid] = int(tid * tid);
-}
-\ No newline at end of file
diff --git a/tests/cross-compile/cpp-execute-simple.slang.expected b/tests/cross-compile/cpp-execute-simple.slang.expected
deleted file mode 100644
index b84777e95..000000000
--- a/tests/cross-compile/cpp-execute-simple.slang.expected
+++ /dev/null
@@ -1 +0,0 @@
-0, 1, 4, 9
diff --git a/tests/cross-compile/cpp-execute.slang b/tests/cross-compile/cpp-execute.slang
deleted file mode 100644
index 2700aa49f..000000000
--- a/tests/cross-compile/cpp-execute.slang
+++ /dev/null
@@ -1,107 +0,0 @@
-//TEST:CPU_EXECUTE: -profile cs_5_0 -entry computeMain -target sharedlib
-
-enum Color
-{
-    Red,
-    Green = 2,
-    Blue,
-}
-
-int test(int val)
-{
-    Color c = Color.Red;
-
-    if(val > 1)
-    {
-        c = Color.Green;
-    }
-
-    if(c == Color.Red)
-    {
-        if(val & 1)
-        {
-            c = Color.Blue;
-        }
-    }
-
-    switch(c)
-    {
-    case Color.Red:
-        val = 1;
-        break;
-
-    case Color.Green:
-        val = 2;
-        break;
-
-    case Color.Blue:
-        val = 3;
-        break;
-
-    default:
-        val = -1;
-        break;
-    }
-
-    return (val << 4) + int(c);
-}
-
-float sum(float a[3])
-{
-    float total = a[0];
-    for (int i = 1; i < 3; ++i)
-    {
-        total += a[i];
-    }
-    return total;
-}
-
-struct Thing
-{
-    int a;
-    float b;
-};
-
-//TEST_INPUT:ubuffer(data=[0 0 0 0], stride=4):dxbinding(0),glbinding(0),out
-RWStructuredBuffer<int> outputBuffer;
-    
-[numthreads(4, 1, 1)]
-void computeMain(
-    uint3 dispatchThreadID : SV_DispatchThreadID)
-{
-    uint tid = dispatchThreadID.x;
-
-    Thing thing = { 10, -1.0 };
-
-    float array[3] = { thing.a, 2, 3};
-
-    float anotherArray[] = { 1, 2, 5 };
-
-    array[0] += anotherArray[1];
-
-    matrix<float, 2, 3> mat = { { sum(array), 1, 2 }, { 3, 4, 5} };
-    vector<float, 2> vec = { float(tid + 1), float(tid + 2) };
-
-    vector<float, 3> vec2 = max(sin(mul(vec, mat)), float3(1, 2, -1));
-    vector<float, 3> vec3 = mul(vec, mat);
-    
-    float3 vec4 = lerp(vec2, vec3, float3(tid * (1.0f / 4), 1, 1));
-    
-    float3 crossVec = normalize(cross(vec4, vec4));
-    
-    vec2.x = fmod(crossVec.y, crossVec.x);
-    
-    vec2 = fmod(vec2, crossVec);
-    
-    vec2 += (-vec2.zyx) * 2 + crossVec * length(crossVec) + reflect(vec4, normalize(crossVec));
-    
-    vector<bool, 3> z = vec2 > 0;
-
-    int val = (int(tid) + (any(z) ? 1 : 0) + (all(z) ? 2 : 0)) % 100;
-    
-    val = asint(asfloat(asuint(asfloat(val))));
-    
-    val = test(val);
-
-    outputBuffer[tid] = val + int(dot(vec2, vec4));
-}
-\ No newline at end of file
diff --git a/tests/cross-compile/cpp-execute.slang.expected b/tests/cross-compile/cpp-execute.slang.expected
deleted file mode 100644
index 65e3ed534..000000000
--- a/tests/cross-compile/cpp-execute.slang.expected
+++ /dev/null
@@ -1 +0,0 @@
--2147483632, -2147483597, -2147483614, -2147483614
diff --git a/tools/render-test/cpu-compute-util.cpp b/tools/render-test/cpu-compute-util.cpp
index 1b1adef82..81325ce80 100644
--- a/tools/render-test/cpu-compute-util.cpp
+++ b/tools/render-test/cpu-compute-util.cpp
@@ -301,127 +301,220 @@ static CPUComputeUtil::Resource* _newOneTexture2D(int elemCount)
     return SLANG_OK;
 }
 
-/* static */SlangResult CPUComputeUtil::execute(const uint32_t dispatchSize[3], const ShaderCompilerUtil::OutputAndLayout& compilationAndLayout, Context& context)
+/* static */SlangResult CPUComputeUtil::calcExecuteInfo(ExecuteStyle style, const uint32_t dispatchSize[3], const ShaderCompilerUtil::OutputAndLayout& compilationAndLayout, Context& context, ExecuteInfo& out)
 {
     auto request = compilationAndLayout.output.request;
     auto reflection = (slang::ShaderReflection*) spGetReflection(request);
 
+    slang::EntryPointReflection* entryPoint = nullptr;
+    auto entryPointCount = reflection->getEntryPointCount();
+    SLANG_ASSERT(entryPointCount == 1);
+
+    entryPoint = reflection->getEntryPointByIndex(0);
+
+    const char* entryPointName = entryPoint->getName();
+
     ComPtr<ISlangSharedLibrary> sharedLibrary;
     SLANG_RETURN_ON_FAIL(spGetEntryPointHostCallable(request, 0, 0, sharedLibrary.writeRef()));
 
-    // Use reflection to find the entry point name
-    
-    struct UniformState;
-    typedef void(*Func)(CPPPrelude::ComputeVaryingInput* varyingInput, CPPPrelude::UniformEntryPointParams* uniformEntryPointParams, UniformState* uniformState);
-    typedef void(*GroupRangeFunc)(CPPPrelude::GroupComputeVaryingInput* varyingInput, CPPPrelude::UniformEntryPointParams* uniformEntryPointParams, UniformState* uniformState);
-
-    slang::EntryPointReflection* entryPoint = nullptr;
-    Func func = nullptr;
-    Func groupFunc = nullptr;
-    GroupRangeFunc groupRangeFunc = nullptr;
+    // Copy dispatch size
+    for (int i = 0; i < 3; ++i)
     {
-        auto entryPointCount = reflection->getEntryPointCount();
-        SLANG_ASSERT(entryPointCount == 1);
-
-        entryPoint = reflection->getEntryPointByIndex(0);
+        out.m_dispatchSize[i] = dispatchSize[i];
+    }
 
-        const char* entryPointName = entryPoint->getName();
-        func = (Func)sharedLibrary->findFuncByName(entryPointName);
+    out.m_style = style;
+    out.m_uniformState = (void*)context.binding.m_rootBuffer.m_data;
+    out.m_uniformEntryPointParams = (void*)context.binding.m_entryPointBuffer.m_data;
 
+    switch (style)
+    {
+        case ExecuteStyle::Group:
         {
             StringBuilder groupEntryPointName;
             groupEntryPointName << entryPointName << "_Group";
 
-            groupFunc = (Func)sharedLibrary->findFuncByName(groupEntryPointName.getBuffer());
-        }
+            CPPPrelude::ComputeFunc groupFunc = (CPPPrelude::ComputeFunc)sharedLibrary->findFuncByName(groupEntryPointName.getBuffer());
+            if (!groupFunc)
+            {
+                return SLANG_FAIL;
+            }
 
+            out.m_func = (ExecuteInfo::Func)groupFunc;
+            break;
+        }
+        case ExecuteStyle::GroupRange:
         {
-            StringBuilder groupRangeEntryPointName;
-            groupRangeEntryPointName << entryPointName << "_GroupRange";
-
-            groupRangeFunc = (GroupRangeFunc)sharedLibrary->findFuncByName(groupRangeEntryPointName.getBuffer());
+            CPPPrelude::ComputeFunc groupRangeFunc = nullptr;
+            groupRangeFunc = (CPPPrelude::ComputeFunc)sharedLibrary->findFuncByName(entryPointName);
+            if (!groupRangeFunc)
+            {
+                return SLANG_FAIL;
+            }
+            out.m_func = (ExecuteInfo::Func)groupRangeFunc;
+            break;
         }
+        case ExecuteStyle::Thread:
+        {
+            StringBuilder threadEntryPointName;
+            threadEntryPointName << entryPointName << "_Thread";
 
-        if (func == nullptr && groupFunc == nullptr && groupRangeFunc == nullptr)
+            CPPPrelude::ComputeThreadFunc threadFunc = (CPPPrelude::ComputeThreadFunc)sharedLibrary->findFuncByName(threadEntryPointName.getBuffer());
+            if (!threadFunc)
+            {
+                return SLANG_FAIL;
+            }
+
+            SlangUInt numThreadsPerAxis[3];
+            entryPoint->getComputeThreadGroupSize(3, numThreadsPerAxis);
+            for (int i = 0; i < 3; ++i)
+            {
+                out.m_numThreadsPerAxis[i] = uint32_t(numThreadsPerAxis[i]);
+            }
+            out.m_func = (ExecuteInfo::Func)threadFunc;
+            break;
+        }
+        default:
         {
             return SLANG_FAIL;
         }
     }
 
-    // If we have the group function, that's the faster way to execute all threads in group...
-    if (groupRangeFunc)
-    {
-        UniformState* uniformState = (UniformState*)context.binding.m_rootBuffer.m_data;
-        CPPPrelude::UniformEntryPointParams* uniformEntryPointParams = (CPPPrelude::UniformEntryPointParams*)context.binding.m_entryPointBuffer.m_data;
-        CPPPrelude::GroupComputeVaryingInput varying;
-
-        varying.startGroupID = {};
-        varying.endGroupID = { dispatchSize[0], dispatchSize[1], dispatchSize[2] };
-        
-        groupRangeFunc(&varying, uniformEntryPointParams, uniformState);
-    }
-    else if (groupFunc)
-    {
-        CPPPrelude::ComputeVaryingInput varying;
+    return SLANG_OK;
+}
+
+/* static */SlangResult CPUComputeUtil::execute(const ExecuteInfo& info)
+{
+    CPPPrelude::UniformState* uniformState = (CPPPrelude::UniformState*)info.m_uniformState;
+    CPPPrelude::UniformEntryPointParams* uniformEntryPointParams = (CPPPrelude::UniformEntryPointParams*)info.m_uniformEntryPointParams;
 
-        for (uint32_t groupZ = 0; groupZ < dispatchSize[2]; ++groupZ)
+    switch (info.m_style)
+    {
+        case ExecuteStyle::Group:
         {
-            for (uint32_t groupY = 0; groupY < dispatchSize[1]; ++groupY)
-            {
-                for (uint32_t groupX = 0; groupX < dispatchSize[0]; ++groupX)
-                {
-                    UniformState* uniformState = (UniformState*)context.binding.m_rootBuffer.m_data;
-                    CPPPrelude::UniformEntryPointParams* uniformEntryPointParams = (CPPPrelude::UniformEntryPointParams*)context.binding.m_entryPointBuffer.m_data;
+            CPPPrelude::ComputeFunc groupFunc = (CPPPrelude::ComputeFunc)info.m_func;
+            CPPPrelude::ComputeVaryingInput varying;
 
-                    varying.groupID = {groupX, groupY, groupZ};
+            const uint32_t groupXCount = info.m_dispatchSize[0];
+            const uint32_t groupYCount = info.m_dispatchSize[1];
+            const uint32_t groupZCount = info.m_dispatchSize[2];
 
-                    groupFunc(&varying, uniformEntryPointParams, uniformState);
+            for (uint32_t groupZ = 0; groupZ < groupZCount; ++groupZ)
+            {
+                for (uint32_t groupY = 0; groupY < groupYCount; ++groupY)
+                {
+                    for (uint32_t groupX = 0; groupX < groupXCount; ++groupX)
+                    {
+                        varying.startGroupID = { groupX, groupY, groupZ };
+                        groupFunc(&varying, uniformEntryPointParams, uniformState);
+                    }
                 }
             }
+            break;
         }
+        case ExecuteStyle::GroupRange:
+        {
+            CPPPrelude::ComputeFunc groupRangeFunc = (CPPPrelude::ComputeFunc)info.m_func;
+            CPPPrelude::ComputeVaryingInput varying;
 
-    }
-    else
-    {
-        // We can also fire off each thread individually
-        SlangUInt numThreadsPerAxis[3];
-        entryPoint->getComputeThreadGroupSize(3, numThreadsPerAxis);
+            varying.startGroupID = {};
+            varying.endGroupID = { info.m_dispatchSize[0], info.m_dispatchSize[1], info.m_dispatchSize[2] };
 
+            groupRangeFunc(&varying, uniformEntryPointParams, uniformState);
+            break;
+        }
+        case ExecuteStyle::Thread:
         {
-            UniformState* uniformState = (UniformState*)context.binding.m_rootBuffer.m_data;
-            CPPPrelude::UniformEntryPointParams* uniformEntryPointParams = (CPPPrelude::UniformEntryPointParams*)context.binding.m_entryPointBuffer.m_data;
+            CPPPrelude::ComputeThreadFunc threadFunc = (CPPPrelude::ComputeThreadFunc)info.m_func;
+            CPPPrelude::ComputeThreadVaryingInput varying;
 
-            CPPPrelude::ComputeVaryingInput varying;
+            const uint32_t groupXCount = info.m_dispatchSize[0];
+            const uint32_t groupYCount = info.m_dispatchSize[1];
+            const uint32_t groupZCount = info.m_dispatchSize[2];
+
+            const uint32_t threadXCount = uint32_t(info.m_numThreadsPerAxis[0]);
+            const uint32_t threadYCount = uint32_t(info.m_numThreadsPerAxis[1]);
+            const uint32_t threadZCount = uint32_t(info.m_numThreadsPerAxis[2]);
 
-            for (uint32_t groupZ = 0; groupZ < dispatchSize[2]; ++groupZ)
+            for (uint32_t groupZ = 0; groupZ < groupZCount; ++groupZ)
             {
-                for (uint32_t groupY = 0; groupY < dispatchSize[1]; ++groupY)
+                for (uint32_t groupY = 0; groupY < groupYCount; ++groupY)
                 {
-                    for (uint32_t groupX = 0; groupX < dispatchSize[0]; ++groupX)
+                    for (uint32_t groupX = 0; groupX < groupXCount; ++groupX)
                     {
-                        varying.groupID = {groupX, groupY, groupZ};
+                        varying.groupID = { groupX, groupY, groupZ };
 
-                        for (int z = 0; z < int(numThreadsPerAxis[2]); ++z)
+                        for (uint32_t z = 0; z < threadZCount; ++z)
                         {
                             varying.groupThreadID.z = z;
-                            for (int y = 0; y < int(numThreadsPerAxis[1]); ++y)
+                            for (uint32_t y = 0; y < threadYCount; ++y)
                             {
                                 varying.groupThreadID.y = y;
-                                for (int x = 0; x < int(numThreadsPerAxis[0]); ++x)
+                                for (uint32_t x = 0; x < threadXCount; ++x)
                                 {
                                     varying.groupThreadID.x = x;
 
-                                    func(&varying, uniformEntryPointParams, uniformState);
+                                    threadFunc(&varying, uniformEntryPointParams, uniformState);
                                 }
                             }
                         }
                     }
                 }
             }
+            break;
+        }
+        default: return SLANG_FAIL;
+    }
+
+    return SLANG_OK;
+}
+
+
+/* static */ SlangResult CPUComputeUtil::checkStyleConsistency(const uint32_t dispatchSize[3], const ShaderCompilerUtil::OutputAndLayout& compilationAndLayout)
+{
+    Context context;
+    SLANG_RETURN_ON_FAIL(CPUComputeUtil::calcBindings(compilationAndLayout, context));
+
+    // Run the thread style to test against
+    {
+        ExecuteInfo info;
+        SLANG_RETURN_ON_FAIL(calcExecuteInfo(ExecuteStyle::Thread, dispatchSize, compilationAndLayout, context, info));
+        SLANG_RETURN_ON_FAIL(execute(info));
+    }
+
+    ExecuteStyle styles[] = { ExecuteStyle::Group, ExecuteStyle::GroupRange };
+    for (auto style: styles)
+    {
+        Context checkContext;
+        SLANG_RETURN_ON_FAIL(CPUComputeUtil::calcBindings(compilationAndLayout, checkContext));
+
+        ExecuteInfo info;
+        SLANG_RETURN_ON_FAIL(calcExecuteInfo(style, dispatchSize, compilationAndLayout, checkContext, info));
+        SLANG_RETURN_ON_FAIL(execute(info));
+
+        // Make sure the out buffers are all the same
+
+        const auto& entries = compilationAndLayout.layout.entries;
+
+        for (int i = 0; i < entries.getCount(); ++i)
+        {
+            const auto& entry = entries[i];
+            if (entry.isOutput)
+            {
+                const auto& buffer = context.buffers[i];
+                const auto& checkBuffer = checkContext.buffers[i];
+
+                if (buffer.m_sizeInBytes != checkBuffer.m_sizeInBytes ||
+                    memcmp(buffer.m_data, checkBuffer.m_data, buffer.m_sizeInBytes) != 0)
+                {
+                    return SLANG_FAIL;
+                }
+            }
         }
     }
 
     return SLANG_OK;
 }
 
+
 } // renderer_test
diff --git a/tools/render-test/cpu-compute-util.h b/tools/render-test/cpu-compute-util.h
index b30ef146b..1284735c0 100644
--- a/tools/render-test/cpu-compute-util.h
+++ b/tools/render-test/cpu-compute-util.h
@@ -11,6 +11,14 @@ namespace renderer_test {
 
 struct CPUComputeUtil
 {
+    enum class ExecuteStyle
+    {
+        Unknown,
+        Thread,
+        Group,
+        GroupRange,
+    };
+
     struct Resource : public RefObject
     {
         void* getInterface() const { return m_interface; }
@@ -27,9 +35,28 @@ struct CPUComputeUtil
         List<RefPtr<Resource> > m_resources;
     };
 
+    struct ExecuteInfo
+    {
+        typedef void (*Func)();
+
+        ExecuteStyle m_style;
+        Func m_func;
+        uint32_t m_dispatchSize[3];
+        uint32_t m_numThreadsPerAxis[3];
+
+        void* m_uniformState;
+        void* m_uniformEntryPointParams;
+    };
+
+    
+        /// Runs code across run styles and makes sure output buffers match
+    static SlangResult checkStyleConsistency(const uint32_t dispatchSize[3], const ShaderCompilerUtil::OutputAndLayout& compilationAndLayout);
+
     static SlangResult calcBindings(const ShaderCompilerUtil::OutputAndLayout& compilationAndLayout, Context& outContext);
 
-    static SlangResult execute(const uint32_t dispatchSize[3], const ShaderCompilerUtil::OutputAndLayout& compilationAndLayout, Context& outContext);
+    static SlangResult calcExecuteInfo(ExecuteStyle style, const uint32_t dispatchSize[3], const ShaderCompilerUtil::OutputAndLayout& compilationAndLayout, Context& context, ExecuteInfo& out);
+
+    static SlangResult execute(const ExecuteInfo& info);
 
     static SlangResult writeBindings(const ShaderInputLayout& layout, const List<CPUMemoryBinding::Buffer>& buffers, const Slang::String& fileName);
 };
diff --git a/tools/render-test/options.cpp b/tools/render-test/options.cpp
index d2f21a5d9..e13a2b88f 100644
--- a/tools/render-test/options.cpp
+++ b/tools/render-test/options.cpp
@@ -183,7 +183,7 @@ SlangResult parseOptions(int argc, const char*const* argv, Slang::WriterHelper s
         {
             if (argCursor == argEnd)
             {
-                stdError.print("error: comma separated compute dispatch size for '%s'\n", arg);
+                stdError.print("error: expecting a comma separated compute dispatch size for '%s'\n", arg);
                 return SLANG_FAIL;
             }
             List<UnownedStringSlice> slices;
diff --git a/tools/render-test/render-test-main.cpp b/tools/render-test/render-test-main.cpp
index 2a0b9a6c9..3a8871618 100644
--- a/tools/render-test/render-test-main.cpp
+++ b/tools/render-test/render-test-main.cpp
@@ -459,12 +459,25 @@ SLANG_TEST_TOOL_API SlangResult innerMain(Slang::StdWriters* stdWriters, SlangSe
         ShaderCompilerUtil::OutputAndLayout compilationAndLayout;
         SLANG_RETURN_ON_FAIL(ShaderCompilerUtil::compileWithLayout(session, gOptions.sourcePath, gOptions.compileArgs, gOptions.shaderType, input, compilationAndLayout));
 
-        CPUComputeUtil::Context context;
-        SLANG_RETURN_ON_FAIL(CPUComputeUtil::calcBindings(compilationAndLayout, context));
-        SLANG_RETURN_ON_FAIL(CPUComputeUtil::execute(gOptions.computeDispatchSize, compilationAndLayout, context));
+       
+        {
+            CPUComputeUtil::Context context;
+            SLANG_RETURN_ON_FAIL(CPUComputeUtil::calcBindings(compilationAndLayout, context));
+
+            CPUComputeUtil::ExecuteInfo info;
+            SLANG_RETURN_ON_FAIL(CPUComputeUtil::calcExecuteInfo(CPUComputeUtil::ExecuteStyle::GroupRange, gOptions.computeDispatchSize, compilationAndLayout, context, info));
+            SLANG_RETURN_ON_FAIL(CPUComputeUtil::execute(info));
+        
+            // Dump everything out that was written
+            SLANG_RETURN_ON_FAIL(CPUComputeUtil::writeBindings(compilationAndLayout.layout, context.buffers, gOptions.outputPath));
+        }
+
+        {
+            // Check all execution styles produce the same result
+            SLANG_RETURN_ON_FAIL(CPUComputeUtil::checkStyleConsistency(gOptions.computeDispatchSize, compilationAndLayout));
+        }
 
-        // Dump everything out that was written
-        return CPUComputeUtil::writeBindings(compilationAndLayout.layout, context.buffers, gOptions.outputPath);
+        return SLANG_OK;
     }
 
     Slang::RefPtr<Renderer> renderer;
diff --git a/tools/slang-test/slang-test-main.cpp b/tools/slang-test/slang-test-main.cpp
index 26d611181..a2d24a54f 100644
--- a/tools/slang-test/slang-test-main.cpp
+++ b/tools/slang-test/slang-test-main.cpp
@@ -1109,120 +1109,6 @@ static SlangResult _loadAsSharedLibrary(const UnownedStringSlice& hexDump, Tempo
     return SharedLibrary::loadWithPlatformPath(sharedLibraryName.getBuffer(), outSharedLibrary);
 }
 
-static void _writeBuffer(const CPPPrelude::RWStructuredBuffer<int32_t>& in, StringBuilder& out)
-{
-    for (size_t i = 0; i < in.count; ++i)
-    {
-        if (i > 0)
-        {
-            out << ", ";
-        }
-        out << in[i];
-    }
-    out << "\n";
-}
-
-TestResult runCPUExecuteTest(TestContext* context, TestInput& input)
-{
-    auto outputStem = input.outputStem;
-
-    CommandLine cmdLine;
-    _initSlangCompiler(context, cmdLine);
-
-    cmdLine.addArg(input.filePath);
-
-    for (auto arg : input.testOptions->args)
-    {
-        cmdLine.addArg(arg);
-    }
-
-    ExecuteResult exeRes;
-    TEST_RETURN_ON_DONE(spawnAndWait(context, outputStem, input.spawnType, cmdLine, exeRes));
-
-    if (context->isCollectingRequirements())
-    {
-        return TestResult::Pass;
-    }
-
-    TemporaryFileSet temporaryFileSet;
-    SharedLibrary::Handle sharedLibrary = SharedLibrary::Handle(0);
-    if (SLANG_FAILED(_loadAsSharedLibrary(exeRes.standardOutput.getUnownedSlice(), temporaryFileSet, sharedLibrary)))
-    {
-        return TestResult::Fail;
-    }
-
-    StringBuilder actualOutput;
-
-    // TODO(JS): For moment just assume function name/data/parameters
-    {
-        SharedLibrary::FuncPtr func = SharedLibrary::findFuncByName(sharedLibrary, "computeMain");
-        if (!func)
-        {
-            SharedLibrary::unload(sharedLibrary);
-            return TestResult::Fail;
-        }
-
-        
-        struct UniformState
-        {
-            CPPPrelude::RWStructuredBuffer<int> buffer;
-        };
-        
-        typedef void (*Func)(CPPPrelude::ComputeVaryingInput* varyingInput, CPPPrelude::UniformEntryPointParams* params, UniformState* uniformState);
-
-        Func runFunc = Func(func);
-        int32_t data[4] = { 0, 0, 0, 0};
-
-        UniformState state;
-
-        state.buffer = CPPPrelude::RWStructuredBuffer<int32_t>{data, 4};
-
-        CPPPrelude::ComputeVaryingInput varyingInput = {};
-        for (Int i = 0; i < 4; ++i)
-        {
-            varyingInput.groupThreadID.x = uint32_t(i);
-            runFunc(&varyingInput, nullptr, &state);
-        }
-
-        SharedLibrary::unload(sharedLibrary);
-
-        // Write the data
-        _writeBuffer(state.buffer, actualOutput);
-    }
-
-    String expectedOutputPath = outputStem + ".expected";
-    String expectedOutput;
-    try
-    {
-        expectedOutput = Slang::File::readAllText(expectedOutputPath);
-    }
-    catch (Slang::IOException)
-    {
-    }
-
-    TestResult result = TestResult::Pass;
-
-    // Otherwise we compare to the expected output
-    if (actualOutput != expectedOutput)
-    {
-        context->reporter->dumpOutputDifference(expectedOutput, actualOutput);
-        result = TestResult::Fail;
-    }
-
-    // If the test failed, then we write the actual output to a file
-    // so that we can easily diff it from the command line and
-    // diagnose the problem.
-    if (result == TestResult::Fail)
-    {
-        String actualOutputPath = outputStem + ".actual";
-        Slang::File::writeAllText(actualOutputPath, actualOutput);
-
-        context->reporter->dumpOutputDifference(expectedOutput, actualOutput);
-    }
-
-    return result;
-}
-
 TestResult runSimpleCompareCommandLineTest(TestContext* context, TestInput& input)
 {
     TestInput workInput(input);
@@ -2456,7 +2342,6 @@ static const TestCommandInfo s_testCommandInfos[] =
     { "CPP_COMPILER_EXECUTE",                   &runCPPCompilerExecute},
     { "CPP_COMPILER_SHARED_LIBRARY",            &runCPPCompilerSharedLibrary},
     { "CPP_COMPILER_COMPILE",                   &runCPPCompilerCompile},
-    { "CPU_EXECUTE",                            &runCPUExecuteTest},
 };
 
 TestResult runTest(
author	jsmall-nvidia <jsmall@nvidia.com>	2019-09-18 11:40:59 -0400
committer	GitHub <noreply@github.com>	2019-09-18 11:40:59 -0400
commit	31c7abcc27a33d63ac8d335387a0ce7b3ad74954 (patch)
tree	3b4254df7bdbf8b497aa8a3e5f08f8927c1afbc6
parent	3af404da7f7f125464b78159940cb3fc06e69cc5 (diff)