Change parameter passing convention for CUDA (#1463)

The Big Picture =============== Given input Slang code like: ```hlsl Texture2D gA; [shader("compute")] void kernelFunc(uniform Texture2D b, uint3 tid : SV_DispatchThreadID) { ... } ``` the existing CUDA code generation strategy would always generate a kernel with a signature like: ```c++ struct GlobalParams { Texture2D gA; } struct EntryPointParams { Texture2D b; } extern "C" __global__ void kernelFunc(EntryPointParams* entryPointParams, GlobalParams* globalParams) { ... } ``` This choice was consistent with the conventions of the CPU kernel target, and shares the advantage that it is easy for the user to data-drive the logic for filling in parameters and then invoking a kernel. However, the approach outlined above has two serious problems when used for CUDA kernels: * First, it defies the programmer's expectation about what an "equivalent" CUDA kernel signature would be, which makes it awkward for a developer to invoke this kernel from CUDA C++ host code (especially in the context of an app that might also run hand-written CUDA kernels). * Second, the performance of this approach suffers because every access to a global or entry point parameter turns into a load from global memory. In contrast, a typical hand-written CUDA kernel passes its parameters via an implementation-specific path that (for current CUDA platforms) seems to be equivalent to `__constant__` memory in performance. This change alters the convention so that the Slang compiler takes the code from the top of this message and translates it into something like: ```c++ struct GlobalParams { Texture2D gA; } __constant__ GlobalParams SLANG_globalParams; extern "C" __global__ void kernelFunc( Texture2D b ) { ... } ``` This translation alleviates both problems with the current translation: * The signature of the generated CUDA kernel function is as close to that of the original as is possible (we had to eliminate the `SV_*`-semantic varying inputs), and should directly match what the programmer would expect in common cases. * Entry-point parameters are passed via CUDA kernel parameters, and should thus match in performance. Global parameters are passed via a variable in `__constant__` memory, and thus should also perform as well as possible/expected. Detailed Changes ================ * Disable the `collectEntryPointUniformParams` pass for CUDA, so that entry-point `uniform` parameters are *not* bundles into a single `struct` and/or `ConstantBuffer`. * When targeting CUDA, disable the logic for generating an entry-point parameter for passing in the global shader parameter(s) * Allow `CLikeSourceEmitter` subclasses to override the name generated for entry-point symbols, and use this to add the required prefix for each OptiX kernel type when translating a ray-tracing kernel. * Add logic to emit "parameter groups" in a specialized way for CUDA (this is the same approach that allows us to generate `cbufffer { ... }` declarations for fxc). A global-scope parameter group will turn into a global `__constant__` variable called `SLANG_globalParams` (that name becomes part of the ABI for Slang-compiled shaders). * Update the logic in `render-test` for loading and invoking CUDA kernels to handle the new policy. The last bullet there merits expansion, since it is indicative of the work a client using Slang would have to go through to use our generated kernels with the new policy: * When loading a CUDA module with one or more kernels, we also use `cuModuleGetGlobal` to query the address of the `SLANG_globalParams` symbol in that CUDA module. That pointer needs to be used when setting global parameter values to be used by kernels in that CUDA odule. * Because our existing `BindPoint` logic for CUDA always sets up parameter data in GPU memory, we end up having to copy the entry-point parameter data from GPU memory to host memory. This step would ideally be skipped in a codebase that understands the correct policy, but it is a bit unfortunate that it is no longer trivially correct for an application to store all parameter data in GPU memory. * Before invoking the kernel, we need to use a `cudaMemcpyAsync` to copy from the prepared GPU memory for global parameters over to the `SLANG_globalParams` symbol associated with the kernel to be invoked. Because this operations is issued on the same CUDA stream as the kernel call, it is guaranteed to not overlap with GPU kernel execution. * When invoking the kernel, we take advantage of the seldom-used `CU_LAUNCH_PARAM_BUFFER_POINTER` facility to specify a contiguous memory region with all the entry-point parameters in it instead of passing each entry-point parameter separately. Given Slang reflection it is also possible to query the offset of each entry-point parameter in the buffer, so we could invoke the kernel in the traditional fashion as well. The choice here is up to the application. Caveats ======= * This is a breaking change, and any subsequent release will need to reflect that fact. Any customers who rely on Slang's current CUDA codegen strategy are likely to be surprised by this change, and I don't see an easy way to give them a more gentle transition. * This change does *not* remove the logic that introduces a `KernelContext` type for code that requires it. That means that things like `static` global variables can continue to work on CUDA for now, but we know that those are not going to be something we can support in the long-term with separate compilation. * While the policy implemented in this change is a reasonable default, it is still not going to perfectly match expecations for some developers. In particular, some developers who are familiar with both D3D and CUDA will likely wonder why a global `cbuffer` in Slang translates to a global-memory pointer in the output CUDA instead of one global `__constant__` variable per `cbuffer`. A more detailed alternate translation would generate a distinct global `__constant__` variable for each top-level constant buffer or parameter block. We may need to refine the translation even more based on feedback from users who care about how we handle global-scope parameters. * Recent changes in Slang have broken the logic that handles the OptiX "shader record" as an alternative mechanism for passing entry-point parameters. In order to get any level of OptiX support up and running we will have to change the IR passes that run on CUDA kernels to actually run the "collection" of `uniform` parameters for ray tracing stages, and then to replace references to the resulting parameter with a call to the function to access the shader record. * The use of `SLANG_globalParams` here works well enough in the case of whole-program compilation; every `CUmodule` ends up with (zero or) one parameter with this name, and an application can just hard-code it. As a mechanism it wouldn't work in the presence of separately-compiled modules that might introduce their own global parameters (including cases like constant lookup tables that really want to be at the global scope). An alternative approach would have Slang generate output PTX for each module, where a module has an optional global symbol for its own global-scope parameters (with a mangled name that is based on the module name), and then a linked CUDA binary has all of those distinct symbols. Such an approach would be compatible with module-at-a-time reflection and parameter binding, but would lead to another breaking change down the line for code that switches to `SLANG_globalParams`.
author: Tim Foley <tfoleyNV@users.noreply.github.com> 2020-07-28 15:14:31 -0700
committer: GitHub <noreply@github.com> 2020-07-28 15:14:31 -0700
commit: cd106730ea52511a672c9c2c5c8697eaca3b57c8 (patch)
tree: d1311cab1a92522023dbe66b3e5ef981f922c578 /tools
parent: dce1d353bf8994220618d53d32455791631096c3 (diff)
1 files changed, 84 insertions, 11 deletions
diff --git a/tools/render-test/cuda/cuda-compute-util.cpp b/tools/render-test/cuda/cuda-compute-util.cpp
index 304784518..5acddf94f 100644
--- a/tools/render-test/cuda/cuda-compute-util.cpp
+++ b/tools/render-test/cuda/cuda-compute-util.cpp
@@ -979,6 +979,22 @@ static SlangResult _loadAndInvokeComputeProgram(
     ScopeCUDAModule cudaModule;
     SLANG_RETURN_ON_FAIL(cudaModule.load(kernelDesc.codeBegin));
 
+    // The global-scope shader parameters in the input Slang program
+    // will be collected into a single `__constant__` global variable
+    // in the output CUDA module.
+    //
+    // We need to query the address of the `__constant__` variable
+    // so that we can copy parameter data into it when invoking
+    // a kernel.
+    //
+    // The Slang compiler always names this symbol `SLANG_globalParams`
+    // so that it is easy to look up independent of the module or
+    // entry point in question.
+    //
+    CUdeviceptr globalParamsSymbol = 0;
+    size_t globalParamsSymbolSize = 0;
+    cuModuleGetGlobal(&globalParamsSymbol, &globalParamsSymbolSize, cudaModule, "SLANG_globalParams");
+
     slang::EntryPointReflection* entryPoint = nullptr;
     auto entryPointCount = reflection->getEntryPointCount();
     SLANG_ASSERT(entryPointCount == 1);
@@ -999,25 +1015,82 @@ static SlangResult _loadAndInvokeComputeProgram(
     int sharedSizeInBytes;
     SLANG_CUDA_RETURN_ON_FAIL(cuFuncGetAttribute(&sharedSizeInBytes, CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES, cudaEntryPoint));
 
-    // Work out the args
-    CUdeviceptr uniformCUDAData = MemoryCUDAResource::getCUDAData(bindRoot.getRootValue());
-    CUdeviceptr entryPointCUDAData = MemoryCUDAResource::getCUDAData(bindRoot.getEntryPointValue());
-
-    // NOTE! These are pointers to the cuda memory pointers
-    void* args[] = { &entryPointCUDAData , &uniformCUDAData };
-
+    // A single CUDA kernel can be invoked with thread groups
+    // of different shapes/sizes, but an HLSL/Slang compute
+    // kernel always has a fixed thread group shape baked in.
+    // We use reflection to query the thread-group size that
+    // the kernel expects, so that we can use the right size
+    // when invoking the kernel.
+    //
     SlangUInt numThreadsPerAxis[3];
     entryPoint->getComputeThreadGroupSize(3, numThreadsPerAxis);
 
-    // Launch
+    // The argument data for the kernel has been set up in `bindRoot`,
+    // which encapsulates global buffers for both the global and
+    // entry-point parameter data.
+    //
+    // In the case of global parameters, we just need to extract the
+    // device address of the parameter data, so we can copy it into
+    // the `SLANG_globalParams` symbol.
+    //
+    {
+        CUdeviceptr globalParamsCUDAData = MemoryCUDAResource::getCUDAData(bindRoot.getRootValue());
+        cudaMemcpyAsync(
+            (void*) globalParamsSymbol,
+            (void*) globalParamsCUDAData,
+            globalParamsSymbolSize,
+            cudaMemcpyDeviceToDevice,
+            cudaStream);
+    }
+    //
+    // In the case of the entry-point parameters, we have to deal with
+    // two different wrinkles.
+    //
+    // First, the `bindRoot` will have the entry-point argument data
+    // stored in a GPU-memory buffer, but we actually need it to be
+    // in host CPU memory. We handle that for now by allocating a
+    // temporary host memory buffer (if needed) and copying the data
+    // from device to host.
+    //
+    auto entryPointBindValue = bindRoot.getEntryPointValue();
+    CUdeviceptr entryPointCUDAData = MemoryCUDAResource::getCUDAData(entryPointBindValue);
+    size_t entryPointDataSize = entryPointBindValue ? entryPointBindValue->m_sizeInBytes : 0;
+    void* entryPointHostData = nullptr;
+    if(entryPointDataSize)
+    {
+        entryPointHostData = alloca(entryPointDataSize);
+        cudaMemcpy(entryPointHostData, (void*)entryPointCUDAData, entryPointDataSize, cudaMemcpyDeviceToHost);
+    }
+    //
+    // Second, the argument data for the entry-point parameters has
+    // been allocated and filled in as a single buffer, but `cuLaunchKernel`
+    // defaults to taking pointers to each of the kernel arguments.
+    //
+    // We could loop over the entry-point parameters using the refleciton
+    // information, and set up a pointer to each using the offset stored
+    // for it in the reflection data. Such an approach would require
+    // us to create and fill in a dynamically-sized array here.
+    //
+    // Instead, we take advantage of a documented but seldom-used feature
+    // of `cuLaunchKernel` that allows the argument data for all of the
+    // kernel "launch parameters" to be specified as a single buffer.
+    //
+    void* extraOptions[] = {
+        CU_LAUNCH_PARAM_BUFFER_POINTER, (void*) entryPointHostData,
+        CU_LAUNCH_PARAM_BUFFER_SIZE, &entryPointDataSize,
+        CU_LAUNCH_PARAM_END,
+    };
+
+    // Once we have all the decessary data extracted and/or
+    // set up, we can launch the kernel and see what happens.
+    //
     auto cudaLaunchResult = cuLaunchKernel(cudaEntryPoint,
         dispatchSize[0], dispatchSize[1], dispatchSize[2], 
         int(numThreadsPerAxis[0]), int(numThreadsPerAxis[1]), int(numThreadsPerAxis[2]),        // Threads per block
         0,              // Shared memory size
         cudaStream,     // Stream. 0 is no stream.
-        args,           // Args
-        nullptr);       // extra
-
+        nullptr,        // Not using traditional argument passing
+        extraOptions);  // Instead passing kernel arguments via "extra" options
     SLANG_CUDA_RETURN_ON_FAIL(cudaLaunchResult);
 
     // Do a sync here. Makes sure any issues are detected early and not on some implicit sync
author	Tim Foley <tfoleyNV@users.noreply.github.com>	2020-07-28 15:14:31 -0700
committer	GitHub <noreply@github.com>	2020-07-28 15:14:31 -0700
commit	cd106730ea52511a672c9c2c5c8697eaca3b57c8 (patch)
tree	d1311cab1a92522023dbe66b3e5ef981f922c578 /tools
parent	dce1d353bf8994220618d53d32455791631096c3 (diff)