Change parameter passing convention for CUDA (#1463)

The Big Picture =============== Given input Slang code like: ```hlsl Texture2D gA; [shader("compute")] void kernelFunc(uniform Texture2D b, uint3 tid : SV_DispatchThreadID) { ... } ``` the existing CUDA code generation strategy would always generate a kernel with a signature like: ```c++ struct GlobalParams { Texture2D gA; } struct EntryPointParams { Texture2D b; } extern "C" __global__ void kernelFunc(EntryPointParams* entryPointParams, GlobalParams* globalParams) { ... } ``` This choice was consistent with the conventions of the CPU kernel target, and shares the advantage that it is easy for the user to data-drive the logic for filling in parameters and then invoking a kernel. However, the approach outlined above has two serious problems when used for CUDA kernels: * First, it defies the programmer's expectation about what an "equivalent" CUDA kernel signature would be, which makes it awkward for a developer to invoke this kernel from CUDA C++ host code (especially in the context of an app that might also run hand-written CUDA kernels). * Second, the performance of this approach suffers because every access to a global or entry point parameter turns into a load from global memory. In contrast, a typical hand-written CUDA kernel passes its parameters via an implementation-specific path that (for current CUDA platforms) seems to be equivalent to `__constant__` memory in performance. This change alters the convention so that the Slang compiler takes the code from the top of this message and translates it into something like: ```c++ struct GlobalParams { Texture2D gA; } __constant__ GlobalParams SLANG_globalParams; extern "C" __global__ void kernelFunc( Texture2D b ) { ... } ``` This translation alleviates both problems with the current translation: * The signature of the generated CUDA kernel function is as close to that of the original as is possible (we had to eliminate the `SV_*`-semantic varying inputs), and should directly match what the programmer would expect in common cases. * Entry-point parameters are passed via CUDA kernel parameters, and should thus match in performance. Global parameters are passed via a variable in `__constant__` memory, and thus should also perform as well as possible/expected. Detailed Changes ================ * Disable the `collectEntryPointUniformParams` pass for CUDA, so that entry-point `uniform` parameters are *not* bundles into a single `struct` and/or `ConstantBuffer`. * When targeting CUDA, disable the logic for generating an entry-point parameter for passing in the global shader parameter(s) * Allow `CLikeSourceEmitter` subclasses to override the name generated for entry-point symbols, and use this to add the required prefix for each OptiX kernel type when translating a ray-tracing kernel. * Add logic to emit "parameter groups" in a specialized way for CUDA (this is the same approach that allows us to generate `cbufffer { ... }` declarations for fxc). A global-scope parameter group will turn into a global `__constant__` variable called `SLANG_globalParams` (that name becomes part of the ABI for Slang-compiled shaders). * Update the logic in `render-test` for loading and invoking CUDA kernels to handle the new policy. The last bullet there merits expansion, since it is indicative of the work a client using Slang would have to go through to use our generated kernels with the new policy: * When loading a CUDA module with one or more kernels, we also use `cuModuleGetGlobal` to query the address of the `SLANG_globalParams` symbol in that CUDA module. That pointer needs to be used when setting global parameter values to be used by kernels in that CUDA odule. * Because our existing `BindPoint` logic for CUDA always sets up parameter data in GPU memory, we end up having to copy the entry-point parameter data from GPU memory to host memory. This step would ideally be skipped in a codebase that understands the correct policy, but it is a bit unfortunate that it is no longer trivially correct for an application to store all parameter data in GPU memory. * Before invoking the kernel, we need to use a `cudaMemcpyAsync` to copy from the prepared GPU memory for global parameters over to the `SLANG_globalParams` symbol associated with the kernel to be invoked. Because this operations is issued on the same CUDA stream as the kernel call, it is guaranteed to not overlap with GPU kernel execution. * When invoking the kernel, we take advantage of the seldom-used `CU_LAUNCH_PARAM_BUFFER_POINTER` facility to specify a contiguous memory region with all the entry-point parameters in it instead of passing each entry-point parameter separately. Given Slang reflection it is also possible to query the offset of each entry-point parameter in the buffer, so we could invoke the kernel in the traditional fashion as well. The choice here is up to the application. Caveats ======= * This is a breaking change, and any subsequent release will need to reflect that fact. Any customers who rely on Slang's current CUDA codegen strategy are likely to be surprised by this change, and I don't see an easy way to give them a more gentle transition. * This change does *not* remove the logic that introduces a `KernelContext` type for code that requires it. That means that things like `static` global variables can continue to work on CUDA for now, but we know that those are not going to be something we can support in the long-term with separate compilation. * While the policy implemented in this change is a reasonable default, it is still not going to perfectly match expecations for some developers. In particular, some developers who are familiar with both D3D and CUDA will likely wonder why a global `cbuffer` in Slang translates to a global-memory pointer in the output CUDA instead of one global `__constant__` variable per `cbuffer`. A more detailed alternate translation would generate a distinct global `__constant__` variable for each top-level constant buffer or parameter block. We may need to refine the translation even more based on feedback from users who care about how we handle global-scope parameters. * Recent changes in Slang have broken the logic that handles the OptiX "shader record" as an alternative mechanism for passing entry-point parameters. In order to get any level of OptiX support up and running we will have to change the IR passes that run on CUDA kernels to actually run the "collection" of `uniform` parameters for ray tracing stages, and then to replace references to the resulting parameter with a call to the function to access the shader record. * The use of `SLANG_globalParams` here works well enough in the case of whole-program compilation; every `CUmodule` ends up with (zero or) one parameter with this name, and an application can just hard-code it. As a mechanism it wouldn't work in the presence of separately-compiled modules that might introduce their own global parameters (including cases like constant lookup tables that really want to be at the global scope). An alternative approach would have Slang generate output PTX for each module, where a module has an optional global symbol for its own global-scope parameters (with a mangled name that is based on the module name), and then a linked CUDA binary has all of those distinct symbols. Such an approach would be compatible with module-at-a-time reflection and parameter binding, but would lead to another breaking change down the line for code that switches to `SLANG_globalParams`.
author: Tim Foley <tfoleyNV@users.noreply.github.com> 2020-07-28 15:14:31 -0700
committer: GitHub <noreply@github.com> 2020-07-28 15:14:31 -0700
commit: cd106730ea52511a672c9c2c5c8697eaca3b57c8 (patch)
tree: d1311cab1a92522023dbe66b3e5ef981f922c578 /source
parent: dce1d353bf8994220618d53d32455791631096c3 (diff)
7 files changed, 139 insertions, 72 deletions
diff --git a/source/slang/slang-emit-c-like.cpp b/source/slang/slang-emit-c-like.cpp
index ce3dc8957..4c6c89ef5 100644
--- a/source/slang/slang-emit-c-like.cpp
+++ b/source/slang/slang-emit-c-like.cpp
@@ -660,6 +660,11 @@ String CLikeSourceEmitter::scrubName(const String& name)
     return sb.ProduceString();
 }
 
+String CLikeSourceEmitter::generateEntryPointNameImpl(IREntryPointDecoration* entryPointDecor)
+{
+    return entryPointDecor->getName()->getStringSlice();
+}
+
 String CLikeSourceEmitter::generateName(IRInst* inst)
 {
     // If the instruction names something
@@ -686,7 +691,7 @@ String CLikeSourceEmitter::generateName(IRInst* inst)
             return "main";
         }
 
-        return entryPointDecor->getName()->getStringSlice();
+        return generateEntryPointNameImpl(entryPointDecor);
     }
 
     // If we have a name hint on the instruction, then we will try to use that
diff --git a/source/slang/slang-emit-c-like.h b/source/slang/slang-emit-c-like.h
index d813a819e..b89d5d1c4 100644
--- a/source/slang/slang-emit-c-like.h
+++ b/source/slang/slang-emit-c-like.h
@@ -156,6 +156,8 @@ public:
     String scrubName(const String& name);
 
     String generateName(IRInst* inst);
+    virtual String generateEntryPointNameImpl(IREntryPointDecoration* entryPointDecor);
+
     String getName(IRInst* inst);
 
     void emitDeclarator(IRDeclaratorInfo* declarator);    
diff --git a/source/slang/slang-emit-cuda.cpp b/source/slang/slang-emit-cuda.cpp
index 6f24d5b74..acd913865 100644
--- a/source/slang/slang-emit-cuda.cpp
+++ b/source/slang/slang-emit-cuda.cpp
@@ -239,7 +239,15 @@ void CUDASourceEmitter::emitLayoutSemanticsImpl(IRInst* inst, char const* unifor
 
 void CUDASourceEmitter::emitParameterGroupImpl(IRGlobalParam* varDecl, IRUniformParameterGroupType* type)
 {
-    Super::emitParameterGroupImpl(varDecl, type);
+    auto elementType = type->getElementType();
+
+    m_writer->emit("extern \"C\" __constant__ ");
+    emitType(elementType, "SLANG_globalParams");
+    m_writer->emit(";\n");
+
+    m_writer->emit("#define ");
+    m_writer->emit(getName(varDecl));
+    m_writer->emit(" (&SLANG_globalParams)\n");
 }
 
 void CUDASourceEmitter::emitEntryPointAttributesImpl(IRFunc* irFunc, IREntryPointDecoration* entryPointDecor)
@@ -260,6 +268,59 @@ void CUDASourceEmitter::emitFunctionPreambleImpl(IRInst* inst)
     }
 }
 
+String CUDASourceEmitter::generateEntryPointNameImpl(IREntryPointDecoration* entryPointDecor)
+{
+    // We have an entry-point function in the IR module, which we
+    // will want to emit as a `__global__` function in the generated
+    // CUDA C++.
+    //
+    // The most common case will be a compute kernel, in which case
+    // we will emit the function more or less as-is, including
+    // usingits original name as the name of the global symbol.
+    //
+    String funcName = Super::generateEntryPointNameImpl(entryPointDecor);
+    String globalSymbolName = funcName;
+
+    // We also suport emitting ray tracing kernels for use with
+    // OptiX, and in that case the name of the global symbol
+    // must be prefixed to indicate to the OptiX runtime what
+    // stage it is to be compiled for.
+    //
+    auto stage = entryPointDecor->getProfile().getStage();
+    switch( stage )
+    {
+    default:
+        break;
+
+#define CASE(STAGE, PREFIX) \
+    case Stage::STAGE: globalSymbolName = #PREFIX + funcName; break
+
+    // Optix 7 Guide, Section 6.1 (Program input)
+    //
+    // > The input PTX should include one or more NVIDIA OptiX programs.
+    // > The type of program affects how the program can be used during
+    // > the execution of the pipeline. These program types are specified
+    // by prefixing the program�s name with the following:
+    //
+    // >    Program type        Function name prefix
+    CASE(   RayGeneration,      __raygen__);
+    CASE(   Intersection,       __intersection__);
+    CASE(   AnyHit,             __anyhit__);
+    CASE(   ClosestHit,         __closesthit__);
+    CASE(   Miss,               __miss__);
+    CASE(   Callable,           __direct_callable__);
+    //
+    // There are two stages (or "program types") supported by OptiX
+    // that Slang currently cannot target:
+    //
+    // CASE(ContinuationCallable,   __continuation_callable__);
+    // CASE(Exception,              __exception__);
+    //
+#undef CASE
+    }
+
+    return globalSymbolName;
+}
 
 void CUDASourceEmitter::emitCall(const HLSLIntrinsic* specOp, IRInst* inst, const IRUse* operands, int numOperands, const EmitOpInfo& inOuterPrec)
 {
@@ -642,6 +703,24 @@ void CUDASourceEmitter::emitPreprocessorDirectivesImpl()
     }
 }
 
+bool CUDASourceEmitter::tryEmitGlobalParamImpl(IRGlobalParam* varDecl, IRType* varType)
+{
+    // A global shader parameter in the IR for CUDA output will
+    // either be the unique constant buffer that wraps all the
+    // global-scope parameters in the original code (which is
+    // handled as a special-case before this routine would be
+    // called), or it is one of the system-defined varying inputs
+    // like `threadIdx`. We won't need to emit anything in the
+    // output code for the latter case, so we need to emit
+    // nothing here and return `true` so that the base class
+    // uses our logic instead of the default.
+    //
+    SLANG_UNUSED(varDecl);
+    SLANG_UNUSED(varType);
+    return true;
+}
+
+
 void CUDASourceEmitter::emitModuleImpl(IRModule* module)
 {
     // Setup all built in types used in the module
@@ -660,51 +739,7 @@ void CUDASourceEmitter::emitModuleImpl(IRModule* module)
 
     // TODO(JS): We may need to generate types (for example for matrices)
 
-    // TODO(JS): We need to determine which functions we need to inline
-
-    // The IR will usually come in an order that respects
-    // dependencies between global declarations, but this
-    // isn't guaranteed, so we need to be careful about
-    // the order in which we emit things.
-
-    List<EmitAction> actions;
-
-    computeEmitActions(module, actions);
-
-
-    _emitForwardDeclarations(actions);
-
-    // Output group shared variables
-
-    {
-        for (auto action : actions)
-        {
-            if (action.level == EmitAction::Level::Definition && action.inst->op == kIROp_GlobalVar && as<IRGroupSharedRate>(action.inst->getRate()))
-            {
-                emitGlobalInst(action.inst);   
-            }
-        }
-    }
-
-    {
-        // Output all the thread locals 
-        for (auto action : actions)
-        {
-            if (action.level == EmitAction::Level::Definition && action.inst->op == kIROp_GlobalVar && !as<IRGroupSharedRate>(action.inst->getRate()))
-            {
-                emitGlobalInst(action.inst);
-            }
-        }
-
-        // Finally output the functions as methods on the context
-        for (auto action : actions)
-        {
-            if (action.level == EmitAction::Level::Definition && as<IRFunc>(action.inst))
-            {
-                emitGlobalInst(action.inst);
-            }
-        }
-    }
+    CLikeSourceEmitter::emitModuleImpl(module);
 }
 
 
diff --git a/source/slang/slang-emit-cuda.h b/source/slang/slang-emit-cuda.h
index ae78a5e57..156d5fab1 100644
--- a/source/slang/slang-emit-cuda.h
+++ b/source/slang/slang-emit-cuda.h
@@ -57,12 +57,13 @@ protected:
     virtual void emitMatrixLayoutModifiersImpl(IRVarLayout* layout) SLANG_OVERRIDE;
     virtual void emitCall(const HLSLIntrinsic* specOp, IRInst* inst, const IRUse* operands, int numOperands, const EmitOpInfo& inOuterPrec) SLANG_OVERRIDE;
     virtual void emitFunctionPreambleImpl(IRInst* inst) SLANG_OVERRIDE;
+    virtual String generateEntryPointNameImpl(IREntryPointDecoration* entryPointDecor) SLANG_OVERRIDE;
 
     virtual void emitLoopControlDecorationImpl(IRLoopControlDecoration* decl) SLANG_OVERRIDE;
 
     virtual void handleCallExprDecorationsImpl(IRInst* funcValue) SLANG_OVERRIDE;
 
-    //virtual bool tryEmitGlobalParamImpl(IRGlobalParam* varDecl, IRType* varType) SLANG_OVERRIDE;
+    virtual bool tryEmitGlobalParamImpl(IRGlobalParam* varDecl, IRType* varType) SLANG_OVERRIDE;
     virtual bool tryEmitInstExprImpl(IRInst* inst, const EmitOpInfo& inOuterPrec) SLANG_OVERRIDE;
 
     virtual void emitPreprocessorDirectivesImpl() SLANG_OVERRIDE;
diff --git a/source/slang/slang-emit.cpp b/source/slang/slang-emit.cpp
index 7d8a4074e..af01e4b0d 100644
--- a/source/slang/slang-emit.cpp
+++ b/source/slang/slang-emit.cpp
@@ -242,20 +242,19 @@ Result linkAndOptimizeIR(
         CollectEntryPointUniformParamsOptions passOptions;
         switch( target )
         {
-        default:
+        case CodeGenTarget::CUDASource:
             break;
 
         case CodeGenTarget::CPPSource:
-        case CodeGenTarget::CUDASource:
             passOptions.alwaysCreateCollectedParam = true;
+        default:
+            collectEntryPointUniformParams(irModule, passOptions);
+        #if 0
+            dumpIRIfEnabled(compileRequest, irModule, "ENTRY POINT UNIFORMS COLLECTED");
+        #endif
+            validateIRModuleIfEnabled(compileRequest, irModule);
             break;
         }
-
-        collectEntryPointUniformParams(irModule, passOptions);
-    #if 0
-        dumpIRIfEnabled(compileRequest, irModule, "ENTRY POINT UNIFORMS COLLECTED");
-    #endif
-        validateIRModuleIfEnabled(compileRequest, irModule);
     }
 
     switch( target )
@@ -637,7 +636,10 @@ Result linkAndOptimizeIR(
     case CodeGenTarget::CUDASource:
         moveGlobalVarInitializationToEntryPoints(irModule);
         introduceExplicitGlobalContext(irModule, target);
-        convertEntryPointPtrParamsToRawPtrs(irModule);
+        if(target == CodeGenTarget::CPPSource)
+        {
+            convertEntryPointPtrParamsToRawPtrs(irModule);
+        }
     #if 0
         dumpIRIfEnabled(compileRequest, irModule, "EXPLICIT GLOBAL CONTEXT INTRODUCED");
     #endif
diff --git a/source/slang/slang-ir-explicit-global-context.cpp b/source/slang/slang-ir-explicit-global-context.cpp
index 8f11bce2c..32efd51e8 100644
--- a/source/slang/slang-ir-explicit-global-context.cpp
+++ b/source/slang/slang-ir-explicit-global-context.cpp
@@ -31,17 +31,6 @@ struct IntroduceExplicitGlobalContextPass
 
         IRBuilder builder(&sharedBuilder);
 
-        // The global context will be represneted by a `struct`
-        // type with a name hint of `KernelContext`.
-        //
-        m_contextStructType = builder.createStructType();
-        builder.addNameHintDecoration(m_contextStructType, UnownedTerminatedStringSlice("KernelContext"));
-
-        // The context will usually be passed around by pointer,
-        // so we get and cache that pointer type up front.
-        //
-        m_contextStructPtrType = builder.getPtrType(m_contextStructType);
-
         // The transformation we will perform will need to affect
         // global variables, global shader parameters, and entry-point
         // function (at the very least), and we start with an explicit
@@ -107,7 +96,13 @@ struct IntroduceExplicitGlobalContextPass
                     // Note: If we ever changed out mind about the representation
                     // and wanted to support multiple global parameters, we could
                     // easily generalize this code to work with a list.
-                    //
+
+                    // For CUDA output, we want to leave the global uniform
+                    // parameter where it is, because it will translate to
+                    // a global `__constant__` variable.
+                    if(m_target == CodeGenTarget::CUDASource)
+                        continue;
+
                     SLANG_ASSERT(!m_globalUniformsParam);
                     m_globalUniformsParam = globalParam;
                 }
@@ -132,9 +127,36 @@ struct IntroduceExplicitGlobalContextPass
             }
         }
 
+        // If there are no global-scope entities that require processing,
+        // then we can completely skip the work of this pass for CUDA.
+        //
+        // Note: We cannot skip the rest of the pass for CPU, because
+        // it is responsible for introducing the explicit entry-point
+        // parameter that is used for passing in the global param(s).
+        //
+        if( m_target == CodeGenTarget::CUDASource )
+        {
+            if( !m_globalUniformsParam && (m_globalVars.getCount() == 0) )
+            {
+                return;
+            }
+        }
+
         // Now that we've capture all the relevant global entities from the IR,
         // we can being to transform them in an appropriate order.
         //
+        // The global context will be represneted by a `struct`
+        // type with a name hint of `KernelContext`.
+        //
+        m_contextStructType = builder.createStructType();
+        builder.addNameHintDecoration(m_contextStructType, UnownedTerminatedStringSlice("KernelContext"));
+
+        // The context will usually be passed around by pointer,
+        // so we get and cache that pointer type up front.
+        //
+        m_contextStructPtrType = builder.getPtrType(m_contextStructType);
+
+
         // The first step will be to create fields in the `KernelContext`
         // type to represent any global parameters or global variables.
         //
@@ -270,9 +292,9 @@ struct IntroduceExplicitGlobalContextPass
             //
             globalUniformsParam->insertBefore(firstOrdinary);
         }
-        else
+        else if(m_target == CodeGenTarget::CPPSource)
         {
-            // The nature of our current ABI for entry points on CPU/CUDA
+            // The nature of our current ABI for entry points on CPU
             // means that we need an explicit parameter to be *declared*
             // for the global uniforms, even if it is never used.
             //
diff --git a/source/slang/slang.vcxproj b/source/slang/slang.vcxproj
index f20a4a322..aaece7095 100644
--- a/source/slang/slang.vcxproj
+++ b/source/slang/slang.vcxproj
@@ -404,4 +404,4 @@
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
   <ImportGroup Label="ExtensionTargets">
   </ImportGroup>
-</Project> 
-\ No newline at end of file
+</Project>
+\ No newline at end of file
author	Tim Foley <tfoleyNV@users.noreply.github.com>	2020-07-28 15:14:31 -0700
committer	GitHub <noreply@github.com>	2020-07-28 15:14:31 -0700
commit	cd106730ea52511a672c9c2c5c8697eaca3b57c8 (patch)
tree	d1311cab1a92522023dbe66b3e5ef981f922c578 /source
parent	dce1d353bf8994220618d53d32455791631096c3 (diff)