Remove KernelContext wrapper from CPU/CUDA emit (#1440)

* Remove KernelContext wrapper from CPU/CUDA emit Currently, the CPU and CUDA C++ targets rely on a `KernelContext` type that is generated during emit, as a way to provide implicit access to things that were global in the input Slang code, but that can't actually be emitted as globals in the target language (because the semantics of global declarations differ). For example, input like: ```hlsl ConstantBuffer<Stuff> gStuff; // shader parameter groupshared int gData[1024]; // thread-group shared variable static int gCounter = 0; // "thread-local" global-scope variable void subroutine() { ... } [shader("compute")] void computeMain() { ... } ``` would translate to output C++ for CPU a bit like: ```c++ struct KernelContext { ConstantBuffer<Stuff> gStuff; int gData[1024]; int gCounter = 0; void subroutine() { ... } void computeMain() { ... } }; ``` Note that both `computeMain()` and `subroutine()` are non-`static` members functions on `KernelContext`, so they have an implicit `this` parameter of type `KernelContext`, which allows the bodies of those functions to implicitly reference `gStuff`, etc. by name in their bodies. Because `KernelContext::computeMain()` is a member function, we end up emitting an additional global-scope function to expose the entry point to the outside world, and that function is responsible for declaring a local `KernelContext` and invoking the generated entry point on it. This approach has several important drawbacks: * It complicates the emit logic for CPU and CUDA, with many special cases around when/how things get emitted * It complicates the implementation of dynamic dispatch, because what seems like a function pointer in Slang IR needs to be a pointer-to-member-function in C++. * It makes it difficult to have a non-kernel-oriented mode of compilation for CPU where a Slang function with a given signature gets output as a C++ CPU function with the "same" signature (not wrapped up as a member function of `KernelContext`. This change makes a step toward addressing these issues by making the introducing of the `KernelContext` type be something that is done in an explicit IR pass instead of being handled as part of the last-mile emit logic. The most important change is the removal of code related to `KernelContext` from the `slang-emit-{cpp,cuda}.{h,cpp}` files, with the equivalent logic instead being handled in a new pass in `slang-ir-explicit-global-context.{h,cpp}`. It should be noted that further cleanups to the emit logic should now be possible; in particular, both the CPU and CUDA emit paths are manually sequencing the `EmitAction`s instead of relying on the default logic, but at this point they should be able to just use the default. The additional cleanups are left for future work. The explicit IR pass does more or less what one would expect: it identifies global-scope entities (global variables and parameters) that need to be wrapped and turns them into fields of a `KernelContext` type. It then modifies all entry points to initialize a `KernelContext` as part of their startup. Finally, any code that used to refer to the global entities is changed to refer to a field of the context, with the context passed via new function parameters (the new parameter is only added to functions that need it for now). Transforming global variables into fields of a `KernelContext` type in the IR pass ends up dropping their initial-value expressions (since those were attached as basic blocks on the `IRGlobalVar`). To avoid breaking code that relies on global-scope (but thread-local) variables, this change also adds an explicit pass that takes the initialization logic on all global variables and moves it to explicit logic that runs at the start of every entry point in a linked module (`slang-ir-explicit-global-init.{h,cpp}`). This pass would also be useful when we get back to direct SPIR-V emit, since SPIR-V also requires initialization logic for globals to be emitted into entry points. One complication that arises when the IR is introducing the types for entry-point parameters, global-scope parameters, and the `KernelContext` type is that it becomes harder for the emit logic to utter the names of those types (they might not even have names, since `IRNameHint`s might get stripped). This created a problem since the wrapper operations that were being generated for CPU were taking `void*` parameters and casting them to the appropriate type. To work around this issue, we have added an explicit IR pass (`slang-ir-entry-point-raw-ptr-params.{h,cpp}`) that transforms the signature of entry points so that any pointer parameters instead become raw pointer (`void*`) parameters, with the casting being handled inside the entry point itself. One consequence of all the above changes is that for the CUDA target we no longer need a wrapper function to invoke the generated entry point any more, because the IR function for the entry point ends up having the correct/expected signature already. This is also the case for CPU when it comes to the `*_Thread` wrapper function, but this change doesn't try to eliminate the wrapper because of a belief that the `*_Thread`-level interface is going away anyway. Because the IR is now responsible for ensuring the signature of the IR entry point for CUDA and CPU is what is expected, I needed to modify the `slang-ir-entry-point-uniforms` pass to always create an explicit parameter for the entry point uniforms when compiling for CUDA/CPU, even if there were no `uniform` parameters on the entry point as written. This also ended up requiring some tweaks to the parameter layout logic to ensure that CPU/CUDA targets always treat `ConstantBuffer<T>` as a `T*` even in the case where `T` is an empty `struct` type (which happens when we construct a `struct` type to represent the uniform parameters of an entry point with no uniform parameters...). There are several future changes that can/should build on this work: * We should change the generated signatures for CUDA kernels, so that they don't rely on `KernelContext` for global-scope parameters. At that point we can avoid generating a `KernelContext` at all for CUDA, except when a program uses global-scope thread-local variables. * We should figure out how to make the "ABI" for dynamic-dispatch calls ensure that the kernel context is either always passed, or always *not* passed. Making a hard-and-fast rule as part of the calling convention for dynamic calls would ensure that they access through the context continues to work with dynamic calls (this change might break it in some cases). * We should figure out how to handle the layout for the `KernelContext` in cases where a program is composed of multiple separately-compiled modules. Right now the layout of the `KernelContext` requires global knowledge (as does the pass that introduces explicit initialization for global-scope thread-locals). * We should try to further clean up the CPU/CUDA C++ emit logic to fall back on the default emit behavior more, now that the various special-case approaches that were taken are no longer needed * fixup: restore build files to default configuration
author: Tim Foley <tfoleyNV@users.noreply.github.com> 2020-07-15 09:31:27 -0700
committer: GitHub <noreply@github.com> 2020-07-15 09:31:27 -0700
commit: 723c9b1b3607ba910abbeb72f4f13bdff3cbd502 (patch)
tree: 387ecf8c0a3324ebeb8361bb1abda08f8589721d /source/slang/slang-emit-cuda.cpp
parent: 48f26ef082fa3b0c2a02dc57585f7e43210bbb63 (diff)
1 files changed, 13 insertions, 221 deletions
diff --git a/source/slang/slang-emit-cuda.cpp b/source/slang/slang-emit-cuda.cpp
index c7dee9f9d..6f24d5b74 100644
--- a/source/slang/slang-emit-cuda.cpp
+++ b/source/slang/slang-emit-cuda.cpp
@@ -248,6 +248,19 @@ void CUDASourceEmitter::emitEntryPointAttributesImpl(IRFunc* irFunc, IREntryPoin
     SLANG_UNUSED(entryPointDecor);
 }
 
+void CUDASourceEmitter::emitFunctionPreambleImpl(IRInst* inst)
+{
+    if(inst && inst->findDecoration<IREntryPointDecoration>())
+    {
+        m_writer->emit("extern \"C\" __global__ ");
+    }
+    else
+    {
+        m_writer->emit("__device__ ");
+    }
+}
+
+
 void CUDASourceEmitter::emitCall(const HLSLIntrinsic* specOp, IRInst* inst, const IRUse* operands, int numOperands, const EmitOpInfo& inOuterPrec)
 {
     switch (specOp->op)
@@ -661,10 +674,6 @@ void CUDASourceEmitter::emitModuleImpl(IRModule* module)
 
     _emitForwardDeclarations(actions);
 
-    IRGlobalParam* entryPointParams = nullptr;
-    IRGlobalParam* globalParams = nullptr;
-    _findShaderParams(&entryPointParams, &globalParams);
-
     // Output group shared variables
 
     {
@@ -677,20 +686,7 @@ void CUDASourceEmitter::emitModuleImpl(IRModule* module)
         }
     }
 
-    // Output the 'Context' which will be used for execution
     {
-        m_writer->emit("struct KernelContext\n{\n");
-        m_writer->indent();
-
-        if (globalParams)
-        {
-            emitGlobalInst(globalParams);
-        }
-        if (entryPointParams)
-        {
-            emitGlobalInst(entryPointParams);
-        }
-
         // Output all the thread locals 
         for (auto action : actions)
         {
@@ -708,211 +704,7 @@ void CUDASourceEmitter::emitModuleImpl(IRModule* module)
                 emitGlobalInst(action.inst);
             }
         }
-
-        m_writer->dedent();
-        m_writer->emit("};\n\n");
     }
-
-    // Finally we need to output dll entry points
-
-    for (auto action : actions)
-    {
-        if (action.level == EmitAction::Level::Definition && as<IRFunc>(action.inst))
-        {
-            IRFunc* func = as<IRFunc>(action.inst);
-
-            IREntryPointDecoration* entryPointDecor = func->findDecoration<IREntryPointDecoration>();
-
-            if (entryPointDecor)
-            {
-                // We have an entry-point function in the IR module, which we
-                // will want to emit as a `__global__` function in the generated
-                // CUDA C++.
-                //
-                // The most common case will be a compute kernel, in which case
-                // we will emit the function more or less as-is, including
-                // usingits original name as the name of the global symbol.
-                //
-                String funcName = getName(func);
-                String globalSymbolName = funcName;
-
-                // We also suport emitting ray tracing kernels for use with
-                // OptiX, and in that case the name of the global symbol
-                // must be prefixed to indicate to the OptiX runtime what
-                // stage it is to be compiled for.
-                //
-                auto stage = entryPointDecor->getProfile().getStage();
-                switch( stage )
-                {
-                default:
-                    break;
-
-            #define CASE(STAGE, PREFIX) \
-                case Stage::STAGE: globalSymbolName = #PREFIX + funcName; break
-
-                CASE(RayGeneration, __raygen__);
-                // TODO: Add the other ray tracing shader stages here.
-            #undef CASE
-                }
-
-                if(globalParams && stage != Stage::Compute )
-                {
-                    // Non-compute shaders (currently just OptiX ray tracing kernels)
-                    // require parameter data that is shared across multiple kernels
-                    // (which in our case is the global-scope shader parameters)
-                    // to be passed using a global `__constant__` variable.
-                    //
-                    // The use of `"C"` linkage here is required because the name
-                    // of this symbol must be passed to the OptiX API when creating
-                    // a pipeline that uses this compiled module. The exact name
-                    // used here (`SLANG_globalParams`) is thus a part of the
-                    // binary interface for Slang->OptiX translation.
-                    //
-                    // TODO: We need to make a decision about how indirected
-                    // the parameter passing for global-scope data is going to
-                    // be for CUDA and OptiX (ideally with an answer that is
-                    // consistent across the two). For now we are deciding to
-                    // make this global `__constant__` variable represent the
-                    // global parameter data directly, rather than indirectly.
-                    //
-                    auto globalParamsPtrType = as<IRPointerLikeType>(globalParams->getDataType());
-                    SLANG_ASSERT(globalParamsPtrType);
-                    auto gloablParamsElementType = globalParamsPtrType->getElementType();
-                    //
-                    m_writer->emit("extern \"C\" { __constant__ ");
-                    emitType(gloablParamsElementType, "SLANG_globalParams");
-                    m_writer->emit("; }\n");
-                }
-
-                // As a convenience for anybody reading the generated
-                // CUDA C++ code, we will prefix a compute kernel
-                // with the information from the `[numthreads(...)]`
-                // attribute in the source.
-                //
-                if(stage == Stage::Compute)
-                {
-                    Int sizeAlongAxis[kThreadGroupAxisCount];
-                    getComputeThreadGroupSize(func, sizeAlongAxis);
-
-                    // 
-                    m_writer->emit("// [numthreads(");
-                    for (int ii = 0; ii < kThreadGroupAxisCount; ++ii)
-                    {
-                        if (ii != 0) m_writer->emit(", ");
-                        m_writer->emit(sizeAlongAxis[ii]);
-                    }
-                    m_writer->emit(")]\n");
-                }
-
-                m_writer->emit("extern \"C\" __global__  ");
-               
-                auto resultType = func->getResultType();
-
-                // Emit the actual function
-                emitEntryPointAttributes(func, entryPointDecor);
-                emitType(resultType, globalSymbolName);
-
-                if( stage == Stage::Compute )
-                {
-                    // CUDA compute shaders take all of their parameters explicitly as
-                    // part of the entry-point parameter list. This means that the
-                    // data representing Slang shader parameters at both the global
-                    // and entry-point scopes needs to be passed as parameters.
-                    //
-                    // At the binary level, our generated CUDA compute kernels will take
-                    // two pointer parameters: the first points to the per-entry-point
-                    // `uniform` parameter data, and the second points to the global-scope
-                    // parameter data (if any).
-                    //
-                    m_writer->emit("(void* entryPointParams, void* globalParams)");
-                }
-                else
-                {
-                    // Non-compute shaders (currently just OptiX ray tracing kernels)
-                    // rely on other mechanisms for parameter passing, and thus use
-                    // an empty parameter list on the kernel declaration.
-                    //
-                    m_writer->emit("()");
-                }
-
-                emitSemantics(func);
-                m_writer->emit("\n{\n");
-                m_writer->indent();
-
-                // Initialize when constructing so that globals are zeroed
-                m_writer->emit("KernelContext context = {};\n");
-
-                // The global-scope parameter data got passed in differently depending on whether we have
-                // a compute shader or a ray-tracing shader, so we need to alter how we initialize
-                // the pointer in our `context` based on the stage.
-                //
-                if( globalParams )
-                {
-                    if( stage == Stage::Compute )
-                    {
-                        m_writer->emit("context.");
-                        m_writer->emit(getName(globalParams));
-                        m_writer->emit(" = (");
-                        emitType(globalParams->getDataType());
-                        m_writer->emit(")globalParams;\n");
-                    }
-                    else
-                    {
-                        m_writer->emit("context.");
-                        m_writer->emit(getName(globalParams));
-                        m_writer->emit(" = &SLANG_globalParams;\n");
-                    }
-                }
-
-                if (entryPointParams)
-                {
-                    auto varDecl = entryPointParams;
-                    auto rawType = varDecl->getDataType();
-                    auto varType = rawType;
-
-                    m_writer->emit("context.");
-                    m_writer->emit(getName(varDecl));
-                    m_writer->emit(" =  (");
-                    emitType(varType);
-                    m_writer->emit(")");
-
-                    // Similar to the case for global parameter data above, the entry-point
-                    // uniform parameter data gets passed in differently for compute kernels
-                    // vs. ray-tracing kernels, and we need to handle the two cases here.
-                    //
-                    if( stage == Stage::Compute )
-                    {
-                        // In the compute case, the entry-point uniform parameters came
-                        // in as an explicit parameter on the CUDA kernel, and we simply
-                        // cast it to the expected type here.
-                        //
-                        m_writer->emit("entryPointParams");
-                    }
-                    else
-                    {
-                        // In the ray-tracing case, the entry-point uniform parameters
-                        // implicitly map to the contents of the Shader Binding Table
-                        // (SBT) entry for the entry point instance being invoked.
-                        //
-                        // The OptiX API provides an accessor function to get a pointer
-                        // to the SBT data for the current entry, and we cast the result
-                        // of that to the expected type.
-                        //
-                        m_writer->emit("optixGetSbtDataPointer()");
-                    }
-                    m_writer->emit(";\n");
-                }
-
-                m_writer->emit("context.");
-                m_writer->emit(funcName);
-                m_writer->emit("();\n");
-
-                m_writer->dedent();
-                m_writer->emit("}\n");
-            }
-        }
-    }
-    
 }
author	Tim Foley <tfoleyNV@users.noreply.github.com>	2020-07-15 09:31:27 -0700
committer	GitHub <noreply@github.com>	2020-07-15 09:31:27 -0700
commit	723c9b1b3607ba910abbeb72f4f13bdff3cbd502 (patch)
tree	387ecf8c0a3324ebeb8361bb1abda08f8589721d /source/slang/slang-emit-cuda.cpp
parent	48f26ef082fa3b0c2a02dc57585f7e43210bbb63 (diff)