slang.git - Making it easier to work with shaders

	Commit message (Collapse)	Author	Age
*	Update function type after inserting KernelContext parameter (#8551)	Julius Ikkala	2025-09-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Without this, there are functions with missing parameters in their type in the IR after running the `introduceExplicitGlobalContext` pass: ``` [layout(%15)] [export("_SV4test12outputBuffer")] [nameHint("outputBuffer")] let %outputBuffer : _ = key [noSideEffect] [export("_S4test7dostuffp1pi_ff")] [nameHint("dostuff")] func %dostuff : Func(Float, Float) { block %34( [nameHint("f")] param %f : Float, [nameHint("kernelContext")] param %kernelContext : Ptr(%KernelContext, 0 : UInt64, 1 : UInt64)): let %35 : Float = mul(%f, %f) let %36 : Ptr(ConstantBuffer(%GlobalParams, DefaultLayout), 0 : UInt64, 1 : UInt64) = get_field_addr(%kernelContext, %globalParams) let %37 : ConstantBuffer(%GlobalParams, DefaultLayout) = load(%36) let %38 : Ptr(RWStructuredBuffer(Float, DefaultLayout, %20)) = get_field_addr(%37, %outputBuffer) let %39 : RWStructuredBuffer(Float, DefaultLayout, %20) = load(%38) let %40 : Ptr(Float) = rwstructuredBufferGetElementPtr(%39, 1 : Int) let %41 : Float = load(%40) let %42 : Float = mul(%35, %41) return_val(%42) } ``` Not sure why this doesn't seem to negatively affect existing targets, but it sure is an issue for the LLVM target I'm working on. I could've left this fix for that PR, but I want to check now if this causes any issues with the existing targets using the CI. This also happens with the entry point functions, where the function type is not updated after adding `ComputeThreadVaryingInput`. This had no effect in the C++ target because `convertEntryPointPtrParamsToRawPtrs(irModule);` is called right after and fixes it.
*	[CBP] Pointer frontend changes + groupshared pointer support (#7848)	ArielG-NV	2025-08-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Resolves #7628 Resolves: #8197 Primary Goals: 1. Add `Access` to pointer 2. AddressSpace::GroupShared support for pointers (SPIR-V) 3. Add `__getAddress()` to replace `&` * `&` is not updated to `require(cpu)` since slangpy uses `&`. This means we must: (1) merge PR; (2) replace `&` with `__getAddress()`; (3) add `require(cpu)` to `&` Changes: * Added to `Ptr` the `Access` generic argument & logic (for `Access::Read`). * Moved the generic argument `AddressSpace` from `Ptr` to the end of the type. * Added pointer casting support between any `Ptr` as long as the `AddressSpace` is the same * Disallow globallycoherent T* and coherent T* * Disallow const T, T const, and const T* * Fixed .natvis display of `ConstantValue` `ValOperandNode` * Support generic resolution of type-casted integers * Added `VariablePointer` emitting for spirv + other minor logic needed for groupshared pointers Breaking Changes: * Anyone using the `AddressSpace` of `Ptr` will now have to account for the `Access` argument * we disallow various syntax paired with `Ptr` and `T*` --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
*	Address issues with GLSL style global in/out vars (#6669) (#6998)	sricker-nvidia	2025-06-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Address issues with GLSL style global in/out vars (#6669) Asserts and segfaults were observed trying to compile a simple vertex shader like: ```` in int2 inPos; [shader("vertex")] main(uniform int2 test1, int2 test2, out float4 pos: SV_Position) void main() { // Bogus use of all input vars to prevent optimizing out. pos = float4(inPos.x, test1.x, test2.y, 0); } ```` Further investigation found that while replacing "uniform int2 test1" with "int2 test1" allowed for successful compilation, the resulting output shader would have overlapping location qualifiers. For example, compiling the above with "int2 test1" to glsl might give: ```` ... layout(location = 0) in ivec2 test1_0; layout(location = 1) in ivec2 test2_0; layout(location = 0) in ivec2 translatedGlobalParams_inPos_0; ... ```` This was because Slang does not actually support mixing GLSL style global in/out vars and entry point params. However, this is never checked for or noted in documentation. Slang source also assumes input shaders do not mix these and these assumptions ultimately led to the observed asserts and seg faults when using uniform entry point params. This change makes updates to throw an error when the compiler detects that it is trying to translate global in/out variables into entry point params when an entry point already contains parameters, allowing for compilation to fail gracefully. Certain tests have been updated to avoid mixing GLSL style global in/out vars and entry point params. This was mostly for tests that were using functions like WaveGetLaneIndex which use global in vars for certain platforms (see __builtinWaveLaneIndex). * Address issues with GLSL style global in/out vars - updates 1 (#6669) Update addresses review feedback to support mixing GLSL-flavored global in/out vars and entrypoint parameters when either all global in/out vars or all entry point params have a system value binding semantic. * Address issues with GLSL style global in/out vars - updates 2 (#6669) This update attempts to actually allow mixing GLSL style global in vars and entry point vars. Change attempts to recalculate offsets when adding the global input vars into the recreated entry point params layout. Additional updates were made to: -resolve further issues uncovered with entry point uniform params. -Address improper use of SV_DispatchThreadID in wave-get-lane-index.slang for metal. "thread_position_in_grid" is not supported for signed integer scalars or vectors. -Fix a spirv casting conflict due to the implementation of gl_PrimitiveID.get conflicting with PrimitiveIndex(). -Add a call to remove a global var in replaceUsesOfGlobalVar(). The global var is already replaced in this function and keeping it around can prevent it from being cleaned up by DCE if it still has decorations. * format code --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>
*	Remove unnecessary parameters from Metal entry point signature (#6131)	Darren Wihandi	2025-01-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* fix metal entry point global params * address review comments, cleanup and test * remove dead code * undo accidental change * address review comments and cleanup * minor fix and cleanup --------- Co-authored-by: Yong He <yonghe@outlook.com>
*	Support specialization constant on WGSL and Metal. (#5780)	Yong He	2024-12-06
\|
*	Move switch statement bodies to their own lines (#5493)	Ellie Hermaszewska	2024-11-05
\| \| \| \| \| \| \| \| \|	* Move switch statement bodies to their own lines * format --------- Co-authored-by: Yong He <yonghe@outlook.com>
*	format	Ellie Hermaszewska	2024-10-29
\| \| \| \| \| \| \|	* format * Minor test fixes * enable checking cpp format in ci
*	Remove using SpvStorageClass values casted into AddressSpace values (#4861)	Ellie Hermaszewska	2024-08-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Remove using SpvStorageClass values casted into AddressSpace values Also removes support for specific storage classes in __target_intrinsic snippets * remove SLANG_RETURN_NEVER macro * squash warnings * Make nonexhaustive switch statement error on gcc * Add SLANG_EXHAUSTIVE_SWITCH_BEGIN/END macros --------- Co-authored-by: Yong He <yonghe@outlook.com>
*	Move SPIRV global variables into a context variable (#4741)	ArielG-NV	2024-07-30
\|
*	Overhaul IR lowering of pointer types. (#4710)	Yong He	2024-07-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Overhaul IR lowering of pointer types. * Propagate address space in IRBuilder. * Fixup. * Fix. * Fix. * Change how Ptr type is printed to text. * Fix.
*	Fix cuda/cpp/metal crash for when using GLSL style shader inputs (#4378)	ArielG-NV	2024-06-13
\| \| \|	Decorations were not expected as an input, this causes a crash.
*	Support groupshared variables for Metal. (#4116)	Yong He	2024-05-06
\|
*	Delete `wrap-global-context` pass. (#4114)	Yong He	2024-05-06
\| \| \| \| \| \| \|	* Delete `wrap-global-context` pass. The pass was added for the metal backend without realizing that the existing `explicit-global-context` does 99% of the job. Instead of duplicating the logic in a different pass for metal, we extend same explicit-global-context pass to work for metal. * Fix build.
*	Dictionary using lowerCamel (#2835)	jsmall-nvidia	2023-04-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* #include an absolute path didn't work - because paths were taken to always be relative. * WIP lowerCamel Dictionary. * WIP more lowerCamel fixes for Dictionary. * Add/Remove/Clear * GetValue/Contains * Fix tabs in dictionary. Count -> getCount * Fix fields with caps. * Key -> key Value -> value Use m_ for members where appropriate. Use lowerCamel in linked list. * Some small fixes/improvements to Dictionary. * Kick CI.
*	Remove `SharedIRBuilder`. (#2657)	Yong He	2023-02-16
\| \| \|	Co-authored-by: Yong He <yhe@nvidia.com>
*	Actual global support (#2262)	jsmall-nvidia	2022-06-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* #include an absolute path didn't work - because paths were taken to always be relative. * Use TerminatedUnownedStringSlice for literals in output C++. * Remove Escape/Unescape functions used in slang-token-reader.cpp Add target type of 'host-cpp' etc to map to the target types. * Fix some corner cases around string encoding. * Added unit test for string escaping. Fixed some assorted escaping bugs. * Updated test output. * Added decode test. * Stop using hex output, to get around 'greedy' aspect. Use octal instead. * Added HostHostCallable Small changes to use ArtifactDesc/Info instead of large switches. * Fix C++ emit to handle arbitrary function export. * Add options handling for callable without an output being specified. * Can compile with COM interface. Added example using com interface. * Use the IR Ptr type instead of hack in C++ emit for interfaces. * Fix issue with outputting the COM call when ptr is used. * Fix crash issue on compilation failure. * Add support for __global. * Added `ActualGlobalRate` Added special handling around globals and COM interfaces. Tested out in cpu-com-example. * Fix typo in NodeBase. * Support for accessing globals by name working. * Check that actual global initialization is working. * Refactor the com replacement such that it doesn't need a cache or do anything special with GlobalVar. * Remove context. Only create replacement if needed. * Split out COM host-callable into a unit-test. * host-callable com testing on C++and llvm. * Comment around the COM ptr replacement. * Disable com test on vs 32 bit. Fix C++ prelude * Disable 32 bit targets testing com host-callable. * Use JSON parsing to locate VS version. * Need platform detection in C++prelude. * Fix com host callable test for LLVM. * Work around for not being able to include "targetConditionals.h"
*	Fix for KernelContext threading issue for C++ targets (#1843)	jsmall-nvidia	2021-05-14
\| \| \| \| \| \| \|	* #include an absolute path didn't work - because paths were taken to always be relative. * Fix for issue where threading KernelContext was not working on C++ test when there were multiple invocations. * Improve test for context threading.
*	Add an accessor for IRInst opcode (#1707)	Tim Foley	2021-02-16
\| \| \| \| \| \| \| \| \|	* Add an accessor for IRInst opcode This main changing is renaming `IRInst::op` over to `IRInst::m_op` and then adds an accessor `IRInst::getOp()` to read it. The rest of the changes are just changing use sites to `getOp` (or to `m_op` in the limited cases where we write to it). This work is in anticipation of a future change that might need to store an extra bit in the same field as the opcode. It seemed better to do this massive refactoring as a separate PR. * fixup
*	Change parameter passing convention for CUDA (#1463)	Tim Foley	2020-07-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Big Picture =============== Given input Slang code like: ```hlsl Texture2D gA; [shader("compute")] void kernelFunc(uniform Texture2D b, uint3 tid : SV_DispatchThreadID) { ... } ``` the existing CUDA code generation strategy would always generate a kernel with a signature like: ```c++ struct GlobalParams { Texture2D gA; } struct EntryPointParams { Texture2D b; } extern "C" __global__ void kernelFunc(EntryPointParams* entryPointParams, GlobalParams* globalParams) { ... } ``` This choice was consistent with the conventions of the CPU kernel target, and shares the advantage that it is easy for the user to data-drive the logic for filling in parameters and then invoking a kernel. However, the approach outlined above has two serious problems when used for CUDA kernels: * First, it defies the programmer's expectation about what an "equivalent" CUDA kernel signature would be, which makes it awkward for a developer to invoke this kernel from CUDA C++ host code (especially in the context of an app that might also run hand-written CUDA kernels). * Second, the performance of this approach suffers because every access to a global or entry point parameter turns into a load from global memory. In contrast, a typical hand-written CUDA kernel passes its parameters via an implementation-specific path that (for current CUDA platforms) seems to be equivalent to `__constant__` memory in performance. This change alters the convention so that the Slang compiler takes the code from the top of this message and translates it into something like: ```c++ struct GlobalParams { Texture2D gA; } __constant__ GlobalParams SLANG_globalParams; extern "C" __global__ void kernelFunc( Texture2D b ) { ... } ``` This translation alleviates both problems with the current translation: * The signature of the generated CUDA kernel function is as close to that of the original as is possible (we had to eliminate the `SV_`-semantic varying inputs), and should directly match what the programmer would expect in common cases. Entry-point parameters are passed via CUDA kernel parameters, and should thus match in performance. Global parameters are passed via a variable in `__constant__` memory, and thus should also perform as well as possible/expected. Detailed Changes ================ * Disable the `collectEntryPointUniformParams` pass for CUDA, so that entry-point `uniform` parameters are not bundles into a single `struct` and/or `ConstantBuffer`. * When targeting CUDA, disable the logic for generating an entry-point parameter for passing in the global shader parameter(s) * Allow `CLikeSourceEmitter` subclasses to override the name generated for entry-point symbols, and use this to add the required prefix for each OptiX kernel type when translating a ray-tracing kernel. * Add logic to emit "parameter groups" in a specialized way for CUDA (this is the same approach that allows us to generate `cbufffer { ... }` declarations for fxc). A global-scope parameter group will turn into a global `__constant__` variable called `SLANG_globalParams` (that name becomes part of the ABI for Slang-compiled shaders). * Update the logic in `render-test` for loading and invoking CUDA kernels to handle the new policy. The last bullet there merits expansion, since it is indicative of the work a client using Slang would have to go through to use our generated kernels with the new policy: * When loading a CUDA module with one or more kernels, we also use `cuModuleGetGlobal` to query the address of the `SLANG_globalParams` symbol in that CUDA module. That pointer needs to be used when setting global parameter values to be used by kernels in that CUDA odule. * Because our existing `BindPoint` logic for CUDA always sets up parameter data in GPU memory, we end up having to copy the entry-point parameter data from GPU memory to host memory. This step would ideally be skipped in a codebase that understands the correct policy, but it is a bit unfortunate that it is no longer trivially correct for an application to store all parameter data in GPU memory. * Before invoking the kernel, we need to use a `cudaMemcpyAsync` to copy from the prepared GPU memory for global parameters over to the `SLANG_globalParams` symbol associated with the kernel to be invoked. Because this operations is issued on the same CUDA stream as the kernel call, it is guaranteed to not overlap with GPU kernel execution. * When invoking the kernel, we take advantage of the seldom-used `CU_LAUNCH_PARAM_BUFFER_POINTER` facility to specify a contiguous memory region with all the entry-point parameters in it instead of passing each entry-point parameter separately. Given Slang reflection it is also possible to query the offset of each entry-point parameter in the buffer, so we could invoke the kernel in the traditional fashion as well. The choice here is up to the application. Caveats ======= * This is a breaking change, and any subsequent release will need to reflect that fact. Any customers who rely on Slang's current CUDA codegen strategy are likely to be surprised by this change, and I don't see an easy way to give them a more gentle transition. * This change does not remove the logic that introduces a `KernelContext` type for code that requires it. That means that things like `static` global variables can continue to work on CUDA for now, but we know that those are not going to be something we can support in the long-term with separate compilation. * While the policy implemented in this change is a reasonable default, it is still not going to perfectly match expecations for some developers. In particular, some developers who are familiar with both D3D and CUDA will likely wonder why a global `cbuffer` in Slang translates to a global-memory pointer in the output CUDA instead of one global `__constant__` variable per `cbuffer`. A more detailed alternate translation would generate a distinct global `__constant__` variable for each top-level constant buffer or parameter block. We may need to refine the translation even more based on feedback from users who care about how we handle global-scope parameters. * Recent changes in Slang have broken the logic that handles the OptiX "shader record" as an alternative mechanism for passing entry-point parameters. In order to get any level of OptiX support up and running we will have to change the IR passes that run on CUDA kernels to actually run the "collection" of `uniform` parameters for ray tracing stages, and then to replace references to the resulting parameter with a call to the function to access the shader record. * The use of `SLANG_globalParams` here works well enough in the case of whole-program compilation; every `CUmodule` ends up with (zero or) one parameter with this name, and an application can just hard-code it. As a mechanism it wouldn't work in the presence of separately-compiled modules that might introduce their own global parameters (including cases like constant lookup tables that really want to be at the global scope). An alternative approach would have Slang generate output PTX for each module, where a module has an optional global symbol for its own global-scope parameters (with a mangled name that is based on the module name), and then a linked CUDA binary has all of those distinct symbols. Such an approach would be compatible with module-at-a-time reflection and parameter binding, but would lead to another breaking change down the line for code that switches to `SLANG_globalParams`.
*	Handle case of no global parameters for CPU/CUDA (#1457)	Tim Foley	2020-07-24
\| \| \| \| \| \| \|	The IR pass that introduces an explicit `KernelContext` for the CPU/CUDA back-ends was also responsible for adding an explicit parameter to the kernel entry point to receive the constant buffer (pointer) with all the global uniform parameters. However, if there were no global uniform parameters, this parameter wasn't getting introduced, which changed the signature/ABI of the generated entry point function. This change makes it so that the pass unconditionally adds a parameter. In the case where there are no global uniforms it just adds a `void*` parameter that never gets used. In order to avoid future regressions, this change also adds a test case to confirm that things work correctly when a kernel has only entry-point parameters and no global parameters.
*	Remove KernelContext wrapper from CPU/CUDA emit (#1440)	Tim Foley	2020-07-15
	* Remove KernelContext wrapper from CPU/CUDA emit Currently, the CPU and CUDA C++ targets rely on a `KernelContext` type that is generated during emit, as a way to provide implicit access to things that were global in the input Slang code, but that can't actually be emitted as globals in the target language (because the semantics of global declarations differ). For example, input like: ```hlsl ConstantBuffer<Stuff> gStuff; // shader parameter groupshared int gData[1024]; // thread-group shared variable static int gCounter = 0; // "thread-local" global-scope variable void subroutine() { ... } [shader("compute")] void computeMain() { ... } ``` would translate to output C++ for CPU a bit like: ```c++ struct KernelContext { ConstantBuffer<Stuff> gStuff; int gData[1024]; int gCounter = 0; void subroutine() { ... } void computeMain() { ... } }; ``` Note that both `computeMain()` and `subroutine()` are non-`static` members functions on `KernelContext`, so they have an implicit `this` parameter of type `KernelContext`, which allows the bodies of those functions to implicitly reference `gStuff`, etc. by name in their bodies. Because `KernelContext::computeMain()` is a member function, we end up emitting an additional global-scope function to expose the entry point to the outside world, and that function is responsible for declaring a local `KernelContext` and invoking the generated entry point on it. This approach has several important drawbacks: * It complicates the emit logic for CPU and CUDA, with many special cases around when/how things get emitted * It complicates the implementation of dynamic dispatch, because what seems like a function pointer in Slang IR needs to be a pointer-to-member-function in C++. * It makes it difficult to have a non-kernel-oriented mode of compilation for CPU where a Slang function with a given signature gets output as a C++ CPU function with the "same" signature (not wrapped up as a member function of `KernelContext`. This change makes a step toward addressing these issues by making the introducing of the `KernelContext` type be something that is done in an explicit IR pass instead of being handled as part of the last-mile emit logic. The most important change is the removal of code related to `KernelContext` from the `slang-emit-{cpp,cuda}.{h,cpp}` files, with the equivalent logic instead being handled in a new pass in `slang-ir-explicit-global-context.{h,cpp}`. It should be noted that further cleanups to the emit logic should now be possible; in particular, both the CPU and CUDA emit paths are manually sequencing the `EmitAction`s instead of relying on the default logic, but at this point they should be able to just use the default. The additional cleanups are left for future work. The explicit IR pass does more or less what one would expect: it identifies global-scope entities (global variables and parameters) that need to be wrapped and turns them into fields of a `KernelContext` type. It then modifies all entry points to initialize a `KernelContext` as part of their startup. Finally, any code that used to refer to the global entities is changed to refer to a field of the context, with the context passed via new function parameters (the new parameter is only added to functions that need it for now). Transforming global variables into fields of a `KernelContext` type in the IR pass ends up dropping their initial-value expressions (since those were attached as basic blocks on the `IRGlobalVar`). To avoid breaking code that relies on global-scope (but thread-local) variables, this change also adds an explicit pass that takes the initialization logic on all global variables and moves it to explicit logic that runs at the start of every entry point in a linked module (`slang-ir-explicit-global-init.{h,cpp}`). This pass would also be useful when we get back to direct SPIR-V emit, since SPIR-V also requires initialization logic for globals to be emitted into entry points. One complication that arises when the IR is introducing the types for entry-point parameters, global-scope parameters, and the `KernelContext` type is that it becomes harder for the emit logic to utter the names of those types (they might not even have names, since `IRNameHint`s might get stripped). This created a problem since the wrapper operations that were being generated for CPU were taking `void` parameters and casting them to the appropriate type. To work around this issue, we have added an explicit IR pass (`slang-ir-entry-point-raw-ptr-params.{h,cpp}`) that transforms the signature of entry points so that any pointer parameters instead become raw pointer (`void`) parameters, with the casting being handled inside the entry point itself. One consequence of all the above changes is that for the CUDA target we no longer need a wrapper function to invoke the generated entry point any more, because the IR function for the entry point ends up having the correct/expected signature already. This is also the case for CPU when it comes to the `_Thread` wrapper function, but this change doesn't try to eliminate the wrapper because of a belief that the `_Thread`-level interface is going away anyway. Because the IR is now responsible for ensuring the signature of the IR entry point for CUDA and CPU is what is expected, I needed to modify the `slang-ir-entry-point-uniforms` pass to always create an explicit parameter for the entry point uniforms when compiling for CUDA/CPU, even if there were no `uniform` parameters on the entry point as written. This also ended up requiring some tweaks to the parameter layout logic to ensure that CPU/CUDA targets always treat `ConstantBuffer<T>` as a `T` even in the case where `T` is an empty `struct` type (which happens when we construct a `struct` type to represent the uniform parameters of an entry point with no uniform parameters...). There are several future changes that can/should build on this work: We should change the generated signatures for CUDA kernels, so that they don't rely on `KernelContext` for global-scope parameters. At that point we can avoid generating a `KernelContext` at all for CUDA, except when a program uses global-scope thread-local variables. * We should figure out how to make the "ABI" for dynamic-dispatch calls ensure that the kernel context is either always passed, or always not passed. Making a hard-and-fast rule as part of the calling convention for dynamic calls would ensure that they access through the context continues to work with dynamic calls (this change might break it in some cases). * We should figure out how to handle the layout for the `KernelContext` in cases where a program is composed of multiple separately-compiled modules. Right now the layout of the `KernelContext` requires global knowledge (as does the pass that introduces explicit initialization for global-scope thread-locals). * We should try to further clean up the CPU/CUDA C++ emit logic to fall back on the default emit behavior more, now that the various special-case approaches that were taken are no longer needed * fixup: restore build files to default configuration