slang.git/tests/cuda, branch master

Immutable access qualifier for pointers and use `__ldg` on cuda. (#8710)

2025-10-16T03:59:47+00:00

This PR implements `Access.Immutable` to allow pointers to immutable data. The new type `ImmutablePtr` is defined as an alias of `Ptr`. By forming a immutable pointer, the programmer is conveying to the compiler that the data at the pointer address will never change during the execution of the current program. Therefore loads from immutable pointers can be deduplicated by the compiler, and will translate to `__ldg` when generating code for CUDA. The SPIRV backend is not changed in this PR, since the current SPIRV spec makes it very difficult to specify loads from immutable address without generating tons of wrappers and boilerplate type declarations. We would like to see the spec evolved a bit to around its support of `NonWritable` physical storage pointers or immutable loads before we attempt to express such immutability in SPIRV. For now we simply emit ordinary pointers and loads when generating spirv. --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>

Allow 1D SV_DispatchThreadID in CPU targets (#8612)

2025-10-08T23:13:27+00:00

The varying param legalization pass didn't deal with this 1D form of SV_DispatchThreadID for CPU targets: ```slang void computeMain(int i : SV_DispatchThreadID) ``` Instead, it just overrode the type of `i` with a `uint3`, breaking lots of code that attempted to use `i` for something, like a `switch` statement for example. I ran across this when going through `language-feature` tests for the LLVM target, which will also use this legalization pass. I'm separately submitting this now because this also fixes the existing CPU target. The test I enable in this PR is one that was previously generating broken code on CPU. (somewhat related issue: #7468)

Enhance buffer load specialization pass to specialize past field extracts. (#8547)

2025-10-01T02:08:23+00:00

This allows us to specialize functions whose argument is a sub element of a constant buffer, instead of being only applicable to entire buffer element. Closes #8421. This change also implements a proper heuristic to determine when to specialize the calls and defer the buffer loads. This PR addresses a pathological case exposed in `slangpy\slangpy\benchmarks\test_benchmark_tensor.py`, which used to take 27ms to finish, and now takes 1.25ms. For example, given: ``` struct Bottom { float bigArray[1024]; [mutating] void setVal(int index, float value) { bigArray[index] = value; } } struct Root { Bottom top[2]; [mutating] void setTopVal(int x, int y, float value) { top[x].setVal(y, value); } } RWStructuredBuffer sb; [shader("compute")] [numthreads(1, 1, 1)] void compute_main(uint3 tid: SV_DispatchThreadID) { sb[0].setTopVal(1, 2, 100.0f); } ``` We are now able to specialize the call to `setTopVal` into: ``` void compute_main(uint3 tid: SV_DispatchThreadID) { setTopVal_specialized(0, 1, 2, 100.0f); } void setTopVal_specialized(int sbIdx, int x, int y, float value) { Bottom_setVal_specialized(sbIdx, x, y, value); } void Bottom_setVal_specialized(int sbIdx, int x, int y, float value) { sb[sbIdx].top[x].bigArray[y] = value; } ``` And get rid of all unnecessary loads. Achieving this requires a combination of function call specialization and buffer-load-defer pass. The buffer-load-defer pass has been completely rewritten to be more correct and avoid introducing redundant loads. This PR also adds tests to make sure pointers, bindless handles, and loads from structured buffer or constant buffers works as expected.

Add Optix Intrinsics Coverage (#8159) (#8310)

2025-09-03T05:16:49+00:00

Add 29 intrinsics to the list by new test

[CUDA] Fix incorrect `kIROp_RaytracingAccelerationStructureType` emitting logic (#8168)

2025-08-15T15:24:06+00:00

Fixes: #8167 Current emitting logic does not work, this has been corrected. The provided test ensures our CUDA code is valid by compiling PTX from it. `m_writer->emit("OptixTraversableHandle");` should be `out <<` since `out` adds to type-name-cache; otherwise using a type twice will produce bad type-names (since we filled type-name cache with "" instead of "typeName")

Fix intrinsic LoadLocalRootTableConstant for optix (#7949)

2025-08-07T21:43:25+00:00

Due to an older version of spec referred there was an inconsitency v1.29 2/20/2025 - [HitObject LoadLocalRootArgumentsConstant] Latest spec https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html#hitobject-loadlocalroottableconstant Refer: OptiX backend support for Shader Execution Reordering (SER) features as outlined in issue #6647. -

Initial copy elision pass (#8042)

2025-08-07T07:22:22+00:00

Fixes #7574 Changes: * Add an initial (fairly simple) optimization pass which is able to eliminate redundant copies. * Our current existing optimizer passes remove redundant load/store very robustly, this pass will focus on other cases of copy elimination * Primary approach is to make all functions which are `in T` and `T` is trivial to copy into a `__constref T`. We then (depending on scenario) manually insert a variable+load if a pass-by-reference is not possible; otherwise we pass by `constref`. * Added optimizations to eliminate redundant code which causes `constref` to fail to compile --------- Co-authored-by: Harsh Aggarwal Co-authored-by: Claude Co-authored-by: slangbot Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>

Fix 7441: CUDA boolean vector layout to use 1-byte elements (#7862)

2025-08-01T09:18:53+00:00

* Fix 7441: CUDA boolean vector layout to use 1-byte elements Boolean vectors (bool1, bool2, bool3, bool4) were incorrectly implemented as integer-based types using 4 bytes per element instead of actual 1-byte boolean elements on CUDA targets. Changes: - Update CUDA prelude to define boolean vectors as structs with bool fields instead of typedef aliases to integer vectors - Implement CUDALayoutRulesImpl::GetVectorLayout to use 1-byte alignment for boolean vectors, matching actual CUDA memory layout behavior - Update make_bool functions to populate struct fields correctly This ensures boolean vectors have the same memory layout as bool[4] arrays: - bool1: 1 byte (was 4 bytes) - bool2: 2 bytes (was 8 bytes) - bool3: 3 bytes (was 12 bytes) - bool4: 4 bytes (was 16 bytes) Fixes memory layout mismatch between Slang reflection API and actual CUDA compilation, achieving 75% memory savings for boolean vector usage. * Fix CI issues - Add and update associated functions and operators * Make boolX same as uchar * Use align construct on struct for boolX * Improve Test case for robust alignment checks * Formatting * Disable selected slangpy tests * add metal check which is slightly different than cuda * Test-1 * Test-2 * Test-3 * Test-4 * ReflectionChange * cleanup and update * _slang_select with plain bool is needed for reverse-loop-checkpoint-test

Add new capdef for lss intrinsics (#7427)

2025-06-13T06:20:17+00:00

* Add new capdef for lss intrinsics Fixes #7426 Raygen shaders need to be supported for only hitobject APIs. So we need a special capability for that, instead of a common one. * regenerate command line reference --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>

Add optix support for coopvec (#7286)

2025-06-10T04:48:24+00:00

* WiP: Add coopvec support for Optix * format code * fix minor issues * Fix review comments --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>