slang.git/prelude, branch master

Immutable access qualifier for pointers and use `__ldg` on cuda. (#8710)

2025-10-16T03:59:47+00:00

This PR implements `Access.Immutable` to allow pointers to immutable data. The new type `ImmutablePtr` is defined as an alias of `Ptr`. By forming a immutable pointer, the programmer is conveying to the compiler that the data at the pointer address will never change during the execution of the current program. Therefore loads from immutable pointers can be deduplicated by the compiler, and will translate to `__ldg` when generating code for CUDA. The SPIRV backend is not changed in this PR, since the current SPIRV spec makes it very difficult to specify loads from immutable address without generating tons of wrappers and boilerplate type declarations. We would like to see the spec evolved a bit to around its support of `NonWritable` physical storage pointers or immutable loads before we attempt to express such immutability in SPIRV. For now we simply emit ordinary pointers and loads when generating spirv. --------- Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>

Add support targeting older OptiX versions (#8700)

2025-10-14T15:21:03+00:00

Currently, the emitted CUDA code does only compile with latest OptiX 9.0. This change allows code to be compiled with OptiX 8.0 upwards by not emitting OptiX calls that are not available. In a later step we should add proper capabilities for the various OptiX versions.

Improve texture loads and stores on CUDA (#8644)

2025-10-08T18:18:23+00:00

- fix handling layer and mip level - add support for 1D layered textures - reduce code by using macros - assert when trying to emit unsupported intrinsics There is a new set of unit tests in slang-rhi for exhaustive testing of shader loads/stores on textures. These fixes allow to enable most of these tests. Formatted loads/stores on surfaces are not supported in PTX ISA, so this would require codegen for the conversion which in theory should be possible but not as part of the CUDA prelude.

Enhance buffer load specialization pass to specialize past field extracts. (#8547)

2025-10-01T02:08:23+00:00

This allows us to specialize functions whose argument is a sub element of a constant buffer, instead of being only applicable to entire buffer element. Closes #8421. This change also implements a proper heuristic to determine when to specialize the calls and defer the buffer loads. This PR addresses a pathological case exposed in `slangpy\slangpy\benchmarks\test_benchmark_tensor.py`, which used to take 27ms to finish, and now takes 1.25ms. For example, given: ``` struct Bottom { float bigArray[1024]; [mutating] void setVal(int index, float value) { bigArray[index] = value; } } struct Root { Bottom top[2]; [mutating] void setTopVal(int x, int y, float value) { top[x].setVal(y, value); } } RWStructuredBuffer sb; [shader("compute")] [numthreads(1, 1, 1)] void compute_main(uint3 tid: SV_DispatchThreadID) { sb[0].setTopVal(1, 2, 100.0f); } ``` We are now able to specialize the call to `setTopVal` into: ``` void compute_main(uint3 tid: SV_DispatchThreadID) { setTopVal_specialized(0, 1, 2, 100.0f); } void setTopVal_specialized(int sbIdx, int x, int y, float value) { Bottom_setVal_specialized(sbIdx, x, y, value); } void Bottom_setVal_specialized(int sbIdx, int x, int y, float value) { sb[sbIdx].top[x].bigArray[y] = value; } ``` And get rid of all unnecessary loads. Achieving this requires a combination of function call specialization and buffer-load-defer pass. The buffer-load-defer pass has been completely rewritten to be more correct and avoid introducing redundant loads. This PR also adds tests to make sure pointers, bindless handles, and loads from structured buffer or constant buffers works as expected.

Always define OptixTraversableHandle (#8411)

2025-09-09T15:55:27+00:00

This fixes an issue where non-raytracing kernels couldn't contain any RaytracingAccelerationStructure resources even when not used.

Enable CUDA support for additional HLSL intrinsic tests (#8293)

2025-09-04T05:28:02+00:00

Enable CUDA support for additional HLSL intrinsic tests by implementing missing functionality and fixing compiler bugs affecting CUDA targets. - Fix critical bug in InterlockedCompareStore64 where division used /4 instead of /8 for 64-bit types, causing incorrect memory addressing for all signed int 64_t atomics - Add signed int64_t atomic wrappers (atomicExch, atomicCAS) to CUDA prelu de that properly cast to/from unsigned types as required by CUDA's atomic API - Enable tests: atomic-intrinsics-64bit.slang - Implement CUDA support for QuadAny and QuadAll operations using warp shu ffle primitives (__shfl_sync with quad-level lane masking) - Add CUDA to quad_control capability definition in slang-capabilities.capdef - Add _slang_quadAny/_slang_quadAll helper functions to CUDA prelude - Enable tests: quad-control-comp-functionality.slang, subgroup-quad.slang --------- Co-authored-by: szihs <675653+szihs@users.noreply.github.com>

Updated support to enable batch3 (#8219)

2025-08-20T09:11:06+00:00

Enable CUDA support for batch 3 tests - Enhanced wave operations with exclusive support - Added proper identity values for min/max operations - Fixed intrinsic name mapping issues - Updated test configurations Co-authored-by: Ellie Hermaszewska

Enable CUDA testing for batch 2 (#8147)

2025-08-12T18:19:15+00:00

Enable CUDA for the tests listed in issue #8078 This requires a minor CUDA prelude change, adding some math functions.

Fix intrinsic LoadLocalRootTableConstant for optix (#7949)

2025-08-07T21:43:25+00:00

Due to an older version of spec referred there was an inconsitency v1.29 2/20/2025 - [HitObject LoadLocalRootArgumentsConstant] Latest spec https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html#hitobject-loadlocalroottableconstant Refer: OptiX backend support for Shader Execution Reordering (SER) features as outlined in issue #6647. -

Initial copy elision pass (#8042)

2025-08-07T07:22:22+00:00

Fixes #7574 Changes: * Add an initial (fairly simple) optimization pass which is able to eliminate redundant copies. * Our current existing optimizer passes remove redundant load/store very robustly, this pass will focus on other cases of copy elimination * Primary approach is to make all functions which are `in T` and `T` is trivial to copy into a `__constref T`. We then (depending on scenario) manually insert a variable+load if a pass-by-reference is not possible; otherwise we pass by `constref`. * Added optimizations to eliminate redundant code which causes `constref` to fail to compile --------- Co-authored-by: Harsh Aggarwal Co-authored-by: Claude Co-authored-by: slangbot Co-authored-by: slangbot <186143334+slangbot@users.noreply.github.com>