Swap the term StdLib with Core-Module or Standard-Module in documents (#5414)

This PR is limited to documents. All use of "Standard library" or "StdLib" are replaced with either "core module" or "standard modules", depending on the context.
author: Jay Kwak <82421531+jkwak-work@users.noreply.github.com> 2024-10-25 21:12:37 -0700
committer: GitHub <noreply@github.com> 2024-10-25 21:12:37 -0700
commit: a508b264eda4bc3c99ba1f44eab1dec6e5ce06c0 (patch)
tree: 717722aefcae6b2a5adbccfbcd8aece4ed81f0b7 /docs/cuda-target.md
parent: 49c691e86862d092cd389a02beb4003ee59a4417 (diff)
1 files changed, 1 insertions, 1 deletions
diff --git a/docs/cuda-target.md b/docs/cuda-target.md
index adb56d922..c59703259 100644
--- a/docs/cuda-target.md
+++ b/docs/cuda-target.md
@@ -267,7 +267,7 @@ There is broad support for [HLSL Wave intrinsics](https://docs.microsoft.com/en-
 
 Most Wave intrinsics will work with vector, matrix or scalar types of typical built in types - uint, int, float, double, uint64_t, int64_t.
 
-The support is provided via both the Slang stdlib as well as the Slang CUDA prelude found in 'prelude/slang-cuda-prelude.h'. Many Wave intrinsics are not directly applicable within CUDA which supplies a more low level mechanisms. The implementation of most Wave functions work most optimally if a 'Wave' where all lanes are used. If all lanes from index 0 to pow2(n) -1  are used (which is also true if all lanes are used) a binary reduction is typically applied. If this is not the case the implementation fallsback on a slow path which is linear in the number of active lanes, and so is typically significantly less performant.
+The support is provided via both the Slang core module as well as the Slang CUDA prelude found in 'prelude/slang-cuda-prelude.h'. Many Wave intrinsics are not directly applicable within CUDA which supplies a more low level mechanisms. The implementation of most Wave functions work most optimally if a 'Wave' where all lanes are used. If all lanes from index 0 to pow2(n) -1  are used (which is also true if all lanes are used) a binary reduction is typically applied. If this is not the case the implementation fallsback on a slow path which is linear in the number of active lanes, and so is typically significantly less performant.
 
 For more a more concrete example take
author	Jay Kwak <82421531+jkwak-work@users.noreply.github.com>	2024-10-25 21:12:37 -0700
committer	GitHub <noreply@github.com>	2024-10-25 21:12:37 -0700
commit	a508b264eda4bc3c99ba1f44eab1dec6e5ce06c0 (patch)
tree	717722aefcae6b2a5adbccfbcd8aece4ed81f0b7 /docs/cuda-target.md
parent	49c691e86862d092cd389a02beb4003ee59a4417 (diff)