From 1241006b6d89cae09766ca9795187ef9c0dd2085 Mon Sep 17 00:00:00 2001
From: ArielG-NV <159081215+ArielG-NV@users.noreply.github.com>
Date: Mon, 26 Feb 2024 19:09:09 -0500
Subject: Partially implement shader_subgroup extension(s); Partially resolves
 #3548 (#3580)

* Partially Implement with tests, functions and built-in variables apart of GL_KHR_shader_subgroup; Partially resolves #3548

Partially Implement with tests, functions and built-in variables apart of GL_KHR_shader_subgroup; Partially resolves #3548

GL_KHR_shader_subgroup implemented based on https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_shader_subgroup.txt

Implementation is broken down into seperate glsl extensions due to the ***large differences*** in implementation of each section, and functionality/testing.

GL_KHR_shader_subgroup_basic{
**Partially implemented**

Implementation:
    * All 9 built-in variables have been stubbed without proper value; implementation is still required for these system variables; related to #411.

    * Functions were reimplemented despite nearly mirrored HLSL functions due to:
        * hlsl.meta implementations targetting workgroups rather than a warp/wave/subgroup:
            * `__syncwarp` vs `__syncthreads`
            * `SubgroupMemory` vs `WorkgroupMemory`
            * etc.
        * hlsl.meta implementations target broader SPIR-V memory targets to block on:
            * ImageMemory|UniformMemory versus SPIR-V specifying barriers for ImageMemory and seperately an option for UniformMemory
        * `subgroupElect` for CUDA has a different implementation than `WaveIsFirstLane`, this is because spec states that `subgroupElect()` only returns the lowest active gl_SubgroupInvocationID; therefore we are supposed to fetch the current active mask even if some invocations are turned off by branches

Testing:
tests for the variable -- `tests/glsl/shader-subgroup-built-in-variables.slang`
    * these tests do not test functionality since not implemented yet

tests for the functions -- `tests/glsl/shader-subgroup-basic.slang`
    * concurrency is tested for using SubgroupMemory, UniformMemory through attempting to create a GPU side race condition with writing and reading memory
        * due to testing tools avaible there are no tests for ImageMemory
    * subgroupElect is tested to return invocation #0, the lowest invocation that will always run; wave size is 32, therefore #0 is always active and will always be the elected invocation.

}

GL_KHR_shader_subgroup_vote{
**Fully implemented**

Implementation:
    * 3/3 functions are using the hlsl.meta implementation

Testing:
`tests/glsl/shader-subgroup-vote.slang`
    * Testing each a positive (returns true) and negative (returns false) test case to ensure vote results are correct

}

GL_KHR_shader_subgroup_ballot{
**Partially implemented**

Implementation:
    There are 10/10 functions that are implemented:
    * 3 are using hlsl.meta implementation
    * 7 are using new implementations -- only support GLSL, SPIR-V, HLSL, CUDA
        * These implementations do not exist in hlsl.meta, so they were added
        * `subgroupInverseBallot` lacks an analog function to call; this feature was emulated:
            * in CUDA through knowing waves are 32bit and lanes are 0 indexed, this implys that `   (ballotResult >> YOUR_INVOCATION) & 1` checks if your invocation is active, for example, `(0b11001 >> 3) & 1` would mean that only invocation 5, 4, and 1 is active, 3 would mean `YOUR_INVOCATION` is the fourth invocation in the subgroup. `(0b11001>>3) & 1` would return true since your bit is toggled and evaluates to `0b11 & 0b1`
            * in HLSL through testing if the wave count is 32 or less (use the same logic as CUDA in this case); else find the index `YOUR_INVOCATION` corrisponds with where each vector has 32bits (32 waves); avoid division in the process. then run the same algorithm cuda employs.
            * `subgroupBallotBitExtract` is logically the same as `subgroupInverseBallot`
        * 5 implementations do not have a CUDA, HLSL, and CPP imlementation yet (subgroupBallotFindMSB, subgroupBallotFindLSB, subgroupBallotExclusiveBitCount, subgroupBallotInclusiveBitCount, subgroupBallotBitCount) due to being out of scope for the commit

Testing:
`tests/glsl/shader-subgroup-ballot.slang`
    * the function tests for an expected value of each ballot function; tests try inputting larger than 32 toggled bits as function parameters to ensure the implementation correctly identifies values up to a maximum of the subgroup invocation count as per extension specification (otherwise the functionality is fairly trivial to test)

}

GL_KHR_shader_subgroup_arithmetic{
**Partially implemented**

Implementation:
    * There are 21 functions to implement:
        * 14 functions are using the hlsl.meta implementation
        * 7 functions are new implementations -- only implemented for GLSL and SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required
            * CUDA, CPP, HLSL are out of scope for the commit

Testing:
`tests/glsl/shader-subgroup-arithmetic.slang`
    * all tests silently kill the shader; outputted GLSL was checked, could not see an issue
    * these tests only check basic functionality and correctness of all functions implemented; not an exaustive test [further continued in "Other notes of worthy" at end of commit]

}

GL_KHR_shader_subgroup_shuffle{
**Partially implemented**

Implementation:
    * There are 2 functions to implement:
        * 1 function is using the existing hlsl.meta implmentation
        * 1 function is using a new implmentation (subgroupShuffleXor) -- only implmented for GLSL & SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required

Testing:
`tests/glsl/shader-subgroup-shuffle.slang`
    * these tests only check basic functionality and correctness of all functions implemented; not an exaustive test [further continued in "Other notes of worthy" at end of commit]
    * tests fail with cpp due to `kIROp_WaveGetActiveMask` failing to be called

}

GL_KHR_shader_subgroup_shuffle_relative{
**Partially implemented**

Implementation:
    * There are 2 functions to implement:
        * all 2 functions are using a new implmentation -- only implmented for GLSL & SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required

Testing:
`tests/glsl/shader-subgroup-shuffle-relative.slang`
    * these tests only check basic functionality and correctness of all functions implemented; not an exaustive test [further continued in "Other notes of worthy" at end of commit]

}

GL_KHR_shader_subgroup_clustered{
**Partially implemented**

Implementation:
    * There are 7 functions to implement:
        * all 7 functions are using a new implmentation -- only implmented for GLSL & SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required

Testing:
`tests/glsl/shader-subgroup-shuffle-clustered.slang`
    * these tests only check basic functionality and correctness of all functions implemented; not an exaustive test [further continued in "Other notes of worthy" at end of commit]

}

GL_KHR_shader_subgroup_quad{
**Partially implemented**

Implementation:
    * There are 4 functions to implement:
        * all 4 functions are using hlsl.meta implmentations -- only implemented for GLSL & SPIR-V & HLSL

Testing:
`tests/glsl/shader-subgroup-shuffle-quad.slang`
    * these tests only check basic functionality and correctness of all functions implemented; not an exaustive test [further continued in "Other notes of worthy" at end of commit]

}

---------
Failing tests and why:

Note: due to system variables not being implemented largly for CUDA and CPP, these tests will fail (#3 and #4){
    tests/glsl/shader-subgroup-arithmetic.slang.3
    tests/glsl/shader-subgroup-arithmetic.slang.4
    tests/glsl/shader-subgroup-ballot.slang.4
    tests/glsl/shader-subgroup-basic.slang.3
    tests/glsl/shader-subgroup-basic.slang.4
    tests/glsl/shader-subgroup-quad.slang.3
    tests/glsl/shader-subgroup-quad.slang.4
    tests/glsl/shader-subgroup-vote.slang.3
    tests/glsl/shader-subgroup-vote.slang.4
}

Note: due to kIROp_WaveGetActiveMask not being loaded for cpp the following test will fail{
    tests/glsl/shader-subgroup-shuffle.slang.4
}

Note: due to a unknown silent error the following will fail [could not spot an error in the generated glsl and spir-v]{
    tests/glsl/shader-subgroup-arithmetic.slang.5 (vk)
    tests/glsl/shader-subgroup-arithmetic.slang.6 (vk)
}

Other notes of worthy:{

    * only a few types are checked currently in tests due to equality templates not allowing freely casting to int/uint, meaning to test types en-mass is not trivial and will most likley be completly replaced once templates can cast & check equality more freely.

    * did not implement vector types for any functions that may use them (mostly in reference to SPIR-V, since many may accept scalar or vector inputs); applicable to subgroup-shuffle, subgroup-shuffle-relative, subgroup-arithmetic, subgroup-shuffle, subgroup_clustered, subgroup_quad

    * did not implement checks for half floats

    * CUDA, CPP, HLSL implementations were largly out of scope and if not implemented, this is due to the implementation not being trivial

}

Random fixes encountered:{
    * hlsl.meta incorrectly sets `OpCapability` as `GroupNonUniformBallot` when the `OpCapability` should be `GroupNonUniformVote`; this is as per SPIR-V spec for all SPIR-V calls used in `GL_KHR_shader_subgroup_vote`: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformAll
}

* added vector types and tests;

Partially Implement with tests, functions and built-in variables apart of GL_KHR_shader_subgroup; Partially resolves #3548

GL_KHR_shader_subgroup implemented based on https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_shader_subgroup.txt

GL_KHR_shader_subgroup_* & GLSL ref:
    * https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_shader_subgroup.txt
    * https://www.khronos.org/blog/vulkan-subgroup-tutorial
    * https://www.khronos.org/assets/uploads/developers/library/2018-vulkan-devday/06-subgroups.pdf

HLSL ref:
    * https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-intrinsic-functions
    * https://github.com/Microsoft/DirectXShaderCompiler/wiki/Wave-Intrinsics

CUDA ref:
    * https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

SPIR-V ref:
    * https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_memory_semantics_id

Implementation is broken down into seperate glsl extensions due to the ***large differences*** in implementation of each section, and functionality/testing.

GL_KHR_shader_subgroup_basic{
**Partially implemented**

Implementation:
    * All 9 built-in variables have been stubbed without proper value; implementation is still required for these system variables; related to #411.

    * Functions were reimplemented despite nearly mirrored HLSL functions due to:
        * hlsl.meta implementations targetting workgroups rather than a warp/wave/subgroup:
            * `__syncwarp` vs `__syncthreads`
            * `SubgroupMemory` vs `WorkgroupMemory`
            * etc.
        * hlsl.meta implementations target broader SPIR-V memory targets to block on:
            * ImageMemory|UniformMemory versus SPIR-V specifying barriers for ImageMemory and seperately an option for UniformMemory
        * `subgroupElect` for CUDA has a different implementation than `WaveIsFirstLane`, this is because spec states that `subgroupElect()` only returns the lowest active gl_SubgroupInvocationID; therefore we are supposed to fetch the current active mask even if some invocations are turned off by branches

Testing:
tests for the variable -- `tests/glsl/shader-subgroup-built-in-variables.slang`
    * these tests do not test functionality since not implemented yet

tests for the functions -- `tests/glsl/shader-subgroup-basic.slang`
    * concurrency is tested for using SubgroupMemory, UniformMemory through attempting to create a GPU side race condition with writing and reading memory
        * due to testing tools avaible there are no tests for ImageMemory
    * subgroupElect is tested to return invocation #0, the lowest invocation that will always run; wave size is 32, therefore #0 is always active and will always be the elected invocation.

}

GL_KHR_shader_subgroup_vote{
**Fully implemented**

Implementation:
    * 3/3 functions are using the hlsl.meta implementation

Testing:

`tests/glsl/shader-subgroup-vote.slang`
    * Testing each a positive (returns true) and negative (returns false) test case to ensure vote results are correct

}

GL_KHR_shader_subgroup_ballot{
**Partially implemented**

Implementation:
    There are 10/10 functions that are implemented:
    * 3 are using hlsl.meta implementation
    * 7 are using new implementations -- only support GLSL, SPIR-V, HLSL, CUDA
        * These implementations do not exist in hlsl.meta, so they were added
        * `subgroupInverseBallot` lacks an analog function to call; this feature was emulated:
            * in CUDA through knowing waves are 32bit and lanes are 0 indexed, this implys that `   (ballotResult >> YOUR_INVOCATION) & 1` checks if your invocation is active, for example, `(0b11001 >> 3) & 1` would mean that only invocation 5, 4, and 1 is active, 3 would mean `YOUR_INVOCATION` is the fourth invocation in the subgroup. `(0b11001>>3) & 1` would return true since your bit is toggled and evaluates to `0b11 & 0b1`
            * in HLSL through testing if the wave count is 32 or less (use the same logic as CUDA in this case); else find the index `YOUR_INVOCATION` corrisponds with where each vector has 32bits (32 waves); avoid division in the process. then run the same algorithm cuda employs.
            * `subgroupBallotBitExtract` is logically the same as `subgroupInverseBallot`
        * 5 implementations do not have a CUDA, HLSL, and CPP imlementation yet (subgroupBallotFindMSB, subgroupBallotFindLSB, subgroupBallotExclusiveBitCount, subgroupBallotInclusiveBitCount, subgroupBallotBitCount) due to being out of scope for the commit

Testing:
`tests/glsl/shader-subgroup-ballot.slang`
    * the function tests for an expected value of each ballot function; tests try inputting larger than 32 toggled bits as function parameters to ensure the implementation correctly identifies values up to a maximum of the subgroup invocation count as per extension specification (otherwise the functionality is fairly trivial to test)

}

GL_KHR_shader_subgroup_arithmetic{
**Partially implemented**

Implementation:
    * There are 21 functions to implement:
        * 14 functions are using the hlsl.meta implementation
        * 7 functions are new implementations -- only implemented for GLSL and SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required
            * CUDA, CPP, HLSL are out of scope for the commit

Testing:
`tests/glsl/shader-subgroup-arithmetic.slang`
    * all tests silently kill the shader; outputted GLSL was checked, could not see an issue
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]

}

GL_KHR_shader_subgroup_shuffle{
**Partially implemented**

Implementation:
    * There are 2 functions to implement:
        * 1 function is using the existing hlsl.meta implmentation
        * 1 function is using a new implmentation (subgroupShuffleXor) -- only implmented for GLSL & SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required

Testing:
`tests/glsl/shader-subgroup-shuffle.slang`
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]
    * tests fail with cpp due to `kIROp_WaveGetActiveMask` failing to be called

}

GL_KHR_shader_subgroup_shuffle_relative{
**Partially implemented**

Implementation:
    * There are 2 functions to implement:
        * all 2 functions are using a new implmentation -- only implmented for GLSL & SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required

Testing:
`tests/glsl/shader-subgroup-shuffle-relative.slang`
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]

}

GL_KHR_shader_subgroup_clustered{
**Partially implemented**

Implementation:
    * There are 7 functions to implement:
        * all 7 functions are using a new implmentation -- only implmented for GLSL & SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required

Testing:
`tests/glsl/shader-subgroup-shuffle-clustered.slang`
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]

}

GL_KHR_shader_subgroup_quad{
**Partially implemented**

Implementation:
    * There are 4 functions to implement:
        * all 4 functions are using hlsl.meta implmentations -- only implemented for GLSL & SPIR-V & HLSL

Testing:
`tests/glsl/shader-subgroup-shuffle-quad.slang`
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]

}

---------
Failing tests and why:

Note: test numbers are assuming none of the existing tests are toggled off

Note: due to system variables not being implemented largly for CUDA and CPP, these tests will fail (#3 and #4){
    tests/glsl/shader-subgroup-arithmetic.slang.3
    tests/glsl/shader-subgroup-arithmetic.slang.4
    tests/glsl/shader-subgroup-ballot.slang.4
    tests/glsl/shader-subgroup-basic.slang.3
    tests/glsl/shader-subgroup-basic.slang.4
    tests/glsl/shader-subgroup-quad.slang.3
    tests/glsl/shader-subgroup-quad.slang.4
    tests/glsl/shader-subgroup-vote.slang.3
    tests/glsl/shader-subgroup-vote.slang.4
}

Note: due to kIROp_WaveGetActiveMask not being loaded for cpp the following test will fail{
    tests/glsl/shader-subgroup-shuffle.slang.4
    tests/glsl/shader-subgroup-shuffle-relative.slang.4
    tests/glsl/shader-subgroup-basic.slang.4
}

Note: due to a unknown silent error the following will fail [could not spot an error in the generated glsl and spir-v]{
    tests/glsl/shader-subgroup-arithmetic.slang.5 (vk)
    tests/glsl/shader-subgroup-arithmetic.slang.6 (vk)
}

Other notes of worthy:{

    * only a few types are checked currently in arithmetic test; this is due to the test silently failing, meaning I can't actually test anything implemented

    * did not implement checks for half floats

    * CUDA, CPP, HLSL implementations were largly out of scope and not implemented, this is due to the implementation being non trivial for many functions

}

Random fixes encountered:{
    * hlsl.meta incorrectly sets `OpCapability` as `GroupNonUniformBallot` when the `OpCapability` should be `GroupNonUniformVote`; this is as per SPIR-V spec for all SPIR-V calls used in `GL_KHR_shader_subgroup_vote`: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformAll
}

* Partially Implement with tests, functions and built-in variables apart of GL_KHR_shader_subgroup; Partially resolves #3548

Partially Implement with tests, functions and built-in variables apart of GL_KHR_shader_subgroup; Partially resolves #3548

GL_KHR_shader_subgroup implemented based on https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_shader_subgroup.txt

GL_KHR_shader_subgroup_* & GLSL ref:
    * https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_shader_subgroup.txt
    * https://www.khronos.org/blog/vulkan-subgroup-tutorial
    * https://www.khronos.org/assets/uploads/developers/library/2018-vulkan-devday/06-subgroups.pdf

HLSL ref:
    * https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-intrinsic-functions
    * https://github.com/Microsoft/DirectXShaderCompiler/wiki/Wave-Intrinsics

CUDA ref:
    * https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

SPIR-V ref:
    * https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_memory_semantics_id

Implementation is broken down into seperate glsl extensions due to the ***large differences*** in implementation of each section, and functionality/testing.

GL_KHR_shader_subgroup_basic{
**Partially implemented**

Implementation:
    * All 9 built-in variables have been stubbed without proper value; implementation is still required for these system variables; related to #411.

    * Functions were reimplemented despite nearly mirrored HLSL functions due to:
        * hlsl.meta implementations targetting workgroups rather than a warp/wave/subgroup:
            * `__syncwarp` vs `__syncthreads`
            * `SubgroupMemory` vs `WorkgroupMemory`
            * etc.
        * hlsl.meta implementations target broader SPIR-V memory targets to block on:
            * ImageMemory|UniformMemory versus SPIR-V specifying barriers for ImageMemory and seperately an option for UniformMemory
        * `subgroupElect` for CUDA has a different implementation than `WaveIsFirstLane`, this is because spec states that `subgroupElect()` only returns the lowest active gl_SubgroupInvocationID; therefore we are supposed to fetch the current active mask even if some invocations are turned off by branches

Testing:
tests for the variable -- `tests/glsl/shader-subgroup-built-in-variables.slang`
    * these tests do not test functionality since not implemented yet

tests for the functions -- `tests/glsl/shader-subgroup-basic.slang`
    * concurrency is tested for using SubgroupMemory, UniformMemory through attempting to create a GPU side race condition with writing and reading memory
        * due to testing tools avaible there are no tests for ImageMemory
    * subgroupElect is tested to return invocation #0, the lowest invocation that will always run; wave size is 32, therefore #0 is always active and will always be the elected invocation.

}

GL_KHR_shader_subgroup_vote{
**Fully implemented**

Implementation:
    * 3/3 functions are using the hlsl.meta implementation

Testing:

`tests/glsl/shader-subgroup-vote.slang`
    * Testing each a positive (returns true) and negative (returns false) test case to ensure vote results are correct

}

GL_KHR_shader_subgroup_ballot{
**Partially implemented**

Implementation:
    There are 10/10 functions that are implemented:
    * 3 are using hlsl.meta implementation
    * 7 are using new implementations -- only support GLSL, SPIR-V, HLSL, CUDA
        * These implementations do not exist in hlsl.meta, so they were added
        * `subgroupInverseBallot` lacks an analog function to call; this feature was emulated:
            * in CUDA through knowing waves are 32bit and lanes are 0 indexed, this implys that `   (ballotResult >> YOUR_INVOCATION) & 1` checks if your invocation is active, for example, `(0b11001 >> 3) & 1` would mean that only invocation 5, 4, and 1 is active, 3 would mean `YOUR_INVOCATION` is the fourth invocation in the subgroup. `(0b11001>>3) & 1` would return true since your bit is toggled and evaluates to `0b11 & 0b1`
            * in HLSL through testing if the wave count is 32 or less (use the same logic as CUDA in this case); else find the index `YOUR_INVOCATION` corrisponds with where each vector has 32bits (32 waves); avoid division in the process. then run the same algorithm cuda employs.
            * `subgroupBallotBitExtract` is logically the same as `subgroupInverseBallot`
        * 5 implementations do not have a CUDA, HLSL, and CPP imlementation yet (subgroupBallotFindMSB, subgroupBallotFindLSB, subgroupBallotExclusiveBitCount, subgroupBallotInclusiveBitCount, subgroupBallotBitCount) due to being out of scope for the commit

Testing:
`tests/glsl/shader-subgroup-ballot.slang`
    * the function tests for an expected value of each ballot function; tests try inputting larger than 32 toggled bits as function parameters to ensure the implementation correctly identifies values up to a maximum of the subgroup invocation count as per extension specification (otherwise the functionality is fairly trivial to test)

}

GL_KHR_shader_subgroup_arithmetic{
**Partially implemented**

Implementation:
    * There are 21 functions to implement:
        * 14 functions are using the hlsl.meta implementation
        * 7 functions are new implementations -- only implemented for GLSL and SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required
            * CUDA, CPP, HLSL are out of scope for the commit

Testing:
`tests/glsl/shader-subgroup-arithmetic.slang`
    * all tests silently kill the shader; outputted GLSL was checked, could not see an issue
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]

}

GL_KHR_shader_subgroup_shuffle{
**Partially implemented**

Implementation:
    * There are 2 functions to implement:
        * 1 function is using the existing hlsl.meta implmentation
        * 1 function is using a new implmentation (subgroupShuffleXor) -- only implmented for GLSL & SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required

Testing:
`tests/glsl/shader-subgroup-shuffle.slang`
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]
    * tests fail with cpp due to `kIROp_WaveGetActiveMask` failing to be called

}

GL_KHR_shader_subgroup_shuffle_relative{
**Partially implemented**

Implementation:
    * There are 2 functions to implement:
        * all 2 functions are using a new implmentation -- only implmented for GLSL & SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required

Testing:
`tests/glsl/shader-subgroup-shuffle-relative.slang`
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]

}

GL_KHR_shader_subgroup_clustered{
**Partially implemented**

Implementation:
    * There are 7 functions to implement:
        * all 7 functions are using a new implmentation -- only implmented for GLSL & SPIR-V
            * GLSL & SPIR-V both use their related functions, no emulation required

Testing:
`tests/glsl/shader-subgroup-shuffle-clustered.slang`
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]

}

GL_KHR_shader_subgroup_quad{
**Partially implemented**

Implementation:
    * There are 4 functions to implement:
        * all 4 functions are using hlsl.meta implmentations -- only implemented for GLSL & SPIR-V & HLSL

Testing:
`tests/glsl/shader-subgroup-shuffle-quad.slang`
    * these tests only check basic functionality and correctness of all functions implemented; [further continued in "Other notes of worthy" at end of commit]

}

---------
Failing tests and why:

Note: test numbers are assuming none of the existing tests are toggled off

Note: due to system variables not being implemented largly for CUDA and CPP, these tests will fail (#3 and #4){
    tests/glsl/shader-subgroup-arithmetic.slang.3
    tests/glsl/shader-subgroup-arithmetic.slang.4
    tests/glsl/shader-subgroup-ballot.slang.4
    tests/glsl/shader-subgroup-basic.slang.3
    tests/glsl/shader-subgroup-basic.slang.4
    tests/glsl/shader-subgroup-quad.slang.3
    tests/glsl/shader-subgroup-quad.slang.4
    tests/glsl/shader-subgroup-vote.slang.3
    tests/glsl/shader-subgroup-vote.slang.4
}

Note: due to kIROp_WaveGetActiveMask not being loaded for cpp the following test will fail{
    tests/glsl/shader-subgroup-shuffle.slang.4
    tests/glsl/shader-subgroup-shuffle-relative.slang.4
    tests/glsl/shader-subgroup-basic.slang.4
}

Other notes of worthy:{

    * added preamble function and macros for implementing subgroup functionality (and tests) to make it possible to iterate on the functionality with reasonable effort in the future

    * CUDA, CPP, HLSL implementations were largly out of scope and not implemented, this is due to the implementation being non trivial for many functions

    * doubles cause a silent crash on most subgroup functions tested (silent shader hang)

    * __requireGLSLExtension does not work as intended inside glsl.meta; as a result half, int16, int64 int8, all are ommited from testing

}

Random fixes encountered:{
    * hlsl.meta incorrectly sets `OpCapability` as `GroupNonUniformBallot` when the `OpCapability` should be `GroupNonUniformVote`; this is as per SPIR-V spec for all SPIR-V calls used in `GL_KHR_shader_subgroup_vote`: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformAll
    * hlsl.meta incorrectly uses for WaveMaskPrefixBitOr (SPIR-V) OpGroupNonUniformBitwiseAnd intead of OpGroupNonUniformBitwiseOr; this was fixed
}

* redesign tests under suggestions that they should be smaller, more maintainable, and test the most amount of data reasonabley possible (balance with fast iterations);

optional double testing

varying parameter testing

most tests chain results now

* fix missing impl and merge conflict resolutions

* reundant test code cleanup and organization

move tests to proper location (glsl-intrinsic)

clean up redundant code (input buffers)

* add missing logical operands support (and remove hlsl/cuda code reuse due to the functional differences) under all And, Or, Xor ops

redesign tests to conform to a better testing paradigm

* testing code style change to not use white space as a toggle for tests

* provided crash reason for doubles (intel iris gpu's crash in glsl with doubles due to missing support in device caps [as per vulkan validation layer)

uncommented the `__requireGLSLExtension` code so once it is fixed int16/8/64/half wil work with subgroup not requiring future intervention

* fixing some vk validation layer errors (OpMemoryBarrier, Shuffle operations)

modified style of tests; removed redundancy (extra code that does nothing); fixed some incorrect run targets; added error reasons for all encountered problems (and if needed, a #define/#if toggle)

* remove comments of important tests inplace of #define over the broken feature of extended shader_subgroup types

* removed macros inside glsl.meta

removed erroneous __target_switch to directly call hlsl.meta function

added elaboration on the problem with __requireGLSLExtension

changed WaveMaskPrefixBit[or|and|xor] to support the expected type of <int> only as per `HLSL Shader Model 6.5` specs

removed "precision highp" since it does not affect tests

* changes some hlsl.meta functions used to be more appropriate (as per suggested)
WaveMask -> WaveActive.*
WaveMaskPrefix.* -> WavePrefix.*

remove __target_switch case's for unimplemented case's of intrinsics

fix _getLaneId() being removed from some regex used earlier

* fix usage of __target_intrinsic instead of __intrinsic_asm; silently would cause only arguments to be emmitted as return

changed usage of `__requireGLSLExtension` because now it causes a crash from the missing intrinsic (instead of a silent error)

* fix shader subgroup extended types support for GLSL and SPIR-V:
1. seperate intrinsic/__requireGLSL generating functionality of shader_subgroup_preamble into child function calls due to otherwise `__requireGLSLExtension` being ignored if the calling function of shader_subgroup_preamble calls an `__intrinsic_asm`
2. fixed HLSL.meta logic for wave operations (Add, Mul, exclusiveAdd, exclusiveMul) to no longer cast the input type T into a uint due to cost-of-op & crash.
    * Int8_t bit casted into uint32_t crashed the compiler. As per SPIR-V spec, OpGroupNonUniformI.* work on uint and int types meaning the function has no need to cast to a unit.
3. removed erroneous __target_switch for subgroupShuffle

* 1. ignore tests gracefully
2. remove un-needed SPIRV capability specifying (with OpCapability)
3. clean up structure of  typeRequireChecks_shader_subgroup_GLSL
4. explain why HLSL/CUDA are not targeted for shader-subgroup-arithmetic.slang

* syntax changes + `property` declaration fix + builtin var glsl implementation + changed incorrect HLSL.meta assumptions

(#1)`property` declaration as *non member* implementation change/fix (all of the changes to `slang-lower-to-ir.cpp`)

using (#1), implemented subgroup builtin's for GLSL/SPIR-V; did not implement built'ins completly for HLSL/CUDA due to non trivial implementations. CPP has no implementation due to missing support of system values

changed some incorrect HLSL.meta subgroup implementation assumptions of type usage (bit casting 8bit->32bit, wrong capabilities causing errors)

dumping ast crash with spir-v when using builtin's fixed by adding the `builtin` spirv case (all of the changes to `slang-ast-dump.cpp`)

[ForceInline] addition to functions missing it

return instead of spirv_asm when empty blocks are used

* syntax & organization of tests adjustment (specifically how if'def's are managed)

* figuring out where ci fails

* figuring out where ci fails -- testing with enclusive & regular

* testing CI with exclusive, regular, inclusive

* remove unneeded white space

test CI inconsistency issues further with arithmetic.slang

* testing if the ci run fails due to some timeout/recovery issue

* split up arithmetic tests and push to test with CI

---------

Co-authored-by: Yong He <yonghe@outlook.com>
---
 tests/glsl-intrinsic/intrinsic-texture.slang       |   4 +-
 .../shader-subgroup-arithmetic_Exclusive.slang     | 191 +++++++++++++++++++++
 .../shader-subgroup-arithmetic_Inclusive.slang     | 191 +++++++++++++++++++++
 .../shader-subgroup-arithmetic_None.slang          | 191 +++++++++++++++++++++
 .../shader-subgroup/shader-subgroup-ballot.slang   | 142 +++++++++++++++
 .../shader-subgroup/shader-subgroup-basic.slang    |  66 +++++++
 .../shader-subgroup-builtin-variables.slang        |  44 +++++
 .../shader-subgroup-clustered.slang                | 171 ++++++++++++++++++
 .../shader-subgroup/shader-subgroup-quad.slang     | 129 ++++++++++++++
 .../shader-subgroup-shuffle-relative.slang         | 121 +++++++++++++
 .../shader-subgroup/shader-subgroup-shuffle.slang  | 139 +++++++++++++++
 .../shader-subgroup/shader-subgroup-vote.slang     | 167 ++++++++++++++++++
 12 files changed, 1554 insertions(+), 2 deletions(-)
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_Exclusive.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_Inclusive.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_None.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-ballot.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-basic.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-builtin-variables.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-clustered.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-quad.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-shuffle-relative.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-shuffle.slang
 create mode 100644 tests/glsl-intrinsic/shader-subgroup/shader-subgroup-vote.slang

(limited to 'tests')

diff --git a/tests/glsl-intrinsic/intrinsic-texture.slang b/tests/glsl-intrinsic/intrinsic-texture.slang
index 3b42be715..591ced099 100644
--- a/tests/glsl-intrinsic/intrinsic-texture.slang
+++ b/tests/glsl-intrinsic/intrinsic-texture.slang
@@ -6,8 +6,8 @@
 //TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage fragment -entry computeMain -target cuda
 
 // Disabling following targets because they are currently causing compile errors.
-//T-EST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage fragment -entry computeMain -target hlsl
-//T-EST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage fragment -entry computeMain -target cpp
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage fragment -entry computeMain -target hlsl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage fragment -entry computeMain -target cpp
 
 // "Offset" family of texture functions in GLSL requires offset parameter to be a constant value.
 // It appears that slangc removes the constant-ness of constant values.
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_Exclusive.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_Exclusive.slang
new file mode 100644
index 000000000..7bfc4d886
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_Exclusive.slang
@@ -0,0 +1,191 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl -DTARGET_GLSL
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly -DTARGET_SPIRV
+//TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+//TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA
+
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+#if 1                        \
+    && !defined(TARGET_HLSL) \
+    && !defined(TARGET_CUDA)
+// hlsl does not treat boolean types with subgroup.* as a logical operator
+// cuda is missing an implementation
+#define TEST_when_logical_operators_are_implemented
+#endif
+
+//TEST_INPUT:ubuffer(data=[0 0], stride=4):out,name=outputBuffer
+buffer MyBlockName2
+{
+    uint data[];
+} outputBuffer;
+
+#define local_size_x_v 4
+layout(local_size_x = local_size_x_v) in;
+
+__generic<T : __BuiltinLogicalType>
+bool test1Logical() {
+    return true
+#if defined(TEST_when_logical_operators_are_implemented)
+        && subgroupExclusiveAnd(T(1)) == T(1)
+        && subgroupExclusiveOr(T(1)) == T(1)
+        && subgroupExclusiveXor(T(1)) == T(1)
+#endif // #if defined(TEST_when_logical_operators_are_implemented)
+        ;
+}
+
+__generic<T : __BuiltinLogicalType, let N : int>
+bool testVLogical() {
+    typealias gvec = vector<T, N>;
+
+    return true
+#if defined(TEST_when_logical_operators_are_implemented)
+        && subgroupExclusiveAnd(gvec(T(1))) == gvec(T(1))
+        && subgroupExclusiveOr(gvec(T(1))) == gvec(T(1))
+        && subgroupExclusiveXor(gvec(T(1))) == gvec(T(1))
+#endif // #if defined(TEST_when_logical_operators_are_implemented)
+        ;
+}
+
+bool testLogical() {
+    return true
+        && test1Logical<int>()
+        && testVLogical<int, 2>()
+        && testVLogical<int, 3>()
+        && testVLogical<int, 4>()
+        && test1Logical<int8_t>()
+        && testVLogical<int8_t, 2>()
+        && testVLogical<int8_t, 3>()
+        && testVLogical<int8_t, 4>()
+        && test1Logical<int16_t>()
+        && testVLogical<int16_t, 2>()
+        && testVLogical<int16_t, 3>()
+        && testVLogical<int16_t, 4>()
+        && test1Logical<int64_t>()
+        && testVLogical<int64_t, 2>()
+        && testVLogical<int64_t, 3>()
+        && testVLogical<int64_t, 4>()
+        && test1Logical<uint>()
+        && testVLogical<uint, 2>()
+        && testVLogical<uint, 3>()
+        && testVLogical<uint, 4>()
+        && test1Logical<uint8_t>()
+        && testVLogical<uint8_t, 2>()
+        && testVLogical<uint8_t, 3>()
+        && testVLogical<uint8_t, 4>()
+        && test1Logical<uint16_t>()
+        && testVLogical<uint16_t, 2>()
+        && testVLogical<uint16_t, 3>()
+        && testVLogical<uint16_t, 4>()
+        && test1Logical<uint64_t>()
+        && testVLogical<uint64_t, 2>()
+        && testVLogical<uint64_t, 3>()
+        && testVLogical<uint64_t, 4>()
+        && test1Logical<bool>()
+        && testVLogical<bool, 2>()
+        && testVLogical<bool, 3>()
+        && testVLogical<bool, 4>()
+        ;
+}
+
+__generic<T : __BuiltinArithmeticType>
+bool test1Arithmetic() {
+    return true
+        && subgroupExclusiveAdd(T(1)) == T(3)
+        && subgroupExclusiveMul(T(1)) == T(1)
+        && subgroupExclusiveMin(T(1)) == T(1)
+        && subgroupExclusiveMax(T(1)) == T(1)
+        ;
+}
+__generic<T : __BuiltinArithmeticType, let N : int>
+bool testVArithmetic() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupExclusiveAdd(gvec(T(1))) == gvec(T(3))
+        && subgroupExclusiveMul(gvec(T(1))) == gvec(T(1))
+        && subgroupExclusiveMin(gvec(T(1))) == gvec(T(1))
+        && subgroupExclusiveMax(gvec(T(1))) == gvec(T(1))
+        ;
+}
+
+bool testArithmetic() {
+    return true
+        && test1Arithmetic<float>()
+        && testVArithmetic<float, 2>()
+        && testVArithmetic<float, 3>()
+        && testVArithmetic<float, 4>()
+        && test1Arithmetic<double>() // WARNING: intel GPU's lack FP64 support
+        && testVArithmetic<double, 2>()
+        && testVArithmetic<double, 3>()
+        && testVArithmetic<double, 4>()
+        && test1Arithmetic<half>()
+        && testVArithmetic<half, 2>()
+        && testVArithmetic<half, 3>()
+        && testVArithmetic<half, 4>()
+        && test1Arithmetic<int>()
+        && testVArithmetic<int, 2>()
+        && testVArithmetic<int, 3>()
+        && testVArithmetic<int, 4>()
+        && test1Arithmetic<int8_t>() 
+        && testVArithmetic<int8_t, 2>()
+        && testVArithmetic<int8_t, 3>()
+        && testVArithmetic<int8_t, 4>()
+        && test1Arithmetic<int16_t>() 
+        && testVArithmetic<int16_t, 2>()
+        && testVArithmetic<int16_t, 3>()
+        && testVArithmetic<int16_t, 4>()
+        && test1Arithmetic<int64_t>() 
+        && testVArithmetic<int64_t, 2>()
+        && testVArithmetic<int64_t, 3>()
+        && testVArithmetic<int64_t, 4>()
+        && test1Arithmetic<uint>()
+        && testVArithmetic<uint, 2>()
+        && testVArithmetic<uint, 3>()
+        && testVArithmetic<uint, 4>()
+        && test1Arithmetic<uint8_t>() 
+        && testVArithmetic<uint8_t, 2>()
+        && testVArithmetic<uint8_t, 3>()
+        && testVArithmetic<uint8_t, 4>()
+        && test1Arithmetic<uint16_t>() 
+        && testVArithmetic<uint16_t, 2>()
+        && testVArithmetic<uint16_t, 3>()
+        && testVArithmetic<uint16_t, 4>()
+        && test1Arithmetic<uint64_t>() 
+        && testVArithmetic<uint64_t, 2>()
+        && testVArithmetic<uint64_t, 3>()
+        && testVArithmetic<uint64_t, 4>()
+        ;
+}
+
+void computeMain()
+{
+
+    bool res0 = true
+            && testLogical()
+            ;
+    
+    bool res1 = true
+            && testArithmetic()
+            ;
+
+    if (gl_LocalInvocationID.x == 3) {
+        // seperate so if there is an erroneous error the "major"
+        // tests are issolated into 2 branches without polluting the
+        // file with a bunch of individual test values
+        outputBuffer.data[0] = res0;
+        outputBuffer.data[1] = res1;
+    }
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 1
+    // BUF-NEXT: 1
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_Inclusive.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_Inclusive.slang
new file mode 100644
index 000000000..09c6bdbdf
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_Inclusive.slang
@@ -0,0 +1,191 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl -DTARGET_GLSL
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly -DTARGET_SPIRV
+//TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+//TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA
+
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+#if 1                        \
+    && !defined(TARGET_HLSL) \
+    && !defined(TARGET_CUDA)
+// hlsl does not treat boolean types with subgroup.* as a logical operator
+// cuda is missing an implementation
+#define TEST_when_logical_operators_are_implemented
+#endif
+
+//TEST_INPUT:ubuffer(data=[0 0], stride=4):out,name=outputBuffer
+buffer MyBlockName2
+{
+    uint data[];
+} outputBuffer;
+
+#define local_size_x_v 4
+layout(local_size_x = local_size_x_v) in;
+
+__generic<T : __BuiltinLogicalType>
+bool test1Logical() {
+    return true
+#if defined(TEST_when_logical_operators_are_implemented)
+        && subgroupInclusiveAnd(T(1)) == T(1)
+        && subgroupInclusiveOr(T(1)) == T(1)
+        && subgroupInclusiveXor(T(1)) == T(0)
+#endif // #if defined(TEST_when_logical_operators_are_implemented)
+        ;
+}
+
+__generic<T : __BuiltinLogicalType, let N : int>
+bool testVLogical() {
+    typealias gvec = vector<T, N>;
+
+    return true
+#if defined(TEST_when_logical_operators_are_implemented)
+        && subgroupInclusiveAnd(gvec(T(1))) == gvec(T(1))
+        && subgroupInclusiveOr(gvec(T(1))) == gvec(T(1))
+        && subgroupInclusiveXor(gvec(T(1))) == gvec(T(0))
+#endif // #if defined(TEST_when_logical_operators_are_implemented)
+        ;
+}
+
+bool testLogical() {
+    return true
+        && test1Logical<int>()
+        && testVLogical<int, 2>()
+        && testVLogical<int, 3>()
+        && testVLogical<int, 4>()
+        && test1Logical<int8_t>()
+        && testVLogical<int8_t, 2>()
+        && testVLogical<int8_t, 3>()
+        && testVLogical<int8_t, 4>()
+        && test1Logical<int16_t>()
+        && testVLogical<int16_t, 2>()
+        && testVLogical<int16_t, 3>()
+        && testVLogical<int16_t, 4>()
+        && test1Logical<int64_t>()
+        && testVLogical<int64_t, 2>()
+        && testVLogical<int64_t, 3>()
+        && testVLogical<int64_t, 4>()
+        && test1Logical<uint>()
+        && testVLogical<uint, 2>()
+        && testVLogical<uint, 3>()
+        && testVLogical<uint, 4>()
+        && test1Logical<uint8_t>()
+        && testVLogical<uint8_t, 2>()
+        && testVLogical<uint8_t, 3>()
+        && testVLogical<uint8_t, 4>()
+        && test1Logical<uint16_t>()
+        && testVLogical<uint16_t, 2>()
+        && testVLogical<uint16_t, 3>()
+        && testVLogical<uint16_t, 4>()
+        && test1Logical<uint64_t>()
+        && testVLogical<uint64_t, 2>()
+        && testVLogical<uint64_t, 3>()
+        && testVLogical<uint64_t, 4>()
+        && test1Logical<bool>()
+        && testVLogical<bool, 2>()
+        && testVLogical<bool, 3>()
+        && testVLogical<bool, 4>()
+        ;
+}
+
+__generic<T : __BuiltinArithmeticType>
+bool test1Arithmetic() {
+    return true
+        && subgroupInclusiveAdd(T(1)) == T(4)
+        && subgroupInclusiveMul(T(1)) == T(1)
+        && subgroupInclusiveMin(T(1)) == T(1)
+        && subgroupInclusiveMax(T(1)) == T(1)
+        ;
+}
+__generic<T : __BuiltinArithmeticType, let N : int>
+bool testVArithmetic() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupInclusiveAdd(gvec(T(1))) == gvec(T(4)) 
+        && subgroupInclusiveMul(gvec(T(1))) == gvec(T(1))
+        && subgroupInclusiveMin(gvec(T(1))) == gvec(T(1))
+        && subgroupInclusiveMax(gvec(T(1))) == gvec(T(1))
+        ;
+}
+
+bool testArithmetic() {
+    return true
+        && test1Arithmetic<float>()
+        && testVArithmetic<float, 2>()
+        && testVArithmetic<float, 3>()
+        && testVArithmetic<float, 4>()
+        && test1Arithmetic<double>() // WARNING: intel GPU's lack FP64 support
+        && testVArithmetic<double, 2>()
+        && testVArithmetic<double, 3>()
+        && testVArithmetic<double, 4>()
+        && test1Arithmetic<half>()
+        && testVArithmetic<half, 2>()
+        && testVArithmetic<half, 3>()
+        && testVArithmetic<half, 4>()
+        && test1Arithmetic<int>()
+        && testVArithmetic<int, 2>()
+        && testVArithmetic<int, 3>()
+        && testVArithmetic<int, 4>()
+        && test1Arithmetic<int8_t>() 
+        && testVArithmetic<int8_t, 2>()
+        && testVArithmetic<int8_t, 3>()
+        && testVArithmetic<int8_t, 4>()
+        && test1Arithmetic<int16_t>() 
+        && testVArithmetic<int16_t, 2>()
+        && testVArithmetic<int16_t, 3>()
+        && testVArithmetic<int16_t, 4>()
+        && test1Arithmetic<int64_t>() 
+        && testVArithmetic<int64_t, 2>()
+        && testVArithmetic<int64_t, 3>()
+        && testVArithmetic<int64_t, 4>()
+        && test1Arithmetic<uint>()
+        && testVArithmetic<uint, 2>()
+        && testVArithmetic<uint, 3>()
+        && testVArithmetic<uint, 4>()
+        && test1Arithmetic<uint8_t>() 
+        && testVArithmetic<uint8_t, 2>()
+        && testVArithmetic<uint8_t, 3>()
+        && testVArithmetic<uint8_t, 4>()
+        && test1Arithmetic<uint16_t>() 
+        && testVArithmetic<uint16_t, 2>()
+        && testVArithmetic<uint16_t, 3>()
+        && testVArithmetic<uint16_t, 4>()
+        && test1Arithmetic<uint64_t>() 
+        && testVArithmetic<uint64_t, 2>()
+        && testVArithmetic<uint64_t, 3>()
+        && testVArithmetic<uint64_t, 4>()
+        ;
+}
+
+void computeMain()
+{
+
+    bool res0 = true
+            && testLogical()
+            ;
+    
+    bool res1 = true
+            && testArithmetic()
+            ;
+
+    if (gl_LocalInvocationID.x == 3) {
+        // seperate so if there is an erroneous error the "major"
+        // tests are issolated into 2 branches without polluting the
+        // file with a bunch of individual test values
+        outputBuffer.data[0] = res0;
+        outputBuffer.data[1] = res1;
+    }
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 1
+    // BUF-NEXT: 1
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_None.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_None.slang
new file mode 100644
index 000000000..5300e6796
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-arithmetic_None.slang
@@ -0,0 +1,191 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl -DTARGET_GLSL
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly -DTARGET_SPIRV
+//TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+//TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA
+
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+#if 1                        \
+    && !defined(TARGET_HLSL) \
+    && !defined(TARGET_CUDA)
+// hlsl does not treat boolean types with subgroup.* as a logical operator
+// cuda is missing an implementation
+#define TEST_when_logical_operators_are_implemented
+#endif
+
+//TEST_INPUT:ubuffer(data=[0 0], stride=4):out,name=outputBuffer
+buffer MyBlockName2
+{
+    uint data[];
+} outputBuffer;
+
+#define local_size_x_v 4
+layout(local_size_x = local_size_x_v) in;
+
+__generic<T : __BuiltinLogicalType>
+bool test1Logical() {
+    return true
+#if defined(TEST_when_logical_operators_are_implemented)
+        && subgroupAnd(T(1)) == T(1)
+        && subgroupOr(T(1)) == T(1)
+        && subgroupXor(T(1)) == T(0) 
+#endif // #if defined(TEST_when_logical_operators_are_implemented)
+        ;
+}
+
+__generic<T : __BuiltinLogicalType, let N : int>
+bool testVLogical() {
+    typealias gvec = vector<T, N>;
+
+    return true
+#if defined(TEST_when_logical_operators_are_implemented)
+        && subgroupAnd(gvec(T(1))) == gvec(T(1))
+        && subgroupOr(gvec(T(1))) == gvec(T(1))
+        && subgroupXor(gvec(T(1))) == gvec(T(0))
+#endif // #if defined(TEST_when_logical_operators_are_implemented)
+        ;
+}
+
+bool testLogical() {
+    return true
+        && test1Logical<int>()
+        && testVLogical<int, 2>()
+        && testVLogical<int, 3>()
+        && testVLogical<int, 4>()
+        && test1Logical<int8_t>()
+        && testVLogical<int8_t, 2>()
+        && testVLogical<int8_t, 3>()
+        && testVLogical<int8_t, 4>()
+        && test1Logical<int16_t>()
+        && testVLogical<int16_t, 2>()
+        && testVLogical<int16_t, 3>()
+        && testVLogical<int16_t, 4>()
+        && test1Logical<int64_t>()
+        && testVLogical<int64_t, 2>()
+        && testVLogical<int64_t, 3>()
+        && testVLogical<int64_t, 4>()
+        && test1Logical<uint>()
+        && testVLogical<uint, 2>()
+        && testVLogical<uint, 3>()
+        && testVLogical<uint, 4>()
+        && test1Logical<uint8_t>()
+        && testVLogical<uint8_t, 2>()
+        && testVLogical<uint8_t, 3>()
+        && testVLogical<uint8_t, 4>()
+        && test1Logical<uint16_t>()
+        && testVLogical<uint16_t, 2>()
+        && testVLogical<uint16_t, 3>()
+        && testVLogical<uint16_t, 4>()
+        && test1Logical<uint64_t>()
+        && testVLogical<uint64_t, 2>()
+        && testVLogical<uint64_t, 3>()
+        && testVLogical<uint64_t, 4>()
+        && test1Logical<bool>()
+        && testVLogical<bool, 2>()
+        && testVLogical<bool, 3>()
+        && testVLogical<bool, 4>()
+        ;
+}
+
+__generic<T : __BuiltinArithmeticType>
+bool test1Arithmetic() {
+    return true
+        && subgroupAdd(T(1)) == T(local_size_x_v) // 32
+        && subgroupMul(T(1)) == T(1)
+        && subgroupMin(T(1)) == T(1)
+        && subgroupMax(T(1)) == T(1)
+        ;
+}
+__generic<T : __BuiltinArithmeticType, let N : int>
+bool testVArithmetic() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupAdd(gvec(T(1))) == gvec(T(local_size_x_v)) // 32
+        && subgroupMul(gvec(T(1))) == gvec(T(1))
+        && subgroupMin(gvec(T(1))) == gvec(T(1))
+        && subgroupMax(gvec(T(1))) == gvec(T(1))
+        ;
+}
+
+bool testArithmetic() {
+    return true
+        && test1Arithmetic<float>()
+        && testVArithmetic<float, 2>()
+        && testVArithmetic<float, 3>()
+        && testVArithmetic<float, 4>()
+        && test1Arithmetic<double>() // WARNING: intel GPU's lack FP64 support
+        && testVArithmetic<double, 2>()
+        && testVArithmetic<double, 3>()
+        && testVArithmetic<double, 4>()
+        && test1Arithmetic<half>()
+        && testVArithmetic<half, 2>()
+        && testVArithmetic<half, 3>()
+        && testVArithmetic<half, 4>()
+        && test1Arithmetic<int>()
+        && testVArithmetic<int, 2>()
+        && testVArithmetic<int, 3>()
+        && testVArithmetic<int, 4>()
+        && test1Arithmetic<int8_t>() 
+        && testVArithmetic<int8_t, 2>()
+        && testVArithmetic<int8_t, 3>()
+        && testVArithmetic<int8_t, 4>()
+        && test1Arithmetic<int16_t>() 
+        && testVArithmetic<int16_t, 2>()
+        && testVArithmetic<int16_t, 3>()
+        && testVArithmetic<int16_t, 4>()
+        && test1Arithmetic<int64_t>() 
+        && testVArithmetic<int64_t, 2>()
+        && testVArithmetic<int64_t, 3>()
+        && testVArithmetic<int64_t, 4>()
+        && test1Arithmetic<uint>()
+        && testVArithmetic<uint, 2>()
+        && testVArithmetic<uint, 3>()
+        && testVArithmetic<uint, 4>()
+        && test1Arithmetic<uint8_t>() 
+        && testVArithmetic<uint8_t, 2>()
+        && testVArithmetic<uint8_t, 3>()
+        && testVArithmetic<uint8_t, 4>()
+        && test1Arithmetic<uint16_t>() 
+        && testVArithmetic<uint16_t, 2>()
+        && testVArithmetic<uint16_t, 3>()
+        && testVArithmetic<uint16_t, 4>()
+        && test1Arithmetic<uint64_t>() 
+        && testVArithmetic<uint64_t, 2>()
+        && testVArithmetic<uint64_t, 3>()
+        && testVArithmetic<uint64_t, 4>()
+        ;
+}
+
+void computeMain()
+{
+
+    bool res0 = true
+            && testLogical()
+            ;
+    
+    bool res1 = true
+            && testArithmetic()
+            ;
+
+    if (gl_LocalInvocationID.x == 3) {
+        // seperate so if there is an erroneous error the "major"
+        // tests are issolated into 2 branches without polluting the
+        // file with a bunch of individual test values
+        outputBuffer.data[0] = res0;
+        outputBuffer.data[1] = res1;
+    }
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 1
+    // BUF-NEXT: 1
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-ballot.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-ballot.slang
new file mode 100644
index 000000000..8bbd60689
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-ballot.slang
@@ -0,0 +1,142 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly
+//TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+
+// not testing cuda due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA 
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+// breaks on Nvidia GPU by returning 0 which is trivially wrong (works on Intel Iris Xe)
+//#define TEST_when_glsl_subgroupBallotExclusiveBitCount_is_not_bugged
+
+//TEST_INPUT:ubuffer(data=[0 0], stride=4):out,name=outputBuffer
+buffer MyBlockName2 
+{
+    uint data[];
+} outputBuffer;
+
+layout(local_size_x = 32) in;
+
+__generic<T : __BuiltinLogicalType>
+bool test1BroadcastX() {
+    return true
+        && subgroupBroadcast(T(1), 0) == T(1)
+        && subgroupBroadcastFirst(T(1)) == T(1)
+        ;
+}
+__generic<T : __BuiltinLogicalType, let N : int>
+bool testVBroadcastX() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupBroadcast(gvec(T(1)), 0) == gvec(T(1))
+        && subgroupBroadcastFirst(gvec(T(1))) == gvec(T(1))
+        ;
+}
+
+__generic<T : __BuiltinFloatingPointType>
+bool test1BroadcastX() {
+    return true
+        && subgroupBroadcast(T(1), 0) == T(1)
+        && subgroupBroadcastFirst(T(1)) == T(1)
+        ;
+}
+__generic<T : __BuiltinFloatingPointType, let N : int>
+bool testVBroadcastX() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupBroadcast(gvec(T(1)), 0) == gvec(T(1))
+        && subgroupBroadcastFirst(gvec(T(1))) == gvec(T(1))
+        ;
+}
+bool testBroadcastX() {
+    return true
+        && test1BroadcastX<float>()
+        && testVBroadcastX<float, 2>()
+        && testVBroadcastX<float, 3>()
+        && testVBroadcastX<float, 4>()
+        && test1BroadcastX<double>() // WARNING: intel GPU's lack FP64 support
+        && testVBroadcastX<double, 2>()
+        && testVBroadcastX<double, 3>()
+        && testVBroadcastX<double, 4>()
+        && test1BroadcastX<half>() 
+        && testVBroadcastX<half, 2>()
+        && testVBroadcastX<half, 3>()
+        && testVBroadcastX<half, 4>()
+        && test1BroadcastX<int>()
+        && testVBroadcastX<int, 2>()
+        && testVBroadcastX<int, 3>()
+        && testVBroadcastX<int, 4>()
+        && test1BroadcastX<int8_t>() 
+        && testVBroadcastX<int8_t, 2>()
+        && testVBroadcastX<int8_t, 3>()
+        && testVBroadcastX<int8_t, 4>()
+        && test1BroadcastX<int16_t>() 
+        && testVBroadcastX<int16_t, 2>()
+        && testVBroadcastX<int16_t, 3>()
+        && testVBroadcastX<int16_t, 4>()
+        && test1BroadcastX<int64_t>() 
+        && testVBroadcastX<int64_t, 2>()
+        && testVBroadcastX<int64_t, 3>()
+        && testVBroadcastX<int64_t, 4>()
+        && test1BroadcastX<uint>()
+        && testVBroadcastX<uint, 2>()
+        && testVBroadcastX<uint, 3>()
+        && testVBroadcastX<uint, 4>()
+        && test1BroadcastX<uint8_t>() 
+        && testVBroadcastX<uint8_t, 2>()
+        && testVBroadcastX<uint8_t, 3>()
+        && testVBroadcastX<uint8_t, 4>()
+        && test1BroadcastX<uint16_t>() 
+        && testVBroadcastX<uint16_t, 2>()
+        && testVBroadcastX<uint16_t, 3>()
+        && testVBroadcastX<uint16_t, 4>()
+        && test1BroadcastX<uint64_t>() 
+        && testVBroadcastX<uint64_t, 2>()
+        && testVBroadcastX<uint64_t, 3>()
+        && testVBroadcastX<uint64_t, 4>()
+        && test1BroadcastX<bool>()
+        && testVBroadcastX<bool, 2>()
+        && testVBroadcastX<bool, 3>()
+        && testVBroadcastX<bool, 4>()
+        ;
+}
+
+bool testBallot() {
+    return true 
+        && (subgroupBallot(true).x == 0xFFFFFFFF)
+        && (subgroupInverseBallot(uvec4(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF)) == true)
+        && (subgroupBallotBitExtract(uvec4(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF), 0) == true)
+        && (subgroupBallotBitCount(uvec4(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF)) == 32)
+        && (subgroupBallotInclusiveBitCount(uvec4(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF)) != 0)
+#ifdef TEST_when_glsl_subgroupBallotExclusiveBitCount_is_not_bugged
+        && (subgroupBallotExclusiveBitCount(uvec4(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF)) != 0)
+#endif
+        && (subgroupBallotFindLSB(uvec4(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF)) == 0)
+        && (subgroupBallotFindMSB(uvec4(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF)) == 31)
+        ;
+}
+
+void computeMain()
+{
+    outputBuffer.data[0] = true
+        && testBroadcastX()
+        ;
+    outputBuffer.data[1] = true
+        && testBallot()
+        ;
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 1
+    // BUF-NEXT: 1
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-basic.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-basic.slang
new file mode 100644
index 000000000..82f2dc8e2
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-basic.slang
@@ -0,0 +1,66 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly
+//TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+
+// not testing cuda due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA 
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+//TEST_INPUT:ubuffer(data=[0 0 0 0 0], stride=4):out,name=outputBuffer
+buffer MyBlockName2 
+{
+    uint data[];
+} outputBuffer;
+
+layout(local_size_x = 32) in;
+
+shared uint shareMem;
+
+void computeMain()
+{
+    // TODO: no test for image memory was done -- subgroupMemoryBarrierImage();
+    // tests are seperate since concurrency testing
+
+    shareMem = 100;
+    subgroupMemoryBarrierShared();
+    outputBuffer.data[0] = 1;
+    subgroupBarrier();
+    outputBuffer.data[0] = 2;
+    subgroupBarrier();
+
+    outputBuffer.data[1] = 1;
+    subgroupMemoryBarrier();
+    outputBuffer.data[1] = 2;
+    subgroupBarrier();
+
+    outputBuffer.data[2] = 1;
+    subgroupMemoryBarrierBuffer();
+    outputBuffer.data[2] = 2;
+    subgroupBarrier();
+
+    shareMem = 2;
+    subgroupMemoryBarrierShared();
+    outputBuffer.data[3] = shareMem;
+    subgroupBarrier();
+
+    if (subgroupElect()) {
+        outputBuffer.data[4] = gl_GlobalInvocationID.x + 2;
+    }
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+
+    // BUF: 2
+    // BUF-NEXT: 2
+    // BUF-NEXT: 2
+    // BUF-NEXT: 2
+    // BUF-NEXT: 2
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-builtin-variables.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-builtin-variables.slang
new file mode 100644
index 000000000..21b533178
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-builtin-variables.slang
@@ -0,0 +1,44 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly
+
+// missing implementation of most builtin values due to non trivial translation
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+// missing implementation of most builtin values due to non trivial translation
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA 
+//missing implementation of system (varying?) values
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+//TEST_INPUT:ubuffer(data=[0], stride=4):out,name=outputBuffer
+buffer MyBlockName2
+{
+    uint data[];
+} outputBuffer;
+
+layout(local_size_x = 32) in;
+
+void computeMain()
+{
+    if (gl_GlobalInvocationID.x == 3) {
+        outputBuffer.data[0] = true
+            && gl_NumSubgroups == 1
+            && gl_SubgroupID  == 0 //1 subgroup, 0 based indexing
+            && gl_SubgroupSize == 32
+            && gl_SubgroupInvocationID == 3
+            && gl_SubgroupEqMask == uvec4(0b1000,0,0,0)
+            && gl_SubgroupGeMask == uvec4(0xFFFFFFF8,0,0,0)
+            && gl_SubgroupGtMask == uvec4(0xFFFFFFF0,0,0,0)
+            && gl_SubgroupLeMask == uvec4(0b1111,0,0,0)
+            && gl_SubgroupLtMask == uvec4(0b111,0,0,0)
+            ;
+    }
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 1
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-clustered.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-clustered.slang
new file mode 100644
index 000000000..9e9b089d2
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-clustered.slang
@@ -0,0 +1,171 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly
+
+// not testing hlsl due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+// not testing cuda due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA 
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+//TEST_INPUT:ubuffer(data=[0 0], stride=4):out,name=outputBuffer
+buffer MyBlockName2 
+{
+    uint data[];
+} outputBuffer;
+
+layout(local_size_x = 32) in;
+
+__generic<T : __BuiltinLogicalType>
+bool test1Logical() {
+    return true
+        && subgroupClusteredAnd(T(1), 1) == T(1)
+        && subgroupClusteredOr(T(1), 1) == T(1)
+        && subgroupClusteredXor(T(1), 1) == T(1)
+        ;
+}
+
+__generic<T : __BuiltinLogicalType, let N : int>
+bool testVLogical() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupClusteredAnd(gvec(T(1)), 1) == gvec(T(1))
+        && subgroupClusteredOr(gvec(T(1)), 1) == gvec(T(1))
+        && subgroupClusteredXor(gvec(T(1)), 1) == gvec(T(1))   
+        ;     
+}
+
+bool testLogical() {
+    return true
+        && test1Logical<int>()
+        && testVLogical<int, 2>()
+        && testVLogical<int, 3>()
+        && testVLogical<int, 4>()
+        && test1Logical<int8_t>()
+        && testVLogical<int8_t, 2>()
+        && testVLogical<int8_t, 3>()
+        && testVLogical<int8_t, 4>()
+        && test1Logical<int16_t>()
+        && testVLogical<int16_t, 2>()
+        && testVLogical<int16_t, 3>()
+        && testVLogical<int16_t, 4>()
+        && test1Logical<int64_t>()
+        && testVLogical<int64_t, 2>()
+        && testVLogical<int64_t, 3>()
+        && testVLogical<int64_t, 4>()
+        && test1Logical<uint>()
+        && testVLogical<uint, 2>()
+        && testVLogical<uint, 3>()
+        && testVLogical<uint, 4>()
+        && test1Logical<uint8_t>()
+        && testVLogical<uint8_t, 2>()
+        && testVLogical<uint8_t, 3>()
+        && testVLogical<uint8_t, 4>()
+        && test1Logical<uint16_t>()
+        && testVLogical<uint16_t, 2>()
+        && testVLogical<uint16_t, 3>()
+        && testVLogical<uint16_t, 4>()
+        && test1Logical<uint64_t>()
+        && testVLogical<uint64_t, 2>()
+        && testVLogical<uint64_t, 3>()
+        && testVLogical<uint64_t, 4>()
+        && test1Logical<bool>()
+        && testVLogical<bool, 2>()
+        && testVLogical<bool, 3>()
+        && testVLogical<bool, 4>()
+        ;
+}
+
+__generic<T : __BuiltinArithmeticType>
+bool test1Arithmetic() {
+    return true
+        && subgroupClusteredAdd(T(1), 1) == T(1)
+        && subgroupClusteredMul(T(1), 1) == T(1)
+        && subgroupClusteredMin(T(1), 1) == T(1)
+        && subgroupClusteredMax(T(1), 1) == T(1)
+        ;
+}
+
+__generic<T : __BuiltinArithmeticType, let N : int>
+bool testVArithmetic() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupClusteredAdd(gvec(T(1)), 1) == gvec(T(1))
+        && subgroupClusteredMul(gvec(T(1)), 1) == gvec(T(1))
+        && subgroupClusteredMin(gvec(T(1)), 1) == gvec(T(1))
+        && subgroupClusteredMax(gvec(T(1)), 1) == gvec(T(1))
+        ;
+}
+
+bool testArithmetic() {
+    return true
+        && test1Arithmetic<float>()
+        && testVArithmetic<float, 2>()
+        && testVArithmetic<float, 3>()
+        && testVArithmetic<float, 4>()
+        && test1Arithmetic<double>() // WARNING: intel GPU's lack FP64 support
+        && testVArithmetic<double, 2>() 
+        && testVArithmetic<double, 3>()
+        && testVArithmetic<double, 4>()
+        && test1Arithmetic<half>() 
+        && testVArithmetic<half, 2>()
+        && testVArithmetic<half, 3>()
+        && testVArithmetic<half, 4>()
+        && test1Arithmetic<int>()
+        && testVArithmetic<int, 2>()
+        && testVArithmetic<int, 3>()
+        && testVArithmetic<int, 4>()
+        && test1Arithmetic<int8_t>() 
+        && testVArithmetic<int8_t, 2>()
+        && testVArithmetic<int8_t, 3>()
+        && testVArithmetic<int8_t, 4>()
+        && test1Arithmetic<int16_t>() 
+        && testVArithmetic<int16_t, 2>()
+        && testVArithmetic<int16_t, 3>()
+        && testVArithmetic<int16_t, 4>()
+        && test1Arithmetic<int64_t>() 
+        && testVArithmetic<int64_t, 2>()
+        && testVArithmetic<int64_t, 3>()
+        && testVArithmetic<int64_t, 4>()
+        && test1Arithmetic<uint>()
+        && testVArithmetic<uint, 2>()
+        && testVArithmetic<uint, 3>()
+        && testVArithmetic<uint, 4>()
+        && test1Arithmetic<uint8_t>() 
+        && testVArithmetic<uint8_t, 2>()
+        && testVArithmetic<uint8_t, 3>()
+        && testVArithmetic<uint8_t, 4>()
+        && test1Arithmetic<uint16_t>() 
+        && testVArithmetic<uint16_t, 2>()
+        && testVArithmetic<uint16_t, 3>()
+        && testVArithmetic<uint16_t, 4>()
+        && test1Arithmetic<uint64_t>() 
+        && testVArithmetic<uint64_t, 2>()
+        && testVArithmetic<uint64_t, 3>()
+        && testVArithmetic<uint64_t, 4>()
+        ;
+}
+
+void computeMain()
+{
+    outputBuffer.data[0] = true
+        && testLogical()
+        ;
+    outputBuffer.data[1] = true
+        && testArithmetic()
+        ;
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 1
+    // BUF-NEXT: 1
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-quad.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-quad.slang
new file mode 100644
index 000000000..5ed6398b2
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-quad.slang
@@ -0,0 +1,129 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly
+//TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+
+// not testing cuda due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA 
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+//TEST_INPUT:ubuffer(data=[0], stride=4):out,name=outputBuffer
+buffer MyBlockName2 
+{
+    uint data[];
+} outputBuffer;
+
+layout(local_size_x = 4) in;
+
+__generic<T : __BuiltinLogicalType>
+bool test1QuadX() {
+    return true
+        && subgroupQuadSwapHorizontal(T(2)) == T(2)
+        && subgroupQuadSwapVertical(T(2)) == T(2)
+        && subgroupQuadSwapDiagonal(T(3)) == T(3)
+        && subgroupQuadBroadcast(T(1), 1) == T(1)
+        ;
+}
+__generic<T : __BuiltinLogicalType, let N : int>
+bool testVQuadX() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupQuadSwapHorizontal(gvec(T(2))) == gvec(T(2))
+        && subgroupQuadSwapVertical(gvec(T(2))) == gvec(T(2))
+        && subgroupQuadSwapDiagonal(gvec(T(3))) == gvec(T(3))
+        && subgroupQuadBroadcast(gvec(T(1)), 1) == gvec(T(1))
+        ;
+}
+
+__generic<T : __BuiltinFloatingPointType>
+bool test1QuadX() {
+    return true
+        && subgroupQuadSwapHorizontal(T(2)) == T(2)
+        && subgroupQuadSwapVertical(T(2)) == T(2)
+        && subgroupQuadSwapDiagonal(T(3)) == T(3)
+        && subgroupQuadBroadcast(T(1), 1) == T(1)
+        ;
+}
+__generic<T : __BuiltinFloatingPointType, let N : int>
+bool testVQuadX() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupQuadSwapHorizontal(gvec(T(2))) == gvec(T(2))
+        && subgroupQuadSwapVertical(gvec(T(2))) == gvec(T(2))
+        && subgroupQuadSwapDiagonal(gvec(T(3))) == gvec(T(3))
+        && subgroupQuadBroadcast(gvec(T(1)), 1) == gvec(T(1))
+        ;
+}
+bool testQuadSwapX() {
+    return true
+        && test1QuadX<float>()
+        && testVQuadX<float, 2>()
+        && testVQuadX<float, 3>()
+        && testVQuadX<float, 4>()
+        && test1QuadX<double>() // WARNING: intel GPU's lack FP64 support
+        && testVQuadX<double, 2>()
+        && testVQuadX<double, 3>()
+        && testVQuadX<double, 4>()
+        && test1QuadX<half>() 
+        && testVQuadX<half, 2>()
+        && testVQuadX<half, 3>()
+        && testVQuadX<half, 4>()
+        && test1QuadX<int>()
+        && testVQuadX<int, 2>()
+        && testVQuadX<int, 3>()
+        && testVQuadX<int, 4>()
+        && test1QuadX<int8_t>() 
+        && testVQuadX<int8_t, 2>()
+        && testVQuadX<int8_t, 3>()
+        && testVQuadX<int8_t, 4>()
+        && test1QuadX<int16_t>() 
+        && testVQuadX<int16_t, 2>()
+        && testVQuadX<int16_t, 3>()
+        && testVQuadX<int16_t, 4>()
+        && test1QuadX<int64_t>() 
+        && testVQuadX<int64_t, 2>()
+        && testVQuadX<int64_t, 3>()
+        && testVQuadX<int64_t, 4>()
+        && test1QuadX<uint>()
+        && testVQuadX<uint, 2>()
+        && testVQuadX<uint, 3>()
+        && testVQuadX<uint, 4>()
+        && test1QuadX<uint8_t>() 
+        && testVQuadX<uint8_t, 2>()
+        && testVQuadX<uint8_t, 3>()
+        && testVQuadX<uint8_t, 4>()
+        && test1QuadX<uint16_t>() 
+        && testVQuadX<uint16_t, 2>()
+        && testVQuadX<uint16_t, 3>()
+        && testVQuadX<uint16_t, 4>()
+        && test1QuadX<uint64_t>() 
+        && testVQuadX<uint64_t, 2>()
+        && testVQuadX<uint64_t, 3>()
+        && testVQuadX<uint64_t, 4>()
+        && test1QuadX<bool>()
+        && testVQuadX<bool, 2>()
+        && testVQuadX<bool, 3>()
+        && testVQuadX<bool, 4>()
+        ;
+}
+
+void computeMain()
+{
+
+    outputBuffer.data[0] = true
+        && testQuadSwapX()
+        ;
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 1
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-shuffle-relative.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-shuffle-relative.slang
new file mode 100644
index 000000000..0e187c568
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-shuffle-relative.slang
@@ -0,0 +1,121 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly
+
+// not testing hlsl due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+// not testing cuda due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA 
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+//TEST_INPUT:ubuffer(data=[0], stride=4):out,name=outputBuffer
+buffer MyBlockName2 
+{
+    uint data[];
+} outputBuffer;
+
+layout(local_size_x = 32) in;
+
+__generic<T : __BuiltinLogicalType>
+bool test1ShuffleX() {
+    return true
+        && subgroupShuffleUp(T(1), 1) == T(1)
+        && subgroupShuffleDown(T(1), 1) == T(1)
+        ;
+}
+__generic<T : __BuiltinLogicalType, let N : int>
+bool testVShuffleX() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupShuffleUp(gvec(T(1)), 1) == gvec(T(1))
+        && subgroupShuffleDown(gvec(T(1)), 1) == gvec(T(1))
+        ;
+}
+
+__generic<T : __BuiltinFloatingPointType>
+bool test1ShuffleX() {
+    return true
+        && subgroupShuffleUp(T(1), 1) == T(1)
+        && subgroupShuffleDown(T(1), 1) == T(1)
+        ;
+}
+__generic<T : __BuiltinFloatingPointType, let N : int>
+bool testVShuffleX() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupShuffleUp(gvec(T(1)), 1) == gvec(T(1))
+        && subgroupShuffleDown(gvec(T(1)), 1) == gvec(T(1))
+        ;
+}
+bool testShuffleX() {
+    return true
+        && test1ShuffleX<float>()
+        && testVShuffleX<float, 2>()
+        && testVShuffleX<float, 3>()
+        && testVShuffleX<float, 4>()
+        && test1ShuffleX<double>() // WARNING: intel GPU's lack FP64 support
+        && testVShuffleX<double, 2>()
+        && testVShuffleX<double, 3>()
+        && testVShuffleX<double, 4>()
+        && test1ShuffleX<half>() 
+        && testVShuffleX<half, 2>()
+        && testVShuffleX<half, 3>()
+        && testVShuffleX<half, 4>()
+        && test1ShuffleX<int>()
+        && testVShuffleX<int, 2>()
+        && testVShuffleX<int, 3>()
+        && testVShuffleX<int, 4>()
+        && test1ShuffleX<int8_t>() 
+        && testVShuffleX<int8_t, 2>()
+        && testVShuffleX<int8_t, 3>()
+        && testVShuffleX<int8_t, 4>()
+        && test1ShuffleX<int16_t>() 
+        && testVShuffleX<int16_t, 2>()
+        && testVShuffleX<int16_t, 3>()
+        && testVShuffleX<int16_t, 4>()
+        && test1ShuffleX<int64_t>() 
+        && testVShuffleX<int64_t, 2>()
+        && testVShuffleX<int64_t, 3>()
+        && testVShuffleX<int64_t, 4>()
+        && test1ShuffleX<uint>()
+        && testVShuffleX<uint, 2>()
+        && testVShuffleX<uint, 3>()
+        && testVShuffleX<uint, 4>()
+        && test1ShuffleX<uint8_t>() 
+        && testVShuffleX<uint8_t, 2>()
+        && testVShuffleX<uint8_t, 3>()
+        && testVShuffleX<uint8_t, 4>()
+        && test1ShuffleX<uint16_t>() 
+        && testVShuffleX<uint16_t, 2>()
+        && testVShuffleX<uint16_t, 3>()
+        && testVShuffleX<uint16_t, 4>()
+        && test1ShuffleX<uint64_t>() 
+        && testVShuffleX<uint64_t, 2>()
+        && testVShuffleX<uint64_t, 3>()
+        && testVShuffleX<uint64_t, 4>()
+        && test1ShuffleX<bool>()
+        && testVShuffleX<bool, 2>()
+        && testVShuffleX<bool, 3>()
+        && testVShuffleX<bool, 4>()
+        ;
+}
+
+void computeMain()
+{
+    outputBuffer.data[0] = true
+        && testShuffleX()
+        ;
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 1
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-shuffle.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-shuffle.slang
new file mode 100644
index 000000000..5dca1a588
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-shuffle.slang
@@ -0,0 +1,139 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly
+
+// not testing hlsl due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+// not testing cuda due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA 
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+#if 1                       \
+    && !defined(TARGET_HLSL) \
+    && !defined(TARGET_CUDA)
+// hlsl is missing an implementation
+// cuda is missing an implementation
+#define TEST_when_subgroupShuffleXor_is_implemented
+#endif
+
+//TEST_INPUT:ubuffer(data=[0], stride=4):out,name=outputBuffer
+buffer MyBlockName2 
+{
+    uint data[];
+} outputBuffer;
+
+layout(local_size_x = 32) in;
+
+__generic<T : __BuiltinLogicalType>
+bool test1ShuffleX() {
+    return true
+        && subgroupShuffle(T(1), 1) == T(1)
+#ifdef TEST_when_subgroupShuffleXor_is_implemented
+        && subgroupShuffleXor(T(1), 1) == T(1)
+#endif // #ifdef TEST_when_subgroupShuffleXor_is_implemented
+        ;
+}
+__generic<T : __BuiltinLogicalType, let N : int>
+bool testVShuffleX() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupShuffle(gvec(T(1)), 1) == gvec(T(1))
+#ifdef TEST_when_subgroupShuffleXor_is_implemented
+        && subgroupShuffleXor(gvec(T(1)), 1) == gvec(T(1))
+#endif // #ifdef TEST_when_subgroupShuffleXor_is_implemented
+        ;
+}
+
+__generic<T : __BuiltinFloatingPointType>
+bool test1ShuffleX() {
+    return true
+        && subgroupShuffle(T(1), 1) == T(1)
+#if !defined(TARGET_CUDA) && !defined(TARGET_HLSL)
+        && subgroupShuffleXor(T(1), 1) == T(1)
+#endif // #if !defined(TARGET_CUDA) && !defined(TARGET_HLSL)
+        ;
+}
+__generic<T : __BuiltinFloatingPointType, let N : int>
+bool testVShuffleX() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupShuffle(gvec(T(1)), 1) == gvec(T(1))
+#if !defined(TARGET_CUDA) && !defined(TARGET_HLSL)
+        && subgroupShuffleXor(gvec(T(1)), 1) == gvec(T(1))
+#endif // #if !defined(TARGET_CUDA) && !defined(TARGET_HLSL)
+        ;
+}
+bool testShuffleX() {
+    return true
+        && test1ShuffleX<float>()
+        && testVShuffleX<float, 2>()
+        && testVShuffleX<float, 3>()
+        && testVShuffleX<float, 4>()
+        && test1ShuffleX<double>() // WARNING: intel GPU's lack FP64 support
+        && testVShuffleX<double, 2>()
+        && testVShuffleX<double, 3>()
+        && testVShuffleX<double, 4>()
+        && test1ShuffleX<half>() 
+        && testVShuffleX<half, 2>()
+        && testVShuffleX<half, 3>()
+        && testVShuffleX<half, 4>()
+        && test1ShuffleX<int>()
+        && testVShuffleX<int, 2>()
+        && testVShuffleX<int, 3>()
+        && testVShuffleX<int, 4>()
+        && test1ShuffleX<int8_t>() 
+        && testVShuffleX<int8_t, 2>()
+        && testVShuffleX<int8_t, 3>()
+        && testVShuffleX<int8_t, 4>()
+        && test1ShuffleX<int16_t>() 
+        && testVShuffleX<int16_t, 2>()
+        && testVShuffleX<int16_t, 3>()
+        && testVShuffleX<int16_t, 4>()
+        && test1ShuffleX<int64_t>() 
+        && testVShuffleX<int64_t, 2>()
+        && testVShuffleX<int64_t, 3>()
+        && testVShuffleX<int64_t, 4>()
+        && test1ShuffleX<uint>()
+        && testVShuffleX<uint, 2>()
+        && testVShuffleX<uint, 3>()
+        && testVShuffleX<uint, 4>()
+        && test1ShuffleX<uint8_t>() 
+        && testVShuffleX<uint8_t, 2>()
+        && testVShuffleX<uint8_t, 3>()
+        && testVShuffleX<uint8_t, 4>()
+        && test1ShuffleX<uint16_t>() 
+        && testVShuffleX<uint16_t, 2>()
+        && testVShuffleX<uint16_t, 3>()
+        && testVShuffleX<uint16_t, 4>()
+        && test1ShuffleX<uint64_t>() 
+        && testVShuffleX<uint64_t, 2>()
+        && testVShuffleX<uint64_t, 3>()
+        && testVShuffleX<uint64_t, 4>()
+        && test1ShuffleX<bool>()
+        && testVShuffleX<bool, 2>()
+        && testVShuffleX<bool, 3>()
+        && testVShuffleX<bool, 4>()
+        ;
+}
+
+
+void computeMain()
+{
+
+    outputBuffer.data[0] = true
+        && testShuffleX()
+        ;
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 1
+}
diff --git a/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-vote.slang b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-vote.slang
new file mode 100644
index 000000000..bcd4aeb56
--- /dev/null
+++ b/tests/glsl-intrinsic/shader-subgroup/shader-subgroup-vote.slang
@@ -0,0 +1,167 @@
+//TEST:SIMPLE(filecheck=CHECK_GLSL):  -allow-glsl -stage compute -entry computeMain -target glsl
+//TEST:SIMPLE(filecheck=CHECK_SPV):  -allow-glsl -stage compute -entry computeMain -target spirv -emit-spirv-directly
+//TEST:SIMPLE(filecheck=CHECK_HLSL): -allow-glsl -stage compute -entry computeMain -target hlsl -DTARGET_HLSL
+
+// not testing cuda due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CUDA): -allow-glsl -stage compute -entry computeMain -target cuda -DTARGET_CUDA 
+// not testing cpp due to missing impl
+//DISABLE_TEST:SIMPLE(filecheck=CHECK_CPP):  -allow-glsl -stage compute -entry computeMain -target cpp -DTARGET_CPP
+
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl
+//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly
+#version 430
+
+//TEST_INPUT:ubuffer(data=[9], stride=4):name=inputBuffer
+buffer MyBlockName
+{
+    uint data[];
+} inputBuffer;
+
+//TEST_INPUT:ubuffer(data=[0 0 0 0 0], stride=4):out,name=outputBuffer
+buffer MyBlockName2 
+{
+    uint data[];
+} outputBuffer;
+
+layout(local_size_x = 32) in;
+
+__generic<T : __BuiltinLogicalType>
+bool test1AllEqual() {
+    return true
+        && subgroupAllEqual(T(1)) == true
+        && subgroupAllEqual(T(gl_GlobalInvocationID.x)) == false
+        ;
+}
+__generic<T : __BuiltinLogicalType, let N : int>
+bool testVAllEqual() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupAllEqual(gvec(T(1))) == true
+        && subgroupAllEqual(gvec(T(gl_GlobalInvocationID.x))) == false
+        ;
+}
+
+__generic<T : __BuiltinFloatingPointType>
+bool test1AllEqual() {
+    return true
+        && subgroupAllEqual(T(1)) == true
+        && subgroupAllEqual(T(gl_GlobalInvocationID.x)) == false
+        ;
+}
+__generic<T : __BuiltinFloatingPointType, let N : int>
+bool testVAllEqual() {
+    typealias gvec = vector<T, N>;
+
+    return true
+        && subgroupAllEqual(gvec(T(1))) == true
+        && subgroupAllEqual(gvec(T(gl_GlobalInvocationID.x))) == false
+        ;
+}
+bool testAllEqual() {
+    return true
+        && test1AllEqual<float>()
+        && testVAllEqual<float, 2>()
+        && testVAllEqual<float, 3>()
+        && testVAllEqual<float, 4>()
+        && test1AllEqual<double>() // WARNING: intel GPU's lack FP64 support
+        && testVAllEqual<double, 2>()
+        && testVAllEqual<double, 3>()
+        && testVAllEqual<double, 4>()
+        && test1AllEqual<half>() 
+        && testVAllEqual<half, 2>()
+        && testVAllEqual<half, 3>()
+        && testVAllEqual<half, 4>()
+        && test1AllEqual<int>()
+        && testVAllEqual<int, 2>()
+        && testVAllEqual<int, 3>()
+        && testVAllEqual<int, 4>()
+        && test1AllEqual<int8_t>() 
+        && testVAllEqual<int8_t, 2>()
+        && testVAllEqual<int8_t, 3>()
+        && testVAllEqual<int8_t, 4>()
+        && test1AllEqual<int16_t>() 
+        && testVAllEqual<int16_t, 2>()
+        && testVAllEqual<int16_t, 3>()
+        && testVAllEqual<int16_t, 4>()
+        && test1AllEqual<int64_t>() 
+        && testVAllEqual<int64_t, 2>()
+        && testVAllEqual<int64_t, 3>()
+        && testVAllEqual<int64_t, 4>()
+        && test1AllEqual<uint>()
+        && testVAllEqual<uint, 2>()
+        && testVAllEqual<uint, 3>()
+        && testVAllEqual<uint, 4>()
+        && test1AllEqual<uint8_t>() 
+        && testVAllEqual<uint8_t, 2>()
+        && testVAllEqual<uint8_t, 3>()
+        && testVAllEqual<uint8_t, 4>()
+        && test1AllEqual<uint16_t>() 
+        && testVAllEqual<uint16_t, 2>()
+        && testVAllEqual<uint16_t, 3>()
+        && testVAllEqual<uint16_t, 4>()
+        && test1AllEqual<uint64_t>() 
+        && testVAllEqual<uint64_t, 2>()
+        && testVAllEqual<uint64_t, 3>()
+        && testVAllEqual<uint64_t, 4>()
+        && test1AllEqual<bool>()
+        && testVAllEqual<bool, 2>()
+        && testVAllEqual<bool, 3>()
+        && testVAllEqual<bool, 4>()
+        ;
+}
+
+void computeMain()
+{
+    //seperate tests since testing concurrency
+
+    // one is true, rest false, positive
+    outputBuffer.data[0] = 1;
+    bool t1 = inputBuffer.data[0] == gl_GlobalInvocationID.x;
+    if (subgroupAny(t1)) {
+        subgroupBarrier();
+        outputBuffer.data[0] = 2;
+    }
+
+    // all false, negative
+    outputBuffer.data[1] = 1;
+    t1 = false;
+    if (!subgroupAny(t1)) {
+        subgroupBarrier();
+        outputBuffer.data[1] = 2;
+    }
+
+    // all true, positive
+    outputBuffer.data[2] = 1;
+    t1 = true;
+    if (subgroupAll(t1)) {
+        subgroupBarrier();
+        outputBuffer.data[2] = 2;
+    }
+
+    // all false, negative
+    outputBuffer.data[3] = 1;
+    t1 = false;
+    if (!subgroupAll(t1)) {
+        subgroupBarrier();
+        outputBuffer.data[3] = 2;
+    }
+
+    outputBuffer.data[4] = 1;
+
+    if (testAllEqual()) {
+        subgroupBarrier();
+        outputBuffer.data[4] = 2;
+    }
+
+    // CHECK_GLSL: void main(
+    // CHECK_SPV: OpEntryPoint
+    // CHECK_HLSL: void computeMain(
+    // CHECK_CUDA: void computeMain(
+    // CHECK_CPP: void _computeMain(
+    // BUF: 2
+    // BUF-NEXT: 2
+    // BUF-NEXT: 2
+    // BUF-NEXT: 2
+    // BUF-NEXT: 2
+}
-- 
cgit v1.2.3