Capabilities System, CapabilitySet Logic Overhaul (#4145)

* Capabilities System, Backing Logic Overhaul Fixes #4015 Problems to address: 1. Currently the capabilities system spends anywhere from 25-50% of compile time on the CapabilityVisitor. Most of this time is spent on join logic: 1. Finding abstract atoms 2. Comparing list1<->list2. This should and can be made significantly faster. 2. Error system does not produce errors with auxiliary information. This will require a partial redesign to provide more useful semantic information for debugging. What was addressed: 1. Array backed `CapabilityConjunctionSet` was replaced in-favor for a `UIntSet` backed `CapabilityTargetSets`. The design is described below. Design: * `CapabilityTargetSets` is a `Dictionary<targetAtom, CapabilityTargetSet>`. This is not an array for 2 reasons: 1. Easy to figure out which target is missing between two `CapabilityTargetSets` 2. To statically allocate an array requires the preprocessor to manually annotate which Capability is a target and link that Capability to an index. This means a dictionary is required for lookup regardless of implementation. * `CapabilityTargetSet` is an intermediate representation of all capabilities for a singular `target` atom (`glsl`, `hlsl`, `metal`, ...). This structure contains a dictionary to all stage specific capability sets for fast lookup of stage capabilities supported by a `CapabilitySet` for a `target` atom. This reduces number of sets searched. * `CapabilityStageSet` is an intermediate representation of all capabilities for a singular `stage` atom (`vertex`, `fragment`, ...). This structure holds all disjoint capability sets for a `stage`. A disjoint set is rare, but may exist in some scenarios (as an example): `{glsl, EXT_GL_FOO}{glsl, _GLSL_130, _GLSL_150}`. This reduces the number of sets searched. * `UIntSet` is the main reason for the redesign for better performance and memory usage. All set operations only require a few operations, making all set logic trivial and with minimal cost to run. All algorithms were modified to focus around `UIntSet` operations. 2. Errors * Semantic information are now better linked to the calling function to provide a connection of function<->function_body for when saving semantic information for errors. * Missing targets now print errors much like other error code by finding code which could be a cause of incompatibility. What is missing: 1. Add non naive support for non-stage specific capabilities such as `{hlsl, _sm_5_0}`. Currently non stage specific targets emulate the behavior through assigning such capabilities to every stage: `{hlsl, _sm_5_0, vertex} {hlsl, _sm_5_0, fragment}...`. Removal of this behavior would remove redundant shader stage sets being made at construction time (~80% of new implementation runtime). This is an addition, not an overhaul. 2. Optionally: `UIntSet` should be modified to support SIMD operations for significantly faster operations. This is not required immediately since `UIntSet` is already not a performance constraint. Notes: * UIntSet had implementation bugs which were fixed in this PR. * The old capabilities system had bugs which were fixed in this PR when transforming to the new implementation. * fix .natvis debug view * Small optimizations I found while working on the addition the AST building pass looks like so now: 1% = ~capabilitySet 2% = capabilitySet() 1.5% capabilitySet::unionWith() 0.8% capabilitySet::join() 1.5% auxillary info for debugging ~0.5-1% extra visitor overhead ~5% total for the visitor ~6.5% for total runtime costs * fix caps which were wrong but worked * push minor syntax fix (still looking for why other tests fail) * perf & bug fixes 1. did not properly remake isBetterForTarget for this->empty case with that as Invalid. This is best case in this senario. 2. Remade seralizer for stdlib generation. Faster (more direct) & cleaner code. NOTE: did not address review comments * fix glsl.meta caps error * fixing findBest logic again & UIntSet wrapper findBest was not checking for 'more specialized' targets & was element counter was flawed * faster getElements algorithm + natvis for UIntSet + wrong warning * type incompatability of bitscanForward implementations * try to fix warnings again * remove ptr for clang intrinsic * add missing header * ifdef to allow clang compile * compiler hackery to fix up platform/type independent operations * bracket * fix MSVC error * missing template * change types out again * changes to fix compiling * adjustment to parameter for Clang/GCC * added iterator to delay processing all atomSets of a CapabilitySet * add a few missing consts's * ensure we never have more than 1 disjointSet Added a wrapper + assert + union functionality to all possible disjoint sets. This was done in favor of a removal of the LinkedList for 2 reasons: 1. We still need 0-1 set functionality. 2. Might as well keep the code, just disallow the problematic functionality. * address review comments non linked-list refactor review comments addressed; add doc comments + remove redundant code * comments + remove isValid for bool operator * push removal of linkedlist for capabilities * add missing break * address review comments minor adjustments of syntax * push a fix to the `CapabilitySet({shader, missing target})` code * quality + error 1. add iterator to UIntSet 2. do not specialize target_switch if profile is derived from case (GLSL_150 is not compatable with GLSL_400) * fix target_switch erroring + temporarily remove UIntSet::Interator temporarily remove UIntSet::Interator. It will be added after, testing code on CI first so I can multi-task fixing the UIntSet Iterator * fix the UIntSet iterator * Revert "fix the UIntSet iterator" temporarily to pull from master * add metal error as per texture.slang (took a while I realize this was why things were breaking, likely should adjust errors to reflect this) * Rework UIntSet to have a template for output type This is done so it is reasonable to debug the iterator output and not just dealing with messy int's Fix problems with the iterators implemented + invalid capabilities handling * removed incorrect `__target_switch` capability barycentric was being used with anticipation of `profile glsl450`, this does not expand into `GL_EXT_fragment_shader_barycentric`, this instead caused an error which is hidden during cross-compile. * remove some uses of getElements * remove undeclared_stage for now * remove redundant code associated with `undeclared_stage` * remove unused variable * address review specifically to note removed static in a thread dangerous scope. Now using a `const static` for read only (thread safe) which precompile steps generate * move GLSL_150 capdef change to sm_4_1 (more accurate) * address most review comments did not address: https://github.com/shader-slang/slang/pull/4145#discussion_r1602256776 * revert incorrect code review suggestion * push changes for all code review suggestions
author: ArielG-NV <159081215+ArielG-NV@users.noreply.github.com> 2024-05-16 00:04:12 -0400
committer: GitHub <noreply@github.com> 2024-05-16 00:04:12 -0400
commit: 1b89f78cd1762aa08402bd656e807b66833b11d0 (patch)
tree: 2be71c9d97af8d28d440981d0c5adc726d9eac56 /source/core
parent: 3b0de8b6ea484091146f61e663c63beeac5b4798 (diff)
3 files changed, 230 insertions, 43 deletions
diff --git a/source/core/slang-linked-list.h b/source/core/slang-linked-list.h
index 93b5e435c..840ef8cd6 100644
--- a/source/core/slang-linked-list.h
+++ b/source/core/slang-linked-list.h
@@ -323,7 +323,7 @@ public:
         }
         return rs;
     }
-    int getCount() { return count; }
+    int getCount() const { return count; }
 };
 } // namespace Slang
 #endif
diff --git a/source/core/slang-uint-set.cpp b/source/core/slang-uint-set.cpp
index e973cbc3a..b6871c192 100644
--- a/source/core/slang-uint-set.cpp
+++ b/source/core/slang-uint-set.cpp
@@ -3,18 +3,6 @@
 namespace Slang
 {
 
-static bool _areAllZero(const UIntSet::Element* elems, Index count)
-{
-    for (Index i = 0; count; ++i)
-    {
-        if (elems[i])
-        {
-            return false;
-        }
-    }
-    return true;
-}
-
 UIntSet& UIntSet::operator=(UIntSet&& other)
 {
     m_buffer = _Move(other.m_buffer);
@@ -49,14 +37,8 @@ void UIntSet::setAll()
 
 void UIntSet::resize(UInt size)
 {
-    const Index oldCount = m_buffer.getCount();
     const Index newCount = Index((size + kElementMask) >> kElementShift);
-    m_buffer.setCount(newCount);
-
-    if (newCount > oldCount)
-    {
-        ::memset(m_buffer.getBuffer() + oldCount, 0, (newCount - oldCount) * sizeof(Element));
-    }
+    resizeBackingBufferDirectly(newCount);
 }
 
 void UIntSet::clear()
@@ -66,17 +48,7 @@ void UIntSet::clear()
 
 bool UIntSet::isEmpty() const
 {
-    const Element*const src = m_buffer.getBuffer();
-    const Index count = m_buffer.getCount();
-
-    for (Index i = 0; i < count; ++i)
-    {
-        if (src[i])
-        {
-            return false;
-        }
-    }
-    return true;
+    return _areAllZero(m_buffer.getBuffer(), m_buffer.getCount());
 }
 
 void UIntSet::clearAndDeallocate()
@@ -106,7 +78,7 @@ bool UIntSet::operator==(const UIntSet& set) const
 
     const Index minCount = Math::Min(aCount, bCount);
     
-    return ::memcmp(aElems, bElems, minCount) == 0 &&
+    return ::memcmp(aElems, bElems, minCount*sizeof(Element)) == 0 &&
         _areAllZero(aElems + minCount, aCount - minCount) &&
         _areAllZero(bElems + minCount, bCount - minCount);
 }
@@ -123,6 +95,15 @@ void UIntSet::intersectWith(const UIntSet& set)
     }
 }
 
+void UIntSet::subtractWith(const UIntSet& set)
+{
+    const Index minCount = Math::Min(this->m_buffer.getCount(), set.m_buffer.getCount());
+    for (Index i = 0; i < minCount; i++)
+    {
+        this->m_buffer[i] = this->m_buffer[i] & (~set.m_buffer[i]);
+    }
+}
+
 /* static */void UIntSet::calcUnion(UIntSet& outRs, const UIntSet& set1, const UIntSet& set2)
 {
     outRs.m_buffer.setCount(Math::Max(set1.m_buffer.getCount(), set2.m_buffer.getCount()));
@@ -162,5 +143,24 @@ void UIntSet::intersectWith(const UIntSet& set)
     return false;
 }
 
+Index UIntSet::countElements() const
+{
+    // TODO: This can be made faster using SIMD intrinsics to count set bits.
+    uint64_t tmp;
+    constexpr Index loopSize = ((sizeof(Element) / sizeof(tmp)) != 0) ? sizeof(Element) / sizeof(tmp) : 1;
+    Index count = 0;
+    for (auto index = 0; index < this->m_buffer.getCount(); index++)
+    {
+        for (auto i = 0; i < loopSize; i++)
+        {
+            tmp = m_buffer[index] >> (sizeof(tmp) * i);
+            tmp = tmp - ((tmp >> 1) & 0x5555555555555555);
+            tmp = (tmp & 0x3333333333333333) + ((tmp >> 2) & 0x3333333333333333);
+            count += ((tmp + (tmp >> 4) & 0xF0F0F0F0F0F0F0F) * 0x101010101010101) >> 56;
+        }
+    }
+    return count;
+}
+
 }
 
diff --git a/source/core/slang-uint-set.h b/source/core/slang-uint-set.h
index 0f2165bab..22ca457b0 100644
--- a/source/core/slang-uint-set.h
+++ b/source/core/slang-uint-set.h
@@ -6,31 +6,83 @@
 #include "slang-common.h"
 #include "slang-hash.h"
 
+#if defined(_MSC_VER)
+#include <intrin.h>
+#endif
 #include <memory.h>
 
 namespace Slang
 {
 
+template<typename T>
+constexpr static Index computeElementShift()
+{
+    Index currentShift = 0;
+    Index currentShiftValue = 1;
+
+    while (currentShiftValue != sizeof(T) * 8)
+    {
+        currentShift++;
+        currentShiftValue *= 2;
+    }
+
+    return currentShift;
+}
+
+static inline Index bitscanForward(uint64_t in)
+{
+#if defined(_MSC_VER)
+
+#ifdef _WIN64
+    uint64_t out = 0;
+    _BitScanForward64((unsigned long*)&out, in);
+    return Index(out);
+#else
+    constexpr uint32_t bitsInType = sizeof(uint32_t) * 8;
+    uint32_t out;
+    // check for 0s in 0bit->31bit. If all 0's, check for 0s in 32bit->63bit
+    _BitScanForward((unsigned long*)&out, *(((uint32_t*)&in) + 1));
+    if (out != bitsInType)
+        return Index(out);
+    _BitScanForward((unsigned long*)&out, *(((uint32_t*)&in)));
+    return Index(out + bitsInType);
+#endif// #ifdef _WIN64
+
+#else 
+    return Index(__builtin_ctzll(in));
+#endif// #if defined(_MSC_VER)
+}
+
 /* Hold a set of UInt values. Implementation works by storing as a bit per value */
+/// UIntSet is essentially a Element[], where each Element is `b` bits big.
+/// Each index has `b` number of integers. If the bit is 1, we have an element there. 
+/// Value of each element is equal to the binary offset from Element[0], bit 0.
 class UIntSet
 {
 public:
     typedef UIntSet ThisType;
-    typedef uint32_t Element;                                   ///< Type that holds the bits to say if value is present
+    typedef uint64_t Element;                                   ///< Type that holds the bits to say if value is present
     
+    constexpr static Index kElementSize = sizeof(Element) * 8; ///< The number of bits in an element. This also determines how many values a element can hold.
+    constexpr static Index kElementMask = kElementSize - 1; ///< Mask to get shift from an index
+    constexpr static Index kElementShift = computeElementShift<Element>(); ///< How many bits to shift to get Element index from an index. 5 for 2^5=32 elements in a uint32_t. 6 for 2^6=64 in a uint64_t.
+
     UIntSet() {}
     UIntSet(const UIntSet& other) { m_buffer = other.m_buffer; }
     UIntSet(UIntSet && other) { *this = (_Move(other)); }
     UIntSet(UInt maxVal) { resizeAndClear(maxVal); }
+    UIntSet(List<UIntSet::Element> buffer) { m_buffer = buffer; }
 
     UIntSet& operator=(UIntSet&& other);
     UIntSet& operator=(const UIntSet& other);
 
     HashCode getHashCode() const;
 
-        /// Return the count of all bits directly represented
+    /// Return the count of all bits directly represented
     Int getCount() const { return Int(m_buffer.getCount()) * kElementSize; }
 
+    List<Element>& getBuffer() { return m_buffer; }
+
         /// Resize such that val can be stored and clear contents
     void resizeAndClear(UInt val);
         /// Set all of the values up to count, as set
@@ -38,6 +90,7 @@ public:
         /// Resize (but maintain contents) up to bit size.
         /// NOTE! That since storage is in Element blocks, it may mean some values after size are set (up to the Element boundary)
     void resize(UInt size);
+    void resizeBackingBufferDirectly(Index size);
 
         /// Clear all of the contents (by clearing the bits)
     void clear();
@@ -47,6 +100,8 @@ public:
 
         /// Add a value
     inline void add(UInt val);
+    inline void add(const UIntSet& val);
+
         /// Remove a value
     inline void remove(UInt val);
         /// Returns true if the value is present
@@ -59,10 +114,12 @@ public:
         /// !=
     bool operator!=(const UIntSet& set) const { return !(*this == set); }
 
-        /// Store the union between this and set in this
+        /// Store the union between this and set
     void unionWith(const UIntSet& set);
-        /// Store the intersection between this and set in this
+        /// Store the intersection between this and set
     void intersectWith(const UIntSet& set);
+        /// Store the subtraction between this and set
+    void subtractWith(const UIntSet& set);
 
         /// 
     bool isEmpty() const;
@@ -70,6 +127,10 @@ public:
         /// Swap this with rhs
     void swapWith(ThisType& rhs) { m_buffer.swapWith(rhs.m_buffer); }
 
+    template<typename T>
+    List<T> getElements() const;
+    Index countElements() const;
+
         /// Store the union of set1 and set2 in outRs
     static void calcUnion(UIntSet& outRs, const UIntSet& set1, const UIntSet& set2);
         /// Store the intersection of set1 and set2 in outRs
@@ -80,16 +141,98 @@ public:
         /// Returns true if set1 and set2 have a same value set (ie there is an intersection)
     static bool hasIntersection(const UIntSet& set1, const UIntSet& set2);
 
-private:
-    enum
+    struct Iterator
     {
-        kElementShift = 5,                              ///< How many bits to shift to get Element index from an index
-        kElementSize = sizeof(Element) * 8,             ///< The number of bits in an element
-        kElementMask = kElementSize - 1,                ///< Mask to get shift from an index
+        friend class UIntSet;
+    private:
+        const List<Element>* context;
+        Index block = 0;
+        Element processedElement = 0;
+        uint64_t LSB = 0;
+
+        void clearLSB()
+        {
+            LSB = bitscanForward(processedElement);
+            processedElement &= processedElement - 1;
+        }
+    public:
+        Iterator(const List<Element>* inContext)
+        {
+            context = inContext;
+        }
+
+        Element operator*()
+        {
+            return Element(LSB + (kElementSize * block));
+        }
+
+        Iterator& operator++()
+        {
+            while (processedElement == 0)
+            {
+                block++;
+                if (block >= context->getCount())
+                {
+                    return *this;
+                }
+                processedElement = (*context)[block];
+            }
+            clearLSB();
+            return *this;
+        }
+        Iterator& operator++(int)
+        {
+            return ++(*this);
+        }
+        bool operator==(const Iterator& other) const
+        {
+            return other.block == this->block
+                && other.processedElement == this->processedElement;
+        }
+        bool operator!=(const Iterator& other) const
+        {
+            return !(other == *this);
+        }
     };
+    Iterator begin() const
+    {
+        Iterator tmp(&m_buffer);
+        if (m_buffer.getCount() == 0)
+            return tmp;
+
+        tmp.processedElement = m_buffer[0];
+        if (tmp.processedElement == 0)
+            tmp++;
+
+        tmp.clearLSB();
 
-    // Make sure they are correct for the Element type
-    SLANG_COMPILE_TIME_ASSERT((1 << kElementShift) == kElementSize);
+        return tmp;
+    }
+    Iterator end() const
+    {
+        Iterator tmp(&m_buffer);
+        tmp.block = m_buffer.getCount();
+        tmp.processedElement = 0;
+        return tmp;
+    }
+
+    bool areAllZero()
+    {
+        return _areAllZero(m_buffer.getBuffer(), m_buffer.getCount());
+    }
+
+protected:
+    static bool _areAllZero(const UIntSet::Element* elems, Index count)
+    {
+        for (Index i = 0; i < count; ++i)
+        {
+            if (elems[i])
+            {
+                return false;
+            }
+        }
+        return true;
+    }
 
     List<Element> m_buffer;
 };
@@ -132,6 +275,18 @@ inline bool UIntSet::contains(const UIntSet& set) const
 }
 
 // --------------------------------------------------------------------------
+
+inline void UIntSet::resizeBackingBufferDirectly(Index newCount)
+{
+    const Index oldCount = m_buffer.getCount();
+    m_buffer.setCount(newCount);
+
+    if (newCount > oldCount)
+    {
+        ::memset(m_buffer.getBuffer() + oldCount, 0, (newCount - oldCount) * sizeof(Element));
+    }
+}
+
 inline void UIntSet::add(UInt val)
 {
     const Index idx = Index(val >> kElementShift);
@@ -142,6 +297,38 @@ inline void UIntSet::add(UInt val)
     m_buffer[idx] |= Element(1) << (val & kElementMask);
 }
 
+inline void UIntSet::add(const UIntSet& other)
+{
+    auto otherCount = other.m_buffer.getCount();
+    if (this->m_buffer.getCount() < otherCount)
+        resizeBackingBufferDirectly(otherCount);
+
+    for (auto i = 0; i < otherCount; i++)
+        m_buffer[i] |= other.m_buffer[i];
 }
 
+template<typename T>
+List<T> UIntSet::getElements() const
+{
+    auto count = m_buffer.getCount();
+    if (count == 0)
+        return {};
+
+    // Specific path for uint64_t. If using SIMD we should not use this path due to larger data types.
+
+    List<T> elements;
+    elements.reserve(count);
+    for (Index block = 0; block < count; block++)
+    {
+        Element n = m_buffer[block];
+        while (n != 0)
+        {
+            elements.add(T(bitscanForward((uint64_t)n) + (kElementSize * block)));
+            n &= n - 1;
+        }
+    }
+    return elements;
+}
+
+}
 #endif
author	ArielG-NV <159081215+ArielG-NV@users.noreply.github.com>	2024-05-16 00:04:12 -0400
committer	GitHub <noreply@github.com>	2024-05-16 00:04:12 -0400
commit	1b89f78cd1762aa08402bd656e807b66833b11d0 (patch)
tree	2be71c9d97af8d28d440981d0c5adc726d9eac56 /source/core
parent	3b0de8b6ea484091146f61e663c63beeac5b4798 (diff)