16 byte cmpxchg #524

SchrodingerZhu · 2022-05-11T10:55:18Z

I think there are two points to improve in current cmpxchg support for 16 byte data structures:

consider arm lse
consider using __sync_bool_compare_and_swap to force gcc emit inlined instructions.

#include <atomic>

#if defined(__aarch64__) && defined(__clang__)
#  pragma clang attribute push(__attribute__((target("lse"))),apply_to=function)
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("clang attribute pop")
#elif defined(__aarch64__) && defined(__GNUC__)
#  pragma GCC push_options
#  pragma GCC target("arch=armv8-a+lse")
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("GCC pop_options")
#elif defined(__x86_64__) && defined(__clang__)
#  pragma clang attribute push(__attribute__((target("cx16"))),apply_to=function)
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("clang attribute pop")
#elif defined(__x86_64__) && defined(__GNUC__)
#  pragma GCC push_options
#  pragma GCC target("cx16")
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("GCC pop_options")
#else 
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING
#endif

template<class T>
__attribute__((always_inline)) inline bool cas(std::atomic<T> &src,
                T const& __restrict cmp,
                T const& __restrict with)
{
    auto inline_copy = [](__int128 * dst, const void * __restrict src) {
#if __has_builtin(__builtin_inline_memcpy)
    __builtin_inline_memcpy(dst, src, sizeof(__int128));
#elif __has_builtin(__builtin_memcpy)
    __builtin_memcpy(dst, src, sizeof(__int128));
#else
    ::memcpy(dst, src, sizeof(__int128));
#endif
    };
    __int128 cmp_value;
    __int128 with_value;
    inline_copy(&cmp_value, &cmp);
    inline_copy(&with_value, &with);
    return __sync_bool_compare_and_swap(reinterpret_cast<__int128 *>(&src), cmp_value, with_value);
}

struct A {
    int64_t a, b;
};

bool cas_test(std::atomic<__int128> &src,
                __int128 const& cmp,
                __int128 const& with)
{
    return cas(src, cmp, with);
}

bool cas_test(std::atomic<A> &src,
                A const& cmp,
                A const& with)
{
    return cas(src, cmp, with);
}

PLATFORM_SPECIFIC_OPTIONS_ENDING

bool cas_test2(std::atomic<__int128> &src,
                __int128 & cmp,
                __int128 & with)
{
    return src.compare_exchange_weak(cmp, with);
}

The text was updated successfully, but these errors were encountered:

davidchisnall · 2022-05-11T13:46:20Z

I'm not sure why we'd need anything on Arm. Arm has load-reserved, store-exclusive, so doesn't need a second word for ABA.

The only platforms where we have a problem have the following combination of things:

An architecture with a CAS instruction but not ll/sc, so no ABA-proof primitive (i.e. x86, this is not a problem on any other architecture).
A standard library that use a compiler builtin for 128-bit CAS.
A compiler that doesn't expand that builtin to a single instruction.

This means that you need the combination of x86, libstdc++, and gcc to see the problem. GCC has an open bug report about it and so, presumably, will fix it eventually.

That said, the ABA code isn't used on any hot paths in snmalloc, so the cost of a libcall is unlikely to be noticeable.

This combination of things (bug in a single compiler, doesn't affect hot paths) means that I'm very hesitant to do something ad-hoc as a work around for that compiler.

Looking at the code, we seem not to be doing LL/SC for non-x86 architectures. I thought @nwf added that a while ago?

mjp41 · 2022-05-11T13:53:11Z

I had a look at LL/SC on Arm, but couldn't work out how to do it. The ABA primitive is designed to be able to use it, I just didn't know a good way to actually achieve it without complex inline assembly across functions.

The use of the ABA protection is only for allocating allocators, and potentially in the backend. So falling back to locks for ABA protection is actually fine in most cases.

SchrodingerZhu · 2022-05-11T14:09:35Z

Hi,

I initiated this issue because I found that LLVM and GCC would behave quite differently on whether to inline atomic ops. This may deliver extra linkage with libatomic.so for rust.

For example, on x86, we are requiring the compiled target to have -mcx16 but gcc will still emit outline atomic ops.

SchrodingerZhu · 2022-05-11T14:29:14Z

For DW LL/SC, maybe https://github.com/taiki-e/portable-atomic/blob/main/src/imp/aarch64.rs can be considered as a reference?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

16 byte cmpxchg #524

16 byte cmpxchg #524

SchrodingerZhu commented May 11, 2022

davidchisnall commented May 11, 2022

mjp41 commented May 11, 2022

SchrodingerZhu commented May 11, 2022 •

edited

Loading

SchrodingerZhu commented May 11, 2022

16 byte cmpxchg #524

16 byte cmpxchg #524

Comments

SchrodingerZhu commented May 11, 2022

davidchisnall commented May 11, 2022

mjp41 commented May 11, 2022

SchrodingerZhu commented May 11, 2022 • edited Loading

SchrodingerZhu commented May 11, 2022

SchrodingerZhu commented May 11, 2022 •

edited

Loading