highway debugging example
highway test suite is a great
stress test for gcc
’s vectorization and SIMD intrinsics code
generators:
- at compile time
highway
instantiates all the vector extensions your CPU could support - at runtime it runs the tests on all of the supported extensions against various vector sizes
It cover many corner cases of what could possibly go wrong with vectored
forms of various operations. A few past examples are:
PR110274
,
PR110880
,
PR111048
,
PR111051
,
PR115115
.
While highway
tests are quite small they are somewhat tricky to
extract into self-contained examples. In this post I’ll write down a few
hacks I usually use to simplify this task.
An example
Today test suite failed at me on today’s gcc-15
build against
nixpkgs
as:
1146 - HwyReverseTestGroup/HwyReverseTest.TestAllReverseLaneBytes/EMU128 # GetParam() = 2305843009213693952 (Subprocess aborted)
1151 - HwyReverseTestGroup/HwyReverseTest.TestAllReverseBits/EMU128 # GetParam() = 2305843009213693952 (Subprocess aborted)
1186 - HwyShuffle4TestGroup/HwyShuffle4Test.TestAllPer4LaneBlockShuffle/EMU128 # GetParam() = 2305843009213693952 (Subprocess aborted)
TestAllReverseLaneBytes
is the test function. EMU128
is an
(emulated) CPU target: the code does not use compiler intrinsics and
uses loops over scalar operations to emulate SIMD
.
Check latest git
highway
is actively maintained. In case the failure is caused by a
highway
bug (and not by a faulty compiler) chances are that it’s
already fixed in latest version. Worth trying it first:
# get the build time depends into the development shell
$ nix develop -f ~/n libhwy
$$ git clone https://github.com/google/highway.git
$$ cd highway
$$ mkdir build
$$ cd build
$$ cmake ..
$$ make -j $(nproc) && make test
...
1146 - HwyReverseTestGroup/HwyReverseTest.TestAllReverseLaneBytes/EMU128 # GetParam() = 2305843009213693952 (Subprocess aborted)
1151 - HwyReverseTestGroup/HwyReverseTest.TestAllReverseBits/EMU128 # GetParam() = 2305843009213693952 (Subprocess aborted)
1186 - HwyShuffle4TestGroup/HwyShuffle4Test.TestAllPer4LaneBlockShuffle/EMU128 # GetParam() = 2305843009213693952 (Subprocess aborted)
The bug was still there!
Enable single (simplest) target
highway
uses heavy C++
template code and various iterator macros to
compile the library for each supported CPU extension highway
knows
about. This increases build times and complicated debugging via code
tweaking as code has to compile for all active targets, not just one.
I disabled all targets except the problematic one. In our case the
problematic target is EMU128
. Thus the local change to leave EMU128
as the only available option is:
--- a/hwy/detect_targets.h
+++ b/hwy/detect_targets.h
@@ -29,7 +29,7 @@
// #define HWY_BASELINE_TARGETS (HWY_SSE4 | HWY_SCALAR)
// Uncomment to override the default blocklist:-// #define HWY_BROKEN_TARGETS HWY_AVX3
+#define HWY_BROKEN_TARGETS (HWY_AVX2 | HWY_SSE4 | HWY_SSE2 | HWY_SSSE3 | HWY_SSE4 | HWY_AVX3_SPR | HWY_AVX3_ZEN4 | HWY_AVX3)
// Uncomment to definitely avoid generating those target(s): // #define HWY_DISABLED_TARGETS HWY_SSE4
Here I disabled anything that build system reports as supported.
Before the change I had Compiled HWY_TARGETS: AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2
in this output:
Config: emu128:0 scalar:0 static:0 all_attain:0 is_test:0
Compiled HWY_TARGETS: AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2
HWY_ATTAINABLE_TARGETS: AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2 EMU128
HWY_BASELINE_TARGETS: SSE2 EMU128
HWY_STATIC_TARGET: SSE2
HWY_BROKEN_TARGETS: Unknown
HWY_DISABLED_TARGETS:
Current CPU supports: AVX2 SSE4 SSSE3 SSE2 EMU128 SCALAR
After the change I get Compiled HWY_TARGETS: EMU128
in this output:
Config: emu128:0 scalar:0 static:0 all_attain:0 is_test:0
Compiled HWY_TARGETS: EMU128
HWY_ATTAINABLE_TARGETS: EMU128
HWY_BASELINE_TARGETS: SSE2 EMU128
HWY_STATIC_TARGET: EMU128
HWY_BROKEN_TARGETS: AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2
HWY_DISABLED_TARGETS:
Current CPU supports: AVX2 SSE4 SSSE3 SSE2 EMU128 SCALAR
Leaving a single compiled target speeds the builds a few times up.
Then I picked the specific binary that implements failing test. In this
case it was tests/reverse_test
:
$ make -j$(nproc) && ./tests/reverse_test
...
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from HwyReverseTestGroup/HwyReverseTest
[ RUN ] HwyReverseTestGroup/HwyReverseTest.TestAllReverseLaneBytes/EMU128
u16x4 expect [0+ ->]:
0xD49E,0x049E,0x0137,0x69D3,
u16x4 actual [0+ ->]:
0xFF9E,0xFF9E,0x0137,0xFFD3,
Abort at reverse_test.cc:162: EMU128, u16x4 lane 0 mismatch:
expected '0xD49E', got '0xFF9E'.
Aborted (core dumped)
Here we can see that instead of expected 0xD49E,0x049E,0x0137,0x69D3
output our library did 0xFF9E,0xFF9E,0x0137,0xFFD3
.
Shrink the test
The rest reduction is usually test-specific, but some of the hacks can be applied to many tests. In this case I kept only one failing test:
--- a/hwy/tests/reverse_test.cc
+++ b/hwy/tests/reverse_test.cc
@@ -293,13 +295,7 @@ HWY_AFTER_NAMESPACE();
namespace hwy {
HWY_BEFORE_TEST(HwyReverseTest);-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverse);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverse2);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverse4);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverse8);
HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverseLaneBytes);-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverseBits);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverseBlocks);
HWY_AFTER_TEST(); } // namespace hwy
The test was still failing (sometimes it’s not the case when mere presence of unrelated code changes the inlining and vectorization decisions).
Then I shrunk the cases down to 16-bit element sizes by inlining a
ForUI163264
definition:
--- a/hwy/tests/reverse_test.cc
+++ b/hwy/tests/reverse_test.cc
@@ -248,7 +249,7 @@ HWY_NOINLINE void TestAllReverse8() {
}
HWY_NOINLINE void TestAllReverseLaneBytes() {- ForUI163264(ForPartialVectors<TestReverseLaneBytes>());
+ ForPartialVectors<TestReverseLaneBytes>()(uint16_t());
}
Then I removed all the other unrelated test asserts from the
hwy/tests/reverse_test.cc
file.
Then I inlined obvious template parameters right into local test class.
Then I extracted random generated data used in the failing vectors (using
printf()
statements) and inlined values into .cc
file. Sometimes
I had to sprinkle __attribute__((noipa))
attributes on local functions
to inhibit too eager constant folding.
gdb
hints
To explore generated code in gdb
I explored hwy::N_<target>::
namespace as:
$ gdb tests/reverse_test
(gdb) disassemble hwy::N_EMU128::TestAllReverseLaneBytes
...
<+113>: mov 0x3d190(%rip),%rax # 0x4dc68
<+120>: mov $0xffffd49e,%esi
<+125>: xor %edx,%edx
<+127>: mov $0x8,%edi
<+132>: mov %rax,0x0(%rbp)
<+136>: mov %si,(%r12)
<+141>: movzwl 0x2(%rbp),%eax
<+145>: xor %esi,%esi
<+147>: rol $0x8,%ax
<+151>: mov %ax,0x2(%r12)
<+157>: movzwl 0x4(%rbp),%eax
<+161>: rol $0x8,%ax
<+165>: mov %ax,0x4(%r12)
<+171>: movzwl 0x6(%rbp),%eax
<+175>: rol $0x8,%ax
<+179>: mov %ax,0x6(%r12)
<+185>: mov 0x0(%rbp),%rax
<+189>: movq %rax,%xmm0
<+194>: movdqa %xmm0,%xmm1
<+198>: psllw $0x8,%xmm0
<+203>: psraw $0x8,%xmm1
<+208>: por %xmm0,%xmm1
<+212>: movq %xmm1,0x8(%rsp)
Here I was lucky! I immediately spotted the bug. We see both:
- 16-bit wide move/rotate/move:
movzwl / rol / mov
(looks correct) - and 128-bit wide move/rotate/move:
movq / psllw / psraw / por / movq
The rotate part is broken here: it should have been logical psrlw
shift, not arithmetic (sign-preserving) psraw
shift.
At this point my test looked this way:
void TestAllReverseLaneBytes() {
HWY_NOINLINE const CappedTag<uint16_t, 4, 0> d;
const size_t N = Lanes(d);
(stderr, "N = %zu\n", N);
fprintfauto in = AllocateAligned<uint16_t>(N);
auto expected = AllocateAligned<uint16_t>(N);
(stderr, "iter\n");
fprintf[0] = 0x9ed4u;
in[1] = 0x049eu;
in[2] = 0x0137u;
in[3] = 0x69d3u;
in[0] = ReverseBytesOfValue(in[0]);
expected[1] = ReverseBytesOfValue(in[1]);
expected[2] = ReverseBytesOfValue(in[2]);
expected[3] = ReverseBytesOfValue(in[3]);
expected
const auto v = Load(d, in.get());
(d, expected.get(), ReverseLaneBytes(v));
HWY_ASSERT_VEC_EQ}
I looked at the ReverseLaneBytes()
implementation. It had two parts.
The first part was generic for all targets:
// from hwy/ops/generic_ops-inl.h
template <class V, HWY_IF_T_SIZE_V(V, 2)>
(V v) {
HWY_API V ReverseLaneBytesconst DFromV<V> d;
const Repartition<uint8_t, decltype(d)> du8;
return BitCast(d, Reverse2(du8, BitCast(du8, v)));
}
And the second part was EMU128
-specific:
// from hwy/ops/emu128-inl.h
template <class D>
<D> Reverse2(D d, VFromD<D> v) {
HWY_API VFromD<D> ret;
VFromDfor (size_t i = 0; i < MaxLanes(d); i += 2) {
.raw[i + 0] = v.raw[i + 1];
ret.raw[i + 1] = v.raw[i + 0];
ret}
return ret;
}
I inlined the above definitions into the test and got this:
// $ cat hwy/tests/reverse_test.cc
#include <stddef.h>
#undef HWY_TARGET_INCLUDE
#define HWY_TARGET_INCLUDE "tests/reverse_test.cc"
#include "hwy/foreach_target.h" // IWYU pragma: keep
#include "hwy/highway.h"
#include "hwy/tests/test_util-inl.h"
();
HWY_BEFORE_NAMESPACEnamespace hwy {
namespace HWY_NAMESPACE {
template <class D>
__attribute__((noipa))
static VFromD<D> Reverse2_(D d, VFromD<D> v) {
<D> ret;
VFromDfor (size_t i = 0; i < MaxLanes(d); i += 2) {
.raw[i + 0] = v.raw[i + 1];
ret.raw[i + 1] = v.raw[i + 0];
ret}
return ret;
}
void TestAllReverseLaneBytes() {
HWY_NOINLINE const CappedTag<uint16_t, 4, 0> d;
const size_t N = Lanes(d); // 4
auto in = AllocateAligned<uint16_t>(N);
auto expected = AllocateAligned<uint16_t>(N);
[0] = 0x9ed4u;
in[1] = 0x049eu;
in[2] = 0x0137u;
in[3] = 0x69d3u;
in[0] = 0xd49eu;
expected[1] = 0x9e04u;
expected[2] = 0x3701u;
expected[3] = 0xd369u;
expected
const auto v = Load(d, in.get());
const Repartition<uint8_t, decltype(d)> du8;
const auto r = BitCast(d, Reverse2_(du8, BitCast(du8, v)));
(d, expected.get(), r);
HWY_ASSERT_VEC_EQ}
// NOLINTNEXTLINE(google-readability-namespace-comments)
} // namespace HWY_NAMESPACE
} // namespace hwy
();
HWY_AFTER_NAMESPACE
#if HWY_ONCE
namespace hwy {
(HwyReverseTest);
HWY_BEFORE_TEST(HwyReverseTest, TestAllReverseLaneBytes);
HWY_EXPORT_AND_TEST_P();
HWY_AFTER_TEST} // namespace hwy
#endif
Make sure it’s an optimizer
Adding #pragma GCC optimize(0)
to the beginning of the file makes the
bug to go away. It’s a good hint that it’s a compiler bug: the test looks
obviously correct (not much hidden code is left in the templates).
But the only way to make sure is to finish the reduction down to a self-contained example. We will need it anyway to report upstream.
Final result
After a few manual extra inlines and simplifications I got this self-contained example:
// $ cat bug.c
typedef unsigned char u8;
((noipa))
__attribute__static void fill_src(u8 * src) {
[0] = 0x00; src[1] = 0xff;
src}
((noipa))
__attribute__static void assert_dst(const u8 * dst) {
if (dst[0] != 0xff) __builtin_trap();
if (dst[1] != 0x00) __builtin_trap();
}
int main() {
[8] __attribute__((aligned(16))) = { 0 };
u8 src[8] __attribute__((aligned(16))) = { 0 };
u8 dst
// place 0x00 into src[0] and 0xFF into src[1]
(src);
fill_src
// swap bytes:
// place 0xFF into dst[0], 0x00 into dst[1]
for (unsigned long i = 0; i < 8; i += 2) {
[i + 0] = src[i + 1];
dst[i + 1] = src[i + 0];
dst}
// make sure bytes swapped
(dst);
assert_dst}
Triggering:
$ gcc bug.c -o a -O1 && ./a
$ gcc bug.c -o a -O2 && ./a
Illegal instruction (core dumped)
Upstream report
I reported the bug as PR115146
.
Bisecting gcc
pointed me to this
“vector shift” change.
This change looks very close to the culprit as the code explicitly picks the “arithmetic” flavour of shift instruction (should be “logical” instead).
By now the original change author already provided a test patch in the report! So quick!
Have fun!