trofi - All posts

Zero Hydra Failures towards 24.11 NixOS release

2024-11-04T00:00:00Z

ZHF (or Zero Hydra Failures) is the time when most build failures are squashed before final NixOS-24.11 release (see full release schedule).

To follow the tradition let’s fix one bug for ZHF.

I picked xorg.libAppleWM build failure. It’s not a very popular package.

The failure looks trivial:

make[2]: Entering directory '/build/libapplewm-be972ebc3a97292e7d2b2350eff55ae12df99a42/src'
  CC       applewm.lo
gcc: error: unrecognized command-line option '-iframeworkwithsysroot'

The build was happening for x86_64-linux target. While this package is MacOS-specific: it uses Darwin APIs and links to it’s libraries directly. No reason to try to build it on x86_64-linux.

The fix is to constrain the package to darwin targets (the default platforms for xorg packages is unix):

--- a/pkgs/servers/x11/xorg/overrides.nix
+++ b/pkgs/servers/x11/xorg/overrides.nix
@@ -171,6 +171,9 @@ self: super:
   libAppleWM = super.libAppleWM.overrideAttrs (attrs: {
     nativeBuildInputs = attrs.nativeBuildInputs ++ [ autoreconfHook ];
     buildInputs =  attrs.buildInputs ++ [ xorg.utilmacros ];
+    meta = attrs.meta // {
+      platforms = lib.platforms.darwin;
+    };
   });

   libXau = super.libXau.overrideAttrs (attrs: {

This fix is now known as PR#353618.

Parting words

I picked very lazy example of a broken package. https://github.com/NixOS/nixpkgs/issues/352882 contains more links and hints on how to find and fix known breakages.

As usual contributing towards ZHF is very easy. Give it a try!

Have fun!

xmms2 0.9.4 is out

2024-10-07T00:00:00Z

Tl;DR: xmms2-0.9.4 is out and you can get it at https://github.com/xmms2/xmms2-devel/releases/tag/0.9.4!

xmms2 is still a music player daemon with various plugins to support stream decoding and transformation. See older announcement on how to get started with xmms2.

Highlights

It’s a small maintenance release. The only notable change is support for ffmpeg-7 as a build dependency.

Have fun!

gcc-15 bugs, pile 1

2024-08-25T00:00:00Z

About 4 months have passed since gcc-14.1.0 release. Around the same time gcc-15 development has started and a few major changes were merged into the master development branch.

summary

This time I waited to collect about 20 bug reports I encountered:

c++/114933: mcfgthread-1.6.1 typecheck failure. Ended up being mcfgthread bug caused by stronger gcc checks.
tree-optimization/114872: sagemath SIGSEGVed due to broken assumptions around setjmp() / longjmp(). Not a gccbug either.
target/115115: highway-1.0.7 test suite expected too specific _mm_cvttps_epi32() semantics. A gcc-12 regression!
target/115146: highway-1.0.7 test suite exposed gcc-15 bug in vectoring bswap16()-like code.
tree-optimization/115227: libepoxy, p11-kit and doxygen can’t fit in RAM of 32-bit gcc due to memory leak in value range propagation subsystem.
target/115397: numpy ICE for -m32: gcc code generator generated a constant pool memory reference and crashed in instruction selection.
c++/115403: highway build failure due to wrong scope handling of #pragma GCC target by gcc.
tree-optimization/115602: liblapack-3.12.0 ICE in slp pass. gcc generated a self-reference cycle after applying code subexpression elimination.
bootstrap/115655: gcc bootstrap failure on -Werror=unused-function.
libstdc++/115797: gcc failed to compile extern "C" { #include } code. was fixed to survive such imports.
middle-end/115863: wrong code on zlib when handling saturated logic. A bug in truncation handling.
rtl-optimization/115916: wrong code on highway. Bad arithmetic shift ubsan-related fix in gcc’s own code.
middle-end/115961: wrong code on llvm, bad bitfield truncation handling for sub-byte bitfield sizes. Saturated truncation arithmetics handling was applied too broadly.
tree-optimization/115991: ICE on linux-6.10. Caused by too broad acceptance of sub-register use in an instruction. ENded up selecting invalid instructions.
rtl-optimization/116037: python3 hangup due to an -fext-dce bug.
rtl-optimization/116200: crash during gcc bootstrap, wrong code on libgcrypt. A bug in RTL constant pool handling.
rtl-optimization/116353: ICE on glibc-2.39. Another RTL bug where gcc instruction selector was presented with invalid value reference.
middle-end/116411: ICE on readline-8.2p13. Conditional operation was incorrectly optimized for some of builtin functions used in branches.
tree-optimization/116412: ICE on openblas-0.3.28. Similar to the above: conditional operation was incorrectly optimized for complex types.

fun bug

The zlib bug is probably the most unusual one. Due to a typo in newly introduced set of optimizations gcc managed to convert a > b ? b : a type of expressions into an equivalent of b > a ? b : a. But it only does it for b = INT_MAX type of arguments (case of saturation). As a result it only broke zlib test suite as it specifically tests for out of range access to cause SIGSEGVs. For well-behaved inputs it never caused any problems. The gcc fix was trivial:

--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9990,7 +9990,7 @@
   rtx sat = force_reg (DImode, GEN_INT (GET_MODE_MASK (mode)));
   rtx dst;

-  emit_insn (gen_cmpdi_1 (op1, sat));
+  emit_insn (gen_cmpdi_1 (sat, op1));

   if (TARGET_CMOVE)
     {
@@ -10026,7 +10026,7 @@
   rtx sat = force_reg (SImode, GEN_INT (GET_MODE_MASK (mode)));
   rtx dst;

-  emit_insn (gen_cmpsi_1 (op1, sat));
+  emit_insn (gen_cmpsi_1 (sat, op1));

   if (TARGET_CMOVE)
     {
@@ -10062,7 +10062,7 @@
   rtx sat = force_reg (HImode, GEN_INT (GET_MODE_MASK (QImode)));
   rtx dst;

-  emit_insn (gen_cmphi_1 (op1, sat));
+  emit_insn (gen_cmphi_1 (sat, op1));

   if (TARGET_CMOVE)
     {

We swap argument order to restore original intent.

histograms

Where did most gcc bugs come from?

tree-optimization: 4
rtl-optimization: 4
middle-end: 3
target: 3
c++: 1
bootstrap: 1
libstdc++: 1

As usual tree-optimization is at the top of subsystem causing troubles. But this time rtl-optimization got close to it as well. highway managed to yield us 4 new bugs while llvm got us just one new bug.

parting words

gcc-15 got a few very nice optimizations (and bugs) related to saturated truncation, zero/sign-extension elimination, constant folding in RTL. I saw at least 5 bugs related to wrong code generation (and also slowly reducing another one in the background). middl-end ones were easy to reduce and explore, RTL ones were very elusive. The most disruptive change is probably a removal of #include from one of libstdc++ headers. That requires quite a few upstream fixes to add missing headers (cppdap, woff2, graphite, glslang, widelands, wesnoth and many others). Have fun!

gcc-15 template checking improvements

2024-07-22T00:00:00Z

Tl;DR

On 18 Jul gcc merged extended correctness checks for template functions. This will cause some incorrect unused code to fail to compile. Consider fixing or deleting the code. I saw at least two projects affected by it:

more words

c++ is a complex language with a type system that does static checking. Most of the time checking the type correctness is easy by both human and the compiler. But sometimes it’s less trivial. Namespaces and function arguments can bring various declarations into the scope. Template code splits a single definition point into two: template definition point and template instantiation. Let’s look at a simple example:

template  struct S {
    int foo(void) { return bar(); }
};

int bar() { return 42; }

int main() {
    S v;
    return v.foo();
}

This fails to build on all recent gcc as:

$ g++ -c a.cc
a.cc: In member function 'int S::foo()':
a.cc:2:28: error: there are no arguments to 'bar' that depend on a
  template parameter, so a declaration of 'bar' must be available [-fpermissive]
    2 |     int foo(void) { return bar(); }
      |                            ^~~
a.cc:2:28: note: (if you use '-fpermissive', G++ will accept your code,
  but allowing the use of an undeclared name is deprecated)

gcc really wants bar to be visible at the template instantiation time. But what is we don’t call foo at all?

template  struct S {
    int foo(void) { return bar(); }
};

int main() {}

Still fails the same:

$ g++ -c a.cc
a.cc: In member function 'int S::foo()':
a.cc:2:28: error: there are no arguments to 'bar' that depend on a
  template parameter, so a declaration of 'bar' must be available [-fpermissive]
    2 |     int foo(void) { return bar(); }
      |                            ^~~
a.cc:2:28: note: (if you use '-fpermissive', G++ will accept your code,
  but allowing the use of an undeclared name is deprecated)

That is neat: even if you never try to instantiate a function gcc still tries to do basic checks on it. But what if we call foo() via this pointer explicitly?

template  struct S {
    int foo(void) { return this->bar(); }
};

int main() {}

Is it valid c++? gcc-14 says it’s fine:

$ g++-14 -c a.cc

Is there a way to somehow make bar() available via this? Maybe, via inheritance? Apparently, no. gcc-15 now flags the code above as unconditionally invalid:

$ g++-15 -c a.cc
a.cc: In member function 'int S::foo()':
a.cc:2:34: error: 'struct S' has no member named 'bar'
    2 |     int foo(void) { return this->bar(); }
      |                                  ^~~

To get it to work you need something like a CRTP pattern:

// Assume Derived::bar() will be provided.
template  struct S {
    int foo(void) { return static_cast(this)->bar(); }
};

int main() {}

Interestingly the above problem pops up time to time in real projects in template code that was not tried after refactors. One such example is an aspell bug:

  template
  void VectorHashTable::recalc_size() {
    size_ = 0;
    for (iterator i = begin(); i != this->e; ++i, ++this->_size);
  }

gcc-14 built it just fine. gcc-15 started rejecting the build as:

In file included from modules/speller/default/readonly_ws.cpp:51:
modules/speller/default/vector_hash-t.hpp:
  In member function 'void aspeller::VectorHashTable::recalc_size()':
modules/speller/default/vector_hash-t.hpp:186:43:
  error: 'class aspeller::VectorHashTable' has no member named 'e'
  186 |     for (iterator i = begin(); i != this->e; ++i, ++this->_size);
      |                                           ^
modules/speller/default/vector_hash-t.hpp:186:59:
  error: 'class aspeller::VectorHashTable' has no member named '_size'; did you mean 'size'?
  186 |     for (iterator i = begin(); i != this->e; ++i, ++this->_size);
      |                                                           ^~~~~
      |                                                           size

VectorHashTable does not contain _size field, but it does contain size_ (used just a line before). e field is not a thing either. The change is simple:

--- a/modules/speller/default/vector_hash-t.hpp
+++ b/modules/speller/default/vector_hash-t.hpp
@@ -183,7 +183,7 @@ namespace aspeller {
   template
   void VectorHashTable::recalc_size() {
     size_ = 0;
-    for (iterator i = begin(); i != this->e; ++i, ++this->_size);
+    for (iterator i = begin(), e = end(); i != e; ++i, ++size_);
   }

 }

Or you could also delete the function if it was broken like that for a while. Another example is mjpegtools bug:

// The commented-out method prototypes are methods to be implemented by
// subclasses.  Not all methods have to be implemented, depending on
// whether it's appropriate for the subclass, but that may impact how
// widely the subclass may be used.
template 
class Region2D
{
  public:
    // ...

    template 
    void UnionDebug (Status_t &a_reStatus,
        REGION_O &a_rOther, REGION_TEMP &a_rTemp);

    // bool DoesContainPoint (INDEX a_tnY, INDEX a_tnX);

    // ...
}

template 
template 
void
Region2D::UnionDebug (Status_t &a_reStatus, INDEX a_tnY,
    INDEX a_tnXStart, INDEX a_tnXEnd, REGION_TEMP &a_rTemp)
{
    // ...
            if (!((rHere.m_tnY == a_tnY
                && (tnX >= a_tnXStart && tnX < a_tnXEnd))
            || this->DoesContainPoint (rHere.m_tnY, tnX)))
                goto error;
    // ...
}

Here mjpegtools assumes that DoesContainPoint should come from derived type. But modern c++ just does allow it to be defined like that:

In file included from SetRegion2D.hh:12,
                 from MotionSearcher.hh:15,
                 from newdenoise.cc:19:
Region2D.hh: In member function 'void Region2D::UnionDebug(Status_t&, INDEX, INDEX, INDEX, REGION_TEMP&)':
Region2D.hh:439:34: error: 'class Region2D' has no member named 'DoesContainPoint'
  439 |                         || this->DoesContainPoint (rHere.m_tnY, tnX)))
      |                                  ^~~~~~~~~~~~~~~~

The fix just deleted these unusable functions. An alternative fix would need to look closer to a CRTP tweak in our contrived example. But it’s a bit more invasive change.

parting words

gcc-15 will reject more invalid unusable c++ code in uninstantiated templates. The simplest code change might be to just delete broken code. More involved fix would require some knowledge of the code base to fix the declaration lookups (or to fix obvious typos). Have fun!

seekwatcher 0.15

2024-07-07T00:00:00Z seekwatcher-0.15 is here! seekwatcher is a tool to visualise access to the block device. It’s been 2.5 years since seekwatcher-0.14 release. The only change is the switch from mencoder to ffmpeg tool. While at it default codec is switched from MPEG2 to H264. As usual here is the programs’s result ran against btrfs scrub on my device:

$ seekwatcher -t scrub.trace -p 'echo 3 > /proc/sys/vm/drop_caches; sync; btrfs scrub start -B /' -d /dev/nvme1n1p2
$ seekwatcher -t scrub.trace -o scrub.mpeg --movie
$ seekwatcher -t scrub.trace -o scrub.png

Outputs:

image (127K)
video (926K)

H264 makes video size comparable to the image report. Have fun!

blog tweaks

2024-07-06T00:00:00Z Tl;DR: A few changes happened to this blog in the past few weeks:

RSS feed and web pages no longer embed svg images into and include them via . This fixes RSS readers like miniflux but might break others. At least now there should be an icon in place of a missing picture instead of just stripped tags. As a small bonus RSS feed should not be as large to download.
RSS feed now includes source code snippets without syntax highlighting. I never included css style into rss feed. highlighting-kate uses various tags and decorates them with links heavily. This change fixes source code rendering in liferea.
RSS feed now embeds https:// self-links instead of http:// (except for a few recent entries to avoid breaking reading history).

More words: I started this blog in 2010. In 2013 I moved it to hakyll static site generator. The initial version was just 88 lines of haskell code. I did not know much about hakyll back then and I kept it that way for about 10 years: it just worked for me. The only thing I missed were tag-based RSS feeds and article breakdown per tag. It prevented the blog from being added to thematic RSS aggregators like Planet Gentoo. But it was not a big deal. I though I would add it “soon” and never did. The only “non-trivial” tweaks I did were dot support and gunplot support. Fast forward to 2024 few weeks ago I boasted to my friend how cool my new gnuplot embeddings are. To what the response was “What pictures?”. Apparently miniflux does not like tags embedded into and strips them away leaving only bits of

</code> tags that almost
looks like original <code>graphviz</code> input :)
That meant my cool hack with <code>svg</code> embedding did not quite work for
<code>RSS</code> feed. I moved all the embeddings into separate <code>.svg</code> files with
<a href="https://github.com/trofi/trofi.github.io.gen/commit/12812bab87ce4bdff91227527d543ee3ac2161a9">this change</a>.
It’s not a big change, but it does violate some <code>hakyll</code> assumptions.
Apparently <code>hakyll</code> can output only one destination file for a source
file. For example <code>foo.md</code> can only produce <code>foo.html</code> and not <code>foo.html</code>
plus indefinite amount of pictures. There is a
<a href="https://jaspervdj.be/hakyll/tutorials/06-versions.html">version support</a>
in <code>hakyll</code>, but it assumes that we know number of outputs upfront. It’s
not really usable for cases like <code>N</code> unknown outputs from an input. To
work it around I’m writing all the auxiliary files without the <code>hakyll</code>
dependency tracker knowledge. I do it by defining <code>Writable</code> instance:
<pre class="haskell"><code>data PWI = PWI {
    pandoc :: H.Item String
  , inlines :: [(String, H.Item DBL.ByteString)]
} deriving (GG.Generic)

deriving instance DB.Binary PWI

instance H.Writable PWI where
    write path item = do
        -- emit page itself:
        let PWI pand inls = H.itemBody item
        H.write path pand
        -- emit inlines nearby:
        CM.forM_ inls $ \(fp, contents) -> do
            H.makeDirectories fp
            H.write fp contents</code></pre>
Here <code>inlines</code> is the list of pairs of filenames and their contents to
write on disk and <code>pandoc</code> is the primary content one would normally
write as <code>H.Item String</code>.
While at it I disabled syntax highlighting in <code>RSS</code> feed as <code>liferea</code>
rendered highlighted source as an unreadable mess. And <code>miniflux</code> just
stripped out all the links and styles. <a href="https://github.com/trofi/trofi.github.io.gen/commit/1dc9d5a9d6b54db928f3fdaef1c0dcb4b6d567df">The change</a>
is somewhat long, but it’s gist is a single extra <code>writerHighlightStyle</code>
option passed to <code>pandoc</code> render:
<pre class="haskell"><code>pandocRSSWriterOptions :: TPO.WriterOptions
pandocRSSWriterOptions = pandocWriterOptions{
    -- disable highlighting
    TPO.writerHighlightStyle = Nothing
}</code></pre>
The last thing I changed was to switch from <code>http://</code> links to
<code>https://</code> links by default. In theory it’s a
<a href="https://github.com/trofi/trofi.github.io.gen/commit/cfc80bb575c1b131225c43c1fed47ff639540bd9">one-character change</a>.
In practice that would break unread history for all <code>RSS</code> users. I worked
it around by restoring <code>http://</code> root link for current <code>RSS</code> entries
with <a href="https://github.com/trofi/trofi.github.io.gen/commit/6b1883a1b23f6965314bfd2b55cb3e9e6a42ec16">metadata change</a>.
That way all new posts should contain <code>https://</code> root links and all
site-local links should automatically become <code>https://</code> links.
Still no tag support. Maybe later.
Have fun!
</article>
<article>
<h1>probabilities are hard</h1>
2024-06-23T00:00:00Z
<h2 id="make---shuffle-background"><code>make --shuffle</code> background</h2>
<a href="https://trofi.github.io/posts/238-new-make-shuffle-mode.html">A while ago</a> I added <code>--shuffle</code>
mode to <code>GNU make</code> to shake out missing dependencies in build rules of
<code>make</code>-based build systems. It managed to find
<a href="https://trofi.github.io/posts/249-an-update-on-make-shuffle.html">a few bugs</a> since.
<h2 id="the-shuffling-algorithm">the shuffling algorithm</h2>
The core function of <code>--shuffle</code> is to generate one random permutation
of prerequisites for a target. I did not try to implement anything
special. I searched for “random shuffle” and got
<a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle">Fisher–Yates shuffle</a>
link from <code>wikipedia</code>, skimmed the page and came up with this algorithm:
<pre class="c"><code>/* Shuffle array elements using RAND().  */
static void
random_shuffle_array (void **a, size_t len)
{
  size_t i;
  for (i = 0; i < len; i++)
    {
      void *t;

      /* Pick random element and swap. */
      unsigned int j = rand () % len;
      if (i == j)
        continue;

      /* Swap. */
      t = a[i];
      a[i] = a[j];
      a[j] = t;
    }
}</code></pre>
The diagram of a single step looks this way:

The implementation looked so natural: we attempt to shuffle each element
with another element chosen randomly using equal probability (assuming
<code>rand () % len</code> is unbiased). At least it seemed to produce random
results.
<strong>Quiz question</strong>: do you see the bug in this implementation?
This version was shipped in <code>make-4.4.1</code>.
I ran <code>make</code> from <code>git</code> against <code>nixpkgs</code> and discovered a ton of
parallelism bugs. I could not be happier than that. I never got to
actual testing the quality of permutation probabilities.
<h2 id="bias-in-initial-implementation">bias in initial implementation</h2>
Artem Klimov had a closer look at it and discovered a bug in the
algorithm above! The algorithm has a common implementation error for
Fisher–Yates
<a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Implementation_errors">documented</a>
on the very page I looked at before /o\. Artem demonstrated problems of
permutation quality on the following trivial <code>Makefile</code>:
<pre class="makefile"><code>all: test1 test2 test3 test4 test5 test6 test7 test8;

test%:
	mkdir -p tests
	echo $@ > tests/$@

test8:
	# no mkdir
	echo 'override' > tests/$@</code></pre>
This test was supposed to fail <code>12.5%</code> of the time in <code>--shuffle</code> mode:
only when <code>test8</code> is scheduled as the first to execute. Alas the test
when ran over thousands runs failed with <code>10.1%</code> probability. That is
<code>2%</code> too low.
Artem also provided a fixed version of the shuffle implementation:
<pre class="c"><code>static void
random_shuffle_array (void **a, size_t len)
{
  size_t i;
  for (i = len - 1; i >= 1; i--)
    {
      void *t;

      /* Pick random element and swap. */
      unsigned int j = make_rand () % (i + 1);

      /* Swap. */
      t = a[i];
      a[i] = a[j];
      a[j] = t;
    }
}</code></pre>
The diagram of a single step looks this way:

Note how this version makes sure that shuffled indices (“gray” color)
never gets considered for future shuffle iterations.
At least for me it’s more obvious to see why this algorithm does not
introduce any biases. But then again I did not suspect problems in the
previous one either. I realised I don’t have a good intuition on why the
initial algorithm manages to produce biases. Where does bias come from
if we pick the target element with equal probability from all the
elements available?
<h1 id="a-simple-test">a simple test</h1>
To get the idea how the bias looks like I wrote a tiny program:
<pre class="c"><code>// $ cat a.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define LEN 3
static int a[LEN];

static void random_shuffle_array (void) {
  for (size_t i = 0; i < LEN; i++) {
      unsigned int j = rand () % LEN;
      int t = a[i]; a[i] = a[j]; a[j] = t;
    }
}

static void random_shuffle_array_fixed (void) {
  for (size_t i = LEN - 1; i >= 1; i--) {
      unsigned int j = rand () % (i + 1);
      int t = a[i]; a[i] = a[j]; a[j] = t;
    }
}

static void do_test(const char * name, void(*shuffler)(void)) {
    size_t hist[LEN][LEN];
    memset(hist, 0, sizeof(hist));

    size_t niters = 10000000;

    printf("%s shuffle probability over %zu iterations:\n", name, niters);
    for (size_t iter = 0; iter < niters; ++iter) {
        // Initialize array `a` with { `0`,  ..., `LEN - 1` }.
        for (size_t i = 0; i < LEN; ++i) a[i] = i;
        shuffler ();
        for (size_t i = 0; i < LEN; ++i) hist[i][a[i]] += 1;
    }

    int prec_digits = 3; /* 0.??? */
    int cell_width = 3 + prec_digits; /* " 0.???" */

    printf("%*s  ", cell_width, "");
    for (size_t j = 0; j < LEN; ++j)
        printf("%*zu", cell_width, j);
    puts("");

    for (size_t i = 0; i < LEN; ++i) {
        printf("%*zu |", cell_width, i);
        for (size_t j = 0; j < LEN; ++j)
            printf(" %.*f", prec_digits, (double)(hist[i][j]) / (double)(niters));
        puts("");
    }
}

int main() {
    srand(time(NULL));
    do_test("broken", &random_shuffle_array);
    puts("");
    do_test("fixed", &random_shuffle_array_fixed);
}</code></pre>
Here the program implement both current (broken) and new (fixed) shuffle
implementations. The histogram is collected over 10 million runs.
Then it prints a probability of each element to be found at a location.
We shuffle an array of <code>LEN = 3</code> elements: <code>{ 0, 1, 2, }</code>.
Here is the output of the program:
<pre><code>$ gcc a.c -o a -O2 -Wall && ./a
broken shuffle probability over 10000000 iterations:
             0     1     2
     0 | 0.333 0.370 0.296
     1 | 0.333 0.297 0.370
     2 | 0.334 0.333 0.333

fixed shuffle probability over 10000000 iterations:
             0     1     2
     0 | 0.333 0.333 0.334
     1 | 0.333 0.334 0.333
     2 | 0.333 0.333 0.333</code></pre>
Here the program tells us that:
<ul>
<li>broken version of the shuffle moves element <code>0</code> to <code>1</code> position <code>37%</code> of the time</li>
<li>broken version moves element <code>0</code> to <code>2</code> position <code>29.6%</code> of the time</li>
<li>fixed version is much closed to uniform distribution and has roughly
<code>33.3%</code> <code>0->1</code> and <code>0->2</code> probabilities</li>
</ul>
The same data above in plots:

<h2 id="a-bit-of-arithmetics">a bit of arithmetics</h2>
To get a bit better understanding of the bias let’s get exact probability
value for each element move for 3-element array.
<h3 id="broken-version">broken version</h3>
To recap the implementation we are looking at here is:
<pre class="c"><code>void random_shuffle_array (void) {
  for (size_t i = 0; i < LEN; i++) {
      unsigned int j = rand () % LEN;
      int t = a[i]; a[i] = a[j]; a[j] = t;
    }
}</code></pre>
Let’s start from broken shuffle with <code>1/(N+1)</code> shuffle probability.
Our initial array state is <code>{ 0, 1, 2, }</code> with probability <code>1/1</code>
(or <code>100%</code>) for each already assigned value:
<ul>
<li>probability at index <code>0</code>:
<ul>
<li>value <code>0</code>: <code>1/1</code></li>
<li>value <code>1</code>: <code>0/1</code></li>
<li>value <code>2</code>: <code>0/1</code></li>
</ul></li>
<li>probability at index <code>1</code>:
<ul>
<li>value <code>0</code>: <code>0/1</code></li>
<li>value <code>1</code>: <code>1/1</code></li>
<li>value <code>2</code>: <code>0/1</code></li>
</ul></li>
<li>probability at index <code>2</code>:
<ul>
<li>value <code>0</code>: <code>0/1</code></li>
<li>value <code>1</code>: <code>0/1</code></li>
<li>value <code>2</code>: <code>1/1</code></li>
</ul></li>
</ul>
On each iteration <code>i</code> we perform the actions below:
<ul>
<li>at <code>i</code> position: <code>1/3</code> probability of swapping any of the possible elements</li>
<li>at non-<code>i</code> positions: <code>2/3</code> probability of keeping and old element (and <code>1/3</code>
probability of absorbing value at <code>i</code> position mentioned in the previous bullet)</li>
</ul>
Thus after first shuffle step at <code>i=0</code> our probability state will be:
<ul>
<li>probability at index <code>0</code>:
<ul>
<li>value <code>0</code>: <code>1/3</code> (was <code>1.0</code>)</li>
<li>value <code>1</code>: <code>1/3</code> (was <code>0.0</code>)</li>
<li>value <code>2</code>: <code>1/3</code> (was <code>0.0</code>)</li>
</ul></li>
<li>probability at index <code>1</code>:
<ul>
<li>value <code>0</code>: <code>1/3</code> (was <code>0.0</code>)</li>
<li>value <code>1</code>: <code>2/3</code> (was <code>1.0</code>)</li>
<li>value <code>2</code>: <code>0/3</code> (was <code>0.0</code>)</li>
</ul></li>
<li>probability at index <code>2</code>:
<ul>
<li>value <code>0</code>: <code>1/3</code> (was <code>0.0</code>)</li>
<li>value <code>1</code>: <code>0/3</code> (was <code>0.0</code>)</li>
<li>value <code>2</code>: <code>2/3</code> (was <code>1.0</code>)</li>
</ul></li>
</ul>
So far so good: element <code>0</code> has even probability among all 3 elements,
and elements <code>1</code> and <code>2</code> decreased their initial probabilities from <code>1/1</code>
down to <code>2/3</code>.
Let’s trace through next <code>i=1</code> step. After that the updated state will be:
<ul>
<li>probability at index <code>0</code>:
<ul>
<li>value <code>0</code>: <code>3/9</code> (was <code>1/3</code>)</li>
<li>value <code>1</code>: <code>4/9</code> (was <code>1/3</code>)</li>
<li>value <code>2</code>: <code>2/9</code> (was <code>1/3</code>)</li>
</ul></li>
<li>probability at index <code>1</code>:
<ul>
<li>value <code>0</code>: <code>3/9</code> (was <code>1/3</code>)</li>
<li>value <code>1</code>: <code>3/9</code> (was <code>2/3</code>)</li>
<li>value <code>2</code>: <code>3/9</code> (was <code>0/3</code>)</li>
</ul></li>
<li>probability at index <code>2</code>:
<ul>
<li>value <code>0</code>: <code>3/9</code> (was <code>1/3</code>)</li>
<li>value <code>1</code>: <code>2/9</code> (was <code>0/3</code>)</li>
<li>value <code>2</code>: <code>4/9</code> (was <code>2/3</code>)</li>
</ul></li>
</ul>
Again, magically current (<code>i=1</code>) element got perfect balance. Zero
probabilities are gone by now.
Final <code>i=2</code> step yields this:
<ul>
<li>probability at index <code>0</code>:
<ul>
<li>value <code>0</code>: <code>9/27</code> (was <code>3/9</code>)</li>
<li>value <code>1</code>: <code>10/27</code> (was <code>4/9</code>)</li>
<li>value <code>2</code>: <code>8/27</code> (was <code>2/9</code>)</li>
</ul></li>
<li>probability at index <code>1</code>:
<ul>
<li>value <code>0</code>: <code>9/27</code> (was <code>3/9</code>)</li>
<li>value <code>1</code>: <code>8/27</code> (was <code>3/9</code>)</li>
<li>value <code>2</code>: <code>10/27</code> (was <code>3/9</code>)</li>
</ul></li>
<li>probability at index <code>2</code>:
<ul>
<li>value <code>0</code>: <code>9/27</code> (was <code>3/9</code>)</li>
<li>value <code>1</code>: <code>9/27</code> (was <code>2/9</code>)</li>
<li>value <code>2</code>: <code>9/27</code> (was <code>4/9</code>)</li>
</ul></li>
</ul>
The same state sequence in diagrams:

Note that final probabilities differ slightly: <code>8/27</code>, <code>9/27</code> and <code>10/27</code>
are probabilities where all should have been <code>9/27</code> (or <code>1/3</code>). This
matches observed values above!
The bias comes from the fact that each shuffle step affects probabilities
of all cells, not just immediately picked cells for a particular shuffle.
That was very hard to grasp for me just by glancing at the algorithm!
<h3 id="fixed-version">Fixed version</h3>
To recap the implementation we are looking at here is:
<pre class="c"><code>void random_shuffle_array_fixed (void) {
  for (size_t i = LEN - 1; i >= 1; i--) {
      unsigned int j = rand () % (i + 1);
      int t = a[i]; a[i] = a[j]; a[j] = t;
    }
}</code></pre>
Now let’s have a look at a shuffle with <code>1/(i+1)</code> probability.
Our initial state is the same <code>{ 0, 1, 2, }</code> with probabilities <code>1/1</code>:
<ul>
<li>probability at index <code>0</code>:
<ul>
<li>value <code>0</code>: <code>1/1</code></li>
<li>value <code>1</code>: <code>0/1</code></li>
<li>value <code>2</code>: <code>0/1</code></li>
</ul></li>
<li>probability at index <code>1</code>:
<ul>
<li>value <code>0</code>: <code>0/1</code></li>
<li>value <code>1</code>: <code>1/1</code></li>
<li>value <code>2</code>: <code>0/1</code></li>
</ul></li>
<li>probability at index <code>2</code>:
<ul>
<li>value <code>0</code>: <code>0/1</code></li>
<li>value <code>1</code>: <code>0/1</code></li>
<li>value <code>2</code>: <code>1/1</code></li>
</ul></li>
</ul>
As the algorithm iterated over the array backwards we start from <code>i=2</code>
(<code>N=3</code>).
<ul>
<li>probability at index <code>0</code>:
<ul>
<li>value <code>0</code>: <code>2/3</code> (was <code>1/1</code>)</li>
<li>value <code>1</code>: <code>0/3</code> (was <code>0/1</code>)</li>
<li>value <code>2</code>: <code>1/3</code> (was <code>0/1</code>)</li>
</ul></li>
<li>probability at index <code>1</code>:
<ul>
<li>value <code>0</code>: <code>0/3</code> (was <code>0/1</code>)</li>
<li>value <code>1</code>: <code>2/3</code> (was <code>1/1</code>)</li>
<li>value <code>2</code>: <code>1/3</code> (was <code>0/1</code>)</li>
</ul></li>
<li>probability at index <code>2</code>:
<ul>
<li>value <code>0</code>: <code>1/3</code> (was <code>0/1</code>)</li>
<li>value <code>1</code>: <code>1/3</code> (was <code>0/1</code>)</li>
<li>value <code>2</code>: <code>1/3</code> (was <code>1/1</code>)</li>
</ul></li>
</ul>
As expected the probabilities are the mirror image of the first step of
the broken implementation.
The next step though is a bit different: <code>i=1</code> (<code>N=2</code>). It effectively
averages probabilities at index <code>0</code> and index <code>1</code>.
<ul>
<li>probability at index <code>0</code>:
<ul>
<li>value <code>0</code>: <code>1/3</code> (was <code>2/3</code>)</li>
<li>value <code>1</code>: <code>1/3</code> (was <code>0/3</code>)</li>
<li>value <code>2</code>: <code>1/3</code> (was <code>1/3</code>)</li>
</ul></li>
<li>probability at index <code>1</code>:
<ul>
<li>value <code>0</code>: <code>1/3</code> (was <code>0/3</code>)</li>
<li>value <code>1</code>: <code>1/3</code> (was <code>2/3</code>)</li>
<li>value <code>2</code>: <code>1/3</code> (was <code>1/3</code>)</li>
</ul></li>
<li>probability at index <code>2</code> (unchanged):
<ul>
<li>value <code>0</code>: <code>1/3</code></li>
<li>value <code>1</code>: <code>1/3</code></li>
<li>value <code>2</code>: <code>1/3</code></li>
</ul></li>
</ul>
Or the same in diagrams:

The series are a lot simpler than the broken version: on each step
handled element always ends up with identical expected probabilities.
Its so much simpler!
<h2 id="element-bonus">30-element bonus</h2>
Let’s look at the probability table for an array of 30-elements. The
only change I did for the program above is to change <code>LEN</code> from <code>3</code> to
<code>30</code>:

This plot shows a curious <code>i == j</code> cutoff line where probability changes
drastically:
<ul>
<li><code>15->15</code> (or any <code>i->i</code>) shuffle probability is lowest and is about <code>2.8%</code></li>
<li><code>15->16</code> (or any <code>i->i+1</code>) shuffle probability is highest and is about <code>4.0%</code></li>
</ul>
<h2 id="make---shuffle-bias-fix"><code>make --shuffle</code> bias fix</h2>
I posted Artem’s fix upstream for inclusion as
<a href="https://mail.gnu.org/archive/html/bug-make/2024-06/msg00008.html">this email</a>:
<pre class="diff"><code>--- a/src/shuffle.c
+++ b/src/shuffle.c
@@ -104,12 +104,16 @@ static void
 random_shuffle_array (void **a, size_t len)
 {
   size_t i;
-  for (i = 0; i < len; i++)
+
+  if (len <= 1)
+    return;
+
+  for (i = len - 1; i >= 1; i--)
     {
       void *t;

       /* Pick random element and swap. */
-      unsigned int j = make_rand () % len;
+      unsigned int j = make_rand () % (i + 1);
       if (i == j)
         continue;
</code></pre>
<h2 id="parting-words">parting words</h2>
Artem Klimov found, fixed and explained the bias in <code>make --shuffle</code>
implementation. Thank you, Artem!
Probabilities are hard! I managed to get wrong seemingly very simple
algorithm. The bias is not too bad: <code>make --shuffle</code> is still able to
produce all possible permutations of the targets. But some of them are
slightly less frequent than the others.
The bias has a curious structure:
<ul>
<li>least likely permutations candidate is <code>i->i</code> “identity” shuffle</li>
<li>most likely permutation candidate is <code>i->i+1</code> “right shift” shuffle</li>
</ul>
At least the initial implementation was not completely broken and still
was able to generate all permutations.
With luck <a href="https://mail.gnu.org/archive/html/bug-make/2024-06/msg00008.html">the fix</a>
will be accepted upstream and we will get more fair <code>--shuffle</code> mode.
Have fun!
</article>
<article>
<h1>inline gnuplot</h1>
2024-06-22T00:00:00Z
Time to time I find myself needing to plot histograms and approximations
in occasional posts.
Similar to <a href="https://trofi.github.io/posts/300-inline-graphviz-dot-in-hakyll.html">inline <code>graphviz</code></a>
support today I added <code>gnuplot</code> <code>svg</code> inlining support into this blog.
The trivial example looks this way:

The above is generated using the following <code>.md</code> snippet:
<pre><code>    ```{render=gnuplot}
    plot [-pi:pi] sin(x)
    ```</code></pre>
<code>hakyll</code> <a href="https://github.com/trofi/trofi.github.io.gen/commit/4fb830628c6923873c0b21b2ac444a73d4d47cee">integration</a>
is also straightforward:
<pre class="haskell"><code>inlineGnuplot :: TP.Block -> Compiler TP.Block
inlineGnuplot cb@(TP.CodeBlock (id, classes, namevals) contents)
  | ("render", "gnuplot") `elem` namevals
  = TP.RawBlock (TP.Format "html") . DT.pack <$> (
      unixFilter "gnuplot"
          [ "--default-settings"
          , "-e", "set terminal svg"
          , "-"]
          (DT.unpack contents))
inlineGnuplot x = return x</code></pre>
Here we call <code>gnuplot --default-settings -e "set terminal svg" -</code> and
pass our script over <code>stdin</code>. Easy!
For those who wonder what <code>gnuplot</code> is capable of have a look at
<a href="http://www.gnuplot.info/demo_svg_4.6/"><code>gnuplot.info</code> demo page</a>.
As a bonus here is the time chart of my commits into <code>nixpkgs</code>:

Have fun!
</article>
<article>
<h1>gcc simd intrinsics bug</h1>
2024-06-16T00:00:00Z
<code>highway</code> keeps yielding very interesting <code>gcc</code> bugs. Some of them are
so obscure that I don’t even understand <code>gcc</code> developers’ comments on
where the bug lies: in <code>highway</code> or on <code>gcc</code>. In this post I’ll explore
<a href="https://gcc.gnu.org/PR115161"><code>PR115161</code></a> report here as an example of
how <code>gcc</code> handles <code>simd</code> intrinsics.
<h2 id="simplest-xmm-intrinsics-example">simplest <code>xmm</code> intrinsics example</h2>
Let’s start from an example based on another closely related bug:
<pre class="c"><code>#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

int main(void) {
    const __m128i  iv = _mm_set1_epi32(0x4f000000); // 1
    const __m128   fv = _mm_castsi128_ps(iv);       // 2
    const __m128i riv = _mm_cvttps_epi32(fv);       // 3

    uint32_t r[4];
    memcpy(r, &riv, sizeof(r));
    printf("%#08x %#08x %#08x %#08x\n", r[0], r[1], r[2], r[3]);
}</code></pre>
The above example implements a vectored form of <code>(int)2147483648.0</code>
conversion using following steps:
<ol type="1">
<li>Place 4 identical 32-bit integer <code>0x4f000000</code> values into 128-bit
<code>iv</code> variable (likely an <code>xmm</code> register).</li>
<li>Bit cast <code>4 x 0x4f00000</code> into <code>4 x 2147483648.0</code> of 32-bit <code>float</code>s.</li>
<li>Convert <code>4 x 2147483648.0</code> 32-bit <code>float</code>s into <code>4 x int32_t</code> by
truncating the fractional part and leaving the integer one.</li>
<li>Print the conversion result in hexadecimal form.</li>
</ol>
Or the same in pictures:

Note: <code>2147483648.0</code> is exactly 2<sup>31</sup>. Maximum <code>int32_t</code> can hold is
2<sup>31</sup>-1, or <code>2147483647</code> (one less than our value at hand).
<strong>Quick quiz: What should this example return? Does it depend on the
compiler options?</strong>
In theory those <code>_mm*()</code> compiler intrinsics are tiny wrappers over
corresponding <code>x86_64</code> instructions.
<a href="https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html">Intel guide</a>
says that <code>_mm_cvttps_epi32()</code> is a <code>cvttps2dq</code> instruction.
Running the example:
<pre><code>$ gcc -Wall a.c -o a0 -O0 && ./a0
0x80000000 0x80000000 0x80000000 0x80000000

$ gcc -Wall a.c -o a1 -O1 && ./a1
0x7fffffff 0x7fffffff 0x7fffffff 0x7fffffff</code></pre>
Optimization levels do change the behaviour of the code when
overflow happens: sometimes the result is 2<sup>31</sup> and sometimes it’s
2<sup>31</sup>-1. Uh-oh. Let’s have a peek at the assembly of both cases.
<code>-O0</code> case:
<pre class="asm"><code>; $ rizin ./a0
; [0x00401050]> aaaa
; [0x00401050]> s main
; [0x00401136]> pdf
            ; DATA XREF from entry0 @ 0x401068
; int main(int argc, char **argv, char **envp);
; ...
          movl  $0x4f000000, var_8ch
          movl  var_8ch, %eax
; ...
          movl  %eax, var_80h
          movd  var_80h, %xmm1
          punpckldq %xmm1, %xmm0
; ...
          movaps %xmm0, var_48h
          cvttps2dq var_48h, %xmm0
          movaps %xmm0, var_78h
          movq  var_78h, %rax
          movq  var_70h, %rdx
          movq  %rax, var_28h
          movq  %rdx, var_20h
          movl  var_1ch, %esi
          movl  var_20h, %ecx
          movl  var_24h, %edx
          movl  var_28h, %eax
          leaq  str.08x___08x___08x___08x, %rdi      ; 0x402004 ; "%#08x %#08x %#08x %#08x\n" ; const char *format
          movl  %esi, %r8d
          movl  %eax, %esi
          movl  $0, %eax
          callq sym.imp.printf                       ; sym.imp.printf ; int printf(const char *format)
; ...</code></pre>
While it’s a lot of superfluous code we do see there <code>cvttps2dq</code>
instruction and <code>printf()</code> call against it’s result.
<code>-O1</code> case:
<pre class="asm"><code>$ rizin ./a1
; [0x00401040]> aaaa
; [0x00401040]> s main
; [0x00401126]> pdf
            ; DATA XREF from entry0 @ 0x401058
; int main(int argc, char **argv, char **envp);
          subq  $8, %rsp
          movl  $0x7fffffff, %r9d
          movl  $0x7fffffff, %r8d
          movl  $0x7fffffff, %ecx
          movl  $0x7fffffff, %edx
          leaq  str.08x___08x___08x___08x, %rsi      ; 0x402004 ; "%#08x %#08x %#08x %#08x\n"
          movl  $2, %edi
          movl  $0, %eax
          callq sym.imp.__printf_chk                 ; sym.imp.__printf_chk
          movl  $0, %eax
          addq  $8, %rsp
          retq</code></pre>
Here we don’t see <code>cvttps2dq</code> at all! <code>gcc</code> just puts <code>0x7fffffff</code>
constants into registers and calls <code>printf()</code> directly.
For completeness let’s try to find out the exact optimization pass that
performs this constant folding. Normally I would expect it to be a tree
optimization, and thus <code>-fdump-tree-all</code> would tell me where the magic
happens. Alas:
<pre class="c"><code>// $ gcc a.c -o a -O2 -fdump-tree-all && ./a
// $ cat a.c.265t.optimized

;; Function main (main, funcdef_no=574, decl_uid=6511, cgraph_uid=575, symbol_order=574) (executed once)

int main ()
{
  unsigned int _2;
  vector(4) int _3;
  unsigned int _4;
  unsigned int _5;
  unsigned int _6;

  <bb 2> [local count: 1073741824]:
  _3 = __builtin_ia32_cvttps2dq ({ 2.147483648e+9, 2.147483648e+9, 2.147483648e+9, 2.147483648e+9 });
  _2 = BIT_FIELD_REF <_3, 32, 96>;
  _6 = BIT_FIELD_REF <_3, 32, 64>;
  _4 = BIT_FIELD_REF <_3, 32, 32>;
  _5 = BIT_FIELD_REF <_3, 32, 0>;
  __printf_chk (2, "%#08x %#08x %#08x %#08x\n", _5, _4, _6, _2);
  return 0;

}</code></pre>
Here we see that <code>_mm_set1_epi32()</code> and <code>_mm_castsi128_ps()</code> were
“folded” into a <code>2.147483648e+9</code> successfully, but <code>_mm_cvttps_epi32()</code>
was not. And yet the final assembly does not contain the call. Let’s
have a loot at the <code>RTL</code> passes that usually follow <code>tree</code> ones as part
of the optimization:
<pre><code>$ gcc a.c -o a -O2 -fdump-rtl-all-slim && ./a
$ ls -1 *r.*
a.c.266r.expand
a.c.267r.vregs
a.c.268r.into_cfglayout
a.c.269r.jump
a.c.270r.subreg1
a.c.271r.dfinit
a.c.272r.cse1
a.c.273r.fwprop1
a.c.274r.cprop1
a.c.275r.pre
a.c.277r.cprop2
a.c.280r.ce1
a.c.281r.reginfo
a.c.282r.loop2
a.c.283r.loop2_init
a.c.284r.loop2_invariant
a.c.285r.loop2_unroll
a.c.287r.loop2_done
a.c.290r.cprop3
a.c.291r.stv1
a.c.292r.cse2
a.c.293r.dse1
a.c.294r.fwprop2
a.c.296r.init-regs
a.c.297r.ud_dce
a.c.298r.combine
a.c.300r.stv2
a.c.301r.ce2
a.c.302r.jump_after_combine
a.c.303r.bbpart
a.c.304r.outof_cfglayout
a.c.305r.split1
a.c.306r.subreg3
a.c.308r.mode_sw
a.c.309r.asmcons
a.c.314r.ira
a.c.315r.reload
a.c.316r.postreload
a.c.319r.split2
a.c.320r.ree
a.c.321r.cmpelim
a.c.322r.pro_and_epilogue
a.c.323r.dse2
a.c.324r.csa
a.c.325r.jump2
a.c.326r.compgotos
a.c.328r.peephole2
a.c.329r.ce3
a.c.331r.fold_mem_offsets
a.c.332r.cprop_hardreg
a.c.333r.rtl_dce
a.c.334r.bbro
a.c.335r.split3
a.c.336r.sched2
a.c.338r.stack
a.c.340r.zero_call_used_regs
a.c.341r.alignments
a.c.343r.mach
a.c.344r.barriers
a.c.349r.shorten
a.c.350r.nothrow
a.c.351r.dwarf2
a.c.352r.final
a.c.353r.dfinish</code></pre>
It’s a long list of passes! Let’s have a look at the first <code>266r.expand</code>:
<pre><code>$ cat a.c.266r.expand
;;
;; Full RTL generated for this function:
;;
    1: NOTE_INSN_DELETED
    3: NOTE_INSN_BASIC_BLOCK 2
    2: NOTE_INSN_FUNCTION_BEG
    5: r106:V4SF=vec_duplicate([`*.LC1'])
    6: r105:V4SF=r106:V4SF
      REG_EQUAL const_vector
    7: r104:V4SI=fix(r105:V4SF)

    8: r99:V4SI=r104:V4SI
    9: r108:V4SI=vec_select(r99:V4SI,parallel)
   10: r107:SI=vec_select(r108:V4SI,parallel)
   11: r110:V4SI=vec_select(vec_concat(r99:V4SI,r99:V4SI),parallel)
   12: r109:SI=vec_select(r110:V4SI,parallel)
   13: r112:V4SI=vec_select(r99:V4SI,parallel)
   14: r111:SI=vec_select(r112:V4SI,parallel)
   15: r113:SI=vec_select(r99:V4SI,parallel)
   16: r114:DI=`*.LC2'
   17: r9:SI=r107:SI
   18: r8:SI=r109:SI
   19: cx:SI=r111:SI
   20: dx:SI=r113:SI
   21: si:DI=r114:DI
   22: di:SI=0x2
   23: ax:QI=0
   24: ax:SI=call [`__printf_chk'] argc:0
      REG_CALL_DECL `__printf_chk'
   25: r103:SI=0
   29: ax:SI=r103:SI
   30: use ax:SI</code></pre>
Here <code>V4SF</code> means the vector type of 4 floats, <code>V4SI</code> is a vector type
of 4 <code>int</code>s, <code>SI</code> is an <code>int</code> type, <code>DI</code> is a <code>long</code> type. It looks like
our <code>float->int32_t</code> conversion happens in two early <code>RTL</code> instructions:
<pre><code>    5: r106:V4SF=vec_duplicate([`*.LC1'])
    6: r105:V4SF=r106:V4SF
      REG_EQUAL const_vector
    7: r104:V4SI=fix(r105:V4SF)</code></pre>
The rest of <code>RTL</code> code is extraction of that result as <code>printf()</code>
arguments. It’s a lot of superfluous data moves. Later optimizations
should clean it up and assign “hardware” registers like <code>r9</code> to virtual
registers like <code>r108</code>. For completeness final <code>353r.dfinish</code> looks this
way:
<pre><code>$ cat a.c.353r.dfinish

;; Function main (main, funcdef_no=574, decl_uid=6511, cgraph_uid=575, symbol_order=574) (executed once)

    1: NOTE_INSN_DELETED
    3: NOTE_INSN_BASIC_BLOCK 2
    2: NOTE_INSN_FUNCTION_BEG
   34: {sp:DI=sp:DI-0x8;clobber flags:CC;clobber [scratch];}
      REG_UNUSED flags:CC
      REG_CFA_ADJUST_CFA sp:DI=sp:DI-0x8
   35: NOTE_INSN_PROLOGUE_END
   19: cx:SI=0x7fffffff
   20: dx:SI=0x7fffffff
   44: {ax:DI=0;clobber flags:CC;}
      REG_UNUSED flags:CC
   17: r9:SI=0x7fffffff
   18: r8:SI=0x7fffffff
   22: di:SI=0x2
   32: si:DI=`*.LC2'
      REG_EQUIV `*.LC2'
   24: ax:SI=call [`__printf_chk'] argc:0
      REG_DEAD r9:SI
      REG_DEAD r8:SI
      REG_DEAD di:SI
      REG_DEAD si:DI
      REG_DEAD cx:SI
      REG_DEAD dx:SI
      REG_UNUSED ax:SI
      REG_CALL_DECL `__printf_chk'
   45: {ax:DI=0;clobber flags:CC;}
      REG_UNUSED flags:CC
   46: NOTE_INSN_EPILOGUE_BEG
   37: {sp:DI=sp:DI+0x8;clobber flags:CC;clobber [scratch];}
      REG_UNUSED flags:CC
      REG_CFA_ADJUST_CFA sp:DI=sp:DI+0x8
   30: use ax:SI
   38: simple_return
   41: barrier
   33: NOTE_INSN_DELETED</code></pre>
Here we don’t have <code>fix()</code> calls any more. <code>printf()</code> call already
contains immediate <code>r8:SI=0x7fffffff</code> constants. All registers are
resolved to real register names. Searching for <code>fix()</code> in all the pass
files I found that <code>272r.cse1</code> was the last pass that mentioned it.
<code>a.c.273r.fwprop1</code> already has the constants inlined. Looking at
<code>272r.cse1</code> in <code>-fdump-rtl-all-all</code> we can see that details are inferred
by <code>cse1</code> about the <code>fix()</code> <code>RTL</code> instruction:
<pre><code>(insn 7 6 8 2 (set (reg:V4SI 104)
        (fix:V4SI (reg:V4SF 106))) "...-gcc-15.0.0/lib/gcc/x86_64-unknown-linux-gnu/15.0.0/include/emmintrin.h":863:19 4254 {
fix_truncv4sfv4si2}
     (expr_list:REG_EQUAL (const_vector:V4SI [
                (const_int 2147483647 [0x7fffffff]) repeated x4
            ])
        (expr_list:REG_DEAD (reg:V4SF 105)
            (nil))))</code></pre>
<code>fix_truncv4sfv4si2()</code> is the name of function that implements conversion
from <code>fix()</code> call down to the lower level instructions. And it looks
like <code>fix()</code> expansion also derived that the finals result is a constant:
<code>(expr_list:REG_EQUAL (const_vector:V4SI [ (const_int 2147483647 [0x7fffffff]) repeated x4])</code>.
Next <code>fwprop1</code> pass will use that constant value everywhere where <code>r104</code>
is used.
<a href="https://gcc.gnu.org/onlinedocs/gccint/Standard-Names.html"><code>gcc</code> internals</a>
documentation says that <code>fix_trunc</code> is a <code>float-to-int</code> conversion. Note
that this conversion does not look specific to our intrinsic. Any
code that casts floats would use the same helper. That explains why
<code>_mm_cvttps_epi32()</code> semantics around the overflow are not honoured and
generic floating conversion code it performed by <code>gcc</code> as if it was
written as <code>(int)(2147483648.0f)</code>. Apparently both <code>0x7fffffff</code> and
<code>0x80000000</code> values are correct under that assumption.
The problem is that <code>_mm_cvttps_epi32()</code> is more specific than any valid
<code>float->int</code> conversion. <code>intel</code> manual specifically says that at
<a href="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html"><code>CVTTPS2DQ</code> description</a>
in “Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined
Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4”:
<pre><code>Description
...
When a conversion is inexact, a truncated (round toward zero) value is
returned. If a converted result is larger than the maximum signed
doubleword integer, the floating-point invalid exception is raised, and
if this exception is masked, the indefinite integer value (80000000H) is
returned.</code></pre>
Thus <code>0x80000000</code> would be a correct value here and not <code>0x7fffffff</code>.
<h2 id="avoiding-the-_mm_cvttps_epi32-non-determinism">avoiding the <code>_mm_cvttps_epi32()</code> non-determinism</h2>
OK, <code>gcc</code> decided to treat it as problematic when handling overflow
condition. That should be easy to workaround by checking first if our
value is in range first, right? Say, something like the following
pseudocode:
<pre class="c"><code>float v = 2147483648.0f;
int32_t result;
if (v >= 2147483648.0f) {
    result = 0x7fffffff;
} else {
    result = fix(v);
}</code></pre>
In a vectored code writing branching code is problematic, thus one needs
to be creative and use masking. That is what <code>highway</code> did in
<a href="https://github.com/google/highway/commit/9dc6e1ecb0748df78398b037d6a8a89e667702e7"><code>avoid GCC "UB" in truncating cases</code></a>
commit. It’s a lot of code, but it’s idea is to mask away values
calculated against overflows:
<pre class="diff"><code>@@ -10884,7 +10869,11 @@ HWY_API VFromD<D> ConvertInRangeTo(D /*di*/, VFromD<RebindToFloat<D>> v) {
 // F32 to I32 ConvertTo is generic for all vector lengths
 template <class D, HWY_IF_I32_D(D)>
 HWY_API VFromD<D> ConvertTo(D di, VFromD<RebindToFloat<D>> v) {
-  return detail::FixConversionOverflow(di, v, ConvertInRangeTo(di, v));
+  const RebindToFloat<decltype(di)> df;
+  // See comment at the first occurrence of "IfThenElse(overflow,".
+  const MFromD<D> overflow = RebindMask(di, Ge(v, Set(df, 2147483648.0f)));
+  return IfThenElse(overflow, Set(di, LimitsMax<int32_t>()),
+                    ConvertInRangeTo(di, v));
 }</code></pre>
If we amend our original example with this tweak we will get the
following equivalent code:
<pre class="c"><code>// $ cat bug.cc
#include <stdint.h>
#include <string.h>
#include <emmintrin.h>

__attribute__((noipa))
static void assert_eq_p(void * l, void * r) {
    char lb[16];
    char rb[16];

    __builtin_memcpy(lb, l, 16);
    __builtin_memcpy(rb, r, 16);

    if (__builtin_memcmp(lb, rb, 16) != 0) __builtin_trap();
}

#if 0
#include <stdio.h>
__attribute__((noipa))
static void d_i(const char * prefix, __m128i p) {
    uint64_t v[2];
    memcpy(v, &p, 16);

    fprintf(stderr, "%10s(i): %#016lx %#016lx\n", prefix, v[0], v[1]);
}
#endif

__attribute__((noipa))
static void assert_eq(__m128i l, __m128i r) { assert_eq_p(&l, &r); }

int main() {
  const __m128i su = _mm_set1_epi32(0x4f000000);
  const __m128  sf = _mm_castsi128_ps(su);

  const __m128  overflow_mask_f32 = _mm_cmpge_ps(sf, _mm_set1_ps(2147483648.0f));
  const __m128i overflow_mask = _mm_castps_si128(overflow_mask_f32);

  const __m128i conv = _mm_cvttps_epi32(sf);
  const __m128i yes = _mm_set1_epi32(INT32_MAX);

  const __m128i a = _mm_and_si128(overflow_mask, yes);
  const __m128i na = _mm_andnot_si128(overflow_mask, conv);

  const __m128i conv_masked = _mm_or_si128(a, na);

  const __m128i actual = _mm_cmpeq_epi32(conv_masked, _mm_set1_epi32(INT32_MAX));
  const __m128i expected = _mm_set1_epi32(-1);

  assert_eq(expected, actual);
}</code></pre>
Here <code>_mm_and_si128()</code> and <code>_mm_andnot_si128()</code> are used to mask away
converted values larger than <code>2147483648.0f</code>.
If we look at the diagram it looks this way (I collapsed vector values
into <code>... x4</code> form as all of the values should be identical):

Here <code>conv -> na</code> green arrow shows where we throw away all the indefinite
values. They all get substituted for <code>yes = 0x7FFFffff x4</code> value.
Thus the program should finally be deterministic, right? Let’s check:
<pre><code>$ gcc bug.cc -O0 -o a && ./a

$ gcc bug.cc -O2 -o a && ./a
Illegal instruction (core dumped)</code></pre>
It does not. Only <code>-O0</code> case works (just like before). Looking at the
assembly again, just <code>-O2</code> this time:
<pre class="asm"><code>; $ rizin ./a
; [0x004010a0]> aaaa
; [0x004010a0]> s main
; [0x00401040]> pdf
            ; DATA XREF from entry0 @ 0x4010a8
            ;-- section..text:
/ int main(int argc, char **argv, char **envp);
|           ; arg uint64_t arg7 @ xmm0
|                 subq  $8, %rsp                             ; [13] -r-x section size 483 named .text
|                 movss data.00402004, %xmm1                 ; [0x402004:4]=0x4f000000
|                 movss data.00402008, %xmm3                 ; [0x402008:4]=0x7fffffff
|                 shufps $0, %xmm1, %xmm1
|                 movaps %xmm1, %xmm2
|                 cvttps2dq %xmm1, %xmm0
|                 shufps $0, %xmm3, %xmm3
|                 cmpleps %xmm1, %xmm2
|                 movdqa %xmm2, %xmm1
|                 andps %xmm3, %xmm2
|                 pandn %xmm0, %xmm1
|                 por   %xmm2, %xmm1
|                 pcmpeqd %xmm0, %xmm1                       ; arg7
|                 pcmpeqd %xmm0, %xmm0                       ; arg7
|                 callq sym.assert_eq_int64_t___vector_2___int64_t___vector_2 ; sym.assert_eq_int64_t___vector_2___int64_t___vector_2
|                 xorl  %eax, %eax
|                 addq  $8, %rsp
\                 retq</code></pre>
At the first glance <code>cvttps2dq</code> instruction is present, thus <code>gcc</code> was
not able to completely constant fold it away. Thus it’s not immediately
obvious why it’s incorrect. Let’s have a look at the control flow
diagram reconstructed from the assembly:

In practice <code>pcmpeqd %xmm0, %xmm1</code> instruction that was supposed to
implement <code>_mm_cmpeq_epi32(conv_masked, _mm_set1_epi32(INT32_MAX))</code> gets
<code>INT32_MAX</code> not as a constant (say, from <code>%xmm3</code>), but as a <code>%xmm0</code>
register assuming it already has the expected value. Red line shows
where the assumption is introduced and brown dotted line shows what it
is removing.
The optimizer was not able to constant-fold all the arithmetic operations,
but it was able to fold just enough to introduce the discrepancy between
assumed and actual value of <code>cvttps2dq</code>.
To remove this overly specific assumption <code>gcc-15</code> updated <code>fix()</code> code
not to assume a particular value on overflows using
<a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=b05288d1f1e4b632eddf8830b4369d4659f6c2ff">this patch</a>:
<pre class="diff"><code>--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -2246,7 +2246,18 @@ fold_convert_const_int_from_real (enum tree_code code, tree type, const_tree arg
   if (! overflow)
     val = real_to_integer (&r, &overflow, TYPE_PRECISION (type));

-  t = force_fit_type (type, val, -1, overflow | TREE_OVERFLOW (arg1));
+  /* According to IEEE standard, for conversions from floating point to
+     integer. When a NaN or infinite operand cannot be represented in the
+     destination format and this cannot otherwise be indicated, the invalid
+     operation exception shall be signaled. When a numeric operand would
+     convert to an integer outside the range of the destination format, the
+     invalid operation exception shall be signaled if this situation cannot
+     otherwise be indicated.  */
+  if (!flag_trapping_math || !overflow)
+    t = force_fit_type (type, val, -1, overflow | TREE_OVERFLOW (arg1));
+  else
+    t = NULL_TREE;
+
   return t;
 }

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 5caf1dfd957f..f6b4d73b593c 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -2256,14 +2256,25 @@ simplify_const_unary_operation (enum rtx_code code, machine_mode mode,
       switch (code)
 	{
 	case FIX:
+	  /* According to IEEE standard, for conversions from floating point to
+	     integer. When a NaN or infinite operand cannot be represented in
+	     the destination format and this cannot otherwise be indicated, the
+	     invalid operation exception shall be signaled. When a numeric
+	     operand would convert to an integer outside the range of the
+	     destination format, the invalid operation exception shall be
+	     signaled if this situation cannot otherwise be indicated.  */
 	  if (REAL_VALUE_ISNAN (*x))
-	    return const0_rtx;
+	    return flag_trapping_math ? NULL_RTX : const0_rtx;
+
+	  if (REAL_VALUE_ISINF (*x) && flag_trapping_math)
+	    return NULL_RTX;

 	  /* Test against the signed upper bound.  */
 	  wmax = wi::max_value (width, SIGNED);
 	  real_from_integer (&t, VOIDmode, wmax, SIGNED);
 	  if (real_less (&t, x))
-	    return immed_wide_int_const (wmax, mode);
+	    return (flag_trapping_math
+		    ? NULL_RTX : immed_wide_int_const (wmax, mode));

 	  /* Test against the signed lower bound.  */
 	  wmin = wi::min_value (width, SIGNED);
@@ -2276,13 +2287,17 @@ simplify_const_unary_operation (enum rtx_code code, machine_mode mode,

 	case UNSIGNED_FIX:
 	  if (REAL_VALUE_ISNAN (*x) || REAL_VALUE_NEGATIVE (*x))
-	    return const0_rtx;
+	    return flag_trapping_math ? NULL_RTX : const0_rtx;
+
+	  if (REAL_VALUE_ISINF (*x) && flag_trapping_math)
+	    return NULL_RTX;

 	  /* Test against the unsigned upper bound.  */
 	  wmax = wi::max_value (width, UNSIGNED);
 	  real_from_integer (&t, VOIDmode, wmax, UNSIGNED);
 	  if (real_less (&t, x))
-	    return immed_wide_int_const (wmax, mode);
+	    return (flag_trapping_math
+		    ? NULL_RTX : immed_wide_int_const (wmax, mode));

 	  return immed_wide_int_const (real_to_integer (x, &fail, width),
 				       mode);</code></pre>
It fixes both tree optimizations if <code>RTL</code> optimizations not to assume a
specific value on known overflows.
After the fix <code>gcc</code> generates something that passes the test at hand:
<pre><code>$ g++ bug.cc -o bug -O2 && ./bug</code></pre>
And the <code>highway</code> test suite.
For completeness the generated code now looks like this:
<pre class="asm"><code>; $ rizin ./a
; [0x004010a0]> aaaa
; [0x004010a0]> s main
; [0x00401040]> pdf
            ; DATA XREF from entry0 @ 0x4010b8
            ;-- section..text:
/ int main(int argc, char **argv, char **envp);
|           ; arg uint64_t arg8 @ xmm1
|                 subq  $8, %rsp                             ; [13] -r-x section size 499 named .text
|                 movss data.00402004, %xmm0                 ; [0x402004:4]=0x4f000000
|                 shufps $0, %xmm0, %xmm0
|                 movaps %xmm0, %xmm2
|                 cmpleps %xmm0, %xmm2
|                 cvttps2dq %xmm0, %xmm0
|                 movdqa %xmm2, %xmm1
|                 pandn %xmm0, %xmm1
|                 movss data.00402008, %xmm0                 ; [0x402008:4]=0x7fffffff
|                 shufps $0, %xmm0, %xmm0
|                 andps %xmm0, %xmm2
|                 pcmpeqd %xmm0, %xmm0
|                 por   %xmm1, %xmm2
|                 pcmpeqd %xmm1, %xmm1                       ; arg8
|                 psrld $1, %xmm1
|                 pcmpeqd %xmm2, %xmm1                       ; arg8
|                 callq sym.assert_eq_int64_t___vector_2___int64_t___vector_2 ; sym.assert_eq_int64_t___vector_2___int64_t___vector_2
|                 xorl  %eax, %eax
|                 addq  $8, %rsp
\                 retq</code></pre>
This code looks slightly closed to originally written <code>C</code> code: <code>%xmm2</code>
collects masked result of <code>cvttps2dq</code> and <code>%xmm1</code> contains <code>0x7FFFffff</code>
value.
<h2 id="parting-words">Parting words</h2>
While not as powerful as tree passes <code>RTL</code> passes are capable of folding
constants, propagating assumed values and removing dead code.
<code>highway</code> uncovered an old <code>gcc</code> <a href="https://gcc.gnu.org/PR115161">bug</a> in
a set of <code>float->int</code> conversion <code>x86</code> intrinsics. This bug was not seen
as frequently until <code>gcc</code> implemented more constant folding cases for
intrinsics in <a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=f2449b55fb2d32">this change</a>.
<code>gcc</code> still has a few places where it could constant-fold a lot more:
<ul>
<li>handle <code>_mm_cvttps_epi32(constant)</code></li>
<li>eliminate redundant <code>movaps %xmm0, %xmm2; cmpleps %xmm0, %xmm2</code> and
below</li>
</ul>
But <code>gcc</code> does not do it today.
If <code>gcc</code> thinks that some intrinsic returns a value that differs from
reality it’s very hard to reliably convince <code>gcc</code> to assume something
else. Sometimes it’s easier to use inline assembly to get the desired
result as a short term workaround.
Have fun!
</article>
<article>
<h1>Three years on NixOS</h1>
2024-06-02T00:00:00Z
This year I decided to shift yearly updates on my <code>NixOS</code> endeavours
(<a href="https://trofi.github.io/posts/290-two-years-on-nixos.html">2023 instance</a>). This time
the occasion is <a href="https://nixos.org/blog/announcements/2024/nixos-2405/"><code>NixOS 24.05 release</code></a>.
<h2 id="system-maintenance">System maintenance</h2>
Looking at the <code>git log</code> for <code>/etc/nixos</code> for the desktop system I see
the following things happening over the past year:
<ul>
<li>follow <code>fonts.fonts</code> to <code>fonts.packages</code> rename (added in <a href="https://github.com/NixOS/nixpkgs/pull/244332">PR#244332</a>)</li>
<li>follow <code>pipewire</code> migration to <code>extraConfig</code> (added in
<a href="https://github.com/NixOS/nixpkgs/pull/282377">PR#282377</a>).</li>
<li>follow <code>programs.gnupg.agent.pinentryFlavor</code> to
<code>programs.gnupg.agent.pinentryPackage</code> migration (added in
<a href="https://github.com/NixOS/nixpkgs/pull/133542">PR#133542</a>).</li>
<li>follow the rename from <code>nix.unstable</code> to <code>nix.latest</code> (added in
<a href="https://github.com/NixOS/nixpkgs/pull/305951">PR#305951</a>).</li>
</ul>
All four were trivial to tweak and did not cause much confusion.
Similar to previous year I did not have any problems related to package
build failures for any on <code>nixos-unstable</code>. Again, probably because I
tested <code>staging</code> time to time.
This time I had two non-trivial problems in upstream packages:
<ul>
<li>problematic <code>firefox</code> update broke <code>meet</code> video streaming. It was
triggered by the use of <code>--with-system-libvpx</code> option in <code>nixpkgs</code>. That
was one of the cases where I had the luxury of bisecting the whole
system to pinpoint the bad component:
<a href="https://github.com/NixOS/nixpkgs/pull/283010#issuecomment-1925703583" class="uri">https://github.com/NixOS/nixpkgs/pull/283010#issuecomment-1925703583</a>.
Local revert until the fix was shipped was trivial.</li>
<li>problematic unstable <code>6.8</code> kernel upgrade caused kernel panic in
<code>eevdf</code> scheduler. Was fixed upstream in
<a href="https://lore.kernel.org/lkml/ZicOSiEWHJJcahi%2F@yujie-X299/t/" class="uri">https://lore.kernel.org/lkml/ZicOSiEWHJJcahi%2F@yujie-X299/t/</a> around
<code>6.9</code> kernel. The crashes were nasty as scheduler crash locks up the
machine and does not print anything when it happens. Luckily
<code>systemd</code> recovered crash log from <code>EFI</code>s <code>nvram</code> and it was clear
from the backtrace that it’s a bug related to scheduling kernel
subsystem.</li>
</ul>
<h2 id="community-support">Community support</h2>
<code>NixOS</code> community remains to be a friendly place that welcomes
newcomers, experiments and day-to-day maintenance work. This year
<code>NixOS Foundation</code> received some heat for how it governs some aspects of
the community. <code>NixOS Foundation</code> proposed a
<a href="https://discourse.nixos.org/t/nixos-foundation-board-giving-power-to-the-community/44552">few major changes</a>
on how it will operate in future.
The most important (and hardest organizationally) change I did was to
document and exercise the procedure of updating
<a href="https://trofi.github.io/posts/315-nixpkgs-bootstrap-files-update.html">bootstrap binaries</a> in
<code>nixpkgs</code>.
My fanciest contribution was
<a href="https://trofi.github.io/posts/309-listing-all-nixpkgs-packages.html">my failed attempt</a> at
listing “all” the package attributes available in <code>nixpkgs</code>. While I was
not able to list all the attributes initially I managed to derive about
60 fixes to <code>nixpkgs</code> (all linked in the article) to make future listing
smoother. Tl:DR; of the fixes is: dynamic typing is hard.
The most unusual <code>nixpkgs</code> contribution was to find
<a href="https://trofi.github.io/posts/292-nix-language-nondeterminism-example.html">the non-determinism</a>
in <code>nix expression language</code> itself. It’s not something I expected to
encounter in real code. Alas.
The trickiest from technical standpoint was the fix for
<a href="https://trofi.github.io/posts/293-mysterious-patchelf-linkage-bug.html">parallel strip breakage</a>.
<a href="https://trofi.github.io/posts/302-Ofast-and-ffast-math-non-local-effects.html"><code>-Ofast</code></a> and
<a href="https://trofi.github.io/posts/310-a-libpam-bug.html"><code>libpam</code></a> bugs were also fun.
The most satisfying was to
<a href="https://trofi.github.io/posts/298-unexpected-runtime-dependencies-in-nixpkgs.html">reduce the runtime closure</a>
for many packages that use <code>__FILE__</code> just for for debug messages.
Surprisingly I managed to get about 800 commits into <code>nixpkgs</code> this year.
About ~90 of them is fixes to get compatibility with <code>gcc-13</code>. About 60
are evaluation fixes mentioned above. About 400 of them are various
package version updates.
I still read some of Matrix channels but I mostly skim through
<a href="https://discourse.nixos.org/">discourse</a> as I have even less free time
than last year.
<h2 id="home-server-experience">Home server experience</h2>
I did not have to adapt anything for the past year. I switched from
<code>apache</code> to <code>nginx</code> as an <code>httpd</code> without any issues. And that’s about
it. Things Just Work.
<h2 id="local-experiments">Local experiments</h2>
As an experiment I gave <a href="https://hyprland.org/"><code>hyprland</code></a> a short try.
I had to switch back to <code>sway</code>. I had two issues with
<code>hyprland</code>: new applications are visibly changing layout a few times
before they settle on final window size (I could not get used to it) and
configuration language quirks (I frequently missed commas where empty
arguments are required).
I also did a bit of fresh <code>gcc</code> testing. This time frame also coincided
with <code>gcc-14</code> development and release cycle. <code>nixpkgs</code> ended up being a
reasonable vehicle to play with <code>gcc-14</code>. The
<a href="https://trofi.github.io/posts/311-gcc-14-bug-pile-4.html">last bug pile report</a> tells me that
I found about 50 <code>gcc</code> bugs and even fixed at least
<a href="https://trofi.github.io/posts/301-another-gcc-profiling-bug.html">one non-trivial one</a>.
<h2 id="parting-words">Parting words</h2>
<code>NixOS</code> still works fine for me. I did not do as much as I managed to
last year. But looking back the list looks impressive.
Give it a go if you did not yet :)
</article>
<article>
<h1>nixpkgs bootstrap files update</h1>
2024-05-26T00:00:00Z
<h2 id="tldr">Tl;DR</h2>
<code>nixpkgs</code> now has up to date <code>bootstrapFiles</code> at least for <code>i686-linux</code>
and <code>x86_64-linux</code>. Moreover we now have an easy procedure to update the
binaries! Instructions are hiding in
<a href="https://github.com/NixOS/nixpkgs/blob/master/maintainers/scripts/bootstrap-files/README.md#how-to-request-the-bootstrap-seed-update"><code>maintainers/scripts/bootstrap-files/README.md</code></a>.
For example the
<a href="https://github.com/NixOS/nixpkgs/pull/288866">PR#288866</a> to update
<code>i686-unknown-linux-gnu</code> was generated as:
<pre><code>$ maintainers/scripts/bootstrap-files/refresh-tarballs.bash \
    --commit \
    --targets=x86_64-unknown-linux-gnu
$ git push my-fork staging:bootstrapFiles-x86_64-unknown-linux-gnu-update</code></pre>
This work paves the way for more frequent updates of the existing
bootstrap files and simplifies the procedure of introducing support for
new targets into <code>nixpkgs</code>.
As a nice side-effect <code>x86_64-linux</code> bootstrap does not depend on <code>i686</code>
<code>busybox</code> binary any more and uses <code>x86_64</code> binary instead:

This means that updating <code>i686-linux</code> bootstrap files alone does not
trigger the rebuild of <code>x86_64-linux</code> world any more.
<h2 id="more-words">More words</h2>
<h3 id="intro">Intro</h3>
About two years ago <a href="https://trofi.github.io/posts/240-nixpkgs-bootstrap-intro.html">I noticed</a>
that <code>nixpkgs</code> has quite old initial seed binaries used to bootstrap the
rest of the system.
Normally the version of bootstrap files does not matter as once bootstrap
finishes the original files are not referenced any more. There are a few
annoying exceptions that people stumbled from time to time. One of them
was stale <code>libgcc.so</code>.
Stale binaries also cause build breakages from time to time. One
example breakage is discussed in the
<a href="https://trofi.github.io/posts/275-nixpkgs-bootstrap-deep-dive.html">bootstrap deep dive</a> post,
another example is a <a href="https://github.com/NixOS/nixpkgs/pull/229898#issuecomment-1589179355">stale <code>gnumake</code></a>
breakage. There are a lot more. They usually get workarounds somewhere
else in <code>nixpkgs</code> but almost never in the bootstrap files as they are
harder to update.
These failures are not always easy to debug (or workaround). I
occasionally suggested updating the bootstrap binaries as I expected it
to be a trivial operation for people who did those initially.
I asked a few times on infra’s Matrix if bootstrap files could be
updated as part of a release preparation and did not get anywhere.
At some point Bernardo visited me in person and explained the details of
what it takes to upload the binaries and what the requirements for the
new binaries are nowadays. In essence there are two requirements:
<ol type="1">
<li>The binaries should be built by <a href="https://hydra.nixos.org" class="uri">https://hydra.nixos.org</a> to make
sure that binaries come from a somewhat trusted location and others
can replicate the same binaries from source.</li>
<li>The binaries must be uploaded to <a href="https://tarballs.nixos.org" class="uri">https://tarballs.nixos.org</a> by
someone someone has the permissions to do it.</li>
</ol>
Sounds trivial?
At some point I encountered complete failure of <code>i686-linux</code> bootstrap
on my file system due to
<a href="https://trofi.github.io/posts/297-32-bit-file-API-strikes-back.html">64-bit inode values</a> on my
<code>/nix/store</code>.
I filed an <a href="https://github.com/NixOS/nixpkgs/issues/253713">Issue#253713</a>
to request automated periodic bootstrap files updates. I also listed a
few examples where periodic refresh would fix the problems people
encounter hoping that somebody will consider it bad enough and upload
the binaries.
<h3 id="a-light-in-the-tunnel">A light in the tunnel</h3>
Alas just filing an issue did not magically fix things.
I noticed that recently <code>riscv64-linux</code> bootstrap files were updated in
<a href="https://github.com/NixOS/nixpkgs/pull/282517">PR#2826517</a> and I took it
as an opportunity to explore the mechanism and automate the whole PR
preparation as the first step.
The procedure looked trivial: you follow a few hydra links and put them
into the <code>.nix</code> file.
<h3 id="annoying-details">Annoying details</h3>
I hoped for a script to be 2-3 <code>curl</code> calls. But there are always those
pesky details that get in your way.
<h4 id="job-names">Job names</h4>
Some targets (like <code>risc-v</code> or <code>powerpc64</code>) don’t yet have a native
build support on <code>hydra</code> (lack of hardware). And yet we have bootstrap
files for those: they are cross-compiled.
Hydra job name and even job results format were different between native
and cross-builds:
<ul>
<li>native targets used only <code>.dist</code> style builds</li>
<li>cross-targets had <code>.bootstrapTools</code> style builds</li>
</ul>
Those were easy to fix by exposing builds unconditionally with
<a href="https://github.com/NixOS/nixpkgs/pull/284090">PR#284090</a>. And drop
unused <code>.dist</code> indirection with
<a href="https://github.com/NixOS/nixpkgs/pull/301639">PR#301639</a> suggested by
Alyssa.
<h4 id="nixpkgs-file-inconsistency-darwin"><code>nixpkgs</code> file inconsistency (<code>darwin</code>)</h4>
Once I had a glance at <code>darwin</code> bootstrap jobs I noticed it puts files
into slightly different location. I unified it with
<a href="https://github.com/NixOS/nixpkgs/pull/284628">PR#284628</a> hoping that
somebody else will finish the <code>darwin</code> part. And help did come from
<code>annalee</code> in <a href="https://github.com/NixOS/nixpkgs/pull/295557">PR#295557</a>.
<h4 id="actual-regenerator">Actual regenerator</h4>
Once enough things were in place I hacked up a shell script that fetches
needed files and generates <code>.nix</code> files with enough contents as
<a href="https://github.com/NixOS/nixpkgs/pull/284541">PR#284541</a>. The script
ended up being almost 300 lines long!
As a first test I tried it on <code>musl</code> targets as they were not using
the binaries from <code>tarballs.nixos.org</code> and that felt list an urgent issue.
The <a href="https://github.com/NixOS/nixpkgs/pull/285906">PR#285906</a> dealt with
<code>x86_64-unknown-linux-musl</code> bootstrap files.
<h4 id="cross-case">cross-case</h4>
Updating existing files is slightly easier than bringing it a completely
new set of binaries. How do you deal with those?
Having dealt with changes above I could finally answer that question
with some confidence:
<ul>
<li>add a new target to <code>lib/systems/examples.nix</code>, make sure it can build
basic things like <code>pkgsCross.$target.hello</code></li>
<li>add <code>bootstrapFiles</code> build entry to
<code>pkgs/stdenv/linux/make-bootstrap-tools-cross.nix</code>, wait for <code>hydra</code>
to build binaries for you</li>
<li>add your new target to <code>maintainers/scripts/bootstrap-files/refresh-tarballs.bash</code>
your in <code>CROSS_TARGETS=()</code> list and run the script.</li>
<li>send a resulting PR requesting binaries upload as described in
<code>maintainers/scripts/bootstrap-files/README.md</code></li>
</ul>
<a href="https://github.com/NixOS/nixpkgs/pull/314823">PR#314823</a> should
document the same procedure in <code>nixpkgs</code>.
<h2 id="parting-words">Parting words</h2>
I managed to update bootstrap files for <code>i686-linux</code> and <code>x86_64-linux</code>!
It immediately got the following benefits:
<ul>
<li>fixed <code>i686-linux</code> bootstrap on file systems with 64-bit inodes</li>
<li>untangled <code>x86_64-linux</code> bootstrap from <code>i686</code> <code>busybox</code></li>
<li>switched <code>musl</code> bootstrap files to <code>hydra</code>-built files hosted on
<code>tarballs.nixos.org</code>.</li>
<li>documented a way to introduce new target into <code>nixpkgs</code> via
cross-compiled bootstrap files.</li>
</ul>
Next steps:
<ul>
<li>update <code>aarch64-linux</code> bootstrap files (trivial)</li>
<li>update other cross-targets (trivial)</li>
<li>work with release engineering team to periodically update the binaries
on a defined cadence (moderate)</li>
</ul>
Have fun!
</article>
<article>
<h1>Zero Hydra Failures towards 24.05 NixOS release</h1>
2024-05-25T00:00:00Z
I somehow missed the beginning of <code>ZHF</code> phase
<a href="https://github.com/NixOS/nixpkgs/issues/309482">this release cycle</a>.
For those who don’t know <code>ZHF</code> (or Zero Hydra Failures) is the time when
most build failures are squashed before final <code>NixOS-24.05</code> release
(see <a href="https://github.com/NixOS/nixpkgs/issues/303285">full release schedule</a>).
To follow the tradition let’s fix one bug for <code>ZHF</code>.
I picked <a href="https://hydra.nixos.org/build/261188699"><code>miniupnpc</code></a> build
failure. Surprisingly it blocks about 60 packages!
The failure looks trivial:
<pre><code>trying https://miniupnp.tuxfamily.org/files/miniupnpc-2.2.7.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:14 --:--:--     0
curl: (28) Failed to connect to miniupnp.tuxfamily.org port 443 after 134794 ms: Couldn't connect to server
Warning: Problem : timeout. Will retry in 1 seconds. 3 retries left.
  0     0    0     0    0     0      0      0 --:--:--  0:02:14 --:--:--     0
curl: (28) Failed to connect to miniupnp.tuxfamily.org port 443 after 134159 ms: Couldn't connect to server
Warning: Problem : timeout. Will retry in 2 seconds. 2 retries left.
  0     0    0     0    0     0      0      0 --:--:--  0:02:13 --:--:--     0
curl: (28) Failed to connect to miniupnp.tuxfamily.org port 443 after 133158 ms: Couldn't connect to server
Warning: Problem : timeout. Will retry in 4 seconds. 1 retries left.
  0     0    0     0    0     0      0      0 --:--:--  0:02:15 --:--:--     0
curl: (28) Failed to connect to miniupnp.tuxfamily.org port 443 after 135253 ms: Couldn't connect to server
error: cannot download miniupnpc-2.2.7.tar.gz from any mirror</code></pre>
Upstream source is unavailable and <code>curl</code> times out fetching it.
With great suggestions from others to switch the package to <code>github</code>
source fetch I came up with <a href="https://github.com/NixOS/nixpkgs/pull/314510">PR#314510</a>:
<pre class="diff"><code>--- a/pkgs/tools/networking/miniupnpc/default.nix
+++ b/pkgs/tools/networking/miniupnpc/default.nix
@@ -1,6 +1,6 @@
 { lib
 , stdenv
-, fetchurl
+, fetchFromGitHub
 , cmake
 }:

@@ -8,14 +8,15 @@ stdenv.mkDerivation rec {
   pname = "miniupnpc";
   version = "2.2.7";

-  src = fetchurl {
-    urls = [
-      "https://miniupnp.tuxfamily.org/files/${pname}-${version}.tar.gz"
-      "http://miniupnp.free.fr/files/${pname}-${version}.tar.gz"
-    ];
-    sha256 = "sha256-sMOicFaED9DskyilqbrD3F4OxtLoczNJz1d7CqHnCsE=";
+  src = fetchFromGitHub {
+    owner = "miniupnp";
+    repo = "miniupnp";
+    rev = "miniupnpc_${lib.replaceStrings ["."] ["_"] version}";
+    hash = "sha256-cIijY1NcdF169tibfB13845UT9ZoJ/CZ+XLES9ctWTY=";
   };

+  sourceRoot = "${src.name}/miniupnpc";
+
   nativeBuildInputs = [ cmake ];

   doCheck = !stdenv.isFreeBSD;</code></pre>
The fix is slightly larger than the average one-liner as we have to
fiddle with the source-fetching helper. But otherwise it’s simple.
Testing:
<pre><code>$ nix build -f. miniupnpc</code></pre>
All good!
<h2 id="parting-words">Parting words</h2>
As <code>24.05</code> branch was already created the fix will have to be backported
to it by adding a specific label. One of the maintainers will have to do
it.
Otherwise contributing to <code>ZHF</code> is very easy. Give it a try!
Have fun!
</article>
<article>
<h1>highway debugging example</h1>
2024-05-18T00:00:00Z
<a href="https://github.com/google/highway">highway</a> test suite is a great
stress test for <code>gcc</code>’s vectorization and SIMD intrinsics code
generators:
<ul>
<li>at compile time <code>highway</code> instantiates all the vector extensions your CPU could support</li>
<li>at runtime it runs the tests on all of the supported extensions against
various vector sizes</li>
</ul>
It cover many corner cases of what could possibly go wrong with vectored
forms of various operations. A few past examples are:
<a href="https://gcc.gnu.org/PR110274"><code>PR110274</code></a>,
<a href="https://gcc.gnu.org/PR110880"><code>PR110880</code></a>,
<a href="https://gcc.gnu.org/PR111048"><code>PR111048</code></a>,
<a href="https://gcc.gnu.org/PR111051"><code>PR111051</code></a>,
<a href="https://gcc.gnu.org/PR115115"><code>PR115115</code></a>.
While <code>highway</code> tests are quite small they are somewhat tricky to
extract into self-contained examples. In this post I’ll write down a few
hacks I usually use to simplify this task.
<h2 id="an-example">An example</h2>
Today test suite failed at me on today’s <code>gcc-15</code> build against
<code>nixpkgs</code> as:
<pre><code> 1146 - HwyReverseTestGroup/HwyReverseTest.TestAllReverseLaneBytes/EMU128  # GetParam() = 2305843009213693952 (Subprocess aborted)
 1151 - HwyReverseTestGroup/HwyReverseTest.TestAllReverseBits/EMU128  # GetParam() = 2305843009213693952 (Subprocess aborted)
 1186 - HwyShuffle4TestGroup/HwyShuffle4Test.TestAllPer4LaneBlockShuffle/EMU128  # GetParam() = 2305843009213693952 (Subprocess aborted)</code></pre>
<code>TestAllReverseLaneBytes</code> is the test function. <code>EMU128</code> is an
(emulated) CPU target: the code does not use compiler intrinsics and
uses loops over scalar operations to emulate <code>SIMD</code>.
<h2 id="check-latest-git">Check latest <code>git</code></h2>
<code>highway</code> is actively maintained. In case the failure is caused by a
<code>highway</code> bug (and not by a faulty compiler) chances are that it’s
already fixed in latest version. Worth trying it first:
<pre><code># get the build time depends into the development shell
$ nix develop -f ~/n libhwy

$$ git clone https://github.com/google/highway.git
$$ cd highway

$$ mkdir build
$$ cd build
$$ cmake ..
$$ make -j $(nproc) && make test
...
1146 - HwyReverseTestGroup/HwyReverseTest.TestAllReverseLaneBytes/EMU128  # GetParam() = 2305843009213693952 (Subprocess aborted)
1151 - HwyReverseTestGroup/HwyReverseTest.TestAllReverseBits/EMU128  # GetParam() = 2305843009213693952 (Subprocess aborted)
1186 - HwyShuffle4TestGroup/HwyShuffle4Test.TestAllPer4LaneBlockShuffle/EMU128  # GetParam() = 2305843009213693952 (Subprocess aborted)
</code></pre>
The bug was still there!
<h2 id="enable-single-simplest-target">Enable single (simplest) target</h2>
<code>highway</code> uses heavy <code>C++</code> template code and various iterator macros to
compile the library for each supported CPU extension <code>highway</code> knows
about. This increases build times and complicated debugging via code
tweaking as code has to compile for all active targets, not just one.
I disabled all targets except the problematic one. In our case the
problematic target is <code>EMU128</code>. Thus the local change to leave <code>EMU128</code>
as the only available option is:
<pre class="diff"><code>--- a/hwy/detect_targets.h
+++ b/hwy/detect_targets.h
@@ -29,7 +29,7 @@
 // #define HWY_BASELINE_TARGETS (HWY_SSE4 | HWY_SCALAR)

 // Uncomment to override the default blocklist:
-// #define HWY_BROKEN_TARGETS HWY_AVX3
+#define HWY_BROKEN_TARGETS (HWY_AVX2 | HWY_SSE4 | HWY_SSE2 | HWY_SSSE3 | HWY_SSE4 | HWY_AVX3_SPR | HWY_AVX3_ZEN4 | HWY_AVX3)

 // Uncomment to definitely avoid generating those target(s):
 // #define HWY_DISABLED_TARGETS HWY_SSE4</code></pre>
Here I disabled anything that build system reports as supported.
Before the change I had <code>Compiled HWY_TARGETS:   AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2</code>
in this output:
<pre><code>Config: emu128:0 scalar:0 static:0 all_attain:0 is_test:0
Compiled HWY_TARGETS:   AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2
HWY_ATTAINABLE_TARGETS: AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2 EMU128
HWY_BASELINE_TARGETS:   SSE2 EMU128
HWY_STATIC_TARGET:      SSE2
HWY_BROKEN_TARGETS:     Unknown
HWY_DISABLED_TARGETS:
Current CPU supports:   AVX2 SSE4 SSSE3 SSE2 EMU128 SCALAR</code></pre>
After the change I get <code>Compiled HWY_TARGETS:   EMU128</code> in this output:
<pre><code>Config: emu128:0 scalar:0 static:0 all_attain:0 is_test:0
Compiled HWY_TARGETS:   EMU128
HWY_ATTAINABLE_TARGETS: EMU128
HWY_BASELINE_TARGETS:   SSE2 EMU128
HWY_STATIC_TARGET:      EMU128
HWY_BROKEN_TARGETS:     AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2
HWY_DISABLED_TARGETS:
Current CPU supports:   AVX2 SSE4 SSSE3 SSE2 EMU128 SCALAR</code></pre>
Leaving a single compiled target speeds the builds a few times up.
Then I picked the specific binary that implements failing test. In this
case it was <code>tests/reverse_test</code>:
<pre><code>$ make -j$(nproc) && ./tests/reverse_test
...
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from HwyReverseTestGroup/HwyReverseTest
[ RUN      ] HwyReverseTestGroup/HwyReverseTest.TestAllReverseLaneBytes/EMU128


u16x4 expect [0+ ->]:
  0xD49E,0x049E,0x0137,0x69D3,
u16x4 actual [0+ ->]:
  0xFF9E,0xFF9E,0x0137,0xFFD3,
Abort at reverse_test.cc:162: EMU128, u16x4 lane 0 mismatch:
  expected '0xD49E', got '0xFF9E'.

Aborted (core dumped)</code></pre>
Here we can see that instead of expected <code>0xD49E,0x049E,0x0137,0x69D3</code>
output our library did <code>0xFF9E,0xFF9E,0x0137,0xFFD3</code>.
<h2 id="shrink-the-test">Shrink the test</h2>
The rest reduction is usually test-specific, but some of the hacks can
be applied to many tests. In this case I kept only one failing test:
<pre class="diff"><code>--- a/hwy/tests/reverse_test.cc
+++ b/hwy/tests/reverse_test.cc
@@ -293,13 +295,7 @@ HWY_AFTER_NAMESPACE();

 namespace hwy {
 HWY_BEFORE_TEST(HwyReverseTest);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverse);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverse2);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverse4);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverse8);
 HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverseLaneBytes);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverseBits);
-HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverseBlocks);
 HWY_AFTER_TEST();
 }  // namespace hwy</code></pre>
The test was still failing (sometimes it’s not the case when mere
presence of unrelated code changes the inlining and vectorization
decisions).
Then I shrunk the cases down to 16-bit element sizes by inlining a
<code>ForUI163264</code> definition:
<pre class="diff"><code>--- a/hwy/tests/reverse_test.cc
+++ b/hwy/tests/reverse_test.cc
@@ -248,7 +249,7 @@ HWY_NOINLINE void TestAllReverse8() {
 }

 HWY_NOINLINE void TestAllReverseLaneBytes() {
-  ForUI163264(ForPartialVectors<TestReverseLaneBytes>());
+  ForPartialVectors<TestReverseLaneBytes>()(uint16_t());
 }</code></pre>
Then I removed all the other unrelated test asserts from the
<code>hwy/tests/reverse_test.cc</code> file.
Then I inlined obvious template parameters right into local test class.
Then I extracted random generated data used in the failing vectors (using
<code>printf()</code> statements) and inlined values into <code>.cc</code> file. Sometimes
I had to sprinkle <code>__attribute__((noipa))</code> attributes on local functions
to inhibit too eager constant folding.
<h2 id="gdb-hints"><code>gdb</code> hints</h2>
To explore generated code in <code>gdb</code> I explored <code>hwy::N_<target>::</code>
namespace as:
<pre><code>$ gdb tests/reverse_test
(gdb) disassemble hwy::N_EMU128::TestAllReverseLaneBytes
...
   <+113>:   mov    0x3d190(%rip),%rax        # 0x4dc68
   <+120>:   mov    $0xffffd49e,%esi
   <+125>:   xor    %edx,%edx
   <+127>:   mov    $0x8,%edi
   <+132>:   mov    %rax,0x0(%rbp)
   <+136>:   mov    %si,(%r12)

   <+141>:   movzwl 0x2(%rbp),%eax
   <+145>:   xor    %esi,%esi
   <+147>:   rol    $0x8,%ax
   <+151>:   mov    %ax,0x2(%r12)

   <+157>:   movzwl 0x4(%rbp),%eax
   <+161>:   rol    $0x8,%ax
   <+165>:   mov    %ax,0x4(%r12)

   <+171>:   movzwl 0x6(%rbp),%eax
   <+175>:   rol    $0x8,%ax
   <+179>:   mov    %ax,0x6(%r12)

   <+185>:   mov    0x0(%rbp),%rax
   <+189>:   movq   %rax,%xmm0
   <+194>:   movdqa %xmm0,%xmm1
   <+198>:   psllw  $0x8,%xmm0
   <+203>:   psraw  $0x8,%xmm1
   <+208>:   por    %xmm0,%xmm1
   <+212>:   movq   %xmm1,0x8(%rsp)</code></pre>
Here I was lucky! I immediately spotted the bug. We see both:
<ul>
<li>16-bit wide move/rotate/move: <code>movzwl / rol / mov</code> (looks correct)</li>
<li>and 128-bit wide move/rotate/move: <code>movq / psllw / psraw / por / movq</code></li>
</ul>
The rotate part is broken here: it should have been logical <code>psrlw</code>
shift, not arithmetic (sign-preserving) <code>psraw</code> shift.
At this point my test looked this way:
<pre class="cpp"><code>HWY_NOINLINE void TestAllReverseLaneBytes() {
    const CappedTag<uint16_t, 4, 0> d;

    const size_t N = Lanes(d);
    fprintf(stderr, "N = %zu\n", N);
    auto in = AllocateAligned<uint16_t>(N);
    auto expected = AllocateAligned<uint16_t>(N);

    fprintf(stderr, "iter\n");
        in[0] = 0x9ed4u;
        in[1] = 0x049eu;
        in[2] = 0x0137u;
        in[3] = 0x69d3u;
        expected[0] = ReverseBytesOfValue(in[0]);
        expected[1] = ReverseBytesOfValue(in[1]);
        expected[2] = ReverseBytesOfValue(in[2]);
        expected[3] = ReverseBytesOfValue(in[3]);

    const auto v = Load(d, in.get());
    HWY_ASSERT_VEC_EQ(d, expected.get(), ReverseLaneBytes(v));
}</code></pre>
I looked at the <code>ReverseLaneBytes()</code> implementation. It had two parts.
The first part was generic for all targets:
<pre class="cpp"><code>// from hwy/ops/generic_ops-inl.h
template <class V, HWY_IF_T_SIZE_V(V, 2)>
HWY_API V ReverseLaneBytes(V v) {
  const DFromV<V> d;
  const Repartition<uint8_t, decltype(d)> du8;
  return BitCast(d, Reverse2(du8, BitCast(du8, v)));
}</code></pre>
And the second part was <code>EMU128</code>-specific:
<pre class="cpp"><code>// from hwy/ops/emu128-inl.h
template <class D>
HWY_API VFromD<D> Reverse2(D d, VFromD<D> v) {
  VFromD<D> ret;
  for (size_t i = 0; i < MaxLanes(d); i += 2) {
    ret.raw[i + 0] = v.raw[i + 1];
    ret.raw[i + 1] = v.raw[i + 0];
  }
  return ret;
}</code></pre>
I inlined the above definitions into the test and got this:
<pre class="cpp"><code>// $ cat hwy/tests/reverse_test.cc
#include <stddef.h>

#undef HWY_TARGET_INCLUDE
#define HWY_TARGET_INCLUDE "tests/reverse_test.cc"
#include "hwy/foreach_target.h"  // IWYU pragma: keep
#include "hwy/highway.h"
#include "hwy/tests/test_util-inl.h"

HWY_BEFORE_NAMESPACE();
namespace hwy {
namespace HWY_NAMESPACE {

template <class D>
__attribute__((noipa))
static VFromD<D> Reverse2_(D d, VFromD<D> v) {
  VFromD<D> ret;
  for (size_t i = 0; i < MaxLanes(d); i += 2) {
    ret.raw[i + 0] = v.raw[i + 1];
    ret.raw[i + 1] = v.raw[i + 0];
  }
  return ret;
}

HWY_NOINLINE void TestAllReverseLaneBytes() {
    const CappedTag<uint16_t, 4, 0> d;

    const size_t N = Lanes(d); // 4
    auto in = AllocateAligned<uint16_t>(N);
    auto expected = AllocateAligned<uint16_t>(N);

    in[0] = 0x9ed4u;
    in[1] = 0x049eu;
    in[2] = 0x0137u;
    in[3] = 0x69d3u;
    expected[0] = 0xd49eu;
    expected[1] = 0x9e04u;
    expected[2] = 0x3701u;
    expected[3] = 0xd369u;

    const auto v = Load(d, in.get());
    const Repartition<uint8_t, decltype(d)> du8;
    const auto r = BitCast(d, Reverse2_(du8, BitCast(du8, v)));
    HWY_ASSERT_VEC_EQ(d, expected.get(), r);
}

// NOLINTNEXTLINE(google-readability-namespace-comments)
}  // namespace HWY_NAMESPACE
}  // namespace hwy
HWY_AFTER_NAMESPACE();

#if HWY_ONCE

namespace hwy {
HWY_BEFORE_TEST(HwyReverseTest);
HWY_EXPORT_AND_TEST_P(HwyReverseTest, TestAllReverseLaneBytes);
HWY_AFTER_TEST();
}  // namespace hwy

#endif</code></pre>
<h2 id="make-sure-its-an-optimizer">Make sure it’s an optimizer</h2>
Adding <code>#pragma GCC optimize(0)</code> to the beginning of the file makes the
bug to go away. It’s a good hint that it’s a compiler bug: the test looks
obviously correct (not much hidden code is left in the templates).
But the only way to make sure is to finish the reduction down to a
self-contained example. We will need it anyway to report upstream.
<h2 id="final-result">Final result</h2>
After a few manual extra inlines and simplifications I got this
self-contained example:
<pre class="c"><code>// $ cat bug.c
typedef unsigned char u8;

__attribute__((noipa))
static void fill_src(u8 * src) {
    src[0] = 0x00; src[1] = 0xff;
}

__attribute__((noipa))
static void assert_dst(const u8 * dst) {
    if (dst[0] != 0xff) __builtin_trap();
    if (dst[1] != 0x00) __builtin_trap();
}

int main() {
    u8 src[8] __attribute__((aligned(16))) = { 0 };
    u8 dst[8] __attribute__((aligned(16))) = { 0 };

    // place 0x00 into src[0] and 0xFF into src[1]
    fill_src(src);

    // swap bytes:
    // place 0xFF into dst[0], 0x00 into dst[1]
    for (unsigned long i = 0; i < 8; i += 2) {
        dst[i + 0] = src[i + 1];
        dst[i + 1] = src[i + 0];
    }

    // make sure bytes swapped
    assert_dst(dst);
}</code></pre>
Triggering:
<pre><code>$ gcc bug.c -o a -O1 && ./a
$ gcc bug.c -o a -O2 && ./a
Illegal instruction (core dumped)</code></pre>
<h2 id="upstream-report">Upstream report</h2>
I reported the bug as <a href="https://gcc.gnu.org/PR115146"><code>PR115146</code></a>.
Bisecting <code>gcc</code> pointed me to this
<a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=a71f90c5a7ae29">“vector shift” change</a>.
This change looks very close to the culprit as the code explicitly picks
the “arithmetic” flavour of shift instruction (should be “logical”
instead).
By now the original change author already provided a test patch in the
report! So quick!
Have fun!
</article>
<article>
<h1>the sagemath saga</h1>
2024-05-12T00:00:00Z
It’s a story of me was
<a href="https://fosstodon.org/@sheerluck@misskey.io/112372289317724185">nerd sniped</a>
by <code>@sheerluck@misskey.io</code> with <a href="https://gcc.gnu.org/PR114872"><code>PR114872</code></a>.
I spent this week having fun with a rare kind of bug: <code>gcc</code> was suspected
to have a bug that causes <code>sagemath</code> to crash with <code>SIGSEGV</code> on certain
inputs.
Ideally <code>sage</code> should not <code>SIGSEGV</code> on problematic inputs and should
instead print reasonable backtraces. At least printing a nicer error
message should not be hard, should it?
<h2 id="the-symptom">The symptom</h2>
The report said that a simple session caused <code>sage</code> tool from <code>sagemath</code>
package to <code>SIGSEGV</code>:
<pre><code>$ sage
sage: libgap.AbelianGroup(0,0,0)

...
Segmentation fault (core dumped)</code></pre>
<code>sage</code> is an <code>ipython</code> <code>REPL</code> with a bunch of bindings for math
libraries like <a href="https://www.gap-system.org/"><code>GAP</code></a>.
It is said that the bug happens only when <code>sagemath</code> is built against
<code>python-3.12</code> while <code>python-3.11</code> would work without the problems.
Normally it would be a strong hint of a <code>sagemath</code> bug. But the reporter
suspected it’s a <code>gcc</code> problem as building <code>sage</code> with <code>-O1</code> made the
bug go away. Thus the bug is at least dependent on generated code and is
likely not just a logic bug.
<strong>Quick quiz: where do you think the bug lies?</strong> In <code>gcc</code>, <code>sagemath</code> or
somewhere else? And while at it: what is the cause of the bug?
Use-after-free, use of uninitialized data, logic bug or maybe something
else?
<h2 id="the-first-reproducer-attempt-nixpkgs-package">The first reproducer attempt: <code>nixpkgs</code> package</h2>
I tried to reproduce the bug by building <code>sage</code> from <code>nxipkgs</code>. As
expected <code>sage</code> built against <code>python-3.11</code> worked just fine.
<code>python-3.11</code> is a <code>nixpkgs</code> default. I tried to flip the default to
<code>python-3.12</code> with this local change:
<pre class="diff"><code>--- a/pkgs/top-level/all-packages.nix
+++ b/pkgs/top-level/all-packages.nix
@@ -17505 +17505 @@ with pkgs;
-  python3 = python311;
+  python3 = python312;
@@ -17509 +17509 @@ with pkgs;
-  python3Packages = dontRecurseIntoAttrs python311Packages;
+  python3Packages = dontRecurseIntoAttrs python312Packages;</code></pre>
Building it did not work as is:
<pre><code>$ nix build -f. sage
...


error: ipython-genutils-0.2.0 not supported for interpreter python3.12


error: nose-1.3.7 not supported for interpreter python3.12


> src/gmpy2_convert_gmp.c:464:76: error: ‘PyLongObject’ {aka ‘struct _longobject’} has no member named ‘ob_digit’
>   464 |                    sizeof(templong->ob_digit[0])*8 - PyLong_SHIFT, templong->ob_digit);
>       |                                                                            ^~
> error: command '/nix/store/8mjb3ziimfi3rki71q4s0916xkm4cm5p-gcc-wrapper-13.2.0/bin/gcc' failed with exit code 1
> /nix/store/558iw5j1bk7z6wrg8cp96q2rx03jqj1v-stdenv-linux/setup: line 1579: pop_var_context: head of shell_variables not a function context
For full logs, run 'nix log /nix/store/g7mf3p2cylf74j3ypq2ifcspx61isb36-python3.12-gmpy2-2.1.2.drv'.


> ModuleNotFoundError: No module named 'distutils'
> configure: error: Python explicitly requested and python headers were not found
For full logs, run 'nix log /nix/store/xyd63v16k1krblcfypfn5bs6jqbj9lwd-audit-3.1.2.drv'.</code></pre>
The above told me that some dependencies (at least in <code>nixpkgs</code>) are not
ready for <code>python-3.12</code>:
<ul>
<li><code>nose</code> and <code>ipython-genutils</code> are explicitly disabled for <code>python-3.12</code>
in <code>nixpkgs</code></li>
<li><code>gmpy2</code> and <code>audit</code> just fail to build for API changes in <code>python</code>
itself</li>
</ul>
<code>gmpy2</code> specifically has an open
<a href="https://github.com/aleaxit/gmpy/issues/446">upstream report</a>
to add support for <code>python-3.12</code>. This means distributions have to apply
not yet upstreamed changes to get earlier <code>python-3.12</code> support.
All the above means that porting fixes are sometimes not trivial and
might vary from a distribution to distribution if they want to get
<code>python-3.12</code> tested earlier.
I did not feel confident to patch at least 4 <code>python</code> libraries to get
<code>sagemath</code> to build. I switched the tactic to reproduce the bug on the
system reporters were using and to explore it there.
Original reporter used Arch Linux to reproduce the failure. Another user
<a href="https://github.com/sagemath/sage/pull/36407#issuecomment-2093792864">reported</a>
that <code>Gentoo</code> users also seen the similar problem.
<h2 id="the-second-reproducer-attempt-gentoo-package-from-sage-on-gentoo">The second reproducer attempt: <code>Gentoo</code> package from <code>::sage-on-gentoo</code></h2>
I had a <code>Gentoo</code> chroot lying around for <code>nix</code> packaging testing. I
tried to reproduce <code>sagemath</code> failure there by using
<a href="https://github.com/cschwan/sage-on-gentoo"><code>::sage-on-gentoo</code></a> overlay.
Unfortunately neither latest release of <code>sagemath-standard-10.3</code> nor
<code>sagemath-standard-9999</code> <code>git</code> versions did build for me as is. I filed
2 bugs:
<ul>
<li><a href="https://github.com/cschwan/sage-on-gentoo/issues/783">cschwan/sage-on-gentoo#783</a>: <code>=sci-mathematics/sagemath-standard-10.3</code> fails as: <code>AttributeError: Can't pickle local object '_prepare_extension_detection.<locals>.<lambda>'</code></li>
<li><a href="https://github.com/cschwan/sage-on-gentoo/issues/784">cschwan/sage-on-gentoo#784</a>: <code>=sci-mathematics/sage_setup-9999</code> fails as: <code>tar (child): sage-setup-*.tar.gz: Cannot open: No such file or directory</code></li>
</ul>
I hoped that <code>pickle</code> failure was fixed in latest <code>git</code> and I could avoid
the second pickle bug by using it.
At least the <code>tar</code> failure looked like a packaging issue:
<pre><code># ACCEPT_KEYWORDS='**' USE=text emerge -av1 sagemath-standard
...
tar (child): sage-setup-*.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
 * ERROR: sci-mathematics/sage_setup-9999::sage-on-gentoo failed (unpack phase):
...</code></pre>
François Bissey promptly
<a href="https://github.com/cschwan/sage-on-gentoo/commit/cc57aef4021bf673d02d20bd483b2708f9336f63">fixed</a>
<code>tar</code> failure! Unfortunately that did not fix the <code>pickle</code> bug in
<code>sagemath-standard-9999</code> for me and it started failing just like <code>sagemath-standard-10.3</code>:
<pre><code># emerge -av1 sagemath-standard
...
AttributeError: Can't pickle local object '_prepare_extension_detection.<locals>.<lambda>'</code></pre>
Surprisingly not everyone saw that problem and some people were able to
build the packages just fine. Day later I explored where
<code>_prepare_extension_detection</code> comes from and I was able to find a
<a href="https://github.com/cschwan/sage-on-gentoo/issues/783#issuecomment-2095500118">surprising workaround</a>:
I needed to uninstall completely unrelated <code>scikit-build-core</code> <code>python</code>
package that happened to be present in my system. <code>scikit-build-core</code> is
not used by <code>sagemath-standard</code> neither directly nor indirectly. But
somehow <a href="https://github.com/scikit-build/scikit-build-core/blob/f6ed5a28fc85e621b03d984011d17def888ee0db/src/scikit_build_core/setuptools/build_cmake.py#L183">it’s code</a> injected
the extra attributes to <code>cmake</code>-based package builds and failed the build.
At least I finally got <code>sage</code> tool in my <code>$PATH</code>!
I took me two evenings to get <code>sagemath</code> to build. At last I could look
at the crash now.
<h2 id="exploring-the-crash">Exploring the crash</h2>
<code>sage</code> is a python program. It has a default handler that executes <code>gdb</code>
at crash time. Unfortunately it does not work on <code>Gentoo</code>:
<pre><code>Attaching gdb to process id 1023.
Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.12/cysignals-CSI", line 225, in <module>
    main(args)
  File "/usr/lib/python-exec/python3.12/cysignals-CSI", line 174, in main
    trace = run_gdb(args.pid, not args.nocolor)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python-exec/python3.12/cysignals-CSI", line 98, in run_gdb
    stdout, stderr = cmd.communicate(gdb_commands(pid, color))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python-exec/python3.12/cysignals-CSI", line 71, in gdb_commands
    with open(script, 'rb') as f:
         ^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/usr/lib/python-exec/python3.12/../share/cysignals/cysignals-CSI-helper.py'</code></pre>
Missing <code>cysignals-CSI-helper.py</code> file at the expected location is a
case of <a href="https://bugs.gentoo.org/927767">packaging error</a>. <code>Gentoo</code> uses
very unusual path to python launcher and breaks too simplistic
<code>argv[0]+"/../share"</code> path construction used in <code>cysignals</code>. Adding a
few more <code>"../../../"</code> should do as a workaround.
I was able to workaround <code>gdb</code> launch failure by attaching <code>gdb</code> to
already running process before pasting the problematic command into the
<code>REPL</code>:
<pre><code># gdb --quiet -p `pgrep sage-ipython`
Attaching to process 1060
[New LWP 1082]
[New LWP 1083]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fedba181256 in epoll_wait (epfd=5, events=events@entry=0x7fed57127590, maxevents=maxevents@entry=2,
    timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30        return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);</code></pre>
To get more details from the crash site I ran <code>continue</code> in <code>gdb</code> and
typed the trigger expression: <code>libgap.AbelianGroup(0,0,0)</code>.
To my surprise I got not a <code>SIGSEGV</code> but a <code>SIGABRT</code>:
<pre><code>(gdb) continue
Continuing.
[Thread 0x7fed56e006c0 (LWP 1083) exited]
[Detaching after vfork from child process 1105]

Thread 1 "sage-ipython" received signal SIGABRT, Aborted.
0x00007fedba0b97a7 in __GI_kill () at ../sysdeps/unix/syscall-template.S:120
120     T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)</code></pre>
The default signal handler for <code>SIGABRT</code> normally crashes the process
and generates a core dump. <code>sagemath</code> installs <code>SIGABRT</code> handler (via
<code>cysignals</code> library) to report and recover from some errors like argument
type errors in the interpreter session.
<code>gdb</code> always intercepts <code>SIGABRT</code> before executing the handler. Thus I
needed to explicitly continue execution in <code>gdb</code> session:
<pre><code>(gdb) continue
Continuing.

Thread 1 "sage-ipython" received signal SIGSEGV, Segmentation fault.
0x00007fed5d13956f in _Py_IsImmortal (op=0x0) at /usr/include/python3.12/object.h:242
242         return _Py_CAST(PY_INT32_T, op->ob_refcnt) < 0;</code></pre>
Yay! I got the <code>SIGSEGV</code>!
A simple <code>NULL</code> dereference. What could be easier to debug? Just check
where it was set to <code>NULL</code> and do something about it, right?
First thing I wondered about is how does <code>SIGABRT</code> handler look like? It
was an idle curiosity. I expected to see some simple global variable tweak.
Alas what I found was <a href="https://github.com/sagemath/cysignals/blob/035ed1605a8741a6f265a55cc682b26ea6e5d1c2/src/cysignals/implementation.c#L279"><code>longjmp()</code></a>:
<pre class="c"><code>static void cysigs_interrupt_handler(int sig)
{
...
    if (cysigs.sig_on_count > 0)
    {
        if (!cysigs.block_sigint && !PARI_SIGINT_block && !custom_signal_is_blocked())
        {
            /* Raise an exception so Python can see it */
            do_raise_exception(sig);

            /* Jump back to sig_on() (the first one if there is a stack) */
            siglongjmp(trampoline, sig);
        }
    }
...
}</code></pre>
One has to be
<a href="https://trofi.github.io/posts/188-grub-0.97-and-gcc-4.9.html">very</a>
<a href="https://trofi.github.io/posts/205-stack-protection-on-mips64.html">careful</a>
with <code>setjmp()</code> / <code>longjmp()</code>.
<h2 id="the-example-setjmp-longjmp-failure-mode">The example <code>setjmp()</code> / <code>longjmp()</code> failure mode</h2>
The <code>C</code> standard has a few constructs that don’t mix well with <code>C</code>
abstract machine. <code>setjmp()</code> / <code>longjmp()</code> is one of those. It’s a
frequent source of subtle and latent bugs.
In theory <code>setjmp()</code> / <code>longjmp()</code> is a portable way to save and restore
the execution at a certain point in the program. It’s a non-local
<code>goto</code>. It is very powerful: you can jump from a few nested calls with
a <code>longjmp()</code> back to where you called <code>setjmp()</code>. In practice one has
to be very careful to get something that works most of the time.
Let’s look at a contrived and <strong>intentionally buggy</strong> example:
<pre class="c"><code>// $ cat a.c
#include <assert.h>
#include <setjmp.h>
#include <stdio.h>

static jmp_buf jb;

__attribute__((noipa)) static int foo(void) {
    int a = 0;
    int r = setjmp(jb); // r = 0 initially, r = 1 after longjmp()

    a += 1; // gets executed twice, but the compiler does not know it!

    if (!r) longjmp(jb, 1); // execute it just once

    return a;
}

int main(void) {
    int r = foo();
    printf("foo() = %d (expect 2)\n", r);
    assert(r == 2);
    return 0;
}</code></pre>
Here we execute <code>foo()</code> function once and use <code>longjmp()</code> to return to
<code>setjmp()</code> (also just once). As a result we should execute <code>a += 1;</code>
statement twice:
<ul>
<li>once during natural <code>foo()</code> execution</li>
<li>once via “hidden” jump via <code>longjmp()</code> call to <code>setjmp()</code>
location.</li>
</ul>
I’m using <code>__attribute__((noipa))</code> to keep <code>foo()</code> from being inlined
into <code>main()</code> to ease <code>foo()</code>’s code exploration.
The test against the unoptimized build confirms that <code>a += 1</code> gets
executed twice:
<pre><code>$ gcc a.c -o a -O0 && ./a
foo() = 2 (expect 2)</code></pre>
Assembly output shows the expected code:
<pre class="asm"><code>; $ gdb ./a
; (gdb) disassemble foo
Dump of assembler code for function foo:
   <+0>:     push   %rbp
   <+1>:     mov    %rsp,%rbp
   <+4>:     sub    $0x10,%rsp
   <+8>:     movl   $0x0,-0x8(%rbp)
   <+15>:    lea    0x2ed4(%rip),%rax        # 0x404040 <jb>
   <+22>:    mov    %rax,%rdi
   <+25>:    call   0x401050 <_setjmp@plt> // setjmp();
   <+30>:    mov    %eax,-0x4(%rbp)
   <+33>:    addl   $0x1,-0x8(%rbp)        // a += 1;
   <+37>:    cmpl   $0x0,-0x4(%rbp)        // if (!r) ...
   <+41>:    jne    0x401195 <foo+63>
   <+43>:    mov    $0x1,%esi
   <+48>:    lea    0x2eb3(%rip),%rax        # 0x404040 <jb>
   <+55>:    mov    %rax,%rdi
   <+58>:    call   0x401060 <longjmp@plt> // longjmp();
   <+63>:    mov    -0x8(%rbp),%eax
   <+66>:    leave
   <+67>:    ret                           // return a;</code></pre>
It’s a linear code with a single explicit branch to skip <code>longjmp()</code> call.
Looks easy?
Now let’s optimize it:
<pre><code>$ gcc a.c -o a -O2 && ./a
foo() = 1 (expect 2)
a: a.c:21: main: Assertion `r == 2' failed.
Aborted (core dumped)</code></pre>
Whoops. <code>foo() = 1</code> output suggests that <code>a</code> was incremented only once.
Let’s look at the generated code to get the idea of what happened:
<pre class="asm"><code>; $ gdb ./a
; (gdb) disassemble foo
Dump of assembler code for function foo:
   <+0>:     sub    $0x8,%rsp
   <+4>:     lea    0x2e85(%rip),%rdi        # 0x404040 <jb>
   <+11>:    call   0x401040 <_setjmp@plt>
   <+16>:    test   %eax,%eax              // if (!r) ...
   <+18>:    je     0x4011ce <foo+30>
   <+20>:    mov    $0x1,%eax              // a = 1;
   <+25>:    add    $0x8,%rsp
   <+29>:    ret                           // return a
   <+30>:    mov    $0x1,%esi
   <+35>:    lea    0x2e66(%rip),%rdi        # 0x404040 <jb>
   <+42>:    call   0x401060 <__longjmp_chk@plt></code></pre>
Nothing prevented <code>gcc</code> from transforming the original <code>foo()</code> into this
simpler form:
<pre class="c"><code>// ...
__attribute__((noipa)) static int foo(void) {
    int r = setjmp(jb); // r = 0 initially, r = 1 after longjmp()
    if (!r) longjmp(jb, 1);
    int a = 1;          // moved `int a = 0;` here and merged into `a += 1;`.
    return a;
}
// ...</code></pre>
Here <code>gcc</code> did quite a bit of code motion:
<ul>
<li><code>gcc</code> moved <code>int a = 0;</code> assignment and an <code>a += 1;</code> increment after
both <code>setjmp()</code> and <code>longjmp()</code>.</li>
<li><code>gcc</code> merged <code>int a = 0;</code>, <code>a += 1;</code> and <code>return a;</code> into a single <code>return 1;</code>.</li>
</ul>
<strong>Note that <code>gcc</code> moved all <code>a</code> manipulation across <code>setjmp()</code> and even
<code>longjmp()</code> boundaries</strong>. <code>setjmp()</code> is expected to be
just a C function for <code>gcc</code>. <code>gcc</code> does not have to have any special
knowledge about control flow semantics of those functions. Thus this
optimization transformation while completely breaking the original
code’s intent is legitimate and expected.
<code>man 3 setjmp</code> covers exactly this case in <code>CAVEATS</code> section as:
<pre><code>CAVEATS
  The compiler may optimize variables into registers, and longjmp() may
  restore the values of other registers in addition to the stack pointer
  and program counter. Consequently, the values of automatic variables
  are unspecified after a call to longjmp() if they meet all the
  following criteria:
  •  they are local to the function that made the corresponding setjmp()
     call;
  •  their values are changed between the calls to setjmp() and
     longjmp(); and
  •  they are not declared as volatile.</code></pre>
It’s up to code author to adhere to these <code>CAVEATS</code>. Would be nice if
the compiler would be able to warn about the violations in simpler cases
though: <code>-flto</code> could expose large chunk of the control flow graph to
the analyzer.
Back to our broken example. One of the possible fixes here is to declare
<code>a</code> as <code>volatile</code>:
<pre class="diff"><code>--- a.c.buggy   2024-05-09 23:14:54.383811692 +0100
+++ a.c 2024-05-10 00:03:53.694219636 +0100
@@ -7,3 +7,3 @@
 __attribute__((noipa)) static int foo(void) {
-    int a = 0;
+    volatile int a = 0;
     int r = setjmp(jb); // r = 0 initially, r = 1 after longjmp()</code></pre>
That would give us the following (correct) example:
<pre class="c"><code>#include <assert.h>
#include <setjmp.h>
#include <stdio.h>

static jmp_buf jb;

__attribute__((noipa)) static int foo(void) {
    volatile int a = 0;
    int r = setjmp(jb); // r = 0 initially, r = 1 after longjmp()

    a += 1; // gets executed twice

    if (!r) longjmp(jb, 1);

    return a;
}

int main(void) {
    int r = foo();
    printf("foo() = %d (expect 2)\n", r);
    assert(r == 2);
}</code></pre>
Running:
<pre><code>$ gcc a.c -o a -O0 && ./a
foo() = 2 (expect 2)

$ gcc a.c -o a -O2 && ./a
foo() = 2 (expect 2)</code></pre>
The execution confirms that both optimizations run the increment twice
as intended. This is the generated code for completeness:
<pre class="asm"><code>; gdb ./a
; (gdb) disassemble foo
Dump of assembler code for function foo:
   <+0>:     sub    $0x18,%rsp
   <+4>:     lea    0x2e85(%rip),%rdi        # 0x404040 <jb>
   <+11>:    movl   $0x0,0xc(%rsp)           // int a = 0;
   <+19>:    call   0x401040 <_setjmp@plt>
   <+24>:    mov    %eax,%edx
   <+26>:    mov    0xc(%rsp),%eax           // load a from stack
   <+30>:    add    $0x1,%eax                // a += 1;
   <+33>:    mov    %eax,0xc(%rsp)           // store a on stack
   <+37>:    test   %edx,%edx
   <+39>:    je     0x4011e2 <foo+50>
   <+41>:    mov    0xc(%rsp),%eax           // load a from stack
   <+45>:    add    $0x18,%rsp
   <+49>:    ret                             // return
   <+50>:    mov    $0x1,%esi
   <+55>:    lea    0x2e52(%rip),%rdi        # 0x404040 <jb>
   <+62>:    call   0x401060 <__longjmp_chk@plt></code></pre>
Note that use of <code>volatile</code> in <code>setjmp()</code> / <code>longjmp()</code> effectively
inhibits completely reasonable compiler optimizations just to make loads
and stores in generated code to match the order written in <code>C</code> source
code for a given function.
I wonder if adding <code>volatile</code> is enough for more heavyweight
optimizations like <code>-flto</code> that enable even more code movement, constant
propagation and stack reuse. The time will tell.
<h2 id="sagemath-use-of-setjmp-longjmp"><code>sagemath</code> use of <code>setjmp()</code> / <code>longjmp()</code></h2>
After I noticed <code>setjmp()</code> / <code>longjmp()</code> use in <code>sagemath</code> I wondered how
many <code>volatile</code> keywords it has in the code where those were used.
To my surprise it had none. That was a good hint in a sense that it
looked like the problem we are dealing with. From that point I was
reasonably sure it’s an application bug and not a <code>gcc</code> bug.
Still, having the concrete corruption evidence would help to clear any
doubts and would help the reporter to craft a fix for <code>sagemath</code>.
The backtrace claimed that the crash happens in
<code>__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__</code>
function:
<pre><code>(gdb) continue
Continuing.

Thread 1 "sage-ipython" received signal SIGSEGV, Segmentation fault.
0x00007feb544b756f in _Py_IsImmortal (op=0x0) at /usr/include/python3.12/object.h:242
242         return _Py_CAST(PY_INT32_T, op->ob_refcnt) < 0;
(gdb) bt
#0  0x00007feb544b756f in _Py_IsImmortal (op=0x0) at /usr/include/python3.12/object.h:242
#1  Py_DECREF (op=0x0) at /usr/include/python3.12/object.h:700
#2  Py_XDECREF (op=0x0) at /usr/include/python3.12/object.h:798
#3  __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (
    __pyx_v_self=__pyx_v_self@entry=0x7feb4e274f40,
    __pyx_v_args=__pyx_v_args@entry=(<sage.rings.integer.Integer at remote 0x7feb533456e0>, <sage.rings.integer.Integer at remote 0x7feb4e5dd8c0>, <sage.rings.integer.Integer at remote 0x7feb4e5dd620>))
    at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26535
...</code></pre>
<code>"__pyx_"</code> prefix suggests it’s a function generated by
<a href="https://cython.org/"><code>cython</code></a>, a python-style DSL for writing <code>C</code> part
of code to create bindings for <code>C</code> libraries to be used in <code>python</code>
libraries. In this case <code>sagemath</code> created bindings for <code>libgap</code>
discrete algebra library.
The <code>cython</code> code for a function we a <code>SIGSEGV</code> in looked
<a href="https://github.com/sagemath/sage/blob/744939e037a67193e730d7205e612e2d58197fca/src/sage/libs/gap/element.pyx#L2504">this way</a>:
<pre class="python"><code>cdef class GapElement_Function(GapElement):
    # ...
    def __call__(self, *args):
    # ...
        cdef Obj result = NULL
        cdef Obj arg_list
        cdef int n = len(args)

        if n > 0 and n <= 3:
            libgap = self.parent()
            a = [x if isinstance(x, GapElement) else libgap(x) for x in args]

        try:
            sig_GAP_Enter()
            sig_on()
            if n == 0:
                result = GAP_CallFunc0Args(self.value)
            elif n == 1:
                result = GAP_CallFunc1Args(self.value,
                                           (<GapElement>a[0]).value)
            elif n == 2:
                result = GAP_CallFunc2Args(self.value,
                                           (<GapElement>a[0]).value,
                                           (<GapElement>a[1]).value)
            elif n == 3:
                result = GAP_CallFunc3Args(self.value,
                                           (<GapElement>a[0]).value,
                                           (<GapElement>a[1]).value,
                                           (<GapElement>a[2]).value)
            else:
                arg_list = make_gap_list(args)
                result = GAP_CallFuncList(self.value, arg_list)
            sig_off()
        finally:
            GAP_Leave()
        if result == NULL:
            # We called a procedure that does not return anything
            return None
        return make_any_gap_element(self.parent(), result)</code></pre>
Looks very straightforward. No <code>setjmp()</code> in sight yet, or is there?.
WARNING: you are about to scroll through A Lot Of Code. You might want
to skip the bulk it and return later The build system produces
<code>element.c</code> with this contents:
<pre class="c"><code>#define sig_GAP_Enter() {int t = GAP_Enter(); if (!t) sig_error();}
#define GAP_Enter() GAP_Error_Setjmp(); GAP_EnterStack()
#define GAP_Error_Setjmp() \
    (GAP_unlikely(GAP_Error_Prejmp_(__FILE__, __LINE__)) \
     || GAP_Error_Postjmp_(_setjmp(*GAP_GetReadJmpError())))
// ...
static PyObject *__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__(struct __pyx_obj_4sage_4libs_3gap_7element_GapElement_Function *__pyx_v_self, PyObject *__pyx_v_args) {
  Obj __pyx_v_result;
  Obj __pyx_v_arg_list;
  int __pyx_v_n;
  PyObject *__pyx_v_libgap = NULL;
  PyObject *__pyx_v_a = NULL;
  PyObject *__pyx_8genexpr3__pyx_v_x = NULL;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  Py_ssize_t __pyx_t_1;
  int __pyx_t_2;
  int __pyx_t_3;
  PyObject *__pyx_t_4 = NULL;
  PyObject *__pyx_t_5 = NULL;
  PyObject *__pyx_t_6 = NULL;
  int __pyx_t_7;
  PyObject *__pyx_t_8 = NULL;
  PyObject *__pyx_t_9 = NULL;
  PyObject *__pyx_t_10 = NULL;
  Obj __pyx_t_11;
  int __pyx_t_12;
  char const *__pyx_t_13;
  PyObject *__pyx_t_14 = NULL;
  PyObject *__pyx_t_15 = NULL;
  PyObject *__pyx_t_16 = NULL;
  PyObject *__pyx_t_17 = NULL;
  PyObject *__pyx_t_18 = NULL;
  PyObject *__pyx_t_19 = NULL;
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;
  __Pyx_RefNannySetupContext("__call__", 1);

  /* "sage/libs/gap/element.pyx":2504
 *             hello from the shell
 *         """
 *         cdef Obj result = NULL             # <<<<<<<<<<<<<<
 *         cdef Obj arg_list
 *         cdef int n = len(args)
 */
  __pyx_v_result = NULL;

  /* "sage/libs/gap/element.pyx":2506
 *         cdef Obj result = NULL
 *         cdef Obj arg_list
 *         cdef int n = len(args)             # <<<<<<<<<<<<<<
 * 
 *         if n > 0 and n <= 3:
 */
  __pyx_t_1 = __Pyx_PyTuple_GET_SIZE(__pyx_v_args); if (unlikely(__pyx_t_1 == ((Py_ssize_t)-1))) __PYX_ERR(0, 2506, __pyx_L1_error)
  __pyx_v_n = __pyx_t_1;

  /* "sage/libs/gap/element.pyx":2508
 *         cdef int n = len(args)
 * 
 *         if n > 0 and n <= 3:             # <<<<<<<<<<<<<<
 *             libgap = self.parent()
 *             a = [x if isinstance(x, GapElement) else libgap(x) for x in args]
 */
  __pyx_t_3 = (__pyx_v_n > 0);
  if (__pyx_t_3) {
  } else {
    __pyx_t_2 = __pyx_t_3;
    goto __pyx_L4_bool_binop_done;
  }
  __pyx_t_3 = (__pyx_v_n <= 3);
  __pyx_t_2 = __pyx_t_3;
  __pyx_L4_bool_binop_done:;
  if (__pyx_t_2) {

    /* "sage/libs/gap/element.pyx":2509
 * 
 *         if n > 0 and n <= 3:
 *             libgap = self.parent()             # <<<<<<<<<<<<<<
 *             a = [x if isinstance(x, GapElement) else libgap(x) for x in args]
 * 
 */
    __pyx_t_5 = __Pyx_PyObject_GetAttrStr(((PyObject *)__pyx_v_self), __pyx_n_s_parent); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 2509, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_5);
    __pyx_t_6 = NULL;
    __pyx_t_7 = 0;
    #if CYTHON_UNPACK_METHODS
    if (likely(PyMethod_Check(__pyx_t_5))) {
      __pyx_t_6 = PyMethod_GET_SELF(__pyx_t_5);
      if (likely(__pyx_t_6)) {
        PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_5);
        __Pyx_INCREF(__pyx_t_6);
        __Pyx_INCREF(function);
        __Pyx_DECREF_SET(__pyx_t_5, function);
        __pyx_t_7 = 1;
      }
    }
    #endif
    {
      PyObject *__pyx_callargs[2] = {__pyx_t_6, NULL};
      __pyx_t_4 = __Pyx_PyObject_FastCall(__pyx_t_5, __pyx_callargs+1-__pyx_t_7, 0+__pyx_t_7);
      __Pyx_XDECREF(__pyx_t_6); __pyx_t_6 = 0;
      if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 2509, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_4);
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    }
    __pyx_v_libgap = __pyx_t_4;
    __pyx_t_4 = 0;

    /* "sage/libs/gap/element.pyx":2510
 *         if n > 0 and n <= 3:
 *             libgap = self.parent()
 *             a = [x if isinstance(x, GapElement) else libgap(x) for x in args]             # <<<<<<<<<<<<<<
 * 
 *         try:
 */
    { /* enter inner scope */
      __pyx_t_4 = PyList_New(0); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 2510, __pyx_L8_error)
      __Pyx_GOTREF(__pyx_t_4);
      __pyx_t_5 = __pyx_v_args; __Pyx_INCREF(__pyx_t_5);
      __pyx_t_1 = 0;
      for (;;) {
        {
          Py_ssize_t __pyx_temp = __Pyx_PyTuple_GET_SIZE(__pyx_t_5);
          #if !CYTHON_ASSUME_SAFE_MACROS
          if (unlikely((__pyx_temp < 0))) __PYX_ERR(0, 2510, __pyx_L8_error)
          #endif
          if (__pyx_t_1 >= __pyx_temp) break;
        }
        #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
        __pyx_t_6 = PyTuple_GET_ITEM(__pyx_t_5, __pyx_t_1); __Pyx_INCREF(__pyx_t_6); __pyx_t_1++; if (unlikely((0 < 0))) __PYX_ERR(0, 2510, __pyx_L8_error)
        #else
        __pyx_t_6 = __Pyx_PySequence_ITEM(__pyx_t_5, __pyx_t_1); __pyx_t_1++; if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 2510, __pyx_L8_error)
        __Pyx_GOTREF(__pyx_t_6);
        #endif
        __Pyx_XDECREF_SET(__pyx_8genexpr3__pyx_v_x, __pyx_t_6);
        __pyx_t_6 = 0;
        __pyx_t_2 = __Pyx_TypeCheck(__pyx_8genexpr3__pyx_v_x, __pyx_ptype_4sage_4libs_3gap_7element_GapElement); 
        if (__pyx_t_2) {
          __Pyx_INCREF(__pyx_8genexpr3__pyx_v_x);
          __pyx_t_6 = __pyx_8genexpr3__pyx_v_x;
        } else {
          __Pyx_INCREF(__pyx_v_libgap);
          __pyx_t_9 = __pyx_v_libgap; __pyx_t_10 = NULL;
          __pyx_t_7 = 0;
          #if CYTHON_UNPACK_METHODS
          if (unlikely(PyMethod_Check(__pyx_t_9))) {
            __pyx_t_10 = PyMethod_GET_SELF(__pyx_t_9);
            if (likely(__pyx_t_10)) {
              PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_9);
              __Pyx_INCREF(__pyx_t_10);
              __Pyx_INCREF(function);
              __Pyx_DECREF_SET(__pyx_t_9, function);
              __pyx_t_7 = 1;
            }
          }
          #endif
          {
            PyObject *__pyx_callargs[2] = {__pyx_t_10, __pyx_8genexpr3__pyx_v_x};
            __pyx_t_8 = __Pyx_PyObject_FastCall(__pyx_t_9, __pyx_callargs+1-__pyx_t_7, 1+__pyx_t_7);
            __Pyx_XDECREF(__pyx_t_10); __pyx_t_10 = 0;
            if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 2510, __pyx_L8_error)
            __Pyx_GOTREF(__pyx_t_8);
            __Pyx_DECREF(__pyx_t_9); __pyx_t_9 = 0;
          }
          __pyx_t_6 = __pyx_t_8;
          __pyx_t_8 = 0;
        }
        if (unlikely(__Pyx_ListComp_Append(__pyx_t_4, (PyObject*)__pyx_t_6))) __PYX_ERR(0, 2510, __pyx_L8_error)
        __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
      }
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      __Pyx_XDECREF(__pyx_8genexpr3__pyx_v_x); __pyx_8genexpr3__pyx_v_x = 0;
      goto __pyx_L12_exit_scope;
      __pyx_L8_error:;
      __Pyx_XDECREF(__pyx_8genexpr3__pyx_v_x); __pyx_8genexpr3__pyx_v_x = 0;
      goto __pyx_L1_error;
      __pyx_L12_exit_scope:;
    } /* exit inner scope */
    __pyx_v_a = ((PyObject*)__pyx_t_4);
    __pyx_t_4 = 0;

    /* "sage/libs/gap/element.pyx":2508
 *         cdef int n = len(args)
 * 
 *         if n > 0 and n <= 3:             # <<<<<<<<<<<<<<
 *             libgap = self.parent()
 *             a = [x if isinstance(x, GapElement) else libgap(x) for x in args]
 */
  }

  /* "sage/libs/gap/element.pyx":2512
 *             a = [x if isinstance(x, GapElement) else libgap(x) for x in args]
 * 
 *         try:             # <<<<<<<<<<<<<<
 *             sig_GAP_Enter()
 *             sig_on()
 */
  /*try:*/ {

    /* "sage/libs/gap/element.pyx":2513
 * 
 *         try:
 *             sig_GAP_Enter()             # <<<<<<<<<<<<<<
 *             sig_on()
 *             if n == 0:
 */
    sig_GAP_Enter();

    /* "sage/libs/gap/element.pyx":2514
 *         try:
 *             sig_GAP_Enter()
 *             sig_on()             # <<<<<<<<<<<<<<
 *             if n == 0:
 *                 result = GAP_CallFunc0Args(self.value)
 */
    __pyx_t_7 = sig_on(); if (unlikely(__pyx_t_7 == ((int)0))) __PYX_ERR(0, 2514, __pyx_L14_error)

    /* "sage/libs/gap/element.pyx":2515
 *             sig_GAP_Enter()
 *             sig_on()
 *             if n == 0:             # <<<<<<<<<<<<<<
 *                 result = GAP_CallFunc0Args(self.value)
 *             elif n == 1:
 */
    switch (__pyx_v_n) {
      case 0:

      /* "sage/libs/gap/element.pyx":2516
 *             sig_on()
 *             if n == 0:
 *                 result = GAP_CallFunc0Args(self.value)             # <<<<<<<<<<<<<<
 *             elif n == 1:
 *                 result = GAP_CallFunc1Args(self.value,
 */
      __pyx_v_result = GAP_CallFunc0Args(__pyx_v_self->__pyx_base.value);

      /* "sage/libs/gap/element.pyx":2515
 *             sig_GAP_Enter()
 *             sig_on()
 *             if n == 0:             # <<<<<<<<<<<<<<
 *                 result = GAP_CallFunc0Args(self.value)
 *             elif n == 1:
 */
      break;
      case 1:

      /* "sage/libs/gap/element.pyx":2519
 *             elif n == 1:
 *                 result = GAP_CallFunc1Args(self.value,
 *                                            (<GapElement>a[0]).value)             # <<<<<<<<<<<<<<
 *             elif n == 2:
 *                 result = GAP_CallFunc2Args(self.value,
 */
      if (unlikely(!__pyx_v_a)) { __Pyx_RaiseUnboundLocalError("a"); __PYX_ERR(0, 2519, __pyx_L14_error) }
      __pyx_t_4 = __Pyx_GetItemInt_List(__pyx_v_a, 0, long, 1, __Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 2519, __pyx_L14_error)
      __Pyx_GOTREF(__pyx_t_4);

      /* "sage/libs/gap/element.pyx":2518
 *                 result = GAP_CallFunc0Args(self.value)
 *             elif n == 1:
 *                 result = GAP_CallFunc1Args(self.value,             # <<<<<<<<<<<<<<
 *                                            (<GapElement>a[0]).value)
 *             elif n == 2:
 */
      __pyx_v_result = GAP_CallFunc1Args(__pyx_v_self->__pyx_base.value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_4)->value);
      __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;

      /* "sage/libs/gap/element.pyx":2517
 *             if n == 0:
 *                 result = GAP_CallFunc0Args(self.value)
 *             elif n == 1:             # <<<<<<<<<<<<<<
 *                 result = GAP_CallFunc1Args(self.value,
 *                                            (<GapElement>a[0]).value)
 */
      break;
      case 2:

      /* "sage/libs/gap/element.pyx":2522
 *             elif n == 2:
 *                 result = GAP_CallFunc2Args(self.value,
 *                                            (<GapElement>a[0]).value,             # <<<<<<<<<<<<<<
 *                                            (<GapElement>a[1]).value)
 *             elif n == 3:
 */
      if (unlikely(!__pyx_v_a)) { __Pyx_RaiseUnboundLocalError("a"); __PYX_ERR(0, 2522, __pyx_L14_error) }
      __pyx_t_4 = __Pyx_GetItemInt_List(__pyx_v_a, 0, long, 1, __Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 2522, __pyx_L14_error)
      __Pyx_GOTREF(__pyx_t_4);

      /* "sage/libs/gap/element.pyx":2523
 *                 result = GAP_CallFunc2Args(self.value,
 *                                            (<GapElement>a[0]).value,
 *                                            (<GapElement>a[1]).value)             # <<<<<<<<<<<<<<
 *             elif n == 3:
 *                 result = GAP_CallFunc3Args(self.value,
 */
      if (unlikely(!__pyx_v_a)) { __Pyx_RaiseUnboundLocalError("a"); __PYX_ERR(0, 2523, __pyx_L14_error) }
      __pyx_t_5 = __Pyx_GetItemInt_List(__pyx_v_a, 1, long, 1, __Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 2523, __pyx_L14_error)
      __Pyx_GOTREF(__pyx_t_5);

      /* "sage/libs/gap/element.pyx":2521
 *                                            (<GapElement>a[0]).value)
 *             elif n == 2:
 *                 result = GAP_CallFunc2Args(self.value,             # <<<<<<<<<<<<<<
 *                                            (<GapElement>a[0]).value,
 *                                            (<GapElement>a[1]).value)
 */
      __pyx_v_result = GAP_CallFunc2Args(__pyx_v_self->__pyx_base.value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_4)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_5)->value);
      __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;

      /* "sage/libs/gap/element.pyx":2520
 *                 result = GAP_CallFunc1Args(self.value,
 *                                            (<GapElement>a[0]).value)
 *             elif n == 2:             # <<<<<<<<<<<<<<
 *                 result = GAP_CallFunc2Args(self.value,
 *                                            (<GapElement>a[0]).value,
 */
      break;
      case 3:

      /* "sage/libs/gap/element.pyx":2526
 *             elif n == 3:
 *                 result = GAP_CallFunc3Args(self.value,
 *                                            (<GapElement>a[0]).value,             # <<<<<<<<<<<<<<
 *                                            (<GapElement>a[1]).value,
 *                                            (<GapElement>a[2]).value)
 */
      if (unlikely(!__pyx_v_a)) { __Pyx_RaiseUnboundLocalError("a"); __PYX_ERR(0, 2526, __pyx_L14_error) }
      __pyx_t_5 = __Pyx_GetItemInt_List(__pyx_v_a, 0, long, 1, __Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 2526, __pyx_L14_error)
      __Pyx_GOTREF(__pyx_t_5);

      /* "sage/libs/gap/element.pyx":2527
 *                 result = GAP_CallFunc3Args(self.value,
 *                                            (<GapElement>a[0]).value,
 *                                            (<GapElement>a[1]).value,             # <<<<<<<<<<<<<<
 *                                            (<GapElement>a[2]).value)
 *             else:
 */
      if (unlikely(!__pyx_v_a)) { __Pyx_RaiseUnboundLocalError("a"); __PYX_ERR(0, 2527, __pyx_L14_error) }
      __pyx_t_4 = __Pyx_GetItemInt_List(__pyx_v_a, 1, long, 1, __Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 2527, __pyx_L14_error)
      __Pyx_GOTREF(__pyx_t_4);

      /* "sage/libs/gap/element.pyx":2528
 *                                            (<GapElement>a[0]).value,
 *                                            (<GapElement>a[1]).value,
 *                                            (<GapElement>a[2]).value)             # <<<<<<<<<<<<<<
 *             else:
 *                 arg_list = make_gap_list(args)
 */
      if (unlikely(!__pyx_v_a)) { __Pyx_RaiseUnboundLocalError("a"); __PYX_ERR(0, 2528, __pyx_L14_error) }
      __pyx_t_6 = __Pyx_GetItemInt_List(__pyx_v_a, 2, long, 1, __Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 2528, __pyx_L14_error)
      __Pyx_GOTREF(__pyx_t_6);

      /* "sage/libs/gap/element.pyx":2525
 *                                            (<GapElement>a[1]).value)
 *             elif n == 3:
 *                 result = GAP_CallFunc3Args(self.value,             # <<<<<<<<<<<<<<
 *                                            (<GapElement>a[0]).value,
 *                                            (<GapElement>a[1]).value,
 */
      __pyx_v_result = GAP_CallFunc3Args(__pyx_v_self->__pyx_base.value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_5)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_4)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_6)->value);
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
      __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;

      /* "sage/libs/gap/element.pyx":2524
 *                                            (<GapElement>a[0]).value,
 *                                            (<GapElement>a[1]).value)
 *             elif n == 3:             # <<<<<<<<<<<<<<
 *                 result = GAP_CallFunc3Args(self.value,
 *                                            (<GapElement>a[0]).value,
 */
      break;
      default:

      /* "sage/libs/gap/element.pyx":2530
 *                                            (<GapElement>a[2]).value)
 *             else:
 *                 arg_list = make_gap_list(args)             # <<<<<<<<<<<<<<
 *                 result = GAP_CallFuncList(self.value, arg_list)
 *             sig_off()
 */
      __pyx_t_11 = __pyx_f_4sage_4libs_3gap_7element_make_gap_list(__pyx_v_args); if (unlikely(__pyx_t_11 == ((Obj)NULL))) __PYX_ERR(0, 2530, __pyx_L14_error)
      __pyx_v_arg_list = __pyx_t_11;

      /* "sage/libs/gap/element.pyx":2531
 *             else:
 *                 arg_list = make_gap_list(args)
 *                 result = GAP_CallFuncList(self.value, arg_list)             # <<<<<<<<<<<<<<
 *             sig_off()
 *         finally:
 */
      __pyx_v_result = GAP_CallFuncList(__pyx_v_self->__pyx_base.value, __pyx_v_arg_list);
      break;
    }

    /* "sage/libs/gap/element.pyx":2532
 *                 arg_list = make_gap_list(args)
 *                 result = GAP_CallFuncList(self.value, arg_list)
 *             sig_off()             # <<<<<<<<<<<<<<
 *         finally:
 *             GAP_Leave()
 */
    sig_off();
  }

  /* "sage/libs/gap/element.pyx":2534
 *             sig_off()
 *         finally:
 *             GAP_Leave()             # <<<<<<<<<<<<<<
 *         if result == NULL:
 *             # We called a procedure that does not return anything
 */
  /*finally:*/ {
    /*normal exit:*/{
      GAP_Leave();
      goto __pyx_L15;
    }
    __pyx_L14_error:;
    /*exception exit:*/{
      __Pyx_PyThreadState_declare
      __Pyx_PyThreadState_assign
      __pyx_t_14 = 0; __pyx_t_15 = 0; __pyx_t_16 = 0; __pyx_t_17 = 0; __pyx_t_18 = 0; __pyx_t_19 = 0;
      __Pyx_XDECREF(__pyx_t_10); __pyx_t_10 = 0;
      __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
      __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
      __Pyx_XDECREF(__pyx_t_6); __pyx_t_6 = 0;
      __Pyx_XDECREF(__pyx_t_8); __pyx_t_8 = 0;
      __Pyx_XDECREF(__pyx_t_9); __pyx_t_9 = 0;
      if (PY_MAJOR_VERSION >= 3) __Pyx_ExceptionSwap(&__pyx_t_17, &__pyx_t_18, &__pyx_t_19);
      if ((PY_MAJOR_VERSION < 3) || unlikely(__Pyx_GetException(&__pyx_t_14, &__pyx_t_15, &__pyx_t_16) < 0)) __Pyx_ErrFetch(&__pyx_t_14, &__pyx_t_15, &__pyx_t_16);
      __Pyx_XGOTREF(__pyx_t_14);
      __Pyx_XGOTREF(__pyx_t_15);
      __Pyx_XGOTREF(__pyx_t_16);
      __Pyx_XGOTREF(__pyx_t_17);
      __Pyx_XGOTREF(__pyx_t_18);
      __Pyx_XGOTREF(__pyx_t_19);
      __pyx_t_7 = __pyx_lineno; __pyx_t_12 = __pyx_clineno; __pyx_t_13 = __pyx_filename;
      {
        GAP_Leave();
      }
      if (PY_MAJOR_VERSION >= 3) {
        __Pyx_XGIVEREF(__pyx_t_17);
        __Pyx_XGIVEREF(__pyx_t_18);
        __Pyx_XGIVEREF(__pyx_t_19);
        __Pyx_ExceptionReset(__pyx_t_17, __pyx_t_18, __pyx_t_19);
      }
      __Pyx_XGIVEREF(__pyx_t_14);
      __Pyx_XGIVEREF(__pyx_t_15);
      __Pyx_XGIVEREF(__pyx_t_16);
      __Pyx_ErrRestore(__pyx_t_14, __pyx_t_15, __pyx_t_16);
      __pyx_t_14 = 0; __pyx_t_15 = 0; __pyx_t_16 = 0; __pyx_t_17 = 0; __pyx_t_18 = 0; __pyx_t_19 = 0;
      __pyx_lineno = __pyx_t_7; __pyx_clineno = __pyx_t_12; __pyx_filename = __pyx_t_13;
      goto __pyx_L1_error;
    }
    __pyx_L15:;
  }

  /* "sage/libs/gap/element.pyx":2535
 *         finally:
 *             GAP_Leave()
 *         if result == NULL:             # <<<<<<<<<<<<<<
 *             # We called a procedure that does not return anything
 *             return None
 */
  __pyx_t_2 = (__pyx_v_result == NULL);
  if (__pyx_t_2) {

    /* "sage/libs/gap/element.pyx":2537
 *         if result == NULL:
 *             # We called a procedure that does not return anything
 *             return None             # <<<<<<<<<<<<<<
 *         return make_any_gap_element(self.parent(), result)
 * 
 */
    __Pyx_XDECREF(__pyx_r);
    __pyx_r = Py_None; __Pyx_INCREF(Py_None);
    goto __pyx_L0;

    /* "sage/libs/gap/element.pyx":2535
 *         finally:
 *             GAP_Leave()
 *         if result == NULL:             # <<<<<<<<<<<<<<
 *             # We called a procedure that does not return anything
 *             return None
 */
  }

  /* "sage/libs/gap/element.pyx":2538
 *             # We called a procedure that does not return anything
 *             return None
 *         return make_any_gap_element(self.parent(), result)             # <<<<<<<<<<<<<<
 * 
 * 
 */
  __Pyx_XDECREF(__pyx_r);
  __pyx_t_4 = __Pyx_PyObject_GetAttrStr(((PyObject *)__pyx_v_self), __pyx_n_s_parent); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 2538, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_5 = NULL;
  __pyx_t_12 = 0;
  #if CYTHON_UNPACK_METHODS
  if (likely(PyMethod_Check(__pyx_t_4))) {
    __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_4);
    if (likely(__pyx_t_5)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_4);
      __Pyx_INCREF(__pyx_t_5);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_4, function);
      __pyx_t_12 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_5, NULL};
    __pyx_t_6 = __Pyx_PyObject_FastCall(__pyx_t_4, __pyx_callargs+1-__pyx_t_12, 0+__pyx_t_12);
    __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
    if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 2538, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_6);
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  }
  __pyx_t_4 = ((PyObject *)__pyx_f_4sage_4libs_3gap_7element_make_any_gap_element(__pyx_t_6, __pyx_v_result)); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 2538, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
  __pyx_r = __pyx_t_4;
  __pyx_t_4 = 0;
  goto __pyx_L0;

  /* "sage/libs/gap/element.pyx":2412
 * 
 * 
 *     def __call__(self, *args):             # <<<<<<<<<<<<<<
 *         """
 *         Call syntax for functions.
 */

  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_XDECREF(__pyx_t_6);
  __Pyx_XDECREF(__pyx_t_8);
  __Pyx_XDECREF(__pyx_t_9);
  __Pyx_XDECREF(__pyx_t_10);
  __Pyx_AddTraceback("sage.libs.gap.element.GapElement_Function.__call__", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XDECREF(__pyx_v_libgap);
  __Pyx_XDECREF(__pyx_v_a);
  __Pyx_XDECREF(__pyx_8genexpr3__pyx_v_x);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}</code></pre>
<code>cython</code> does a great job annotating generated code with corresponding
source code. Very useful! While it’s a lot of boilerplate the code is
straightforward.
A few important facts:
<ol type="1">
<li><code>sig_GAP_Enter()</code> is our <code>setjmp()</code> in disguise of a few
macros defined at the beginning of the file. You can’t see <code>longjmp()</code>
calls here but they are lurking in various <code>GAP_CallFunc3Args()</code> calls.</li>
<li>There is a ton of local variables being updated in this file. And
none of them has <code>volatile</code> annotations.</li>
</ol>
The above strongly suggests we are hitting a <code>volatile</code> <code>setjmp()</code>
<code>CAVEAT</code>. But how to find out for sure?
<h2 id="nailing-down-the-specific-variable-corruption">Nailing down the specific variable corruption</h2>
The general “missing <code>volatile</code>” hint is a bit abstract. Is it easy to
find out what variable is being corrupted here? The <code>gdb</code> is
surprisingly useful here. Let’s look at our <code>SIGSEGV</code> again:
<pre><code>(gdb) bt
#0  0x00007feb544b756f in _Py_IsImmortal (op=0x0) at /usr/include/python3.12/object.h:242
#1  Py_DECREF (op=0x0) at /usr/include/python3.12/object.h:700
#2  Py_XDECREF (op=0x0) at /usr/include/python3.12/object.h:798
#3  __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (
    __pyx_v_self=__pyx_v_self@entry=0x7feb4e274f40,
    __pyx_v_args=__pyx_v_args@entry=(<sage.rings.integer.Integer at remote 0x7feb533456e0>, <sage.rings.integer.Integer at remote 0x7feb4e5dd8c0>, <sage.rings.integer.Integer at remote 0x7feb4e5dd620>))
    at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26535
#4  0x00007feb544b84e7 in __pyx_pw_4sage_4libs_3gap_7element_19GapElement_Function_3__call__ (
    __pyx_v_self=<sage.libs.gap.element.GapElement_Function at remote 0x7feb4e274f40>,
    __pyx_args=(<sage.rings.integer.Integer at remote 0x7feb533456e0>, <sage.rings.integer.Integer at remote 0x7feb4e5dd8c0>, <sage.rings.integer.Integer at remote 0x7feb4e5dd620>),
    __pyx_kwds=<optimized out>)
    at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26105</code></pre>
Our function is in <code>frame 3</code> right now. Let’s look at it:
<pre><code>(gdb) fr 3
#3  __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (
    __pyx_v_self=__pyx_v_self@entry=0x7feb4e274f40,
    __pyx_v_args=__pyx_v_args@entry=(<sage.rings.integer.Integer at remote 0x7feb533456e0>, <sage.rings.integer.Integer at remote 0x7feb4e5dd8c0>, <sage.rings.integer.Integer at remote 0x7feb4e5dd620>))
    at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26535
list
26535         __Pyx_XDECREF(__pyx_t_6); __pyx_t_6 = 0;</code></pre>
<code>__Pyx_XDECREF()</code> looks promising.
<pre><code>(gdb) list
26530         __Pyx_PyThreadState_assign
26531         __pyx_t_14 = 0; __pyx_t_15 = 0; __pyx_t_16 = 0; __pyx_t_17 = 0; __pyx_t_18 = 0; __pyx_t_19 = 0;
26532         __Pyx_XDECREF(__pyx_t_10); __pyx_t_10 = 0;
26533         __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
26534         __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
26535         __Pyx_XDECREF(__pyx_t_6); __pyx_t_6 = 0;
26536         __Pyx_XDECREF(__pyx_t_8); __pyx_t_8 = 0;
26537         __Pyx_XDECREF(__pyx_t_9); __pyx_t_9 = 0;
26538         if (PY_MAJOR_VERSION >= 3) __Pyx_ExceptionSwap(&__pyx_t_17, &__pyx_t_18, &__pyx_t_19);
26539         if ((PY_MAJOR_VERSION < 3) || unlikely(__Pyx_GetException(&__pyx_t_14, &__pyx_t_15, &__pyx_t_16) < 0)) __Pyx_ErrFetch(&__pyx_t_14, &__pyx_t_15, &__pyx_t_16);</code></pre>
<code>gdb</code> says that crash happens at line
<code>26535 __Pyx_XDECREF(__pyx_t_6); __pyx_t_6 = 0;</code>. Thus <code>__pyx_t_6 = 0;</code>
is our primary suspect.
To trace the life of <code>__pyx_t_6</code> in assembly code I built <code>element.c</code>
with <code>-S -fverbose-asm</code> flags and got this tiny snippet of variable
reference:
<pre class="asm"><code># /usr/include/python3.12/object.h:242:     return _Py_CAST(PY_INT32_T, op->ob_refcnt) < 0;
    .loc 5 242 12 is_stmt 0 view .LVU66168
    movq -200(%rbp), %rdx # %sfp, r
    movq (%rdx), %rax # __pyx_t_6_10(ab)->D.11083.ob_refcnt, _1070</code></pre>
It’s the first few instructions of <code>__Pyx_XDECREF()</code> implementation.
The main take away here is that <code>__pyx_t_6</code> is stored on stack at the
address <code>%rbp-200</code>. We can trace all the updates at that location and
see what is missing.
Before doing that I navigated to the beginning of crashing
<code>__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__</code>
function as:
<pre><code>$ gdb -p `pgrep sage-ipython`
(gdb) break __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__
(gdb) continue

    # trigger break with with ` libgap.AbelianGroup(0,0,0)`

(gdb) disassemble
Dump of assembler code for function __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__:
=> 0x00007f4ed9981b60 <+0>:     push   %rbp
   0x00007f4ed9981b61 <+1>:     mov    %rsp,%rbp</code></pre>
And then executed first two instructions to initialize <code>%rbp</code>
register (as our variable lives at a constant offset from <code>%rbp</code> on
stack):
<pre><code>(gdb) nexti
(gdb) nexti
(gdb) disassemble
Dump of assembler code for function __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__:
   0x00007f4ed9981b60 <+0>:     push   %rbp
   0x00007f4ed9981b61 <+1>:     mov    %rsp,%rbp
=> 0x00007f4ed9981b64 <+4>:     push   %r15</code></pre>
Now I set the watch point at stack memory where we expect <code>__pyx_t_6</code> to
reside:
<pre><code>(gdb) print $rbp-200
$2 = (void *) 0x7ffd2824c5e8

(gdb) watch *(int*)(void *) 0x7ffd2824c5e8
Hardware watchpoint 2: *(int*)(void *) 0x7ffd2824c5e8

(gdb) continue
Continuing.

Thread 1 "sage-ipython" hit Hardware watchpoint 2: *(int*)(void *) 0x7ffd2824c5e8

Old value = 673498624
New value = 0
0x00007f98e609d2a8 in __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (
    __pyx_v_self=__pyx_v_self@entry=0x7f98dfe70dc0,
    __pyx_v_args=__pyx_v_args@entry=(<sage.rings.integer.Integer at remote 0x7f98e4722c40>, <sage.rings.integer.Integer at remote 0x7f98dfe7afd0>, <sage.rings.integer.Integer at remote 0x7f98e01dd5c0>))
    at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26192
26192       __pyx_t_6 = NULL;</code></pre>
This break is our initial <code>__pyx_t_6 = NULL;</code> store. Nothing unusual.
Moving to the next update:
<pre><code>(gdb) continue
Continuing.

Thread 1 "sage-ipython" hit Hardware watchpoint 2: *(int*)(void *) 0x7ffd2824c5e8

Old value = 0
New value = -538669696
__Pyx_GetItemInt_List_Fast (wraparound=0, boundscheck=1, i=2,
    o=[<sage.libs.gap.element.GapElement_Integer at remote 0x7f98e0ac5c00>, <sage.libs.gap.element.GapElement_Integer at remote 0x7f98dfe4b500>, <sage.libs.gap.element.GapElement_Integer at remote 0x7f98dfe48d80>])
    at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:38070
38070           Py_INCREF(r);</code></pre>
Here we create an object and increment a reference via <code>__pyx_t_6</code> address.
Moving on:
<pre><code>(gdb) continue
Continuing.

Thread 1 "sage-ipython" received signal SIGABRT, Aborted.
0x00007f99428617a7 in __GI_kill () at ../sysdeps/unix/syscall-template.S:120
120     T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)</code></pre>
Got to <code>SIGABRT</code> signal, <code>longjmp()</code> is about to be called. Moving on.
<pre><code>(gdb) continue
Continuing.

Thread 1 "sage-ipython" received signal SIGSEGV, Segmentation fault.
0x00007f98e609c56f in _Py_IsImmortal (op=0x0) at /usr/include/python3.12/object.h:242
242         return _Py_CAST(PY_INT32_T, op->ob_refcnt) < 0;</code></pre>
We arrived at a final <code>SIGSEGV</code> signal.
Curiously we see only two stores (one <code>NULL</code> store and one
non-<code>NULL</code> store) at inspected address. Both are before the <code>SIGABRT</code>
signal (and thus <code>longjmp()</code> call). I was initially afraid that stack
got corrupted by stack reuse in nested functions. But it’s not the case
and our case is a lot simpler than it could be. Phew.
This session allowed me to finally understand the control flow happening
here.
<h2 id="sagemath-breakage-mechanics"><code>sagemath</code> breakage mechanics</h2>
Armed with <code>__pyx_t_6</code> <code>runtime</code> behaviour I think I got the properties
<code>__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__</code> had
to force <code>gcc</code> into <code>SIGSEGV</code>.
Slightly oversimplified function has the following form:
<pre class="c"><code>__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__() {
    __pyx_t_6 = NULL;

    int mode = setjmp(jb);

    switch (mode) {
      case 1:                                  // longjmp() case
        break;
      case 0:                                  // regular case (`case 3:` in real code)
        __pyx_t_6 = something_else();          // set __pyx_t_6 to non-zero
        int done = use_post_setjmp(__pyx_t_6); // call longjmp(jb, 1) here
        __pyx_t_6 = NULL;
        break;
    }

    // get here via longjmp() or via natural execution flow
    //
    // Note: natural execution flow always gets here with `__pyx_t_6 = NULL`.
    if (__pyx_t_6 != NULL) deref(__pyx_t_6);
}</code></pre>
Similar to our contrived example we save the function state at the
beginning of a function and return back to it from <code>use_post_setjmp()</code>
helper after we changed the value of a local variable.
<code>__pyx_t_6</code> is not marked <code>volatile</code>. <code>gcc</code> noticed that <code>__pyx_t_6</code> is
always expected to be <code>NULL</code> when it reaches the final statement . And
<code>gcc</code> optimizes the function code into the following equivalent:
<pre class="c"><code>__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__() {
    __pyx_t_6 = NULL;

    int mode = setjmp(jb);

    switch (mode) {
      case 1:                                  // longjmp() case
        break;
      case 0:                                  // regular case (`case 3:` in real code)
        __pyx_t_6 = something_else();          // set __pyx_t_6 to non-zero
        int done = use_post_setjmp(__pyx_t_6); // call longjmp(jb, 1) here
        __pyx_t_6 = NULL;
        break;
    }

    // Constant-fold `__pyx_t_6 = NULL` in `deref()` call site
    if (__pyx_t_6 != NULL) deref(NULL);
}</code></pre>
Note that <code>gcc</code> did not eliminate the <code>if (__pyx_t_6 != NULL)</code> even
though the condition is always expected to be <code>false</code> (Richard explains
it <a href="https://gcc.gnu.org/PR114872#c24">here</a>).
The presence of <code>longjmp()</code> effectively skips the
<code>__pyx_t_6 = NULL;</code> assignment and executes <code>if (__pyx_t_6 != NULL) deref(NULL);</code>.
That leads to <code>SIGSEGV</code>.
Phew. We finally have the breakage mechanics caused by <code>gcc</code> code motion.
This bug is known for a while in <code>sagemath</code> bug tracker as an
<a href="https://github.com/sagemath/sage/issues/37026">Issue#37026</a>.
Fixing it will probably require adding <code>volatile</code> marking to quite a few
temporary variables in <code>.pyx</code> files of <code>sagemath</code>. Or alternatively
avoid the <code>setjmp()</code> / <code>longjmp()</code> pattern in inner functions if
feasible. They are just too large to be auditable for <code>setjmp()</code>
constraints.
<h2 id="bonus-how-setjmp-longjmp-works-in-glibc">Bonus: how <code>setjmp()</code> / <code>longjmp()</code> works in <code>glibc</code></h2>
Let’s have a look what <code>glibc</code> does <code><S-Del></code> on <code>x86_64</code> at
<a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/setjmp.S;h=40807e73b51c362464730212d7c973848be612bf;hb=HEAD"><code>sysdeps/x86_64/setjmp.S</code></a>:
<pre class="asm"><code>ENTRY (__sigsetjmp)
        /* Save registers.  */
        movq %rbx, (JB_RBX*8)(%rdi)
#ifdef PTR_MANGLE
        mov %RBP_LP, %RAX_LP
        PTR_MANGLE (%RAX_LP)
        mov %RAX_LP, (JB_RBP*8)(%rdi)
#else
        movq %rbp, (JB_RBP*8)(%rdi)
#endif
        movq %r12, (JB_R12*8)(%rdi)
        movq %r13, (JB_R13*8)(%rdi)
        movq %r14, (JB_R14*8)(%rdi)
        movq %r15, (JB_R15*8)(%rdi)
        lea 8(%rsp), %RDX_LP    /* Save SP as it will be after we return.  */
#ifdef PTR_MANGLE
        PTR_MANGLE (%RDX_LP)
#endif
        movq %rdx, (JB_RSP*8)(%rdi)
        mov (%rsp), %RAX_LP     /* Save PC we are returning to now.  */
        LIBC_PROBE (setjmp, 3, LP_SIZE@%RDI_LP, -4@%esi, LP_SIZE@%RAX_LP)
#ifdef PTR_MANGLE
        PTR_MANGLE (%RAX_LP)
#endif
        movq %rax, (JB_PC*8)(%rdi)

        /* Make a tail call to __sigjmp_save; it takes the same args.  */
        jmp __sigjmp_save
END (__sigsetjmp)</code></pre>
<code>%rdi</code> is our <code>env</code> parameter in <code>int setjmp(jmp_buf env);</code> signature.
<code>__sigsetjmp</code> does a few things:
<ul>
<li>is saves <code>r12</code>, <code>r13</code>, <code>r14</code>, <code>r15</code>, <code>rbx</code>, <code>rsp</code>, <code>rbp</code> registers into
<code>jmp_buf env</code> parameter. These registers happen to be all <code>callee-save</code>
registers on <code>System V x86_64 ABI</code>. Any <code>ABI</code>-conformant <code>C</code> function
does the same if it plans to use these registers locally.</li>
<li>the only twist is that some of the stack-related registers are
obfuscated with <code>PTR_MANGLE</code> macro. The macro mixes in stack canary
into the value.</li>
<li><code>__sigsetjmp</code> also saves return address as <code>mov (%rsp), %RAX_LP</code> to be
able to resume from it later.</li>
<li><code>__sigsetjmp</code> also handles shadow call stack if that exists, I skipped
it entirely for simplicity</li>
</ul>
Importantly <code>__sigsetjmp</code> does not save any stack contents. It assumes
that <code>gcc</code> and the code author make sure it does not get corrupted on
return.
The <code>longjmp()</code> recovery is a mirror-image of <code>setjmp()</code>
in <a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/__longjmp.S;h=22fedc49970eba0ab86cb8dec8411197bec95d37;hb=HEAD"><code>sysdeps/x86_64/__longjmp.S</code></a>. I’ll skip
pasting <code>longjmp()</code> implementation here.
<h2 id="bonus-2--wclobbered-is-able-to-detect-some-of-the-clobber-cases">Bonus 2: <code>-Wclobbered</code> is able to detect some of the clobber cases</h2>
Alexander Monakov <a href="https://fosstodon.org/@amonakov@mastodon.gamedev.place/112429019564307874">pointed out</a>
that <code>gcc</code> actually has an option to detect some of the clobbering cases
with <code>-Wclobbered</code>.
On a contrived example it reports nothing:
<pre><code>$ gcc -O2 -c example.c -Wclobbered -Wall -Wextra -W</code></pre>
It’s unfortunate as we can clearly see the different output.
But on the real <code>element.c</code> it complains as:
<pre><code>$ gcc -O2 -Wclobbered -c element.i
...
element.c: In function '__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__':
element.c:26122:13: warning: variable '__pyx_v_libgap' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:26123:13: warning: variable '__pyx_v_a' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:26125:13: warning: variable '__pyx_r' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:26130:13: warning: variable '__pyx_t_4' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:26131:13: warning: variable '__pyx_t_5' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:26132:13: warning: variable '__pyx_t_6' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:26134:13: warning: variable '__pyx_t_8' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:26135:13: warning: variable '__pyx_t_9' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:26136:13: warning: variable '__pyx_t_10' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:38074:19: warning: variable 'r' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
element.c:38074:19: warning: variable 'r' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]</code></pre>
And that includes our <code>__pyx_t_6</code> variable!
<h2 id="parting-words">Parting words</h2>
Python ecosystem did not fully update to latest <code>python</code> release:
<code>python-3.12</code> was out in October 2023 and many packages did not yet
catch up with it. That makes it tedious to fix a long tail of issues
like <code>sagemath</code> crashes. Also makes it hard to reproduce issues on some
distributions that follow upstream packages without much patching.
While it was a bit unfortunate that <code>sagemath</code> is not packaged in
<code>::gentoo</code> I was pleasantly surprised by the responsive maintainer of
<code>::sage-on-gentoo</code> overlay. I managed to build <code>sagemath-standard</code> with
their help relatively quickly.
<code>scikit-build-core</code> package is invasive and is able to break the build
of packages that don’t import it directly or indirectly, like
<a href="https://github.com/cschwan/sage-on-gentoo/issues/783"><code>sagemath-standard</code></a>.
<code>cysignals</code> in <code>::gentoo</code> has a <a href="https://bugs.gentoo.org/927767">broken path</a>
to the <code>gdb</code> helper which makes it more tedious to debug <code>sagemath</code>
crashes.
<code>cysignals</code> uses <code>SIGABRT</code> and <code>setjmp()</code> / <code>longjmp()</code> to recover from
errors. <code>sagemath</code> decided to use it in <code>C</code> bindings. Unfortunately it
does not mix well with <code>cython</code>’s use of local variables and leads to
broken code.
I was lucky to look at the <code>SIGABRT</code> handler first to notice <code>longjmp()</code>
there. If <code>gdb</code> hook would not fail for a wrong <code>share</code> path I would
take me a lot more time to discover <code>setjmp()</code> clue.
We were lucky that <code>gcc</code> did not delete <code>NULL</code>-dereference code and
generated something that <code>SIGSEGV</code>s on execution. Otherwise it would be
at best resource leak and use-after-free or data corruption at worst.
<code>setjmp()</code> / <code>longjmp()</code> is very hard to use correctly in large
functions. One has to be very careful to sprinkle enough <code>volatile</code>
keywords to inhibit enough compiler optimizations to get expected
semantics from the program.
It was not a compiler bug after all but the <code>sagemath</code> one. It was
interaction between two aspects <code>sagemath</code> used:
<ul>
<li><code>cython</code> usage that generates a ton of local variables it mutates
along the way without the <code>volatile</code> annotations</li>
<li>use of <code>cysignals</code>’ <code>sig_GAP_Enter()</code> (aka <code>setjmp()</code>) error recovery
in the same function</li>
</ul>
<code>gcc</code> can detect some of the clobber cases with <code>-Wclobbered</code> flag.
Have fun!
</article>
<article>
<h1>gcc-14 bugs, pile 4</h1>
2024-04-13T00:00:00Z
In a few weeks <code>gcc-14</code> branch should see a <code>14.1.0</code> release.
Since <a href="https://trofi.github.io/posts/306-gcc-14-bug-pile-3.html">November 2023</a> I encountered
only 8 more bugs. That makes it 2 bugs per month. 4 time lower rate than
I had last time.
<h2 id="summary">summary</h2>
Bugs I saw (in discovery order):
<ul>
<li><a href="https://gcc.gnu.org/PR112711">tree-optimization/112711</a>: wrong code
on <code>llvm-16</code>, <code>bswap</code> and <code>assume(aligned)</code>.</li>
<li><a href="https://gcc.gnu.org/PR112869">c++/112869</a>: <code>ICE</code> on <code>libmpt-0.7.3</code>
when built with <code>-std=c++20</code>.</li>
<li><a href="https://gcc.gnu.org/PR112991">tree-optimization/112991</a>: <code>ICE</code> on
<code>p7zip</code> due to value numbering bugs.</li>
<li><a href="https://gcc.gnu.org/PR113132">bootstrap/113132</a>: bootstrap build
failure due to <code>-Werror</code>.</li>
<li><a href="https://gcc.gnu.org/PR113445">bootstrap/113445</a>: bootstrap comparison
failure due to instruction scheduler changes.</li>
<li><a href="https://gcc.gnu.org/PR114249">tree-optimization/114249</a>: <code>ICE</code> on
<code>lvm2</code> (wrong type transform in <code>SLP</code>).</li>
<li><a href="https://gcc.gnu.org/PR114439">c++/114439</a>: <code>icu4c</code> build failure due
to the initialization changes in <code>c++</code>.</li>
<li><a href="https://gcc.gnu.org/PR114574">lto/114574</a>: <code>unbound</code> ICE in <code>-flto</code>
mode.</li>
</ul>
<h2 id="fun-bug">fun bug</h2>
Only one of 8 bug was a runtime failure on <code>llvm-16</code> in
<code>__builtin_assume_aligned()</code> handling code. It’s extracted from
<code>EndiaTest.cpp</code>:
<pre class="c"><code>// $ cat EndianTest.cpp
typedef          int i32;
typedef unsigned int u32;

static inline void write_i32(void *memory, i32 value) {
  // swap i32 bytes as if it was u32:
  u32 u_value = value;
  value = __builtin_bswap32(u_value);

  // llvm infers '1' alignment from destination type
  __builtin_memcpy(__builtin_assume_aligned(memory, 1), &value, sizeof(value));
}

__attribute__((noipa))
static void bug (void) {
  #define assert_eq(lhs, rhs) if (lhs != rhs) __builtin_trap()

  unsigned char data[5];
  write_i32(data, -1362446643);
  assert_eq(data[0], 0xAE);
  assert_eq(data[1], 0xCA);
  write_i32(data + 1, -1362446643);
  assert_eq(data[1], 0xAE);
}

int main() {
    bug();
}</code></pre>
The optimization breaks simple <code>store-32</code> / <code>load-8</code> / <code>compare</code>
sequence:
<pre><code>$ gcc/xg++ -Bgcc EndianTest.cpp -o bug -O0 && ./bug
$ gcc/xg++ -Bgcc EndianTest.cpp -o bug -O2 && ./bug
Illegal instruction (core dumped)</code></pre>
There <code>gcc</code> was too optimistic in assuming that
<code>__builtin_assume_aligned()</code> returns address not aliased to input
argument. As a result <code>gcc</code> erroneously removed dead-looking stores.
<a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=302461ad9a04d82fee904bddac69811d13d5bb6a">The fix</a>
drops overly optimistic assumption.
<h2 id="histograms">histograms</h2>
<ul>
<li><code>tree-optimization</code>: 3</li>
<li><code>c++</code>: 2</li>
<li><code>bootstrap</code>: 2</li>
<li><code>lto</code>: 1</li>
</ul>
We are back to the usual <code>tree-optimization</code> at the top of subsystems
where bugs lurked.
<h2 id="parting-words">parting words</h2>
This cycle felt easier than the previous three. It feels like <code>gcc-14</code>
is ready for release.
Since the time <code>gcc-13</code> was released about a year ago I found 53 bugs
(past piles
<a href="https://trofi.github.io/posts/291-gcc-14-bugs-pile-1.html">1</a>,
<a href="https://trofi.github.io/posts/296-gcc-14-bugs-pile-2.html">2</a>,
<a href="https://trofi.github.io/posts/306-gcc-14-bug-pile-3.html">3</a>).
This is about one bug a week.
I think the most impactful <code>gcc-14</code> change will probably be
<code>-Werror=implicit-int</code>, <code>-Werror=implicit-function-declaration</code> and
friends (see <a href="https://gcc.gnu.org/gcc-14/porting_to.html" class="uri">https://gcc.gnu.org/gcc-14/porting_to.html</a> for more
details). This will break quite a few old unmaintained but used
everywhere C projects like <code>zip</code>, <code>opensp</code>, <code>jam</code>, <code>directfb</code>.
Have fun!
</article>
<article>
<h1>a libpam bug</h1>
2024-03-30T00:00:00Z
Over a past few months I updated <code>libpam</code> <code>nixpkgs</code> package a few times.
I broke it at least three times:
<ul>
<li><a href="https://github.com/NixOS/nixpkgs/pull/266828">fix clobbered patched <code>Makefile.in</code></a></li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/281182">fix deleted <code>required pam_lastlog</code> module</a></li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/298994">fix broken empty password handling</a></li>
</ul>
It’s not great. But at least each breakage is slightly different.
Exploring it gives me some insight into how <code>libpam</code> works.
Let’s have a look at the most recent breakage related to empty passwords.
<h2 id="actual-password-handling-bug">actual password handling bug</h2>
In <a href="https://github.com/NixOS/nixpkgs/issues/297920"><code>nixpkgs</code> issue#297920</a>
Thomas M. DuBuisson reported that users with empty passwords no longer
work on <code>NixOS</code>.
An example <code>/etc/nixos/configuration.nix</code> snippet to create such an user
is:
<pre class="nix"><code>{ ... }:
{
  users.users.test = {
    isNormalUser = true;
    password = "";
    extraGroups = [ ];
  };
}</code></pre>
After an update to <code>libpam-1.6.0</code> <code>login</code> did not work any more:
<pre><code>login: test
Password:

Login incorrect</code></pre>
Why did it break? Let’s have a look at how passwords are stored in <code>linux</code>.
<h2 id="etcshadow-structure"><code>/etc/shadow</code> structure</h2>
On <code>linux</code> desktops hashed user passwords are usually stored in
<code>/etc/shadow</code>:
<pre><code># cat /etc/shadow
...
nobody:!:1::::::
nulloktest:$6$D0XkOSlH$fWuW6/7aFD5ZD2YzBuerj0STra3LddBNoXMn5pomYRmdbmsjM6bGzIX7nQQS4bGepDBoao2U.IZRGhgAJ4qOp.:1::::::</code></pre>
<a href="https://manpages.opensuse.org/Tumbleweed/shadow/shadow.5.en.html"><code>man 5 shadow</code></a>
from <code>shadow</code> package has more detail on what each field means.
As hashed passwords are still sensitive information <code>/etc/shadow</code> is not
readable by most users:
<pre><code>$ ls -l /etc/shadow
-rw-r----- 1 root shadow 3454 Mar 29 07:03 /etc/shadow</code></pre>
Programs that check passwords are usually ran as <code>root</code> or are
<code>SUID</code>-root themselves.
<h2 id="hash-structure">hash structure</h2>
<code>$6$D0XkOSlH$fWuW6/7aFD5ZD2YzBuerj0STra3LddBNoXMn5pomYRmdbmsjM6bGzIX7nQQS4bGepDBoao2U.IZRGhgAJ4qOp.</code>
is not just a hash string of a well known hash algorithm. It used to be.
But not any more.
<a href="https://manpages.opensuse.org/Tumbleweed/libxcrypt-devel/crypt.5.en.html"><code>man 5 crypt</code></a>
from <code>libxcrypt</code> package explains the string format in detail. It’s 10
pages long!
The main takeaway from that page is that there are a few ways to encode
empty (“” password) and blank (no password prompt) passwords for a user:
<pre><code>nulloktest:$6$D0XkOSlH$fWuW6/7aFD5ZD2YzBuerj0STra3LddBNoXMn5pomYRmdbmsjM6bGzIX7nQQS4bGepDBoao2U.IZRGhgAJ4qOp.:1::::::
nulloktest::1::::::</code></pre>
Hashed version (first) accepts only empty(““) password while empty version
(second) does not prompt for a password at all.
This hashed version happens to use <code>SHA512</code> algorithm (<code>$6$</code> prefix). You
can have other encodings as well: by varying used hash salt for <code>SHA512</code>
algorithm or by switching the hashing algorithm (for example to
<code>yescrypt</code>.)
<h2 id="libpam-overview"><code>libpam</code> overview</h2>
<code>PAM</code> stands for “Pluggable Authentication Modules”. It allows programs
to embed user authentication without having to deal with the specifics of
accessing <code>/etc/shadow</code> contents directly. <code>libpam</code> allows you to
completely change authentication back end via <code>/etc/pam.d/</code>
configuration to use non-<code>shadow</code> mechanisms instead. I’ll focus on
<code>shadow</code> here and will avoid any configuration aspects.
Well known <code>libpam</code> users are <code>sudo</code> and <code>login</code> programs. Those are
expected to be ran as <code>root</code> or gain <code>root</code> via <code>SUID</code> <code>root</code> bit on the
binaries. <code>login</code> can authenticate any user in the system while <code>sudo</code>
probably only needs to authenticate current user.
Fun fact: normally you need <code>root</code> (or <code>shadow</code>) privileges to read
<code>/etc/shadow</code> for password verification purposes.
But <code>libpam</code> can be used for non-root programs as well! The typical
example is the session screen locker like <code>swaylock</code> or <code>i3lock</code>: those
lock your screen until you type current user’s password. How do they
manage to validate current user’s password without having a <code>SUID</code>
<code>root</code> bit? <code>libpam</code> does it by by calling external <code>unix_chkpwd</code>
<code>SUID</code> <code>root</code> program from <code>libpam</code> package:
<pre><code>$ ls -l `which unix_chkpwd`
-r-s--x--x 1 root root 63472 Mar 29 04:41 /run/wrappers/bin/unix_chkpwd</code></pre>
No magic unfortunately :)
<h2 id="back-to-password-handling-bug">back to password handling bug</h2>
In <a href="https://github.com/linux-pam/linux-pam/issues/758"><code>libpam</code> issue#758</a>
Chris Severance narrowed the regression down to
<a href="https://github.com/linux-pam/linux-pam/commit/b3020da7da384d769f27a8713257fbe1001878be">the <code>pam_unix/passverify</code> change</a>.
Here are two diagrams to show the effect of that change.
Before the change <code>libpam-1.5.3</code> password checking diagram looked this way:

After the change <code>libpam-1.6.0</code> password checking diagram looks this way:

To phrase it in words: <code>libpam-1.5.3</code> used <code>unix_chkpwd</code> external helper
program only if <code>program</code> was not already ran as <code>root</code>. <code>libpam-1.6.0</code>
in contrast always uses <code>unix_chkpwd</code>.
In theory both versions should work the same.
In practice <code>unix_chkpwd</code> disallows empty passwords (bug) but allows the
blank ones (feature).
Mechanically the bug happens as <code>unix_chkpwd</code> adds
<a href="https://github.com/linux-pam/linux-pam/commit/9e74e90147c530801e3ea3428d64371722c90e01">this extra bit of code</a>:
<pre class="diff"><code>--- a/modules/pam_unix/passverify.c
+++ b/modules/pam_unix/passverify.c
@@ -1096,6 +1096,12 @@ helper_verify_password(const char *name, const char *p, int nullok)
 	if (pwd == NULL || hash == NULL) {
 		helper_log_err(LOG_NOTICE, "check pass; user unknown");
 		retval = PAM_USER_UNKNOWN;
+	} else if (p[0] == '\0' && nullok) {
+		if (hash[0] == '\0') {
+			retval = PAM_SUCCESS;
+		} else {
+			retval = PAM_AUTH_ERR;
+		}
 	} else {
 		retval = verify_pwd_hash(p, hash, nullok);
 	}</code></pre>
Here <code>helper_verify_password()</code> started rejecting empty passwords with
non-empty hashes present in <code>/etc/shadow</code>. Oops. This change is what
distinguishes theses hashes:
<ul>
<li><code>nulloktest::1::::::</code>: <code>hash[0] == '\0'</code> (hash of length <code>0</code>): accepted</li>
<li><code>nulloktest:$6$D0...</code>: <code>hash[0] != '\0'</code> (hash of length <code>>0</code>): rejected</li>
</ul>
Such an early rejection is a bug. The hash needs to have a chance to be
verified instead.
<code>helper_verify_password()</code> function is only used by external helpers like
<code>unix_chkpwd</code> and was never used by <code>libpam.so</code> itself. That’s why
<code>libpam-1.5.3</code> just worked for <code>login</code>.
The fix is <a href="https://github.com/linux-pam/linux-pam/pull/784">simple</a>:
<pre class="diff"><code>--- a/modules/pam_unix/passverify.c
+++ b/modules/pam_unix/passverify.c
@@ -76,9 +76,13 @@ PAMH_ARG_DECL(int verify_pwd_hash,
 
 	strip_hpux_aging(hash);
 	hash_len = strlen(hash);
-	if (!hash_len) {
+
+	if (p && p[0] == '\0' && !nullok) {
+		/* The passed password is empty */
+		retval = PAM_AUTH_ERR;
+	} else if (!hash_len) {
 		/* the stored password is NULL */
-		if (nullok) { /* this means we've succeeded */
+		if (p && p[0] == '\0' && nullok) { /* this means we've succeeded */
 			D(("user has empty password - access granted"));
 			retval = PAM_SUCCESS;
 		} else {
@@ -1109,12 +1113,6 @@ helper_verify_password(const char *name, const char *p, int nullok)
 	if (pwd == NULL || hash == NULL) {
 		helper_log_err(LOG_NOTICE, "check pass; user unknown");
 		retval = PAM_USER_UNKNOWN;
-	} else if (p[0] == '\0' && nullok) {
-		if (hash[0] == '\0') {
-			retval = PAM_SUCCESS;
-		} else {
-			retval = PAM_AUTH_ERR;
-		}
 	} else {
 		retval = verify_pwd_hash(p, hash, nullok);
 	}</code></pre>
Here we move all the null password checking to a common <code>verify_pwd_hash</code>
code and check for password length and not for hash length to handle
<code>nullok</code> <code>libpam</code> configuration.
<h2 id="debugging-tips">debugging tips</h2>
<code>unix_chkpwd</code> is a <code>SUID</code> <code>root</code> binary. Moreover it changes it’s
behaviour based on actual user calling it. I wanted to find a way to
probe directly it’s ability to validate <code>/etc/shadow</code> contents without
involving external <code>login</code> binary.
On a system with <code>libpam</code> installed password checking session for an
arbitrary user could be done this way:
<pre><code>$ printf "correct-password\0" | sudo unix_chkpwd testuser nullok; echo $?
0
$ printf "incorrect-password\0" | sudo unix_chkpwd testuser nullok; echo $?
7</code></pre>
You need to pass null-terminated password into <code>unix_chkpwd</code>’s <code>stdin</code>
descriptor. The error code will tell you back if the check succeeded or
why it failed.
I needed an equivalent check for a locally built <code>unix_chkpwd</code> from
<code>linux-pam</code> <code>git</code> repository. I came up with this hack:
<pre><code># build unix_chkpwd
$ cd linux-pam
$ ./configure --disable-doc
$ make

# mark unix_chkpwd SUID-root
$ sudo chown root modules/pam_unix/unix_chkpwd
$ sudo chmod 4755 modules/pam_unix/unix_chkpwd

# check if it can authenticate the suer
$ printf "\0" | sudo modules/pam_unix/unix_chkpwd nulloktest nullok; echo $?</code></pre>
This setup allowed me to quickly find the problematic bit using
<code>printf()</code> debugging.
<h2 id="hashed-empty-passwords">hashed empty passwords</h2>
Is it typical to have empty hashed passwords? At least <code>passwd</code> does not
allow you to specify an empty password explicitly:
<pre><code># passwd nulloktest
New password:
Retype new password:
No password has been supplied.
...
passwd: Permission denied
passwd: password unchanged</code></pre>
But we can set a blank password with a <code>-d</code> option:
<pre><code># passwd -d nulloktest
passwd: password changed.

# grep nulloktest /etc/shadow
nulloktest::1::::::</code></pre>
This subtlety was one of the reasons why upstream <code>linux-pam</code> had a hard
time verifying a problem of empty hashed password. So how does <code>NixOS</code>
manage to create empty hashed passwords in the first place?
It uses <code>perl</code> code to call <code>crypt()</code> function from <code>libxcrypt</code> to
generate one. <a href="https://github.com/NixOS/nixpkgs/blob/8f72bd17eae0d1a7fcb63e3f1a3baa7dadebef68/nixos/modules/config/update-users-groups.pl#L37">Actual code</a>:
<pre class="perl"><code>sub hashPassword {
    my ($password) = @_;
    my $salt = "";
    my @chars = ('.', '/', 0..9, 'A'..'Z', 'a'..'z');
    $salt .= $chars[rand 64] for (1..8);
    return crypt($password, '$6$' . $salt . '$');
}</code></pre>
Here the <code>$password</code> passed is an empty string. We can generate a few of
them just for fun:
<pre><code>$ perl -e 'my $salt = ""; my @chars = (".", "/", 0..9, "A".."Z", "a".."z"); $salt .= $chars[rand 64] for (1..8); print(crypt("", "\$6\$" . $salt . "\$") . "\n");'
$6$AgpQ6azd$0YmJW0VFg0FwyPgSW1KSiF8cy5qB8NB/.IcMbjMa1OCbGH3ki9a4bkuhtMxQupeMeiagB86tpW7F/7yOl3Hhc0</code></pre>
<h2 id="parting-words">parting words</h2>
<code>libpam</code> updates are tricky. Every time I update <code>libpam</code> I break
something. This time it was a somewhat benign case of empty password
handling. We need more tests. The empty password fix fix was merged to
<code>libpam</code> as <a href="https://github.com/linux-pam/linux-pam/pull/784">PR#784</a>
and to <code>nixpkgs</code> as <a href="https://github.com/NixOS/nixpkgs/pull/298994">PR#298994</a>.
If you are affected by an empty password bug you can set the password to
a blank hash with <code>hashedPassword = ""</code>. Or set it to blank with
<code>passwd -d $user</code>. Both are equivalent.
<code>libpam-1.6.0</code> executes an external<code>unix_chkpwd</code> binary with <code>SUID</code>
<code>root</code> set to validate passwords.
Blank passwords (empty hash) are subtly different from empty passwords
(present hash of empty string): blank passwords don’t get prompted by
<code>login</code> and friends while empty passwords do. Sometimes this causes the
confusion.
<code>NixOS</code> uses direct <code>crypt()</code> call to <code>libxcrypt</code> to generate
<code>/etc/shadow</code> entries.
<code>libxcrypt</code> supports quite a few hashing algorithms and tweaks on top of
them that control hashing rounds and so on.
<code>linux-pam</code> upstream is very responsive! It’s always a pleasure to send
fixes there.
Have fun!
</article>
<article>
<h1>listing all nixpkgs packages</h1>
2023-12-30T00:00:00Z
<h2 id="intro">Intro</h2>
<code>nixpkgs</code> provides <a href="https://repology.org/repository/nix_unstable">a lot of packages</a>.
Today <code>repology.org</code> says it’s <code>106937</code> packages for <code>89083</code> projects.
As I understand it <code>repology</code>’s <code>project</code> means upstream project name.
If we pick <code>python:networkx</code> <code>repology</code> name then <code>nixpkgs</code> provides a
few versions of <code>networkx</code> for each python version:
<pre><code>$ nix-env -qa | grep networkx
python3.10-networkx-3.1
python3.11-networkx-3.1</code></pre>
But what if I tell you the above number is only a minor subset of package
definitions hiding in <code>nixpkgs</code>? You could easily access packages like
<code>python3.12-networkx-3.1</code>, <code>python3.10-networkx-3.1</code> or even
<code>python3.11-networkx-3.1-riscv64-unknown-linux-gnu</code>. None of them are
listed on <code>repology</code>.
Abundance of various package flavours like this one is a well-known fact
or a seasoned user of <code>nixpkgs</code>.
A few days ago I attempted to update <code>autoconf</code> from <code>2.71</code> to <code>2.72</code>
version. It’s supposed to be a minor maintenance release without
many breaking changes. To make sure I don’t break too much I attempted
to validate that all the packages that somehow use <code>autoconf</code> are still
building correctly.
<h2 id="on-attributes-and-package-names">On attributes and package names</h2>
<code>NixOS</code> and <code>nixpkgs</code> users almost never deal with exact package names:
resolving a package name to package definition is slow and ambiguous.
Instead <code>nixpkgs</code> encourages users to use “attribute names” using
<code>nix</code>-language level constructs.
For example <code>python3.11-networkx-3.1</code> would have a name of
<code>python3Packages.networkx</code> on my system. The same package also has quite
a few aliases:
<ul>
<li><code>python311Packages.networkx</code></li>
<li><code>python3.pkgs.networkx</code></li>
<li><code>python311.pkgs.networkx</code></li>
</ul>
Each of them evaluates to the same package definition. An example
<code>nix repl</code> session to make sure it’s still true:
<pre><code>$ nix repl -f '<nixpkgs>'

nix-repl> python3Packages.networkx
«derivation /nix/store/659allxmdwqxr4zmg03z8wqyizlsdmgh-python3.11-networkx-3.1.drv»

nix-repl> python311Packages.networkx
«derivation /nix/store/659allxmdwqxr4zmg03z8wqyizlsdmgh-python3.11-networkx-3.1.drv»

nix-repl> python311.pkgs.networkx
«derivation /nix/store/659allxmdwqxr4zmg03z8wqyizlsdmgh-python3.11-networkx-3.1.drv</code></pre>
The <code>.drv</code> files have identical hash part which means all the names are
equivalent when used as is.
Simpler examples of attributes are <code>re2c</code> and <code>gnugrep</code>. More complex
ones are <code>python3Packages.ninja</code>, <code>linuxPackages_latest.kernel.configfile</code> and
<code>pkgsCross.riscv64.re2c</code>.
Thus to answer a question of what packages should I test after
<code>autoconf</code> upgrade I would prefer to get attribute names instead of
package names.
<h2 id="poor-mans-reverse-dependency-lookup">Poor man’s reverse dependency lookup</h2>
I routinely do package updates that touch many packages indirectly.
To get a list of impacted packages <code>nixpkgs</code> provides a
<a href="https://github.com/NixOS/nixpkgs/blob/master/maintainers/scripts/rebuild-amount.sh"><code>maintainers/scripts/rebuild-amount.sh</code> script</a>.
It instantiates all the known to hydra attributes into <code>.drv</code> files and
checks changed hashes before and after the change. This diff is our
impact. Script’s typical output looks like that:
<pre><code>$ time ./maintainers/scripts/rebuild-amount.sh --print HEAD^
Estimating rebuild amount by counting changed Hydra jobs (parallel=unset).
     32 x86_64-darwin
     63 x86_64-linux

          asciidoc-full-with-plugins.x86_64-darwin  dist=/nix/store/...-asciidoc-full-with-plugins-10.2.0-dist;/nix/store/...-asciidoc-full-with-plugins-10.2.0
          asciidoc-full-with-plugins.x86_64-linux   dist=/nix/store/...-asciidoc-full-with-plugins-10.2.0-dist;/nix/store/...-asciidoc-full-with-plugins-10.2.0
          asciidoc-full.x86_64-darwin               dist=/nix/store/...-asciidoc-full-10.2.0-dist;/nix/store/...-asciidoc-full-10.2.0
          asciidoc-full.x86_64-linux                dist=/nix/store/...-asciidoc-full-10.2.0-dist;/nix/store/...-asciidoc-full-10.2.0
          auto-multiple-choice.x86_64-darwin        /nix/store/...-auto-multiple-choice-1.6.0
          auto-multiple-choice.x86_64-linux         /nix/store/...-auto-multiple-choice-1.6.0
          bicgl.x86_64-darwin                       /nix/store/...-bicgl-unstable-2018-04-06
          bicgl.x86_64-linux                        /nix/store/...-bicgl-unstable-2018-04-06
          bicpl.x86_64-darwin                       /nix/store/...-bicpl-unstable-2020-10-15
          bicpl.x86_64-linux                        /nix/store/...-bicpl-unstable-2020-10-15
          cantor.x86_64-linux                       /nix/store/...-cantor-23.08.4
          clevis.x86_64-linux                       man=/nix/store/...-clevis-19-man;/nix/store/...-clevis-19
          conglomerate.x86_64-darwin                /nix/store/...-conglomerate-unstable-2017-09-10
          ...
          netpbm.x86_64-darwin                                                                     bin=/nix/store/...-netpbm-11.4.4-bin;dev=/nix/store/...-netpbm-11.4.4-dev;/nix/store/...-netpbm-11.4.4
          netpbm.x86_64-linux                                                                      bin=/nix/store/...-netpbm-11.4.4-bin;dev=/nix/store/...-netpbm-11.4.4-dev;/nix/store/...-netpbm-11.4.4

          ...
...
real    6m38,854s
user    6m19,604s
sys     0m17,102s</code></pre>
Here I changed <code>netpbm</code> package in <code>HEAD</code> commit and that caused the
rebuild of <code>63</code> <code>x86_64-linux</code> packages (and <code>32</code> <code>x86_64-darwin</code> ones).
In this case rebuilding all <code>63</code> of them is not a big deal.
But even here some of the packages are probably not worth testing.
<code>netpbm</code> has something to do with image formats and <code>cleavis</code> is about
the encryption. I would guess <code>cleavis</code> would not be impacted by the
update at all.
It would be nice to find all <strong>direct</strong> users of <code>netpbm</code> instead and
rebuild those.
The very first topic I created on <code>NixOS discourse</code> was about
<a href="https://discourse.nixos.org/t/how-do-you-find-reverse-dependencies/15057">reverse dependencies lookup</a>.
I was a bit surprised there was no standard tool like that and wrote a
hack to do it:
<pre class="nix"><code># use as:
#    import ./arevdeps.nix linuxHeaders pkgs lib
revdepAttr: pkgs: lib:
let isDrv = v: (builtins.tryEval v).success && lib.isDerivation v;
    # skip broken and unsupported packages on this system in a very crude way:
    safeReadFile = df: let c = builtins.tryEval (builtins.readFile df); in if c.success then c.value else "";
    fastHasEntry = i: s: s != builtins.replaceStrings [i] ["<FOUND-HERE>"] s;
    sInDrv = s: d: fastHasEntry s (safeReadFile d.drvPath);
    rdepInDrv = rdep: d: builtins.any (s: sInDrv s d)
                                      (builtins.map (o: rdep.${o}.outPath) rdep.outputs);
    matchedPackages = lib.filterAttrs (n: d: isDrv d && rdepInDrv revdepAttr d)
                                      pkgs;
in builtins.attrNames matchedPackages</code></pre>
It’s a bit wordy (not much <code>nix</code> experience by then) but it’s idea is
simple: find references to a searched package via it’s <code>.drv</code> path by
looking at <code>.drv</code> files created from attributes in <code><nixpkgs></code> object.
Let’s try to look up <code>netpbm</code> against it:
<pre><code>nix-repl> import ./arevdeps.nix netpbm pkgs lib
[ "auto-multiple-choice" "bicpl" "fbcat" "foomatic-db-ppds" "fped"
  "img2pdf" "latex2html" "lilypond" "lilypond-unstable" "mup" "netpbm"
  "pcb" "pnglatex" "sng" "xplanet" "yad" ]</code></pre>
Only 16 packages!
The major caveat is that the hack does not try to descend from the top
level down to other attributes like <code>python3Packages.*</code> or
<code>haskellPackages.*</code>.
In theory you could direct the hack to specific attributes and expand
the output:
<pre><code>nix-repl> import ./arevdeps.nix netpbm pkgs.python3Packages lib
[ "img2pdf" "pnglatex" ]</code></pre>
In practice it’s not very convenient and I never did it. The command
already takes a while to run and running it multiple times is no fun.
I decided to extend initial script to handle nested attributes.
<h2 id="naive-attempt-to-extend-the-hack">Naive attempt to extend the hack</h2>
In theory it’s one small extension: add a tiny amount of code to descend
into child attributes and you are done.
Sounds good, did not work.
The result started crashing on various syntax errors in various
<code>nixpkgs</code> files. When I worked error around <code>nix</code> ate <code>100GB</code> of <code>RAM</code>
and crashed without producing the result.
I’ll spare you the implementation details of a modified script.
Unbounded <code>RAM</code> usage is very unfortunate as the script in theory could
run in constant space. It’s not very simple in practice as <code>nix</code> uses
<a href="https://en.wikipedia.org/wiki/Boehm_garbage_collector"><code>boehm-gc</code></a> to
control it’s heap usage. I’m not sure a single loaded <code><nixpkgs></code> tree
allows for any garbage collection of <code>.drv</code> files.
I filed <a href="https://github.com/NixOS/nix/issues/9671" class="uri">https://github.com/NixOS/nix/issues/9671</a> issue to see if there
are any obvious references <code>nix</code> could remove to make garbage collection
more efficient.
But in the shorter term I had to try something else.
<h2 id="a-step-back-just-list-all-the-attributes">A step back: just list all the attributes</h2>
I realized there are multiple problems with my hack and I attempted to
solve a simple problem. I wanted to just list all the available
attributes in <code><nixpkgs></code>. Ideally not just those known to <code>hydra</code> CI
builder but the ones hiding in <code>pkgsCross</code> in other places.
Quiz question: how hard is it to get a list of such attributes to
explore?
Getting an attribute list of a single set is trivial via single call of
<code>lib.attrNames</code>:
<pre><code>$ nix repl -f '<nixpkgs>'

nix-repl> lib.take 4 (lib.attrNames pkgs)
[ "AAAAAASomeThingsFailToEvaluate" "AMB-plugins" "ArchiSteamFarm" "AusweisApp2" ]

nix-repl> lib.length (lib.attrNames pkgs)
20059</code></pre>
The problem is that some of the attributes are neither derivations not
attribute sets:
<pre><code>nix-repl> pathsFromGraph
/nix/store/jp811zl7njhg1g59x95dgqs4rddgr7xz-source/pkgs/build-support/kernel/paths-from-graph.pl</code></pre>
Luckily it’s easy to introspect value type via predefined predicates:
<pre><code>nix-repl> lib.isPath pkgs.pathsFromGraph
true

nix-repl> lib.isDerivation re2c
true

nix-repl> lib.isAttrs pkgsCross
true</code></pre>
Another problem is that some of attribute values don’t evaluate
successfully. Sometimes intentionally:
<pre><code>nix-repl> pkgs.AAAAAASomeThingsFailToEvaluate
error:
       … while calling the 'throw' builtin

         at /nix/store/jp811zl7njhg1g59x95dgqs4rddgr7xz-source/pkgs/top-level/all-packages.nix:106:36:

          105|   ### Evaluating the entire Nixpkgs naively will fail, make failure fast
          106|   AAAAAASomeThingsFailToEvaluate = throw ''
             |                                    ^
          107|     Please be informed that this pseudo-package is not the only part

       error: Please be informed that this pseudo-package is not the only part
       of Nixpkgs that fails to evaluate. You should not evaluate
       entire Nixpkgs without some special measures to handle failing
       packages, like using pkgs/top-level/release-attrpaths.nix.

nix-repl> pkgs.gccWithoutTargetLibc
error:
       … while evaluating the attribute 'gccWithoutTargetLibc'
        15967|   gccWithoutTargetLibc = assert stdenv.targetPlatform != stdenv.hostPlatform; let
       error: assertion '((stdenv).targetPlatform != (stdenv).hostPlatform)' failed</code></pre>
And sometimes entirely by accident:
<pre><code>nix-repl> pkgsLLVM.clang_6
error:
       … while evaluating the attribute 'clang_6'
       error: attribute 'clangUseLLVM' missing</code></pre>
I just need to filter out all the problematic attributes and leave only
evaluatable ones. <code>nix</code> even provides a <code>builtins.tryEval</code> just for
this case:
<pre><code>nix-repl> builtins.tryEval pkgs.AAAAAASomeThingsFailToEvaluate
{ success = false; value = false; }

nix-repl> builtins.tryEval pkgs.gccWithoutTargetLibc
{ success = false; value = false; }

nix-repl> builtins.tryEval pkgs.gcc
{ success = true; value = «derivation /nix/store/y5vq20420rg2g6h03c8x7sxzjcxphg9w-gcc-wrapper-12.3.0.drv»; }</code></pre>
Sounds easy, right? As always there is a catch:
<pre><code>nix-repl> builtins.tryEval pkgsLLVM.clang_6
error:
       error: attribute 'clangUseLLVM' missing

nix-repl> builtins.tryEval pkgsMusl.adobe-reader
       error: evaluation aborted with the following error message: 'unsupported platform for the pure Linux stdenv'</code></pre>
Not all error types can be caught by <code>builtins.tryEval</code>: only <code>throw</code>
and <code>aasert</code> calls (these are explicitly present in the call) are
catchable. The rest is considered a bug in <code>nix</code> expression and can’t be
caught. I guess it’s the way to signal invalid <code>.nix</code> programs.
Lack of error recovery means that I can’t do attribute filtering like
in a single <code>nix</code> expression! I had 2 options:
<ol type="1">
<li>Write an external script that probes for problematic attributes and
somehow skips them.</li>
<li>Fix all the evaluation errors in <code>nixpkgs</code> to make the naive
filtering work.</li>
</ol>
<code>[1.]</code> would require use of <code>nix</code> as a library in one form or another.
I was lazy and tried <code>[2.]</code> first. My assumption that it was a small
list of easy to fix errors.
Here is my first version of a simple attribute lister:
<pre class="nix"><code># Usage example:
# $ nix-instantiate --eval --strict ~/.config/nixpkgs/lib/all-attrs.nix -I nixpkgs=$PWD

{ nixpkgs ? import <nixpkgs> {
    config = {
    };
  }
, rootAttr ? "pkgs"
, verbose ? 1 # warn

# How to pick, resource usage for me as of 2023-12-28:
# 1 - 10 seconds, ~2GB of RAM
# 2 - 2 minutes, ~25GB of RAM (unfiltered attrs)
# 3 - 5+ minutes, ~70GB+ or RAM Fails on attributes like `pkgsCross.iphone32.ammonite`
# anything else: at your risk
, maxDepth
}:

let
  # simple variables:
  lib = nixpkgs.lib;

  # logging:
  err   = s: e: lib.trace "ERROR: ${s}" e;
  warn  = s: e: if verbose >= 1 then lib.trace "WARN: ${s}" e else e;
  info  = s: e: if verbose >= 2 then lib.trace "INFO: ${s}" e else e;
  debug = s: e: if verbose >= 3 then lib.trace "DEBUG: ${s}" e else e;

  # root to start at
  root = lib.attrByPath (lib.splitString "." rootAttr)
                        (warn "did not find ${rootAttr}" {})
                        nixpkgs;
  # other helpers:
  isPrimitive = v: lib.isFunction v
                || lib.isString v
                || lib.isBool v
                || lib.isList v
                || lib.isInt v
                || lib.isPath v
                || v == null;

  go = depth: ap: v:
    let
      a = lib.showAttrPath ap;
      e = builtins.tryEval v;
      maybe_go_deeper =
        if depth >= maxDepth
        then info "too deep (depth=${toString depth}) nesting of a=${a}, stop" []
        else map (nv: go (depth + 1) (ap ++ [nv.name]) nv.value)
                 (lib.attrsToList v);
    in debug "inspecting ${a}" (
    if !e.success then info "${a} fails to evaluate" []
    else if lib.isDerivation v
    then [a]
    else if lib.isAttrs v then maybe_go_deeper
    else if isPrimitive v then []
    # should not get here
    else warn "unhandled type of ${a}" []);
in lib.flatten (go 0 [] root)</code></pre>
It’s more than a page of code. It explores at most <code>maxDepth</code> attributes
deep.
Usage example:
<pre><code>$ time nix-instantiate --eval --strict ~/.config/nixpkgs/lib/all-attrs.nix -I nixpkgs=$PWD --arg maxDepth 1
...
[ "AMB-plugins" "ArchiSteamFarm" ... "zulu8" "zuo" "zwave-js-server"
  "zx" "zxcvbn-c" "zxfer" "zxing" "zxing-cpp" "zxpy" "zxtune" "zydis"
  "zyn-fusion" "zynaddsubfx" "zynaddsubfx-fltk" "zynaddsubfx-ntk" "zz"
  "zziplib" "zzuf" ]

real    0m6,587s
user    0m5,963s
sys     0m0,608s</code></pre>
Yay! It works! Note: this is just one depth level of attributes, like
<code>re2c</code>. It does not contain packages like <code>python3Packages.ninja</code>. The
run takes 2GB and 6 seconds to complete.
If we want one level deeper we can specify <code>--maxDepth 2</code>:
<pre><code>$ time nix-instantiate --eval --strict ~/.config/nixpkgs/lib/all-attrs.nix -I nixpkgs=$PWD --arg maxDepth 2
...
[ "AMB-plugins" "ArchiSteamFarm" "AusweisApp2" "BeatSaberModManager"
  ...
  "CuboCore.coreaction" "CuboCore.corearchiver" "CuboCore.corefm"
  "CuboCore.coregarage"
  ...
  "__splicedPackages.AMB-plugins" "__splicedPackages.ArchiSteamFarm"
  ...
  "pkgsLLVM.aaaaxy" "pkgsLLVM.aacgain"
  ...
  "pkgsHostTarget.zig_0_11" "pkgsHostTarget.zig_0_9"
  ...
  "zyn-fusion" "zynaddsubfx" "zynaddsubfx-fltk" "zynaddsubfx-ntk" "zz"
  "zziplib" "zzuf" ]

real    1m4,845s
user    0m58,368s
sys     0m5,910s</code></pre>
Second level also works! This time it took 25GB and a bit more than 1
minute to print the result. There are a few issues with it: some
attribute trees like <code>__splicedPackages</code> and <code>pkgsHostTarget</code> are
redundant. We already get attributes like <code>python3Packages.ninja</code>.
But are not quite at <code>pkgsCross.riscv64.re2c</code> yet.
<h2 id="running-the-naive-lister-on-larger-depths">Running the naive lister on larger depths</h2>
I ran the naive script above and derived the following fixes:
<ul>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277117">PR 277117</a>:
<code>netbsd.libcurses</code> constructed invalid type of <code>NIX_CFLAGS_COMPILE</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/276984">PR 276984</a>:
<code>beam.packages.erlangR23</code> referred to non-existent <code>erlang_23</code>
attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/276985">PR 276985</a>:
<code>coq-kernel.launcher</code> used an alias instead of package name.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/276995">PR 276995</a>:
<code>haskell.packages.ghc810.mod</code> used non-existent <code>mod_0_1_2_2</code> attribute
in it’s definition.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/276986">PR 276986</a>:
<code>dockerTools.tests.docker-tools</code> used an alias instead of actual name.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277211">PR 277211</a>:
<code>nixosTests.nixops</code> had an unsatisfied function argument.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277355">PR 277355</a>:
<code>stdenv</code> used <code>abort</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277364">PR 277364</a>:
<code>python312Packages.array-record</code> accessed non-existent attribute.</li>
</ul>
Getting 8 bugs just like that impressed me. I optimized lister a bit to
be able to descend into larger attribute depths and found a few more
bugs.
If you did not notice my lister script never attempted to explore any
attributes <strong>within</strong> derivations. If we pick a <code>re2c</code> example the
script would never get to <code>re2c.passthru.updateScript</code>.
And <code>passthru</code> are probably least tested attributes as they rarely used
by <code>hydra</code> CI. The <code>passthry.tests</code> in particular are <strong>not</strong> used by
<code>hydra</code> but are used by <code>ofborg</code> <code>GitHub</code> actions. And the caveat of
<code>ofborg</code> is that it rarely shows test failures as a red cross. Usually
it renders a failure as inconclusive gray. The assumption is that the
reviewer look at the underlying failure and makes a decision.
Thus I added a knob to descend into attributes of derivations
<a href="https://github.com/trofi/nixpkgs-overlays/commit/50ed200dc06ee1b6ec8ad8ca879a9948cc85135e">this way</a>:
<pre class="diff"><code>--- a/lib/all-attrs.nix
+++ b/lib/all-attrs.nix
@@ -35,6 +35,11 @@
 , maxDepth

 , ignoreCross ? true
+
+# Whether to validate every attribute within derivations themselves.
+# Most intereting fields are `passthru.tests`, but sometimes there are
+# very unusual bugs lurking. Risky but very fun!
+, ignoreDrvAttrs ? true
 }:

 let
@@ -76,11 +81,9 @@ let
     in debug "inspecting ${a}" (
     if !e.success then info "${a} fails to evaluate" []
     else if lib.isDerivation v
-    # TODO: add an option to traverse into derivations as well.
-    # Mainly to test validity of `passthru.tests`, `metadata` and
-    # similar.
-    then [a] # TODO: "++ maybe_go_deeper"
+    then [a] ++ lib.optionals (!ignoreDrvAttrs) maybe_go_deeper
     # Skip "foo = self;" attributes like `pythonPackages.pythonPackages`
+    # TODO: might skip too much.
     else if lib.isAttrs v && depth > 0 && lib.hasAttr (lib.last ap) v then info "${a} is a repeated attribute, skipping" []
     else if lib.isAttrs v then maybe_go_deeper
     else if isPrimitive v then []</code></pre>
I also had to add a few ignored paths like <code>nixosTests</code> as they require
around <code>1GB</code> of extra <code>RAM</code> per test(!) and I had to skip <code>pkgsCross</code>
as it derives too many attributes for (still!) naive script to handle.
But even with such a limited lister I managed to get to these bugs:
<ul>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277399">PR#277399</a>:
<code>bazel-watcher.bazel.tests</code> had a <code>optionalSttrs</code> typo instead of
<code>optionalAttrs</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277400">PR#277400</a>:
<code>bitcoind-knots</code> referred to non-existent test.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277402">PR#277402</a>:
<code>cargo</code> tried to pull tests for a package that does not define it.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277404">PR#277404)</a>:
<code>corosync</code> did not specify a test input argument that it used.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277408">PR#277408</a>:
<code>lua-wrapper</code> uses non-existent attributes to define paths.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277420">PR#277420</a>:
<code>displaylink</code> referred to non-existent test.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277434">PR#277434</a>:
<code>gnupg22</code> incorrectly refers to the test suite.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277435">PR#277435</a>:
<code>pisocsope.rules</code> looked <code>writeTextDir</code> in <code>lib</code> instead of <code>pkgs</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277473">PR#277473</a>:
<code>guacamole-client</code> was referring to deleted test.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277474">PR#277474</a>:
<code>mutmut</code> used <code>testers</code> attribute without use.
= <a href="https://github.com/NixOS/nixpkgs/pull/277494">PR#277494</a>:
<code>buildFHSEnv</code> did not fully handle <code>multiPaths = null</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277512">PR#277512</a>:
<code>owncast</code> referred to non-existent test.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277517">PR#277517</a>:
<code>python3Packages.pypaBuildHook.tests</code> test referred non-existent <code>.nix</code>
file.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277543">PR#277543</a>:
<code>pythonInterpreters.pypy39_prebuilt</code> referred to deleted <code>pypy38</code>
attribute, not <code>pypy39</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277580">PR#277580</a>:
<code>tigervnc.tests</code> referred to non-existent test.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277581">PR#277581</a>:
<code>wezterm.tests</code> referred to commented out tests.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277590">PR#277590</a>:
<code>devpod.tests</code> passed incorrect parameter to a test function.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277593">PR#277593</a>:
<code>fakeroot.tests</code> passed incorrect parameter to a test function.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277595">PR#277595</a>:
<code>findup.tests</code> passed incorrect parameter to a test function.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277600">PR#277600</a>:
<code>jellyfin-ffmpeg.tests</code> is missing <code>pkg-config</code> annotation.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277617">PR#277617</a>:
<code>build-support/go</code> code constructed inaccessible <code>vendorSha256</code>
attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277715">PR#277715</a>:
<code>octoprint</code> referred to non-existent attribute in <code>tests</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277741">PR#277741</a>:
<code>pypy2Packages.attrs</code> refers non-existent <code>.nix</code> file.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277751">PR#277751</a>:
<code>python3Packages.openllm</code>: fix <code>passthru</code> dependency references and
fix variable shadowing.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277777">PR#277777</a>:
<code>python3Packages.openllm-client</code>: fix <code>passthru</code> dependency references.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277788">PR#277788</a>:
<code>python3Packages.openllm-core</code>: fix <code>passthru</code> dependency references.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277880">PR#277880</a>:
<code>valhalla</code> was missing <code>pkgConfigModules</code> definition.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277899">PR#277899</a>:
<code>zammad.src.meta</code> failed to evaluate due to incorrect position
assumption: no metadata attributes were defined in the <code>.nix</code> files.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277973">PR#277973</a>:
<code>ruff.tests</code> referred <code>ruff-lsp</code> alias instead of direct name.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/277982">PR#277982</a>:
<code>spark.tests</code>: referred to <code>nixosTest</code> alias.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/278034">PR#278034</a>:
<code>nixosTests.kernel-generic</code> attempted to use <code>bool</code> value as a kernel
derivation.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/278044">PR#278044</a>:
<code>aaxtomp3</code>: fix invalid reference to <code>glibc</code> for non-<code>glibc</code> targets.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/278069">PR#278069</a>:
<code>haskell.packages.ghc810</code> refer to non-existent packages.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/278074">PR#278074</a>:
<code>haskell.packages.ghc865Binary</code> refer to non-existent packages.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/278076">PR#278076</a>:
<code>haskell.packages.ghc98</code> refer to non-existent packages.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/278224">PR#278224</a>:
<code>haskell.packages.ghcjs</code> lacks <code>llvmPackages</code> attribute implied by
<code>ghc-8.10</code> packages.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/278528">PR#278528</a>:
<code>python3Packages.paddlepaddle</code>: unhandled error in <code>src</code> attribute
dereference.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/278915">PR#278915</a>:
<code>nvidia-x11</code> unconditionally refers to <code>/share/</code> even if libraries are
the only enabled bit.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/278950">PR#278950</a>:
<code>pythonInterpreters.pypy39_prebuilt</code> failed the <code>test</code> evaluation as
it exposed unhandled <code>pythonAttr = null</code> value. The test expected a
real object.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/279018">PR#279018</a>:
<code>systemd.tests.systemd-journal-upload</code> has invalid maintainer
specified.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/279404">PR#279404</a>:
<code>llvmPackages.bintools.bintools</code> did not define expected
<code>targetPackages</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/279463">PR#279463</a>:
<code>stdenv.adapters</code>: fix <code>overrideLibcxx</code> definition.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/280319">PR#280319</a>:
<code>rubyModules</code> defined <code>gemType</code> via non-existent sets.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/280470">PR#280470</a>:
<code>pkgsLLVM.dmd</code> accessed non-existent <code>libgcc</code> attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/283737">PR#283737</a>:
<code>tests.cross.sanity</code> referred non-existent <code>qt5.qutebrowser</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/289397">PR#289397</a>:
<code>nixosTests.keepalived</code> defined <code>maintainers</code> attribute in an
incorrect scope.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/292677">PR#292677</a>:
<code>distrobuilder.tests</code> referred renamed test attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/292762">PR#292762</a>:
<code>lxc.tests</code> referred renamed test attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/295431">PR#295431</a>:
<code>pypy27Packages.pulsar-client</code> dereferenced non-existent attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/295448">PR#295448</a>:
<code>ttyd.tests</code> referred unmentioned <code>nixosTests</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/296264">PR#296264</a>:
<code>python3.pkgs.openllm-core.optional-dependencies.full</code> referred
renamed attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/296497">PR#296497</a>:
<code>apptainer.gpuChecks.saxpy</code> refers the attribute from the wrong place.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/305811">PR#305811</a>:
<code>lxd.ui</code> is an export of non-=existent attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/305843">PR#305843</a>:
<code>pypy27Packages.pluthon</code> used invalid form of <code>lib.optionals</code> call.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/305925">PR#305925</a>:
<code>redlib.tests</code> referred non-existent <code>nixosTests.redlib</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/313791">PR#313791</a>:
<code>haskell.packages.ghc865Binary.exceptions</code> referred to missing package.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/313792">PR#313792</a>:
<code>haskell.packages.ghcjs.exceptions</code> referred to missing package.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/314092">PR#314092</a>:
<code>nextcloud-notify_push.tests</code> referred to missing package.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/314109">PR#314109</a>:
<code>githooks.tests</code> used invalid parameter to <code>testers.testVersion</code> helper.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/314196">PR#314196</a>:
<code>nixVersions.git.tests</code> used invalid attribute name when defined tests.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/317956">PR#317956</a>:
<code>lix.tests</code> failed to evaluate as it tried to use non-existent
attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/319518">PR#319518</a>:
<code>ollama.tests</code> used incorrect construct to merge attributes.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/325111">PR#325111</a>:
<code>nextcloud-notify_push.tests</code> referred already deleted attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/329253">PR#329253</a>:
<code>autoprefixer.tests</code> refers to renamed attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/329490">PR#329490</a>:
<code>pypy27Packages.corner.nativeBuildInputs</code> refers non-existent attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/329505">PR#329505</a>:
<code>pypy27Packages.ray.optional-dependencies</code> refers to non-existent
attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/329511">PR#329511</a>:
<code>python3Packages.pytorch-bin.tests</code> passes non-existent parameters.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/329512">PR#329512</a>:
<code>pypy27Packages.pyreqwest-impersonate</code> defines non-optional attributes
as optional.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/329515">PR#329515</a>:
<code>varnish60Packages.modules</code> used invalid <code>hash</code> format.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/331682">PR#331682</a>:
<code>nixosTests.bittorrent</code> used an alias in package inputs.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/332822">PR#332822</a>:
<code>octave.buildEnv</code> called the function with undefined attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/337406">PR#337406</a>:
<code>apacheKafka.tests</code> referred to a broken attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/337728">PR#337728</a>:
<code>tectonic.tests</code> referred to a non-existent attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/338559">PR#338559</a>:
<code>pypy27Packages.incremental</code> constructs incorrect derivation with an
associative array field.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/338596">PR#338596</a>:
<code>pypy2Packages.python-engineio-v3</code> constructs incorrect derivation with an
associative array field.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/338597">PR#338597</a>:
<code>pypy2Packages.python-socketio-v4</code> constructs incorrect derivation with an
associative array field.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/341497">PR#341497</a>:
<code>dotnet/build-dotnet-module</code> fails to handle missing <code>pname</code>.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/341985">PR#341985</a>:
<code>perlInterpreters.perl536</code> requires already deleted attribute.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/346336">PR#346336</a>:
<code>libtorrent</code> and <code>rtorrent</code> <code>updaterScript</code> passed non-existent
parameters.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/346689">PR#346689</a>:
<code>nix-plugin-pijul.tests</code> fix the invalid attribute reference.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/346746">PR#346746</a>:
<code>python3Packages.sshfs.optional-dependencies.pkcs11</code> fix incorrect
attribute name.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/347172">PR#347172</a>:
<code>python3Packages.sshfs.optional-dependencies.pyopenssl</code>: fix incorrect
attribute name.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/347439">PR#347439</a>:
<code>python3Packages.ifcopenshell.tests</code> did not pass enough parameters.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/348104">PR#348104</a>:
<code>faiss.passthru</code> attributes were dropped without removing references
to it.</li>
<li><a href="https://github.com/NixOS/nixpkgs/pull/352825">PR#352825</a>:
<code>pyamlboot.tests</code> passed the parameter if incorrect type.</li>
</ul>
Note: It’s not the full list of required fixes. For more complex cases I
filed a few bugs to get maintainers’ help:
<ul>
<li><a href="https://github.com/NixOS/nixpkgs/issues/277285">Issue#277285</a>:
<code>pkgsStatic.php</code> enters infinite loop and exhausts all available
memory.</li>
<li><a href="https://github.com/NixOS/nixpkgs/issues/277628">Issue#277628</a>:
<code>godot3-mono.nugetSource.meta</code> detects infinite recursion on
evaluation.</li>
<li><a href="https://github.com/NixOS/nixpkgs/issues/277698">Issue#277698</a>:
<code>ocamlPackages.janeStreet_0_15</code> has unsatisfied attributes.</li>
<li><a href="https://github.com/NixOS/nixpkgs/issues/280377">Issue#280377</a>:
<code>tests.cuda</code> has an unsatisfied <code>backendStdenv</code> attribute.</li>
</ul>
<h2 id="did-i-get-the-list-package-for-autoconf">Did I get the list package for <code>autoconf</code>?</h2>
Sort of: I managed to write the hack to get a list of packages using
<code>autoconf</code> in a few layers deep below top level. It’s good enough for
testing close to exhaustive.
But I did not get exhaustive at all. There are two main problems still:
<ol type="1">
<li>The attribute sets are infinite in <code>nixpkgs</code>. An example a bit silly
but still valid attribute is:
<pre><code>nix-repl> pkgs.pkgs.pkgs.pkgsCross.riscv64.pkgsMusl.pkgsCross.riscv64.pythonPackages.pythonPackages.pythonPackages.ninja
«derivation /nix/store/4vnprl12q706s3ilb1g1c2v4bf9pjpc9-ninja-1.11.1.drv»`</code></pre>
<code>nix</code> the language does not provide the mechanism to compare
references to shortcut things like <code>pythonPackages.pythonPackages</code>
And each scope has those self-referential package structures.</li>
<li>Even if the attribute set was finite in <code><nixpkgs></code> the mere act of
listing them takes 100s of GB. It looks like it’s because <code>nix</code> does
not collect already evaluated garbage expressions that still have
references from other parts of the tree. The packages loops in
<code>nixpkgs</code> from <code>[1.]</code> do not help in that at all.</li>
</ol>
I am still hopeful that I can get something decent soon. I can
workaround <code>[2.]</code> <code>RAM</code> exhaustion by declaring defeat on a single
<code>.nix</code> script and run it in incremental mode. Say, to process 100
packages at a time to avoid infinite memory growth.
Another option would be to write a separate tool using <code>nix</code> as a
library to parse and evaluate <code>.nix</code> code that does this job
specifically. But I’d prefer to try to fix <code>nix</code> <code>GC</code> behaviour first. I
think it’s tractable.
<h2 id="parting-words">Parting words</h2>
Traversing package attribute set in <code>nixpkgs</code> is surprisingly
challenging. I think it is fixable and should be fixed (at least for
non-<code>pkgsCross.*</code> part of the tree). Fetching metadata about the
packages is a frequent operation for many types of tree-wide changes.
I had a lot of fun writing debuggable <code>.nix</code> code to list available
<code>nixpkgs</code> attributes. So far my result is hiding at
<a href="https://github.com/trofi/nixpkgs-overlays/blob/main/lib/all-attrs.nix" class="uri">https://github.com/trofi/nixpkgs-overlays/blob/main/lib/all-attrs.nix</a>.
So far I managed to get to 4 levels of attribute depth using <code>60GB</code> of
<code>RAM</code>. This uncovered at least 27 bugs.
Some of the bugs are very scary:
<ul>
<li><code>cargo</code> did not have tests: <a href="https://github.com/NixOS/nixpkgs/pull/277402" class="uri">https://github.com/NixOS/nixpkgs/pull/277402</a></li>
<li><code>lua-wrapper</code> did not expose correct paths: <a href="https://github.com/NixOS/nixpkgs/pull/277408" class="uri">https://github.com/NixOS/nixpkgs/pull/277408</a></li>
</ul>
<code>builtins.tryEval</code> does not catch all the failure types in attribute
evaluation: <code>throw</code> / <code>assert</code> are fine, but reference to non-existent
attribute (or <code>assert</code>) are not.
<code>pkgs.nixosTests</code> attribute set is very slow and RAM hungry to evaluate:
<a href="https://github.com/NixOS/nix/issues/9671" class="uri">https://github.com/NixOS/nix/issues/9671</a>
You can also fix a few <code>nixpkgs</code> bugs! Just run <code>all-attrs.nix</code> as:
<pre><code>$ nix-instantiate --eval --strict ./all-attrs.nix \
    -I nixpkgs=~/path/to/nicpkgs \
    --arg maxDepth 2 --arg verbose 3 --arg ignoreDrvAttrs false</code></pre>
And see what you get.
Next steps I’d like to take at some future point:
<ul>
<li>batch package listing and package instantiation in smaller batches to
get RAM usage down to a few <code>GB</code>s.</li>
<li>explore <code>nix</code> and garbage collection mechanisms to make it friendlier
to large evaluations like <code>all-attrs.nix</code></li>
</ul>
Have fun!
</article>
<article>
<h1>a breakage example in C variadic function</h1>
2023-12-20T00:00:00Z
<h2 id="intro">Intro</h2>
This post is about <code>C</code> functions with ellipsis (<code>...</code>) in their
signatures. The most famous example of such is probably <code>printf()</code>:
<pre class="c"><code>int printf(const char *format, ...);</code></pre>
If you ever tried to use it you probably know that using wrong types
that don’t match format arguments might crash your program. A simple
faulty example could be:
<pre class="c"><code>printf("%s", 42); // this will crash</code></pre>
Luckily <code>gcc</code> and <code>clang</code> do have <code>-Wformat</code> warning that complains
about the mismatch between expected types by a format string and
actually passed types:
<pre class="c"><code>// $ cat simple.c
#include <stdio.h>

int main(void) {
    printf("%s", 42);
}</code></pre>
<pre><code>$ gcc -Wformat -c simple.c

simple.c: In function 'main':
simple.c:5:14: warning: format '%s' expects argument of type 'char *', but argument 2 has type 'int' [-Wformat=]
    5 |     printf("%s", 42);
      |             ~^   ~~
      |              |   |
      |              |   int
      |              char *
      |             %d</code></pre>
Many distributions turn these warnings into errors by default.
<h2 id="argument-count">Argument count</h2>
An ellipsis (<code>...</code>) means that the function accepts unknown count and
unknown type of parameters. There has to be a way to somehow signal
actual argument list in ellipsis. As <code><stdarg.h></code> does not provide a
standard way to do it code author has to come up with their own scheme
to solve the problem.
A few examples come to mind:
<ul>
<li><code>int printf(const char *format, ...)</code> uses <code>format</code> parameter to pass
the argument count by interpreting the format string.</li>
<li><code>int open(const char *pathname, int flags, mode_t mode)</code> uses <code>flags</code>
to distinguish 3-argument form from 2-argument
<code>int open(const char *pathname, int flags)</code>.</li>
<li><code>glib</code>’s <code>gchar* g_strconcat (const gchar* string1, ...)</code> consumes
parameters until <code>NULL</code> parameter is encountered.</li>
</ul>
Each of the examples above implements a different scheme.
<code>g_strconcat</code> is especially scary as you might pass non-strings there
and get no warning from the compiler. Or forget to pass a <code>NULL</code> value
(<a href="https://github.com/proftpd/proftpd/pull/1028/files"><code>proftpd</code> example</a>
comes to mind). But at least all the arguments are expected to have the
same <code>const gchar*</code> type.
But the above is not an exhaustive list. You can bake any assumption
into ellipsis meaning.
<h2 id="an-example-of-variadic-c-function">An example of variadic C function</h2>
I’ll use yet another (arguably the simplest) form to pass argument count
to a variadic function: I’ll pass the number explicitly. Let’s explore
the following example:
<pre class="c"><code>#include <stdarg.h>
#include <stdio.h>
#include <string.h>

void foo(int n, ...) {
    va_list va;

    va_start(va, n);
    for (int i = 0; i < n; i++) {
        size_t l = va_arg(va, size_t);
        printf("[%u]: %zu\n", i, l);
    }
    va_end(va);
}

int main(void) {
    foo(3, strlen("foo"), strlen("barr"), strlen("bazzz"));
}</code></pre>
Here we explicitly pass count of variadic arguments as the first <code>n</code>
parameter of the <code>foo()</code> function. The program should be correct.
Running it:
<pre><code>$ gcc a.c -o a &&./a
[0]: 3
[1]: 4
[2]: 5</code></pre>
All good.
Now I’ll change the above program slightly:
<pre class="c"><code>#include <stdarg.h>
#include <stdio.h>
#include <string.h>

void foo(int n, ...) {
    va_list va;

    va_start(va, n);
    for (int i = 0; i < n; i++) {
        size_t l = va_arg(va, size_t);
        printf("[%u]: %zu\n", i, l);
    }
    va_end(va);
}

int main(void) {
    foo(3, 3, 4, 5);
}</code></pre>
I inlined the results of <code>strlen()</code> calls to their literal values. Is it
still a correct program? Is it always expected to print <code>3 4 5</code>?
Let’s run it:
<pre><code>$ gcc a.c -o a &&./a
[0]: 3
[1]: 4
[2]: 5</code></pre>
Seems to work. Let’s throw more arguments just for fun:
<pre class="c"><code>// $ cat a.c
#include <stdarg.h>
#include <stdio.h>
#include <string.h>

void foo(int n, ...) {
    va_list va;

    va_start(va, n);
    for (int i = 0; i < n; i++) {
        size_t l = va_arg(va, size_t);
        printf("[%u]: %zu\n", i, l);
    }
    va_end(va);
}

int main() {
    foo(16, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
}</code></pre>
I increased argument count to <code>16</code>. Running it on <code>x86_64</code>:
<pre><code>$ gcc a.c -o a &&./a
[0]: 1
[1]: 2
[2]: 3
[3]: 4
[4]: 5
[5]: 6
[6]: 7
[7]: 8
[8]: 9
[9]: 10
[10]: 11
[11]: 12
[12]: 13
[13]: 14
[14]: 15
[15]: 16</code></pre>
Still all good!
Running on <code>aarch64-linux</code> just in case:
<pre><code>$ aarch64-unknown-linux-gnu-gcc a.c -o a &&./a
[0]: 1
[1]: 2
[2]: 3
[3]: 4
[4]: 5
[5]: 6
[6]: 7
[7]: 70368744177672
[8]: 70368744177673
[9]: 70368744177674
[10]: 70368744177675
[11]: 70368744177676
[12]: 13
[13]: 70368744177678
[14]: 15
[15]: 70368744177680</code></pre>
Uh-oh. It’s broken!
What is worse: first seven parameters look totally fine and degradation
start only <code>8th</code> one. Is it a coincidence? Some architecture-specific
property? Or maybe a compiler bug?
Or maybe you noticed a bug in the original program? How would you fix it
or work it around?
<h2 id="argument-passing-mechanics">Argument passing mechanics</h2>
Let’s have a look at the generated code and check how parameters are
passed across the call boundary. I’ll use the same <code>17</code>-argument example
above:
<pre class="c"><code>#include <stdarg.h>
#include <stdio.h>
#include <string.h>

void foo(int n, ...) {
    va_list va;

    va_start(va, n);
    for (int i = 0; i < n; i++) {
        size_t l = va_arg(va, size_t);
        printf("[%u]: %zu\n", i, l);
    }
    va_end(va);
}

int main(void) {
    foo(16, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
}</code></pre>
<h2 id="x86_64"><code>x86_64</code></h2>
Full <code>x86_64</code> code on <code>gcc</code> looks this way:
<pre class="asm"><code>; gcc -O1 -S a.c -o -

main:
        subq    $16, %rsp
        ; push some of the arguments on stack:
        pushq   $16 ; va[15] = 16
        pushq   $15 ; va[14] = 15
        pushq   $14 ; va[13] = 14
        pushq   $13 ; va[12] = 13
        pushq   $12 ; va[11] = 12
        pushq   $11 ; va[10] = 11
        pushq   $10 ; va[9] = 10
        pushq   $9  ; va[8] = 9
        pushq   $8  ; va[7] = 8
        pushq   $7  ; va[6] = 7
        pushq   $6  ; va[5] = 6

        movl    $5, %r9d  ; va[4] = 5
        movl    $4, %r8d  ; va[3] = 4
        movl    $3, %ecx  ; va[2] = 3
        movl    $2, %edx  ; va[1] = 2
        movl    $1, %esi  ; va[0] = 1
        movl    $16, %edi ; n = 16

        movl    $0, %eax
        call    foo
        movl    $0, %eax
        addq    $104, %rsp
        ret

.LC0:
        .string "[%u]: %zu\n"

foo:
        pushq   %rbp
        pushq   %rbx
        subq    $88, %rsp
        movq    %rsi, 40(%rsp)
        movq    %rdx, 48(%rsp)
        movq    %rcx, 56(%rsp)
        movq    %r8, 64(%rsp)
        movq    %r9, 72(%rsp)
        movl    $8, 8(%rsp)
        leaq    112(%rsp), %rax
        movq    %rax, 16(%rsp)
        leaq    32(%rsp), %rax
        movq    %rax, 24(%rsp)
        testl   %edi, %edi
        jle     .L1
        movl    %edi, %ebp
        movl    $0, %ebx
        jmp     .L5
.L3:
        movq    16(%rsp), %rdx
        leaq    8(%rdx), %rax
        movq    %rax, 16(%rsp)
.L4:
        movq    (%rdx), %rdx  ; size_t l = va_arg(va, size_t);
        movl    %ebx, %esi    ; int i (loop variable)
        movl    $.LC0, %edi   ; format = "%[u]: %zu\n"
        movl    $0, %eax
        call    printf
        addl    $1, %ebx
        cmpl    %ebx, %ebp
        je      .L1
.L5:
        movl    8(%rsp), %eax
        cmpl    $47, %eax
        ja      .L3
        movl    %eax, %edx
        addq    24(%rsp), %rdx
        addl    $8, %eax
        movl    %eax, 8(%rsp)
        jmp     .L4
.L1:
        addq    $88, %rsp
        popq    %rbx
        popq    %rbp
        ret</code></pre>
It’s a lot of text! We can ignore most of it and focus on the following
few lines to get to the argument passing mechanics:
<pre class="asm"><code>main:
        ; ...
        pushq   $6  ; va[5] = 6
        movl    $5, %r9d  ; va[4] = 5
        ; ...

foo:
        ; ...
        movq    (%rdx), %rdx  ; size_t l = va_arg(va, size_t);
        movl    %ebx, %esi    ; int i (loop variable)
        movl    $.LC0, %edi   ; format = "%[u]: %zu\n"
        movl    $0, %eax
        call    printf
        ; ...</code></pre>
The <code>main</code> function is trivial: it shows us that first 6 arguments are
passed in registers alone (<code>%edi = 16</code>, <code>%esi = 1</code>, <code>%edx = 2</code>,
<code>%ecx = 3</code>, <code>%r8d = 4</code> <code>%r9d = 5</code>). And starting from <code>7</code>th argument
they are passed via stack (<code>pushq $6</code>). This is a standard
<code>x86_64-linux</code> calling convention.
The <code>foo</code> is more complicated. The gist of it is that our <code>va_arg</code>
always gets fetched from stack as a 64-bit value via <code>movq (%rdx), %rdx</code>
instruction. To make it work <code>foo</code> stores all register-passed arguments
on stack. The fetch result gets passed later as a third argument to
<code>printf("%[u]: %zu\n", i, l)</code> call in <code>%rdx</code> register.
A few notes before we continue:
Instructions like <code>movl $1, %esi</code> tell the CPU to store <code>$1</code> to
32-bit <code>esi</code> register (lower half of <code>rsi</code> register). <code>movl</code> (or any
other write instruction that works on 32-bit registers) also zeroes out
upper 64-bits of <code>rsi</code> register. Thus it’s a functional equivalent of
<code>movq $1, %rsi</code>. But the encoding might be more efficient as it does not
need a <code>REX</code> prefix.
Instructions like <code>pushq $6</code> write full 64-bit constant on stack as if
we pushed full <code>size_t</code> value instead of <code>int</code>.
In both register store and memory store cases <code>int</code> literals are stored
as 64-bit values. This means that on <code>x86_64</code> it’s not too bad to mix
these two types as the example does.
<h2 id="aarch64"><code>aarch64</code></h2>
Now let’s do the same exercise for <code>aarch64</code> target:
<pre class="asm"><code>; aarch64-unknown-linux-gnu-gcc -O1 -S a.c -o -
main:
        sub     sp, sp, #96
        stp     x29, x30, [sp, 80]
        add     x29, sp, 80
        mov     w0, 16 ; n = 16
        str     w0, [sp, 64] ; va[15] = 16
        mov     w1, 15
        str     w1, [sp, 56] ; va[14] = 15
        mov     w1, 14
        str     w1, [sp, 48] ; va[13] = 14
        mov     w1, 13
        str     w1, [sp, 40] ; va[12] = 13
        mov     w1, 12
        str     w1, [sp, 32] ; va[11] = 12
        mov     w1, 11
        str     w1, [sp, 24] ; va[10] = 11
        mov     w1, 10
        str     w1, [sp, 16] ; va[9]  = 10
        mov     w1, 9
        str     w1, [sp, 8]  ; va[8]  = 9
        mov     w1, 8
        str     w1, [sp]     ; va[7]  = 8
        mov     w7, 7        ; va[6]  = 7
        mov     w6, 6        ; va[5]  = 6
        mov     w5, 5        ; va[4]  = 5
        mov     w4, 4        ; va[3]  = 4
        mov     w3, 3        ; va[2]  = 3
        mov     w2, 2        ; va[1]  = 2
        mov     w1, 1        ; va[0]  = 1
        bl      foo
        mov     w0, 0
        ldp     x29, x30, [sp, 80]
        add     sp, sp, 96
        ret

.LC0:
        .string "[%u]: %zu\n"

foo:
        stp     x29, x30, [sp, -144]!
        mov     x29, sp
        stp     x19, x20, [sp, 16]
        mov     w20, w0
        str     x1, [sp, 88]
        str     x2, [sp, 96]
        str     x3, [sp, 104]
        str     x4, [sp, 112]
        str     x5, [sp, 120]
        str     x6, [sp, 128]
        str     x7, [sp, 136]
        add     x0, sp, 144
        str     x0, [sp, 48]
        str     x0, [sp, 56]
        add     x0, sp, 80
        str     x0, [sp, 64]
        mov     w0, -56
        str     w0, [sp, 72]
        str     wzr, [sp, 76]
        cmp     w20, 0
        ble     .L1
        str     x21, [sp, 32]
        mov     w19, 0
        adrp    x21, .LC0
        add     x21, x21, :lo12:.LC0
        b       .L6
.L3:
        add     w0, w2, 8
        str     w0, [sp, 72]
        cmp     w0, 0
        ble     .L5
        add     x0, x1, 15
        and     x0, x0, -8
        str     x0, [sp, 48]
.L4:
        ldr     x2, [x1] ; size_t l = va_arg(va, size_t);
        mov     w1, w19  ; int i (loop variable)
        mov     x0, x21  ; format = "%[u]: %zu\n"
        bl      printf
        add     w19, w19, 1
        cmp     w20, w19
        beq     .L9
.L6:
        ldr     w2, [sp, 72]
        ldr     x1, [sp, 48]
        tbnz    w2, #31, .L3
        add     x2, x1, 15
        and     x2, x2, -8
        str     x2, [sp, 48]
        b       .L4
.L5:
        ldr     x1, [sp, 56]
        add     x1, x1, w2, sxtw
        b       .L4
.L9:
        ldr     x21, [sp, 32]
.L1:
        ldp     x19, x20, [sp, 16]
        ldp     x29, x30, [sp], 144
        ret</code></pre>
Again, it’s a lot of repetitive text. We can ignore most of it and focus
on arguments passed over call boundary:
<pre class="asm"><code>main:
        ; ...
        mov     w1, 8
        str     w1, [sp]     ; va[7]  = 8
        mov     w7, 7        ; va[6]  = 7
        mov     w6, 6        ; va[5]  = 6

foo:
        ; ...
        ldr     x2, [x1] ; size_t l = va_arg(va, size_t);
        mov     w1, w19  ; int i (loop variable)
        mov     x0, x21  ; format = "%[u]: %zu\n"
        bl      printf</code></pre>
The <code>main</code> structure is very similar to <code>x86_64</code>: first few parameters
(<code>8</code> this time) get passed via registers: <code>w0 = 16</code>, <code>w1 = 1</code>, <code>w2 = 2</code>,
…. <code>w7 = 7</code>. The rest goes to stack: <code>mov w1, 8</code> / <code>str w1, [sp]</code>,
<code>mov w1, 9</code> / <code>str w1, [sp, 8]</code>, and so on.
Similarly to <code>x86_64</code> the instruction <code>mov w1, 1</code> sets lower 32-bit part
of 64-bit <code>x1</code> register to an immediate value. Higher 32-bit part is
zeroed out. This makes it equivalent to <code>mov x1, 1</code> instruction.
The difference starts in the way stack variables are stored: while
<code>mov w1, 8</code> initializes both <code>w1</code> and <code>x1</code> to value <code>8</code> the
<code>str w1, [sp]</code> instruction writes only 32 bits of value on stack. Upper
32 bits of stack value contain existing value (some garbage). If we
wanted to fix it then <code>str x1, [sp]</code> would place all <code>64</code> bits as. In
theory <code>gcc</code> could have used that instruction even for your unmodified
case. But it does not have to.
This is our corruption mechanic: we store 32 bits on stack for
parammeters <code>8</code> and above and then read 64-bit values from store
locations in <code>C</code> pseudo-code:
<pre class="c"><code>// main():
    // ...
    int val;
    size_t location = uninitialized();
    // ...
    *(int*)(&location) = val; // store 32 initialized bits
// foo():
    // ...
    size_t result = *(size_t*)(&location); // load 64 bits</code></pre>
<h2 id="possible-fix">Possible fix</h2>
Once the breakage is clear the fix is simple: use exact expected type
at call site. In this case <code>size_t</code> instead of <code>int</code>:
<pre class="diff"><code>--- a.c
+++ a.fixed.c
@@ -16,3 +16,3 @@
 int main() {
-    foo(16, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
+    foo(16, (size_t)1,(size_t)2,(size_t)3,(size_t)4,(size_t)5,(size_t)6,(size_t)7,(size_t)8,(size_t)9,(size_t)10,(size_t)11,(size_t)12,(size_t)13,(size_t)14,(size_t)15,(size_t)16);
 }</code></pre>
Not the prettiest change, but it gets the job done:
<pre><code>$ aarch64-unknown-linux-gnu-gcc -O1 a.fixed.c -o a && ./a
[0]: 1
[1]: 2
[2]: 3
[3]: 4
[4]: 5
[5]: 6
[6]: 7
[7]: 8
[8]: 9
[9]: 10
[10]: 11
[11]: 12
[12]: 13
[13]: 14
[14]: 15
[15]: 16</code></pre>
The code change looks as expected:
<pre class="diff"><code>diff -U0 a.S a.fixed.S
--- a.S 2023-12-17 23:51:24.749893552 +0000
+++ a.fixed.S   2023-12-17 23:51:29.687979158 +0000
@@ -2 +2 @@
-       .file   "a.c"
+       .file   "a.fixed.c"
@@ -102,13 +102,13 @@
-       mov     w0, 16
-       str     w0, [sp, 64]
-       mov     w1, 15
-       str     w1, [sp, 56]
-       mov     w1, 14
-       str     w1, [sp, 48]
-       mov     w1, 13
-       str     w1, [sp, 40]
-       mov     w1, 12
-       str     w1, [sp, 32]
-       mov     w1, 11
-       str     w1, [sp, 24]
-       mov     w1, 10
-       str     w1, [sp, 16]
-       mov     w1, 9
-       str     w1, [sp, 8]
-       mov     w1, 8
-       str     w1, [sp]
+       mov     x0, 16
+       str     x0, [sp, 64]
+       mov     x1, 15
+       str     x1, [sp, 56]
+       mov     x1, 14
+       str     x1, [sp, 48]
+       mov     x1, 13
+       str     x1, [sp, 40]
+       mov     x1, 12
+       str     x1, [sp, 32]
+       mov     x1, 11
+       str     x1, [sp, 24]
+       mov     x1, 10
+       str     x1, [sp, 16]
+       mov     x1, 9
+       str     x1, [sp, 8]
+       mov     x1, 8
+       str     x1, [sp]
@@ -119,7 +119,7 @@
-       mov     w7, 7
-       mov     w6, 6
-       mov     w5, 5
-       mov     w4, 4
-       mov     w3, 3
-       mov     w2, 2
-       mov     w1, 1
+       mov     x7, 7
+       mov     x6, 6
+       mov     x5, 5
+       mov     x4, 4
+       mov     x3, 3
+       mov     x2, 2
+       mov     x1, 1</code></pre>
Only <code>main</code> saw the change. The change is shift from 32-bit to 64-bit
registers in value assignments and value stores.
<h2 id="is-this-bug-real">Is this bug real?</h2>
The main takeaway from the above is that on <code>aarch64</code> arguments <code>9</code> and
above must not mix <code>int</code> / <code>size_t</code> and pass exact type if those
arguments are present in variadic template.
Could such bug happen on real code or 9 arguments are too much to be
seen in the wild? Guess how I found this obscurity!
Here is the <a href="https://git.kernel.org/pub/scm/network/wireless/iwd.git/tree/src/dpp-util.c?h=2.11#n1379"><code>iwd-0.11</code></a> code:
<pre class="c"><code>bool prf_plus(enum l_checksum_type type, const void *key, size_t key_len,
              void *out, size_t out_len,
              size_t n_extra, ...)
{
    // ...
    struct iovec iov[n_extra + 2];
    va_list va;
    size_t i;

    va_start(va, n_extra);

    for (i = 0; i < n_extra; i++) {
        iov[i + 1].iov_base = va_arg(va, void *);
        iov[i + 1].iov_len  = va_arg(va, size_t);
    }
    // ...
}
// ...
bool dpp_derive_z(const uint8_t *mac_i, more params, void *z_out, size_t *z_len)
{
    // ...
    prf_plus(sha, prk, bytes, z_out, bytes,
             5,
             mac_i, 6,
             mac_r, 6,
             m_x, bytes,
             n_x, bytes,
             key, strlen(key));</code></pre>
Do you see where the thing breaks?
Here <code>prf_plus()</code> expects <code>n_extra</code> pairs of <code>void *</code> / <code>size_t</code> in the
variadic arguments. But <code>dpp_derive_z()</code> passes <code>void *</code> / <code>int</code> as
first two pairs.
I would never notice it if not for mysteriously failing <code>iwd</code> test on
<code>aarch64</code> platform:
<pre><code>    $ unit/test-dpp
    TEST: DPP test responder-only key derivation
    TEST: DPP test mutual key derivation
    TEST: DPP test PKEX key derivation
    test-dpp: unit/test-dpp.c:514: test_pkex_key_derivation: Assertion `!memcmp(tmp, __tmp, 32)' failed.</code></pre>
<code>strace</code> shown the smoking gun this way:
<pre><code>$ strace unit/test-dpp
...
sendmsg(4, {
    msg_name=NULL,
    msg_namelen=0,
    msg_iov=[
        {iov_base="", iov_len=0},
        {iov_base="\254d\221\364R\7", iov_len=6},
        {iov_base="n^\316n\363\335\0\0\0\0"..., iov_len=281470681743366},
        {iov_base="\274\312\216#\345\300P2"..., iov_len=32},
        {iov_base="\n\221\340r\210\t\273\2"..., iov_len=32},
        {iov_base="thisisreallysecret", iov_len=18},
        {iov_base="\1", iov_len=1}],
    msg_iovlen=7, msg_controllen=0, msg_flags=0}, MSG_MORE) = 3136</code></pre>
See anything suspicious?
Length of the third element of <code>msg_iov</code> array is <code>281470681743366</code> (or
<code>0xffff00000006</code> in hex). It should have been <code>6</code> if not for higher
<code>0xffff</code> garbage bits.
While we are at it: <code>sendmsg()</code> did not fail with an <code>-EINVAL</code> error and
consumed <code>3K</code> of data. At best it will fail at the key derivation. At
worst it might send your unrelated process memory over the network.
A nasty kind of bug.
<h2 id="bonus-section">Bonus section</h2>
If we go back to our original broken example are there any other 64-bit
architectures where <code>size_t</code> / <code>int</code> mismatch is as problematic as on
<code>aarch64</code>?
I’ll show the result for the following list of <code>8</code> <code>64</code>-bit
architectures I could remember:
<ul>
<li><code>alpha</code></li>
<li><code>mips64 -mabi=64</code></li>
<li><code>loongarch64</code></li>
<li><code>s390x</code></li>
<li><code>powerpc64</code></li>
<li><code>sparc64</code></li>
<li><code>riscv64</code></li>
<li><code>ia64</code></li>
</ul>
What is your guess? Is <code>aarch64</code> the unique one being broken here, or
maybe <code>x86_64</code> is unique in that it happens to work anyway? Will it be
endianness-specific? Or it’s closer to <code>50/50</code>?
Let’s see:
<ul>
<li><code>alpha</code>: no corruption. Stores 64-bit values on stack:
<pre class="asm"><code>main:
    ; ...
    lda t0,6
    stq t0,0(sp)</code></pre>
and loads it as 64-bit value:
<pre class="asm"><code>foo:
    ; ...
    ldq a3,0(t0)</code></pre>
The target passes first <code>6</code> arguments in registers.</li>
<li><code>mips64 -mabi=64</code>: no corruption. Stores 64-bit values on stack:
<pre class="asm"><code>main:
    ; ...
    li v0,8
    sd v0,0(sp)</code></pre>
and loads it as 64-bit value:
<pre class="asm"><code>foo:
    ; ...
    ld a3,-8(s1)</code></pre>
The target passes first <code>8</code> arguments in registers.</li>
<li><code>loongarch64</code>: no corruption. Stores 64-bit values on stack:
<pre class="asm"><code>main:
    ; ...
    addi.w $t0, $zero, 9(0x9)
    st.d $t0, $sp, 8(0x8)</code></pre>
and loads it as 64-bit value:
<pre class="asm"><code>foo:
    ; ...
    ld.d $a3, $s1, -8(0xff8)</code></pre>
The target passes first <code>8</code> arguments in registers.</li>
<li><code>s390x</code>: no corruption. Stores 64-bit values on stack:
<pre class="asm"><code>main:
    ; ...
    lghi %r1,5
    stg %r1,160(%r15)</code></pre>
and loads it as 64-bit value:
<pre class="asm"><code>foo:
    ; ...
    lg %r5,0(%r1)</code></pre>
The target passes first <code>5</code> arguments in registers.</li>
<li><code>powerpc64</code>: no corruption. Stores 64-bit values on stack:
<pre class="asm"><code>main:
    ; ...
    li r7,8
    std r7,112(r1)</code></pre>
and loads it as 64-bit value:
<pre class="asm"><code>foo:
    ; ...
    ldu r6,8(r29)</code></pre>
The target passes first <code>8</code> arguments in registers.</li>
<li><code>sparc64</code>: no corruption. Stores 64-bit values on stack:
<pre class="asm"><code>main:
    ; ...
    mov  6, %g1
    stx  %g1, [ %sp + 0x8af ]</code></pre>
and loads it as 64-bit value:
<pre class="asm"><code>foo:
    ; ...
    ldx  [ %i5 ], %o3</code></pre>
The target passes first <code>6</code> arguments in registers.</li>
<li><code>riscv64</code>: no corruption. Stores 64-bit values on stack:
<pre class="asm"><code>main:
    ; ...
    li a5
    sd a5,0(sp)</code></pre>
and loads it as 64-bit value:
<pre class="asm"><code>foo:
    ; ...
    ld a3,-8(s1)</code></pre>
The target passes first <code>8</code> arguments in registers.</li>
<li><code>ia64</code>: has corruption:
<pre><code>$ ./a
[0]: 1
[1]: 2
[2]: 3
[3]: 4
[4]: 5
[5]: 6
[6]: 7
[7]: 8
[8]: 2305843009213693961
[9]: 2305843009213693962
[10]: 11
[11]: 12
[12]: 13
[13]: 14
[14]: 15
[15]: 16</code></pre>
<code>ia64</code> stores only 32-bit value on stack (just like <code>aarch64</code>):
<pre class="asm"><code>main:
    ; ...
    mov r15=8
    ;;
    st4 [r14]=r15</code></pre>
and loads it as 64-bit value:
<pre class="asm"><code>foo:
    ; ...
    ld8 r47=[r14]</code></pre>
The target passes first <code>8</code> arguments in registers.</li>
</ul>
Final table:
<table>
<thead>
<tr class="header">
<th style="text-align: center;">target</th>
<th style="text-align: center;">Is affected</th>
<th style="text-align: center;">In registers</th>
<th style="text-align: center;">First on stack</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;"><code>alpha</code></td>
<td style="text-align: center;">no</td>
<td style="text-align: center;">6</td>
<td style="text-align: center;">7</td>
</tr>
<tr class="even">
<td style="text-align: center;"><code>mips64</code></td>
<td style="text-align: center;">no</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">9</td>
</tr>
<tr class="odd">
<td style="text-align: center;"><code>loongarch64</code></td>
<td style="text-align: center;">no</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">9</td>
</tr>
<tr class="even">
<td style="text-align: center;"><code>s390x</code></td>
<td style="text-align: center;">no</td>
<td style="text-align: center;">5</td>
<td style="text-align: center;">6</td>
</tr>
<tr class="odd">
<td style="text-align: center;"><code>powerpc64</code></td>
<td style="text-align: center;">no</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">9</td>
</tr>
<tr class="even">
<td style="text-align: center;"><code>sparc64</code></td>
<td style="text-align: center;">no</td>
<td style="text-align: center;">6</td>
<td style="text-align: center;">7</td>
</tr>
<tr class="odd">
<td style="text-align: center;"><code>riscv64</code></td>
<td style="text-align: center;">no</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">9</td>
</tr>
<tr class="even">
<td style="text-align: center;"><strong><code>ia64</code></strong></td>
<td style="text-align: center;"><strong>yes</strong></td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">9</td>
</tr>
<tr class="odd">
<td style="text-align: center;"><code>x86_64</code></td>
<td style="text-align: center;">no</td>
<td style="text-align: center;">6</td>
<td style="text-align: center;">7</td>
</tr>
<tr class="even">
<td style="text-align: center;"><strong><code>aarch64</code></strong></td>
<td style="text-align: center;"><strong>yes</strong></td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">9</td>
</tr>
</tbody>
</table>
Thus <code>aarch64</code> is not unique, but very close to it :)
<h2 id="parting-words">Parting words</h2>
One has to be careful at specifying exact types expected by variadic
functions. Integral type conversion rules do not apply the same way you
would expect for a non-variadic function call.
If you are writing your function with variadic parameters and it’s not
a <code>printf()</code>-style function then compiler will not be able to help you
with warnings. Make sure you have a way to validate passed types via
other means.
Sometimes breakages are very subtle: first <code>8</code> parameters would work
just fine and <code>9</code>-th one will eat all your data. And it will happen only
on small set of architectures: <code>aarch64</code> and <code>ia64</code> :)
<code>iwd</code> fix went upstream as
<a href="https://git.kernel.org/pub/scm/network/wireless/iwd.git/commit/?id=688d27700833258a139a6fbd5661334bd2c9fa98">this patch</a>.
Have fun!
<h2 id="is-amd64-actually-immune-to-this">Is AMD64 actually immune to this?</h2>
<a href="https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines">No</a>.
GCC’s use of <code>push</code> instructions is controlled by two options,
<code>-mpush-args</code> and <code>-mno-accumulate-outgoing-args</code>:
<pre class="c"><code>static bool
ix86_push_argument (unsigned int npush)
{
  /* If SSE2 is available, use vector move to put large argument onto
     stack.  NB:  In 32-bit mode, use 8-byte vector move.  */
  return ((!TARGET_SSE2 || npush < (TARGET_64BIT ? 16 : 8))
          && TARGET_PUSH_ARGS
          && !ACCUMULATE_OUTGOING_ARGS);
}</code></pre>
The latter option is automatically adjusted depending on which CPU
family GCC should tune for:
<pre class="c"><code>/* X86_TUNE_ACCUMULATE_OUTGOING_ARGS: Allocate stack space for outgoing
   arguments in prologue/epilogue instead of separately for each call
   by push/pop instructions.
   This increase code size by about 5% in 32bit mode, less so in 64bit mode
   because parameters are passed in registers.  It is considerable
   win for targets without stack engine that prevents multple push operations
   to happen in parallel.  */

DEF_TUNE (X86_TUNE_ACCUMULATE_OUTGOING_ARGS, "accumulate_outgoing_args",
          m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL
          | m_GOLDMONT | m_GOLDMONT_PLUS | m_ATHLON_K8 | m_LUJIAZUI)</code></pre>
So the use of <code>push</code> vs. <code>mov</code> is a tuning choice, which makes this bug
even more subtle: it can surface depending on which CPU is specified via
the <code>-march=</code> or <code>-mtune</code> option.
</article>
<article>
<h1>upgrading an ssd</h1>
2023-12-15T00:00:00Z
In search of bugs I build a lot of software locally. About ~20000
packages per day. I usually keep all builds around to speed up
regression debugging.
That way (and with help of filesystem compression, <code>duperemove</code> and
identical file hardlinking) I manage to fill up my <code>512G</code> SSD with build
results within 2-3 weeks.
Once the disk is full I have to trigger garbage collection that frees
all that space and start over.
I decided to switch to a larger <code>2T</code> SSD to expand the time budget to
1-2 months.
This is my boot disk on <code>btrfs</code> and I would like to preserve most of
it’s properties without too much of mountpoint juggling or machine
downtime. AFAIU <code>rsync</code> does not handle advanced filesystem features
like subvolume layouts and already deduplicated data.
I ended up plugging in a new device and did two commands to transfer all
the data live from one device to another:
<pre><code>$ btrfs device add    /dev/nvme1n1p2 /
$ btrfs device remove /dev/nvme0n1p3 /</code></pre>
15 minutes later all the data was on the new <code>SSD</code>! Magic!
<h2 id="the-actual-procedure">The actual procedure</h2>
It required a tiny bit of extra work to handle partitioning on a new
device and <code>EFI</code> <code>vfat</code> partition move.
Here is the sequence I used:
<ol type="1">
<li>Plug a new device in, it detected as <code>/dev/nvme1n1</code>.</li>
<li>Partition new device:
<pre><code># fdisk /dev/nvme1n1
g; n; 1; 2048 (default); +4G; t; 1 (EFI); n; 2; w</code></pre>
Here we create 2 partitions: 4G <code>EFI</code> and the rest on Linux.</li>
<li>Format <code>EFI</code> partition:
<pre><code># mkfs.fat -F 32 /dev/nvme1n1p1</code></pre></li>
<li>Sync <code>EFI</code> data to the new partition:
<pre><code># mkdir /new-boot
# mount /dev/nvme1n1p1 /new-boot
# rsync -av /boot/ /new-boot/
# umount /new-boot
# rmdir /new-boot</code></pre></li>
<li>Update <code>/etc/nixos/hardware-configuration.nix</code> to point <code>EFI</code>
partition to the new <code>device = "/dev/disk/by-uuid/ABCD-1234";</code></li>
<li><strong>Migrate the root filesystem</strong>:
<pre><code># btrfs device add    /dev/nvme1n1p2 /
# btrfs device remove /dev/nvme0n1p3 /</code></pre></li>
<li>Rebuild boot loader configuration and reinstall it:
<pre><code># nixops-rebuild switch --install-bootloader</code></pre></li>
<li>Reboot the machine.</li>
</ol>
Done!
It took 15 minutes to remove the device and evacuate all the data out.
A snapshot of migration state somewhere in the middle of the process:
<pre><code># btrfs fi show /
Label: none  uuid: abcdef12-...
        Total devices 2 FS bytes used 201.63GiB
        devid    1 size 0.00B used 64.03GiB path /dev/nvme0n1p3
        devid    2 size 1.86TiB used 141.00GiB path /dev/nvme1n1p2</code></pre>
Note how to-be-removed device had <code>size 0.00B</code> while it still had to
drain <code>64G</code> of data.
<h2 id="parting-words">Parting words</h2>
<code>btrfs</code> device handling is magic! It does not matter if the new device
is smaller or larger than existing one: you add bytes to the pool and
new block groups get allocated there. Deleting old devices is also
straightforward: evacuated device stops being used for new object
allocation and existing block groups are evacuated to other devices.
<code>btrfs device remove</code> wipes filesystem superblock and removes the device
from device tree of filesystem once data is fully drained. There is no
easy way to access data on the old device after the move. It is slightly
scary but has it’s charm as well: there is no chance to accidentally
mount old device and use it as new for a while.
By default <code>NixOS</code> uses <code>/dv/disk/by-uuid/...</code> device paths:
<pre><code>$ cat /etc/fstab
/dev/disk/by-uuid/abcdef12-1dbb-... / btrfs x-initrd.mount,subvol=nixos,noatime,compress=zstd 0 0
/dev/disk/by-uuid/ABCD-1234 /boot vfat umask=1022,quiet,codepage=866,iocharset=utf8,dmask=1022,fmask=1133 0 2</code></pre>
That means device rename and move on <code>btrfs</code> is transparent to the
configuration as <code>UUID</code> gets preserved on new device addition.
Have fun!
</article>
<article>
<h1>gcc-14 bugs, pile 3</h1>
2023-11-22T00:00:00Z
This week <a href="https://gcc.gnu.org/pipermail/gcc/2023-November/242898.html">GCC 14 entered stage 3</a>.
Most major features are already pushed to <code>master</code> branch and the main
focus now is to stabilize the result.
It’s a good time to look at bugs I noticed over past two months. This
time I saw <code>19</code> of those. That almost twice as much than over
<a href="https://trofi.github.io/posts/296-gcc-14-bugs-pile-2.html">previous 2 months</a>.
<h2 id="summary">summary</h2>
Bugs (or patches) in the order I observed them:
<ul>
<li><a href="https://gcc.gnu.org/PR111435">tree-optimization/111435</a>: <code>ICE</code> on
<code>gcc</code> code (<code>-m32</code>) due to infinite recursion in type conversion
rule</li>
<li><a href="https://gcc.gnu.org/PR111527">driver/111527</a>: <code>gcc</code> hits environment
size limit early due to an internal <code>COLLECT_GCC_OPTIONS</code> variable</li>
<li><a href="https://gcc.gnu.org/PR111619">rtl-optimization/111619</a>:
<code>make profiledbootstrap</code> is very slow to build in unoptimized builds</li>
<li><a href="https://gcc.gnu.org/PR111629">other/111629</a>: <code>make profiledbootstrap</code>
<code>SIGSEGV</code>s <code>gcc</code> on shutdown due to a <code>ggc</code> bug</li>
<li><a href="https://gcc.gnu.org/PR111642">bootstrap/111642</a>:
<code>make profiledbootstrap</code> fails to type check <code>gcc</code>’s own <code>poly_int64</code> constructor</li>
<li><a href="https://gcc.gnu.org/PR111647">c++/111647</a>: <code>-fchecking=0/2</code> disagree
on validity of <code>IFNDR</code> <code>c++</code> handling</li>
<li><a href="https://gcc.gnu.org/PR111653">bootstrap/111653</a>: <code>-fchecking=0/2</code>
generate different code on the same input</li>
<li><a href="https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633948.html"><code>libgcc</code> trampoline build fix</a>:
<code>libgcc</code> build failure in <code>libc</code>-less mode</li>
<li><a href="https://gcc.gnu.org/PR112107">rtl-optimization/112107</a>: bootstrap
failure on <code>i686-linux</code>: enabling debug changed register allocator output</li>
<li><a href="https://gcc.gnu.org/PR112321">middle-end/112321</a>: <code>gcc</code> <code>SIGSEGV</code> in
<code>debug</code> mode as it generated invalid objects</li>
<li><a href="https://gcc.gnu.org/PR112332">target/112332</a>: <code>ICE</code> in <code>gcc</code> when it
attempted to use <code>SIMD</code> instruction for stack access</li>
<li><a href="https://gcc.gnu.org/PR112347">c/112347</a>: <code>ICE</code> on <code>jemalloc</code> in newly
added <code>-Walloc-size</code> analysis</li>
<li><a href="https://gcc.gnu.org/PR112379">bootstrap/112379</a>: bootstrap builds
failure: unused function when asserts are disabled (code under <code>#ifdef</code>)</li>
<li><a href="https://gcc.gnu.org/PR112467">libstdc++/112467</a>: <code>__assume__</code> in
<code>libstdc++</code> broke <code>clang</code> usage of that library</li>
<li><a href="https://gcc.gnu.org/PR112523">target/112523</a>: <code>mpfr</code>, <code>libsodium</code> and
<code>unbound</code> tests were failing for an invalid <code>shrd</code> instruction use</li>
<li><a href="https://gcc.gnu.org/PR112540">target/112540</a>: <code>gstreamer</code> <code>ICE</code> in
<code>RTL</code> (invalid addressing mode for <code>SIMD</code>)</li>
<li><a href="https://gcc.gnu.org/PR112567">target/112567</a>: <code>linux</code> ICE on <code>RTL</code>
due to <code>gcc</code> generating invalid objects</li>
<li><a href="https://gcc.gnu.org/PR112601">ipa/112601</a>: <code>ICE</code> on <code>llvm-17.0.5</code>
code in <code>-fchecking=2</code> mode related to function attribute inference</li>
<li><a href="https://gcc.gnu.org/PR112613">target/112613</a>: bad code in comparison
code when <code>AVX2</code> registers are present in generated code</li>
</ul>
<h2 id="fun-discovery">fun discovery</h2>
I found a few new things as part of poking at those bugs:
<a href="https://en.cppreference.com/w/cpp/language/acronyms">IFNDR</a> “Ill-Formed,
No Diagnostic Required” is the known invalid code from type checking
standpoint that is allowed to be compiled. In this case the whole
program has an undefined behaviour.
<code>-Walloc-size</code> added in <a href="https://gcc.gnu.org/PR71219">PR71219</a> detects
interesting cases of <code>T * p = malloc(sz)</code> calls when it’s clear that <code>sz</code>
is smaller than <code>sizeof(T)</code>. It also works on <code>calloc()</code> and already
found a few benign instances in
<a href="https://sourceware.org/git/?p=elfutils.git;a=commitdiff;h=fb232b56ca4dc37a70fd4e581a0fc2c56dda5e0a">elfutils</a>,
<a href="https://gitlab.freedesktop.org/mstoeckl/waypipe/-/merge_requests/19">waypipe</a>,
<a href="https://github.com/swaywm/sway/commit/020a572ed615b8fe272c7566a27ee0abe73a58d7">sway</a>
and
<a href="https://github.com/swaywm/swaybg/commit/435be14610a4b4538adc6a926160ed434ff630fa">swaybg</a>.
<h1 id="histograms">histograms</h1>
Looking at the bug categories:
<ul>
<li><code>target</code>: 5</li>
<li><code>bootstrap</code>: 3</li>
<li><code>rtl-optimization</code>: 2</li>
<li><code>tree-optimization</code>: 1</li>
<li><code>driver</code>: 1</li>
<li><code>other</code>: 1</li>
<li><code>c++</code>: 1</li>
<li><code>middle-end</code>: 1</li>
<li><code>ipa</code>: 1</li>
<li><code>c</code>: 1</li>
<li><code>libstdc++</code>: 1</li>
<li><code>libgcc</code>: 1</li>
</ul>
This cycle was very unusual: it has more bugs than I expected, it spans
over 12 categories of compiler components, most of bugs are in <code>i386</code>
target.
<h1 id="parting-words">parting words</h1>
This cycle felt very busy: about 2-3 bugs per week.
I had a lot of joy fixing <a href="https://trofi.github.io/posts/301-another-gcc-profiling-bug.html"><code>PGO</code> bootstrap</a>
and submitting a few trivial fixes upstream.
I encountered only two bugs related to wrong code generated by <code>gcc</code>.
Most of the bugs were compiler crashes which are a lot easier to detect,
report and fix. Chances are I did not get to running enough test suites
as there were so many obvious bugs that required some attention.
Let’s see what stage 3 will bring us.
Have fun!
</article>
<article>
<h1>fuzzing duperemove</h1>
2023-11-21T00:00:00Z
<a href="https://github.com/markfasheh/duperemove/releases/tag/v0.14"><code>duperemove-0.14</code></a>
was released yesterday and included a few small fixes I wrote about
<a href="https://trofi.github.io/posts/304-duperemove-speedups.html">before</a>.
On top of that the new release contains an overhauled parallel file
scanner and database handler that scale a lot better on large files.
<h2 id="new-crashes">new crashes</h2>
Unfortunately recent changes also increased complexity of handling
deduplication queue in a way that caused occasional crashes like
asserts in <code>dedupe_extent_list()</code> at
<a href="https://github.com/markfasheh/duperemove/issues/329"><code>run_dedupe.c:448</code></a>.
In that case my typical dedupe run started crashing as:
<pre><code># duperemove -q --batchsize=0 --dedupe-options=partial,same -rd --hashfile=/run/duperemove/root-dupes.db /
...
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
ERROR: run_dedupe.c:287
[stack trace follows]
/nix/store/528cmbj5wnz37llwkfwxjxj3j41ng0gi-duperemove-0.14/bin/duperemove(print_stack_trace+0x2e) [0x409aee]
/nix/store/528cmbj5wnz37llwkfwxjxj3j41ng0gi-duperemove-0.14/bin/duperemove() [0x40d2dc]
/nix/store/7wkspba8d5i28bw0jfxyi3c70wrw2512-glib-2.76.4/lib/libglib-2.0.so.0(+0x8b06a) [0x7f906e45006a]
/nix/store/7wkspba8d5i28bw0jfxyi3c70wrw2512-glib-2.76.4/lib/libglib-2.0.so.0(+0x8a71d) [0x7f906e44f71d]
/nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6(+0x8b084) [0x7f906e038084]
/nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6(+0x10d60c) [0x7f906e0ba60c]

/nix/store/b86jv7hh4656xf60mby91w7a93wi4h03-remove-dupes.bash: line 12: 405268 Aborted                 (core dumped) SQLITE_TMPDIR=/run/duperemove /nix/store/528cmbj5wnz37llwkfwxjxj3j41ng0gi-duperemove-0.14/bin/duperemove -q --batchsize=0 --dedupe-options=partial,same -rd --hashfile=/run/duperemove/root-dupes.db /

duperemove-root.service: Main process exited, code=exited, status=134/n/a
duperemove-root.service: Failed with result 'exit-code'.
Failed to start duperemove-root.service.
duperemove-root.service: Consumed 34min 12.755s CPU time, no IP traffic.</code></pre>
It took 30 minutes of CPU time (and about 10 minutes of real time) to
only crash later.
<h2 id="crash-location">crash location</h2>
I poked a bit around the crash in <code>gdb</code> to find that failure happens
somewhere in the middle of <a href="https://github.com/markfasheh/duperemove/blob/v0.14/run_dedupe.c#L274"><code>dedupe_extent_list()</code></a>.
It’s a seemingly simple but big function:
<pre class="c"><code>static int dedupe_extent_list(struct dupe_extents *dext, uint64_t *fiemap_bytes,
                              uint64_t *kern_bytes, unsigned long long passno)
{
    int last = 0;
    struct dedupe_ctxt *ctxt = NULL;
    // ...
    list_for_each_entry(extent, &dext->de_extents, e_list) {
        if (list_is_last(&extent->e_list, &dext->de_extents))
            last = 1;
        // ...
        if (...) {
            // ...
            if (ctxt && last)
                goto run_dedupe;
            continue;
        }
        if (ctxt == NULL) {
            ctxt = new_dedupe_ctxt(dext->de_num_dupes,
                                   tgt_extent->e_loff, len,
                                   tgt_extent->e_file);
            // ...
            if (tgt_extent == extent)
                continue;
            // ...
        }
        // ...
        if (...) {
            // ...
            if (!last)
                continue;
        // ...
run_dedupe:
        // ...
close_files:
        filerec_close_open_list(&open_files);
        free_dedupe_ctxt(ctxt);
        ctxt = NULL;
        // ...
    }
    // ...
    abort_on(ctxt != NULL); // we fail here
    // ...
}</code></pre>
<code>duperemove</code> fails at <code>abort_on(ctxt != NULL);</code> assertion. The intent
of the assert seems straightforward: <code>ctxt</code> is expected to be created
within <code>list_for_each_entry()</code> loop and is destroyed before we exit the
loop.
There is a bit of logic that tries to track if we are in the last
element of the loop to make sure we clean up properly.
From a quick glance I was not able to figure out why <code>duperemove</code> crashes
on my input. What is worse: running <code>duperemove</code> with
<code>--io-threads=1 --cpu-threads=1</code> options to decrease parallelism and to
simplify deduplication sequence started triggering an unrelated
<code>abort_on()</code> in the same function.
That means there not just one but a few different cases that manage to
break <code>duperemove</code>.
<h2 id="building-a-reproducer">building a reproducer</h2>
As the scan takes at least 10 minutes on my input data I wondered if I
could extract a smaller example to present for upstream.
At first I though of crafting the files on the file system in a
particular way to match the way <code>duperemove</code> breaks for me. But I also
felt it would be a tedious task.
Before actually trying to extract the first crash example I got an idea
of fuzzing <code>duperemove</code>. In theory a simple sequence of random
actions against a file system to create interesting enough file state
would be able to create a lot more interesting scenarios than I have.
Maybe I’ll get something that crashes <code>duperemove</code> faster?
I though if the following operations for the fuzzer:
<ol type="1">
<li>create a brand new file with a few (<code>4</code>) unique non-dedupable blocks</li>
<li>copy full existing file contents with or without reflinking into a new file</li>
<li>copy one random block from one random file to another random file</li>
<li>run <code>duperemove</code> on the current state</li>
</ol>
<strong>Quick quiz</strong>: If we execute these actions at random, how long would it
take to crash <code>duperemove</code>? A second, a minute, a day or never?
Here is the direct <code>bash</code> implementation of the fuzzer described above:
<pre class="bash"><code>#!/usr/bin/env bash

duperemove_binary=$1
target_dir=$2

shift; shift

if [[ -z $duperemove_binary ]] || [[ -z $target_dir ]]; then
    echo "Usage: $0 </abs/path/to/duperemove> <directory> [duperemove opts]"
    exit 1
fi

# fail on any error
set -e

mkdir "$target_dir"
cd "$target_dir"

shopt -s nullglob

while :; do
    sync
    files=(*)
    f_count=${#files[@]}
    dst=$f_count

    case $((RANDOM % 4)) in
        0)  # copy existing file
            [[ $f_count -eq 0 ]] && continue

            cp_arg=""
            case $((RANDOM % 2)) in
                0) cp_arg=--reflink=always;;
                1) cp_arg=--reflink=never;;
            esac
            src=$((RANDOM % f_count))
            cp -v "$cp_arg" "$src" "$dst"
            ;;
        1) # create new file of 4x4KB distinct blocks
            printf "0%*d" 4095 "$dst"  > "$dst"
            printf "1%*d" 4095 "$dst" >> "$dst"
            printf "2%*d" 4095 "$dst" >> "$dst"
            printf "3%*d" 4095 "$dst" >> "$dst"
            ;;
        2) # run duperemove
            "$duperemove_binary" "$@" -rd -b 4096 "$target_dir"
            ;;
        3) # dd 4KB of one file into another
            [[ $f_count -eq 0 ]] && continue

            src=$((RANDOM % f_count))
            dst=$((RANDOM % f_count))
            [[ $src = $dst ]] && continue

            src_block=$((RANDOM % 3))
            dst_block=$((RANDOM % 3))
            dd "if=$src" "iseek=$src_block" "of=$dst" "oseek=$dst_block" bs=4096 count=1
            ;;
    esac
done</code></pre>
Specifically we always create <code>16KB</code> files and move <code>4KB</code> blocks around
to make sure <code>duperemove</code> considers them as a whole. We also pass
<code>-b 4096</code> block size as default block size is <code>128KB</code>.
And now goes the quiz answer:
<pre><code>$ time { rm -rfv ~/tmp/dr/ && bash ./duperemove-fuzz.bash $PWD/duperemove/duperemove ~/tmp/dr -q; }
...
Simple read and compare of file data found 12 instances of files that might benefit from deduplication.
ERROR: run_dedupe.c:287
[stack trace follows]
/home/slyfox/dev/git/duperemove/duperemove(print_stack_trace+0x29) [0x409b39]
/home/slyfox/dev/git/duperemove/duperemove() [0x40d844]
/nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0(+0x8ad72) [0x7f8da7b47d72]
/nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0(+0x8a41d) [0x7f8da7b4741d]
/nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6(+0x908b1) [0x7f8da77278b1]
/nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6(+0x114e6c) [0x7f8da77abe6c]
./duperemove-fuzz.bash: line 27: 264572 Aborted                 (core dumped) "$duperemove_binary" "$@" -rd -b 4096 "$target_dir"

real    0m3.354s
user    0m0.530s
sys     0m1.500s</code></pre>
3 seconds!
I’m not always as lucky: sometimes it takes 2 seconds, sometimes as much
as 8 seconds. It feels like a very good result for such a dumb script.
<h2 id="parting-words">parting words</h2>
<code>duperemove</code> has a non-trivial state machine to track extent state to
avoid multiple deduplication attempts. It clearly has a few bugs like
<a href="https://github.com/markfasheh/duperemove/issues/329">issue #329</a>.
Fuzzing works great for a small set of well defined operations. I picked
a minimal subset of operations to trigger the failures.
The fuzzer does not exercise other interesting operations like hardlinks
creation, file removal or larger files with more interesting extent
sharing structure. There is still more room for improvement to get into
darker corners of state tracking in <code>duperemove</code>.
I will not have much time debugging specifics of these <code>duperemove</code>
crashes in the following days. Try to fix these crashes yourself!
Have fun!
</article>
<article>
<h1>duperemove speedups</h1>
2023-11-12T00:00:00Z
<a href="https://github.com/markfasheh/duperemove/"><code>duperemove</code></a> is a great
tool to reduce on-disk file redundancy for file systems that support
data sharing across files. Currently <code>duperemove</code> supports only <code>btrfs</code>
and <code>xfs</code>.
<h2 id="initial-user-experience">initial user experience</h2>
I started using <code>duperemove</code> around version <code>0.9</code> when I started keeping
multiple <code>chroot</code> environments for various Linux distributions. I had
about 15 instances of various systems:
<pre><code>/chroot/gentoo-amd64-multilib/{bin,lib,usr,...}
/chroot/gentoo-amd64-O3/{bin,lib,usr,...}
/chroot/gentoo-amd64-lto/{bin,lib,usr,...}
/chroot/gentoo-i686/{bin,lib,usr,...}
/chroot/debian-sid/{bin,lib,usr,...}
...</code></pre>
Some of the files had identical or mostly identical content across
systems. The typical source of duplicates would be a
<code>/usr/lib/locale/locale-archive</code> file. Today it contains <code>~200MB</code> worth
of locales on systems with complete set of locales. Another source is
huge dumps of translation files in <code>/usr/share/locale</code>.
As I used <code>btrfs</code> I ran <code>duperemove</code> time to time on a <code>/chroot</code>
directory to claw back a bit of storage eaten by these duplicates. I had
a spinning disk at the time and I was amazed by the speed that
<code>duperemove</code> took to do it’s magic.
<h2 id="more-recent-usage-attempt">more recent usage attempt</h2>
A few years later I got an <code>NVMe</code> <code>SSD</code> storage and started using <code>nix</code>
extensively to build a lot of packages. <code>nix</code> can produce quite a bit of
duplication when building package with minimal changes:
<pre><code>$ ls -1d /nix/store/????????????????????????????????-glibc-2.38-27
/nix/store/4jyz743dan9fn8b53cdxl5fyld2hkaby-glibc-2.38-27
/nix/store/as01fdk2w1605lr3lmpqpwa1xan17gbd-glibc-2.38-27
/nix/store/d9bhmzah59gzbm7bkki7i8fm0p4nyiyv-glibc-2.38-27
/nix/store/iwnxvprzvymiigxds0pw53sxg11m4azk-glibc-2.38-27
/nix/store/ldavvsk4f57lmw816ch0c4v82hf6ww8g-glibc-2.38-27
/nix/store/mk28ys1qrw7mb04psrnmr0p56bxw3g54-glibc-2.38-27
/nix/store/pnnqn1gh3jd7imjjyibgf8r0n0zjrxf5-glibc-2.38-27
/nix/store/ps99gh0kq7yar94ap6ya6d2av2rfa8dz-glibc-2.38-27
/nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27
/nix/store/qpspr9zw4vxq6fq3rc1izqsglk497m67-glibc-2.38-27
/nix/store/wrv286x4aldgbj6gjl15qn8pl233zrsx-glibc-2.38-27
/nix/store/y2yvsr2gb27ixz8mc42ry4q6lpasl0fk-glibc-2.38-27

$ ls -1d /nix/store/????????????????????????????????-firefox-119.0.1
/nix/store/c3c387alqga9b9s0r4n064d6kkan07dy-firefox-119.0.1
/nix/store/ck5k36nn46vbpc534hvncbana2rmdpxj-firefox-119.0.1
/nix/store/qx36d083630w1ksp3n38avsyk52zxf9j-firefox-119.0.1
/nix/store/r4cjmc042q18bi7xg2jmcxqs8nzl4fr9-firefox-119.0.1
/nix/store/s2qiq9xszj4k7z64ri3lrl1hwqa48v3p-firefox-119.0.1
/nix/store/zns8fsz0c7adk7aw1x11kal6235jxxya-firefox-119.0.1</code></pre>
Sometimes two versions of a package build only differ in the embedded
paths of their dependencies: it’s a 32 byte difference between <code>ELF</code>
files. For the example above <code>duperemove</code> quickly spots the similarity:
<pre><code>$ duperemove -b4096 --batchsize=0 -q --dedupe-options=partial -rd /nix/store/????????????????????????????????-glibc-2.38-27

Simple read and compare of file data found 1174 instances of files that might benefit from deduplication.
Comparison of extent info shows a net change in shared extents of: 266229126
Found 133 identical extents.
[########################################]
Search completed with no errors.
Simple read and compare of file data found 114 instances of extents that might benefit from deduplication.
Comparison of extent info shows a net change in shared extents of: 2183168
Total files scanned:  10636</code></pre>
Here <code>duperemove</code> managed to deduplicate <code>70%</code> (<code>253MB</code> out of <code>357MB</code>
considered) in the extent pass comparison. And the extracted extra <code>2MB</code>
of duplicates when considered <code>4KB</code> blocks within different extents.
I decided to try <code>duperemove</code> on the whole of my <code>/nix/store</code> directory.
I ran <code>duperemove-0.11</code> and got failures related to exhausted file
descriptors: <code>duperemove</code> ran with <code>32x</code> parallelism and was able to hit
<code>4096</code> open files. That was easy to fix by
<a href="https://github.com/markfasheh/duperemove/pull/269">raising the file limit</a>.
I think it worked as fast as before.
But a while after I ran <code>duperemove-0.13</code> against <code>/nix/store</code>. 2 hours
later I found that it did not finish and ate <code>100%</code> of the CPU. That was
unexpected.
I was not sure if it was a particular file that was causing trouble or
the sheer load on <code>duperemove</code> that made it degrade so much under the
load.
I attempted to run <code>duperemove</code> in incremental mode and found out that
it rescans all the files on the database on each run effectively making
the incremental mode quadratic. I filed <a href="https://github.com/markfasheh/duperemove/issues/303">a bug</a>
to see if it could be fixed.
Jack implemented incremental mode the same day! I tried it and saw an
improvement. But the result was still too slow to run on the whole of
<code>/nix/store</code> within a day. I could not easily pinpoint the problem of
<code>100%</code> CPU usage on my workloads.
<h2 id="duperemove-complexity-intuition"><code>duperemove</code> complexity intuition</h2>
What are the <code>duperemove</code>’s scaling limits? I had about <code>4 million</code>
files taking <code>300GB</code> of storage in <code>/nix/store</code> on <code>NVMe</code> device.
Quick quiz: How long should it take to dedupe that data you would say? A
minute, an hour, a day?
In theory all it takes to do is to read all the data out, checksum it
and attempt the deduplication on identified candidates. Should be an
IO-bound problem without too many random reads.
Given that <code>duperemove</code> has an optional <code>sqlite</code> database to persist
details about previous runs it even skips data read of the files it
already processed.
If we have a reasonable fast IO storage capable of 1GB/s of sequential
read throughput then it should ideally take about <code>300GB / 1GB/s = ~5 minutes</code>. And on top of that there should be some minor overhead to
calculate checksums and store some state in <code>sqlite</code> database. That was
my naive reasoning :)
Practice showed that <code>duperemove</code> found the CPU-heavy work to do for
hours on my machine.
<h2 id="synthetic-tests">synthetic tests</h2>
I ran <code>perf top</code> and noticed that <code>duperemove</code> showed unusual reading on
various stages of a run: at one point most of the time was spent in
<code>sqlite</code> internals, at another one some <code>rb_next()</code> function took most
of the time. I did not expect such things in an IO-mostly workload.
As it was not very convenient to experiment with <code>duperemove</code>’s
behaviour on real data I tried to throw synthetic workloads at it.
I started simple: created <code>100 thousands</code> files of <code>1KB</code> size and ran
<code>duperemove</code> at it:
<pre class="bash"><code>echo "Creating directory structure, will take a minute"
mkdir dd
for d in `seq 1 100`; do
    mkdir dd/$d
    for f in `seq 1 1000`; do
        printf "%*s" 1024 "$f" > dd/$d/$f
    done
done
sync</code></pre>
How long should it take to run? Maybe 1-2 seconds? Alas running it for
real showed the following:
<pre><code>$ time ./duperemove -q -rd dd/
...
Nothing to dedupe.
Total files scanned:  100000
real    0m39,835s
user    1m54,903s
sys     0m8,922s</code></pre>
Almost <code>40 seconds</code> of real time and almost <code>2 minutes</code> of user time
(<code>duperemove</code> runs some actions in parallel) to process <code>100MB</code> of data.
But what did <code>duperemove</code> do all that time? Let’s ask <code>perf</code>:
<pre><code>$ perf record ./duperemove -q -rd dd/
$ perf report

# Overhead  Command       Shared Object            Symbol
# ........  ............  .......................  ...........................................
#
    70.81%  pool          libc.so.6                [.] __memset_avx2_unaligned_erms
     2.14%  duperemove    libsqlite3.so.0.8.6      [.] sqlite3VdbeExec
     0.97%  pool          libsqlite3.so.0.8.6      [.] sqlite3VdbeExec
     0.58%  pool          libc.so.6                [.] __memmove_avx_unaligned_erms
...</code></pre>
Vast majority of the CPU time it spent in <code>memset()</code>!
<h2 id="memset-fix"><code>memset()</code> fix</h2>
There are various ways to find the <code>memset()</code> call. I took the lazy
approach to check where <code>memset()</code> is called with a large value and did
not find any offenders. Then I checked all <code>calloc()</code> calls and found
huge <code>calloc(8MB)</code> allocating temporary space to read files out. This
buffer was allocated at each new opened file.
<a href="https://github.com/markfasheh/duperemove/pull/318">The fix</a> was simple:
<pre class="diff"><code>--- a/file_scan.c
+++ b/file_scan.c
@@ -887,7 +887,7 @@ static void csum_whole_file(struct filerec *file,
        struct block_csum *block_hashes = NULL;

        memset(&csum_ctxt, 0, sizeof(csum_ctxt));
-       csum_ctxt.buf = calloc(1, READ_BUF_LEN);
+       csum_ctxt.buf = malloc(READ_BUF_LEN);
        assert(csum_ctxt.buf != NULL);
        csum_ctxt.file = file;</code></pre>
After the fix profile looked a bit better:
<pre><code>$ time perf record ./duperemove -q -rd dd/
real    0m13,046s
user    0m11,194s
sys     0m2,581s

$ perf report
...
# Overhead  Command       Shared Object            Symbol
# ........  ............  .......................  ..........................................
#
    18.71%  duperemove    libsqlite3.so.0.8.6      [.] sqlite3VdbeExec
     3.18%  pool          libsqlite3.so.0.8.6      [.] sqlite3VdbeExec
     1.95%  pool          libsqlite3.so.0.8.6      [.] sqlite3WhereBegin
     1.66%  pool          libsqlite3.so.0.8.6      [.] resolveExprStep
     1.64%  pool          libsqlite3.so.0.8.6      [.] whereLoopAddBtreeIndex
     1.52%  pool          libc.so.6                [.] __memmove_avx_unaligned_erms
     1.17%  pool          libsqlite3.so.0.8.6      [.] sqlite3_str_vappendf
     1.13%  duperemove    duperemove               [.] populate_tree
     1.10%  pool          libc.so.6                [.] _int_malloc</code></pre>
That was a lot better: almost <code>3x</code> speed up just for removing a single
redundant <code>memset()</code>. It’s a safe change as <code>duperemove</code> guarantees that
it initializes the area with data from the file before calculating the
hash.
Now <code>sqlite</code> is at the top of our profile. Looks like the rest of <code>80%</code>
samples goes to IO wait time.
<h2 id="needless-work-on-small-files">needless work on small files</h2>
While the test seemed to run quickly I noticed that it complains about
dedupe attempts all the time:
<pre><code>$ ./duperemove -q -rd dd/
...
Dedupe for file "dd/20/426" had status (1) "data changed".
Dedupe for file "dd/20/427" had status (1) "data changed".
Dedupe for file "dd/20/240" had status (1) "data changed".
Dedupe for file "dd/20/439" had status (1) "data changed".
Dedupe for file "dd/20/377" had status (1) "data changed".
Dedupe for file "dd/20/378" had status (1) "data changed".
Dedupe for file "dd/20/452" had status (1) "data changed".
Dedupe for file "dd/20/453" had status (1) "data changed".</code></pre>
Why does <code>duperemove</code> think data has changed? Those are static files
I just created for test.
The answer was surprising:
<ul>
<li>[good] <code>duperemove</code> skips inline extents in files as file systems can’t
deduplicate data embedded in metadata blocks</li>
<li>[bad] <code>duperemove</code> stores checksum of such files as if they were zero
bytes</li>
<li>[very bad] <code>duperemove</code> tries to deduplicate all these files as they
have identical checksum</li>
</ul>
That was <a href="https://github.com/markfasheh/duperemove/pull/322">easy to fix</a>
as well:
<pre class="diff"><code>--- a/file_scan.c
+++ b/file_scan.c
@@ -937,10 +937,19 @@ static void csum_whole_file(struct filerec *file,
                }
        }

-       ret = dbfile_store_file_digest(db, file, csum_ctxt.file_digest);
-       if (ret) {
-               g_mutex_unlock(&io_mutex);
-               goto err;
+       /* Do not store files with zero hashable extents. Those are
+        * usually small files inlined with extent type
+        * FIEMAP_EXTENT_DATA_INLINE. We avoid storing them as all these
+        * files have the same zero bytes checksum. Attempt to
+        * deduplicate those will never succeed and will produce a lot
+        * of needless work: https://github.com/markfasheh/duperemove/issues/316
+        */
+       if (nb_hash > 0) {
+               ret = dbfile_store_file_digest(db, file, csum_ctxt.file_digest);
+               if (ret) {
+                       g_mutex_unlock(&io_mutex);
+                       goto err;
+               }
        }

        ret = dbfile_commit_trans(db);</code></pre>
A few lines above <code>duperemove</code> calculates count of non-inline extents
with data in <code>nb_hash</code> variable. If it’s zero then checksum is zero.
The fix as is speeds the scan quite a bit:
<pre><code>$ time ./duperemove -q -rd dd/
...
Simple read and compare of file data found 0 instances of files that might benefit from deduplication.
Nothing to dedupe.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Simple read and compare of file data found 0 instances of files that might benefit from deduplication.
Nothing to dedupe.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Total files scanned:  100000

real    0m7,844s
user    0m7,116s
sys     0m1,686s</code></pre>
This is an extra <code>2x</code> speed up on tiny files. But we can squeeze a bit
more speed out of it. Notice repetitive <code>Found 0 identical extents</code>
entries. This happens because <code>duperemove</code> batches deduplication
attempts every <code>1024</code> files (controlled by <code>--batchsize=</code> flag). We can
crank up that flag as well:
<pre><code>$ time ./duperemove --batchsize=1000000 -q -rd dd/
Simple read and compare of file data found 0 instances of files that might benefit from deduplication.
Nothing to dedupe.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Total files scanned:  100000

real    0m5,995s
user    0m4,605s
sys     0m1,677s</code></pre>
Compared to our initial <code>50s</code> runtime we got <code>~10x</code> speed up.
Question for the reader: what does <code>duperemove</code> do in those 5 seconds?
Is there any room for improvement here?
The above set of fixes sped up <code>duperemove</code> on my real <code>300GB</code> dataset
to finish in 2 minutes!
All done?
<h2 id="dedupe-optionspartial-mode"><code>--dedupe-options=partial</code> mode</h2>
When I skimmed through existing bugs on <code>duperemove</code> bug tracker I
noticed that <code>duperemove</code> only deduplicates extents of identical size
and does not try to look into individual blocks for performance reason.
That sounded a bit strange to me as break up of a file on extents is
quite arbitrary at least on <code>btrfs</code>. You can easily have one extent for
a huge file or a ton of really small ones.
I threw <code>--dedupe-options=partial</code> at <code>300GB</code> and got a CPU-bound hangup
again. An hour later I had to interrupt the process.
This time most of the time was spent in <code>sqlite</code> extent queries for each
individual file. Let’s looks at
<a href="https://github.com/markfasheh/duperemove/pull/324">the fix</a> to get idea
where the problem was hiding:
<pre class="diff"><code>--- a/dbfile.c
+++ b/dbfile.c
@@ -1424,9 +1424,8 @@ int dbfile_load_nondupe_file_extents(sqlite3 *db, struct filerec *file,
        struct file_extent *extents = NULL;

 #define NONDUPE_JOIN                                                   \
-       "FROM extents JOIN (SELECT digest FROM extents GROUP BY digest "\
-       "HAVING count(*) = 1) AS nondupe_extents on extents.digest = "  \
-       "nondupe_extents.digest where extents.ino = ?1 and extents.subvol = ?2;"
+       "FROM extents where extents.ino = ?1 and extents.subvol = ?2 and " \
+       "(1 = (SELECT COUNT(*) FROM extents as e where e.digest = extents.digest));"
 #define GET_NONDUPE_EXTENTS                                            \
        "select extents.loff, len, poff, flags " NONDUPE_JOIN</code></pre>
The change switches from this query:
<pre class="sql"><code>SELECT extents.loff, len, poff, flags
FROM extents JOIN (
    SELECT digest
    FROM extents
    GROUP BY digest
    HAVING count(*) = 1) AS nondupe_extents
ON extents.digest = nondupe_extents.digest
WHERE extents.ino = ?1 AND extents.subvol = ?2;</code></pre>
To this query:
<pre class="sql"><code>SELECT extents.loff, len, poff, flags
FROM extents
WHERE extents.ino = ?1 AND extents.subvol = ?2 AND (
    1 = (SELECT COUNT(*)
         FROM extents as e
         WHERE e.digest = extents.digest));</code></pre>
Both queries should produce identical result and both do two things:
<ol type="1">
<li>Fetch all entries from <code>extents</code> table for a given <code>inode</code> (<code>ino</code>
and <code>subvol</code>)</li>
<li>Pick only those extents that have unique digest: so that extents are
not shared and thus splitting them into smaller extents should not be
a problem.</li>
</ol>
Here are both plans for both queries told by <code>sqlite</code>. Enabling plan
dump:
<pre><code>$ sqlite3 /tmp/foo.db
sqlite> .eqp on</code></pre>
First plan:
<pre><code>sqlite> SELECT extents.loff, len, poff, flags
       FROM extents JOIN (
         SELECT digest
         FROM extents
         GROUP BY digest
         HAVING count(*) = 1) AS nondupe_extents
       on extents.digest =  nondupe_extents.digest
       where extents.ino = ?1 and extents.subvol = ?2;

QUERY PLAN
|--CO-ROUTINE nondupe_extents
|  `--SCAN extents USING COVERING INDEX idx_extent_digest
|--SEARCH extents USING INDEX idx_extents_inosub (ino=? AND subvol=?)
|--BLOOM FILTER ON nondupe_extents (digest=?)
`--SEARCH nondupe_extents USING AUTOMATIC COVERING INDEX (digest=?)</code></pre>
Second plan:
<pre><code>sqlite> SELECT extents.loff, len, poff, flags
        FROM extents
        where extents.ino = ?1 and extents.subvol = ?2 and (
          1 = (SELECT COUNT(*)
              FROM extents as e
              where e.digest = extents.digest));

QUERY PLAN
|--SEARCH extents USING INDEX idx_extents_inosub (ino=? AND subvol=?)
`--CORRELATED SCALAR SUBQUERY 1
   `--SEARCH e USING COVERING INDEX idx_extent_digest (digest=?)</code></pre>
The first plan is more complicated. One of it’s problems is the use of
<code>SCAN</code> (full table scan) in <code>CO-ROUTINE nondupe_extents</code>. As I understand
the output here full <code>extents</code> table scan is performed at least once for
this whole query.
Reading the second plan is easy: all searches use existing indexes in
the tables. We fetch all extents for the <code>inode</code> and then leave only
those that match a subquery. Subquery also uses only index lookup.
Now the whole non-incremental <code>duperemove</code> run on my <code>300GB</code> dataset
takes 9 minutes. And in incremental mode it takes about 4 minutes.
Yay!
<h2 id="parting-words">Parting words</h2>
It took me a few steps to get <code>duperemove</code> to work on my machines:
<ul>
<li><a href="https://github.com/markfasheh/duperemove/pull/269">Increase file descriptor limit</a></li>
<li><a href="https://github.com/markfasheh/duperemove/pull/318">Avoid redundant <code>memset(8MB)</code> on file read</a></li>
<li><a href="https://github.com/markfasheh/duperemove/pull/322">Do not deduplicate inline-only files</a></li>
<li><a href="https://github.com/markfasheh/duperemove/pull/324">Avoid full table scan in <code>partial</code> mode</a></li>
</ul>
This sped up <code>duperemove</code> run from multiple hours down to under 10
minutes on a few hundreds of gigabytes of small files.
<code>duperemove</code> still has quite a bit room for improvement to get even more
performance.
Fun fact: <code>duperemove</code> uses two simple <code>ioctl()</code> interfaces:
<ul>
<li><code>FS_IOC_FIEMAP</code> to get on-disk layout for a file:
<a href="https://docs.kernel.org/filesystems/fiemap.html" class="uri">https://docs.kernel.org/filesystems/fiemap.html</a>.</li>
<li><code>FIDEDUPERANGE</code> to deduplicate file range between two file
descriptors.</li>
</ul>
Have fun!
</article>
<article>
<h1>Zero Hydra Failures towards 23.11 NixOS release</h1>
2023-11-08T00:00:00Z
<h2 id="zhf"><code>ZHF</code></h2>
The next <code>NixOS-23.11</code> will be released around the end of November.
Current development phase is called <code>Zero Hydra Failures</code> (<code>ZHF</code>): at
this time the main focus is to fix as many build failures in
<code>nixpkgs/master</code> repository as possible before the final release.
<a href="https://github.com/NixOS/nixpkgs/issues/265948">Issue #265948</a> (and
<a href="https://discourse.nixos.org/t/zero-hydra-failure-23-11-edition/35103">the discourse topic</a>)
is the tracker where you can get hints on how to help fixing known
broken packages and review already proposed fixes.
It is a great time to contribute to <code>nixpkgs</code>!
To follow the tradition let’s fix one package here.
<h2 id="newlib-example"><code>newlib</code> example</h2>
<a href="https://hydra.nixos.org/jobset/nixpkgs/trunk"><code>trunk jobset</code></a> shows us
about ~3800 build failure. I picked <a href="https://hydra.nixos.org/log/nv0q296sc06achvd7ljlrsn8x3qh8fg1-newlib-4.3.0.20230120.drv"><code>newlib</code> failure</a>
and will try to fix it. The install part fails there as:
<pre><code>...
installing
install flags: SHELL=/nix/store/lf0wpjrj8yx4gsmw2s3xfl58ixmqk8qa-bash-5.2-p15/bin/bash install
make[1]: Entering directory '/build/newlib-4.3.0.20230120'
/nix/store/lf0wpjrj8yx4gsmw2s3xfl58ixmqk8qa-bash-5.2-p15/bin/bash ./mkinstalldirs /nix/store/1wxhiz8jkyff6chkwp89vy85qlgvi7ij-newlib-4.3.0.20230120 /nix/store/1wxhiz8jkyff6chkwp89vy85qlgvi7ij-newlib-4.3.0.20230120
make[2]: Entering directory '/build/newlib-4.3.0.20230120/etc'
make[3]: Entering directory '/build/newlib-4.3.0.20230120/etc'
make[3]: Nothing to be done for 'install-exec-am'.
make[3]: Nothing to be done for 'install-data-am'.
make[3]: Leaving directory '/build/newlib-4.3.0.20230120/etc'
make[2]: Leaving directory '/build/newlib-4.3.0.20230120/etc'
make[1]: Nothing to be done for 'install-target'.
make[1]: Leaving directory '/build/newlib-4.3.0.20230120'
$out is empty</code></pre>
I have no idea why the build fails. Let’s find out the hard way.
<a href="https://hydra.nixos.org/build/239066832">The build tab</a> tells us
that last successful build of <code>newlib</code> was around <code>2023-06-18</code>
on <code>d9895270b775226e0fdabd7937af2d236abe4eb2</code> <code>nixpkgs</code> input. And first
failed commit was <code>8277b539d371bf4308fc5097911aa58bfac1794f</code> around
<code>2023-07-01</code>.
Running bisect:
<pre><code>$ git bisect start 8277b539d371bf4308fc5097911aa58bfac1794f d9895270b775226e0fdabd7937af2d236abe4eb2
$ git bisect run nix build -f. newlib

commit cf1b7c4d5c027837e71d284a838fbeb05b3fcb7f
Date:   Sat Jun 24 01:13:17 2023 +0200

    newlib: fix build of nano variant on non-ARM architectures
...</code></pre>
The full diff of this
<a href="https://github.com/NixOS/nixpkgs/commit/cf1b7c4d5c027837e71d284a838fbeb05b3fcb7f">commit</a>
is small an readable:
<pre class="diff"><code>--- a/pkgs/development/misc/newlib/default.nix
+++ b/pkgs/development/misc/newlib/default.nix
@@ -73,10 +73,12 @@ stdenv.mkDerivation (finalAttrs: {
       cd $out${finalAttrs.passthru.libdir}

       for f in librdimon.a libc.a libg.a; do
-        cp "$f" "''${f%%\.a}_nano.a"
+        # Some libraries are only available for specific architectures.
+        # For example, librdimon.a is only available on ARM.
+        [ -f "$f" ] && cp "$f" "''${f%%\.a}_nano.a"
       done
     )
-  '';
+  '' + ''[ "$(find $out -type f | wc -l)" -gt 0 ] || (echo '$out is empty' 1>&2 && exit 1)'';

   passthru = {
     incdir = "/${stdenv.targetPlatform.config}/include";</code></pre>
In our case of <code>newlib</code> (not <code>newlib-nano</code>) the only change is the
addition of <code>[ "$(find $out -type f | wc -l)" -gt 0 ] || (echo '$out is empty' 1>&2 && exit 1)</code>
line. It causes build to fail if <code>$out</code> is empty. The <code>$out</code> was always
empty for <code>newlib.x86_64-linux</code>. Normally the <code>newlib</code> output contains
something only for bare-metal targets like
<code>pkgsCross.x86_64-embedded.newlib</code>.
Thus the fix is to constrain <code>newlib</code> to only those targets:
<pre class="diff"><code>--- a/pkgs/development/misc/newlib/default.nix
+++ b/pkgs/development/misc/newlib/default.nix
@@ -96,5 +96,9 @@ stdenv.mkDerivation (finalAttrs: {
     # COPYING.NEWLIB
     # COPYING3
     license = licenses.gpl2Plus;
+    # newlib frequently does ont supply any files on hosted targets like
+    # x86_64-unknown-linux-gnu: https://hydra.nixos.org/log/nv0q296sc06achvd7ljlrsn8x3qh8fg1-newlib-4.3.0.20230120.drv
+    # Let's constrain `newlib` package to bare-metal alone.
+    broken = !stdenv.hostPlatform.isNone;
   };
 })</code></pre>
This change was proposed as <a href="https://github.com/NixOS/nixpkgs/pull/266268">PR #266268</a>.
<h2 id="parting-words">Parting words</h2>
Fixing package breakages are usually easier if the package used to work
at some point before. Otherwise we can always mark packages broken and
schedule them for removal.
Have fun!
</article>
<article>
<h1>-Ofast and -ffast-math non-local effects</h1>
2023-10-25T00:00:00Z
<h2 id="tldr">Tl;DR:</h2>
<code>-ffast-math</code> / <code>-Ofast</code> options are very tricky to use correctly:
in addition to breaking your immediate floating point arithmetic code
(which you might be prepared for) it also <strong>breaks the code not compiled
with these options</strong> but happen to be present in the same address
space:

Here I assume that code built with <code>-Ofast</code> was clean and was prepared
to changes caused by <code>-Ofast</code> effect. <code>-O2</code>-compiled code was not
prepared for <code>-Ofast</code> flag effects.
All the red boxes are negatively affected by the module compiled with
<code>-Ofast</code>.
<h2 id="more-words">More words</h2>
Let’s start off with an executable example: we’ll construct a very small
<code>double</code> value and print it with some of it’s properties:
<h3 id="an-example">An example</h3>
<pre class="c"><code>#include <math.h>
#include <stdio.h>

static const char * fpc(double v) {
    switch (fpclassify(v)) {
        case FP_ZERO: return "FP_ZERO";
        case FP_SUBNORMAL: return "FP_SUBNORMAL";
        case FP_NORMAL: return "FP_NORMAL";
        case FP_INFINITE: return "FP_INFINITE";
        case FP_NAN: return "FP_NAN";
        default: return "UKNOWN (unhandled?)";
    }
}

int main() {
    double small = 0x1.0p-1040;
    volatile double also_small = 0x1.0p-1040;
    printf("     small = %a or %e (%s)\n", small, small, fpc(small));
    printf("also_small = %a or %e (%s)\n", also_small, also_small, fpc(also_small));
}</code></pre>
C standard defines a few <code>FP</code> classes (taken from <code>man fpcassify</code>):
<ul>
<li><code>FP_NAN</code>: x is “Not a Number”.</li>
<li><code>FP_INFINITE</code>: x is either positive infinity or negative infinity.</li>
<li><code>FP_ZERO</code>: x is zero.</li>
<li><code>FP_SUBNORMAL</code>: x is too small to be represented in normalized format.</li>
<li><code>FP_NORMAL</code>: if nothing of the above is correct then it must be a
normal floating-point number.</li>
</ul>
I am using <code>volatile</code> on <code>also_small</code> to prevent <code>gcc</code> from folding
constants at compile time. I also use <code>gcc</code> extension to write down
floats in hexadecimal form (instead of decimal form).
Quick quiz: what class should the above program print you think?
Let’s run it and see the answer:
<pre><code>$ gcc a.c -o a -O2 && ./a
     small = 0x0.00004p-1022 or 8.487983e-314 (FP_SUBNORMAL)
also_small = 0x0.00004p-1022 or 8.487983e-314 (FP_SUBNORMAL)

$ gcc a.c -o a -Ofast && ./a
     small = 0x0.00004p-1022 or 8.487983e-314 (FP_SUBNORMAL)
also_small = 0x0.00004p-1022 or 8.487983e-314 (FP_ZERO)

$ gcc a.c -o a -ffast-math && ./a
     small = 0x0.00004p-1022 or 8.487983e-314 (FP_SUBNORMAL)
also_small = 0x0.00004p-1022 or 8.487983e-314 (FP_ZERO)</code></pre>
Apparently it depends! On <code>-O2</code> the value is detected as <code>FP_SUBNORMAL</code>
and on <code>-Ofast</code> (or <code>-ffast-math</code>) it goes as <code>FP_ZERO</code>. Thus <code>printf()</code>
disagrees and prints something that looks more like a small number than
zero.
That is unfortunate. But maybe it’s expected by someone who uses
<code>-Ofast</code>?
<h2 id="cross-module-effects">Cross-module effects</h2>
Let’s create an empty <code>empty.c</code> file and build it with <code>-Ofast</code>. And
then build our initial program with <code>-O2</code>. I’ll use <code>gcc-12</code>
specifically:
<pre><code>$ touch empty.c
$ gcc-12 -shared -fPIC empty.c -Ofast -o libfast.so

$ gcc-12 -O2 a.c -o a && ./a
     small = 0x0.00004p-1022 or 8.487983e-314 (FP_SUBNORMAL)
also_small = 0x0.00004p-1022 or 8.487983e-314 (FP_SUBNORMAL)

$ gcc-12 -O2 a.c -o a -L. -Wl,--no-as-needed -lfast -Wl,-rpath,'$ORIGIN' && ./a
     small = 0x0.00004p-1022 or 8.487983e-314 (FP_SUBNORMAL)
also_small = 0x0.00004p-1022 or 8.487983e-314 (FP_ZERO)</code></pre>
See the difference?
Some distributions like Debian or Ubuntu use <code>-Wl,--as-needed</code> by
default and throw away the library dependencies without explicit symbol
references. <code>-Wl,--no-as-needed</code> makes sure we still retain <code>-lfast</code> in
our runtime dependencies.
Note how the mere presence of <code>libfast.so</code> in the library dependencies
changes output of the program otherwise compiled and liked with <code>-O2</code>
option:

<h2 id="breakage-mechanics">Breakage mechanics</h2>
Before looking at the implementation let’s have a look at the option
descriptions <code>gcc</code> man page provides:
<pre><code>-ffast-math

  Sets the options -fno-math-errno, -funsafe-math-optimizations,
  -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans,
  -fcx-limited-range and -fexcess-precision=fast.

  This option causes the preprocessor macro "__FAST_MATH__" to be defined.

  This option is not turned on by any -O option besides -Ofast since it
  can result in incorrect output for programs that depend on an exact
  implementation of IEEE  or  ISO  rules/specifications  for math
  functions. It may, however, yield faster code for programs that do not
  require the guarantees of these specifications.</code></pre>
Note how vague the description is: it says your floating point code
might do something funny that violates C standard but does not go into
specifics.
For most people it should be a good hint not to use the option lightly.
The effect we see in <code>fpclassify()</code> in our example’s instability is the
result of <code>-funsafe-math-optimizations</code> option. That one is described as:
<pre><code>-funsafe-math-optimizations

  Allow optimizations for floating-point arithmetic that (a) assume that
  arguments and results are valid and (b) may violate IEEE or ANSI
  standards.  When used at link time, it may include libraries or
  startup files that change the default FPU control word or other
  similar optimizations.

  This option is not turned on by any -O option since it can result in
  incorrect output for programs that depend on an exact implementation
  of IEEE or ISO rules/specifications for math functions. It  may,
  however, yield faster code for programs that do not require the
  guarantees of these specifications.  Enables -fno-signed-zeros,
  -fno-trapping-math, -fassociative-math and -freciprocal-math.

  The default is -fno-unsafe-math-optimizations.</code></pre>
This option changes global setting of an <code>FP</code> unit at program start.
This change affects not only the code explicitly compiled with
<code>-ffast-math</code> but also affects everything else that resides in the same
address space.
Mechanically the <code>FPU</code> state changes when <code>gcc</code> links <code>crtfastmath.o</code>
on <code>-Ofast</code> / <code>-ffast-math</code>. Object file is implemented via <code>spec</code>
machinery:
<pre><code>$ gcc -dumpspecs | fgrep crtfastmath

... %{Ofast|ffast-math|funsafe-math-optimizations:crtfastmath.o%s} ...</code></pre>
This <code>spec</code> dump tells <code>gcc</code> to always add a <code>crtfastmath.o</code> as an input
if any of <code>-Ofast</code>, <code>-ffast-math</code> or <code>-funsafe-math-optimizations</code> is
passed to <code>gcc</code>. We can also verify it with <code>-Wl,-t</code> to trace all linker
inputs:
<pre><code>$ touch a.c

$ gcc -shared a.c -o libshared.so -Wl,-t |& fgrep crtfast
<nothing>

$ gcc -shared a.c -o libshared.so -Wl,-t -Ofast |& fgrep crtfast |& unnix
/<<NIX>>/gcc-12.3.0/lib/gcc/x86_64-unknown-linux-gnu/12.3.0/crtfastmath.o</code></pre>
To achieve this effect <code>libgcc/config/i386/crtfastmath.c</code> defines global
constructor to change <code>FP</code> state:
<pre class="c"><code>static void __attribute__((constructor))
set_fast_math (void)
{
  unsigned int mxcsr = __builtin_ia32_stmxcsr ();
  mxcsr |= MXCSR_DAZ | MXCSR_FTZ;
  __builtin_ia32_ldmxcsr (mxcsr);
}</code></pre>
The above code sets two flags normally disabled in <code>mxcsr</code> <code>SSE</code> flags
register:
<ul>
<li><code>DAZ</code>: Denormalized-Are-Zero - denormalized inputs are treated as zeros.</li>
<li><code>FTZ</code>: Flush-To-Zero - denormalized outputs are converted to zero.</li>
</ul>
This effect was seen as too problematic by many and <code>gcc-13</code> stopped
enabling <code>crtfastmath.o</code> for libraries (<code>-shared</code> option) in
<a href="https://gcc.gnu.org/PR55522" class="uri">https://gcc.gnu.org/PR55522</a>.
<code>clang-19</code> will follow <code>gcc</code> example in
<a href="https://github.com/thexujie/llvm-project/commit/971cc613e994a308f939f68247257b65e04c74fa" class="uri">https://github.com/thexujie/llvm-project/commit/971cc613e994a308f939f68247257b65e04c74fa</a>.
On top of that <code>clang-19</code> will have a <code>-mdaz-ftz</code> / <code>-mno-daz-ftz</code> flag
to override the default.
While the fix limits the impact to final programs that enable <code>-Ofast</code>
it is still able to break all the libraries linked into the program
linked with <code>-Ofast</code>.
As the library test suites are usually not ran with <code>-Ofast</code> the
breakage might not be noticed until data corruption starts happening on
real data in final applications.
<h2 id="actual-breakage-in-an-actual-program">Actual breakage in an actual program</h2>
I did not know about this <code>-ffast-math</code> problem until <code>nixpkgs</code> updated
<code>libsodium</code> library to <code>1.0.19</code> which started defaulting to
<code>-Ofast</code> in <a href="https://github.com/jedisct1/libsodium/commit/ad4584d45590654b9d863ced90d2b2561d5cfbda">“Try using -Ofast / -O3 by default” commit</a>.
After a <code>libsodium</code> upgrade test suites for various programs started
failing with obscure errors of data corruption around <code>double</code> conversion
to string and back. In some of the cases <code>libsodium</code> was not even used
directly and was an indirect dependency via <code>libzmq</code>:

It was very confusing to see <code>bitcoind</code> to fail <code>double</code> serialization /
deserialization tests after <code>libsodium</code> update (which API was not used
in the <code>program</code> at all).
I usually try to avoid crypto-currency software. But in this case
compiler was implied to generate bad code and I had to look closer.
<h2 id="other-compilers">Other compilers</h2>
<code>clang</code> also considers disabling <code>crtfastmath.o</code> for shared libraries
to follow <code>gcc</code> lead: <a href="https://github.com/llvm/llvm-project/issues/57589" class="uri">https://github.com/llvm/llvm-project/issues/57589</a>.
According to <a href="https://en.wikipedia.org/wiki/Subnormal_number#Intel_SSE">Wikipedia</a>
<code>ICC</code> enables <code>-ffast-math</code> equivalent on the optimizations above <code>-O0</code>.
That sounds quite unsafe.
<h2 id="why-do-we-even-have-that-lever">Why do we even have that lever?</h2>
But why do these transformations exist at all? Why treat small <code>double</code>s
differently compared to larger values? We don’t do it with small
integers after all (I hope!).
Apparently at least older CPUs (and maybe modern ones as well?) were
slower to handle denormalized values. Sometimes 100x slower. Thus
cutting a corner here might have a visible win in applications that
don’t care about precision or predictability of the result.
<h2 id="binary-representation-of-floating-points">Binary representation of floating points</h2>
How small those values need to be to get flushed to zero on occasion?
Wikipedia’s <a href="https://en.wikipedia.org/wiki/Double-precision_floating-point_format">“Double-precision floating-point format” article</a>
is a great detailed explanation of the <code>float64</code> format.
Tl;DR: all the numbers of 64-bits have the following bit layout:

Depending on the exponent <code>exp</code> value there are two formulas and
three(ish) cases:
<ul>
<li>Normalized values (for <code>exp</code> in <code>1</code> - <code>2046</code> range): <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mrow><mo stretchy="true" form="prefix">(</mo><mi>−</mi><mn>1</mn><mo stretchy="true" form="postfix">)</mo></mrow><mrow><mi>s</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msup><mo>×</mo><msup><mn>2</mn><mrow><mi>e</mi><mi>x</mi><mi>p</mi><mo>−</mo><mn>1023</mn></mrow></msup><mo>×</mo><mn>1</mn><mi>.</mi><mrow><mi>f</mi><mi>r</mi><mi>a</mi><mi>c</mi></mrow></mrow><annotation encoding="application/x-tex">{(-1)}^{sign}\times2^{exp-1023}\times1.{frac}</annotation></semantics></math></li>
<li>Denormalized values: (for <code>exp == 0</code>): <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mrow><mo stretchy="true" form="prefix">(</mo><mi>−</mi><mn>1</mn><mo stretchy="true" form="postfix">)</mo></mrow><mrow><mi>s</mi><mi>i</mi><mi>g</mi><mi>n</mi></mrow></msup><mo>×</mo><msup><mn>2</mn><mrow><mi>e</mi><mi>x</mi><mi>p</mi><mo>−</mo><mn>1022</mn></mrow></msup><mo>×</mo><mn>0</mn><mi>.</mi><mrow><mi>f</mi><mi>r</mi><mi>a</mi><mi>c</mi></mrow></mrow><annotation encoding="application/x-tex">{(-1)}^{sign}\times2^{exp-1022}\times0.{frac}</annotation></semantics></math></li>
<li><code>exp == 2047</code>: <code>NaN</code>s and infinities of sorts</li>
</ul>
<p>The main detail here is implied <code>1.frac</code> vs <code>0.frac</code> in the first two
cases. Otherwise the structure is the same. A few examples of value
encodings:</p>
<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th>value</th>
<th>encoded</th>
<th>sign/exp/frac</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>0x1.5p0</code></td>
<td><code>0x3ff5000000000000</code></td>
<td><code>0/0x3ff/0x0005000000000000</code></td>
<td>A normalized value</td>
</tr>
<tr class="even">
<td><code>0x1.0p-1022</code></td>
<td><code>0x0010000000000000</code></td>
<td><code>0/1/0</code></td>
<td>Smallest normalized value</td>
</tr>
<tr class="odd">
<td><code>0x0.fffffffffffffp-1022</code></td>
<td><code>0x000fffffffffffff</code></td>
<td><code>0/0/0x000fffffffffffff</code></td>
<td>Largest denormalized value</td>
</tr>
<tr class="even">
<td><code>0x1.5p-1040</code></td>
<td><code>0x0000054000000000</code></td>
<td><code>0/0/0x0054000000000</code></td>
<td>A denormalized value</td>
</tr>
<tr class="odd">
<td><code>0x1.0p-1074</code></td>
<td><code>0x0000000000000001</code></td>
<td><code>0/0/1</code></td>
<td>Smallest denormalized value</td>
</tr>
<tr class="even">
<td><code>0.0</code></td>
<td><code>0x0000000000000000</code></td>
<td><code>0/0/0</code></td>
<td>Positive zero</td>
</tr>
<tr class="odd">
<td><code>-0.0</code></td>
<td><code>0x8000000000000000</code></td>
<td><code>1/0/0</code></td>
<td>Negative zero</td>
</tr>
<tr class="even">
<td><code>Inf</code></td>
<td><code>0x7ff0000000000000</code></td>
<td><code>0/0x7ff/0</code></td>
<td>Positive infinity</td>
</tr>
<tr class="odd">
<td><code>-Inf</code></td>
<td><code>0xfff0000000000000</code></td>
<td><code>1/0x7ff/0</code></td>
<td>Negative infinity</td>
</tr>
<tr class="even">
<td><code>NaN</code></td>
<td><code>0xfff8000000000000</code></td>
<td><code>1/0x7ff/0x0008000000000000</code></td>
<td>NaN for <code>0/0</code> value</td>
</tr>
</tbody>
</table>
<p>There are various curious facts about this encoding:</p>
<p>Normalized and denormalized value sets do not intersect: all
denormalized absolute values are smaller than smallest normalized value.</p>
<p>Having a short glance at first 3 hexadecimal digits is enough to get the
idea which <code>FP</code> class we are looking at:</p>
<ul>
<li><code>0x000 / 0x800</code>: Denormalized numbers or zero</li>
<li><code>0x7ff / 0xfff</code>: <code>NaN</code> or infinities</li>
</ul>
<p>Normalized values do not allow for <code>0.0</code> encoding: <code>frac</code> always has
an implied leading <code>1.</code> start for fraction. Thus zeros have to be
encoded using denormalized scheme.</p>
<p>(Positive) zero is encoded as all zero bits. Thus <code>memset()</code> on array of
floats creates sensible array of zeros.</p>
<p>There are two zeros: signed and unsigned.</p>
<p>While zeros require subnormal encoding of an exponent they are
considered a separate <code>FP_ZERO</code> class from <code>FP_SUBNORMAL</code></p>
<p>Normalized values use all their 52 bits of fraction for precision.
Denormalized values usually use less as they maintain a few leading
zeros to encode smaller values.</p>
<p>Looking at the binary representation one can imagine how CPU would
implement at least trivial operations (like addition and subtraction)
on floating points with the same exponent value and then slightly extend
it to operation on numbers with different exponents.</p>
<h2 id="the-effect-of-mxcsr_daz-and-mxcsr_ftz">The effect of <code>MXCSR_DAZ</code> and <code>MXCSR_FTZ</code></h2>
<p>By now it should be clear that <code>MXCSR_DAZ</code> and <code>MXCSR_FTZ</code> flags
effectively turn any denormalized value into zero for any <code>FP</code> operation.</p>
<p>The minor problem happens if denormalized value is already constructed
in memory and is used by something else. That is why we get seemingly
nonsensical result above when <code>FP_ZERO</code> has a non-zero encoding:</p>
<pre><code>also_small = 0x0.00004p-1022 or 8.487983e-314 (FP_ZERO)</code></pre>
<p>Is it hard to hit a denormalized value? It depends! If you operate on
small values like <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msup><mn>10</mn><mrow><mi>−</mi><mn>20</mn></mrow></msup><annotation encoding="application/x-tex">10^{-20}</annotation></semantics></math> (micros) and use to raise it to higher
powers, like 16, then you get outside the normal range:
<math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mrow><mo stretchy="true" form="prefix">(</mo><msup><mn>10</mn><mrow><mi>−</mi><mn>20</mn></mrow></msup><mo stretchy="true" form="postfix">)</mo></mrow><mn>16</mn></msup><mo>=</mo><mn>0</mn><mi>x</mi><mn>0.00000000007</mn><mi>e</mi><mn>8</mn><msup><mi>p</mi><mrow><mi>−</mi><mn>1022</mn></mrow></msup></mrow><annotation encoding="application/x-tex">{(10^{-20})}^{16} = 0x0.00000000007e8p^{-1022}</annotation></semantics></math></p>
<p>These are very small values.</p>
<p>With such flushing enabled it is a lot easier to hit a <code>NaN</code> by dividing
(flushed) zero by (flushed) zero or by subtracting infinities a bit after.</p>
<h2 id="parting-words">Parting words</h2>
<p><code>-Ofast</code> is not a safe option to use without a second thought. In
addition to breaking your immediate floating point arithmetic code
<code>-ffast-math</code> / <code>-Ofast</code> also breaks the code not compiled with these
options. This non-local effect is most problematic.</p>
<p><code>libsodium</code> tried it and broke a few reverse dependencies that relied on
denormalized values to work as expected.</p>
<p>For the time being <code>libsodium</code> <a href="https://github.com/jedisct1/libsodium/commit/0e0e2c16401e63777dce8c7958a3ca789055dfcf">rolled back</a>
<code>-Ofast</code> default. That should stop <code>FP</code> code breakage for users of older
<code>gcc</code>.</p>
<p>There probably is a lot more packages enabling <code>-Ofast</code> without
realizing what effects it causes on <code>FP</code> code correctness somewhere
else.</p>
<p>To notice the problem the code needs to exercise denormalized values
which might require very small actual values as operands.</p>
<p>What is worse: the truncation problems might come and go depending on
what compiler decides to do with the intermediate values: perform
an operation on <code>FP</code> unit and observe the truncation, pass it in
memory and process using bitwise arithmetic and not observe the
truncation or re-associate the operations and expose denormalized values.
This instability effect is very similar to <code>i387</code> <code>FPU</code> instability
on <code>i386</code> documented at <a href="https://gcc.gnu.org/wiki/x87note" class="uri">https://gcc.gnu.org/wiki/x87note</a>.</p>
<p>Floating point encoding is straightforward, but is full of corner cases:
normalized, denormalized, zeros, infinities and <code>NaN</code>s. Handling all the
cases requires extra work from the programmer and the CPU.</p>
<p><code>MXCSR_DAZ</code> and <code>MXCSR_FTZ</code> status bits allow CPU to treat most
denormalized values as zeros at a cost of <code>C</code> and <code>IEEE</code> standard
conformance.</p>
<p>Have fun!</p>
</article>
<article>
<h1>Another gcc profiling bug</h1>
<p>2023-10-07T00:00:00Z</p>
<h2 id="the-python-pgo-bug">The python PGO bug</h2>
<p>About <a href="https://trofi.github.io/posts/243-gcc-profiler-internals.html">a year ago</a> I had some
fun debugging <code>gcc</code> crash on <code>python</code> code base built in <code>PGO</code> mode
(optimized based on profile-feedback from test run).</p>
<p>Scrolling through recent <code>gcc</code> bugs I noticed <a href="https://gcc.gnu.org/PR111559">PR111559</a>
<code>"[14 regression] ICE when building Python with PGO"</code> bug reported by
Sam James. It looked vaguely similar to the previous instance so I had
a look.</p>
<p>There <code>python</code> build of <code>-fprofile-use</code> stage was crashing <code>gcc</code> as:</p>
<pre><code>$ gcc -c ... -fprofile-use -fprofile-correction ... -o Parser/parser.o Parser/parser.c
Parser/parser.c: In function 'simple_stmt_rule':
Parser/parser.c:1706:1: error: probability of edge 613->614 not initialized
 1706 | simple_stmt_rule(Parser *p)
      | ^~~~~~~~~~~~~~~~
Parser/parser.c:1706:1: error: probability of edge 615->621 not initialized
during IPA pass: inline
Parser/parser.c:1706:1: internal compiler error: verify_flow_info failed
0x55c9cced2153 verify_flow_info()</code></pre>
<p>This error tells us what exactly is wrong in the control flow graph when
crashes <code>gcc</code> (as opposed to vague <code>SIGSEGV</code>).</p>
<p>Normally <code>gcc</code> is very forgiving to input garbage profile data you pass
to it. Worst case you should get badly optimized binary with correct
behaviour. But in this case <code>gcc</code> complains about probabilities <code>gcc</code>
calculated itself. I did not see this error type before.</p>
<p>I wanted to have a closer look.</p>
<h2 id="reproducing">Reproducing</h2>
<p>First thing I tried to reproduce <code>gcc</code> ICE following Sam’s instructions.
I had to do two minor tweaks:</p>
<ol type="1">
<li>set <code>--enable-checking=yes</code> for <code>gcc_debug</code></li>
<li>change <code>python3</code> <code>nixpkgs</code> package to use <code>gcc_debug</code></li>
</ol>
<p>Out of laziness I patched <code>--enable-checking=yes</code> into local checkout
and changed <code>python3</code> dependency in development shell invocation. Both
tweaks are below:</p>
<p><code>gcc_debug</code> patch:</p>
<pre class="diff"><code>--- a/pkgs/top-level/all-packages.nix
+++ b/pkgs/top-level/all-packages.nix
@@ -15893,5 +15893,6 @@ with pkgs;
-  gcc_debug = lowPrio (wrapCC (gcc.cc.overrideAttrs {
+  gcc_debug = lowPrio (wrapCC (gcc.cc.overrideAttrs (oa: {
     dontStrip = true;
-  }));
+    configureFlags = oa.configureFlags ++ [ "--enable-checking=yes" ];
+  })));</code></pre>
<p>Running the shell:</p>
<pre><code>$ nix develop --impure --expr 'with import ./. {};
python3.overrideAttrs (oa: {
  nativeBuildInputs = [ gcc_debug ] ++ oa.nativeBuildInputs;
})'</code></pre>
<p>That gave me the interactive development shell with all the tools in the
<code>PATH</code>. Double-checking if the visible compiler looks like the patched
one:</p>
<pre><code>$$ LANG=C gcc -v
...
Configured with: ../source/configure ... --enable-checking=yes ...
...
gcc version 14.0.0 99999999 (experimental) (GCC)</code></pre>
<p>Looks good. Moving on to to run reproducer as is:</p>
<pre><code>$$ wget https://www.python.org/ftp/python/3.11.5/Python-3.11.5.tar.xz
$$ tar xf Python-3.11.5.tar.xz
$$ cd Python-3.11.5/
$$ ./configure --enable-optimizations
$$ make -j$(nproc)</code></pre>
<p>After a minute or so <code>make -j$(nproc)</code> command failed as:</p>
<pre><code>$$ make -j$(nproc)
...
gcc -c -Wsign-compare -DNDEBUG -g -fwrapv -O3 ...
...
Parser/parser.c: In function 'simple_stmt_rule':
Parser/parser.c:1620:1: error: probability of edge 159->160 not initialized
 1620 | simple_stmt_rule(Parser *p)
      | ^~~~~~~~~~~~~~~~
Parser/parser.c:1620:1: error: probability of edge 161->162 not initialized
Parser/parser.c:1620:1: error: probability of edge 162->171 not initialized
Parser/parser.c:1620:1: error: probability of edge 166->162 not initialized
Parser/parser.c:1620:1: error: probability of edge 169->170 not initialized
Parser/parser.c:1620:1: error: probability of edge 614->615 not initialized
Parser/parser.c:1620:1: error: probability of edge 616->622 not initialized
during IPA pass: inline
Parser/parser.c:1620:1: internal compiler error: verify_flow_info failed
0xacee3e verify_flow_info()
        ../../source/gcc/cfghooks.cc:287
0x104f73c checking_verify_flow_info()
        ../../source/gcc/cfghooks.h:214
0x104f73c cleanup_tree_cfg_noloop
        ../../source/gcc/tree-cfgcleanup.cc:1154
0x104f73c cleanup_tree_cfg(unsigned int)
        ../../source/gcc/tree-cfgcleanup.cc:1205
0xed541c execute_function_todo
        ../../source/gcc/passes.cc:2057
0xed58ce execute_todo
        ../../source/gcc/passes.cc:2142
0xed841f execute_one_ipa_transform_pass
        ../../source/gcc/passes.cc:2336
0xed841f execute_all_ipa_transforms(bool)
        ../../source/gcc/passes.cc:2396
0xb0b09d cgraph_node::expand()
        ../../source/gcc/cgraphunit.cc:1834
0xb0b09d cgraph_node::expand()
        ../../source/gcc/cgraphunit.cc:1794
0xb0bfc1 expand_all_functions
        ../../source/gcc/cgraphunit.cc:2000
0xb0bfc1 symbol_table::compile()
        ../../source/gcc/cgraphunit.cc:2398
0xb0f527 symbol_table::compile()
        ../../source/gcc/cgraphunit.cc:2311
0xb0f527 symbol_table::finalize_compilation_unit()
        ../../source/gcc/cgraphunit.cc:2583
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.</code></pre>
<p>Yay! Luckily <code>gcc_debug</code> crashed for me without any extra convincing.
And <code>python</code> build system helpfully printed exact command to rerun.</p>
<h2 id="reducing-the-input">Reducing the input</h2>
<p>Once I got the reproducer I attempted to minimize it with
<a href="https://github.com/marxin/cvise">cvise</a> against preprocessed
<code>Parser/parser.c</code> file and it’s <code>parser.gcda</code> file (<code>-fprofile-use</code> flag
looks it up and loads profiling data).</p>
<p>Creating the preprocessed file to simplify <code>cvise</code> command run:</p>
<pre><code>$$ gcc -E -P ... -o Parser/parser.c.c Parser/parser.c
$$ mv Parser/parser.c.c Parser/parser.c</code></pre>
<p>Making sure the failure did not go away (this time with only flags
needed to crash <code>gcc_debug</code>):</p>
<pre><code>$$ gcc -c -O2 -fprofile-use -fprofile-correction -o Parser/parser.o Parser/parser.c
...
Parser/parser.c: In function 'simple_stmt_rule':
Parser/parser.c:10501:1: error: probability of edge 540->541 not initialized
10501 | simple_stmt_rule(Parser *p)
      | ^~~~~~~~~~~~~~~~
Parser/parser.c:10501:1: error: probability of edge 542->548 not initialized
during IPA pass: inline
Parser/parser.c:10501:1: internal compiler error: verify_flow_info failed
0xacee3e verify_flow_info()
        ../../source/gcc/cfghooks.cc:287
...</code></pre>
<p>The crash was still there. Now running <code>cvise</code>:</p>
<pre><code>$$ cd Parser/
$$ cvise --command="mkdir Parser && cp parser.c $PWD/parser.gcda Parser/;
gcc -c -O2 -fprofile-use -fprofile-correction -o Parser/parser.o Parser/parser.c |&
  grep 'verify_flow_info'" parser.c
...
Runtime: 387 seconds
Reduced test-cases:
...</code></pre>
<p>Note: in the <code>--command=...</code> script I had to maintain <code>Parser/</code>
directory nesting in the interestingness test as <code>.gcda</code> files contain
directory part of the file path.</p>
<p>This is the raw file produced by <code>cvise</code>:</p>
<pre class="c"><code>typedef struct {
  void **elements
} asdl_seq;
typedef struct {
  int mark;
  int arena;
  int error_indicator;
  int level
} Parser;
static *_loop1_104_rule(Parser *p) {
  if (p->level++ == 6000) {
    p->error_indicator = 1;
    PyErr_NoMemory();
  }
  if (p->error_indicator) {
    p->level--;
    return 0;
  }
  int _mark = p->mark;
  void **_children = PyMem_Malloc(sizeof(void *));
  if (!_children) {
    p->error_indicator = 1;
    PyErr_NoMemory();
    p->level--;
    return 0;
  }
  long _children_capacity = 1, _n = 0;
  if (p->error_indicator) {
    p->level--;
    return 0;
  }
  int *lambda_param_with_default_var;
  while (lambda_param_with_default_var = lambda_param_with_default_rule(p)) {
    if (_n == _children_capacity) {
      _children_capacity *= 2;
      void *_new_children =
          PyMem_Realloc(_children, _children_capacity * sizeof(void *));
      if (!_new_children) {
        PyMem_Free(_children);
        p->error_indicator = 1;
        PyErr_NoMemory();
        p->level--;
        return 0;
      }
      _children = _new_children;
    }
    _children[_n++] = lambda_param_with_default_var;
    _mark = p->mark;
  }
  p->mark = _mark;
  if (_n == 0 || p->error_indicator) {
    PyMem_Free(_children);
    p->level--;
    return 0;
  }
  asdl_seq *_seq = _Py_asdl_generic_seq_new(_n, p->arena);
  if (!_seq) {
    PyMem_Free(_children);
    p->error_indicator = 1;
    PyErr_NoMemory();
    p->level--;
    return 0;
  }
  for (int i = 0; i < _n; i++)
    _seq->elements[i] = _children[i];
  PyMem_Free(_children);
  p->level--;
  return _seq;
}
static *_loop1_106_rule(Parser *p) {
  if (p->level++ == 6000) {
    p->error_indicator = 1;
    PyErr_NoMemory();
  }
  if (p->error_indicator) {
    p->level--;
    return 0;
  }
  int _mark = p->mark;
  void **_children = PyMem_Malloc(sizeof(void *));
  if (!_children) {
    p->error_indicator = 1;
    PyErr_NoMemory();
    p->level--;
    return 0;
  }
  long _children_capacity = 1, _n = 0;
  if (p->error_indicator) {
    p->level--;
    return 0;
  }
  int *lambda_param_with_default_var;
  while (lambda_param_with_default_var = lambda_param_with_default_rule(p)) {
    if (_n == _children_capacity) {
      _children_capacity *= 2;
      void *_new_children =
          PyMem_Realloc(_children, _children_capacity * sizeof(void *));
      if (!_new_children) {
        PyMem_Free(_children);
        p->error_indicator = 1;
        PyErr_NoMemory();
        p->level--;
        return 0;
      }
      _children = _new_children;
    }
    _children[_n++] = lambda_param_with_default_var;
    _mark = p->mark;
  }
  p->mark = _mark;
  if (_n == 0 || p->error_indicator) {
    PyMem_Free(_children);
    p->level--;
    return 0;
  }
  asdl_seq *_seq = _Py_asdl_generic_seq_new(_n, p->arena);
  if (!_seq) {
    PyMem_Free(_children);
    p->error_indicator = 1;
    PyErr_NoMemory();
    p->level--;
    return 0;
  }
  for (int i = 0; i < _n; i++)
    _seq->elements[i] = _children[i];
  PyMem_Free(_children);
  p->level--;
  return _seq;
}
func_type_rule() { _loop1_104_rule(_loop1_106_rule); }</code></pre>
<p>It’s a big and messy file! But do not be afraid!</p>
<p>I checked first what <code>gcc_debug</code> tries to do with it when it optimizes
it to see if I could apply more optimizations manually. <code>-fopt-info</code>
flag to the rescue:</p>
<pre><code>$$ LANG=C gcc -c -O2 -fprofile-use -fprofile-correction -o Parser/parser.o Parser/parser.c -fopt-info
...
Parser/parser.c:10:9: error: source locations for function '_loop1_104_rule' have changed, the profile data may be out of date [-Werror=coverage-mismatch]
...
Parser/parser.c:10:9: optimized: Semantic equality hit:_loop1_104_rule/0->_loop1_106_rule/1
Parser/parser.c:10:9: optimized: Assembler symbol names:_loop1_104_rule/0->_loop1_106_rule/1
Parser/parser.c:10:9: error: probability of edge 3->4 not initialized
Parser/parser.c:10:9: error: probability of edge 5->6 not initialized
Parser/parser.c:10:9: error: probability of edge 8->6 not initialized
Parser/parser.c:10:9: error: probability of edge 10->6 not initialized
Parser/parser.c:10:9: error: probability of edge 13->6 not initialized
Parser/parser.c:10:9: error: probability of edge 20->6 not initialized
during GIMPLE pass: fixup_cfg
Parser/parser.c:10:9: internal compiler error: verify_flow_info failed
...</code></pre>
<p>There is literally one optimization: <code>_loop1_104_rule()</code> and
<code>_loop1_106_rule()</code> have identical implementation and are folded into a
single function.</p>
<p>I supplied <code>main()</code> function, stubbed out missing functions and managed
to get a source-only reproducer without the need for <code>*.gcda</code> files!</p>
<p>This allowed me running <code>cvise</code> on a <code>.c</code> file alone. Reducing it
further I got this beauty:</p>
<pre class="c"><code>// $ cat bug.c
__attribute__((noipa)) static void edge(void) {}

int p = 0;

__attribute__((noinline))
static void rule1(void) { if (p) edge(); }

__attribute__((noinline))
static void rule1_same(void) { if (p) edge(); }

__attribute__((noipa)) int main(void) {
    rule1();
    rule1_same();
}</code></pre>
<p>The above example still crashed as:</p>
<pre><code>$ gcc -O2 -fprofile-generate                 bug.c -o b && ./b
$ gcc -O2 -fprofile-use -fprofile-correction bug.c -o b

bug.c: In function 'rule1':
bug.c:6:13: error: probability of edge 3->4 not initialized
    6 | static void rule1(void) { if (p) edge(); }
      |             ^~~~~
during GIMPLE pass: fixup_cfg
bug.c:6:13: internal compiler error: verify_flow_info failed</code></pre>
<p>It’s a nice outcome of the reduction. I pasted it as the update to the
bug hoping that somebody fixes it.</p>
<h2 id="looking-at-the-failure-mode">Looking at the failure mode</h2>
<p>The reduced case looks like some kind of a trivial bug. Is it the only
thing that plagues <code>python</code> <code>PGO</code> build? I tried to get the idea if I
can somehow work around the failure and see if <code>gcc_debug</code> crashes
somewhere else as well.</p>
<p>Even before looking at the <code>gcc</code> code I knew quite a bit about the
failure: the identical code folding fails on a function most of which
bodies is not executed: <code>if (p)</code> is always <code>false</code>.</p>
<p>Before doing <code>gcc</code> bisection I had a look at recent <code>gcc</code> commits.
<a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=0c78240fd7d519">commit “Check that passes do not forget to define profile”</a>
added control flow graph verification against uninitialized branch
probabilities.</p>
<p>It was clearly the change that exposed problematic transformation. But
it did not change existing transformations. Thus chances are it’s not a
new problem: it only happens to be visible now. We need to find a place
where uninitialized probability gets emitted by <code>gcc</code>.</p>
<p>I added a few debugging statements into <code>gcc</code> and found that probability
corruption happens in <code>ipa-icf</code> pass (<code>identical code folding</code> pass). As
<code>rule1()</code> and <code>rule1_same()</code> have identical implementation then in
theory probabilities of both functions should sum up together (whatever
“sum” means for a complex call graph of a function).</p>
<p>To look at the specific probability values right before the corruption I
added the following debugging patch to <code>gcc</code>:</p>
<pre class="diff"><code>--- a/gcc/ipa-utils.cc
+++ b/gcc/ipa-utils.cc
@@ -642,14 +642,17 @@ ipa_merge_profiles (struct cgraph_node *dst,
          else
            {
              for (i = 0; i < EDGE_COUNT (srcbb->succs); i++)
                {
+                 profile_count den = dstbb->count.ipa () + srccount.ipa ();
+                 gcc_assert(den.nonzero_p());
+
                  edge srce = EDGE_SUCC (srcbb, i);
                  edge dste = EDGE_SUCC (dstbb, i);
                  dste->probability =
                    dste->probability * dstbb->count.ipa ().probability_in
                                                 (dstbb->count.ipa ()
                                                  + srccount.ipa ())
                    + srce->probability * srcbb->count.ipa ().probability_in
                                                 (dstbb->count.ipa ()
                                                  + srccount.ipa ());
                }</code></pre>
<p>Here I extracted <code>dstbb->count.ipa () + srccount.ipa ()</code> denominator to
a separate <code>den</code> variable and added assert that it should not be zero
(as <code>probability_in()</code> turns those into undefined values).</p>
<p>Making sure we get assertion trigger:</p>
<pre><code>$ gcc/xgcc -Bgcc -O2 -fprofile-generate bug.c -o b && ./b
$ gcc/xgcc -Bgcc -O2 -fprofile-use bug.c -o b
during IPA pass: icf
bug.c:14:1: internal compiler error: in ipa_merge_profiles, at ipa-utils.cc:653</code></pre>
<p>In <code>gdb</code> session I poked a bit at the actual values:</p>
<pre><code>$ gdb --args gcc/cc1 -quiet -v -iprefix /tmp/gb/gcc/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/ -isystem gcc/include -isystem gcc/include-fixed bug.c -quiet -dumpdir b- -dumpbase bug.c -dumpbase-ext .c -mtune=generic -march=x86-64 -O2 -version -fprofile-use -o /run/user/1000/ccnlNQ8W.s

(gdb) start
(gdb) break internal_error
(gdb) continue
Breakpoint 2, internal_error (gmsgid=gmsgid@entry=0x285290d "in %s, at %s:%d") at /home/slyfox/dev/git/gcc/gcc/diagnostic.cc:2151

(gdb) bt
#0  internal_error (gmsgid=gmsgid@entry=0x285290d "in %s, at %s:%d") at /home/slyfox/dev/git/gcc/gcc/diagnostic.cc:2151
#1  0x000000000093902c in fancy_abort (file=file@entry=0x22afa38 "/home/slyfox/dev/git/gcc/gcc/ipa-utils.cc", line=line@entry=653, function=function@entry=0x22af9c7 "ipa_merge_profiles")
    at /home/slyfox/dev/git/gcc/gcc/diagnostic.cc:2268
#2  0x00000000007b6124 in ipa_merge_profiles (dst=dst@entry=0x7fffea01a330, src=src@entry=0x7fffea01a440, preserve_body=preserve_body@entry=false) at /home/slyfox/dev/git/gcc/gcc/ipa-utils.cc:653
#3  0x0000000001db302c in ipa_icf::sem_function::merge (this=0x2e407a0, alias_item=0x2e41060) at /home/slyfox/dev/git/gcc/gcc/ipa-icf.cc:1276

(gdb) fr 2
#2  0x00000000007b6124 in ipa_merge_profiles (dst=dst@entry=0x7fffea01a330, src=src@entry=0x7fffea01a440, preserve_body=preserve_body@entry=false)
    at /home/slyfox/dev/git/gcc/gcc/ipa-utils.cc:653
653                       gcc_assert(den.nonzero_p());

(gdb) call dstbb->count.debug()
0 (precise)
(gdb) call srccount.ipa ().debug()
0 (precise)</code></pre>
<p>Here is the initial probability value we are about to overwrite:</p>
<pre><code>(gdb) call dste->probability.debug()
always</code></pre>
<p>I proposed the conservative fix by ignoring such updates that change
probability from “initialized” to “uninitialized” as an
<a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=043a6fcbc27f8721301eb2f72a7839f54f393003">“ipa-utils: avoid uninitialized probabilities on ICF [PR111559]” commit</a>:</p>
<pre class="diff"><code>--- a/gcc/ipa-utils.cc
+++ b/gcc/ipa-utils.cc
@@ -651,13 +651,14 @@ ipa_merge_profiles (struct cgraph_node *dst,
 		{
 		  edge srce = EDGE_SUCC (srcbb, i);
 		  edge dste = EDGE_SUCC (dstbb, i);
-		  dste->probability = 
-		    dste->probability * dstbb->count.ipa ().probability_in
-						 (dstbb->count.ipa ()
-						  + srccount.ipa ())
-		    + srce->probability * srcbb->count.ipa ().probability_in
-						 (dstbb->count.ipa ()
-						  + srccount.ipa ());
+		  profile_count sum =
+		    dstbb->count.ipa () + srccount.ipa ();
+		  if (sum.nonzero_p ())
+		    dste->probability =
+		      dste->probability * dstbb->count.ipa ().probability_in
+						   (sum)
+		      + srce->probability * srcbb->count.ipa ().probability_in
+						   (sum);
 		}
 	      dstbb->count = dstbb->count.ipa () + srccount.ipa ();
 	    }</code></pre>
<p>It might not be the best fix as we discard the fact that branch was
never executed during the profile run. But at least we don’t compromise
correctness.</p>
<p>This fixed the reduced example and the actual <code>python</code> <code>PGO</code> build for
me. Yay! That was easier than I expected.</p>
<h2 id="a-minor-comment">A minor comment</h2>
<p>All done?</p>
<p>All looked very well. The patch was not reviewed yet and <code>master</code> branch
was still exposed to this kind of failure. Franz Sirl
<a href="https://gcc.gnu.org/PR111559#c3">reported</a> that the same problem is
likely happening on <code>profiledbootstrap</code> build:</p>
<pre><code>../../gcc/c-family/c-attribs.cc:1369:1: error: probability of edge 3->4 not initialized
 1369 | handle_noclone_attribute (tree *node, tree name,
      | ^~~~~~~~~~~~~~~~~~~~~~~~
during IPA pass: inline
../../gcc/c-family/c-attribs.cc:1369:1: internal compiler error: verify_flow_info failed
0xa92b3e verify_flow_info()
        ../../gcc/cfghooks.cc:287
0xfde04c checking_verify_flow_info()
        ../../gcc/cfghooks.h:214
0xfde04c cleanup_tree_cfg_noloop
        ../../gcc/tree-cfgcleanup.cc:1154
0xfde04c cleanup_tree_cfg(unsigned int)
        ../../gcc/tree-cfgcleanup.cc:1205
0xe7b25c execute_function_todo
        ../../gcc/passes.cc:2057
0xe7b70e execute_todo
        ../../gcc/passes.cc:2142
0xe7e16f execute_one_ipa_transform_pass
        ../../gcc/passes.cc:2336
0xe7e16f execute_all_ipa_transforms(bool)
        ../../gcc/passes.cc:2396
0xacde5d cgraph_node::expand()
        ../../gcc/cgraphunit.cc:1834
0xacde5d cgraph_node::expand()
        ../../gcc/cgraphunit.cc:1794
0xacecec expand_all_functions
        ../../gcc/cgraphunit.cc:2000
0xacecec symbol_table::compile()
        ../../gcc/cgraphunit.cc:2398
0xad2197 symbol_table::compile()
        ../../gcc/cgraphunit.cc:2311
0xad2197 symbol_table::finalize_compilation_unit()
        ../../gcc/cgraphunit.cc:2583
Please submit a full bug report, with preprocessed source (by using -freport-bug).</code></pre>
<p>There <code>gcc</code>’s own build is using profile feedback information. That made
sense: there is a big chance <code>STL</code> (or other <code>gcc</code> internals) produces
identical functions worth folding. And looking at the crash log in
Franz’s case <code>handle_noclone_attribute()</code> was folded with something else.</p>
<p>Looking at the code around I found this bunch of helpers:</p>
<pre class="c"><code>static tree
handle_noclone_attribute (tree *node, tree name,
                          tree ARG_UNUSED (args),
                          int ARG_UNUSED (flags), bool *no_add_attrs)
{
  if (TREE_CODE (*node) != FUNCTION_DECL)
    {
      warning (OPT_Wattributes, "%qE attribute ignored", name);
      *no_add_attrs = true;
    }

  return NULL_TREE;
}

static tree
handle_noicf_attribute (tree *node, tree name,
                        tree ARG_UNUSED (args),
                        int ARG_UNUSED (flags), bool *no_add_attrs)
{
  if (TREE_CODE (*node) != FUNCTION_DECL)
    {
      warning (OPT_Wattributes, "%qE attribute ignored", name);
      *no_add_attrs = true;
    }

  return NULL_TREE;
}</code></pre>
<p>Both functions have identical implementations up to white space and
local variable names.</p>
<p>Looking at the manifestation of the crash I was pretty sure it’s exactly
the same merging issue. But in theory there might be a lot more places
where we introduce undefined probabilities.</p>
<h2 id="building-profiled-gcc">Building profiled gcc</h2>
<p>I decided to build <code>profiledbootstrap</code> on <code>x86_64</code> Just In Case. In
theory it’s very simple. You just run two commands:</p>
<pre><code>$ ~/dev/git/gcc/configure
$ make profiledbootstrap</code></pre>
<p>The only change from a trivial vanilla build is the non-default
<code>profiledbootstrap</code> build target. I ran the above commands as is.</p>
<p>The build was very slow. Some <code>.cc</code> files took 15 minutes to compile.
It’s longer than the whole <code>--disable-bootstrap</code> build on my machines
which takes around 8 minutes. I filed <a href="https://gcc.gnu.org/PR111619">PR11619</a>
for <code>"'make profiledbootstrap' makes 10+ minutes on insn-recog.cc"</code>.</p>
<p>To get the idea why some individual compilations take up to 15 minutes
let’s look at the anatomy of a <code>make profiledbootstrap</code> build:</p>

<p>Or if we put the picture in words:</p>
<ol type="1">
<li><code>stage1-gcc</code> is built using <code>-O0</code> by host <code>gcc</code> (or other compiler).
At this point we get <code>gcc</code> of guaranteed feature set and
configuration.</li>
<li><code>stageprofile-gcc</code> is built using <code>-O2 -fprofile-generate</code> by
unoptimized <code>stage1-gcc</code></li>
<li><code>stagetrain-gcc</code> is built using <code>-O2</code> by <code>stageprofile-gcc</code> to
produce <code>.gcda</code> files and to produce next compiler stage</li>
<li><code>stagefeedback</code> is built using <code>-O2 -fprofile-use</code> by
<code>stagetrain-gcc</code> to produce final profile-optimised compiler.</li>
</ol>
<p>All of <code>[2.]-[3.]-[4.]</code> added are faster than single <code>[1.]</code> as all of
them use <code>-O2</code> option. And <code>[1.]</code> uses <code>CFLAGS=-O0</code> by default.</p>
<p>The speed-up workaround was to build <code>stage1-gcc</code> with optimizations
(<code>-O2</code> instead of default <code>-O0</code>). <code>gcc</code> build system provides
<code>STAGE1_CFLAGS</code> option for that. And while at it we will enable <code>-ggdb3</code>
instead of default <code>-g</code> option:</p>
<pre><code>$ ~/dev/git/gcc/configure
$ make profiledbootstrap STAGE1_CFLAGS='-O2 -ggdb3' BOOT_CFLAGS='-O2 -ggdb3'</code></pre>
<p>That made the build a lot faster for me.</p>
<p>The only problem is that build failed configuring <code>stagetrain-gcc</code>
(equivalent of <code>stage3-gcc</code> for non-profiled builds):</p>
<pre><code>$ make profiledbootstrap
...
checking for uintptr_t... no
configure: error: uint64_t or int64_t not found
make[2]: *** [Makefile:4862: configure-stagetrain-gcc] Error 1
make[2]: Leaving directory '/tmp/gb'
make[1]: *** [Makefile:26749: stagetrain-bubble] Error 2
make[1]: Leaving directory '/tmp/gb'
make: *** [Makefile:26902: profiledbootstrap] Error 2</code></pre>
<p>That’s not good. It’s certainly not the failure Franz saw. I switched
over to exploring that bug instead.</p>
<h2 id="debugging-stagetrain-gcc-crash">Debugging <code>stagetrain-gcc</code> crash</h2>
<p>I looked at <code>gcc/config.log</code> to check why <code>uint64_t</code> had problems:</p>
<pre><code>configure:6937:  /tmp/gb/./prev-gcc/xg++ -B/tmp/gb/./prev-gcc/ ...
internal compiler error: in diagnostic_report_diagnostic, at diagnostic.cc:1486
0x4418233 gcov_do_dump
        gcc/libgcc/libgcov-driver.c:689
0x44198c3 __gcov_dump_one
        gcc/libgcc/libgcov-driver.c:722
...
xg++: internal compiler error: Aborted signal terminated program cc1plus</code></pre>
<p>That’s a compiler crash in machinery related to profile counter dumping.
Built compiler also crashes on simplest input:</p>
<pre><code>$ touch a.c
$ prev-gcc/xg++ -Bprev-gcc -c a.c
...
internal compiler error: in diagnostic_report_diagnostic, at diagnostic.cc:1486</code></pre>
<p>That makes it a bit easier to debug.</p>
<p>To improve debugging of intermediate stages and retain <code>-ggdb3</code> flags in
all <code>gcc</code> build stages I also dropped <code>-gtoggle</code> from <code>stage2-gcc</code>.</p>
<p>Normally <code>-gtoggle</code> is used to compare <code>stage2</code> and <code>stage3</code> in vanilla
bootstrap to make sure added/removed <code>-g</code> options don’t affect generated
executable code. But in our case it build with <code>-g0</code> just the stage we
want to debug.</p>
<pre class="diff"><code>--- a/config/bootstrap-debug.mk
+++ b/config/bootstrap-debug.mk
@@ -11,2 +11,3 @@
-STAGE2_CFLAGS += -gtoggle
+#STAGE2_CFLAGS += -gtoggle
 do-compare = $(SHELL) $(srcdir)/contrib/compare-debug $$f1 $$f2</code></pre>
<p>And restarted <code>gcc</code> build after the change.</p>
<p>Back to our crash: <code>gdb</code> told me that <code>SIGSEGV</code> happened right at <code>gcc</code>
exit in global destructors:</p>
<pre><code>Program received signal SIGSEGV, Segmentation fault.
0x0000000004424506 in gcov_do_dump (list=0x5c9c020, run_counted=0, mode=0) at /home/slyfox/dev/git/gcc/libgcc/libgcov-driver.c:689
689             for (unsigned i = 0; i < cinfo->num; i++)

(gdb) bt
#0  0x0000000004424506 in gcov_do_dump (list=0x5c9c020, run_counted=0, mode=0) at /home/slyfox/dev/git/gcc/libgcc/libgcov-driver.c:689
#1  0x00000000044245e5 in __gcov_dump_one (root=0x6c15be0 <__gcov_root>) at /home/slyfox/dev/git/gcc/libgcc/libgcov-driver.c:722
#2  0x0000000004424627 in __gcov_exit () at /home/slyfox/dev/git/gcc/libgcc/libgcov-driver.c:747
#3  0x00007ffff7fcb0e2 in _dl_call_fini (closure_map=closure_map@entry=0x7ffff7ffe2c0) at dl-call_fini.c:43
#4  0x00007ffff7fcee06 in _dl_fini () at dl-fini.c:114
#5  0x00007ffff79d1255 in __run_exit_handlers (status=0, listp=0x7ffff7b6d660 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:111
#6  0x00007ffff79d138e in __GI_exit (status=<optimized out>) at exit.c:141
#7  0x00007ffff79b9ad5 in __libc_start_call_main (main=main@entry=0xc63030 <main(int, char**)>, argc=argc@entry=21, argv=argv@entry=0x7fffffffad68) at ../sysdeps/nptl/libc_start_call_main.h:74
#8  0x00007ffff79b9b89 in __libc_start_main_impl (main=0xc63030 <main(int, char**)>, argc=21, argv=0x7fffffffad68, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffad58)
    at ../csu/libc-start.c:360
#9  0x0000000000c63d25 in _start ()</code></pre>
<p>According to <code>gdb</code> the crash happens at <code>cinfo->num</code> dereference:</p>
<pre><code>(gdb) list
684         for (unsigned f_ix = 0; (unsigned)f_ix != gi_ptr->n_functions; f_ix++)
685           {
686             const struct gcov_ctr_info *cinfo
687               = &gi_ptr->functions[f_ix]->ctrs[GCOV_COUNTER_ARCS];
688
689             for (unsigned i = 0; i < cinfo->num; i++)
690               if (run_max < cinfo->values[i])
691                 run_max = cinfo->values[i];
692           }
693</code></pre>
<p>Looking at the specifics some tables we dereference are <code>NULL</code>:</p>
<pre><code>(gdb) p *gi_ptr
$1 = {version = 1110716448, next = 0x50b5500, stamp = 4280923493, checksum = 2709867717, filename = 0x44558dc "/tmp/gb/gcc/cp/logic.gcda", merge = {0x4421950 <__gcov_merge_add>, 0x0, 0x0, 0x0,
    0x44219a0 <__gcov_merge_topn>, 0x0, 0x0, 0x4421be0 <__gcov_merge_time_profile>}, n_functions = 106, functions = 0x50b6480}:

(gdb) p f_ix
$6 = 0

(gdb) p gi_ptr->functions[f_ix]
$7 = (const gcov_fn_info * const) 0x0</code></pre>
<p>You would think that it’s just an unhandled case of <code>NULL</code> functions in
the table. At least that’s what I thought initially. In reality it’s not
the case.</p>
<p>The expected layout here is the following:</p>
<ul>
<li><code>gi_ptr</code> is the pointer to the table of performance counters per
function for a module (usually for one <code>.c</code> file, in our case it is
<code>gcc/cp/logic.cc</code>).</li>
<li><code>gi_ptr->n_functions</code> is the <code>gi_ptr->functions</code> array size with
pointers to counters associated with each given function (multiple
counters per function).</li>
</ul>
<p>In pictures the layout should look this way:</p>

<p><code>gcc</code> code generator never puts zeros into <code>functions</code> array. Each
<code>__gcov.f<N></code> entry is itself an array of counters associated with a
single function <code>f${N}</code>.</p>
<p>Back to our crash: according to <code>gdb</code> session above somehow
<code>functions[0]</code> entry has <code>NULL</code> value. We have <code>n_functions = 106</code>
entries there. Let’s peek at first 16 to get the idea if it has any
reasonable values:</p>
<pre><code>(gdb) x/16a &gi_ptr->functions[f_ix]
0x50b6480:      0x0     0x0
0x50b6490:      0x50b8e80 <__gcov_._Z21ggc_cleared_vec_allocIP17subsumption_entryEPT_m> 0x50b8e20 <__gcov_._ZN10hash_tableI11atom_hasherLb0E11xcallocatorE26find_empty_slot_for_expandEj>
0x50b64a0:      0x50b8dc0 <__gcov_._ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorE6expandEv>     0x50b8d60 <__gcov_._Z8finalizeI10hash_tableI18subsumption_hasherLb0E11xcallocatorEEvPv>
0x50b64b0:      0x50b8d00 <__gcov_._ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorE6verifyERKP17subsumption_entryj>       0x50b8ca0 <__gcov_._ZNK10hash_tableI18subsumption_hasherLb0E11xcallocatorE13alloc_entriesEm>
0x50b64c0:      0x50b8c40 <__gcov_._ZNSt15__allocated_ptrISaISt10_List_nodeI6clauseEEED2Ev>     0x50b8be0 <__gcov_._ZNSt7__cxx114listI6clauseSaIS1_EE14_M_create_nodeIJRP9tree_nodeEEEPSt10_List_nodeIS1_EDpOT_>
0x50b64d0:      0x50b8b80 <__gcov_._ZNSt15__allocated_ptrISaISt10_List_nodeIP9tree_nodeEEED2Ev> 0x50b8b20 <__gcov_._Z21ggc_cleared_vec_allocIP9tree_nodeEPT_m>
0x50b64e0:      0x50b8ac0 <__gcov_._ZN10hash_tableI11atom_hasherLb0E11xcallocatorE6verifyERKP9tree_nodej>       0x50b8a60 <__gcov_._ZN10hash_tableI11atom_hasherLb0E11xcallocatorE6expandEv>
0x50b64f0:      0x50b8a00 <__gcov_._Z27hashtab_entry_note_pointersI18subsumption_hasherEvPvS1_PFvS1_S1_S1_ES1_> 0x50b89a0 <__gcov_._ZNK10hash_tableI18subsumption_hasherLb0E11xcallocatorE24check_complete_insertionEv></code></pre>
<p>Aha, the first two entries have unexpected <code>NULL</code> values. The rest of
them look as expected and are related to <code>__gcov</code> counters. Let’s check
if first two <code>NULL</code>s were always there or it’s a later runtime
corruption.</p>
<p>We care about <code>0x50b6480</code> address specifically (the first that contains
unexpected <code>0x0</code>). Let’s look at the array values at the very <code>gcc</code>
start:</p>
<pre><code>(gdb) start
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Temporary breakpoint 1 at 0xc63030: file /home/slyfox/dev/git/gcc/gcc/main.cc, line 35.

Temporary breakpoint 1, main (argc=21, argv=0x7fffffffad68) at /home/slyfox/dev/git/gcc/gcc/main.cc:35

(gdb) x/16a 0x50b6480

0x50b6480:      0x50b8fc0 <__gcov_._ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorE26find_empty_slot_for_expandEj>        0x50b8ee0 <__gcov_._ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorED2Ev>
0x50b6490:      0x50b8e80 <__gcov_._Z21ggc_cleared_vec_allocIP17subsumption_entryEPT_m> 0x50b8e20 <__gcov_._ZN10hash_tableI11atom_hasherLb0E11xcallocatorE26find_empty_slot_for_expandEj>
0x50b64a0:      0x50b8dc0 <__gcov_._ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorE6expandEv>     0x50b8d60 <__gcov_._Z8finalizeI10hash_tableI18subsumption_hasherLb0E11xcallocatorEEvPv>
0x50b64b0:      0x50b8d00 <__gcov_._ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorE6verifyERKP17subsumption_entryj>       0x50b8ca0 <__gcov_._ZNK10hash_tableI18subsumption_hasherLb0E11xcallocatorE13alloc_entriesEm>
0x50b64c0:      0x50b8c40 <__gcov_._ZNSt15__allocated_ptrISaISt10_List_nodeI6clauseEEED2Ev>     0x50b8be0 <__gcov_._ZNSt7__cxx114listI6clauseSaIS1_EE14_M_create_nodeIJRP9tree_nodeEEEPSt10_List_nodeIS1_EDpOT_>
0x50b64d0:      0x50b8b80 <__gcov_._ZNSt15__allocated_ptrISaISt10_List_nodeIP9tree_nodeEEED2Ev> 0x50b8b20 <__gcov_._Z21ggc_cleared_vec_allocIP9tree_nodeEPT_m>
0x50b64e0:      0x50b8ac0 <__gcov_._ZN10hash_tableI11atom_hasherLb0E11xcallocatorE6verifyERKP9tree_nodej>       0x50b8a60 <__gcov_._ZN10hash_tableI11atom_hasherLb0E11xcallocatorE6expandEv>
0x50b64f0:      0x50b8a00 <__gcov_._Z27hashtab_entry_note_pointersI18subsumption_hasherEvPvS1_PFvS1_S1_S1_ES1_> 0x50b89a0 <__gcov_._ZNK10hash_tableI18subsumption_hasherLb0E11xcallocatorE24check_complete_insertionEv></code></pre>
<p>Compared to the previous output here we see that first two entries are
valid non-<code>NULL</code> counters for functions named:</p>
<pre><code>$ c++filt _ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorE26find_empty_slot_for_expandEj _ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorED2Ev
hash_table<subsumption_hasher, false, xcallocator>::find_empty_slot_for_expand(unsigned int)
hash_table<subsumption_hasher, false, xcallocator>::~hash_table()</code></pre>
<p>That means something corrupted first two entries a while after.</p>
<p>Let’s catch the actual place where <code>0x0</code> clobber write happens using
<code>gdb</code>’s watch points:</p>
<pre><code>(gdb) watch -l *(void**)0x50b6480
Hardware watchpoint 2: -location *(void**)0x50b6480
(gdb) continue

Old value = (void *) 0x50b8fc0 <__gcov_._ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorE26find_empty_slot_for_expandEj>
New value = (void *) 0x50b8f00 <__gcov_._ZN10hash_tableI18subsumption_hasherLb0E11xcallocatorED2Ev+32>
__memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:328
328     ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory.

(gdb) bt
#0  __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:328
#1  0x0000000001930731 in ggc_common_finalize () at /home/slyfox/dev/git/gcc/gcc/ggc-common.cc:1312
#2  0x00000000020d2421 in toplev::finalize (this=this@entry=0x7fffffffac3e) at /home/slyfox/dev/git/gcc/gcc/toplev.cc:2354
#3  0x0000000000c630e5 in main (argc=<optimized out>, argv=0x7fffffffad68) at /home/slyfox/dev/git/gcc/gcc/main.cc:42</code></pre>
<p>Got it! <code>memset()</code> call from <code>ggc_common_finalize()</code> does byte-by-byte
zeroing out of our entry!</p>
<h2 id="ggc-corruptor">GGC corruptor</h2>
<p>We can peek at the specific location of the writer:</p>
<pre><code>(gdb) frame 1
#1  0x0000000001930731 in ggc_common_finalize () at /home/slyfox/dev/git/gcc/gcc/ggc-common.cc:1312
1312          memset (rti->base, 0, rti->stride * rti->nelt);
(gdb) list
1307        for (rti = *rt; rti->base != NULL; rti++)
1308          memset (rti->base, 0, rti->stride * rti->nelt);
1309
1310      for (rt = gt_ggc_rtab; *rt; rt++)
1311        for (rti = *rt; rti->base != NULL; rti++)
1312          memset (rti->base, 0, rti->stride * rti->nelt);
1313
1314      for (rt = gt_pch_scalar_rtab; *rt; rt++)
1315        for (rti = *rt; rti->base != NULL; rti++)
1316          memset (rti->base, 0, rti->stride * rti->nelt);</code></pre>
<p><code>ggc</code> is related to memory managed by <code>gcc</code>’s
<a href="https://gcc.gnu.org/onlinedocs/gccint/Type-Information.html">garbage collector</a>
and to global counters related to precompiled headers.</p>
<p><code>ggc</code> is completely unrelated to statically allocated <code>gcov</code> counters.
Specifically <code>ggc</code> should never touch function pointer area.</p>
<p><code>gt_ggc_rtab</code> is a table to garbage collector root pointers for global
variables used by <code>gcc</code>. Those usually have fancy <code>GTY(())</code> annotations
around the structs.</p>
<p>Let’s figure out what value <code>ggc</code> tries to wipe out off our counter
metadata:</p>
<pre><code>(gdb) p *rti
$15 = {
  base = 0x50b5608 <ovl_op_info+8>,
  nelt = 116,
  stride = 32,
  cb = 0x1477b30 <gt_ggc_m_S(void const*)>,
  pchw = 0x20bd860 <gt_pch_n_S(void const*)>
}</code></pre>
<p>It’s one of the fields of <code>ovl_op_info</code> global array. <code>ovl_op_info_t</code> is
declared in <code>gcc/cp/cp-tree.h</code>:</p>
<pre class="cpp"><code>struct GTY(()) ovl_op_info_t {
  /* The IDENTIFIER_NODE for the operator.  */
  tree identifier;
  /* The name of the operator.  */
  const char *name;
  /* The mangled name of the operator.  */
  const char *mangled_name;
  /* The (regular) tree code.  */
  enum tree_code tree_code : 16;
  /* The (compressed) operator code.  */
  enum ovl_op_code ovl_op_code : 8;
  /* The ovl_op_flags of the operator */
  unsigned flags : 8;
};

/* Overloaded operator info indexed by ass_op_p & ovl_op_code.  */
extern GTY(()) ovl_op_info_t ovl_op_info[2][OVL_OP_MAX];</code></pre>
<p>Here <code>ovl_op_info_t</code> has 3 garbage collectable pointers:</p>
<ul>
<li><code>identifier</code></li>
<li><code>name</code></li>
<li><code>mangled_name</code></li>
</ul>
<p><code>ovl_op_info+8</code> we saw above looks like a <code>name</code> if the pointers are
8-bytes long.</p>
<p>Let’s find the table entry for our <code>rti</code> value:</p>
<pre><code>(gdb) p *rt
$23 = (const ggc_root_tab * const) 0x4488340 <gt_ggc_r_gt_cp_tree_h>
(gdb) p rti - *rt
$24 = 5
(gdb) p (*rt)[rti - *rt]
$25 = {
  base = 0x50b5608 <ovl_op_info+8>,
  nelt = 116,
  stride = 32,
  cb = 0x1477b30 <gt_ggc_m_S(void const*)>,
  pchw = 0x20bd860 <gt_pch_n_S(void const*)>
}</code></pre>
<p>According to <code>gdb</code> session right above the table with our pointer
description should be named <code>gt_ggc_r_gt_cp_tree_h</code> and fifth element
(counting from 0) will be our element. The table definition hides in
generated <code>prev-gcc/gt-cp-tree.h</code> file:</p>
<pre class="cpp"><code>EXPORTED_CONST struct ggc_root_tab gt_ggc_r_gt_cp_tree_h[] = {
  { /* 0: skipped for brevity */},
  { /* 1: skipped for brevity */},
  { /* 2: skipped for brevity */},
  { /* 3: skipped for brevity */},
  { // 4:
    &ovl_op_info[0][0].identifier,
    1 * (2) * (OVL_OP_MAX),
    sizeof (ovl_op_info[0][0]),
    >_ggc_mx_tree_node,
    >_pch_nx_tree_node
  },
  { // 5:
    &ovl_op_info[0][0].name,
    1 * (2) * (OVL_OP_MAX),
    sizeof (ovl_op_info[0][0]),
    (gt_pointer_walker) >_ggc_m_S,
    (gt_pointer_walker) >_pch_n_S
  },
  { // 6:
    &ovl_op_info[0][0].mangled_name,
    1 * (2) * (OVL_OP_MAX),
    sizeof (ovl_op_info[0][0]),
    (gt_pointer_walker) >_ggc_m_S,
    (gt_pointer_walker) >_pch_n_S
  },
  // ...</code></pre>
<p>The <code>// 5:</code> value confirms us that the entry points to <code>name</code> field of
the first element in the array. <code>nelts = 1 * (2) * (OVL_OP_MAX)</code> tells
us how many elements there are in the array and <code>stride = sizeof (ovl_op_info[0][0])</code> tells us how many bytes there are to the beginning
of the next pointer. All look sensible.</p>
<p>But if we look again at how <code>ggc_common_finalize()</code> tries to wipe these
pointers out we might notice the problem:</p>
<pre><code>1310      for (rt = gt_ggc_rtab; *rt; rt++)
1311        for (rti = *rt; rti->base != NULL; rti++)
1312          memset (rti->base, 0, rti->stride * rti->nelt);</code></pre>
<p>Instead of wiping out the pointers it wipes out the whole structs. And
given that <code>memset()</code> starts at an offset <code>8</code> of the array it actually
gets out of bounds of the <code>ovl_op_info</code> for 8 bytes. And when <code>// 6:</code>
entry is wiped we’ll get off-by-16 bytes <code>memset()</code>.</p>
<p>These extra 16 bytes are exactly the corruption we see in our <code>gcov</code>
counters.</p>
<p><a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=7525707c5f3edb46958c4fdfbe30de5ddfa8923a">The fix</a>
was straightforward: zero out only pointers themselves, not the structs
around them:</p>
<pre class="diff"><code>--- a/gcc/ggc-common.cc
+++ b/gcc/ggc-common.cc
@@ -75,6 +75,18 @@ ggc_mark_root_tab (const_ggc_root_tab_t rt)
       (*rt->cb) (*(void **) ((char *)rt->base + rt->stride * i));
 }

+/* Zero out all the roots in the table RT.  */
+
+static void
+ggc_zero_rtab_roots (const_ggc_root_tab_t rt)
+{
+  size_t i;
+
+  for ( ; rt->base != NULL; rt++)
+    for (i = 0; i < rt->nelt; i++)
+      (*(void **) ((char *)rt->base + rt->stride * i)) = (void*)0;
+}
+
 /* Iterate through all registered roots and mark each element.  */

 void
@@ -1307,8 +1319,7 @@ ggc_common_finalize ()
       memset (rti->base, 0, rti->stride * rti->nelt);

   for (rt = gt_ggc_rtab; *rt; rt++)
-    for (rti = *rt; rti->base != NULL; rti++)
-      memset (rti->base, 0, rti->stride * rti->nelt);
+    ggc_zero_rtab_roots (*rt);

   for (rt = gt_pch_scalar_rtab; *rt; rt++)
     for (rti = *rt; rti->base != NULL; rti++)</code></pre>
<p>Andrew Pinkski also mentioned that <code>bootstrap-asan</code> also detects
out-of-bounds access in <a href="https://gcc.gnu.org/PR111505" class="uri">https://gcc.gnu.org/PR111505</a> and I reproduced
it as:</p>
<pre><code>$ ../gcc/configure --with-build-config=bootstrap-asan
$ make</code></pre>
<p>The <code>gcc</code> fix fixed the <code>bootstrap-asan</code> for me. Yay!</p>
<h2 id="c-ifndr-and--fchecking2">C++, IFNDR and -fchecking=2</h2>
<p>The <code>ggc</code> fix also allowed me to get past <code>stagetrain-gcc</code> build stage
for <code>profiledbootstrap</code>.</p>
<p>But <code>make profiledbootstrap</code> started failing on <code>stagefeedback-gcc</code>
stage (roughly <code>stage4</code>) as:</p>
<pre><code>In file included from /home/slyfox/dev/git/gcc/gcc/coretypes.h:480,
                 from /home/slyfox/dev/git/gcc/gcc/rtl-tests.cc:22:
/home/slyfox/dev/git/gcc/gcc/poly-int.h: In instantiation of ‘constexpr poly_int<N, T>::poly_int(poly_int_full, const Cs& ...) [with Cs = {int, int}; unsigned int N = 1; C = long int]’:
/home/slyfox/dev/git/gcc/gcc/poly-int.h:439:13:   required from here
/home/slyfox/dev/git/gcc/gcc/rtl-tests.cc:249:25:   in ‘constexpr’ expansion of ‘poly_int<1, long int>(1, 1)’
/home/slyfox/dev/git/gcc/gcc/poly-int.h:453:5: error: too many initializers for ‘long int [1]’
  453 |   : coeffs { (typename poly_coeff_traits<C>::
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  454 |               template init_cast<Cs>::type (cs))... } {}
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[3]: *** [Makefile:1188: rtl-tests.o] Error 1</code></pre>
<p>That looked like a bug in <code>c++</code> code of <code>gcc</code>. But why did it not fail
earlier when <code>stage1-gcc</code> or <code>stageprofile</code> were being built?</p>
<p>I filed <a href="https://gcc.gnu.org/PR111647" class="uri">https://gcc.gnu.org/PR111647</a> we confirmed it’s a <code>gcc</code> source
code bug: <code>gcc</code> accepts slightly different <code>c++</code> in <code>-fhcecking=1</code> and
<code>-fchecking=2</code> modes.</p>
<p>I extracted the following example out of <code>gcc</code>’s source code:</p>
<pre class="cpp"><code>// $ cat rtl-tests.cc
template<unsigned int N> struct poly_int {
  template<typename ...Cs> constexpr poly_int (const Cs &... cs)
  : coeffs { cs... } {}

  int coeffs[N];
};

#define TARGET_DEFINED_VALUE 1
// this works:
//#define TARGET_DEFINED_VALUE 2

// Is instantiated only for N == 2.
template<unsigned int N> struct const_poly_int_tests {
  static void run () {
    poly_int<TARGET_DEFINED_VALUE> (1, 1);
  }
};</code></pre>
<p>And this code compiles (or not) depending on compiler flags:</p>
<pre><code>$ g++ -c rtl-tests.cc -fchecking=1
# did not fail, BAD!

$ g++ -c rtl-tests.cc -fchecking=2
rtl-tests.cc: In instantiation of 'constexpr poly_int<N>::poly_int(const Cs& ...) [with Cs = {int, int}; unsigned int N = 1]':
rtl-tests.cc:15:42:   required from here
rtl-tests.cc:3:5: error: too many initializers for 'int [1]'
    3 |   : coeffs { cs... } {}
      |     ^~~~~~~~~~~~~~~~
# failed, GOOD

$ clang++ -c rtl-tests.cc
rtl-tests.cc:3:14: error: excess elements in array initializer
  : coeffs { cs... } {}
             ^~
rtl-tests.cc:15:5: note: in instantiation of function template specialization 'poly_int<1>::poly_int<int, int>' requested here
    poly_int<TARGET_DEFINED_VALUE> (1, 1);
    ^
1 error generated.
# failed, GOOD</code></pre>
<p>From there I learned that <a href="https://en.cppreference.com/w/cpp/language/acronyms">IFNDR</a>
means <code>"Ill-Formed, No Diagnostic Required"</code>. Which I would characterise
as allowed Undefined Behaviour of the <code>C++</code> type checker: it might or
might not detect a bug in the C++ and that’s fine to be a conforming
application.</p>
<p>And <a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=e465e5e4a969334f64cf0d6611de5273d73ea732">the fix</a>
was simple:</p>
<pre class="diff"><code>--- a/gcc/rtl-tests.cc
+++ b/gcc/rtl-tests.cc
@@ -246,6 +246,7 @@ template<unsigned int N>
 void
 const_poly_int_tests<N>::run ()
 {
+  using poly_int64 = poly_int<N, HOST_WIDE_INT>;
   rtx x1 = gen_int_mode (poly_int64 (1, 1), QImode);
   rtx x255 = gen_int_mode (poly_int64 (1, 255), QImode);

--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -8689,6 +8689,7 @@ template<unsigned int N>
 void
 simplify_const_poly_int_tests<N>::run ()
 {
+  using poly_int64 = poly_int<N, HOST_WIDE_INT>;
   rtx x1 = gen_int_mode (poly_int64 (1, 1), QImode);
   rtx x2 = gen_int_mode (poly_int64 (-80, 127), QImode);
   rtx x3 = gen_int_mode (poly_int64 (-79, -128), QImode);
</code></pre>
<p>Here <code>poly_int64</code> is made dependent on <code>N</code> parameter and compiler is
happy not to check dependent types as long as those are not
instantiated.</p>
<h2 id="the-use-case-of-bootstrap4">The use case of <code>bootstrap4</code></h2>
<p>But why did default build of <code>gcc</code> not fail for everyone? There are a
few reasons to that. Let’s look at the <code>gcc</code> <code>bootstrap</code> sequence once
more. But this time from standpoint of <code>-fchecking=</code> option.</p>
<p>The default value of <code>-fchecking=</code> is defined by <code>gcc/configure.ac</code>:</p>
<pre><code>AC_ARG_ENABLE(checking,
[AS_HELP_STRING([[--enable-checking[=LIST]]],
                [enable expensive run-time checks.  With LIST,
                 enable only specific categories of checks.
                 Categories are: yes,no,all,none,release.
                 Flags are: assert,df,extra,fold,gc,gcac,gimple,misc,
                 rtlflag,rtl,runtime,tree,valgrind,types])],
[ac_checking_flags="${enableval}"],[
# Determine the default checks.
if test x$is_release = x ; then
  ac_checking_flags=yes,extra
else
  ac_checking_flags=release
fi])</code></pre>
<p>The above sets <code>--enable-checking=release</code> to <code>gcc</code> releases (which
defaults to <code>-fchecking=0</code>). A development <code>gcc</code> versions sets
<code>--enable-checking=yes,extra</code> which defaults to <code>-fchecking=2</code>.</p>
<p>But that is not all. <code>gcc</code>’s build system does the following <code>CFLAGS</code>
overrides:</p>
<ul>
<li><code>stage1-gcc</code> gets built with default host’s compiler flags</li>
<li><code>stage2-gcc</code> / <code>stageprofile</code> is built with <code>-fno-checking</code></li>
<li><code>stage3-gcc</code> / <code>stagetrain</code> is built with <code>-fchecking=1</code></li>
<li><code>stage4-gcc</code> / <code>stagefeedback</code> is build with default <code>stage3-gcc</code> flags</li>
</ul>
<p>This means we have a few ways to build <code>gcc</code> with <code>-fchecking=2</code> and
get the failure:</p>
<ol type="1">
<li>In <code>stage1-gcc</code>: your host compiler must be a
<code>--enable-checking=yes,extra</code>. Not all distributions ship the
compiler with extra checks (mine does not).</li>
<li>In <code>stage4-gcc</code> (or in <code>stagefeedback</code>): your built compiler is a
<code>--enable-checking=yes,extra</code> and you are building 4-stage compiler.</li>
</ol>
<p>I was hitting the latter <code>[2.]</code> case. Once I realized that I tried
<code>bootstrap4</code>:</p>
<pre><code>$ ../gcc/configure
$ make bootstrap4</code></pre>
<p>That allowed me to verify that <code>stage4</code> was the trigger. And once I
applied Roger’s patch <code>make profiledbootstrap</code> managed to build
<code>stagetrain</code>!</p>
<p>To get <code>stagefeedback</code> built I needed one
<a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=2551e10038a70901f30b2168e6e3af4536748f3c">extra patch</a>
to disable <code>-Werror</code> for <code>stagefeedback</code></p>
<pre class="diff"><code>--- a/Makefile.tpl
+++ b/Makefile.tpl
@@ -561,6 +561,10 @@ STAGEtrain_TFLAGS = $(filter-out -fchecking=1,$(STAGE3_TFLAGS))
 
 STAGEfeedback_CFLAGS = $(STAGE4_CFLAGS) -fprofile-use -fprofile-reproducible=parallel-runs
 STAGEfeedback_TFLAGS = $(STAGE4_TFLAGS)
+# Disable warnings as errors for a few reasons:
+# - sources for gen* binaries do not have .gcda files available
+# - inlining decisions generate extra warnings
+STAGEfeedback_CONFIGURE_FLAGS = $(filter-out --enable-werror-always,$(STAGE_CONFIGURE_FLAGS))
 
 STAGEautoprofile_CFLAGS = $(filter-out -gtoggle,$(STAGE2_CFLAGS)) -g
 STAGEautoprofile_TFLAGS = $(STAGE2_TFLAGS)</code></pre>
<p>Otherwise build from <code>master</code> fails for missing profile data for
binaries that not compiler by profile-generating compiler:</p>
<pre><code>gcc/gcc/sort.cc: In function ‘void reorder45(sort_ctx*, char*, char*, char*, char*, char*) [with sort_ctx = sort_r_ctx]’:
gcc/gcc/sort.cc:313:1: error: ‘gcc/build/sort.gcda’ profile count data file not found [-Werror=missing-profile]</code></pre>
<p>And after that <code>make profiledbootstrap</code> built without any snags. And
<code>make check</code> did not show any regressions.</p>
<p><code>--disable-werror</code> was a reasonable workaround as well for
<code>-Werror</code>-related failures.</p>
<h2 id="make-bootstrap4-strikes-again"><code>make bootstrap4</code> strikes again</h2>
<p>All done?</p>
<p>I was using <code>--enable-checking=release</code> for a while to work around
<code>IFNDR</code>-related failures in <code>profiledbootstrap</code>.</p>
<p>After it was fixed I tried <code>make bootstrap4</code> on default
<code>--enable-checking=yes,extra</code>. And it failed as:</p>
<pre><code>$ ../gcc/configure --disable-multilib --enable-languages=c,c++ CC='gcc -O2' CXX='g++ -O2'
$ make bootstrap4
...
Comparing stages 3 and 4
Bootstrap comparison failure!
x86_64-pc-linux-gnu/libstdc++-v3/src/filesystem/dir.o differs
x86_64-pc-linux-gnu/libstdc++-v3/src/filesystem/cow-dir.o differs
x86_64-pc-linux-gnu/libstdc++-v3/src/c++20/tzdb.o differs
x86_64-pc-linux-gnu/libstdc++-v3/src/c++17/cow-fs_path.o differs
x86_64-pc-linux-gnu/libstdc++-v3/src/c++17/fs_path.o differs
x86_64-pc-linux-gnu/libstdc++-v3/src/c++17/cow-fs_dir.o differs
x86_64-pc-linux-gnu/libstdc++-v3/src/c++17/fs_dir.o differs</code></pre>
<p>This was a case where <code>-fchecking=2</code> caused slightly different code
generated with <code>-fchecking=1</code> and <code>-fchecking=2</code>. I filed
<a href="https://gcc.gnu.org/PR111663">PR111663</a> to clarify if it’s an expected
outcome of <code>-fchecking=2</code> or we should fix <code>gcc</code> code generation.</p>
<p>The following seems to be enough to expose unstable code generation:</p>
<pre class="cpp"><code>// $ cat fs_dir.cc.cc
namespace std {

struct type_info {
  void operator==(const type_info &) const;
};
struct _Sp_counted_base {
  virtual void _M_get_deleter(const type_info &);
};
struct _Sp_make_shared_tag {};
template <typename> struct _Sp_counted_ptr_inplace : _Sp_counted_base {
  struct _Impl {
    _Impl(int);
  };
  _Sp_counted_ptr_inplace(int __a) : _M_impl(__a) {}
  void _M_get_deleter(const type_info &__ti) {
    __ti == typeid(_Sp_make_shared_tag);
  }
  _Impl _M_impl;
};
struct __shared_count {
  __shared_count() { _Sp_counted_ptr_inplace<int>(0); }
} _M_refcount;
} // namespace std</code></pre>
<p>Triggering:</p>
<pre><code>$ g++ -frandom-seed=fs_dir.lo -c fs_dir.cc.cc -fchecking=2 -o bug.o
$ sha1sum bug.o
92d676d60ee6e26e9b242fb64bffe9e47a92052a  bug.o

$ /g++ -frandom-seed=fs_dir.lo -c fs_dir.cc.cc -fchecking=2 -o bug.o -fchecking=1
$ sha1sum bug.o
748b578657a335c212872b012b2afaf0be3ecbc4  bug.o</code></pre>
<h2 id="parting-words">Parting words</h2>
<p>What looked like a simple <code>PGO</code> bug uncovered quite a list of adjacent
<code>gcc</code> bugs in less exercised areas on <code>gcc</code> itself:</p>
<ul>
<li><a href="https://gcc.gnu.org/PR111559">PR111559</a>: <code>"[14 regression] ICE when   building Python with PGO"</code>. <a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=043a6fcbc27f8721301eb2f72a7839f54f393003">Fixed</a>.</li>
<li><a href="https://gcc.gnu.org/PR111619">PR111619</a>: <code>"'make profiledbootstrap'   makes 10+ minutes on insn-recog.cc"</code>. Not fixed yet.</li>
<li><a href="https://gcc.gnu.org/PR111629">PR111629</a>: <code>"[14 Regression]   ggc_common_finalize() corrupts global memory outsuide GTY(()) objects"</code>.
<a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=7525707c5f3edb46958c4fdfbe30de5ddfa8923a">Fixed</a>.</li>
<li><a href="https://gcc.gnu.org/PR111642">PR111642</a>: <code>"[14 Regression] bootstrap4   or profiledbootstrap failure: poly-int.h:453:5: error: too many   initializers"</code>. <a href="https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=e465e5e4a969334f64cf0d6611de5273d73ea732">Fixed</a>.</li>
<li><a href="https://gcc.gnu.org/PR111647">PR111647</a>: <code>"g++ accepts different c++   on -fchecking= anf checking=2"</code>. Not fixed yet.</li>
<li><a href="https://gcc.gnu.org/PR111653">PR111653</a>: <code>"make bootstrap4 fails for   -fchecking=2 code generation changes"</code>
Not fixed yet.</li>
</ul>
<p>At least I managed to drag the <code>PGO</code> bug itself to completion.</p>
<p><code>python</code> keeps breaking <code>gcc</code>’s <code>PGO</code> machinery.</p>
<p><code>cvise</code> is still great at reducing source files (and even <code>.gcda</code> files!).</p>
<p>I learned a few tricks how to effectively debug <code>gcc</code> crashes with <code>gdb</code>
like <code>make STAGE1_CFLAGS='-O2 -ggdb3' BOOT_CFLAGS='-O2 -ggdb3'</code> and
dropping <code>-gtoggle</code>.</p>
<p><code>make profiledbootstrap</code> seemingly never worked when ran with default
<code>./configure</code> options against <code>master</code> branch of <code>gcc</code>. But now it should!</p>
<p><code>make bootstrap4</code> is another rarely exercised and yet very useful sanity
check of <code>gcc</code>’s options like <code>-fchecking=2</code> and code generation
stability. It does not quite works yet, but we are almost there.</p>
<p><code>gcc</code> has it’s own garbage collector subsystem able to track pointers
in structs marked with <code>GTY(())</code> attribute.</p>
<p><code>IFNDR</code> is a <code>C++</code> word for allowed undefined result of type checker:
<code>IFNDR</code> code might or might not be compiled successfully and both
outcomes will be valid.</p>
<p><code>-fchecking=2</code> not just changes <code>c++</code> <code>gcc</code> understands but also changes
the way <code>gcc</code> generates code. Both are bugs, but are scary ones.</p>
<p>Have fun!</p>
</article>
<article>
<h1>Inline graphviz dot in hakyll</h1>
<p>2023-10-06T00:00:00Z</p>
<p><a href="https://github.com/trofi/trofi.github.io.gen/commit/0872d9dc50c9d15b50ceda16d408a45e5655c913">A while ago</a>
I added <code>.dot</code> file support into this blog and used it on
<a href="https://trofi.github.io/posts/248-how-do-shared-library-collisions-break.html">“How do shared library collisions break?”</a>
post.</p>
<p>The change was very easy and helped me a bit to add more visual arrows
into posts that came after.</p>
<p>The result was fine for larger <code>.dot</code> files, but one-liners were a bit
clumsy to insert into a stand alone external file. I tried to add a
quick support for inline <code>.dot</code> syntax and failed.</p>
<p>But today I had enough time to actually finish it and behold!</p>

<p>This is generated by the following snippet:</p>
<pre class="dot"><code>    ```{render=dot}
    digraph {
      node [shape=box]
      B [shape=egg]
      A -> B [label="i am dotted" style=dotted color=red]
      A -> C [solor=blue]
      B -> C -> A [color=green]
    }
    ```</code></pre>
<p>The <a href="https://github.com/trofi/trofi.github.io.gen/commit/1329f12fb75572007cc7dcdc0f980fdd7ba176d7">full change</a>
ended up being very small thanks to <code>pandoc</code> and <code>hakyll</code>’s <code>unixFilter</code>:</p>
<pre class="haskell"><code>inlineDotWithGrapthviz :: TP.Pandoc -> Compiler TP.Pandoc
inlineDotWithGrapthviz = TPW.walkM inlineDot

inlineDot :: TP.Block -> Compiler TP.Block
inlineDot cb@(TP.CodeBlock (id, classes, namevals) contents)
  | lookup "render" namevals == Just "dot"
  = TP.RawBlock (TP.Format "html") . DT.pack <$> (unixFilter "dot" ["-Tsvg"] (DT.unpack contents))
inlineDot x = return x</code></pre>
<p>Here we traverse the <code>pandoc</code> representation of <code>markdown</code> and substitute
<code>{render=dot}</code> for the raw output of <code>dot -Tsvg</code> tool call. Easy!</p>
<p>If you never saw what <code>graphviz</code> is capable have a loot at
<a href="https://graphviz.org/gallery/">their gallery</a>.</p>
<p>Have fun!</p>
</article>
<article>
<h1>Maximum argument count on linux and in gcc</h1>
<p>2023-09-21T00:00:00Z</p>
<h2 id="tldr">Tl;DR</h2>
<p>By default on <code>linux</code> argument list (and environment) is limited by
less than <code>2MB</code> all bytes calculated across all arguments and environment
(including <code>argv/envp</code> array overheads and null terminators).</p>
<p><code>ulimit -s</code> can increase this limit to <code>6MB</code>. Individual command line
and environment <code>K=V</code> pairs are limited to <code>128KB</code>.</p>
<p>And due to internal implementation deficiencies of <code>gcc</code> argument list
for <code>gcc</code> happens to be limited by the same <code>128KB</code> limit.</p>
<h2 id="story-mode">Story mode</h2>
<p>My <a href="https://trofi.github.io/posts/298-unexpected-runtime-dependencies-in-nixpkgs.html">previous attempt</a>
to “just add” a bunch of <code>-fmacro-prefix-map=</code> flags to each <code>nixpkgs</code>
package worked for most packages. But it started failing for <code>qemu</code> as:</p>
<pre><code>Command line: `gcc -m64 -mcx16 /build/qemu-8.1.0/build/meson-private/tmpbyikv8nc/testfile.c \
  -o /build/qemu-8.1.0/build/meson-private/tmpbyikv8nc/output.exe -D_FILE_OFFSET_BITS=64 \
  -O0 -Wl,--start-group -laio -Wl,--end-group -Wl,--allow-shlib-undefined` -> 1
stderr:
gcc: fatal error: cannot execute 'cc1': execv: Argument list too long
compilation terminated.</code></pre>
<p>The failure happens because we have exhausted some command line argument
limit. In case of <code>qemu</code> we pass around a few thousands of
<code>-fmacro-prefix-map=</code> options (<code>qemu</code> is a big package and has many
build inputs).</p>
<p>Ideally I would like to be able to pass a lot more options without
hitting the arguments limit (10x? 100x?). Luckily <code>gcc</code> and other tools
like <code>ld</code> do support a way to pass many options indirectly via response
files:</p>
<pre><code>$ gcc -g -O2 -c a.c</code></pre>
<p>The above should be equivalent to the below:</p>
<pre><code>$ echo "-g"   > a.rsp
$ echo "-O2" >> a.rsp
$ gcc @a.rsp -c a.c</code></pre>
<p>But I’ll leave response files to another post as it ended up being
it’s own rabbit hole.</p>
<p>Instead let’s explore how many arguments you can pass to a single
command in <code>linux</code>.</p>
<h2 id="exploring-the-argument-count-limits">Exploring the argument count limits</h2>
<p>So what are the actual limits we are hitting against here? How many
arguments can we pass to <code>gcc</code> without any problems?</p>
<p>Let’s explore it! I’ll start by adding more and more <code>-g</code> options to
<code>gcc</code> call until it starts failing for a command line limit:</p>
<pre><code>$ nix shell nixpkgs#gcc
$$ set -x; touch a.c; args=(-g); while gcc ${args[@]} -c a.c; do args+=("${args[@]}"); done; echo ${#args[@]}

+ set -x
+ touch a.c
+ args=(-g)
+ gcc -g -c a.c
+ args+=("${args[@]}")
+ gcc -g -g -c a.c
+ args+=("${args[@]}")
+ gcc -g -g -g -g -c a.c
+ args+=("${args[@]}")
+ gcc -g -g -g -g -g -g -g -g -c a.c
...
gcc: fatal error: cannot execute 'cc1': execv: Argument list too long
compilation terminated.
+ echo 32768
32768</code></pre>
<p>Our limit is somewhere below 32K (this is a lot lower than I expected).</p>
<p>In the above snippet we double the length of argument list to speed the
search up a bit, thus it’s not an exact value and some closest
power-of-2 ceiling.</p>
<p>Let’s extend this snippet a bit and build more flexible argument count
probe that returns us precise value. I called it <code>probe-argsize.bash</code>:</p>
<pre class="bash"><code>#!/usr/bin/env bash

# $1 - payload
# $2... - prober command to test against argument list

# create an aray of enough elements to start failing the allocation:
large=("$1"); shift

while "$@" "${large[@]}" >/dev/null 2>/dev/null; do
    # double the array lenght until execution start failing
    large+=("${large[@]}")
done

# Use binary search to find largest successfully running prober
l=0
u=${#large[@]}

while (( l < u )); do
    m=$(( (l + u + 1) / 2 ))
    if "$@" "${large[@]:0:m}" >/dev/null 2>/dev/null; then
        # can survive
        (( l = m ))
    else
        (( u = m - 1 ))
    fi
done

echo $l</code></pre>
<p>The first half of the script does the same 2x argument list growth on
every step as before. And the second half does binary search for an
exact value.</p>
<p>Let’s try it out:</p>
<pre><code>$$ touch a.c; ./probe-argsize.bash -g gcc -c a.c

26118</code></pre>
<p>~26K parameters. Seems to work!</p>
<p>What if we make our argument a bit larger? Say, pass <code>-ggdb3</code> instead of
<code>-g</code>?</p>
<pre><code>$ touch a.c; ./probe-argsize.bash -ggdb3 gcc -c a.c
14510</code></pre>
<p>Just ~14K. That degrades very quickly. The available length is decreased
by half! (or something like that)</p>
<p>How about longer option? I’ll try ~100 bytes long one:</p>
<pre><code>$ touch a.c; ./probe-argsize.bash -I0123456789-90123456789-0123456789-0123456789-0123456789-0123456789-01234567889-0123456789-0123456789 \
  gcc -c a.c
1209</code></pre>
<p>1209 is extremely low. That is on par with what <code>qemu</code> exercises in
<code>nixpkgs</code>. Looks like our limit here is about <code>~120K</code> bytes if we sum up
all our argument lengths to <code>gcc</code>.</p>
<p>What if the problem is in some internal <code>gcc</code> limit and not the OS
itself? Let’s <code>strace</code> <code>gcc</code> call just to make sure:</p>
<pre><code>$$ strace -f gcc ${args[@]} -c a.c
...
[pid 1360260] execve("cc1", ["cc1", "-quiet", "-idirafter", ..., "-g", ...], \
    0x1474b80 /* 103 vars */) = -1 E2BIG (Argument list too long)</code></pre>
<p>Here we see that <code>E2BIG</code> comes right from an <code>execve()</code> system call.
Thus it’s kernel’s limitation of some sort.</p>
<h2 id="getting-the-formula">Getting the formula</h2>
<p>Can we easily increase the limit? Let’s find out how <code>linux</code> implements
limits in <a href="https://github.com/torvalds/linux/blob/2cf0f715623872823a72e451243bbf555d10d032/fs/exec.c#L1888C1-L1894C13">fs/exec.c</a>.
Maybe there is a <code>linux</code>-specific hack somewhere we could pull out.</p>
<p>There are a few places where <code>-E2BIG</code> is returned. This code looks most
relevant:</p>
<pre class="c"><code>static int do_execveat_common(int fd, struct filename *filename,
			      struct user_arg_ptr argv,
			      struct user_arg_ptr envp,
			      int flags)
{
	struct linux_binprm *bprm;
	int retval;
	// ...

	retval = count(argv, MAX_ARG_STRINGS);
	if (retval < 0)
		goto out_free;
	bprm->argc = retval;

	retval = count(envp, MAX_ARG_STRINGS);
	if (retval < 0)
		goto out_free;
	bprm->envc = retval;

	// ...

	retval = bprm_stack_limits(bprm);
	if (retval < 0)
		goto out_free;

	// ...

	retval = copy_strings(bprm->envc, envp, bprm);
	if (retval < 0)
		goto out_free;

	retval = copy_strings(bprm->argc, argv, bprm);
	if (retval < 0)
		goto out_free;

	// ...
}</code></pre>
<p>I wondered if <code>MAX_ARG_STRINGS</code> could be one of our limits once
we solve the smaller limit we are bumping into now, but nope it’s defined
in <a href="https://github.com/torvalds/linux/blob/2cf0f715623872823a72e451243bbf555d10d032/include/uapi/linux/binfmts.h#L9">include/uapi/linux/binfmts.h</a>
as:</p>
<pre class="c"><code>/*
 * These are the maximum length and maximum number of strings passed to the
 * execve() system call.  MAX_ARG_STRLEN is essentially random but serves to
 * prevent the kernel from being unduly impacted by misaddressed pointers.
 * MAX_ARG_STRINGS is chosen to fit in a signed 32-bit integer.
 */
#define MAX_ARG_STRLEN (PAGE_SIZE * 32)
#define MAX_ARG_STRINGS 0x7FFFFFFF</code></pre>
<p>Which is a ridiculously large number:</p>
<pre><code>$ printf "%d\n" 0x7FFFFFFF
2147483647</code></pre>
<p>Maybe it’s a stack limit then? <code>bprm_stack_limits()</code> looks promising:</p>
<pre class="c"><code>static int bprm_stack_limits(struct linux_binprm *bprm)
{
	unsigned long limit, ptr_size;

	/*
	 * Limit to 1/4 of the max stack size or 3/4 of _STK_LIM
	 * (whichever is smaller) for the argv+env strings.
	 * This ensures that:
	 *  - the remaining binfmt code will not run out of stack space,
	 *  - the program will have a reasonable amount of stack left
	 *    to work from.
	 */
	limit = _STK_LIM / 4 * 3;
	limit = min(limit, bprm->rlim_stack.rlim_cur / 4);
	/*
	 * We've historically supported up to 32 pages (ARG_MAX)
	 * of argument strings even with small stacks
	 */
	limit = max_t(unsigned long, limit, ARG_MAX);
	/*
	 * We must account for the size of all the argv and envp pointers to
	 * the argv and envp strings, since they will also take up space in
	 * the stack. They aren't stored until much later when we can't
	 * signal to the parent that the child has run out of stack space.
	 * Instead, calculate it here so it's possible to fail gracefully.
	 *
	 * In the case of argc = 0, make sure there is space for adding a
	 * empty string (which will bump argc to 1), to ensure confused
	 * userspace programs don't start processing from argv[1], thinking
	 * argc can never be 0, to keep them from walking envp by accident.
	 * See do_execveat_common().
	 */
	ptr_size = (max(bprm->argc, 1) + bprm->envc) * sizeof(void *);
	if (limit <= ptr_size)
		return -E2BIG;
	limit -= ptr_size;

	bprm->argmin = bprm->p - limit;
	return 0;
}</code></pre>
<p>This well-commented function tells us that formula here is. I’ll
compress it as:</p>
<pre class="c"><code>limit = _STK_LIM / 4 * 3;
limit = min(limit, bprm->rlim_stack.rlim_cur / 4);
limit = max(limit, ARG_MAX);</code></pre>
<p>Let’s inline as many constants as we can here:</p>
<pre><code>// from include/uapi/linux/resource.h
#define _STK_LIM  (8*1024*1024)
// from include/uapi/linux/limits.h
#define ARG_MAX 131072 /* # bytes of args + environ for exec() */</code></pre>
<p>We’ll get this 3-liner:</p>
<pre class="c"><code>limit = 6MB; // 8MB / 4 * 3
limit = min(limit, CURRENT_STACK_LIMIT / 4);
limit = max(limit, 128K);</code></pre>
<p>The above in plain words: argument limits (in bytes) are at least 128K
and at most are 6MB. And by default it’s <code>CURRENT_STACK_LIMIT / 4</code>.</p>
<p><code>CURRENT_STACK_LIMIT</code> default is set to <code>_STK_LIM</code> as well:</p>
<pre><code>$ ulimit -s
8192</code></pre>
<p>Thus the argument length limit by default is <code>2MB</code>. And wee can raise up
to 6MB (3x) maximum if we set <code>ulimit -s</code> up to <code>24MB</code>. Setting stack
to anything higher would not affect argument limit.</p>
<p>That is the theory. Does it match the practice?</p>
<p>What did I miss? Why do we get only ~128K or argument limit for <code>gcc</code>
and not ~2MB?</p>
<p>There is a small catch: I kept exploring limits of <code>gcc</code> executable. In
<code>nixpkgs</code> it’s a big shell wrapper. Part of wrapper’s work is to set
various environment variables. Not just pass through the arguments.
And each environment variable is treated roughly like a command line
parameter.</p>
<p>Let’s use known simple <code>printf</code> binary instead (it should not set any
environment variables internally) and see what are it’s limits:</p>
<pre><code>$ ./probe-argsize.bash -g $(which printf) "%s" --
189929</code></pre>
<p>189K arguments! This looks more like <code>400KB</code> of argument bytes. But if you
know anything about <code>char * argv[]</code> parameter to <code>main()</code> you might know
that it’s a pointer array. And pointers probably take most overhead
here.</p>
<p>Let’s use longer arguments to mitigate pointer overhead:</p>
<pre><code>$ filler=0123456789-90123456789-0123456789-0123456789-0123456789-0123456789-01234567889-0123456789-0123456789
$ ./probe-argsize.bash $filler $(which printf) "%s" --
19165</code></pre>
<p>19K arguments 100 bytes each: that looks more like <code>~2MB</code> limit. Phew!
We finally got our theoretical limit.</p>
<h2 id="ultimate-argument-count-linux-allows-today">Ultimate argument count <code>linux</code> allows today</h2>
<p>Given that argument count somehow depends on stack size via
<code>CURRENT_STACK_LIMIT</code> can we just change default stack size and set
<code>10x</code> limit for argument count?</p>
<p>What is The Largest argument count we can pass on <code>linux</code>?</p>
<p>It’s not very hard to figure it out from the <code>bprm_stack_limits()</code> code
above.</p>
<p>To recap our environment structure in memory (on stack) is an array of
pointers to the null-terminated string pool:</p>
<pre><code> | (char *)argv[0] -> "arg0\0"
 | (char *)argv[1] -> "arg1\0"
 | (char *)argv[2] -> "arg2\0"
 | (char *)argv[3] -> "arg3\0"
 | (char *)argv[4] -> "arg4\0"
 | (char *)argv[5] -> "arg5\0"
 | ...
 | (char *)argv[N] -> "argN\0"
 | (char *)NULL</code></pre>
<p>This means that our most memory-efficient input would be an array of
zero-byte arguments. Note, that in this case main overhead is on
pointer array and not on the arguments themselves.</p>
<p>That would be something like:</p>
<pre><code> | (char *)argv[0] -> "\0"
 | (char *)argv[1] -> "\0"
 | (char *)argv[2] -> "\0"
 | (char *)argv[3] -> "\0"
 | (char *)argv[4] -> "\0"
 | (char *)argv[5] -> "\0"
 | ...
 | (char *)argv[N] -> "\0"
 | (char *)NULL</code></pre>
<p>On 64-bit systems this gives us 9 bytes per argument. That gives us
<code>2MB / 9 = 233016</code> arguments. The experiment confirms that we are very
close:</p>
<pre><code>$ ./probe-argsize.bash "" env -i $(which printf) "%s" --
232134</code></pre>
<p>An exercise for the reader: why is it 822 arguments shorter that our
maximum theoretical value? A word of warning: it’s not a very simple
question.</p>
<p>Given that <code>2MB</code> is derived from <code>1/4 * CURRENT_STACK_LIMIT</code> we can
increase that as well using <code>ulimit -s</code>. Let’s add 100x:</p>
<pre><code>$ ulimit -s
8192
$ ulimit -s 819200

$ ./probe-argsize.bash "" env -i $(which printf) "%s" --
698168</code></pre>
<p>Note that it’s only a 3x improvement (and not a 100x improvement).</p>
<p>This is exactly our <code>limit = 6MB;</code> absolute limit above. Once again: i
you need to get maximum out of your argument limits on today’s <code>linux</code>
it’s enough to set <code>ulimit -s</code> from <code>8MB</code> to <code>24MB</code>.</p>
<p>Which gives us final formula of <code>6MB / 9 = 699050</code> argument count on
64-bit systems.</p>
<p>Fun fact: on 32-bit host kernel the limit should probably be slightly higher
due to shorter pointer size:</p>
<ul>
<li><code>2MB / 5 = 419430</code> arguments (compared to <code>233016</code> on <code>64-bit</code>)</li>
<li><code>6MB / 5 = 1258291</code> arguments (compared to <code>699050</code> on <code>64-bit</code>)</li>
</ul>
<p>That is 1.8x larger than 64-bit systems!</p>
<p>Once again: it’s a pretty silly benchmark as it’s not very useful to
pass a million empty strings to the program. But it’s a good model to
understand the absolute limits.</p>
<p>We can do more realistic estimates if we know average argument length
in our use case. Say, if the bulk of our parameters are paths to the
<code>/nix/store</code> we can safely say those are at least 50 bytes long.</p>
<p>For 100-bytes long use case we would get:</p>
<ul>
<li><code>2MB / 109 = 19239</code> (64-bit), <code>2MB / 105 = 19972</code> (32-bit)</li>
<li><code>6MB / 109 = 57719</code> (64-bit), <code>6MB / 105 = 59918</code> (32-bit)</li>
</ul>
<p>Thus for simple case we should be able to pull out almost <code>~20K</code> by
default and almost <code>~60K</code> with larger stack size.</p>
<h2 id="nixpkgs-gcc-wrapper-mystery"><code>nixpkgs</code> <code>gcc</code> wrapper mystery</h2>
<p>So why does <code>gcc</code> have so much overhead of <code>-g</code> options? We get about
<code>190K</code> options for for <code>printf -g</code> and only 20K for <code>gcc -g</code> above.
That is almost 10x reduction.</p>
<p>Could <code>nixpkgs</code>’s <code>gcc</code> wrapper artificially inflate it’s arguments
somehow? <code>strace</code> should help us verify that:</p>
<pre><code>$ nix shell nixpkgs#gcc
$$ touch a.c && strace -etrace=execve -s 10000 -f -v -olog gcc -Ihow-many-duplcates -c a.c

$$ grep cc1 log
1290300 execve(".../cc1", [..., , "-I", "how-many-duplcates", ...], [..., "COLLECT_GCC_OPTIONS='-I' 'how-many-duplcates' ...", ...]) = 0</code></pre>
<p>Not too bad: we see 2x explosion here:</p>
<ul>
<li>one option is in argument list</li>
<li>another is in <code>COLLECT_GCC_OPTIONS=</code> environment</li>
</ul>
<p>The <code>2x</code> explosion itself does not explain 10x reduction.</p>
<p>There is also an extra catch: the way I tried to set the variable in
<code>qemu</code> is via <code>NIX_CFLAGS_COMPILE=</code> environment variable. Those get
translated by package setup hooks not visible in <code>nix shell</code>. Let’s
check their full effect.</p>
<p>I’ll write a complete small derivation to demonstrate the explosion
closer to <code>qemu</code> failure mode.</p>
<p>Here is a <code>default.nix</code> derivation that should demonstrate the point:</p>
<pre class="nix"><code>{ pkgs ? import <nixpkgs> { } }:
pkgs.stdenv.mkDerivation {
  name = "probe-wrapper";
  dontUnpack = true;
  nativeBuildInputs = [ pkgs.strace ];

  NIX_CFLAGS_COMPILE = "-Ihow-many-duplcates";

  postInstall = ''
    touch a.c
    strace -etrace=execve -s 10000 -f -v -o$out -- $CC -c a.c -o a.o
  '';
}</code></pre>
<p>Now we can quickly build it and explore it’s contents:</p>
<pre><code>$ nix-build
$ grep cc1 result

31 execve(".../cc1", [..., "-I", "how-many-duplcates", ...], "NIX_CFLAGS_COMPILE_x86_64_unknown_linux_gnu=... -Ihow-many-duplcates ...", "NIX_CFLAGS_COMPILE=-Ihow-many-duplcates ...", "COLLECT_GCC_OPTIONS=... '-I' 'how-many-duplcates' ..."]) = 0</code></pre>
<p>Things are a bit worse here: we see 4x explosion:</p>
<ul>
<li>[as before] one option is in argument list</li>
<li>[as before] another is in <code>COLLECT_GCC_OPTIONS=</code> environment</li>
<li>[new] <code>NIX_CFLAGS_COMPILE</code> variable, the one we added and are using for injection</li>
<li>[new] <code>NIX_CFLAGS_COMPILE_x86_64_unknown_linux_gnu</code> variable, the one
<code>pkgs/build-support/cc-wrapper/setup-hook.sh</code> setup hook set for us</li>
</ul>
<p>And if we throw cross-compilation into the picture (ideally I wanted to
avoid <code>__FILE__</code> leaks in all cases) and use all three
<code>NIX_CFLAGS_COMPILE_FOR_BUILD</code>, <code>NIX_CFLAGS_COMPILE</code> and
<code>NIX_CFLAGS_COMPILE_FOR_TARGET</code>:</p>
<pre class="nix"><code>{ pkgs ? import <nixpkgs> { crossSystem = "riscv64-linux"; } }:
pkgs.stdenv.mkDerivation {
  name = "probe-wrapper";
  dontUnpack = true;
  nativeBuildInputs = [ pkgs.buildPackages.strace ];

  NIX_CFLAGS_COMPILE_FOR_BUILD =  "-Ihow-many-duplcates";
  NIX_CFLAGS_COMPILE =            "-Ihow-many-duplcates";
  NIX_CFLAGS_COMPILE_FOR_TARGET = "-Ihow-many-duplcates";

  postInstall = ''
    touch a.c
    strace -etrace=execve -s 10000 -f -v -o$out -- $CC -c a.c -o a.o
  '';
}</code></pre>
<pre><code>$ nix-build
$ grep cc1 result

35 execve(".../cc1", [..., "-I", "how-many-duplcates", ...], "NIX_CFLAGS_COMPILE_riscv64_unknown_linux_gnu=... -Ihow-many-duplcates ...", "NIX_CFLAGS_COMPILE_FOR_BUILD=-Ihow-many-duplcates ...", "NIX_CFLAGS_COMPILE=-Ihow-many-duplcates ...", NIX_CFLAGS_COMPILE_FOR_TARGET=-Ihow-many-duplcates ...", "COLLECT_GCC_OPTIONS=... '-I' 'how-many-duplcates' ..."]) = 0</code></pre>
<p>Here we see 6x explosion:</p>
<ul>
<li>[as before] one option is in argument list</li>
<li>[as before] another is in <code>COLLECT_GCC_OPTIONS=</code> environment</li>
<li>[as before] <code>NIX_CFLAGS_COMPILE</code> variable, the one we added and are using for injection</li>
<li>[as before] <code>NIX_CFLAGS_COMPILE_riscv64_unknown_linux_gnu</code> variable, the one
<code>pkgs/build-support/cc-wrapper/setup-hook.sh</code> setup hook set for us</li>
<li>[new] <code>NIX_CFLAGS_COMPILE_FOR_BUILD</code> variable, we set ourselves</li>
<li>[new] <code>NIX_CFLAGS_COMPILE_FOR_TARGET</code> variable, we set ourselves</li>
</ul>
<p>None of these variables looks redundant: they serve the purpose to
propagate flags across shell wrappers. Thus we’ll have to keep in mind
this 6x explosion.</p>
<p>And yet. 6x explosion does not explain why mere <code>-g</code> option can be
present only 20K times instead of 200K times.</p>
<h2 id="another-nasty-limit">Another nasty limit</h2>
<p>I skimmed through maximum individual variable length limit above as if
it did not exist. But it’s there! Look:</p>
<pre><code>$ $(which printf) "%s" $(printf "%0*d" 100000) $(printf "%0*d" 100000) $(printf "%0*d" 100000) $(printf "%0*d" 100000) >/dev/null; echo $?
0
$ $(which printf) "%s" $(printf "%0*d" 200000) >/dev/null; echo $?
-bash: /run/current-system/sw/bin/printf: Argument list too long
126</code></pre>
<p>More precise limit is <code>128K</code>:</p>
<pre><code>$ $(which printf) "%s" $(printf "%0*d" $((2 ** 17)) ) >/dev/null; echo $?
-bash: /run/current-system/sw/bin/printf: Argument list too long
126
$ $(which printf) "%s" $(printf "%0*d" $((2 ** 17 - 1)) ) >/dev/null; echo $?
0
$ echo $((2 ** 17))
131072</code></pre>
<p>It comes from <code>copy_string()</code> from the same
<a href="https://github.com/torvalds/linux/blob/2cf0f715623872823a72e451243bbf555d10d032/fs/exec.c#L523">fs/exec.c</a>:</p>
<pre class="c"><code>static int copy_strings(int argc, struct user_arg_ptr argv,
			struct linux_binprm *bprm)
{
	struct page *kmapped_page = NULL;
	char *kaddr = NULL;
	unsigned long kpos = 0;
	int ret;

	while (argc-- > 0) {
		const char __user *str;
		int len;
		unsigned long pos;

		ret = -EFAULT;
		str = get_user_arg_ptr(argv, argc);
		if (IS_ERR(str))
			goto out;

		len = strnlen_user(str, MAX_ARG_STRLEN);
		if (!len)
			goto out;

		if (!len)
			goto out;

		ret = -E2BIG;
		if (!valid_arg_len(bprm, len))
			goto out;
	// ...
out:
	// ...
	return ret;
}

static bool valid_arg_len(struct linux_binprm *bprm, long len)
{
	return len <= MAX_ARG_STRLEN;
}</code></pre>
<p>Here <code>MAX_ARG_STRLEN</code> (<code>128K</code>) limits individual entries in arguments
and environment variable entries:</p>
<pre><code>$ E=$(printf "%0*d" $((2 ** 17 - 3)) ) $(which printf) "1" >/dev/null; echo $?
0
$ E=$(printf "%0*d" $((2 ** 17 - 2)) ) $(which printf) "1" >/dev/null; echo $?
-bash: /run/current-system/sw/bin/printf: Argument list too long
126</code></pre>
<p>Here we run <code>printf 1</code> with a large <code>E=000000...000</code> variable and expose
the same <code>128K</code> limit.</p>
<p>This is bad news. As we saw above <code>gcc</code> uses single <code>COLLECT_GCC_OPTIONS</code>
variable to pass all options around. And on top of that it quotes each
argument.</p>
<p>As a result <code>gcc -g -g ...</code> turns into
<code>COLLECT_GCC_OPTIONS="'-g' '-g' ..."</code>. This means that we are limited by
a <code>128K</code> limit when it comes to <code>gcc</code> flags.</p>
<p>To estimate our best limit <code>-g</code> gets translated to <code>"'-g' "</code>: 2 bytes
get turned into 5. That means we can put <code>128K / 5 = 26214</code> entries.</p>
<p>That is very close to our <code>26118</code> limit we got above. Again, why the
values differ by <code>96</code> entries is another exercise to the reader.</p>
<h2 id="the-variable-budget">The variable budget</h2>
<p>Let’s see what worst-case scenario for the longest option we have for
<code>-fmacro-prefix-map=</code> on my system for directories that have <code>include</code>
subdirectory:</p>
<pre><code>$ for d in *; do [[ -d "$d/include" ]] && echo "${#d} $d"; done | sort -k 1 -n | tail -n 10
87 s2zpr18vjr274yaalgz5c4g7dx51cvb9-mcfgthreads-x86_64-w64-mingw32-unstable-2023-06-06-dev
87 s5gn9gwwwpm8ln84apjb8hbrhihq72d4-mcfgthreads-x86_64-w64-mingw32-unstable-2023-06-06-dev
87 scihnclavazxzk0vgblhp4371s3gr40c-mcfgthreads-x86_64-w64-mingw32-unstable-2023-06-06-dev
87 sdf4v7s0avfkgnnnx1hq000ydl7bz9sd-mcfgthreads-x86_64-w64-mingw32-unstable-2023-06-06-dev
87 xfl1c6qmr8b39bxgpmjp004afknm54av-mcfgthreads-x86_64-w64-mingw32-unstable-2023-06-06-dev
87 y00bh71xvsrkazdjy5w4ii8pdkdwb5hr-mcfgthreads-x86_64-w64-mingw32-unstable-2023-06-06-dev
87 z8k642hwqflwmp16ql3jj7z1pslyznzq-mcfgthreads-x86_64-w64-mingw32-unstable-2023-06-06-dev
88 6lw175bhq8d669jl1ddvwy49hcmlysi0-gmp-with-cxx-static-x86_64-unknown-linux-musl-6.3.0-dev
89 ljrwhpdcv7xzmxbank3jdi2xyga553s5-gmp-with-cxx-static-aarch64-unknown-linux-musl-6.3.0-dev
94 r5gzvnjnq024gr830nvpkcyd7n4ip33v-libnetfilter_conntrack-static-x86_64-unknown-linux-musl-1.0.9</code></pre>
<p>About 90 bytes. With added <code>/nix/store</code> prefix that would be <code>100</code>
characters long paths. Nice round number.</p>
<p>This makes <code>-fmacro-prefix-map=/nix/store/...=/nix/store/...</code> options
around 220 bytes.</p>
<p>How many of those can we realistically set for our <code>2MB</code> budget?</p>
<ul>
<li>for 4x overhead: <code>2 * 1024 * 1024 / (220 + 9) / 4 = 2289</code> variables</li>
<li>for 6x overhead: <code>2 * 1024 * 1024 / (220 + 9) / 6 = 1526</code> variables</li>
</ul>
<p>But given <code>COLLECT_GCC_OPTIONS</code> limitation for <code>128K</code> our calculation
becomes even more pessimistic:</p>
<ul>
<li>for no overhead: <code>128 * 1024 / (220 + 9) / 4 = 143</code> variables!</li>
</ul>
<p>It might sound like a lot but it’s not that much of a budget: some
packaging systems (like <code>haskell</code>’s <code>hackage</code>) do like small
fine-grained packages and occasionally do install <code>C</code> header files.
<code>nix</code> itself favours smaller packages to speed up rebuilds and shrink
runtime closure. <code>pkg-config</code> is geared towards installing packages
into individual directories.</p>
<p>And what is worse: <code>NIX_CFLAGS_COMPILE</code> is not the only option
that exhibits this behaviour. Here is the longer list following
<code>mangleVarList</code> used in <code>nixpkgs</code>:</p>
<pre><code># from pkgs/build-support/cc-wrapper/add-flags.sh
NIX_CFLAGS_COMPILE
NIX_CFLAGS_COMPILE_BEFORE
NIX_CFLAGS_LINK
NIX_CXXSTDLIB_COMPILE
NIX_CXXSTDLIB_LINK
NIX_GNATFLAGS_COMPILE

# from ./pkgs/build-support/bintools-wrapper/add-flags.sh
NIX_IGNORE_LD_THROUGH_GCC
NIX_LDFLAGS
NIX_LDFLAGS_BEFORE
NIX_DYNAMIC_LINKER
NIX_LDFLAGS_AFTER
NIX_LDFLAGS_HARDEN
NIX_HARDENING_ENABLE

# from pkgs/build-support/cc-wrapper/add-gnat-extra-flags.sh
NIX_GNATMAKE_CARGS

# from pkgs/build-support/pkg-config-wrapper/add-flags.sh
PKG_CONFIG_PATH</code></pre>
<p>Thus worst case we get to set at most ~150 entries for all of these
variables.</p>
<h2 id="parting-words">Parting words</h2>
<p>Even on <code>linux</code> command line argument limits are hard. If you can try to
use files to pass inputs of unbounded sizes.</p>
<p><code>linux</code> has unreachable <code>0-x7fffFFFF</code> argument count limit when executing
the commands. It does have an overall limit <code>2MB</code> limit that one can
increase to <code>6MB</code>. And on top of that individual arguments are limited
by <code>128K</code> limit that you can’t raise.</p>
<p>The above effectively means that argument passing overheads define
argument count limit. For arguments of length <code>0</code> by default you can
pass <code>233016</code> empty strings on 64-bit kernel and <code>419430</code> on 32-bit one.</p>
<p>If you increase the default stack size with <code>ulimit -s</code> you can get up
to <code>1258291</code> empty string arguments on 32-bits and <code>699050</code> on 64-bits.</p>
<p>For practical argument lengths actual values are way smaller: in order
of thousands to tens of thousands.</p>
<p><code>gcc</code> is a special <code>COLLECT_GCC_OPTIONS</code> case and it has a limit of
<code>128KB</code> making argument limits onto hundreds.</p>
<p>Initially I planned to workaround <code>qemu</code> failure by using <code>gcc</code>’s
response files. I thought it would save the problem completely.
Unfortunately <code>COLLECT_GCC_OPTIONS</code> contains already expanded response
file contents and thus response files will only remove multiplication
factor but will not sidestep 128KB limit.</p>
<p>On the bright side <code>COLLECT_GCC_OPTIONS</code> is an internal <code>gcc</code>
implementation detail that should be fixable without much external
impact. Even if we move it to proper argument list it should already
unlock <code>2MB</code> limit. And if we could pass response files through we cloud
sidestep the limit entirely. Filed <a href="https://gcc.gnu.org/PR111527" class="uri">https://gcc.gnu.org/PR111527</a> to
<code>gcc</code> upstream.</p>
<p>On top of that as a medium-term workaround I proposed the change to
<code>linux</code> to allow variables as large as the whole limit:
<a href="https://lkml.org/lkml/2023/9/24/381" class="uri">https://lkml.org/lkml/2023/9/24/381</a>.</p>
<p>But meanwhile I’ll try to patch <code>gcc</code> (and maybe <code>clang</code>?) just for
<code>nixpkgs</code> to apply programmatic mangling similar to:</p>
<pre><code># Pseudocode. real regexps do not work for `gcc`:
-fmacro-prefix-map=s,$NIX_STORE/[a-z0-9]{32}-,$NIX_STORE/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-</code></pre>
<p>Have fun!</p>
</article>
<article>
<h1>Unexpected runtime dependencies in nixpkgs</h1>
<p>2023-09-14T00:00:00Z</p>
<h2 id="intro">Intro</h2>
<p><code>nix</code> uses a bit unusual way to detect runtime dependencies: it scans
the build result for it’s hash inputs used during build and does not
require manual specification of the dependencies.</p>
<p>For example if we build a trivial package that just prints a string into
the file it automatically pulls in that strings into runtime closure:</p>
<pre><code>$ nix build --impure --expr 'with import <nixpkgs> {}; runCommand "foo" {} \
  "echo ${re2c}/bin/re2c > $out"'</code></pre>
<p>This command builds a package that consists of a single <code>$out</code> file
(symlinked to <code>./result</code>) which contains absolute path to <code>re2c</code> binary.
It’s not a complicated package:</p>
<pre><code>$ cat result
/nix/store/pj9cdgj07iz3cj88rywapx2lfxfmdqd3-re2c-3.1/bin/re2c</code></pre>
<p>And yet if we look at full closure of it’s inferred dependencies it
already has <code>re2c</code> and all it’s runtime dependencies:</p>
<pre><code>$ nix path-info -r ./result
/nix/store/gnzwqa9df994g01yw5x75qnbl1rhp9ds-libunistring-1.1
/nix/store/h3aw16j1c54jv8s39yvdhpfcx3538jwi-libidn2-2.3.4
/nix/store/kv0v4h5i911gj39m7n9q10k8r8gbn3sa-xgcc-12.3.0-libgcc
/nix/store/905gkx2q1pswixwmi1qfhfl6mik3f22l-glibc-2.37-8
/nix/store/s2pgr9iqj60mfnmabixnqacxl4bzb408-gcc-12.3.0-libgcc
/nix/store/gi26p79iq8jrw51irq5x82c2cqlgicxi-gcc-12.3.0-lib
/nix/store/pj9cdgj07iz3cj88rywapx2lfxfmdqd3-re2c-3.1
/nix/store/amiqn0hvnmrfcz2s8b47fb770v8hy9ny-foo</code></pre>
<p>Such automatic scanning method method has both false positives and false
negatives.</p>
<p>Say, if we are to compress the file reference might disappear:</p>
<pre><code>$ nix build --impure --expr 'with import <nixpkgs> {}; runCommand "foo" {} \
  "echo ${re2c}/bin/re2c | bzip2 > $out"'

$ nix path-info -r ./result
/nix/store/lwx722djnam7yjy439b9k6czb55h707q-foo</code></pre>
<p>Missing reference detection is a bug here (false negative). False
negatives can be worked around by explicitly adding plain text
references into the file just like we did in the original example.</p>
<p>But I would like to talk about false positives today.</p>
<h2 id="the-problem">The problem</h2>
<p>Let’s jump start from the motivating example: right now <code>nix</code> package
has header only <code>nlohmann/json</code> dependency in it’s runtime closure:</p>
<pre><code>$ nix path-info -r nixpkgs#nix | fgrep nlohmann_json
/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2</code></pre>
<p>There is nothing in <code>nlohmann/json</code> useful for <code>nix</code>’s runtime:</p>
<pre><code>$ find /nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2 -type f | unnix
/<<NIX>>/nlohmann_json-3.11.2/include/nlohmann/detail/conversions/from_json.hpp
/<<NIX>>/nlohmann_json-3.11.2/include/nlohmann/detail/conversions/to_chars.hpp
/<<NIX>>/nlohmann_json-3.11.2/include/nlohmann/detail/conversions/to_json.hpp
...
/<<NIX>>/nlohmann_json-3.11.2/include/nlohmann/ordered_map.hpp
/<<NIX>>/nlohmann_json-3.11.2/share/cmake/nlohmann_json/nlohmann_jsonTargets.cmake
/<<NIX>>/nlohmann_json-3.11.2/share/cmake/nlohmann_json/nlohmann_jsonConfig.cmake
/<<NIX>>/nlohmann_json-3.11.2/share/cmake/nlohmann_json/nlohmann_jsonConfigVersion.cmake
/<<NIX>>/nlohmann_json-3.11.2/share/pkgconfig/nlohmann_json.pc</code></pre>
<p>These are a few headers and <code>cmake</code> and <code>pkg-config</code> plumbing.</p>
<p>So why does <code>nix</code> retain those then?</p>
<h2 id="debugging-the-details">Debugging the details</h2>
<p>To figure out where the references come from we can grep the package for
a raw string and see how it gets there:</p>
<pre><code>$ LANG=C grep -R $(nix-build --no-link '<nixpkgs>' -A nlohmann_json) $(nix-build --no-link '<nixpkgs>' -A nix.out)
grep: /nix/store/vxx4c6gc2zgfw870b40f06dmli6ljp34-nix-2.17.0/bin/nix: binary file matches
grep: /nix/store/vxx4c6gc2zgfw870b40f06dmli6ljp34-nix-2.17.0/bin/nix-build: binary file matches
grep: /nix/store/vxx4c6gc2zgfw870b40f06dmli6ljp34-nix-2.17.0/bin/nix-channel: binary file matches
grep: /nix/store/vxx4c6gc2zgfw870b40f06dmli6ljp34-nix-2.17.0/bin/nix-collect-garbage: binary file matches
grep: /nix/store/vxx4c6gc2zgfw870b40f06dmli6ljp34-nix-2.17.0/bin/nix-copy-closure: binary file matches
...</code></pre>
<p>Here we see that even <code>nix</code> binary itself retains <code>nlohmann/json</code>
reference. With <code>strings</code> tool from <code>GNU binutils</code> we can check how the
reference looks like:</p>
<pre><code>$ nix shell nixpkgs#binutils-unwrapped
$$strings $(nix-build --no-link '<nixpkgs>' -A nix.out)/bin/nix | grep $(nix-build --no-link '<nixpkgs>' -A nlohmann_json)
/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2/include/nlohmann/json.hpp
/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2/include/nlohmann/detail/output/serializer.hpp
/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2/include/nlohmann/detail/conversions/to_chars.hpp
/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2/include/nlohmann/detail/input/lexer.hpp
/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2/include/nlohmann/detail/iterators/iter_impl.hpp
/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2/include/nlohmann/detail/input/json_sax.hpp
/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2/include/nlohmann/detail/iterators/iteration_proxy.hpp
/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2/include/nlohmann/detail/input/parser.hpp</code></pre>
<p>Note how absolute header paths are embedded into <code>nix</code> binary. It
happens via <code>__FILE__</code> macro expansion in C++ code.</p>
<pre class="cpp"><code>//// in nlohmann_json/include/nlohmann.json.h
class basic_json {
    // ...
    template<typename T, typename... Args> static T* create(Args&& ... args)
    {
        // ...
        JSON_ASSERT(obj != nullptr);
        // ...
    }

//// in nlohmann_json/include/nlohmann/detail/macro_scope.hpp
// allow overriding assert
#if !defined(JSON_ASSERT)
    #include <cassert> // assert
    #define JSON_ASSERT(x) assert(x)
#endif

// in glibc/include/assert.h
# if defined __cplusplus
#  define assert(expr)                                                  \
     (static_cast <bool> (expr)                                         \
      ? void (0)                                                        \
      : __assert_fail (#expr, __FILE__, __LINE__, __ASSERT_FUNCTION))
// ...</code></pre>
<p>In this case any code that happens to instantiate <code>basic_json::create()</code>
function will embed <code>__FILE__</code> definition as part of the <code>assert()</code> call
and will embed absolute path to the store.</p>
<p>Absolute path to the store will retain <code>nlohmann/json</code> in runtime closure.
It’s a completely redundant runtime dependency.</p>
<h2 id="the-workaround">The workaround</h2>
<p>Initially I though of using something like <code>nukeReferences</code>
file-mangling tool to wipe out unexpected references. But I was not sure
where should I plug this hammer: ideally any user of <code>nlohmann_json</code>
package should run it just in case. And patching files after-the-fast is
always prone to break something: be it broken file checksums, broken
sort ordering, unrelated string sharing due to identical code folding.</p>
<p>I wanted something milder: ideally tell <code>gcc</code> not to emit problematic
paths at all. And <code>gcc</code> provides exactly that mechanism!</p>
<p><code>gcc</code> has a way to slightly mangle absolute paths used by <code>__FILE__</code> via
<code>-fmacro-prefix-map=old=new</code> set of options: <a href="https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html" class="uri">https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html</a>.</p>
<p>It’s main use case is to untangle final binaries from the temporary
directory sources are built against:
<code>-fmacro-prefix-map=/tmp/autogenerated/foo=/usr/src/foo</code>.</p>
<p>The problem feels vaguely similar: we want to avoid any mention of
source directories in the final output. I tried to inject
<code>-fmacro-prefix-map=</code> for every single build input used by <code>nixpkgs</code> as:</p>
<pre class="diff"><code>--- a/pkgs/build-support/cc-wrapper/setup-hook.sh
+++ b/pkgs/build-support/cc-wrapper/setup-hook.sh
@@ -65,15 +65,29 @@
 # function is guaranteed to be exactly the same.
 ccWrapper_addCVars () {
     # See ../setup-hooks/role.bash
-    local role_post
+    local role_post mangled_store map_flag var
     getHostRoleEnvHook

+    var=NIX_CFLAGS_COMPILE${role_post}
+
     if [ -d "$1/include" ]; then
-        export NIX_CFLAGS_COMPILE${role_post}+=" -isystem $1/include"
+        export $var+=" -isystem $1/include"
     fi

     if [ -d "$1/Library/Frameworks" ]; then
-        export NIX_CFLAGS_COMPILE${role_post}+=" -iframework $1/Library/Frameworks"
+        export $var+=" -iframework $1/Library/Frameworks"
+    fi
+
+    # Try hard to avoid hardcoding of -dev outputs via __FILE__.
+    # THe typical examples are: asserts in nlohmann_json leaking into
+    # nix executable closure, asserts from lttng-ust leaking into
+    # pipewire.
+    mangled_store=$(printf "%s" "$1" | sed -e "s|$NIX_STORE/[a-z0-9]\{32\}-|$NIX_STORE/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-|g")
+    map_flag="-fmacro-prefix-map=$1=$mangled_store"
+    # As it's a long flag try hard not to introduce duplicates as
+    # environment gets exhausted otherwise for large packages like qemu.
+    if [[ ${!var-} != *" $map_flag"* ]]; then
+        export $var+=" $map_flag"
     fi
 }</code></pre>
<p>The change adds a bunch of compiler options on form of:</p>
<pre><code>-fmacro-prefix-map=/nix/store/5xih6daf5g3hpa0wc5vs2cgrhakn4s0j-nlohmann_json-3.11.2=/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-nlohmann_json-3.11.2</code></pre>
<p>That way <code>__FILE__</code> values still have a reasonable form: we can see which
package they come from. But we can’t use the path in any meaningful way
as they don’t refer real files anymore. Does not look too bad.</p>
<p>I also had to avoid attempts at inserting multiple identical values into
the final <code>NIX_CFLAGS_COMPILE</code> variable. Otherwise some particularly
large packages like <code>qemu</code> exhaust all the environment space and fail to
run executables. We might want to do the same for the rest of variables:
we add most of the options options thrice to <code>NIX_CFLAGS_COMPILE</code>.</p>
<p>Proposed the workaround as <a href="https://github.com/NixOS/nixpkgs/pull/255192" class="uri">https://github.com/NixOS/nixpkgs/pull/255192</a>.</p>
<h2 id="the-results">The results</h2>
<p>Did the workaround help? I placed the patch into <code>~/n</code> checkout of
<code>nixpkgs</code> repo.</p>
<pre><code>$ LANG=C grep -R $(nix-build --no-link ~/n -A nlohmann_json) $(nix-build --no-link ~/n -A nix.out)
<empty></code></pre>
<p>Yay! At the very least we got rid of the unneeded reference.</p>
<p><code>nix</code> was a good example of a superfluous and small harmless dependency.</p>
<p>Are there packages where this change would be more pronounced? My
ulterior motive was to fix something similar in <code>lttng-ust.dev</code>. That
one was not just a set of headers. It also contains python interpreter
as a dependency:</p>
<pre><code>$ nix path-info -r $(nix-build '<nixpkgs>' -A lttng-ust.dev) |& unnix
/<<NIX>>/libunistring-1.1
/<<NIX>>/libidn2-2.3.4
/<<NIX>>/xgcc-12.3.0-libgcc
/<<NIX>>/glibc-2.37-8
/<<NIX>>/zlib-1.2.13
/<<NIX>>/sqlite-3.42.0
/<<NIX>>/numactl-2.0.16
/<<NIX>>/lttng-ust-2.13.1
/<<NIX>>/expat-2.5.0
/<<NIX>>/ncurses-6.4
/<<NIX>>/liburcu-0.14.0
/<<NIX>>/xz-5.4.4
/<<NIX>>/openssl-3.0.10
/<<NIX>>/bzip2-1.0.8
/<<NIX>>/libffi-3.4.4
/<<NIX>>/libxcrypt-4.4.36
/<<NIX>>/liburcu-0.14.0-dev
/<<NIX>>/gcc-12.3.0-libgcc
/<<NIX>>/gcc-12.3.0-lib
/<<NIX>>/readline-8.2p1
/<<NIX>>/bash-5.2-p15
/<<NIX>>/mailcap-2.1.53
/<<NIX>>/gdbm-1.23
/<<NIX>>/tzdata-2023c
/<<NIX>>/python3-3.10.12
/<<NIX>>/lttng-ust-2.13.1-bin
/<<NIX>>/lttng-ust-2.13.1-dev</code></pre>
<p>One of the frequent users of <code>lttng-ust.dev</code> is <code>pipewire</code>:</p>
<pre><code>$ nix path-info -r $(nix-build '<nixpkgs>' -A pipewire.out) | fgrep lttng
/nix/store/23jh1m6irhvr16zjmrvy2cnpgz7yi6gj-lttng-ust-2.13.1
/nix/store/xznqsvr1la1xnfnzia2yvnicfz03yjqb-lttng-ust-2.13.1-bin
/nix/store/fckk2ncjnxdw1xsx8v8rxjnmhldbx8pr-lttng-ust-2.13.1-dev</code></pre>
<p>And I’m glad to announce that it’s also gone with the patch above:</p>
<pre><code>$ nix path-info -r $(nix-build ~/n -A pipewire.out) | fgrep lttng
/nix/store/bv3i3qjphzgfzmmdhws9nhwz76qscy61-lttng-ust-2.13.1</code></pre>
<p>Here is the closure size difference:</p>
<p>Before:</p>
<pre><code>$ nix path-info -rsSh $(nix-build '<nixpkgs>' -A pipewire.out) | nl | tail -n1
   219  /nix/store/ff0w34nr807in3b1swmqklxy9g9v5hg9-pipewire-0.3.79 1.4M  543.4M</code></pre>
<p>After:</p>
<pre><code>$ nix path-info -rsSh $(nix-build ~/n -A pipewire.out) | nl | tail -n1
   207  /nix/store/hl4dffvc73nsh3zfbji0y7h9lcnrk14b-pipewire-0.3.79 1.3M  452.0M</code></pre>
<p>12 dependencies and ~90MB (<code>543.4</code> -> <code>452.0</code> reduction, ~20% of the whole
output) are just gone!</p>
<h2 id="parting-words">Parting words</h2>
<p>Looking at the runtime closure is always fun. There are many other
low-hanging fruits in <code>nixpkgs</code> to remove. Most of the time just adding
an extra <code>dev</code> output is enough to slim down the output.</p>
<p><code>__FILE__</code> is a tricky macro that makes <code>nix</code> builds to leak out
unnecessary references into final closure.</p>
<p><code>-fmacro-prefix-map=</code> seems to be a robust workaround for <code>__FILE__</code>
induced leaks. The flag is supported in both <code>gcc</code> and <code>clang</code> for quite
a while.</p>
<p>Even <code>nix</code> package itself had a redundant dependency in it’s final
closure.</p>
<p><a href="https://github.com/NixOS/nixpkgs/pull/255192" class="uri">https://github.com/NixOS/nixpkgs/pull/255192</a> should fix it in <code>nixpkgs</code>.</p>
<p>And for some packages like <code>pipewire</code> the closure size reduction is
substantial and is taking about 20% of all closure size. Not bad for a
single extra compiler option.</p>
<p>Have fun!</p>
</article>
<article>
<h1>32-bit file API strikes back</h1>
<p>2023-09-07T00:00:00Z</p>
<p>It was another day of me trying a new <code>gcc</code> version.</p>
<h2 id="the-problem">The problem</h2>
<p>This time build failure was in <code>gcc</code> itself:</p>
<pre><code>$ nix build -f. pkgsi686Linux.stdenv
       > /nix/store/bxvqx767s4gwry9km5c3cmflskmparyf-bootstrap-stage-xgcc-stdenv-linux/setup: line 167: type: install_name_tool: not found
       > preFixupLibGccPhase
       > stat: Value too large for defined data type
       For full logs, run 'nix log /nix/store/v3cr2nghg1s4bmm30r1vnq1124qqvv9m-xgcc-14.0.0.drv'.</code></pre>
<h2 id="debugging-the-failure">Debugging the failure</h2>
<p>The actual error here is <code>stat: Value too large for defined data type</code>.
While <code>type: install_name_tool: not found</code> is an unrelated distraction.</p>
<p>Note that this failure happens a bit earlier than <code>pkgsi686Linux.stdenv</code>
build itself. If we poke a bit around the action build failure happens
in <code>pkgsi686Linux.stdenv.__bootPackages.stdenv.__bootPackages.stdenv.__bootPackages.stdenv.cc.cc</code> package in early bootstrap phases.</p>
<p>From the error message failure happens in <code>preFixupLibGccPhase</code> phase.
Let’s have a look at its definition:</p>
<pre><code>$ nix repl '<nixpkgs>'
...
nix-repl> builtins.trace pkgsi686Linux.stdenv.__bootPackages.stdenv.__bootPackages.stdenv.__bootPackages.stdenv.cc.cc.preFixupLibGccPhase ""
trace: # move libgcc from lib to its own output (libgcc)
mkdir -p $libgcc/lib
mv    $lib/lib/libgcc_s.so      $libgcc/lib/
mv    $lib/lib/libgcc_s.so.1    $libgcc/lib/
ln -s $libgcc/lib/libgcc_s.so   $lib/lib/
ln -s $libgcc/lib/libgcc_s.so.1 $lib/lib/
patchelf --set-rpath "" $libgcc/lib/libgcc_s.so.1

""</code></pre>
<p>One of these commands did fail, not clear which one. Let’s add a bit of
debugging by adding <code>set -x</code> into the phase:</p>
<pre><code>$ nix develop --impure --expr 'with import ./. {};
  pkgsi686Linux.stdenv.__bootPackages.stdenv.__bootPackages.stdenv.__bootPackages.stdenv.cc.cc.overrideAttrs (oa: {
    preFixupLibGccPhase = "set -x\n" + oa.preFixupLibGccPhase; })'
$$ genericBuild
...
++ patchelf --set-rpath '' /home/slyfox/dev/git/nixpkgs/outputs/libgcc/lib/libgcc_s.so.1
stat: Value too large for defined data type</code></pre>
<p>Yay! It was the <code>patchelf</code> call! We can re-enter the environment and
poke a bit more at the environment:</p>
<pre><code>$ nix develop --impure --expr 'with import ...'
$$ patchelf --set-rpath '' $libgcc/lib/libgcc_s.so.1
stat: Value too large for defined data type</code></pre>
<p>The error is still there. What kind of <code>stat</code> call does <code>patchelf</code> use?</p>
<pre><code>$ which patchelf
/nix/store/i9v173g8a5wwi8i8fd2wmdyr8ix6mla1-bootstrap-tools/bin/patchelf

$ nm -DC /nix/store/i9v173g8a5wwi8i8fd2wmdyr8ix6mla1-bootstrap-tools/bin/patchelf |& fgrep stat
         U __xstat@GLIBC_2.0</code></pre>
<p>Note that this <code>patchelf</code> comes from <code>bootstrapTools</code>.
<code>pkgs/stdenv/linux/bootstrap-files/i686.nix</code> says it was updated last
time in <code>2019</code> (4 years ago).</p>
<p>For comparison currently built <code>patchelf</code> built on <code>i686</code> system does
use <code>stat64</code> call:</p>
<pre><code>$ nm -DC $(nix-build --no-link '<nixpkgs>' -A patchelf --argstr system i686-linux )/bin/patchelf |& fgrep stat
         U stat64@GLIBC_2.33</code></pre>
<p>And it runs the patch just fine:</p>
<pre><code>$$ $(nix-build --no-link '<nixpkgs>' -A patchelf --argstr system i686-linux)/bin/patchelf  --set-rpath '' $libgcc/lib/libgcc_s.so.1</code></pre>
<h2 id="refreshing-bootstrapfiles">Refreshing <code>bootstrapFiles</code></h2>
<p>The fix is as simple as regenerating <code>bootstrapFiles</code> for <code>i686</code>:</p>
<pre><code>$ nix-build '<nixpkgs/pkgs/stdenv/linux/make-bootstrap-tools.nix>' -A bootstrapFiles      --arg pkgs 'import <nixpkgs> { system = "i686-linux"; }'
/nix/store/713cyy66gkxqmi1wpdswd4llq1qzikr5-bootstrap-tools.tar.xz
/nix/store/cvdfhnwjbbfjbv6ibgcl8rz47giy771v-busybox</code></pre>
<p>I did not have to build anything. Hydra has it cached today.</p>
<p>We can point our seed binaries to freshly built version of those:</p>
<pre class="diff"><code>--- a/pkgs/stdenv/linux/bootstrap-files/i686.nix
+++ b/pkgs/stdenv/linux/bootstrap-files/i686.nix
@@ -1,12 +1,4 @@
 {
-  busybox = import <nix/fetchurl.nix> {
-    url = "http://tarballs.nixos.org/stdenv-linux/i686/4907fc9e8d0d82b28b3c56e3a478a2882f1d700f/busybox";
-    sha256 = "ef4c1be6c7ae57e4f654efd90ae2d2e204d6769364c46469fa9ff3761195cba1";
-    executable = true;
-  };
-
-  bootstrapTools = import <nix/fetchurl.nix> {
-    url = "http://tarballs.nixos.org/stdenv-linux/i686/c5aabb0d603e2c1ea05f5a93b3be82437f5ebf31/bootstrap-tools.tar.xz";
-    sha256 = "b9bf20315f8c5c0411679c5326084420b522046057a0850367c67d9514794f1c";
-  };
+  busybox = ./i686-linux/busybox;
+  bootstrapTools = ./i686-linux/bootstrap-tools.tar.xz;
 }</code></pre>
<p>Now <code>pkgsi686Linux.stdenv</code> builds just fine:</p>
<pre><code>$ nix build -f. pkgsi686Linux.stdenv</code></pre>
<p>Unfortunately the change is not usable for upstream as is: uploading new
bootstrap binaries is a strange rarely exercised process that requires
privileged user to upload tarballs to <code>s3</code>. Filed
<a href="https://github.com/NixOS/nixpkgs/issues/253274" class="uri">https://github.com/NixOS/nixpkgs/issues/253274</a> to do it correctly.</p>
<p>I would say it’s a waste of time to debug issues in outdated binaries
like that. The bootstrap tarballs should be updated at least every
<code>NixOS</code> release (every 6 months). Or more frequently :) Filed
<a href="https://github.com/NixOS/nixpkgs/issues/253713" class="uri">https://github.com/NixOS/nixpkgs/issues/253713</a> for that.</p>
<p>Periodic updates would also make tarballs more homogeneous across
architectures. Today we ship different <code>glibc</code> and <code>gcc</code> versions in
bootstrap tarballs which adds another dimension of bugs.</p>
<h2 id="why-did-upgrade-work">Why did upgrade work?</h2>
<p><code>patchelf</code> itself was fixed in 2016 (7 years ago) as
<a href="https://github.com/NixOS/patchelf/commit/a4d21661d510ccf7ff72bb0e4ccd3f087e9086ad" class="uri">https://github.com/NixOS/patchelf/commit/a4d21661d510ccf7ff72bb0e4ccd3f087e9086ad</a>:</p>
<pre class="diff"><code>--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -1,4 +1,4 @@
-AM_CXXFLAGS = -Wall -std=c++11
+AM_CXXFLAGS = -Wall -std=c++11 -D_FILE_OFFSET_BITS=64</code></pre>
<p>We just happened to pull in a fix for it along with <code>glibc</code> that
supports <code>stat64</code>.</p>
<h2 id="how-should-fix-usually-looks-like">How should fix usually looks like?</h2>
<p>Setting <code>-D_FILE_OFFSET_BITS=64</code> explicitly should be a safe workaround.</p>
<p><code>autoconf</code>-based systems usually use
<a href="https://www.gnu.org/software/autoconf/manual/autoconf-2.67/html_node/System-Services.html">AC_SYS_LARGEFILE</a>.
It should set both <code>_FILE_OFFSET_BITS</code> and <code>_LARGE_FILES</code> where needed
and also provides a nice <code>--disable-largefile</code> knob. Other build systems
have the equivalent or always enable it by default.</p>
<p><code>glibc</code> implements <code>stat</code> switch in <code>io/sys/stat.h</code> as:</p>
<pre class="c"><code>#ifndef __USE_FILE_OFFSET64
/* Get file attributes for FILE and put them in BUF.  */
extern int stat (const char *__restrict __file,
                 struct stat *__restrict __buf) __THROW __nonnull ((1, 2));
#else
# ifdef __USE_TIME_BITS64
extern int __REDIRECT_NTH (stat, (const char *__restrict __file,
                                  struct stat *__restrict __buf),
                                  __stat64_time64)
# endif
#else
extern int stat64 (const char *__restrict __file,
                   struct stat64 *__restrict __buf) __THROW __nonnull ((1, 2));
#endif</code></pre>
<p>(I skipped a bit of <code>#define</code>ery where <code>stat</code> gets redirected to
<code>__xstat</code>.)</p>
<p>The above hints that we will soon have a similar problem of switching
to 64-bit <code>time_t</code> on 32-bit systems.</p>
<h2 id="why-did-patchelf-fail-at-all">Why did <code>patchelf</code> fail at all?</h2>
<p>I hear you ask: “why did <code>patchelf</code> fail at all”? Is <code>libgcc.so</code> such a
large file by any definition? It’s size is unlikely to overflow 32 bits
(4GB). Why does <code>stat()</code> implementation matter here?</p>
<p>And you are right: <code>libgcc_s.so.1</code> is only 139KB large.</p>
<p>Here is the full structure <code>man 2 stat</code> knows about:</p>
<pre class="c"><code>struct stat {
    dev_t     st_dev;         /* ID of device containing file */
    ino_t     st_ino;         /* Inode number */
    mode_t    st_mode;        /* File type and mode */
    nlink_t   st_nlink;       /* Number of hard links */
    uid_t     st_uid;         /* User ID of owner */
    gid_t     st_gid;         /* Group ID of owner */
    dev_t     st_rdev;        /* Device ID (if special file) */
    off_t     st_size;        /* Total size, in bytes */
    blksize_t st_blksize;     /* Block size for filesystem I/O */
    blkcnt_t  st_blocks;      /* Number of 512B blocks allocated */

    /* Since Linux 2.6, the kernel supports nanosecond
       precision for the following timestamp fields.
       For the details before Linux 2.6, see NOTES. */

    struct timespec st_atim;  /* Time of last access */
    struct timespec st_mtim;  /* Time of last modification */
    struct timespec st_ctim;  /* Time of last status change */

#define st_atime st_atim.tv_sec      /* Backward compatibility */
#define st_mtime st_mtim.tv_sec
#define st_ctime st_ctim.tv_sec
};</code></pre>
<p><code>stat()</code> has to fill all the fields. It does not known which ones
userspace is going to need. The man page also tells us the overflowing
condition:</p>
<pre><code>ERRORS
...
   EOVERFLOW
          pathname  or fd refers to a file whose size, inode number,
          or number of blocks cannot be represented in, respectively,
          the types off_t, ino_t, or blkcnt_t.  This error can occur
          when, for example, an application compiled on a 32-bit
          platform without -D_FILE_OFFSET_BITS=64 calls stat() on a
          file whose size exceeds (1<<31)-1 bytes.</code></pre>
<p>Note that it’s 2GB limit and not a 4GB limit. And it is not just about
the file size. In my case it’s the inode number <code>ino_t st_ino;</code> field:</p>
<pre><code>$ ls -li foo
4404087433 -rw-r--r-- 1 slyfox users 0 Sep  7 09:25 foo</code></pre>
<p>Here inode number overflows our 2GB limit. Let’s use this trivial
program to make sure it fails to <code>stat()</code>:</p>
<pre class="c"><code>#include <errno.h>
#include <stdio.h>
#include <string.h>

#include <sys/stat.h>

int main() {
    struct stat s;
    int r;

    errno = 0;
    r = stat("foo", &s);

    if (r == -1)
        printf ("stat() = -1: error: %s\n", strerror(errno));
    else
        printf("stat() succeeded\n");
}</code></pre>
<p>Running it against both APIs:</p>
<pre><code>$ nix develop -f '<nixpkgs>' patchelf --argstr system i686-linux

$$ gcc a.c -o a && ./a
stat() = -1: error: Value too large for defined data type

$$ gcc a.c -o a -D_FILE_OFFSET_BITS=64 && ./a
stat() succeeded</code></pre>
<p>Yep! This is it.</p>
<h2 id="why-are-my-inode-numbers-so-big">Why are my inode numbers so big?</h2>
<p>4 billion inodes is a lot. Why such a big number? Do I have so many
files on disk? No, <code>find /</code> tells me I have around 25 million files
(~100 times smaller than 2 billion mark).</p>
<p>It comes down to the fact how exactly <code>btrfs</code> filesystem allocates inode
numbers.</p>
<p>Compared to <code>ext4</code> (which uses first available inode number in inode
table of fixed size) <code>btrfs</code> does not use a single inode table but uses
B-tree of “objects”.</p>
<p><code>btrfs</code> strategy to allocate inodes is to increment the global number
(per filesystem, number is populated in <code>btrfs_create_new_inode()</code>):</p>
<pre class="c"><code>int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
                           struct btrfs_new_inode_args *args)
{
    // ...
    int ret;
    ret = btrfs_get_free_objectid(root, &objectid);
    if (ret)
        goto out;
    inode->i_ino = objectid;
    // ...
    return ret;
}
// ...
int btrfs_get_free_objectid(struct btrfs_root *root, u64 *objectid)
{
    int ret;
    // skipped locking and error handling
    *objectid = root->free_objectid++;
    ret = 0;
    return re;
}
// ...
#define BTRFS_FIRST_FREE_OBJECTID 256ULL
#define BTRFS_LAST_FREE_OBJECTID -256ULL
int btrfs_init_root_free_objectid(struct btrfs_root *root)
{
    int ret;
    // ...
    search_key.objectid = BTRFS_LAST_FREE_OBJECTID;
    search_key.type = -1;
    search_key.offset = (u64)-1;
    ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0);

    if (path->slots[0] > 0) {
        slot = path->slots[0] - 1;
        l = path->nodes[0];
        btrfs_item_key_to_cpu(l, &found_key, slot);
        root->free_objectid = max_t(u64, found_key.objectid + 1,
                                        BTRFS_FIRST_FREE_OBJECTID);
    } else {
        root->free_objectid = BTRFS_FIRST_FREE_OBJECTID;
    }
    ret = 0;
    return ret;
}</code></pre>
<p>In the code above <code>btrfs</code> literally increments <code>root->free_objectid</code> as
a way to generate new inode number. On fresh filesystems inode numbers
for files and directories start from <code>256</code> (<code>BTRFS_FIRST_FREE_OBJECTID</code>).
On used filesystem they start from the next after largest already
allocated inode.</p>
<p>Note that file removal does not normally reclaim the inode numbers.
Let’s poke a bit at it in action:</p>
<pre><code># create empty btrfs filesystem:
$ fallocate -l 10G fs.raw
$ mkfs.btrfs fs.raw
$ mkdir m
$ mount fs.raw m
$ cd m

# first file on disk:
$ ls -li first
257 -rw-r--r-- 1 root root 0 Sep  7 15:10 first

# second file on disk:
$ ls -li first
258 -rw-r--r-- 1 root root 0 Sep  7 15:10 first</code></pre>
<p>Despite the same file name being deleted and recreated in place it’s
inode number increases.</p>
<p>There is one exception to “always increasing” rule: if we delete files
with highest inode numbers and unmount/remount the filesystem we will be
able to unwind <code>free_objectid</code> back a bit:</p>
<pre><code># remount empty and try again:
$ rm first

$ cd ..
$ umount m
$ mount fs.raw m
$ cd m

$ touch first
$ ls -li first
257 -rw-r--r-- 1 root root 0 Sep  7 15:11 first</code></pre>
<p>Note: after the remount the inode number is back to <code>257</code> (and not <code>259</code>).
<code>256</code> inode is taken by <code>/</code> root directory.</p>
<p>Back to the question why my filesystem has inode numbers above 4
billion: apparently I managed to create that many files throughout the
lifetime of this filesystem. It’s a 2 years old <code>btrfs</code>. This means
filesystem sees about 70 files per second being created and deleted.</p>
<h2 id="more-failures">More failures</h2>
<p>After fixing <code>patchelf</code> locally I tried to build more <code>i686</code> packages
(mainly <code>wine</code> dependencies) and discovered a few more similar failures.</p>
<p>One of them was in <code>which</code> command:</p>
<pre><code>$ bison
bison: missing operand
Try 'bison --help' for more information.

$ which bison
which: no bison in (... long list of PATHs here, one of them with `bison`)</code></pre>
<p>If we look <code>bison</code> up manually it’s there:</p>
<pre><code>$ for p in ${PATH//:/ }; do [ -f $p/bison ] && ls -li $p/bison; done
4386192903 -r-xr-xr-x 2 root root 678408 Jan  1  1970 /nix/store/mf37crpkvz388nmqqvkbnmvp21663w26-bison-3.8.2/bin/bison</code></pre>
<p>Proposed <code>which</code> fix for <code>nixpkgs</code> as
<a href="https://github.com/NixOS/nixpkgs/pull/253382" class="uri">https://github.com/NixOS/nixpkgs/pull/253382</a> and upstream as
<a href="https://github.com/CarloWood/which/pull/1" class="uri">https://github.com/CarloWood/which/pull/1</a>.</p>
<p><code>which</code> fix allowed <code>i686</code> to progress a bit more and now it stumbled on
<code>fontconfig</code> and <code>tpm2-tss</code>. To be debugged.</p>
<h2 id="parting-words">Parting words</h2>
<p>32-bit file APIs are not just about handling of files larger that 4GB
in size. Nowadays’ filesystems can easily have other fields that don’t
fit into 32-bit counters. One of them is inode number. Next in the queue
will probably be 64-bit <code>time_t</code>.</p>
<p>The 64-bit interfaces are opt-in for many 32-bit targets and will remain
such for the foreseeable future. Each individual project will have to
adapt to it by adding <code>-D_FILE_OFFSET_BITS=64</code> (and soon
<code>-D_TIME_BITS=64</code>).</p>
<p>While projects gradually migrate to new APIs <code>bootstrapTools</code> should be
rebuilt to get the updates. I hope some form of
<a href="https://github.com/NixOS/nixpkgs/issues/253713" class="uri">https://github.com/NixOS/nixpkgs/issues/253713</a> process will be in
place to make it smoother. Otherwise one-off
<a href="https://github.com/NixOS/nixpkgs/issues/253274" class="uri">https://github.com/NixOS/nixpkgs/issues/253274</a> update will have to do.</p>
<p>If you see a project that still uses 32-bit APIs please send a patch
upstream to use 64-bit API if possible. Chances are it will fix real
breakage on filesystems with 64-bit inodes.</p>
<p>Have fun!</p>
</article>
<article>
<h1>gcc-14 bugs, pile 2</h1>
<p>2023-09-06T00:00:00Z</p>
<p>Two more months of <code>gcc-14</code> development have passed. The bug rate slowed
down considerably. Let’s look at the new bugs since. This time it is
mere <code>10</code> bugs.</p>
<h2 id="summary">Summary</h2>
<p>Here are the bugs themselves:</p>
<ul>
<li><a href="https://gcc.gnu.org/PR110652">tree-optimization/110652</a>: bootstrap failure for <code>-Werror</code>.</li>
<li><a href="https://gcc.gnu.org/PR110697">middle-end/110697</a>: bootstrap failure for <code>-Werror</code>.</li>
<li><a href="https://gcc.gnu.org/PR110726">tree-optimization/110726</a>: wrong code on <code>llvm</code>: <code>a |= a == 0</code>.</li>
<li><a href="https://gcc.gnu.org/PR110790">target/110787</a>: wrong code on <code>gmp</code></li>
<li><a href="https://gcc.gnu.org/PR110838">tree-optimization/110838</a>: wrong code on <code>x365</code></li>
<li><a href="https://gcc.gnu.org/PR110880">target/110880</a>: <code>aarch64</code> ICE on <code>highway-1.0.5</code></li>
<li><a href="https://gcc.gnu.org/PR111009">tree-optimization/111009</a>: wrong code on <code>perf</code> from <code>linux</code></li>
<li><a href="https://gcc.gnu.org/PR111048">tree-optimization/111048</a>: wrong code on <code>highway-1.0.6</code></li>
<li><a href="https://gcc.gnu.org/PR111051">target/111051</a>: <code>gcc</code> fails to use <code>avx2</code> primitives
on <code>avx512</code> targeted functions on <code>highway-1.0.6</code></li>
<li><a href="https://gcc.gnu.org/PR111060">target/111060</a>: <code>i686-linux</code> <code>gcc</code> fails to build
it’s own <code>libbstdc++</code> due to missing <code>Float16</code> support.</li>
</ul>
<p>Histograms time!</p>
<p>Type of bugs:</p>
<ul>
<li><code>wrong-code</code>: 5</li>
<li><code>other</code>: 4</li>
<li><code>ICEs</code>: 1</li>
</ul>
<p>This was an unusual cycle: most of the bugs were related to the wrong
code generated by <code>gcc</code>. Some of them (like <code>gmp</code> and <code>perf</code> ones) were
very nasty to extract and report.</p>
<p>Subsystems:</p>
<ul>
<li><code>tree-optimization</code>: 5</li>
<li><code>target</code>: 4</li>
<li><code>middle-end</code>: 1</li>
</ul>
<p>As usual half the bugs are lurking in <code>tree-optimization</code> part of the
compiler. Though this time <code>target</code>-specific bugs are frequent as well.</p>
<h2 id="a-note-on-highway">A note on highway</h2>
<p>3 out of 10 bugs were exposed by a <code>highway</code> library. It’s such a great
vectorizer benchmark for <code>gcc</code>!</p>
<p>Initially I was a bit upset by the amount of C++ template indirections
<code>highway</code> uses on it’s implementation.</p>
<p>As it exercises all the possible vector units available for the platform
(<code>SSE{2,3,4}</code>, <code>AVX{1,2}</code> and many others) it was hard for me to narrow
down to a single failing unit type: attempts to remove code for one of
them fails the build for others.</p>
<p>But! I found an easy way to deal with it! We can disable most irrelevant
targets in <code>hwy/detect_targets.h</code> by adding all the irrelevant targets
to <code>HWY_BROKEN_TARGETS</code> define. I usually add everything except the
broken target (usually <code>HWY_AVX2</code> :).</p>
<p>And then reduction becomes trivial: I expand all the templates and get
the simple function with a failure.</p>
<h2 id="parting-words">Parting words</h2>
<p><code>gcc</code> <code>master</code> still yields new bugs. It is still quite a bit of fun to
get and extract build compiler failures from real world usage.</p>
<p>I liked <a href="https://gcc.gnu.org/PR111048" class="uri">https://gcc.gnu.org/PR111048</a> the most. There <code>gcc</code> generated
wrong code for <code>highway-1.0.6</code> library in <code>-mavx2</code> mode.</p>
<p>When handling the following loop:</p>
<pre class="c"><code>u8 in_lanes[32];
for (unsigned i = 0; i < 32; i += 2) {
  in_lanes[i + 0] = 0;
  in_lanes[i + 1] = ((u8)0xff) >> (i & 7);
}</code></pre>
<p><code>gcc</code> broke <code>i = 12;</code> iteration and instead of <code>i[13] = 0xf;</code>
(<code>0xff >> 4</code>) stored something like <code>i[13] = 0xef;</code> there. <code>gcc</code> was
able to fully constant-fold the loop (and did it slightly incorrectly).</p>
<p>Have fun!</p>
</article>
</main></body></html>