scaling ghc --make
Tl;DR
To speed up build of a haskell
package use:
$ ghc --make -j +RTS -A256M -qb0 -RTS <your-args>
Story mode
In the previous post I’ve shown a tiny hack on how to make sense out of
ghc
using perf
. My actual goal was to get basic understanding on why
ghc --make
does not scale perfectly.
The scalability problem is known as GHC
trac
#9221.
Why ghc --make -j
scalability matters
I care about ghc --make
speed because I build a lot of haskell
packages as part of
gentoo-haskell
overlay maintenance.
My workflow to add or update a new package is simple:
- quickly build new version of package (or packages) against my existing
haskell
world (~1500
globally installedhaskell
packages) and run its tests - install this new version in system and rebuild rest of
haskell
packages against it
Installation and rebuild usually happens overnight.
Quick build is a relatively interactive phase. It fails frequently as packages require some tweaks along the way, like upper bound fixes or minor API changes.
I use various hacks to speed up quick builds:
- build with optimisations disabled:
-O0
- build in parallel
ghc --make -j
mode
The symptoms
My machine is a x86_64
desktop (4x2
HT cores on a single die). Nothing
fancy. Should be an easy target to get perfect parallelism. Right?
At first I chose highlightling-kate
package as it has at least 100
independent modules for various languages.
The ghc --make -j<N>
histogram looks like that:

The problem is that even on my machine ghc --make -j3
works better
than ghc --make -j8
and I could not get more than 1.5x
speedup.
I’d like to see 8x
speedup. Or at least 4x
:)
Another problem is quick performance degradation once you exceed number
of CPUs
. Running ghc --make -j20
on my machine makes things
seriously worse: compilation time is longer than in non-parallel mode!
That did not look good.
The clues
So why are things so bad? And more importantly how would you diagnose
similar issue in another haskell
program? The problem looked like a nice
thing to get basic experience in this area.
In comment #9
Gintas found one of the problems with large <N>
but I was not able
to understand his reasoning at that time as I knew too little about
GHC
runtime.
Simon suggested larger heap sizes and allocation areas. I tried it and
got better performance numbers! Best numbers were at -A 256M
:
2.5x
speedup! But what is the real bottleneck here? How to estimate
the number in general case?
The bug contained a lot of speculation about lock contentions of sorts
but no simple cause of slowness.
Down the rabbit hole
I’ve decided to try to find out what happens there. Maybe someone else
will have an AHA! moment and will fix things for everyone if I’ll
manage to find out that useless sleep(1)
somewhere :)
perf
, take 1
I tried to run perf record
on my benchmark to figure out what
happens with the process. perf
suggested GHC
did mostly
sched_yield()
which does not sound like a useful call.
threadscope
, take 1
Simon also suggested to try ThreadScope
on GHC
itself. In theory
it’s quite simple. You need to relink ghc-stage2
with -eventlog
as:
# mk/build.mk
GhcStage2HcOpts += -rtsopts -eventlog
Run GHC
with event logging enabled as ghc --make ... +RTS -l
.
Then running threadscope ghc-stage2.eventlog
should show us nice
graph of all sorts of interesting things.
Alas threadscope
crashed on generated eventlog
file. I first thought
it’s a threadscope
bug and filed
one.
After threadscope
got a fix it still could not read my eventlog
profiles. I’ve misattributed a bug to another ghc bug
#9003 where GHC
accidentally broke eventlog
ABI by changing enum values for existing
events in one of GHC
releases.
After bug #9003
got fixed my dumps still did not render! I’ve filed
third ghc
bug #9384
which basically says that calling setNumCapabilities
in your haskell
program usually breaks your eventlog
.
I didn’t think of trying to fix it. In the process of finding the
trigger I’ve found out the workaround: if you specify rts option -RTS -N<N>
at program startup to the same value you try to
setNumCapabilities
it will work just fine! Running ghc --make -j<N> +RTS -N<N> -l
will not change GHC
behavior except it
will produce readable eventlogs
!
I looked at the eventlog
output and did not know what I should look at.
Later Alexander Vershilov fixed that bug with 2 lines of C
code.
Tl;DR: GHC
did not handle dynamic resize of number of capabilities.
Thank you!
perf
, take 2
Half a year later many more things happened:
- Peter’s
DWARF
support forGHC
landed and I’ve found a few bugs in it:ghc
bug #10655 where-g
breaks rewrite rulesghc
bug #10667 where-g
breaksGNU as
and produces broken assembly file
- I’ve mastered black magic of porting
GHC
to weird architectures: basically built unregisterisedGHC
onamd64
and made it pass most tests. Some details are in this blog post. I’ve started Seeing Things!
I’ve attempted to use perf
and found out the reason why perf
can’t
disassemble haskell
function and wrote another blog post on how to make
it see the
code
(still needs more things to be upstreamed).
perf
also suggested that one of most popular user space functions
being called was the stg_BLACKHOLE_info()
.
Ad-hoc rts
patching
Blackholes
are special kind of thunks that usually denote a thunk being
evaluated by another haskell
thread. When a normal (closure) thunk is
being evaluated by a thread thunk could be updated at two points in
time: either before (eager) or after (lazy) actual thunk evaluation.
By default GHC
uses lazy strategy (unless -feager-blackholing
is
used). That means the same thunk can be evaluated by multiple execution
threads. But it is fine for a typical pure program.
It’s not fine for sensitive things like unsafePerformIO
thunks
which are guaranteed to be evaluated once.
For those sensitive thunks GHC
uses blackholes
: at enter closure
type is atomically rewritten to BLACKHOLE
thunk which points to a
thread (TSO
) that locks this thunk. When any other thread tries to
enter blackhole
it blocks on that thunk. When first thread finishes
computation it rewrites a TSO
pointed by blackhole
to an actual
result of computation and resumes second thread. Second thread picks
result of evaluation and continues execution.
Thus my assumption was that the main scalability problem was large
amount of unsafePerformIO
calls that causes high haskell
thread
rescheduling churn.
I tried to modify cmm
code to print an object pointed by blackhole
but most of the time those are not TSO
objects but I#
(integers), labels and other things. GHC
indeed uses
unsafePerformIO
for global unique string pool. Those I#
are
likely the sign of it (see mkSplitUniqSupply
).
I tried to recompile GHC
with -feager-blackholing
option
enabled. GHC
became slightly slower in single-threaded case
(expected), slightly slower in multi-threaded case (unexpected) and did
not degrade as fast as base case on large -j<N>
(totally
unexpected!).
It means that GHC
evaluates some expensive things more than once.
How would we find which ones? GHC
needs to have something in rts
or
in eventlog
to expose stats for:
blackhole
synchronization (who stalls other threads?)- things needlessly evaluated more than once (what wastes CPU/heap?)
speeding up parallel GC
: -qb0
The more I tried to play with various GHC
changes the more I
realized my machine is not good enough to expose bottlenecks. I decided
to pick a 24-core virtual machine and gave threadscope
another chance.
This time haskell
threads were frequently stopped by garbage collector
threads. And garbage collector CPU load was unevenly balanced across
cores!
I started reading a parallel GC
paper by
Simon
which describes in detail how things work.
Basically there are two phases of parallel GC:
- non-work-stealing phase when each
GC
thread scans it’s ownCapability
’sTSO
stacks and allocation area (aka generation-0,g0
,eden
, nursery) - work-stealing phase when each
GC
thread either processes its queue offrom-space
pointers and copies things toto-space
or steals work from other threads queues
GHC
as a compiler does not take too much heap to compile a file.
Having huge allocation area (-A256M
in my case) is enough to compile
an average file. Older generations (g1+
) don’t have much to scan
because there is nothing to put there!
It’s fun that +RTS -sstderr
always showed me that GC
parallelism
number and I did not even notice it:
Parallel GC work balance: 22.34% (serial 0%, perfect 100%) # -qb1, default
...
Parallel GC work balance: 74.48% (serial 0%, perfect 100%) # -qb0
...
Parallel GC work balance: 80.03% (serial 0%, perfect 100%) # -qb0
The bottleneck is allocation area scan. GHC
RTS
happens to
have a knob to enable work-stealing on g0
. The option is -qb0
.
It was a magic knob that increased GHC
parallelism from 2.5x
to
7x
speed up on 24-core VM! I’ve got similar effect on my 8-CPU
desktop where speedup was from 2.5x
to 3.5x
.
Two pictures building highlighting-kate
on 24-core VM without and with
-qb0
:
20 seconds to 9 seconds shrink. First picture shows that GHC
struggles to load half the CPUs at peak. While second picture eats all
24 at peak and then tails on slow modules. GREEN
- mutation CPU usage,
ORANGE
- GC
CPU usage.
It also became obvious that having more GC
threads than available CPUs
always hurts performance.
That’s what Gintas meant by setNumCapabilities
discrepancy. The
first fix was
obvious.
It does not fix the normal case but at least does not make things worse
when ghc --make -j<N>
is higher than the number of CPUs available.
To improve normal case GHC
now auto-tunes parallel scan based on
allocation area
size.
multithreadng vs. multiprocessing
Threadscope
also showed that load imbalance on highlighting-kate
is
high due to uneven workload on haskell
threads. Basically, not all
highlighting modules are equal in size. Which is obvious in the
hindsight. So much for a perfect benchmark :)
I’ve written synthetic benchmark synth.bash
which generates a lot
of independent haskell
sources of even size and measures its build
time:
#!/bin/bash
MODULES=128
FIELDS=100
GHC=ghc
GHC=~/dev/git/ghc-perf/inplace/bin/ghc-stage2
rm -rf src/
mkdir -p src
for m in `seq 1 ${MODULES}`; do
{
echo "module M${m} where"
echo "data D = D0"
for f in `seq 1 ${FIELDS}`; do
echo " | D${f} { f${f} :: Int}"
done
echo " deriving (Read, Show, Eq, Ord)"
} > src/M${m}.hs
done
#perf record -- \
time \
$GHC \
-hide-all-packages -package=base \
\
--make src/*.hs -j +RTS -A256M -RTS "$@"
Nice to tweak numbers to adapt to small and large machines. It’s also
interesting to compare with multiprocess GHC
using the following
Makefile
:
OBJECTS := $(patsubst %.hs,%.o,$(wildcard src/*.hs))
all: $(OBJECTS)
src/%.o: src/%.hs
~/dev/git/ghc-perf/inplace/bin/ghc-stage2 -c +RTS -A256M -RTS $< -o $@
clean:
$(RM) $(OBJECTS)
.PHONY: clean
Note the difference: here we run separate ghc -c
processes on each
.hs
file and let GNU make
do the scheduling.
This benchmark raises even more questions! It shows (see comment
#65) that
multiprocess compilation scales perfectly and is fastest at make -j<N>
where N
is the number of CPUs
. It’s even faster that ghc --make -j
(both for 8-CPU and 24-CPU machines).
It means there is still a few issues to understand here :)
This benchmark also shows what best performance I should expect from
GHC
on these machines. Multiprocess (Makefile
-based) benchmark
is expected to have ideal scalability. If not it means bottleneck is
somewhere outside GHC
(kernel/RAM bandwidth/etc.).
4x2
HT desktop:
$ make clean; time make -j1
real 1m2.561s
user 0m56.523s
sys 0m5.560s
$ make clean; time make -j4
real 0m18.936s
user 1m7.549s
sys 0m6.857s
$ make clean; time make -j8
real 0m15.964s
user 1m52.058s
sys 0m9.929s
It’s 3.93x
(almost exactly 4x
). Almost ideal saturation of 4
physical cores.
12x2
HT VM:
$ make clean; time make -j1
real 1m33.147s
user 1m20.836s
sys 0m11.556s
$ make clean; time make -j12
real 0m10.537s
user 1m36.276s
sys 0m16.948s
$ make clean; time make -j24
real 0m7.336s
user 2m15.936s
sys 0m19.004s
This is 12.7x
speedup. Better-than-ideal saturation of 12 physical
cores.
Note how confusing user time in -j1
/ -j12
/ -j24
cases is.
It claims CPU does twice as much work. But it should not as I throw the
same amount of work at CPUs in all runs.
It’s one of confusing parts about hyperthreading
: operating system lies
to you how much actual work was actually done.
perf
, take 3
Trying to slice and dice metrics reported by perf
I tried to get an
idea where the overhead appears in multi-threaded case.
I’ve noticed GHC
works way faster when compiled with
-fno-worker-wrapper -fno-spec-constr
flags.
It instantly sped GHC
up by 10% on synth benchmark and made GHC
bootstrap 5%
faster.
GHC
managed to inflate tiny snippet:
cmmExprNative :: ReferenceKind -> CmmExpr -> CmmOptM CmmExpr
= do
cmmExprNative referenceKind expr <- getDynFlags
dflags let platform = targetPlatform dflags
= platformArch platform
arch case expr of
CmmLit (CmmLabel lbl)
-> do
cmmMakeDynamicReference dflags referenceKind lbl
into the following code:
_______
│ cmmExprNative :: ReferenceKind -> CmmExpr -> CmmOptM CmmExpr
│ cmmExprNative referenceKind expr = do
0,11 │ cmp $0x3,%rax
│ ↑ jb 3ceb930 <cFO7_info+0x8b0>
│ -- we must convert block Ids to CLabels here, because we
│ -- might have to do the PIC transformation. Hence we must
│ -- not modify BlockIds beyond this point.
│
│ CmmLit (CmmLabel lbl)
│ -> do
2,02 │ add $0x890,%r12
│ cmp 0x358(%r13),%r12
│ ↑ ja 3cf456f <cFIc_info+0x7df>
0,16 │ mov 0x7(%rbx),%rax
0,59 │ lea ghc_DynFlags_DynFlags_con_info,%rbx
0,05 │ mov %rbx,-0x888(%r12)
3,41 │18e9: mov 0x50(%rsp),%rbx
0,05 │ mov %rbx,-0x880(%r12)
0,32 │ mov 0x58(%rsp),%r14
│ mov %r14,-0x878(%r12)
│ mov 0x60(%rsp),%rbx
│ mov %rbx,-0x870(%r12)
0,05 │ mov 0x68(%rsp),%r14
│ mov %r14,-0x868(%r12)
│ mov 0x70(%rsp),%rbx
│ mov %rbx,-0x860(%r12)
│ mov 0x78(%rsp),%r14
0,11 │ mov %r14,-0x858(%r12)
0,05 │ mov 0x80(%rsp),%rbx
│ mov %rbx,-0x850(%r12)
0,05 │ mov 0x88(%rsp),%r14
│ mov %r14,-0x848(%r12)
│ mov 0x90(%rsp),%rbx
│ mov %rbx,-0x840(%r12)
0,05 │ mov 0x98(%rsp),%r14
0,05 │ mov %r14,-0x838(%r12)
0,11 │ mov 0xa0(%rsp),%rbx
│ mov %rbx,-0x830(%r12)
│ mov 0xa8(%rsp),%r14
│ mov %r14,-0x828(%r12)
0,05 │ mov 0xb0(%rsp),%rbx
│ mov %rbx,-0x820(%r12)
│ mov 0xb8(%rsp),%r14
... <a few more pages of it>
That appeared to be a bug of specialiser being too eager to move out as
many strict fields from structs to separate function arguments (aka
worker arguments) even if it means pulling out 180 arguments.
Optimization should be controlled by -fmax-worker-args=<N>
flag
but it got lost when demand analyzer was rewritten a few years ago.
I’ve noticed it by accident when passing through DynFlags
structure
in GHC
. Many of flags were unused (including -fmax-worker-args=<N>
).
That gave rise to ghc
bug
#11565.
Final result
On my 8-CPU desktop final improvement is 3.72x
(22 seconds vs. 82
seconds):
$ ./synth.bash -j1 +RTS -sstderr
100,430,153,096 bytes allocated in the heap
3,970,134,600 bytes copied during GC
145,432,032 bytes maximum residency (16 sample(s))
1,639,792 bytes maximum slop
643 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 363 colls, 0 par 7.065s 7.060s 0.0194s 0.0496s
Gen 1 16 colls, 0 par 1.485s 1.484s 0.0928s 0.1705s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.002s ( 0.002s elapsed)
MUT time 64.242s ( 73.611s elapsed)
GC time 8.550s ( 8.544s elapsed)
EXIT time 0.012s ( 0.015s elapsed)
Total time 72.821s ( 82.172s elapsed)
Alloc rate 1,563,303,909 bytes per MUT second
Productivity 88.3% of total user, 89.6% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 1m22.213s
user 1m21.061s
sys 0m1.139s
$ ./synth.bash +RTS -sstderr
...
100,608,799,232 bytes allocated in the heap
3,959,638,248 bytes copied during GC
173,982,704 bytes maximum residency (8 sample(s))
2,947,048 bytes maximum slop
2556 MB total memory in use (1 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 53 colls, 53 par 15.480s 2.055s 0.0388s 0.0767s
Gen 1 8 colls, 7 par 3.943s 0.558s 0.0697s 0.0897s
Parallel GC work balance: 75.64% (serial 0%, perfect 100%)
TASKS: 19 (1 bound, 18 peak workers (18 total), using -N8)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.002s ( 0.002s elapsed)
MUT time 131.375s ( 19.329s elapsed)
GC time 19.423s ( 2.613s elapsed)
EXIT time 0.016s ( 0.018s elapsed)
Total time 150.830s ( 21.962s elapsed)
Alloc rate 765,815,750 bytes per MUT second
Productivity 87.1% of total user, 88.1% of total elapsed
gc_alloc_block_sync: 370037
whitehole_spin: 0
gen[0].sync: 564078
gen[1].sync: 520393
real 0m22.068s
user 2m45.042s
sys 0m4.318s
On 24-CPU VM speedup is around 8x
(9 seconds versus 75 seconds)
This suggests GC
takes only 12% of elapsed time in both cases. I’m not
sure if 200M
allocation difference could be attributed to throw-away
work done by haskell
threads.
CPU MUT
and GC
time got 2x
increase. It’s not the throw-away work
but a hyperthreading artifact.
Conclusions
I have not solved the problem yet. And don’t fully understand it either! But I feel we are on the right track :) Some achievements:
ghc --make
is slightly faster now!- Now I (and you!) understand
haskell
runtime performance a bit better - we’ve fixed a few seemingly unrelated bugs :)
eventlog
works in programs callingsetNumCapabilities
threadscope
can readGHC
traces- large
-A
size auto-tunes parallelGC
parameters ghc --make
does not create more Capabilities than CPUs available- we’ve learned
hyperthreading
is tricky :)
How you
can improve GHC
:
GHC
is full of low-hanging performance improvements and they are easy to fix :)haskell
profiling tools need an improvement to see whereBLACKHOLES
come from. (I suspect most of them come frommkSplitUniqSupply
function)perf
is still too hard to use onGHC
-built programs
Have fun!