CmmSink: Use a IntSet instead of a list
ClosedPublic

Authored by alexbiehl on Oct 31 2017, 5:23 PM.

Details

Summary

CmmProcs which have *lots* of local variables take a considerable
amount of time in CmmSink. This was noticed by @tdammers in Trac #7258
while compiling files with large records (~200-400 fields).

Before:

        Sun Oct 29 19:58 2017 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/Users/alexbiehl/git/ghc/inplace/lib /Users/alexbiehl/Downloads/W2.hs -fforce-recomp -O2

        total time  =       26.00 secs   (25996 ticks @ 1000 us, 1 processor)
        total alloc = 14,921,627,912 bytes  (excludes profiling overheads)

COST CENTRE     MODULE      SRC                                                 %time %alloc

sink            CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        55.7   15.9
SimplTopBinds   SimplCore   compiler/simplCore/SimplCore.hs:761:39-74            19.5   30.6
FloatOutwards   SimplCore   compiler/simplCore/SimplCore.hs:471:40-66             4.2    9.0
RegAlloc-linear AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    4.0   11.1
pprNativeCode   AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    2.8    6.3
NewStranal      SimplCore   compiler/simplCore/SimplCore.hs:480:40-63             1.6    3.7
OccAnal         SimplCore   compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     1.5    3.5
StgCmm          HscMain     compiler/main/HscMain.hs:(1426,13)-(1427,62)          1.2    2.4
regLiveness     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    1.2    1.9
genMachCode     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    0.9    1.8
NativeCodeGen   CodeOutput  compiler/main/CodeOutput.hs:171:18-78                 0.9    2.1
CoreTidy        HscMain     compiler/main/HscMain.hs:1253:27-67                   0.8    1.9

After:

        Sun Oct 29 19:18 2017 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/Users/alexbiehl/git/ghc/inplace/lib /Users/alexbiehl/Downloads/W2.hs -fforce-recomp -O2

        total time  =       13.31 secs   (13307 ticks @ 1000 us, 1 processor)
        total alloc = 15,772,184,488 bytes  (excludes profiling overheads)

COST CENTRE     MODULE         SRC                                                 %time %alloc

SimplTopBinds   SimplCore      compiler/simplCore/SimplCore.hs:761:39-74            38.3   29.0
sink            CmmPipeline    compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        13.2   20.3
RegAlloc-linear AsmCodeGen     compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    8.3   10.5
FloatOutwards   SimplCore      compiler/simplCore/SimplCore.hs:471:40-66             8.1    8.5
pprNativeCode   AsmCodeGen     compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    5.4    5.9
NewStranal      SimplCore      compiler/simplCore/SimplCore.hs:480:40-63             3.1    3.5
OccAnal         SimplCore      compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     2.9    3.3
StgCmm          HscMain        compiler/main/HscMain.hs:(1426,13)-(1427,62)          2.3    2.3
regLiveness     AsmCodeGen     compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    2.1    1.8
NativeCodeGen   CodeOutput     compiler/main/CodeOutput.hs:171:18-78                 1.7    2.0
genMachCode     AsmCodeGen     compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    1.6    1.7
CoreTidy        HscMain        compiler/main/HscMain.hs:1253:27-67                   1.4    1.8
foldNodesBwdOO  Hoopl.Dataflow compiler/cmm/Hoopl/Dataflow.hs:(397,1)-(403,17)       1.1    0.8

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
alexbiehl created this revision.Oct 31 2017, 5:23 PM
alexbiehl updated the Trac tickets for this revision.Oct 31 2017, 5:23 PM
duog added a subscriber: duog.Oct 31 2017, 9:19 PM

Nice speedup. Allocations increased by a lot though; I suppose that this is due to the UniqSet overhead. Note that UniqSet LocalReg is a newtype of IntMap LocalReg, and you only ever add to the set and test for membership; so perhaps it would be worthwhile to use an IntSet directly here.

duog added a comment.Oct 31 2017, 9:20 PM

Also, SimplTopBinds is using substantially more time than allocations in your second profile, so I'd wager it's also doing something silly with lists.

Good idea @duog. I will try an IntSet instead. I don't expect huge saving though.

Using IntSet

        Wed Nov  1 06:52 2017 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/Users/alexbiehl/git/ghc/inplace/lib -O2 -fforce-recomp /Users/alexbiehl/Downloads/W2.hs

        total time  =       14.87 secs   (14865 ticks @ 1000 us, 1 processor)
        total alloc = 15,318,689,952 bytes  (excludes profiling overheads)

COST CENTRE     MODULE      SRC                                                 %time %alloc

SimplTopBinds   SimplCore   compiler/simplCore/SimplCore.hs:761:39-74            43.8   29.8
sink            CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        10.4   18.1
FloatOutwards   SimplCore   compiler/simplCore/SimplCore.hs:471:40-66             9.4    8.8
RegAlloc-linear AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    6.9   10.8
pprNativeCode   AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    5.1    6.1
OccAnal         SimplCore   compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     3.2    3.4
NewStranal      SimplCore   compiler/simplCore/SimplCore.hs:480:40-63             2.6    3.6
StgCmm          HscMain     compiler/main/HscMain.hs:(1426,13)-(1427,62)          2.0    2.4
regLiveness     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    2.0    1.8
genMachCode     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    1.5    1.7
NativeCodeGen   CodeOutput  compiler/main/CodeOutput.hs:171:18-78                 1.4    2.0
CoreTidy        HscMain     compiler/main/HscMain.hs:1253:27-67                   1.4    1.9
alexbiehl updated this revision to Diff 14529.Nov 1 2017, 12:58 AM
  • Use IntSet instead of UniqueSet
In D4145#115781, @duog wrote:

Nice speedup. Allocations increased by a lot though; I suppose that this is due to the UniqSet overhead. Note that UniqSet LocalReg is a newtype of IntMap LocalReg, and you only ever add to the set and test for membership; so perhaps it would be worthwhile to use an IntSet directly here.

If we only ever add to the set and test for membership, then maybe a Bloom filter could help?

alexbiehl added a comment.EditedNov 1 2017, 5:37 AM

This could work. The size of skipped is bounded by the number of assignments so we could easily pre-allocate the Bloomfilter. tryToInline needs to run in ST to allow in-place updates. We need to be careful here: Bloomfilter allow for false positives. A Bloomfilter might indicate a LocalVariable is in skipped although it isn't! This brings the possibility of not inlining something which could in fact be inlined. We could invert the relation though so that we trigger a linear scan over skipped in case we are not sure.

This would be quite some work though. Maybe we should keep that in mind and first go with the approach in this differential.

This could work. The size of skipped is bounded by the number of assignments so we could easily pre-allocate the Bloomfilter. tryToInline needs to run in ST to allow in-place updates. We need to be careful here: Bloomfilter allow for false positives. A Bloomfilter might indicate a LocalVariable is in skipped although it isn't! This brings the possibility of not inlining something which could in fact be inlined. We could invert the relation though so that we trigger a linear scan over skipped in case we are not sure.

I am aware of this, so yes, we would still have to run a set membership test or list scan in the cases where we're not sure, and we would have to decide which case (inline or not inline) we want to optimize for.

This would be quite some work though. Maybe we should keep that in mind and first go with the approach in this differential.

Agree; this wouldn't be trivial.

Also, further testing seems to indicate that with this change in place, sink is not the bottleneck anymore, so maybe it's not worth the effort at all.

alexbiehl retitled this revision from CmmSink: Use a UniqSet instead of a list to CmmSink: Use a IntSet instead of a list.Nov 1 2017, 7:35 AM

Nice! Can you do also a nofib run and compare compile time/allocation? Also do the perf/compiler tests indicate any difference? (Harbourmaster builds haven't finished yet)

bgamari accepted this revision.Nov 2 2017, 10:48 AM

This looks reasonable to me. I've pushed this to wip/D4145 so that gipeda can test it.

This revision is now accepted and ready to land.Nov 2 2017, 10:48 AM

The Gipeda results are in.

The fact that both runs improve overall compile time by over 0.5% is certainly encouraging. It's interesting how much some of the testsuite allocations fluctuate by (e.g. T9233).

duog added a comment.Nov 2 2017, 2:51 PM

The fact that both runs improve overall compile time by over 0.5% is certainly encouraging. It's interesting how much some of the testsuite allocations fluctuate by (e.g. T9233).

I wonder how much of that .5% is improvement in DynFlags in stage2, I would guess all of it!

It looks like the first first run was still using UniqSet, so a bit more allocation is what one would expect in T9233.

I'm still not exactly sure how commit messages are derrived from Diffs, but the commit message should show the updated profiling report rather than the one currently in the summary.

This revision was automatically updated to reflect the committed changes.