Refactor derived Generic instances to reduce allocations
ClosedPublic

Authored by RyanGlScott on Jun 5 2016, 12:20 PM.

Details

Summary

Previously, derived implementations of to/from in Generic
instances were wastefully putting extra M1s in every case, which led
to an O(n) increase in the number of coercions, resulting in a slowdown
during the typechecker phase.

This factors out the common M1 in every case of a to/from
definition so that the typechecker has far fewer coercions to deal with.
For a datatype with 300 constructors, this change has been observed to
save almost 3 seconds of compilation time.

This is one step towards coming up with a solution for Trac #5642.

Test Plan

./validate

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
RyanGlScott updated this revision to Diff 7858.Jun 5 2016, 12:20 PM
RyanGlScott retitled this revision from to Refactor derived Generic instances to reduce allocations.
RyanGlScott updated this object.
RyanGlScott edited the test plan for this revision. (Show Details)
RyanGlScott added reviewers: simonpj, hvr, austin, bgamari.
RyanGlScott updated the Trac tickets for this revision.
RyanGlScott added subscribers: osa1, carter, basvandijk.
hvr awarded a token.Jun 5 2016, 12:21 PM
RyanGlScott added a comment.EditedJun 5 2016, 12:26 PM

Note that this Diff is incomplete, since I probably changed the allocation numbers for test T5642 (a perf test). My machine gives very different numbers than what all.T expects, and Harbormaster is still down, so what is the recommended way to figure out what allocation numbers to put in all.T?

Also, I'm a bit confused as to why the Generic instance in test T5642 is manually implemented. As a result of this, I had to change the code in T5642.hs itself to match what would be generated by a derived Generic instance. Couldn't we just use deriving Generic instead?

RyanGlScott updated this revision to Diff 7859.Jun 5 2016, 3:19 PM
RyanGlScott edited edge metadata.

Update expected allocations for perf tests

RyanGlScott updated this revision to Diff 7860.Jun 5 2016, 3:19 PM
RyanGlScott edited edge metadata.

Update expected allocations for perf tests

simonpj edited edge metadata.Jun 6 2016, 2:48 AM

I like it.

Could you go further, and build a tree of pattern matches, so as to combine all the L1 matches from the first half of the constructors and the R1 matches for the second half? After all, you have to build the tree virtually in any case, in order to build the L1/R1 nest!

Also: is there a wiki page, or overview Note, describing the Generic deriving mechanism, as implemented? What we have in GHC is diverging, for good reasons, from the paper, and I'd love to have a place to look. I find it a bit scary.

bgamari accepted this revision.Jun 6 2016, 4:32 AM
bgamari edited edge metadata.

Nice! Quite remarkable how much of a speed-up this yielded. Thanks for the note.

This revision is now accepted and ready to land.Jun 6 2016, 4:32 AM

I like it.

Could you go further, and build a tree of pattern matches, so as to combine all the L1 matches from the first half of the constructors and the R1 matches for the second half? After all, you have to build the tree virtually in any case, in order to build the L1/R1 nest!

That's an idea I've briefly played around with. As a quick example, I changed the code in the T5642 perf test to have this to definition:

to (M1 x) = case x of
   L1 a -> case a of
     L1 b -> case b of
       L1 c -> case c of
         L1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C0
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C1
               R1 g -> case g of
                 M1 U1 -> C2
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C3
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C4
               R1 g -> case g of
                 M1 U1 -> C5
         R1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C6
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C7
               R1 g -> case g of
                 M1 U1 -> C8
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C9
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C10
               R1 g -> case g of
                 M1 U1 -> C11
       R1 c -> case c of
         L1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C12
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C13
               R1 g -> case g of
                 M1 U1 -> C14
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C15
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C16
               R1 g -> case g of
                 M1 U1 -> C17
         R1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C18
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C19
               R1 g -> case g of
                 M1 U1 -> C20
           R1 e -> case e of
             L1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C21
               R1 g -> case g of
                 M1 U1 -> C22
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C23
               R1 g -> case g of
                 M1 U1 -> C24
     R1 b -> case b of
       L1 c -> case c of
         L1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C25
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C26
               R1 g -> case g of
                 M1 U1 -> C27
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C28
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C29
               R1 g -> case g of
                 M1 U1 -> C30
         R1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C31
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C32
               R1 g -> case g of
                 M1 U1 -> C33
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C34
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C35
               R1 g -> case g of
                 M1 U1 -> C36
       R1 c -> case c of
         L1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C37
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C38
               R1 g -> case g of
                 M1 U1 -> C39
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C40
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C41
               R1 g -> case g of
                 M1 U1 -> C42
         R1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C43
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C44
               R1 g -> case g of
                 M1 U1 -> C45
           R1 e -> case e of
             L1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C46
               R1 g -> case g of
                 M1 U1 -> C47
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C48
               R1 g -> case g of
                 M1 U1 -> C49
   R1 a -> case a of
     L1 b -> case b of
       L1 c -> case c of
         L1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C50
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C51
               R1 g -> case g of
                 M1 U1 -> C52
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C53
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C54
               R1 g -> case g of
                 M1 U1 -> C55
         R1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C56
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C57
               R1 g -> case g of
                 M1 U1 -> C58
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C59
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C60
               R1 g -> case g of
                 M1 U1 -> C61
       R1 c -> case c of
         L1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C62
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C63
               R1 g -> case g of
                 M1 U1 -> C64
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C65
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C66
               R1 g -> case g of
                 M1 U1 -> C67
         R1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C68
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C69
               R1 g -> case g of
                 M1 U1 -> C70
           R1 e -> case e of
             L1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C71
               R1 g -> case g of
                 M1 U1 -> C72
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C73
               R1 g -> case g of
                 M1 U1 -> C74
     R1 b -> case b of
       L1 c -> case c of
         L1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C75
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C76
               R1 g -> case g of
                 M1 U1 -> C77
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C78
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C79
               R1 g -> case g of
                 M1 U1 -> C80
         R1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C81
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C82
               R1 g -> case g of
                 M1 U1 -> C83
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C84
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C85
               R1 g -> case g of
                 M1 U1 -> C86
       R1 c -> case c of
         L1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C87
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C88
               R1 g -> case g of
                 M1 U1 -> C89
           R1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C90
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C91
               R1 g -> case g of
                 M1 U1 -> C92
         R1 d -> case d of
           L1 e -> case e of
             L1 f -> case f of
               M1 U1 -> C93
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C94
               R1 g -> case g of
                 M1 U1 -> C95
           R1 e -> case e of
             L1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C96
               R1 g -> case g of
                 M1 U1 -> C97
             R1 f -> case f of
               L1 g -> case g of
                 M1 U1 -> C98
               R1 g -> case g of
                 M1 U1 -> C99

I compiled that with -v3 +RTS -s and compared it against what is currently in T5642.hs (after this Diff). Surprisingly, the number of types and coercions are exactly the same. However, it did seem to save on allocations.

Here's before that change:

     902,483,936 bytes allocated in the heap
     257,873,992 bytes copied during GC
      39,535,120 bytes maximum residency (10 sample(s))
       1,200,944 bytes maximum slop
              85 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       248 colls,     0 par    0.257s   0.257s     0.0010s    0.0120s
  Gen  1        10 colls,     0 par    0.227s   0.227s     0.0227s    0.0507s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.677s  (  0.696s elapsed)
  GC      time    0.484s  (  0.484s elapsed)
  EXIT    time    0.018s  (  0.018s elapsed)
  Total   time    1.197s  (  1.199s elapsed)

  Alloc rate    1,333,330,307 bytes per MUT second

  Productivity  59.5% of total user, 59.4% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

And here's after that change:

     855,759,664 bytes allocated in the heap
     204,160,752 bytes copied during GC
      29,375,336 bytes maximum residency (9 sample(s))
       1,210,312 bytes maximum slop
              75 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       234 colls,     0 par    0.235s   0.235s     0.0010s    0.0122s
  Gen  1         9 colls,     0 par    0.180s   0.180s     0.0200s    0.0455s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.687s  (  0.705s elapsed)
  GC      time    0.414s  (  0.415s elapsed)
  EXIT    time    0.018s  (  0.017s elapsed)
  Total   time    1.137s  (  1.138s elapsed)

  Alloc rate    1,246,491,310 bytes per MUT second

  Productivity  63.5% of total user, 63.4% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

Ultimately, not that much faster in terms of compilation time, but there is evidence that it saves some allocations. Perhaps this effect would be more pronounced on a larger datatype. I'll explore this later.

Also: is there a wiki page, or overview Note, describing the Generic deriving mechanism, as implemented? What we have in GHC is diverging, for good reasons, from the paper, and I'd love to have a place to look. I find it a bit scary.

I've started a new section in the GenericDeriving wiki page detailing the list of differences between the paper A Generic Deriving Mechanism for Haskell and what GHC generics does, which gives lip service to a section on compilation performance hacks.

I think you are saying that

  • Commoning up the outer M1 has a huge effect on compile time. (You say 3s but you don't say what percentage that is.)
  • Commoning up all the rest has a rather minor effect

That's very surprising! Can you give a pair of side-by-side examples, without deriving, that shows this effect?

@simonpj, I decided to test this hypothesis on the 300-constructor sum type mentioned in Trac #5642. I manually implemented a Generic instance for this datatype three times, and put each one in its own file:

  • Gen_v1.hs, which contains a Generic instance as GHC derives it currently (without this Diff's changes)
  • Gen_v2.hs, which is like Gen_v1.hs except that it factors out the topmost M1 in from/to
  • Gen_v3.hs, which is like Gen_v1.hs except that it both (1) factors out the topmost M1 in from/to and (2) factors out common occurrences of L1/R1 in to as per your suggestion

I compiled each file with ghc -O2 -v3 +RTS -s and dumped the results to logs. Here are the highlights:

  • Gen_v1.txt
    • {terms: 16,282, types: 2,563,921, coercions: 639,012}
    • 7,781,716,432 bytes allocated in the heap
    • Total time 8.708s ( 8.719s elapsed)
  • Gen_v2.txt
    • {terms: 16,288, types: 2,924,492, coercions: 9,950}
    • 4,479,400,544 bytes allocated in the heap
    • Total time 5.580s ( 5.590s elapsed)
  • Gen_v3.txt
    • {terms: 16,288, types: 2,924,492, coercions: 9,950}
    • 4,016,934,848 bytes allocated in the heap
    • Total time 4.990s ( 5.006s elapsed)

There is a huge difference between v1 and v2, as suspected. There is a difference between v2 and v3 in that it allocated fewer bytes on the heap, but interestingly, v3 has the exact same number of types and coercions, so I'm not sure where the improvement is coming from.

Wow that is AMAZING.

Ryan, could you open a Trac ticket, attaching the files (or links) and the stats you give?

And, one last favour, do some a very small version of v1/v2, with say two or three constructors. I'm looking for a program with manageable size that I can eyeball, but where there's a big increase in the number of coercions. Or does the effect disappear altogether in small programs? Is it linear in the program size or non-linear?

Thanks!

Ryan, could you open a Trac ticket, attaching the files (or links) and the stats you give?

I've copied my above comment to the original Trac ticket here.

And, one last favour, do some a very small version of v1/v2, with say two or three constructors. I'm looking for a program with manageable size that I can eyeball, but where there's a big increase in the number of coercions. Or does the effect disappear altogether in small programs? Is it linear in the program size or non-linear?

Sure thing. I redid the same experiment with a much smaller datatype:

data Bigsum = C0 | C1 | C2 | C3

And posted the results here. Here's the code for:

  • Gen_v1.hs (what you get when deriving Generic)
  • Gen_v2.hs (like Gen_v1.hs, but factoring out the common M1 at the top)
  • Gen_v3.sh (like Gen_v2.hs, but factoring out common L1s/R1s

Here are the highlights:

  • Gen_v1.txt
    • {terms: 143, types: 562, coercions: 239}
    • 94,739,056 bytes allocated in the heap
    • Total time 0.201s ( 0.191s elapsed)
  • Gen_v2.txt
    • {terms: 148, types: 638, coercions: 174}
    • 93,979,880 bytes allocated in the heap
    • Total time 0.196s ( 0.187s elapsed)
  • Gen_v3.txt
    • {terms: 148, types: 638, coercions: 174}
    • 94,281,768 bytes allocated in the heap
    • Total time 0.203s ( 0.192s elapsed)

Even at this scale, v2 has visible improvements over v1, since it saves about 800 KB in heap allocations, and saves a little sliver of compilation time. Curiously, v3 actually does worse than v2 in this little example. Again, I'm not sure what's going on there, but I do expect v3 to be asymptotically better than v2, since we go from an O(n) number of L1s/R1s to an O(log n) number.

This revision was automatically updated to reflect the committed changes.