Don't inline String literals or CallStacks
Needs RevisionPublic

Authored by gridaphobe on Sep 20 2015, 8:43 PM.

Details

Reviewers
simonpj
hvr
nomeata
bgamari
austin
Trac Issues
#10844
Summary

This patch avoids inlining String literals and CallStacks by immediately floating them to top-level binders in the desugarer.

The only knock-on effect is that DsGblEnv has a new field to track extra top-level binders we may want to add. It's only used to track String literals and CallStacks at the moment. The new function DsUtils.bindExprAtTopLevel takes an Expr and attempts to create a new top-level binder for it (attempts, because we may be compiling a simple expression in which case the notion of a top-level binder makes no sense). If successsful it returns a Var that is bound to the Expr, otherwise it returns the Expr unmodified. The new function DsMonad.withTopBinds initializes the DsGblEnv to track additional top-level binders and returns any added binders alongside the result of the provided action.

nofib indicates a substantial reduction in both binary size and allocations with this patch.

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
            Min         -15.6%    -10.2%     -4.1%    -13.4%    -20.0%
            Max          +1.3%     +0.8%     +7.1%     +9.3%     +6.7%
 Geometric Mean         -14.9%     -0.3%     +1.9%     +1.3%     -0.1%
Test Plan
  • validate (new test in testsuite/tests/deSugar/should_compile/T10844.hs)
  • inspect nofib benchmarks to see if the binary sizes have gone down

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
wip/D1259 (branched from arcpatch-D1259)
Lint
Lint WarningsExcuse: it's okay
SeverityLocationCodeMessage
Warningcompiler/deSugar/Desugar.hs:144TXT3Line Too Long
Warningcompiler/deSugar/DsBinds.hs:1204TXT3Line Too Long
Unit
No Unit Test Coverage
Build Status
Buildable 14504
Build 22929: [GHC] Linux/amd64: Patch building
Build 22928: [GHC] OSX/amd64: Continuous Integration
Build 22927: [GHC] Windows/amd64: Continuous Integration
Build 22926: arc lint + arc unit
There are a very large number of changes, so older changes are hidden. Show Older Changes
simonpj added inline comments.Oct 12 2015, 4:24 AM
compiler/deSugar/Desugar.hs
124

This withExtraBinds stuff is inconsistent with the handing of 'msgs'. I'd make it all part of initDs, which can return the extra bindings.

Or maybe withExtraBinds ::: DsM a -> DsM (a, [CoreBind]), if that seems better. But I don't like all this ioref-reading stuff right here at top level

compiler/deSugar/DsUtils.hs
583

A poor name. Something like bindExprAtTopLevel or something would be better. Needs a comment saying that it either returns its argument expression, or (Var v), where there's a new top-level binding for v

593

I'd much prefer NOT to mark it NOINLINE. GHC should not inline stupid things. And I think it's already careful about not gratuitously inlining strings.

compiler/typecheck/TcRnTypes.hs
382

I'm ok with the Maybe, but please add a careful Note explaining what this field does, and how it is used, with examples. Perhaps call it ds_top_binds, or something stressing that this is a bunch of top-level bindings. That makes it clearer that it's not apprpriate when compiling an Expr, since there is no top level.

libraries/base/Data/OldList.hs
1140

I think it may be better with your change. The string "" is actually a bit of a complicated thing to match on (lots of unpack nonsense)! Whereas [] is nice and inert.

Better check that ("" :: String) does turn into [] very early in compilation.

Worth a Note to explain the use of []

austin requested changes to this revision.Oct 13 2015, 12:53 AM
austin edited edge metadata.

(Marking as Request Changes pending response to @simonpj's comments.)

This revision now requires changes to proceed.Oct 13 2015, 12:53 AM
gridaphobe updated this revision to Diff 8359.Jul 31 2016, 2:41 PM
gridaphobe retitled this revision from Don't inline CallStacks to Don't inline String literals or CallStacks.
gridaphobe updated this object.
gridaphobe edited edge metadata.

Address Simon's comments.

There are still a couple benchmarks with increased allocations (cse and parstof), I'm not yet sure what's going on there.

bgamari requested changes to this revision.Aug 2 2016, 7:02 AM
bgamari edited edge metadata.

Going to bump from review queue while the last regressions are addressed. These regressions that you refer to are runtime regressions, not compiler allocation regressions, correct?

compiler/deSugar/DsUtils.hs
578

How is the Ds suffix here supposed to indicate this distinction?

libraries/ghc-prim/GHC/CString.hs
63 ↗(On Diff #8359)

It would be worth mentioning the reason for CONLIKE in the note above.

testsuite/tests/deSugar/should_compile/T10844.hs
10

A brief comment here describing what it is that we are checking for would be helpful.

This revision now requires changes to proceed.Aug 2 2016, 7:02 AM
simonpj requested changes to this revision.Aug 18 2016, 5:36 AM
simonpj edited edge metadata.

As well as looking at regressions I'd like to see a proper Note explaining why we are going to all this trouble, citing the ticket(s), and boiling out the key insights from the long thread on the ticket.

It'd be great to finish this off! Thanks

Simon

compiler/deSugar/DsUtils.hs
598

Please comment this 'find'. I think you are doing pre-emptive CSE. Maybe that's a good plan, but say so and say why.

I finally had some time to dig into the regressions a bit, specifically the cse benchmark. The -ticky flag helped me isolate the allocation increases to the draw function.

draw (Node x ts) = grp (s1 ++ pad width x ++ "]") (space (width+3)) (stLoop ts)
 where stLoop []     = [""]
       stLoop [t]    = grp s2 "  " (draw t)
       stLoop (t:ts) = grp s3 s4 (draw t) ++ [s4] ++ rsLoop ts

       rsLoop [t]    = grp s5 "  " (draw t)
       rsLoop (t:ts) = grp s6 s4 (draw t) ++ [s4] ++ rsLoop ts

       grp fst rst   = zipWith (++) (fst:repeat rst)

       -- Define the strings used to print tree diagrams:
       [s1,s2,s3,s4,s5,s6] | pcGraphics = ["\196[", "\196\196", "\196\194",
                                           " \179", " \192",    " \195"]
                           | otherwise  = ["-[",    "--",       "-+",
                                           " |",    " `",       " +"]

       pad n x    = take n (x ++ repeat ' ')
       width      = 4
       pcGraphics = False

After staring at the simplified Core, the only real difference I can see is in the handling of the s1-s6 list of string literals.

In master, each string literal is lifted to its own top-level binder. With my patch, the entire list is lifted to a top-level 6-tuple. Each element of the tuple is given its own top-level binder, *except* for s1, which is repeatedly extracted from the tuple inside draw. The Core for the tuple is somewhat bizarre too.

ds2_r6yH :: ([Char], [Char], [Char], [Char], [Char], [Char])
[GblId, Str=m]
ds2_r6yH =
  let {
    s1_s5Wv :: GHC.Prim.Addr#
    [LclId]
    s1_s5Wv = "-["# } in
  let {
    ds6_s5Wx :: GHC.Prim.Addr#
    [LclId]
    ds6_s5Wx = "--"# } in
  let {
    ds7_s5Ww :: [Char]
    [LclId]
    ds7_s5Ww = GHC.CString.unpackCString# ds6_s5Wx } in
  let {
    ds8_s5WA :: GHC.Prim.Addr#
    [LclId]
    ds8_s5WA = "-+"# } in
  let {
    ds9_s5Wz :: [Char]
    [LclId]
    ds9_s5Wz = GHC.CString.unpackCString# ds8_s5WA } in
  let {
    ds10_s5WD :: GHC.Prim.Addr#
    [LclId]
    ds10_s5WD = " |"# } in
  let {
    ds11_s5WC :: [Char]
    [LclId]
    ds11_s5WC = GHC.CString.unpackCString# ds10_s5WD } in
  let {
    ds12_s5WG :: GHC.Prim.Addr#
    [LclId]
    ds12_s5WG = " `"# } in
  let {
    ds13_s5WF :: [Char]
    [LclId]
    ds13_s5WF = GHC.CString.unpackCString# ds12_s5WG } in
  let {
    ds14_s5WJ :: GHC.Prim.Addr#
    [LclId]
    ds14_s5WJ = " +"# } in
  let {
    ds15_s5WI :: [Char]
    [LclId]
    ds15_s5WI = GHC.CString.unpackCString# ds14_s5WJ } in
  (GHC.CString.unpackCString# s1_s5Wv, ds7_s5Ww, ds9_s5Wz, ds11_s5WC,
   ds13_s5WF, ds15_s5WI)

Note that each string literal has a let binder that calls unpackCString# on it except for the first element, where unpackCString# is called inside the tuple expression. Why would GHC single out the first element? And regardless, why would this lead to increased allocations when the entire tuple should be evaluated at most once?

One other odd thing I noticed is that there are two helper functions inside draw that are lifted to top-level functions in master, but not in my patch. The Core for draw (with my patch) starts like this:

$wdraw_r6za =
  \ (ww_s67f :: [Char]) (ww1_s67g :: [GenTree [Char]]) ->
    zipWith
      @ [Char]
      @ [Char]
      @ [Char]
      (++ @ Char)
      (GHC.Types.:
         @ [Char]
         (case ds2_r6yH of
          { (s1_a2OL, s2_a2OO, s3_a2OR, s7_a2OU, s8_a2OX, s9_a2P0) ->
          ++
            @ Char
            s1_a2OL
            (letrec {
               $wxs1_s6eL [InlPrag=[0], Occ=LoopBreaker]
                 :: GHC.Prim.Int# -> [Char]
               [LclId, Arity=1, Str=<S,1*U>]
               $wxs1_s6eL =
                 \ (ww2_s670 :: GHC.Prim.Int#) ->
                   case ww2_s670 of ds7_a3aa {
                     __DEFAULT ->
                       GHC.Types.:
                         @ Char Main.$fShowGenTree3 ($wxs1_s6eL (GHC.Prim.-# ds7_a3aa 1#));
                     1# -> lvl9_r6z8
                   }; } in
             letrec {
               $wgo1_s6eO [InlPrag=[0], Occ=LoopBreaker]
                 :: [Char] -> GHC.Prim.Int# -> [Char]
               [LclId, Arity=2, Str=<S,1*U><S,1*U>]
               $wgo1_s6eO =
                 \ (w_s673 :: [Char]) (ww2_s677 :: GHC.Prim.Int#) ->
                   case w_s673 of {
                     [] -> $wxs1_s6eL ww2_s677;
                     : y_a37u ys_a37v ->
                       case ww2_s677 of ds7_a3aa {
                         __DEFAULT ->
                           GHC.Types.:
                             @ Char y_a37u ($wgo1_s6eO ys_a37v (GHC.Prim.-# ds7_a3aa 1#));
                         1# -> GHC.Types.: @ Char y_a37u ds3_r6yQ
                       }
                   }; } in
             $wgo1_s6eO ww_s67f 4#)
          })

On master, the $wxs1 and $wgo1 functions are lifted to top-level binders, and as far as I can tell they *could* be lifted out here too. Curious...

compiler/deSugar/DsUtils.hs
578

Good point. I'll make the names more descriptive.

598

I am doing pre-emptive CSE, but it may be a case of premature optimization. I'll try removing it and rerun the benchmarks to see if it actually makes a difference.

Why would GHC single out the first element? And regardless, why would this lead to increased allocations when the entire tuple should be evaluated at most once?

That's odd, but should make absolutely no difference. If you do -ddump-prep it'll probably look identical.

I'm puzzled why ALL of those strings aren't being allocated at top level, thus

ds7 = unpackCString# "[-"#

etc. Why do we end up with nested bindings at all?

On master, the $wxs1 and $wgo1 functions are lifted to top-level binders, and as far as I can tell they *could* be lifted out here too. Curious...

That looks like a place where allocation will change, because each call to $wdraw will allocate a closure for each of those functions. To narrow it down, look at the output of FloatOut, and see why they aren't being floated.

On master, the $wxs1 and $wgo1 functions are lifted to top-level binders, and as far as I can tell they *could* be lifted out here too. Curious...

That looks like a place where allocation will change, because each call to $wdraw will allocate a closure for each of those functions. To narrow it down, look at the output of FloatOut, and see why they aren't being floated.

It turns out they are actually being floated out, but then LiberateCase inlines them right back because $wdraw scrutinizes the 6-tuple of strings (the notNull free_scruts check inside LiberateCase.libCaseId).

It seems to me that the best solution here is to figure out why the tuple is not being eliminated and fix that, which should solve the allocation of $wxs1 and $wgo1 by extension. This change must be related to the CONLIKE pragma I've attached to unpackCString#, because the tuple *is* eliminated if I remove CONLIKE.

Why would GHC single out the first element? And regardless, why would this lead to increased allocations when the entire tuple should be evaluated at most once?

That's odd, but should make absolutely no difference. If you do -ddump-prep it'll probably look identical.

I'm puzzled why ALL of those strings aren't being allocated at top level, thus

ds7 = unpackCString# "[-"#

etc. Why do we end up with nested bindings at all?

I was able to minimize this odd behavior to the following example

haskell
module Foo where

draw xs = a ++ b ++ xs
[a,b] = ["aa", "bb"]

which produces this core on master

b = unpackCString# "bb"#

a = unpackCString# "aa"#

draw = \ xs_aqM -> ++ a (++ b xs_aqM)

and this core with my patch

a1 =
  let {
    a2_sBk
    [LclId]
    a2_sBk = "aa"# } in
  let {
    ds_sBm
    [LclId]
    ds_sBm = "bb"# } in
  let {
    ds1_sBl
    [LclId]
    ds1_sBl = unpackCString# ds_sBm } in
  (unpackCString# a2_sBk, ds1_sBl)

a = case a1 of { (a2_aqO, b1_aqR) -> a2_aqO }

b = case a1 of { (a2_aqO, b1_aqR) -> b1_aqR }

draw = \ xs_aqL -> ++ a (++ b xs_aqL)

In both cases the simplifier (phase 1, if I'm reading the trace correctly) collapses the pattern match on the list ["aa", "bb"] into a tuple (unpackCString# "aa"#, unpackCString# "bb"#). Then the process diverges.

On Master
It ANFs the applications of unpackCString# and floats them to top-level binders, giving us

ds_a = unpackCString# "aa"#
ds_b = unpackCString# "bb"#

a1 = (ds_a, ds_b)

a = case a1 of { (a2_aqO, b1_aqR) -> a2_aqO }

b = case a1 of { (a2_aqO, b1_aqR) -> b1_aqR }

which allows the case-of-known-constructor rewrite to simplify the top-level a and b binders.

My Patch
The simplifier ANFs the unpackCString#s and the "aa"# and "bb"# literals, which I believe is preventing it from floating any of the binders, as unlifted values cannot exist at the top level. Thus we get the Core

a1 =
  let {
    a2_sBk
    [LclId]
    a2_sBk = "aa"# } in
  let {
    ds_sBm
    [LclId]
    ds_sBm = "bb"# } in
  let {
    ds1_sBl
    [LclId]
    ds1_sBl = unpackCString# ds_sBm } in
  (unpackCString# a2_sBk, ds1_sBl)

a = case a1 of { (a2_aqO, b1_aqR) -> a2_aqO }

b = case a1 of { (a2_aqO, b1_aqR) -> b1_aqR }

draw = \ xs_aqL -> ++ a (++ b xs_aqL)

and case-of-known-constructor cannot fire.

@simonpj does this explanation make sense? If so, I can see two possible solutions.

  1. Prevent GHC from ANF'ing the unlifted strings. I'm not sure why it is doing this (though I suspect it's due to the CONLIKE pragma on unpackCString#), but I don't think we gain anything from it since the unlifted strings can't be floated out to the top. The concern is that we'd have to add some special exception for unpackCString# in whatever machinery is doing the ANF'ing.
  1. Allow unlifted strings at the top level. I seem to recall seeing a comment or ticket somewhere that this might be desirable anyway, but I don't know how much work would be involved, or what the knock-on effects might be.

Eric, the ticket is https://ghc.haskell.org/trac/ghc/ticket/8472. The example there is not dissimilar!

PLEASE add your example to the ticket.

I think that's the only "right" solution.

The only other solid, consistent alternative is to make unlifted string literals freely duplicable (like Int#, say), by messing with litIsTrivial. That would lead to duplication of literal strings, so they'd have to be commoned up later. It amounts the same thing in the end.

I favour supporting top-level literal strings. As I say in the ticket, it's not hard, I think. Would you like to have a go?

PS: Phab isn't a good long-term store of value (I think). Trac is. Maybe put substantive discussion onto a ticket instead of here?
Simon

PS: excellent diagnosis!

@gridaphobe, what ever happened to this?

@bgamari thanks for the prod. There are two things preventing this patch from being ready.

  1. It depends on D2605 (I thought there was a way to add that to the metadata, but can't find it ATM). EDIT: apparently I just added the dependency, but have no idea how :)
  2. There are still a few nofib benchmarks (bspt and transform specifically) that are reporting increased allocations (+0.1%) with the patch, though the average is down (-0.2%), as is binary size.

I'm out of town this week, but I'll try to find some time next week to investigate (2) more closely so this can be ready to merge as soon as D2605 is.

Thanks Erik

  1. It depends on D2605 (I thought there was a way to add that to the metadata, but can't find it ATM). EDIT: apparently I just added the dependency, but have no idea how :)

Akio seems a bit stalled no D2605, so any help you can give there would be terrific. It seems 95% done!

Are the nofib results in the summary still valid? A 15% reduction in binary size would be amazing.

@simonmar no, those nofib results are quite old. I'll update the summary with the final results once I've sorted out the last issues, but the current summary is:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
            Min          -0.2%    -10.2%     -4.6%     -4.7%     -4.3%
            Max         +20.7%     +3.2%     +5.4%     +5.4%     +3.4%
 Geometric Mean          +0.3%     -0.3%     -0.3%     -0.4%     -0.0%

I'm not surprised that the binary size isn't moving anymore, D1290 should have incurred (mostly) the same binary size savings this patch would.

@simonpj I believe I have traced some of the increased allocations to GHC.Arr (specifically the array function). Looking at the simplified Core I see something a bit odd.

-- RHS size: {terms: 1, types: 0, coercions: 0}
GHC.Arr.$fIxInt2 :: Addr#
[GblId,
 Caf=NoCafRefs,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 20 0}]
GHC.Arr.$fIxInt2 = "Int"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
GHC.Arr.$fIxInt1 :: [Char]
[GblId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=True,
         WorkFree=False, Expandable=True, Guidance=IF_ARGS [] 20 0}]
GHC.Arr.$fIxInt1 = unpackCString# GHC.Arr.$fIxInt2

-- RHS size: {terms: 1, types: 0, coercions: 0}
ds_rdnp :: Addr#
[GblId, Caf=NoCafRefs]
ds_rdnp = " out of range "#

-- RHS size: {terms: 2, types: 0, coercions: 0}
ds1_rdnq :: [Char]
[GblId]
ds1_rdnq = unpackCString# ds_rdnp

Does this mean that some string literals are being marked ConLike while others are not, or is it just an artifact of the pretty printer (perhaps because the 2nd literal is not exported)?

Ok, I've isolated the largest remaining allocation increase to the way GHC.Arr.index (specifically the Int specialization) is inlined.

index :: (Int, Int) -> Int -> Int
index b i | inRange b i =  unsafeIndex b i
          | otherwise   =  indexError b i "Int"

The unfolding for index changes subtly with this patch.

BEFORE:

case &&
       (tagToEnum# @ Bool (<=# m_a2de i1_a2dg))
       (tagToEnum# @ Bool (<=# i1_a2dg n_a2df))
of {
  False ->
    indexError
      @ Int @ Int GHC.Show.$fShowInt wild_X2R wild3_Xa
      (build @ Char (\ (@ b1_a4rm) -> unpackFoldrCString# @ b1 "Int"#));

AFTER:

case &&
       (tagToEnum# @ Bool (<=# m_a2bI i1_a2bK))
       (tagToEnum# @ Bool (<=# i1_a2bK n_a2bJ))
of {
  False ->
    indexError
      @ Int @ Int GHC.Show.$fShowInt wild_X35 wild3_Xa 
      GHC.Arr.$fIxInt1;
  ...

GHC.Arr.$fIxInt2 = "Int"#
GHC.Arr.$fIxInt1 = unpackCString# GHC.Arr.$fIxInt2

After the patch the unfolding no longer contains a string literal, which is exactly the goal. So far so good.

But now things go awry when we inline index, e.g. in GHC.Arr.safeIndex.

BEFORE:

let {
  $j_s4Mw :: Void# -> Int
  [LclId,
   Arity=1,
   Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,
           WorkFree=True, Expandable=True, Guidance=IF_ARGS [0] 100 0}]
  $j_s4Mw =
    \ _ [Occ=Dead, OS=OneShot] ->
      GHC.Arr.indexError
        @ Int
        @ Int
        GHC.Show.$fShowInt
        (wild1_a4Ba, wild2_a4Be)
        wild3_a4Bi
        (build
           @ Char (\ (@ b1_a4L3) -> unpackFoldrCString# @ b1 "Int"#)) } in
case tagToEnum# @ Bool (<=# m_a4Bc i1_a4Bk) of {
  False -> $j_s4Mw void#;
  True ->
    case tagToEnum# @ Bool (<=# i1_a4Bk n_a4Bg) of {
      False -> $j_s4Mw void#;
      True -> GHC.Types.I# (-# i1_a4Bk m_a4Bc)
    }

AFTER:

case tagToEnum# @ Bool (<=# m_a4ED i1_a4EL) of {
  False ->
    GHC.Arr.indexError
      @ Int
      @ Int
      GHC.Show.$fShowInt
      (wild1_a4EB, wild2_a4EF)
      wild3_a4EJ
      GHC.Arr.$fIxInt1;
  True ->
    case tagToEnum# @ Bool (<=# i1_a4EL n_a4EH) of {
      False ->
        GHC.Arr.indexError
          @ Int
          @ Int
          GHC.Show.$fShowInt
          (wild1_a4EB, wild2_a4EF)
          wild3_a4EJ
          GHC.Arr.$fIxInt1;

The AFTER unfolding looks cheaper to GHC, so it doesn't mind pushing it into the unfolding of (&&). Again, this seems like a reasonable decision.

Unfortunately, later on GHC realizes that in the BEFORE case, the $j_s4Mw binder can be floated out to a top-level binder, whereas there's no binder to float in the AFTER case. Finally, the call to GHC.Arr.indexError allocates a pair, which is now done in EACH "call" to index rather than just in the bottoming calls.

So the question seems to be, how can we convince GHC to float the GHC.Arr.indexError call in the AFTER case?

Max         +20.7%     +3.2%     +5.4%     +5.4%     +3.4%

Looks like that +20% size increase deserves investigating too.

So the question seems to be, how can we convince GHC to float the GHC.Arr.indexError call in the AFTER case?

This is odd. In HEAD I tried

import GHC.Arr

f x True  = indexError (x,x) x "rur"
f x False = True

And indeed I get just what I expect

f =
  \ (@ a_aRV)
    ($dShow_a19L :: Show a)
    (x_aAH :: a)
    (ds_d19W :: Bool) ->
    case ds_d19W of {
      False -> GHC.Types.True;
      True -> Bar.f1 @ a $dShow_a19L x_aAH
    }

where f1 is the floated call to indexError. Don't you get this? I have not yet tried building your exact patch, but I suppose I can do that if you are stuck. It would be great to nail this before the 8.2 deadline

gridaphobe added a comment.EditedJan 16 2017, 11:39 PM

@simonpj sorry to be so slow.

I think my last post may have been a bit confusing. It's not the compilation of index that is changed, it's the compilation of a function that calls index that changes. For example, GHC.Arr.safeIndex.

safeIndex :: (Int, Int) -> Int -> Int -> Int
safeIndex (l,u) n i
  | (0 <= i') && (i' < n) = i'
  | otherwise             = error "bad index"
  where
    i' = index (l,u) i

When we inline index into safeIndex we end up with the situation I described above.

Ultimately the problem is that the new indexError expression appears quite a bit smaller to GHC, and thus it's happy to duplicate the expression where it wouldn't have before. It actually has very little to do with Strings, that was just the trigger.

I was able to recover the old behavior by telling GHC to not duplicate bottoming expressions, but I don't think this is the right solution... Given all the trouble I'm having with this approach, I wonder if you're original suggestion to have the simplifier float the strings out (and remove them from the unfoldings) wouldn't be better?

EDIT: Here's the patch I used to tell GHC not to duplicate bottoming expressions. I'm not adding it to the diff yet because it's not at all clear that it would be the right approach.

Can you give me a small repro case so I can see exactly what you are saying? I can't do it in my head!

Note that in my compilation of f (shown above) the *entire* indexError call is floated to top level, as indeed it should. But not for you?

Sorry, I've been having trouble properly minimizing the example, but I think I finally got it.

For reference, the original function where I observed the increased allocation is nofib/spectral/hartel/Fast2haskell.tabulate.

Here's a minimized version that doesn't depend on any external modules.

module Foo where

tabulate :: (Int -> a) -> (Int, Int) -> [Int]
tabulate f (l,u) = array (l,u) [l..u]

{-# INLINE array #-}
array :: (Int, Int) -> [Int] -> [Int]
array (l,u) is = [index (l,u) i | i <- is]

{-# INLINE index #-}
index :: (Int, Int) -> Int -> Int
index b@(l,h) i
  | l <= i && i < h = 0
  | otherwise       = indexError b i "int"

{-# NOINLINE indexError #-}
indexError :: (Int, Int) -> Int -> String -> b
indexError rng i tp = indexError rng i tp

On HEAD tabulate compiles to

tabulate2 =
  \ wild1_a1Y8 wild_a1Y4 x_a1Z9 _ ->
    indexError (wild_a1Y4, wild1_a1Y8) (I# x_a1Z9) lvl_r25h

$wtabulate =
  \ @ a_s22X ww_s235 ww1_s23a ->
    case tagToEnum# (># ww_s235 ww1_s23a) of {
      False ->
        let {
          wild1_a1Y8 = I# ww1_s23a } in
        let {
          wild2_a1Y4 = I# ww_s235 } in
        letrec {
          go_a1Z8 =
            \ x_a1Z9 ->
              : (case tagToEnum# (<=# ww_s235 x_a1Z9) of {
                   False -> tabulate2 wild1_a1Y8 wild2_a1Y4 x_a1Z9 void#;
                   True ->
                     case tagToEnum# (<# x_a1Z9 ww1_s23a) of {
                       False -> tabulate2 wild1_a1Y8 wild2_a1Y4 x_a1Z9 void#;
                       True -> tabulate1
                     }
                 })
                (case tagToEnum# (==# x_a1Z9 ww1_s23a) of {
                   False -> go_a1Z8 (+# x_a1Z9 1#);
                   True -> []
                 }); } in
        go_a1Z8 ww_s235;
      True -> []
    }

which is good, the tuple is allocated inside tabulate2, i.e. only on diverging paths.

With the PATCH, it compiles to

$wtabulate =
  \ @ a_s230 ww_s238 ww1_s23d ->
    let {
      wild1_a1Yg = I# ww1_s23d } in
    let {
      wild_a1Yc = I# ww_s238 } in
    let {
      lvl_s24e = (wild_a1Yc, wild1_a1Yg) } in
    case tagToEnum# (># ww_s238 ww1_s23d) of {
      False ->
        letrec {
          go_a1Zb =
            \ x_a1Zc ->
              : (case tagToEnum# (<=# ww_s238 x_a1Zc) of {
                   False -> indexError lvl_s24e (I# x_a1Zc) array1;
                   True ->
                     case tagToEnum# (<# x_a1Zc ww1_s23d) of {
                       False -> indexError lvl_s24e (I# x_a1Zc) array1;
                       True -> tabulate1
                     }
                 })
                (case tagToEnum# (==# x_a1Zc ww1_s23d) of {
                   False -> go_a1Zb (+# x_a1Zc 1#);
                   True -> []
                 }); } in
        go_a1Zb ww_s238;
      True -> []
    }

which is bad, because the tuple is always allocated.

The reason we get worse Core with my patch is that the string "int" now appears cheaper to GHC (it's just a variable reference), so GHC is happy to duplicate the indexError call (via Simplify.mkDupableAlt). If we change the string "int" to the integer 0, HEAD will also produce the bad Core (interestingly the same thing happens if we just make array use a map rather than a list comprehension).

So this regression is kinda tangential to the string changes, the problem is that we're duplicating and then floating computations that will only be demanded on diverging paths, it could be happening elsewhere!

Here's a minimized version that doesn't depend on any external modules.

Oh the joy of a repro case. Now I can see exactly what is going on. You've exposed at least two bugs: Trac Trac #13143 and Trac Trac #13144.

Now I want to fix those tickets, and it should not be hard. I've already got a fix for Trac #13144 in the works. But I'm out of time for this week. Would you like to try Trac #13143. Just delete the offening lines in WorkWrap; I think everything else will happen automaticaly!

Sure, I'll take a look this evening. Thanks for your patience!

@simonmar the +17% binary size for transform is indeed concerning,
I've done some digging and found that the issue can be reproduced on
HEAD just by adding the CONLIKE pragma to unpackCString#, somehow
this is causing GHC to duplicate a ton of code. I spent some time
yesterday trying to minimize transform and came up with the following.

module Foo (f_t5) where

data T_def
  = F_Caf [String] T_expr

data T_expr
  = F_Cons Int T_expr T_expr
  | F_Map Int String [T_expr]
  | F_Ap String [T_expr]
  | F_Var String

f_t5 a_prg
  = f_t5a [] a_prg [] 1

f_t5a a_ds' [] a_r a_n
  = f_t5x "e" a_ds' a_r a_n
f_t5a a_ds' ((F_Caf ["output"] (F_Map a_k a_f a_xs)) : a_ds) a_r a_n
  = let r_new = [F_Caf ["output"] (F_Cons a_k (F_Ap a_f []) (F_Var "xxv"))]
    in f_t5x "a" r_new a_r (a_n + 2)
f_t5a a_ds' (a_other : a_ds) a_r a_n
  = f_t5a (a_other : a_ds') a_ds a_r a_n

f_t5x "e" a_all a_r a_n
  = a_all
f_t5x a_step a_all a_r a_n
  = f_t5a [] a_all a_r a_n

With HEAD

> ghc -O2 -fforce-recomp Foo.hs; ls -lh Foo.o
[1 of 1] Compiling Foo              ( Foo.hs, Foo.o )
-rw-r--r--  1 gridaphobe  staff    11K Jan 31 11:26 Foo.o

With HEAD + CONLIKE

> ./inplace/bin/ghc-stage2 -O2 -fforce-recomp Foo.hs; ls -lh Foo.o

[1 of 1] Compiling Foo              ( Foo.hs, Foo.o )
WARNING: file compiler/simplCore/SimplCore.hs, line 666
  Simplifier bailing out after 4 iterations [85, 1, 1, 1]
    Size = {terms: 296, types: 152, coercions: 0}
-rw-r--r--  1 gridaphobe  staff    33K Jan 31 11:25 Foo.o

I took a quick look at the Core and it appears that the inliner has gone
a bit crazy. I count 6 definitions each of $wf_t5a1 (which appears to
be f_t5x) and $s$wf_t5a.

Is this worth reporting as a separate issue?

compiler/deSugar/DsUtils.hs
598

It appears that the pre-emptive CSE was covering up a bug in the implementation. If I disable it, the simplifier (I think postInlineUnconditionally specifically) will just inline the values right back because they have a single use and appear cheap.

I'll need to think of a more principled solution. I think we want the top-level bound values to not be inlined period unless it's part of a RULE, otherwise why bind them at the top level?

Eric I'm keen to nail this one. It's dragged on for too long.

I've done some digging and found that the issue can be reproduced on HEAD just by adding the CONLIKE pragma to unpackCString#,

I tried this (in HEAD), and did discover two bugs: Trac Trac #13316, Trac #13317

But I did not see

I took a quick look at the Core and it appears that the inliner has gone a bit crazy. I count 6 definitions each of $wf_t5a1 (which appears to be f_t5x) and $s$wf_t5a.

Can you see if this is still happening in HEAD, and if so make a ticket? The inliner should not go crazy!

compiler/deSugar/DsUtils.hs
598

Can you try in HEAD? If we are floating literals out we should not inline them back in. If that happens, do submit a Trac ticket.

libraries/ghc-prim/GHC/CString.hs
63 ↗(On Diff #8359)

Yes Please say why!

I've committed a patch to HEAD that makes unpackCString CONLIKE (and documents why). And improves exprIsConApp_maybe.

Does that get you un-glued?

Simon

I've committed a patch to HEAD that makes unpackCString CONLIKE (and documents why). And improves exprIsConApp_maybe.

Does that get you un-glued?

Yes, your patch to exprIsConApp_maybe seems to have been the key to the +17% binary size for transform, thanks!

I'm now getting the following nofib results.

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           anna          +0.2%     -0.0%     0.059     0.065      0.0%
           ansi           0.0%     -2.0%     0.000     0.003      0.0%
           bspt          -0.4%     +0.1%     0.004     0.009      0.0%
         gamteb          -0.4%     -4.0%     0.018     0.024      0.0%
             gg          +0.4%     -2.4%     0.005     0.013      0.0%
        mkhprog          +0.5%     -9.5%     0.001     0.005      0.0%
        rewrite          -0.4%     +0.1%     0.006     0.010      0.0%
      transform           0.0%     +0.1%     0.195     0.202      0.0%
--------------------------------------------------------------------------------
            Min          -1.1%     -9.5%     -4.9%     -5.5%    -29.1%
            Max          +0.5%     +0.1%    +15.1%    +18.6%      0.0%
 Geometric Mean          -0.2%     -0.2%     +0.7%     +0.6%     -0.4%

I'm a bit annoyed that there are any Size or Alloc increases, but the balance is positive and there are no more large outliers. Furthermore, the largest Size increases are offset by much larger Alloc decreases, which seems like the right tradeoff if we have to choose. I'm now working through the testsuite, there are a large number of tests that need to be updated.

I hope to have an updated patch later today or tomorrow, I'll also dig into the remaining Size/Alloc increases to see if I can at least explain them.

Great, modulo the Note

compiler/typecheck/TcRnTypes.hs
389

Please document carefully here the reason that the FloatOut pass doesn't nail this. See my comment on the ticket

gridaphobe updated this revision to Diff 11316.Feb 25 2017, 3:09 PM
gridaphobe edited edge metadata.
  • address comments
  • float primitive strings out too
  • expand the Note to explain why FloatOut isn't the answer here
  • make T10844 more robust
gridaphobe updated this revision to Diff 11319.Feb 25 2017, 5:02 PM
  • preemptively float primitive strings too (see the last paragraph of the Note)
  • we can do CSE on top-level primitive strings
gridaphobe updated this revision to Diff 11320.Feb 25 2017, 6:32 PM

preInlineUnconditionally was re-inlining the floated primitive strings

gridaphobe updated this revision to Diff 11321.Feb 25 2017, 8:55 PM

fix CSE to maintain Core invariant

As of the latest patch, the nofib Size increases are gone and there's a single 0.1% Alloc increase.

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
      transform          -0.7%     +0.1%     0.190     0.199      0.0%
--------------------------------------------------------------------------------
            Min          -3.1%     -9.5%     -7.2%     -9.2%     -2.7%
            Max          -0.4%     +0.1%    +11.2%    +12.0%      0.0%
 Geometric Mean          -1.1%     -0.3%     +1.7%     +2.2%     -0.0%

Not sure yet what's going on in transform, I'll try to take a look tomorrow.

@bgamari harbormaster is failing to build my patch because it can't find my nofib patch. I thought all I had to do was point to my fork in .gitmodules?

As of the latest patch, the nofib Size increases are gone and there's a single 0.1% Alloc increase.

Great work!

@bgamari harbormaster is failing to build my patch because it can't find my nofib patch. I thought all I had to do was point to my fork in .gitmodules?

Hmm, I thought this (Trac #12623) was fixed. I'll have to have a look at this.

Nearly there!

compiler/deSugar/DsUtils.hs
584

No need for all this faff. Just generate a single binding

x = unpackCSTring# "foo#"

FloatOut will do the rest.

compiler/simplCore/CSE.hs
298 ↗(On Diff #11321)

Please explain this better.

Why are you not using exprIsLiteralString_maybe here?

The two cases look very similar. Why not use the same code except returning (NonRec b2 e2) where

e2 = case exprIsLiteralString_maybe e of
            Just {} -> e
            Nothing -> e1

And explain, in the context of the example in Note [Take care with literal strings] what you expect the substitution and reverse-mapping to be.

compiler/simplCore/SimplUtils.hs
1052

No, don't do this. If you do, you'll fail to drop dead bindings!

Better: modify canInlineInLam to reject literals for which litIsTrivial is false. And add a Note!!

compiler/typecheck/TcRnTypes.hs
403

StableUnfolding

gridaphobe added inline comments.Feb 27 2017, 2:31 PM
compiler/deSugar/DsUtils.hs
584

Hmm, you're right. I thought I had an example where the primitive string would sneak into a StableUnfolding, but I can't reproduce on a clean build. I'll revert.

gridaphobe updated this revision to Diff 11375.Feb 27 2017, 2:47 PM
  • don't pre-emptively float primitive strings
  • clean up code duplication in cseBind
  • add a Note explaining why preInlineUnconditionally must guard against inlining string literals.
gridaphobe updated this revision to Diff 11380.Feb 27 2017, 7:14 PM
  • sigh, fragile tests
  • avoid unnecessary tuples
  • disable preemptive floating at -O0
simonpj added inline comments.Feb 28 2017, 2:58 AM
compiler/deSugar/DsMonad.hs
296

Why not at -00. Explain!

gridaphobe added inline comments.Feb 28 2017, 11:04 AM
compiler/deSugar/DsMonad.hs
296

It's in the Note, the issue is that floating the strings causes extra work for GHC (specifically CodeGen).

We only notice this at -O0 because FloatOut will float them anyway. T1969 was failing with +15% allocations as a result.

Solution: keep -O0 fast by disabling preemptive floating.

simonpj added inline comments.Mar 1 2017, 4:46 AM
compiler/deSugar/DsMonad.hs
296

It's in the Note, the issue is that floating the strings causes extra work for GHC (specifically CodeGen).

C.f. Trac Trac #13344 comment:14. I'm suspicious about thiis extra work. We are talking here about

f x = ...(unpackCSring "foo"#)...

vs

s = unpackCString "foo"#
f x= ...s....

I don't understand why the latter would compile so much slower than the former. Maybe this has changed since you removed the "faff" a few comments ago?

Could you check. I smell low hanging perf fruit.

In the case of T1969 (compiled with -O0) the difference is quite stark: with floating the non-optimizing simplifier pass produces {terms: 16,893, types: 7,552, coercions: 0, joins: 0/0}, without it produces {terms: 12,693, types: 4,552, coercions: 0, joins: 0/0}.

The test looks like,

class C a where
    c :: a -> String
    d :: a -> String
    d x = c x
    e :: a -> String
    e x = c x

data A1 = A1
instance C A1 where
    c A1 = "A1"

data A2 = A2
instance C A2 where
    c A2 = "A2"

...

Post-desugar

The reason for the regression is in part due to the fact that we float out multiple copies of each unpackCString# "An" expression. That is, after desugaring without floating we get (looking at just the A1 bindings),

T1969.$dme :: forall a. C a => a -> String
T1969.$dme = \ (@ a_aM1) ($dC_a1h9 :: C a_aM1) (x_aM3 :: a_aM1) ->
      c @ a_aM1 $dC_a1h9 x_aM3

-- same as $dme
T1969.$dmd :: forall a. C a => a -> String
T1969.$dmd = \ (@ a_aM1) ($dC_a1h9 :: C a_aM1) (x_aM2 :: a_aM1) ->
      c @ a_aM1 $dC_a1h9 x_aM2

$cc_a1i7 :: A1 -> String
$cc_a1i7= \ (ds_d1jJ :: A1) -> case ds_d1jJ of { A1 -> GHC.CString.unpackCString# "A1"# }

Rec {
T1969.$fCA3 :: C A3
T1969.$fCA3 = T1969.C:C @ A3 $cc_a1hl $cd_a1hp $ce_a1hy

$ce_a1hy :: A3 -> String
$ce_a1hy = T1969.$dme @ A3 T1969.$fCA3

$cd_a1hp :: A3 -> String
$cd_a1hp = T1969.$dmd @ A3 T1969.$fCA3
end Rec }

Whereas with floating we get,

-- same as above
T1969.$dme :: forall a. C a => a -> String
T1969.$dmd :: forall a. C a => a -> String

ds_d1k4 :: [Char]
ds_d1k4 = GHC.CString.unpackCString# "A1"#

$cc_a1i7 :: A1 -> String
$cc_a1i7 = \ (ds_d1k3 :: A1) -> case ds_d1k3 of { A1 -> ds_d1k4 }

Rec {
T1969.$fCA1 :: C A1
T1969.$fCA1 = T1969.C:C @ A1 $cc_a1i7 $cd_a1ib $ce_a1ik

$ce_a1ik :: A1 -> String
$ce_a1ik = T1969.$dme @ A1 T1969.$fCA1

$cd_a1ib :: A1 -> String
$cd_a1ib = T1969.$dmd @ A1 T1969.$fCA1
end Rec }

So far things aren't so bad: the only interesting difference is the floated [Char], which we would expect. However, let's then see what happens during simplification.

Post-simplification

Without floating we see,

T1969.$fCA1 :: C A1
T1969.$fCA1 = T1969.C:C @ A1 $cc_a1i7 $cc_a1i7 $cc_a1i7

$cc_a1i7 :: A1 -> String
$cc_a1i7
  = \ (ds_d1jJ :: A1) ->
      case ds_d1jJ of { A1 -> GHC.CString.unpackCString# "A1"# }

Whereas with floating we have,

ds_d1k4 :: [Char]
ds_d1k4 = GHC.CString.unpackCString# "A1"#

$cc_a1i7 :: A1 -> String
$cc_a1i7 = \ (ds_d1k3 :: A1) -> case ds_d1k3 of { A1 -> ds_d1k4 }

$cd_a1ib :: A1 -> String
$cd_a1ib = \ (x_aM2 :: A1) -> case x_aM2 of { A1 -> ds_d1k4 }

$ce_a1ik :: A1 -> String
$ce_a1ik = \ (x_aM3 :: A1) -> case x_aM3 of { A1 -> ds_d1k4 }

T1969.$fCA1 :: C A1
T1969.$fCA1 = T1969.C:C @ A1 $cc_a1i7 $cd_a1ib $ce_a1ik

This is quite interesting: without floating we are somehow able to collapse each of the A1 -> String bindings into a single binding (despite CSE being disabled due to -O0!).

I'd prefer to always to the "float-in-the-desugarer" (rather than leave a hack in the code) and open a ticket for the unexpected bump in cases like T1969. It's very much a corner case, but it needs a ticket of its own.

bgamari updated this revision to Diff 11624.Mar 7 2017, 4:56 PM
  • Update note

This diff has been rebased onto the current state of master and updated. However, after rerunning the nofib it's not clear that we really still need it as most of the reductions appear to have already been accomplished in other patches. Namely, we have

Binary Sizes

-------------------------------------------------------------------------------
        Program                  old              new
-------------------------------------------------------------------------------
        -1 s.d.                -----            -0.5%
        +1 s.d.                -----            +0.8%
        Average                -----            +0.2%

Strangely, though, there are a few rather striking reductions in runtime allocations,

mkhprog              3351704           -13.7%
rewrite             16510904           -11.2%
-1 s.d.                -----            -2.2%
+1 s.d.                -----            +1.6%
Average                -----            -0.3%

Although these reductions aren't accompanied by corresponding reductions in wall time.

Compiler allocations go up a bit,

-1 s.d.                -----            -1.4%
+1 s.d.                -----            +3.8%
Average                -----            +1.1%

Consequently I think I will punt this off beyond 8.2.

Hmm, I'll have to find time to take a look at the updated nofib results. There really shouldn't be any increases in binary size (or, ideally, in allocations either).

I think the patch could still have merit though. The original motivation was to avoid duplicating Strings when inlining across modules, which AFAIK has not been addressed by any of the other related patches. If I had to guess, I'd say this is probably what's responsible for the decreased allocations.

I'm ok with deferring beyond 8.2 (which is ready to go), but I do think we should commit this after the fork, for the reasons Eric says. It'd be worth checking why allocations do sometimes go down -- but more inlining is a plausible cause.

dfeuer added a subscriber: dfeuer.Mar 18 2017, 7:23 PM

Would it make sense to try to separate the CallStack part? That seems rather more likely to be a pure win than the String aspect, no?

austin resigned from this revision.Nov 6 2017, 3:50 PM
This revision now requires changes to proceed.Nov 6 2017, 3:50 PM