Drop custom mapM impl for [] & set mapM_ = traverse_
AbandonedPublic

Authored by bgamari on May 29 2015, 2:57 AM.

Details

Reviewers
simonmar
dolio
hvr
austin
ekmett
Trac Issues
#10457
Summary
Test Plan

let harbormaster do the needful

Diff Detail

hvr updated this revision to Diff 3038.May 29 2015, 2:57 AM
hvr retitled this revision from to Drop custom mapM impl for [] & set mapM_ = traverse_.
hvr updated this object.
hvr edited the test plan for this revision. (Show Details)
hvr added reviewers: austin, simonmar.
hvr updated the Trac tickets for this revision.
hvr added a reviewer: dolio.May 29 2015, 3:00 AM
hvr added a reviewer: ekmett.
hvr updated this revision to Diff 3039.May 29 2015, 3:23 AM
  • drop now redundant import
simonmar edited edge metadata.May 29 2015, 8:26 AM

Need a nofib run to test perf.

austin requested changes to this revision.Jun 2 2015, 2:48 PM
austin edited edge metadata.

Yes, it'd be nice to at least get some ballpark allocation numbers for this.

This revision now requires changes to proceed.Jun 2 2015, 2:48 PM
hvr added a comment.Jun 4 2015, 1:03 PM

see P59 (use "View Raw File" to get into a more readable form); The most visible regression is in cryptarithm2, but I have no explanation for that

cryptarithm2 has this:

solve :: [[Char]] -> [Char] -> Int -> DigitState ()
solve tops (bot:bots) carry =
  do topN <- (case tops of
		   [] -> return carry
		   (top:_) -> 
		     do topNS <- mapM select top
	     	        return (sum topNS + carry))
     botN <- select bot
     guard (topN `mod` 10 == botN)	-- key optimization
     solve (rest tops) bots (topN `div` 10)
  where
     rest []     = []
     rest (x:xs) = xs

There's a mapM there...

dolio edited edge metadata.Jun 6 2015, 4:15 PM

Here's a simplified test case that may be relevant:

Running with option 1 (traverse) does 100% more allocation than option 2 (mapM).

One interesting point is that switching out length for a hand-rolled recursive implementation puts them on par with one another. So the mapM version must be fusing with length somehow, perhaps? Marking both functions NOINLINE has little effect on test1 and makes test2 much worse than noinlined test1, although 2 does less copying than 1 in that case (instead of more when inlining is enabled).

Not sure about the exact mechanism here yet.

dolio added a comment.Jun 6 2015, 4:45 PM

Now I'm not sure if my test case is demonstrating the same thing, because moving the mapM/traverse call out to a separate, noinline function in cryptarithm2 has no effect on the results.

dolio added a comment.Jun 6 2015, 8:46 PM

Okay. After further investigation, I have some results to report.

My first clue was that when changing the code to use traverse, there is a call to map in the core, which of course isn't there in the mapM code. This was likely causing an extra intermediate list over the mapM version.

So, I hand-rolled an implementation of DigitState, implementing all the classes myself, and marking everything INLINE, performance is then identical between traverse and mapM, and it is better than using transformers.

With that in mind, I popped over to transformers, and looked at StateT. Marking fmap INLINE there had no effect. But, it looked suspicious to me that the Applicative definition just had (<*>) = ap. So, I hand-wrote an implementation and marked it INLINE. This was better than mapM. So, instead I just changed it back to (<*>) = ap and marked it INLINE. This then makes traverse identical in performance to mapM.

So, the discrepancy in performance in this benchmark is due to the StateT instance for Applicative. There is nothing fundamentally slower about traverse here.

Ok, so this makes sense. If <*> = ap, then traverse is the same as mapM, and will have the same performance provided <*> is inlined. If <*> is not inlined, then there might be an extra performance penalty compared with mapM.

I suspect your hand-written and inlined <*> with traverse beat mapM because >>= is not inlined.

This could end up pessimising existing code, but in a recoverable way. We should check whether it happens for IO.

dolio added a comment.Jun 8 2015, 12:10 PM

Actually, now that I think about it, I think the allocation difference was that I implemented (<*>) for strict StateT instead of lazy. But it doesn't matter, because just inlining (<*>) = ap achieves parity.

IO seems to display no difference between traverse and mapM.

simonmar accepted this revision.Jun 10 2015, 4:55 AM
simonmar edited edge metadata.

I think we're good to go here then. Just one thing that puzzles me: <*> = ap should be inlined automatically. Perhaps without the INLINE on <*> = ap we get ap inlined instead, and then <*> does not get inlined at call sites because it is too big. But that still wouldn't explain the difference, because we would only get an extra call, not extra allocation. Strange.

dolio added a comment.Jun 10 2015, 9:22 AM

The only thing that makes sense to me is that it doesn't make the decision on whether to inline (<*>) until ap is inlined, at which point it's too large. I'm not sure I knew it worked that way.

The extra allocation is caused by missed fusion. Without inlining, there is a map (:) call that disappears when we inline (<*>) because they fuse together.

ekmett accepted this revision.Jul 1 2015, 1:02 PM
ekmett edited edge metadata.

LGTM

austin accepted this revision.Jul 20 2015, 8:56 AM
austin edited edge metadata.

Alright, sounds OK to me.

This revision is now accepted and ready to land.Jul 20 2015, 8:56 AM
bgamari requested changes to this revision.Jul 21 2015, 6:21 AM
bgamari added a reviewer: bgamari.

Indeed this still regresses on a few tests. Namely,

make test TEST="ghcirun004 T1969 T5642"

Two are stat failures due to allocations, one appears to time out.

This revision now requires changes to proceed.Jul 21 2015, 6:21 AM
bgamari added a comment.EditedJul 24 2015, 9:05 AM

I'm looking deeper into these regressions. It seems that T1969 fails even without this patch so we can safely ignore it.

ghcirun004 (which runs only with the ghci way) is a bit odd. The program is essentially,

main = print (map foo [1,50..5000] )

foo :: Int -> Int
foo 1 = 0
foo 2 = 1
...
foo 5000 = 4999
foo n = n + 1

With doesn't touch mapM at all. Compiling without --interactive produces an executable that works as one would expect.

Ahh, here's a hint,

$ "/mnt/work/ghc/ghc-d924/inplace/bin/ghc-stage2" --interactive
GHCi, version 7.11.20150723: http://www.haskell.org/ghc/  :? for help
Prelude> :load ghcirun004.hs
[1 of 1] Compiling Main             ( ghcirun004.hs, interpreted )

and there it hangs... This would suggest that this testcase just happens to (luckily) tickle a use of mapM in the compiler. I'm going to recompile ghc with profiling and see if -xc doesn't suggest where this might be.

bgamari added a comment.EditedJul 24 2015, 4:04 PM

Here are some notes from earlier:

I should mention that technically ghci will eventually load ghcirun004 given enough time. As noted above, the culprit appears to be the change in mapM_,

mapM_ = traverse                         -- user: 5 min 38 sec
mapM_ f = foldr ((*>) . f) (return ())   -- user: 5 min 41 sec
mapM_ f = foldr ((>>) . f) (return ())   -- user: 4.4 sec

Unfortunately at the moment I still haven't the slightest idea which monad this is being applied to, so it's unclear how (*>) and (>>) differ.

libraries/base/Data/Foldable.hs
412

Intriguingly, reverting just this piece of the patch allows ghcirun004 to be loaded and run. Removing the INLINE has no effect.

Indeed the mapM_ instance in question appears to live in ByteCodeAsm in assembleBCO. Reverting it to use the previous definition of mapM_ results in normal load times for ghcirun004. The monad here is Assembler, defined in the same module.

Indeed defining (*>) = (>>) in the Applicative Assembler instance brings things back to sanity.

Hmm, isn't Phabricator supposed to pull in dependencies of a Diff when validating it? This validates on my machine with D1097 yet somehow the Harbormaster still fails.

bgamari commandeered this revision.Aug 3 2015, 11:07 AM
bgamari edited reviewers, added: hvr; removed: bgamari.

I'm going to abandon this to preserve it for posterity and merge the removal of the mapM override from [] in D1124.

This revision is now accepted and ready to land.Aug 3 2015, 11:07 AM
bgamari abandoned this revision.Aug 3 2015, 11:07 AM