See https://mail.haskell.org/pipermail/libraries/2015-May/025708.html for
This addresses Trac #10457
cryptarithm2 has this:
solve :: [[Char]] -> [Char] -> Int -> DigitState () solve tops (bot:bots) carry = do topN <- (case tops of  -> return carry (top:_) -> do topNS <- mapM select top return (sum topNS + carry)) botN <- select bot guard (topN `mod` 10 == botN) -- key optimization solve (rest tops) bots (topN `div` 10) where rest  =  rest (x:xs) = xs
There's a mapM there...
Here's a simplified test case that may be relevant:
Running with option 1 (traverse) does 100% more allocation than option 2 (mapM).
One interesting point is that switching out length for a hand-rolled recursive implementation puts them on par with one another. So the mapM version must be fusing with length somehow, perhaps? Marking both functions NOINLINE has little effect on test1 and makes test2 much worse than noinlined test1, although 2 does less copying than 1 in that case (instead of more when inlining is enabled).
Not sure about the exact mechanism here yet.
Now I'm not sure if my test case is demonstrating the same thing, because moving the mapM/traverse call out to a separate, noinline function in cryptarithm2 has no effect on the results.
Okay. After further investigation, I have some results to report.
My first clue was that when changing the code to use traverse, there is a call to map in the core, which of course isn't there in the mapM code. This was likely causing an extra intermediate list over the mapM version.
So, I hand-rolled an implementation of DigitState, implementing all the classes myself, and marking everything INLINE, performance is then identical between traverse and mapM, and it is better than using transformers.
With that in mind, I popped over to transformers, and looked at StateT. Marking fmap INLINE there had no effect. But, it looked suspicious to me that the Applicative definition just had (<*>) = ap. So, I hand-wrote an implementation and marked it INLINE. This was better than mapM. So, instead I just changed it back to (<*>) = ap and marked it INLINE. This then makes traverse identical in performance to mapM.
So, the discrepancy in performance in this benchmark is due to the StateT instance for Applicative. There is nothing fundamentally slower about traverse here.
Ok, so this makes sense. If <*> = ap, then traverse is the same as mapM, and will have the same performance provided <*> is inlined. If <*> is not inlined, then there might be an extra performance penalty compared with mapM.
I suspect your hand-written and inlined <*> with traverse beat mapM because >>= is not inlined.
This could end up pessimising existing code, but in a recoverable way. We should check whether it happens for IO.
Actually, now that I think about it, I think the allocation difference was that I implemented (<*>) for strict StateT instead of lazy. But it doesn't matter, because just inlining (<*>) = ap achieves parity.
IO seems to display no difference between traverse and mapM.
I think we're good to go here then. Just one thing that puzzles me: <*> = ap should be inlined automatically. Perhaps without the INLINE on <*> = ap we get ap inlined instead, and then <*> does not get inlined at call sites because it is too big. But that still wouldn't explain the difference, because we would only get an extra call, not extra allocation. Strange.
The only thing that makes sense to me is that it doesn't make the decision on whether to inline (<*>) until ap is inlined, at which point it's too large. I'm not sure I knew it worked that way.
The extra allocation is caused by missed fusion. Without inlining, there is a map (:) call that disappears when we inline (<*>) because they fuse together.
I'm looking deeper into these regressions. It seems that T1969 fails even without this patch so we can safely ignore it.
ghcirun004 (which runs only with the ghci way) is a bit odd. The program is essentially,
main = print (map foo [1,50..5000] ) foo :: Int -> Int foo 1 = 0 foo 2 = 1 ... foo 5000 = 4999 foo n = n + 1
With doesn't touch mapM at all. Compiling without --interactive produces an executable that works as one would expect.
Ahh, here's a hint,
$ "/mnt/work/ghc/ghc-d924/inplace/bin/ghc-stage2" --interactive GHCi, version 7.11.20150723: http://www.haskell.org/ghc/ :? for help Prelude> :load ghcirun004.hs [1 of 1] Compiling Main ( ghcirun004.hs, interpreted )
and there it hangs... This would suggest that this testcase just happens to (luckily) tickle a use of mapM in the compiler. I'm going to recompile ghc with profiling and see if -xc doesn't suggest where this might be.
Here are some notes from earlier:
I should mention that technically ghci will eventually load ghcirun004 given enough time. As noted above, the culprit appears to be the change in mapM_,
mapM_ = traverse -- user: 5 min 38 sec mapM_ f = foldr ((*>) . f) (return ()) -- user: 5 min 41 sec mapM_ f = foldr ((>>) . f) (return ()) -- user: 4.4 sec
Unfortunately at the moment I still haven't the slightest idea which monad this is being applied to, so it's unclear how (*>) and (>>) differ.
Intriguingly, reverting just this piece of the patch allows ghcirun004 to be loaded and run. Removing the INLINE has no effect.
Indeed the mapM_ instance in question appears to live in ByteCodeAsm in assembleBCO. Reverting it to use the previous definition of mapM_ results in normal load times for ghcirun004. The monad here is Assembler, defined in the same module.