nofib/Simon-nofib-notes

1​ --------------------------------------------
2​ Notes on nofib programs
3​ --------------------------------------------
4
5​Every time I do performance tuning on GHC I re-discover wrinkles in
6​particular nofib programs. I intend to record these wrinkles herein.
7
8
9​NOTE: With 4.09 we introduced Unicode. That means that (C# x) always allocates,
10​whereas it didn't before. So allocations go up a bit.
11
12​---------------------------------------
13​ Imaginary suite
14​---------------------------------------
15
16​integrate
17​~~~~~~~~~
18​integrate1D is strict in its second argument 'u', but it also passes 'u' to
19​function 'f'. Hence it now does some reboxing, which pushes up allocation
20​slightly.
21
22​gen_regexps
23​~~~~~~~~~~~
24​I found that there were some very bad loss-of-arity cases in PrelShow.
25​ In particular, we had:
26
27​ showl "" = showChar '"' s
28​ showl ('"':xs) = showString "\\\"" . showl xs
29​ showl (x:xs) = showLitChar x . showl xs
30
31​ Trouble is, we get
32​ showl = \xs -> case xs of
33​ ...
34​ (x:xs) -> let f = showLitChar x
35​ g = showl xs
36​ in \s -> f (g s)
37​ which is TERRIBLE. We can't spot that showLitChar has arity 2 because
38​ it looks like this:
39
40​ ...other eqns...
41​ showLitChar c = showString ('\\' : asciiTab!!ord c)
42
43​ notice that the (asciiTab!!orc c) is outside the \s, so GHC can't rewrite it to
44
45​ showLitChar c = \s -> showString ('\\' : asciiTab!!ord c) s
46
47​ So I've changed PrelShow.showLitChar to use explicit \s. Even then, showl
48​ doesn't work, because GHC can't see that showl xs can be pushed inside the \s.
49​ So I've put an explict \s there too.
50
51​ showl "" s = showChar '"' s
52​ showl ('"':xs) s = showString "\\\"" (showl xs s)
53​ showl (x:xs) s = showLitChar x (showl xs s)
54
55​ Net result: imaginary/gen_regexps more than halves in allocation!
56
57
58​x2n1
59​~~~~
60
61​Relies quite heavily on specialisations for Num (Complex Double). If
62​this test suddenly gets worse by a factor of 2 or so, chances are that
63​specialisation is broken.
64
65​---------------------------------------
66​ Spectral suite
67​---------------------------------------
68
69​rewrite
70​~~~~~~~
71​It's important to inline p_ident.
72
73​There's a very delicate CSE in p_expr
74​ p_expr = seQ q_op [p_term1, p_op, p_term2] ## p_term3
75
76​(where all the pterm1,2,3 are really just p_term).
77
78​This expands into
79​ p_expr s = case p_term1 s of
80​ Nothing -> p_term3 s
81​ Just (v,s') -> case p_op s' of ...
82​ Nothing -> p_term3 s
83
84​Notice that if the bit before (##) fails, we call p_term3 s, and if
85​CSE notices we can re-use the result of the first call. But that
86​depends on how much inlining takes place. In particular, the follow-up
87​calls p_term3 can get hidden inside join points, which don't "see" the first
88​call.
89
90
91
92​cse
93​~~~
94​The functions 'go' and 'go1', called from draw, get arity 1, but they
95​should get arity 2. We need a better arity analysis, a la Gill. 'go' looks
96​like this:
97​ go_r1og =
98​ \ (ds_aGI :: [GHC.Base.Char]) ->
99​ case ds_aGI of wild_aGJ {
100​ [] -> z_r1oe;
101​ : y_aGN ys_aGO ->
102​ let {
103​ xs7_s1i8 :: GHC.Prim.Int# -> [GHC.Base.Char]
104​ [Str: DmdType]
105​ xs7_s1i8 = go_r1og ys_aGO
106​ } in
107​ \ (m_XWf :: GHC.Prim.Int#) ->
108​ case GHC.Prim.<=# m_XWf 1 of wild1_aSI {
109​ GHC.Base.False ->
110​ GHC.Base.: @ GHC.Base.Char y_aGN (xs7_s1i8 (GHC.Prim.-# m_XWf 1));
111​ GHC.Base.True -> GHC.Base.: @ GHC.Base.Char y_aGN lvl13_r1n0
112​ }
113​ }
114​Notice the 'let' which stops the lambda moving out.
115
116
117​Eliza
118​~~~~~
119​In June 2002, GHC 5.04 emitted four successive
120​ NOTE: Simplifier still going after 4 iterations; bailing out.
121​messages. I suspect that the simplifer is looping somehow.
122
123
124​Expert
125​~~~~~~
126​In spectral/expert/Search.ask there's a statically visible CSE. Catching this
127​depends almost entirely on chance, which is a pity.
128
129
130​Fish
131​~~~~
132​The performance of fish depends crucially on inlining scale_vec2.
133​It turns out to be right on the edge of GHC's normal threshold size, so
134​small changes to the compiler can knock it to and fro.
135
136​[Sept 08] The new instance-declaration story makes show slightly worse.
137​Look at Main.showl in the simplified core. You'll see this:
138
139​ a79_aRZ :: GHC.Types.Int
140​ -> (GHC.Types.Int, GHC.Types.Int, GHC.Types.Int, GHC.Types.Int)
141​ -> GHC.Show.ShowS
142​ [Str: DmdType]
143​ a79_aRZ =
144​ GHC.Show.showsPrec9 @ Int @ Int @ Int @ Int
145​ GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16
146
147​ Rec {
148​ showl_sWK :: [Main.Line_segment] -> String -> String
149​ [Arity 2
150​ Str: DmdType SL]
151​ showl_sWK =
152​ \ (ds_dQJ :: [Main.Line_segment]) (s_alB :: GHC.Base.String) ->
153​ case ds_dQJ of wild_X1d [ALWAYS Just A] {
154​ [] -> GHC.Types.: @ GHC.Types.Char lvl_sYT s_alB;
155​ : x_alD [ALWAYS Just L] xs_alF [ALWAYS Just L] ->
156​ GHC.Base.++
157​ @ GHC.Types.Char
158​ a_s12R
159​ (a79_aRZ a_s1uS x_alD (showl_sWK xs_alF s_alB))
160​ }
161​ end Rec }
162
163​Note the non-inlined call to GHC.Show.showsPrec9, which looks like this:
164​ showsPrec9 :: forall a b c d. (Show a, Show b, Show c, Show d) =>
165​ GHC.Types.Int -> (a, b, c, d) -> GHC.Show.ShowS
166​ {- Arity: 4 Strictness: LLLL
167​ Unfolding: (\ @ a b c d
168​ $dShow :: GHC.Show.Show a
169​ $dShow1 :: GHC.Show.Show b
170​ $dShow2 :: GHC.Show.Show c
171​ $dShow3 :: GHC.Show.Show d ->
172​ let {
173​ shows1 :: d -> GHC.Show.ShowS = GHC.Show.shows @ d $dShow3
174​ } in
175​ let {
176​ shows2 :: c -> GHC.Show.ShowS = GHC.Show.shows @ c $dShow2
177​ } in
178​ let {
179​ shows3 :: b -> GHC.Show.ShowS = GHC.Show.shows @ b $dShow1
180​ } in
181​ let {
182​ shows4 :: a -> GHC.Show.ShowS = GHC.Show.shows @ a $dShow
183​ } in
184​ \ ds :: GHC.Types.Int ds1 :: (a, b, c, d) s :: GHC.Base.String ->
185​ case @ GHC.Base.String ds1 of wild { (a79, b, c, d) ->
186​ GHC.Show.show_tuple
187​ (GHC.Types.:
188​ @ GHC.Show.ShowS
189​ (shows4 a79)
190​ (GHC.Types.:
191​ @ GHC.Show.ShowS
192​ (shows3 b)
193​ (GHC.Types.:
194​ @ GHC.Show.ShowS
195​ (shows2 c)
196​ (GHC.Types.:
197​ @ GHC.Show.ShowS
198​ (shows1 d)
199​ (GHC.Types.[] @ GHC.Show.ShowS)))))
200​ s }) -}
201
202​We would do better to inpline showsPrec9 but it looks too big. Before
203​it was inlined regardless by the instance-decl stuff. So perf drops slightly.
204
205
206​Integer
207​~~~~~~~
208​A good benchmark for beating on big-integer arithmetic
209
210​Knights
211​~~~~~~~
212​In knights/KnightHeuristic, we don't find that possibleMoves is strict
213​(with important knock-on effects) unless we apply rules before floating
214​out the literal list [A,B,C...].
215​Similarly, in f_se (F_Cmp ...) in listcompr (but a smaller effect)
216
217
218​Lambda
219​~~~~~~
220​This program shows the cost of the non-eta-expanded lambdas that arise from
221​a state monad.
222
223​mandel2
224​~~~~~~~
225​check_perim's several calls to point_colour lead to opportunities for CSE
226​which may be more or less well taken.
227
228​Mandel
229​~~~~~~
230​Relies heavily on having a specialised version of Complex.magnitude
231​(:: Complex Double -> Double) available.
232
233​Mandel.mandelset has a go-loop inside an enumDeltaToIntegerFB, out of which
234​4.08.2 managed to float a constant expression, but HEAD did not. I think
235​this is because the pre-let-floating simplification did too little inlining;
236​in particular, it did not inline windowToViewport
237
238
239​Multiplier
240​~~~~~~~~~~
241​In spectral/multiplier, we have
242​ xor = lift21 forceBit f
243​ where f :: Bit -> Bit -> Bit
244​ f 0 0 = 0
245​ f 0 1 = 1
246​ f 1 0 = 1
247​ f 1 1 = 0
248​ Trouble is, f is CPR'd, and that means that instead of returning
249​ the constants I# 0, I# 1, it returns 0,1 and then boxes them.
250​ So allocation goes up. I don't see a way around this.
251
252
253​Parstof
254​~~~~~~~
255​spectral/hartel/parstof ends up saying
256​ case (unpackCString "x") of { c:cs -> ... }
257​ quite a bit. We should spot these and behave accordingly.
258
259
260​Power
261​~~~~~
262​With GHC 4.08, for some reason the arithmetic defaults to Double. The
263​right thing is to default to Rational, which accounts for the big increase
264​in runtime after 4.08
265
266
267​Puzzle
268​~~~~~~
269​The main function is 'transfer'. It has some complicated join points, and
270​a big issue is the full laziness can float out many small MFEs that then
271​make much bigger closures. It's quite delicate: small changes can make
272​big differences, and I spent far too long gazing at it.
273
274​I found that in my experimental proto 4.09 compiler I had
275
276​ let ds = go xs in
277​ let $j = .... ds ... in
278​ case e of
279​ p1 -> $j
280​ p2 -> $j
281​ p3 -> ds
282
283​But ds wasn't getting inlined because we couldn't spot that it was actually
284​only being used once. Keith's analyser should find this!
285
286
287​Also, making concat into a good producer made a large gain.
288
289​My proto 4.09 still allocates more, partly because of more full laziness relative
290​to 4.08; I don't know why that happens
291
292​Extra allocation is happening in 5.02 as well; perhaps for the same reasons. There is
293​at least one instance of floating that prevents fusion; namely the enumerated lists
294​in 'transfer'.
295
296​Sphere
297​~~~~~~
298​A key function is vecsub, which looks like this (after w/w)
299
300​$wvecsub
301​ = \ ww :: Double ww1 :: Double ww2 :: Double
302​ ww3 :: Double ww4 :: Double ww5 :: Double ->
303​ let { a = case ww of wild { D# x ->
304​ case ww3 of wild1 { D# y ->
305​ let { a1 = -## x y
306​ } in $wD# a1
307​ } } } in
308​ let { a1 = case ww1 of wild { D# x ->
309​ case ww4 of wild1 { D# y ->
310​ let { a2 = -## x y
311​ } in $wD# a2
312​ } } } in
313​ let { a2 = case ww2 of wild { D# x ->
314​ case ww5 of wild1 { D# y ->
315​ let { a3 = -## x y
316​ } in $wD# a3
317​ } }
318​ } in (# a, a1, a2 #)
319
320​Currently it gets guidance: IF_ARGS 6 [2 2 2 2 2 2] 25 4
321​It occurs in a context like
322
323​ case $wvecsub a b c d e f of (# p, q, r #) ->
324​ case p of D# p' ->
325​ case q of D# q' ->
326​ case r of D# r' ->
327
328​So it would be much, much better to inline it. But the result-discounting
329​doesn't "see" that the three results are each taken apart, and that's the
330​problem.
331
332​One question is this: why did minusDouble get inlined? If it didn't then
333​$vecsub would look much smaller:
334
335​$wvecsub
336​ = \ ww :: Double ww1 :: Double ww2 :: Double
337​ ww3 :: Double ww4 :: Double ww5 :: Double ->
338​ let { a = minusDouble ww ww3 } in
339​ let { a1 = minusDouble ww1 ww4 } in
340​ let { a2 = minusDouble ww2 ww5 } in
341​ (# a, a1, a2 #)
342
343​Reason minusDouble gets inlined: because we are trying to get update in place.
344​So I added a flag -funfolding-update-in-place to enable this "optimisation".
345​Omitting the flag gives much better inlining for $wvecsub at least.
346
347
348​Sphere also does 60,000 calls to hPutStr, so I/O plays a major role. Currently
349​this I/O does a *lot* of allocation, much of it since the adddition of thread-safety.
350
351
352​treejoin
353​~~~~~~~~
354​Does a lot of IO.readFile. In GHC.IO.Encoding.UTF8 the demand
355​analyser sees a strict function with type
356​ a_s1gj :: GHC.IO.Buffer.Buffer GHC.Word.Word8
357​ -> GHC.IO.Buffer.Buffer GHC.Types.Char
358​ -> GHC.Prim.State# GHC.Prim.RealWorld
359​ -> (# GHC.Prim.State# GHC.Prim.RealWorld,
360​ (GHC.IO.Encoding.Types.CodingProgress,
361​ GHC.IO.Buffer.Buffer GHC.Word.Word8,
362​ GHC.IO.Buffer.Buffer GHC.Types.Char) #)
363​Unboxing both Buffer arguments makes a HUGE difference (halves
364​allocation); but that makes the worker function have 12 arguments. A
365​good reason for unboxing even if the worker gets a lot of args.
366
367​sorting
368​~~~~~~~
369​Same issue with GHC.IO.Encoding.UTF8 as treejoin
370
371
372​---------------------------------------
373​ Real suite
374​---------------------------------------
375
376​gg
377​~~
378​Same issue with GHC.IO.Encoding.UTF8 as treejoin
379
380​Maillist
381​~~~~~~~~
382​Uses appendFile repeatedly rather than opening the output file once,
383​which leads to numerous file opens/closes. Allocations will rise with
384​the new I/O subsystem in 5.02 because the I/O buffer will be
385​re-allocated on the heap for each open, whereas previously it would be
386​allocated on the C heap and therefore not show up in the stats.