1 | -------------------------------------------- |
---|---|

2 | Notes on nofib programs |

3 | -------------------------------------------- |

4 | |

5 | Every time I do performance tuning on GHC I re-discover wrinkles in |

6 | particular nofib programs. I intend to record these wrinkles herein. |

7 | |

8 | |

9 | NOTE: With 4.09 we introduced Unicode. That means that (C# x) always allocates, |

10 | whereas it didn't before. So allocations go up a bit. |

11 | |

12 | --------------------------------------- |

13 | Imaginary suite |

14 | --------------------------------------- |

15 | |

16 | integrate |

17 | ~~~~~~~~~ |

18 | integrate1D is strict in its second argument 'u', but it also passes 'u' to |

19 | function 'f'. Hence it now does some reboxing, which pushes up allocation |

20 | slightly. |

21 | |

22 | gen_regexps |

23 | ~~~~~~~~~~~ |

24 | I found that there were some very bad loss-of-arity cases in PrelShow. |

25 | In particular, we had: |

26 | |

27 | showl "" = showChar '"' s |

28 | showl ('"':xs) = showString "\\\"" . showl xs |

29 | showl (x:xs) = showLitChar x . showl xs |

30 | |

31 | Trouble is, we get |

32 | showl = \xs -> case xs of |

33 | ... |

34 | (x:xs) -> let f = showLitChar x |

35 | g = showl xs |

36 | in \s -> f (g s) |

37 | which is TERRIBLE. We can't spot that showLitChar has arity 2 because |

38 | it looks like this: |

39 | |

40 | ...other eqns... |

41 | showLitChar c = showString ('\\' : asciiTab!!ord c) |

42 | |

43 | notice that the (asciiTab!!orc c) is outside the \s, so GHC can't rewrite it to |

44 | |

45 | showLitChar c = \s -> showString ('\\' : asciiTab!!ord c) s |

46 | |

47 | So I've changed PrelShow.showLitChar to use explicit \s. Even then, showl |

48 | doesn't work, because GHC can't see that showl xs can be pushed inside the \s. |

49 | So I've put an explict \s there too. |

50 | |

51 | showl "" s = showChar '"' s |

52 | showl ('"':xs) s = showString "\\\"" (showl xs s) |

53 | showl (x:xs) s = showLitChar x (showl xs s) |

54 | |

55 | Net result: imaginary/gen_regexps more than halves in allocation! |

56 | |

57 | |

58 | x2n1 |

59 | ~~~~ |

60 | |

61 | Relies quite heavily on specialisations for Num (Complex Double). If |

62 | this test suddenly gets worse by a factor of 2 or so, chances are that |

63 | specialisation is broken. |

64 | |

65 | --------------------------------------- |

66 | Spectral suite |

67 | --------------------------------------- |

68 | |

69 | rewrite |

70 | ~~~~~~~ |

71 | It's important to inline p_ident. |

72 | |

73 | There's a very delicate CSE in p_expr |

74 | p_expr = seQ q_op [p_term1, p_op, p_term2] ## p_term3 |

75 | |

76 | (where all the pterm1,2,3 are really just p_term). |

77 | |

78 | This expands into |

79 | p_expr s = case p_term1 s of |

80 | Nothing -> p_term3 s |

81 | Just (v,s') -> case p_op s' of ... |

82 | Nothing -> p_term3 s |

83 | |

84 | Notice that if the bit before (##) fails, we call p_term3 s, and if |

85 | CSE notices we can re-use the result of the first call. But that |

86 | depends on how much inlining takes place. In particular, the follow-up |

87 | calls p_term3 can get hidden inside join points, which don't "see" the first |

88 | call. |

89 | |

90 | |

91 | |

92 | cse |

93 | ~~~ |

94 | The functions 'go' and 'go1', called from draw, get arity 1, but they |

95 | should get arity 2. We need a better arity analysis, a la Gill. 'go' looks |

96 | like this: |

97 | go_r1og = |

98 | \ (ds_aGI :: [GHC.Base.Char]) -> |

99 | case ds_aGI of wild_aGJ { |

100 | [] -> z_r1oe; |

101 | : y_aGN ys_aGO -> |

102 | let { |

103 | xs7_s1i8 :: GHC.Prim.Int# -> [GHC.Base.Char] |

104 | [Str: DmdType] |

105 | xs7_s1i8 = go_r1og ys_aGO |

106 | } in |

107 | \ (m_XWf :: GHC.Prim.Int#) -> |

108 | case GHC.Prim.<=# m_XWf 1 of wild1_aSI { |

109 | GHC.Base.False -> |

110 | GHC.Base.: @ GHC.Base.Char y_aGN (xs7_s1i8 (GHC.Prim.-# m_XWf 1)); |

111 | GHC.Base.True -> GHC.Base.: @ GHC.Base.Char y_aGN lvl13_r1n0 |

112 | } |

113 | } |

114 | Notice the 'let' which stops the lambda moving out. |

115 | |

116 | |

117 | Eliza |

118 | ~~~~~ |

119 | In June 2002, GHC 5.04 emitted four successive |

120 | NOTE: Simplifier still going after 4 iterations; bailing out. |

121 | messages. I suspect that the simplifer is looping somehow. |

122 | |

123 | |

124 | Expert |

125 | ~~~~~~ |

126 | In spectral/expert/Search.ask there's a statically visible CSE. Catching this |

127 | depends almost entirely on chance, which is a pity. |

128 | |

129 | |

130 | Fish |

131 | ~~~~ |

132 | The performance of fish depends crucially on inlining scale_vec2. |

133 | It turns out to be right on the edge of GHC's normal threshold size, so |

134 | small changes to the compiler can knock it to and fro. |

135 | |

136 | [Sept 08] The new instance-declaration story makes show slightly worse. |

137 | Look at Main.showl in the simplified core. You'll see this: |

138 | |

139 | a79_aRZ :: GHC.Types.Int |

140 | -> (GHC.Types.Int, GHC.Types.Int, GHC.Types.Int, GHC.Types.Int) |

141 | -> GHC.Show.ShowS |

142 | [Str: DmdType] |

143 | a79_aRZ = |

144 | GHC.Show.showsPrec9 @ Int @ Int @ Int @ Int |

145 | GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16 |

146 | |

147 | Rec { |

148 | showl_sWK :: [Main.Line_segment] -> String -> String |

149 | [Arity 2 |

150 | Str: DmdType SL] |

151 | showl_sWK = |

152 | \ (ds_dQJ :: [Main.Line_segment]) (s_alB :: GHC.Base.String) -> |

153 | case ds_dQJ of wild_X1d [ALWAYS Just A] { |

154 | [] -> GHC.Types.: @ GHC.Types.Char lvl_sYT s_alB; |

155 | : x_alD [ALWAYS Just L] xs_alF [ALWAYS Just L] -> |

156 | GHC.Base.++ |

157 | @ GHC.Types.Char |

158 | a_s12R |

159 | (a79_aRZ a_s1uS x_alD (showl_sWK xs_alF s_alB)) |

160 | } |

161 | end Rec } |

162 | |

163 | Note the non-inlined call to GHC.Show.showsPrec9, which looks like this: |

164 | showsPrec9 :: forall a b c d. (Show a, Show b, Show c, Show d) => |

165 | GHC.Types.Int -> (a, b, c, d) -> GHC.Show.ShowS |

166 | {- Arity: 4 Strictness: LLLL |

167 | Unfolding: (\ @ a b c d |

168 | $dShow :: GHC.Show.Show a |

169 | $dShow1 :: GHC.Show.Show b |

170 | $dShow2 :: GHC.Show.Show c |

171 | $dShow3 :: GHC.Show.Show d -> |

172 | let { |

173 | shows1 :: d -> GHC.Show.ShowS = GHC.Show.shows @ d $dShow3 |

174 | } in |

175 | let { |

176 | shows2 :: c -> GHC.Show.ShowS = GHC.Show.shows @ c $dShow2 |

177 | } in |

178 | let { |

179 | shows3 :: b -> GHC.Show.ShowS = GHC.Show.shows @ b $dShow1 |

180 | } in |

181 | let { |

182 | shows4 :: a -> GHC.Show.ShowS = GHC.Show.shows @ a $dShow |

183 | } in |

184 | \ ds :: GHC.Types.Int ds1 :: (a, b, c, d) s :: GHC.Base.String -> |

185 | case @ GHC.Base.String ds1 of wild { (a79, b, c, d) -> |

186 | GHC.Show.show_tuple |

187 | (GHC.Types.: |

188 | @ GHC.Show.ShowS |

189 | (shows4 a79) |

190 | (GHC.Types.: |

191 | @ GHC.Show.ShowS |

192 | (shows3 b) |

193 | (GHC.Types.: |

194 | @ GHC.Show.ShowS |

195 | (shows2 c) |

196 | (GHC.Types.: |

197 | @ GHC.Show.ShowS |

198 | (shows1 d) |

199 | (GHC.Types.[] @ GHC.Show.ShowS))))) |

200 | s }) -} |

201 | |

202 | We would do better to inpline showsPrec9 but it looks too big. Before |

203 | it was inlined regardless by the instance-decl stuff. So perf drops slightly. |

204 | |

205 | |

206 | Integer |

207 | ~~~~~~~ |

208 | A good benchmark for beating on big-integer arithmetic |

209 | |

210 | Knights |

211 | ~~~~~~~ |

212 | In knights/KnightHeuristic, we don't find that possibleMoves is strict |

213 | (with important knock-on effects) unless we apply rules before floating |

214 | out the literal list [A,B,C...]. |

215 | Similarly, in f_se (F_Cmp ...) in listcompr (but a smaller effect) |

216 | |

217 | |

218 | Lambda |

219 | ~~~~~~ |

220 | This program shows the cost of the non-eta-expanded lambdas that arise from |

221 | a state monad. |

222 | |

223 | mandel2 |

224 | ~~~~~~~ |

225 | check_perim's several calls to point_colour lead to opportunities for CSE |

226 | which may be more or less well taken. |

227 | |

228 | Mandel |

229 | ~~~~~~ |

230 | Relies heavily on having a specialised version of Complex.magnitude |

231 | (:: Complex Double -> Double) available. |

232 | |

233 | Mandel.mandelset has a go-loop inside an enumDeltaToIntegerFB, out of which |

234 | 4.08.2 managed to float a constant expression, but HEAD did not. I think |

235 | this is because the pre-let-floating simplification did too little inlining; |

236 | in particular, it did not inline windowToViewport |

237 | |

238 | |

239 | Multiplier |

240 | ~~~~~~~~~~ |

241 | In spectral/multiplier, we have |

242 | xor = lift21 forceBit f |

243 | where f :: Bit -> Bit -> Bit |

244 | f 0 0 = 0 |

245 | f 0 1 = 1 |

246 | f 1 0 = 1 |

247 | f 1 1 = 0 |

248 | Trouble is, f is CPR'd, and that means that instead of returning |

249 | the constants I# 0, I# 1, it returns 0,1 and then boxes them. |

250 | So allocation goes up. I don't see a way around this. |

251 | |

252 | |

253 | Parstof |

254 | ~~~~~~~ |

255 | spectral/hartel/parstof ends up saying |

256 | case (unpackCString "x") of { c:cs -> ... } |

257 | quite a bit. We should spot these and behave accordingly. |

258 | |

259 | |

260 | Power |

261 | ~~~~~ |

262 | With GHC 4.08, for some reason the arithmetic defaults to Double. The |

263 | right thing is to default to Rational, which accounts for the big increase |

264 | in runtime after 4.08 |

265 | |

266 | |

267 | Puzzle |

268 | ~~~~~~ |

269 | The main function is 'transfer'. It has some complicated join points, and |

270 | a big issue is the full laziness can float out many small MFEs that then |

271 | make much bigger closures. It's quite delicate: small changes can make |

272 | big differences, and I spent far too long gazing at it. |

273 | |

274 | I found that in my experimental proto 4.09 compiler I had |

275 | |

276 | let ds = go xs in |

277 | let $j = .... ds ... in |

278 | case e of |

279 | p1 -> $j |

280 | p2 -> $j |

281 | p3 -> ds |

282 | |

283 | But ds wasn't getting inlined because we couldn't spot that it was actually |

284 | only being used once. Keith's analyser should find this! |

285 | |

286 | |

287 | Also, making concat into a good producer made a large gain. |

288 | |

289 | My proto 4.09 still allocates more, partly because of more full laziness relative |

290 | to 4.08; I don't know why that happens |

291 | |

292 | Extra allocation is happening in 5.02 as well; perhaps for the same reasons. There is |

293 | at least one instance of floating that prevents fusion; namely the enumerated lists |

294 | in 'transfer'. |

295 | |

296 | Sphere |

297 | ~~~~~~ |

298 | A key function is vecsub, which looks like this (after w/w) |

299 | |

300 | $wvecsub |

301 | = \ ww :: Double ww1 :: Double ww2 :: Double |

302 | ww3 :: Double ww4 :: Double ww5 :: Double -> |

303 | let { a = case ww of wild { D# x -> |

304 | case ww3 of wild1 { D# y -> |

305 | let { a1 = -## x y |

306 | } in $wD# a1 |

307 | } } } in |

308 | let { a1 = case ww1 of wild { D# x -> |

309 | case ww4 of wild1 { D# y -> |

310 | let { a2 = -## x y |

311 | } in $wD# a2 |

312 | } } } in |

313 | let { a2 = case ww2 of wild { D# x -> |

314 | case ww5 of wild1 { D# y -> |

315 | let { a3 = -## x y |

316 | } in $wD# a3 |

317 | } } |

318 | } in (# a, a1, a2 #) |

319 | |

320 | Currently it gets guidance: IF_ARGS 6 [2 2 2 2 2 2] 25 4 |

321 | It occurs in a context like |

322 | |

323 | case $wvecsub a b c d e f of (# p, q, r #) -> |

324 | case p of D# p' -> |

325 | case q of D# q' -> |

326 | case r of D# r' -> |

327 | |

328 | So it would be much, much better to inline it. But the result-discounting |

329 | doesn't "see" that the three results are each taken apart, and that's the |

330 | problem. |

331 | |

332 | One question is this: why did minusDouble get inlined? If it didn't then |

333 | $vecsub would look much smaller: |

334 | |

335 | $wvecsub |

336 | = \ ww :: Double ww1 :: Double ww2 :: Double |

337 | ww3 :: Double ww4 :: Double ww5 :: Double -> |

338 | let { a = minusDouble ww ww3 } in |

339 | let { a1 = minusDouble ww1 ww4 } in |

340 | let { a2 = minusDouble ww2 ww5 } in |

341 | (# a, a1, a2 #) |

342 | |

343 | Reason minusDouble gets inlined: because we are trying to get update in place. |

344 | So I added a flag -funfolding-update-in-place to enable this "optimisation". |

345 | Omitting the flag gives much better inlining for $wvecsub at least. |

346 | |

347 | |

348 | Sphere also does 60,000 calls to hPutStr, so I/O plays a major role. Currently |

349 | this I/O does a *lot* of allocation, much of it since the adddition of thread-safety. |

350 | |

351 | |

352 | treejoin |

353 | ~~~~~~~~ |

354 | Does a lot of IO.readFile. In GHC.IO.Encoding.UTF8 the demand |

355 | analyser sees a strict function with type |

356 | a_s1gj :: GHC.IO.Buffer.Buffer GHC.Word.Word8 |

357 | -> GHC.IO.Buffer.Buffer GHC.Types.Char |

358 | -> GHC.Prim.State# GHC.Prim.RealWorld |

359 | -> (# GHC.Prim.State# GHC.Prim.RealWorld, |

360 | (GHC.IO.Encoding.Types.CodingProgress, |

361 | GHC.IO.Buffer.Buffer GHC.Word.Word8, |

362 | GHC.IO.Buffer.Buffer GHC.Types.Char) #) |

363 | Unboxing both Buffer arguments makes a HUGE difference (halves |

364 | allocation); but that makes the worker function have 12 arguments. A |

365 | good reason for unboxing even if the worker gets a lot of args. |

366 | |

367 | sorting |

368 | ~~~~~~~ |

369 | Same issue with GHC.IO.Encoding.UTF8 as treejoin |

370 | |

371 | |

372 | --------------------------------------- |

373 | Real suite |

374 | --------------------------------------- |

375 | |

376 | gg |

377 | ~~ |

378 | Same issue with GHC.IO.Encoding.UTF8 as treejoin |

379 | |

380 | Maillist |

381 | ~~~~~~~~ |

382 | Uses appendFile repeatedly rather than opening the output file once, |

383 | which leads to numerous file opens/closes. Allocations will rise with |

384 | the new I/O subsystem in 5.02 because the I/O buffer will be |

385 | re-allocated on the heap for each open, whereas previously it would be |

386 | allocated on the C heap and therefore not show up in the stats. |

387 | |

388 | --------------------------------------- |

389 | Shootout suite |

390 | --------------------------------------- |

391 | |

392 | binary-tree |

393 | ~~~~~~~~~~~ |

394 | |

395 | In May 2016, a series of seemingly unrelated commits changed the runtime |

396 | performance of this up and down by ~3%. Maybe a performance cliff. Mailinglist |

397 | thread: |

398 | https://mail.haskell.org/pipermail/ghc-devs/2017-March/013886.html |

# nofib/Simon-nofib-notes

nofib/Simon-nofib-notes

Owners

*None*