Remove explicit recursion in retainer profiling (fixes #14758)
ClosedPublic

Authored by qnikst on Sun, Nov 18, 7:37 AM.

Details

Summary

Retainer profiling contained a recursion that under
certain circumstances could lead to the stack overflow
in C code.

The idea of the improvement is to keep an explicit stack for the
object, more precise to reuse existing stack, but allow new type of
objects to be stored there.

There is no reliable reproducer that is not a big program
but in some cases foldr (+) 0 [1..10000000] can work.

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
qnikst created this revision.Sun, Nov 18, 7:37 AM
osa1 added a subscriber: osa1.Sun, Nov 18, 7:48 AM

Thanks! One thing that reliably triggered this problem in the past (at least on my system) was running profiled GHCi (build profiled GHC using the prof flavor, then run ghc-stage2 --interactive +RTS -hr). Could you give it a try without this patch to see if it still fails, and then try with this patch?

osa1 added a comment.Sun, Nov 18, 7:54 AM

I just remembered the full command I was using as a test for this bug: ghc-stage2 --interactive +RTS -hr <<(echo 'sequence_ (replicate 1000000 (return ()))')

In D5351#147264, @osa1 wrote:

Thanks! One thing that reliably triggered this problem in the past (at least on my system) was running profiled GHCi (build profiled GHC using the prof flavor, then run ghc-stage2 --interactive +RTS -hr). Could you give it a try without this patch to see if it still fails, and then try with this patch?

I've tried exactly that, but the program that I have used didn't reliably fail before the patch. Though it has always failed when run under rr.
With the patch it has never failed neither directly nor when run under rr. I have not added the program in the testsuite because it depends on
timeout and not always pinpoints the problem.

In D5351#147266, @osa1 wrote:

I just remembered the full command I was using as a test for this bug: ghc-stage2 --interactive +RTS -hr <<(echo 'sequence_ (replicate 1000000 (return ()))')

Thanks! will report once do that.

osa1 added a comment.Sun, Nov 18, 8:05 AM
In D5351#147264, @osa1 wrote:

Thanks! One thing that reliably triggered this problem in the past (at least on my system) was running profiled GHCi (build profiled GHC using the prof flavor, then run ghc-stage2 --interactive +RTS -hr). Could you give it a try without this patch to see if it still fails, and then try with this patch?

I've tried exactly that, but the program that I have used didn't reliably fail before the patch. Though it has always failed when run under rr.
With the patch it has never failed neither directly nor when run under rr. I have not added the program in the testsuite because it depends on
timeout and not always pinpoints the problem.

Great! I'll give this a read later today unless someone else beats me to it.

osa1 accepted this revision.Sun, Nov 18, 11:27 PM
In D5351#147266, @osa1 wrote:

I just remembered the full command I was using as a test for this bug: ghc-stage2 --interactive +RTS -hr <<(echo 'sequence_ (replicate 1000000 (return ()))')

Thanks! will report once do that.

I just tried this and it works even with larger values (tried 50000000), so this seems to work as expected.

Also looked at the code; it's a straightforward refactoring that replaces retainClosure calls with stack pushes (using a new stack entry type), so LGTM. I dropped a few inline comments though.

rts/RetainerProfile.c
399

objects -> object's

413

This code is duplicated from push(), could you put this into a new function (maybe call it actualPush or something like that) and call it in pushClosure() and push() ?

794

Why not just if (se->info.type == posTypeFresh) ?

796

For consistency could you indent this block? This file uses 4 spaces for indentation.

This revision is now accepted and ready to land.Sun, Nov 18, 11:27 PM
qnikst updated this revision to Diff 18763.Mon, Nov 19, 5:17 AM
  • Address review comments.
qnikst marked 4 inline comments as done.Tue, Nov 20, 12:54 PM

I've addressed all of the comments, please tell me if I need so cleanup anything else there.

rts/RetainerProfile.c
794

The previous version of the patch was a bit more elaborated and I had 2 cases there, so used switch, thanks for pointing that out.

bgamari requested changes to this revision.Thu, Nov 22, 10:49 AM

The implementation looks fine but I recall we discussed adding a note. What happened to this? It is badly needed.

This revision now requires changes to proceed.Thu, Nov 22, 10:49 AM
qnikst marked an inline comment as done.Thu, Nov 22, 1:02 PM

The implementation looks fine but I recall we discussed adding a note. What happened to this? It is badly needed.

I'm not sure I remember the exact context of the note we wanted to add. I remember that the first implementation
required that, but what should I add now?

In general this file is terribly under-documented. In particular, there is no discussion of what the various stack entry types mean.

In general this file is terribly under-documented. In particular, there is no discussion of what the various stack entry types mean.

Ah! I have naively hoped I could do that afterwards. I'll try to do some archaeology and investigation and document that.

qnikst updated this revision to Diff 18956.Fri, Nov 30, 1:32 PM
qnikst marked an inline comment as not done.
  • Add a note about retainers.

@bgamari please take another look, I've updated documentation a bit and have added a general information.
I was not able to dig why do we keep mutable objects as a retainer, but I have decided not to change semantics there.

bgamari added inline comments.Fri, Nov 30, 1:44 PM
rts/RetainerProfile.c
54

I think you are missing a preposition here. "on the path" perhaps?

145

Thank you for documenting the new stack entry type. However,m I'm still worried that the others are completely undocumented. Given that you have this paged in, could you add a comment for each?

bgamari added inline comments.Fri, Nov 30, 1:52 PM
rts/RetainerProfile.c
398

I wonder if we in fact want to inline this. It seems like a pretty large function.

qnikst updated this revision to Diff 18958.Fri, Nov 30, 2:03 PM
  • Remove INLINE on actualPush function
qnikst marked 2 inline comments as done.Fri, Nov 30, 2:13 PM
bgamari accepted this revision.Mon, Dec 3, 3:27 PM

Looks good. Thanks!

This revision is now accepted and ready to land.Mon, Dec 3, 3:27 PM
osa1 updated this revision to Diff 19006.Tue, Dec 4, 11:24 AM
  • Rebase
  • Comments
osa1 added a comment.Tue, Dec 4, 11:25 AM

Testing this one last time before merging ...

rts/RetainerProfile.c
670–671

I think this comment should be moved to actualPush.

qnikst added inline comments.Tue, Dec 4, 11:33 AM
rts/RetainerProfile.c
670–671

the // If the size of stackElement was huge, we would better replace the part was moved

and

// ToDo: The line below leads to the warning:

should still live here, should I restore it?

osa1 added inline comments.Tue, Dec 4, 11:35 AM
rts/RetainerProfile.c
670–671

Does it still lead to a warning? I don't think this currently generates a warning because we don't allow warnings when validating, and this validated the last time I checked (I'm validating it again now).

qnikst added inline comments.Tue, Dec 4, 11:38 AM
rts/RetainerProfile.c
670–671

I'm not sure actually, the tools I've used didn't give me any warning. But maybe it was some other external static analyzer. Actually, at this point, I'm not sure that todo actually gives much information.

osa1 added a comment.Tue, Dec 4, 11:43 AM

Well it segfaults now. I'm currently in the debugger, hopefully it's an easy one.

osa1 requested changes to this revision.Tue, Dec 4, 9:57 PM

Moving this out of the merge queue until we fix the segfault.

This revision now requires changes to proceed.Tue, Dec 4, 9:57 PM
qnikst updated this revision to Diff 19015.Wed, Dec 5, 5:25 AM
  • Fixes.
qnikst added a comment.Wed, Dec 5, 5:27 AM

@osa1 , @bgamari please take another look, the problem appears to be in the removing INLINE from the actual push. While I don't understand the reasons for that
I was able to find that while bisecting. In addition, I have renamed pushClosure to name the name unique in the project.

qnikst added a comment.Wed, Dec 5, 5:31 AM

ok, the problem it seems that there is an actualPush function in Profiling.c, what is better to do, remove INLINE and rename the function or keep both?

qnikst updated this revision to Diff 19018.Wed, Dec 5, 10:21 AM
  • Initialize posTypeObjects on the stack.
qnikst updated this revision to Diff 19019.Wed, Dec 5, 10:36 AM
  • fixup
osa1 accepted this revision.Wed, Dec 5, 10:44 AM

Awesome! Confirmed that this works now. Merging.

This revision is now accepted and ready to land.Wed, Dec 5, 10:44 AM
This revision was automatically updated to reflect the committed changes.