rts/base: Fix #9423
ClosedPublic

Authored by AndreasVoellmy on Aug 8 2014, 10:43 AM.

Details

Summary

Fix Trac #9423.

The problem in Trac #9423 is caused when code invoked by hs_exit() waits
on all foreign calls to return, but some IO managers are in safe foreign
calls and do not return. The previous design signaled to the timer manager
(via its control pipe) that it should "die" and when the timer manager
returned to Haskell-land, the Haskell code in timer manager then signalled
to the IO manager threads that they should return from foreign calls and
die. Unfortunately, in the shutdown sequence the timer manager is unable
to return to Haskell-land fast enough and so the code that signals to the
IO manager threads (via their control pipes) is never executed and the IO
manager threads remain out in the foreign calls.

This patch solves this problem by having the RTS signal to all the IO
manager threads (via their control pipes; and in addition to signalling
to the timer manager thread) that they should shutdown (in ioManagerDie()
in rts/Signals.c. To do this, we arrange for each IO manager thread to
register its control pipe with the RTS (in GHC.Thread.startIOManagerThread).
In addition, GHC.Thread.startTimerManagerThread registers its control pipe.
These are registered via C functions setTimerManagerControlFd (in
rts/Signals.c) and setIOManagerControlFd (in rts/Capability.c). The IO
manager control pipe file descriptors are stored in a new field of the
Capability_ struct.

Test Plan

See the notes on Trac #9423 to recreate the problem and to verify that it no longer occurs with the fix.

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Lint Skipped
Unit
Unit Tests Skipped
AndreasVoellmy retitled this revision from to Fix bug from trac issue #9423.
AndreasVoellmy updated this object.
AndreasVoellmy edited the test plan for this revision. (Show Details)
AndreasVoellmy added reviewers: edsko, simonmar.
AndreasVoellmy set the repository for this revision to rGHC Glasgow Haskell Compiler.
AndreasVoellmy updated the Trac tickets for this revision.
edsko edited edge metadata.Aug 8 2014, 11:38 AM

I'm not sure I completely understand the patch -- but doesn't the above obsolete the Haskell-land shutdownManagers?

Yes, I think we can get rid of shutdownManagers in GHC.Thread. I'm doing that now and also running validate. Will update when this is done.

AndreasVoellmy set the repository for this revision to rGHC Glasgow Haskell Compiler.Aug 8 2014, 1:10 PM
  • Fix issue Trac #9423 (problem discovered in Trac #9284).
  • Remove GHC.Event.shutdownManagers and fix warnings.

Whoops, Build B347: Diff 289 (D129) has failed! Full logs available at F12061.

Whoops, Build B347: Diff 289 (D129) has failed! Full logs available at F12063.

Phabricator's builds seem to fail on my diff, but I can't see why from the build log provided. I don't see any obvious errors in the log.

ezyang edited edge metadata.Aug 8 2014, 7:27 PM

Phabricator's builds seem to fail on my diff, but I can't see why from the build log provided. I don't see any obvious errors in the log.

Andreas, you need to validate. Here is the error I get:

rts/posix/Signals.c: In function ‘ioManagerDie’:

rts/posix/Signals.c:242:14:
     error: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Werror=unused-result]
cc1: all warnings being treated as errors
ezyang added a comment.Aug 8 2014, 9:25 PM

It would be good if the commit message commented on the general approach of the patch.

I'm not entirely sure, but isn't there only one IO manager thread per capability? So it seems having everyone contend against a linked list, it would be better to preallocate an array the size of the number of capabilities we have (much like in other cases in the RTS) and directly access that.

Thanks for pointing out the warning that I triggered. I will fix and put up
a new diff.

I'll add some more commentary as well.

Yes, there is only one IO manager thread per capability, plus one timer
manager. There is likely a way to store these so that recording the control
pipe for a capability's IO manager can be done without taking a lock. On
the other hand, I'm not sure it's worth it, since this will be called so
rarely (once per capability and then once on shutdown).

AndreasVoellmy edited edge metadata.
  • Examine return value of write call in ioManagerDie.

Whoops, Build B350: Diff 292 (D129) has failed! Full logs available at F12175. The testsuite summary sez:

testsuite_summary.txt
Unexpected results from:
TEST="T5313 dynCompileExpr linker_unload"

OVERALL SUMMARY for test run started at Sat Aug  9 05:28:06 2014 UTC
 0:06:29 spent to go through
    4083 total tests, which gave rise to
   13462 test cases, of which
    9720 were skipped

      26 had missing libraries
    3653 expected passes
      60 expected failures

       0 caused framework failures
       0 unexpected passes
       3 unexpected failures

Unexpected failures:
   driver                  T5313 [bad exit code] (normal)
   ghc-api/dynCompileExpr  dynCompileExpr [bad exit code] (normal)
   rts                     linker_unload [bad exit code] (normal)

Validate fails on my machine, but the 4 failures are also present in HEAD (d0ee4eb0879f77067e7f05e0daa80e6ca8817f1d) for me, so I don't think this has anything to do with this patch.

Rework the patch to maintain a per-capability file descriptor for the control pipes of the IO managers and to avoid introducint any new locks.

What happens if we change the number of capabilities at runtime?

rts/Capability.c
1085

typo

Validate still fails for me due to failures that were in master before my patches - this patch introduces no new validate failures.

Let me know if I should squash these commits and start a new Phab diff.

I added a bunch of explanation in my commit message for this latest patch. I'm not sure this is preserved in Phab, so I'm pasting it below as well. Let me know if you feel it needs more explanation of any points.

Date: Sun Aug 10 22:17:06 2014 -0400

Fix #9423.

The problem in #9423 is caused when code invoked by hs_exit() waits
on all foreign calls to return, but some IO managers are in safe foreign
calls and do not return. The previous design signaled to the timer manager
(via its control pipe) that it should "die" and when the timer manager
returned to Haskell-land, the Haskell code in timer manager then signalled
to the IO manager threads that they should return from foreign calls and
die. Unfortunately, in the shutdown sequence the timer manager is unable
to return to Haskell-land fast enough and so the code that signals to the
IO manager threads (via their control pipes) is never executed and the IO
manager threads remain out in the foreign calls.

This patch solves this problem by having the RTS signal to all the IO
manager threads (via their control pipes; and in addition to signalling
to the timer manager thread) that they should shutdown (in ioManagerDie()
in rts/Signals.c. To do this, we arrange for each IO manager thread to
register its control pipe with the RTS (in GHC.Thread.startIOManagerThread).
In addition, GHC.Thread.startTimerManagerThread registers its control pipe.
These are registered via C functions setTimerManagerControlFd (in
rts/Signals.c) and setIOManagerControlFd (in rts/Capability.c). The IO
manager control pipe file descriptors are stored in a new field of the
Capability_ struct.

I think it should be fine if the number of capabilities changes at runtime.
The control pipe for the IO manager of capability i is stored in the
Capability_ struct of capability[i]. So there is no need to resize any data
structures outside of what is already done by the RTS. At shutdown,
ioManagerDie simply reads the control pipe field of all the capability[i]
elements for i in 0..n_capabilities.

Yay! Build B387: Diff 319 (D129) has succeeded! Full logs available at F12375.

  • Fix typo.

Yay! Build B389: Diff 321 (D129) has succeeded! Full logs available at F12385.

Should the setTimerManagerControlFd() and setIOManagerControlFd() be defined as primops rather than imported foreign functions? I imagine there is less overhead associated with primops.

They're called O(1) times, right? I think FFI ought to be OK.

Yes, O(1) times. So I won't bother with primops. I will mark the calls as
"unsafe", since all they do is write to a variable in the RTS. Will
validate and update.

  • Make foreign calls setIOManagerWakeupFd, setIOManagerControlFd, and setTimerManagerControlFd "unsafe" since all they do is write values into variables in the RTS.

Whoops, Build B392: Diff 324 (D129) has failed! Full logs available at F12413.

Whoops, Build B392: Diff 324 (D129) has failed! Full logs available at F12426.

My validate (and maybe Harbormaster's too) is failing on the following:

Unexpected failures:
   cabal/cabal06  cabal06 [bad stdout] (normal)
   perf/compiler  T3064 [stat not good enough] (normal)
   perf/compiler  T4801 [stat too good] (normal)
   rts            T5435_dyn_asm [bad stdout] (normal)

These were all failing before my patches, so I don't think I caused them. At this point, I don't intend to make any more changes on this patch, so it's ready for another round of review.

Whoops, Build B392: Diff 324 (D129) has failed! Full logs available at F12629.

austin accepted this revision.Aug 14 2014, 8:37 AM
austin edited edge metadata.

This looks fine to me, although perhaps @simonmar also has something to say too still.

This revision is now accepted and ready to land.Aug 14 2014, 8:37 AM
austin retitled this revision from Fix bug from trac issue #9423 to rts/base: Fix #9423.Aug 14 2014, 8:39 AM
austin updated this object.
austin edited edge metadata.
austin updated the Trac tickets for this revision.

@AndreasVoellmy I've updated the description a bit; if you're satisfied with the diff here, when I merge this using arcanist it will automatically use the updated description. You can also merge it yourself (by using arc amend, arc patch, or using git yourself to squash/write a new commit description)

I'm happy with the diff. @austin: do you want to apply it?

austin closed this revision.Aug 19 2014, 8:03 AM
austin updated this revision to Diff 405.

Closed by commit rGHCf9f89b7884cc (authored by @AndreasVoellmy, committed by @austin).

simonmar added inline comments.Sep 3 2014, 4:54 AM
rts/posix/Signals.c
246–264

why are we sending the signal to the timer thread and all of the IO manager threads? Don't we need to send it to just one of the IO manager threads?