Fix deadlock between STM and throwTo
ClosedPublic

Authored by simonmar on Wed, Jul 11, 11:01 AM.

Details

Summary

There was a lock-order reversal between lockTSO() and the TVar lock,
see Trac #15136 for the details.

It turns out we can fix this pretty easily by just deleting all the
locking code(!). The principle for unblocking a BlockedOnSTM thread
then becomes the same as for other kinds of blocking: if the TSO
belongs to this capability then we do it directly, otherwise we send a
message to the capability that owns the TSO. That is, a thread blocked
on STM is owned by its capability, as it should be.

The possible downside of this is that we might send multiple messages
to wake up a thread when the thread is on another capability. This is
safe, it's just not very efficient. I'll try to do some experiments
to see if this is a problem.

Test Plan

Test case from Trac #15136 doesn't deadlock any more.

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
simonmar created this revision.Wed, Jul 11, 11:01 AM
osa1 accepted this revision.Wed, Jul 11, 1:21 PM

Thanks! I was looking at this code today and was terribly confused ...

I still don't understand this code too well but if my understanding is correct this makes perfect sense to me.

My understanding: if we make sure the TSO is only modified by the capability it belongs then we don't need to lock it. This is certainly the case in raiseAsync() (otherwise raiseAsync() calls throwToSendMsg() and returns), and it turns out it's also the case with unpark_tso() as tryWakeupThread() checks for this and sends a message if we don't own the TSO. So it turns out no locking was necessary.

So this looks great to me! (much less code to understand now :)

(I can't accept this as the button is disabled for some reason ...)

rts/sm/Sanity.c
551

I wonder if this assert is triggered if we remove this? This kind of code usually costs a lot of time to a new reader and causes problems when extending code (as we have to assume this case may happen) so I wonder if it's a good idea (assuming it isn't triggered during tests with debug RTS) to remove this.

This revision is now accepted and ready to land.Wed, Jul 11, 1:21 PM
This revision was automatically updated to reflect the committed changes.

Ah, I was actually going to revise this to try to make it more efficient. Not to worry, I should remember to set it to Plan Changes next time.