schedulePushWork: avoid unnecessary wakeups
This function had some pathalogically bad behaviour: if we had 2 threads
on the current capability and 23 other idle capabilities, we would
- grab all 23 capabilities
- migrate one Haskell thread to one of them
- wake up a worker on *all* 23 other capabilities.
This lead to a lot of unnecessary wakeups when using large -N values.
- Count how many capabilities we need to wake up
- Start from cap->no+1, so that we don't overload low-numbered capabilities
- Only wake up capabilities that we migrated a thread to (unless we have sparks to steal)
This results in a pretty dramatic improvement in our production system.