Add Stack Traces in Haskell land based on symtab
AbandonedPublic

Authored by Tarrasch on Jun 6 2015, 10:09 AM.

Details

Reviewers
scpmw
simonmar
hvr
austin
bgamari
Trac Issues
#10656
Summary

This patch extends the base library with a new Haskell module called GHC.ExecutionStack. Essentially it provides the simple functions to print a stack trace to stdout. As the module name suggests, it uses GHC-specific implementation details to retrieve a stack. In particular it uses the stg stack.

The usefulness of the printed stack is currently low. As it is only using names from the symbol table, which most of the time are just random (something like "s2cZ_info"). But you can sometimes get useful names in the stack trace still (like "Main_myfunction_info"), so it is still useful when debugging. See the module documentation for an example when it could print something useful.

Test Plan

Run validate script. I also added 4 test cases. On my machine, I've run validate with libelf installed, and the build machine that run phabricator diffs validates without libelf installed.

This patch doesn't change many existing code paths of the compiler code or the run time system. So it's quite unintrusive in that regard. However, it does change build requirements and files like distrib/configure.ac.in in order to make libelf optional at user compile time (when a user compiles ghc from source from [1]) instead of at the time a ghc maintainer packages ghc. These changes are also covered by ./validate in its make test_bindtest step.

[1]: https://www.haskell.org/ghc/download

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
arcpatch-D963
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 5138
Build 5352: GHC Patch Validation (amd64/Linux)
Build 5351: arc lint + arc unit
There are a very large number of changes, so older changes are hidden. Show Older Changes

Ok, so the validation errors is basically part of this build step: make binary-dist-prep && make test_bindist TEST_PREP=YES.

$ make test_bindist TEST_PREP=YES
...
 ( lots of output)
...
[1 of 1] Compiling Main             ( bindisttest/HelloWorld.lhs, bindisttest/HelloWorld.o )
Linking bindisttest/HelloWorld ...
/home/arash/repos/ghc/bindisttest/install   dir/lib/ghc-7.11.20150704/rts/libHSrts.a(Codemap.o): In function `codemap_load_symbols':

/home/arash/repos/ghc/rts/Codemap.c:119:0: error:
     undefined reference to `elf_nextscn'

/home/arash/repos/ghc/rts/Codemap.c:120:0: error:
     undefined reference to `gelf_getshdr'

/home/arash/repos/ghc/rts/Codemap.c:125:0: error:
     undefined reference to `elf_getdata'

...

So I guess this kind of is about that I haven't solved the fact that we don't ship libelf with the Haskell binary. Any advice on this packaging-related question?

Tarrasch updated this revision to Diff 3519.Jul 11 2015, 8:39 AM
  • Mark *ip as UNUSED for no USE_ELF
Tarrasch updated this revision to Diff 3520.Jul 11 2015, 8:40 AM
  • Mark *ip as UNUSED for no USE_ELF
Tarrasch updated this revision to Diff 3521.Jul 11 2015, 8:45 AM
  • Mark *ip as UNUSED for no USE_ELF
  • Remove redundant import
Tarrasch updated this revision to Diff 3525.Jul 11 2015, 4:31 PM
  • Whoups, ELF not DWARF :)
  • Only run tests when GHC has libelf

Hmm, I have a couple of questions. I would appreciate any help:

1. How do I get libelf on the build agents?

Is that reasonable to have you think @hvr? It is perhaps not needed but would be nice to able to run the executionStack tests as well.

2. How do I disable tests based on if libelf exists or not?

I fixed this by emitting that information through ghc --info and then parse it from the testing framework. Does that make sense?

3. After this patch, Linux users need to install libelf when using GHC. Is that correct?

If I'm not mistaken, a GHC user will need libelf iff the one who compiled the compiler for them had libelf at the time ./configure was run. That seems somewhat reasonable to me. We'll just have to make sure that the distribution creator has libelf only on the Linux machines.

I can't think of any way for this to be a more flexible option without introducing another "Way". But I would be happy to be shown wrong on this. :)

Tarrasch updated this revision to Diff 3528.Jul 11 2015, 4:37 PM
  • Only run tests when GHC has libelf
Tarrasch updated this revision to Diff 3596.Jul 19 2015, 7:15 AM

Make it ./validate with elf as well.

Now it always validates!

Tarrasch edited the test plan for this revision. (Show Details)Jul 19 2015, 7:36 AM
Tarrasch updated the Trac tickets for this revision.
Tarrasch edited the test plan for this revision. (Show Details)

I would really appreciate another round of review guys. :)

Macro prettypleasecat: can you please review my code?

ping @scpmw @simonmar

@Tarrasch, great work!

Some comments inline.

I wonder if this should be tied in to the RTS's linker to ensure that symbols are loaded for newly loaded libraries.

compiler/prelude/primops.txt.pp
1984

A description for the haddocks would be nice here. See, for instance readMutVar# to get an idea of the expected syntax.

includes/rts/Codemap.h
28

A comment here describing what a CodemapUnit actually is would be quite useful.

41

This is fairly self-explanatory but that being said it a comment wouldn't hurt.

libraries/base/GHC/ExecutionStack/Internal.hsc
75

Might I suggest the haddock "Pretty-print an execution stack,"?

116

This doesn't actually "display"; I think showStackFrames would be a more appropriate name.

138

Minor aesthetic point: the redundant grouping symbols make things less readable, not more, IMHO.

205

This extra newline is technically a whitespace error. Perhaps you could kill it in the next iteration?

libraries/base/tests/all.T
117

Indeed this smells a bit funny. What happens when you do not omit dyn?

rts/Codemap.c
42

I would say it's not just a good idea, it's the correct way to use the interface. Let's not leave this up to interpretation.

62

In what way does this force loading? Why is codemap_load not an appropriate name?

339

A brief comment explaining the arguments would be nice (e.g. "look-up the procedure associated with a given address (or NULL if unknown). Compilation unit in which procedure is defined is stored to *punit if non-null.")

bgamari requested changes to this revision.Jul 20 2015, 8:25 AM
bgamari edited edge metadata.
This revision now requires changes to proceed.Jul 20 2015, 8:25 AM

Thanks for the review @bgamari! I'll try to iterate on it early this week. :)

rts/Codemap.c
62

Haha, stupid, historical reasons. dwarf_load already existed already in libdwarf so I changed it to dwarf_force_load and then I just s/dwarf/codemap this ^^

Tarrasch updated this revision to Diff 3607.Jul 20 2015, 6:04 PM
Tarrasch edited edge metadata.
Tarrasch marked 9 inline comments as done.
  • Fix @bgmari's comments. Remove unused Codemap-code

For some very weird reason I have started to get this error locally now. I'm pushing my changes to see if the build machine also suddenly have started to fail or not.

~/repos/ghc arcpatch-D963  
❯ "inplace/bin/ghc-stage1" -optc-fno-stack-protector -optc-Wall -optc-Wall -optc-Wextra -optc-Wstrict-prototypes -optc-Wmissing-prototypes -optc-Wmissing-declarations -optc-Winline -optc-Waggregate-return -optc-Wpointer-arith -optc-Wmissing-noreturn -optc-Wnested-externs -optc-Wredundant-decls -optc-Iincludes -optc-Iincludes/dist -optc-Iincludes/dist-derivedconstants/header -optc-Iincludes/dist-ghcconstants/header -optc-Irts -optc-Irts/dist/build -optc-DCOMPILING_RTS -optc-fno-strict-aliasing -optc-fno-common -optc-DUSE_ELF -optc-Irts/dist/build/autogen -optc-Werror=unused-but-set-variable -optc-Wno-error=inline -optc-O2 -optc-fomit-frame-pointer -optc-g -optc-DRtsWay=\"rts_p\" -optc-Wno-strict-prototypes -static -prof  -H32m -O -Wall  -Iincludes -Iincludes/dist -Iincludes/dist-derivedconstants/header -Iincludes/dist-ghcconstants/header -Irts -Irts/dist/build -DCOMPILING_RTS -this-package-key rts -dcmm-lint      -i -irts -irts/dist/build -irts/dist/build/autogen -Irts/dist/build -Irts/dist/build/autogen           -O2    -c rts/sm/Storage.c -o rts/dist/build/sm/Storage.p_o

In file included from includes/Rts.h:217:0: error:
    0,
                     from rts/sm/Storage.c:15:
rts/sm/Storage.c: In function ‘allocArrWords’:

rts/sm/Storage.c:774:43: error:
     error: ‘CCCS’ undeclared (first use in this function)
         SET_ARR_HDR(arr, &stg_ARR_WORDS_info, CCCS, size_in_bytes);
                                               ^

includes/rts/storage/ClosureMacros.h:138:33: error:
     note: in definition of macro ‘SET_PROF_HDR’
             ((c)->header.prof.ccs = ccs_,   \
                                     ^
...
compiler/prelude/primops.txt.pp
1984

Note to self: Don't put () parens within your {} comments in .pp files. #hourwasted

rts/Codemap.c
339

I commented in the header file instead.

Tarrasch updated this revision to Diff 3620.Jul 21 2015, 12:09 PM
Tarrasch edited edge metadata.

Small fixes

  • Explain why 'dyn' flag is needed for every test case
  • Fix test failure after rename of dipslayStackFrame
  • Make test 003 more trustworthy
Tarrasch updated this revision to Diff 3621.Jul 21 2015, 12:16 PM

Remove extra whitespace in testlib.py

I'll iterate on this again because of the core dumped issue in executionStack002 test. Anyway, I think this patch can be reviewed still. :)

libraries/base/tests/all.T
117

It's great that you pointed this out. I realized that the error from executionStack002 is pretty scary. That you get a says-me-nothing segfault with "Core dumped" isn't acceptable of course. It would've been almost impossible to nail the cause of it in a large application I guess.

I'll check it out.

bgamari requested changes to this revision.Jul 21 2015, 3:46 PM
bgamari edited edge metadata.

@scpmw, what do you think?

Requesting changes due to failing tests.

This revision now requires changes to proceed.Jul 21 2015, 3:46 PM
scpmw added a comment.Jul 23 2015, 5:14 AM

Looks okay, nothing big stands out. Really disappointing that we need conditional compilation for libelf though... I would really have hoped that we could avoid that.

Hm. If we need a global configuration flag anyway, it might be an option to use the existing libbfd conditionals here - at minimum that would allow us to save some configuration overhead. After looking into a bit, this actually might have some nice benefits: more portable and bfd_find_nearest_line gives us some basic DWARF functionality. Original reason for libelf was that it allows us to read more DWARF info using libdwarf, but that could turn out to be irrelevant now.

Might have to take a stab at porting the code over. Sorry for being so slow :(

libraries/base/GHC/ExecutionStack/Internal.hsc
12

typo ;)

92

It might be nice to have a basic z-decoder at some point, btw. Maybe just look for the "_info" suffix and make something like "base_GHC.ExecutionStack.Internal_currentStackFrames1" out of it? Could help making the stack less "scary". See e.g. here.

rts/Codemap.c
3

... why not? I think the doubtful comments in other files are about that it might be more elegant to exclude the file from the build completely. On the other hand, this way it's clear that the code is conditionally compiled.

Cool. Seems like this patch is looking ready then. I'll just have to look at the segfault. Hopefully a minor fix. :)

libraries/base/GHC/ExecutionStack/Internal.hsc
92

While I really like this idea. I would push for those nice-to-haves to be part of a different patch. This patch is really big already.

rts/Codemap.c
3

Good point. I feel much better about this now. :)

Indeed, the value that it's clear from the file that it's conditionally compiles is very valuable.

Tarrasch updated this revision to Diff 3680.Jul 27 2015, 1:01 PM
Tarrasch edited edge metadata.

Make tests run with 'dyn' as well

  • Fix typo
  • Fix bug p_unit
  • Fix 003 test, now all tests pass with 'dyn'
Tarrasch marked 17 inline comments as done.Jul 27 2015, 1:16 PM

Alright, the 'dyn' tests pass now as well. Let me know it there's anything more to fix for this patch. Otherwise what needs to be done in order to get this to land?

For practical reasons, should getCurrentFrames limit the reification to a thousand frames or something? You would like to able to get a stack trace on out of stack memory exceptions too right?

@simonmar, have you planned to take a look at this patch some time? :)

Tarrasch updated this revision to Diff 3683.Jul 27 2015, 1:47 PM
Tarrasch edited edge metadata.

Add (c) notices to rts files

Tarrasch updated this revision to Diff 3755.Aug 4 2015, 9:39 AM

Rebase so patch can be applied (and validate)

Also squash commits and --reset-author

This looks good to me.

@scpmw, any final requests?

compiler/prelude/primops.txt.pp
1984

The documentation should probably mention what happens if the primop is called on a build lacking stack trace support.

Tarrasch updated this revision to Diff 3811.Aug 9 2015, 6:25 AM

Update docs and comments

Tarrasch updated this object.Aug 9 2015, 6:32 AM
Tarrasch edited edge metadata.
Tarrasch updated this object.
Tarrasch marked an inline comment as done.Aug 9 2015, 6:35 AM
Tarrasch added a subscriber: trofi.Aug 9 2015, 7:01 AM

@bgamari, nice. I updated some documentation. Now I'll try to ping our other colleagues. :)

@simonmar, @scpmw, feel free to review again. @scpmw, I have not looked at the libbfd path much yet. But I must say I'm more up for changing this later than iterating further on this patch (since I've iterated so much on this patch already). Where can I read more about libbfd and how it is used? Maybe @trofi knows?

@austin, @hvr, you're marked as "Blocking review". What remains to be done for you guys to unblock the review?

@Tarrasch, don't worry about @austin for the moment. He's currently on vacation.

@scpmw, concerning your statement,

Original reason for libelf was that it allows us to read more DWARF info using libdwarf, but that could turn out to be irrelevant now.

If I'm not mistaken (which is quite possible), libelf is a library provided by elfutils. The DWARF library provided by elfutils is libdw. libdwarf is an entirely separate project. What did you mean by your statement?

Also, on a related note, what every happened to the .debug_ghc section and CoreNote mechanism that you describe in your thesis?

scpmw added a comment.Aug 19 2015, 8:47 AM
In D963#31774, @bgamari wrote:

@scpmw, concerning your statement,

Original reason for libelf was that it allows us to read more DWARF info using libdwarf, but that could turn out to be irrelevant now.

If I'm not mistaken (which is quite possible), libelf is a library provided by elfutils. The DWARF library provided by elfutils is libdw. libdwarf is an entirely separate project. What did you mean by your statement?

That's the crux here really - we have a lot of options, and the choices I made for my PhD were rather ad-hoc. A good re-evaluation would probably be a good idea, and given that I don't even have a proper Linux build machine anymore this makes matters rather... complicated.

Here's my current (still woefully under-informed) image of the situation. First off - for reading object files:

  • libelf: Only works for Linux, and less of a non-brainer to link against than I had hoped
  • libbfd: At least *claims* to work on more than Linux, and given that it drives gdb this seems believable. Unfortunately, it is rather tightly tied to these tools (rumours of unstable API etc.), and we probably can't just link against it either.
  • glibc: Just using dladdr might actually be enough for getting the information we need here (without any of the address map complications, too!). However, it will be hard to extend.
  • Doing it manually, using a different library per platform: Sounds like a nightmare, but ought to be tested.

For reading DWARF:

  • libdwarf: The library I decided to use for my PhD, chosen after a quick Google popularity poll and the docs made sense to me. It has an interface for overloading the object file access interface, but in practice it only works with libelf. There's a patch for Mach-O flying around on the Internet though, so it's not impossible.
  • libdw: The API is fairly similar, and we could probably do the same things. I suspect it's even more tied to libelf though, but I might be wrong.
  • Doing it manually: Once we have object file reading sorted out this might actually be quite tempting - DWARF is actually fairly well standardised, and we are only ever interested in a very small subset of it.

Also, on a related note, what every happened to the .debug_ghc section and CoreNote mechanism that you describe in your thesis?

There's two reasons why I'm delaying talking about this - firstly this has a rather singular use (profiling as I showed for ThreadScope), which makes it a hard feature to sell. We will probably have to maintain an alternative client application besides ThreadScope for this to really take off. Furthermore, the actual hard technical bits are, again, about reading symbol tables and DWARF. So let's focus on these issues first.

bgamari added a comment.EditedAug 19 2015, 9:57 AM

@Tarrasch, how do you envision full DWARF unwinding fitting in here? After looking a bit closer into how libdw's unwinding mechanism works, it seems like Codemap may be the wrong level of abstraction here. I'm wondering whether the c

In D963#31803, @scpmw wrote:
In D963#31774, @bgamari wrote:

If I'm not mistaken (which is quite possible), libelf is a library provided by elfutils. The DWARF library provided by elfutils is libdw. libdwarf is an entirely separate project. What did you mean by your statement?

That's the crux here really - we have a lot of options, and the choices I made for my PhD were rather ad-hoc. A good re-evaluation would probably be a good idea, and given that I don't even have a proper Linux build machine anymore this makes matters rather... complicated.

Here's my current (still woefully under-informed) image of the situation. First off - for reading object files:
...

Right.

So I actually took a few hours today and tried to piece together DWARF-based unwinding support. I first started with libdw but soon ran into some interface deficiencies which I brought up on the elfutils mailing list. The maintainer helpfully pointed out libbacktrace (part of GCC) as another possible option. It's 3-clause BSD licensed (it was apparently originally developed for Go) and so it shouldn't be problematic to use.

The interface is incredibly simple. This simplicity may rule it out in our case. For one, I'm unsure of is whether it supports dynamic linking at all (the interface just looks too simple for all of this to work automatically). Secondly, it is quite limited in what it offers: it will happily unwind the stack for you, giving you the PC, a symbol name, and source location information as it goes. At first glance this looks perfect, but I'm not sure if we might need more control at some point given the rather tricky mapping from Haskell onto DWARF. @scpmw, perhaps you could comment here?

Regardless, I have some test code which seems to work on statically linked C targets. I have no idea what it will do with GHC's slightly more intricate DIEs.

Also, on a related note, what every happened to the .debug_ghc section and CoreNote mechanism that you describe in your thesis?

There's two reasons why I'm delaying talking about this - firstly this has a rather singular use (profiling as I showed for ThreadScope), which makes it a hard feature to sell. We will probably have to maintain an alternative client application besides ThreadScope for this to really take off. Furthermore, the actual hard technical bits are, again, about reading symbol tables and DWARF. So let's focus on these issues first.

Ahh, fair enough.

@bgamari, actually Codemap is just a shrink-down of the old Dwarf.c, which was Codemap.c + dwarf-parts. I had a prototype in D662 even. So I guess current Codemap.c can be expanded to incorporate Dwarf in the future. It used libdwarf and not libdw, not sure if that matters.

scpmw added a comment.EditedAug 19 2015, 10:09 AM
In D963#31808, @bgamari wrote:

The interface is incredibly simple. This simplicity may rule it out in our case. For one, I'm unsure of is whether it supports dynamic linking at all (the interface just looks too simple for all of this to work automatically). Secondly, it is quite limited in what it offers: it will happily unwind the stack for you, giving you the PC, a symbol name, and source location information as it goes. At first glance this looks perfect, but I'm not sure if we might need more control at some point given the rather tricky mapping from Haskell onto DWARF. @scpmw, perhaps you could comment here?

The actual backtracing facilities are worthless to us - from a quick scan they don't read .debug_unwind, so will not be able to do anything with the Haskell stack. Fortunately we don't actually need those parts either because the RTS can do the stack walk. On the other hand, this leaves just backtrace_pcinfo or backtrace_syminfo which share the problem of dladdr: Relatively little information, and we especially can't easily extend it without changing the library's code. Also I can't find any reference to it working for something else apart from ELF?

On the bright side, this might be an excellent starting point for rolling our own custom DWARF reader at some point, I like dwarf.c quite a bit.

In D963#31810, @scpmw wrote:
In D963#31808, @bgamari wrote:

The interface is incredibly simple. This simplicity may rule it out in our case. For one, I'm unsure of is whether it supports dynamic linking at all (the interface just looks too simple for all of this to work automatically). Secondly, it is quite limited in what it offers: it will happily unwind the stack for you, giving you the PC, a symbol name, and source location information as it goes. At first glance this looks perfect, but I'm not sure if we might need more control at some point given the rather tricky mapping from Haskell onto DWARF. @scpmw, perhaps you could comment here?

The actual backtracing facilities are worthless to us - from a quick scan they don't read .debug_unwind, so will not be able to do anything with the Haskell stack. Fortunately we don't actually need those parts either because the RTS can do the stack walk. On the other hand, this leaves just backtrace_pcinfo or backtrace_syminfo which share the problem of dladdr: Relatively little information, and we especially can't easily extend it without changing the library's code.

Right, I was afraid this might be the case. Indeed I can't find any reference to .debug_unwind.

In this case I'll fall back to libdw. In principle it shouldn't be difficult to implement local tracing. The only tricky part is actually extracting the initial register state to pass to the unwinder. Tricky but doable, I think.

Also I can't find any reference to it working for something else apart from ELF?

It may support PE/COFF but I think that's about all. No mention of MachO.

On the bright side, this might be an excellent starting point for rolling our own custom DWARF reader at some point, I like dwarf.c quite a bit.

The code is generally rather nice.

@scpmw, have you looked at libunwind? @simonmar recommended it.

Also, where did you get the .debug_unwind section name from? I can't seem to find any references to this in the x86-64 ABI, ELF, or DWARF specs. Perhaps I'm missing a spec? AFAICT you are supposed to use .debug_frame for call frame infomation.

scpmw added a comment.Aug 20 2015, 4:35 AM
In D963#31819, @bgamari wrote:

@scpmw, have you looked at libunwind? @simonmar recommended it.

No. From a quick look there might be a chance that it is able to unwind the Haskell stack, but as I said earlier - that's not our primary problem here, as the RTS already knows the layout. On the other hand, this might allow us to stack-trace through C calls at some point, but let's not run too far ahead just yet.

Also, where did you get the .debug_unwind section name from? I can't seem to find any references to this in the x86-64 ABI, ELF, or DWARF specs. Perhaps I'm missing a spec? AFAICT you are supposed to use .debug_frame for call frame infomation.

Er, yes, .debug_frame. Misremembered, and was for some reason pretty sure about it. Sorry about that :)

So for the record, I have libdw-based stack traces implemented here. It appears to work for my simple test of dumping a stack trace in schedule. See D1156.

Ping!

@scpmw and @bgamari, did the experimentation done in D1156 bring any conclusions related to this patch? If not I would like us to continue to see if this patch can be merged. What needs to be done?

I also saw that "Haskell land stack traces" was not part of the email about the upcoming 7.12 feature freeze. That's a bit disappointing, this patch is for all that I know "good to go". Unit tests are included and all tests pass.

Please also note that this Haskell-land effort should (and probably will) not stop here. I'm willing to push this idea further and have Codemap understand Dwarf. But as requested in D662, I made this patch as small as possible while still providing value.

scpmw added a comment.Sep 1 2015, 11:45 AM

Well, we would add an additional dependencies for a feature that we know to be incomplete. gdb's backtraces are a lot better and completely portable to boot, so the added value would be marginal. Neither is the fault of the Haskell interface, obviously...

On the bright side, @bgamari is doing exactly what I had hoped to do with this patch eventually - use it as a stepping stone to build more complete stack traces using different back-end code. Also just last weekend I finished putting together my new PC, which should hopefully make privately building GHC a lot less frustrating (5 hours before!). So as far as I'm concerned, I like the thought of continuing work on this quite a bit better than trying to push some bare minimum version into 7.12.

@Tarrasch I've been quietly pushing D1156 along, although haven't been as good about keeping Phab up-to-date as I should.

At this point I have RTS-land and Haskell-land stack-traces working with libdw. I cleaned up the patches last weekend and will try to update the Phab diff shortly.

I have also been making good progress on folding in @scpmw's ideas on profiling. I believe I am quite close to having something working in this area.

Tarrasch abandoned this revision.Nov 4 2015, 4:12 AM

Abandoning. As far as I understand, D1196 together with D1198 strictly supersedes this. Also see https://ghc.haskell.org/trac/ghc/wiki/DWARF/80Status for clearer status page.