compiler/iface: compress .hi files
ClosedPublic

Authored by austin on Aug 21 2015, 12:10 AM.

Details

Summary

Compress all interface files generated by the compiler with LZ4. While
having an extremely small amount of code, LZ4 is both very fast at
compression and decompression while having quite good space saving
properties.

Non-scientific size test: size of stage2 compiler .hi files:

find ./compiler/stage2 -type f -iname '*.hi' -exec du -ch {} + | grep total$

Without this patch: 22MB of .hi files for stage2.
With this patch: 9.2MB of .hi files for stage2.

Signed-off-by: Austin Seipp <austin@well-typed.com>

Test Plan

I ran ./validate

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
austin retitled this revision from to compiler/iface: compress .hi files.Aug 21 2015, 12:10 AM
austin updated this object.
austin edited the test plan for this revision. (Show Details)
austin added reviewers: thomie, bgamari, hvr.
austin planned changes to this revision.Aug 21 2015, 12:19 AM

Some notes:

  1. I should do some real tests on the performance impact. 50% compression is pretty good for size. I imagine in general compression speed won't even be a blip on the radar.
    • As a side note, if the compression really is a blip on the radar, I know a way to get even better ratios: just double-compress the bytestring once in normal mode, and again in 'high compression mode', which is about 50x slower but gets better ratios. The trick is this: the dictionaries/tables that will be a part of the result of the first compression run (so it can be decompressed) will be regular, and there will be some regularities in the compressed output. So those can be exploited by a more aggressive algorithm for even more gains. Yet, LZ4 HC mode is fast to discard 'uncompressable' bytes ASAP, for speed. So the net effect of this is that HC mode on already compressed output is much faster than any uncompressed text (only a few times slower than the usual mode, not 50x), and can still save you a few % storage on top of the original. It's worth seeing if this is worth it.
  2. The LZ4 module should probably be renamed to something like IfaceCompress, since the implementation isn't relevant (we can change the interface file on a whim anyway)
  3. I should update to the latest LZ4. I authored this several months ago (back in February according to git log!) so it's probably a few versions behind.
  4. This can't be merged, because removeDirectoryRecursive001 is apparently failing. I haven't investigated.
    • Strangely the first time I authored this I had a *lot* more failures, but I guess that may have been a dirty test suite. I'm not complaining either way!
austin added inline comments.Aug 21 2015, 12:23 AM
compiler/utils/LZ4.hs
2

Note that this file is cribbed directly from the lz4 package, which I co-authored a while back with someone.

These days, there is an actual 'LZ4 Frame' format, which defines a standard upon which streams can be incrementally compressed and shared between languages, but I've never used that API - and it was introduced *after* I wrote this module. (Originally, LZ4 users had to 'format' the output themselves properly to allocate proper storage space for the decompression; this is why this inline code for that exists here.)

austin updated this revision to Diff 3885.Aug 21 2015, 2:00 AM
  • Update to lz4 r131

Okay, so I probably won't rename the LZ4 module (the compression happens under utils, so putting IfaceCompress there seems a bit odd. ) So I'd appreciate a cursory review by anyone while I gather some other numbers later.

I'm skeptical about this feature. What is the goal here? People complain about GHC being slow, not about them running out of diskspace. At the very least explain your reasoning in a Note.

austin added a comment.EditedAug 23 2015, 11:17 AM
In D1159#32325, @thomie wrote:

I'm skeptical about this feature. What is the goal here? People complain about GHC being slow, not about them running out of diskspace. At the very least explain your reasoning in a Note.

Really? I find that I often hear complaints about the size of GHC's installation and libraries. Just the base GHC installation is around 1GB! On more than one occasion (either IRL, on Reddit, etc), people have brought this up to me, and I think this is still something people dislike. It's also not as simple as you think: anyone doing ARM development is going to be happy with this, because when you only have 16GB of eMMC flash memory (because the memory bus is tiny so SD is unusably slow, and the bus is shared with e.g. USB/ethernet), every byte is vital to save.

Let me put it this way: Why not cut the size of all interface files in half if it's so easy, when the compressor is so incredibly fast and portable?

The second reasoning is it could actually be better for performance, and at minimum I/O traffic on the disk. LZ4 is fast enough to be nearish-memory speed, and the compression is good - this adds up to where when you issue reads on the disk to get interface files, the cost of reading a smaller file and decompressing in memory is less than the overall cost of reading the bigger file! Disks are cheap but they are not fast, and this result is easily achievable if your decompressor is like 5x faster than your disk and within a multiple of a multi-GB/s bus.

Full disclosure: It is probably much more complicated for performance in GHC because interface files being smaller will interact with the filesystems notion of how to store those files (e.g. whether it can pack it inside a block), and furthermore interface files are numerous, so it will equate to lots of I/O requests getting sent to the disk for each one, as opposed to a few for a large amount of contiguous space.

The second point in that last paragraph hints to what would be a true performance win, which is if we could map the entire set of interface files into GHC at once as one bigger file, one I've mulled over with @duncan before. (In fact I don't really think you even need one file, but you need a way of ensuring they all get read contiguously into the page cache.) In some future scenario like that, with chunkier files, I can almost guarantee compression would only enhance the performance.

Either way, that shouldn't stop this from going in IMO, because again, there's no real reason *not* to save all that space. And yes, I'll definitely add this novel to a Note

austin added a subscriber: duncan.Aug 23 2015, 11:21 AM

And I apologize, it's hard to substantiate that without an actual compressor speed, but if you go here you can see the speeds I'm talking about: decompression is at 2GB/s in the default configuration! And in the compression case it's still nearly 400mb/s, which is pretty damn close to the speed of high end SSD and going to be dwarfed by GHC actually, like, compiling things.

thomie added a comment.EditedAug 23 2015, 11:47 AM

If this zipping/unzipping is really just a blip on the radar as you say speedwise, than this disk space saving is of course great.

(It's great for me personally, as my SSD is only 32GB, and it's all taken up by various GHC installations.)

It's just that when I use ghc-7.6 to run some tests, it kills me to notice how much faster it is than ghc-7.10. So I just wanted to be sure we don't accidentally make GHC slower again in this Diff. I'm afraid 7.12 is going to be slower again anyway, with the major rewrites of the compiler going on.

Note I just fixed in a bug in Cabal, where it didn't strip installed .a files for us.

$ du -bhs /opt/ghc/7.10.2/

Before: 867Mb
After: 812Mb
hvr added a comment.Aug 23 2015, 2:51 PM

Nice!

(although I'm already compressing .hi files thanks to btrfs with lzo-compression)

Is LZ4 a deterministic compression algorithm? I'd guess so, but I couldn't find any information about it.

Sidenote:
I was curious about the size of a ghc installation, separated by file type, and got the following result:

$ dir /opt/ghc/7.10.2/
                     Extension   Bytes Files
============================================
                         .html     860 (1)
                          .gif    1141 (5)
                          .sty    2140 (1)
                          .txt    3622 (2)
                           .js    8189 (1)
                          .png     13K (2)
                          .css     16K (2)
                            .6     51K (1)
                            .2     52K (6)
                         .conf     55K (25)
                        .cache    127K (1)
                            .h    458K (86)
                             .    992K (239)
                                 4821K (43)
                      .haddock   8002K (24)
                         .p_hi     43M (963)
                           .hi     43M (963)
                       .dyn_hi     43M (963)
                           .so     99M (31)
                            .a    627M (64)

                        Total:    867M (3423)

Is there a ticket for this? Shouldn't there be?

This will save 60M out of 867M installation. Useful, but hardly transformative.

Is it faster, though? Writing less stuff to disk might actually speed up GHC. Can't tell without data though.

Is all the code BSD?

I'd like to know what Simon Mar thinks.

Simon

If you compile stage2 before and after a few times (make sure it's the same stage2 source, including the LZ4 module), and it doesn't get slower, then I have no objections. But why not zlib, is that too slow?

thomie added a comment.EditedSep 11 2015, 1:01 PM

I ran nofib-analyse A B on the build logs from perf.haskell.org for the parent commit (A) and this Diff (B).

Results: median compilation time with this Diff is +1.6%, using 187 measurements, of which 38 show a decrease and 140 show an increase in compilation time.

I excluded the modules for where the compilation time difference was not listed as a percentage.

BUT:

  • The nofib run for this Diff (commit B) was done about 2 weeks after the one for commit A. Maybe something changed in the enviroment. It would be better to redo this experiment on a different machine.
  • Comparing the parent commit (A) with HEAD (C) shows a 1.8% increase in median compilation time. Further evidence of a change in environment. Or maybe GHC really has been getting that much slower in 3 weeks time.
  • Comparing commit A with its parent, as a control, shows a 0.0% difference in median compilation time. This suggest to me that my measurement technique is not completely flawed.

It would be nice to improve nofib-analyse and fix Trac #9319.

bgamari requested changes to this revision.Sep 18 2015, 5:12 AM

I certainly wouldn't complain if interface files got smaller so long as the effect on compilation time is negligible. As someone with finite funds yet a desire to have an SSD, I appreciate any reduction in GHC's disk footprint.

That being said, I think we really should measure what the real cost of this will be. A nofib comparison of compile time and find nofib/ -iname '*.hi' | xargs du would be quite helpful here.

This revision now requires changes to proceed.Sep 18 2015, 5:12 AM

If you compile stage2 before and after a few times (make sure it's the same stage2 source, including the LZ4 module), and it doesn't get slower, then I have no objections. But why not zlib, is that too slow?

Yes, zlib is quite slow compared to lz4, although it has a better compression ratio. I chose lz4 because it achieves a very nice balance of speed (small enough to fit even in your L1 cache code-wise) and compression ratio overall.

Is there a ticket for this? Shouldn't there be?

I mean, it's not really a 'bug' that interface files are so large... This is just something I thought of a while back that was easy enough to implement. I can make a ticket though.

This will save 60M out of 867M installation. Useful, but hardly transformative.

Is it faster, though? Writing less stuff to disk might actually speed up GHC. Can't tell without data though.

Is all the code BSD?

Yep, 2-clause BSD (which is compatible with our 3-clause version).

austin updated this revision to Diff 7114.Mar 29 2016, 4:58 PM

Rebase onto master

I've updated this. I went ahead and ran nofib on an unused, 12-core Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz. The CPU governor was set to performance to disable frequency scaling. The results from nofib-analyse for Compile Time was:

-1 s.d.                -----            -7.5%
+1 s.d.                -----            +7.2%
Average                -----            -0.4%

For Compile Allocations:

-1 s.d.                -----            -1.0%
+1 s.d.                -----            -0.3%
Average                -----            -0.7%

This was against this diff vs the current HEAD as of this writing (cf768ec062dee47cf72cbddf42d69d9508ec3afb).

I observed some other changes; in particular things like 'binary size' seemed to give minor wibbles, although clearly this should have no impact on that. So there may just still be some noise lingering; either way this change seems to have no impact.

I'm going to look at the .hi file sizes of nofib too, but I didn't record that, so I have to do another run...

I reran the benchmarks, and got object file sizes (same 12-core machine, etc, with performance governor). Object file sizes and nofib-analyse results below. I also ./validated this change again.

The results are positive overall, so I think this is good to go in now.

Commits

HEAD as of this writing (rGHC973633ae3327: Comments only in Unify.hs), with the LZ4 patch on top of that.

Building

NOT with ./validate - using a performance build.

$ ./boot && ./configure && make -j13

nofib-analyze results

Compile time

-1 s.d.                -----            -7.8%
+1 s.d.                -----            +7.7%
Average                -----            -0.3%

Compile allocations

-1 s.d.                -----            -1.0%
+1 s.d.                -----            -0.3%
Average                -----            -0.7%

Interface file sizes

Before

$ find ./compiler/stage2 -type f -iname '*.hi' -exec du -ch {} + | grep total$
26M     total
$ find ./nofib -type f -iname '*.hi' -exec du -ch {} + | grep total$
6.4M    total

After

$ find ./compiler/stage2 -type f -iname '*.hi' -exec du -ch {} + | grep total$
11M     total
$ find ./nofib -type f -iname '*.hi' -exec du -ch {} + | grep total$
3.4M    total
bgamari added a comment.EditedMar 30 2016, 2:39 PM

The numbers look sensible enough; shaving a factor of 0.5 off our interface files sounds good to me. A few comments about the patch itself inline.

It might be good to open a Trac feature request just so we have someplace to hang comments (and perhaps the numbers that gathered here) for future reference.

compiler/utils/LZ4.hs
21

What's up with the comments?

99

Hmm, why not just return a lazy bytestring and save a copy?

What is your plan for this?

austin planned changes to this revision.May 10 2016, 8:54 AM

Planning changes based on Ben's feedback.

compiler/utils/LZ4.hs
21

Probably just dead code that tripped ./validate or something. I'll eliminate it.

99

It was mostly in the interests of trying to keep the changes to Binary small, but could probably be refactored a bit further.

austin marked 4 inline comments as done.May 19 2016, 5:38 PM
austin updated this revision to Diff 7664.May 19 2016, 10:22 PM
  • Make encoding simpler
  • docs/relnotes: add blurb
austin updated this revision to Diff 7665.May 19 2016, 10:25 PM
  • Remove some more dead code
bgamari accepted this revision.May 21 2016, 11:46 AM

Indeed, looks good to me.

compiler/utils/Binary.hs
200

Nice simplification.

This revision is now accepted and ready to land.May 21 2016, 11:46 AM
This revision was automatically updated to reflect the committed changes.

Note that switching to CBOR might also reduce file sizes by ~50% and increase encode and decode performance vs the binary package.

But can we switch to CBOR yet?

But can we switch to CBOR yet?

It's not released yet, and @duncan and I are still tuning it. I mean, it's definitely seen a lot of production use, we're just not committed to the initial API (yet).

I expect this patch will probably still give some additional benefit, even when combined with CBOR (a lot of the redundancy will be stripped, e.g. because Ints will be encoded much more densely), so I imagine the overall effect will be somewhere in the 50-70% range over what we're currently doing with the two of them combined. Just speculation.

I have a private branch with a tiny amount of work starting a transition of GHC to use binary-serialise-cbor, but it doesn't work.

In D1159#65732, @austin wrote:

But can we switch to CBOR yet?

It's not released yet, and @duncan and I are still tuning it. I mean, it's definitely seen a lot of production use, we're just not committed to the initial API (yet).

I expect this patch will probably still give some additional benefit, even when combined with CBOR (a lot of the redundancy will be stripped, e.g. because Ints will be encoded much more densely), so I imagine the overall effect will be somewhere in the 50-70% range over what we're currently doing with the two of them combined. Just speculation.

I have a private branch with a tiny amount of work starting a transition of GHC to use binary-serialise-cbor, but it doesn't work.

And to be clear: I think a release could happen "soon" but we're about to do at least one overhaul of the decoding interface. It will almost certainly happen in time for 8.2, so preparing work to transition GHC in the mean time would probably prove useful in the long run and instructive anyway.

Do we think that incorporating binary-serialise-cbor into GHC is a viable option? Won't this significantly increase our dependency set? AFAICT it would add text, unordered-containers, hashable, if taken as-is. Are these dependencies just for defining instances? If so, what is going to become of them?

Do we think that incorporating binary-serialise-cbor into GHC is a viable option? Won't this significantly increase our dependency set? AFAICT it would add text, unordered-containers, hashable, if taken as-is. Are these dependencies just for defining instances? If so, what is going to become of them?

I think these are reasonable. For one, @duncan plans to inevitably overhaul the Cabal parser, and to do this he needs parsec, which will inevitably bring along text as a dependency of Cabal. I, at least, agreed with this plan a long time ago.

unordered-containers and hashable in this regard aren't strictly necessary, but I could imagine there may be some cases in GHC where they would be of use so I don't think adding them harms much. Finally it also means GHC will now indirectly depend on vector, but luckily I fixed the stage2 requirement long ago, so I think this will be OK as well.

I also use binary for -fexternal-interpreter, and I really want a faster replacement. I was going to look at Store, but if cbor is not far away maybe I should wait for that. Is it likely to be as fast as Store?

austin added a comment.EditedMay 29 2016, 6:39 PM

I also use binary for -fexternal-interpreter, and I really want a faster replacement. I was going to look at Store, but if cbor is not far away maybe I should wait for that. Is it likely to be as fast as Store?

Depends on the use case; store is basically just a fancy, thin wrapper over Storable so in some regard it's going to be extremely hard to beat no matter what you do; CBOR still has overhead because it uses a system/source independent data format[1], so in many cases it will be slower. But there are some actual drawbacks to store as exists today that I think rule it out directly: namely, it has far too large of a dependency set and I believe it has a direct dependence on TH for calculating type hashes, which is directly used for deriving instances. That means it's going to be impossible to work into GHC's build system.

If you want the fastest possible communication primitives between only the interpreter and the runtime process, I'd suggest just building your own super-tiny copy of store or something; you can probably get about 80% of the way with 20% of the fancy gadgetry and it will likely be way faster than binary or binary-serialise-cbor still. It could all fit in one module. This is basically the way shake is heading, to eliminate binary format overhead (e.g. they'll just use a tiny Storable-based interface internally).

For the interface file format, I think CBOR would still be a good choice; it's much faster and more space efficient than binary while being essentially as convenient for users, and comes with a decent array of debugging and testing tools thanks to the design. It's not as fast as store in an absolute sense (I think we can get close-to or as-fast for some cases), but it's much more general, being in a bit of a different design space.

Overall I'd say it's a toss up. If you wait a little while and are patient, CBOR will be on the scene and viable, and for essentially every use case it will be faster than binary. If you want to do a little extra work and squeeze out the performance, writing a miniature version of store should go even further, I think. Maybe it's worth both cases (ifaces & ipc) to both use the same storage, or one specialized to each.

[1] However, I still say it depends on the use case, because today @duncan got the 'unboxed array' case about 20x faster than previous (close to only a memcpy), so it really depends on exactly what kind of interface and tooling you want.

duncan added a comment.EditedMay 30 2016, 2:15 PM

I also use binary for -fexternal-interpreter, and I really want a faster replacement. I was going to look at Store, but if cbor is not far away maybe I should wait for that. Is it likely to be as fast as Store?

It depends. If it fits squarely into the sweet spot for store then no. That sweet spot is not-too-large vectors of smallish records (or sequences thereof). If it's potentially large lazily generated values that are more pointer heavy then store is not likely to be so good since it has to do two passes over the data (for the size calc) so it forces both the input and the output fully into memory at once (and the output is large like binary since it doesn't use variable length encodings). There are not yet very good benchmarks comparing the two except for the cases that are the sweet spot for store. For the one quoted large case the speedup was only 2x that of binary which is well within the range one can expect for the cbor lib.

The argument for .hi files is much more compelling for the cbor lib since as disk files there is more advantage to using a sensible and compact on-disk format, and the structures to serialise are rather bigger and more pointer heavy.

As for a release. The API is not likely to change that much between now and a proper release, so you could use the git version for now. The main changes are likely to be renaming things to be a bit more sensible, and adding an 's' param to the Decoder to support using ST underneath.