Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 169260 - default man page and doc compression too high
Summary: default man page and doc compression too high
Status: CONFIRMED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 835380
  Show dependency tree
 
Reported: 2007-03-04 10:55 UTC by J.O. Aho
Modified: 2024-05-12 19:15 UTC (History)
11 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description J.O. Aho 2007-03-04 10:55:43 UTC
In sys-apps/portage-2.1.2-r9 the default compression of man pages and docs are set to high to make any good sence. "bzip2 -9" generates large man pages than "gzip -1" when you are dealing with small sized man pages. The only time "bzip2 -9" gets the upper hand is when dealing with the few really big man pages.

There is no documentation in /etc/make.conf.example how to change the compression format and there is no way to set the compression level.

Reproducible: Always

Steps to Reproduce:
1. emerge app-arch/gzip

Actual Results:  
The zdiff.1.bz2 becomes larger than what the old zdiff.1.gz

Here is a small list of sizes with different compressions:
zdiff.1         :    802
zdiff.1.bz2 (-1):    461
zdiff.1.bz2 (-5):    461
zdiff.1.bz2 (-9):    461
zdiff.1.gz  (-1):    439
zdiff.1.gz  (-5):    429
zdiff.1.gz  (-9):    429

Expected Results:  
Had expected a more sane selection of compression. I would suggest that the default compression should be "gzip -5", as it don't generate much larger man pages than what "gzip -9" did, of course there should be a good documentation about how to change the defaults in /etc/make.conf.example.
Comment 1 Zac Medico gentoo-dev 2007-03-04 11:38:02 UTC
Use PORTAGE_COMPRESS_FLAGS which is documented in `man make.conf`.
Comment 2 Nuno Silva 2007-03-04 13:14:43 UTC
(In reply to comment #1)
> Use PORTAGE_COMPRESS_FLAGS which is documented in `man make.conf`.
> 

The issue is not the possibility of choosing another compression system and level, it's the default setting.

What I can't understand is the reason to change such a setting automagically, instead of providing instructions to change to bzip2 at the config files... changing something like this without the system admin authorization is really not a good idea.
Comment 3 Arthur Hagen 2007-03-04 16:18:40 UTC
Yes, bzip2 -9 is overkill, and will lead to MUCH longer decompression times compared to gzip -9.  We're talking several times as slow here.  Yes, really.
And in most cases, it won't save a single byte of disk space either, unless the difference between the gzip compressed and bzip2 compressed file is enough to cross a block size boundary.  With 4 kB blocks, it doesn't matter whether a file is 8400 or 9000 bytes, it'll take up the same disk space.  So the overall savings are VERY minimal, at a VERY high price in CPU and RAM use.

Also, the PORTAGE_COMPRESS and PORTAGE_COMPRESS_FLAGS variables have been implemented, er, shall we diplomatically say less than optimally.  If set to a null string, it works the same as if unset, and you get "bzip" and "-9".  You have to set PORTAGE_COMPRESS to a null command (like ":") and PORTAGE_COMPRESS_FLAGS to a non-altering option (like " ") to avoid compression.

Then there's the point that /usr/share should be distributable across architecture boundaries, like sharing /usr/share/man with NFS (a historical practice).  If compressed with bzip2, the man pages no longer work on Unix-like systems that *don't* have bzip2.  That's most of them.  If the pages aren't meant to be shared, they should go under /usr/man and not /usr/share/man.

Also, the compression of files in /usr/share/doc means that much of the documentation won't be readily available.  Tried reading HTML docs that have been compressed, and following the links between pages?
While it makes sense to compress man pages with a fast compression algorithm on systems with slow storage, as this actually *increases* speed, it doesn't likewise make sense to compress documentation.  Especially not when the files cross-reference each other by name.

(And, unfortunately, someone changed the distributed /etc/man.conf to use bzip2 -9 for catman pages too.  It should, of course, use compress or gzip -9, because the WHOLE POINT of compressing catman pages is to increase access speed, not to save disk space.  If you want to save disk space, you don't use a cache.)

Anyhow, reverting from bzip2 to gzip should go without saying, as there's no conceivable use for bzip2 outside special situations where a few kB saved is paramount (liveCDs, keyfob distros) and CPU/RAM usage of little importance.
  And there should, of course, be an option to keep the doc files uncompressed for convenience even if you want the man pages compressed for speed.

Comment 4 Arthur Hagen 2007-03-04 16:24:58 UTC
Also, here's a small test that also measures the speed and actual disk usage:

Assuming man.1 is a typical man page (in size, it appears to be fairly close to the median), I get the following test results:

Name       Command   Size   Disk*  Decomp. speed**
man.1      none      12297  16 kB  1.207 s
man.1.Z    compress   6398   8 kB  1.357 s
man.1.gz   gzip -1    5332   8 kB  1.439 s
man.1.gz   gzip -9    4812   8 kB  1.370 s
man.1.bz2  bzip2 -1   4618   8 kB  2.735 s
man.1.bz2  bzip2 -9   4618   8 kB  2.857 s

*:  Assuming 4 kB block size.
**:  1000 iterations, best result of 3 consecutive runs.
Equipment used:  1.79 GHz Athlon-4 M (Compaq Presario 2175US) 
File system:  XFS, noatime.
Commands used:
time for i in `seq 1 1000`; do cat man.1 >/dev/null; done
time for i in `seq 1 1000`; do zcat man.1.Z >/dev/null; done
time for i in `seq 1 1000`; do gzcat man.1.gz >/dev/null; done
time for i in `seq 1 1000`; do bzcat man.1.bz2 >/dev/null; done
>From this is subtracted the time of the loop construct (0.009s):
time for i in `seq 1 1000`; do :; done

In other words, for this man page, you save *nothing* by using bzip2 -9 instead of gzip -9, and the net effect is to more than double the extraction time.
Comment 5 Marius Mauch (RETIRED) gentoo-dev 2007-06-23 16:46:33 UTC
Mike, any objection if we'd change this back? I think most of the ecompress bugs should be found/fixed by now (which I assume was the main motivation to change the defaults to bzip)
Comment 6 Zac Medico gentoo-dev 2007-06-27 02:09:54 UTC
(In reply to comment #5)
> Mike, any objection if we'd change this back? I think most of the ecompress
> bugs should be found/fixed by now (which I assume was the main motivation to
> change the defaults to bzip)

Yeah, the bzip2 was great for flushing out bugs but gzip is probably a suitable long term default.  Mike, please close this bug when you get a chance.
Comment 7 Jorge Peixoto de Morais Neto 2007-07-03 20:24:08 UTC
Hang on!
According to my tests, bzip2 -5 is faster than gzip. 
Instead of using a specific man page, I compressed every file in /usr/share/man with a certain setting (gzip -1, gzip -5, gzip -9, bzip2 -1, bzip2 -5, bzip2 -9) and measured the time to uncompress all the files sending the output to /dev/null. Each uncompression was performed three times in a row, just after booting the PC in single user mode.

The times : 

bzip2 -1:
			
real	1m30.190s	0m50.056s	0m50.068s
user	0m24.878s	0m25.334s	0m25.574s
sys	0m26.174s	0m24.722s	0m24.494s

bzip2 -5:

real	1m29.377s	0m50.799s	0m50.759s
user	0m25.222s	0m24.958s	0m24.998s
sys	0m26.438s	0m25.842s	0m25.762s
 
bzip2 -9:
			
real	1m40.974s	0m50.769s	0m50.551s
user	0m25.578s	0m25.598s	0m25.034s
sys	0m26.190s	0m25.170s	0m25.518s

gzip -1:

real	1m52.998s	1m6.384s	1m6.536s
user	0m34.050s	0m33.594s	0m33.330s
sys	0m33.598s	0m32.790s	0m33.206s

gzip -5:
			
real	1m48.797s	1m6.475s	1m6.375s
user	0m33.658s	0m33.018s	0m34.094s
sys	0m33.474s	0m33.458s	0m32.282s

gzip -9:
			
real	1m47.781s	1m6.567s	1m6.104s
user	0m34.082s	0m33.122s	0m33.542s
sys	0m33.150s	0m33.446s	0m32.562s

the sizes:

40M     man-bzip2-1
39M     man-bzip2-3
39M     man-bzip2-4
39M     man-bzip2-5
39M     man-bzip2-9
42M     man-gzip-1
40M     man-gzip-5
40M     man-gzip-9
72M     man-u (uncompressed)

some information about my PC:

sudo hdparm -tT /dev/hda

/dev/hda:
 Timing cached reads:   438 MB in  2.00 seconds = 218.95 MB/sec
 Timing buffered disk reads:  160 MB in  3.01 seconds =  53.19 MB/sec

uname -a
Linux jorge 2.6.21-gentoo-r2 #5 PREEMPT Sat Jun 9 12:49:08 BRT 2007 i686 AMD Athlon(tm) XP 2600+ AuthenticAMD GNU/Linux
Comment 8 Jorge Peixoto de Morais Neto 2007-07-03 20:30:57 UTC
Two corrections:

1) I said bzip2 -5, but I meant bzip2 -1

2) I should tell you the sizes in bytes, not MB. So, removing the -h flag from du:

du -s man*
40017   man-bzip2-1
39913   man-bzip2-3
39905   man-bzip2-4
39905   man-bzip2-5
39901   man-bzip2-9
42915   man-gzip-1
40751   man-gzip-5
40651   man-gzip-9
72802   man-u
Comment 9 Ole Langbehn 2009-06-27 15:32:16 UTC
jorges numbers are probably flawed, since he cold booted and then started the test. That means that his first bzip2 run read from disk instead of from cached mem.

I would trust the numbers in comment #4.

And I agree, the default should be gzip -9, since it takes the least time for decompression.
Comment 10 Arthur Hagen 2009-06-28 15:32:43 UTC
(In reply to comment #9)
> 
> And I agree, the default should be gzip -9, since it takes the least time for
> decompression.

Well, actually pack/unpack (lower case .z files) is the fastest for uncompressing, which is why several commercial Unixes use it for catman.  And lzop (.lzo files) is faster too.  But pack isn't easily available for Linux due to licensing, and lzop, like bzip2, isn't a standard install, and thus should not be used on anything that could be shared between machines.

Anyhow, out of what we have to work with, gzip is, by far, the most available these days, and the decompression speed is relatively fast.  So yeah, gzip is the sensible choice, next to "no compression".
Comment 11 Roman Žilka 2013-03-09 12:02:33 UTC
For what it's worth, I ran some tests to figure out this dilemma today. I have a relatively good laptop SATA drive, ext4 w/ noatime, 2.5GHz Core i5, hardened kernel 3.7.5.

Tests on an average-sized (2-4 KB) manpage show that I/O costs about 400 ms for both uncompressed and compressed versions of the manpage, while decompression costs 1-3 ms for everything between lzo -1 and xz -9, including gzip -X and bzip2 -X. The largest manpage I have (perltoc.1, over 1 MB) takes 20 ms to unxz -9 and 5 ms to unlzop -1.

People with mechanical storage have fast CPUs, so I find that it doesn't really matter to them what kind of compression (if any) they use. People with SSD/flash storage care less about I/O. I cannot test how little they care, but I suppose there is negligible speed difference between reading 4 KB and 1 MB. Some of them are the modern laptop users, who don't even care about CPU demands; others are the embedded people who might care about CPU.

I conclude that the best default way to do things is no compression. The embedded folks will be 5% happier and the rest of us will get 1 second faster merge times per package. Also, things will just be generally a tiny bit simpler.

I suppose noone today cares about disk space when we're talking manpages and docs. The rare few embedded ones who care about disk space on the scale of tens-few hundred MB will have to explicitly turn on compression and wait a bit longer for manpages to display.

I'm interested in any ideas concerning this evaluation.
Comment 12 Roman Žilka 2013-03-09 12:28:48 UTC
Oops, a typo there: s/400 ms/40 ms/. And reading 1 MB costs about twice as much. But I consider anything <100 ms like it's not there.

It might probably be also correct to consider the number of embedded people just too low when compared to others and put a higher priority to the <100ms cares of us SATA regulars. In that case, the best default would probably be xz -9: for small files there is no hurt; for larger files the reduced I/O saves the day.

What do you think? It seems consensus is yet to be reached in this bug.
Comment 13 Zac Medico gentoo-dev 2013-03-09 16:12:56 UTC
Maybe we should try for conformity with other linux distros, or at least look at the reasoning other distros used to choose their default compression?

Looking at an Ubuntu 12.10 system, the default compression appears to be gzip.
Comment 14 SpanKY gentoo-dev 2013-03-20 07:16:01 UTC
(In reply to comment #13)

tl;dr: skip to end :p

i don't think any other distro has picked anything other than gzip.  and most likely for legacy reasons ("it's always been that way") rather than anyone doing serious research.

i did numbers analysis a long time ago on this, but i'm not sure if i didn't post it (or i posted it to a diff bug or mailing list or ...).  doesn't matter.

keep in mind that portage (currently) only supports one compression scheme.  so focusing on just /usr/share/man/ doesn't make sense.  /usr/share/doc/ has a lot more files and makes much more sense to be compressed higher as those aren't actively being decompressed on a day-to-day basis.

i'm not too interested in the claim that this "only matters for embedded".  i've got plenty of systems where disk is a premium, and they aren't embedded.  to turn the logic around, if your system is "beefy", then the decompression overhead should be irrelevant.

for man pages, the focus here should be on decompression.  we don't generally care how long it takes to compress the files.

on my system of ~17500 man pages (which i think is a bit on the unusual side in terms of total # of files), the compressed sizes we have:
bzip2-9       37363270 35%
bzip2-8       37363846 35%
bzip2-7       37364252 35%
bzip2-6       37367266 35%
bzip2-5       37377376 35%
bzip2-4       37382266 35%
bzip2-3       37412669 35%
bzip2-2       37476800 35%
xz-7          37648808 36%
xz-6          37648810 36%
xz-8          37648811 36%
xz-9          37648811 36%
xz-5          37674888 36%
bzip2-1       37692413 36%
xz-4          37914222 36%
gzip-9        38610832 37%
gzip-8        38610944 37%
gzip-7        38625440 37%
gzip-6        38661329 37%
gzip-5        38860087 37%
xz-3          39039534 37%
xz-2          39132314 37%
xz-1          39434849 37%
gzip-4        39607037 38%
gzip-3        41248262 39%
gzip-2        41971804 40%
gzip-1        43059502 41%
lzop-9        46518382 44%
lzop-8        46518522 44%
lzop-7        46757820 44%
lzop-5        57622842 55%
lzop-6        57622842 55%
lzop-4        57622843 55%
lzop-2        57622844 55%
lzop-3        57622844 55%
lzop-1        57843543 55%
uncompressed 104150218

while for compression time (src & dst in tmpfs on an otherwise quiet system):
lzop-2       0.32user 0.36system 0:00.74elapsed 91%CPU
lzop-1       0.37user 0.30system 0:00.71elapsed 94%CPU
lzop-4       0.38user 0.29system 0:00.73elapsed 91%CPU
lzop-5       0.39user 0.26system 0:00.94elapsed 69%CPU
lzop-6       0.39user 0.27system 0:00.99elapsed 66%CPU
lzop-3       0.41user 0.28system 0:00.84elapsed 82%CPU
gzip-1       1.81user 0.22system 0:02.07elapsed 98%CPU
gzip-2       1.93user 0.19system 0:02.16elapsed 98%CPU
gzip-3       2.05user 0.23system 0:02.32elapsed 98%CPU
gzip-4       2.34user 0.18system 0:02.59elapsed 97%CPU
gzip-5       2.75user 0.27system 0:04.56elapsed 66%CPU
gzip-6       3.26user 0.26system 0:04.67elapsed 75%CPU
gzip-7       3.47user 0.21system 0:04.73elapsed 77%CPU
gzip-8       3.76user 0.27system 0:05.36elapsed 75%CPU
gzip-9       3.76user 0.25system 0:04.94elapsed 81%CPU
lzop-7       5.56user 0.36system 0:06.40elapsed 92%CPU
lzop-8       8.64user 0.25system 0:09.43elapsed 94%CPU
lzop-9       8.72user 0.30system 0:09.57elapsed 94%CPU
bzip2-3     13.78user 0.76system 0:14.76elapsed 98%CPU
bzip2-1     13.87user 0.78system 0:15.00elapsed 97%CPU
bzip2-4     13.87user 0.63system 0:14.88elapsed 97%CPU
bzip2-2     13.88user 0.64system 0:14.60elapsed 99%CPU
bzip2-7     13.95user 2.24system 0:20.93elapsed 77%CPU
bzip2-9     13.97user 1.32system 0:19.80elapsed 77%CPU
bzip2-5     13.99user 0.71system 0:18.37elapsed 80%CPU
bzip2-8     14.06user 1.73system 0:20.29elapsed 77%CPU
bzip2-6     14.12user 2.26system 0:21.36elapsed 76%CPU
xz-1        16.51user 0.70system 0:18.59elapsed 92%CPU
xz-2        28.51user 0.82system 0:30.79elapsed 95%CPU
xz-3        43.86user 0.96system 0:45.93elapsed 97%CPU
xz-4        56.64user 0.82system 0:58.54elapsed 98%CPU
xz-5       154.20user 1.00system 3:17.77elapsed 78%CPU
xz-6       156.40user 0.98system 3:16.90elapsed 79%CPU
xz-7       261.04user 0.90system 5:06.92elapsed 85%CPU
xz-9       401.04user 1.44system 7:26.30elapsed 90%CPU
xz-8       402.18user 1.52system 7:26.45elapsed 90%CPU

and for decompression time:
 - src is on a RAID of 7200 rpm disks
 - dst is /dev/null -- normally we write to a pipe/tty which is RAM anyways
 - `echo 3 > /proc/sys/vm/drop_caches` before each set of tests
 - e.g. find <dir> -exec xzcat {} + >/dev/null
 - system was otherwise quiet
lzop-6       1.25user 1.24system 0:10.57elapsed 23%CPU
lzop-1       1.31user 1.25system 0:12.76elapsed 20%CPU
lzop-2       1.34user 1.22system 0:14.08elapsed 18%CPU
lzop-4       1.35user 1.19system 0:12.42elapsed 20%CPU
lzop-7       1.40user 1.14system 0:09.94elapsed 25%CPU
lzop-3       1.41user 1.16system 0:14.13elapsed 18%CPU
lzop-8       1.41user 1.15system 0:10.31elapsed 24%CPU
lzop-5       1.42user 1.06system 0:08.99elapsed 27%CPU
lzop-9       1.50user 1.07system 0:10.72elapsed 24%CPU
gzip-8       3.63user 0.83system 0:10.33elapsed 43%CPU
gzip-7       3.81user 0.90system 0:09.13elapsed 51%CPU
gzip-6       3.84user 0.91system 0:11.71elapsed 40%CPU
gzip-5       3.90user 0.90system 0:11.66elapsed 41%CPU
gzip-9       3.90user 0.82system 0:10.79elapsed 43%CPU
gzip-1       3.96user 1.21system 0:14.43elapsed 35%CPU
gzip-4       3.97user 0.94system 0:12.77elapsed 38%CPU
gzip-3       4.00user 0.91system 0:12.36elapsed 39%CPU
gzip-2       4.02user 0.96system 0:13.33elapsed 37%CPU
xz-5         6.20user 0.69system 0:12.01elapsed 57%CPU
xz-7         6.30user 0.68system 0:10.60elapsed 65%CPU
xz-1         6.64user 0.63system 0:11.27elapsed 64%CPU
xz-3         6.71user 0.76system 0:11.15elapsed 67%CPU
xz-2         6.76user 0.82system 0:11.38elapsed 66%CPU
xz-8         6.92user 0.68system 0:12.43elapsed 61%CPU
xz-6         6.99user 0.72system 0:11.43elapsed 67%CPU
bzip2-5      7.18user 0.81system 0:12.00elapsed 66%CPU
xz-4         7.26user 0.63system 0:13.78elapsed 57%CPU
xz-9         7.44user 0.82system 0:13.40elapsed 61%CPU
bzip2-6      7.91user 0.88system 0:12.98elapsed 67%CPU
bzip2-2      8.11user 0.93system 0:13.66elapsed 66%CPU
bzip2-9      8.11user 0.89system 0:12.84elapsed 70%CPU
bzip2-8      8.20user 0.96system 0:14.17elapsed 64%CPU
bzip2-4      8.57user 1.01system 0:15.31elapsed 62%CPU
bzip2-1      8.65user 1.08system 0:16.54elapsed 58%CPU
bzip2-7      9.06user 1.05system 0:15.19elapsed 66%CPU
bzip2-3      9.38user 0.97system 0:17.02elapsed 60%CPU

bzip2 is the worst in decompression speeds, but has the best compressed result.  keep in mind that xz excels with large inputs, but not so much with small inputs which is probably why bzip2 beats it out here (many small files).

lzo is kind of interesting, but not really.  the fact that our current man packages don't support it out of the box means that it's not useful (we could fix that, but eh).

i think as a default, we can sacrifice <10MiB for the majority of systems to get better decompression speeds.

we can probably go even better and skip compressing of files below a certain threshold (like <128bytes) because those usually don't compress at all (they get bigger in fact) which means we gain literally nothing but lose out in every other way.  pretty much every single one of those is a '.so' redirect to a different man page.  if we wanted to get tricky, we might even consider deleting the .so altogether and rewriting it into a symlink ...

so what i think we should do is split the compression vars.  PMS has no say in this area -- it talks about it in the abstract as in "things may be compressed", but doesn't require any particular variable/compression scheme/flags/etc...

thus, for man pages, we default to `gzip -6`.  for other files (i.e. /usr/share/docs/), we keep the default of `bzip2 -9`.  and we skip compression entirely on man pages under 128 bytes.
Comment 15 Roman Žilka 2013-03-20 11:03:53 UTC
OK, but in that case, why gzip -6? According to your results, shouldn't it ideally be more like this?

<128B: no compression
128B - 4KiB: still no compression, because the I/O costs the same for everything in this range (I haven't measured, but I believe it's true), so one might as well avoid the decompression phase
4KiB - xyKiB: bzip2 -9
>xyKiB: xz -9 or even -9e

I don't have /usr/share/doc, so I can't evaluate how much 'xy' should be or if the files there are large enough to even consider xz.

However, we could consider the fact that xz -9 and bzip2 -9 both compress the manpages to an almost equivalent size - the difference is ~0.7% in your case. That easily rounds to nothing. Combined with the faster decompression of xz -9 the picture gets even simpler then:

<=4KiB: no compression
>4KiB: xz -9 or -9e

A more punctual person might want to insert something intermediate which has faster decompression than xz -9, but still produces compressed files <4KiB, so I/O time is not affected. Something like gzip -1. But it probably wouldn't really make much difference. On the other hand, the embedded few would get happier - gzip-1 is fast and the limit would cover most manpages, I guess(?).
Comment 16 SpanKY gentoo-dev 2013-03-20 16:19:16 UTC
(In reply to comment #15)

i've posted a patch to disable compression on man pages <=128B

looking at the man pages locally, after deleting everything <=128B, there is literally no man page that falls into the <=4KiB region.  the next smallest page is 64KiB.  so i don't think we need to bother getting into that ugly rabbit hole of graduated compression.

i do have /usr/share/doc/.  i don't build with USE=doc or USE=examples, but my dir is still 603MiB.  if you include those other flags, that'll get much bigger.

re-running those tests with all the small files (<=128B) trimmed:
bzip2-9        5722229 23%
bzip2-8        5722805 23%
bzip2-7        5723211 23%
bzip2-6        5726221 23%
bzip2-5        5736335 23%
bzip2-4        5741225 23%
bzip2-3        5771628 24%
bzip2-2        5835757 24%
xz-7           5913491 24%
xz-9           5913492 24%
xz-6           5913493 24%
xz-8           5913496 24%
xz-5           5931178 24%
xz-4           6036545 25%
bzip2-1        6051371 25%
xz-3           6318873 26%
xz-2           6372501 26%
xz-1           6518569 27%
gzip-9         6704404 27%
gzip-8         6704492 27%
gzip-7         6712945 27%
gzip-6         6729971 28%
gzip-5         6834475 28%
gzip-4         7092539 29%
gzip-3         7529731 31%
lzop-9         7694473 32%
lzop-8         7694587 32%
lzop-7         7797339 32%
gzip-2         7814169 32%
gzip-1         8182159 34%
lzop-2        10803656 45%
lzop-4        10803656 45%
lzop-5        10803656 45%
lzop-6        10803656 45%
lzop-3        10803658 45%
lzop-1        10887167 45%
uncompressed  23995226

compression times:
lzop-6       0.05user 0.02system 0:00.13elapsed 60%CPU
lzop-1       0.06user 0.02system 0:00.09elapsed 96%CPU
lzop-2       0.07user 0.02system 0:00.10elapsed 95%CPU
lzop-5       0.07user 0.01system 0:00.09elapsed 93%CPU
lzop-3       0.08user 0.01system 0:00.10elapsed 95%CPU
lzop-4       0.08user 0.01system 0:00.10elapsed 97%CPU
gzip-1       0.34user 0.02system 0:00.37elapsed 97%CPU
gzip-2       0.38user 0.01system 0:00.40elapsed 98%CPU
gzip-3       0.46user 0.01system 0:00.47elapsed 98%CPU
gzip-4       0.48user 0.03system 0:00.52elapsed 98%CPU
gzip-5       0.68user 0.00system 0:00.91elapsed 75%CPU
gzip-6       0.94user 0.01system 0:01.41elapsed 67%CPU
gzip-7       1.06user 0.02system 0:01.22elapsed 88%CPU
gzip-8       1.26user 0.00system 0:01.36elapsed 92%CPU
gzip-9       1.27user 0.06system 0:01.45elapsed 92%CPU
xz-1         1.75user 0.06system 0:01.86elapsed 97%CPU
lzop-7       1.92user 0.02system 0:01.95elapsed 99%CPU
bzip2-2      2.14user 0.06system 0:02.21elapsed 99%CPU
bzip2-3      2.15user 0.05system 0:02.23elapsed 99%CPU
bzip2-4      2.16user 0.07system 0:02.26elapsed 98%CPU
bzip2-5      2.16user 0.09system 0:02.27elapsed 99%CPU
bzip2-1      2.17user 0.05system 0:02.25elapsed 98%CPU
bzip2-7      2.20user 0.18system 0:02.99elapsed 79%CPU
bzip2-8      2.21user 0.10system 0:02.98elapsed 77%CPU
xz-2         2.23user 0.04system 0:02.29elapsed 99%CPU
bzip2-6      2.24user 0.08system 0:02.75elapsed 84%CPU
bzip2-9      2.25user 0.17system 0:03.04elapsed 80%CPU
xz-3         2.72user 0.06system 0:02.83elapsed 98%CPU
lzop-8       2.84user 0.01system 0:02.88elapsed 98%CPU
lzop-9       2.91user 0.00system 0:02.97elapsed 97%CPU
xz-4         5.89user 0.05system 0:05.99elapsed 99%CPU
xz-5         8.04user 0.19system 0:09.28elapsed 88%CPU
xz-6         8.80user 0.08system 0:10.70elapsed 83%CPU
xz-7         9.34user 0.07system 0:12.47elapsed 75%CPU
xz-8        10.65user 0.09system 0:13.18elapsed 81%CPU
xz-9        10.82user 0.08system 0:12.82elapsed 85%CPU

decompression times:
lzop-7       0.04user 0.01system 0:00.06elapsed 86%CPU
lzop-9       0.04user 0.00system 0:00.06elapsed 87%CPU
lzop-4       0.05user 0.00system 0:00.07elapsed 86%CPU
lzop-6       0.05user 0.01system 0:00.07elapsed 86%CPU
lzop-8       0.05user 0.00system 0:00.06elapsed 83%CPU
lzop-1       0.06user 0.00system 0:00.07elapsed 82%CPU
lzop-2       0.06user 0.00system 0:00.07elapsed 86%CPU
lzop-3       0.06user 0.00system 0:00.08elapsed 76%CPU
lzop-5       0.06user 0.00system 0:00.07elapsed 87%CPU
gzip-8       0.23user 0.00system 0:00.35elapsed 68%CPU
gzip-7       0.36user 0.03system 0:00.64elapsed 61%CPU
gzip-1       0.37user 0.02system 0:00.80elapsed 49%CPU
xz-7         0.38user 0.00system 0:00.40elapsed 97%CPU
xz-6         0.39user 0.00system 0:00.40elapsed 97%CPU
xz-8         0.39user 0.00system 0:00.40elapsed 97%CPU
gzip-9       0.40user 0.02system 0:00.50elapsed 84%CPU
xz-5         0.40user 0.01system 0:00.43elapsed 95%CPU
xz-9         0.40user 0.00system 0:00.41elapsed 97%CPU
gzip-2       0.41user 0.02system 0:00.60elapsed 73%CPU
gzip-6       0.42user 0.00system 0:00.58elapsed 73%CPU
gzip-3       0.44user 0.02system 0:00.64elapsed 72%CPU
gzip-5       0.48user 0.02system 0:01.06elapsed 47%CPU
xz-4         0.48user 0.01system 0:00.62elapsed 79%CPU
xz-2         0.50user 0.03system 0:00.71elapsed 74%CPU
gzip-4       0.51user 0.02system 0:00.93elapsed 57%CPU
xz-1         0.58user 0.02system 0:00.78elapsed 77%CPU
xz-3         0.60user 0.02system 0:00.86elapsed 71%CPU
bzip2-7      0.64user 0.02system 0:00.72elapsed 92%CPU
bzip2-6      0.69user 0.02system 0:00.82elapsed 86%CPU
bzip2-9      0.69user 0.03system 0:00.85elapsed 84%CPU
bzip2-4      0.71user 0.04system 0:00.89elapsed 83%CPU
bzip2-3      0.72user 0.03system 0:00.96elapsed 78%CPU
bzip2-8      0.72user 0.02system 0:00.94elapsed 78%CPU
bzip2-2      0.73user 0.02system 0:00.95elapsed 78%CPU
bzip2-5      0.73user 0.02system 0:00.89elapsed 85%CPU
bzip2-1      0.76user 0.02system 0:01.07elapsed 73%CPU

these yield more interesting numbers.  there is no compressed size difference between xz -[6789], while compression & decompression is better with xz -6.  so with that in mind, i guess i'll change my recommendation to use that by default for man pages.
Comment 17 Roman Žilka 2013-03-20 18:58:11 UTC
(In reply to comment #16)
> looking at the man pages locally, after deleting everything <=128B, there is
> literally no man page that falls into the <=4KiB region.  the next smallest
> page is 64KiB.  so i don't think we need to bother getting into that ugly
> rabbit hole of graduated compression.

In the first note of it, the <=4KiB is not a part of the graduated scheme. The kernel ends up loading at least 4 KiB anyway, even if the file is smaller. IIRC from the classes. But it needs a check. A check that takes into consideration both the I/O for an individual file(!) upon decompression + the algorithm run time. Decompression must be non-deleting. The working dir contains ~1000 uncompressed manpages, randomly selected from the sum of all manpages. ext4+noatime, SATA disk.

Compress:
sync; echo 3 > /proc/sys/vm/drop_caches; time bash -c 'for i in *; do XYzip -[1-9] $i; done'
Uncompress:
time bash -c 'for i in *; do sync; echo 3 > /proc/sys/vm/drop_caches; XYzip -kdc $i >/dev/null; done'

First "real": compression time, then decompression.

bzip2 -9
real    0m3.187s
real    0m18.562s
bzip2 -5
real    0m2.775s
real    0m18.072s
bzip2 -1
real    0m2.776s
real    0m18.803s
xz -9
real    0m30.914s
real    0m18.651s
xz -5
real    0m11.478s
real    0m24.592s
xz -1
real    0m4.741s
real    0m20.107s
xz -9e
real    0m30.912s
real    0m19.978s
xz -5e
real    0m11.530s
real    0m20.758s
xz -1e
real    0m5.645s
real    0m19.971s
lzop -9
real    0m2.940s
real    0m15.741s
lzop -5
real    0m2.073s
real    0m14.622s
lzop -1
real    0m2.129s
real    0m14.634s
gzip -9
real    0m3.799s
real    0m17.614s
gzip -5
real    0m2.066s
real    0m16.527s
gzip -1
real    0m1.883s
real    0m16.658s
cat
real    ---
real    0m13.754s

Again, I can't test on /usr/share/docs content, but I'm totally switching back to uncompressed.
Comment 18 SpanKY gentoo-dev 2013-03-20 19:20:23 UTC
portage no longer compresses man pages below <=128B.  i might look into adding a similar limit on doc files.

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commitdiff;h=b47ea7a91220d72b78547721cedb8a4ca6cec39e

that means for the remaining files, `bzip2 -9` does give the best compression ration overall (which is the current default).

we can look at adding a dedicated compression var for man pages, but it's must less of an issue now.
Comment 19 RumpletonBongworth 2017-01-10 14:47:22 UTC
Note that xz compression is more effective for documentation with a profile containing "pb=0" than without. At least, this was true when I tested back in 2012. At the time, I used the following option in make.conf:

  PORTAGE_COMPRESS_FLAGS="--lzma2=preset=6e,pb=0" 

These were my results at the time (the directories without a .bz2 suffix are those containing the xz compressed material):

  # du -sb /usr/share/{doc.bz2,man.bz2,doc,man}
  16676478        /usr/share/doc.bz2
  17882717        /usr/share/man.bz2
  17393009        /usr/share/doc
  11889470        /usr/share/man

While xz didn't quite beat bzip2 -9 for compression of my doc directory, it yielded overall savings of 5.03 MiB. What's interesting is that pb=0 was the magic sauce that allowed it to exceed - rather than fall short of - the degree of compression applied by bzip2.

I'm not necessarily suggesting that xz is the best choice for gentoo, but this is worth keeping in mind during any evaluation where the effectiveness of the compression is considered important.