Bug 298549

Summary:	app-arch/lzma-utils-4.32.7: --best argument does not describe the function it performs
Product:	Gentoo Linux	Reporter:	Erik <esigra>
Component:	[OLD] Core system	Assignee:	Gentoo Linux bug wranglers <bug-wranglers>
Status:	RESOLVED WORKSFORME
Severity:	normal	CC:	jer
Priority:	High
Version:	unspecified
Hardware:	All
OS:	Linux
URL:	http://tukaani.org/lzma/
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	uncompressed file used for test

Description Erik 2009-12-27 16:44:54 UTC

I tried to compress the attached file with lzma and all different compression levels. --best is a synonym for -9.

Reproducible: Always

Steps to Reproduce:
for level in $(seq 9); do echo -n $level:; lzma --keep -$level printf.3 --stdout|wc -c; done
Actual Results:  
1:8838
2:8784
3:8403
4:8397
5:8405
6:8405
7:8405
8:8414
9:8414

Expected Results:  
1:8838
2:8784
3:8403
4:8397
5:8397
6:8397
7:8397
8:8397
9:8397

Of course a higher compression level should never give a larger file than a lower compression level. A workaround would be to create an alias that tries the compression levels up to the one given on the command line and keeps the smallest file. This is what the program should do on its own if the algorithm can not guarantee this result in another way.

Comment 1 Erik 2009-12-27 16:46:23 UTC

Created attachment 214321 [details]
uncompressed file used for test

Comment 2 Rafał Mużyło 2009-12-27 19:25:20 UTC

I'd say INVALID - that compression scheme was meant for
large files. What you're probably seeing is a larger dictionary
- for large files, that can make the archive smaller,
but manpages are too small to benefit.

Comment 3 Martin Väth 2009-12-27 20:44:29 UTC

(In reply to comment #0)
> Of course a higher compression level should never give a larger file than
> a lower compression level.

This is false: It should not give a larger file *in the mean* (for
typical data). It lies in the nature of compression that you
*cannot* know in advance what is best.

> A workaround would be to create an alias that tries
> the compression levels up to the one given on the command line
> and keeps the smallest file.

There are several wrappers around which do things like this, even
using different compression programs. You are always free to use such;
requiring this from the compression program itself would not be sane.
Just think what would it do if it were implemented: It would increase the
time by about the factor 5-7 (just a wild guess) and the required memory
by about the factor 2 - in order to save a few bytes in exceptional cases;
if you would just use that time and memory to enlarge the dictionary even
more, you would gain more in the mean - so the latter is more clever (in
the mean - compression is always a game with probabilities) if you
are willing to invest the time and memory.

Comment 4 Erik 2009-12-27 21:23:25 UTC

(In reply to comment #3)
> Just think what would it do if it were implemented: It would increase the
> time by about the factor 5-7 (just a wild guess) and the required memory
> by about the factor 2

My guess was that it would increase the time by a factor of 2 and not increase the required memory. It seems obvious that it would not need more memory, because it would try the compression levels one after the other. The amount of required memory seems to double for each level (according to the manual), so it seems like a reasonable guess that the time is similar. Then trying all lower levels in turn would take just a little less time than the highest level, which means a time factor of 2.

I have dual core here, so the program could try the highest level on one core and the lower levels one after the other on the other core. Then the time factor would be 1 and the memory factor 1.5.

How did you guess the time factor of 5-7 and memory factor of 2?


> if you would just use that time and memory to enlarge the dictionary even
> more, you would gain more in the mean

That must be wrong assuming that Rafał Mużyło was right about that a too large dictionary is the problem.

The program could probably do something more clever than trying all compression levels in turn. It could use a heuristic function that gives a dictionary size that is expected to be best for the given input, then try to compress with some dictinary sizes around that expeted best size, then interpolate/extrapolate and try to find an even better dictinary size, narrowing it down to the optimal dictionary size.

Comment 5 Jeroen Roovers (RETIRED) gentoo-dev

2009-12-28 12:48:32 UTC

Well, then maybe the best solution is to rename --best to --dictionary=large or some such. As far as I remember most compression utilities actually use this kind of option.

bzip2:

-1 (or --fast) to -9 (or --best)
Set the block size to 100 k, 200 k .. 900 k when compressing.
Has no effect when decompressing. See MEMORY MANAGEMENT below.
The --fast and --best aliases are primarily for GNU gzip compat‐
ibility. In particular, --fast doesn't make things signifi‐
cantly faster. And --best merely selects the default behaviour.

gzip:
-# --fast --best
Regulate the speed of compression using the specified digit #,
where -1 or --fast indicates the fastest compression method
(less compression) and -9 or --best indicates the slowest com‐
pression method (best compression). The default compression
level is -6 (that is, biased towards high compression at expense
of speed).

zip is more careful in its wording:
-#
(-0, -1, -2, -3, -4, -5, -6, -7, -8, -9)
Regulate the speed of compression using the specified digit #,
where -0 indicates no compression (store all files), -1 indi‐
cates the fastest compression speed (less compression) and -9
indicates the slowest compression speed (optimal compression,
ignores the suffix list). The default compression level is -6.

Though still being worked, the intention is this setting will
control compression speed for all compression methods. Cur‐
rently only deflation is controlled

Oh, and look at xz:
--fast and --best
These are somewhat misleading aliases for -0 and -9, respec‐
tively. These are provided only for backwards compatibility
with LZMA Utils. Avoid using these options.

Especially the name of --best is misleading, because the defini‐
tion of best depends on the input data, and that usually people
don't want the very best compression ratio anyway, because it
would be very slow.

Please talk to upstream ([URL]) about this, but do note this:

"Users of LZMA Utils should move to XZ Utils. XZ Utils support the legacy .lzma format used by LZMA Utils, and can also emulate the command line tools of LZMA Utils. This should make transition from LZMA Utils to XZ Utils relatively easy."

Maybe you should find out if app-arch/xz-utils perhaps doesn't try to deceive you in this way? :)

Comment 6 Erik 2009-12-28 16:41:40 UTC

(In reply to comment #5)
> Maybe you should find out if app-arch/xz-utils perhaps doesn't try to deceive
> you in this way? :)

Actually, with lzma from the beta version of xz-utils, the size is really a monotonically decreasing function of the compression level (for the sample input file):
% for level in $(seq 9); do echo -n $level:; lzma --keep -$level printf.3 --stdout|wc -c; done
1:8843
2:8791
3:8418
4:8418
5:8418
6:8402
7:8402
8:8402
9:8402

But note that it can not compress as well as the stable lzma-utils (at level 4).

Of course I also tried xz. It is also monotonic, but compresses even worse:
% for level in $(seq 9); do echo -n $level:; xz --keep -$level printf.3 --stdout|wc -c; done
1:8888
2:8836
3:8464
4:8464
5:8464
6:8448
7:8448
8:8448
9:8448

Comment 7 Jeroen Roovers (RETIRED) gentoo-dev

2009-12-28 19:04:55 UTC

(In reply to comment #6)
> % for level in $(seq 9); do echo -n $level:; lzma --keep -$level printf.3
> --stdout|wc -c; done

Yes yes, but it was mentioned already that a tiny little file will not compress better by using a larger size dictionary. That's in the design of all dictionary based compression algorithms and there's nothing we can do for it. If you were to compress a tar file a couple hundred megabytes in size, you probably would notice that a larger size dictionary does improve the compression rate.

But please stop adding to this bug now - there's nothing Gentoo can do about it.