Summary: | app-arch/lzma-utils-4.32.7: --best argument does not describe the function it performs | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Erik <esigra> |
Component: | [OLD] Core system | Assignee: | Gentoo Linux bug wranglers <bug-wranglers> |
Status: | RESOLVED WORKSFORME | ||
Severity: | normal | CC: | jer |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://tukaani.org/lzma/ | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | uncompressed file used for test |
Description
Erik
2009-12-27 16:44:54 UTC
Created attachment 214321 [details]
uncompressed file used for test
I'd say INVALID - that compression scheme was meant for large files. What you're probably seeing is a larger dictionary - for large files, that can make the archive smaller, but manpages are too small to benefit. (In reply to comment #0) > Of course a higher compression level should never give a larger file than > a lower compression level. This is false: It should not give a larger file *in the mean* (for typical data). It lies in the nature of compression that you *cannot* know in advance what is best. > A workaround would be to create an alias that tries > the compression levels up to the one given on the command line > and keeps the smallest file. There are several wrappers around which do things like this, even using different compression programs. You are always free to use such; requiring this from the compression program itself would not be sane. Just think what would it do if it were implemented: It would increase the time by about the factor 5-7 (just a wild guess) and the required memory by about the factor 2 - in order to save a few bytes in exceptional cases; if you would just use that time and memory to enlarge the dictionary even more, you would gain more in the mean - so the latter is more clever (in the mean - compression is always a game with probabilities) if you are willing to invest the time and memory. (In reply to comment #3) > Just think what would it do if it were implemented: It would increase the > time by about the factor 5-7 (just a wild guess) and the required memory > by about the factor 2 My guess was that it would increase the time by a factor of 2 and not increase the required memory. It seems obvious that it would not need more memory, because it would try the compression levels one after the other. The amount of required memory seems to double for each level (according to the manual), so it seems like a reasonable guess that the time is similar. Then trying all lower levels in turn would take just a little less time than the highest level, which means a time factor of 2. I have dual core here, so the program could try the highest level on one core and the lower levels one after the other on the other core. Then the time factor would be 1 and the memory factor 1.5. How did you guess the time factor of 5-7 and memory factor of 2? > if you would just use that time and memory to enlarge the dictionary even > more, you would gain more in the mean That must be wrong assuming that Rafał Mużyło was right about that a too large dictionary is the problem. The program could probably do something more clever than trying all compression levels in turn. It could use a heuristic function that gives a dictionary size that is expected to be best for the given input, then try to compress with some dictinary sizes around that expeted best size, then interpolate/extrapolate and try to find an even better dictinary size, narrowing it down to the optimal dictionary size. Well, then maybe the best solution is to rename --best to --dictionary=large or some such. As far as I remember most compression utilities actually use this kind of option. bzip2: -1 (or --fast) to -9 (or --best) Set the block size to 100 k, 200 k .. 900 k when compressing. Has no effect when decompressing. See MEMORY MANAGEMENT below. The --fast and --best aliases are primarily for GNU gzip compat‐ ibility. In particular, --fast doesn't make things signifi‐ cantly faster. And --best merely selects the default behaviour. gzip: -# --fast --best Regulate the speed of compression using the specified digit #, where -1 or --fast indicates the fastest compression method (less compression) and -9 or --best indicates the slowest com‐ pression method (best compression). The default compression level is -6 (that is, biased towards high compression at expense of speed). zip is more careful in its wording: -# (-0, -1, -2, -3, -4, -5, -6, -7, -8, -9) Regulate the speed of compression using the specified digit #, where -0 indicates no compression (store all files), -1 indi‐ cates the fastest compression speed (less compression) and -9 indicates the slowest compression speed (optimal compression, ignores the suffix list). The default compression level is -6. Though still being worked, the intention is this setting will control compression speed for all compression methods. Cur‐ rently only deflation is controlled Oh, and look at xz: --fast and --best These are somewhat misleading aliases for -0 and -9, respec‐ tively. These are provided only for backwards compatibility with LZMA Utils. Avoid using these options. Especially the name of --best is misleading, because the defini‐ tion of best depends on the input data, and that usually people don't want the very best compression ratio anyway, because it would be very slow. Please talk to upstream ([URL]) about this, but do note this: "Users of LZMA Utils should move to XZ Utils. XZ Utils support the legacy .lzma format used by LZMA Utils, and can also emulate the command line tools of LZMA Utils. This should make transition from LZMA Utils to XZ Utils relatively easy." Maybe you should find out if app-arch/xz-utils perhaps doesn't try to deceive you in this way? :) (In reply to comment #5) > Maybe you should find out if app-arch/xz-utils perhaps doesn't try to deceive > you in this way? :) Actually, with lzma from the beta version of xz-utils, the size is really a monotonically decreasing function of the compression level (for the sample input file): % for level in $(seq 9); do echo -n $level:; lzma --keep -$level printf.3 --stdout|wc -c; done 1:8843 2:8791 3:8418 4:8418 5:8418 6:8402 7:8402 8:8402 9:8402 But note that it can not compress as well as the stable lzma-utils (at level 4). Of course I also tried xz. It is also monotonic, but compresses even worse: % for level in $(seq 9); do echo -n $level:; xz --keep -$level printf.3 --stdout|wc -c; done 1:8888 2:8836 3:8464 4:8464 5:8464 6:8448 7:8448 8:8448 9:8448 (In reply to comment #6) > % for level in $(seq 9); do echo -n $level:; lzma --keep -$level printf.3 > --stdout|wc -c; done Yes yes, but it was mentioned already that a tiny little file will not compress better by using a larger size dictionary. That's in the design of all dictionary based compression algorithms and there's nothing we can do for it. If you were to compress a tar file a couple hundred megabytes in size, you probably would notice that a larger size dictionary does improve the compression rate. But please stop adding to this bug now - there's nothing Gentoo can do about it. |