Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 309683 - app-arch/lbzip2 patch to act as bzip2 replacement
Summary: app-arch/lbzip2 patch to act as bzip2 replacement
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All All
: High normal (vote)
Assignee: Justin Lecher (RETIRED)
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-16 01:57 UTC by Jan Psota
Modified: 2012-09-27 16:18 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
ebuild based on that by Justin (sorry for small caps before ;-) (lbzip2-0.23-r1.ebuild,1.60 KB, text/plain)
2010-03-16 01:59 UTC, Jan Psota
Details
patch allowing to decompress symlinked archives without '--force' (s_isreg.patch,350 bytes, patch)
2010-03-16 02:00 UTC, Jan Psota
Details | Diff
rewrite of emerge-delta-webrsync (emerge-delta-webrsync,3.48 KB, text/plain)
2010-03-23 22:25 UTC, Jan Psota
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Psota 2010-03-16 01:57:14 UTC
lbzip2 is faster because takes advantage of smp, but refuses to decompress symlinks to archives even with '-dc'. I patched it to skip this check on decompression and added some lines to original justin's ebuild to replace original bzip2 (with USE=symlink) saving it in .orig for security.

lbzip2 programmer Laszlo Ersek proposed to use --force with -dc, but I believe it's not the right way.
Comment 1 Jan Psota 2010-03-16 01:59:34 UTC
Created attachment 223837 [details]
ebuild based on that by Justin (sorry for small caps before ;-)
Comment 2 Jan Psota 2010-03-16 02:00:55 UTC
Created attachment 223839 [details, diff]
patch allowing to decompress symlinked archives without '--force'
Comment 3 Jan Psota 2010-03-17 09:47:20 UTC
Watch out! emerge-delta-webrsync works, but does not compress patched portage archive to $DISTDIR.
Comment 4 Laszlo Ersek 2010-03-17 21:33:35 UTC
(In reply to comment #3)
> Watch out! emerge-delta-webrsync works, but does not compress patched portage
> archive to $DISTDIR.

I'm looking at emerge-delta-webrsync-3.5.1.

First of all, "emerge-delta-webrsync-3.5.1" has a bug (perhaps file a bug report for it if you care):

   459		echo "recompressing."
   460		bzip2 -v9 "${TEMPDIR}/portage-${final_date}.tar.bz2"

The ".bz2" suffix on line 460 is an error -- this won't work with the original bzip2 either.

Second, "emerge-delta-webrsync-3.5.1" will not move the compressed archive to ${DISTDIR} in

   486	else
   487		dfile="${DISTDIR}/portage-${final_date}.tar.bz2"
   488		mv "${TEMPDIR}/portage-${final_date}.tar.bz2" ${DISTDIR}/
   489	fi

because this condition will hold:

   463	echo "verifying generated tarball"
   464	
   465	if ! verify_md5_file "${TEMPDIR}/portage-${final_date}.tar.bz2"; then

That is, the MD5 of the newly compressed tarball won't match the downloaded MD5. This is because the compressed output of lbzip2 is NOT identical to that of standard bzip2 (even though it is a completely valid bz2 file).

This is the first thing listed in the BUGS section of the manual page, and it is explained in even more detail in the README. ("In order to spare the compressor's muxer a lot of cycles"...)

I will not fix this bug in lbzip2, because
(1) it is not a functionality bug but a compatibility bug,
(2) it doesn't bother me,
(3) fixing it would be a lot of work,
(4) fixing it naively would seriously hamper performance (it would instantly create a serialization bottleneck in the muxer),
(5) fixing it very cleverly (if at all possible) would hamper performance only mildly, but at the price of introducing platform-dependent, non-portable code into lbzip2, and messing up the source,
(6) fixing it would even require changes to the splitter -- it would have to mimic (duplicate) the workings of the first-stage RLE (like dbzip2 does) so that the BWT receives identical input from the first-stage RLE.

In short, I won't touch this.

I'm not sure why the compressed file must be bit-identical -- I'd think it's sufficient if the decompressed stream is identical (the umd5sum), but I have no experience with Portage, so who know.
Comment 5 Jan Psota 2010-03-23 22:23:41 UTC
New, written from scratch, emerge-delta-webrsync (without manual for now) follows.
Give it a try please :-)
And I will try to get rid of tarsync.
Comment 6 Jan Psota 2010-03-23 22:25:57 UTC
Created attachment 224959 [details]
rewrite of emerge-delta-webrsync
Comment 7 Justin Lecher (RETIRED) gentoo-dev 2010-03-24 07:24:51 UTC
Thanks for your efforts. I really like to get this thing working with portage. Together with app-arch/pigz we can speed portage a lot. As I am more interested in the usage during emerge and not delta-webrsync I will CC the portage guys, so that they could comment on your delta-webrsync patches.
Comment 8 Zac Medico gentoo-dev 2010-03-24 08:50:28 UTC
Can't we just use something like a eselect module to choose which bzip2 program we want to use? Then maybe we won't have to modify anything (like emerge-delta-webrsync) to use it.
Comment 9 Justin Lecher (RETIRED) gentoo-dev 2010-03-24 10:03:56 UTC
Might be a nice idea. I will look at it and post on the dev ml for that.

@Laszlo
Are you going to include the patch into your releases so that we have a version which supports symlink decompression out-of-the-box?
I lost the discussion a little bit.
Comment 10 Jan Psota 2010-03-24 10:31:39 UTC
(In reply to comment #8)
> Can't we just use something like a eselect module to choose which bzip2 program
> we want to use? Then maybe we won't have to modify anything (like
> emerge-delta-webrsync) to use it.
> 
I think that symlinking is a natural way. Of course eselect module can do exactly this. But I inserted it into -r1 lbzip2 ebuild (with USE=symlink) original bzip2 is moved to bzip2.orig and symlink from bzip2 to lbzip2 is made. Of course it is reversed while unmerging!

@Justin:
He (Laszlo) won't do it ;-)
But he gave us free hand to do it if we needs it.

And about choosing which bzip to use: if sombebody wants to use any option compatible archiver -- he can. For now. Because I think that it shouldn't be a matter of "like" but a matter of which compressor is good for which purposes:
lzip [plzip], lzma, xz, 7z for program distribution,
lzop, gzip [pigz], tamp where speed is most important,
bzip2 [lbzip2, pbzip2] in many other cases.
Comment 11 Justin Lecher (RETIRED) gentoo-dev 2010-03-24 11:05:52 UTC
Setting symlinks is what eselect does in many cases.

Before we can proceed with that, we need to insure that the lbzip2 implementations are 100% compatible with the onces from bzip2.
Comment 12 Laszlo Ersek 2010-03-24 22:00:28 UTC
Hello All,

thank you very much for considering lbzip2 as a tool to be used by portage.

(In reply to comment #11)
> Setting symlinks is what eselect does in many cases.
>
> Before we can proceed with that, we need to insure that the lbzip2
> implementations are 100% compatible with the onces from bzip2.
>

Indiscriminate compatibility with bzip2 is not among the purposes of lbzip2. I'll try to summarize the points on this topic I tried to make in our private correspondence with Justin and Jan  -- that is, why "lbzip2 -dc" won't follow symlinks.

First off, decompression (-d) is not special wrt. what kinds of files lbzip2 or bzip2 opens.

original bzip2
                         -z   -zc  -zf  -zcf  -d  -dc -df  -dcf
regular file             1    1    1    1     1   1   1    1
symlink to regular file       1    1    1         1   1    1
device                        1    1    1         1   1    1
symlink to device             1    1    1         1   1    1

lbzip2
                         -z   -zc  -zf  -zcf  -d  -dc -df  -dcf
regular file             1    1    1    1     1   1   1    1
symlink to regular file       X    1    1         X   1    1
device                        X    1    1         X   1    1
symlink to device             X    1    1         X   1    1

The X characters show situations where bzip2 opens the file passed as a command line operand, while lbzip2 does not.

Standard bzip2 cares about the type of the input file operands because it will *remove them* after (de)compression, and in bzip2's opinion, removing non-regular files is probably not what you mean. If you pass --stdout (-c) to bzip2, either with compression or decompression, its scruple about removing the file vanishes (because --stdout implies the file operand should not be removed), and happily opens any kind of file.

bzip2's approach is inconsistent: consider the --keep (-k) option. It should also make bzip2's scruple about removing the file operand disappear. It does not, however:

$ bzip2 --keep /dev/zero
bzip2: Input file /dev/zero is not a normal file.

Same with a symlink to a regular file.

(There are other problems with bzip2's --keep: "bzip2 --keep" refuses to open a regular file having more than 1 hard links to it. A Debian bug report was submitted about this.)

The behavior of --stdout suggests that bzip2's reason for not opening non-regular file operands is that it will remove them. Option --keep's behavior disagrees. There are two ways to resolve this conflict. The first way is to make --stdout work like --keep (in this regard), that is, don't open special files because *reading* from special files is probably not what you mean. The second way is to make --keep work like --stdout, that is, don't open special files because *removing* special files is probably not what you mean; but if you don't remove them (due to either -c or -k), reading from them is fine.

In the name of consistency, lbzip2 will not mimic this contradicting state of affairs, so it has to choose one of both resolutions. lbzip2 chooses the first way, because in my opinion, *reading* from special files is the thing to avoid per default, because that is what will cause you a nasty surprise in the first place. Reading from a block special or character special, or even trying to open a FIFO is considered unexpected, unless you say otherwise. (You might have some kind of hardware attached to said devices.)

With that out of the way, why does lbzip2 lump symlinks together with other kinds of non-regular files? I'll quote one of my mails to Jan (typos fixed, lower case / upper case changed etc):

----v----

The problem is that the symlink may point to a character special, a block special, or a FIFO. [...] Reliably checking the type of the link destination would require:
- omitting lstat() as the first step,
- instead, opening the file as the first step,
- then checking it through the file descriptor with fstat().

Unfortunately, by the time one would call fstat(), it's too late: we're already past open() then. open() might do unexpected things if those character or block devices correspond to pieces of actual hardware, or some exclusive-access kernel resources.

Even when considering a mere FIFO (named pipe), the open() would hang the lbzip2 process until there was a producer (writer) reference to the same FIFO. Yes, O_NONBLOCK can be used to make such an open() return immediately for FIFO's, but, for example, O_NONBLOCK has unspecified behavior for regular files, according to SUSv2. (SUSv2 is the foundational standard for lbzip2.) I avoid unspecified behavior. Thus you would have to query, again, the filetype *before* opening the file...

In the end, I have to put an lstat() or stat() before the open(). Both have the same race, that is, the pathname in question can be changed to point to a different file between the lstat()/stat() call and the open(). So why did I choose lstat() instead of stat(), if lstat() can still be tricked? Because stat() will follow not only one symlink, but an arbitrarily long chain of symlinks as well, and in my opinion that increases the chances of said race greatly. lstat() won't follow any symlink.

-----^----

I have no formal proof (or anectodal evidence, for that matter) how this "race" would be dangerous to a user's files, in case of either a privileged or an unprivileged user. Still I perceive the lstat() manner to be stricter, cleaner and more elegant.

Accordingly, I suggest to pass --force. The behavior of --force wrt. lbzip2 is precisely documented in the command line help and in the manual page (unlike with bzip2 IIRC):

(1) Open non-regular input files.
(2) Open input files with more than one links.
(3) Try to remove each output file before opening it.

Let's see consequences of --stdout: 'Write output to stdout even with FILE operands. Implies "-k". Incompatible with "-t".' This pulls out the venom tooth of (3) immediately -- there is no output file!

Let's see --keep, implied by --stdout: 'Don't remove FILE operands. Open regular input files with more than one links.' This directly implies (2).

Hence, if -dc is passed *anyway*, the only thing --force enables *in addition* is (1), ie. "open non-regular input files". If you look up the table at the top, you'll see that bzip2 -dc opens absolutely anything. Particularly, "bzip2 -c [-d]" lumps symlinks in with all other kinds of non-regular files.

Therefore,

(i) "lbzip2 -dcf" is *equivalent* to (not more, not less than) "bzip2 -dc"; and
(ii) it is further *equivalent* to (not more, not less than) to "bzip2 -dcf".

This means that you can modify your scripts calling "bzip2 -dc" to "bzip2 -dcf", and *nothing* will change with bzip2. Additionally, this will make an lbzip2 binary wearing a bzip2 robe behave in an identical fashion.

You could also set the environment variable BZIP2 to "-f". It is respected by both bzip2 and lbzip2. What's more, you can set the env. var. LBZIP2 to "-f", which will only affect lbzip2 (even though -f's effect on original bzip2, when it accompanies -dc, is exactly nil).

... Actually, I'm lying a bit with "ii". bzip2 will pipe non-bz2 files transparently in response to --force. Now this is a misfeature I will *never* implement. (The bzip2 manual says this (IMHO dubious) ability was modeled after "gzip --force".) If, in your Gentoo scripts, you can't even be sure that the files passed to "bzip2 -dc" are *actually* compressed, then all hope is lost :) As you don't pass --force to bzip2 *now* (or so it seems to me), I guess this is not the case, however. Anyhow, point "i" and the applicability of the LBZIP2 env. var. still hold even if you take this transparent piping of "bzip2 --force" into consideration.

Of course, if you decompress *single files* in each invocation, you could sidestep the whole issue by simply redirecting the process' stdin instead of making the process open the file operand itself.

$ bzip2 -d <symlink-to-regular-bz2-file | ...

Note that the shell *absolutely* doesn't care where it redirects stdin from, so if you were willing to do this, this is *at least* as forceful as actually passing --force. (Not considering the transparent piping, which I don't. This utter endeavour towards the lazy user's satisfaction will result in a 400MB "swiss_army_knife --do-whatever-I-happen-to-think-of *.png *.ogg *.bz2 *.c *.html *.iso" utility.)

----o----

Nonetheless, please feel free to patch lbzip2 in Gentoo as you wish -- I released lbzip2 under the GPLv2+ so that, among other reasons, you don't even have to ask. If you have questions about the code, I'll gladly try to explain it to the best of my knowledge.

I apologize for over-emphasizing stuff *constantly*. :)
Comment 13 Justin Lecher (RETIRED) gentoo-dev 2010-03-27 09:47:39 UTC
I commited the patch. Let's test it out.
Comment 14 Justin Lecher (RETIRED) gentoo-dev 2010-05-01 07:32:42 UTC
So I tested the patch and it seems to work out very well. Lets see what we can get at the replacement front.
Comment 15 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2012-09-26 21:15:47 UTC
(In reply to comment #14)
> So I tested the patch and it seems to work out very well. Lets see what we
> can get at the replacement front.

Is there anything that clearly needs to be done wrt this bug? Maybe you could open a new one to not confuse the patch with actual replacement.
Comment 16 Zac Medico gentoo-dev 2012-09-27 16:18:04 UTC
Please file a new bug for dev-portage@gentoo.org if you need any changes in emerge-delta-webrsync.