ZIP compression format doesn't provide any standardized encoding for names of compressed files by default. That is one can compress files in one locale and decompress them on machine where different encoding were set, and names of decompressed files will be broken. A patch which adds support for recoding names of decompressing files is available for a very long time (http://sisyphus.ru/cgi-bin/srpm.pl/Sisyphus/unzip/getpatch/3). Unfortunately it works with cyrillic encodings only, but adding support for any desired encoding seems to be trivial. There is also a slightly improved version of this patch which is available in crg overlay (http://gentoo-overlays.zugaina.org/crg/portage/app-arch/unzip/files//unzip-5.50-iconv-v1.2-utf8.patch). There's no reason to use outdated version of the patch of course. Reproducible: Always Steps to Reproduce:
Created attachment 195622 [details, diff] Proposed patch for the ebuild
The patch you reference here is not used by altlinux and afaik it has some issues, discussed in altlinux bugzilla. Correct way is to use NATSPEC library[*] but packaging it correctly requires quite some work on configure.in script (many automagic dependencies and thus inability to disable documentation/python bindings building, which are really redundant for such basic package). I've started to work on it but had no time to finish. * http://freesource.info/wiki/Lokalizacija/BibliotekaNATSPEC I've dropped really draft ebuild into my overlay (dev-libs/natspec). Another problem: unzip-6.0 with some utf8 support is out and this and natspec patch required to be reworked (anyone?).
unzip-6.0 should have unicode support now
Yes, unzip has unicode support, but it's still unable to decode files packed in windows (and this is a problem here, since I have to open such files). OTOH altlinux updated patch for unzip-6.0 so the only blocker here is to fix automagic dependencies in natspec and improve build system there.
let me phrase it this way ... i have no files that cause a problem for unzip, nor do i have an interest in fixing this, nor do i really understand the issues you reference. so if you have a fix for unzip-6.0, feel free to update the ebuild in the tree.
(In reply to comment #5) > let me phrase it this way ... i have no files that cause a problem for unzip, See http://www.fipi.ru/binaries/724/bio%20WinRAR.zip as a sample.
Created attachment 208700 [details] dev-libs/natspec-0.2.5.ebuild Ebuild for dev-libs/natspec-0.2.5 library (required for the ALT linux patch) This version has a better configure script - python bindings are now optional. Tested on ~amd64.
Created attachment 208701 [details, diff] unzip-6.0-alt-natspec.patch Patch for unzip-6.0 by ALT linux (available at http://sisyphus.ru/ru/srpm/Sisyphus/unzip/patches/0) to enable manually setting legacy filename encodings via the -O switch.
Created attachment 208702 [details] new unzip-6.0-r1.ebuild unzip-6.0 ebuild that uses above patch. I have successfully used the patched unzip to extract files from zip files whose contents' filenames are in the cp866 encoding (as a test case, have a look at all the .zip files at the bottom of http://www.lawinstitut.ru/archnum.aspx?lang=ru) Can we please get this in portage?
Created attachment 208704 [details] dev-libs/natspec-0.2.5.ebuild Better ebuild for natspec (popt and tcl are only used during the build process, not in the installed libraries, so move them to DEPEND)
It works well for me. I have tested it with cp852 encoding on x86 machine. I would like to see this feature in the Portage Tree.
libnatspec links against popt, so it needs it in RDEPEND ive added that package to the tree, but even with the proposed patch, the sample zip in comment #6 still doesnt work for me Archive: bio WinRAR.zip creating: ????/ inflating: ????/????_??????????_2009.pdf inflating: ????/????_????????_2009.pdf inflating: ????/????_????????????_2009.pdf
(In reply to comment #12) > sample zip in comment #6 still doesnt work for me > > Archive: bio WinRAR.zip > creating: ????/ > inflating: ????/????_??????????_2009.pdf > inflating: ????/????_????????_2009.pdf > inflating: ????/????_????????????_2009.pdf I know, the list of files is wrong. But the files have right names when they are unpacked. Try this: $ zipnote file.zip | iconv -f cp852 -t utf8 Change the cp852 encoding to the one what you need.
the file output is incorrect too: -- üê |-- üê_æ»Ñµ¿Σ_2009.pdf |-- üê_äѼ«_2009.pdf `-- üê_è«ñ¿Σ_2009.pdf but even if the files were correct, the output should have been correct. i'm not inclined to add a patch that only fixes 20% of the problem. i'm using a unicode locale here ...
(In reply to comment #14) > the file output is incorrect too: > -- üê > |-- üê_æ»Ñµ¿Σ_2009.pdf > |-- üê_äѼ«_2009.pdf > `-- üê_è«ñ¿Σ_2009.pdf It is because you've applied incorrect encoding here. CP866 is Cyrillic/Russian encoding using in DOS. $ zipnote file.zip | iconv -f cp866 -t utf8 produces correct file names (БИ/, БИ/БИ_Кодиф_2009.pdf, БИ/БИ_Демо_2009.pdf, БИ/БИ_Специф_2009.pdf respectively; hope Bugzilla and your browser displays them correctly). I'm using en_US.UTF-8.
(In reply to comment #14) > but even if the files were correct, the output should have been correct. i'm > not inclined to add a patch that only fixes 20% of the problem. This solves the main problem - correct files after unpack. ALT uses other patches (Ark on KDE-4) for GUI output.
(In reply to comment #14) > the file output is incorrect too: > -- üê > |-- üê_æ»Ñµ¿Σ_2009.pdf > |-- üê_äѼ«_2009.pdf > `-- üê_è«ñ¿Σ_2009.pdf > but even if the files were correct, the output should have been correct. i'm > not inclined to add a patch that only fixes 20% of the problem. > > i'm using a unicode locale here ... > Improved patch http://www.opennet.ru/soft/zip_rus/unzip60-natspec-mod.diff.gz (detailed article in Russian http://www.opennet.ru/tips/info/2494.shtml ). Clean output and correct filenames in KDE4 Ark.
Created attachment 257604 [details, diff] unzip-6.0-alt-natspec.patch Improved patch (source: http://www.opennet.ru/soft/zip_rus/unzip60-natspec-mod.diff.gz).
Patches for zip and unzip were applied in the tree. Mike, zipnote was not covered by patch from altlinux. The goal of the patch is to make zip encode non-ascii filenames in such way, so they could be read in Windows. That said, I've up added support of natspec for zipnote in our patchset too :)