Convertion from IBM-1047 to some (any) ascii codepage fails. EOL is converted wrong. Reproducible: Always Steps to Reproduce: 1. iconv -f IBM-1047 -t UTF8 ibm-1047.example > ibm-1047.to.utf8.converted or 2. iconv -f UTF8 -t IBM-1047 utf8.example > utf8.to.ibm-1047.converted etc Actual Results: What is converted: abcdefghijklmnopqrstuvwxyz01234567890!@#$%^&*()_+-=<EOL> ABCDEFGHIJKLMNOPQRSTUVWXYZ<EOL> <EOL> - is \x0a or \x0d\0xOa in ascii and \x15 in ibm-1047 Expected Results: ibm-1047.example is $ hexdump -C ibm-1047.example 00000000 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97 |................| 00000010 98 99 a2 a3 a4 a5 a6 a7 a8 a9 f0 f1 f2 f3 f4 f5 |................| 00000020 f6 f7 f8 f9 f0 5a 7c 7b 5b 6c 5f 50 5c 4d 5d 6d |.....Z|{[l_P\M]m| 00000030 4e 60 7e[15]c1 c2 c3 c4 c5 c6 c7 c8 c9 d1 d2 d3 |N`~.............| 00000040 d4 d5 d6 d7 d8 d9 e2 e3 e4 e5 e6 e7 e8 e9[15] |...............| [15] - is EOL after iconv -f IBM-1047 -t UTF8 ibm-1047.example > ibm-1047.to.utf8.converted 00000000 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 |abcdefghijklmnop| 00000010 71 72 73 74 75 76 77 78 79 7a 30 31 32 33 34 35 |qrstuvwxyz012345| 00000020 36 37 38 39 30 21 40 23 24 25 5e 26 2a 28 29 5f |67890!@#$%^&*()_| 00000030 2b 2d 3d[c2 85]41 42 43 44 45 46 47 48 49 4a 4b |+-=..ABCDEFGHIJK| 00000040 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a[c2 |LMNOPQRSTUVWXYZ.| 00000050 85] |.| So, EOL [15] converted to [c2 85] sequence instead of [0a] And vice-versa: $ hexdump -C utf8.example 00000000 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 |abcdefghijklmnop| 00000010 71 72 73 74 75 76 77 78 79 7a 30 31 32 33 34 35 |qrstuvwxyz012345| 00000020 36 37 38 39 30 21 40 23 24 25 5e 26 2a 28 29 5f |67890!@#$%^&*()_| 00000030 2b 2d 3d[0a]41 42 43 44 45 46 47 48 49 4a 4b 4c |+-=.ABCDEFGHIJKL| 00000040 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a[0a] |MNOPQRSTUVWXYZ.| [0a] - is EOL after iconv -f UTF8 -t IBM-1047 utf8.example > utf8.to.ibm-1047.converted $ hexdump -C utf8.to.ibm-1047.converted 00000000 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97 |................| 00000010 98 99 a2 a3 a4 a5 a6 a7 a8 a9 f0 f1 f2 f3 f4 f5 |................| 00000020 f6 f7 f8 f9 f0 5a 7c 7b 5b 6c 5f 50 5c 4d 5d 6d |.....Z|{[l_P\M]m| 00000030 4e 60 7e[25]c1 c2 c3 c4 c5 c6 c7 c8 c9 d1 d2 d3 |N`~%............| 00000040 d4 d5 d6 d7 d8 d9 e2 e3 e4 e5 e6 e7 e8 e9[25] |..............%| So, EOL [0a] converted to [25] instead of [15] if we take example with \x0d\0xOa line ending just \xOa will converted to \x25, \x0d will stay in converted file like it was in original iconv from another *nix system libc seems work proper. It converts \x15 to \x0a in case of iconv -f IBM-1047 -t UTF8 and from \x0a to \x15 in case of iconv -f UTF8 -t IBM-1047 If we look at ibm1047.c (.../glibc-2.5/iconvdata/ibm1047.c) ... #include <stdint.h> /* Get the conversion table. */ #include <ibm1047.h> #define CHARSET_NAME "IBM1047//" #define HAS_HOLES 1 /* Not all 256 character are defined. */ #include <8bit-generic.c> But file ibm1047.h doesn't exist in glibc-2.5/iconvdata/ And I couldn't find it in any other versions of glibc (that I found)
that's probably because the tables are generated on the fly while building actually post the files as attachments instead of printing their hexdumps
Created attachment 131881 [details] text file in ibm-1047 that is used in description
can you open a bug here please: http://sources.redhat.com/bugzilla/ you know a lot more about the issue than i ;)
Is this still an issue? Was it reported upstream? :)