I've had a few RTF documents to text, and I noticed that unrtf outputs an exclamation mark instead of accents. Here's a patch that makes it produce valid UTF-8 text for any ANSI RTF input file. Please test :-)
Created attachment 61385 [details, diff] Patch to output ANSI RTF characters correctly
Created attachment 61386 [details, diff] Patch for the ebuild
Robin, do you want to take this bug? Jo
Robin, do you want to take this bug? Joël, did you sent the patch to the upstream developers?
No, not yet. Should I send it ? (I suppose unrtf was written before a common encoding, UTF-8 was created. So now that many people use UTF-8, I guess it's nice to put the extended characters to good use)
Let's wait for robbat2's comment. He's travelling for the next 2 weeks.
please send this to upstream. if they are unresponsive, then i'll just patch our ebuild, but i'd prefer it if they took it first.
Robin, Thanks for your response ! I'm trying to do it. Two remarks though: - I've just found a newer version: http://ftp.gnu.org/gnu/unrtf/0.19.7/ - unrtf@gnu.ai.mit.edu does not work - there is a patch (text_french.patch) in the 0.19.7 package, which is similar to mine, but only handles a few accents. I'll try to contact its author. I'll let you know when I get something !
Any news on this? I'm just trying the 3rd party kat ebuilds and they contain an ebuild with this patch. Would be cool if I needed one ebuild less in my overlay :)
I just saw that there's a new version 0.19.9 from last week, from the changelog: | 0.19.4: added unicode support | 0.19.5: removed defective PS support and non-free text files | more unicode support | improved symbol font support - no longer puts entities in latex output | Bug#266020 concerning double slashes fixed | Bug#269054 concerning Doctype fixed | Bug#287038 security breach fixed | (thanks to Joey Hess <joeyh@debian.org>) | 0.19.6: fix some latex problems | 0.19.7: updated FSF address | 0.19.8: minor fixes | 0.19.9: included verbose mode So it might be fixed in that version...
Hi, Actually (before I made the patch) the authors did put an _unused_ "text_french.patch" file in unrtf 0.19.7 -- but their patch is incomplete (see comment #7). I sent an email containing the information, as well as a link to this bugzilla page, to the upstream developers on 3rd July 2005: TO: tuorfa@yahoo.com, csurchi@debian.org CC: victor.stinner@haypocalc.com I got no response so far. I haven't looked (or tried) unrtf 0.19.9 -- could you have a quick look at the test.c file, to see what characters they added in the tables ? Best Regards
unrtf has a project page at savannah, here [1]. There's both a bug and a patch tracker, maybe you've got more luck there. [1] http://savannah.gnu.org/projects/unrtf/ It seems like they added a few but not all characters, and different to your solution: mss@otherland ~/tmp $ diff -u unrtf-0.19.3/text.c unrtf_0.19.9/text.c --- unrtf-0.19.3/text.c 2004-02-19 00:35:04.000000000 +0100 +++ unrtf_0.19.9/text.c 2006-01-06 22:56:06.000000000 +0100 @@ -1,7 +1,6 @@ - /*============================================================================= GNU UnRTF, a command-line program to convert RTF documents to other formats. - Copyright (C) 2000,2001 Zachary Thayer Smith + Copyright (C) 2000,2001,2004 by Zachary Smith This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -15,20 +14,25 @@ You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software - Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA - The author is reachable by electronic mail at tuorfa@yahoo.com. + The maintainer is reachable by electronic mail at daved@physiol.usyd.edu.au =============================================================================*/ /*---------------------------------------------------------------------- * Module name: text - * Author name: Zach Smith + * Author name: Zachary Smith * Create date: 19 Sep 01 * Purpose: Plain text output module *---------------------------------------------------------------------- * Changes: * 22 Sep 01, tuorfa@yahoo.com: added function-level comment blocks + * 29 Mar 05, daved@physiol.usyd.edu.au: changes requested by ZT Smith + * 14 Jun 05, daved@physiol.usyd.edu.au: higher Iso-Latin-1 characters + * added - thanks to ronross@colba.net and + * victor.stinner@haypocalc.com + * 23 Jul 05, daved@physiol.usyd.edu.au: added endash, emdash and bullet *--------------------------------------------------------------------*/ @@ -59,22 +63,24 @@ static char* upper_translation_table [128] = { - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", +/* 0 1 2 3 4 5 6 7 */ +/* 80 */ "?", "?", "?", "?", "?", "?", "?", "?", +/* 88 */ "?", "?", "?", "?", "?", "?", "?", "?", +/* 90 */ "?", "?", "?", "?", "?", "?", "?", "?", +/* 98 */ "?", "?", "?", "?", "?", "?", "?", "?", +/* A0 */ "
unrtf has a project page at savannah, here [1]. There's both a bug and a patch tracker, maybe you've got more luck there. [1] http://savannah.gnu.org/projects/unrtf/ It seems like they added a few but not all characters, and different to your solution: mss@otherland ~/tmp $ diff -u unrtf-0.19.3/text.c unrtf_0.19.9/text.c --- unrtf-0.19.3/text.c 2004-02-19 00:35:04.000000000 +0100 +++ unrtf_0.19.9/text.c 2006-01-06 22:56:06.000000000 +0100 @@ -1,7 +1,6 @@ - /*============================================================================= GNU UnRTF, a command-line program to convert RTF documents to other formats. - Copyright (C) 2000,2001 Zachary Thayer Smith + Copyright (C) 2000,2001,2004 by Zachary Smith This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -15,20 +14,25 @@ You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software - Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA - The author is reachable by electronic mail at tuorfa@yahoo.com. + The maintainer is reachable by electronic mail at daved@physiol.usyd.edu.au =============================================================================*/ /*---------------------------------------------------------------------- * Module name: text - * Author name: Zach Smith + * Author name: Zachary Smith * Create date: 19 Sep 01 * Purpose: Plain text output module *---------------------------------------------------------------------- * Changes: * 22 Sep 01, tuorfa@yahoo.com: added function-level comment blocks + * 29 Mar 05, daved@physiol.usyd.edu.au: changes requested by ZT Smith + * 14 Jun 05, daved@physiol.usyd.edu.au: higher Iso-Latin-1 characters + * added - thanks to ronross@colba.net and + * victor.stinner@haypocalc.com + * 23 Jul 05, daved@physiol.usyd.edu.au: added endash, emdash and bullet *--------------------------------------------------------------------*/ @@ -59,22 +63,24 @@ static char* upper_translation_table [128] = { - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", - "?", "?", "?", "?", "?", "?", "?", "?", +/* 0 1 2 3 4 5 6 7 */ +/* 80 */ "?", "?", "?", "?", "?", "?", "?", "?", +/* 88 */ "?", "?", "?", "?", "?", "?", "?", "?", +/* 90 */ "?", "?", "?", "?", "?", "?", "?", "?", +/* 98 */ "?", "?", "?", "?", "?", "?", "?", "?", +/* A0 */ " ", "¡", "¢", "£", "¤", "¥", "¦", "§", +/* A8 */ "¨", "©", "ª", "«", "¬", "", "®", "¯", +/* B0 */ "°", "±", "²", "³", "´", "µ", "¶", "·", +/* B8 */ "¸", "¹", "º", "»", "¼", "½", "¾", "¿", +/* C0 */ "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", +/* C8 */ "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", +/* D0 */ "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", +/* D8 */ "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", +/* E0 */ "à", "á", "â", "ã", "ä", "å", "æ", "ç", +/* E8 */ "è", "é", "ê", "ë", "ì", "í", "î", "ï", +/* F0 */ "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "÷", +/* F8 */ "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ", +/* 8 9 A B C D E F */ }; @@ -255,6 +261,11 @@ text_op->chars.left_quote = "`"; text_op->chars.right_dbl_quote = "''"; text_op->chars.left_dbl_quote = "``"; +#if 1 /* daved - 0.19.8 */ + text_op->chars.endash = ""; /* not ASCII */ + text_op->chars.emdash = "-"; + text_op->chars.bullet = "·"; /* not ASCII */ +#endif return text_op; }
Ah, this new patch looks good :-) It handles everything, excluding values 0x80..0x9F. It can be because that range of values is forbidden/reserved and cannot not be found in ANSI RTF anyway (I have no idea what's the deal with these 0x80..0x9F values). My only concern: filling the array in a C file with characters (instead of hex value) could be a bit dangerous, depending on the compiler's character set support (?)
I've just commit 0.19.9 to the tree, is the patch from this bug still needed?
I've just tried the 0.19.9 version. Indeed, the patch I posted is not needed anymore, *but* please note that unrtf will always output ISO-8859-1 text, regardless of the user's $LANG setting. Not very good for pure UTF-8 users IMHO. Ideal workaround: unrtf should iconv() the whole text at runtime, so the input obeys the user's preferred encoding. In the meantime, I suggest adding this as a first line in src_compile(): src_compile() { iconv -f ISO-8859-15 text.c >text.c.new && mv text.c.new text.c This would detect the user's encoding at emerge time, which is better than ignoring it completely. With this line added, unrtf outputs proper UTF-8 text for me. Since iconv is called without '-t' (target encoding) argument, it *should* convert to the user's preferred encoding. It works for UTF-8 -- can someone please test with an ISO-8859 $LANG/$LC_ALL ? I have userlocales and only UTF-8 locales built. Thanks
I don't agree with using iconv like that. My root user runs in a different $LANG than my regular user. unrtf really must be made encoding-aware. I'm going to close this for now, and I'd ask you take it to upstream again. If you diff the old release with the new one, you'll see there is a new maintainer, and hopefully he can be more responsive.
He's from Australia, right ? Ok, e-mail is sent (including of course, a link to this page) :-) When something happens I'll report it here.