96376 – FIX for unrtf to handle accented characters correctly

Bug 96376 - FIX for unrtf to handle accented characters correctly

Summary: FIX for unrtf to handle accented characters correctly

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	New packages (show other bugs)
Hardware:	All All

Importance:	High enhancement (vote)
Assignee:	Robin Johnson

URL:
Whiteboard:
Keywords:	Inclusion

Depends on:
Blocks:

Reported:	2005-06-17 06:39 UTC by Joël
Modified:	2006-02-20 08:13 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Patch to output ANSI RTF characters correctly (unrtf-0.19.3-full-charset-ansi.patch,2.21 KB, patch) 2005-06-17 06:40 UTC, Joël	Details \| Diff
Patch for the ebuild (unrtf-0.19.3-r1.ebuild.patch,340 bytes, patch) 2005-06-17 06:42 UTC, Joël	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joël 2005-06-17 06:39:14 UTC

I've had a few RTF documents to text, and I noticed that unrtf outputs an exclamation mark instead of accents.

Here's a patch that makes it produce valid UTF-8 text for any ANSI RTF input file. Please test :-)

Comment 1 Joël 2005-06-17 06:40:38 UTC

Created attachment 61385 [details, diff]
Patch to output ANSI RTF characters correctly

Comment 2 Joël 2005-06-17 06:42:22 UTC

Created attachment 61386 [details, diff]
Patch for the ebuild

Comment 3 Torsten Veller (RETIRED) gentoo-dev

2005-06-20 13:25:38 UTC

Robin, do you want to take this bug?

Jo

Comment 4 Torsten Veller (RETIRED) gentoo-dev

2005-06-20 13:25:38 UTC

Robin, do you want to take this bug?

Joël, did you sent the patch to the upstream developers?

Comment 5 Joël 2005-06-20 14:54:12 UTC

No, not yet. Should I send it ?

(I suppose unrtf was written before a common encoding, UTF-8 was created. So now
that many people use UTF-8, I guess it's nice to put the extended characters to
good use)

Comment 6 Torsten Veller (RETIRED) gentoo-dev

2005-06-20 15:26:31 UTC

Let's wait for robbat2's comment. He's travelling for the next 2 weeks.

Comment 7 Robin Johnson archtester

2005-07-02 14:05:04 UTC

please send this to upstream.
if they are unresponsive, then i'll just patch our ebuild, but i'd prefer it if 
they took it first.

Comment 8 Joël 2005-07-03 02:31:49 UTC

Robin,

Thanks for your response ! I'm trying to do it.

Two remarks though:
- I've just found a newer version: http://ftp.gnu.org/gnu/unrtf/0.19.7/
- unrtf@gnu.ai.mit.edu does not work
- there is a patch (text_french.patch) in the 0.19.7 package, which is similar
to mine, but only handles a few accents. I'll try to contact its author.

I'll let you know when I get something !

Comment 9 Malte S. Stretz 2006-01-11 02:09:35 UTC

Any news on this?  I'm just trying the 3rd party kat ebuilds and they contain an ebuild with this patch.  Would be cool if I needed one ebuild less in my overlay :)

Comment 10 Malte S. Stretz 2006-01-11 02:14:24 UTC

I just saw that there's a new version 0.19.9 from last week, from the changelog:
| 0.19.4: added unicode support
| 0.19.5: removed defective PS support and non-free text files
|         more unicode support
|         improved symbol font support - no longer puts entities in latex output
|         Bug#266020 concerning double slashes fixed
|         Bug#269054 concerning Doctype fixed
|         Bug#287038 security breach fixed
|                 (thanks to Joey Hess <joeyh@debian.org>)
| 0.19.6: fix some latex problems
| 0.19.7: updated FSF address
| 0.19.8: minor fixes
| 0.19.9: included verbose mode

So it might be fixed in that version...

Comment 11 Joël 2006-01-11 02:33:28 UTC

Hi,

Actually (before I made the patch) the authors did put an _unused_ "text_french.patch" file in unrtf 0.19.7 -- but their patch is incomplete (see comment #7).

I sent an email containing the information, as well as a link to this bugzilla page, to the upstream developers on 3rd July 2005:
TO: tuorfa@yahoo.com, csurchi@debian.org
CC: victor.stinner@haypocalc.com

I got no response so far.

I haven't looked (or tried) unrtf 0.19.9 -- could you have a quick look at the test.c file, to see what characters they added in the tables ?

Best Regards

Comment 12 Malte S. Stretz 2006-01-11 03:03:28 UTC

unrtf has a project page at savannah, here [1].  There's both a bug and a patch tracker, maybe you've got more luck there.

[1] http://savannah.gnu.org/projects/unrtf/

It seems like they added a few but not all characters, and different to your solution:
mss@otherland ~/tmp $ diff -u unrtf-0.19.3/text.c unrtf_0.19.9/text.c
--- unrtf-0.19.3/text.c 2004-02-19 00:35:04.000000000 +0100
+++ unrtf_0.19.9/text.c 2006-01-06 22:56:06.000000000 +0100
@@ -1,7 +1,6 @@
-
 /*=============================================================================
    GNU UnRTF, a command-line program to convert RTF documents to other formats.
-   Copyright (C) 2000,2001 Zachary Thayer Smith
+   Copyright (C) 2000,2001,2004 by Zachary Smith

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -15,20 +14,25 @@

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
-   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+   Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

-   The author is reachable by electronic mail at tuorfa@yahoo.com.
+   The maintainer is reachable by electronic mail at daved@physiol.usyd.edu.au
 =============================================================================*/


 /*----------------------------------------------------------------------
  * Module name:    text
- * Author name:    Zach Smith
+ * Author name:    Zachary Smith
  * Create date:    19 Sep 01
  * Purpose:        Plain text output module
  *----------------------------------------------------------------------
  * Changes:
  * 22 Sep 01, tuorfa@yahoo.com: added function-level comment blocks
+ * 29 Mar 05, daved@physiol.usyd.edu.au: changes requested by ZT Smith
+ * 14 Jun 05, daved@physiol.usyd.edu.au: higher Iso-Latin-1 characters
+ *             added - thanks to ronross@colba.net and
+ *             victor.stinner@haypocalc.com
+ * 23 Jul 05, daved@physiol.usyd.edu.au: added endash, emdash and bullet
  *--------------------------------------------------------------------*/


@@ -59,22 +63,24 @@

 static char*
 upper_translation_table [128] = {
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
+/*        0    1    2    3    4    5    6    7 */
+/* 80 */ "?", "?", "?", "?", "?", "?", "?", "?",
+/* 88 */ "?", "?", "?", "?", "?", "?", "?", "?",
+/* 90 */ "?", "?", "?", "?", "?", "?", "?", "?",
+/* 98 */ "?", "?", "?", "?", "?", "?", "?", "?",
+/* A0 */ "

Comment 13 Malte S. Stretz 2006-01-11 03:03:28 UTC

unrtf has a project page at savannah, here [1].  There's both a bug and a patch tracker, maybe you've got more luck there.

[1] http://savannah.gnu.org/projects/unrtf/

It seems like they added a few but not all characters, and different to your solution:
mss@otherland ~/tmp $ diff -u unrtf-0.19.3/text.c unrtf_0.19.9/text.c
--- unrtf-0.19.3/text.c 2004-02-19 00:35:04.000000000 +0100
+++ unrtf_0.19.9/text.c 2006-01-06 22:56:06.000000000 +0100
@@ -1,7 +1,6 @@
-
 /*=============================================================================
    GNU UnRTF, a command-line program to convert RTF documents to other formats.
-   Copyright (C) 2000,2001 Zachary Thayer Smith
+   Copyright (C) 2000,2001,2004 by Zachary Smith

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -15,20 +14,25 @@

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
-   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+   Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

-   The author is reachable by electronic mail at tuorfa@yahoo.com.
+   The maintainer is reachable by electronic mail at daved@physiol.usyd.edu.au
 =============================================================================*/


 /*----------------------------------------------------------------------
  * Module name:    text
- * Author name:    Zach Smith
+ * Author name:    Zachary Smith
  * Create date:    19 Sep 01
  * Purpose:        Plain text output module
  *----------------------------------------------------------------------
  * Changes:
  * 22 Sep 01, tuorfa@yahoo.com: added function-level comment blocks
+ * 29 Mar 05, daved@physiol.usyd.edu.au: changes requested by ZT Smith
+ * 14 Jun 05, daved@physiol.usyd.edu.au: higher Iso-Latin-1 characters
+ *             added - thanks to ronross@colba.net and
+ *             victor.stinner@haypocalc.com
+ * 23 Jul 05, daved@physiol.usyd.edu.au: added endash, emdash and bullet
  *--------------------------------------------------------------------*/


@@ -59,22 +63,24 @@

 static char*
 upper_translation_table [128] = {
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
-       "?", "?", "?", "?", "?", "?", "?", "?",
+/*        0    1    2    3    4    5    6    7 */
+/* 80 */ "?", "?", "?", "?", "?", "?", "?", "?",
+/* 88 */ "?", "?", "?", "?", "?", "?", "?", "?",
+/* 90 */ "?", "?", "?", "?", "?", "?", "?", "?",
+/* 98 */ "?", "?", "?", "?", "?", "?", "?", "?",
+/* A0 */ " ", "¡", "¢", "£", "¤", "¥", "¦", "§",
+/* A8 */ "¨", "©", "ª", "«", "¬", "", "®", "¯",
+/* B0 */ "°", "±", "²", "³", "´", "µ", "¶", "·",
+/* B8 */ "¸", "¹", "º", "»", "¼", "½", "¾", "¿",
+/* C0 */ "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç",
+/* C8 */ "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï",
+/* D0 */ "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×",
+/* D8 */ "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß",
+/* E0 */ "à", "á", "â", "ã", "ä", "å", "æ", "ç",
+/* E8 */ "è", "é", "ê", "ë", "ì", "í", "î", "ï",
+/* F0 */ "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "÷",
+/* F8 */ "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ",
+/*        8    9    A    B    C    D    E    F */
 };


@@ -255,6 +261,11 @@
        text_op->chars.left_quote = "`";
        text_op->chars.right_dbl_quote = "''";
        text_op->chars.left_dbl_quote = "``";
+#if 1 /* daved - 0.19.8 */
+       text_op->chars.endash = ""; /* not ASCII */
+       text_op->chars.emdash = "-";
+       text_op->chars.bullet = "·"; /* not ASCII */
+#endif

        return text_op;
 }

Comment 14 Joël 2006-01-11 07:10:37 UTC

Ah, this new patch looks good :-)

It handles everything, excluding values 0x80..0x9F. It can be because that range of values is forbidden/reserved and cannot not be found in ANSI RTF anyway (I have no idea what's the deal with these 0x80..0x9F values).

My only concern: filling the array in a C file with characters (instead of hex value) could be a bit dangerous, depending on the compiler's character set support (?)

Comment 15 Robin Johnson archtester

2006-02-16 17:38:41 UTC

I've just commit 0.19.9 to the tree, is the patch from this bug still needed?

Comment 16 Joël 2006-02-17 04:30:35 UTC

I've just tried the 0.19.9 version.

Indeed, the patch I posted is not needed anymore, *but* please note that unrtf will always output ISO-8859-1 text, regardless of the user's $LANG setting. Not very good for pure UTF-8 users IMHO.

Ideal workaround: unrtf should iconv() the whole text at runtime, so the input obeys the user's preferred encoding.

In the meantime, I suggest adding this as a first line in src_compile():

src_compile() {
    iconv -f ISO-8859-15 text.c >text.c.new && mv text.c.new text.c

This would detect the user's encoding at emerge time, which is better than ignoring it completely. With this line added, unrtf outputs proper UTF-8 text for me.

Since iconv is called without '-t' (target encoding) argument, it *should* convert to the user's preferred encoding. It works for UTF-8 -- can someone please test with an ISO-8859 $LANG/$LC_ALL ? I have userlocales and only UTF-8 locales built.

Thanks

Comment 17 Robin Johnson archtester

2006-02-20 01:06:24 UTC

I don't agree with using iconv like that.
My root user runs in a different $LANG than my regular user.
unrtf really must be made encoding-aware.

I'm going to close this for now, and I'd ask you take it to upstream again. If you diff the old release with the new one, you'll see there is a new maintainer, and hopefully he can be more responsive.

Comment 18 Joël 2006-02-20 08:13:41 UTC

He's from Australia, right ?

Ok, e-mail is sent (including of course, a link to this page) :-)

When something happens I'll report it here.