Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 276635 - app-text/par doesn't include the UTF-8 support
Summary: app-text/par doesn't include the UTF-8 support
Status: CONFIRMED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: High enhancement (vote)
Assignee: No maintainer - Look at https://wiki.gentoo.org/wiki/Project:Proxy_Maintainers if you want to take care of it
URL: http://sysmic.org/dotclear/index.php?...
Whiteboard:
Keywords: NeedPatch
Depends on:
Blocks:
 
Reported: 2009-07-05 20:28 UTC by Álvaro Castro Castilla
Modified: 2015-09-17 03:00 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
proposed new ebuild (par-1.52-r3.ebuild,871 bytes, text/plain)
2009-07-06 09:29 UTC, Álvaro Castro Castilla
Details
Patch contributed by http://sysmic.org/par/ (par-1.52-i18n.3.patch,45.55 KB, text/plain)
2009-07-06 09:31 UTC, Álvaro Castro Castilla
Details
example file that works (artifex_es.txt,2.15 KB, text/plain)
2009-07-12 13:55 UTC, Álvaro Castro Castilla
Details
Result from running 'par j < artifex_es.txt > artifex_es.justified.txt' (artifex_es.justified.txt,2.29 KB, text/plain)
2009-07-25 20:19 UTC, Wormo (RETIRED)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Álvaro Castro Castilla 2009-07-05 20:28:45 UTC
app-text/par doesn't include the UTF-8 support, there is a patch already supporting this not included in the version of portage. See:

http://sysmic.org/dotclear/index.php?2006/06/22/55-add-multibyte-characters-support-in-par

Reproducible: Always

Steps to Reproduce:
1. Write/copy-paste a text with international characters (èáñöî)
2. format with $ par 70j p0 s0 < file.txt 

Actual Results:  
See that the lines with  these characters don't justify properly


It's fixed with a patch found in the link
Comment 1 Álvaro Castro Castilla 2009-07-05 20:32:17 UTC
You can find the latest version of that patch here:

http://sysmic.org/par/


It seems there is a newer one in this link!!
Comment 2 Álvaro Castro Castilla 2009-07-06 09:29:09 UTC
Created attachment 196872 [details]
proposed new ebuild

Well, I made this to fix it in my overlay. If the gentoo gurus consider this acceptable maybe can be introduced in portage. The fact is: for anyone with any native language that is not english it is necessary.
Comment 3 Álvaro Castro Castilla 2009-07-06 09:31:39 UTC
Created attachment 196873 [details]
Patch contributed by http://sysmic.org/par/

See here:

http://sysmic.org/par/

Tanks Jezz!
Comment 4 Wormo (RETIRED) gentoo-dev 2009-07-12 07:02:28 UTC
Sorry, but it looks like this patch still needs some work -- it causes weird output (lots of extra blank lines inserted) or even segfaults when I run it on files that have UTF-8 characters (example files close at hand -- gentoo ChangeLogs). 

Could you post your 'emerge --info' and attach some sample files that work for you?
Comment 5 Álvaro Castro Castilla 2009-07-12 13:49:19 UTC
Portage 2.1.6.13 (default/linux/amd64/2008.0, gcc-4.3.2, glibc-2.9_p20081201-r2, 2.6.29-gentoo-r5 x86_64)
=================================================================
System uname: Linux-2.6.29-gentoo-r5-x86_64-AMD_Athlon-tm-_64_X2_Dual_Core_Processor_4200+-with-glibc2.2.5
Timestamp of tree: Sat, 11 Jul 2009 16:00:01 +0000
app-shells/bash:     3.2_p39
dev-java/java-config: 2.1.8-r1
dev-lang/python:     2.5.4-r3
dev-util/cmake:      2.6.4
sys-apps/baselayout: 1.12.11.1
sys-apps/sandbox:    1.6-r2
sys-devel/autoconf:  2.13, 2.63
sys-devel/automake:  1.5, 1.9.6-r2, 1.10.2
sys-devel/binutils:  2.18-r3
sys-devel/gcc-config: 1.4.1
sys-devel/libtool:   1.5.26
virtual/os-headers:  2.6.27-r2
ACCEPT_KEYWORDS="amd64"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=native -mtune=native -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /var/lib/hsqldb"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/udev/rules.d"
CXXFLAGS="-march=native -mtune=native -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="collision-protect distlocks fixpackages parallel-fetch protect-owned sandbox sfperms strict unmerge-orphans userfetch"
GENTOO_MIRRORS="http://trumpetti.atm.tut.fi/gentoo/ ftp://trumpetti.atm.tut.fi/gentoo/ ftp://ftp.wh2.tu-dresden.de/pub/mirrors/gentoo "
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
LDFLAGS="-Wl,-O1"
LINGUAS="en es es_ES"
MAKEOPTS="-j3"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage/layman/sunrise /usr/local/portage/layman/java-overlay /usr/local/portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="3dnow X a52 aac aalib acl acpi alsa amd64 apache apache2 apm bash-completion berkdb bitmap-fonts blas bzip2 cairo cddb cdparanoia cdr cli cracklib crypt cscope cups curl cxx dbus dri dv dvd dvdread encode ffmpeg firefox flac fortran gcj gd gdbm ggi gif gmp gphoto2 gpm gtk gtk2 guile hal hddtemp iconv imagegemagic imlib ipv6 isdnlog java java6 javascript jpeg jpeg2k lcms libcaca live lzo mad matroska midi mmext mmx mono mp3 mpeg mplayer msn mudflap multilib ncurses nls nptl nptlonly nsplugin ocaml ocamlopt offensive openal opengl openmp pam pcre pdf perl png ppds pppd python qt4 quicktime readline recode reflection session skins slang source spell spl sse sse2 sse3 ssl subvesion svg sysfs tcpd theora threads tiff truetype unicode usb v4l v4l2 vcd vim vim-syntax vorbis wxwindows x264 xcb xcomposite xft xml xorg xterm-color xulrunner xv xvid zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" ELIBC="glibc" INPUT_DEVICES="evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en es es_ES" USERLAND="GNU" VIDEO_CARDS="nvidia"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Comment 6 Álvaro Castro Castilla 2009-07-12 13:55:06 UTC
Created attachment 197648 [details]
example file that works 

This file is in spanish. It is using characters like "áéíóñ¿¡" (not all).

The patch works perfectly for me, and apparently for many people. Maybe you can send also the one that is not working, so I check if it does in my computer or not.

Basically, the patch is doing little less than substitutions from char types to wide chars.

Note: the best way to see how the non-UTF8 version fails is with the "justify" option.
Comment 7 Wormo (RETIRED) gentoo-dev 2009-07-25 20:17:17 UTC
I've tracked down one problem with the patch -- it has undefined behavior when your text file has a multibyte sequence not valid for the current locale. That can lead to the crashes I was seeing.

In par.c, mbstowcs is called without checking its return value. If a bad sequence occurs, mbstowcs will stop converting and leave uninitialized memory at the end of the buffer (since malloc() rather than calloc() was used to allocate the buffer). Later code doesn't realize the memory was uninitialized and tries to use it, leading to crashes or other bad behavior.

However, there is still another regression that I'm looking into -- the blank lines in my test file are getting bunched together in big chunks at the top of the file (and other random places in the file, if the input is long enough). I'll attach my output from your test case so you can see what I mean. This happens even if there are only ascii characters, so it really is a regression.

Hey this is really strange... the problem with blank lines does not occur if output goes to the terminal, rather than being redirected to an output file. This smells to me like a glibc bug, and I notice that you are using a newer glibc rather than stable x86 (which is sys-libs/glibc-2.8_p20080602-r1)

Do you have any boxes with older glibc to test on, to check this theory?
Comment 8 Wormo (RETIRED) gentoo-dev 2009-07-25 20:19:28 UTC
Created attachment 199163 [details]
Result from running 'par j < artifex_es.txt > artifex_es.justified.txt'

Notice how all the blank lines were collected at top... weird.