Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 344475

Summary: sys-apps/sed: processes files with non-ASCII chars wrong if LC_ALL="C"
Product: Gentoo/Alt Reporter: Charles Davis <cdavis5x>
Component: Prefix SupportAssignee: Gentoo Prefix <prefix>
Status: RESOLVED UPSTREAM    
Severity: normal    
Priority: High    
Version: unspecified   
Hardware: All   
OS: OS X   
Whiteboard:
Package list:
Runtime testing required: ---

Description Charles Davis 2010-11-07 05:19:59 UTC
If a file contains non-ASCII characters, and the locale is set to "C", sed's regexp pattern matching fails to pick up the non-ASCII characters. In fact, if it encounters even one non-ASCII char, the match terminates.

This pops up when building Wine. dlls/shell32/authors.c ends up mangled because of this.


Reproducible: Always

Steps to Reproduce:
1. echo 'aâbc' > test.txt
2. env LC_ALL="C" sed -e 's/\(.*\)/\"\1\",' test.txt

Actual Results:  
sed produces:

"a",âb


Expected Results:  
sed should print:

"aâbc",

Portage 2.2.01.17133-prefix (prefix/darwin/macos/10.6/x86, gcc-4.2.1, unavailable, 10.4.0 i386)
=================================================================
                        System Settings
=================================================================
System uname: Darwin-10.4.0-i386-32bit
Timestamp of tree: Sun, 07 Nov 2010 00:42:31 +0000
distcc 3.1-toolwhip.1 i386-apple-darwin10.0 [disabled]
app-shells/bash:     4.1_p7
dev-lang/python:     2.6.5-r2
dev-util/cmake:      2.8.1-r2
sys-devel/autoconf:  2.65-r1
sys-devel/automake:  1.9.6-r3, 1.11.1
sys-devel/gcc-config: 1.4.1-r00.2
sys-devel/libtool:   2.2.10
sys-devel/make:      3.81-r2
Repositories: gentoo_prefix
ACCEPT_KEYWORDS="~x86-macos"
ACCEPT_LICENSE="* -@EULA"
CBUILD="i686-apple-darwin10"
CFLAGS="-O2 -pipe -march=core2"
CHOST="i686-apple-darwin10"
CONFIG_PROTECT="/etc"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/portage /etc/revdep-rebuild /etc/terminfo"
CXXFLAGS="-O2 -pipe -march=core2"
DISTDIR="/Users/chip/Gentoo/usr/portage/distfiles"
FEATURES="assume-digests binpkg-logs collision-protect distlocks fixlafiles fixpackages news nostrip parallel-fetch preserve-libs protect-owned sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
GENTOO_MIRRORS="http://distfiles.gentoo.org"
LDFLAGS="-Wl,-dead_strip_dylibs"
MAKEOPTS="-j4"
PKGDIR="/Users/chip/Gentoo/usr/portage/packages"
PORTAGE_CONFIGROOT="/Users/chip/Gentoo/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/Users/chip/Gentoo/var/tmp"
PORTDIR="/Users/chip/Gentoo/usr/portage"
PORTDIR_OVERLAY=""
SYNC="rsync://rsync.prefix.freens.org/gentoo-portage-prefix"
USE="aqua bash-completion berkdb bzip2 coreaudio cracklib crypt curl cxx dbus exceptions expat extensions gdbm gnutls gpg gzip iconv icu ipv6 jbig jpeg libssh2 lzma lzo mmx mmxext mng modules mysql ncurses nls objc objc++ pch perl png prefix python qt3support readline ruby sql sqlite3 sse sse2 ssl subversion tcl threads tiff tk truetype unicode vim-syntax x86-macos xml zlib" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="Darwin" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse" KERNEL="Darwin" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" PHP_TARGETS="php5-2" RUBY_TARGETS="ruby18" USERLAND="GNU" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" 
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, LANG, LC_ALL, LINGUAS, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

=================================================================
                        Package Settings
=================================================================

sys-apps/sed-4.2.1-r1 was built with the following:
USE="nls (prefix) (-acl) (-selinux) -static"
Comment 1 Fabian Groffen gentoo-dev 2010-11-09 18:01:06 UTC
The C-locale of the Mac actually is MacRoman.

After fixing your sed, this should work for you too:
% env LC_ALL=en_GB.UTF-8 sed -e 's/\(.*\)/\"\1\",/' test.txt
"aâbc",

I don't think it's really sed's fault
Comment 2 Charles Davis 2010-11-09 18:27:34 UTC
(In reply to comment #1)
> The C-locale of the Mac actually is MacRoman.
That makes sense.
> 
> After fixing your sed, this should work for you too:
> % env LC_ALL=en_GB.UTF-8 sed -e 's/\(.*\)/\"\1\",/' test.txt
> "aâbc",
That works all right.
> 
> I don't think it's really sed's fault
I forgot to mention that this works perfectly fine with Mac OS X's built-in sed with LC_ALL=C. (Gee, that would have been helpful to know before! ;)

Personally I think it is sed's fault. In a regexp, '.' means "match ANY character." Especially considering that Mac OS's own sed has no trouble at all with this, then if GNU sed's not matching all the characters--even the non-ASCII ones--this is a problem.

In fact, I tried it with GNU sed from MacPorts. It too has this problem. This appears to be a bug in GNU sed itself. Sorry to have wasted your time.