Bug 149526

Summary:	sed's misdoings
Product:	Gentoo Linux	Reporter:	Igor Golubev <ooptimum>
Component:	Current packages	Assignee:	Gentoo's Team for Core System packages <base-system>
Status:	RESOLVED INVALID
Severity:	critical	CC:	truedfx
Priority:	High
Version:	2006.1
Hardware:	All
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---

Description Igor Golubev 2006-09-29 08:24:02 UTC

$ echo "[aA][bB][zZ]" | sed 's/[A-Z]//g'
[a][][]

the result must be: [a][b][z]

$ echo "[gG][iI][fF]" | sed 's/[A-Z]//g'
[][][]

the result must be: [g][i][f]

sed gives wrong output on these both configurations:

System uname: 2.6.17-gentoo-r4 i686 Intel(R) Pentium(R) 4 CPU 3.00GHz
Gentoo Base System version 1.12.5
Last Sync: Fri, 29 Sep 2006 12:30:04 +0000
app-admin/eselect-compiler: [Not Present]
dev-java/java-config: [Not Present]
dev-lang/python:     2.4.3-r1
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache:     [Not Present]
dev-util/confcache:  [Not Present]
sys-apps/sandbox:    1.2.17
sys-devel/autoconf:  2.13, 2.59-r7
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2
sys-devel/binutils:  2.17.50.0.3
sys-devel/gcc-config: 1.3.13-r3
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.11-r5
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-O2 -march=i686 -fomit-frame-pointer -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc"
CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/terminfo"
CXXFLAGS="-O2 -march=i686 -fomit-frame-pointer -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig distlocks metadata-transfer sandbox sfperms strict"
GENTOO_MIRRORS="ftp://mirror.aiya.ru/pub/gentoo/ ftp://ftp.citkit.ru/pub/Linux/gentoo/"
LANG="en_US.UTF-8"
LC_ALL=""
LDFLAGS="-Wl,-O1,--hash-style=both"
LINGUAS=""
MAKEOPTS="-j3"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/portage/local /usr/portage/local/layman/toolchain_overlay"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="x86 apache2 berkdb bitmap-fonts cli crypt cups dlloader dri elibc_glibc fortran input_devices_evdev input_devices_keyboard input_devices_mouse isdnlog kernel_linux libg++ mailwrapper mysql ncurses nls nptl nptlonly pam pcre perl ppds pppd python readline reflection session snmp spl ssl truetype truetype-fonts type1-fonts udev unicode userland_GNU vhosts video_cards_apm video_cards_ark video_cards_ati video_cards_chips video_cards_cirrus video_cards_cyrix video_cards_dummy video_cards_fbdev video_cards_glint video_cards_i128 video_cards_i740 video_cards_i810 video_cards_imstt video_cards_mga video_cards_neomagic video_cards_nsc video_cards_nv video_cards_rendition video_cards_s3 video_cards_s3virge video_cards_savage video_cards_siliconmotion video_cards_sis video_cards_sisusb video_cards_tdfx video_cards_tga video_cards_trident video_cards_tseng video_cards_v4l video_cards_vesa video_cards_vga video_cards_via video_cards_vmware video_cards_voodoo xml xorg zlib"
Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, PORTAGE_RSYNC_EXTRA_OPTS

and

Portage 2.1.1 (default-linux/x86/2006.1, gcc-4.1.1, glibc-2.4-r3, 2.6.17-gentoo-r8-ww i686)
=================================================================
System uname: 2.6.17-gentoo-r8-ww i686 AMD Athlon(TM) XP 2700+
Gentoo Base System version 1.12.5
Last Sync: Fri, 29 Sep 2006 01:53:01 +0000
ccache version 2.3 [enabled]
app-admin/eselect-compiler: [Not Present]
dev-java/java-config: [Not Present]
dev-lang/python:     2.4.3-r4
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache:     2.3
dev-util/confcache:  [Not Present]
sys-apps/sandbox:    1.2.17
sys-devel/autoconf:  2.13, 2.59-r7
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2
sys-devel/binutils:  2.16.1-r3
sys-devel/gcc-config: 1.3.13-r3
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.17-r1
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-O3 -march=i686 -mtune=athlon-xp -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/X11/xkb /usr/share/config"
CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/splash /etc/terminfo"
CXXFLAGS="-O3 -march=i686 -mtune=athlon-xp -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig ccache distlocks metadata-transfer parallel-fetch sandbox sfperms strict"
GENTOO_MIRRORS="http://mirror.aiya.ru/pub/gentoo http://gentoo.osuosl.org http://mirror.gentoo.no"
LANG="ru_RU.UTF-8"
LINGUAS="ru en"
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="x86 3dnow 3dnowext 7zip X aac acl alsa bash-completion berkdb bitmap-fonts bzip2 cdparanoia chardet cli crypt cups curl dbus directfb divx dlloader dri dvd dvdr dvdread elibc_glibc encode examples fbcon ffmpeg gdbm gif glitz gpm gtk gtk2 gzip hal hardened iconv imlib input_devices_evdev input_devices_keyboard input_devices_mouse isdnlog ithreads jpeg kernel_linux ldap libg++ linguas_en linguas_ru mad matroska md5sum mikmod mmx mmxext mng mozilla mp3 mpeg ncurses nls no-old-linux nptl nptlonly nsplugin nvidia ogg opengl pam pango pcre perl png ppds pppd pyste python quicktime rar readline reflection sdl session spell spl sqlite sse ssl startup-notification svg symlink sysfs tcltk tcpd theora threads thumbnail tk toolbar trayicon truetype truetype-fonts type1-fonts udev unicode usb userland_GNU userlocales v4l v4l2 video_cards_nv video_cards_nvidia video_cards_vesa vorbis win32codecs wma wmf x264 xfce xml xorg xpm xv xvid zip zlib"
Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LC_ALL, LDFLAGS, PORTAGE_RSYNC_EXTRA_OPTS

Switch to gcc-3.4.6 on these configuration gives no result.

Comment 1 Igor Golubev 2006-09-29 08:32:51 UTC

I forgot to indicate header for the first `emerge --info` output. Here it is:

Portage 2.1.1 (default-linux/x86/2006.1/server, gcc-4.1.1, glibc-2.4-r3, 
2.6.17-gentoo-r4 i686)
=================================================================

Comment 2 Jakub Moc (RETIRED) gentoo-dev

2006-09-29 08:33:22 UTC

$ echo "[aA][bB][zZ]" | sed 's/[A-Z]//g'
[a][][]

$ echo "[gG][iI][fF]" | sed 's/[A-Z]//g'
[][][]

$ emerge --info | grep glibc
Portage 2.1.2_pre1-r4 (hardened/x86/2.6, gcc-3.4.6, glibc-2.3.6-r4, 2.6.17-gentoo-r8-amd64 i686)

Really don't see how is this glibc-2.4 issue.

Comment 3 Igor Golubev 2006-09-29 08:38:56 UTC

This gave me the wrong clue, Jacub:

$ echo "[aA][bB][zZ]" | sed 's/[A-Z]//g'
[a][b][z]
$ echo "[gG][iI][fF]" | sed 's/[A-Z]//g'
[g][i][f]
$ emerge --info |grep glibc
Portage 2.1.1 (hardened/x86/2.6, gcc-3.3.6, glibc-2.3.6-r4, 2.6.11-hardened-r15 i686)

Comment 4 Harald van Dijk (RETIRED) gentoo-dev

2006-09-29 09:04:12 UTC

There's no bug here. If you want to match only the uppercase letters of the English alphabet, set LC_ALL=C. If you want to match the uppercase letters of the current locale, use [[:upper:]]. [A-Z] means "uppercase A, uppercase Z, or any of the characters that would be sorted between them in the current locale", and in en_US.UTF-8, that includes the lowercase b through z.

echo {A..Z} {a..z} | fmt -w 1 | sort

Comment 5 Igor Golubev 2006-09-29 09:49:22 UTC

Ubuntu 6.06LTS:

$ locale
LANG=ru_RU.UTF-8
LANGUAGE=ru_RU:ru:en_GB:en
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=
$ echo "[aA][bB][zZ]" | sed 's/[A-Z]//g'
[a][b][z]
$ echo "[gG][iI][fF]" | sed 's/[A-Z]//g'
[g][i][f]
$ sed --version
GNU sed &#1074;&#1077;&#1088;&#1089;&#1080;&#1103; 4.1.4

Don't you think that this behaviour of sed in Gentoo could lead to numerous mistakes in the scripts written with this syntax in mind?

Comment 6 Sergey Dryabzhinsky 2006-09-29 10:04:23 UTC

on Gentoo/Linux

$ locale
LANG=ru_RU.KOI8-R
LC_CTYPE="ru_RU.KOI8-R"
LC_NUMERIC="ru_RU.KOI8-R"
LC_TIME="ru_RU.KOI8-R"
LC_COLLATE="ru_RU.KOI8-R"
LC_MONETARY="ru_RU.KOI8-R"
LC_MESSAGES="ru_RU.KOI8-R"
LC_PAPER="ru_RU.KOI8-R"
LC_NAME="ru_RU.KOI8-R"
LC_ADDRESS="ru_RU.KOI8-R"
LC_TELEPHONE="ru_RU.KOI8-R"
LC_MEASUREMENT="ru_RU.KOI8-R"
LC_IDENTIFICATION="ru_RU.KOI8-R"
LC_ALL=

$ echo "[aA][bB][cC]" | sed 's/[A-Z]//g' && sed --version | grep sed
[a][][]
GNU sed &#1074;&#1077;&#1088;&#1089;&#1080;&#1103; 4.1.5
$ emerge --info | grep glibc | grep gcc
Portage 2.1.1 (default-linux/x86/2006.1/desktop, gcc-4.1.1, glibc-2.4-r3, 2.6.18.xsuid.bot i686)

Comment 7 Sergey Dryabzhinsky 2006-09-29 10:12:17 UTC

Actions on ASCII character ranges should not depend on the locale.

Comment 8 Oleg S. Marin 2006-09-29 10:15:47 UTC

From urxvt launched with LANG="C"
wwolf@terrum ~ $ echo "[bB][aA][zZ]" | sed 's/[A-Z]/'
[b][aA][zZ]
wwolf@terrum ~ $ echo "[gG][iI][fF]" | sed 's[A-Z]//g'
[g][i][f]

From urxvt launched with LANG="ru_RU.KOI8-R"
wwolf@terrum ~ $ echo "[bB][aA][zZ]" | sed 's/[A-Z]//'
[B][aA][zZ]
wwolf@terrum ~ $ echo "[gG][iI][fF]" | sed 's/[A-Z]//g'
[][][]

Comment 9 Harald van Dijk (RETIRED) gentoo-dev

2006-09-29 10:19:24 UTC

(In reply to comment #5)
> Don't you think that this behaviour of sed in Gentoo could lead to numerous
> mistakes in the scripts written with this syntax in mind?

Such scripts are broken and should be fixed -- and they are.

(In reply to comment #7)
> Actions on ASCII character ranges should not depend on the locale.

Yes, they should. This is briefly mentioned in the sed info page, as well as the behaviour required by POSIX.

Comment 10 Oleg S. Marin 2006-09-29 10:40:31 UTC

sorry i missed some symbols in my previous post, with "&#1057;" all ok.
But from urxvt launched with LANG="ru_RU.KOI8-R" get

wwolf@terrum ~ $ echo "[bB][aA][zZ]" | sed 's/[A-Z]//g'
[][a][]
wwolf@terrum ~ $ echo "[gG][iI][fF]" | sed 's/[A-Z]//g'
[][][]

Comment 11 SpanKY gentoo-dev

2006-09-29 12:25:07 UTC

Harald van Dijk is spot on with everything he has said