Mail::SpamAssassin::Plugin::TextCat /usr/share/spamassassin/languages contains wrong tokens for ru.iso-8859-5 (UTF-8) When switching to "normalize_charset 1" TextCat becomes useless It never detect language. When i replaced it to http://www.phpclasses.org/browse/file/14651.html This file contain tokens for ru.iso-8859-5 (UTF-8) Everything just fine. Reproducible: Always Steps to Reproduce: 1.wget http://www1.uralpress.ru/my/utf8.msg 2.sa-learn -D --spam utf8.msg 2>&1 | grep textcat 3.dbg: textcat: can't determine language uniquely enough wget http://www1.uralpress.ru/my/languages cp languages /usr/share/spamassassin/ sa-learn -D --spam utf8.msg 2>&1 | grep textcat dbg: textcat: language possibly: ru.iso-8859-5 Actual Results: bg: textcat: can't determine language uniquely enough Expected Results: dbg: textcat: language possibly: ru.iso-8859-5 /etc/spamassassin/v310.pre: loadplugin Mail::SpamAssassin::Plugin::TextCat Portage 2.1.4.5 (default/linux/x86/2008.0, gcc-4.1.2, glibc-2.6.1-r0, 2.6.25-gentoo-r9 i686) ================================================================= System uname: 2.6.25-gentoo-r9 i686 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Timestamp of tree: Fri, 26 Dec 2008 01:00:01 +0000 ccache version 2.4 [enabled] app-shells/bash: 3.2_p33 dev-lang/python: 2.5.2-r7 dev-util/ccache: 2.4-r7 sys-apps/baselayout: 1.12.11.1 sys-apps/sandbox: 1.2.18.1-r2 sys-devel/autoconf: 2.61-r2 sys-devel/automake: 1.9.6-r2, 1.10.1-r1 sys-devel/binutils: 2.18-r3 sys-devel/gcc-config: 1.4.0-r4 sys-devel/libtool: 1.5.26 virtual/os-headers: 2.6.23-r3 ACCEPT_KEYWORDS="x86" CBUILD="i686-pc-linux-gnu" CFLAGS="-O2 -march=k8 -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /var/bind" CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/terminfo /etc/udev/rules.d" CXXFLAGS="-O2 -march=k8 -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="ccache distlocks metadata-transfer sandbox sfperms strict unmerge-orphans userfetch" GENTOO_MIRRORS="http://mirror.bytemark.co.uk/gentoo/ http://gentoo.tups.lv/source/ http://trumpetti.atm.tut.fi/gentoo/" LANG="ru_RU.UTF-8" LC_ALL="" LDFLAGS="-Wl,-O1" LINGUAS="ru" MAKEOPTS="-j3" PKGDIR="/usr/portage/packages" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="acl apache2 bash-completion bzip2 cli cracklib crypt cups dri gdbm gpm iconv isdnlog jpeg logrotate midi mmx mudflap mysql mysqli nls nptl nptlonly openmp pam pcre perl png pppd python readline reflection session spl sse sse2 ssl sysfs tcpd truetype unicode vhosts x86 xorg zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_dbd authn_default authn_file authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock dbd deflate dir disk_cache env expires ext_filter file_cache filter headers imagemap log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling ssl unique_id vhost_alias" APACHE2_MPMS="prefork" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="ru" USERLAND="GNU" VIDEO_CARDS="fbdev glint i810 intel mach64 mga neomagic nv r128 radeon savage sis tdfx trident vesa vga via vmware voodoo" Unset: CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Created attachment 176408 [details] Fixed languages file
Created attachment 176411 [details] testcase - mail message UTF-8 RU
(In reply to comment #0) > 1.wget http://www1.uralpress.ru/my/utf8.msg > wget http://www1.uralpress.ru/my/languages Please use files from attachments.
This is misconfigured installation. normalize_charset not required for UTF-8 aware system - all messages normalizing to UTF-8 by default. ISO-8895-5 is NOT UTF-8. Messages not avaible, I can't reproduce bug. I think this is INVALID bug.
(In reply to comment #4) > This is misconfigured installation. > > normalize_charset not required for UTF-8 aware system - all messages > normalizing to UTF-8 by default. > > ISO-8895-5 is NOT UTF-8. Messages not avaible, I can't reproduce bug. > > I think this is INVALID bug. > See Comment #3 - messages in attachments ISO-8895-5 - is the charset detected by spamassassin, encoding of message was UTF-8 normalize_charset - See http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Conf.html#language_options
(In reply to comment #4) > This is misconfigured installation. > I think this is INVALID bug. You think. This is great. Try to reproduce or prove it's INVALID.
Created attachment 179423 [details] ru.iso-8859-5.lm