Bug 312541

Summary:	app-editors/vim-core-7.2.402: gentoo autocmd takes too long time when opening a large file
Product:	Gentoo Linux	Reporter:	Takano Akio <aljee>
Component:	Current packages	Assignee:	Vim Maintainers <vim>
Status:	RESOLVED TEST-REQUEST
Severity:	normal
Priority:	High
Version:	unspecified
Hardware:	All
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	vimrc-r5 attempt 1 vimrc-r5 attempt 2

Description Takano Akio 2010-04-01 04:36:57 UTC

I'm running vim on the ja_JP.UTF-8 locale. When I try to open a large text file (10MBytes, for example), vim takes a lot of time executing an autocmd set by /etc/vim/vimrc:

  autocmd BufReadPost *
        \ if exists("g:added_fenc_utf8") && &fileencoding == "utf-8" &&
        \    ! &bomb && search('[\x80-\xFF]','nw') == 0 && &modifiable |
        \       set fileencoding= |
        \ endif

I think there are at least 3 problems.

First, since the default encoding of my locale is UTF-8, /etc/vim/vimrc shouldn't define g:added_fenc_utf8, which signifies that the default encoding is not UTF-8.
Actually, it defines g:added_fenc_utf8 whenever the value of v:lang begins with ko, ja_JP, zh_TW or zh_CN, regardless of whether the locale's default is UTF-8.

Second, the autocmd should search for [^\x00-\x7F], not for [\x80-\xFF], because there are non-ASCII characters that don't match [\x80-\xFF].

Third, even when searching is required, I think it shouldn't search the entire file, which can be quite large.

Reproducible: Always

Steps to Reproduce:
1. LC_MESSAGES=ja_JP.UTF-8 vim <large-text-file>
Actual Results:  
It takes a lot of time starting up.

Expected Results:  
It starts up as fast as in the case of LC_MESSAGES=en_US.UTF-8 vim <large-text-file>

emerge --info:
Portage 2.1.8.3 (default/linux/x86/10.0/desktop, gcc-4.4.3, glibc-2.11-r1, 2.6.33-gentoo i686)
=================================================================
System uname: Linux-2.6.33-gentoo-i686-Intel-R-_Celeron-R-_M_processor_1.30GHz-with-gentoo-2.0.1
Timestamp of tree: Thu, 01 Apr 2010 03:15:01 +0000
app-shells/bash:     4.1_p2
dev-java/java-config: 2.1.10
dev-lang/python:     2.6.4, 3.1.1-r1
dev-python/pycrypto: 2.1.0
dev-util/cmake:      2.6.2-r1
sys-apps/baselayout: 2.0.1
sys-apps/openrc:     0.6.0
sys-apps/sandbox:    2.2
sys-devel/autoconf:  2.13, 2.65
sys-devel/automake:  1.9.6-r2, 1.10.3, 1.11.1
sys-devel/binutils:  2.20.1
sys-devel/gcc:       4.4.3
sys-devel/gcc-config: 1.4.1
sys-devel/libtool:   2.2.6b
virtual/os-headers:  2.6.33
ACCEPT_KEYWORDS="x86 ~x86"
ACCEPT_LICENSE="* -@EULA dlj-1.1"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-O2 -fomit-frame-pointer -march=pentium-m -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/X11/xkb"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-O2 -fomit-frame-pointer -march=pentium-m -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="assume-digests distlocks fixpackages metadata-transfer news parallel-fetch protect-owned sandbox sfperms strict unmerge-logs unmerge-orphans userfetch"
GENTOO_MIRRORS="http://gentoo.gg3.net/ ftp://ftp.ecc.u-tokyo.ac.jp/GENTOO"
LANG="ja_JP.UTF-8"
LDFLAGS="-Wl,-O1"
LINGUAS="ja en"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/portage/local/layman/haskell /usr/portage/local/hand"
SYNC="rsync://rsync.europe.gentoo.org/gentoo-portage"
USE="X a52 aac accessibility acpi alsa avi bash-completion berkdb bluetooth branding bzip2 cairo cdparanoia cdr cjk cli consolekit cracklib crypt cxx dri dts dvd dvdr emboss encode exif fbcon firefox flac gdbm gif gnutls gpm gtk gtk2 iconv ipv6 jack jpeg lcms mad mikmod mime mmap mmx mng modules mozsvg mp3 mp4 mpeg mudflap ncurses nls nptl nptlonly nsplugin ogg oggvorbis opengl openmp pam pango pcre pdf perl png ppds pppd profile qt3support readline reflection sdl session spl sse sse2 ssl startup-notification svg sysfs tcpd tiff truetype truetypr unicode usb vorbis win32codecs wxwindows x264 x86 xcb xml xml2 xorg xulrunner xv xvid zlib" ALSA_CARDS="intel8x0" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic auth_digest authn_anon authn_dbd authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock dbd deflate dir disk_cache env expires ext_filter file_cache filter headers ident imagemap include info log_config logio mem_cache mime mime_magic negotiation proxy proxy_ajp proxy_balancer proxy_connect proxy_http rewrite setenvif so speling status unique_id userdir usertrack vhost_alias" ELIBC="glibc" INPUT_DEVICES="keyboard mouse synaptics" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="ja en" RUBY_TARGETS="ruby18" USERLAND="GNU" VIDEO_CARDS="vesa intel fbdev"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, LC_ALL, MAKEOPTS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

Comment 1 Jim Ramsay (lack) (RETIRED) gentoo-dev

2010-04-01 16:41:05 UTC

These are good points!

(In reply to comment #0)
> First, since the default encoding of my locale is UTF-8, /etc/vim/vimrc
> shouldn't define g:added_fenc_utf8, which signifies that the default encoding
> is not UTF-8.
> Actually, it defines g:added_fenc_utf8 whenever the value of v:lang begins with
> ko, ja_JP, zh_TW or zh_CN, regardless of whether the locale's default is UTF-8.

Yes, I see the problem.  Just above where we set g:added_fenc_utf8, we are checking v:lang and overriding the default fileencodings!  I am not sure why we are doing this, but I believe it should be moved until *after*, and then just append to the fileencodings list instead of replacing it wholesale.

> Second, the autocmd should search for [^\x00-\x7F], not for [\x80-\xFF],
> because there are non-ASCII characters that don't match [\x80-\xFF].

I'm not sure I fully appreciate the distinction here... Since the '\x' character class only matches against single-byte characters (0x00 through 0xFF), it looks to me like [^\x00-\x7F] is exactly equivalent to [\x80-\xFF].  What am I not seeing?

> Third, even when searching is required, I think it shouldn't search the entire
> file, which can be quite large.

Other than the potential issue of saving, moving the cursor to the top of the file, and then restoring the position afterwards, which I could do, how does one decide how much of the file is enough to search?  10 lines? 80 lines? 300 lines?

I would consider perhaps putting in some sort of flag users could set 'g:no-auto-encoding-scan' or something so people who don't want this feature are not hit by the startup cost, but if the point is to ensure we don't force the wrong encoding only on files that have *no* non-ascii characters, I'm not sure we can accurately do any less than scan the whole file.

I will be uploading a revised vimrc file shortly that addresses issue (1), please test it and let me know if it does the right thing.

Though I'm still a bit stuck figuring out what order of fileencodings is really correct...  The choices are:
  ucs-bom,utf-8,euc-jp,default,latin1
  ucs-bom,euc-jp,utf-8,default,latin1
  ucs-bom,utf-8,default,latin1
Any input from your experiences?

If you wouldn't mind testing this out for me, I'd appreciate it very much!

Comment 2 Jim Ramsay (lack) (RETIRED) gentoo-dev

2010-04-01 16:44:23 UTC

Created attachment 226153 [details]
vimrc-r5 attempt 1

As promised, this should solve the main problem.  Please test by replacing your /etc/vim/vimrc file with this one, and let me know what the results are.

Comment 3 Takano Akio 2010-04-02 12:51:21 UTC

(In reply to comment #1)

Thank you, your vimrc file works well for me.

> > Second, the autocmd should search for [^\x00-\x7F], not for [\x80-\xFF],
> > because there are non-ASCII characters that don't match [\x80-\xFF].
> 
> I'm not sure I fully appreciate the distinction here... Since the '\x'
> character class only matches against single-byte characters (0x00 through
> 0xFF), it looks to me like [^\x00-\x7F] is exactly equivalent to [\x80-\xFF]. 
> What am I not seeing?

If the `encoding' option of vim is set to utf-8, [\x80-\xFF] matches against characters in the range U+0080 .. U+00FF, not necessarily encoded in a single byte. For example, it matches against 'é' (Latin small e with acute, U+00e9), whose representation in UTF-8 is a two-byte sequence "C3 A9". On the other hand, it does not match against 'α' (Greek small alpha, U+03b1), because its code point is above 0xFF. [^\x00-\x7F] matches against the both.

I don't know exactly how these patterns work if `encoding' is not set to utf-8. However I confirmed, with `encoding' set to eucjp, that [^\x00-\x7F] matches against 'あ' (hiragana a, U+3042), while [\x80-\xFF] does not.

> 
> > Third, even when searching is required, I think it shouldn't search the entire
> > file, which can be quite large.
> 
> Other than the potential issue of saving, moving the cursor to the top of the
> file, and then restoring the position afterwards, which I could do, how does
> one decide how much of the file is enough to search?  10 lines? 80 lines? 300
> lines?
> 
> I would consider perhaps putting in some sort of flag users could set
> 'g:no-auto-encoding-scan' or something so people who don't want this feature
> are not hit by the startup cost, but if the point is to ensure we don't force
> the wrong encoding only on files that have *no* non-ascii characters, I'm not
> sure we can accurately do any less than scan the whole file.

Placing an arbitrary search limit (top 300 lines for example) looks good enough to me, because guessing character encodings is often inaccurate in nature. Anyway, a user should at least be able to choose not to wait each time opening a large ASCII file.

> Though I'm still a bit stuck figuring out what order of fileencodings is really
> correct...  The choices are:
>   ucs-bom,utf-8,euc-jp,default,latin1
>   ucs-bom,euc-jp,utf-8,default,latin1
>   ucs-bom,utf-8,default,latin1
> Any input from your experiences?

For ja_JP.UTF-8 users, euc-jp should not precede utf-8, because they have chosen UTF-8 as their default encoding.

Comment 4 Jim Ramsay (lack) (RETIRED) gentoo-dev

2012-06-19 19:17:26 UTC

Created attachment 315791 [details]
vimrc-r5 attempt 2

Comment 5 Jim Ramsay (lack) (RETIRED) gentoo-dev

2012-06-19 19:18:35 UTC

This latest attempt takes your point about the regex used for searching, and also adds a 1s timeout to the search for quicker load times on large files.

Does this look okay?  I promise I'll actually commit the fix this time!

And my apologies for letting this sleep so long :)

Comment 6 Patrice Clement (RETIRED) gentoo-dev

2017-07-29 21:44:04 UTC

Hi. Can you still reproduce this problem with vim 8? Let us know and reopen that bug you do.