Summary: | app-editors/vim-core-7.2.402: gentoo autocmd takes too long time when opening a large file | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Takano Akio <aljee> |
Component: | Current packages | Assignee: | Vim Maintainers <vim> |
Status: | RESOLVED TEST-REQUEST | ||
Severity: | normal | ||
Priority: | High | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
vimrc-r5 attempt 1
vimrc-r5 attempt 2 |
Description
Takano Akio
2010-04-01 04:36:57 UTC
These are good points! (In reply to comment #0) > First, since the default encoding of my locale is UTF-8, /etc/vim/vimrc > shouldn't define g:added_fenc_utf8, which signifies that the default encoding > is not UTF-8. > Actually, it defines g:added_fenc_utf8 whenever the value of v:lang begins with > ko, ja_JP, zh_TW or zh_CN, regardless of whether the locale's default is UTF-8. Yes, I see the problem. Just above where we set g:added_fenc_utf8, we are checking v:lang and overriding the default fileencodings! I am not sure why we are doing this, but I believe it should be moved until *after*, and then just append to the fileencodings list instead of replacing it wholesale. > Second, the autocmd should search for [^\x00-\x7F], not for [\x80-\xFF], > because there are non-ASCII characters that don't match [\x80-\xFF]. I'm not sure I fully appreciate the distinction here... Since the '\x' character class only matches against single-byte characters (0x00 through 0xFF), it looks to me like [^\x00-\x7F] is exactly equivalent to [\x80-\xFF]. What am I not seeing? > Third, even when searching is required, I think it shouldn't search the entire > file, which can be quite large. Other than the potential issue of saving, moving the cursor to the top of the file, and then restoring the position afterwards, which I could do, how does one decide how much of the file is enough to search? 10 lines? 80 lines? 300 lines? I would consider perhaps putting in some sort of flag users could set 'g:no-auto-encoding-scan' or something so people who don't want this feature are not hit by the startup cost, but if the point is to ensure we don't force the wrong encoding only on files that have *no* non-ascii characters, I'm not sure we can accurately do any less than scan the whole file. I will be uploading a revised vimrc file shortly that addresses issue (1), please test it and let me know if it does the right thing. Though I'm still a bit stuck figuring out what order of fileencodings is really correct... The choices are: ucs-bom,utf-8,euc-jp,default,latin1 ucs-bom,euc-jp,utf-8,default,latin1 ucs-bom,utf-8,default,latin1 Any input from your experiences? If you wouldn't mind testing this out for me, I'd appreciate it very much! Created attachment 226153 [details]
vimrc-r5 attempt 1
As promised, this should solve the main problem. Please test by replacing your /etc/vim/vimrc file with this one, and let me know what the results are.
(In reply to comment #1) Thank you, your vimrc file works well for me. > > Second, the autocmd should search for [^\x00-\x7F], not for [\x80-\xFF], > > because there are non-ASCII characters that don't match [\x80-\xFF]. > > I'm not sure I fully appreciate the distinction here... Since the '\x' > character class only matches against single-byte characters (0x00 through > 0xFF), it looks to me like [^\x00-\x7F] is exactly equivalent to [\x80-\xFF]. > What am I not seeing? If the `encoding' option of vim is set to utf-8, [\x80-\xFF] matches against characters in the range U+0080 .. U+00FF, not necessarily encoded in a single byte. For example, it matches against 'é' (Latin small e with acute, U+00e9), whose representation in UTF-8 is a two-byte sequence "C3 A9". On the other hand, it does not match against 'α' (Greek small alpha, U+03b1), because its code point is above 0xFF. [^\x00-\x7F] matches against the both. I don't know exactly how these patterns work if `encoding' is not set to utf-8. However I confirmed, with `encoding' set to eucjp, that [^\x00-\x7F] matches against 'あ' (hiragana a, U+3042), while [\x80-\xFF] does not. > > > Third, even when searching is required, I think it shouldn't search the entire > > file, which can be quite large. > > Other than the potential issue of saving, moving the cursor to the top of the > file, and then restoring the position afterwards, which I could do, how does > one decide how much of the file is enough to search? 10 lines? 80 lines? 300 > lines? > > I would consider perhaps putting in some sort of flag users could set > 'g:no-auto-encoding-scan' or something so people who don't want this feature > are not hit by the startup cost, but if the point is to ensure we don't force > the wrong encoding only on files that have *no* non-ascii characters, I'm not > sure we can accurately do any less than scan the whole file. Placing an arbitrary search limit (top 300 lines for example) looks good enough to me, because guessing character encodings is often inaccurate in nature. Anyway, a user should at least be able to choose not to wait each time opening a large ASCII file. > Though I'm still a bit stuck figuring out what order of fileencodings is really > correct... The choices are: > ucs-bom,utf-8,euc-jp,default,latin1 > ucs-bom,euc-jp,utf-8,default,latin1 > ucs-bom,utf-8,default,latin1 > Any input from your experiences? For ja_JP.UTF-8 users, euc-jp should not precede utf-8, because they have chosen UTF-8 as their default encoding. Created attachment 315791 [details]
vimrc-r5 attempt 2
This latest attempt takes your point about the regex used for searching, and also adds a 1s timeout to the search for quicker load times on large files. Does this look okay? I promise I'll actually commit the fix this time! And my apologies for letting this sleep so long :) Hi. Can you still reproduce this problem with vim 8? Let us know and reopen that bug you do. |