vim-6.3.068 has started to make some strange charset conversions. If I create a new file with vim and insert some 7-bit, the file will be just that: plain ascii. If I create another new file and insert some 8-bit text, the file will be latin1 text. That's as things have always been. Reproducible: Always Steps to Reproduce: Now if I open an *existing* *7-bit* text file in vim, vim will instantly convert it into UTF-8 (that's apparent from the text " [converted] " on the status line). If I add some 8-bit characters and save, the file will be in UTF-8 format. On the other hand, if I open an existing latin1 text file, no conversion occurs. Expected Results: Vim should respect my locale settings. Files should only be converted to UTF-8 if the locale says so. This has a rather nasty effect with mutt. When the user writes a new message, mutt first creates a plain text file (with the user's signature, if one is used) and then opens it in $EDITOR. So the user writes a message, vim saves it in UTF-8, mutt thinks it is still latin1 or latin9 and sets the headers accordingly... See bug 87424.
You should override &fileencodings to better suit your locale. The way it is in the vim-core-6.3.068 provided vimrc is about the best we could come up with to avoid having vim munge files. See :help 'fileencodings and /etc/vim/vimrc (you can override in ~/.vimrc or /etc/vim/vimrc.local). Please reopen if you can come up with a saner set of rules than those we have currently.
I might have something... How about changing /etc/vim/vimrc like this: -------------------- --- vimrc.utf8 2005-04-03 07:05:33.000000000 +0300 +++ vimrc 2005-04-05 21:34:12.184209500 +0300 @@ -71,9 +71,11 @@ set fileencodings^=ucs-bom endif -" Always check for UTF-8 when trying to determine encodings. -if &fileencodings !~? "utf-8" - set fileencodings+=utf-8 +if v:lang =~? "utf" + " Check for UTF-8 when trying to determine encodings. + if &fileencodings !~? "utf-8" + set fileencodings+=utf-8 + endif endif " Make sure we have a sane fallback for encoding detection -------------------- This lets me get utf-8 files when my locale is set to accept utf and latin1 (or some other default?) when not. Is this sane? Sure it will make vim munge utf files if the locale is set wrong, but isn't it supposed to?
No go. Vim must not be made to munge utf-8 files regardless of locale. Editing files which use an encoding other than the one in the active locale is entirely legitimate.
I suppose so, and it is very nice to be able to open utf-8 files on a latin1 terminal. But there must be a way to make vim stick to latin1 when latin1 is sufficient and the active locale does not use utf. Use 7-bit ASCII when only 7-bit characters are needed, convert the file to latin1 when latin1 characters appear and convert to utf-8 only when latin1 is not enough (if the active locale uses utf-8, *then* utf-8 is naturally the way to go). Vim does seem to behave in this way when creating new files: a new file with 8-bit latin1 characters will be saved in latin1 and it will remain so afterwards. Somehow I'd imagine that vim's default fileencodings setting 'ucs-bom,utf-8,latin1' would have this effect, but it doesn't. Isn't this a bug?
The problem is that if there are no high-bit characters, it's impossible to tell whether a file is utf8 or latin1. And, since all files are valid latin1, including utf8 files, utf8 has to go before latin1 in the list.
Um, how can a utf8 encoded file be valid latin1? Latin1 only has single-byte characters whereas utf8 has both single-byte and two-byte characters (meaning encoding, of course). Also, if there are no high-bit characters, the file is plain 7-bit ASCII, no? If high-bit characters are entered during editing, their existence can be seen when the file is saved. And when the file is saved, converting to latin1 encoding is enough if the file only contains only latin1 characters.
Forget what I said about saving in the last comment... What vim should do when opening an existing file is check if there are high-bit characters in the file. If there are, you can tell if the file is utf-8 or latin1 because the encoding is different, right? Then set the encoding in vim based on the file. If the file contains only 7-bit characters, set the encoding to that used in the active locale. This way, if the locale uses utf-8 any high-bit characters will be saved in utf-8 encoding and if the locale uses latin1, the files get "latinized".
Well, consider a file whose entire content is c2a3a0 (hex encoded). This is a valid utf8 sequence representing a pound sign followed by a newline. It's also a valid latin1 sequence representing a capital A with a hat, then a pound sign, then a newline. So how do you tell which it's supposed to be?
Ask the user? Respect the locale? What does vim do now if the user opens this? As I pointed out earlier, when I open an existing file (say, containing just the letter
Ask the user? Respect the locale? What does vim do now if the user opens this? As I pointed out earlier, when I open an existing file (say, containing just the letter รค (small letter a with diaeresis) in latin1 encoding the file will not be converted to utf, so (in easy cases at least) vim already does the right thing. BTW, if you insert the text "c2a3a0" into a file, you expect it to be saved as text, not hex. :)
vim does the right thing when it gets a file that can't be utf8. In cases where there's no way to tell whether it's latin1 or utf8, vim will go for utf8. Try this: echo 6dc3b8c3b873650a | xxd -p -r > foo Then figure out how to determine what foo is encoded as.
That is a hard one. "file foo" thinks it's UTF-8. :) In this case it might be ok to open it as utf-8, though I'd still say "use the locale" and maybe add a warning if vim is unsure about the encoding. Still, there is nothing ambiguous about a plain 7-bit file, is there? Try this: echo a > foo What makes vim think it should be utf-8 when there are no high-bit characters in the file? If high-bit characters are now inserted, vim should absolutely use the user's intended charset (from the locale) as default.
Ok, how would that be implemented?
Can't think of any configuration options that would help, but I find a couple of places in fileio.c that might be hackable: 1) readfile() makes plenty of checks when deciding which encoding to use so why not add one more? Let it try opening the file in active locale's encoding before trying items in 'fileencodings'. 2) Change next_fenc() so that the first fileencoding it returns is always the one "closest" to the active locale. If "fileencodings=ucs-bom,utf-8,latin1" but the active locale prefers latin1, vim should try latin1 first. If it works, that's great; otherwise try ucs-bom and utf-8. I don't know if this would work. My C is at best read-only, so I cannot really try it out. :(
Sigh. A comment numbered #13 was bound to be bad. If the active locale's encoding is set to latin1, trying it first would only have the same effect as "fileencodings=ucs-bom,latin1,utf-8" which certainly does not work... What I meant is "try opening the file in 7-bit mode before trying items in 'fileencodings'". Or maybe there could also be a way to set fileencodings to something like "ucs-bom,7bit,utf-8,latin1" to prevent it from jumping straight to utf-8 when high-bit characters are inserted? That would take care of my worst problem, but it would still mean that in ambiguous cases (as in Ciaran's comment #8) vim would munge legal latin1 files in order to avoid munging utf-8 files. I don't think that's good either. Anybody else have any good ideas? Am I the only one who still prefers latin1?
Marko, you are certainly not the only user suffering from latin1-to-utf8-conversions, especially when writing emails with vim. I suppose all european users writing German, Spanish, French and so on have this problem. As I understand all the previous comments, there is still no accepted solution. As a workaround, I created a vimrc.local which makes the LC_CTYPE setting of all users the primary fileencoding, unless there's a conversion error when vim tries using it. When such a user wants to edit a UTF-8 file, he/she won't of course get an error but must do :set fileencodings=utf-8 :e! which is a workaround for me. But this is still bad. This is my /etc/vim/vimrc.local: -- SNIP " Maiks extended locale settings if v:lang =~? "^ko" set fileencodings=euc-kr set guifontset=-*-*-medium-r-normal--16-*-*-*-*-*-*-* elseif v:lang =~? "^ja_JP" set fileencodings=euc-jp set guifontset=-misc-fixed-medium-r-normal--14-*-*-*-*-*-*-* elseif v:lang =~? "^zh_TW" set fileencodings=big5 set guifontset=-sony-fixed-medium-r-normal--16-150-75-75-c-80-iso8859-1,-taipei-fixed-medium-r-normal--16-150-75-75-c-160-big5-0 elseif v:lang =~? "^zh_CN" set fileencodings=gb2312 set guifontset=*-r-* elseif v:ctype =~? "^de_DE" set fileencodings=iso-8859-15 endif " If we have a BOM, always honour that rather than trying to guess. if &fileencodings !~? "ucs-bom" set fileencodings^=ucs-bom endif " Always check for UTF-8 when trying to determine encodings. if &fileencodings !~? "utf-8" set fileencodings+=utf-8 endif " Make sure we have a sane fallback for encoding detection set fileencodings+=default " }}} -- SNAP
Hi! I found this bug report after messing with my computer for days. I almost threw the whole thing out of the window. I mean it. I wanted to use Mutt (a console email client) together with vim as editor. I use {LANG,LC_ALL}=de_DE@euro. Everytime I tried sending a mail containing umlauts or other special characters something would mess up the whole text. I tried to seek some help on different forums to no avail. Finally I was told by some m*****f***** that I needed an UTF8 enabled system to do this right. As a last resort and after fiddling for hours I compiled my whole box with unicode support. And it still didn't work. Somehow I stumbled on "fileencodings" and found out that gentoos default vimrc set it to utf no matter what. I moved the /etc/vim dir somewhere else and it worked. I put the dir back and made up my own .vimrc afterwards. I mean can you knowingly put a user into a mess like this? Cheers mic
Please reopen if you can suggest a saner set of rules than the existing lot.