87953 – Vim converts files into UTF-8 ignoring locale

Bug 87953 - Vim converts files into UTF-8 ignoring locale

Summary: Vim converts files into UTF-8 ignoring locale

Status:	RESOLVED CANTFIX

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	High normal
Assignee:	Vim Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-04-04 13:08 UTC by Marko Vallius
Modified:	2005-08-04 17:25 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marko Vallius 2005-04-04 13:08:37 UTC

vim-6.3.068 has started to make some strange charset conversions. If I create a new file with vim and insert some 7-bit, the file will be just that: plain ascii. If I create another new file and insert some 8-bit text, the file will be latin1 text. That's as things have always been.



Reproducible: Always
Steps to Reproduce:
Now if I open an *existing* *7-bit* text file in vim, vim will instantly convert it into UTF-8 (that's apparent from the text " [converted] " on the status line). If I add some 8-bit characters and save, the file will be in UTF-8 format. On the other hand, if I open an existing latin1 text file, no conversion occurs. 



Expected Results:  
Vim should respect my locale settings. Files should only be converted to  
UTF-8 if the locale says so.  

This has a rather nasty effect with mutt. When the user writes a new message,   
mutt first creates a plain text file (with the user's signature, if one is   
used) and then opens it in $EDITOR. So the user writes a message, vim saves it   
in UTF-8, mutt thinks it is still latin1 or latin9 and sets the headers   
accordingly... See bug 87424.

Comment 1 Ciaran McCreesh 2005-04-05 11:02:07 UTC

You should override &fileencodings to better suit your locale. The way it is in the vim-core-6.3.068 provided vimrc is about the best we could come up with to avoid having vim munge files. See :help 'fileencodings and /etc/vim/vimrc (you can override in ~/.vimrc or /etc/vim/vimrc.local).

Please reopen if you can come up with a saner set of rules than those we have currently.

Comment 2 Marko Vallius 2005-04-05 11:48:27 UTC

I might have something... How about changing /etc/vim/vimrc like this:

--------------------
--- vimrc.utf8  2005-04-03 07:05:33.000000000 +0300
+++ vimrc       2005-04-05 21:34:12.184209500 +0300
@@ -71,9 +71,11 @@
   set fileencodings^=ucs-bom
 endif
 
-" Always check for UTF-8 when trying to determine encodings.
-if &fileencodings !~? "utf-8"
-  set fileencodings+=utf-8
+if v:lang =~? "utf"
+  " Check for UTF-8 when trying to determine encodings.
+  if &fileencodings !~? "utf-8"
+    set fileencodings+=utf-8
+  endif
 endif
 
 " Make sure we have a sane fallback for encoding detection
--------------------

This lets me get utf-8 files when my locale is set to accept utf and latin1
(or some other default?) when not. Is this sane? Sure it will make vim munge
utf files if the locale is set wrong, but isn't it supposed to?

Comment 3 Ciaran McCreesh 2005-04-05 13:47:11 UTC

No go. Vim must not be made to munge utf-8 files regardless of locale. Editing files which use an encoding other than the one in the active locale is entirely legitimate.

Comment 4 Marko Vallius 2005-04-05 20:58:06 UTC

I suppose so, and it is very nice to be able to open utf-8 files on a latin1
terminal. But there must be a way to make vim stick to latin1 when latin1
is sufficient and the active locale does not use utf. Use 7-bit ASCII when
only 7-bit characters are needed, convert the file to latin1 when latin1
characters appear and convert to utf-8 only when latin1 is not enough (if the
active locale uses utf-8, *then* utf-8 is naturally the way to go). Vim does
seem to behave in this way when creating new files: a new file with 8-bit
latin1 characters will be saved in latin1 and it will remain so afterwards.

Somehow I'd imagine that vim's default fileencodings setting 'ucs-bom,utf-8,latin1' would have this effect, but it doesn't. 
Isn't this a bug?

Comment 5 Ciaran McCreesh 2005-04-05 21:26:34 UTC

The problem is that if there are no high-bit characters, it's impossible to tell whether a file is utf8 or latin1. And, since all files are valid latin1, including utf8 files, utf8 has to go before latin1 in the list.

Comment 6 Marko Vallius 2005-04-05 21:47:49 UTC

Um, how can a utf8 encoded file be valid latin1? Latin1 only has single-byte
characters whereas utf8 has both single-byte and two-byte characters (meaning
encoding, of course). 

Also, if there are no high-bit characters, the file is plain 7-bit ASCII, no?
If high-bit characters are entered during editing, their existence can be seen
when the file is saved. And when the file is saved, converting to latin1
encoding is enough if the file only contains only latin1 characters.

Comment 7 Marko Vallius 2005-04-05 22:06:15 UTC

Forget what I said about saving in the last comment...

What vim should do when opening an existing file is check if there are
high-bit characters in the file. If there are, you can tell if the file 
is utf-8 or latin1 because the encoding is different, right? Then set the
encoding in vim based on the file. If the file contains only 7-bit characters,
set the encoding to that used in the active locale. 

This way, if the locale uses utf-8 any high-bit characters will be saved 
in utf-8 encoding and if the locale uses latin1, the files get "latinized".

Comment 8 Ciaran McCreesh 2005-04-05 22:07:54 UTC

Well, consider a file whose entire content is c2a3a0 (hex encoded). This is a valid utf8 sequence representing a pound sign followed by a newline. It's also a valid latin1 sequence representing a capital A with a hat, then a pound sign, then a newline. So how do you tell which it's supposed to be?

Comment 9 Marko Vallius 2005-04-05 22:22:11 UTC

Ask the user? Respect the locale? What does vim do now if the user opens this?
As I pointed out earlier, when I open an existing file (say, containing just
the letter

Comment 10 Marko Vallius 2005-04-05 22:22:11 UTC

Ask the user? Respect the locale? What does vim do now if the user opens this?
As I pointed out earlier, when I open an existing file (say, containing just
the letter ä (small letter a with diaeresis) in latin1 encoding the file will
not be converted to utf, so (in easy cases at least) vim already does the
right thing. 

BTW, if you insert the text "c2a3a0" into a file, you expect it to be saved 
as text, not hex. :)

Comment 11 Ciaran McCreesh 2005-04-05 22:35:03 UTC

vim does the right thing when it gets a file that can't be utf8. In cases where there's no way to tell whether it's latin1 or utf8, vim will go for utf8.

Try this:

echo 6dc3b8c3b873650a | xxd -p -r > foo

Then figure out how to determine what foo is encoded as.

Comment 12 Marko Vallius 2005-04-06 00:22:01 UTC

That is a hard one. "file foo" thinks it's UTF-8. :) In this case it might be
ok to open it as utf-8, though I'd still say "use the locale" and maybe add a
warning if vim is unsure about the encoding.

Still, there is nothing ambiguous about a plain 7-bit file, is there?
Try this:

echo a > foo

What makes vim think it should be utf-8 when there are no high-bit characters in the file? If high-bit characters are now inserted, vim should absolutely
use the user's intended charset (from the locale) as default.

Comment 13 Ciaran McCreesh 2005-04-06 00:42:43 UTC

Ok, how would that be implemented?

Comment 14 Marko Vallius 2005-04-06 11:36:47 UTC

Can't think of any configuration options that would help, but I find
a couple of places in fileio.c that might be hackable:

1) readfile() makes plenty of checks when deciding which encoding to use
so why not add one more? Let it try opening the file in active locale's 
encoding before trying items in 'fileencodings'. 

2) Change next_fenc() so that the first fileencoding it returns is always 
the one "closest" to the active locale.

If "fileencodings=ucs-bom,utf-8,latin1" but the active locale prefers
latin1, vim should try latin1 first. If it works, that's great; otherwise
try ucs-bom and utf-8.

I don't know if this would work. My C is at best read-only, so I cannot 
really try it out. :(

Comment 15 Marko Vallius 2005-04-07 09:05:28 UTC

Sigh. A comment numbered #13 was bound to be bad. If the active locale's
encoding is set to latin1, trying it first would only have the same
effect as "fileencodings=ucs-bom,latin1,utf-8" which certainly does not
work... What I meant is "try opening the file in 7-bit mode before trying
items in 'fileencodings'". Or maybe there could also be a way to set
fileencodings to something like "ucs-bom,7bit,utf-8,latin1" to prevent
it from jumping straight to utf-8 when high-bit characters are inserted?

That would take care of my worst problem, but it would still mean that
in ambiguous cases (as in Ciaran's comment #8) vim would munge legal
latin1 files in order to avoid munging utf-8 files. I don't think that's
good either.

Anybody else have any good ideas? Am I the only one who still prefers
latin1?

Comment 16 Maik Musall 2005-07-04 09:56:45 UTC

Marko, you are certainly not the only user suffering from
latin1-to-utf8-conversions, especially when writing emails with vim. I suppose
all european users writing German, Spanish, French and so on have this problem.

As I understand all the previous comments, there is still no accepted solution.
As a workaround, I created a vimrc.local which makes the LC_CTYPE setting of all
users the primary fileencoding, unless there's a conversion error when vim tries
using it. When such a user wants to edit a UTF-8 file, he/she won't of course
get an error but must do

:set fileencodings=utf-8
:e!

which is a workaround for me. But this is still bad.

This is my /etc/vim/vimrc.local:

-- SNIP
" Maiks extended locale settings
if v:lang =~? "^ko"
  set fileencodings=euc-kr
  set guifontset=-*-*-medium-r-normal--16-*-*-*-*-*-*-*
elseif v:lang =~? "^ja_JP"
  set fileencodings=euc-jp
  set guifontset=-misc-fixed-medium-r-normal--14-*-*-*-*-*-*-*
elseif v:lang =~? "^zh_TW"
  set fileencodings=big5
  set
guifontset=-sony-fixed-medium-r-normal--16-150-75-75-c-80-iso8859-1,-taipei-fixed-medium-r-normal--16-150-75-75-c-160-big5-0
elseif v:lang =~? "^zh_CN"
  set fileencodings=gb2312
  set guifontset=*-r-*
elseif v:ctype =~? "^de_DE"
  set fileencodings=iso-8859-15
endif

" If we have a BOM, always honour that rather than trying to guess.
if &fileencodings !~? "ucs-bom"
  set fileencodings^=ucs-bom
endif

" Always check for UTF-8 when trying to determine encodings.
if &fileencodings !~? "utf-8"
  set fileencodings+=utf-8
endif

" Make sure we have a sane fallback for encoding detection
set fileencodings+=default
" }}}
-- SNAP

Comment 17 Sebastian 2005-07-26 06:40:26 UTC

Hi!

I found this bug report after messing with my computer for days. I almost threw
the whole thing out of the window. I mean it.

I wanted to use Mutt (a console email client) together with vim as editor. I use
{LANG,LC_ALL}=de_DE@euro. Everytime I tried sending a mail containing umlauts or
other special characters something would mess up the whole text.

I tried to seek some help on different forums to no avail. Finally I was told by
some m*****f***** that I needed an UTF8 enabled system to do this right. As a
last resort and after fiddling for hours I compiled my whole box with unicode
support. And it still didn't work.

Somehow I stumbled on "fileencodings" and found out that gentoos default vimrc
set it to utf no matter what. I moved the /etc/vim dir somewhere else and it
worked. I put the dir back and made up my own .vimrc afterwards.

I mean can you knowingly put a user into a mess like this?

Cheers

mic

Comment 18 Ciaran McCreesh 2005-08-04 17:25:44 UTC

Please reopen if you can suggest a saner set of rules than the existing lot.