504778 – dev-python/pytz-2014.1 - setup.py: README.txt: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14839: ordinal not in range(128)

Bug 504778 - dev-python/pytz-2014.1 - setup.py: README.txt: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14839: ordinal not in range(128)

Summary: dev-python/pytz-2014.1 - setup.py: README.txt: UnicodeDecodeError: 'ascii' co...

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Development (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Python Gentoo Team

URL:
Whiteboard:
Keywords:	PATCH

Depends on:
Blocks:

Reported:	2014-03-16 10:24 UTC by Alex Turbov
Modified:	2014-03-22 21:28 UTC (History)
CC List:	5 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
patch to fix the problem (pytz-2014.1-fix-setup.patch,503 bytes, patch) 2014-03-16 10:25 UTC, Alex Turbov	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alex Turbov 2014-03-16 10:24:09 UTC

I've got the following error having Python 3.3 as default:

Traceback (most recent call last):
  File "setup.py", line 23, in <module>
    long_description=open('README.txt','r').read(),
  File "/usr/lib64/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14839: ordinal not in range(128)


Reproducible: Always

Comment 1 Alex Turbov 2014-03-16 10:25:27 UTC

Created attachment 372800 [details, diff]
patch to fix the problem

Comment 2 François Bissey 2014-03-17 00:13:07 UTC

Got hit too. Haven't tried the patch.

Comment 3 Mike Gilbert gentoo-dev

2014-03-17 01:12:05 UTC

Comment on attachment 372800 [details, diff]
patch to fix the problem

This will cause an error in python 2.6 and 2.7.

Comment 4 Mike Gilbert gentoo-dev

2014-03-17 01:14:48 UTC

Here's a hint.

http://python3porting.com/problems.html#reading-from-files

Comment 5 Francesco Riosa 2014-03-18 00:14:36 UTC

Or simply patch readme, that charachter is probably an error

--- README.txt  2014-03-14 09:58:59.000000000 +0100
+++ /root/README.txt    2014-03-18 01:12:47.988642649 +0100
@@ -425,7 +425,7 @@
 measurement.

 All other timezones are defined relative to UTC, and include offsets like
-UTC+0800 � hours to add or subtract from UTC to derive the local time. No
+UTC+0800 - hours to add or subtract from UTC to derive the local time. No
 daylight saving time occurs in UTC, making it a useful timezone to perform
 date arithmetic without worrying about the confusion and ambiguities caused
 by daylight saving time transitions, your country changing its timezone, or

Comment 6 Ian Delaney (RETIRED) gentoo-dev

2014-03-19 15:04:56 UTC

This has been fixed by way of a patch in many other python packages long ago and sadly still occurs.  This is;
infile = codecs.open('UTF-8.txt', 'r', encoding='UTF-8')

I believe the essence of the hint.  It's not the readme you patch but the code that opens it.  There ought still be such patches in python herd packages

Comment 7 Joshua Kinard gentoo-dev

2014-03-21 02:12:27 UTC

(In reply to Francesco Riosa from comment #5)
> Or simply patch readme, that charachter is probably an error
> 
> --- README.txt  2014-03-14 09:58:59.000000000 +0100
> +++ /root/README.txt    2014-03-18 01:12:47.988642649 +0100
> @@ -425,7 +425,7 @@
>  measurement.
> 
>  All other timezones are defined relative to UTC, and include offsets like
> -UTC+0800 � hours to add or subtract from UTC to derive the local time. No
> +UTC+0800 - hours to add or subtract from UTC to derive the local time. No
>  daylight saving time occurs in UTC, making it a useful timezone to perform
>  date arithmetic without worrying about the confusion and ambiguities caused
>  by daylight saving time transitions, your country changing its timezone, or

It's not an error.  I looked at it in a hex editor, and it's three bytes, 0xe2 0x80 0x94, which is definitely UTF-16 encoding.  Extract out the relevant bits and re-assemble into a 2-byte value, and you get U+2014, also known as an "Em Dash" in the Unicode char set:
http://www.fileformat.info/info/unicode/char/2014/index.htm

It probably means that the README.txt file was typed up in an editor that attempts to do some kind of "smart replacement" of specific characters with correct typographic replacements.  Like MS Word and those accursed smart quotes...

Comment 8 Fausto 2014-03-22 11:40:26 UTC

(In reply to Joshua Kinard from comment #7)
> (In reply to Francesco Riosa from comment #5)
> > Or simply patch readme, that charachter is probably an error
> > 
> > --- README.txt  2014-03-14 09:58:59.000000000 +0100
> > +++ /root/README.txt    2014-03-18 01:12:47.988642649 +0100
> > @@ -425,7 +425,7 @@
> >  measurement.
> > 
> >  All other timezones are defined relative to UTC, and include offsets like
> > -UTC+0800 � hours to add or subtract from UTC to derive the local time. No
> > +UTC+0800 - hours to add or subtract from UTC to derive the local time. No
> >  daylight saving time occurs in UTC, making it a useful timezone to perform
> >  date arithmetic without worrying about the confusion and ambiguities caused
> >  by daylight saving time transitions, your country changing its timezone, or
> 
> It's not an error.  I looked at it in a hex editor, and it's three bytes,
> 0xe2 0x80 0x94, which is definitely UTF-16 encoding.  Extract out the
> relevant bits and re-assemble into a 2-byte value, and you get U+2014, also
> known as an "Em Dash" in the Unicode char set:
> http://www.fileformat.info/info/unicode/char/2014/index.htm
> 
> It probably means that the README.txt file was typed up in an editor that
> attempts to do some kind of "smart replacement" of specific characters with
> correct typographic replacements.  Like MS Word and those accursed smart
> quotes...

Hello,

maybe it could be not an error, but in this way an entire updating process is stopped (at least in my case) because someone wrote some weird characters in a README file.

IMHO, I think this should be avoided... 

thanks,
Fausto

Comment 9 Mike Gilbert gentoo-dev

2014-03-22 21:16:17 UTC

(In reply to Joshua Kinard from comment #7)
> It's not an error.  I looked at it in a hex editor, and it's three bytes,
> 0xe2 0x80 0x94, which is definitely UTF-16 encoding.

It's actually UTF-8. ^_^

+  22 Mar 2014; Mike Gilbert <floppym@gentoo.org>
+  +files/pytz-2014.1-setup.py.patch, pytz-2014.1.ebuild:
+  Specify the correct encoding when opening README.txt in setup.py, bug 504778
+  by Alex Turbov.

Comment 10 Mike Gilbert gentoo-dev

2014-03-22 21:27:03 UTC

+*pytz-2014.1.1 (22 Mar 2014)
+
+  22 Mar 2014; Mike Gilbert <floppym@gentoo.org> +pytz-2014.1.1.ebuild,
+  -files/pytz-2014.1-setup.py.patch, -pytz-2014.1.ebuild:
+  Upstream already fixed the encoding issue by removing the non-ASCII character
+  from README.txt. Bug 504778.

Comment 11 Joshua Kinard gentoo-dev

2014-03-22 21:28:17 UTC

(In reply to Mike Gilbert from comment #9)
> (In reply to Joshua Kinard from comment #7)
> > It's not an error.  I looked at it in a hex editor, and it's three bytes,
> > 0xe2 0x80 0x94, which is definitely UTF-16 encoding.
> 
> It's actually UTF-8. ^_^
> 
> +  22 Mar 2014; Mike Gilbert <floppym@gentoo.org>
> +  +files/pytz-2014.1-setup.py.patch, pytz-2014.1.ebuild:
> +  Specify the correct encoding when opening README.txt in setup.py, bug
> 504778
> +  by Alex Turbov.

Oh, I was going by the number of encoded buts when stating UTF-16, as U+2016 is greater than 0x0800 and less than 0xffff, which is 16-bits of codepoint.  
https://en.wikipedia.org/wiki/UTF-8#Description

But now I see that there are clear differences between UTF-8, UTF-16, and even UTF-32.  Oy...