Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 504778 - dev-python/pytz-2014.1 - setup.py: README.txt: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14839: ordinal not in range(128)
Summary: dev-python/pytz-2014.1 - setup.py: README.txt: UnicodeDecodeError: 'ascii' co...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Development (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Python Gentoo Team
URL:
Whiteboard:
Keywords: PATCH
Depends on:
Blocks:
 
Reported: 2014-03-16 10:24 UTC by Alex Turbov
Modified: 2014-03-22 21:28 UTC (History)
5 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
patch to fix the problem (pytz-2014.1-fix-setup.patch,503 bytes, patch)
2014-03-16 10:25 UTC, Alex Turbov
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Turbov 2014-03-16 10:24:09 UTC
I've got the following error having Python 3.3 as default:

Traceback (most recent call last):
  File "setup.py", line 23, in <module>
    long_description=open('README.txt','r').read(),
  File "/usr/lib64/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14839: ordinal not in range(128)


Reproducible: Always
Comment 1 Alex Turbov 2014-03-16 10:25:27 UTC
Created attachment 372800 [details, diff]
patch to fix the problem
Comment 2 François Bissey 2014-03-17 00:13:07 UTC
Got hit too. Haven't tried the patch.
Comment 3 Mike Gilbert gentoo-dev 2014-03-17 01:12:05 UTC
Comment on attachment 372800 [details, diff]
patch to fix the problem

This will cause an error in python 2.6 and 2.7.
Comment 4 Mike Gilbert gentoo-dev 2014-03-17 01:14:48 UTC
Here's a hint.

http://python3porting.com/problems.html#reading-from-files
Comment 5 Francesco Riosa 2014-03-18 00:14:36 UTC
Or simply patch readme, that charachter is probably an error

--- README.txt  2014-03-14 09:58:59.000000000 +0100
+++ /root/README.txt    2014-03-18 01:12:47.988642649 +0100
@@ -425,7 +425,7 @@
 measurement.

 All other timezones are defined relative to UTC, and include offsets like
-UTC+0800 � hours to add or subtract from UTC to derive the local time. No
+UTC+0800 - hours to add or subtract from UTC to derive the local time. No
 daylight saving time occurs in UTC, making it a useful timezone to perform
 date arithmetic without worrying about the confusion and ambiguities caused
 by daylight saving time transitions, your country changing its timezone, or
Comment 6 Ian Delaney (RETIRED) gentoo-dev 2014-03-19 15:04:56 UTC
This has been fixed by way of a patch in many other python packages long ago and sadly still occurs.  This is;
infile = codecs.open('UTF-8.txt', 'r', encoding='UTF-8')

I believe the essence of the hint.  It's not the readme you patch but the code that opens it.  There ought still be such patches in python herd packages
Comment 7 Joshua Kinard gentoo-dev 2014-03-21 02:12:27 UTC
(In reply to Francesco Riosa from comment #5)
> Or simply patch readme, that charachter is probably an error
> 
> --- README.txt  2014-03-14 09:58:59.000000000 +0100
> +++ /root/README.txt    2014-03-18 01:12:47.988642649 +0100
> @@ -425,7 +425,7 @@
>  measurement.
> 
>  All other timezones are defined relative to UTC, and include offsets like
> -UTC+0800 � hours to add or subtract from UTC to derive the local time. No
> +UTC+0800 - hours to add or subtract from UTC to derive the local time. No
>  daylight saving time occurs in UTC, making it a useful timezone to perform
>  date arithmetic without worrying about the confusion and ambiguities caused
>  by daylight saving time transitions, your country changing its timezone, or

It's not an error.  I looked at it in a hex editor, and it's three bytes, 0xe2 0x80 0x94, which is definitely UTF-16 encoding.  Extract out the relevant bits and re-assemble into a 2-byte value, and you get U+2014, also known as an "Em Dash" in the Unicode char set:
http://www.fileformat.info/info/unicode/char/2014/index.htm

It probably means that the README.txt file was typed up in an editor that attempts to do some kind of "smart replacement" of specific characters with correct typographic replacements.  Like MS Word and those accursed smart quotes...
Comment 8 Fausto 2014-03-22 11:40:26 UTC
(In reply to Joshua Kinard from comment #7)
> (In reply to Francesco Riosa from comment #5)
> > Or simply patch readme, that charachter is probably an error
> > 
> > --- README.txt  2014-03-14 09:58:59.000000000 +0100
> > +++ /root/README.txt    2014-03-18 01:12:47.988642649 +0100
> > @@ -425,7 +425,7 @@
> >  measurement.
> > 
> >  All other timezones are defined relative to UTC, and include offsets like
> > -UTC+0800 � hours to add or subtract from UTC to derive the local time. No
> > +UTC+0800 - hours to add or subtract from UTC to derive the local time. No
> >  daylight saving time occurs in UTC, making it a useful timezone to perform
> >  date arithmetic without worrying about the confusion and ambiguities caused
> >  by daylight saving time transitions, your country changing its timezone, or
> 
> It's not an error.  I looked at it in a hex editor, and it's three bytes,
> 0xe2 0x80 0x94, which is definitely UTF-16 encoding.  Extract out the
> relevant bits and re-assemble into a 2-byte value, and you get U+2014, also
> known as an "Em Dash" in the Unicode char set:
> http://www.fileformat.info/info/unicode/char/2014/index.htm
> 
> It probably means that the README.txt file was typed up in an editor that
> attempts to do some kind of "smart replacement" of specific characters with
> correct typographic replacements.  Like MS Word and those accursed smart
> quotes...

Hello,

maybe it could be not an error, but in this way an entire updating process is stopped (at least in my case) because someone wrote some weird characters in a README file.

IMHO, I think this should be avoided... 

thanks,
Fausto
Comment 9 Mike Gilbert gentoo-dev 2014-03-22 21:16:17 UTC
(In reply to Joshua Kinard from comment #7)
> It's not an error.  I looked at it in a hex editor, and it's three bytes,
> 0xe2 0x80 0x94, which is definitely UTF-16 encoding.

It's actually UTF-8. ^_^

+  22 Mar 2014; Mike Gilbert <floppym@gentoo.org>
+  +files/pytz-2014.1-setup.py.patch, pytz-2014.1.ebuild:
+  Specify the correct encoding when opening README.txt in setup.py, bug 504778
+  by Alex Turbov.
Comment 10 Mike Gilbert gentoo-dev 2014-03-22 21:27:03 UTC
+*pytz-2014.1.1 (22 Mar 2014)
+
+  22 Mar 2014; Mike Gilbert <floppym@gentoo.org> +pytz-2014.1.1.ebuild,
+  -files/pytz-2014.1-setup.py.patch, -pytz-2014.1.ebuild:
+  Upstream already fixed the encoding issue by removing the non-ASCII character
+  from README.txt. Bug 504778.
Comment 11 Joshua Kinard gentoo-dev 2014-03-22 21:28:17 UTC
(In reply to Mike Gilbert from comment #9)
> (In reply to Joshua Kinard from comment #7)
> > It's not an error.  I looked at it in a hex editor, and it's three bytes,
> > 0xe2 0x80 0x94, which is definitely UTF-16 encoding.
> 
> It's actually UTF-8. ^_^
> 
> +  22 Mar 2014; Mike Gilbert <floppym@gentoo.org>
> +  +files/pytz-2014.1-setup.py.patch, pytz-2014.1.ebuild:
> +  Specify the correct encoding when opening README.txt in setup.py, bug
> 504778
> +  by Alex Turbov.

Oh, I was going by the number of encoded buts when stating UTF-16, as U+2016 is greater than 0x0800 and less than 0xffff, which is 16-bits of codepoint.  
https://en.wikipedia.org/wiki/UTF-8#Description

But now I see that there are clear differences between UTF-8, UTF-16, and even UTF-32.  Oy...