425760 – >=app-portage/esearch-1.0 encoding error in eupdatedb

Bug 425760 - >=app-portage/esearch-1.0 encoding error in eupdatedb

Summary: >=app-portage/esearch-1.0 encoding error in eupdatedb

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Portage Tools Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-07-11 00:51 UTC by Brian Dolbec
Modified:	2012-11-02 19:33 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Brian Dolbec (RETIRED) gentoo-dev

2012-07-11 00:51:43 UTC

big_daddy layman # eupdatedb
 * indexing: 14815 ebuilds to go * Missing digest for '/usr/local/portage/app-portage/some-package/some-package-0.6.0.ebuild'
8293 ebuilds to goTraceback (most recent call last):
  File "/usr/bin/eupdatedb", line 5, in <module>
    main()
  File "/usr/lib64/python2.7/site-packages/esearch/update.py", line 252, in main
    success = updatedb(config)
  File "/usr/lib64/python2.7/site-packages/esearch/update.py", line 208, in updatedb
    str(description),
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 125: ordinal not in range(128)

It appears that unicode is getting into ebuild descriptions now.

Looks like we need to convert the db into full unicode.  Which judging from layman's unicode db may be a pain to code.  I wasn't able to get layman's code to work with both py-2 and py-3 at the same time.  It may need to be coded such that we need to run 2to3 on the code for py-3 installs.

Reproducible: Always

Comment 1 Paul Varner (RETIRED) gentoo-dev

2012-07-11 14:56:43 UTC

I don't have the problem on any of my machines which indicates the ebuild is coming from an overlay.  Which overlays do you have installed?

Comment 2 Brian Dolbec (RETIRED) gentoo-dev

2012-07-11 15:13:59 UTC

big_daddy layman # layman -l

 * mgorny                    [Git       ] (git://git.overlays.gentoo.org/dev/mgorny.git                                                      )
 * multimedia                [Git       ] (git://gitorious.org/gentoo-multimedia/gentoo-multimedia.git                                       )
 * mva                       [Git       ] (git://github.com/msva/mva-overlay                                                                 )
 * science                   [Git       ] (git://git.overlays.gentoo.org/proj/sci.git                                                        )
 * sunrise                   [Git       ] (git://git.overlays.gentoo.org/proj/sunrise-reviewed.git                                           )
 * xfce-dev                  [Git       ] (git://git.overlays.gentoo.org/proj/xfce.git                                                       )

big_daddy layman #

I'll add some debug try:except pairs to the code to try and trap them.

That should make things work a little better and give us more info where the unicode is coming from.

We may be able to do a char substitution for the offending string as a temp workaround as well as report it to stderr.

What about adding logging to esearch?  Might be good to have things like this filed for bug submittal.

Comment 3 Zac Medico gentoo-dev

2012-07-12 00:27:32 UTC

It should work fine if we just write esearchdb.py with UTF-8 encoding and put a line like "# -*- coding: UTF8 -*-" at the top. Instead of using str(), use _unicode() like portage typically does:

if sys.hexversion >= 0x3000000:
	_unicode = str
else:
	_unicode = unicode

And open the unicode file like this:

dbfile = io.open(dbfd, mode="w", encoding="utf_8")
dbfile.write(_unicode("# -*- coding: UTF8 -*-\n"))

Just use _unicode() instead of str() to wrap any strings that you write to dbfile, and it should work find because the strings that come from portage are all unicode.

Comment 4 Brian Dolbec (RETIRED) gentoo-dev

2012-07-27 03:10:44 UTC

With Zac's help.  it is now saving the db in unicode in a py2 and py3 compatible way.  No matter which python creates the db it will load correctly in either pythons.

commit: https://github.com/fuzzyray/esearch/commit/2be2aa2f0f66c6e68acd0ea4b5b49e55305836f2

Comment 5 Paul Varner (RETIRED) gentoo-dev

2012-11-02 19:33:52 UTC

Released in esearch-1.3