575946 – dev-vcs/gitstats-0_pre131024: ValueError: need more than 1 value to unpack

Bug 575946 - dev-vcs/gitstats-0_pre131024: ValueError: need more than 1 value to unpack

Summary: dev-vcs/gitstats-0_pre131024: ValueError: need more than 1 value to unpack

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Göktürk Yüksek

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-02-28 21:17 UTC by Martin Mokrejš
Modified:	2016-03-24 15:13 UTC (History)
CC List:	3 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
git rev-list output (git_rev-list.txt.bz2,121.06 KB, application/octet-stream) 2016-03-24 08:01 UTC, Martin Mokrejš	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Mokrejš 2016-02-28 21:17:04 UTC

Also please fetch current snapshot form github as I am getting the following when running it over sci overlay:

$ gitstats ~/proj/sci /tmp/blah
[2.14413] >> gnuplot --version
Output path: /scratch/gitstats
Git path: /home/xxx/proj/sci
Collecting data...
[8.03722] >> git shortlog -s HEAD | wc -l
[0.08298] >> git show-ref --tags
[0.07031] >> git rev-list --pretty=format:"%at %ai %aN <%aE>" HEAD | grep -v ^commit
Traceback (most recent call last):
  File "/usr/lib/python-exec/python2.7/gitstats", line 1472, in <module>
    g.run(sys.argv[1:])
  File "/usr/lib/python-exec/python2.7/gitstats", line 1449, in run
    data.collect(gitpath)
  File "/usr/lib/python-exec/python2.7/gitstats", line 333, in collect
    author, mail = parts[4].split('<', 1)
ValueError: need more than 1 value to unpack
$

Please also fix the ebuidl to install the manpage file.

Comment 1 Pacho Ramos gentoo-dev

2016-02-29 19:53:28 UTC

Are you willing to proxy maintain this?
https://wiki.gentoo.org/wiki/Project:Proxy_Maintainers

Comment 2 Göktürk Yüksek archtester

2016-03-04 14:05:20 UTC

(In reply to Pacho Ramos from comment #1)
> Are you willing to proxy maintain this?
> https://wiki.gentoo.org/wiki/Project:Proxy_Maintainers

I am already proxying this but it looks like we forgot to edit the metadata. See: https://archives.gentoo.org/gentoo-dev/message/67c561f9e121df5f746d578d523951b2

I have a pull request to update the live ebuild: https://github.com/gentoo/gentoo/pull/972

I can't do snapshots because I have no dev space. I'll see what I can do.

Comment 3 Pacho Ramos gentoo-dev

2016-03-15 19:27:49 UTC

I will CC amadio as he was your proxy, right?

Comment 4 Göktürk Yüksek archtester

2016-03-24 03:00:35 UTC

commit 69e6ed97490e2d6f7ebe18687002e0aa33a256e1
Author:     Göktürk Yüksek <gokturk@binghamton.edu>
AuthorDate: Wed Mar 23 22:35:00 2016 -0400
Commit:     NP-Hardass <NP-Hardass@gentoo.org>
CommitDate: Wed Mar 23 22:52:25 2016 -0400

    dev-vcs/gitstats: bump to snapshot 2015-12-23
    
    Package-Manager: portage-2.2.26

Can you try the new snapshot and see if it's reproducible? Thanks.

Comment 5 Martin Mokrejš 2016-03-24 07:59:22 UTC

Thank you, it does not crash but it gets stuck on the same command so it may still crash while parsing.

$ /usr/bin/gitstats ~/proj/sci /tmp/blah
[0.01236] >> gnuplot --version
Output path: /scratch/gitstats
Loading cache...
Git path: /home/xxx/proj/sci
Collecting data...
[0.38159] >> git shortlog -s HEAD | wc -l
[0.02234] >> git show-ref --tags
>> git rev-list --pretty=format:"%at %ai %aN <%aE>" HEAD | grep -v ^commit


I emailed you already privately, it turns out grep complains the contains some "binary" chars.

Here is the repo I use:

> $ git pull -v
>  From git+ssh://git.gentoo.org/proj/sci
>   = [up to date]      master     -> origin/master
>   = [up to date]      ambertools -> origin/ambertools
>   = [up to date]      wxmacmolplt -> origin/wxmacmolplt
> $

This patch works for me:

$ diff -u -w gitstats.ori gitstats
--- gitstats.ori        2016-01-08 18:12:21.000000000 +0100
+++ gitstats    2016-03-03 16:39:53.033358354 +0100
@@ -327,7 +327,7 @@
 
                # Collect revision statistics
                # Outputs "<stamp> <date> <time> <timezone> <author> '<' <mail> '>'"
-               lines = getpipeoutput(['git rev-list --pretty=format:"%%at %%ai %%aN <%%aE>" %s' % getlogrange('HEAD'), 'grep -v ^commit']).split('\n')
+               lines = getpipeoutput(['git rev-list --pretty=format:"%%at %%ai %%aN <%%aE>" %s' % getlogrange('HEAD'), 'LC_ALL=en_US grep -v ^commit']).split('\n')
                for line in lines:
                        parts = line.split(' ', 4)
                        author = ''


In the git rev-list output are mixed us-ascii,utf-8 and azbuka chars.

Here is the very first of those problematic in a row (the one after 1260558957). First shown through "less":

git rev-list --pretty=format:"%at %ai %aN <%aE>" HEAD | less

1260558957 2009-12-11 20:15:57 +0100 root <root@localhost.(none)>
commit 893107e474bbfdc40abb7f87564b5ad9b52de3c2
1260548737 2009-12-11 13:25:37 -0300 Tom<E1>s Touceda <chiiph@gmail.com>
commit 55fd9de91abe7eeacca6f3b0ada3b2d64f0de7d8


Here is roughly the same section through:

git rev-list --pretty=format:"%at %ai %aN <%aE>" HEAD | od -c | less


5077220   d   8   b   5   a   f   f   c   8   f  \n   1   2   6   0   5
5077240   5   8   9   5   7       2   0   0   9   -   1   2   -   1   1
5077260       2   0   :   1   5   :   5   7       +   0   1   0   0
5077300   r   o   o   t       <   r   o   o   t   @   l   o   c   a   l
5077320   h   o   s   t   .   (   n   o   n   e   )   >  \n   c   o   m
5077340   m   i   t       8   9   3   1   0   7   e   4   7   4   b   b
5077360   f   d   c   4   0   a   b   b   7   f   8   7   5   6   4   b
5077400   5   a   d   9   b   5   2   d   e   3   c   2  \n   1   2   6
5077420   0   5   4   8   7   3   7       2   0   0   9   -   1   2   -
5077440   1   1       1   3   :   2   5   :   3   7       -   0   3   0
5077460   0       T   o   m 341   s       T   o   u   c   e   d   a
5077500   <   c   h   i   i   p   h   @   g   m   a   i   l   .   c   o
5077520   m   >  \n   c   o   m   m   i   t       5   5   f   d   9   d
5077540   e   9   1   a   b   e   7   e   e   a   c   c   a   6   f   3
5077560   b   0   a   d   a   3   b   2   d   6   4   f   0   d   e   7
5077600   d   8  \n   1   2   6   0   5   3   2   8   8   2       2   0
5077620   0   9   -   1   2   -   1   1       1   2   :   0   1   :   2
5077640   2       +   0   0   0   0       J   o   n   a   t   h   a   n
5077660   -   C   h   r   i   s   t   o   f   e   r       D   e   m   a
5077700   y       <   j   c   d   e   m   a   y   @   g   m   a   i   l
5077720   .   c   o   m   >  \n   c   o   m   m   i   t       c   e   d
5077740   d   e   1   b   9   4   3   6   1   c   0   4   0   6   0   c
5077760   9   4   1   6   1   4   b   6   d   0   f   b   2   7   a   3
5100000   1   2   0   1   a  \n   1   2   6   0   5   3   2   8   5   1 


I will attach the git rev-list raw output. I contacted upstream as you know but we did not get further. It seems in general, my export LC_ALL=en_US.UTF-8 does not propagate into gitstats python code, hence the dirty patch above.

Although I have xterm with utf-8 support and in general have utf-8 enabled, as you can see this is not enough to get the grep call working from inside python code of gitstats.

Comment 6 Martin Mokrejš 2016-03-24 08:01:54 UTC

Created attachment 428896 [details]
git rev-list output

Here is the input for your attempts to make grep(1) working on this stream.

Comment 7 Guilherme Amadio gentoo-dev

2016-03-24 12:45:40 UTC

I cannot reproduce this bug.

Comment 8 Guilherme Amadio gentoo-dev

2016-03-24 12:46:25 UTC

Nevermind, I can when I use the gentoo science project...

Comment 9 Guilherme Amadio gentoo-dev

2016-03-24 13:22:20 UTC

What works for me is to add an additional grep pipe to filter out the invalid UTF-8 lines from the stuff to be parsed, i.e., change

> grep -v ^commit

to

> grep -av ^commit | grep -ax '.*'

The -a makes grep parse binary files as text, and the -x picks only matching lines, filtering out invalid UTF-8. See link below for a reference:
http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8

I will apply this fix if others are in agreement.

Comment 10 Guilherme Amadio gentoo-dev

2016-03-24 13:26:24 UTC

An alternative that discards only characters instead of lines, also listed in the link above is to use 

> grep -av ^commit | iconv -c -t UTF-8

Comment 11 Guilherme Amadio gentoo-dev

2016-03-24 14:15:59 UTC

Fixed by commit below.
https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=93107b6974084bd968d73c5c2b09a054e433972e

Pull request placed upstream too.
https://github.com/hoxu/gitstats/pull/63

commit 93107b6974084bd968d73c5c2b09a054e433972e
Author: Guilherme Amadio <amadio@gentoo.org>
Date:   Thu Mar 24 11:09:16 2016 -0300

    dev-vcs/gitstats-0_pre131024: fix bug #575946
    
    Gentoo-bug: 575946
    Reported-by: Martin Mokrejš
    
    Package-Manager: portage-2.2.28

Comment 12 Martin Mokrejš 2016-03-24 15:02:44 UTC

(In reply to Guilherme Amadio from comment #10)
> An alternative that discards only characters instead of lines, also listed
> in the link above is to use 
> 
> > grep -av ^commit | iconv -c -t UTF-8

I would have preferred this one (just discarding one character but cannot is be really converted to us-ascii?). Interestingly, I was also trying to use iconv to convert away the non-ascii chars but I failed. Maybe that was because of the russian azbuka, I don't remember now.

I thought there is another bug -  a users LANG or other LC_* variables affects how grep works and if it switches itself into the binary mode the process gets stuck. Maybe gitstat could internally force its own LANG/LC_CTYPE/other but I do not know which one and to what value. Forcing all to UTF8 did not help in my hands. So, my opinion was and still is there is more needed on the gitstats side to.

I do not know how this translates to a general expectation that gitstats should respects users environment settings eventually for it own output. Maybe that should be only respected in output HTML encoding. If not, it should cherry-pick some env variable and forcibly pass it to forked grep processes ('export LC_CTYPE' does not work).

I did not learn enough how the USE=nls affects grep binary, also gnuplot has a nls use flag.

Quite knotted situation.

Comment 13 Guilherme Amadio gentoo-dev

2016-03-24 15:13:46 UTC

(In reply to Martin Mokrejš from comment #12)
> (In reply to Guilherme Amadio from comment #10)
> > An alternative that discards only characters instead of lines, also listed
> > in the link above is to use 
> > 
> > > grep -av ^commit | iconv -c -t UTF-8
> 
> I would have preferred this one (just discarding one character but cannot is
> be really converted to us-ascii?). Interestingly, I was also trying to use
> iconv to convert away the non-ascii chars but I failed. Maybe that was
> because of the russian azbuka, I don't remember now.

I totally agree. I also would have preferred the iconv solution. I tested it and it didn't work, so I used the solution with grep. Of course, if we can make iconv work, I will happily update the patch.

> I thought there is another bug -  a users LANG or other LC_* variables
> affects how grep works and if it switches itself into the binary mode the
> process gets stuck. Maybe gitstat could internally force its own
> LANG/LC_CTYPE/other but I do not know which one and to what value. Forcing
> all to UTF8 did not help in my hands. So, my opinion was and still is there
> is more needed on the gitstats side to.

The -a option to grep forces it not to switch to binary mode, so there is no need to change the locale.

> I do not know how this translates to a general expectation that gitstats
> should respects users environment settings eventually for it own output.
> Maybe that should be only respected in output HTML encoding. If not, it
> should cherry-pick some env variable and forcibly pass it to forked grep
> processes ('export LC_CTYPE' does not work).

In this case, I don't think that the locale is the problem, as the data in the commits contains unicode presumably, but with invalid characters. Since it's a mixture of different enconding, any single enconding will fail. The LC_ALL=en_US.UTF-8 solution shown above did not solve the problem for me.