Also please fetch current snapshot form github as I am getting the following when running it over sci overlay: $ gitstats ~/proj/sci /tmp/blah [2.14413] >> gnuplot --version Output path: /scratch/gitstats Git path: /home/xxx/proj/sci Collecting data... [8.03722] >> git shortlog -s HEAD | wc -l [0.08298] >> git show-ref --tags [0.07031] >> git rev-list --pretty=format:"%at %ai %aN <%aE>" HEAD | grep -v ^commit Traceback (most recent call last): File "/usr/lib/python-exec/python2.7/gitstats", line 1472, in <module> g.run(sys.argv[1:]) File "/usr/lib/python-exec/python2.7/gitstats", line 1449, in run data.collect(gitpath) File "/usr/lib/python-exec/python2.7/gitstats", line 333, in collect author, mail = parts[4].split('<', 1) ValueError: need more than 1 value to unpack $ Please also fix the ebuidl to install the manpage file.
Are you willing to proxy maintain this? https://wiki.gentoo.org/wiki/Project:Proxy_Maintainers
(In reply to Pacho Ramos from comment #1) > Are you willing to proxy maintain this? > https://wiki.gentoo.org/wiki/Project:Proxy_Maintainers I am already proxying this but it looks like we forgot to edit the metadata. See: https://archives.gentoo.org/gentoo-dev/message/67c561f9e121df5f746d578d523951b2 I have a pull request to update the live ebuild: https://github.com/gentoo/gentoo/pull/972 I can't do snapshots because I have no dev space. I'll see what I can do.
I will CC amadio as he was your proxy, right?
commit 69e6ed97490e2d6f7ebe18687002e0aa33a256e1 Author: Göktürk Yüksek <gokturk@binghamton.edu> AuthorDate: Wed Mar 23 22:35:00 2016 -0400 Commit: NP-Hardass <NP-Hardass@gentoo.org> CommitDate: Wed Mar 23 22:52:25 2016 -0400 dev-vcs/gitstats: bump to snapshot 2015-12-23 Package-Manager: portage-2.2.26 Can you try the new snapshot and see if it's reproducible? Thanks.
Thank you, it does not crash but it gets stuck on the same command so it may still crash while parsing. $ /usr/bin/gitstats ~/proj/sci /tmp/blah [0.01236] >> gnuplot --version Output path: /scratch/gitstats Loading cache... Git path: /home/xxx/proj/sci Collecting data... [0.38159] >> git shortlog -s HEAD | wc -l [0.02234] >> git show-ref --tags >> git rev-list --pretty=format:"%at %ai %aN <%aE>" HEAD | grep -v ^commit I emailed you already privately, it turns out grep complains the contains some "binary" chars. Here is the repo I use: > $ git pull -v > From git+ssh://git.gentoo.org/proj/sci > = [up to date] master -> origin/master > = [up to date] ambertools -> origin/ambertools > = [up to date] wxmacmolplt -> origin/wxmacmolplt > $ This patch works for me: $ diff -u -w gitstats.ori gitstats --- gitstats.ori 2016-01-08 18:12:21.000000000 +0100 +++ gitstats 2016-03-03 16:39:53.033358354 +0100 @@ -327,7 +327,7 @@ # Collect revision statistics # Outputs "<stamp> <date> <time> <timezone> <author> '<' <mail> '>'" - lines = getpipeoutput(['git rev-list --pretty=format:"%%at %%ai %%aN <%%aE>" %s' % getlogrange('HEAD'), 'grep -v ^commit']).split('\n') + lines = getpipeoutput(['git rev-list --pretty=format:"%%at %%ai %%aN <%%aE>" %s' % getlogrange('HEAD'), 'LC_ALL=en_US grep -v ^commit']).split('\n') for line in lines: parts = line.split(' ', 4) author = '' In the git rev-list output are mixed us-ascii,utf-8 and azbuka chars. Here is the very first of those problematic in a row (the one after 1260558957). First shown through "less": git rev-list --pretty=format:"%at %ai %aN <%aE>" HEAD | less 1260558957 2009-12-11 20:15:57 +0100 root <root@localhost.(none)> commit 893107e474bbfdc40abb7f87564b5ad9b52de3c2 1260548737 2009-12-11 13:25:37 -0300 Tom<E1>s Touceda <chiiph@gmail.com> commit 55fd9de91abe7eeacca6f3b0ada3b2d64f0de7d8 Here is roughly the same section through: git rev-list --pretty=format:"%at %ai %aN <%aE>" HEAD | od -c | less 5077220 d 8 b 5 a f f c 8 f \n 1 2 6 0 5 5077240 5 8 9 5 7 2 0 0 9 - 1 2 - 1 1 5077260 2 0 : 1 5 : 5 7 + 0 1 0 0 5077300 r o o t < r o o t @ l o c a l 5077320 h o s t . ( n o n e ) > \n c o m 5077340 m i t 8 9 3 1 0 7 e 4 7 4 b b 5077360 f d c 4 0 a b b 7 f 8 7 5 6 4 b 5077400 5 a d 9 b 5 2 d e 3 c 2 \n 1 2 6 5077420 0 5 4 8 7 3 7 2 0 0 9 - 1 2 - 5077440 1 1 1 3 : 2 5 : 3 7 - 0 3 0 5077460 0 T o m 341 s T o u c e d a 5077500 < c h i i p h @ g m a i l . c o 5077520 m > \n c o m m i t 5 5 f d 9 d 5077540 e 9 1 a b e 7 e e a c c a 6 f 3 5077560 b 0 a d a 3 b 2 d 6 4 f 0 d e 7 5077600 d 8 \n 1 2 6 0 5 3 2 8 8 2 2 0 5077620 0 9 - 1 2 - 1 1 1 2 : 0 1 : 2 5077640 2 + 0 0 0 0 J o n a t h a n 5077660 - C h r i s t o f e r D e m a 5077700 y < j c d e m a y @ g m a i l 5077720 . c o m > \n c o m m i t c e d 5077740 d e 1 b 9 4 3 6 1 c 0 4 0 6 0 c 5077760 9 4 1 6 1 4 b 6 d 0 f b 2 7 a 3 5100000 1 2 0 1 a \n 1 2 6 0 5 3 2 8 5 1 I will attach the git rev-list raw output. I contacted upstream as you know but we did not get further. It seems in general, my export LC_ALL=en_US.UTF-8 does not propagate into gitstats python code, hence the dirty patch above. Although I have xterm with utf-8 support and in general have utf-8 enabled, as you can see this is not enough to get the grep call working from inside python code of gitstats.
Created attachment 428896 [details] git rev-list output Here is the input for your attempts to make grep(1) working on this stream.
I cannot reproduce this bug.
Nevermind, I can when I use the gentoo science project...
What works for me is to add an additional grep pipe to filter out the invalid UTF-8 lines from the stuff to be parsed, i.e., change > grep -v ^commit to > grep -av ^commit | grep -ax '.*' The -a makes grep parse binary files as text, and the -x picks only matching lines, filtering out invalid UTF-8. See link below for a reference: http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 I will apply this fix if others are in agreement.
An alternative that discards only characters instead of lines, also listed in the link above is to use > grep -av ^commit | iconv -c -t UTF-8
Fixed by commit below. https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=93107b6974084bd968d73c5c2b09a054e433972e Pull request placed upstream too. https://github.com/hoxu/gitstats/pull/63 commit 93107b6974084bd968d73c5c2b09a054e433972e Author: Guilherme Amadio <amadio@gentoo.org> Date: Thu Mar 24 11:09:16 2016 -0300 dev-vcs/gitstats-0_pre131024: fix bug #575946 Gentoo-bug: 575946 Reported-by: Martin Mokrejš Package-Manager: portage-2.2.28
(In reply to Guilherme Amadio from comment #10) > An alternative that discards only characters instead of lines, also listed > in the link above is to use > > > grep -av ^commit | iconv -c -t UTF-8 I would have preferred this one (just discarding one character but cannot is be really converted to us-ascii?). Interestingly, I was also trying to use iconv to convert away the non-ascii chars but I failed. Maybe that was because of the russian azbuka, I don't remember now. I thought there is another bug - a users LANG or other LC_* variables affects how grep works and if it switches itself into the binary mode the process gets stuck. Maybe gitstat could internally force its own LANG/LC_CTYPE/other but I do not know which one and to what value. Forcing all to UTF8 did not help in my hands. So, my opinion was and still is there is more needed on the gitstats side to. I do not know how this translates to a general expectation that gitstats should respects users environment settings eventually for it own output. Maybe that should be only respected in output HTML encoding. If not, it should cherry-pick some env variable and forcibly pass it to forked grep processes ('export LC_CTYPE' does not work). I did not learn enough how the USE=nls affects grep binary, also gnuplot has a nls use flag. Quite knotted situation.
(In reply to Martin Mokrejš from comment #12) > (In reply to Guilherme Amadio from comment #10) > > An alternative that discards only characters instead of lines, also listed > > in the link above is to use > > > > > grep -av ^commit | iconv -c -t UTF-8 > > I would have preferred this one (just discarding one character but cannot is > be really converted to us-ascii?). Interestingly, I was also trying to use > iconv to convert away the non-ascii chars but I failed. Maybe that was > because of the russian azbuka, I don't remember now. I totally agree. I also would have preferred the iconv solution. I tested it and it didn't work, so I used the solution with grep. Of course, if we can make iconv work, I will happily update the patch. > I thought there is another bug - a users LANG or other LC_* variables > affects how grep works and if it switches itself into the binary mode the > process gets stuck. Maybe gitstat could internally force its own > LANG/LC_CTYPE/other but I do not know which one and to what value. Forcing > all to UTF8 did not help in my hands. So, my opinion was and still is there > is more needed on the gitstats side to. The -a option to grep forces it not to switch to binary mode, so there is no need to change the locale. > I do not know how this translates to a general expectation that gitstats > should respects users environment settings eventually for it own output. > Maybe that should be only respected in output HTML encoding. If not, it > should cherry-pick some env variable and forcibly pass it to forked grep > processes ('export LC_CTYPE' does not work). In this case, I don't think that the locale is the problem, as the data in the commits contains unicode presumably, but with invalid characters. Since it's a mixture of different enconding, any single enconding will fail. The LC_ALL=en_US.UTF-8 solution shown above did not solve the problem for me.