The net-misc/pavuk-0.9.34-r1 ebuild hardcoded --disable-gtk --disable-socks Some old comment in the ChangeLog from 2004 mentions GTK is "quite broken" but I reenabled it and it builds and runs just fine for me. Modified ebuild attached.
Created attachment 185024 [details] pavuk ebuild
btw: Pavuk 0.9.35 is available since 2007-02-21. http://sourceforge.net/projects/pavuk/files/ I guess it would be better to bump first before fixing. ;-)
The developer is missing since 3 years. I'll enhance the ebuild... seems simple.
Created attachment 275269 [details] Update and enhanced ebuild.
The package is unusable, it segfaults. Remove it.
OK, thanks for reporting If you find any other package assigned to maintainer-needed that is also highly broken, please report or add a comment if a bug already exists
pavuk works fine for me, I've been using it for a long time. Too bad if this isn't maintained by upstream anymore, as it's a great piece of software. As for GUI, I didn't know it was supposed to have one. To me it's a multi threaded replacement for wget, to be used on the command line. So far no segfaults (on ~amd64). I'll just move this into my local overlay then, as long as it works...
It segfaults frequently in my case and I tried various CFLAGS/CXXFLAGS to solve the case, but it didn't work out. http://tech.groups.yahoo.com/group/pavuk/message/1006 Can you reproduce it with my test URL?
I could not reproduce your segfault with the URL, dE. pavuk's cvs source was still being developed until 11 months ago, so you could test if the cvs version segfaults, too. ~amd64 sys-devel/gcc-4.5.3-r2 hardened USE="nls ssl" CFLAGS="-march=native -O2 -fomit-frame-pointer -pipe -floop-interchange -floop-strip-mine -floop-block" It does throw a severe warning about buffer overflow when compiling, though. I'm not aware of any other web grabber as powerful as pavuk is. It's surprising that such a beautiful project is never forked and maintained by anyone. As far as there's no security risks I would still prefer it being kept in Portage tree.
@dE, I have seen there is a last upstream 0.9.35 version from 2007, could you try to bump it locally and see if it solves your issues (including that buffer overflow problem)
Ok, your URL segfaults for me too, if I let pavuk crawl it with no restrictions. It ends up kicking me to the German weather service where it then segfaults. I usually let pavuk crawl only simpler/cleaner structures and/or explecitely forbid it to follow links I'm not interested in (which is also nicer to the website owners). Maybe that's why I didn't come across such issues so far.
(In reply to comment #9) > It's surprising that such a beautiful project is never forked and maintained by anyone. I don't understand the code. :-( If I had to maintain it I'd rewrite it from scratch in another language. :-)
(In reply to comment #9) > I could not reproduce your segfault with the URL, dE. pavuk's cvs source was > still being developed until 11 months ago, so you could test if the cvs version > segfaults, too. > > ~amd64 sys-devel/gcc-4.5.3-r2 hardened > USE="nls ssl" > CFLAGS="-march=native -O2 -fomit-frame-pointer -pipe -floop-interchange > -floop-strip-mine -floop-block" > > It does throw a severe warning about buffer overflow when compiling, though. > > I'm not aware of any other web grabber as powerful as pavuk is. It's surprising > that such a beautiful project is never forked and maintained by anyone. As far > as there's no security risks I would still prefer it being kept in Portage > tree. I'll try. This's is much better than the one and only alternative -- httrack which's slow and takes too much memory.
No... I remember, I did try, but it still segfaults, not only on Gentoo, but also on Debian Also I tried a verity of CFLAGS, I even tried -march=native -O0. Anyway, I'll still try again. Suggestions are welcomed.
Humm. There's no configure script.
YES!! The CVS version does NOT segfault!!!!!!! :) :) Now we need a CVS ebuild.
Although I don't know much of CVS (in that case no version control at all), can I get some documentation related to this EAPI?
Created attachment 302003 [details] pavuk-9999.ebuild Here you go, pavuk-9999.ebuild. Notes: 1. The pcre USE flag is broken! Well, I may look into it someday. 2. I only tested with USE="nls ssl -hammer -ipv6 -pcre -profile" on amd64 3. hammer USE flag is used for server stress test. 4. I took the two patches from pavuk-0.9.34-r2 out, not sure if that would break things. 5. I used an ugly hack in src_unpack() to make aclocal work. Wonder if there's a more graceful solution. 6. Here's the how-to about using CVS sources in ebuilds, note it's out-dated and quite broken. Don't explicitly follow it! http://devmanual.gentoo.org/ebuild-writing/functions/src_unpack/cvs-sources/index.html
Created attachment 302043 [details] pavuk-0.9.36_pre20120215.ebuild Well, CVS ebuilds are not recommended for main tree usage (as they could change by upstream in the future without notice making them hard to maintain), on the other hand, we take a snapshot from live sources and rely on them. I have just do that, but when trying to compile with attached ebuild, build fails with the following: -pipe -march=native -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -c jsbind.c x86_64-pc-linux-gnu-gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I/usr/include/js -DXP_UNIX -I/usr/include/gtk-2.0 -I/usr/lib64/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng15 -I/usr/include/libdrm -pthread -I/usr/include -DGETTEXT_DEFAULT_CATALOG_DIR="\"/usr/share/locale\"" -DDEFAULTRC="\"/etc/pavukrc\"" -O2 -pipe -march=native -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -c jstrans.c htmlparser.c: In function 'html_parser_process_new_base_url': htmlparser.c:263: warning: comparison of unsigned expression >= 0 is always true htmlparser.c:264: warning: comparison between signed and unsigned integer expressions htmlparser.c: In function 'html_parser_url_to_local': htmlparser.c:1560: warning: comparison of unsigned expression >= 0 is always true htmlparser.c:1561: warning: comparison between signed and unsigned integer expressions htmlparser.c: In function 'html_parser_change_url': htmlparser.c:1690: warning: comparison of unsigned expression >= 0 is always true htmlparser.c:1691: warning: comparison between signed and unsigned integer expressions x86_64-pc-linux-gnu-gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I/usr/include/js -DXP_UNIX -I/usr/include/gtk-2.0 -I/usr/lib64/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng15 -I/usr/include/libdrm -pthread -I/usr/include -DGETTEXT_DEFAULT_CATALOG_DIR="\"/usr/share/locale\"" -DDEFAULTRC="\"/etc/pavukrc\"" -O2 -pipe -march=native -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -c lfname.c jsbind.c: In function 'pjs_tracing': jsbind.c:407: error: dereferencing pointer to incomplete type jsbind.c:410: error: dereferencing pointer to incomplete type jsbind.c: In function 'pjs_load': jsbind.c:480: warning: comparison between signed and unsigned integer expressions jsbind.c: In function 'pjs_print': jsbind.c:503: warning: comparison between signed and unsigned integer expressions jsbind.c: In function 'pjs_url_obj_set_url': jsbind.c:693: warning: comparison of unsigned expression >= 0 is always true jsbind.c: In function 'pjs_url_get_property': jsbind.c:780: warning: comparison of unsigned expression >= 0 is always true jsbind.c:780: warning: comparison between signed and unsigned integer expressions jsbind.c: In function 'pjs_url_check_cond': jsbind.c:1035: warning: comparison between signed and unsigned integer expressions jsbind.c: In function 'pjs_url_class_init': jsbind.c:1077: error: 'JSObject' has no member named 'setProto' jsbind.c: In function 'pjs_fnrules_class_init': jsbind.c:1354: error: 'JSObject' has no member named 'setProto' jsbind.c: In function 'psj_BranchCallback': jsbind.c:1383: error: dereferencing pointer to incomplete type jsbind.c:1383: error: dereferencing pointer to incomplete type jsbind.c: In function 'psj_ContextNewSetup': jsbind.c:1446: error: too many arguments to function 'JS_SetOperationCallback' jsbind.c:1447: error: 'JSOPTION_NATIVE_BRANCH_CALLBACK' undeclared (first use in this function) jsbind.c:1447: error: (Each undeclared identifier is reported only once jsbind.c:1447: error: for each function it appears in.) jsbind.c: In function 'psj_ContextCallback': jsbind.c:1460: warning: unused variable 'retv' jsbind.c: In function 'pjs_run_parse_content_func': jsbind.c:1888: warning: passing argument 1 of 'STRING_TO_JSVAL' from incompatible pointer type /usr/include/js/jsapi.h:216: note: expected 'struct JSString *' but argument is of type 'char *' jsbind.c:1889: warning: passing argument 1 of 'STRING_TO_JSVAL' from incompatible pointer type /usr/include/js/jsapi.h:216: note: expected 'struct JSString *' but argument is of type 'char *' jsbind.c:1890: warning: passing argument 1 of 'STRING_TO_JSVAL' from incompatible pointer type /usr/include/js/jsapi.h:216: note: expected 'struct JSString *' but argument is of type 'char *' jsbind.c: In function 'pjs_run_doc_process_func': jsbind.c:1986: warning: unused variable 'ourl' jsbind.c:1985: warning: unused variable 'rv' jsbind.c:1984: warning: unused variable 'param' make[2]: *** [jsbind.o] Error 1 make[2]: *** Waiting for unfinished jobs.... jstrans.c: In function 'js_transform_match_tag': jstrans.c:55: warning: comparison between signed and unsigned integer expressions http.c: In function 'is_valid_http_response_code': http.c:95: warning: comparison between signed and unsigned integer expressions http.c: In function 'http_dummy_proxy_send_connect': http.c:467: warning: comparison between signed and unsigned integer expressions http.c: In function 'http_request': http.c:701: warning: comparison of unsigned expression >= 0 is always true http.c:702: warning: comparison between signed and unsigned integer expressions http.c:1014: warning: comparison between signed and unsigned integer expressions http.c: In function 'http_get_response_info2': http.c:2242: warning: comparison between signed and unsigned integer expressions lfname.c: In function 'lfname_get_by_url': lfname.c:317: warning: comparison of unsigned expression >= 0 is always true lfname.c:318: warning: comparison between signed and unsigned integer expressions lfname.c:413: warning: comparison between signed and unsigned integer expressions lfname.c: In function 'lfname_lsp_token_type': lfname.c:1345: warning: comparison between signed and unsigned integer expressions lfname.c: At top level: lfname.c:1392: warning: 'lfname_lsp_var_ret_free' defined but not used make[2]: Leaving directory `/var/tmp/portage/net-misc/pavuk-0.9.36_pre20120215/work/pavuk/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/var/tmp/portage/net-misc/pavuk-0.9.36_pre20120215/work/pavuk' make: *** [all] Error 2 I am trying to merge it with this USEs: net-misc/pavuk-0.9.36_pre20120215 USE="nls pcre ssl -hammer -ipv6 -profile"
Created attachment 302047 [details] pavuk-0.9.36_pre20120215.ebuild Thanks for your cvs snapshot firstly. :) Okay, I guess I know why your compilation failed: You have spidermonkey installed, and pavuk tries to enable its JavaScript bindings with spidermonkey, which unfortunately seems broken. The JS binding is controlled by the javascript USE flag in this ebuild. If you are experiencing issues, disable it. Changes in the ebuild: * I removed the broken pcre USE flag. (Well, I said it was broken in the last comment...) * Fixed a bug in my old ebuild that makes USE flags totally ineffective. (A silly mistake indeed...) * Added some USE flags and a line of code to install the man page. * Another ugly sed hack to fix an annoying bug in pavuk's configure script that prevents people from turning debugging features off...
Fine, thanks a lot I won't have time to check it until tomorrow, but will look to it as soon as possible :) Another question, have you checked about enabling GUI (this bug was originally opened for it)? Or is it still too broken? Also, is bug 340833 still valid for this snapshot?
I hope pavuk-0.9.36_pre20120215 will be uploaded to the tree soon.
Created attachment 302111 [details] pavuk-0.9.36_pre20120215.ebuild I almost could not believe that this 3-year-old GTK+ 2 interface is still working! It's broken in my older ebuilds since the sed hack corrupted some other lines in configure.in. This ebuild installs pavuk.desktop, too. The _FORTIFY_SOURCE overflow issue is gone in cvs ebuild. Yet there's another QA warning. * QA Notice: Package triggers severe warnings which indicate that it * may exhibit random runtime failures. * /usr/include/bits/socket2.h:43:2: warning: call to ‘__recv_chk_warn’ declared with attribute warning: recv called with bigger length than size of destination buffer
+*pavuk-0.9.36_pre20120215 (16 Feb 2012) + + 16 Feb 2012; Pacho Ramos <pacho@gentoo.org> +pavuk-0.9.36_pre20120215.ebuild, + -files/pavuk-0.9.34-gcc43.patch, -files/pavuk-0.9.34-nls.patch, + -pavuk-0.9.34-r1.ebuild, -pavuk-0.9.34-r2.ebuild, metadata.xml: + Version bump with latest upstream snapshot including tons of bugfixes, this + also solves bug #262504 (adding a gtk2 gui) and buffer overflow problems (bug + #340833). Thanks a lot to Richard Grenville, dE, Andreas Klauer and D_W. + Remove old and broken buggy versions. + An important last suggestion: - What about proxy maintaining this? ;), I am willing to proxy-maintain you: http://www.gentoo.org/proj/en/qa/proxy-maintainers/index.xml Or, if you want to become a gentoo-dev and help with this and other packages, feel free to mail me and will try to find a mentor for you :) Thanks a lot
Created attachment 302157 [details] pavuk-0.9.36_pre20120215.ebuild Unfortunately I could confirm that the JavaScript binding is utterly broken. After quite some effort I could pass pavuk's src_configure() with ~dev-lang/spidermonkey-1.8.5, yet I saw a million error messages in src_compile() coming from jsbind.c... So I presume it takes too much effort to fix the 2000+ line JavaScript binding code in pavuk and I have to give up. This ebuild comes with not many changes. Removed javascript USE flag; Some more comments to make my nasty hacks more understandable; And replaced my ugly pavuk.desktop installation code with domenu. I might try to fix the PCRE support later. I'm willing to proxy-maintain this ebuild, and help with other ebuilds. I guess my knowledge is insufficient for being a Gentoo-dev, nonetheless. "Gentoo-dev" sounds like a too sacred job. :D
Created attachment 302159 [details] pavuk-0.9.36_pre20120215.ebuild Oops, I did not noticed you made many changes to the ebuild, sorry. Merged your changes.
Can you please add meaning of the 'hammer' and 'profile' (still don't know this one) in ebuild?
And automake in the build depends. >>> Emerging (1 of 1) www-client/pavuk-0.9.36_pre20120215 from my-tree >>> Failed to emerge www-client/pavuk-0.9.36_pre20120215 * Messages for package www-client/pavuk-0.9.36_pre20120215: * ERROR: www-client/pavuk-0.9.36_pre20120215 failed (prepare phase): * Cannot find the latest automake! Tried 1.11 * * Call stack: * ebuild.sh, line 85: Called src_prepare * environment, line 2945: Called autotools-utils_src_prepare * environment, line 552: Called autotools-utils_autoreconf * environment, line 464: Called eaclocal * environment, line 912: Called eaclocal_amflags * environment, line 920: Called autotools_env_setup * environment, line 590: Called die * The specific snippet of code: * [[ ${WANT_AUTOMAKE} == "latest" ]] && die "Cannot find the latest automake! Tried ${_LATEST_AUTOMAKE}"; * * If you need support, post the output of 'emerge --info =www-client/pavuk-0.9.36_pre20120215', * the complete build log and the output of 'emerge -pqv =www-client/pavuk-0.9.36_pre20120215'. * This ebuild is from an overlay named 'my-tree': '/home/de/dev-tree/' * The complete build log is located at '/tmp/portage/www-client/pavuk-0.9.36_pre20120215/temp/build.log'. * The ebuild environment file is located at '/tmp/portage/www-client/pavuk-0.9.36_pre20120215/temp/environment'. * S: '/tmp/portage/www-client/pavuk-0.9.36_pre20120215/work/pavuk'
Pacho Ramos has already added a description for "hammer" to the tree: --------------- <flag name="hammer">Turn on chunky/hammer mode (DoS) in pavuk: when specified, pavuk will include features to stress test web sites using an ultrahigh performance replay mechanism</flag> --------------- (In reply to comment #27) > Can you please add meaning of the 'hammer' and 'profile' (still don't know this > one) in ebuild? =============== 1. I don't exactly know what version of automake pavuk requires. So I used the default value "latest". 2. Someone said he resolved the issue by manually emerging automake. https://bugs.gentoo.org/show_bug.cgi?id=400639 3. The pavuk ebuild is in Portage tree already, so please don't emerge from your local overlay. And never enable that javascript USE flag! (In reply to comment #28) > And automake in the build depends. > > >>> Emerging (1 of 1) www-client/pavuk-0.9.36_pre20120215 from my-tree > >>> Failed to emerge www-client/pavuk-0.9.36_pre20120215 > > * Messages for package www-client/pavuk-0.9.36_pre20120215: > > * ERROR: www-client/pavuk-0.9.36_pre20120215 failed (prepare phase): > * Cannot find the latest automake! Tried 1.11
+ 17 Feb 2012; Pacho Ramos <pacho@gentoo.org> metadata.xml, + pavuk-0.9.36_pre20120215.ebuild: + Drop javascript support as it's completely broken, bug #262504#c25 by Richard + Grenville. I will start to be his proxy-maintainer for this also. + Thanks again :)(In reply to comment #25) > I'm willing to proxy-maintain this ebuild, and help with other ebuilds. I guess > my knowledge is insufficient for being a Gentoo-dev, nonetheless. "Gentoo-dev" > sounds like a too sacred job. :D I think all most of us have someday think our knowledge is insufficient (I still remember when fauli, my mentor, "pushed" me to become a dev ;)) After reading: http://www.gentoo.org/proj/en/devrel/handbook/handbook.xml http://devmanual.gentoo.org/ I am sure you will be ready (it looks a lot of documentation, but you can start reading it when you have time during months :)) (In reply to comment #27) > Can you please add meaning of the 'hammer' and 'profile' (still don't know this > one) in ebuild? They are already there: # equery uses pavuk [ Legend : U - final flag setting for installation] [ : I - package is installed with flag ] [ Colors : set, unset ] * Found these USE flags for net-misc/pavuk-0.9.36_pre20120215: U I - - debug : Enable extra debug codepaths, like asserts and extra output. If you want to get meaningful backtraces see http://www.gentoo.org/proj/en/qa/backtraces.xml + + gtk : Adds support for x11-libs/gtk+ (The GIMP Toolkit) - - hammer : Turn on chunky/hammer mode (DoS) in pavuk: when specified, pavuk will include features to stress test web sites using an ultrahigh performancereplay mechanism - - ipv6 : Adds support for IP version 6 + + nls : Adds Native Language Support (using gettext - GNU locale utilities) - - profile : Adds support for software performance analysis (will likely vary from ebuild to ebuild) + + ssl : Adds support for Secure Socket Layer connections (profile meaning for this ebuild is the "global" one, nothing special over other ebuilds using it) Regarding automake problems, I cannot reproduce them, try re-emerging it like Richard suggested
AH! *** glibc detected *** pavuk: free(): invalid next size (fast): 0x00000000034bf860 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3be7a74d25] /lib64/libc.so.6(cfree+0x6c)[0x3be7a79abc] pavuk[0x40f419] pavuk[0x430743] pavuk[0x42f6b9] pavuk[0x42eac0] pavuk[0x42d283] pavuk[0x443f33] pavuk[0x4448c7] pavuk[0x444bb0] /lib64/libpthread.so.0[0x3be7e07c2c] /lib64/libc.so.6(clone+0x6d)[0x3be7ad77bd] ======= Memory map: ======== 00400000-0047e000 r-xp 00000000 08:07 352906 /usr/bin/pavuk 0067d000-0067e000 r--p 0007d000 08:07 352906 /usr/bin/pavuk 0067e000-0068f000 rw-p 0007e000 08:07 352906 /usr/bin/pavuk 0068f000-00692000 rw-p 00000000 00:00 0 0244e000-03cd2000 rw-p 00000000 00:00 0 [heap] 34a9200000-34a9215000 r-xp 00000000 08:05 721347 /lib64/libgcc_s.so.1 34a9215000-34a9414000 ---p 00015000 08:05 721347 /lib64/libgcc_s.so.1 34a9414000-34a9415000 r--p 00014000 08:05 721347 /lib64/libgcc_s.so.1 34a9415000-34a9416000 rw-p 00015000 08:05 721347 /lib64/libgcc_s.so.1 34a9a00000-34a9b87000 r-xp 00000000 08:07 363 /usr/lib64/libcrypto.so.1.0.0 34a9b87000-34a9d87000 ---p 00187000 08:07 363 /usr/lib64/libcrypto.so.1.0.0 34a9d87000-34a9da0000 r--p 00187000 08:07 363 /usr/lib64/libcrypto.so.1.0.0 34a9da0000-34a9daa000 rw-p 001a0000 08:07 363 /usr/lib64/libcrypto.so.1.0.0 34a9daa000-34a9dae000 rw-p 00000000 00:00 0 34a9e00000-34a9e55000 r-xp 00000000 08:07 167740 /usr/lib64/libssl.so.1.0.0 34a9e55000-34aa055000 ---p 00055000 08:07 167740 /usr/lib64/libssl.so.1.0.0 34aa055000-34aa058000 r--p 00055000 08:07 167740 /usr/lib64/libssl.so.1.0.0 34aa058000-34aa05d000 rw-p 00058000 08:07 167740 /usr/lib64/libssl.so.1.0.0 34b2800000-34b2975000 r-xp 00000000 08:07 199427 /usr/lib64/libdb-4.8.so 34b2975000-34b2b75000 ---p 00175000 08:07 199427 /usr/lib64/libdb-4.8.so 34b2b75000-34b2b77000 r--p 00175000 08:07 199427 /usr/lib64/libdb-4.8.so 34b2b77000-34b2b7a000 rw-p 00177000 08:07 199427 /usr/lib64/libdb-4.8.so 3be7600000-3be761f000 r-xp 00000000 08:05 11028 /lib64/ld-2.13.so 3be781f000-3be7820000 r--p 0001f000 08:05 11028 /lib64/ld-2.13.so 3be7820000-3be7821000 rw-p 00020000 08:05 11028 /lib64/ld-2.13.so 3be7821000-3be7822000 rw-p 00000000 00:00 0 3be7a00000-3be7b7f000 r-xp 00000000 08:05 12283 /lib64/libc-2.13.so 3be7b7f000-3be7d7e000 ---p 0017f000 08:05 12283 /lib64/libc-2.13.so 3be7d7e000-3be7d82000 r--p 0017e000 08:05 12283 /lib64/libc-2.13.so 3be7d82000-3be7d83000 rw-p 00182000 08:05 12283 /lib64/libc-2.13.so 3be7d83000-3be7d88000 rw-p 00000000 00:00 0 3be7e00000-3be7e17000 r-xp 00000000 08:05 13481 /lib64/libpthread-2.13.so 3be7e17000-3be8017000 ---p 00017000 08:05 13481 /lib64/libpthread-2.13.so 3be8017000-3be8018000 r--p 00017000 08:05 13481 /lib64/libpthread-2.13.so 3be8018000-3be8019000 rw-p 00018000 08:05 13481 /lib64/libpthread-2.13.so 3be8019000-3be801d000 rw-p 00000000 00:00 0 3be8600000-3be8602000 r-xp 00000000 08:05 703533 /lib64/libdl-2.13.so 3be8602000-3be8802000 ---p 00002000 08:05 703533 /lib64/libdl-2.13.so 3be8802000-3be8803000 r--p 00002000 08:05 703533 /lib64/libdl-2.13.so 3be8803000-3be8804000 rw-p 00003000 08:05 703533 /lib64/libdl-2.13.so 3bf3c00000-3bf3c13000 r-xp 00000000 08:05 712580 /lib64/libresolv-2.13.so 3bf3c13000-3bf3e13000 ---p 00013000 08:05 712580 /lib64/libresolv-2.13.so 3bf3e13000-3bf3e14000 r--p 00013000 08:05 712580 /lib64/libresolv-2.13.so 3bf3e14000-3bf3e15000 rw-p 00014000 08:05 712580 /lib64/libresolv-2.13.so 3bf3e15000-3bf3e17000 rw-p 00000000 00:00 0 7f0804000000-7f0805830000 rw-p 00000000 00:00 0 7f0805830000-7f0808000000 ---p 00000000 00:00 0 7f0808000000-7f080988f000 rw-p 00000000 00:00 0 7f080988f000-7f080c000000 ---p 00000000 00:00 0 7f080c000000-7f080dd33000 rw-p 00000000 00:00 0 7f080dd33000-7f0810000000 ---p 00000000 00:00 0 7f0810000000-7f0811821000 rw-p 00000000 00:00 0 7f0811821000-7f0814000000 ---p 00000000 00:00 0 7f0814000000-7f0815aaa000 rw-p 00000000 00:00 0 7f0815aaa000-7f0818000000 ---p 00000000 00:00 0 7f081ac52000-7f081ac57000 r-xp 00000000 08:05 734724 /lib64/libnss_dns-2.13.so 7f081ac57000-7f081ae56000 ---p 00005000 08:05 734724 /lib64/libnss_dns-2.13.so 7f081ae56000-7f081ae57000 r--p 00004000 08:05 734724 /lib64/libnss_dns-2.13.so 7f081ae57000-7f081ae58000 rw-p 00005000 08:05 734724 /lib64/libnss_dns-2.13.so 7f081ae83000-7f081ae84000 ---p 00000000 00:00 0 7f081ae84000-7f081aec2000 rw-p 00000000 00:00 0 7f081aec2000-7f081aec3000 ---p 00000000 00:00 0 7f081aec3000-7f081af01000 rw-p 00000000 00:00 0 7f081af01000-7f081af02000 ---p 00000000 00:00 0 7f081af02000-7f081af40000 rw-p 00000000 00:00 0 7f081af40000-7f081af41000 ---p 00000000 00:00 0 7f081af41000-7f081af7f000 rw-p 00000000 00:00 0 7f081af7f000-7f081af80000 ---p 00000000 00:00 0 7f081af80000-7f081afbe000 rw-p 00000000 00:00 0 7f081afbe000-7f081afbf000 ---p 00000000 00:00 0 7f081afbf000-7f081affd000 rw-p 00000000 00:00 0 7f081affd000-7f081affe000 ---p 00000000 00:00 0 7f081affe000-7f081b03c000 rw-p 00000000 00:00 0 7f081b03c000-7f081b03d000 ---p 00000000 00:00 0 7f081b03d000-7f081b07b000 rw-p 00000000 00:00 0 7f081b07b000-7f081b07c000 ---p 00000000 00:00 0 7f081b07c000-7f081b0ba000 rw-p 00000000 00:00 0 7f081b0ba000-7f081b0bb000 ---p 00000000 00:00 0 7f081b0bb000-7f081b0f9000 rw-p 00000000 00:00 0 7f081b0f9000-7f081b105000 r-xp 00000000 08:05 734726 /lib64/libnss_files-2.13.so 7f081b105000-7f081b304000 ---p 0000c000 08:05 734726 /lib64/libnss_files-2.13.so 7f081b304000-7f081b305000 r--p 0000b000 08:05 734726 /lib64/libnss_files-2.13.so 7f081b305000-7f081b306000 rw-p 0000c000 08:05 734726 /lib64/libnss_files-2.13.so 7f081b306000-7f081b8ed000 rw-p 00000000 00:00 0 7f081b8ed000-7f081b901000 r-xp 00000000 08:05 768968 /lib64/libz.so.1.2.5.1 7f081b901000-7f081bb00000 ---p 00014000 08:05 768968 /lib64/libz.so.1.2.5.1 7f081bb00000-7f081bb01000 r--p 00013000 08:05 768968 /lib64/libz.so.1.2.5.1 7f081bb01000-7f081bb02000 rw-p 00014000 08:05 768968 /lib64/libz.so.1.2.5.1 7f081bb2c000-7f081bb2e000 rw-p 00000000 00:00 0 7fff3a828000-7fff3a84a000 rw-p 00000000 00:00 0 [stack] 7fff3a9ff000-7fff3aa00000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Aborted
Open a separate bug report with a test case please
Ok, this'll take some time.
Created attachment 302337 [details] pavuk-0.9.36_pre20120215.ebuild New ebuild with PCRE support, tested with libpcre-0.30 but should work with previous versions. The patch pavuk-0.9.36_pre20120215-pcre-fix.patch will be attached later. I didn't expect fixing PCRE support only requires such a little patch... Checked the PCRE support code in re.c and found no other issues. pavuk does not quite free compiled PCRE regular expressions in the standard way, though. I discovered a segfault in pavuk's GTK+ interface. pavuk -X -> "Files" menu -> "Add URL" -> "Limitations" button -> SIGSEGV Could not get meaningful backtraces currently as I don't have GTK+ and glibc built with debugging symbols... Well, and I have a really slow computer. Anyway, a GUI that segfaults sometimes is better than no GUI at all. =============================== (In reply to comment #33) > Ok, this'll take some time. I guess a backtrace would also be needed: http://www.gentoo.org/proj/en/qa/backtraces.xml
Created attachment 302339 [details, diff] pavuk-0.9.36_pre20120215-pcre-fix.patch
For starters, you can try and mirror http://en.wikipedia.org/wiki/Main_Page, but I'm looking forward towards more examples.
(In reply to comment #36) > For starters, you can try and mirror http://en.wikipedia.org/wiki/Main_Page, > but I'm looking forward towards more examples. You are talking about the glibc issue? I tried running with the URL for a few minutes and nothing unexpected occurred. I really should not try to mirror Wikipedia unless I wish to have my little hard drive exploded. Please tell us on which URL the problem appeared, the full command you used to execute pavuk, what the last few lines of output before the problem occurred were, and your emerge --info pavuk.
(In reply to comment #34) > Created attachment 302337 [details] > pavuk-0.9.36_pre20120215.ebuild > > New ebuild with PCRE support, tested with libpcre-0.30 but should work with > previous versions. The patch pavuk-0.9.36_pre20120215-pcre-fix.patch will be > attached later. > Please always try to apply patches unconditionally, otherwise, there is a high possibility of getting broken patches in future updates as it's easy to forget to test building and running for every use flag combination ;)
+ 19 Feb 2012; Pacho Ramos <pacho@gentoo.org> + +files/pavuk-0.9.36_pre20120215-pcre-fix.patch, + pavuk-0.9.36_pre20120215.ebuild: + Fix PCRE support (bug #262504#c34 by Richard Grenville), install icon for menu + entry. +
Created attachment 302661 [details, diff] pavuk-0.9.36_pre20120215-fix-gtkmulticol-segfault.patch A patch to fix the segfault I discovered in pavuk's GTK+2 interface, which probably only happens on amd64 systems. This patch is experimental and may cause some other issues since I have almost totally no knowledge about GTK+/glib. > I discovered a segfault in pavuk's GTK+ interface. > pavuk -X -> "Files" menu -> "Add URL" -> "Limitations" button -> SIGSEGV > > Could not get meaningful backtraces currently as I don't have GTK+ and glibc > built with debugging symbols... Well, and I have a really slow computer. > Anyway, a GUI that segfaults sometimes is better than no GUI at all. Basically pavuk uses a custom GTK+ widget "GtkMultiCol" in the "Limitations" window. When it tries to register the widget with gtk_type_unique() (a very deprecated function...), it uses a variable with the type "guint" (unsigned int?) to retrieve the address of the new widget, however an unsigned int variable is not enough for holding a pointer on amd64 systems, so gcc crops the pointer in the assignment, generates a totally incorrect pointer, thus causing a segfault. Again, I know basically nothing about GTK+ or glib and I started actually touching gdb only yesterday, so this patch may cause unpredictable behaviors... I could only verify that it fixes the segfault on my own system.
(In reply to comment #37) > (In reply to comment #36) > > For starters, you can try and mirror http://en.wikipedia.org/wiki/Main_Page, > > but I'm looking forward towards more examples. > > You are talking about the glibc issue? I tried running with the URL for a few > minutes and nothing unexpected occurred. I really should not try to mirror > Wikipedia unless I wish to have my little hard drive exploded. Please tell us > on which URL the problem appeared, the full command you used to execute pavuk, > what the last few lines of output before the problem occurred were, and your > emerge --info pavuk. This problem occurred on an Intranet. I was just trying wiki for testing. This'll need some heavy duty testing.
So it could have something to do with crawling massive amount of pages? Still, if you are able to provide either a way to reproduce the problem, or a full backtrace, maybe I could find out the cause. http://www.gentoo.org/proj/en/qa/backtraces.xml (In reply to comment #41) > This problem occurred on an Intranet. I was just trying wiki for testing. > > This'll need some heavy duty testing.
Surprisingly, upgrading to an Intel processor solved the problem. The program now never segfaults.
(In reply to comment #43) > Surprisingly, upgrading to an Intel processor solved the problem. The > program now never segfaults. I think you probably just happened not to meet the problem somehow, or some other changes fixed the problem. I couldn't think of any differences between AMD and Intel CPUs except that they support different instructions -- and if it's an issue related to unsupported instructions, you should get SIGILL instead of SIGSEGV. Still, backtrace, backtrace, backtrace!
There appears to be a JS detection bug in pavuk which causes sefault. So disable js processing to avoid trigging this bug.
(In reply to comment #45) > There appears to be a JS detection bug in pavuk which causes sefault. > > So disable js processing to avoid trigging this bug. Do you mean -js_pattern? I tried fetching 100 Wikipedia pages with this command: $ pavuk -enable_js -js_pattern "^document.[a-zA-Z0-9_]*.src[ \t]*=[ \t]*'(.*)'$" -sleep 1 http://en.wikipedia.org/wiki/Main_Page As well as with a few other random JS patterns that actually appeared in Wikipedia's <script> code. I'm not sure if they actually worked, but pavuk didn't segfault, anyway. I compiled pavuk with clang's Address Sanitizer, and nothing wrong has it reported, either. As what I've already said, without stacktrace or an actual command that could reproduce this problem, I guess what we could do will be extremely limited.
It happened in html_parser_parse_jspatterns (htmlparser.c) line 1996; that's when I realized that this was a JS problem (which is enabled by default). Disabling it worked. Anyway, it's a rare problem.
Created attachment 346762 [details, diff] pavuk-0.9.36_pre20120215-fix-segfault-on-empty-external-js.patch Well, finally I got some time to look into this. I'm able to reproduce a segfault when pavuk fetches an empty external JavaScript file (which is certainly a rare case!). Do you think it's related? Anyway, I attached a patch above that could possibly fix the issue.
(In reply to comment #48) > Created attachment 346762 [details, diff] [details, diff] > pavuk-0.9.36_pre20120215-fix-segfault-on-empty-external-js.patch > > Well, finally I got some time to look into this. I'm able to reproduce a > segfault when pavuk fetches an empty external JavaScript file (which is > certainly a rare case!). Do you think it's related? Anyway, I attached a > patch above that could possibly fix the issue. Thankyou!! Unfortunately I lost the test case. Here, pavuk works on very unclean websites. This patch should be merged with the Gentoo ebuild.
#0 url_hash_func (size=233, key=0x7fffe9e240e0) at dlhash_tools.c:48 #1 0x000000000041e66f in dlhash_insert (hash=0x69ac10, key_data=key_data@entry=0x7fffe9e240e0) at dlhash.c:75 #2 0x0000000000454128 in url_add_to_url_hash_tab (urlp=urlp@entry=0x7fffe9e240e0) at url.c:2323 #3 0x0000000000459252 in cat_links_to_url_list (l1=l1@entry=0x7fffe9e29600) at url.c:1969 #4 0x0000000000445e95 in process_document (docu=docu@entry=0x7ffff77974d0, check_lim=check_lim@entry=1) at recurse.c:881 #5 0x000000000044630a in mt_recurse (thrnr=thrnr@entry=1) at recurse.c:1195 #6 0x00000000004465f0 in mt_recurse_thrd (param=<optimized out>) at recurse.c:1335 #7 0x0000003fbac08ec6 in start_thread () from /lib64/libpthread.so.0 #8 0x0000003fba8ea6ed in clone () from /lib64/libc.so.6
(In reply to comment #50) > #0 url_hash_func (size=233, key=0x7fffe9e240e0) at dlhash_tools.c:48 > #1 0x000000000041e66f in dlhash_insert (hash=0x69ac10, > key_data=key_data@entry=0x7fffe9e240e0) at dlhash.c:75 > #2 0x0000000000454128 in url_add_to_url_hash_tab > (urlp=urlp@entry=0x7fffe9e240e0) at url.c:2323 > #3 0x0000000000459252 in cat_links_to_url_list (l1=l1@entry=0x7fffe9e29600) > at url.c:1969 > #4 0x0000000000445e95 in process_document (docu=docu@entry=0x7ffff77974d0, > check_lim=check_lim@entry=1) at recurse.c:881 > #5 0x000000000044630a in mt_recurse (thrnr=thrnr@entry=1) at recurse.c:1195 > #6 0x00000000004465f0 in mt_recurse_thrd (param=<optimized out>) at > recurse.c:1335 > #7 0x0000003fbac08ec6 in start_thread () from /lib64/libpthread.so.0 > #8 0x0000003fba8ea6ed in clone () from /lib64/libc.so.6 Now, this problem is a bit complicated. url_hash_func() could trigger a segfault if url_get_path() returns NULL, and url_get_path() returns NULL when it encounters either an unknown URL (URLT_UNKNOWN) or a URLT_FROMPARENT URL ("//hostname/path..."). Unknown URLs are usually filtered out correctly by url_append_condition(). URLT_FROMPARENT, however, are usually converted from normal URLs according to info in their parent URLs. (Note, these URLs are NOT NORMAL. Firefox seemingly doesn't handle those URLs correctly either!) Unfortunately, a parent URL isn't always easy to find (or sometimes not available at all), and pavuk's author doesn't do the conversation in many cases. As far as I know, when a "//hostname/path" style URL is inside a place in HTML not subject to (html_parser_)url_to_absolute_url() (seemingly it's done for all attributes), when it's inside JavaScript parsed by SpiderMonkey, when it's inside a HTTP redirection, when it's inside CSS, they are probably (note, only probably, I only skimmed through the code) not converted correctly, thus possibly leading to this segfault. So my assumption is your website somehow uses a "//hostname/path" style URL in a place where pavuk doesn't handle correctly. If you enable the debug USE flag and -debug option, you might be able to find where exactly the issue is. And a perfect fix is not easy, as far as I could see. To handle conversation of URLT_FROMPARENT URLs in all those places would be tricky. And to make the hash table system handle these URL probably requires many things modified, and could potentially lead to more issues. If you don't really need to handle those crazy URLs, this may work for you: diff --git a/src/url.c b/src/url.c index 70cc90d..21b7fde 100644 --- a/src/url.c +++ b/src/url.c @@ -42,7 +42,7 @@ const protinfo prottable[] = { #endif {URLT_FILE, NULL, "file", "file://", 0, TRUE}, {URLT_GOPHER, "gopher", "gopher", "gopher://", DEFAULT_GOPHER_PORT, TRUE}, - {URLT_FROMPARENT, NULL, "//", "//", DEFAULT_HTTP_PORT, TRUE} + {URLT_FROMPARENT, NULL, "//", "//", DEFAULT_HTTP_PORT, FALSE} }; int prottable_size(void) Dropping the whole URLT_FROMPARENT line may as well work.
Apparently that's not the problem. It appears that each time there's a segfault, it's at a different location. I don't think it can be fixed without someone who knows all of the code.
(In reply to comment #52) > Apparently that's not the problem. Why, then? > It appears that each time there's a segfault, it's at a different location. This may happen if you are using multiple threads, or when you are starting from an directory with pre-fetched files, even if what you actually encounter is the same bug. And, last but not least, I have told you what is the way to provide proper debugging information: > If you enable the debug USE flag and -debug option, you might be able to find where exactly the issue is. If you do wish to let me or anyone else look into your problem, please provide enough information. Otherwise, I've done what I could.
(In reply to comment #53) > (In reply to comment #52) > > Apparently that's not the problem. > > Why, then? > > > It appears that each time there's a segfault, it's at a different location. > > This may happen if you are using multiple threads, or when you are starting > from an directory with pre-fetched files, even if what you actually > encounter is the same bug. > > And, last but not least, I have told you what is the way to provide proper > debugging information: > > > If you enable the debug USE flag and -debug option, you might be able to find where exactly the issue is. > > If you do wish to let me or anyone else look into your problem, please > provide enough information. Otherwise, I've done what I could. Thanks, I'll do that.