Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 424385 - app-portage/euscan should stop scanning when blocked by robots.txt
Summary: app-portage/euscan should stop scanning when blocked by robots.txt
Status: RESOLVED INVALID
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Corentin Chary (RETIRED)
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-01 12:59 UTC by Justin Lecher (RETIRED)
Modified: 2012-07-16 05:16 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Lecher (RETIRED) gentoo-dev 2012-07-01 12:59:21 UTC
 * Url 'http://www.kdau.com/files' blocked by robots.txt
 * Generating version from 1.2.0
 * Brute forcing: http://www.kdau.com/files/gelemental-${PV}.tar.bz2
 * Url 'http://www.kdau.com/files/gelemental-1.2.1.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.2.2.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.2.3.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.3.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.4.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.5.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-2.0.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-3.0.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-4.0.0.tar.bz2' blocked by robots.txt

Once the base URL is blocked, we can skip the rest, because it will be blocked too.
Comment 1 Corentin Chary (RETIRED) gentoo-dev 2012-07-02 11:53:17 UTC
Not always, "Disallow:" can be set only on a particular URL.

Anyway, it's almost free to print these lines since robots.txt is fetched only once, and before scanning an url we see if we are allowed to do so before starting the network request. The only drawback is the noise in the log...