424385 – app-portage/euscan should stop scanning when blocked by robots.txt

Bug 424385 - app-portage/euscan should stop scanning when blocked by robots.txt

Summary: app-portage/euscan should stop scanning when blocked by robots.txt

Status:	RESOLVED INVALID

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Corentin Chary (RETIRED)

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-07-01 12:59 UTC by Justin Lecher (RETIRED)
Modified:	2012-07-16 05:16 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Justin Lecher (RETIRED) gentoo-dev

2012-07-01 12:59:21 UTC

 * Url 'http://www.kdau.com/files' blocked by robots.txt
 * Generating version from 1.2.0
 * Brute forcing: http://www.kdau.com/files/gelemental-${PV}.tar.bz2
 * Url 'http://www.kdau.com/files/gelemental-1.2.1.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.2.2.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.2.3.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.3.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.4.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-1.5.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-2.0.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-3.0.0.tar.bz2' blocked by robots.txt
 * Url 'http://www.kdau.com/files/gelemental-4.0.0.tar.bz2' blocked by robots.txt

Once the base URL is blocked, we can skip the rest, because it will be blocked too.

Comment 1 Corentin Chary (RETIRED) gentoo-dev

2012-07-02 11:53:17 UTC

Not always, "Disallow:" can be set only on a particular URL.

Anyway, it's almost free to print these lines since robots.txt is fetched only once, and before scanning an url we see if we are allowed to do so before starting the network request. The only drawback is the noise in the log...