446402 – Proposal: make bugs.gentoo.org more search-engine friendly, but w/o email addresses

Bug 446402 - Proposal: make bugs.gentoo.org more search-engine friendly, but w/o email addresses

Summary: Proposal: make bugs.gentoo.org more search-engine friendly, but w/o email add...

Status:	RESOLVED INVALID

Alias:	None

Product:	Gentoo Infrastructure
Classification:	Unclassified
Component:	Bugzilla (show other bugs)
Hardware:	All Linux

Importance:	Normal enhancement
Assignee:	Bugzilla Admins

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-12-07 22:43 UTC by Daniel Santos
Modified:	2013-02-16 02:03 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Daniel Santos 2012-12-07 22:43:47 UTC

I would very much like to get good hits for gentoo bugs when googling, but robots.txt is very restrictive. This isn't uncommon at all. Many bugzilla databases have even more restrictive robots.txt than ours. (Gnome actually threatens to hunt you down and give you a wedgie: https://bugs.gnome.org/robots.txt). The solution probably lies in making a good buzilla extension for this if one doesn't already exist (best I've found so far is a sitemap generator, which isn't enough: http://code.google.com/p/bugzilla-sitemap). However, I'm creating this bug here to propose making such a change and discuss what the requirements would be.

First off, any crawlers not following robots.txt need to be detected and permanently banned (hopefully, that's already happening). I don't have any particular design in mind yet, but it should efficiently produce an output that is stripped of email addresses, probably one that is lazy generated, but will only be updated every x interval, so that subsequent crawler requests will receive data that's a bit stale (perhaps daily).

One easy (but less efficient) implementation is to just use the bugzilla-sitemap extension so that crawlers will only attempt to retrieve pages who's data has changed, and then modify the cgi scripts to detect when a crawler is requesting the page and strip the email addresses. A more efficient, but complicated design would involve cgi scripts detecting when its being called by a crawler and forwarding the request (internally) to a specialized handler for this. This handler would either make no database access or only enough to check a modification date and then would return the content of a lazy generated static snapshot of the bug report (or generate it if the snapshot is stale enough). As long as the mechanism for forwarding the request is efficient, this should allow search engines to index all of our bugs, w/o email addresses, and without bogging down the server.

I'm sure there are even better approaches, but I don't understand the whole robots.txt/crawler protocols very well at this point. From looking at redhat's implementation, I can see that Google's smart enough to read a sitemap from an xml with URLs to gzipped xmls and know what pages to request and which to leave alone (see http://bugzilla.redhat.com/robots.txt and http://bugzilla.redhat.com/sitemap_index.xml), so I suspect there are better approaches than what I've proposed above.

Comment 1 Alec Warner (RETIRED) archtester

2012-12-29 23:54:38 UTC

Try not to file a bug asking for more than one concrete thing.

1) No one is going to be ever-watching the logs to ban bots that ignore robots.txt. We all have better things to do.

2) Email addresses are only visible to logged in users. Many bots don't do this because it adds state to the crawler, and when you crawl billions of sites, it adds a bunch of extra complexity. Email addresses in raw text (comments, etc.) are of course available.

3) We already generate cached datasets for bots. It is listed in our bot policy.
https://bugs.gentoo.org/bots.html

-A

Comment 2 Daniel Santos 2013-02-16 02:03:52 UTC

Sorry for my late response.  I guess I'll open a new bug for the problem, which appears to be that https://bugs.gentoo.org/bots.html links to static cached files which robots.txt disallows with this rule:

Disallow: /data/cached/

I need to re-read the robots.txt spec first though. Thanks for the response.

Daniel