Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 457782 - bugs.gentoo.org (mostly) inaccessible to bots
Summary: bugs.gentoo.org (mostly) inaccessible to bots
Status: UNCONFIRMED
Alias: None
Product: Gentoo Infrastructure
Classification: Unclassified
Component: Bugzilla (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Bugzilla Admins
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 457784
  Show dependency tree
 
Reported: 2013-02-16 02:54 UTC by Daniel Santos
Modified: 2013-02-16 03:09 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Santos 2013-02-16 02:54:11 UTC
This is a follow-up on #446402, which was closed for being too broad in scope. I hope this report is more helpful!

Specifically, a summary list of bugs (their summary, id # and link) is available to bots, but the individual reports are not.  The /data/cached/* files provide a link via the https protocol for each bug.  https://bugs.gentoo.org and http://bugs.gentoo.org have differing robots.txt.  The https server forbids everything and then re-forbids each sub-directory specifically in case "Disallow: *" and "Disallow: /" weren't clear enough.  IMO, there's nothing wrong with that -- I don't think we need bots crawling around in https, burning our CPU on encryption, but limiting search engines to just our summaries & id#s is too restrictive.

The second challenge is that the http server re-directs to https when attempting to access a bug via (bug #123 in this case) either the URI "/123" or "/show_bug.cgi?id=123".

It seems to me that we need to:
1. Change the cache generators to specify the http protocol
2. Allow access to bugs via http instead of 301-ing them to the https server.

Of course, this presents a new problem: show_bug.cgi is (I suspect) resource-intensive.  Can we access bug reports via "/<id>" and hit a static document?  If not, we need a solution so that search engines can hit our full bug reports (minus email addresses of course).
Comment 1 Daniel Santos 2013-02-16 03:09:26 UTC
I think I see how we can assure that email addresses don't end up in search engines.  If http server uses a show_bug.cgi that just never shows the addresses (rather or not you are a logged in), then I think all should be fine.  Therefore, regardless of rather or not the crawler uses an account, the email addresses will be suppressed (as long as robots.txt is obeyed).