Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 457782

Summary: bugs.gentoo.org (mostly) inaccessible to bots
Product: Gentoo Infrastructure Reporter: Daniel Santos <daniel.santos>
Component: BugzillaAssignee: Bugzilla Admins <bugzilla>
Status: RESOLVED OBSOLETE    
Severity: normal    
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---
Bug Depends on:    
Bug Blocks: 457784    

Description Daniel Santos 2013-02-16 02:54:11 UTC
This is a follow-up on #446402, which was closed for being too broad in scope. I hope this report is more helpful!

Specifically, a summary list of bugs (their summary, id # and link) is available to bots, but the individual reports are not.  The /data/cached/* files provide a link via the https protocol for each bug.  https://bugs.gentoo.org and http://bugs.gentoo.org have differing robots.txt.  The https server forbids everything and then re-forbids each sub-directory specifically in case "Disallow: *" and "Disallow: /" weren't clear enough.  IMO, there's nothing wrong with that -- I don't think we need bots crawling around in https, burning our CPU on encryption, but limiting search engines to just our summaries & id#s is too restrictive.

The second challenge is that the http server re-directs to https when attempting to access a bug via (bug #123 in this case) either the URI "/123" or "/show_bug.cgi?id=123".

It seems to me that we need to:
1. Change the cache generators to specify the http protocol
2. Allow access to bugs via http instead of 301-ing them to the https server.

Of course, this presents a new problem: show_bug.cgi is (I suspect) resource-intensive.  Can we access bug reports via "/<id>" and hit a static document?  If not, we need a solution so that search engines can hit our full bug reports (minus email addresses of course).
Comment 1 Daniel Santos 2013-02-16 03:09:26 UTC
I think I see how we can assure that email addresses don't end up in search engines.  If http server uses a show_bug.cgi that just never shows the addresses (rather or not you are a logged in), then I think all should be fine.  Therefore, regardless of rather or not the crawler uses an account, the email addresses will be suppressed (as long as robots.txt is obeyed).
Comment 2 Alec Warner (RETIRED) archtester gentoo-dev Security 2021-08-25 21:11:37 UTC
This is 8 years old, please re-open if its an actual problem for you.