This is a follow-up on #446402, which was closed for being too broad in scope. I hope this report is more helpful!
Specifically, a summary list of bugs (their summary, id # and link) is available to bots, but the individual reports are not. The /data/cached/* files provide a link via the https protocol for each bug. https://bugs.gentoo.org and http://bugs.gentoo.org have differing robots.txt. The https server forbids everything and then re-forbids each sub-directory specifically in case "Disallow: *" and "Disallow: /" weren't clear enough. IMO, there's nothing wrong with that -- I don't think we need bots crawling around in https, burning our CPU on encryption, but limiting search engines to just our summaries & id#s is too restrictive.
The second challenge is that the http server re-directs to https when attempting to access a bug via (bug #123 in this case) either the URI "/123" or "/show_bug.cgi?id=123".
It seems to me that we need to:
1. Change the cache generators to specify the http protocol
2. Allow access to bugs via http instead of 301-ing them to the https server.
Of course, this presents a new problem: show_bug.cgi is (I suspect) resource-intensive. Can we access bug reports via "/<id>" and hit a static document? If not, we need a solution so that search engines can hit our full bug reports (minus email addresses of course).
I think I see how we can assure that email addresses don't end up in search engines. If http server uses a show_bug.cgi that just never shows the addresses (rather or not you are a logged in), then I think all should be fine. Therefore, regardless of rather or not the crawler uses an account, the email addresses will be suppressed (as long as robots.txt is obeyed).