Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 823860 - Consider making a public database dump
Summary: Consider making a public database dump
Status: CONFIRMED
Alias: None
Product: Websites
Classification: Unclassified
Component: Wiki (show other bugs)
Hardware: All Linux
: Normal normal with 1 vote (vote)
Assignee: Gentoo Wiki Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-15 19:09 UTC by John Helmert III
Modified: 2024-03-03 22:34 UTC (History)
7 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2021-11-15 19:09:46 UTC
I see a couple obvious benefits to doing this: distributed and crowdsourced backups, and the ability for users to browse the wiki offline. Wikipedia already does this of course, so I think it would be good for us to do this too.
Comment 1 Matthew Marchese Gentoo Infrastructure gentoo-dev 2022-07-15 03:59:42 UTC
Wiki project has had the policy since before my involvement to not perform database dumps (although I think I may have formalized the position at the following link):

https://wiki.gentoo.org/wiki/Project:Wiki#Can_I_get_a_dump_of_the_Gentoo_wiki_database.3F

There are quite a few considerations if we are to make this a possibility:

1. Historically there were a few 'competing' Gentoo wikis - obviously all of which except for ours were unofficial. Our wiki team didn't want to create more confusion by making it easy for an unofficial, unaffiliated third party to stand up unofficial instances of our wiki, which has a very real possibly to be misleading to our community members. At this point in history, it is much more difficult to get confused. Our wiki has been established in search engine history and the unofficial wikis have all been offline for years. Personally, although I was not the project lead at the time the decision was made to not publish database dumps I still hold a concern that unaffiliated third parties will stand up instances that may be confusing for our community. I'd like other's thoughts on this matter. My caution here is as soon as we publish a single database dump, there is no going back. If unofficial clones appear, we'll want to help them stay up to date with continuous releases, and that sounds like a lot of bandwidth and hosting requirements, which leads into my next point...
2. Does infra have the bandwidth to distribute these database dumps? If we do enable database dumps, what would be the release cadence and how would we distribute them? Dump once per month, once per quarter, one per year? Create an package and publish it out to distfiles/?
3. If we published the wiki, should we also publish the Bugzilla database and/or the packages.g.o databases as well? Perhaps this is comparing apples to oranges, but I find it likely **someone** in the community would want us to follow suit for consistency, although these would certainly be less useful.

I'm probably missing some important considerations, so I'd like to rope in more more Infra options. The above reasons are a few quick things off the top of my head.

In short - I'd like to make this a reality, but I have concerns over abuse and the cost/benefit ratio seems heavier on the cost side... 

That's hard to say because Gentoo has a very transparent project structure; we're not trying to hide anything by not releasing the database - it's just that it may not be best to do so.
Comment 2 John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2022-07-15 04:05:47 UTC
Looping in infra@ as some of these points are definitely infra questions.
Comment 3 Alec Warner (RETIRED) archtester gentoo-dev Security 2022-07-15 04:45:07 UTC
(In reply to John Helmert III from comment #2)
> Looping in infra@ as some of these points are definitely infra questions.

So the general problem is that dumps contain private information (such as PII) and we want to avoid disseminating that information. Things like who signed up when, from where, hashed and salted passwords, etc.

In general I'd advocate for a crawling-based approach (like we have for bugzilla), where users are granted a scoped API token to read what they are allowed to read, and can crawl the site as normal.

Specifically for packages.gentoo.org; its not a stateful database. Users who "want a copy" can just download and execute the code and they will get the same data (in their copy of the database) after it syncs up from external sources.

-A
Comment 4 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2022-07-15 16:08:23 UTC
I think the PII problem is the most difficult non-technical problem, User:* namespace contains most of it.

Effort cost is the largest problem overall: simply dumping SQL is not suitable for the reasons other provided. Is there a meaningful export format and pre-existing automation? Then as you say, we have have to maintain those exports, ensure the tooling keeps working etc.

The literal hosting of it isn't difficult per-se.

How does Wikipedia handle this (because I know they publish exports)?
Comment 5 Matt Jolly gentoo-dev 2022-10-13 01:32:01 UTC
I did this recently for work.

I believe that the appropriate tool is built into Mediawiki - DumpBackup.php

https://www.mediawiki.org/wiki/Manual:DumpBackup.php

tl;dr: 

> XML Dumps contain the content of a wiki (wiki pages with all their revisions), without the site-related data. A XML dump does not create a full backup of the wiki database, the dump does not contain user accounts, images, edit logs, etc. 

This appears to be the format (and likely tool) that Wikipedia uses.

It should be trivial to implement and maintain dumps so I wholeheartedly support the idea!
Comment 6 Matthew Marchese Gentoo Infrastructure gentoo-dev 2023-08-08 04:04:41 UTC
This is related to bug 671696.