823860 – Consider making a public wiki database dump

Bug 823860 - Consider making a public wiki database dump

Summary: Consider making a public wiki database dump

Status:	CONFIRMED

Alias:	None

Product:	Websites
Classification:	Unclassified
Component:	Wiki (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Gentoo Wiki Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2021-11-15 19:09 UTC by John Helmert III
Modified:	2025-04-03 02:19 UTC (History)
CC List:	8 users (show)

See Also:	671696
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description John Helmert III archtester

2021-11-15 19:09:46 UTC

I see a couple obvious benefits to doing this: distributed and crowdsourced backups, and the ability for users to browse the wiki offline. Wikipedia already does this of course, so I think it would be good for us to do this too.

Comment 1 Matthew Marchese Gentoo Infrastructure

2022-07-15 03:59:42 UTC

Wiki project has had the policy since before my involvement to not perform database dumps (although I think I may have formalized the position at the following link):

https://wiki.gentoo.org/wiki/Project:Wiki#Can_I_get_a_dump_of_the_Gentoo_wiki_database.3F

There are quite a few considerations if we are to make this a possibility:

1. Historically there were a few 'competing' Gentoo wikis - obviously all of which except for ours were unofficial. Our wiki team didn't want to create more confusion by making it easy for an unofficial, unaffiliated third party to stand up unofficial instances of our wiki, which has a very real possibly to be misleading to our community members. At this point in history, it is much more difficult to get confused. Our wiki has been established in search engine history and the unofficial wikis have all been offline for years. Personally, although I was not the project lead at the time the decision was made to not publish database dumps I still hold a concern that unaffiliated third parties will stand up instances that may be confusing for our community. I'd like other's thoughts on this matter. My caution here is as soon as we publish a single database dump, there is no going back. If unofficial clones appear, we'll want to help them stay up to date with continuous releases, and that sounds like a lot of bandwidth and hosting requirements, which leads into my next point...
2. Does infra have the bandwidth to distribute these database dumps? If we do enable database dumps, what would be the release cadence and how would we distribute them? Dump once per month, once per quarter, one per year? Create an package and publish it out to distfiles/?
3. If we published the wiki, should we also publish the Bugzilla database and/or the packages.g.o databases as well? Perhaps this is comparing apples to oranges, but I find it likely **someone** in the community would want us to follow suit for consistency, although these would certainly be less useful.

I'm probably missing some important considerations, so I'd like to rope in more more Infra options. The above reasons are a few quick things off the top of my head.

In short - I'd like to make this a reality, but I have concerns over abuse and the cost/benefit ratio seems heavier on the cost side...

That's hard to say because Gentoo has a very transparent project structure; we're not trying to hide anything by not releasing the database - it's just that it may not be best to do so.

Comment 2 John Helmert III archtester

2022-07-15 04:05:47 UTC

Looping in infra@ as some of these points are definitely infra questions.

Comment 3 Alec Warner (RETIRED) archtester

2022-07-15 04:45:07 UTC

(In reply to John Helmert III from comment #2)
> Looping in infra@ as some of these points are definitely infra questions.

So the general problem is that dumps contain private information (such as PII) and we want to avoid disseminating that information. Things like who signed up when, from where, hashed and salted passwords, etc.

In general I'd advocate for a crawling-based approach (like we have for bugzilla), where users are granted a scoped API token to read what they are allowed to read, and can crawl the site as normal.

Specifically for packages.gentoo.org; its not a stateful database. Users who "want a copy" can just download and execute the code and they will get the same data (in their copy of the database) after it syncs up from external sources.

-A

Comment 4 Robin Johnson archtester

2022-07-15 16:08:23 UTC

I think the PII problem is the most difficult non-technical problem, User:* namespace contains most of it.

Effort cost is the largest problem overall: simply dumping SQL is not suitable for the reasons other provided. Is there a meaningful export format and pre-existing automation? Then as you say, we have have to maintain those exports, ensure the tooling keeps working etc.

The literal hosting of it isn't difficult per-se.

How does Wikipedia handle this (because I know they publish exports)?

Comment 5 Matt Jolly gentoo-dev

2022-10-13 01:32:01 UTC

I did this recently for work.

I believe that the appropriate tool is built into Mediawiki - DumpBackup.php

https://www.mediawiki.org/wiki/Manual:DumpBackup.php

tl;dr: 

> XML Dumps contain the content of a wiki (wiki pages with all their revisions), without the site-related data. A XML dump does not create a full backup of the wiki database, the dump does not contain user accounts, images, edit logs, etc. 

This appears to be the format (and likely tool) that Wikipedia uses.

It should be trivial to implement and maintain dumps so I wholeheartedly support the idea!

Comment 6 Matthew Marchese Gentoo Infrastructure

2023-08-08 04:04:41 UTC

This is related to bug 671696.

Comment 7 Matt Jolly gentoo-dev

2024-09-22 21:56:34 UTC

Ping. Is there any good reason _not_ to implement this? It seems quite straightforward and provides clear benefits.

What blockers are there on the wiki or infra side? Anything I can help with?

Comment 8 Robin Johnson archtester

2024-09-23 05:11:03 UTC

(In reply to Matt Jolly from comment #7)
> Ping. Is there any good reason _not_ to implement this? It seems quite
> straightforward and provides clear benefits.
> 
> What blockers are there on the wiki or infra side? Anything I can help with?

Two things I see as important to resolve:

Compliance:
Can you research the PII that may be in the dump?

Esp. how this would interact for users that file GDPR deletion requests and are scrubbed.

Licensing:
Need to ensure that the dumps have suitable licenses attached, so they are less likely be "accidentally" fed into AI training systems.

Plus a practical question:

Infra:
If we are to use that script, what's the right options for OUR case; e.g. splitting namespaces for more usable dumps?

Comment 9 Matt Jolly gentoo-dev

2024-09-24 23:48:41 UTC

> Compliance:
> Can you research the PII that may be in the dump?

Yes.

> XML dumps contain the content of the wiki (wiki pages with all their revisions), without the site-related data. DumpBackup.php does not create a full backup of the wiki database, the dump does not contain user accounts, images, deleted revisions, etc

I.e. There should be no issue, this tool is explicitly designed for making public exports (as per the warning when running it)

> WARNING: this is not a full database dump! It is merely for public export of your wiki. For full backup, see our online help at: https://www.mediawiki.org/wiki/Backup

IMO if a user has their PII in a `User:` nampspace page, well that's content that they've explicitly licensed under the current wiki licence (CC-BY-SA-4.0). We _could_ sidestep this by simply not exporting that, but it's a wiki and it's already all publicly available.

> Esp. how this would interact for users that file GDPR deletion requests and
> are scrubbed.

I don't think we have to worry about that as we aren't including user accounts (etc). If a user requests that their account be deleted via the `User:` page + contact IRC method that page will not be included in future dumps after being removed.

Do we need to go back through old dumps and purge data in the event of a GDPR request? Probably not. It's a wiki, but IANAL.

> Licensing:
> Need to ensure that the dumps have suitable licenses attached, so they are
> less likely be "accidentally" fed into AI training systems.

Probably covered by the wiki's existing license:

> Unless otherwise expressly stated, all content on the Gentoo wiki is licensed under CC-BY-SA-4.0

If we start attempting to restrict dumps they're just going to scrape the website instead, which is allowed under the current CC-BY-SA.

> Plus a practical question:
> 
> Infra:
> If we are to use that script, what's the right options for OUR case; e.g.
> splitting namespaces for more usable dumps?

It's really well documented. A (slightly modified) chunk from https://www.mediawiki.org/wiki/Manual:DumpBackup.php:

Save the revision history of all pages (--full) into a file named dump.xml:

`php dumpBackup.php --full > pagedump.xml`

Include the uploaded files by doing:

`php dumpBackup.php --full --include-files --uploads > page-and-filedump.xml`

Restrict the data dump to one namespace. In this example, there are only templates with their current revision:

`php dumpBackup.php --current --quiet --filter=namespace:10 > templates.xml`

As above with all revisions:

`php dumpBackup.php --full --quiet --filter=namespace:10 > templates.xml`

Include multiple namespaces with their current revision:

`php dumpBackup.php --current --quiet --filter=namespace:10,11 > templates_plus_template_talk.xml`

Also include files when filtering by certain namespaces:

`php dumpBackup.php --current --quiet --filter=namespace:0,1,6 --include-files --uploads > main_plus_talk_plus_files.xml`

There are also detailed examples further down the page and a ton of flags to customise the output.

Comment 10 Matt Jolly gentoo-dev

2024-09-27 02:58:40 UTC

> Do we need to go back through old dumps and purge data in the event of a GDPR request? Probably not. It's a wiki, but IANAL.

Since dumps can include revisions we don't need to keep a bunch around: if we publish only one version of the output at any given time any GDPR removals will be reflected in the next update (which should be trivially automated).

Comment 11 Matt Jolly gentoo-dev

2025-04-03 02:19:20 UTC

Ping. Are there any objections from wiki or infra on proceeding with this?

- We know how to do it / how Wikipedia do it
- I don't see any concerns about PII  - It's wiki content that's explicitly licensed as CC-BY-SA-4.0.
- GDPR compliance seems like a non-issue, we're not exporting user data. Worst case we do a GDPR thing on the wiki and trigger a new export.

@infra: any input on the following?

> Infra:
> If we are to use that script, what's the right options for OUR case; e.g. splitting namespaces for more usable dumps?

If there are no objections we just need to come up with a plan and implement it.

My suggestion: A cron job that runs a weekly full export of the wiki via the `DumpBackup.php` script.

It's straightforward, easy to implement, and seems like a fine middle ground between bleeding-edge updates and our hosting resources.