Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 762403 - List of excluded files in vcs-history contains many false positives
Summary: List of excluded files in vcs-history contains many false positives
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Infrastructure
Classification: Unclassified
Component: Git (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo Infrastructure
URL: https://projects.gentoo.org/vcs-histo...
Whiteboard:
Keywords: PATCH
Depends on:
Blocks:
 
Reported: 2020-12-29 11:08 UTC by Ulrich Müller
Modified: 2022-01-25 19:41 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
Review of files in gentoo-projects CVS repository (gentoo-projects.txt,1018 bytes, text/plain)
2020-12-29 23:22 UTC, Ulrich Müller
Details
Review of files in gentoo CVS repository (gentoo.txt,6.84 KB, text/plain)
2020-12-29 23:22 UTC, Ulrich Müller
Details
Patch for excluded_files.txt (0001-excluded_files.txt-Keep-some-files.patch,4.38 KB, patch)
2020-12-29 23:23 UTC, Ulrich Müller
Details | Diff
CVSROOT.tar.gz: Add file (0001-CVSROOT.tar.gz-Add-file.patch,205.09 KB, patch)
2022-01-09 13:11 UTC, Ulrich Müller
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Müller gentoo-dev 2020-12-29 11:08:57 UTC
https://projects.gentoo.org/vcs-history/excluded_files.txt appears to list all files containing the word "birthday" which causes many false positives. This includes files like:

   gentoo/xml/htdocs/dtd/guide.dtd,v
   gentoo/xml/htdocs/xsl/guide.xsl,v

which have a "birthday" attribute as part of the Guide XML DTD, some patches which seem to be unlikely to contain any personal information:

   gentoo/src/patchsets/glibc/2.5/1505_hppa_cvs-head-20061203.patch,v
   gentoo/src/patchsets/glibc/2.5.1/1505_hppa_cvs-head-20061203.patch,v

and some Council IRC logs which are publicly available elsewhere:

   gentoo/xml/htdocs/proj/en/council/meeting-logs/Attic/20110201.txt,v
   gentoo/xml/htdocs/proj/en/council/meeting-logs/Attic/20061109.txt,v
   gentoo/xml/htdocs/proj/en/council/meeting-logs/Attic/20090817.txt,v
   gentoo/xml/htdocs/proj/en/council/meeting-logs/Attic/20080410.txt,v

There are also files like:

   gentoo/src/fortune-gentoo-dev/fortunes/00020-20030809_00020,v

which is publicly available as part of https://dev.gentoo.org/~robbat2/distfiles/fortune-gentoo-dev-20090306.tar.bz2 and is certainly a false positive :)

   [discussing the pronounciation of gif]
   <seemant> I say it with soft g
   <seemant> coz like I envision this exchange:
   <seemant> "hey, <so-and-so> I have a gif for you" <-- hard g
   <seemant> where <so-and-so> will reply "ooh for me? you shouldn't 
    have!  it's not even my birthday"
   <seemant> and then you're in an embarrassing situtation

Especially guide.dtd and guide.xsl are essential if one wants to recreate any of the old documentation (and that's how I noticed the issue, in the first place). So please reconsider whether some of the files in the list could be distributed.

Generally I think that these archives should be as faithful to the CVS repositories as possible and therefore only exclude such files where it is absulutely necessary.
Comment 1 Alec Warner (RETIRED) archtester gentoo-dev Security 2020-12-29 15:53:28 UTC
As I discussed on IRC

The current function of "whether it is absolutely necessary" is currently:

for file in archive:
  if file.contains('birthday'):
    archive.remove(file)

I'm happy to use a more accurate method; but I felt it better to produce a redacted archive quickly than a perfect archive never. We can always add files back into the archives; its hard to unpublish files.

-A
Comment 2 Ulrich Müller gentoo-dev 2020-12-29 23:22:14 UTC
Created attachment 680227 [details]
Review of files in gentoo-projects CVS repository
Comment 3 Ulrich Müller gentoo-dev 2020-12-29 23:22:45 UTC
Created attachment 680230 [details]
Review of files in gentoo CVS repository
Comment 4 Ulrich Müller gentoo-dev 2020-12-29 23:23:28 UTC
Created attachment 680233 [details, diff]
Patch for excluded_files.txt
Comment 5 Max Magorsch (RETIRED) Gentoo Infrastructure gentoo-dev 2020-12-30 01:54:48 UTC
I've just updated the repository according to the reviewed list.

Thanks for taking the time to review the list more closely.
Comment 6 Ulrich Müller gentoo-dev 2020-12-30 16:51:58 UTC
(In reply to Max Magorsch from comment #5)
> I've just updated the repository according to the reviewed list.

Looks good.
Comment 7 Ulrich Müller gentoo-dev 2021-03-10 10:25:17 UTC
I've noticed that the CVSROOT repository is missing. Presumably it won't add much value, but should be added for completeness?

It contains the file CVSROOT/history which (IIUC) is a log of all reads and writes to any repo, so I believe it should better be excluded. Apart from that file, I don't see any personal data.
Comment 8 Ulrich Müller gentoo-dev 2022-01-09 13:11:48 UTC
Created attachment 761706 [details, diff]
CVSROOT.tar.gz: Add file

Attached patch adds CVSROOT.tar.gz, with CVSROOT/history excluded.
Comment 9 Alec Warner (RETIRED) archtester gentoo-dev Security 2022-01-25 15:59:03 UTC
antarus@marine-bay:~/gentoo/vcs-history$ patch --binary -p1 < patch 
File CVSROOT.tar.gz: git binary diffs are not supported.

I think you need to remake the patch with git diff --binary (to generate the right diff format?)

-A
Comment 10 Larry the Git Cow gentoo-dev 2022-01-25 17:27:30 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/sites/projects/vcs-history.git/commit/?id=fe3a5ddd85eacb7795e340d24468b62f3dedb482

commit fe3a5ddd85eacb7795e340d24468b62f3dedb482
Author:     Ulrich Müller <ulm@gentoo.org>
AuthorDate: 2022-01-09 13:02:25 +0000
Commit:     Alec Warner <antarus@gentoo.org>
CommitDate: 2022-01-25 17:27:00 +0000

    CVSROOT.tar.gz: Add file
    
    The file CVSROOT/history is a log of all reads and writes to any repo
    and has been excluded from the archive.
    
    Bug: https://bugs.gentoo.org/762403#c7
    Signed-off-by: Ulrich Müller <ulm@gentoo.org>
    Signed-off-by: Alec Warner <antarus@gentoo.org>

 CVSROOT.tar.gz     | Bin 0 -> 165234 bytes
 checksums.b2       |   1 +
 excluded_files.txt |   9 ++++++++-
 3 files changed, 9 insertions(+), 1 deletion(-)
Comment 11 Ulrich Müller gentoo-dev 2022-01-25 19:41:19 UTC
AFAICS all done. Closing.