Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 806415 - Replace historical.git with newer copy
Summary: Replace historical.git with newer copy
Status: IN_PROGRESS
Alias: None
Product: Gentoo Infrastructure
Classification: Unclassified
Component: Git (show other bugs)
Hardware: All Linux
: Normal normal with 1 vote (vote)
Assignee: Gentoo Infrastructure
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 761876
  Show dependency tree
 
Reported: 2021-08-04 18:28 UTC by Alec Warner
Modified: 2022-03-29 18:13 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alec Warner (RETIRED) archtester gentoo-dev Security 2021-08-04 18:28:53 UTC
Ulm has done some work making the historical.git more accurate:

Prerequisite packages to install
================================

- dev-vcs/cvs
- dev-vcs/cvs-fast-export
- dev-vcs/git
- dev-libs/libxslt (for userinfo.xml conversion)


Create the author map
=====================

Extract userinfo.xml from LDAP on dev.gentoo.org:
$ perl_ldap -U

Create authormap.txt from userinfo.xml:
$ ./make-authormap.sh >authormap.txt


Fetch and unpack the CVS repository
===================================

Fetch a copy of the archived gentoo-x86 CVS repository from:
https://projects.gentoo.org/vcs-history/gentoo-x86.tar.gz


Run cvs-fast-export
===================

$ cd var/cvsroot/gentoo-x86
$ find . | cvs-fast-export -A /path/to/authormap.txt -l /path/to/gentoo-x86-export.log -p >/path/to/gentoo-x86-export.out

This will run for some time (8 hours on i7-8700), mostly as a single
thread, and produce a 21 GiB output file.

The CVS repository contains a package app-backup/Attic, which confuses
cvs-fast-export: "Files in CVS Attic and RCS directories are treated
as though the 'Attic/' or 'RCS/' portion of the path were absent."
This can be seen in the output file (note that the "Attic" path
component is missing):

  ----------------------------------------------------------------------
  commit refs/heads/master
  mark :5149424
  committer Hanno Böck <hanno@gentoo.org> 1431281161 +0000
  data 118
  Initial commit of Attic

  (Portage version: 2.2.18/cvs/Linux x86_64, signed Manifest commit with key A5880072BBB51E42)

  from :5149420
  M 100644 :5149421 app-backup/Attic-0.15.ebuild
  M 100644 :5149422 app-backup/ChangeLog
  M 100644 :5149423 app-backup/metadata.xml
  ----------------------------------------------------------------------

  ----------------------------------------------------------------------
  commit refs/heads/master
  mark :5149426
  committer Hanno Böck <hanno@gentoo.org> 1431281167 +0000
  data 118
  Initial commit of Attic

  (Portage version: 2.2.18/cvs/Linux x86_64, signed Manifest commit with key A5880072BBB51E42)

  from :5149424
  M 100644 :5149425 app-backup/Manifest
  ----------------------------------------------------------------------

This is fixed by an additional sed filter in the following step.


Import into Git
===============

$ mkdir gentoo-x86-git
$ cd gentoo-x86-git
$ git init
$ LC_ALL=C sed '/^Initial commit of Attic$/,/^M [0-7]\{6\} .* app-backup\/Manifest/{s:^\(M [0-7]\{6\} .* app-backup/\)\(.*\):\1Attic/\2:}' \
../../var/cvsroot/gentoo-x86-export.txt | git fast-import


Differences to the old conversion
=================================

- cvs-fast-export(1) says:
  "A set of file operations is coalesced into a changeset if either
  (a) they all share the same commitid, or (b) all have no commitid
  but identical change comments, authors, and modification dates
  within the window defined by the time-fuzz parameter."

  For our case this means that for commits after 2006-03-04T10:23:03Z
  (commit 0b9dd1d2e89c) the commitid has been used to group them
  together, while earlier ones have been grouped by authors and commit
  messages, within a 5 minutes time window (which is the default
  for the fuzz parameter).

  This results in a total of 1688447 commits in the master branch,
  while the old conversion has only 788893 commits. Most of the
  difference can be explained by the fact that "repoman commit"
  actually did two CVS commits, the second one for the Manifest to
  catch up with the updated $Header$ keywords. Since this reflects
  the actual workflow, no attempts have been made to squash these
  pairs of commits.

- The new conversion used a complete author map, previously users
  cbrannon, jerrya, luke-jr, and uid2214 (darkside) were missing.

- Commit messages have been left alone. For example, no conversion
  to Git footer lines has taken place. Conversion of character sets
  wasn't attempted either. (There are 310 commit messages with
  non-UTF-8 characters. About 80% of them appear to be latin-1,
  but the rest is something else, or just contains some garbage
  characters.)

- Category app-backup is now there.

- File sci-libs/qfits/Manifest in HEAD differs. The new conversion
  agrees with the last CVS checkout.

- The new conversion has a .gitignore file in its top-level directory.
  Also metadata/.cvsignore was renamed to metadata/.gitignore
  (cvs-fast-export does this automatically).

- Output of "diff -qr --exclude=.git" between tips of old and new repo:

  Only in gentoo-x86-git: .gitignore
  Only in gentoo-x86-git: app-backup
  Files historical/header.txt and gentoo-x86-git/header.txt differ
  Only in historical/metadata: .cvsignore
  Only in gentoo-x86-git/metadata: .gitignore
  Files historical/sci-libs/qfits/Manifest and gentoo-x86-git/sci-libs/qfits/Manifest differ


Notes
=====

Keyword expansion
-----------------

Although the man page of cvs-fast-export (version 1.57) says that the
program "does the equivalent of cvs -kb when checking out masters, not
performing any $-keyword expansion at all", it actually does expand
$-keywords.

For the tip of the trunk, expanded keywords appear to be correct,
as can be verified with Manifest checksums. This is not always true
earlier in history. For example, the CVS repository was located in
/home/cvsroot and moved to /var/cvsroot later ($Header$ lines suggest
that this move happened in early 2004). Also it is known that some
files were moved in the raw repository. Expanded keywords from before
such a move won't match.


Branch points
-------------

cvs-fast-export-1.57 gets confused about branch points, if a file
doesn't have any commits on the trunk that are newer than those on the
branch.

This triggers some warnings during conversion:

  cvs-fast-export: warning - non-vendor ./app-admin/analog/files/analog.cfg,v branch RELEASE-1_4 has no parent
  [and many more of the same type]

  cvs-fast-export: warning - branch point import-1.1.1 -> master later than branch
  cvs-fast-export:        trunk(85563):  2005-11-30T09:36:17Z  en.txt 1.1
  cvs-fast-export:        branch(85563): 2005-11-30T09:38:30Z  app-accessibility/SphinxTrain/files/digest-SphinxTrain-0.9.1-r1 1.1

It also results in commits from the branch showing up in the converted
Git master branch. The problem has been reported upstream:
https://gitlab.com/esr/cvs-fast-export/-/issues/57

For the time being, this is worked around by adding an extra commit to
the trunk (and removing it from the converted repository later):

$ export CVSROOT=/var/cvsroot
$ cvs checkout gentoo-x86
$ cd gentoo-x86
$ for file in $(find . -type d -name CVS -prune -o -type f -print); do echo >>${file}; done
$ cvs commit -m "extra commit in trunk"
Comment 1 Alec Warner (RETIRED) archtester gentoo-dev Security 2021-08-04 18:30:37 UTC
Rich0, ulm@ asked that you weigh in on this.

-A
Comment 2 Ulrich Müller gentoo-dev 2021-08-04 18:36:35 UTC
A copy of the repo is on Github: https://github.com/ulm/gentoo-x86-historical
Comment 3 Ulrich Müller gentoo-dev 2021-08-25 07:40:32 UTC
IMHO we should keep both conversions under repo/gentoo/. Not sure what names are best, maybe historical-1 and historical-2?

Replacing historical.git by the new repo may be confusing, especially if people have this location as a git-remote and try to pull from it.
Comment 4 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2021-08-25 07:47:49 UTC
historical-ulm ;-)
Comment 5 Ulrich Müller gentoo-dev 2021-08-26 07:23:58 UTC
I have pushed the new repo to archive/repo/gentoo-2.git.

Should repo/gentoo/historical.git be moved to archive/repo/gentoo-1.git now?
Comment 6 Ulrich Müller gentoo-dev 2022-01-09 11:25:14 UTC
(In reply to Ulrich Müller from comment #5)
> Should repo/gentoo/historical.git be moved to archive/repo/gentoo-1.git now?

Ping. Can we do this please? People get confused about which one is the newer one.
Comment 7 Alec Warner (RETIRED) archtester gentoo-dev Security 2022-01-24 22:40:15 UTC
(In reply to Ulrich Müller from comment #6)
> (In reply to Ulrich Müller from comment #5)
> > Should repo/gentoo/historical.git be moved to archive/repo/gentoo-1.git now?
> 
> Ping. Can we do this please? People get confused about which one is the
> newer one.

To clarify, I can rename /repo/gentoo/historical.git to /archive/repo/gentoo-1.git but then...should there be a repo at 'historical.git'? Or is the goal to just have 2 archive repos?

-A
Comment 8 Ulrich Müller gentoo-dev 2022-01-24 23:53:55 UTC
I'd say just rename it.

If it's easy to do, gentoo/historical.git could redirect to archive/repo/gentoo-1.git, but IMHO that's in the "nice to have" category.
Comment 9 Ulrich Müller gentoo-dev 2022-03-29 18:13:41 UTC
Ping.

Could the last missing step be done please, namely renaming of repo/gentoo/historical.git to archive/repo/gentoo-1.git? The current layout is confusing.