Ulm has done some work making the historical.git more accurate: Prerequisite packages to install ================================ - dev-vcs/cvs - dev-vcs/cvs-fast-export - dev-vcs/git - dev-libs/libxslt (for userinfo.xml conversion) Create the author map ===================== Extract userinfo.xml from LDAP on dev.gentoo.org: $ perl_ldap -U Create authormap.txt from userinfo.xml: $ ./make-authormap.sh >authormap.txt Fetch and unpack the CVS repository =================================== Fetch a copy of the archived gentoo-x86 CVS repository from: https://projects.gentoo.org/vcs-history/gentoo-x86.tar.gz Run cvs-fast-export =================== $ cd var/cvsroot/gentoo-x86 $ find . | cvs-fast-export -A /path/to/authormap.txt -l /path/to/gentoo-x86-export.log -p >/path/to/gentoo-x86-export.out This will run for some time (8 hours on i7-8700), mostly as a single thread, and produce a 21 GiB output file. The CVS repository contains a package app-backup/Attic, which confuses cvs-fast-export: "Files in CVS Attic and RCS directories are treated as though the 'Attic/' or 'RCS/' portion of the path were absent." This can be seen in the output file (note that the "Attic" path component is missing): ---------------------------------------------------------------------- commit refs/heads/master mark :5149424 committer Hanno Böck <hanno@gentoo.org> 1431281161 +0000 data 118 Initial commit of Attic (Portage version: 2.2.18/cvs/Linux x86_64, signed Manifest commit with key A5880072BBB51E42) from :5149420 M 100644 :5149421 app-backup/Attic-0.15.ebuild M 100644 :5149422 app-backup/ChangeLog M 100644 :5149423 app-backup/metadata.xml ---------------------------------------------------------------------- ---------------------------------------------------------------------- commit refs/heads/master mark :5149426 committer Hanno Böck <hanno@gentoo.org> 1431281167 +0000 data 118 Initial commit of Attic (Portage version: 2.2.18/cvs/Linux x86_64, signed Manifest commit with key A5880072BBB51E42) from :5149424 M 100644 :5149425 app-backup/Manifest ---------------------------------------------------------------------- This is fixed by an additional sed filter in the following step. Import into Git =============== $ mkdir gentoo-x86-git $ cd gentoo-x86-git $ git init $ LC_ALL=C sed '/^Initial commit of Attic$/,/^M [0-7]\{6\} .* app-backup\/Manifest/{s:^\(M [0-7]\{6\} .* app-backup/\)\(.*\):\1Attic/\2:}' \ ../../var/cvsroot/gentoo-x86-export.txt | git fast-import Differences to the old conversion ================================= - cvs-fast-export(1) says: "A set of file operations is coalesced into a changeset if either (a) they all share the same commitid, or (b) all have no commitid but identical change comments, authors, and modification dates within the window defined by the time-fuzz parameter." For our case this means that for commits after 2006-03-04T10:23:03Z (commit 0b9dd1d2e89c) the commitid has been used to group them together, while earlier ones have been grouped by authors and commit messages, within a 5 minutes time window (which is the default for the fuzz parameter). This results in a total of 1688447 commits in the master branch, while the old conversion has only 788893 commits. Most of the difference can be explained by the fact that "repoman commit" actually did two CVS commits, the second one for the Manifest to catch up with the updated $Header$ keywords. Since this reflects the actual workflow, no attempts have been made to squash these pairs of commits. - The new conversion used a complete author map, previously users cbrannon, jerrya, luke-jr, and uid2214 (darkside) were missing. - Commit messages have been left alone. For example, no conversion to Git footer lines has taken place. Conversion of character sets wasn't attempted either. (There are 310 commit messages with non-UTF-8 characters. About 80% of them appear to be latin-1, but the rest is something else, or just contains some garbage characters.) - Category app-backup is now there. - File sci-libs/qfits/Manifest in HEAD differs. The new conversion agrees with the last CVS checkout. - The new conversion has a .gitignore file in its top-level directory. Also metadata/.cvsignore was renamed to metadata/.gitignore (cvs-fast-export does this automatically). - Output of "diff -qr --exclude=.git" between tips of old and new repo: Only in gentoo-x86-git: .gitignore Only in gentoo-x86-git: app-backup Files historical/header.txt and gentoo-x86-git/header.txt differ Only in historical/metadata: .cvsignore Only in gentoo-x86-git/metadata: .gitignore Files historical/sci-libs/qfits/Manifest and gentoo-x86-git/sci-libs/qfits/Manifest differ Notes ===== Keyword expansion ----------------- Although the man page of cvs-fast-export (version 1.57) says that the program "does the equivalent of cvs -kb when checking out masters, not performing any $-keyword expansion at all", it actually does expand $-keywords. For the tip of the trunk, expanded keywords appear to be correct, as can be verified with Manifest checksums. This is not always true earlier in history. For example, the CVS repository was located in /home/cvsroot and moved to /var/cvsroot later ($Header$ lines suggest that this move happened in early 2004). Also it is known that some files were moved in the raw repository. Expanded keywords from before such a move won't match. Branch points ------------- cvs-fast-export-1.57 gets confused about branch points, if a file doesn't have any commits on the trunk that are newer than those on the branch. This triggers some warnings during conversion: cvs-fast-export: warning - non-vendor ./app-admin/analog/files/analog.cfg,v branch RELEASE-1_4 has no parent [and many more of the same type] cvs-fast-export: warning - branch point import-1.1.1 -> master later than branch cvs-fast-export: trunk(85563): 2005-11-30T09:36:17Z en.txt 1.1 cvs-fast-export: branch(85563): 2005-11-30T09:38:30Z app-accessibility/SphinxTrain/files/digest-SphinxTrain-0.9.1-r1 1.1 It also results in commits from the branch showing up in the converted Git master branch. The problem has been reported upstream: https://gitlab.com/esr/cvs-fast-export/-/issues/57 For the time being, this is worked around by adding an extra commit to the trunk (and removing it from the converted repository later): $ export CVSROOT=/var/cvsroot $ cvs checkout gentoo-x86 $ cd gentoo-x86 $ for file in $(find . -type d -name CVS -prune -o -type f -print); do echo >>${file}; done $ cvs commit -m "extra commit in trunk"
Rich0, ulm@ asked that you weigh in on this. -A
A copy of the repo is on Github: https://github.com/ulm/gentoo-x86-historical
IMHO we should keep both conversions under repo/gentoo/. Not sure what names are best, maybe historical-1 and historical-2? Replacing historical.git by the new repo may be confusing, especially if people have this location as a git-remote and try to pull from it.
historical-ulm ;-)
I have pushed the new repo to archive/repo/gentoo-2.git. Should repo/gentoo/historical.git be moved to archive/repo/gentoo-1.git now?
(In reply to Ulrich Müller from comment #5) > Should repo/gentoo/historical.git be moved to archive/repo/gentoo-1.git now? Ping. Can we do this please? People get confused about which one is the newer one.
(In reply to Ulrich Müller from comment #6) > (In reply to Ulrich Müller from comment #5) > > Should repo/gentoo/historical.git be moved to archive/repo/gentoo-1.git now? > > Ping. Can we do this please? People get confused about which one is the > newer one. To clarify, I can rename /repo/gentoo/historical.git to /archive/repo/gentoo-1.git but then...should there be a repo at 'historical.git'? Or is the goal to just have 2 archive repos? -A
I'd say just rename it. If it's easy to do, gentoo/historical.git could redirect to archive/repo/gentoo-1.git, but IMHO that's in the "nice to have" category.
Ping. Could the last missing step be done please, namely renaming of repo/gentoo/historical.git to archive/repo/gentoo-1.git? The current layout is confusing.