cvsps gets an update
TL;DR
:
cvsps --fast-export --root $CVSROOT $proj | git fast-import
Not so long ago esr
has
taken over maintainer role of wonderful cvsps tool
.
In short CVS
does not store change sets at all and special tools
are needed to extract that information out of CVS
repository.
The problem with this is usually lack of direct access to repository
itself. You only have a client program to check out any random version or
query history for a given file or a set of files.
Tools like git-cvsimport
use cvsps
to extract change set
information, fetch all files belonging to the commit and commit it to
respective branch.
cvsps
was a buggy and mostly abandoned project. 2.2_beta1
was
the only version that did not crash for me.
esr
decided to fix bugs all the way down and get rid of
git-cvsimport
as an intermediary (--fast-export
option in 3.x
series).
Alive upstream is a virtue and I’ve packaged one without any testing.
The 3.x
came out incompatible for git-cvsimport
, but it’s
really a feature. Just use it directly and post-process the result with
git filter-branch
or similar.
After a breakage report I’ve decided
to use it myself on our largest and oldest CVS
project at work.
cvsps
hung at the very start. valgrind
told me there was garbage
in input data and it started my real
contribution.
The hangup didn’t go away and I’ve started digging into in-the-wire
CVS
client format which resulted in a real
fix
of the problem.
And now the fun part (the reason I have written the post): while fixing
above bug i’ve noticed that sometimes code fetching the revisions works
noticeably faster (5x speedup) depending on random factors.
I’ve looked at strace -r
output and figured that first response to
cvs co <file>
request comes back after 150-200
milliseconds.
It’s a severe lag. You can’t fetch faster, than 5 files per second.
Playing a bit with Nagle’s hacks the turbo booster fix gone to the
tree.
After those hacks I haven’t managed to kill cvsps
on my internal
projects.
Well, let’s try to test it on really large CVS
project: gentoo
’s
ebuild
tree. It has 2.2GB
of history.
Some preparations:
# in $HOME/portage/gentoo-x86.rsync
rsync -aP rsync://anonvcs.gentoo.org/vcs-public-cvsroot/gentoo-x86/ gentoo-x86/
rsync -aP rsync://anonvcs.gentoo.org/vcs-public-cvsroot/CVSROOT/ CVSROOT/
Let’s try to import kde-base
category (it’s one of largest
categories, takes 190MBs
of history, 10% of the whole tree).
$ git init
$ time { ../cvsps --root :local:$HOME/portage/gentoo-x86.rsync --fast-export gentoo-x86/kde-base | git fast-import; }
cvsps: branch symbol RELEASE-1_4 not translated
cvsps: multiple vendor or anonymous branches; head content may be incorrect.
git-fast-import statistics:
....
real 29m11.682s
user 4m11.970s
sys 1m9.217s
A bit more complex example with authentication:
$ git init
$ CVS_RSH=ssh ../cvsps --fast-export --root :ext:slyfox@cvs.gentoo.org:/var/cvsroot gentoo-x86/dev-lang/ghc | git fast-import
And the scariest run. The whole-tree conversion
$ git init
$ ../cvsps --root :local:$HOME/portage/gentoo-x86.rsync --fast-export gentoo-x86 | git fast-import
It takes 3.8G
of RAM
to build in-RAM revision history. I
haven’t got it finished yet, but I expect 3-4 hours of work.
Next step it to setup incremental updates and push the result out to the
public.
UPDATE: finished import. It took ~5 hours
, resulting repo is
1.2GB
:
git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects: 2655000
Total objects: 2653581 ( 148626 duplicates )
blobs : 986447 ( 119173 duplicates 493906 deltas of 966966 attempts)
trees : 1348212 ( 29453 duplicates 1192295 deltas of 1241649 attempts)
commits: 318922 ( 0 duplicates 0 deltas of 0 attempts)
tags : 0 ( 0 duplicates 0 deltas of 0 attempts)
Total branches: 8 ( 3 loads )
marks: 1073741824 ( 1424542 unique )
atoms: 174556
Memory total: 150808 KiB
pools: 26355 KiB
objects: 124453 KiB
---------------------------------------------------------------------
pack_report: getpagesize() = 4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit = 8589934592
pack_report: pack_used_ctr = 8342911
pack_report: pack_mmap_calls = 354193
pack_report: pack_open_windows = 2 / 3
pack_report: pack_mapped = 1213886709 / 1890833032
---------------------------------------------------------------------
real 317m53.483s
user 19m24.108s
sys 5m47.618s
And it broke
. Latest commit is:
commit e123e7caa8b45f3ce8a7b358e3137de393f2619c
Author: agriffis <agriffis>
Date: Tue Feb 7 08:55:13 2006 +0000
UPDATE 2: more info. It turns out to be a bug in cvs server
itself. It leaked all the 32GB
of RAM and crashed on poor cvsps
leaving incomplete import.
Due to those leaks repo importing slows down a bit on every checkout
request: cvs server
serves every request by forking, thus the more
PTEs
have to be copied on each fork()
. Looking at cvs server
now to fix the disease.