Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 534528

Summary: distfiles should be sorted into subdirectories of DISTDIR
Product: Portage Development Reporter: G.Wolfe Woodbury <redwolfe>
Component: Enhancement/Feature RequestsAssignee: Portage team <dev-portage>
Status: IN_PROGRESS ---    
Severity: normal CC: daniel, flow, infra-bugs, lssndrbarbieri, mgorny, redwolfe, ulm
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
See Also: https://bugs.gentoo.org/show_bug.cgi?id=13325
Whiteboard:
Package list:
Runtime testing required: ---
Bug Depends on: 645810, 646898, 697566, 756778    
Bug Blocks:    

Description G.Wolfe Woodbury 2015-01-03 20:19:19 UTC
If a system mirrors the /usr/portage/distfiles directory, or if lots of packages are installed on a system, the directory get huge.  With an ext3 or ext4 filesystem this does not cause much of a problem, but ext2 and several other filesystem types have problems with directories that large.

Reproducible: Always

Steps to Reproduce:
1.mirror distfiles locally or have large numbers of packages installed
2.
3.
Actual Results:  
my /usr/portage/distfiles directory is currently 3.6 Mb in size and has 67069 (approximately) files in it.

Expected Results:  
I suggest that the distfiles directory be subdivided into a series of directories using the first character of the file name.  This significantly reduces the size of each directory to something a bit more reasonabke for some filesystem types.

Large directories are generally not a problem for the modern filesystem types like ext4, but ext2 and vfat and ntfs filesystems can create performance problems with very large directories.  As an example, the Fedora package trees now use this technique to make reasonable sized directories on their servers and mirrors.
Comment 1 Jeroen Roovers (RETIRED) gentoo-dev 2015-01-04 09:34:13 UTC
> If a system mirrors the /usr/portage/distfiles directory, or if lots of
> packages are installed on a system, the directory get huge. 

Of course it does.

> With an ext3 or ext4 filesystem this does not cause much of a problem,
> but ext2 and several
> other filesystem types have problems with directories that large.

Then pick an appropriate filesystem for that storage requirement.

> 1.mirror distfiles locally or have large numbers of packages installed

Everyone does that, I guess.

> my /usr/portage/distfiles directory is currently 3.6 Mb in size and has
> 67069 (approximately) files in it.

3.6 megabits?

> I suggest that the distfiles directory be subdivided into a series of
> directories using the first character of the file name.  This significantly
> reduces the size of each directory to something a bit more reasonabke for
> some filesystem types.

That's bug #13325, but maybe it was closed somewhat prematurely.

> Large directories are generally not a problem for the modern filesystem
> types like ext4, but ext2 and vfat and ntfs filesystems can create
> performance problems with very large directories.  As an example, the Fedora
> package trees now use this technique to make reasonable sized directories on
> their servers and mirrors.

You can move it to a better place (set DISTDIR in make.conf) and you can regularly clean it up (using eclean-dist for instance).

Again: pick a filesystem type to match your storage requirements. Having subdirectories in DISTDIR would solve only one of the issues with ext2/VFAT/NTFS storage.
Comment 2 Jeroen Roovers (RETIRED) gentoo-dev 2015-01-04 09:39:42 UTC
One immediate problem would be that many many ebuilds directly access DISTDIR/<somefile>, notably in src_unpack() and src_prepare().

Somehow all package managers would suddenly need to be able to translate DISTDIR/<somefile> transparently into DISTDIR/<firstletterof<somefile>>/<somefile> (or all those ebuilds would need to be rewritten).
Comment 3 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2015-01-04 09:43:06 UTC
(In reply to Jeroen Roovers from comment #2)
> One immediate problem would be that many many ebuilds directly access
> DISTDIR/<somefile>, notably in src_unpack() and src_prepare().
> 
> Somehow all package managers would suddenly need to be able to translate
> DISTDIR/<somefile> transparently into
> DISTDIR/<firstletterof<somefile>>/<somefile> (or all those ebuilds would
> need to be rewritten).

Portage uses a temporary distdir with symlinks.
Comment 4 Zac Medico gentoo-dev 2015-01-04 21:00:08 UTC
Adding infra-bugs to CC, since we probably also want to use the new DISTDIR layout on our mirrors.

@infra: Do we already have a new distfiles directory structure in mind?
Comment 5 G.Wolfe Woodbury 2015-06-01 19:21:56 UTC
Well, of course distfile *should* be placed on an ext4 or other appropriate filesystem type, but not all installations may be able to do this and still want to have all the distfiles available.

This enhancement/modification would simply add more flexibility for all types of installations.
Comment 6 G.Wolfe Woodbury 2016-08-27 19:33:46 UTC
In reply to comment #3:

the default src_unpack and src_prepare provided by portage eclasses could handle the redirection easily. I'm not sure enough of portage internals to know if there is an eclass function that could do this for them also -- might need to write one for accessing the files in DISTDIR.

The main point is that even though the default filesystem types used by Gentoo are capable of dealing with the 3+MB directory sizes, other filesystem types that are in common use do not fare so well.

This suggestion is another way to make Gentoo more flexible and offer more choice to users and local sysadmins.
Comment 7 Zac Medico gentoo-dev 2016-08-27 20:56:15 UTC
(In reply to G.Wolfe Woodbury from comment #6)
> the default src_unpack and src_prepare provided by portage eclasses could
> handle the redirection easily.

We'd only have to adjust the _prepare_fake_distdir function here, since we already use that as a layer of indirection:

https://gitweb.gentoo.org/proj/portage.git/tree/pym/portage/package/ebuild/doebuild.py?h=portage-2.3.0#n1309
Comment 8 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-08-28 21:13:56 UTC
Many many years ago, when I did more mirror-admin stuff, I ask about hash the names as well, but the major problem at the time was supporting all of the old portage versions (they are going to keep requesting the non-hashed layout for a long time).

More on that in a moment, as it may be more doable now by replacing the http://distfiles.gentoo.org/ DNS round-robin with an improved bouncer (that subdivides the filename).

The larger issue, is that of the 75k files on the mirrors in /distfiles/, 24k of them start with "texlive-module-". 66k  files in /distfiles/ are actively referenced in Manifests at present (I don't have active stats on the other 9k).

Fedora, I believe went to using the checksum of distfiles as part of the directory structure, to ensure a more even distribution.

Maybe a scheme like that would help overall?
/distfiles/sha256-${SHA256:0:2}/${SHA256}/$FILENAME 
* I explicitly include the name of the hash in the path, so multiple hashes can be trivially supported in future.

Implementation plan:
--------------------
Phase 1:
(on mirror infrastructure)
Move distfiles to new scheme, replace existing files with symlinks.

Phase 2:
Roll-out new Portage that uses the new scheme

Phase 3:
Replace distfiles.g.o round-robin with bouncer generates HTTP redirects to old-style /distfiles/ structure (specifically, it's UNAWARE of distfile checksums).

Phase 4:
Make bouncer AWARE of distfile checksums, so it generates redirects to the hashed path where possible. Needs to have some idea of fallback as well.

Phase 5:
Depreciate non-hashed paths on mirrors, advise that it'll be going away in 6 months.

Phase 6 (+6 months from previous phase):
Turn off non-hashed paths on mirrors.

Legacy Portage installs should have their mirror://gentoo/ value pointing simply to http://distfiles.gentoo.org/, and that will redirect them to the new distfile location as needed.
Comment 9 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-09-09 18:25:42 UTC
@redwolfe, @zmedico:
Any concerns/opinions about my suggestions in comment 8?
Comment 10 Zac Medico gentoo-dev 2016-09-09 18:40:28 UTC
It seems like a lot of unnecessary complexity, so we really need to question the value of supporting legacy filesystems.
Comment 11 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-09-09 18:57:27 UTC
(In reply to Zac Medico from comment #10)
> It seems like a lot of unnecessary complexity, so we really need to question
> the value of supporting legacy filesystems.

Which part do you see as unnecessary? The hash-based system is needed otherwise the "texlive-module-" prefix will cause one of the name-based directories to be very large still.

Phase4 is the hardest portion, and only relevant to support old clients for an extended period time. If we didn't care about that, we can just give 6 months notice and then remove the symlinks (this would be TOTALLY fine with me).
Comment 12 Zac Medico gentoo-dev 2016-09-09 19:04:45 UTC
(In reply to Robin Johnson from comment #11)
> (In reply to Zac Medico from comment #10)
> > It seems like a lot of unnecessary complexity, so we really need to question
> > the value of supporting legacy filesystems.
> 
> Which part do you see as unnecessary? The hash-based system is needed
> otherwise the "texlive-module-" prefix will cause one of the name-based
> directories to be very large still.

I mean that the status quo is fine. Modern filesystems are designed to handle directories containing millions of files. We really don't need to change anything, unless we are catering to legacy fileystems. It's a lot of work, just to support legacy filesystems.
Comment 13 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-09-09 20:07:43 UTC
Just because they are designed to handle millions of files in a single directory, doesn't mean there are significant performance benefits to keeping each directory smaller still. listdir() is very expensive on the distfiles directory as it stands.
Comment 14 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-09-09 20:11:34 UTC
I have to agree with Zac here. Unless you've got a better problems to solve than 'old filesystems supposedly do not work well', I don't think this should be changed arbitrarily.

We're talking about a major change that is going to break existing tools, and confuse people who are used to the current schema. Even if we can guarantee that legacy distfile URIs will work, the local layout change will cause incompatibility between different package managers (and different versions of the same manager), break eclean-dist, confuse people who use to unpack/lookup files in DISTDIR…

If someone wants a split DISTDIR locally, I think we can easily add an option to support that. If we want a split mirror layout, we should be able to handle it Infra side with backwards compatibility. But in any case, I think it'd be nice to summarize the problem being solved first.
Comment 15 Zac Medico gentoo-dev 2016-09-09 20:43:41 UTC
(In reply to Robin Johnson from comment #13)
> listdir() is very expensive on the distfiles directory as it stands.

For applications where listdir introduces a performance problem, you can cache the result in a file, and invalidate the cache when the directory timestamp changes. Just make sure that you also invalidate the cache if the timestamp changes *during* the listdir call (and use nanosecond resolution timestamps).
Comment 16 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-09-09 20:59:30 UTC
Simply loading http://$MIRRORNAME/distfiles/ on some mirrors is problematic, because of listdir performance, and apache not having listdir caching as you describe.
Comment 17 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-09-09 21:02:38 UTC
(In reply to Michał Górny from comment #14)
> But in any case, I
> think it'd be nice to summarize the problem being solved first.

Problem 1:
On user systems, listdir of $DISTDIR is slow, esp. when less modern filesystems are used (see the first message in the bug).

Problem 2:
On mirrors, listdir of /distfiles/ is slow, mostly due to having 74k files in the single directory. This is esp. evident on mirrors that share a storage backend via NFS to many webservers.

Root cause:
The flat directory layout does not scale well.
Comment 18 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-09-09 21:07:25 UTC
What is the use case for the listdir?
Comment 19 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-09-10 22:55:07 UTC
(In reply to Michał Górny from comment #18)
> What is the use case for the listdir?
Hitting them via HTTP, but also the frequent rsync for mirror replication. If the kernel cached listdir better, it probably would hurt way less, but that's hard to get changed everywhere, esp. on mirrors running older distros.
Comment 20 Zac Medico gentoo-dev 2016-09-11 00:22:10 UTC
(In reply to Robin Johnson from comment #19)
> (In reply to Michał Górny from comment #18)
> > What is the use case for the listdir?
> Hitting them via HTTP,



> but also the frequent rsync for mirror replication.
> If the kernel cached listdir better, it probably would hurt way less, but
> that's hard to get changed everywhere, esp. on mirrors running older distros.
Comment 21 Zac Medico gentoo-dev 2016-09-11 00:25:54 UTC
Whoops, premature comment there...

(In reply to Zac Medico from comment #20)
> (In reply to Robin Johnson from comment #19)
> > (In reply to Michał Górny from comment #18)
> > > What is the use case for the listdir?
> > Hitting them via HTTP,

We can probably just disable browsing on that directory, no?

> > but also the frequent rsync for mirror replication.
> > If the kernel cached listdir better, it probably would hurt way less, but
> > that's hard to get changed everywhere, esp. on mirrors running older distros.

Since rsync is something that runs it the background, it shouldn't bother anyone. If there's a real performance issue, then that's a bug. For example, linux-4.7 has this "Parallel directory lookups" fix:

https://kernelnewbies.org/Linux_4.7#head-cb7faf5c84d36d6bec87c7f9233bfe2d50b0073a
Comment 22 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-09-11 15:13:49 UTC
(In reply to Zac Medico from comment #21)
> Whoops, premature comment there...
> 
> (In reply to Zac Medico from comment #20)
> > (In reply to Robin Johnson from comment #19)
> > > (In reply to Michał Górny from comment #18)
> > > > What is the use case for the listdir?
> > > Hitting them via HTTP,
> We can probably just disable browsing on that directory, no?
That's terrible for people looking for old distfiles. 


> Since rsync is something that runs it the background, it shouldn't bother
> anyone. If there's a real performance issue, then that's a bug. For example,
> linux-4.7 has this "Parallel directory lookups" fix:
> 
> https://kernelnewbies.org/Linux_4.7#head-
> cb7faf5c84d36d6bec87c7f9233bfe2d50b0073a
That fix would really help, but this is something that's been a problem for a long time and still affects mirrors that won't be running a 4.7 kernel until it gets into RHEL/CentOS. Getting the structure changed will mitigate it in the meantime.

The only downside that I've been able to identify is that previously you could try to compare a single directory listing to check how up to date a mirror was, and now you have to scan many directories.
Comment 23 Zac Medico gentoo-dev 2016-09-11 20:54:10 UTC
(In reply to Robin Johnson from comment #22)
> (In reply to Zac Medico from comment #21)
> > Whoops, premature comment there...
> > 
> > (In reply to Zac Medico from comment #20)
> > > (In reply to Robin Johnson from comment #19)
> > > > (In reply to Michał Górny from comment #18)
> > > > > What is the use case for the listdir?
> > > > Hitting them via HTTP,
> > We can probably just disable browsing on that directory, no?
> That's terrible for people looking for old distfiles. 

We had a similar problem at work, involving a directory containing build artifacts. I solved that by creating a service that watches the directory for changes (it polls the directory timestamp, since pyinotify proved to be unreliable), and generates/synchronizes a browse-able directory structure containing hardlinks to the original files. It compares inode numbers to detect if hardlinks need to be updated.

> > Since rsync is something that runs it the background, it shouldn't bother
> > anyone. If there's a real performance issue, then that's a bug. For example,
> > linux-4.7 has this "Parallel directory lookups" fix:
> > 
> > https://kernelnewbies.org/Linux_4.7#head-
> > cb7faf5c84d36d6bec87c7f9233bfe2d50b0073a
> That fix would really help, but this is something that's been a problem for
> a long time and still affects mirrors that won't be running a 4.7 kernel
> until it gets into RHEL/CentOS. Getting the structure changed will mitigate
> it in the meantime.
> 
> The only downside that I've been able to identify is that previously you
> could try to compare a single directory listing to check how up to date a
> mirror was, and now you have to scan many directories.

Well, we can do both, if we have a service to maintain a directory structure containing hardlinks to the original files, like the one that I've described above.
Comment 24 G.Wolfe Woodbury 2016-10-03 12:08:36 UTC
To me, it seems a bit callous to just say "don't use old filesystem types" or to not acknowledge that a performance problem exists -- it is, I think, akin to dismissing software bloat because modern CPUs are fast enough to deal with all the extra work.

Last I looked (a few moments ago), fedora was still just using the first letter of the package name as a spreading method to make the tree components more reasonable in size. Browsing is still sufficiently fast, and rsync doesn't spend nearly as much time analysing the file lists as Gentoo's distfiles.

There is also the question of dealing with folks who may be using something like NTFS or (heaven forbid) vfat in dual installations.

I will acknowledge that most folks won't notice or appreciate the performance problems. There are, however, digital pack-rats that try to collect anything and everything, and system administrators of mirror sites (who are professional pack-rats!) that will notice, and appreciate anything that will/may help.

[Personally, I keep an up-to-date copy of distfiles and portages around because I have a herd of machines and VMs and I don't want each of them hitting my home external bandwidth all the time.  I also feed a few other folks in the RTP area with files as I get them. I try to be a helpful NetCitizen. :-) ]
Comment 25 G.Wolfe Woodbury 2017-01-26 22:30:37 UTC
Is there any progress or update on thinking of a solution or change from this concept? 

Large directories on any filesystem do create performance problems when being searched, even if the underlaying structure is modern.  The problem is not that the older filesystems can't handle many files, but that loading or searching the directory create performance problems when the system is running.
Comment 26 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2017-01-27 14:51:31 UTC
I'm going to do some stats on distfiles, and research how we could split them effectively. But don't expect me to write patches for this.
Comment 27 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2017-01-27 16:09:32 UTC
I guess I'm going to say nothing new but:

1. splitting on first letter of filename (the obvious idea) is totally inefficient, with letter 't' beating all competition,

2. splitting on first bits of file SHA256SUM (from Manifest) gives quite even-sized groups,

3. splitting on first bits of filename checksum gives similarly good results.


Results for files currently listed in Manifest files:

a. grouping by first letter of hex SHA256 file checksum:

0	4330	28785356091
1	4270	28303294571
2	4432	16338800513
3	4436	28121072905
4	4394	19148942090
5	4354	19899583492
6	4331	21005434359
7	4461	12475733267
8	4364	23470557236
9	4410	17520663264
a	4216	19048461741
b	4303	26029227065
c	4290	18938965494
d	4277	14847605391
e	4286	25792760424
f	4362	25413179569

b. by first letter of hex SHA256 filename checksum:

0	4299	18789218764
1	4238	18867311226
2	4505	43039037488
3	4307	18516689827
4	4395	13705113020
5	4314	16678197673
6	4351	14726445287
7	4407	29858917031
8	4300	18577810355
9	4384	31248828773
a	4413	17575358359
b	4274	21286490828
c	4344	17897224867
d	4310	20944029518
e	4315	27354703695
f	4360	16074260761

Method a. has stdev = 69.4, b. has stdev = 65.07.


Therefore, I'd say either makes sense.

Using file checksum has the advantage of using data already available in Manifest. However, it means Portage will have to use a temporary directory for initial checksumming and it won't work for mirrors.

Using filename checksum seems like the best portable option, though it will need explicit calculation.

As the example shows, using 4 bits means at most 4200-4500 files per directory. User systems will certainly have less (unless somebody fetches distfiles for everything in Gentoo), mirror will probably have more. With 6 bits, we'd get around 1000, and 8 would mean max 250. I don't know if we need to be going further than that.
Comment 28 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2017-01-27 16:10:29 UTC
Oh, forgive me for missing legend to my table. The first column is group key, the second is file count, the third is sum of file sizes in bytes (as a side info, not really used in my considerations).
Comment 29 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2017-01-30 22:53:31 UTC
I like the idea of checksum of filename, because it allows complete pre-computation.

Checksum from the manifest is problematic because not every file in distfiles is in a Manifest (there are whitelisted files).

I'm going to vote for the first 8 bits of $hash(filename) as the prefix length. If we want to continue to use SHA256 on the filename, that's fine, but moving to a newer hash should also be considered (the original GLEP that brought in SHA256 did plan for it's replacement with the outcome of the SHA-3 contest).
Comment 30 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2017-03-18 07:55:58 UTC
Note that if we want to implement this on local systems, we also need to account for lots of pkg_nofetch() ebuilds telling people to put files in $DISTDIR. While this is strictly invalid anyway (since DISTDIR may be a shadow dir there), with subdirectories the user will now have to figure out the correct subdir for each file.

@dev-portage, 'emaint add-distfile' helper? ;-)
Comment 31 Ulrich Müller gentoo-dev 2017-03-18 20:39:10 UTC
(In reply to Robin Johnson from comment #29)
> I'm going to vote for the first 8 bits of $hash(filename) as the prefix
> length. If we want to continue to use SHA256 on the filename, that's fine,
> but moving to a newer hash should also be considered (the original GLEP that
> brought in SHA256 did plan for it's replacement with the outcome of the
> SHA-3 contest).

Why would you need any cryptographic strength of the hash at all? For the purpose of balancing directory sizes MD5 should be more than good enough, or even simple (and fast) CRCs like the one used in cksum(1).
Comment 32 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2017-03-18 21:18:25 UTC
(In reply to Ulrich Müller from comment #31)
> Why would you need any cryptographic strength of the hash at all? For the
> purpose of balancing directory sizes MD5 should be more than good enough, or
> even simple (and fast) CRCs like the one used in cksum(1).

Why would you introduce additional hash methods instead of using the one that is required for Manifests? It's not like using 'simpler' hash has any real benefit here.
Comment 33 Ulrich Müller gentoo-dev 2017-03-19 09:47:43 UTC
(In reply to Michał Górny from comment #32)
> Why would you introduce additional hash methods instead of using the one
> that is required for Manifests? It's not like using 'simpler' hash has any
> real benefit here.

Sure, take what is most convenient, and what isn't likely to go away. My point was only that security is of no concern here.
Comment 34 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2017-09-02 23:46:59 UTC
portage team:
This got derailed before, but can we revive the discussion?

I've had a personal report from a mirror that sync was failing weirdly on ext4, as a result of dir_index:
EXT4-fs warning (device dm-26): ext4_dx_add_entry:2172: inode #5849089: comm rsync: Directory index full!
Distfiles as it goes to mirrors is over 60k files.

They were going to switch to XFS when possible, but it was going to mean shuffling the data around.

Re the hash choice, there is one further piece relating to that that came up: limits to the length of the directory name & limits to the length of the entire path. We do need to keep them shorter rather than longer, so truncating the hash is going to be a must.

We also need a way to convey the format of a given mirror of distfiles, to make cut-over easier. To that end, I propose adding a metadata file for a given hosting location. It should convey: the hash used, what's being hashed, how many bits/characters are being used (eg use SHA512 but only take the first 8 bits/2 characters). The metadata should allow multiple formats to be present (eg with symlinks).

One of the impacts here is that portage needs to be able to treat gentoo mirror distfile URLs differently than upstream URLs: upstream doesn't get a hash injected.
Comment 35 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2018-01-26 16:44:42 UTC
(In reply to Robin Johnson from comment #34)
> Re the hash choice, there is one further piece relating to that that came
> up: limits to the length of the directory name & limits to the length of the
> entire path. We do need to keep them shorter rather than longer, so
> truncating the hash is going to be a must.

There is no need to store the complete hash anywhere -- we just need to use the part that's used for splitting, i.e. 1-2 hexdigits.

> We also need a way to convey the format of a given mirror of distfiles, to
> make cut-over easier. To that end, I propose adding a metadata file for a
> given hosting location. It should convey: the hash used, what's being
> hashed, how many bits/characters are being used (eg use SHA512 but only take
> the first 8 bits/2 characters). The metadata should allow multiple formats
> to be present (eg with symlinks).

Do we want it to contain per-mirror information, or just one metadata file that will be propagated to all mirrors unchanged? Do we need the metadata file to be extensible for additional data in the future?
Comment 36 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2018-01-26 19:06:46 UTC
(In reply to Michał Górny from comment #35)
> (In reply to Robin Johnson from comment #34)
> > Re the hash choice, there is one further piece relating to that that came
> > up: limits to the length of the directory name & limits to the length of the
> > entire path. We do need to keep them shorter rather than longer, so
> > truncating the hash is going to be a must.
> There is no need to store the complete hash anywhere -- we just need to use
> the part that's used for splitting, i.e. 1-2 hexdigits.
Publishing the mapping table will help people searching. Think old school FTP listing files.

> > We also need a way to convey the format of a given mirror of distfiles, to
> > make cut-over easier. To that end, I propose adding a metadata file for a
> > given hosting location. It should convey: the hash used, what's being
> > hashed, how many bits/characters are being used (eg use SHA512 but only take
> > the first 8 bits/2 characters). The metadata should allow multiple formats
> > to be present (eg with symlinks).
> 
> Do we want it to contain per-mirror information, or just one metadata file
> that will be propagated to all mirrors unchanged? Do we need the metadata
> file to be extensible for additional data in the future?
It's just one file that propagates to mirrors, but custom mirrors will want tooling to generate their own. Portage could query a mirror and figure out what format is on it.

1. push out a metadata file that says we're on the old flat structure.
2. start a gradual pushout of HARDLINKS from old files to new location
3. when the pushout is done, ADD an entry to the the metadata file.
4. Later on, when we decide the flat structure is not supported anymore, drop it from the metadata file, and WATCH for old accesses.
5. drop the old file paths.

Zac noted the hardlinks, and that's what older releng processes used for getting releases to mirrors but not accessible to the public until a release date.
Comment 37 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-07 07:48:18 UTC
How deep do we want the new layout to be?

a. 4 bits of checksum: 16 dirs, 4200-4500 files each,

b. 8 bits of checksum, 256 dirs, 250-300 files each.

If b., then do we want 256 subdirs in top-level, or 16x16 in two levels?
Comment 38 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-07 08:57:43 UTC
There's also the question of hardlinks vs symlinks.  To summarize:

Hardlinks:

1. We don't really know if people are using -H.

1a. FWICS 4 people answered Robin's survey, with two mirrors using -H before and two enabling it after the mail.  If I were to extrapolate from that, I'd have to assume that half of our mirrors don't use -H.

1b. -H is NOT enabled by -a and it's documented as expensive.  That decreases the chance that people actually enabled it.

2. Mirrors that do not enable -H will fetch and store two copies of every file during the transitional period.

3. After the transitional period, removing old files will be cheap.


Symlinks:

1. We CAN know if people are using -l.

1a. I've pushed 'symlink-test' to distfiles-local today.  We can check later how many of the mirrors have fetched it.

1b. -l is enabled by -a, so there's a good chance people have it enabled.

2. Mirrors that do not enable -l will not get the files at all.  This sucks for users but it will help me establish 1a.

3. After the transitional period, symlinks will have to be replaced with real files which implies transferring them all again.
Comment 39 Ulrich Müller gentoo-dev 2019-10-07 09:59:14 UTC
(In reply to Michał Górny from comment #37)
> How deep do we want the new layout to be?
> 
> a. 4 bits of checksum: 16 dirs, 4200-4500 files each,
> 
> b. 8 bits of checksum, 256 dirs, 250-300 files each.

I always found it a good rule of thumb to have about the same number of files in each level. Looks like b. would fulfil that better.

> If b., then do we want 256 subdirs in top-level, or 16x16 in two levels?

By the same argument, it should rather be 10 (= 5 + 5) bits of the checksum, if it's going to be two levels of directories.
Comment 40 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-07 13:17:47 UTC
Only multiples of 4, please, so we can cut off hex checksum.
Comment 41 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-07 14:50:54 UTC
(In reply to Michał Górny from comment #38)
> 1a. I've pushed 'symlink-test' to distfiles-local today.  We can check later
> how many of the mirrors have fetched it.

Update on that: so far out of all mirrors on distfiles.gentoo.org DNS and bouncer, the following do not have that file (yet):

https://mirrors.163.com/gentoo/
  -> timestamp from yesterday
https://lug.mtu.edu/gentoo
  -> the whole directory 404s, so I'll remove it from the list
https://mirror.isoc.org.il/pub/gentoo/
  -> timestamp says it didn't sync yet

So I think we can reasonably assume all mirrors handle symlinks.
Comment 42 Ulrich Müller gentoo-dev 2019-10-07 17:21:09 UTC
(In reply to Michał Górny from comment #40)
> Only multiples of 4, please, so we can cut off hex checksum.

Although I'm sympathetic to this for aesthetic reasons, shouldn't package managers implement the algorithm outlined in https://www.gentoo.org/glep/glep-0075.html#filename-hash-structure?

(And maybe that would be a reason _not_ to use multiples of 4. :-)


(In reply to Michał Górny from comment #37)
> a. 4 bits of checksum: 16 dirs, 4200-4500 files each,

Looking at GLEP 79, I just noticed that this wouldn't fulfil
"the number of files in a single directory should not exceed 1000"
which is listed as first goal in https://www.gentoo.org/glep/glep-0075.html#algorithm-for-splitting-distfiles.
Comment 43 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-07 18:10:17 UTC
(In reply to Ulrich Müller from comment #42)
> (In reply to Michał Górny from comment #40)
> > Only multiples of 4, please, so we can cut off hex checksum.
> 
> Although I'm sympathetic to this for aesthetic reasons, shouldn't package
> managers implement the algorithm outlined in
> https://www.gentoo.org/glep/glep-0075.html#filename-hash-structure?
> 
> (And maybe that would be a reason _not_ to use multiples of 4. :-)

Doing this in Python would be slower than just cutting the hex-string, so we'd probably special case multiples of 4 anyway.  Since 8 bits seem to work fine, I don't really see why we'd make it more complex.

> (In reply to Michał Górny from comment #37)
> > a. 4 bits of checksum: 16 dirs, 4200-4500 files each,
> 
> Looking at GLEP 79, I just noticed that this wouldn't fulfil
> "the number of files in a single directory should not exceed 1000"
> which is listed as first goal in
> https://www.gentoo.org/glep/glep-0075.html#algorithm-for-splitting-distfiles.

Yes, I also like 8 bits better.
Comment 44 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-08 03:35:47 UTC
Mail to discuss symlinks vs hardlinks sent to gentoo-mirrors ml.

FTR, I've came up with one more solution: use symlinks during transitional period, then slowly replace them with hardlinks during cleanup period.  This should let us avoid retransferring them, while keeping potential double space usage to a few groups at a time.
Comment 45 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-09 16:02:25 UTC
So, my proposed transition plan would be to:

1. Enable emirrordist to produce new layout on new files via symlinks.

We will start having:

  0/a/foo.tar.gz -> ../../foo.tar.gz

2. Manually (via a cheap script) populate symlinks for existing distfiles, optionally splitting this into few windows.

3. Give mirrors some time to sync this, then swap the order in layout.conf.  Users will start fetching from new layout, and emirrordist will start reversing symlink direction.

Now we will start having:

  foo.tar.gz -> 0/a/foo.tar.gz

4. Reverse symlink direction.  That is, for a few groups at a time:

4a. replace symlinks from steps 1&2 with hardlinks,

4b. wait some, then replace original files with symlinks.

This will ensure that mirrors with -H enabled will not retransfer the file, and those with -H disabled will duplicate only one a few groups at a time.

5. After a long transitional period, we can disable the old layout and remove symlinks.
Comment 46 Ulrich Müller gentoo-dev 2019-10-09 16:20:15 UTC
(In reply to Michał Górny from comment #45)
>   0/a/foo.tar.gz -> ../../foo.tar.gz

What is the rationale for having two levels?
Comment 47 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-09 18:19:47 UTC
(In reply to Ulrich Müller from comment #46)
> (In reply to Michał Górny from comment #45)
> >   0/a/foo.tar.gz -> ../../foo.tar.gz
> 
> What is the rationale for having two levels?

That's just an example.  I was testing with two levels just in case they would cause trouble.

My personal rationale would be keeping immediate directories as small as possible to improve lookup efficiency.
Comment 48 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-12 22:19:24 UTC
Ok, I've done some Infra testing and everything seems good so far.

What I did:

1. Cloned master mirror and got it working with old setup.

2. Cloned slave mirror and rsynced it to master mirror (with minimal changes due to missync).

3. Now, switched emirrordist invocation on master to use '--symlinks --layout-conf ...', with layout.conf specifying filename-hash+flat structure.

Master started fetching new distfiles as:
foo.tar.gz -> XY/foo.tar.gz (symlink)

4. Ran a cheap script that hardlinked all existing distfiles into XY/ subdirs.  It needed around 30 seconds to do that.

5. Synced slave to master, with -H enabled.  It took a few seconds, and transferred around ~7 MiB for all distfiles.

6. Ran a cheap script that replaced top-level distfile hardlinks with symlinks.

7. Synced slave again.  Same result -- few seconds, ~7 MiB transfer.

8. Removed top-level symlinks from master.

9. Synced slave.  Few seconds, ~3 MiB transfer.


Therefore, emirrordist (with one extra patch on top) works fine.  Transition scripts are trivial and work fine.  The hybrid approach saves lots of transfer, provided that mirrors are using -H.

I believe we're ready to go (up to symlinking all distfiles) as soon as new Portage is released.  Before hardlink phase, I think we should try to mail all mirrors owners again (suggesting -H).
Comment 49 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2019-10-17 11:53:50 UTC
The masterdist already has the new layout deployed.  I'm going to update layout.conf to expose this to users later today -- when I can reasonably assume the majority of the mirrors have the new layout.

Right now, all new distfiles are fetched according to the new layout, and symlinked into the old layout.  Past distfiles are symlinked the other way around.

I think we can stop here for some time, and let the transition occur naturally.  As new distfiles are fetched and old are removed, we'll eventually be getting closer to the desired final layout.  Then, we can transition the remaining files when that doesn't involve so much hassle.