Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 583000 - [R_Overlay] Repository URI unaccessible
Summary: [R_Overlay] Repository URI unaccessible
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Infrastructure
Classification: Unclassified
Component: Gentoo Overlays (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo Infrastructure
URL: https://qa-reports.gentoo.org/output/...
Whiteboard:
Keywords:
Depends on:
Blocks: repository-qa-issues
  Show dependency tree
 
Reported: 2016-05-14 10:10 UTC by Michał Górny
Modified: 2016-06-30 21:57 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-05-14 10:10:45 UTC
Our automated repository checks [1] have detected that the 'R_Overlay'
repository can not be synced.

The following URIs are listed for the repository:

  [   rsync] rsync://roverlay.dev.gentoo.org/roverlay

Please verify that the server hosting the repository is working
correctly. If the repository has been moved to a new location or removed
altogether, please let us know to update the record appropriately.

We reserve the right to remove the repository if we do not receive any
reply within 2 weeks.

[1]:https://wiki.gentoo.org/wiki/Project:Repository_mirror_and_CI
Comment 1 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-05-14 10:13:58 UTC
Note: I'm aware this is related to the VMs that were took offline to provide replacement for dipper.
Comment 2 Benda Xu gentoo-dev 2016-05-19 03:21:56 UTC
Is there an estimated date to revive roverlay.dev.gentoo.org?
Comment 3 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-05-19 18:16:23 UTC
@heroxbd:
i think there might be just enough space to bring it back right now, but otherwise it's going to have to wait for new hardware, unless you can help trim down the disk usage.

The VM had 3 volumes:
10G /dev/vda system disk
10G /dev/vdb seems entirely unused (was all zeros)
150G /dev/vdc. Ext4, 81GiB used, of which 78GiB is in distfiles/.
(170GiB total)

Questions:
- can we ditch the 10GB empty disk?
- Are all those distfiles really needed? (if you just need to archive them, project hosting is a MUCH better place).
- Ideally, can we squeeze it all down to 80GiB? (or smaller)
Comment 4 Benda Xu gentoo-dev 2016-05-20 00:57:07 UTC
Thanks Robin.

(In reply to Robin Johnson from comment #3)
> @heroxbd:
> i think there might be just enough space to bring it back right now, but
> otherwise it's going to have to wait for new hardware, unless you can help
> trim down the disk usage.
> 
> The VM had 3 volumes:
> 10G /dev/vda system disk
> 10G /dev/vdb seems entirely unused (was all zeros)

I couldn't remember what vdb was.  Most likely it is unused.

> 150G /dev/vdc. Ext4, 81GiB used, of which 78GiB is in distfiles/.
> (170GiB total)
> 
> Questions:
> - can we ditch the 10GB empty disk?

I think so.

> - Are all those distfiles really needed? (if you just need to archive them,
> project hosting is a MUCH better place).

At present they are only used for generating manifests of the ebuilds.  In bug 564912 I was requesting more space.

Could you elaborate on project hosting?  I could find two links:

https://projects.gentoo.org
https://wiki.gentoo.org/wiki/Project:Infrastructure/Project_File_Hosting

But did not fully understand.

> - Ideally, can we squeeze it all down to 80GiB? (or smaller)

Yes, by removing distfiles and shrinking the drive.  I think that's the best we can do for the moment.
Comment 5 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-05-20 06:01:24 UTC
You could also provide the usual git hosting for repository like most of us do. repo-mirror-ci will take care of providing the cache.
Comment 6 Benda Xu gentoo-dev 2016-05-20 06:46:39 UTC
(In reply to Michał Górny from comment #5)
> You could also provide the usual git hosting for repository like most of us
> do. repo-mirror-ci will take care of providing the cache.

Thanks Michał.  Unfortunately, the package meta information of CRAN is saved inside their tarballs.  Generating the overlay requires the tarballs to be available locally.

If at the moment the disk is limited to 80GB, the only way to move forward is to drop all the sci-BOIC packages (which have large tarballs) before new hardware is available.
Comment 7 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-05-20 07:28:20 UTC
(In reply to Benda Xu from comment #6)
> (In reply to Michał Górny from comment #5)
> > You could also provide the usual git hosting for repository like most of us
> > do. repo-mirror-ci will take care of providing the cache.
> Thanks Michał.  Unfortunately, the package meta information of CRAN is saved
> inside their tarballs.  Generating the overlay requires the tarballs to be
> available locally.
He's talking about making the repo available via git vs rsync. Rather than generate and publish only via rsync, you can generate & commit to git (with a push to git.gentoo.org).

> If at the moment the disk is limited to 80GB, the only way to move forward
> is to drop all the sci-BOIC packages (which have large tarballs) before new
> hardware is available.

I can see downloading them to generate the data once, but I don't see why you need to retain them on this system.

Download tarball, extract meta, retain meta, move tarball into location for expired/timed removal.

Alternatively, if there is a need to retain the tarball, this is where project distfile hosting/archival would come into play (I apologize for the lack of documentation on it so far, it's not widely ready yet, and is mostly intended for archival, not hosting).
Comment 8 Benda Xu gentoo-dev 2016-05-20 08:21:35 UTC
(In reply to Robin Johnson from comment #7)
>
> He's talking about making the repo available via git vs rsync. Rather than
> generate and publish only via rsync, you can generate & commit to git (with
> a push to git.gentoo.org).

I see.  Then it is a good idea.

> > If at the moment the disk is limited to 80GB, the only way to move forward
> > is to drop all the sci-BOIC packages (which have large tarballs) before new
> > hardware is available.
> 
> I can see downloading them to generate the data once, but I don't see why
> you need to retain them on this system.
>
> Download tarball, extract meta, retain meta, move tarball into location for
> expired/timed removal.

If we retain the lastest tarballs, by rsync a lot of bandwidth can be saved.  Otherwise the whole upstream repo will have to be downloaded every time the ebuilds are verified to be up-to-date.  Because all the metadata are in the tarballs.

> Alternatively, if there is a need to retain the tarball, this is where
> project distfile hosting/archival would come into play (I apologize for the
> lack of documentation on it so far, it's not widely ready yet, and is mostly
> intended for archival, not hosting).

I myself don't see the need to host the tarball.  Denis (calchan) who first created roverlay has argued that mirroring the tarballs was necessary.  Here I quote the arguments 2 years ago[1].

I do consider generating only ebuilds from roverlay.dev.gentoo.org and deploy repo-mirror-ci to do the manifest/cache an efficient approach.  We can use the project distfile hosting/archival when it's ready.  Besides, I am happy to be the first user outside Infra Team to aid testing out the mechanism.


1. email: private communication with calchan.  Quoted without permission.

Actually, tarballs *do* disappear, often as soon as they're bumped on
CRAN. This is why we need to mirror them:

 - There will be a gap between the time when the tarball disappears
and when we run the roverlay update, when it will be impossible to
emerge the package even if you stay current.

 - Scientists like to use a specific version of a package throughout
an entire project (which can be years), for example for validations.

When tarballs disappear they're usually moved to CRAN archives, but
that thing is a big mess and I'm not sure we can handle it at least at
the beginning. I would like us to handle it at some point, but even so
this thing is such a mess that it happened to me that I couldn't
(manually) find what I was looking for, or could find something but
couldn't be sure it was the right thing. Our mirror would have no such
problem. If we need bigger hardware at some point, I'll make sure we
get it.
Comment 9 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-05-20 20:27:42 UTC
(In reply to Benda Xu from comment #8)
> (In reply to Robin Johnson from comment #7)
> >
> > He's talking about making the repo available via git vs rsync. Rather than
> > generate and publish only via rsync, you can generate & commit to git (with
> > a push to git.gentoo.org).
> I see.  Then it is a good idea.
Ok, let's seperate that for the moment from the rest of the discussion. Please make a new bug for it, and we'll get a repo for you to push commits to (probably with an automated user key for the pushes)

> > > If at the moment the disk is limited to 80GB, the only way to move forward
> > > is to drop all the sci-BOIC packages (which have large tarballs) before new
> > > hardware is available.
> > 
> > I can see downloading them to generate the data once, but I don't see why
> > you need to retain them on this system.
> >
> > Download tarball, extract meta, retain meta, move tarball into location for
> > expired/timed removal.
> 
> If we retain the lastest tarballs, by rsync a lot of bandwidth can be saved.
> Otherwise the whole upstream repo will have to be downloaded every time the
> ebuilds are verified to be up-to-date.  Because all the metadata are in the
> tarballs.
Ugh, so it's really upstream's design that we're stuck with.

- How big is just the live upstream portion (vs the archive of old tarballs?)
- Does upstream have reliable timestamps that you could tell if something has changed without downloading it again?

> 
> > Alternatively, if there is a need to retain the tarball, this is where
> > project distfile hosting/archival would come into play (I apologize for the
> > lack of documentation on it so far, it's not widely ready yet, and is mostly
> > intended for archival, not hosting).
> 
> I myself don't see the need to host the tarball.  Denis (calchan) who first
> created roverlay has argued that mirroring the tarballs was necessary.  Here
> I quote the arguments 2 years ago[1].
> 
> I do consider generating only ebuilds from roverlay.dev.gentoo.org and
> deploy repo-mirror-ci to do the manifest/cache an efficient approach.  We
> can use the project distfile hosting/archival when it's ready.  Besides, I
> am happy to be the first user outside Infra Team to aid testing out the
> mechanism.
From an outside perspective, it looks like an destination similar to distfiles-local that you push to, and that ends up stored (removal or moving files around is a special case).
Comment 10 Benda Xu gentoo-dev 2016-05-21 00:27:29 UTC
(In reply to Robin Johnson from comment #9)
> > 
> > If we retain the lastest tarballs, by rsync a lot of bandwidth can be saved.
> > Otherwise the whole upstream repo will have to be downloaded every time the
> > ebuilds are verified to be up-to-date.  Because all the metadata are in the
> > tarballs.
> Ugh, so it's really upstream's design that we're stuck with.
> 
> - How big is just the live upstream portion (vs the archive of old tarballs?)

There are 3 big categories, BIOC (bioconductor) is the biggest about (80GB).  CRAN and Rforge should be at the order of 1GB (not sure).

> - Does upstream have reliable timestamps that you could tell if something
> has changed without downloading it again?

That should work.  What tools do you suggest to achieve this?  Can rsync save the timestamp/checksum somewhere after the local copy of files is deleted?

> > I myself don't see the need to host the tarball.  Denis (calchan) who first
> > created roverlay has argued that mirroring the tarballs was necessary.  Here
> > I quote the arguments 2 years ago[1].
> > 
> > I do consider generating only ebuilds from roverlay.dev.gentoo.org and
> > deploy repo-mirror-ci to do the manifest/cache an efficient approach.  We
> > can use the project distfile hosting/archival when it's ready.  Besides, I
> > am happy to be the first user outside Infra Team to aid testing out the
> > mechanism.
> From an outside perspective, it looks like an destination similar to
> distfiles-local that you push to, and that ends up stored (removal or moving
> files around is a special case).

Exactly, an automated distfiles-local.
Comment 11 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-05-22 23:45:32 UTC
(In reply to Benda Xu from comment #10)
> > - How big is just the live upstream portion (vs the archive of old tarballs?)
> There are 3 big categories, BIOC (bioconductor) is the biggest about (80GB).
> CRAN and Rforge should be at the order of 1GB (not sure).
That's still larger than I'd like to have kept on the VM. Read further on.

> > - Does upstream have reliable timestamps that you could tell if something
> > has changed without downloading it again?
> 
> That should work.  What tools do you suggest to achieve this?  Can rsync
> save the timestamp/checksum somewhere after the local copy of files is
> deleted?
rsync can't do it natively, but certainly it can with some scripting help.
It'll have to be timestamp-only however, I don't know of any way to ask the remote rsync to send the checksum otherwise (also it could be expensive on their IO system without the checksum-cache patches, which almost nobody uses).

Does upstream publish checksums at all? Ideally, on the VM, you could mirror _just_ the checksums, and be able to detect changes (add/remove/modify) to pull down added/changed distfiles, update the overlay, push the distfile to the archive.

> > From an outside perspective, it looks like an destination similar to
> > distfiles-local that you push to, and that ends up stored (removal or moving
> > files around is a special case).
> Exactly, an automated distfiles-local.
Ok, that I can give you soon.
Comment 12 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-05-23 00:09:49 UTC
CC Andre, as he can hopefully help w/ modifying the R overlay software.

The VM is too large, and we'd like to shrink it down, while improving the functionality.

My present understanding of it, is that it:
- mirrors one or more upstream rsync sources from  https://www.bioconductor.org/about/mirrors/mirror-how-to/
- without deleting distfiles
- updates ebuilds based on the available distfiles
- keeps the old distfiles for reproducability.
- serves up the overlay via rsync.



I'd like it to be more like this.
1. Do NOT keep an entire copy of upstream
2. Do keep a copy of checksums/DIGESTS from upstream.
3. Detect changes between passes of upstream checksums/digests.
4. Based on change:
4.1. Added: add ebuild, push new distfile to archive
4.2. Removed: drop ebuild
4.3. Changed: revbump ebuild, push new distfile to archive
5. The overlay gets regularly pushed automatically to Git, is not served via rsync anymore.
6. The distfile archive is hosted outside of the VM.

This of course depends mostly on what we can work with the contents of upstream.

If upstream offers strictly reliable timestamps, then we can use just that to detect changes. Otherwise we need actual digests to detect changes.
Comment 13 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-05-23 06:32:35 UTC
If you really insist on using the tool, I think the -l option could be helpful for timestamps:

$ rsync -r -l master.bioconductor.org::release
[...]
-rw-r--r--        835,017 2015/08/27 20:25:51 bioc/bin/macosx/contrib/3.3/ABAEnrichment_0.99.6.tgz
-rw-r--r--        514,515 2015/08/26 21:26:23 bioc/bin/macosx/contrib/3.3/ABSSeq_1.5.1.tgz
-rw-r--r--        570,601 2015/04/17 21:10:32 bioc/bin/macosx/contrib/3.3/ABarray_1.37.0.tgz
[...]

Though I think this is getting past the point of shell script, and I think you'd rather take a look at librsync.
Comment 14 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2016-05-25 19:40:39 UTC
heroxbd/dywi/calchan:
The VM is available again for the moment; so you can work on the improvements discussed here.
Comment 15 Benda Xu gentoo-dev 2016-06-01 02:32:08 UTC
(In reply to Robin Johnson from comment #14)
> heroxbd/dywi/calchan:
> The VM is available again for the moment; so you can work on the
> improvements discussed here.

Many thanks Robin!
Comment 16 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-06-30 21:57:20 UTC
The bug seems to be fixed in the repository. Closing.