580648 – mirror://pypi/ does not work anymore

Bug 580648 - mirror://pypi/ does not work anymore

Summary: mirror://pypi/ does not work anymore

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Unspecified (show other bugs)
Hardware:	All Linux

Importance:	Normal major (vote)
Assignee:	Python Gentoo Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-04-20 15:24 UTC by Matthew Thode ( prometheanfire )
Modified:	2023-09-11 21:57 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthew Thode ( prometheanfire ) archtester

2016-04-20 15:24:38 UTC

upstream pypi changed their URL scheme.

from dstufft's paste

So, previously PyPI used URLs like :
    /packages/{python version}/{name[0]}/{name}/{filename}

Now it uses:
    /packages/{hash[:2]}/{hash[2:4]}/{hash[4:]}/{filename}
Where hash is blake2b(file_content, digest_size=32).hexdigest().lower()

There are a few reasons for this:

* We generally do not allow people to delete a file and re-upload the same
  version again. However the old lay out generally means that we *can't* do
  that even if we wanted to because HTTP clients will use the URL as the key
  for a cache and thus it can never change (other than to be deleted).

* The file system is not transactional and isn't part of the database, which
  means we get put in a funny pickle where we have to decide if we persist the
  change to the file system *prior* to committing the transaction or *after*
  committing. Both ways have their ups and downs and neither solves all of the
  issues. In general, on upload we try to save the file prior to committing
  because once it's been committed downstream users will expect it to exist
  and if we haven't saved the file to disk yet it may not yet exist yet (and
  if saving fails, it may never exist).

  However, this raises a problem. We're currently using Amazon S3 to save
  files which is an eventually consistent data store. When writing a brand
  new file it will be (in the S3 region we're using) available immediately
  after writing a *new* file, however for writing a file that has already
  existed it can take some time for it to be consistent (reportedly being able
  to take up to hours for this to occur). This leaves us in a sticky situation
  where someone can run this:

      setup.py sdist upload

  And have PyPI accept the upload, write it to S3 and then fail to commit the
  upload. Then when the user re-runs that we'll write the file to S3 again
  (however it will have changed contents because ``setup.py sdist`` is not
  deterministic) and then commit the database, succeeding this time. If this
  happens then in the time period between when the database commits and when
  Amazon S3 has yet to update the file to the latest version (possibly taking
  hours) everyone is going to fail downloading/installing that file because
  the hash we're getting from Amazon S3 isn't going to match the hash that we
  have recorded in the PyPI database. To make this even more painful, we
  utilize download caching of the files pretty heavily and to do that we make
  the assumption that the contents at the URL will never change. So not only
  will it be broken in that window before Amazon S3 has become consistent, it
  will be persistently broken for anyone who attempted to install it until
  they go out of their way to delete their cache. By making the URL determined
  by the *contents* of the file, we make it so repeating the same upload with
  different contents will by definition end up with a different URL side
  stepping the entire problem.

* When a file gets deleted from PyPI we have to delete it from the backing
  store too because the URL is predictable and people attempt to short circuit
  the Simple Repository API and we want a file deletion to, by default, mean
  that people don't discover that version. However, this flies in the face of
  people who use the simple repository API to resolve a version (or the Web UI)
  who then want to resolved URL into something with the expectation it will not
  change or go away. This change allows us to simply stop deleting files, so
  that if someone bakes a file URL into something it can continue to work into
  perpetuity without people accidentally installing that through simple URL
  building in the end user software.


Now even though the specific location of the file has not been considered part
of our "API" nonetheless people have over time baked in assumptions about that
URL scheme in various things, and obviously this change will break those
things. So then how should someone deal with this change?

Well, the simplest (though perhaps not the least effort) is to remove whatever
assumptions have been made and replace them with the new URL structure. This
will fix things today, but it may or may not be the case that tomorrow the URL
structure changes again.

Another option is to discover the final URL using a method similar to what pip
does. The protocol is documented in PEP 503, but generally what you need to do
is look at /simple/<name>/ and see what links are available there. That will
tell you all of the files that currently exist for that project.

Yet another option is to run a sort of "translator" service that can consume
the PyPI JSON API and will output the URLs in whatever format best suites you.
An example of this is pypi.debian.net (which I don't know where the code base
for it is, but the proof of concept I wrote for it is at
https://github.com/dstufft/pypi-debian). These translators are fairly simple,
they take an URL, pull the project and filename out of it and then use the JSON
API to figure out the "real" URL and then just simply redirects to that.

Reproducible: Always

Steps to Reproduce:
try to package ansible 2.0.2.0 from pypi with the mirror://pypi syntax

Comment 1 Matthew Thode ( prometheanfire ) archtester

2016-04-20 15:26:54 UTC

also, my vote is in favor of doing the lookup at the /simple/${PN} url

Comment 2 Matthew Thode ( prometheanfire ) archtester

2016-04-20 15:58:33 UTC

it looks like the old format might be coming back, maybe at a different domain though

Comment 3 Mike Gilbert gentoo-dev

2016-04-20 18:21:14 UTC

Moving this to the python team.

Comment 4 Mike Gilbert gentoo-dev

2016-05-04 12:09:11 UTC

commit 1ab4c5d0fa5ca3cc38cf33beca4da2aae7d90b8f
Author: Mike Gilbert <floppym@gentoo.org>
Date:   Mon Apr 25 10:57:43 2016 -0400

    thirdpartymirrors: Add https://pypi.io/ for pypi
    
    The URL scheme for the old pypi is changing, but a new version with
    backward-compatible redirects was implemented on pypi.io.
    
    Bug: https://bitbucket.org/pypa/pypi/issues/438

 profiles/thirdpartymirrors | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)