Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 612178 - packages.gentoo.org: duplicate package versions
Summary: packages.gentoo.org: duplicate package versions
Status: RESOLVED OBSOLETE
Alias: None
Product: Websites
Classification: Unclassified
Component: Packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo Packages Website
URL: https://packages.gentoo.org/packages/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-10 07:46 UTC by Ulrich Müller
Modified: 2020-05-16 02:34 UTC (History)
6 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
Screenshot (pgo.png,54.86 KB, image/png)
2017-03-10 07:46 UTC, Ulrich Müller
Details
openrc package page screenshot (openrc.png,66.78 KB, image/png)
2017-09-07 08:43 UTC, Hadrien Lacour
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Müller gentoo-dev 2017-03-10 07:46:03 UTC
Created attachment 466532 [details]
Screenshot

See URL and attached screenshot. Version 25.2_rc1 appears twice, in the first and the third row.
Comment 1 Joerg Schaible 2017-04-18 09:28:52 UTC
Same happens currently for sys-devel/gcc for multiple versions. Even worse, because for version 3.4.0-r3 one line declares amd64 and x86 as stable and the other line does not. After synchronizing the tree I have currently only the unstable version.
Comment 2 Hadrien Lacour 2017-09-07 08:43:02 UTC
Created attachment 492928 [details]
openrc package page screenshot

The duplicates can also be in strange order.
Comment 3 Alice Ferrazzi Gentoo Infrastructure gentoo-dev 2018-02-20 12:29:40 UTC
same for sys-kernel/*
Comment 4 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-02-20 15:12:49 UTC
Its possible this is now worse in the new indexing scheme.

I see two possible means of attack.

1) Quick-and-Dirty. For each index, we key on specific things. For Packages, Versions, etc...its probably not too difficult because the keys for these items are all obvious (CP, CPV, and so forth). We can just write a rake task that detects duplicate documents and apply some ranking algorithm to keep the 'best' document for each key. Its likely this will result in some loss (particularly of 'event'-like data. The benefit of this approach is that we don't need to necessarily root cause why the duplicates occur. Its not an ideal solution.

2) Figure out why the duplicates show up upon insert. The basic idea is that each document has an _id field and repeated inserts to the same _id should overwrite. Its likely that the duplicates exist because the _id field we are using (afaik an md5 hash over some metadata) is changing between inserts. This is a no-no. Assuming we can fix 2, we still need to either run 1; or dump the database and reindex which takes about an hour. I suspect I prefer the latter ;)
Comment 5 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-02-20 15:26:23 UTC
Lets focus on duplicate versions first:

We will avoid focusing on the p.g.o code that detects if a given CPV is updated; but instead will focus on how the _id is computed; if a given CPV is updated, it should over-write and not add a duplicate.

    # Imports data from an ebuild model and saves the object
    #
    # @param [Portrage::Repository::Ebuild] ebuild_model
    def import!(ebuild_model, parent_package, options)
      self.version = ebuild_model.version
      self.atom = ebuild_model.to_cpv
      self.package = parent_package.atom

      raw_slot = nil
      raw_subslot = nil
      raw_slot, raw_subslot = ebuild_model.metadata[:slot].split '/' if ebuild_model.metadata[:slot]
      self.slot = raw_slot || ''
      self.subslot = raw_subslot || ''

      old_keywords = keywords
      self.keywords = ebuild_model.metadata[:keywords] || []
      self.use = strip_useflag_defaults(ebuild_model.metadata[:iuse] || []).uniq
      self.restrict = ebuild_model.metadata[:restrict] || []
      self.properties = ebuild_model.metadata[:properties] || []
      self.masks = Portage::Util::Masks.for(ebuild_model)
      self.metadata_hash = ebuild_model.metadata_hash

      save()

We don't see much code in here about setting _id. I assume in the previous indexing scheme there was something, but not in the multi-index we are using.

To pick a random gentoo-sources version:

irb(main):023:0> Version.find_all_by(:package, 'sys-kernel/gentoo-sources')[94]
=> #<Version {created_at: 2018-02-19 00:25:48 UTC, updated_at: 2018-02-19 00:25:48 UTC, version: "4.4.115", package: "sys-kernel/gentoo-sources", atom: "sys-kernel/gentoo-sources-4.4.115", sort_key: 35, slot: "4.4.115", subslot: "", eapi: nil, keywords: ["~alpha", "~amd64", "~arm", "~arm64", "~hppa", "~ia64", "~mips", "~ppc", "~ppc64", "~s390", "~sh", "~sparc", "~x86"], masks: [], use: ["experimental", "symlink", "build"], restrict: ["binchecks", "strip"], properties: [], metadata_hash: "be3111e6a241b26578a1c4499ed83258", id: "SMxzq2EBjknS_DU0zQP9"}>
irb(main):024:0> Version.find_all_by(:package, 'sys-kernel/gentoo-sources')[9]
=> #<Version {created_at: 2018-02-18 00:28:56 UTC, updated_at: 2018-02-18 00:28:56 UTC, version: "4.4.115", package: "sys-kernel/gentoo-sources", atom: "sys-kernel/gentoo-sources-4.4.115", sort_key: 35, slot: "4.4.115", subslot: "", eapi: nil, keywords: ["~alpha", "~amd64", "~arm", "~arm64", "~hppa", "~ia64", "~mips", "~ppc", "~ppc64", "~s390", "~sh", "~sparc", "~x86"], masks: [], use: ["experimental", "symlink", "build"], restrict: ["binchecks", "strip"], properties: [], metadata_hash: "fc773025e69496a947b5fdcc401e0eb9", id: "h8lQpmEBjknS_DU0Th-J"}>

So here is evidence that we have the same CPV inserted, 3 minutes apart, with a different _id; the latter should have replaced the former.
Comment 6 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-02-20 15:41:02 UTC
(In reply to Alec Warner from comment #5)
> Lets focus on duplicate versions first:
> 
> We will avoid focusing on the p.g.o code that detects if a given CPV is
> updated; but instead will focus on how the _id is computed; if a given CPV
> is updated, it should over-write and not add a duplicate.
> 
>     # Imports data from an ebuild model and saves the object
>     #
>     # @param [Portrage::Repository::Ebuild] ebuild_model
>     def import!(ebuild_model, parent_package, options)
>       self.version = ebuild_model.version
>       self.atom = ebuild_model.to_cpv
>       self.package = parent_package.atom
> 
>       raw_slot = nil
>       raw_subslot = nil
>       raw_slot, raw_subslot = ebuild_model.metadata[:slot].split '/' if
> ebuild_model.metadata[:slot]
>       self.slot = raw_slot || ''
>       self.subslot = raw_subslot || ''
> 
>       old_keywords = keywords
>       self.keywords = ebuild_model.metadata[:keywords] || []
>       self.use = strip_useflag_defaults(ebuild_model.metadata[:iuse] ||
> []).uniq
>       self.restrict = ebuild_model.metadata[:restrict] || []
>       self.properties = ebuild_model.metadata[:properties] || []
>       self.masks = Portage::Util::Masks.for(ebuild_model)
>       self.metadata_hash = ebuild_model.metadata_hash
> 
>       save()
> 
> We don't see much code in here about setting _id. I assume in the previous
> indexing scheme there was something, but not in the multi-index we are using.
> 
> To pick a random gentoo-sources version:
> 
> irb(main):023:0> Version.find_all_by(:package,
> 'sys-kernel/gentoo-sources')[94]
> => #<Version {created_at: 2018-02-19 00:25:48 UTC, updated_at: 2018-02-19
> 00:25:48 UTC, version: "4.4.115", package: "sys-kernel/gentoo-sources",
> atom: "sys-kernel/gentoo-sources-4.4.115", sort_key: 35, slot: "4.4.115",
> subslot: "", eapi: nil, keywords: ["~alpha", "~amd64", "~arm", "~arm64",
> "~hppa", "~ia64", "~mips", "~ppc", "~ppc64", "~s390", "~sh", "~sparc",
> "~x86"], masks: [], use: ["experimental", "symlink", "build"], restrict:
> ["binchecks", "strip"], properties: [], metadata_hash:
> "be3111e6a241b26578a1c4499ed83258", id: "SMxzq2EBjknS_DU0zQP9"}>
> irb(main):024:0> Version.find_all_by(:package,
> 'sys-kernel/gentoo-sources')[9]
> => #<Version {created_at: 2018-02-18 00:28:56 UTC, updated_at: 2018-02-18
> 00:28:56 UTC, version: "4.4.115", package: "sys-kernel/gentoo-sources",
> atom: "sys-kernel/gentoo-sources-4.4.115", sort_key: 35, slot: "4.4.115",
> subslot: "", eapi: nil, keywords: ["~alpha", "~amd64", "~arm", "~arm64",
> "~hppa", "~ia64", "~mips", "~ppc", "~ppc64", "~s390", "~sh", "~sparc",
> "~x86"], masks: [], use: ["experimental", "symlink", "build"], restrict:
> ["binchecks", "strip"], properties: [], metadata_hash:
> "fc773025e69496a947b5fdcc401e0eb9", id: "h8lQpmEBjknS_DU0Th-J"}>
> 
> So here is evidence that we have the same CPV inserted, 3 minutes apart,
> with a different _id; the latter should have replaced the former.

Another challenge is that ideally you could just set the id field to the hash of metadata; but metadata_hash is different for these two version documents. Its supposed to be an MD5sum, and I didn't think the time metadata was included; but more research is needed here.
Comment 7 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-02 02:12:57 UTC
*** Bug 605856 has been marked as a duplicate of this bug. ***
Comment 8 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-02 02:19:20 UTC
You should see less of this in production now as a major bug was resolved.

-A
Comment 9 Alec Warner (RETIRED) archtester gentoo-dev Security 2020-05-16 02:34:54 UTC
We rewrote the entire application.