Created attachment 466532 [details] Screenshot See URL and attached screenshot. Version 25.2_rc1 appears twice, in the first and the third row.
Same happens currently for sys-devel/gcc for multiple versions. Even worse, because for version 3.4.0-r3 one line declares amd64 and x86 as stable and the other line does not. After synchronizing the tree I have currently only the unstable version.
Created attachment 492928 [details] openrc package page screenshot The duplicates can also be in strange order.
same for sys-kernel/*
Its possible this is now worse in the new indexing scheme. I see two possible means of attack. 1) Quick-and-Dirty. For each index, we key on specific things. For Packages, Versions, etc...its probably not too difficult because the keys for these items are all obvious (CP, CPV, and so forth). We can just write a rake task that detects duplicate documents and apply some ranking algorithm to keep the 'best' document for each key. Its likely this will result in some loss (particularly of 'event'-like data. The benefit of this approach is that we don't need to necessarily root cause why the duplicates occur. Its not an ideal solution. 2) Figure out why the duplicates show up upon insert. The basic idea is that each document has an _id field and repeated inserts to the same _id should overwrite. Its likely that the duplicates exist because the _id field we are using (afaik an md5 hash over some metadata) is changing between inserts. This is a no-no. Assuming we can fix 2, we still need to either run 1; or dump the database and reindex which takes about an hour. I suspect I prefer the latter ;)
Lets focus on duplicate versions first: We will avoid focusing on the p.g.o code that detects if a given CPV is updated; but instead will focus on how the _id is computed; if a given CPV is updated, it should over-write and not add a duplicate. # Imports data from an ebuild model and saves the object # # @param [Portrage::Repository::Ebuild] ebuild_model def import!(ebuild_model, parent_package, options) self.version = ebuild_model.version self.atom = ebuild_model.to_cpv self.package = parent_package.atom raw_slot = nil raw_subslot = nil raw_slot, raw_subslot = ebuild_model.metadata[:slot].split '/' if ebuild_model.metadata[:slot] self.slot = raw_slot || '' self.subslot = raw_subslot || '' old_keywords = keywords self.keywords = ebuild_model.metadata[:keywords] || [] self.use = strip_useflag_defaults(ebuild_model.metadata[:iuse] || []).uniq self.restrict = ebuild_model.metadata[:restrict] || [] self.properties = ebuild_model.metadata[:properties] || [] self.masks = Portage::Util::Masks.for(ebuild_model) self.metadata_hash = ebuild_model.metadata_hash save() We don't see much code in here about setting _id. I assume in the previous indexing scheme there was something, but not in the multi-index we are using. To pick a random gentoo-sources version: irb(main):023:0> Version.find_all_by(:package, 'sys-kernel/gentoo-sources')[94] => #<Version {created_at: 2018-02-19 00:25:48 UTC, updated_at: 2018-02-19 00:25:48 UTC, version: "4.4.115", package: "sys-kernel/gentoo-sources", atom: "sys-kernel/gentoo-sources-4.4.115", sort_key: 35, slot: "4.4.115", subslot: "", eapi: nil, keywords: ["~alpha", "~amd64", "~arm", "~arm64", "~hppa", "~ia64", "~mips", "~ppc", "~ppc64", "~s390", "~sh", "~sparc", "~x86"], masks: [], use: ["experimental", "symlink", "build"], restrict: ["binchecks", "strip"], properties: [], metadata_hash: "be3111e6a241b26578a1c4499ed83258", id: "SMxzq2EBjknS_DU0zQP9"}> irb(main):024:0> Version.find_all_by(:package, 'sys-kernel/gentoo-sources')[9] => #<Version {created_at: 2018-02-18 00:28:56 UTC, updated_at: 2018-02-18 00:28:56 UTC, version: "4.4.115", package: "sys-kernel/gentoo-sources", atom: "sys-kernel/gentoo-sources-4.4.115", sort_key: 35, slot: "4.4.115", subslot: "", eapi: nil, keywords: ["~alpha", "~amd64", "~arm", "~arm64", "~hppa", "~ia64", "~mips", "~ppc", "~ppc64", "~s390", "~sh", "~sparc", "~x86"], masks: [], use: ["experimental", "symlink", "build"], restrict: ["binchecks", "strip"], properties: [], metadata_hash: "fc773025e69496a947b5fdcc401e0eb9", id: "h8lQpmEBjknS_DU0Th-J"}> So here is evidence that we have the same CPV inserted, 3 minutes apart, with a different _id; the latter should have replaced the former.
(In reply to Alec Warner from comment #5) > Lets focus on duplicate versions first: > > We will avoid focusing on the p.g.o code that detects if a given CPV is > updated; but instead will focus on how the _id is computed; if a given CPV > is updated, it should over-write and not add a duplicate. > > # Imports data from an ebuild model and saves the object > # > # @param [Portrage::Repository::Ebuild] ebuild_model > def import!(ebuild_model, parent_package, options) > self.version = ebuild_model.version > self.atom = ebuild_model.to_cpv > self.package = parent_package.atom > > raw_slot = nil > raw_subslot = nil > raw_slot, raw_subslot = ebuild_model.metadata[:slot].split '/' if > ebuild_model.metadata[:slot] > self.slot = raw_slot || '' > self.subslot = raw_subslot || '' > > old_keywords = keywords > self.keywords = ebuild_model.metadata[:keywords] || [] > self.use = strip_useflag_defaults(ebuild_model.metadata[:iuse] || > []).uniq > self.restrict = ebuild_model.metadata[:restrict] || [] > self.properties = ebuild_model.metadata[:properties] || [] > self.masks = Portage::Util::Masks.for(ebuild_model) > self.metadata_hash = ebuild_model.metadata_hash > > save() > > We don't see much code in here about setting _id. I assume in the previous > indexing scheme there was something, but not in the multi-index we are using. > > To pick a random gentoo-sources version: > > irb(main):023:0> Version.find_all_by(:package, > 'sys-kernel/gentoo-sources')[94] > => #<Version {created_at: 2018-02-19 00:25:48 UTC, updated_at: 2018-02-19 > 00:25:48 UTC, version: "4.4.115", package: "sys-kernel/gentoo-sources", > atom: "sys-kernel/gentoo-sources-4.4.115", sort_key: 35, slot: "4.4.115", > subslot: "", eapi: nil, keywords: ["~alpha", "~amd64", "~arm", "~arm64", > "~hppa", "~ia64", "~mips", "~ppc", "~ppc64", "~s390", "~sh", "~sparc", > "~x86"], masks: [], use: ["experimental", "symlink", "build"], restrict: > ["binchecks", "strip"], properties: [], metadata_hash: > "be3111e6a241b26578a1c4499ed83258", id: "SMxzq2EBjknS_DU0zQP9"}> > irb(main):024:0> Version.find_all_by(:package, > 'sys-kernel/gentoo-sources')[9] > => #<Version {created_at: 2018-02-18 00:28:56 UTC, updated_at: 2018-02-18 > 00:28:56 UTC, version: "4.4.115", package: "sys-kernel/gentoo-sources", > atom: "sys-kernel/gentoo-sources-4.4.115", sort_key: 35, slot: "4.4.115", > subslot: "", eapi: nil, keywords: ["~alpha", "~amd64", "~arm", "~arm64", > "~hppa", "~ia64", "~mips", "~ppc", "~ppc64", "~s390", "~sh", "~sparc", > "~x86"], masks: [], use: ["experimental", "symlink", "build"], restrict: > ["binchecks", "strip"], properties: [], metadata_hash: > "fc773025e69496a947b5fdcc401e0eb9", id: "h8lQpmEBjknS_DU0Th-J"}> > > So here is evidence that we have the same CPV inserted, 3 minutes apart, > with a different _id; the latter should have replaced the former. Another challenge is that ideally you could just set the id field to the hash of metadata; but metadata_hash is different for these two version documents. Its supposed to be an MD5sum, and I didn't think the time metadata was included; but more research is needed here.
*** Bug 605856 has been marked as a duplicate of this bug. ***
You should see less of this in production now as a major bug was resolved. -A
We rewrote the entire application.