There were a bunch of metadata.xml files in the tree without the DTD reference. This in turn makes the validation mostly useless (since it only checks wellformedness).
What should repoman check for? I did a grep for the first two lines in all metadata.xml files. The following lines appear in at least one file as one of the first two lines. empty line <!DOCTYPE catmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <?xml version = '1.0' encoding = 'UTF-8'?> <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0"encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?> <?xml version="1.0"?> <?xml version='1.0' encoding='UTF-8'?> <pkgmetadata> What is valid? In which order? etc.
The first line should be the <?xml one to identify it as an XML file.
I have a patch that uses the following two regular expressions to match the first resp. the second line. metadata_xml_header_re = re.compile("^<\?xml [^>]+>$") metadata_xml_dtd_re = re.compile("^<!DOCTYPE [^ ]+ SYSTEM \"" + metadata_dtd_uri + "\">$") Let me know if you want something changed.
you probably want to match something such that you get - warning for trailing whitespace (just style) - a '<?xml version="1.0" encoding="UTF-8"?>' header, where you can barf at whitespace, lowercase, or use of single quotes (just style) - a '<!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd">' line * verify that if pkgmetadata is in there, there is also a <pkgmetadata> tag, and the metadata.xml file is in a package location (2nd dir?) * verify that is catmetadata is in there, there is also a <catmetadata> tag, and the metadata.xml file is in a category location (1st dir?) * verify that the url points to 'http://www.gentoo.org/dtd/metadata.dtd' leave the rest to the xml validator My €0.02 :)
(In reply to comment #4) > you probably want to match something such that you get > - warning for trailing whitespace (just style) I meant leading whitespace before the <?xml thing
And verify that the file is actually UTF-8 encoded. Would be handy for ChangeLog files too, as some devs still don't have UTF-8 fully sorted out.
(In reply to comment #4) > * verify that if pkgmetadata is in there, there is also a <pkgmetadata> > tag, and the metadata.xml file is in a package location (2nd dir?) > * verify that is catmetadata is in there, there is also a <catmetadata> > tag, and the metadata.xml file is in a category location (1st dir?) These particular checks would have been helpful to prevent bug 458054 from happening. The other checks mentioned in comment 4 also have merit, as package metadata.xml's are still inconsistent regarding the first three lines. I used the following command against the tree to check this: for i in */*/metadata.xml; do cat -vETn $i | head -n 3; done | sort -u | less Category metadata.xml's are at least consistent here, but checks to ensure that it stays that way are always a good idea. From a consistency standpoint, I'd suggest requiring that the first three lines match exactly to the following: For category metadata.xml: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE catmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <catmetadata> For package metadata.xml: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <pkgmetadata> If it doesn't match, throw a warning. Just my thoughts on the subject.
We already use xml.etree.ElementTree.parse() to parse the metadata.xml, so if we could use that to validate it, then that would be ideal.
Here's a doctype check for pkgmetadata: http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=02db319dd7a3c5ca3f499e70f4ab922cc3c71717
Here's a check for the xml declaration: http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=9b2d0beb3200553380889395e296a6bf961dd0d3
This is fixed in 2.1.11.54 and 2.2.0_alpha165.