Summary: | repoman should complain if metadata.xml doesn't refer to dtd | ||
---|---|---|---|
Product: | Portage Development | Reporter: | Dirkjan Ochtman (RETIRED) <djc> |
Component: | Repoman | Assignee: | Portage team <dev-portage> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | dwfreed |
Priority: | High | Keywords: | InVCS |
Version: | 2.1 | ||
Hardware: | All | ||
OS: | Linux | ||
See Also: | https://bugs.gentoo.org/show_bug.cgi?id=558646 | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 459934 |
Description
Dirkjan Ochtman (RETIRED)
2010-07-13 14:55:10 UTC
What should repoman check for? I did a grep for the first two lines in all metadata.xml files. The following lines appear in at least one file as one of the first two lines. empty line <!DOCTYPE catmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <?xml version = '1.0' encoding = 'UTF-8'?> <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0"encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?> <?xml version="1.0"?> <?xml version='1.0' encoding='UTF-8'?> <pkgmetadata> What is valid? In which order? etc. The first line should be the <?xml one to identify it as an XML file. I have a patch that uses the following two regular expressions to match the first resp. the second line. metadata_xml_header_re = re.compile("^<\?xml [^>]+>$") metadata_xml_dtd_re = re.compile("^<!DOCTYPE [^ ]+ SYSTEM \"" + metadata_dtd_uri + "\">$") Let me know if you want something changed. you probably want to match something such that you get - warning for trailing whitespace (just style) - a '<?xml version="1.0" encoding="UTF-8"?>' header, where you can barf at whitespace, lowercase, or use of single quotes (just style) - a '<!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd">' line * verify that if pkgmetadata is in there, there is also a <pkgmetadata> tag, and the metadata.xml file is in a package location (2nd dir?) * verify that is catmetadata is in there, there is also a <catmetadata> tag, and the metadata.xml file is in a category location (1st dir?) * verify that the url points to 'http://www.gentoo.org/dtd/metadata.dtd' leave the rest to the xml validator My €0.02 :) (In reply to comment #4) > you probably want to match something such that you get > - warning for trailing whitespace (just style) I meant leading whitespace before the <?xml thing And verify that the file is actually UTF-8 encoded. Would be handy for ChangeLog files too, as some devs still don't have UTF-8 fully sorted out. (In reply to comment #4) > * verify that if pkgmetadata is in there, there is also a <pkgmetadata> > tag, and the metadata.xml file is in a package location (2nd dir?) > * verify that is catmetadata is in there, there is also a <catmetadata> > tag, and the metadata.xml file is in a category location (1st dir?) These particular checks would have been helpful to prevent bug 458054 from happening. The other checks mentioned in comment 4 also have merit, as package metadata.xml's are still inconsistent regarding the first three lines. I used the following command against the tree to check this: for i in */*/metadata.xml; do cat -vETn $i | head -n 3; done | sort -u | less Category metadata.xml's are at least consistent here, but checks to ensure that it stays that way are always a good idea. From a consistency standpoint, I'd suggest requiring that the first three lines match exactly to the following: For category metadata.xml: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE catmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <catmetadata> For package metadata.xml: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd"> <pkgmetadata> If it doesn't match, throw a warning. Just my thoughts on the subject. We already use xml.etree.ElementTree.parse() to parse the metadata.xml, so if we could use that to validate it, then that would be ideal. Here's a doctype check for pkgmetadata: http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=02db319dd7a3c5ca3f499e70f4ab922cc3c71717 Here's a check for the xml declaration: http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=9b2d0beb3200553380889395e296a6bf961dd0d3 This is fixed in 2.1.11.54 and 2.2.0_alpha165. |