Summary: | app-emacs/nxml-mode-20041004: Regular expression too big in xmlschema.rnc | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Martin von Gagern <Martin.vGagern> |
Component: | Current packages | Assignee: | Emacs project <emacs> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | sgml |
Priority: | High | ||
Version: | 2007.0 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://www.w3.org/TR/xmlschema-1/#element-selector | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
patch xmlschema.rnc to specify two pattern parameters
two patterns with fixes to colon and whitespace handling two patterns, with whitespace handling now in both of them |
Description
Martin von Gagern
2007-08-08 14:27:52 UTC
Created attachment 127406 [details, diff]
patch xmlschema.rnc to specify two pattern parameters
I just found out that rng-xsd-compile processes multiple parameters in a way that ANDs them together, so we can simply specify two such parameters to get the suggested intersection. However it could probably be argued that this is against the specs and thus turning a bug into a feature, which would be a problem if nxml ever decided to follow the spec more closely. I still have found no way to express a hierarchy of dimple derived types in RELAX NG. Relying on xsd doesn't seem to be an option either, as I think nxml can't read them. The only right thing to do would probably be to implement a new data type library and provide lisp code for parsing it. Right now I tend to rather exploit the present ANDing.
While trying out my schema I found out two things. I had missed one * in my regular expression, and I realized that \i and \c include the colon : as well. This seems very strange as the regular expressions mention colons as a structural element, so I had assumed the names in between to be without colons. Right now you can have a name like _:.... which would not match \i\c*:\i\c* but does match \i\c* itself and therefore is a valid name. I guess this is more likely a problem with the specs than with getting nxml to work for us here.
We will work on it, but that regexp is quite complex, so give us time. (In reply to comment #2) > We will work on it, but that regexp is quite complex Taken by themselves, they are real brutes. Originally I've compared them to the EBNF in the documentation tags of the schema for schemas, but now I've found a better reference: http://www.w3.org/TR/xmlschema-1/#c-selector-xpath This seems to be what the regexp should match. I'm a bit confused about the whitespace issue. The reference stated above says whitespace were allowed, whereas the schema for schemas has a regexp that doesn't seem to allow them, and both seem to be normative. Almost looks like a problem with the spec itself. Should this be reported to the working group? We could match the optional whitespace as \s* if we wanted to, making the pattern even longer. Looking at http://www.w3.org/TR/2006/REC-xml-names-20060816/#NT-NCName I find that the colon is definitely not part of an NCName. Probably NC stands for Non-Colon. We could match an NCName as [^\I:][^\C:]* or as [\i-[:]][\c-[:]]* to give a more restrictive pattern that still agrees with the spec. Either form seems to work with nxml. > so give us time. Sure. I'm already using a modified version here, so I'm in no particular hurry to see this in portage. As long as it will be one day. If I can help in any way, let me know. It would help if you could sort it out with the working group about the whitespace issues. (In reply to comment #4) > It would help if you could sort it out with the working group about the > whitespace issues. OK, the regexp really should accept whitespace, this is http://www.w3.org/Bugs/Public/show_bug.cgi?id=2207 Also the NCNames should not match colons, this is http://www.w3.org/Bugs/Public/show_bug.cgi?id=2122 I guess it's worth to generate those regexps with something that looks more like the BNF notation. I tries the following shell script: #!/bin/bash -u S="\\s*" NCName='[^./|:*@]+' QName="${NCName}:${NCName}" NameTest="(child::${S})?(${NCName}:)?(${NCName}|\\*)" Step="${S}(\\.|${NameTest})${S}" Path="(${S}\\.${S}//)?${Step}(/${Step})*" Selector="${Path}\\|${Path}" echo "selector: ${Selector}" LastStep="${Step}|${S}(@|attribute::)${S}${NameTest}${S}" Path="(${S}\\.${S}//)?(${Step}/)*(${LastStep})" Selector="${Path}\\|${Path}" echo "field: ${Selector}" I'm still working on some tool that will compare the resulting regexps to the original ones, to ensure that they are equal except for intended differences. Any news here? Sorry, I should have given you an update. Correctly parsing and handling regular expressions either manually or using parser generator tools proved surprisingly difficult, so the checking tool I had in mind would have taken more time than I have to spare at the moment. It would have to wait at least a month, but I'm not sure if it's worth the effort. I trust my script pretty much. You can either trust me on this, or do some manual checking yourself. In either case, even having a possibly (though unlikely) wrong regexp should be better than the definitely not working regexp currently in place. Created attachment 130368 [details, diff] two patterns with fixes to colon and whitespace handling OK, immediately after I posted my stuff I noticed an error in my script... I should read first and post then, this kind of thing happens too often. :( The error is that there is a starred part in Selector which I missed. In case you want to check the script against the spec, the main reference is still http://www.w3.org/TR/xmlschema-1/#c-selector-xpath. Unfortunately it is not simply copying the BNF and translating it to shell. There is a bit prose about axes and abbreviated forms, and there is another set of BNF about the whitespace handling. I hope this updates script follows the spec: #!/bin/bash -u S="\\s*" NCName='[^./|:*@]+' QName="${NCName}:${NCName}" NameTest="(child::${S})?(${NCName}:)?(${NCName}|\\*)" Step="${S}(\\.|${NameTest})${S}" Path="(${S}\\.${S}//)?${Step}(/${Step})*" Selector="${Path}(\\|${Path})*" echo "selector: ${Selector}" LastStep="${Step}|${S}(@|attribute::)${S}${NameTest}${S}" Path="(${S}\\.${S}//)?(${Step}/)*(${LastStep})" Selector="${Path}(\\|${Path})*" echo "field: ${Selector}" I've included the regexps generated by this script, along with the new, colon-sensitive name checks, into the attached patch. Thanks for the updated patch. (Please provide patches with a path relative to ${WORKDIR} or ${S}, otherwise we have to redo them.) The "Selector" makes more sense as in the previous one. ;-) I've committed an ebuild with the new patch as -r2. Please test. Adding the SGML team to CC since they are mentioned in metadata. Created attachment 130685 [details, diff]
two patterns, with whitespace handling now in both of them
I did some testing by using Google code search to get some real world examples. Not much interesting out there, but I found out I had forgotten to test the whitespace handling, sorry. It didn't work, because there was no whitespace allowed by the first pattern. Now I updated everything, and tested this at least. I placed the shell script in the header of the patch file this time, to have the patch better documented.
Updated patch committed, no revbump this time (might do one if this is finally tested and working). Please test again and reopen. (In reply to comment #11) > Please test again and reopen. Any news here? I would like to close it as FIXED at some point. (In reply to comment #12) > Any news here? I would like to close it as FIXED at some point. If no news is good news, then I've got the good news that I haven't found any further errors. So I'd consider it FIXED until someone stumbles upon some more bizarre test cases, or there is some activity on the W3C bugs. Reopening for proper closing. (In reply to comment #13) > So I'd consider it FIXED until someone stumbles upon some more bizarre > test cases, or there is some activity on the W3C bugs. FIXED then. Thanks a lot for the patch, and for digging into the matter in the first place. |