Bug 188114

Summary:	app-emacs/nxml-mode-20041004: Regular expression too big in xmlschema.rnc
Product:	Gentoo Linux	Reporter:	Martin von Gagern <Martin.vGagern>
Component:	Current packages	Assignee:	Emacs project <emacs>
Status:	RESOLVED FIXED
Severity:	normal	CC:	sgml
Priority:	High
Version:	2007.0
Hardware:	All
OS:	Linux
URL:	http://www.w3.org/TR/xmlschema-1/#element-selector
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	patch xmlschema.rnc to specify two pattern parameters two patterns with fixes to colon and whitespace handling two patterns, with whitespace handling now in both of them

Description Martin von Gagern 2007-08-08 14:27:52 UTC

I'm trying to validate an xsd with uniqueness constraints using nxml-mode.

First I encountered bug 188112, but after fixing that, I got this error message instead:

Internal error in rng-validate-mode triggered at buffer position 1805. Invalid regexp: "Regular expression too big"

Problem are the rexegxp for field and selector which are in fact over 50000 chars long when translated. The problem is that the syntax has a lot of
"REGEXP ( DELIMITER REGEXP )*" structure which repeats REGEXP every time. Then there is a lot of "\i\c*" to express names. Each of these escapes expands to a massively huge character class which is where all those chars come from. Those regular expressions occur in xmlschema.rnc which comes from
nxml-mode-20040910-xmlschema.patch.gz which is not mentioned in ChangeLog but seems to originate from bug 65836.

It should be possible to express all of this as the intersection of two regular expressions. These are the elements of the regexp. One could be this:
[./|:*@]*(\i\c*[./|:*@]+)(\i\c*)?
The other could be the original xpath regular expressions with every occurrence of "\i\c*" replaced by "[^./|:*@]+".
The idea is to have one regular expression denoting any sequence of alternating names and symbol groups, and the other to identify names simply as non-symbols. The first would be short enough as there are only four of those massively expanding escapes, and the second because it has no such escapes at all.

Seems like these pattern restrictions are not really part of the RELAX NG but rather inherited from XSD. And the XSD specs say that pattern restrictions from different levels of derivation are ANDed together. So it should be possible to define a type for the general alternating sequence, and then to use this type instead of xsd:token as the base type for the xpath attributes.

However I'm don't know enough RELAX NG (yet) to express this in compact syntax there, and I'm not sure whether the nxml-mode would follow the specs in this respect. Anyone here with RELAX NG knowledge willing to help me out?

References:
http://www.w3.org/TR/xmlschema-1/#element-selector <selector xpath="..."/>
http://www.w3.org/TR/xmlschema-1/#element-field <field xpath="..."/>
http://www.w3.org/TR/xmlschema-1/#normative-schemaSchema source of the regexps
http://www.w3.org/TR/xmlschema-2/#element-pattern pattern restriction
http://www.w3.org/TR/xmlschema-2/#dt-regex regexp syntax
http://www.w3.org/TR/xmlschema-2/#nt-MultiCharEsc \i and \c explained
http://www.w3.org/TR/xmlschema-2/#src-multiple-patterns ANDing of patterns
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-CombiningChar character classes

Comment 1 Martin von Gagern 2007-08-10 00:01:48 UTC

Created attachment 127406 [details, diff]
patch xmlschema.rnc to specify two pattern parameters

I just found out that rng-xsd-compile processes multiple parameters in a way that ANDs them together, so we can simply specify two such parameters to get the suggested intersection. However it could probably be argued that this is against the specs and thus turning a bug into a feature, which would be a problem if nxml ever decided to follow the spec more closely. I still have found no way to express a hierarchy of dimple derived types in RELAX NG. Relying on xsd doesn't seem to be an option either, as I think nxml can't read them. The only right thing to do would probably be to implement a new data type library and provide lisp code for parsing it. Right now I tend to rather exploit the present ANDing.

While trying out my schema I found out two things. I had missed one * in my regular expression, and I realized that \i and \c include the colon : as well. This seems very strange as the regular expressions mention colons as a structural element, so I had assumed the names in between to be without colons. Right now you can have a name like _:.... which would not match \i\c*:\i\c* but does match \i\c* itself and therefore is a valid name. I guess this is more likely a problem with the specs than with getting nxml to work for us here.

Comment 2 Christian Faulhammer (RETIRED) gentoo-dev

2007-08-15 07:45:03 UTC

We will work on it, but that regexp is quite complex, so give us time.

Comment 3 Martin von Gagern 2007-08-15 10:28:41 UTC

(In reply to comment #2)
> We will work on it, but that regexp is quite complex

Taken by themselves, they are real brutes. Originally I've compared them to the EBNF in the documentation tags of the schema for schemas, but now I've found a better reference: http://www.w3.org/TR/xmlschema-1/#c-selector-xpath
This seems to be what the regexp should match.

I'm a bit confused about the whitespace issue. The reference stated above says whitespace were allowed, whereas the schema for schemas has a regexp that doesn't seem to allow them, and both seem to be normative. Almost looks like a problem with the spec itself. Should this be reported to the working group?
We could match the optional whitespace as \s* if we wanted to, making the pattern even longer.

Looking at http://www.w3.org/TR/2006/REC-xml-names-20060816/#NT-NCName I find that the colon is definitely not part of an NCName. Probably NC stands for Non-Colon. We could match an NCName as [^\I:][^\C:]* or as [\i-[:]][\c-[:]]* to give a more restrictive pattern that still agrees with the spec. Either form seems to work with nxml.

> so give us time.

Sure. I'm already using a modified version here, so I'm in no particular hurry to see this in portage. As long as it will be one day. If I can help in any way, let me know.

Comment 4 Christian Faulhammer (RETIRED) gentoo-dev

2007-08-15 10:51:56 UTC

It would help if you could sort it out with the working group about the whitespace issues.

Comment 5 Martin von Gagern 2007-08-15 12:56:56 UTC

(In reply to comment #4)
> It would help if you could sort it out with the working group about the
> whitespace issues.

OK, the regexp really should accept whitespace, this is
http://www.w3.org/Bugs/Public/show_bug.cgi?id=2207

Also the NCNames should not match colons, this is
http://www.w3.org/Bugs/Public/show_bug.cgi?id=2122

I guess it's worth to generate those regexps with something that looks more like the BNF notation. I tries the following shell script:

#!/bin/bash -u
S="\\s*"
NCName='[^./|:*@]+'
QName="${NCName}:${NCName}"
NameTest="(child::${S})?(${NCName}:)?(${NCName}|\\*)"
Step="${S}(\\.|${NameTest})${S}"
Path="(${S}\\.${S}//)?${Step}(/${Step})*"
Selector="${Path}\\|${Path}"
echo "selector: ${Selector}"
LastStep="${Step}|${S}(@|attribute::)${S}${NameTest}${S}"
Path="(${S}\\.${S}//)?(${Step}/)*(${LastStep})"
Selector="${Path}\\|${Path}"
echo "field: ${Selector}"

I'm still working on some tool that will compare the resulting regexps to the original ones, to ensure that they are equal except for intended differences.

Comment 6 Ulrich Müller gentoo-dev

2007-09-08 21:25:36 UTC

Any news here?

Comment 7 Martin von Gagern 2007-09-08 22:06:53 UTC

Sorry, I should have given you an update.

Correctly parsing and handling regular expressions either manually or using parser generator tools proved surprisingly difficult, so the checking tool I had in mind would have taken more time than I have to spare at the moment. It would have to wait at least a month, but I'm not sure if it's worth the effort.

I trust my script pretty much. You can either trust me on this, or do some manual checking yourself. In either case, even having a possibly (though unlikely) wrong regexp should be better than the definitely not working regexp currently in place.

Comment 8 Martin von Gagern 2007-09-08 22:37:03 UTC

Created attachment 130368 [details, diff]
two patterns with fixes to colon and whitespace handling

OK, immediately after I posted my stuff I noticed an error in my script...
I should read first and post then, this kind of thing happens too often. :(
The error is that there is a starred part in Selector which I missed.

In case you want to check the script against the spec, the main reference is still http://www.w3.org/TR/xmlschema-1/#c-selector-xpath. Unfortunately it is not simply copying the BNF and translating it to shell. There is a bit prose about axes and abbreviated forms, and there is another set of BNF about the whitespace handling. I hope this updates script follows the spec:

#!/bin/bash -u
S="\\s*"
NCName='[^./|:*@]+'
QName="${NCName}:${NCName}"
NameTest="(child::${S})?(${NCName}:)?(${NCName}|\\*)"
Step="${S}(\\.|${NameTest})${S}"
Path="(${S}\\.${S}//)?${Step}(/${Step})*"
Selector="${Path}(\\|${Path})*"
echo "selector: ${Selector}"
LastStep="${Step}|${S}(@|attribute::)${S}${NameTest}${S}"
Path="(${S}\\.${S}//)?(${Step}/)*(${LastStep})"
Selector="${Path}(\\|${Path})*"
echo "field: ${Selector}"

I've included the regexps generated by this script, along with the new, colon-sensitive name checks, into the attached patch.

Comment 9 Ulrich Müller gentoo-dev

2007-09-09 10:21:41 UTC

Thanks for the updated patch. (Please provide patches with a path relative to ${WORKDIR} or ${S}, otherwise we have to redo them.) The "Selector" makes more sense as in the previous one. ;-) I've committed an ebuild with the new patch as -r2. Please test.

Adding the SGML team to CC since they are mentioned in metadata.

Comment 10 Martin von Gagern 2007-09-12 09:45:56 UTC

Created attachment 130685 [details, diff]
two patterns, with whitespace handling now in both of them

I did some testing by using Google code search to get some real world examples. Not much interesting out there, but I found out I had forgotten to test the whitespace handling, sorry. It didn't work, because there was no whitespace allowed by the first pattern. Now I updated everything, and tested this at least. I placed the shell script in the header of the patch file this time, to have the patch better documented.

Comment 11 Ulrich Müller gentoo-dev

2007-09-12 15:09:08 UTC

Updated patch committed, no revbump this time (might do one if this is finally tested and working).

Please test again and reopen.

Comment 12 Ulrich Müller gentoo-dev

2007-11-23 15:20:19 UTC

(In reply to comment #11)
> Please test again and reopen.

Any news here? I would like to close it as FIXED at some point.

Comment 13 Martin von Gagern 2007-11-28 19:34:00 UTC

(In reply to comment #12)
> Any news here? I would like to close it as FIXED at some point.

If no news is good news, then I've got the good news that I haven't found any further errors. So I'd consider it FIXED until someone stumbles upon some more bizarre test cases, or there is some activity on the W3C bugs.

Comment 14 Ulrich Müller gentoo-dev

2007-11-28 21:30:17 UTC

Reopening for proper closing.

Comment 15 Ulrich Müller gentoo-dev

2007-11-28 21:32:27 UTC

(In reply to comment #13)
> So I'd consider it FIXED until someone stumbles upon some more bizarre
> test cases, or there is some activity on the W3C bugs.

FIXED then.

Thanks a lot for the patch, and for digging into the matter in the first place.