Describes common errors, anti-patterns and known bugs with the SimpleParse 2.0 engine.
Is extremely inefficient, it generates 4 new Python objects and a number of new object pointers for every match (figure > 100 bytes for each match), on top of the engine overhead in tracking the recursion, so if you have a 1-million character match that's “matching” for every character, you'll have hundreds of megabytes of memory used.
In addition, if you are not using the non-recursive rewrite of mx.TextTools, you can actually blow up the C stack with the recursive calls to tag(). Symptoms of this are a memory access error when attempting to parse.
a := 'b', a? # bad!
a := 'b'+ # good!
At present, there's no way for the engine to know whether a child has been satisfied (matched) because they are optional (or all of their children are optional), or because they actually matched. The problem with the obvious solution of just checking whether we've moved forward in the text is that many classes of match object may match depending on external (non-text-based) conditions, so if we do the check, all of those mechanisms suddenly fail. For now, make sure:
No child of a repeating FirstOfGroup (x/y/z)+ or (x/y/z)* can match a Null-string
At least one child of a repeating SequentialGroup (x,y,z)+ or (x,y,z)* must not match the Null-string
You can recognize this situation by the process going into an endless loop with little or no memory being consumed. To fix this one, I'd likely need to add another return value type to the mxTextTools engine.
The TextTools engine does not support backtracking as seen in RE engines and many parsers, so productions like this can never match:
a := (b/c)*, c
Because the 'c' productions will all have been consumed by the FirstOfGroup, so the last 'c' can never match. This is a fundamental limit of the current back-end, so unless a new back-end is created, the problem will not go away. You will need to design your grammars accordingly.
The production c := (a/b) produces a FirstOfGroup, that is, a group which
matches the first child to match. Many parsers and regex engines
use an algorithm that matches all children and chooses the longest successful
match. It would be possible to define a new TextTools tagging command to
support the longest-of semantics for Table/SubTable matches, but I haven't
felt the need to do so. If such a command is created, it will likely be
spelled '|' rather than '/' in the SimpleParse grammar.
Although not particularly likely, users of SimpleParse 1.0 may have relied
on the (extremely non-intuitive) grouping mechanism for element tokens in
their grammars. With that mechanism, the group:
a,b,c/d,e
was interpreted as:
a,b,(c/(d,e))
The new rule is simply that alternation binds closer than sequences, so
the same grammar becomes:
a,b,(c/d),e
which, though no more (or less) intuitive than:
(a,b,c)/(d,e) ### it doesn't work this way!!!
is certainly better than the original mechanism.
You will, if possible, want to use the non-recursive rewrite of the 2.1.0
mxTextTools engine (2.1.0nr). At the time of writing, the mainline 2.1.0b3
has some errors (which I'm told are fixed for 2.1.0final), while the non-recursive
rewrite passes all tests. The bugs in the (recursive) engine(s) that are
known (and not likely to be fixed in the case of 2.1.0 final) are:
A
Open Source project