Common Problems

Describes common errors, anti-patterns and known bugs with the SimpleParse 2.0 engine.

Repetition-as-Recursion

Is extremely inefficient, it generates 4 new Python objects and a number of new object pointers for every match (figure > 100 bytes for each match), on top of the engine overhead in tracking the recursion, so if you have a 1-million character match that's “matching” for every character, you'll have hundreds of megabytes of memory used.

In addition, if you are not using the non-recursive rewrite of mx.TextTools, you can actually blow up the C stack with the recursive calls to tag(). Symptoms of this are a memory access error when attempting to parse.

a := 'b', a? # bad!
a := 'b'+ # good!

Null-match Children of Repeating Groups

At present, there's no way for the engine to know whether a child has been satisfied (matched) because they are optional (or all of their children are optional), or because they actually matched. The problem with the obvious solution of just checking whether we've moved forward in the text is that many classes of match object may match depending on external (non-text-based) conditions, so if we do the check, all of those mechanisms suddenly fail. For now, make sure:

No child of a repeating FirstOfGroup (x/y/z)+ or (x/y/z)* can match a Null-string
At least one child of a repeating SequentialGroup (x,y,z)+ or (x,y,z)* must not match the Null-string

You can recognize this situation by the process going into an endless loop with little or no memory being consumed. To fix this one, I'd likely need to add another return value type to the mxTextTools engine.

No Backtracking

The TextTools engine does not support backtracking as seen in RE engines and many parsers, so productions like this can never match:

a := (b/c)*, c

Because the 'c' productions will all have been consumed by the FirstOfGroup, so the last 'c' can never match. This is a fundamental limit of the current back-end, so unless a new back-end is created, the problem will not go away. You will need to design your grammars accordingly.

First-Of, not Longest-Of (Meaning of / )

The production c := (a/b) produces a FirstOfGroup, that is, a group which matches the first child to match. Many parsers and regex engines use an algorithm that matches all children and chooses the longest successful match. It would be possible to define a new TextTools tagging command to support the longest-of semantics for Table/SubTable matches, but I haven't felt the need to do so. If such a command is created, it will likely be spelled '|' rather than '/' in the SimpleParse grammar.

Grouping Rules

Although not particularly likely, users of SimpleParse 1.0 may have relied on the (extremely non-intuitive) grouping mechanism for element tokens in their grammars. With that mechanism, the group:

a,b,c/d,e

was interpreted as:

a,b,(c/(d,e))

The new rule is simply that alternation binds closer than sequences, so the same grammar becomes:

a,b,(c/d),e

which, though no more (or less) intuitive than:

(a,b,c)/(d,e) ### it doesn't work this way!!!

is certainly better than the original mechanism.

mxTextTools Versions

You will, if possible, want to use the non-recursive rewrite of the 2.1.0 mxTextTools engine (2.1.0nr). At the time of writing, the mainline 2.1.0b3 has some errors (which I'm told are fixed for 2.1.0final), while the non-recursive rewrite passes all tests. The bugs in the (recursive) engine(s) that are known (and not likely to be fixed in the case of 2.1.0 final) are:

Recursive data structure returns from reported subtable matches (not a problem for SimpleParse, we never report a subtable match. [2.0.3 only]
Failure to truncate/report position for a failed table correctly (may confuse your code, as the results from failed matches may show up in your tag-lists [2.0.3 only]
AppendToTag only supports list objects with [2.0.3 only]
C Stack overflow with recursive calls [2.0.3 AND 2.1.0 (recursive)]

Up to index...

A
Open Source project