SimpleParse A Parser Generator for mxTextTools v2.1.0

SimpleParse is a BSD-licensed Python package providing a simple and fast parser generator using a modified version of the mxTextTools text-tagging engine. SimpleParse allows you to generate parsers directly from your EBNF grammar.

Unlike most parser generators, SimpleParse generates single-pass parsers (there is no distinct tokenization stage), an approach taken from the predecessor project (mcf.pars) which attempted to create "autonomously parsing regex objects". The resulting parsers are not as generalized as those created by, for instance, the Earley algorithm, but they do tend to be useful for the parsing of computer file formats and the like (as distinct from natural language and similar "hard" parsing problems).

As of version 2.1.0 the SimpleParse project includes a patched copy of the mxTextTools tagging library with the non-recursive rewrite of the core parsing loop.  This means that you will need to build the extension module to use SimpleParse, but the effect is to provide a uniform parsing platform where all of the features of a give SimpleParse version are always available.

For those interested in working on the project, I'm actively interested in welcoming and supporting both new developers and new users. Feel free to contact me.

Documentation

Acquisition and Installation

You will need a copy of Python with distutils support (Python versions 2.0 and above include this). You'll also need a C compiler compatible with your Python build and understood by distutils.

To install the base SimpleParse engine, download the latest version in your preferred format. If you are using the Win32 installer, simply run the executable. If you are using one of the source distributions, unpack the distribution into a temporary directory (maintaining the directory structure) then run:

setup.py install

in the top directory created by the expansion process.  This will cause the patched mxTextTools library to be built as a sub-package of the simpleparse package and will then install the whole package to your system.

Features/Changelog

New in 2.1.0a1:

New in 2.0.1:

diff -w -r1.4 error.py
32c32
<             return '%s: %s'%( self.__class__.__name__, self.messageFormat(message) )
---
>             return '%s: %s'%( self.__class__.__name__, self.messageFormat(self.message) )

New in 2.0:

General

"Class" of Parsers Generated

Our (current) parsers are top-down, in that they work from the top of the parsing graph (the root production). They are not, however, tokenising parsers, so there is no appropriate LL(x) designation as far as I can see, and there is an arbitrary lookahead mechanism that could theoretically parse the entire rest of the file just to see if a particular character matches).  I would hazard a guess that they are theoretically closest to a deterministic recursive-descent parser.

There are no backtracking facilities, so any ambiguity is handled by choosing the first successful match of a grammar (not the longest, as in most top-down parsers, mostly because without tokenisation, it would be expensive to do checks for each possible match's length).  As a result of this, the parsers are entirely deterministic.

The time/memory characteristics are such that, in general, the time to parse an input text varies with the amount of text to parse. There are two major factors, the time to do the actual parsing (which, for simple deterministic grammars should be close to linear with the length of the text, though a pathalogical grammar might have radically different operating characteristics) and the time to build the results tree (which depends on the memory architecture of the machine, the currently free memory, and the phase of the moon).  As a rule, SimpleParse parsers will be faster (for suitably limited grammars) than anything you can code directly in Python.  They will not generally outperform grammar-specific parsers written in C.

Missing Features

Possible Future Directions

mxTextTools Rewrite Enhancements

Alternate C Back-end?

mxBase/mxTextTools Installation

NOTE: This section only applies to SimpleParse versions before 2.1.0, SimpleParse 2.1.0 and above include a patched version of mxTextTools already!

You will want an mxBase 2.1.0 distribution to run SimpleParse, preferably with the non-recursive rewrite. If you want to use the non-recursive implementation, you will need to get the source archive for mxTextTools.  It is possible to use mxBase 2.0.3 with SimpleParse, but not to use it for building the non-recursive TextTools engine (2.0.3 also lacks a lot of features and bug-fixes found in the 2.1.0 versions).

Note: without the non-recursive rewrite of 2.1.0 (i.e. with the recursive version), the test suite will not pass all tests.  I'm not sure why they fail with the recursive version, but it does argue for using the non-recursive rewrite.

To build the non-recursive TextTools engine, you'll need to get the source distribution for the non-recursive implementation from the SimpleParse file repository.  Note, there are incompatabilities in the mxBase 2.1 versions that make it necessary to use the versions specified below to build the non-recursive versions.

This archive is intended to be expanded over the mxBase source archive from the top-level directory, replacing one file and adding four others.

cd egenix-mx-base-2.1.0
gunzip non-recursive-1.0.0b1.tar.gz
tar -xvf non-recursive-1.0.0b1.tar

(Or use WinZip on Windows). When you have completed that, run:

setup.py build --force install

in the top directory of the eGenix-mx-base source tree.

Copyright, License & Disclaimer

The 2.1.0 and greater releases include the eGenix mxTextTools extension:

Licensed under the eGenix.com Public License see the mxLicense.html file for details on licensing terms for the original library, the eGenix extensions are:

    Copyright (c) 1997-2000, Marc-Andre Lemburg
    Copyright (c) 2000-2001, eGenix.com Software GmbH

Extensions to the eGenix extensions (most significantly the rewrite of the core loop) are copyright Mike Fletcher and released under the SimpleParse License below:

    Copyright © 2003-2006, Mike Fletcher

SimpleParse License:

Copyright © 1998-2006, Copyright by Mike C. Fletcher; All Rights Reserved.
mailto: mcfletch@users.sourceforge.net

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee or royalty is hereby granted, provided that the above copyright notice appear in all copies and that both the copyright notice and this permission notice appear in supporting documentation or portions thereof, including modifications, that you make.

THE AUTHOR MIKE C. FLETCHER DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE!

A SourceForge Logo
Open Source project