Python: module simpleparse.stt.TextTools.__init_

simpleparse.stt.TextTools.__init__ (version 2.1.0)

index
/home/mcfletch/pylive/simpleparse/stt/TextTools/__init__.py

mxTextTools - A tools package for fast text processing. Copyright (c) 2000, Marc-Andre Lemburg; mailto:mal@lemburg.com Copyright (c) 2000-2003, eGenix.com Software GmbH; mailto:info@egenix.com Copyright (c) 2003-2006, Mike Fletcher; mailto:mcfletch@vrplumber.com See the documentation for further information on copyrights, or contact the author. All Rights Reserved.

Modules

simpleparse.stt.TextTools.mxTextTools.mxTextTools
string
time
types

Functions


BMS = TextSearch(...)
TextSearch(match[,translate=None,algorithm=default_algorithm]) Create a substring search object for the string match; translate is an optional translate-string like the one used in the module re.

CharSet(...)
CharSet(definition) Create a character set matching object from the string

FS = TextSearch(...)
TextSearch(match[,translate=None,algorithm=default_algorithm]) Create a substring search object for the string match; translate is an optional translate-string like the one used in the module re.

FSType = TextSearch(...)
TextSearch(match[,translate=None,algorithm=default_algorithm]) Create a substring search object for the string match; translate is an optional translate-string like the one used in the module re.

TagTable(...)
TagTable(definition[,cachable=1])

TextSearch(...)
TextSearch(match[,translate=None,algorithm=default_algorithm]) Create a substring search object for the string match; translate is an optional translate-string like the one used in the module re.

UnicodeTagTable(...)
TagTable(definition[,cachable=1])

_BMS(match, translate)
# Needed for backward compatibility:

_CS(definition)
# Shortcuts for pickle (reduces the pickle's length)

_FS(match, translate)

_TS(match, translate, algorithm)

_TT(definition)

charsplit(...)
charsplit(text,char,start=0,stop=len(text)) Split text[start:stop] into substrings at char and return the result as list of strings.

cmp(...)
cmp(a,b) Compare two valid taglist tuples w/r to their slice position; this is useful for sorting joinlists.

hex2str(...)
hex2str(text) Return text interpreted as two byte HEX values converted to a string.

isascii(...)
isascii(text,start=0,stop=len(text)) Return 1/0 depending on whether text only contains ASCII characters.

join(...)
join(joinlist,sep='',start=0,stop=len(joinlist)) Copy snippets from different strings together producing a new string The first argument must be a list of tuples or strings; tuples must be of the form (string,l,r[,...]) and turn out as string[l:r] NOTE: the syntax used for negative slices is different than the Python standard: -1 corresponds to the first character *after* the string, e.g. ('Example',0,-1) gives 'Example' and not 'Exampl', like in Python sep is an optional separator string, start and stop define the slice of joinlist that is taken into accont.

joinlist(...)
joinlist(text,list,start=0,stop=len(text)) Takes a list of tuples (replacement,l,r,...) and produces a taglist suitable for join() which creates a copy of text where every slice [l:r] is replaced by the given replacement - the list must be sorted using cmp() as compare function - it may not contain overlapping slices - the slices may not contain negative indices - if the taglist cannot contain overlapping slices, you can   give this function the taglist produced by tag() directly   (sorting is not needed, as the list will already be sorted) - start and stop set the slice to work in, i.e. text[start:stop]

lower(...)
lower(text) Return text converted to lower case.

prefix(...)
prefix(text,prefixes,start=0,stop=len(text)[,translate]) Looks at text[start:stop] and returns the first matching prefix out of the tuple of strings given in prefixes. If no prefix is found to be matching, None is returned. The optional 256 char translate string is used to translate the text prior to comparing it with the given suffixes.

set(...)
set(string,logic=1) Returns a character set for string: a bit encoded version of the characters occurring in string. - logic can be set to 0 if all characters *not* in string   should go into the set

setfind(...)
setfind(text,set,start=0,stop=len(text)) Find the first occurence of any character from set in text[start:stop] set must be a string obtained with set() DEPRECATED: use CharSet().search() instead.

setsplit(...)
setsplit(text,set,start=0,stop=len(text)) Split text[start:stop] into substrings using set, omitting the splitting parts and empty substrings. set must be a string obtained from set() DEPRECATED: use CharSet().split() instead.

setsplitx(...)
setsplitx(text,set,start=0,stop=len(text)) Split text[start:stop] into substrings using set, so that every second entry consists only of characters in set. set must be a string obtained with set() DEPRECATED: use CharSet().splitx() instead.

setstrip(...)
setstrip(text,set,start=0,stop=len(text),mode=0) Strip all characters in text[start:stop] appearing in set. mode indicates where to strip (<0: left; =0: left and right; >0: right). set must be a string obtained with set() DEPRECATED: use CharSet().strip() instead.

splitat(...)
splitat(text,char,nth=1,start=0,stop=len(text)) Split text[start:stop] into two substrings at the nth occurance of char and return the result as 2-tuple. If the character is not found, the second string is empty. nth may be negative: the search is then done from the right and the first string is empty in case the character is not found.

str2hex(...)
str2hex(text) Return text converted to a string consisting of two byte HEX values.

suffix(...)
suffix(text,suffixes,start=0,stop=len(text)[,translate]) Looks at text[start:stop] and returns the first matching suffix out of the tuple of strings given in suffixes. If no suffix is found to be matching, None is returned. The optional 256 char translate string is used to translate the text prior to comparing it with the given suffixes.

tag(...)
tag(text,tagtable,sliceleft=0,sliceright=len(text),taglist=[],context=None) Produce a tag list for a string, given a tag-table - returns a tuple (success, taglist, nextindex) - if taglist == None, then no taglist is created

upper(...)
upper(text) Return text converted to upper case.

Data

A2Z = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
A2Z_charset = <Character Set object for 'A-Z'>
A2Z_set = '\x00\x00\x00\x00\x00\x00\x00\x00\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
AllIn = 11
AllInCharSet = 41
AllInSet = 31
AllNotIn = 12
AppendMatch = 2048
AppendTagobj = 1024
AppendToTagobj = 512
BOYERMOORE = 0
Break = 0
Call = 201
CallArg = 202
CallTag = 256
EOF = 101
FASTSEARCH = 1
Fail = 100
Here = 1
Is = 13
IsIn = 14
IsInCharSet = 42
IsInSet = 32
IsNot = 15
IsNotIn = 15
Jump = 100
JumpTarget = 104
LookAhead = 4096
Loop = 205
LoopControl = 206
MatchFail = -1000000
MatchOk = 1000000
Move = 103
NoWord = 211
Reset = -1
Skip = 102
SubTable = 207
SubTableInList = 208
TRIVIAL = 2
Table = 203
TableInList = 204
ThisTable = 999
To = 0
ToBOF = 0
ToEOF = -1
Umlaute = '\xc4\xd6\xdc'
Umlaute_charset = <Character Set object for '\xc4\xd6\xdc'>
Word = 21
WordEnd = 23
WordStart = 22
a2z = 'abcdefghijklmnopqrstuvwxyz'
a2z_charset = <Character Set object for 'a-z'>
a2z_set = '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
alpha_charset = <Character Set object for 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'>
alpha_set = '\x00\x00\x00\x00\x00\x00\x00\x00\xfe\xff\xff\x07\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
alphanumeric = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
alphanumeric_charset = <Character Set object for 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'>
alphanumeric_set = '\x00\x00\x00\x00\x00\x00\xff\x03\xfe\xff\xff\x07\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
any = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
any_charset = <Character Set object for '\x00-\xff'>
any_set = '\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
formfeed = '\x0c'
formfeed_charset = <Character Set object for '\x0c'>
german_alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xe4\xf6\xfc\xdf\xc4\xd6\xdc'
german_alpha_charset = <Character Set object for 'ABCDEFGHIJKLMNOPQRSTU...hijklmnopqrstuvwxyz\xe4\xf6\xfc\xdf\xc4\xd6\xdc'>
german_alpha_set = '\x00\x00\x00\x00\x00\x00\x00\x00\xfe\xff\xff\x07\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00@\x90\x10\x00@\x10'
id2cmd = {-1000000: 'MatchFail', -1: 'ToEOF', 0: 'Fail/Jump', 1: 'Here', 11: 'AllIn', 12: 'AllNotIn', 13: 'Is', 14: 'IsIn', 15: 'IsNotIn', 21: 'Word', ...}
newline = '\r\n'
newline_charset = <Character Set object for '\r\n'>
newline_set = '\x00$\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
nonwhitespace_charset = <Character Set object for '^ \t\x0b\r\n\x0c'>
nonwhitespace_set = '\xff\xc1\xff\xff\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
number = '0123456789'
number_charset = <Character Set object for '0-9'>
number_set = '\x00\x00\x00\x00\x00\x00\xff\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
sFindWord = 213
sWordEnd = 212
sWordStart = 211
tagtable_cache = {(46912536021760, 0): <String Tag Table object>, (46912540134840, 0): <String Tag Table object>, (46912541410080, 0): <String Tag Table object>, (46912541454848, 0): <String Tag Table object>, (46912541455136, 0): <String Tag Table object>, (46912541455208, 0): <String Tag Table object>, (46912541489264, 0): <String Tag Table object>, (46912541566016, 0): <String Tag Table object>, (46912543903688, 0): <String Tag Table object>, (46912543908136, 0): <String Tag Table object>, ...}
to_lower = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
to_upper = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
umlaute = '\xe4\xf6\xfc\xdf'
umlaute_charset = <Character Set object for '\xe4\xf6\xfc\xdf'>
white = ' \t\x0b'
white_charset = <Character Set object for ' \t\x0b'>
white_set = '\x00\x02\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
whitespace = ' \t\x0b\r\n\x0c'
whitespace_charset = <Character Set object for ' \t\x0b\r\n\x0c'>
whitespace_set = '\x00&\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Data
		A2Z = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' A2Z_charset = <Character Set object for 'A-Z'> A2Z_set = '\x00\x00\x00\x00\x00\x00\x00\x00\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' AllIn = 11 AllInCharSet = 41 AllInSet = 31 AllNotIn = 12 AppendMatch = 2048 AppendTagobj = 1024 AppendToTagobj = 512 BOYERMOORE = 0 Break = 0 Call = 201 CallArg = 202 CallTag = 256 EOF = 101 FASTSEARCH = 1 Fail = 100 Here = 1 Is = 13 IsIn = 14 IsInCharSet = 42 IsInSet = 32 IsNot = 15 IsNotIn = 15 Jump = 100 JumpTarget = 104 LookAhead = 4096 Loop = 205 LoopControl = 206 MatchFail = -1000000 MatchOk = 1000000 Move = 103 NoWord = 211 Reset = -1 Skip = 102 SubTable = 207 SubTableInList = 208 TRIVIAL = 2 Table = 203 TableInList = 204 ThisTable = 999 To = 0 ToBOF = 0 ToEOF = -1 Umlaute = '\xc4\xd6\xdc' Umlaute_charset = <Character Set object for '\xc4\xd6\xdc'> Word = 21 WordEnd = 23 WordStart = 22 a2z = 'abcdefghijklmnopqrstuvwxyz' a2z_charset = <Character Set object for 'a-z'> a2z_set = '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' alpha_charset = <Character Set object for 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'> alpha_set = '\x00\x00\x00\x00\x00\x00\x00\x00\xfe\xff\xff\x07\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' alphanumeric = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' alphanumeric_charset = <Character Set object for 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'> alphanumeric_set = '\x00\x00\x00\x00\x00\x00\xff\x03\xfe\xff\xff\x07\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' any = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' any_charset* = <Character Set object for '\x00-\xff'> any_set = '\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff' formfeed = '\x0c' formfeed_charset = <Character Set object for '\x0c'> german_alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xe4\xf6\xfc\xdf\xc4\xd6\xdc' german_alpha_charset = <Character Set object for 'ABCDEFGHIJKLMNOPQRSTU...hijklmnopqrstuvwxyz\xe4\xf6\xfc\xdf\xc4\xd6\xdc'> german_alpha_set = '\x00\x00\x00\x00\x00\x00\x00\x00\xfe\xff\xff\x07\xfe\xff\xff\x07\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00@\x90\x10\x00@\x10' id2cmd = {-1000000: 'MatchFail', -1: 'ToEOF', 0: 'Fail/Jump', 1: 'Here', 11: 'AllIn', 12: 'AllNotIn', 13: 'Is', 14: 'IsIn', 15: 'IsNotIn', 21: 'Word', ...} newline = '\r\n' newline_charset = <Character Set object for '\r\n'> newline_set = '\x00$\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' nonwhitespace_charset = <Character Set object for '^ \t\x0b\r\n\x0c'> nonwhitespace_set = '\xff\xc1\xff\xff\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff' number = '0123456789' number_charset = <Character Set object for '0-9'> number_set = '\x00\x00\x00\x00\x00\x00\xff\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' sFindWord = 213 sWordEnd = 212 sWordStart = 211 tagtable_cache = {(46912536021760, 0): <String Tag Table object>, (46912540134840, 0): <String Tag Table object>, (46912541410080, 0): <String Tag Table object>, (46912541454848, 0): <String Tag Table object>, (46912541455136, 0): <String Tag Table object>, (46912541455208, 0): <String Tag Table object>, (46912541489264, 0): <String Tag Table object>, (46912541566016, 0): <String Tag Table object>, (46912543903688, 0): <String Tag Table object>, (46912543908136, 0): <String Tag Table object>, ...} to_lower = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' to_upper* = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' umlaute* = '\xe4\xf6\xfc\xdf' umlaute_charset = <Character Set object for '\xe4\xf6\xfc\xdf'> white = ' \t\x0b' white_charset = <Character Set object for ' \t\x0b'> white_set = '\x00\x02\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' whitespace = ' \t\x0b\r\n\x0c' whitespace_charset = <Character Set object for ' \t\x0b\r\n\x0c'> whitespace_set = '\x00&\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'