best split tokens?

John Machin sjmachin at lexicon.net
Sat Sep 9 18:15:00 CEST 2006


Tim Chase wrote:
> >> Any more crazy examples? :)
> >
> > 'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?
>
> I said "crazy"...not "pathological" :)
>
> If one really wants such a case, one has to omit the standard
> practice of nesting quotes:
>
> 	John replied "Dad told me 'you can't go' but let Judy"
>
> However, if you don't have such situations and to want to make
> 'enry and 'orace 'appy, you can change the regexp to
>
>
>  >>> s="He was wont to be alarmed/amused by answers that won't work"
>  >>> s2="The two-faced liar--a real joker--can't tell the truth"
>  >>> s3="'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop
> their aitches?"
>
>  >>> r =
> re.compile("(?:(?:[a-zA-Z][-'])|(?:[-'][a-zA-Z])|[a-zA-Z])+")
>
> It will also choke using double-dashes:
>
>  >>> r.findall(s), r.findall(s2), r.findall(s3)
> (['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
> 'answers', 'that', "won't", 'work'], ['The', 'two-faced',
> 'liar--a', 'real', "joker--can't", 'tell', 'the', 'truth'],
> ["'ey", "'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
> "'orace", 'drop', 'their', 'aitches'])
>
> Or you could combine them to only allow infix dashes, but allow
> apostrophes anywhere in the word, including the front or back,
> one could use:
>
>  >>> r =
> re.compile("(?:(?:[a-zA-Z]')|(?:'[a-zA-Z])|(?:[a-zA-Z]-[a-zA-Z])|[a-zA-Z])+")
>  >>> r.findall(s), r.findall(s2), r.findall(s3)
> (['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
> 'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
> 'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'], ["'ey",
> "'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
> "'orace", 'drop', 'their', 'aitches'])
>
>
> Now your spell-checker has to have the "dropped initial or
> terminal letter" locale... :)
> 

Too complicated for string.bleedin'_split(), innit?
Cheers,
John




More information about the Python-list mailing list