Scanning a file character by character

Tue Feb 17 15:19:15 EST 2009

In [401]: import shlex

In [402]: shlex.split("""Joe went to 'the store' where he bought a "box of chocolates" and stuff.""")
Out[402]: 
['Joe',
 'went',
 'to',
 'the store',
 'where',
 'he',
 'bought',
 'a',
 'box of chocolates',
 'and',
 'stuff.']

how's that work for ya? 

http://docs.python.org/library/shlex.html

On Tue, 10 Feb 2009 16:46:30 -0600
Tim Chase <python.list at tim.thechases.com> wrote:

> >> Or for a slightly less simple minded splitting you could try
> >> re.split:
> >>
> >>>>> re.split("(\w+)", "The quick brown fox jumps, and falls
> >>>>> over.")[1::2]
> >> ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
> > 
> > 
> > Perhaps I'm missing something, but the above regex does the exact
> > same thing as line.split() except it is significantly slower and
> > harder to read.
> > 
> > Neither deal with quoted text, apostrophes, hyphens, punctuation or
> > any other details of real-world text. That's what I mean by
> > "simple-minded".
> 
>    >>> s = "The quick brown fox jumps, and falls over."
>    >>> import re
>    >>> re.split(r"(\w+)", s)[1::2]
>    ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
>    >>> s.split()
>    ['The', 'quick', 'brown', 'fox', 'jumps,', 'and', 'falls', 
> 'over.']
> 
> Note the difference in "jumps" vs. "jumps,"  (extra comma in the 
> string.split() version) and likewise the period after "over". 
> Thus not quite "the exact same thing as line.split()".
> 
> I think an easier-to-read variant would be
> 
>    >>> re.findall(r"\w+", s)
>    ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
> 
> which just finds words.  One could also just limit it to letters with
> 
>    re.findall("[a-zA-Z]", s)
> 
> as "\w" is a little more encompassing (letters and underscores) 
> if that's a problem.
> 
> -tkc
> 
> 
> 
> 
> --
> http://mail.python.org/mailman/listinfo/python-list

-- 

Josh Dukes
MicroVu IT Department