Scanning a file character by character
Josh Dukes
josh.dukes at microvu.com
Tue Feb 17 15:19:15 EST 2009
In [401]: import shlex
In [402]: shlex.split("""Joe went to 'the store' where he bought a "box of chocolates" and stuff.""")
Out[402]:
['Joe',
'went',
'to',
'the store',
'where',
'he',
'bought',
'a',
'box of chocolates',
'and',
'stuff.']
how's that work for ya?
http://docs.python.org/library/shlex.html
On Tue, 10 Feb 2009 16:46:30 -0600
Tim Chase <python.list at tim.thechases.com> wrote:
> >> Or for a slightly less simple minded splitting you could try
> >> re.split:
> >>
> >>>>> re.split("(\w+)", "The quick brown fox jumps, and falls
> >>>>> over.")[1::2]
> >> ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
> >
> >
> > Perhaps I'm missing something, but the above regex does the exact
> > same thing as line.split() except it is significantly slower and
> > harder to read.
> >
> > Neither deal with quoted text, apostrophes, hyphens, punctuation or
> > any other details of real-world text. That's what I mean by
> > "simple-minded".
>
> >>> s = "The quick brown fox jumps, and falls over."
> >>> import re
> >>> re.split(r"(\w+)", s)[1::2]
> ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
> >>> s.split()
> ['The', 'quick', 'brown', 'fox', 'jumps,', 'and', 'falls',
> 'over.']
>
> Note the difference in "jumps" vs. "jumps," (extra comma in the
> string.split() version) and likewise the period after "over".
> Thus not quite "the exact same thing as line.split()".
>
> I think an easier-to-read variant would be
>
> >>> re.findall(r"\w+", s)
> ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
>
> which just finds words. One could also just limit it to letters with
>
> re.findall("[a-zA-Z]", s)
>
> as "\w" is a little more encompassing (letters and underscores)
> if that's a problem.
>
> -tkc
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
--
Josh Dukes
MicroVu IT Department
More information about the Python-list
mailing list