[Python-Dev] "".tokenize() ?
M.-A. Lemburg
mal@lemburg.com
Fri, 04 May 2001 12:16:16 +0200
Fredrik Lundh wrote:
>
> mal wrote:
>
> > Gustavo Niemeyer submitted a patch which adds a tokenize like
> > method to strings and Unicode:
> >
> > "one, two and three".tokenize([",", "and"])
> > -> ["one", " two ", "three"]
> >
> > I like this method -- should I review the code and then check it in ?
>
> -1. method bloat. not exactly something you do every day, and
> when you do, it's a one-liner:
>
> def tokenize(string, ignore):
> [word for word in re.findall("\w+", string) if not word in ignore]
This is not the same as what .tokenize() does: it cut at each
occurrance of a substring rather than words as in your example
(although I must say that list comprehension looks cool ;-).
> > PS: Haven't gotten any response regarding the .decode() method yet...
> > should I take this as "no objections" ?
>
> -0. method bloat. we don't have asfloat methods on integers and
> asint methods on strings either...
Well, we already have .encode() which interfaces to PyString_Encode(),
but no Python API for getting at PyString_Decode(). This is what
.decode() is for. Depending on the codecs you use, these two
methods can be very useful, e.g. for "fixing" line-endings or
hexifying strings. The codec concept can be used for far more
applications than just converting from and to Unicode.
About rich method APIs in general: I like having rich method APIs,
since they make life easier (you don't have to reinvent the wheel
everytime you want a common job to be done). IMHO, too many
methods can never hurt, but I'm probably alone with that POV.
--
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/