[Python-Dev] "".tokenize() ?

M.-A. Lemburg mal@lemburg.com
Fri, 04 May 2001 12:16:16 +0200


Fredrik Lundh wrote:
> 
> mal wrote:
> 
> > Gustavo Niemeyer submitted a patch which adds a tokenize like
> > method to strings and Unicode:
> >
> > "one, two and three".tokenize([",", "and"])
> > -> ["one", " two ", "three"]
> >
> > I like this method -- should I review the code and then check it in ?
> 
> -1.  method bloat.  not exactly something you do every day, and
> when you do, it's a one-liner:
> 
> def tokenize(string, ignore):
>     [word for word in re.findall("\w+", string) if not word in ignore]

This is not the same as what .tokenize() does: it cut at each
occurrance of a substring rather than words as in your example
(although I must say that list comprehension looks cool ;-).
 
> > PS: Haven't gotten any response regarding the .decode() method yet...
> > should I take this as "no objections" ?
> 
> -0.  method bloat.  we don't have asfloat methods on integers and
> asint methods on strings either...

Well, we already have .encode() which interfaces to PyString_Encode(),
but no Python API for getting at PyString_Decode(). This is what
.decode() is for. Depending on the codecs you use, these two
methods can be very useful, e.g. for "fixing" line-endings or
hexifying strings. The codec concept can be used for far more
applications than just converting from and to Unicode.

About rich method APIs in general: I like having rich method APIs,
since they make life easier (you don't have to reinvent the wheel 
everytime you want a common job to be done). IMHO, too many
methods can never hurt, but I'm probably alone with that POV.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/