[Pythonmac-SIG] Unicode and split

Jeremy Reichman jaharmi at jaharmi.com
Fri May 23 16:50:49 CEST 2008


I have some characters in line strings in a file I'm processing that appear
to be Unicode. (When I print them to the shell from my script, they are
Asian characters for files like fonts in the Mac OS X filesystem.)

When I run a.split() on the affected line strings, they split on what I'm
guessing is considered a Unicode whitespace character. Specifically, the
culprit seems to be '\xe1':

$ python -c 'print "\xe1"'
?

I want to split only only ASCII spaces and tabs, however. Unfortunately, the
line strings from the file may be split on space runs and/or tabs -- and I
have no control over what was originally written to the source files -- so
the defaults for a.split() are otherwise ideal. The split method works on
most lines I'm processing perfectly well.

I'd rather not have to import the 're' module to split on a regular
expression.

Does anyone have any suggestions on how to handle this? I'm in Apple's
Python 2.5.1 in Leopard, and I'd also like to remain compatible with 2.3.x
in Tiger. I'd appreciate advice, thanks!


-- 
Jeremy




More information about the Pythonmac-SIG mailing list