re.compile.match() results in unicode strings - why?
Peter Otten
__peter__ at web.de
Fri Nov 12 06:04:11 EST 2004
Axel Bock wrote:
> Kent Johnson wrote:
>
>> Apparently if the input strings are unicode then the groups will be as
>> well:
>> [...]
>> Are you sure that exp is not a unicode string?
>
> hm. pretty much - i read the lines from a text file which contains only
> normal text. a sample line looks like that:
>
> 6. call_noparam 1000 runs 149453,1 ms 149,4531 ms/call
>
> no surprise here, i think ... . Actually I also wrote the program which
> produces that file, and I really didn't use unicode then. opening the file
> with a text editor also does not show unicode, and I can't believe that
> windows does actually manage the unicode stuff transparently to text
> editors. and also I have never heard of file-attached codepage
> information, those would be the only things i could imagine as a reason.
Why do you keep speculating?
[Your code from another post]
> ** CODE **
> string = "1. asdf asdf 327,88"
> exp = re.compile("(\S+) (\S+) (\S+) (\S+).*")
> m = exp.match(string)
> print m.groups()
> ** /CODE **
You could modify that along the lines (untested)
string = "1. asdf asdf 327,88"
pattern = "(\S+) (\S+) (\S+) (\S+).*"
# make sure that there is no unicode input:
assert not isinstance(string, unicode)
assert not isinstance(pattern, unicode)
exp = re.compile(pattern)
m = exp.match(string)
# make sure at least one group is a unicode string
if m:
assert [g for g in m.groups() if isinstance(g, unicode)]
If this does not throw an assertion error we can look further, but I still
think this is unlikely.
Peter
More information about the Python-list
mailing list