problems with regex in Japanese?

Fri Aug 10 13:07:21 EDT 2001

[[ This message was both posted and mailed: see
   the "To," "Cc," and "Newsgroups" headers for details. ]]

My company is using the PCRE library in an international product.  But
we're discovering some problems with its UTF-8 support, which is
causing grief for our Japanese users (and others).  Since (AIUI) Python
also uses PCRE, I'm wondering if the Python community has encountered
these problems, and how they've been handled?

Matching a single UTF-8 character works fine.  For the sake of
discussion, let's imagine that "#" and "@" are Japanese Kanji.  If my
target string is "##@", and my search pattern is "#", PCRE will
correctly match the string "#".

But matching multiple UTF-8 characters appears to be broken.  If
(again) my target string is "##@", but my search pattern is "#+", it
should match the string "##".  But it doesn't; instead, it matches only
"#".  And of course more complex search strings fail in more mysterious
ways.

We've tried to contact Philip Hazel (author of PCRE) about it, but have
received no reply.  Any tips from the Python community about this
problem?

Thanks,
- Joe

-- 
,------------------------------------------------------------------.
|    Joseph J. Strout         Check out the Mac Web Directory:     |
|    joe at strout.net           http://www.macwebdir.com             |
`------------------------------------------------------------------'